Skip to content

Running YaCy Crawling over VPN While Staying a Healthy Direct Peer

When I first set up YaCy in my homelab I wanted two things at the same time: crawler traffic leaving through a VPN proxy, and a node that behaved like a normal peer in the YaCy network.

The two interact in subtle ways. After some trial and error I ended up with a configuration that keeps crawler traffic behind a proxy while allowing the peer to integrate cleanly with the swarm.

This note documents the minimal configuration changes required.


Intended audience

This write-up assumes the reader already understands:

  • Docker container networking
  • basic homelab routing and port forwarding
  • HTTP proxy behaviour
  • the fundamentals of peer-to-peer systems

The goal here is not to explain those concepts but to show the exact YaCy configuration changes needed to reproduce the setup.


Part 1 – Ensure the peer is reachable

Peer reachability should be configured before introducing a proxy. A node that won’t accept inbound connections still works, but behaves like an outbound client and contributes less efficiently to the distributed index.

All changes are made in:

DATA/SETTINGS/yacy.conf

Set the externally reachable address and port:

staticIP=<public IPv4 address>
publicPort=8090
port=8090
host=0.0.0.0

Example structure:

staticIP=<your WAN IP>
publicPort=8090

This ensures the peer advertises the correct endpoint when exchanging seed information.

Confirm the node is operating as a public peer and participates in the DHT:

cluster.mode=publicpeer
network.unit.dht=true

allowDistributeIndex=true
allowReceiveIndex=true
allowDistributeIndexWhileCrawling=true
allowDistributeIndexWhileIndexing=true

Router forwarding for the YaCy port must already exist in the environment.


Part 2 – Route crawler traffic through a proxy

Once the peer is reachable, configure the crawler to use a proxy.

In my environment the proxy is provided by a Gluetun container exposing an HTTP proxy.

Add or confirm the following configuration:

remoteProxyUse=true
remoteProxyHost=<proxy host or container name>
remoteProxyPort=<proxy port>
remoteProxyUse4SSL=true

Example pattern:

remoteProxyUse=true
remoteProxyHost=<proxy container>
remoteProxyPort=8888
remoteProxyUse4SSL=true

This causes YaCy’s outbound fetches to traverse the proxy.

The default non-proxy list should remain unchanged:

remoteProxyNoProxy=10\\..*,127\\..*,172\\.(1[6-9]|2[0-9]|3[0-1])\\..*,169\\.254\\..*,192\\.168\\..*,localhost,0:0:0:0:0:0:0:1

This ensures internal traffic bypasses the proxy.


Result

With these settings the behaviour becomes predictable:

TrafficPath
crawler fetchesproxy
peer discoverydirect
DHT index exchangedirect
distributed search queriesdirect

In practice this produces a far cleaner operational profile. Crawls run through the VPN as intended, while the node itself remains a fully functional participant in the YaCy network.


Closing note

After switching to this layout my node immediately behaved more like a cooperative peer. Crawl traffic stayed behind the proxy, but the swarm interaction became smoother and more consistent.

The configuration itself is small, but the order in which the pieces are applied makes all the difference.