Service Discovery Network Simulation

Hello!

Our implementation of service discovery has reach a higher level of maturity and it’s now time to look into what the DST team can do to help.

The first scenario would be a set of services Zipf distributed in popularity in a large network.

In this network we will try to find nodes supporting the least and most popular services. Measuring; median latency, closer peer count, time to first result, etc…

We could then complexify a bit by having nodes support multiple services. In this case, comparing the speed of search for node that support many vs few services might be interesting.

We will surely find more scenario but for now, I just wanted to get the ball rolling on this topic. :folded_hands:

1 Like

One quirk of the protocol is that when triggering `lookup`, using `startDiscovering` before hand should speed up the search because to search, a node need to maintain a table centered on the searched for services.

It would be interesting to answer questions like, what is the performance difference between a “primed“ search VS not doing the `startDiscovering` before or how long before a `lookup` is `startDiscovering` beneficial (if it is).

1 Like

Scenerios:

Scenario 1: Rare Service Discovery in a Large Noisy Network

Goal: Verify that service discovery finds rare services faster than random walking and that discoverers converge toward the right registrars rather than wandering.

Network setup:

  • 250 registrars → server mode, no services, no xprPublishing
  • 150 kad-only peers → plain KadDHT (not ServiceDiscovery) — act as routing noise
  • 80 advertisers → advertise a common service e.g. “/logos/popular/1.0.0”
  • 3 advertisers → advertise the rare service “/logos/rare/1.0.0”
  • 40 discoverers → call lookup() for “/logos/rare/1.0.0” only

How to run:

  1. Start registrars first. Wait for routing tables to stabilise (~30s).

  2. Start 80 popular-service advertisers. Let them get confirmed (~60s).

  3. Start 3 rare-service advertisers. Record the exact timestamp.

  4. Start 40 discoverers. Each calls lookup("/logos/rare/1.0.0") in a loop every 10s.

Existing logs to watch:

Log line What it tells you
debug "getting adverts" Each GET_ADS sent — count per discoverer
debug "adverts found" count=N N=0 means empty response from that registrar
debug "advert accepted" Rare service ad admitted at registrar
cd_lookup_peers_found metric Total peers found per lookup call
cd_lookup_requests metric Total lookups initiated
cd_registrar_cache_ads gauge Cache size per registrar over time
cd_service_table_peers gauge Routing table growth toward service hash

Missing logs — need to add:

  • debug "lookup complete" with fields: serviceId, peersFound, registrarsContacted, bucketsTraversed, durationMs — currently no single log captures when a lookup finishes and what it took

  • debug "empty response from registrar" — currently adverts found count=0 exists but has no explicit distinguishing label; add a dedicated log so it can be grepped separately

  • debug "routing table state" at lookup start — log how many peers are in each bucket of DiscT(service_id_hash) before the first GET_ADS goes out; currently no snapshot log exists

What to check:

  • Time from rare ad admitted (advert accepted) to first discoverer finding it (adverts found count>0)

  • Ratio of adverts found count=0 to total getting adverts — should decrease over time as routing tables converge

  • cd_service_table_peers should grow steadily for discoverers; if it stays flat, routing table is not being updated from closerPeers

  • Whether the same registrar peer ID appears repeatedly in getting adverts — indicates discoverers are not advancing through buckets

REGISTER Storm / Retry Explosion

  • 40 registrar nodes

  • 300 advertiser nodes

  • All advertisers advertise the same service

  • All advertisers should start simultaneously.

Things to check in logs:

  • Number of retries to same registrar

  • REGISTER requests per second

  • Whether advertisers retry without waiting

  • registrar cache size change with time

  • how waiting times change with registrar cache size

Client mode

The main goal of this test is to verify that client-mode nodes can use service discovery only as discoverers, but cannot act as advertisers or registrars. Server-mode nodes may act as discoverers, advertisers, or registrars, but client-mode nodes must only act as discoverers. So a client-mode node should be able to search for peers providing a service, but it should not advertise its own service or accept/store advertisements from others.

  • 80 nodes → server mode registrars

  • 40 nodes → server mode advertisers

  • 30 nodes → server mode discoverers

  • 100 nodes → client mode

  • keep just 1 service for simplicity and to just test specifically client mode functionality

First, start the server-mode registrars and advertisers. Let the advertisers register their advertisements.

Then start the client-mode nodes and make them discover the same service. The expected result is that client-mode nodes should be able to send GET_ADS requests and discover valid advertisers.

After that, intentionally try to misuse client mode. Pick 20 client-mode nodes and try to make them advertise. Pick another 20 client-mode nodes and try to make them behave like registrars by accepting REGISTER requests from advertisers. These operations should fail cleanly. They should not silently succeed, should not add anything to an advertisement cache, and should not start registrar/advertiser loops in the background.

Things to check in logs:

  • Client-mode nodes successfully send GET_ADS requests

  • Client-mode nodes receive valid advertisements for the requested service

  • Client-mode nodes do not send REGISTER requests for their own advertisements

  • Clear error or warning appears when advertise/register is attempted from client mode

Malicious Registrars Poisoning Routing Tables

Service discovery success heavily depends heavily on registrars helping advertisers and discoverers move closer to the correct service-specific region of the keyspace.

This test checks whether malicious routing information slowly corrupts search tables.

  • 120 registrar nodes

  • 50 malicious registrars

  • 30 advertisers

  • 40 discoverers

The malicious registrars should intentionally return bad closerPeers lists. They should:

  • return duplicate peers

  • return only malicious peers

  • return unreachable peers

  • return themselves repeatedly

  • return peers far from service hash

  • return random garbage peer IDs.

Things to check in logs:

  • Number of failed dials

  • Percentage of peers from a table thats unreachable

  • If discoverers stop making progress

Ticket Grinding Attack

  • 30 registrars

  • 200 malicious advertisers

The malicious advertisers repeatedly try to game the waiting-time system. We want to verify that malicious advertisers cannot get artificially lower wait times.

Malicious advertiser behaviour:

  • retry too early

  • retry too late

  • modify t_wait_for

  • reuse old tickets

  • use ticket from another registrar

  • slightly modify advertisement while reusing ticket

  • intentionally drop tickets and restart.

what to check in logs:

  • Early retry rejections

  • Late retry rejections

  • Ticket validation failures

  • Lower-bound state changes

  • Are malicious advertiser eventually penalized with higher waiting times

Same-IP Sybil Flooding

Service discovery includes IP similarity scoring to prevent many nodes from the same subnet dominating the advertisement cache. This test checks whether the IP tree logic actually works.

  • 100 registrars

  • 500 advertisers

Out of these:

  • 400 advertisers run behind the same IP/subnet

  • 100 advertisers use unique IPs

All advertisers continuously advertise services.

This test checks whether the IP tree logic actually works.

Things to check in logs:

  • WAIT times grouped by IP

  • Number of ads admitted per subnet in the ad cache

  • IP similarity scores

  • IP tree growth

  • IP tree cleanup after expiry

Popular Service Hotspot Attack

Service discovery specifically tries to avoid hotspotting near service hashes. This test checks whether rare services still remain discoverable when one service becomes extremely popular.

  • 150 registrars

  • 700 advertisers for one popular service

  • 20 advertisers for several rare services

  • 50 discoverers

The popular service should aggressively try to dominate registrar caches.

Things to check in logs

  • Cache entries per service

  • WAIT times for popular vs rare services

  • Number of rare ads stored

  • Lookup latency for rare services vs popular one

Advertisement Expiry and Churn Chaos

This checks whether stale advertisements continue being returned after expiry time E.

Start:

  • 100 registrars

  • 100 advertisers

  • 100 discoverers

Allow advertisements to stabilize first.

Then suddenly:

  • kill 70 advertisers,

  • restart remaining advertisers with new IPs,

  • rotate peer IDs,

  • stop some registrars,

  • restart them later.

Things to check in logs:

  • Expired ads still returned

  • Failed dial count

  • Ad cleanup timing

  • Cache shrink behavior

  • Whether discoverers keep receiving dead peers

Oversized / Corrupted Advertisement Attack

Registrars and discoverers should reject invalid advertisements safely

Start:

  • 50 registrars

  • 50 malicious advertisers

The malicious advertisers send:

  • corrupted protobufs,

  • invalid signatures,

  • missing services,

  • huge advertisements,

  • invalid multiaddrs,

  • random bytes instead of XPR,

  • extremely large service metadata.

Things to check in logs:

  • Signature verification failures

  • memory spikes

  • corrupted ads accepted

Eclipse Attack Near Service Hash

Service discovery assumes that querying random registrars across buckets prevents eclipse attacks. This test checks whether that assumption holds.

Start:

  • 200 registrars

  • 60 malicious registrars positioned close to target service hash

  • 4 honest advertisers

  • 30 discoverers

The malicious registrars should:

  • suppress honest advertisements,

  • return empty responses,

  • return fake advertisements,

  • return only malicious closer peers.

Things to check in logs:

  • Honest peer discovery success rate

  • Number of malicious registrars contacted

  • Percentage of invalid ads returned

  • Whether discoverers terminate early

Concurrent Advertise + Lookup + Churn Race Test

Test concurrency bugs, async races, routing-table corruption, and cleanup issues.

Start:

  • 150 registrars

  • 200 advertisers

  • 200 discoverers

Continuously:

  • start advertisers,

  • stop advertisers,

  • rotate services,

  • perform lookups,

  • restart registrars.

Everything should happen simultaneously for a long duration.

Things to check in logs:

  • Concurrent map write errors

  • Deadlocks

  • Stuck futures/tasks

  • Duplicate registrations

  • Missing cleanup

  • Memory growth

1 Like

Hi Arunima.

I come with some questions regarding these scenarios:

  • How do you imagine the initial deployment. What I mean by this is, for example taking into account “Rare Service Discovery in a Large Noisy Network“ scenario, were you describe a set of 500 nodes. Are these nodes connected to an initial bootstrap node? Or they connect between themselves?

Then, specifically for scenarios:
”Rare Service Discovery in a Large Noisy Network”

  1. You mention 500 nodes: 250 + 150 + 80 + 3 + 40 is a bit more than that. Just doublechecking that we are talking about 523 nodes, since I assume that every described set of nodes is independent from any other group.
  2. How we set the behavior of a node as “registrar”? Is this not the behavior by default?
  3. What is considered a popular service and a rare service. Is it correctly to assume that popular is “popular” because it is advertised by 80 nodes?
  4. What is “normal random peer discovery”? Normal Kad-DHT nodes?

“REGISTER Storm / Retry Explosion“

  1. Is client mode finished? Where can I see an example of it?

My last question is, are the required things to check exposed by the service-discovery protocol? Either by logs or metrics.

That would be it for now. I think scenarios are fairly doable so far.

Hi!

Thanks for taking a look at the scenarios and for the questions :slightly_smiling_face:

Regarding the initial deployment, I was imagining a setup similar to the current dogfooding setup where all nodes connect to a common bootstrap node initially. From there, they discover additional peers through Kad-DHT and service discovery. The exact topology is flexible though—the numbers in the scenarios are mainly intended to describe the scale and roles involved rather than a strict deployment plan.

For the “Rare Service Discovery in a Large Noisy Network” scenario, you’re right about the count. I wasn’t being precise with the total number there; the important part was the relative distribution of registrars, advertisers, discoverers, and other nodes. We can adjust the numbers.

Regarding registrars, my understanding is that service discovery currently runs on server-mode nodes, so yes, in practice many nodes may behave as registrars by default. I was mainly using the term to describe the role the node is playing in the scenario rather than implying special configuration.

For popular vs rare services, yes, that’s exactly the idea. A service is considered “popular” simply because many nodes advertise it, while a “rare” service has very few advertisers. The goal is to see whether service discovery still performs well when only a handful of nodes provide a given service.

By “normal random peer discovery”, I mean discovering peers through the underlying Kad-DHT without using service-specific advertisements. The comparison is mainly to see whether service discovery helps find peers for a specific service especially the rare services more efficiently.

For client mode, I believe the implementation is mostly there, but I haven’t written those scenarios assuming it is already fully available in Logos Delivery. I added the client-mode scenario mainly because it is an important behavior to verify.

For the metrics/logs, not everything is exposed today. Some of the things I listed are more “things worth observing” rather than metrics that already exist. We may need additional logging or instrumentation to measure some of them properly.

Glad to hear they seem doable so far. Also, these scenarios are still very much a draft and we will further discuss and refine them. The main goal right now is to brainstorm ways we can stress the system and identify potential weaknesses.