Service Discovery Network Simulation

SionoiS · May 4, 2026, 3:33pm

Hello!

Our implementation of service discovery has reach a higher level of maturity and it’s now time to look into what the DST team can do to help.

The first scenario would be a set of services Zipf distributed in popularity in a large network.

In this network we will try to find nodes supporting the least and most popular services. Measuring; median latency, closer peer count, time to first result, etc…

We could then complexify a bit by having nodes support multiple services. In this case, comparing the speed of search for node that support many vs few services might be interesting.

We will surely find more scenario but for now, I just wanted to get the ball rolling on this topic.

SionoiS · May 4, 2026, 3:39pm

One quirk of the protocol is that when triggering `lookup`, using `startDiscovering` before hand should speed up the search because to search, a node need to maintain a table centered on the searched for services.

It would be interesting to answer questions like, what is the performance difference between a “primed“ search VS not doing the `startDiscovering` before or how long before a `lookup` is `startDiscovering` beneficial (if it is).

arunima · May 15, 2026, 4:07am

Scenerios:

Scenario 1: Rare Service Discovery in a Large Noisy Network

Goal: Verify that service discovery finds rare services faster than random walking and that discoverers converge toward the right registrars rather than wandering.

Network setup:

250 registrars → server mode, no services, no xprPublishing
150 kad-only peers → plain KadDHT (not ServiceDiscovery) — act as routing noise
80 advertisers → advertise a common service e.g. “/logos/popular/1.0.0”
3 advertisers → advertise the rare service “/logos/rare/1.0.0”
40 discoverers → call lookup() for “/logos/rare/1.0.0” only

How to run:

Start registrars first. Wait for routing tables to stabilise (~30s).
Start 80 popular-service advertisers. Let them get confirmed (~60s).
Start 3 rare-service advertisers. Record the exact timestamp.
Start 40 discoverers. Each calls lookup("/logos/rare/1.0.0") in a loop every 10s.

Existing logs to watch:

Log line	What it tells you
`debug "getting adverts"`	Each GET_ADS sent — count per discoverer
`debug "adverts found" count=N`	`N=0` means empty response from that registrar
`debug "advert accepted"`	Rare service ad admitted at registrar
`cd_lookup_peers_found` metric	Total peers found per lookup call
`cd_lookup_requests` metric	Total lookups initiated
`cd_registrar_cache_ads` gauge	Cache size per registrar over time
`cd_service_table_peers` gauge	Routing table growth toward service hash

Missing logs — need to add:

debug "lookup complete" with fields: serviceId, peersFound, registrarsContacted, bucketsTraversed, durationMs — currently no single log captures when a lookup finishes and what it took
debug "empty response from registrar" — currently adverts found count=0 exists but has no explicit distinguishing label; add a dedicated log so it can be grepped separately
debug "routing table state" at lookup start — log how many peers are in each bucket of DiscT(service_id_hash) before the first GET_ADS goes out; currently no snapshot log exists

What to check:

Time from rare ad admitted (advert accepted) to first discoverer finding it (adverts found count>0)
Ratio of adverts found count=0 to total getting adverts — should decrease over time as routing tables converge
cd_service_table_peers should grow steadily for discoverers; if it stays flat, routing table is not being updated from closerPeers
Whether the same registrar peer ID appears repeatedly in getting adverts — indicates discoverers are not advancing through buckets

Scenario 2: REGISTER Storm / Retry Explosion

Goal: Verify that 300 advertisers hitting 40 registrars simultaneously do not cause retry loops, unbounded queues, or broken ticket state.

Network setup:

40 registrars → server mode, advertCacheCap = 1000 (default)
300 advertisers → all advertise the same service, all start at t=0

How to run:

Start 40 registrars. Wait for routing tables to stabilise.
Start all 300 advertisers simultaneously (use a barrier or coordinated start timestamp).
Run for at least 3 × advertExpiry (default = 3 × 900s = 45 min) to see full retry cycles.

Existing logs to watch:

Log line	What it tells you
`debug "registering advert"`	First attempt per advertiser per registrar
`debug "waiting for registrar" wait=Xs`	Wait time issued — look for outliers
`debug "advert accepted"`	Successful admission
`debug "registrar rejection, aborting"`	Hard rejection — should be rare
`error "no ticket to retry with"`	Registrar issued Wait but no ticket — protocol bug
`cd_register_requests{status="Wait"}`	Rate of Wait responses
`cd_register_requests{status="Confirmed"}`	Rate of admissions
`cd_registrar_cache_ads` gauge	Cache fill rate over time
`cd_advertiser_pending_actions` gauge	Queue depth on advertisers

Missing logs — need to add:

debug "waiting for registrar" currently logs wait as a string (waitSecs) — change to a numeric field so it can be aggregated; also add retryCount field to track how many times this advertiser has retried this registrar
No log when an advertiser exhausts advertExpiry and loops back for re-registration — add debug "re-registering after expiry" to distinguish fresh registration from retry
cd_register_requests label for "Wait" counts total Waits but not per-registrar — add registrar label or a separate histogram for wait duration values

What to check:

cd_advertiser_pending_actions should grow initially then stabilise — if it grows unboundedly, advertisers are not respecting wait times
debug "waiting for registrar" entries should have wait > 0 — wait=0 followed immediately by another registering advert for the same registrar is a retry-without-waiting bug
cd_registrar_cache_ads should approach but not exceed advertCacheCap

Scenario 3: Client mode

Goal: Verify client-mode nodes can discover but cannot advertise or act as registrars, and that violations fail loudly.

Network setup:

80 server-mode registrars
40 server-mode advertisers → advertise “/logos/test/1.0.0”
30 server-mode discoverers
100 client-mode nodes → client = true in ServiceDiscovery.new()

How to run:

Start registrars and advertisers. Wait for ads to be confirmed.
Start 100 client-mode nodes. Each calls lookup("/logos/test/1.0.0").
Pick 20 client-mode nodes and call addProvidedService() on them — expect failure.
Pick another 20 client-mode nodes and attempt to send a REGISTER request directly (craft a raw message or call advertiseToRegistrar) — expect failure.

Existing behaviour:

addProvidedService and advertiseToRegistrar both have doAssert not disco.clientMode, "not supported in client mode" — this raises an assertion error in debug builds and crashes in release. That is not clean failure.

Missing logs — need to add (and fix):

Replace doAssert not disco.clientMode with a proper if disco.clientMode: warn "operation not supported in client mode", op="addProvidedService"; return — so it fails visibly in logs without crashing
Add debug "client mode lookup success" with peersFound field so it is verifiable in logs that client nodes are succeeding at discovery
No log exists when a client-mode node receives an inbound REGISTER request — the handler will process it silently because registration() does not check clientMode. This is a missing guard — a client-mode node acting as registrar should reject inbound REGISTER messages and log a warning

What to check:

debug "adverts found" count>0 appears for client nodes — discovery works
warn "operation not supported in client mode" appears when advertise is attempted — clean rejection
No debug "advert accepted" or cd_registrar_cache_ads increment on client nodes — they are not storing ads
No debug "registering advert" on client nodes — they are not sending REGISTER

Scenario 4: Malicious Registrars Poisoning Routing Tables

Goal: Verify that bad closerPeers responses from malicious registrars do not permanently corrupt service routing tables or prevent discovery.

Network setup:

120 honest registrars
50 malicious registrars → return crafted closerPeers (see below)
30 advertisers
40 discoverers

Malicious registrar behaviours to implement (one variant per run):

Return only their own peerId in closerPeers (self-loop)
Return 16 random garbage peer IDs not in the network
Return unreachable peers (valid IDs, no working addresses)
Return peers far from service_id_hash (bucket 0 peers for a service in bucket 15)
Return empty closerPeers list

How to run:
Each variant is a separate run. Run for at least 2 full lookup cycles. Compare discovery success rate against a clean baseline run.

Existing logs to watch:

Log line	What it tells you
`error "failed to register ad" error="dialing peer failed"`	Unreachable peer in routing table
`debug "adverts found" count=N`	Discovery progress
`cd_service_table_insertions`	How many poisoned peers were inserted
`cd_service_table_peers` gauge	Routing table size over time

Missing logs — need to add:

No log when closerPeers are inserted into the routing table — add debug "inserting closer peers" with count and serviceId so poisoning rate can be measured
No log when a dial fails during dispatchGetAds — the error is returned as a string but not logged at the call site; add warn "GET_ADS dial failed" with registrar and error fields
No metric for failed dials specifically — cd_messages_sent exists but there is no cd_dial_failures counter; add one

What to check:

cd_service_table_peers should not grow unboundedly when malicious nodes return junk — indicates no dedup or cap on peer insertion
error "dialing peer failed" rate should correlate with the number of malicious nodes
Honest discoverers should still find ads within 2–3 lookup cycles even with 50 malicious registrars — if they never find ads, the routing table is fully poisoned

Scenario 5: Ticket Grinding Attack

Goal: Verify that malicious advertisers cannot reduce their wait time by manipulating ticket fields or retry timing.

Network setup:

30 registrars
200 malicious advertisers → one behaviour variant each (see below)

Malicious behaviours — implement as separate advertiser modes:

Variant	What it does
Early retry	Retries at `t_mod + t_wait_for - 10s` (before window opens)
Late retry	Retries at `t_mod + t_wait_for + δ + 10s` (after window closes)
Modified `t_wait_for`	Reduces `t_wait_for` in ticket before presenting it
Reused old ticket	Presents a ticket from a previous registration attempt
Cross-registrar ticket	Presents a ticket signed by registrar A to registrar B
Modified ad	Slightly changes the advertisement bytes while reusing the ticket
Dropped ticket	Ignores the ticket and retries fresh every time

Existing logs to watch:

Log line	What it tells you
`error "invalid ticket in register message"`	Ticket validation failure
`cd_register_requests{status="Rejected"}`	Hard rejection rate
`cd_register_requests{status="Wait"}`	Re-issuance rate (grinding attempt)
`debug "waiting for registrar" wait=Xs`	Wait times being issued

Missing logs — need to add:

No log distinguishes why a ticket was invalid — add a reason field: warn "ticket rejected" reason="early retry|late retry|signature mismatch|ad mismatch" — currently all failures collapse into error "invalid ticket in register message" with no reason
No log when updateLowerBounds increases the bound (i.e. when the grinding defence fires) — add debug "lower bound enforced" with serviceId, prevBound, newBound so it is observable
No metric tracking wait time values issued — add a histogram cd_wait_time_issued_secs so you can see whether wait times are growing for grinding advertisers

What to check:

Early and late retry should produce cd_register_requests{status="Rejected"} immediately — no Wait ticket re-issued
Dropped-ticket variant: wait times should not decrease across retries — each fresh attempt recalculates from current cache state
After sustained grinding, cd_wait_time_issued_secs histogram for the grinding advertisers should show increasing values (lower bound enforcement working)

Scenario 6: Same-IP Sybil Flooding

Goal: Verify that the IP tree correctly penalises subnet-concentrated advertisers while admitting diverse ones.

Network setup:

100 registrars
400 advertisers → all bind to addresses in 192.168.1.0/24
100 advertisers → each bind to a unique IP in 10.x.x.x/8
All advertise the same service continuously.

How to run:
Use loopback aliases or a network namespace to give the 400 Sybil advertisers addresses in the same /24. Run for at least 2 × advertExpiry.

Existing logs to watch:

Log line	What it tells you
`debug "waiting for registrar" wait=Xs`	Compare wait values between /24 and unique-IP groups
`debug "advert accepted"`	Admission rate per group
`cd_iptree_unique_ips` gauge	IP tree growth over time
`cd_registrar_ads_expired` counter	Cleanup rate
`cd_registrar_cache_ads` gauge	Cache composition

Missing logs — need to add:

No log records the IP similarity score at the time of admission or Wait issuance — add debug "waiting time calculated" with fields ipSim, serviceSim, occupancy, tWait, peerId in waitingTime — this is the most important missing observable for this scenario
No log when an IP is inserted into or removed from the IP tree — add debug "ip tree insert" / debug "ip tree remove" with the IP (or at least the /24 prefix) so tree growth and cleanup are traceable
cd_iptree_unique_ips only tracks total unique IPs — add cd_iptree_depth_score histogram or similar to track score distribution

What to check:

debug "waiting for registrar" wait=Xs for /24 advertisers should be significantly higher than for unique-IP advertisers
cd_registrar_cache_ads breakdown: unique-IP advertisers should have proportionally more entries than their share of total advertisers
cd_iptree_unique_ips should stabilise after advertExpiry as expired ads remove their IPs — if it only grows, removeAd is not being called on expiry

Scenario 7: Popular Service Hotspot Attack

Goal: Verify that a dominant popular service does not starve rare services out of the cache.

Network setup:

150 registrars
700 advertisers → advertise “/logos/popular/1.0.0” aggressively
20 advertisers → advertise “/logos/rare-A/1.0.0” and “/logos/rare-B/1.0.0”
50 discoverers → look up both rare services

How to run:

Start registrars, let them stabilise.
Start all 700 popular-service advertisers simultaneously.
After 30s, start rare-service advertisers.
After another 30s, start discoverers. Measure lookup latency for rare vs popular.

Existing logs to watch:

Log line	What it tells you
`debug "advert accepted"` with `serviceId`	Admission rate per service
`debug "waiting for registrar" wait=Xs`	Compare wait times across services
`cd_registrar_cache_services` gauge	Number of distinct services in cache
`cd_registrar_cache_ads` gauge	Total cache occupancy
`cd_lookup_peers_found` counter	Discovery success for rare service

Missing logs — need to add:

No per-service cache count is logged or exposed as a metric — cd_registrar_cache_services counts distinct services but not ads per service; add cd_registrar_cache_ads_per_service gauge with serviceId label so hotspot dominance is directly visible
The serviceSim term in waitingTime captures per-service pressure but is never logged — include it in the proposed debug "waiting time calculated" log from Scenario 6

What to check:

Rare service ads should still appear in cache (debug "advert accepted" for rare serviceId) even while popular service dominates
service_similarity (c_s / C) for popular service should be high, causing its own advertisers to wait longer — self-regulating
Rare service lookup latency should not be significantly worse than in a no-popular-service baseline

Scenario 8: Advertisement Expiry and Churn Chaos

Goal: Verify that expired ads are cleaned up and not returned to discoverers, and that the system recovers after registrar and advertiser churn.

Network setup:

100 registrars
100 advertisers
100 discoverers

How to run:

Start everything. Let ads stabilise (wait for debug "advert accepted" across most advertisers).
At t=E/2 (450s): kill 70 advertisers abruptly (SIGKILL).
At t=E (900s): restart remaining 30 advertisers with new peer IDs and new IPs.
At t=E+30s: kill 30 registrars. Restart them 60s later.
Continue running discoverers throughout and record what they find.

Existing logs to watch:

Log line	What it tells you
`debug "pruned expired adverts" count=N`	Cleanup running correctly
`cd_registrar_ads_expired` counter	Total expired ad count
`cd_registrar_cache_ads` gauge	Cache should shrink after kills
`error "failed to register ad" error="dialing peer failed"`	Dead advertiser dials
`debug "adverts found" count=N`	Whether stale ads are still being returned

Missing logs — need to add:

debug "pruned expired adverts" only logs a count — add serviceId breakdown so you can see which service’s ads are expiring
No log when getAdvertisements returns an ad that belongs to a peer that is now unreachable — the cache returns all stored ads regardless; this requires a separate staleness check but at minimum add debug "returning N ads for service" with serviceId and count in getAdvertisements so stale return rate is visible
No log distinguishing a registrar restarting from a fresh registrar — add info "registrar cache restored" with cache size on startup if persistence is added, or info "registrar starting with empty cache" if not

What to check:

After t=E, cd_registrar_cache_ads should drop significantly — dead advertisers’ ads have expired
debug "adverts found" count>0 after t=E should only return ads from alive advertisers — if dead-peer ads are still returned, expiry is not working
After registrar restart, cd_registrar_cache_ads gauge should show 0 (cache is lost on restart, expected) — discoverers should reroute to surviving registrars

Scenario 9: Oversized / Corrupted Advertisement Attack

Goal: Verify that malicious advertisements are rejected at validation without causing crashes, memory spikes, or silent acceptance.

Network setup:

50 honest registrars
50 malicious advertisers → one attack variant each (see below)

Attack variants — one per advertiser group:

Variant	How to produce
Random bytes	Send raw random bytes as `register.advertisement`
Valid protobuf, invalid signature	Build valid XPR structure but sign with wrong key
Valid signature, missing service	XPR with correct signature but no `ServiceInfo` entries
Oversized advertisement	XPR > 1024 bytes (pad `ServiceInfo.data` to overflow)
Oversized `ServiceInfo.data`	`ServiceInfo.data` > 33 bytes, otherwise valid
Invalid multiaddrs	Valid XPR with malformed multiaddress bytes
Empty advertisement	Send `register.advertisement = []`

Existing logs to watch:

Log line	What it tells you
`error "advertisement exceeds maximum encoded XPR size"`	Oversized ad caught
`error "invalid advertisement received"`	Decode failure
`error "advertisement does not advertise the requested service"`	Missing service caught
`error "advertisement violates XPR or ServiceInfo size limits"`	Size constraint caught
`cd_register_requests{status="Rejected"}`	Rejection rate

Missing logs — need to add:

No log records which validation step failed with enough detail — the four error messages above cover different cases but all fire at the same location with similar text; add a reason field to each: error "invalid advertisement" reason="decode_failed|size_exceeded|service_missing|xpr_limits"
No memory metric is tracked around validation — if a huge malformed protobuf causes a large allocation before rejection, it won’t be visible; consider adding a cd_advertisement_bytes_rejected counter with a size histogram

What to check:

Every malicious variant must produce a cd_register_requests{status="Rejected"} increment — none should result in Confirmed
cd_registrar_cache_ads must not increase during the attack
Process memory (external monitoring) should not grow — oversized ads are rejected before allocation where possible

Scenario 10: Eclipse Attack Near Service Hash

Goal: Verify that malicious registrars clustered near the target service hash cannot suppress honest advertisements.

Network setup:

200 honest registrars → distributed uniformly
60 malicious registrars → peer IDs chosen so XOR(peerId, service_hash) is minimal (close bucket)
4 honest advertisers
30 discoverers

Malicious registrar behaviours:

Return empty getAds.advertisements always
Return fake advertisements (valid structure, attacker-controlled peer IDs)
Return only other malicious nodes in closerPeers
Return REJECTED to all REGISTER requests from honest advertisers

How to run:

Pre-generate 60 peer keys such that SHA256(peerId) is close to SHA256(service_id) (brute-force leading bits).
Start malicious registrars with those keys.
Start honest infrastructure.
Run discoverers for at least 5 lookup cycles.

Existing logs to watch:

Log line	What it tells you
`debug "adverts found" count=N`	`N=0` from malicious registrars
`debug "advert accepted"`	Honest ads still getting through to honest registrars
`cd_lookup_peers_found`	End-to-end success rate
`error "failed to register ad"`	Honest advertisers rejected by malicious registrars

Missing logs — need to add:

No log records which bucket a queried registrar came from during lookup — add debug "querying registrar" bucket=N in collectBucketAds so you can see whether discoverers are stuck in the high-numbered buckets (close to service hash) where malicious nodes are concentrated
No log for when validAds() filters out invalid ads from a GET_ADS response — currently done silently; add debug "filtered invalid ads" count=N registrar=X so the fake-ad rejection rate is visible

Scenario 11: Concurrent Advertise + Lookup + Churn Race Test

Goal: Detect concurrency bugs, async races, routing-table corruption, and cleanup issues under sustained load.

Network setup:

150 registrars
200 advertisers
200 discoverers

How to run:
Continuously for 4+ hours:

Every 30s: kill and restart 10 random advertisers with new peer IDs
Every 60s: rotate 5 random registrars (kill + restart)
Every 20s: each discoverer calls lookup() concurrently
Every 45s: add a new service and start advertising it; remove an old service

Existing logs to watch:

Log line	What it tells you
`error "no service routing table found"`	Table missing during active advertisement — race
`error "failed to register ad"`	Network churn causing failures
`debug "pruned expired adverts"`	Cleanup running during churn
`cd_advertiser_pending_actions` gauge	Should not grow unboundedly
`cd_service_tables_count` gauge	Should track add/remove correctly

Missing logs — need to add:

No log when a service routing table is created or destroyed during churn — add debug "service table created" / debug "service table removed" with serviceId and role (Provided/Interest) so table lifecycle is traceable
No log when an async task is cancelled unexpectedly (e.g. advertiseToRegistrar cancelled mid-wait) — add debug "advertise task cancelled" in the cancellation path
No log when maintainAdvertiser or maintainRegistrar loops restart after a crash or table change — add debug "maintenance loop iteration" with a timestamp and current state summary

What to check:

cd_advertiser_pending_actions should stay bounded — if it grows monotonically, tasks are leaking
cd_service_tables_count should track add/remove — if it only grows, removeService is not being called
No Nim runtime errors (Error: unhandled exception, SIGSEGV) in logs — concurrent map access issues would surface here
Memory (external) should be flat after initial ramp-up — growing memory indicates leaked futures or routing table entries

Alberto · May 22, 2026, 9:15am

Hi Arunima.

I come with some questions regarding these scenarios:

How do you imagine the initial deployment. What I mean by this is, for example taking into account “Rare Service Discovery in a Large Noisy Network“ scenario, were you describe a set of 500 nodes. Are these nodes connected to an initial bootstrap node? Or they connect between themselves?

Then, specifically for scenarios:
”Rare Service Discovery in a Large Noisy Network”

You mention 500 nodes: 250 + 150 + 80 + 3 + 40 is a bit more than that. Just doublechecking that we are talking about 523 nodes, since I assume that every described set of nodes is independent from any other group.
How we set the behavior of a node as “registrar”? Is this not the behavior by default?
What is considered a popular service and a rare service. Is it correctly to assume that popular is “popular” because it is advertised by 80 nodes?
What is “normal random peer discovery”? Normal Kad-DHT nodes?

“REGISTER Storm / Retry Explosion“

Is client mode finished? Where can I see an example of it?

My last question is, are the required things to check exposed by the service-discovery protocol? Either by logs or metrics.

That would be it for now. I think scenarios are fairly doable so far.

arunima · May 29, 2026, 11:37am

Hi!

Thanks for taking a look at the scenarios and for the questions

Regarding the initial deployment, I was imagining a setup similar to the current dogfooding setup where all nodes connect to a common bootstrap node initially. From there, they discover additional peers through Kad-DHT and service discovery. The exact topology is flexible though—the numbers in the scenarios are mainly intended to describe the scale and roles involved rather than a strict deployment plan.

For the “Rare Service Discovery in a Large Noisy Network” scenario, you’re right about the count. I wasn’t being precise with the total number there; the important part was the relative distribution of registrars, advertisers, discoverers, and other nodes. We can adjust the numbers.

Regarding registrars, my understanding is that service discovery currently runs on server-mode nodes, so yes, in practice many nodes may behave as registrars by default. I was mainly using the term to describe the role the node is playing in the scenario rather than implying special configuration.

For popular vs rare services, yes, that’s exactly the idea. A service is considered “popular” simply because many nodes advertise it, while a “rare” service has very few advertisers. The goal is to see whether service discovery still performs well when only a handful of nodes provide a given service.

By “normal random peer discovery”, I mean discovering peers through the underlying Kad-DHT without using service-specific advertisements. The comparison is mainly to see whether service discovery helps find peers for a specific service especially the rare services more efficiently.

For client mode, I believe the implementation is mostly there, but I haven’t written those scenarios assuming it is already fully available in Logos Delivery. I added the client-mode scenario mainly because it is an important behavior to verify.

For the metrics/logs, not everything is exposed today. Some of the things I listed are more “things worth observing” rather than metrics that already exist. We may need additional logging or instrumentation to measure some of them properly.

Glad to hear they seem doable so far. Also, these scenarios are still very much a draft and we will further discuss and refine them. The main goal right now is to brainstorm ways we can stress the system and identify potential weaknesses.