Service Discovery Network Simulation

arunima · May 15, 2026, 4:07am

Scenerios:

Scenario 1: Rare Service Discovery in a Large Noisy Network

Goal: Verify that service discovery finds rare services faster than random walking and that discoverers converge toward the right registrars rather than wandering.

Network setup:

250 registrars → server mode, no services, no xprPublishing
150 kad-only peers → plain KadDHT (not ServiceDiscovery) — act as routing noise
80 advertisers → advertise a common service e.g. “/logos/popular/1.0.0”
3 advertisers → advertise the rare service “/logos/rare/1.0.0”
40 discoverers → call lookup() for “/logos/rare/1.0.0” only

How to run:

Start registrars first. Wait for routing tables to stabilise (~30s).
Start 80 popular-service advertisers. Let them get confirmed (~60s).
Start 3 rare-service advertisers. Record the exact timestamp.
Start 40 discoverers. Each calls lookup("/logos/rare/1.0.0") in a loop every 10s.

Existing logs to watch:

Log line	What it tells you
`debug "getting adverts"`	Each GET_ADS sent — count per discoverer
`debug "adverts found" count=N`	`N=0` means empty response from that registrar
`debug "advert accepted"`	Rare service ad admitted at registrar
`cd_lookup_peers_found` metric	Total peers found per lookup call
`cd_lookup_requests` metric	Total lookups initiated
`cd_registrar_cache_ads` gauge	Cache size per registrar over time
`cd_service_table_peers` gauge	Routing table growth toward service hash

Missing logs — need to add:

debug "lookup complete" with fields: serviceId, peersFound, registrarsContacted, bucketsTraversed, durationMs — currently no single log captures when a lookup finishes and what it took
debug "empty response from registrar" — currently adverts found count=0 exists but has no explicit distinguishing label; add a dedicated log so it can be grepped separately
debug "routing table state" at lookup start — log how many peers are in each bucket of DiscT(service_id_hash) before the first GET_ADS goes out; currently no snapshot log exists

What to check:

Time from rare ad admitted (advert accepted) to first discoverer finding it (adverts found count>0)
Ratio of adverts found count=0 to total getting adverts — should decrease over time as routing tables converge
cd_service_table_peers should grow steadily for discoverers; if it stays flat, routing table is not being updated from closerPeers
Whether the same registrar peer ID appears repeatedly in getting adverts — indicates discoverers are not advancing through buckets

Scenario 2: REGISTER Storm / Retry Explosion

Goal: Verify that 300 advertisers hitting 40 registrars simultaneously do not cause retry loops, unbounded queues, or broken ticket state.

Network setup:

40 registrars → server mode, advertCacheCap = 1000 (default)
300 advertisers → all advertise the same service, all start at t=0

How to run:

Start 40 registrars. Wait for routing tables to stabilise.
Start all 300 advertisers simultaneously (use a barrier or coordinated start timestamp).
Run for at least 3 × advertExpiry (default = 3 × 900s = 45 min) to see full retry cycles.

Existing logs to watch:

Log line	What it tells you
`debug "registering advert"`	First attempt per advertiser per registrar
`debug "waiting for registrar" wait=Xs`	Wait time issued — look for outliers
`debug "advert accepted"`	Successful admission
`debug "registrar rejection, aborting"`	Hard rejection — should be rare
`error "no ticket to retry with"`	Registrar issued Wait but no ticket — protocol bug
`cd_register_requests{status="Wait"}`	Rate of Wait responses
`cd_register_requests{status="Confirmed"}`	Rate of admissions
`cd_registrar_cache_ads` gauge	Cache fill rate over time
`cd_advertiser_pending_actions` gauge	Queue depth on advertisers

Missing logs — need to add:

debug "waiting for registrar" currently logs wait as a string (waitSecs) — change to a numeric field so it can be aggregated; also add retryCount field to track how many times this advertiser has retried this registrar
No log when an advertiser exhausts advertExpiry and loops back for re-registration — add debug "re-registering after expiry" to distinguish fresh registration from retry
cd_register_requests label for "Wait" counts total Waits but not per-registrar — add registrar label or a separate histogram for wait duration values

What to check:

cd_advertiser_pending_actions should grow initially then stabilise — if it grows unboundedly, advertisers are not respecting wait times
debug "waiting for registrar" entries should have wait > 0 — wait=0 followed immediately by another registering advert for the same registrar is a retry-without-waiting bug
cd_registrar_cache_ads should approach but not exceed advertCacheCap

Scenario 3: Client mode

Goal: Verify client-mode nodes can discover but cannot advertise or act as registrars, and that violations fail loudly.

Network setup:

80 server-mode registrars
40 server-mode advertisers → advertise “/logos/test/1.0.0”
30 server-mode discoverers
100 client-mode nodes → client = true in ServiceDiscovery.new()

How to run:

Start registrars and advertisers. Wait for ads to be confirmed.
Start 100 client-mode nodes. Each calls lookup("/logos/test/1.0.0").
Pick 20 client-mode nodes and call addProvidedService() on them — expect failure.
Pick another 20 client-mode nodes and attempt to send a REGISTER request directly (craft a raw message or call advertiseToRegistrar) — expect failure.

Existing behaviour:

addProvidedService and advertiseToRegistrar both have doAssert not disco.clientMode, "not supported in client mode" — this raises an assertion error in debug builds and crashes in release. That is not clean failure.

Missing logs — need to add (and fix):

Replace doAssert not disco.clientMode with a proper if disco.clientMode: warn "operation not supported in client mode", op="addProvidedService"; return — so it fails visibly in logs without crashing
Add debug "client mode lookup success" with peersFound field so it is verifiable in logs that client nodes are succeeding at discovery
No log exists when a client-mode node receives an inbound REGISTER request — the handler will process it silently because registration() does not check clientMode. This is a missing guard — a client-mode node acting as registrar should reject inbound REGISTER messages and log a warning

What to check:

debug "adverts found" count>0 appears for client nodes — discovery works
warn "operation not supported in client mode" appears when advertise is attempted — clean rejection
No debug "advert accepted" or cd_registrar_cache_ads increment on client nodes — they are not storing ads
No debug "registering advert" on client nodes — they are not sending REGISTER

Scenario 4: Malicious Registrars Poisoning Routing Tables

Goal: Verify that bad closerPeers responses from malicious registrars do not permanently corrupt service routing tables or prevent discovery.

Network setup:

120 honest registrars
50 malicious registrars → return crafted closerPeers (see below)
30 advertisers
40 discoverers

Malicious registrar behaviours to implement (one variant per run):

Return only their own peerId in closerPeers (self-loop)
Return 16 random garbage peer IDs not in the network
Return unreachable peers (valid IDs, no working addresses)
Return peers far from service_id_hash (bucket 0 peers for a service in bucket 15)
Return empty closerPeers list

How to run:
Each variant is a separate run. Run for at least 2 full lookup cycles. Compare discovery success rate against a clean baseline run.

Existing logs to watch:

Log line	What it tells you
`error "failed to register ad" error="dialing peer failed"`	Unreachable peer in routing table
`debug "adverts found" count=N`	Discovery progress
`cd_service_table_insertions`	How many poisoned peers were inserted
`cd_service_table_peers` gauge	Routing table size over time

Missing logs — need to add:

No log when closerPeers are inserted into the routing table — add debug "inserting closer peers" with count and serviceId so poisoning rate can be measured
No log when a dial fails during dispatchGetAds — the error is returned as a string but not logged at the call site; add warn "GET_ADS dial failed" with registrar and error fields
No metric for failed dials specifically — cd_messages_sent exists but there is no cd_dial_failures counter; add one

What to check:

cd_service_table_peers should not grow unboundedly when malicious nodes return junk — indicates no dedup or cap on peer insertion
error "dialing peer failed" rate should correlate with the number of malicious nodes
Honest discoverers should still find ads within 2–3 lookup cycles even with 50 malicious registrars — if they never find ads, the routing table is fully poisoned

Scenario 5: Ticket Grinding Attack

Goal: Verify that malicious advertisers cannot reduce their wait time by manipulating ticket fields or retry timing.

Network setup:

30 registrars
200 malicious advertisers → one behaviour variant each (see below)

Malicious behaviours — implement as separate advertiser modes:

Variant	What it does
Early retry	Retries at `t_mod + t_wait_for - 10s` (before window opens)
Late retry	Retries at `t_mod + t_wait_for + δ + 10s` (after window closes)
Modified `t_wait_for`	Reduces `t_wait_for` in ticket before presenting it
Reused old ticket	Presents a ticket from a previous registration attempt
Cross-registrar ticket	Presents a ticket signed by registrar A to registrar B
Modified ad	Slightly changes the advertisement bytes while reusing the ticket
Dropped ticket	Ignores the ticket and retries fresh every time

Existing logs to watch:

Log line	What it tells you
`error "invalid ticket in register message"`	Ticket validation failure
`cd_register_requests{status="Rejected"}`	Hard rejection rate
`cd_register_requests{status="Wait"}`	Re-issuance rate (grinding attempt)
`debug "waiting for registrar" wait=Xs`	Wait times being issued

Missing logs — need to add:

No log distinguishes why a ticket was invalid — add a reason field: warn "ticket rejected" reason="early retry|late retry|signature mismatch|ad mismatch" — currently all failures collapse into error "invalid ticket in register message" with no reason
No log when updateLowerBounds increases the bound (i.e. when the grinding defence fires) — add debug "lower bound enforced" with serviceId, prevBound, newBound so it is observable
No metric tracking wait time values issued — add a histogram cd_wait_time_issued_secs so you can see whether wait times are growing for grinding advertisers

What to check:

Early and late retry should produce cd_register_requests{status="Rejected"} immediately — no Wait ticket re-issued
Dropped-ticket variant: wait times should not decrease across retries — each fresh attempt recalculates from current cache state
After sustained grinding, cd_wait_time_issued_secs histogram for the grinding advertisers should show increasing values (lower bound enforcement working)

Scenario 6: Same-IP Sybil Flooding

Goal: Verify that the IP tree correctly penalises subnet-concentrated advertisers while admitting diverse ones.

Network setup:

100 registrars
400 advertisers → all bind to addresses in 192.168.1.0/24
100 advertisers → each bind to a unique IP in 10.x.x.x/8
All advertise the same service continuously.

How to run:
Use loopback aliases or a network namespace to give the 400 Sybil advertisers addresses in the same /24. Run for at least 2 × advertExpiry.

Existing logs to watch:

Log line	What it tells you
`debug "waiting for registrar" wait=Xs`	Compare wait values between /24 and unique-IP groups
`debug "advert accepted"`	Admission rate per group
`cd_iptree_unique_ips` gauge	IP tree growth over time
`cd_registrar_ads_expired` counter	Cleanup rate
`cd_registrar_cache_ads` gauge	Cache composition

Missing logs — need to add:

No log records the IP similarity score at the time of admission or Wait issuance — add debug "waiting time calculated" with fields ipSim, serviceSim, occupancy, tWait, peerId in waitingTime — this is the most important missing observable for this scenario
No log when an IP is inserted into or removed from the IP tree — add debug "ip tree insert" / debug "ip tree remove" with the IP (or at least the /24 prefix) so tree growth and cleanup are traceable
cd_iptree_unique_ips only tracks total unique IPs — add cd_iptree_depth_score histogram or similar to track score distribution

What to check:

debug "waiting for registrar" wait=Xs for /24 advertisers should be significantly higher than for unique-IP advertisers
cd_registrar_cache_ads breakdown: unique-IP advertisers should have proportionally more entries than their share of total advertisers
cd_iptree_unique_ips should stabilise after advertExpiry as expired ads remove their IPs — if it only grows, removeAd is not being called on expiry

Scenario 7: Popular Service Hotspot Attack

Goal: Verify that a dominant popular service does not starve rare services out of the cache.

Network setup:

150 registrars
700 advertisers → advertise “/logos/popular/1.0.0” aggressively
20 advertisers → advertise “/logos/rare-A/1.0.0” and “/logos/rare-B/1.0.0”
50 discoverers → look up both rare services

How to run:

Start registrars, let them stabilise.
Start all 700 popular-service advertisers simultaneously.
After 30s, start rare-service advertisers.
After another 30s, start discoverers. Measure lookup latency for rare vs popular.

Existing logs to watch:

Log line	What it tells you
`debug "advert accepted"` with `serviceId`	Admission rate per service
`debug "waiting for registrar" wait=Xs`	Compare wait times across services
`cd_registrar_cache_services` gauge	Number of distinct services in cache
`cd_registrar_cache_ads` gauge	Total cache occupancy
`cd_lookup_peers_found` counter	Discovery success for rare service

Missing logs — need to add:

No per-service cache count is logged or exposed as a metric — cd_registrar_cache_services counts distinct services but not ads per service; add cd_registrar_cache_ads_per_service gauge with serviceId label so hotspot dominance is directly visible
The serviceSim term in waitingTime captures per-service pressure but is never logged — include it in the proposed debug "waiting time calculated" log from Scenario 6

What to check:

Rare service ads should still appear in cache (debug "advert accepted" for rare serviceId) even while popular service dominates
service_similarity (c_s / C) for popular service should be high, causing its own advertisers to wait longer — self-regulating
Rare service lookup latency should not be significantly worse than in a no-popular-service baseline

Scenario 8: Advertisement Expiry and Churn Chaos

Goal: Verify that expired ads are cleaned up and not returned to discoverers, and that the system recovers after registrar and advertiser churn.

Network setup:

100 registrars
100 advertisers
100 discoverers

How to run:

Start everything. Let ads stabilise (wait for debug "advert accepted" across most advertisers).
At t=E/2 (450s): kill 70 advertisers abruptly (SIGKILL).
At t=E (900s): restart remaining 30 advertisers with new peer IDs and new IPs.
At t=E+30s: kill 30 registrars. Restart them 60s later.
Continue running discoverers throughout and record what they find.

Existing logs to watch:

Log line	What it tells you
`debug "pruned expired adverts" count=N`	Cleanup running correctly
`cd_registrar_ads_expired` counter	Total expired ad count
`cd_registrar_cache_ads` gauge	Cache should shrink after kills
`error "failed to register ad" error="dialing peer failed"`	Dead advertiser dials
`debug "adverts found" count=N`	Whether stale ads are still being returned

Missing logs — need to add:

debug "pruned expired adverts" only logs a count — add serviceId breakdown so you can see which service’s ads are expiring
No log when getAdvertisements returns an ad that belongs to a peer that is now unreachable — the cache returns all stored ads regardless; this requires a separate staleness check but at minimum add debug "returning N ads for service" with serviceId and count in getAdvertisements so stale return rate is visible
No log distinguishing a registrar restarting from a fresh registrar — add info "registrar cache restored" with cache size on startup if persistence is added, or info "registrar starting with empty cache" if not

What to check:

After t=E, cd_registrar_cache_ads should drop significantly — dead advertisers’ ads have expired
debug "adverts found" count>0 after t=E should only return ads from alive advertisers — if dead-peer ads are still returned, expiry is not working
After registrar restart, cd_registrar_cache_ads gauge should show 0 (cache is lost on restart, expected) — discoverers should reroute to surviving registrars

Scenario 9: Oversized / Corrupted Advertisement Attack

Goal: Verify that malicious advertisements are rejected at validation without causing crashes, memory spikes, or silent acceptance.

Network setup:

50 honest registrars
50 malicious advertisers → one attack variant each (see below)

Attack variants — one per advertiser group:

Variant	How to produce
Random bytes	Send raw random bytes as `register.advertisement`
Valid protobuf, invalid signature	Build valid XPR structure but sign with wrong key
Valid signature, missing service	XPR with correct signature but no `ServiceInfo` entries
Oversized advertisement	XPR > 1024 bytes (pad `ServiceInfo.data` to overflow)
Oversized `ServiceInfo.data`	`ServiceInfo.data` > 33 bytes, otherwise valid
Invalid multiaddrs	Valid XPR with malformed multiaddress bytes
Empty advertisement	Send `register.advertisement = []`

Existing logs to watch:

Log line	What it tells you
`error "advertisement exceeds maximum encoded XPR size"`	Oversized ad caught
`error "invalid advertisement received"`	Decode failure
`error "advertisement does not advertise the requested service"`	Missing service caught
`error "advertisement violates XPR or ServiceInfo size limits"`	Size constraint caught
`cd_register_requests{status="Rejected"}`	Rejection rate

Missing logs — need to add:

No log records which validation step failed with enough detail — the four error messages above cover different cases but all fire at the same location with similar text; add a reason field to each: error "invalid advertisement" reason="decode_failed|size_exceeded|service_missing|xpr_limits"
No memory metric is tracked around validation — if a huge malformed protobuf causes a large allocation before rejection, it won’t be visible; consider adding a cd_advertisement_bytes_rejected counter with a size histogram

What to check:

Every malicious variant must produce a cd_register_requests{status="Rejected"} increment — none should result in Confirmed
cd_registrar_cache_ads must not increase during the attack
Process memory (external monitoring) should not grow — oversized ads are rejected before allocation where possible

Scenario 10: Eclipse Attack Near Service Hash

Goal: Verify that malicious registrars clustered near the target service hash cannot suppress honest advertisements.

Network setup:

200 honest registrars → distributed uniformly
60 malicious registrars → peer IDs chosen so XOR(peerId, service_hash) is minimal (close bucket)
4 honest advertisers
30 discoverers

Malicious registrar behaviours:

Return empty getAds.advertisements always
Return fake advertisements (valid structure, attacker-controlled peer IDs)
Return only other malicious nodes in closerPeers
Return REJECTED to all REGISTER requests from honest advertisers

How to run:

Pre-generate 60 peer keys such that SHA256(peerId) is close to SHA256(service_id) (brute-force leading bits).
Start malicious registrars with those keys.
Start honest infrastructure.
Run discoverers for at least 5 lookup cycles.

Existing logs to watch:

Log line	What it tells you
`debug "adverts found" count=N`	`N=0` from malicious registrars
`debug "advert accepted"`	Honest ads still getting through to honest registrars
`cd_lookup_peers_found`	End-to-end success rate
`error "failed to register ad"`	Honest advertisers rejected by malicious registrars

Missing logs — need to add:

No log records which bucket a queried registrar came from during lookup — add debug "querying registrar" bucket=N in collectBucketAds so you can see whether discoverers are stuck in the high-numbered buckets (close to service hash) where malicious nodes are concentrated
No log for when validAds() filters out invalid ads from a GET_ADS response — currently done silently; add debug "filtered invalid ads" count=N registrar=X so the fake-ad rejection rate is visible

Scenario 11: Concurrent Advertise + Lookup + Churn Race Test

Goal: Detect concurrency bugs, async races, routing-table corruption, and cleanup issues under sustained load.

Network setup:

150 registrars
200 advertisers
200 discoverers

How to run:
Continuously for 4+ hours:

Every 30s: kill and restart 10 random advertisers with new peer IDs
Every 60s: rotate 5 random registrars (kill + restart)
Every 20s: each discoverer calls lookup() concurrently
Every 45s: add a new service and start advertising it; remove an old service

Existing logs to watch:

Log line	What it tells you
`error "no service routing table found"`	Table missing during active advertisement — race
`error "failed to register ad"`	Network churn causing failures
`debug "pruned expired adverts"`	Cleanup running during churn
`cd_advertiser_pending_actions` gauge	Should not grow unboundedly
`cd_service_tables_count` gauge	Should track add/remove correctly

Missing logs — need to add:

No log when a service routing table is created or destroyed during churn — add debug "service table created" / debug "service table removed" with serviceId and role (Provided/Interest) so table lifecycle is traceable
No log when an async task is cancelled unexpectedly (e.g. advertiseToRegistrar cancelled mid-wait) — add debug "advertise task cancelled" in the cancellation path
No log when maintainAdvertiser or maintainRegistrar loops restart after a crash or table change — add debug "maintenance loop iteration" with a timestamp and current state summary

What to check:

cd_advertiser_pending_actions should stay bounded — if it grows monotonically, tasks are leaking
cd_service_tables_count should track add/remove — if it only grows, removeService is not being called
No Nim runtime errors (Error: unhandled exception, SIGSEGV) in logs — concurrent map access issues would surface here
Memory (external) should be flat after initial ramp-up — growing memory indicates leaked futures or routing table entries