Scenerios:
Scenario 1: Rare Service Discovery in a Large Noisy Network
Goal: Verify that service discovery finds rare services faster than random walking and that discoverers converge toward the right registrars rather than wandering.
Network setup:
- 250 registrars → server mode, no services, no xprPublishing
- 150 kad-only peers → plain KadDHT (not ServiceDiscovery) — act as routing noise
- 80 advertisers → advertise a common service e.g. “/logos/popular/1.0.0”
- 3 advertisers → advertise the rare service “/logos/rare/1.0.0”
- 40 discoverers → call lookup() for “/logos/rare/1.0.0” only
How to run:
-
Start registrars first. Wait for routing tables to stabilise (~30s).
-
Start 80 popular-service advertisers. Let them get confirmed (~60s).
-
Start 3 rare-service advertisers. Record the exact timestamp.
-
Start 40 discoverers. Each calls
lookup("/logos/rare/1.0.0")in a loop every 10s.
Existing logs to watch:
| Log line | What it tells you |
|---|---|
debug "getting adverts" |
Each GET_ADS sent — count per discoverer |
debug "adverts found" count=N |
N=0 means empty response from that registrar |
debug "advert accepted" |
Rare service ad admitted at registrar |
cd_lookup_peers_found metric |
Total peers found per lookup call |
cd_lookup_requests metric |
Total lookups initiated |
cd_registrar_cache_ads gauge |
Cache size per registrar over time |
cd_service_table_peers gauge |
Routing table growth toward service hash |
Missing logs — need to add:
-
debug "lookup complete"with fields:serviceId,peersFound,registrarsContacted,bucketsTraversed,durationMs— currently no single log captures when a lookup finishes and what it took -
debug "empty response from registrar"— currentlyadverts found count=0exists but has no explicit distinguishing label; add a dedicated log so it can be grepped separately -
debug "routing table state"at lookup start — log how many peers are in each bucket ofDiscT(service_id_hash)before the first GET_ADS goes out; currently no snapshot log exists
What to check:
-
Time from rare ad admitted (
advert accepted) to first discoverer finding it (adverts found count>0) -
Ratio of
adverts found count=0to totalgetting adverts— should decrease over time as routing tables converge -
cd_service_table_peersshould grow steadily for discoverers; if it stays flat, routing table is not being updated fromcloserPeers -
Whether the same registrar peer ID appears repeatedly in
getting adverts— indicates discoverers are not advancing through buckets
Scenario 2: REGISTER Storm / Retry Explosion
Goal: Verify that 300 advertisers hitting 40 registrars simultaneously do not cause retry loops, unbounded queues, or broken ticket state.
Network setup:
- 40 registrars → server mode, advertCacheCap = 1000 (default)
- 300 advertisers → all advertise the same service, all start at t=0
How to run:
-
Start 40 registrars. Wait for routing tables to stabilise.
-
Start all 300 advertisers simultaneously (use a barrier or coordinated start timestamp).
-
Run for at least 3 ×
advertExpiry(default = 3 × 900s = 45 min) to see full retry cycles.
Existing logs to watch:
| Log line | What it tells you |
|---|---|
debug "registering advert" |
First attempt per advertiser per registrar |
debug "waiting for registrar" wait=Xs |
Wait time issued — look for outliers |
debug "advert accepted" |
Successful admission |
debug "registrar rejection, aborting" |
Hard rejection — should be rare |
error "no ticket to retry with" |
Registrar issued Wait but no ticket — protocol bug |
cd_register_requests{status="Wait"} |
Rate of Wait responses |
cd_register_requests{status="Confirmed"} |
Rate of admissions |
cd_registrar_cache_ads gauge |
Cache fill rate over time |
cd_advertiser_pending_actions gauge |
Queue depth on advertisers |
Missing logs — need to add:
-
debug "waiting for registrar"currently logswaitas a string (waitSecs) — change to a numeric field so it can be aggregated; also addretryCountfield to track how many times this advertiser has retried this registrar -
No log when an advertiser exhausts
advertExpiryand loops back for re-registration — adddebug "re-registering after expiry"to distinguish fresh registration from retry -
cd_register_requestslabel for"Wait"counts total Waits but not per-registrar — addregistrarlabel or a separate histogram for wait duration values
What to check:
-
cd_advertiser_pending_actionsshould grow initially then stabilise — if it grows unboundedly, advertisers are not respecting wait times -
debug "waiting for registrar"entries should havewait > 0—wait=0followed immediately by anotherregistering advertfor the same registrar is a retry-without-waiting bug -
cd_registrar_cache_adsshould approach but not exceedadvertCacheCap
Scenario 3: Client mode
Goal: Verify client-mode nodes can discover but cannot advertise or act as registrars, and that violations fail loudly.
Network setup:
- 80 server-mode registrars
- 40 server-mode advertisers → advertise “/logos/test/1.0.0”
- 30 server-mode discoverers
- 100 client-mode nodes → client = true in ServiceDiscovery.new()
How to run:
-
Start registrars and advertisers. Wait for ads to be confirmed.
-
Start 100 client-mode nodes. Each calls
lookup("/logos/test/1.0.0"). -
Pick 20 client-mode nodes and call
addProvidedService()on them — expect failure. -
Pick another 20 client-mode nodes and attempt to send a REGISTER request directly (craft a raw message or call
advertiseToRegistrar) — expect failure.
Existing behaviour:
addProvidedServiceandadvertiseToRegistrarboth havedoAssert not disco.clientMode, "not supported in client mode"— this raises an assertion error in debug builds and crashes in release. That is not clean failure.
Missing logs — need to add (and fix):
-
Replace
doAssert not disco.clientModewith a properif disco.clientMode: warn "operation not supported in client mode", op="addProvidedService"; return— so it fails visibly in logs without crashing -
Add
debug "client mode lookup success"withpeersFoundfield so it is verifiable in logs that client nodes are succeeding at discovery -
No log exists when a client-mode node receives an inbound REGISTER request — the handler will process it silently because
registration()does not checkclientMode. This is a missing guard — a client-mode node acting as registrar should reject inbound REGISTER messages and log a warning
What to check:
-
debug "adverts found" count>0appears for client nodes — discovery works -
warn "operation not supported in client mode"appears when advertise is attempted — clean rejection -
No
debug "advert accepted"orcd_registrar_cache_adsincrement on client nodes — they are not storing ads -
No
debug "registering advert"on client nodes — they are not sending REGISTER
Scenario 4: Malicious Registrars Poisoning Routing Tables
Goal: Verify that bad closerPeers responses from malicious registrars do not permanently corrupt service routing tables or prevent discovery.
Network setup:
- 120 honest registrars
- 50 malicious registrars → return crafted closerPeers (see below)
- 30 advertisers
- 40 discoverers
Malicious registrar behaviours to implement (one variant per run):
-
Return only their own
peerIdincloserPeers(self-loop) -
Return 16 random garbage peer IDs not in the network
-
Return unreachable peers (valid IDs, no working addresses)
-
Return peers far from
service_id_hash(bucket 0 peers for a service in bucket 15) -
Return empty
closerPeerslist
How to run:
Each variant is a separate run. Run for at least 2 full lookup cycles. Compare discovery success rate against a clean baseline run.
Existing logs to watch:
| Log line | What it tells you |
|---|---|
error "failed to register ad" error="dialing peer failed" |
Unreachable peer in routing table |
debug "adverts found" count=N |
Discovery progress |
cd_service_table_insertions |
How many poisoned peers were inserted |
cd_service_table_peers gauge |
Routing table size over time |
Missing logs — need to add:
-
No log when
closerPeersare inserted into the routing table — adddebug "inserting closer peers"withcountandserviceIdso poisoning rate can be measured -
No log when a dial fails during
dispatchGetAds— the error is returned as a string but not logged at the call site; addwarn "GET_ADS dial failed"withregistraranderrorfields -
No metric for failed dials specifically —
cd_messages_sentexists but there is nocd_dial_failurescounter; add one
What to check:
-
cd_service_table_peersshould not grow unboundedly when malicious nodes return junk — indicates no dedup or cap on peer insertion -
error "dialing peer failed"rate should correlate with the number of malicious nodes -
Honest discoverers should still find ads within 2–3 lookup cycles even with 50 malicious registrars — if they never find ads, the routing table is fully poisoned
Scenario 5: Ticket Grinding Attack
Goal: Verify that malicious advertisers cannot reduce their wait time by manipulating ticket fields or retry timing.
Network setup:
- 30 registrars
- 200 malicious advertisers → one behaviour variant each (see below)
Malicious behaviours — implement as separate advertiser modes:
| Variant | What it does |
|---|---|
| Early retry | Retries at t_mod + t_wait_for - 10s (before window opens) |
| Late retry | Retries at t_mod + t_wait_for + δ + 10s (after window closes) |
Modified t_wait_for |
Reduces t_wait_for in ticket before presenting it |
| Reused old ticket | Presents a ticket from a previous registration attempt |
| Cross-registrar ticket | Presents a ticket signed by registrar A to registrar B |
| Modified ad | Slightly changes the advertisement bytes while reusing the ticket |
| Dropped ticket | Ignores the ticket and retries fresh every time |
Existing logs to watch:
| Log line | What it tells you |
|---|---|
error "invalid ticket in register message" |
Ticket validation failure |
cd_register_requests{status="Rejected"} |
Hard rejection rate |
cd_register_requests{status="Wait"} |
Re-issuance rate (grinding attempt) |
debug "waiting for registrar" wait=Xs |
Wait times being issued |
Missing logs — need to add:
-
No log distinguishes why a ticket was invalid — add a reason field:
warn "ticket rejected" reason="early retry|late retry|signature mismatch|ad mismatch"— currently all failures collapse intoerror "invalid ticket in register message"with no reason -
No log when
updateLowerBoundsincreases the bound (i.e. when the grinding defence fires) — adddebug "lower bound enforced"withserviceId,prevBound,newBoundso it is observable -
No metric tracking wait time values issued — add a histogram
cd_wait_time_issued_secsso you can see whether wait times are growing for grinding advertisers
What to check:
-
Early and late retry should produce
cd_register_requests{status="Rejected"}immediately — no Wait ticket re-issued -
Dropped-ticket variant: wait times should not decrease across retries — each fresh attempt recalculates from current cache state
-
After sustained grinding,
cd_wait_time_issued_secshistogram for the grinding advertisers should show increasing values (lower bound enforcement working)
Scenario 6: Same-IP Sybil Flooding
Goal: Verify that the IP tree correctly penalises subnet-concentrated advertisers while admitting diverse ones.
Network setup:
- 100 registrars
- 400 advertisers → all bind to addresses in 192.168.1.0/24
- 100 advertisers → each bind to a unique IP in 10.x.x.x/8
- All advertise the same service continuously.
How to run:
Use loopback aliases or a network namespace to give the 400 Sybil advertisers addresses in the same /24. Run for at least 2 × advertExpiry.
Existing logs to watch:
| Log line | What it tells you |
|---|---|
debug "waiting for registrar" wait=Xs |
Compare wait values between /24 and unique-IP groups |
debug "advert accepted" |
Admission rate per group |
cd_iptree_unique_ips gauge |
IP tree growth over time |
cd_registrar_ads_expired counter |
Cleanup rate |
cd_registrar_cache_ads gauge |
Cache composition |
Missing logs — need to add:
-
No log records the IP similarity score at the time of admission or Wait issuance — add
debug "waiting time calculated"with fieldsipSim,serviceSim,occupancy,tWait,peerIdinwaitingTime— this is the most important missing observable for this scenario -
No log when an IP is inserted into or removed from the IP tree — add
debug "ip tree insert"/debug "ip tree remove"with the IP (or at least the /24 prefix) so tree growth and cleanup are traceable -
cd_iptree_unique_ipsonly tracks total unique IPs — addcd_iptree_depth_scorehistogram or similar to track score distribution
What to check:
-
debug "waiting for registrar" wait=Xsfor /24 advertisers should be significantly higher than for unique-IP advertisers -
cd_registrar_cache_adsbreakdown: unique-IP advertisers should have proportionally more entries than their share of total advertisers -
cd_iptree_unique_ipsshould stabilise afteradvertExpiryas expired ads remove their IPs — if it only grows,removeAdis not being called on expiry
Scenario 7: Popular Service Hotspot Attack
Goal: Verify that a dominant popular service does not starve rare services out of the cache.
Network setup:
- 150 registrars
- 700 advertisers → advertise “/logos/popular/1.0.0” aggressively
- 20 advertisers → advertise “/logos/rare-A/1.0.0” and “/logos/rare-B/1.0.0”
- 50 discoverers → look up both rare services
How to run:
-
Start registrars, let them stabilise.
-
Start all 700 popular-service advertisers simultaneously.
-
After 30s, start rare-service advertisers.
-
After another 30s, start discoverers. Measure lookup latency for rare vs popular.
Existing logs to watch:
| Log line | What it tells you |
|---|---|
debug "advert accepted" with serviceId |
Admission rate per service |
debug "waiting for registrar" wait=Xs |
Compare wait times across services |
cd_registrar_cache_services gauge |
Number of distinct services in cache |
cd_registrar_cache_ads gauge |
Total cache occupancy |
cd_lookup_peers_found counter |
Discovery success for rare service |
Missing logs — need to add:
-
No per-service cache count is logged or exposed as a metric —
cd_registrar_cache_servicescounts distinct services but not ads per service; addcd_registrar_cache_ads_per_servicegauge withserviceIdlabel so hotspot dominance is directly visible -
The
serviceSimterm inwaitingTimecaptures per-service pressure but is never logged — include it in the proposeddebug "waiting time calculated"log from Scenario 6
What to check:
-
Rare service ads should still appear in cache (
debug "advert accepted"for rare serviceId) even while popular service dominates -
service_similarity(c_s / C) for popular service should be high, causing its own advertisers to wait longer — self-regulating -
Rare service lookup latency should not be significantly worse than in a no-popular-service baseline
Scenario 8: Advertisement Expiry and Churn Chaos
Goal: Verify that expired ads are cleaned up and not returned to discoverers, and that the system recovers after registrar and advertiser churn.
Network setup:
- 100 registrars
- 100 advertisers
- 100 discoverers
How to run:
-
Start everything. Let ads stabilise (wait for
debug "advert accepted"across most advertisers). -
At
t=E/2(450s): kill 70 advertisers abruptly (SIGKILL). -
At
t=E(900s): restart remaining 30 advertisers with new peer IDs and new IPs. -
At
t=E+30s: kill 30 registrars. Restart them 60s later. -
Continue running discoverers throughout and record what they find.
Existing logs to watch:
| Log line | What it tells you |
|---|---|
debug "pruned expired adverts" count=N |
Cleanup running correctly |
cd_registrar_ads_expired counter |
Total expired ad count |
cd_registrar_cache_ads gauge |
Cache should shrink after kills |
error "failed to register ad" error="dialing peer failed" |
Dead advertiser dials |
debug "adverts found" count=N |
Whether stale ads are still being returned |
Missing logs — need to add:
-
debug "pruned expired adverts"only logs a count — addserviceIdbreakdown so you can see which service’s ads are expiring -
No log when
getAdvertisementsreturns an ad that belongs to a peer that is now unreachable — the cache returns all stored ads regardless; this requires a separate staleness check but at minimum adddebug "returning N ads for service"withserviceIdandcountingetAdvertisementsso stale return rate is visible -
No log distinguishing a registrar restarting from a fresh registrar — add
info "registrar cache restored"with cache size on startup if persistence is added, orinfo "registrar starting with empty cache"if not
What to check:
-
After
t=E,cd_registrar_cache_adsshould drop significantly — dead advertisers’ ads have expired -
debug "adverts found" count>0aftert=Eshould only return ads from alive advertisers — if dead-peer ads are still returned, expiry is not working -
After registrar restart,
cd_registrar_cache_adsgauge should show 0 (cache is lost on restart, expected) — discoverers should reroute to surviving registrars
Scenario 9: Oversized / Corrupted Advertisement Attack
Goal: Verify that malicious advertisements are rejected at validation without causing crashes, memory spikes, or silent acceptance.
Network setup:
- 50 honest registrars
- 50 malicious advertisers → one attack variant each (see below)
Attack variants — one per advertiser group:
| Variant | How to produce |
|---|---|
| Random bytes | Send raw random bytes as register.advertisement |
| Valid protobuf, invalid signature | Build valid XPR structure but sign with wrong key |
| Valid signature, missing service | XPR with correct signature but no ServiceInfo entries |
| Oversized advertisement | XPR > 1024 bytes (pad ServiceInfo.data to overflow) |
Oversized ServiceInfo.data |
ServiceInfo.data > 33 bytes, otherwise valid |
| Invalid multiaddrs | Valid XPR with malformed multiaddress bytes |
| Empty advertisement | Send register.advertisement = [] |
Existing logs to watch:
| Log line | What it tells you |
|---|---|
error "advertisement exceeds maximum encoded XPR size" |
Oversized ad caught |
error "invalid advertisement received" |
Decode failure |
error "advertisement does not advertise the requested service" |
Missing service caught |
error "advertisement violates XPR or ServiceInfo size limits" |
Size constraint caught |
cd_register_requests{status="Rejected"} |
Rejection rate |
Missing logs — need to add:
-
No log records which validation step failed with enough detail — the four error messages above cover different cases but all fire at the same location with similar text; add a
reasonfield to each:error "invalid advertisement" reason="decode_failed|size_exceeded|service_missing|xpr_limits" -
No memory metric is tracked around validation — if a huge malformed protobuf causes a large allocation before rejection, it won’t be visible; consider adding a
cd_advertisement_bytes_rejectedcounter with a size histogram
What to check:
-
Every malicious variant must produce a
cd_register_requests{status="Rejected"}increment — none should result inConfirmed -
cd_registrar_cache_adsmust not increase during the attack -
Process memory (external monitoring) should not grow — oversized ads are rejected before allocation where possible
Scenario 10: Eclipse Attack Near Service Hash
Goal: Verify that malicious registrars clustered near the target service hash cannot suppress honest advertisements.
Network setup:
- 200 honest registrars → distributed uniformly
- 60 malicious registrars → peer IDs chosen so XOR(peerId, service_hash) is minimal (close bucket)
- 4 honest advertisers
- 30 discoverers
Malicious registrar behaviours:
-
Return empty
getAds.advertisementsalways -
Return fake advertisements (valid structure, attacker-controlled peer IDs)
-
Return only other malicious nodes in
closerPeers -
Return
REJECTEDto all REGISTER requests from honest advertisers
How to run:
-
Pre-generate 60 peer keys such that
SHA256(peerId)is close toSHA256(service_id)(brute-force leading bits). -
Start malicious registrars with those keys.
-
Start honest infrastructure.
-
Run discoverers for at least 5 lookup cycles.
Existing logs to watch:
| Log line | What it tells you |
|---|---|
debug "adverts found" count=N |
N=0 from malicious registrars |
debug "advert accepted" |
Honest ads still getting through to honest registrars |
cd_lookup_peers_found |
End-to-end success rate |
error "failed to register ad" |
Honest advertisers rejected by malicious registrars |
Missing logs — need to add:
-
No log records which bucket a queried registrar came from during lookup — add
debug "querying registrar" bucket=NincollectBucketAdsso you can see whether discoverers are stuck in the high-numbered buckets (close to service hash) where malicious nodes are concentrated -
No log for when
validAds()filters out invalid ads from a GET_ADS response — currently done silently; adddebug "filtered invalid ads" count=N registrar=Xso the fake-ad rejection rate is visible
Scenario 11: Concurrent Advertise + Lookup + Churn Race Test
Goal: Detect concurrency bugs, async races, routing-table corruption, and cleanup issues under sustained load.
Network setup:
- 150 registrars
- 200 advertisers
- 200 discoverers
How to run:
Continuously for 4+ hours:
-
Every 30s: kill and restart 10 random advertisers with new peer IDs
-
Every 60s: rotate 5 random registrars (kill + restart)
-
Every 20s: each discoverer calls
lookup()concurrently -
Every 45s: add a new service and start advertising it; remove an old service
Existing logs to watch:
| Log line | What it tells you |
|---|---|
error "no service routing table found" |
Table missing during active advertisement — race |
error "failed to register ad" |
Network churn causing failures |
debug "pruned expired adverts" |
Cleanup running during churn |
cd_advertiser_pending_actions gauge |
Should not grow unboundedly |
cd_service_tables_count gauge |
Should track add/remove correctly |
Missing logs — need to add:
-
No log when a service routing table is created or destroyed during churn — add
debug "service table created"/debug "service table removed"withserviceIdandrole(Provided/Interest) so table lifecycle is traceable -
No log when an async task is cancelled unexpectedly (e.g.
advertiseToRegistrarcancelled mid-wait) — adddebug "advertise task cancelled"in the cancellation path -
No log when
maintainAdvertiserormaintainRegistrarloops restart after a crash or table change — adddebug "maintenance loop iteration"with a timestamp and current state summary
What to check:
-
cd_advertiser_pending_actionsshould stay bounded — if it grows monotonically, tasks are leaking -
cd_service_tables_countshould track add/remove — if it only grows,removeServiceis not being called -
No Nim runtime errors (
Error: unhandled exception,SIGSEGV) in logs — concurrent map access issues would surface here -
Memory (external) should be flat after initial ramp-up — growing memory indicates leaked futures or routing table entries