Service Discovery Network Simulation

Scenerios:

Scenario 1: Rare Service Discovery in a Large Noisy Network

Goal: Verify that service discovery finds rare services faster than random walking and that discoverers converge toward the right registrars rather than wandering.

Network setup:

  • 250 registrars → server mode, no services, no xprPublishing
  • 150 kad-only peers → plain KadDHT (not ServiceDiscovery) — act as routing noise
  • 80 advertisers → advertise a common service e.g. “/logos/popular/1.0.0”
  • 3 advertisers → advertise the rare service “/logos/rare/1.0.0”
  • 40 discoverers → call lookup() for “/logos/rare/1.0.0” only

How to run:

  1. Start registrars first. Wait for routing tables to stabilise (~30s).

  2. Start 80 popular-service advertisers. Let them get confirmed (~60s).

  3. Start 3 rare-service advertisers. Record the exact timestamp.

  4. Start 40 discoverers. Each calls lookup("/logos/rare/1.0.0") in a loop every 10s.

Existing logs to watch:

Log line What it tells you
debug "getting adverts" Each GET_ADS sent — count per discoverer
debug "adverts found" count=N N=0 means empty response from that registrar
debug "advert accepted" Rare service ad admitted at registrar
cd_lookup_peers_found metric Total peers found per lookup call
cd_lookup_requests metric Total lookups initiated
cd_registrar_cache_ads gauge Cache size per registrar over time
cd_service_table_peers gauge Routing table growth toward service hash

Missing logs — need to add:

  • debug "lookup complete" with fields: serviceId, peersFound, registrarsContacted, bucketsTraversed, durationMs — currently no single log captures when a lookup finishes and what it took

  • debug "empty response from registrar" — currently adverts found count=0 exists but has no explicit distinguishing label; add a dedicated log so it can be grepped separately

  • debug "routing table state" at lookup start — log how many peers are in each bucket of DiscT(service_id_hash) before the first GET_ADS goes out; currently no snapshot log exists

What to check:

  • Time from rare ad admitted (advert accepted) to first discoverer finding it (adverts found count>0)

  • Ratio of adverts found count=0 to total getting adverts — should decrease over time as routing tables converge

  • cd_service_table_peers should grow steadily for discoverers; if it stays flat, routing table is not being updated from closerPeers

  • Whether the same registrar peer ID appears repeatedly in getting adverts — indicates discoverers are not advancing through buckets

Scenario 2: REGISTER Storm / Retry Explosion

Goal: Verify that 300 advertisers hitting 40 registrars simultaneously do not cause retry loops, unbounded queues, or broken ticket state.

Network setup:

  • 40 registrars → server mode, advertCacheCap = 1000 (default)
  • 300 advertisers → all advertise the same service, all start at t=0

How to run:

  1. Start 40 registrars. Wait for routing tables to stabilise.

  2. Start all 300 advertisers simultaneously (use a barrier or coordinated start timestamp).

  3. Run for at least 3 × advertExpiry (default = 3 × 900s = 45 min) to see full retry cycles.

Existing logs to watch:

Log line What it tells you
debug "registering advert" First attempt per advertiser per registrar
debug "waiting for registrar" wait=Xs Wait time issued — look for outliers
debug "advert accepted" Successful admission
debug "registrar rejection, aborting" Hard rejection — should be rare
error "no ticket to retry with" Registrar issued Wait but no ticket — protocol bug
cd_register_requests{status="Wait"} Rate of Wait responses
cd_register_requests{status="Confirmed"} Rate of admissions
cd_registrar_cache_ads gauge Cache fill rate over time
cd_advertiser_pending_actions gauge Queue depth on advertisers

Missing logs — need to add:

  • debug "waiting for registrar" currently logs wait as a string (waitSecs) — change to a numeric field so it can be aggregated; also add retryCount field to track how many times this advertiser has retried this registrar

  • No log when an advertiser exhausts advertExpiry and loops back for re-registration — add debug "re-registering after expiry" to distinguish fresh registration from retry

  • cd_register_requests label for "Wait" counts total Waits but not per-registrar — add registrar label or a separate histogram for wait duration values

What to check:

  • cd_advertiser_pending_actions should grow initially then stabilise — if it grows unboundedly, advertisers are not respecting wait times

  • debug "waiting for registrar" entries should have wait > 0wait=0 followed immediately by another registering advert for the same registrar is a retry-without-waiting bug

  • cd_registrar_cache_ads should approach but not exceed advertCacheCap

Scenario 3: Client mode

Goal: Verify client-mode nodes can discover but cannot advertise or act as registrars, and that violations fail loudly.

Network setup:

  • 80 server-mode registrars
  • 40 server-mode advertisers → advertise “/logos/test/1.0.0”
  • 30 server-mode discoverers
  • 100 client-mode nodes → client = true in ServiceDiscovery.new()

How to run:

  1. Start registrars and advertisers. Wait for ads to be confirmed.

  2. Start 100 client-mode nodes. Each calls lookup("/logos/test/1.0.0").

  3. Pick 20 client-mode nodes and call addProvidedService() on them — expect failure.

  4. Pick another 20 client-mode nodes and attempt to send a REGISTER request directly (craft a raw message or call advertiseToRegistrar) — expect failure.

Existing behaviour:

  • addProvidedService and advertiseToRegistrar both have doAssert not disco.clientMode, "not supported in client mode" — this raises an assertion error in debug builds and crashes in release. That is not clean failure.

Missing logs — need to add (and fix):

  • Replace doAssert not disco.clientMode with a proper if disco.clientMode: warn "operation not supported in client mode", op="addProvidedService"; return — so it fails visibly in logs without crashing

  • Add debug "client mode lookup success" with peersFound field so it is verifiable in logs that client nodes are succeeding at discovery

  • No log exists when a client-mode node receives an inbound REGISTER request — the handler will process it silently because registration() does not check clientMode. This is a missing guard — a client-mode node acting as registrar should reject inbound REGISTER messages and log a warning

What to check:

  • debug "adverts found" count>0 appears for client nodes — discovery works

  • warn "operation not supported in client mode" appears when advertise is attempted — clean rejection

  • No debug "advert accepted" or cd_registrar_cache_ads increment on client nodes — they are not storing ads

  • No debug "registering advert" on client nodes — they are not sending REGISTER

Scenario 4: Malicious Registrars Poisoning Routing Tables

Goal: Verify that bad closerPeers responses from malicious registrars do not permanently corrupt service routing tables or prevent discovery.

Network setup:

  • 120 honest registrars
  • 50 malicious registrars → return crafted closerPeers (see below)
  • 30 advertisers
  • 40 discoverers

Malicious registrar behaviours to implement (one variant per run):

  • Return only their own peerId in closerPeers (self-loop)

  • Return 16 random garbage peer IDs not in the network

  • Return unreachable peers (valid IDs, no working addresses)

  • Return peers far from service_id_hash (bucket 0 peers for a service in bucket 15)

  • Return empty closerPeers list

How to run:
Each variant is a separate run. Run for at least 2 full lookup cycles. Compare discovery success rate against a clean baseline run.

Existing logs to watch:

Log line What it tells you
error "failed to register ad" error="dialing peer failed" Unreachable peer in routing table
debug "adverts found" count=N Discovery progress
cd_service_table_insertions How many poisoned peers were inserted
cd_service_table_peers gauge Routing table size over time

Missing logs — need to add:

  • No log when closerPeers are inserted into the routing table — add debug "inserting closer peers" with count and serviceId so poisoning rate can be measured

  • No log when a dial fails during dispatchGetAds — the error is returned as a string but not logged at the call site; add warn "GET_ADS dial failed" with registrar and error fields

  • No metric for failed dials specifically — cd_messages_sent exists but there is no cd_dial_failures counter; add one

What to check:

  • cd_service_table_peers should not grow unboundedly when malicious nodes return junk — indicates no dedup or cap on peer insertion

  • error "dialing peer failed" rate should correlate with the number of malicious nodes

  • Honest discoverers should still find ads within 2–3 lookup cycles even with 50 malicious registrars — if they never find ads, the routing table is fully poisoned

Scenario 5: Ticket Grinding Attack

Goal: Verify that malicious advertisers cannot reduce their wait time by manipulating ticket fields or retry timing.

Network setup:

  • 30 registrars
  • 200 malicious advertisers → one behaviour variant each (see below)

Malicious behaviours — implement as separate advertiser modes:

Variant What it does
Early retry Retries at t_mod + t_wait_for - 10s (before window opens)
Late retry Retries at t_mod + t_wait_for + δ + 10s (after window closes)
Modified t_wait_for Reduces t_wait_for in ticket before presenting it
Reused old ticket Presents a ticket from a previous registration attempt
Cross-registrar ticket Presents a ticket signed by registrar A to registrar B
Modified ad Slightly changes the advertisement bytes while reusing the ticket
Dropped ticket Ignores the ticket and retries fresh every time

Existing logs to watch:

Log line What it tells you
error "invalid ticket in register message" Ticket validation failure
cd_register_requests{status="Rejected"} Hard rejection rate
cd_register_requests{status="Wait"} Re-issuance rate (grinding attempt)
debug "waiting for registrar" wait=Xs Wait times being issued

Missing logs — need to add:

  • No log distinguishes why a ticket was invalid — add a reason field: warn "ticket rejected" reason="early retry|late retry|signature mismatch|ad mismatch" — currently all failures collapse into error "invalid ticket in register message" with no reason

  • No log when updateLowerBounds increases the bound (i.e. when the grinding defence fires) — add debug "lower bound enforced" with serviceId, prevBound, newBound so it is observable

  • No metric tracking wait time values issued — add a histogram cd_wait_time_issued_secs so you can see whether wait times are growing for grinding advertisers

What to check:

  • Early and late retry should produce cd_register_requests{status="Rejected"} immediately — no Wait ticket re-issued

  • Dropped-ticket variant: wait times should not decrease across retries — each fresh attempt recalculates from current cache state

  • After sustained grinding, cd_wait_time_issued_secs histogram for the grinding advertisers should show increasing values (lower bound enforcement working)

Scenario 6: Same-IP Sybil Flooding

Goal: Verify that the IP tree correctly penalises subnet-concentrated advertisers while admitting diverse ones.

Network setup:

  • 100 registrars
  • 400 advertisers → all bind to addresses in 192.168.1.0/24
  • 100 advertisers → each bind to a unique IP in 10.x.x.x/8
  • All advertise the same service continuously.

How to run:
Use loopback aliases or a network namespace to give the 400 Sybil advertisers addresses in the same /24. Run for at least 2 × advertExpiry.

Existing logs to watch:

Log line What it tells you
debug "waiting for registrar" wait=Xs Compare wait values between /24 and unique-IP groups
debug "advert accepted" Admission rate per group
cd_iptree_unique_ips gauge IP tree growth over time
cd_registrar_ads_expired counter Cleanup rate
cd_registrar_cache_ads gauge Cache composition

Missing logs — need to add:

  • No log records the IP similarity score at the time of admission or Wait issuance — add debug "waiting time calculated" with fields ipSim, serviceSim, occupancy, tWait, peerId in waitingTime — this is the most important missing observable for this scenario

  • No log when an IP is inserted into or removed from the IP tree — add debug "ip tree insert" / debug "ip tree remove" with the IP (or at least the /24 prefix) so tree growth and cleanup are traceable

  • cd_iptree_unique_ips only tracks total unique IPs — add cd_iptree_depth_score histogram or similar to track score distribution

What to check:

  • debug "waiting for registrar" wait=Xs for /24 advertisers should be significantly higher than for unique-IP advertisers

  • cd_registrar_cache_ads breakdown: unique-IP advertisers should have proportionally more entries than their share of total advertisers

  • cd_iptree_unique_ips should stabilise after advertExpiry as expired ads remove their IPs — if it only grows, removeAd is not being called on expiry

Scenario 7: Popular Service Hotspot Attack

Goal: Verify that a dominant popular service does not starve rare services out of the cache.

Network setup:

  • 150 registrars
  • 700 advertisers → advertise “/logos/popular/1.0.0” aggressively
  • 20 advertisers → advertise “/logos/rare-A/1.0.0” and “/logos/rare-B/1.0.0”
  • 50 discoverers → look up both rare services

How to run:

  1. Start registrars, let them stabilise.

  2. Start all 700 popular-service advertisers simultaneously.

  3. After 30s, start rare-service advertisers.

  4. After another 30s, start discoverers. Measure lookup latency for rare vs popular.

Existing logs to watch:

Log line What it tells you
debug "advert accepted" with serviceId Admission rate per service
debug "waiting for registrar" wait=Xs Compare wait times across services
cd_registrar_cache_services gauge Number of distinct services in cache
cd_registrar_cache_ads gauge Total cache occupancy
cd_lookup_peers_found counter Discovery success for rare service

Missing logs — need to add:

  • No per-service cache count is logged or exposed as a metric — cd_registrar_cache_services counts distinct services but not ads per service; add cd_registrar_cache_ads_per_service gauge with serviceId label so hotspot dominance is directly visible

  • The serviceSim term in waitingTime captures per-service pressure but is never logged — include it in the proposed debug "waiting time calculated" log from Scenario 6

What to check:

  • Rare service ads should still appear in cache (debug "advert accepted" for rare serviceId) even while popular service dominates

  • service_similarity (c_s / C) for popular service should be high, causing its own advertisers to wait longer — self-regulating

  • Rare service lookup latency should not be significantly worse than in a no-popular-service baseline

Scenario 8: Advertisement Expiry and Churn Chaos

Goal: Verify that expired ads are cleaned up and not returned to discoverers, and that the system recovers after registrar and advertiser churn.

Network setup:

  • 100 registrars
  • 100 advertisers
  • 100 discoverers

How to run:

  1. Start everything. Let ads stabilise (wait for debug "advert accepted" across most advertisers).

  2. At t=E/2 (450s): kill 70 advertisers abruptly (SIGKILL).

  3. At t=E (900s): restart remaining 30 advertisers with new peer IDs and new IPs.

  4. At t=E+30s: kill 30 registrars. Restart them 60s later.

  5. Continue running discoverers throughout and record what they find.

Existing logs to watch:

Log line What it tells you
debug "pruned expired adverts" count=N Cleanup running correctly
cd_registrar_ads_expired counter Total expired ad count
cd_registrar_cache_ads gauge Cache should shrink after kills
error "failed to register ad" error="dialing peer failed" Dead advertiser dials
debug "adverts found" count=N Whether stale ads are still being returned

Missing logs — need to add:

  • debug "pruned expired adverts" only logs a count — add serviceId breakdown so you can see which service’s ads are expiring

  • No log when getAdvertisements returns an ad that belongs to a peer that is now unreachable — the cache returns all stored ads regardless; this requires a separate staleness check but at minimum add debug "returning N ads for service" with serviceId and count in getAdvertisements so stale return rate is visible

  • No log distinguishing a registrar restarting from a fresh registrar — add info "registrar cache restored" with cache size on startup if persistence is added, or info "registrar starting with empty cache" if not

What to check:

  • After t=E, cd_registrar_cache_ads should drop significantly — dead advertisers’ ads have expired

  • debug "adverts found" count>0 after t=E should only return ads from alive advertisers — if dead-peer ads are still returned, expiry is not working

  • After registrar restart, cd_registrar_cache_ads gauge should show 0 (cache is lost on restart, expected) — discoverers should reroute to surviving registrars

Scenario 9: Oversized / Corrupted Advertisement Attack

Goal: Verify that malicious advertisements are rejected at validation without causing crashes, memory spikes, or silent acceptance.

Network setup:

  • 50 honest registrars
  • 50 malicious advertisers → one attack variant each (see below)

Attack variants — one per advertiser group:

Variant How to produce
Random bytes Send raw random bytes as register.advertisement
Valid protobuf, invalid signature Build valid XPR structure but sign with wrong key
Valid signature, missing service XPR with correct signature but no ServiceInfo entries
Oversized advertisement XPR > 1024 bytes (pad ServiceInfo.data to overflow)
Oversized ServiceInfo.data ServiceInfo.data > 33 bytes, otherwise valid
Invalid multiaddrs Valid XPR with malformed multiaddress bytes
Empty advertisement Send register.advertisement = []

Existing logs to watch:

Log line What it tells you
error "advertisement exceeds maximum encoded XPR size" Oversized ad caught
error "invalid advertisement received" Decode failure
error "advertisement does not advertise the requested service" Missing service caught
error "advertisement violates XPR or ServiceInfo size limits" Size constraint caught
cd_register_requests{status="Rejected"} Rejection rate

Missing logs — need to add:

  • No log records which validation step failed with enough detail — the four error messages above cover different cases but all fire at the same location with similar text; add a reason field to each: error "invalid advertisement" reason="decode_failed|size_exceeded|service_missing|xpr_limits"

  • No memory metric is tracked around validation — if a huge malformed protobuf causes a large allocation before rejection, it won’t be visible; consider adding a cd_advertisement_bytes_rejected counter with a size histogram

What to check:

  • Every malicious variant must produce a cd_register_requests{status="Rejected"} increment — none should result in Confirmed

  • cd_registrar_cache_ads must not increase during the attack

  • Process memory (external monitoring) should not grow — oversized ads are rejected before allocation where possible

Scenario 10: Eclipse Attack Near Service Hash

Goal: Verify that malicious registrars clustered near the target service hash cannot suppress honest advertisements.

Network setup:

  • 200 honest registrars → distributed uniformly
  • 60 malicious registrars → peer IDs chosen so XOR(peerId, service_hash) is minimal (close bucket)
  • 4 honest advertisers
  • 30 discoverers

Malicious registrar behaviours:

  • Return empty getAds.advertisements always

  • Return fake advertisements (valid structure, attacker-controlled peer IDs)

  • Return only other malicious nodes in closerPeers

  • Return REJECTED to all REGISTER requests from honest advertisers

How to run:

  1. Pre-generate 60 peer keys such that SHA256(peerId) is close to SHA256(service_id) (brute-force leading bits).

  2. Start malicious registrars with those keys.

  3. Start honest infrastructure.

  4. Run discoverers for at least 5 lookup cycles.

Existing logs to watch:

Log line What it tells you
debug "adverts found" count=N N=0 from malicious registrars
debug "advert accepted" Honest ads still getting through to honest registrars
cd_lookup_peers_found End-to-end success rate
error "failed to register ad" Honest advertisers rejected by malicious registrars

Missing logs — need to add:

  • No log records which bucket a queried registrar came from during lookup — add debug "querying registrar" bucket=N in collectBucketAds so you can see whether discoverers are stuck in the high-numbered buckets (close to service hash) where malicious nodes are concentrated

  • No log for when validAds() filters out invalid ads from a GET_ADS response — currently done silently; add debug "filtered invalid ads" count=N registrar=X so the fake-ad rejection rate is visible

Scenario 11: Concurrent Advertise + Lookup + Churn Race Test

Goal: Detect concurrency bugs, async races, routing-table corruption, and cleanup issues under sustained load.

Network setup:

  • 150 registrars
  • 200 advertisers
  • 200 discoverers

How to run:
Continuously for 4+ hours:

  • Every 30s: kill and restart 10 random advertisers with new peer IDs

  • Every 60s: rotate 5 random registrars (kill + restart)

  • Every 20s: each discoverer calls lookup() concurrently

  • Every 45s: add a new service and start advertising it; remove an old service

Existing logs to watch:

Log line What it tells you
error "no service routing table found" Table missing during active advertisement — race
error "failed to register ad" Network churn causing failures
debug "pruned expired adverts" Cleanup running during churn
cd_advertiser_pending_actions gauge Should not grow unboundedly
cd_service_tables_count gauge Should track add/remove correctly

Missing logs — need to add:

  • No log when a service routing table is created or destroyed during churn — add debug "service table created" / debug "service table removed" with serviceId and role (Provided/Interest) so table lifecycle is traceable

  • No log when an async task is cancelled unexpectedly (e.g. advertiseToRegistrar cancelled mid-wait) — add debug "advertise task cancelled" in the cancellation path

  • No log when maintainAdvertiser or maintainRegistrar loops restart after a crash or table change — add debug "maintenance loop iteration" with a timestamp and current state summary

What to check:

  • cd_advertiser_pending_actions should stay bounded — if it grows monotonically, tasks are leaking

  • cd_service_tables_count should track add/remove — if it only grows, removeService is not being called

  • No Nim runtime errors (Error: unhandled exception, SIGSEGV) in logs — concurrent map access issues would surface here

  • Memory (external) should be flat after initial ramp-up — growing memory indicates leaked futures or routing table entries

1 Like