itsgoin/docs/TODO-hardening.md
Scott Reimers bb6f2b64b0 v0.4.1: Security hardening, lock contention fixes, data cleanup
Security:
- Reaction signatures: ed25519 sign/verify (sign_reaction, verify_reaction_signature)
  Backward-compatible — unsigned reactions from old nodes still accepted
- Comment signature verification: verify_comment_signature now called on receipt
- Reaction removal authorization: only reactor or post author can remove
- BlobHeader author verification: lookup actual author from storage, don't trust payload

Lock contention (4 fixes):
- ManifestPush discovery: cm lock released before PostFetch I/O
- Pull request handler: load under lock, filter without lock, brief re-lock for is_deleted
- Pull sender: split into two brief locks (store posts, then batch upstream+sync)
- Engagement checker: batch all chunk results, single lock for writes

Data cleanup:
- Post deletion cleans post_downstream, post_upstream, seen_engagement tables
- Added TODO-hardening.md documenting remaining DOS/security/lock/data issues

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 19:30:38 -04:00

9.8 KiB

DOS Hardening TODO

Identified during v0.4.0 audit (2026-03-21). Implement before v0.4.1.

CRITICAL — Lock Contention (v4-introduced)

L1. ManifestPush discovery holds cm lock during network I/O (connection.rs:4847-4981)

  • Spawned task grabs cm lock, calls send_post_fetch (QUIC I/O), waits for response — all while locked
  • Every connection operation queues behind it (5s+ freeze possible)
  • Fix: Gather connection handle before locking cm. PostFetch outside lock. Brief re-acquire for DB writes.

L2. Pull request handler holds lock during filtering (connection.rs:1855-1905)

  • Loads ALL posts, loops through checking visibility + timestamps while holding storage lock
  • 50K posts = 500ms+ lock hold
  • Fix: Load posts under lock (brief), release, filter without lock (CPU only), re-acquire briefly for is_deleted() on filtered subset.

L3. Pull sender's second lock too long (connection.rs:1650-1721, 1572-1624)

  • After receiving posts: store + add_upstream (count query each) + update_last_sync — all under one lock
  • 100 posts = 100 inserts + 100 count queries + 20 author updates
  • Fix: Split into two brief locks. First: bulk store posts. Second: batch upstream adds + last_sync updates. Collect unique authors during first lock.

L4. Per-post engagement lock acquisitions (connection.rs:1777-1833)

  • Lock acquired/released 100 times in tight loop (once per post)
  • Each acquisition blocks behind other tasks
  • Fix: Batch writes. Collect all engagement results, acquire lock once, write all. Network I/O already outside lock.

L5. Stale follows query (node.rs:2528-2530) — LOW

  • get_stale_follows every 60s, brief query, acceptable
  • Fix (optional): Add index on follows(last_sync_ms) if missing

HIGH Priority — DOS

1. Stream handler cap (connection.rs:4252, 4270)

  • Max 10 concurrent workers per connection via Semaphore
  • Excess streams wait, not spawned unbounded

2. Slot index memory bomb (connection.rs:5754-5792)

  • Soft 1K slot limit per post
  • Author can sign capacity increase that propagates via BlobHeaderDiff
  • Without author signature, cap stays at 1K
  • Consider thread-split pattern for overflow (already exists for 16KB comments)

3. ManifestPush amplification (connection.rs:4877-4936)

  • Custom ManifestPush for new posts: only deliver [new_post_id, previous_post_id]
  • Each CDN partner updates their own local manifest copy
  • Same diff pattern useful for N+10 updates
  • Lower bandwidth, low-priority background task

MEDIUM Priority

4. Post list pagination (connection.rs:1857)

  • Limit to 200 posts per pull response
  • ~100KB memory, <5ms lock hold
  • Next sync cycle catches remainder via since_ms timestamps

5. Eviction candidate cap (storage.rs:3678-3737)

  • Limit to 100 candidates per batch
  • ~40KB memory, <5ms lock hold
  • Next 5-min cycle catches more if needed

6. Payload element abuse — CDN consensus check

  • Before accepting a large engagement update, check 1-2 CDN neighbors
  • "Does your header for this post look like this?" If not → reject
  • Attacker must compromise multiple CDN nodes to pass
  • No trust scoring needed — just peer corroboration

7. Lock acquisition timeouts (connection.rs: throughout)

  • 5-second timeout on storage lock acquisition
  • On timeout: skip operation, try next cycle
  • Log: operation name, wait duration, who holds the lock
  • Add last_lock_holder: AtomicU64 storing hash of acquiring function name

8. Discovery task cap (connection.rs:4938-4980)

  • One discovery task per peer at a time
  • AtomicBool flag per connection, skip if already running

LOW Priority

9. Engagement rate limiting

  • Self-claimed: max 3 emoji + 1 comment per 10 seconds
  • Chain-propagated: CDN consensus check from #6 applies
  • Process only first 100 ops per BlobHeaderDiff message

10. Mesh stream spawn cap (connection.rs)

  • Same as #1 — 10 max concurrent handlers per connection
  • Supplements auth rate limiter (which handles connection-level, not stream-level)

11. Retry backoff per target

  • Start at 5 seconds, triple on each failure
  • 5s → 15s → 45s → 135s → 405s → 1215s → 3645s → 10935s → 14400s (cap at 4hr)
  • 8 failures to hit max backoff
  • Reset to 5s on success
  • Track per target peer, not global

Security Hardening TODO

Identified during v0.4.0 security audit (2026-03-21).

CRITICAL — Immediate (before next public release)

S1. Comment signature verification — ONE LINE FIX (connection.rs:5711-5724)

  • verify_comment_signature() exists in crypto.rs but is NEVER called on receipt
  • Add if !crypto::verify_comment_signature(...) { continue; } before store_comment()
  • Infrastructure exists, just not wired up

S2. Reaction removal auth check — TWO LINE FIX (connection.rs:5708-5709)

  • RemoveReaction accepts from any sender, no auth
  • Add: if *reactor == sender || sender == payload.author { ... }
  • Same pattern already used in EditComment/DeleteComment

S3. Reaction signature — ~30 lines (types.rs, crypto.rs, connection.rs)

  • Reaction has no signature field — anyone can fake reactions from any NodeId
  • Add signature: Vec<u8> to Reaction struct (#[serde(default)] for compat)
  • Sign (reactor + post_id + emoji + timestamp) with reactor's ed25519 key
  • Verify in handle_blob_header_diff before storing
  • Follow existing sign_comment / verify_comment_signature pattern

S4. BlobHeader author verification — ~5 lines (connection.rs:5821-5836)

  • Header rebuild uses payload.author without checking against stored post author
  • Look up actual author from storage.get_post(&payload.post_id)
  • Use stored author, not payload-claimed author

HIGH — Short-term

S5. PostId verification in all paths (connection.rs)

  • PostPush verifies with verify_post_id() but some pull paths don't
  • Audit all store_post_with_visibility call sites
  • Ensure verify_post_id() called before each store

S6. Slot write protection — self-healing signature system (connection.rs:5749-5803)

  • Problem: any peer can overwrite encrypted slots with garbage
  • Solution (two layers):
    1. CDN tree membership check: only accept slot writes from peers in post_downstream or post_upstream for that post. Rejects random peers.
    2. Self-healing signatures: participants sign their own slot writes with the slot key (derived from CEK) and keep a local copy. On diff check, if their slot was overwritten with something they didn't sign, they re-write their signed version. Other participants verify signatures — keep the signed version, discard unsigned garbage. The legitimate version propagates through the CDN tree. Attacker must keep overwriting forever; the real version keeps coming back from every CDN node that received it.
  • Relay nodes can't verify signatures (don't have CEK) but pass through all writes — participants do client-side verification on decrypt

S7. Comment edit/delete cryptographic proof (connection.rs:5726-5736)

  • Currently "trust-based" — checks sender == author at transport layer
  • QUIC connection IS authenticated (iroh ed25519), so sender identity is verified
  • Risk: compromised relay node
  • Fix: require new signature over edited content (editor proves they hold private key)
  • For post-author deletes: require post author's signature over delete request

S8. Pull sync follow list privacy (connection.rs:1846-1915)

  • PullSyncRequest sends entire follow list unencrypted to every sync peer
  • Every mesh peer learns your complete social graph
  • Options:
    • Accept and document (mesh peers are semi-trusted infrastructure) — RECOMMENDED for now
    • Bloom filter: probabilistic set, leaks less, some irrelevant posts received (acceptable bandwidth cost)
    • Long-term: oblivious transfer / PIR (heavy crypto, probably not worth it for social network)

MEDIUM — Design review

S9. Nonce reuse guard (crypto.rs:54-56)

  • ChaCha20-Poly1305 catastrophic on nonce reuse
  • RNG is reliable on modern OS (getrandom syscall)
  • Add sanity check: if nonce is all zeros after generation, panic rather than encrypt
  • One-line guard

S10. Slot timing metadata leakage (connection.rs:5757, 5776)

  • header.updated_at changes on slot writes, leaking WHEN engagement occurs on private posts
  • Passive observer can correlate timestamps with known user behavior
  • Fix: round updated_at to 10-minute buckets for private posts
  • Or batch slot writes on fixed schedule rather than immediately

S11. Per-author engagement rate limiting (connection.rs:5699-5725)

  • A peer can send 10,000 fake reactions in one BlobHeaderDiff
  • Cap ops per message (100 max per DOS hardening #9)
  • Deduplicate by (reactor, post_id, emoji) — storage already does ON CONFLICT DO UPDATE
  • Combined with reaction signatures (S3), fake NodeId reactions become impossible

LOW


Data Cleanup TODO

D1. post_downstream not cleaned on post delete (storage.rs delete_post)

  • When a post is deleted, downstream registrations stay forever
  • Fix: add DELETE FROM post_downstream WHERE post_id = ?1 in delete_post()
  • Also add: DELETE FROM post_upstream WHERE post_id = ?1
  • Also add: DELETE FROM seen_engagement WHERE post_id = ?1
  • One-line fixes each

D2. Document BlobHeader-table relationship (storage.rs store_blob_header)

  • Header JSON is a snapshot, reactions/comments tables are authoritative
  • They can temporarily diverge (BlobHeaderResponse arrives with newer header than tables)
  • Header rebuilt from tables on next engagement op
  • Add clarifying comment to store_blob_header

Low Priority

S12. Hex parse error logging (web.rs:110-119)

  • Malformed hex strings silently return 404
  • Add debug logging for malformed inputs

S13. Edit comment signature consistency (storage.rs:4338-4343)

  • edit_comment updates content without updating signature
  • If signature verification (S1) is enabled, edited comments would have invalid signatures
  • Fix: add signature parameter to edit_comment, re-sign edited content