v0.4.1: Security hardening, lock contention fixes, data cleanup

Security: - Reaction signatures: ed25519 sign/verify (sign_reaction, verify_reaction_signature) Backward-compatible — unsigned reactions from old nodes still accepted - Comment signature verification: verify_comment_signature now called on receipt - Reaction removal authorization: only reactor or post author can remove - BlobHeader author verification: lookup actual author from storage, don't trust payload Lock contention (4 fixes): - ManifestPush discovery: cm lock released before PostFetch I/O - Pull request handler: load under lock, filter without lock, brief re-lock for is_deleted - Pull sender: split into two brief locks (store posts, then batch upstream+sync) - Engagement checker: batch all chunk results, single lock for writes Data cleanup: - Post deletion cleans post_downstream, post_upstream, seen_engagement tables - Added TODO-hardening.md documenting remaining DOS/security/lock/data issues Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 19:30:38 -04:00 · 2026-03-21 19:30:38 -04:00 · bb6f2b64b0
commit bb6f2b64b0
parent bbaacf9b6c
11 changed files with 500 additions and 138 deletions
--- a/docs/TODO-hardening.md
+++ b/docs/TODO-hardening.md
@ -0,0 +1,204 @@
+# DOS Hardening TODO
+
+Identified during v0.4.0 audit (2026-03-21). Implement before v0.4.1.
+
+## CRITICAL — Lock Contention (v4-introduced)
+
+### L1. ManifestPush discovery holds cm lock during network I/O (connection.rs:4847-4981)
+- Spawned task grabs cm lock, calls send_post_fetch (QUIC I/O), waits for response — all while locked
+- Every connection operation queues behind it (5s+ freeze possible)
+- **Fix:** Gather connection handle before locking cm. PostFetch outside lock. Brief re-acquire for DB writes.
+
+### L2. Pull request handler holds lock during filtering (connection.rs:1855-1905)
+- Loads ALL posts, loops through checking visibility + timestamps while holding storage lock
+- 50K posts = 500ms+ lock hold
+- **Fix:** Load posts under lock (brief), release, filter without lock (CPU only), re-acquire briefly for is_deleted() on filtered subset.
+
+### L3. Pull sender's second lock too long (connection.rs:1650-1721, 1572-1624)
+- After receiving posts: store + add_upstream (count query each) + update_last_sync — all under one lock
+- 100 posts = 100 inserts + 100 count queries + 20 author updates
+- **Fix:** Split into two brief locks. First: bulk store posts. Second: batch upstream adds + last_sync updates. Collect unique authors during first lock.
+
+### L4. Per-post engagement lock acquisitions (connection.rs:1777-1833)
+- Lock acquired/released 100 times in tight loop (once per post)
+- Each acquisition blocks behind other tasks
+- **Fix:** Batch writes. Collect all engagement results, acquire lock once, write all. Network I/O already outside lock.
+
+### L5. Stale follows query (node.rs:2528-2530) — LOW
+- get_stale_follows every 60s, brief query, acceptable
+- **Fix (optional):** Add index on follows(last_sync_ms) if missing
+
+## HIGH Priority — DOS
+
+### 1. Stream handler cap (connection.rs:4252, 4270)
+- Max 10 concurrent workers per connection via Semaphore
+- Excess streams wait, not spawned unbounded
+
+### 2. Slot index memory bomb (connection.rs:5754-5792)
+- Soft 1K slot limit per post
+- Author can sign capacity increase that propagates via BlobHeaderDiff
+- Without author signature, cap stays at 1K
+- Consider thread-split pattern for overflow (already exists for 16KB comments)
+
+### 3. ManifestPush amplification (connection.rs:4877-4936)
+- Custom ManifestPush for new posts: only deliver [new_post_id, previous_post_id]
+- Each CDN partner updates their own local manifest copy
+- Same diff pattern useful for N+10 updates
+- Lower bandwidth, low-priority background task
+
+## MEDIUM Priority
+
+### 4. Post list pagination (connection.rs:1857)
+- Limit to 200 posts per pull response
+- ~100KB memory, <5ms lock hold
+- Next sync cycle catches remainder via since_ms timestamps
+
+### 5. Eviction candidate cap (storage.rs:3678-3737)
+- Limit to 100 candidates per batch
+- ~40KB memory, <5ms lock hold
+- Next 5-min cycle catches more if needed
+
+### 6. Payload element abuse — CDN consensus check
+- Before accepting a large engagement update, check 1-2 CDN neighbors
+- "Does your header for this post look like this?" If not → reject
+- Attacker must compromise multiple CDN nodes to pass
+- No trust scoring needed — just peer corroboration
+
+### 7. Lock acquisition timeouts (connection.rs: throughout)
+- 5-second timeout on storage lock acquisition
+- On timeout: skip operation, try next cycle
+- Log: operation name, wait duration, who holds the lock
+- Add `last_lock_holder: AtomicU64` storing hash of acquiring function name
+
+### 8. Discovery task cap (connection.rs:4938-4980)
+- One discovery task per peer at a time
+- AtomicBool flag per connection, skip if already running
+
+## LOW Priority
+
+### 9. Engagement rate limiting
+- Self-claimed: max 3 emoji + 1 comment per 10 seconds
+- Chain-propagated: CDN consensus check from #6 applies
+- Process only first 100 ops per BlobHeaderDiff message
+
+### 10. Mesh stream spawn cap (connection.rs)
+- Same as #1 — 10 max concurrent handlers per connection
+- Supplements auth rate limiter (which handles connection-level, not stream-level)
+
+### 11. Retry backoff per target
+- Start at 5 seconds, triple on each failure
+- 5s → 15s → 45s → 135s → 405s → 1215s → 3645s → 10935s → 14400s (cap at 4hr)
+- 8 failures to hit max backoff
+- Reset to 5s on success
+- Track per target peer, not global
+
+---
+
+# Security Hardening TODO
+
+Identified during v0.4.0 security audit (2026-03-21).
+
+## CRITICAL — Immediate (before next public release)
+
+### S1. Comment signature verification — ONE LINE FIX (connection.rs:5711-5724)
+- `verify_comment_signature()` exists in crypto.rs but is NEVER called on receipt
+- Add `if !crypto::verify_comment_signature(...) { continue; }` before `store_comment()`
+- Infrastructure exists, just not wired up
+
+### S2. Reaction removal auth check — TWO LINE FIX (connection.rs:5708-5709)
+- `RemoveReaction` accepts from any sender, no auth
+- Add: `if *reactor == sender || sender == payload.author { ... }`
+- Same pattern already used in EditComment/DeleteComment
+
+### S3. Reaction signature — ~30 lines (types.rs, crypto.rs, connection.rs)
+- `Reaction` has no signature field — anyone can fake reactions from any NodeId
+- Add `signature: Vec<u8>` to Reaction struct (#[serde(default)] for compat)
+- Sign `(reactor + post_id + emoji + timestamp)` with reactor's ed25519 key
+- Verify in handle_blob_header_diff before storing
+- Follow existing `sign_comment` / `verify_comment_signature` pattern
+
+### S4. BlobHeader author verification — ~5 lines (connection.rs:5821-5836)
+- Header rebuild uses `payload.author` without checking against stored post author
+- Look up actual author from `storage.get_post(&payload.post_id)`
+- Use stored author, not payload-claimed author
+
+## HIGH — Short-term
+
+### S5. PostId verification in all paths (connection.rs)
+- PostPush verifies with `verify_post_id()` but some pull paths don't
+- Audit all `store_post_with_visibility` call sites
+- Ensure `verify_post_id()` called before each store
+
+### S6. Slot write protection — self-healing signature system (connection.rs:5749-5803)
+- Problem: any peer can overwrite encrypted slots with garbage
+- Solution (two layers):
+  1. CDN tree membership check: only accept slot writes from peers in post_downstream or post_upstream for that post. Rejects random peers.
+  2. Self-healing signatures: participants sign their own slot writes with the slot key (derived from CEK) and keep a local copy. On diff check, if their slot was overwritten with something they didn't sign, they re-write their signed version. Other participants verify signatures — keep the signed version, discard unsigned garbage. The legitimate version propagates through the CDN tree. Attacker must keep overwriting forever; the real version keeps coming back from every CDN node that received it.
+- Relay nodes can't verify signatures (don't have CEK) but pass through all writes — participants do client-side verification on decrypt
+
+### S7. Comment edit/delete cryptographic proof (connection.rs:5726-5736)
+- Currently "trust-based" — checks sender == author at transport layer
+- QUIC connection IS authenticated (iroh ed25519), so sender identity is verified
+- Risk: compromised relay node
+- Fix: require new signature over edited content (editor proves they hold private key)
+- For post-author deletes: require post author's signature over delete request
+
+### S8. Pull sync follow list privacy (connection.rs:1846-1915)
+- PullSyncRequest sends entire follow list unencrypted to every sync peer
+- Every mesh peer learns your complete social graph
+- Options:
+  - Accept and document (mesh peers are semi-trusted infrastructure) — RECOMMENDED for now
+  - Bloom filter: probabilistic set, leaks less, some irrelevant posts received (acceptable bandwidth cost)
+  - Long-term: oblivious transfer / PIR (heavy crypto, probably not worth it for social network)
+
+## MEDIUM — Design review
+
+### S9. Nonce reuse guard (crypto.rs:54-56)
+- ChaCha20-Poly1305 catastrophic on nonce reuse
+- RNG is reliable on modern OS (getrandom syscall)
+- Add sanity check: if nonce is all zeros after generation, panic rather than encrypt
+- One-line guard
+
+### S10. Slot timing metadata leakage (connection.rs:5757, 5776)
+- `header.updated_at` changes on slot writes, leaking WHEN engagement occurs on private posts
+- Passive observer can correlate timestamps with known user behavior
+- Fix: round updated_at to 10-minute buckets for private posts
+- Or batch slot writes on fixed schedule rather than immediately
+
+### S11. Per-author engagement rate limiting (connection.rs:5699-5725)
+- A peer can send 10,000 fake reactions in one BlobHeaderDiff
+- Cap ops per message (100 max per DOS hardening #9)
+- Deduplicate by (reactor, post_id, emoji) — storage already does ON CONFLICT DO UPDATE
+- Combined with reaction signatures (S3), fake NodeId reactions become impossible
+
+## LOW
+
+---
+
+# Data Cleanup TODO
+
+### D1. post_downstream not cleaned on post delete (storage.rs delete_post)
+- When a post is deleted, downstream registrations stay forever
+- Fix: add `DELETE FROM post_downstream WHERE post_id = ?1` in delete_post()
+- Also add: `DELETE FROM post_upstream WHERE post_id = ?1`
+- Also add: `DELETE FROM seen_engagement WHERE post_id = ?1`
+- One-line fixes each
+
+### D2. Document BlobHeader-table relationship (storage.rs store_blob_header)
+- Header JSON is a snapshot, reactions/comments tables are authoritative
+- They can temporarily diverge (BlobHeaderResponse arrives with newer header than tables)
+- Header rebuilt from tables on next engagement op
+- Add clarifying comment to store_blob_header
+
+---
+
+# Low Priority
+
+### S12. Hex parse error logging (web.rs:110-119)
+- Malformed hex strings silently return 404
+- Add debug logging for malformed inputs
+
+### S13. Edit comment signature consistency (storage.rs:4338-4343)
+- edit_comment updates content without updating signature
+- If signature verification (S1) is enabled, edited comments would have invalid signatures
+- Fix: add signature parameter to edit_comment, re-sign edited content