itsgoin/docs/TODO-hardening.md

# DOS Hardening TODO

Identified during v0.4.0 audit (2026-03-21). Implement before v0.4.1.

## CRITICAL — Lock Contention (v4-introduced)

### L1. ManifestPush discovery holds cm lock during network I/O (connection.rs:4847-4981)
- Spawned task grabs cm lock, calls send_post_fetch (QUIC I/O), waits for response — all while locked
- Every connection operation queues behind it (5s+ freeze possible)
- **Fix:** Gather connection handle before locking cm. PostFetch outside lock. Brief re-acquire for DB writes.

### L2. Pull request handler holds lock during filtering (connection.rs:1855-1905)
- Loads ALL posts, loops through checking visibility + timestamps while holding storage lock
- 50K posts = 500ms+ lock hold
- **Fix:** Load posts under lock (brief), release, filter without lock (CPU only), re-acquire briefly for is_deleted() on filtered subset.

### L3. Pull sender's second lock too long (connection.rs:1650-1721, 1572-1624)
- After receiving posts: store + add_upstream (count query each) + update_last_sync — all under one lock
- 100 posts = 100 inserts + 100 count queries + 20 author updates
- **Fix:** Split into two brief locks. First: bulk store posts. Second: batch upstream adds + last_sync updates. Collect unique authors during first lock.

### L4. Per-post engagement lock acquisitions (connection.rs:1777-1833)
- Lock acquired/released 100 times in tight loop (once per post)
- Each acquisition blocks behind other tasks
- **Fix:** Batch writes. Collect all engagement results, acquire lock once, write all. Network I/O already outside lock.

### L5. Stale follows query (node.rs:2528-2530) — LOW
- get_stale_follows every 60s, brief query, acceptable
- **Fix (optional):** Add index on follows(last_sync_ms) if missing

## HIGH Priority — DOS

### 1. Stream handler cap (connection.rs:4252, 4270)
- Max 10 concurrent workers per connection via Semaphore
- Excess streams wait, not spawned unbounded

### 2. Slot index memory bomb (connection.rs:5754-5792)
- Soft 1K slot limit per post
- Author can sign capacity increase that propagates via BlobHeaderDiff
- Without author signature, cap stays at 1K
- Consider thread-split pattern for overflow (already exists for 16KB comments)

### 3. ManifestPush amplification (connection.rs:4877-4936)
- Custom ManifestPush for new posts: only deliver [new_post_id, previous_post_id]
- Each CDN partner updates their own local manifest copy
- Same diff pattern useful for N+10 updates
- Lower bandwidth, low-priority background task

## MEDIUM Priority

### 4. Post list pagination (connection.rs:1857)
- Limit to 200 posts per pull response
- ~100KB memory, <5ms lock hold
- Next sync cycle catches remainder via since_ms timestamps

### 5. Eviction candidate cap (storage.rs:3678-3737)
- Limit to 100 candidates per batch
- ~40KB memory, <5ms lock hold
- Next 5-min cycle catches more if needed

### 6. Payload element abuse — CDN consensus check
- Before accepting a large engagement update, check 1-2 CDN neighbors
- "Does your header for this post look like this?" If not → reject
- Attacker must compromise multiple CDN nodes to pass
- No trust scoring needed — just peer corroboration

### 7. Lock acquisition timeouts (connection.rs: throughout)
- 5-second timeout on storage lock acquisition
- On timeout: skip operation, try next cycle
- Log: operation name, wait duration, who holds the lock
- Add `last_lock_holder: AtomicU64` storing hash of acquiring function name

### 8. Discovery task cap (connection.rs:4938-4980)
- One discovery task per peer at a time
- AtomicBool flag per connection, skip if already running

## LOW Priority

### 9. Engagement rate limiting
- Self-claimed: max 3 emoji + 1 comment per 10 seconds
- Chain-propagated: CDN consensus check from #6 applies
- Process only first 100 ops per BlobHeaderDiff message

### 10. Mesh stream spawn cap (connection.rs)
- Same as #1 — 10 max concurrent handlers per connection
- Supplements auth rate limiter (which handles connection-level, not stream-level)

### 11. Retry backoff per target
- Start at 5 seconds, triple on each failure
- 5s → 15s → 45s → 135s → 405s → 1215s → 3645s → 10935s → 14400s (cap at 4hr)
- 8 failures to hit max backoff
- Reset to 5s on success
- Track per target peer, not global

---

# Security Hardening TODO

Identified during v0.4.0 security audit (2026-03-21).

## CRITICAL — Immediate (before next public release)

### S1. Comment signature verification — ONE LINE FIX (connection.rs:5711-5724)
- `verify_comment_signature()` exists in crypto.rs but is NEVER called on receipt
- Add `if !crypto::verify_comment_signature(...) { continue; }` before `store_comment()`
- Infrastructure exists, just not wired up

### S2. Reaction removal auth check — TWO LINE FIX (connection.rs:5708-5709)
- `RemoveReaction` accepts from any sender, no auth
- Add: `if *reactor == sender || sender == payload.author { ... }`
- Same pattern already used in EditComment/DeleteComment

### S3. Reaction signature — ~30 lines (types.rs, crypto.rs, connection.rs)
- `Reaction` has no signature field — anyone can fake reactions from any NodeId
- Add `signature: Vec<u8>` to Reaction struct (#[serde(default)] for compat)
- Sign `(reactor + post_id + emoji + timestamp)` with reactor's ed25519 key
- Verify in handle_blob_header_diff before storing
- Follow existing `sign_comment` / `verify_comment_signature` pattern

### S4. BlobHeader author verification — ~5 lines (connection.rs:5821-5836)
- Header rebuild uses `payload.author` without checking against stored post author
- Look up actual author from `storage.get_post(&payload.post_id)`
- Use stored author, not payload-claimed author

## HIGH — Short-term

### S5. PostId verification in all paths (connection.rs)
- PostPush verifies with `verify_post_id()` but some pull paths don't
- Audit all `store_post_with_visibility` call sites
- Ensure `verify_post_id()` called before each store

### S6. Slot write protection — self-healing signature system (connection.rs:5749-5803)
- Problem: any peer can overwrite encrypted slots with garbage
- Solution (two layers):
  1. CDN tree membership check: only accept slot writes from peers in post_downstream or post_upstream for that post. Rejects random peers.
  2. Self-healing signatures: participants sign their own slot writes with the slot key (derived from CEK) and keep a local copy. On diff check, if their slot was overwritten with something they didn't sign, they re-write their signed version. Other participants verify signatures — keep the signed version, discard unsigned garbage. The legitimate version propagates through the CDN tree. Attacker must keep overwriting forever; the real version keeps coming back from every CDN node that received it.
- Relay nodes can't verify signatures (don't have CEK) but pass through all writes — participants do client-side verification on decrypt

### S7. Comment edit/delete cryptographic proof (connection.rs:5726-5736)
- Currently "trust-based" — checks sender == author at transport layer
- QUIC connection IS authenticated (iroh ed25519), so sender identity is verified
- Risk: compromised relay node
- Fix: require new signature over edited content (editor proves they hold private key)
- For post-author deletes: require post author's signature over delete request

### S8. Pull sync follow list privacy (connection.rs:1846-1915)
- PullSyncRequest sends entire follow list unencrypted to every sync peer
- Every mesh peer learns your complete social graph
- Options:
  - Accept and document (mesh peers are semi-trusted infrastructure) — RECOMMENDED for now
  - Bloom filter: probabilistic set, leaks less, some irrelevant posts received (acceptable bandwidth cost)
  - Long-term: oblivious transfer / PIR (heavy crypto, probably not worth it for social network)

## MEDIUM — Design review

### S9. Nonce reuse guard (crypto.rs:54-56)
- ChaCha20-Poly1305 catastrophic on nonce reuse
- RNG is reliable on modern OS (getrandom syscall)
- Add sanity check: if nonce is all zeros after generation, panic rather than encrypt
- One-line guard

### S10. Slot timing metadata leakage (connection.rs:5757, 5776)
- `header.updated_at` changes on slot writes, leaking WHEN engagement occurs on private posts
- Passive observer can correlate timestamps with known user behavior
- Fix: round updated_at to 10-minute buckets for private posts
- Or batch slot writes on fixed schedule rather than immediately

### S11. Per-author engagement rate limiting (connection.rs:5699-5725)
- A peer can send 10,000 fake reactions in one BlobHeaderDiff
- Cap ops per message (100 max per DOS hardening #9)
- Deduplicate by (reactor, post_id, emoji) — storage already does ON CONFLICT DO UPDATE
- Combined with reaction signatures (S3), fake NodeId reactions become impossible

## LOW

---

# Data Cleanup TODO

### D1. post_downstream not cleaned on post delete (storage.rs delete_post)
- When a post is deleted, downstream registrations stay forever
- Fix: add `DELETE FROM post_downstream WHERE post_id = ?1` in delete_post()
- Also add: `DELETE FROM post_upstream WHERE post_id = ?1`
- Also add: `DELETE FROM seen_engagement WHERE post_id = ?1`
- One-line fixes each

### D2. Document BlobHeader-table relationship (storage.rs store_blob_header)
- Header JSON is a snapshot, reactions/comments tables are authoritative
- They can temporarily diverge (BlobHeaderResponse arrives with newer header than tables)
- Header rebuilt from tables on next engagement op
- Add clarifying comment to store_blob_header

---

# Low Priority

### S12. Hex parse error logging (web.rs:110-119)
- Malformed hex strings silently return 404
- Add debug logging for malformed inputs

### S13. Edit comment signature consistency (storage.rs:4338-4343)
- edit_comment updates content without updating signature
- If signature verification (S1) is enabled, edited comments would have invalid signatures
- Fix: add signature parameter to edit_comment, re-sign edited content