Scott Reimers bb6f2b64b0 v0.4.1: Security hardening, lock contention fixes, data cleanup

Security:
- Reaction signatures: ed25519 sign/verify (sign_reaction, verify_reaction_signature)
  Backward-compatible — unsigned reactions from old nodes still accepted
- Comment signature verification: verify_comment_signature now called on receipt
- Reaction removal authorization: only reactor or post author can remove
- BlobHeader author verification: lookup actual author from storage, don't trust payload

Lock contention (4 fixes):
- ManifestPush discovery: cm lock released before PostFetch I/O
- Pull request handler: load under lock, filter without lock, brief re-lock for is_deleted
- Pull sender: split into two brief locks (store posts, then batch upstream+sync)
- Engagement checker: batch all chunk results, single lock for writes

Data cleanup:
- Post deletion cleans post_downstream, post_upstream, seen_engagement tables
- Added TODO-hardening.md documenting remaining DOS/security/lock/data issues

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-21 19:30:38 -04:00

9.8 KiB

Raw Blame History

DOS Hardening TODO

Identified during v0.4.0 audit (2026-03-21). Implement before v0.4.1.

CRITICAL — Lock Contention (v4-introduced)

L1. ManifestPush discovery holds cm lock during network I/O (connection.rs:4847-4981)

Spawned task grabs cm lock, calls send_post_fetch (QUIC I/O), waits for response — all while locked
Every connection operation queues behind it (5s+ freeze possible)
Fix: Gather connection handle before locking cm. PostFetch outside lock. Brief re-acquire for DB writes.

L2. Pull request handler holds lock during filtering (connection.rs:1855-1905)

Loads ALL posts, loops through checking visibility + timestamps while holding storage lock
50K posts = 500ms+ lock hold
Fix: Load posts under lock (brief), release, filter without lock (CPU only), re-acquire briefly for is_deleted() on filtered subset.

L3. Pull sender's second lock too long (connection.rs:1650-1721, 1572-1624)

After receiving posts: store + add_upstream (count query each) + update_last_sync — all under one lock
100 posts = 100 inserts + 100 count queries + 20 author updates
Fix: Split into two brief locks. First: bulk store posts. Second: batch upstream adds + last_sync updates. Collect unique authors during first lock.

L4. Per-post engagement lock acquisitions (connection.rs:1777-1833)

Lock acquired/released 100 times in tight loop (once per post)
Each acquisition blocks behind other tasks
Fix: Batch writes. Collect all engagement results, acquire lock once, write all. Network I/O already outside lock.

L5. Stale follows query (node.rs:2528-2530) — LOW

get_stale_follows every 60s, brief query, acceptable
Fix (optional): Add index on follows(last_sync_ms) if missing

HIGH Priority — DOS

1. Stream handler cap (connection.rs:4252, 4270)

Max 10 concurrent workers per connection via Semaphore
Excess streams wait, not spawned unbounded

2. Slot index memory bomb (connection.rs:5754-5792)

Soft 1K slot limit per post
Author can sign capacity increase that propagates via BlobHeaderDiff
Without author signature, cap stays at 1K
Consider thread-split pattern for overflow (already exists for 16KB comments)

3. ManifestPush amplification (connection.rs:4877-4936)

Custom ManifestPush for new posts: only deliver [new_post_id, previous_post_id]
Each CDN partner updates their own local manifest copy
Same diff pattern useful for N+10 updates
Lower bandwidth, low-priority background task

MEDIUM Priority

4. Post list pagination (connection.rs:1857)

Limit to 200 posts per pull response
~100KB memory, <5ms lock hold
Next sync cycle catches remainder via since_ms timestamps

5. Eviction candidate cap (storage.rs:3678-3737)

Limit to 100 candidates per batch
~40KB memory, <5ms lock hold
Next 5-min cycle catches more if needed

6. Payload element abuse — CDN consensus check

Before accepting a large engagement update, check 1-2 CDN neighbors
"Does your header for this post look like this?" If not → reject
Attacker must compromise multiple CDN nodes to pass
No trust scoring needed — just peer corroboration

7. Lock acquisition timeouts (connection.rs: throughout)

5-second timeout on storage lock acquisition
On timeout: skip operation, try next cycle
Log: operation name, wait duration, who holds the lock
Add last_lock_holder: AtomicU64 storing hash of acquiring function name

8. Discovery task cap (connection.rs:4938-4980)

One discovery task per peer at a time
AtomicBool flag per connection, skip if already running

LOW Priority

9. Engagement rate limiting

Self-claimed: max 3 emoji + 1 comment per 10 seconds
Chain-propagated: CDN consensus check from #6 applies
Process only first 100 ops per BlobHeaderDiff message

10. Mesh stream spawn cap (connection.rs)

Same as #1 — 10 max concurrent handlers per connection
Supplements auth rate limiter (which handles connection-level, not stream-level)

11. Retry backoff per target

Start at 5 seconds, triple on each failure
5s → 15s → 45s → 135s → 405s → 1215s → 3645s → 10935s → 14400s (cap at 4hr)
8 failures to hit max backoff
Reset to 5s on success
Track per target peer, not global

Security Hardening TODO

Identified during v0.4.0 security audit (2026-03-21).

CRITICAL — Immediate (before next public release)

S1. Comment signature verification — ONE LINE FIX (connection.rs:5711-5724)

verify_comment_signature() exists in crypto.rs but is NEVER called on receipt
Add if !crypto::verify_comment_signature(...) { continue; } before store_comment()
Infrastructure exists, just not wired up

S2. Reaction removal auth check — TWO LINE FIX (connection.rs:5708-5709)

RemoveReaction accepts from any sender, no auth
Add: if *reactor == sender || sender == payload.author { ... }
Same pattern already used in EditComment/DeleteComment

S3. Reaction signature — ~30 lines (types.rs, crypto.rs, connection.rs)

Reaction has no signature field — anyone can fake reactions from any NodeId
Add signature: Vec<u8> to Reaction struct (#[serde(default)] for compat)
Sign (reactor + post_id + emoji + timestamp) with reactor's ed25519 key
Verify in handle_blob_header_diff before storing
Follow existing sign_comment / verify_comment_signature pattern

S4. BlobHeader author verification — ~5 lines (connection.rs:5821-5836)

Header rebuild uses payload.author without checking against stored post author
Look up actual author from storage.get_post(&payload.post_id)
Use stored author, not payload-claimed author

HIGH — Short-term

S5. PostId verification in all paths (connection.rs)

PostPush verifies with verify_post_id() but some pull paths don't
Audit all store_post_with_visibility call sites
Ensure verify_post_id() called before each store

S6. Slot write protection — self-healing signature system (connection.rs:5749-5803)

Problem: any peer can overwrite encrypted slots with garbage
Solution (two layers):
1. CDN tree membership check: only accept slot writes from peers in post_downstream or post_upstream for that post. Rejects random peers.
2. Self-healing signatures: participants sign their own slot writes with the slot key (derived from CEK) and keep a local copy. On diff check, if their slot was overwritten with something they didn't sign, they re-write their signed version. Other participants verify signatures — keep the signed version, discard unsigned garbage. The legitimate version propagates through the CDN tree. Attacker must keep overwriting forever; the real version keeps coming back from every CDN node that received it.
Relay nodes can't verify signatures (don't have CEK) but pass through all writes — participants do client-side verification on decrypt

S7. Comment edit/delete cryptographic proof (connection.rs:5726-5736)

Currently "trust-based" — checks sender == author at transport layer
QUIC connection IS authenticated (iroh ed25519), so sender identity is verified
Risk: compromised relay node
Fix: require new signature over edited content (editor proves they hold private key)
For post-author deletes: require post author's signature over delete request

S8. Pull sync follow list privacy (connection.rs:1846-1915)

PullSyncRequest sends entire follow list unencrypted to every sync peer
Every mesh peer learns your complete social graph
Options:
- Accept and document (mesh peers are semi-trusted infrastructure) — RECOMMENDED for now
- Bloom filter: probabilistic set, leaks less, some irrelevant posts received (acceptable bandwidth cost)
- Long-term: oblivious transfer / PIR (heavy crypto, probably not worth it for social network)

MEDIUM — Design review

S9. Nonce reuse guard (crypto.rs:54-56)

ChaCha20-Poly1305 catastrophic on nonce reuse
RNG is reliable on modern OS (getrandom syscall)
Add sanity check: if nonce is all zeros after generation, panic rather than encrypt
One-line guard

S10. Slot timing metadata leakage (connection.rs:5757, 5776)

header.updated_at changes on slot writes, leaking WHEN engagement occurs on private posts
Passive observer can correlate timestamps with known user behavior
Fix: round updated_at to 10-minute buckets for private posts
Or batch slot writes on fixed schedule rather than immediately

S11. Per-author engagement rate limiting (connection.rs:5699-5725)

A peer can send 10,000 fake reactions in one BlobHeaderDiff
Cap ops per message (100 max per DOS hardening #9)
Deduplicate by (reactor, post_id, emoji) — storage already does ON CONFLICT DO UPDATE
Combined with reaction signatures (S3), fake NodeId reactions become impossible

LOW

Data Cleanup TODO

D1. post_downstream not cleaned on post delete (storage.rs delete_post)

When a post is deleted, downstream registrations stay forever
Fix: add DELETE FROM post_downstream WHERE post_id = ?1 in delete_post()
Also add: DELETE FROM post_upstream WHERE post_id = ?1
Also add: DELETE FROM seen_engagement WHERE post_id = ?1
One-line fixes each

D2. Document BlobHeader-table relationship (storage.rs store_blob_header)

Header JSON is a snapshot, reactions/comments tables are authoritative
They can temporarily diverge (BlobHeaderResponse arrives with newer header than tables)
Header rebuilt from tables on next engagement op
Add clarifying comment to store_blob_header

Low Priority

S12. Hex parse error logging (web.rs:110-119)

Malformed hex strings silently return 404
Add debug logging for malformed inputs

S13. Edit comment signature consistency (storage.rs:4338-4343)

edit_comment updates content without updating signature
If signature verification (S1) is enabled, edited comments would have invalid signatures
Fix: add signature parameter to edit_comment, re-sign edited content

9.8 KiB Raw Blame History

DOS Hardening TODO

CRITICAL — Lock Contention (v4-introduced)

L1. ManifestPush discovery holds cm lock during network I/O (connection.rs:4847-4981)

L2. Pull request handler holds lock during filtering (connection.rs:1855-1905)

L3. Pull sender's second lock too long (connection.rs:1650-1721, 1572-1624)

L4. Per-post engagement lock acquisitions (connection.rs:1777-1833)

L5. Stale follows query (node.rs:2528-2530) — LOW

HIGH Priority — DOS

1. Stream handler cap (connection.rs:4252, 4270)

2. Slot index memory bomb (connection.rs:5754-5792)

3. ManifestPush amplification (connection.rs:4877-4936)

MEDIUM Priority

4. Post list pagination (connection.rs:1857)

5. Eviction candidate cap (storage.rs:3678-3737)

6. Payload element abuse — CDN consensus check

7. Lock acquisition timeouts (connection.rs: throughout)

8. Discovery task cap (connection.rs:4938-4980)

LOW Priority

9. Engagement rate limiting

10. Mesh stream spawn cap (connection.rs)

11. Retry backoff per target

Security Hardening TODO

CRITICAL — Immediate (before next public release)

S1. Comment signature verification — ONE LINE FIX (connection.rs:5711-5724)

S2. Reaction removal auth check — TWO LINE FIX (connection.rs:5708-5709)

S3. Reaction signature — ~30 lines (types.rs, crypto.rs, connection.rs)

S4. BlobHeader author verification — ~5 lines (connection.rs:5821-5836)

HIGH — Short-term

S5. PostId verification in all paths (connection.rs)

S6. Slot write protection — self-healing signature system (connection.rs:5749-5803)

S7. Comment edit/delete cryptographic proof (connection.rs:5726-5736)

S8. Pull sync follow list privacy (connection.rs:1846-1915)

MEDIUM — Design review

S9. Nonce reuse guard (crypto.rs:54-56)

S10. Slot timing metadata leakage (connection.rs:5757, 5776)

S11. Per-author engagement rate limiting (connection.rs:5699-5725)

LOW

Data Cleanup TODO

D1. post_downstream not cleaned on post delete (storage.rs delete_post)

D2. Document BlobHeader-table relationship (storage.rs store_blob_header)

Low Priority

S12. Hex parse error logging (web.rs:110-119)

S13. Edit comment signature consistency (storage.rs:4338-4343)

9.8 KiB

Raw Blame History