Content-Addressed Storage Implementation Guide¶

TL;DR: Files You Need¶

For implementing content-addressed storage in BIRD/shq, use these files in this order:

1. Design & Architecture (Read First)¶

1. CONTENT_ADDRESSED_BLOBS.md    - Complete design document
2. bird_spec.md                  - Updated BIRD specification
3. SPEC_CHANGELOG.md             - What changed from UUID-based

2. Implementation Reference (Code From)¶

4. shq_implementation.md         - Updated Rust code examples
5. MIGRATION_SUMMARY.md          - Phase-by-phase checklist

3. Supporting Docs (Reference As Needed)¶

6. COMPRESSION_DECISION.md       - Why gzip not zstd
7. STORAGE_LIFECYCLE.md          - Blob lifecycle details

Quick Verification: Is bird_spec.md Updated?¶

✅ YES! Here's proof:

Directory Structure (lines 30-36)¶

└── blobs/
    └── content/         # Content-addressed pool ✅
        ├── ab/
        │   └── abc123def456...789.bin.gz  ✅ Hash-based!
        ├── cd/
        └── ...          # 256 subdirs (00-ff)

Outputs Table (lines 115-143)¶

CREATE TABLE outputs (
    id                UUID PRIMARY KEY,
    command_id        UUID NOT NULL,

    -- NEW: Content identification
    content_hash      TEXT NOT NULL,     -- BLAKE3 hash ✅
    byte_length       BIGINT NOT NULL,

    -- NEW: Polymorphic storage
    storage_type      TEXT NOT NULL,     -- 'inline', 'blob', 'archive' ✅
    storage_ref       TEXT NOT NULL,     -- URI to content ✅

    stream            TEXT NOT NULL,
    content_type      TEXT,
    ...
);

Blob Registry Table (lines 155-178)¶

CREATE TABLE blob_registry (
    content_hash      TEXT PRIMARY KEY,  -- BLAKE3 hash ✅
    byte_length       BIGINT NOT NULL,
    ref_count         INT DEFAULT 0,     -- Reference tracking ✅
    first_seen        TIMESTAMP,
    last_accessed     TIMESTAMP,
    storage_tier      TEXT,              -- 'recent', 'archive' ✅
    storage_path      TEXT,
    ...
);

All updated! ✅

Implementation Phases¶

Phase 1: Schema (Week 1)¶

File: bird_spec.md lines 115-178

Tasks: 1. Add columns to outputs table:

ALTER TABLE outputs ADD COLUMN content_hash TEXT;
ALTER TABLE outputs ADD COLUMN storage_type TEXT DEFAULT 'inline';
ALTER TABLE outputs ADD COLUMN storage_ref TEXT;

Create blob_registry table:

CREATE TABLE blob_registry (...);  -- Full schema in bird_spec.md

Update OutputRecord struct (Rust):

struct OutputRecord {
    id: UUID,
    command_id: UUID,
    stream: String,
    content_hash: String,        // NEW
    byte_length: i64,
    storage_type: String,        // NEW
    storage_ref: String,         // NEW
    content_type: Option<String>,
    date: Date,
}

Reference: shq_implementation.md lines 163-172 (updated struct)

Phase 2: Capture Flow (Week 2)¶

File: shq_implementation.md lines 143-265

Key Functions to Implement:

write_output() - Main entry point

fn write_output(command_id: &UUID, stream: &str, content: &[u8]) -> Result<()> {
    // 1. Compute BLAKE3 hash
    let hash = blake3::hash(content);

    // 2. Size-based routing
    if content.len() < THRESHOLD {
        // Inline: data: URI
    } else {
        // Blob: Check for existing (dedup!)
        if let Some(path) = check_blob_exists(&hash)? {
            increment_ref_count(&hash)?;  // DEDUP HIT
        } else {
            write_content_addressed_blob(&hash, content)?;  // DEDUP MISS
        }
    }

    // 3. Insert output record
}

write_content_addressed_blob() - Lines 196-226

fn write_content_addressed_blob(hash: &str, content: &[u8]) -> Result<String> {
    // Subdirectory: first 2 hex chars
    let subdir = &hash[..2];
    let blob_dir = bird_root.join("db/data/recent/blobs/content").join(subdir);

    // Filename: {hash}.bin.gz
    let filename = format!("{}.bin.gz", hash);

    // Atomic write (handles race conditions)
    // ... see full code in shq_implementation.md
}

check_blob_exists() - Lines 228-239

fn check_blob_exists(hash: &str) -> Result<Option<String>> {
    // Query blob_registry
    // Return storage_path if exists
}

register_blob() - Lines 241-250

fn register_blob(hash: &str, byte_length: usize, path: &str) -> Result<()> {
    // INSERT into blob_registry
    // Set ref_count = 1
}

increment_blob_ref_count() - Lines 252-260

fn increment_blob_ref_count(hash: &str) -> Result<()> {
    // UPDATE blob_registry SET ref_count = ref_count + 1
}

Dependencies:

[dependencies]
blake3 = "1.5"           # Fast hashing
flate2 = "1.0"           # Gzip compression
base64 = "0.21"          # For data: URIs

Phase 3: Query Flow (Week 3)¶

File: shq_implementation.md lines 460-540

Key Function:

resolve_storage_ref() - Lines 505-533

href="#__codelineno-15-1">fn resolve_storage_ref(storage_type: &str, storage_ref: &str, hash: &str) -> Result<String> { match storage_type { "inline" => { // Decode data: URI, write to temp file let b64_data = storage_ref.split(',').nth(1)?; let decoded = base64::decode(b64_data)?; let temp_path = format!("/tmp/shq-output-{}.tmp", hash); fs::write(&temp_path, decoded)?; Ok(temp_path) }, "blob" | "archive" => { // Extract path from file:// URI let rel_path = &storage_ref[7..]; let full_path = bird_root.join("db/data").join(rel_path); Ok(full_path.display().to_string()) }, _ => Err(anyhow!("Unknown storage type")) } }

Update Queries: - Replace o.file_ref with o.storage_type, o.storage_ref, o.content_hash - Call resolve_storage_ref() before accessing content - See lines 464-504 for complete example

Phase 4: Testing (Week 4)¶

Test Cases:

Deduplication Test

#[test]
fn test_dedup_same_content() {
    let content = b"identical output";

    // First write
    write_output(&cmd1, "stdout", content)?;
    let blob_count_1 = count_blobs()?;

    // Second write (same content)
    write_output(&cmd2, "stdout", content)?;
    let blob_count_2 = count_blobs()?;

    // Should reuse blob (same count)
    assert_eq!(blob_count_1, blob_count_2);

    // But both outputs exist
    assert_eq!(count_outputs()?, 2);
}

Reference Counting Test

#[test]
fn test_ref_counting() {
    let hash = write_blob(content)?;
    let refs = get_ref_count(&hash)?;
    assert_eq!(refs, 1);

    // Write same content again
    write_output(&cmd2, "stdout", content)?;
    let refs = get_ref_count(&hash)?;
    assert_eq!(refs, 2);  // Incremented!
}

Storage URI Resolution Test

#[test]
fn test_resolve_inline() {
    let uri = "data:application/octet-stream;base64,SGVsbG8=";
    let path = resolve_storage_ref("inline", uri, "abc123")?;
    let content = fs::read_to_string(path)?;
    assert_eq!(content, "Hello");
}

Storage Savings Calculator¶

Your Workload:

Commands/day:     _______
Avg output size:  _______
Dedup ratio:      70% (typical CI)

Before:  commands/day × avg_size × 30 days
After:   commands/day × avg_size × (1 - dedup_ratio) × 30 days
Savings: _______

Example:

10,000 commands/day
200KB avg output
70% dedup

Before:  10k × 200KB × 30 = 60GB/month
After:   10k × 200KB × 0.3 × 30 = 18GB/month
Savings: 42GB (70%)

Troubleshooting¶

"Hash mismatch on read"¶

Corruption in blob file
Check: SELECT * FROM blob_registry WHERE corrupt = TRUE
Fix: Recompute hash, mark for deletion

"Dedup not working"¶

Check: Are hashes identical? SELECT content_hash, COUNT(*) FROM outputs GROUP BY content_hash
Verify: BLAKE3 version consistent
Debug: Log dedup hit/miss rates

"Blob directory growing too large"¶

Check: Subdirectory sharding (00-ff)
Verify: ls blobs/content/ | wc -l should show 256 dirs
Fix: Migrate old blobs to subdirs

Migration from UUID-based¶

File: MIGRATION_SUMMARY.md

Steps:

# 1. Backup
cp -r db/data/recent/managed db/data/recent/managed.backup

# 2. Run migration
shq migrate-to-content-addressed

# 3. Verify
shq verify-blobs

# 4. Cleanup (after verification)
rm -rf db/data/recent/managed.backup

Performance Targets¶

✅ Hash computation: <2ms per 5MB output
✅ Dedup check: <1ms (indexed query)
✅ Total overhead: <3ms per capture
✅ Storage savings: 70-90% for CI workloads
✅ Query speed: No regression

Summary Checklist¶

[ ] Read CONTENT_ADDRESSED_BLOBS.md (design)
[ ] Read bird_spec.md (schema)
[ ] Phase 1: Schema migration
[ ] Phase 2: Capture flow (write_output, etc.)
[ ] Phase 3: Query flow (resolve_storage_ref)
[ ] Phase 4: Testing (dedup, ref counting, URIs)
[ ] Measure: Storage savings achieved
[ ] Document: Actual dedup ratios for your workload

Implementation time: ~4 weeks
Complexity: Medium (clear interfaces)
Risk: Low (backward compatible)
Payoff: 70-90% storage reduction 🎯