BIRD Spec Changelog¶
v5: Attempts/Outcomes Split (2026-02-10)¶
Summary¶
BIRD v5 introduces a clean architectural split between attempts (what commands were started) and outcomes (how they completed). This eliminates pending files and status partitioning, replacing them with a VIEW-based approach where status is derived from the join.
Key Changes¶
1. Schema Split: Attempts + Outcomes¶
Before (v4):
CREATE TABLE invocations (
id UUID PRIMARY KEY,
-- ... all columns including exit_code, status ...
status VARCHAR DEFAULT 'completed' -- pending, completed, orphaned
);
After (v5):
-- What was tried
CREATE TABLE attempts (
id UUID PRIMARY KEY,
timestamp TIMESTAMP NOT NULL,
cmd VARCHAR NOT NULL,
cwd VARCHAR,
session_id VARCHAR,
tag VARCHAR,
source_client VARCHAR NOT NULL,
machine_id VARCHAR, -- Runner ID for liveness checking
hostname VARCHAR,
executable VARCHAR,
format_hint VARCHAR,
metadata MAP(VARCHAR, JSON), -- Extensible metadata
date DATE NOT NULL
);
-- What happened
CREATE TABLE outcomes (
attempt_id UUID PRIMARY KEY, -- References attempts.id
completed_at TIMESTAMP NOT NULL,
exit_code INTEGER, -- NULL = crashed/orphaned
duration_ms BIGINT NOT NULL,
signal INTEGER,
timeout BOOLEAN,
metadata MAP(VARCHAR, JSON),
date DATE NOT NULL
);
-- invocations is now a VIEW
CREATE VIEW invocations AS
SELECT
a.*, o.completed_at, o.exit_code, o.duration_ms, o.signal, o.timeout,
CASE
WHEN o.attempt_id IS NULL THEN 'pending'
WHEN o.exit_code IS NULL THEN 'orphaned'
ELSE 'completed'
END AS status
FROM attempts a
LEFT JOIN outcomes o ON a.id = o.attempt_id;
2. Simplified Directory Structure¶
Before (v4):
recent/invocations/status=pending/date=2024-01-15/
recent/invocations/status=completed/date=2024-01-15/
recent/invocations/status=orphaned/date=2024-01-15/
After (v5):
No status partitioning - status is derived from the join.
3. Pending Detection via VIEW¶
Before (v4): Pending files in db/pending/*.pending
After (v5): SQL query
4. Schema Versioning¶
New bird_meta table for schema versioning:
CREATE TABLE bird_meta (
key VARCHAR PRIMARY KEY,
value VARCHAR NOT NULL,
updated_at TIMESTAMP DEFAULT now()
);
-- Required entries:
-- schema_version: '5'
-- primary_client: 'shq' | 'blq' | etc.
-- created_at: ISO timestamp
5. Extensible Metadata¶
MAP(VARCHAR, JSON) columns on both attempts and outcomes:
-- Query metadata
SELECT * FROM invocations WHERE metadata['vcs']->>'branch' = 'main';
-- Well-known keys: vcs, ci, env, resources, timing
Benefits¶
- Cleaner crash recovery: No pending files, just query the VIEW
- Simpler partitioning: No status= directory level
- Extensibility: MAP metadata without schema changes
- Better separation: Attempt data vs outcome data clearly separated
Migration¶
Existing v4 databases will need migration. The pending file mechanism is deprecated - v5 uses VIEW-based pending detection.
Breaking changes:
- invocations is now a VIEW (can't INSERT directly)
- db/pending/ directory no longer used
- status= partitioning removed
v4: In-Flight Tracking + Crash Recovery (2026-02-09)¶
Summary¶
BIRD v4 adds comprehensive crash recovery support with in-flight invocation tracking, status partitioning, and the clean/prune command.
Key Changes¶
1. Runner ID for Liveness Checking¶
New runner_id field in invocations for tracking execution context:
Supported formats:
- pid:<number> - Local process ID (can check with kill -0)
- gha:run:<id> - GitHub Actions run ID
- k8s:pod:<name> - Kubernetes pod identifier
2. Status Partitioning¶
Invocations are now partitioned by status as the first-level hive partition:
invocations/
└── status=<status>/ # pending, completed, orphaned
└── date=YYYY-MM-DD/
└── <session>--<exec>--<uuid>.parquet
Status values:
| Status | Description |
|--------|-------------|
| pending | Command is currently running (exit_code is NULL) |
| completed | Command finished normally (exit code captured) |
| orphaned | Process died without cleanup (crash, SIGKILL, system reboot) |
3. Pending File Operations¶
Crash-safe JSON markers for in-flight invocations:
Pending file format:
{
"id": "01937a2b-3c4d-7e8f-9012-3456789abcde",
"session_id": "zsh-12345",
"timestamp": "2024-01-15T10:30:00Z",
"cwd": "/home/user/project",
"cmd": "make test",
"runner_id": "pid:12345",
"client_id": "user@hostname"
}
Lifecycle:
1. Command starts → Create JSON pending file + write to status=pending/
2. Command completes → Write to status=completed/, delete pending files
3. Crash recovery → Scan pending files, mark dead runners as orphaned
4. Clean/Prune Command¶
New shq clean command for recovery and maintenance:
shq clean # Recover orphaned invocations
shq clean --dry-run # Preview what would be cleaned
shq clean --max-age 12 # Mark as orphaned after 12 hours
shq clean --prune # Also prune old archive data
shq clean --prune --older-than 90d # Prune data older than 90 days
Operations:
1. Scan pending files for dead/stale runners
2. Mark orphaned invocations (write to status=orphaned/)
3. Delete stale pending files
4. Optionally prune old archive data
Schema Changes¶
Invocations table additions:
runner_id VARCHAR, -- Runner identifier for liveness checking
status VARCHAR DEFAULT 'completed', -- pending, completed, orphaned
Benefits¶
- Crash recovery: Detect and record commands that crashed or were killed
- Better visibility: See pending and orphaned commands in queries
- Cross-platform: Runner ID supports local PIDs, GitHub Actions, Kubernetes
- No data loss: Pending files persist across crashes for recovery
Migration¶
Existing databases will default to status='completed' and runner_id=NULL for historical data. No migration script required - the new columns are nullable with defaults.
v3: Dual Storage Backends + Remote Sync (2026-01-15)¶
Previous version - see below for content-addressed storage changes.
v2: Content-Addressed Storage (2026-01-02)¶
Summary¶
The BIRD specification has been updated from UUID-based blob storage to content-addressed storage using BLAKE3 hashing. This enables automatic deduplication, reducing storage by 70-90% for typical CI/CD workloads.
Key Changes¶
1. Directory Structure¶
Before:
After:
recent/
└── blobs/
└── content/ # Content-addressed pool
├── ab/
│ └── abc123...789.bin.gz
├── cd/
│ └── cde456...012.bin.gz
└── ... # 256 subdirs (00-ff)
2. Database Schema¶
outputs table - Before:
CREATE TABLE outputs (
id UUID PRIMARY KEY,
command_id UUID NOT NULL,
stream TEXT NOT NULL,
content BLOB, -- Inline for <1MB
file_ref TEXT, -- Path for ≥1MB
byte_length BIGINT NOT NULL,
content_type TEXT,
...
);
outputs table - After:
CREATE TABLE outputs (
id UUID PRIMARY KEY,
command_id UUID NOT NULL,
stream TEXT NOT NULL,
-- NEW: Content identification
content_hash TEXT NOT NULL, -- BLAKE3 hash
byte_length BIGINT NOT NULL,
-- NEW: Polymorphic storage
storage_type TEXT NOT NULL, -- 'inline', 'blob', 'tarfs', 'archive'
storage_ref TEXT NOT NULL, -- URI to content
content_type TEXT,
...
);
CREATE INDEX idx_outputs_hash ON outputs(content_hash);
NEW: blob_registry table:
CREATE TABLE blob_registry (
content_hash TEXT PRIMARY KEY,
byte_length BIGINT NOT NULL,
ref_count INT DEFAULT 0, -- Reference tracking
first_seen TIMESTAMP,
last_accessed TIMESTAMP,
storage_tier TEXT, -- 'recent', 'archive'
storage_path TEXT,
verified_at TIMESTAMP,
corrupt BOOLEAN DEFAULT FALSE
);
3. Filename Format¶
Before:
After:
blobs/content/{hash[0:2]}/{hash}.bin.gz
Example: blobs/content/ab/abc123def456789abcdef0123456789...789.bin.gz
4. Compression¶
Before: zstd (.zst)
After: gzip (.gz) - DuckDB can read directly
5. Storage URI Formats¶
Inline (<1MB):
Blob (≥1MB):
Archive:
6. Capture Flow¶
Before:
// Always write new file
let path = format!("managed/{}.bin.zst", uuid);
write_compressed(&path, data)?;
After:
// Check for existing blob first
let hash = blake3::hash(data);
if let Some(path) = check_blob_exists(&hash)? {
// DEDUP HIT: Reuse existing file
increment_ref_count(&hash)?;
} else {
// DEDUP MISS: Write new file
let path = write_blob(&hash, data)?;
}
7. Compaction Strategy¶
Before: - Parquet files: Compact by date - Blob files: Tar archives by week
After: - Parquet files: Compact by date (unchanged) - Blob files: No compaction needed! (already deduplicated)
8. Garbage Collection¶
Before: Not applicable (UUID = unique)
After: Three strategies available: 1. Never delete (MVP) - Simple, safe 2. Reference counting - Production-ready 3. Mark-and-sweep - For migrations
9. Archival¶
Before:
archive/by-week/managed/
└── client=1/year=2024/week=52/
└── archive-20241230.tar.zst # Contains blobs from that week
After:
archive/blobs/content/ # Global content pool
├── ab/abc123...789.bin.gz # Referenced by many weeks
└── cd/cde456...012.bin.gz # Archived when all refs archived
Benefits¶
| Metric | Before | After | Improvement |
|---|---|---|---|
| Typical CI output | 500MB/day | 50MB/day | 90% smaller |
| 1 year storage | 182GB | 18GB | 90% reduction |
| File count | O(commands) | O(unique outputs) | 70-90% fewer |
| Dedup ratio | 0% | 70-90% | Automatic |
Implementation Impact¶
Minimal Changes Required¶
✅ Capture: Add hash computation (~2ms overhead) ✅ Query: Transparent (read by hash) ✅ Schema: Add 2 columns, 1 table ✅ Storage: ~90% less disk space
No Breaking Changes¶
- Existing commands/outputs tables unchanged
- Query interface identical
- APIs remain the same
- Migration path available
Migration Path¶
For existing BIRD installations:
# Automated migration script
bird migrate-to-content-addressed
# Steps:
# 1. Add new columns (with defaults)
# 2. Create blob_registry table
# 3. Hash existing blobs
# 4. Move to content-addressed paths
# 5. Populate content_hash and storage_ref
# 6. Remove old managed/ directory
Configuration¶
[deduplication]
enabled = true # Enable content-addressing
hash_algorithm = "blake3" # Fast cryptographic hash
[garbage_collection]
enabled = false # Disable for MVP
strategy = "ref_counting" # ref_counting, mark_sweep
grace_period_days = 30
References¶
- CONTENT_ADDRESSED_BLOBS.md - Complete design document
- bird_spec.md - Updated specification
- bird_spec_v1_uuid.md.backup - Original UUID-based spec
Version: 2.0 (Content-Addressed Storage) Date: 2026-01-02 Impact: 70-90% storage reduction for CI workloads 🎉