BIRD Integration with blq¶
This document describes how BIRD integrates with blq (Build Log Query) and the duck_hunt DuckDB extension.
Overview¶
┌──────────────────────────────────────────────────────────────────┐
│ MAGIC Ecosystem │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │ shq │ │ blq │ │
│ │ (Shell Query) │◀────────▶│ (Build Log Query) │ │
│ │ │ │ │ │
│ │ • Captures history │ │ • Parses formats │ │
│ │ • Stores in BIRD │ │ • Analyzes logs │ │
│ │ • Provides queries │ │ • Aggregates data │ │
│ └─────────┬──────────┘ └─────────┬──────────┘ │
│ │ │ │
│ └───────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ duck_hunt │ │
│ │ (DuckDB Extension) │ │
│ │ │ │
│ │ • 80+ log formats │ │
│ │ • Unified schema │ │
│ │ • SQL integration │ │
│ └────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐ │
│ │ BIRD DB │ │
│ │ (DuckDB + Parquet) │ │
│ └────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
The duck_hunt Extension¶
duck_hunt is a DuckDB extension that provides a unified interface for parsing structured logs.
Installation¶
Supported Formats¶
Over 80 log formats including:
Compilers: - GCC, Clang, MSVC - Rust (cargo, rustc) - Go (go build) - Java (javac)
Build Tools: - Make, CMake, Ninja - Gradle, Maven, Ant - Bazel, Buck
Linters: - ESLint, TSLint - Pylint, Flake8, mypy - RuboCop - ShellCheck
Test Frameworks: - pytest, unittest - Jest, Mocha - JUnit, TestNG - Go test, Cargo test
CI Systems: - GitHub Actions - GitLab CI - Jenkins - CircleCI
Unified Schema¶
All formats parse to a common schema:
CREATE TABLE parsed_events (
severity TEXT, -- error, warning, info, note
message TEXT, -- Error message
file TEXT, -- Source file path
line INTEGER, -- Line number
column INTEGER, -- Column number (if available)
code TEXT, -- Error code (e.g., E0308)
rule TEXT, -- Rule name (e.g., no-unused-vars)
category TEXT, -- Error category
suggestion TEXT, -- Fix suggestion (if available)
context TEXT, -- Surrounding context
metadata JSON -- Format-specific extras
);
Usage¶
-- Parse GCC output
SELECT * FROM read_duck_hunt_log('build.log', 'gcc');
-- Parse pytest output
SELECT * FROM read_duck_hunt_log('test-output.txt', 'pytest');
-- Parse cargo output
SELECT * FROM read_duck_hunt_log('cargo-build.log', 'cargo');
Integration Approaches¶
1. Direct Query Integration¶
Query BIRD data with duck_hunt parsing inline:
-- Find all GCC errors from last week
SELECT
c.cmd,
c.timestamp,
e.*
FROM bird.commands c
CROSS JOIN read_duck_hunt_log(c.stdout_file, c.format_hint) e
WHERE c.format_hint = 'gcc'
AND e.severity = 'error'
AND c.date >= current_date - 7;
2. shq Command Integration (Planned for v0.6)¶
The shq events command is planned for v0.6 with duck_hunt integration.
# (Future) Parse last command output
# shq events
# (Future) Filter by severity
# shq events --severity error
3. Current shq Commands¶
For now, use SQL queries to analyze captured data:
# Run a command with capture
shq run make test
# View the output
shq show
# Query invocations
shq sql "SELECT cmd, exit_code FROM invocations WHERE cmd LIKE 'make%'"
Format Detection Strategy¶
1. Command-Based Detection¶
Detect format from command name:
fn detect_from_command(cmd: &str) -> Option<String> {
let lower = cmd.to_lowercase();
if lower.contains("gcc") || lower.contains("g++") {
return Some("gcc".to_string());
}
if lower.contains("cargo") {
return Some("cargo".to_string());
}
if lower.contains("pytest") {
return Some("pytest".to_string());
}
// ... more patterns
None
}
2. Content-Based Detection¶
Analyze output to detect format:
fn detect_from_content(output: &[u8]) -> Option<String> {
let text = String::from_utf8_lossy(output);
// GCC pattern: "file.c:42:5: error:"
if text.contains(":error:") && text.contains(".c:") {
return Some("gcc".to_string());
}
// Cargo pattern: "error[E0308]:"
if text.contains("error[E") && text.contains("]") {
return Some("cargo".to_string());
}
// pytest pattern: "FAILED test_foo.py::test_bar"
if text.contains("FAILED") && text.contains("::") {
return Some("pytest".to_string());
}
None
}
3. Hybrid Approach¶
Combine both with confidence scoring:
struct FormatDetection {
format: String,
confidence: f32,
source: DetectionSource,
}
enum DetectionSource {
Command,
Content,
Both,
}
fn detect_format(cmd: &str, output: &[u8]) -> Option<FormatDetection> {
let from_cmd = detect_from_command(cmd);
let from_content = detect_from_content(output);
match (from_cmd, from_content) {
(Some(c1), Some(c2)) if c1 == c2 => Some(FormatDetection {
format: c1,
confidence: 0.95,
source: DetectionSource::Both,
}),
(Some(c), None) => Some(FormatDetection {
format: c,
confidence: 0.7,
source: DetectionSource::Command,
}),
(None, Some(c)) => Some(FormatDetection {
format: c,
confidence: 0.8,
source: DetectionSource::Content,
}),
(None, None) => None,
}
}
Advanced Query Patterns¶
Error Frequency Analysis¶
-- Top 10 most common errors
SELECT
message,
COUNT(*) as occurrences,
COUNT(DISTINCT c.id) as affected_builds
FROM bird.commands c
CROSS JOIN read_duck_hunt_log(c.stdout_file, c.format_hint) e
WHERE e.severity = 'error'
AND c.date >= current_date - 30
GROUP BY message
ORDER BY occurrences DESC
LIMIT 10;
Error Resolution Tracking¶
-- Errors that got fixed
WITH recent_errors AS (
SELECT DISTINCT e.message, e.file, e.line
FROM bird.commands c
CROSS JOIN read_duck_hunt_log(c.stdout_file, c.format_hint) e
WHERE e.severity = 'error'
AND c.date = current_date - 7
),
current_errors AS (
SELECT DISTINCT e.message, e.file, e.line
FROM bird.commands c
CROSS JOIN read_duck_hunt_log(c.stdout_file, c.format_hint) e
WHERE e.severity = 'error'
AND c.date = current_date
)
SELECT r.*
FROM recent_errors r
LEFT JOIN current_errors c USING (message, file, line)
WHERE c.message IS NULL;
Build Time vs Error Count¶
-- Correlation between errors and duration
SELECT
DATE_TRUNC('day', c.timestamp) as day,
AVG(c.duration_ms) as avg_duration,
COUNT(DISTINCT e.message) as error_count
FROM bird.commands c
CROSS JOIN read_duck_hunt_log(c.stdout_file, c.format_hint) e
WHERE c.cmd LIKE 'make%'
AND e.severity = 'error'
GROUP BY day
ORDER BY day;
File-Level Error Heatmap¶
-- Which files have most errors?
SELECT
e.file,
COUNT(*) as error_count,
COUNT(DISTINCT e.line) as affected_lines,
COUNT(DISTINCT c.date) as days_with_errors
FROM bird.commands c
CROSS JOIN read_duck_hunt_log(c.stdout_file, c.format_hint) e
WHERE e.severity = 'error'
AND c.date >= current_date - 30
GROUP BY e.file
ORDER BY error_count DESC
LIMIT 20;
BIRD Sync Protocol¶
For syncing BIRD data across machines.
Goals¶
- Conflict-free merging
- Efficient transfer (rsync-friendly)
- Selective sync (by date range, client)
- Preserve parquet structure
Protocol Design¶
┌─ Machine A ─────────────────┐ ┌─ Shared Storage ─────────┐ ┌─ Machine B ─────────────────┐
│ │ │ │ │ │
│ ~/.local/share/bird/ │ │ /shared/bird/ │ │ ~/.local/share/bird/ │
│ ├── db/data/recent/ │ │ ├── sync/ │ │ ├── db/data/recent/ │
│ │ └── date=2024-12-30/ │──────▶│ │ └── pending/ │◀──────│ │ └── date=2024-12-30/ │
│ └── db/data/archive/ │ │ │ └── laptop-*.parquet │ └── db/data/archive/ │
│ │ │ └── archive/ │ │ │
│ Every 6 hours: │ │ └── by-week/ │ │ Every 6 hours: │
│ 1. Export new files │ │ │ │ 1. Check for new files │
│ 2. Upload to shared │ │ Merge process: │ │ 2. Download and merge │
│ 3. Check for updates │ │ - No conflicts (UUID) │ │ 3. Update local DB │
│ │ │ - Append-only │ │ │
└─────────────────────────────┘ └──────────────────────────┘ └─────────────────────────────┘
Sync Commands¶
# Push local data to shared storage
shq sync push --remote /shared/bird/
# Pull remote data to local
shq sync pull --remote /shared/bird/
# Bidirectional sync
shq sync --remote /shared/bird/
# Status check
shq sync status --remote /shared/bird/
Conflict Resolution¶
No conflicts possible because: - UUIDs guarantee uniqueness - Parquet files are immutable (append-only) - Each client writes to separate archive partitions
Sync Algorithm¶
fn sync_to_remote(local: &Path, remote: &Path) -> Result<()> {
// 1. Find new files (not in remote)
let local_files = scan_parquet_files(local)?;
let remote_files = scan_parquet_files(remote)?;
let new_files: Vec<_> = local_files
.difference(&remote_files)
.collect();
// 2. Copy new files (rsync for efficiency)
for file in new_files {
rsync(file, remote)?;
}
// 3. Update sync metadata
write_sync_manifest(remote, &local_files)?;
Ok(())
}
fn sync_from_remote(local: &Path, remote: &Path) -> Result<()> {
// 1. Find new files in remote
let remote_files = scan_parquet_files(remote)?;
let local_files = scan_parquet_files(local)?;
let new_files: Vec<_> = remote_files
.difference(&local_files)
.collect();
// 2. Copy new files
for file in new_files {
rsync(remote.join(file), local)?;
}
// 3. Refresh DuckDB views
refresh_views()?;
Ok(())
}
blq Command Examples¶
Using blq with BIRD data¶
# Analyze all GCC builds from last week
shq sql "
SELECT stdout_file
FROM invocations
WHERE format_hint = 'gcc'
AND date >= current_date - 7
" | blq from - --format gcc \
| blq stats
# Find most common warnings
shq sql "
SELECT stdout_file, cmd
FROM invocations
WHERE format_hint = 'gcc'
" | blq from - --format gcc \
| blq where severity=warning \
| blq group message \
| blq sort count desc \
| blq head 10
# Compare error rates across dates
for date in 2024-12-{25..30}; do
echo "=== $date ==="
shq sql "
SELECT stdout_file
FROM invocations
WHERE date = '$date' AND format_hint = 'gcc'
" | blq from - --format gcc \
| blq where severity=error \
| blq count
done
Best Practices¶
1. Always Store Format Hint¶
When running commands, let shq detect and store the format:
Or specify explicitly:
2. Query Recent Data First¶
Recent data is faster to query:
-- Fast: Query recent only
SELECT * FROM bird.commands_recent WHERE exit_code != 0;
-- Slower: Query all history
SELECT * FROM bird.commands WHERE exit_code != 0;
3. Use Managed Files for Large Outputs¶
Configure threshold appropriately:
Files >1MB automatically stored separately, keeping parquet files small and fast.
4. Regular Compaction¶
Run compaction to keep query performance high:
5. Selective Sync¶
Only sync what you need:
# Sync only last 7 days
shq sync --since 7d
# Sync specific clients
shq sync --clients laptop,desktop
# Exclude large blob files (content-addressed storage)
shq sync --no-blobs
Part of the MAGIC ecosystem 🏀