jsonl-algebra (command: ja
) is a production-grade implementation of relational algebra for JSONL data. It’s the full-featured, battle-tested realization of dotsuite’s dotrelate component—bringing SQL-like operations to the command line with first-class support for nested JSON structures.
The Relationship to Dotsuite
In dotsuite’s architecture, dotrelate provides relational operations on document collections:
- dotrelate: Join, union, project, difference on collections
- Theoretical foundation: Relational algebra lifted to nested documents
jsonl-algebra (ja
) is the production implementation of these concepts, with:
- ✅ Published on PyPI:
pip install jsonl-algebra
- ✅ Battle-tested: Used in real-world data pipelines
- ✅ Feature-complete: All relational operations + aggregations
- ✅ Streaming: Process datasets larger than memory
- ✅ Schema tools: Inference and validation
- ✅ Interactive REPL: Build pipelines step-by-step
Core Insight: Relational Algebra for Nested Data
Traditional relational algebra assumes flat tables:
SELECT name, age FROM users WHERE age > 30
But modern JSON is deeply nested:
{
"user": {
"id": 1,
"name": "Alice",
"address": {
"city": "NYC",
"zip": "10001"
}
},
"orders": [
{"id": 101, "amount": 50}
]
}
jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:
ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id
The Five Core Operations
Relational algebra has five fundamental operations that form a complete algebra:
1. Selection (σ) — Filter Rows
Mathematical notation:
# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl
# Filter on nested fields
ja select 'user.age > 30' users.jsonl
# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl
Properties:
- Commutative:
- Combinable:
2. Projection (π) — Select/Compute Columns
Mathematical notation:
# Pick specific fields
ja project id,name data.jsonl
# Access nested fields
ja project user.name,user.address.city users.jsonl
# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl
Properties:
- Idempotent (for simple projections):
3. Join (⋈) — Combine Relations
Mathematical notation:
# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id
# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id
# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email
Properties:
- Commutative:
- Associative:
This means you can join multiple files in any order:
ja join users.jsonl orders.jsonl --on user.id=customer_id \
| ja join - products.jsonl --on product_id=id
4. Union (∪) — Combine All Rows
Mathematical notation:
# Combine employees and contractors
ja union employees.jsonl contractors.jsonl
# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl
Properties:
- Commutative:
- Associative:
- Identity:
5. Difference (−) — Set Subtraction
Mathematical notation:
# Find users without orders
ja difference users.jsonl customers_with_orders.jsonl --on id
# Set operations
ja difference all_items.jsonl sold_items.jsonl --on sku
Properties:
- Not commutative:
- Anti-identity: , but
Aggregation Operations: Extending the Algebra
While the five core operations are complete, practical data processing needs aggregation:
GroupBy (γ) — Aggregate by Groups
Mathematical notation:
# Count by category
ja groupby category --agg count products.jsonl
# Sum by customer
ja groupby customer_id --agg sum:amount orders.jsonl
# Multiple aggregations
ja groupby department --agg avg:salary,count,max:age employees.jsonl
Supported aggregations:
count
: Number of itemssum
: Sum of valuesavg
: Average of valuesmin
/max
: Minimum/maximumlist
: Collect all values into array
Chained Grouping: An Innovation
Traditional SQL requires nested subqueries for multi-level grouping. jsonl-algebra enables chained grouping through metadata preservation:
# Multi-level grouping
cat sales.jsonl \
| ja groupby region \ # First level
| ja groupby product \ # Second level
| ja groupby month \ # Third level
| ja agg total=sum:amount # Final aggregation
How it works: Each groupby
adds metadata without aggregating:
{
"sale_id": 101,
"region": "east",
"product": "laptop",
"month": "jan",
"amount": 1200,
"_groups": [
{"field": "region", "value": "east"},
{"field": "product", "value": "laptop"},
{"field": "month", "value": "jan"}
],
"_group_size": 5,
"_group_index": 0
}
This preserves the relational structure while tracking grouping hierarchy—a key innovation that enables composition.
Real-World Example: The 5-Minute Tour
Setup: Two JSONL files
users.jsonl
:
{"user": {"id": 1, "name": "Alice"}}
{"user": {"id": 2, "name": "Bob"}}
orders.jsonl
:
{"order_id": 101, "customer_id": 1, "amount": 50}
{"order_id": 102, "customer_id": 1, "amount": 75}
{"order_id": 103, "customer_id": 2, "amount": 120}
Goal: Total amount spent by each user
Solution:
# Join users and orders, then aggregate
ja join users.jsonl orders.jsonl --on user.id=customer_id \
| ja groupby user.name --agg sum:amount
Output:
{"user.name": "Alice", "sum_amount": 125}
{"user.name": "Bob", "sum_amount": 120}
This demonstrates:
- ✅ Nested field access (
user.id
,user.name
) - ✅ Relational join
- ✅ Aggregation
- ✅ Unix-style piping
Additional Operations
Distinct — Remove Duplicates
# Unique users
ja distinct users.jsonl
# Distinct on specific fields
ja distinct --key user.id,email contacts.jsonl
Sort — Order Results
# Sort by age ascending
ja sort age users.jsonl
# Sort descending
ja sort age:desc users.jsonl
# Multi-key sort
ja sort department,salary:desc employees.jsonl
Rename — Field Aliasing
# Rename single field
ja rename id=user_id data.jsonl
# Rename nested fields
ja rename user.loc=user.location data.jsonl
Head/Tail — Limit Results
# First 10 rows
ja head 10 data.jsonl
# Last 10 rows
ja tail 10 data.jsonl
Schema Operations: Inference and Validation
Schema Inference
Automatically discover the structure of your data:
# Infer JSON Schema
ja schema infer users.jsonl > users_schema.json
Output:
{
"type": "object",
"properties": {
"id": {"type": "integer"},
"name": {"type": "string"},
"age": {"type": "integer"},
"address": {
"type": "object",
"properties": {
"city": {"type": "string"},
"zip": {"type": "string"}
}
}
}
}
Schema Validation
Validate data against a schema:
# Check if data conforms to schema
ja schema validate users_schema.json users.jsonl
This is crucial for:
- Data quality checks
- ETL pipeline validation
- API contract testing
Format Conversion: CSV Integration
JSONL to CSV
# Flatten nested structure to CSV
ja to-csv users.jsonl > users.csv
# Select specific fields
ja project user.name,user.age users.jsonl | ja to-csv > simple.csv
Intelligent flattening:
{"user": {"name": "Alice", "age": 30}}
Becomes:
user.name,user.age
Alice,30
CSV to JSONL
# Convert CSV to JSONL
ja from-csv users.csv > users.jsonl
Interactive REPL
Build and test pipelines interactively:
$ ja repl users.jsonl
ja> select 'age > 25'
# Preview first few results...
ja> project name,email
# Preview transformed data...
ja> groupby department --agg avg:salary
# See aggregated results...
ja> write output.jsonl
# Save pipeline results
The REPL is perfect for:
- Exploring unfamiliar data
- Testing query predicates
- Iterative pipeline development
- Teaching relational algebra concepts
Command-Line Examples
Log Analysis
# Find error logs from auth service
ja select 'level == `"ERROR"` and service == `"auth"`' app.log.jsonl
# Count errors by service
ja select 'level == `"ERROR"`' app.log.jsonl \
| ja groupby service --agg count
ETL Pipeline
# Extract, transform, load
ja select 'status == `"completed"`' sales.jsonl \
| ja project customer_id,amount,date \
| ja join - customers.jsonl --on customer_id=id \
| ja groupby customer.region --agg sum:amount \
| ja sort sum_amount:desc \
| ja head 10
Data Quality Checks
# Find users without email
ja select 'not (email != `null`)' users.jsonl
# Find duplicate user IDs
ja groupby user_id users.jsonl \
| ja select '_group_size > 1'
Report Generation
# Monthly sales report
ja join orders.jsonl products.jsonl --on product_id=id \
| ja groupby month,category --agg sum:revenue,count:orders \
| ja sort month,sum_revenue:desc \
| ja to-csv > monthly_report.csv
Python API
Use ja
programmatically:
from ja.core import read_jsonl, join, groupby_agg, select_rows
from ja.schema import infer_schema
# Load data
users = read_jsonl("users.jsonl")
orders = read_jsonl("orders.jsonl")
# Join
joined = join(users, orders, on=[("user.id", "customer_id")])
# Filter
high_value = select_rows(joined, "amount > 100")
# Aggregate
result = groupby_agg(
high_value,
group_by_key="user.name",
aggregations=[("sum", "amount"), ("count", None)]
)
print(list(result))
# Schema inference
schema = infer_schema(users)
Integration with Ecosystem
jsonl-algebra works seamlessly with other tools:
With JAF (filtering)
# JAF for boolean filtering, ja for relational ops
jaf filter users.jsonl '(eq? @status "active")' --eval \
| ja join - orders.jsonl --on id=user_id \
| ja groupby name --agg sum:amount
With jq (JSON manipulation)
# ja for relational ops, jq for complex transformations
ja join users.jsonl orders.jsonl --on id=user_id \
| jq '.metadata = {processed: now, version: "1.0"}'
With standard Unix tools
# Combine with grep, sort, uniq
ja project email users.jsonl \
| grep '@gmail.com' \
| sort \
| uniq -c
Theoretical Foundation: Relational Algebra
jsonl-algebra implements Codd’s relational algebra extended for nested data:
Completeness
The five operations (σ, π, ⋈, ∪, −) form a complete algebra—any relational query can be expressed using these primitives.
Closure Property
Every operation produces a valid relation:
This ensures composability: R | op1 | op2 | op3
always works.
Query Optimization
Because operations have well-defined properties, queries can be optimized:
Selection pushdown:
Projection elimination:
These properties enable automatic query optimization (future work).
Nested Data Extension
Traditional relational algebra assumes flat tuples. jsonl-algebra extends this with:
Path Expressions
Replace column names with dot-separated paths:
Deep Equality
Join conditions can reference nested fields:
Projection Flattening
Projection can flatten nested structures:
Performance Characteristics
Operation | Memory | Time | Notes |
---|---|---|---|
select | O(1) | O(n) | Streaming filter |
project | O(1) | O(n) | Field extraction |
join | O(min(R,S)) | O(R×S) | Hash join (smaller table in memory) |
union | O(1) | O(R+S) | Streaming concatenation |
distinct | O(k) | O(n) | k = unique items |
groupby | O(g) | O(n) | g = number of groups |
sort | O(n) | O(n log n) | External sort for large data |
jsonl-algebra is designed for:
- ✅ Streaming operations where possible
- ✅ Bounded memory for most operations
- ✅ Efficient joins via hash tables
- ✅ External sorting for datasets larger than RAM
When to Use jsonl-algebra
Use ja
when:
- ✅ Working with JSONL data
- ✅ Need SQL-like operations on nested JSON
- ✅ Building command-line data pipelines
- ✅ Joining multiple data sources
- ✅ Aggregating and grouping data
- ✅ Want schema inference/validation
Use something else when:
- ❌ Need a persistent database → PostgreSQL, MongoDB
- ❌ Complex analytical queries → DuckDB, ClickHouse
- ❌ Real-time streaming → Apache Kafka, Flink
- ❌ Just filtering/mapping → JAF
Combine with:
- JAF: Boolean filtering before relational ops
- jq: Complex JSON transformations
- SQL databases: Load processed data
Design Philosophy
🎯 Unix Philosophy: Do one thing well (relational algebra)
🔗 Composability: Every operation produces a valid relation
📦 Streaming First: Process data without loading into memory
🛡️ Type Safety: Schema inference and validation
📚 Pedagogical: Clear mapping to mathematical operations
⚡ Production-Ready: Battle-tested, published on PyPI
Quick Start
# Install
pip install jsonl-algebra
# Basic usage
ja select 'age > 30' users.jsonl
ja project name,email users.jsonl
ja join users.jsonl orders.jsonl --on id=user_id
# Interactive mode
ja repl data.jsonl
Comparison: dotsuite vs jsonl-algebra
Feature | dotsuite (dotrelate) | jsonl-algebra (ja) |
---|---|---|
Status | Pedagogical concept | Production tool |
Distribution | Source code | PyPI package |
Operations | Basic join, union | Full relational algebra |
Aggregation | Planned | Complete (sum, avg, count, etc.) |
Schema | None | Inference + validation |
Interactive | No | REPL included |
Format conversion | No | CSV support |
Documentation | Concept docs | Full CLI reference |
Testing | Basic | Comprehensive test suite |
Recommendation: Learn relational concepts through dotsuite, use ja
for production.
Resources
- PyPI: pypi.org/project/jsonl-algebra/
- Repository: github.com/queelius/jsonl-algebra
- Quick Start: docs/quickstart.md
- Concepts: docs/concepts/jsonl-algebra.md
License
MIT
jsonl-algebra: Bringing the power of relational algebra to nested JSON, one operation at a time. From SQL’s flat tables to JSON’s nested structures, with full mathematical foundations.
Discussion