jsonl-algebra (command: ja) is a command-line implementation of relational algebra for JSONL data. It’s the production version of dotsuite’s dotrelate component: SQL-like operations on the command line with first-class support for nested JSON structures.
The Relationship to Dotsuite
In dotsuite’s architecture, dotrelate provides relational operations on document collections: join, union, project, difference. jsonl-algebra (ja) is the production implementation of those concepts, published on PyPI, with all relational operations plus aggregations, streaming support, schema tools, and an interactive REPL.
The Core Insight
Traditional relational algebra assumes flat tables:
SELECT name, age FROM users WHERE age > 30
But real-world JSON is deeply nested:
{
"user": {
"id": 1,
"name": "Alice",
"address": {
"city": "NYC",
"zip": "10001"
}
},
"orders": [
{"id": 101, "amount": 50}
]
}
jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:
ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id
The Five Core Operations
Relational algebra has five fundamental operations that form a complete algebra. Everything else is derived.
1. Selection (filter rows)
Mathematical notation:
# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl
# Filter on nested fields
ja select 'user.age > 30' users.jsonl
# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl
Selection is commutative () and combinable ().
2. Projection (select/compute columns)
Mathematical notation:
# Pick specific fields
ja project id,name data.jsonl
# Access nested fields
ja project user.name,user.address.city users.jsonl
# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl
Idempotent for simple projections: .
3. Join (combine relations)
Mathematical notation:
# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id
# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id
# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email
Commutative and associative, so you can join multiple files in any order:
ja join users.jsonl orders.jsonl --on user.id=customer_id \
| ja join - products.jsonl --on product_id=id
4. Union (combine all rows)
Mathematical notation:
# Combine employees and contractors
ja union employees.jsonl contractors.jsonl
# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl
5. Difference (set subtraction)
Mathematical notation:
# Find users without orders
ja difference users.jsonl customers_with_orders.jsonl --on id
# Set operations
ja difference all_items.jsonl sold_items.jsonl --on sku
Not commutative: . This is the one that isn’t symmetric.
Aggregation Operations
The five core operations are complete in the theoretical sense, but practical data processing needs aggregation.
GroupBy
# Count by category
ja groupby category --agg count products.jsonl
# Sum by customer
ja groupby customer_id --agg sum:amount orders.jsonl
# Multiple aggregations
ja groupby department --agg avg:salary,count,max:age employees.jsonl
Supported aggregations: count, sum, avg, min, max, list.
Chained Grouping
Traditional SQL requires nested subqueries for multi-level grouping. jsonl-algebra enables chained grouping through metadata preservation:
# Multi-level grouping
cat sales.jsonl \
| ja groupby region \
| ja groupby product \
| ja groupby month \
| ja agg total=sum:amount
Each groupby adds metadata without aggregating:
{
"sale_id": 101,
"region": "east",
"product": "laptop",
"month": "jan",
"amount": 1200,
"_groups": [
{"field": "region", "value": "east"},
{"field": "product", "value": "laptop"},
{"field": "month", "value": "jan"}
],
"_group_size": 5,
"_group_index": 0
}
This preserves the relational structure while tracking grouping hierarchy. It’s what makes the chaining composable.
The 5-Minute Tour
Two JSONL files:
users.jsonl:
{"user": {"id": 1, "name": "Alice"}}
{"user": {"id": 2, "name": "Bob"}}
orders.jsonl:
{"order_id": 101, "customer_id": 1, "amount": 50}
{"order_id": 102, "customer_id": 1, "amount": 75}
{"order_id": 103, "customer_id": 2, "amount": 120}
Goal: total amount spent by each user.
# Join users and orders, then aggregate
ja join users.jsonl orders.jsonl --on user.id=customer_id \
| ja groupby user.name --agg sum:amount
Output:
{"user.name": "Alice", "sum_amount": 125}
{"user.name": "Bob", "sum_amount": 120}
Nested field access, relational join, aggregation, Unix-style piping. That’s the whole idea.
Additional Operations
Distinct
ja distinct users.jsonl
ja distinct --key user.id,email contacts.jsonl
Sort
ja sort age users.jsonl
ja sort age:desc users.jsonl
ja sort department,salary:desc employees.jsonl
Rename
ja rename id=user_id data.jsonl
ja rename user.loc=user.location data.jsonl
Head/Tail
ja head 10 data.jsonl
ja tail 10 data.jsonl
Schema Operations
Inference
Automatically discover the structure of your data:
ja schema infer users.jsonl > users_schema.json
Output:
{
"type": "object",
"properties": {
"id": {"type": "integer"},
"name": {"type": "string"},
"age": {"type": "integer"},
"address": {
"type": "object",
"properties": {
"city": {"type": "string"},
"zip": {"type": "string"}
}
}
}
}
Validation
ja schema validate users_schema.json users.jsonl
Useful for data quality checks, ETL pipeline validation, and API contract testing.
Format Conversion
JSONL to CSV
ja to-csv users.jsonl > users.csv
ja project user.name,user.age users.jsonl | ja to-csv > simple.csv
Nested structures get flattened intelligently: {"user": {"name": "Alice", "age": 30}} becomes user.name,user.age\nAlice,30.
CSV to JSONL
ja from-csv users.csv > users.jsonl
Interactive REPL
Build and test pipelines interactively:
$ ja repl users.jsonl
ja> select 'age > 25'
# Preview first few results...
ja> project name,email
# Preview transformed data...
ja> groupby department --agg avg:salary
# See aggregated results...
ja> write output.jsonl
# Save pipeline results
Good for exploring unfamiliar data, testing query predicates, and iterative pipeline development.
Command-Line Examples
Log Analysis
ja select 'level == `"ERROR"` and service == `"auth"`' app.log.jsonl
ja select 'level == `"ERROR"`' app.log.jsonl \
| ja groupby service --agg count
ETL Pipeline
ja select 'status == `"completed"`' sales.jsonl \
| ja project customer_id,amount,date \
| ja join - customers.jsonl --on customer_id=id \
| ja groupby customer.region --agg sum:amount \
| ja sort sum_amount:desc \
| ja head 10
Data Quality Checks
# Find users without email
ja select 'not (email != `null`)' users.jsonl
# Find duplicate user IDs
ja groupby user_id users.jsonl \
| ja select '_group_size > 1'
Report Generation
ja join orders.jsonl products.jsonl --on product_id=id \
| ja groupby month,category --agg sum:revenue,count:orders \
| ja sort month,sum_revenue:desc \
| ja to-csv > monthly_report.csv
Python API
from ja.core import read_jsonl, join, groupby_agg, select_rows
from ja.schema import infer_schema
# Load data
users = read_jsonl("users.jsonl")
orders = read_jsonl("orders.jsonl")
# Join
joined = join(users, orders, on=[("user.id", "customer_id")])
# Filter
high_value = select_rows(joined, "amount > 100")
# Aggregate
result = groupby_agg(
high_value,
group_by_key="user.name",
aggregations=[("sum", "amount"), ("count", None)]
)
print(list(result))
# Schema inference
schema = infer_schema(users)
Integration with Ecosystem
With JAF (filtering)
jaf filter users.jsonl '(eq? @status "active")' --eval \
| ja join - orders.jsonl --on id=user_id \
| ja groupby name --agg sum:amount
With jq (JSON manipulation)
ja join users.jsonl orders.jsonl --on id=user_id \
| jq '.metadata = {processed: now, version: "1.0"}'
With standard Unix tools
ja project email users.jsonl \
| grep '@gmail.com' \
| sort \
| uniq -c
Theoretical Foundation
jsonl-algebra implements Codd’s relational algebra extended for nested data.
Completeness
The five operations (selection, projection, join, union, difference) form a complete algebra. Any relational query can be expressed using these primitives.
Closure Property
Every operation produces a valid relation:
This ensures composability: R | op1 | op2 | op3 always works.
Query Optimization
Because operations have well-defined algebraic properties, queries can be optimized:
Selection pushdown:
Projection elimination:
These enable automatic query optimization (future work).
Nested Data Extension
Traditional relational algebra assumes flat tuples. jsonl-algebra extends this with path expressions (replace column names with dot-separated paths), deep equality (join conditions can reference nested fields), and projection flattening (projection can flatten nested structures).
Performance Characteristics
| Operation | Memory | Time | Notes |
|---|---|---|---|
| select | O(1) | O(n) | Streaming filter |
| project | O(1) | O(n) | Field extraction |
| join | O(min(R,S)) | O(R*S) | Hash join (smaller table in memory) |
| union | O(1) | O(R+S) | Streaming concatenation |
| distinct | O(k) | O(n) | k = unique items |
| groupby | O(g) | O(n) | g = number of groups |
| sort | O(n) | O(n log n) | External sort for large data |
When to Use jsonl-algebra
Use ja for working with JSONL data, SQL-like operations on nested JSON, command-line data pipelines, joining multiple data sources, aggregating and grouping data, and schema inference/validation.
Use something else for persistent databases (PostgreSQL, MongoDB), complex analytical queries (DuckDB, ClickHouse), real-time streaming (Kafka, Flink), or if you just need filtering and mapping (JAF).
Quick Start
# Install
pip install jsonl-algebra
# Basic usage
ja select 'age > 30' users.jsonl
ja project name,email users.jsonl
ja join users.jsonl orders.jsonl --on id=user_id
# Interactive mode
ja repl data.jsonl
Resources
- PyPI: pypi.org/project/jsonl-algebra/
- Repository: github.com/queelius/jsonl-algebra
- Quick Start: docs/quickstart.md
- Concepts: docs/concepts/jsonl-algebra.md
License
MIT
Discussion