jsonl-algebra: Production-Grade Relational Algebra for Nested JSON

December 18, 2024 9 min read Updated: October 7, 2025

jsonl-algebra (command: ja) is a production-grade implementation of relational algebra for JSONL data. It’s the full-featured, battle-tested realization of dotsuite’s dotrelate component—bringing SQL-like operations to the command line with first-class support for nested JSON structures.

The Relationship to Dotsuite

In dotsuite’s architecture, dotrelate provides relational operations on document collections:

dotrelate: Join, union, project, difference on collections
Theoretical foundation: Relational algebra lifted to nested documents

jsonl-algebra (ja) is the production implementation of these concepts, with:

✅ Published on PyPI: pip install jsonl-algebra
✅ Battle-tested: Used in real-world data pipelines
✅ Feature-complete: All relational operations + aggregations
✅ Streaming: Process datasets larger than memory
✅ Schema tools: Inference and validation
✅ Interactive REPL: Build pipelines step-by-step

Core Insight: Relational Algebra for Nested Data

Traditional relational algebra assumes flat tables:

SELECT name, age FROM users WHERE age > 30

But modern JSON is deeply nested:

{
  "user": {
    "id": 1,
    "name": "Alice",
    "address": {
      "city": "NYC",
      "zip": "10001"
    }
  },
  "orders": [
    {"id": 101, "amount": 50}
  ]
}

jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:

ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id

The Five Core Operations

Relational algebra has five fundamental operations that form a complete algebra:

1. Selection (σ) — Filter Rows

Mathematical notation: $\sigma_{\text{predicate}}(R)$

# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl

# Filter on nested fields
ja select 'user.age > 30' users.jsonl

# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl

Properties:

Commutative: $\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))$
Combinable: $\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_1 \land p_2}(R)$

2. Projection (π) — Select/Compute Columns

Mathematical notation: $\pi_{\text{columns}}(R)$

# Pick specific fields
ja project id,name data.jsonl

# Access nested fields
ja project user.name,user.address.city users.jsonl

# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl

Properties:

Idempotent (for simple projections): $\pi_a(\pi_{a,b}(R)) = \pi_a(R)$

3. Join (⋈) — Combine Relations

Mathematical notation: $R \bowtie_{\text{condition}} S$

# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id

# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id

# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email

Properties:

Commutative: $R \bowtie S = S \bowtie R$
Associative: $(R \bowtie S) \bowtie T = R \bowtie (S \bowtie T)$

This means you can join multiple files in any order:

ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja join - products.jsonl --on product_id=id

4. Union (∪) — Combine All Rows

Mathematical notation: $R \cup S$

# Combine employees and contractors
ja union employees.jsonl contractors.jsonl

# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl

Properties:

Commutative: $R \cup S = S \cup R$
Associative: $(R \cup S) \cup T = R \cup (S \cup T)$
Identity: $R \cup \emptyset = R$

5. Difference (−) — Set Subtraction

Mathematical notation: $R - S$

# Find users without orders
ja difference users.jsonl customers_with_orders.jsonl --on id

# Set operations
ja difference all_items.jsonl sold_items.jsonl --on sku

Properties:

Not commutative: $R - S \neq S - R$
Anti-identity: $R - \emptyset = R$ , but $\emptyset - R = \emptyset$

Aggregation Operations: Extending the Algebra

While the five core operations are complete, practical data processing needs aggregation:

GroupBy (γ) — Aggregate by Groups

Mathematical notation: $\gamma_{\text{group\_keys}, \text{aggregations}}(R)$

# Count by category
ja groupby category --agg count products.jsonl

# Sum by customer
ja groupby customer_id --agg sum:amount orders.jsonl

# Multiple aggregations
ja groupby department --agg avg:salary,count,max:age employees.jsonl

Supported aggregations:

count: Number of items
sum: Sum of values
avg: Average of values
min/max: Minimum/maximum
list: Collect all values into array

Chained Grouping: An Innovation

Traditional SQL requires nested subqueries for multi-level grouping. jsonl-algebra enables chained grouping through metadata preservation:

# Multi-level grouping
cat sales.jsonl \
  | ja groupby region \        # First level
  | ja groupby product \       # Second level
  | ja groupby month \         # Third level
  | ja agg total=sum:amount    # Final aggregation

How it works: Each groupby adds metadata without aggregating:

{
  "sale_id": 101,
  "region": "east",
  "product": "laptop",
  "month": "jan",
  "amount": 1200,
  "_groups": [
    {"field": "region", "value": "east"},
    {"field": "product", "value": "laptop"},
    {"field": "month", "value": "jan"}
  ],
  "_group_size": 5,
  "_group_index": 0
}

This preserves the relational structure while tracking grouping hierarchy—a key innovation that enables composition.

Real-World Example: The 5-Minute Tour

Setup: Two JSONL files

users.jsonl:

{"user": {"id": 1, "name": "Alice"}}
{"user": {"id": 2, "name": "Bob"}}

orders.jsonl:

{"order_id": 101, "customer_id": 1, "amount": 50}
{"order_id": 102, "customer_id": 1, "amount": 75}
{"order_id": 103, "customer_id": 2, "amount": 120}

Goal: Total amount spent by each user

Solution:

# Join users and orders, then aggregate
ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja groupby user.name --agg sum:amount

Output:

{"user.name": "Alice", "sum_amount": 125}
{"user.name": "Bob", "sum_amount": 120}

This demonstrates:

✅ Nested field access (user.id, user.name)
✅ Relational join
✅ Aggregation
✅ Unix-style piping

Additional Operations

Distinct — Remove Duplicates

# Unique users
ja distinct users.jsonl

# Distinct on specific fields
ja distinct --key user.id,email contacts.jsonl

Sort — Order Results

# Sort by age ascending
ja sort age users.jsonl

# Sort descending
ja sort age:desc users.jsonl

# Multi-key sort
ja sort department,salary:desc employees.jsonl

Rename — Field Aliasing

# Rename single field
ja rename id=user_id data.jsonl

# Rename nested fields
ja rename user.loc=user.location data.jsonl

Head/Tail — Limit Results

# First 10 rows
ja head 10 data.jsonl

# Last 10 rows
ja tail 10 data.jsonl

Schema Operations: Inference and Validation

Schema Inference

Automatically discover the structure of your data:

# Infer JSON Schema
ja schema infer users.jsonl > users_schema.json

Output:

{
  "type": "object",
  "properties": {
    "id": {"type": "integer"},
    "name": {"type": "string"},
    "age": {"type": "integer"},
    "address": {
      "type": "object",
      "properties": {
        "city": {"type": "string"},
        "zip": {"type": "string"}
      }
    }
  }
}

Schema Validation

Validate data against a schema:

# Check if data conforms to schema
ja schema validate users_schema.json users.jsonl

This is crucial for:

Data quality checks
ETL pipeline validation
API contract testing

Format Conversion: CSV Integration

JSONL to CSV

# Flatten nested structure to CSV
ja to-csv users.jsonl > users.csv

# Select specific fields
ja project user.name,user.age users.jsonl | ja to-csv > simple.csv

Intelligent flattening:

{"user": {"name": "Alice", "age": 30}}

Becomes:

user.name,user.age
Alice,30

CSV to JSONL

# Convert CSV to JSONL
ja from-csv users.csv > users.jsonl

Interactive REPL

Build and test pipelines interactively:

$ ja repl users.jsonl

ja> select 'age > 25'
# Preview first few results...

ja> project name,email
# Preview transformed data...

ja> groupby department --agg avg:salary
# See aggregated results...

ja> write output.jsonl
# Save pipeline results

The REPL is perfect for:

Exploring unfamiliar data
Testing query predicates
Iterative pipeline development
Teaching relational algebra concepts

Command-Line Examples

Log Analysis

# Find error logs from auth service
ja select 'level == `"ERROR"` and service == `"auth"`' app.log.jsonl

# Count errors by service
ja select 'level == `"ERROR"`' app.log.jsonl \
  | ja groupby service --agg count

ETL Pipeline

# Extract, transform, load
ja select 'status == `"completed"`' sales.jsonl \
  | ja project customer_id,amount,date \
  | ja join - customers.jsonl --on customer_id=id \
  | ja groupby customer.region --agg sum:amount \
  | ja sort sum_amount:desc \
  | ja head 10

Data Quality Checks

# Find users without email
ja select 'not (email != `null`)' users.jsonl

# Find duplicate user IDs
ja groupby user_id users.jsonl \
  | ja select '_group_size > 1'

Report Generation

# Monthly sales report
ja join orders.jsonl products.jsonl --on product_id=id \
  | ja groupby month,category --agg sum:revenue,count:orders \
  | ja sort month,sum_revenue:desc \
  | ja to-csv > monthly_report.csv

Python API

Use ja programmatically:

from ja.core import read_jsonl, join, groupby_agg, select_rows
from ja.schema import infer_schema

# Load data
users = read_jsonl("users.jsonl")
orders = read_jsonl("orders.jsonl")

# Join
joined = join(users, orders, on=[("user.id", "customer_id")])

# Filter
high_value = select_rows(joined, "amount > 100")

# Aggregate
result = groupby_agg(
    high_value,
    group_by_key="user.name",
    aggregations=[("sum", "amount"), ("count", None)]
)

print(list(result))

# Schema inference
schema = infer_schema(users)

Integration with Ecosystem

jsonl-algebra works seamlessly with other tools:

With JAF (filtering)

# JAF for boolean filtering, ja for relational ops
jaf filter users.jsonl '(eq? @status "active")' --eval \
  | ja join - orders.jsonl --on id=user_id \
  | ja groupby name --agg sum:amount

With jq (JSON manipulation)

# ja for relational ops, jq for complex transformations
ja join users.jsonl orders.jsonl --on id=user_id \
  | jq '.metadata = {processed: now, version: "1.0"}'

With standard Unix tools

# Combine with grep, sort, uniq
ja project email users.jsonl \
  | grep '@gmail.com' \
  | sort \
  | uniq -c

Theoretical Foundation: Relational Algebra

jsonl-algebra implements Codd’s relational algebra extended for nested data:

Completeness

The five operations (σ, π, ⋈, ∪, −) form a complete algebra—any relational query can be expressed using these primitives.

Closure Property

Every operation produces a valid relation:

\text{Relation} \xrightarrow{\text{Operation}} \text{Relation}

This ensures composability: R | op1 | op2 | op3 always works.

Query Optimization

Because operations have well-defined properties, queries can be optimized:

Selection pushdown:

\sigma_p(R \bowtie S) \equiv (\sigma_p(R)) \bowtie S \quad \text{(if p only references R)}

Projection elimination:

\pi_a(\pi_{a,b}(R)) \equiv \pi_a(R)

These properties enable automatic query optimization (future work).

Nested Data Extension

Traditional relational algebra assumes flat tuples. jsonl-algebra extends this with:

Path Expressions

Replace column names with dot-separated paths:

\sigma_{\text{user.age} > 30}(R) \quad \text{vs} \quad \sigma_{\text{age} > 30}(R)

Deep Equality

Join conditions can reference nested fields:

R \bowtie_{\text{user.id} = \text{order.customer\_id}} S

Projection Flattening

Projection can flatten nested structures:

\pi_{\text{user.name as name}, \text{user.address.city as city}}(R)

Performance Characteristics

Operation	Memory	Time	Notes
select	O(1)	O(n)	Streaming filter
project	O(1)	O(n)	Field extraction
join	O(min(R,S))	O(R×S)	Hash join (smaller table in memory)
union	O(1)	O(R+S)	Streaming concatenation
distinct	O(k)	O(n)	k = unique items
groupby	O(g)	O(n)	g = number of groups
sort	O(n)	O(n log n)	External sort for large data

jsonl-algebra is designed for:

✅ Streaming operations where possible
✅ Bounded memory for most operations
✅ Efficient joins via hash tables
✅ External sorting for datasets larger than RAM

When to Use jsonl-algebra

Use ja when:

✅ Working with JSONL data
✅ Need SQL-like operations on nested JSON
✅ Building command-line data pipelines
✅ Joining multiple data sources
✅ Aggregating and grouping data
✅ Want schema inference/validation

Use something else when:

❌ Need a persistent database → PostgreSQL, MongoDB
❌ Complex analytical queries → DuckDB, ClickHouse
❌ Real-time streaming → Apache Kafka, Flink
❌ Just filtering/mapping → JAF

Combine with:

JAF: Boolean filtering before relational ops
jq: Complex JSON transformations
SQL databases: Load processed data

Design Philosophy

🎯 Unix Philosophy: Do one thing well (relational algebra)

🔗 Composability: Every operation produces a valid relation

📦 Streaming First: Process data without loading into memory

🛡️ Type Safety: Schema inference and validation

📚 Pedagogical: Clear mapping to mathematical operations

⚡ Production-Ready: Battle-tested, published on PyPI

Quick Start

# Install
pip install jsonl-algebra

# Basic usage
ja select 'age > 30' users.jsonl
ja project name,email users.jsonl
ja join users.jsonl orders.jsonl --on id=user_id

# Interactive mode
ja repl data.jsonl

Comparison: dotsuite vs jsonl-algebra

Feature	dotsuite (dotrelate)	jsonl-algebra (ja)
Status	Pedagogical concept	Production tool
Distribution	Source code	PyPI package
Operations	Basic join, union	Full relational algebra
Aggregation	Planned	Complete (sum, avg, count, etc.)
Schema	None	Inference + validation
Interactive	No	REPL included
Format conversion	No	CSV support
Documentation	Concept docs	Full CLI reference
Testing	Basic	Comprehensive test suite

Recommendation: Learn relational concepts through dotsuite, use ja for production.

Resources

PyPI: pypi.org/project/jsonl-algebra/
Repository: github.com/queelius/jsonl-algebra
Quick Start: docs/quickstart.md
Concepts: docs/concepts/jsonl-algebra.md

License

MIT

jsonl-algebra: Bringing the power of relational algebra to nested JSON, one operation at a time. From SQL’s flat tables to JSON’s nested structures, with full mathematical foundations.

The Relationship to Dotsuite

Core Insight: Relational Algebra for Nested Data

The Five Core Operations

1. Selection (σ) — Filter Rows

2. Projection (π) — Select/Compute Columns

3. Join (⋈) — Combine Relations

4. Union (∪) — Combine All Rows

5. Difference (−) — Set Subtraction

Aggregation Operations: Extending the Algebra

GroupBy (γ) — Aggregate by Groups

Chained Grouping: An Innovation

Real-World Example: The 5-Minute Tour

Additional Operations

Distinct — Remove Duplicates

Sort — Order Results

Rename — Field Aliasing

Head/Tail — Limit Results

Schema Operations: Inference and Validation

Schema Inference

Schema Validation

Format Conversion: CSV Integration

JSONL to CSV

CSV to JSONL

Interactive REPL

Command-Line Examples

Log Analysis

ETL Pipeline

Data Quality Checks

Report Generation

Python API

Integration with Ecosystem

With JAF (filtering)

With jq (JSON manipulation)

With standard Unix tools

Theoretical Foundation: Relational Algebra

Completeness

Closure Property

Query Optimization

Nested Data Extension

Path Expressions

Deep Equality

Projection Flattening

Performance Characteristics

When to Use jsonl-algebra

Design Philosophy

Quick Start

Comparison: dotsuite vs jsonl-algebra

Resources

License

Related Posts

The Dot Ecosystem: From Simple Paths to Data Algebras

JSL: A Network-Native Functional Language Where Code Is JSON

Discussion