Skip to main content

jsonl-algebra: Production-Grade Relational Algebra for Nested JSON

jsonl-algebra (command: ja) is a production-grade implementation of relational algebra for JSONL data. It’s the full-featured, battle-tested realization of dotsuite’s dotrelate component—bringing SQL-like operations to the command line with first-class support for nested JSON structures.

The Relationship to Dotsuite

In dotsuite’s architecture, dotrelate provides relational operations on document collections:

  • dotrelate: Join, union, project, difference on collections
  • Theoretical foundation: Relational algebra lifted to nested documents

jsonl-algebra (ja) is the production implementation of these concepts, with:

  • Published on PyPI: pip install jsonl-algebra
  • Battle-tested: Used in real-world data pipelines
  • Feature-complete: All relational operations + aggregations
  • Streaming: Process datasets larger than memory
  • Schema tools: Inference and validation
  • Interactive REPL: Build pipelines step-by-step

Core Insight: Relational Algebra for Nested Data

Traditional relational algebra assumes flat tables:

SELECT name, age FROM users WHERE age > 30

But modern JSON is deeply nested:

{
  "user": {
    "id": 1,
    "name": "Alice",
    "address": {
      "city": "NYC",
      "zip": "10001"
    }
  },
  "orders": [
    {"id": 101, "amount": 50}
  ]
}

jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:

ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id

The Five Core Operations

Relational algebra has five fundamental operations that form a complete algebra:

1. Selection (σ) — Filter Rows

Mathematical notation: σpredicate(R)\sigma_{\text{predicate}}(R)

# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl

# Filter on nested fields
ja select 'user.age > 30' users.jsonl

# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl

Properties:

  • Commutative: σp1(σp2(R))=σp2(σp1(R))\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))
  • Combinable: σp1(σp2(R))=σp1p2(R)\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_1 \land p_2}(R)

2. Projection (π) — Select/Compute Columns

Mathematical notation: πcolumns(R)\pi_{\text{columns}}(R)

# Pick specific fields
ja project id,name data.jsonl

# Access nested fields
ja project user.name,user.address.city users.jsonl

# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl

Properties:

  • Idempotent (for simple projections): πa(πa,b(R))=πa(R)\pi_a(\pi_{a,b}(R)) = \pi_a(R)

3. Join (⋈) — Combine Relations

Mathematical notation: RconditionSR \bowtie_{\text{condition}} S

# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id

# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id

# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email

Properties:

  • Commutative: RS=SRR \bowtie S = S \bowtie R
  • Associative: (RS)T=R(ST)(R \bowtie S) \bowtie T = R \bowtie (S \bowtie T)

This means you can join multiple files in any order:

ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja join - products.jsonl --on product_id=id

4. Union (∪) — Combine All Rows

Mathematical notation: RSR \cup S

# Combine employees and contractors
ja union employees.jsonl contractors.jsonl

# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl

Properties:

  • Commutative: RS=SRR \cup S = S \cup R
  • Associative: (RS)T=R(ST)(R \cup S) \cup T = R \cup (S \cup T)
  • Identity: R=RR \cup \emptyset = R

5. Difference (−) — Set Subtraction

Mathematical notation: RSR - S

# Find users without orders
ja difference users.jsonl customers_with_orders.jsonl --on id

# Set operations
ja difference all_items.jsonl sold_items.jsonl --on sku

Properties:

  • Not commutative: RSSRR - S \neq S - R
  • Anti-identity: R=RR - \emptyset = R, but R=\emptyset - R = \emptyset

Aggregation Operations: Extending the Algebra

While the five core operations are complete, practical data processing needs aggregation:

GroupBy (γ) — Aggregate by Groups

Mathematical notation: γgroup_keys,aggregations(R)\gamma_{\text{group\_keys}, \text{aggregations}}(R)

# Count by category
ja groupby category --agg count products.jsonl

# Sum by customer
ja groupby customer_id --agg sum:amount orders.jsonl

# Multiple aggregations
ja groupby department --agg avg:salary,count,max:age employees.jsonl

Supported aggregations:

  • count: Number of items
  • sum: Sum of values
  • avg: Average of values
  • min/max: Minimum/maximum
  • list: Collect all values into array

Chained Grouping: An Innovation

Traditional SQL requires nested subqueries for multi-level grouping. jsonl-algebra enables chained grouping through metadata preservation:

# Multi-level grouping
cat sales.jsonl \
  | ja groupby region \        # First level
  | ja groupby product \       # Second level
  | ja groupby month \         # Third level
  | ja agg total=sum:amount    # Final aggregation

How it works: Each groupby adds metadata without aggregating:

{
  "sale_id": 101,
  "region": "east",
  "product": "laptop",
  "month": "jan",
  "amount": 1200,
  "_groups": [
    {"field": "region", "value": "east"},
    {"field": "product", "value": "laptop"},
    {"field": "month", "value": "jan"}
  ],
  "_group_size": 5,
  "_group_index": 0
}

This preserves the relational structure while tracking grouping hierarchy—a key innovation that enables composition.

Real-World Example: The 5-Minute Tour

Setup: Two JSONL files

users.jsonl:

{"user": {"id": 1, "name": "Alice"}}
{"user": {"id": 2, "name": "Bob"}}

orders.jsonl:

{"order_id": 101, "customer_id": 1, "amount": 50}
{"order_id": 102, "customer_id": 1, "amount": 75}
{"order_id": 103, "customer_id": 2, "amount": 120}

Goal: Total amount spent by each user

Solution:

# Join users and orders, then aggregate
ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja groupby user.name --agg sum:amount

Output:

{"user.name": "Alice", "sum_amount": 125}
{"user.name": "Bob", "sum_amount": 120}

This demonstrates:

  • ✅ Nested field access (user.id, user.name)
  • ✅ Relational join
  • ✅ Aggregation
  • ✅ Unix-style piping

Additional Operations

Distinct — Remove Duplicates

# Unique users
ja distinct users.jsonl

# Distinct on specific fields
ja distinct --key user.id,email contacts.jsonl

Sort — Order Results

# Sort by age ascending
ja sort age users.jsonl

# Sort descending
ja sort age:desc users.jsonl

# Multi-key sort
ja sort department,salary:desc employees.jsonl

Rename — Field Aliasing

# Rename single field
ja rename id=user_id data.jsonl

# Rename nested fields
ja rename user.loc=user.location data.jsonl

Head/Tail — Limit Results

# First 10 rows
ja head 10 data.jsonl

# Last 10 rows
ja tail 10 data.jsonl

Schema Operations: Inference and Validation

Schema Inference

Automatically discover the structure of your data:

# Infer JSON Schema
ja schema infer users.jsonl > users_schema.json

Output:

{
  "type": "object",
  "properties": {
    "id": {"type": "integer"},
    "name": {"type": "string"},
    "age": {"type": "integer"},
    "address": {
      "type": "object",
      "properties": {
        "city": {"type": "string"},
        "zip": {"type": "string"}
      }
    }
  }
}

Schema Validation

Validate data against a schema:

# Check if data conforms to schema
ja schema validate users_schema.json users.jsonl

This is crucial for:

  • Data quality checks
  • ETL pipeline validation
  • API contract testing

Format Conversion: CSV Integration

JSONL to CSV

# Flatten nested structure to CSV
ja to-csv users.jsonl > users.csv

# Select specific fields
ja project user.name,user.age users.jsonl | ja to-csv > simple.csv

Intelligent flattening:

{"user": {"name": "Alice", "age": 30}}

Becomes:

user.name,user.age
Alice,30

CSV to JSONL

# Convert CSV to JSONL
ja from-csv users.csv > users.jsonl

Interactive REPL

Build and test pipelines interactively:

$ ja repl users.jsonl

ja> select 'age > 25'
# Preview first few results...

ja> project name,email
# Preview transformed data...

ja> groupby department --agg avg:salary
# See aggregated results...

ja> write output.jsonl
# Save pipeline results

The REPL is perfect for:

  • Exploring unfamiliar data
  • Testing query predicates
  • Iterative pipeline development
  • Teaching relational algebra concepts

Command-Line Examples

Log Analysis

# Find error logs from auth service
ja select 'level == `"ERROR"` and service == `"auth"`' app.log.jsonl

# Count errors by service
ja select 'level == `"ERROR"`' app.log.jsonl \
  | ja groupby service --agg count

ETL Pipeline

# Extract, transform, load
ja select 'status == `"completed"`' sales.jsonl \
  | ja project customer_id,amount,date \
  | ja join - customers.jsonl --on customer_id=id \
  | ja groupby customer.region --agg sum:amount \
  | ja sort sum_amount:desc \
  | ja head 10

Data Quality Checks

# Find users without email
ja select 'not (email != `null`)' users.jsonl

# Find duplicate user IDs
ja groupby user_id users.jsonl \
  | ja select '_group_size > 1'

Report Generation

# Monthly sales report
ja join orders.jsonl products.jsonl --on product_id=id \
  | ja groupby month,category --agg sum:revenue,count:orders \
  | ja sort month,sum_revenue:desc \
  | ja to-csv > monthly_report.csv

Python API

Use ja programmatically:

from ja.core import read_jsonl, join, groupby_agg, select_rows
from ja.schema import infer_schema

# Load data
users = read_jsonl("users.jsonl")
orders = read_jsonl("orders.jsonl")

# Join
joined = join(users, orders, on=[("user.id", "customer_id")])

# Filter
high_value = select_rows(joined, "amount > 100")

# Aggregate
result = groupby_agg(
    high_value,
    group_by_key="user.name",
    aggregations=[("sum", "amount"), ("count", None)]
)

print(list(result))

# Schema inference
schema = infer_schema(users)

Integration with Ecosystem

jsonl-algebra works seamlessly with other tools:

With JAF (filtering)

# JAF for boolean filtering, ja for relational ops
jaf filter users.jsonl '(eq? @status "active")' --eval \
  | ja join - orders.jsonl --on id=user_id \
  | ja groupby name --agg sum:amount

With jq (JSON manipulation)

# ja for relational ops, jq for complex transformations
ja join users.jsonl orders.jsonl --on id=user_id \
  | jq '.metadata = {processed: now, version: "1.0"}'

With standard Unix tools

# Combine with grep, sort, uniq
ja project email users.jsonl \
  | grep '@gmail.com' \
  | sort \
  | uniq -c

Theoretical Foundation: Relational Algebra

jsonl-algebra implements Codd’s relational algebra extended for nested data:

Completeness

The five operations (σ, π, ⋈, ∪, −) form a complete algebra—any relational query can be expressed using these primitives.

Closure Property

Every operation produces a valid relation:

RelationOperationRelation \text{Relation} \xrightarrow{\text{Operation}} \text{Relation}

This ensures composability: R | op1 | op2 | op3 always works.

Query Optimization

Because operations have well-defined properties, queries can be optimized:

Selection pushdown:

σp(RS)(σp(R))S(if p only references R) \sigma_p(R \bowtie S) \equiv (\sigma_p(R)) \bowtie S \quad \text{(if p only references R)}

Projection elimination:

πa(πa,b(R))πa(R) \pi_a(\pi_{a,b}(R)) \equiv \pi_a(R)

These properties enable automatic query optimization (future work).

Nested Data Extension

Traditional relational algebra assumes flat tuples. jsonl-algebra extends this with:

Path Expressions

Replace column names with dot-separated paths:

σuser.age>30(R)vsσage>30(R) \sigma_{\text{user.age} > 30}(R) \quad \text{vs} \quad \sigma_{\text{age} > 30}(R)

Deep Equality

Join conditions can reference nested fields:

Ruser.id=order.customer_idS R \bowtie_{\text{user.id} = \text{order.customer\_id}} S

Projection Flattening

Projection can flatten nested structures:

πuser.name as name,user.address.city as city(R) \pi_{\text{user.name as name}, \text{user.address.city as city}}(R)

Performance Characteristics

OperationMemoryTimeNotes
selectO(1)O(n)Streaming filter
projectO(1)O(n)Field extraction
joinO(min(R,S))O(R×S)Hash join (smaller table in memory)
unionO(1)O(R+S)Streaming concatenation
distinctO(k)O(n)k = unique items
groupbyO(g)O(n)g = number of groups
sortO(n)O(n log n)External sort for large data

jsonl-algebra is designed for:

  • Streaming operations where possible
  • Bounded memory for most operations
  • Efficient joins via hash tables
  • External sorting for datasets larger than RAM

When to Use jsonl-algebra

Use ja when:

  • ✅ Working with JSONL data
  • ✅ Need SQL-like operations on nested JSON
  • ✅ Building command-line data pipelines
  • ✅ Joining multiple data sources
  • ✅ Aggregating and grouping data
  • ✅ Want schema inference/validation

Use something else when:

  • ❌ Need a persistent database → PostgreSQL, MongoDB
  • ❌ Complex analytical queries → DuckDB, ClickHouse
  • ❌ Real-time streaming → Apache Kafka, Flink
  • ❌ Just filtering/mapping → JAF

Combine with:

  • JAF: Boolean filtering before relational ops
  • jq: Complex JSON transformations
  • SQL databases: Load processed data

Design Philosophy

🎯 Unix Philosophy: Do one thing well (relational algebra)

🔗 Composability: Every operation produces a valid relation

📦 Streaming First: Process data without loading into memory

🛡️ Type Safety: Schema inference and validation

📚 Pedagogical: Clear mapping to mathematical operations

⚡ Production-Ready: Battle-tested, published on PyPI

Quick Start

# Install
pip install jsonl-algebra

# Basic usage
ja select 'age > 30' users.jsonl
ja project name,email users.jsonl
ja join users.jsonl orders.jsonl --on id=user_id

# Interactive mode
ja repl data.jsonl

Comparison: dotsuite vs jsonl-algebra

Featuredotsuite (dotrelate)jsonl-algebra (ja)
StatusPedagogical conceptProduction tool
DistributionSource codePyPI package
OperationsBasic join, unionFull relational algebra
AggregationPlannedComplete (sum, avg, count, etc.)
SchemaNoneInference + validation
InteractiveNoREPL included
Format conversionNoCSV support
DocumentationConcept docsFull CLI reference
TestingBasicComprehensive test suite

Recommendation: Learn relational concepts through dotsuite, use ja for production.

Resources

License

MIT


jsonl-algebra: Bringing the power of relational algebra to nested JSON, one operation at a time. From SQL’s flat tables to JSON’s nested structures, with full mathematical foundations.

Discussion