Skip to main content

jsonl-algebra: Relational Algebra for Nested JSON

jsonl-algebra (command: ja) is a command-line implementation of relational algebra for JSONL data. It’s the production version of dotsuite’s dotrelate component: SQL-like operations on the command line with first-class support for nested JSON structures.

The Relationship to Dotsuite

In dotsuite’s architecture, dotrelate provides relational operations on document collections: join, union, project, difference. jsonl-algebra (ja) is the production implementation of those concepts, published on PyPI, with all relational operations plus aggregations, streaming support, schema tools, and an interactive REPL.

The Core Insight

Traditional relational algebra assumes flat tables:

SELECT name, age FROM users WHERE age > 30

But real-world JSON is deeply nested:

{
  "user": {
    "id": 1,
    "name": "Alice",
    "address": {
      "city": "NYC",
      "zip": "10001"
    }
  },
  "orders": [
    {"id": 101, "amount": 50}
  ]
}

jsonl-algebra bridges this gap by extending relational algebra with dot notation for nested access:

ja select 'user.age > 30' users.jsonl
ja project user.name,user.address.city users.jsonl
ja join users.jsonl orders.jsonl --on user.id=customer_id

The Five Core Operations

Relational algebra has five fundamental operations that form a complete algebra. Everything else is derived.

1. Selection (filter rows)

Mathematical notation: σpredicate(R)\sigma_{\text{predicate}}(R)

# Filter where status is "active"
ja select 'status == `"active"`' data.jsonl

# Filter on nested fields
ja select 'user.age > 30' users.jsonl

# Complex boolean logic
ja select 'price < 100 and category == `"electronics"`' products.jsonl

Selection is commutative (σp1(σp2(R))=σp2(σp1(R))\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_2}(\sigma_{p_1}(R))) and combinable (σp1(σp2(R))=σp1p2(R)\sigma_{p_1}(\sigma_{p_2}(R)) = \sigma_{p_1 \land p_2}(R)).

2. Projection (select/compute columns)

Mathematical notation: πcolumns(R)\pi_{\text{columns}}(R)

# Pick specific fields
ja project id,name data.jsonl

# Access nested fields
ja project user.name,user.address.city users.jsonl

# Computed columns (coming soon)
ja project name,annual_income=salary*12 employees.jsonl

Idempotent for simple projections: πa(πa,b(R))=πa(R)\pi_a(\pi_{a,b}(R)) = \pi_a(R).

3. Join (combine relations)

Mathematical notation: RconditionSR \bowtie_{\text{condition}} S

# Inner join on user ID
ja join users.jsonl orders.jsonl --on user.id=customer_id

# Join on nested fields
ja join posts.jsonl comments.jsonl --on post.id=comment.post_id

# Multiple join keys
ja join users.jsonl accounts.jsonl --on id=user_id,email=account_email

Commutative and associative, so you can join multiple files in any order:

ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja join - products.jsonl --on product_id=id

4. Union (combine all rows)

Mathematical notation: RSR \cup S

# Combine employees and contractors
ja union employees.jsonl contractors.jsonl

# Union multiple sources
ja union jan.jsonl feb.jsonl mar.jsonl

5. Difference (set subtraction)

Mathematical notation: RSR - S

# Find users without orders
ja difference users.jsonl customers_with_orders.jsonl --on id

# Set operations
ja difference all_items.jsonl sold_items.jsonl --on sku

Not commutative: RSSRR - S \neq S - R. This is the one that isn’t symmetric.

Aggregation Operations

The five core operations are complete in the theoretical sense, but practical data processing needs aggregation.

GroupBy

# Count by category
ja groupby category --agg count products.jsonl

# Sum by customer
ja groupby customer_id --agg sum:amount orders.jsonl

# Multiple aggregations
ja groupby department --agg avg:salary,count,max:age employees.jsonl

Supported aggregations: count, sum, avg, min, max, list.

Chained Grouping

Traditional SQL requires nested subqueries for multi-level grouping. jsonl-algebra enables chained grouping through metadata preservation:

# Multi-level grouping
cat sales.jsonl \
  | ja groupby region \
  | ja groupby product \
  | ja groupby month \
  | ja agg total=sum:amount

Each groupby adds metadata without aggregating:

{
  "sale_id": 101,
  "region": "east",
  "product": "laptop",
  "month": "jan",
  "amount": 1200,
  "_groups": [
    {"field": "region", "value": "east"},
    {"field": "product", "value": "laptop"},
    {"field": "month", "value": "jan"}
  ],
  "_group_size": 5,
  "_group_index": 0
}

This preserves the relational structure while tracking grouping hierarchy. It’s what makes the chaining composable.

The 5-Minute Tour

Two JSONL files:

users.jsonl:

{"user": {"id": 1, "name": "Alice"}}
{"user": {"id": 2, "name": "Bob"}}

orders.jsonl:

{"order_id": 101, "customer_id": 1, "amount": 50}
{"order_id": 102, "customer_id": 1, "amount": 75}
{"order_id": 103, "customer_id": 2, "amount": 120}

Goal: total amount spent by each user.

# Join users and orders, then aggregate
ja join users.jsonl orders.jsonl --on user.id=customer_id \
  | ja groupby user.name --agg sum:amount

Output:

{"user.name": "Alice", "sum_amount": 125}
{"user.name": "Bob", "sum_amount": 120}

Nested field access, relational join, aggregation, Unix-style piping. That’s the whole idea.

Additional Operations

Distinct

ja distinct users.jsonl
ja distinct --key user.id,email contacts.jsonl

Sort

ja sort age users.jsonl
ja sort age:desc users.jsonl
ja sort department,salary:desc employees.jsonl

Rename

ja rename id=user_id data.jsonl
ja rename user.loc=user.location data.jsonl

Head/Tail

ja head 10 data.jsonl
ja tail 10 data.jsonl

Schema Operations

Inference

Automatically discover the structure of your data:

ja schema infer users.jsonl > users_schema.json

Output:

{
  "type": "object",
  "properties": {
    "id": {"type": "integer"},
    "name": {"type": "string"},
    "age": {"type": "integer"},
    "address": {
      "type": "object",
      "properties": {
        "city": {"type": "string"},
        "zip": {"type": "string"}
      }
    }
  }
}

Validation

ja schema validate users_schema.json users.jsonl

Useful for data quality checks, ETL pipeline validation, and API contract testing.

Format Conversion

JSONL to CSV

ja to-csv users.jsonl > users.csv
ja project user.name,user.age users.jsonl | ja to-csv > simple.csv

Nested structures get flattened intelligently: {"user": {"name": "Alice", "age": 30}} becomes user.name,user.age\nAlice,30.

CSV to JSONL

ja from-csv users.csv > users.jsonl

Interactive REPL

Build and test pipelines interactively:

$ ja repl users.jsonl

ja> select 'age > 25'
# Preview first few results...

ja> project name,email
# Preview transformed data...

ja> groupby department --agg avg:salary
# See aggregated results...

ja> write output.jsonl
# Save pipeline results

Good for exploring unfamiliar data, testing query predicates, and iterative pipeline development.

Command-Line Examples

Log Analysis

ja select 'level == `"ERROR"` and service == `"auth"`' app.log.jsonl

ja select 'level == `"ERROR"`' app.log.jsonl \
  | ja groupby service --agg count

ETL Pipeline

ja select 'status == `"completed"`' sales.jsonl \
  | ja project customer_id,amount,date \
  | ja join - customers.jsonl --on customer_id=id \
  | ja groupby customer.region --agg sum:amount \
  | ja sort sum_amount:desc \
  | ja head 10

Data Quality Checks

# Find users without email
ja select 'not (email != `null`)' users.jsonl

# Find duplicate user IDs
ja groupby user_id users.jsonl \
  | ja select '_group_size > 1'

Report Generation

ja join orders.jsonl products.jsonl --on product_id=id \
  | ja groupby month,category --agg sum:revenue,count:orders \
  | ja sort month,sum_revenue:desc \
  | ja to-csv > monthly_report.csv

Python API

from ja.core import read_jsonl, join, groupby_agg, select_rows
from ja.schema import infer_schema

# Load data
users = read_jsonl("users.jsonl")
orders = read_jsonl("orders.jsonl")

# Join
joined = join(users, orders, on=[("user.id", "customer_id")])

# Filter
high_value = select_rows(joined, "amount > 100")

# Aggregate
result = groupby_agg(
    high_value,
    group_by_key="user.name",
    aggregations=[("sum", "amount"), ("count", None)]
)

print(list(result))

# Schema inference
schema = infer_schema(users)

Integration with Ecosystem

With JAF (filtering)

jaf filter users.jsonl '(eq? @status "active")' --eval \
  | ja join - orders.jsonl --on id=user_id \
  | ja groupby name --agg sum:amount

With jq (JSON manipulation)

ja join users.jsonl orders.jsonl --on id=user_id \
  | jq '.metadata = {processed: now, version: "1.0"}'

With standard Unix tools

ja project email users.jsonl \
  | grep '@gmail.com' \
  | sort \
  | uniq -c

Theoretical Foundation

jsonl-algebra implements Codd’s relational algebra extended for nested data.

Completeness

The five operations (selection, projection, join, union, difference) form a complete algebra. Any relational query can be expressed using these primitives.

Closure Property

Every operation produces a valid relation:

RelationOperationRelation \text{Relation} \xrightarrow{\text{Operation}} \text{Relation}

This ensures composability: R | op1 | op2 | op3 always works.

Query Optimization

Because operations have well-defined algebraic properties, queries can be optimized:

Selection pushdown:

σp(RS)(σp(R))S(if p only references R) \sigma_p(R \bowtie S) \equiv (\sigma_p(R)) \bowtie S \quad \text{(if p only references R)}

Projection elimination:

πa(πa,b(R))πa(R) \pi_a(\pi_{a,b}(R)) \equiv \pi_a(R)

These enable automatic query optimization (future work).

Nested Data Extension

Traditional relational algebra assumes flat tuples. jsonl-algebra extends this with path expressions (replace column names with dot-separated paths), deep equality (join conditions can reference nested fields), and projection flattening (projection can flatten nested structures).

Performance Characteristics

OperationMemoryTimeNotes
selectO(1)O(n)Streaming filter
projectO(1)O(n)Field extraction
joinO(min(R,S))O(R*S)Hash join (smaller table in memory)
unionO(1)O(R+S)Streaming concatenation
distinctO(k)O(n)k = unique items
groupbyO(g)O(n)g = number of groups
sortO(n)O(n log n)External sort for large data

When to Use jsonl-algebra

Use ja for working with JSONL data, SQL-like operations on nested JSON, command-line data pipelines, joining multiple data sources, aggregating and grouping data, and schema inference/validation.

Use something else for persistent databases (PostgreSQL, MongoDB), complex analytical queries (DuckDB, ClickHouse), real-time streaming (Kafka, Flink), or if you just need filtering and mapping (JAF).

Quick Start

# Install
pip install jsonl-algebra

# Basic usage
ja select 'age > 30' users.jsonl
ja project name,email users.jsonl
ja join users.jsonl orders.jsonl --on id=user_id

# Interactive mode
ja repl data.jsonl

Resources

License

MIT

Discussion