Skip to main content

DagShell: A Content-Addressable Virtual Filesystem

DagShell is a virtual filesystem that organizes data by content instead of location. Identical files automatically share storage through SHA256 hashing. The structure is a directed acyclic graph rather than a tree, so the same content block can be referenced from multiple paths without duplication.

I built it because sometimes you need filesystem semantics without touching actual disk. Testing, sandboxing, versioning, portability. The implementation has 583 tests with 77% coverage.

The DAG structure

Traditional filesystems are trees: each file has exactly one parent. DagShell uses a DAG where content is stored once and referenced by hash:

/project/
├── src/
│   └── main.py  ──────┐
├── backup/            │
│   └── main.py  ──────┼──> [SHA256: abc123...] -> "print('hello')"
└── archive/           │
    └── main.py  ──────┘

Three paths, one storage block.

Fluent Python API

DagShell provides a chainable API that mirrors shell commands:

from dagshell.dagshell_fluent import DagShell

shell = DagShell()

# Create project structure
(shell
    .mkdir("/project/src")
    .mkdir("/project/docs")
    .cd("/project/src")
    .echo("def main(): pass").out("main.py")
    .echo("# My Project").out("../docs/README.md"))

# Navigate with directory stack
shell.pushd("/tmp")
shell.touch("scratch.txt")
shell.popd()  # Back to /project/src

# Save entire filesystem to JSON
shell.save("project_snapshot.json")

Terminal emulator

For interactive exploration:

python -m dagshell.terminal

dagshell:/$ mkdir /home/user
dagshell:/$ cd /home/user
dagshell:/home/user$ echo "Hello" > greeting.txt
dagshell:/home/user$ cat greeting.txt
Hello
dagshell:/home/user$ ls -la
total 1
drwxr-xr-x  2 user user  4096 Aug 15 10:00 .
drwxr-xr-x  3 user user  4096 Aug 15 10:00 ..
-rw-r--r--  1 user user     6 Aug 15 10:00 greeting.txt

Virtual devices

Standard Unix special files work:

shell.echo("garbage").out("/dev/null")  # Discarded
random_bytes = shell.cat("/dev/random")  # Random data
zeros = shell.head("/dev/zero", 100)     # 100 zero bytes

Import/export

Move files between real and virtual filesystems:

# Import from real filesystem
shell.import_file("/real/path/data.csv", "/virtual/data.csv")

# Export to real filesystem
shell.export_file("/virtual/results.json", "/real/path/results.json")

# Import entire directory
shell.import_dir("/real/project", "/virtual/project")

Persistence

The entire filesystem state serializes to JSON:

shell.save("filesystem.json")
restored = DagShell.load("filesystem.json")

# Or get JSON directly
state = shell.to_json()

The JSON format is human-readable:

{
  "root": {
    "type": "directory",
    "children": {
      "project": {
        "type": "directory",
        "children": {
          "README.md": {
            "type": "file",
            "content_hash": "abc123..."
          }
        }
      }
    }
  },
  "content_store": {
    "abc123...": "# My Project\n..."
  }
}

Content hashes in the directory tree, actual content in a flat store. Deduplication falls out naturally.

Scheme DSL

For Lisp people, there’s a Scheme interface:

(mkdir "/project")
(cd "/project")
(echo "Hello" "greeting.txt")
(define files (ls))

I included this partly because I like Scheme and partly because a filesystem is a natural fit for s-expressions.

Use cases

  • Build systems: track input/output files without disk I/O
  • Testing frameworks: create fixture filesystems programmatically
  • Backup tools: represent filesystem snapshots efficiently
  • Educational: teach filesystem concepts without system access

Installation

pip install dagshell
# Or from source
pip install -e .

Resources

Discussion