DagShell: A Content-Addressable Virtual Filesystem
with Multiple Interface Paradigms
Abstract
We present DagShell, a virtual POSIX-compliant filesystem implemented as a content-addressable directed acyclic graph (DAG). Unlike traditional filesystems that mutate data in place, DagShell treats all filesystem objects as immutable nodes identified by SHA256 content hashes, similar to Git’s object model. The system provides three distinct interfaces: a fluent Python API for programmatic access, a Scheme DSL for functional scripting, and a full terminal emulator for interactive use. DagShell achieves automatic deduplication, complete history preservation, and rollback capabilities while maintaining compatibility with standard POSIX operations. The implementation demonstrates that content-addressable storage can be practical for general-purpose filesystem operations, with particular utility in testing, sandboxing, reproducible builds, and data pipeline experimentation. We report 99% test coverage across 10 comprehensive test suites.
1 Introduction
Modern computing increasingly demands reproducibility, versioning, and isolation in filesystem operations. Version control systems like Git [1] demonstrate the power of content-addressable storage for source code, while systems like Nix [3] and IPFS [2] extend similar principles to package management and distributed storage. However, general-purpose virtual filesystems with content-addressable properties remain uncommon, particularly ones offering multiple programming interfaces.
DagShell addresses this gap by providing a lightweight, in-memory virtual filesystem where every operation creates new immutable nodes rather than modifying existing ones. This approach yields several benefits:
-
•
Automatic deduplication: Identical content shares the same node
-
•
Complete history: All versions remain accessible until garbage collection
-
•
Safe experimentation: Operations cannot corrupt the filesystem
-
•
Reproducibility: Filesystem state can be exactly reproduced from snapshots
The system’s three interfaces serve different use cases: the Python API integrates with applications, the Scheme DSL enables functional programming over filesystems, and the terminal emulator provides familiar interactive access. This multi-paradigm approach makes DagShell suitable for diverse applications from unit testing to computational notebooks.
1.1 Contributions
This work makes the following contributions:
-
1.
A content-addressable virtual filesystem design combining POSIX semantics with immutability
-
2.
Three complementary interfaces (Python, Scheme, Terminal) over a unified core
-
3.
Implementation techniques for efficient path-to-hash mapping in DAG filesystems
-
4.
Practical applications in testing, sandboxing, and reproducible workflows
-
5.
Comprehensive test coverage demonstrating correctness and robustness
2 Design
2.1 Core Architecture
DagShell’s architecture consists of three layers (Figure LABEL:fig:arch):
-
1.
Core DAG Layer: Content-addressable node storage
-
2.
Filesystem Layer: POSIX operations and path management
-
3.
Interface Layer: Python API, Scheme interpreter, Terminal
The core maintains two primary data structures:
This separation allows path names to change while preserving immutable content nodes. Multiple paths can reference the same node, enabling automatic deduplication.
2.2 Node Types
DagShell implements three node types, all immutable:
FileNode
Represents regular files with binary content:
DirNode
Represents directories as mappings from names to child node hashes:
DeviceNode
Implements virtual devices like /dev/null, /dev/zero, and /dev/random:
2.3 Content Addressing
Each node’s hash is computed from its complete state including metadata:
This ensures that two nodes are identical if and only if they have the same hash. The system stores nodes in a hash table, automatically deduplicating identical content.
2.4 Immutability and Operations
All filesystem operations create new nodes rather than modifying existing ones. For example, writing to a file creates a new FileNode, which requires creating new DirNodes for all parent directories up to the root:
This copy-on-write approach ensures old versions remain accessible until garbage collection.
2.5 Path Resolution
DagShell maintains a separate path index for efficient lookups:
The deleted set implements soft deletion—nodes remain in storage but paths are marked deleted. This allows recovery before garbage collection.
2.6 Garbage Collection
DagShell implements mark-and-sweep garbage collection:
This reclaims storage from deleted files and old versions.
3 Implementation
3.1 Fluent Python API
The fluent API provides method chaining for pipeline-style composition:
The CommandResult class enables both method chaining and result inspection:
This design allows commands to return structured data while supporting Unix-style redirection.
3.2 Scheme DSL
The Scheme interpreter provides functional programming over the filesystem:
The implementation uses a minimal Scheme interpreter (approximately 750 lines) with a global environment containing filesystem operations:
This allows Scheme code to directly manipulate the filesystem with proper functional composition.
3.3 Terminal Emulator
The terminal emulator translates shell commands into fluent API calls:
It supports standard shell features:
-
•
Pipelines: cat file.txt | grep error | head -10
-
•
Redirections: ls > files.txt, echo log >> output.log
-
•
Logical operators: mkdir /tmp && cd /tmp
-
•
Command history and tab completion (enhanced mode)
3.4 Slash Commands
Beyond POSIX commands, the terminal provides meta-operations via slash commands:
-
•
/import <host-path> <virt-path> - Import from real filesystem
-
•
/export <virt-path> <host-path> - Export to real filesystem
-
•
/save [file] - Persist state to JSON
-
•
/load <file> - Restore state from JSON
-
•
/snapshot <name> - Create timestamped snapshot
-
•
/status - Show filesystem statistics
-
•
/dag - Visualize DAG structure
-
•
/nodes [pattern] - List content-addressed nodes
-
•
/info <hash> - Inspect node details
These commands bridge the virtual and real filesystems while maintaining security through configurable safe directories.
4 Features
4.1 POSIX Compliance
DagShell implements standard POSIX operations:
File Operations
open, read, write, close, stat
Directory Operations
mkdir, rmdir, ls, cd, pwd
Utilities
cat, cp, mv, rm, touch
Text Processing
grep, sed, sort, uniq, wc, head, tail
4.2 User and Permission System
DagShell maintains /etc/passwd and /etc/group files for user management:
4.3 State Persistence
The filesystem serializes to JSON for persistence:
This enables:
-
•
Checkpoint/restore workflows
-
•
Filesystem versioning
-
•
State sharing between processes
-
•
Reproducible test fixtures
4.4 Host Filesystem Integration
Safe import/export operations bridge virtual and real filesystems:
5 Use Cases
5.1 Unit Testing
DagShell provides isolated filesystem fixtures:
5.2 Sandboxing
Applications can run in isolated virtual environments:
5.3 Data Pipelines
Content addressing enables efficient pipeline experimentation:
5.4 Reproducible Builds
Build systems can snapshot filesystem state:
6 Evaluation
6.1 Test Coverage
We evaluated DagShell through comprehensive testing:
Test Suite | Tests |
---|---|
Core Filesystem | 45 |
Terminal Features | 38 |
Scheme Integration | 32 |
Text Processing | 28 |
Persistence | 24 |
Import/Export | 22 |
Edge Cases | 19 |
Regression | 15 |
Help System | 12 |
Enhanced Terminal | 8 |
Total | 243 |
Code coverage: 99% (10,234 / 10,345 lines)
6.2 Performance Characteristics
DagShell prioritizes correctness over performance, but achieves reasonable efficiency:
Space Complexity
Each node requires space where is content size. Identical content is automatically deduplicated. Path index requires space for paths.
Time Complexity
-
•
Path lookup: hash table access
-
•
Write operation: where is directory depth (creates new nodes along path)
-
•
Garbage collection: where is total nodes
-
•
Read operation: after path resolution
Deduplication Efficiency
In a test importing a Python package with many identical LICENSE files, deduplication reduced storage by 73%.
6.3 Limitations
DagShell has intentional limitations:
-
•
In-memory only (not designed for large datasets)
-
•
No concurrent access control
-
•
Write operations require copying parent directories
-
•
No symbolic links (would complicate DAG structure)
-
•
No hard links (paths are separate from nodes)
These trade-offs favor simplicity and correctness for the target use cases.
7 Related Work
7.1 Git
Git [1] pioneered practical content-addressable storage for version control. DagShell applies similar principles to general filesystem operations. Unlike Git, DagShell:
-
•
Supports in-place filesystem operations
-
•
Provides POSIX compatibility
-
•
Offers real-time interaction rather than commit-based workflows
7.2 IPFS
The InterPlanetary File System [2] provides distributed content-addressable storage. IPFS focuses on network distribution and permanent storage, while DagShell targets lightweight local virtualization.
7.3 Nix
Nix [3] uses content addressing for reproducible package management. DagShell generalizes this to arbitrary file operations with multiple programming interfaces.
7.4 FUSE and Virtual Filesystems
FUSE [4] enables user-space filesystems. DagShell differs by:
-
•
Operating entirely in-process
-
•
Providing language bindings beyond system calls
-
•
Emphasizing immutability over mutation
7.5 Plan 9
Plan 9 [5] demonstrated that ”everything is a file” can unify system interfaces. DagShell extends this by making files content-addressed and providing multiple programming models.
7.6 Copy-on-Write Filesystems
Filesystems like Btrfs and ZFS use copy-on-write for snapshots. DagShell applies CoW universally rather than as an optimization, yielding different semantics.
8 Future Work
Several extensions could enhance DagShell:
8.1 Compression
Content-addressed nodes could be compressed:
8.2 Distributed Operation
Content addressing naturally supports distribution:
-
•
Nodes could be stored in distributed hash tables
-
•
Multiple processes could share read-only nodes
-
•
Writes could use optimistic concurrency
8.3 Incremental Garbage Collection
Current mark-and-sweep GC is stop-the-world. Generational or incremental collection could reduce pauses.
8.4 Symbolic Links
While challenging for DAG structure, symbolic links could be implemented:
8.5 Persistent Storage Backend
An on-disk backing store with lazy loading:
8.6 Time-Travel Debugging
The complete history could enable debugging:
-
•
Replay filesystem operations
-
•
Bisect to find when a file changed
-
•
Diff filesystem states
9 Conclusion
DagShell demonstrates that content-addressable storage can provide a practical foundation for general-purpose virtual filesystems. By treating all filesystem objects as immutable nodes in a DAG, the system achieves automatic deduplication, complete history preservation, and safe experimentation while maintaining POSIX compatibility.
The three interface paradigms—fluent Python API, Scheme DSL, and terminal emulator—serve complementary use cases: programmatic integration, functional scripting, and interactive access. This multi-paradigm approach makes DagShell suitable for diverse applications including unit testing, sandboxing, data pipelines, and reproducible builds.
Our implementation achieves 99% test coverage across 243 tests, validating correctness across core functionality, interface layers, and edge cases. While designed for in-memory operation rather than large-scale storage, DagShell’s architecture could extend to persistent and distributed scenarios.
The success of DagShell suggests that content addressing deserves broader consideration in filesystem design. The immutability and versioning properties that benefit version control systems and package managers can also improve general-purpose file operations, particularly in contexts requiring reproducibility, isolation, or experimentation.
DagShell is open source and available at https://github.com/.
Acknowledgments
We thank the developers of Git, IPFS, and Nix for demonstrating the power of content-addressable storage in different domains.
References
- [1] Linus Torvalds and Junio C Hamano. Git: Fast Version Control System. https://git-scm.com/, 2005.
- [2] Juan Benet. IPFS - Content Addressed, Versioned, P2P File System. arXiv preprint arXiv:1407.3561, 2014.
- [3] Eelco Dolstra, Merijn de Jonge, and Eelco Visser. Nix: A Safe and Policy-Free System for Software Deployment. In Proceedings of the 18th USENIX Conference on System Administration (LISA), pages 79-92, 2004.
- [4] Miklos Szeredi. FUSE: Filesystem in Userspace. https://github.com/libfuse/libfuse, 2001.
- [5] Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken Thompson, Howard Trickey, and Phil Winterbottom. Plan 9 from Bell Labs. Computing Systems, 8(3):221-254, 1995.
- [6] Chris Mason. Btrfs: The Linux B-Tree Filesystem. ACM Transactions on Storage, 2013.
- [7] Jeff Bonwick, Matt Ahrens, Val Henson, Mark Maybee, and Mark Shellenbaum. The Zettabyte File System. In Proceedings of the 2nd Usenix Conference on File and Storage Technologies (FAST), 2003.
- [8] Ralph C. Merkle. A Digital Signature Based on a Conventional Encryption Function. In Advances in Cryptology—CRYPTO’87, pages 369-378, 1988.
- [9] IEEE and The Open Group. POSIX.1-2017: Portable Operating System Interface. IEEE Std 1003.1-2017, 2018.