maph: Maps Based on Perfect Hashing
for Sub-Microsecond Key-Value Storage

29 min read
Alexander Towell PhD Student
Southern Illinois University Edwardsville/Carbondale
Email: atowell@siue.edu, lex@metafunctor.com
GitHub: https://github.com/queelius/rd_ph_filter
Abstract

We present maph (Map based on Perfect Hash), a high-performance key-value storage system that achieves sub-microsecond latency through a novel combination of memory-mapped I/O, approximate perfect hashing, and lock-free atomic operations. Unlike traditional key-value stores that suffer from kernel/user space transitions and locking overhead, maph leverages direct memory access via mmap(2) to eliminate system call overhead on the critical path. Our design employs a dual-region architecture with 80% static slots using perfect hashing for collision-free O(1) lookups, and 20% dynamic slots with bounded linear probing for handling hash collisions. Each slot is fixed at 512 bytes and cache-line aligned (64-byte boundaries) to minimize false sharing and maximize CPU cache utilization. Experimental evaluation demonstrates that maph achieves 10 million GET operations per second with sub-100 nanosecond latency on a single thread, and scales to 98 million operations per second with 16 threads. The system supports SIMD-accelerated batch operations via AVX2 instructions, achieving 50 million keys per second for parallel lookups. We show that maph outperforms Redis by 12×, RocksDB by 87×, and Memcached by 6× on read-heavy workloads while maintaining comparable write performance. The framework is particularly suited for applications requiring predictable ultra-low latency, including high-frequency trading systems, machine learning feature stores, and real-time gaming infrastructure.

Index Terms:
key-value stores, memory-mapped I/O, perfect hashing, lock-free algorithms, sub-microsecond latency, zero-copy architecture

I Introduction

Modern distributed systems and cloud applications increasingly depend on high-performance key-value stores for caching, session management, and metadata storage. However, existing solutions face fundamental limitations when microsecond-level latency becomes critical. Traditional in-memory databases like Redis [1] operate in user space, incurring system call overhead for each operation. Persistent stores like RocksDB [2] optimize for durability and compression at the expense of lookup speed. Even specialized caches like Memcached [3] cannot achieve sub-microsecond latency due to network stack overhead and serialization costs.

The performance gap becomes particularly acute in domains such as high-frequency trading, where every nanosecond of latency translates directly to competitive disadvantage and potential financial loss [4]. Similarly, machine learning inference pipelines require feature stores capable of retrieving thousands of features within tight latency budgets [5]. Real-time gaming and IoT applications demand predictable response times even under concurrent access patterns [6].

We identify three fundamental bottlenecks in existing key-value stores:

  1. 1.

    Kernel overhead: System calls for I/O operations require expensive context switches between user and kernel space, typically adding 100-500ns of latency per operation [7].

  2. 2.

    Memory copying: Traditional stores copy data multiple times—from kernel buffers to user space, between internal data structures, and for serialization. Each copy operation consumes CPU cycles and memory bandwidth.

  3. 3.

    Synchronization overhead: Lock-based concurrency control creates contention under high concurrency, leading to unpredictable latency spikes and poor scalability [8].

To address these challenges, we present maph, a novel key-value storage framework that achieves sub-microsecond latency through three key innovations:

  • Zero-copy architecture via mmap: By memory-mapping the entire database file, we eliminate kernel/user space transitions and enable direct memory access to persistent storage. The CPU’s memory management unit (MMU) handles address translation transparently, providing the illusion of in-memory access with automatic persistence.

  • Approximate perfect hashing: We partition the key space into static (80%) and dynamic (20%) regions. Static slots use perfect hash functions [9, 10] to guarantee collision-free O(1) lookups. Dynamic slots handle overflow via bounded linear probing with a maximum of 10 probes.

  • Lock-free atomic operations: All operations use compare-and-swap (CAS) primitives and atomic versioning to ensure consistency without locks. Readers never block, and writers coordinate through atomic slot versioning.

Our primary contributions are:

  1. 1.

    A hybrid hash architecture combining perfect hashing for known keys with standard hashing for dynamic keys, providing O(1) lookups for optimized datasets while gracefully handling new insertions

  2. 2.

    Application of memory-mapped I/O to eliminate system call overhead in the critical path while maintaining automatic persistence through the kernel page cache

  3. 3.

    Engineering optimizations including cache-line aligned fixed-size slots (512 bytes), lock-free versioning for concurrent reads, and bounded linear probing for predictable worst-case performance

  4. 4.

    Experimental evaluation comparing in-process performance against both network-based systems (Redis, Memcached) and in-memory data structures (std::unordered_map) with detailed latency analysis

  5. 5.

    Open-source implementation demonstrating practical application of these techniques for JSON key-value storage with sub-microsecond target latency

The remainder of this paper is organized as follows. Section II reviews related work in high-performance key-value storage. Section III presents the maph system architecture and design rationale. Section IV details the implementation including memory layout, hash functions, and concurrency control. Section V evaluates performance through microbenchmarks and comparison with existing systems. Section VI discusses limitations and future work. Section VII concludes.

II Background and Related Work

II-A Traditional Key-Value Stores

Redis [1] represents the most widely deployed in-memory key-value store, using a single-threaded event loop to avoid synchronization overhead. While this design simplifies reasoning about consistency, it fundamentally limits scalability on multi-core systems. Redis achieves approximately 100,000 operations per second on a single core but cannot exploit parallelism without complex sharding schemes.

RocksDB [2] and LevelDB [11] optimize for storage efficiency using log-structured merge trees (LSM-trees). The multi-level structure with compaction provides excellent compression ratios and write throughput but requires multiple disk seeks for reads. Even with extensive caching, lookup latency remains in the microsecond range due to the inherent tree traversal overhead.

Memcached [3] focuses on simplicity and network efficiency for distributed caching. Its slab allocator reduces fragmentation, and multi-threaded architecture scales well. However, the network stack adds unavoidable latency—even with kernel bypass techniques like DPDK [12], round-trip times exceed one microsecond.

II-B Low-Latency Distributed Systems

Several research systems target microsecond-scale latency through specialized architectures and RDMA networking.

RAMCloud [30] achieves single-digit microsecond latency by keeping all data in DRAM with log-structured organization and fast crash recovery. The system uses kernel bypass and polling for low latency, reporting median read latency of 5μs. While significantly faster than traditional databases, this remains an order of magnitude slower than our sub-microsecond target.

FaRM [31] leverages RDMA one-sided reads to achieve median read latency of 5μs for small objects. The use of one-sided RDMA eliminates server CPU involvement for reads, but network latency and RDMA NIC processing still dominate at this scale. FaRM focuses on distributed transactions rather than single-machine key-value performance.

MICA [32] demonstrates that careful engineering of partitioned in-memory hash tables can achieve 65 million operations per second on a single machine. The design uses per-core data structures to avoid synchronization overhead. While MICA achieves high throughput, its focus on avoiding cross-core communication differs from our lock-free shared-memory approach.

HERD [33] explores the design space of using RDMA for key-value stores, finding that RDMA writes combined with polling can achieve sub-10μs latency. The system demonstrates the importance of reducing round trips and using efficient RPC mechanisms.

These systems demonstrate the achievable performance with specialized hardware (RDMA NICs) and distributed architectures. In contrast, maph targets single-machine performance using commodity hardware and shared memory, trading off distributed functionality for lower latency through memory mapping and perfect hashing.

II-C Memory-Mapped Databases

LMDB [13] pioneered the use of memory-mapped files for database storage, eliminating the buffer cache and providing zero-copy reads. Its B+tree structure with copy-on-write semantics ensures consistency without write-ahead logging. However, the tree structure still requires O(log n) comparisons, and the append-only design leads to space amplification.

WiredTiger [14], used in MongoDB, employs memory mapping for its cache but maintains complex buffer management for durability. The layered architecture with multiple storage engines adds abstraction overhead that prevents achieving nanosecond-level latency.

II-D Perfect Hash Functions

Perfect hash functions map a set of keys to unique positions without collisions. Minimal perfect hash functions (MPHF) additionally ensure the range equals the key set size, achieving optimal space efficiency [15].

The CHD algorithm [9] constructs minimal perfect hash functions in expected linear time using hypergraph techniques. For n keys, it requires approximately 2.7 bits per key of auxiliary space while providing O(1) query time.

BBHash [10] improves construction time through a multi-level scheme, building the hash function in O(n) time with 3 bits per key. The cascade approach handles collisions by recursively processing them at subsequent levels.

RecSplit [16] achieves the theoretical lower bound of approximately 1.56 bits per key through recursive splitting, though with higher construction cost. The space optimality comes at the expense of more complex query operations.

II-E Lock-Free Data Structures

Lock-free algorithms guarantee system-wide progress—at least one thread makes forward progress at any time [17]. This property eliminates deadlock and reduces latency variance under contention.

Concurrent hash tables like Junction [18] and libcuckoo [19] demonstrate that lock-free designs can achieve superior scalability. Junction uses atomic operations on pointers for collision chain management, while libcuckoo employs fine-grained locking with cuckoo hashing for bounded worst-case lookup time.

The Read-Copy-Update (RCU) pattern [20] enables zero-overhead reads by deferring reclamation until all readers complete. This technique is particularly effective for read-heavy workloads but requires careful epoch management.

II-F SIMD Optimization for Databases

Modern CPUs provide SIMD (Single Instruction, Multiple Data) instructions that operate on multiple data elements simultaneously. AVX2 instructions process 256 bits (8 integers or 4 doubles) per cycle, while AVX-512 doubles this to 512 bits [21].

Column stores like MonetDB [22] and Vectorwise [23] exploit SIMD for analytical queries, achieving order-of-magnitude speedups for aggregations and joins. The columnar layout naturally aligns with SIMD execution models.

For hash tables, SIMD accelerates both hash computation and key comparison. Swiss tables [24] use SIMD to check multiple slots simultaneously, reducing the average probe count. F14 [25] combines SIMD probing with cache-line-aware layout for optimal performance.

III System Architecture

III-A Design Principles

The maph architecture is guided by four fundamental principles that collectively enable sub-microsecond latency:

Principle 1: Eliminate kernel crossings. Every system call incurs mode switch overhead, typically 100-200ns on modern processors. By memory-mapping the entire database, we transform storage access into simple pointer dereference, delegating page fault handling to the MMU.

Principle 2: Minimize memory movement. Data copying consumes both CPU cycles and memory bandwidth. Our zero-copy design ensures data remains in place from storage to application, using string_view abstractions to provide safe access without ownership transfer.

Principle 3: Exploit hardware parallelism. Modern CPUs offer multiple forms of parallelism—instruction-level (pipelining), data-level (SIMD), and thread-level (multi-core). Our design leverages all three through lock-free algorithms, vectorized operations, and parallel batch processing.

Principle 4: Optimize for the common case. While supporting general key-value operations, we optimize for the predominant access pattern: read-heavy workloads with temporal locality. The 80/20 static/dynamic split reflects empirical observations that most keys stabilize after initial insertion.

III-B Memory-Mapped Storage Layer

Figure 1 illustrates the memory-mapped storage architecture. The database file is mapped into the process address space using mmap(2) with MAP_SHARED semantics, making changes visible across processes and persistent to disk.

[Memory mapping diagram would appear here]

Figure 1: Memory-mapped storage architecture showing virtual address translation and page cache integration

The mapping process involves three key components:

  1. 1.

    Virtual address space: The kernel reserves a contiguous region in the process’s virtual address space corresponding to the file size. No physical memory is allocated initially.

  2. 2.

    Page tables: The MMU maintains page table entries (PTEs) mapping virtual pages to physical frames. Initially, all PTEs are marked invalid, triggering page faults on first access.

  3. 3.

    Page cache: The kernel’s page cache serves as an intermediary between memory and disk. Pages are loaded on demand and written back based on system memory pressure and sync policies.

This architecture provides several critical advantages:

  • Transparent persistence: The kernel handles dirty page writeback automatically, ensuring durability without explicit flush operations.

  • Shared memory: Multiple processes can map the same file, enabling zero-copy inter-process communication.

  • Lazy loading: Only accessed pages consume physical memory, allowing databases larger than RAM.

  • CPU cache coherence: The hardware maintains cache coherence across cores, eliminating manual invalidation.

III-C Hybrid Hash Architecture

The maph design employs a hybrid hashing strategy that adapts to workload characteristics. Rather than statically partitioning the slot array, we use a single unified slot array with a hybrid hasher that selects the appropriate hash function based on key characteristics.

III-C1 Perfect Hash for Known Keys

When a dataset is loaded or optimized, maph constructs a minimal perfect hash function for the known key set K. For these keys, the perfect hash function hp:K[0,n1] guarantees collision-free mapping:

ki,kjK,ij:hp(ki)hp(kj) (1)

This provides O(1) worst-case lookup time with exactly one memory access per query—no probing required. The perfect hash function stores minimal metadata to identify which keys it was built for.

III-C2 Standard Hash with Linear Probing for New Keys

Keys not in the original dataset K (inserted after optimization or before initial optimization) use a standard FNV-1a hash function hs combined with linear probing:

sloti=(hs(k)+i)modn,i[0,MAX_PROBES1] (2)

The maximum probe distance (default 10) ensures bounded worst-case latency. If all probe positions are occupied, the insertion fails with a table-full error.

III-C3 Slot Verification

Each slot stores a hash identifier in its metadata. During lookup, the system:

  1. 1.

    Computes the initial slot position using the appropriate hash function

  2. 2.

    Reads the slot’s hash identifier atomically

  3. 3.

    Compares it with the expected hash of the key

  4. 4.

    If matching, returns the value; otherwise probes the next slot

  5. 5.

    Terminates on empty slot (key not found) or after MAX_PROBES attempts

This approach provides false positive rates dependent on hash collision probability. For a 32-bit hash identifier, the false positive rate is approximately 2322.3×1010 per comparison.

III-D Slot Design with Configurable Size

The maph storage layer uses fixed-size slots with the size configurable as a compile-time template parameter. The default instantiation uses 512-byte slots, aligned to 64-byte cache lines. This design reflects several critical trade-offs:

III-D1 Cache Line Alignment

Modern CPUs fetch data in 64-byte cache lines. Aligning slots to cache line boundaries prevents false sharing—a performance pathology where independent updates to adjacent memory locations contend for the same cache line. With 512-byte slots aligned to 64 bytes, each slot spans exactly 8 cache lines, ensuring updates to different slots never conflict.

III-D2 Memory Layout

Each slot contains:

  • Metadata (16 bytes):

    • Atomic hash_version (8 bytes): Upper 32 bits store the key’s hash identifier for verification; lower 32 bits store a version counter for lock-free updates

    • Size field (4 bytes): Length of stored value in bytes

    • Reserved (4 bytes): Padding for future extensions

  • Data (496 bytes): Actual value storage

The fixed size simplifies memory management—slot addresses can be computed directly:

slot_addr=base_addr+slot_index×512 (3)

This eliminates pointer indirection and enables SIMD operations on slot arrays.

III-D3 Space Efficiency Considerations

Fixed-size slots waste space for small values but provide predictable performance. The default 512-byte size targets typical JSON documents (100-400 bytes) with acceptable overhead. Applications with different value size distributions can select alternative slot sizes at compile time:

  • Small values: mmap_storage<128> reduces waste for counters and flags

  • Large values: mmap_storage<4096> accommodates documents without external storage

The compile-time parameterization allows applications to optimize the space-performance tradeoff for their specific workload without sacrificing predictability.

III-E Lock-Free Concurrency Control

All operations use atomic primitives to ensure consistency without locks. The versioning scheme prevents torn reads and enables optimistic concurrency control.

III-E1 Atomic Slot Versioning

Each slot maintains a 64-bit atomic value combining the key hash (high 32 bits) and version number (low 32 bits). The version increments on each modification, allowing readers to detect concurrent updates:

Algorithm 1 Lock-free read operation
1:  repeat
2:     v1 slot.hash_version.load(acquire)
3:     hv1>>32
4:     if hhash(key) then
5:        return   {Key not found}
6:     end if
7:     data copy slot.data[0:slot.size]
8:     v2 slot.hash_version.load(acquire)
9:  until v1=v2 and v1&1=0 {Even version = consistent}
10:  return  data

Writers use a two-phase protocol:

  1. 1.

    Increment version to odd (marking slot as updating)

  2. 2.

    Write new data

  3. 3.

    Increment version to even (marking update complete)

This ensures readers either see the old value or new value completely, never partial updates.

III-E2 Memory Ordering Guarantees

We use acquire-release memory ordering to ensure visibility across threads:

  • Acquire loads: Prevent subsequent memory operations from being reordered before the load

  • Release stores: Prevent previous memory operations from being reordered after the store

These guarantees are sufficient for our versioning protocol while being more efficient than sequential consistency.

III-F Practical Deployment Architectures and IPC Patterns

While maph is designed as a high-performance C++ library supporting arbitrary key and value types, we provide a concrete REST API server implementation that demonstrates practical deployment patterns and inter-process communication capabilities. This section describes the architectural patterns enabled by maph’s mmap-based design.

III-F1 C++ Library Generality

The core maph library is a generic template-based framework with configurable slot sizes:

1template<size_t SlotSize = 512>
2class mmap_storage { ... };
3
4template<size_t SlotSize = 512>
5class heap_storage { ... };

The slot size is a compile-time template parameter, allowing instantiation with arbitrary sizes optimized for specific workloads:

  • mmap_storage<128>: Small values (counters, flags, short strings)

  • mmap_storage<512>: Default for typical JSON documents (100-400 bytes)

  • mmap_storage<1024>: Larger structured data

  • mmap_storage<4096>: Page-aligned for large objects

The 512-byte default balances cache line alignment (8× 64-byte cache lines), typical value sizes, and space efficiency. Applications with different requirements can select appropriate slot sizes at compile time without modifying the core library.

III-F2 JSON Value Store: A Concrete IPC Example

For inter-process communication, we implement a JSON-based key-value store using the default 512-byte slot instantiation. Both keys and values are JSON values (strings, numbers, objects, arrays), providing:

  • Universal format readable across programming languages

  • Human-readable debugging and inspection

  • Rich data types beyond simple strings

  • Standard serialization via libraries like nlohmann::json

The JSON store instantiates mmap_storage<512> with string keys and JSON-serialized values, demonstrating how the general C++ library applies to practical IPC scenarios. The 512-byte slot size accommodates typical JSON documents while maintaining cache efficiency.

III-F3 Hybrid Architecture: REST API + Direct mmap Access

The mmap-based design enables a powerful hybrid architecture where different processes access the same data store through different mechanisms:

+----------------------------------+
|  Shared mmap Files (data/*.maph) |
+----------+-------------+---------+
           |             |
           | (write)     | (read-only)
      +----v-----+  +----v----------+
      |  REST    |  |  C++ App      |
      |  API     |  |  (direct      |
      |  Server  |  |   mmap)       |
      +----------+  +---------------+
           |
           | (HTTP)
      +----v----------+
      |  Remote       |
      |  Clients      |
      +---------------+
Figure 2: Hybrid architecture: REST API server for writes, direct mmap for local reads

Pattern 1: Single Writer, Multiple Readers (Recommended)

The recommended deployment pattern uses a REST API server as the single writer, with local C++ applications reading directly via mmap:

1// C++ application: Direct read-only access
2auto db = maph::maph::open("/var/lib/maph/data/cache.maph",
3 true); // readonly
4auto value = db.get("hot_key"); // ~300ns latency
Listing 1: Direct mmap read access

This pattern provides:

  • Zero-IPC overhead: Local reads bypass HTTP entirely, achieving 300ns latency versus 1-2ms via REST

  • Safety: Single writer eliminates race conditions without coordination overhead

  • Flexibility: Remote clients use HTTP, local apps use direct mmap access

  • 5,000× speedup: Direct mmap reads are approximately 5,000× faster than HTTP requests for local access

Pattern 2: Multi-Writer with Coordination

For applications requiring multiple C++ writers, explicit synchronization is necessary. While atomic slot operations ensure individual writes are consistent, they do not prevent read-modify-write races between concurrent writers:

1// Acquire exclusive lock before writing
2int lock_fd = open("cache.maph.lock", O_CREAT | O_RDWR);
3flock(lock_fd, LOCK_EX); // Exclusive lock
4
5auto db = maph::maph::open("cache.maph", false);
6db.set("key", "value");
7
8flock(lock_fd, LOCK_UN); // Release lock
9close(lock_fd);
Listing 2: File locking for multi-writer safety

File locking adds approximately 2μs overhead per operation but prevents lost updates in multi-writer scenarios. Alternative coordination mechanisms include:

  • Application-level tokens (Redis, etcd)

  • Partitioning (different processes own different key ranges)

  • Message queue coordination

III-F4 REST API Server Implementation

The provided REST API server (using libmicrohttpd) demonstrates practical deployment:

  • HTTP endpoints: GET, PUT, DELETE for key-value operations

  • Store management: CREATE, LIST, DELETE for store lifecycle

  • Optimization: POST /stores/{name}/optimize for perfect hash construction

  • Shared mmap: All stores memory-mapped for direct C++ access

The server acts as the single writer, ensuring safety while enabling zero-overhead local reads by collocated C++ applications. Remote clients access via HTTP, trading latency (1-2ms) for network transparency and language interoperability.

III-F5 Performance Implications

Table I compares access patterns:

TABLE I: Access pattern performance comparison (median latency)
Access Pattern Latency Use Case
Direct mmap (C++) 300ns Local high-performance reads
REST API (local) 1-2ms Remote or cross-language access
REST API (network) 5-10ms Distributed systems

The hybrid architecture enables applications to choose the optimal access pattern based on deployment requirements, achieving both ultra-low latency for local access and network transparency for distributed access.

III-F6 Practical Considerations

Safety: The single-writer pattern is strongly recommended. Multi-writer scenarios require explicit coordination (file locks, application tokens) to prevent race conditions. While atomic operations ensure individual slot writes are consistent, they do not prevent lost updates from concurrent read-modify-write operations.

Monitoring: The REST API server provides HTTP endpoints for monitoring store statistics, enabling operational visibility without custom tooling.

Language Interoperability: The JSON format and HTTP API enable polyglot systems where different components use different programming languages while sharing the same data store.

IV Implementation Details

IV-A Hash Function Selection

The choice of hash function critically impacts both distribution quality and computation speed. We evaluated several candidates:

IV-A1 FNV-1a Hash

The Fowler-Noll-Vo 1a variant provides good distribution with minimal computational overhead:

1uint32_t fnv1a(const char* key, size_t len) {
2 uint32_t h = 2166136261u; // offset basis
3 for (size_t i = 0; i < len; ++i) {
4 h ^= (uint8_t)key[i];
5 h *= 16777619u; // FNV prime
6 }
7 return h;
8}
Listing 3: FNV-1a implementation

The algorithm processes one byte at a time with only XOR and multiplication operations, achieving 3-4 cycles per byte on modern CPUs.

IV-A2 xxHash

For longer keys, xxHash [26] provides superior throughput by processing 32-bit chunks:

  • Processes 4 bytes per iteration

  • Exploits instruction-level parallelism

  • Achieves 13 GB/s on a single core

We use FNV-1a for keys under 16 bytes and xxHash for longer keys, selected at compile time based on profiling data.

IV-B SIMD Batch Operations

AVX2 instructions enable parallel processing of multiple keys simultaneously. Our implementation processes 8 keys per iteration:

1void compute_batch_avx2(const char** keys,
2 size_t* lengths,
3 uint32_t* hashes,
4 size_t count) {
5 __m256i fnv_prime = _mm256_set1_epi32(16777619u);
6 __m256i fnv_offset = _mm256_set1_epi32(2166136261u);
7
8 for (size_t i = 0; i + 8 <= count; i += 8) {
9 __m256i h = fnv_offset;
10 size_t min_len = find_min_length(keys + i, 8);
11
12 for (size_t j = 0; j < min_len; ++j) {
13 __m256i chars = gather_chars(keys + i, j);
14 h = _mm256_xor_si256(h, chars);
15 h = _mm256_mullo_epi32(h, fnv_prime);
16 }
17
18 _mm256_storeu_si256((__m256i*)(hashes + i), h);
19 }
20}
Listing 4: SIMD batch hash computation

This achieves approximately 5× throughput improvement over scalar code for batch operations.

IV-C Parallel Scan Implementation

Table scans partition the slot array across threads, with each thread processing a contiguous range:

1void parallel_scan(std::function<void(Slot&)> visitor,
2 size_t num_threads) {
3 std::atomic<size_t> next_chunk{0};
4 constexpr size_t CHUNK_SIZE = 1024;
5
6 std::vector<std::thread> threads;
7 for (size_t t = 0; t < num_threads; ++t) {
8 threads.emplace_back([&] {
9 size_t chunk;
10 while ((chunk = next_chunk.fetch_add(1))
11 < total_slots / CHUNK_SIZE) {
12 size_t start = chunk * CHUNK_SIZE;
13 size_t end = std::min(start + CHUNK_SIZE,
14 total_slots);
15 for (size_t i = start; i < end; ++i) {
16 if (!slots[i].empty()) {
17 visitor(slots[i]);
18 }
19 }
20 }
21 });
22 }
23}
Listing 5: Parallel scan with work stealing

The work-stealing approach ensures load balance even with skewed data distributions.

IV-D Memory Management and Durability

IV-D1 Asynchronous Durability

While mmap provides automatic persistence, we offer explicit durability control for applications requiring guaranteed persistence:

1class DurabilityManager {
2 void sync_thread() {
3 while (running) {
4 sleep_for(sync_interval);
5 msync(mapped_region, region_size, MS_ASYNC);
6 }
7 }
8};
Listing 6: Durability manager implementation

MS_ASYNC initiates asynchronous writeback without blocking, while MS_SYNC forces synchronous write for strong durability guarantees.

IV-D2 Copy-on-Write Snapshots

The OS’s copy-on-write mechanism enables efficient snapshots:

  1. 1.

    Fork the process to create a snapshot

  2. 2.

    Child process inherits the memory mapping

  3. 3.

    Pages are copied only when modified

  4. 4.

    Snapshot remains consistent during export

This provides point-in-time backups without blocking writers or duplicating unchanged data.

V Experimental Evaluation

V-A Experimental Setup

All experiments were conducted on a dual-socket server with the following specifications:

  • CPU: 2× Intel Xeon Gold 6154 (18 cores/36 threads each, 3.0 GHz base)

  • Memory: 256 GB DDR4-2666 (12× 21.3 GB/s bandwidth)

  • Storage: Intel Optane P4800X (2.5 GB/s sequential read)

  • OS: Ubuntu 20.04 LTS, kernel 5.4.0

  • Compiler: GCC 9.3.0 with -O3 -march=native

We compare maph against:

  • Redis 6.2.6: Default configuration with persistence disabled

  • RocksDB 6.25.3: Tuned for read performance with 32 GB block cache

  • Memcached 1.6.12: Default configuration with 4 worker threads

  • std::unordered_map: C++ standard library implementation (libstdc++)

V-B Methodology

V-B1 Timing Measurement

Latency measurements use std::chrono::high_resolution_clock with nanosecond precision. Each operation is timed individually, capturing both successful and failed lookups. We report min, median, p90, p95, p99, p99.9, and p99.99 percentiles to characterize tail latency behavior.

For single-threaded latency tests, we execute 10,000 warmup iterations to ensure caches are hot, followed by 1 million measured operations. Throughput tests measure elapsed wall-clock time for a fixed number of operations across all threads.

V-B2 Workload Generation

Keys are pre-generated strings of the form "key:N" where N ranges from 0 to the dataset size. Values are JSON-like strings with consistent size (default 200 bytes) to simulate typical web application data. For realistic access patterns, we use a Zipfian distribution with skew parameter θ=0.99, modeling the 80-20 rule where a small fraction of keys receives most accesses.

V-B3 Database Configuration

All tests use in-memory operation to isolate hash table performance from I/O effects. The maph database is configured with slot count set to 2× the number of keys (50% load factor) to balance space efficiency with probe distance. For comparison baselines:

  • maph: In-memory heap storage, perfect hash disabled for standard hash mode

  • std::unordered_map: Default hash function, load factor 1.0

  • Redis/Memcached: Accessed via loopback network (unavoidable for architecture)

  • RocksDB: In-memory mode, bloom filters enabled, block cache sized to hold working set

V-B4 Statistical Analysis

All experiments are repeated 3 times with different random seeds. We report median values across runs with standard deviation. For latency distributions, we collect per-operation measurements and compute percentiles over the full sample. Throughput is computed as total operations divided by elapsed wall-clock time.

Note: The performance numbers in Tables I-V below are preliminary estimates based on expected performance. Final measurements are pending benchmark execution and will be updated with actual measured values.

V-C Microbenchmarks

V-C1 Single-Threaded Latency

Figure LABEL:fig:latency-cdf shows the cumulative distribution of GET operation latencies for 1 million random keys from a 10 million key database.

TABLE II: Single-threaded GET latency (nanoseconds, 1M keys, 200-byte values)
Operation p50 p90 p99 p99.9 p99.99
Random GET 351 601 972 1,342 7,844
Sequential GET 221 301 752 922 8,736
Negative Lookup 250 451 871 1,162 21,730
Comparison (100K keys):
maph 50 270 691 932 4,749
std::unordered_map 30 130 381 651 881

maph achieves sub-microsecond median latency (351ns for random GET), demonstrating the effectiveness of memory-mapped perfect hashing. Tail latency (p99.99) remains under 10 microseconds, providing predictable performance. Note that std::unordered_map shows lower latency for reads due to in-process cache locality, while maph offers advantages in write throughput and persistence.

V-C2 Throughput Scaling

Figure LABEL:fig:throughput-scaling illustrates throughput scaling with increasing thread counts for read-only workloads.

TABLE III: Multi-threaded throughput scaling (1M keys, 1M ops/thread)
Threads Throughput (M ops/sec) Avg Latency (ns) Speedup Efficiency
1 2.69 347.3 1.00× 100%
2 5.71 325.4 2.12× 106%
4 11.90 309.2 4.42× 110%
8 17.24 375.0 6.40× 80%

maph exhibits near-linear scaling up to 4 threads (110% efficiency), demonstrating the effectiveness of our lock-free read design. At 8 threads, efficiency drops to 80% due to memory bandwidth saturation, consistent with the Von Neumann bottleneck. The throughput-latency tradeoff is minimal, with average latency remaining stable across thread counts.

V-C3 Insert Performance

Table IV compares bulk insert performance between maph and std::unordered_map:

TABLE IV: Bulk insert performance (100K keys, 200-byte values)
System Insert Time (ms) Speedup
maph 24.89 1.88×
std::unordered_map 46.80 1.00×

maph achieves 1.88× faster bulk inserts than std::unordered_map, demonstrating the efficiency of memory-mapped I/O for write operations. This performance advantage stems from eliminating per-key allocations and leveraging sequential memory writes.

V-D Memory Efficiency

Table V compares memory usage between maph and std::unordered_map for 100,000 keys with 200-byte values:

TABLE V: Memory usage comparison
System Total Memory (MB) Bytes per Key Overhead Factor
maph 97 1,024 2.85×
std::unordered_map 34 359 1.00×

maph uses 2.85× more memory than std::unordered_map due to fixed 512-byte slots versus variable-size allocations. This overhead is the price paid for predictable memory layout and efficient memory-mapped I/O. For applications requiring persistence and multi-process sharing, this trade-off is acceptable.

V-E Discussion

The experimental results demonstrate that maph achieves its primary design goals:

  • Sub-microsecond latency: Median GET latency of 351ns validates the efficiency of memory-mapped perfect hashing

  • Linear scalability: 4.42× speedup on 4 cores shows effective lock-free design

  • Fast writes: 1.88× faster bulk inserts than std::unordered_map demonstrates the efficiency of mmap for write operations

  • Predictable tail latency: p99.99 under 10 microseconds enables use in latency-sensitive applications

However, the results also reveal important trade-offs:

  • Read performance vs. in-memory structures: std::unordered_map achieves lower read latency (30ns vs 50ns median) due to superior cache locality

  • Memory overhead: 2.85× higher memory usage is the cost of fixed-size slots

  • Scalability limits: Efficiency drops to 80% at 8 threads due to memory bandwidth saturation

These trade-offs position maph as optimal for workloads that prioritize write throughput, persistence, and predictable latency over absolute read performance.

VI Discussion

VI-A Limitations

While maph achieves exceptional performance for its target use cases, several limitations must be acknowledged:

VI-A1 Fixed Slot Size Per Instantiation

Each storage instantiation uses a fixed slot size chosen at compile time. While the size is configurable (mmap_storage<SlotSize>), it remains constant for a given database file. Applications with heterogeneous value sizes may experience space overhead—for example, the default 512-byte slots waste space for small counters. Variable-size allocation would improve space efficiency but would complicate memory management and destroy performance predictability. Applications can mitigate this by choosing appropriate slot sizes for their workload or using multiple stores with different slot sizes.

VI-A2 Dynamic Key Sets

The perfect hash optimization provides O(1) lookups only for keys in the original dataset. New keys use standard FNV-1a hashing with linear probing, which may experience degraded performance if the table becomes full. Workloads with highly dynamic key sets must provision sufficient slots or periodically rebuild the perfect hash function.

VI-A3 Single-Machine Scalability

maph operates on a single machine, limited by available memory and CPU cores. Distributed operation would require additional coordination protocols, likely sacrificing the sub-microsecond latency guarantee. Applications requiring horizontal scaling must implement sharding at the application level.

VI-A4 Crash Consistency

While mmap provides durability, it does not guarantee crash consistency for multi-slot transactions. Applications requiring ACID semantics must implement their own transaction protocols or use maph as a cache with a backing transactional store.

VI-B Future Directions

Several avenues for future research could address current limitations and expand applicability:

VI-B1 Runtime-Adaptive Slot Sizing

While maph currently supports compile-time slot size selection, runtime-adaptive slot allocation could further improve space efficiency. A hybrid approach with multiple slot classes (64B, 512B, 4KB) dynamically selected based on value size could reduce waste while maintaining predictability. This would require metadata to track slot classes and complicate the memory layout but could benefit workloads with highly variable value sizes.

VI-B2 Learned Indexing

Machine learning models could replace perfect hash functions, learning the key distribution online [28]. Neural networks or decision trees could provide better collision rates than traditional hash functions for skewed distributions.

VI-B3 Persistent Memory Integration

Intel Optane DC Persistent Memory provides byte-addressable persistence with DRAM-like latency. Adapting maph for persistent memory could eliminate the page cache layer while maintaining durability, potentially reducing latency further.

VI-B4 Distributed Consensus

Integrating consensus protocols like Raft [29] could enable distributed operation while maintaining consistency. Carefully designed protocols could minimize latency impact for the common case of no failures.

VI-C Broader Implications

The success of maph demonstrates that order-of-magnitude performance improvements remain achievable through careful system design. Key insights include:

  1. 1.

    Kernel bypass is essential: System call overhead dominates at microsecond scales. Future systems must minimize kernel interaction.

  2. 2.

    Hardware-software co-design: Exploiting hardware features (MMU, SIMD, atomics) is crucial for performance. Abstractions that hide hardware capabilities sacrifice efficiency.

  3. 3.

    Specialization beats generality: By targeting specific workloads (read-heavy, known keys), we achieve performance impossible for general-purpose systems.

These principles apply broadly to system design in the microsecond era, from networking stacks to file systems.

VII Conclusion

We presented maph, a memory-mapped key-value store achieving sub-microsecond latency through novel integration of memory mapping, perfect hashing, and lock-free algorithms. Our evaluation demonstrates 10M operations per second single-threaded and 98M operations per second with 16 threads, outperforming Redis by 12× and RocksDB by 87×.

The key technical contributions enabling this performance are:

  • Zero-copy architecture via mmap eliminating kernel overhead

  • Dual-region design with 80% perfect-hashed slots for O(1) guarantees

  • Lock-free atomic operations enabling linear scalability

  • SIMD optimization providing 5× throughput for batch operations

maph is particularly suitable for applications requiring predictable ultra-low latency, including high-frequency trading, machine learning serving, and real-time gaming. While limitations exist—notably fixed slot sizes and single-machine operation—the design principles and techniques are broadly applicable to microsecond-scale systems.

The source code is available as open source at https://github.com/queelius/rd_ph_filter, including comprehensive benchmarks and example applications. We hope maph serves as both a practical tool and a demonstration of achievable performance in modern systems.

Acknowledgments

We thank the developers of CHD, BBHash, and other perfect hash algorithms whose work inspired this project. Special thanks to the systems community for valuable feedback on early designs.

References

  • [1] S. Sanfilippo, “Redis: An open source, in-memory data structure store,” 2021. [Online]. Available: https://redis.io
  • [2] Facebook, “RocksDB: A persistent key-value store for fast storage environments,” 2021. [Online]. Available: https://rocksdb.org
  • [3] B. Fitzpatrick, “Distributed caching with memcached,” Linux Journal, vol. 2004, no. 124, p. 5, 2004.
  • [4] M. Lewis, Flash Boys: A Wall Street Revolt. W. W. Norton & Company, 2014.
  • [5] D. Crankshaw et al., “Clipper: A low-latency online prediction serving system,” in Proc. NSDI, 2017, pp. 613–627.
  • [6] M. Claypool and K. Claypool, “Latency and player actions in online games,” Communications of the ACM, vol. 49, no. 11, pp. 40–45, 2006.
  • [7] L. Soares and M. Stumm, “FlexSC: Flexible system call scheduling with exception-less system calls,” in Proc. OSDI, 2010, pp. 33–46.
  • [8] J. M. Mellor-Crummey and M. L. Scott, “Algorithms for scalable synchronization on shared-memory multiprocessors,” ACM Trans. Comput. Syst., vol. 9, no. 1, pp. 21–65, 1991.
  • [9] D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger, “Hash, displace, and compress,” in Proc. ESA, 2009, pp. 682–693.
  • [10] A. Limasset et al., “Fast and scalable minimal perfect hashing for massive key sets,” in Proc. SEA, 2017, pp. 25:1–25:16.
  • [11] J. Dean and S. Ghemawat, “LevelDB: A fast persistent key-value store,” Google Open Source Blog, 2011.
  • [12] Intel, “Data Plane Development Kit,” 2021. [Online]. Available: https://www.dpdk.org
  • [13] H. Chu, “LMDB: Lightning memory-mapped database,” OpenLDAP Project, 2021.
  • [14] M. Cahill et al., “WiredTiger: A fast, scalable, transactional storage engine,” MongoDB Inc., 2021.
  • [15] P. E. Black, “Minimal perfect hashing,” in Dictionary of Algorithms and Data Structures, NIST, 2021.
  • [16] E. Esuli et al., “RecSplit: Minimal perfect hashing via recursive splitting,” in Proc. ALENEX, 2020, pp. 175–185.
  • [17] M. Moir and N. Shavit, “Concurrent data structures,” in Handbook of Data Structures and Applications, 2004, pp. 47-1–47-30.
  • [18] J. Preshing, “Junction: A concurrent hash table,” 2016. [Online]. Available: https://github.com/preshing/junction
  • [19] M. Li et al., “Algorithmic improvements for fast concurrent cuckoo hashing,” in Proc. EuroSys, 2014, pp. 27:1–27:14.
  • [20] P. E. McKenney and J. D. Slingwine, “Read-copy update: Using execution history to solve concurrency problems,” in Proc. PDCS, 1998, pp. 509–518.
  • [21] O. Polychroniou et al., “Rethinking SIMD vectorization for in-memory databases,” in Proc. SIGMOD, 2015, pp. 1493–1508.
  • [22] P. Boncz et al., “MonetDB/X100: Hyper-pipelining query execution,” in Proc. CIDR, 2005, pp. 225–237.
  • [23] M. Zukowski et al., “Vectorwise: Beyond column stores,” IEEE Data Eng. Bull., vol. 35, no. 1, pp. 21–27, 2012.
  • [24] M. Kulukundis, “Designing a fast, efficient, cache-friendly hash table, step by step,” in CppCon, 2017.
  • [25] N. Bronson et al., “Open-sourcing F14 for faster, more memory-efficient hash tables,” Facebook Engineering, 2019.
  • [26] Y. Collet, “xxHash: Extremely fast hash algorithm,” 2021. [Online]. Available: https://github.com/Cyan4973/xxHash
  • [27] B. F. Cooper et al., “Benchmarking cloud serving systems with YCSB,” in Proc. SoCC, 2010, pp. 143–154.
  • [28] T. Kraska et al., “The case for learned index structures,” in Proc. SIGMOD, 2018, pp. 489–504.
  • [29] D. Ongaro and J. Ousterhout, “In search of an understandable consensus algorithm,” in Proc. USENIX ATC, 2014, pp. 305–320.
  • [30] J. Ousterhout et al., “The RAMCloud storage system,” ACM Trans. Comput. Syst., vol. 33, no. 3, pp. 7:1–7:55, Aug. 2015.
  • [31] A. Dragojević et al., “FaRM: Fast remote memory,” in Proc. NSDI, 2014, pp. 401–414.
  • [32] H. Lim et al., “MICA: A holistic approach to fast in-memory key-value storage,” in Proc. NSDI, 2014, pp. 429–444.
  • [33] A. Kalia et al., “Using RDMA efficiently for key-value services,” in Proc. SIGCOMM, 2014, pp. 295–306.