active library

alga

Started 2021 C++

Resources & Distribution

Source Code

Package Registries

Alga

A mathematically elegant C++ library for algebraic text processing and compositional parsing. Built on rigorous algebraic foundations with monoids, functors, and extended operators, Alga transforms text manipulation from imperative string processing into declarative mathematical expressions.

Key Features

  • Mathematical Foundation: Built on rigorous algebraic structures (monoids, functors) with verified laws and properties
  • High Performance: Header-only template library with zero-cost abstractions and C++17/20 support
  • Type Safety: Compile-time guarantees through concepts, strong typing, and mathematical invariants
  • Algebraic Operators: Extended operator set (*, |, ^, %, >>) with C++20 concepts for mathematical composition
  • Compositional Design: Simple parsers combine naturally into sophisticated text analysis pipelines
  • Comprehensive Testing: 90%+ coverage with property-based testing of algebraic laws
  • Production Ready: Includes industry-standard Porter2 stemmer and real-world examples

Quick Start

Prerequisites

  • C++17 compatible compiler (GCC 7+, Clang 5+, MSVC 2017+)
  • CMake 3.14+
  • Optional: Google Test for running tests

Installation

git clone https://github.com/yourusername/alga.git
cd alga
mkdir build && cd build
cmake ..
make -j$(nproc)

# Run tests to verify installation
make test

Basic Example

#include "parsers/lc_alpha.hpp"
#include "parsers/porter2stemmer.hpp"
#include "parsers/algebraic_operators.hpp"

using namespace alga;

int main() {
    // Algebraic text processing with mathematical elegance
    auto word1 = make_lc_alpha("hello");
    auto word2 = make_lc_alpha("world");
    
    if (word1 && word2) {
        // Mathematical composition operators
        auto combined = *word1 * *word2;          // Monoid concatenation: "helloworld"
        auto choice = word1 | word2;              // Choice operator: first valid
        auto emphasis = *word1 ^ 3;               // Repetition: "hellohellohello"
        auto sequence = *word1 >> *word2;         // Sequential: vector["hello", "world"]
        
        // Function application with error handling
        auto to_upper = [](auto w) { 
            std::string s = w.str();
            std::transform(s.begin(), s.end(), s.begin(), ::toupper);
            return s;
        };
        auto uppercase = word1 % to_upper;        // Functional style: "HELLO"
        
        // Porter2 stemming with algebraic composition
        auto stem = make_porter2_stem("running");
        if (stem) {
            auto repeated = *stem ^ 2;             // "runrun"
            std::cout << "Stem repetition: " << std::string{repeated} << std::endl;
        }
        
        // Complex algebraic expressions
        auto complex = ((word1 ^ 2) | (word2 ^ 3)) % [](auto w) { return w.size(); };
        std::cout << "Complex expression result: " << *complex << std::endl;
    }
    
    return 0;
}

Mathematical Foundations

Algebraic Structures

The library implements several key mathematical concepts:

  1. Free Monoids: The set of lowercase alphabetic strings forms a free monoid under concatenation

    • Identity: Empty string
    • Associativity: (a · b) · c = a · (b · c)
    • Closure: Concatenation always produces valid strings
  2. Functors: Structure-preserving transformations between categories

    // fmap preserves composition: fmap(g ∘ f) = fmap(g) ∘ fmap(f)
    auto parser = make_word_parser();
    auto length_parser = fmap(parser, [](const std::string& s) { return s.length(); });
    
  3. Monadic Composition: Error-handling composition with automatic failure propagation

    auto result = make_lc_alpha("running")
        >>= [](const lc_alpha& word) { return stemmer(word); }
        >>= [](const porter2_stem& stem) { return make_unigram(stem); };
    

Verified Properties

All algebraic structures satisfy their mathematical laws:

// Monoid laws (automatically tested)
ASSERT_EQ((a * b) * c, a * (b * c));           // Associativity
ASSERT_EQ(empty * a, a);                        // Left identity  
ASSERT_EQ(a * empty, a);                        // Right identity

// Functor laws (automatically tested)
ASSERT_EQ(fmap(id, parser), parser);            // Identity preservation
ASSERT_EQ(fmap(compose(g,f), parser), fmap(g, fmap(f, parser))); // Composition

Core Components

Parser Types

  • lc_alpha: Type-safe lowercase alphabetic strings with monoid structure
  • porter2_stemmer: Production-ready Porter2 stemming algorithm with algebraic interface
  • combinatorial_parser: Composable parser combinators for building complex grammars
  • fsm_string_rewriter: Finite state machine-based string transformations
  • ngram_stemmer: N-gram generation and composition with algebraic operations

Composition Examples

#include "examples/composition_examples.hpp"

// Multi-strategy processing with fallbacks
MultiStrategyStemmer stemmer;
auto result = stemmer(word);  // Tries Porter2, falls back to alternatives

// Document processing pipeline
DocumentProcessor processor;
auto analysis = processor.process_document(text);
// Returns: word count, unique stems, n-gram statistics

// Parallel processing with value semantics
std::vector<std::future<std::optional<porter2_stem>>> futures;
for (const auto& word : words) {
    futures.emplace_back(std::async(std::launch::async, [=]() {
        return make_porter2_stem(word);  // Safe to copy - no shared state
    }));
}

Advanced Usage

Custom Parser Development

// Define algebraic parser with proper interface
template<typename Predicate, typename Transform>
class AlgebraicParser {
    Predicate predicate;
    Transform transformer;

public:
    using input_type = std::string;
    using output_type = std::invoke_result_t<Transform, std::string>;
    
    auto operator()(const input_type& input) const -> std::optional<output_type> {
        if (predicate(input)) {
            return transformer(input);
        }
        return std::nullopt;
    }
    
    // Monadic composition
    template<typename NextParser>
    auto then(NextParser&& next) const {
        return [*this, next = std::forward<NextParser>(next)](const input_type& input) {
            auto result = operator()(input);
            if (!result) return std::nullopt;
            return next(*result);
        };
    }
};

// Usage
auto vowel_extractor = AlgebraicParser{
    [](const std::string& s) { return !s.empty() && is_vowel(s[0]); },
    [](const std::string& s) { return std::toupper(s[0]); }
};

Parser Combinators

#include "parsers/combinatorial_parser_fixed.hpp"

// Sequential composition
auto word_number = sequence(word_parser(), digit_parser());

// Choice composition (sum types)
auto word_or_number = alternative(word_parser(), digit_parser());  

// Repetition (Kleene star)
auto many_words = many(word_parser());

// Transform results while preserving structure
auto word_lengths = transform(word_parser(), [](const std::string& s) { 
    return s.length(); 
});

Testing and Quality Assurance

Comprehensive Test Suite

The library includes extensive testing at multiple levels:

# Run all tests
make test

# Test coverage analysis
make coverage  # Requires CMAKE_BUILD_TYPE=Coverage

# Performance benchmarks
make benchmark

Property-Based Testing

Mathematical properties are automatically verified:

// Test algebraic laws for all monoid types
TEST_F(AlgebraicProperties, MonoidLaws) {
    // Tests run on randomly generated data
    for (int i = 0; i < 1000; ++i) {
        auto a = generate_random_lc_alpha();
        auto b = generate_random_lc_alpha();  
        auto c = generate_random_lc_alpha();
        
        ASSERT_EQ((a * b) * c, a * (b * c));  // Associativity
    }
}

Real-World Integration Tests

TEST(Integration, ComplexDocumentProcessing) {
    // Process multi-paragraph document
    std::string document = load_test_document("lorem_ipsum.txt");
    DocumentProcessor processor;
    
    auto result = processor.process_document(document);
    
    ASSERT_GT(result.word_count, 100);
    ASSERT_GT(result.unique_stems.size(), 50);
    ASSERT_FALSE(result.bigrams.empty());
}

Performance

  • Zero-Cost Abstractions: Template metaprogramming eliminates runtime overhead
  • Memory Efficient: RAII and move semantics, no unnecessary copying
  • Parallelizable: Value semantics enable safe concurrent processing
  • Optimized Algorithms: Porter2 implementation tuned for performance

Benchmarks show comparable performance to hand-optimized C code while providing mathematical guarantees and composability.

Architecture

include/parsers/           # Core algebraic parser types
├── lc_alpha.hpp          # Free monoid of lowercase strings  
├── porter2stemmer.hpp    # Porter2 algorithm with algebraic interface
├── combinatorial_parser_fixed.hpp  # Parser combinators
├── ngram_stemmer.hpp     # N-gram composition
├── fsm_string_rewriter.hpp # Finite state transformations
└── word_parser.hpp       # Word tokenization

examples/                  # Real-world composition patterns
└── composition_examples.hpp  # Document processing pipelines

test/                     # Comprehensive test suite
├── composition_test.cpp  # Compositional property tests
├── comprehensive_test_suite.cpp # Integration tests
└── performance_benchmark.cpp    # Performance validation

Contributing

We welcome contributions that maintain the mathematical rigor and performance characteristics:

  1. Mathematical Correctness: All algebraic structures must satisfy their laws
  2. Type Safety: Use concepts and static assertions to prevent invalid usage
  3. Performance: Maintain zero-cost abstraction principles
  4. Testing: Include property-based tests for algebraic laws
  5. Documentation: Provide clear mathematical explanations

See CONTRIBUTING.md for detailed guidelines.

License

This project is licensed under the GNU General Public License v2.0 - see the LICENSE file for details.

Acknowledgments

  • Inspired by Haskell’s parser combinator libraries and category theory
  • Built using modern C++ best practices and mathematical foundations
  • Implements classical algorithms (Porter2) within a principled algebraic framework

Mathematical parsing meets practical performance. This library demonstrates that rigorous mathematical foundations enhance rather than compromise real-world usability and performance.