shannon
MCP-LLM Backend
An elegant, Anthropic-compatible LLM backend with native MCP (Model Context Protocol) support. Run local LLMs with tool-calling capabilities while seamlessly integrating with MCP servers.
Features
- Anthropic API Compatibility: Drop-in replacement for Claude API
- Native MCP Support: Automatically discovers and exposes MCP tools
- Fast Local Inference: Uses llama.cpp for efficient GGUF model serving
- Plugin Architecture: Extensible design for custom functionality
- Streaming Support: Full SSE streaming compatible with Anthropic clients
- Tool Calling: Native support for models with tool-calling capabilities
Architecture
User Application → Anthropic API → Orchestrator → LLM Engine (llama.cpp)
↓
MCP Manager → MCP Servers
The system acts as an MCP Host, managing connections to multiple MCP servers while using a local LLM to intelligently orchestrate tool calls.
Installation
# Install dependencies
pip install -e .
# Download a compatible GGUF model (e.g., Llama 3.2 Instruct)
mkdir models
wget -O models/llama-3.2-3b-instruct.gguf [model-url]
Configuration
Edit config.yaml:
models:
local:
- name: "local-llama-3.2"
path: "models/llama-3.2-3b-instruct.gguf"
context_size: 8192
gpu_layers: 35
mcp:
servers:
- name: "filesystem"
transport: "stdio"
command: ["npx", "-y", "@modelcontextprotocol/server-filesystem", "/tmp"]
Usage
Start the Server
python main.py
The server runs on http://localhost:8000 by default.
Using with Claude Desktop
Configure Claude Desktop to use the local backend:
{
"apiProvider": "anthropic",
"apiUrl": "http://localhost:8000/v1",
"apiKey": "local"
}
Using with Python
from anthropic import Anthropic
client = Anthropic(
base_url="http://localhost:8000/v1",
api_key="local"
)
response = client.messages.create(
model="local-llama-3.2",
messages=[
{"role": "user", "content": "List files in /tmp"}
],
tools="auto", # Automatically use MCP tools
max_tokens=1024
)
print(response.content)
Using with curl
curl -X POST http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "local-llama-3.2",
"messages": [
{"role": "user", "content": "What files are in /tmp?"}
],
"tools": "auto",
"max_tokens": 1024
}'
MCP Integration
The backend automatically:
- Connects to configured MCP servers on startup
- Discovers available tools from each server
- Exposes them as Anthropic-format tools to the LLM
- Routes tool calls to the appropriate MCP server
- Returns results in the response
Supported MCP Servers
@modelcontextprotocol/server-filesystem- File system access@modelcontextprotocol/server-memory- Memory/knowledge base@modelcontextprotocol/server-postgres- Database access- Any MCP-compliant server
API Endpoints
POST /v1/messages- Anthropic Messages APIGET /v1/models- List available modelsGET /health- Health check and MCP server status
Development
Project Structure
src/
├── api/ # Anthropic-compatible API
├── core/ # Orchestration and types
├── llm/ # LLM engine abstraction
├── mcp/ # MCP client and manager
├── adapters/ # Format converters
└── plugins/ # Extension system
Adding a New LLM Engine
Implement the LLMEngine interface:
from src.llm.base import LLMEngine
class MyEngine(LLMEngine):
async def generate(self, messages, tools, **kwargs):
# Your implementation
pass
Creating Plugins
from src.plugins.base import Plugin
class MyPlugin(Plugin):
async def on_request(self, context):
# Pre-process request
pass
async def on_response(self, context):
# Post-process response
pass
Requirements
- Python 3.10+
- CUDA-capable GPU (optional, for acceleration)
- Compatible GGUF model with tool-calling support
License
MIT
Contributing
Contributions welcome! The design prioritizes:
- Clean abstractions
- Minimal dependencies
- Anthropic API compatibility
- Efficient MCP integration
Links
Repository Stats
- Stars: 0
- Forks: 0
- Primary Language: Python