Architecture¶

This document describes the design principles, patterns, and architectural decisions that guide Echomine development.

Design Principles¶

Echomine follows 8 core Constitution Principles that guide all architectural and implementation decisions:

I. Library-First Architecture¶

Core functionality is built as an importable library, with CLI as a thin wrapper on top.

Rationale: Primary use case is cognivault integration. CLI is a convenience layer, not the core product.

Implementation:

# ✅ CORRECT: CLI wraps library
from echomine import OpenAIAdapter, SearchQuery

def search_command(file: Path, keywords: list[str]):
    adapter = OpenAIAdapter()  # Library component
    query = SearchQuery(keywords=keywords)
    for result in adapter.search(file, query):
        print_result(result)  # CLI formatting

# ❌ WRONG: Library calls CLI
def search(file: Path, query: SearchQuery):
    subprocess.run(["echomine", "search", ...])  # NO!

II. CLI Interface Contract¶

Results go to stdout, progress and errors go to stderr, with standard exit codes.

Contract:

stdout: Results only (JSON or human-readable)
stderr: Progress, warnings, errors
Exit codes: 0 (success), 1 (operational error), 2 (usage error)

Benefits:

Pipeline-friendly (compose with jq, grep, xargs)
Separates data from metadata
Standard UNIX conventions

III. Test-Driven Development (TDD)¶

All features follow the RED-GREEN-REFACTOR cycle with no exceptions.

Workflow:

RED: Write failing test first (verify it fails)
GREEN: Write minimal code to pass test
REFACTOR: Improve code while keeping tests green

Enforcement: Pre-commit hooks reject commits without test coverage.

IV. Observability & Debuggability¶

JSON structured logs via structlog with contextual fields.

Logging Pattern:

from echomine.utils.logging import get_logger

logger = get_logger(__name__)

logger.info(
    "Processing conversation",
    operation="stream_conversations",
    file_name=str(file_path),
    conversation_id=conversation.id,
    count=count,
)

Graceful Degradation:

Malformed entries logged and skipped (WARNING level)
Processing continues for valid entries
Summary reports include skip counts

V. Simplicity & YAGNI¶

Implement ONLY what the spec requires. No speculative features.

Examples:

No database layer (just file parsing)
No caching (streaming is sufficient)
No async (sync generators are simpler and adequate)

When Complexity is Justified:

ijson: Required for O(1) memory usage on 1GB+ files
BM25 ranking: Spec requirement for search quality

VI. Strict Typing Mandatory¶

mypy --strict with ZERO TOLERANCE for errors.

Requirements:

Type hints on ALL functions, methods, variables
No Any types in public API
Use Protocol for abstractions
Pydantic models for all structured data

Example:

from typing import Iterator, Optional
from pathlib import Path
from echomine.models import Conversation, SearchQuery, SearchResult

def search(
    file_path: Path,
    query: SearchQuery,
    *,
    progress_callback: Optional[Callable[[int], None]] = None,
) -> Iterator[SearchResult[Conversation]]:
    """Full type safety."""
    pass

VII. Multi-Provider Adapter Pattern¶

Stateless adapters implement ConversationProvider protocol.

Design:

OpenAIAdapter for ChatGPT (v1.0)
Future: ClaudeAdapter, GeminiAdapter (v2.0+)
Shared models (Message, Conversation) across providers
Provider-specific data in metadata dict

Stateless Pattern:

# ✅ CORRECT: Stateless adapter
class OpenAIAdapter:
    def stream_conversations(self, file_path: Path) -> Iterator[Conversation]:
        # file_path passed as argument
        pass

# ❌ WRONG: Stateful adapter
class OpenAIAdapter:
    def __init__(self, file_path: Path):  # NO!
        self.file_path = file_path

VIII. Memory Efficiency & Streaming¶

O(1) memory usage regardless of file size via ijson streaming.

Pattern:

# ✅ CORRECT: Streaming with ijson
def stream_conversations(file_path: Path) -> Iterator[Conversation]:
    with open(file_path, "rb") as f:
        parser = ijson.items(f, "item")
        for item in parser:
            yield Conversation.model_validate(item)

# ❌ WRONG: Load entire file
def stream_conversations(file_path: Path) -> Iterator[Conversation]:
    with open(file_path) as f:
        data = json.load(f)  # Loads entire file into memory!
        for item in data:
            yield Conversation.model_validate(item)

Performance Contracts:

1.6GB file search in <30 seconds
10K conversations + 50K messages on 8GB RAM
O(1) memory (constant, not proportional to file size)

Architectural Patterns¶

Adapter Pattern¶

All provider-specific logic is encapsulated in adapters that implement the ConversationProvider protocol:

from typing import Protocol, Iterator, TypeVar
from echomine.models import Conversation, SearchQuery, SearchResult

ConversationT = TypeVar("ConversationT", bound="Conversation")

class ConversationProvider(Protocol[ConversationT]):
    """Protocol for conversation export adapters."""

    def stream_conversations(
        self,
        file_path: Path,
    ) -> Iterator[ConversationT]:
        """Stream conversations from export file."""
        ...

    def search(
        self,
        file_path: Path,
        query: SearchQuery,
    ) -> Iterator[SearchResult[ConversationT]]:
        """Search conversations with BM25 ranking."""
        ...

    def get_conversation_by_id(
        self,
        file_path: Path,
        conversation_id: str,
    ) -> Optional[ConversationT]:
        """Retrieve specific conversation by ID."""
        ...

Immutable Data Models¶

All data models use Pydantic with strict validation and immutability:

from pydantic import BaseModel, ConfigDict, Field

class Message(BaseModel):
    model_config = ConfigDict(
        frozen=True,           # Immutability
        strict=True,           # No type coercion
        extra="forbid",        # Reject unknown fields
        validate_assignment=True,
    )

    id: str = Field(..., min_length=1)
    content: str
    role: Literal["user", "assistant", "system"]
    timestamp: datetime  # UTC, timezone-aware
    parent_id: Optional[str] = None

Streaming Pattern¶

All operations use generators for memory efficiency:

# Returns Iterator, not List
def stream_conversations(file_path: Path) -> Iterator[Conversation]:
    with open(file_path, "rb") as f:
        parser = ijson.items(f, "item")
        for item in parser:
            try:
                yield Conversation.model_validate(item)
            except ValidationError as e:
                logger.warning("Skipped malformed entry", reason=str(e))
                continue

Error Handling Strategy¶

Fail-Fast on Unrecoverable Errors:

FileNotFoundError: File doesn't exist
PermissionError: No read access
SchemaVersionError: Unsupported export version

Graceful Degradation on Data Errors:

ValidationError: Skip malformed conversation, log warning, continue
ParseError: Skip malformed JSON entry, log warning, continue

No Retries: All errors are permanent. Users must fix the issue manually.

Project Structure¶

echomine/
├── src/echomine/           # Library source code
│   ├── models/             # Pydantic data models
│   │   ├── conversation.py # Conversation, Message
│   │   ├── search.py       # SearchQuery, SearchResult
│   │   └── protocols.py    # ConversationProvider protocol
│   ├── adapters/           # Provider adapters
│   │   └── openai/         # OpenAI (ChatGPT) adapter
│   ├── search/             # Search and ranking logic
│   │   └── ranking.py      # BM25 algorithm
│   ├── exporters/          # Export formatters
│   │   └── markdown.py     # Markdown exporter
│   ├── utils/              # Utilities
│   │   └── logging.py      # Structured logging setup
│   └── cli/                # CLI commands (thin wrapper)
│       ├── app.py          # Typer app
│       └── commands/       # Individual commands
├── tests/                  # Test suite
│   ├── unit/               # Unit tests (70%)
│   ├── integration/        # Integration tests (20%)
│   ├── contract/           # Protocol contract tests (5%)
│   └── performance/        # Performance benchmarks (5%)
└── specs/                  # Design documents
    └── 001-ai-chat-parser/ # Feature specification

Data Flow¶

Streaming Operation Flow¶

Export File (JSON)
    ↓
ijson.items() [Streaming Parser]
    ↓
dict (raw JSON object)
    ↓
Conversation.model_validate() [Pydantic Validation]
    ↓
Conversation (Immutable Model)
    ↓
Generator Yield
    ↓
Consumer (CLI, Library User)

Search Operation Flow¶

Export File
    ↓
stream_conversations() [Stream All]
    ↓
Filter by Date Range (if specified)
    ↓
Filter by Title (if specified) [Metadata-only]
    ↓
BM25 Ranking (if keywords specified) [Full-text]
    ↓
Sort by Relevance Score (descending)
    ↓
Limit Results
    ↓
SearchResult[Conversation] Generator
    ↓
Consumer

Technology Choices¶

Core Stack¶

Technology	Purpose	Why Chosen
Python 3.12+	Language	Modern type hints (PEP 695, improved generics)
Pydantic v2	Data validation	Comprehensive validation, immutability, JSON schema
ijson	JSON parsing	Streaming for O(1) memory (handles 1GB+ files)
typer	CLI framework	Native type hint support, automatic help
rich	Terminal output	Tables, progress bars, syntax highlighting
structlog	Logging	JSON output for observability, contextual fields

Development Tools¶

Tool	Purpose	Why Chosen
pytest	Testing	De facto standard, excellent fixtures/plugins
mypy	Type checking	Strict mode for zero-tolerance type safety
ruff	Linting/Formatting	Fast (10-100x faster than alternatives)
pre-commit	Git hooks	Automated quality gates

Alternative Considerations¶

Why not async?

Sync generators are simpler
No I/O-bound operations (just CPU + disk reads)
ijson streaming is adequate for performance

Why not database?

YAGNI: Not required by spec
Export files are read-only
Streaming handles large files efficiently

Why not caching?

Export files don't change during read
Memory efficiency is more important
Adds complexity without clear benefit

Extension Points¶

Adding New Providers¶

Implement ConversationProvider protocol
Map provider-specific roles to standard roles
Store provider-specific data in metadata dict
Add provider-specific tests

Example:

class ClaudeAdapter:
    """Adapter for Anthropic Claude exports."""

    def stream_conversations(
        self,
        file_path: Path,
    ) -> Iterator[Conversation]:
        # Parse Claude-specific format
        # Map "human" → "user", "assistant" → "assistant"
        # Store Claude-specific fields in metadata
        pass

Add optional field to SearchQuery model
Update search logic in adapters
Add tests for new filter
Update CLI to accept new filter

Backward compatibility: New filters must be optional with sensible defaults.

Performance Optimization¶

Memory Efficiency¶

Streaming: Never load entire file into memory
Generators: Use Iterator return types, not List
Context Managers: Ensure file handles are closed

Search Performance¶

Title Filtering: Metadata-only (no message content scan)
BM25 Ranking: Only when keywords specified
Early Termination: Stop after limit results

Profiling¶

Use pytest-benchmark for performance regression testing:

def test_search_performance(benchmark):
    """Search 1.6GB file completes in <30 seconds."""
    result = benchmark(adapter.search, large_file, query)
    assert benchmark.stats.stats.mean < 30.0

Security Considerations¶

Input Validation¶

All file paths validated (Path objects, existence checks)
All search queries validated (Pydantic models)
No shell execution or eval()

Resource Limits¶

Streaming prevents OOM attacks
File handle cleanup ensures no resource leaks

Data Privacy¶

No network calls (offline library)
No telemetry or tracking
All processing local

Concurrency Model¶

Thread Safety¶

Adapter instances: Thread-safe (stateless)
Iterators: NOT thread-safe (each thread needs its own)

Multi-Process Safety¶

Multiple processes can read same file concurrently
File system provides read isolation

Future Considerations¶

Multi-Provider Support (v2.0)¶

Add ClaudeAdapter, GeminiAdapter
Auto-detection helper (optional)
Provider registry pattern

Advanced Search (v1.1)¶

Semantic search (embeddings)
Regex pattern matching
Boolean query syntax

Export Formats (v1.1)¶

HTML export
PDF export
CSV export (metadata)

Next Steps¶

Library Usage: Comprehensive API guide
CLI Usage: Command-line reference
API Reference: Detailed API documentation
Contributing: Development guidelines