BM25 Ranking¶

BM25 relevance ranking algorithm for full-text search.

Overview¶

Echomine uses BM25 (Best Matching 25) algorithm for relevance ranking when searching conversations by keywords. BM25 is a probabilistic ranking function used by search engines to estimate the relevance of documents to a given query.

API Reference¶

ranking ¶

BM25 relevance ranking for conversation search.

BM25 (Best Matching 25) is a probabilistic relevance ranking function. Used to score documents (conversations) based on keyword queries.

Parameters (from FR-317): k1 = 1.5 # Term frequency saturation parameter b = 0.75 # Length normalization parameter

Algorithm

score(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))

Where: - D: Document (conversation) - Q: Query (keywords) - f(qi, D): Frequency of keyword qi in document D - |D|: Length of document D (word count) - avgdl: Average document length across corpus - IDF(qi): Inverse document frequency of qi

Constitution Compliance

Principle VI: Strict typing with mypy --strict
FR-317-326: BM25 algorithm with standard parameters
FR-319: Case-insensitive keyword matching

Advanced Search Features (v1.1.0): - FR-001-006: phrase_matches() for exact phrase matching - FR-012-016: exclude_filter() for exclusion filtering

BM25Scorer ¶

BM25Scorer(corpus: list[str], avg_doc_length: float)

BM25 relevance scorer for conversations.

Scores conversations based on keyword matches using BM25 algorithm. Normalizes scores to [0.0, 1.0] range for consistency.

Algorithm Details

k1 = 1.5 (term frequency saturation)
b = 0.75 (length normalization)
IDF: log((N - df + 0.5) / (df + 0.5) + 1)
Score normalization: score_normalized = score_raw / (score_raw + 1) per FR-319

Example

# Build corpus
corpus = [
    "Python is a great programming language",
    "I love Python and JavaScript",
    "Go is fast but Python is easier"
]

# Initialize scorer
scorer = BM25Scorer(corpus=corpus, avg_doc_length=7.0)

# Score documents
score = scorer.score("Python is a great programming language", ["python"])
# Returns BM25 score (higher = more relevant)

Requirements

FR-317: BM25 algorithm implementation
FR-318: k1 = 1.5, b = 0.75
FR-319: Case-insensitive matching
FR-326: Score normalization to [0.0, 1.0]

Initialize BM25 scorer with corpus statistics.

Parameters:

Name	Type	Description	Default
`corpus`	`list[str]`	List of document texts (conversation content)	required
`avg_doc_length`	`float`	Average document length (for normalization)	required

Example

corpus = ["doc one content", "doc two content"]
avg_len = sum(len(doc.split()) for doc in corpus) / len(corpus)
scorer = BM25Scorer(corpus=corpus, avg_doc_length=avg_len)

Source code in src/echomine/search/ranking.py

def __init__(self, corpus: list[str], avg_doc_length: float) -> None:
    """Initialize BM25 scorer with corpus statistics.

    Args:
        corpus: List of document texts (conversation content)
        avg_doc_length: Average document length (for normalization)

    Example:
        ```python
        corpus = ["doc one content", "doc two content"]
        avg_len = sum(len(doc.split()) for doc in corpus) / len(corpus)
        scorer = BM25Scorer(corpus=corpus, avg_doc_length=avg_len)
        ```
    """
    self.corpus_size = len(corpus)
    self.avg_doc_length = avg_doc_length

    # Calculate IDF scores for all terms
    self.idf_scores: dict[str, float] = self._calculate_idf(corpus)

score ¶

score(document: str, keywords: list[str]) -> float

Score a document for given keywords using BM25.

Tokenizes both document and keywords using the same method to ensure consistent matching. Multi-character keywords (e.g., Chinese "编程") are split into individual character tokens.

Parameters:

Name	Type	Description	Default
`document`	`str`	Document text (conversation content)	required
`keywords`	`list[str]`	List of query keywords (will be tokenized)	required

Returns:

Type	Description
`float`	BM25 score (higher = more relevant, unnormalized)

Example

doc = "Python is a great programming language"
keywords = ["python", "programming"]
score = scorer.score(doc, keywords)
# Returns sum of BM25 scores for each keyword token

Algorithm

For each keyword token qi: IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))

Source code in src/echomine/search/ranking.py

def score(self, document: str, keywords: list[str]) -> float:
    """Score a document for given keywords using BM25.

    Tokenizes both document and keywords using the same method to ensure
    consistent matching. Multi-character keywords (e.g., Chinese "编程")
    are split into individual character tokens.

    Args:
        document: Document text (conversation content)
        keywords: List of query keywords (will be tokenized)

    Returns:
        BM25 score (higher = more relevant, unnormalized)

    Example:
        ```python
        doc = "Python is a great programming language"
        keywords = ["python", "programming"]
        score = scorer.score(doc, keywords)
        # Returns sum of BM25 scores for each keyword token
        ```

    Algorithm:
        For each keyword token qi:
            IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))
    """
    # Tokenize document using improved tokenization
    doc_terms = self._tokenize(document)
    doc_length = len(doc_terms)

    # Count term frequencies
    tf_counter: CounterType[str] = Counter(doc_terms)

    # Tokenize keywords (handles multi-character keywords like "编程")
    keyword_tokens: list[str] = []
    for keyword in keywords:
        keyword_tokens.extend(self._tokenize(keyword))

    # Calculate BM25 score
    score = 0.0

    for kw_token in keyword_tokens:
        # Get IDF score (0 if term not in corpus)
        idf = self.idf_scores.get(kw_token, 0.0)

        # Get term frequency in document
        tf = tf_counter.get(kw_token, 0)

        # BM25 formula
        numerator = tf * (self.K1 + 1.0)

        # Guard against division by zero when avg_doc_length is 0
        # This can happen with sparse corpora (e.g., role_filter=system with few system messages)
        # When avg_doc_length is 0, skip length normalization (use ratio of 1.0)
        length_ratio = doc_length / self.avg_doc_length if self.avg_doc_length > 0 else 1.0

        denominator = tf + self.K1 * (1.0 - self.B + self.B * length_ratio)

        score += idf * (numerator / denominator)

    return score

phrase_matches ¶

phrase_matches(text: str, phrases: list[str]) -> bool

Check if any phrase matches in the text (case-insensitive substring).

This function implements exact phrase matching without tokenization. Phrases are matched as-is, preserving hyphens, underscores, and spaces.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to search in (e.g., conversation content)	required
`phrases`	`list[str]`	List of phrases to match	required

Returns:

Type	Description
`bool`	True if any phrase is found in text, False otherwise

Example

text = "We use algo-insights for data analysis"
assert phrase_matches(text, ["algo-insights"]) is True
assert phrase_matches(text, ["algorithm"]) is False

Requirements

FR-001: Exact phrase matching (no tokenization)
FR-003: Case-insensitive matching
FR-006: Special characters matched literally

Source code in src/echomine/search/ranking.py

def phrase_matches(text: str, phrases: list[str]) -> bool:
    """Check if any phrase matches in the text (case-insensitive substring).

    This function implements exact phrase matching without tokenization.
    Phrases are matched as-is, preserving hyphens, underscores, and spaces.

    Args:
        text: Text to search in (e.g., conversation content)
        phrases: List of phrases to match

    Returns:
        True if any phrase is found in text, False otherwise

    Example:
        ```python
        text = "We use algo-insights for data analysis"
        assert phrase_matches(text, ["algo-insights"]) is True
        assert phrase_matches(text, ["algorithm"]) is False
        ```

    Requirements:
        - FR-001: Exact phrase matching (no tokenization)
        - FR-003: Case-insensitive matching
        - FR-006: Special characters matched literally
    """
    # Empty phrases list means no matches possible
    if not phrases:
        return False

    # Empty text means no matches possible
    if not text:
        return False

    # Case-insensitive substring matching (OR logic for multiple phrases)
    text_lower = text.lower()
    return any(phrase.lower() in text_lower for phrase in phrases)

all_terms_present ¶

all_terms_present(text: str, keywords: list[str], scorer: BM25Scorer) -> bool

Check if ALL keyword tokens are present in the text.

Uses the same tokenization as BM25Scorer to ensure consistent matching behavior. All keyword tokens must be present in the text tokens for match_mode='all' to succeed.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to check	required
`keywords`	`list[str]`	Keywords to find (will be tokenized)	required
`scorer`	`BM25Scorer`	BM25Scorer instance for tokenization	required

Returns:

Type	Description
`bool`	True if ALL keyword tokens are present in text,
`bool`	False otherwise

Example

scorer = BM25Scorer(corpus=["test"], avg_doc_length=1.0)
text = "Python and Java programming"
assert all_terms_present(text, ["python", "java"], scorer) is True
assert all_terms_present(text, ["python", "rust"], scorer) is False

Requirements

FR-009: All keywords must be present (AND logic)
FR-010: Uses same tokenization as BM25

Source code in src/echomine/search/ranking.py

def all_terms_present(text: str, keywords: list[str], scorer: BM25Scorer) -> bool:
    """Check if ALL keyword tokens are present in the text.

    Uses the same tokenization as BM25Scorer to ensure consistent matching
    behavior. All keyword tokens must be present in the text tokens for
    match_mode='all' to succeed.

    Args:
        text: Text to check
        keywords: Keywords to find (will be tokenized)
        scorer: BM25Scorer instance for tokenization

    Returns:
        True if ALL keyword tokens are present in text,
        False otherwise

    Example:
        ```python
        scorer = BM25Scorer(corpus=["test"], avg_doc_length=1.0)
        text = "Python and Java programming"
        assert all_terms_present(text, ["python", "java"], scorer) is True
        assert all_terms_present(text, ["python", "rust"], scorer) is False
        ```

    Requirements:
        - FR-009: All keywords must be present (AND logic)
        - FR-010: Uses same tokenization as BM25
    """
    # Empty keywords list is vacuously true (all of nothing is present)
    if not keywords:
        return True

    # Tokenize the text using BM25Scorer's tokenization
    text_tokens = set(scorer._tokenize(text))  # noqa: SLF001

    # Tokenize all keywords and check if each token is present
    for keyword in keywords:
        keyword_tokens = scorer._tokenize(keyword)  # noqa: SLF001
        # All tokens from this keyword must be present
        for token in keyword_tokens:
            if token not in text_tokens:
                return False

    return True

exclude_filter ¶

exclude_filter(
    text: str, exclude_keywords: list[str], scorer: BM25Scorer
) -> bool

Check if text contains any excluded keywords.

Uses the same tokenization as BM25Scorer to ensure consistent matching behavior between inclusion and exclusion.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to check for excluded terms	required
`exclude_keywords`	`list[str]`	Keywords to exclude (will be tokenized)	required
`scorer`	`BM25Scorer`	BM25Scorer instance for tokenization	required

Returns:

Type	Description
`bool`	True if text should be EXCLUDED (contains any excluded term),
`bool`	False if text should be KEPT (no excluded terms found)

Example

scorer = BM25Scorer(corpus=["test"], avg_doc_length=1.0)
text = "Python with Django framework"
assert exclude_filter(text, ["django"], scorer) is True  # Exclude
assert exclude_filter(text, ["flask"], scorer) is False  # Keep

Requirements

FR-014: Applied after matching, before ranking
FR-015: Uses same tokenization as keywords

Source code in src/echomine/search/ranking.py

def exclude_filter(text: str, exclude_keywords: list[str], scorer: BM25Scorer) -> bool:
    """Check if text contains any excluded keywords.

    Uses the same tokenization as BM25Scorer to ensure consistent matching
    behavior between inclusion and exclusion.

    Args:
        text: Text to check for excluded terms
        exclude_keywords: Keywords to exclude (will be tokenized)
        scorer: BM25Scorer instance for tokenization

    Returns:
        True if text should be EXCLUDED (contains any excluded term),
        False if text should be KEPT (no excluded terms found)

    Example:
        ```python
        scorer = BM25Scorer(corpus=["test"], avg_doc_length=1.0)
        text = "Python with Django framework"
        assert exclude_filter(text, ["django"], scorer) is True  # Exclude
        assert exclude_filter(text, ["flask"], scorer) is False  # Keep
        ```

    Requirements:
        - FR-014: Applied after matching, before ranking
        - FR-015: Uses same tokenization as keywords
    """
    # Empty exclusions means keep all
    if not exclude_keywords:
        return False

    # Empty text has no tokens to match exclusions
    if not text:
        return False

    # Tokenize the text using BM25Scorer's tokenization
    text_tokens = set(scorer._tokenize(text))  # noqa: SLF001

    # Check if ANY excluded token is present (OR logic for exclusion)
    for keyword in exclude_keywords:
        keyword_tokens = scorer._tokenize(keyword)  # noqa: SLF001
        # If any token from this keyword is present, exclude
        for token in keyword_tokens:
            if token in text_tokens:
                return True

    return False

How BM25 Works¶

Algorithm¶

BM25 calculates a relevance score based on:

Term Frequency (TF): How often keywords appear in the conversation
Inverse Document Frequency (IDF): How rare keywords are across all conversations
Document Length Normalization: Adjusts for conversation length

Formula:

score(D, Q) = Σ IDF(qi) · (f(qi, D) · (k1 + 1)) / (f(qi, D) + k1 · (1 - b + b · |D| / avgdl))

Where: - D = Conversation (document) - Q = Search query keywords - f(qi, D) = Frequency of keyword qi in conversation D - |D| = Length of conversation (number of words) - avgdl = Average conversation length across all conversations - k1 = Term saturation parameter (default: 1.5) - b = Length normalization parameter (default: 0.75)

Parameters¶

Echomine uses standard BM25 parameters:

k1 = 1.5: Controls term saturation (higher = more weight to term frequency)
b = 0.75: Controls length normalization (higher = more penalty for long documents)

Usage Examples¶

Basic Search with Ranking¶

from echomine import OpenAIAdapter, SearchQuery
from pathlib import Path

adapter = OpenAIAdapter()
export_file = Path("conversations.json")

# Search with keywords
query = SearchQuery(keywords=["python", "async"], limit=10)

# Results are ranked by BM25 score
for result in adapter.search(export_file, query):
    print(f"[{result.score:.2f}] {result.conversation.title}")

Score Interpretation¶

for result in adapter.search(export_file, query):
    score = result.score

    if score >= 0.8:
        quality = "Excellent match"
    elif score >= 0.6:
        quality = "Good match"
    elif score >= 0.4:
        quality = "Fair match"
    else:
        quality = "Weak match"

    print(f"[{score:.2f}] {quality}: {result.conversation.title}")

Filtering by Score Threshold¶

# Only show high-quality matches
min_score = 0.6

for result in adapter.search(export_file, query):
    if result.score >= min_score:
        print(f"[{result.score:.2f}] {result.conversation.title}")

Score Normalization¶

Scores are normalized to 0.0-1.0 range:

1.0: Perfect match (all keywords present, high frequency)
0.8-0.9: Excellent match (most keywords, good frequency)
0.6-0.7: Good match (some keywords, moderate frequency)
0.4-0.5: Fair match (few keywords, low frequency)
<0.4: Weak match

Note: Scores are relative. A score of 0.8 doesn't mean "80% match" - it means the conversation is in the top tier of matches for the query.

Search Behavior¶

Keyword Processing¶

Keywords are processed as follows:

Case-insensitive: "Python" matches "python", "PYTHON"
Whole words: "python" doesn't match "pythonic" (unless stemming is enabled)
Multiple keywords: All keywords contribute to score (OR logic)

Content Searched¶

BM25 searches across:

Message content (all messages in conversation)
Conversation title (weighted higher)

Ranking Order¶

Results are sorted by score in descending order:

results = list(adapter.search(export_file, query))

# Results are pre-sorted by score
assert results[0].score >= results[1].score >= results[2].score

Performance¶

Time Complexity¶

O(n · m): Where n = number of conversations, m = average messages per conversation
Early termination: Stops after finding limit top matches

Memory Complexity¶

O(1): Constant memory (streaming-based)
No full conversation buffering

Speed¶

1.6GB file: <30 seconds for keyword search
10K conversations: <10 seconds

Comparison with Other Ranking Methods¶

BM25 vs TF-IDF¶

Feature	BM25	TF-IDF
Saturation	Yes (k1 parameter)	No
Length normalization	Yes (b parameter)	Limited
Performance	Better for varied-length documents	Better for uniform documents
Echomine choice	✅ Used	❌ Not used

Why BM25? Conversations vary widely in length. BM25's length normalization prevents long conversations from dominating results.

BM25 vs Semantic Search¶

Feature	BM25	Semantic Search
Query type	Keywords	Natural language
Speed	Fast (no ML)	Slower (embedding models)
Setup	No dependencies	Requires embedding models
Accuracy	Good for exact terms	Better for synonyms/concepts
Echomine v1.0	✅ Used	❌ Not used

Future: Semantic search may be added in v2.0 as an optional feature.

Limitations¶

No fuzzy matching: "python" doesn't match "pythonic"
No synonym matching: "car" doesn't match "automobile"
Keyword-only: Doesn't understand semantic meaning
No phrase matching: "machine learning" treated as two separate keywords

Advanced Usage¶

Multi-Keyword Queries¶

# All keywords contribute to score (OR logic)
query = SearchQuery(
    keywords=["python", "async", "asyncio", "coroutine"],
    limit=10
)

# Conversations with more keyword matches rank higher
for result in adapter.search(export_file, query):
    print(f"[{result.score:.2f}] {result.conversation.title}")

Combining with Filters¶

from datetime import date

# BM25 ranking + date filtering
query = SearchQuery(
    keywords=["algorithm", "design"],
    from_date=date(2024, 1, 1),
    to_date=date(2024, 3, 31),
    limit=10
)

# Date filter reduces search space, then BM25 ranks
for result in adapter.search(export_file, query):
    print(f"[{result.score:.2f}] {result.conversation.title}")

Implementation Details¶

Tokenization¶

Messages are tokenized by:

Lowercase conversion
Split on whitespace and punctuation
Filter stop words (optional, not implemented in v1.0)

IDF Calculation¶

import math

def calculate_idf(term: str, total_docs: int, docs_with_term: int) -> float:
    """Calculate inverse document frequency."""
    return math.log((total_docs - docs_with_term + 0.5) / (docs_with_term + 0.5) + 1.0)

Score Calculation¶

def calculate_bm25_score(
    term_freq: int,
    doc_length: int,
    avg_doc_length: float,
    idf: float,
    k1: float = 1.5,
    b: float = 0.75,
) -> float:
    """Calculate BM25 score for a single term."""
    numerator = term_freq * (k1 + 1)
    denominator = term_freq + k1 * (1 - b + b * (doc_length / avg_doc_length))
    return idf * (numerator / denominator)

SearchQuery: Search parameters
SearchResult: Result model with score
OpenAI Adapter: Adapter using BM25

BM25 Ranking¶

Overview¶

API Reference¶

ranking ¶

BM25Scorer ¶

score ¶

phrase_matches ¶

all_terms_present ¶

exclude_filter ¶

How BM25 Works¶

Algorithm¶

Parameters¶

Usage Examples¶

Basic Search with Ranking¶

Score Interpretation¶

Filtering by Score Threshold¶

Score Normalization¶

Search Behavior¶

Keyword Processing¶

Content Searched¶

Ranking Order¶

Performance¶

Time Complexity¶

Memory Complexity¶

Speed¶

Comparison with Other Ranking Methods¶

BM25 vs TF-IDF¶

BM25 vs Semantic Search¶

Limitations¶

Advanced Usage¶

Multi-Keyword Queries¶

Combining with Filters¶

Implementation Details¶

Tokenization¶

IDF Calculation¶

Score Calculation¶

See Also¶

References¶

BM25 Ranking¶

Overview¶

API Reference¶

ranking ¶

BM25Scorer ¶

score ¶

phrase_matches ¶

all_terms_present ¶

exclude_filter ¶

How BM25 Works¶

Algorithm¶

Parameters¶

Usage Examples¶

Basic Search with Ranking¶

Score Interpretation¶

Filtering by Score Threshold¶

Score Normalization¶

Search Behavior¶

Keyword Processing¶

Content Searched¶

Ranking Order¶

Performance¶

Time Complexity¶

Memory Complexity¶

Speed¶

Comparison with Other Ranking Methods¶

BM25 vs TF-IDF¶

BM25 vs Semantic Search¶

Limitations¶

Advanced Usage¶

Multi-Keyword Queries¶

Combining with Filters¶

Implementation Details¶

Tokenization¶

IDF Calculation¶

Score Calculation¶

Related¶

See Also¶

References¶