Skip to content

BM25 Ranking

BM25 relevance ranking algorithm for full-text search.

Overview

Echomine uses BM25 (Best Matching 25) algorithm for relevance ranking when searching conversations by keywords. BM25 is a probabilistic ranking function used by search engines to estimate the relevance of documents to a given query.

API Reference

ranking

BM25 relevance ranking for conversation search.

BM25 (Best Matching 25) is a probabilistic relevance ranking function. Used to score documents (conversations) based on keyword queries.

Parameters (from FR-317): k1 = 1.5 # Term frequency saturation parameter b = 0.75 # Length normalization parameter

Algorithm

score(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))

Where: - D: Document (conversation) - Q: Query (keywords) - f(qi, D): Frequency of keyword qi in document D - |D|: Length of document D (word count) - avgdl: Average document length across corpus - IDF(qi): Inverse document frequency of qi

Constitution Compliance
  • Principle VI: Strict typing with mypy --strict
  • FR-317-326: BM25 algorithm with standard parameters
  • FR-319: Case-insensitive keyword matching

Advanced Search Features (v1.1.0): - FR-001-006: phrase_matches() for exact phrase matching - FR-012-016: exclude_filter() for exclusion filtering

BM25Scorer

BM25Scorer(corpus: list[str], avg_doc_length: float)

BM25 relevance scorer for conversations.

Scores conversations based on keyword matches using BM25 algorithm. Normalizes scores to [0.0, 1.0] range for consistency.

Algorithm Details
  • k1 = 1.5 (term frequency saturation)
  • b = 0.75 (length normalization)
  • IDF: log((N - df + 0.5) / (df + 0.5) + 1)
  • Score normalization: score_normalized = score_raw / (score_raw + 1) per FR-319
Example
# Build corpus
corpus = [
    "Python is a great programming language",
    "I love Python and JavaScript",
    "Go is fast but Python is easier"
]

# Initialize scorer
scorer = BM25Scorer(corpus=corpus, avg_doc_length=7.0)

# Score documents
score = scorer.score("Python is a great programming language", ["python"])
# Returns BM25 score (higher = more relevant)
Requirements
  • FR-317: BM25 algorithm implementation
  • FR-318: k1 = 1.5, b = 0.75
  • FR-319: Case-insensitive matching
  • FR-326: Score normalization to [0.0, 1.0]

Initialize BM25 scorer with corpus statistics.

Parameters:

Name Type Description Default
corpus list[str]

List of document texts (conversation content)

required
avg_doc_length float

Average document length (for normalization)

required
Example
corpus = ["doc one content", "doc two content"]
avg_len = sum(len(doc.split()) for doc in corpus) / len(corpus)
scorer = BM25Scorer(corpus=corpus, avg_doc_length=avg_len)
Source code in src/echomine/search/ranking.py
def __init__(self, corpus: list[str], avg_doc_length: float) -> None:
    """Initialize BM25 scorer with corpus statistics.

    Args:
        corpus: List of document texts (conversation content)
        avg_doc_length: Average document length (for normalization)

    Example:
        ```python
        corpus = ["doc one content", "doc two content"]
        avg_len = sum(len(doc.split()) for doc in corpus) / len(corpus)
        scorer = BM25Scorer(corpus=corpus, avg_doc_length=avg_len)
        ```
    """
    self.corpus_size = len(corpus)
    self.avg_doc_length = avg_doc_length

    # Calculate IDF scores for all terms
    self.idf_scores: dict[str, float] = self._calculate_idf(corpus)
score
score(document: str, keywords: list[str]) -> float

Score a document for given keywords using BM25.

Tokenizes both document and keywords using the same method to ensure consistent matching. Multi-character keywords (e.g., Chinese "编程") are split into individual character tokens.

Parameters:

Name Type Description Default
document str

Document text (conversation content)

required
keywords list[str]

List of query keywords (will be tokenized)

required

Returns:

Type Description
float

BM25 score (higher = more relevant, unnormalized)

Example
doc = "Python is a great programming language"
keywords = ["python", "programming"]
score = scorer.score(doc, keywords)
# Returns sum of BM25 scores for each keyword token
Algorithm

For each keyword token qi: IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))

Source code in src/echomine/search/ranking.py
def score(self, document: str, keywords: list[str]) -> float:
    """Score a document for given keywords using BM25.

    Tokenizes both document and keywords using the same method to ensure
    consistent matching. Multi-character keywords (e.g., Chinese "编程")
    are split into individual character tokens.

    Args:
        document: Document text (conversation content)
        keywords: List of query keywords (will be tokenized)

    Returns:
        BM25 score (higher = more relevant, unnormalized)

    Example:
        ```python
        doc = "Python is a great programming language"
        keywords = ["python", "programming"]
        score = scorer.score(doc, keywords)
        # Returns sum of BM25 scores for each keyword token
        ```

    Algorithm:
        For each keyword token qi:
            IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))
    """
    # Tokenize document using improved tokenization
    doc_terms = self._tokenize(document)
    doc_length = len(doc_terms)

    # Count term frequencies
    tf_counter: CounterType[str] = Counter(doc_terms)

    # Tokenize keywords (handles multi-character keywords like "编程")
    keyword_tokens: list[str] = []
    for keyword in keywords:
        keyword_tokens.extend(self._tokenize(keyword))

    # Calculate BM25 score
    score = 0.0

    for kw_token in keyword_tokens:
        # Get IDF score (0 if term not in corpus)
        idf = self.idf_scores.get(kw_token, 0.0)

        # Get term frequency in document
        tf = tf_counter.get(kw_token, 0)

        # BM25 formula
        numerator = tf * (self.K1 + 1.0)

        # Guard against division by zero when avg_doc_length is 0
        # This can happen with sparse corpora (e.g., role_filter=system with few system messages)
        # When avg_doc_length is 0, skip length normalization (use ratio of 1.0)
        length_ratio = doc_length / self.avg_doc_length if self.avg_doc_length > 0 else 1.0

        denominator = tf + self.K1 * (1.0 - self.B + self.B * length_ratio)

        score += idf * (numerator / denominator)

    return score

phrase_matches

phrase_matches(text: str, phrases: list[str]) -> bool

Check if any phrase matches in the text (case-insensitive substring).

This function implements exact phrase matching without tokenization. Phrases are matched as-is, preserving hyphens, underscores, and spaces.

Parameters:

Name Type Description Default
text str

Text to search in (e.g., conversation content)

required
phrases list[str]

List of phrases to match

required

Returns:

Type Description
bool

True if any phrase is found in text, False otherwise

Example
text = "We use algo-insights for data analysis"
assert phrase_matches(text, ["algo-insights"]) is True
assert phrase_matches(text, ["algorithm"]) is False
Requirements
  • FR-001: Exact phrase matching (no tokenization)
  • FR-003: Case-insensitive matching
  • FR-006: Special characters matched literally
Source code in src/echomine/search/ranking.py
def phrase_matches(text: str, phrases: list[str]) -> bool:
    """Check if any phrase matches in the text (case-insensitive substring).

    This function implements exact phrase matching without tokenization.
    Phrases are matched as-is, preserving hyphens, underscores, and spaces.

    Args:
        text: Text to search in (e.g., conversation content)
        phrases: List of phrases to match

    Returns:
        True if any phrase is found in text, False otherwise

    Example:
        ```python
        text = "We use algo-insights for data analysis"
        assert phrase_matches(text, ["algo-insights"]) is True
        assert phrase_matches(text, ["algorithm"]) is False
        ```

    Requirements:
        - FR-001: Exact phrase matching (no tokenization)
        - FR-003: Case-insensitive matching
        - FR-006: Special characters matched literally
    """
    # Empty phrases list means no matches possible
    if not phrases:
        return False

    # Empty text means no matches possible
    if not text:
        return False

    # Case-insensitive substring matching (OR logic for multiple phrases)
    text_lower = text.lower()
    return any(phrase.lower() in text_lower for phrase in phrases)

all_terms_present

all_terms_present(text: str, keywords: list[str], scorer: BM25Scorer) -> bool

Check if ALL keyword tokens are present in the text.

Uses the same tokenization as BM25Scorer to ensure consistent matching behavior. All keyword tokens must be present in the text tokens for match_mode='all' to succeed.

Parameters:

Name Type Description Default
text str

Text to check

required
keywords list[str]

Keywords to find (will be tokenized)

required
scorer BM25Scorer

BM25Scorer instance for tokenization

required

Returns:

Type Description
bool

True if ALL keyword tokens are present in text,

bool

False otherwise

Example
scorer = BM25Scorer(corpus=["test"], avg_doc_length=1.0)
text = "Python and Java programming"
assert all_terms_present(text, ["python", "java"], scorer) is True
assert all_terms_present(text, ["python", "rust"], scorer) is False
Requirements
  • FR-009: All keywords must be present (AND logic)
  • FR-010: Uses same tokenization as BM25
Source code in src/echomine/search/ranking.py
def all_terms_present(text: str, keywords: list[str], scorer: BM25Scorer) -> bool:
    """Check if ALL keyword tokens are present in the text.

    Uses the same tokenization as BM25Scorer to ensure consistent matching
    behavior. All keyword tokens must be present in the text tokens for
    match_mode='all' to succeed.

    Args:
        text: Text to check
        keywords: Keywords to find (will be tokenized)
        scorer: BM25Scorer instance for tokenization

    Returns:
        True if ALL keyword tokens are present in text,
        False otherwise

    Example:
        ```python
        scorer = BM25Scorer(corpus=["test"], avg_doc_length=1.0)
        text = "Python and Java programming"
        assert all_terms_present(text, ["python", "java"], scorer) is True
        assert all_terms_present(text, ["python", "rust"], scorer) is False
        ```

    Requirements:
        - FR-009: All keywords must be present (AND logic)
        - FR-010: Uses same tokenization as BM25
    """
    # Empty keywords list is vacuously true (all of nothing is present)
    if not keywords:
        return True

    # Tokenize the text using BM25Scorer's tokenization
    text_tokens = set(scorer._tokenize(text))  # noqa: SLF001

    # Tokenize all keywords and check if each token is present
    for keyword in keywords:
        keyword_tokens = scorer._tokenize(keyword)  # noqa: SLF001
        # All tokens from this keyword must be present
        for token in keyword_tokens:
            if token not in text_tokens:
                return False

    return True

exclude_filter

exclude_filter(
    text: str, exclude_keywords: list[str], scorer: BM25Scorer
) -> bool

Check if text contains any excluded keywords.

Uses the same tokenization as BM25Scorer to ensure consistent matching behavior between inclusion and exclusion.

Parameters:

Name Type Description Default
text str

Text to check for excluded terms

required
exclude_keywords list[str]

Keywords to exclude (will be tokenized)

required
scorer BM25Scorer

BM25Scorer instance for tokenization

required

Returns:

Type Description
bool

True if text should be EXCLUDED (contains any excluded term),

bool

False if text should be KEPT (no excluded terms found)

Example
scorer = BM25Scorer(corpus=["test"], avg_doc_length=1.0)
text = "Python with Django framework"
assert exclude_filter(text, ["django"], scorer) is True  # Exclude
assert exclude_filter(text, ["flask"], scorer) is False  # Keep
Requirements
  • FR-014: Applied after matching, before ranking
  • FR-015: Uses same tokenization as keywords
Source code in src/echomine/search/ranking.py
def exclude_filter(text: str, exclude_keywords: list[str], scorer: BM25Scorer) -> bool:
    """Check if text contains any excluded keywords.

    Uses the same tokenization as BM25Scorer to ensure consistent matching
    behavior between inclusion and exclusion.

    Args:
        text: Text to check for excluded terms
        exclude_keywords: Keywords to exclude (will be tokenized)
        scorer: BM25Scorer instance for tokenization

    Returns:
        True if text should be EXCLUDED (contains any excluded term),
        False if text should be KEPT (no excluded terms found)

    Example:
        ```python
        scorer = BM25Scorer(corpus=["test"], avg_doc_length=1.0)
        text = "Python with Django framework"
        assert exclude_filter(text, ["django"], scorer) is True  # Exclude
        assert exclude_filter(text, ["flask"], scorer) is False  # Keep
        ```

    Requirements:
        - FR-014: Applied after matching, before ranking
        - FR-015: Uses same tokenization as keywords
    """
    # Empty exclusions means keep all
    if not exclude_keywords:
        return False

    # Empty text has no tokens to match exclusions
    if not text:
        return False

    # Tokenize the text using BM25Scorer's tokenization
    text_tokens = set(scorer._tokenize(text))  # noqa: SLF001

    # Check if ANY excluded token is present (OR logic for exclusion)
    for keyword in exclude_keywords:
        keyword_tokens = scorer._tokenize(keyword)  # noqa: SLF001
        # If any token from this keyword is present, exclude
        for token in keyword_tokens:
            if token in text_tokens:
                return True

    return False

How BM25 Works

Algorithm

BM25 calculates a relevance score based on:

  1. Term Frequency (TF): How often keywords appear in the conversation
  2. Inverse Document Frequency (IDF): How rare keywords are across all conversations
  3. Document Length Normalization: Adjusts for conversation length

Formula:

score(D, Q) = Σ IDF(qi) · (f(qi, D) · (k1 + 1)) / (f(qi, D) + k1 · (1 - b + b · |D| / avgdl))

Where: - D = Conversation (document) - Q = Search query keywords - f(qi, D) = Frequency of keyword qi in conversation D - |D| = Length of conversation (number of words) - avgdl = Average conversation length across all conversations - k1 = Term saturation parameter (default: 1.5) - b = Length normalization parameter (default: 0.75)

Parameters

Echomine uses standard BM25 parameters:

  • k1 = 1.5: Controls term saturation (higher = more weight to term frequency)
  • b = 0.75: Controls length normalization (higher = more penalty for long documents)

Usage Examples

Basic Search with Ranking

from echomine import OpenAIAdapter, SearchQuery
from pathlib import Path

adapter = OpenAIAdapter()
export_file = Path("conversations.json")

# Search with keywords
query = SearchQuery(keywords=["python", "async"], limit=10)

# Results are ranked by BM25 score
for result in adapter.search(export_file, query):
    print(f"[{result.score:.2f}] {result.conversation.title}")

Score Interpretation

for result in adapter.search(export_file, query):
    score = result.score

    if score >= 0.8:
        quality = "Excellent match"
    elif score >= 0.6:
        quality = "Good match"
    elif score >= 0.4:
        quality = "Fair match"
    else:
        quality = "Weak match"

    print(f"[{score:.2f}] {quality}: {result.conversation.title}")

Filtering by Score Threshold

# Only show high-quality matches
min_score = 0.6

for result in adapter.search(export_file, query):
    if result.score >= min_score:
        print(f"[{result.score:.2f}] {result.conversation.title}")

Score Normalization

Scores are normalized to 0.0-1.0 range:

  • 1.0: Perfect match (all keywords present, high frequency)
  • 0.8-0.9: Excellent match (most keywords, good frequency)
  • 0.6-0.7: Good match (some keywords, moderate frequency)
  • 0.4-0.5: Fair match (few keywords, low frequency)
  • <0.4: Weak match

Note: Scores are relative. A score of 0.8 doesn't mean "80% match" - it means the conversation is in the top tier of matches for the query.

Search Behavior

Keyword Processing

Keywords are processed as follows:

  1. Case-insensitive: "Python" matches "python", "PYTHON"
  2. Whole words: "python" doesn't match "pythonic" (unless stemming is enabled)
  3. Multiple keywords: All keywords contribute to score (OR logic)

Content Searched

BM25 searches across:

  • Message content (all messages in conversation)
  • Conversation title (weighted higher)

Ranking Order

Results are sorted by score in descending order:

results = list(adapter.search(export_file, query))

# Results are pre-sorted by score
assert results[0].score >= results[1].score >= results[2].score

Performance

Time Complexity

  • O(n · m): Where n = number of conversations, m = average messages per conversation
  • Early termination: Stops after finding limit top matches

Memory Complexity

  • O(1): Constant memory (streaming-based)
  • No full conversation buffering

Speed

  • 1.6GB file: <30 seconds for keyword search
  • 10K conversations: <10 seconds

Comparison with Other Ranking Methods

BM25 vs TF-IDF

Feature BM25 TF-IDF
Saturation Yes (k1 parameter) No
Length normalization Yes (b parameter) Limited
Performance Better for varied-length documents Better for uniform documents
Echomine choice ✅ Used ❌ Not used

Why BM25? Conversations vary widely in length. BM25's length normalization prevents long conversations from dominating results.

Feature BM25 Semantic Search
Query type Keywords Natural language
Speed Fast (no ML) Slower (embedding models)
Setup No dependencies Requires embedding models
Accuracy Good for exact terms Better for synonyms/concepts
Echomine v1.0 ✅ Used ❌ Not used

Future: Semantic search may be added in v2.0 as an optional feature.

Limitations

  1. No fuzzy matching: "python" doesn't match "pythonic"
  2. No synonym matching: "car" doesn't match "automobile"
  3. Keyword-only: Doesn't understand semantic meaning
  4. No phrase matching: "machine learning" treated as two separate keywords

Advanced Usage

Multi-Keyword Queries

# All keywords contribute to score (OR logic)
query = SearchQuery(
    keywords=["python", "async", "asyncio", "coroutine"],
    limit=10
)

# Conversations with more keyword matches rank higher
for result in adapter.search(export_file, query):
    print(f"[{result.score:.2f}] {result.conversation.title}")

Combining with Filters

from datetime import date

# BM25 ranking + date filtering
query = SearchQuery(
    keywords=["algorithm", "design"],
    from_date=date(2024, 1, 1),
    to_date=date(2024, 3, 31),
    limit=10
)

# Date filter reduces search space, then BM25 ranks
for result in adapter.search(export_file, query):
    print(f"[{result.score:.2f}] {result.conversation.title}")

Implementation Details

Tokenization

Messages are tokenized by:

  1. Lowercase conversion
  2. Split on whitespace and punctuation
  3. Filter stop words (optional, not implemented in v1.0)

IDF Calculation

import math

def calculate_idf(term: str, total_docs: int, docs_with_term: int) -> float:
    """Calculate inverse document frequency."""
    return math.log((total_docs - docs_with_term + 0.5) / (docs_with_term + 0.5) + 1.0)

Score Calculation

def calculate_bm25_score(
    term_freq: int,
    doc_length: int,
    avg_doc_length: float,
    idf: float,
    k1: float = 1.5,
    b: float = 0.75,
) -> float:
    """Calculate BM25 score for a single term."""
    numerator = term_freq * (k1 + 1)
    denominator = term_freq + k1 * (1 - b + b * (doc_length / avg_doc_length))
    return idf * (numerator / denominator)

See Also

References