BM25 Ranking¶
BM25 relevance ranking algorithm for full-text search.
Overview¶
Echomine uses BM25 (Best Matching 25) algorithm for relevance ranking when searching conversations by keywords. BM25 is a probabilistic ranking function used by search engines to estimate the relevance of documents to a given query.
API Reference¶
ranking
¶
BM25 relevance ranking for conversation search.
BM25 (Best Matching 25) is a probabilistic relevance ranking function. Used to score documents (conversations) based on keyword queries.
Parameters (from FR-317): k1 = 1.5 # Term frequency saturation parameter b = 0.75 # Length normalization parameter
Algorithm
score(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))
Where: - D: Document (conversation) - Q: Query (keywords) - f(qi, D): Frequency of keyword qi in document D - |D|: Length of document D (word count) - avgdl: Average document length across corpus - IDF(qi): Inverse document frequency of qi
Constitution Compliance
- Principle VI: Strict typing with mypy --strict
- FR-317-326: BM25 algorithm with standard parameters
- FR-319: Case-insensitive keyword matching
Advanced Search Features (v1.1.0): - FR-001-006: phrase_matches() for exact phrase matching - FR-012-016: exclude_filter() for exclusion filtering
BM25Scorer
¶
BM25 relevance scorer for conversations.
Scores conversations based on keyword matches using BM25 algorithm. Normalizes scores to [0.0, 1.0] range for consistency.
Algorithm Details
- k1 = 1.5 (term frequency saturation)
- b = 0.75 (length normalization)
- IDF: log((N - df + 0.5) / (df + 0.5) + 1)
- Score normalization: score_normalized = score_raw / (score_raw + 1) per FR-319
Example
# Build corpus
corpus = [
"Python is a great programming language",
"I love Python and JavaScript",
"Go is fast but Python is easier"
]
# Initialize scorer
scorer = BM25Scorer(corpus=corpus, avg_doc_length=7.0)
# Score documents
score = scorer.score("Python is a great programming language", ["python"])
# Returns BM25 score (higher = more relevant)
Requirements
- FR-317: BM25 algorithm implementation
- FR-318: k1 = 1.5, b = 0.75
- FR-319: Case-insensitive matching
- FR-326: Score normalization to [0.0, 1.0]
Initialize BM25 scorer with corpus statistics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
corpus
|
list[str]
|
List of document texts (conversation content) |
required |
avg_doc_length
|
float
|
Average document length (for normalization) |
required |
Example
Source code in src/echomine/search/ranking.py
score
¶
Score a document for given keywords using BM25.
Tokenizes both document and keywords using the same method to ensure consistent matching. Multi-character keywords (e.g., Chinese "编程") are split into individual character tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
document
|
str
|
Document text (conversation content) |
required |
keywords
|
list[str]
|
List of query keywords (will be tokenized) |
required |
Returns:
| Type | Description |
|---|---|
float
|
BM25 score (higher = more relevant, unnormalized) |
Example
Algorithm
For each keyword token qi: IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D|/avgdl))
Source code in src/echomine/search/ranking.py
phrase_matches
¶
Check if any phrase matches in the text (case-insensitive substring).
This function implements exact phrase matching without tokenization. Phrases are matched as-is, preserving hyphens, underscores, and spaces.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to search in (e.g., conversation content) |
required |
phrases
|
list[str]
|
List of phrases to match |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if any phrase is found in text, False otherwise |
Example
Requirements
- FR-001: Exact phrase matching (no tokenization)
- FR-003: Case-insensitive matching
- FR-006: Special characters matched literally
Source code in src/echomine/search/ranking.py
all_terms_present
¶
Check if ALL keyword tokens are present in the text.
Uses the same tokenization as BM25Scorer to ensure consistent matching behavior. All keyword tokens must be present in the text tokens for match_mode='all' to succeed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to check |
required |
keywords
|
list[str]
|
Keywords to find (will be tokenized) |
required |
scorer
|
BM25Scorer
|
BM25Scorer instance for tokenization |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if ALL keyword tokens are present in text, |
bool
|
False otherwise |
Example
Requirements
- FR-009: All keywords must be present (AND logic)
- FR-010: Uses same tokenization as BM25
Source code in src/echomine/search/ranking.py
exclude_filter
¶
Check if text contains any excluded keywords.
Uses the same tokenization as BM25Scorer to ensure consistent matching behavior between inclusion and exclusion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to check for excluded terms |
required |
exclude_keywords
|
list[str]
|
Keywords to exclude (will be tokenized) |
required |
scorer
|
BM25Scorer
|
BM25Scorer instance for tokenization |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if text should be EXCLUDED (contains any excluded term), |
bool
|
False if text should be KEPT (no excluded terms found) |
Example
Requirements
- FR-014: Applied after matching, before ranking
- FR-015: Uses same tokenization as keywords
Source code in src/echomine/search/ranking.py
How BM25 Works¶
Algorithm¶
BM25 calculates a relevance score based on:
- Term Frequency (TF): How often keywords appear in the conversation
- Inverse Document Frequency (IDF): How rare keywords are across all conversations
- Document Length Normalization: Adjusts for conversation length
Formula:
Where:
- D = Conversation (document)
- Q = Search query keywords
- f(qi, D) = Frequency of keyword qi in conversation D
- |D| = Length of conversation (number of words)
- avgdl = Average conversation length across all conversations
- k1 = Term saturation parameter (default: 1.5)
- b = Length normalization parameter (default: 0.75)
Parameters¶
Echomine uses standard BM25 parameters:
- k1 = 1.5: Controls term saturation (higher = more weight to term frequency)
- b = 0.75: Controls length normalization (higher = more penalty for long documents)
Usage Examples¶
Basic Search with Ranking¶
from echomine import OpenAIAdapter, SearchQuery
from pathlib import Path
adapter = OpenAIAdapter()
export_file = Path("conversations.json")
# Search with keywords
query = SearchQuery(keywords=["python", "async"], limit=10)
# Results are ranked by BM25 score
for result in adapter.search(export_file, query):
print(f"[{result.score:.2f}] {result.conversation.title}")
Score Interpretation¶
for result in adapter.search(export_file, query):
score = result.score
if score >= 0.8:
quality = "Excellent match"
elif score >= 0.6:
quality = "Good match"
elif score >= 0.4:
quality = "Fair match"
else:
quality = "Weak match"
print(f"[{score:.2f}] {quality}: {result.conversation.title}")
Filtering by Score Threshold¶
# Only show high-quality matches
min_score = 0.6
for result in adapter.search(export_file, query):
if result.score >= min_score:
print(f"[{result.score:.2f}] {result.conversation.title}")
Score Normalization¶
Scores are normalized to 0.0-1.0 range:
- 1.0: Perfect match (all keywords present, high frequency)
- 0.8-0.9: Excellent match (most keywords, good frequency)
- 0.6-0.7: Good match (some keywords, moderate frequency)
- 0.4-0.5: Fair match (few keywords, low frequency)
- <0.4: Weak match
Note: Scores are relative. A score of 0.8 doesn't mean "80% match" - it means the conversation is in the top tier of matches for the query.
Search Behavior¶
Keyword Processing¶
Keywords are processed as follows:
- Case-insensitive: "Python" matches "python", "PYTHON"
- Whole words: "python" doesn't match "pythonic" (unless stemming is enabled)
- Multiple keywords: All keywords contribute to score (OR logic)
Content Searched¶
BM25 searches across:
- Message content (all messages in conversation)
- Conversation title (weighted higher)
Ranking Order¶
Results are sorted by score in descending order:
results = list(adapter.search(export_file, query))
# Results are pre-sorted by score
assert results[0].score >= results[1].score >= results[2].score
Performance¶
Time Complexity¶
- O(n · m): Where n = number of conversations, m = average messages per conversation
- Early termination: Stops after finding
limittop matches
Memory Complexity¶
- O(1): Constant memory (streaming-based)
- No full conversation buffering
Speed¶
- 1.6GB file: <30 seconds for keyword search
- 10K conversations: <10 seconds
Comparison with Other Ranking Methods¶
BM25 vs TF-IDF¶
| Feature | BM25 | TF-IDF |
|---|---|---|
| Saturation | Yes (k1 parameter) | No |
| Length normalization | Yes (b parameter) | Limited |
| Performance | Better for varied-length documents | Better for uniform documents |
| Echomine choice | ✅ Used | ❌ Not used |
Why BM25? Conversations vary widely in length. BM25's length normalization prevents long conversations from dominating results.
BM25 vs Semantic Search¶
| Feature | BM25 | Semantic Search |
|---|---|---|
| Query type | Keywords | Natural language |
| Speed | Fast (no ML) | Slower (embedding models) |
| Setup | No dependencies | Requires embedding models |
| Accuracy | Good for exact terms | Better for synonyms/concepts |
| Echomine v1.0 | ✅ Used | ❌ Not used |
Future: Semantic search may be added in v2.0 as an optional feature.
Limitations¶
- No fuzzy matching: "python" doesn't match "pythonic"
- No synonym matching: "car" doesn't match "automobile"
- Keyword-only: Doesn't understand semantic meaning
- No phrase matching: "machine learning" treated as two separate keywords
Advanced Usage¶
Multi-Keyword Queries¶
# All keywords contribute to score (OR logic)
query = SearchQuery(
keywords=["python", "async", "asyncio", "coroutine"],
limit=10
)
# Conversations with more keyword matches rank higher
for result in adapter.search(export_file, query):
print(f"[{result.score:.2f}] {result.conversation.title}")
Combining with Filters¶
from datetime import date
# BM25 ranking + date filtering
query = SearchQuery(
keywords=["algorithm", "design"],
from_date=date(2024, 1, 1),
to_date=date(2024, 3, 31),
limit=10
)
# Date filter reduces search space, then BM25 ranks
for result in adapter.search(export_file, query):
print(f"[{result.score:.2f}] {result.conversation.title}")
Implementation Details¶
Tokenization¶
Messages are tokenized by:
- Lowercase conversion
- Split on whitespace and punctuation
- Filter stop words (optional, not implemented in v1.0)
IDF Calculation¶
import math
def calculate_idf(term: str, total_docs: int, docs_with_term: int) -> float:
"""Calculate inverse document frequency."""
return math.log((total_docs - docs_with_term + 0.5) / (docs_with_term + 0.5) + 1.0)
Score Calculation¶
def calculate_bm25_score(
term_freq: int,
doc_length: int,
avg_doc_length: float,
idf: float,
k1: float = 1.5,
b: float = 0.75,
) -> float:
"""Calculate BM25 score for a single term."""
numerator = term_freq * (k1 + 1)
denominator = term_freq + k1 * (1 - b + b * (doc_length / avg_doc_length))
return idf * (numerator / denominator)
Related¶
- SearchQuery: Search parameters
- SearchResult: Result model with score
- OpenAI Adapter: Adapter using BM25