@k
, meaning they consider only the top k results returned by the search engine.
Compare to online metrics, which include Average Order Value (AOV) and Click-Through Rate (CTR).
Computing evaluation metrics
When a metric is computed@k
, it measures performance using only the top k ranked results for a given query. Once @k
metrics are computed for each query, they are averaged, etc.
Examples:
Precision@5
considers only the top 5 results.Recall@10
checks how many relevant documents are retrieved in the top 10.nDCG@3
rewards ranking quality within the top 3 results.
@k
include @1
, @3
, @5
, or @10
, depending on how deeply users typically browse the result list.
Evaluation Metrics
Each evaluation metric has different formulas and use cases.Metric | Measures | Rank-Aware | Uses Graded Relevance | Interpretable As |
---|---|---|---|---|
Recall@k | Coverage of relevant documents | ❌ | ❌ | Percentage of relevant docs retrieved |
F1@k | Balance between precision and recall | ❌ | ❌ | Unified classification score |
Hit@k | Presence of any relevant result | ❌ | ❌ | Binary success or failure |
MRR@k | Position of 1st relevant result | ✅ | ❌ | Early relevance ranking |
nDCG@k | Overall ranking quality | ✅ | ✅ (optional) | Best results high in list |
MAP@k | Aggregate relevance and ranking | ✅ | ❌ | Average quality over queries |
Use cases
Short Description | Use Case | Recommended Metrics |
---|---|---|
Quick success checks | Site-search, Knowledge Management, Ecommerce | Hit@k, MRR@k |
Early relevance (fast find-ability) | Knowledge Management, Ecommerce | MRR@k, nDCG@k |
Ranking quality | Ecommerce | nDCG@k, MRR@k, MAP@k |
Multi-relevant answers | Ecommerce, Knowledge Management | nDCG@k, MAP@k, Recall@k |
Classification | Classification | F1@k, Recall@k |
Precision@k
Precision@k
is a straightforward metric that measures the proportion of relevant items among the top k
results in a ranked list. This metric is useful for measuring overall system performance and relevance.
- What it is: The number of relevant items in the top
k
results, divided byk
- Formula:
(Number of relevant items in the top k) / k
Hit@k
Hit@k
is a straightforward but powerful metric that mimics accuracy by checking whether the system succeeded in returning something relevant at all. This metric is useful as a quick success check and intuitive to interpret, as the answer is either 0
or 1
.
- What it is: A binary indicator that states whether any relevant document present in the top k results
- Formula:
Hit@k = 1
if any relevant document is in top k, elseHit@k = 0
Mean Average Precision (MAP@k)
MAP@k
evaluates how many results out of k
results are relevant. It uses both relevance and ranking quality across the full result list, rewarding systems that consistently return relevant results near the top. This metric is especially useful in ecommerce and multi-relevant-answer scenarios, where it’s expected for each query to have many relevant documents or products.
- What it is: The mean of average precision scores across all queries, computed at cutoff k
- Formula (per query):
Average Precision@k = Mean of precision values at each rank i ≤ k where document_i is relevant
- Then:
MAP@k = Mean of Average Precision@k over all queries
- Then:
Mean Reciprocal Rank (MRR@k)
MRR@k
captures how high in the search rankings a user finds a relevant result. It rewards systems that surface correct answers earlier in the result list.
- What it is: The average reciprocal of the rank at which the first relevant document appears, up to position k
- Formula:
MRR@k = 1 / rank of first relevant doc
(if rank ≤ k, else 0)
Normalized Discounted Cumulative Gain (nDCG@k)
nDCG@k
is order-aware and optionally weight-aware (graded-relevance). It penalizes systems that bury relevant results and rewards those that put the most relevant documents near the top. If graded relevance scores are unavailable, binary nDCG is computed (relevant vs. not relevant, where all relevant documents have a weight/grade of 1).
- What it is: Measures both the presence and position of relevant documents in the top k, while optionally considering graded relevance (that is, some results are more relevant than others)
- Formula (simplified):
DCG@k = Σ (2^relevance_i / log2(i + 1)) for i = 1 to k
nDCG@k = DCG@k / IDCG@k (ideal DCG)
Recall@k
Recall@k
captures completeness. Even if many relevant results exist, if they aren’t surfaced early, users may never find them. While Precision@k
measures the proportion of relevant documents out of a specified k
results, Recall@k
measures the proportion of relevant documents that show up early in the results out of the total relevant documents.
Recall@k is bounded by k / total_relevant_docs
. For example, if you compute Recall@5
for a query that has 10 relevant documents, the Recall@5
value cannot exceed 0.5 even if all 5 results are relevant.
- What it is: The proportion of all relevant documents that appear in the top k
- Formula:
Recall@k = (Relevant documents in top k) / (Total relevant documents)
F1@k
F1@k
balances both the accuracy (precision) and coverage (recall) of search results, giving a holistic performance score at depth k. This is a classification-specific metric.
- What it is: The harmonic mean of
Precision@k
andRecall@k
- Formula:
F1@k = 2 * (Precision@k * Recall@k) / (Precision@k + Recall@k)