Evaluation Metrics

Evaluation metrics are computed over a set with known query/document relevancy. Also known as offline metrics, search evaluation metrics help measure how effective a system is at returning relevant and well-ranked results. Many of these metrics are evaluated @k, meaning they consider only the top k results returned by the search engine. Compare to online metrics, which include Average Order Value (AOV) and Click-Through Rate (CTR).

Computing evaluation metrics

When a metric is computed @k, it measures performance using only the top k ranked results for a given query. Once @k metrics are computed for each query, they are averaged, etc. Examples:

Precision@5 considers only the top 5 results.
Recall@10 checks how many relevant documents are retrieved in the top 10.
nDCG@3 rewards ranking quality within the top 3 results.

Default values for @k include @1, @3, @5, or @10, depending on how deeply users typically browse the result list.

Each evaluation metric has different formulas and use cases.

Metric	Measures	Rank-Aware	Uses Graded Relevance	Interpretable As
Recall@k	Coverage of relevant documents	❌	❌	Percentage of relevant docs retrieved
F1@k	Balance between precision and recall	❌	❌	Unified classification score
Hit@k	Presence of any relevant result	❌	❌	Binary success or failure
MRR@k	Position of 1st relevant result	✅	❌	Early relevance ranking
nDCG@k	Overall ranking quality	✅	✅ (optional)	Best results high in list
MAP@k	Aggregate relevance and ranking	✅	❌	Average quality over queries

Use cases

Short Description	Use Case	Recommended Metrics
Quick success checks	Site-search, Knowledge Management, Ecommerce	Hit@k, MRR@k
Early relevance (fast find-ability)	Knowledge Management, Ecommerce	MRR@k, nDCG@k
Ranking quality	Ecommerce	nDCG@k, MRR@k, MAP@k
Multi-relevant answers	Ecommerce, Knowledge Management	nDCG@k, MAP@k, Recall@k
Classification	Classification	F1@k, Recall@k

Precision@k

Precision@k is a straightforward metric that measures the proportion of relevant items among the top k results in a ranked list. This metric is useful for measuring overall system performance and relevance.

What it is: The number of relevant items in the top k results, divided by k
Formula: (Number of relevant items in the top k) / k

Hit@k

Hit@k is a straightforward but powerful metric that mimics accuracy by checking whether the system succeeded in returning something relevant at all. This metric is useful as a quick success check and intuitive to interpret, as the answer is either 0 or 1.

What it is: A binary indicator that states whether any relevant document present in the top k results
Formula: Hit@k = 1 if any relevant document is in top k, else Hit@k = 0

Mean Average Precision (MAP@k)

MAP@k evaluates how many results out of k results are relevant. It uses both relevance and ranking quality across the full result list, rewarding systems that consistently return relevant results near the top. This metric is especially useful in ecommerce and multi-relevant-answer scenarios, where it’s expected for each query to have many relevant documents or products.

What it is: The mean of average precision scores across all queries, computed at cutoff k
Formula (per query): Average Precision@k = Mean of precision values at each rank i ≤ k where document_i is relevant
- Then: MAP@k = Mean of Average Precision@k over all queries

Mean Reciprocal Rank (MRR@k)

MRR@k captures how high in the search rankings a user finds a relevant result. It rewards systems that surface correct answers earlier in the result list.

What it is: The average reciprocal of the rank at which the first relevant document appears, up to position k
Formula: MRR@k = 1 / rank of first relevant doc (if rank ≤ k, else 0)

Normalized Discounted Cumulative Gain (nDCG@k)

nDCG@k is order-aware and optionally weight-aware (graded-relevance). It penalizes systems that bury relevant results and rewards those that put the most relevant documents near the top. If graded relevance scores are unavailable, binary nDCG is computed (relevant vs. not relevant, where all relevant documents have a weight/grade of 1).

What it is: Measures both the presence and position of relevant documents in the top k, while optionally considering graded relevance (that is, some results are more relevant than others)
Formula (simplified): DCG@k = Σ (2^relevance_i / log2(i + 1)) for i = 1 to k
nDCG@k = DCG@k / IDCG@k (ideal DCG)

Recall@k

Recall@k captures completeness. Even if many relevant results exist, if they aren’t surfaced early, users may never find them. While Precision@k measures the proportion of relevant documents out of a specified k results, Recall@k measures the proportion of relevant documents that show up early in the results out of the total relevant documents. Recall@k is bounded by k / total_relevant_docs. For example, if you compute Recall@5 for a query that has 10 relevant documents, the Recall@5 value cannot exceed 0.5 even if all 5 results are relevant.

What it is: The proportion of all relevant documents that appear in the top k
Formula: Recall@k = (Relevant documents in top k) / (Total relevant documents)

F1@k

F1@k balances both the accuracy (precision) and coverage (recall) of search results, giving a holistic performance score at depth k. This is a classification-specific metric.

What it is: The harmonic mean of Precision@k and Recall@k
Formula: F1@k = 2 * (Precision@k * Recall@k) / (Precision@k + Recall@k)

UI tour

Index data

Query data

Metrics and analytics

Improve your queries

Administration

Developer documentation

Machine learning

Neural Hybrid Search

Release notes

FAQs

Evaluation Metrics

Computing evaluation metrics