Skip to main content
Evaluation metrics are computed over a set with known query/document relevancy. Also known as offline metrics, search evaluation metrics help measure how effective a system is at returning relevant and well-ranked results. Many of these metrics are evaluated @k, meaning they consider only the top k results returned by the search engine. Compare to online metrics, which include Average Order Value (AOV) and Click-Through Rate (CTR).

Computing evaluation metrics

When a metric is computed @k, it measures performance using only the top k ranked results for a given query. Once @k metrics are computed for each query, they are averaged, etc. Examples:
  • Precision@5 considers only the top 5 results.
  • Recall@10 checks how many relevant documents are retrieved in the top 10.
  • nDCG@3 rewards ranking quality within the top 3 results.
Default values for @k include @1, @3, @5, or @10, depending on how deeply users typically browse the result list.

Evaluation Metrics

Each evaluation metric has different formulas and use cases.
MetricMeasuresRank-AwareUses Graded RelevanceInterpretable As
Recall@kCoverage of relevant documentsPercentage of relevant docs retrieved
F1@kBalance between precision and recallUnified classification score
Hit@kPresence of any relevant resultBinary success or failure
MRR@kPosition of 1st relevant resultEarly relevance ranking
nDCG@kOverall ranking quality✅ (optional)Best results high in list
MAP@kAggregate relevance and rankingAverage quality over queries

Use cases

Short DescriptionUse CaseRecommended Metrics
Quick success checksSite-search, Knowledge Management, EcommerceHit@k, MRR@k
Early relevance (fast find-ability)Knowledge Management, EcommerceMRR@k, nDCG@k
Ranking qualityEcommercenDCG@k, MRR@k, MAP@k
Multi-relevant answersEcommerce, Knowledge ManagementnDCG@k, MAP@k, Recall@k
ClassificationClassificationF1@k, Recall@k

Precision@k

Precision@k is a straightforward metric that measures the proportion of relevant items among the top k results in a ranked list. This metric is useful for measuring overall system performance and relevance.
  • What it is: The number of relevant items in the top k results, divided by k
  • Formula: (Number of relevant items in the top k) / k

Hit@k

Hit@k is a straightforward but powerful metric that mimics accuracy by checking whether the system succeeded in returning something relevant at all. This metric is useful as a quick success check and intuitive to interpret, as the answer is either 0 or 1.
  • What it is: A binary indicator that states whether any relevant document present in the top k results
  • Formula: Hit@k = 1 if any relevant document is in top k, else Hit@k = 0

Mean Average Precision (MAP@k)

MAP@k evaluates how many results out of k results are relevant. It uses both relevance and ranking quality across the full result list, rewarding systems that consistently return relevant results near the top. This metric is especially useful in ecommerce and multi-relevant-answer scenarios, where it’s expected for each query to have many relevant documents or products.
  • What it is: The mean of average precision scores across all queries, computed at cutoff k
  • Formula (per query): Average Precision@k = Mean of precision values at each rank i ≤ k where document_i is relevant
    • Then: MAP@k = Mean of Average Precision@k over all queries

Mean Reciprocal Rank (MRR@k)

MRR@k captures how high in the search rankings a user finds a relevant result. It rewards systems that surface correct answers earlier in the result list.
  • What it is: The average reciprocal of the rank at which the first relevant document appears, up to position k
  • Formula: MRR@k = 1 / rank of first relevant doc (if rank ≤ k, else 0)

Normalized Discounted Cumulative Gain (nDCG@k)

nDCG@k is order-aware and optionally weight-aware (graded-relevance). It penalizes systems that bury relevant results and rewards those that put the most relevant documents near the top. If graded relevance scores are unavailable, binary nDCG is computed (relevant vs. not relevant, where all relevant documents have a weight/grade of 1).
  • What it is: Measures both the presence and position of relevant documents in the top k, while optionally considering graded relevance (that is, some results are more relevant than others)
  • Formula (simplified): DCG@k = Σ (2^relevance_i / log2(i + 1)) for i = 1 to k
  • nDCG@k = DCG@k / IDCG@k (ideal DCG)

Recall@k

Recall@k captures completeness. Even if many relevant results exist, if they aren’t surfaced early, users may never find them. While Precision@k measures the proportion of relevant documents out of a specified k results, Recall@k measures the proportion of relevant documents that show up early in the results out of the total relevant documents. Recall@k is bounded by k / total_relevant_docs. For example, if you compute Recall@5 for a query that has 10 relevant documents, the Recall@5 value cannot exceed 0.5 even if all 5 results are relevant.
  • What it is: The proportion of all relevant documents that appear in the top k
  • Formula: Recall@k = (Relevant documents in top k) / (Total relevant documents)

F1@k

F1@k balances both the accuracy (precision) and coverage (recall) of search results, giving a holistic performance score at depth k. This is a classification-specific metric.
  • What it is: The harmonic mean of Precision@k and Recall@k
  • Formula: F1@k = 2 * (Precision@k * Recall@k) / (Precision@k + Recall@k)
I