2dbi

Compute Text Similarity

viaLeetCode

Problem Write code that computes a similarity score between two pieces of text and justify the metric chosen.

Input / Output

  • Input: strings a and b.
  • Output: similarity in [0, 1] (or a ranked interpretation).

Constraints

  • Clarify the use case first — near-duplicate detection, semantic similarity, or fuzzy matching — the metric follows from it.

Example

  • "the hotel was clean" vs "a clean hotel" → high Jaccard/cosine overlap after tokenization; "cheap flight" vs "inexpensive airfare" → near-zero lexical overlap, needs embeddings for a high score.

Expected approach

  • Lexical route (codeable in-interview): tokenize/normalize → Jaccard over token sets (order-insensitive, cheap) or cosine over TF-IDF vectors (weights rare terms, standard IR choice). Character-level: edit distance for typos/short strings. Semantic route: average word vectors or a sentence-encoder embedding + cosine — catches synonymy at model cost. Walk the trade-offs (speed, synonym handling, length bias, normalization) and implement one cleanly — typically TF-IDF cosine or Jaccard with a tokenizer.
Add a follow-up question they asked
No follow-ups yet. Be the first to add one.
asked …
LeaderboardSalary
Language
Account