Compute Text Similarity
viaLeetCode
Problem Write code that computes a similarity score between two pieces of text and justify the metric chosen.
Input / Output
- Input: strings a and b.
- Output: similarity in [0, 1] (or a ranked interpretation).
Constraints
- Clarify the use case first — near-duplicate detection, semantic similarity, or fuzzy matching — the metric follows from it.
Example
- "the hotel was clean" vs "a clean hotel" → high Jaccard/cosine overlap after tokenization; "cheap flight" vs "inexpensive airfare" → near-zero lexical overlap, needs embeddings for a high score.
Expected approach
- Lexical route (codeable in-interview): tokenize/normalize → Jaccard over token sets (order-insensitive, cheap) or cosine over TF-IDF vectors (weights rare terms, standard IR choice). Character-level: edit distance for typos/short strings. Semantic route: average word vectors or a sentence-encoder embedding + cosine — catches synonymy at model cost. Walk the trade-offs (speed, synonym handling, length bias, normalization) and implement one cleanly — typically TF-IDF cosine or Jaccard with a tokenizer.
asked …