Compute Text Similarity

viaLeetCode

Problem Write code that computes a similarity score between two pieces of text and justify the metric chosen.

Input / Output

Constraints

Clarify the use case first — near-duplicate detection, semantic similarity, or fuzzy matching — the metric follows from it.

Example

"the hotel was clean" vs "a clean hotel" → high Jaccard/cosine overlap after tokenization; "cheap flight" vs "inexpensive airfare" → near-zero lexical overlap, needs embeddings for a high score.

Expected approach

Lexical route (codeable in-interview): tokenize/normalize → Jaccard over token sets (order-insensitive, cheap) or cosine over TF-IDF vectors (weights rare terms, standard IR choice). Character-level: edit distance for typos/short strings. Semantic route: average word vectors or a sentence-encoder embedding + cosine — catches synonymy at model cost. Walk the trade-offs (speed, synonym handling, length bias, normalization) and implement one cleanly — typically TF-IDF cosine or Jaccard with a tokenizer.

Add a follow-up question they asked

No follow-ups yet. Be the first to add one.

asked …