Dropbox Document Search Engine

The startup had ~23,000 documents on Dropbox and no way to find anything. Naïve keyword search across that many heterogeneous files (PDFs, Word, plain text, scanned-image documents with OCR layers) would have been slow and unhelpful — users want a result list with the matching passage shown, not just the file name.

What I built

Document preprocessing. A pipeline that parses the diverse file types into a normalised text representation, segments at the sentence level using NLP models, and indexes per-sentence so we can map a query back to the exact span that matched.
Partitioned index. Sentence-level keyword mapping with per-document partitioning, so query latency stays low even as the corpus grows.
Dropbox API integration. Real-time access to the original document for context expansion when a user clicks a result.
Highlighting algorithm. Given a query and a matched sentence, surface the minimal sub-span that explains the match. This part is documented in the linked writeup above.

Hosting on AWS, written in Python. The whole thing was scoped for a one-engineer summer, which is part of why I picked the partitioned index over a heavier ML retrieval stack.

Capstone for the Computer Science & Engineering BS at Lehigh (May 2024 – July 2025).