Dropbox Document Search Engine
A full-stack search engine over ~23,000 Dropbox documents for an early-stage startup, with sentence-level highlight extraction. My Lehigh CSE capstone.
The startup had ~23,000 documents on Dropbox and no way to find anything. Naïve keyword search across that many heterogeneous files (PDFs, Word, plain text, scanned-image documents with OCR layers) would have been slow and unhelpful — users want a result list with the matching passage shown, not just the file name.
What I built
- Document preprocessing. A pipeline that parses the diverse file types into a normalised text representation, segments at the sentence level using NLP models, and indexes per-sentence so we can map a query back to the exact span that matched.
- Partitioned index. Sentence-level keyword mapping with per-document partitioning, so query latency stays low even as the corpus grows.
- Dropbox API integration. Real-time access to the original document for context expansion when a user clicks a result.
- Highlighting algorithm. Given a query and a matched sentence, surface the minimal sub-span that explains the match. This part is documented in the linked writeup above.
Hosting on AWS, written in Python. The whole thing was scoped for a one-engineer summer, which is part of why I picked the partitioned index over a heavier ML retrieval stack.
Capstone for the Computer Science & Engineering BS at Lehigh (May 2024 – July 2025).