Project Overview
Objective
Built query-image driven retrieval across both image and video collections.
Stack
CLIP (ViT-B/32)FAISSBLIPFastAPIReactAxios
Delivery highlights
- Built an end-to-end visual retrieval system that accepts a query image and performs cross-media similarity search across both images and videos. The system encodes the uploaded image into a semantic embedding using CLIP (ViT-B/32) in PyTorch and compares it against pre-indexed image files and extracted video keyframes stored in a FAISS innerproduct index. For videos, the system aggregates matched frame timestamps, merges temporally adjacent segments into consolidated intervals (start_time, end_time), and ranks results based on similarity scores. Integrated BLIP to automatically generate descriptive captions for the query image to improve interpretability and contextual understanding. The backend is implemented with FastAPI, providing endpoints for index construction and search execution, returning structured JSON responses containing media paths, similarity scores, and timestamp ranges. The frontend is developed using React and Axios, enabling configurable search parameters (top_k, similarity threshold, caption length) and interactive video playback with direct navigation to detected relevant segments