Image-to-Image and Video Similarity Search System | Projects

Project Overview

Objective

Built query-image driven retrieval across both image and video collections.

Stack

CLIP (ViT-B/32)FAISSBLIPFastAPIReactAxios

Delivery highlights

Built an end-to-end visual retrieval system that accepts a query image and performs cross-media similarity search across both images and videos. The system encodes the uploaded image into a semantic embedding using CLIP (ViT-B/32) in PyTorch and compares it against pre-indexed image files and extracted video keyframes stored in a FAISS innerproduct index. For videos, the system aggregates matched frame timestamps, merges temporally adjacent segments into consolidated intervals (start_time, end_time), and ranks results based on similarity scores. Integrated BLIP to automatically generate descriptive captions for the query image to improve interpretability and contextual understanding. The backend is implemented with FastAPI, providing endpoints for index construction and search execution, returning structured JSON responses containing media paths, similarity scores, and timestamp ranges. The frontend is developed using React and Axios, enabling configurable search parameters (top_k, similarity threshold, caption length) and interactive video playback with direct navigation to detected relevant segments

1 items

Demo Video

3 items

Personal ProjectsYear: 2026

Unified text-to-video and text-to-image search into one cross-modal retrieval platform.

Personal ProjectsYear: 2026

Built text-to-image semantic search using CLIP shared embedding space and FAISS indexing.

Personal ProjectsYear: 2026

Built FastAPI semantic search over videos with clip indexing and multilingual query support.

Objective

Built query-image driven retrieval across both image and video collections.

Stack

CLIP (ViT-B/32)FAISSBLIPFastAPIReactAxios

Delivery highlights

Built an end-to-end visual retrieval system that accepts a query image and performs cross-media similarity search across both images and videos. The system encodes the uploaded image into a semantic embedding using CLIP (ViT-B/32) in PyTorch and compares it against pre-indexed image files and extracted video keyframes stored in a FAISS innerproduct index. For videos, the system aggregates matched frame timestamps, merges temporally adjacent segments into consolidated intervals (start_time, end_time), and ranks results based on similarity scores. Integrated BLIP to automatically generate descriptive captions for the query image to improve interpretability and contextual understanding. The backend is implemented with FastAPI, providing endpoints for index construction and search execution, returning structured JSON responses containing media paths, similarity scores, and timestamp ranges. The frontend is developed using React and Axios, enabling configurable search parameters (top_k, similarity threshold, caption length) and interactive video playback with direct navigation to detected relevant segments