Multimodal Semantic Retrieval (Video and Image Search)

Unified text-to-video and text-to-image search into one cross-modal retrieval platform. This project demonstrates practical execution from architecture and implementation to measurable delivery outcomes.

Personal ProjectsYear 2026

Project Overview

Objective

Unified text-to-video and text-to-image search into one cross-modal retrieval platform.

Stack

CLIP (ViT-B/32)FAISSFastAPIReact.jsTailwind CSS

Delivery highlights

  • Extended and integrated previous projects (Textto-Video Semantic Search and Text-to-Image Semantic Search) into a unified multimodal semantic retrieval platform capable of searching across both videos and images using natural language queries. Leveraged CLIP (ViT-B/32) developed by OpenAI to generate shared embeddings for text, video keyframes, and images within the same vector space, enabling cross-modal semantic similarity search with FAISS and accurate timestamp alignment for video playback. Integrated language translation preprocessing (Thai → English) to improve embedding alignment and retrieval accuracy, as CLIP performs more effectively with English text inputs. Developed RESTful APIs using FastAPI that return structured JSON responses containing media_id, timestamp (for videos), similarity score, and media URL, and built a responsive frontend using React.js and Tailwind CSS for real-time result visualization.
Back to Topic ProjectsBack to All Projects