Project Overview
Objective
Built a multimodal chat system leveraging vector embeddings and cross-modal retrieval for semantic search across text and images with multilingual support.
Stack
FastAPINext.jsPostgreSQLQdrantSentenceTransformerCLIPBLIPGoogleTranslatorTyphoon OCR
Delivery highlights
- Developed a multimodal semantic search chat system enabling cross-modal retrieval across text and images using a dual-database architecture (PostgreSQL for structured chat data and Qdrant for vector similarity search). Leveraged SentenceTransformers for multilingual text embeddings and CLIP for unified image–text representation, enhancing image understanding through BLIP-based captioning and Typhoon OCR for text extraction with translation. Designed a hybrid search pipeline combining semantic similarity from both text and image modalities with ranking and filtering to improve retrieval relevance and accuracy. Built scalable backend services using FastAPI and integrated with a Next.js frontend to support real-time chat interaction and efficient semantic search.