Project Overview
Objective
Built end-to-end VQA platform for image upload, scene understanding, and LLM-based answers.
Stack
FastAPIReactYOLOCLIPViTBLIPGPT-4o-miniGPT-4.1GPT-5
Delivery highlights
- Developed an end-to-end Visual Question Answering (VQA) system that allows users to upload an image and ask natural language questions about the scene. The system uses YOLO for object detection, CLIP for image–text similarity reasoning, Vision Transformer (ViT) for image classification, and BLIP for automatic image captioning, and integrates the extracted visual information with selectable Large Language Models (LLMs) to generate context-aware answers. The system supports multiple LLM options including GPT-4o-mini, GPT-4.1, and GPT-5, allowing users to compare responses from different models. Built the backend using FastAPI for model inference and API services, and developed an interactive React-based web interface where users can upload images, select the LLM model, visualize detected objects and bounding boxes, and receive AI-generated explanations about the image conten