Build an AI Image Caption Generator with Gemini Vision and Next.js
Learn how to build an AI-powered image caption generator using Google Gemini Vision API and Next.js. Upload any image and get an instant AI-generated description.
Build an AI Image Caption Generator with Gemini Vision and Next.js
Ever wanted to add AI image understanding to your web app? Google's Gemini Vision model makes it surprisingly easy. In this tutorial, we'll build a simple image caption generator — upload an image, and Gemini Vision describes what's in it.
This is a great starting point if you want to explore multimodal AI (AI that handles both text and images) without getting lost in complicated setups.
What We're Building
A Next.js app where users can:
- Upload any image from their device
- Hit a button to analyze it
- Get an AI-generated caption or description back
We'll use the Google Generative AI SDK (which gives access to Gemini Vision), a Next.js API route for the backend call, and Tailwind CSS for the UI.
Prerequisites
- Next.js 14+ project set up
- A Google AI Studio account (free)
- A Gemini API key
Getting an API key takes two minutes — go to Google AI Studio, create a new key, and copy it. No billing setup needed for basic usage.
Step 1: Install the Google Generative AI SDK
npm install @google/generative-aiThat's the only package we need. It works with both Gemini text and vision models.
Step 2: Add the API Key to Environment Variables
Create or update .env.local:
GEMINI_API_KEY=your_api_key_hereNever put this in client-side code. The API route will handle it server-side.
Step 3: Create the API Route
Create app/api/caption/route.ts:
import { GoogleGenerativeAI } from '@google/generative-ai';
import { NextRequest, NextResponse } from 'next/server';
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
export async function POST(req: NextRequest) {
try {
const { imageBase64, mimeType } = await req.json();
if (!imageBase64 || !mimeType) {
return NextResponse.json({ error: 'Image data is required' }, { status: 400 });
}
const model = genAI.getGenerativeModel({ model: 'gemini-1.5-flash' });
const result = await model.generateContent([
{
inlineData: {
data: imageBase64,
mimeType,
},
},
'Describe this image in 2-3 clear sentences. Focus on the main subject and any important details.',
]);
const caption = result.response.text();
return NextResponse.json({ caption });
} catch (error) {
console.error('Gemini Vision error:', error);
return NextResponse.json({ error: 'Failed to generate caption' }, { status: 500 });
}
}We send the image as a base64 string along with a simple prompt. Gemini handles the rest.
Step 4: Build the Upload UI
Create app/caption/page.tsx:
'use client';
import { useState } from 'react';
export default function CaptionPage() {
const [preview, setPreview] = useState<string | null>(null);
const [caption, setCaption] = useState('');
const [loading, setLoading] = useState(false);
const [imageData, setImageData] = useState<{ base64: string; mimeType: string } | null>(null);
function handleFileChange(e: React.ChangeEvent<HTMLInputElement>) {
const file = e.target.files?.[0];
if (!file) return;
const reader = new FileReader();
reader.onload = () => {
const result = reader.result as string;
const base64 = result.split(',')[1];
setPreview(result);
setImageData({ base64, mimeType: file.type });
setCaption('');
};
reader.readAsDataURL(file);
}
async function generateCaption() {
if (!imageData) return;
setLoading(true);
setCaption('');
try {
const res = await fetch('/api/caption', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
imageBase64: imageData.base64,
mimeType: imageData.mimeType,
}),
});
const data = await res.json();
setCaption(data.caption || data.error);
} catch {
setCaption('Something went wrong. Please try again.');
} finally {
setLoading(false);
}
}
return (
<div className="min-h-screen bg-gray-50 flex items-center justify-center p-6">
<div className="bg-white rounded-2xl shadow-md w-full max-w-lg p-8">
<h1 className="text-2xl font-bold text-gray-800 mb-2">AI Image Caption Generator</h1>
<p className="text-gray-500 text-sm mb-6">Upload an image and let Gemini Vision describe it.</p>
<label className="flex flex-col items-center justify-center w-full h-48 border-2 border-dashed border-gray-300 rounded-xl cursor-pointer hover:border-blue-400 transition-colors">
{preview ? (
<img src={preview} alt="Preview" className="h-full w-full object-contain rounded-xl" />
) : (
<div className="text-center">
<p className="text-gray-400 text-sm">Click to upload an image</p>
<p className="text-gray-300 text-xs mt-1">PNG, JPG, WEBP supported</p>
</div>
)}
<input type="file" accept="image/*" onChange={handleFileChange} className="hidden" />
</label>
<button
onClick={generateCaption}
disabled={!imageData || loading}
className="mt-5 w-full bg-blue-600 hover:bg-blue-700 disabled:bg-blue-300 text-white font-medium py-3 rounded-xl transition-colors"
>
{loading ? 'Analyzing...' : 'Generate Caption'}
</button>
{caption && (
<div className="mt-5 p-4 bg-gray-50 rounded-xl border border-gray-200">
<p className="text-xs font-semibold text-gray-400 uppercase tracking-wide mb-2">Caption</p>
<p className="text-gray-700 text-sm leading-relaxed">{caption}</p>
</div>
)}
</div>
</div>
);
}How It Works
- User picks an image — browser reads it as a base64 string using
FileReader - That base64 data gets sent to our API route via a POST request
- API route passes it to Gemini Vision with a prompt
- Gemini returns a text description, which we show on screen
The whole round trip usually takes 1–3 seconds depending on image size.
Tips Before Going Live
Limit file size on the client side. Large images slow down the request and eat into your API quota. Add a size check before reading the file:
if (file.size > 4 * 1024 * 1024) {
alert('Image must be under 4MB');
return;
}Customize the prompt. The prompt we're using is generic. You can change it to fit your use case — product descriptions for an e-commerce app, alt text generation for accessibility tools, or photo tagging for a gallery app.
Rate limiting. If this is public-facing, add rate limiting on the API route. A free tier Gemini key has limits per minute. Upstash Redis pairs well with Next.js for this.
Wrapping Up
Gemini Vision is one of the easiest ways to add image AI to a web app. The SDK is clean, the API is fast, and the free tier is generous enough to build and test without worrying about costs.
From here, you could extend this to generate alt text for images automatically, build a product photo tagger, or hook it into a CMS to auto-describe uploaded media.
If you're looking to go deeper, check out our post on building a chat interface with Vertex AI and Next.js — the same pattern applies, just with text instead of images.
Have questions or built something cool with this? Drop a comment below.