Learn how to build an AI-powered image caption generator using Google Gemini Vision API and Next.js. Upload any image and get an instant AI-generated description.

Build an AI Image Caption Generator with Gemini Vision and Next.js

Ever wanted to add AI image understanding to your web app? Google's Gemini Vision model makes it surprisingly easy. In this tutorial, we'll build a simple image caption generator — upload an image, and Gemini Vision describes what's in it.

This is a great starting point if you want to explore multimodal AI (AI that handles both text and images) without getting lost in complicated setups.

What We're Building

A Next.js app where users can:

Upload any image from their device
Hit a button to analyze it
Get an AI-generated caption or description back

We'll use the Google Generative AI SDK (which gives access to Gemini Vision), a Next.js API route for the backend call, and Tailwind CSS for the UI.

Prerequisites

Next.js 14+ project set up
A Google AI Studio account (free)
A Gemini API key

Getting an API key takes two minutes — go to Google AI Studio, create a new key, and copy it. No billing setup needed for basic usage.

Step 1: Install the Google Generative AI SDK

npm install @google/generative-ai

That's the only package we need. It works with both Gemini text and vision models.

Step 2: Add the API Key to Environment Variables

Create or update .env.local:

GEMINI_API_KEY=your_api_key_here

Never put this in client-side code. The API route will handle it server-side.

Step 3: Create the API Route

Create app/api/caption/route.ts:

import { GoogleGenerativeAI } from '@google/generative-ai';
import { NextRequest, NextResponse } from 'next/server';
 
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
 
export async function POST(req: NextRequest) {
  try {
    const { imageBase64, mimeType } = await req.json();
 
    if (!imageBase64 || !mimeType) {
      return NextResponse.json({ error: 'Image data is required' }, { status: 400 });
    }
 
    const model = genAI.getGenerativeModel({ model: 'gemini-1.5-flash' });
 
    const result = await model.generateContent([
      {
        inlineData: {
          data: imageBase64,
          mimeType,
        },
      },
      'Describe this image in 2-3 clear sentences. Focus on the main subject and any important details.',
    ]);
 
    const caption = result.response.text();
    return NextResponse.json({ caption });
  } catch (error) {
    console.error('Gemini Vision error:', error);
    return NextResponse.json({ error: 'Failed to generate caption' }, { status: 500 });
  }
}

We send the image as a base64 string along with a simple prompt. Gemini handles the rest.

Step 4: Build the Upload UI

Create app/caption/page.tsx:

'use client';
 
import { useState } from 'react';
 
export default function CaptionPage() {
  const [preview, setPreview] = useState<string | null>(null);
  const [caption, setCaption] = useState('');
  const [loading, setLoading] = useState(false);
  const [imageData, setImageData] = useState<{ base64: string; mimeType: string } | null>(null);
 
  function handleFileChange(e: React.ChangeEvent<HTMLInputElement>) {
    const file = e.target.files?.[0];
    if (!file) return;
 
    const reader = new FileReader();
    reader.onload = () => {
      const result = reader.result as string;
      const base64 = result.split(',')[1];
      setPreview(result);
      setImageData({ base64, mimeType: file.type });
      setCaption('');
    };
    reader.readAsDataURL(file);
  }
 
  async function generateCaption() {
    if (!imageData) return;
    setLoading(true);
    setCaption('');
 
    try {
      const res = await fetch('/api/caption', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          imageBase64: imageData.base64,
          mimeType: imageData.mimeType,
        }),
      });
 
      const data = await res.json();
      setCaption(data.caption || data.error);
    } catch {
      setCaption('Something went wrong. Please try again.');
    } finally {
      setLoading(false);
    }
  }
 
  return (
    <div className="min-h-screen bg-gray-50 flex items-center justify-center p-6">
      <div className="bg-white rounded-2xl shadow-md w-full max-w-lg p-8">
        <h1 className="text-2xl font-bold text-gray-800 mb-2">AI Image Caption Generator</h1>
        <p className="text-gray-500 text-sm mb-6">Upload an image and let Gemini Vision describe it.</p>
 
        <label className="flex flex-col items-center justify-center w-full h-48 border-2 border-dashed border-gray-300 rounded-xl cursor-pointer hover:border-blue-400 transition-colors">
          {preview ? (
            <img src={preview} alt="Preview" className="h-full w-full object-contain rounded-xl" />
          ) : (
            <div className="text-center">
              <p className="text-gray-400 text-sm">Click to upload an image</p>
              <p className="text-gray-300 text-xs mt-1">PNG, JPG, WEBP supported</p>
            </div>
          )}
          <input type="file" accept="image/*" onChange={handleFileChange} className="hidden" />
        </label>
 
        <button
          onClick={generateCaption}
          disabled={!imageData || loading}
          className="mt-5 w-full bg-blue-600 hover:bg-blue-700 disabled:bg-blue-300 text-white font-medium py-3 rounded-xl transition-colors"
        >
          {loading ? 'Analyzing...' : 'Generate Caption'}
        </button>
 
        {caption && (
          <div className="mt-5 p-4 bg-gray-50 rounded-xl border border-gray-200">
            <p className="text-xs font-semibold text-gray-400 uppercase tracking-wide mb-2">Caption</p>
            <p className="text-gray-700 text-sm leading-relaxed">{caption}</p>
          </div>
        )}
      </div>
    </div>
  );
}

How It Works

User picks an image — browser reads it as a base64 string using FileReader
That base64 data gets sent to our API route via a POST request
API route passes it to Gemini Vision with a prompt
Gemini returns a text description, which we show on screen

The whole round trip usually takes 1–3 seconds depending on image size.

Tips Before Going Live

Limit file size on the client side. Large images slow down the request and eat into your API quota. Add a size check before reading the file:

if (file.size > 4 * 1024 * 1024) {
  alert('Image must be under 4MB');
  return;
}

Customize the prompt. The prompt we're using is generic. You can change it to fit your use case — product descriptions for an e-commerce app, alt text generation for accessibility tools, or photo tagging for a gallery app.

Rate limiting. If this is public-facing, add rate limiting on the API route. A free tier Gemini key has limits per minute. Upstash Redis pairs well with Next.js for this.

Wrapping Up

Gemini Vision is one of the easiest ways to add image AI to a web app. The SDK is clean, the API is fast, and the free tier is generous enough to build and test without worrying about costs.

From here, you could extend this to generate alt text for images automatically, build a product photo tagger, or hook it into a CMS to auto-describe uploaded media.

If you're looking to go deeper, check out our post on building a chat interface with Vertex AI and Next.js — the same pattern applies, just with text instead of images.

Have questions or built something cool with this? Drop a comment below.

Build an AI Image Caption Generator with Gemini Vision and Next.js

Build an AI Image Caption Generator with Gemini Vision and Next.js

What We're Building

Prerequisites

Step 1: Install the Google Generative AI SDK

Step 2: Add the API Key to Environment Variables

Step 3: Create the API Route

Step 4: Build the Upload UI

How It Works

Tips Before Going Live

Wrapping Up

Keep reading

Build an AI Financial Analysis App That Turns Excel into PowerPoint Decks with Next.js

Build a Simple AI Chat Interface with GCP Vertex AI and Next.js

Why & How to Start a Blog in 2025?