Joe Barrow

Google Gemini 102 - Advanced Structured Outputs

Probing the supported output types of Gemini.

This is a follow up to Google Gemini 101 - Object Detection with Vision and Structured Outputs, where we explore some advanced/alternative setups for Gemini structured outputs, as well as some gotcha’s.

Motivation: Required Fields

One of the first things you’ll notice from the previous blog post is that none of the keys are required. That is, the model is free to return any subset of the keys, which can lead to getting funky results with no localization, e.g.:

{
  "type": "Teacup"
}

There are three ways to get required fields with AI Studio: 1. use raw JSON schemas 2. use protobufs 3. use the openai package and API, but with Gemini endpoint

I’m rather partial to the third, so that’s what we’re going to cover here. However, feel free to follow the above links to the documentation for the other solutions, if you’d prefer.

Use the openai Package

The openai structured output approach enforces that every key is required (link to docs). This means that if we swap from the google-generativeai package to openai, all keys will be required:

from typing import Literal
from pydantic import BaseModel
from .utils import draw_bounding_box

import argparse
import openai
import base64
import os


class TeaSet(BaseModel):
    type: Literal["Teacup", "Teapot"]
    bounding_box: list[int]


class TeaSets(BaseModel):
    tea_sets: list[TeaSet]


def main(client: openai.OpenAI, image_path: str, size: float = 1024) -> None:
    with open(image_path, "rb") as image_file:
        image_base64 = base64.b64encode(image_file.read()).decode("utf-8")

    image = Image.open(image_path)

    response = client.beta.chat.completions.parse(
        model="gemini-2.0-flash-exp",
        n=1,
        messages=[
            {
                "role": "system",
                "content": """Find all the teacups and teapots in the image.
Return your answer as a list of JSON objects with the type and bounding box.
Return the bounding box in [ymin, xmin, ymax, xmax] format.""",
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Here's the image:"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_base64}"
                        },
                    },
                ],
            },
        ],
        response_format=TeaSets,
    )

    if tea_sets := response.choices[0].message.parsed:
        for tea_set in tea_sets.tea_sets:
            draw_bounding_box(image, tea_set.type, tea_set.bounding_box)

    image.show()


if __name__ == "__main__":
    client = openai.OpenAI(
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
        api_key=os.getenv("GOOGLE_API_KEY"),
    )

    parser = argparse.ArgumentParser()
    parser.add_argument("image_path", help="Path to image file to process")
    args = parser.parse_args()

    main(client, args.image_path)

Gemini Structured Mode Gotchas

However, if you want to do more complex things (that OpenAI can do), it’s not that easy! The following things aren’t supported by Gemini’s structured outputs: - union types - tuples

Union Types

For instance, union types aren’t supported:

class Teacup(BaseModel):
    is_empty: bool
    bounding_box: list[int]

class Teapot(BaseModel):
    rating: int
    bounding_box: list[int]

class TeaSets(BaseModel):
    tea_sets: list[Teacup | Teapot]

That also means that Optional types (or, | None in more recent python) isn’t supported:

from typing import Optional

class TeaSet(BaseModel):
    type: Literal["Teacup", "Teapot"]
    bounding_box: Optional[list[int]]

class TeaSets(BaseModel):
    tea_sets: list[TeaSet]

Tuples

In the previous post, you might have thought to yourself: “Why implement bounding box as a list of integers when there are only 4 items? Shouldn’t you just use a 4-tuple?” That’s a great question!

Unfortunately, it’s not supported. Give it a shot yourself:

class TeaSet(BaseModel):
    type: Literal["Teacup", "Teapot"]
    bounding_box: tuple[int, int, int, int]

class TeaSets(BaseModel):
    tea_sets: list[TeaSet]

What is supported?

https://cloud.google.com/vertex-ai/docs/reference/rest/v1/Schema