Google Gemini 102 - Advanced Structured Outputs
Probing the supported output types of Gemini.
This is a follow up to Google Gemini 101 - Object Detection with Vision and Structured Outputs, where we explore some advanced/alternative setups for Gemini structured outputs, as well as some gotcha’s.
Motivation: Required Fields
One of the first things you’ll notice from the previous blog post is that none of the keys are required. That is, the model is free to return any subset of the keys, which can lead to getting funky results with no localization, e.g.:
{
"type": "Teacup"
}
There are three ways to get required fields with AI Studio:
1. use raw JSON schemas
2. use protobufs
3. use the openai package and API, but with Gemini endpoint
I’m rather partial to the third, so that’s what we’re going to cover here. However, feel free to follow the above links to the documentation for the other solutions, if you’d prefer.
Use the openai Package
The openai structured output approach enforces that every key is required (link to docs). This means that if we swap from the google-generativeai package to openai, all keys will be required:
from typing import Literal
from pydantic import BaseModel
from .utils import draw_bounding_box
import argparse
import openai
import base64
import os
class TeaSet(BaseModel):
type: Literal["Teacup", "Teapot"]
bounding_box: list[int]
class TeaSets(BaseModel):
tea_sets: list[TeaSet]
def main(client: openai.OpenAI, image_path: str, size: float = 1024) -> None:
with open(image_path, "rb") as image_file:
image_base64 = base64.b64encode(image_file.read()).decode("utf-8")
image = Image.open(image_path)
response = client.beta.chat.completions.parse(
model="gemini-2.0-flash-exp",
n=1,
messages=[
{
"role": "system",
"content": """Find all the teacups and teapots in the image.
Return your answer as a list of JSON objects with the type and bounding box.
Return the bounding box in [ymin, xmin, ymax, xmax] format.""",
},
{
"role": "user",
"content": [
{"type": "text", "text": "Here's the image:"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
},
],
},
],
response_format=TeaSets,
)
if tea_sets := response.choices[0].message.parsed:
for tea_set in tea_sets.tea_sets:
draw_bounding_box(image, tea_set.type, tea_set.bounding_box)
image.show()
if __name__ == "__main__":
client = openai.OpenAI(
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
api_key=os.getenv("GOOGLE_API_KEY"),
)
parser = argparse.ArgumentParser()
parser.add_argument("image_path", help="Path to image file to process")
args = parser.parse_args()
main(client, args.image_path)
Gemini Structured Mode Gotchas
However, if you want to do more complex things (that OpenAI can do), it’s not that easy! The following things aren’t supported by Gemini’s structured outputs: - union types - tuples
Union Types
For instance, union types aren’t supported:
class Teacup(BaseModel):
is_empty: bool
bounding_box: list[int]
class Teapot(BaseModel):
rating: int
bounding_box: list[int]
class TeaSets(BaseModel):
tea_sets: list[Teacup | Teapot]
That also means that Optional types (or, | None in more recent python) isn’t supported:
from typing import Optional
class TeaSet(BaseModel):
type: Literal["Teacup", "Teapot"]
bounding_box: Optional[list[int]]
class TeaSets(BaseModel):
tea_sets: list[TeaSet]
Tuples
In the previous post, you might have thought to yourself: “Why implement bounding box as a list of integers when there are only 4 items? Shouldn’t you just use a 4-tuple?” That’s a great question!
Unfortunately, it’s not supported. Give it a shot yourself:
class TeaSet(BaseModel):
type: Literal["Teacup", "Teapot"]
bounding_box: tuple[int, int, int, int]
class TeaSets(BaseModel):
tea_sets: list[TeaSet]
What is supported?
https://cloud.google.com/vertex-ai/docs/reference/rest/v1/Schema