Letting an LLM read your Metabase is a deceptively easy thing to do badly. The first version everyone writes is "give the model an API token and a schema dump and let it figure out what to query". That works for the demo. It also invents dashboards that do not exist, queries questions it cannot read, and occasionally writes a SQL question into the wrong collection.
This post covers the patterns that have actually held up in production for the last year of building MetaLens, plus the failure modes I keep running into when reviewing other people's integrations. Code examples are in Python and TypeScript. The patterns are the same regardless of which model you use; "Claude" and "GPT" in the title is shorthand for "any LLM with tool use".
Auth: API key vs session token
Metabase has two auth modes for the API.
- Session token. Issued by
POST /api/sessionwith a username and password. Short-lived (default 14 days). Acts as the user who logged in. - API key. Created in admin settings, scoped to a group you control, long-lived until revoked. Acts as a service account.
For LLM use, always use API keys. Session tokens are a maintenance problem (rotation, expiry, accidentally checked-in credentials), and they conflate "this human" with "this service", which makes audit logs noisy.
Two more rules I treat as non-negotiable:
- The API key's group has read-only permissions on every collection it should see, and no permission on collections it should not. Do not grant admin to an LLM-facing key, ever.
- The key does not have permission to create, edit, or delete questions, dashboards, or alerts. If your integration "needs" to write back to Metabase, that is a separate key with a much smaller scope, used in a much narrower code path.
In Python:
import os, httpx
class Metabase:
def __init__(self, base_url: str, api_key: str):
self.base_url = base_url.rstrip("/")
self.headers = {"x-api-key": api_key}
def get(self, path: str, **params):
r = httpx.get(f"{self.base_url}{path}", headers=self.headers, params=params, timeout=30)
r.raise_for_status()
return r.json()
mb = Metabase(os.environ["METABASE_URL"], os.environ["METABASE_API_KEY"])
print(mb.get("/api/database"))In TypeScript (Node 22+):
const BASE = process.env.METABASE_URL!.replace(/\/$/, "");
const KEY = process.env.METABASE_API_KEY!;
async function mb<T>(path: string, qs: Record<string, string> = {}): Promise<T> {
const url = new URL(BASE + path);
for (const [k, v] of Object.entries(qs)) url.searchParams.set(k, v);
const r = await fetch(url, { headers: { "x-api-key": KEY } });
if (!r.ok) throw new Error(`${path} ${r.status}: ${await r.text()}`);
return r.json() as Promise<T>;
}That is the entire client. Everything below builds on this.
Schema introspection: what to give the model
The naive "give the model a schema dump" approach fails for two reasons. First, real Metabase instances have schemas the size of a novel. The model wastes its context window memorizing 800 columns it does not need. Second, the warehouse schema is rarely what the LLM should reason about. The model should be reasoning about the curated saved questions and models, not the raw events table with 200 columns.
The pattern that works: give the model a small, curated card list, not a schema. For each card the model should know:
- ID
- Name
- Description (Metabase's own description field, if filled)
- Collection path
- Output columns (name + type)
That's it. No SQL, no source columns, no upstream tables. The model decides "the user is asking about churn, here are three relevant questions" and then asks for the data from one of them by ID.
def list_curated_cards(collection_id: int):
cards = mb.get(f"/api/collection/{collection_id}/items", models="card")
out = []
for c in cards["data"]:
card = mb.get(f"/api/card/{c['id']}")
out.append({
"id": card["id"],
"name": card["name"],
"description": card.get("description") or "",
"collection": card.get("collection", {}).get("name", ""),
"result_columns": [
{"name": col["name"], "type": col.get("base_type")}
for col in (card.get("result_metadata") or [])
],
})
return outThe "curated" part matters. Do not feed the model every saved question in the instance. Pick a collection (or a tag, or a name prefix) that you treat as the LLM-readable surface area, and only expose those. The first time somebody asks the model "what's our MRR" and it pulls a dashboard from "old / experiments / draft", you will wish you had limited the surface.
Read-only patterns: never POST or PUT
The ONLY endpoints the LLM-facing client should call are:
GET /api/databaseGET /api/collection/:id/itemsGET /api/card/:idPOST /api/card/:id/query(to execute a saved card)
That last one is POST, but it is a "run this query" not a "create something" call, and there is no Metabase endpoint that lets you create a card via the query endpoint. It is safe to expose.
Everything else (POST /api/card, PUT /api/card/:id, DELETE /api/card/:id, anything under /api/dashboard that mutates) is forbidden. Either by code (the client class above does not implement them) or by permissions on the API key (the safer choice, both layers).
If your integration genuinely needs to write back, do that with a different key, in a different module, with a human-in-the-loop confirmation. The "let the agent do everything" architecture is a great way to wake up to a deleted dashboard.
Running a saved card and returning rows
Once the model has picked a card, you run it and feed the result back. Metabase's card-query endpoint returns rows in a JSON envelope with column metadata.
def run_card(card_id: int, parameters: list | None = None):
body = {"parameters": parameters or []}
r = httpx.post(
f"{mb.base_url}/api/card/{card_id}/query",
headers={**mb.headers, "Content-Type": "application/json"},
json=body,
timeout=60,
)
r.raise_for_status()
payload = r.json()
cols = [c["name"] for c in payload["data"]["cols"]]
rows = payload["data"]["rows"]
return [dict(zip(cols, row)) for row in rows]Two things to watch:
- Rows can be huge. A saved question can return millions of rows. Cap it before you send to the model. I default to 200 rows and a payload size limit of 30KB. Anything above that gets truncated with a note ("results truncated to 200 rows of 12,541; ask a more specific question").
- Parameters. If the card has filters, the model has to provide values. Validate them: parameter targets, types, and allowable values. The parameter list comes from
GET /api/card/:idin theparametersfield.
Hallucination prevention: constrain to what exists
The most common failure mode is the model "remembering" a card name from a previous turn that does not exist in this Metabase, or inventing a card that sounds plausible. Two things prevent this in practice.
First, force the model to pick from a list, not generate. Tool use is the right primitive here. Define one tool: run_card(card_id: integer, parameters?: object). Hand the model the curated card list as plain text in the system prompt or as the previous tool result. The model's only path to data is to call run_card with an ID that came from the list.
Second, validate every tool call. If the model calls run_card(card_id=999999, ...) and that ID is not in the curated list, refuse and tell it. Do not just call the API and let Metabase 404. The model will hallucinate less if you are strict about this.
In TypeScript, with Anthropic's tool use:
const tools = [
{
name: "run_card",
description: "Run a saved Metabase card by ID. Only IDs from the provided card list are allowed.",
input_schema: {
type: "object",
properties: {
card_id: { type: "integer" },
parameters: { type: "array", items: { type: "object" } },
},
required: ["card_id"],
},
},
];
async function handleToolCall(call: { name: string; input: any }, allowedIds: Set<number>) {
if (call.name !== "run_card") throw new Error(`Unknown tool: ${call.name}`);
if (!allowedIds.has(call.input.card_id)) {
return { error: `card_id ${call.input.card_id} is not in the allowed card list` };
}
return await runCard(call.input.card_id, call.input.parameters);
}A small extra: log every tool call with the model, the input, and the output size. The first time an agent goes off the rails, the log is what tells you which step it happened on.
Rate limits and cost control
Metabase does not have great rate limit primitives. The defenses you put in are mostly your own:
- Cache card results aggressively. A card is a saved query. If it ran 30 seconds ago, the result is almost certainly the same. A 60-second in-memory cache by
(card_id, parameters_hash)solves most of the load. - Cap concurrent queries per session. If a user is in a chat, only run one query at a time. The model will wait.
- Watch for runaway loops. If the model has called
run_cardmore than N times in a session without producing an answer, stop. Something is wrong, and the cost will keep climbing while it figures it out.
Token cost on the LLM side is mostly about the result size. A 200-row, 8-column table is fine. A 50,000-row table will eat your budget and the model will not be able to summarize it usefully anyway. Truncate, sample, or paginate.
Putting it together
A minimal end-to-end flow looks like this:
def answer(user_question: str, allowed_collection: int):
cards = list_curated_cards(allowed_collection)
allowed_ids = {c["id"] for c in cards}
system = f"""You answer questions using the provided Metabase cards.
Only the card IDs in this list exist:
{json.dumps(cards, indent=2)}
To get data, call the run_card tool with a card_id from this list.
Never invent card IDs. If no card fits, say so."""
messages = [{"role": "user", "content": user_question}]
while True:
resp = anthropic.messages.create(
model="claude-opus-4-7",
system=system,
tools=[run_card_tool],
messages=messages,
max_tokens=2000,
)
if resp.stop_reason == "tool_use":
for block in resp.content:
if block.type == "tool_use":
result = handle_tool_call(block, allowed_ids)
messages.append({"role": "assistant", "content": resp.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(result)[:30000],
}],
})
continue
return "".join(b.text for b in resp.content if b.type == "text")Three things to note about the loop:
- The card list goes in the system prompt. With Anthropic's prompt caching, this is cheap on repeat questions in the same session.
- The tool result is truncated at 30KB. The model handles "the result was truncated" gracefully if you tell it.
- The loop exits on
stop_reason != "tool_use". If the model calls a tool, you handle it and continue.
Common pitfalls I see in code reviews
- Sending the warehouse schema to the model "for context". Almost always a mistake. Curated cards are smaller, more accurate, and harder to misuse.
- Letting the model pass arbitrary SQL. "Just give it the dataset endpoint" sounds simple. It is not. Native query execution against the warehouse, generated by an LLM, with no validation, is a security incident waiting to happen.
- Treating the LLM as the source of truth. The model should describe what it found, not assert numbers. If it says "MRR is $X", that's the row from card N. Make that traceable in your UI.
- Forgetting to handle parameter cards. Many Metabase cards have required filters. The model needs the parameter spec from
GET /api/card/:idand has to fill it correctly. - Not handling MBQL questions. Cards built in the visual query builder come back as MBQL, not SQL. The query endpoint runs them either way, but if you are doing anything with the SQL field, it will sometimes be missing.
What MetaLens does about this
MetaLens uses the patterns above to talk to Metabase from an LLM, but it does it for the meta-task: auditing the instance itself. The model reads cards and dashboards, looks for the seven patterns from the audit post, and reports them with traceable links back to the source. Same constraints: API key, read-only permissions, curated set of operations, no SQL generation.
If you are building your own LLM-Metabase integration, the rules are the simple ones from this post: API key with read-only scope, curated card list, tool use with validation, truncate aggressively. Almost every production failure I have seen on this stack is one of those four rules being broken.
If you have built one and it is misbehaving, walk back through the four. Usually it's the second one (no curated list) or the third one (model can pass arbitrary SQL). Fix those two, and most of the surprises go away.



