
OpenAI’s ecosystem in 2026 is built around Assistants, a first-class abstraction that packages models, tools, instructions, and memory into a single unit. Below is a practical guide that walks you through every step—from creating your first Assistant to wiring it into an end-to-end workflow—complete with code snippets, FAQs, and tips that reflect the current state of the platform.
In 2026 the OpenAI API is largely declarative: you describe what you want, not how to achieve it.
| Concept | 2026 Abstraction | What You Provide |
|---|---|---|
| Model | model string ("gpt-5", "o3-mini") | Instruction set & temperature |
| Tools | tools array (code interpreter, function calls, file search, web search) | JSON schema & Python functions |
| Memory | vector_store + thread | File IDs, chunking strategy, retention rules |
| Prompt | instructions | System-level persona, tone, guardrails |
| State | thread | Conversation history & metadata |
Key changes from 2024:
Assistant run against a Thread.vector_store) is vector-only; hybrid BM25 is deprecated.from openai import OpenAI
client = OpenAI(api_key="sk-...")
assistant = client.beta.assistants.create(
name="CodeReviewer",
instructions="You are a senior Python engineer. Review PRs for style, safety, and performance.",
model="gpt-5",
tools=[
{"type": "code_interpreter"},
{"type": "file_search", "vector_store_ids": ["vs_abc123"]}
],
temperature=0.2
)
model → pick the smartest model you can afford (gpt-5 ≥ o4-mini).tools → order matters; code interpreter runs before file search.vector_store_ids → attaches a pre-created vector store (see §3).Threads are ephemeral by default:
thread = client.beta.threads.create(
messages=[
{
"role": "user",
"content": "Review this PR: https://github.com/.../pull/123"
}
]
)
If you need persistence, set metadata={"retention": "30d"} and store the thread_id in your DB.
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
instructions="Focus on type hints and exception safety."
)
Monitor status:
status = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
if status.status == "completed":
messages = client.beta.threads.messages.list(thread_id=thread.id)
print(messages.data[0].content[0].text.value)
vector_store = client.beta.vector_stores.create(name="PythonStyleGuide")
for file in ["pep8.md", "mypy.md"]:
client.beta.vector_stores.files.upload(
vector_store_id=vector_store.id,
file=open(file, "rb"),
chunking_strategy={"type": "static", "max_chunk_size_tokens": 800}
)
chunking_strategy defaults to 800 tokens; you can set max_chunk_size_tokens up to 4096..txt, .pdf, .md, .docx, .pptx, .csv, .jsonl.assistant = client.beta.assistants.update(
assistant_id=assistant.id,
tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}}
)
vector_store if files change.ranking_options={"ranker": "default"} for better precision.tools = [
{
"type": "function",
"function": {
"name": "fetch_github_pr",
"description": "Fetch a GitHub PR diff.",
"parameters": {
"type": "object",
"properties": {
"owner": {"type": "string"},
"repo": {"type": "string"},
"pr_number": {"type": "integer"}
},
"required": ["owner", "repo", "pr_number"]
}
}
}
]
def fetch_github_pr(owner: str, repo: str, pr_number: int) -> str:
import httpx
url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pr_number}"
diff_url = f"https://patch-diff.githubusercontent.com/raw/{owner}/{repo}/pull/{pr_number}.diff"
diff = httpx.get(diff_url).text
return diff
assistant = client.beta.assistants.update(
assistant_id=assistant.id,
tools=[*tools, {"type": "file_search", "vector_store_ids": [...] }]
)
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id
)
# Stream events
for event in client.beta.threads.runs.stream(
thread_id=thread.id,
run_id=run.id,
event_handler=EventHandler()
):
if event.event == "thread.run.step.completed":
step = event.data
if step.step_details.type == "tool_calls":
for tool_call in step.step_details.tool_calls:
args = json.loads(tool_call.function.arguments)
result = fetch_github_pr(**args)
client.beta.threads.runs.submit_tool_outputs(
thread_id=thread.id,
run_id=run.id,
tool_outputs=[{"tool_call_id": tool_call.id, "output": result}]
)
with client.beta.threads.messages.stream(
thread_id=thread.id,
event_handler=MessageStreamHandler()
) as stream:
for text in stream.text_deltas:
yield text
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
stream=True,
truncation_strategy={"type": "auto"}
)
truncation_strategy defaults to 16k tokens; set max_prompt_tokens to control cost.gpt-5, ~400–600 ms for o4-mini.| Metric | 2026 Rate |
|---|---|
| Input tokens | $0.03 / 1M (cached) / $0.12 / 1M (fresh) |
| Output tokens | $0.06 / 1M |
| Code interpreter | $0.08 / 1M tokens + $0.03 / minute compute |
| File search | $0.05 / 1k queries |
| Vector store | $0.10 / GB / month |
cached_prompt = client.beta.prompts.create(
input="Review this PR for style and safety.",
model="gpt-5",
temperature=0.2
)
cached_prompt_id instead of instructions.thread = client.beta.threads.create(
messages=[...],
tool_resources={"file_search": {"vector_store_ids": [...]}}
)
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
max_prompt_tokens=12_000,
max_completion_tokens=4_000
)
max_prompt_tokens includes messages + tool context.moderation tool to flag unsafe content before streaming.httpx calls raise RuntimeError.os.system, subprocess are blocked.math, random, numpy, pandas, matplotlib, PIL.text/*, application/pdf, application/vnd.openxmlformats-officedocument.*.metadata={"redact": false}.location="eu" flag pins threads & vector stores to Frankfurt.[[queues.consumers]]
max_batch_size = 10
max_retries = 3
[queues.producers]
queue = "assistant-runs"
[[r2_buckets]]
binding = "BUCKET"
bucket_name = "assistant-files"
Worker code:
export default {
async queue(batch, env) {
const { client } = env.OPENAI;
for (const msg of batch) {
const run = await client.beta.threads.runs.create({
thread_id: msg.threadId,
assistant_id: env.ASSISTANT_ID
});
}
}
};
containers:
- name: assistant-proxy
image: openai/assistant-proxy:v26
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: openai-creds
key: key
ports:
- containerPort: 8080
Proxy handles token budgeting, retry logic, and observability.
tools.client.beta.threads.runs.steps.list(thread_id, run_id) shows tool call attempts.text-embedding-3-small is the default; switch to large for better recall.metadata={"language": "python"} when uploading files.truncation_strategy={"type": "auto", "last_messages": 10} to keep only the last 10 messages.vector_store export.max_completion_tokens to cap output.client.beta.prompts) to avoid regenerating instructions.metadata={"whitelist": ["jira", "ticket"]}.moderation.model="text-moderation-007" threshold in assistants.create.client.fine_tuning (private beta).OpenAI’s Assistants in 2026 abstract away the gritty details of prompt engineering, token counting, and tool orchestration, letting you focus on the intent of your workflow. Start small—one Assistant, one thread, one tool—and iterate. The new primitives are declarative, observable, and cost-capped, which makes it possible to ship production-grade AI helpers without becoming an LLM expert overnight.
Website content is one of the richest sources of information your business has. Every help article, FAQ, service description, and policy pag…

Customer service is the heartbeat of customer experience—and for many businesses, it’s also the most expensive. The average company spends u…

E-commerce is no longer just about transactions—it’s about personalized experiences, instant support, and frictionless journeys. Today’s sho…

Comments
Sign in to join the conversation
No comments yet. Be the first to share your thoughts!