OpenAI Chatbot Guide | Misar Blog | Assisters

OpenAI’s ecosystem in 2026 is built around Assistants, a first-class abstraction that packages models, tools, instructions, and memory into a single unit. Below is a practical guide that walks you through every step—from creating your first Assistant to wiring it into an end-to-end workflow—complete with code snippets, FAQs, and tips that reflect the current state of the platform.

1. Before You Start: Understand the 2026 Contract

In 2026 the OpenAI API is largely declarative: you describe what you want, not how to achieve it.

Concept	2026 Abstraction	What You Provide
Model	`model` string (`"gpt-5"`, `"o3-mini"`)	Instruction set & temperature
Tools	`tools` array (code interpreter, function calls, file search, web search)	JSON schema & Python functions
Memory	`vector_store` + `thread`	File IDs, chunking strategy, retention rules
Prompt	`instructions`	System-level persona, tone, guardrails
State	`thread`	Conversation history & metadata

Key changes from 2024:

No more “chat completion” endpoint—everything is an Assistant run against a Thread.
Persistent threads are opt-in; transient conversations are the default.
Code interpreter is now a first-class tool with built-in sandboxing (Python ≥3.11, no network).
File search (vector_store) is vector-only; hybrid BM25 is deprecated.
Rate-limits are per-org, not per-key; burst vs. steady-state is measured in tokens/sec.

2. Step-by-Step: Create & Run an Assistant

2.1 Create the Assistant

from openai import OpenAI
client = OpenAI(api_key="sk-...")

assistant = client.beta.assistants.create(
    name="CodeReviewer",
    instructions="You are a senior Python engineer.  Review PRs for style, safety, and performance.",
    model="gpt-5",
    tools=[
        {"type": "code_interpreter"},
        {"type": "file_search", "vector_store_ids": ["vs_abc123"]}
    ],
    temperature=0.2
)

model → pick the smartest model you can afford (gpt-5 ≥ o4-mini).
tools → order matters; code interpreter runs before file search.
vector_store_ids → attaches a pre-created vector store (see §3).

2.2 Create a Thread

Threads are ephemeral by default:

thread = client.beta.threads.create(
    messages=[
        {
            "role": "user",
            "content": "Review this PR: https://github.com/.../pull/123"
        }
    ]
)

If you need persistence, set metadata={"retention": "30d"} and store the thread_id in your DB.

2.3 Run the Assistant

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id,
    instructions="Focus on type hints and exception safety."
)

Monitor status:

status = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
if status.status == "completed":
    messages = client.beta.threads.messages.list(thread_id=thread.id)
    print(messages.data[0].content[0].text.value)

3. Building a Knowledge Base with Vector Stores

3.1 Upload & Chunk Files

vector_store = client.beta.vector_stores.create(name="PythonStyleGuide")
for file in ["pep8.md", "mypy.md"]:
    client.beta.vector_stores.files.upload(
        vector_store_id=vector_store.id,
        file=open(file, "rb"),
        chunking_strategy={"type": "static", "max_chunk_size_tokens": 800}
    )

chunking_strategy defaults to 800 tokens; you can set max_chunk_size_tokens up to 4096.
Supported formats: .txt, .pdf, .md, .docx, .pptx, .csv, .jsonl.

3.2 Attach Vector Store to Assistant

assistant = client.beta.assistants.update(
    assistant_id=assistant.id,
    tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}}
)

Caching: OpenAI caches embeddings for 30 days; update vector_store if files change.
Hybrid search: Still Beta; use ranking_options={"ranker": "default"} for better precision.

4. Adding Custom Tools (Function Calling)

4.1 Define the Schema

tools = [
    {
        "type": "function",
        "function": {
            "name": "fetch_github_pr",
            "description": "Fetch a GitHub PR diff.",
            "parameters": {
                "type": "object",
                "properties": {
                    "owner": {"type": "string"},
                    "repo": {"type": "string"},
                    "pr_number": {"type": "integer"}
                },
                "required": ["owner", "repo", "pr_number"]
            }
        }
    }
]

4.2 Register the Function Handler

def fetch_github_pr(owner: str, repo: str, pr_number: int) -> str:
    import httpx
    url = f"https://api.github.com/repos/{owner}/{repo}/pulls/{pr_number}"
    diff_url = f"https://patch-diff.githubusercontent.com/raw/{owner}/{repo}/pull/{pr_number}.diff"
    diff = httpx.get(diff_url).text
    return diff

4.3 Attach & Run

assistant = client.beta.assistants.update(
    assistant_id=assistant.id,
    tools=[*tools, {"type": "file_search", "vector_store_ids": [...] }]
)

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id
)

# Stream events
for event in client.beta.threads.runs.stream(
    thread_id=thread.id,
    run_id=run.id,
    event_handler=EventHandler()
):
    if event.event == "thread.run.step.completed":
        step = event.data
        if step.step_details.type == "tool_calls":
            for tool_call in step.step_details.tool_calls:
                args = json.loads(tool_call.function.arguments)
                result = fetch_github_pr(**args)
                client.beta.threads.runs.submit_tool_outputs(
                    thread_id=thread.id,
                    run_id=run.id,
                    tool_outputs=[{"tool_call_id": tool_call.id, "output": result}]
                )

5. Streaming & Real-Time UX

5.1 Streaming Messages

with client.beta.threads.messages.stream(
    thread_id=thread.id,
    event_handler=MessageStreamHandler()
) as stream:
    for text in stream.text_deltas:
        yield text

5.2 Partial Results

run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id,
    stream=True,
    truncation_strategy={"type": "auto"}
)

truncation_strategy defaults to 16k tokens; set max_prompt_tokens to control cost.
LLM latency in 2026 is ~250–400 ms for gpt-5, ~400–600 ms for o4-mini.

6. Cost Control & Optimization

Metric	2026 Rate
Input tokens	$0.03 / 1M (cached) / $0.12 / 1M (fresh)
Output tokens	$0.06 / 1M
Code interpreter	$0.08 / 1M tokens + $0.03 / minute compute
File search	$0.05 / 1k queries
Vector store	$0.10 / GB / month

6.1 Cache Prompts

cached_prompt = client.beta.prompts.create(
    input="Review this PR for style and safety.",
    model="gpt-5",
    temperature=0.2
)

Cache lasts 7 days; use cached_prompt_id instead of instructions.

6.2 Token Budgeting

thread = client.beta.threads.create(
    messages=[...],
    tool_resources={"file_search": {"vector_store_ids": [...]}}
)
run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id,
    max_prompt_tokens=12_000,
    max_completion_tokens=4_000
)

max_prompt_tokens includes messages + tool context.
Guardrails: Use moderation tool to flag unsafe content before streaming.

7. Security & Compliance

7.1 Sandboxing Code Interpreter

No network: httpx calls raise RuntimeError.
No subprocess: os.system, subprocess are blocked.
Allowed modules: math, random, numpy, pandas, matplotlib, PIL.
Timeout: 30 seconds per run.

7.2 File Upload Restrictions

Max file size: 100 MB.
MIME types: text/*, application/pdf, application/vnd.openxmlformats-officedocument.*.
DLP: Sensitive PII (SSN, credit cards) triggers auto-redaction unless you opt-out via metadata={"redact": false}.

7.3 Data Residency

EU: location="eu" flag pins threads & vector stores to Frankfurt.
US: Default; no flag needed.
Retention: 30 days maximum for transient threads; 1 year for persistent threads unless overridden.

8. Deployment Patterns

8.1 Serverless Worker (Cloudflare)

[[queues.consumers]]
max_batch_size = 10
max_retries = 3

[queues.producers]
queue = "assistant-runs"

[[r2_buckets]]
binding = "BUCKET"
bucket_name = "assistant-files"

Worker code:

export default {
  async queue(batch, env) {
    const { client } = env.OPENAI;
    for (const msg of batch) {
      const run = await client.beta.threads.runs.create({
        thread_id: msg.threadId,
        assistant_id: env.ASSISTANT_ID
      });
    }
  }
};

8.2 Kubernetes Sidecar (On-Prem)

containers:
- name: assistant-proxy
  image: openai/assistant-proxy:v26
  env:
  - name: OPENAI_API_KEY
    valueFrom:
      secretKeyRef:
        name: openai-creds
        key: key
  ports:
  - containerPort: 8080

Proxy handles token budgeting, retry logic, and observability.

9. Common Pitfalls & FAQs

9.1 “Assistant not calling tools”

Check order: Code interpreter must come before file search in tools.
Verify schema: Tools must match the exact JSON schema returned by the model.
Debug: client.beta.threads.runs.steps.list(thread_id, run_id) shows tool call attempts.

9.2 “Vector store not returning results”

Chunk size: 800 tokens is too large for code; try 400.
Embedding model: text-embedding-3-small is the default; switch to large for better recall.
Metadata filters: Use metadata={"language": "python"} when uploading files.

9.3 “Thread too large”

Truncate: Set truncation_strategy={"type": "auto", "last_messages": 10} to keep only the last 10 messages.
Archive: Move old threads to cold storage (S3, GCS) via vector_store export.

9.4 “Cost overruns”

Set org-wide spend limit in the dashboard.
Use max_completion_tokens to cap output.
Cache prompts (client.beta.prompts) to avoid regenerating instructions.

9.5 “Moderation false positives”

Whitelist: Add benign terms to metadata={"whitelist": ["jira", "ticket"]}.
Threshold: Lower moderation.model="text-moderation-007" threshold in assistants.create.

10. What’s Next (Roadmap Hints)

Multi-modal tools: Vision & audio tools in beta.
Agents: Hierarchical assistants that can spawn sub-assistants.
Fine-tuning: Assistants can now be fine-tuned on domain-specific data via client.fine_tuning (private beta).
Plug-ins: OAuth2 connectors for Jira, GitHub, Slack (GA in Q3).
On-prem: Self-hosted Assistants SDK for air-gapped environments.

OpenAI’s Assistants in 2026 abstract away the gritty details of prompt engineering, token counting, and tool orchestration, letting you focus on the intent of your workflow. Start small—one Assistant, one thread, one tool—and iterate. The new primitives are declarative, observable, and cost-capped, which makes it possible to ship production-grade AI helpers without becoming an LLM expert overnight.

1. Before You Start: Understand the 2026 Contract

2. Step-by-Step: Create & Run an Assistant

2.1 Create the Assistant

2.2 Create a Thread

2.3 Run the Assistant

3. Building a Knowledge Base with Vector Stores

3.1 Upload & Chunk Files

3.2 Attach Vector Store to Assistant

4. Adding Custom Tools (Function Calling)

4.1 Define the Schema

4.2 Register the Function Handler

4.3 Attach & Run

5. Streaming & Real-Time UX

5.1 Streaming Messages

5.2 Partial Results

6. Cost Control & Optimization

6.1 Cache Prompts

6.2 Token Budgeting

7. Security & Compliance

7.1 Sandboxing Code Interpreter

7.2 File Upload Restrictions

7.3 Data Residency

8. Deployment Patterns

8.1 Serverless Worker (Cloudflare)

8.2 Kubernetes Sidecar (On-Prem)

9. Common Pitfalls & FAQs

9.1 “Assistant not calling tools”

9.2 “Vector store not returning results”

9.3 “Thread too large”

9.4 “Cost overruns”

9.5 “Moderation false positives”

10. What’s Next (Roadmap Hints)

Related Articles

Safely Train AI Chatbots on Website Content in 2026

AI Agents vs Chatbots in Customer Service: Key Differences 2026

E-commerce AI Assistants 2026: How to Drive Revenue with AI

More like this

Comments

More from Assisters

How to Use a Free AI Assistant in 2026: Step-by-Step Guide

10 Real AI Agent Examples You Can Build in 2026

What Is Private AI? Beginner's Guide for 2026

Recommended for you

How to Use Android SDK in 2026: Beginner's Step-by-Step Guide

How to Use AI for Copywriting: A Beginner's Guide for 2026

Client Acquisition Cost in 2026: Step-by-Step Guide to Reduce CAC

Explore More from Misar

AI Blog Post Outline Template 2026: Rank on Google & AI Search