An open source Typescript implementation of Programmatic Tool Calling for AI Agents.
Codecall changes how agents interact with tools by letting them write and execute code instead of making individual tool calls that bloat context, increase the price, and slow everything down
Works with MCP servers and standard tool definitions.
Note
Before reading :)
Please keep in mind all of this is the future plan for Codecall and how it will work. Codecall is still a WIP and not production ready.
The README describes the vision and architecture for how the system will function once completed and worked on. Features, API design, and implementation details are subject to change.
If you're interested in contributing or following the project, check back soon or open an issue to discuss ideas!
Traditional tool calling has fundamental architectural issues that get worse at scale:
Every tool definition lives in your system prompt. Connect a few MCP servers and you're burning tens of thousands of tokens before the conversation even starts.
GitHub MCP: 32 tools → ~60,000 tokens
Internal Tools: 12 tools → ~24,000 tokens
───────────────────────────────────────────────
Total: 44 tools → ~84,000 tokens (before any work happens)
Each tool call requires a full model inference pass. The entire conversation history gets sent back and forth every single time.
User: "Find all admin users and update their permissions"
Traditional approach:
Turn 1: [8,000 tokens] → get_all_users()
Turn 2: [18,000 tokens] → filter mentally, pick first admin
Turn 3: [19,500 tokens] → update_user(id1, ...)
Turn 4: [21,000 tokens] → update_user(id2, ...)
Turn 5: [22,500 tokens] → update_user(id3, ...)
...
Total: 150,000+ tokens, 12 inference passes
The problem also compounds because each tool call adds its output to the context, making every subsequent generation more expensive.
Benchmarks show models have a 10-50% failure rate when searching through large datasets in context. They hallucinate field names, miss entries, and get confused by similar data.
But doing this programmatically fixes this because it can just write code, as its deterministic (so 0% failure rate)
users.filter((u) => u.role === "admin");The special tokens used for tool calls (<tool_call>, </tool_call>) are synthetic training data. Models dont have much exposure to the tool calling syntax, and have only seen contrived examples from training sets... but they DO have:
- Millions of lines of real world TypeScript
- Lots of experience writing code to call APIs
“Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It’s just not going to be his best work.”
— Cloudflare Engineering
Even though Grok 4 was heavily trained on tool calling. Result? It hallucinates tool call XML syntax in the middle of responses, writing the format but not triggering actual execution. The model “knows” the syntax exists but doesn’t use it correctly.
Let models do what they're good at: writing code.
LLMs have enormous amounts of real-world TypeScript in their training data. They're significantly better at writing code to call APIs than they are at the arbitrary JSON matching that tool calling requires.
// Instead of 12+ inference passes and 150+ tokens:
const allUsers = await tools.users.listAllUsers();
const adminUsers = allUsers.filter((u) => u.role === "admin");
const resources = await tools.resources.getSensitiveResources();
progress({
step: "Data loaded",
admins: adminUsers.length,
resources: resources.length,
});
const revokedAccesses = [];
const failedAccesses = [];
for (const admin of adminUsers) {
for (const resource of resources) {
try {
const result = await tools.permissions.revokeAccess({
userId: admin.id,
resourceId: resource.id,
});
if (result.success) {
revokedAccesses.push({ admin: admin.name, resource: resource.name });
}
} catch (err) {
failedAccesses.push({
admin: admin.name,
resource: resource.name,
error: err.message,
});
}
}
}
return {
totalAdmins: adminUsers.length,
resourcesAffected: resources.length,
accessesRevoked: revokedAccesses.length,
accessesFailed: failedAccesses.length,
};One inference pass. ~2,000 tokens. 98.7% reduction.
Codecall gives the model 3 tools to work with so the model still controls the entire flow that decides what to read, what code to write, when to execute, and how to respond... so everything stays fully agentic.
Instead of exposing every tool directly to the LLM for it to call, Codecall:
- Converts your MCP definitions into TypeScript SDK files (types + function signatures)
- Shows the model a directory tree of available files
- Allows the model to selectively read SDK files to understand types and APIs
- Lets the model write code to accomplish the task
- Executes that code in a deno sandbox with access to your actual tools as functions
- Returns the execution result back (success/error)
- Lets the model produce a respond or continue
Returns the SDK file tree showing all available tools as files
Example:
listFiles() ->
tools/
├─ users/
│ ├─ listAllUsers.ts
│ ├─ getUser.ts
│ ├─ updateUser.ts
│ └─ ...
├─ permissions/
│ ├─ revokeAccess.ts
│ ├─ grantAccess.ts
│ ├─ listPermissions.ts
│ └─ ...
├─ resources/
│ getSensitiveResources.ts
│ listResources.ts
└─ ...
Returns the full contents of a specific SDK file, including type definitions, function signatures, and schemas.
Example:
readFile({ path: "tools/users/listAllUsers.ts" }); ->
// /tools/users/listAllUsers.ts
// SDK stub for tool: "users.listAllUsers"
export interface ListAllUsersInput {
limit?: number;
offset?: number;
}
export interface User {
id: string;
name: string;
email: string;
role: "admin" | "user" | "guest";
department: string;
createdAt: string;
}
export async function listAllUsers(input: ListAllUsersInput): Promise<User[]> {
return call("users.listAllUsers", input);
}Executes TypeScript code in a Deno sandbox. Returns either the successful output or an error w/ the execution trace.
Example:
executeCode(`
const users = await tools.users.listAllUsers({ limit: 100 });
return users.filter(u => u.role === "admin");
`);Success returns:
{
status: "success",
output: [
{ id: "1", name: "Alice", role: "admin", ... },
{ id: "2", name: "Bob", role: "admin", ... }
]
}Error returns:
{
status: "error",
error: "ToolError: revokeAccess expected object { userId: string, resourceId: string }, got (string, string)",
executionTrace: [
{ step: 1, function: "listAllUsers", input: {}, output: [...] },
{ step: 2, function: "revokeAccess", input: ["admin-1", "resource-db-prod"], error: "Invalid Argument Schema" }
],
failedCode: "const result = await tools.permissions.revokeAccess(admin.id, resource.id);"
}When the model calls executeCode(), Codecall runs that code inside a fresh, short-lived Deno sandbox. Each sandbox. Each sandbox is spun up using Deno and runs the code in isolation. Deno’s security model blocks access to sensitive capabilities unless explicitly allowed.
By default, the sandboxed code has no access to the filesystem, network, environment variables, or system processes. The only way it can interact with the outside world is by calling the tool functions exposed through tools (which are forwarded by Codecall to the MCP server).
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│ SANDBOX LIFECYCLE │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ SPAWN │────▶│ INJECT │────▶│ EXECUTE │────▶│ CAPTURE │────▶│ DESTROY │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ Fresh Deno tools proxy Run generated Collect return Terminate │
│ process with + progress() TypeScript value or error process, │
│ deny-all injected code + exec trace cleanup │
│ permissions │
│ │
└─────────────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│ DATA FLOW │
│ │
│ │
│ SANDBOX TOOL BRIDGE MCP SERVER │
│ │ │ │ │
│ │ tools.users.listAllUsers() │ │ │
│ │ ─────────────────────────────▶│ │ │
│ │ │ │ │
│ │ │ tools/call: listAllUsers │ │
│ │ │ ──────────────────────────────────▶│ │
│ │ │ │ │
│ │ │ [{ id, name, role }, ...] │ │
│ │ │ ◀──────────────────────────────────│ │
│ │ │ │ │
│ │ Promise<User[]> resolved │ │ │
│ │ ◀─────────────────────────────│ │ │
│ │ │ │ │
│ │ (code continues execution) │ │ │
│ │ │ │ │
│ │ progress({ step: "Done" }) │ │ │
│ │ ─────────────────────────────▶│ │ │
│ │ │ │ │
│ │ Streams to UI │ │
│ │ │ │ │
│ │ return { success: true } │ │ │
│ │ ─────────────────────────────▶│ │ │
│ │ │ │ │
│ │ Result sent to Model │ │
│ │ for response generation │ │
│ │ │ │ │
│ ▼ │
└─────────────────────────────────────────────────────────────────────────────────────────┘
When the generated code runs, Codecall injects a real tools object into the sandbox.
toolsis not a set of local functions, but it's a small runtime bridge provided by Codecall- Each call to
tools.*is forwarded to the real tool implementation
So when the the model calls executeCode() w/ tools:
const result = await tools.permissions.revokeAccess({
userId: admin.id,
resourceId: resource.id,
reason: "security-audit",
});What actually happens is:
- The sandbox captures the tool name (
"permissions.revokeAccess") and arguments - Codecall forwards that request to the connected MCP server using
tools/call - The MCP server executes the real tool
- The result is returned back to the sandbox
- The script continues running
From the code’s perspective this behaves exactly like calling a normal async function.
The model can use progress() when writing code to provide real time feedback during long-running operations. While the model could also achieve progress by making multiple smaller executeCode() calls, using progress() within a single execution is more efficient, gives better context, and reduces the number of steps too.
Because Codecall's main benefit comes from executing comprehensive code in a single pass, progress updates are important for two reasons:
-
Better UX: Users see real-time feedback during long-running operations without multiple model calls adding cost and latency
-
Model awareness: The model receives progress logs in the
executeCode()response and can reference them when explaining what it did.
So for example, in your system prompt you can tell the model to use progress():
When writing code, use progress(...) to show meaningful updates. can see what is happening. For example:
progress("Loading data...");
progress({ step: "Processing", current: i, total });
progress({ step: "Sending emails", done: count });
Agent Code Example
const allUsers = await tools.users.listAllUsers({ limit: 5000 });
progress({
step: "Loaded all users",
totalCount: allUsers.length,
adminCount: allUsers.filter((u) => u.role === "admin").length,
});
const adminUsers = allUsers.filter((u) => u.role === "admin");
const sensitiveResources = await tools.resources.getSensitiveResources();
progress({
step: "Loaded sensitive resources",
resourceCount: sensitiveResources.length,
resourceNames: sensitiveResources.map((r) => r.name),
});
const revokedAccesses = [];
const failedAccesses = [];
for (let i = 0; i < adminUsers.length; i++) {
const admin = adminUsers[i];
for (let j = 0; j < sensitiveResources.length; j++) {
const resource = sensitiveResources[j];
try {
const result = await tools.permissions.revokeAccess({
userId: admin.id,
resourceId: resource.id,
reason: "security-audit",
});
if (result.success) {
revokedAccesses.push({
admin: admin.name,
email: admin.email,
resource: resource.name,
timestamp: result.timestamp,
});
} else {
failedAccesses.push({
admin: admin.name,
resource: resource.name,
reason: result.reason || "unknown",
});
}
if (((i + 1) * (j + 1)) % 10 === 0) {
progress({
step: "Revoking access",
admin: admin.name,
resource: resource.name,
processed: revokedAccesses.length + failedAccesses.length,
revoked: revokedAccesses.length,
failed: failedAccesses.length,
});
}
} catch (err) {
failedAccesses.push({
admin: admin.name,
resource: resource.name,
error: err.message,
});
}
}
}
progress({
step: "Access revocation complete",
revoked: revokedAccesses.length,
failed: failedAccesses.length,
});
return {
execution: {
totalAdminsProcessed: adminUsers.length,
totalResourcesAffected: sensitiveResources.length,
totalAttempted: revokedAccesses.length + failedAccesses.length,
accessesRevoked: revokedAccesses.length,
accessesFailed: failedAccesses.length,
successPercentage: Math.round(
(revokedAccesses.length /
(revokedAccesses.length + failedAccesses.length)) *
100
),
},
revokedDetails: revokedAccesses.map((r) => ({
...r,
status: "success",
})),
failureDetails: failedAccesses.slice(0, 25),
};This keeps the UX of a "step by step" agent with user facing intermediate updates, while still getting the cost and speed benefits of single-pass execution.
Benchmarks show Claude Opus 4.1 performs:
- 42.3% on Python
- 47.7% on TypeScript
That's a 12% improvement just from language choice, and various other models show the same pattern.
TypeScript also gives you:
- Full type inference for SDK generation
- Compile time validation of tool schemas
- The model sees types and can use them correctly
MCP tool definitions include inputSchema (what you pass to a tool) but outputSchema is optional and most servers almost never provide it... This matters Codecall generates TypeScript code that chains tool calls together. Without knowing what a tool returns, the model has to guess the structure, leading to runtime errors.
Example of the problem:
const tasks = await tools.todoist.getTasks({ filter: "today" });
for (const task of tasks) {
console.log(task.title); // BUG: actual property is "name", not "title"
}
if (task.dueDate === "2024-01-15") { ... }
// BUG: actual structure is task.due, not task.dueDateThe code looks correct but fails at runtime because the model hallucinated the return type based on common naming patterns...
We haven't fully solved this (that would require MCP servers to provide outputSchema), but we've implemented a hack that works in practice:
- Tool Classification - We use an LLM to classify each tool as
read,write,destructive, orwrite_readbased on its semantics - Output Schema Discovery - For tools classified as
readorwrite_read, we generate safe sample inputs and actually call the tool - Schema Inference - We capture the real response and infer a JSON schema from it
- Typed SDK Generation - The inferred schema is passed to the SDK generator, producing proper TypeScript output types
This means tools like search_engine now generate SDKs with accurate output types based on real API responses, not guesses.
Limitations:
- Requires actually calling the tools during SDK generation
- Single sample responses may miss optional fields or variant shapes
- Write+Read tools create real data (we use identifiable test names like
codecall_test_*)
A second more fundamental challenge is that a lot of MCP servers return plain strings or markdown, not structured data...
In these cases:
- The output has no stable shape
- There are no fields to index into
- There is nothing meaningful to type beyond string
From Codecall’s perspective, this means:
- No reliable code generation beyond simple passthrough
- No safe composition of tool outputs
- No advantage over a traditional agent that directly interprets text
This is not a limitation of Codecall, but a reflection of how the tools were designed.
Because Codecall focuses on deterministic, type-safe code generation, its benefits disappear when tool outputs are unstructured. In those cases, interpretation must happen in the LLM itself, which moves the system back into standard agent behavior.
Sadly, there is no reliable workaround when using external MCP servers: if you do not control the tool, you cannot enforce structured outputs.
WIP, Please check back soon or feel free to add here :)
Still working on how the high level architecture and how everything should work/flow together
We welcome contributions! Please Feel free to:
- Open issues for bugs or feature requests
- Submit PRs for improvements
- Share your use cases and feedback
This project builds on ideas from the community and is directly inspired by:
- Yannic Kilcher – What Cloudflare's code mode misses about MCP and tool calling
- Theo – Anthropic admits that MCP sucks & Anthropic is trying SO hard to fix MCP...
- Boundary - Using MCP server with 10000+ tools: 🦄 Ep #7
- Cloudflare – Code mode: the better way to use MCP
- Anthropic – Code execution with MCP: building more efficient AI agents & Introducing advanced tool use
- Medium - Your Agent Is Wasting Money On Tools. Code Execution With MCP Fixes It.
MIT