News
What's new in AI Roblox development
May 2, 2026 - covering changes since March
A lot has shipped since the Studio MCP server stabilized in March. Roblox announced an agentic direction for Studio, shipped Planning Mode and a Playtesting Agent, added Procedural Models, and made it one click to wire external tools into Studio's MCP. The OpenGameEval benchmark also more than doubled in size, and several new models landed on the leaderboard. Here's what changed and what it means for your workflow.
Studio is going agentic
On April 15, Roblox SVP of Engineering Nick Tornow announced that the Assistant is moving from a single-prompt tool to an autonomous development partner that plans, builds, and tests - the same loop a human developer follows. According to Roblox, 44% of the top 1,000 creators already use Roblox Assistant or third-party AI tools via MCP.
The agentic direction has three pillars.
1. Plan: Planning Mode
Announced May 1, Planning Mode changes how Assistant handles a request. Instead of acting immediately, it analyzes your code and data model, asks clarifying questions, and produces an editable action plan. You review, modify, or reorder steps before any change happens. Each session creates checkpoints you can revert to.
The plan acts as a small design document that the agent uses to execute. It integrates with three generation tools - Material Generation, Mesh Generation, and Creator Store imports - and a verification step that playtests the result, reads logs, captures screenshots, and feeds bugs back for automatic fixes.
Coming next: multi-agent execution where separate agents handle scripting, layout, and UI concurrently, plus a node graph for visualizing and editing the plan-build-test workflow.
2. Build: Procedural Models with AI generation
Procedural Models (edit-time beta, April 30) are a new Instance type that rebuilds its contents when parameters change. Assign a Generator Module written in Luau, define attributes that trigger regeneration, and the model updates on edit. Change the Size and a bookcase adds shelves; switch a Material attribute from wood to stone and every surface updates.
You can generate them through the Assistant (/generate_procedural_model a medieval castle with terracotta roofs) or through the MCP server's generate_procedural_model function. The Assistant picks attributes from your prompt and produces models built from primitive parts you can keep editing. They can be published to the Creator Store with automatic sandboxing that blocks access to sensitive APIs.
The interesting part is reusability: instead of a one-off model, you get a parameterized asset that can be adjusted, shared, and composed with other Procedural Models.
3. Test: Playtesting Agent
The Playtesting Agent (beta) acts as an automated tester. It runs your game against the original plan, reads output logs, simulates keyboard and mouse input, and verifies behavior on its own. Bugs it surfaces feed back into the next round of planning, closing the loop.
First teased on March 6 as MCP-powered playtest automation, it's now wired into Planning Mode's verification step. You no longer need to manually playtest after every iteration.
Data Model Search Subagent: smarter exploration
Shipped April 28, this addresses a real problem with large projects: as tool calls pile up in the conversation, the agent has less room to reason. The Data Model Search Subagent runs on its own thread to explore the project, then returns a compact summary. Multiple subagents can run concurrently, investigating different parts of the codebase in parallel.
For a 50,000-line game with hundreds of instances, this is the difference between an agent that loses the plot and one that doesn't.
MCP Quick Connect: one-click setup for external tools
Also shipped April 28, Quick Connect detects supported MCP clients on your machine - Claude Code, Claude Desktop, Cursor, Gemini CLI, VS Code, Codex CLI, and Antigravity - and adds them to Studio with a single toggle. No JSON editing, no config file hunting. The manual setup option is still there for clients that aren't auto-detected.
Combined with the new Copy Unique ID feature (right-click any instance for a stable text handle to paste into external prompts), it makes moving between Studio and tools like BloxBot, Claude Code, or Cursor a lot smoother.
OpenGameEval: from 47 to 117 evaluations
Roblox's OpenGameEval benchmark has expanded twice:
- Debug evals (March 3): 30 evaluations testing bug-fixing ability. Each one introduces a bug - renamed instances, inverted logic, changed constants, state management issues - and asks the model to find and fix it. Top scores: Gemini 3.1 Pro (56.7%), GLM 5 (56.0%), Claude Opus 4.6 (50.7%, 0.96% tool error rate).
- Code Generation Batch 2 (April 14): 40 evaluations with higher complexity and broader place coverage, including from-scratch baseplate scenarios that require multi-step system creation. Top on this harder set: Claude Opus 4.6 (48.1%, 0.71% tool errors).
The benchmark now covers 117 evaluations across two skill categories and tracks 24 models.
New models since March
| Model | Pass@1 | Pass@5 | Tool err | Notable |
|---|---|---|---|---|
| Gemini 3.1 Pro | 55.3% | 72.3% | 1.3% | New #1 overall, also leads Debug |
| Gemini 3 Flash | 54.7% | 65.7% | 2.2% | Near-Pro performance at Flash speed and cost |
| Claude Opus 4.6 | 51.9% | 65.0% | 1.4% | Best on the expanded 87-eval set |
| GLM 5 | 51.7% | 69.0% | 2.0% | Top on Debug, best value |
| Claude Opus 4.7 | 46.4% | 61.3% | 2.2% | 39% fewer tool calls than 4.6 |
| Claude Sonnet 4.6 | 46.4% | 57.5% | 1.3% | Strong mid-tier with low tool errors |
| GPT-5.4 | 35.1% | 55.4% | 1.3% | Reasoning mode; 50.0% on Debug |
Claude Opus 4.7: the efficiency play
Opus 4.7's Pass@1 (46.4%) looks like a regression from 4.6 (51.9%), but the headline number misses the point. On the expanded 87-eval set, the gap between 4.6 and 4.7 isn't statistically significant (p=0.24). Where 4.7 stands out is efficiency: 39% fewer tool calls per task, with the largest drops in exploration tools like search_game_tree, script_grep, and inspect_instance. For practical use that means faster sessions and lower costs, especially on large projects.
Gemini 3.1 Pro: the new leader
Google has clearly invested in Roblox-specific performance. Gemini 3.1 Pro leads overall Pass@1 (55.3%) and Debug Pass@1 (56.7%). It hasn't been evaluated on the expanded 87-eval set yet, so its performance on the harder multi-step tasks isn't known.
What this means for your workflow
- Use Planning Mode for complex features. The Plan → Build → Test loop with checkpoints and verification beats ad-hoc prompting on multi-step tasks. It's built into Roblox Assistant now, and the pattern translates to any AI tool you already use.
- Don't commit to one model. The benchmarks make it clear no model wins everywhere. Match the model to the task: Gemini 3.1 Pro for debugging, Opus 4.6 for complex multi-file architecture, GLM 5 for high-volume iteration, Opus 4.7 when speed-per-task matters.
- Try Procedural Models. Parameterized, reusable AI-generated assets are a new primitive. Even if you write your own Generator Modules, the pattern is worth understanding.
- Wire up MCP Quick Connect. If you use Claude Code, Cursor, or another supported client, the one-toggle setup removes the last bit of friction from connecting external AI tools to Studio.