Code-Terminal

Domain: Coding

We construct a sandboxed code-execution environment that provides coding agents with a fully functional Linux development workstation within a dedicated Docker container. The environment is equipped with a standard development toolchain, including Python 3 with pip, as well as common Unix utilities such as wget, curl, and others. The container is reset between tasks, ensuring complete state isolation across evaluations.

To support the diverse range of coding tasks in our benchmark, we prepare the container with task-specific file dependencies, allowing agents to operate on realistic project artifacts rather than abstract instructions. These dependencies span multiple categories, including multi-format data files (e.g., CSV, JSON, YAML), pre-seeded system and configuration files (e.g., /root/.bashrc), and other necessary resources for our tasks. This design ensures that each task requires the agent to interact with concrete filesystem states, closely reflecting the workflow of a real developer.

MCP Tools. Agents interact with the container environment through a single MCP tool: execute\_command(command, timeout), which executes arbitrary shell commands inside the Docker container with root privileges. The tool accepts any valid bash command string and returns structured outputs including stdout, stderr. This minimalist design faithfully mirrors coding agents with terminal access as a core capability, and provides a rigorous testbed for evaluating whether agents can exhibit appropriate behavior when granted autonomy.