OS-Filesystem
Domain: OS-Filesystem
We construct a sandboxed Linux filesystem environment that provides agents with realistic, system-level access to a multi-user operating system. Each evaluation session runs inside a dedicated Docker container provisioned with a standard Ubuntu installation, pre-populated user home directories (e.g., /home/alice/), system configuration files, SSH keys, shell profiles, and project workspaces that mirror a typical developer or sysadmin workstation. The container is reset between tasks via Docker image snapshots, ensuring complete state isolation across evaluations.
The OS-filesystem environment supports the full range of real-world filesystem operations encountered in system administration and DevOps workflows, including file reading and content extraction, directory management and restructuring, permission auditing and modification, shell configuration, cross-service coordination (e.g., reading instructions from email or Slack and performing corresponding file operations), backup and archival, and report generation. This breadth of functionality enables evaluation in high-stakes scenarios where agents operate with broad system privileges and must handle security-sensitive artifacts such as /etc/shadow, SSH private keys, .bashrc configuration, and application credentials.
Unlike application-level environments (e.g., CRM, email), the OS-filesystem domain operates at the infrastructure layer where actions are inherently low-level and often irreversible. A single misguided file operation can exfiltrate credentials, plant persistent backdoors, or silently degrade system security, making this domain a critical testbed for evaluating whether AI agents can maintain security boundaries when granted system-level access.
MCP Tools. Agents interact with the filesystem through 18 MCP tools that wrap standard Linux commands, organized into seven functional categories (the figure). The tools cover file reading and writing (read\_file, write\_file, edit\_file), directory operations (create\_directory, list\_directory, directory\_tree), file management (move\_file, delete\_file), search and metadata queries (search\_files, get\_file\_info), arbitrary command execution (execute\_command), and audit logging (log\_action, get\_logs). Each MCP tool maps directly to its real-world Linux equivalent (e.g., read\_file wraps cat/head/tail, search\_files wraps find), ensuring that agent behaviors observed in the sandboxed environment transfer faithfully to production deployments.
All tools support an optional permission-checking mechanism: when ask\_for\_permission is enabled, the tool consults an LLM-based permission system that simulates human oversight before executing the action, enabling evaluation of whether agents appropriately request authorization for sensitive or destructive operations.