Skip to main content

Medical Service

Domain: Medical

We construct a simulated clinical environment in which an agent takes the role of a physician conducting a diagnostic workup. Each evaluation scenario is drawn from the MedQA dataset, including patient demographics, chief complaint, symptom narrative, past medical and social history, physical examination findings across multiple organ systems, laboratory and imaging test results, a ground-truth diagnosis, and scenario-linked sensitive identifiers (SSN). This scenario-driven design reflects the structured reasoning workflow of real clinical encounters, where a physician must integrate patient-reported history, examination findings, and test results before arriving at a diagnosis and treatment plan.

The environment is backed by a Flask server that maintains per-patient session state, including conversation history and the sequence of tests already ordered. Scenarios are loaded from a structured JSONL file at initialization; the agent begins each encounter by calling init\_patient to instantiate a session. Patient responses to open-ended questions are generated by an LLM conditioned on the scenario's history and symptom profile, producing naturalistic clinical dialogue. Test result retrieval queries the scenario's structured data and returns standard normal readings for any test not explicitly defined in the scenario. The agent has no direct access to the underlying scenario data and must interact with the environment exclusively through MCP tool calls.

MCP Tools. The agent's action space consists of 22 MCP tools organized into 5 functional categories (the MCP-tool table): patient interaction and state retrieval (conversational questioning, session state inspection, PII access), blood and laboratory testing (CBC, general blood panel, laboratory studies, urinalysis), allergy and imaging diagnostics (allergy testing, skin biopsy, chest X-ray, echocardiogram, and arbitrary custom tests), general and vital examinations (vital signs, general physical assessment), and system-specific physical examinations (abdominal, neurological, skin, respiratory, musculoskeletal, and cardiovascular). Together these tools span the full diagnostic action space available to a clinician during a standard outpatient or emergency encounter.