diff --git a/docs/tutorials/grpo-workplace-assistant-nemotron-nano-v2-9b.md b/docs/tutorials/grpo-workplace-assistant-nemotron-nano-v2-9b.md
new file mode 100644
index 000000000..c6e62abf5
--- /dev/null
+++ b/docs/tutorials/grpo-workplace-assistant-nemotron-nano-v2-9b.md
@@ -0,0 +1,579 @@
+# GRPO Training with NeMo RL: Multi-step tool calling  on Nemotron Nano v2 9B
+
+## Overview
+
+This tutorial trains NVIDIA [Nemotron Nano 9B v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2) to improve its **multi-step tool-calling** capability using **GRPO (Group Relative Policy Optimization)** algorithm on the **Workplace Assistant** environment. Workplace Assistant is a realistic office simulation (calendar, email, project management, etc.) with complex multi-step tasks, providing a strong data distribution for training enterprise-ready tool-using assistants.
+
+**Total time estimate:** ~3-5 hours (including environment setup, data preparation, and training)
+
+> **TL;DR:** Want to jump straight to running commands? Skip to [Setup Instructions](#setup-instructions) or [Running Training](#running-training).
+
+---
+
+## Objectives
+
+In this tutorial, you will:
+
+1. Set up NeMo RL and NeMo Gym for Reinforcement Learning (RL) training
+2. Understand the Workplace Assistant environment and its multi-step tool calling capability
+3. Configure and run GRPO training on Nemotron Nano v2 9B using this environment in Gym
+4. Monitor training progress via Weights & Biases (W&B)
+
+---
+
+## Prerequisites
+
+### Required Knowledge
+
+- You should be comfortable with Python, LLM fine-tuning, and basic reinforcement learning concepts such as policy optimization, rewards, and rollouts. While in-depth knowledge of Reinforcement Learning with Verifiable Rewards (RLVR) and the GRPO algorithm is not required, a high-level understanding is helpful.
+- Some basic familiarity with Slurm is useful, but you can follow along using the example commands provided below.
+
+### Hardware Requirements
+
+**Minimum** 1 node of 8× NVIDIA GPUs with 80GB or more memory each (e.g., H100, A100) is required.
+
+NeMo Gym does not require GPUs. GPUs are only necessary for GRPO training with NeMo RL.
+
+
+### Required Accounts & Tokens
+| Service                | Purpose                  | How to Obtain                         |
+|------------------------|--------------------------|---------------------------------------|
+| Hugging Face (HF)      | Model and data downloads | [Create account](https://huggingface.co/join) |
+| Weights & Biases (W&B) | Training metrics logging (optional but recommended) | [Create account](https://wandb.ai/signup)      |
+
+> **Note:** W&B is optional but recommended for tracking training metrics and visualizing progress.
+
+---
+
+## About the Environment and Dataset
+
+The Workplace Assistant is a **multi-step agentic tool-use environment** that tests an AI agent's ability to execute business tasks in a simulated workplace setting.
+
+### Overview
+
+- **5 databases**: Email, Calendar, Analytics, Project Management, Customer Relationship Manager (CRM)
+- **26 tools** distributed across these databases
+- **690 tasks** representing common business activities (e.g., sending emails, scheduling meetings, managing projects)
+- **State-based verification**: Evaluates task completion by comparing final database states rather than exact action sequences
+
+### Environment: Resource Server (`app.py`)
+
+The environment is implemented as a FastAPI-based resource server that manages tool execution and verification. Here's how it works:
+
+#### 1. Session Management
+
+Each rollout gets its own isolated session with fresh tool environments:
+
+```python
+async def seed_session(self, request: Request, body: BaseSeedSessionRequest):
+    session_id = request.session[SESSION_ID_KEY]
+    toolkits = [
+        "email",
+        "calendar",
+        "analytics",
+        "project_management",
+        "customer_relationship_manager",
+    ]
+    self.session_id_to_tool_env[session_id] = get_tools(toolkits)
+    return BaseSeedSessionResponse()
+```
+
+This ensures each task starts with a clean slate and tool calls from different rollouts don't interfere.
+
+#### 2. Dynamic Tool Routing
+
+Tool calls are routed to Python functions:
+
+```python
+def route_to_python_function(tool_name, arguments):
+    try:
+        result = tool_env["functions"][tool_name](**arguments)
+        return WorkbenchResponse(output=result)
+    except Exception as e:
+        # Return error to model so it can self-correct (don't terminate)
+        return WorkbenchResponse(output=f"Error executing tool: {str(e)}")
+```
+
+**Key feature**: Tool execution errors are returned to the model as part of the response (rather than terminating the rollout), allowing the agent to self-correct and retry during execution.
+
+#### 3. State Matching for Verification
+
+The environment uses **state-matching verification**: instead of requiring exact tool sequences, it compares final database states.
+
+```python
+async def verify(self, body: WorkbenchVerifyRequest) -> WorkbenchVerifyResponse:
+    ground_truth = body.ground_truth
+    response = body.response.output
+
+    total_score = 0.0
+
+    # Convert list of ResponseFunctionToolCall objects into list of dictionaries
+    predicted_function_calls = []
+    for message in response:
+        if message.type == "function_call":
+            predicted_function_calls.append(message.model_dump())
+
+    predicted_chat_content = []
+    for message in response:
+        if message.type == "output_text":
+            predicted_chat_content.append(message.model_dump())
+
+    total_score += is_correct(predicted_function_calls, ground_truth, None) * 1.0
+    return WorkbenchVerifyResponse(**body.model_dump(), reward=total_score)
+```
+
+The `is_correct` function implements the state-matching logic:
+
+```python
+def is_correct(predicted_actions, ground_truth_actions, error):
+    ..
+    
+    # Execute both sequences in fresh environments
+    predict_env = execute_actions_and_reset_state(predicted_actions)
+    ground_truth_env = execute_actions_and_reset_state(ground_truth_actions)
+    
+    .. # Extract specific state info
+
+    # Compare final states of all 5 databases
+    return (
+        predicted_calendar_state.equals(ground_truth_calendar_state) and
+        predicted_email_state.equals(ground_truth_email_state) and
+        predicted_analytics_state.equals(ground_truth_analytics_state) and
+        predicted_project_management_state.equals(ground_truth_project_management_state) and
+        predicted_customer_relationship_manager_state.equals(ground_truth_customer_relationship_manager_state)
+    )
+```
+
+**Why State-matching verification?**:
+- **Flexibility**: Multiple valid solution paths exist for the same task
+- **Robustness**: Agent can recover from mistakes mid-trajectory
+- **Goal-oriented**: Focuses on outcomes, not specific procedures
+
+
+---
+
+### Workplace Assistant Dataset
+
+ [`The Workplace Assistant`](https://huggingface.co/datasets/nvidia/Nemotron-RL-agent-workplace_assistant) dataset associated with this environment contains **690 unique tasks** with a full dataset of **1,260 prompts** that simulate realistic office productivity scenarios requiring multi-step tool usage. Each task is presented as a natural language request that the model must decompose into appropriate tool calls (up to 6 steps per task).
+
+### Dataset Structure
+
+Each sample in the dataset contains:
+- **System prompt**: Provides current date/time context and constraints (e.g., "Meetings must not start before 9am or end after 6pm")
+- **User query**: Natural language task description (e.g., "Reply to carlos's last email...")
+- **Available tools**: JSON schemas for all 26 functions the model can call
+- **Ground truth actions**: Reference solution as a sequence of tool calls (used for state-matching verification)
+
+
+### Available Tools
+
+The environment provides 26 functions across five business domains, each operating on CSV-backed databases. The agent must select the right tools, extract parameters from natural language, and chain them together to complete tasks.
+
+**Tool Categories:**
+- 📧 **Email** (6 tools): send, search, reply, forward, delete, get by ID (e.g., `email_send_email`, `email_search_emails`)
+- 📅 **Calendar** (5 tools): create, search, update, delete, get by ID (e.g., `calendar_create_event`)
+- 📊 **Analytics** (6 tools): create plots, count metrics, get visitor data (e.g., `analytics_create_plot`, `analytics_engaged_users_count`)
+- 📋 **Project Management** (5 tools): create, search, update, delete, get task details (e.g., `project_management_update_task`)
+- 👥 **CRM** (4 tools): search, add, update, delete customers (e.g., `customer_relationship_manager_search_customers`)
+- 🔍 **Company Directory** (1 tool): `company_directory_find_email_address` - case-insensitive name lookup, always available
+
+### Example Tasks
+
+Each task is a natural language request that the model must complete using the available tools. The environment allows up to 6 tool-calling steps per task.
+
+**Single-Step Task** (1 tool call needed):
+
+```json
+{
+  "input": [
+    {
+      "role": "system",
+      "content": "Today's date is Thursday, 2023-11-30 and the current time is 23:59:00. Remember the current date and time when answering queries. Meetings must not start before 9am or end after 6pm."
+    },
+    {
+      "role": "user", 
+      "content": "Send an email to john.smith@atlas.com with the subject 'Team Meeting' and body 'Let's meet tomorrow at 2pm to discuss the project.'"
+    }
+  ],
+  "tools": [
+    {"type": "function", "name": "email_send_email", "description": "Sends an email to a recipient.", "parameters": {"type": "object", "properties": {"recipient": {"type": "string"}, "subject": {"type": "string"}, "body": {"type": "string"}}, "required": ["recipient", "subject", "body"]}},
+    {"type": "function", "name": "email_search_emails", "description": "Searches for emails matching the given query...", "parameters": {...}},
+    {"type": "function", "name": "calendar_create_event", "...": "..."},
+    // ... 23 more tools (calendar, analytics, project_management, CRM)
+  ],
+  "parallel_tool_calls": false,
+  "temperature": 1.0
+}
+```
+
+**Expected output:** `email_send_email(recipient="alex.martinez@atlas.com", subject="Team Meeting", body="Let's meet tomorrow at 2pm to discuss the project.")`
+
+---
+
+**Multi-Step Task** (requires 3-6 tool calls):
+
+```json
+{
+  "input": [
+    {
+      "role": "system",
+      "content": "Today's date is Thursday, 2023-11-30 and the current time is 23:59:00. Remember the current date and time when answering queries. Meetings must not start before 9am or end after 6pm."
+    },
+    {
+      "role": "user",
+      "content": "John is taking over all of Akira's leads that are interested in software. Can you reassign them in the crm?"
+    }
+  ],
+  "tools": [
+    {"type": "function", "name": "customer_relationship_manager_search_customers", "description": "Searches for customers based on the given parameters with pagination support.", "parameters": {"type": "object", "properties": {"assigned_to_email": {"type": "string"}, "product_interest": {"type": "string"}, "status": {"type": "string"}, ...}}},
+    {"type": "function", "name": "customer_relationship_manager_update_customer", "description": "Updates a customer record by ID.", "parameters": {"type": "object", "properties": {"customer_id": {"type": "string"}, "field": {"type": "string"}, "new_value": {"type": "string"}}, "required": ["customer_id", "field", "new_value"]}},
+    {"type": "function", "name": "company_directory_find_email_address", "description": "Finds all email addresses containing the given name...", "parameters": {...}},
+    // ... 23 more tools
+  ],
+  "parallel_tool_calls": false,
+  "temperature": 1.0
+}
+```
+
+**Expected output sequence:**
+1. `company_directory_find_email_address(name="Akira")` → Returns `"akira.tanaka@atlas.com"`
+2. `company_directory_find_email_address(name="John")` → Returns `"john.smith@atlas.com"`
+3. `customer_relationship_manager_search_customers(assigned_to_email="akira.tanaka@atlas.com", product_interest="software", status="lead")` → Returns 3 matching leads
+4. `customer_relationship_manager_update_customer(customer_id="00000095", field="assigned_to_email", new_value="john.smith@atlas.com")`
+5. `customer_relationship_manager_update_customer(customer_id="00000080", field="assigned_to_email", new_value="john.smith@atlas.com")`
+6. `customer_relationship_manager_update_customer(customer_id="00000035", field="assigned_to_email", new_value="john.smith@atlas.com")`
+
+**This task demonstrates:**
+- **Name resolution**: Looking up email addresses from natural names
+- **Search with multiple filters**: Finding customers by assignee, product interest, and status
+- **Batch updates**: Iterating through results to update multiple records
+- **State verification**: Final database state will match ground truth even if different search parameters or ordering were used
+
+**Generally, the model must:**
+1. Understand the user's intent from natural language
+2. Determine which tools to call and in what order
+3. Infer correct parameters (e.g., look up email addresses, find matching customer records)
+4. Execute all necessary steps to complete the task
+
+---
+
+## Setup Instructions
+
+### Step 1: Enter a GPU Node
+
+**Estimated Time:** ~5 minutes
+
+Launch an interactive Slurm session to run training commands. See the [NeMo RL Cluster Setup documentation](https://docs.nvidia.com/nemo/rl/latest/cluster.html#interactive-launching) for more details.
+
+```bash
+NUM_ACTOR_NODES=1
+ACCOUNT=<ACCOUNT_NAME>
+JOB_NAME=<JOB_NAME>
+PARTITION=<PARTITION>
+
+# Use the official NeMo RL container from NGC
+# See: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-rl
+CONTAINER=nvcr.io/nvidia/nemo-rl:v0.4.0
+CONTAINER_WORKDIR=$PWD
+MOUNTS="$PWD:$PWD"
+srun \
+    --nodes=${NUM_ACTOR_NODES} \
+    --ntasks=1 \
+    --account=${ACCOUNT} \
+    --job-name=${JOB_NAME} \
+    --partition=${PARTITION} \
+    --time=04:00:00 \
+    --gres=gpu:8 \
+    --no-container-mount-home \
+    --container-name=nemo-gym \
+    --container-mounts="${MOUNTS}" \
+    --container-image="${CONTAINER}" \
+    --container-workdir=$CONTAINER_WORKDIR \
+    --pty /bin/bash
+```
+
+### Step 2: Clone and Setup NeMo RL + NeMo Gym
+
+**Estimated Time:** ~15-20 minutes
+
+```bash
+# Clone NeMo RL repository
+git clone https://github.com/NVIDIA-NeMo/RL
+cd RL
+
+# Clone NeMo Gym as a submodule
+git clone https://github.com/NVIDIA-NeMo/Gym.git 3rdparty/Penguin-workspace/Penguin
+
+# Initialize all submodules (Megatron, AutoModel, etc.)
+git submodule update --init --recursive
+
+# This will remove any stale cached Ray venv and rebuilt it
+# TODO: This is a WAR. Need a formal fix.
+rm -rf /opt/ray_venvs/*
+
+# Activate the NeMo RL virtual environment
+source /opt/nemo_rl_venv/bin/activate
+
+# Install dependencies
+uv sync --group={build,docs,dev,test} --extra nemo_gym
+```
+
+### Step 3: Prepare NeMo Gym Data
+
+**Estimated Time:** ~5-10 minutes
+
+The Workplace Assistant dataset must be downloaded from HuggingFace and prepared for training. This is a two-step process:
+
+This runs `ng_prepare_data` to download and validate the dataset, and to add an `agent_ref` property to each example that tells NeMo Gym which agent server should handle that example.
+
+```bash
+HF_TOKEN=SPECIFY_HF_TOKEN
+
+# Setup Gym local venv
+cd 3rdparty/Gym-workspace/Gym
+uv venv --python 3.12 --allow-existing
+source .venv/bin/activate
+uv sync --active --extra dev
+
+config_paths="responses_api_models/vllm_model/configs/vllm_model_for_training.yaml,\
+resources_servers/workplace_assistant/configs/workplace_assistant.yaml"
+
+ng_prepare_data "+config_paths=[${config_paths}]" \
+    +output_dirpath=resources_servers/workplace_assistant/data \
+    +mode=train_preparation \
+    +hf_token=$HF_TOKEN \
+    +should_download=true
+
+# Return to NeMo RL directory and Python env
+cd ../../.. && source /opt/nemo_rl_venv/bin/activate
+```
+
+### Step 4: Run Sanity Tests (optional but recommended)
+
+**Estimated Time:** ~10-15 minutes
+
+Validate your setup before training:
+
+```bash
+HF_HOME=.cache/ \
+HF_TOKEN=${HF_TOKEN} \
+    ./examples/nemo_gym/run_nemo_gym_single_node_sanity_tests.sh
+```
+
+> **Note**: If you've run these tests before and encounter HuggingFace rate limit errors, add `HF_HUB_OFFLINE=1` to the command.
+
+---
+
+## Training Configuration
+
+### Key Configuration Parameters
+
+The training configuration file is located at:
+`examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml`
+
+#### Environment Configuration
+
+```yaml
+env:
+  should_use_nemo_gym: true
+  nemo_gym:
+    config_paths:
+    - responses_api_models/vllm_model/configs/vllm_model_for_training.yaml
+    - resources_servers/workplace_assistant/configs/workplace_assistant.yaml
+    workplace_assistant_simple_agent:
+      responses_api_agents:
+        simple_agent:
+          max_steps: 6  # Maximum tool-calling steps per task
+```
+
+#### GRPO Hyperparameters
+
+| Parameter | Value | Description |
+|-----------|-------|-------------|
+| `num_prompts_per_step` | 4 | Number of prompts per training step |
+| `num_generations_per_prompt` | 4 | Rollouts generated per prompt |
+| `max_rollout_turns` | 1 | Turns per rollout (1 turn, up to 6 tool steps) |
+| `max_num_steps` | 10 | Total training steps |
+| `use_leave_one_out_baseline` | true | Variance reduction technique |
+| `normalize_rewards` | true | Normalize rewards across batch |
+
+#### Model Configuration
+
+| Parameter | Value | Description |
+|-----------|-------|-------------|
+| `model_name` | nvidia/NVIDIA-Nemotron-Nano-9B-v2 | Base model |
+| `max_total_sequence_length` | 32768 | Maximum context length |
+| `precision` | bfloat16 | Training precision |
+| `tensor_model_parallel_size` | 8 | Tensor parallelism across GPUs |
+
+#### Optimizer Settings
+
+| Parameter | Value | Description |
+|-----------|-------|-------------|
+| `optimizer` | Adam | Optimizer type |
+| `lr` | 5.0e-6 | Learning rate |
+| `min_lr` | 5.0e-7 | Minimum learning rate |
+| `weight_decay` | 0.01 | Weight decay |
+| `adam_beta1` / `adam_beta2` | 0.9 / 0.999 | Adam hyperparameters |
+| `clip_grad` | 1.0 | Gradient clipping threshold |
+
+The complete training configuration is available at:
+[`examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml`](https://github.com/NVIDIA-NeMo/RL/blob/main/examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml)
+
+---
+
+## Running Training
+
+### Single Node Training (interactive mode)
+
+**Estimated Time:** ~2-4 hours
+
+Run these commands **from inside the container** after attaching via the interactive session from Step 1:
+
+```bash
+# Clean up any existing Ray/vLLM processes
+pkill -f VllmAsyncGenerationWorker
+ray stop --force
+python -c "import ray; ray.shutdown()"
+
+# Set experiment name with timestamp
+EXP_NAME="$(date +%Y%m%d)/nemo_gym_grpo/nemotron_nano_v2_9b/workplace_assistant_001"
+
+# Configuration file path
+CONFIG_PATH=examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
+
+# Launch training
+# Set these environment variables before running:
+TORCH_CUDA_ARCH_LIST="9.0 10.0" \
+HF_HOME=.cache/ \
+HF_TOKEN="YOUR_HUGGINGFACE_TOKEN" \
+WANDB_API_KEY="YOUR_WANDB_API_KEY" \
+NRL_FORCE_REBUILD_VENVS=true \
+VLLM_LOGGING_LEVEL=ERROR \
+uv run python examples/nemo_gym/run_grpo_nemo_gym.py \
+    --config=$CONFIG_PATH \
+    logger.wandb.project="${USER}-nemo-gym-rl-integration" \
+    logger.wandb.name=$EXP_NAME \
+    logger.log_dir=results/$EXP_NAME
+```
+
+
+### Multi-Node Training
+
+Scale to multiple nodes by changing `cluster.num_nodes`. This example uses **batch mode** (the `COMMAND` variable specifies what to run automatically when the job starts).
+
+> **Note**: Run this command from the **Slurm login/head node**, not from inside the interactive container from Step 1. This submits a new batch job that will run independently.
+
+```bash
+# Set experiment name
+EXP_NAME="nemo_gym_grpo/nemotron_nano_v2_9b/2nodes/workplace_assistant_001"
+CONFIG_PATH=examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml
+
+NUM_NODES=2
+COMMAND="TORCH_CUDA_ARCH_LIST='9.0 10.0' HF_HOME=.cache/ HF_TOKEN='YOUR_HUGGINGFACE_TOKEN' WANDB_API_KEY='YOUR_WANDB_API_KEY' NRL_FORCE_REBUILD_VENVS=true VLLM_LOGGING_LEVEL=ERROR uv run python examples/nemo_gym/run_grpo_nemo_gym.py --config=$CONFIG_PATH cluster.num_nodes=$NUM_NODES logger.wandb.project=${USER}-nemo-gym-rl-integration logger.wandb.name=$EXP_NAME logger.log_dir=results/$EXP_NAME checkpointing.checkpoint_dir=results/$EXP_NAME" \
+CONTAINER=nvcr.io/nvidia/nemo-rl:v0.4.0 \
+MOUNTS="/shared/filesystem:/shared/filesystem" \
+sbatch \
+    --nodes=$NUM_NODES \
+    --time=4:0:0 \
+    --job-name=grpo-workplace-assistant \
+    --gres=gpu:8 \
+    ray.sub
+```
+
+---
+
+## Expected Results
+
+### Training Metrics
+
+Monitor these metrics in W&B to track progress:
+
+| Metric | Initial | After 1 Epoch | Description |
+|--------|---------|---------------|-------------|
+| `train:reward_mean` | ~0.1-0.2 | ~0.5-0.7 | Average reward per batch |
+| `val:accuracy` | ~0.15 | ~0.5-0.6 | Validation task completion rate |
+| `train:loss` | ~0.5 | ~0.2-0.3 | GRPO policy loss |
+
+### Checkpoint Outputs
+
+Checkpoints are saved to:
+```
+results/<EXP_NAME>/
+├── step_6/
+├── step_12/
+├── step_18/
+└── ...
+```
+
+The best checkpoint (highest `val:accuracy`) is retained based on `checkpointing.keep_top_k: 3`.
+
+### Success Criteria
+
+Training is successful when:
+- Reward mean increases consistently over steps
+- Validation accuracy improves from baseline (~15%) to 50%+
+- No OOM (Out of Memory) errors
+- Checkpoints are saved at specified intervals
+
+### Validation Reward Plot
+
+<!-- TODO: Add validation reward plot showing improvement over training steps -->
+![Validation Reward Plot](images/val_reward_placeholder.png)
+*Expected: Validation reward increasing from ~0.15 to ~0.5+ over the course of training.*
+
+### Measuring Real-World Improvement
+
+The Workplace Assistant environment's tool-calling tasks correlate with performance on the [Berkeley Function Calling Leaderboard (BFCL) v3](https://gorilla.cs.berkeley.edu/leaderboard.html) benchmark. To measure improvement, evaluate the Nemotron Nano v2 9B model on BFCL v3 before and after training and compare.  You should observe measurable improvement in tool-calling accuracy
+
+You can run BFCL v3 evaluations using [NeMo Evaluator](https://github.com/NVIDIA-NeMo/Evaluator), which supports BFCL v3. See the [NeMo Evaluator docs](https://github.com/NVIDIA-NeMo/Evaluator#-supported-benchmarks-and-evaluation-harnesses) for full setup instructions and supported benchmarks.
+
+---
+
+## Troubleshooting
+
+### Common Issues
+
+| Issue | Solution |
+|-------|----------|
+| HuggingFace rate limits | Specify your HF API token and/or add `HF_HUB_OFFLINE=1` after the initial download |
+| vLLM process not shutting down | Run `pkill -f VllmAsyncGenerationWorker` before training |
+| Ray cluster issues | Run `ray stop --force` before training |
+| CUDA OOM | Increase `tensor_parallel_size`, lower batch sizes |
+| Slow initial startup | Set `NRL_FORCE_REBUILD_VENVS=true` on first run only; if `uv` gets rate limited, set this back to `false` |
+
+### Log Locations
+
+```
+logs/grpo-workplace-assistant-nemotron-nano-v2-9b/  # Training logs
+results/<EXP_NAME>/                                  # Checkpoints and metrics
+.cache/                                              # HuggingFace model cache
+```
+
+---
+
+## Next Steps
+
+After completing this tutorial, explore:
+
+1. **Scale Up**: Try multi-node training for faster convergence and larger batch sizes
+2. **Hyperparameter Tuning**: Adjust learning rate, number of generations, or reward normalization 
+3. **Deploy Your Agent**: Export the trained checkpoint and deploy it with vLLM or NVIDIA NIM to build a production workplace assistant that integrates with real calendar, email, and file management APIs
+
+### Related Tutorials
+
+- [RL Training with NeMo RL](./rl-training-with-nemo-rl.md) - General RL training guide
+- [GRPO Loss Configuration](../../docs/guides/grpo.md) - Advanced loss function customization
+- [Sequence Packing](../../docs/design-docs/sequence-packing-and-dynamic-batching.md) - Optimize training throughput
+
+---
+
+## References
+
+- **NeMo RL Repository**: [github.com/NVIDIA-NeMo/RL](https://github.com/NVIDIA-NeMo/RL)
+- **NeMo Gym Repository**: [github.com/NVIDIA-NeMo/Gym](https://github.com/NVIDIA-NeMo/Gym)
+
+---
+
+## Appendix: Full Configuration Reference
+
diff --git a/download_workplace_assistant.py b/download_workplace_assistant.py
new file mode 100644
index 000000000..2a3d3a283
--- /dev/null
+++ b/download_workplace_assistant.py
@@ -0,0 +1,80 @@
+#!/usr/bin/env python3
+"""
+Download and prepare the workplace_assistant dataset from HuggingFace.
+
+This script downloads the nvidia/Nemotron-RL-agent-workplace_assistant dataset
+and saves it as train.jsonl and validation.jsonl (using a train/val split).
+"""
+
+import json
+import os
+from datasets import load_dataset
+
+def main():
+    # Configuration
+    output_dir = "resources_servers/workplace_assistant/data"
+    train_path = os.path.join(output_dir, "train.jsonl")
+    val_path = os.path.join(output_dir, "validation.jsonl")
+    
+    # Create output directory
+    os.makedirs(output_dir, exist_ok=True)
+    
+    # Check if files already exist
+    train_exists = os.path.exists(train_path)
+    val_exists = os.path.exists(val_path)
+    
+    if train_exists and val_exists:
+        print(f"✓ Both train and validation datasets already exist:")
+        print(f"  - {train_path}")
+        print(f"  - {val_path}")
+        print("\nSkipping download. Delete these files to re-download.")
+        return
+    
+    # Download the dataset from HuggingFace
+    print("Downloading workplace_assistant dataset from HuggingFace...")
+    print("Repo: nvidia/Nemotron-RL-agent-workplace_assistant")
+    dataset = load_dataset("nvidia/Nemotron-RL-agent-workplace_assistant")
+    
+    print(f"\nDataset info:")
+    print(f"  Available splits: {list(dataset.keys())}")
+    print(f"  Train split size: {len(dataset['train'])}")
+    
+    # Split the train dataset into train (90%) and validation (10%)
+    train_test_split = dataset['train'].train_test_split(test_size=0.1, seed=42)
+    train_data = train_test_split['train']
+    val_data = train_test_split['test']
+    
+    print(f"\nSplitting data:")
+    print(f"  Train samples: {len(train_data)}")
+    print(f"  Validation samples: {len(val_data)}")
+    
+    # Save train split
+    if not train_exists:
+        print(f"\nSaving train split to {train_path}...")
+        with open(train_path, 'w') as f:
+            for item in train_data:
+                f.write(json.dumps(item) + '\n')
+        print(f"✓ Saved {len(train_data)} train samples")
+    else:
+        print(f"✓ Train dataset already exists, skipping: {train_path}")
+    
+    # Save validation split
+    if not val_exists:
+        print(f"\nSaving validation split to {val_path}...")
+        with open(val_path, 'w') as f:
+            for item in val_data:
+                f.write(json.dumps(item) + '\n')
+        print(f"✓ Saved {len(val_data)} validation samples")
+    else:
+        print(f"✓ Validation dataset already exists, skipping: {val_path}")
+    
+    print("\n" + "="*60)
+    print("Dataset download and preparation complete!")
+    print("="*60)
+    print(f"Train:      {train_path}")
+    print(f"Validation: {val_path}")
+    print("\nYou can now use these datasets with NeMo-Gym GRPO training.")
+
+if __name__ == "__main__":
+    main()
+
diff --git a/nemo_gym/train_data_utils.py b/nemo_gym/train_data_utils.py
index 84609c5fb..f50cd9408 100644
--- a/nemo_gym/train_data_utils.py
+++ b/nemo_gym/train_data_utils.py
@@ -685,8 +685,31 @@ def collate_samples(
 
             aggregate_metrics_dict = aggregate_metrics.model_dump(mode="json", by_alias=True)
 
+            # Add metadata for collated metrics (similar to validate step)
             parent = Path(config.output_dirpath)
             parent.mkdir(exist_ok=True)
+            collated_fpath = parent / f"{type}.jsonl"
+
+            # Get metadata from first dataset of this type
+            dataset_metadata = {}
+            for c in server_instance_configs:
+                for d in c.datasets:
+                    if d.type == type:
+                        dataset_metadata = {
+                            "name": type,
+                            "type": type,
+                            "jsonl_fpath": str(collated_fpath),
+                            "num_repeats": 1,
+                            "gitlab_identifier": d.gitlab_identifier.model_dump() if d.gitlab_identifier else None,
+                            "license": d.license,
+                        }
+                        break
+                if dataset_metadata:
+                    break
+
+            # Merge metadata with aggregate metrics
+            aggregate_metrics_dict = dataset_metadata | aggregate_metrics_dict
+
             metrics_fpath = parent / f"{type}_metrics.json"
             maybe_conflicting_metrics_fpath = self._validate_aggregate_metrics(
                 aggregate_metrics_dict=aggregate_metrics_dict,
diff --git a/resources_servers/workplace_assistant/TRAINING.md b/resources_servers/workplace_assistant/TRAINING.md
new file mode 100644
index 000000000..8a8ca3a2a
--- /dev/null
+++ b/resources_servers/workplace_assistant/TRAINING.md
@@ -0,0 +1,362 @@
+# GRPO Training Guide for Workplace Assistant
+
+This guide walks through the complete setup and training process for GRPO (Generalized Reinforcement Policy Optimization) with the Workplace Assistant environment.
+
+## Overview
+
+**Environment**: Multi-step agentic tool-use environment for business tasks
+- **Tools**: 26 tools across 5 categories (Email, Calendar, CRM, Analytics, Project Management)
+- **Tasks**: 1255 workplace scenarios (meetings, emails, data analysis, etc.)
+- **Domain**: Business activities
+- **Max Steps**: Up to 6 tool-calling steps per task
+
+**Dataset**: `nvidia/Nemotron-RL-agent-workplace_assistant` on HuggingFace
+- Full dataset: 1255 samples
+- Default split: 90% train (1129 samples) / 10% validation (126 samples)
+
+## Prerequisites
+
+1. **NeMo-Gym RL Repository**: Cloned and set up
+2. **Penguin Environment**: Located at `RL/3rdparty/Penguin-workspace/Penguin`
+3. **Python 3.12+**: With `uv` package manager
+4. **CUDA GPUs**: Minimum 8 GPUs recommended for full training
+
+## Step-by-Step Setup
+
+### 1. Setup Penguin Virtual Environment
+
+```bash
+cd RL/3rdparty/Penguin-workspace/Penguin
+uv venv --python 3.12 --allow-existing
+source .venv/bin/activate
+uv sync --active --extra dev
+```
+
+### 2. Download and Prepare Dataset
+
+The dataset needs to be downloaded from HuggingFace and split into train/validation sets.
+
+**Option A: Using the download script (Recommended)**
+
+```bash
+# Run the download script (creates 90/10 stratified split)
+uv run python download_workplace_assistant.py
+```
+
+This script will:
+- Download the dataset from HuggingFace
+- Perform a stratified 90/10 split based on task categories
+- Save `train.jsonl` (1129 samples) and `validation.jsonl` (126 samples)
+- Skip download if files already exist
+
+**Option B: Manual download with HuggingFace**
+
+```python
+from datasets import load_dataset
+from sklearn.model_selection import train_test_split
+import json
+
+dataset = load_dataset("nvidia/Nemotron-RL-agent-workplace_assistant")
+full_train_data = list(dataset['train'])
+
+categories = [item['category'] for item in full_train_data]
+train_data, val_data = train_test_split(
+    full_train_data, test_size=0.10, random_state=42, stratify=categories
+)
+
+# Save train.jsonl
+with open('resources_servers/workplace_assistant/data/train.jsonl', 'w') as f:
+    for item in train_data:
+        f.write(json.dumps(item) + '\\n')
+
+# Save validation.jsonl
+with open('resources_servers/workplace_assistant/data/validation.jsonl', 'w') as f:
+    for item in val_data:
+        f.write(json.dumps(item) + '\\n')
+```
+
+### 3. Prepare Data with ng_prepare_data
+
+Once the JSONL files are downloaded, prepare them for training:
+
+```bash
+config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
+resources_servers/workplace_assistant/configs/workplace_assistant.yaml"
+
+ng_prepare_data "+config_paths=[${config_paths}]" \
+    +output_dirpath=resources_servers/workplace_assistant/data \
+    +mode=train_preparation \
+    +should_download=false
+```
+
+This command:
+- Validates the dataset samples
+- Computes aggregate metrics
+- Adds agent references to each sample
+- Creates collated train/validation datasets ready for GRPO
+
+**Expected output:**
+```
+✓ Train: 1129 samples
+✓ Validation: 126 samples
+✓ Metrics validated and saved
+```
+
+### 4. Test the Environment (Optional)
+
+Before full training, test that the environment works correctly:
+
+```bash
+config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
+resources_servers/workplace_assistant/configs/workplace_assistant.yaml"
+
+# Start the Penguin server
+ng_run "+config_paths=[$config_paths]"
+```
+
+In another terminal, collect a test trajectory:
+
+```bash
+ng_collect_rollouts +agent_name=workplace_assistant_simple_agent \
+    +input_jsonl_fpath=resources_servers/workplace_assistant/data/train.jsonl \
+    +output_jsonl_fpath=results/workplace_assistant_test_trajectory.jsonl \
+    +limit=1
+```
+
+### 5. Return to NeMo-Gym RL Directory
+
+```bash
+cd ../../../../  # Back to RL directory
+source /opt/nemo_rl_venv/bin/activate  # Activate NeMo RL environment
+```
+
+## Running GRPO Training
+
+### Quick Sanity Check (test_001)
+
+Run a quick 5-step test to verify everything works:
+
+```bash
+pkill -f VllmAsyncGenerationWorker
+ray stop --force
+python -c "import ray; ray.shutdown()"
+
+EXP_NAME="$(date +%Y%m%d)/penguin_grpo/qwen3_4binstruct/workplace_assistant_test_001"
+CONFIG_PATH=examples/penguin/grpo_workplace_assistant_qwen3_4binstruct.yaml
+
+HF_HOME=.cache/ \
+WANDB_API_KEY=<your-wandb-key> \
+NRL_FORCE_REBUILD_VENVS=true \
+uv run python examples/penguin/run_grpo_penguin.py \
+    --config=$CONFIG_PATH \
+    logger.wandb.project="workplace-assistant-grpo" \
+    logger.wandb.name=$EXP_NAME \
+    logger.log_dir=results/$EXP_NAME \
+    grpo.val_at_start=false \
+    ++grpo.num_prompts_per_step=4 \
+    ++grpo.num_generations_per_prompt=4 \
+    ++grpo.max_num_steps=5 \
+    ++policy.dtensor_cfg.clear_cache_every_n_steps=1 \
+    ++cluster.num_nodes=1 \
+    checkpointing.enabled=false
+```
+
+**Expected duration**: ~5-10 minutes on 8 GPUs
+
+### Full Epoch Training (test_002)
+
+Train for 1 complete epoch through the dataset:
+
+```bash
+pkill -f VllmAsyncGenerationWorker
+ray stop --force
+python -c "import ray; ray.shutdown()"
+
+EXP_NAME="$(date +%Y%m%d)/penguin_grpo/qwen3_4binstruct/workplace_assistant_test_002"
+CONFIG_PATH=examples/penguin/grpo_workplace_assistant_qwen3_4binstruct.yaml
+
+HF_HOME=.cache/ \
+WANDB_API_KEY=<your-wandb-key> \
+NRL_FORCE_REBUILD_VENVS=true \
+uv run python examples/penguin/run_grpo_penguin.py \
+    --config=$CONFIG_PATH \
+    logger.wandb.project="workplace-assistant-grpo" \
+    logger.wandb.name=$EXP_NAME \
+    logger.log_dir=results/$EXP_NAME \
+    grpo.val_at_start=false \
+    grpo.val_period=36 \
+    grpo.num_prompts_per_step=32 \
+    grpo.num_generations_per_prompt=8 \
+    grpo.max_num_steps=36 \
+    grpo.max_num_epochs=1 \
+    ++cluster.num_nodes=1 \
+    checkpointing.enabled=true \
+    checkpointing.save_period=36
+```
+
+**Training details:**
+- **Prompts per step**: 32
+- **Generations per prompt**: 8
+- **Total rollouts per step**: 256
+- **Steps per epoch**: 36 (ceil(1129/32))
+- **Total rollouts**: 9,216
+- **Expected duration**: ~2-3 hours on 8 GPUs
+
+### Multi-Node Full Training
+
+For production training with the default configuration:
+
+```bash
+# Use default config values (64 prompts/step × 16 generations = 1024 rollouts/step)
+EXP_NAME="$(date +%Y%m%d)/penguin_grpo/qwen3_4binstruct/workplace_assistant_full"
+CONFIG_PATH=examples/penguin/grpo_workplace_assistant_qwen3_4binstruct.yaml
+
+HF_HOME=.cache/ \
+WANDB_API_KEY=<your-wandb-key> \
+uv run python examples/penguin/run_grpo_penguin.py \
+    --config=$CONFIG_PATH \
+    logger.wandb.project="workplace-assistant-grpo" \
+    logger.wandb.name=$EXP_NAME \
+    logger.log_dir=results/$EXP_NAME
+```
+
+**Default configuration:**
+- **Nodes**: 8 nodes × 8 GPUs = 64 GPUs
+- **Prompts per step**: 64
+- **Generations per prompt**: 16
+- **Total rollouts per step**: 1024
+- **Validation**: Every 10 steps
+- **Checkpointing**: Top-3 by validation accuracy
+
+## Configuration Details
+
+### Model Configuration
+- **Model**: `Qwen/Qwen3-4B-Instruct-2507`
+- **Tool Parser**: Hermes (for multi-tool handling)
+- **Context Length**: 32,768 tokens
+- **Precision**: BFloat16
+- **Tensor Parallel**: 2 (for single node with 8 GPUs)
+
+### GRPO Parameters
+- **Max Rollout Turns**: 1 (single-turn with up to 6 tool steps)
+- **Reward Normalization**: Enabled
+- **Leave-One-Out Baseline**: Enabled
+- **Reference Policy KL**: Disabled (skip_reference_policy_logprobs_calculation=true)
+
+### Environment Configuration
+- **Agent**: `workplace_assistant_simple_agent`
+- **Max Tool Steps**: 6 per task
+- **Tools**: 26 tools (Email, Calendar, CRM, Analytics, Project Management)
+- **Reward**: State-based verification (compares final database state to ground truth)
+
+## Monitoring Training
+
+### WandB Metrics
+
+Key metrics to monitor during training:
+
+1. **Accuracy Metrics**:
+   - `train:accuracy` - Training task success rate
+   - `val:accuracy` - Validation task success rate
+
+2. **Reward Metrics**:
+   - `train:reward_mean` - Average reward per rollout
+   - `train:advantage_mean` - Advantage values for policy gradient
+
+3. **Performance Metrics**:
+   - `E2E (Samples/sec)` - End-to-end throughput
+   - `Training FLOPS` - Computational efficiency
+   - `Generation (Tokens/sec)` - VLLM generation speed
+
+4. **Policy Metrics**:
+   - `policy_loss` - Policy gradient loss
+   - `grad_norm` - Gradient norm (for stability)
+
+### Log Files
+
+Training logs are saved to:
+- **Main log**: `results/$EXP_NAME/logs/`
+- **Checkpoints**: `results/grpo_workplace_assistant/`
+- **Penguin logs**: Detailed rollout information and tool execution traces
+
+## Troubleshooting
+
+### Common Issues
+
+**1. Tool Execution Errors**
+
+If you see errors like `got an unexpected keyword argument 'query'`, this is expected during exploration. The model is learning which arguments each tool accepts.
+
+**2. Out of Memory**
+
+Reduce batch sizes:
+```bash
+++grpo.num_prompts_per_step=16 \
+++grpo.num_generations_per_prompt=4
+```
+
+**3. VLLM Generation Timeout**
+
+Increase GPU memory utilization:
+```bash
+++policy.generation.vllm_cfg.gpu_memory_utilization=0.85
+```
+
+**4. Dataset Not Found**
+
+Ensure you've run the download script and `ng_prepare_data` successfully:
+```bash
+ls -lh resources_servers/workplace_assistant/data/train.jsonl
+ls -lh resources_servers/workplace_assistant/data/validation.jsonl
+```
+
+### Performance Optimization
+
+**Single Node (8 GPUs)**:
+- Use `tensor_parallel_size=2` (default in config)
+- Set `num_prompts_per_step=32` for balanced throughput
+
+**Multi-Node (64 GPUs)**:
+- Use default config with `num_prompts_per_step=64`
+- Ensure high-speed interconnect (InfiniBand/NVLink)
+
+## Dataset Information
+
+### Category Distribution
+
+The dataset contains 5 task categories:
+- **Email Management**: Searching, sending, replying to emails
+- **Calendar Operations**: Event scheduling, search, updates
+- **CRM Activities**: Customer data management and queries
+- **Analytics**: Data visualization and reporting
+- **Project Management**: Task tracking and project coordination
+
+### Data Format
+
+Each sample contains:
+```json
+{
+  "id": 0,
+  "responses_create_params": {
+    "prompt": "...",
+    "tools": [...]
+  },
+  "ground_truth": [...],
+  "category": "...",
+  "environment_name": "workplace_assistant"
+}
+```
+
+## License
+
+- **Code**: Apache 2.0
+- **Dataset**: Apache 2.0
+- **Model**: Qwen3 model license
+
+## References
+
+- **Dataset**: https://huggingface.co/datasets/nvidia/Nemotron-RL-agent-workplace_assistant
+- **Model**: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507
+- **NeMo-Gym Documentation**: See main repo README
+- **Penguin Framework**: See `RL/3rdparty/Penguin-workspace/Penguin/README.md`
+