The Grimoire

atheo.dev

Multiplayer AI Storytelling

A multiplayer AI-powered tabletop RPG

Play the Game

Build Overview

Building The Grimoire

Context engineering, tool calling, and prompt architecture for an AI game master

Context Engineering

The main challenge here is giving the LLM enough world state to narrate coherently without blowing the context window. Every turn, I assemble a hierarchical context snapshot from live game state and inject it as a system message.

I structure it in layers of decreasing importance: party stats first (with condition penalties calculated), then the current scene, then world lore, then recent story turns. It's roughly the same priority order a human GM would keep in their head.

def build_world_context(world_state, recent_turns):
    lines = ["[CURRENT WORLD STATE]"]

    # Layer 1: Party — stats, conditions, effective values
    party = world_state.get('party', [])
    for char in party:
        # Show effective stats with condition penalties
        for stat_name in ['edge', 'heart', 'iron', 'shadow', 'wits']:
            base = stats.get(stat_name, 0)
            effective = get_effective_stat(char, stat_name)
            # "wits:3(base 4)" when a condition applies a penalty
            stat_parts.append(
                f"{stat_name}:{effective}" if effective == base
                else f"{stat_name}:{effective}(base {base})"
            )

    # Layer 2: Current scene — location, situation, threats
    scene = world_state.get('scene', {})
    lines.append(f"Location: {scene.get('location', 'Unknown')}")
    lines.append(f"Situation: {scene.get('situation', 'Unknown')}")

    # Layer 3: World lore — last 5 facts only (token budget)
    facts = world_state.get('world_facts', [])
    for fact in facts[-5:]:
        lines.append(f"- {fact.get('fact', '')}")

    # Layer 4: Recent story — last 10-15 turns
    for turn in recent_turns:
        lines.append(build_turn_summary(turn))

I inject the context as a second system message, placed between the main system prompt and the user's input. This exploits the model's recency bias: tokens closer to the end of the context window get more attention. The world state is always "top of mind" for the LLM, even in long-running games.

messages = [
    {"role": "system", "content": system_prompt},   # Style & rules
    {"role": "system", "content": context},          # Live world state
    {"role": "user", "content": user_content}        # Player's action
]

The token budget also adapts based on how many players are active. More players means more narrative to cover:

# Dynamic token budget based on player count
if player_count >= 4:
    max_tokens = 450
elif player_count >= 2:
    max_tokens = 350
else:
    max_tokens = 250

Tool Calling

The most important thing I got right early on was the trust boundary: the LLM handles narrative, the server handles mechanics. The LLM can never directly modify character stats, roll dice, or change game state. It can only request actions by calling tools, and the server decides what actually happens.

Each tool uses enum constraints in the JSON schema, which means the LLM can only pick from a fixed list of valid values. It can't hallucinate a move that doesn't exist or make up a stat name:

{
    "name": "resolve_move",
    "parameters": {
        "properties": {
            "move": {
                "type": "string",
                "enum": ["face_danger", "secure_advantage",
                         "gather_information", "compel",
                         "heal", "hearten", "resupply", ...],
            },
            "stat": {
                "type": "string",
                "enum": ["edge", "heart", "iron",
                         "shadow", "wits", "supply"],
            },
            "character": {
                "type": "string",
                "description": "Name of the character making the move"
            }
        }
    }
}

So the LLM decides which move fits the fiction and which stat makes sense for the approach, but it has no power to change numbers. The server rolls dice, calculates consequences, and mutates state.

The actual flow works in multiple passes. First, the LLM reads the context and calls tools. The server executes those tools (rolling dice, resolving consequences, updating tracks) and feeds the results back as tool role messages. Then the LLM gets the mechanical outcomes and writes the narrative around them:

# Pass 1: LLM decides what to do
completion = client.chat.completions.create(
    model=model,
    messages=messages,
    tools=active_tools,
    tool_choice="auto",
    max_tokens=max_tokens
)

# Server executes each tool call
for tool_call in message.tool_calls:
    tool_response = _execute_tool(fn_name, args, party, result)
    messages.append({
        "role": "tool",
        "tool_call_id": tool_call.id,
        "content": json.dumps(tool_response)
    })

# Pass 2: LLM narrates the outcome
completion2 = client.chat.completions.create(
    model=model,
    messages=messages,
    tools=active_tools,
    tool_choice="auto",
    max_tokens=max_tokens
)

There's one exception: apply_consequence. When a player misses a roll, the LLM chooses the type of consequence (physical harm, mental stress, lost momentum, supply damage, or a narrative complication) based on what makes sense in the fiction. But the severity is still server-controlled.

Prompt Architecture

The system prompt isn't a single static block. I built it to be modular: a base prompt establishes the LLM's role, writing style, and rules, and then depending on what's happening in the game, addendums get appended.

For example, when a Scene Challenge is active, a challenge addendum overrides the available moves and injects the challenge state into the instructions:

# Dynamic prompt composition based on game state
system_prompt = PARTY_MODE_PROMPT  # Base: role + style + rules

if challenge_active:
    system_prompt += SCENE_CHALLENGE_ADDENDUM.format(
        objective=challenge.get('objective', '?'),
        rank_label=rank_info.get('label', rank),
        progress_boxes=progress_score(ticks),
        clock_segments=clock,
        clock_warning='*** CRITICAL ***' if clock >= 3 else '',
    )

if ff_active:
    system_prompt += FINISH_SCENE_ADDENDUM  # or FINISH_CHAPTER_ADDENDUM

if farewell_char:
    system_prompt += FAREWELL_SCENE_PROMPT.format(
        name=farewell_char['name'],
        fate='fallen in battle',
        last_words=player_input or '(silence)',
    )

The tool set itself also changes depending on state. During normal play, the LLM has access to resolve_move, update_scene, and oracle_roll. During a Scene Challenge, those are swapped out for resolve_sc_move and finish_the_scene. The LLM literally can't call the wrong tools because they're not available.

Another thing I had to solve is output discipline. LLMs naturally want to present numbered option lists and "What will you do?" menus, which breaks the fiction. I explicitly ban these patterns in the system prompt, and a post-processing safety net strips any that slip through:

Your narrative must NEVER contain:
  numbered/bulleted option lists,
  "What will you do?" questions,
  "You could/might..." suggestions,
  "Choose:" headers,
  or any prompt-like content.

End with a story hook (action, dialogue, or threat), not a menu.

Instead of letting the LLM generate menus inline, I handle interactive action suggestions in a separate LLM call with tool_choice="required", forcing the model to output structured data via a dedicated suggest_prompts tool. Narrative prose and interactive UI stay cleanly separated, so the story never breaks character.