Skip to content

matt-kempster/sm64.sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sm64.sql

Load Super Mario 64 decompilation game data into a SQLite database so you can explore it with SQL.

The decomp stores its data as C macros and tables (OBJECT(...), MACRO_OBJECT(...), enum MacroPresets, #define MODEL_*, ...). sm64.sql reads those source files and writes the equivalent rows into a SQLite database, so questions like "which objects appear only in act 1?", "what music plays in each course?", or "where does each painting warp to?" become one-line queries.

It currently populates 30 tables (plus 17 derived views) spanning placed objects, models and per-level model loads, behaviors and their command scripts and native C code, levels/courses/areas, warps, camera-trigger zones, dialog, music, animations, sounds, the in-game course and star names, and the named constants behavior params use — see the schema and example queries below.

Install

pip install -e .            # from a checkout
pip install -e ".[dev]"     # plus pytest / black / mypy for development

Requires Python 3.8+ and a checkout of the SM64 decompilation source (this tool reads the source; it does not need the game to be built).

Usage

sm64-sql --repo /path/to/sm64 --db sm64.db
# or, without installing:
PYTHONPATH=src python -m sm64_sql --repo /path/to/sm64 --db sm64.db
Flag Description
-r, --repo Path to the SM64 decompilation source tree (required).
-d, --db SQLite file to write. Omit to use an in-memory database.
-o, --overwrite Overwrite an existing database without prompting.

Web playground

web/ is a static, zero-backend site that loads the database into your browser with sql.js and lets you run your own SQL (with syntax highlighting and schema-aware autocomplete), browse the schema, try curated example queries, and explore visual tabs — a map of any level's placed objects, an object × level/course heatmap, and a treemap of the game's object population (the last two built on D3). Every chart cell links back to the JOIN behind it. It's all client-side, so nothing you type leaves the page. Build the database and serve the folder:

sm64-sql -r /path/to/sm64 -d web/sm64.db -o
cd web && python3 -m http.server 8000   # then open http://localhost:8000

A GitHub Actions workflow rebuilds the database and deploys the site to GitHub Pages on every push to master. See web/README.md.

Schema

Table Source Columns
object levels/*/script.c model_name, level, initial_x/y/z, initial_rot_x/y/z, bhv_param, bhv_param_value, bhv_param_1bhv_param_4, behavior, in_act_1in_act_6
macro_object levels/**/macro.inc.c macro_name, level, yaw, pos_x/y/z, bhv_param, bhv_param_value
model include/model_ids.h model_name, model_id
model_load levels/*/script.c + levels/scripts.c level (common = shared), model_name, geo, layer, kind (geo/dl)
macro_preset include/macro_presets.h + macro_presets.inc.c macro_name, behavior, model_name, param, param_value
level levels/level_defines.h level_name, course_name, folder, internal_name, is_stub
course levels/course_defines.h course_name, display_name, dance_cutscene, is_bonus
course_name text/us/courses.h course_name, number (1-15, 0 for bonus), name (in-game file-select name)
star text/us/courses.h course_name, kind (main/secret), act (1-6, 0 for secret), name
sequence include/seq_ids.h seq_name, seq_id (music tracks)
dialog text/us/dialogs.h + include/dialog_ids.h dialog_name, dialog_id, lines_per_box, left_offset, width, text
special_preset include/special_presets.h + special_presets.inc.c preset_name, preset_id, preset_type, default_param, model_name, behavior
special_object levels/**/collision.inc.c preset_name, preset_id, level, area, pos_x/y/z, yaw, bhv_param, bhv_param_value
behavior include/behavior_data.h + data/behavior_data.c behavior_name, obj_list
behavior_command data/behavior_data.c behavior_name, seq, command, args, args_json
behavior_call src/game/behaviors/*.inc.c behavior_name, function, seq, call, args, args_json, file, line
behavior_data_spawn src/game/behaviors/*.inc.c behavior_name, spawned_behavior, spawned_model, source, function, file, line
mario_action include/sm64.h + src/game/mario*.c action_name, id (hex), group_name, flags_json, handler, file, line
mario_action_call src/game/mario*.c action_name, function, seq, call, target, condition, args, args_json, file, line
mario_action_data_transition src/game/mario*.c action_name, to_action, source, condition, function, file, line
warp levels/*/script.c level, area (0 = level-global), node_id, dest_level, dest_area, dest_node, flags, is_painting
instant_warp levels/*/script.c level, area, warp_index, dest_area, displace_x/y/z
area levels/*/script.c level, area, geo, terrain_type, background_music, dialog
camera_trigger src/game/camera.c + levels/level_defines.h level (NULL = defined but unused), camera_table, seq, area (−1 = whole-level), event, center_x/y/z, bounds_x/y/z, bounds_yaw, doc, file, line
mario_animation include/mario_animation_ids.h anim_name, anim_id
sound include/sounds.h sound_name, sound_id, bank
constant include/object_constants.h + src/game/level_update.h name, value, source (warp_nodes/object_constants)
save_struct src/game/save_file.h struct_name, size, align, doc
save_field src/game/save_file.h struct_name, seq, field_name, type_name, dims, count, elem_size, offset, size, is_struct, doc
save_flag src/game/save_file.h flag_group, bit, flag_name, mask

Names such as MODEL_BOO, bhvGoomba, and macro_yellow_coin_2 are kept as the symbolic strings used in the source, so the tables join naturally on those names.

Behavior scripts

Each behavior is a little bytecode program — an ordered array of command macros (BEGIN, CALL_NATIVE, SPAWN_CHILD, BEGIN_LOOP, …). The behavior_command table records one row per command, in order (seq), keeping the comma-joined args text plus args_json, a JSON array of the top-level-split arguments (so an expression like (OBJ_FLAG_A | OBJ_FLAG_B) stays one argument).

Three views read that backbone to expose the high-value relations as plain columns that join like any other table:

View From Columns
behavior_spawn SPAWN_CHILD / SPAWN_OBJ / SPAWN_CHILD_WITH_PARAM behavior_name, seq, kind, spawned_model, spawned_behavior, bhv_param
behavior_native CALL_NATIVE behavior_name, seq, func (the C function the behavior runs)
behavior_resource LOAD_ANIMATIONS / LOAD_COLLISION_DATA / SET_MODEL behavior_name, seq, kind, symbol

Because the arguments are stored as JSON, json_each answers the questions the views do not — e.g. "which commands name this symbol in any argument slot?"

Behavior native code

A behavior script mostly just CALL_NATIVEs into C functions under src/game/behaviors/. The real logic — what an object spawns, the sounds it plays, the dialog it shows — lives in those functions and the helpers they call, and the script never names it. The behavior_call table is the backbone for that layer: one row per call site in the C, attributed to the behavior(s) that reach it. The C is parsed with tree-sitter (a real syntax tree, not regex), so every function and every call is enumerated structurally; each call is attributed by following the static call graph from a behavior's CALL_NATIVE roots through the object code, treating engine helpers (spawn_object, cur_obj_play_sound_2, …) as the leaf relation vocabulary.

The same backbone-plus-views pattern then exposes the relations as plain columns. The target symbol is matched by argument pattern (a spawned behavior is the bhv* argument, a model the MODEL_* one), so the views are robust to each helper's differing argument order:

View From calls Columns (besides behavior_name, function, file, line, call)
behavior_calls_spawn spawn_object, spawn_object_relative, … spawned_behavior, spawned_model
behavior_calls_sound cur_obj_play_sound_1/2, play_sound, … sound
behavior_calls_model cur_obj_set_model model
behavior_calls_dialog cur_obj_update_dialog_with_cutscene, cutscene_object_with_dialog, … dialog
behavior_calls_morph cur_obj_set_behavior, obj_set_behavior becomes_behavior
behavior_calls_seek cur_obj_nearest_object_with_behavior, cur_obj_has_behavior, … target_behavior

Each view lists only the call sites whose target resolves to a literal symbol, so the target column is never null. A call that passes its target as a runtime value — a signpost reading its dialog id from oBhvParams2ndByte, a spawn of a behavior held in a variable — stays in behavior_call (query it directly) but is not surfaced as a clean edge here.

Reachability follows two C idioms a plain call graph cannot: a behavior's loop dispatching per-frame logic through an action-function table (cur_obj_call_action_function(sFooActions)), and a spawn whose behavior is read from a static table or forwarded through a helper parameter. The latter — e.g. the exclamation box spawning its sExclamationBoxContents table — are resolved interprocedurally into the behavior_data_spawn table (with the model paired from the same row). The behavior_all_spawns view then unions all three sources — behavior_spawn (bytecode), behavior_calls_spawn (literal C), and behavior_data_spawn (data-table) — tagged by origin, for the complete spawn graph.

Completeness is auditable rather than assumed: nothing is filtered at parse time, so behavior_call_unclassified lists every captured call a relation view does not classify (most-frequent first) — the visible residue to scan for a helper family worth promoting to a relation.

This is what makes runtime spawns visible that the script alone cannot show: a Bob-omb's explosion (spawn_object(bhvExplosion) inside bobomb_act_explode) has no SPAWN_* opcode, so it is absent from behavior_spawn but present in behavior_calls_spawn.

Mario's action state machine

Mario is a state machine: each frame he is in one action (ACT_WALKING, ACT_LONG_JUMP, …) and transitions to another by calling set_mario_action(m, ACT_*, arg). The same tree-sitter machinery mines the whole machine from src/game/:

  • mario_action is the node table: every ACT_* constant, with its group_name (the seven dispatch groups) and flags_json decoded from the bit-packed value, and the handler act_* read from the authoritative mario_execute_*_action dispatcher switch (which captures shared handlers and fall-through cases).
  • mario_action_call is the transition backbone: from each action's handler, reachability follows the static call graph through Mario's action code, and every call to a transition setter (set_mario_action and its siblings) is recorded, attributed to the action(s) that reach it. The four setters that take the destination as an argument are the leaf vocabulary; the fixed-target helpers (set_steep_jump_actionACT_STEEP_JUMP, …) are recursed into. Each call also records its trigger — the nearest enclosing if-guard (m->input & INPUT_Z_PRESSED, m->forwardVel >= 38.0f), negated for an else branch — which is what the Actions tab labels each arrow with.
  • Group-wide cancels. Each per-group dispatcher mario_execute_*_action runs check_common_*_cancels before its switch, so those transitions apply to every action in the group, not to one handler — without this, ACT_DROWNING, ACT_SQUISHED, and ACT_VERTICAL_WIND had no incoming edge at all. The walk also starts from each dispatcher (treating the handlers as leaves) and attributes that ambient code to the whole group.
  • Flag-gated refutation. Some transitions sit inside a shared helper's switch on a flag-gated result — common_air_action_step sets ACT_START_HANGING under case AIR_STEP_GRABBED_CEILING, which perform_air_step only returns when the caller passed AIR_STEP_CHECK_HANG. The call graph alone attributes that to every air action, so a backflip falsely appears to reach hanging. The parser discovers the result → flag contract from the code (a constant returned only under arg & FLAG) and refutes the edge for callers that don't pass the flag: the row keeps a gated_by flag, is dropped from mario_transition, and stays visible in mario_transition_refuted. (Jump and double-jump, which do pass AIR_STEP_CHECK_HANG, keep their hang edge.) The same refutation runs on a source-action flag: a group cancel guarded by m->action & ACT_FLAG_* (e.g. check_for_instant_quicksand needs ACT_FLAG_INVULNERABLE) only fires for actions whose own decoded flags satisfy the guard, so the group-wide attribution is trimmed to those — read straight off mario_action.flags_json, with || conditions left alone as a conservative guard.

Three views and a table expose the edges:

View / table Holds Columns
mario_transition literal-target edges (flag-refuted ones excluded) action_name, to_action
mario_action_data_transition runtime targets resolved to a literal action action_name, to_action, source, function, file, line
mario_all_transitions the dedup union of the two action_name, to_action
mario_transition_refuted call-graph edges a flag (an argument, or the source action's own ACT_FLAG_*) disproves action_name, to_action, gated_by

mario_action_data_transition resolves the tractable runtime targets the same way behavior_data_spawn does: a forwarded land action (common_air_action_step(m, ACT_JUMP_LAND, …)act_jump lands as ACT_JUMP_LAND) and a literal embedded in an expression (a ternary's two branches). The genuinely runtime tail — struct-table landings (landingAction->endAction) and two-level forwards — stays visible in mario_action_call_unclassified, the completeness residue.

Camera trigger zones

Nine courses steer the camera with trigger zones — world-space boxes that run a CameraEvent function (cam_bob_tower, cam_thi_move_cam_through_tunnel, …) while Mario is inside them. The camera_trigger table captures every box from the struct CameraTrigger sCam*[] tables in camera.c: its center_x/y/z, half-extent bounds_x/y/z, the bounds_yaw that rotates it about the vertical axis, and the table's own doc comment. The boxes share the object coordinate space, so the web playground's Map tab overlays them on the level — the invisible camera logic, made visible.

Each table is wired to its level by the camera-table column of DEFINE_LEVEL in level_defines.h; that resolves camera_trigger.level to the folder. One table, sCamBOB, is defined but no level references it (BOB's column is _), so it is dead code: its rows keep level = NULL and are surfaced by the camera_trigger_unused residue view rather than silently dropped.

Save file layout

Super Mario 64 saves to a 512-byte (EEPROM_SIZE = 0x200) EEPROM, and the on-cartridge image is one C struct: struct SaveBuffer. save_layout.py is a small struct-layout engine over src/game/save_file.h — it parses the structs, resolves their constants (NUM_SAVE_FILES, the COURSE_COUNT / COURSE_STAGES_COUNT enum counts) and the sizeof-based EEPROM filler, applies N64-ABI natural alignment, and computes the byte offset and size of every member. It emits the layout of every struct reachable from SaveBuffer into save_struct (sized structs) and save_field (each member at its offset, keeping the declared array shape in dimsfiles[4][2] is the four save files each stored twice as a backup — and is_struct so a struct-typed member points back into save_struct). EU-only fields behind #ifdef VERSION_EU are filtered out so the common US/JP layout is the one captured.

The progress flags word is bit-packed; save_flag decodes the SAVE_FLAG_* #defines into one row per bit (bit, mask). The unused bits (21–23, 29–31) simply have no row — the gaps are the residue. The whole thing is checked the same way every other corpus is: the save_struct_coverage view reports each struct's declared size against the sum of its field sizes, and that padding_bytes slack is 0 for every struct — the layout provably tiles the EEPROM, with SaveBuffer summing to exactly 0x200. The web playground's Save tab renders it as a memory-map block diagram (the file/backup overview, a per-byte struct grid, and the 32-bit flags ribbon).

Behavior parameters

Each placed object carries a behavior parameter that the game packs into the 32-bit oBhvParams field, written in the source as a combination of the BPARAMn macros (BPARAM1(x) is the top byte, BPARAM2(x) the next, …). Each byte is an independent, behavior-specific field — a warp node, a dialog id, a star index, an enemy size, and so on. The tables expose it three ways:

  • bhv_param keeps the expression exactly as written (e.g. BPARAM1(0x01) | BPARAM2(WARP_NODE_03) or DIALOG_089).
  • bhv_param_value is the resolved 32-bit integer, but only when the whole expression is numeric; expressions that reference a #defined constant are left NULL.
  • on object, bhv_param_1bhv_param_4 hold the argument written in each BPARAMn slot (bhv_param_2 is the famous oBhvParams2ndByte), so a warp node or star index can be selected or joined directly.

Those symbolic byte values (WARP_NODE_0A, STAR_INDEX_ACT_3, GOOMBA_SIZE_HUGE) are names; the constant table resolves each to its integer value, so you can join a param byte to its number.

Example queries

-- How many placed objects are in each level?
SELECT level, COUNT(*) AS n FROM object GROUP BY level ORDER BY n DESC;

-- Objects that only appear during act 1.
SELECT level, behavior, initial_x, initial_y, initial_z
FROM object
WHERE in_act_1 AND NOT (in_act_2 OR in_act_3 OR in_act_4 OR in_act_5 OR in_act_6);

-- Each course's display name and the music that plays in its main area.
SELECT c.display_name, s.seq_name
FROM area a
JOIN level l ON a.level = l.folder
JOIN course c ON l.course_name = c.course_name
JOIN sequence s ON a.background_music = s.seq_name
WHERE a.area = 1;

-- The level-to-level warp graph (ignoring within-level warps).
SELECT w.level AS from_level, dl.folder AS to_level, w.node_id
FROM warp w
JOIN level sl ON w.level = sl.folder
JOIN level dl ON w.dest_level = dl.level_name
WHERE w.dest_level <> sl.level_name;

-- Read the intro dialog shown when entering each area.
SELECT a.level, a.area, d.text
FROM area a JOIN dialog d ON a.dialog = d.dialog_name;

-- Which placed objects run a "surface" behavior?
SELECT o.level, o.behavior
FROM object o JOIN behavior b ON o.behavior = b.behavior_name
WHERE b.obj_list = 'OBJ_LIST_SURFACE';

-- Resolve each macro object to its behavior and model.
SELECT mo.level, mo.macro_name, mp.behavior, mp.model_name
FROM macro_object mo JOIN macro_preset mp ON mo.macro_name = mp.macro_name;

-- Read the message on each signpost: its behavior param is a dialog id.
SELECT mo.level, d.text
FROM macro_object mo JOIN dialog d ON mo.bhv_param = d.dialog_name
WHERE mo.macro_name = 'macro_wooden_signpost';

-- Which warp node does each warp object send Mario to? (BPARAM2 = 2nd byte)
SELECT level, behavior, bhv_param_2 AS warp_node
FROM object
WHERE behavior LIKE '%Warp%' AND bhv_param_2 IS NOT NULL;

-- Map each star to the act that awards it (BPARAM1 = 1st byte).
SELECT level, behavior, bhv_param_1 AS star_index
FROM object
WHERE bhv_param_1 LIKE 'STAR_INDEX_%'
ORDER BY level, bhv_param_1;

-- Every main-course star name, by act.
SELECT c.display_name, s.act, s.name
FROM star s JOIN course c ON s.course_name = c.course_name
WHERE s.kind = 'main'
ORDER BY c.display_name, s.act;

-- The payoff: which object awards each named star (star name <- STAR_INDEX byte).
SELECT l.folder AS level, o.behavior, s.name AS star
FROM object o
JOIN level l ON o.level = l.folder
JOIN star s ON s.course_name = l.course_name
           AND s.act = CAST(replace(o.bhv_param_1, 'STAR_INDEX_ACT_', '') AS INTEGER)
WHERE o.bhv_param_1 LIKE 'STAR_INDEX_ACT_%';

-- One model slot is reused per level: what geo is MODEL_LEVEL_GEOMETRY_03 in each?
SELECT level, geo FROM model_load
WHERE model_name = 'MODEL_LEVEL_GEOMETRY_03' AND level <> 'common'
ORDER BY level;

-- Resolve a symbolic param byte to its number via the constant table.
SELECT o.level, o.behavior, o.bhv_param_2 AS warp_node, c.value
FROM object o JOIN constant c ON o.bhv_param_2 = c.name
WHERE o.bhv_param_2 LIKE 'WARP_NODE_%';

-- The object spawn graph: which behavior spawns which (a self-join).
SELECT behavior_name AS parent, spawned_behavior AS child, spawned_model
FROM behavior_spawn
WHERE spawned_behavior IS NOT NULL
ORDER BY parent;

-- What C code implements each behavior? (its init/loop/update functions)
SELECT behavior_name, group_concat(func, ', ') AS funcs
FROM behavior_native GROUP BY behavior_name;

-- Which behaviors load a given collision mesh / animation set?
SELECT behavior_name, symbol FROM behavior_resource WHERE kind = 'collision';

-- The *complete* spawn graph: bytecode, literal-C, and data-table spawns, each
-- tagged by origin. The 'c'/'data' edges live only in code (e.g. a Bob-omb's
-- explosion, or the exclamation box's contents) -- invisible to the bytecode.
SELECT behavior_name AS parent, spawned_behavior AS child, spawned_model, origin
FROM behavior_all_spawns ORDER BY parent;

-- Which objects make a given sound, and from which C function?
SELECT behavior_name, function, sound FROM behavior_calls_sound
WHERE sound = 'SOUND_OBJ_BOBOMB_WALK';

-- Completeness audit: captured C calls that no relation view classifies yet.
SELECT call, n FROM behavior_call_unclassified LIMIT 25;

-- Mario's action graph: where can a long jump go next?
SELECT to_action FROM mario_all_transitions WHERE action_name = 'ACT_LONG_JUMP';

-- The action hubs: which actions the most other actions can transition into.
SELECT to_action, COUNT(DISTINCT action_name) AS in_degree
FROM mario_all_transitions GROUP BY to_action ORDER BY in_degree DESC LIMIT 10;

-- Every attacking action and its group (flags decoded from the packed value).
SELECT action_name, group_name FROM mario_action
WHERE flags_json LIKE '%ATTACKING%' ORDER BY group_name;

-- Read a single behavior's command script, in order.
SELECT seq, command, args FROM behavior_command
WHERE behavior_name = 'bhvGoomba' ORDER BY seq;

-- The open door: every command that names a symbol in ANY argument slot.
SELECT behavior_name, command FROM behavior_command, json_each(args_json)
WHERE json_each.value = 'MODEL_BOWSER_FLAME';

How it works

The decomp's level data is written as C macro invocations with symbolic arguments (OBJECT(MODEL_BOO, ..., bhvGhostHuntBigBoo)), not plain C that a compiler would parse directly — the files are designed to be #included and macro-expanded. sm64.sql reads them line by line and pulls the arguments out of each known macro call. Argument extraction is bracket-aware (src/sm64_sql/parse_utils.py), so behavior-parameter expressions like BPARAM2(41) or BPARAM1(0) | BPARAM2(1) are split, decomposed into their byte slots, and arithmetically resolved when fully numeric (src/sm64_sql/behavior_param.py).

This deliberately stops at structure, not semantics: it keeps MODEL_BOO as a string rather than expanding it to a number, and likewise leaves a behavior param such as BPARAM2(WARP_NODE_03) as the symbol WARP_NODE_03 rather than resolving the #define to a number. See docs/parsing.md for the trade-offs and when a real C parser (e.g. tree-sitter) would be worth adopting.

Development

pip install -e ".[dev]"
pytest                 # unit tests (fast, hermetic)
SM64_DECOMP_PATH=/path/to/sm64 pytest   # also run the end-to-end test
black src tests        # format
mypy                   # type-check

Status & limitations

30 tables (plus 17 views) are populated from a full current n64decomp/sm64 checkout: placed objects, macro objects and special objects; models and per-level model loads, behaviors and their command scripts and native C code, macro/special presets; levels, courses and areas; warps and instant warps; dialog text, music sequences, Mario animations, and sound effects; the in-game course and star names; and the named constants behavior params use. Counts are cross-checked against the source.

Behavior parameters (bhvParam / preset param) are captured on the object, macro object, special object, and macro preset tables — split into their BPARAM byte slots and resolved to a number when the expression is numeric. The symbolic byte values (WARP_NODE_*, STAR_INDEX_*, GOOMBA_SIZE_*, …) resolve to integers via the constant table.

Behavior command scripts are recorded as an ordered command stream (behavior_command), but their control flow is not resolved: GOTO / CALL jump targets are kept as the raw argument text rather than linked to a command position. The data-binding opcodes (spawns, native calls, loaded resources) are surfaced as views; the loop/jump opcodes are just rows.

Not yet captured (contributions welcome):

  • Geo layouts, collision geometry, trajectories, and the level command script flow (the JUMP/CALL/LOOP in levels/scripts.c) are not extracted — see the geometry note in the project brief; these are intentionally out of scope.
  • The geo symbols in model_load are recorded as names; the geo layouts they point at are not parsed (out of scope, as above).
  • A few values are kept verbatim rather than resolved: e.g. an area whose music is SEQ_x | SEQ_VARIATION won't join to sequence by name.

About

A reproducible SQL database of Super Mario 64 data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors