Refactor COW interpreter and compiler: bytecode, instruction decoder, jump table, and faster dispatch#1
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Review Summary by QodoRefactor COW interpreter and compiler with bytecode, computed-goto dispatch, and jump tables
WalkthroughsDescription• Refactor interpreter to use bytecode instructions and computed-goto dispatch • Replace iterator-based memory with index-based access for better performance • Precompute jump table for MOO/moo loop pairs to enable faster control flow • Centralize instruction decoding logic in decode_instruction function • Optimize compiler to build output in-memory via emit helper before writing • Add COW_MAX_STEPS environment variable support to guard long-running programs • Improve file I/O by reading entire source into buffer before parsing • Add templated apply_all_stomachs utility in DDX module for cleaner code Diagramflowchart LR
A["Source File"] -->|Read into buffer| B["Bytecode Program"]
B -->|Precompute jump table| C["Jump Table"]
C -->|Dispatch via computed-goto| D["Execute Instructions"]
D -->|Index-based memory| E["Memory Array"]
B -->|Compiler path| F["Emit to string"]
F -->|Write once| G["cow.out.cpp"]
File Changes1. source/cow.cpp
|
Code Review by Qodo
1. mOO breaks loop exec
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the core components of the COW interpreter and compiler to enhance their performance, maintainability, and consistency. By introducing a bytecode-based instruction set, optimizing memory access with index-based navigation, and streamlining control flow with precomputed jump tables and efficient dispatch mechanisms, the changes aim to create a more robust and faster execution environment for COW programs. The refactoring also centralizes common logic, such as instruction decoding, to promote code reuse and simplify future development. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Summary
This PR refactors the COW interpreter and compiler with bytecode encoding, jump tables, and faster dispatch. While the architecture improvements are sound, there are critical defects that must be fixed before merge.
Critical Issues Found
- Crash risk in computed-goto dispatch: Empty program causes segfault on line 82
- Out-of-bounds array access: Invalid instructions (>11) cause OOB read in dispatch table
- Missing bounds checks: Multiple memory access operations lack
mem_pos < memory.size()validation - File I/O error handling: Negative
ftell()return values are cast to unsigned, potentially causing huge allocations in both interpreter and compiler
These issues create multiple crash vectors and potential security vulnerabilities that block merge. Please address all findings before proceeding.
You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.
| op_mOO: | ||
| if( memory[mem_pos] == 3 ) | ||
| quit( false ); | ||
| if( memory[mem_pos] < 0 || memory[mem_pos] > 11 ) | ||
| quit( false ); | ||
| return exec(*mem_pos); | ||
|
|
||
| // Moo | ||
| case 4: | ||
| if( (*mem_pos) != 0 ) | ||
| printf( "%c", *mem_pos ); | ||
| instruction = (instruction_t)memory[mem_pos]; | ||
| continue; |
There was a problem hiding this comment.
🛑 Crash Risk: Missing bounds check before accessing memory[mem_pos]. If memory is empty or mem_pos is out of bounds, this will cause undefined behavior or crash.
| op_mOO: | |
| if( memory[mem_pos] == 3 ) | |
| quit( false ); | |
| if( memory[mem_pos] < 0 || memory[mem_pos] > 11 ) | |
| quit( false ); | |
| return exec(*mem_pos); | |
| // Moo | |
| case 4: | |
| if( (*mem_pos) != 0 ) | |
| printf( "%c", *mem_pos ); | |
| instruction = (instruction_t)memory[mem_pos]; | |
| continue; | |
| op_mOO: | |
| if( mem_pos >= memory.size() ) | |
| quit( true ); | |
| if( memory[mem_pos] == 3 ) | |
| quit( false ); | |
| if( memory[mem_pos] < 0 || memory[mem_pos] > 11 ) | |
| quit( false ); | |
| instruction = (instruction_t)memory[mem_pos]; | |
| continue; |
| op_Moo: | ||
| if( memory[mem_pos] != 0 ) | ||
| printf( "%c", memory[mem_pos] ); | ||
| else | ||
| { | ||
| (*mem_pos) = getchar(); | ||
| memory[mem_pos] = getchar(); | ||
| while( getchar() != '\n' ); | ||
| } | ||
| break; | ||
|
|
||
| // MOo | ||
| case 5: | ||
| (*mem_pos)--; | ||
| break; | ||
|
|
||
| // MoO | ||
| case 6: | ||
| (*mem_pos)++; | ||
| break; | ||
|
|
||
| // MOO | ||
| case 7: | ||
| if( (*mem_pos) == 0 ) | ||
| ++pc; | ||
| if( pc < program_size ) instruction = prog[pc]; | ||
| continue; |
There was a problem hiding this comment.
🛑 Crash Risk: All memory access operations (op_Moo, op_MOo, op_MoO, op_OOO, op_MMM, op_OOM, op_oom, op_MOO) dereference memory[mem_pos] without checking if mem_pos < memory.size(). This pattern repeats throughout the computed-goto dispatch and can cause crashes.
| op_Moo: | |
| if( memory[mem_pos] != 0 ) | |
| printf( "%c", memory[mem_pos] ); | |
| else | |
| { | |
| (*mem_pos) = getchar(); | |
| memory[mem_pos] = getchar(); | |
| while( getchar() != '\n' ); | |
| } | |
| break; | |
| // MOo | |
| case 5: | |
| (*mem_pos)--; | |
| break; | |
| // MoO | |
| case 6: | |
| (*mem_pos)++; | |
| break; | |
| // MOO | |
| case 7: | |
| if( (*mem_pos) == 0 ) | |
| ++pc; | |
| if( pc < program_size ) instruction = prog[pc]; | |
| continue; | |
| op_Moo: | |
| if( mem_pos >= memory.size() ) | |
| quit( true ); | |
| if( memory[mem_pos] != 0 ) | |
| printf( "%c", memory[mem_pos] ); | |
| else | |
| { | |
| memory[mem_pos] = getchar(); | |
| while( getchar() != '\n' ); | |
| } | |
| ++pc; | |
| if( pc < program_size ) instruction = prog[pc]; | |
| continue; |
| fseek( f, 0, SEEK_END ); | ||
| const long size = ftell( f ); | ||
| rewind( f ); | ||
|
|
||
| char buf[3]; | ||
| memset( buf, 0, 3 ); | ||
| int pos = 0; | ||
| std::vector<char> source( size > 0 ? (size_t)size : 0 ); |
There was a problem hiding this comment.
🛑 Logic Error: When ftell returns -1 on error, the code casts it to size_t creating a huge allocation. Additionally, size is signed long but cast to size_t without checking for negative values.
| fseek( f, 0, SEEK_END ); | |
| const long size = ftell( f ); | |
| rewind( f ); | |
| char buf[3]; | |
| memset( buf, 0, 3 ); | |
| int pos = 0; | |
| std::vector<char> source( size > 0 ? (size_t)size : 0 ); | |
| fseek( f, 0, SEEK_END ); | |
| const long size = ftell( f ); | |
| rewind( f ); | |
| if( size < 0 ) | |
| { | |
| printf( "Error reading file size.\n" ); | |
| fclose( f ); | |
| exit( 1 ); | |
| } | |
| std::vector<char> source( size > 0 ? (size_t)size : 0 ); |
| fseek( f, 0, SEEK_END ); | ||
| const long size = ftell( f ); | ||
| rewind( f ); | ||
|
|
||
| while( !feof(f) ) | ||
| std::vector<char> source( size > 0 ? (size_t)size : 0 ); |
There was a problem hiding this comment.
🛑 Logic Error: Same ftell error handling issue exists in the compiler - negative return from ftell is cast to unsigned size_t, potentially causing huge allocation or wraparound.
| fseek( f, 0, SEEK_END ); | |
| const long size = ftell( f ); | |
| rewind( f ); | |
| while( !feof(f) ) | |
| std::vector<char> source( size > 0 ? (size_t)size : 0 ); | |
| fseek( f, 0, SEEK_END ); | |
| const long size = ftell( f ); | |
| rewind( f ); | |
| if( size < 0 ) | |
| { | |
| printf( "Error reading file size.\n" ); | |
| fclose( f ); | |
| exit( 1 ); | |
| } | |
| std::vector<char> source( size > 0 ? (size_t)size : 0 ); |
Co-authored-by: amazon-q-developer[bot] <208079219+amazon-q-developer[bot]@users.noreply.github.com>
Co-authored-by: amazon-q-developer[bot] <208079219+amazon-q-developer[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request significantly modernizes the COW interpreter and compiler, introducing bytecode, a jump table for faster control flow, and improved I/O handling. The use of computed gotos for performance in the interpreter is a great touch. The changes are well-aligned with the goals of improving performance and maintainability.
My review focuses on a couple of areas:
- A repeated code block that could be centralized for better maintainability.
- A subtle bug in input handling that could lead to infinite loops when
EOFis encountered.
Overall, this is an excellent refactoring. Addressing these points will make the implementation even more robust.
| op_Moo: | ||
| if( memory[mem_pos] != 0 ) | ||
| printf( "%c", memory[mem_pos] ); | ||
| else |
There was a problem hiding this comment.
| case 4: | ||
| if( memory[mem_pos] != 0 ) | ||
| printf( "%c", memory[mem_pos] ); | ||
| else |
There was a problem hiding this comment.
| // Moo | ||
| case 4: | ||
| fprintf( output, "if((*p)!=0){putchar(*p);}else{(*p)=getchar();while(getchar()!='\\n');}" ); | ||
| emit( "if((*p)!=0){putchar(*p);}else{(*p)=getchar();while(getchar()!='\\n');}" ); |
There was a problem hiding this comment.
The generated code for the Moo instruction includes while(getchar()!='\n');, which can lead to an infinite loop on EOF in the compiled program. The generated code should also check for EOF when flushing the input buffer.
emit( "if((*p)!=0){putchar(*p);}else{(*p)=getchar();if((*p)!=EOF)for(int c=0;(c=getchar())!='\n'&&c!=EOF;);}" );| int decode_instruction( const char* token ) | ||
| { | ||
| switch( token[0] ) | ||
| { | ||
| case 'm': | ||
| if( token[1] == 'o' && token[2] == 'o' ) return 0; | ||
| if( token[1] == 'O' && token[2] == 'o' ) return 1; | ||
| if( token[1] == 'o' && token[2] == 'O' ) return 2; | ||
| if( token[1] == 'O' && token[2] == 'O' ) return 3; | ||
| break; | ||
| case 'M': | ||
| if( token[1] == 'o' && token[2] == 'o' ) return 4; | ||
| if( token[1] == 'O' && token[2] == 'o' ) return 5; | ||
| if( token[1] == 'o' && token[2] == 'O' ) return 6; | ||
| if( token[1] == 'O' && token[2] == 'O' ) return 7; | ||
| if( token[1] == 'M' && token[2] == 'M' ) return 9; | ||
| break; | ||
| case 'O': | ||
| if( token[1] == 'O' && token[2] == 'O' ) return 8; | ||
| if( token[1] == 'O' && token[2] == 'M' ) return 10; | ||
| break; | ||
| case 'o': | ||
| if( token[1] == 'o' && token[2] == 'm' ) return 11; | ||
| break; | ||
| }; | ||
|
|
||
| return -1; | ||
| } |
There was a problem hiding this comment.
The decode_instruction function is duplicated from source/cow.cpp. To improve maintainability and adhere to the PR's goal of centralizing logic, consider moving this function, along with the instruction_t typedef, into a shared header file (e.g., cow_common.h) and including it in both cow.cpp and cowcomp.cpp.
| op_mOO: | ||
| if( memory[mem_pos] == 3 ) | ||
| quit( false ); | ||
| if( memory[mem_pos] < 0 || memory[mem_pos] > 11 ) | ||
| quit( false ); | ||
| return exec(*mem_pos); | ||
|
|
||
| // Moo | ||
| case 4: | ||
| if( (*mem_pos) != 0 ) | ||
| printf( "%c", *mem_pos ); | ||
| instruction = (instruction_t)memory[mem_pos]; | ||
| continue; |
There was a problem hiding this comment.
1. Moo breaks loop exec 🐞 Bug ✓ Correctness
In source/cow.cpp, mOO (opcode 3) re-dispatches to an opcode from memory without changing pc, so if it resolves to moo (0) or MOO (7) the jump_table lookup uses the mOO location and errors. This breaks compatibility with the repo’s compiler and DDX interpreter, both of which allow mOO to execute moo/MOO relative to the current position.
Agent Prompt
### Issue description
`mOO` (opcode 3) re-dispatches to an opcode from memory without updating the program counter. If that opcode is `moo` (0) or `MOO` (7), the interpreter uses `jump_table[pc]` keyed by the `mOO` instruction position, which is not a loop token, causing `quit(true)` and breaking compatibility with existing semantics.
### Issue Context
- Loop opcodes use `jump_table[pc]`.
- `mOO` can legally dispatch to opcode 0 or 7 (compiler supports it and DDX interpreter historically supports it).
### Fix Focus Areas
- source/cow.cpp[90-119]
- source/cow.cpp[145-154]
- source/cow.cpp[201-240]
### Implementation notes
- Add a helper to compute loop targets relative to an arbitrary `pc` (old scan-based algorithm is acceptable for this rare path):
- For virtual `moo` at `pc`: scan backward to the matching `MOO`, set `pc` to that index, then continue dispatch.
- For virtual `MOO` at `pc`: if current cell is 0, scan forward to matching `moo` and set `pc` to `match+1`; else `pc++`.
- Use existing `jump_table` fast-path only when the opcode being executed is actually at `program[pc]`; otherwise use the virtual scan path.
- Ensure both dispatch backends (computed-goto and switch fallback) share the same logic.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| void execute_program( long long max_steps ) | ||
| { | ||
| const instruction_t* prog = program.data(); | ||
| const int program_size = (int)program.size(); | ||
| int pc = 0; | ||
| long long steps = 0; | ||
|
|
||
| #if defined(__GNUC__) || defined(__clang__) | ||
| static void* dispatch[] = { | ||
| &&op_moo, &&op_mOo, &&op_moO, &&op_mOO, | ||
| &&op_Moo, &&op_MOo, &&op_MoO, &&op_MOO, | ||
| &&op_OOO, &&op_MMM, &&op_OOM, &&op_oom | ||
| }; | ||
| instruction_t instruction = prog[pc]; | ||
|
|
There was a problem hiding this comment.
2. Empty program crashes 🐞 Bug ⛯ Reliability
In the GCC/Clang computed-goto dispatcher, execute_program reads prog[0] before checking program_size, so an empty parsed program dereferences invalid memory and can crash. main() can produce an empty program (no valid tokens) and still calls execute_program().
Agent Prompt
### Issue description
The computed-goto path in `execute_program()` dereferences `prog[0]` unconditionally. If the parsed program is empty, this is undefined behavior and may crash.
### Issue Context
The parser can legitimately produce an empty `program` (e.g., empty source file or file with no valid 3-char tokens). `main()` still calls `execute_program()`.
### Fix Focus Areas
- source/cow.cpp[69-85]
- source/cow.cpp[339-356]
- source/cow.cpp[387-392]
### Implementation notes
- Add `if (program_size == 0) return;` before `instruction_t instruction = prog[pc];`, OR
- Restructure computed-goto loop so the first `instruction` fetch happens only after verifying `pc < program_size` (e.g., fetch at top of the loop).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Motivation
Description
instruction_t(uint8_t) andprogram_tto store parsed instructions as bytes, and adddecode_instructionto map 3-character tokens to opcodes in bothsource/cow.cppandsource/cowcomp.cpp.source/cow.cpp, add ajump_tablecomputed from matchingMOO/moopairs, and implementexecute_programwith an optional computed-goto dispatch forgcc/clangand a portable switch fallback; add support forCOW_MAX_STEPSto guard long-running programs.fgetc/strncmploops with a robust slidingbuf+decode_instructionapproach when building the bytecodeprogram.source/cowcomp.cppto build output in-memory viaemitintostd::stringand writecow.out.cppat once, reusedecode_instruction, switch touint8_tinstruction storage, and keep existing code-generation behavior using the newemithelper.ddx/cow.cpp: include headers and a templatedapply_all_stomachsutility and replace explicit loops acrossnum_stomachswith the new helper.Testing
g++(tested withg++ -O3) and compilation completed successfully.COW_MAX_STEPSguard, and these checks passed.cow.out.cppfrom a sample program and verified the output file was created and compilable, and this step succeeded.gcc/clang) and the portable switch fallback to ensure both dispatch styles run through typical instruction sequences successfully.Codex Task