poc: Implement the specs + sample + AI#290
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
39a546f to
8c28567
Compare
8c28567 to
c22ea09
Compare
Cross-model benchmark + breakage-detection benchmark showed: - All models (Haiku, Sonnet, Codex, bash) produce correct substitutions - generate.sh ships a latent Podfile sed bug (deletes GCC_PREPROCESSOR_DEFINITIONS line but not its array contents → invalid Ruby). validation_test.dart never caught it. flutter build ios catches it immediately. - AI models (Haiku, Sonnet) handle the same transformation correctly because they understand Ruby structure. The validation suite was written to catch AI-level substitution misses, not sed-level structural errors. Changes: - Delete specs/generate.sh (bash generator, was never committed) - Delete specs/validation/validate.sh + validation_test.dart (570 lines) - Rewrite .github/workflows/test.yml: native Flutter tooling on sample/, macOS runner, JDK 17 pin, flavor-aware apk + ios builds - Simplify specs/generation-prompt.md: no inline bash, self-check via grep, structural guidance for Podfile conditional - README: drop validate_all.sh + 8 nonexistent spec file references, add verification commands and JDK 17 requirement note - .gitignore: add output/ for AI-generated project outputs Net: -670 lines. System is now sample/ + 2 markdown specs + native CI gate. No custom Dart validation, no bash generator.
c8b7996 to
8e92641
Compare
…hain
sample/'s Android stack (Gradle 7.5 + legacy Flutter plugin loader + AGP 7.3.0) is
incompatible with modern Flutter plugin ecosystem. `flutter build apk` fails with
"compileSdkVersion is not specified" because package_info_plus ^9.0.0 expects the
declarative Flutter Gradle plugin to expose flutter.compileSdkVersion to library
subprojects — which the legacy apply-from-gradle pattern does not do.
Minimal toolchain migration to unblock builds:
- Gradle wrapper 7.5 → 7.6.3 (JDK 17 compatible, supports declarative plugins)
- settings.gradle: legacy apply-from pattern → declarative plugins block
(AGP 7.4.2, Kotlin 1.8.22)
- build.gradle: removed obsolete buildscript block (classpath deps now in
settings.gradle plugins)
- app/build.gradle: apply-plugin → plugins {} block at top of file; dropped
explicit kotlin-stdlib-jdk7 dep (bundled by kotlin plugin)
- minSdk 23 → 24 (flutter_secure_storage requirement)
- package_info_plus ^9.0.0 → ^8.3.0 (v9.x requires even newer Flutter Gradle
plugin interface than declarative migration provides; v8.3.0 works)
Verified: flutter build apk --debug --flavor staging succeeds; APK installs and
launches on Pixel 7 emulator. flutter build ios --debug --no-codesign --flavor
staging also succeeds (Runner.app produced).
AGP 8.x modernization deferred — hit unmigrated-plugin namespace issues
(flutter_config) that would require a full dependency audit. Out of scope for
unblocking builds.
- proposal-ai-generation-migration.md: full Notion-ready proposal with 3-round benchmark results, recommendation matrix (Opus/Sonnet/Haiku), and twin failure exhibits (Haiku Podfile + Sonnet Android XML). - experiments-log.md: per-experiment record across 6 experiments and 18 total runs, source of truth for the proposal's claims. Replaces the earlier outdated proposal (March) which referenced the deleted validation suite.
Reproducibility kit for the 3-round benchmark documented in docs/experiments-log.md: - Round 1 (standard params): setup-bench.sh, teardown-bench.sh, verify-all.sh, run_benchmark.sh - Round 2 (Opus edge cases): setup-edge-bench.sh, teardown-edge-bench.sh, verify-edge.sh - Round 3 (multi-model edge cases): setup-models-bench.sh, teardown-models-bench.sh, verify-models.sh - Canonical prompt pinned in benchmark-prompt.md Each setup script creates per-case git worktrees with parameter blocks pre-injected; teardown collects outputs and restores memory; verify runs the full pipeline including flutter build apk + flutter build ios. REPO_ROOT resolution updated to work from the new scripts/benchmark/ location.
Proposal (docs/proposal-ai-generation-migration.md):
- Trim from 460 → 178 lines while keeping load-bearing arguments
- Fix arithmetic: Haiku stage pass rate (28/35), N=15 runs with full build gate
- Add asymmetry framing ("used a few times a year, maintained every few weeks")
- Verify "42 Generate bundle commits" with git log
- Restructure scorecard, recommendation, and harness section
- Add per-format escaping callout for app_name special characters
- Add API cost-per-generation detail
- Add "Likely questions" section (CI, vendor compat, sample/ failure)
- Replace internal .md links with GitHub URLs on feature/ai-generation-migration
- Update verify command block: cd <project> + macOS-only marker
Generation prompt (specs/generation-prompt.md):
- Group substitution table by purpose (identifiers / display / metadata / codegen)
- Replace literal version line with pattern-based rule (resilient to sample/ bumps)
- Add per-format escaping section for app_name (Dart/XML/Ruby/pbxproj/MD)
- Inline architecture invariants (layer deps, naming conventions) — domain pure Dart rule added
- Add <string>sample</string> and 1.14.0 to self-check grep
- Replace personal example values with placeholder syntax (<your_project_name>)
Architecture rules (specs/architecture-rules.md):
- Removed; useful content inlined into generation-prompt.md, the rest was redundant with sample/
- Title: "Migrate ... from Mason to AI" → "Drop Mason — Use sample/ + AI to Generate Projects" (active verb, names both halves of the new system) - Add "Maintenance effort, task by task" table — concrete scenarios (Flutter SDK bump, dep bump, architectural switch, bug fix, onboarding) - Verify block: add cd <your_project> + macOS-only marker - Remove "Likely questions" section (CI/vendor questions answered by experiments-log + readme) - Remove API cost callout (premature without confirmed pricing) - Replace "no Mustache wrappers" jargon with concrete "flutter run sample/" - Replace internal .md links with GitHub URLs on feature/ai-generation-migration
Clarifies what's NOT in this proposal (harness, additional sample variants, Claude Code skills integration) so reviewers evaluate the right thing.
Re-ran all 5 Opus 4.7 cases capturing Anthropic's /cost accounting before/after each session. Measured $1.99/run avg on API list price (range $0.81-$3.70, ~2.4M tokens/run); $0 marginal on Max/Team subscriptions; ~2% of 5h window per run on Max 5x. Build pipeline: 35/35 gates pass (5 cases x 7 stages incl. native apk+ios builds). Apostrophe edge case passed cleanly - the spot Sonnet 4.6 had previously failed on (Android XML escape). Adds reproducibility tooling under scripts/benchmark/: - setup-opus-bench.sh: spin up 5 worktrees with preset parameters - verify-opus-tokens.sh: full build pipeline + token extraction - extract-tokens.py: dedup'd JSONL token sum with correct 1h/5m cache pricing split per Anthropic API docs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Note: for a release PR, append this parameter
?template=release_template.mdto the current URL to apply the release PRtemplate, e.g.
{Github PR URL}?template=release_template.md--
What happened 👀
Provide a description of the changes this pull request brings to the codebase. Additionally, when the pull request is still being worked on, a checklist of the planned changes is welcome to track progress.
Insight 📝
Describe in detail why this solution is the most appropriate, which solution you tried but did not go with, and how to test the changes. References to relevant documentation are welcome as well.
Proof Of Work 📹
Show us the implementation: screenshots, GIFs, etc.