Skip to content

Upgrade to Gemini 3#130

Merged
mreichhoff merged 9 commits intomainfrom
upgrade-2-5
Feb 14, 2026
Merged

Upgrade to Gemini 3#130
mreichhoff merged 9 commits intomainfrom
upgrade-2-5

Conversation

@mreichhoff
Copy link
Owner

Testing out different models, starting with a minimal upgrade to 2.5 flash.
Ideally we'll use Gemini 3, but let's see the delta in eval scores. 2.0 is being
retired, and 2.5 is still cheaper. Will try out pro vs flash, 2.5 vs 3, pending
VertexAI availability of these models.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 4/4 (100%) 100.0%
englishTranslationPresent ✅ 4/4 (100%) 100.0%
outputStructureValid ✅ 4/4 (100%) 100.0%

explain chinese

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 70/70 (100%) 100.0%
validPinyinFormat ✅ 70/70 (100%) 100.0%
grammarExplanationQuality 🟡 68/70 (97%) 96.6%
outputStructureValid ✅ 70/70 (100%) 100.0%

explain english

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 75/75 (100%) 100.0%
validPinyinFormat ✅ 75/75 (100%) 100.0%
grammarExplanationQuality 🟡 71/75 (95%) 96.3%
outputStructureValid ✅ 75/75 (100%) 100.0%

generate sentences

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
validPinyinFormat ✅ 5/5 (100%) 100.0%
sentenceGenerationQuality ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

word context

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
englishTranslationPresent ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 4/4 (100%) 100.0%
englishTranslationPresent ✅ 4/4 (100%) 100.0%
outputStructureValid ✅ 4/4 (100%) 100.0%

explain chinese

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 70/70 (100%) 100.0%
validPinyinFormat ✅ 70/70 (100%) 100.0%
grammarExplanationQuality ✅ 70/70 (100%) 99.1%
outputStructureValid ✅ 70/70 (100%) 100.0%

explain english

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 75/75 (100%) 100.0%
validPinyinFormat ✅ 75/75 (100%) 100.0%
grammarExplanationQuality 🟡 73/75 (97%) 97.9%
outputStructureValid ✅ 75/75 (100%) 100.0%

generate sentences

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
validPinyinFormat ✅ 5/5 (100%) 100.0%
sentenceGenerationQuality ✅ 5/5 (100%) 96.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

word context

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
englishTranslationPresent ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 4/4 (100%) 100.0%
englishTranslationPresent ✅ 4/4 (100%) 100.0%
outputStructureValid ✅ 4/4 (100%) 100.0%

explain chinese

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 70/70 (100%) 100.0%
validPinyinFormat ✅ 70/70 (100%) 100.0%
grammarExplanationQuality ❌ 0/70 (0%) NaN%
outputStructureValid ✅ 70/70 (100%) 100.0%

explain english

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 75/75 (100%) 100.0%
validPinyinFormat ✅ 75/75 (100%) 100.0%
grammarExplanationQuality ❌ 0/75 (0%) NaN%
outputStructureValid ✅ 75/75 (100%) 100.0%

generate sentences

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
validPinyinFormat ✅ 5/5 (100%) 100.0%
sentenceGenerationQuality ❌ 0/5 (0%) NaN%
outputStructureValid ✅ 5/5 (100%) 100.0%

word context

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
englishTranslationPresent ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 4/4 (100%) 100.0%
englishTranslationPresent ✅ 4/4 (100%) 100.0%
outputStructureValid ✅ 4/4 (100%) 100.0%

explain chinese

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 70/70 (100%) 100.0%
validPinyinFormat ✅ 70/70 (100%) 100.0%
grammarExplanationQuality 🟡 67/70 (96%) 97.4%
outputStructureValid ✅ 70/70 (100%) 100.0%

explain english

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 75/75 (100%) 100.0%
validPinyinFormat ✅ 75/75 (100%) 100.0%
grammarExplanationQuality 🟡 66/75 (88%) 91.5%
outputStructureValid ✅ 75/75 (100%) 100.0%

generate sentences

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
validPinyinFormat ✅ 5/5 (100%) 100.0%
sentenceGenerationQuality ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

word context

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
englishTranslationPresent ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 4/4 (100%) 100.0%
englishTranslationPresent ✅ 4/4 (100%) 100.0%
outputStructureValid ✅ 4/4 (100%) 100.0%

explain chinese

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 70/70 (100%) 100.0%
validPinyinFormat ✅ 70/70 (100%) 100.0%
grammarExplanationQuality 🟡 69/70 (99%) 97.4%
outputStructureValid ✅ 70/70 (100%) 100.0%

explain english

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 75/75 (100%) 100.0%
validPinyinFormat ✅ 75/75 (100%) 100.0%
grammarExplanationQuality 🟡 71/75 (95%) 94.4%
outputStructureValid ✅ 75/75 (100%) 100.0%

generate sentences

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
validPinyinFormat ✅ 5/5 (100%) 100.0%
sentenceGenerationQuality 🟡 4/5 (80%) 92.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

word context

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
englishTranslationPresent ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

📦 Download full results

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 4/4 (100%) 100.0%
englishTranslationPresent ✅ 4/4 (100%) 100.0%
outputStructureValid ✅ 4/4 (100%) 100.0%

explain chinese

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 70/70 (100%) 100.0%
validPinyinFormat ✅ 70/70 (100%) 100.0%
grammarExplanationQuality 🟡 68/70 (97%) 96.9%
outputStructureValid ✅ 70/70 (100%) 100.0%

explain english

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 75/75 (100%) 100.0%
validPinyinFormat ✅ 75/75 (100%) 100.0%
grammarExplanationQuality 🟡 69/75 (92%) 93.2%
outputStructureValid ✅ 75/75 (100%) 100.0%

generate sentences

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
validPinyinFormat ✅ 5/5 (100%) 100.0%
sentenceGenerationQuality 🟡 4/5 (80%) 92.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

word context

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
englishTranslationPresent ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

📦 Download full results

@github-actions
Copy link

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 4/4 (100%) 100.0%
englishTranslationPresent ✅ 4/4 (100%) 100.0%
outputStructureValid ✅ 4/4 (100%) 100.0%

explain chinese

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 70/70 (100%) 100.0%
validPinyinFormat ✅ 70/70 (100%) 100.0%
grammarExplanationQuality 🟡 67/70 (96%) 97.1%
outputStructureValid ✅ 70/70 (100%) 100.0%

explain english

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 75/75 (100%) 100.0%
validPinyinFormat ✅ 75/75 (100%) 100.0%
grammarExplanationQuality 🟡 72/75 (96%) 95.5%
outputStructureValid ✅ 75/75 (100%) 100.0%

generate sentences

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
validPinyinFormat ✅ 5/5 (100%) 100.0%
sentenceGenerationQuality ✅ 5/5 (100%) 96.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

word context

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
englishTranslationPresent ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

📦 Download full results

@github-actions
Copy link

🧪 AI Evaluation Results

collocation

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 4/4 (100%) 100.0%
englishTranslationPresent ✅ 4/4 (100%) 100.0%
outputStructureValid ✅ 4/4 (100%) 100.0%

explain chinese

Evaluator Pass Rate Avg Score
chineseTextPresent 🟡 69/70 (99%) 98.6%
validPinyinFormat ✅ 70/70 (100%) 100.0%
grammarExplanationQuality 🟡 69/70 (99%) 98.3%
outputStructureValid ✅ 70/70 (100%) 100.0%

explain english

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 75/75 (100%) 100.0%
validPinyinFormat ✅ 75/75 (100%) 100.0%
grammarExplanationQuality 🟡 72/75 (96%) 96.0%
outputStructureValid ✅ 75/75 (100%) 100.0%

generate sentences

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
validPinyinFormat ✅ 5/5 (100%) 100.0%
sentenceGenerationQuality 🟡 4/5 (80%) 84.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

word context

Evaluator Pass Rate Avg Score
chineseTextPresent ✅ 5/5 (100%) 100.0%
englishTranslationPresent ✅ 5/5 (100%) 100.0%
outputStructureValid ✅ 5/5 (100%) 100.0%

📦 Download full results

@mreichhoff mreichhoff changed the title Upgrade to newer Gemini model Upgrade to Gemini 3 Feb 14, 2026
@mreichhoff mreichhoff merged commit efa8e67 into main Feb 14, 2026
1 check passed
@mreichhoff mreichhoff deleted the upgrade-2-5 branch February 14, 2026 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant