Hello,
I was comparing some high performing models' specific results on pages where they get 0%. I expected to see genuine failures from LLMs, but here Gemini flash performed nicely but got its score deducted due to ground truth being not that true. Please check this and others to confirm and/or re-evaluate. Am I missing something?
https://idp-leaderboard.org/explore/?model=Gemini-3-Flash&benchmark=olmocr&task=present&sample=17_pg17_pg1_text_03
Hello,
I was comparing some high performing models' specific results on pages where they get
0%. I expected to see genuine failures from LLMs, but here Gemini flash performed nicely but got its score deducted due to ground truth being not that true. Please check this and others to confirm and/or re-evaluate. Am I missing something?https://idp-leaderboard.org/explore/?model=Gemini-3-Flash&benchmark=olmocr&task=present&sample=17_pg17_pg1_text_03