A simple yet powerful web-based Optical Character Recognition (OCR) chatbot for images and PDF files, leveraging Google's Gemini 2.5 Pro Experimental model (gemini-2.5-pro-exp-03-25) and other Gemini models like Gemini 2.0 Flash (Thinking). The application features a German user interface, streaming output for fast results, and was primarily developed through iterative prompting of an AI assistant (Gemini) – an example of "Vibe Coding"!
Currently, the necessary API keys can be obtained for free via Google AI Studio.
- OCR for Images & PDFs: Uploads common image formats (PNG, JPG, WEBP) and PDF files for text extraction.
- Gemini 2.5 Pro Power: Utilizes the state-of-the-art
gemini-2.5-pro-exp-03-25model for recognition. Includes a dropdown to select Gemini 2.0 Flash (Thinking). - Streaming Output: Displays extracted results incrementally as they are generated by the model.
- German UI: User interface is presented entirely in German.
- Material Design Inspired: Visual design follows Google's Material Design principles.
- Persistent Status Bar: Clearly indicates the current state (Ready, Thinking, Writing, Done, Error) with a "wave" animation during thinking/writing phases.
- Dark/Light Mode: Toggleable color scheme with persistence via Local Storage.
- Custom Instructions: Ability to provide specific instructions to the model for the extraction process.
- File Preview: Shows a thumbnail for images or an icon for PDFs.
- Secure: Ignores the
.envfile containing the API key via.gitignore.
- Backend: Python 3.x with Flask
- Frontend: Vanilla HTML, CSS, JavaScript
- AI Model: Google Gemini 2.5 Pro Experimental (
gemini-2.5-pro-exp-03-25) viagoogle-generativeaiSDK - Styling: CSS inspired by Material Design
Follow these steps to run the project locally:
-
Prerequisites:
-
Clone the Repository:
git clone https://github.com/marlonka/gemini-ocr-chatbot.git cd gemini-ocr-chatbot -
Create a Virtual Environment (Recommended):
- Linux/macOS:
python3 -m venv venv source venv/bin/activate - Windows:
python -m venv venv .\venv\Scripts\activate
- Linux/macOS:
-
Install Dependencies:
pip install -r requirements.txt
-
Configure API Key (IMPORTANT!):
- Create a file named
.envin the project's root directory (gemini-ocr-chatbot). - Add the following content, replacing
YOUR_API_KEY_HEREwith your actual API Key obtained from Google AI Studio:GOOGLE_API_KEY=YOUR_API_KEY_HERE
- This file is ignored by
.gitignoreand should NEVER be committed or pushed to GitHub!
- Create a file named
-
Run the Flask Application:
flask run # Or: python app.py -
Open the Application in your Browser:
- Navigate to
http://127.0.0.1:5000(or the address shown in your terminal).
- Navigate to
- Open the web interface in your browser.
- Drag & drop an image or PDF file onto the dropzone, or click "Dateien durchsuchen" (Browse files) to select one.
- (Optional) Enter specific instructions in the text area (e.g., "Extract only the table on page 2").
- Click "PDF/Bild OCR starten" (Start PDF/Image OCR).
- Observe the status bar ("KI denkt nach..." -> "Die KI schreibt...") and the results streaming into the right panel.
- Click "Text kopieren" (Copy Text) to copy the result to your clipboard.
- Click "Neu starten" (Start New) to upload a different file.
- Use the icon in the top-right corner to toggle between light and dark modes.
This project was significantly developed using AI prompting (specifically Google Gemini). It served as an experiment to explore the extent to which precise instructions given to an advanced AI could build a functional web application. The process involved approximately 8-10 iterative main prompts to implement features like PDF support, streaming, UI adjustments, and bug fixes.
Contributions are welcome! If you have suggestions or find bugs, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
