"Because reading your own documents is so 2022. Let the AI do the heavy lifting while you take all the credit. Now with 100% more dark mode and a backend that actually works!"
- π What Is This Masterpiece?
- π°οΈ What Was New in v2.0 (The Glow-Up Edition)
- πΈ Screenshots
- π Features
- π οΈ How to Run
- π§ͺ Running the Tests
- ποΈ Project Structure
- π API Key (OpenCode Zen)
- πΎ Session Persistence
- π€ How It Works
- π― Why Use This?
- π§ Installation Troubleshooting
- π Credits & Thanks
β οΈ Disclaimer- π Feedback & Support
- π Recent Changes (v3.1)
- π§ Project History
- π€ Maintainer
Welcome to AI Document Analyst v3.0 β the tool you never knew you desperately needed, now with a complete backend overhaul and a cleaner architecture. Built by Devansh Singh (yes, I made this, and yes, I'm still waiting for my Nobel Prize).
This Python-powered, AI-infused, theme-switching, sarcasm-enabled agent will:
- Read your PDFs, DOCX, TXT, CSV, Excel (XLSX/XLS), and images (JPG, JPEG, PNG, TIFF, BMP β OCR, because why not?).
- Summarize them via OpenCode Zen β OpenAI-compatible chat completions, with
minimax-m3-freeas the default model. - Stream answers back token-by-token via
st.write_streamβ first words in <1 s, full reply in 1β3 s. - Analyze your data with pandas wizardry (the Avengers of data science).
- Visualize trends and patterns with auto-generated charts (because you love pretty colors).
- Chat with your documents like they're your best friend (spoiler: they're more reliable).
- Switch between light and dark themes (because your eyes deserve options).
- Generate Reports that sound like you spent hours on them (you didn't).
- Deploy to Streamlit Cloud with one click (yes, really).
All this, wrapped in a gorgeous Streamlit UI with tabs, themes, and more bells and whistles than a marching band.
π Click to expand the v2.0 changelog (historical context)
A historical changelog entry β v2.0 was the previous major version before the 2026-06-05 reset to v3.0. Kept here for readers who arrive via the v2.0 screenshots below.
- π Dark/Light Mode: Toggle between themes like a pro. Your retinas will thank you.
- π Tabbed Interface: Home, Upload, Chat, Analytics, and Settings tabs. Organization is sexy.
- π― Enhanced Upload: Drag & drop files with style. Progress bars included (because waiting is fun).
- π¬ Interactive Chat: Ask your documents anything. They actually respond now.
- βοΈ Settings Panel: Configure everything from AI models to themes. Power user vibes.
- π Advanced Analytics: Beautiful charts, stats, and insights that'll make Excel cry.
- πͺ Better UI: Modern gradients, cards, and animations. Instagram-worthy data analysis.
Glimpses! So you know it actually works π
![]() π Home Dashboard |
![]() π€ Upload & Process |
![]() π File Processing |
![]() π¬ AI Chat Interface |
![]() π€ Chat Conversation |
![]() π Analytics Dashboard |
![]() βοΈ Settings Panel |
![]() π Dark Mode Settings |
The UI may differ slightly if I decided to tweak it and forgot to update screenshots. JK! (But seriously, it might.)
- Welcome dashboard with feature overview
- Quick start guide (for the impatient)
- Status indicators (so you know things are working)
- Multi-format Support: PDF, DOCX, TXT, CSV, Excel (XLSX/XLS), and images (JPG, JPEG, PNG, TIFF, BMP)
- Drag & Drop Interface: Because clicking is so 2010
- Real-time Processing: Watch your files get analyzed in real-time
- Progress Tracking: Know exactly what's happening (transparency is key)
- Auto-Visualization: Charts generate themselves (like magic, but with code)
- Conversational Q&A: Ask anything about your documents
- Context Awareness: Remembers your conversation (better than most humans)
- Quick Questions: Pre-built buttons for instant insights
- Smart Responses: Powered by OpenCode Zen (OpenAI-compatible chat completions)
- Token-by-token streaming: Answers render chunk-by-chunk via
st.write_stream, so the first words appear in <1 s instead of waiting for the full 1β3 s completion
- Statistical Summaries: Mean, median, mode, and other math-y things
- Data Quality Checks: Missing values, duplicates, outliers
- Correlation Analysis: Find relationships you never knew existed
- Auto-Generated Charts: Histograms, heatmaps, box plots, and more
- API Key Management: Built-in key configuration (no more .env hunting)
- Model Selection: Choose from multiple AI models
- Theme Switching: Light/Dark mode toggle
- Processing Settings: Customize AI behavior
- Session Management: Reset everything when you mess up
pip install -r requirements.txt(Or just install everything you see in the imports. I believe in your package management skills.)
streamlit run app.pyNote: the entrypoint is
app.py, notAgent.py.Agent.pyis the engine module;app.pyis the Streamlit UI. The oldData_Analyst_Agent.py/Agent.pymonolith is gone.
If streamlit is on your PATH you can also do python -m streamlit run app.py.
- The app opens at
http://localhost:8501(Streamlit's default) - If it doesn't, manually navigate there (I can't click for you)
- Set your
OPENCODE_API_KEYin.env(or paste it into the app's Settings tab once it's running) - Upload your files and start chatting β the default model is
minimax-m3-free(no credit card burn)
- Push this repo to GitHub.
- On share.streamlit.io, click New app,
select the repo, and set the main file path to
app.py. - Open Advanced settings β Secrets and paste:
OPENCODE_API_KEY = "your_key_here"
- Click Deploy. The first build will pull
tesseractfrompackages.txtand Python deps fromrequirements.txt.
packages.txtin the repo root installs the systemtesseractbinary on the Cloud image so image OCR works.
The repo ships with a 73-test suite under tests/ that covers extraction
failures, BM25 retrieval, the SSRF policy, the path-traversal-safe
filename helper, and the extension allowlist. Run it with either:
# stdlib only β works out of the box, no install step
python -m unittest discover -s tests -v
# or, if you've installed pytest as a dev dep:
python -m pytest tests -vThe suite finishes in ~100 ms because it stubs the LLM layer and never hits the network. Heavy visualization deps (matplotlib, seaborn, PIL, pytesseract) are not imported by the tests.
To install pytest (optional):
pip install pytestDataa_Analyst_Agent/
βββ app.py # Streamlit UI β entrypoint, pure orchestration
βββ app_helpers.py # _safe_filename, AVAILABLE_MODELS, list_model_choices
βββ theme.py # DARK_CSS / LIGHT_CSS + css_for_theme()
βββ Agent.py # Engine: extractors, BM25 retriever, OpenCode Zen client
βββ ISSUES.md # Open audit findings (audit complete as of v3.1)
βββ Readme.md # You are here
βββ requirements.txt # Python runtime deps
βββ packages.txt # System deps (tesseract for OCR on Streamlit Cloud)
βββ pyproject.toml # [tool.pytest.ini_options] for the test suite
βββ .streamlit/
β βββ config.toml # Cloud-friendly Streamlit defaults
βββ tests/
β βββ __init__.py
β βββ conftest.py # Shared fixtures (works under pytest OR unittest)
β βββ test_agent.py # 73 tests covering extraction, retrieval, SSRF, persistence, reset path, streaming, viz guards, DataFrame preview, theme, helpers
βββ venv/ # Local virtualenv (not committed)
At runtime, not in the repo:
tempfile.gettempdir()/dataa_analyst_state_<uuid>.dbβ per-agent SQLite store for session persistence. See Session Persistence above.tempfile.gettempdir()/dataa_analyst_viz_<uuid>/β per-agent directory holding chart PNGs.
The app talks to OpenCode Zen β a single endpoint that fronts multiple model providers behind an OpenAI-compatible chat-completions API. Sign up there, paste a credit card (the free tier is enough for most use), and grab an API key.
Resolution order (the app tries these in sequence):
st.secrets["OPENCODE_API_KEY"]β used when deployed on Streamlit CloudOPENCODE_API_KEYenvironment variable β used for local dev via.envTOGETHER_API_KEYenvironment variable β legacy fallback from v2.0
echo "OPENCODE_API_KEY=your_key_here" > .envIn the app dashboard, go to Settings β Secrets and paste:
OPENCODE_API_KEY = "your_key_here"Save. The app will reboot and pick the key up automatically β no code change needed.
π€ Click to see all 10 supported models (default is minimax-m3-free)
Default model is minimax-m3-free (free tier). You can swap to any of
these from the in-app Settings tab:
minimax-m3-freeβ default, free tiermimo-v2.5-freeβ free tierqwen3.6-plus-freeβ free tierdeepseek-v4-flash-freeβ free tiernemotron-3-ultra-freeβ free tierminimax-m2.7β paid, latest MiniMaxminimax-m2.5β paid, previous MiniMaxgpt-5β paid, via Zenclaude-sonnet-4-6β paid, via Zengemini-3.1-proβ paid, via Zen
The full live catalog is at https://opencode.ai/zen/v1/models.
Uploads, conversation history, analysis results, BM25 chunk caches,
and chart bytes are persisted to a SQLite database under
tempfile.gettempdir(). The DB survives Streamlit container
recycles on Cloud but doesn't pollute the repo, and the agent
hydrates from it on init so a fresh container picks up exactly
where the previous one left off.
- DB path:
tempfile.gettempdir()/dataa_analyst_state_<uuid>.dbβ one file per agent instance, UUID-keyed. - What's stored: documents (content + summary), DataFrames
(parquet blobs), analyses (JSON), conversation history (Q&A),
BM25 chunk cache, and
(label, png_bytes)viz pairs. - Cleared by: the in-app "Clear All Files" button (which
calls
agent.clear_caches(), including the store) and the "Reset Session" button on the Settings tab. - No new dependencies β
sqlite3is in the Python stdlib.
If you want a clean slate, hit "Reset Session" in the Settings tab; on the next render the agent will be re-instantiated and a new DB will be created.
- π Start at Home: Overview of features and quick start guide
- π€ Upload Files: Drag & drop your documents in the Upload tab
- π Auto-Processing: Text extraction, OCR, data loading - all automatic
- π Get Analytics: Instant stats, charts, and insights in the Analytics tab
- π¬ Chat Away: Ask questions in the Chat tab - get smart answers
- βοΈ Customize: Tweak settings, change themes, swap AI models
- π Export Results: Screenshots, insights, whatever you need
- Zero Setup Hassle: API key included, just run and go
- Beautiful UI: Dark mode, themes, modern design
- Actually Smart: Real AI analysis, not just fancy buttons
- Multiple File Types: PDF, Excel, images - it reads everything
- Conversation Memory: Ask follow-up questions like a normal human
- Free to Use: No hidden costs, no subscription nonsense
- Regular Updates: I actually maintain this thing
If you encounter any errors (because software is never perfect):
pip uninstall numpy pandas -y
pip install numpy==1.24.3
pip install pandas==1.5.3
pip install -r requirements.txtpip install --upgrade streamlit- Get a key from OpenCode Zen
- Add it in the Settings tab of the app, in
.env(local), or in Streamlit Cloud β Settings β Secrets (deployment) - The app will pick it up on next reload
Made with excessive amounts of coffee, determination, and a healthy dose of sarcasm by Devansh Singh.
Special thanks to:
- OpenCode Zen for their OpenAI-compatible chat API
- Streamlit for making beautiful UIs possible
- Meta for Llama models that actually work
- You for using this instead of doing manual analysis
- This tool is for educational and productivity purposes
- AI responses are smart but not infallible (unlike me)
- No documents were harmed in the making of this agent
- Dark mode may cause addiction to superior UI experiences
- Free API usage is subject to reasonable limits (don't abuse it)
- Open an issue on GitHub (I actually read them)
- Email: dksdevansh@gmail.com (for serious stuff)
- Or just scream into the void (therapeutic but less helpful)
The v3.0 cutover (June 2026) shipped the OpenCode Zen migration and the
Agent.py / app.py split. v3.1 layers a security + correctness
pass on top of that, plus the retrieval upgrade and a real test suite.
- XSS in chat output fixed. The
unsafe_allow_html=Trueblocks that interpolated LLM output into HTML (10+ call sites inapp.py) are gone. The chat response now usesst.chat_message, which sanitizes by default. The only remainingunsafe_allow_html=Trueis the CSS-injection atapp.py:217, which has to be raw HTML for theming to work. - Path-traversal via uploaded filename fixed. The original
temp_<uploaded_file.name>sink is gone. Uploads now go totemp_uploads/<uuid>.<safe_ext>viaapp._safe_filename()(a werkzeug-free sanitizer: basename + allowlist regex + UUID fallback). All 18 adversarial inputs (path traversal, shell metacharacters, Windows backslashes, 300-char overflow) resolve to paths strictly insidetemp_uploads/. - SSRF chokepoint added.
Agent.safe_fetch_url()is now the only path any future URL fetcher should use. Validates scheme (http/https), blocks 17 private/loopback/link-local IPv4 + IPv6 ranges, normalises IPv4-mapped IPv6, and re-checks DNS + redirects at every hop to defeat DNS rebinding. ASECURITY:comment at theimport requestsline points future contributors to it. - Extraction errors no longer fed to the LLM. The four
extract_*helpers now raise instead of returning an error string.process_documentreturnssuccess: bool+error: str | None; on failurecontentandsummaryare empty and the file is not added todocument_contentordata_frames. The UI short-circuits the render + viz pipeline and showsst.errorwith the actual message.load_structured_datano longer silently returns an empty DataFrame on failure (silent data loss bug). - Extension allowlist unified.
Agent._SUPPORTED_EXTENSIONSis the single source of truth. Thest.file_uploadertype=list is derived from it, so adding an extension once extends both layers. Side effect:tiffandbmpuploads now actually reach the OCR extractor (they were silently dropped by the uploader filter before).
- BM25 retrieval replaces blind truncation.
answer_questionused to send only the first[:1500]chars per doc and the first[:4000]of the assembled context to the LLM. A 50-page PDF or a 10-section CSV left the model blind past the opening pages. Nowprocess_documentchunks text at extraction time (paragraph-aware, 800/150), andanswer_questionscores chunks with a stdlib BM25-lite ranker and sends the top-4 per doc into a 12k-char context budget. Pure stdlib β no sentence-transformers, no chromadb. A follow-up could swap the ranker for embedding-based cosine similarity without changing the chunker or the budget. - Visualisations no longer leak into CWD. The old
visualizations_<file_name>/in the repo root is gone. Charts now go to a per-agent UUID-keyed subdir oftempfile.gettempdir()and are also kept in memory as(label, png_bytes)pairs on the agent (consumed by the analytics tab on subsequent reruns).clear_visualizations()is wired to the "Clear All Files" button.
- 73 tests in
tests/test_agent.py(one new optional dep:httpx, used lazily for streaming; runs under stdlibunittestorpytest). - Coverage: extension detection + allowlist,
_safe_filenamefor path-traversal safety,process_documenthappy + failure paths, BM25 retrieval surfaces the right chunk,safe_fetch_urlblocks unsafe schemes + private IP ranges, BM25 ranker correctness, SQLite persistence across container recycle, Reset Session handler (no mid-iterationdel, wipes on-disk store), token-by-token streaming (SSE parser, error surfacing, full-text persistence), visualization guards (empty/short dfs skip the right chart types, column truncation is surfaced in the label), and DataFrame preview storage (large CSVs no longer inflate document_content or the SQLite documents row; the full DataFrame hydrates from parquet), the new theme + app_helpers modules (noDARK_CSS/LIGHT_CSS/_safe_filename/ inlineAVAILABLE_MODELSleft inapp.py), and the curated model catalogue (fictional ids likemimo-v2.5-freeare not inAVAILABLE_MODELS). - No network calls, no heavy-dep imports in the test path. Full suite finishes in ~100 ms.
app.pyβ Streamlit UI (the entrypoint forstreamlit run).Agent.pyβ engine: extractors, BM25 retriever, OpenCode Zen client,safe_fetch_urlSSRF chokepoint.tests/β 73 tests + shared fixtures (conftest.py).pyproject.tomlβ[tool.pytest.ini_options]for the test suite.ISSUES.mdβ personal-tracked audit; updated as each fix lands..streamlit/config.tomlβ Cloud-friendly defaults (port 8501, headless).
- XSS in chat output (10+
unsafe_allow_htmlsites) βst.chat_message - Path-traversal sink (
temp_<uploaded_name>) β UUID-keyedtemp_uploads/ - SSRF TODO β working
safe_fetch_url()chokepoint - Extraction errors fed to LLM β
success: bool+ UI short-circuit - Blind [:4000] truncation β BM25 retrieval over pre-chunked text
visualizations_*dirs in CWD β in-memory bytes +tempfile.gettempdir()- Hardcoded extension list β
Agent._SUPPORTED_EXTENSIONS(single source of truth) - No tests β 73-test suite under
tests/ - In-memory state lost on container recycle β SQLite store under
tempfile.gettempdir(), hydrated on init, write-through on every mutation. No new deps. del st.session_state[key]mid-iteration in Reset Session βst.session_state.pop(key, None)plusagent.clear_caches()+agent.clear_visualizations()so the on-disk store is wiped too.- Synchronous HTTP, no token-by-token feedback β
httpxAsyncClient.stream+st.write_streamso the chat bubble appears in <1 s and tokens arrive in place. The newstream_answermirrorsanswer_question's side effects (conversation history, on-disk store) so a streamed answer and a non-streamed answer see the same persistence path. - Charts always render, even for 3-row data β
create_visualizationsnow picks chart types that match the data: empty df β no charts, <3 rows β no histogram, <5 rows β no box plot, <2 numeric cols β no heatmap, and column truncation is surfaced in the label as"Distributions (showing 4 of 12)". df.to_string()stored in agent state (10 MB+ for 100k rows) βprocess_documentnow stores a bounded preview (shape, columns, dtypes, first 20 rows) indocument_content[file_name]["content"]and the SQLite documents table. The full DataFrame is still indata_frames[file_name](parquet blob, hydrated on container recycle). A 100k-row CSV'scontentis now bounded under 5 KB instead of >10 MB.app.pymixes UI + theming + helpers; AVAILABLE_MODELS includes fictional ids β CSS moved totheme.py,_safe_filenameand a curatedAVAILABLE_MODELSmoved toapp_helpers.py. The catalogue was trimmed from 10 entries (8 of which would 404 against OpenCode Zen) to 3 verified ids with acurl-based recipe for adding more.app.pyre-exports the moved names so legacyfrom app import _safe_filenamekeeps working.
This repo went through a one-time history reset on 2026-06-05.
Prior to that date, the main branch carried 26 commits of v2.0 history
that mixed the agent, the Streamlit UI, the launcher, and ~700 lines of
theme CSS in a single 2,197-line Agent.py file, with a real
TOGETHER_API_KEY committed to .env. The reset replaced that history
with the v3.0 structure described above.
If you're looking at an old clone and git pull shows nothing, that's
why β please re-clone.
This project is maintained solely by @DevanshSrajput (Devansh Singh).
As of 2026-06-05, collaborator write access for @aditya-ig10 has been revoked. The repo is now single-maintainer; PRs from other contributors are not accepted and pushes from outside the maintainer account will be force-reverted.
For issues, suggestions, or security reports, contact
dksdevansh@gmail.com.
Enjoy the new and improved document analysis experience! (Or don't, but at least it looks pretty now) π







