PhoneRAG

PhoneRAG is a simple Android proof of concept for fully on-device AI using llama.cpp.

It demonstrates how to:

run a GGUF LLM locally on Android
load models from device storage
perform inference without any cloud API
build a Kotlin Android app around local inference
experiment with lightweight retrieval for source-grounded responses

Demo

This project is part of a short demo on fully local AI on Android.

Blog: add-your-blog-link-here
LinkedIn post: add-your-linkedin-post-link-here

Why this project

Most AI app demos rely on cloud inference. The goal here was different:

keep prompts and data on the device
avoid server dependency
understand how to wire native llama.cpp inference into an Android app
test whether lightweight RAG-style behavior is practical on mobile

What it does

The app lets a user:

Pick a .gguf model file from device storage
Import and load the model locally
Send prompts to the model on-device
View responses in a simple Android UI

I also used this project to test retrieval-style prompting with a small local knowledge base and source display.

Core stack

llama.cpp
Android Studio
Kotlin
GGUF models
Android document picker
App-private storage for local model loading
AiChat / InferenceEngine integration pattern from the official Android sample

Project direction

This repo started as a custom Android integration attempt, then evolved toward a more stable approach by reusing the official llama.android integration pattern and customizing the application layer on top.

That made the native setup more reliable and let me focus on:

the app flow
prompt handling
lightweight retrieval experiments
debugging model behavior on-device

App flow

The working inference path is:

User selects a GGUF model
App parses GGUF metadata
App copies the model into app-private storage
The inference engine loads the model from the local path
User sends a prompt
Tokens stream back and are rendered in the UI

Retrieval experiment

To move toward a RAG-like experience, I tested lightweight retrieval approaches inside the app.

The goal was to:

match a user query to small local context chunks
pass the selected context to the model
ask grounded questions with visible sources

This was useful for exploring:

query-to-context matching
prompt design for grounded QA
how much model quality affects source-aware answers

Problems faced during build

This project involved several practical Android-native issues:

CMake path and project structure issues
Gradle and plugin mismatches
AndroidX and minSdk mismatches
ABI packaging problems
backend-loading/runtime failures in native inference
weaker GGUF models failing to follow grounded QA prompts well

A major lesson was that on-device AI is not only about choosing a model. Native packaging and runtime setup matter just as much as the UI.

Lessons learned

The official Android sample is the safest base for llama.cpp on Android
Native integration issues can dominate development time
Model quality strongly affects grounded QA performance
Retrieval may work correctly even when the model answers poorly
Kotlin is fully capable of handling the app-side orchestration for local AI

How to run

Requirements

Android Studio
Android SDK / NDK configured
A compatible Android device or emulator
A GGUF model file stored locally

Steps

Open the project in Android Studio
Sync Gradle
Build and run the app
Pick a .gguf model from storage
Wait for the model to be imported and loaded
Start chatting locally

Debugging notes

During development, the most useful debugging tool was Logcat.

Things worth checking:

current device ABI
model import path
GGUF metadata parsing
model load success/failure
backend-loading logs
prompt text being sent
retrieval hits and source scores
generated token stream

Typical failure categories:

model path issues
ABI mismatch
backend loading failure
weak prompt-following by the model

Repository purpose

This repository is mainly a learning and engineering log:

how to get local LLM inference working on Android
how Kotlin can be used around llama.cpp
what broke during integration
what worked in the final app
how a lightweight retrieval experiment can be layered on top

Credits

This work builds on:

llama.cpp
the official Android integration pattern from examples/llama.android

Status

This is a proof of concept, not a production-ready app.

It is intended to document the build process, the integration decisions, and the practical lessons from getting fully local Android AI running.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
MainActivity.kt		MainActivity.kt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhoneRAG

Demo

Why this project

What it does

Core stack

Project direction

App flow

Retrieval experiment

Problems faced during build

Lessons learned

How to run

Requirements

Steps

Debugging notes

Repository purpose

Credits

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhoneRAG

Demo

Why this project

What it does

Core stack

Project direction

App flow

Retrieval experiment

Problems faced during build

Lessons learned

How to run

Requirements

Steps

Debugging notes

Repository purpose

Credits

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages