Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions .github/workflows/release.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please restore the changes for removing the blank lines?

Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,44 @@
labels: ${{ steps.meta.outputs.labels }}
outputs: type=image,name=openmindagi/riva-speech-server,push-by-digest=true,name-canonical=true,push=true

build-embedding-arm64:
runs-on: ubuntu-22.04-arm
outputs:
image-digest: ${{ steps.build.outputs.digest }}
metadata: ${{ steps.meta.outputs.json }}
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v4
with:
images: openmindagi/embedding-service
tags: |
type=semver,pattern={{version}},prefix=v
type=semver,pattern={{major}}.{{minor}},prefix=v
type=ref,event=branch
type=sha
type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}
type=raw,value=e5-small-v2-aarch64
- name: Build and push ARM64 image by digest
id: build
uses: docker/build-push-action@v4
with:
context: ./docker
file: ./docker/Dockerfile.embed
platforms: linux/arm64
labels: ${{ steps.meta.outputs.labels }}
outputs: type=image,name=openmindagi/embedding-service,push-by-digest=true,name-canonical=true,push=true

create-manifest:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
runs-on: ubuntu-latest
needs: [build-arm64]
steps:
Expand All @@ -77,3 +114,24 @@
--tag $tag \
openmindagi/riva-speech-server@${{ needs.build-arm64.outputs.image-digest }}
done

create-embedding-manifest:
runs-on: ubuntu-latest
needs: [build-embedding-arm64]
steps:
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Create manifest list and push
run: |
tags=$(echo '${{ needs.build-embedding-arm64.outputs.metadata }}' | jq -r '.tags[]')
for tag in $tags; do
echo "Creating manifest for: $tag"
docker buildx imagetools create \
--tag $tag \
openmindagi/embedding-service@${{ needs.build-embedding-arm64.outputs.image-digest }}
done
Comment on lines +119 to +137

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}

Copilot Autofix

AI 8 days ago

In general, the fix is to add an explicit permissions block declaring the least privileges required for GITHUB_TOKEN. Since this workflow only builds and pushes Docker images to Docker Hub using separate secrets and does not interact with repository contents or GitHub APIs in a write capacity, it can use a minimal read-only configuration such as permissions: contents: read at the workflow level.

The best way to fix this without altering existing behavior is to add a root-level permissions block near the top of .github/workflows/release.yml, alongside name and on. This will apply to all jobs (build-arm64, build-embedding-arm64, create-manifest, and create-embedding-manifest) because none of them define their own permissions. A good least-privilege default is:

permissions:
  contents: read

This explicitly constrains GITHUB_TOKEN to read-only access to repository contents and leaves all other permission scopes disabled, while not interfering with Docker Hub logins or image pushes, which rely on secrets.DOCKERHUB_USERNAME and secrets.DOCKERHUB_TOKEN rather than GITHUB_TOKEN.

Concretely, edit .github/workflows/release.yml and insert the permissions block after the name: line and before the on: block. No additional imports, methods, or other definitions are needed, since this is purely a YAML configuration change for GitHub Actions.

Suggested changeset 1
.github/workflows/release.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
--- a/.github/workflows/release.yml
+++ b/.github/workflows/release.yml
@@ -1,5 +1,8 @@
 name: Release Riva Speech Server ARM64 Image
 
+permissions:
+  contents: read
+
 on:
   push:
     branches:
EOF
@@ -1,5 +1,8 @@
name: Release Riva Speech Server ARM64 Image

permissions:
contents: read

on:
push:
branches:
Copilot is powered by AI and may make mistakes. Always verify output.
17 changes: 17 additions & 0 deletions docker/Dockerfile.embed
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM nvcr.io/nvidia/pytorch:25.10-py3

WORKDIR /app

RUN pip install --no-cache-dir \
fastapi \
uvicorn[standard] \
sentence-transformers

# Pre-download model at build time
RUN python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('intfloat/e5-small-v2')"

COPY src/embedding/embedding_server.py .

EXPOSE 8100

CMD ["uvicorn", "embedding_server:app", "--host", "0.0.0.0", "--port", "8100"]
8 changes: 8 additions & 0 deletions docker/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,11 @@ services:

security_opt:
- label:disable

embedding:
image: openmindagi/embedding-service:e5-small-v2-aarch64
container_name: embedding_service
ports:
- "8100:8100"
runtime: nvidia
restart: unless-stopped
208 changes: 208 additions & 0 deletions src/embedding/embedding_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
"""
Embedding microservice for NVIDIA Jetson AGX Thor.
Run: uvicorn embedding_server:app --host 0.0.0.0 --port 8100.

Endpoints
---------
POST /embed
Single query embedding (base64 binary response).
POST /embed_batch
Batch query embedding (base64 binary response).
GET /health
Health check.
"""

import base64
import logging
import time

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer

logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger("embedding-service")

app = FastAPI(title="Embedding Service")
model = None


# Request/Response Models
class QueryRequest(BaseModel):
"""
Request body for single query embedding.

Attributes
----------
query : str
The text to embed.
"""

query: str


class FastResponse(BaseModel):
"""
Response body for single query embedding.

Attributes
----------
embedding_b64 : str
Base64-encoded float32 byte array (384 × 4 = 1536 bytes).
Decode with ``np.frombuffer(base64.b64decode(s), dtype="float32")``.
dimension : int
Embedding dimension (384 for e5-small-v2).
latency_ms : float
Model inference latency in milliseconds (excludes network I/O).
"""

embedding_b64: str
dimension: int
latency_ms: float


class BatchRequest(BaseModel):
"""
Request body for batch query embedding.

Attributes
----------
queries : list of str
List of texts to embed.
"""

queries: list[str]


class BatchResponse(BaseModel):
"""
Response body for batch query embedding.

Attributes
----------
embeddings_b64 : list of str
List of base64-encoded float32 byte arrays, one per query.
dimension : int
Embedding dimension (384 for e5-small-v2).
count : int
Number of embeddings returned.
latency_ms : float
Model inference latency in milliseconds (excludes network I/O).
"""

embeddings_b64: list[str]
dimension: int
count: int
latency_ms: float


@app.on_event("startup")
def load_model():
"""
Load the sentence transformer model onto GPU and run warmup inferences.

This runs automatically when the FastAPI server starts. The warmup
ensures CUDA kernels are compiled before the first real request,
avoiding cold-start latency (~2s → ~6ms).
"""
global model
logger.info("Loading e5-small-v2 on CUDA...")
model = SentenceTransformer("intfloat/e5-small-v2", device="cuda")
for _ in range(5):
model.encode(["warmup"], normalize_embeddings=True)
logger.info("Model ready!")


@app.post("/embed", response_model=FastResponse)
def embed(req: QueryRequest):
"""
Embed a single query and return base64-encoded vector.

The query is prefixed with ``"query: "`` before encoding, as
required by the e5 model family.

Parameters
----------
req : QueryRequest
Request body containing the query string.

Returns
-------
FastResponse
Base64-encoded embedding with dimension and latency info.
"""
if model is None:
raise HTTPException(
status_code=503,
detail="Model not loaded. Service not initialized properly.",
)

start = time.perf_counter()
emb = model.encode([f"query: {req.query}"], normalize_embeddings=True).astype(
"float32"
)
latency = (time.perf_counter() - start) * 1000
emb_b64 = base64.b64encode(emb[0].tobytes()).decode("ascii")
logger.info(f'embed | query="{req.query[:50]}" | latency={latency:.1f}ms')
return FastResponse(
embedding_b64=emb_b64, dimension=len(emb[0]), latency_ms=round(latency, 2)
)


@app.post("/embed_batch", response_model=BatchResponse)
def embed_batch(req: BatchRequest):
"""
Embed multiple queries in a single GPU batch.

Batch processing is significantly faster than individual requests
for multiple queries (e.g., 10 queries: ~15ms batch vs ~100ms
sequential).

Parameters
----------
req : BatchRequest
Request body containing a list of query strings.

Returns
-------
BatchResponse
List of base64-encoded embeddings with count and latency info.
"""
if model is None:
raise HTTPException(
status_code=503,
detail="Model not loaded. Service not initialized properly.",
)

start = time.perf_counter()
prefixed = [f"query: {q}" for q in req.queries]
embs = model.encode(prefixed, normalize_embeddings=True, batch_size=64).astype(
"float32"
)
latency = (time.perf_counter() - start) * 1000
embs_b64 = [base64.b64encode(e.tobytes()).decode("ascii") for e in embs]
logger.info(f"embed_batch | count={len(req.queries)} | latency={latency:.1f}ms")
return BatchResponse(
embeddings_b64=embs_b64,
dimension=len(embs[0]),
count=len(embs),
latency_ms=round(latency, 2),
)


@app.get("/health")
def health():
"""
Health check endpoint.

Returns
-------
dict
``{"status": "ok", "model": "e5-small-v2"}`` if the service is
running and the model is loaded.
"""
return {"status": "ok", "model": "e5-small-v2"}
Loading
Loading