diff --git a/content/patterns/lemonade-stand-quickstart/_index.adoc b/content/patterns/lemonade-stand-quickstart/_index.adoc new file mode 100644 index 000000000..d27376f7e --- /dev/null +++ b/content/patterns/lemonade-stand-quickstart/_index.adoc @@ -0,0 +1,37 @@ +--- +title: Lemonade Stand AI Quickstart +date: 2026-06-25 +tier: sandbox +summary: This pattern deploys an AI guardrails demonstration with a multi-layered safety pipeline, interactive chatbot, and real-time monitoring on OpenShift. +rh_products: + - Red Hat OpenShift Container Platform + - Red Hat OpenShift AI +industries: + - General +focus_areas: + - AI + - Safety + - AI Quickstart +aliases: /lemonade-stand-quickstart/ +links: + github: https://github.com/validatedpatterns-sandbox/ai-quickstart-lemonade-stand + install: getting-started + bugs: https://github.com/validatedpatterns-sandbox/ai-quickstart-lemonade-stand/issues + feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform +--- +:toc: +:imagesdir: /images +:_content-type: ASSEMBLY +include::modules/comm-attributes.adoc[] + +include::modules/lemonade-stand-quickstart-about.adoc[leveloffset=+1] + +include::modules/lemonade-stand-quickstart-architecture.adoc[leveloffset=+1] + +[id="next-steps-lemonade-stand-quickstart"] +== Next steps + +* link:getting-started[Install this pattern] +* link:cluster-sizing[Cluster sizing] +* link:customizing-this-pattern[Customizing this pattern] +* link:troubleshooting[Troubleshooting] diff --git a/content/patterns/lemonade-stand-quickstart/cluster-sizing.adoc b/content/patterns/lemonade-stand-quickstart/cluster-sizing.adoc new file mode 100644 index 000000000..945a76853 --- /dev/null +++ b/content/patterns/lemonade-stand-quickstart/cluster-sizing.adoc @@ -0,0 +1,29 @@ +--- +title: Cluster sizing +weight: 30 +aliases: /lemonade-stand-quickstart/cluster-sizing/ +--- + +:toc: +:imagesdir: /images +:_content-type: ASSEMBLY +include::modules/comm-attributes.adoc[] +include::modules/ai-quickstart-lemonade-stand/metadata-ai-quickstart-lemonade-stand.adoc[] + +include::modules/cluster-sizing-template.adoc[] + +[id="lemonade-stand-quickstart-gpu-node-requirements"] +== GPU node requirements + +In addition to the worker nodes listed above, this pattern requires at least 1 GPU-equipped node for LLM inference. On AWS, the pattern automatically provisions a `g5.2xlarge` instance with an NVIDIA A10G GPU. On other providers and bare metal, a GPU node must already be part of the cluster before deploying the pattern. + +.GPU node minimum requirements +[cols="<,^,<,<"] +|=== +| Cloud provider | Node type | Number of nodes | Instance type + +| Amazon Web Services +| GPU Worker +| 1 +| g5.2xlarge +|=== diff --git a/content/patterns/lemonade-stand-quickstart/customizing-this-pattern.adoc b/content/patterns/lemonade-stand-quickstart/customizing-this-pattern.adoc new file mode 100644 index 000000000..794701957 --- /dev/null +++ b/content/patterns/lemonade-stand-quickstart/customizing-this-pattern.adoc @@ -0,0 +1,138 @@ +--- +title: Customizing this pattern +weight: 20 +aliases: /lemonade-stand-quickstart/customizing/ +--- + +:toc: +:imagesdir: /images +:_content-type: ASSEMBLY +include::modules/comm-attributes.adoc[] + +[id="customizing-lemonade-stand-quickstart"] +== Customizing the Lemonade Stand AI Quickstart pattern + +This pattern deploys an AI chatbot with a multi-layered guardrails pipeline, including model-based detectors, a rule-based language detector, and regex-based competitor filtering. You can customize the LLM model, detector configuration, and monitoring settings. + +[id="changing-model-lemonade-stand"] +=== Changing the LLM model + +The pattern serves Llama 3.2 3B Instruct (FP8-quantized) by default through vLLM on KServe. The model is defined in the lemonade-stand-assistant Helm chart's `values.yaml`. + +To change the locally served model, update the model configuration in the Helm chart values. The model must be compatible with vLLM and fit within the available GPU VRAM on the provisioned node (NVIDIA A10G with 24 GB VRAM on `g5.2xlarge`). + +[id="using-external-model-lemonade-stand"] +=== Using an external model endpoint (BYOM) + +Instead of serving a model locally on GPU, you can configure the pattern to use an external Model-as-a-Service endpoint. This eliminates the GPU node requirement for inference. + +. Make a local copy of the secrets template outside of your repository: ++ +[WARNING] +==== +Do not add, commit, or push this file to your repository. Doing so might expose personal credentials to GitHub. +==== ++ +[source,terminal] +---- +$ cp values-secret.yaml.template ~/values-secret-ai-quickstart-lemonade-stand.yaml +---- + +. Edit the secrets file and set the API key for your external model endpoint: ++ +[source,terminal] +---- +$ vim ~/values-secret-ai-quickstart-lemonade-stand.yaml +---- ++ +[source,yaml] +---- + - name: lemonade-stand + vaultPrefixes: + - global + fields: + - name: vllm-api-key + value: +---- + +. Set the `model` section in the Helm chart values to point to your external endpoint: ++ +[source,yaml] +---- +model: + name: my-model + endpoint: my-maas-instance + port: 443 +---- + +When using an external model endpoint, the vLLM InferenceService is not deployed and the GPU node is not required for LLM inference. The guardrails pipeline continues to function normally with the external model. + +[id="enabling-gpu-detectors-lemonade-stand"] +=== Enabling GPU for detector models + +By default, the HAP and prompt injection detector models run on CPU. You can enable GPU acceleration for these models to reduce inference latency, but this requires additional GPU resources. + +To enable GPU for the detector models, set the `useGpu` flag in the Helm chart values: + +[source,yaml] +---- +detectors: + hap: + useGpu: true + promptInjection: + useGpu: true +---- + +[NOTE] +==== +Enabling GPU for both detectors requires 2 additional GPUs beyond the 1 GPU used for the LLM, for a total of 3 GPUs. You must provision additional GPU nodes before enabling this option. +==== + +[id="configuring-detector-thresholds-lemonade-stand"] +=== Configuring detector thresholds + +The guardrails pipeline uses three detector models, each with a configurable detection threshold. Lower thresholds increase sensitivity (block more content) while higher thresholds reduce false positives. + +The default thresholds are: + +[cols="1,1,2",options="header"] +|=== +| Detector | Default threshold | Description + +| IBM Granite Guardian HAP +| 0.5 +| Hate speech, abuse, and profanity detection + +| DeBERTa v3 Prompt Injection +| 0.5 +| Prompt injection and jailbreak detection + +| Lingua Language +| 0.88 +| English language confidence threshold +|=== + +To adjust detector thresholds, modify the Guardrails Orchestrator configuration in the `fms-orchestr8-config-nlp` ConfigMap within the lemonade-stand-assistant Helm chart. + +[id="configuring-regex-detector-lemonade-stand"] +=== Configuring the regex detector + +The FastAPI application includes a regex-based detector that blocks mentions of competitor fruit names (oranges, apples, bananas, and others) across 13+ languages. This detector runs locally in the application before the request reaches the Guardrails Orchestrator. + +To modify the blocked terms or supported languages, edit the regex patterns in the `app_fastapi.py` file in the lemonade-stand-assistant repository. + +[id="configuring-shiny-dashboard-lemonade-stand"] +=== Adjusting the monitoring dashboard + +The R Shiny dashboard polls the FastAPI application's `/metrics` endpoint to display guardrail activation statistics in real time. The default polling interval is 1 second. + +To adjust the refresh interval, modify the `shinyDashboard.metrics.refreshInterval` value in the Helm chart values: + +[source,yaml] +---- +shinyDashboard: + metrics: + refreshInterval: 5 +---- + +Push your changes to your forked repository so the GitOps framework applies the updated configuration. diff --git a/content/patterns/lemonade-stand-quickstart/getting-started.adoc b/content/patterns/lemonade-stand-quickstart/getting-started.adoc new file mode 100644 index 000000000..e5c894a49 --- /dev/null +++ b/content/patterns/lemonade-stand-quickstart/getting-started.adoc @@ -0,0 +1,158 @@ +--- +title: Getting started +weight: 10 +aliases: /lemonade-stand-quickstart/getting-started/ +--- + +:toc: +:imagesdir: /images +:_content-type: ASSEMBLY +include::modules/comm-attributes.adoc[] + +[id="deploying-lemonade-stand-quickstart-pattern"] +== Deploying the Lemonade Stand AI Quickstart pattern + +.Prerequisites + +* An OpenShift cluster (version 4.18 or later). This pattern requires at least 1 NVIDIA GPU node for LLM inference. + ** *AWS*: The pattern automatically provisions 1 `g5.2xlarge` GPU worker node (NVIDIA A10G) during installation. No GPU nodes need to be present before you deploy. + ** *Other providers and bare metal*: A GPU node must already be part of the OpenShift cluster before you deploy this pattern. The pattern installs all required operators automatically. + ** To create an OpenShift cluster, go to the https://console.redhat.com/[Red Hat Hybrid Cloud console]. + ** Select *OpenShift \-> Red Hat OpenShift Container Platform \-> Create cluster*. +* The Helm binary. For instructions, see link:https://helm.sh/docs/intro/install/[Installing Helm]. +* The `oc` CLI tool. For instructions, see link:https://docs.openshift.com/container-platform/latest/cli_reference/openshift_cli/getting-started-cli.html[Getting started with the OpenShift CLI]. +* Additional installation tool dependencies. For details, see link:https://validatedpatterns.io/learn/quickstart/[Patterns quick start]. + +[id="preparing-for-deployment-lemonade-stand"] +== Preparing for deployment +.Procedure + +. Fork the link:https://github.com/validatedpatterns-sandbox/ai-quickstart-lemonade-stand[ai-quickstart-lemonade-stand] repository on GitHub. You must fork the repository to customize this pattern. + +. Clone the forked copy of this repository. ++ +[source,terminal] +---- +$ git clone git@github.com:your-username/ai-quickstart-lemonade-stand.git +---- + +. Go to the root directory of your Git repository: ++ +[source,terminal] +---- +$ cd ai-quickstart-lemonade-stand +---- + +. Run the following command to set the upstream repository: ++ +[source,terminal] +---- +$ git remote add -f upstream git@github.com:validatedpatterns-sandbox/ai-quickstart-lemonade-stand.git +---- + +. Verify the setup of your remote repositories by running the following command: ++ +[source,terminal] +---- +$ git remote -v +---- ++ +.Example output ++ +[source,terminal] +---- +origin git@github.com:your-username/ai-quickstart-lemonade-stand.git (fetch) +origin git@github.com:your-username/ai-quickstart-lemonade-stand.git (push) +upstream git@github.com:validatedpatterns-sandbox/ai-quickstart-lemonade-stand.git (fetch) +upstream git@github.com:validatedpatterns-sandbox/ai-quickstart-lemonade-stand.git (push) +---- + +. Optional: To customize the deployment, create and switch to a new branch by running the following command: ++ +[source,terminal] +---- +$ git checkout -b my-branch +---- ++ +Make your changes, then stage and commit them: ++ +[source,terminal] +---- +$ git add +$ git commit -m "Customize deployment" +---- ++ +Push the changes to your forked repository: ++ +[source,terminal] +---- +$ git push origin my-branch +---- + +[id="deploying-cluster-using-patternsh-file-lemonade-stand"] +== Deploying the pattern by using the pattern.sh file + +To deploy the pattern by using the `pattern.sh` file, complete the following steps: + +. Log in to your cluster by following this procedure: + +.. Obtain an API token by visiting link:https://oauth-openshift.apps../oauth/token/request[https://oauth-openshift.apps../oauth/token/request]. + +.. Log in to the cluster by running the following command: ++ +[source,terminal] +---- +$ oc login --token= --server=https://api..:6443 +---- ++ +Or log in by running the following command: ++ +[source,terminal] +---- +$ export KUBECONFIG=~/ +---- + +. Deploy the pattern to your cluster. Run the following command: ++ +[source,terminal] +---- +$ ./pattern.sh make install +---- + +.Verification + +To verify a successful installation, check the health of the ArgoCD applications: + +. Run the following command: ++ +[source,terminal] +---- +$ ./pattern.sh make argo-healthcheck +---- ++ +It might take several minutes for all applications to synchronize and reach a healthy state. This includes downloading detector models, initializing the GPU operator, and starting the vLLM inference service. + +. Verify that the Operators are installed by navigating to *Operators -> Installed Operators* in the {ocp} web console. Confirm the following Operators are present: ++ +* NVIDIA GPU Operator +* {rhoai} +* Node Feature Discovery Operator +* External Secrets Operator + +. After all applications are healthy, verify the inference service is serving by running: ++ +[source,terminal] +---- +$ oc get inferenceservice -A +---- + +. Access the Lemonade Stand chatbot UI. Navigate to *Networking -> Routes* in the `lemonade-stand` namespace and open the route URL for the `lemonade-stand` service. + +. Access the R Shiny monitoring dashboard. Navigate to *Networking -> Routes* in the `lemonade-stand` namespace and open the route URL for the `shiny-dashboard` service. + +[id="next-steps-getting-started-lemonade-stand"] +== Next steps + +* link:customizing-this-pattern[Customizing this pattern] +* link:cluster-sizing[Cluster sizing] +* link:troubleshooting[Troubleshooting] diff --git a/content/patterns/lemonade-stand-quickstart/troubleshooting.adoc b/content/patterns/lemonade-stand-quickstart/troubleshooting.adoc new file mode 100644 index 000000000..3d29d1a3e --- /dev/null +++ b/content/patterns/lemonade-stand-quickstart/troubleshooting.adoc @@ -0,0 +1,284 @@ +--- +title: Troubleshooting +weight: 40 +aliases: /lemonade-stand-quickstart/troubleshooting/ +--- + +:toc: +:imagesdir: /images +:_content-type: ASSEMBLY +include::modules/comm-attributes.adoc[] + +[id="troubleshooting-lemonade-stand-quickstart"] +== Troubleshooting the Lemonade Stand AI Quickstart pattern + +Use this page to diagnose and resolve common issues when deploying or operating this pattern. + +[id="troubleshooting-prereqs-lemonade-stand"] +== Prerequisite and tooling issues + +[id="troubleshooting-podman-version-lemonade-stand"] +=== Podman version not supported + +The `pattern.sh` script requires Podman 4.3.0 or later. Earlier versions do not support the `--userns=keep-id` flag required for correct UID/GID mapping inside the container. + +.Symptom + +The script exits with an error referencing the Podman version or `keep-id`. + +.Resolution + +. Check your Podman version: ++ +[source,terminal] +---- +$ podman --version +---- + +. If the version is earlier than 4.3.0, upgrade Podman. For instructions, see the link:https://podman.io/docs/installation[Podman installation documentation]. + +[id="troubleshooting-kubeconfig-lemonade-stand"] +=== KUBECONFIG path is outside the HOME directory + +The `pattern.sh` script runs inside a container and mounts your `$HOME` directory. If your `KUBECONFIG` file is located outside `$HOME`, the container cannot access it. + +.Symptom + +The script fails to connect to the cluster or reports that the kubeconfig file cannot be found. + +.Resolution + +Move your kubeconfig file to a path inside your home directory and export the updated path: + +[source,terminal] +---- +$ cp ~/kubeconfig +$ export KUBECONFIG=~/kubeconfig +---- + +[id="troubleshooting-deployment-lemonade-stand"] +== Deployment issues + +[id="troubleshooting-argocd-sync-lemonade-stand"] +=== ArgoCD applications are not syncing or are unhealthy + +After running `./pattern.sh make install`, ArgoCD applications can take 15–30 minutes to reach a healthy state. Model downloads and GPU operator initialization take additional time. + +.Symptom + +Running `./pattern.sh make argo-healthcheck` reports applications in `Progressing` or `Degraded` state. + +.Resolution + +. Check which applications are not healthy: ++ +[source,terminal] +---- +$ oc get applications -n openshift-gitops +---- + +. Inspect the failing application for error details: ++ +[source,terminal] +---- +$ oc describe application -n openshift-gitops +---- + +. Check the logs of the ArgoCD application controller: ++ +[source,terminal] +---- +$ oc logs -n openshift-gitops deployment/openshift-gitops-application-controller +---- + +. If applications are stuck in `Progressing`, wait an additional 10 minutes and re-run the health check. Detector model downloads from Hugging Face via MinIO and GPU operator initialization can take significant time. + +[id="troubleshooting-gpu-lemonade-stand"] +== GPU and inference issues + +[id="troubleshooting-gpu-nodes-lemonade-stand"] +=== GPU nodes are not ready + +The NVIDIA GPU Operator must successfully initialize on the GPU node before model serving can start. + +.Symptom + +The vLLM inference service pod remains in `Pending` state, or `oc get inferenceservice -A` shows the service not ready. + +.Resolution + +. Check the status of GPU nodes: ++ +[source,terminal] +---- +$ oc get nodes -l nvidia.com/gpu.present=true +---- + +. Check the NVIDIA GPU Operator pods: ++ +[source,terminal] +---- +$ oc get pods -n nvidia-gpu-operator +---- + +. Check for driver initialization errors: ++ +[source,terminal] +---- +$ oc logs -n nvidia-gpu-operator -l app=nvidia-driver-daemonset +---- + +. If you are using a provider other than AWS, confirm that a GPU node was present in the cluster before you deployed the pattern. The pattern does not provision GPU nodes on providers other than AWS. + +[id="troubleshooting-inference-lemonade-stand"] +=== Inference endpoint is not serving + +.Symptom + +`oc get inferenceservice -A` shows the inference service in a non-ready state, or the chatbot returns connection errors. + +.Resolution + +. Check the status of the inference service: ++ +[source,terminal] +---- +$ oc get inferenceservice -A +---- + +. Check the vLLM model server pod logs: ++ +[source,terminal] +---- +$ oc logs -n lemonade-stand -l serving.kserve.io/inferenceservice=llm-service +---- + +. Confirm that the GPU node has sufficient available VRAM. The Llama 3.2 3B Instruct model requires a GPU with at least 24 GB of VRAM. + +[id="troubleshooting-guardrails-lemonade-stand"] +== Guardrails orchestrator issues + +[id="troubleshooting-orchestrator-not-ready"] +=== Guardrails Orchestrator pod is not ready + +The Guardrails Orchestrator depends on all detector models being available and healthy before it can start serving requests. + +.Symptom + +The orchestrator pod is in `CrashLoopBackOff` or `Error` state, or the chatbot returns 503 errors. + +.Resolution + +. Check the status of all pods in the lemonade-stand namespace: ++ +[source,terminal] +---- +$ oc get pods -n lemonade-stand +---- + +. Check the orchestrator pod logs for detector connection errors: ++ +[source,terminal] +---- +$ oc logs -n lemonade-stand -l app=guardrails-orchestrator +---- + +. Verify that all detector services are running: ++ +[source,terminal] +---- +$ oc get inferenceservice -n lemonade-stand +---- + +. If detector models are not ready, check that MinIO has successfully downloaded the model artifacts from Hugging Face: ++ +[source,terminal] +---- +$ oc logs -n lemonade-stand -l app=minio +---- + +[id="troubleshooting-all-blocked"] +=== Guardrails are blocking all requests + +.Symptom + +Every user query is blocked by the guardrails, even when the content appears safe and in English. + +.Resolution + +. Check the R Shiny dashboard to identify which detector is triggering. Navigate to *Networking -> Routes* in the `lemonade-stand` namespace and open the dashboard route. + +. If the Lingua detector is blocking English text, the language confidence threshold may be too high. Review the Lingua threshold in the `fms-orchestr8-config-nlp` ConfigMap. + +. If the HAP or prompt injection detector is triggering on safe content, their detection thresholds may be too aggressive. See link:customizing-this-pattern#configuring-detector-thresholds-lemonade-stand[Configuring detector thresholds]. + +[id="troubleshooting-application-lemonade-stand"] +== Application issues + +[id="troubleshooting-chatbot-ui"] +=== Lemonade Stand chatbot UI is not accessible + +.Symptom + +The chatbot UI route returns a 503 or connection error. + +.Resolution + +. Check that the lemonade-stand pod is running: ++ +[source,terminal] +---- +$ oc get pods -n lemonade-stand -l app=lemonade-stand +---- + +. Check the application logs for startup errors: ++ +[source,terminal] +---- +$ oc logs -n lemonade-stand -l app=lemonade-stand +---- + +. Verify the route is correctly configured: ++ +[source,terminal] +---- +$ oc get routes -n lemonade-stand +---- + +[id="troubleshooting-shiny-dashboard"] +=== R Shiny dashboard shows no data + +.Symptom + +The dashboard loads but shows zero values for all metrics, or displays errors. + +.Resolution + +. Confirm that the lemonade-stand application is running and the `/metrics` endpoint is accessible: ++ +[source,terminal] +---- +$ oc exec -n lemonade-stand deployment/shiny-dashboard -- curl -s http://lemonade-stand:8080/metrics +---- + +. Check the Shiny dashboard pod logs: ++ +[source,terminal] +---- +$ oc logs -n lemonade-stand -l app=shiny-dashboard +---- + +. Verify that the `shinyDashboard.metrics.url` in the Helm chart values points to the correct metrics endpoint. + +[id="troubleshooting-get-help-lemonade-stand"] +== Getting help + +If you cannot resolve an issue using this guide: + +* Check the link:https://github.com/validatedpatterns-sandbox/ai-quickstart-lemonade-stand/issues[GitHub issues] for known problems and workarounds. +* Open a new issue with the output of the following command to help diagnose the problem: ++ +[source,terminal] +---- +$ oc get pods -A | grep -v Running | grep -v Completed +---- diff --git a/modules/ai-quickstart-lemonade-stand/metadata-ai-quickstart-lemonade-stand.adoc b/modules/ai-quickstart-lemonade-stand/metadata-ai-quickstart-lemonade-stand.adoc new file mode 100644 index 000000000..b8db8310a --- /dev/null +++ b/modules/ai-quickstart-lemonade-stand/metadata-ai-quickstart-lemonade-stand.adoc @@ -0,0 +1,21 @@ +// This file defines cluster sizing attributes for the Lemonade Stand AI Quickstart pattern. +// These values are defined manually based on tested configurations. +:metadata_version: 1.0 +:name: ai-quickstart-lemonade-stand +:pattern_version: 1.0 +:description: Deploy an AI chatbot with multi-layered safety guardrails, TrustyAI orchestration, and real-time monitoring. +:display_name: Lemonade Stand AI Quickstart +:repo_url: https://github.com/validatedpatterns-sandbox/ai-quickstart-lemonade-stand +:docs_repo_url: https://github.com/validatedpatterns/docs +:issues_url: https://github.com/validatedpatterns-sandbox/ai-quickstart-lemonade-stand/issues +:docs_url: https://validatedpatterns.io/patterns/lemonade-stand-quickstart/ +:ci_url: https://validatedpatterns.io/ci/?pattern=lemonade-stand-quickstart +:tier: sandbox +:owners: dminnear-rh +:requirements_hub_controlPlane_platform_aws_replicas: 3 +:requirements_hub_controlPlane_platform_aws_type: m5.xlarge +:requirements_hub_compute_platform_aws_replicas: 3 +:requirements_hub_compute_platform_aws_type: m5.4xlarge +:extra_features_hypershift_support: false +:extra_features_spoke_support: false +:external_requirements: diff --git a/modules/lemonade-stand-quickstart-about.adoc b/modules/lemonade-stand-quickstart-about.adoc new file mode 100644 index 000000000..a21a9c038 --- /dev/null +++ b/modules/lemonade-stand-quickstart-about.adoc @@ -0,0 +1,79 @@ +:_content-type: CONCEPT +:imagesdir: ../../images +include::comm-attributes.adoc[] + +[id="about-lemonade-stand-quickstart"] += About the Lemonade Stand AI Quickstart pattern + +Deploy an AI chatbot with multi-layered safety guardrails on OpenShift to demonstrate content filtering, prompt injection defense, language enforcement, and real-time guardrail monitoring. + +Use case:: + +* Deploy a guardrailed AI chatbot that filters harmful content, detects prompt injection attempts, enforces language constraints, and blocks competitor mentions through multiple safety layers. +* Explore multi-layered AI safety techniques including model-based detectors (HAP, prompt injection), rule-based detectors (language enforcement), and regex-based filtering. +* Use a GitOps approach to provision AI guardrails infrastructure including GPU-accelerated model serving, TrustyAI orchestration, and monitoring. + +Background:: + +This pattern builds on the link:https://github.com/rh-ai-quickstart/lemonade-stand-assistant[Lemonade Stand Assistant AI Quickstart]. It provisions the OpenShift cluster with link:https://www.redhat.com/en/products/ai/openshift-ai[{rhoai}] configured for GPU-accelerated inference using vLLM. It deploys the NVIDIA GPU Operator for model serving on GPU nodes and manages secrets through the {solution-name-upstream} framework using HashiCorp Vault and the External Secrets Operator. This pattern generalizes one or more successful deployments of this use case. Implementation details might vary depending on your specific environment and requirements. + +Organizations can use the Lemonade Stand AI Quickstart to learn how to implement AI safety guardrails as a multi-layered pipeline. It demonstrates a production-ready approach to: + +- Serving a language model (Llama 3.2 3B Instruct) with GPU-accelerated inference through vLLM on KServe +- Orchestrating multiple safety detectors through TrustyAI Guardrails Orchestrator (FMS Orchestr8) to validate both input and output +- Detecting hate speech, abuse, and profanity with the IBM Granite Guardian HAP 125M model +- Identifying prompt injection and jailbreak attempts with the DeBERTa v3 Base model +- Enforcing language constraints with the Lingua rule-based detector +- Blocking competitor product mentions with regex-based pre-filtering in 13+ languages +- Monitoring guardrail activation rates in real time through an R Shiny dashboard + +[id="about-lemonade-stand-quickstart-solution"] +== About the solution + +This pattern deploys a complete guardrailed chatbot on a single OpenShift cluster by using a GitOps approach. The {solution-name-upstream} framework handles infrastructure provisioning, including GPU operators, AI platform configuration, and secrets management. The Lemonade Stand Assistant AI Quickstart delivers the application layer: model serving, guardrails orchestration, safety detectors, and monitoring. + +User queries flow through a multi-stage safety pipeline. The FastAPI application first applies a regex pre-filter to block competitor fruit names across 13 languages. Queries that pass are forwarded to the TrustyAI Guardrails Orchestrator, which chains three detector models in sequence: Lingua (language enforcement), IBM Granite Guardian HAP 125M (content safety), and DeBERTa v3 Base (prompt injection detection). Only validated queries reach the LLM. The same detectors validate model output before it is streamed back to the user. An R Shiny dashboard provides real-time visibility into guardrail activation rates by polling Prometheus metrics from the FastAPI application. + +[id="about-lemonade-stand-quickstart-technology"] +== About the technology + +This solution uses the following technologies: + +https://www.redhat.com/en/technologies/cloud-computing/openshift/try-it[{rh-ocp}]:: +An enterprise-ready Kubernetes container platform built for an open hybrid cloud strategy. It provides a consistent application platform to manage hybrid cloud, public cloud, and edge deployments. + +https://www.redhat.com/en/technologies/cloud-computing/openshift/try-it[{rh-gitops}]:: +A declarative application continuous delivery tool for Kubernetes based on the ArgoCD project. Application definitions, configurations, and environments are declarative and version controlled in Git. + +https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai[{rhoai}]:: +A flexible, scalable MLOps platform with tools to build, deploy, and manage AI-enabled applications. This pattern uses {rhoai} to manage GPU-accelerated model serving with vLLM via KServe. + +https://www.redhat.com/en/blog/trustyai-openshift-ai[TrustyAI / FMS Guardrails Orchestrator]:: +A guardrails orchestration framework that chains multiple safety detectors to validate both input and output of LLM interactions. This pattern uses the FMS Orchestr8 configuration to pipeline Lingua, HAP, and prompt injection detectors. + +https://docs.vllm.ai/[vLLM]:: +A high-throughput, memory-efficient inference engine for large language models. vLLM serves the Llama 3.2 model with optimized GPU utilization. + +https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct[Llama 3.2 3B Instruct]:: +A compact instruction-tuned language model from Meta. This pattern serves the FP8-quantized variant for efficient GPU inference as the chatbot's conversational backend. + +https://huggingface.co/ibm-granite/granite-guardian-hap-125m[IBM Granite Guardian HAP 125M]:: +A lightweight hate, abuse, and profanity detection model from IBM. This pattern uses it to filter harmful content in both user queries and model responses. + +https://huggingface.co/protectai/deberta-v3-base-prompt-injection-v2[DeBERTa v3 Base Prompt Injection]:: +A fine-tuned DeBERTa model for detecting prompt injection and jailbreak attempts. This pattern uses it to identify attempts to override the chatbot's system instructions. + +https://github.com/pemistahl/lingua-py[Lingua]:: +A rule-based natural language detection library. This pattern uses Lingua to enforce English-only communication with the chatbot. + +https://github.com/foundation-model-stack/fms-guardrails-orchestrator[FMS Chunkers]:: +A sentence-level text chunking service that segments user input and model output for individual detector evaluation. + +https://min.io/[MinIO]:: +An S3-compatible object storage service. This pattern uses MinIO to store detector model artifacts that are downloaded from Hugging Face during initialization. + +https://shiny.posit.co/[R Shiny]:: +A web application framework for R. This pattern uses an R Shiny dashboard to visualize guardrail activation metrics in real time, displaying request counts, blocked inputs and outputs, and per-detector activation rates. + +https://prometheus.io/[Prometheus]:: +An open source monitoring and alerting toolkit. This pattern uses Prometheus-format metrics exposed by the FastAPI application to track guardrail activations. diff --git a/modules/lemonade-stand-quickstart-architecture.adoc b/modules/lemonade-stand-quickstart-architecture.adoc new file mode 100644 index 000000000..af49e14b6 --- /dev/null +++ b/modules/lemonade-stand-quickstart-architecture.adoc @@ -0,0 +1,174 @@ +:_content-type: CONCEPT +:imagesdir: ../../images +include::comm-attributes.adoc[] + +[id="lemonade-stand-quickstart-architecture"] += Lemonade Stand AI Quickstart architecture + +The following figure shows the Lemonade Stand AI Quickstart architecture. + +.Lemonade Stand AI Quickstart system architecture +image::lemonade-stand-quickstart/lemonade-stand-architecture.png[Lemonade Stand AI Quickstart Architecture,link="/images/lemonade-stand-quickstart/lemonade-stand-architecture.png"] + +The architecture consists of three main layers: + +* *Guardrails Layer* -- Chains multiple safety detectors through the TrustyAI Guardrails Orchestrator to filter both input and output for content safety, prompt injection, and language compliance. +* *Inference Layer* -- Serves the Llama 3.2 3B Instruct model through vLLM with GPU acceleration on KServe. +* *Application Layer* -- Provides the chatbot UI with regex-based pre-filtering, Prometheus metrics, and a real-time R Shiny monitoring dashboard. + +[id="lemonade-stand-quickstart-guardrails-layer"] +== Guardrails layer + +The guardrails layer validates both user input and model output through a chain of safety detectors: + +TrustyAI Guardrails Orchestrator (FMS Orchestr8):: +Orchestrates multiple detector models in a configurable pipeline. Each detector evaluates the input or output independently, and the orchestrator aggregates results to determine whether to allow or block the content. + +IBM Granite Guardian HAP 125M:: +A lightweight model that detects hate speech, abuse, and profanity. Runs on CPU by default with optional GPU acceleration. Threshold: 0.5. + +DeBERTa v3 Base Prompt Injection Detector:: +A fine-tuned DeBERTa model that identifies prompt injection and jailbreak attempts. Runs on CPU by default with optional GPU acceleration. Threshold: 0.5. + +Lingua Language Detector:: +A rule-based language detection service that enforces English-only communication. Deployed as a Kubernetes Deployment (not KServe). Threshold: 0.88. + +FMS Chunkers:: +A gRPC-based text chunking service that segments input and output into sentences for individual detector evaluation. + +[id="lemonade-stand-quickstart-inference-layer"] +== Inference layer + +The inference layer serves the language model and processes chat requests: + +vLLM Model Server:: +Serves the Llama 3.2 3B Instruct model (FP8-quantized) with GPU acceleration. Managed by {rhoai} as a KServe InferenceService with optimized settings including chunked prefill and 95% GPU memory utilization. + +NVIDIA GPU Operator:: +Manages NVIDIA GPU drivers, device plugins, and monitoring on worker nodes. Ensures GPUs are configured and available for model serving workloads. + +[id="lemonade-stand-quickstart-application-layer"] +== Application layer + +The application layer provides the user interface, pre-filtering, and monitoring: + +Lemonade Stand FastAPI Application:: +A Python FastAPI application that serves the chatbot UI on port 8080. It implements a local regex-based detector that blocks competitor fruit names across 13+ languages before forwarding queries to the Guardrails Orchestrator. It also exposes a `/metrics` endpoint with Prometheus-format guardrail activation metrics and streams LLM responses to the browser via Server-Sent Events (SSE). + +R Shiny Monitoring Dashboard:: +A real-time monitoring dashboard that visualizes guardrail activation metrics. It polls the FastAPI `/metrics` endpoint and displays total request counts, blocked input and output counts, approved requests, and per-detector activation breakdowns. + +MinIO:: +An S3-compatible object storage service that stores detector model artifacts. Models are downloaded from Hugging Face during initialization and served to the KServe detector runtimes. + +[id="lemonade-stand-quickstart-data-flow"] +== Data flow + +The following describes the request flow through the guardrails pipeline: + +. The user sends a message through the chatbot UI. +. The FastAPI application applies the regex pre-filter to check for competitor fruit names in 13+ languages. If detected, the request is blocked immediately. +. The application forwards the query to the TrustyAI Guardrails Orchestrator over HTTPS. +. The Guardrails Orchestrator chains the detectors in sequence: +.. *Lingua* checks that the input is in English. +.. *IBM Granite Guardian HAP* checks for hate speech, abuse, and profanity. +.. *DeBERTa v3 Prompt Injection* checks for jailbreak attempts. +. If all detectors pass, the query is forwarded to the vLLM model server. +. The LLM generates a response, which is validated by the same detector chain (output guardrails). +. The validated response is streamed back to the user via SSE. +. The FastAPI application records Prometheus metrics for each detector activation. +. The R Shiny dashboard polls the `/metrics` endpoint to display real-time guardrail statistics. + +[id="lemonade-stand-quickstart-deployment"] +== Deployment architecture + +The following table describes the pod structure when you deploy on OpenShift: + +[cols="1,2,3",options="header"] +|=== +| Pod | Purpose | Characteristics + +| Lemonade Stand App +| Chatbot UI and API +| FastAPI on port 8080, regex pre-filter for competitor fruits, Prometheus metrics endpoint, SSE response streaming + +| Guardrails Orchestrator +| Safety pipeline orchestration +| TrustyAI / FMS Orchestr8 on port 8032, chains detector models in sequence for input and output validation + +| vLLM Model Server +| LLM inference +| Llama 3.2 3B Instruct (FP8), GPU-accelerated, KServe InferenceService managed by {rhoai} + +| HAP Detector +| Content safety +| IBM Granite Guardian HAP 125M, CPU or optional GPU, detects hate speech, abuse, and profanity + +| Prompt Injection Detector +| Jailbreak defense +| DeBERTa v3 Base, CPU or optional GPU, detects prompt injection and manipulation attempts + +| Lingua Detector +| Language enforcement +| Rule-based English-only detection, lightweight CPU deployment + +| Chunker Service +| Text segmentation +| FMS Chunkers, gRPC on port 8085, sentence-level splitting for detector input + +| MinIO +| Model storage +| S3-compatible object storage, stores detector model artifacts downloaded from Hugging Face + +| R Shiny Dashboard +| Monitoring +| Real-time guardrail activation visualization, polls Prometheus metrics from the FastAPI app + +| Vault +| Secrets management +| Stores vLLM API key and other credentials, synced to the cluster by the External Secrets Operator +|=== + +[id="lemonade-stand-quickstart-technologies"] +== Implementation technologies + +[cols="1,2",options="header"] +|=== +| Component | Technology + +| Application Framework +| FastAPI (Python) + +| LLM Service +| vLLM with meta-llama/Llama-3.2-3B-Instruct + +| Guardrails Orchestration +| TrustyAI / FMS Orchestr8 + +| Content Safety +| IBM Granite Guardian HAP 125M + +| Prompt Injection Detection +| DeBERTa v3 Base + +| Language Detection +| Lingua + +| Text Chunking +| FMS Chunkers + +| Container Orchestration +| {rh-ocp} + {rhoai} + +| GPU Management +| NVIDIA GPU Operator + +| Object Storage +| MinIO (S3-compatible) + +| Monitoring +| R Shiny + Prometheus + +| Secrets Management +| HashiCorp Vault + External Secrets Operator +|=== diff --git a/static/images/lemonade-stand-quickstart/lemonade-stand-architecture.png b/static/images/lemonade-stand-quickstart/lemonade-stand-architecture.png new file mode 100644 index 000000000..5c88d2f58 Binary files /dev/null and b/static/images/lemonade-stand-quickstart/lemonade-stand-architecture.png differ