Skip to content

Latest commit

 

History

History
294 lines (191 loc) · 9.98 KB

File metadata and controls

294 lines (191 loc) · 9.98 KB

📖 Glossary - Speaking the Language of Neural Orchestration

Every expert was once a beginner who refused to give up on understanding

A

API Gateway A service that acts as an entry point for multiple backend services, handling routing, authentication, rate limiting, and other cross-cutting concerns.

Auto-initialization The automatic setup and configuration of the LLM router when the class is instantiated, eliminating the need for manual initialization calls.

Async Generator A JavaScript function that can be paused and resumed, yielding values asynchronously. Used extensively for streaming token generation.

B

Backpressure A mechanism to handle situations where data is produced faster than it can be consumed, preventing memory overflow in streaming scenarios.

Batch Processing The technique of processing multiple requests together to improve throughput and efficiency, especially useful for inference operations.

GGUF (GPT-Generated Unified Format) A binary format for storing large language models, optimized for fast loading and efficient memory usage. Successor to GGML format.

C

Cache Hit Rate The percentage of requests that can be served from cache rather than requiring new computation. Higher rates indicate better performance.

Circuit Breaker A design pattern that prevents cascading failures by automatically stopping requests to a failing service until it recovers.

Compute Shader A type of shader program that runs on the GPU for general-purpose parallel computing, used in WebGPU inference.

Context Window The maximum number of tokens a model can process in a single request, determining how much text it can "remember" during generation.

D

Dynamic Quantization The process of automatically reducing model precision (e.g., from 32-bit to 8-bit) at runtime based on available resources.

Directed Acyclic Graph (DAG) A graph structure with no cycles, used to represent dependencies between models and processing steps.

E

Engine A runtime environment that executes model inference, such as WebGPU for GPU acceleration or WASM for CPU processing.

Ensemble A technique that combines outputs from multiple models to improve accuracy and robustness of predictions.

Event Loop The JavaScript runtime mechanism that handles asynchronous operations, crucial for non-blocking model inference.

F

Fallback Chain A sequence of alternative models or engines to try if the primary choice fails, ensuring reliability.

First Token Latency The time between sending a request and receiving the first generated token, a critical performance metric for user experience.

G

GPU (Graphics Processing Unit) Specialized hardware optimized for parallel processing, increasingly used for AI model inference acceleration.

gRPC A high-performance, cross-platform RPC framework used for efficient communication between services.

H

Hardware Abstraction Layer A software layer that provides a consistent interface to different hardware platforms, allowing the same code to run on various systems.

HuggingFace A popular platform and library ecosystem for machine learning models, transformers, and datasets.

I

Inference The process of using a trained model to generate predictions or responses based on input data.

Interleaving A technique for processing multiple requests simultaneously by switching between them, improving overall throughput.

J

JSON Schema A specification for validating the structure of JSON data, used extensively for configuration validation.

JWT (JSON Web Token) A compact, URL-safe means of representing claims to be transferred between parties, commonly used for authentication.

L

Load Balancer A component that distributes incoming requests across multiple model instances or servers to optimize performance.

LLM (Large Language Model) A type of artificial intelligence model trained on vast amounts of text data to understand and generate human-like text.

LRU (Least Recently Used) A cache eviction strategy that removes the least recently accessed items when the cache reaches capacity.

M

Memory Mapping (mmap) A technique that maps files directly into memory, allowing efficient access to large model files without loading everything into RAM.

Model Registry A centralized repository for storing, versioning, and managing machine learning models and their metadata.

Multimodal Referring to models that can process multiple types of input (text, images, audio) or generate multiple types of output.

N

Neural Architecture Search (NAS) An automated method for designing optimal neural network architectures for specific tasks.

Node.js A JavaScript runtime built on Chrome's V8 JavaScript engine, enabling server-side JavaScript execution.

O

ONNX (Open Neural Network Exchange) An open-source format for representing machine learning models, enabling interoperability between different frameworks.

Orchestration The automated configuration, coordination, and management of multiple models and services.

P

Pipeline A sequence of data processing steps, each performed by different components, that transform input into desired output.

Prompt Engineering The practice of crafting input text to guide language models toward producing desired outputs.

Pruning A technique for reducing model size by removing unnecessary parameters or connections while maintaining performance.

Q

Quality Score A metric that evaluates the quality of model outputs based on various criteria like coherence, relevance, and accuracy.

Quantization The process of reducing the numerical precision of model parameters to decrease memory usage and increase inference speed.

Queue A data structure that manages requests in a first-in-first-out manner, essential for handling concurrent requests.

R

Rate Limiting A technique to control the number of requests a user or system can make within a specified time period.

Registry A centralized service for storing and retrieving models, configurations, and metadata.

Routing Strategy An algorithm that determines which model to use for a given request based on factors like quality, cost, and speed.

S

Safetensors A secure format for storing neural network parameters that prevents arbitrary code execution vulnerabilities.

Sandboxing A security mechanism that isolates running processes to prevent them from affecting the rest of the system.

SIMD (Single Instruction, Multiple Data) A parallel computing technique where the same operation is performed on multiple data points simultaneously.

Streaming A technique for processing and delivering data in small chunks over time, rather than waiting for complete processing.

T

Tensor A multi-dimensional array used to represent data in machine learning models, generalizing scalars, vectors, and matrices.

Thread Pool A collection of worker threads that can execute tasks concurrently, improving application performance.

Tokenization The process of converting text into smaller units (tokens) that can be processed by language models.

TTL (Time To Live) The amount of time data should remain in cache before being considered stale and potentially evicted.

U

UUID (Universally Unique Identifier) A 128-bit value used to uniquely identify resources in distributed systems.

Uptime The amount of time a system has been running and available, typically expressed as a percentage.

V

Vectorization The process of converting data into numerical vectors that can be efficiently processed by mathematical operations.

Virtual Memory A memory management technique that gives applications the illusion of having more memory than physically available.

W

WASM (WebAssembly) A binary instruction format that enables high-performance execution of code in web browsers and other environments.

WebGPU A modern web standard that provides access to GPU capabilities for high-performance graphics and compute operations.

WebRTC A collection of standards and technologies for real-time peer-to-peer communication in web browsers.

Worker Thread A separate execution context that runs in parallel with the main thread, enabling concurrent processing.

Common Acronyms

Acronym Full Form Context
AI Artificial Intelligence General field
API Application Programming Interface Software integration
CPU Central Processing Unit Hardware
GPU Graphics Processing Unit Hardware acceleration
HTTP HyperText Transfer Protocol Web communication
JSON JavaScript Object Notation Data format
ML Machine Learning AI subset
NLP Natural Language Processing AI for language
REST Representational State Transfer API architecture
SLA Service Level Agreement Performance guarantees
TLS Transport Layer Security Encryption
URL Uniform Resource Locator Web addresses
UUID Universally Unique Identifier Resource identification
WASM WebAssembly Binary format

Performance Terms

Batching Processing multiple inputs together to improve throughput.

Latency The time between request and first response.

Throughput The number of requests processed per unit of time.

P50, P95, P99 Percentile measurements (50th, 95th, 99th) of response times.

QPS Queries Per Second - a measure of system throughput.

RPS Requests Per Second - similar to QPS.

Security Terms

CORS Cross-Origin Resource Sharing - browser security feature.

CSP Content Security Policy - browser security mechanism.

RBAC Role-Based Access Control - authorization model.

OWASP Open Web Application Security Project - security standards.

CSRF Cross-Site Request Forgery - type of web attack.

XSS Cross-Site Scripting - type of web vulnerability.


"The beginning of wisdom is the definition of terms" - Socrates

Understanding the language is the first step to mastering the craft 📚

Built with 💙 by Echo AI Systems