This document outlines the architectural decisions and technical specifications for the SheikhLM family of models.
SheikhLM is designed for efficiency, speed, and deployment in resource-constrained environments. The architecture incorporates modern best practices from the Llama and Mistral families while maintaining a compact footprint.
- Tokenizer: Byte-Pair Encoding (BPE) with a target vocabulary size of 32,000.
- Activation Function: SwiGLU (using SiLU/Swish). SwiGLU has been shown to outperform standard GELU in most benchmarks.
- Normalization: RMSNorm (Root Mean Square Layer Normalization) applied before each transformer block. RMSNorm is computationally more efficient than standard LayerNorm.
- Positional Embeddings: RoPE (Rotary Positional Embeddings). RoPE allows for better extrapolation to longer sequence lengths and is the current industry standard.
- Attention: Standard Multi-Head Attention (MHA) for all variants.
- Embeddings: Tied Embeddings (Weight tying between input and output embeddings) to reduce the total parameter count, which is particularly beneficial for smaller models.
| Feature | SheikhLM-135M | SheikhLM-360M | SheikhLM-1.7B |
|---|---|---|---|
| Parameters | ~135M | ~360M | ~1.7B |
| Hidden Size | 768 | 1024 | 2048 |
| Layers | 12 | 24 | 24 |
| Attention Heads | 12 | 16 | 16 |
| Intermediate Size | 2944 | 3072 | 8384 |
| Vocab Size | 32,000 | 32,000 | 32,000 |
| Max Context | 2048 | 2048 | 2048 |
Parameters are calculated including embeddings, all transformer layers (Attention + MLP + Norms), and tied output head.
- SheikhLM-135M: 134,302,464 parameters
- SheikhLM-360M: 359,973,888 parameters
- SheikhLM-1.7B: 1,704,560,640 parameters
