Skip to content

Latest commit

 

History

History
36 lines (26 loc) · 1.95 KB

File metadata and controls

36 lines (26 loc) · 1.95 KB

SheikhLM Architecture Specifications

This document outlines the architectural decisions and technical specifications for the SheikhLM family of models.

Design Philosophy

SheikhLM is designed for efficiency, speed, and deployment in resource-constrained environments. The architecture incorporates modern best practices from the Llama and Mistral families while maintaining a compact footprint.

Core Architectural Components

  • Tokenizer: Byte-Pair Encoding (BPE) with a target vocabulary size of 32,000.
  • Activation Function: SwiGLU (using SiLU/Swish). SwiGLU has been shown to outperform standard GELU in most benchmarks.
  • Normalization: RMSNorm (Root Mean Square Layer Normalization) applied before each transformer block. RMSNorm is computationally more efficient than standard LayerNorm.
  • Positional Embeddings: RoPE (Rotary Positional Embeddings). RoPE allows for better extrapolation to longer sequence lengths and is the current industry standard.
  • Attention: Standard Multi-Head Attention (MHA) for all variants.
  • Embeddings: Tied Embeddings (Weight tying between input and output embeddings) to reduce the total parameter count, which is particularly beneficial for smaller models.

Model Variants

Feature SheikhLM-135M SheikhLM-360M SheikhLM-1.7B
Parameters ~135M ~360M ~1.7B
Hidden Size 768 1024 2048
Layers 12 24 24
Attention Heads 12 16 16
Intermediate Size 2944 3072 8384
Vocab Size 32,000 32,000 32,000
Max Context 2048 2048 2048

Parameter Calculation Verification

Parameters are calculated including embeddings, all transformer layers (Attention + MLP + Norms), and tied output head.

  • SheikhLM-135M: 134,302,464 parameters
  • SheikhLM-360M: 359,973,888 parameters
  • SheikhLM-1.7B: 1,704,560,640 parameters