diff --git a/docs/machine-learning/data-engineering-basics/data-collection/apis.mdx b/docs/machine-learning/data-engineering-basics/data-collection/apis.mdx
index e69de29..05a4d6b 100644
--- a/docs/machine-learning/data-engineering-basics/data-collection/apis.mdx
+++ b/docs/machine-learning/data-engineering-basics/data-collection/apis.mdx
@@ -0,0 +1,125 @@
+---
+title: Mastering APIs for Data Collection
+sidebar_label: APIs
+description: "A deep dive into REST and GraphQL APIs: how to fetch, authenticate, and process external data for machine learning."
+tags: [apis, rest, graphql, json, data-engineering, python-requests]
+---
+
+In the Data Engineering lifecycle, **APIs** are the "clean" way to collect data. Unlike web scraping, which is brittle and unstructured, APIs provide a contract-based method to access data that is versioned, documented, and usually delivered in machine-readable formats like JSON.
+
+## 1. How APIs Work: The Request-Response Cycle
+
+An API acts as a middleman between your ML pipeline and a remote server. You send a **Request** (a specific question) and receive a **Response** (the data answer).
+
+```mermaid
+sequenceDiagram
+    participant Pipeline as ML Data Pipeline
+    participant API as API Gateway
+    participant Server as Data Server
+    
+    Pipeline->>API: HTTP Request (GET /data)
+    Note right of Pipeline: Includes Headers & API Key
+    API->>Server: Validate & Route
+    Server-->>API: Data Payload
+    API-->>Pipeline: HTTP Response (200 OK + JSON)
+
+```
+
+### Components of an API Request:
+
+1. **Endpoint (URL):** The address where the data lives (e.g., `api.twitter.com/v2/tweets`).
+2. **Method:** What you want to do (`GET` to fetch, `POST` to send).
+3. **Headers:** Metadata like your **API Key** or the format you want (`Content-Type: application/json`).
+4. **Parameters:** Filters for the data (e.g., `?start_date=2023-01-01`).
+
+## 2. Common API Architectures in ML
+
+### A. REST (Representational State Transfer)
+
+The most common architecture. It treats every piece of data as a "Resource."
+
+* **Best for:** Standardized data fetching.
+* **Format:** Almost exclusively **JSON**.
+
+### B. GraphQL
+
+Developed by Meta, it allows the client to define the structure of the data it needs.
+
+* **Advantage in ML:** If a user profile has 100 fields but you only need 3 features for your model, GraphQL prevents "Over-fetching," saving bandwidth and memory.
+
+[Image comparing REST vs GraphQL data fetching efficiency]
+
+### C. Streaming APIs (WebSockets/gRPC)
+
+Used when data needs to be delivered in real-time.
+
+* **ML Use Case:** Algorithmic trading or live social media sentiment monitoring.
+
+## 3. Implementation in Python
+
+The `requests` library is the standard tool for interacting with APIs.
+
+```python
+import requests
+
+url = "https://api.example.com/v1/weather"
+headers = {
+    "Authorization": "Bearer YOUR_TOKEN"
+}
+params = {
+    "city": "Mandsaur",
+    "country": "IN",
+    "units": "metric"
+}
+
+response = requests.get(url, headers=headers, params=params)
+
+if response.status_code == 200:
+    data = response.json()
+    temperature = data["main"]["temp"]   # Extracting temperature
+    humidity = data["main"]["humidity"] # Extracting humidity
+
+    print(f"Temperature in Mandsaur: {temperature}°C")
+    print(f"Humidity: {humidity}%")
+else:
+    print("Failed to fetch weather data")
+
+```
+
+## 4. Challenges: Rate Limiting and Status Codes
+
+APIs are not infinite resources. Providers implement **Rate Limiting** to prevent abuse.
+
+| Status Code | Meaning | Action for ML Pipeline |
+| --- | --- | --- |
+| **200** | OK | Process the data. |
+| **401** | Unauthorized | Check your API Key/Token. |
+| **404** | Not Found | Check your Endpoint URL. |
+| **429** | Too Many Requests | **Exponential Backoff:** Wait and try again later. |
+
+```mermaid
+flowchart TD
+    Req[Send API Request] --> Res{Status Code?}
+    Res -- 200 --> Save[Ingest to Database]
+    Res -- 429 --> Wait[Wait/Sleep] --> Req
+    Res -- 401 --> Fail[Alert Developer]
+    style Wait fill:#fff3e0,stroke:#ef6c00,color:#333
+
+```
+
+## 5. Authentication Methods
+
+1. **API Keys:** A simple string passed in the header.
+2. **OAuth 2.0:** A more secure, token-based system used by Google, Meta, and Twitter.
+3. **JWT (JSON Web Tokens):** Often used in internal microservices.
+
+## References for More Details
+
+* **[REST API Tutorial](https://restfulapi.net/):** Understanding the principles of RESTful design.
+
+
+* **[Python Requests Guide](https://requests.readthedocs.io/en/latest/):** Mastering HTTP requests for data collection.
+
+---
+
+APIs give us structured data, but sometimes the "front door" is locked. When there is no API, we must use the more aggressive "side window" approach.
\ No newline at end of file
diff --git a/docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx b/docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx
index e69de29..ee50c2c 100644
--- a/docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx
+++ b/docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx
@@ -0,0 +1,101 @@
+---
+title: Data Sources in ML
+sidebar_label: Data Sources
+description: "Identifying and integrating various data sources: from relational databases and APIs to unstructured web data and IoT streams."
+tags: [data-engineering, data-sources, sql, nosql, apis, web-scraping]
+---
+
+Data is the "fuel" for Machine Learning. However, this fuel is rarely found in one place. As a data engineer, your job is to identify where the raw data lives and how to transport it safely into your environment for processing.
+
+## 1. The Data Source Landscape
+
+We generally categorize data sources based on their **Structure** and their **Storage Method**.
+
+```mermaid
+graph TD
+    Root[Data Sources] --> Structured[Structured]
+    Root --> Semi[Semi-Structured]
+    Root --> Unstructured[Unstructured]
+    
+    Structured --> SQL[Relational DBs: MySQL, Postgres]
+    Semi --> Files[JSON, XML, Parquet]
+    Unstructured --> Media[Images, Video, Audio, PDF]
+
+```
+
+## 2. Common Data Sources
+
+### A. Relational Databases (SQL)
+
+The most common source for tabular data (customer records, transactions).
+
+* **Protocol:** SQL (Structured Query Language).
+* **Pros:** Highly reliable (ACID compliant), easy to join tables.
+* **Cons:** Hard to scale horizontally; requires a fixed schema.
+
+### B. NoSQL Databases
+
+Used for high-volume, high-velocity, or non-tabular data.
+
+* **Key-Value Stores:** Redis.
+* **Document Stores:** MongoDB (Stores data as JSON/BSON).
+* **ML Use Case:** Storing user profiles or real-time feature stores.
+
+### C. APIs (Application Programming Interfaces)
+
+Used to pull data from external services like Twitter, Google Maps, or Financial markets.
+
+* **Format:** Usually **JSON** or **REST**.
+* **Challenges:** Rate limiting (you can only pull so much data per hour) and authentication.
+
+### D. Cloud Object Storage (The Data Lake)
+
+Services like **AWS S3** or **Google Cloud Storage** act as a dumping ground for raw files before they are processed.
+
+* **ML Use Case:** Storing millions of images for a Computer Vision model.
+
+## 3. Batch vs. Streaming Sources
+
+How the data arrives at your model is just as important as where it comes from.
+
+| Feature | Batch Processing | Stream Processing |
+| --- | --- | --- |
+| **Source** | Databases, CSV files, Data Lakes | Kafka, Kinesis, IoT Sensors |
+| **Frequency** | Hourly, Daily, Weekly | Real-time (Milliseconds) |
+| **Use Case** | Training a model on historical sales | Predicting fraud during a transaction |
+
+```mermaid
+flowchart LR
+    S1[(Database)] -->|Batch| B[ETL Process]
+    S2{{IoT Sensor}} -->|Stream| P[Real-time Pipeline]
+    B --> DL[Data Lake]
+    P --> DL
+    style P fill:#fff3e0,stroke:#ef6c00,color:#333
+
+```
+
+## 4. Web Scraping & Crawling
+
+When data isn't available via API or DB, we use scrapers (like `BeautifulSoup` or `Scrapy`) to extract information from HTML.
+
+* **Ethics Check:** Always check a site's `robots.txt` before scraping to ensure you are legally and ethically allowed to take the data.
+
+## 5. Identifying High-Quality Sources
+
+Not all data sources are equal. When evaluating a source for an ML project, ask:
+
+1. **Freshness:** How often is this data updated?
+2. **Reliability:** Does the source go down often?
+3. **Completeness:** Does it have missing values ()?
+4. **Granularity:** Is the data at the level we need (e.g., individual transactions vs. daily totals)?
+
+## References for More Details
+
+* **[Google Cloud - Data Source Types](https://cloud.google.com/architecture/data-lifecycle-cloud-platform):** Understanding how cloud providers handle different data types.
+
+
+* **[MongoDB University](https://university.mongodb.com/):** Learning the difference between Document stores and SQL.
+
+---
+
+Finding the data is only the first step. Once we have access, we need to move it into our systems without losing information or causing bottlenecks.
\ No newline at end of file
diff --git a/docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx b/docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx
index e69de29..dce4917 100644
--- a/docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx
+++ b/docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx
@@ -0,0 +1,88 @@
+---
+title: "SQL vs. NoSQL for ML"
+sidebar_label: SQL & NoSQL
+description: "Comparing Relational and Non-Relational databases: choosing the right storage for your machine learning features and labels."
+tags: [databases, sql, nosql, data-engineering, postgres, mongodb]
+---
+
+Choosing between a **SQL (Relational)** and a **NoSQL (Non-Relational)** database is one of the most critical decisions in a Data Engineering pipeline. In Machine Learning, this choice often depends on whether your data is fixed and structured or evolving and unstructured.
+
+## 1. The Architectural Divide
+
+```mermaid
+graph TD
+    subgraph SQL ["SQL (Relational)"]
+        Table[Tables/Rows] --- Schema[Strict Schema]
+    end
+    subgraph NoSQL ["NoSQL (Non-Relational)"]
+        Doc[Documents/Key-Value] --- Flexible[Dynamic Schema]
+    end
+    style SQL fill:#e3f2fd,stroke:#1565c0,color:#333
+    style NoSQL fill:#f1f8e9,stroke:#33691e,color:#333
+
+```
+
+## 2. SQL: Relational Databases
+
+**Examples:** PostgreSQL, MySQL, SQLite, Oracle.
+
+SQL databases store data in rows and columns. They are built on **ACID** properties (Atomicity, Consistency, Isolation, Durability), ensuring that every transaction is processed reliably.
+
+* **Best for:** Structured data where relationships are key (e.g., linking a `User_ID` to `Transactions` and `Product_Details`).
+* **Scaling:** Vertically (buying a bigger, more powerful server).
+* **ML Use Case:** Serving as the "Source of Truth" for historical training data where data integrity is paramount.
+
+## 3. NoSQL: Non-Relational Databases
+
+**Examples:** MongoDB (Document), Cassandra (Column-family), Redis (Key-Value), Neo4j (Graph).
+
+NoSQL databases are designed for distributed data and high-speed horizontal scaling. They are often **BASE** compliant (Basically Available, Soft state, Eventual consistency).
+
+* **Best for:** Unstructured or semi-structured data (JSON, social media feeds, sensor logs).
+* **Scaling:** Horizontally (adding more cheap servers to a cluster).
+* **ML Use Case:** * **Feature Stores:** Using Redis for ultra-fast lookup of features during real-time inference.
+* **Unstructured Storage:** Using MongoDB to store raw JSON metadata for NLP tasks.
+
+## 4. Key Differences Comparison
+
+| Feature | SQL | NoSQL |
+| --- | --- | --- |
+| **Data Model** | Tabular (Rows/Columns) | Document, Key-Value, Graph |
+| **Schema** | Fixed (Pre-defined) | Dynamic (On-the-fly) |
+| **Joins** | Very efficient ($$JOIN$$) | Generally avoided (Data is denormalized) |
+| **Query Language** | Structured Query Language (SQL) | Varies (e.g., MQL for MongoDB) |
+| **Standard** | ACID | BASE |
+
+## 5. CAP Theorem: The Data Engineer's Trade-off
+
+When choosing a database for a distributed ML system, you must consider the **CAP Theorem**. It states that a distributed system can only provide two out of the following three:
+
+```mermaid
+pie
+    title CAP Theorem
+    "Consistency" : 1
+    "Availability" : 1
+    "Partition Tolerance" : 1
+```
+
+1. **Consistency:** Every read receives the most recent write.
+2. **Availability:** Every request receives a response (even if it's not the latest).
+3. **Partition Tolerance:** The system continues to operate despite network failures.
+
+## 6. Hybrid Approaches: The "Polyglot" Strategy
+
+Modern ML architectures rarely use just one.
+
+* **Postgres (SQL)** might store the user account and labels.
+* **MongoDB (NoSQL)** might store the raw log data.
+* **S3 (Object Store)** might store the actual trained `.pkl` or `.onnx` model files.
+
+## References for More Details
+
+* **[PostgreSQL Documentation](https://www.postgresql.org/docs/):** Learning about complex joins and indexing for speed.
+
+* **[MongoDB Architecture Guide](https://www.mongodb.com/docs/manual/core/data-modeling-introduction/):** Understanding document-based data modeling.
+
+---
+
+Storing data is one thing; getting it into your system is another. Let's look at how we build the bridges between these databases and our models.
\ No newline at end of file
diff --git a/docs/machine-learning/data-engineering-basics/data-collection/internet.mdx b/docs/machine-learning/data-engineering-basics/data-collection/internet.mdx
index e69de29..41c8474 100644
--- a/docs/machine-learning/data-engineering-basics/data-collection/internet.mdx
+++ b/docs/machine-learning/data-engineering-basics/data-collection/internet.mdx
@@ -0,0 +1,86 @@
+---
+title: "Data from the Web: APIs & Scraping"
+sidebar_label: Web Data
+description: "Mastering the techniques for harvesting data from the internet: REST APIs, GraphQL, and automated web scraping."
+tags: [data-engineering, apis, web-scraping, json, rest, scraping-ethics]
+---
+
+The internet is the primary source of data for modern Machine Learning, from sentiment analysis of tweets to training LLMs on billions of webpages. There are two main ways to "ingest" this data: **APIs** (the front door) and **Web Scraping** (the side window).
+
+## 1. APIs: The Structured Front Door
+
+An **API (Application Programming Interface)** is a formal agreement between two systems. It allows you to request specific data and receive it in a predictable, structured format (usually JSON).
+
+### Types of APIs in ML
+* **REST (Representational State Transfer):** The standard for most web services. Uses HTTP methods like `GET` to fetch data.
+* **GraphQL:** Allows you to request *exactly* the fields you need, reducing data transfer size—ideal for mobile data collection.
+* **Webhooks:** Instead of you asking for data, the server "pushes" data to you when an event occurs (e.g., a new user sign-up).
+
+```mermaid
+sequenceDiagram
+    participant ML_App as ML Pipeline
+    participant API as Web API
+    participant DB as External Database
+
+    ML_App->>API: GET /v1/market-data?symbol=AAPL
+    API->>DB: Query Price
+    DB-->>API: 150.25
+    API-->>ML_App: JSON: {"price": 150.25, "currency": "USD"}
+
+```
+
+## 2. Web Scraping: The Unstructured Side Window
+
+When a website does not provide an API, we use **Web Scraping**. This involves writing code that downloads the HTML of a page and "parses" (extracts) the specific information we need.
+
+### The Scraping Toolkit
+
+1. **Requests / HTTPX:** For downloading the raw HTML content.
+2. **BeautifulSoup:** For navigating the HTML tree and finding tags (e.g., `<h1>`, `<table>`).
+3. **Selenium / Playwright:** For scraping "Dynamic" sites that require JavaScript to load (like Infinite Scroll dashboards).
+
+## 3. Comparison: API vs. Scraping
+
+| Feature | APIs | Web Scraping |
+| --- | --- | --- |
+| **Data Format** | JSON/XML (Easy to parse) | HTML (Messy/Unstructured) |
+| **Stability** | High (Versioned) | Low (Breaks if the UI changes) |
+| **Legality** | Encouraged by the provider | Gray area (Depends on Terms of Service) |
+| **Speed** | Fast & Efficient | Slower (Requires rendering) |
+
+## 4. The Ethics of Web Data Collection
+
+Data Engineering isn't just about "can we get the data," but "should we?"
+
+1. **Robots.txt:** Always check `website.com/robots.txt` to see which parts of the site the owner has forbidden for crawlers.
+2. **Rate Limiting:** Do not spam a server with thousands of requests per second. This is effectively a DDoS attack. Use `time.sleep()`.
+3. **Terms of Service (ToS):** Many sites (like LinkedIn or Amazon) strictly forbid scraping in their user agreements.
+
+```mermaid
+graph LR
+    Start[Scraping Plan] --> R["Check robots.txt"]
+    R --> T["Review Terms of Service"]
+    T --> L["Set Rate Limits"]
+    L --> Run[Execute Collection]
+    style R fill:#fff3e0,stroke:#ef6c00,color:#333
+
+```
+
+## 5. Cleaning Web Data
+
+Data from the web is "noisy." You will almost always need to perform these steps immediately after collection:
+
+* **HTML Stripping:** Removing `<script>` and `<style>` tags.
+* **Encoding Fixes:** Converting symbols like `&amp;` to `&`.
+* **Deduplication:** Removing identical articles or comments scraped from different parts of a site.
+
+## References for More Details
+
+* **[Real Python - API Integration](https://realpython.com/api-integration-in-python/):** Learning the `requests` library.
+
+
+* **[Scrapy Documentation](https://docs.scrapy.org/en/latest/):** Building industrial-scale web crawlers.
+
+---
+
+Now that we have collected data from databases and the web, we need to move it into our training environment. This is where the "Pipeline" begins.
\ No newline at end of file
diff --git a/docs/machine-learning/data-engineering-basics/data-collection/iot.mdx b/docs/machine-learning/data-engineering-basics/data-collection/iot.mdx
index e69de29..7c7228a 100644
--- a/docs/machine-learning/data-engineering-basics/data-collection/iot.mdx
+++ b/docs/machine-learning/data-engineering-basics/data-collection/iot.mdx
@@ -0,0 +1,88 @@
+---
+title: Data Collection from IoT Devices
+sidebar_label: IoT & Sensors
+description: "Mastering the challenges of high-velocity sensor data: MQTT protocols, edge processing, and time-series ingestion."
+tags: [iot, sensors, mqtt, time-series, data-engineering, edge-computing]
+---
+
+The **Internet of Things (IoT)** represents a network of physical objects embedded with sensors and software. For Machine Learning, IoT is a goldmine for **Predictive Maintenance**, **Smart Cities**, and **Industrial Automation**. However, the sheer volume and "noise" of sensor data require a specific engineering approach.
+
+## 1. IoT Data Characteristics
+
+IoT data differs from web or database data in three major ways:
+
+1.  **High Velocity:** Sensors may pulse data every millisecond ($1000\text{Hz}$), creating massive streams.
+2.  **Time-Series Nature:** Every data point is a tuple of $(\text{timestamp}, \text{value})$. The order is critical.
+3.  **Low Signal-to-Noise Ratio:** Sensors are often affected by environmental interference (heat, vibration, or electronic "jitter").
+
+## 2. Communication Protocols: MQTT vs. HTTP
+
+While web apps use HTTP, IoT devices often use **MQTT (Message Queuing Telemetry Transport)**. It is a lightweight, "publish-subscribe" protocol designed for low-bandwidth, high-latency environments.
+
+```mermaid
+sequenceDiagram
+    participant Sensor as IoT Sensor (Publisher)
+    participant Broker as MQTT Broker (Mosquitto/AWS IoT)
+    participant ML_Pipe as ML Pipeline (Subscriber)
+    
+    Sensor->>Broker: Publish: telemetry/temp (Value: 22.5)
+    Broker-->>ML_Pipe: Forward: telemetry/temp (Value: 22.5)
+    Note over ML_Pipe: Data Ingested for Inference
+
+```
+
+## 3. Data Ingestion Architecture
+
+Because IoT devices can generate millions of events per second, we cannot write directly to a standard SQL database. We use a **Message Queue** as a buffer.
+
+* **Producer:** The IoT device or gateway.
+* **Broker:** **Apache Kafka** or **AWS Kinesis** (handles the high-speed data stream).
+* **Consumer:** An ingestion service that writes to a **Time-Series Database (TSDB)** like InfluxDB or TimescaleDB.
+
+```mermaid
+graph LR
+    D1[Sensor A] --> G[IoT Gateway]
+    D2[Sensor B] --> G
+    G --> K[Message Queue: Kafka]
+    K --> P[Pre-processor]
+    P --> TSDB[(Time-Series DB)]
+    style K fill:#f3e5f5,stroke:#7b1fa2,color:#333
+    style TSDB fill:#e1f5fe,stroke:#01579b,color:#333
+
+```
+
+## 4. Edge vs. Cloud Processing
+
+In IoT, sending *all* data to the cloud is expensive and slow. We use **Edge Computing** to filter data locally.
+
+* **On the Edge (Device/Gateway):**
+* **Downsampling:** Instead of sending 1000 readings per second, send the average every 1 second.
+* **Anomaly Detection:** Only send data if a value exceeds a safety threshold (e.g., Temperature ).
+
+* **In the Cloud:** 
+    * **Model Training:** Using historical logs to train a predictive model.
+    * **Long-term Storage:** Archiving data for regulatory compliance.
+
+## 5. Common Challenges in IoT Ingestion
+
+### A. Clock Drift
+
+IoT devices may have slightly different internal clocks. When merging data from two sensors,  on Sensor A might actually be  on Sensor B. Data engineers must perform **Time Synchronization**.
+
+### B. Out-of-Order Data
+
+Due to network lag, a packet sent at 10:00:01 might arrive *after* a packet sent at 10:00:02. Your pipeline must be able to re-sort data based on the original timestamp.
+
+### C. Missing Values (Packet Loss)
+
+Wireless signals drop. You must decide whether to **Interpolate** missing values (estimate based on neighbors) or leave them as nulls.
+
+## References for More Details
+
+* **[MQTT Essentials](https://www.hivemq.com/mqtt-essentials/):** Deep dive into how IoT messaging works.
+
+* **[InfluxDB Guide to Time-Series Data](https://www.influxdata.com/time-series-database/):** Understanding why TSDBs are better than SQL for sensors.
+
+---
+
+Whether your data comes from a SQL database, a web scraper, a mobile app, or an IoT sensor, it all flows into the same place: The Pipeline.
\ No newline at end of file
diff --git a/docs/machine-learning/data-engineering-basics/data-collection/mobile-apps.mdx b/docs/machine-learning/data-engineering-basics/data-collection/mobile-apps.mdx
index e69de29..333a44d 100644
--- a/docs/machine-learning/data-engineering-basics/data-collection/mobile-apps.mdx
+++ b/docs/machine-learning/data-engineering-basics/data-collection/mobile-apps.mdx
@@ -0,0 +1,84 @@
+---
+title: Data Collection from Mobile Apps
+sidebar_label: Mobile Apps
+description: "Understanding telemetry, event tracking, and edge-to-cloud data ingestion for mobile-first machine learning."
+tags: [mobile-data, telemetry, analytics, data-engineering, edge-computing, privacy]
+---
+
+Data collection from mobile devices is a critical component for building personalized ML models (like recommendation engines). Unlike servers, mobile devices are "unreliable" nodes—they lose connectivity, run out of battery, and have limited processing power.
+
+## 1. Telemetry and Event Tracking
+
+The most common data collected from mobile apps is **Telemetry**—the logs of how a user interacts with the interface.
+
+* **Explicit Events:** Actions the user takes (e.g., `button_click`, `purchase_complete`).
+* **Implicit Events:** Passive data collection (e.g., `time_spent_on_screen`, `scroll_depth`).
+* **Device Metadata:** OS version, device model, and screen resolution (crucial for feature engineering).
+
+
+## 2. Ingestion Architectures
+
+Because mobile devices aren't always online, data collection follows a **Store-and-Forward** pattern.
+
+```mermaid
+graph LR
+    User[User Interaction] --> Cache[(Local Storage/SQLite)]
+    Cache --> Sync{Network Check}
+    Sync -->|Offline| Cache
+    Sync -->|Online| Gateway[API Gateway]
+    Gateway --> Stream[Kafka/Kinesis Stream]
+    Stream --> Warehouse[Data Warehouse]
+    
+    style Cache fill:#fff3e0,stroke:#ef6c00,color:#333
+    style Sync fill:#e1f5fe,stroke:#01579b,color:#333
+
+```
+
+### Batching and Throttling
+
+To save battery and data plans, mobile SDKs (like Firebase or Mixpanel) don't send data every second. They **batch** events locally and "flush" them to the server when the device is on Wi-Fi or charging.
+
+## 3. Sensors: The "Eyes and Ears"
+
+Mobile apps provide rich multi-modal data that isn't available from web browsers. This data is often used in **Activity Recognition** or **Biometric ML models**.
+
+| Sensor Type | Data Produced | ML Use Case |
+| --- | --- | --- |
+| **Accelerometer** | 3-axis motion ($$x, y, z$$) | Detecting walking vs. running. |
+| **GPS** | Latitude, Longitude | Location-based recommendations. |
+| **Microphone** | Audio waveforms | Speech-to-text or voice commands. |
+| **Camera** | Image/Video frames | Facial recognition or AR filters. |
+
+## 4. Privacy and Regulations (ATT & GDPR)
+
+Data collection on mobile is heavily restricted by platform owners (Apple and Google).
+
+1. **App Tracking Transparency (ATT):** On iOS, users must explicitly "Opt-in" to be tracked across apps.
+2. **Zero-Knowledge Collection:** Modern apps often use **Differential Privacy**, adding "noise" to data locally so the server learns general trends without seeing individual user identities.
+3. **PII Masking:** Personally Identifiable Information (like email or name) must be hashed or removed before hitting the ML training bucket.
+
+## 5. Edge Computing: Processing before Ingestion
+
+To reduce the amount of data sent to the cloud, we sometimes perform **Edge Inference**.
+
+* **Pre-processing:** Scaling images or normalizing audio on the device.
+* **Feature Extraction:** Running a small model on the phone to extract "embeddings" (vectors) and sending only the vectors to the server, instead of the raw, heavy files.
+
+```mermaid
+flowchart TD
+    Raw[Raw Video Frame] --> Edge[Mobile GPU: Feature Extraction]
+    Edge --> Vector[128-dim Vector]
+    Vector -->|Sent to Cloud| Cloud[Cloud ML: Final Classification]
+    style Edge fill:#f3e5f5,stroke:#7b1fa2,color:#333
+
+```
+
+## References for More Details
+
+* **[Firebase Analytics Documentation](https://firebase.google.com/docs/analytics):** Learning how industry-standard event tracking works.
+
+* **[Apple Developer - Data Privacy](https://developer.apple.com/documentation/apptrackingtransparency):** Understanding the constraints of mobile data collection.
+
+---
+
+Whether from the web, databases, or mobile apps, all this raw data is messy. The next step is to build a system that moves and cleans this data automatically.
\ No newline at end of file
diff --git a/docs/machine-learning/data-engineering-basics/data-formats/csv.mdx b/docs/machine-learning/data-engineering-basics/data-formats/csv.mdx
index e69de29..7f53195 100644
--- a/docs/machine-learning/data-engineering-basics/data-formats/csv.mdx
+++ b/docs/machine-learning/data-engineering-basics/data-formats/csv.mdx
@@ -0,0 +1,104 @@
+---
+title: "CSV: The Universal Data Language"
+sidebar_label: CSV
+description: "Understanding the Comma-Separated Values format: its role in ML, performance trade-offs, and best practices for ingestion."
+tags: [data-engineering, csv, data-formats, pandas, datasets]
+---
+
+**CSV (Comma-Separated Values)** is a plain-text format used to store tabular data. Despite being one of the oldest formats, it remains the most common way to share datasets in the Machine Learning community (e.g., on platforms like Kaggle).
+
+## 1. Structure of a CSV
+
+A CSV file represents a 2D table where each line is a **row** and each piece of data is separated by a **delimiter** (usually a comma).
+
+```text
+id,feature_1,feature_2,label
+1,0.85,22,1
+2,0.12,45,0
+3,0.55,30,1
+
+```
+
+```mermaid
+graph LR
+    Line["Raw Text Line"] --> Parser["CSV Parser"]
+    Parser --> Row["Row (Observation)"]
+    Row --> F1["Cell 1 (Feature)"]
+    Row --> F2["Cell 2 (Feature)"]
+    Row --> F3["Cell 3 (Target)"]
+    style Parser fill:#e1f5fe,stroke:#01579b,color:#333
+
+```
+
+## 2. Why CSV is the "Standard" for ML
+
+1. **Human Readable:** You can open a CSV in any text editor, Excel, or Google Sheets to inspect the data manually.
+2. **Universal Support:** Every programming language (Python, R, Julia, C++) and every ML library (Scikit-Learn, TensorFlow, PyTorch) can parse CSVs.
+3. **Simplicity:** No complex headers or binary encoding; it is just text.
+
+## 3. The Performance Trade-off
+
+While CSV is great for sharing, it has significant limitations for **Production Data Engineering**.
+
+| Feature | CSV (Plain Text) | Parquet/Avro (Binary) |
+| --- | --- | --- |
+| **Storage Size** | Large (No compression) | Small (Highly compressed) |
+| **Read Speed** | Slow (Must parse text) | Fast (Direct memory mapping) |
+| **Schema** | None (Everything is a string) | Strict (Enforces data types) |
+| **Partial Reading** | No (Must read whole row) | Yes (Columnar access) |
+
+
+## 4. Handling CSVs in Python (Pandas)
+
+Pandas is the primary tool for moving CSV data into an ML pipeline.
+
+```python
+import pandas as pd
+
+# Standard loading
+df = pd.read_csv('dataset.csv')
+
+# Handling Large Files (Chunking)
+# For files larger than RAM, we process them in pieces.
+chunk_size = 10000
+for chunk in pd.read_csv('big_data.csv', chunksize=chunk_size):
+    process_for_ml(chunk)
+
+```
+
+## 5. Common "CSV Traps" in ML Pipelines
+
+As a data engineer, you must watch out for these common errors that can break your model:
+
+### A. The Delimiter Collision
+
+If a feature contains a comma (e.g., an address like `"New York, NY"`), a naive parser will split it into two columns.
+
+* **Fix:** Use quotes (`" "`) or change the delimiter to a Tab (`\t`) or Pipe (`|`).
+
+### B. Type Inference Errors
+
+Since CSVs have no schema, Pandas "guesses" the data type. It might treat an ID `00123` as the integer `123`, losing the leading zeros.
+
+* **Fix:** Explicitly define types: `pd.read_csv(file, dtype={'id': str})`.
+
+### C. Encoding Issues
+
+Files created on Windows (UTF-16) might crash on a Linux server (UTF-8).
+
+* **Fix:** Always standardize on **UTF-8** encoding.
+
+## 6. When to Use (and When to Move On)
+
+* **Use CSV if:** You are sharing a small dataset (MB), doing initial EDA, or sending data to a non-technical stakeholder.
+* **Avoid CSV if:** You are working with "Big Data" (GBs or TBs), require strict data types, or need high-performance streaming.
+
+## References for More Details
+
+* **[Pandas `read_csv` Documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html):** Learning about the 50+ parameters available for handling messy CSVs.
+
+* **[RFC 4180 - The CSV Standard](https://tools.ietf.org/html/rfc4180):** Understanding the formal definition of the CSV format.
+
+---
+
+CSV is easy to read, but it's inefficient for large datasets. For more complex, nested data like what we get from APIs, we need a more flexible format.
\ No newline at end of file
diff --git a/docs/machine-learning/data-engineering-basics/data-formats/excel.mdx b/docs/machine-learning/data-engineering-basics/data-formats/excel.mdx
index e69de29..7e9eee0 100644
--- a/docs/machine-learning/data-engineering-basics/data-formats/excel.mdx
+++ b/docs/machine-learning/data-engineering-basics/data-formats/excel.mdx
@@ -0,0 +1,95 @@
+---
+title: "Excel: The Business Data Standard"
+sidebar_label: Excel
+description: "Handling .xlsx and .xls files in ML pipelines: managing multi-sheet workbooks, data types, and conversion pitfalls."
+tags: [data-engineering, excel, pandas, data-formats, business-intelligence]
+---
+
+While CSVs are the favorite of developers, **Excel (.xlsx)** is the undisputed king of the business world. Most domain experts, financial analysts, and stakeholders store their primary data in workbooks. As a data engineer, you must be able to ingest these files while navigating their unique complexities.
+
+## 1. Excel vs. CSV: More Than Just Tables
+
+Unlike a CSV, which is a single flat text file, an Excel file is actually a **compressed collection of XML files** (for `.xlsx`). 
+
+| Feature | CSV | Excel (.xlsx) |
+| :--- | :--- | :--- |
+| **Capacity** | Unlimited rows (limited by disk) | 1,048,576 rows per sheet |
+| **Structure** | Single table | Multiple sheets (Workbooks) |
+| **Metadata** | None | Includes formatting, formulas, and charts |
+| **Data Types** | Everything is a string | Explicit types (Date, Currency, Number) |
+
+## 2. The Multi-Sheet Challenge
+
+A single Excel workbook often contains multiple datasets spread across different tabs. In an ML pipeline, you must specify which sheet contains your features.
+
+```mermaid
+graph TD
+    Workbook[Workbook: Sales_Data.xlsx] --> S1[Sheet: Q1_Sales]
+    Workbook --> S2[Sheet: Q2_Sales]
+    Workbook --> S3[Sheet: Reference_Codes]
+    
+    S1 --> Process[ML Ingestion Pipeline]
+    style S1 fill:#e1f5fe,stroke:#01579b,color:#333
+
+```
+
+## 3. Reading Excel with Python
+
+To process Excel files, Python requires an engine like `openpyxl` or `xlrd`. Pandas wraps these to provide a simple interface.
+
+```python
+import pandas as pd
+
+# Load a specific sheet
+df = pd.read_excel('company_data.xlsx', sheet_name='Training_Data')
+
+# Loading multiple sheets into a dictionary of DataFrames
+all_sheets = pd.read_excel('company_data.xlsx', sheet_name=None)
+# Access the 'Metadata' sheet
+metadata = all_sheets['Metadata']
+
+```
+
+## 4. Common Pitfalls in Excel Ingestion
+
+Excel allows for "human-friendly" formatting that is "machine-hostile." Watch out for these:
+
+### A. Merged Cells
+
+Merged cells create `NaN` (null) values for all but the first cell in the merge.
+
+* **Fix:** Use `df.fillna(method='ffill')` to propagate values downward or across.
+
+### B. Header Offsets
+
+Business reports often have titles or empty rows at the top before the data starts.
+
+* **Fix:** Use the `skiprows` parameter: `pd.read_csv(..., skiprows=3)`.
+
+### C. Formulas vs. Values
+
+If a cell contains `=SUM(A1:A10)`, a basic parser might read the formula string instead of the calculated result ().
+
+* **Note:** Standard Pandas readers fetch the **calculated value**, but if the file hasn't been saved recently, the values might be stale.
+
+## 5. Performance Note
+
+Excel files are significantly slower to read than CSV or Parquet.
+
+* **Benchmarking:** Reading a 100MB Excel file can take **10–20x longer** than reading the same data from a CSV because the engine must decompress the XML and parse formatting.
+* **Best Practice:** If you are running an iterative ML experiment, convert the Excel file to **Parquet** or **Pickle** once, then use that faster file for the rest of your work.
+
+## 6. Summary: When to Use Excel
+
+* **Use Excel if:** You are receiving data from non-technical departments, need to preserve multi-sheet relationships, or require the data types (like Dates) to be pre-defined.
+* **Avoid Excel if:** You have more than 1 million rows, or you need to read data in a high-speed production environment.
+
+## References for More Details
+
+* **[Pandas `read_excel` Guide](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html):** Advanced parameters like `usecols` and `converters`.
+
+* **[Openpyxl Documentation](https://openpyxl.readthedocs.io/en/stable/):** Modifying Excel files (writing formulas, adding colors) programmatically.
+
+---
+
+Excel is great for humans, but for web-based data and highly nested structures, we use a format that looks more like a Python dictionary.
\ No newline at end of file