Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
title: Mastering APIs for Data Collection
sidebar_label: APIs
description: "A deep dive into REST and GraphQL APIs: how to fetch, authenticate, and process external data for machine learning."
tags: [apis, rest, graphql, json, data-engineering, python-requests]
---

In the Data Engineering lifecycle, **APIs** are the "clean" way to collect data. Unlike web scraping, which is brittle and unstructured, APIs provide a contract-based method to access data that is versioned, documented, and usually delivered in machine-readable formats like JSON.

## 1. How APIs Work: The Request-Response Cycle

An API acts as a middleman between your ML pipeline and a remote server. You send a **Request** (a specific question) and receive a **Response** (the data answer).

```mermaid
sequenceDiagram
participant Pipeline as ML Data Pipeline
participant API as API Gateway
participant Server as Data Server

Pipeline->>API: HTTP Request (GET /data)
Note right of Pipeline: Includes Headers & API Key
API->>Server: Validate & Route
Server-->>API: Data Payload
API-->>Pipeline: HTTP Response (200 OK + JSON)

```

### Components of an API Request:

1. **Endpoint (URL):** The address where the data lives (e.g., `api.twitter.com/v2/tweets`).
2. **Method:** What you want to do (`GET` to fetch, `POST` to send).
3. **Headers:** Metadata like your **API Key** or the format you want (`Content-Type: application/json`).
4. **Parameters:** Filters for the data (e.g., `?start_date=2023-01-01`).

## 2. Common API Architectures in ML

### A. REST (Representational State Transfer)

The most common architecture. It treats every piece of data as a "Resource."

* **Best for:** Standardized data fetching.
* **Format:** Almost exclusively **JSON**.

### B. GraphQL

Developed by Meta, it allows the client to define the structure of the data it needs.

* **Advantage in ML:** If a user profile has 100 fields but you only need 3 features for your model, GraphQL prevents "Over-fetching," saving bandwidth and memory.

[Image comparing REST vs GraphQL data fetching efficiency]

### C. Streaming APIs (WebSockets/gRPC)

Used when data needs to be delivered in real-time.

* **ML Use Case:** Algorithmic trading or live social media sentiment monitoring.

## 3. Implementation in Python

The `requests` library is the standard tool for interacting with APIs.

```python
import requests

url = "https://api.example.com/v1/weather"
headers = {
"Authorization": "Bearer YOUR_TOKEN"
}
params = {
"city": "Mandsaur",
"country": "IN",
"units": "metric"
}

response = requests.get(url, headers=headers, params=params)

if response.status_code == 200:
data = response.json()
temperature = data["main"]["temp"] # Extracting temperature
humidity = data["main"]["humidity"] # Extracting humidity

print(f"Temperature in Mandsaur: {temperature}°C")
print(f"Humidity: {humidity}%")
else:
print("Failed to fetch weather data")

```

## 4. Challenges: Rate Limiting and Status Codes

APIs are not infinite resources. Providers implement **Rate Limiting** to prevent abuse.

| Status Code | Meaning | Action for ML Pipeline |
| --- | --- | --- |
| **200** | OK | Process the data. |
| **401** | Unauthorized | Check your API Key/Token. |
| **404** | Not Found | Check your Endpoint URL. |
| **429** | Too Many Requests | **Exponential Backoff:** Wait and try again later. |

```mermaid
flowchart TD
Req[Send API Request] --> Res{Status Code?}
Res -- 200 --> Save[Ingest to Database]
Res -- 429 --> Wait[Wait/Sleep] --> Req
Res -- 401 --> Fail[Alert Developer]
style Wait fill:#fff3e0,stroke:#ef6c00,color:#333

```

## 5. Authentication Methods

1. **API Keys:** A simple string passed in the header.
2. **OAuth 2.0:** A more secure, token-based system used by Google, Meta, and Twitter.
3. **JWT (JSON Web Tokens):** Often used in internal microservices.

## References for More Details

* **[REST API Tutorial](https://restfulapi.net/):** Understanding the principles of RESTful design.


* **[Python Requests Guide](https://requests.readthedocs.io/en/latest/):** Mastering HTTP requests for data collection.

---

APIs give us structured data, but sometimes the "front door" is locked. When there is no API, we must use the more aggressive "side window" approach.
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
title: Data Sources in ML
sidebar_label: Data Sources
description: "Identifying and integrating various data sources: from relational databases and APIs to unstructured web data and IoT streams."
tags: [data-engineering, data-sources, sql, nosql, apis, web-scraping]
---

Data is the "fuel" for Machine Learning. However, this fuel is rarely found in one place. As a data engineer, your job is to identify where the raw data lives and how to transport it safely into your environment for processing.

## 1. The Data Source Landscape

We generally categorize data sources based on their **Structure** and their **Storage Method**.

```mermaid
graph TD
Root[Data Sources] --> Structured[Structured]
Root --> Semi[Semi-Structured]
Root --> Unstructured[Unstructured]

Structured --> SQL[Relational DBs: MySQL, Postgres]
Semi --> Files[JSON, XML, Parquet]
Unstructured --> Media[Images, Video, Audio, PDF]

```

## 2. Common Data Sources

### A. Relational Databases (SQL)

The most common source for tabular data (customer records, transactions).

* **Protocol:** SQL (Structured Query Language).
* **Pros:** Highly reliable (ACID compliant), easy to join tables.
* **Cons:** Hard to scale horizontally; requires a fixed schema.

### B. NoSQL Databases

Used for high-volume, high-velocity, or non-tabular data.

* **Key-Value Stores:** Redis.
* **Document Stores:** MongoDB (Stores data as JSON/BSON).
* **ML Use Case:** Storing user profiles or real-time feature stores.

### C. APIs (Application Programming Interfaces)

Used to pull data from external services like Twitter, Google Maps, or Financial markets.

* **Format:** Usually **JSON** or **REST**.
* **Challenges:** Rate limiting (you can only pull so much data per hour) and authentication.

### D. Cloud Object Storage (The Data Lake)

Services like **AWS S3** or **Google Cloud Storage** act as a dumping ground for raw files before they are processed.

* **ML Use Case:** Storing millions of images for a Computer Vision model.

## 3. Batch vs. Streaming Sources

How the data arrives at your model is just as important as where it comes from.

| Feature | Batch Processing | Stream Processing |
| --- | --- | --- |
| **Source** | Databases, CSV files, Data Lakes | Kafka, Kinesis, IoT Sensors |
| **Frequency** | Hourly, Daily, Weekly | Real-time (Milliseconds) |
| **Use Case** | Training a model on historical sales | Predicting fraud during a transaction |

```mermaid
flowchart LR
S1[(Database)] -->|Batch| B[ETL Process]
S2{{IoT Sensor}} -->|Stream| P[Real-time Pipeline]
B --> DL[Data Lake]
P --> DL
style P fill:#fff3e0,stroke:#ef6c00,color:#333

```

## 4. Web Scraping & Crawling

When data isn't available via API or DB, we use scrapers (like `BeautifulSoup` or `Scrapy`) to extract information from HTML.

* **Ethics Check:** Always check a site's `robots.txt` before scraping to ensure you are legally and ethically allowed to take the data.

## 5. Identifying High-Quality Sources

Not all data sources are equal. When evaluating a source for an ML project, ask:

1. **Freshness:** How often is this data updated?
2. **Reliability:** Does the source go down often?
3. **Completeness:** Does it have missing values ()?
4. **Granularity:** Is the data at the level we need (e.g., individual transactions vs. daily totals)?

## References for More Details

* **[Google Cloud - Data Source Types](https://cloud.google.com/architecture/data-lifecycle-cloud-platform):** Understanding how cloud providers handle different data types.


* **[MongoDB University](https://university.mongodb.com/):** Learning the difference between Document stores and SQL.

---

Finding the data is only the first step. Once we have access, we need to move it into our systems without losing information or causing bottlenecks.
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
title: "SQL vs. NoSQL for ML"
sidebar_label: SQL & NoSQL
description: "Comparing Relational and Non-Relational databases: choosing the right storage for your machine learning features and labels."
tags: [databases, sql, nosql, data-engineering, postgres, mongodb]
---

Choosing between a **SQL (Relational)** and a **NoSQL (Non-Relational)** database is one of the most critical decisions in a Data Engineering pipeline. In Machine Learning, this choice often depends on whether your data is fixed and structured or evolving and unstructured.

## 1. The Architectural Divide

```mermaid
graph TD
subgraph SQL ["SQL (Relational)"]
Table[Tables/Rows] --- Schema[Strict Schema]
end
subgraph NoSQL ["NoSQL (Non-Relational)"]
Doc[Documents/Key-Value] --- Flexible[Dynamic Schema]
end
style SQL fill:#e3f2fd,stroke:#1565c0,color:#333
style NoSQL fill:#f1f8e9,stroke:#33691e,color:#333

```

## 2. SQL: Relational Databases

**Examples:** PostgreSQL, MySQL, SQLite, Oracle.

SQL databases store data in rows and columns. They are built on **ACID** properties (Atomicity, Consistency, Isolation, Durability), ensuring that every transaction is processed reliably.

* **Best for:** Structured data where relationships are key (e.g., linking a `User_ID` to `Transactions` and `Product_Details`).
* **Scaling:** Vertically (buying a bigger, more powerful server).
* **ML Use Case:** Serving as the "Source of Truth" for historical training data where data integrity is paramount.

## 3. NoSQL: Non-Relational Databases

**Examples:** MongoDB (Document), Cassandra (Column-family), Redis (Key-Value), Neo4j (Graph).

NoSQL databases are designed for distributed data and high-speed horizontal scaling. They are often **BASE** compliant (Basically Available, Soft state, Eventual consistency).

* **Best for:** Unstructured or semi-structured data (JSON, social media feeds, sensor logs).
* **Scaling:** Horizontally (adding more cheap servers to a cluster).
* **ML Use Case:** * **Feature Stores:** Using Redis for ultra-fast lookup of features during real-time inference.
* **Unstructured Storage:** Using MongoDB to store raw JSON metadata for NLP tasks.

## 4. Key Differences Comparison

| Feature | SQL | NoSQL |
| --- | --- | --- |
| **Data Model** | Tabular (Rows/Columns) | Document, Key-Value, Graph |
| **Schema** | Fixed (Pre-defined) | Dynamic (On-the-fly) |
| **Joins** | Very efficient ($$JOIN$$) | Generally avoided (Data is denormalized) |
| **Query Language** | Structured Query Language (SQL) | Varies (e.g., MQL for MongoDB) |
| **Standard** | ACID | BASE |

## 5. CAP Theorem: The Data Engineer's Trade-off

When choosing a database for a distributed ML system, you must consider the **CAP Theorem**. It states that a distributed system can only provide two out of the following three:

```mermaid
pie
title CAP Theorem
"Consistency" : 1
"Availability" : 1
"Partition Tolerance" : 1
```

1. **Consistency:** Every read receives the most recent write.
2. **Availability:** Every request receives a response (even if it's not the latest).
3. **Partition Tolerance:** The system continues to operate despite network failures.

## 6. Hybrid Approaches: The "Polyglot" Strategy

Modern ML architectures rarely use just one.

* **Postgres (SQL)** might store the user account and labels.
* **MongoDB (NoSQL)** might store the raw log data.
* **S3 (Object Store)** might store the actual trained `.pkl` or `.onnx` model files.

## References for More Details

* **[PostgreSQL Documentation](https://www.postgresql.org/docs/):** Learning about complex joins and indexing for speed.

* **[MongoDB Architecture Guide](https://www.mongodb.com/docs/manual/core/data-modeling-introduction/):** Understanding document-based data modeling.

---

Storing data is one thing; getting it into your system is another. Let's look at how we build the bridges between these databases and our models.
Loading