diff --git a/docs/machine-learning/data-engineering-basics/data-collection/apis.mdx b/docs/machine-learning/data-engineering-basics/data-collection/apis.mdx index e69de29..05a4d6b 100644 --- a/docs/machine-learning/data-engineering-basics/data-collection/apis.mdx +++ b/docs/machine-learning/data-engineering-basics/data-collection/apis.mdx @@ -0,0 +1,125 @@ +--- +title: Mastering APIs for Data Collection +sidebar_label: APIs +description: "A deep dive into REST and GraphQL APIs: how to fetch, authenticate, and process external data for machine learning." +tags: [apis, rest, graphql, json, data-engineering, python-requests] +--- + +In the Data Engineering lifecycle, **APIs** are the "clean" way to collect data. Unlike web scraping, which is brittle and unstructured, APIs provide a contract-based method to access data that is versioned, documented, and usually delivered in machine-readable formats like JSON. + +## 1. How APIs Work: The Request-Response Cycle + +An API acts as a middleman between your ML pipeline and a remote server. You send a **Request** (a specific question) and receive a **Response** (the data answer). + +```mermaid +sequenceDiagram + participant Pipeline as ML Data Pipeline + participant API as API Gateway + participant Server as Data Server + + Pipeline->>API: HTTP Request (GET /data) + Note right of Pipeline: Includes Headers & API Key + API->>Server: Validate & Route + Server-->>API: Data Payload + API-->>Pipeline: HTTP Response (200 OK + JSON) + +``` + +### Components of an API Request: + +1. **Endpoint (URL):** The address where the data lives (e.g., `api.twitter.com/v2/tweets`). +2. **Method:** What you want to do (`GET` to fetch, `POST` to send). +3. **Headers:** Metadata like your **API Key** or the format you want (`Content-Type: application/json`). +4. **Parameters:** Filters for the data (e.g., `?start_date=2023-01-01`). + +## 2. Common API Architectures in ML + +### A. REST (Representational State Transfer) + +The most common architecture. It treats every piece of data as a "Resource." + +* **Best for:** Standardized data fetching. +* **Format:** Almost exclusively **JSON**. + +### B. GraphQL + +Developed by Meta, it allows the client to define the structure of the data it needs. + +* **Advantage in ML:** If a user profile has 100 fields but you only need 3 features for your model, GraphQL prevents "Over-fetching," saving bandwidth and memory. + +[Image comparing REST vs GraphQL data fetching efficiency] + +### C. Streaming APIs (WebSockets/gRPC) + +Used when data needs to be delivered in real-time. + +* **ML Use Case:** Algorithmic trading or live social media sentiment monitoring. + +## 3. Implementation in Python + +The `requests` library is the standard tool for interacting with APIs. + +```python +import requests + +url = "https://api.example.com/v1/weather" +headers = { + "Authorization": "Bearer YOUR_TOKEN" +} +params = { + "city": "Mandsaur", + "country": "IN", + "units": "metric" +} + +response = requests.get(url, headers=headers, params=params) + +if response.status_code == 200: + data = response.json() + temperature = data["main"]["temp"] # Extracting temperature + humidity = data["main"]["humidity"] # Extracting humidity + + print(f"Temperature in Mandsaur: {temperature}°C") + print(f"Humidity: {humidity}%") +else: + print("Failed to fetch weather data") + +``` + +## 4. Challenges: Rate Limiting and Status Codes + +APIs are not infinite resources. Providers implement **Rate Limiting** to prevent abuse. + +| Status Code | Meaning | Action for ML Pipeline | +| --- | --- | --- | +| **200** | OK | Process the data. | +| **401** | Unauthorized | Check your API Key/Token. | +| **404** | Not Found | Check your Endpoint URL. | +| **429** | Too Many Requests | **Exponential Backoff:** Wait and try again later. | + +```mermaid +flowchart TD + Req[Send API Request] --> Res{Status Code?} + Res -- 200 --> Save[Ingest to Database] + Res -- 429 --> Wait[Wait/Sleep] --> Req + Res -- 401 --> Fail[Alert Developer] + style Wait fill:#fff3e0,stroke:#ef6c00,color:#333 + +``` + +## 5. Authentication Methods + +1. **API Keys:** A simple string passed in the header. +2. **OAuth 2.0:** A more secure, token-based system used by Google, Meta, and Twitter. +3. **JWT (JSON Web Tokens):** Often used in internal microservices. + +## References for More Details + +* **[REST API Tutorial](https://restfulapi.net/):** Understanding the principles of RESTful design. + + +* **[Python Requests Guide](https://requests.readthedocs.io/en/latest/):** Mastering HTTP requests for data collection. + +--- + +APIs give us structured data, but sometimes the "front door" is locked. When there is no API, we must use the more aggressive "side window" approach. \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx b/docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx index e69de29..ee50c2c 100644 --- a/docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx +++ b/docs/machine-learning/data-engineering-basics/data-collection/data-sources.mdx @@ -0,0 +1,101 @@ +--- +title: Data Sources in ML +sidebar_label: Data Sources +description: "Identifying and integrating various data sources: from relational databases and APIs to unstructured web data and IoT streams." +tags: [data-engineering, data-sources, sql, nosql, apis, web-scraping] +--- + +Data is the "fuel" for Machine Learning. However, this fuel is rarely found in one place. As a data engineer, your job is to identify where the raw data lives and how to transport it safely into your environment for processing. + +## 1. The Data Source Landscape + +We generally categorize data sources based on their **Structure** and their **Storage Method**. + +```mermaid +graph TD + Root[Data Sources] --> Structured[Structured] + Root --> Semi[Semi-Structured] + Root --> Unstructured[Unstructured] + + Structured --> SQL[Relational DBs: MySQL, Postgres] + Semi --> Files[JSON, XML, Parquet] + Unstructured --> Media[Images, Video, Audio, PDF] + +``` + +## 2. Common Data Sources + +### A. Relational Databases (SQL) + +The most common source for tabular data (customer records, transactions). + +* **Protocol:** SQL (Structured Query Language). +* **Pros:** Highly reliable (ACID compliant), easy to join tables. +* **Cons:** Hard to scale horizontally; requires a fixed schema. + +### B. NoSQL Databases + +Used for high-volume, high-velocity, or non-tabular data. + +* **Key-Value Stores:** Redis. +* **Document Stores:** MongoDB (Stores data as JSON/BSON). +* **ML Use Case:** Storing user profiles or real-time feature stores. + +### C. APIs (Application Programming Interfaces) + +Used to pull data from external services like Twitter, Google Maps, or Financial markets. + +* **Format:** Usually **JSON** or **REST**. +* **Challenges:** Rate limiting (you can only pull so much data per hour) and authentication. + +### D. Cloud Object Storage (The Data Lake) + +Services like **AWS S3** or **Google Cloud Storage** act as a dumping ground for raw files before they are processed. + +* **ML Use Case:** Storing millions of images for a Computer Vision model. + +## 3. Batch vs. Streaming Sources + +How the data arrives at your model is just as important as where it comes from. + +| Feature | Batch Processing | Stream Processing | +| --- | --- | --- | +| **Source** | Databases, CSV files, Data Lakes | Kafka, Kinesis, IoT Sensors | +| **Frequency** | Hourly, Daily, Weekly | Real-time (Milliseconds) | +| **Use Case** | Training a model on historical sales | Predicting fraud during a transaction | + +```mermaid +flowchart LR + S1[(Database)] -->|Batch| B[ETL Process] + S2{{IoT Sensor}} -->|Stream| P[Real-time Pipeline] + B --> DL[Data Lake] + P --> DL + style P fill:#fff3e0,stroke:#ef6c00,color:#333 + +``` + +## 4. Web Scraping & Crawling + +When data isn't available via API or DB, we use scrapers (like `BeautifulSoup` or `Scrapy`) to extract information from HTML. + +* **Ethics Check:** Always check a site's `robots.txt` before scraping to ensure you are legally and ethically allowed to take the data. + +## 5. Identifying High-Quality Sources + +Not all data sources are equal. When evaluating a source for an ML project, ask: + +1. **Freshness:** How often is this data updated? +2. **Reliability:** Does the source go down often? +3. **Completeness:** Does it have missing values ()? +4. **Granularity:** Is the data at the level we need (e.g., individual transactions vs. daily totals)? + +## References for More Details + +* **[Google Cloud - Data Source Types](https://cloud.google.com/architecture/data-lifecycle-cloud-platform):** Understanding how cloud providers handle different data types. + + +* **[MongoDB University](https://university.mongodb.com/):** Learning the difference between Document stores and SQL. + +--- + +Finding the data is only the first step. Once we have access, we need to move it into our systems without losing information or causing bottlenecks. \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx b/docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx index e69de29..dce4917 100644 --- a/docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx +++ b/docs/machine-learning/data-engineering-basics/data-collection/databases-sql-nosql.mdx @@ -0,0 +1,88 @@ +--- +title: "SQL vs. NoSQL for ML" +sidebar_label: SQL & NoSQL +description: "Comparing Relational and Non-Relational databases: choosing the right storage for your machine learning features and labels." +tags: [databases, sql, nosql, data-engineering, postgres, mongodb] +--- + +Choosing between a **SQL (Relational)** and a **NoSQL (Non-Relational)** database is one of the most critical decisions in a Data Engineering pipeline. In Machine Learning, this choice often depends on whether your data is fixed and structured or evolving and unstructured. + +## 1. The Architectural Divide + +```mermaid +graph TD + subgraph SQL ["SQL (Relational)"] + Table[Tables/Rows] --- Schema[Strict Schema] + end + subgraph NoSQL ["NoSQL (Non-Relational)"] + Doc[Documents/Key-Value] --- Flexible[Dynamic Schema] + end + style SQL fill:#e3f2fd,stroke:#1565c0,color:#333 + style NoSQL fill:#f1f8e9,stroke:#33691e,color:#333 + +``` + +## 2. SQL: Relational Databases + +**Examples:** PostgreSQL, MySQL, SQLite, Oracle. + +SQL databases store data in rows and columns. They are built on **ACID** properties (Atomicity, Consistency, Isolation, Durability), ensuring that every transaction is processed reliably. + +* **Best for:** Structured data where relationships are key (e.g., linking a `User_ID` to `Transactions` and `Product_Details`). +* **Scaling:** Vertically (buying a bigger, more powerful server). +* **ML Use Case:** Serving as the "Source of Truth" for historical training data where data integrity is paramount. + +## 3. NoSQL: Non-Relational Databases + +**Examples:** MongoDB (Document), Cassandra (Column-family), Redis (Key-Value), Neo4j (Graph). + +NoSQL databases are designed for distributed data and high-speed horizontal scaling. They are often **BASE** compliant (Basically Available, Soft state, Eventual consistency). + +* **Best for:** Unstructured or semi-structured data (JSON, social media feeds, sensor logs). +* **Scaling:** Horizontally (adding more cheap servers to a cluster). +* **ML Use Case:** * **Feature Stores:** Using Redis for ultra-fast lookup of features during real-time inference. +* **Unstructured Storage:** Using MongoDB to store raw JSON metadata for NLP tasks. + +## 4. Key Differences Comparison + +| Feature | SQL | NoSQL | +| --- | --- | --- | +| **Data Model** | Tabular (Rows/Columns) | Document, Key-Value, Graph | +| **Schema** | Fixed (Pre-defined) | Dynamic (On-the-fly) | +| **Joins** | Very efficient ($$JOIN$$) | Generally avoided (Data is denormalized) | +| **Query Language** | Structured Query Language (SQL) | Varies (e.g., MQL for MongoDB) | +| **Standard** | ACID | BASE | + +## 5. CAP Theorem: The Data Engineer's Trade-off + +When choosing a database for a distributed ML system, you must consider the **CAP Theorem**. It states that a distributed system can only provide two out of the following three: + +```mermaid +pie + title CAP Theorem + "Consistency" : 1 + "Availability" : 1 + "Partition Tolerance" : 1 +``` + +1. **Consistency:** Every read receives the most recent write. +2. **Availability:** Every request receives a response (even if it's not the latest). +3. **Partition Tolerance:** The system continues to operate despite network failures. + +## 6. Hybrid Approaches: The "Polyglot" Strategy + +Modern ML architectures rarely use just one. + +* **Postgres (SQL)** might store the user account and labels. +* **MongoDB (NoSQL)** might store the raw log data. +* **S3 (Object Store)** might store the actual trained `.pkl` or `.onnx` model files. + +## References for More Details + +* **[PostgreSQL Documentation](https://www.postgresql.org/docs/):** Learning about complex joins and indexing for speed. + +* **[MongoDB Architecture Guide](https://www.mongodb.com/docs/manual/core/data-modeling-introduction/):** Understanding document-based data modeling. + +--- + +Storing data is one thing; getting it into your system is another. Let's look at how we build the bridges between these databases and our models. \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-collection/internet.mdx b/docs/machine-learning/data-engineering-basics/data-collection/internet.mdx index e69de29..41c8474 100644 --- a/docs/machine-learning/data-engineering-basics/data-collection/internet.mdx +++ b/docs/machine-learning/data-engineering-basics/data-collection/internet.mdx @@ -0,0 +1,86 @@ +--- +title: "Data from the Web: APIs & Scraping" +sidebar_label: Web Data +description: "Mastering the techniques for harvesting data from the internet: REST APIs, GraphQL, and automated web scraping." +tags: [data-engineering, apis, web-scraping, json, rest, scraping-ethics] +--- + +The internet is the primary source of data for modern Machine Learning, from sentiment analysis of tweets to training LLMs on billions of webpages. There are two main ways to "ingest" this data: **APIs** (the front door) and **Web Scraping** (the side window). + +## 1. APIs: The Structured Front Door + +An **API (Application Programming Interface)** is a formal agreement between two systems. It allows you to request specific data and receive it in a predictable, structured format (usually JSON). + +### Types of APIs in ML +* **REST (Representational State Transfer):** The standard for most web services. Uses HTTP methods like `GET` to fetch data. +* **GraphQL:** Allows you to request *exactly* the fields you need, reducing data transfer size—ideal for mobile data collection. +* **Webhooks:** Instead of you asking for data, the server "pushes" data to you when an event occurs (e.g., a new user sign-up). + +```mermaid +sequenceDiagram + participant ML_App as ML Pipeline + participant API as Web API + participant DB as External Database + + ML_App->>API: GET /v1/market-data?symbol=AAPL + API->>DB: Query Price + DB-->>API: 150.25 + API-->>ML_App: JSON: {"price": 150.25, "currency": "USD"} + +``` + +## 2. Web Scraping: The Unstructured Side Window + +When a website does not provide an API, we use **Web Scraping**. This involves writing code that downloads the HTML of a page and "parses" (extracts) the specific information we need. + +### The Scraping Toolkit + +1. **Requests / HTTPX:** For downloading the raw HTML content. +2. **BeautifulSoup:** For navigating the HTML tree and finding tags (e.g., `

`, ``). +3. **Selenium / Playwright:** For scraping "Dynamic" sites that require JavaScript to load (like Infinite Scroll dashboards). + +## 3. Comparison: API vs. Scraping + +| Feature | APIs | Web Scraping | +| --- | --- | --- | +| **Data Format** | JSON/XML (Easy to parse) | HTML (Messy/Unstructured) | +| **Stability** | High (Versioned) | Low (Breaks if the UI changes) | +| **Legality** | Encouraged by the provider | Gray area (Depends on Terms of Service) | +| **Speed** | Fast & Efficient | Slower (Requires rendering) | + +## 4. The Ethics of Web Data Collection + +Data Engineering isn't just about "can we get the data," but "should we?" + +1. **Robots.txt:** Always check `website.com/robots.txt` to see which parts of the site the owner has forbidden for crawlers. +2. **Rate Limiting:** Do not spam a server with thousands of requests per second. This is effectively a DDoS attack. Use `time.sleep()`. +3. **Terms of Service (ToS):** Many sites (like LinkedIn or Amazon) strictly forbid scraping in their user agreements. + +```mermaid +graph LR + Start[Scraping Plan] --> R["Check robots.txt"] + R --> T["Review Terms of Service"] + T --> L["Set Rate Limits"] + L --> Run[Execute Collection] + style R fill:#fff3e0,stroke:#ef6c00,color:#333 + +``` + +## 5. Cleaning Web Data + +Data from the web is "noisy." You will almost always need to perform these steps immediately after collection: + +* **HTML Stripping:** Removing `