From eb0ccedea191d64a10b2d0e3980dfb1fb1b6004d Mon Sep 17 00:00:00 2001 From: Ajay Dhangar Date: Wed, 24 Dec 2025 22:31:37 +0530 Subject: [PATCH] added more content --- .../data-formats/json.mdx | 120 ++++++++++++++++++ .../data-formats/parquet.mdx | 91 +++++++++++++ .../data-formats/xml.mdx | 106 ++++++++++++++++ 3 files changed, 317 insertions(+) diff --git a/docs/machine-learning/data-engineering-basics/data-formats/json.mdx b/docs/machine-learning/data-engineering-basics/data-formats/json.mdx index e69de29..8014460 100644 --- a/docs/machine-learning/data-engineering-basics/data-formats/json.mdx +++ b/docs/machine-learning/data-engineering-basics/data-formats/json.mdx @@ -0,0 +1,120 @@ +--- +title: "JSON: The Semi-Structured Standard" +sidebar_label: JSON +description: "Mastering JSON for Machine Learning: handling nested data, converting dictionaries, and efficient parsing for NLP pipelines." +tags: [data-engineering, json, api, semi-structured-data, python, nlp] +--- + +**JSON (JavaScript Object Notation)** is a lightweight, text-based format for storing and transporting data. While CSVs are perfect for simple tables, JSON excels at representing **hierarchical** or **nested** data—where one observation might contain lists or other sub-observations. + +## 1. JSON Syntax vs. Python Dictionaries + +JSON structure is almost identical to a Python dictionary. It uses key-value pairs and supports several data types: + +* **Objects:** Enclosed in `{}` (Maps to Python `dict`). +* **Arrays:** Enclosed in `[]` (Maps to Python `list`). +* **Values:** Strings, Numbers, Booleans (`true`/`false`), and `null`. + +```json +{ + "user_id": 101, + "metadata": { + "login_count": 5, + "tags": ["premium", "active"] + }, + "is_active": true +} + +``` + +## 2. Why JSON is Critical for ML + +### A. Natural Language Processing (NLP) + +Text data often comes with complex metadata (author, timestamp, geolocation, and nested entity tags). JSON allows all this info to stay bundled with the raw text. + +### B. Configuration Files + +Most ML frameworks use JSON (or its cousin, YAML) to store **Hyperparameters**. + +```json +{ + "model": "ResNet-50", + "learning_rate": 0.001, + "optimizer": "Adam" +} + +``` + +### C. API Responses + +As discussed in the [APIs section](/tutorial/machine-learning/data-engineering-basics/data-collection/apis), almost every web service returns data in JSON format. + +## 3. The "Flattening" Problem + +Machine Learning models (like Linear Regression or XGBoost) require **flat** 2D arrays (Rows and Columns). They cannot "see" inside a nested JSON object. Data engineers must **Flatten** or **Normalize** the data. + +```mermaid +graph LR + Nested[Nested JSON] --> Normalize["pd.json_normalize()"] + Normalize --> Flat[Flat DataFrame] + style Normalize fill:#f3e5f5,stroke:#7b1fa2,color:#333 + +``` + +**Example in Python:** + +```python +import pandas as pd +import json + +raw_json = [ + {"name": "Alice", "info": {"age": 25, "city": "NY"}}, + {"name": "Bob", "info": {"age": 30, "city": "SF"}} +] + +# Flattens 'info' into 'info.age' and 'info.city' columns +df = pd.json_normalize(raw_json) + +``` + +## 4. Performance Trade-offs + +| Feature | JSON | CSV | Parquet | +| --- | --- | --- | --- | +| **Flexibility** | **Very High** (Schema-less) | Low (Fixed Columns) | Medium (Evolving Schema) | +| **Parsing Speed** | Slow (Heavy string parsing) | Medium | **Very Fast** | +| **File Size** | Large (Repeated Keys) | Medium | Small (Binary) | + +:::note +In a JSON file, the key (e.g., `"user_id"`) is repeated for every single record, which wastes a lot of disk space compared to CSV. +::: + +## 5. JSONL: The Big Data Variant + +Standard JSON files require you to load the entire file into memory to parse it. For datasets with millions of records, we use **JSONL (JSON Lines)**. + +* Each line in the file is a separate, valid JSON object. +* **Benefit:** You can stream the file line-by-line without crashing your RAM. + +```text +{"id": 1, "text": "Hello world"} +{"id": 2, "text": "Machine Learning is fun"} + +``` + +## 6. Best Practices for ML Engineers + +1. **Validation:** Use JSON Schema to ensure the data you're ingesting hasn't changed structure. +2. **Encoding:** Always use `UTF-8` to avoid character corruption in text data. +3. **Compression:** Since JSON is text-heavy, always use `.gz` or `.zip` when storing raw JSON files to save up to 90% space. + +## References for More Details + +* **[Python `json` Module](https://docs.python.org/3/library/json.html):** Learning `json.loads()` and `json.dumps()`. + +* **[Pandas `json_normalize` Guide](https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html):** Mastering complex flattening of API data. + +--- + +JSON is the king of flexibility, but for "Big Data" production environments where speed and storage are everything, we move to binary formats. \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-formats/parquet.mdx b/docs/machine-learning/data-engineering-basics/data-formats/parquet.mdx index e69de29..6cb921d 100644 --- a/docs/machine-learning/data-engineering-basics/data-formats/parquet.mdx +++ b/docs/machine-learning/data-engineering-basics/data-formats/parquet.mdx @@ -0,0 +1,91 @@ +--- +title: "Parquet: The Big Data Gold Standard" +sidebar_label: Parquet +description: "Understanding Columnar storage, compression benefits, and why Parquet is the preferred format for high-performance ML pipelines." +tags: [data-engineering, parquet, big-data, columnar-storage, performance, cloud-storage] +--- + +**Apache Parquet** is an open-source, column-oriented data file format designed for efficient data storage and retrieval. Unlike CSV or JSON, which store data row-by-row, Parquet organizes data by **columns**. This single architectural shift makes it the industry standard for modern data lakes and ML feature stores. + +## 1. Row-based vs. Columnar Storage + +To understand Parquet, you must understand the difference in how data is laid out on your hard drive. + +* **Row-based (CSV/SQL):** Stores all data for "User 1," then all data for "User 2." +* **Columnar (Parquet):** Stores all "User IDs" together, then all "Ages" together, then all "Incomes" together. + + +```mermaid +graph LR + subgraph Row_Storage [Row-Based: CSV] + R1[Row 1: ID, Age, Income] + R2[Row 2: ID, Age, Income] + end + + subgraph Col_Storage [Column-Based: Parquet] + C1[IDs: 1, 2, 3...] + C2[Ages: 25, 30, 35...] + C3[Incomes: 50k, 60k...] + end + +``` + +## 2. Why Parquet is Superior for ML + +### A. Column Projection (Selective Reading) + +In ML, you might have a dataset with 500 columns, but your specific model only needs 5 features. + +* **CSV:** You must read the entire file into memory to get those 5 columns. +* **Parquet:** The system "jumps" directly to the 5 columns you need and skips the other 495. This reduces I/O by over 90%. + +### B. Drastic Compression + +Because Parquet stores similar data types together, it can use highly efficient compression algorithms (like Snappy or Gzip). + +* **Example:** In an "Age" column, numbers are similar. Parquet can store "30, 30, 30, 31" as "3x30, 1x31" (**Run-Length Encoding**). + +### C. Schema Preservation + +Parquet is a binary format that stores **metadata**. It "knows" that a column is a 64-bit float or a Timestamp. You never have to worry about a "Date" column being accidentally read as a string. + +## 3. Parquet vs. CSV: The Benchmarks + +| Feature | CSV | Parquet | +| --- | --- | --- | +| **Storage Size** | 1.0x (Large) | **~0.2x (Small)** | +| **Query Speed** | Slow | **Very Fast** | +| **Cost (Cloud)** | Expensive (S3 scans more data) | **Cheap** (S3 scans less data) | +| **ML Readiness** | Requires manual type casting | **Plug-and-play** | + +## 4. Using Parquet in Python + +Pandas and PyArrow make it easy to switch from CSV to Parquet. + +```python +import pandas as pd + +# Saving a dataframe to Parquet +# Requires 'pyarrow' or 'fastparquet' installed +df.to_parquet('large_dataset.parquet', compression='snappy') + +# Reading only specific columns (The magic of Parquet!) +df_subset = pd.read_parquet('large_dataset.parquet', columns=['feature_1', 'target']) + +``` + +## 5. When to use Parquet + +1. **Production Pipelines:** Always use Parquet for data passed between different stages of a pipeline. +2. **Large Datasets:** If your data is MB, the speed gains become obvious. +3. **Cloud Storage:** If storing data in AWS S3 or Google Cloud Storage, Parquet will save you significant money on data egress/scan costs. + +## References for More Details + +* **[Apache Parquet Official Documentation](https://parquet.apache.org/):** Deep diving into the binary file structure. + +* **[Databricks - Why Parquet?](https://www.databricks.com/glossary/what-is-parquet)** Understanding Parquet's role in the "Lakehouse" architecture. + +--- + +Parquet is the king of analytical data storage. However, some streaming applications require a format that is optimized for high-speed row writes rather than column reads. \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-formats/xml.mdx b/docs/machine-learning/data-engineering-basics/data-formats/xml.mdx index e69de29..9e897a6 100644 --- a/docs/machine-learning/data-engineering-basics/data-formats/xml.mdx +++ b/docs/machine-learning/data-engineering-basics/data-formats/xml.mdx @@ -0,0 +1,106 @@ +--- +title: "XML: Extensible Markup Language" +sidebar_label: XML +description: "Handling hierarchical data in XML: parsing techniques, its role in Computer Vision annotations, and converting XML to ML-ready formats." +tags: [data-engineering, xml, data-formats, computer-vision, pascal-voc, web-services] +--- + +**XML** is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. While JSON has largely replaced XML for web APIs, XML remains a cornerstone in industrial systems and **Object Detection** datasets. + +## 1. Anatomy of an XML Document + +XML uses a tree-like structure consisting of **tags**, **attributes**, and **content**. + +```xml + + image_01.jpg + + 640 + 480 + + + cat + + 100 + 120 + 250 + 300 + + + + +``` + +## 2. XML in Machine Learning: Use Cases + +### A. Computer Vision (Pascal VOC) + +One of the most famous datasets in ML history, **Pascal VOC**, uses XML files to store the coordinates of bounding boxes for image classification and detection. + +### B. Enterprise Data Integration + +Many older banking, insurance, and manufacturing systems exchange data exclusively via XML over SOAP (Simple Object Access Protocol). + +### C. Configuration & Metadata + +XML is often used to store metadata for scientific datasets where complex, nested relationships must be strictly defined by a **Schema (XSD)**. + +## 3. Parsing XML in Python + +Because XML is a tree, we don't read it like a flat file. We "traverse" the tree using libraries like `ElementTree` or `lxml`. + +```python +import xml.etree.ElementTree as ET + +tree = ET.parse('annotation.xml') +root = tree.getroot() + +# Accessing specific data +filename = root.find('filename').text +for obj in root.findall('object'): + name = obj.find('name').text + print(f"Detected object: {name}") + +``` + +## 4. XML vs. JSON + +| Feature | XML | JSON | +| --- | --- | --- | +| **Metadata** | Supports Attributes + Elements | Only Key-Value pairs | +| **Strictness** | High (Requires XSD validation) | Low (Flexible) | +| **Size** | Verbose (Closing tags increase size) | Compact | +| **Readability** | High (Document-centric) | High (Data-centric) | + +## 5. The Challenge: Deep Nesting + +Just like [JSON](/tutorial/machine-learning/data-engineering-basics/data-formats/json), XML is hierarchical. To use it in a standard ML model (like a Random Forest), you must **Flatten** the tree into a table. + +```mermaid +graph TD + XML[XML Root] --> Branch1[Branch: Metadata] + XML --> Branch2[Branch: Observations] + Branch2 --> Leaf[Leaf: Data Point] + Leaf --> Flatten[Flattening Logic] + Flatten --> CSV[2D Feature Matrix] + + style XML fill:#f3e5f5,stroke:#7b1fa2,color:#333 + style CSV fill:#e1f5fe,stroke:#01579b,color:#333 + +``` + +## 6. Best Practices + +1. **Use `lxml` for Speed:** The built-in `ElementTree` is fine for small files, but `lxml` is significantly faster for processing large datasets. +2. **Beware of "XML Bombs":** Malicious XML files can use entity expansion to crash your parser (DoS attack). Use **defusedxml** if you are parsing untrusted data from the web. +3. **Schema Validation:** Always validate your XML against an `.xsd` file if available to ensure your ML pipeline doesn't break due to a missing tag. + + +## References for More Details + +* **[Python ElementTree Documentation](https://docs.python.org/3/library/xml.etree.elementtree.html):** Learning the standard library approach. +* **[Pascal VOC Dataset Format](https://www.google.com/search?q=http://host.robots.ox.ac.uk/pascal/VOC/):** Seeing how XML is used in real-world ML projects. + +--- + +XML completes our look at "Text-Based" formats. While these are great for humans to read, they are slow for machines to process. Next, we look at the high-speed binary formats used in Big Data. \ No newline at end of file