Skip to content

Ankush405/data-ingestion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Ingestion Service

A Go-based service for ingesting logs into ClickHouse, with migration and REST API support.

Database Schema

The service uses a ClickHouse table to store ingested logs. Below is an example schema:

CREATE TABLE logs (
    id           Int64,
    user_id      Int64,
    title      String,
    ingested_at    DateTime,
    body        String,
    source       String
) ENGINE = MergeTree()
ORDER BY id;
  • id: Unique identifier for each log entry.
  • user_id: post user_id.
  • title: The title of the record.
  • ingested_at: Time the post was ingested.
  • body: Content body of the post.
  • source: Source of the log (e.g., placeholder_api).

Setup Instructions

Prerequisites

Running the Application

  1. Clone the repository:

    git clone git@github.com:Ankush405/data-ingestion.git
    cd data-ingestion
  2. Start the services using Docker Compose:

    docker-compose up --build

    This will:

  3. API Endpoints:

    • GET /health — Health check endpoint
    • GET /logs — Retrieve ingested logs

Running Tests

Note: Add unit/integration tests as needed.

  1. Run tests locally:

    go test ./...
  2. (Optional) Run tests inside Docker:

    • Add a test stage to your Dockerfile or use a separate test container.

Deploying to a Cloud Environment

  1. Build Docker images:

    docker build -f Dockerfile.server -t yourrepo/ingestion-server:latest .
    docker build -f Dockerfile.migrate -t yourrepo/ingestion-migrate:latest .
  2. Push images to your container registry:

    docker push yourrepo/ingestion-server:latest
    docker push yourrepo/ingestion-migrate:latest
  3. Provision a ClickHouse instance (e.g., Altinity.Cloud, Aiven, or self-hosted).

  4. Set environment variables for your deployment (e.g., CLICKHOUSE_URL).

  5. Deploy using your preferred orchestrator (e.g., Kubernetes, ECS, GCP Cloud Run, etc.), referencing the pushed images and environment variables.

Documentation

  • Code Structure:

    • cmd/server/ — Main server application
    • cmd/migrate/ — Database migration tool
    • internal/log/ — Log ingestion business logic
  • Configuration:

    • Environment variable: CLICKHOUSE_URL (e.g., clickhouse://default:password123@clickhouse:9000/default)
  • Migrations:

  • Extending:

Design Trade-offs and Implementation Notes

Trade-offs Considered

  • Simplicity vs. Flexibility:
    The service is designed with a simple schema and API to enable quick ingestion and querying. More flexible schemas (e.g., supporting arbitrary fields) were avoided to keep the implementation straightforward and performant.
  • ClickHouse as the Storage Engine:
    ClickHouse was chosen for its high performance with analytical queries and large-scale log data. However, this comes at the cost of more complex setup and less transactional support compared to traditional relational databases.
  • Dockerized Deployment:
    Using Docker and Docker Compose simplifies local development and deployment, but may not reflect all production nuances (e.g., security, scaling, persistent storage).

Hardest Parts to Implement

  • Database Migrations:
    Ensuring that migrations run reliably and idempotently, especially when deploying to new environments, required careful scripting and testing.

Improvements for the Future

  • Error Handling for Edge Cases:
    Handling API timeouts, invalid responses, and database errors in a robust way is required to simulate various failure scenarios.
  • Enhanced Schema Flexibility:
    Support for dynamic fields or a more flexible schema to accommodate different posts formats.
  • Observability:
    Add structured logging, metrics, and tracing to improve monitoring and debugging in production.
  • Automated CI/CD:
    Integrate automated testing, linting, and deployment pipelines for faster and safer releases.
  • Security Enhancements:
    Implement authentication, authorization, and secure handling of secrets and environment variables.

For more details, see the source code and comments in each file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages