Skip to content

How to get attribute level lineage "across" spark-jobs. through Spline-UI/AQL ? #1088

@adcb6gt

Description

@adcb6gt

I've an application that sources data from upstream, processes it and then passes it to downstream.
The application does this using an in-house concept called "Workflows".
Workflows are nothing but a set of activities (that can be sequential or parallel). There is a start activity, then set of transformation activities (which can be sequential or parallel) and then an end activity.
Each activity behind the scene translates to a spark-submit job.

As an example, one of our simple workflow would look like:
Source a file --> Select few columns from file --> Apply a filter condition --> Modify the value of a column based on a condition --> Save the modified file

Each of the above activity mentioned above is submitted as a pypsark job. At each step we create a dataframe and then persist to HDFS.

Scroll below frame for complete data view

Activity#      Activity (= pyspark job)   			    Source Action   							Target Action
1	       Source a File             		            df=spark.read.csv('/path/to/source/file/on/hdfs/sourceA.csv') df.write('/path/to/workflow-run-folder/step1.csv')
2	       Select few columns from file    		            df=df.select("col1","col2")				          df.write('/path/to/workflow-run-folder/step2.csv')
3	       Apply a filter condition				    df=df.filter(df.col1='xyz')					  df.write('/path/to/workflow-run-folder/step3.csv')
4	       Modify the value of a column based on a condition    df=df.withColumn("col2", df.col2*3)				 df.write('/path/to/workflow-run-folder/step4.csv')
5	       Save the modified filter												  df.write('/path/to/workflow-run-folder/step5.csv')

We then had requirement to get data lineage.

To achieve this objective, I explored the Spline framework and incorporated it in my application.

We placed the spark-spline-agent bundle on the spark driver (my --deploy-mode is client and --master is yarn) (placed the jar under jars folder of spark-home).
We also brought up the spark-rest-gateway and spline-ui along with arango db (used the spline 0.7 version).

On submitting our spark-jobs, we did the initialized via code by calling the enableLineageTracking on SparkLineageInitializer (for some reason codeless initialization via --packages didn't work).

I can see the spark-jobs listed on Spline-UI under ExecutionEvents screen (one record per spark-job; so for above example, we have 5 spark-jobs listed on Spline-UI).

On clicking on the Execution Plan hyperlink against each job, I can see the lineage that Spline-UI brings through ArangoDB.

Also in ArangoDB, I can see the vertices and edges in graph-view.

However, what I'm looking for is:

We now have a requirement to figure out data lineage "across" the spark jobs we submit.
For above example, I need some way where I provide "col2" as input and it gives me a lineage that should logically translate as:
col2 (Activity-1) -> col2 (Activity-2) -> col2 (Activity-3) -> col2*3 (Activity-4) -> col2*3 (Activity-5)

So if my col2 value at source is, say, value 1, then lineage should be:
1 -> 1 -> 1 -> 3 -> 3

Also against each value, I need to know the activity applied (if it was a pass-through (i.e., no transformations applied) or if there was any transformation on it (in above example, at activity 4, column value was multiplied by 3))

Is there a way I can achieve it by using AQL ?

I understand every spark job will get created as a Document in the document Collection (executionPlan) but if i've to get lineage of a specific attribute across these collections, how can I do it ? In other words, how can i get lineage of an attribute across different spark jobs (i.e., across different execution IDs) ?

Also can I capture the runtime value of the attribute as it goes through different transformations ?
I read through ArangoDB's traverse path graph but not able to get hold of the way i can achieve it via AQL (or throug any other approach in Spline)

Any pointers on this will be highly appreciated.

Thanks !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions