I have been trying to run this configuration on Jupyter workbook with emr-serverless application attached.
.config("spark.jars",
"s3://bucket/spark-3.5-spline-agent-bundle_2.12-2.2.1.jar") \
.config("spark.sql.queryExecutionListeners", "za.co.absa.spline.harvester.listener.SplineQueryExecutionListener") \
.config("spark.spline.lineageDispatcher", "console,file") \
.config("spark.spline.lineageDispatcher.file.className", "za.co.absa.spline.harvester.dispatcher.FileLineageDispatcher") \
.config("spark.spline.lineageDispatcher.file.fileName",
"s3://bucket/spline_workbook/lineage.csv")
script trying to run:
empsDF = spark.read \
.option("header", "true") \
.option("inferschema", "true") \
.csv(input_file_1)
empsDF1 = empsDF.withColumnRenamed('name', 'Name')
empsDF1.show()
deptsDF = spark.read \
.option("header", "true") \
.option("inferschema", "true") \
.csv(input_file_2)
resultsDF = empsDF1.join(deptsDF, empsDF1.dept_id==deptsDF.dept_id1, "left_outer")
resultsDF.write.csv( output_file_1, header=True, mode = "overwrite")
xdf = empsDF.groupBy('manager_id')
ydf = xdf.agg(sf.sum('salary').alias('total_salary'))
ydf.show()
ydf.coalesce(1).write.csv( output_file_2, header=True, mode = "overwrite")
However, even though the run is successful, the lineage file is not created created at the s3 location.
I have been trying to run this configuration on Jupyter workbook with emr-serverless application attached.
script trying to run:
However, even though the run is successful, the lineage file is not created created at the s3 location.