Add Flink-Hive use case in Jupyter notebook (#91) by TungYuChiang · Pull Request #98 · apache/gravitino-playground

TungYuChiang · 2024-11-04T14:58:17Z

This commit adds a Flink-Hive use case to the Jupyter notebook

FANNG1 · 2024-11-06T08:07:41Z

@@ -1,2 +1,3 @@
 **/.idea
 **/.DS_Store
+**/packages/**


seems a little odd to ignore packages here

"I added this line following the suggestion from @xunliu . Perhaps it would be better to only include init/jupyter/packages instead? Let me know if that works."

@FANNG1 Because we doesn't needs to commit these download jar files to git repo.

FANNG1 · 2024-11-06T08:08:49Z

      - ./init/jupyter:/tmp/gravitino
    entrypoint: /bin/bash /tmp/gravitino/init.sh
+    environment:
+    - HADOOP_CLASSPATH=/tmp/gravitino/packages/hadoop-2.7.3/etc/hadoop:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/*:/tmp/gravitino/packages/contrib/capacity-scheduler/*.jar


is it necessary to add yarn, mapreduce jar to the CLASSPATH? seems only need HDFS

FANNG1 · 2024-11-06T08:09:58Z

+    environment:
+    - HADOOP_CLASSPATH=/tmp/gravitino/packages/hadoop-2.7.3/etc/hadoop:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/*:/tmp/gravitino/packages/contrib/capacity-scheduler/*.jar
+    - NB_USER=my-username        
+    - GRANT_SUDO=yes              


why adding belew environment and use root?

GRANT_SUDO=yes CHOWN_HOME=yes

The GRANT_SUDO=yes and CHOWN_HOME=yes settings were added to allow users to install JDK and PyFlink directly within the Jupyter notebook environment

FANNG1 · 2024-11-06T08:12:49Z

+
+
+HADOOP_VERSION="2.7.3"
+HADOOP_URL="https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz"


it may take too much time to download hadoop in low network environment, only need HDFS client here?

Thank you for the feedback. I will work on optimizing the dependencies by using only the HDFS client instead of the full Hadoop installation

@FANNG1 , @coolderli I was wondering if it's possible to download only a subset of Hadoop components instead of the full package. Would it be possible to provide some guidance or assistance on how to achieve this? Any help would be greatly appreciated!

Seems there is no link to download hadoop client bundle jar which contains the dependencies jars.

@xunliu WDYT?

@TungYuChiang
I think maybe we can use https://hub.docker.com/_/flink/tags to fix this problem. just like the Spark in the gravitino-playground.

FANNG1 · 2024-11-06T09:31:12Z

@coolderli could you help to review this PR?

coolderli · 2024-11-06T12:05:54Z

+FLINK_HIVE_CONNECTOR_MD5="${FLINK_HIVE_CONNECTOR_JAR}.md5"
+download_and_verify "${FLINK_HIVE_CONNECTOR_JAR}" "${FLINK_HIVE_CONNECTOR_MD5}" "${jupyter_dir}"
+
+GRAVITINO_FLINK_JAR="https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-flink-1.18_2.12/0.6.1-incubating/gravitino-flink-1.18_2.12-0.6.1-incubating.jar"


@TungYuChiang Do we still need a gravitino-flink package when we have a gravitino-flink-runtime jar?

I just tested it, and it seems we don’t need the gravitino-flink package. I'll go ahead and remove it.

I've removed the redundant Gravitino-Flink package as discussed. Thanks for the suggestion!

coolderli · 2024-11-06T12:08:12Z

 ls "${jupyter_dir}/packages/" | xargs -I {} rm "${jupyter_dir}/packages/"{}
 find "${jupyter_dir}/../spark/packages/" | grep jar | xargs -I {} ln {} "${jupyter_dir}/packages/"

+FLINK_HIVE_CONNECTOR_JAR="https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.10_2.12/1.20.0/flink-sql-connector-hive-2.3.10_2.12-1.20.0.jar"


@FANNG1 Should we package the flink-sql-connector to the gravitino-flink-runtime? The kyuubi-spark-connector-hive is packaged to the spark-runtime. Do we need to maintain consistency？

I've removed the redundant Gravitino-Flink package as discussed. Thanks for the suggestion!

coolderli · 2024-11-06T12:34:48Z

@@ -0,0 +1,226 @@
+{


This case looks good to me. Maybe we can add more operations like alter table,drop table later.

I tested the operations, and DROP works as expected. However, I encountered some errors when trying to use ALTER

Py4JJavaError: An error occurred while calling o31.executeSql.
: java.lang.NoClassDefFoundError: org/apache/gravitino/shaded/org/apache/commons/compress/utils/Lists

I find it quite strange because other commands work fine. I’m wondering if the issue could be related to the gravitino-flink-connector-runtime. Any insights or suggestions would be appreciated.

@TungYuChiang Could you replace the import org.apache.commons.compress.utils.Lists; to import com.datastrato.gravitino.shaded.com.google.common.collect.Lists; in https://github.com/apache/gravitino/blob/main/flink-connector/flink/src/main/java/org/apache/gravitino/flink/connector/catalog/BaseCatalog.java#L32 and rebuild it. Then try it again. Thanks.

@TungYuChiang I create an issue apache/gravitino#5534. Could you submit a patch to fix it?

@coolderli Thank you for your guidance!
I have already left a comment under apache/gravitino#3354.

Add Flink-Hive use case in Jupyter notebook (apache#91)

440ffd2

This commit adds a Flink-Hive use case to the Jupyter notebook

TungYuChiang force-pushed the flink-hive branch 3 times, most recently from 2068f38 to 833c4e0 Compare November 5, 2024 15:34

Add Flink-Hive use case in Jupyter notebook (apache#91)

df2fc17

This commit adds a Flink-Hive use case to the Jupyter notebook

TungYuChiang force-pushed the flink-hive branch from 833c4e0 to df2fc17 Compare November 5, 2024 15:58

Merge branch 'main' into flink-hive

a6f4ac4

xunliu requested review from FANNG1 and coolderli November 6, 2024 06:29

xunliu assigned TungYuChiang Nov 6, 2024

FANNG1 reviewed Nov 6, 2024

View reviewed changes

coolderli reviewed Nov 6, 2024

View reviewed changes

Remove redundant Flink JAR files

2e96e61



		HADOOP_VERSION="2.7.3"
		HADOOP_URL="https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz"

Conversation

TungYuChiang commented Nov 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FANNG1 commented Nov 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TungYuChiang Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TungYuChiang Nov 12, 2024 •

edited

Loading