Skip to content

Add Flink-Hive use case in Jupyter notebook (#91)#98

Open
TungYuChiang wants to merge 4 commits intoapache:mainfrom
TungYuChiang:flink-hive
Open

Add Flink-Hive use case in Jupyter notebook (#91)#98
TungYuChiang wants to merge 4 commits intoapache:mainfrom
TungYuChiang:flink-hive

Conversation

@TungYuChiang
Copy link
Copy Markdown

This commit adds a Flink-Hive use case to the Jupyter notebook

This commit adds a Flink-Hive use case to the Jupyter notebook
@TungYuChiang TungYuChiang force-pushed the flink-hive branch 3 times, most recently from 2068f38 to 833c4e0 Compare November 5, 2024 15:34
This commit adds a Flink-Hive use case to the Jupyter notebook
Comment thread .gitignore
@@ -1,2 +1,3 @@
**/.idea
**/.DS_Store
**/packages/**
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a little odd to ignore packages here

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I added this line following the suggestion from @xunliu . Perhaps it would be better to only include init/jupyter/packages instead? Let me know if that works."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FANNG1 Because we doesn't needs to commit these download jar files to git repo.

Comment thread docker-compose.yaml
- ./init/jupyter:/tmp/gravitino
entrypoint: /bin/bash /tmp/gravitino/init.sh
environment:
- HADOOP_CLASSPATH=/tmp/gravitino/packages/hadoop-2.7.3/etc/hadoop:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/*:/tmp/gravitino/packages/contrib/capacity-scheduler/*.jar
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to add yarn, mapreduce jar to the CLASSPATH? seems only need HDFS

Comment thread docker-compose.yaml
environment:
- HADOOP_CLASSPATH=/tmp/gravitino/packages/hadoop-2.7.3/etc/hadoop:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/*:/tmp/gravitino/packages/contrib/capacity-scheduler/*.jar
- NB_USER=my-username
- GRANT_SUDO=yes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why adding belew environment and use root?

GRANT_SUDO=yes              
CHOWN_HOME=yes  

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GRANT_SUDO=yes and CHOWN_HOME=yes settings were added to allow users to install JDK and PyFlink directly within the Jupyter notebook environment



HADOOP_VERSION="2.7.3"
HADOOP_URL="https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may take too much time to download hadoop in low network environment, only need HDFS client here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. I will work on optimizing the dependencies by using only the HDFS client instead of the full Hadoop installation

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FANNG1 , @coolderli I was wondering if it's possible to download only a subset of Hadoop components instead of the full package. Would it be possible to provide some guidance or assistance on how to achieve this? Any help would be greatly appreciated!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems there is no link to download hadoop client bundle jar which contains the dependencies jars.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xunliu WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TungYuChiang
I think maybe we can use https://hub.docker.com/_/flink/tags to fix this problem. just like the Spark in the gravitino-playground.

@FANNG1
Copy link
Copy Markdown

FANNG1 commented Nov 6, 2024

@coolderli could you help to review this PR?

Comment thread init/jupyter/jupyter-dependency.sh Outdated
FLINK_HIVE_CONNECTOR_MD5="${FLINK_HIVE_CONNECTOR_JAR}.md5"
download_and_verify "${FLINK_HIVE_CONNECTOR_JAR}" "${FLINK_HIVE_CONNECTOR_MD5}" "${jupyter_dir}"

GRAVITINO_FLINK_JAR="https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-flink-1.18_2.12/0.6.1-incubating/gravitino-flink-1.18_2.12-0.6.1-incubating.jar"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TungYuChiang Do we still need a gravitino-flink package when we have a gravitino-flink-runtime jar?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested it, and it seems we don’t need the gravitino-flink package. I'll go ahead and remove it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the redundant Gravitino-Flink package as discussed. Thanks for the suggestion!

ls "${jupyter_dir}/packages/" | xargs -I {} rm "${jupyter_dir}/packages/"{}
find "${jupyter_dir}/../spark/packages/" | grep jar | xargs -I {} ln {} "${jupyter_dir}/packages/"

FLINK_HIVE_CONNECTOR_JAR="https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.10_2.12/1.20.0/flink-sql-connector-hive-2.3.10_2.12-1.20.0.jar"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FANNG1 Should we package the flink-sql-connector to the gravitino-flink-runtime? The kyuubi-spark-connector-hive is packaged to the spark-runtime. Do we need to maintain consistency?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the redundant Gravitino-Flink package as discussed. Thanks for the suggestion!

@@ -0,0 +1,226 @@
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case looks good to me. Maybe we can add more operations like alter table,drop table later.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the operations, and DROP works as expected. However, I encountered some errors when trying to use ALTER

Py4JJavaError: An error occurred while calling o31.executeSql.
: java.lang.NoClassDefFoundError: org/apache/gravitino/shaded/org/apache/commons/compress/utils/Lists

I find it quite strange because other commands work fine. I’m wondering if the issue could be related to the gravitino-flink-connector-runtime. Any insights or suggestions would be appreciated.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TungYuChiang Could you replace the import org.apache.commons.compress.utils.Lists; to import com.datastrato.gravitino.shaded.com.google.common.collect.Lists; in https://github.com/apache/gravitino/blob/main/flink-connector/flink/src/main/java/org/apache/gravitino/flink/connector/catalog/BaseCatalog.java#L32 and rebuild it. Then try it again. Thanks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TungYuChiang I create an issue apache/gravitino#5534. Could you submit a patch to fix it?

Copy link
Copy Markdown
Author

@TungYuChiang TungYuChiang Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coolderli Thank you for your guidance!
I have already left a comment under apache/gravitino#3354.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants