Add Flink-Hive use case in Jupyter notebook (#91)#98
Add Flink-Hive use case in Jupyter notebook (#91)#98TungYuChiang wants to merge 4 commits intoapache:mainfrom
Conversation
This commit adds a Flink-Hive use case to the Jupyter notebook
2068f38 to
833c4e0
Compare
This commit adds a Flink-Hive use case to the Jupyter notebook
833c4e0 to
df2fc17
Compare
| @@ -1,2 +1,3 @@ | |||
| **/.idea | |||
| **/.DS_Store | |||
| **/packages/** | |||
There was a problem hiding this comment.
seems a little odd to ignore packages here
There was a problem hiding this comment.
"I added this line following the suggestion from @xunliu . Perhaps it would be better to only include init/jupyter/packages instead? Let me know if that works."
There was a problem hiding this comment.
@FANNG1 Because we doesn't needs to commit these download jar files to git repo.
| - ./init/jupyter:/tmp/gravitino | ||
| entrypoint: /bin/bash /tmp/gravitino/init.sh | ||
| environment: | ||
| - HADOOP_CLASSPATH=/tmp/gravitino/packages/hadoop-2.7.3/etc/hadoop:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/*:/tmp/gravitino/packages/contrib/capacity-scheduler/*.jar |
There was a problem hiding this comment.
is it necessary to add yarn, mapreduce jar to the CLASSPATH? seems only need HDFS
| environment: | ||
| - HADOOP_CLASSPATH=/tmp/gravitino/packages/hadoop-2.7.3/etc/hadoop:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/common/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/hdfs/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/yarn/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/tmp/gravitino/packages/hadoop-2.7.3/share/hadoop/mapreduce/*:/tmp/gravitino/packages/contrib/capacity-scheduler/*.jar | ||
| - NB_USER=my-username | ||
| - GRANT_SUDO=yes |
There was a problem hiding this comment.
why adding belew environment and use root?
GRANT_SUDO=yes
CHOWN_HOME=yes
There was a problem hiding this comment.
The GRANT_SUDO=yes and CHOWN_HOME=yes settings were added to allow users to install JDK and PyFlink directly within the Jupyter notebook environment
|
|
||
|
|
||
| HADOOP_VERSION="2.7.3" | ||
| HADOOP_URL="https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz" |
There was a problem hiding this comment.
it may take too much time to download hadoop in low network environment, only need HDFS client here?
There was a problem hiding this comment.
Thank you for the feedback. I will work on optimizing the dependencies by using only the HDFS client instead of the full Hadoop installation
There was a problem hiding this comment.
@FANNG1 , @coolderli I was wondering if it's possible to download only a subset of Hadoop components instead of the full package. Would it be possible to provide some guidance or assistance on how to achieve this? Any help would be greatly appreciated!
There was a problem hiding this comment.
Seems there is no link to download hadoop client bundle jar which contains the dependencies jars.
There was a problem hiding this comment.
@TungYuChiang
I think maybe we can use https://hub.docker.com/_/flink/tags to fix this problem. just like the Spark in the gravitino-playground.
|
@coolderli could you help to review this PR? |
| FLINK_HIVE_CONNECTOR_MD5="${FLINK_HIVE_CONNECTOR_JAR}.md5" | ||
| download_and_verify "${FLINK_HIVE_CONNECTOR_JAR}" "${FLINK_HIVE_CONNECTOR_MD5}" "${jupyter_dir}" | ||
|
|
||
| GRAVITINO_FLINK_JAR="https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-flink-1.18_2.12/0.6.1-incubating/gravitino-flink-1.18_2.12-0.6.1-incubating.jar" |
There was a problem hiding this comment.
@TungYuChiang Do we still need a gravitino-flink package when we have a gravitino-flink-runtime jar?
There was a problem hiding this comment.
I just tested it, and it seems we don’t need the gravitino-flink package. I'll go ahead and remove it.
There was a problem hiding this comment.
I've removed the redundant Gravitino-Flink package as discussed. Thanks for the suggestion!
| ls "${jupyter_dir}/packages/" | xargs -I {} rm "${jupyter_dir}/packages/"{} | ||
| find "${jupyter_dir}/../spark/packages/" | grep jar | xargs -I {} ln {} "${jupyter_dir}/packages/" | ||
|
|
||
| FLINK_HIVE_CONNECTOR_JAR="https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.10_2.12/1.20.0/flink-sql-connector-hive-2.3.10_2.12-1.20.0.jar" |
There was a problem hiding this comment.
@FANNG1 Should we package the flink-sql-connector to the gravitino-flink-runtime? The kyuubi-spark-connector-hive is packaged to the spark-runtime. Do we need to maintain consistency?
There was a problem hiding this comment.
I've removed the redundant Gravitino-Flink package as discussed. Thanks for the suggestion!
| @@ -0,0 +1,226 @@ | |||
| { | |||
There was a problem hiding this comment.
This case looks good to me. Maybe we can add more operations like alter table,drop table later.
There was a problem hiding this comment.
I tested the operations, and DROP works as expected. However, I encountered some errors when trying to use ALTER
Py4JJavaError: An error occurred while calling o31.executeSql.
: java.lang.NoClassDefFoundError: org/apache/gravitino/shaded/org/apache/commons/compress/utils/Lists
I find it quite strange because other commands work fine. I’m wondering if the issue could be related to the gravitino-flink-connector-runtime. Any insights or suggestions would be appreciated.
There was a problem hiding this comment.
@TungYuChiang Could you replace the import org.apache.commons.compress.utils.Lists; to import com.datastrato.gravitino.shaded.com.google.common.collect.Lists; in https://github.com/apache/gravitino/blob/main/flink-connector/flink/src/main/java/org/apache/gravitino/flink/connector/catalog/BaseCatalog.java#L32 and rebuild it. Then try it again. Thanks.
There was a problem hiding this comment.
@TungYuChiang I create an issue apache/gravitino#5534. Could you submit a patch to fix it?
There was a problem hiding this comment.
@coolderli Thank you for your guidance!
I have already left a comment under apache/gravitino#3354.
This commit adds a Flink-Hive use case to the Jupyter notebook