-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[core]JindoFileIO support cache using JindoCache #6949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@timmyyao CREATE CATALOG paimon_catalog WITH (
'type' = 'paimon',
'metastore' = 'hive',
'warehouse' = 'oss://xxxx.oss-dls.aliyuncs.com/user/hive/warehouse'
); |
The major purpose of this PR is to provide a table-level cache policy which can be supported by DLF server-side and cache is backed by JindoCache. So related configs are generated in RESTCatalog and RESTTokenFileIO according to the policy defined at server-side. If you just use filesystem or hive catalog using JindoFileIO, you can simply apply the cache configs to JindoSDK (fs.xengine, fs.jindocache.namespace.rpc.address) as long as you have a JindoCache cluster. But there is no server-side support for cache management, you can only adjust at client-side to decide whether or not to enable cache. |
|
In addition, this PR also allows to config the cache policy at client-side. So instead of directly config fs.xengine=true, you can config dlf.io-cache-enabled and dlf.io-cache.policy to define more detailed cache policy by this PR (e.g. meta,read,write), and JindoFileIO will apply cache only to specific file to guarantee consistency. Thus it will be more safe to use cache for Paimon table. |
4ccf8f5 to
d0673cf
Compare
| .withDescription("REST Catalog DLF OSS endpoint."); | ||
|
|
||
| public static final ConfigOption<Boolean> DLF_FILE_IO_CACHE_ENABLED = | ||
| ConfigOptions.key("dlf.io-cache-enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just io-cache.enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| .withDescription( | ||
| "Enable cache for visiting files using file io (currently only JindoFileIO supports cache)."); | ||
| public static final ConfigOption<String> DLF_FILE_IO_CACHE_WHITELIST_PATH = | ||
| ConfigOptions.key("dlf.io-cache.whitelist-path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just io-cache.whitelist-path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| .withDescription( | ||
| "Cache is only applied to paths which contain the specified pattern, and * means all paths."); | ||
| public static final ConfigOption<String> DLF_FILE_IO_CACHE_POLICY = | ||
| ConfigOptions.key("dlf.io-cache.policy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just io-cache.policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Purpose
We tend to persist files of Paimon table on object storage such as S3 and OSS. However the throughput or latency of such storage is usually limited, making IO performance become the bottleneck in ETL or OLAP occasions with high concurrency. Thus we introduce cache when FileIO visits files within a Paimon table. The solution is to support JindoCache in JindoFileIO, which is a distributed filesystem cache system. And table-level cache policy is implemented, which can be supported by REST server such as DLF.