Expose the cached bdf.parquet and bdf.csv files via url for external applications#1772
Expose the cached bdf.parquet and bdf.csv files via url for external applications#1772be-smith wants to merge 8 commits into
Conversation
…nerate the cache if not already present
… if we already generated it through this route
datalab
|
||||||||||||||||||||||||||||
| Project |
datalab
|
| Branch Review |
bes/exposing_echem_cache
|
| Run status |
|
| Run duration | 07m 55s |
| Commit |
|
| Committer | Ben Smith |
| View all properties for this run ↗︎ | |
| Test results | |
|---|---|
|
|
0
|
|
|
0
|
|
|
0
|
|
|
0
|
|
|
478
|
| View all changes introduced in this branch ↗︎ | |
…n identically names files from each directoru
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1772 +/- ##
==========================================
+ Coverage 78.81% 79.30% +0.48%
==========================================
Files 83 83
Lines 7223 7401 +178
==========================================
+ Hits 5693 5869 +176
- Misses 1530 1532 +2
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR extends the echem “cycle” block output to expose a cached parquet BDF export via parquet_url and adds a dedicated endpoint for external tools to download either the cached parquet or CSV BDF representation for a given item/block.
Changes:
- Add
parquet_urlalongsidebdf_urlwhen processing echem cycle blocks. - Add
GET /items/<item_id>/blocks/<block_id>/bdf?format=parquet|csvto serve (and generate on-demand) cached BDF exports. - Add server tests covering
parquet_urlpopulation and on-demand cache generation/serving.
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pydatalab/src/pydatalab/apps/echem/blocks.py | Produces parquet cache path and populates parquet_url in cycle block web data. |
| pydatalab/src/pydatalab/routes/v0_1/files.py | Adds a new route to serve cached parquet/CSV and generate caches on-demand. |
| pydatalab/tests/server/test_echem_block.py | Adds coverage for parquet_url and the new BDF cache download endpoint. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pydatalab.mongo.flask_mongo.db.items.update_one( | ||
| {"item_id": item_id, f"blocks_obj.{block_id}": {"$exists": True}}, | ||
| { | ||
| "$set": { | ||
| f"blocks_obj.{block_id}.bdf_url": block.data.get("bdf_url"), | ||
| f"blocks_obj.{block_id}.parquet_url": block.data.get("parquet_url"), | ||
| } | ||
| }, | ||
| ) |
There was a problem hiding this comment.
@ml-evs this fix is simple enough, the solution still generates the cache file on read only. I personally think that populating the url so that the cache is visible is probably the right thing to do if we actually make the cache file. What are your thoughts?
There was a problem hiding this comment.
Currently I have implemented the fix so the urls won't be updated
…tem and filename has an expected cache extension
| return jsonify({"status": "success"}), 200 | ||
|
|
||
|
|
||
| @FILES.route("/items/<string:item_id>/blocks/<string:block_id>/bdf", methods=["GET"]) |
There was a problem hiding this comment.
@FILES.route("/files/<string:file_id>/<string:filename>/formats", methods=["GET"])
Maybe we think two routes generally for blocks? The above could return e.g.,
{"data": {"formats": ["bdf+csv", "bdf+parquet"]}
then a second route
@FILES.route("/files/<string:file_id>/<string:filename>/formats/<string:format>", methods=["GET"])
that can match on those keys and let you download the file.
ml-evs
left a comment
There was a problem hiding this comment.
As discussed lets put this on hold for now, but extract the changes to the echem block into another PR
This PR adds a
parquet_urlto the echem cycle block data, alongside the existingbdf_url, containing a link to the cached .bdf.parquet file. This is smaller and faster to load than the CSV and is more useful for external tools (e.g. datalab-plot).It also adds a new route
GET /items/<item_id>/blocks/<block_id>/bdf?format=parquet|csvthat serves either cache file directly. If the cache doesn't yet exist (e.g. the block was added but never processed), the route generates it on demand before serving. This gives external analysis tools a single clean endpoint to pull processed echem data without needing to know the internal file structure. This route can also be used in the future for analysis blocks to fetch the fastest form of the echem data - similar to #1737