Skip to content

Conversation

@benjeffery
Copy link
Member

Add a metadata codec that supports JSON and a binary blob.

@codecov
Copy link

codecov bot commented Oct 24, 2025

Codecov Report

❌ Patch coverage is 89.31298% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.76%. Comparing base (25a26a7) to head (06e279a).

Files with missing lines Patch % Lines
python/tskit/metadata.py 83.13% 8 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3306      +/-   ##
==========================================
- Coverage   89.76%   89.76%   -0.01%     
==========================================
  Files          29       29              
  Lines       31292    31422     +130     
  Branches     5738     5757      +19     
==========================================
+ Hits        28089    28205     +116     
- Misses       1794     1802       +8     
- Partials     1409     1415       +6     
Flag Coverage Δ
c-tests 86.81% <100.00%> (+0.03%) ⬆️
lwt-tests 80.38% <ø> (ø)
python-c-tests 87.12% <ø> (ø)
python-tests 98.68% <83.13%> (-0.17%) ⬇️
python-tests-no-jit 33.31% <14.45%> (-0.20%) ⬇️
python-tests-numpy1 49.92% <14.45%> (-0.37%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
c/tskit/core.c 95.89% <100.00%> (+0.26%) ⬆️
c/tskit/core.h 100.00% <ø> (ø)
python/tskit/metadata.py 96.79% <83.13%> (-2.10%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@petrelharp
Copy link
Contributor

This is the right way.

@bhaller
Copy link

bhaller commented Oct 24, 2025

Hi @benjeffery! This looks like it might provide a reasonable solution, sure. You prefer it to the proposal you made before, I take it? Because at least it's just more top-level metadata, rather than an entirely new thing?

Anyhow, I guess my main question is how this will look from the C API perspective. I see you've got a special header:

_HDR = struct.Struct("<4sBQQ") # magic, version, json_len, blob_len

which you parse in the new codec to split out the JSON and binary parts. Do you intend to provide some kind of wrapper for this in C, so that clients don't need to hard-code the structure of this header, at write and at read? (Or at least make the "magic" and the codec version be public in a C header, so that those aspects of it don't need to be hard-coded in SLiM.)

@petrelharp will this work for you on the pyslim end of things? Ah, I see your comment above now. :->

@bhaller
Copy link

bhaller commented Oct 28, 2025

...Anyhow, I guess my main question is how this will look from the C API perspective. I see you've got a special header:

_HDR = struct.Struct("<4sBQQ") # magic, version, json_len, blob_len

which you parse in the new codec to split out the JSON and binary parts. Do you intend to provide some kind of wrapper for this in C, so that clients don't need to hard-code the structure of this header, at write and at read? (Or at least make the "magic" and the codec version be public in a C header, so that those aspects of it don't need to be hard-coded in SLiM.)

Hi @benjeffery, just a ping on this; wondering about the C API plan for this PR. Apart from that concern, this proposal looks great to me; once it's merged @petrelharp and I can update the tskit copy in SLiM to this and test it out. :-> Thanks!

@benjeffery
Copy link
Member Author

Hi @benjeffery, just a ping on this

Sorry, school holidays here so had some leave. I'm not sure about this, it's been a long standing principle that the C code doesn't touch the contents of metadata or have anything to do with codecs. I can see a small helper that just extracts the binary buffer being helpful though. I've added that in the last commit.

@bhaller
Copy link

bhaller commented Oct 30, 2025

Sorry, school holidays here so had some leave.

:->

I'm not sure about this, it's been a long standing principle that the C code doesn't touch the contents of metadata or have anything to do with codecs. I can see a small helper that just extracts the binary buffer being helpful though. I've added that in the last commit.

This looks good to me, thanks. I can see the desire for the C code not to get into metadata/codec stuff, but this seems like a good exception to that principle to carve out, otherwise every client of this codec would have to reinvent this (potentially bug-prone) wheel. I think this is the right design.

As far as I can tell there is no dependence here on the JSON string being null-terminated; is that correct? Looks like after calling tsk_json_binary_metadata_get_blob() the length of the JSON can just be determined by the caller based upon the offset to, or the length of, the binary blob.

And since the header length and magic bytes and such are in core.c and thus private, it looks like the intent is for the caller to blindly call tsk_json_binary_metadata_get_blob() without knowing whether the metadata is in fact compliant, and use the return value to determine compliance and raise the appropriate user-visible error? Which seems right, just checking.

I'm happy with this, thanks!

@benjeffery
Copy link
Member Author

As far as I can tell there is no dependence here on the JSON string being null-terminated; is that correct?

Yes, that's right as we have fixed lengths. Makes me think the helper should return pointers/lengths to both parts so that API users can grab and decode the JSON?

@bhaller
Copy link

bhaller commented Oct 31, 2025

As far as I can tell there is no dependence here on the JSON string being null-terminated; is that correct?

Yes, that's right as we have fixed lengths. Makes me think the helper should return pointers/lengths to both parts so that API users can grab and decode the JSON?

That would perhaps be an improvement, yes. It's easy enough for the client to do the math as it is now; but it does seem better for the helper to do it, so that the client is not building in assumptions about the layout of the metadata. Actually, now that I think about it more, the client would also have to build in an assumption about the number of bytes in the codec's header, wouldn't they? If that's true, then the helper should definitely return pointers/lengths for both parts; we don't want the client's code knowing that sort of thing.

@benjeffery
Copy link
Member Author

Added returning of json with blob.

@jeromekelleher
Copy link
Member

This generally looks good to me. One question is should we provide a "setter" function in C also if we're providing a getter?

@bhaller
Copy link

bhaller commented Nov 3, 2025

This generally looks good to me. One question is should we provide a "setter" function in C also if we're providing a getter?

Good point, I wasn't thinking about that end of things! It does seem like a good idea to not have the client hard-coding the structure of the header and such.

@benjeffery
Copy link
Member Author

This generally looks good to me. One question is should we provide a "setter" function in C also if we're providing a getter?

facepalm! Of course.

@benjeffery
Copy link
Member Author

@bhaller Talking this over with JK just now, we realised that it is a problem that the bytes in the binary blob are not easily interpretable by tskit users e.g. pyslim, or described in the schema. To fix this we propose a metadata codec that allows specification of JSON and binary in one schema. It would look like:

{
    "codec":"json+struct",
    "json": _A NORMAL JSON SCHEMA_,
    "struct": _A NORMAL STRUCT SCHEMA_,
}

It would be an error to have identical keys in the two schemas, as the returned metadata object merges the two key lists. This allows minimal changes to existing code as opposed to returning with a top-level split as in the schema.

This makes no difference at all to the C side, except that the binary should be encoded in a struct codec friendly way.

This is quite a bit more work, so I will have to return to it after tskit 1.0 is out the door, but as it needs no C changes you can go ahead and start developing assuming the current binary structure.

Hopefully that all makes sense!

@bhaller
Copy link

bhaller commented Nov 3, 2025

@bhaller Talking this over with JK just now, we realised that it is a problem that the bytes in the binary blob are not easily interpretable by tskit users e.g. pyslim, or described in the schema...

Hmm, I see. I'm fine with this if you guys want to get into it. I think I can see how having a schema for the binary might make life easier for pyslim, since then I guess all the binary metadata would be automatically be parsed and made accessible by tskit on the python side? If the current PR's design doesn't have a schema for the JSON metadata, then I agree that that is clearly a problem; having that schema is essential. If it is just that the current PR's design doesn't have a schema for the binary metadata, but does have one for the JSON (which I assumed was the case, but hadn't really thought about as much as I should have!), then I'm not sure how much of a shortcoming that really is in practice, but I can see the broad motivations for attaching a schema to everything – completeness, consistency, transparency, etc. @petrelharp any thoughts here on how useful this would be for pyslim?

It's unfortunate that this won't make it into tskit 1.0, just from the perspective of wanting tskit 1.0 to be kind of a completed vision with no loose ends; but of course that's an impossible dream, and this new proposal sounds like more work for sure, so no worries, push it out. :-O I will end up needing it in the next couple of months, though, for the next release of SLiM, so I hope we're not talking about pushing it too terribly far out, though; what timeframe would you estimate?

Purely out of curiosity (I have no need/desire to do this in practice at present), would this new metadata codec be usable elsewhere in tskit? I.e., could you declare that the metadata column on any given tskit table uses this new codec? Or would its use be limited to the top-level metadata? I don't really have a clear idea of how all the metadata bits and pieces fit together architecturally in tskit. :->

Thanks for taking this on! As always, things turn out to be more complicated than originally envisioned. There must be a Somebody's Law about that. I greatly appreciate your time and effort.

@benjeffery
Copy link
Member Author

tskit 1.0 is a forward-compatibility boundary, not a statement of completeness. This work involves no breaking changes so it doesn't matter which side of that boundary it sits.

I'm hoping to have 1.0 in the order of 1-2 weeks, and this work to follow after. Certainly fits in the timescale you have above.

@petrelharp
Copy link
Contributor

That sounds great, @benjeffery.

@bhaller
Copy link

bhaller commented Nov 30, 2025

Hi @benjeffery! Congrats on tskit 1.0! :-> Just a ping to say that this issue remains important for me over on the SLiM side. Thanks!

@benjeffery
Copy link
Member Author

Ok, I've added the layer that let's python read the binary. I think this is now complete?

@bhaller
Copy link

bhaller commented Dec 11, 2025

Awesome, thanks! I'm not sure exactly when I will be able to actually exercise this in SLiM; things are complicated in the multitrait branch right now, lol. I figure that when the time comes I will copy this PR's version of tskit into SLiM and try to build stuff on top of it, and report back here if I encounter any problems. Is it OK for you to have this PR sitting open for some weeks until that happens?

@benjeffery
Copy link
Member Author

Is it OK for you to have this PR sitting open for some weeks until that happens?

Yep, I'm on leave for a week now anyway. I finished this up before I left as was told it was becoming urgent for you.

@bhaller
Copy link

bhaller commented Dec 11, 2025

Is it OK for you to have this PR sitting open for some weeks until that happens?

Yep, I'm on leave for a week now anyway. I finished this up before I left as was told it was becoming urgent for you.

Thanks! Yes, I've been holding off on launching into some big work for several weeks until this functionality became available. Now that it has arrived, I'll plunge into that project. :-> Enjoy your leave! 🏝️

@hyanwong hyanwong changed the title Add json+binary codec Add json+struct codec Dec 11, 2025
@hyanwong
Copy link
Member

hyanwong commented Feb 7, 2026

Is this actually ready to review and hopefully merge @benjeffery ? I ask because I think @bhaller is going to be moving to get the new SLiM version compatible with tskit, so I guess a released tskit version with this functionality would be very useful?

@bhaller
Copy link

bhaller commented Feb 7, 2026

Is this actually ready to review and hopefully merge @benjeffery ? I ask because I think @bhaller is going to be moving to get the new SLiM version compatible with tskit, so I guess a released tskit version with this functionality would be very useful?

Yes! Sorry, the previous work on my big SLiM project took quite a while, so I didn't get to trying this out nearly as soon as I thought I would. But I am now optimistic that I am ready to start building on top of this feature in SLiM more or less now. :-> I thought this had been merged and released a while ago, actually! I'm not really sure how to build my own tskit from a feature branch like this, being rather clueless about both GitHub and Python, so yes, it'd be great if this came out in a release. Thanks for noticing this, @hyanwong!

@bhaller
Copy link

bhaller commented Feb 9, 2026

I realized that for my purposes right now I don't need my python tskit updated. I'll just hand-merge the C changes for now, and then I can start using them in SLiM. Of course once I get that working I'll need the python side of things to actually use it, but I can cross that bridge a little later; I expect it'll be a couple days of work to get the SLiM changes working. It'd be great to know what the plan is for this, though – as I wrote above, I actually thought this was already merged and released. :-O

@bhaller
Copy link

bhaller commented Feb 9, 2026

Starting to work with this. The stuff in test_core.c that puts together an example for testing would be useful to have in tskit. Most importantly, the JBLB magic bytes should either (a) be exposed in core.h, or (b) should be set up by a tskit function, probably? It's a bit weird for SLiM to duplicate this magic in its own code. Less importantly, it'd be nice to provide set_u64_le() as a utility function. Providing a "write" equivalent of tsk_json_struct_metadata_get_blob() seems tricky since it wouldn't want to do its own malloc for the result, so then the caller needs to know how many bytes of storage will be needed, so life gets complicated and messy. Anyhow, I'll just duplicate the magic and set_u64_le() in SLiM for now, but maybe this is worth thinking about a bit so that every tskit client using this stuff doesn't end up copying those same things into their own codebases?

@bhaller
Copy link

bhaller commented Feb 9, 2026

One more thing that occurs to me @benjeffery: it would perhaps be nice to provide a memory alignment guarantee for the binary portion (assuming that the metadata chunk as a whole is at an aligned address, which is guaranteed by malloc). On my end, when writing out the metadata, I could of course insert null bytes in between the JSON and the binary to align the binary portion; but this gets tricky because those null bytes would have to be described in the binary schema, and the number of null bytes is variable since the length of the JSON string is variable, so it makes everything a bit of a mess. (I guess SLiM would have eight different possible binary schemas that it uses, with from 0 to 7 null bytes at the start of the binary data as needed, eww.) I could put spaces at the end of the JSON string instead to produce alignment, but I'm worried that that would make the JSON string non-canonical and make tskit produce an error, so I think (?) that solution is out.

Having an alignment guarantee (I'd suggest 8-byte) would be nice because SLiM will be writing and reading lots of data directly to/from the binary blob, so if it's unaligned it just makes for additional headaches and lower performance in the read/write code. If this alignment guarantee were built into the format expected by the json+struct codec, it would smooth the path. Thoughts?

@bhaller
Copy link

bhaller commented Feb 9, 2026

Sorry for all the chat, I'm chugging along over here. I see that you have a test that disallows duplicate keys, and I'm wondering how broadly that's interpreted. SLiM's schemas have "$schema":"http://json-schema.org/schema#" in them; I figured I ought to move that up to the top level so that the JSON and binary schemas don't contain that any more (is that right?). But what about the "examples" key? It would be quite preferable for the JSON and binary schemas to provide their own separate examples, I think; is that allowed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants