Refactored repository to use id = -1 for the unassigned label class#31
Refactored repository to use id = -1 for the unassigned label class#31jacksonjacobs1 wants to merge 1 commit into
Conversation
|
i looked but couldn't quickly find it - can you confirm what the ordering would be here for the classes, does it now go {-1,0,1,2}? i think the previous order was {1,2,3} and thus seems to have moved 1 -- > -1, but then is the result {-1,1,2...) (assuming autoincrement key typically starts at 1)... do we have a class 0 for the first labeled class? |
|
Postgres sequences start at 1, so it would go -1, 1, 2, 3 Would you prefer using 0 for the unassigned class then? Would be simple fix. |
|
close to 100% sure we'll introduce a bug at some point - especially if claude is doing a lot of the lifting. is there not a way to force a -1,0,1,2,3 so that the DB perfectly aligns to the softmax vector coming out of the DL? |
|
Yes you can set sequence initializations like so CREATE SEQUENCE my_default_seq
START WITH 1
INCREMENT BY 1
MINVALUE 1
MAXVALUE 9223372036854775807; -- Max value for a BIGINTI'd note though that label_class_id increments across projects, so I'm not sure that starting at 0 actually reduces the complexity of the logic to map label class ids to model logits indexes. I'd like to propose adding a Mapping label classes to model logits then becomes straitforward, with the order of label classes (filtered by project_id) mapping directly onto model logits indexes. The chance of bugs becomes very low. Thoughts? |
|
i think a more stable design may be to normalize this a bit. there is a global primary key that is autoincrement and increments across projects. but there is also a projid + label_id column that form the project level ids - so one always knows htat projid=X label_id=0 <-- this is always the 0th class coming out of the softmax also needs the delete column as you suggested - to make sure the mapping remains locked in i think this drops the bug likelyhood even further - and since the label table is quite small, the added columns have minimal overhead. i'm not 100% sure its the best design at the moment - but it feels more explicit instead of implict, which has nice vibe to it |
|
I think this is redundant - no need for soft delete in the label class table if there's a project-local label class id If we have a project_label_id column, and a label class is removed from the table, the label classes will still map onto the model logits. |
|
maybe? what if i have 5 classes, add a class so there is 6 classes, then remove that 6th, and then add a new 6th class? the most stable training will be if the DL network continues to output 7 with 1 being soft deleted |
|
In that case, without needing an extra SELECT logits_id FROM label_class WHERE project_id = 1 [0,1,2,3,4] Mapping onto 7 logits Where "1" marks enabled and "0" marks soft deleted Is there a problem with this? |
|
i think the question is, do you want the mapping to happen at all? in this case "[0,1,2,3,4,6]" -- > you actually have to convert it to the format [1,1,1,1,1,0,1] if it was of the format: the relevance to tihs, however, is limited to the scope where the values from the tables are used. is there ever a time when it needs to align, or are they only used in e.g., one off queries? |
|
My understanding is that some mapping must occur anyway. When patch predictions are saved, the classification from the model (e.g., 5) must be converted to the global label_class_id. If we truly wanted to avoid classification mapping at prediction time, we'd save The mapping operation is computationally trivial either way, so I'd prefer to just map the |
compelling argument and totally agree - this should be entirely obscured from the client that said, not really sure exactly code wise where the mapping takes place. this "logits_idx to the patch_pred table" makes sense to me, since this is done by the DL directly itself, right? there would be no mapping and there would be a 1:1 alignment between what the DL model produces and what goes into its "table" that is within its scope to populate. note that while trivial, the mapping would still have to have to occur potentially billions of times, every batch, every epoch. how often are those results actually used in their "mapped" version? what i'm essentially thinking of is something like lazy-mapping, so that it only occurs on demand when its actually needed. I don't know what that means in terms of design / implementation based on the existing ocde, however |
|
If we performed a lazy mapping in the server layer, this would need to occur each time a patch or aggregation information is requested. During zooming / panning, I see ~10-20 aggregation tiles requested per second, and ~100 patches requested per second if "show patches" is enabled. Patch gallery pagination may also request ~100 patches per second. So it's far fewer than the ~5,000 patches per second throughput that we've seen for a single DL worker. But I'd also note that each endpoint call that performs the lazy mapping would have to perform an additional database round trip to get the most up-to-date vs. the ray train worker, which can just hold the LabelClassMap in memory.
|
No description provided.