-
Notifications
You must be signed in to change notification settings - Fork 0
Write a global count as a separate process #4
Description
Right now there is an internal global counts data frame generated.
crispr-screening/bin/extract-reads.py
Lines 53 to 63 in 1a6f8c1
| df = df.merge( | |
| read_csv( | |
| LIBRARY, | |
| delimiter="\t", | |
| header=None, | |
| dtype=str, | |
| names=["gene", "id", "sequence"], | |
| ), | |
| on="sequence", | |
| how="inner", | |
| ) |
However, we do not write this file out for processing. Rather, we immediately move into attempting to pair data.
crispr-screening/bin/extract-reads.py
Lines 66 to 72 in 1a6f8c1
| for sample in samples["sample"].unique().tolist(): | |
| initialAdapter = samples.loc[ | |
| (samples["sample"] == sample) & (samples["time"] == "initial"), ["sequence"] | |
| ].values[0][0] | |
| finalAdapter = samples.loc[ | |
| (samples["sample"] == sample) & (samples["time"] == "final"), ["sequence"] | |
| ].values[0][0] |
This is not the best as the workflow breaks if pairs cannot be found and all of the counting (where the most time is spent) must be repeated. I have run into this as an issue twice: 1) I had a typo in my sample sheet (see #2), 2) we wanted to count a single sample separately (i.e., only one half of a pair).
What I propose is writing the global count as a separate file -- after the first code block here add df.to_csv("global-count.csv", ...) -- then have a "pairing" process that performs the contents starting at the second code block here.