Write a global count as a separate process

Right now there is an internal global counts data frame generated.

https://github.com/sheltzer-lab/crispr-screening/blob/1a6f8c1cbe94433e4abfc02d47247ba92c21ade4/bin/extract-reads.py#L53-L63

However, we do not write this file out for processing. Rather, we immediately move into attempting to pair data.

https://github.com/sheltzer-lab/crispr-screening/blob/1a6f8c1cbe94433e4abfc02d47247ba92c21ade4/bin/extract-reads.py#L66-L72

This is not the best as the workflow breaks if pairs cannot be found and all of the counting (where the most time is spent) must be repeated. I have run into this as an issue twice: 1) I had a typo in my sample sheet (see #2), 2) we wanted to count a single sample separately (i.e., only one half of a pair).

What I propose is writing the global count as a separate file -- after the first code block here add `df.to_csv("global-count.csv", ...)` -- then have a "pairing" process that performs the contents starting at the second code block here.

	df = df.merge(
	read_csv(
	LIBRARY,
	delimiter="\t",
	header=None,
	dtype=str,
	names=["gene", "id", "sequence"],
	),
	on="sequence",
	how="inner",
	)

	for sample in samples["sample"].unique().tolist():
	initialAdapter = samples.loc[
	(samples["sample"] == sample) & (samples["time"] == "initial"), ["sequence"]
	].values[0][0]
	finalAdapter = samples.loc[
	(samples["sample"] == sample) & (samples["time"] == "final"), ["sequence"]
	].values[0][0]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a global count as a separate process #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Write a global count as a separate process #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions