Issue warning for possible duplicate analysis files by cadley-nyulangone · Pull Request #175 · mauranolab/mapping

cadley-nyulangone · 2020-03-26T17:27:22Z

Previous runs of the pipeline may leave old analysis files in place.
This can cause duplicate tracks to be constructed.

Previous runs of the pipeline may leave old analysis files in place. This can cause duplicate tracks to be constructed.

mattmaurano · 2020-03-26T18:22:56Z

John -- instead of relying on parsing filenames, check in main loop for an existing row with the same SampleID/Mapped_Genome/Group

mattmaurano · 2020-03-27T15:40:37Z

+
+		# Check for duplicate analysisFiles, possibly left over from a previous run.
+		# Note:  "data$Group" does not contain flowcell ID info when opt$project=CEGS_byLocus, so we need to get it from curdir.
+		fcID <- strsplit(curdir, "/", fixed=TRUE)[[1]][1]


Does it need to be the FC ID? Shouldn't whatever is in data$Group be the ID we need?

When opt$project=="CEGS_byLocus", Group values are in the form of Study ID, like "LP_Cells".

There are a bunch of SampleID/Mapped_Genome pairs which are identical, but which come from different flowcells. However they have the same Study ID, which gets put into the Group column, and then generates a duplicate warning (when opt$project=="CEGS_byLocus").

How do they get unique track names then?

We get saved by the duplicate track name logic in MakeTrackhub.py Here is an example. The first two stanzas are normal flowcell stanzas. The track names are "[genome]_[group]_[SampleID]_bam". They look almost the same, but they come from different flowcells. The group value is in the form [date]_[flowcell ID]. track hg38_20181114_FCHHWC2BGX7_BS01710A_bam bigDataUrl ../mapped/FCHHWC2BGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam shortLabel H1_Clone58I_Percoll_60U-BS01710A longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (9,748,880 analyzed reads, 0.26 SPOT, 1,935 Hotspots) type bam subGroups sample=H1_Clone58I_Percoll_60U-BS01710A view=Reads parent Reads_view_hg38_20181114_FCHHWC2BGX7_dnase off track hg38_20181130_FCHN3FWBGX7_BS01710A_bam bigDataUrl ../mapped/FCHN3FWBGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam shortLabel H1_Clone58I_Percoll_60U-BS01710A longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (14,151,000 analyzed reads, 0.26 SPOT, 4,614 Hotspots) type bam subGroups sample=H1_Clone58I_Percoll_60U-BS01710A view=Reads parent Reads_view_hg38_20181130_FCHN3FWBGX7_dnase off Below are the same two tracks in the "byLocus" form. Again, the track name is in the form "[genome]_[group]_[SampleID]_bam", but now the group value is "LandingPads" in both of them. That creates a duplicate track name, and we resolve the problem by appending the "2" after the sample ID. track hg38_LandingPads_BS01710A_bam bigDataUrl ../mapped/FCHHWC2BGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam shortLabel H1_Clone58I_Percoll_60U-BS01710A longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (9,748,880 analyzed reads, 0.26 SPOT, 1,935 Hotspots) type bam subGroups sample=H1_Clone58I_Percoll_60U sampleid=BS01710A project=LP058 assembly=HPRT1 type=LP_Cells view=Reads parent Reads_view_hg38byLocus_LandingPads_dnase off track hg38_LandingPads_BS01710A_2_bam bigDataUrl ../mapped/FCHN3FWBGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam shortLabel H1_Clone58I_Percoll_60U-BS01710A longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (14,151,000 analyzed reads, 0.26 SPOT, 4,614 Hotspots) type bam subGroups sample=H1_Clone58I_Percoll_60U sampleid=BS01710A project=LP058 assembly=HPRT1 type=LP_Cells view=Reads parent Reads_view_hg38byLocus_LandingPads_dnase off

That's very helpful. It would be better to generate unique group IDs from the start, rather than rely on them getting fixed later. Let's add the FC ID when project is CEGS_byLocus

That should simplify your duplicate checking here.

John, FC needs to get into the track name, not the Group name for CEGS_byLocus. I guess the issue is that MakeTrackhub.py makes the track name as follows:
curGroup_trackname = cleanTrackName(args.genome + args.tracknameprefix + "_" + curGroup + "_" + assay_suffix)
So we don't seem to have independent control over both the group name and the track name?

One option would be to change tracknameprefix from a command line arg to a column in the samplesforTrackhub_ output.

I added a FlowCellID column in the commit I just made.

However, I think this makes the byLocus browser view to no longer work as intended. Take a look at byLocus in dev, and we can decide what to do.

Is used to make unique tracknames for ByLocus tracks.

Issue warning for possible duplicate analysis files

10fb2e5

Previous runs of the pipeline may leave old analysis files in place. This can cause duplicate tracks to be constructed.

cadley-nyulangone added 2 commits March 26, 2020 18:11

Check for dups in the "data" matrix, rather than in filenames.

553bf14

Fixed some tabs

b2a9162

mattmaurano requested changes Mar 27, 2020

View reviewed changes

cadley-nyulangone added 3 commits March 30, 2020 14:52

Changed if from != to >

3bf3344

Added flowcell ID to Group when processing tracks byLocus.

c2fa72b

However, I think this makes the byLocus browser view to no longer work as intended. Take a look at byLocus in dev, and we can decide what to do.

Add a FlowCellID column to samplesforTrackhub.R output

3fb0e3f

Is used to make unique tracknames for ByLocus tracks.

mattmaurano force-pushed the dupAnalysisFiles branch from a715613 to 3fb0e3f Compare July 6, 2020 21:32

mattmaurano force-pushed the master branch from 89b48d3 to a49d940 Compare July 6, 2020 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue warning for possible duplicate analysis files#175

Issue warning for possible duplicate analysis files#175
cadley-nyulangone wants to merge 6 commits intomasterfrom
dupAnalysisFiles

cadley-nyulangone commented Mar 26, 2020

Uh oh!

mattmaurano commented Mar 26, 2020

Uh oh!

mattmaurano Mar 27, 2020

Uh oh!

cadley-nyulangone Mar 27, 2020

Uh oh!

mattmaurano Mar 31, 2020

Uh oh!

cadley-nyulangone Mar 31, 2020

Uh oh!

mattmaurano Mar 31, 2020 •

edited

Loading

Uh oh!

mattmaurano Apr 2, 2020

Uh oh!

cadley-nyulangone Apr 6, 2020

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cadley-nyulangone commented Mar 26, 2020

Uh oh!

mattmaurano commented Mar 26, 2020

Uh oh!

mattmaurano Mar 27, 2020

Choose a reason for hiding this comment

Uh oh!

cadley-nyulangone Mar 27, 2020

Choose a reason for hiding this comment

Uh oh!

mattmaurano Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

cadley-nyulangone Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

mattmaurano Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattmaurano Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

cadley-nyulangone Apr 6, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattmaurano Mar 31, 2020 •

edited

Loading