Issue warning for possible duplicate analysis files#175
Issue warning for possible duplicate analysis files#175cadley-nyulangone wants to merge 6 commits intomasterfrom
Conversation
Previous runs of the pipeline may leave old analysis files in place. This can cause duplicate tracks to be constructed.
|
John -- instead of relying on parsing filenames, check in main loop for an existing row with the same SampleID/Mapped_Genome/Group |
|
|
||
| # Check for duplicate analysisFiles, possibly left over from a previous run. | ||
| # Note: "data$Group" does not contain flowcell ID info when opt$project=CEGS_byLocus, so we need to get it from curdir. | ||
| fcID <- strsplit(curdir, "/", fixed=TRUE)[[1]][1] |
There was a problem hiding this comment.
Does it need to be the FC ID? Shouldn't whatever is in data$Group be the ID we need?
There was a problem hiding this comment.
When opt$project=="CEGS_byLocus", Group values are in the form of Study ID, like "LP_Cells".
There are a bunch of SampleID/Mapped_Genome pairs which are identical, but which come from different flowcells. However they have the same Study ID, which gets put into the Group column, and then generates a duplicate warning (when opt$project=="CEGS_byLocus").
There was a problem hiding this comment.
How do they get unique track names then?
There was a problem hiding this comment.
We get saved by the duplicate track name logic in MakeTrackhub.py
Here is an example. The first two stanzas are normal flowcell stanzas. The track names are "[genome]_[group]_[SampleID]_bam". They look almost the same, but they come from different flowcells. The group value is in the form [date]_[flowcell ID].
track hg38_20181114_FCHHWC2BGX7_BS01710A_bam
bigDataUrl ../mapped/FCHHWC2BGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam
shortLabel H1_Clone58I_Percoll_60U-BS01710A
longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (9,748,880 analyzed reads, 0.26 SPOT, 1,935 Hotspots)
type bam
subGroups sample=H1_Clone58I_Percoll_60U-BS01710A view=Reads
parent Reads_view_hg38_20181114_FCHHWC2BGX7_dnase off
track hg38_20181130_FCHN3FWBGX7_BS01710A_bam
bigDataUrl ../mapped/FCHN3FWBGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam
shortLabel H1_Clone58I_Percoll_60U-BS01710A
longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (14,151,000 analyzed reads, 0.26 SPOT, 4,614 Hotspots)
type bam
subGroups sample=H1_Clone58I_Percoll_60U-BS01710A view=Reads
parent Reads_view_hg38_20181130_FCHN3FWBGX7_dnase off
Below are the same two tracks in the "byLocus" form. Again, the track name is in the form "[genome]_[group]_[SampleID]_bam", but now the group value is "LandingPads" in both of them. That creates a duplicate track name, and we resolve the problem by appending the "2" after the sample ID.
track hg38_LandingPads_BS01710A_bam
bigDataUrl ../mapped/FCHHWC2BGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam
shortLabel H1_Clone58I_Percoll_60U-BS01710A
longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (9,748,880 analyzed reads, 0.26 SPOT, 1,935 Hotspots)
type bam
subGroups sample=H1_Clone58I_Percoll_60U sampleid=BS01710A project=LP058 assembly=HPRT1 type=LP_Cells view=Reads
parent Reads_view_hg38byLocus_LandingPads_dnase off
track hg38_LandingPads_BS01710A_2_bam
bigDataUrl ../mapped/FCHN3FWBGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam
shortLabel H1_Clone58I_Percoll_60U-BS01710A
longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (14,151,000 analyzed reads, 0.26 SPOT, 4,614 Hotspots)
type bam
subGroups sample=H1_Clone58I_Percoll_60U sampleid=BS01710A project=LP058 assembly=HPRT1 type=LP_Cells view=Reads
parent Reads_view_hg38byLocus_LandingPads_dnase off
There was a problem hiding this comment.
That's very helpful. It would be better to generate unique group IDs from the start, rather than rely on them getting fixed later. Let's add the FC ID when project is CEGS_byLocus
That should simplify your duplicate checking here.
There was a problem hiding this comment.
John, FC needs to get into the track name, not the Group name for CEGS_byLocus. I guess the issue is that MakeTrackhub.py makes the track name as follows:
curGroup_trackname = cleanTrackName(args.genome + args.tracknameprefix + "_" + curGroup + "_" + assay_suffix)
So we don't seem to have independent control over both the group name and the track name?
One option would be to change tracknameprefix from a command line arg to a column in the samplesforTrackhub_ output.
There was a problem hiding this comment.
I added a FlowCellID column in the commit I just made.
However, I think this makes the byLocus browser view to no longer work as intended. Take a look at byLocus in dev, and we can decide what to do.
Is used to make unique tracknames for ByLocus tracks.
a715613 to
3fb0e3f
Compare
Previous runs of the pipeline may leave old analysis files in place.
This can cause duplicate tracks to be constructed.