Skip to content

Issue warning for possible duplicate analysis files#175

Open
cadley-nyulangone wants to merge 6 commits intomasterfrom
dupAnalysisFiles
Open

Issue warning for possible duplicate analysis files#175
cadley-nyulangone wants to merge 6 commits intomasterfrom
dupAnalysisFiles

Conversation

@cadley-nyulangone
Copy link
Copy Markdown
Contributor

Previous runs of the pipeline may leave old analysis files in place.
This can cause duplicate tracks to be constructed.

Previous runs of the pipeline may leave old analysis files in place.
This can cause duplicate tracks to be constructed.
@mattmaurano
Copy link
Copy Markdown
Contributor

John -- instead of relying on parsing filenames, check in main loop for an existing row with the same SampleID/Mapped_Genome/Group

Comment thread dnase/trackhub/samplesforTrackhub.R Outdated

# Check for duplicate analysisFiles, possibly left over from a previous run.
# Note: "data$Group" does not contain flowcell ID info when opt$project=CEGS_byLocus, so we need to get it from curdir.
fcID <- strsplit(curdir, "/", fixed=TRUE)[[1]][1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to be the FC ID? Shouldn't whatever is in data$Group be the ID we need?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When opt$project=="CEGS_byLocus", Group values are in the form of Study ID, like "LP_Cells".

There are a bunch of SampleID/Mapped_Genome pairs which are identical, but which come from different flowcells. However they have the same Study ID, which gets put into the Group column, and then generates a duplicate warning (when opt$project=="CEGS_byLocus").

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do they get unique track names then?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get saved by the duplicate track name logic in MakeTrackhub.py

Here is an example. The first two stanzas are normal flowcell stanzas.  The track names are "[genome]_[group]_[SampleID]_bam".  They look almost the same, but they come from different flowcells. The group value is in the form [date]_[flowcell ID].

                track hg38_20181114_FCHHWC2BGX7_BS01710A_bam
                bigDataUrl ../mapped/FCHHWC2BGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam
                shortLabel H1_Clone58I_Percoll_60U-BS01710A
                longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (9,748,880 analyzed reads, 0.26 SPOT, 1,935 Hotspots)
                type bam
                subGroups sample=H1_Clone58I_Percoll_60U-BS01710A view=Reads
                parent Reads_view_hg38_20181114_FCHHWC2BGX7_dnase off

                track hg38_20181130_FCHN3FWBGX7_BS01710A_bam
                bigDataUrl ../mapped/FCHN3FWBGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam
                shortLabel H1_Clone58I_Percoll_60U-BS01710A
                longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (14,151,000 analyzed reads, 0.26 SPOT, 4,614 Hotspots)
                type bam
                subGroups sample=H1_Clone58I_Percoll_60U-BS01710A view=Reads
                parent Reads_view_hg38_20181130_FCHN3FWBGX7_dnase off

Below are the same two tracks in the "byLocus" form. Again, the track name is in the form "[genome]_[group]_[SampleID]_bam", but now the group value is "LandingPads" in both of them.  That creates a duplicate track name, and we resolve the problem by appending the "2" after the sample ID.

                track hg38_LandingPads_BS01710A_bam
                bigDataUrl ../mapped/FCHHWC2BGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam
                shortLabel H1_Clone58I_Percoll_60U-BS01710A
                longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (9,748,880 analyzed reads, 0.26 SPOT, 1,935 Hotspots)
                type bam
                subGroups sample=H1_Clone58I_Percoll_60U sampleid=BS01710A project=LP058 assembly=HPRT1 type=LP_Cells view=Reads
                parent Reads_view_hg38byLocus_LandingPads_dnase off

                track hg38_LandingPads_BS01710A_2_bam
                bigDataUrl ../mapped/FCHN3FWBGX7/dnase/H1_Clone58I_Percoll_60U-BS01710A/H1_Clone58I_Percoll_60U-BS01710A.hg38_noalt.bam
                shortLabel H1_Clone58I_Percoll_60U-BS01710A
                longLabel DNase-seq Reads H1_Clone58I_Percoll_60U-BS01710A (14,151,000 analyzed reads, 0.26 SPOT, 4,614 Hotspots)
                type bam
                subGroups sample=H1_Clone58I_Percoll_60U sampleid=BS01710A project=LP058 assembly=HPRT1 type=LP_Cells view=Reads
                parent Reads_view_hg38byLocus_LandingPads_dnase off

Copy link
Copy Markdown
Contributor

@mattmaurano mattmaurano Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very helpful. It would be better to generate unique group IDs from the start, rather than rely on them getting fixed later. Let's add the FC ID when project is CEGS_byLocus

That should simplify your duplicate checking here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

John, FC needs to get into the track name, not the Group name for CEGS_byLocus. I guess the issue is that MakeTrackhub.py makes the track name as follows:
curGroup_trackname = cleanTrackName(args.genome + args.tracknameprefix + "_" + curGroup + "_" + assay_suffix)
So we don't seem to have independent control over both the group name and the track name?

One option would be to change tracknameprefix from a command line arg to a column in the samplesforTrackhub_ output.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a FlowCellID column in the commit I just made.

Comment thread dnase/trackhub/samplesforTrackhub.R Outdated
Comment thread dnase/trackhub/samplesforTrackhub.R Outdated
However, I think this makes the byLocus browser view to no longer work as intended.

Take a look at byLocus in dev, and we can decide what to do.
Is used to make unique tracknames for ByLocus tracks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants