Skip to content

dataset: synthetic from PANTHER#25

Open
tristan-f-r wants to merge 34 commits intomainfrom
synthetic
Open

dataset: synthetic from PANTHER#25
tristan-f-r wants to merge 34 commits intomainfrom
synthetic

Conversation

@tristan-f-r
Copy link
Contributor

@tristan-f-r tristan-f-r commented Jul 1, 2025

This does not add anything to config/*.yaml.

Co-Authored-By: Neha Talluri 78840540+ntalluri@users.noreply.github.com
Co-Authored-By: Oliver Faulkner Anderson 112665860+oliverfanderson@users.noreply.github.com
Co-Authored-By: Altaf Barelvi altafayyubibarelvi@gmail.com

Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
Co-Authored-By: Oliver Faulkner Anderson <112665860+oliverfanderson@users.noreply.github.com>
Co-Authored-By: Altaf Barelvi <altafayyubibarelvi@gmail.com>
@tristan-f-r tristan-f-r added the dataset Mutating datasets in any way. label Jul 1, 2025
@@ -0,0 +1,100 @@
pathways = ["Apoptosis_signaling", "B_cell_activation",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of the files have this variable. I think we should have it only in the snakefile and send this list to each of the files that use this pathway list

@ntalluri
Copy link
Collaborator

ntalluri commented Jul 3, 2025

A question I’d appreciate feedback on: Currently, we generate separate source, target, and prize files for each pathway, but we combine all pathways into each thresholded interactome. Should we also create a combined list of sources, targets, and prizes? Should we also combine the gold standard as well? Or would it be better to keep separate interactomes for each individual pathway (keep it the way it is)?

@tristan-f-r
Copy link
Contributor Author

We should have separate gold standards.

@ntalluri
Copy link
Collaborator

ntalluri commented Jul 28, 2025

When this is reviewed (or before) we should do tests to see how connected the networks are after thresholding, adding back the pathway data, and removing proteins that don't have uniprot ids.

@ntalluri
Copy link
Collaborator

Also there is a chance we can use more panther pathways, we should look to see what else we can use from pathway commons.

@tristan-f-r tristan-f-r mentioned this pull request Jul 30, 2025
@ntalluri
Copy link
Collaborator

@oliverfanderson @ctrlaltaf For the gold standard nodes (and potentially the edges), should we exclude source, target, and prize nodes when defining it? Currently, it looks like we’re including these nodes in the gold standard for each pathway. These nodes overlap with the gold standard, but that overlap should happen naturally, not by construction/being predefined. I’m concerned this could inflate our precision and recall metrics, because of a form of data leakage.

@ntalluri
Copy link
Collaborator

ntalluri commented Aug 11, 2025

@oliverfanderson @ctrlaltaf For the gold standard nodes (and potentially the edges), should we exclude source, target, and prize nodes when defining it? Currently, it looks like we’re including these nodes in the gold standard for each pathway. These nodes overlap with the gold standard, but that overlap should happen naturally, not by construction/being predefined. I’m concerned this could inflate our precision and recall metrics, because of a form of data leakage.

Plan to keep all of them in the gold standard. But update the evaluation code to deal with the sources/targets/prizes being in the gold standard and shown as a different baseline where those are all set as frequency 1.0.

@ntalluri
Copy link
Collaborator

Should we also consider how sparse an interactome becomes after applying a threshold to the STRING interactome? When we filter by size, we implicitly accounting for the decrease in graph density as well. Would it make more sense to treat size and density as separate variables when evaluating performance? However, does testing for density even matter in this context; are there any interactomes that aren’t already highly connected?

I’m thinking we should first threshold the interactomes, then select only those that are highly connected (e.g., density ≥ 0.85). From that subset, we could choose a few to represent different size scales.

@ntalluri
Copy link
Collaborator

ntalluri commented Nov 6, 2025

I will be updating how we create interactomes for the Panther pathways dataset.

Current:
Our current thresholded interactomes are built by applying a hard cutoff on experimental scores (keeping edges with score ≥ x).
While this approach retains high-score edges and removes low-score ones, it distorts the original score distribution, which could be problematic for algorithms that rely on edge scores during optimization.

New:
Instead for the interacomes, we should approach it in a downsampling approach: build smaller interactomes that preserve the original score distribution of the original network while also reducing the total edge count.

For example, in the STRING interaction networks, when using only physical interactions and experimental edge scores, we could aim to keep 25% of all edges.
To achieve this, we could:

  • Sample 25% (or remove 75%) of edges uniformly at random, ignoring edge scores
  • Or use stratified sampling by edge score bins to preserve the distribution of scores

Now we will be construct new interactomes by removing X% of edges and then adding all edges from all chosen PANTHER pathways. We will only keep downsampled interacomes that satisfy specified properties for a given set of sources and targets.

Proposed brute-force method for Panther pathways interactomes:

  1. edge removal

Randomly remove X edges from the full STRING interactome

  • Option A: Remove edges uniformly at random (ignoring scores)
  • Option B: Stratify edges by score bins and sample within each bin to maintain the overall score distribution
    • We can stratify the edges into bins based on their score ranges (example: [0–300], [300–600], [600–900] ...).
    • Then, when we randomly remove X edges, we can remove them proportionally from each bin.
    • Example: suppose 20% of the original edges fall in the 0–300 bin, 50% in the 300–600 bin, and 30% in the 600–900 bin. If we're removing 1000 edges total, we'd randomly pick ~200 from the first bin, ~500 from the second, and ~300 from the third.
  1. pathway integration
    Add all edges from the selected Panther pathways to the new downsampled interactome

  2. Property checks

Verify that the new network maintains the following properties:

  • All-in-one: all sources and targets lie in the same connected component

    • V_{ST} = {all sources} U {all targets}, and after edge removal, all vertices in V_{ST} should belong to the same connected component of the graph
  • Reachability: every target is reachable from at least one source (can be checked via apsp or bfs)

  • Might want to soften the criteria to be X% of the sources and targets remain in a single component,
    and (potentially) X% of the targets are reachable from at least one source.

  1. Restart is necessary

If the properties above are not satisfied, repeat the process with a different random sample.

@ntalluri
Copy link
Collaborator

ntalluri commented Nov 6, 2025

For this dataset, we are planning on using it for all of the evaluations. I was deciding if we need to use all of the pathways, and I don't think we need to. I decided on a couple that we can use:

Balanced
Interleukin_signaling - 86 GS Nodes, 811 GS Edges, 18 S / 16 T (ratios 0.209 / 0.186)
Apoptosis_signaling - 108 GS Nodes, 286 GS Edges, 6 S / 17 T (0.056 / 0.157)

Skewed
Cadherin_signaling - 150 GS Nodes, 2650 GS Edges, 17 S / 3 T (0.113 / 0.02)
PDGF_signaling - 125 GS Nodes, 764 GS Edges, 2 S / 28 T (0.016 / 0.224)
Toll_signaling - 44 GS Nodes, 84 GS Edges, 8 S / 3 T (0.182 / 0.068)

Tiny
Hedgehog_signaling - 19 GS Nodes, 61 GS Edges, 2 S / 2 T (0.105 / 0.105)

When making the interactomes, I want to add all of these pathways on the thresholded interactomes and uphold the properties above.

I need to double check if I used any of these will break the rules for pilot data/runs; but since we are making a new dataset that wasn't used for my thesis, I think we will be okay.

@tristan-f-r
Copy link
Contributor Author

Made minor changes to fix the interactome fetching - these shouldn't cause any conflicts, nor were the changes I wanted to make as mentioned in Slack. [If they do, feel free to force push.]

Copy link
Collaborator

@ntalluri ntalluri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking over the current state of this dataset to help with the write-up for the registered report. As I looked over it, I left a few broad questions and reminders where what I was seeing didn’t fully match my understanding of the current planned approach for the dataset. This was not a full, in-depth review.

edges_df = edges_df.sort_values(by=["NODE1", "NODE2", "Direction"], ascending=True, ignore_index=True)
edges_df = edges_df.drop_duplicates(keep="first", ignore_index=True)

edges_df.to_csv(out_folder / f"{pathway}_gs_edges.txt", sep="\t", index=False, header=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder: This should be updated to make a gold standard of edges that are contained in both the pathway and the specific interactome.

prizes_df["dummy"] = ""
prizes_df.rename(columns={"uniprot": "NODEID", "prizes": "prize"}, inplace=True)
result_df = prizes_df[["NODEID", "prize", "sources", "targets", "active", "dummy"]]
result_df.to_csv(out_folder / f"{pathway}_node_prizes.txt", sep="\t", index=False, header=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder: This should be updated to make the input nodes to be those that are contained in both the pathway and the specific interactome.

pathway_df = convert_undirected_to_directed(pathway_df)
pathway_df = pathway_df[['Interactor1', 'Interactor2']]

print(f'Merging {pathway} with interactome...')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this isn't finished, but will you be testing for the properties that X% of the pathway edges are reachable and that X% of the pathway is in one connected component after merging the pathway into the sampled interactome?

interactome_df = interactome_df.iloc[list(full_set)]

pathway_df = pandas.read_csv(
current_directory / '..' / 'processed' / pathway / f'{pathway}_gs_edges.txt', sep='\t',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this isn't finished, but will you be updating the gold standards once we have the downsampled interactomes. I’m assuming we’ll have a separate gold standard for each interactome for each pathway. The current '{pathway}_gs_edges.txt' is all of the edges in a pathway; some may not be reachable in an interactome.

@tristan-f-r
Copy link
Contributor Author

tristan-f-r commented Feb 2, 2026

This isn't finished. I'm going to run this for longer, but if you see the pathway.txt file generation, I haven't been able to get a connected s/t component after a concerning amount of runs.

[As for your gold standard comments, I'm specifically just worried about the interactome generation, and, if I can't get a valid sampled gs after 100 runs, I'm going to further optimize interactome sampling - gold standard thresholding can be easily finished with the inner join in sampling.py]

@tristan-f-r
Copy link
Contributor Author

I'll use that updated TF list 👍

@tristan-f-r
Copy link
Contributor Author

tristan-f-r commented Feb 11, 2026

As a textual tl;dr, I need to still add trim.py, and I want to update the paxtools utility to [efficiently] do BioPAX extraction from the large .owl file provided by PC.

I'm going to make that trim utility, split this PR to add those utilities, and merge the EGFR interactome updates and sampling using those utility scripts, then jump back to doing the BioPAX parsing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to add to the README more information on how we are downsampling the interactomes + add the psuedo code?

Copy link
Collaborator

@ntalluri ntalluri Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we providing this functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the comment above, I want to stop downloading PathwayCommons SIF files individually and extract them from OWL. I originally added this file as a quick exploration tool of this data, but I'll drop it once you have your list of signaling pathways from PathwayCommons and the criteria you used to fetch them, and move it over to use said automated selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Mutating datasets in any way.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants