Skip to content

Strip tracking params (gaa_*, utm_*, ...) from stored URLs#649

Merged
mircealungu merged 2 commits into
masterfrom
wt/strip-gaa-url
Jun 5, 2026
Merged

Strip tracking params (gaa_*, utm_*, ...) from stored URLs#649
mircealungu merged 2 commits into
masterfrom
wt/strip-gaa-url

Conversation

@mircealungu
Copy link
Copy Markdown
Member

Problem

The bt.dk URL below broke article creation (reported in prod logs):

https://www.bt.dk/krimi/rumaenske-tyve-haerger--nu-doemmes-fire-om-dagen?gaa_at=...&gaa_n=...&gaa_ts=...&gaa_sig=...

Some publishers (e.g. bt.dk via Google Discover / "Subscribe with Google") append long signed Google Article Access tokens — gaa_at, gaa_n, gaa_ts, gaa_sig — to article URLs. The full URL is 278+ chars, and that overflowed two 255-char columns:

  1. user_activity_data.value — the OPEN POPUP event from the extension 500'd with DataError 1406: Data too long for column 'value', killing the whole /upload_user_activity_data request.
  2. url.path (via Url.get_path) — the over-long path silently failed the len(path) > 255 guard, so the article's URL row was dropped.

Fix

New helper remove_tracking_query_params() (zeeguu/core/util/url.py) strips known tracking cruft (gaa_*, utm_*, fbclid, gclid, _ga), keeps real params, and leaves non-URL strings untouched. Applied at both sites:

  • create_from_post_data: clean + clamp value to 255 (backstop against any future monster URL).
  • Url.get_path: clean at the single choke point used by __init__, find_or_create, and find, so store and lookup stay canonical and consistent.

The bt.dk URL drops from 278 → 72 chars (clean canonical article URL).

Tests

  • test_util_url.py — strips gaa tokens, strips utm/click-ids while keeping real params, leaves clean/non-URL/empty strings untouched.
  • test_url.pyget_path strips tracking params.

All pass. Verified the full model package still imports cleanly (no circular-import regression).

🤖 Generated with Claude Code

Some publishers (e.g. bt.dk via Google Discover / Subscribe with Google)
append long signed access tokens (gaa_at, gaa_n, gaa_ts, gaa_sig) to
article URLs. These pushed URLs past 255 chars and broke two things:

- user_activity_data.value insert 500'd with DataError 1406 on the
  OPEN POPUP event from the extension.
- Url.get_path produced an over-long path that silently failed the
  len(path) > 255 guard, so the article's URL row was dropped.

Add remove_tracking_query_params() and apply it at both sites: clean +
clamp the activity value, and clean at Url.get_path so store and lookup
stay canonical and consistent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

ArchLens - No architecturally relevant changes to the existing views

Code review found the parse_qsl+urlencode round-trip mutated URLs even
when no tracking param was present:
- embedded articleURL=<inner url with ?/&/=> got percent-mangled and split
- ?q=a%20b -> ?q=a+b, valueless ?key -> ?key=
- signed image/CDN query values re-encoded -> broken served URLs
- every URL with a query string no longer matched its already-stored
  url.path row -> duplicate Url/Article rows, broken translated-article cache

Now operate on the raw query string: drop only the matched key=value
segments, leave survivors byte-for-byte, and return the original string
untouched when nothing was stripped. Add regression tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mircealungu
Copy link
Copy Markdown
Member Author

Follow-up after self code-review

Ran /code-review (high effort) on this branch. It flagged that the first cut used a parse_qslurlencode round-trip that mutated URLs even when no tracking param was present:

  • articleURL=-wrapped inner URLs got percent-mangled / split (breaking as_canonical_string() and translation.py's split("articleURL="))
  • ?q=a%20b?q=a+b, valueless ?key?key=
  • signed image/CDN query values re-encoded → broken served image URLs
  • every URL with a query string stopped matching its already-stored url.path → duplicate Url/Article rows + broken translated-article cache

Fixed in the latest commit: cleaning is now surgical — it splits the raw query, drops only the matched key=value segments, leaves survivors byte-for-byte, and returns the original string untouched when nothing matched. Added regression tests for all four cases. 12/12 tests pass.

Accepted residual: URL rows stored before this change that genuinely had gaa_*/utm_* in their path won't match the now-cleaned lookup, so a re-encounter creates one fresh canonical row. That's the intended dedup-going-forward behavior; the stale duplicates are harmless. Not worth a backfill.

@mircealungu mircealungu merged commit fe034ad into master Jun 5, 2026
2 of 3 checks passed
@mircealungu mircealungu deleted the wt/strip-gaa-url branch June 5, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant