Strip tracking params (gaa_*, utm_*, ...) from stored URLs by mircealungu · Pull Request #649 · zeeguu/api

mircealungu · 2026-06-05T12:12:35Z

Problem

The bt.dk URL below broke article creation (reported in prod logs):

https://www.bt.dk/krimi/rumaenske-tyve-haerger--nu-doemmes-fire-om-dagen?gaa_at=...&gaa_n=...&gaa_ts=...&gaa_sig=...

Some publishers (e.g. bt.dk via Google Discover / "Subscribe with Google") append long signed Google Article Access tokens — gaa_at, gaa_n, gaa_ts, gaa_sig — to article URLs. The full URL is 278+ chars, and that overflowed two 255-char columns:

user_activity_data.value — the OPEN POPUP event from the extension 500'd with DataError 1406: Data too long for column 'value', killing the whole /upload_user_activity_data request.
url.path (via Url.get_path) — the over-long path silently failed the len(path) > 255 guard, so the article's URL row was dropped.

Fix

New helper remove_tracking_query_params() (zeeguu/core/util/url.py) strips known tracking cruft (gaa_*, utm_*, fbclid, gclid, _ga), keeps real params, and leaves non-URL strings untouched. Applied at both sites:

create_from_post_data: clean + clamp value to 255 (backstop against any future monster URL).
Url.get_path: clean at the single choke point used by __init__, find_or_create, and find, so store and lookup stay canonical and consistent.

The bt.dk URL drops from 278 → 72 chars (clean canonical article URL).

Tests

test_util_url.py — strips gaa tokens, strips utm/click-ids while keeping real params, leaves clean/non-URL/empty strings untouched.
test_url.py — get_path strips tracking params.

All pass. Verified the full model package still imports cleanly (no circular-import regression).

🤖 Generated with Claude Code

Some publishers (e.g. bt.dk via Google Discover / Subscribe with Google) append long signed access tokens (gaa_at, gaa_n, gaa_ts, gaa_sig) to article URLs. These pushed URLs past 255 chars and broke two things: - user_activity_data.value insert 500'd with DataError 1406 on the OPEN POPUP event from the extension. - Url.get_path produced an over-long path that silently failed the len(path) > 255 guard, so the article's URL row was dropped. Add remove_tracking_query_params() and apply it at both sites: clean + clamp the activity value, and clean at Url.get_path so store and lookup stay canonical and consistent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-05T12:13:19Z

ArchLens - No architecturally relevant changes to the existing views

Code review found the parse_qsl+urlencode round-trip mutated URLs even when no tracking param was present: - embedded articleURL=<inner url with ?/&/=> got percent-mangled and split - ?q=a%20b -> ?q=a+b, valueless ?key -> ?key= - signed image/CDN query values re-encoded -> broken served URLs - every URL with a query string no longer matched its already-stored url.path row -> duplicate Url/Article rows, broken translated-article cache Now operate on the raw query string: drop only the matched key=value segments, leave survivors byte-for-byte, and return the original string untouched when nothing was stripped. Add regression tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mircealungu · 2026-06-05T15:27:49Z

Follow-up after self code-review

Ran /code-review (high effort) on this branch. It flagged that the first cut used a parse_qsl→urlencode round-trip that mutated URLs even when no tracking param was present:

articleURL=-wrapped inner URLs got percent-mangled / split (breaking as_canonical_string() and translation.py's split("articleURL="))
?q=a%20b → ?q=a+b, valueless ?key → ?key=
signed image/CDN query values re-encoded → broken served image URLs
every URL with a query string stopped matching its already-stored url.path → duplicate Url/Article rows + broken translated-article cache

Fixed in the latest commit: cleaning is now surgical — it splits the raw query, drops only the matched key=value segments, leaves survivors byte-for-byte, and returns the original string untouched when nothing matched. Added regression tests for all four cases. 12/12 tests pass.

Accepted residual: URL rows stored before this change that genuinely had gaa_*/utm_* in their path won't match the now-cleaned lookup, so a re-encounter creates one fresh canonical row. That's the intended dedup-going-forward behavior; the stale duplicates are harmless. Not worth a backfill.

mircealungu merged commit fe034ad into master Jun 5, 2026
2 of 3 checks passed

mircealungu deleted the wt/strip-gaa-url branch June 5, 2026 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip tracking params (gaa_, utm_, ...) from stored URLs#649

Strip tracking params (gaa_, utm_, ...) from stored URLs#649
mircealungu merged 2 commits into
masterfrom
wt/strip-gaa-url

mircealungu commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

mircealungu commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mircealungu commented Jun 5, 2026

Problem

Fix

Tests

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

mircealungu commented Jun 5, 2026

Follow-up after self code-review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant