Strip tracking params (gaa_*, utm_*, ...) from stored URLs#649
Conversation
Some publishers (e.g. bt.dk via Google Discover / Subscribe with Google) append long signed access tokens (gaa_at, gaa_n, gaa_ts, gaa_sig) to article URLs. These pushed URLs past 255 chars and broke two things: - user_activity_data.value insert 500'd with DataError 1406 on the OPEN POPUP event from the extension. - Url.get_path produced an over-long path that silently failed the len(path) > 255 guard, so the article's URL row was dropped. Add remove_tracking_query_params() and apply it at both sites: clean + clamp the activity value, and clean at Url.get_path so store and lookup stay canonical and consistent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
ArchLens - No architecturally relevant changes to the existing views |
Code review found the parse_qsl+urlencode round-trip mutated URLs even when no tracking param was present: - embedded articleURL=<inner url with ?/&/=> got percent-mangled and split - ?q=a%20b -> ?q=a+b, valueless ?key -> ?key= - signed image/CDN query values re-encoded -> broken served URLs - every URL with a query string no longer matched its already-stored url.path row -> duplicate Url/Article rows, broken translated-article cache Now operate on the raw query string: drop only the matched key=value segments, leave survivors byte-for-byte, and return the original string untouched when nothing was stripped. Add regression tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up after self code-reviewRan
Fixed in the latest commit: cleaning is now surgical — it splits the raw query, drops only the matched Accepted residual: URL rows stored before this change that genuinely had |
Problem
The bt.dk URL below broke article creation (reported in prod logs):
Some publishers (e.g. bt.dk via Google Discover / "Subscribe with Google") append long signed Google Article Access tokens —
gaa_at,gaa_n,gaa_ts,gaa_sig— to article URLs. The full URL is 278+ chars, and that overflowed two 255-char columns:user_activity_data.value— theOPEN POPUPevent from the extension 500'd withDataError 1406: Data too long for column 'value', killing the whole/upload_user_activity_datarequest.url.path(viaUrl.get_path) — the over-long path silently failed thelen(path) > 255guard, so the article's URL row was dropped.Fix
New helper
remove_tracking_query_params()(zeeguu/core/util/url.py) strips known tracking cruft (gaa_*,utm_*,fbclid,gclid,_ga), keeps real params, and leaves non-URL strings untouched. Applied at both sites:create_from_post_data: clean + clampvalueto 255 (backstop against any future monster URL).Url.get_path: clean at the single choke point used by__init__,find_or_create, andfind, so store and lookup stay canonical and consistent.The bt.dk URL drops from 278 → 72 chars (clean canonical article URL).
Tests
test_util_url.py— strips gaa tokens, strips utm/click-ids while keeping real params, leaves clean/non-URL/empty strings untouched.test_url.py—get_pathstrips tracking params.All pass. Verified the full model package still imports cleanly (no circular-import regression).
🤖 Generated with Claude Code