Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions PROGRESS.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,4 @@ Phase docs live in `docs/progress/` (untracked, local only).
| 9 | HGT retraining Session 2 | COMPLETE — retrained HGT on 140-ticker graph (no code changes; metadata extracted dynamically); val AUC **0.9807** at epoch 280 (vs 0.9803 / e=240 on the prior 30-ticker run); 7m 03s wall-clock; `MODEL_VERSION` bumped `hgt_link_pred_v1` → **`hgt_link_pred_v2`**; node_embeddings re-backfilled at 58 monthly snapshots → **8,120 rows** (140 × 58, dim=64); embedding validation passed (cos(NVDA,AMD)=0.98 > cos(NVDA,ARW)=0.63; per-component std median 0.05); **`graph_gnn_embedding_drift` IC backtest NULL at all horizons** (t=+0.382 @ 21d / +0.524 @ 63d / +0.368 @ 126d on N=52..57; HLZ fail by 10×); registered `status='rejected'` in signal_registry with full evidence record; paper trader unchanged from Phase 9 Session 1 | `docs/progress/phase_9.md` |
| 10 | Regime-aware aggregator Session 1 | COMPLETE — **NEGATIVE result, hypothesis refuted, flag rolled back**. Built `_apply_regime_gate` in aggregator + `fsi_value` param on `load_factor_records` + FSI wiring in paper_trader (5 new unit tests, 155/156 suite pass). Tested `non_calm_action: 'zero'` on `fundamental_margin_compression` (126d NON-CALM t=−2.31, N=8). Paper trader: **CAGR +8.72% → +7.68%, Sharpe 0.488 → 0.450, Max DD −32.68% → −35.39%** — all three metrics worsened. Monthly-horizon audit: factor made money in 4 of 6 NON-CALM forward months (gated rebalances sat at start of late-2022 recovery). 126d drag is a horizon artifact; doesn't translate to monthly rebalancing. Registry flag rolled back; regime-aware *infrastructure* retained as opt-in capability for future factors | `docs/progress/phase_10.md` |
| 10 | Conviction-weighted institutional flow Session 2 | COMPLETE — migration 0012 (`fund_strategy`, 22 rows: 9 T1 + 4 T2 + 6 T3 + 3 excluded banks); `compose_conviction_flow` pure helper (Δpct_portfolio, point-in-time gated on `available_as_of`); 9 TDD tests; institutional panel wired opt-in into backtest. **Primary `institutional_conviction_flow` NULL** (best raw t=+0.94 at 63d, HLZ fail by 4×) → `status='rejected'`. Ultra-T1 sub-test (5 funds: Lone Pine/Viking/Tiger/Coatue/Point72) full-window 21d t=+1.74; late-third 21d t=+3.02, 63d t=+3.33 — material but in-sample, fails HLZ M=400 (|t|≥3.78). Registered **`institutional_conviction_flow_ultra_t1` as `status='research'`** with dated review gate (2026-08-15, promote if full-window 21d t > 2.0 on extended Q2-2026 sample). NOT wired into aggregator. Paper trader unchanged. | `docs/progress/phase_10.md` |
| 10 | Opportunistic insider purchase signal Session 3A | COMPLETE — **NEGATIVE result: signal direction inverted vs hypothesis**. Migration 0013 (`insider_transactions`, 190,020 rows, 1,577 P txns, 140/140 tickers, 2018-2026). Cohen-Malloy-Pomorski opportunistic classifier (`_is_routine`: same calendar-month P in each of 3 prior years → routine; otherwise opportunistic). `compose_insider_signal` normalised by market cap; 9 TDD tests green. Full-universe IC: `insider_opportunistic_63d` 21d t=**−1.983**, 63d t=**−2.223**, 126d t=**−2.658** — uniform negative across all lookbacks and forward horizons. CALM regime 63d t=**−2.773**; late-third 21d t=**−2.870**. NON-CALM positive flip (+0.67) insufficient (N=8). Registered **`insider_opportunistic_63d` as `status='rejected'`** with full IC evidence in `regime_profile`. Not wired into aggregator. Paper trader unchanged. | `docs/progress/phase_10.md` |
105 changes: 105 additions & 0 deletions migrations/versions/0013_insider_transactions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
"""insider_transactions table for Form 4 insider signal.

Revision ID: 0013
Revises: 0012
Create Date: 2026-05-18

Phase 10 Session 3A: stores every nonDerivativeTransaction row parsed
from SEC Form 4 filings for all 140 universe tickers, 2018-01-01 to present.

Point-in-time key is ``filed_at`` (Form 4 filing date), never
``transaction_date`` (trade execution date). The 2-business-day gap
matters at daily resolution.

All transaction codes are stored (P, S, A, D, M, F, etc). The factor
layer filters to code='P' (open-market purchases) for signal computation.
Storing everything enables future A/B tests on sales signal or option
exercises without re-ingesting the full archive.

UNIQUE constraint: (accession_number, insider_cik, transaction_date,
transaction_code, shares_traded, price_per_share) — stronger than the
spec's original (company_id, insider_name, transaction_date, ...) because:
1. accession_number is stable; insider_name formatting drifts across filings
2. Allows two same-day same-code same-size trades at different prices
3. Form 4/A amendments get their own accession_number, so both versions
coexist; the factor picks the latest filed_at per trade cluster
"""
from collections.abc import Sequence
from typing import Union

import sqlalchemy as sa
from alembic import op

revision: str = "0013"
down_revision: Union[str, Sequence[str], None] = "0012"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None


def upgrade() -> None:
op.create_table(
"insider_transactions",
sa.Column("id", sa.BigInteger, primary_key=True, autoincrement=True),
sa.Column(
"company_id",
sa.BigInteger,
sa.ForeignKey("companies.id"),
nullable=False,
),
sa.Column("issuer_cik", sa.String(10), nullable=False),
sa.Column("insider_cik", sa.String(20), nullable=False),
sa.Column("insider_name", sa.String(200), nullable=False),
sa.Column("insider_title", sa.String(200), nullable=True),
sa.Column(
"is_director",
sa.Boolean,
nullable=False,
server_default=sa.text("FALSE"),
),
sa.Column(
"is_officer",
sa.Boolean,
nullable=False,
server_default=sa.text("FALSE"),
),
sa.Column(
"is_ten_pct_owner",
sa.Boolean,
nullable=False,
server_default=sa.text("FALSE"),
),
sa.Column("transaction_date", sa.Date, nullable=False),
sa.Column("filed_at", sa.Date, nullable=False),
sa.Column("transaction_code", sa.String(1), nullable=False),
sa.Column("acquired_disposed", sa.String(1), nullable=True),
sa.Column("shares_traded", sa.Numeric(18, 4), nullable=False),
sa.Column("price_per_share", sa.Numeric(12, 4), nullable=True),
sa.Column("notional_usd", sa.Numeric(20, 4), nullable=True),
sa.Column("shares_owned_after", sa.Numeric(18, 4), nullable=True),
sa.Column("accession_number", sa.String(25), nullable=False),
sa.UniqueConstraint(
"accession_number",
"insider_cik",
"transaction_date",
"transaction_code",
"shares_traded",
"price_per_share",
name="uq_insider_tx",
),
)
op.create_index(
"ix_insider_tx_company_filed",
"insider_transactions",
["company_id", "filed_at"],
)
op.create_index(
"ix_insider_tx_code",
"insider_transactions",
["transaction_code"],
)


def downgrade() -> None:
op.drop_index("ix_insider_tx_code", table_name="insider_transactions")
op.drop_index("ix_insider_tx_company_filed", table_name="insider_transactions")
op.drop_table("insider_transactions")
30 changes: 30 additions & 0 deletions nexus/data/edgar/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,3 +183,33 @@ def get_13f_infotable(self, cik: str, accession_number: str) -> str | None:
except Exception as e:
print(f" [ERROR] infotable {accession_number}: {e}")
return None

def get_form4_xml(self, cik: str, accession_number: str) -> str | None:
"""Fetch the raw Form 4 ownershipDocument XML from EDGAR.

EDGAR stores two XML variants per Form 4 filing:
1. xslF345X06/filename.xml — XSLT-rendered HTML (not parseable as XML)
2. filename.xml — raw ownershipDocument XML (what we want)

The filing index lists both; we pick the one NOT in a subdirectory.
"""
try:
import re as _re
cik_int = int(cik)
path_acc = accession_number.replace("-", "")
base = f"https://www.sec.gov/Archives/edgar/data/{cik_int}/{path_acc}"

index_html = self._get(f"{base}/{accession_number}-index.htm").text

all_xml = _re.findall(r'href="([^"]+\.xml)"', index_html, _re.IGNORECASE)
# Exclude XSLT-rendered variants (live in xslF345X06/ subdirectory)
raw_xmls = [x for x in all_xml if "xsl" not in x.lower()]
if not raw_xmls:
return None

xml_name = raw_xmls[0].split("/")[-1]
return self._get(f"{base}/{xml_name}").text

except Exception as e:
print(f" [ERROR] form4_xml {accession_number}: {e}")
return None
211 changes: 211 additions & 0 deletions nexus/data/edgar/forms/form_4.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
"""Form 4 (Statement of Changes in Beneficial Ownership) XML parser.

Parses the raw ``ownershipDocument`` XML returned by
``EDGARClient.get_form4_xml``.

Only ``nonDerivativeTransaction`` rows are extracted — derivative
transactions (options, SARs) are compensation-driven, not open-market
signals. ``nonDerivativeHolding`` rows (position snapshots without a
transaction) are also skipped.

Point-in-time note
------------------
``InsiderTransaction.filed_at`` is the Form 4 filing date — the date the
information became public. ``transaction_date`` is when the trade executed,
which may be up to 2 business days earlier. The factor MUST gate on
``filed_at``, never ``transaction_date``.

XML structure (verified against live EDGAR 2026-03-24 NVDA filing)
------------------------------------------------------------------
- ``transactionCode`` has NO ``<value>`` child — read ``.text`` directly
- All numeric fields (shares, price, sharesOwned) use ``<value>`` wrappers
- ``rptOwnerCik`` has leading zeros — strip with ``.lstrip("0")``
- ``nonDerivativeHolding`` elements appear inside ``nonDerivativeTable``
but lack ``transactionCoding`` — detect and skip them
"""
from __future__ import annotations

import xml.etree.ElementTree as ET
from dataclasses import dataclass
from datetime import date


@dataclass(frozen=True)
class InsiderTransaction:
issuer_cik: str
insider_cik: str
insider_name: str
insider_title: str # empty string if not an officer
is_director: bool
is_officer: bool
is_ten_pct_owner: bool
transaction_date: date
filed_at: date # point-in-time key — always use this, never transaction_date
transaction_code: str # P=purchase, S=sale, A=award, D=disposition to company, etc.
acquired_disposed: str # A=acquired, D=disposed; empty string if unavailable
shares_traded: float
price_per_share: float | None
notional_usd: float | None # shares_traded × price_per_share; None when price is None
shares_owned_after: float | None
accession_number: str


def parse_form4(
xml_text: str,
accession_number: str,
filed_at: date,
) -> list[InsiderTransaction]:
"""Parse a Form 4 ownershipDocument XML. Returns one record per
nonDerivativeTransaction row. Pure function — no IO.

Parameters
----------
xml_text:
Raw XML string from EDGARClient.get_form4_xml.
accession_number:
EDGAR accession number for this filing (stored for dedup).
filed_at:
Form 4 filing date from the EDGAR submissions index — the
point-in-time availability date (NOT transactionDate).
"""
if not xml_text or not xml_text.strip():
return []
try:
root = ET.fromstring(xml_text)
except ET.ParseError:
xml_text = xml_text.encode("ascii", "ignore").decode()
try:
root = ET.fromstring(xml_text)
except ET.ParseError:
return []

# ── issuer ───────────────────────────────────────────────────────────────
issuer_cik = _text(root, "issuer/issuerCik").lstrip("0") or "0"

# ── reporting owner ──────────────────────────────────────────────────────
owner = root.find("reportingOwner")
if owner is None:
return []

insider_cik = _text(owner, "reportingOwnerId/rptOwnerCik").lstrip("0") or "0"
insider_name = _text(owner, "reportingOwnerId/rptOwnerName").strip()

rel = owner.find("reportingOwnerRelationship")
is_director = _bool_flag(rel, "isDirector")
is_officer = _bool_flag(rel, "isOfficer")
is_ten_pct = _bool_flag(rel, "isTenPercentOwner")
officer_title = _text(rel, "officerTitle").strip() if rel is not None else ""

# ── non-derivative transactions ───────────────────────────────────────────
table = root.find("nonDerivativeTable")
if table is None:
return []

results: list[InsiderTransaction] = []
for tx in table.findall("nonDerivativeTransaction"):
# transactionCode has NO <value> wrapper — read .text directly.
# nonDerivativeHolding elements have no transactionCoding child.
coding = tx.find("transactionCoding")
if coding is None:
continue
tc_el = coding.find("transactionCode")
if tc_el is None or not (tc_el.text or "").strip():
continue
transaction_code = tc_el.text.strip()

tx_date_str = _value(tx, "transactionDate")
if not tx_date_str:
continue
try:
transaction_date = date.fromisoformat(tx_date_str)
except ValueError:
continue

amounts = tx.find("transactionAmounts")
if amounts is None:
continue

shares_str = _value(amounts, "transactionShares")
try:
shares_traded = float((shares_str or "0").replace(",", ""))
except ValueError:
continue
if shares_traded == 0:
continue

price_str = _value(amounts, "transactionPricePerShare")
try:
price_per_share: float | None = (
float(price_str.replace(",", "")) if price_str else None
)
except ValueError:
price_per_share = None

notional_usd = (
shares_traded * price_per_share if price_per_share is not None else None
)

acquired_disposed = _value(amounts, "transactionAcquiredDisposedCode") or ""

post = tx.find("postTransactionAmounts")
shares_after_str = (
_value(post, "sharesOwnedFollowingTransaction") if post is not None else None
)
try:
shares_owned_after: float | None = (
float(shares_after_str.replace(",", ""))
if shares_after_str
else None
)
except ValueError:
shares_owned_after = None

results.append(InsiderTransaction(
issuer_cik=issuer_cik,
insider_cik=insider_cik,
insider_name=insider_name,
insider_title=officer_title,
is_director=is_director,
is_officer=is_officer,
is_ten_pct_owner=is_ten_pct,
transaction_date=transaction_date,
filed_at=filed_at,
transaction_code=transaction_code,
acquired_disposed=acquired_disposed,
shares_traded=shares_traded,
price_per_share=price_per_share,
notional_usd=notional_usd,
shares_owned_after=shares_owned_after,
accession_number=accession_number,
))

return results


# ── helpers ───────────────────────────────────────────────────────────────────

def _text(element: ET.Element | None, path: str, default: str = "") -> str:
"""Text of a descendant located by simple slash-separated path."""
if element is None:
return default
el = element.find(path)
return (el.text or default) if el is not None else default


def _value(element: ET.Element | None, tag: str, default: str = "") -> str:
"""Text of the <value> child of a named child element."""
if element is None:
return default
parent = element.find(tag)
if parent is None:
return default
val_el = parent.find("value")
return (val_el.text or default) if val_el is not None else default


def _bool_flag(element: ET.Element | None, tag: str) -> bool:
"""Return True if the named child element has text '1'."""
if element is None:
return False
el = element.find(tag)
return (el.text or "").strip() == "1" if el is not None else False
Loading
Loading