Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 85 additions & 60 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -1,86 +1,111 @@
# PLAN: MES Core — Week 2 (Modbus Machine State Reader)
# PLAN: MES Core — Week 5 (Downtime Tracking)

**Branch:** `feat/mes-week2-state-reader`
**Issue:** Mikecranesync/MIRA#320
**PRD:** `docs/PRD-MES-CORE.md`
**Date:** 2026-04-15
**Depends on:** Week 1 (feat/mes-week1-db-schema) merged
**Branch:** `feat/mes-week5-downtime`
**Issue:** Mikecranesync/MIRA#323
**PRD:** `docs/PRD-MES-CORE.md §4.4`
**Date:** 2026-04-16
**Depends on:** Weeks 1–4 merged

---

## Objective

Build the machine state reader: a background poller that reads the plc-modbus HTTP API every 5 seconds per configured line, detects state transitions (RUNNING/DOWN/IDLE/OFFLINE), writes them to `machine_states`, and exposes `GET /api/mes/lines` and `GET /api/mes/lines/{id}/state` REST endpoints.
Complete the "Core Four" anchor: downtime tracking with three capture modes:
1. **AUTO** — PLC fault code → reason_code (already live via state_poller + state_machine.py)
2. **MANUAL** — operator or MIRA sends a reason_code directly via REST
3. **NLP** — operator or MIRA sends a free-text description → keyword classifier → reason_code

## Affected Files

**New:**
- `services/mes/backend/services/__init__.py`
- `services/mes/backend/services/plc_client.py` — async HTTP client wrapping plc-modbus
- `services/mes/backend/services/state_machine.py` — pure state detection from IO snapshot
- `services/mes/backend/services/state_poller.py` — asyncio background poll loop
- `services/mes/backend/routes/lines.py` — GET /api/mes/lines, GET /lines/{id}/state
- `services/mes/tests/test_machine_states.py` — 10 unit tests, all mocked
- `services/mes/backend/services/downtime_classifier.py` — pure NLP keyword→reason_code
- `services/mes/backend/routes/downtime.py` — 3 endpoints
- `services/mes/tests/test_downtime.py` — classifier + API tests

**Modified:**
- `services/mes/requirements.txt` — add httpx
- `services/mes/backend/config.py` — add plc_modbus_url setting
- `services/mes/backend/main.py` — wire poller into lifespan, add lines router
- `docker-compose.yml` — add PLC_MODBUS_URL env to mes container
- `services/mes/backend/main.py` — include downtime router
- `PLAN.md` — this file

## Approach

1. `plc_client.py` — thin async wrapper around `GET /api/plc/io` (httpx). Raises `PLCOfflineError` on timeout/connection failure so caller can set OFFLINE state.
2. `state_machine.py` — pure function `detect_state(io_data)` → `(MachineStateEnum, reason_code | None)`. Derived from `VFDStatus` and `ErrorCode` registers. No DB or network calls — fully testable without mocks.
3. `state_poller.py` — asyncio task, one iteration per line every 5s. Maintains in-memory cache to avoid DB reads on every tick. Writes to `machine_states` only on transition.
4. `lines.py` routes — two endpoints: list all lines (from DB), get current state (from in-memory cache + last DB row).
5. `main.py` lifespan — start poller task on startup, cancel on shutdown.

State transition write: close open row (`ended_at = NOW()`), insert new row.

## State Machine
---

```
IO: VFDStatus=1, ErrorCode=0 → RUNNING
IO: VFDStatus=2 OR ErrorCode>0 → DOWN (reason_code from ErrorCode map)
IO: VFDStatus=0, ErrorCode=0 → IDLE
HTTP failure / timeout → OFFLINE
```
## Approach

## ErrorCode → reason_code map
### 1. NLP Classifier (pure function, no LLM)

`classify_reason(text: str) -> tuple[str, str]`
Returns `(reason_code, confidence)` where confidence is "high" or "low".

Keyword priority table (first match wins):
| Keywords | Reason Code |
|----------|-------------|
| estop / e-stop / emergency stop | E_STOP |
| pm / preventive / scheduled maint | MAINT_PM |
| breakdown / broken / failed / fault | MAINT_BREAKDOWN |
| tooling / tool change | CHANGEOVER_TOOLING |
| changeover / product change / switchover | CHANGEOVER_PRODUCT |
| jam / jammed / stuck / blocked conveyor | JAM |
| starved / no material / empty / feed | STARVED_MATERIAL |
| blocked / downstream / full | BLOCKED_DOWNSTREAM |
| quality / hold / inspection / reject | QUALITY_HOLD |
| overload / overcurrent | OVERLOAD |
| overheat / hot / thermal | OVERHEAT |
| sensor / proximity / photoelectric | SENSOR_FAIL |
| comms / communication / timeout / network | COMMS_FAIL |
fallback → UNKNOWN, confidence="low"

### 2. Endpoints (`downtime.py`)

| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/mes/downtime-reasons` | List all 14 reason codes |
| GET | `/api/mes/lines/{id}/downtime?hours=8` | All DOWN/CHANGEOVER events for line |
| POST | `/api/mes/lines/{id}/downtime` | Attach reason to current open DOWN event |

POST body (two modes):
- Direct: `{ "reason_code": "JAM", "entered_by": "OPERATOR", "notes": "..." }`
- NLP: `{ "description": "the line is jammed", "entered_by": "MIRA_AI" }`

POST logic:
1. Line must exist → 404
2. Must have an open DOWN/CHANGEOVER state (ended_at IS NULL) → 409 if not
3. If reason_code given: validate it exists → 422 if not
4. If description given: classify → reason_code (fallback to UNKNOWN)
5. UPDATE machine_states SET reason_code=?, entered_by=?, notes=?
6. Return updated event

### 3. Response shape

```python
{1: "OVERLOAD", 2: "OVERHEAT", 3: "SENSOR_FAIL", 4: "JAM", 7: "E_STOP"}
class DowntimeEventResponse(BaseModel):
id: str
line_id: str
state: str # DOWN or CHANGEOVER
reason_code: Optional[str]
reason_desc: Optional[str] # joined from downtime_reasons
category: Optional[str] # PLANNED / UNPLANNED / EXTERNAL
entered_by: str
notes: Optional[str]
started_at: datetime
ended_at: Optional[datetime]
duration_min: Optional[int] # None if still open
```

---

## Risks

- plc-modbus in mock mode returns VFDStatus=0 at rest — poller sees IDLE immediately (expected)
- Multiple lines share one plc-modbus service currently — same io_data, different `line_id` rows
- POST must find the open DOWN row atomically — use single DB query with
`ended_at IS NULL AND state IN ('DOWN','CHANGEOVER')` not in-memory cache.
- NLP classifier must never raise — always returns a (code, confidence) tuple.
- If line has multiple open rows (shouldn't happen, but defensive): update only the most recent.

## Rollback

```bash
git checkout feat/mes-week1-db-schema
```

## Verification Steps

```bash
# Unit tests (no docker needed)
cd services/mes && pytest tests/test_machine_states.py -v

# Integration: start stack, check state endpoint
docker compose up mes-db mes plc-modbus -d
curl localhost:8300/api/mes/lines
curl localhost:8300/api/mes/lines/<id>/state

# Inject a fault and verify DB transition
curl -X POST localhost:8001/api/plc/mock/fault -H "Content-Type: application/json" -d '{"fault_type":"jam"}'
sleep 8
curl localhost:8300/api/mes/lines/<id>/state # should show DOWN / JAM
```
Delete the new files, remove import from main.py. No DB schema changes.

## Note on Active Focus Window
## Verification

Explicitly authorized by Mike (2026-04-15 session).
1. `pytest tests/test_downtime.py -v` — all new tests pass
2. `pytest tests/ -v` — full suite (66 + new) passes, zero regressions
3. NLP: "the conveyor is jammed" → JAM, "scheduled PM" → MAINT_PM, "e-stop" → E_STOP
4. POST with no open DOWN → 409
5 changes: 4 additions & 1 deletion services/mes/backend/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@ class Settings(BaseSettings):
# Polling interval in seconds (default 5, set lower in tests)
plc_poll_interval_sec: int = 5

# Set True to skip poller startup (useful in unit tests)
# OEE calculator tick interval in seconds (default 60)
oee_tick_sec: int = 60

# Set True to skip background task startup (useful in unit tests)
plc_use_mock: bool = False


Expand Down
50 changes: 37 additions & 13 deletions services/mes/backend/main.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,21 @@
"""FactoryLM MES API — FastAPI entry point.

Lifespan:
startup → seed state cache, launch background state poller
shutdown → signal poller to stop cleanly
startup → launch state poller + OEE calculator background tasks
shutdown → stop both tasks cleanly

Routes (cumulative by week):
Week 1: /api/health
Week 2: /api/mes/lines, /api/mes/lines/{id}/state
Week 3: /api/mes/lines/{id}/oee, /api/mes/lines/{id}/oee/history
/api/mes/oee/summary, /api/mes/kpis
Week 4: /api/mes/products, /api/mes/products (POST/GET)
/api/mes/work-orders (POST/GET), /api/mes/work-orders/{id} (GET)
/api/mes/work-orders/{id}/status (PATCH)
Schedule-aware TEEP via schedules table
Week 5: /api/mes/downtime-reasons
/api/mes/lines/{id}/downtime (GET/POST)
NLP classifier: free-text → reason_code
"""

import asyncio
Expand All @@ -17,9 +26,12 @@
from fastapi.middleware.cors import CORSMiddleware

from backend.config import settings
from backend.routes.downtime import router as downtime_router
from backend.routes.health import router as health_router
from backend.routes.lines import router as lines_router
from backend.services import state_poller
from backend.routes.oee import router as oee_router
from backend.routes.work_orders import router as work_orders_router
from backend.services import oee_calculator, state_poller

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
Expand All @@ -30,25 +42,34 @@ async def lifespan(app: FastAPI):
db_host = settings.database_url.split("@")[-1]
logger.info("MES service starting — DB: %s PLC: %s", db_host, settings.plc_modbus_url)

poller_task = None
poller_task = oee_task = None

if not settings.plc_use_mock:
poller_task = asyncio.create_task(
state_poller.run(poll_interval_sec=settings.plc_poll_interval_sec),
name="state_poller",
)
logger.info("State poller started (interval=%ds)", settings.plc_poll_interval_sec)
oee_task = asyncio.create_task(
oee_calculator.run(tick_sec=settings.oee_tick_sec),
name="oee_calculator",
)
logger.info(
"Background tasks started — poller=%ds oee_tick=%ds",
settings.plc_poll_interval_sec, settings.oee_tick_sec,
)
else:
logger.info("PLC mock mode — state poller disabled")
logger.info("PLC mock mode — background tasks disabled")

yield

logger.info("MES service shutting down")
if poller_task:
state_poller.stop()
state_poller.stop()
oee_calculator.stop()
for task in [t for t in [poller_task, oee_task] if t]:
try:
await asyncio.wait_for(poller_task, timeout=8.0)
except asyncio.TimeoutError:
poller_task.cancel()
await asyncio.wait_for(task, timeout=8.0)
except (asyncio.TimeoutError, asyncio.CancelledError):
task.cancel()


app = FastAPI(
Expand All @@ -65,8 +86,11 @@ async def lifespan(app: FastAPI):
allow_headers=["*"],
)

app.include_router(health_router, prefix=settings.api_prefix)
app.include_router(lines_router, prefix=settings.api_prefix)
app.include_router(health_router, prefix=settings.api_prefix)
app.include_router(lines_router, prefix=settings.api_prefix)
app.include_router(oee_router, prefix=settings.api_prefix)
app.include_router(work_orders_router, prefix=settings.api_prefix)
app.include_router(downtime_router, prefix=settings.api_prefix)


if __name__ == "__main__":
Expand Down
Loading
Loading