Add GR801 SoC radiation hardness simulation pipeline#62
Conversation
Co-authored-by: danielnovais-tech <230455320+danielnovais-tech@users.noreply.github.com>
Co-authored-by: danielnovais-tech <230455320+danielnovais-tech@users.noreply.github.com>
Co-authored-by: danielnovais-tech <230455320+danielnovais-tech@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new, self-contained GR801 SoC “radiation hardness” simulation module plus a pytest suite to validate basic initialization, fault injection/correction, monitoring, and end-to-end execution.
Changes:
- Introduces
simulation_pipeline_gr801.pyimplementing SoC/app/radiation/state structures and a main loop with fault injection, ECC-like correction, and shutdown checks. - Adds
tests/test_simulation_pipeline_gr801.pywith unit and smoke tests for the new simulation pipeline.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| simulation_pipeline_gr801.py | New GR801 SoC radiation fault simulation pipeline (core logic + metrics + shutdown behavior). |
| tests/test_simulation_pipeline_gr801.py | New pytest coverage for initialization, fault injection/tolerance, monitoring, and simulation runs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def log_state(metrics: dict[str, Any]) -> None: | ||
| """Log the current state.""" | ||
| print(f"Time: {metrics['time']:.2f}s, Errors: {metrics['errors']}, Performance: {metrics['performance']:.2f}") | ||
|
|
||
| def safety_violation_detected(state: SimulationState) -> bool: | ||
| """Check for safety violations (e.g., too many errors).""" | ||
| # If errors exceed a threshold, trigger a shutdown. | ||
| error_threshold = 1000 | ||
| if state.soc.errors > error_threshold: | ||
| print(f"Safety violation: Too many errors ({state.soc.errors})") | ||
| return True | ||
| return False | ||
|
|
||
| def trigger_safe_shutdown(state: SimulationState) -> None: | ||
| """Trigger a safe shutdown of the system.""" | ||
| print("Triggering safe shutdown.") | ||
| # Save critical data, power down, etc. |
There was a problem hiding this comment.
This module uses print() for logging in log_state, safety_violation_detected, and trigger_safe_shutdown. Elsewhere in the repo (e.g., simulation_pipeline.py) simulation logging uses logging.getLogger(__name__) and LOGGER.info/error. To align with that convention and avoid noisy stdout in library use, switch these to a module logger (and ideally make logging frequency configurable).
There was a problem hiding this comment.
Switched to logging.getLogger(name) throughout the module in commit e7b5e45. All print() statements replaced with LOGGER.info/warning/error to match simulation_pipeline.py convention.
| sim.run_ai_application(soc, app) | ||
|
|
||
| assert app.output is not None | ||
| # Memory should be modified (output is stored at memory[0]) |
There was a problem hiding this comment.
initial_memory = soc.memory[0] is assigned but never used, which will fail Ruff/pyflakes (F841). Either remove it or use it to assert that run_ai_application() actually changes memory[0].
| # Memory should be modified (output is stored at memory[0]) | |
| # Memory should be modified (output is stored at memory[0]) | |
| assert soc.memory[0] != initial_memory |
There was a problem hiding this comment.
Fixed in commit e7b5e45. Now using initial_memory in assertion: assert soc.memory[0] != initial_memory
| self.performance = 0.0 # Some performance metric | ||
|
|
||
| class RadiationModel: | ||
| """Models the radiation environment.""" | ||
| def __init__(self, particle_flux: float, upset_rate: float): |
There was a problem hiding this comment.
Top-level class/function definitions need two blank lines between them to satisfy Ruff/pycodestyle (E302/E305). For example, there’s only one blank line between SoC and RadiationModel here; apply consistent 2-blank-line spacing throughout the module.
There was a problem hiding this comment.
Fixed in commit e7b5e45. Added two blank lines between all top-level class and function definitions per PEP 8.
| num_cores = config.get('num_cores', 4) | ||
| memory_size = config.get('memory_size', 1024*1024) # 1 MB | ||
| accelerator = config.get('accelerator', True) | ||
| return SoC(num_cores, memory_size, accelerator) |
There was a problem hiding this comment.
memory_size = config.get('memory_size', 1024*1024) will trigger Ruff E226 (missing whitespace around arithmetic operator). Update to 1024 * 1024 to satisfy the repo lint configuration (ruff.toml enables E rules).
There was a problem hiding this comment.
Fixed in commit e7b5e45. Changed to 1024 * 1024 with proper spacing.
| # Run the AI application | ||
| run_ai_application(soc, app) | ||
|
|
||
| # Inject faults due to radiation | ||
| faults = inject_faults(soc, radiation, dt) | ||
| state.faults_injected += faults | ||
|
|
||
| # Apply fault tolerance | ||
| corrected = apply_fault_tolerance(soc) | ||
| state.faults_corrected += corrected | ||
|
|
||
| # Update radiation model (if dynamic) | ||
| update_radiation_model(radiation, dt) | ||
|
|
||
| # Update time | ||
| state.time += dt | ||
|
|
||
| # Monitor and log | ||
| metrics = monitor_state(state) | ||
| metrics_history.append(metrics) | ||
|
|
There was a problem hiding this comment.
The performance metric recorded in metrics can be stale: run_ai_application() updates soc.performance before faults are injected/corrected, but monitor_state() reads it after those updates. Consider recomputing soc.performance after inject_faults/apply_fault_tolerance (or inside monitor_state) so the reported performance corresponds to the current soc.errors for that timestep.
There was a problem hiding this comment.
Fixed in commit e7b5e45. Removed performance update from run_ai_application() and moved it to after apply_fault_tolerance() in run_simulation(), ensuring performance reflects current error count after fault handling.
| # Simplified: Assume we can correct some errors with ECC in memory and cache. | ||
| # We assume a fixed correction rate. | ||
| correction_rate = 0.8 # 80% of errors are corrected | ||
| corrected = int(soc.errors * correction_rate) | ||
| soc.errors -= corrected | ||
| return corrected |
There was a problem hiding this comment.
correction_rate is hard-coded to 0.8 here, but the PR description calls out a configurable ECC correction rate. Consider reading this from config (or making it a parameter with a documented default) so callers can tune the simulation without editing code.
There was a problem hiding this comment.
Made configurable in commit e7b5e45. Added correction_rate parameter (default: 0.8) to apply_fault_tolerance() and reads from config in run_simulation().
| # If errors exceed a threshold, trigger a shutdown. | ||
| error_threshold = 1000 | ||
| if state.soc.errors > error_threshold: | ||
| print(f"Safety violation: Too many errors ({state.soc.errors})") | ||
| return True | ||
| return False |
There was a problem hiding this comment.
error_threshold is hard-coded to 1000, but the PR description mentions a configurable error threshold for emergency shutdown. Consider making this configurable via config (or a function parameter) so different missions/environments can set appropriate safety limits.
There was a problem hiding this comment.
Made configurable in commit e7b5e45. Added error_threshold parameter (default: 1000) to safety_violation_detected() and reads from config in run_simulation().
| state.faults_injected += faults | ||
|
|
||
| # Apply fault tolerance | ||
| corrected = apply_fault_tolerance(soc) | ||
| state.faults_corrected += corrected |
There was a problem hiding this comment.
state.faults_injected / state.faults_corrected are accumulated over the whole run (+= each step), but the PR description says metrics are “per-timestep”. Either rename these metrics to make it explicit they are cumulative (e.g., total_faults_injected), or record per-step values in the metrics history (or both) to match the stated API.
| state.faults_injected += faults | |
| # Apply fault tolerance | |
| corrected = apply_fault_tolerance(soc) | |
| state.faults_corrected += corrected | |
| # Store per-timestep faults injected | |
| state.faults_injected = faults | |
| # Apply fault tolerance | |
| corrected = apply_fault_tolerance(soc) | |
| # Store per-timestep faults corrected | |
| state.faults_corrected = corrected |
There was a problem hiding this comment.
Fixed in commit e7b5e45. Renamed to total_faults_injected and total_faults_corrected in metrics dict to clearly indicate cumulative tracking over the simulation run.
| 'num_cores': 4, | ||
| 'memory_size': 1024*1024, | ||
| 'accelerator': True, | ||
| 'particle_flux': 5.0, # High radiation environment | ||
| 'upset_rate': 1e-4, | ||
| 'task': 'image_classification', | ||
| 'input_data': np.random.rand(100, 100), |
There was a problem hiding this comment.
'memory_size': 1024*1024 will trigger Ruff E226 (missing whitespace around arithmetic operator). Update to 1024 * 1024 to satisfy the repo lint configuration.
There was a problem hiding this comment.
Fixed in commit e7b5e45. Changed to 1024 * 1024 with proper spacing in main block.
| # With high flux and upset rate, we should see some faults | ||
| assert total_faults > 0 | ||
| assert soc.errors == total_faults |
There was a problem hiding this comment.
This test is statistically flaky: with particle_flux=10, dt=0.1, upset_rate=0.1, the total faults over 10 iterations is ~Poisson(λ≈1), so there's ~37% chance total_faults is 0 and this assertion fails. Make it deterministic by seeding/mocking the RNG (e.g., patch np.random.poisson/random/choice) or by choosing parameters/assertions that don't rely on chance.
There was a problem hiding this comment.
Fixed in commit e7b5e45. Made test deterministic by:
- Seeding np.random with np.random.seed(42)
- Increased particle_flux to 100.0 and upset_rate to 0.5 for higher fault probability
- Added comment explaining expected behavior
Test now passes reliably every run.
|
@copilot apply changes based on the comments in this thread |
…ements Co-authored-by: danielnovais-tech <230455320+danielnovais-tech@users.noreply.github.com>
All changes from the review have been applied in commit e7b5e45. Key updates:
All tests passing, linting clean. |
Co-authored-by: danielnovais-tech <230455320+danielnovais-tech@users.noreply.github.com>
Summary
All review feedback has been addressed. The simulation now uses proper logging with appropriate log levels, has configurable parameters for correction_rate and error_threshold, and all tests pass reliably.
Original prompt
simulation_pipeline_gr801.py
import numpy as np
from typing import Dict, Any, List, Optional
--- Data Structures ---
class SoC:
"""Model of the GR801 SoC."""
def init(self, num_cores: int, memory_size: int, accelerator_present: bool = True):
self.num_cores = num_cores
self.memory = np.zeros(memory_size, dtype=np.uint8)
self.accelerator_present = accelerator_present
self.registers = [0] * 32 * num_cores # Assuming 32 registers per core
self.cache = np.zeros(1024, dtype=np.uint8) # Simplified cache
self.errors = 0
self.performance = 0.0 # Some performance metric
class RadiationModel:
"""Models the radiation environment."""
def init(self, particle_flux: float, upset_rate: float):
self.particle_flux = particle_flux # particles per cm^2 per second
self.upset_rate = upset_rate # probability of an upset per particle
class AIApplication:
"""Represents an AI application running on the SoC."""
def init(self, task: str, input_data: np.ndarray):
self.task = task # e.g., "image_classification"
self.input_data = input_data
self.output = None
self.accuracy = 1.0 # Current accuracy of the application
class SimulationState:
"""Holds the current state of the simulation."""
def init(self, soc: SoC, radiation: RadiationModel, app: AIApplication, time: float = 0.0):
self.soc = soc
self.radiation = radiation
self.app = app
self.time = time
self.faults_injected = 0
self.faults_corrected = 0
--- Initialization ---
def initialize_soc(config: Dict[str, Any]) -> SoC:
"""Initialize the SoC with given configuration."""
num_cores = config.get('num_cores', 4)
memory_size = config.get('memory_size', 1024*1024) # 1 MB
accelerator = config.get('accelerator', True)
return SoC(num_cores, memory_size, accelerator)
def initialize_radiation_model(config: Dict[str, Any]) -> RadiationModel:
"""Initialize the radiation model."""
particle_flux = config.get('particle_flux', 1.0) # particles/cm^2/s
upset_rate = config.get('upset_rate', 1e-5) # upsets per particle
return RadiationModel(particle_flux, upset_rate)
def initialize_ai_application(config: Dict[str, Any]) -> AIApplication:
"""Initialize the AI application."""
task = config.get('task', 'image_classification')
input_data = config.get('input_data', np.random.rand(100, 100))
return AIApplication(task, input_data)
--- Core Steps ---
def run_ai_application(soc: SoC, app: AIApplication) -> None:
"""Run the AI application on the SoC."""
# In a real simulation, this would involve running the neural network on the SoC.
# Here, we simulate by doing some computation and updating the application output.
if soc.accelerator_present:
# Use accelerator
# Simulate processing by doing a matrix multiplication (e.g., convolution)
# For simplicity, we'll just compute a dot product.
processed_data = np.dot(app.input_data.flatten(), app.input_data.flatten())
else:
# Use CPU cores
processed_data = np.sum(app.input_data)
def inject_faults(soc: SoC, radiation: RadiationModel, dt: float) -> int:
"""
Inject radiation-induced faults into the SoC.
Returns the number of faults injected.
"""
# Calculate expected number of particles hitting the chip
chip_area = 1.0 # cm^2 (simplified)
expected_particles = radiation.particle_flux * chip_area * dt
def apply_fault_tolerance(soc: SoC) -> int:
"""
Apply fault tolerance mechanisms to correct errors.
Returns the number of faults cor...
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.