✨ Add stepwise intermediate reward for RL by Shaobo-Zhou · Pull Request #526 · munich-quantum-toolkit/predictor

Shaobo-Zhou · 2025-11-26T16:11:16Z

Description

This PR introduces a shaped, step-wise reward signal for the RL-based compiler.

For the figures of merit expected_fidelity and estimated_success_probability, the reward is computed in two regimes:

Exact regime (native + mapped circuits)
If the circuit consists only of device-native gates and respects the device’s coupling map, the step reward is based on the change in the exact calibration-aware metric between successive steps.
Approximate regime (non-native / unmapped circuits)
If the circuit still contains non-native gates or violates the device topology, a conservative canonical cost model is used to approximate the expected fidelity and ESP. The intermediate reward is then derived from the change in this approximate metric.

Checklist:

The pull request only contains commits that are focused and relevant to this change.
I have added appropriate tests that cover the new/changed functionality.
I have updated the documentation to reflect these changes.
I have added entries to the changelog for any noteworthy additions, changes, fixes, or removals.
I have added migration instructions to the upgrade guide (if needed).
The changes follow the project's style guidelines and introduce no new warnings.
The changes are fully tested and pass the CI checks.
I have reviewed my own code changes.

Update action space and feature space Update actions Update action space

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations

Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Fix bugs Fix bugs Fix bugs

…e gate check

Fix windows runtime warning problem Fix windows runtime warning issue

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/approx_reward.py`:
- Around line 62-63: The code repeatedly calls estimate_basis_gate_counts (which
internally transpiles) for the same circuit state; modify calculate_reward
(and/or predictorenv.py's step) to memoize the basis-gate counts keyed by a
canonical representation of the quantum circuit (e.g., serialized QASM or a hash
of qc.data + device target), reuse the cached result for subsequent calls, and
fall back to calling get_basis_gates_from_target and estimate_basis_gate_counts
only on cache misses; ensure the cache key and lookup are used wherever basis =
get_basis_gates_from_target(device) and counts = estimate_basis_gate_counts(qc,
basis_gates=basis) are invoked so duplicate transpilation is avoided.
- Around line 30-40: The loop in estimate_basis_gate_counts uses deprecated
tuple unpacking of qc_t.data; replace it by iterating over CircuitInstruction
objects and access attributes directly (operation/qubits/clbits). Concretely, in
estimate_basis_gate_counts change the loop to iterate like "for circ_instr in
qc_t.data:", extract the gate with "instr = circ_instr.operation" (and use
circ_instr.qubits / circ_instr.clbits if needed), then use instr.name and the
existing BLACKLIST/counts logic.

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 259-261: The code only suppresses delta_reward when transitioning
from "approx"→"exact"; change the logic in the reward computation (referencing
prev_reward_kind, new_kind, and delta_reward in predictorenv.py) to suppress the
delta for any regime change (i.e., if prev_reward_kind != new_kind then set
delta_reward = 0.0) and ensure prev_reward_kind is still updated after this
check so later steps see the new regime; adjust the conditional that currently
checks (self.prev_reward_kind == "approx" and new_kind == "exact") to a generic
regime-change check and keep existing comments about metrics comparability.
- Around line 623-624: Replace the local duplicate gate_blacklist in
_ensure_device_averages_cached with the shared BLACKLIST constant from
approx_reward to avoid divergence; import BLACKLIST from the approx_reward
module and use it when building basis_ops (i.e., change basis_ops = [name for
name in op_names if name not in gate_blacklist] to use BLACKLIST), removing the
local {"measure","reset","delay","barrier"} definition.
- Around line 263-268: The two branches computing reward_val in predictorenv.py
are identical; replace the if/elif block that compares delta_reward with a
single assignment using the existing symbols: set reward_val = self.reward_scale
* delta_reward (preserving delta_reward == 0.0 result) inside the same function
where delta_reward and reward_val are used so you remove the redundant
conditional logic.
- Around line 218-229: Calculate and assign prev_reward and prev_reward_kind
only when the chosen action is non-terminal: move the call to calculate_reward
from before apply_action into the branch that handles non-terminal actions
(i.e., the branch that proceeds after altered_qc is not None and not the
terminate action), so terminal/terminate actions skip the expensive pre-action
evaluation; ensure apply_action, calculate_reward, and the used_actions logic
remain unchanged. Also initialize prev_reward and prev_reward_kind in reset()
(or add a guard before use) so the first step has defined values when no prior
reward exists.

src/mqt/predictor/rl/approx_reward.py

src/mqt/predictor/rl/predictorenv.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 268-352: Update the calculate_reward signature to narrow the mode
parameter from plain str to a Literal type (Literal["auto", "exact", "approx"])
to catch invalid modes at type-check time; add the appropriate import for
Literal (from typing or typing_extensions depending on project compatibility)
and update any affected type hints/exports so static checkers (mypy/pyright)
pick it up—no logic changes needed inside calculate_reward or its use of
self._is_native_and_mapped, expected_fidelity,
approx_estimated_success_probability, etc.

src/mqt/predictor/rl/predictorenv.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/mqt/predictor/rl/predictorenv.py (1)
104-104: 🧹 Nitpick | 🔵 Trivial

Use lazy logging formatting for consistency.

String concatenation in the log call is inconsistent with the %s-style formatting used elsewhere in this file (e.g., lines 217, 271). Ruff rule G003 discourages eager string building in logging calls.
♻️ Suggested fix
-        logger.info("Init env: " + reward_function)
+        logger.info("Init env: %s", reward_function)

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 244-245: The call to calculate_reward at the start of each step is
redundant because self.prev_reward and self.prev_reward_kind already hold the
post-action reward from the previous step; remove the redundant invocation in
the step logic (the line calling calculate_reward that sets
self.prev_reward/self.prev_reward_kind) and instead initialize/seed
self.prev_reward and self.prev_reward_kind during reset() by calling
calculate_reward() once there, ensuring the end-of-step code that updates
prev_reward/prev_reward_kind (lines around where post-action reward is set)
continues to overwrite for the next step; adjust any references in step() and
reset() accordingly to reuse the carried-over prev_reward/prev_reward_kind
without recomputing on each new step.
- Around line 726-735: _determine_valid_actions_for_state and
_is_native_and_mapped duplicate expensive GatesInBasis/CheckMap work; cache the
pass instances and share computed results to avoid repeated checks. Create
cached pass instances on self (e.g., self._gates_in_basis and self._check_map)
initialized once and reused in both determine_valid_actions_for_state and
_is_native_and_mapped, and have determine_valid_actions_for_state store the last
computed flags (e.g., self._last_only_nat_gates and self._last_mapped) after
running the checks so _is_native_and_mapped can return those cached booleans if
available instead of re-running the passes; fall back to running the cached pass
instances if no cached results exist and update the cache.
- Around line 218-242: The estimated_hellinger_distance branch duplicates the
apply→decompose→update-state→check-valid-actions logic found later; extract that
shared sequence into a helper (e.g. _apply_and_update or
_apply_action_and_update_state) which calls apply_action, performs the gate-type
decompose loop, sets self.state and self.state._layout, increments
self.num_steps, updates self.valid_actions via determine_valid_actions_for_state
and raises RuntimeError if empty; then replace the duplicated block in the
reward_function == "estimated_hellinger_distance" branch to call the helper and
only keep the branch-specific reward logic (calculate_reward(mode="exact") vs
no_effect_penalty) before returning create_feature_dict(self.state), reward,
done, False, {}.

src/mqt/predictor/rl/predictorenv.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tests/compilation/test_predictor_rl.py`:
- Around line 149-204: Tests test_approx_reward_ef and test_approx_reward_esp
duplicate setup and assertions; replace them with a single parametrized test
using pytest.mark.parametrize over figure_of_merit values ("expected_fidelity"
and "estimated_success_probability") that creates Predictor(figure_of_merit=...)
and monkeypatches predictor.env._is_native_and_mapped the same way, runs val,
kind = predictor.env.calculate_reward(...), asserts common conditions (kind ==
"approx", 0<=val<=1, predictor.env._dev_avgs_cached, isinstance checks and
len(predictor.env._err_by_gate)>0, len(predictor.env._dur_by_gate)>0), and then
include the ESP-specific assertions (assert len(..._dur_by_gate)>0 and assert
predictor.env._tbar is None or predictor.env._tbar > 0.0) only when the
parameter == "estimated_success_probability"; keep references to Predictor,
predictor.env.calculate_reward, _is_native_and_mapped, _err_by_gate,
_dur_by_gate, and _tbar to locate the code.

tests/compilation/test_predictor_rl.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/mqt/predictor/rl/predictorenv.py (1)

562-576: 🧹 Nitpick | 🔵 Trivial

determine_valid_actions_for_state creates fresh pass instances while _native_and_mapped reuses cached ones.

Both methods perform the same GatesInBasis + CheckMap checks. determine_valid_actions_for_state still instantiates new objects each call (lines 564, 574), while _native_and_mapped uses the cached instances from __init__. Consider reusing the cached instances here too, or calling _native_and_mapped internally.

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 705-717: The cache in _native_and_mapped uses id(qc) which can be
recycled and cause stale hits; either remove the cache entirely and always run
_gates_in_basis_check and _check_map, or replace the id-based key with a safe
version counter: add a _state_version integer on the class (initialize in
__init__), increment _state_version whenever self.state is reassigned, store and
compare _last_state_version (instead of _last_qc_id) alongside
_last_native_mapped in _native_and_mapped, and update that version when you set
self.state so the cached (only_native, mapped) is only reused for the current
state.

In `@tests/compilation/test_predictor_rl.py`:
- Around line 175-178: The condition is checking the imported type alias
figure_of_merit instead of the test parameter fom, so the ESP-specific
assertions (predictor.env._dur_by_gate and predictor.env._tbar) never run;
update the conditional to use the test parameter (if fom ==
"estimated_success_probability") so the assertions execute for the ESP reward
path and validate duration/tbar caching for predictor.env._dur_by_gate and
predictor.env._tbar.

src/mqt/predictor/rl/predictorenv.py

tests/compilation/test_predictor_rl.py

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@src/mqt/predictor/rl/predictorenv.py`:
- Around line 636-639: The suppress(KeyError, AttributeError, TypeError) around
the basis_ops loop is too broad and can hide legitimate TypeError bugs when
calling target.operation_from_name or reading op.num_qubits; change the
exception list to (KeyError, AttributeError) (i.e., remove TypeError) and
optionally add a debug/log line inside the except path that records the gate
name and the caught exception so skipped gates are visible; locate the loop
using basis_ops, target.operation_from_name, arity_by_name and op.num_qubits to
make the change.
- Around line 238-243: The code path for reward_function ==
"estimated_hellinger_distance" only gives the episode terminal step a real
reward and assigns no_effect_penalty for every non-terminal step, which leaves
the agent without shaping signal; update the branch in predictorenv.py (the
block checking self.reward_function == "estimated_hellinger_distance") to either
implement a non-terminal shaping heuristic or, if this is intentional, add a
concise explanatory comment above the block referencing that choice and the
involved symbols (self.reward_function, "estimated_hellinger_distance",
self.no_effect_penalty, self.calculate_reward, and self.action_terminate_index)
so readers know why calculate_reward is only called at done and why non-terminal
steps receive no signal.

In `@tests/compilation/test_predictor_rl.py`:
- Around line 165-170: The monkeypatch targets a non-existent method
`_is_native_and_mapped` and has no effect; update the test to force the
approximate path by calling calculate_reward with mode="approx" instead of
relying on device mapping heuristics. Locate the call to
predictor.env.calculate_reward(qc=qc, mode="auto") in the test and change the
mode argument to "approx" (remove the monkeypatch line that sets
`_is_native_and_mapped`), ensuring the test exercises the approximate reward
branch in calculate_reward rather than depending on GatesInBasis/CheckMap
behavior.

coderabbitai · 2026-02-08T19:52:14Z

src/mqt/predictor/rl/predictorenv.py

+        done = action == self.action_terminate_index
+
+        if self.reward_function == "estimated_hellinger_distance":
+            reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty
+            self.state._layout = self.layout  # noqa: SLF001
+            return create_feature_dict(self.state), reward_val, done, False, {}


🧹 Nitpick | 🔵 Trivial

Hellinger path provides no shaping signal — non-terminal steps always get a flat penalty.

This is a design choice, but it means the RL agent receives zero useful gradient signal during an episode for estimated_hellinger_distance — reward information only comes at termination. If this is intentional (e.g., because there's no approximate Hellinger model), a brief comment explaining the rationale would help future readers.

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 242-242: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents

In `@src/mqt/predictor/rl/predictorenv.py` around lines 238 - 243, The code path for reward_function == "estimated_hellinger_distance" only gives the episode terminal step a real reward and assigns no_effect_penalty for every non-terminal step, which leaves the agent without shaping signal; update the branch in predictorenv.py (the block checking self.reward_function == "estimated_hellinger_distance") to either implement a non-terminal shaping heuristic or, if this is intentional, add a concise explanatory comment above the block referencing that choice and the involved symbols (self.reward_function, "estimated_hellinger_distance", self.no_effect_penalty, self.calculate_reward, and self.action_terminate_index) so readers know why calculate_reward is only called at done and why non-terminal steps receive no signal.

Indeed, there's no approximate Hellinger model. However, I don't understand why this must imply a no-effect penalty here (the circuit might have changed)?

@flowerthrower You're absolutely right — this does look like a design oversight rather than an intentional choice.

The current code assigns no_effect_penalty to every non-terminal step when using estimated_hellinger_distance, regardless of whether the circuit actually improved or degraded. This is inconsistent with how the environment handles expected_fidelity and estimated_success_probability below (lines 244+), where:

The previous reward is cached

A new reward is computed after each action

The step reward is the scaled delta (reward_scale * (new_val - prev_reward))

Only when delta == 0.0 does it fall back to no_effect_penalty

The absence of an approximate Hellinger model simply means you can't use the transpile-based proxy — but you can still compute the exact Hellinger distance at each step via self.calculate_reward(mode="exact") and shape rewards based on the delta, just like EF/ESP do.

Suggested fix:
Remove the special-case early return for Hellinger and let it flow through the same delta-based shaping logic below, but always call calculate_reward(mode="exact") (since mode="auto" would also choose "exact" anyway, as there's no approximate path). That would give the agent meaningful gradient signal during the episode.

✏️ Learnings added

Learnt from: flowerthrower Repo: munich-quantum-toolkit/predictor PR: 526 File: src/mqt/predictor/rl/predictorenv.py:238-243 Timestamp: 2026-02-09T13:15:50.387Z Learning: When using reward_function == "estimated_hellinger_distance" in PredictorEnv (src/mqt/predictor/rl/predictorenv.py), the current implementation only computes the exact reward at episode termination and assigns no_effect_penalty to all non-terminal steps, which prevents intermediate shaping signal. This is likely unintentional; the Hellinger distance can be computed exactly at each step and used for delta-based reward shaping, similar to expected_fidelity and estimated_success_probability. </add_learning>

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

src/mqt/predictor/rl/predictorenv.py

tests/compilation/test_predictor_rl.py

flowerthrower

Hi there, great work so far. Only a few comments left from my end.

src/mqt/predictor/rl/approx_reward.py

flowerthrower · 2026-02-09T12:33:37Z

src/mqt/predictor/rl/predictorenv.py

+        done = action == self.action_terminate_index
+
+        if self.reward_function == "estimated_hellinger_distance":
+            reward_val = self.calculate_reward(mode="exact")[0] if done else self.no_effect_penalty
+            self.state._layout = self.layout  # noqa: SLF001
+            return create_feature_dict(self.state), reward_val, done, False, {}


Indeed, there's no approximate Hellinger model. However, I don't understand why this must imply a no-effect penalty here (the circuit might have changed)?

src/mqt/predictor/rl/predictorenv.py

Signed-off-by: Patrick Hopf <81010725+flowerthrower@users.noreply.github.com>

flowerthrower · 2026-02-24T08:43:18Z

@coderabbitai provide a detailed summary of the structural and methodical changes in this PR

coderabbitai · 2026-02-24T08:44:57Z

Tip

For best results, initiate chat on the files or code changes.

@flowerthrower: Here's a detailed breakdown of the structural and methodical changes in PR #526:

🏗️ Structural Changes

New Module

src/mqt/predictor/rl/approx_reward.py (116 lines added)

Introduces heuristic reward estimation for non-native/unmapped circuits
Public functions added:
- get_basis_gates_from_target(): Extracts basis gates from a Qiskit Target
- estimate_basis_gate_counts(): Transpiles circuits to basis gates and counts occurrences
- approx_expected_fidelity(): Computes product-based fidelity estimate using per-gate error rates
- approx_estimated_success_probability(): Adds idle/decoherence penalties based on circuit duration and coherence time

Core Environment Refactoring (`src/mqt/predictor/rl/predictorenv.py`)

Constructor changes:

Added parameters: reward_scale: float = 1.0, no_effect_penalty: float = -0.001

New public methods:

calculate_reward(qc, mode="auto") -> tuple[float, str]: Central reward computation with three modes:
- "exact": Uses calibration-aware metrics (expected_fidelity/ESP)
- "approx": Uses conservative canonical cost model from new approx_reward module
- "auto": Automatically selects exact vs. approx based on circuit native-ness and mapping status
_ensure_device_averages_cached(): Caches per-gate error rates, durations, and coherence scale (_tbar) for fast approximate evaluation

Signature changes:

_handle_qiskit_layout_postprocessing(): Now accepts PropertySet dict instead of PassManager for layout propagation

Internal flow updates:

New _apply_and_update() helper consolidates action application, decomposition, state update, and validation
Reward computation now tracks prev_reward and prev_reward_kind to enable delta-based reward shaping
Handles regime transitions (exact ↔ approx) gracefully during episode

Type System Changes (`src/mqt/predictor/rl/actions.py`)

DeviceDependentAction.do_while: Callback type changed from dict[str, Circuit] to PropertySet

Supporting Changes

src/mqt/predictor/reward.py: Added fallback dt = device.dt or 1.0 for devices without timing info
src/mqt/predictor/utils.py: Replaced Windows RuntimeWarning with logger.info for timeout unsupported message
src/mqt/predictor/ml/predictor.py: Whitespace cleanup (removed empty line)

🧠 Methodical Changes

From Terminal to Stepwise Rewards

Previous behavior:

As noted in the retrieved learning, reward functions like estimated_hellinger_distance only computed exact rewards at episode termination
Intermediate steps received no_effect_penalty, providing no shaping signal

New behavior:

Delta-based reward shaping: reward = (current_metric - prev_metric) * reward_scale + no_effect_penalty
Terminal step uses exact final evaluation
Enables gradient information throughout the episode

Two-Regime Reward System

Exact Regime (native + mapped circuits):

Step reward = Δ(exact metric) × reward_scale + no_effect_penalty

Applies to circuits using only device-native gates respecting coupling map
Verified via internal _is_native_and_mapped() check (uses GatesInBasis and CheckMap passes)
Metrics: calibration-aware expected_fidelity or estimated_success_probability

Approximate Regime (non-native/unmapped circuits):

Step reward = Δ(approx metric) × reward_scale + no_effect_penalty

Conservative estimates for circuits with non-native gates or topology violations
Approximations:
- Expected fidelity: ∏ᵍ (1 - pₑᵣᵣ)^count(g) over basis gates
- ESP: Gate fidelity term × exp(-circuit_duration / tbar) coherence decay

Auto mode: Dynamically selects regime based on circuit state

Optimization: Calibration Caching

Device error rates, gate durations, and coherence time (_tbar) are computed once and cached
Prevents redundant Target API queries during episode
Raises informative errors if calibration data is missing

🧪 Test Updates

New test (tests/compilation/test_predictor_rl.py):

test_approx_reward_paths_use_cached_per_gate_maps(): Validates approximate reward path, cache population, and output bounds [0,1]

Test configuration changes:

Replaced BenchmarkLevel.ALG → BenchmarkLevel.INDEP across multiple test files
Added mapping validation using CheckMap pass in test_qcompile_with_newly_trained_models
Increased timeout from 600 → 6000 in Hellinger distance tests

📝 Documentation

CHANGELOG.md: Added "✨ Improve RL reward design by adding intermediate rewards" under Unreleased/Changed
Removed support notes for custom model names and x86 macOS

This PR fundamentally shifts the RL training paradigm from sparse terminal rewards to dense intermediate feedback, enabling the agent to learn from every compilation step. The two-regime design maintains exactness when possible while providing conservative estimates when necessary, balancing computational efficiency with training signal quality.

🧠 Learnings used

Learnt from: flowerthrower
Repo: munich-quantum-toolkit/predictor PR: 526
File: src/mqt/predictor/rl/predictorenv.py:238-243
Timestamp: 2026-02-09T13:15:55.306Z
Learning: When using reward_function == "estimated_hellinger_distance" in PredictorEnv (src/mqt/predictor/rl/predictorenv.py), the current implementation only computes the exact reward at episode termination and assigns no_effect_penalty to all non-terminal steps, which prevents intermediate shaping signal. This is likely unintentional; the Hellinger distance can be computed exactly at each step and used for delta-based reward shaping, similar to expected_fidelity and estimated_success_probability.
</add_learning>

Shaobo Zhou and others added 30 commits March 29, 2025 19:20

Update predictor(adding callbacks)

129b60f

Update

08889bd

Restore helper.py and predictor.py to match upstream

e2ff3fe

Merge remote-tracking branch 'upstream/main'

1c32d15

Implement new mapping actions

78dc1aa

Update action space and feature space Update actions Update action space

Fix: resolve pre-commit issues and add missing annotations

a3ba836

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py

Fix: resolve pre-commit issues and add missing annotations

5935e6f

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations

Fix: resolve pre-commit issues and add missing annotations

f71fb29

Fix: resolve pre-commit issues and add missing annotations

3c7592b

Fix mypy errors

6db5c27

Fix mypy errors

47841c5

Fix dependencies issues

b1ac8ce

Fix dependency issues

5f8473c

Add missing zip file

7491ec0

Fix issue with Python 3.13

3346842

Merge branch 'main' into hybrid-mapping

6f7a73c

Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Remove Python 3.13 from noxfile.py due to compatibility issue

6c67349

Skip minimums session on Windows due to CI slowness

2692b96

Fix bugs

f4874e6

Fix bugs Fix bugs Fix bugs

Fix bugs

54eec91

Use default Qiskit settings for VF2Layout and add assertion for nativ…

845f7de

…e gate check

Debug

3418936

Fix missing argument

ae870cc

Fix warning issues

861bc62

Fix window runtime warning problem

fa989b6

Fix window runtime warning problem

405bd39

Add time limit for VF2PostLayout

7b2f321

Fix windows runtime warning problem

b67d0a6

Fix windows runtime warning problem Fix windows runtime warning issue

Add new actions

bf7c9ee

Add new actions

6d2733f

Adjusted implementation of reward approximation

56573a9

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

Coderabbit suggestions

e0cec3c

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

src/mqt/predictor/rl/predictorenv.py Show resolved Hide resolved

Shaobo-Zhou and others added 2 commits February 8, 2026 15:51

Resolve problem with estimated Hellinger distance

a41b323

Merge branch 'main' into new_RL

37c55a8

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

Add test coverage

734df4e

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

tests/compilation/test_predictor_rl.py Outdated Show resolved Hide resolved

Apply coderabbit suggestions

d376268

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

tests/compilation/test_predictor_rl.py Outdated Show resolved Hide resolved

Apply coderabbit suggestions

adaefa9

coderabbitai bot requested changes Feb 8, 2026

View reviewed changes

Apply coderabbit suggestions

7a0c40b

flowerthrower requested changes Feb 9, 2026

View reviewed changes

flowerthrower added 2 commits February 9, 2026 14:21

Merge branch 'main' into new_RL

58acc66

Merge branch 'main' into new_RL

c18d9f0

Signed-off-by: Patrick Hopf <81010725+flowerthrower@users.noreply.github.com>

🔀 resolve merge conflicts

e2ec2d6

flowerthrower changed the title ~~Add stepwise intermediate reward for RL~~ ✨ Add stepwise intermediate reward for RL Feb 24, 2026

Shaobo-Zhou and others added 9 commits March 1, 2026 16:55

Move device average calculation to approx_reward.py

268c6cd

🎨 pre-commit fixes

8a9273b

Add test coverage

316cac0

🎨 pre-commit fixes

fb9d776

Add test coverage

3e2daad

Merge branch 'main' into new_RL

8ee1297

chore: retrigger CI

eb3dc93

trigger CI

3dea485

trigger CI

52caeea

Uh oh!

Conversation

Shaobo-Zhou commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flowerthrower Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

flowerthrower left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flowerthrower Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flowerthrower commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 24, 2026

🏗️ Structural Changes

New Module

Core Environment Refactoring (src/mqt/predictor/rl/predictorenv.py)

Type System Changes (src/mqt/predictor/rl/actions.py)

Supporting Changes

🧠 Methodical Changes

From Terminal to Stepwise Rewards

Two-Regime Reward System

Optimization: Calibration Caching

🧪 Test Updates

📝 Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Shaobo-Zhou commented Nov 26, 2025 •

edited

Loading

coderabbitai bot Feb 8, 2026 •

edited

Loading

flowerthrower commented Feb 24, 2026 •

edited

Loading

Core Environment Refactoring (`src/mqt/predictor/rl/predictorenv.py`)

Type System Changes (`src/mqt/predictor/rl/actions.py`)