0xideas · 0xideas · Jun 3, 2026 · Jun 3, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/.gitignore b/.gitignore
@@ -5,7 +5,7 @@ logs/
 metadata_configs/
 outputs/
 project_folder/
-
+!tests/unit/data
 *\~
 *.DS_Store
 

diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,4 @@
+{
+    "python-envs.defaultEnvManager": "ms-python.python:conda",
+    "python-envs.defaultPackageManager": "ms-python.python:conda"
+}
diff --git a/README.md b/README.md
@@ -97,16 +97,16 @@ Let's start with the data format expected by sequifier. The basic data format th
 
 The two columns "sequenceId" and "itemPosition" have to be present, and then there must be at least one feature column. There can also be many feature columns, and these can be categorical or real valued.
 
-Data of this input format can be transformed into the format that is used for model training and inference using `sequifier preprocess`, which takes this form:
+Data of this input format can be transformed into the format that is used for model training and inference using `sequifier preprocess`. Preprocessing defines the physical `stored_context_width` and `max_target_offset`; training and inference choose the model-facing `context_length` from that stored capacity:
 
-|sequenceId|subsequenceId|startItemPosition|columnName|[Subsequence Length]|[Subsequence Length - 1]|...|0|
-|----------|-------------|-----------------|----------|--------------------|------------------------| - |-|
-|0|0|0|column1|"high"|"high"|...|"low"|
-|0|0|0|column2|12.3|10.2|...|14.9|
-|...|...|...|...|...|...|...|...|
-|1|0|15|column1|"medium"|"high"|...|"medium"|
-|1|0|15|column2|20.6|18.5|...|21.6|
-|...|...|...|...|...|...|...|...|
+|sequenceId|subsequenceId|startItemPosition|leftPadLength|inputCol|[Window Length - 1]|[Window Length - 2]|...|0|
+|----------|-------------|-----------------|-------------|--------|-------------------|-------------------| - |-|
+|0|0|0|0|column1|"high"|"high"|...|"low"|
+|0|0|0|0|column2|12.3|10.2|...|14.9|
+|...|...|...|...|...|...|...|...|...|
+|1|0|15|0|column1|"medium"|"high"|...|"medium"|
+|1|0|15|0|column2|20.6|18.5|...|21.6|
+|...|...|...|...|...|...|...|...|...|
 
 On inference, the output is returned in the library input format, introduced first.
 
@@ -199,7 +199,7 @@ Please cite with:
   title = {sequifier - causal transformer models for multivariate sequence modelling},
   year = {2025},
   publisher = {GitHub},
-  version = {v1.2.0.0},
+  version = {v1.9.9.9},
   url = {[https://github.com/0xideas/sequifier](https://github.com/0xideas/sequifier)}
 }
 

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -15,7 +15,7 @@
 project = 'sequifier'
 copyright = '2025, Leon Luithlen'
 author = 'Leon Luithlen'
-release = 'v1.2.0.0'
+release = 'v1.9.9.9'
 html_baseurl = 'https://www.sequifier.com/'
 
 # -- General configuration ---------------------------------------------------

diff --git a/documentation/configs/hyperparameter-search.md b/documentation/configs/hyperparameter-search.md
@@ -59,7 +59,7 @@ Sequifier allows you to search not just for model parameters, but for the best *
 | --- | --- | --- | --- |
 | `input_columns` | `list[list[str]]` | **Yes** | A list of input sets. E.g., `[['col1'], ['col1', 'col2']]`. |
 | `target_columns` | `list[str]` | **Yes** | The target column(s) to predict. Fixed across all runs. |
-| `seq_length` | `list[int]` | **Yes** | List of sequence lengths to test (e.g., `[24, 48]`). |
+| `context_length` | `list[int]` | **Yes** | List of sequence lengths to test (e.g., `[24, 48]`). |
 | `target_column_types` | `dict` | **Yes** | Map of target columns to `categorical` or `real`. |
 | `column_types` | `list[dict]` | *Conditional* | Required if `input_columns` varies. List of type maps corresponding to the input sets. |
 
@@ -134,8 +134,9 @@ Most fields here are lists for sampling, but some are scalar values fixed for al
 | `scheduler` | `list[dict]` | No | `[{'name': 'StepLR'...}]`| List of scheduler configs. `scheduler.step()` is only called if \< total\_steps, so correct configuration is essential. |
 | `scheduler_step_on` | `str` | No | `epoch` | When to step the scheduler: `epoch` or `batch`. |
 | `save_latest_interval_minutes`| `float`| No | `null` | Time interval to overwrite a "latest" checkpoint. |
-| `save_batch_interval_minutes` | `float` | No | `null` | Time interval to save a unique, batch-specific checkpoint. |
-| `save_batch_interval_minutes_val_loss` | `bool` | No | `true` | Whether to calculate validation loss at the moment of the batch interval save. |
+| `save_interval_minutes` | `float` | No | `null` | Time interval to save a unique, batch-specific checkpoint. |
+| `save_interval_batches` | `int` | No | `null` | Batch interval to save a unique, batch-specific checkpoint. |
+| `save_interval_val_loss` | `bool` | No | `true` | Whether to calculate validation loss at the moment of the batch interval save. |
 | `calculate_validation_loss_on_initialization` | `bool` | No | `false` | Determines if a validation pass runs before epoch 1 begins. |
 | `log_interval` | `int` | No | `10` | Logging frequency (batches). |
 | `class_share_log_columns`| `list[str]`| No | `[]` | Columns for which to log the predicted class distribution in validation. |
@@ -186,7 +187,7 @@ All other parameters are considered **Independent**. Sequifier will test every v
 
   * **Model:** `num_layers`, `dim_feedforward`, `activation_fn`, `normalization`, `norm_first`, `positional_encoding`, `attention_type`, `rope_theta`.
   * **Training:** `batch_size`, `dropout`, `accumulation_steps`, `optimizer`.
-  * **Data:** `seq_length`.
+  * **Data:** `context_length`.
 
 ### 3\. Special Case: `n_kv_heads`
 

diff --git a/documentation/configs/infer.md b/documentation/configs/infer.md
@@ -40,7 +40,7 @@ These fields tell the inference engine which columns to extract from the new dat
 | Field | Type | Mandatory | Default | Description |
 | :--- | :--- | :--- | :--- | :--- |
 | `model_type` | `str` | **Yes** | - | `generative` (predict next value) or `embedding` (extract vector representation). |
-| `seq_length` | `int` | **Yes** | - | The context window size. Must match training. |
+| `context_length` | `int` | **Yes** | - | The model context window size. It must match the trained model view and fit inside the stored metadata capacity. |
 | `prediction_length` | `int` | No | `1` | Number of steps to predict *simultaneously*. **Must be 1** if `autoregression: true`. |
 | `inference_batch_size`| `int` | **Yes** | - | Number of sequences to process at once. |
 | `autoregression` | `bool` | No | `false` | If `true`, feeds predictions back into the model to predict further into the future. |

diff --git a/documentation/configs/preprocess.md b/documentation/configs/preprocess.md
@@ -38,15 +38,17 @@ The configuration is defined in a YAML file (e.g., `preprocess.yaml`). Below are
 | `selected_columns` | `list[str]` | No | `null` | A specific list of columns to process. If `null`, all columns (except metadata) are processed. |
 | `max_rows` | `int` | No | `null` | Limits processing to the first N rows. Useful for rapid debugging. |
 | `metadata_config_path` | `Optional[str]` | No | `null` | use a preexisting metadata config path for tokenizing discrete columns and standardising real-valued columns |
+| `mask_column` | `Optional[str]` | No | `null` | Optional input column used as a row-level mask. If set, `metadata_config_path` must also be set. |
 | `use_precomputed_maps`| `list[str]` | No | `null` | If not `null`, enforces the use of precomputed maps for the variables in the list. |
 
 ### 3\. Sequence Logic & Splitting
 
 | Field | Type | Mandatory | Default | Description |
 | :--- | :--- | :--- | :--- | :--- |
-| `seq_length` | `int` | **Yes** | - | The length of the context window (history) fed into the model. |
+| `stored_context_width` | `int` | **Yes** | - | The physical serialized window width written to preprocessed data. |
+| `max_target_offset` | `int` | No | `1` | Number of future items retained after the model input window. Use `0` for BERT-style same-width inputs and targets; use `1` for causal next-item training. |
 | `split_ratios` | `list[float]`| **Yes** | - | Proportions for data splits (e.g., `[0.8, 0.1, 0.1]` for train/val/test). Must sum to 1.0. |
-| `stride_by_split` | `list[int]` | No | `[seq_length]*N` | The step size used to slide the window for each split. Corresponds to `split_ratios`. |
+| `stride_by_split` | `list[int]` | No | `[stored_context_width]*N` | The step size used to slide the window for each split. Corresponds to `split_ratios`. |
 | `subsequence_start_mode`| `str` | No | `distribute` | Strategy for selecting start indices (`distribute` or `exact`). |
 
 ### 4\. Performance & System
@@ -71,10 +73,10 @@ The configuration is defined in a YAML file (e.g., `preprocess.yaml`). Below are
 
 This controls data augmentation and redundancy.
 
-  * **Stride = `seq_length` (Non-overlapping):** The model sees every data point exactly once as a target. Training is faster, but the model might miss patterns that cross the window boundary.
+  * **Stride = `context_length` (Non-overlapping):** The model sees every data point exactly once as a target. Training is faster, but the model might miss patterns that cross the window boundary.
   * **Stride = 1 (Maximum Overlap):** Maximizes data volume. The model sees every possible sequence. This yields the highest accuracy but significantly increases the size of the preprocessed data and training time.
   * **Hybrid Approach:** It is common practice to set a large stride for the training and validation splits (index 0) to reduce the size on disk of the dataset, and a stride=1 for the test split to evaluate the model on each point in the test set. This supposes that the test split value is low.
-      * *Example:* `stride_by_split: [24, 24, 1]` (assuming `seq_length: 48`).
+      * *Example:* `stride_by_split: [24, 24, 1]` (assuming `stored_context_width: 49`).
 
 ### 3\. `subsequence_start_mode`: `distribute` vs `exact`
 
@@ -116,7 +118,7 @@ After running `preprocess`, the following are generated:
 
 ## 5\. Advanced: Custom ID Mapping
 
-By default, Sequifier automatically generates integer IDs for categorical columns starting from index 2 (indices 0 and 1 are reserved for system use, such as "unknown" values).
+By default, Sequifier automatically generates integer IDs for categorical columns starting from index 2 (indices 0 and 1 are reserved for system use, such as "[unknown]" values).
 
 If you need to enforce specific integer mappings (e.g., to maintain consistency across different training runs or datasets), you can provide **precomputed ID maps**.
 

diff --git a/documentation/configs/train.md b/documentation/configs/train.md
@@ -30,7 +30,7 @@ The configuration is defined in a YAML file (e.g., `train.yaml`). The file is st
 | `target_columns` | `list[str]`| **Yes** | - | The specific column(s) the model should learn to predict. |
 | `target_column_types`| `dict` | **Yes** | - | Map of target columns to their type: `'categorical'` or `'real'`. The key order in target_column_types must exactly match the list order in target_columns |
 | `input_columns` | `list[str]`| No | All | Subset of columns to use as input features. Defaults to all available in metadata. |
-| `seq_length` | `int` | **Yes** | - | Must match the `seq_length` used in preprocessing. |
+| `context_length` | `int` | **Yes** | - | Model input context length. It must fit inside the metadata `stored_context_width` with the stored `max_target_offset`. |
 
 ### 3\. Model Architecture (`model_spec`)
 
@@ -72,8 +72,9 @@ These fields determine the size and complexity of the Transformer.
 | `class_weights` | `dict` | No | `null` | Weights for specific classes (useful for imbalanced datasets). |
 | `save_interval_epochs` | `int` | **Yes** | - | Save a checkpoint every N epochs. |
 | `save_latest_interval_minutes`| `float`| No | Time interval to overwrite a "latest" checkpoint. |
-| `save_batch_interval_minutes` | `float` | No | Time interval to save a unique, batch-specific checkpoint. |
-| `save_batch_interval_minutes_val_loss` | `bool` | No | Whether to calculate validation loss at the moment of the batch interval save. Defaults to true. |
+| `save_interval_minutes` | `float` | No | Time interval to save a unique, batch-specific checkpoint. |
+| `save_interval_batches` | `int` | No | Batch interval to save a unique, batch-specific checkpoint. |
+| `save_interval_val_loss` | `bool` | No | Whether to calculate validation loss at the moment of the batch interval save. Defaults to true. |
 | `calculate_validation_loss_on_initialization` | `bool` | No | Determines if a validation pass runs before epoch 1 begins. Defaults to true. |
 | `early_stopping_epochs`| `int` | No | `null` | Stop training if validation loss doesn't improve for N epochs. |
 | `log_interval` | `int` | No | `10` | Print training logs every N batches. |