Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
47e2a8b
Add BERT configs
0xideas Jun 3, 2026
8fd1493
Introduce mask token
0xideas Jun 3, 2026
709d9f1
Move training_objective to TrainingSpecModel
0xideas Jun 4, 2026
7525aff
WIP
0xideas Jun 4, 2026
3a179be
Add curly brackets
0xideas Jun 4, 2026
857b86e
Fix index alignment
0xideas Jun 4, 2026
82cc41f
Fix tests
0xideas Jun 4, 2026
6df557a
Add padding mask
0xideas Jun 5, 2026
f61f621
Store pad length explicitly
0xideas Jun 5, 2026
9624e8c
Move to metadata dict
0xideas Jun 5, 2026
513cdf8
Deduplicate padding inference
0xideas Jun 5, 2026
0b99cbe
Fix test
0xideas Jun 5, 2026
d412e82
Update outputs
0xideas Jun 8, 2026
b4c8d03
Add [mask] input column
0xideas Jun 8, 2026
3753839
add reserved_mask_column
0xideas Jun 8, 2026
04fdba6
Rename reserved_mask_column to mask_column
0xideas Jun 8, 2026
95fa63b
Bump version to 2.0.0.0
0xideas Jun 8, 2026
ea71ce0
fix subsequence_starts problems
0xideas Jun 10, 2026
3b3ee96
Add hyperparameter search for bert vars
0xideas Jun 10, 2026
dbc355a
Add hyperparameter search for bert vars
0xideas Jun 10, 2026
c10f9ea
Update itemPosition in test outputs
0xideas Jun 10, 2026
e34e398
Add bert integration tests
0xideas Jun 10, 2026
c2ff935
WIP
0xideas Jun 10, 2026
8eaf770
WIP
0xideas Jun 10, 2026
6e287a0
WIP
0xideas Jun 10, 2026
34a3e7f
Clean up for tests
0xideas Jun 10, 2026
cc87def
Address BERT shortcomings
0xideas Jun 11, 2026
c9f0514
infer bert valid column
0xideas Jun 11, 2026
6c593f5
Small fixes
0xideas Jun 11, 2026
2e8c2a4
Clean up internal abstractions
0xideas Jun 11, 2026
f739c0a
Remove backward compatibility for missing mask tensors
0xideas Jun 11, 2026
9ad5a01
remove default value for training_objective
0xideas Jun 11, 2026
c0cd86e
remove negative int test
0xideas Jun 11, 2026
bd07708
remove padding inference again
0xideas Jun 11, 2026
6113f54
Make metadata mandatory
0xideas Jun 11, 2026
d15cb58
Update outputs
0xideas Jun 11, 2026
e504c14
Make special_token_ids metadata field mandatory
0xideas Jun 12, 2026
c3f03f0
Remove tuple backward compatibility
0xideas Jun 12, 2026
ecbb097
Create dummy metadata
0xideas Jun 12, 2026
24192e5
Introduce target_max_offset
0xideas Jun 12, 2026
0265aa3
Rename seq_length -> context_length
0xideas Jun 12, 2026
0b0296b
set max_lookahead to 0 for bert integration tests
0xideas Jun 12, 2026
6a95a7c
fix tests, reverse metadata export order
0xideas Jun 12, 2026
6bcfe78
Update outputs
0xideas Jun 12, 2026
14929b7
Enforce v2
0xideas Jun 12, 2026
ead716e
Make sequence_layout_version required
0xideas Jun 12, 2026
ac1bd3d
Small changes
0xideas Jun 12, 2026
20fbb29
Fail on wrong sequence_layout_version
0xideas Jun 12, 2026
5b57073
Remove vestiges of v1 processing
0xideas Jun 13, 2026
3012988
Access sequence layout directly
0xideas Jun 13, 2026
bd68908
Separate storage and modelling window size
0xideas Jun 13, 2026
6da8f27
Rename params
0xideas Jun 13, 2026
dcd92f7
make valid_mask compulsory
0xideas Jun 13, 2026
6b78f3e
set prediction_length from context_length in hyperparameter tuning of…
0xideas Jun 13, 2026
4c899e9
Expand preprocessing digest
0xideas Jun 13, 2026
da7c934
Correctly weighted validation loss
0xideas Jun 13, 2026
da4d0fa
update custom eval int test outputs
0xideas Jun 13, 2026
edb9ffd
Fix class_share calculation
0xideas Jun 15, 2026
9b8245d
shorten docstrings & clean up
0xideas Jun 15, 2026
99a2440
Add pin_memory to SequifierBatch
0xideas Jun 15, 2026
d450171
Token level loss scaling for distributed training
0xideas Jun 15, 2026
2146611
Vectorize bert masking
0xideas Jun 15, 2026
5e620f3
improve tests
0xideas Jun 15, 2026
a303826
Improve tests
0xideas Jun 16, 2026
d61a1b7
Make preprocessing resumption more efficient
0xideas Jun 16, 2026
3fd2784
pyright
0xideas Jun 16, 2026
d7ee14f
move tests
0xideas Jun 16, 2026
bb11b23
allow_sequence_splitting and device dependent torch float type
0xideas Jun 17, 2026
d428055
Correct state saving and loading WIP
0xideas Jun 17, 2026
76d6ba5
Correct state saving and loading WIP
0xideas Jun 17, 2026
095ad4a
Correct state saving and loading WIP
0xideas Jun 17, 2026
79b57ed
Refactor checkpoint metadata management
0xideas Jun 20, 2026
69024a1
WIP
0xideas Jun 20, 2026
49096fb
Rename save_batch_interval_minutes to save_interval_minutes
0xideas Jun 20, 2026
9e61362
Rename interval checkpointing vars
0xideas Jun 20, 2026
1171da5
Add self.save_interval_batches
0xideas Jun 20, 2026
89cf33c
remove accumulated_global_token_count, empty_global_batches, global_t…
0xideas Jun 20, 2026
7f08cd4
prevent layer_type_dtypes with FSDP
0xideas Jun 20, 2026
e842780
Fix FSDP mixed precision policy
0xideas Jun 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ logs/
metadata_configs/
outputs/
project_folder/

!tests/unit/data
*\~
*.DS_Store

Expand Down
4 changes: 4 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"python-envs.defaultEnvManager": "ms-python.python:conda",
"python-envs.defaultPackageManager": "ms-python.python:conda"
}
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,16 +97,16 @@ Let's start with the data format expected by sequifier. The basic data format th

The two columns "sequenceId" and "itemPosition" have to be present, and then there must be at least one feature column. There can also be many feature columns, and these can be categorical or real valued.

Data of this input format can be transformed into the format that is used for model training and inference using `sequifier preprocess`, which takes this form:
Data of this input format can be transformed into the format that is used for model training and inference using `sequifier preprocess`. Preprocessing defines the physical `stored_context_width` and `max_target_offset`; training and inference choose the model-facing `context_length` from that stored capacity:

|sequenceId|subsequenceId|startItemPosition|columnName|[Subsequence Length]|[Subsequence Length - 1]|...|0|
|----------|-------------|-----------------|----------|--------------------|------------------------| - |-|
|0|0|0|column1|"high"|"high"|...|"low"|
|0|0|0|column2|12.3|10.2|...|14.9|
|...|...|...|...|...|...|...|...|
|1|0|15|column1|"medium"|"high"|...|"medium"|
|1|0|15|column2|20.6|18.5|...|21.6|
|...|...|...|...|...|...|...|...|
|sequenceId|subsequenceId|startItemPosition|leftPadLength|inputCol|[Window Length - 1]|[Window Length - 2]|...|0|
|----------|-------------|-----------------|-------------|--------|-------------------|-------------------| - |-|
|0|0|0|0|column1|"high"|"high"|...|"low"|
|0|0|0|0|column2|12.3|10.2|...|14.9|
|...|...|...|...|...|...|...|...|...|
|1|0|15|0|column1|"medium"|"high"|...|"medium"|
|1|0|15|0|column2|20.6|18.5|...|21.6|
|...|...|...|...|...|...|...|...|...|

On inference, the output is returned in the library input format, introduced first.

Expand Down Expand Up @@ -199,7 +199,7 @@ Please cite with:
title = {sequifier - causal transformer models for multivariate sequence modelling},
year = {2025},
publisher = {GitHub},
version = {v1.2.0.0},
version = {v1.9.9.9},
url = {[https://github.com/0xideas/sequifier](https://github.com/0xideas/sequifier)}
}

Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
project = 'sequifier'
copyright = '2025, Leon Luithlen'
author = 'Leon Luithlen'
release = 'v1.2.0.0'
release = 'v1.9.9.9'
html_baseurl = 'https://www.sequifier.com/'

# -- General configuration ---------------------------------------------------
Expand Down
9 changes: 5 additions & 4 deletions documentation/configs/hyperparameter-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Sequifier allows you to search not just for model parameters, but for the best *
| --- | --- | --- | --- |
| `input_columns` | `list[list[str]]` | **Yes** | A list of input sets. E.g., `[['col1'], ['col1', 'col2']]`. |
| `target_columns` | `list[str]` | **Yes** | The target column(s) to predict. Fixed across all runs. |
| `seq_length` | `list[int]` | **Yes** | List of sequence lengths to test (e.g., `[24, 48]`). |
| `context_length` | `list[int]` | **Yes** | List of sequence lengths to test (e.g., `[24, 48]`). |
| `target_column_types` | `dict` | **Yes** | Map of target columns to `categorical` or `real`. |
| `column_types` | `list[dict]` | *Conditional* | Required if `input_columns` varies. List of type maps corresponding to the input sets. |

Expand Down Expand Up @@ -134,8 +134,9 @@ Most fields here are lists for sampling, but some are scalar values fixed for al
| `scheduler` | `list[dict]` | No | `[{'name': 'StepLR'...}]`| List of scheduler configs. `scheduler.step()` is only called if \< total\_steps, so correct configuration is essential. |
| `scheduler_step_on` | `str` | No | `epoch` | When to step the scheduler: `epoch` or `batch`. |
| `save_latest_interval_minutes`| `float`| No | `null` | Time interval to overwrite a "latest" checkpoint. |
| `save_batch_interval_minutes` | `float` | No | `null` | Time interval to save a unique, batch-specific checkpoint. |
| `save_batch_interval_minutes_val_loss` | `bool` | No | `true` | Whether to calculate validation loss at the moment of the batch interval save. |
| `save_interval_minutes` | `float` | No | `null` | Time interval to save a unique, batch-specific checkpoint. |
| `save_interval_batches` | `int` | No | `null` | Batch interval to save a unique, batch-specific checkpoint. |
| `save_interval_val_loss` | `bool` | No | `true` | Whether to calculate validation loss at the moment of the batch interval save. |
| `calculate_validation_loss_on_initialization` | `bool` | No | `false` | Determines if a validation pass runs before epoch 1 begins. |
| `log_interval` | `int` | No | `10` | Logging frequency (batches). |
| `class_share_log_columns`| `list[str]`| No | `[]` | Columns for which to log the predicted class distribution in validation. |
Expand Down Expand Up @@ -186,7 +187,7 @@ All other parameters are considered **Independent**. Sequifier will test every v

* **Model:** `num_layers`, `dim_feedforward`, `activation_fn`, `normalization`, `norm_first`, `positional_encoding`, `attention_type`, `rope_theta`.
* **Training:** `batch_size`, `dropout`, `accumulation_steps`, `optimizer`.
* **Data:** `seq_length`.
* **Data:** `context_length`.

### 3\. Special Case: `n_kv_heads`

Expand Down
2 changes: 1 addition & 1 deletion documentation/configs/infer.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ These fields tell the inference engine which columns to extract from the new dat
| Field | Type | Mandatory | Default | Description |
| :--- | :--- | :--- | :--- | :--- |
| `model_type` | `str` | **Yes** | - | `generative` (predict next value) or `embedding` (extract vector representation). |
| `seq_length` | `int` | **Yes** | - | The context window size. Must match training. |
| `context_length` | `int` | **Yes** | - | The model context window size. It must match the trained model view and fit inside the stored metadata capacity. |
| `prediction_length` | `int` | No | `1` | Number of steps to predict *simultaneously*. **Must be 1** if `autoregression: true`. |
| `inference_batch_size`| `int` | **Yes** | - | Number of sequences to process at once. |
| `autoregression` | `bool` | No | `false` | If `true`, feeds predictions back into the model to predict further into the future. |
Expand Down
12 changes: 7 additions & 5 deletions documentation/configs/preprocess.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,17 @@ The configuration is defined in a YAML file (e.g., `preprocess.yaml`). Below are
| `selected_columns` | `list[str]` | No | `null` | A specific list of columns to process. If `null`, all columns (except metadata) are processed. |
| `max_rows` | `int` | No | `null` | Limits processing to the first N rows. Useful for rapid debugging. |
| `metadata_config_path` | `Optional[str]` | No | `null` | use a preexisting metadata config path for tokenizing discrete columns and standardising real-valued columns |
| `mask_column` | `Optional[str]` | No | `null` | Optional input column used as a row-level mask. If set, `metadata_config_path` must also be set. |
| `use_precomputed_maps`| `list[str]` | No | `null` | If not `null`, enforces the use of precomputed maps for the variables in the list. |

### 3\. Sequence Logic & Splitting

| Field | Type | Mandatory | Default | Description |
| :--- | :--- | :--- | :--- | :--- |
| `seq_length` | `int` | **Yes** | - | The length of the context window (history) fed into the model. |
| `stored_context_width` | `int` | **Yes** | - | The physical serialized window width written to preprocessed data. |
| `max_target_offset` | `int` | No | `1` | Number of future items retained after the model input window. Use `0` for BERT-style same-width inputs and targets; use `1` for causal next-item training. |
| `split_ratios` | `list[float]`| **Yes** | - | Proportions for data splits (e.g., `[0.8, 0.1, 0.1]` for train/val/test). Must sum to 1.0. |
| `stride_by_split` | `list[int]` | No | `[seq_length]*N` | The step size used to slide the window for each split. Corresponds to `split_ratios`. |
| `stride_by_split` | `list[int]` | No | `[stored_context_width]*N` | The step size used to slide the window for each split. Corresponds to `split_ratios`. |
| `subsequence_start_mode`| `str` | No | `distribute` | Strategy for selecting start indices (`distribute` or `exact`). |

### 4\. Performance & System
Expand All @@ -71,10 +73,10 @@ The configuration is defined in a YAML file (e.g., `preprocess.yaml`). Below are

This controls data augmentation and redundancy.

* **Stride = `seq_length` (Non-overlapping):** The model sees every data point exactly once as a target. Training is faster, but the model might miss patterns that cross the window boundary.
* **Stride = `context_length` (Non-overlapping):** The model sees every data point exactly once as a target. Training is faster, but the model might miss patterns that cross the window boundary.
* **Stride = 1 (Maximum Overlap):** Maximizes data volume. The model sees every possible sequence. This yields the highest accuracy but significantly increases the size of the preprocessed data and training time.
* **Hybrid Approach:** It is common practice to set a large stride for the training and validation splits (index 0) to reduce the size on disk of the dataset, and a stride=1 for the test split to evaluate the model on each point in the test set. This supposes that the test split value is low.
* *Example:* `stride_by_split: [24, 24, 1]` (assuming `seq_length: 48`).
* *Example:* `stride_by_split: [24, 24, 1]` (assuming `stored_context_width: 49`).

### 3\. `subsequence_start_mode`: `distribute` vs `exact`

Expand Down Expand Up @@ -116,7 +118,7 @@ After running `preprocess`, the following are generated:

## 5\. Advanced: Custom ID Mapping

By default, Sequifier automatically generates integer IDs for categorical columns starting from index 2 (indices 0 and 1 are reserved for system use, such as "unknown" values).
By default, Sequifier automatically generates integer IDs for categorical columns starting from index 2 (indices 0 and 1 are reserved for system use, such as "[unknown]" values).

If you need to enforce specific integer mappings (e.g., to maintain consistency across different training runs or datasets), you can provide **precomputed ID maps**.

Expand Down
7 changes: 4 additions & 3 deletions documentation/configs/train.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ The configuration is defined in a YAML file (e.g., `train.yaml`). The file is st
| `target_columns` | `list[str]`| **Yes** | - | The specific column(s) the model should learn to predict. |
| `target_column_types`| `dict` | **Yes** | - | Map of target columns to their type: `'categorical'` or `'real'`. The key order in target_column_types must exactly match the list order in target_columns |
| `input_columns` | `list[str]`| No | All | Subset of columns to use as input features. Defaults to all available in metadata. |
| `seq_length` | `int` | **Yes** | - | Must match the `seq_length` used in preprocessing. |
| `context_length` | `int` | **Yes** | - | Model input context length. It must fit inside the metadata `stored_context_width` with the stored `max_target_offset`. |

### 3\. Model Architecture (`model_spec`)

Expand Down Expand Up @@ -72,8 +72,9 @@ These fields determine the size and complexity of the Transformer.
| `class_weights` | `dict` | No | `null` | Weights for specific classes (useful for imbalanced datasets). |
| `save_interval_epochs` | `int` | **Yes** | - | Save a checkpoint every N epochs. |
| `save_latest_interval_minutes`| `float`| No | Time interval to overwrite a "latest" checkpoint. |
| `save_batch_interval_minutes` | `float` | No | Time interval to save a unique, batch-specific checkpoint. |
| `save_batch_interval_minutes_val_loss` | `bool` | No | Whether to calculate validation loss at the moment of the batch interval save. Defaults to true. |
| `save_interval_minutes` | `float` | No | Time interval to save a unique, batch-specific checkpoint. |
| `save_interval_batches` | `int` | No | Batch interval to save a unique, batch-specific checkpoint. |
| `save_interval_val_loss` | `bool` | No | Whether to calculate validation loss at the moment of the batch interval save. Defaults to true. |
| `calculate_validation_loss_on_initialization` | `bool` | No | Determines if a validation pass runs before epoch 1 begins. Defaults to true. |
| `early_stopping_epochs`| `int` | No | `null` | Stop training if validation loss doesn't improve for N epochs. |
| `log_interval` | `int` | No | `10` | Print training logs every N batches. |
Expand Down
Loading