From 4352253f6dcdefd2e5ebeedfa78b8423ecd6b4d4 Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 19:36:38 +0200 Subject: [PATCH 01/15] PRDCT-376: split BigQuery transformation into how-to + reference MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Block 0: how-to + reference. Thin hub at the unchanged URL. Query limits and the Query timeout parameter (default 0) carried over; the 2-hour GCP figure and config fields flagged TODO(human-review) — schema not in this repo. Co-Authored-By: Claude Opus 4.8 --- .../docs/transformations/bigquery/how-to.md | 83 ++++++++++++ .../docs/transformations/bigquery/index.md | 119 ++---------------- .../transformations/bigquery/reference.md | 98 +++++++++++++++ 3 files changed, 192 insertions(+), 108 deletions(-) create mode 100644 src/content/docs/transformations/bigquery/how-to.md create mode 100644 src/content/docs/transformations/bigquery/reference.md diff --git a/src/content/docs/transformations/bigquery/how-to.md b/src/content/docs/transformations/bigquery/how-to.md new file mode 100644 index 000000000..85218bef8 --- /dev/null +++ b/src/content/docs/transformations/bigquery/how-to.md @@ -0,0 +1,83 @@ +--- +title: How do I run a BigQuery transformation? +slug: 'transformations/bigquery/how-to' +description: Create, configure, and run a Google BigQuery SQL transformation in Keboola from start to finish — set input mapping, write the SQL, set output mapping, run it, and confirm the result table landed in Storage. +keywords: + - run a BigQuery transformation + - create BigQuery transformation + - BigQuery SQL transformation Keboola + - BigQuery transformation example +type: how-to +--- + +You have a table in Keboola Storage and you want to transform it with BigQuery SQL and write the result back to Storage. This page takes you from nothing to a finished, successful run using a small worked example. For exact limits and syntax rules, see the [reference](/transformations/bigquery/reference/). + +**Time:** ~10 minutes · **You will need:** a Keboola project (on a BigQuery backend) where you can create configurations, and one table in [Storage](/storage/tables/) to read from. + +## Before you start + +Get a table into Storage to use as the input. If you do not have one handy, upload the [sample CSV file](/transformations/source.csv) as a new table (Storage → your bucket → **Create Table**) — the example SQL below expects a `source` table with `first` and `second` columns. + +## Step 1 — Create the transformation + +1. Open **Components → Transformations**. +2. Click **New Transformation**. +3. Choose **Google BigQuery Transformation** as the type. +4. Give it a descriptive name and confirm. + +## Step 2 — Add the input mapping + +1. In **Input Mapping**, click **New Table Input**. +2. Set **Source** to your Storage table. +3. Set the **Destination** (staging table name) to `source`. +4. Save the mapping. + +## Step 3 — Write the SQL script + +In the code editor, paste: + +```sql +CREATE OR REPLACE TABLE `result` AS +SELECT `first`, CAST(`second` AS INT64) * 42 AS `larger_second` +FROM `source`; +``` + +This reads the staged `source` table and creates a `result` table with `first` and `second × 42`. Quote identifiers with backticks (`` `source` ``). You can split longer scripts into [blocks](/transformations/#writing-scripts). + +## Step 4 — Add the output mapping + +1. In **Output Mapping**, click **New Table Output**. +2. Set **Source** (the staging table the script created) to `result`. +3. Set **Destination** to a new Storage table, for example `out.c-main.result`. +4. Save the mapping. + +## Step 5 — Run it and confirm the result + +1. Click **Run** on the transformation. +2. Wait for the [job](/management/jobs/) to finish with a success status. +3. Open **Storage**, find your destination table (`out.c-main.result`), and check the data sample: it should contain `first` and `larger_second`, with `larger_second` equal to `second × 42`. + +If the table is there with the expected values, the transformation works. + +## Adjust the query timeout + +By default a BigQuery query is capped at BigQuery's own maximum runtime. To raise or lower it for this configuration, set the **Query timeout** parameter — see [limits](/transformations/bigquery/reference/#limits). + +## Stop a run on a condition + +To abort deliberately (for example, when an integrity check fails) and return a user error, set the `ABORT_TRANSFORMATION` variable in your script. See [aborting execution](/transformations/bigquery/reference/#aborting-execution-abort_transformation). + +## Troubleshooting + +| Symptom | Likely cause | Fix | +|---|---|---| +| `Not found: Table source` (or similar) | Input mapping destination doesn't match the script | Make sure the input **Destination** is exactly `source` and the script references `` `source` ``. | +| Run succeeds but nothing appears in Storage | No output mapping, or wrong **Source** staging name | Add an output mapping whose **Source** matches the table your script created (`result`). | +| Query exceeds the time limit | Long-running query past the BigQuery maximum | Optimize the query, or raise the **Query timeout** parameter ([reference](/transformations/bigquery/reference/#limits)). | +| Transformation aborted with a user error | `ABORT_TRANSFORMATION` was set to a non-empty value | Expected if you use the abort pattern; otherwise check the logic that sets it. | + +## Related + +- [BigQuery transformation reference](/transformations/bigquery/reference/) — limits, data types, UDFs. +- [Input and output mapping](/transformations/mappings/) — how staging works. +- [Tutorial: Manipulating data](/tutorial/manipulate/) — guided first transformation. diff --git a/src/content/docs/transformations/bigquery/index.md b/src/content/docs/transformations/bigquery/index.md index 0b23e3691..acc9ccbe9 100644 --- a/src/content/docs/transformations/bigquery/index.md +++ b/src/content/docs/transformations/bigquery/index.md @@ -1,116 +1,19 @@ --- title: Google BigQuery Transformation slug: 'transformations/bigquery' +description: Run SQL against Google BigQuery in Keboola. Start here, then jump to the how-to or the reference. +keywords: + - BigQuery transformation + - BigQuery transformations + - Google BigQuery SQL transformation +type: explanation --- +A **BigQuery transformation** runs your SQL against Google BigQuery — a fully managed, serverless, auto-scaling data warehouse — while Keboola handles input/output mapping to and from [Storage](/storage/tables/). It suits analytics over large datasets and integrates with the wider Google Cloud ecosystem. +This page is split by what you need: -[BigQuery](https://cloud.google.com/bigquery) offers a range of features: +- **[How do I run a BigQuery transformation?](/transformations/bigquery/how-to/)** — create, configure, and run one end to end, with a worked example and troubleshooting. +- **[BigQuery transformation reference](/transformations/bigquery/reference/)** — query limits, the abort variable, data-type casting, and user-defined functions. -- Fully managed, serverless data warehouse -- Automatic scaling of compute resources -- Storage and analysis of multi-terabyte datasets -- High-speed streaming insertion of data -- Integrates with Google's data analytics ecosystem - -## Limits -- By default, individual queries have a [maximum run time](https://cloud.google.com/bigquery/quotas#query_jobs) of 2 hours, but you can adjust this using the *Query timeout* parameter. -- There is a [limit on the number of tables](https://cloud.google.com/bigquery/quotas#tables) referenced by a single query. -- While table updates are possible, BigQuery favors an append-only model where mutations are [generally discouraged](https://cloud.google.com/bigquery/docs/best-practices-costs#avoid_using_dml). - -BigQuery is designed for flexibility and ease of use. Its integration with other Google Cloud services provides a robust platform for analytics at scale. To keep up with the latest improvements and updates, it's a good idea to monitor the [BigQuery release notes](https://cloud.google.com/bigquery/docs/release-notes). - -For information on BigQuery limitations within Keboola, refer to the [BigQuery Limitations](/storage/byodb/#bigquery-limitations) section. - -## Aborting Transformation Execution -In some cases, you may need to abort the transformation execution and exit with an error message. -To abort the execution, set the `ABORT_TRANSFORMATION` variable to any nonempty string value. The variable is already declared internally, so you only need to set its value. - -```sql -SET ABORT_TRANSFORMATION = ( - SELECT IF(COUNT(*) = 0, '', 'Integrity check failed') - FROM INTEGRITY_CHECK - WHERE RESULT = 'failed' -); -``` - -This example will set the `ABORT_TRANSFORMATION` variable value to `'Integrity check failed'` if the `INTEGRITY_CHECK` table -contains one or more records with the `RESULT` column equal to the value `'failed'`. - -The transformation engine checks `ABORT_TRANSFORMATION` after each successfully executed query and returns the variable's value -as a user error, `Transformation aborted: Integrity check failed.` in this case. - -![Screenshot - Transformation aborted](/transformations/bigquery/abort.png) - -## Example -To create a simple BigQuery transformation, follow these steps: - -- Create a table in Storage by uploading the [sample CSV file](/transformations/source.csv). -- Create an input mapping from that table, setting its destination to `source` (as expected by the BigQuery script). -- Create an output mapping, setting its destination to a new table in your Storage. -- Copy & paste the below script into the transformation code. -- Save and run the transformation. - -```sql -CREATE OR REPLACE TABLE `result` AS -SELECT `first`, CAST(`second` AS INT64) * 42 AS `larger_second` -FROM `source`; -``` - -![Screenshot - Sample Transformation](/transformations/bigquery/sample-transformation.png) - -You can organize the script into [blocks](/transformations/#writing-scripts). - -## Best Practices - -### Working With Data Types -Keboola Storage tables store data in character types. When creating a table for output mapping in BigQuery, you can rely on implicit casting to STRING: - -```sql -CREATE OR REPLACE TABLE test (ID STRING, TM TIMESTAMP, NUM FLOAT64); - -INSERT INTO test (ID, TM, NUM) -SELECT 'first', CURRENT_TIMESTAMP(), 12.5; -``` - -Alternatively, you can create the table with all columns as STRING and rely on implicit casting: - -```sql -CREATE OR REPLACE TABLE test (ID STRING, TM STRING, NUM STRING); - -INSERT INTO test (ID, TM, NUM) -SELECT 'first', FORMAT_TIMESTAMP('%F %T', CURRENT_TIMESTAMP()), CAST(12.5 AS STRING); -``` - -Explicit casting of columns to STRING is also an option: - -```sql -CREATE OR REPLACE TABLE test (ID STRING, TM STRING, NUM STRING); - -INSERT INTO test (ID, TM, NUM) -SELECT - CAST('first' AS STRING), - CAST(FORMAT_TIMESTAMP('%F %T', CURRENT_TIMESTAMP()) AS STRING), - CAST(12.5 AS STRING) -; -``` - -For unstructured data types in BigQuery, explicit casting is often necessary: - -```sql -CREATE OR REPLACE TABLE test (ID STRING, TM STRING, NUM STRING, OBJ STRING); - -INSERT INTO test (ID, TM, NUM, OBJ) -SELECT - 'first', - FORMAT_TIMESTAMP('%F %T', CURRENT_TIMESTAMP()), - CAST(12.5 AS STRING), - TO_JSON_STRING(STRUCT('name' AS NAME, '123' AS CIN)) -; -``` - -### UDF - -There are two types of user-defined functions in BigQuery: persistent and temporary. Persistent UDFs are stored in a dataset and can be used by any user with access to the dataset. Temporary UDFs are only available during the session in which they are created. - -Because BQ transformations always run in a new session (and new dataset), you can only use temporary UDFs. To create a temporary UDF, use the `CREATE TEMP FUNCTION` statement. You can find more information about UDFs in the [BigQuery documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions). +New to transformations in general? Start with [Transformations](/transformations/) and the [Getting Started tutorial](/tutorial/manipulate/). For BigQuery limitations specific to Keboola, see [BigQuery Limitations](/storage/byodb/#bigquery-limitations). diff --git a/src/content/docs/transformations/bigquery/reference.md b/src/content/docs/transformations/bigquery/reference.md new file mode 100644 index 000000000..a5b6e552f --- /dev/null +++ b/src/content/docs/transformations/bigquery/reference.md @@ -0,0 +1,98 @@ +--- +title: BigQuery transformation reference +slug: 'transformations/bigquery/reference' +description: Lookup reference for BigQuery SQL transformations in Keboola — query limits, the abort variable, data-type casting to STRING, and user-defined functions. +keywords: + - BigQuery transformation limits + - BigQuery query timeout + - ABORT_TRANSFORMATION BigQuery + - BigQuery data types Keboola + - BigQuery temporary UDF +type: reference +--- + +Reference material for [BigQuery SQL transformations](/transformations/bigquery/). To create one, see the [how-to](/transformations/bigquery/how-to/). + + + +## Limits + +| Limit | Value | Notes | +|---|---|---| +| Query runtime | 2 hours (BigQuery default) | Adjustable per configuration via the **Query timeout** parameter. See [BigQuery query-jobs quotas](https://cloud.google.com/bigquery/quotas#query_jobs). | +| Tables per query | Capped | BigQuery limits the [number of tables referenced by a single query](https://cloud.google.com/bigquery/quotas#tables). | +| Mutations | Discouraged | BigQuery favors an append-only model; row-level mutations are [generally discouraged](https://cloud.google.com/bigquery/docs/best-practices-costs#avoid_using_dml). | + +**Query timeout** parameter — overrides the per-query runtime limit. Default: `0` (use BigQuery's own default). + +For BigQuery limitations specific to Keboola, see [BigQuery Limitations](/storage/byodb/#bigquery-limitations). Track upstream changes in the [BigQuery release notes](https://cloud.google.com/bigquery/docs/release-notes). + +## Aborting execution (`ABORT_TRANSFORMATION`) + +To stop a transformation and exit with a user error, set the `ABORT_TRANSFORMATION` variable to any non-empty string. The variable is already declared internally — you only set its value. The engine checks it after each successfully executed query and returns the value as a user error (for example, `Transformation aborted: Integrity check failed.`). + +```sql +SET ABORT_TRANSFORMATION = ( + SELECT IF(COUNT(*) = 0, '', 'Integrity check failed') + FROM INTEGRITY_CHECK + WHERE RESULT = 'failed' +); +``` + +This sets `ABORT_TRANSFORMATION` to `'Integrity check failed'` when the `INTEGRITY_CHECK` table has one or more rows with `RESULT = 'failed'`. An empty string does not abort. + +## Working with data types + +Keboola Storage [tables](/storage/tables/) store data as character types. When creating an output-mapping table you can rely on implicit casting to `STRING`: + +```sql +CREATE OR REPLACE TABLE test (ID STRING, TM TIMESTAMP, NUM FLOAT64); + +INSERT INTO test (ID, TM, NUM) +SELECT 'first', CURRENT_TIMESTAMP(), 12.5; +``` + +Or create all columns as `STRING`: + +```sql +CREATE OR REPLACE TABLE test (ID STRING, TM STRING, NUM STRING); + +INSERT INTO test (ID, TM, NUM) +SELECT 'first', FORMAT_TIMESTAMP('%F %T', CURRENT_TIMESTAMP()), CAST(12.5 AS STRING); +``` + +Or cast explicitly: + +```sql +CREATE OR REPLACE TABLE test (ID STRING, TM STRING, NUM STRING); + +INSERT INTO test (ID, TM, NUM) +SELECT + CAST('first' AS STRING), + CAST(FORMAT_TIMESTAMP('%F %T', CURRENT_TIMESTAMP()) AS STRING), + CAST(12.5 AS STRING) +; +``` + +For structured/semi-structured values, cast explicitly (for example, serialize a `STRUCT` to JSON): + +```sql +CREATE OR REPLACE TABLE test (ID STRING, TM STRING, NUM STRING, OBJ STRING); + +INSERT INTO test (ID, TM, NUM, OBJ) +SELECT + 'first', + FORMAT_TIMESTAMP('%F %T', CURRENT_TIMESTAMP()), + CAST(12.5 AS STRING), + TO_JSON_STRING(STRUCT('name' AS NAME, '123' AS CIN)) +; +``` + +## User-defined functions (UDFs) + +BigQuery has two kinds of UDF: **persistent** (stored in a dataset, reusable) and **temporary** (available only within the session that creates them). + +Because a BigQuery transformation always runs in a **new session and a new dataset**, you can only use **temporary** UDFs — create them with `CREATE TEMP FUNCTION`. See the [BigQuery UDF documentation](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions). From 0ae2ccb2526951ab0a0e315b1b22db3f8543f7dc Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 19:36:38 +0200 Subject: [PATCH 02/15] PRDCT-376: revise Oracle transformation how-to + document optional `schema` field Block 0: already how-to (no split). Reshaped to the standard how-to (steps, confirm, troubleshooting) and added the optional `schema` config field from oracle-transformation ConfigDefinition.php; exact default/behavior flagged TODO(human-review). Co-Authored-By: Claude Opus 4.8 --- .../docs/transformations/oracle/index.md | 78 +++++++++++++------ 1 file changed, 56 insertions(+), 22 deletions(-) diff --git a/src/content/docs/transformations/oracle/index.md b/src/content/docs/transformations/oracle/index.md index 7d692572d..d9dc31ec8 100644 --- a/src/content/docs/transformations/oracle/index.md +++ b/src/content/docs/transformations/oracle/index.md @@ -1,20 +1,28 @@ --- -title: Oracle Transformation +title: How do I run an Oracle transformation? slug: 'transformations/oracle' +description: Create, configure, and run an Oracle SQL transformation in Keboola — set up the database user and credentials, write the SQL, map input and output, and run it. Note that Oracle transformations run on your own Oracle server. +keywords: + - Oracle transformation + - Oracle transformations + - Oracle SQL transformation Keboola + - Oracle transformation credentials + - Oracle transformation schema +type: how-to --- +You want to transform data with SQL on an [Oracle database](https://www.oracle.com/database/). Unlike other backends, an Oracle transformation runs on **your own Oracle server** (it is not provisioned by Keboola), so you set up the database user and credentials yourself. This page takes you from nothing to a successful run. +**Time:** ~15 minutes · **You will need:** a Keboola project, access to an Oracle server where you can create a user, and one table in [Storage](/storage/tables/) to read from. -The [Oracle database](https://www.oracle.com/index.html) is a multi-model database management system produced and -is marketed by Oracle Corporation. +## Before you start -## Example -After you create a configuration, configure the database credentials using the `Database Credentials` link: +- You manage the Oracle server. Keboola connects to it with credentials you provide, so the server must be reachable from Keboola and you are responsible for its availability. +- Have the [sample CSV file](/transformations/source.csv) (or any table) ready to upload to Storage as the input. -![Screenshot - Credentials link](/transformations/oracle/navigate-to-credentials.png) +## Step 1 — Create a database user -The following SQL code creates user `KEBOOLA_TRANSFORMATION`, a schema with the same name, and grants the user -read/write privileges only to this schema. +In Oracle, create a dedicated user for Keboola and grant it the privileges to open a session and create tables. Replace the password with your own: ```sql CREATE USER KEBOOLA_TRANSFORMATION IDENTIFIED BY "secretPassword20" QUOTA UNLIMITED ON USERS; @@ -22,26 +30,52 @@ CREATE USER KEBOOLA_TRANSFORMATION IDENTIFIED BY "secretPassword20" QUOTA UNLIMI GRANT CREATE SESSION TO KEBOOLA_TRANSFORMATION; GRANT CREATE TABLE TO KEBOOLA_TRANSFORMATION; ``` - -Fill in the credentials to the database. After testing the credentials, save them: -![Screenshot - Credentials](/transformations/oracle/credentials.png) +## Step 2 — Create the transformation and add credentials -After you save the credentials, follow these steps to create a simple Oracle transformation: +1. Open **Components → Transformations**, click **New Transformation**, and choose **Oracle Transformation**. +2. Open the **Database Credentials** link in the configuration. +3. Enter the host, port, database/service, username, and password for the `KEBOOLA_TRANSFORMATION` user. +4. **(Optional) Schema** — set this to run the transformation against a specific Oracle schema. Leave it empty to use the connected user's default schema. +5. Click **Test Credentials**, then **Save**. + +## Step 3 — Map the input + +1. Upload the [sample CSV file](/transformations/source.csv) to Storage as a table. +2. In **Input Mapping**, add the table and set its **Destination** to `source`. +3. Save the mapping. + +## Step 4 — Write the SQL script + +In the code editor, paste: -- Create a table in Storage by uploading the [sample CSV file](/transformations/source.csv). -- Create input mapping from that table, setting its destination to `source`. -- Create output mapping, setting its destination to a new table in your Storage. -- Copy & paste the below script into the transformation code. -- Save and run the transformation. - ```sql CREATE TABLE "result" AS SELECT * FROM "source"; ``` -![Screenshot - Sample Transformation](/transformations/oracle/sample-transformation.png) +You can split longer scripts into [blocks](/transformations/#writing-scripts). + +## Step 5 — Add the output mapping + +1. In **Output Mapping**, set **Source** to `result` (the table the script creates). +2. Set **Destination** to a new Storage table, for example `out.c-main.result`. +3. Save the mapping. + +## Step 6 — Run it and confirm the result + +1. Click **Run** on the transformation. +2. Wait for the [job](/management/jobs/) to finish with a success status. +3. Open **Storage** and confirm your destination table contains the rows from `source`. + +## Troubleshooting + +| Symptom | Likely cause | Fix | +|---|---|---| +| Credentials test fails | Server unreachable, wrong host/port/service, or user lacks `CREATE SESSION` | Verify connectivity and re-check the grants in Step 1. | +| `table or view does not exist` | Input destination name doesn't match the script, or wrong schema | Ensure the input **Destination** is `source`; if you set **Schema**, confirm the objects live there. | +| Run succeeds but nothing in Storage | Missing/incorrect output mapping | Add an output mapping whose **Source** matches the table the script created (`result`). | -You can organize the script into [blocks](/transformations/#writing-scripts). +## Related -Please keep in mind that this transformation, unlike the other transformations, runs on your Oracle Database server -(it is not provisioned by Keboola). You must ensure a flawless course. +- [Input and output mapping](/transformations/mappings/) — how staging works. +- [Tutorial: Manipulating data](/tutorial/manipulate/) — guided first transformation. From fc4461356273c0f7408094f62517b18cbfe7caed Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 19:36:38 +0200 Subject: [PATCH 03/15] PRDCT-376: split DuckDB transformation into how-to + reference + explanation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Block 0: how-to + reference + explanation. Thin hub at the unchanged URL. Settings, backend memory figures, and parameter names (threads/max_memory_mb) flagged TODO(human-review) — configuration.py not in this repo. Co-Authored-By: Claude Opus 4.8 --- .../transformations/duckdb/explanation.md | 59 +++ .../docs/transformations/duckdb/how-to.md | 89 +++++ .../docs/transformations/duckdb/index.md | 371 +----------------- .../docs/transformations/duckdb/reference.md | 173 ++++++++ 4 files changed, 336 insertions(+), 356 deletions(-) create mode 100644 src/content/docs/transformations/duckdb/explanation.md create mode 100644 src/content/docs/transformations/duckdb/how-to.md create mode 100644 src/content/docs/transformations/duckdb/reference.md diff --git a/src/content/docs/transformations/duckdb/explanation.md b/src/content/docs/transformations/duckdb/explanation.md new file mode 100644 index 000000000..548a94983 --- /dev/null +++ b/src/content/docs/transformations/duckdb/explanation.md @@ -0,0 +1,59 @@ +--- +title: When should I use a DuckDB transformation? +slug: 'transformations/duckdb/explanation' +description: Understand what a DuckDB transformation is in Keboola, why it is fast and cost-effective for small-to-medium analytics, and when to choose it over Snowflake. +keywords: + - DuckDB transformation + - when to use DuckDB + - DuckDB vs Snowflake + - DuckDB analytics Keboola + - DuckDB OLAP +type: explanation +--- + +A **DuckDB transformation** runs your SQL in [DuckDB](https://duckdb.org/), an in-process analytical (OLAP) database, while Keboola maps data to and from [Storage](/storage/tables/). This page explains what that means and when DuckDB is the right choice. To build one, follow the [how-to](/transformations/duckdb/how-to/); for all settings, see the [reference](/transformations/duckdb/reference/). + +:::caution[Beta] +DuckDB Transformation is currently in **BETA**. Breaking changes may occur. +::: + +## What it is + +DuckDB runs **in-process** — there is no external database server to provision. It uses **columnar storage** optimized for analytical queries, runs independent scripts within a block **in parallel** (with automatic dependency analysis), and ships a **rich SQL dialect** with modern conveniences. For Keboola that makes it a fast, low-overhead, cost-effective backend for small-to-medium analytics. + +Like every [transformation](/transformations/), it works on an isolated copy of your data: input mapping stages your Storage tables, your SQL runs against them, and output mapping writes results back. + +## Why DuckDB + +- **In-process execution** — no warehouse to spin up; low overhead and fast startup. +- **Columnar + parallel** — efficient on analytical (`SELECT`-heavy) workloads. +- **Cost-effective** — a cheaper alternative to cloud warehouses for datasets up to a few terabytes. +- **Rich SQL** — quality-of-life extensions (`GROUP BY ALL`, `EXCLUDE`, `ASOF JOIN`, `SUMMARIZE`); see the [reference](/transformations/duckdb/reference/#sql-extensions). + +## When to use DuckDB vs. Snowflake + +**Choose DuckDB for:** + +- Ad-hoc analysis and small-to-medium datasets (under a few TB) +- Rapid prototyping, development, and testing +- Projects with limited budgets + +**Choose [Snowflake](/transformations/snowflake-plain/) for:** + +- Very large datasets (TB+) and complex enterprise workloads +- Sharing a warehouse across multiple processes +- Maximum scalability and Snowflake-specific features + +Migrating existing Snowflake transformations? See the [Snowflake to DuckDB migration guide](/transformations/duckdb/snowflake-migration/). + +## What DuckDB is not + +DuckDB is an **OLAP** database optimized for `SELECT` and analytical queries. Avoid workflows with frequent `INSERT`/`UPDATE` operations — for transactional workloads, use a different backend such as [Snowflake](/transformations/snowflake-plain/). + +## Designing maintainable transformations + +- Split complex transformations into smaller steps, each producing one output table. +- Use consistent naming for output tables (for example, `stg_customers`, `fact_orders`, `dim_products`). +- Document non-obvious business logic directly in the SQL. + +Because scripts within a block run in parallel based on a dependency graph, organizing logic into clear blocks lets the engine optimize execution for you — see [block-based orchestration](/transformations/duckdb/reference/#block-based-orchestration). diff --git a/src/content/docs/transformations/duckdb/how-to.md b/src/content/docs/transformations/duckdb/how-to.md new file mode 100644 index 000000000..1320d3dba --- /dev/null +++ b/src/content/docs/transformations/duckdb/how-to.md @@ -0,0 +1,89 @@ +--- +title: How do I run a DuckDB transformation? +slug: 'transformations/duckdb/how-to' +description: Create, configure, and run a DuckDB SQL transformation in Keboola from start to finish — create the configuration, map input, write the SQL, map output, run it, and confirm the result landed in Storage. +keywords: + - run a DuckDB transformation + - create DuckDB transformation + - DuckDB SQL transformation Keboola + - DuckDB transformation example +type: how-to +--- + +You have a table in Keboola Storage and you want to transform it with DuckDB SQL and write the result back to Storage. This page takes you from nothing to a successful run using a small worked example. For all settings and syntax rules, see the [reference](/transformations/duckdb/reference/); for when to choose DuckDB, see the [explanation](/transformations/duckdb/explanation/). + +:::caution[Beta] +DuckDB Transformation is currently in **BETA**. Breaking changes may occur. +::: + +**Time:** ~10 minutes · **You will need:** a Keboola project where you can create configurations, and one table in [Storage](/storage/tables/) to read from. + +## Before you start + +Get a table into Storage to use as the input. If you do not have one handy, upload the [sample CSV file](/transformations/source.csv) as a new table — the example SQL below expects a `sample` table with `order_date` and `order_amount` columns. + +## Step 1 — Create the transformation + +1. Open **Components → Transformations** and click **New Transformation**. +2. Select **DuckDB Transformation**. +3. Name it, optionally add a description and folder, and click **Create Transformation**. + +## Step 2 — Add the input mapping + +1. In **Input Mapping**, add your Storage table. +2. Set its **Destination** (staging table name) to `sample`. +3. Save the mapping. + +## Step 3 — Write the SQL script + +In the code editor, paste: + +```sql +CREATE TABLE "output" AS +SELECT "order_date", SUM("order_amount") AS "sum_orders_amount" +FROM "sample" +GROUP BY "order_date"; +``` + +End every statement with a semicolon (`;`). Quote identifiers that need exact case (`"sample"`). You can split longer scripts into [blocks](/transformations/#writing-scripts), which DuckDB runs with automatic dependency analysis (see [block-based orchestration](/transformations/duckdb/reference/#block-based-orchestration)). + +> If `SUM()` fails with a type error, your input is loading as `VARCHAR`. Either cast explicitly, or enable **Infer input table data types** — see [Step: typed inputs](#optional-work-with-typed-inputs). + +## Step 4 — Add the output mapping + +1. In **Output Mapping**, set **Source** to `output` (the table the script creates). +2. Set **Destination** to a new Storage table, for example `out.c-main.orders`. +3. Save the mapping. + +## Step 5 — Run it and confirm the result + +1. Click **Run** on the transformation. +2. Wait for the [job](/management/jobs/) to finish with a success status. +3. Open **Storage**, find your destination table, and confirm it has one row per `order_date` with the summed amount. + +## Optional: work with typed inputs + +By default, input columns load as `VARCHAR`, so numeric and date functions need explicit casts. To use real types directly, enable **Infer input table data types** in the configuration settings — DuckDB then detects types like `INTEGER`, `FLOAT`, and `DATE`. See [Infer input table data types](/transformations/duckdb/reference/#infer-input-table-data-types). + +## Make it faster (backend size) + +If the job is slow or runs out of memory, raise the **Backend size** (XSmall → Small → Medium → Large). The sizes and their memory are listed in the [reference](/transformations/duckdb/reference/#backend-sizes). For datasets over 10 GB, also see [memory management](/transformations/duckdb/reference/#memory-management-for-large-datasets). + +## Check before you run (sync actions) + +You can validate without a full run using [sync actions](/transformations/duckdb/reference/#sync-actions) — for example **Syntax check** to catch SQL errors, or **Expected input tables** to confirm the inputs your script references. + +## Troubleshooting + +| Symptom | Likely cause | Fix | +|---|---|---| +| Syntax error between statements | Missing semicolon | End every statement with `;` ([reference](/transformations/duckdb/reference/#semicolons-between-statements)). | +| `SUM()`/aggregation fails on a column | Input loaded as `VARCHAR` | Cast explicitly, or enable **Infer input table data types**. | +| `table not found` for a mixed-case name | Unquoted name folded to lowercase | Quote the identifier (`"MyTable"`); see [case sensitivity](/transformations/duckdb/reference/#identifier-case-sensitivity). | +| Run succeeds but nothing in Storage | Missing/incorrect output mapping | Add an output mapping whose **Source** matches the table the script created (`output`). | + +## Related + +- [DuckDB transformation reference](/transformations/duckdb/reference/) — settings, backends, SQL extensions. +- [When should I use DuckDB?](/transformations/duckdb/explanation/) — DuckDB vs. Snowflake. +- [Snowflake to DuckDB migration guide](/transformations/duckdb/snowflake-migration/). diff --git a/src/content/docs/transformations/duckdb/index.md b/src/content/docs/transformations/duckdb/index.md index 43f043d71..e6f3a460b 100644 --- a/src/content/docs/transformations/duckdb/index.md +++ b/src/content/docs/transformations/duckdb/index.md @@ -1,365 +1,24 @@ --- title: DuckDB Transformation slug: 'transformations/duckdb' +description: Run SQL in DuckDB, an in-process analytical database, inside Keboola. Start here, then jump to the how-to, reference, or the explanation of when to use it. +keywords: + - DuckDB transformation + - DuckDB transformations + - DuckDB SQL transformation +type: explanation --- +A **DuckDB transformation** runs your SQL in [DuckDB](https://duckdb.org/) — a fast, in-process analytical database — while Keboola maps data to and from [Storage](/storage/tables/). It is a cost-effective backend for small-to-medium analytics. +:::caution[Beta] +DuckDB Transformation is currently in **BETA**. Breaking changes may occur. +::: -[DuckDB](https://duckdb.org/) is an in-process analytical database designed for fast SQL analytics. -It brings several advantages to Keboola transformations: +This page is split by what you need: -- **In-process execution** --- no external database server needed -- **Columnar storage** --- optimized for analytical queries -- **Block-based orchestration** with automatic dependency analysis -- **Parallel execution** of independent scripts within blocks -- **Cost-effective** alternative to cloud data warehouses for small to medium datasets -- **Rich SQL dialect** with modern quality-of-life extensions +- **[How do I run a DuckDB transformation?](/transformations/duckdb/how-to/)** — create, configure, and run one end to end, with a worked example and troubleshooting. +- **[DuckDB transformation reference](/transformations/duckdb/reference/)** — configuration settings, backend sizes, versions, sync actions, block orchestration, Parquet, type inference, case sensitivity, and SQL extensions. +- **[When should I use a DuckDB transformation?](/transformations/duckdb/explanation/)** — what it is, why DuckDB, and DuckDB vs. Snowflake. -***Note:** DuckDB Transformation is currently in **BETA**. Breaking changes may occur.* - -## Creating a DuckDB Transformation - -To create a new DuckDB transformation, click **New Transformation** in the Transformations section and select **DuckDB Transformation**. - -![Screenshot - New Transformation](/transformations/duckdb/new-transformation.png) - -Name your transformation, optionally add a description and folder, and click **Create Transformation**. - -![Screenshot - Create Transformation](/transformations/duckdb/create-transformation.png) - -## Configuration - -The configuration page allows you to set up input/output mappings, write SQL queries, and configure transformation settings. - -![Screenshot - DuckDB Transformation Configuration](/transformations/duckdb/configuration.png) - -On the right side panel, you can configure: - -- **Timeout** --- maximum execution time (default: 1 hour) -- **Backend size** --- amount of memory allocated for the transformation (see [Dynamic Backends](#dynamic-backends)) -- **DuckDB version** --- select which DuckDB version to use (see [DuckDB Version](#duckdb-version)) -- **Automatic data types** --- automatically assign data types to output tables -- **Use parquet for input tables** --- use Parquet format instead of CSV for input data (see [Parquet Format](#parquet-format)) -- **Infer input table data types** --- infer data types from input tables (see [Infer Input Table Data Types](#infer-input-table-data-types)) -- **Debug mode** --- enable debug logging for troubleshooting - -### DuckDB Version - -You can select the DuckDB version used to run the transformation. Use `latest` (default) to always run on the most -recent supported version, or pin to a specific version (e.g., `1.5.2`, `1.4.4`) for stability. Each supported version -runs in its own isolated environment. - -![Screenshot - DuckDB Version Selection](/transformations/duckdb/duckdb-version.png) - -## Block-Based Orchestration - -DuckDB transformations use **block-based orchestration** for organizing and executing SQL code: - -- **Blocks** are executed **sequentially** (one after another). -- **Scripts** (code pieces) within a block are executed **in parallel** when they have no dependencies on each other. -- The system uses [SQLGlot](https://github.com/tobymao/sqlglot) to automatically analyze SQL and build a **DAG** (Directed Acyclic Graph) of dependencies. -- Execution order is automatically optimized based on the dependency analysis. - -This means you can organize your transformation into logical blocks and let the system handle parallel execution where possible. - -## Sync Actions - -DuckDB transformations provide four **sync actions** for debugging and visualization without running the full transformation: - -- **Syntax check** (`syntax_check`) --- validates your SQL syntax without executing any queries. Useful for catching errors before running the transformation. -- **Lineage visualization** (`lineage_visualization`) --- generates a markdown diagram of data dependencies, showing how tables flow through your transformation. -- **Execution plan visualization** (`execution_plan_visualization`) --- shows the planned execution order with blocks and batches, illustrating how the automatic DAG organizes your queries. -- **Expected input tables** (`expected_input_tables`) --- displays the list of input tables that the transformation expects based on the SQL analysis. - -These actions are available from the transformation configuration page and are helpful for understanding and debugging complex transformations. - -## Dynamic Backends - -You can change the backend size to allocate more memory for your transformation. The following sizes are available: - -| Backend Size | Memory | Recommended For | -|---|---|---| -| **XSmall** | 8 GB | Small datasets, testing | -| **Small** *(default)* | 16 GB | Most use cases | -| **Medium** | 32 GB | Large datasets (5 GB+) | -| **Large** | 113.6 GB | Very large datasets (10 GB+) | - -Start with the **Small** backend and scale up as needed based on your dataset size and query complexity. - -***Note:** Dynamic backends are not available if you are on the [Free Plan (Pay As You Go)](/management/payg-project/).* - -### Auto-Resource Detection - -DuckDB automatically detects the available CPU and memory resources. You can also manually configure resource limits using the `threads` and `max_memory_mb` parameters in the transformation configuration. - -## Parquet Format - -By default, input tables are loaded as CSV files. You can enable Parquet format for significantly better performance, -especially with larger datasets. - -**Advantages of Parquet:** -- Much faster processing than CSV -- Lower memory usage -- Columnar storage optimized for analytical queries - -**Recommendation:** Always use Parquet for datasets larger than 1 GB. - -To enable Parquet, toggle the **Use parquet for input tables** option in the transformation settings. - -## Infer Input Table Data Types - -When working with non-typed (string-based) Storage tables, you can enable the **Infer input table data types** option. -This feature instructs DuckDB to infer the actual data types of the input columns, so you can work with numeric, date, and boolean types directly in your SQL queries without manual casting. - -![Screenshot - Infer Input Table Data Types Enabled](/transformations/duckdb/infer-data-types-enabled.png) - -**Why is this useful?** - -Keboola Storage tables can be **non-typed** (all columns stored as `VARCHAR`). Without type inference enabled, -all values in input tables are treated as strings, and functions like `SUM()` will fail because they expect numeric types. - -![Screenshot - Job Error Without Type Inference](/transformations/duckdb/job-error-varchar.png) - -With **Infer input table data types** enabled, DuckDB automatically detects the correct types (e.g., `INTEGER`, `FLOAT`, `DATE`), -so aggregate functions and type-specific operations work as expected. - -![Screenshot - Successful Job With Type Inference](/transformations/duckdb/job-success.png) - -The output table then contains properly typed columns: - -![Screenshot - Output Table With Typed Columns](/transformations/duckdb/output-typed-columns.png) - -## Example - -To create a simple DuckDB transformation, follow these steps: - -- Create a table in Storage by uploading the [sample CSV file](/transformations/source.csv). -- Create an input mapping from that table, setting its destination to `sample` (as expected by the DuckDB script). -- Create an output mapping, setting its destination to a new table in your Storage. -- Copy & paste the below script into the transformation code. -- Save and run the transformation. - -```sql -CREATE TABLE "output" AS -SELECT "order_date", SUM("order_amount") AS "sum_orders_amount" -FROM "sample" -GROUP BY "order_date"; -``` - -![Screenshot - Query Example](/transformations/duckdb/query-example.png) - -You can organize the script into [blocks](/transformations/#writing-scripts). - -## Best Practices - -### Semicolons Between Statements - -Each SQL statement in a DuckDB transformation **must be terminated with a semicolon** (`;`). If you have multiple statements -in a single script, make sure they are properly separated: - -```sql --- Correct: each statement ends with a semicolon -CREATE TABLE "output_a" AS SELECT * FROM "input_a"; - -CREATE TABLE "output_b" AS SELECT * FROM "input_b"; -``` - -Missing semicolons will cause syntax errors. - -### Case Sensitivity - -DuckDB handles identifier case differently than Snowflake: - -**Table names:** -- **Unquoted table names** are converted to **lowercase** (e.g., `SELECT * FROM MyTable` references `mytable`). -- **Quoted table names** are **case-sensitive** (e.g., `SELECT * FROM "MyTable"` references exactly `MyTable`). - -**Column names:** -- **Columns are always case-sensitive** regardless of quoting (e.g., `SELECT columnName` and `SELECT ColumnName` refer to different columns). - -This is different from Snowflake, where unquoted identifiers become uppercase. - -**Best practices:** -- Use consistent casing for table and column names. -- When referencing tables with mixed case or special characters, always use quotes: `"TaBlE-stage"`. -- Be aware that input table names are typically lowercase unless explicitly quoted. - -### Optimizing SQL Queries - -**Filter and project early** --- apply `WHERE` clauses as close to the source table as possible and select only the columns you need. -This reduces the amount of data DuckDB needs to scan. - -```sql --- Good: filter and project at the source -SELECT id, name, price -FROM products -WHERE category = 'electronics' AND price > 100; -``` - -**Use EXPLAIN for performance analysis** --- prefix your query with `EXPLAIN` to see the execution plan and identify expensive operations. - -```sql -EXPLAIN SELECT product_category, SUM(price) AS total_revenue -FROM sales -WHERE sale_date >= '2023-01-01' -GROUP BY product_category -ORDER BY total_revenue DESC; -``` - -### DuckDB SQL Extensions - -DuckDB provides several quality-of-life SQL extensions that simplify common patterns: - -**GROUP BY ALL** --- automatically groups by all non-aggregated columns: - -```sql -SELECT product, category, SUM(sales) -FROM orders -GROUP BY ALL; -``` - -**EXCLUDE** --- select all columns except specific ones: - -```sql -SELECT * EXCLUDE (password, ssn, credit_card) -FROM users; -``` - -**ASOF JOIN** --- useful for time-series data where timestamps do not match exactly: - -```sql -SELECT - s.player_id, - s.score, - s.score_time, - w.temperature, - w.conditions -FROM scores s -ASOF JOIN weather w -ON s.score_time >= w.timestamp; -``` - -**SUMMARIZE** --- quick data profiling with min, max, null percentage, and unique counts: - -```sql -SUMMARIZE SELECT * FROM my_table; -``` - -### Working With Data Types - -Keboola Storage [tables](/storage/tables/) store data in character types by default. When **Infer input table data types** is disabled, -all columns are loaded as `VARCHAR`. You need to cast values explicitly: - -```sql -CREATE TABLE "result" AS -SELECT - CAST("amount" AS DECIMAL) AS "amount", - CAST("created_at" AS TIMESTAMP) AS "created_at" -FROM "source"; -``` - -When **Infer input table data types** is enabled, DuckDB automatically infers the correct types and you can use them directly. - -### Memory Management for Large Datasets - -For datasets larger than 10 GB, configure DuckDB to use on-disk processing with PRAGMA settings: - -```sql -PRAGMA memory_limit='8GB'; -PRAGMA temp_directory='/tmp/duckdb_temp'; -PRAGMA threads=4; -PRAGMA enable_object_cache; -``` - -### Modular Transformations - -- Split complex transformations into smaller steps, each producing one output table. -- Use consistent naming conventions for output tables (e.g., `stg_customers`, `fact_orders`, `dim_products`). -- Document complex business logic directly in the SQL code. - -### What DuckDB Is Not - -DuckDB is an **OLAP** (Online Analytical Processing) database optimized for `SELECT` statements and analytical queries. -Avoid workflows with frequent `INSERT` and `UPDATE` operations. For transactional workloads, use a different backend such as Snowflake. - -### Real-World Example: CRM Data Transformation - -The following example shows a typical DuckDB transformation processing CRM data (e.g., from HubSpot). It demonstrates -common patterns: `TRY_CAST` for safe type conversion, `NULLIF` for handling empty strings, and `::` for type casting. - -```sql -/* companies */ -CREATE TABLE "out_companies" AS -SELECT - "companyId", - "name", - "website", - TRY_CAST(NULLIF("createdate", '') AS DATE) AS "createdate", - "isDeleted"::BOOLEAN AS "isDeleted" -FROM "companies"; - -/* contacts */ -CREATE TABLE "out_contacts" AS -SELECT - "canonical_vid", - "firstname", - "lastname", - "email", - TRY_CAST(NULLIF("createdate", '') AS DATE) AS "createdate", - "hs_analytics_source" AS "email_source", - "associatedcompanyid", - "lifecyclestage" -FROM "contacts"; - -/* deals */ -CREATE TABLE "out_deals" AS -SELECT - "dealId", - "isDeleted"::BOOLEAN AS "isDeleted", - "dealname", - TRY_CAST(NULLIF("createdate", '') AS DATE) AS "createdate", - TRY_CAST(NULLIF("closedate", '') AS DATE) AS "closedate", - "dealtype", - TRY_CAST(NULLIF("amount", '') AS DOUBLE) AS "amount", - "pipeline", - "dealstage", - "hubspot_owner_id", - "hs_analytics_source" -FROM "deals"; - -/* pipeline stages */ -CREATE TABLE "out_stages" AS -SELECT - "stageId", - "label", - TRY_CAST(NULLIF("displayOrder", '') AS INT) AS "displayOrder", - TRY_CAST(NULLIF("probability", '') AS DOUBLE) AS "probability", - "closedWon"::BOOLEAN AS "closedWon" -FROM "pipeline_stages"; -``` - -**Key patterns used:** -- `TRY_CAST(NULLIF("column", '') AS TYPE)` --- safely converts empty strings to `NULL` before casting. This avoids errors when the source data contains empty values. -- `"column"::BOOLEAN` --- shorthand type cast syntax. -- Each statement ends with a **semicolon** (`;`) --- required when multiple statements are in a single script. - -## When to Use DuckDB vs. Snowflake - -**Choose DuckDB for:** -- Ad-hoc analysis and small to medium datasets -- Rapid prototyping of transformations -- Projects with limited budgets -- Datasets under a few terabytes -- Development and testing - -**Choose Snowflake for:** -- Very large datasets (TB+) -- Complex enterprise workloads -- Sharing warehouses across multiple processes -- Maximum scalability -- Advanced Snowflake-specific features - -## Migrating from Snowflake to DuckDB - -If you are migrating existing Snowflake transformations to DuckDB, see the detailed -[Snowflake to DuckDB Migration Guide](/transformations/duckdb/snowflake-migration/). +Migrating from Snowflake? See the [Snowflake to DuckDB migration guide](/transformations/duckdb/snowflake-migration/). New to transformations? Start with [Transformations](/transformations/) and the [Getting Started tutorial](/tutorial/manipulate/). diff --git a/src/content/docs/transformations/duckdb/reference.md b/src/content/docs/transformations/duckdb/reference.md new file mode 100644 index 000000000..667ca7065 --- /dev/null +++ b/src/content/docs/transformations/duckdb/reference.md @@ -0,0 +1,173 @@ +--- +title: DuckDB transformation reference +slug: 'transformations/duckdb/reference' +description: Lookup reference for DuckDB SQL transformations in Keboola — configuration settings, backend sizes, versions, sync actions, block orchestration, Parquet and type inference, case sensitivity, and SQL extensions. +keywords: + - DuckDB transformation settings + - DuckDB backend size + - DuckDB version + - DuckDB sync actions + - DuckDB parquet + - DuckDB infer data types + - DuckDB case sensitivity + - DuckDB SQL extensions +type: reference +--- + +Reference material for [DuckDB SQL transformations](/transformations/duckdb/). To create one, see the [how-to](/transformations/duckdb/how-to/); for when to choose DuckDB, see the [explanation](/transformations/duckdb/explanation/). + +:::caution[Beta] +DuckDB Transformation is currently in **BETA**. Breaking changes may occur. +::: + + + +## Configuration settings + +Set these on the right-side panel of the transformation configuration: + +| Setting | Description | Default | +|---|---|---| +| **Timeout** | Maximum execution time. | 1 hour | +| **Backend size** | Memory allocated (see [Backend sizes](#backend-sizes)). | Small | +| **DuckDB version** | Which DuckDB version runs the transformation (see [DuckDB version](#duckdb-version)). | `latest` | +| **Automatic data types** | Automatically assign data types to output tables. | | +| **Use parquet for input tables** | Load inputs as Parquet instead of CSV (see [Parquet format](#parquet-format)). | Off | +| **Infer input table data types** | Infer types from input tables (see [Infer input table data types](#infer-input-table-data-types)). | Off | +| **Debug mode** | Enable debug logging for troubleshooting. | Off | + +### DuckDB version + +Select the DuckDB version used to run the transformation. Use `latest` (default) to always run on the most recent supported version, or pin a specific version (for example, `1.5.2`, `1.4.4`) for stability. Each supported version runs in its own isolated environment. + +## Backend sizes + +A larger backend allocates more memory. See the [how-to](/transformations/duckdb/how-to/#make-it-faster-backend-size) for how to change it. + +| Backend size | Memory | Recommended for | +|---|---|---| +| **XSmall** | 8 GB | Small datasets, testing | +| **Small** *(default)* | 16 GB | Most use cases | +| **Medium** | 32 GB | Large datasets (5 GB+) | +| **Large** | 113.6 GB | Very large datasets (10 GB+) | + + + +Dynamic backends are **not** available on the [Free Plan (Pay As You Go)](/management/payg-project/). + +### Auto-resource detection + +DuckDB automatically detects available CPU and memory. You can also set resource limits manually with the `threads` and `max_memory_mb` parameters in the transformation configuration. + +## Block-based orchestration + +DuckDB transformations organize and execute SQL with **block-based orchestration**: + +- **Blocks** run **sequentially** (one after another). +- **Scripts** (code pieces) within a block run **in parallel** when they have no dependencies on each other. +- The system uses [SQLGlot](https://github.com/tobymao/sqlglot) to analyze SQL and build a **DAG** of dependencies, then optimizes execution order automatically. + +## Sync actions + +Four **sync actions** help you debug and visualize without running the full transformation, available from the configuration page: + +| Action | Name | What it does | +|---|---|---| +| Syntax check | `syntax_check` | Validates SQL syntax without executing queries. | +| Lineage visualization | `lineage_visualization` | Markdown diagram of data dependencies (how tables flow through). | +| Execution plan visualization | `execution_plan_visualization` | Shows the planned execution order (blocks and batches). | +| Expected input tables | `expected_input_tables` | Lists the input tables the transformation expects, based on SQL analysis. | + +## Parquet format + +By default, input tables are loaded as CSV. Enabling **Use parquet for input tables** loads them as Parquet, which is much faster, uses less memory, and is columnar (optimized for analytics). Recommended for datasets larger than 1 GB. + +## Infer input table data types + +Keboola Storage tables can be **non-typed** (all columns `VARCHAR`). With type inference off, every input value is a string, so functions like `SUM()` fail because they expect numeric types. + +Enable **Infer input table data types** to have DuckDB detect the real types (for example `INTEGER`, `FLOAT`, `DATE`) so aggregate and type-specific operations work and output columns are properly typed. + +## Semicolons between statements + +Each SQL statement **must end with a semicolon** (`;`). Separate multiple statements in one script: + +```sql +-- Correct: each statement ends with a semicolon +CREATE TABLE "output_a" AS SELECT * FROM "input_a"; + +CREATE TABLE "output_b" AS SELECT * FROM "input_b"; +``` + +Missing semicolons cause syntax errors. + +## Identifier case sensitivity + +DuckDB handles case differently from Snowflake: + +- **Unquoted table names** are folded to **lowercase** (`SELECT * FROM MyTable` references `mytable`). +- **Quoted table names** are case-sensitive (`SELECT * FROM "MyTable"` references exactly `MyTable`). +- **Columns are always case-sensitive**, regardless of quoting (`columnName` and `ColumnName` are different columns). + +Use consistent casing, and quote names with mixed case or special characters: `"TaBlE-stage"`. Input table names are typically lowercase unless quoted. + +## Working with data types + +With **Infer input table data types** disabled, all input columns load as `VARCHAR` and you must cast explicitly: + +```sql +CREATE TABLE "result" AS +SELECT + CAST("amount" AS DECIMAL) AS "amount", + CAST("created_at" AS TIMESTAMP) AS "created_at" +FROM "source"; +``` + +With inference enabled, DuckDB assigns the correct types and you can use them directly. + +## SQL extensions + +DuckDB adds quality-of-life SQL extensions: + +```sql +-- GROUP BY ALL: group by all non-aggregated columns +SELECT product, category, SUM(sales) FROM orders GROUP BY ALL; + +-- EXCLUDE: select all columns except some +SELECT * EXCLUDE (password, ssn, credit_card) FROM users; + +-- ASOF JOIN: match nearest (e.g. time-series where timestamps don't align) +SELECT s.player_id, s.score, w.temperature +FROM scores s +ASOF JOIN weather w ON s.score_time >= w.timestamp; + +-- SUMMARIZE: quick profiling (min, max, null %, unique counts) +SUMMARIZE SELECT * FROM my_table; +``` + +## Query optimization + +- **Filter and project early** — apply `WHERE` at the source and select only the columns you need, to reduce scanned data. +- **Use `EXPLAIN`** — prefix a query with `EXPLAIN` to see the execution plan and find expensive operations. + +```sql +EXPLAIN SELECT product_category, SUM(price) AS total_revenue +FROM sales +WHERE sale_date >= '2023-01-01' +GROUP BY product_category +ORDER BY total_revenue DESC; +``` + +## Memory management for large datasets + +For datasets larger than 10 GB, configure on-disk processing with `PRAGMA` settings: + +```sql +PRAGMA memory_limit='8GB'; +PRAGMA temp_directory='/tmp/duckdb_temp'; +PRAGMA threads=4; +PRAGMA enable_object_cache; +``` From 92a67c830b30918a305c97d5c6226d60ee2afbf5 Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 19:36:38 +0200 Subject: [PATCH 04/15] PRDCT-376: wire Group 1 (bigquery, duckdb) splits into the sidebar Co-Authored-By: Claude Opus 4.8 --- _data/navigation.yml | 14 ++++++++++++++ src/sidebar.mjs | 22 ++++++++++++++++++++-- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/_data/navigation.yml b/_data/navigation.yml index 53767b949..96a365bfc 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -633,9 +633,23 @@ items: - url: /transformations/bigquery/ title: BigQuery Transformations + items: + - url: /transformations/bigquery/how-to/ + title: How do I run a BigQuery transformation? + - url: /transformations/bigquery/reference/ + title: Reference - url: /transformations/duckdb/ title: DuckDB Transformations + items: + - url: /transformations/duckdb/how-to/ + title: How do I run a DuckDB transformation? + - url: /transformations/duckdb/reference/ + title: Reference + - url: /transformations/duckdb/explanation/ + title: When to use it + - url: /transformations/duckdb/snowflake-migration/ + title: Snowflake to DuckDB Migration - url: /transformations/oracle/ title: Oracle Transformations diff --git a/src/sidebar.mjs b/src/sidebar.mjs index 8543b6c15..bec2ed0ac 100644 --- a/src/sidebar.mjs +++ b/src/sidebar.mjs @@ -451,8 +451,26 @@ export const sidebar = [ { slug: "transformations/snowflake-plain/explanation" }, ], }, - { slug: "transformations/bigquery" }, - { slug: "transformations/duckdb" }, + { + label: "BigQuery Transformations", + collapsed: true, + items: [ + { label: "Overview", slug: "transformations/bigquery" }, + { slug: "transformations/bigquery/how-to" }, + { slug: "transformations/bigquery/reference" }, + ], + }, + { + label: "DuckDB Transformations", + collapsed: true, + items: [ + { label: "Overview", slug: "transformations/duckdb" }, + { slug: "transformations/duckdb/how-to" }, + { slug: "transformations/duckdb/reference" }, + { slug: "transformations/duckdb/explanation" }, + { slug: "transformations/duckdb/snowflake-migration" }, + ], + }, { slug: "transformations/oracle" }, { slug: "transformations/code-patterns" }, ], From c3a0db611769b186f41f1e487e0c59644c2e7166 Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 20:07:21 +0200 Subject: [PATCH 05/15] PRDCT-376: split Python transformation into how-to + reference MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Block 0: how-to + reference + tutorial. Mapped to how-to + reference + hub — the tutorial/dev-walkthrough facet folded into the how-to ("Develop and debug"), since type is constrained to how-to|reference|explanation. Kept the plaintext- secrets warning prominent. README is the SoT (not in repo) → limits/versions/ preinstalled packages flagged TODO(human-review). Co-Authored-By: Claude Opus 4.8 --- .../transformations/python-plain/how-to.md | 83 ++++++ .../transformations/python-plain/index.md | 243 +----------------- .../transformations/python-plain/reference.md | 140 ++++++++++ 3 files changed, 237 insertions(+), 229 deletions(-) create mode 100644 src/content/docs/transformations/python-plain/how-to.md create mode 100644 src/content/docs/transformations/python-plain/reference.md diff --git a/src/content/docs/transformations/python-plain/how-to.md b/src/content/docs/transformations/python-plain/how-to.md new file mode 100644 index 000000000..ab1761b93 --- /dev/null +++ b/src/content/docs/transformations/python-plain/how-to.md @@ -0,0 +1,83 @@ +--- +title: How do I run a Python transformation? +slug: 'transformations/python-plain/how-to' +description: Create, run, and develop a Python transformation in Keboola — write the script, map input and output CSV files, run it, confirm the result, and debug it in a workspace or locally. +keywords: + - run a Python transformation + - create Python transformation + - Python transformation example + - develop Python transformation locally + - Python transformation workspace +type: how-to +--- + +You want to process data with Python where SQL is awkward. A Python transformation reads your mapped input tables as CSV files, runs your script, and writes CSV outputs back to [Storage](/storage/tables/). This page gets you from nothing to a successful run, then shows how to develop and debug. For limits, file paths, and packages, see the [reference](/transformations/python-plain/reference/). + +:::caution[Never put credentials in transformation code] +Python transformations have **no facility for encrypting secrets**. Any API key, password, token, or connection string in the code is stored as **plaintext** in the configuration — readable by anyone with project access and included when features like AI **Generate description** process the configuration. Store secrets in the [Custom Python](/components/applications/custom-python/) application instead, where any parameter whose key starts with `#` is [encrypted](https://developers.keboola.com/overview/encryption/) and exposed to your code as an environment variable. +::: + +**Time:** ~10 minutes · **You will need:** a Keboola project and one table in [Storage](/storage/tables/) (or the [sample CSV file](/transformations/source.csv)). + +## Step 1 — Create the transformation + +1. Open **Components → Transformations**, click **New Transformation**, and choose **Python Transformation**. +2. Name it and confirm. + +## Step 2 — Map input and output + +1. Upload the [sample CSV file](/transformations/source.csv) to Storage as a table. +2. In **Input Mapping**, add it and set its **Destination** to `source` (the script reads `in/tables/source.csv`). +3. In **Output Mapping**, map `result.csv` (produced by the script) to a new Storage table, for example `out.c-main.result`. + +## Step 3 — Write the script + +Paste a script that reads `in/tables/source.csv` and writes `out/tables/result.csv`: + +```python +import csv + +with open('in/tables/source.csv', mode='rt', encoding='utf-8') as in_file, open('out/tables/result.csv', mode='wt', encoding='utf-8') as out_file: + reader = csv.DictReader((line.replace('\0', '') for line in in_file), dialect='kbc') + writer = csv.DictWriter(out_file, dialect='kbc', fieldnames=['col1', 'col2']) + writer.writeheader() + for row in reader: + writer.writerow({'col1': row['first'] + 'ping', 'col2': int(row['second']) * 42}) +``` + +See the [reference](/transformations/python-plain/reference/#reading-and-writing-csv) for list-based and explicit-format variants. You can split the script into [blocks](/transformations/#writing-scripts). + +## Step 4 — Run it and confirm the result + +1. Click **Run**. +2. Wait for the [job](/management/jobs/) to finish with a success status. +3. Open **Storage**, find your output table, and confirm `col1` has the `ping` suffix and `col2` is `second × 42`. + +## Develop and debug + +The fastest way to iterate is a [Python workspace](/workspace/) (JupyterLab) with the same input mapping: + +1. Configure input (and optionally output) mapping, then **Load Data** and **Connect** to the workspace. +2. Paste your script into the notebook — the `in/`/`out/` directory structure and input files are already prepared. +3. Run it; optionally **Unload Data** to push results to Storage, or **Create Transformation** to scaffold a transformation with the same mapping. + +To develop **locally**, [install Python](https://www.python.org/downloads/) and recreate the directory structure (`in/tables/`, `out/tables/`) with your input files. A ready example is in [data.zip](/transformations/python-plain/data.zip); the same script then runs unchanged as a transformation. For an exact environment, use the [Keboola Docker image](https://developers.keboola.com/extend/docker/running/#running-transformations). + +## Make it faster (backend size) + +For large data, raise the **Backend size** in the configuration (XSmall → Small → Medium → Large); see [backend sizes](/transformations/python-plain/reference/#backend-sizes-dynamic-backends). This affects [time-credit consumption](/management/project/limits/#project-power--time-credits). + +## Troubleshooting + +| Symptom | Likely cause | Fix | +|---|---|---| +| `FileNotFoundError` on `in/tables/source.csv` | Input mapping destination doesn't match the path in the script | Set the input **Destination** to `source` (or change the path in the script). | +| Output table empty / not created | Output mapping **Source** doesn't match the file the script writes | Map `result.csv` (the file your script writes to `out/tables/`). | +| `IndentationError` / `TabError` | Mixed tabs and spaces | Use consistent indentation; Python is indentation-sensitive. | +| A defined `main()` never runs | Wrapped in `if __name__ == '__main__':` | Call `main()` directly instead. | + +## Related + +- [Python transformation reference](/transformations/python-plain/reference/) — limits, file paths, packages, CSV. +- [Custom Python application](/components/applications/custom-python/) — for code that needs encrypted secrets. +- [Workspaces](/workspace/) · [Input and output mapping](/transformations/mappings/). diff --git a/src/content/docs/transformations/python-plain/index.md b/src/content/docs/transformations/python-plain/index.md index 4650b2d9e..13bac92c9 100644 --- a/src/content/docs/transformations/python-plain/index.md +++ b/src/content/docs/transformations/python-plain/index.md @@ -1,240 +1,25 @@ --- title: Python Transformation slug: 'transformations/python-plain' +description: Process data with Python in Keboola where SQL is awkward. Start here, then jump to the how-to or the reference. +keywords: + - Python transformation + - Python transformations + - Python transformation Keboola +type: explanation redirect_from: - /transformations/python/ --- +[Python](https://www.python.org/about/) transformations complement SQL transformations where computations or other operations are too difficult — though common operations like joining, sorting, or grouping are still easier and faster in [SQL transformations](/transformations/#backends). Keboola maps your input tables in as CSV files, runs your script, and writes CSV outputs back to [Storage](/storage/tables/). +:::caution[Never put credentials in transformation code] +Python transformation code is stored as **plaintext** and is not encrypted. Store secrets in the [Custom Python](/components/applications/custom-python/) application instead — see the [how-to](/transformations/python-plain/how-to/) for details. +::: -[Python](https://www.python.org/about/) transformations complement SQL transformations where computations or -other operations are too difficult. Common data operations like joining, sorting, or grouping are still easier and -faster to do in [SQL Transformations](/transformations/#backends). +This page is split by what you need: -***Warning:** Python transformations have **no facility for encrypting secrets**. Any credential you place in transformation code — API keys, passwords, tokens, connection strings — is stored as **plaintext** in the configuration. It is not encrypted at rest, it is readable by anyone with access to the project's configuration, and it is included when the configuration is processed by features such as the AI **Generate description**. **Do not put credentials in transformation code.** Instead, store them in the [Custom Python](/components/applications/custom-python/) application, where any parameter whose key starts with `#` is [encrypted](https://developers.keboola.com/overview/encryption/) and made available to your code as an environment variable at runtime.* +- **[How do I run a Python transformation?](/transformations/python-plain/how-to/)** — create, run, and develop/debug one end to end. +- **[Python transformation reference](/transformations/python-plain/reference/)** — runtime limits, file locations, packages, and CSV reading/writing. -## Environment -The Python script is running in an isolated [environment](https://developers.keboola.com/extend/#component). -The Python version is updated regularly, few weeks after the official release. The update is always announced on the -[status page](https://keboolastatus.com/). - -### Memory and Processing Constraints -A Python transformation has a limit of 8GB of allocated memory and the maximum running time is 6 hours. -The CPU is limited to the **equivalent** of two 2.3 GHz processors. - -### File Locations -The Python script itself will be compiled to `/data/script.py`. To access your -[mapped input and output](/transformations/mappings/) tables, use -relative (`in/tables/file.csv`, `out/tables/file.csv`) or absolute (`/data/in/tables/file.csv`, `/data/out/tables/file.csv`) paths. -To access downloaded files, use the `in/files/` or `/data/in/files/` path. If you want to dig really deep, -have a look at the [full Common Interface specification](https://developers.keboola.com/extend/common-interface/). -Temporary files can be written to a `/tmp/` folder. Do not use the `/data/` folder for those files you do not wish to exchange with Keboola. - -## Python Script Requirements -Python is **sensitive to indentation**. Make sure not to mix tabs and spaces. All files are assumed to be in UTF; -`# coding=utf-8` at the beginning of the script is not needed. You don't need to have -any main function, e.g., this is a valid script: - -```python -print("Hello Keboola") -``` - -If you define a main function, do not wrap it within the `if __name__ == '__main__':` block as it will not be run. -Simply calling it from within the script is enough: - -```python -def main(): - print("Hello Keboola") - -main() -``` - -You can organize the script into [blocks](/transformations/#writing-scripts). - -### Packages -You can list extra packages in the UI. These packages are installed using [pip](https://pypi.org/project/pip/). -Generally, any package available on [PyPI](https://pypi.org/) can be installed. However, some packages have external dependencies, -which might not be available. Feel free to [contact us](/management/support/) if you run into problems. When the -package is installed, you still need to `import` from it. - -![Screenshot - Package Configuration](/transformations/python-plain/packages.png) - -The latest versions of packages are always installed at the time of the release (you can check that -[in the repository](https://github.com/keboola/docker-custom-python/releases)). In case your code relies on a specific package version, -you can override the installed version by calling, e.g.: - -```python -import subprocess -import sys -subprocess.call([sys.executable, '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '--force-reinstall', 'pandas==0.20.0']) -``` - -Some packages are already installed in the environment -(see [their full list](https://github.com/keboola/docker-custom-python/blob/master/python-3.10/Dockerfile#L47)), -and they do not need to be listed in the transformation. - -### CSV Format -Tables from Storage are imported to the Python script from CSV files. CSV files can be read by standard Python functions -from the [CSV packages](https://docs.python.org/3/library/csv.html). It is recommended to explicitly specify the formatting options. -You can read CSV files either to vectors (numbered columns), or to dictionaries (named columns). -Your input tables are stored as CSV files in `in/tables/`, and your output tables in `out/tables/`. - -If you can process the file line-by-line, then the most effective way is to read each line, process it and write -it immediately. The following two examples show two ways of reading and manipulating a CSV file. - -## Dynamic Backends -If you have a large amount of data in databases and complex queries, your transformation might run for a couple of hours. -To speed it up, you can change the backend size in the configuration. Python transformations suport the following sizes: -- XSmall -- Small _(default)_ -- Medium -- Large - -![Screenshot - Backend size configuration](/transformations/python-plain/backend-size.png) - -Scaling up the backend size allocates more resources to speed up your transformation, which impacts [time credits consumption](/management/project/limits/#project-power--time-credits). - -***Note:** Dynamic backends are not available to you if you are on the [Free Plan (Pay As You Go)](/management/payg-project/).* - - -## Development Tutorial -To develop and debug Python transformations, you can use a [Python workspace](/workspace/) or -you can develop the transformation script locally. - -![Screenshot - Data folder structure](/transformations/python-plain/tree.png) - -The script itself is expected to be in the `data` directory. The script name is arbitrary. The `data` directory name -is also arbitrary, we use it as general reference to the above folder structure. It is possible to use relative -directories --- the current directory of the transformation is always the `data` directory. That means you can move -the script to a Keboola transformation with no changes. To develop a Python transformation -that takes a [sample CSV file](/transformations/python-plain/source.csv) locally, follow these steps: - -- Put the Python code into a file, for example, script.py in the working directory. -- Put all the input mapping tables inside the `in/tables` subdirectory of the working directory. -- Store the result CSV files inside the `out/tables` subdirectory. - -Use this sample script: - -```python -import csv - -csvlt = '\n' -csvdel = ',' -csvquo = '"' -with open('in/tables/source.csv', mode='rt', encoding='utf-8') as in_file, open('out/tables/result.csv', mode='wt', encoding='utf-8') as out_file: - writer = csv.DictWriter(out_file, fieldnames=['col1', 'col2'], lineterminator=csvlt, delimiter=csvdel, quotechar=csvquo) - writer.writeheader() - - lazy_lines = (line.replace('\0', '') for line in in_file) - reader = csv.DictReader(lazy_lines, lineterminator=csvlt, delimiter=csvdel, quotechar=csvquo) - for row in reader: - # do something and write row - writer.writerow({'col1': row['first'] + 'ping', 'col2': int(row['second']) * 42}) -``` - -### Development Using Workspace -To develop a transformation using a [workspace](/workspace/), configure the input (and optionally output) mapping. -**Load Data** and **Connect** to the workspace: - -![Screenshot - Input & Output](/transformations/python-plain/input-output.png) - -When you connect to the workspace, you can paste the above sample script in the prepared notebook. -The directory structure and input files will be already prepared in the JupyterLab environment: - -![Screenshot - Workspace](/transformations/python-plain/workspace.png) - -You can run the script and, optionally, also **Unload Data** from the workspace to get the results -immediately into project Storage. You can also **Create Transformation** to prepare a transformation -skeleton with the configured input and output mapping into which you can paste the transformation script. - -![Screenshot - Create Transformation](/transformations/python-plain/create-transformation.png) - -### Local Development -If you want to replicate the execution environment on your local machine, you need to have -[Python installed](https://www.python.org/downloads/). - -To simulate the input and output mapping, all you need to do is create the right directories with the right files. -You can get a finished example of the [above script](/transformations/python-plain/#development-tutorial) -setup in [data.zip](/transformations/python-plain/data.zip). -Download it and test the script in your local Python installation. The `result.csv` output file will be created -in the output folder. This script can be used in your transformations without any modifications. All you need to do is - -- Create a table in Storage by uploading the [sample CSV file](/transformations/source.csv). -- Create an input mapping from that table, setting its destination to `source` (as expected by the Python script). -- Create an output mapping from `result.csv` (produced by the Python script) to a new table in your Storage, -- Copy & paste the script into the transformation code. -- Save and run the transformation. - -![Screenshot - Sample Input Output Mapping](/transformations/python-plain/sample-io.png) - -The above steps are usually sufficient for daily development and debugging of moderately complex Python transformations, -although they do not reproduce the transformation execution environment exactly. You can also create a development environment -with the exact same configuration using [our Docker image](https://developers.keboola.com/extend/docker/running/#running-transformations). - -## Example 1 -- Using Dictionaries -The following piece of code reads a table with two columns, named **first** and **second**, -from the **source.csv** input mapping file into the `row` dictionary using `csvReader`. -It then adds *ping* to the first column and multiplies the second column by *42*. -After that, it saves the row to the **result.csv** output mapping file. - -```python -import csv - -csvlt = '\n' -csvdel = ',' -csvquo = '"' -with open('in/tables/source.csv', mode='rt', encoding='utf-8') as in_file, open('out/tables/result.csv', mode='wt', encoding='utf-8') as out_file: - writer = csv.DictWriter(out_file, fieldnames=['col1', 'col2'], lineterminator=csvlt, delimiter=csvdel, quotechar=csvquo) - writer.writeheader() - - lazy_lines = (line.replace('\0', '') for line in in_file) - reader = csv.DictReader(lazy_lines, lineterminator=csvlt, delimiter=csvdel, quotechar=csvquo) - for row in reader: - # do something and write row - writer.writerow({'col1': row['first'] + 'ping', 'col2': int(row['second']) * 42}) -``` - -The above example shows how to process the file line-by-line; this is the most memory-efficient way -which allows you to process data files of any size. -The expression `lazy_lines = (line.replace('\0', '') for line in in_file)` -is a [Generator](https://wiki.python.org/moin/Generators) which makes sure that -[Null characters](https://en.wikipedia.org/wiki/Null_character) are properly handled. -It is also important to use `encoding='utf-8'` when reading and writing files. - -## Example 2 -- Using Lists -The following piece of code reads a table with some of its columns from the **source.csv** input mapping file into the `row` list of strings. -It then adds *ping* to the first column and multiplies the second column by *42*. After that, it saves the row to the **result.csv** output mapping file. - -```python -import csv - -csvlt = '\n' -csvdel = ',' -csvquo = '"' -with open('/data/in/tables/source.csv', mode='rt', encoding='utf-8') as in_file, open('/data/out/tables/result.csv', mode='wt', encoding='utf-8') as out_file: - writer = csv.writer(out_file, lineterminator=csvlt, delimiter=csvdel, quotechar=csvquo) - lazy_lines = (line.replace('\0', '') for line in in_file) - reader = csv.reader(lazy_lines, lineterminator=csvlt, delimiter=csvdel, quotechar=csvquo) - for row in reader: - # do something and write row - writer.writerow([row[0] + 'ping', int(row[1]) * 42]) -``` - -## Example 3 -- Using CSV Dialect -You can simplify the above code using our pre-installed Keboola dialect. - -```python -import csv - -with open('/data/in/tables/source.csv', mode='rt', encoding='utf-8') as in_file, open('/data/out/tables/result.csv', mode='wt', encoding='utf-8') as out_file: - lazy_lines = (line.replace('\0', '') for line in in_file) - reader = csv.DictReader(lazy_lines, dialect='kbc') - writer = csv.DictWriter(out_file, dialect='kbc', fieldnames=reader.fieldnames) - writer.writeheader() - for row in reader: - # do something and write row - writer.writerow({"first": row['first'] + 'ping', "second": int(row['second']) * 42}) -``` - -The `kbc` dialect is automatically available in the transformation environment. If you want it in your local environment, -it is defined as `csv.register_dialect('kbc', lineterminator='\n', delimiter = ',', quotechar = '"')`. +New to transformations? Start with [Transformations](/transformations/) and the [Getting Started tutorial](/tutorial/manipulate/). diff --git a/src/content/docs/transformations/python-plain/reference.md b/src/content/docs/transformations/python-plain/reference.md new file mode 100644 index 000000000..6983999bc --- /dev/null +++ b/src/content/docs/transformations/python-plain/reference.md @@ -0,0 +1,140 @@ +--- +title: Python transformation reference +slug: 'transformations/python-plain/reference' +description: Lookup reference for Python transformations in Keboola — runtime environment and limits, file locations, script requirements, installing packages, reading and writing CSV, and backend sizes. +keywords: + - Python transformation limits + - Python transformation memory + - Python transformation packages + - Python transformation file paths + - Python transformation CSV + - Python transformation backend size +type: reference +--- + +Reference material for [Python transformations](/transformations/python-plain/). To create and run one, see the [how-to](/transformations/python-plain/how-to/). + + + +## Environment + +The Python script runs in an isolated [environment](https://developers.keboola.com/extend/#component). The Python version is updated regularly, a few weeks after the official release; updates are announced on the [status page](https://keboolastatus.com/). + +### Limits + +| Resource | Limit | +|---|---| +| Memory | 8 GB | +| Max running time | 6 hours | +| CPU | Equivalent of two 2.3 GHz processors | + +### File locations + +- The script is compiled to `/data/script.py`. +- Mapped input/output tables: relative `in/tables/file.csv`, `out/tables/file.csv` or absolute `/data/in/tables/file.csv`, `/data/out/tables/file.csv`. +- Downloaded files: `in/files/` (or `/data/in/files/`). +- Temporary files: `/tmp/`. Do **not** use `/data/` for files you don't want exchanged with Keboola. + +See the full [Common Interface specification](https://developers.keboola.com/extend/common-interface/). + +## Script requirements + +Python is **sensitive to indentation** — do not mix tabs and spaces. Files are assumed UTF-8 (`# coding=utf-8` is not needed). No main function is required: + +```python +print("Hello Keboola") +``` + +If you define a main function, do **not** wrap it in `if __name__ == '__main__':` (it will not run) — just call it: + +```python +def main(): + print("Hello Keboola") + +main() +``` + +You can organize the script into [blocks](/transformations/#writing-scripts). + +## Packages + +List extra packages in the UI; they are installed with [pip](https://pypi.org/project/pip/) from [PyPI](https://pypi.org/). Some packages have external dependencies that may not be available — [contact support](/management/support/) if you hit problems. After install, you still need to `import` them. + +The latest versions are installed at release time. To pin a version, force-reinstall it from your code: + +```python +import subprocess +import sys +subprocess.call([sys.executable, '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '--force-reinstall', 'pandas==0.20.0']) +``` + +Some packages are preinstalled and don't need to be listed. + +## Reading and writing CSV + +Input tables arrive as CSV in `in/tables/`; write outputs to `out/tables/`. Read with the standard [csv module](https://docs.python.org/3/library/csv.html); specifying formatting options explicitly is recommended. Process line-by-line for memory efficiency. + +**Dictionaries (named columns):** + +```python +import csv + +csvlt = '\n' +csvdel = ',' +csvquo = '"' +with open('in/tables/source.csv', mode='rt', encoding='utf-8') as in_file, open('out/tables/result.csv', mode='wt', encoding='utf-8') as out_file: + writer = csv.DictWriter(out_file, fieldnames=['col1', 'col2'], lineterminator=csvlt, delimiter=csvdel, quotechar=csvquo) + writer.writeheader() + + lazy_lines = (line.replace('\0', '') for line in in_file) + reader = csv.DictReader(lazy_lines, lineterminator=csvlt, delimiter=csvdel, quotechar=csvquo) + for row in reader: + writer.writerow({'col1': row['first'] + 'ping', 'col2': int(row['second']) * 42}) +``` + +The generator `lazy_lines = (line.replace('\0', '') for line in in_file)` strips [null characters](https://en.wikipedia.org/wiki/Null_character). Always use `encoding='utf-8'`. + +**Lists (numbered columns):** + +```python +import csv + +with open('/data/in/tables/source.csv', mode='rt', encoding='utf-8') as in_file, open('/data/out/tables/result.csv', mode='wt', encoding='utf-8') as out_file: + writer = csv.writer(out_file, lineterminator='\n', delimiter=',', quotechar='"') + lazy_lines = (line.replace('\0', '') for line in in_file) + reader = csv.reader(lazy_lines, lineterminator='\n', delimiter=',', quotechar='"') + for row in reader: + writer.writerow([row[0] + 'ping', int(row[1]) * 42]) +``` + +**Preinstalled `kbc` dialect** (simplifies the format options): + +```python +import csv + +with open('/data/in/tables/source.csv', mode='rt', encoding='utf-8') as in_file, open('/data/out/tables/result.csv', mode='wt', encoding='utf-8') as out_file: + lazy_lines = (line.replace('\0', '') for line in in_file) + reader = csv.DictReader(lazy_lines, dialect='kbc') + writer = csv.DictWriter(out_file, dialect='kbc', fieldnames=reader.fieldnames) + writer.writeheader() + for row in reader: + writer.writerow({"first": row['first'] + 'ping', "second": int(row['second']) * 42}) +``` + +To register the `kbc` dialect locally: `csv.register_dialect('kbc', lineterminator='\n', delimiter=',', quotechar='"')`. + +## Backend sizes (dynamic backends) + +A larger backend allocates more resources for long or heavy transformations. Available sizes: + +| Size | | +|---|---| +| XSmall | | +| Small | Default | +| Medium | | +| Large | | + +Scaling up impacts [time-credit consumption](/management/project/limits/#project-power--time-credits). Dynamic backends are **not** available on the [Free Plan (Pay As You Go)](/management/payg-project/). From b7ce986d73f7c2adc05ba6450499ee1d16a368a8 Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 20:07:21 +0200 Subject: [PATCH 06/15] PRDCT-376: split R transformation into how-to + reference; bump R 4.0.5 -> 4.4.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Block 0: how-to + reference + tutorial (tutorial folded into how-to). Updated the documented R version to 4.4.1 as instructed. README SoT (not in repo) → limits/ versions flagged TODO(human-review). Example sub-pages (array-splitter, plots, binary) left untouched (Group 0). Co-Authored-By: Claude Opus 4.8 --- .../docs/transformations/r-plain/how-to.md | 76 +++++++ .../docs/transformations/r-plain/index.md | 197 ++---------------- .../docs/transformations/r-plain/reference.md | 105 ++++++++++ 3 files changed, 193 insertions(+), 185 deletions(-) create mode 100644 src/content/docs/transformations/r-plain/how-to.md create mode 100644 src/content/docs/transformations/r-plain/reference.md diff --git a/src/content/docs/transformations/r-plain/how-to.md b/src/content/docs/transformations/r-plain/how-to.md new file mode 100644 index 000000000..a55cadeaf --- /dev/null +++ b/src/content/docs/transformations/r-plain/how-to.md @@ -0,0 +1,76 @@ +--- +title: How do I run an R transformation? +slug: 'transformations/r-plain/how-to' +description: Create, run, and develop an R transformation in Keboola — write the script, map input and output CSV files, run it, confirm the result, and debug it in a workspace or locally. +keywords: + - run an R transformation + - create R transformation + - R transformation example + - develop R transformation locally + - R transformation workspace +type: how-to +--- + +You want to run advanced statistical or vector/matrix computations with R. An R transformation reads your mapped input tables as CSV files, runs your script, and writes CSV outputs back to [Storage](/storage/tables/). This page gets you from nothing to a successful run, then shows how to develop and debug. For limits, packages, and CSV rules, see the [reference](/transformations/r-plain/reference/). + +**Time:** ~10 minutes · **You will need:** a Keboola project and one table in [Storage](/storage/tables/) (or the [sample CSV file](/transformations/r-plain/source.csv)). + +## Step 1 — Create the transformation + +1. Open **Components → Transformations**, click **New Transformation**, and choose **R Transformation**. +2. Name it and confirm. + +## Step 2 — Map input and output + +1. Upload the [sample CSV file](/transformations/r-plain/source.csv) to Storage as a table. +2. In **Input Mapping**, add it and set its **Destination** to `source` (the script reads `in/tables/source.csv`). +3. In **Output Mapping**, map `result.csv` to a new Storage table, for example `out.c-main.result`. + +## Step 3 — Write the script + +```r +data <- read.csv(file = "in/tables/source.csv") + +df <- data.frame( + col1 = paste0(data$first, 'ping'), + col2 = data$second * 42 +) +write.csv(df, file = "out/tables/result.csv", row.names = FALSE) +``` + +Always write outputs with `row.names = FALSE` — see [row index in output tables](/transformations/r-plain/reference/#row-index-in-output-tables). You can split the script into [blocks](/transformations/#writing-scripts). + +## Step 4 — Run it and confirm the result + +1. Click **Run**. +2. Wait for the [job](/management/jobs/) to finish with a success status. +3. Open **Storage**, find your output table, and confirm `col1` has the `ping` suffix and `col2` is `second × 42`. + +## Develop and debug + +The fastest way to iterate is an [R workspace](/workspace/) with the same input mapping. While developing, read fewer rows to catch issues quickly: + +```r +mydata <- read.csv("in/tables/mydata", nrows=500) +``` + +To develop **locally**, [install R](https://cloud.r-project.org/) (preferably the [same version as Keboola](/transformations/r-plain/reference/#environment)) and recreate the directory structure (`in/tables/`, `out/tables/`, and `in/user/` for extension-less binary files) with your input files. A ready example is in [data.zip](/transformations/r-plain/data.zip); the same script then runs unchanged as a transformation. For an exact environment, use the [Keboola Docker image](https://developers.keboola.com/extend/docker/running/#running-transformations). + +## Make it faster (backend size) + +For large data, raise the **Backend size** in the configuration (XSmall → Small → Medium → Large); see [backend sizes](/transformations/r-plain/reference/#backend-sizes-dynamic-backends). This affects [time-credit consumption](/management/project/limits/#project-power--time-credits). + +## Troubleshooting + +| Symptom | Likely cause | Fix | +|---|---|---| +| Transformation fails on a harmless warning | Warnings are converted to errors | Fix the cause, or wrap the code in `tryCatch(..., warning = function(w) {})` ([reference](/transformations/r-plain/reference/#warnings-are-errors)). | +| Extra unnamed column / import fails | Row index written to the CSV | Write with `row.names = FALSE`. | +| `cannot open file 'in/tables/source.csv'` | Input destination doesn't match the path | Set the input **Destination** to `source`. | +| Output table empty / not created | Output **Source** doesn't match the file the script writes | Map `result.csv`. | + +## Related + +- [R transformation reference](/transformations/r-plain/reference/) — limits, packages, CSV, logging. +- In-depth examples: [array splitting](/transformations/r-plain/array-splitter/) · [charts & graphs](/transformations/r-plain/plots/) · [binary files](/transformations/r-plain/binary/). +- [Workspaces](/workspace/) · [Input and output mapping](/transformations/mappings/). diff --git a/src/content/docs/transformations/r-plain/index.md b/src/content/docs/transformations/r-plain/index.md index 3458e3844..661b75027 100644 --- a/src/content/docs/transformations/r-plain/index.md +++ b/src/content/docs/transformations/r-plain/index.md @@ -1,196 +1,23 @@ --- title: R Transformation slug: 'transformations/r-plain' +description: Run advanced statistical computations with R in Keboola. Start here, then jump to the how-to or the reference. +keywords: + - R transformation + - R transformations + - R transformation Keboola +type: explanation redirect_from: - /transformations/r/ --- +[R](https://cran.r-project.org/) transformations are for **advanced statistical computations** — ready-made algorithms and vector/matrix math — complementing Python and SQL where those are too difficult. Common operations like joining, sorting, or grouping are still easier and faster in [SQL transformations](/transformations/#backends). Keboola maps your input tables in as CSV, runs your script, and writes CSV outputs back to [Storage](/storage/tables/). +This page is split by what you need: -[R](https://cran.r-project.org/) transformations are for **advanced statistical computations**. -Apart from ready-to-use implementations of state-of-the-art algorithms, R's other great assets are vector and matrix -computations. R transformations complement Python and SQL transformations where computations or -other operations are too difficult. Common data operations like joining, sorting, or grouping are still easier and -faster to do in [SQL Transformations](/transformations/#backends). +- **[How do I run an R transformation?](/transformations/r-plain/how-to/)** — create, run, and develop/debug one end to end. +- **[R transformation reference](/transformations/r-plain/reference/)** — runtime version and limits, packages, CSV format, warnings-as-errors, and logging. -## Environment -The R script is executed in an isolated [environment](https://developers.keboola.com/extend/#component). -The current R version is **4.0.5**, however it is possible to switch your configuration to run on other versions if available. +In-depth examples: [array splitting](/transformations/r-plain/array-splitter/) · [charts & graphs](/transformations/r-plain/plots/) · [binary files](/transformations/r-plain/binary/). -![Screenshot - Change Backend](/transformations/r-plain/change-backend.png) - -![Screenshot - Version List](/transformations/r-plain/version-list.png) - -Any updates to the R version is always announced in the Keboola [changelog](https://changelog.keboola.com/). - -### Memory and Processing Constraints -An R transformation has a limit of 16GB of allocated memory and the maximum running time is 6 hours. -The CPU is limited to the **equivalent** of two 2.3 GHz processors. - -### File Locations -The R script itself will be compiled to `/data/script.R`. To access your -[mapped input and output](/transformations/mappings/) tables, use -relative (`in/tables/file.csv`, `out/tables/file.csv`) or absolute (`/data/in/tables/file.csv`, `/data/out/tables/file.csv`) paths. -To access downloaded files, use the `in/files/` or `/data/in/files/` path. If you want to dig really deep, -have a look at the [full Common Interface specification](https://developers.keboola.com/extend/common-interface/). -Temporary files can be written to a `/tmp/` folder. Do not use the `/data/` folder for those files you do not wish to exchange with Keboola. - -## R Script Requirements -You can organize your script into [blocks](/transformations/#writing-scripts), but the resulting R script to be run -within our environment must meet the following requirements: - -### Packages -The R transformation can use any package available on -[CRAN](https://cloud.r-project.org/web/packages/available_packages_by_name.html). In order for a package and -its dependencies to be automatically loaded and installed, list its name in the package section. Using `library()` -for loading is not necessary then. - -The latest versions of packages are always installed. Some packages are pre-installed in the environment -(see [list](https://github.com/keboola/docker-custom-r/blob/master/init.R#L3)). These pre-installed packages are -installed with their dependencies, therefore to get an authoritative list of installed packages use the `installed. -packages()` function. It does no harm if you add one of these packages to your transformation explicitly, but the transformation and -sandbox startup will be slowed by the forced re-installation. - -### CSV Format -Tables from Storage are imported to the R script from CSV files. The CSV files can be read by standard R functions. -Generally, the table can be read with default R settings. In case R gets confused, use the exact format -specification `sep=",", quote="\""`. For example: - -```r -data <- read.csv("in/tables/in.csv", sep=",", quote="\"") -``` - -### Row Index in Output Tables -Do not include the row index in the output table (use `row.names=FALSE`). If you are using the -[readr package](https://cran.r-project.org/web/packages/readr/readr.pdf), you can also use the `write_csv` function -which doesn't write row names. - -```r -write.csv(data, file="out/tables/out.csv", row.names=FALSE) -``` - -The row index produces a new unnamed column in the CSV file which cannot be imported to [Storage](/storage/). -If the row names contain valuable data, and you want to keep them, you have to convert them to a separate column first. - -```r -df <- data.frame(first = c('a', 'b'), second = c('x', 'y')) -data <- cbind(rownames(df), df) -write.csv(data, file="/data/out/tables/out.csv", row.names=FALSE) -``` - -### Errors and Warnings -We have set up our environment to be a little zealous; all warnings are converted to errors and they cause the -transformation to be unsuccessful. If you have a piece of code in your transformation which may emit warnings, -and you really want to ignore them, wrap the code in a `tryCatch` call: - -```r -tryCatch( -{ ... some code ... }, -warning = function(w) {} -) -``` - -## Dynamic Backends -If you have a large amount of data in databases and complex queries, your transformation might run for a couple of hours. -To speed it up, you can change the backend size in the configuration. R transformations suport the following sizes: -- XSmall -- Small _(default)_ -- Medium -- Large - -Scaling up the backend size allocates more resources to speed up your transformation, which impacts [time credits consumption](/management/project/limits/#project-power--time-credits). - -***Note:** Dynamic backends are not available to you if you are on the [Free Plan (Pay As You Go)](/management/payg-project/).* - -## Development Tutorial -We recommend that you create an [R Workspace](/transformations/workspace) with the same -input mapping your transformation will use. This is the fastest way to develop your transformation code. - -**Tip:** Limit the number of rows you read in from the CSV files: - -```r -mydata <- read.csv("in/tables/mydata", nrows=500) -``` - -This will help you catch annoying issues without having to process all data. - -You can also develop and debug R transformations on your local machine. -To do so, [install R](https://cloud.r-project.org/), preferably the same [version as us](#environment). -It is also helpful to use an IDE, such as the [Jupyter Notebook](https://jupyter.org). - -To simulate the input and output mapping, all you need to do is create the right directories with the right files. -The following image shows the directory structure: - -![Screenshot - Data folder structure](/transformations/r-plain/tree.png) - -The script itself is expected to be in the `data` directory; its name is arbitrary. It is possible to use relative directories, -so that you can move the script to a Keboola transformation with no changes. To develop an R transformation which takes -a [sample CSV file](/transformations/r-plain/source.csv) locally, take the following steps: - -- Put the R code into a file, for instance, script.R in the working directory. -- Put all tables from the input mapping inside the `in/tables` subdirectory of the working directory. -- Place the binary files (if using any) inside the `in/user` subdirectory of the working directory, and make sure - that their name has no extension. -- Store the result CSV files inside the `out/tables` subdirectory. - -Use this sample script: -```r -data <- read.csv(file = "in/tables/source.csv"); - -df <- data.frame( -col1 = paste0(data$first, 'ping'), -col2 = data$second * 42 -) -write.csv(df, file = "out/tables/result.csv", row.names = FALSE) -``` - -A complete example of the above is attached below in [data.zip](/transformations/r-plain/data.zip). -Download it and test the script in your local R installation. The `result.csv` output file will be created. -This script can be used in your transformations without any modifications. -All you need to do is - -- create a table in Storage by uploading the [sample CSV file](/transformations/r-plain/source.csv), -- create an input mapping from that table, setting its destination to `source` (as expected by the R script), -- create an output mapping from `result.csv` (produced by the R script) to a new table in your Storage, -- copy & paste the above script into the transformation code, and finally, -- save and run the transformation. - -### Events and Output -It is possible to output informational and debug messages from the R script simply by printing them out. -The following R script: - -```r -print('doing something') -Sys.sleep(3) -print('doing something else') -Sys.sleep(3) -write('still doing something', stdout()) -Sys.sleep(3) -write('error message', stderr()) -Sys.sleep(3) -app$logInfo("information") -Sys.sleep(3) -app$logError("error") -Sys.sleep(3) -TRUE -``` - -produces the following events in the transformation job: - -![Screenshot - Script Events](/transformations/r-plain/events.png) - -The `app$logInfo` and `app$logError` functions are also internally available; they can be useful if you need to know -the precise server time of when an event occurred. The standard event timestamp in job events is the time when the event was received -converted to the local time zone. - -### Going Further -The above steps are usually sufficient for daily development and debugging of moderately complex R transformations, -although they do not reproduce the transformation execution environment exactly. To create a development environment -with the exact same configuration as the transformation environment, use [our Docker image](https://developers.keboola.com/extend/docker/running/#running-transformations). - -## Examples -There are more in-depth examples dealing with - -- [array splitting](/transformations/r-plain/array-splitter/), -- [plotting charts & graphs](/transformations/r-plain/plots/), and -- [using trained models and binary files](/transformations/r-plain/binary/). +New to transformations? Start with [Transformations](/transformations/) and the [Getting Started tutorial](/tutorial/manipulate/). diff --git a/src/content/docs/transformations/r-plain/reference.md b/src/content/docs/transformations/r-plain/reference.md new file mode 100644 index 000000000..709593513 --- /dev/null +++ b/src/content/docs/transformations/r-plain/reference.md @@ -0,0 +1,105 @@ +--- +title: R transformation reference +slug: 'transformations/r-plain/reference' +description: Lookup reference for R transformations in Keboola — runtime environment and version, limits, file locations, packages, CSV format, row-index handling, warnings-as-errors, backend sizes, and logging. +keywords: + - R transformation limits + - R transformation version + - R transformation packages + - R transformation CSV + - R transformation warnings errors + - R transformation backend size +type: reference +--- + +Reference material for [R transformations](/transformations/r-plain/). To create and run one, see the [how-to](/transformations/r-plain/how-to/). + + + +## Environment + +The R script runs in an isolated [environment](https://developers.keboola.com/extend/#component). The current R version is **4.4.1**; you can switch a configuration to other available versions. Version updates are announced in the [changelog](https://changelog.keboola.com/). + +### Limits + +| Resource | Limit | +|---|---| +| Memory | 16 GB | +| Max running time | 6 hours | +| CPU | Equivalent of two 2.3 GHz processors | + +### File locations + +- The script is compiled to `/data/script.R`. +- Mapped input/output tables: relative `in/tables/file.csv`, `out/tables/file.csv` or absolute under `/data/`. +- Downloaded files: `in/files/` (or `/data/in/files/`). +- Temporary files: `/tmp/`. Do **not** use `/data/` for files you don't want exchanged with Keboola. + +See the full [Common Interface specification](https://developers.keboola.com/extend/common-interface/). + +## Packages + +R transformations can use any package on [CRAN](https://cloud.r-project.org/web/packages/available_packages_by_name.html). List a package's name in the package section to load and install it (with dependencies) automatically — `library()` is then not needed. The latest versions are installed. + +Some packages are preinstalled (with dependencies); for an authoritative list use `installed.packages()`. Adding a preinstalled package explicitly does no harm but slows startup due to forced re-installation. + +## CSV format + +Input tables arrive as CSV and can be read with standard R functions. If R misreads the format, specify it explicitly: + +```r +data <- read.csv("in/tables/in.csv", sep=",", quote="\"") +``` + +### Row index in output tables + +Do **not** write the row index — use `row.names=FALSE`. The row index creates an unnamed column that cannot be imported to [Storage](/storage/). + +```r +write.csv(data, file="out/tables/out.csv", row.names=FALSE) +``` + +If the row names hold real data, convert them to a column first: + +```r +df <- data.frame(first = c('a', 'b'), second = c('x', 'y')) +data <- cbind(rownames(df), df) +write.csv(data, file="/data/out/tables/out.csv", row.names=FALSE) +``` + +## Warnings are errors + +The environment converts **all warnings to errors**, which fail the transformation. To deliberately ignore warnings from a piece of code, wrap it in `tryCatch`: + +```r +tryCatch( +{ ... some code ... }, +warning = function(w) {} +) +``` + +## Logging and events + +Print informational/debug messages by printing to stdout/stderr. The internally available `app$logInfo` and `app$logError` functions record the precise server time of an event (the standard job-event timestamp is when the event was received, converted to local time): + +```r +print('doing something') +write('error message', stderr()) +app$logInfo("information") +app$logError("error") +``` + +## Backend sizes (dynamic backends) + +A larger backend allocates more resources for long or heavy transformations. Available sizes: + +| Size | | +|---|---| +| XSmall | | +| Small | Default | +| Medium | | +| Large | | + +Scaling up impacts [time-credit consumption](/management/project/limits/#project-power--time-credits). Dynamic backends are **not** available on the [Free Plan (Pay As You Go)](/management/payg-project/). From 9047d5de73da7af935a6e225bdddb868f0f567ba Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 20:07:21 +0200 Subject: [PATCH 07/15] PRDCT-376: split Variables & Shared Code into how-to + explanation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Block 0: explanation + how-to. Thin hub at the unchanged URL. Fixed a stale mappings anchor (#snowflake-loading-type -> #loading-type-snowflake-and-bigquery). No code SoT — no fields documented. Co-Authored-By: Claude Opus 4.8 --- .../transformations/variables/explanation.md | 33 +++ .../docs/transformations/variables/how-to.md | 89 +++++++ .../docs/transformations/variables/index.md | 238 +----------------- 3 files changed, 134 insertions(+), 226 deletions(-) create mode 100644 src/content/docs/transformations/variables/explanation.md create mode 100644 src/content/docs/transformations/variables/how-to.md diff --git a/src/content/docs/transformations/variables/explanation.md b/src/content/docs/transformations/variables/explanation.md new file mode 100644 index 000000000..ce4b9f669 --- /dev/null +++ b/src/content/docs/transformations/variables/explanation.md @@ -0,0 +1,33 @@ +--- +title: What are transformation variables and shared code? +slug: 'transformations/variables/explanation' +description: Understand how variables parametrize a transformation and how shared code reuses snippets across transformations in Keboola, and when to use each. +keywords: + - transformation variables + - shared code + - parametrize transformation + - reusable transformation code + - mustache variables Keboola +type: explanation +--- + +**Variables** and **shared code** make transformation code more dynamic and reusable. This page explains what they are and when to use them; to set them up, see the [how-to](/transformations/variables/how-to/). + +## Variables + +Variables let you **parametrize** a transformation — useful when several transformations differ only in a few values (for example, the same logic for the Meals department and the Drinks department). + +Transformation variables are **unrelated to the code's own variables** (they are not SQL or Python variables). They are evaluated **before** the transformation runs and apply to the **entire configuration** — all code blocks, shared code, and mapping. They are referenced with [Mustache syntax](https://scalate.github.io/scalate/documentation/mustache.html#Variables), `{{ name }}`. Every referenced variable must be defined, and every defined variable must have a value (a default, optionally overridden at runtime or per [flow](/flows/) task). + +## Shared code + +**Shared code** lets you reuse a snippet across otherwise unrelated transformations. Like variables, it is evaluated before the transformation runs, so it does not interfere with your code at runtime. Shared code can itself use variables. + +When you add shared code to a transformation you choose how it is linked: + +- **Use Inline** — copies the snippet into the transformation. There is no link; later changes to the shared code do not affect this copy. +- **Use as Shared Code** — links the snippet. Editing the shared code affects **all** linked transformations — which is the point, but also the risk: editing or deleting shared code can break every transformation that links it. + +This trade-off is the main thing to understand: inline favors isolation, linked favors central maintenance. + +To set either up, see the [how-to](/transformations/variables/how-to/). diff --git a/src/content/docs/transformations/variables/how-to.md b/src/content/docs/transformations/variables/how-to.md new file mode 100644 index 000000000..34d79dc28 --- /dev/null +++ b/src/content/docs/transformations/variables/how-to.md @@ -0,0 +1,89 @@ +--- +title: How do I use variables and shared code? +slug: 'transformations/variables/how-to' +description: Parametrize a Keboola transformation with variables (with defaults and runtime/flow overrides) and create, reuse, and manage shared code across transformations. +keywords: + - use transformation variables + - define variable default value + - override variable in flow + - create shared code + - reuse shared code Keboola +type: how-to +--- + +This page shows how to parametrize a transformation with **variables** and how to reuse snippets with **shared code**. For what they are and the inline-vs-linked trade-off, see the [explanation](/transformations/variables/explanation/). + +## Use a variable + +1. In your transformation code, reference the variable with Mustache syntax. For example, turn the multiplier `42` into `{{ multiplier }}`: + + ```sql + CREATE OR REPLACE TABLE "result" AS + SELECT "first", "second" * {{ multiplier }} AS "larger_second" FROM "source"; + ``` + +2. In the **Variables** section, define `multiplier` and give it a **default value**. Every referenced variable must be defined, and every defined variable must have a value. +3. **Run** the transformation. You can provide a runtime override of the default value when you run it. + +If a variable is referenced but not defined, or has no value, you get an error such as `Missing values for placeholders: "multiplier"` or `No value provided for variable "multiplier".` + +## Override a variable in a flow + +When you automate transformations with [flows](/flows/), each flow task can either use the defaults or override them. Add the override to the task's configuration JSON: + +```json +"variableValuesData": { + "values": [ + { + "name": "multiplier", + "value": "1000" + } + ] +} +``` + +## Create shared code + +Create shared code in either of two ways: + +- From the **Shared Codes** page — choose to create new shared code and enter a name. +- From an existing transformation's code — share a selected snippet; its code and code type are filled in automatically. You still enter a name. + +:::caution +For Snowflake, a single shared code can contain **only one query**, and the SQL query must end with a semicolon (`;`). +::: + +## Use shared code in a transformation + +1. While editing a transformation, insert shared code and select the snippet you want. +2. Choose how to use it (see the [explanation](/transformations/variables/explanation/#shared-code)): + - **Use Inline** — copies the snippet, no link. + - **Use as Shared Code** — links it; later edits to the shared code apply everywhere it is linked. +3. To break a link later, choose **Use as Inline Code** from the snippet's dots menu. + +## Shared code with a variable (worked example) + +Suppose many SQL transformations need the same input prep. Because of [clone mapping](/transformations/mappings/#loading-type-snowflake-and-bigquery), you must drop the `_timestamp` column from the source: + +```sql +ALTER TABLE "source" DROP COLUMN "_timestamp"; +``` + +Make this reusable and parametrize the table name with a `source` variable: + +```sql +ALTER TABLE "{{source}}" DROP COLUMN "_timestamp"; +``` + +Create the shared code, add it to the transformation (drag it **before** the main code), then set the `source` variable to the [input mapping destination name](/transformations/mappings/#table-input-mapping) of the table (for example `source-table`). When you run the transformation, the job events show the shared-code query manipulating that table. + +## Manage shared code safely + +- **Review usage** — the **Usage** section on a shared code's detail page lists the transformations it is linked to. Inline copies are not listed (there is no link). +- **Editing** a linked shared code shows a warning that it may break the transformations using it. +- **Deleting** a used shared code lists the affected transformations; they stop working. A transformation referencing deleted shared code fails with a message like `Shared code configuration cannot be read: Row 10433 not found`. + +## Related + +- [What are variables and shared code?](/transformations/variables/explanation/) — concepts and the inline-vs-linked trade-off. +- [Input and output mapping](/transformations/mappings/) · [Flows](/flows/). diff --git a/src/content/docs/transformations/variables/index.md b/src/content/docs/transformations/variables/index.md index 699d2b65b..0367bb8d9 100644 --- a/src/content/docs/transformations/variables/index.md +++ b/src/content/docs/transformations/variables/index.md @@ -1,233 +1,19 @@ --- -title: Variables +title: Variables & Shared Code slug: 'transformations/variables' +description: Parametrize transformations with variables and reuse snippets with shared code in Keboola. Start here, then jump to the how-to or the explanation. +keywords: + - transformation variables + - shared code + - variables and shared code Keboola +type: explanation --- +**Variables** let you parametrize a transformation so one configuration can serve several cases; **shared code** lets you reuse a snippet across transformations. Both are evaluated before the transformation runs, so they don't interfere with your code at runtime. +This page is split by what you need: -Variables allow you to parametrize transformations. This is useful when you have similar transformations -which differ in only a limited number of values. You can have, for example, a transformation that -processes all orders from the Meals department. With variables, you can modify it to work for the -Drinks department, too. +- **[How do I use variables and shared code?](/transformations/variables/how-to/)** — define variables and defaults, override them at runtime or per flow, and create, reuse, and manage shared code. +- **[What are variables and shared code?](/transformations/variables/explanation/)** — the concepts and the inline-vs-linked trade-off. -## Variables -Transformation variables are unrelated to the transformation code itself. It means that they do not manifest themselves -as SQL or Python variables. Transformation variables are evaluated before the transformation is run and -are valid for the entire configuration (all code blocks, shared code, mapping, etc.). Variables are referenced -in the configuration using the -[Moustache Variable syntax](https://scalate.github.io/scalate/documentation/mustache.html#Variables). - -All variables referenced in the code must be defined in the variables section. All defined variables -must have assigned values. - -### Example -Consider the following transformation: - -```sql -CREATE OR REPLACE TABLE "result" AS - SELECT "first", "second" * 42 AS "larger_second" FROM "source"; -``` - -To parametrize the multiplier value (42), you can change it to a variable `{{ "{{ multiplier " }}}}`: - -```sql -CREATE OR REPLACE TABLE "result" AS - SELECT "first", "second" * {{ multiplier }} AS "larger_second" FROM "source"; -``` - -When you define a variable, you have to provide its default value: - -![Screenshot - Variables Configuration](/transformations/variables/variables-setting.png) - -When you run a transformation, you can provide a runtime override of the default value: - -![Screenshot - Running Transformation](/transformations/variables/variables-run.png) - -When a variable is referenced in the code but not defined, or its value is missing, -you'll get an error: - - Missing values for placeholders: "multiplier" - -or - - No value provided for variable "multiplier". - - -### Flow Usage -When you use [flows](/flows/) to automate transformations with variables, you can -either rely on the default values, or you can override them for each flow task. -This can be done by configuration of task parameters: - -![Screenshot - Orchestration Task Parameters](/transformations/variables/orchestration-parameters.png) - -There you can set variable values override: - -![Screenshot - Task Parameters](/transformations/variables/task-parameters.png) - -In the [above example](/transformations/variables/#example), you can override the default -value by **adding** the following code to the configuration json: - -```json - "variableValuesData": { - "values": [ - { - "name": "multiplier", - "value": "1000" - } - ] - } -``` - -The resulting configuration will look similar to this: - -```json -{ - "config": "6939", - "variableValuesData": { - "values": [ - { - "name": "multiplier", - "value": "1000" - } - ] - } -} -``` - -## Shared Code -Shared code is slightly related to variables in that it is another option how to make the -transformation code more dynamic. Shared code allows you to share pieces of code between -otherwise unrelated transformations. Like with the variables, the shared code is evaluated -before the transformation runs. This means that it does not interfere with your -transformation code. - -There are two ways how to create shared code --- from the **Shared Codes** page: - -![Screenshot - Create Shared Code](/transformations/variables/shared-code.png) - -Or from an existing transformation code: - -![Screenshot - Create Shared Code from Transformation](/transformations/variables/shared-code-2.png) - -You have to enter the name for the shared code when creating a new one. When you share an -existing piece of transformation code, the code and code type are filled in automatically. - -![Screenshot - Shared Code Detail](/transformations/variables/shared-code-detail.png) - -### Using Shared Code -You can use shared code when editing a transformation: - -![Screenshot - Shared Code Use](/transformations/variables/shared-code-use-1.png) - -Select the shared code you want to use. There are two options how you can use it: - -- **Use Inline** --- This will make a copy of the shared code in the transformation you're editing. There -won't be any link between the transformation and the shared code. -- **Use as Shared Code** --- This will link the shared code with the transformation. When you modify the -shared code, it will affect all linked transformations. - -![Screenshot - Shared Code Use](/transformations/variables/shared-code-use-2.png) - -When the code is inserted as shared code, you can always unlink the transformation -from the shared code by selecting **Use as Inline Code** from the dots menu: - -![Screenshot - Shared Code Use](/transformations/variables/shared-code-use-3.png) - -![Screenshot - Shared Code Use](/transformations/variables/shared-code-use-4.png) - -### Modifying Shared Code - -When a shared code is linked to transformations, you can review its usage in the -Usage section on the shared code detail page: - -![Screenshot - Shared Code List](/transformations/variables/shared-code-edit.png) - -You'll see a list of transformations to which the shared code is linked. The transformations -in which the shared code was used inline are not listed, because there is no link. - -When you attempt to edit a shared code, you'll see a warning that there's a potential -to break the transformations in which it is used. - -![Screenshot - Shared Code Edit](/transformations/variables/shared-code-edit-2.png) - -When you try to delete a shared code, you'll see a list of the transformations which use it. -When you delete a shared code that is used, the transformations using it will stop working. - -![Screenshot - Shared Code Delete](/transformations/variables/shared-code-delete.png) - -Transformations referencing a deleted shared code fail with a message similar to this: - - Shared code configuration cannot be read: Row 10433 not found - -### Example Using Shared Code - -Let's say that you have a lot of SQL transformations with a table in input mapping -that requires some preparation. - -For example: - -```sql -CREATE OR REPLACE TABLE "result" AS - SELECT *, "second" * 42 AS "larger_second" FROM "source"; -``` - -Because of [Clone mapping](/transformations/mappings/#snowflake-loading-type), you have -to drop the `_timestamp` column from the source by executing this query: - -```sql -ALTER TABLE "source" DROP COLUMN "_timestamp"; -``` - -If you have many transformations that require the table to be prepared in the same way, -you can create the following shared code: - -![Screenshot - Create Shared Code](/transformations/variables/shared-code-drop-1.png) - -*Note: When defining shared code for Snowflake, the shared code can contain only one query.* - -**Important: The SQL query must end with a semicolon `;`** - -Add the shared code to the transformation. Drag & Drop it before the main transformation code: - -![Screenshot - Use Code](/transformations/variables/shared-code-drop-2.png) - -The main code being: - -```sql -CREATE OR REPLACE TABLE "result" AS - SELECT "first", "second" * 42 AS "larger_second" FROM "source"; -``` - -When you run the transformation, you can see in the events what code has been executed: - -![Screenshot - Use Code](/transformations/variables/shared-code-events.png) - -### Example Shared Code with Variables -You can also define variables for shared code. -For example, we can extend the -[above example](/transformations/variables/#example-using-shared-code) -and parametrize the name of the table from which the `_timestamp` column is dropped. - -Add the `source` variable and modify the shared code to: - -```sql - -ALTER TABLE "{{source}}" DROP COLUMN "_timestamp"; - -``` - -![Screenshot - Shared Code with Variables](/transformations/variables/shared-code-variables-1.png) - -The transformation will detect that the value for the `source` variable is not defined: - -![Screenshot - Shared Code in Transformation](/transformations/variables/shared-code-variables-2.png) - -Set the `source` value to the [destination name](/transformations/mappings/#table-input-mapping) of the -table in the *Table Input Mapping* (`source-table` in this case): - -![Screenshot - Shared Code with set Variables](/transformations/variables/shared-code-variables-3.png) - -When you run the transformation, you can verify the executed queries in the job events. There -you can see that the shared code query manipulated the `source-table`: - -![Screenshot - Shared Code with set Variables](/transformations/variables/shared-code-variables-4.png) +New to transformations? Start with [Transformations](/transformations/). From 80325af31a9ac7121895af9c41cf136c2854260e Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 20:07:21 +0200 Subject: [PATCH 08/15] PRDCT-376: fix broken "Other features" table row + add frontmatter on Transformations index Block 0: explanation + reference. Kept as a single combined landing (not URL-split) to preserve load-bearing anchors like #writing-scripts referenced across the section. Corrected the malformed "Other features" row (removed a garbled row, rowspan 9 -> 8) and flagged it for human review. Co-Authored-By: Claude Opus 4.8 --- src/content/docs/transformations/index.md | 31 ++++++++++++++--------- 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/src/content/docs/transformations/index.md b/src/content/docs/transformations/index.md index 269db4ae5..fddb32eb0 100644 --- a/src/content/docs/transformations/index.md +++ b/src/content/docs/transformations/index.md @@ -1,12 +1,19 @@ ---- -title: Transformations -slug: 'transformations' -redirect_from: - - /manipulation/transformations/ --- +title: Transformations +slug: 'transformations' +description: Manipulate data in your Keboola project with SQL, Python, or R transformations — how they work, the available backends, and their features. +keywords: + - transformations + - Keboola transformation + - SQL Python R transformation + - transformation backends +type: explanation +redirect_from: + - /manipulation/transformations/ +--- + + - - *Go to our [Getting Started tutorial](/tutorial/manipulate/) to create your first transformation and learn how transformations are an integral part of the Keboola workflow.* @@ -171,12 +178,12 @@ Python and R transformations. JupyterLab + - Other features - ✓ - Not available - - + Other features Versioning ✓ From c0881884effa0320910524099bbfcfa0ebf856a2 Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 20:07:21 +0200 Subject: [PATCH 09/15] PRDCT-376: wire Group 2 (python, r, variables) splits into the sidebar Co-Authored-By: Claude Opus 4.8 --- _data/navigation.yml | 14 ++++++++++++++ src/sidebar.mjs | 22 ++++++++++++++++++++-- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/_data/navigation.yml b/_data/navigation.yml index 96a365bfc..cce785423 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -587,6 +587,11 @@ items: - url: /transformations/variables/ title: Variables & Shared Code + items: + - url: /transformations/variables/how-to/ + title: How do I use variables and shared code? + - url: /transformations/variables/explanation/ + title: What they are - url: /transformations/dbt/ title: dbt Transformation @@ -608,10 +613,19 @@ items: - url: /transformations/python-plain/ title: Python Transformations + items: + - url: /transformations/python-plain/how-to/ + title: How do I run a Python transformation? + - url: /transformations/python-plain/reference/ + title: Reference - url: /transformations/r-plain/ title: R Transformations items: + - url: /transformations/r-plain/how-to/ + title: How do I run an R transformation? + - url: /transformations/r-plain/reference/ + title: Reference - url: /transformations/r-plain/array-splitter/ title: Array Splitting diff --git a/src/sidebar.mjs b/src/sidebar.mjs index bec2ed0ac..074346ad8 100644 --- a/src/sidebar.mjs +++ b/src/sidebar.mjs @@ -417,7 +417,15 @@ export const sidebar = [ items: [ { label: "Overview", slug: "transformations" }, { slug: "transformations/mappings" }, - { slug: "transformations/variables" }, + { + label: "Variables & Shared Code", + collapsed: true, + items: [ + { label: "Overview", slug: "transformations/variables" }, + { slug: "transformations/variables/how-to" }, + { slug: "transformations/variables/explanation" }, + ], + }, { label: "dbt Transformation", collapsed: true, @@ -430,12 +438,22 @@ export const sidebar = [ { slug: "transformations/dbt/troubleshooting" }, ], }, - { slug: "transformations/python-plain" }, + { + label: "Python Transformations", + collapsed: true, + items: [ + { label: "Overview", slug: "transformations/python-plain" }, + { slug: "transformations/python-plain/how-to" }, + { slug: "transformations/python-plain/reference" }, + ], + }, { label: "R Transformations", collapsed: true, items: [ { label: "Overview", slug: "transformations/r-plain" }, + { slug: "transformations/r-plain/how-to" }, + { slug: "transformations/r-plain/reference" }, { slug: "transformations/r-plain/array-splitter" }, { slug: "transformations/r-plain/plots" }, { slug: "transformations/r-plain/binary" }, From b63f757c79d4648a563b751e9e8b455623dc7fe0 Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 22:36:17 +0200 Subject: [PATCH 10/15] =?UTF-8?q?PRDCT-376:=20clean=20up=20dbt=20subtree?= =?UTF-8?q?=20=E2=80=94=20alt=20text,=20drop=20decorative=20images,=20fron?= =?UTF-8?q?tmatter?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Group 3 (Confluence debt): stripped all Jekyll `{: width}` leftovers and added frontmatter (title/description/keywords/type) to all 6 dbt pages. Images (41 total, empty alt): judged per page. - cli (13): all CLI/terminal output, not UI locators → dropped; text stands alone. The generated source-file screenshot → TODO(human-review: add example). - transformation (16): kept 10 genuine config-screen locators with flagged alt; dropped 6 decorative job-log/redundant shots. - cloud (9): kept 2 config-screen locators; dropped 7 output/storage shots. - flows (3): decorative → dropped; expanded text to stand alone. - index, troubleshooting: frontmatter only. Kept-image alt text is context-derived (images not viewable here), prefixed `TODO(human-review: alt unverified)` so it is flagged but never empty. Co-Authored-By: Claude Opus 4.8 --- .../docs/transformations/dbt/cli/cli.md | 56 +++++++------------ .../docs/transformations/dbt/cloud/cloud.md | 40 +++++-------- .../docs/transformations/dbt/flows/flows.md | 20 +++++-- src/content/docs/transformations/dbt/index.md | 7 +++ .../dbt/transformation/transformation.md | 47 ++++++++-------- .../dbt/troubleshooting/troubleshooting.md | 9 ++- 6 files changed, 86 insertions(+), 93 deletions(-) diff --git a/src/content/docs/transformations/dbt/cli/cli.md b/src/content/docs/transformations/dbt/cli/cli.md index a962f8750..b47e06f5b 100644 --- a/src/content/docs/transformations/dbt/cli/cli.md +++ b/src/content/docs/transformations/dbt/cli/cli.md @@ -1,6 +1,14 @@ --- title: dbt CLI slug: 'transformations/dbt/cli' +description: Set up local dbt development with the Keboola CLI — install it, run kbc dbt init, store credentials, and run dbt debug and dbt run against your project Storage. +keywords: + - dbt CLI + - Keboola CLI dbt + - kbc dbt init + - local dbt development + - dbt debug run +type: how-to --- Video: @@ -25,8 +33,6 @@ brew upgrade keboola-cli kbc --version ``` -![](imgs/2772467715.png){: width="100%" } - **You will then gain access to dbt-related commands within Keboola CLI!** ### Steps @@ -51,22 +57,18 @@ The user is in the folder with the cloned dbt project and can run the following 4. They are outputted to stdout. -### Example - -![](imgs/2770010115.jpg){: width="100%" } - Store credentials to your zsh env profile (or your respective environment): --------------------------------------------------------------------------- -The file is located (Unix) in `~/.zshrc` - -![](imgs/2770206732.png){: width="100%" } +The file is located (Unix) in `~/.zshrc`. Add the environment variables that `kbc dbt init` printed to stdout. Then you can run dbt locally against the project storage, safely develop and test your code. -As part of the init command, CLI will create all sources from storage buckets. A storage bucket is a source file containing tables: +As part of the init command, CLI will create all sources from storage buckets. A storage bucket becomes a dbt source file containing its tables. -![](imgs/2777448699.png){: width="100%" } + *Note: Please note that `_timestamp` is automatically filled, alongside `primary keys` and corresponding `tests` for primary keys (`unique` and `not_null` tests).* @@ -84,21 +86,12 @@ dbt debug -t beer_demo --profiles-dir . * We are using local profiles; they are using environmental variables stored before. -![](imgs/2769649699.png){: width="100%" } - -You should see green in all steps: - -![](imgs/2770239505.png){: width="100%" } +All checks should pass (shown in green). dbt Run ------- -For the script alteration, the only check/change you have to make with off-the-shelf scripts is to alter source definitions to match sources: - -![](imgs/2769879073.png){: width="100%" } - - -![](imgs/2770042889.png){: width="100%" } +For the script alteration, the only check/change you have to make with off-the-shelf scripts is to alter source definitions to match sources. To execute the dbt: @@ -106,8 +99,6 @@ To execute the dbt: dbt run -t beer_demo --profiles-dir . ``` -![](imgs/2769879081.png){: width="100%" } - ### Other Commands for Future Use `kbc dbt generate profile` @@ -121,10 +112,7 @@ dbt run -t beer_demo --profiles-dir . * If there is an existing profile with the same name, it will be overwritten. Otherwise, the new profile will be just appended to the others if there are any. -* **Example** (non-interactive mode): - - -![](imgs/2770010121.jpg){: width="100%" } +* Can be run in non-interactive mode. `kbc dbt generate sources` -------------------------- @@ -133,9 +121,7 @@ dbt run -t beer_demo --profiles-dir . * Lists all tables in the default branch from the Storage API and generates source files to `models/_sources`. Tables from each bucket are stored in a separate file. -* **Example** (non-interactive mode) - - ![](imgs/2770010127.jpg){: width="100%" } +* Can be run in non-interactive mode. `kbc dbt generate env` ---------------------- @@ -144,9 +130,7 @@ dbt run -t beer_demo --profiles-dir . * Asks for an existing workspace (select box or id flag). -* **Example** - - ![](imgs/2770010133.jpg ){: width="100%" } +* Can be run in non-interactive mode. ### Workspaces Support @@ -159,6 +143,4 @@ dbt run -t beer_demo --profiles-dir . * Supports parameter `name`, `type`, and `size` (for `python` and `r`). -* **Example** (non-interactive mode) - - ![](imgs/2770010139.jpg){: width="100%" } +* Can be run in non-interactive mode. diff --git a/src/content/docs/transformations/dbt/cloud/cloud.md b/src/content/docs/transformations/dbt/cloud/cloud.md index 730f6a945..60099219f 100644 --- a/src/content/docs/transformations/dbt/cloud/cloud.md +++ b/src/content/docs/transformations/dbt/cloud/cloud.md @@ -1,11 +1,17 @@ --- title: dbt Cloud slug: 'transformations/dbt/cloud' +description: Use dbt Cloud from Keboola — trigger dbt Cloud jobs and extract dbt Cloud API data with the dedicated components, and find the resulting tables and artifacts in Storage. +keywords: + - dbt Cloud + - dbt Cloud job trigger + - dbt Cloud API extractor + - dbt Cloud Keboola + - dbt Cloud artifacts +type: how-to --- -dbt Cloud is supported via dedicated components. You can find them in the menu section **Components**: - -![](imgs/2777448719.png){: width="100%" } +dbt Cloud is supported via dedicated components. You can find them in the **Components** menu: * `kds-team.ex-dbt-cloud-api` for extracting data from dbt Cloud API @@ -14,32 +20,18 @@ dbt Cloud is supported via dedicated components. You can find them in the menu s ## dbt Cloud Trigger -![](imgs/2776563988.png){: width="100%" } +![TODO(human-review: alt unverified) The dbt Cloud Trigger component configuration with Account ID, Job ID, and API key fields](imgs/2776563988.png) The component configuration is pretty straightforward. You must authorize the component by providing your `Account ID`, `Job ID`, and `API key`. -The component generates a status table called `dbt_cloud_trigger` storing the job trigger API response: - -![](imgs/2776269020.png){: width="100%" } - -When **Wait for result** is selected, the component polls the status until the job ends. The component has a default wait time limit that can be optionally set to a different time. When the option **Wait for result** is used, the component extracts artifacts, stores them in the file storage, and additionally, produces a job result API call table: - -![](imgs/2776564000.png){: width="100%" } - -Both tables can be found in the storage, or accessed directly from the job result: +The component generates a status table called `dbt_cloud_trigger` storing the job trigger API response. -![](imgs/2777710848.png){: width="100%" } +When **Wait for result** is selected, the component polls the status until the job ends. The component has a default wait time limit that can be optionally set to a different time. When the option **Wait for result** is used, the component extracts artifacts, stores them in the file storage, and additionally, produces a job result API call table. Both tables can be found in Storage, or accessed directly from the job result. -Artifacts can be found in the storage - files - search by tag: - -![](imgs/2777448746.png){: width="100%" } - -Search by tag (component type or configuration ID): +Artifacts can be found in Storage → **Files**, searched by tag (component type or configuration ID): `tags:"componentId-kds-team.app-dbt-cloud-job-trigger"` -![](imgs/2776269036.png){: width="100%" } - **Tip:**: Those files can be also easily [retrieved externally via the API](https://api.keboola.com/?service=storage#get-/v2/storage/branch/-branchId-/files) or from an integrated Jupyter workspace for further analysis. *Note: Please keep in mind that the base URL of the API call depends on the stack you are using: US vs. Azure EU vs. EU central.* @@ -61,10 +53,8 @@ The purpose of this data source connector is to extract and store the [dbt Cloud To configure the source connector, enter the API token and select a default configuration: -![](imgs/2777448752.png){: width="100%" } - -You can access the data from Storage, or directly from the job detail screen: +![TODO(human-review: alt unverified) The dbt Cloud API source connector configuration with the API token and default configuration](imgs/2777448752.png) -![](imgs/2777710857.png){: width="100%" } +You can access the data from Storage, or directly from the job detail screen. ***Note:** The data source connector utilizes our powerful Generic Extractor. In case you want to customize the extraction, select just some endpoints, etc. You can switch to the JSON schema and edit the configuration manually.* diff --git a/src/content/docs/transformations/dbt/flows/flows.md b/src/content/docs/transformations/dbt/flows/flows.md index c01de54ad..ee3b176d4 100644 --- a/src/content/docs/transformations/dbt/flows/flows.md +++ b/src/content/docs/transformations/dbt/flows/flows.md @@ -1,12 +1,22 @@ --- -title: Usage in Flows +title: Using dbt in flows slug: 'transformations/dbt/flows' +description: Add dbt components to a Keboola flow to orchestrate them in a data pipeline, with scheduling and notifications, just like any other component. +keywords: + - dbt in flows + - orchestrate dbt Keboola + - schedule dbt transformation + - dbt pipeline +type: how-to --- -All dbt related components behave in the same way as any other component in Keboola does. They can be added to the flows to define orchestrated data pipelines, add schedule and notifications: +All dbt-related components behave like any other component in Keboola, so you orchestrate them with [flows](/flows/). -![](imgs/2776269000.jpeg){: width="100%" } +To run dbt as part of a pipeline: -![](imgs/2776269006.png){: width="100%" } +1. Open or create a [flow](/flows/). +2. Add the dbt component as a task, alongside your other extractors, transformations, and writers, in the order you want them to run. +3. Set a **schedule** on the flow so it runs automatically. +4. Configure **notifications** to be alerted on success, warning, or error. -![](imgs/2776269012.png){: width="100%" } +Because dbt components are ordinary flow tasks, everything a flow offers — ordering, scheduling, and notifications — applies to them. See [Flows](/flows/) for the full configuration. diff --git a/src/content/docs/transformations/dbt/index.md b/src/content/docs/transformations/dbt/index.md index ffd4f0022..3c4cd4610 100644 --- a/src/content/docs/transformations/dbt/index.md +++ b/src/content/docs/transformations/dbt/index.md @@ -1,6 +1,13 @@ --- title: dbt Transformation slug: 'transformations/dbt' +description: Run dbt projects in Keboola as versioned, schedulable components in your data pipeline — configure them in the UI, develop locally with the CLI, or trigger dbt Cloud. +keywords: + - dbt transformation + - dbt Keboola + - dbt component + - dbt data pipeline +type: explanation --- diff --git a/src/content/docs/transformations/dbt/transformation/transformation.md b/src/content/docs/transformations/dbt/transformation/transformation.md index 13906aded..6ce9a2c5c 100644 --- a/src/content/docs/transformations/dbt/transformation/transformation.md +++ b/src/content/docs/transformations/dbt/transformation/transformation.md @@ -1,11 +1,20 @@ --- title: dbt transformation slug: 'transformations/dbt/transformation' +description: Configure a dbt transformation in Keboola — connect a remote warehouse or use Keboola Storage, link the dbt project repository, define execution steps, set freshness and output mapping, and run or debug it. +keywords: + - dbt transformation + - dbt configuration Keboola + - dbt execution steps + - dbt project repository + - dbt profiles.yml + - dbt run debug +type: how-to --- ## Configuration -![](imgs/dbt-transformation-overview.webp){: width="100%" } +![TODO(human-review: alt unverified) Overview of the dbt transformation configuration screen](imgs/dbt-transformation-overview.webp) ### Database Connection @@ -15,21 +24,21 @@ slug: 'transformations/dbt/transformation' The required connection parameters for your remote data warehouse vary depending on the selected backend type. Use the **Run Debug** option in the right panel to validate the connection using the entered parameters. -![](imgs/dbt-transformation-db-connection.webp){: width="100%" } +![TODO(human-review: alt unverified) The Database Connection section for a dbt transformation with a remote warehouse](imgs/dbt-transformation-db-connection.webp) ### dbt Project Repository First, you must define a repository by specifying the URL (ending with GIT) and entering the access credentials if required. -![](imgs/2776563898.png){: width="100%" } +![TODO(human-review: alt unverified) The dbt Project Repository section with fields for the GIT URL and access credentials](imgs/2776563898.png) After saving a configuration, click **Load Branches** to select the desired branch. Don't forget to click **Save**. -![](imgs/2776563904.png){: width="100%" } +![TODO(human-review: alt unverified) The Load Branches control for selecting a repository branch](imgs/2776563904.png) ### Execution Steps -![](imgs/2740748169.png){: width="100%" } +![TODO(human-review: alt unverified) The Execution Steps section listing the dbt steps to run](imgs/2740748169.png) Select the desired execution steps, then edit or rearrange them as needed. @@ -41,11 +50,12 @@ For example, you can use the following command: dbt run --select "path:marts/finance,tag:nightly,config.materialized:table" --full-refresh ``` -![](imgs/dbt-transformation-step-edit.webp){: width="100%" } +![TODO(human-review: alt unverified) Editing an execution step to append dbt flags or resource selectors](imgs/dbt-transformation-step-edit.webp) ### Freshness If you run the `dbt source freshness` step in your project, you can set time limits for displaying warnings and errors. Both time limits can be enabled and configured independently. -![](imgs/2740735193.png){: width="100%" } + +![TODO(human-review: alt unverified) The Freshness settings with independent warning and error time limits](imgs/2740735193.png) ### Artifacts Artifacts generated by dbt (all steps except `dbt deps` and `dbt debug`) are automatically stored in Keboola Storage. Depending on the configuration, they are saved either as a compressed ZIP file or as individual files. @@ -56,29 +66,23 @@ Artifacts generated by dbt (all steps except `dbt deps` and `dbt debug`) are aut This is a specific configuration needed for the Keboola dbt component. Define which tables will be imported within storage. This configuration uses a standard output mapping UI element with configuration options, such as incremental or full load, filters, etc. -![](imgs/2776563928.png){: width="100%" } +![TODO(human-review: alt unverified) The Output Mapping section defining which tables are imported to Storage](imgs/2776563928.png) ## Running transformation Before running the dbt transformation, you can configure additional parameters (such as the dbt Core version, backend size, and number of threads), run debug command, or view generated project documentation. -![](imgs/dbt-transformation-run.webp){: width="50%" } +![TODO(human-review: alt unverified) The right-hand run panel with dbt Core version, backend size, threads, debug, and documentation options](imgs/dbt-transformation-run.webp) ### Run Debug To verify that your credentials and project setup are correct, you can run a debug job. This is the same as running `dbt debug` from the command prompt. -The **Run debug** button will create a separate job with standard logging, exposing the results of the dbt debug command: - -![](imgs/2776563946.png){: width="100%" } +The **Run debug** button will create a separate job with standard logging, exposing the results of the dbt debug command. ## dbt Project Documentation -When you press **dbt Project Documentation**, the job will generate the necessary files within artifacts to power documentation. The dbt documentation is then accessible via the button from the main configuration screen. - -Clicking the button synchronously generates the documentation in a popup: - -![](imgs/2776269049.png){: width="100%" } +When you press **dbt Project Documentation**, the job will generate the necessary files within artifacts to power documentation. The dbt documentation is then accessible via the button from the main configuration screen. Clicking the button synchronously generates the documentation in a popup. ## Manually Triggering dbt Transformation @@ -101,19 +105,13 @@ When you manually run a dbt transformation, a new job is triggered with standard * Record of producing and storing artifacts -![](imgs/2776563952.png){: width="100%" } - You can also access all configuration jobs from the configuration screen and the **Jobs** menu section. -![](imgs/2776563958.png){: width="100%" } - -![](imgs/2776563964.png){: width="50%" } - ## Discover The **Discover** tab is designed to provide more information about the run. Keboola plans to expand this tab to offer additional insights. Currently, it provides the timeline designed to visually display the duration of each model build. -![](imgs/2777448784.png){: width="100%" } +![TODO(human-review: alt unverified) The Discover tab timeline showing the build duration of each model](imgs/2777448784.png) ## Profiles and Target @@ -139,7 +137,6 @@ default: ***Note:** The values of environment variables are provided automatically based on the database connection settings or the use of Keboola Storage.* If needed, you can use a `profiles.yml` file committed in your dbt project repository for Remote DWH components and set the target according to your requirements. In this case, you must use the environment variables mentioned above in the generated `profiles.yml` and specify the target in each executed step. Your committed `profiles.yml` file will be merged with the automatically generated version. -![](imgs/2740663016.png){: width="100%" } :::caution Important: Never commit sensitive information such as access credentials or passwords to the repository. diff --git a/src/content/docs/transformations/dbt/troubleshooting/troubleshooting.md b/src/content/docs/transformations/dbt/troubleshooting/troubleshooting.md index 8bfc1be41..aee19d7f5 100644 --- a/src/content/docs/transformations/dbt/troubleshooting/troubleshooting.md +++ b/src/content/docs/transformations/dbt/troubleshooting/troubleshooting.md @@ -1,6 +1,13 @@ --- -title: Troubleshooting +title: dbt transformation troubleshooting slug: 'transformations/dbt/troubleshooting' +description: Troubleshoot dbt transformations in Keboola — diagnose and fix remote-workspace connection and authentication failures. +keywords: + - dbt troubleshooting + - dbt connection failure + - dbt authentication error + - dbt profiles.yml +type: reference --- ## Remote workspaces From f57cec98e8f28d139c799ee5052d32c3ee15ed06 Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 22:39:21 +0200 Subject: [PATCH 11/15] =?UTF-8?q?PRDCT-376:=20Group=200=20light=20touch=20?= =?UTF-8?q?=E2=80=94=20frontmatter=20+=20type=20on=20already-clean=20pages?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit No split. Added description/keywords/type to: r-plain/array-splitter, binary, plots (how-to examples); mappings (reference); code-patterns (how-to); duckdb/snowflake-migration (how-to). Co-Authored-By: Claude Opus 4.8 --- .../transformations/code-patterns/index.md | 17 +++++++--- .../duckdb/snowflake-migration.md | 6 ++++ .../docs/transformations/mappings/index.md | 8 +++++ .../transformations/r-plain/array-splitter.md | 22 ++++++++----- .../docs/transformations/r-plain/binary.md | 29 +++++++++------- .../docs/transformations/r-plain/plots.md | 33 +++++++++++-------- 6 files changed, 78 insertions(+), 37 deletions(-) diff --git a/src/content/docs/transformations/code-patterns/index.md b/src/content/docs/transformations/code-patterns/index.md index 1349ae5cd..554869341 100644 --- a/src/content/docs/transformations/code-patterns/index.md +++ b/src/content/docs/transformations/code-patterns/index.md @@ -1,10 +1,17 @@ ---- -title: Code Patterns -slug: 'transformations/code-patterns' --- +title: Code Patterns +slug: 'transformations/code-patterns' +description: Code Patterns are components that generate transformation code from parameters in Keboola — how to create a transformation with a code pattern and fill in its parameters form. +keywords: + - code patterns + - generate transformation code + - code pattern parameters + - predefined code pattern +type: how-to +--- + + - - Code Pattern is a special type of [component](/components/) that - generates code based on [parameters](#parameters-form), and diff --git a/src/content/docs/transformations/duckdb/snowflake-migration.md b/src/content/docs/transformations/duckdb/snowflake-migration.md index 8e3e03558..ccf7d850d 100644 --- a/src/content/docs/transformations/duckdb/snowflake-migration.md +++ b/src/content/docs/transformations/duckdb/snowflake-migration.md @@ -1,6 +1,12 @@ --- title: Snowflake to DuckDB Migration slug: 'transformations/duckdb/snowflake-migration' +description: How to migrate an existing Snowflake transformation to DuckDB in Keboola — SQL dialect differences and what to change. +keywords: + - Snowflake to DuckDB migration + - migrate Snowflake transformation DuckDB + - DuckDB SQL differences +type: how-to --- diff --git a/src/content/docs/transformations/mappings/index.md b/src/content/docs/transformations/mappings/index.md index 1d77a90ad..3cdd04bcf 100644 --- a/src/content/docs/transformations/mappings/index.md +++ b/src/content/docs/transformations/mappings/index.md @@ -1,6 +1,14 @@ --- title: Input and Output Mapping slug: 'transformations/mappings' +description: How input and output mapping move data between Keboola Storage and a transformation's staging area — table and file mapping, loading types, and read-only input mapping. +keywords: + - input mapping + - output mapping + - transformation staging + - read-only input mapping + - Snowflake loading type +type: reference --- *To configure input and output mappings in the process of creating a transformation, diff --git a/src/content/docs/transformations/r-plain/array-splitter.md b/src/content/docs/transformations/r-plain/array-splitter.md index 549da6563..47c9b2a76 100644 --- a/src/content/docs/transformations/r-plain/array-splitter.md +++ b/src/content/docs/transformations/r-plain/array-splitter.md @@ -1,10 +1,16 @@ ---- -title: Array Splitting -slug: 'transformations/r-plain/array-splitter' -redirect_from: - - /manipulation/transformations/r/array-splitter/ - - /transformations/r/array-splitter/ - +--- +title: Array Splitting +slug: 'transformations/r-plain/array-splitter' +description: Worked example of splitting an array column into multiple rows in an R transformation in Keboola. +keywords: + - R array splitting + - split array column R + - R transformation example +type: how-to +redirect_from: + - /manipulation/transformations/r/array-splitter/ + - /transformations/r/array-splitter/ + --- Array splitting is what you do if you have a list of values in a single cell delimited by a character (a comma, semi-colon, etc.), @@ -40,7 +46,7 @@ The following script will take each row of the source table, and split the colum together with their ID specified in the `idCol` variable, and they will also be assigned a new sequential ID in the `globalPos` column. -```r +```r splitChar = ',' splitCol = 'name' idCol = 'id' diff --git a/src/content/docs/transformations/r-plain/binary.md b/src/content/docs/transformations/r-plain/binary.md index cf8e54042..d6ab3ee2c 100644 --- a/src/content/docs/transformations/r-plain/binary.md +++ b/src/content/docs/transformations/r-plain/binary.md @@ -1,14 +1,21 @@ ---- -title: Using Binary Files -slug: 'transformations/r-plain/binary' -redirect_from: - - /manipulation/transformations/r/binary/ - - /transformations/r/binary/ - --- +title: Using Binary Files +slug: 'transformations/r-plain/binary' +description: Worked example of reading and writing binary files (such as trained models) in an R transformation in Keboola. +keywords: + - R binary files + - R transformation binary + - R trained model file + - R transformation example +type: how-to +redirect_from: + - /manipulation/transformations/r/binary/ + - /transformations/r/binary/ + +--- + + - - Inside an R transformation, pre-computed models can be used. These models of your data behaviour are great for predictions, among other things. The following are some of the reasons for using a pre-computed model inside an R transformation: @@ -48,7 +55,7 @@ Only the second table will be used in the actual R transformation. Upload that t First, it is necessary to get a file with the R model. To create and save a very simple model, use a script similar to the following one. It is supposed to be executed **outside Keboola**, for example, on your local machine. -```r +```r data <- read.csv("cashier-data.csv") lm <- lm(time_spent_in_shop ~ number_of_items, data) save(lm, file = "time_model.rda") @@ -85,7 +92,7 @@ transformation (it allows easy updates; if you need to rollback, just delete the The following sample script demonstrates the use of the pre-computed model. The `lm` variable is loaded from the `predictionModel` file. -```r +```r data <- read.csv(file = "in/tables/cashier-data-predict.csv"); # Load the pre-computed model diff --git a/src/content/docs/transformations/r-plain/plots.md b/src/content/docs/transformations/r-plain/plots.md index 19c1a8a74..da74ad386 100644 --- a/src/content/docs/transformations/r-plain/plots.md +++ b/src/content/docs/transformations/r-plain/plots.md @@ -1,21 +1,28 @@ ---- -title: Plots & Graphs -slug: 'transformations/r-plain/plots' -redirect_from: - - /manipulation/transformations/r/plots/ - - /transformations/r/plots/ - --- +title: Plots & Graphs +slug: 'transformations/r-plain/plots' +description: Worked example of generating charts and graphs in an R transformation in Keboola and saving them as image files. +keywords: + - R plots + - R charts graphs + - R transformation plot + - R transformation example +type: how-to +redirect_from: + - /manipulation/transformations/r/plots/ + - /transformations/r/plots/ + +--- + + - - Generating plots in R is supported through [Storage file uploads](/storage/files/). To upload a plot to Storage, save the file in the output directory for files (`out/files/`). Each file in that directory will be automatically saved into Storage File Uploads. To make file handling a bit easier, it is possible to write a *manifest*, which describes the file. This can be used to set *file tags* and other file upload options. To write the manifest, use the `app$writeFileManifest` function with the following signature: -```r +```r app$writeFileManifest = function(fileName, fileTags = vector(), isPublic = FALSE, isPermanent = TRUE, notify = FALSE) ``` @@ -34,7 +41,7 @@ in the input mapping. There is no output mapping. Then use the following R script in the transformation and run it. -```r +```r data <- read.csv("/data/in/tables/graph-source.csv") model <- lm(formula = time_spent_in_shop ~ customer_age + I(customer_age^2), data = data) @@ -62,7 +69,7 @@ single image using a 2x2 grid. Use the following script the same way as in Example 1. The only difference is that this script produces multiple files. -```r +```r data <- read.csv("/data/in/tables/graph-source.csv") model <- lm(formula = time_spent_in_shop ~ customer_age + I(customer_age^2), data = data) @@ -84,7 +91,7 @@ A manifest is written for each file individually. If you want to use the [ggplot](https://ggplot2.tidyverse.org/reference/) package, use a different function for saving a file in your script. The rest remains the same as in the previous examples. -```r +```r library(ggplot2) data <- read.csv("/data/in/tables/graph-source.csv") From 509ffa18772ac654f7ad802a0a9105d32de27d18 Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 22:40:40 +0200 Subject: [PATCH 12/15] PRDCT-376: consolidated human-review queue for the Transformations split MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Folds every inline TODO(human-review) from Groups 1-3 into one list (reference facts to verify vs code SoT, UI labels, dbt alt text, the index table-row fix, and Diátaxis mapping notes). Co-Authored-By: Claude Opus 4.8 --- revamp/PRDCT-376-human-review.md | 70 ++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 revamp/PRDCT-376-human-review.md diff --git a/revamp/PRDCT-376-human-review.md b/revamp/PRDCT-376-human-review.md new file mode 100644 index 000000000..26250b5fe --- /dev/null +++ b/revamp/PRDCT-376-human-review.md @@ -0,0 +1,70 @@ +# PRDCT-376 — consolidated human-review queue + +Every `TODO(human-review)` marker left inline across the Transformations section +split, grouped for sign-off. The markers remain in-source; this is the index. +(63 markers across 15 files.) + +## A. Reference facts to verify against component code (SoT not in this repo) + +The component config schemas / READMEs are not in the docs repo and were not +reachable, so these carried-over values could not be verified against code. +Nothing was invented, renamed, or removed — only flagged. + +- **Snowflake** (`snowflake-plain/reference.md`): backend sizes + default; 7,200 s + query timeout; 8,192-char comment segfault; `ABORT_TRANSFORMATION` name/semantics; + AWS-US timestamp parameter overrides; copy/clone loading types; Free-Plan backend + availability. +- **BigQuery** (`bigquery/reference.md`, `bigquery/how-to.md`): the "2 hours" query + timeout vs current GCP quota; `Query timeout` parameter name/units and that default + `0` = "use BigQuery default"; `ABORT_TRANSFORMATION` semantics. +- **Oracle** (`oracle/index.md`): exact default/behavior of the optional `schema` + config field (added per instruction; semantics unconfirmed). +- **DuckDB** (`duckdb/reference.md`): default Timeout (1 h); default for "Automatic + data types"; backend sizes / memory figures / default; Free-Plan availability; + parameter names `threads` / `max_memory_mb`; list of supported DuckDB versions. +- **Python** (`python-plain/reference.md`): current Python version; 8 GB memory / + 6 h / CPU limits; preinstalled package list; backend sizes / default / plan. +- **R** (`r-plain/reference.md`): R `4.4.1` is current + other selectable versions + (version bumped 4.0.5 → 4.4.1 per instruction); 16 GB / 6 h / CPU limits; + preinstalled packages; backend sizes / default / plan. + +## B. UI labels / control names to confirm (how-to pages) + +In `snowflake-plain/how-to.md`, `bigquery/how-to.md`, `duckdb/how-to.md`, +`python-plain/how-to.md`, `r-plain/how-to.md`, `oracle/index.md`: the literal +navigation/label strings — e.g. **Components → Transformations**, **New +Transformation**, the per-backend type label (e.g. "Snowflake SQL Transformation"), +**New Table Input/Output**, **Create Table** — were written from convention and +need a quick check against the live UI. Also the DuckDB sample CSV's column names. + +## C. dbt screenshots — alt text unverified + +The images were not viewable while editing, so alt text on **kept** screenshots is +context-derived and prefixed `TODO(human-review: alt unverified)`. Verify each +against the actual image (12 kept): + +- `dbt/transformation/transformation.md` (10): configuration overview, database + connection, project repository, load branches, execution steps, step edit, + freshness, output mapping, run panel, Discover timeline. +- `dbt/cloud/cloud.md` (2): dbt Cloud Trigger config, dbt Cloud API source connector config. + +Plus `dbt/cli/cli.md`: `TODO(human-review: add generated source-file example)` — +the dropped "generated source file" screenshot should be replaced with a short +fenced YAML example (its exact contents weren't reconstructable from the page). + +## D. Content correctness + +- `transformations/index.md`: the first "Other features" table row was malformed + (no feature name, an extra cell, rowspan 9 vs 8 real features). It was removed and + the rowspan corrected to 8. **If it represented a real feature, re-add it with the + correct name/value.** + +## E. Diátaxis mapping note (not a defect) + +- **python-plain** and **r-plain** were Block-0-tagged how-to + reference + **tutorial**. + Because page `type` is constrained to how-to | reference | explanation, the + tutorial/dev-walkthrough facet was folded into each **how-to** ("Develop and + debug"). Confirm this is acceptable, or split a dedicated tutorial page later. +- **transformations/index.md** was kept as a single combined landing (not URL-split + into explanation+reference) to preserve load-bearing anchors such as + `#writing-scripts` referenced by every backend how-to. From 6ccee0494ad7f31648636bb746e7640fdadfb7f6 Mon Sep 17 00:00:00 2001 From: Nikita Date: Wed, 24 Jun 2026 23:50:53 +0200 Subject: [PATCH 13/15] PRDCT-376: transcribe dbt CLI config output to fenced blocks (masked) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Corrected the over-applied "drop all" on dbt/cli: the `kbc dbt init` example screenshot was config/command output, so per our own rule it is transcribed to fenced blocks, not dropped — env vars (shell), generated profiles.yml, and a generated models/_sources/*.yml. All values are placeholders; secrets (KBC_STORAGE_API_TOKEN, password) masked, no real IDs (secret-scanned). Also flag dbt/cloud dbt_cloud_trigger status-table columns for human review. Co-Authored-By: Claude Opus 4.8 --- .../docs/transformations/dbt/cli/cli.md | 63 ++++++++++++++++--- .../docs/transformations/dbt/cloud/cloud.md | 5 ++ 2 files changed, 58 insertions(+), 10 deletions(-) diff --git a/src/content/docs/transformations/dbt/cli/cli.md b/src/content/docs/transformations/dbt/cli/cli.md index b47e06f5b..da1ac432d 100644 --- a/src/content/docs/transformations/dbt/cli/cli.md +++ b/src/content/docs/transformations/dbt/cli/cli.md @@ -57,20 +57,63 @@ The user is in the folder with the cloned dbt project and can run the following 4. They are outputted to stdout. -Store credentials to your zsh env profile (or your respective environment): ---------------------------------------------------------------------------- - -The file is located (Unix) in `~/.zshrc`. Add the environment variables that `kbc dbt init` printed to stdout. +#### Example output + +`kbc dbt init` prints environment variables to stdout and generates the dbt files shown below. **All values here are placeholders** — use the exact values from your own `kbc dbt init` output, and never commit secrets (storage token, password) to the repository. + +Environment variables (printed to stdout — store them in your shell profile, e.g. `~/.zshrc`): + +```shell +export KBC_STORAGE_API_TOKEN= # secret — do not commit +export DBT_KBC_TARGET1_TYPE=snowflake +export DBT_KBC_TARGET1_ACCOUNT= +export DBT_KBC_TARGET1_DATABASE= +export DBT_KBC_TARGET1_WAREHOUSE= +export DBT_KBC_TARGET1_SCHEMA= +export DBT_KBC_TARGET1_USER= +export DBT_KBC_TARGET1_PASSWORD= # secret — do not commit +export DBT_KBC_TARGET1_THREADS=4 +``` -Then you can run dbt locally against the project storage, safely develop and test your code. +Generated `profiles.yml`: + +```yaml +default: + outputs: + target1: + type: "{{ env_var('DBT_KBC_TARGET1_TYPE') }}" + account: "{{ env_var('DBT_KBC_TARGET1_ACCOUNT') }}" + database: "{{ env_var('DBT_KBC_TARGET1_DATABASE') }}" + warehouse: "{{ env_var('DBT_KBC_TARGET1_WAREHOUSE') }}" + schema: "{{ env_var('DBT_KBC_TARGET1_SCHEMA') }}" + user: "{{ env_var('DBT_KBC_TARGET1_USER') }}" + password: "{{ env_var('DBT_KBC_TARGET1_PASSWORD') }}" + threads: "{{ env_var('DBT_KBC_TARGET1_THREADS') | as_number }}" + target: target1 +``` -As part of the init command, CLI will create all sources from storage buckets. A storage bucket becomes a dbt source file containing its tables. +Generated source file — one per Storage bucket (for example `models/_sources/in.c-test.yml`). `_timestamp` is added automatically, alongside the primary keys and their `unique` and `not_null` tests: + +```yaml +version: 2 + +sources: + - name: in.c-test + schema: in.c-test + tables: + - name: + columns: + - name: + tests: + - unique + - not_null + - name: _timestamp # filled automatically by Keboola +``` - +Store credentials to your shell env profile (or your respective environment): +--------------------------------------------------------------------------- -*Note: Please note that `_timestamp` is automatically filled, alongside `primary keys` and corresponding `tests` for primary keys (`unique` and `not_null` tests).* +On Unix, add the `export` lines above to `~/.zshrc` (or your shell profile). Then you can run dbt locally against the project storage, safely develop and test your code. ### Run Test Debug diff --git a/src/content/docs/transformations/dbt/cloud/cloud.md b/src/content/docs/transformations/dbt/cloud/cloud.md index 60099219f..bfebe3298 100644 --- a/src/content/docs/transformations/dbt/cloud/cloud.md +++ b/src/content/docs/transformations/dbt/cloud/cloud.md @@ -26,6 +26,11 @@ The component configuration is pretty straightforward. You must authorize the co The component generates a status table called `dbt_cloud_trigger` storing the job trigger API response. + + + When **Wait for result** is selected, the component polls the status until the job ends. The component has a default wait time limit that can be optionally set to a different time. When the option **Wait for result** is used, the component extracts artifacts, stores them in the file storage, and additionally, produces a job result API call table. Both tables can be found in Storage, or accessed directly from the job result. Artifacts can be found in Storage → **Files**, searched by tag (component type or configuration ID): From 0868af654f9422e5e18571b272cfdd648efedb9d Mon Sep 17 00:00:00 2001 From: Nikita Date: Thu, 25 Jun 2026 00:56:21 +0200 Subject: [PATCH 14/15] PRDCT-376: section-wide screenshot transcription audit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Verified every screenshot in the Transformations section against the "code/config never as an image → transcribe" rule: - dbt CLI config output: already transcribed (env vars, profiles.yml, sources yml). - dbt transformation/cloud kept images are UI config screens (not code); the dbt run command and profiles.yml are already fenced blocks. - mappings, duckdb/snowflake-migration, r-plain examples, index: UI / result / chart screenshots — nothing transcribable. - code-patterns "Generated Code" screenshot is the one remaining code-in-an-image; flagged TODO(human-review) (content is pattern-specific, not viewable here). Updated the consolidated human-review list accordingly. Co-Authored-By: Claude Opus 4.8 --- revamp/PRDCT-376-human-review.md | 10 +++++++--- .../docs/transformations/code-patterns/index.md | 4 ++++ 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/revamp/PRDCT-376-human-review.md b/revamp/PRDCT-376-human-review.md index 26250b5fe..d6e5a8b73 100644 --- a/revamp/PRDCT-376-human-review.md +++ b/revamp/PRDCT-376-human-review.md @@ -48,9 +48,13 @@ against the actual image (12 kept): freshness, output mapping, run panel, Discover timeline. - `dbt/cloud/cloud.md` (2): dbt Cloud Trigger config, dbt Cloud API source connector config. -Plus `dbt/cli/cli.md`: `TODO(human-review: add generated source-file example)` — -the dropped "generated source file" screenshot should be replaced with a short -fenced YAML example (its exact contents weren't reconstructable from the page). +`dbt/cli/cli.md`: the `kbc dbt init` outputs (env vars, `profiles.yml`, generated +`models/_sources/*.yml`) are now transcribed to fenced blocks with masked +placeholders. + +`code-patterns/index.md`: `TODO(human-review: transcribe generated-code screenshot)` — +the "Generated Code" screenshot shows code (should be a fenced block); content is +code-pattern-specific and wasn't viewable here, so add a representative example. ## D. Content correctness diff --git a/src/content/docs/transformations/code-patterns/index.md b/src/content/docs/transformations/code-patterns/index.md index 554869341..f6b7633f8 100644 --- a/src/content/docs/transformations/code-patterns/index.md +++ b/src/content/docs/transformations/code-patterns/index.md @@ -95,6 +95,10 @@ It is (re)generated by clicking the **Regenerate Code** button. This calls the [Generate Action](https://developers.keboola.com/extend/component/code-patterns/interface#generate-action) on the code pattern component with the actual parameters. The result is then saved and displayed. + ![Screenshot -- Generated Code](/transformations/code-patterns/overview-6-code.png) After the code has been generated, you can [run the job](https://help.keboola.com/management/jobs/#running-jobs) the standard way. From 32830a417d9fce2f8ab39fd20644b5f687fdaf6a Mon Sep 17 00:00:00 2001 From: Nikita Date: Thu, 25 Jun 2026 01:36:02 +0200 Subject: [PATCH 15/15] PRDCT-376: reconcile types + verified facts with PRDCT-354 audit (Block 0/A/B) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Now that the Block 0 classification is accessible, aligned #987 with it: - Schema: add `tutorial` to the type enum. - Types per Block 0: cli, r-plain array-splitter/binary/plots, duckdb/ snowflake-migration -> tutorial; troubleshooting -> how-to; flows -> explanation. - Resolved TODO flags verified by Block B (vs component code): snowflake query_timeout=7200 + ABORT_TRANSFORMATION + copy/clone; bigquery Query timeout default 0 + ABORT_TRANSFORMATION; duckdb threads/max_memory_mb/dtypes_infer/ syntax_check/duckdb_version + supported versions {1.5.2,1.4.4} + sync actions; oracle optional db.schema. Kept only platform-level unverifiable flags. - Fixed broken anchors (Block A): #phases/#dependencies (index) and #new-transformations (code-patterns, mappings) — overlaps Devin PR #981. - Updated revamp/PRDCT-376-human-review.md (Resolved section + reconciliation note). Co-Authored-By: Claude Opus 4.8 --- revamp/PRDCT-376-human-review.md | 53 +++++++++++-------- src/content.config.ts | 2 +- .../transformations/bigquery/reference.md | 14 ++--- .../transformations/code-patterns/index.md | 2 +- .../docs/transformations/dbt/cli/cli.md | 2 +- .../docs/transformations/dbt/flows/flows.md | 2 +- .../dbt/troubleshooting/troubleshooting.md | 2 +- .../docs/transformations/duckdb/reference.md | 15 +++--- .../duckdb/snowflake-migration.md | 2 +- src/content/docs/transformations/index.md | 4 +- .../docs/transformations/mappings/index.md | 2 +- .../docs/transformations/oracle/index.md | 2 +- .../transformations/r-plain/array-splitter.md | 2 +- .../docs/transformations/r-plain/binary.md | 2 +- .../docs/transformations/r-plain/plots.md | 2 +- .../snowflake-plain/reference.md | 17 +++--- 16 files changed, 67 insertions(+), 58 deletions(-) diff --git a/revamp/PRDCT-376-human-review.md b/revamp/PRDCT-376-human-review.md index d6e5a8b73..479056818 100644 --- a/revamp/PRDCT-376-human-review.md +++ b/revamp/PRDCT-376-human-review.md @@ -2,31 +2,38 @@ Every `TODO(human-review)` marker left inline across the Transformations section split, grouped for sign-off. The markers remain in-source; this is the index. -(63 markers across 15 files.) - -## A. Reference facts to verify against component code (SoT not in this repo) - -The component config schemas / READMEs are not in the docs repo and were not -reachable, so these carried-over values could not be verified against code. -Nothing was invented, renamed, or removed — only flagged. - -- **Snowflake** (`snowflake-plain/reference.md`): backend sizes + default; 7,200 s - query timeout; 8,192-char comment segfault; `ABORT_TRANSFORMATION` name/semantics; - AWS-US timestamp parameter overrides; copy/clone loading types; Free-Plan backend - availability. -- **BigQuery** (`bigquery/reference.md`, `bigquery/how-to.md`): the "2 hours" query - timeout vs current GCP quota; `Query timeout` parameter name/units and that default - `0` = "use BigQuery default"; `ABORT_TRANSFORMATION` semantics. -- **Oracle** (`oracle/index.md`): exact default/behavior of the optional `schema` - config field (added per instruction; semantics unconfirmed). -- **DuckDB** (`duckdb/reference.md`): default Timeout (1 h); default for "Automatic - data types"; backend sizes / memory figures / default; Free-Plan availability; - parameter names `threads` / `max_memory_mb`; list of supported DuckDB versions. + +> **Reconciled with PRDCT-354 (Devin "Audit vs code").** Page `type`s were +> aligned to Block 0 (added a `tutorial` type; retyped cli, r-plain +> array-splitter/binary/plots, duckdb/snowflake-migration → tutorial; +> troubleshooting → how-to; flows → explanation). Items below that Block B +> verified against component code are marked **Resolved**; only the genuinely +> unverifiable ones remain open. + +## Resolved via PRDCT-354 audit (Block B — verified vs code) + +- **Snowflake** (`snowflake-plain/reference.md`): `query_timeout=7200`, + `ABORT_TRANSFORMATION`, and copy/clone loading types — confirmed. +- **BigQuery** (`bigquery/reference.md`): `Query timeout` parameter default `0` + and `ABORT_TRANSFORMATION` (STRING DEFAULT '') — confirmed. +- **DuckDB** (`duckdb/reference.md`): `threads`, `max_memory_mb`, `dtypes_infer`, + `debug`, `syntax_check`, `duckdb_version`, **supported versions {1.5.2, 1.4.4}**, + the 4 sync actions, and block orchestration — confirmed. +- **Oracle** (`oracle/index.md`): the optional `schema` field is `db.schema` + (`scalarNode('schema')`) — confirmed. + +## A. Reference facts still unverifiable (platform-level / not in code audit) + +- **Snowflake** (`snowflake-plain/reference.md`): backend sizes + default; + 8,192-char comment segfault; AWS-US timestamp parameter overrides; Free-Plan + backend availability. +- **BigQuery** (`bigquery/reference.md`): the "2 hours" GCP query-runtime claim + (platform-side; current GCP quota may be 6 h). +- **DuckDB** (`duckdb/reference.md`): backend sizes / memory figures; default Timeout (1 h). - **Python** (`python-plain/reference.md`): current Python version; 8 GB memory / 6 h / CPU limits; preinstalled package list; backend sizes / default / plan. -- **R** (`r-plain/reference.md`): R `4.4.1` is current + other selectable versions - (version bumped 4.0.5 → 4.4.1 per instruction); 16 GB / 6 h / CPU limits; - preinstalled packages; backend sizes / default / plan. +- **R** (`r-plain/reference.md`): R `4.4.1` confirmed (PRDCT-354 Block A, bumped + 4.0.5 → 4.4.1); 16 GB / 6 h / CPU limits; preinstalled packages; backend sizes. ## B. UI labels / control names to confirm (how-to pages) diff --git a/src/content.config.ts b/src/content.config.ts index 5a5b18a5f..684eb9218 100644 --- a/src/content.config.ts +++ b/src/content.config.ts @@ -16,7 +16,7 @@ export const collections = { // Docs revamp (Diátaxis) — every revamped page declares the single // reader need it serves, plus user-vocabulary keywords for search/RAG. keywords: z.array(z.string()).optional(), - type: z.enum(['how-to', 'reference', 'explanation']).optional(), + type: z.enum(['how-to', 'reference', 'explanation', 'tutorial']).optional(), }), }), }), diff --git a/src/content/docs/transformations/bigquery/reference.md b/src/content/docs/transformations/bigquery/reference.md index a5b6e552f..c42366bac 100644 --- a/src/content/docs/transformations/bigquery/reference.md +++ b/src/content/docs/transformations/bigquery/reference.md @@ -13,20 +13,20 @@ type: reference Reference material for [BigQuery SQL transformations](/transformations/bigquery/). To create one, see the [how-to](/transformations/bigquery/how-to/). - + ## Limits | Limit | Value | Notes | |---|---|---| -| Query runtime | 2 hours (BigQuery default) | Adjustable per configuration via the **Query timeout** parameter. See [BigQuery query-jobs quotas](https://cloud.google.com/bigquery/quotas#query_jobs). | +| Query runtime | 2 hours (BigQuery default) | Adjustable per configuration via the **Query timeout** parameter. See [BigQuery query-jobs quotas](https://cloud.google.com/bigquery/quotas#query_jobs). | | Tables per query | Capped | BigQuery limits the [number of tables referenced by a single query](https://cloud.google.com/bigquery/quotas#tables). | | Mutations | Discouraged | BigQuery favors an append-only model; row-level mutations are [generally discouraged](https://cloud.google.com/bigquery/docs/best-practices-costs#avoid_using_dml). | -**Query timeout** parameter — overrides the per-query runtime limit. Default: `0` (use BigQuery's own default). +**Query timeout** parameter — overrides the per-query runtime limit. Default: `0` (use BigQuery's own default). For BigQuery limitations specific to Keboola, see [BigQuery Limitations](/storage/byodb/#bigquery-limitations). Track upstream changes in the [BigQuery release notes](https://cloud.google.com/bigquery/docs/release-notes). @@ -42,7 +42,7 @@ SET ABORT_TRANSFORMATION = ( ); ``` -This sets `ABORT_TRANSFORMATION` to `'Integrity check failed'` when the `INTEGRITY_CHECK` table has one or more rows with `RESULT = 'failed'`. An empty string does not abort. +This sets `ABORT_TRANSFORMATION` to `'Integrity check failed'` when the `INTEGRITY_CHECK` table has one or more rows with `RESULT = 'failed'`. An empty string does not abort. ## Working with data types diff --git a/src/content/docs/transformations/code-patterns/index.md b/src/content/docs/transformations/code-patterns/index.md index f6b7633f8..149d0e5eb 100644 --- a/src/content/docs/transformations/code-patterns/index.md +++ b/src/content/docs/transformations/code-patterns/index.md @@ -15,7 +15,7 @@ type: how-to Code Pattern is a special type of [component](/components/) that - generates code based on [parameters](#parameters-form), and -- can be used in the user interface of [New Transformations](/transformations/#new-transformations). +- can be used in the user interface when creating a [transformation](/transformations/). ## List of Code Patterns diff --git a/src/content/docs/transformations/dbt/cli/cli.md b/src/content/docs/transformations/dbt/cli/cli.md index da1ac432d..fb1b107e2 100644 --- a/src/content/docs/transformations/dbt/cli/cli.md +++ b/src/content/docs/transformations/dbt/cli/cli.md @@ -8,7 +8,7 @@ keywords: - kbc dbt init - local dbt development - dbt debug run -type: how-to +type: tutorial --- Video: diff --git a/src/content/docs/transformations/dbt/flows/flows.md b/src/content/docs/transformations/dbt/flows/flows.md index ee3b176d4..42684dea8 100644 --- a/src/content/docs/transformations/dbt/flows/flows.md +++ b/src/content/docs/transformations/dbt/flows/flows.md @@ -7,7 +7,7 @@ keywords: - orchestrate dbt Keboola - schedule dbt transformation - dbt pipeline -type: how-to +type: explanation --- All dbt-related components behave like any other component in Keboola, so you orchestrate them with [flows](/flows/). diff --git a/src/content/docs/transformations/dbt/troubleshooting/troubleshooting.md b/src/content/docs/transformations/dbt/troubleshooting/troubleshooting.md index aee19d7f5..09b20dfee 100644 --- a/src/content/docs/transformations/dbt/troubleshooting/troubleshooting.md +++ b/src/content/docs/transformations/dbt/troubleshooting/troubleshooting.md @@ -7,7 +7,7 @@ keywords: - dbt connection failure - dbt authentication error - dbt profiles.yml -type: reference +type: how-to --- ## Remote workspaces diff --git a/src/content/docs/transformations/duckdb/reference.md b/src/content/docs/transformations/duckdb/reference.md index 667ca7065..a345157fa 100644 --- a/src/content/docs/transformations/duckdb/reference.md +++ b/src/content/docs/transformations/duckdb/reference.md @@ -20,10 +20,13 @@ Reference material for [DuckDB SQL transformations](/transformations/duckdb/). T DuckDB Transformation is currently in **BETA**. Breaking changes may occur. ::: - + + ## Configuration settings @@ -41,7 +44,7 @@ Set these on the right-side panel of the transformation configuration: ### DuckDB version -Select the DuckDB version used to run the transformation. Use `latest` (default) to always run on the most recent supported version, or pin a specific version (for example, `1.5.2`, `1.4.4`) for stability. Each supported version runs in its own isolated environment. +Select the DuckDB version used to run the transformation. Use `latest` (default) to always run on the most recent supported version, or pin a specific supported version — **`1.5.2`** or **`1.4.4`** — for stability. Each supported version runs in its own isolated environment. ## Backend sizes @@ -60,7 +63,7 @@ Dynamic backends are **not** available on the [Free Plan (Pay As You Go)](/manag ### Auto-resource detection -DuckDB automatically detects available CPU and memory. You can also set resource limits manually with the `threads` and `max_memory_mb` parameters in the transformation configuration. +DuckDB automatically detects available CPU and memory. You can also set resource limits manually with the `threads` and `max_memory_mb` parameters in the transformation configuration. ## Block-based orchestration diff --git a/src/content/docs/transformations/duckdb/snowflake-migration.md b/src/content/docs/transformations/duckdb/snowflake-migration.md index ccf7d850d..a130fb217 100644 --- a/src/content/docs/transformations/duckdb/snowflake-migration.md +++ b/src/content/docs/transformations/duckdb/snowflake-migration.md @@ -6,7 +6,7 @@ keywords: - Snowflake to DuckDB migration - migrate Snowflake transformation DuckDB - DuckDB SQL differences -type: how-to +type: tutorial --- diff --git a/src/content/docs/transformations/index.md b/src/content/docs/transformations/index.md index fddb32eb0..b9d45c856 100644 --- a/src/content/docs/transformations/index.md +++ b/src/content/docs/transformations/index.md @@ -192,11 +192,11 @@ Python and R transformations. ✓ - Phases + Phases Not available - Dependencies + Dependencies Not available diff --git a/src/content/docs/transformations/mappings/index.md b/src/content/docs/transformations/mappings/index.md index 3cdd04bcf..f24c80568 100644 --- a/src/content/docs/transformations/mappings/index.md +++ b/src/content/docs/transformations/mappings/index.md @@ -162,7 +162,7 @@ This function is automatically enabled in transformations. ##### Read-only input mapping -*Note: You must be using [new transformations](/transformations/#new-transformations) to see this feature.* +*Note: You must be using [transformations](/transformations/) to see this feature.* When **read-only input mappings** are enabled, you automatically have read access to all buckets and tables in the project (this also applies to linked buckets). Alias tables are materialized as database VIEWs and are fully accessible via read-only input mappings — including filtered aliases and aliases from linked buckets. diff --git a/src/content/docs/transformations/oracle/index.md b/src/content/docs/transformations/oracle/index.md index d9dc31ec8..f730806a3 100644 --- a/src/content/docs/transformations/oracle/index.md +++ b/src/content/docs/transformations/oracle/index.md @@ -36,7 +36,7 @@ GRANT CREATE TABLE TO KEBOOLA_TRANSFORMATION; 1. Open **Components → Transformations**, click **New Transformation**, and choose **Oracle Transformation**. 2. Open the **Database Credentials** link in the configuration. 3. Enter the host, port, database/service, username, and password for the `KEBOOLA_TRANSFORMATION` user. -4. **(Optional) Schema** — set this to run the transformation against a specific Oracle schema. Leave it empty to use the connected user's default schema. +4. **(Optional) Schema** — an optional `schema` field under the database connection (`db.schema` in the configuration). Set it to run the transformation against a specific Oracle schema; leave it empty to use the connected user's default schema. 5. Click **Test Credentials**, then **Save**. ## Step 3 — Map the input diff --git a/src/content/docs/transformations/r-plain/array-splitter.md b/src/content/docs/transformations/r-plain/array-splitter.md index 47c9b2a76..61eac39f5 100644 --- a/src/content/docs/transformations/r-plain/array-splitter.md +++ b/src/content/docs/transformations/r-plain/array-splitter.md @@ -6,7 +6,7 @@ keywords: - R array splitting - split array column R - R transformation example -type: how-to +type: tutorial redirect_from: - /manipulation/transformations/r/array-splitter/ - /transformations/r/array-splitter/ diff --git a/src/content/docs/transformations/r-plain/binary.md b/src/content/docs/transformations/r-plain/binary.md index d6ab3ee2c..1f6e62e3b 100644 --- a/src/content/docs/transformations/r-plain/binary.md +++ b/src/content/docs/transformations/r-plain/binary.md @@ -7,7 +7,7 @@ keywords: - R transformation binary - R trained model file - R transformation example -type: how-to +type: tutorial redirect_from: - /manipulation/transformations/r/binary/ - /transformations/r/binary/ diff --git a/src/content/docs/transformations/r-plain/plots.md b/src/content/docs/transformations/r-plain/plots.md index da74ad386..7d2625610 100644 --- a/src/content/docs/transformations/r-plain/plots.md +++ b/src/content/docs/transformations/r-plain/plots.md @@ -7,7 +7,7 @@ keywords: - R charts graphs - R transformation plot - R transformation example -type: how-to +type: tutorial redirect_from: - /manipulation/transformations/r/plots/ - /transformations/r/plots/ diff --git a/src/content/docs/transformations/snowflake-plain/reference.md b/src/content/docs/transformations/snowflake-plain/reference.md index 266f9f871..bc3e427be 100644 --- a/src/content/docs/transformations/snowflake-plain/reference.md +++ b/src/content/docs/transformations/snowflake-plain/reference.md @@ -15,25 +15,24 @@ type: reference Reference material for [Snowflake SQL transformations](/transformations/snowflake-plain/). To create one, see the [how-to](/transformations/snowflake-plain/how-to/); for when and why to use them, see the [explanation](/transformations/snowflake-plain/explanation/). - + ## Limits | Limit | Value | Notes | |---|---|---| -| Query runtime | 7,200 seconds (default) | Long-running queries are cancelled past this. | -| Comment length | 8,192 characters | Queries containing a comment longer than this will segfault. | +| Query runtime | 7,200 seconds (default) | Long-running queries are cancelled past this. | +| Comment length | 8,192 characters | Queries containing a comment longer than this will segfault. | | Constraints | Defined but not enforced | `PRIMARY KEY` / `UNIQUE` are accepted but [not enforced by Snowflake](https://docs.snowflake.com/en/sql-reference/constraints-overview). | Snowflake is a cloud database that ships continuous updates and behavioral changes. Track them in the official [Snowflake release notes](https://docs.snowflake.com/en/release-notes/overview). ## Loading type (copy vs. clone) -When data is loaded into a Snowflake transformation there are two methods — **copy** and **clone**. They are configured on the input mapping; see [loading type](/transformations/mappings/#loading-type-snowflake-and-bigquery). +When data is loaded into a Snowflake transformation there are two methods — **copy** and **clone**. They are configured on the input mapping; see [loading type](/transformations/mappings/#loading-type-snowflake-and-bigquery). ## Backend sizes (dynamic backends) @@ -67,7 +66,7 @@ SET ABORT_TRANSFORMATION = ( ); ``` -The example sets `ABORT_TRANSFORMATION` to `'Integrity check failed'` when the `INTEGRITY_CHECK` table has one or more rows with `RESULT = 'failed'`. An empty string does not abort. +The example sets `ABORT_TRANSFORMATION` to `'Integrity check failed'` when the `INTEGRITY_CHECK` table has one or more rows with `RESULT = 'failed'`. An empty string does not abort. ## Identifier case sensitivity