Skip to main content

Dataform Basics

note

This is a just a short overview of Dataform need in the context of the package description. For more details please read official documentation.

What is Dataform?#

A few main facts about Dataform:

  • it's a tool for T (transformation) in you ETL (extract, transform, load) processes
  • it's already in BigQuery, you don't need to install anything
  • it's free
  • it uses JavaScript
  • you could create your own package (like this one)
  • it's really cool (and super cool compared to scheduling queries)

How does it work?#

SQLX#

You should define .sqlx files with SELECT statement and configuration block (they call them actions). All actions should be inside definitions folder or it subfolders.

So for example if you create definitions/test.sqlx file:

config {    type: "table",}select "dataform" as cool_tool

And if you run action (by clicking "Start Execution / Actions" and selecting "test") you will get table test inside dataform dataset with one row and one column full of true.

Magic! You just define SELECT and dataform creates table for you. And each time you run action it will recreate table, in our case you will always have only one row with the same result.

Of course you could query any BigQuery table in your SELECT for example GA4 export tables.

Dataform support a few types of actions:

  • table
  • incremental table
  • assertion
  • view
  • operation
  • test

Incremental tables#

In our context the most important type is the incremental. Incremental type generate two queries: one for creating tables (non-incremental) and one for updating tables (incremental). For updating Dataform generate INSERT statement and for each run instead of recreating table it adds new rows into the table.

If in our example we change action type to incremental:

config {    type: "incremental",}select "dataform" as cool_tool

And run test action twice we will get table with two rows.

So for GA4 data processing it means we could define definitions/sessions.sqlx file with SELECT statement that queries GA4 raw events only for the last day and returns sessions. The package provides Session and Event classes with publish method that generates SELECT statement for you. And to keep data fresh you need to run this action every day as soon as GA4 exports data to BigQuery.

One more benifit of incremental type, if for provide uniqueKey parameter, Dataform will generate MERGE statement instead of INSERT. The beauty of MERGE is that it INSERT new values and UPDATE existing ones based on uniqueKey. It means if for some reason you run the same action twice on the same day - you don't double the same sessions, you just refresh (UPDATE) them.

The package by default creates incremental actions with uniqueKey parameters [session_id,date] and [event_id,date] for sessions and events tables respectively.

And also Dataform provides JavaScript API you could use in your actions.

One of this helpers is incremental() function, that returns true when the current context is incremental. And using this function you could select different source tables for the first run and for the next runs, or SELECT from all days for the first run and only from the last day for next runs. Session and Event provides incrementalTableName and nonIncrementalTableName properties to set tables for incremental and non-incremental context.

ref()#

ref() is one of the most important Dataform functions. It allows you to reference other actions (table). And the benefit of using it is that Dataform could build dependencies based on ref() function. For example, if you create report and want to select data from sessions table generated by the package, you could write something like this:

config {    type: "table",}SELECT * FROM ${ref("sessions")}

And when you run all actions for daily update you don't need to care about execution order. At first Dataform will run sessions action and only after that lp_report action (because it depends on sessions). So that's why it's important to use ref() in all your actions.

Define sources#

You could use ref() to reference all actions created in your Dataform project. But to reference source tables that exist outside of Dataform, you should declare them first.

Dataform has a special action for this: declare. You could create definitions/sources/ga4.sqlx file with the following code:

   config {      type: "declaration",      schema: "analytics_XXXXXX",      name: "events_XXXXXX",    }

And after thatm you could query from this table like this:

select * from ${ref("events_XXXXXX")}

The package also expected that sources should be defined first. And it provides a special helper method: declareSources().

So for example you could create definitions/sources/ga4.js file with the following code:

const ga4 = require("dataform-ga4-sessions");// Define your configconst config = {  dataset: "analytics_XXXXXX",  incrementalTableName: "events_XXXXXX",};// Declare GA4 source tablesga4.declareSources(config);

And after that, you and the package could use ref("events_XXXXXX") in the actions.

Dataform project structure#

And a few words about Dataform folders. Dataform's documentation provides recomendations about project structure. Inside definitions folder you should create subfolders:

  • sources - for source definitions
  • staging - for intermediate tables
  • reporting - for reporting tables

It's a good starting point, before you really grow up to dozens of actions.

Tags#

One more important Dataform concept is tags. You could add tags to your actions and then run only actions with specific tags. For example you could add staging tag to all actions inside staging folder and run only them. Or you could add a daily tag to all actions you want to run daily.

The package by default add tag with source datset name (like analytics_XXXXXX) for all actions it creates. The reason behind this is that you could export data from a few GA4 accounts in the same GCP. And in this case, you need to separate configuration and execution for each GA4 account. So you could run only actions for specific GA4 dataset with specific tag (dataset name).