Rittman Analytics

View Original

Announcing our Automated dbt-to-Dataform Migration Tool, Powered by Open AI

At Rittman Analytics we’ve been advocates of dbt (“Data Build Tool”) and a partner of dbt Labs since back when they were known as Fishtown Analytics and dbt was their project accelerator toolkit.

Increasingly though and especially for clients who are invested in the Google Cloud ecosystem we’re being asked to use Dataform, rather than dbt, to perform these tasks. 

Dataform is a SQL-based data transformation tool, similar to dbt, but intended these days to work natively with Google BigQuery and run as a fully-managed service within Google Cloud. 

Crucially though for Google Cloud customers whilst Dataform Core, like dbt Core, is free to use any supporting services spend on scheduling and orchestration, for example, comes off of their existing committed Google Cloud spend rather than having to pay-for (and agree legals for) a separate service such as dbt Cloud.

A comparison of Dataform to dbt is something for another blog post but for organizations such as ours with considerable investment and code libraries written for dbt, having to convert those by-hand to Dataform is quite a daunting task. For example;

  • while both dbt and Dataform have the concept of models which in Dataform are called “tables”, Dataform’s syntax is of course different and tests, for example, are defined in-line rather than in separate .yml properties files

  • dbt models are defined using SQL and Jinja code, whereas Dataform tables are defined using SQL and Javascript code

  • dbt project configurations are stored in dbt_project.yml and defined in YAML whereas Dataform configurations go into either dataform.json in JSON format, or more recently workflow_settings.yaml stored in YAML format; and so on.

To help us translate our dbt repositories as well as public ones such as Jaffle Shop into Dataform syntax and configuration settings, we’ve been building a dbt to Dataform Migration Tool that partially automates the process of converting dbt (data build tool) projects to Dataform, focusing on BigQuery as the data warehouse and available as open-source Python code on our Github site.

It’s not yet a turn-key solution that handles all aspects of the migration for you, but it will take care of the routine steps so that you can concentrate on manually converting the more complex parts; in its initial 0.1.0-alpha release it:

  • Converts dbt models to Dataform SQLX files, with limitations

  • Translates dbt source definitions to Dataform declarations

  • Converts dbt macros to Dataform functions using GPT-4 (requires OpenAI API key)

  • Preserves project structure, adapting it to Dataform best practices

  • Handles (with limitations) dbt-specific Jinja syntax and converts it to JavaScript

  • Supports conversion of dbt variables to Dataform project config variables

  • Automatically converts common dbt_utils functions to their BigQuery equivalents

  • Uses GPT-4 to check and correct Dataform syntax in converted files (requires OpenAI API key)

  • Generates a detailed conversion report highlighting potential issues and syntax corrections

So how does this migration tool work, and how does it use the OpenAI API to assist in some of the more complex migration tasks?

Our approach to automating this code conversion was to use a combination of rule-based transformations and AI-powered processing for more complex scenarios. This hybrid approach enables the tool to handle both straightforward translations and nuanced, context-dependent conversions.

While the process is largely automated, it is designed to complement rather than replace human expertise. The tool provides a solid foundation for migration, but user intervention will be necessary for project-specific issues and handling of unsupported features.

The migration process is made-up of seven steps, plus manual review and completion by yourself afterwards.

  1. The RepositoryAnalyzer modules scans your dbt project structure to identify models, tests, macros, and YAML files.

  2. The ProjectConfigConverter modules translates your dbt_project.yml package settings to the equivalent dataform.json format, configuring your target warehouse connection settings and declaring project-wide settings and variables.

  3. The SourceConverter module processes dbt source definitions and creates individual .sqlx source table definitions for each discovered source.

  4. The ModelConverter module then translates each dbt SQL model to a Dataform .sqlx file, automatically handling reference conversions, variable replacements, and macro translations.

  5. The MacroConverter module then transforms dbt macros into Dataform JavaScript functions by sending them, wrapped in a prompt, to the Open AI API for conversion, like this:

See this content in the original post

6. Model and metadata conversions are then passed to the SyntaxChecker module which again uses the OpenAI API, this time to review each converted model against Dataform syntax rules and best practices, like this:

See this content in the original post

7. Finally, the ConversionReport module outputs a report on the migration process, highlighting potential issues, syntax corrections, and areas needing manual review.

Current limitations in the migration process are:

  • Complex dbt macros may require manual adjustment after conversion

  • Custom dbt tests might need additional implementation in Dataform

  • The tool assumes a BigQuery setup; adjustments may be needed for other warehouses

  • Certain dbt-specific features, such as dbt’s semantic layer, might not have direct equivalents in Dataform and are therefore out-of-scope

  • While this converter handles many aspects of dbt projects, some features are not currently supported or require manual intervention including seeds, snapshots, custom tests, documentation, exposures and advanced hooks, though we do intent to cover most of these off over time

The python package and all its code are available as open-source at https://github.com/rittmananalytics/ra_dbt_to_dataform and we’d welcome pull requests and issue reports. 

And … like London buses where there’s none for ages and then two turn-up at once, my old friend Christian Screen released his own dbt to Dataform converter a week or so ago, so check his out as well and I’m sure there’ll be some cross-project collaboration at some point.

INTERESTED? FIND OUT MORE!

Rittman Analytics is a boutique data analytics consultancy that helps ambitious, digital-native businesses scale-up their approach to data, analytics and generative AI.

We’re authorised delivery partners for Google Cloud along with Oracle, Segment, Cube, Dagster, Preset, dbt Labs and Fivetran and are experts at helping you design an analytics solution that’s right for your organisation’s needs, use-cases and budget and working with you and your data team to successfully implement it.

If you’re looking for some help and assistance with your Dataform initiative or would just like to talk shop and share ideas and thoughts on what’s going on in your organisation and the wider data analytics world, contact us now to organise a 100%-free, no-obligation call — we’d love to hear from you!