DBT & Python - How to write reusable and testable pipelines
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69419 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 2024115 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Endliche ModelltheorieModul <Datentyp>Real numberIntegrated development environmentSlide ruleE-learning2 (number)Number theoryWritingSoftwareArchitectureComponent-based software engineeringSet (mathematics)Video gameIntegrated development environmentSoftware architectureLibrary (computing)LaptopDatabaseReal numberInformation privacyUnit testingMereologyMultiplication signPoint (geometry)Physical systemSource codeSlide ruleCategory of beingModel theoryConnectivity (graph theory)Line (geometry)CuboidSoftware testingProduct (business)AreaOperator (mathematics)Perspective (visual)Term (mathematics)Different (Kate Ryan album)Endliche ModelltheorieDiagramMetric systemExecution unitNeuroinformatikInformationFreewareCodeSingle-precision floating-point formatProcess (computing)Matrix (mathematics)Right angleCountingPattern languageRegulator geneComputer fontField (computer science)Type theoryTable (information)Case moddingSphereOperating systemData modelWeb pageSystem identificationExpected value2 (number)Formal languageMathematicsBitCurvatureComputer animation
06:11
DatabaseSource codeSource codeTable (information)Pattern languageConfiguration spaceDirectory serviceMultilaterationEndliche ModelltheorieDatabaseProduct (business)Computer fileCodeUniform resource locatorCategory of beingVariable (mathematics)Integrated development environmentData structureLine (geometry)Level (video gaming)Computer animation
07:16
Source codeOrder (biology)Endliche ModelltheorieSource codeComputer fileModel theoryMacro (computer science)Selectivity (electronic)Statement (computer science)Template (C++)2 (number)Functional (mathematics)Computer animation
08:12
Travelling salesman problemEndliche ModelltheorieData warehouseNeuroinformatikCodeEndliche ModelltheorieProjective planeFreewareLocal ringQuery languageSinc functionControl flowComputer animation
08:49
Endliche ModelltheorieData warehouseSemiconductor memorySimilarity (geometry)Object (grammar)Model theoryModule (mathematics)Functional (mathematics)Data storage deviceDatabaseProcedural programmingSource codeFrame problemNormal (geometry)CodeEndliche ModelltheorieComputer animation
09:50
Endliche ModelltheorieGraph (mathematics)Endliche ModelltheorieTable (information)Source codeNP-hardCodeModel theoryOrder (biology)PlastikkarteMetadataMacro (computer science)InformationWebsiteComputer animation
11:15
Logic gateLogicGraph (mathematics)Total S.A.Line (geometry)Materialization (paranormal)Graph (mathematics)Product (business)Integrated development environmentConfiguration spaceEndliche ModelltheorieTable (information)Template (C++)MereologyModule (mathematics)DatabaseData structureProfil (magazine)Projective planeModel theoryRepository (publishing)CodeSheaf (mathematics)Variable (mathematics)Level (video gaming)Goodness of fitCategory of beingSimilarity (geometry)Data warehouseCase moddingComputer fileSource codeMacro (computer science)Reading (process)View (database)Computer animation
14:09
Software testingMaterialization (paranormal)Sheaf (mathematics)Computer animation
14:37
Shape (magazine)Software testingSoftware testingExpected valueComputer fileCASE <Informatik>Pattern languageInternet service providerINTEGRALSource codeData integrityStatistical hypothesis testingCase moddingUnit testingShape (magazine)Proper mapNatural languageProjective planeType theoryExecution unitComputer animation
15:47
Revision controlPressureSource codeConfiguration spaceWritingComputer fileSoftware testingIntegerData warehouseINTEGRALTable (information)Category of beingData typeConstraint (mathematics)Statement (computer science)Computer animation
16:36
Pattern languageSoftware testingSource codeDeterminantComputer fontData integritySource codeSoftware testingTable (information)NumberStatement (computer science)BitCondition numberDistribution (mathematics)Row (database)Level (video gaming)Constructor (object-oriented programming)Arithmetic meanSimilarity (geometry)Combinational logicUniqueness quantificationData warehouseModel theoryRange (statistics)Expected valueReal numberMultiplicationFunctional (mathematics)LogicComputer animation
18:38
Pattern languageSoftware testingEndliche ModelltheorieLevel (video gaming)CodeSheaf (mathematics)MereologySoftware testingUnit testingCase moddingComputer animation
19:14
Source codeExecution unitSatelliteModel theoryCodeFunction (mathematics)2 (number)outputModel theoryOrder (biology)MultilaterationConfiguration spaceGame controllerComputer fileElectric generatorDatabaseExecution unitExpected valueUnit testingTable (information)Connected spaceSource codeNumberSheaf (mathematics)Computer animation
20:48
Execution unitEndliche ModelltheorieSoftware testingModel theoryTotal S.A.Pattern languageExecution unitConfiguration spaceComputer fileOrder (biology)Software testingMacro (computer science)Model theoryEndliche ModelltheorieGoodness of fitComputer animation
21:23
Pattern languageExecution unitSoftware testingEndliche ModelltheorieDirectory service2 (number)Cloud computingMacro (computer science)Model theoryEndliche ModelltheorieMacro (computer science)Unit testingTotal S.A.Multiplication signComputer file2 (number)outputConnectivity (graph theory)Shared memoryAxiom of choiceType theoryLibrary (computing)CodeDifferent (Kate Ryan album)Sheaf (mathematics)Category of beingVariable (mathematics)Sinc functionTemplate (C++)Computer animation
22:52
Execution unitSoftware testingEndliche ModelltheorieMacro (computer science)Asynchronous Transfer ModeMacro (computer science)Unit testingLogicoutputExpected valueNormal (geometry)Total S.A.Model theoryHoaxSoftware testingCASE <Informatik>Data structureBitSheaf (mathematics)Data typeFunction (mathematics)Computer animation
24:03
Pattern languageExecution unitMacro (computer science)Endliche ModelltheorieSoftware testingoutputHoaxEndliche ModelltheorieDatabaseProjective planeMacro (computer science)HookingView (database)Computer animation
24:32
Software testingExecution unitPattern languageEndliche ModelltheorieData warehouseView (database)Unit testingSheaf (mathematics)Endliche ModelltheorieComputer animation
25:03
Computer-assisted translationSource codeExecution unitSoftware testingEndliche ModelltheorieMoving averageCodeLevel (video gaming)Model theorySheaf (mathematics)Macro (computer science)Derivation (linguistics)Functional (mathematics)CASE <Informatik>AbstractionObject (grammar)Revision controlModule (mathematics)Data storage deviceData warehouseHookingConfiguration spaceState of matterComputer animation
26:11
Pattern languageFunction (mathematics)Execution unitEndliche ModelltheorieSoftware testingModel theoryWritingCodeData warehouseNormal (geometry)DatabaseEndliche ModelltheorieModel theoryExecution unitComputer fileSoftware testingUnit testingRight angleComputer animation
27:03
Pattern languageExecution unitModel theoryUnit testingMultiplication signEndliche ModelltheorieSheaf (mathematics)Software testingSlide ruleSoftware developerComputer animation
27:35
Source codeMaterialization (paranormal)Directory serviceSoftware testingDefault (computer science)CASE <Informatik>Model theoryMathematicsRight angleLevel (video gaming)ImplementationCodeDifferent (Kate Ryan album)Code refactoringPoint (geometry)Configuration spaceComputer fileProduct (business)Multiplication signSoftware developerStapeldateiEntire functionSet (mathematics)ResultantEndliche ModelltheorieTable (information)Function (mathematics)Game controllerLine (geometry)Expected valueLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
What I'm going to present to you today is basically lessons learned from implementing production data pipelines with DBT on Snowflake. So it's not like a library that we wrote ourselves, it's just the way how we use DBT.
00:21
Before I get started, some things about myself. So my name is Florian, I'm from Germany, I'm based in Berlin. I have a very energetic little son. I just attended a talk about how to teach coding to kids, so he's like a center of my life. I myself am already 40 years old, and the last time I was
00:43
in Prague is actually 27 years ago, so it's like, yeah, I got old. I have a background in mathematics, in database a little bit, and software architecture. And since three years, I work for a company called Flatiron Health.
01:00
And what we do is we work in the area of real world evidence. What that means is we take data from operational systems, from hospitals and practices, to create research databases for research, and we focus on oncology. So it's real world oncology data, and we do this in the US,
01:23
in UK, in Japan, and in Germany, that stands for European Union. And it has two implications that are somewhat relevant for the perspective of the talk. So first of all, it's not really a big data problem. We have genetic databases, but most of the data that we actually process, that's not really big data,
01:41
I mean, not in terms of today's databases. But these are very complicated data models. Before, I worked for eBay, and the data I was processing was like clicks on a webpage, right? So a very different problem. And since we do this in different countries, and each country has different data privacy regulations,
02:02
and we deal with very sensitive data, so we have very segmented environments. For example, I can never access the data of patients from the UK. So, but we still want to build systems that are shareable across each market, each country. And the way how we approach it looks roughly like that.
02:22
I mean, as all of you know, a diagram is always wrong. But I try to at least outline what we're doing. So we pull data from the source systems, it's like all kinds of different computers standing in the basements of hospitals, sometimes cloud-based APIs, from all the sources.
02:41
We divide the data in two categories. One we call unstructured data, and the other thing is unstructured data. The line is somewhat blurry, because most information is a free text field, this already qualifies as unstructured data. And we process that through all kinds of steps.
03:01
And in the end, we harmonize the data, we normalize the data, and put it together, and then we build different data models. Because one thing I learned working for this company is that cancer is not a single disease, it's in fact a collection of a thousand diseases. For each cancer type, we develop our own data models, which in the end is a set of tables, which we then deliver
03:22
as data marts to our life science partners. And then we serve them in a secure environment. In the medical sphere, you call this a TRE, a trusted research environment. It's basically a set of notebooks and databases where we can basically control that the people don't join the data with outside data.
03:43
So basically do a probability re-identification of the data. Okay. With that out of the way, here's the agenda. I have one smarty pen slide, just one, that you can bear with me. Then we have a quick look at DBT to repeat the concepts that are important for the talk.
04:02
And then we talk about testability, first with data tests, with DBT expectations, and then unit testing for SQL models and Python models. And I said I will skip the part about reusability. I will mention it maybe a couple of times. And if you're interested, you can just reach out to me. That just might be, turned out it would be a second talk.
04:22
Okay. Let's get started with the smarty pen slide. So this is the only slide where I mentioned a large language model. So I had the feeling that I have to. My point is it's very important that you develop a system
04:40
that you can test your components in isolation. And it's easy to observe. Right? I think that we all agree that this is very important. My point here is if you design your system for that, you get a lot of other important properties out of the box. Like my mind, as I'm thinking around the lines. So if you develop a system that you can test your components
05:01
in isolation, and you can observe each component, then it's also evolvable. And if a system is evolvable, you can make it more secure. You can make it more fast. You can change it. Right? But yeah, I'm happy to be challenged about that. And for me, like this testing and observing is somewhat related.
05:22
And it's related via data tests in my mind. So it'd be like, of course, like a unit test. Probably some of you will point out that what I will show later is not a unit test. But a unit test is you can test a component in isolation. And a data test is somewhat really related to metrics. What you do is you define kind of metrics in which you fail or alert
05:45
when certain statistical properties are not met. So these are, that is the smarty pen slide. Now it will be just patterns and code. Okay. Let's get started with DBT. So of course, DBT is much more than what I'm going to show you.
06:01
But I want to basically introduce like the most important concepts that are important for testing. The first concepts, very important, are sources. So a data source in DBT is a configuration file. And one pattern that I would recommend is that we have like a source folder
06:24
in our models folder, and per data source, we have a directory. And then for each, let's say, table of the data source, we have a configuration file. When we started, we had just one configuration file that then had like 10,000 lines of code. That's not really maintainable. So basically, with that structure, it's very easy
06:42
to identify the tables and the data sources. And then the first thing you do is you define a location where DBT will read the data from. So it has a database and a schema property. Another recommendation is use environment variables to allow yourself to overwrite those values. This is especially important if you want to have different deployments
07:02
like a dev staging, a production deployment. And the second important aspect of that is that you then define your tables. At least you need a table name. Ideally, you also list the columns. And we will come back to that later. So that's the first important concept, how you define the data sources in a DBT pipeline.
07:22
The second important concept are SQL models. And the SQL models, in the end, is a SQL file that defines a select statement. In fact, DBT uses Jinja, so it's a template with those curly braces. And the curly brace, with the curly braces,
07:41
you reference other models, for example, sources. You can hard-code them, but you should not, right? You should always use references and sources to reference like the data on which the model is dependent upon. So the source is a built-in macro. And you can define your own macros and reference them. This is kind of defining a function, importing it and calling it.
08:02
I will talk more about macros later. And those models, by convention, live in the folder which is called models. You can change that, but that's the convention. SQL models. Next important concept, Python models.
08:21
DBT supports Python since a year. And since DBT pushes all computation to the data warehouse, that means you need a data warehouse that can run Python code. And as far as I know, the only free data warehouse that are currently on the market is Snowflake, Databricks, and BigQuery, but I might be wrong.
08:41
There was an adapter for local execution of Python, but this project has been abandoned. So you can only make use, as of now, of Python models if you have one of those data warehouses. And then the Python model is basically a Python module which defines a function called model, which gets passed in a DBT object, which you should use.
09:01
It also gets passed in a database session, which you should not use. And it should return a data frame. In this example, I'm returning a pandas data frame. But if you have large data, then you need probably a data warehouse native data frame with lazy execution, like a PySpark data frame or a Snowpark data frame. But if your data fits into memory,
09:22
you can also work with pandas. And then you can do this similar thing with SQL models. So you have this DBT object, and then you reference a source by calling the source function. And you can also, similar with macros, you can import your own functions and call them. I will talk about this later, but the very nice thing
09:42
about doing this is you don't need to define stored procedures or something like that. It's just really normal Python code of any additional setup. That's all about the Python models. The next concept is the lineage graph. The main reason why you should always use those references
10:00
and sources and not hard code other tables is so that you allow DBT to build up the lineage graph. Let's, for example, look at a MART model. So the smart model references to other models via the built-in ref macro. So this revenue model references the customer's model and this order's model, which I showed you before.
10:22
But as a reminder, the customer's model references the customer's source. The order's model references the order item source and the order source. And with this information, DBT will build up the lineage graph, so the DAG, the execution graph. So this is very important that DBT
10:40
knows in which order to generate your models, but it also gives you a very nice documentation. So this screenshot comes from the built-in DBT documentation. You can just run DBT docs generate, and then you get a website. And it's very helpful. First of all, you see your DAG, and you can zoom in. You can filter. And if you click on those nodes,
11:01
you get meta information. And the meta information also contains the compiled code, so the actual SQL that's running, which is very, very important for debugging. I mean, my code always runs, and I never have to debug my own code, but sometimes you have to. That was the lineage graph. The last concept, and then we're done with the concepts,
11:23
are materializations. So materializations are all about how DBT writes your data. And I would start with specifying the target database and a base schema. There are different ways to do that. We do this with a profiles file. You can put a profiles file into your repository, which
11:44
would contain the credentials. So it would not contain your credentials, but environment variables that read your credentials, of course. And then it also has a database in the schema property. And as with sources, I would recommend you to allow you to overwrite the target database and schema
12:02
by environment variables. And this allows you to do easily things to have a production pipeline. You first build up a new data, a new release candidate, let's say, then you compare the databases, and then you switch to what's the latest release, for example, like this. And this is then as easy as just setting
12:20
some environment variables. The next part of moduleizations is that you want to configure how your models are materialized. So in the main DBT configuration file, this is a DBT projects file, YAML. There's this model section.
12:40
And there you can specify for each folder or subdirectory of your models folder a schema, which will be a suffix of the schema that you defined in the profiles YAML. So let's say the profiles YAML defines a schema main. Then the staging models would be written to a schema main underscore staging.
13:00
You don't have to do this, but this is, I think, a good idea to, if you have the folder structure in your repository, to also have a similar table structure in your data warehouse. And then we can define how those models are materialized. You can override this on a model level, but I think it's a good idea to use
13:20
the structure of your repositories for conventions rather than having individual configurations per model. Two things you should pay attention to, first of all, Python models can only be materialized as tables, and they can only read views and tables. That's the one thing. And the second thing, which if you have very complicated
13:43
SQL models with a lot of macros, you won't understand the Jinja template anymore. You can look at the compiled code to understand what's happening with the DBD docs, but sometimes it's more easy to do this in the data warehouse. And then you first want to create a view,
14:02
maybe a materialized view, and then a table out of that. That makes it much faster to fix things in the data warehouse, but I think it's a matter of taste. That was the materialization section, and that concludes our quick look at DBT. Not everyone is looking at their phone so far, so I did not lose you. So that's very good.
14:21
So then let's go to the main chapter about testability in data pipelines. And I will start with the data tests. But before we go with data tests, we brush up on our concepts. These are DBT packages. So packages, there's a packages.yaml file, and you can list your packages that you want,
14:41
external packages you want to install. And there's one package that everyone should use, is DBT expectations. That is a port of great expectations to DBT, and it removes the need from writing custom tests to 90% of the cases, I would argue. So use DBT expectations. So it's an awesome project.
15:02
Okay, packages. So now about data tests. So I classified this as patterns, because what we do is we test our source data, because we obviously don't trust our data providers. And what we do mainly is integrity tests, because we do not understand the data yet. So what we do is we test the shape of the data
15:22
and test the data integrity. Then we build our own data marts, and then on the mart layer, we actually apply statistical tests, because now we understand the data, and now we can formulate proper expectations, like genetic types should be distributed following a certain distribution, and that you can test.
15:41
These are data tests, and the logic in between we test with unit tests, which I will show later. So let's start with the source tests. So as I mentioned before, you represent this data source via a configuration file, and this configuration file, for each column, you can add a data test property and then do a very simple test, like not null,
16:03
which is very important, because the data warehouses do not enforce that, so you can add a not null constraint to your create table statement, but most data warehouses will just ignore that. So it's good to test that. Then you can test the data type, because you don't control the source data, and you can test that this column is unique.
16:24
You can also test referential integrity with relationships. For example, you can say that, like, the custom ID must reference an ID in the customers table. And you can, of course, write more complicators, but these are laid away on the source layer.
16:40
We test the data integrity. Then on the mark layer, so like the final tables, we test more complicated logic, and here I picked the revenue model. For example, there's an amount column, and what we test first of all here with DBT expectations, the range of the allowed value.
17:00
So it should be between zero and I think it's 10 million. And a little bit more complicated, you can test a little basic distribution. You can say that 95% of the rows should have at least an amount of 1 million. To make it more complicated, you can also apply data tests on a table level.
17:21
Like this, you have access to multiple columns. For example, we constructed a new column with array construct. It's a snowflake function, but similar things exist in other data warehouses where basically you merge two columns, and we tested a combination of those columns as unique. And this now is a real, like, more real data test.
17:41
What this does is it's also a bit more complicated. So we first filter out all the rows that at least have an amount of over 5 million. We now look at those rows. And then I switch the condition here, but it doesn't matter. So we say if the region, we construct a new column. If the region is eased, we say it's one,
18:01
otherwise the value is zero. And then we test that 75%, at least 75% of the rows have a value of zero, meaning that of those rows with an amount of over 5 million, 75% are not from the east region.
18:21
That's how basically a practical test for that is that we test the distribution of people who are privately insured, right? So basically if you have a non-numerical value, you can transfer it with this statement to a number and then apply distribution tests to it. That concludes the section about the data tests on a mod level,
18:45
and that's all that I have about data tests. Now I switch over to testing the code, basically writing unit tests. And the first part will be unit testing SQL models, and that's actually since a couple of months a built-in feature.
19:04
When I planned a talk, I felt very smart because DBT did not support that, so we had to write our own scaffolding code for that, but now DBT introduced it, so it's now very simple. So we look at a SQL model. Important for unit tests are of course the inputs and the outputs.
19:21
The inputs here are two sources, so the order items and the orders. Another reason why you should not hard code tables, right, because then you cannot unit test them. And the outputs are the columns that are produced, so you have four columns. And what you then can do is you add some configuration,
19:41
we specify a new configuration file. You can't add this configuration to the existing configuration file, but we split this. I will say something about this in a second. So basically there's a configuration file with a unit test section, there's a given section, they basically specify values for the inputs, for the order items, for the orders,
20:01
and then you basically define your expectation. And then DBT will run this code. Two notes on that. This requires a database connection, so people argue that's not a unit test, but I will ignore this discussion. And the second thing, I'm using here SQL to define the data. There are multiple ways you can use CSVs, Jsens, or YAML.
20:23
I have another example later. You want to use SQL when you need control over how the data looks, and this happens otherwise if you specify your data with CSV, DBT will generate SQL for you, and that might not be what you want. For example, if your column names are weird like numbers, then DBT will not compile correct code.
20:43
In that sense, you can use SQL and you have full control of how the data will actually look. One pattern. So what we do is we move the tests into a configuration file that lives in a different folder. The reason is we want to have all the unit tests for SQL models,
21:00
Python models, and macros next to each other. So we introduce the new specs folder, and in order to enable that, we need to tell DBT that it also should look in the specs folder for model configuration. So in our main configuration file, we just add specs to the model paths. We don't have to do that,
21:21
but this is what works well for us. That's about SQL model unit testing. Now I will talk about macros. And macros are for us a very important concept of DBT models. First, what are they? So the model I showed you three times
21:41
uses two custom macros, for example, the total amounts macro. So a macro, you can call it with the curly braces syntax, and a macro lives, in the end, a SQL file that lives in a macros folder. And again, it's a Jinja template. You can use other macros with curly braces. For example, you can define input values,
22:02
and you can reference them with the double curly braces syntax. The reason why to use macros for us is twofold. First of all, it allows you to create smaller components that you can test. So we basically can divide larger models into smaller chunks of macros.
22:24
Instead of creating too many models, but that's a choice you have, you can also just basically create more models. But the second reason, this is the primary way how we can share code between different models, which we'll talk more about in the reusable section, but it's very important for us
22:41
because between different cancer types, some properties or variables are actually the same, have the same definition, and we use macros for them to build up a variable library. Okay, these are macros. And since for us, macros contain a large chunk of the logic, we really want to unit test those macros.
23:00
And DBT does not directly support unit testing macros, but it's actually not hard to do that. So if you look at that macro, first of all, we again need to look at what is produced, the output, in this case two columns, the ID column and the total amount column. Then we need a test model. So in that case, it's just a very simple model,
23:22
it just calls this macro. So we produce a test model, and here we need to look at the input, and that's a little bit like the hacky stuff, we need to define a fake input model. The fake input model just contains the data structure because DBT needs an input model for inferring the data types.
23:44
So the only thing that matters here are the columns and their types. And with that in place, it's a normal unit test of this test model. So in the given section, we can define values for the input model, here we do this with YAML,
24:00
and then we can phrase our expectations. But those fake input models, and I didn't find a better way to do that, but it's a bit ugly because in the end, DBT will materialize them in your database, and somehow I want to clean them up. So what we do is, in our projects file, we say, models are living in a macros folder,
24:21
they are written to a temporary schema, selected as views, it doesn't really matter, and then we have an onRun end hook, which just drops this temporary schema. So that's like a macro that drops the schema. You don't have to do this, but it's just that we don't have a million views lying around in our data warehouse.
24:42
Okay, that's about how we handle, or how we do macro unit testing. And that concludes this section about unit testing SQL models. Oh, okay, I will get this done, four minutes left. So, unit testing Python models. First, I need to introduce Python models to you,
25:04
or like Python packages. So this is the Python model I showed you before, and as I mentioned already, you can call your own functions, but the way how you can do that, so the derive function comes from a custom module, which just lives in your code base. It's a derive region function.
25:23
And we can make this available to GBT, because again, this Python code runs in your data warehouse. So basically, you need to make your Python code available in the data warehouse. And the way how to do this, there's a conflict section of this Python module, where you can specify where Snowflake in that case should read the code from in the import section.
25:41
That's the first step. And the second step is that you need to make your code available to your data warehouse. So for that, we use an on-run start hook, which deploys the packages. This is a macro that first creates a schema, where you put your code, if it doesn't exist, then it creates a stage. A stage in Snowflake is basically an abstraction
26:01
of object storage, if you're on AWS at three. And then you upload your code to this internal stage. And that in place, you can basically, in your code base, write normal Python code, and make this available to GBT in your data warehouse.
26:22
And then to unit test your Python model, it's a normal Python unit test. So I don't have to explain you how to do that. One note maybe. Okay, we'll skip all that. That doesn't matter for you. The last thing I want to mention is that, since we put now the Python unit tests into specs package,
26:44
DBT will think, hey. And DBT will scan the specs package for models. So DBT will think, awesome, there's a Python model. And it will fail, because it's not a Python model. But there's a DBT ignore file, and we put these packages, like the path to the Python unit tests into the DBT ignore file to tell the DBT
27:02
that these are not Python models. Okay, I managed in time. I'm very happy about that. That concludes the section about unit testing and about testing in general. And if you remember one thing of this talk, it's the next slide. I saw this quote many years ago, and I love it,
27:20
that a good developer is like a werewolf afraid of silver bullets. So what I showed you works well for us. It might not work for you. And with that, thank you so much for listening. I have two minutes to answer questions. Thank you. Any questions?
27:46
Thank you. That was very helpful. I was just wondering, all those tests, in case you have incremental model where you refresh or update the data regularly, those tests, do they run just on the batch that is being updated
28:00
or on the entire data set that would result? They would run on the entire data set. Can you choose? Do you have any control over that? Or so I cannot answer straight. So there's a way we actually use custom materializations in some cases for different use case
28:20
because we do not want to because by default, DBT does a create or replace. And that should work. But maybe we can discuss after. So I cannot give you an answer how to do that. So that's basically it. Thank you for your talk.
28:41
Question about testing SQL models. When you have multiple sources like 10 plus sources, how do you effectively test them instead of just writing out tons of CSV code? So first of all, you don't have to put everything into one configuration file.
29:00
You can put CSV files in a directory and reference a directory to organize it a little better. But it doesn't free you from creating test data. So what I typically do is I take production data, anonymize it, and that will be the test data. So that's basically my workflow. But yes, you need to get the test data somewhere from.
29:20
And honestly, what I do is I trust my implementation. I run the DBT pipeline and then copy this produce table. And that's the output. That's kind of, yeah, does that answer your question? Then maybe I didn't get those all. Right.
29:40
So it answers my question. But you know, testing production data, just copy pasting from your model output, what's the purpose of this kind of test? So the main purpose is basically to, so first of all, during implementation, I assert myself, so let's say,
30:01
if I really want to do a test-driven development, then I write the test code myself, like the test, the expected output myself. But if you want to take the generated output, that's basically for securing future changes of the model. Because I assume at that stage, the code is correct. And then future refactorings are protected by this.
30:22
That's a reason. Thank you. Thanks a lot. I think we are on time. Okay. Thank you. Thank you so much again. And have a nice day.