https://prefect.io logo
#random
Title
# random
g

Gustavo Puma

04/26/2022, 2:01 PM
Hi peeps ๐Ÿ‘‹ At my company we're trying to improve the data quality of our (Databricks) Delta Lake by introducing monitoring. We'd like to add assertions (as SQL statements) that we could then schedule and alert on (based on the query results). Any of you familiar with such tools that you could recommendโ“ Databricks has its own thing called SQL analytics but unfortunately it doesn't have any form of git integration
discourse 1
a

Anna Geller

04/26/2022, 2:16 PM
I can very much relate to your problem! I used to approach it event-based - check this blog post as an example. The idea is that any time a new file or partition arrives in your data lake, data quality checks fire and alert you when data validation fails. You could combine it with Prefect for enhanced observability - instead of implementing the data quality checks in AWS Lambda, your lambda function could just trigger a Prefect flow as explained here.
j

Jeremiah

04/26/2022, 4:15 PM
@Gustavo Puma itโ€™s a little early to discuss, but I think in the second half of 2022 Prefect will introduce the building blocks to become a first class solution for your problem. Iโ€™m not saying you should wait for that, but do keep it in mind ๐Ÿ™‚
๐Ÿ‘ 3
r

Rio McMahon

04/26/2022, 5:11 PM
Weโ€™ve been experimenting internally with Great Expectations (https://greatexpectations.io/expectations/) and evidently (https://github.com/evidentlyai/evidently). I am not sure if this exactly fits your use case but may be worth looking into.
a

Anna Geller

04/26/2022, 5:12 PM
thanks for sharing, Rio!
g

Gustavo Puma

04/27/2022, 11:11 AM
Thanks all for your responses, we actually already use something similar to great expectations for the incoming , streaming data. I'm looking now for something after the data has already been persisted. This is use case driven and would involve joining different tables, that's why alerting based on SQL queries would fit us nicely
a

Anna Geller

04/27/2022, 11:27 AM
Gustavo, I'd be curious to clarify the problem even more. When you say you want to run the data tests after the load, why is that important? Is it just some sort of "integration test" to see if after the load your data e.g. matches your expected value distribution? Or is the underlying problem that you want to run those data quality tests any time new data arrives in the given destination (table, data lake path) regardless of which process (data pipeline from orchestrator, manual load from your dev machine or manual run from
dbt run
command) loaded that data? Just to confirm whether we have the same understanding of the problem
g

Gustavo Puma

04/27/2022, 1:21 PM
Hi Anna, more like the second case, I want to run these tests on either my entire dataset or a sliding window. Running these assertions as part of our ETL would increase the runtime by a lot and our Analysts would also be interested in defining them, this is why we'd like a post-etl step. The type of issues are more related towards aggregation: e.g. a null value is acceptable in a certain column but if 80% of the values in a slice of data are null then this could be an issue. Another example is the value of a column in a table should match the one from another table, we already have some cases like this but would like to monitor if this increases
๐Ÿ‘ 1
a

Anna Geller

04/27/2022, 1:31 PM
Thanks so much for sharing this! I cross-posted here on Discourse
11 Views