https://prefect.io logo
Title
l

Laura Lorenz (she/her)

04/30/2020, 6:25 PM
Hey all! I’m noodling on an integration with Great Expectations, so if you have an interest in that library or in data validation/data assertions generally, I have a discussion issue open with some questions and value your thoughts! : https://github.com/PrefectHQ/prefect/issues/2436
:upvote: 7
🚀 5
💯 3
j

Joe Schmid

04/30/2020, 7:14 PM
Paging @Jie Lou to the white courtesy phone! She has integrated GE (plus embulk for data transfer and dbt for analytics engineering) into a substantial Prefect Flow that populates a Redshift cluster with clinical healthcare data. That feeds a large data science pipeline (also in Prefect, of course) that trains & tests machine learning models used by our clients.
❤️ 1
❗ 1
At the moment, our use of GE is on data in external systems, e.g. Redshift, and not on data passing between Prefect tasks.
s

Steve Taylor

04/30/2020, 7:32 PM
By complete coincidence, we just started using GE. We're using it primarily as test cases and exploration notebooks, working through ideas verifying the shapes of the data before we move ideas over to Prefect tasks. Our very next step was in fact to try to start putting validation tasks into our production workflow.
❗ 1
👀 2
j

Jie Lou

04/30/2020, 9:22 PM
Just added a little detail to @Joe Schmid’s comments, we use Great Expectation to validate data after data transformation in redshift. Basically, we configured the valid data format in json for each table, and then run a python script in shelltask to get the results of GE, like valid/invalid. I would see the value of GE on prefect task result if it involves complex data structure. I’ll think more!
❤️ 1
🙏 1
s

Steve Taylor

06/19/2020, 7:14 PM
@Laura Lorenz (she/her) I'm taking advantage that you indicated an interest in validation, not just GE, but my team has had better luck using Pandera (https://github.com/pandera-dev/pandera) for simple validation, and even some not so simple validation. We've used it as straight python, and I've had luck with its YAML implementation also. It's opinionated but in a different way about its column checks and it has some pretty deep roots in stats, which I don't use as much, but interesting to see where the author's heart is at. In this thread above I mentioned using GE, and we did eventually figure how to use it simply and cleanly -- i.e., without the whole webpage, notebook editor/suite thing. It's just JSON and a couple lines of python in task to validate:
@task()
def validate_roster(df):
    """
    Validate the dataframe using great_expectations file.

    This may throw a warning, "Pandas doesn't allow columns to be created via a new attribute name"
    which may be ignored. Working on this.

    Returns the dataframe given
    """

    # Create a ge "batch"
    df_ge = ge.from_pandas(pandas_df=df)

    validation_result = df_ge.validate(
        expectation_suite="resources/expectations.json",
        result_format="SUMMARY",
    )

    if not validation_result.success:
        <http://logger.info|logger.info>(validation_result)
        raise Exception("Dataframe did not validate correctly.")

    return df
I want to like it, but the result_format and such is just a little chaotic, especially with things that require SUMMARY for stats and floats. We're finding Pandera to be easier to live with.
👀 1