https://prefect.io logo
Title
m

Matthias

03/30/2022, 11:05 AM
In Orion, task run result storage is not implemented yet. Is there already a blueprint on what it will look like?
a

Anna Geller

03/30/2022, 11:13 AM
Nothing I could share yet.
m

Matthias

03/30/2022, 11:32 AM
And a potential timeline?
a

Anna Geller

03/30/2022, 11:36 AM
You are asking the hardest questions 🙂 I asked our product team
m

Matthias

03/30/2022, 12:20 PM
Haha sorry 😄. The reason I’m asking this is that I have been thinking and prototyping a lot with model registries (such as MLflow). Although there are nice tools out there, I found them lacking in functionality when using them in an end-to-end flow and in different environments (dev/prod). The main reason being is that typically, a data flow produces more than just one “asset” (table, dataset, pickled ML model, …) while tools such as MLflow focus on only the ML model itself. So I was thinking it would be nice to have something available where you can register and version flow assets as well as store metadata about these assets that can then be made visible in a UI. This way, you have a nice coupling between flow runs and the assets they produce…
Btw I don't think it is too hard to build it, but unfortunately I am not a frontend developer…
a

Anna Geller

03/30/2022, 1:05 PM
Using the "boxes" and "arrows" analogy from this blog post, here is how you can approach this problem: 1. Your task runs are the "boxes" that reflect what is happening within your computation (e.g. generating a specific table/dataset/pickled model) 2. While Prefect is mainly focused on the "arrows", rather than the "boxes" themselves, you can label your boxes as you wish to reflect what this computation is about. For instance, if your task is related to a specific data asset, you can name it as such! And with well-engineered naming of your task runs, you can easily trace the mapping of computation to the asset it generates. For instance: a. if your task is loading data to a specific table, you can name the task run analogically to this table - this way, you can easily see in your radar view that this is a data node related to this dataset b. if your task is computing an ML model, then you can name it accordingly to how you named your pickled model 3. In the Orion UI, the highly customizable filter functionality will allow you (in the very near future) to easily filter for specific task run names that contain the name of your data node, effectively providing you with the state of computation related to a specific data node in one place. And you will even be able to save it as a custom Dashboard. You would then end up having a dashboard providing you with the full history of runs related to a specific "data asset" without having to rely on Results backend, or Artifacts API - simply with task run naming and thanks to the amazing filter functionality of the Orion UI. I was saying in the near future because the wildcard functionality (shown below) and saving custom dashboards still have some open issues on GitHub, but once those are resolved, you will soon be able to give this approach a try. I would be curious to hear your thoughts on the topic!
And regarding tracking metadata related to specific experiments, ML models, and other data assets, simply logging this information within your tasks can get you quite far already. At some point, you could export the logs of specific task runs to some log aggregation service like Splunk to build some analytics dashboard doing something with this logged metadata
@Matthias btw just got info that the task result storage is already implemented. When you run
prefect storage xyz
you automatically choose not only the flow storage but also how your task run results are stored, e.g. you can already start using it for persisting task run results to S3!
j

justabill

03/30/2022, 1:56 PM
@Matthias do I understand correctly that when you visit the page for a flow run, you want to see the storage configuration that was used to persist the results from that flow run?
m

Matthias

03/30/2022, 2:30 PM
@Anna Geller thanks a lot for the info! Will definitely give it a try soon! About metadata tracking, logging indeed could get me quite far, but I was thinking about an ability to also log artifacts suchs as GE reports or plots so that is something we might have to use the Artifacts api for? Nice didn't know that storage was included task run results storage. Curious how that going to evolve when adding Docker/GitHub storage…
👍 1
@justabill, not exactly. I was imagining something where you could register assets (flow output if you will) as well as metadata (description, owner, …) during flow registration and then, during flow runs, log metadata/artifacts such as GE reports, performance plots for ML models, … while also automatically version the results so that a specific version tag can be linked to a specific flow run.
👍 1
a

Anna Geller

03/30/2022, 3:32 PM
Docker and GitHub storage won't be the same as the current storage abstractions as those are "read-only" storage types (you certainly wouldn't expect Prefect to commit things to your GitHub repository, would you?) and they are not suitable as a backend for result storage (you likely agree that you wouldn't expect Prefect, or anyone really, to commit pickle or CSV files to a GitHub repository). And Docker image is a way of creating build artifacts, so that's also not suitable for a Results backend. Regarding the GE artifacts and plots - do you think it would be enough to store a reference to an S3 object that would include your GE report, dbt report, or a file with a plot? Because this is something that would be certainly useful, doable even at scale, and something I would be happy to communicate with the product team. Thanks for all your feedback and suggestions here, feedback from an ML engineer like you dealing with this problem on a regular basis is extremely valuable!
Regarding versioning of results, some thoughts: 1. Prefect can help you version your flows. You can attach an arbitrary version to each flow and use it to track how your flow evolved over time. 2. Versioning of the actual data is not a concern of a workflow orchestrator, but this is something that you could tackle separately from Prefect. Here are some suggestions: a. Using S3 versioning - if you enable versioning on your S3 bucket, anytime you (re)upload a specific object, S3 keeps track of it. Prefect can help you by e.g. automatically checkpointing/storing results of this computation to S3, but versioning itself is done on the S3 side b. Using lakeFS - there is an open-source tool that helps with data versioning, c. Arctic is an open-source tool built on top of MongoDB that contains VersionStore, allowing you to very easily dump e.g. a Pandas dataframe to it, version it, and Arctic even preserves Pandas index (especially useful when working with time-series data!) The question is then: if you use some of ☝️ those options for data versioning, would you need any feature in the orchestrator for that? My feeling is that Prefect could version your flows, each flow run/task run could be labeled in a way that would reference a specific data asset (say data warehouse schema, table name, dataset directory on S3), and optionally Prefect could provide a way of displaying some extra tiny piece of metadata on each task run (say, displaying the exact S3 object path of generated pickle file or Arctic dataset version number) but the actual tracking of how data evolved (e.g. nr of rows ingested, schema changes made over time) is not something I would expect from a workflow orchestrator.
m

Matthias

03/30/2022, 6:57 PM
Thanks for the response Anna! This is a lot to reply to, so let me try to structure it a bit. Docker/GitHub storage: I know and I was hinting towards that. There are two options: either there is the option to use S3 for task result storage alongside it or there is no option for result storage… Regarding artifacts/plots: having a reference to the uri of the object in S3 is one thing, but it would be nice to render it directly in the UI. With e.g. Argo Workflows, you can define output artifacts of a task which are stored in S3, but you can also inspect them directly from the UI (which I think is a neat feature). For the data assets itself, this is not so useful (and perhaps not ideal from a security point of view), but for GE reports or plots, this is extremely useful! Moreover, if you would then be able to have a view where you can display the reports of the current and previous run, you can get a lot of insights from that! About data versioning: I completely agree that this is not the responsibility of a workflow orchestrator. But having a way to display tiny pieces of metadata of a task run would be a nice feature! And that metadata could be even a table containing basic statistics of your dataset (pandas.DataFrame.describe displayed as a table in Markdown and made visible in the UI). If you could then compare these between flow runs, that's again super helpful! But on the other hand, there might be other tools to do that…
a

Anna Geller

03/30/2022, 7:58 PM
Good point regarding storage - the current setup is implemented using a global setting for all storage configs, but we are working on a more granular (per deployment, per work queue, task results different than flows) storage/result backend that should be released in a month or two
it would be nice to render it directly in the UI
I like the way you put it because I would consider this as a "nice to have" feature - something that could be nice, but nothing critical because if you can just get the link to the file on S3 or say even Google Drive, it's quite easy to just copy-paste it then you need it for troubleshooting
But having a way to display tiny pieces of metadata of a task run would be a nice feature!
I agree! Will create a feature request for that Thanks again for all your feedback, really appreciate it