Apologies in advance for quite a vague question slightly smi Prefect Community #ask-community

Apologies in advance for quite a vague question :s...

Mike Geeves

07/19/2022, 7:57 AM

Apologies in advance for quite a vague question 🙂 I've historically custom rolled data pipelines (janky shell/python scripts, slightly nicer but terrible observability with Camel). Is there much usage, or nice ways for dealing with time series and spatial data in Prefect? I can see for example being able to look at success/fail rates of the run history, I'm wondering how this could work when sometimes you only "care" about a failure for a short amount of time and that's subject to when it last failed. Use case: satellite images are processed for various areas - there could be one job for each tile of interest. Periodically each tile is checked for new data and when there is some, a number of steps to load and process are performed. Sometimes there isn't data, so the first step fails. This can be because it isn't available yet, or it won't ever be available. Failed might be fine because there's nothing we can do anyway (there's just no data). However if the same one is failing for say a couple of weeks and there's no "recent" data then it becomes bad. Sometimes bad things happen and everything fails. Maybe an API key has been revoked or a service is down. Failed is bad. For my uses I ended up tracking these and making a dashboard showing e.g. each day along an x axis and each tile along a y axis to be able to spot gaps which hopefully might go from red to green after retries. Is something like that possible via Prefect? To pull out and create custom dashboards like that or via API calls etc? This is purely an "out of interest" rather than an immediate need, observability and even being able to categorise failures was a huge problem so I'm just wondering if this is something catered for, so if there are solutions that would be great to here, but if not "nope, you're still on your own" is fine 😄

Sylvain Hazard

07/19/2022, 8:02 AM

Hey ! For Prefect 1.x (and I believe it's basically the same for Prefect 2.x but haven't had the chance to fiddle with it yet), the whole flow run history and metadata is stored in a PostgreSQL database. You can theoretically access it directly but that's pretty cumbersome. On the other hand, Prefect comes with a GraphQL API that allows for easy access to this kind of data. For example, I have a flow that queries the database for old flow runs and deletes them. I can give you the code for this if you want to. For your use case, I could see having your data update run on their own and then having another flow query historical data and then push whatever metrics you want for your dashboard. Hope that helps 🙂

Mike Geeves

07/19/2022, 8:14 AM

Thanks! Speedy reply 😮 Makes sense 🤔I should have added the part of the question "or is that considered out of scope for Prefect itself" Ohh, another flow looking at the history sounds like an interesting idea, I like that 😀 That would avoid yet another system scheduling by itself as yet another point of failure 🤔

Sylvain Hazard

07/19/2022, 8:18 AM

It would also allow you to run history fetching independently from your update flows which could be a pain otherwise.

Mike Geeves

07/19/2022, 9:14 AM

Yeah, the aggregation of e.g. one fail ok, three fails bad makes the actual flow very convoluted I'll give that a go should I need to again 😀 🤔 Grouping of task success/fail by arguments or something like that seems like it could be interesting to look at but maybe there isn't much actual direct value (or significant number of meaningful use cases)

8 Views

Open in Slack

Previous Next