Nova Westlake

07/29/2022, 9:12 PM
I think I'm trying to wrap my head around whether Prefect is a tool that is good for our very specific problem. When I read about the case studies, etc. it sounds like it is solving different problems. An example of our problem and what I'm trying to solve: We have a multi-terabyte drive of data we need to upload to the cloud. We upload it to the cloud. We run multiple proprietary tools on servers to extract the data. Transform and process it multiple times. Say 20 different steps in one workflow. The process will break multiple times during this process and we need to inspect and potentially even manually fix problems inside the files themselves and debug / rerun individual steps. So none of this is ever customer data, we're never trying to gain insight about big data generated from users or anything. This is purely a technical GIS data processing project.

Mason Menges

07/29/2022, 11:25 PM
Hey @Nova Westlake I think Prefect, especially 2.0, can definitely accommodate your use case especially since for the most part when you're building out your workflows you're just writing native python and it's definitely possible to build a workflow that "pauses" due to some unforeseen error and resumes when it's able to. In regards to this question: "I think what I'm most stuck on is is Prefect made to manage multiple separate tools that are built as Docker containers? I want to sandbox these data processing steps from each other in terms of their dependencies, library versions etc. But it feels like Prefect is more intended to be a monolithic app? How would you normally manage running tools that need to use python libraries with different versions in prefect, etc?" There's likely more than one way to accomplish this in 2.0 and it can depend on your use case but as an initial thought I believe you could accomplish this by setting up a flow of flows, where each subflow can be deployed specifying their own docker containers with their own dependencies, more on deployments here as well the docker-container infrastructure here. Each flow/container could be triggered from the orion api client, create_flow_run_from_deployment, which would ensure each flow is running within it's own container you can control the flow states as well from the api, since 2.0 works really well with native python it's definitely possible to include conditional states around the success of the individual flows within you're code to deal with failures as well, there's definitely more to consider here for sure but I think it's definitely possible to accomplish what you're looking for. as you're scoping it out as well definitely feel free to ask any questions you might come across we're more than happy to try and help 😄 It's also worth noting to that for complex scenarios and more production heavy use cases you can also reach out to for paid support which can get a little more hands on with helping you get this pipeline setup
:upvote: 1

Nova Westlake

07/30/2022, 12:31 AM
Thank you for this detailed response! I really appreciate it. I'll have to continue perusing the documentation, and especially the docs you've pointed me towards.