Reece Hart
08/02/2021, 1:03 AMseqtk (args) | fastp (args) | sentieon (args) | samtools (args) >out.tmp
In our pipeline, most steps are wrapped in a script, which is what the Makefile calls. This step starts with apx 100GB of data and dumps a 150GB of data. Given the data volume, I would be reluctant to write intermediate files in lieu of the pipes.
Given the nature of workflow -- all command-line tools with file-based data -- I think adopting Prefect would amount to making most of our scripts into Prefect's ShellTasks. I wonder whether this is really worth the effort.
The main drivers for choosing a workflow tool are to help with pipeline versioning, schedule and track jobs, to help orchestrate infrastructure scale up/down.
Thanks for any guidance.Kevin Kho
ShellTask
would be the way to go if you were to use Prefect. I don’t think any orchestration tool will require you to persist data with this setup right? I’m not familiar but if seqtk (args)…
does not persist data, then no orchestration tool will require persistence of intermediate results since you can just pipe that whole thing in the ShellTask
.
2. There is one thing Prefect can help with here and that is to make the logic more modular. I have seen Flow of Flows used to orchestrate this kind of setup. Imagine a scenario where you only want the entire flow to run if new data comes in. The first sub-Flow can check for the existence of new data, and propagate a SKIP is there isn’t any so that all downstream steps don’t run. Similarly, if you only want to run subFlow 30 out of 50, Prefect can help decouple that. Though, of course, your makefile is well capable of doing this too.
3. The modular setup also helps decouple infrastructure. If you use Docker containers, I think you are using one image that contains all the dependencies of the setup with you current Makefile
. Having these split into individual containers might help speed up development. Maybe you don’t need images at all. That would make this thought pointless.
4. Scheduling, tracking failures are the infrastructure scale up and down are all good use cases for workflow orchestration. If you only need simple schedulling and it’s just a matter of running all 50 of those consecutive tasks, Cron or something like Windows Task Scheduler will work for you. If you need to perform certain actions in the event things fail, then you would need a workflow orchestration system. For example, if task 10/50 fails, do you want to run downstream tasks? Do you want to proceed to task 30? Do you just want a notification? These triggers are things workflow orchestration can help with.
5. I think the effort level is not too bad. There will be of course, some degree of effort to move to a workflow orchestration system. It would be upto you if those gains are worth it. If a job fails and there’s no issue, then I guess it would totally be fine without a workflow orchestration system in place.
6. More of the effort level might be around packaging and modularizing the containers. It sounds like you have a gigantic container which might be painful to work with if you use a service like ECS that spins the compute and down for you. Working with this might be a lot of effort.
7. I think Git and Docker might provide enough versioning? You can also post artifacts related to your pipeline somewhere like S3 for traceability (with or without Prefect)
8. The benefits you get from an orchestration system are dependent on how much you buy in of course. If you used something like Prefect just for the sake of spin-up---run---spin-down
(which is just scheduling and scaling), then it wouldn’t be helpful for you if run
failed aside from a notification. Prefect charges per task run because the observability service is added to each task. You might need to register more of those 50 steps to get more insight into the failures. I can understand if that’s not worth the trouble.
9. The beauty of this seems to be that Prefect does not have to be invasive into your code at all since everything is just shell commands. I think this makes it really each to onboard and offboard if you want because you don’t need to decorate Python code. It would be Python orchestrating these shell commands in a different script.Reece Hart
08/02/2021, 1:34 AMKevin Kho
Junjun Zhang
08/02/2021, 2:05 AMReece Hart
08/02/2021, 2:42 AMReece Hart
08/02/2021, 2:45 AMReece Hart
08/02/2021, 2:53 AMJunjun Zhang
08/02/2021, 3:19 AMJunjun Zhang
08/02/2021, 3:19 AMReece Hart
08/02/2021, 3:22 AMJunjun Zhang
08/02/2021, 11:31 AM