Hi Community. I need some advice about the kinds o...
# ask-community
r
Hi Community. I need some advice about the kinds of workflows that Prefect is good for or, conversely, not so good for. I'm considering use Prefect to coordinate the execution of genomics data processing. The pipeline consists of ~50 steps, each of which ranges from seconds to 2 hours. The steps are all command line tools written in a mix of compiled C and C++, Java, Python, shell scripts, and even some Perl for bad measure. The pipeline starts with 100-150GB of data, processes that into a series of similarly large files, then analyzes those data to result in files roughly 100MB or smaller. Processing takes 250-500 CPU-hours (e.g., 5-10 hours with 48 threads on a EC2 c5ad.12xlarge). The current pipeline is coordinated by a Makefile, which works surprisingly well. My primary concern (and question for the community) is whether prefect is really suited to the tasks at hand. For example, one of the first steps looks roughly like this:
Copy code
seqtk (args) | fastp (args) | sentieon (args) | samtools (args) >out.tmp
In our pipeline, most steps are wrapped in a script, which is what the Makefile calls. This step starts with apx 100GB of data and dumps a 150GB of data. Given the data volume, I would be reluctant to write intermediate files in lieu of the pipes. Given the nature of workflow -- all command-line tools with file-based data -- I think adopting Prefect would amount to making most of our scripts into Prefect's ShellTasks. I wonder whether this is really worth the effort. The main drivers for choosing a workflow tool are to help with pipeline versioning, schedule and track jobs, to help orchestrate infrastructure scale up/down. Thanks for any guidance.
k
Hey @Reece Hart, this sounds pretty complicated and I don’t have a clear answer so I’ll just list some thoughts. 1. You are right that the
ShellTask
would be the way to go if you were to use Prefect. I don’t think any orchestration tool will require you to persist data with this setup right? I’m not familiar but if
seqtk (args)…
does not persist data, then no orchestration tool will require persistence of intermediate results since you can just pipe that whole thing in the
ShellTask
. 2. There is one thing Prefect can help with here and that is to make the logic more modular. I have seen Flow of Flows used to orchestrate this kind of setup. Imagine a scenario where you only want the entire flow to run if new data comes in. The first sub-Flow can check for the existence of new data, and propagate a SKIP is there isn’t any so that all downstream steps don’t run. Similarly, if you only want to run subFlow 30 out of 50, Prefect can help decouple that. Though, of course, your makefile is well capable of doing this too. 3. The modular setup also helps decouple infrastructure. If you use Docker containers, I think you are using one image that contains all the dependencies of the setup with you current
Makefile
. Having these split into individual containers might help speed up development. Maybe you don’t need images at all. That would make this thought pointless. 4. Scheduling, tracking failures are the infrastructure scale up and down are all good use cases for workflow orchestration. If you only need simple schedulling and it’s just a matter of running all 50 of those consecutive tasks, Cron or something like Windows Task Scheduler will work for you. If you need to perform certain actions in the event things fail, then you would need a workflow orchestration system. For example, if task 10/50 fails, do you want to run downstream tasks? Do you want to proceed to task 30? Do you just want a notification? These triggers are things workflow orchestration can help with. 5. I think the effort level is not too bad. There will be of course, some degree of effort to move to a workflow orchestration system. It would be upto you if those gains are worth it. If a job fails and there’s no issue, then I guess it would totally be fine without a workflow orchestration system in place. 6. More of the effort level might be around packaging and modularizing the containers. It sounds like you have a gigantic container which might be painful to work with if you use a service like ECS that spins the compute and down for you. Working with this might be a lot of effort. 7. I think Git and Docker might provide enough versioning? You can also post artifacts related to your pipeline somewhere like S3 for traceability (with or without Prefect) 8. The benefits you get from an orchestration system are dependent on how much you buy in of course. If you used something like Prefect just for the sake of
spin-up---run---spin-down
(which is just scheduling and scaling), then it wouldn’t be helpful for you if
run
failed aside from a notification. Prefect charges per task run because the observability service is added to each task. You might need to register more of those 50 steps to get more insight into the failures. I can understand if that’s not worth the trouble. 9. The beauty of this seems to be that Prefect does not have to be invasive into your code at all since everything is just shell commands. I think this makes it really each to onboard and offboard if you want because you don’t need to decorate Python code. It would be Python orchestrating these shell commands in a different script.
🚀 1
r
@Kevin Kho Thanks for these awesome comments. Very helpful. I should have said: In addition to the command line tools, a few of the steps are actually docker invocations. We have dockerized several of the steps because they have onerous and sometimes atypical dependencies, and we didn't want to carry that baggage in the other containers.
k
That makes a lot of sense. The Docker invocations sounds good. One scenario where Prefect provides value is if the different steps of your pipelines require different compute power. When you do the flow of flows approach, some Prefect users send the GPU-dependent subflows to compute that has GPU enabled. This is normally for users with heterogenous hardware requirements through their pipeline. The downside is that changing executors means data has to be persisted in some way, which is what you’re trying to avoid.
👍 1
j
@Reece Hart Saw your question, I can tell clearly you are running genomics/bioinformatics workflows. I am in the same field. As you probably know CWL, WDL, Nextflow are typical used to develop bioinformatics workflows, I have been thinking about whether Prefect is appropriate to be adopted by the bioinformatics community, similar to Airflow (although airflow's adoption does not seem very successful). Do you mind share your thoughts on this?
👍 1
r
@Kevin Kho I like the idea of using cheaper hardware for the later stages of our pipeline. I think the next move for me is to draw the dag out (yet again!), this time noting the data sizes flowing between jobs, the compute complexity of each task, and the software dependencies for each task. Perhaps that will help me think about where to partition the dag to tradeoff devops/packaging complexity with computational efficiency.
@Kevin Kho And, yes, everything's in git and mostly built in CI/CD pipelines. Although that tracks code versions, it doesn't solve the problem of wanting to know which code generated which results. (It is logged, but it's not structured.)
@Junjun Zhang I've written toy examples in CWL. I know of NextFlow and WDL, but never used them. I don't think I have any well-formed thoughts yet. I can tell you that I look for tool simplicity, developer ease (from laptops to cloud clusters), and a gut-check on whether I think a project will be around for a while. What do you use for workflows?
j
I used all of them, but settled with Nextflow at the end. Nextflow is pretty nice, particularly with the new DSL2 syntax, it allows complex workflows to be built with modules, roughly one module one step. It has a bit of learning curve, particularly around the concept of data channels, but once you master it, it becomes very logic and clean. It also has very nice engine (ie, execution orchestration), a properly written workflow (eg, containerized steps) can run on laptop, VM, HPC, cloud cluster without code modification. You also get nice features like resume and allocate different step with different compute resources etc. I am not trying to advertise Nextflow at all here, I think Prefect is quite unique although I have not studied it closely enough (and it's keeping improving so rapidly 🙂). It's still quite possible to be used for bioinformatics workflows.
just some thoughts of mine to share so far
r
I appreciate the info. Maybe I should give nextflow a second look.
j
@Reece Hart for your other considerations such as versioning, devops/packaging, which code generate which results etc, you might want check out WFPM, a workflow package manager: https://wfpm.readthedocs.io/en/latest/README.html. Disclosure: I am the author of WFPM, but not affiliated with Nextflow. As this is the Prefect community here, I’d like to make it very clear that my motivation is to compare Prefect with other bioinformatics workflow systems and see what would be a good strategy to introduce Prefect to the bioinformatics community to maximize its super power.
👍 1