https://prefect.io logo
Title
a

Alex Joseph

06/06/2020, 6:18 PM
Hi everyone - We're a Data Science team trying to slowly move towards Data engineering. We've been working on R mostly, and it has served us well, but would like to try and use Prefect for the workflow orchestration. We've a lot of code in R with all the domain logic/tests and so rewriting everything in python is not a viable option. Can we use Prefect for orchestration in this case? I've tried to run the R code as subprocess, and it works to an extent, but it seems very hacky. Is there a standard way to do this?
đź’Ş 2
j

Jeremiah

06/06/2020, 6:23 PM
Hi @Alex Joseph, and welcome! We do have some DS teams running R scripts through our Python API, so hopefully one can chime in with a favorite recipe. In general, though, we do see the subprocess call-out somewhat frequently. In that setup, R is just a third party system like any other piece of your infrastructure, and Prefect is orchestrating it.
a

Alex Joseph

06/06/2020, 6:53 PM
Thanks Jermiah! It does work but is just a bit complicated because each R function has to be rewritten as a fully encompassing script and called. I'll wait to see if anyone else has better solutions (Maybe RPy2?)
a

Alex Cano

06/06/2020, 8:36 PM
Hey @Alex Joseph, my current company has a similar but not exact scenario you’re running into. Right now we are using Airflow for the scheduler, but we have a mix of R and Python, all needing to be scheduled. What we ended up doing is having all of our code run within Docker containers. So the processing code (container execution) and scheduling (task definition and dependency ordering) are completely separate. To recreate this using the Prefect workflow, you can take a look at the suite of Docker tasks, and have your Prefect code call the appropriate container. This will let you continue to use your R scripts, but will give you the flexibility to start migrating code over to Python slowly (or continue using R). (https://docs.prefect.io/core/task_library/docker.html#containers).
a

Alex Joseph

06/07/2020, 5:16 AM
Thanks @Alex Cano - This still means that the unit of work are still docker scripts, but I was wondering if it could still be more fine grained? I'd like to explain with a real example. We have a script which 1. Creates a config file 2. hits two different of data bases 3. runs three levels of transformations 4. one validation check 5. Pushes the output to a database 6. Pushes the state to another database We currently have the entire thing dockerized in a single container. We could potentally dockerize each step independently, that would mean 9 different docker containainers, but each time the state of the world has to be recreated and I think the overhead would make it defeat the purpose of modularity. I'm really fascinated by the Prefect approach of having each function as a separate task which can be monitored independently, and I was wondering if there was a solution for that. I'm looking for something like an R Task, similar to other tasks, Maybe I'm trying to have the cake and eat it too, but I was wondering if there were bettter solutions
a

Alex Cano

06/07/2020, 5:38 AM
Gotcha! Yeah I think breaking them up into different containers would do what you’re looking for here. You’d have one prefect task for each container, and each being able to run and monitored by itself. This is the current pattern we use at work as well, with saving all of our intermediate data in GCS buckets. So in your case, if you would adopt a similar design pattern with saving the data to intermediate locations (you could delete it after the flow runs or whenever you need to), you could get similar results to your current monolith style container. I’d actually even argue you might be better off (if a transformation fails, you’d have an output of the database saved successfully and wouldn’t need to re-query the database). However, this definitely does come with the overhead of then needing to manage those intermediate files. I think this could be relatively easy to add a task after the rest of them that deletes the intermediate files (assuming you don’t want to keep them around). Specifically on your point of having an R task, I think you could (probably would be pretty easy to take inspiration from the
ShellTask
), but the one thing I don’t know enough about would be throwing objects between R and Python itself.