Hi, if I am trying to run a heavy transform task o...
# ask-community
a
Hi, if I am trying to run a heavy transform task on a large data frame -
Copy code
@task
def transform(df):
   return transformed_df

with Flow("transform-1") as flow:
   transform(df).                 # current implementation
   for each df in filtered_dfs:   # suggested way to utilize paralellization optimization on tasks
      transform(df)
1. Will breaking down the data frame into small chunks and the calling the transformation task on each batch of the dataframe help me reap the benefits of prefect parellelization using a dask executor? 2. Given that the input to a transform task is a large dataframe - what other steps can I consider to optimize the turn around time of the flow ? (suggest using different run configs and prefect paradigms of writing the code)
d
Hi Abhas: 1. It's definitely possible! Once you've broken up your flow into component tasks and run the flow, the tasks are all submitted to Dask and it will try to do the heavy lifting of parallelization for you. However, this will only work if the relationship between the tasks is such that Dask believes it can run tasks in parallel, and to getting this to work well might require some experimentation. In your specific example, the for loop will generate tasks that are structured to be run serially. 2. If you can structure your transform in a way that makes sense, consider using a
Prefect
map on an
Iterable[pd.Dataframe]
to naturally submit parallel tasks to a Dask, and you can use this to naturally contruct Dask-friendly flows. You can find examples in our documentation: https://docs.prefect.io/core/concepts/mapping.html
💡 1
a
Exactly, the motivation for creating batch data frames was to be able to submit them as an iterable to a mapped transform task - for Dask to recognize it as a parallelizable child tasks :)
Quick question : will the mapped child tasks be visible under the dashboard ?