I m having a problem creating a prefect flow to process a la Prefect Community #ask-community

I'm having a problem creating a prefect flow to pr...

Jon Ruhnke

01/14/2022, 9:35 PM

I'm having a problem creating a prefect flow to process a large number of xml files (40k) without running out of ram. Is this the right place to ask for help?

Kevin Kho

01/14/2022, 9:36 PM

Hi @Jon Ruhnke, here is the right place for help. Are you using LocalDaskExecutor or DaskExecutor?

Jon Ruhnke

01/14/2022, 9:38 PM

I don't have a clue. This is my first prefect flow, and i've written most of it already and it seems to work pretty well on smaller file sizes. I'm just running it via command line

Jon Ruhnke

01/14/2022, 9:39 PM

Reading about it now.

Kevin Kho

01/14/2022, 9:40 PM

Ah ok. If you didn’t set any it would just be the local sequential executor. I think this will help you. If you are dealing with large data, you can reduce your footprint by writing the results somewhere, returning the location, and then reading it back in later.

Jon Ruhnke

01/14/2022, 9:40 PM

I spent a bunch of time implementing mapping so that my ETL task would run in batches of 1000 files, thought that was going to solve my problem, but it doesn't seem like that affects memory usage?

Kevin Kho

01/14/2022, 9:41 PM

It does a bit. What is the return of your task?

Jon Ruhnke

01/14/2022, 9:42 PM

Jon Ruhnke

01/14/2022, 9:43 PM

This is a test run. Starts with 5 xml files in a folder. I create batches of 1 (since it's a test), then the parse function reads the xml values I need and puts them to a dataframe, then the load function writes that batch to SQL table

Jon Ruhnke

01/14/2022, 9:44 PM

I assumed that implemented this way with mapping it was going to free up memory after executing each batch, but it didn't seem to do that.

Kevin Kho

01/14/2022, 9:45 PM

It won’t because the result is held in memory to pass it to the next task. If

parse_file

normally succeeds, you can reduce your footprint by combining these tasks. If not, you can just save that dataframe, return the location and load it in the

load_to_sql

task.

Jon Ruhnke

01/14/2022, 9:46 PM

By "save that dataframe", you mean save it to a file on disk like json or something?

Kevin Kho

01/14/2022, 9:46 PM

Yes exactly

Jon Ruhnke

01/14/2022, 9:47 PM

Otherwise I could combine parse/load into a single function so no data is being held? That seems like a bad practice to combine a bunch of thing into a single task

Jon Ruhnke

01/14/2022, 9:48 PM

Can that one task reference multiple non-task functions to break up the code?

Kevin Kho

01/14/2022, 9:51 PM

Yes you are exactly right. Combining into the single task doesn’t mean combining the code. You can still keep it as a function and call it inside a task

Copy code

def some_function():
    return 1

@task
def some_task():
    return some_function() + 1

If an operation is pretty much guaranteed to succeed, you don’t need to put it in a task since you don’t need observability and retries around it.

Jon Ruhnke

01/14/2022, 9:51 PM

Oh that makes sense

Jon Ruhnke

01/14/2022, 10:04 PM

Thanks this was really helpful. It works now when I combine them.

🙌 1

Kevin Kho

01/14/2022, 10:13 PM

No problem!

15 Views

Open in Slack

Previous Next