I'm having a problem creating a prefect flow to pr...
# ask-community
j
I'm having a problem creating a prefect flow to process a large number of xml files (40k) without running out of ram. Is this the right place to ask for help?
k
Hi @Jon Ruhnke, here is the right place for help. Are you using LocalDaskExecutor or DaskExecutor?
j
I don't have a clue. This is my first prefect flow, and i've written most of it already and it seems to work pretty well on smaller file sizes. I'm just running it via command line
Reading about it now.
k
Ah ok. If you didn’t set any it would just be the local sequential executor. I think this will help you. If you are dealing with large data, you can reduce your footprint by writing the results somewhere, returning the location, and then reading it back in later.
j
I spent a bunch of time implementing mapping so that my ETL task would run in batches of 1000 files, thought that was going to solve my problem, but it doesn't seem like that affects memory usage?
k
It does a bit. What is the return of your task?
j
This is a test run. Starts with 5 xml files in a folder. I create batches of 1 (since it's a test), then the parse function reads the xml values I need and puts them to a dataframe, then the load function writes that batch to SQL table
I assumed that implemented this way with mapping it was going to free up memory after executing each batch, but it didn't seem to do that.
k
It won’t because the result is held in memory to pass it to the next task. If
parse_file
normally succeeds, you can reduce your footprint by combining these tasks. If not, you can just save that dataframe, return the location and load it in the
load_to_sql
task.
j
By "save that dataframe", you mean save it to a file on disk like json or something?
k
Yes exactly
j
Otherwise I could combine parse/load into a single function so no data is being held? That seems like a bad practice to combine a bunch of thing into a single task
Can that one task reference multiple non-task functions to break up the code?
k
Yes you are exactly right. Combining into the single task doesn’t mean combining the code. You can still keep it as a function and call it inside a task
Copy code
def some_function():
    return 1

@task
def some_task():
    return some_function() + 1
If an operation is pretty much guaranteed to succeed, you don’t need to put it in a task since you don’t need observability and retries around it.
j
Oh that makes sense
Thanks this was really helpful. It works now when I combine them.
🙌 1
k
No problem!