I have a use case where we our company has tens of thousands of databases, each database is represented by a client, within a single AWS RDS instance. To have some company wide business analytics performed, we currently backup the instance on a S3 bucket which creates millions of parquet files partitioned by date, database, schema, and table.
We currently use an ELT approach using DBT and hence dump all this data into our Data warehouse using a multi tenant design and treat this as our source data for now.
Coming to the issue at hand, I've created a flow which consists of multiple sub flows and respective tasks and tested them locally by deploying the workflows on Docker, which are also handled by a Docker hosted Prefect. It all runs well on a subset (100k files) of the data since it is my local machine.
But once we deploy our solution to production using ECS Infra and Prefect Cloud, there are some odd issues I'm facing:
• I have a flow which scans the metadata of all the millions of files and groups them based on which database and table they belong to. Initially I just simply returned the dictionary from the task and pass it onto the next subflows to start working on them. This worked well on my local machine, but the flows get stuck in PENDING state when running on the same subset (100k files) of data on Prefect Cloud. I was able to narrow it down to one of the parameter which was that large mapping dictionary. As soon as I removed it and tested out the flow, it started running again.
• Once I was able to work around this issue, another part of subflow tasks are now stuck in PENDING state due to them taking a subset of that dictionary as input, which would contain a list of thousands of s3 path strings. This also used to work well on my locally deployed workflows, but get stuck when running them on ECS Infra and Prefect Cloud.
Just a side note, that all my flows and subflows run on the default ConcurrentTaskRunner setting.
It is safe to assume that Flow and Tasks Inputs are most likely serialized and deserialized, which might hint at why these flows or tasks get stuck in Pending states, but I can't seem to understand why it's not being reproduced locally. Is it something related to how Prefect Cloud handles inputs differently compared to Open source variant?
Has anyone faced something similar?
@Scott Walsh so it seems like the only solution for passing large outputs and inputs between tasks and flows is to write to some external storage, either local or remote and read from there in the upstream tasks?