Hello folks, I am new to prefect and trying to bui...
# prefect-server
r
Hello folks, I am new to prefect and trying to build a process to load millions of files from a FTP location to one of the local database. While doing this we have the typical ETL transformation which happens at each file. My question here is on how can i design this using prefect? Right now we have a in house process which leverages Celery to queue and process the files but the way it runs currently, it downloads all the files and then perform the ETL process on each one of it in parallel. Shall we maintain it the same way or download each file and then perform ETL operation? Maintaining error files state even if retry logic fails. Also the current time which this process takes is about 20-24 hours as it processes Billions of records. Any insight on how to design this would be helpful.
👀 1
d
Hi @raman, Welcome to Prefect! You’ve come to the right place with this questions 😄 Before I answer, I’d like to ask a few questions to make sure I understand your constraints and goals. * Are you trying to reduce the amount of time this flow takes? * What sort of resources are you working with? Would you have access to a large Dask cluster? * Is it possible to detect when a new file hits the FTP? If so, you may be able to process the files as they come in. Take a look at: https://medium.com/the-prefect-blog/event-driven-workflows-with-aws-lambda-2ef9d8cc8f1a
r
Hi @Dylan, Thank you for your response. To answer your questions: 1) Yes the primary reason for migrating this workflow is to reduce the time and also to monitor the files which have loaded and which have failed. 2) We don't have a dask cluster, do you see any advantage of having a Dask cluster with the flow whcih we have highlighted? we are leveraging pandas directly so far and In terms of resources hardware wise we have dedicated onprem servers and because of the sensitivity of the data it cannot be moved to cloud servers. 3) Yes, We can detect the new files on the FTP in few cases but since the data is sensitive and is onPrem i need to explore "non cloud" options to perform the operation. Happy to answer any further Questions you have.
d
@raman Great! Thanks for these
Prefect Cloud never has access to your data and can operate without access to your flow run logs
We have many HIPAA compliant customers that are thrilled to let us manage their execution infrastructure
I’d suggest configuring something in the following pattern: 1. A flow (Flow A) that runs every minute to check for new files 2. for every new file it finds, it kicks off a flow run of Flow B 3. Flow B processes the file and uploads to the DB 4. Flow B is executed on a locally-running Dask Cluster via the
DaskExecutor
or spins up its on cluster. Then you can control parallelism on a per-flow-run basis, but you’ll have less control over resource usage. The long-running cluster would use a fixed set of resources, but all flow runs would share it for execution
@raman let me know if that’s helpful!
r
Thanks @Dylan This is very helpful!! Let me try these and will get back to you with further questions.
👍 1