Hi everyone! We are using Prefect 2.0 with Azure....
# ask-community
p
Hi everyone! We are using Prefect 2.0 with Azure. Using the traditional df.to_sql is not fast for ingesting data into our Azure SQL DB. Timetable with 50.000 rows: local dask cluster: to_sql ~ 6 min. local dask cluster: to_sql with fast_executemany ~ 4 min data factory ~ 15 sec. What would you do to increase ingestion speed? Would you recommend using real Dask Executor? Any help or ideas is appreciated!
1
h
I have not been able to speed up ingest either, so that makes two of us. I assume you have tried playing around with the chunksize argument, and number of partitions?
p
Yes, I have also tried different file types but without any particular success. No matter how I try to not make use of data factory, it has shown to be impossible. The alternative could potentially be self hosting an open source SQL DB or maybe Snowflake where the connectors are considerably better.
r
It may or may not fit your workflow/use case, but in the past when I've needed to ingest data in Azure SQL quickly, saving it to Blob Storage as a .csv and then running BULK INSERT worked well.
h
I did a test with Meltano from snowflake to Azure Postgres and this is slow too. So I don’t think this is pandas/dask specific. I guess this is Microsoft’s home field advantage. But bulk insert looks interesting.
r
The common theme here might be SQLAlchemy? I think Meltano uses it, as does Pandas df.to_sql. SQLAlchemy is quite slow; it is good at what it does, but quick bulk inserts are not one of its strengths
h
The meltano sink uses psycopg2 directly, no sqlalchemy involved. Thinking about it now, there is a good number of systems that use some sort of bulk insert. I remember now that I saw a similar performance situation in Oracle Cloud some years ago too.
p
2 seconds with BULK INSERT. Thanks! @Ryan Peden @Henning Holgersen 🙌
👍 1
😄 1
👀 1
r
Wow! That is great. I'm happy to hear it worked well for you.