Hi everyone We are using Prefect 2 0 with Azure Using the tr Prefect Community #ask-community

Hi everyone! We are using Prefect 2.0 with Azure....

Parwez Noori

09/01/2022, 6:43 PM

Hi everyone! We are using Prefect 2.0 with Azure. Using the traditional df.to_sql is not fast for ingesting data into our Azure SQL DB. Timetable with 50.000 rows: local dask cluster: to_sql ~ 6 min. local dask cluster: to_sql with fast_executemany ~ 4 min data factory ~ 15 sec. What would you do to increase ingestion speed? Would you recommend using real Dask Executor? Any help or ideas is appreciated!

✅ 1

Henning Holgersen

09/01/2022, 7:04 PM

I have not been able to speed up ingest either, so that makes two of us. I assume you have tried playing around with the chunksize argument, and number of partitions?

Parwez Noori

09/01/2022, 7:11 PM

Yes, I have also tried different file types but without any particular success. No matter how I try to not make use of data factory, it has shown to be impossible. The alternative could potentially be self hosting an open source SQL DB or maybe Snowflake where the connectors are considerably better.

Ryan Peden

09/01/2022, 8:33 PM

It may or may not fit your workflow/use case, but in the past when I've needed to ingest data in Azure SQL quickly, saving it to Blob Storage as a .csv and then running BULK INSERT worked well.

Henning Holgersen

09/01/2022, 9:01 PM

I did a test with Meltano from snowflake to Azure Postgres and this is slow too. So I don’t think this is pandas/dask specific. I guess this is Microsoft’s home field advantage. But bulk insert looks interesting.

Ryan Peden

09/01/2022, 9:05 PM

The common theme here might be SQLAlchemy? I think Meltano uses it, as does Pandas df.to_sql. SQLAlchemy is quite slow; it is good at what it does, but quick bulk inserts are not one of its strengths

Henning Holgersen

09/02/2022, 5:13 AM

The meltano sink uses psycopg2 directly, no sqlalchemy involved. Thinking about it now, there is a good number of systems that use some sort of bulk insert. I remember now that I saw a similar performance situation in Oracle Cloud some years ago too.

Parwez Noori

09/02/2022, 8:38 AM

2 seconds with BULK INSERT. Thanks! @Ryan Peden @Henning Holgersen 🙌

👍 1

😄 1

👀 1

Ryan Peden

09/02/2022, 10:46 AM

Wow! That is great. I'm happy to hear it worked well for you.

5 Views

Open in Slack

Previous Next