Hi everyone I was wondering if anyone could kindly give some Prefect Community #ask-community

Hi everyone, I was wondering if anyone could kind...

Sebastian

07/11/2020, 6:06 PM

Hi everyone, I was wondering if anyone could kindly give some pointers to a data engineering noob like me. 🤓 I have to ingest about 10GB of data from xlsx files into a sql starschema. The person working on this before me used Pandas to do all the data cleaning, surrogate key creation etc. I am worried that as we get more data that at some point it won't fit into memory anymore and that we will have a problem. 🤪 What are the best approaches for this in your opinion? Do you put the data into some intermediary SQL table? Do you use Dask? (Taking into consideration that I would like to learn/leverageP).

Aiden Price

07/11/2020, 10:01 PM

If there's a lot of Pandas code you've inherited then changing to Dask code is actually very straight-forward. See examples like this; https://towardsdatascience.com/trying-out-dask-dataframes-in-python-for-fast-data-analysis-in-parallel-aa960c18a915

Aiden Price

07/11/2020, 10:04 PM

Setting up the Dask cluster is the "hard" part. Even if you only build a local cluster then Dask will still be able to handle datasets larger than memory because of lazy evaluation.

💡 2

Sebastian

07/12/2020, 1:32 PM

Hi Aiden, thank you so much for your input! In your experience will things like removing duplicates from a big concatenated dataframe work fairly well or do you think I will experience big performance hits?

Aiden Price

07/12/2020, 10:55 PM

Actually I don’t know the answer to that one. Dask builds a graph of tasks and finding uniques shouldn’t be too much more difficult than other operations.

Simone Cittadini

07/13/2020, 5:01 AM

Hi Sebastian, to my experience there's no way to find duplicates without incurring in some memory constraint at some point, since you need to keep in memory what you've already seen. We solved by streaming to a "raw data" db table the validated rows and letting a stored procedure implement the logic for duplicates. Curious to hear if there's a better method

Sebastian

07/13/2020, 7:05 PM

Hi @Simone Cittadini! Thanks for your insights! Could you clarify what you mean with stored procedure? Like a SQL query on the raw data containing the duplicates?

Simone Cittadini

07/14/2020, 6:01 AM

yes, putting a trigger on the insert :

CREATE TRIGGER trigger_ BEFORE INSERT ON schema_.table_ FOR EACH ROW WHEN (pg_trigger_depth() = 0) EXECUTE PROCEDURE schema_.function_();

which calls a function skipping the duplicate ( or whatever is your business logic )

🙏 1

Sebastian

07/14/2020, 10:26 AM

Thank you so much, this is really helpful! 🙂

2 Views

Open in Slack

Previous Next