Hi everyone, I was wondering if anyone could kind...
# prefect-community
s
Hi everyone, I was wondering if anyone could kindly give some pointers to a data engineering noob like me. 🤓 I have to ingest about 10GB of data from xlsx files into a sql starschema. The person working on this before me used Pandas to do all the data cleaning, surrogate key creation etc. I am worried that as we get more data that at some point it won't fit into memory anymore and that we will have a problem. 🤪 What are the best approaches for this in your opinion? Do you put the data into some intermediary SQL table? Do you use Dask? (Taking into consideration that I would like to learn/leverageP).
a
If there's a lot of Pandas code you've inherited then changing to Dask code is actually very straight-forward. See examples like this; https://towardsdatascience.com/trying-out-dask-dataframes-in-python-for-fast-data-analysis-in-parallel-aa960c18a915
Setting up the Dask cluster is the "hard" part. Even if you only build a local cluster then Dask will still be able to handle datasets larger than memory because of lazy evaluation.
💡 2
s
Hi Aiden, thank you so much for your input! In your experience will things like removing duplicates from a big concatenated dataframe work fairly well or do you think I will experience big performance hits?
a
Actually I don’t know the answer to that one. Dask builds a graph of tasks and finding uniques shouldn’t be too much more difficult than other operations.
s
Hi Sebastian, to my experience there's no way to find duplicates without incurring in some memory constraint at some point, since you need to keep in memory what you've already seen. We solved by streaming to a "raw data" db table the validated rows and letting a stored procedure implement the logic for duplicates. Curious to hear if there's a better method
s
Hi @Simone Cittadini! Thanks for your insights! Could you clarify what you mean with stored procedure? Like a SQL query on the raw data containing the duplicates?
s
yes, putting a trigger on the insert :
CREATE TRIGGER trigger_ BEFORE INSERT ON schema_.table_ FOR EACH ROW WHEN (pg_trigger_depth() = 0) EXECUTE PROCEDURE schema_.function_();
which calls a function skipping the duplicate ( or whatever is your business logic )
🙏 1
s
Thank you so much, this is really helpful! 🙂