Thread
#prefect-community
    Sebastian

    Sebastian

    2 years ago
    Hi everyone, I was wondering if anyone could kindly give some pointers to a data engineering noob like me. 🤓 I have to ingest about 10GB of data from xlsx files into a sql starschema. The person working on this before me used Pandas to do all the data cleaning, surrogate key creation etc. I am worried that as we get more data that at some point it won't fit into memory anymore and that we will have a problem. 🤪 What are the best approaches for this in your opinion? Do you put the data into some intermediary SQL table? Do you use Dask? (Taking into consideration that I would like to learn/leverage😛refect:).
    Aiden Price

    Aiden Price

    2 years ago
    If there's a lot of Pandas code you've inherited then changing to Dask code is actually very straight-forward. See examples like this; https://towardsdatascience.com/trying-out-dask-dataframes-in-python-for-fast-data-analysis-in-parallel-aa960c18a915
    Setting up the Dask cluster is the "hard" part. Even if you only build a local cluster then Dask will still be able to handle datasets larger than memory because of lazy evaluation.
    Sebastian

    Sebastian

    2 years ago
    Hi Aiden, thank you so much for your input! In your experience will things like removing duplicates from a big concatenated dataframe work fairly well or do you think I will experience big performance hits?
    Aiden Price

    Aiden Price

    2 years ago
    Actually I don’t know the answer to that one. Dask builds a graph of tasks and finding uniques shouldn’t be too much more difficult than other operations.
    Simone Cittadini

    Simone Cittadini

    2 years ago
    Hi Sebastian, to my experience there's no way to find duplicates without incurring in some memory constraint at some point, since you need to keep in memory what you've already seen. We solved by streaming to a "raw data" db table the validated rows and letting a stored procedure implement the logic for duplicates. Curious to hear if there's a better method
    Sebastian

    Sebastian

    2 years ago
    Hi @Simone Cittadini! Thanks for your insights! Could you clarify what you mean with stored procedure? Like a SQL query on the raw data containing the duplicates?
    Simone Cittadini

    Simone Cittadini

    2 years ago
    yes, putting a trigger on the insert :
    CREATE TRIGGER trigger_ BEFORE INSERT ON schema_.table_ FOR EACH ROW WHEN (pg_trigger_depth() = 0) EXECUTE PROCEDURE schema_.function_();
    which calls a function skipping the duplicate ( or whatever is your business logic )
    Sebastian

    Sebastian

    2 years ago
    Thank you so much, this is really helpful! 🙂