I am working with prefect on my local machine for my persona Prefect Community #prefect-gcp

I am working with prefect on my local machine for ...

Kartik Ullal

12/20/2023, 2:39 AM

I am working with prefect on my local machine for my personal project. I am using pregfect_gcp, where I have a pipeline setup where I pull data from a website and store it on google cloud. I have a different pipeline setup where I pull the data from the gcs bucket, perform some transformations and load into google bigquery. I am doing all this using Python. For now I am using gcs_bucket.get_directory to get files from gcs bucket onto my local machine, and then put it into a pandas dataframe for transformations which later go into google bigquery. However, I wanted a way where I didn’t need to download the file on my local machine, directly pull it from gcs bucket and put into a pandas dataframe, so that I won’t have to download lots of data on my local machine. I read through the whole documentation but I couldn’t find anything that did it for me. Maybe I am missing something?

Nate

12/20/2023, 3:10 AM

hi @Kartik Ullal - if you don't want to save it on disk, you can just download the data from gcs as bytes and pass that directly to your dataframe (assuming that data can fit into memory) with something like

pd.read_csv(io.BytesIO(blob_bytes))

. if it cant fit into memory you could do something like: • do it a `chunksize` at a time • load it to BQ right away and use SQL to do transformations there

Kevin Grismore

12/20/2023, 3:45 AM

+1 to loading the data into bigquery, it eliminates any worry you might have about local storage or memory constraints

Sean Davis

01/26/2024, 1:49 PM

With Pandas, you can read directly from google cloud storage. If your object is at

<gs://my_bucket/my_file.csv>

, you can read into pandas dataframe using:

Copy code

df = pandas.read_csv("<gs://my_bucket/my_file.csv>")

There is not a need to download the file(s) or to read them into a separate place in memory.

Sean Davis

01/26/2024, 1:49 PM

https://pandas.pydata.org/docs/dev/user_guide/io.html#basic

11 Views

Open in Slack

Previous Next