I am working with prefect on my local machine for ...
# prefect-gcp
k
I am working with prefect on my local machine for my personal project. I am using pregfect_gcp, where I have a pipeline setup where I pull data from a website and store it on google cloud. I have a different pipeline setup where I pull the data from the gcs bucket, perform some transformations and load into google bigquery. I am doing all this using Python. For now I am using gcs_bucket.get_directory to get files from gcs bucket onto my local machine, and then put it into a pandas dataframe for transformations which later go into google bigquery. However, I wanted a way where I didn’t need to download the file on my local machine, directly pull it from gcs bucket and put into a pandas dataframe, so that I won’t have to download lots of data on my local machine. I read through the whole documentation but I couldn’t find anything that did it for me. Maybe I am missing something?
n
hi @Kartik Ullal - if you don't want to save it on disk, you can just download the data from gcs as bytes and pass that directly to your dataframe (assuming that data can fit into memory) with something like
pd.read_csv(io.BytesIO(blob_bytes))
. if it cant fit into memory you could do something like: • do it a `chunksize` at a time • load it to BQ right away and use SQL to do transformations there
k
+1 to loading the data into bigquery, it eliminates any worry you might have about local storage or memory constraints
s
With Pandas, you can read directly from google cloud storage. If your object is at
<gs://my_bucket/my_file.csv>
, you can read into pandas dataframe using:
Copy code
df = pandas.read_csv("<gs://my_bucket/my_file.csv>")
There is not a need to download the file(s) or to read them into a separate place in memory.