Hi folks. I'm working on a project and trying to e...
# prefect-gcp
s
Hi folks. I'm working on a project and trying to extract data from GCS to put it to BQ. I'm getting a "No such file or directory" error when trying to read the parquet from the local path I assign. Here's the error text:
FileNotFoundError: [Errno 2] No such file or directory: '\\data\\data\\matches\\atp_matches_1969.parquet'
Here's the path to that file on the GCS bucket:
tennis_data_lake_tennis-analysis-405301/data/matches/atp_matches_1969.parquet
Here's my extract code:
def extract_from_gcs(tour: str, subgroup: str, year: int) -> Path:
"""Download trip data from GCS"""
gcs_path = f"/data/matches/{tour}_matches{subgroup}_{year}.parquet"
gcs_block = GcsBucket.load("tennis-bucket")
gcs_block.get_directory(from_path=gcs_path, local_path=f"../data/")
return Path(f"../data/{gcs_path}")
Where tour is atp, subgroup is ''" (a blank string), and year is 1969 so the returned path should be "data/data/matches/atp_matches_1969.parquet", right? This is my transform code.
@task()
def transform(path: Path) -> pd.DataFrame:
"""Transform parquet to df"""
df = pd.read_parquet(path)
return df
So I guess it is not finding the local path properly? The directory structure I have is: -data -matches -src -BQ_ETL.py (the file with the flow in question). So shouldn't "../data/" as the local path and f'../data/{gcs_path}' as the returned path be the same thing? I have data from years 1968 to 2023 and weirdly the current code works for 1968 and 2023, which are the first and last files in my bucket, but does not work for any of the other files in the bucket.
m
@Seth Taylor besides the error you're seeing, why do you need to download the file from GCS to local in the first place?