Miguel Angel
01/17/2022, 6:00 PMimport dask.dataframe as dd
from dask.distributed import Client
from s3fs import S3FileSystem
s3 = S3FileSystem()
client = Client()
folder_list = [
"file1",
"file2",
"file3",
"file4",
"file5",
"file6",
"file7",
"file8",
]
file_list = list(
map(lambda folder: f"<s3://my-bucket/parquet/{folder}/*.parquet>", folder_list,)
)
dataframe_list = client.map(dd.read_parquet, file_list, gather_statistics=False)
dataframe = client.submit(dd.concat, dataframe_list)
mean_value = client.submit(lambda x: ["some_data_column"].mean(), dataframe)
mean_compute = client.submit(lambda x: x.compute(), mean_value)
print(mean_compute.result())
Anna Geller
Miguel Angel
01/17/2022, 6:52 PMfolder_list
on the snipped above).Miguel Angel
01/17/2022, 6:53 PMfutures
instead of dealayed
Anna Geller
import awswrangler as wr
import pandas as pd
from prefect import task, Flow
from prefect.executors import DaskExecutor
from typing import List
@task
def get_folder_list() -> List[str]:
return [
"file1",
"file2",
"file3",
"file4",
"file5",
"file6",
"file7",
"file8",
]
@task
def process_parquet_file(file_name: str) -> pd.DataFrame:
return wr.s3.read_parquet(f"<s3://my-bucket/parquet/{file_name}/*.parquet>")
@task
def concat_dfs(list_of_dfs: List[pd.DataFrame]) -> None:
final_df = pd.concat(list_of_dfs, ignore_index=True)
print(final_df["some_data_column"].mean())
with Flow("flow_run_test", executor=DaskExecutor()) as flow:
folder_list = get_folder_list()
list_of_dfs = process_parquet_file.map(folder_list)
concat_dfs(list_of_dfs)
Kevin Kho
client.submit()
to execute. There is also some optimization so that consecutive mapped tasks are submitted together so it might be better.
But if you really want to do client.submit
or work with a dask dataframe, you can follow thisKevin Kho
client.submit
when working with a Dask dataframeMiguel Angel
01/18/2022, 10:35 AMfrom dask.dataframe.core import DataFrame
objects like). I tried something similar, although couldn't get parallel reading while fetching parquet files,Miguel Angel
01/18/2022, 10:37 AMAnna Geller