Bertrand GERVAIS
04/21/2020, 4:41 PMZachary Hughes
04/21/2020, 4:47 PMJim Crist-Harif
04/21/2020, 4:58 PMdask.distributed.Client
directly, rather than relying on prefect's executor inside this task, as that interface is less public.Bertrand GERVAIS
04/22/2020, 6:19 AMJim Crist-Harif
04/22/2020, 5:22 PMI want to avoid returning a list of references to the data, as I am using a generic data reader (OGR library) and my source data can be either local files, DB or served by web services. OGR gives me a good abstraction to read this kind of data transparently, that I dont want to lose by using PrefectClarifying, just to ensure we're not talking past each other. Frequently when I see workloads like this, operations can be broken down into two steps: • List references to all the objects you want to process (usually fast, doesn't require reading the data). This may be a list of files, a list of keys in the db, etc... • Do some work on each object. This task would take the reference, and use the underlying library (OGR) to read the data in and process it. If your workload fits, this, then prefect's map should work:
@task
def items_to_process():
return list_of_keys_to_process
@task
def process_single_item(key):
data = read_in_data_from_key(key)
do_some_work(data)
from dask.distributed import get_client
@task
def your_dask_processing_step():
client = get_client()
# do dask things in here
# for example:
results = client.map(inc, [1, 2, 3, 4, 5])
Bertrand GERVAIS
04/23/2020, 9:12 AMfrom dask.distributed import get_client
I am very interested in future works you will be doing on this !