Bertrand GERVAIS04/21/2020, 4:41 PM
Zachary Hughes04/21/2020, 4:47 PM
Jim Crist-Harif04/21/2020, 4:58 PM
directly, rather than relying on prefect's executor inside this task, as that interface is less public.
Bertrand GERVAIS04/22/2020, 6:19 AM
Jim Crist-Harif04/22/2020, 5:22 PM
I want to avoid returning a list of references to the data, as I am using a generic data reader (OGR library) and my source data can be either local files, DB or served by web services. OGR gives me a good abstraction to read this kind of data transparently, that I dont want to lose by using PrefectClarifying, just to ensure we're not talking past each other. Frequently when I see workloads like this, operations can be broken down into two steps: • List references to all the objects you want to process (usually fast, doesn't require reading the data). This may be a list of files, a list of keys in the db, etc... • Do some work on each object. This task would take the reference, and use the underlying library (OGR) to read the data in and process it. If your workload fits, this, then prefect's map should work:
@task def items_to_process(): return list_of_keys_to_process @task def process_single_item(key): data = read_in_data_from_key(key) do_some_work(data)
from dask.distributed import get_client @task def your_dask_processing_step(): client = get_client() # do dask things in here # for example: results = client.map(inc, [1, 2, 3, 4, 5])
Bertrand GERVAIS04/23/2020, 9:12 AM
I am very interested in future works you will be doing on this !
from dask.distributed import get_client