An Hoang
08/19/2021, 5:25 PMprefect.context
to be used in all subsequent tasks? I have a stateful variable that many tasks will use and would love to use it inside many tasks without passing in as parameterKevin Kho
An Hoang
08/19/2021, 5:30 PMconfig_path = Parameter() #string param
data_catalog_object = get_data_catalog(config_path) #custom-class object initialized from parameters parsed from the `config_path`
task1_result = task1(data_catalog_object, *args), #data_catalog_object's state might be modified here
#many more tasks using the `data_catalog_object` as input
It is rarely below 10kb so I don't think KV Store would workKevin Kho
data_catalog_object
, you can pass that to downstream tasks…unless it’s not serializable?An Hoang
08/19/2021, 5:33 PMdata_catalog_object
and modifies its state, so I wondered if there is an easy way to not have to pass it to the tasks every single time.
It's also a hassle to have to return the modified data_catalog_object
in every single task, as target
caching will not work as intendedKevin Kho
Kevin Kho
data_catalog_object
, it’s needs to be explicitly returned because Prefect won’t keep that of the state for operations like restarting the Flow Run from the point of failure. I’ll try to figure out the target though for multiple returns.An Hoang
08/19/2021, 7:40 PMdata_catalog_object
is not mutated, just used many times in many tasks. Does that change anything?Kevin Kho
@task_with_catalog
that supplies the data_catalog so you don’t have to worry about it? You would still have to input it inside the Flow
, but at least it doesn’t affect the task definition?Kevin Kho
data_catalog_object
? I think the KV store is really something that will help here if possible.An Hoang
08/19/2021, 9:27 PMdata_catalog_object
is Kedro's DataCatalog. I want to separate out the data loading/saving part from Prefect code and let the data catalog handle all of that. So when I want to save cars.csv
, assume that I have the yaml below at config/path/folder/catalog.yaml
cars:
type: pandas.CSVDataSet
filepath: data/01_raw/company/cars.csv
load_args:
sep: ','
save_args:
index: False
date_format: '%Y-%m-%d %H:%M'
decimal: .
I can do
from <http://kedro.io|kedro.io> import DataCatalog
cars_df = ...
data_catalog = DataCatalog("config/path/folder")
data_catalog.save("cars", cars_df) #saves cars_df to data/01_raw/company/cars.csv
cars_df = data_catalog.load("cars")
I just use with Flow('flow', result = LocalResult("path/to/outer-most/folder")
and use this catalog to handle the loading and saving of sub-files/foldersAn Hoang
08/19/2021, 9:28 PMtarget
argument accepts a function that returns True
or False
, then I can write the function to check multiple outputs with complex logicKevin Kho
DataCatalog
you want to use in the Flow and then load it during the Flow run. If you store your flow as a script, it will be loaded and run at runtime. The second thing to use the KV Store
to point to the address of the DataCatalog and then load it in per task. You can also maybe mutate it and save it and then load it downstream again. You can create a helper function (non-task) that loads this in, and then your flows can use it.
If you make your own decorator like @mytask
to handle this, the function just has to take in kwargs
for you to be able to pass in the configuration. The new decorator might be able to take care of this for you.An Hoang
08/20/2021, 2:06 PMstore the path to theThere can be only one fixed path stored per flow right? I need the path to be parameterized by the user Also, what do you think about this? Would it be feasible to add in the near future?you want to use in the Flow and then load it during the Flow runDataCatalog
One thing that would be very helpful is ifargument accepts a function that returnstarget
orTrue
, then I can write the function to check multiple outputs with complex logicFalse
Kevin Kho
target
s still work in conjunction with serializers and if you have two different targets
, you would still need to make your own custom serializer to handle different types. Actually, the way it works right now is that your two returns will come in as a tuple to the Serializer
, so you can actually make you own serializer to provide custom logic to handle the tuple.
If you need to Parameterize it, then the best approach is really to parameterize the path and then create DataCatalog
and pass it throughout the flow. I think by design this just becomes really if hard if you mutate it inside the tasks, the best design I think is to mutate them in their own tasks and pass them around like that.