Can I Is it a good practice to put a result of a task inside Prefect Community #ask-community

Can I/Is it a good practice to put a result of a t...

An Hoang

08/19/2021, 5:25 PM

Can I/Is it a good practice to put a result of a task inside the

prefect.context

to be used in all subsequent tasks? I have a stateful variable that many tasks will use and would love to use it inside many tasks without passing in as parameter

✅ 1

Kevin Kho

08/19/2021, 5:26 PM

Hey @An Hoang, it is not good practice because the context is not mutable in a sense that I think subsequent tasks won’t have it. Does it from your tests? It’s not something we recommend. It’s likely better to pass it as a Parameter, or you can use the KV Store if it’s below 10 KB.

An Hoang

08/19/2021, 5:30 PM

I would love to pass it as Parameter, but the object is dynamically created from a string parameter Example inside a flow context:

Copy code

config_path = Parameter() #string param
data_catalog_object = get_data_catalog(config_path) #custom-class object initialized from parameters parsed from the `config_path`

task1_result = task1(data_catalog_object, *args), #data_catalog_object's state might be modified here

#many more tasks using the `data_catalog_object` as input

It is rarely below 10kb so I don't think KV Store would work

Kevin Kho

08/19/2021, 5:31 PM

Once it’s instantiated with

data_catalog_object

, you can pass that to downstream tasks…unless it’s not serializable?

An Hoang

08/19/2021, 5:33 PM

Yes, it is serializable, but 90% of my tasks use this

data_catalog_object

and modifies its state, so I wondered if there is an easy way to not have to pass it to the tasks every single time. It's also a hassle to have to return the modified

data_catalog_object

in every single task, as

target

caching will not work as intended

Kevin Kho

08/19/2021, 5:39 PM

I’ll ask the team if there are any ideas. Not seeing any at the moment

Kevin Kho

08/19/2021, 7:16 PM

So when you mutate that

data_catalog_object

, it’s needs to be explicitly returned because Prefect won’t keep that of the state for operations like restarting the Flow Run from the point of failure. I’ll try to figure out the target though for multiple returns.

An Hoang

08/19/2021, 7:40 PM

Thank you. What if the

data_catalog_object

is not mutated, just used many times in many tasks. Does that change anything?

Kevin Kho

08/19/2021, 7:43 PM

Yes and no. If it was not instantiated in a Parameter, you might be able to use it as a global with script-based storage. But given it’s created in runtime through a Parameter, I think it’s not doable to get it as a global. But! what you can also maybe do is create your modified

@task_with_catalog

that supplies the data_catalog so you don’t have to worry about it? You would still have to input it inside the

Flow

, but at least it doesn’t affect the task definition?

Kevin Kho

08/19/2021, 8:57 PM

Unfortunately it looks like we don’t support the multiple targets. What is your

data_catalog_object

? I think the KV store is really something that will help here if possible.

An Hoang

08/19/2021, 9:27 PM

the

data_catalog_object

is Kedro's DataCatalog. I want to separate out the data loading/saving part from Prefect code and let the data catalog handle all of that. So when I want to save

cars.csv

, assume that I have the yaml below at

config/path/folder/catalog.yaml

Copy code

cars:
  type: pandas.CSVDataSet
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .

I can do

Copy code

from <http://kedro.io|kedro.io> import DataCatalog
cars_df = ...

data_catalog = DataCatalog("config/path/folder")
data_catalog.save("cars", cars_df) #saves cars_df to data/01_raw/company/cars.csv
cars_df = data_catalog.load("cars")

I just use

with Flow('flow', result = LocalResult("path/to/outer-most/folder")

and use this catalog to handle the loading and saving of sub-files/folders

An Hoang

08/19/2021, 9:28 PM

One thing that would be very helpful is if

target

argument accepts a function that returns

True

False

, then I can write the function to check multiple outputs with complex logic

Kevin Kho

08/19/2021, 9:59 PM

Oh I see, I watched the Kedro tutorial today after you mentioned it (and I’ve been meaning to for a while. They presented right before me in PyCon.). In this case I think there are two options here. First is to store the path to the

DataCatalog

you want to use in the Flow and then load it during the Flow run. If you store your flow as a script, it will be loaded and run at runtime. The second thing to use the

KV Store

to point to the address of the DataCatalog and then load it in per task. You can also maybe mutate it and save it and then load it downstream again. You can create a helper function (non-task) that loads this in, and then your flows can use it. If you make your own decorator like

@mytask

to handle this, the function just has to take in

kwargs

for you to be able to pass in the configuration. The new decorator might be able to take care of this for you.

An Hoang

08/20/2021, 2:06 PM

Thanks @Kevin Kho! I'll try it out and let you know how it goes.

store the path to the
DataCatalog
you want to use in the Flow and then load it during the Flow run

There can be only one fixed path stored per flow right? I need the path to be parameterized by the user Also, what do you think about this? Would it be feasible to add in the near future?

One thing that would be very helpful is if
target
argument accepts a function that returns
True
or
False
, then I can write the function to check multiple outputs with complex logic

Kevin Kho

08/20/2021, 2:23 PM

I think it’s still not easy for the user because

target

s still work in conjunction with serializers and if you have two different

targets

, you would still need to make your own custom serializer to handle different types. Actually, the way it works right now is that your two returns will come in as a tuple to the

Serializer

, so you can actually make you own serializer to provide custom logic to handle the tuple. If you need to Parameterize it, then the best approach is really to parameterize the path and then create

DataCatalog

and pass it throughout the flow. I think by design this just becomes really if hard if you mutate it inside the tasks, the best design I think is to mutate them in their own tasks and pass them around like that.

13 Views

Open in Slack

Previous Next