Hey folks I was wondering if it s possible to make our tasks Prefect Community #ask-community

Hey folks, I was wondering if it's possible to mak...

Federico Zambelli

04/20/2023, 1:37 PM

Hey folks, I was wondering if it's possible to make our tasks and flows functions as methods of a class, without having the linter screaming at me (and without disabling type checking). Example of what I mean, try this:

Copy code

class MyClass:
    @task
    def doStuffWithArgs(self, myArg: str):
        ...

    def doStuffWithoutArgs(self):
        ...

    @flow
    def doMoreStuff(self):
        self.doStuffWithArgs("stuff")
        self.doStuffWithoutArgs()

You will see that for

doStuffWithArgs

Pylance complains that no overload matches that call (see screenshot):

✅ 1

flapili

04/20/2023, 1:39 PM

in my opinion this design is not easy to maintain anyway

flapili

04/20/2023, 1:40 PM

and probably not prefect friendly

Federico Zambelli

04/20/2023, 1:41 PM

Thanks for the feedback, could you elaborate on that? The reason I wanted a class is because I have certain objects that I want to reuse across different tasks. In my head it made more sense to be able to do

<http://this.my|this.my>_object

instead of either passing it as an arg of each task or making it global

Federico Zambelli

04/20/2023, 1:41 PM

what would you suggest as an alternative ?

flapili

04/20/2023, 1:41 PM

the issue is the args must be pickable

flapili

04/20/2023, 1:41 PM

in order to retries failed tasks as example

Federico Zambelli

04/20/2023, 1:41 PM

uh dayum you're right

flapili

04/20/2023, 1:42 PM

passing the args is probably the best way to work with prefect

Federico Zambelli

04/20/2023, 1:43 PM

what if the arg itself ends up not being pickleable, but the result can? Such as, idk, a duckdb connection object

Federico Zambelli

04/20/2023, 1:43 PM

or an aiohttp session, for instance

flapili

04/20/2023, 1:44 PM

I believe tasks will not even run without validate_input kwargs in flow decorator

Federico Zambelli

04/20/2023, 1:45 PM

ah, fair enough

Federico Zambelli

04/20/2023, 1:45 PM

so my only option in that case is a variable set at the module level, i guess ?

flapili

04/20/2023, 1:47 PM

the less contextual objects tasks have the better they are 😅

Federico Zambelli

04/20/2023, 1:48 PM

pardon my ignorance, i'm not familiar with the term "contextual object", what do you mean with that ?

flapili

04/20/2023, 1:48 PM

tasks should be idempotent

Federico Zambelli

04/20/2023, 1:48 PM

tasks should be idempotent

this I'm aware

flapili

04/20/2023, 1:50 PM

by contextual object I would mean global/shared object

flapili

04/20/2023, 1:51 PM

you could create a pure function which return an aiohttp session as example

flapili

04/20/2023, 1:51 PM

and in this function you could implement pools of sessions or even singleton sessions

Federico Zambelli

04/20/2023, 1:53 PM

uhm ok I understand, altho I'm not sure in this case what would be the difference between a singleton session vs a global shared session object. it's not like im going to initialize it more than once

flapili

04/20/2023, 1:54 PM

you could also a global shared session object

flapili

04/20/2023, 1:54 PM

but you can't pass it as args

Federico Zambelli

04/20/2023, 1:55 PM

seems fine by me, i don't see any drawbacks

flapili

04/20/2023, 1:58 PM

mostly DX issues

Federico Zambelli

04/20/2023, 1:58 PM

yeah, just that, but it's a personal project so im not too worried about it

flapili

04/20/2023, 1:58 PM

like typing, autcomplete, ect

flapili

04/20/2023, 1:59 PM

Copy code

from prefect import task, flow


def get_global_var():
    return 42


@task
def some_task():
    var = get_global_var()
    # do something with var
    return var


@flow
def main():
    r = some_task()
    print(r)

flapili

04/20/2023, 1:59 PM

imagine get_global_var return an unpickable object

Federico Zambelli

04/20/2023, 1:59 PM

yup 👍

flapili

04/20/2023, 1:59 PM

and if you want reuse it you juste have to implete singleton

Federico Zambelli

04/20/2023, 2:00 PM

Since you're here, I have one last question if you don't mind: Is it possible to cache a task that returns nothing? E.g. imagine a task writes its results to some external storage, and I want it to skip execution if the passed args are the same. Does it make sense ?

Federico Zambelli

04/20/2023, 2:02 PM

(the purpose is idempotency ofc)

flapili

04/20/2023, 2:04 PM

you can, and prefect already implement it

flapili

04/20/2023, 2:04 PM

I don't remember the name but there is a func that take input and hask it as cache key

flapili

04/20/2023, 2:04 PM

but I'm wondering if you will not need a storage anyway

flapili

04/20/2023, 2:05 PM

because it keep more than the result

flapili

04/20/2023, 2:05 PM

the date as example

flapili

04/20/2023, 2:06 PM

like you could tell to prefect "cache by args, use cache if result is less than X hours old"

flapili

04/20/2023, 2:07 PM

https://docs.prefect.io/latest/tutorials/flow-task-config/#task-input-hash

Federico Zambelli

04/20/2023, 2:07 PM

task(cache_key_fn=task_input_hash, cache_expiration=None)

I assume you mean this?

but I'm wondering if you will not need a storage anyway

I'm mostly playing around. The idea behind my last question is as follows: Imagine im reading data from some API that returns unpredictable results, and I'm writing them to S3. I don't want duplicate results so if the

write_to_s3

function receives the same input (e.g. result_key), skip execution.

flapili

04/20/2023, 2:22 PM

not sure to understand

flapili

04/20/2023, 2:22 PM

do you mean you can't predict result but it's undempotent ?

Federico Zambelli

04/20/2023, 2:23 PM

sorry, lemme explain better

flapili

04/20/2023, 2:24 PM

like of the api is a function f, f(5) will alway return the same result ?

flapili

04/20/2023, 2:24 PM

but the result is hard/impossible to guess ?

Federico Zambelli

04/20/2023, 2:26 PM

like of the api is a function f, f(5) will alway return the same result ?

It should in theory, but in practice it doesn't because of unavailability of the API itself. f(5) sometimes can get 1000 results, sometimes can return only 100. And this 100 can be a subset of the 1000. Each result however has a unique key. So imagine I have two tasks:

get_results_from_api

--->

write_result_to_storage

For each result in

get_result_from_api

, write to storage ONLY if the

result_key

hasn't been seen before.

Federico Zambelli

04/20/2023, 2:28 PM

so I was thinking, if I use

cache

with the 2nd task, does it skip writing to storage if the passed

result_key

was seen before?

flapili

04/20/2023, 2:33 PM

cache are mostly for retries stuff

flapili

04/20/2023, 2:34 PM

you could instead name the S3 object from input hash

flapili

04/20/2023, 2:35 PM

because if prefect database or the storage block which cache results is not available for some reason

flapili

04/20/2023, 2:35 PM

you could have indempotency issue

flapili

04/20/2023, 2:36 PM

but you could for sure play "belt and suspender"

Federico Zambelli

04/20/2023, 2:36 PM

ok i understand. I have no idea what "belt and suspender" is tho 😅

flapili

04/20/2023, 2:36 PM

prefect cache + checks in the write tasks to ensure the job was not did in the past

flapili

04/20/2023, 2:37 PM

IDK I translated literally the french expression "ceinture et bretelle"

Federico Zambelli

04/20/2023, 2:37 PM

ahaha, is this a programming pattern or something you just made up 😄 ?

flapili

04/20/2023, 2:37 PM

flapili

04/20/2023, 2:38 PM

the true translation is belt and braces 😅

Federico Zambelli

04/20/2023, 2:38 PM

got it 👍 , thanks !

Federico Zambelli

04/20/2023, 2:38 PM

i had no idea this term existed

Federico Zambelli

04/20/2023, 2:38 PM

well thanks a lot for the help !

flapili

04/20/2023, 2:38 PM

3 Views

Open in Slack

Previous Next