Trying to make a task decorator that subsamples and caches t Prefect Community #ask-community

Trying to make a task decorator that subsamples an...

Scott Moreland

03/04/2021, 9:16 PM

Trying to make a task decorator that subsamples and caches the output of a task to HDFS (each task returns a spark dataframe). The goal is to quickly iterate on and debug downstream tasks using subsampled data. Since these are not full blown checkpoints, I'm not sure if the Results API would be appropriate. I was thinking something like

Copy code

@cache(subsample=100, sdf_key='sdf_large')
@task
def some_large_spark_dataframe():
  "intensive ETL process here"
  ...
  return sdf


@task
def downstream_task(sdf_large):
  "some intensive computation on sdf_large"
  ...
  return sdf


with Flow() as flow:
  sdf_large_sample = read_from_cache('sdf_large')
  downstream_task(sdf_large_sample)

...but I've had difficulty stacking decorators with the

task

decorator. Moreover, the usual challenges of the task result not being available until evaluation time. Any recipes you'd recommend?

Zanie

03/04/2021, 10:47 PM

Hey @Scott Moreland -- have you tried putting your decorator after the task decorator instead? Basically you want the start of your task to check for a cached result and return that instead of doing an expensive computation. You need it to remain a

Task

type and as long as this is on the inside of the

task

decorator you can ignore that it is deferred.

Scott Moreland

03/04/2021, 11:35 PM

Thanks, I'll give it a go. Love the product and appreciate the support as always.

Open in Slack

Previous Next