Trying to make a task decorator that subsamples an...
# ask-community
s
Trying to make a task decorator that subsamples and caches the output of a task to HDFS (each task returns a spark dataframe). The goal is to quickly iterate on and debug downstream tasks using subsampled data. Since these are not full blown checkpoints, I'm not sure if the Results API would be appropriate. I was thinking something like
Copy code
@cache(subsample=100, sdf_key='sdf_large')
@task
def some_large_spark_dataframe():
  "intensive ETL process here"
  ...
  return sdf


@task
def downstream_task(sdf_large):
  "some intensive computation on sdf_large"
  ...
  return sdf


with Flow() as flow:
  sdf_large_sample = read_from_cache('sdf_large')
  downstream_task(sdf_large_sample)
...but I've had difficulty stacking decorators with the
task
decorator. Moreover, the usual challenges of the task result not being available until evaluation time. Any recipes you'd recommend?
z
Hey @Scott Moreland -- have you tried putting your decorator after the task decorator instead? Basically you want the start of your task to check for a cached result and return that instead of doing an expensive computation. You need it to remain a
Task
type and as long as this is on the inside of the
task
decorator you can ignore that it is deferred.
s
Thanks, I'll give it a go. Love the product and appreciate the support as always.