Hey If you needed to create a flow that did some logic in Py Prefect Community #show-and-tell

Hey, If you needed to create a flow that did some ...

Philip MacMenamin

06/19/2020, 5:39 PM

Hey, If you needed to create a flow that did some logic in Python, and shell out a number of tasks and wait for them to return, then do more logic in python and ultimately create a set of final outputs which would be persisted, what would be the Prefect-ish way of doing this WRT communicating the locations of the files for each task? Is there a way to give each flow it's own dir in /tmp for example, and then refer to the files relatively, with each flow run looking in it's own /tmd/dir? Or do you use result caching and pass around the file locations that way? I'm probably missing something obvious. If anybody can point out an example of a flow which has this kind of logic it would be really helpful!

nicholas

06/19/2020, 5:44 PM

Hi @Philip MacMenamin - are you wanting to handle the read/write of the files yourself, or are you wanting Prefect to handle that?

Philip MacMenamin

06/19/2020, 5:46 PM

For the binaries that I'm going to shell out to, I'd leave those binaries to handle everything. It would just be a dumb call into the CLI. There would be some processing of the files in Python also, and I would think it might be easier for me to handle the I/O on those files? I guess I'm not sure what the alternative would be.

Philip MacMenamin

06/19/2020, 5:48 PM

I feel like my question is sort of betraying the fact that there's some aspect of Prefect I'm not 100% getting.

nicholas

06/19/2020, 5:49 PM

Well in this case you have 2 options I think: you can return the file paths from the tasks themselves (and pass to downstream), or use the results, targets, and checkpointing interfaces.

Philip MacMenamin

06/19/2020, 5:53 PM

OK. I was looking at that area of the docs. So the first option you mention would be each task creates a a file in some arbitrary location which it takes care of, and I would return the fill path that file on each task.

Philip MacMenamin

06/19/2020, 5:54 PM

*full

nicholas

06/19/2020, 5:57 PM

Exactly. And of course there's no reason you can't use a mix of these methods to suit your needs!

Philip MacMenamin

06/19/2020, 5:57 PM

And the second option would be to use output caching along the lines of something like: https://docs.prefect.io/core/concepts/persistence.html#output-caching-based-on-a-file-target maybe And then return a full path at the end of that taks?

Philip MacMenamin

06/19/2020, 6:00 PM

There's no way to just set a default location in a flow, eg and have everything in a run of that flow would create some uniq dir, eg /tmp/my_uniq_dir_0, and you can tell all tasks to operate using files where it's assumed that the default location will always be correct per the run?

Philip MacMenamin

06/19/2020, 6:04 PM

By uniq dir, I mean unique to a a run. As in you can set up the flow to create a base dir /tmp/prefect_runs/uniq_run_id, and every task can default to using that special uniq dir that's dedicated to that run.

nicholas

06/19/2020, 6:07 PM

Ah gotcha, and then you'd want those to be persisted, correct?

Philip MacMenamin

06/19/2020, 6:11 PM

I would want at least some of them persisted. I probably don't care about every output. The binaries I'm calling into might produce ancilliary files, or junk I don't care about. Some binaries produce outputs I don't care about persisting, some I will just need the paths to so I can feed it to the next binary, or do some operation on it in Python. I will need to persist files at the end of the run, I was going to use S3 or something for that.

nicholas

06/19/2020, 6:14 PM

Got it. In which case, I think templated locations are what you're looking for, which allow you to output results directly to run-specific directories. Since those can be configured at the task level, you can easily persist results you want and discard those you don't. Is that helpful?

Philip MacMenamin

06/19/2020, 6:26 PM

OK, so those results are placed in a cloud bucket, do it will do a write across the network into that bucket, and then the next task will do a read to get the file local again. Is there a way to use the local file system during the run, and then pick a subset of files to persist in buckets?

Philip MacMenamin

06/19/2020, 6:30 PM

in a way similar to this templated location. As in, just not use the cloud bucket during the run, and then ship selected outputs

nicholas

06/19/2020, 6:32 PM

You can use a local result to persist the initial results and then probably a downstream task to pick the ones you'd like to persist elsewhere.

nicholas

06/19/2020, 6:33 PM

And since the

prefect.context

is available in every task, you'd be able to persist the final results in the remote location using the same variables you would in a templated location

Philip MacMenamin

06/19/2020, 6:35 PM

OK. I think this is what I'm looking for. Great.

Philip MacMenamin

06/19/2020, 6:36 PM

Thanks Nicholas!

nicholas

06/19/2020, 6:36 PM

You're welcome! Let me know if you have any hiccups

Philip MacMenamin

06/19/2020, 6:37 PM

will do.

Philip MacMenamin

06/23/2020, 8:15 PM

from https://docs.prefect.io/core/concepts/results.html#how-to-configure-task-result-persistence

Copy code

from prefect import task, Flow
from prefect.engine.results import LocalResult


@task(result=LocalResult(location="initial_data.prefect"))
def root_task():
    return [1, 2, 3]

@task(result=LocalResult(location="{date:%A}/{task_name}.prefect"))
def downstream_task(x):
    return [i * 10 for i in x]

with Flow("local-results") as flow:
    downstream_task(root_task)

What should this snippet do?

Philip MacMenamin

06/23/2020, 8:16 PM

(upon flow.run() getting called)

nicholas

06/23/2020, 8:49 PM

Hi @Philip MacMenamin - if you wouldn't mind, please open a new thread in the community channel

Philip MacMenamin

06/23/2020, 8:49 PM

will do.

3 Views

Open in Slack

Previous Next