is there a way is it a bad idea to access an upstream `Resul Prefect Community #prefect-contributors-archived

is there a way / is it a bad idea to access an ups...

Brett Naul

07/07/2020, 3:18 AM

is there a way / is it a bad idea to access an upstream

Result

directly from within a task? I'm thinking about how a papermill task might look; since it runs in a subprocess you'd need to load the results there instead of in the main prefect taskrunner process....

Laura Lorenz (she/her)

07/07/2020, 1:47 PM

we def imagined that people would use the results API directly, it may be a bit tough to get an arbitrary result from upstream within a task (unless you are willing to hydrate it yourself from the file location)

Laura Lorenz (she/her)

07/07/2020, 2:02 PM

this was our idea of how people might use the results API directly. if nothing else if you know the location of the upstream file you could do this style https://docs.prefect.io/core/concepts/results.html#persisting-user-created-results

Brett Naul

07/07/2020, 4:50 PM

thanks @Laura Lorenz (she/her)! that definitely makes sense, this is definitely a weird edge case that I'm imagining. it'd be easy to make something like the airflow papermill operator https://airflow.apache.org/docs/stable/howto/operator/papermill.html that just hard-codes the parameters at build-time. but since passing results is a first-class prefect feature it'd be cool to be able to support that here as well 🤔

Brett Naul

07/07/2020, 4:54 PM

oh but just to clarify: I definitely am fine rehydrating the result myself, in fact it's the only way that this could work. I'm mostly just wondering how to actually grab that path from inside a task, since the argument that the task runner passes to my function is the already-rehydrated value

Laura Lorenz (she/her)

07/08/2020, 2:29 PM

Yeah, FWIW we were talking a bit about this yesterday internally as it felt related to another feature request about results not reading in if downstreams are cached (https://github.com/PrefectHQ/prefect/issues/2922), in the sense that it’s about being able to arbitrarily ‘reach back’ or ‘reach forward’ in the graph for state metadata to make a decision (that issue is about the pipeline deciding to read a result, yours is about a task grabbing an upstream Result object). Out of curiosity for the papermill stich you are thinking of. Correct me if I’m wrong (I have no papermill experience, btw, but read the homepage haha) but I think what you are implying is that the papermill task code will start a subprocess (using some papermill utility) that a jupyter notebook runs in, and you need to pass in parameters of the jupyter notebook subprocess whenever your prefect task calls that papermill utility. What is the connection that needs them to be

Result

objects instead of the upstream data dependencies themselves?

Brett Naul

07/08/2020, 2:52 PM

yeah you're right, I guess the real issue I am trying to solve is different from what I explained. for small literal JSON-serializable parameters you're right there's no problem, you can pass them in directly via

execute_notebook

. for anything larger or that isn't a built-in Python type (say a large dataframe), we've had to hack together various crazy workarounds to get those into papermill; what I really want is to pass in

output=some_task()

for the

parameters

cell to be populated with just

output = GCSResult(bucket).read(path_to_result).value

, which means the data is only deserialized once inside the subprocess and also doesn't have to go through JSON inbetween

5 Views

Open in Slack

Previous Next