https://prefect.io logo
Title
g

Greg Desmarais

07/21/2020, 2:08 PM
It looks like the docker image
prefecthq/prefect:all_extras
is running python 3.8 - this seems to require all clients to move to 3.8, as cloudpickle might be sensitive to the version difference. Am I right on that? Is the position of the prefect team that everything needs to be on py 3.8? I might not have rtfm enough...
I might be wrong, but I feel like when I started looking at prefect, 3.7 was required, so I intentionally setup everything for 3.7. Before I go through the trouble of moving everything to 3.8, I figure I should ask.
j

Jim Crist-Harif

07/21/2020, 2:11 PM
Yeah, Python bumped the default pickle protocol in 3.8, so flows packaged with 3.8 will need to be running 3.8. In general though, we recommend you always run flows in the same/similar environment to the one they were registered in.
g

Greg Desmarais

07/21/2020, 2:14 PM
just fwiw, that is potentially really tough if we want flows to be created by a diverse team. On some platforms, upgrading your python major version is not a small task. In addition, infrastructure is laid down with major versions, so bumping this piece may require a big change. Not sure it is the right move. Causes me a s*** load of work potentially.
j

Jim Crist-Harif

07/21/2020, 2:15 PM
Sorry, back-tracking a bit. Flows packaged with older versions can run on the new version, flows packaged with newer versions cannot run on the old version.
g

Greg Desmarais

07/21/2020, 2:15 PM
I understand the 3.8 has been out a while...
j

Jim Crist-Harif

07/21/2020, 2:15 PM
So your old stuff should still work, but flows packaged with 3.8 cannot run on 3.7
g

Greg Desmarais

07/21/2020, 2:17 PM
ok - I have a flow packaged on 3.7.8:
(rightsize-venv) MacBook-Pro:scripts gdesmarais$ python --version
Python 3.7.8
that, after the hopps of getting it deployed on fargate, results in the following message on the AWS side:
2020-07-21T09:56:16.831-04:00
[2020-07-21 13:56:16] INFO - prefect.S3 | Downloading datasciences/prefect_flows/dask_cloud_provider_test from celsius-temp-data

2020-07-21T09:56:17.050-04:00
an integer is required (got type bytes)

2020-07-21T09:56:17.052-04:00
Traceback (most recent call last):

2020-07-21T09:56:17.052-04:00
File "/usr/local/bin/prefect", line 8, in <module>

2020-07-21T09:56:17.052-04:00
sys.exit(cli())

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__

2020-07-21T09:56:17.052-04:00
return self.main(*args, **kwargs)

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main

2020-07-21T09:56:17.052-04:00
rv = self.invoke(ctx)

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke

2020-07-21T09:56:17.052-04:00
return _process_result(sub_ctx.command.invoke(sub_ctx))

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke

2020-07-21T09:56:17.052-04:00
return _process_result(sub_ctx.command.invoke(sub_ctx))

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke

2020-07-21T09:56:17.052-04:00
return ctx.invoke(self.callback, **ctx.params)

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke

2020-07-21T09:56:17.052-04:00
return callback(*args, **kwargs)

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/prefect/cli/execute.py", line 80, in cloud_flow

2020-07-21T09:56:17.052-04:00
raise exc

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/prefect/cli/execute.py", line 69, in cloud_flow

2020-07-21T09:56:17.052-04:00
flow = storage.get_flow(storage.flows[flow_data.name])

2020-07-21T09:56:17.052-04:00
File "/usr/local/lib/python3.8/site-packages/prefect/environments/storage/s3.py", line 90, in get_flow

2020-07-21T09:56:17.052-04:00
return cloudpickle.loads(output)

2020-07-21T09:56:17.052-04:00
TypeError: an integer is required (got type bytes)
From what I gather, that error message is indicative of a py version mismatch and cloudpickle.
j

Jim Crist-Harif

07/21/2020, 2:20 PM
Hmmm, that's odd. Pickle is a backwards-compatible format, this might be one of cloudpickle's serializers that's failing, not the pickle protocol itself (which you'd see going from 3.8 -> 3.7).
g

Greg Desmarais

07/21/2020, 2:20 PM
My 20 minute level informed opinion is that I agree.
it seems to be cloudpickle.
j

Jim Crist-Harif

07/21/2020, 2:24 PM
Hmmm. You might need to have users using 3.7 register flow runs to run images with 3.7. I understand that from an organizational level that could be tricky, but prefect flows really aren't designed to be registered and run in two disparate environments.
g

Greg Desmarais

07/21/2020, 2:25 PM
I am already creating some custom images, so I can certainly do that. I had created one already, but it was hanging after the flow executed. Figured I should try using the hq one.
j

Jim Crist-Harif

07/21/2020, 2:25 PM
We have added support for script-based storage (some support already out, some coming in this release (probably today/tomorrow)). This lets your store the script instead of the pickle file, which will make this a bit easier, but I still recommend standardizing environments for other benefits.
g

Greg Desmarais

07/21/2020, 2:26 PM
So let me ask - this is the task that is pulling the flow from my storage, deserializing it, then executing it. What does that image need?
I have an image already built that pulls in prefect (among a lot of other stuff) and has this entry:
ENTRYPOINT ["tini", "-g", "--", "/usr/bin/prepare.sh"]
I'm not sure that is the right entry, based on the hq image, or something else. I totally agree with standardizing the images for the flow run processes.
j

Jim Crist-Harif

07/21/2020, 2:29 PM
The image needs: • All the libraries your flow relies on (including
prefect
, ideally the same version as used locally when developing/registering. Things may work if they aren't the same, but no guarantees. • Any helper code not contained directly in the same file that created the flow. • Ideally the same major version of python as registered with. If using pickle-based storage this is more a requirement than a suggestion.
g

Greg Desmarais

07/21/2020, 2:30 PM
Ok - got that.
(I'm using a bunch of dask and cuda)
j

Jim Crist-Harif

07/21/2020, 2:30 PM
I'm not sure what
prepare.sh
does, but using
tini
is a good option. Sorry, not sure I understand the question here.
g

Greg Desmarais

07/21/2020, 2:30 PM
np - not sure I asked it clearly.
My question might then be different. I had assumed my custom image for the task that deserializes the flow and runs it was the problem - that I needed to use the hq one for that step, then my custom one for the actual task workers (dask).
I can go back and re-frame on what happens after the flow run is executed - I might have an issue with my dask cluster creation/configuration.
j

Jim Crist-Harif

07/21/2020, 2:33 PM
So the error you got above was failure to load the flow, before dask enters the picture at all. So that's an incompatibility between python/cloudpickle versions.
g

Greg Desmarais

07/21/2020, 2:33 PM
yea - agreed.
j

Jim Crist-Harif

07/21/2020, 2:33 PM
But more generally, we also recommend dependencies match throughout your dask cluster.
g

Greg Desmarais

07/21/2020, 2:34 PM
When I use my custom image, instead of the hq one, the flow gets done with execute, but then hangs. I've got patches for prefect code that inserts a bunch of printlns so I can increase that and see if I can isolate further.
My first instinct was that my custom image didn't have something right in the dockerfile.
j

Jim Crist-Harif

07/21/2020, 2:35 PM
The flow executes, but the image doesn't return?
So prefect thinks the flow is done, but the image is still running?
g

Greg Desmarais

07/21/2020, 2:35 PM
Actually, it returns, and status of 0
And the UI doesn't show it as complete.
which makes me think it is some wrapper that is supposed to communicate back to the api to say 'done'
j

Jim Crist-Harif

07/21/2020, 2:37 PM
All of that is contained within the flow runner (the thing executing the flow), there's no other bit that happens afterward.
g

Greg Desmarais

07/21/2020, 2:37 PM
hrm...ok, let me dig/debug/print a whole crap ton of stuff.
I've been battling this for a while, and I feel like I'm really close.
j

Jim Crist-Harif

07/21/2020, 2:37 PM
Does prefect see that the flow started?
g

Greg Desmarais

07/21/2020, 2:38 PM
yes
j

Jim Crist-Harif

07/21/2020, 2:38 PM
That is odd. Hmmm.
If you can create a reproducible example, we'd be happy to take a look.
g

Greg Desmarais

07/21/2020, 2:38 PM
I have a fargate agent started from api with a bunch of config, it picks up the run, kicks off the first task...
that first task pulls the flow from storage, deserializes, executes, then nothing...except that the container exits with a 0
the UI doesn't show the run completed
(I kick off the flow from the ui)
I'm "pretty sure" my networking between the containers and the api is correct, but maybe I have to look at that. it is all in the same vpc with wide open sec groups (net acl protecting the vpc)
j

Jim Crist-Harif

07/21/2020, 2:51 PM
What
Environment
class are you using?
g

Greg Desmarais

07/21/2020, 2:56 PM
executor = DaskExecutor(cluster_class='dask_cloudprovider.FargateCluster',
                        cluster_kwargs=cluster_kwargs)

flow.environment = FargateTaskEnvironment(
    executor=executor,
    region_name=DEFAULT_REGION,
    metadata=metadata,
    **task_definition_kwargs
)
I have been through so many iterations of this that there is almost certainly some cruft.
I'm going to disappear for a while - I will put my debugs in and see where that leads me. If you have any suggestions, my ears are wide open.
j

Jim Crist-Harif

07/21/2020, 2:59 PM
FargateTaskEnvironment
kicks off an intermediate fargate task that launches another fargate task where the flow actually runs. Are you seeing both tasks start successfully? Just the first one?
g

Greg Desmarais

07/21/2020, 6:50 PM
I'm seeing the first one start, and I'm seeing the run_task for the second one return successfully, but I'm not seeing a command or anything. I am going to switch over to a dask cloud provider environment and see where that takes me.
j

Jim Crist-Harif

07/21/2020, 6:55 PM
Hmmm, that is odd behavior. If you can provide a reproducible example that'd be useful for us to help debug.
g

Greg Desmarais

07/21/2020, 7:01 PM
I've def. been trying to get a repro for this, but it invariably depends on my environment - a fargate cluster, server in EC2, etc. I've got things running in my pycharm debugger now - isolated the bit of code that is suspect. I'll open a new convo if I am able to pose a concise, answerable question. Right now it would be a pain in the ass to support me.
j

Jim Crist-Harif

07/21/2020, 7:02 PM
Ok, let us know when if/when you figure something out. Thanks for looking into this!
g

Greg Desmarais

07/21/2020, 7:03 PM
I have a feeling I just don't have the magic recipie for my deployment model.