Thread
#prefect-community
    Laksh Aithani

    Laksh Aithani

    4 months ago
    Hello everyone, I’m working on running a very simple Prefect 2.0 workflow using a remote Ray cluster. More details in the thread below
    from prefect import flow, task
    from prefect.task_runners import RayTaskRunner
    
    
    @task(retries=3, retry_delay_seconds=5)
    def say_hello(name):
        print(f"hello {name}")
    
    
    kwargs = dict() # works on my Mac
    kwargs = dict(address="ray://**.***.**.***:10001", init_kwargs=dict(runtime_env={"pip": []})) # fails on my Mac, works on Ray head node
    
    
    @flow(task_runner=RayTaskRunner(**kwargs))
    def greetings(names):
        for name in names:
            say_hello(name)
    
    
    if __name__ == "__main__":
        greetings(["arthur", "trillian", "ford", "marvin"])
    when running the this flow on my Mac and not connecting to the remote client but spinning up a local Ray cluster, it works. When trying to use the remote client, it fails when on my Mac with the following error:
    19:38:37.575 | INFO    | Task run 'say_hello-d71d0552-0' - Crash detected! Execution was interrupted by an unexpected exception.
    19:38:37.616 | INFO    | Task run 'say_hello-d71d0552-1' - Crash detected! Execution was interrupted by an unexpected exception.
    19:38:37.628 | INFO    | Task run 'say_hello-d71d0552-2' - Crash detected! Execution was interrupted by an unexpected exception.
    19:38:37.641 | INFO    | Task run 'say_hello-d71d0552-3' - Crash detected! Execution was interrupted by an unexpected exception.
    weirdly then, when on the head node of the ray cluster, but still connecting to the ray cluster with the remote client connection, it works. Anyone know why this may be the case? (edited)
    Upon further thought, this might be due to the local Prefect database not interacting properly with the Ray remote client
    Kevin Kho

    Kevin Kho

    4 months ago
    Yes I think it’s something with the Ray client. There’s no need to tag people on the weekend. It won’t make us respond to the post any faster. We’ll still see it 🙂
    Anna Geller

    Anna Geller

    4 months ago
    @Laksh Aithani your intuition is right that the Ray client is not able to communicate with your SQLite DB correctly. I have two ideas of what may be happening:1. SQLite is not meant for concurrent writes and it's possible that your parallel Ray execution is trying to perform task run state updates for multiple tasks in parallel causing DB issues 2. Another possibility for your DB connection issues may be network connectivity: given that you are running Orion locally and your Ray cluster seems to be running in a different network (remote VPC/subnet) which may result in connection issues. I believe that it's #1 - you could test that by running a test flow on the same server as your Ray cluster but with SequentialTaskRunner (rather than RayTaskRunner) to confirm whether it's #1 or #2 And to really confirm that it's #1 and simultaneously potentially solve the issue, you can switch to Postgres backend to allow concurrent writes
    Laksh Aithani

    Laksh Aithani

    4 months ago
    Hi Anna, We confirmed it was an issue with SQLite, and using a Postgres database deployed onto AWS RDS solved the issue!