Hi everybody im having an issue with Ray Its creating multip Prefect Community #ask-community

Hi everybody, im having an issue with Ray. Its cr...

Tedi Gjoni

01/06/2025, 6:53 PM

Hi everybody, im having an issue with Ray. Its creating multiple db connections (server db) but its not closing them. I have 100 connections that stay idle. How can i clean these connections from prefect server. One solution is to have a job that cleans these connections every N minutes but im wondering if there is an existing setting that takes care of this. Thank you

Tedi Gjoni

01/06/2025, 6:53 PM

@Nate any recommendation?

Bianca Hoch

01/06/2025, 7:44 PM

Hey Tedi! Please refrain from tagging folks directly 🙇 . Are you explicitly closing the connections at any point during the task or flow? Something like this?

Copy code

import psycopg2  # Or the library you're using to connect to the database

@task
def query_database():
    conn = psycopg2.connect("your_connection_string")
    try:
        # Perform database operations
        with conn.cursor() as cur:
            cur.execute("SELECT * FROM your_table")
            results = cur.fetchall()
    finally:
        # Ensure the connection is always closed
        conn.close()

Tedi Gjoni

01/06/2025, 7:50 PM

@Bianca Hoch My apologies. Im not creating the connection. Prefect and Ray are creating it.

Nate

01/06/2025, 7:51 PM

can you explain what database connections you’re talking about and how you’re observing them?

Tedi Gjoni

01/06/2025, 7:54 PM

I have a flow that runs a function in parallel. My team is using Ray. Im decorating my function like this:

Copy code

@flow(
    name="some_name",
    task_runner=RayTaskRunner(init_kwargs={"num_cpus": NUM_TRANSFORM_CPUS}),
)
def main(batches, data_dir,save_dir, demo_lavel,config):
    futures =t_signal.temp_transform_batches.map(
        question_configs=batches,
        data_dir=data_dir,
        save_dir=save_dir,
        level=demo_lavel,
        config_obj=unmapped(config.CONFIG),  # type: ignore
    )
    for future in futures:
        future.wait()

Tedi Gjoni

01/06/2025, 7:55 PM

where

temp_transform_batches

is decorated with:

Copy code

@task(name="TempStep2-transfrom-list", retries=10, retry_delay_seconds=6)
def temp_transform_batches( ...

Tedi Gjoni

01/06/2025, 7:56 PM

@Nate the database connection that im taking about is when Prefect creates the tasks.

Tedi Gjoni

01/06/2025, 7:56 PM

I believe im overloading my db with this as it creates 130 connections

Tedi Gjoni

01/06/2025, 7:57 PM

The error that i get is this:

Copy code

Crash detected! Execution was interrupted by an unexpected exception: PrefectHTTPStatusError: Server error '500 Internal Server Error' for url 'http://<my_url>:4200/api/task_runs/e93dc7a6-19f2-46a6-89ba-b79ea289841a/set_state'
Response: {'exception_message': 'Internal Server Error'}

Tedi Gjoni

01/06/2025, 7:59 PM

im running prefect server from a systemd service like this:

Copy code

// other configs here
[Service]
Type=simple
User=ubuntu
Restart=always
Environment="PATH=/opt/prefect-server/prefect-server/bin/"
ExecStart=sudo PREFECT_SQLALCHEMY_POOL_SIZE=200  PREFECT_SQLALCHEMY_POOL_RECYCLE=1800 PREFECT_SQLALCHEMY_POOL_PRE_PING=True PREFECT_SQLALCHEMY_MAX_OVERFLOW=50 /opt/prefect-server/prefect-server/bin/prefect server start --host 0.0.0.0

Nate

01/06/2025, 10:22 PM

hmm while trying to minimally reproduce your error I ran into what I believe is a bug in either pydantic or prefect

Copy code

FAILED tests/test_task_runners.py::TestRayTaskRunner::test_can_run_many_tasks_without_crashing[default_ray_task_runner] - ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/ray/exceptions.py", line 51, in from_ray_exception
    return pickle.loads(ray_exception.serialized_exception)
TypeError: __init__() missing 1 required keyword-only argument: 'code'

have you seen this at all client side?

Tedi Gjoni

01/06/2025, 10:24 PM

i believe i have seen this issue with joblib when i use

loky

as backend

👍 1

Tedi Gjoni

01/06/2025, 10:24 PM

but i might be wrong

Tedi Gjoni

01/06/2025, 10:26 PM

i believe my issue is when i run a job in parallel and each process creates a task

Nate

01/06/2025, 10:27 PM

if I'm understanding, in general with the ray and dask task runners, that is the expected behavior

Tedi Gjoni

01/06/2025, 10:29 PM

I see. My team will switch to joblib with

multiprocessing

backend, and will stop creating a task for each parallel process

👍 1

Tedi Gjoni

01/06/2025, 10:30 PM

but joblib if i pass a class objc sometime raises the same issue that you faced

Tedi Gjoni

01/06/2025, 10:31 PM

basically a sterilization issue

Nate

01/06/2025, 10:31 PM

if you don't want to use the task runner paradigm, you're free to call arbitrary python (which may call ray) inside of tasks (at that point tasks are just tools for encapsulating work / adding retries / caching etc)

Tedi Gjoni

01/06/2025, 10:33 PM

Im surprised that this declaration: @task(name="foo bar", retries=10, retry_delay_seconds=6) doesnt capture the Timeout error nor does it retry

Tedi Gjoni

01/06/2025, 10:36 PM

Im still confused why doesnt prefect/ray creates so many connection to the db? Each task its a new connection?

Nate

01/06/2025, 10:36 PM

again, can you clarify what connections you're talking about and how you're counting these?

Nate

01/06/2025, 10:37 PM

as far as prefect goes: the task engine will send updates to the API, which writes state changes to the db - which yes will happen for each task

Tedi Gjoni

01/06/2025, 10:39 PM

Prefect connects to a database. It stores all the task runs/logs/flows etc? I can see in AWS RDS metrics that DatabaseConnections skyrockets when this flows runs

Nate

01/06/2025, 10:39 PM

It stores all the task runs/logs/flows etc?

yes!

Tedi Gjoni

01/06/2025, 10:39 PM

if the job crashes, the flow doesnt properly close the connection and they stay idle

Nate

01/06/2025, 10:40 PM

if the job crashes, the flow doesnt properly close the connection and they stay idle

that sounds like a bug, I'd guess likely related to using the ray task runner

Tedi Gjoni

01/06/2025, 10:41 PM

postgres can handle the

idle in transaction

connection but not

idle

Tedi Gjoni

01/06/2025, 10:41 PM

i have to manually kill these connections

Nate

01/06/2025, 10:41 PM

are you willing to open an issue with a minimal reproduction of this?

Tedi Gjoni

01/06/2025, 10:41 PM

okay

🙏 1

45 Views

Open in Slack

Previous Next