Memory leak in web runners? :eyes: Realize it's a...
# ask-community
d
Memory leak in web runners? 👀 Realize it's a fairly new feature, and improved in
2.16.5
release last night, but has anyone managed to get the web runners approach working robustly for concurrent synchronous flows (not async)? I am consistently seeing the prefect server process crash around 1350 (before 2.16.5 around 1100) files processed with a fairly simple api call and response pattern. As an example (the length of the flow seems irrelevant) this crashes just as much as a much longer and complex flow.
Copy code
@flow(log_prints=True)
def warm_cache(file):
    filepath = os.path.normpath(file).split(os.path.sep)
    if filepath[-3] == "atlas_test":
        identity = t.packed_provisional_to_identity(filepath[-2])
        orbit_job_id = t.object_orbit_submit(identity)
        orbit = t.object_orbit(orbit_job_id)

@flow(log_prints=True)
def atlas_ingest():
    atlas_path = "/data/staging/atlas_test"
    files = t.file_checker_atlas(atlas_path)
    print(f"There are {len(files)} files found to process.")
    submit_to_runner(
            warm_cache, [{"file": file} for file in files]
    )
It reads a set of directories (or one in my test case) of images to process scientifically. Having things go boom around 1350 each flow is problematic (and more to the point, the full initial historical ingest is about 8M images.). The scientific flow following cache warming of orbits is never even reached. I am serving this via about 51 workers on a very beefy server, but somewhat convinced at this point the issue is a memory leak in Prefect server with the webrunners.
Copy code
if __name__ == "__main__":
    atlas_ingest.serve(name="atlas_pipeline", webserver=True, limit=51)
It's been hard to diagnose as the
docker logs
only show an issue on crash (and about post-crash database lock/write issues.). So any help, even in diagnosing would be good. As mentioned this and the 2nd flow are successful up until a consistent limit and (I think) I've eliminated my python code with retries and backoff/jitter as a factor, Bit stumped as to how to fix it. Really loving Prefect compared to Airflow, but really need to get this sorted now. Alternativesly: Is there another method to serve/submit/deploy the above "take all these files and inject on for each flow" pattern? I used the webrunner approach to concurrently run a syncronous flow with the webrunners but if webrunners are still too experimental for Prod purposes what are my options? (running on big metal on-prem, not in AWS or GCP w K8s). See also: https://prefect-community.slack.com/archives/CL09KU1K7/p1710454407885129