Tim Galvin
11/14/2023, 8:22 AMflow
that seems to be running fine, except tasks that read a Complete()
state are, for whatever reason, being restarted. Sometimes hours after they were successfully completed.
As far as I can tell, there are no errors in my files. I am seeing the correct set of end data products from tasks that are completed after the task that is restarted needlessly.
Is there something I should be looking into for this? I am a little unsure where to even begin.Tess Dicker
11/14/2023, 2:40 PMTim Galvin
11/14/2023, 3:41 PMWEB_CONCURRENCY=24
, and a postgres database server through a container:
SINGULARITYENV_POSTGRES_PASSWORD="$POSTGRES_PASS" SINGULARITYENV_POSTGRES_DB="$POSTGRES_DB" SINGULARITYENV_PGDATA="$POSTGRES_SCRATCH/pgdata" \
singularity run --cleanenv --bind "$POSTGRES_SCRATCH":/var postgres_latest.sif -c max_connections=4096 -c shared_buffers=4096MB
I can see the workers and connections.
The actual code is running on a set of compute nodes managed by SLURM on a HPC cluster. Each jobs has effectively twice as many CPUs and memory that I have spec'd that is required for the workflow.
The entire run does not restart. When I am looking at these restarted tasks the UI is reporting their run count as 1. I can also see that the prefect messaged Finished in state Complete()
is issued, and then the task restarts some time later in the same page. Other tasks that are still running continue to run without any issue.
In my prefect server I am getting these messages _very occasionally_:
Invalid HTTP request received.
Invalid HTTP request received.
Invalid HTTP request received.
Invalid HTTP request received.
Invalid HTTP request received.
Invalid HTTP request received.
Are there are extra options I can activate to try to track those down and see if they are related? I am completely reinstalling my environment in the hopes there is a version mismatch or some other patched upstream library. I have some memory of a http2 type error in a dependency lurking around.
I don't have a MWE that can consistently reproduce the issue. When I try to track it down and rerun my failed flow, it works. It could very well be related to the HPC I am using, but there are no errors reported or known by the technical staff who managed it.
The attached screen shot is an example. The particular error is because the data file has been zipped by a later stage, which was completed successfully. The only way that it could have gotten to this zip stage is if there were a number of other stages that completed successfully against this correctly preprocessed measurement set. I can even see some figured produced against data from this data that imply everything worked at least once. The other tasks in my flow that fail also have the same basic structure - they complete, I see the finished data products, prefect marks them as Complete()
, but some time after they rerun. In this attached screenshot it was some 2 hours later.Tess Dicker
11/14/2023, 10:13 PMTim Galvin
11/15/2023, 2:20 AMdask_jobqueue
to establish a set of dask workers. So long as these dask workers have work to do, they are alive for the duration of the flow. The SLURM script that runs on the compute node is purely starting a dask worker that connects to the dask task runner's scheduler.
While there are tasks to do, the set of workers remain alive and available for work. When the dask worker does shut down, I would hope it is has reported back to the scheduler that it is, and informed the prefect engine as such?
Also, I am noticing that this is happening while the flow run is still running. The only other obvious thing I can think of is to turn of the adapt mode of the dask cluster (which is what the dask task runner is under the hood using).
I am unclear on how to further go about trying to isolate and test things further.Tess Dicker
11/15/2023, 7:00 PMTim Galvin
11/16/2023, 2:03 AMTim Galvin
11/16/2023, 2:04 AMTim Galvin
11/16/2023, 2:05 AMTim Galvin
11/16/2023, 2:16 AMTess Dicker
11/16/2023, 5:32 PMTim Galvin
11/17/2023, 8:52 AMInvalid http message
errprTim Galvin
11/23/2023, 5:30 AM# See <https://docs.dask.org/en/latest/configuration.html#distributed-scheduler>
# For more information on these variables
# This attempts to set distributed.scheduler.worker-saturation
export DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION=0.01
export DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING=True
export DASK_DISTRIBUTED__SCHEDULER__WORK_STEALING_INTERVAL="120s"
export DASK_DISTRIBUTED__SCHEDULER__WORKER_TTL="3600s"
export DASK_DISTRIBUTED__SCHEDULER__ALLOWED_FAILURES=100
export DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL="10000ms"
export DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE="1000000ms"
export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT="300s"
export DASK_DISTRIBUTED__COMM__SOCKET_BACKLOG="16384"
export DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP="300s"
export DASK_DISTRIBUTED__COMM__RETRY__COUNT=12
and setting the adaptive mode to something like
adapt_kwargs:
minimum: 2
maximum: 36
wait_count: 20
target_interval: "300s"
interval: "30s"
Would there be any other insights you or teams members could share? I am very close to having a proper working solution that I can go all out with. Is it also worth considering putting something together in the docs about this? This seems to be a larger issue for me since I am using the dask_jobqueue.SLURMCluster, which means I am at times slowly acquiring the compute resources.Tess Dicker
11/30/2023, 3:20 PMTim Galvin
11/30/2023, 3:51 PMTim Galvin
11/30/2023, 3:53 PMtask
decorator / function to confirm that dask is not doing anything silly. Frustrating. But some of these functions have some very frustrating side effects that I might not be able to get around.Tim Galvin
12/14/2023, 3:28 AM