Alex Papanicolaou
08/29/2020, 8:14 PM12:10:11 prefect.CloudTaskRunner Task 'run_simulator[54]': Starting task run...
12:10:11 prefect.run_simulator[54] Starting simulator run
12:10:11 prefect.run_simulator[54] cusip_list [{'secmnem': 'FNMMA3057', 'cusip': '31418CMF8'}]
12:10:11 prefect.run_simulator[54] Loading model 'cf621134-8c36-446a-96b5-7ecde88a33e2'
12:10:22 prefect.run_simulator[54] Simulating pool {'secmnem': 'FNMMA3057', 'cusip': '31418CMF8'}
12:10:31 prefect.run_simulator[54] Number of replicates 6
12:11:59 prefect.CloudTaskRunner Task 'run_simulator[54]': finished task run for task with final state: 'Success'
Here is an example though (and they don’t appear super common) where the task succeeded and then was later rerun.
One thing you can note is that the model id is different. this is randomly generated (not a big deal) but along with the timestamp just confirms that this is repeated run not a duplicated log.
11:55:34 prefect.CloudTaskRunner Task 'run_simulator[6]': Starting task run...
11:55:35 prefect.run_simulator[6] Starting simulator run
11:55:35 prefect.run_simulator[6] cusip_list [{'secmnem': 'FNMMA3774', 'cusip': '31418DFQ0'}]
11:55:35 prefect.run_simulator[6] Loading model 'c410358f-4612-4aef-8f12-e9a3642711de'
11:56:23 prefect.run_simulator[6] Simulating pool {'secmnem': 'FNMMA3774', 'cusip': '31418DFQ0'}
11:56:36 prefect.run_simulator[6] Number of replicates 3
11:57:12 prefect.CloudTaskRunner Task 'run_simulator[6]': finished task run for task with final state: 'Success'
12:06:17 prefect.CloudTaskRunner Task 'run_simulator[6]': Starting task run...
12:06:17 prefect.run_simulator[6] Starting simulator run
12:06:17 prefect.run_simulator[6] cusip_list [{'secmnem': 'FNMMA3774', 'cusip': '31418DFQ0'}]
12:06:17 prefect.run_simulator[6] Loading model '45322fce-d452-4340-9e06-e7bcc2775b84'
12:06:27 prefect.run_simulator[6] Simulating pool {'secmnem': 'FNMMA3774', 'cusip': '31418DFQ0'}
12:06:40 prefect.run_simulator[6] Number of replicates 3
12:07:15 prefect.CloudTaskRunner Task 'run_simulator[6]': finished task run for task with final state: 'Success'
Chris White
08/29/2020, 8:17 PMAlex Papanicolaou
08/29/2020, 8:25 PMChris White
08/29/2020, 8:33 PMAlex Papanicolaou
08/30/2020, 9:58 PMare you running your flow on a dask cluster? If so, this can occur if there is a worker eviction or if dask loses track of the data.
Not sure why dask isn’t reassigning your tasksWe studied this a bit more and it seems like the task that runs multiple times does so because its worker restarted due to another task.
Chris White
09/01/2020, 2:53 AMAlex Papanicolaou
09/01/2020, 3:13 AMWe were still running into memory issues and but we think we know why and found a workaroundAfter running another attempt at the flow in question, this did the trick! 1. Tasks were set up so that there were ~2 tasks per worker. Before adjusting those profile intervals, we found that the second task for a worker would start with an elevated memory usage (~4gb compared to .5gb for the first task on a worker). 2. Cutting down the polling by 1000x led to the profiler no longer being picked up by tracemalloc (as reported in that stackoverflow post). 3. Whereas before we ran into a ton of garbage collection warnings from Dask as well as memory usage creeping up to 4,5,6gb etc and eventually bricking the workers and thus the flow, when we set the profile intervals we peaked out approximately at just over 1gb for each worker with no noticeable memory growth as 750mb-1gb is about what we expect. I’m not sure what the side effects are but this is a tremendous improvement for us. I’ll give that stackoverflow post a few days and maybe migrate it to the Dask github to see if anyone has an opinion on it.
Chris White
09/01/2020, 2:53 PMMarvin
09/01/2020, 2:54 PMJim Crist-Harif
09/01/2020, 4:19 PM