Hey, sometimes we miss some CloudTaskRunner Errors...
# prefect-server
m
Hey, sometimes we miss some CloudTaskRunner Errors in the output of
prefect agent docker start --show-flow-logs
. We can see them in the GUI, but not in the terminal output. Is this a known behavior?
a
Are you able to identify what type of CloudTaskRunner logs are missing? are those for all errors or only specific exceptions?
the way this is implemented:
Copy code
if self.show_flow_logs:
            self.stream_flow_logs(container.get("Id"))
so it should stream all logs from a running container
You could try adjusting those environment variables: • PREFECT__CLOUD__SEND_FLOW_RUN_LOGS • PREFECT__LOGGING__LEVEL But given it appears only sometimes and for some error logs, this is hard... I would open an issue directly but let's try to collect more info first Can you share
prefect diagnostics
to reproduce it easier? Also trying to pin down when exactly this happens (only specific flows, specific exceptions) would help
m
@Anna Geller Thanks a lot for your answer! I can describe the env we have: • the docker agents runs as a service in linux (with the
show-flow-logs
-flag) • so all the output will be captured by the so all the output can be displayed with
journalctl
We have a simple log-scanner for errors (by keyword) from this service/agent. This for the reasons, that if the container can not been started at all, we would get at least one mail-notification. This error is only thrown from the agent. Exmaple for working "ERROR - prefect.CloudTaskRunner":
Copy code
Apr 29 07:29:04 <http://rndprefectdev01.uhbs.ch|rndprefectdev01.uhbs.ch> run_tst_docker_agent.sh[4000120]: [2022-04-29 05:29:04+0000] ERROR - prefect.CloudTaskRunner | Task 'cdwh.historical_audit: historical_audit': Exception encountered during task execution!
The case we missed it This is the log we collected from journalctl via the
show-flow-logs
-flag.
Copy code
May 02 07:25:05 <http://rndprefectdev01.uhbs.ch|rndprefectdev01.uhbs.ch> run_tst_docker_agent.sh[1844096]: [2022-05-02 05:25:05+0000] INFO - prefect.CloudTaskRunner | Task 'cdwh.join_checks: join_checks': Starting task run...
May 02 07:25:05 <http://rndprefectdev01.uhbs.ch|rndprefectdev01.uhbs.ch> run_tst_docker_agent.sh[1844096]: [2022-05-02 05:25:05+0000] INFO - prefect.transform | Task started: transformation_clinerion.cli_labtest: transform
May 02 07:25:05 <http://rndprefectdev01.uhbs.ch|rndprefectdev01.uhbs.ch> run_tst_docker_agent.sh[1844096]: [2022-05-02 05:25:05+0000] INFO - prefect.join_checks | Task started: cdwh.join_checks: join_checks
May 02 09:11:46 <http://rndprefectdev01.uhbs.ch|rndprefectdev01.uhbs.ch> run_tst_docker_agent.sh[2134]: [2022-05-02 07:11:46,129] INFO - tst | Deploying flow run 83e1c3fe-0e29-48df-b561-392204d73c3c to execution environment...
This is the full log we got from the prefect server gui:
Copy code
2 May 2022,07:25:05 	prefect.join_checks	INFO	Task started: cdwh.join_checks: join_checks
2 May 2022,07:25:05 	prefect.transform	INFO	transformation_clinerion.cli_labtest: transform - Cron: 0 0 * * mon#1 -> Last time executed: 2022-04-08 09:24:35.513000 -> Next run should be: 2022-05-02 00:00:00 (today is: 2022-05-02 05:25:05.315360, so should skip by cron: False)
2 May 2022,07:26:36 	prefect.join_checks	INFO	Task finished: cdwh.join_checks: join_checks
2 May 2022,07:26:36 	prefect.CloudTaskRunner	INFO	Task 'cdwh.join_checks: join_checks': Finished task run for task with final state: 'Success'
2 May 2022,07:26:36 	prefect.CloudTaskRunner	INFO	Task 'cdwh.join_checks: cleanup': Starting task run...
2 May 2022,07:26:36 	prefect.cleanup	INFO	Task started: cdwh.join_checks: cleanup
2 May 2022,07:27:36 	prefect.cleanup	INFO	Task finished: cdwh.join_checks: cleanup
2 May 2022,07:27:36 	prefect.CloudTaskRunner	INFO	Task 'cdwh.join_checks: cleanup': Finished task run for task with final state: 'Success'
2 May 2022,07:28:05 	prefect.transform	INFO	Task finished: transformation_clinerion.cli_labtest: transform
2 May 2022,07:28:05 	prefect.CloudTaskRunner	INFO	Task 'transformation_clinerion.cli_labtest: transform': Finished task run for task with final state: 'Success'
2 May 2022,07:28:05 	prefect.CloudTaskRunner	INFO	Task 'clinerion.delete_orphan_records: clinerion_deleter': Starting task run...
2 May 2022,07:28:05 	prefect.clinerion_deleter	INFO	Task started: clinerion.delete_orphan_records: clinerion_deleter
2 May 2022,07:28:05 	prefect.clinerion_deleter	INFO	clinerion.delete_orphan_records: clinerion_deleter - Cron: 0 0 * * mon#1 -> Last time executed: 2022-04-08 09:27:34.508000 -> Next run should be: 2022-05-02 00:00:00 (today is: 2022-05-02 05:28:05.855010, so should skip by cron: False)
2 May 2022,07:28:09 	prefect.clinerion_deleter	INFO	693580 records will be deleted from dm_cli_case
2 May 2022,07:28:09 	prefect.clinerion_deleter	INFO	33172 records will be deleted from dm_cli_patient
2 May 2022,07:28:09 	prefect.clinerion_deleter	INFO	33172 records will be deleted from dm_cli_deathdetail
2 May 2022,07:28:10 	prefect.clinerion_deleter	INFO	Task finished: clinerion.delete_orphan_records: clinerion_deleter
2 May 2022,07:28:10 	prefect.CloudTaskRunner	INFO	Task 'clinerion.delete_orphan_records: clinerion_deleter': Finished task run for task with final state: 'Success'
2 May 2022,07:28:10 	prefect.CloudTaskRunner	INFO	Task 'cdwh.historical_audit: historical_audit': Starting task run...
2 May 2022,07:28:10 	prefect.historical_audit	INFO	Task started: cdwh.historical_audit: historical_audit
2 May 2022,07:28:48 	prefect.historical_audit	INFO	Task finished: cdwh.historical_audit: historical_audit
2 May 2022,07:28:48 	prefect.CloudTaskRunner	INFO	Task 'cdwh.historical_audit: historical_audit': Finished task run for task with final state: 'Success'
2 May 2022,07:28:48 	prefect.CloudTaskRunner	INFO	Task 'cdwh.historical_audit: compare_with_prev_run': Starting task run...
2 May 2022,07:28:48 	prefect.compare_with_prev_run	INFO	Task started: cdwh.historical_audit: compare_with_prev_run
2 May 2022,07:28:49 	prefect.CloudTaskRunner	ERROR	Task 'cdwh.historical_audit: compare_with_prev_run': Exception encountered during task execution!
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/usr/local/lib/python3.9/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "/usr/local/lib/python3.9/site-packages/cdwhprefect/model/task_collection/impl/general/historical_audit.py", line 373, in compare_with_prev_run
    - history <i>(last {alarm.last_measures_n:,.0f}): last_run_ids: {alarm.last_run_ids}</i>
AttributeError: Could not locate column in row for column 'last_run_ids'
2 May 2022,07:28:49 	prefect.compare_with_prev_run	INFO	Failed with cdwh.historical_audit: compare_with_prev_run: Error during execution of task: AttributeError("Could not locate column in row for column 'last_run_ids'")
2 May 2022,07:28:49 	prefect.compare_with_prev_run	INFO	Error during execution of task: AttributeError("Could not locate column in row for column 'last_run_ids'")
2 May 2022,07:28:49 	prefect.compare_with_prev_run	INFO	Task finished: cdwh.historical_audit: compare_with_prev_run
2 May 2022,07:28:49 	prefect.CloudTaskRunner	INFO	Task 'cdwh.historical_audit: compare_with_prev_run': Finished task run for task with final state: 'Failed'
2 May 2022,07:28:52 	prefect.CloudFlowRunner	INFO	Flow run FAILED: some reference tasks failed.
So in the long list we see at the beginning the record, which was also visible in journalctl. All following records were missed. In the first log from April, we can see that in general "prefect.CloudTaskRunner-ERROR" (at 2 May 2022,072849) will be captured.
Copy code
{
  "config_overrides": {
    "server": {
      "database": {
        "password": true
      },
      "ui": {
        "apollo_url": true
      }
    }
  },
  "env_vars": [],
  "system_information": {
    "platform": "Linux-5.4.0-109-generic-x86_64-with-glibc2.31",
    "prefect_backend": "server",
    "prefect_version": "1.1.0",
    "python_version": "3.9.6"
  }
}
Yes, the analysis of such errors are hard. I should try to build a docker images with a simple error. But to do this with the full pipeline (flow, container, agent) this is not that a quick task. But then we could reproduce.
So I assume the flow stops to early after the exception (thrown in our task). Maybe it waited to write this via graphql, but not for the effective print which then would be redirected to journalctl in the end.
a
Hi @Michael Hadorn, catching up now. Thank you so much for this detailed write-up, I will create an issue today, but let's discuss one possibility beforehand. I don't want to suggest changing your process, but you could catch and get notified about errors such as this one via state handlers, rather than relying on alerts from a log aggregation service:
Copy code
ERROR - prefect.CloudTaskRunner | Task 'cdwh.historical_audit: historical_audit': Exception encountered during task execution!
Using a state handler would be more reliable since this would be handled on the orchestration backend. Here is one example allowing you to receive the full exception traceback in a Slack notification: https://gist.github.com/anna-geller/2014180ee5eaec9ea54f4d3f5b98ca93
m
@Anna Geller Thank you so much for your answer. Awesome! Yes. We do this already in the flow-state-handler, because we care for the full flow. We added the other method only to also detect errors in the log files when the flow itself could not be started. Then there is afaik no other way to catch this. So for our process, it safe if we have both at the moment. But I still think, this errors should also be reflected in the docker output.
a
I generally view this as a negative engineering task that Prefect should take care of for you - you shouldn't need alerting from 3rd party log aggregator based on log message content from a task run. When the flow could not be started, this would be a different message. For the "Exception encountered during task execution" you would need a task-level state handler, flow-level state handler won't catch it as it's triggered based on the flow run object, not the task run.
@Marvin open "Docker agent with
--show-flow-logs
doesn't stream all flow run logs to the CLI output"
m
@Anna Geller For clarification: we don't read logs in the flow-state-handler. We check there for errors from tasks (with prefect lib) and send then the status email. The log-parsing will alert e.g. our registry is not available, because then we can not handle it from the flow itself. I read somewhere that there is not better solution for on-prem solution. Only in the cloud you will got a error email for sure. I'm not sure, if you understand me correctly. That's why I repeated some topics. Thanks anyhow.
👍 1