Hello again! I think i’ve tracked down an issue wi...
# prefect-community
t
Hello again! I think i’ve tracked down an issue with MS Teams notifications and the notification queue in general where only certain flow run states were being sent. This logger message appears:
Copy code
prefect-orion-1  | 21:53:34.590 | DEBUG   | prefect.orion.services.flowrunnotifications - Got 3 notifications from queue.
However, it looks like the code intends for only one notification to be read off the queue in the first place: https://github.com/PrefectHQ/prefect/blob/main/src/prefect/orion/services/flow_run_notifications.py#L38-L42.
The behavior I’m seeing is that if we have notifications for
Running
and
Failed
in succession, only one of them gets read off the queue and the other is discarded.
If i remove the notification for the
Running
state, then the
Failed
notification makes it out of the queue. If there’s more time (I believe the polling time is 4 seconds) in the flow run between the Running and Failed notifications getting enqueued, then both seem to get read off appropriately.
I was only able to get this far thanks to this: https://prefect-community.slack.com/archives/CL09KU1K7/p1668038154353999 — so this already feels like progress.
Other details: still running Prefect 2.6.4 + Postgres in Docker.
z
Thanks for the additional details! We’ll look into this.
🙏 1
cc @Zach Angell perhaps this broke during the query optimizations?
Can you share the output of
prefect version
where you’re running your server? Curious if you’re using Postgres or SQLite since these queries are different for each backend.
c
2.6.4
Postgres
t
^ what he said:
Copy code
Version:             2.6.4
API version:         0.8.2
Python version:      3.9.15
Git commit:          51e92dda
Built:               Thu, Oct 20, 2022 3:11 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         hosted
Copy code
psql (PostgreSQL) 14.1
z
Great thanks!
I can’t reproduce, can you also get the versions
pip show sqlalchemy asyncpg
?
t
Copy code
Name: SQLAlchemy
Version: 1.4.42
Summary: Database Abstraction Library
Home-page: <https://www.sqlalchemy.org>
Author: Mike Bayer
Author-email: <mailto:mike_mp@zzzcomputing.com|mike_mp@zzzcomputing.com>
License: MIT
Location: /usr/local/lib/python3.9/site-packages
Requires: greenlet
Required-by: alembic, prefect
---
Name: asyncpg
Version: 0.26.0
Summary: An asyncio PostgreSQL driver
Home-page: <https://github.com/MagicStack/asyncpg>
Author: MagicStack Inc
Author-email: <mailto:hello@magic.io|hello@magic.io>
License: Apache License, Version 2.0
Location: /usr/local/lib/python3.9/site-packages
Requires:
Required-by: prefect
z
t
If it helps, this is the docker image we are using: prefecthq/prefect:2.6.4-python3.9
z
Our tests run against Postgres 13 in CI but I can’t reproduce this against 14 locally either.
Can you turn on
PREFECT_ORION_DATABASE_ECHO
t
i did
prefect config set PREFECT_ORION_DATABASE_ECHO=True
— i’ll tail the logs and try to re-run
z
I’m interested in seeing the query that’s being used to retrieve notifications, it should have
LIMIT 1
from the code but apparently that’s not working for ya’ll
t
does the server need to be bounced for that setting to take effect? i’m also running the logging hotfix you posted yesterday, so i’m seeing the log streams for prefect and uvicorn but the DB is not echoing
z
Yeah the config isn’t hot-reloaded
t
okay will bounce the container with config set
okay, got the super verbose logs now. re-running the test flow.
is this what you’re looking for:
Copy code
prefect-orion-1  | 23:52:39.362 | INFO    | sqlalchemy.engine.Engine - WITH queued_notifications AS
prefect-orion-1  | (DELETE FROM flow_run_notification_queue WHERE flow_run_notification_queue.id IN (SELECT flow_run_notification_queue.id
prefect-orion-1  | FROM flow_run_notification_queue ORDER BY flow_run_notification_queue.updated
prefect-orion-1  |  LIMIT %s FOR UPDATE SKIP LOCKED) RETURNING flow_run_notification_queue.id, flow_run_notification_queue.flow_run_notification_policy_id, flow_run_notification_queue.flow_run_state_id)
prefect-orion-1  |  SELECT queued_notifications.id AS queue_id, flow_run_notification_policy.id AS flow_run_notification_policy_id, flow_run_notification_policy.message_template AS flow_run_notification_policy_message_template, flow_run_notification_policy.block_document_id, flow.id AS flow_id, flow.name AS flow_name, flow_run.id AS flow_run_id, flow_run.name AS flow_run_name, flow_run.parameters AS flow_run_parameters, flow_run_state.type AS flow_run_state_type, flow_run_state.name AS flow_run_state_name, flow_run_state.timestamp AS flow_run_state_timestamp, flow_run_state.message AS flow_run_state_message
prefect-orion-1  | FROM queued_notifications JOIN flow_run_notification_policy ON queued_notifications.flow_run_notification_policy_id = flow_run_notification_policy.id JOIN flow_run_state ON queued_notifications.flow_run_state_id = flow_run_state.id JOIN flow_run ON flow_run_state.flow_run_id = flow_run.id JOIN flow ON flow_run.flow_id = flow.id
prefect-orion-1  | 2022-11-10 23:52:39,363 INFO sqlalchemy.engine.Engine [cached since 75.8s ago] (1,)
right before the notification was sent via
apprise
Copy code
prefect-orion-1  | 23:53:27.396 | DEBUG   | prefect.orion.services.flowrunnotifications - Got 2 notifications from queue.
prefect-orion-1  | 23:53:27.396 | DEBUG   | prefect.orion.services.flowrunnotifications - Got 2 notifications from queue.
z
LIMIT %s
👀
t
oh i missed that!
i love a good prepared statement(?) bug
i unfortunately don’t have an orion environment set up in my debugger at the moment otherwise i’d try to step through where that CTE is built and see if it’s just a bad value passed or if sqlalchemy is not happy with something.
z
With a debugger:
Copy code
(Pdb) print(notification_details_stmt.compile(compile_kwargs={"literal_binds": True}))
WITH queued_notifications AS 
(DELETE FROM flow_run_notification_queue WHERE flow_run_notification_queue.id IN (SELECT flow_run_notification_queue.id 
FROM flow_run_notification_queue ORDER BY flow_run_notification_queue.updated
 LIMIT 1 FOR UPDATE) RETURNING flow_run_notification_queue.id, flow_run_notification_queue.flow_run_notification_policy_id, flow_run_notification_queue.flow_run_state_id)
 SELECT queued_notifications.id AS queue_id, flow_run_notification_policy.id AS flow_run_notification_policy_id, flow_run_notification_policy.message_template AS flow_run_notification_policy_message_template, flow_run_notification_policy.block_document_id, flow.id AS flow_id, flow.name AS flow_name, flow_run.id AS flow_run_id, flow_run.name AS flow_run_name, flow_run.parameters AS flow_run_parameters, flow_run_state.type AS flow_run_state_type, flow_run_state.name AS flow_run_state_name, flow_run_state.timestamp AS flow_run_state_timestamp, flow_run_state.message AS flow_run_state_message 
FROM queued_notifications JOIN flow_run_notification_policy ON queued_notifications.flow_run_notification_policy_id = flow_run_notification_policy.id JOIN flow_run_state ON queued_notifications.flow_run_state_id = flow_run_state.id JOIN flow_run ON flow_run_state.flow_run_id = flow_run.id JOIN flow ON flow_run.flow_id = flow.id
I’m going to have to point someone else at this tomorrow, it looks sensible and I can’t reproduce the issue myself but it’s clearly happening on your end from the log 🙂
t
That sounds good - appreciate you digging in! Happy to provide any other Orion runtime environment details as needed.
z
Minor note, mine printed
LIMIT :param_1
for Postgres and
LIMIT %s
for SQLite — are you sure it’s running against postgres as intended?
t
from the orion settings panel:
PREFECT_ORION_DATABASE_CONNECTION_URL: <postgresql+asyncpg://postgres>@...
z
That’s pretty promising 🙂 I don’t think there’s a CTE for SQLite anyway, the query would look pretty different.
t
I mean, you never know what else they’ll figure out how to package into SQLite… i can also share the docker-compose i’m launching this from, minus any interesting-to-only-us bits
Copy code
# docker-compose.yml
version: "3.9"
services:
  database:
    image: postgres:14.1-alpine
    restart: always
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=MySuperSecretPassword
      - POSTGRES_DB=orion
    expose:
      - 5432
    volumes:
      - /prefect/postgres:/var/lib/postgresql/data

  orion:
    image: prefecthq/prefect:2.6.4-python3.9
    restart: always
    volumes:
      - /prefect/orion:/root/.prefect
      - /prefect/monkeypatch/orion.py:/usr/local/lib/python3.9/site-packages/prefect/cli/orion.py # temporary logging fix, see: <https://github.com/PrefectHQ/prefect/compare/poc/fix-prefect-logging-uvicorn>
    entrypoint: ["prefect", "orion", "start"]
    environment:
      - PREFECT_ORION_API_HOST=0.0.0.0
      - PREFECT_ORION_API_PORT=80
      - PREFECT_API_URL=<http://our-server-lives-here.com:80/api>
      - PREFECT_API_KEY=<guid>
      - PREFECT_ORION_DATABASE_CONNECTION_URL=<postgresql+asyncpg://postgres:MySuperSecretPassword@database:5432/orion>
      - PREFECT_LOGGING_LEVEL=DEBUG # default: INFO
      - PREFECT_LOGGING_SERVER_LEVEL=DEBUG # default: WARNING
      - PREFECT_DEBUG_MODE=True
      - PREFECT_LOGGING_ORION_ENABLED=True
      - PREFECT_ORION_DATABASE_ECHO=True
    ports:
      - 80:80
    depends_on:
      - database
z
Just an update here: We can reproduce this but not in unit tests.
c
progress, thanks for the update
z
We’re all thoroughly confused though 🙂
t
The best kind of bug!
z
Would you mind opening an issue? I think this is going to move to our backlog as we’ve made no progress 😕
t
I can do that. Thanks for digging in this far.
Hey @Zanie, I’ve filed that here: https://github.com/PrefectHQ/prefect/issues/7568. Let me know how I can help you or the other community engineers on this one. Thanks again!