05/15/2020, 3:50 PM
As a follow up to my earlier message, we wanted to let everyone know what went wrong and what we're doing to prevent similar outages in the future. _What happened_: During a routine API release at approximately 09:28 EST, one of our API nodes became out of sync with the others due to a hanging application of metadata. As a result, traffic served to the unhealthy node incorrectly received 400 errors on any requests containing reference to flow schedules. While investigating the issue, our Cloud engineers were forced to re-apply the metadata, leading to a complete service disruption that lasted approximately 9 minutes. In this window, all communication with the API was severed, which means runs that were scheduled during that time may have been late and runs happening at that time may have become zombies. The intermittent disruptions lasted from approximately 09:45 EST to 11:00 EST. _What we're doing to respond_: One of the services that make up the Cloud architecture has a mechanism that we use when applying database migrations; this mechanism is one that we've outgrown, and was the source of the disruption. Moving forward, we're moving to a more scalable API metadata application (removing the mechanism altogether) and introducing safeguards to the process in the form of pre-production replica DBs.
:prefect: 1
👍 10