Intermittent failures in API
Incident Report for Sanity
Incident resolved.
Posted Sep 20, 2019 - 03:59 CEST
It seems most services are restored. We are continuing to monitor.
Posted Sep 19, 2019 - 21:27 CEST
We are having further issues. We are on it.
Posted Sep 19, 2019 - 20:37 CEST
At 11:15 UTC we got reports of slow queries from some of our API users. We monitored the problem, but there did not seem to be cause for alarm.
Investigating further, we discovered that some nodes in one pool that handle GROQ and GraphQL API queries for our freemium and advanced plan customers were having memory issues. Their issues seemed to poison the entire pool.
Restarting the failing nodes, however, started a cascade of problems that made the whole pool unstable resulting in partial intermittent or sustained outages for an unknown fraction of our freemium and advanced plan customers.
There was an uncharacteristic problem that made the nodes recover very slowly. We spent a long time hunting down more "ill" nodes. Rebooting these made recovery speed finally pick up.
At 15:32 UTC all our indexes were back online, but redundancy was still critical.
At 16:38 UTC our query APIs were nominally operational again. This means all resources are online, but the pool is still stabilising, so certain write operations and queries may still time out until it settles.
There was never a threat of data loss during this situation. The document store is an entirely separate service and was operating normally the entire time. In addition to not being subject to this event, we keep daily backups of all your content for 30 days.
We are still monitoring the situation.
Posted Sep 19, 2019 - 19:23 CEST
Service restored for all customers, still waiting for all redundant capacity to come online.
Posted Sep 19, 2019 - 17:36 CEST
Services are now recovering. Stand by for further updates. Thank you for your patience.
Posted Sep 19, 2019 - 17:24 CEST
We are having trouble with an index cluster temporarily preventing us from processing queries for a currently unknown fraction of datasets. We are all hands on deck working to get the cluster back up. No risk of data loss, this is just indexes used for query processing that are down.
Posted Sep 19, 2019 - 15:37 CEST
We are investigating.
Posted Sep 19, 2019 - 14:30 CEST
This incident affected: