Intermittent failures in API
Incident Report for Sanity
Resolved
Incident resolved.
Posted 26 days ago. Sep 20, 2019 - 03:59 CEST
Update
It seems most services are restored. We are continuing to monitor.
Posted 26 days ago. Sep 19, 2019 - 21:27 CEST
Update
We are having further issues. We are on it.
Posted 26 days ago. Sep 19, 2019 - 20:37 CEST
Update
At 11:15 UTC we got reports of slow queries from some of our API users. We monitored the problem, but there did not seem to be cause for alarm.
Investigating further, we discovered that some nodes in one pool that handle GROQ and GraphQL API queries for our freemium and advanced plan customers were having memory issues. Their issues seemed to poison the entire pool.
Restarting the failing nodes, however, started a cascade of problems that made the whole pool unstable resulting in partial intermittent or sustained outages for an unknown fraction of our freemium and advanced plan customers.
There was an uncharacteristic problem that made the nodes recover very slowly. We spent a long time hunting down more "ill" nodes. Rebooting these made recovery speed finally pick up.
At 15:32 UTC all our indexes were back online, but redundancy was still critical.
At 16:38 UTC our query APIs were nominally operational again. This means all resources are online, but the pool is still stabilising, so certain write operations and queries may still time out until it settles.
There was never a threat of data loss during this situation. The document store is an entirely separate service and was operating normally the entire time. In addition to not being subject to this event, we keep daily backups of all your content for 30 days.
We are still monitoring the situation.
Posted 26 days ago. Sep 19, 2019 - 19:23 CEST
Monitoring
Service restored for all customers, still waiting for all redundant capacity to come online.
Posted 26 days ago. Sep 19, 2019 - 17:36 CEST
Update
Services are now recovering. Stand by for further updates. Thank you for your patience.
Posted 26 days ago. Sep 19, 2019 - 17:24 CEST
Identified
We are having trouble with an index cluster temporarily preventing us from processing queries for a currently unknown fraction of datasets. We are all hands on deck working to get the cluster back up. No risk of data loss, this is just indexes used for query processing that are down.
Posted 26 days ago. Sep 19, 2019 - 15:37 CEST
Investigating
We are investigating.
Posted 27 days ago. Sep 19, 2019 - 14:30 CEST
This incident affected: api.sanity.io.