1. System downtime description:
From 09:56 am to 10:15 am UTC on March. 10, OKEx detected abnormal behavior in terms of Spot and Margin trading, affecting the web, app, and API servers.
After troubleshooting, it was discovered that these abnormal behaviors were a result of a triggered hidden bug which caused the internal services that the trading system relied on to stop working.
The timeline for this incident is detailed below:
09:56:00 UTC, our detection system discovered an abnormal condition in the system and issued an alarm message.
09:56:00 UTC, the error code "30030" was returning with a "Matching engine is being upgraded. Please try in about 1 minute" prompt on the API trading interface. Spot trading and margin trading service were suspended.
09:57:00 UTC, urgent repairs were initiated.
10:05:00 UTC, OKEx engineering team found out that the failure resulted from the internal service's sudden unavailability, which backed the run of spot and margin trading system.
10:07:00 UTC, OKEx engineering completed the preparations for resuming the services.
10:15:00 UTC, the spot and margin trading service got back to normal.
2. Why did this downtime happen?
OKEx provides 24/7 trading services and has been dedicated to making its trading system ultra-stable and smooth. However, given the complexity and unexpected abnormalities of a trading system with high performance, we cannot guarantee that the system will work perfectly at all times. However, we have been working hard to improve system stability and minimize the probability of downtime from all aspects.
3. What work do we do to ensure the stability of the OKEx platform?
1). We strengthen engineering quality assurance and optimize the test system. The code for new functions can be launched only after it runs stably for a period of time in demo trading.
2). We upgrade architecture. The high availability of multiple servers in various regions is being realized, with less downtime caused by hardware and software problems.
3). Hot upgrades will be realized in a stateless way, which reduces the impact of the upgrade on user transactions.
4. How do we optimize the process of fault repair?
(1) Once we detect failures, we will immediately publish failure notifications on the Status page.
(2)If there is any system upgrade scheduled, we will publish a notification on the Status page and notify users via market and community channels (API user community + regular user community). Meanwhile, API users can be notified of the updates by subscribing to System/Status channel.