1. System downtime description:
At 5:50 am UTC and 1:03 pm UTC on March 26, OKEx spot and margin trading service was suspended twice due to system downtime, affecting the web, app and API servers.
After troubleshooting, we discovered the cause of the downtime:
For the first downtime, which occurred at 5:50 am UTC, OKEx consulted technical experts from our cloud server provider and found that there was a system abnormality in the cloud server that deploys our spot and margin trading system, causing the internal services that the trading system relies on to stop working.
For the second downtime, at 1:03 pm UTC, OKEx detected that the cloud server that deploys our spot and margin trading system was down again and caused the suspension of the spot and margin trading service.
Timeline of first downtime:
At 5:50 am UTC, our detection system discovered an abnormal condition in the system and issued a notification.
Also at 5:50 am UTC, the API trading interface was returning error code "30030" with a prompt that read, "Matching engine is being upgraded. Please try in about 1 minute.” At this time, the spot trading and margin trading service had been suspended.
At 5:51 am UTC, urgent repairs were initiated.
By 6:05 am UTC, the OKEx engineering team had found out that the abnormality of the cloud server may be the cause of the system issue and immediately contacted the server provider to fix the issue.
At 6:30 am UTC, the OKEx engineering team started working with technical experts from the server provider to troubleshoot and fix the issue, concluding that the abnormality of the cloud server caused the system issue.
By 6:36 am UTC, the spot and margin trading service was back to normal.
Timeline of second downtime:
At 1:03 pm UTC, our detection system discovered an abnormal condition in the system and issued an notification.
Also at 1:03 pm UTC, the API trading interface was returning error code "30030" with a prompt that read, "Matching engine is being upgraded. Please try in about 1 minute.” At this time, the spot trading and margin trading service had been suspended.
At 1:04 pm UTC, urgent repairs were initiated.
By 1:10 pm UTC, the OKEx engineering team discovered that the cloud server abnormality had again caused the system issue.
At 1:13 pm UTC, the cloud server provider performed a server upgrade to fix the abnormality.
By 1:25 pm UTC, the spot and margin trading service was back to normal.
2. Why did this downtime happen?
OKEx provides 24/7 trading services and has been dedicated to making its trading system ultra-stable and smooth. However, given the complexity and unexpected abnormalities of a trading system with high performance, we cannot guarantee that the system will work perfectly at all times. However, we have been working hard to improve system stability and minimize the probability of downtime from all aspects.
3. What work do we do to ensure the stability of the OKEx platform?
1). We strengthen engineering quality assurance and optimize the test system. The code for new functions can be launched only after it runs stably for a period of time in demo trading.
2). We upgrade architecture. The high availability of multiple servers in various regions is being realized, with less downtime caused by hardware and software problems.
3). Hot upgrades will be realized in a stateless way, which reduces the impact of the upgrade on user transactions.
4. How do we optimize the process of fault repair?
(1) Once we detect failures, we will immediately publish failure notifications on the Status page.
(2)If there is any system upgrade scheduled, we will publish a notification on the Status page and notify users via market and community channels (API user community + regular user community). Meanwhile, API users can be notified of the updates by subscribing to System/Status channel.