1. System downtime description:
At 9:37 am UTC on Jan. 29, OKEx detected abnormal behavior in terms of website access and trading features. After troubleshooting, it was discovered that these abnormalities were a result of traffic overload and the consequent bandwidth shortage in the cache system, affecting the web, app and API servers.
OKEx engineers responded urgently and managed to resume all functions by 10:18 am UTC on Jan. 29. The timeline for this incident is detailed below:
9:37 am UTC: OKEx detected abnormal behavior. Market information and trading depth data was not displaying on the web and app interfaces, while error code “30012” was returning intermittently with an “Invalid Authority” prompt on the API interface.
9:40 am UTC: OKEx engineers discovered that the failures were caused by traffic overload, which resulted in bandwidth shortage in the cache system and internal service calling timeouts. Urgent repairs were initiated.
9:58 am UTC: Market information and trading depth data started being displayed normally and trading features were also fully resumed.
10:05 am UTC: Due to internal service call timeouts, the API service event processing of perpetual swaps was blocked, causing the interface requests to timeout.
10:18 am UTC: The API trading feature for perpetual swaps was recovered and resumed.
2. Why did this downtime happen?
OKEx provides 24/7 trading services and has been dedicated to making its trading system ultra-stable and smooth. However, given the complexity and unexpected abnormalities of a trading system with high performance, we cannot guarantee that the system will work perfectly at all times. However, we have been working hard to improve system stability and minimize the probability of downtime from all aspects.
3. What work do we do to ensure the stability of the OKEx platform?
1). We strengthen engineering quality assurance and optimize the test system. The code for new functions can be launched only after it runs stably for a period of time in demo trading.
2). We upgrade architecture. The high availability of multiple servers in various regions is being realized, with less downtime caused by hardware and software problems.
3). Hot upgrades will be realized in a stateless way, which reduces the impact of the upgrade on user transactions.
4. How do we optimize the process of fault repair?
After the failure, we will immediately post maintenance notifications on the Status page and notify users in time through market and community channels (API user community + regular user community). Meanwhile, API users can be notified of the updates by subscribing to System/Status channel.