Saturday, April 02, 2022

Design a Health Code System That Will Never Crash

Today, when the epidemic is not over, the health code on the mobile phone provides great convenience for quickly identifying risk groups. However, once the back-end system fails to provide query services in a timely manner, users will face the embarrassment of being unable to prove their innocence. So, is it possible to provide reliable services with limited resources?

This article attempts to design a low-cost (average daily cost of $0.15) and a health code service system that will not collapse by using modern information technology.

Disclaimer: All analyses in this article, such as virtual machine prices, are based on public information on the Internet, and the actual situation may be different.

If you want to know how to optimally and quickly detect the new coronavirus, you can refer to "Optimal Grouping Algorithm for Covid-19 Detection".

Demand analysis

First analyze the functional requirements, the system must be able to achieve at least the following two basic operations:

  • Query: Given an ID number, it can return whether there is a risk in time (such as red and green codes, more complex ones can further return the risk value or location data, which is ignored here);
  • Updates: Background data can be updated, including user transitions between health and risk status.

In terms of performance requirements, if the service is provided in a province, it needs to support the number of users at the level of 100 million; if it is in a city, the number of users at the level of 10 million is sufficient.

It is assumed here that the province is the unit to support the daily queries of 100 million users. If a user makes an average of 10 queries per day (8 am to 8 pm), the throughput rate is about 24 thousand queries/second (100,000,000*10/12/3600). Assuming the peak is 10 times the average, the maximum throughput the system needs to support is 240 kbps.

The user ID number is 18 bits, and it takes 36 bytes to store in a string, and 8 bytes to store in a number. It is almost negligible relative to the network protocol header.

Assuming that a single request message is 200 bytes, the peak access bandwidth is 2402008 Kbps = 384 Mbps, and a conservative estimate of 400 Mbps bandwidth can be satisfied.

Analysis of the characteristics of the problem

The hardest part of the problem is how to support high-rate queries. Considering that under normal circumstances, the ratio of riskers is small (otherwise there is no need for health code services), this is a typical sparse data query problem.

For similar problems, the usual idea is to store only sparse data (risk person ID number), and if the query cannot be found, it means non-risk person.

Assuming that the risk ratio is 1%, the actual stored data is 1 million ID numbers (about 36 MB in string storage, which is easy to put into memory), which is not a large-scale dataset.

Initial design



Simplest design, storing all risker data directly to the database. Select common open source to implement MySQL, which is easy to implement thousands of queries per second on a single machine. After applying common optimization methods (cache, index, memory engine, etc.), it can achieve more than 10,000 queries per second on a single machine.

A rough estimate is that the overall system requires 24 servers. Taking the rental of public cloud virtual machines as an example, the average daily price of a 4-core 8G virtual machine is less than 1 yuan, and the total price of 24 virtual machines is 24 yuan. 400 Mbps bandwidth costs about 1000 RMB per day.

Overall, the average daily cost is 1024 yuan ($160).

Considering that this service can support hundreds of millions of users, this result is already very good. However, technology workers will not give up if they do not reach the limit. Let us see if we can further optimize it.

Optimization Solution 1: replace the database

First of all, since the query only needs to confirm whether a certain data exists, and the data volume is not large, it has a high tolerance for read and write consistency, so consider replacing it with a NoSQL database such as Redis.

The queries of NoSQL databases on ordinary servers can reach 100 thousand times per second, successfully reducing the number of servers to 3.

The average daily cost of this program is 1003 yuan.

Optimization Solution 2: Use Bloom filter

Bloom Filter is a classic data structure for querying whether an element exists in a collection. It has a case of misjudgment, that is, if the judgment exists, it may not actually exist; but if the judgment does not exist, it must not exist.

This feature is very suitable for judging the situation of risk groups. If the Bloom filter thinks someone is risk-free (in most cases), it must be risk-free; if it is considered risky, it can be confirmed through the database.

When one million pieces of data are stored and the false positive rate is 3%, the Bloom filter only needs a 1 MB memory data structure, and the query performance can easily reach one million times. In fact, many high-performance databases also use this data structure (such as Redis) to improve performance.

At this point, only 1 server is required to handle all queries easily.

The average daily cost is 1001 yuan.

Optimization Solution 3: preset filter

 Readers may find that the number of servers has been optimized from 24 to 1, but the overall cost has barely dropped because bandwidth costs account for most of the cost. So, is there room for optimization of network bandwidth?

The most direct idea is to simply download the filter to the client locally. In this way, the vast majority of risk-free users can obtain results through local queries, and only a few risk users and misjudged users need to be sent to the server for processing.

It is estimated that the bandwidth requirement has been reduced to 1% of the original, and the average daily cost has been reduced to 5 yuan.

Optimization Solution 4: use content distribution network


 Careful readers may think of another problem. When the preset filter on the client needs to be updated, it will still occupy a lot of network bandwidth. This is where the Content Delivery Network (CDN) comes in.

The content distribution network will deploy multiple edge servers in different geographical locations in advance, and cache the data to be distributed on the edge servers in advance, and the bandwidth cost is lower than that of direct distribution from the core network.

In this way, even if the client needs to update the data every day, it will not be a big problem.

What if you don't even want to pay for this?

The Ultimate Solution: Distributed Network Applications


 The essence of the network is nothing more than to transmit data. If clients can directly send data to each other, wouldn't the distribution pressure on the server be greatly reduced? This is exactly what P2P networks are good at.

By allowing the client to support P2P network distribution, the problem of client filter update is perfectly solved.

Further, the server can distribute both the filter and complete data to the client through the P2P network, and the client can query it locally. Exchange storage for computing, and even if the server goes down or the network fails, it will not affect the display of the health code.

More lazy, if you don't even want to develop a client that can do P2P networking, then just encrypt the data and put it in the existing P2P network (IPFS can understand it). The client can download and decrypt the data at any time before updating.

In this way, the server does not even need to provide query services, and the average daily cost is reduced to 1 yuan ($0.15).

Summarize

In conclusion, this article shows the step-by-step evolution of designing a high-performance concurrent query system: by selecting the appropriate database and system optimization, the number of required servers is reduced to 10%. Use Bloom filter to successfully implement a single machine to support massive query requests. The server pressure was reduced by another 1% through the client-side preset filter. Realize rapid update of client data through content distribution network. Combined with P2P technology, the scalability and reliability of the system are greatly improved, and the average daily cost is finally reduced to 1 yuan.

Of course, this article only analyzes possible design schemes from a theoretical point of view, and there must be a lot of work to be done between system design and implementation. In actual production, at least the following aspects need to be considered:

  • Load balancing: It is assumed that load balancing is performed by presetting the server address on the client.
  • Security precautions: It is assumed that the data center where the server is located provides a basic security precaution system.
  • Privacy protection: It is assumed that the client can better protect user privacy through encryption.
  • ......

As the primary productive force, isn't technology just to make everyone's life better and easier?

No comments:

Post a Comment