Thursday, April 07, 2022

From Web 1.0 to Web 3.0

 The Internet ecology has undergone a transformation from web 1.0 to web 2.0 in the past three decades. Where is the future of web 3.0, it is worth thinking about and exploring.

web 1.0

In the early 1990s, after the invention of the HTTP protocol, various websites sprang up one after another. In the early days, such as AOL and Yahoo, they all created amazing growth miracles.

The characteristics of these websites are that the owner is responsible for providing the content, and the user can only read the content and use the service. In other words, the classic single-production-many-consumption model.

The model of web 1.0 makes the right to speak on the World Wide Web in the hands of a few website service providers, a few decide the mainstream voice, and the rest are the silent majority.

web 2.0

web 1.0 spawned the booming .com bubble. Ten years later, in the fall of 2001, Internet stocks crashed, marking the beginning of the 1.0 model's gradual decline.

People began to discuss a new generation of web models, and after several years of exploration, the 2.0 model gradually became popular. This model begins to support user interaction, that is, users can create and decide content.

In the early days of web 2.0, blogs were used as a typical application, which represented a transition from a traditional user consumption model to a user who could become a content producer at the same time. Further, in the later period, social networking sites around social networking were born, and twitter, facebook, etc. became famous for a while. This pattern continues to this day.

While users can interact and create content, the platform remains firmly in the hands of a handful of internet giants, meaning they can easily channel and control so-called "mainstream" voices.

web 3.0

With the emphasis on privacy protection and the awareness of personal data rights, users are beginning to be dissatisfied with the existing network ecosystem, hoping to get their own interests back through a new generation of network models.

Web 3.0 is pinned on this kind of good hope. At present, the concept of web 3.0 is still in development, and the early exploration mainly hopes to combine distributed ledger technology to allow users to control their own identity information while limiting the sharing of data. Decentralized finance, data storage and trading applications have emerged.

Summarize

From the early web 1.0 to the web 3.0 that is still under discussion, it can be seen that the user's demand for interactive participation is constantly increasing, and at the same time, they hope to have their own control over the data on the network. This will be a huge challenge to the existing Internet ecosystem, and it may take ten or even decades to evolve.

Friday, April 01, 2022

Design a Health Code System That Will Never Crash

Today, when the epidemic is not over, the health code on the mobile phone provides great convenience for quickly identifying risk groups. However, once the back-end system fails to provide query services in a timely manner, users will face the embarrassment of being unable to prove their innocence. So, is it possible to provide reliable services with limited resources?

This article attempts to design a low-cost (average daily cost of $0.15) and a health code service system that will not collapse by using modern information technology.

Disclaimer: All analyses in this article, such as virtual machine prices, are based on public information on the Internet, and the actual situation may be different.

If you want to know how to optimally and quickly detect the new coronavirus, you can refer to "Optimal Grouping Algorithm for Covid-19 Detection".

Demand analysis

First analyze the functional requirements, the system must be able to achieve at least the following two basic operations:

  • Query: Given an ID number, it can return whether there is a risk in time (such as red and green codes, more complex ones can further return the risk value or location data, which is ignored here);
  • Updates: Background data can be updated, including user transitions between health and risk status.

In terms of performance requirements, if the service is provided in a province, it needs to support the number of users at the level of 100 million; if it is in a city, the number of users at the level of 10 million is sufficient.

It is assumed here that the province is the unit to support the daily queries of 100 million users. If a user makes an average of 10 queries per day (8 am to 8 pm), the throughput rate is about 24 thousand queries/second (100,000,000*10/12/3600). Assuming the peak is 10 times the average, the maximum throughput the system needs to support is 240 kbps.

The user ID number is 18 bits, and it takes 36 bytes to store in a string, and 8 bytes to store in a number. It is almost negligible relative to the network protocol header.

Assuming that a single request message is 200 bytes, the peak access bandwidth is 2402008 Kbps = 384 Mbps, and a conservative estimate of 400 Mbps bandwidth can be satisfied.

Analysis of the characteristics of the problem

The hardest part of the problem is how to support high-rate queries. Considering that under normal circumstances, the ratio of riskers is small (otherwise there is no need for health code services), this is a typical sparse data query problem.

For similar problems, the usual idea is to store only sparse data (risk person ID number), and if the query cannot be found, it means non-risk person.

Assuming that the risk ratio is 1%, the actual stored data is 1 million ID numbers (about 36 MB in string storage, which is easy to put into memory), which is not a large-scale dataset.

Initial design



Simplest design, storing all risker data directly to the database. Select common open source to implement MySQL, which is easy to implement thousands of queries per second on a single machine. After applying common optimization methods (cache, index, memory engine, etc.), it can achieve more than 10,000 queries per second on a single machine.

A rough estimate is that the overall system requires 24 servers. Taking the rental of public cloud virtual machines as an example, the average daily price of a 4-core 8G virtual machine is less than 1 yuan, and the total price of 24 virtual machines is 24 yuan. 400 Mbps bandwidth costs about 1000 RMB per day.

Overall, the average daily cost is 1024 yuan ($160).

Considering that this service can support hundreds of millions of users, this result is already very good. However, technology workers will not give up if they do not reach the limit. Let us see if we can further optimize it.

Optimization Solution 1: replace the database

First of all, since the query only needs to confirm whether a certain data exists, and the data volume is not large, it has a high tolerance for read and write consistency, so consider replacing it with a NoSQL database such as Redis.

The queries of NoSQL databases on ordinary servers can reach 100 thousand times per second, successfully reducing the number of servers to 3.

The average daily cost of this program is 1003 yuan.

Optimization Solution 2: Use Bloom filter

Bloom Filter is a classic data structure for querying whether an element exists in a collection. It has a case of misjudgment, that is, if the judgment exists, it may not actually exist; but if the judgment does not exist, it must not exist.

This feature is very suitable for judging the situation of risk groups. If the Bloom filter thinks someone is risk-free (in most cases), it must be risk-free; if it is considered risky, it can be confirmed through the database.

When one million pieces of data are stored and the false positive rate is 3%, the Bloom filter only needs a 1 MB memory data structure, and the query performance can easily reach one million times. In fact, many high-performance databases also use this data structure (such as Redis) to improve performance.

At this point, only 1 server is required to handle all queries easily.

The average daily cost is 1001 yuan.

Optimization Solution 3: preset filter

 Readers may find that the number of servers has been optimized from 24 to 1, but the overall cost has barely dropped because bandwidth costs account for most of the cost. So, is there room for optimization of network bandwidth?

The most direct idea is to simply download the filter to the client locally. In this way, the vast majority of risk-free users can obtain results through local queries, and only a few risk users and misjudged users need to be sent to the server for processing.

It is estimated that the bandwidth requirement has been reduced to 1% of the original, and the average daily cost has been reduced to 5 yuan.

Optimization Solution 4: use content distribution network


 Careful readers may think of another problem. When the preset filter on the client needs to be updated, it will still occupy a lot of network bandwidth. This is where the Content Delivery Network (CDN) comes in.

The content distribution network will deploy multiple edge servers in different geographical locations in advance, and cache the data to be distributed on the edge servers in advance, and the bandwidth cost is lower than that of direct distribution from the core network.

In this way, even if the client needs to update the data every day, it will not be a big problem.

What if you don't even want to pay for this?

The Ultimate Solution: Distributed Network Applications


 The essence of the network is nothing more than to transmit data. If clients can directly send data to each other, wouldn't the distribution pressure on the server be greatly reduced? This is exactly what P2P networks are good at.

By allowing the client to support P2P network distribution, the problem of client filter update is perfectly solved.

Further, the server can distribute both the filter and complete data to the client through the P2P network, and the client can query it locally. Exchange storage for computing, and even if the server goes down or the network fails, it will not affect the display of the health code.

More lazy, if you don't even want to develop a client that can do P2P networking, then just encrypt the data and put it in the existing P2P network (IPFS can understand it). The client can download and decrypt the data at any time before updating.

In this way, the server does not even need to provide query services, and the average daily cost is reduced to 1 yuan ($0.15).

Summarize

In conclusion, this article shows the step-by-step evolution of designing a high-performance concurrent query system: by selecting the appropriate database and system optimization, the number of required servers is reduced to 10%. Use Bloom filter to successfully implement a single machine to support massive query requests. The server pressure was reduced by another 1% through the client-side preset filter. Realize rapid update of client data through content distribution network. Combined with P2P technology, the scalability and reliability of the system are greatly improved, and the average daily cost is finally reduced to 1 yuan.

Of course, this article only analyzes possible design schemes from a theoretical point of view, and there must be a lot of work to be done between system design and implementation. In actual production, at least the following aspects need to be considered:

  • Load balancing: It is assumed that load balancing is performed by presetting the server address on the client.
  • Security precautions: It is assumed that the data center where the server is located provides a basic security precaution system.
  • Privacy protection: It is assumed that the client can better protect user privacy through encryption.
  • ......

As the primary productive force, isn't technology just to make everyone's life better and easier?

Optimal Grouping Algorithm for COVID-19 Detection

 In response to the outbreak, countries around the world need to test potentially infected people. Due to the relative shortage of detection reagents, how to use as few reagents as possible for detection has become an interesting problem. It is assumed here that the sampling volume is sufficient, and the detection time requirement is not considered.

At present, many countries use the group testing mechanism, that is, the groups to be tested are divided into groups, and the samples in the same group are mixed for testing. If a group is positive, the group is tested further.

A natural question: how to group to save reagents?

Consider first the two extreme cases.

A small number of infected people

Usually, infected people within a group are a minority.

Suppose there are 64 people in the group, and there is only 1 infected person, using the 2, 4, 8, 64 equal division method.

  • 2-point method requires 6 rounds of testing, a total of 2*6=12 times;
  • 4-point method requires 3 rounds to be tested, a total of 4*3=12 times;
  • 8-point method requires 2 rounds to be tested, a total of 8*2=16 times;
  • 64-point method requires 1 round of tests, 64 times in total.

In fact, when there are a large number of people, such as N people, the x-division method is used, and the total number of tests is about y=x*log(x, N). It is easy to know that the function takes a minimum value when x=e.

In this case, an x ​​value near e should be taken, such as 2 or 3, to save reagents. At the same time, the number of people N is best chosen to be a value that divides x.

A large number of infected people

In the second scenario, all 64 people are infected, and the 2, 4, 8, and 64 division method is also used.

  • 2-points method requires 6 rounds to be tested, a total of 2+4+8+16+32+64 = 126 times;
  • 4-point method requires 3 rounds of testing, a total of 4+16+32+64=116 times;
  • 8-point method requires 2 rounds of testing, a total of 8+64=72 times.
  • 64-point method requires 1 round of tests, 64 times in total.

It is easy to find that when the 64-point method is used at this time, the total number of detections is the least.

N people, using the x-point method, the total number of tests is about S=x*\frac{1-x^log(x,N)}{1-x}, which is obviously a decreasing function.

In this case, a larger number of groups should be taken.

Promotion as a general case

If N people are known and the infection rate is p, how much should x be selected at this time?

This problem is a bit more complicated than the above two cases, we adopt the idea of ​​backward push.

First draw the detection route, which is a typical tree structure, as shown in the following figure.

Among them, there are p*N people who are infected (in the worst case they do not have a common parent node), then in the last round, it needs to be checked p*N*x times. Pushing back to the next level, it still needs to detect pNx times. Until the last sample to be detected has a common parent node, starting from this layer, each layer needs to be detected x times.

That is, p*N*x + p*N*x + … x + x + ….

In the worst case, infected samples are separated as far as possible from each other until the second layer, where all samples to be detected have a common parent node. At this time, the total number of detections is S=p*N*x*(log(x, N)-1)+x. At this time, the optimal x value is around 3.


Usually, this formula is accurate when the infection rate p is not high; when p is high, the total number of detections predicted by the formula is higher than the actual value because multiple infected people at the bottom have a common parent node. Therefore, the consensus obtained is the upper limit of the required number of reagents.

Applied to real life

Assuming that the resident population of a city is N=20000000 (20 million), and the infection rate is p=0.0001 (one ten thousandth), then insert the formula to get

S=2000*x*(log(x,20000000)-1)+x, it is easy to find that when x=2.91 (which can be rounded to 3), the minimum value of S is 85782, and the number of detection rounds is 16 rounds.

Therefore, if the optimal detection algorithm is adopted, the number of reagents used in the city accounts for only 0.43% of the overall population to be tested, which can save a lot of reagents. The cost is the need for multiple rounds of testing.

In practice, only the first few rounds can be grouped to achieve a balance between the optimal number of reagents and the detection time. For example, the detection must be completed within a given time, and the number of detection packets at each layer can be adjusted at this time.

ps, interestingly, no matter what value p takes, the final optimal x will never exceed 4.

Wednesday, November 06, 2019

三个值得深思的小故事

第一个故事是耳熟能详的《扁鹊见蔡桓公》,大意就是扁鹊多次建议蔡桓公预防疾病,桓公不听啊,最终病入膏肓,无药可治。
长久以来,大家都在批评桓公“讳疾忌医”。其实不然。桓公在位时曾创建稷下学宫,大举招纳贤士,乃是一位有作为的明君,怎会如此愚笨。
故事的关键在于桓公著名的一句话:“医之好(hào)治不病以为功“。意思是,医生都喜欢治疗没有病的健康人——这样啥都不干还作为功劳。你看看,这哪里是目光短浅,这分明目光如炬,桓公麾下的大臣是不容易混日子的。当时的齐国”人才荟萃,彬彬大盛“。
但是很可惜,一代明主还是英年(44 岁)早逝了。
问题出在哪里?扁鹊的三阶段论说的好:病症前期,汤药即可;病症中期,需要针灸手术;到了晚期,无药可救。再直白一点,治病这样专业的事情啊,并不是说干的越多就效果越好。你一个外行非不听专业意见,自然没有好果子吃。
联想到有一些医生门庭若市,他的病人隔三差五就要来开药治疗;有的医生门庭冷清,因为他的病人按照他的建议吃一两副药就病好了,常年健康,自然无需看病。
嗯,扯远了,接着说第二个故事:有两个县令,都被派到偏远山区工作。
其中一个发现当地治安很差。为了保卫人民财产安全,他积极增加差役,加强巡逻排查,自己也呕心沥血连轴转。年终上京述职自然是成果满满:抓获小偷若干,破获重大案件若干,获得老百姓送的锦旗若干……
另一个县令也发现治安很差。他没有急着增加差役,而是采取了一系列有利商业发展和民生的措施。年终述职却发现无料可写,因为辖地安居乐业,已经压根没有人去做小偷了。
如果你是这两个县令的上司,你会怎么评价他们呢?
第三个故事是说某地有家大互联网公司。有个经理发现自己部门的员工们每天都按点下班,毫无紧张气氛。于是决定要提高下工作效率,每天要求员工汇报工作内容,每月评比,辞退排名垫底的员工。
经过一段时间,部门果然有了变化。大家都加班加点紧张工作。为了产生工作量,把一些无关紧要的产品特性改来改去;新产品带着 bug 仓促上线,等客户抱怨再来作为新的工作量去修正。经理很满意报表上显示人均工作时间提升了一倍。结果到了年底,经理被辞退了。
其实,产品设计也是类似道理。优秀的产品不应该让用户感觉到有太多的缺点,也不应该让用户感觉到太多的优点。因为它就是按照最自然的方式来工作的,用户不会因为某个特性而感到惊喜或者迷惑。好的产品会帮助用户完成工作,却无需留意到它到底是怎么工作的。
这样的产品上线后,压根就不会有人报 ticket。需求早已被满足了,没有故障,又有谁会去开 ticket 呢?
但是,当公司的考核是比拼加班和工作量的时候,哪个团队又会愿意开发这样的产品出来呢?

Thursday, October 31, 2019

春风又绿江南岸

克罗奇(意大利史学家、哲学家)曾说:“一切历史都是当代史”。
普通人很难拥有远见,因为百年生命太过短暂;难以全面看待问题,因为所学所知毕竟片面;难以深入洞察,因为缺乏逻辑和推理训练(理工科更应该成为普适教育)。
第一个开启心智的猿人,不论它当时所思所想,估计是满心的欢喜、恐慌,以及悲伤和无奈吧。
大部分人会以为科技与宗教是两码事,然而科技其实正是新时代的宗教。
它提高了生产力,重构了社会的结构,启发和引导了文明进步的新方向……与此同时,它也导致了巨大的贫富差距,创造出一个又一个泡沫。
也许泡沫破灭时,可以推动文明的小船继续往前漂动,以弥补进化意义上的缺陷。
天道不仁。唯其不仁,方能囊括一切。方能亘古久远、连绵不绝。
微观处的不对称性,并不会对宏观系统的运转产生即时的影响,因为复杂系统自身往往具有很强的鲁棒性。但千千万万的微观不对称长期积累起来,最终会由量变引发质变,推动系统过渡到新的平衡。经济学上把这个过程称为经济危机。无论奥派还是新旧芝派,对此都只能聪明地采取鸵鸟的做法。
是我们引领了时代,还是时代创造了我们?