Showing posts with label architecture. Show all posts
Showing posts with label architecture. Show all posts

Monday, August 15, 2022

Mastering Systematic Thinking

The whole is greater than the sum of its parts, and structure determines system behavior.

What is System Thinking?

System thinking is an approach to problem solving by thinking holistically.

Different from the simple way of thinking about the problem itself intuitively, systematic thinking often needs to observe the behavior, structure, and association of complex systems, summarize its internal laws from different levels, and understand its operation. Furthermore, the internal laws can be changed by adjusting the structure of the system to achieve the goal of changing the behavior of the system.

Focus on the whole, not the parts. Focus on connections, not things.

For example, seeing an apple falling to the ground, the intuitive way of thinking is that the apple will fall to the ground when it is ripe.

Systems thinking may need to consider:

What is the connection between apples, fruit trees and the ground?

What internal law causes the behavior of the apple falling to the ground?

What factors can be changed to prevent apples from falling to the ground?

......

Another example is to see inflation, intuitive thinking may think that it is due to additional currency issuance. New systems thinking takes into account the cyclical laws and distribution mechanisms of the economy.

Why master systems thinking?

The world itself is a complex system, and many problems in real life are dealing with complexity. For example, designing a bridge, building an assembly line, implementing an enterprise software, and so on.

Simple systems tend to be linear, i.e. 1+1=2, while complex systems are usually nonlinear, i.e. 1+1>2.

Usually, due to the limitations of knowledge, cognition, and way of thinking, it is difficult for humans to intuitively see the whole of things.

Also, it's hard to understand it directly for most complex objects.

These all require dissection and thinking using systems thinking, which can help us analyze problems more comprehensively.

How to Master System Thinking

Systematic thinking can be mastered through training, which mainly includes the following steps:

First, to observe the dynamic behavior of the system, including system events, behavior characteristics, and summarize the behavior rules of the system.

Afterwards, its possible internal structure is predicted through behavioral laws.

Further, the divide-and-conquer system is divided into multiple small-scale simple systems according to the structure.

To verify that the predicted structures are accurate, prototypes can be built to conduct experiments. Make adjustments through experimental feedback.

Thursday, April 21, 2022

Latest Progress in the Fed's Digital Currency

Since 2016, Central Bank Digital Currency (CBDC) has gradually become an important subject of research and development experiments by central banks around the world. In terms of application scenarios, general CBDC is oriented to retail, online shopping, personal payment, etc., basically corresponding to cash scenarios, and is the main research direction at present. In addition, there are CBDCs such as financial institutions’ reserves.

The Federal Reserve has been cautious in its exploration of digital currencies, and has suppressed Facebook's Libra project. But it has been conducting exploration and research itself, mainly including its financial laboratory and the "Hamilton" project that authorized its Boston branch.

Note: The project name honors two people: Alexander Hamilton, the first U.S. Treasury Secretary and founder of the financial system. Margaret Hamilton, director of software engineering at the MIT Instrumentation Laboratory, was involved in software development for the Apollo program.

Hamilton Project

The "Hamilton" project is an exploratory research project by the Federal Reserve Boston Branch and the MIT Monetary Research Center.

The project is divided into two phases:

  • Phase 1: Solve core issues such as high performance, reliable transactions, scalability, and privacy protection. Target 100,000 TPS, second-level confirmation, multi-region fault tolerance.
  • Phase 2: Solve key issues such as auditable, programmable contracts, support for intermediary layers, attack prevention, and offline transactions.

After several years of hard work, the first phase was completed in February this year. The source code OpenCBDC was released in the form of open source software, mainly developed through C++, following the MIT open source license agreement, and the project address is mit-dci/opencbdc-tx.

Two kinds of engines were tested. The single-order node engine Atomizer (order-preserving) can reach a peak value of 170,000 TPS; the parallel execution engine 2PC (order-preserving) can reach 1.7 million TPS.

In terms of architecture, it is similar to other central bank digital currency systems, drawing on the technical characteristics of blockchain and cryptocurrencies.

  • A centralized transaction structure is adopted because the central bank can provide a strong premise of trust;
  • Transactions are verified by private key signature;
  • The user uses the currency through the wallet client;
  • Referring to the UTXO model, the spent currency will be destroyed, and then a new currency will be created;
  • Transaction verification and execution are decoupled, making it easier to expand.

The project is still at an early stage and the scenarios under consideration are very limited. The author believes that there is still a long way to go before it can be used on the ground.

Several major open issues at present:

  • How is identity verification implemented? This still depends on the public-private key mechanism, which can be accelerated by specific hardware.
  • How to monitor anti-money laundering? This may be handled offline in an extended manner.
  • How is the audit granularity of identity and transaction data achieved? The main purpose is to allow different roles to see different granularities. This can be achieved through data isolation and encryption mechanisms.
  • How is the currency issued? It can be directly exchanged to individuals, or it can be authorized by secondary commercial banks (the latter is adopted by the digital renminbi).
  • How to integrate with the existing financial system? You can go through the transaction gateway, or simply not get through first, and go separately.

Summary

In fact, objectively speaking, under the premise of a centralized architecture, it is not difficult to implement a high-throughput trading system by using the existing software and hardware system. The difficulty is to support complex financial services, multi-transaction associations, and scalability, while taking into account conflicting requirements such as compliance, auditability, and privacy protection. These often require a lot of hands-on experience.

Thursday, April 07, 2022

Design and Management of Digital Token

NIST’s February 8301 report, “Blockchain Networks: Token Design and Management Overview,” specifically addressed issues related to digital tokens. There are several interesting topics that are worth thinking about.

The focus of future architecture


According to the mainstream design, the architecture is divided into 5 layers from bottom to top:

  • Physical layer: Physical hardware. Including servers, network infrastructure, etc.;
  • Network layer: A network that supports peer-to-peer network communication. Such as the Internet, enterprise network, etc.;
  • Blockchain layer: The implementation of blockchain-related protocols. Including consensus, storage, smart contracts, etc.;
  • Integration layer: An integrated application based on blockchain smart contracts. Such as middleware, offline computing storage, etc.;
  • Interface Layer: User-facing interface. Such as data analysis, client applications, etc.

The bottom layer and the top layer are rich in research and practice, but the integration layer has strong definability and complex functions, and currently faces great challenges. In fact, in the traditional software architecture, the most important is the middleware layer. Only when the middle stability function is powerful can it effectively link the previous and the next.

At present, the industry does not have in-depth research on the middle layer, and there are not many problems in the traditional academic category. Only the open source community such as the Hyperledger community has begun some explorations on the basis of practice.

Token Definition and Classification

Explicitly defines a token as the data that the service can verify the data exchange with. It is classified into Blockchain-Based tokens (such as various NFTs and FTs) and Self-contained tokens (such as JWT).

The mainstream blockchain-based token implementation is the authentication data bound to the account. And accounts can be on-chain or off-chain (in client wallets). The implementation of Token can be native or contract-based; it can be a UTXO model or an account model; it can be split or non-split.

This actually implies that the digital currency will be taken out separately. After all, the digital currency has a greater significance in the currency, and the underlying realization is just the carrier. The current official digital currency is more about digitizing real currency than a certificate of circulation in the digital world that geeks envision.

Token exchange

Mainly through atomic swap (atomic swap). Unmanaged, the exchange is either fully implemented or will fall back to its original state. The implementation is mostly achieved through hash locks and time locks.

Atomic swaps can be within the same chain or across chains.

It relies heavily on the security of the contract.

Cross-chain support

Including two-way exchange or one-way transfer two scenarios. The implementation depends on the intermediate coordination system, or requires the support of contracts or agreements on both sides.

There is no practical solution for this yet, and further development is required.

Friday, April 01, 2022

Design a Health Code System That Will Never Crash

Today, when the epidemic is not over, the health code on the mobile phone provides great convenience for quickly identifying risk groups. However, once the back-end system fails to provide query services in a timely manner, users will face the embarrassment of being unable to prove their innocence. So, is it possible to provide reliable services with limited resources?

This article attempts to design a low-cost (average daily cost of $0.15) and a health code service system that will not collapse by using modern information technology.

Disclaimer: All analyses in this article, such as virtual machine prices, are based on public information on the Internet, and the actual situation may be different.

If you want to know how to optimally and quickly detect the new coronavirus, you can refer to "Optimal Grouping Algorithm for Covid-19 Detection".

Demand analysis

First analyze the functional requirements, the system must be able to achieve at least the following two basic operations:

  • Query: Given an ID number, it can return whether there is a risk in time (such as red and green codes, more complex ones can further return the risk value or location data, which is ignored here);
  • Updates: Background data can be updated, including user transitions between health and risk status.

In terms of performance requirements, if the service is provided in a province, it needs to support the number of users at the level of 100 million; if it is in a city, the number of users at the level of 10 million is sufficient.

It is assumed here that the province is the unit to support the daily queries of 100 million users. If a user makes an average of 10 queries per day (8 am to 8 pm), the throughput rate is about 24 thousand queries/second (100,000,000*10/12/3600). Assuming the peak is 10 times the average, the maximum throughput the system needs to support is 240 kbps.

The user ID number is 18 bits, and it takes 36 bytes to store in a string, and 8 bytes to store in a number. It is almost negligible relative to the network protocol header.

Assuming that a single request message is 200 bytes, the peak access bandwidth is 2402008 Kbps = 384 Mbps, and a conservative estimate of 400 Mbps bandwidth can be satisfied.

Analysis of the characteristics of the problem

The hardest part of the problem is how to support high-rate queries. Considering that under normal circumstances, the ratio of riskers is small (otherwise there is no need for health code services), this is a typical sparse data query problem.

For similar problems, the usual idea is to store only sparse data (risk person ID number), and if the query cannot be found, it means non-risk person.

Assuming that the risk ratio is 1%, the actual stored data is 1 million ID numbers (about 36 MB in string storage, which is easy to put into memory), which is not a large-scale dataset.

Initial design



Simplest design, storing all risker data directly to the database. Select common open source to implement MySQL, which is easy to implement thousands of queries per second on a single machine. After applying common optimization methods (cache, index, memory engine, etc.), it can achieve more than 10,000 queries per second on a single machine.

A rough estimate is that the overall system requires 24 servers. Taking the rental of public cloud virtual machines as an example, the average daily price of a 4-core 8G virtual machine is less than 1 yuan, and the total price of 24 virtual machines is 24 yuan. 400 Mbps bandwidth costs about 1000 RMB per day.

Overall, the average daily cost is 1024 yuan ($160).

Considering that this service can support hundreds of millions of users, this result is already very good. However, technology workers will not give up if they do not reach the limit. Let us see if we can further optimize it.

Optimization Solution 1: replace the database

First of all, since the query only needs to confirm whether a certain data exists, and the data volume is not large, it has a high tolerance for read and write consistency, so consider replacing it with a NoSQL database such as Redis.

The queries of NoSQL databases on ordinary servers can reach 100 thousand times per second, successfully reducing the number of servers to 3.

The average daily cost of this program is 1003 yuan.

Optimization Solution 2: Use Bloom filter

Bloom Filter is a classic data structure for querying whether an element exists in a collection. It has a case of misjudgment, that is, if the judgment exists, it may not actually exist; but if the judgment does not exist, it must not exist.

This feature is very suitable for judging the situation of risk groups. If the Bloom filter thinks someone is risk-free (in most cases), it must be risk-free; if it is considered risky, it can be confirmed through the database.

When one million pieces of data are stored and the false positive rate is 3%, the Bloom filter only needs a 1 MB memory data structure, and the query performance can easily reach one million times. In fact, many high-performance databases also use this data structure (such as Redis) to improve performance.

At this point, only 1 server is required to handle all queries easily.

The average daily cost is 1001 yuan.

Optimization Solution 3: preset filter

 Readers may find that the number of servers has been optimized from 24 to 1, but the overall cost has barely dropped because bandwidth costs account for most of the cost. So, is there room for optimization of network bandwidth?

The most direct idea is to simply download the filter to the client locally. In this way, the vast majority of risk-free users can obtain results through local queries, and only a few risk users and misjudged users need to be sent to the server for processing.

It is estimated that the bandwidth requirement has been reduced to 1% of the original, and the average daily cost has been reduced to 5 yuan.

Optimization Solution 4: use content distribution network


 Careful readers may think of another problem. When the preset filter on the client needs to be updated, it will still occupy a lot of network bandwidth. This is where the Content Delivery Network (CDN) comes in.

The content distribution network will deploy multiple edge servers in different geographical locations in advance, and cache the data to be distributed on the edge servers in advance, and the bandwidth cost is lower than that of direct distribution from the core network.

In this way, even if the client needs to update the data every day, it will not be a big problem.

What if you don't even want to pay for this?

The Ultimate Solution: Distributed Network Applications


 The essence of the network is nothing more than to transmit data. If clients can directly send data to each other, wouldn't the distribution pressure on the server be greatly reduced? This is exactly what P2P networks are good at.

By allowing the client to support P2P network distribution, the problem of client filter update is perfectly solved.

Further, the server can distribute both the filter and complete data to the client through the P2P network, and the client can query it locally. Exchange storage for computing, and even if the server goes down or the network fails, it will not affect the display of the health code.

More lazy, if you don't even want to develop a client that can do P2P networking, then just encrypt the data and put it in the existing P2P network (IPFS can understand it). The client can download and decrypt the data at any time before updating.

In this way, the server does not even need to provide query services, and the average daily cost is reduced to 1 yuan ($0.15).

Summarize

In conclusion, this article shows the step-by-step evolution of designing a high-performance concurrent query system: by selecting the appropriate database and system optimization, the number of required servers is reduced to 10%. Use Bloom filter to successfully implement a single machine to support massive query requests. The server pressure was reduced by another 1% through the client-side preset filter. Realize rapid update of client data through content distribution network. Combined with P2P technology, the scalability and reliability of the system are greatly improved, and the average daily cost is finally reduced to 1 yuan.

Of course, this article only analyzes possible design schemes from a theoretical point of view, and there must be a lot of work to be done between system design and implementation. In actual production, at least the following aspects need to be considered:

  • Load balancing: It is assumed that load balancing is performed by presetting the server address on the client.
  • Security precautions: It is assumed that the data center where the server is located provides a basic security precaution system.
  • Privacy protection: It is assumed that the client can better protect user privacy through encryption.
  • ......

As the primary productive force, isn't technology just to make everyone's life better and easier?

Sunday, December 27, 2015

集群负载均衡技术概述

集群负载均衡技术(Load Balancing)是目前互联网后端服务的关键技术,是互联网系统演化到现在这样巨大规模的基础。
客观地说,负载均衡是一个门槛相当不低的领域,已有技术主要包括硬件方案和软件方案。简单说,硬件方案性能好,但是昂贵;软件方案性能差,但是成本相对可控。
硬件方案代表为F5、Ctrix、A10、Redware 等 LB 厂商的产品,每年市场营收额高达百亿。
开源的软件实现方案也有不少,知名的包括 HAProxy、Nginx、LVS 等。
在实际大规模高性能的场景下,往往很难靠单一方案实现可靠的负载均衡服务。
很多 LB 厂家现在已经不再使用“负载均衡设备”这个术语了,改叫做应用分发(ADC、ADN)设备(甚至直接叫做 4-7 层交换机),但在这些设备中的最为基础和关键的技术仍然是对海量流量进行处理的负载均衡技术。
实际上,从客户端是否能感知或参与负载均衡的过程来区分,可以首先将负载均衡技术大概分为两大类:客户端可感知的负载均衡和客户端透明的负载均衡。当然,两类技术又可以进一步地结合起来使用。

客户端可感知负载均衡

顾名思义,客户端主动参与负载均衡过程,可以感知负载均衡存在。
典型的例子是一些客户端应用,需要动态获知服务地址。从这个意义上看,不少 P2P 应用都属于该范畴。
基本的过程可以为,客户端先向某个控制(Control)服务器发起请求,获取业务(Service)服务器地址(域名或者 IP 地址)的列表。获取列表成功后,本地按照某种均衡策略向这些业务服务器发出请求。
这样做的好处显而易见,可以降低对服务器端进行负载均衡的要求,甚至服务端可以不需要再进行负载均衡,简单资源池即可。
技术难点主要在于客户端和服务端之间的信息同步。比如业务服务器池调整后信息怎么及时更新到客户端;某些负载均衡策略需要获取实时服务器的状态信息,等等。
具体实现上,往往需要客户端和服务端进行一些配合。针对某些特定的业务场景,可以进行一些调整和优化,例如 Yahoo 就在某些互联网业务中使用了这种模式。
桌面时代的客户端往往受控性差,情况比较复杂,这种模式实现难度较大。
现在移动互联网发展起来后,客户端 APP 的可控性大大提高,比较看好这种模式的应用前景。

客户端透明负载均衡

客户端透明负载均衡实际上又包括很多层面的技术,这里都统归为一类。不同实现可能放在不同地方(运营商或者服务商等),但原理都是一致的,核心都是借用了“Overlay”的思想(计算机行业的所有问题,基本都可以通过引入新的层来实现)。
下面从客户端发出请求后的完成流程来看,可以在哪些步骤实现负载均衡。

DNS 层

首先,客户端会向某个业务服务器的域名发起请求,要先解析这个域名为具体的 IP 地址。
域名解析这块能做的事情很多,但大致上要么是运营商本地应答,要么运营商扔给服务商的域名服务器。
如果是前者(一般情况下)的话,运营商会通过本地 cache 进行查询,无命中则扔给上层域名服务或者服务商。
域名解析的过程直接决定了进行 LB 的效果。这里就有问题了,由于域名是由运营商进行维护答复的,一旦发生变化,服务商怎么去快速地调整呢,传统域名更新的方式可达数分钟甚至数小时的时延。
一种思路是跟运营商去谈,所有相关请求都服务商自己来处理或者怎么能合作进行快速更新;一种是将请求想办法绕开运营商,隧道给自己(又是一种 Overlay),腾讯的 HttpDNS 就是这个思路,域名解析不走传统 DNS 协议,而是通过 HTTP 请求来实现。
总结下,基于 DNS 的 LB 方式是比较简单直接,容易实现的,但是存在问题也不少,主要有多级 DNS 结构不可控,容易分布不均匀,和难以及时刷新等等。

IP 层

拿到 IP 之后,客户端可以发起请求到对应的 IP 地址,之后就等着接收到源地址为该 IP 的响应数据包。很自然的,可以在 LB 上替换掉这个地址,发给后端的真实服务池。这也是目前大量三层负载均衡器的基本原理。
实践中,前面挡一层 DNS 均衡,后面进行三层负载均衡,已经可以支持很大规模的场景了。
典型的,LVS 是这一类的软件方案,但具体实现上,LVS 支持三种模式:NAT(Network Address Translation)、DR(Direct Routing)、TUN(IP Tunneling)。NAT 就是把 LB 做 NAT 网关;DR 和 TUN 是想办法把流量转发给后端服务器,让后面服务器自己处理,形成三角路由。
首先,负载均衡器上会带有一个 VIP(用户看到的服务地址)和 DIP(用于探测后端服务的存活),后端真实服务器上会有 RIP。
NAT 模式下,请求和响应流量都经过负载均衡器,性能会差一些。
通过网络地址转换,调度器重写请求报文的目标地址,根据预设的调度算法,将请求分派给后端的真实服务器;真实服务器的响应报文通过调度器时,报文的源地址被重写,再返回给客户,完成整个负载调度过程。
DR 模式下通过改写请求报文的目的 MAC 地址为某个真实服务器的 MAC(ARP 请求肯定不能自动响应了),将请求发送到真实服务器。而真实服务器将响应直接返回给客户。该模式扩展型号,但要求 LB 与真实服务器可以二层连通。这种模式因为性能好,用的很多。
TUN 是解决 DR 无法跨网络的问题,不是拿 MAC 做转发,而是封装完整包做 IP 转发,比 DR 处理要复杂。
此外,肯定要保证 Session 一致性,可以根据 IP 头和 TCP 头记录之前决策,保证流调度到同一台机器。实践中,支持十几台到数十台后端服务器比较合适。

应用层

用户访问资源需要通过 URI,一个 IP 地址上可能支持数个 URI,在三层均衡的同时,可以根据这些 URI 来制定具体的分发策略,比如访问 Front_IP:80/app1 的流量可能会被分配到 Back_IP1:80/app1;而到 Front_IP:80/app2 的流量被分配到Back_IP1:80/app2 等等。
这种应用层均衡,特点当然是最灵活,可以多级划分,但因为要看应用层的信息,计算复杂性大一些,对均衡器要求也比较高,很少有能做高性能的。

Monday, August 17, 2015

用 consul + consul-template + registrator + nginx 打造真正可动态扩展的服务架构

在互联网应用领域,服务的动态性需求十分常见,这就对服务的自动发现和可动态扩展提出了很高的要求。
Docker 的出现,以及微服务架构的兴起,让众多开源项目开始关注在松耦合的架构前提下,如何基于 Docker 实现一套真正可动态扩展的服务架构。

基本需求

基本的需求包括:
  • 服务启动后要能自动被发现(vs 传统需要手动进行注册);
  • 负载要能动态在可用的服务实例上进行均衡(vs 传统需要静态写入配置);
  • 服务规模要方便进行快速调整(vs 传统需要较长时间的手动调整)。

相关项目

服务发现

服务发现的项目已经有不少,包括之前介绍的 consul,以及 skydnsserf、以及主要关注一致性的强大的 zookeeper 等。
这些项目各有优缺点,功能上大同小异,都是要通过某种机制来获取服务信息,然后通过维护一套(分布式)数据库来存储服务的信息。这也是为什么 etcd 受到大家关注和集成。
在这里,选用 HashiCorp 公司的 consul 作为服务发现的管理端,它的简介可以参考 这里

服务注册

服务注册的手段有很多,当然,从发起方是谁可以分为两大类,主动注册还是被动探测。
主动注册,顾名思义,服务启动后,向指定的服务发现管理端的 API 发送请求,给出自身的相关信息。这样做,对管理端的要求简单了很多,但意味着服务自身要完成注册工作,并且极端情况下,管理端比较难探测出真正存活的服务。
被动探测,则是服务发现管理端通过某种机制来探测存活的服务。这样可以获取真实的服务情况,但如何探测是个很难设计的点,特别当服务类型比较复杂的时候。
以上两种,都对网络连通性要求较高。从短期看,主动注册方式会比较容易实现一些,应用情形更广泛;但长期维护上,被动探测方式应该是更高效的设计。
这里,我们选用 gliderlabs 的 registrator,它可以通过跟本地的 docker 引擎通信,来获取本地启动的容器信息,并且注册到指定的服务发现管理端。

配置更新

服务被调整后,负载均衡器要想动态重新分配负载,就需要通过配置来获取更新。这样的方案也有不少,基本上都是要安装一些本地 agent 来监听服务发现管理端的信息,生成新的配置,并执行更新命令。
HashiCorp 公司 的consul-template,可以通过监听 consul 的注册信息,来完成本地应用的配置更新。

负载均衡

负载均衡对性能要求很高,其实并不是软件所擅长的领域,但软件方案胜在成本低、维护方便。包括 lvshaproxy 都是很优秀的设计方案。
这里,我们选用 nginx。nginx 不仅是个强大的 web 代理服务器,同时在负载均衡方面表现也不俗。更关键的,新版本的 nginx 对在线升级支持做到了极致。实时配置更新更是不在话下,可以保证服务的连续性。

实验过程

准备工作

首先,从 这里 下载模板文件。
主要内容如下:
#backend web application, scale this with docker-compose scale web=3
web:
  image: yeasy/simple-web:latest
  environment:
    SERVICE_80_NAME: http
    SERVICE_NAME: web
    SERVICE_TAGS: backend
  ports:
  - "80"

#load balancer will automatically update the config using consul-template
lb:
  image: yeasy/nginx-consul-template:latest
  hostname: lb
  links:
  - consulserver:consul
  ports:
  - "80:80"

consulserver:
  image: gliderlabs/consul-server:latest
  hostname: consulserver
  ports:
  - "8300"
  - "8400"
  - "8500:8500"
  - "53"
  command: -data-dir /tmp/consul -bootstrap -client 0.0.0.0

# listen on local docker sock to register the container with public ports to the consul service
registrator:
  image: gliderlabs/registrator:master
  hostname: registrator
  links:
  - consulserver:consul
  volumes:
  - "/var/run/docker.sock:/tmp/docker.sock"
  command: -internal consul://consul:8500
如果没有安装 docker 和 docker-compose,需要先进行安装,以 ubuntu 系统为例。
$ curl -sSL https://get.docker.com/ | sh
$ sudo pip install docker-compose

执行

docker-compose 模板所在目录,执行
$ sudo docker-compose up
相关镜像即可自动被下载,下载完毕后,容器就启动起来了。
访问 http://localhost 可以看到一个 web 页面,提示实际访问的目标地址。
多次刷新,可以看到目标地址没有变化,这是因为,目前我们只有一个 web 后端服务器。
2015-08-18 03:37:58: 6 requests from <LOCAL: 172.17.1.150> to WebServer <172.17.1.148>
调整后端为 3 个服务器。
$ sudo docker-compose scale web=3
然后,再次访问 http://localhost,多次刷新,可以看到访问的实际目标地址发生了变化,新启动的 web 服务器被自动注册,并且 nginx 自动对它们进行了负载均衡。
2015-08-18 03:37:58: 6 requests from <LOCAL: 172.17.1.150> to WebServer <172.17.1.148>
2015-08-18 03:38:17: 5 requests from <LOCAL: 172.17.1.150> to WebServer <172.17.1.152>
2015-08-18 03:38:20: 5 requests from <LOCAL: 172.17.1.150> to WebServer <172.17.1.153>

Tuesday, June 23, 2015

Mesos 基本原理与架构

首先,Mesos 是一个资源调度框架,并非一整套完整的应用管理平台,本身是不能干活的。但是它可以比较容易的跟各种应用管理或者中间件平台整合,一起工作,提高资源使用效率。

架构

mesos-arch master-slave 架构,master 使用 zookeeper 来做 HA。
master 单独运行在管理节点上,slave 运行在各个计算任务节点上。
各种具体任务的管理平台,即 framework 跟 master 交互,来申请资源。

基本单元

master

负责整体的资源调度和逻辑控制。

slave

负责汇报本节点上的资源给 master,并负责隔离资源来执行具体的任务。
隔离机制当然就是各种容器机制了。

framework

framework 是实际干活的,包括两个主要组件:
  • scheduler:注册到主节点,等待分配资源;
  • executor:在 slave 节点上执行本framework 的任务。
framework 分两种:一种是对资源需求可以 scale up 或者 down 的(Hadoop、Spark);一种是对资源需求大小是固定的(MPI)。

调度

对于一个资源调度框架来说,最核心的就是调度机制,怎么能快速高效的完成对某个 framework 资源的分配(最好是能猜到它的实际需求)。

两层调度算法:

master 先调度一大块资源给某个 framework,framework 自己再实现内部的细粒度调度。
调度机制支持插件。默认是 DRF。

基本调度过程

调度通过 offer 方式交互:
  • master 提供一个 offer(一组资源) 给 framework;
  • framework 可以决定要不要,如果接受的话,返回一个描述,说明自己希望如何使用和分配这些资源(可以说明只希望使用部分资源,则多出来的会被 master 收回);
  • master 则根据 framework 的分配情况发送给 slave,以使用 framework 的 executor 来按照分配的资源策略执行任务。

过滤器

framework 可以通过过滤器机制告诉 master 它的资源偏好,比如希望分配过来的 offer 有哪个资源,或者至少有多少资源。
主要是为了加速资源分配的交互过程。

回头机制

master 可以通过回收计算节点上的任务来动态调整长期任务和短期任务的分布。

HA

master

master 节点存在单点失效问题,所以肯定要上 HA,目前主要是使用 zookpeer 来热备份。
同时 master 节点可以通过 slave 和 framework 发来的消息重建内部状态(具体能有多快呢?这里不使用数据库可能是避免引入复杂度。)。

framework 通知

framework 中相关的失效,master 将发给它的 scheduler 来通知。

Wednesday, June 17, 2015

云计算容器服务该何去何从

容器技术最近很火,各家项目纷纷提出自己的支持方案,比如 OpenStack、CF、Mesos,以及一堆本身就基于容器的平台方案,更是跟容器技术脱不开关系。
容器是个啥?它不是虚拟机这样单纯的底层虚拟化,也不是纯应用,它恰好位于两者之间,位置极其重要。简直就像是 IP 协议层一样,无论自顶向下还是自底向上都无法绕开。
这也直接导致了暧昧已久的 IaaS 领域和 PaaS 领域正式开始正面的冲突。
在 IaaS 工程师们看来,做 PaaS 无非就是提供几个应用模板嘛,原来虚机不好做,现在有了 Docker,瞬间给你把服务整起来。更别提还有最近出来搅局的 hyper,虚机号称要跟容器比性能,那原来熟悉虚机的工程师可是太喜欢了。
而在 PaaS 工程师们看来,IaaS 就该老老实实地做底层物理资源给抽象成虚拟资源的事。原来底层都是虚机的时候,感觉好复杂,什么 SR-IOV,Libvirt/KVM,SDN,Overlay,Ceph……搞 PaaS 的人一般看不懂。而现在有了一堆现成提供 Docker 的容器平台,再往上就是应用了,都是做软件的人可以做的事情了,所以以后其实 IaaS 就没那么关键了。
这些讨论就像当年的单内核和微内核之争一样,是不光涉及到技术,还涉及到模式的关键战役。
可以说,看起来现在已经到了一个很关键的节骨眼上,中间这一层的设计将决定未来至少十年内云计算产业的形态。

当前状况

姑且放下这些争论,我们来看一下 IaaS 领域的 Top 开源项目 OpenStack 现在是怎么对待容器的。
目前有三种方案:
  • Nova-Docker:把容器作为虚机管起来。基本上其它组件都不需要动。唯一的问题就是容器毕竟不是虚机,比如需要提供一些额外的参数支持啦,需要引入组的概念啦,需要性能的优化啦。这导致玩 PaaS 的人很不喜欢。
  • Heat Docker Driver:用 Heat 来管容器。Heat 大家都知道,是个十分灵活且强大的解释引擎,理论上 Docker 需要的支持它都能有。唯一的问题是 Heat 毕竟是个解释引擎,它本质上还是要基于其它服务提供的 API。由于不是个运维引擎,导致运行时的管理没法保障,比如自动的资源调度啊,网络功能啊,等等。如果这些都做了,那就等于在更高一层上重复发明轮子了。
  • Magnum:玩容器的人看问题当然基本上都是从应用层往上开始考虑。一帮人兴冲冲跑去 Nova 项目谈,应该怎么支持基于容器的 DevOps 啊,应用模板啊,Nova 组一帮做系统的人就傻了,这事我们咋能干?这分明是 PaaS 该做的事情。但架不住大家都觉得 Docker 很火啊,我们肯定还是要玩点花样的,于是一个新的项目诞生了。但玩应用的人毕竟不懂系统,调研了下,发现现在能管理 Docker 的开源方案还真有几个,比如 Swarm 和 Kubernetes。太好了,那么,怎么把 Swarm 和 Kubernetes 这样的 PaaS 平台给集成到 OpenStack 这样的 IaaS 平台上呢?这个听起来好像也不懂唉,有人又想起 Heat 来了,一拍脑袋,可以先拿 Heat 来装一套啊。每次需要的时候就调个 Heat 命令,动态的装一套。所有问题看起来都解决了,皆大欢喜啊,真是皆大欢喜!
就这样,最为关键的容器服务提供这一层反而有意无意的被大家忽略了。

思考云计算

某位名人说过,我之所以能看得远,是因为站在了前人的肩膀上。
让我们抛开系统和应用之争,姑且也大胆地站在前人的肩膀上来重新发明轮子
首先还是要忍不住重复强调,信息技术的领域最为核心的思想就是分层和抽象。历史上,在不同位置进行分层和抽象,诞生了小型机,诞生了处理器,诞生了编程语言,诞生了 Web 服务,诞生了云计算……
抛开 IaaS,抛开 PaaS,云计算到底要提供啥?这个问题大家都知道,是服务。
啥叫服务?用户需要操作系统,可以直接给你一个;用户需要一个运行环境,可以直接给你一个;用户需要一套软件,也可以直接给你一个;用户需要一套方案,这个目前还没法直接给你一个,属于外包公司的业务。
那么,对于云平台的设计者来说,就是要提供这些不同层次的服务给用户,这才有了所谓的 IaaS 和 PaaS。所以,要记住,各种 XaaS 是呈献给用户的服务层次的不同,根本不是设计层次和技术方案的不同。
就好比你买了一个手机,可以玩游戏,也可以打电话。游戏和电话是手机提供给你的不同服务形态而已,并非说游戏是一种特殊的手机,电话是另外一种特殊的手机。
OK,那么,下面的问题就是讨论为了满足用户的这些需求,在设计上该如何分层。前人总结出了计算、存储、网络这三个根本的基础业务。其中计算又是最核心和最直接的。
我们来看直接面向用户的计算业务。数据中心里面放着的都是物理机器,物理机器上可以装操作系统,操作系统上可以装各种软件,可以运行虚拟机,可以运行容器。无论物理机、虚拟机、容器,都是属于计算资源。统统都应该用云平台给实现和供应出来。
如果说 IDC 是让用户可以拿物理机作为计算资源载体,那么现在的云计算是更进一步,让用户可以直接忽略计算资源的实际载体,无论是操作系统还是应用,直接提供给你即可,无需关心具体的载体。
总之一句话,云计算就是要方便提供计算资源!

问题与方向

Magnum 目前被认为 OpenStack 里面最有活力的容器项目,但是很可惜,一开始的路子就是偏的。
Magnum 的定位是提供一套 OpenStack API,底下可以兼容/依赖多种第三方的容器管理平台。OpenStack 本来就是要做一套资源管理平台,现在是用别人的,意味着这跟 OpenStack 其实关系不大。但是如果抛开 OpenStack,上面封装的一套 OpenStack API 又没有了意义。第三方的管理平台都有自己现成的 API。
真正的 Container as a Service,其实应该是在 OpenStack 中实现一套容器平台,而不是在 OpenStack 中安装了别人的一套平台,然后进行了 API 的封装。
或许有人会猜测,之所以不做底下的实现,可能跟 Nova-Docker 有关。
如果只谈技术,其实很容易在 Nova 中实现真正的 Container as a Service。在 Nova 看来,都是计算节点,但计算节点可以带上各自的类型,比如有的计算节点是物理机、有的是虚拟机、有的是容器甚至容器组。不同的类型意味着底层不同的驱动。用一套抽象的资源调度的框架(参考 Mesos 的二层调度机制),带上不同的底层 framework,问题很容易得到解决。
但偏偏是现在已经有了 Nova-Docker,已经有了 Magnum,不知道要经过多少波折才可能走到这个方向上来。或许在 OpenStack 的大环境下,是一件太困难的事情了。
或许,这就是开源的魅力,分分合合,曲折中前行。

Tuesday, January 13, 2015

网络技术正当革命时

计算机网络自诞生之后,面向的应用场景主要包括局域网、广域网两大类。在各种环境中,通过层级结构将局域网连接起来,形成一个网络的网络,即所谓的互联网。无论什么网络,唯一的事实标准就是 TCP/IP 协议栈以及围绕这个协议栈的各种管理和应用技术,即便后来推出的 IPv6、CCN、SDN 等网络技术,都没有完全超出这个范畴。所以,看起来基于 TCP/IP 的修修补补在相当长的一段时间里满足了各种场景下对于网络的需求
然而,到了现在云计算的时代,数据中心场景对于网络的需求,让这些传统的网络架构开始碰到了真正的难题。
在数据中心里有着其截然不同的网络需求和运营特点:
  • 资源都为某一方统一管理(或者说是存在统一查询的逻辑层)
  • 网络本身的动态性极高(各种迁移调度)
  • 网络的规模极大(几十万台主机,甚至几百万台主机)
  • 策略配置十分复杂(一个 vm 可能就要跟着十几条规则)
  • 性能需求十分苛刻(不光是正常传输性能,还有收敛性能)
这些特点对网络技术提出了很严峻的挑战,甚至可以说,其中某些问题在现有的网络架构下,是无解的。
就比如说可扩展性的问题,传统的二层汇聚,然后三层隔离广播域的做法,让网络对于核心层的压力骤增。而即便是昂贵的商用核心解决方案,也存在种种局限。
再比如说迁移的问题,这就从根本上对 TCP/IP 的模型的设计提出了挑战,虽然有种种 overlay 的技术来试图弥补这个缺陷,但又带来了传输的代价和额外的管理成本。
这些挑战使得网络虚拟化看起来很美,但到了落地阶段就会发现十分困难。
那么,为什么不重新考虑整个问题,看看数据中心中的网络需求到底是什么?
在当前的数据中心里,实际上运行的无论是否是虚拟机,提供给用户的都是各种通过互联网接入的服务(直接以虚机方式提供给用户的形式将越来越少)。用户访问这些服务,是基于 TCP/IP 服务进行的。这些服务之间,以及内部如何实现,其实并不一定是 TCP/IP。
一旦脱离开 TCP/IP 技术的限制,其实可以设计一套面向数据中心内部种种场景的特定协议。比如,在数据中心中,虚机(或主机)的发现,其实完全不需要依赖传统的二层广播形式。在启动虚机的同时(或在虚机启动后),控制层就已经明确知道这个虚机在哪里,它需要跟谁进行通信,那么很自然,相应的位置信息和通信信息其实可以进行提前的配置和优化,而无需非要先广播询问接口,通过一层甚至多层的响应,造成大量的无用数据包。
当广播域根本不需要隔离,甚至消灭了广播域的时候,对交换设备的压力就会减少很多。那个时候,设计成本可以接受、性能满足要求的交换层将成为可能。
当然,并不是要完全抛弃传统网络领域中种种优秀的技术和方法,不少精巧的思想都可以应用到数据中心场景,所谓万变不离其宗。但毫无疑问,非要绑定到 TCP/IP 的协议框架下,并不是一个合理的方案(或许也是个没办法下的方案),这必将带来更多的难题。
TCP/IP 在其发展过程中,已经经历过太多的风浪,不知道这次的挑战,它是否能再次安然度过。

Sunday, October 26, 2014

云计算时代应用设计十二要素

云计算时代应用设计十二要素

  • 什么样的软件才是可用性和可维护性好的软件?
  • 什么样的代码才能避免后续开发的上手障碍?
  • 什么样的实行才能稳定的运行在分布式的环境中?
Heroku (一家 PaaS 服务提供者,2010 年被 Salesforce 收购)平台创始人 Adam Winggins 提出了推荐的应用十二风格,对我们设计和实现云时代(特别是 PaaS 和 SaaS 上)高效的应用都有很好的参考意义。

代码

每个子系统都用一个代码库管理,使用版本管理,实现独立的部署。

依赖

显式声明依赖,通过环境来严格隔离不同依赖。

配置

在环境变量中保存配置信息,而避免放在源码或配置文件中。

后端服务

后端服务作为可挂载资源来使用,这样系统跟外部依赖尽量松耦合。

生命周期

区分不同声明周期的运行环境,包括创建、发布、部署,各个步骤要相互隔离。

进程

以一个或多个无状态的进程来运行应用,即尽量实现无状态,不要在进程中保存数据。

端口

通过端口绑定来对外提供服务。

并发

通过进程控制来扩展,即以多进程模型进行扩展。

可丢弃性

快速启动,优雅关闭,并尽量鲁棒(随时 kill,随时 crash)。

开发与生产环境的差异性

尽量保持从开发到生产部署环境的相似性。

日志

将日志当作事件流来进行统一的管理和维护(使用 Logstash 等工具)。

管理

将管理作为一次性的系统服务来使用。