K8S Calico BGP 实现和路由设计
隔离在家快一个月,终于想起来把拖了一年的K8S整理下来。从Calico 开始,梳理一下容器网络解决方案。
本文以kubeadm容器化部署k8s管理平台,使用官网yaml文件一键容器化部署Calico,主要介绍Calico BGP模式的基础配置方法和几种实用的内外部BGP网络设计。
一、容器的网络逻辑
K8S中最小的管理单元叫Pod,一般来说一个Pod中只运行一个容器。物理机/宿主机被称为Node,Pod本质上就是Node操作系统中运行的一个或多个进程,这些进程可以是mysql、nginx或者K8S的某个数据库。
为了方便本文会混用Pod、容器和Host、宿主机等名词。
使用docker run命令创建并进入一个busybox容器:
root@e564:~# docker run -it --name my-busybox1 busybox /bin/sh
/ #
/ # ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
207: eth0@if208: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
link/ether 02:42:c0:a8:a9:04 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.4/16 brd 192.168.169.255 scope global eth0
valid_lft forever preferred_lft forever
此时在宿主机上使用route命令查看路由表,可以看到容器网络路由,出接口为docker0:
# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.1.253.1 0.0.0.0 UG 0 0 0 mybr0
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
在逻辑上,docker0网卡就相当于一个loopback接口,在PC(10.1.253.100)上手动配置一条指向容器网络(172.17.0.0/16) 下一跳为宿主机(10.1.253.101)后,PC就可以访问容器了:
上图逻辑是不是很熟悉?当部署大量Host时,我们只需要为每个Host规划特定的IP Pool,在物理网络上针对不同的Host写不同的路由就可以实现外部网络访问容器了,不同Host间也可以互相配置静态路由实现不同Host内容器的互访!
但是当Host数量增加到一定程度时,手写静态路由就会变成极大的负担,因此K8S 适配了很多容器间网络的解决方案,其中Calico的BGP模式就是最适适合大规模、高扩展容器网络的解决方案。
二、Calico
Calico有两种基本的容器间互通方案:Overlay 和Underlay
默认使用IPinIP的Oerlay解决方案,即将Pod间的IP数据包封装在外层IP头中。内层IP为Pod IP,外层IP为Host 物理网络IP。
该方案最简单也最不灵活,开箱即用,只适合小规模网络。部署时要主要Pod MTU会自动减去20字节的外层IP头部。
第二种方案即是使用BGP 实现Underlay互通。
本文使用容器化部署的Calico Pod实现BGP功能,即在集群内每个Host上运行一个Calico Pod,多个Host间利用Calico建立BGP邻居关系。互相通告本地Pod的网络并将学习到的路由下发到Host 本地路由中,同时Calico还可以与外部交换机建立BGP邻居,互相通告路由,完美实现了集群内外的互访需求。
部署Calico
-
NetworkManager 脱管
vim /etc/NetworkManager/conf.d/calico.conf
[keyfile]
unmanaged-devices=interface-name:cali*;interface-name:tunl* -
下载:wget https://docs.projectcalico.org/manifests/calico.yaml
该文件默认部署Calico IPIP模式,将IPIP模式置为false 或者never即启用BGP模式
# Enable IPIP
- name: CALICO_IPV4POOL_IPIP
value: "Never" # 关闭IPIP以使用BGP
#value: "Always" -
定义BGP监听的接口(支持简单通配符)
定义IP_AUTODETECTION_METHOD 参数,匹配特定interface
# IP automatic detection
- name: IP_AUTODETECTION_METHOD
value: "interface=en.*" -
利用K8S管理工具一键部署
# kubectl apply -f calico.yaml
configmap/calico-config created
.
.
.
deployment.apps/calico-kube-controllers created
serviceaccount/calico-kube-controllers created
poddisruptionbudget.policy/calico-kube-controllers created
在Calico 完全配置之前,如果想要看到BGP Peer会话,需要在Node上手动放行TCP 179端口:
iptables -I INPUT -p tcp --dport=179 -j ACCEPT
部署Catlicoctl工具
Calicoctl是Calico官方发布的管理工具,本质上是从K8S集群数据库中读取、写入Calioco信息。该工具有两种访问方式:直接访问etcd数据库或这访问K8S Api。
本文使用后者:
-
下载calicoctl:
curl -L https://github.com/projectcalico/calico/releases/download/v3.22.1/calicoctl-linux-amd64 -o /usr/local/bin/calicoctl
chmod +x /usr/local/bin/calicoctl -
创建并编辑默认配置文件
vim /etc/calico/calicoctl.cfg
vim /etc/calico/calicoctl.cfg
apiVersion: projectcalico.org/v3
kind: CalicoAPIConfig
metadata:
spec:
datastoreType: "kubernetes"
kubeconfig: "<用户家目录>.kube/config" -
验证
# sudo calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.1.253.235 | node specific | up | 08:42:52 | Established |
| 10.1.253.236 | node specific | up | 08:42:52 | Established |
| 10.1.253.128 | node specific | up | 09:00:23 | Established |
+--------------+---------------+-------+----------+-------------+
三、BGP的基本配置
与K8S一样,Calico使用yaml或者json文件配置BGP,不同“Kind”的配置不能共用同一个yaml文件。
Calico常用配置
-
kind: BGPConfiguration
用于配置集群全局AsNumber、开启关闭全互联、指定通告特殊网络等
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
name: default
spec:
logSeverityScreen: Info
nodeToNodeMeshEnabled: false
asNumber: 65111
serviceClusterIPs:
- cidr: 10.96.0.0/16 -
kind: BGPPeer
用于配置集群内部、外部邻居关系。使用nodeSelector指定A端BGP Speaker,使用peerSelector指定Z端Speaker:
kind: BGPPeer
apiVersion: projectcalico.org/v3
metadata:
name: client-rr
spec:
nodeSelector: all()
peerSelector: label-bgp-rr == 'true' -
kind: Node
用于配置单个宿主机的BGP参数,例如配置RR Cluster-ID
apiVersion: projectcalico.org/v3
kind: Node
metadata:
labels:
label-bgp-rr: "true"
name: worknode.homelab.local
spec:
addresses:
- address: 10.1.253.13/24
type: CalicoNodeIP
- address: 10.1.253.13
type: InternalIP
bgp:
ipv4Address: 10.1.253.13/24
routeReflectorClusterID: 100.64.0.13
asNumber: 65111
orchRefs:
- nodeName: worknode.homelab.local
orchestrator: k8s
四、BGP网络设计
K8S 默认开启全互联(nodeToNodeMeshEnabled: true),全互联需要所有Calico BGP Speaker 间两两建立并维护BGP邻居关系,当集群扩展到较大规模时,需要配置RR 减少BGP Peer数量,降低开销。
如图所式,Mesh全互联时所有Calico Node间都要建立BGP邻居关系
配置Node仅与RR 建立邻居关系,当节点规模扩大后将有效减少BGP 会话消息占用的开销:
以配置BGP 反射器为例,需要先为特定Node 打上lebel,之后选定所有其他Node与具有该Label的Node建立BGP邻居:
-
为node打自定义label:
kubectl label node <RR_node_name> label-bgp-rr=true
-
修改kind: BGPPeer 文件,定义邻居关系:
# vim bgp_peer_conf.yaml
kind: BGPPeer
apiVersion: projectcalico.org/v3
metadata:
name: peer-with-route-reflectors
spec:
nodeSelector: all() # 指定所有Node
peerSelector: label-bgp-rr == 'true' # 与携带label-bgp-rr的Node建立邻居 -
为RR 配置Cluster-ID(可选)
apiVersion: projectcalico.org/v3
kind: Node
metadata:
labels:
label-bgp-rr: "true"
name: worknode.homelab.local
spec:
addresses:
- address: 10.1.253.13/24
type: CalicoNodeIP
- address: 10.1.253.13
type: InternalIP
bgp:
ipv4Address: 10.1.253.13/24
routeReflectorClusterID: 100.64.0.13
asNumber: 65111 -
关闭Mesh全互联(如果未关闭)
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
name: default
spec:
nodeToNodeMeshEnabled: false
-
应用配置
calicoctl apply -f bgp_peer_conf.yaml
calicoctl apply -f bgp_node_conf.yaml
calicoctl apply -f bgp_global_conf.yaml
检查Peer状态:
RR)# calicoctl node status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.1.253.235 | node specific | up | 10:50:30 | Established |
| 10.1.253.236 | node specific | up | 10:50:30 | Established |
+--------------+---------------+-------+----------+-------------+
Calico还支持以peerIP的形式指定外部BGP Peer,可以指定外部RR进一步节省Calico集群开销:
-
配置外部peerID和ASN
kind: BGPPeer
apiVersion: projectcalico.org/v3
metadata:
name: rr-tor
spec:
nodeSelector: label-bgp-rr == 'true'
peerIP: 10.1.253.128
asNumber: 65111 -
应用配置
calicoctl apply -f bgp_peer_conf.yaml
检查邻居关系:
RR)# calicoctl node status
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.1.253.128 | node specific | up | 09:00:22 | Established |
+--------------+---------------+-------+----------+-------------+
假如K8S环境存在整体升级、搬迁等需求,则应该考虑Calico集群与外部BGP交换机松耦合,即保证当外部路由器或三层交换机宕机或者下架时,集群应保证内部通信。此时可以选择内部RR加外部iBGP邻居结合的混合部署方案:
-
定义第一个BGPPeer文件,配置所有Calico Node与Calico RR建立邻居:
kind: BGPPeer
apiVersion: projectcalico.org/v3
metadata:
name: client-rr
spec:
nodeSelector: all()
peerSelector: label-bgp-rr == 'true' -
定义第二个BGPPeer文件,配置RR Node与外部peerIP建立邻居:
kind: BGPPeer
apiVersion: projectcalico.org/v3
metadata:
name: rr-tor
spec:
nodeSelector: label-bgp-rr == 'true'
peerIP: 10.1.253.128
asNumber: 65111 -
关闭Mesh全互联(如果未关闭)
-
应用配置文件:
calicoctl apply -f bgp_peer_conf.yaml
calicoctl apply -f bgp_peer-with-tor_conf.yaml
calicoctl apply -f bgp_global_conf.yaml
此时查看邻居关系可以看到所有其他Node仅与Calico RR建立Peer:
[root@node-236 ~]# calicoctl node status
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.1.253.13 | node specific | up | 08:42:51 | Established |
+--------------+---------------+-------+----------+-------------+
Calico RR Node与其他Node和外部TOR交换机建立Peer
RR)# calicoctl node status
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.1.253.235 | node specific | up | 08:42:51 | Established |
| 10.1.253.236 | node specific | up | 08:42:51 | Established |
| 10.1.253.128 | node specific | up | 09:00:22 | Established |
+--------------+---------------+-------+----------+-------------+
TOR 交换机配置示例(Cisco NXOS):
interface Ethernet1/1
no switchport
ip address 10.1.253.128/24
no shutdown
#
router bgp 65111
router-id 10.1.253.128
address-family ipv4 unicast
network 10.114.114.114.114/32
neighbor 10.1.253.13
remote-as 65111
description Calico-BGP-Peer-node13
address-family ipv4 unicast
查看TOR 路由表。学习到K8S集群的路由(10.244.0.0/16),下一跳分别为三个Node:
NX-Pub# show ip route bgp
10.244.109.192/26, ubest/mbest: 1/0
*via 10.1.253.13, [200/0], 06:38:18, bgp-65111, internal, tag 65111
10.244.210.192/26, ubest/mbest: 1/0
*via 10.1.253.236, [200/0], 06:38:18, bgp-65111, internal, tag 65111
10.244.214.64/26, ubest/mbest: 1/0
*via 10.1.253.235, [200/0], 06:38:18, bgp-65111, internal, tag 65111
在Node上查看路由,可以学习到TOR通告的网络(10.114.114.114.114)
# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.1.253.1 0.0.0.0 UG 100 0 0 enp0s25
10.1.253.0 0.0.0.0 255.255.255.0 U 100 0 0 enp0s25
10.244.109.192 0.0.0.0 255.255.255.192 U 0 0 0 *
10.244.109.199 0.0.0.0 255.255.255.255 UH 0 0 0 cali716c8c147b6
10.244.109.211 0.0.0.0 255.255.255.255 UH 0 0 0 calidf9444430f4
10.244.109.213 0.0.0.0 255.255.255.255 UH 0 0 0 cali2fc1d82df67
10.244.109.215 0.0.0.0 255.255.255.255 UH 0 0 0 cali765da0572cb
10.244.210.192 10.1.253.236 255.255.255.192 UG 0 0 0 enp0s25
10.244.214.64 10.1.253.235 255.255.255.192 UG 0 0 0 enp0s25
10.114.114.114 10.1.253.128 255.255.255.255 UGH 0 0 0 enp0s25 <======来自外部TOR的路由