云原生redis-cluster调度器设计

vlambda
2021-07-01

云原生redis-cluster调度器设计

背景

本文主要介绍如何在kubernetes运行redis-cluster，如何保证高可用性。

redis cluster最少的配置6个节点，每个maser和slave通过复制的方式互为主从，当其中master挂了，slave会自动切换切为新的master。

cluster通过slots机制进行数据分片，最小需要3个分片，客户端访问redis通过CRC16(key) & 16383，定位具体的分片进行访问。

其中如果需要实现redis的高可用，需要满足以下条件

任意2个master不能运行在同一个节点
任意一组的master与slave也不能运行在同一个节点

方案

主要方案是通过调度器 + operator共同协作。由调度器控制pod 与 node的分配，通过operator控制pod与redis角色的分配

调度器

策略

在满足默认调度的情况下增加以下调度策略，根据用户配置可以自定义满足最低要求或推荐要求。

最低要求

至少包含replicas/2个数量的节点；比如6个副本，至少需要3个节点

推荐要求

至少包含replicas数量的节点，每个实例分别运行在不同的节点

云原生redis-cluster调度器设计

流程

主要通过kubernetes schedule-framework 进行实现调度器，这里对节点的策略主要实现Filter方法.

进入Filter流程
输入pod 与 node
输入node + pod.label["middlewide-type"] + pod.label["middlewide-app"] , 索引从获取该node上已存在的实例数量
如果该节点实例数量大于阈值(通常是分片数量)，则标记该节点不可调度，否则标记为可调度
进入下一个节点，并继续执行步骤2

云原生redis-cluster调度器设计

需要在pod的标签中声明应用类型和应用名称，这里暂时规定label为 middlewide-type 和middleware-app

operator

策略

满足每个master在不同的节点，同时主从的master+slave也不在同一个节点

首先保证每个节点上面分配master

云原生redis-cluster调度器设计

然后将有主从关系的master和slave错开放置

云原生redis-cluster调度器设计

流程

list pod 并转化成map结构，key为nodeName，value为[]pod
遍历map结构，将[]pod中的第一个元素设置为master，可以保证master在不同的节点上。剩下的就是slave
将master与slave分别转成[]pod结构
遍历maser 与 slave ，将spec.nodeName 不相等的master与slave进行分组，保证相同分片的的master和slave不在同一个node上
输出

关键技术验证

调度器demo

https://github.com/shenkonghui/scheduler-plugins/blob/pf/pkg/redis/redis.go#L120

func (r *Redis) Filter(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status { app := pod.Labels["app"] if app != ""{ // 获取该节点上，该应用实例的的数量 pods,_ := r.getPodsAssignedToNode(nodeInfo.Node().Name,app) // 小于2个才允许调度 if len(pods) < 2{ klog.Infof("Filter: agress pod %s schedule to %s",pod.Name,nodeInfo.Node().Name) return framework.NewStatus(framework.Success) }else { klog.Infof("Filter: not agress pod %s schedule to %s, existed num %d",pod.Name,nodeInfo.Node().Name,len(pods)) } } return framework.NewStatus(framework.Unschedulable)}

I0629 10:59:17.117661 36994 redis.go:130] Filter: agree pod busybox-0 schedule to slave-213I0629 10:59:17.117675 36994 redis.go:130] Filter: agree pod busybox-0 schedule to slave-214I0629 10:59:17.117735 36994 redis.go:130] Filter: agree pod busybox-0 schedule to master-212I0629 10:59:26.780771 36994 redis.go:130] Filter: agree pod busybox-1 schedule to slave-213I0629 10:59:26.780772 36994 redis.go:130] Filter: agree pod busybox-1 schedule to master-212I0629 10:59:26.780771 36994 redis.go:130] Filter: agree pod busybox-1 schedule to slave-214I0629 10:59:32.859554 36994 redis.go:130] Filter: agree pod busybox-2 schedule to slave-213I0629 10:59:32.859584 36994 redis.go:130] Filter: agree pod busybox-2 schedule to master-212I0629 10:59:32.859554 36994 redis.go:130] Filter: agree pod busybox-2 schedule to slave-214I0629 10:59:40.321685 36994 redis.go:130] Filter: agree pod busybox-3 schedule to slave-213I0629 10:59:40.321695 36994 redis.go:130] Filter: agree pod busybox-3 schedule to slave-214I0629 10:59:40.321727 36994 redis.go:133] Filter: not agree pod busybox-3 schedule to master-212, existed num 2I0629 10:59:58.062286 36994 redis.go:133] Filter: not agree pod busybox-4 schedule to slave-213, existed num 2I0629 10:59:58.062303 36994 redis.go:130] Filter: agree pod busybox-4 schedule to slave-214I0629 10:59:58.062323 36994 redis.go:133] Filter: not agree pod busybox-4 schedule to master-212, existed num 2I0629 10:59:59.879474 36994 redis.go:130] Filter: agree pod busybox-5 schedule to slave-214I0629 10:59:59.879473 36994 redis.go:133] Filter: not agree pod busybox-5 schedule to master-212, existed num 2I0629 10:59:59.879473 36994 redis.go:133] Filter: not agree pod busybox-5 schedule to slave-213, existed num 2