vlambda博客
学习文章列表

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

Ceph under the Hood

在本章中,我们将介绍以下秘籍:

  • Ceph scalability and high availability
  • Understanding the CRUSH mechanism
  • CRUSH map internals
  • CRUSH tunables
  • Ceph cluster map
  • High availability monitors
  • Ceph authentication and authorization
  • I/O path from a Ceph client to a Ceph cluster
  • Ceph placement group
  • Placement group states
  • Creating Ceph pools on specific OSDs

Introduction

在本章中,我们将通过了解 Ceph 的可扩展性、高可用性、身份验证和授权等特性来深入了解 Ceph 的内部工作原理。我们还将介绍 CRUSH 图,它是 Ceph 集群中最重要的部分之一。最后,我们将介绍动态集群管理和 Ceph 池的自定义 CRUSH 图设置。

Ceph scalability and high availability

要了解Ceph的可扩展性和高可用性,我们先来说说传统存储系统的架构。在这种架构下,为了存储或检索数据,客户端与称为控制器或网关的集中式组件通信。这些存储控制器充当客户请求的单一联系点。下图说明了这种情况:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

这个存储网关,作为存储系统的单点入口,也成为单点故障。这也限制了可扩展性和性能,同时引入了单点故障,这样如果集中式组件出现故障,整个系统就会崩溃。

Ceph 不遵循这种传统的存储架构;它已针对下一代存储进行了彻底改造。 Ceph 通过使客户端能够直接与 Ceph OSD 守护进程交互来消除集中式网关。下图说明了客户端如何连接到 Ceph 集群:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

Ceph OSD 守护进程在其他 Ceph 节点上创建对象及其副本,以确保数据安全和高可用性。 Ceph 使用一组监视器来消除集中化并确保高可用性。 Ceph 使用一种称为可伸缩散列下的受控复制 (CRUSH) 的算法。在 CRUSH 的帮助下,客户端可以按需计算数据应该写入或读取的位置。在下面的秘籍中,我们将检查 Ceph CRUSH 算法的细节。

Understanding the CRUSH mechanism

在数据存储和管理方面,Ceph使用了CRUSH算法,这是Ceph的一种智能数据分发机制。正如我们在上一个秘籍中所讨论的,传统存储系统使用中央元数据/索引表来了解用户数据的存储位置。另一方面,Ceph 使用 CRUSH 算法来确定性地计算数据应该写入或读取的位置。 CRUSH 不是存储元数据,而是按需计算元数据,因此无需集中式服务器/网关或代理。它使 Ceph 客户端能够计算元数据,也称为 CRUSH 查找,并直接与 OSD 通信。

对于 Ceph 集群的读/写操作,客户端首先联系 Ceph 监视器并检索集群映射的副本,其中包括五个映射,即监视器、OSD、MDS 和CRUSH 和 PG 地图;我们将在本章后面介绍这些地图。这些集群映射帮助客户端了解 Ceph 集群的状态和配置。接下来,使用对象名称和池名称/ID 将数据转换为对象。然后这个对象用 PG 的数量进行散列,以在所需的 Ceph 池中生成最终的 PG。这个计算出来的 PG 然后通过 CRUSH 查找函数来确定主要、次要和三级 OSD 位置来存储或检索数据。

一旦客户端得到确切的 OSD ID,它会直接联系 OSD 并存储数据。所有这些计算操作都由客户端执行;因此,它们不会影响集群性能。下图说明了整个过程:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

CRUSH map internals

要知道 CRUSH 图里面有什么,为了方便编辑,我们需要提取和反编译它,将它转换成人类可读的形式。下图说明了这个过程:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

CRUSH map 对 Ceph 集群的改变是动态的,也就是说,一旦新的 CRUSH map 注入到 Ceph 集群中,所有的改变都会立即生效,即时生效。

How to do it...

我们现在来看看我们的 Ceph 集群的 CRUSH 图:

  1. Extract the CRUSH map from any of the monitor nodes:
        # ceph osd getcrushmap -o crushmap_compiled_file
  1. Once you have the CRUSH map, decompile it to convert it into a human-readable/editable form:
        # crushtool -d crushmap_compiled_file 
                    -o crushmap_decompiled_file

此时,输出文件 crushmap_decompiled_file 可以在您喜欢的编辑器中查看/编辑。在接下来的recipe中,我们将学习如何对 CRUSH 图进行更改。

<跨度><跨度><跨度>3。更改完成后,您应该编译这些更改:

        # crushtool -c crushmap_decompiled_file -o newcrushmap

<跨度><跨度><跨度>4。最后,将新编译的 CRUSH 图注入 Ceph 集群:

        # ceph osd setcrushmap -i newcrushmap

How it works...

现在我们知道了如何编辑 Ceph CRUSH 映射,让我们了解 CRUSH 映射内部的内容。 CRUSH 映射文件包含四个主要部分;它们如下:

  • Devices: This section of the CRUSH map keeps a list of all the OSD devices in your cluster. The OSD is a physical disk corresponding to the ceph-osd daemon. To map the PG to the OSD device, CRUSH requires a list of OSD devices. This list of devices appears in the beginning of the CRUSH map to declare the device in the CRUSH map. The following is the sample device list:
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  • Bucket types: This defines the types of buckets used in your CRUSH hierarchy. Buckets consist of a hierarchical aggregation of physical locations (for example, rows, racks, chassis, hosts, and so on) and their assigned weights. They facilitate a hierarchy of nodes and leaves, where the node bucket represents a physical location and can aggregate other nodes and leaves buckets under the hierarchy. The leaf bucket represents the ceph-osd daemon and its underlying physical device. The following table lists the default bucket types:
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

CRUSH 还支持自定义桶类型创建。这些默认存储桶类型可以删除,也可以根据需要引入新类型。

  • Bucket instances: Once you define bucket types, you must declare bucket instances for your hosts. A bucket instance requires the bucket type, a unique name (string), a unique ID expressed as a negative integer, a weight relative to the total capacity of its item, a bucket algorithm (straw, by default), and the hash (0, by default, reflecting the CRUSH hash rjenkins1). A bucket may have one or more items, and these items may consist of other buckets or OSDs. The item should have a weight that reflects the relative weight of the item. The general syntax of a bucket type looks as follows:
        [bucket-type] [bucket-name] {
        id [a unique negative numeric ID]
        weight [the relative capacity the item]
        alg [ the bucket type: uniform | list | tree | straw |straw2]
        hash [the hash type: 0 by default]
        item [item-name] weight [weight]
        }

现在我们将简要介绍 CRUSH 存储桶实例使用的参数:

    • bucket-type: It's the type of bucket, where we must specify the OSD's location in the CRUSH hierarchy.
    • bucket-name: A unique bucket name.
    • id: The unique ID, expressed as a negative integer.
    • weight: Ceph writes data evenly across the cluster disks, which helps in performance and better data distribution. This forces all the disks to participate in the cluster and make sure that all cluster disks are equally utilized, irrespective of their capacity. To do so, Ceph uses a weighting mechanism. CRUSH allocates weights to each OSD. The higher the weight of an OSD, the more physical storage capacity it will have. A weight is a relative difference between device capacities. We recommend using 1.00 as the relative weight for a 1 TB storage device. Similarly, a weight of 0.5 would represent approximately 500 GB, and a weight of 3.00 would represent approximately 3 TB.
    • alg: Ceph supports multiple algorithm bucket types for your selection. These algorithms differ from each other on the basis of performance and reorganizational efficiency. Let's briefly cover these bucket types:
      • uniform: The uniform bucket can be used if the storage devices have exactly the same weight. For non-uniform weights, this bucket type should not be used. The addition or removal of devices in this bucket type requires the complete reshuffling of data, which makes this bucket type less efficient.
      • list: The list buckets aggregate their contents as linked lists and can contain storage devices with arbitrary weights. In the case of cluster expansion, new storage devices can be added to the head of a linked list with minimum data migration. However, storage device removal requires a significant amount of data movement. So, this bucket type is suitable for scenarios under which the addition of new devices to the cluster is extremely rare or non-existent. In addition, list buckets are efficient for small sets of items, but they may not be appropriate for large sets.
      • tree: The tree buckets store their items in a binary tree. It is more efficient than list buckets because a bucket contains a larger set of items. Tree buckets are structured as a weighted binary search tree with items at the leaves. Each interior node knows the total weight of its left and right subtrees and is labeled according to a fixed strategy. The tree buckets are an all-around boon, providing excellent performance and decent reorganization efficiency.
      • straw: To select an item using list and tree buckets, a limited number of hash values need to be calculated and compared by weight. They use a divide and conquer strategy, which gives precedence to certain items (for example, those at the beginning of a list). This improves the performance of the replica placement process, but it introduces moderate reorganization when bucket contents change due to addition, removal, or re-weighting.

straw 存储桶类型允许所有项目公平地竞争副本放置。在需要移除且重组效率至关重要的场景中,straw 存储桶提供子树之间的最佳迁移行为。这种存储桶类型允许所有项目通过类似于吸管的过程公平地竞争副本放置。

      • straw2: This is an improved straw bucket that correctly avoids any data movement between items A and B, when neither A's nor B's weights are changed. In other words, if we adjust the weight of item C by adding a new device to it, or by removing it completely, the data movement will take place to or from C, never between other items in the bucket. Thus, the straw2 bucket algorithm reduces the amount of data migration required when changes are made to the cluster.
    • hash: Each bucket uses a hash algorithm. Currently, Ceph supports rjenkins1. Enter 0 as your hash setting to select rjenkins1.
    • item: A bucket may have one or more items. These items may consist of node buckets or leaves. Items may have a weight that reflects the relative weight of the item.

以下屏幕截图说明了 CRUSH 存储桶实例。在这里,我们有三个主机存储桶实例。这些主机存储桶实例由 OSD 存储桶组成:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  • Rules: The CRUSH maps contain CRUSH rules that determine the data placement for pools. As the name suggests, these are the rules that define the pool properties and the way data gets stored in the pools. They define the replication and placement policy that allows CRUSH to store objects in a Ceph cluster. The default CRUSH map contains a rule for default pools, that is, rbd. The general syntax of a CRUSH rule looks as follows:
        rule <rulename> 
       {
        ruleset <ruleset>
        type [ replicated | erasure ]
        min_size <min-size>
        max_size <max-size>
        step take <bucket-type>
        step [choose|chooseleaf] [firstn] <num> <bucket-type>
        step emit
        }

我们现在将简要介绍 CRUSH 规则使用的这些参数:

    • ruleset: An integer value; it classifies a rule as belonging to a set of rules.
    • type: A string value; it's the type of pool that is either replicated or erasure coded.
    • min_size: An integer value; if a pool makes fewer replicas than this number, CRUSH will not select this rule.
    • max_size: An integer value; if a pool makes more replicas than this number, CRUSH will not select this rule.
    • step take: This takes a bucket name and begins iterating down the tree.
    • step choose firstn <num> type <bucket-type>: This selects the number (N) of buckets of a given type, where the number (N) is usually the number of replicas in the pool (that is, pool size):
      • If num == 0, select N buckets
      • If num > 0 && < N, select num buckets
      • If num < 0, select N - num buckets

例如:step choose firstn 1 type row

在本例中,num=1,假设池大小为3< /kbd>,然后 CRUSH
会将这个条件评估为 1 > 0&& < 3。因此,它将选择 1 行类型存储桶。

    • step chooseleaf firstn <num> type <bucket-type>: This first selects a set of buckets of a bucket type, and then chooses the leaf node from the subtree of each bucket in the set of buckets. The number of buckets in the set (N) is usually the number of replicas in the pool:
      • If num == 0, select N buckets
      • If num > 0 && < N, select num buckets
      • If num < 0, select N - num buckets

例如:step chooseleaf firstn 0 type row

在本例中,num=0,假设池大小为3,那么 CRUSH 会将这个条件评估为 0 == 0 ,然后选择一个行类型的桶集,使得该集合包含三个桶。然后它将从每个桶的子树中选择叶子节点。这样,CRUSH 会选择三个叶子节点。

    • step emit: This first outputs the current value and empties the stack. This is typically used at the end of a rule, but it may also be used to form different trees in the same rule.

CRUSH tunables

在 Ceph 中,开发人员通过增强 CRUSH 算法来计算数据的位置。开发人员引入了一系列 CRUSH 可调选项来支持行为变化。这些选项控制所使用算法的改进变化或遗留。 Ceph 服务器和客户端都必须支持新版本的 CRUSH 才能使用新的可调参数。

因此,Ceph 开发人员以引入它们的 Ceph 版本的名称命名了 CRUSH 可调配置文件。例如,Firefly 版本支持不适用于旧客户端的 firefly 可调参数。 ceph-osdceph-mon 将阻止旧客户端连接到集群,只要一组给定的可调参数从旧的默认行为更改。这些旧客户端不支持新的 CRUSH 功能。

更多信息,请访问http ://docs.ceph.com/docs/jewel/rados/operations/crush-map/#tunables

The evolution of CRUSH tunables

在下一节中,我们将解释 CRUSH 可调参数的演变。

Argonaut – legacy

使用传统的 CRUSH 可调 Argonaut 对某些集群来说是很好的行为,只要大量的 OSD 没有被标记出集群,因为当 OSD 被标记出来时,这可能会导致正确重新平衡数据的问题。

Bobtail – CRUSH_TUNABLES2

Bobtail 配置文件修复了几个 CRUSH 问题:

  • In CRUSH hierarchies with a smaller number of devices in buckets, such as a host leaf bucket with one OSDs - three OSDs under it, the PGs may get mapped to less than the desired number of replica
  • In larger Ceph clusters with several hierarchy layers (row, rack, host, osd) it is possible that a small amount of PGs could get mapped to less than the desired amount of OSDs
  • If an OSD gets marked out in Bobtail, the data usually gets rebalanced to nearby OSDs in the bucket instead of across the entire CRUSH hierarchy

以下是新的可调参数:

  • choose_local_tries: The number of local retries is given by this tunable. The legacy and optimal values are 2 and 0, respectively.
  • choose_local_fallback_tries: The legacy and optimal values are 5 and 0, respectively.
  • choose_total_tries: The total number of attempts required for an item to be chosen. Legacy value was 19. Further testing has shown that this is too low of a value and the more appropriate value for a typical cluster is 50. For very large clusters, a bigger value might be necessary to properly choose an item.
  • chooseleaf_descend_once: Either a recursive chooseleaf attempt will retry, or will only try once and allow the original placement to retry. The default and optimal values of legacy are 0 and 1, respectively:
    • Migration impact: A moderate amount of data movement is triggered if we move from Argonaut version to Bobtail version. We will have to be cautious on a cluster that is already populated with data.

Firefly – CRUSH_TUNABLES3

Firefly 配置文件解决了一个问题,即当太多 OSD 被标记出集群并且无法执行时,负责 PG 映射的 chooseleaf CRUSH 规则行为将得出太少的结果映射 PG。

以下是新的可调参数:

  • chooseleaf_vary_r: If a recursive chooseleaf attempt starts with a non-zero value of r, based on the number of attempts parent has already made. The default value of legacy is 0, but with such a value, CRUSH is sometimes not able to find a mapping, which can lead to PGs in an unmapped state. The optimal value (in terms of computational cost and correctness) is 1:
    • Migration impact: For the existing clusters that have lots of existing data, changing from 0 to 1 will cause a lot of data to move; a value of 4 or 5 will allow CRUSH to find a valid mapping, but it will make less data move.
  • straw_calc_version: This tunable resolves an issue when there were items in the CRUSH map with a weight of 0 or a mix of different weights in straw buckets. This would lead CRUSH to distribute data incorrectly throughout the cluster. Old is preserved by the value 0, broken internal weight calculation; behavior is fixed by the value 1:
    • Migration impact: Move to the straw_calc_version 1 tunable and then adjust a straw bucket (add, remove, or reweight an item, or use the reweight-all command) triggers a small to moderate amount of data movement if the cluster has hit one of the problematic conditions. This tunable option is special because it has absolutely no impact concerning the required kernel version in the client side.

Hammer – CRUSH_V4

Hammer 可调配置文件不会影响现有 CRUSH 图的映射,只需更改配置文件,并且需要通过在 CRUSH 桶上启用新的 straw 桶类型来手动操作 CRUSH 图:

  • straw2: The straw2 bucket resolves several initial limitations of the original straw bucket algorithm. The major change is that with the initial straw buckets, changing the weight of a bucket item would lead to multiple PG mapping changes of other bucket items outside the item that was actually reweighted. straw2 will allow only changing mappings to or from the bucket that was actually reweighted:
    • Migration impact: Changing a bucket type from straw to straw2 will result in a fairly small amount of data movement, depending on how much the bucket item weights vary from each other. When all the weights are same, no data will move, and when item weights vary considerably there will be more movement.

 Jewel – CRUSH_TUNABLES5

Jewel 配置文件将通过限制当 OSD 被标记出集群时 PG 映射更改的数量来改善 CRUSH 的整体行为。

以下是新的可调参数:

  • chooseleaf_stable: A recursive chooseleaf attempt will use a better value for an inner loop that greatly reduces the number of mapping changes when an OSD is marked out. The legacy value is 0, while the new value of 1 uses the new approach:
    • Migration impact: Changing this value on an existing cluster will result in a very large amount of data movement as almost every PG mapping is likely to change

Ceph and kernel versions that support given tunables

以下是支持给定可调参数的 Ceph 和内核版本:

Tunables Ceph and kernel versions that support given tunables
CRUSH_TUNABLES Argonaut series, v0.48.1 or recent version
v0.49 or greater
Linux kernel version v3.6 or recent (for the file system and RBD kernel clients)
CRUSH_TUNABLES2 v0.55 or recent, including Bobtail series (v0.56.x)
Linux kernel version v3.9 or recent (for the filesystem and RBD kernel clients)
CRUSH_TUNABLES3 v0.78 (Firefly) or recent version
CentOS 7.1, Linux kernel version v3.15 or recent (for the filesystem and RBD kernel clients)
CRUSH_V4 v0.94 (Hammer) or recent
CentOS 7.1, Linux kernel version v4.1 or recent (for the filesystem and RBD kernel clients)
CRUSH_TUNABLES5 v10.2.0 (Jewel) or recent
CentOS 7.3, Linux kernel version v4.5 or recent (for the filesystem and RBD kernel clients)

Warning when tunables are non-optimal

如果当前运行的 CRUSH 可调参数对于从 v0.74 开始的当前运行的 Ceph 版本不是最佳的,Ceph 集群将发出健康警告。

为了从 Ceph 集群中删除此警告,您可以调整现有集群上的可调参数。调整 CRUSH 可调参数将导致一些数据移动(可能多达集群上 10% 的数据)。这显然是首选路线,但应注意生产集群,因为任何数据移动都可能影响当前集群性能。
您可以通过以下方式启用最佳可调参数:

ceph osd crush tunables optimal

如果由于可调参数更改的重新平衡导致集群上的数据移动负载而开始出现性能问题,或者遇到客户端兼容性问题(旧内核 cephfsrbd 客户端或 prebobtail librados 客户端),您可以使用以下命令切换回旧版可调参数:

ceph osd crush tunables legacy

A few important points

对 CRUSH 可调参数的调整将导致一些 PG 在存储节点之间转移。如果 Ceph 集群已经包含大量数据,请准备好通过 CRUSH 可调更改可能会有大量 PG 移动。

监视器和 OSD 守护程序将启动,并且每个守护程序在收到更新的地图时将需要每个新连接的新启用的 CRUSH 功能。任何已经连接到集群的客户端都将被继承;如果客户端(内核版本、Ceph 版本)不支持新启用的功能,这将导致不良行为。如果您选择将 CRUSH 可调参数设置为最佳,请确认所有 Ceph 节点和客户端都运行相同的版本:

  • If your CRUSH tunables are set to a value that is not legacy, then reverted back to default value the OSD daemons will not be required to support the feature. Please note that the OSD peering process does require reviewing and comprehending old maps, so you should not run old versions of Ceph if the cluster had previously used a non-legacy CRUSH tunable, even if the latest maps were reverted to legacy default values. It is very important to validate that all OSDs are running the same Ceph version.
    The simplest way to adjust the CRUSH tunables is by changing to a known profile. Those are:
    • legacy: The legacy profile gives the legacy behavior from Argonaut and previous versions
    • argonaut: The argonaut profile gives the legacy values that are supported by the original Argonaut release
    • bobtail: The values that are supported by the Bobtail release are given by the bobtail profile
    • firefly: The values that are supported by Firefly release are given by the firefly profile
    • hammer: The values that are supported by the Hammer release are given by the hammer profile
    • jewel: The values that are supported by the Jewel release are given by the jewel profile
    • optimal: The optimal profile gives the best/optimal values of the current Ceph version
    • default: The default values of a new cluster is given by default profile

您可以使用以下命令在正在运行的集群上选择配置文件:

ceph osd crush tunables {PROFILE}
请注意,这可能会导致一些数据移动。

您可以使用以下命令检查正在运行的集群上的当前配置文件:

ceph osd crush show-tunables        
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

Ceph cluster map

Ceph 监控器负责监控整个集群的健康状况以及维护集群成员状态、对等节点状态和集群配置信息。 Ceph 监视器通过维护集群映射的主副本来执行这些任务。集群图包括监视器图、OSD图、PG图、CRUSH图和MDS图。所有这些地图统称为集群地图。让我们快速浏览一下每个地图的功能:

  • Monitor map: It holds end-to-end information about the monitor node, which includes the Ceph cluster ID, monitor hostname, and IP address with the port number. It also stores the current epoch for map creation and last changed time too. You can check your cluster's monitor map by executing the following:
        # ceph mon dump
  • OSD map: It stores some common fields, such as cluster ID, epoch for OSD map creation and last changed, and information related to pools, such as pool names, pool ID, type, replication level, and PGs. It also stores OSD information such as count, state, weight, last clean interval, and OSD host information. You can check your cluster's OSD maps by executing the following:
        # ceph osd dump
  • PG map: It holds the PG version, timestamp, last OSD map epoch, full ratio, and
    near full ratio information. It also keeps track of each PG ID, object count, state,
    state stamp, up and acting OSD sets, and finally, the scrub details. To check your cluster PG map, execute the following:
        # ceph pg dump
  • CRUSH map: It holds information on your clusters devices, buckets, failure domain hierarchy, and the rules defined for the failure domain when storing data. To check your cluster CRUSH map, execute the following:
        # ceph osd crush dump
  • MDS map: This stores information on the current MDS map epoch, map creation and modification time, data and metadata pool ID, cluster MDS count, and the MDS state. To check your cluster MDS map, execute the following:
        # ceph mds dump

High availability monitors

Ceph 监视器不存储数据并将数据提供给客户端;它为客户端以及其他集群节点提供更新的集群映射。客户端和其他集群节点定期使用监视器检查集群映射的最新副本。 Ceph 客户端必须与 Ceph 监视器联系以获取集群映射的最新副本,然后才能读取或写入数据。

Ceph 存储集群可以使用单个监视器运行,但这会给集群带来单点故障的风险;也就是说,如果监控节点出现故障,Ceph 客户端将无法读取或写入数据。为了克服这个问题,一个典型的 Ceph 集群由一组 Ceph 监视器组成。多监控的 Ceph 架构使用 Paxos 算法开发仲裁并为集群中的分布式决策提供共识。集群中的监视器计数应该是奇数;最低要求是 1 个监控节点,建议数量为 3。由于监视器在仲裁中运行,因此应始终有超过一半的监视器节点可用以防止出现裂脑问题。在所有集群监视器中,其中一个作为领导者运行。如果领导者监视器不可用,其他监视器节点有权成为领导者。生产集群必须至少有三个监控节点才能提供高可用性。

Ceph authentication and authorization

在这个秘籍中,我们将介绍 Ceph 使用的身份验证和授权机制。用户是个人或系统参与者,例如应用程序,它们使用 Ceph 客户端与 Ceph 存储集群守护进程进行交互。下图说明了这个流程:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

Ceph 提供了两种认证方式。它们如下:

  • none: With this mode, any user can access the Ceph cluster without authentication. This mode is disabled by default. Cryptographic authentication, which includes encrypting and decrypting user keys, has some computational costs. You can disable the Ceph authentication if you are very sure that your network infrastructure is secure, the clients/Ceph cluster nodes have established trust, and you want to save some computation by disabling authentication. However, this is not recommended, and you might be at risk of a man-in-the-middle attack. Still, if you are interested in disabling the Ceph authentication, you can do it by adding the following parameters in the global section of your Ceph configuration file on all the nodes, followed by the Ceph service restart:
       auth cluster required = none
       auth service required = none
       auth client required = none
  • cephx: Ceph provides its Cephx authentication system to authenticate users and daemons in order to identify users and protect against man-in-the-middle attacks. The Cephx protocol works similar to Kerberos to some extent and allows clients to access the Ceph cluster. It's worth knowing that the Cephx protocol does not do data encryption. In a Ceph cluster, the Cephx protocol is enabled by default. If you have disabled Cephx by adding the preceding auth options to your cluster configuration file, then you can enable Cephx in two ways. One is to simply remove all auth entries from the cluster configuration file, which are none, or you can explicitly enable Cephx by adding the following options in the cluster configuration file and restarting the Ceph services:
        auth cluster required = cephx
        auth service required = cephx
        auth client required = cephx

现在我们已经介绍了 Ceph 的不同身份验证模式,让我们了解一下 Ceph 中的身份验证和授权是如何工作的。

Ceph authentication

为了访问 Ceph 集群,参与者/用户/应用程序调用 Ceph 客户端来联系集群的监控节点。通常,一个 Ceph 集群有多个监视器,一个 Ceph 客户端可以连接到任何监视器节点以启动身份验证过程。 Ceph 的这种多监视器架构消除了身份验证过程中的单点故障情况。

要使用 Cephx,管理员(即 client.admin 必须在Ceph 集群。要创建用户帐户,client.admin 用户调用 ceph auth get-or-create键 命令。 Ceph 身份验证子系统生成用户名和密钥,将此信息存储在 Ceph 监视器上,并将用户的密钥返回给 client.admin 调用用户创建命令的用户。 Ceph 系统管理员应与希望以安全方式使用 Ceph 存储服务的 Ceph 客户端共享此用户名和密钥。下图可视化了整个过程:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

在上一个秘籍中,我们学习了用户创建的过程以及用户的密钥如何存储在所有集群节点上。我们现在将研究 Ceph 如何对用户进行身份验证并允许访问集群节点。

为了访问 Ceph 集群,客户端首先联系 Ceph 监控节点并仅传递其用户名。 Cephx 协议以这样一种方式工作,即双方能够相互证明他们拥有密钥的副本,而无需实际透露它。这就是客户端只发送其用户名而不发送其密钥的原因。

用户的会话密钥由监视器生成,并使用与该用户关联的密钥进行加密。加密的会话密钥由监视器传输回客户端。然后,客户端使用其密钥解密有效负载以检索会话密钥。此会话密钥对该用户在当前会话中仍然有效。

使用会话密钥,客户端从 Ceph 监视器请求票证。 Ceph 监视器验证会话密钥,然后生成一张票,用用户的密钥加密,然后将其传输给用户。客户端解密票证并使用它对整个集群的 OSD 和元数据服务器的请求进行签名。

Ceph 节点和客户端之间正在进行的通信由 Cephx 协议进行身份验证。 Ceph 节点和客户端之间发送的每条消息,发布初始身份验证后,都会使用元数据节点、OSD 和监视器使用其共享密钥验证的票证进行签名。 Cephx 票证确实会过期,因此攻击者无法使用过期票证或会话密钥来访问 Ceph 集群。下图说明了此处已说明的整个身份验证过程:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

Ceph authorization

在上一个秘籍中,我们介绍了 Ceph 使用的身份验证过程。在这个recipe中,我们将检查它的授权过程。一旦用户通过身份验证,他就被授权进行不同类型的访问、活动或角色。 Ceph 使用术语 capabilities,缩写为 caps< /跨度>。能力是用户获得的权利,定义了他们必须操作集群的访问级别。 capability 语法如下所示:

{daemon-type} 'allow {capability}' [{daemon-type} 'allow {capability}']

能力语法的详细解释如下:

  • Monitor caps: Includes the r, w, x, parameters, and allow profiles {cap}. For example:
        mon 'allow rwx' or mon 'allow profile osd'
  • OSD caps: Includes r, w, x, class-read, class-write, and profile OSD. For example:
        osd 'allow rwx' or osd 'allow class-read, allow rwx pool=rbd'
  • MDS caps: Only requires allow. For example:
        mds 'allow'

让我们了解每个功能:

  • allow: This implies rw only for MDS.
  • r: This gives the user read access, which is required with the monitor to read CRUSH maps.
  • w: This gives the user write access to objects.
  • x: This gives the user the ability to call class methods, including read and write, and also, the rights to perform auth operations on monitors.
Ceph 可以通过创建称为 Ceph 类的共享对象类来扩展。 Ceph 可以加载 .so 类存储在 OSD 类中 目录。对于一个类,您可以创建能够调用 Ceph 对象存储中的本地方法的新对象方法,例如,您在类中定义的对象可以调用本地 Ceph 方法,例如读取和写入。
  • class-read: This is a subset of x that allows users to call class read methods.
  • class-write: This is a subset of x that allows users to call class write methods.
  • *: This gives users full permission (r, w, and x) on a specific pool as well as to execute admin commands.
  • profile osd: This allows users to connect as an OSD to other OSDs or monitors. Used for the OSD heartbeat traffic and status reporting.
  • profile mds: This allows users to connect as an MDS to other MDSs.
  • profile bootstrap-osd: This allows users to bootstrap an OSD. For example, ceph-deploy and ceph-disk tools use the client.bootstrap-osd user, which has permission to add keys and bootstrap an OSD.
  • profile bootstrap-mds: This allows the user to bootstrap the metadata server. For example, the ceph-deploy tool uses the client.bootstrap-mds user to add keys and bootstrap the metadata server.

用户可以是应用程序的个人用户,例如 OpenStack 中的 cinder/nova。创建用户允许您控制可以访问 Ceph 存储集群、其池以及池中数据的对象。在 Ceph 中,用户应该有一个类型,它总是 client,和一个 ID,它可以是任何名称。因此,Ceph 中有效的用户名语法是 TYPE.ID,即 client。名称>,例如 client.admin< span>client.cinder.

How to do it…

在下面的秘籍中,我们将通过运行一些命令来讨论更多的 Ceph 用户管理:

  1. To list the users in your cluster, execute the following command:
        # ceph auth list

该命令的输出表明,对于每种守护进程类型,Ceph 都会创建一个具有不同能力的用户。它还列出了 client.admin 用户,即集群管理员用户。

  1. To retrieve a specific user, for example, client.admin, execute the following:
        # ceph auth get client.admin
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  1. Create a user, client.rbd:
        # ceph auth get-or-create client.rbd
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

这将创建一个没有能力的用户 client.rbd 和一个没有上限的用户没有用。

  1. Add capabilities to the client.rbd user and list the user's capabilities:
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

I/O path from a Ceph client to a Ceph cluster

让我们快速回顾一下客户端如何访问 Ceph 集群。为了对 Ceph 集群执行写操作,客户端从 Ceph 监视器获取最新的集群映射副本(如果他们还没有的话)。集群图提供有关 Ceph 集群布局的信息。然后客户端写入/读取对象,该对象存储在 Ceph 池中。池根据该池的 CRUSH 规则集选择 OSD。下图说明了整个过程:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

现在,我们来了解一下 Ceph 集群内部的数据存储过程。 Ceph 将数据存储在称为池的逻辑分区中。这些池包含多个 PG,而这些 PG 又包含对象。 Ceph 是一个真正的分布式存储系统,其中每个对象每次都被复制并存储在不同的 OSD 上。这个机制已经在下图的帮助下得到了解释,我们试图展示对象是如何存储在 Ceph 集群中的:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

Ceph Placement Group

A Placement Group (PG) 是在 OSD 上复制以提供存储系统的可靠性。根据 Ceph 池的复制级别,每个 PG 都被复制并分布在 Ceph 集群的多个 OSD 上。你可以把一个 PG 看成是一个逻辑容器,里面有多个对象,这样这个逻辑容器就会映射到多个 OSD:

读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

PG 对 Ceph 存储系统的可扩展性和性能至关重要。如果没有 PG,将很难管理和跟踪数以千万计的对象,这些对象被复制并分布在数百个 OSD 上。在没有 PG 的情况下管理这些对象也将导致计算损失。系统必须管理具有大量对象的 PG,而不是单独管理每个对象。这使得 Ceph 成为一个更易于管理且不太复杂的系统。

每个 PG 都需要一些系统资源,因为它们必须管理多个对象。集群中的 PG 数量应该仔细计算,这将在本书后面讨论。通常,增加集群中 PG 的数量会重新平衡 OSD 负载。建议每个 OSD 的 PG 数量为 50 到 100,以避免 OSD 节点上的高资源利用率。随着 Ceph 集群上数据量的增加,您可能需要通过调整 PG 计数来调整集群。当设备被添加到集群或从集群中删除时,CRUSH 以最优化的方式管理 PG 的重定位。

现在,我们已经了解到 Ceph PG 将其数据存储在多个 OSD 上以实现可靠性和高可用性。这些 OSD 被称为一级、二级、三级等,它们属于一个称为该 PG 的代理集的集。对于每个 PG 代理集,第一​​个 OSD 是主要的,后者是次要的和第三的。

How to do it…

为了更好地理解这一点,让我们从 Ceph 集群中找出 PG 的代理集:

  1. Add a temporary object with name hosts to a pool rbd:
        # rados put -p rbd hosts /etc/hosts
  1. Check the PG name for object hosts:
        # ceph osd map rbd hosts
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

如果观察输出,Placement Group (0.e ) 有一个设置 [2,4,3]< /kbd> 和 代理集 [2,4,3]。所以,这里 osd.2 是主 OSD,osd.4< /span>osd.3 是二级和三级 OSD。主 OSD 是唯一接受来自客户端的写操作的 OSD。对于read,默认也是来自主OSD;但是,我们可以通过设置读取亲和性来改变这种行为。

已启动的 OSD 保留在启动集中,以及正在执行的集中。一旦主 OSD 关闭,它首先从 up set 中移除,然后从 acting set 中移除。然后将辅助 OSD 提升为主 OSD。 Ceph 将故障 OSD 的 PG 恢复到新的 OSD 中,然后将其添加到 up 和 actions 集中以确保高可用性。在 Ceph 集群中,一个 OSD 可以是某些 PG 的主 OSD,同时它可以是其他 PG 的辅助或第三 OSD。

Placement Group states

Ceph PG 可能会根据当时集群内部发生的情况表现出几种状态。要了解 PG 的状态,您可以查看命令 ceph status 的输出。在这个秘籍中,我们将介绍 PG 的这些不同状态,并了解每种状态的实际含义:

  • Creating: The PG is being created. This generally happens when pools are being created or when PGs are increased for a pool.
  • Active: All PGs are active, and requests to the PG will be processed.
  • Clean: All objects in the PG are replicated the correct number of times.
  • Down: A replica with necessary data is down, so the PG is offline (down).
  • Replay: The PG is waiting for clients to replay operations after an OSD has crashed.
  • Splitting: The PG is being split into multiple PGs. Usually, a PG attains this state when PGs are increased for an existing pool. For example, if you increase the PGs of a pool rbd from 64 to 128, the existing PGs will split, and some of their objects will be moved to new PGs.
  • Scrubbing: The PG is being checked for inconsistencies.
  • Degraded: Some objects in the PG are not replicated as many times as they are supposed to be.
  • Inconsistent: The PG replica is not consistent. For example, there is the wrong size of object, or objects are missing from one replica after recovery is finished.
  • Peering: The PG is undergoing the peering process, in which it's trying to bring the OSDs that store the replicas of the PG into agreement about the state of the objects and metadata in the PG.
  • Repair: The PG is being checked, and any inconsistencies found will be repaired (if possible).
  • Recovering: Objects are being migrated/synchronized with replicas. When an OSD goes down, its contents may fall behind the current state of other replicas in the PGs. So, the PG goes into a recovering state and objects will be migrated/synchronized with replicas.
  • Backfill: When a new OSD joins the cluster, CRUSH will reassign PGs from existing OSDs in the cluster to the newly added OSD. Once the backfilling is complete, the new OSD will begin serving requests when it is ready.
  • Backfill-wait: The PG is waiting in line to start backfill.
  • Incomplete: A PG is missing a necessary period of history from its log. This generally occurs when an OSD that contains needed information fails or is unavailable.
  • Stale: The PG is in an unknown state—the monitors have not received an update for it since the PG mapping changed. When you start your cluster, it is common to see the stale state until the peering process completes.
  • Remapped: When the acting set that services a PG changes, the data migrates from the old acting set to the new acting set. It may take some time for a new primary OSD to service requests. So, it may ask the old primary OSD to continue to service requests until the PG migration is complete. Once data migration completes, the mapping uses the primary OSD of the new acting set.

以下是在宝石版本中为快照修剪功能添加的另外两个新 PG 状态:

  • snaptrim: The PGs are currently being trimmed
  • snaptrim_wait: The PGs are waiting to be trimmed

Creating Ceph pools on specific OSDs

一个 Ceph 集群通常由具有多个磁盘驱动器的多个节点组成。而且,这些磁盘驱动器可以是混合类型。例如,您的 Ceph 节点可能包含 SATA、NL-SAS、SAS、SSD 甚至 PCIe 等类型的磁盘。 Ceph 为您提供了在特定驱动器类型上创建池的灵活性。例如,您可以从一组 SSD 磁盘创建一个高性能 SSD 池,或者您可以使用 SATA 磁盘驱动器创建一个高容量、低成本的池。

在这个秘籍中,我们将了解如何创建一个名为 ssd-pool 并由 SSD 磁盘支持的池,以及另一个名为 sata-pool 的池,它由 SATA 磁盘支持。为此,我们将编辑 CRUSH 地图并进行必要的配置。

我们在本书中部署和使用的 Ceph 集群是托管在虚拟机上的,没有真正的 SSD 磁盘支持它。因此,我们将假设我们有一些虚拟磁盘作为 SSD 磁盘用于学习目的。如果您在基于 SSD 磁盘的真实 Ceph 集群上执行此练习,则不会发生任何变化。

对于下面的演示,我们假设osd.0osd.3osd.6 是 SSD 磁盘,我们将在这些磁盘上创建 SSD 池。同样,我们假设 osd.1, osd.5< /kbd> 和 osd.7 是 SATA 磁盘,将托管 SATA 池。< /跨度>

How to do it...

让我们开始配置吧:

  1. Get the current CRUSH map and decompile it:
        # ceph osd getcrushmap -o crushmapdump
        # crushtool -d crushmapdump -o crushmapdump-decompiled
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  1. Edit the crushmapdump-decompiled CRUSH map file and add the following section after the root default section:
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  1. Create the CRUSH rule by adding the following rules under the rule section of the CRUSH map, and then save and exit the file:
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  1. Compile and inject the new CRUSH map in the Ceph cluster:
        # crushtool -c crushmapdump-decompiled -o crushmapdump-compiled
        # ceph osd setcrushmap -i crushmapdump-compiled
添加 osd_crush_update_on_start=false 选项 [全局][osd] 部分 ceph.conf 在所有 OSD 节点中,因此将来如果任何 OSD 节点或 OSD 将重新启动,它们将使用自定义 CRUSH 映射并且不会将其更新回默认值。
  1. Once the new CRUSH map has been applied to the Ceph cluster, check the OSD tree view for the new arrangement, and notice the ssd and sata root buckets:
        # ceph osd tree
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  1. Create and verify the ssd-pool.
由于这是一个托管在虚拟机上的小型集群,我们将使用一些 PG 创建这些池。
    1. Create the ssd-pool:
                # ceph osd pool create ssd-pool 8 8
    1. Verify the ssd-pool; notice that the crush_ruleset is 0, which is by default:
                # ceph osd dump | grep -i ssd
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
    1. Let's change the crush_ruleset to 1 so that the new pool gets created on the SSD disks:
                # ceph osd pool set ssd-pool crush_ruleset 1
    1. Verify the pool and notice the change in crush_ruleset:
                # ceph osd dump | grep -i ssd
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  1. Similarly, create and verify sata-pool:
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  1. Let's add some objects to these pools:
    1. Since these pools are new, they should not contain any objects, but let's verify this by using the rados list command:
                # rados -p ssd-pool ls
                # rados -p sata-pool ls
    1. We will now add an object to these pools using the rados put command. The syntax would be rados -p <pool_name> put <object_name> <file_name>:
                # rados -p ssd-pool put dummy_object1 /etc/hosts
                # rados -p sata-pool put dummy_object1 /etc/hosts
    1. Using the rados list command, list these pools. You should get the object names that we stored in the last step:
                # rados -p ssd-pool ls
                # rados -p sata-pool ls
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》
  1. Now, the interesting part of this entire section is to verify that the objects are getting stored on the correct set of OSDs:
    1. For the ssd-pool, we have used the OSDs 0, 3, and 6. Check the osd map for ssd-pool using the syntax ceph osd map <pool_name> <object_name>:
                # ceph osd map ssd-pool dummy_object1
    1. Similarly, check the object from sata-pool:
                # ceph osd map sata-pool dummy_object1
读书笔记《ceph-cookbook-second-edition》《引擎盖下的塞夫》

如上图所示,在 ssd-pool上创建的对象实际上是< span>y 存储在 OSD 集 [3,0,6] 上,以及在 上创建的对象sata-pool 存储在 OSD 集 [1,7,4] 上。该输出是预期的,它验证我们创建的池是否按照我们的要求使用了正确的 OSD 集。这种类型的配置在生产设置中非常有用,您希望创建一个仅基于 SSD 的快速池,以及一个基于旋转磁盘的中/低性能池。