vlambda博客
学习文章列表

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

Production Planning and Performance Tuning for Ceph

在本章中,我们将介绍以下秘籍:

  • The dynamics of capacity, performance, and cost
  • Choosing hardware and software components for Ceph
  • Ceph recommendations and performance tuning
  • Ceph erasure-coding
  • Creating an erasure-coded pool
  • Ceph cache tiering
  • Creating a pool for cache tiering
  • Creating a cache tier
  • Configuring a cache tier
  • Testing a cache tier
  • Cache tiering – possible dangers in production environments

Introduction

在本章中,我们将学习一些关于 Ceph 的非常有趣的主题。这些将包括硬件/软件建议、Ceph 组件(即 Ceph MON 和 OSD)和客户端的性能调整,包括操作系统调整。最后,我们将了解 Ceph 擦除编码和缓存分层,涵盖两者的不同技术。

The dynamics of capacity, performance, and cost

Ceph 是一种软件定义的存储解决方案,旨在在商用硬件上运行。这种能力使其成为一种灵活且经济的解决方案,可根据您的需求量身定制。由于 Ceph 的所有智能都存在于其软件中,因此它需要一套良好的硬件才能使其成为一个出色的存储解决方案的整体软件包。

Ceph 硬件选择需要根据您的存储需求和您拥有的用例进行细致的规划。组织需要优化的硬件配置,使他们能够从小规模起步并扩展到数 PB。下图展示了用于确定 Ceph 集群的最佳配置的几个因素:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

不同的组织有不同的存储工作负载,通常需要性能、容量和 TCO 共享的中间地带。 Ceph是统一存储,即可以在同一个集群中提供文件、块、对象存储。 Ceph 还能够在同一个集群中提供针对不同工作负载的不同类型的存储池。这种能力允许组织根据需要定制其存储基础架构。有多种方法可以定义您的存储需求;下图显示了这样做的一种方法:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

其他方法如下:

  • IOPS optimized: The highlight of this type of configuration is that it has the highest IOPS (I/O operations per second) with low TCO (Total Cost of Ownership) per I/O. It is typically implemented using high-performance nodes containing faster SSD disks, PCIe SSD, NVMe, and so on, for data storage. It is generally used for block storage, however, you can use it for other workloads that require a high IOPS. These deployments are suitable for cloud computing applications such as MySQL or MariaDB instances as virtual machines running on OpenStack.
  • Throughput optimized: Its highlights include the highest throughput and a low cost per throughput. It is typically implemented using SSD disks and PCIe SSD for OSD journals, with a high bandwidth, physically-separated dual network. It is mostly used for block storage. If your use case requires a high-performance object or file storage, then you should consider this. These deployments are suitable for serving up large amounts of data, such as graphics, audio, and video content. Capacity optimized: Its highlights include a low cost per TB and a low cost per rack unit of physical space in the data center. It is also known as economic storage, cheap storage, and archival/long-term storage, and is typically implemented using dense servers full of spinning disks, usually 36 to 72, with 4 TB to 6 TB of physical disk space per server. It is generally used for low cost, large storage capacity objects or filesystem storage. It is a good candidate for using erasure-coding to maximize the usable capacity. These deployments are suitable for storing backup data for a long amount of time.

Choosing hardware and software components for Ceph

如前所述,Ceph 硬件选择需要根据您的环境和存储需求进行细致的规划。硬件组件的类型、网络基础设施和集群设计是您在 Ceph 存储规划的初始阶段应考虑的一些关键因素。 Ceph 硬件选择没有黄金法则,因为它取决于各种因素,例如预算、性能与容量(或两者)、容错级别和用例。

Ceph 与硬件无关;组织可以根据预算、性能或容量要求或用例自由选择他们选择的任何硬件。他们可以完全控制他们的存储集群和底层基础设施。此外,Ceph 的优点之一是它支持异构硬件。您可以在创建 Ceph 集群基础架构时混合使用硬件品牌。例如,在构建 Ceph 集群时,您可以混合使用来自不同制造商(例如 HP、Dell、Supermicro 等)的硬件,甚至是现成的硬件,从而显着节省成本。跨度>

您应该记住,Ceph 的硬件选择取决于您计划在存储集群上放置的工作负载、环境以及您将使用的功能。在这个秘籍中,我们将学习为 Ceph 集群选择硬件的一些一般做法。

Processor

Ceph 监控守护进程维护集群映射,不向客户端提供任何数据,因此它是轻量级的,没有非常严格的 < span>处理器要求。在大多数情况下,一个普通的单核服务器处理器将为 Ceph 监视器完成这项工作。另一方面,Ceph MDS 更需要资源。它需要四核或更好的 CPU 处理能力显着提高。对于小型 Ceph 集群或概念验证环境,您可以将 Ceph 监视器与其他 Ceph 组件(例如 OSD、RADOS 网关,甚至 Ceph MDS)放在一起。对于大中型环境,Ceph 监视器不应被共享,而应托管在专用机器上。

Ceph OSD 守护进程需要相当多的处理能力,因为它为客户端提供数据。要估计 Ceph OSD 的 CPU 需求,了解服务器将托管多少个 OSD 很重要。通常建议每个 OSD 守护程序至少应具有一个 CPU 核心 GHz。您可以使用以下公式来估算 OSD CPU 需求:

((CPU sockets*CPU cores per socket*CPU clock speed in GHz) / No. Of OSD) >=1

例如,具有单插槽、六核、2.5 GHz CPU 的服务器应该足以支持 12 个 Ceph OSD,每个 OSD 将获得大约 1.25 GHz 的计算能力:

((1*6*2.5)/12)= 1.25

以下是 Ceph OSD 节点的更多处理器示例:

  • 英特尔® 至强® 处理器 E5-2620 v4(2.10 GHz,8 核):
    1*8*2.10 = 16.8意味着这对于具有多达 16 个 OSD 的 Ceph 节点是有益的

  • 英特尔® 至强® 处理器 E5-2680 v4(2.40 GHz,14 核):
    1*14*2.40 = 33.6意味着这对于具有多达 33 个 OSD 的 Ceph 节点是有益的

如果您打算使用 Ceph 纠删码功能,那么获得更强大的 CPU 会更有利,因为纠删码操作需要更多的处理能力。在使用 Ceph 纠删码时确定 CPU 时,最好高估所需的处理能力,而不是低估它。

如果您打算使用 Ceph 纠删码池,那么获得更强大的 CPU 会很有用,因为托管纠删码池的 Ceph OSD 将使用更多CPU 比托管复制池的 Ceph OSD 多。

Memory

Monitor 和 MDS 守护进程需要快速提供数据,因此它们应该有足够的内存来加快处理速度。经验法则是每个守护程序实例有 2 GB 或更多内存——这对 Ceph MDS 和较小 Ceph 集群中的监视器应该是有益的。较大的 Ceph 集群应该增加这个数量。 Ceph MDS 很大程度上依赖于数据缓存;由于他们需要快速提供数据,因此需要大量 RAM。 Ceph MDS 的 RAM 越高,Ceph FS 的性能就越好。

OSD 通常需要大量的物理内存。对于平均工作负载,每个 OSD 守护程序实例 1 GB 内存就足够了。但是,从性能的角度来看,每个 OSD 守护进程 2 GB 是一个不错的选择,并且拥有更多内存也有助于恢复和改进缓存。此建议假设您为一个物理磁盘使用一个 OSD 守护程序。如果每个 OSD 使用多个物理磁盘,内存需求也会增加。一般来说,物理内存越多越好,因为在集群恢复期间,内存消耗会显着增加。值得知道的是,如果考虑底层物理磁盘的 RAW 容量,OSD 内存消耗会增加。因此,6 TB 磁盘的 OSD 要求将高于 4 TB 磁盘。您应该明智地做出此决定,以免内存成为集群性能的瓶颈。

在集群规划的早期阶段过度分配 CPU 和内存通常具有成本效益,因为您可以随时向同一主机添加更多 JBOD 样式的物理磁盘——如果它有足够的系统资源——而不是购买一个全新的节点,这有点贵。

Network

Ceph 是一个分布式存储系统,它严重依赖于底层网络基础设施。如果您希望您的 Ceph 集群可靠且高性能,请确保您的网络专为它设计。建议所有集群节点都有两个冗余的独立网络用于集群和客户端流量。

在单独的 NIC 上设置集群和客户端网络。

对于小型概念验证,或测试几个节点的 Ceph 集群,1 GBps 的网络 速度应该可以正常工作。如果您有一个中大型集群(几十个节点),您应该考虑使用 10 GBps 或更多的网络带宽。在恢复/重新平衡时,网络起着至关重要的作用。如果您的带宽网络连接有 10 GBps 或更多,您的集群将快速恢复,否则可能需要一些时间。因此,从性能的角度来看,10 GBps 或更高的双网络将是一个不错的选择。精心设计的 Ceph 集群使用两个物理上分离的网络:一个用于集群网络(内部网络),另一个用于客户端网络(外部网络)。这两个网络都应该从服务器到网络交换机以及介于两者之间的所有内容在物理上分开,如下图所示:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

关于网络,另一个争论的话题是是使用以太网还是 InfiniBand 网络,或者更准确地说是 10 G 网络还是 40 G 或更高带宽的网络。它取决于几个因素,例如工作负载、Ceph 集群的大小、Ceph OSD 节点的密度和数量等等。在几次部署中,我看到客户使用 Ceph 的 10 G 和 40 G 网络。在这种情况下,Ceph 集群范围为几个 PB 和几百个节点,它们使用客户端网络为 10 G,内部集群网络为高带宽、低延迟 40 G。然而,以太网的价格网络出现故障;根据您的用例,您可以决定您想要的网络类型。

对于网络优化,您可以选择使用巨型帧以获得更好的 CP/带宽比。如果要使用巨型帧,请记住验证所有互连网络设备都配置为处理巨型帧。启用巨型帧将在本章后面的秘籍中讨论。

1 GBps 网络不适合生产集群。例如,在驱动器发生故障的情况下,跨 1 GBps 网络复制 1 TB 数据需要 3 小时,而 3 TB(典型驱动器配置)需要 9 小时。对于 10 GBps 的网络,1 TB 的复制时间为 20 分钟,3 TB 的复制时间为 1 小时。

Disk

Ceph 集群的性能和经济性都在很大程度上取决于存储介质的有效选择。在为 Ceph 集群选择存储介质之前,您应该了解您的工作负载和可能的性能要求。 Ceph 以两种方式使用存储介质:OSD 日志部分和 OSD 数据部分。正如前面章节中所解释的,Ceph 中的每个 write 操作目前都是一个两步过程。当 OSD 收到写入对象的请求时,它首先将该对象写入代理集中 OSD 的日志部分,并发送一个 write 向客户致谢。不久之后,日志数据同步到数据分区。值得知道的是,复制也是 write 性能的重要因素。复制因子通常是可靠性、性能和 TCO 之间的权衡。这样,所有的集群性能都围绕着 OSD 日志和数据分区。

Partitioning the Ceph OSD journal

如果您的工作负载以性能为中心,那么建议您使用 SSD。通过使用 SSD,您可以通过减少访问时间和 write 延迟来显着提高吞吐量。为了将 SSD 用作日志,我们在每个物理 SSD 上创建多个逻辑分区,这样每个 SSD 逻辑分区(日志)都映射到一个 OSD 数据分区。在这种情况下,OSD 数据分区位于旋转磁盘上,并且其 日志位于更快的 SSD 分区上。< /span> 下图说明了这种配置:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

在这种类型的设置中,您应该记住不要通过存储超出其限制的多个日志来使 SSD 过载。通常,10 GB 到 20 GB 的日志大小对于大多数情况应该足够了,但是,如果您有更大的 SSD,您可以创建更大的日志设备;在这种情况下,不要忘记增加 OSD 的文件存储最大和最小同步时间间隔。

与 Ceph 一起使用的两种最常见的非易失性快速存储类型是 SATA 或 SAS SSD,以及 PCIe 或 NVMe SSD。为了让您的 SATA/SAS SSD 获得良好的性能,您的 SSD 与 OSD 的比例应该是 1:4,即一个 SSD 与四个 OSD 数据磁盘共享。对于 PCIe 或 NVMe 闪存设备,根据设备性能,SSD 与 OSD 的比率可以在 1:12 到 1:18 之间变化,即一个闪存设备与 12 到 18 个 OSD 数据磁盘共享。< /span>

这里提到的 SSD 与 OSD 的比例非常笼统,在大多数情况下都能正常工作。但是,我建议您针对您的特定工作负载和环境测试您的 SSD/PCIe,以充分利用它。

将单个 SSD 用于多个日志的不利之处在于,如果您丢失了 SSD,托管多个日志,则与该 SSD 关联的所有 OSD 都将失败,您可能会丢失数据。但是,您可以通过对日志使用 RAID one 来克服这种情况,但这会增加您的存储成本。此外,与 HDD 相比,每 GB 的 SSD 成本几乎高出十倍。因此,如果您使用 SSD 构建集群,则会增加 Ceph 集群的每 GB 成本。但是,如果您希望从 Ceph 集群中获得显着的性能改进,那么值得为期刊投资 SSD。

我们了解了很多关于 SSD 日志的知识,并且了解它们可以有助于提高 write 性能。但是,如果您不关心极限性能,并且每 TB 的成本是您的决定因素,那么您应该考虑将日志和数据分区配置在同一个硬盘驱动器上。这意味着,在您的大型旋转磁盘中,您将为 OSD 日志分配几 GB,并将同一驱动器的剩余容量用于 OSD 数据。这种设置的性能可能不如基于 SSD 日志的设置,但每 TB 存储价格的 TCO 会相当低。

一般来说,OSD 与日志的比率的既定公式是:

Journal number = (SSD seq write speed) / (spinning disk seq write speed)

前面的公式通常为 SSD 日志磁盘产生大约四到五个旋转驱动器。这意味着根据前面的公式,单个 SSD 磁盘可用于承载大约四到五个 OSD 的日志。

Partitioning Ceph OSD data

OSD 是存储所有数据的真正主力。在生产环境中,您应该在 Ceph 集群中使用企业、云或归档类硬盘驱动器。通常,桌面级 HDD 不太适合生产 Ceph 集群。原因在于,在 Ceph 集群中,数百个旋转 HDD 安装得很近,组合的旋转振动可能成为桌面级 HDD 的挑战。这会增加磁盘故障率并可能影响整体性能。企业级 HDD 专门用于处理振动,它们本身产生的旋转振动非常小。此外,它们的中位故障间隔时间 (MTBF) 明显高于桌面级 HDD。

Ceph OSD数据盘的另外一个需要考虑的就是它的接口,也就是SATA或者SAS。 NL-SAS 硬盘具有双 SAS 12 GBps 端口,它们通常比单端口 6 GBps SATA 接口 硬盘具有更高的性能。此外,双 SAS 端口提供冗余,还可以允许同时读取和写入。 SAS 设备的另一个方面是,与 SATA 驱动器相比,它们的不可恢复的读取错误 (URE) 较低。 URE 越低,清理错误和归置组修复操作就越少。

Ceph OSD 节点的密度也是集群性能、可用​​容量和 TCO 的重要因素。一般来说,拥有更多的小节点比拥有几个大容量节点要好,但这总是有争议的。您应该选择 Ceph OSD 节点的密度,使得一个节点应小于集群总大小的 10%。

例如,在 1 PB Ceph 集群中,您应该避免使用 4 x 250 TB OSD 节点,其中每个节点构成集群容量的 25%。相反,您可以拥有 13 个 80 TB OSD 节点,其中每个节点的大小小于集群容量的 10%。但是,这可能会增加您的 TCO,并可能影响集群规划方面的其他几个因素。

正确规划 Ceph OSD 节点密度可以防止在 OSD 节点故障期间延长集群恢复时间。与承载 25% 集群数据的节点相比,承载 10% 集群数据的 OSD 节点发生故障时集群能够更快地恢复!

Operating system

Ceph 是在基于 Linux 的操作系统之上运行的软件定义系统。 Ceph 支持大多数主要的 Linux 发行版。截至目前,运行 Ceph 集群的有效操作系统选择是 RHEL、CentOS、Fedora、Debian、Ubuntu、openSUSE 和 SLES。对于 Linux 内核版本,建议您在较新版本的 Linux 内核上部署 Ceph。我们还建议在具有长期支持LTS< /span>)。 在编写本书时,推荐使用 Linux 内核 v3.16.3 或更高版本,这是一个很好的起点。留意 http:// 是个好主意docs.ceph.com/docs/master/start/os-recommendations。根据文档,CentOS 7 和 Ubuntu 14.04 是 tier-1 发行版,其全面的功能、回归和压力测试套件连续运行,毫无疑问,如果您使用的是企业 Red,RHEL 7 是最佳选择帽子 Ceph 存储产品。

OSD filesystem

Ceph OSD 守护进程运行在可以是 XFS、EXT 甚至 Btrfs 的文件系统之上。然而,为 Ceph OSD 选择正确的文件系统是一个关键因素,因为 OSD 守护进程严重依赖底层文件系统的稳定性和性能。除了稳定性和性能,文件系统还提供扩展属性XATTR) Ceph OSD 守护进程利用的。 XATTR 向 Ceph OSD 守护进程提供有关对象状态、快照、元数据和 ACL 的内部信息,这有助于数据管理。

这就是为什么底层文件系统应该为 XATTR 提供足够的容量。 Btrfs 提供了更大的 XATTR 元数据,它存储在一个文件中。 XFS 有一个相对较大的限制(64 KB),大多数部署不会遇到,但是 ext4 太小而无法使用。如果您在 Ceph OSD 中使用 ext4 文件系统,则应始终将 filestore xattr use omap = true 添加到 [OSD] 部分的以下设置中>ceph.conf 文件。文件系统的选择对于生产工作负载来说非常重要,而对于 Ceph,这些文件系统在很多方面彼此不同,如下所示:

  • XFS: XFS is a reliable, mature, and very stable filesystem, which is recommended for production usage of Ceph clusters. However, XFS stands lower when compared to Btrfs. XFS has small performance issues in terms of metadata scaling. Also, XFS is a journaling filesystem, that is, each time a client sends data to write to a Ceph cluster, it is first written to a journaling space and then to an XFS filesystem. This increases the overhead of writing the same data twice, and thus makes the XFS perform slower when compared to Btrfs, which does not use journals. XFS is the currently recommended OSD filesystem for production Ceph workloads.
  • Btrfs: The OSD, with the Btrfs filesystem underneath, delivers the best performance when compared to XFS and ext4 filesystem-based OSDs. One of the major advantages of using Btrfs is that it supports copy-on-write and writable snapshots. With the Btrfs filesystem, Ceph uses parallel journaling, that is, Ceph writes to the OSD journal and OSD data in parallel, which boosts write performance. It also supports transparent compression and pervasive checksums, and it incorporates multi-device management in a filesystem. It has an attractive feature: online FSCK. However, despite these new features, Btrfs is currently not production-ready, but it remains a good candidate for test deployments.
  • Ext4: The fourth extended filesystem (ext4) is also a journaling filesystem that is production-ready for Ceph OSD. However, we don't recommend using ext4 due to limitations in terms of the size of the XATTRs it can store, and the problems this will cause with the way Ceph handles long RADOS object names. These issues will generally not surface with Ceph clusters using only short object names (RBD for example), but other client users like RGW that make use of long object names will have issues. Starting with Jewel, the ceph-osd daemon will not start if the configured max object name cannot be safely stored on ext4. If the cluster is planned to be used only with short object names (RBD usage only), then you can set the following configuration options to continue using ext4:
        osd max object name len = 256
        osd max object namespace len = 64
不要混淆 Ceph 日志和文件系统日志(XFS、ext4);他们都是不同的。 Ceph 在写入文件系统时进行日志记录,然后文件系统在将数据写入底层磁盘时进行日志记录。

Ceph recommendations and performance tuning

在这个秘籍中,我们将学习 Ceph 集群的一些性能调优参数。这些集群范围的配置参数是在 Ceph 配置文件中定义的,因此每次 Ceph 守护程序启动时,它都会遵守定义的设置。默认情况下,配置文件名为ceph.conf,位于/etc/ceph目录下。这个配置文件有一个全局部分以及每个服务类型的几个部分。每当 Ceph 服务类型启动时,它都会应用 [global] 部分以及 下定义的配置 守护进程特定部分。一个 Ceph 配置文件有多个部分,如下图所示:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

我们现在将讨论配置文件的每个部分的作用:

  • Global section: The global section of the cluster configuration file begins with the [global] keyword. All the settings defined under this section apply to all the daemons of the Ceph cluster. The following is an example of a parameter defined in the [global] section:
        public network = 192.168.0.0/24
  • Monitor section: The settings defined under the [mon] section of the configuration file are applied to all the Ceph monitor daemons in the cluster. The parameter defined under this section overrides the parameters defined under the [global] section. The following is an example of a parameter usually defined under the [mon] section:
        mon initial members = ceph-mon1
  • OSD section: The settings defined in the [osd] section are applied to all the Ceph OSD daemons in the Ceph cluster. The configuration defined under this section overrides the same setting defined under the [global] section. The following is an example of the settings in this section:
        osd mkfs type = xfs
  • MDS section: The settings defined in the [mds] section are applied to all the Ceph MDS daemons in the Ceph cluster. The configuration defined under this section overrides the same setting defined under the [global] section. The following is an example of the settings in this section:
        mds cache size = 250000
  • Client section: The settings defined under the [client] section are applied to all the Ceph clients. The configuration defined under this section overrides the same setting defined under the [global] section. The following is an example of the settings in this section:
        rbd cache size = 67108864
If utilizing Ansible as a deployment tool for Ceph, then Ansible will manage the ceph.conf file for the cluster. Be sure to include any custom configuration settings applied to specific sections, and that they are properly updated on the Ansible management node in the /usr/share/ceph-ansible/group_vars/all.yml file, under the ceph_conf_overrides section as discussed in previous chapters. If this is not done, any management done using Ansible will wipe out the custom configuration settings in your ceph.conf file!

在下一个秘籍中,我们将学习一些调整 Ceph 集群性能的技巧。性能调优是一个庞大的话题,需要了解 Ceph 以及存储堆栈的其他组件。性能调整没有灵丹妙药。这在很大程度上取决于底层基础架构和您的环境。

Tuning global clusters

全局参数应在 Ceph 集群配置文件的 [global] 部分定义,如如下:

  • Network: It's recommended that you use two physically separated networks for your Ceph cluster, which are referred to as public and cluster networks respectively. Earlier in this chapter, we covered the need for two different networks. Let's now understand how we can define them in a Ceph configuration:
    • public network: Use this syntax to define the public network public network = {public network / netmask}:
                public network = 192.168.100.0/24
    • cluster network: Use this syntax to define the cluster network cluster network = {cluster network / netmask}:
                 cluster network = 192.168.1.0/24
  • max open files: If this parameter is in place and the Ceph cluster starts, it sets the maximum open file descriptors at the OS level. This keeps OSD daemons from running out of the file descriptors. The default value of this parameter is zero, but you can set it as up to a 64 bit integer:
                max open files = 131072
  • osd pool default min size: This is the replication level in a degraded state, which should be set lower than the osd pool default size value. It sets the minimum number of replicas for the objects in pool in order to acknowledge a write operation from clients when the cluster is degraded. If the minimum size does not match, Ceph will not acknowledge it or write to the client. In production environments where data consistency is vital, this is recommended to be 2. The default value is 0:
               osd pool default min size = 2
  • osd pool default pg / osd pool default pgp: Make sure that the cluster has a realistic number of placement groups. The recommended value of placement groups per OSD is 100. Use this formula to calculate the PG count: (Total number of OSD * 100)/number of replicas.
    For 10 OSDs and a replica size of three, the PG count should be under (10*100)/3 = 333:
        osd pool default pg num = 128
        osd pool default pgp num = 128

如前所述,PG和PGP编号应保持相同。 PG 和 PGP 值根据集群大小变化很大。前面提到的配置不应损害您的集群,但您可能需要在应用这些值之前考虑一下。您应该知道,这些参数不会更改现有池的 PG 和 PGP 编号;它们在您创建新池而不指定 PG 和 PGP 值时应用。

  • osd pool default crush rule: The default CRUSH ruleset to use when creating a pool:
        osd pool default crush rule = 0
  • Disable in memory logs: Each Ceph subsystem has a logging level for its output logs, and it logs in-memory. We can set different values for each of these subsystems by setting a log file level and a memory level for debug logging on a scale of 1 to 20, where 1 is terse and 20 is verbose. The first setting is the log level and the second setting is the memory level. You must separate them with a forward slash as follows:
         debug <subsystem> = <log-level>/<memory-level>

默认日志级别对您的集群来说已经足够好了,除非您发现内存级别日志正在影响您的性能或内存消耗。在这种情况下,您可以尝试禁用内存日志记录。要禁用内存中日志的默认值,请添加以下参数:

        debug_default = 0/0
        debug_lockdep = 0/0
        debug_context = 0/0
        debug_crush = 0/0
        debug_mds = 0/0
        debug_mds_balancer = 0/0
        debug_mds_locker = 0/0
        debug_mds_log = 0/0
        debug_mds_log_expire = 0/0
        debug_mds_migrator = 0/0
        debug_buffer = 0/0
        debug_timer = 0/0
        debug_filer = 0/0
        debug_objecter = 0/0
        debug_rados = 0/0
        debug_rbd = 0/0
        debug_journaler = 0/0
        debug_objectcatcher = 0/0
        debug_client = 0/0
        debug_osd = 0/0
        debug_optracker = 0/0
        debug_objclass = 0/0
        debug_filestore = 0/0
        debug_journal = 0/0
        debug_ms = 0/0
        debug_monc = 0/0
        debug_tp = 0/0
        debug_auth = 0/0
        debug_finisher = 0/0
        debug_heartbeatmap = 0/0
        debug_perfcounter = 0/0
        debug_asok = 0/0
        debug_throttle = 0/0
        debug_mon = 0/0
        debug_paxos = 0/0
        debug_rgw = 0/0
        debug_javaclient = 0/0 

Tuning Monitor

监视器调优参数应在 Ceph 集群配置文件的 [mon] 部分定义为如下:

  • mon_osd_down_out_interval: This is the number of seconds Ceph waits before marking a Ceph OSD daemon as down and out if it doesn't respond. This option comes in handy when your OSD nodes crash and reboot by themselves or after some short glitch in the network. You don't want your cluster to start rebalancing as soon as the problem comes, rather, for it to wait for a few minutes and see if the problem gets fixed (default 300):
        mon_osd_down_out_interval = 600
  • mon_allow_pool_delete: To avoid the accidental deletion of the Ceph pool, set this parameter as false. This can be useful if you have many administrators managing the Ceph cluster, and you do not want to take any risks with client data (default true):
        mon_allow_pool_delete = false
  • mon_osd_min_down_reporters: The Ceph OSD daemon can report to MON about its peer OSDs if they are down, by default this value is two. With this option, you can change the minimum number of Ceph OSD daemons required to report a down Ceph OSD to the Ceph monitor. In a large cluster, it's recommended that you have this value larger than the default; three should be a good number:
        mon_osd_min_down_reporters = 3

OSD tuning

在这个秘籍中,我们将了解应该在 [osd] Ceph 集群配置文件的部分。

OSD general settings

以下设置允许 Ceph OSD 守护进程确定文件系统类型、挂载选项以及其他一些有用的设置:

  • osd_mkfs_options_xfs: At the time of OSD creation, Ceph will use these xfs options to create the OSD filesystem:
        osd_mkfs_options_xfs = "-f -i size=2048
  • osd_mount_options_xfs: It supplies the xfs filesystem mount options to OSD. When Ceph is mounting an OSD, it will use the following options for OSD filesystem mounting:

        osd_mount_options_xfs = "rw,noatime,largeio,
        inode64,swalloc,logbufs=8,logbsize=256k,delaylog,allocsize=4M"
  • osd_max_write_size: The maximum size in MB an OSD can write at a time:
        osd_max_write_size = 256
  • osd_client_message_size_cap: The largest client data message in bytes that is allowed in memory:
        osd_client_message_size_cap = 1073741824
  • osd_map_dedup: Remove duplicate entires in the OSD map:
        osd_map_dedup = true
  • osd_op_threads: The number of threads to service the Ceph OSD daemon operations. Set it to zero to disable it. Increasing the number may increase the request processing rate:
        osd_op_threads = 16
  • osd_disk_threads: The number of disk threads that are used to perform background disk intensive OSD operations such as scrubbing and snap trimming:
        osd_disk_threads = 1
  • osd_disk_thread_ioprio_class: It is used in conjunction with osd_disk_thread_ioprio_priority. This tunable can change the I/O scheduling class of the disk thread, and it only works with the Linux kernel CFQ scheduler. The possible values are idle, be, or rt:
    • idle: The disk thread will have a lower priority than any other thread in the OSD. It is useful when you want to slow down the scrubbing on an OSD that is busy handling client requests.
    • be: The disk threads have the same priority as other threads in the OSD.
    • rt: The disk thread will have more priority than all the other threads. This is useful when scrubbing is much needed, and it can be prioritized at the expense of client operations:
                osd_disk_thread_ioprio_class = idle
  • osd_disk_thread_ioprio_priority: It's used in conjunction with osd_disk_thread_ioprio_class. This tunable can change the I/O scheduling priority of the disk thread ranging from 0 (highest) to 7 (lowest). If all OSDs on a given host are in class idle and are competing for I/O and not doing many operations, this parameter can be used to lower the disk thread priority of one OSD to 7 so that another OSD with a priority of zero can potentially scrub faster. Like the osd_disk_thread_ioprio_class, this also works with the Linux kernel CFQ scheduler:
        osd_disk_thread_ioprio_priority = 0
OSD磁盘线程 ioprio_class 和 osd 磁盘线程 ioprio_priority 只有在调度器是 CFQ 时才有效。调度程序可以在运行时更改,而不会影响 I/O 操作。这将在本章后面的秘籍中讨论。

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

OSD journal settings

Ceph OSD 守护进程支持以下日志配置:

  • osd journal size: Ceph's default OSD journal size value is zero; you should use the osd_journal_size parameter to set the journal size. The journal size should be at least twice the product of the expected drive speed and filestore max sync interval. If you are using SSD journals, it's usually good to create journals larger than 10 GB and increase the filestore minimum/maximum sync intervals:
        osd_journal_size = 20480
  • journal_max_write_bytes日志一次可以写入的最大字节数:

        journal_max_write_bytes = 1073714824
  • journal_max_write_entries: The maximum number of entries the journal can write at once:
        journal_max_write_entries = 10000
  • journal queue max ops: The maximum number of operations allowed in the journal queue at a given time:
        journal queue max ops = 50000
  • journal queue max bytes: The maximum number of bytes allowed in the journal queue at a given time:
        journal queue max bytes = 10485760000
  • journal_dio: This enables direct I/O to the journal. It requires journal block align to be set to true:
        jorunal_dio = true
  • journal_aio: This enables the use of libaio for asynchronous writes to the journal. It requires journal_aio to be set to true:
        journal_aio = true
  • journal block align: This block aligns write operations. It's required for dio and aio.

OSD filestore settings

以下是一些可以为 Ceph OSD 守护进程配置的文件存储设置:

  • filestore merge threshold: This is the minimum number of subdirectories before merging them into a parent directory. A negative value can be set here to disable subdirectory merging:
        filestore merge threshold = 50
  • filestore_split_multiple: The maximum number of files in a subdirectory before splitting it into child directories:
        filestore_split_multiple = 12
  • filestore xattr useomap: Uses the object map for XATTRs. Needs to be set to true for the ext4 filesystems:
        filestore xattr useomap = true
  • filestore sync interval: In order to create a consistent commit point, the filestore needs to quiesce write operations and do a syncfs() operation, which syncs data from the journal to the data partition and thus frees the journal. A more frequently performed sync operation reduces the amount of data that is stored in a journal. In such cases, the journal becomes underutilized. Configuring less frequent syncs allows the filesystem to coalesce small writes better, and we might get improved performance. The following parameters define the minimum and maximum time period between two syncs:
        file_store_min_sync_interval = 10
        file_store_max_sync_interval = 15
  • filestore_queue_max_ops: The maximum number of operations that a filestore can accept before blocking new operations from joining the queue:
        filestore_queue_max_ops = 2500
  • filestore_max_queue_bytes: The maximum number of bytes in an operation:
        filestore_max_queue_bytes = 10485760
  • filestore_queue_commiting_max_ops: The maximum number of operations the filestore can commit:
        filestore_queue_commiting_max_ops = 5000
 
  • filestore_queue_commiting_max_bytes: The maximum number of bytes the filestore can commit:
        filestore_ queue_commiting_max_bytes = 10485760000

OSD recovery settings

当你想要性能过度恢复时,应该使用这些设置,反之亦然。如果您的 Ceph 集群运行状况不佳并且正在恢复,您可能无法获得正常的性能,因为 OSD 将忙于恢复。如果您仍然更喜欢性能而不是恢复,您可以降低恢复优先级以减少 OSD 被恢复占用的情况。如果您想快速恢复集群,也可以设置这些值,帮助 OSD 更快地执行恢复:

  • osd_recovery_max_active: The number of active recovery requests per OSD at a given moment:
        osd_recovery_max_active = 1
  • osd_recovery_max_single_start: This is used in conjunction with osd_recovery_max_active. To understand this, let's assume osd_recovery_max_single_start is equal to 1, and osd_recovery_max_active is equal to three. In this case, it means that the OSD will start a maximum of one recovery operation at a time, out of a total of three operations active at that time:
        osd_recovery_max_single_start = 1
  • osd_recovery_op_priority: This is the priority set for the recovery operation. This is relative to osd client op priority. The higher the number, the higher the recovery priority:
        osd_recovery_op_priority = 1
  • osd client op priority: This is the priority set for client operation. The higher the number, the higher the client operation priority:
        osd client op priority = 63
  • osd_recovery_max_chunk: The maximum size of a recovered chuck of data in bytes:
        osd_recovery_max_chunk = 1048576
  • osd_recovery_threads: The number of threads needed for recovering data:
        osd_recovery_threads = 1

OSD backfilling settings

OSD 回填设置允许 Ceph 将回填操作设置为低于 readwrite :

  • osd_max_backfills: The maximum number of backfills allowed to or from a single OSD:
        osd_max_backfills = 2
  • osd_backfill_scan_min: The minimum number of objects per backfill scan:
        osd_backfill_scan_min = 8
  • osd_backfill_scan_max: The maximum number of objects per backfill scan:
        osd_backfill_scan_max = 64

OSD scrubbing settings

OSD 清理对于维护数据完整性很重要,但它会降低性能。您可以调整以下设置以增加或减少擦洗操作:

  • osd_max_scrubs: The maximum number of simultaneous scrub operations for a Ceph OSD daemon:
        osd_max_scrubs = 1
  • osd_scrub_sleep: The time in seconds that scrubbing sleeps between two consecutive scrubs:
        osd_scrub_sleep = .1
  • osd_scrub_chunk_min: The minimum number of data chunks an OSD should perform scrubbing on:
        osd_scrub_chunk_min = 1
  • osd_scrub_chunk_max: The maximum number of data chunks an OSD should perform scrubbing on:
        osd_scrub-chunk_max = 5
  • osd_deep_scrub_stride: The read size in bytes while doing a deep scrub:
        osd_deep_scrub_stride = 1048576
  • osd_scrub_begin_hour: The earliest hour that scrubbing can begin. This is used in conjunction with osd_scrub_end_hour to define a scrubbing time window:
        osd_scrub_begin_hour = 19
  • osd_scrub_end_hour: This is the upper bound when the scrubbing can be performed. This works in conjunction with osd_scrub_begin_hour to define a scrubbing time window:
        osd_scrub_end_hour = 7
之前启动的擦洗/深度擦洗 osd_scrub_end_hour 可能会持续到预期的结束时间。结束时间只是在集群上启动新清理的停止时间。

Tuning the client

客户端调优参数应在 [client] 部分下定义你的 Ceph 配置文件。通常,这个 [client] 部分也应该出现在客户端节点上托管的 Ceph 配置文件中。参数如下:

  • rbd_cache: Enables caching for the RADOS Block Device (RBD):
        rbd_cache = true
  • rbd_cache_writethrough_until_flush: Starts out in write-through mode, and switches to writeback after the first flush request is received:
        rbd_cache_writethrough_until_flush = true
  • rbd_concurrent_management_ops: The number of concurrent management operations that can be performed on rbd:
        rbd_concurrent_management_ops = 10
  • rbd_cache_size: The rbd cache size in bytes:
        rbd_cache_size = 67108864 #64M
  • rbd_cache_max_dirty: The limit in bytes at which the cache should trigger a writeback. It should be less than rbd_cache_size:
        rbd_cache_max_dirty = 50331648 #48M
  • rbd_cache_target_dirty: The dirty target before the cache begins writing data to the backing store:
        rbd_cache_target_dirty = 33554432 #32M
  • rbd_cache_max_dirty_age: The number of seconds that the dirty data is in the cache before writeback starts:
        rbd_cache_max_dirty_age = 2
  • rbd_default_format: This uses the second RBD format, which is supported by librbd and the Linux kernel (since version 3.11). This adds support for cloning and is more easily extensible, allowing for more features in the future:
        rbd_default_format = 2

Tuning the operating system

在上一个秘籍中,我们介绍了 Ceph MON、OSD 和客户端的调优参数。在这个秘籍中,我们将介绍一些可以应用于操作系统的通用调优参数,如下所示:

  • kernel.pid_max: This is a Linux kernel parameter that is responsible for the maximum number of threads and process IDs. By default, the Linux kernel has a relatively small kernel.pid_max value. You should configure this parameter with a higher value on Ceph nodes hosting several OSDs, typically more than 20 OSDs. This setting helps spawn multiple threads for faster recovery and rebalancing. To use this parameter, execute the following command from the root user:
        # echo 4194303 > /proc/sys/kernel/pid_max
  • file-max: This is the maximum number of open files on a Linux system. It's generally a good idea to have a larger value for this parameter:
        # echo 26234859 > /proc/sys/fs/file-max
  • disk read_ahead: The read_ahead parameter speeds up the disk read operation by fetching data beforehand and loading it in the random access memory. It sets up up a relatively higher value for read_ahead and will benefit clients performing sequential read operations.

    假设磁盘 vda 是一个安装在客户端节点上的 RBD。使用以下命令检查它的 read_ahead值,这是大多数情况下的默认值:

        # cat /sys/block/vda/queue/read_ahead_kb

设置 read_ahead为更高的v值,即8 MB 对于 vda RBD,执行以下命令:

        # echo "8192" > /sys/block/vda/queue/read_ahead_kb

read_ahead 设置用于使用挂载 RBD 的 Ceph 客户端。要获得 read 性能提升,您可以将其设置为数 MB,具体取决于您的硬件和所有 RBD 设备。

  • 虚拟内存:D由于高度关注 I/O 的配置文件,使用交换会导致整个服务器变得无响应。对于高 I/O 工作负载,建议使用较低的 swappiness 值。设置 vm.swappiness 0 in / etc/sysctl.conf 为了防止这种情况:

        # echo "vm.swappiness=0" >> /etc/sysctl.conf 
  • min_free_kbytes: This provides the minimum number of KB to keep free across the system. You can keep 1 % to 3% of the total system memory free with min_free_kbytes by running the following command:
        # echo 262144 > /proc/sys/vm/min_free_kbytes
 
  • zone_reclaim_mode: This allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to 0, then no zone reclaim occurs. Allocations will be satisfied from other zones in the system. Set this with the following command:
        # echo 0 > /proc/sys/vm/zone_reclaim_mode
  • vfs_cache_pressure: This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching directory and inode objects. Set this with the following command:
        # echo 50 /proc/sys/vm/vfs_cache_pressure
  • I/O scheduler: Linux gives us the option to select the I/O scheduler, and this can be changed without rebooting, too. It provides three options for I/O schedulers, which are as follows:
    • Deadline: The deadline I/O scheduler replaces CFQ as the default I/O scheduler in Red Hat Enterprise Linux 7 and its derivatives, as well as in Ubuntu Trusty. The deadline scheduler favors read over writes via the use of separate I/O queues for each. This scheduler is suitable for most use cases, but particularly for those in which read operations occur more often than write operations. Queued I/O requests are sorted into a read or write batch and then scheduled for execution in increasing LBA order. The read batches take precedence over the write batches by default, as applications are more likely to block on reading the I/O. For Ceph OSD workloads deadlines, the I/O scheduler looks promising.
    • CFQ: The Completely Fair Queuing (CFQ) scheduler was the default scheduler in Red Hat Enterprise Linux (4, 5, and 6) and its derivatives. The default scheduler is only for devices identified as SATA disks. The CFQ scheduler divides processes into three separate classes: real-time, best effort, and idle. Processes in the real-time class are always performed before processes in the best effort class, which are always performed before processes in the idle class. This means that processes in the real-time class can starve both the best effort and idle processes of the processor time. Processes are assigned to the best effort class by default.
    • Noop: The Noop I/O scheduler implements a simple first-in first-out (FIFO) scheduling algorithm. Requests are merged at the generic block layer through a simple last-hit cache. This can be the best scheduler for CPU-bound systems using fast storage. For an SSD, the NOOP I/O scheduler can reduce I/O latency and increase throughput as well as eliminate the CPU time spent reordering I/O requests. This scheduler typically works well with SSDs, virtual machines, and even with NVMe cards. Thus, the Noop I/O scheduler should be a good choice for SSD disks used for Ceph journals.

      执行以下命令检查磁盘设备的默认I/O调度程序sda(默认调度程序应出现在方括号内):

                # cat /sys/block/sda/queue/scheduler

sda 磁盘的默认 I/O 调度程序更改为 deadline

                # echo deadline > /sys/block/sda/queue/scheduler

将磁盘的默认 I/O 调度程序更改为 noop

                # echo noop > /sys/block/sda/queue/scheduler
您必须重复这些命令才能将默认调度程序更改为 deadline noop,基于您对所有磁盘的要求。此外,要使此更改永久生效,您需要使用所需的电梯选项更新 grub 引导加载程序。
  • I/O scheduler queue: The default I/O scheduler queue size is 128. The scheduler queue sorts and writes in an attempt to optimize for sequential I/O and to reduce the seek time. Changing the depth of the scheduler queue to 1024 can increase the proportion of sequential I/O that disks perform and improve the overall throughput.

    要检查 sda 块设备的调度程序深度,请使用以下命令:

        # cat /sys/block/sda/queue/nr_requests

要将调度器深度增加到 1024,请使用以下命令:

        # echo 1024 > /sys/block/sda/queue/nr_request

Tuning the network

负载 MTU 超过 1,500 字节的以太网帧称为巨型帧。在 Ceph 用于集群和客户端网络的所有网络接口上启用巨型帧应该可以提高网络吞吐量和整体网络性能。

应该在主机和网络交换机端启用巨型帧,否则,MTU 大小不匹配会导致丢包。要在接口 eth0 上启用巨型帧,请执行以下命令:

# ifconfig eth0 mtu 9000

同样,您应该对参与 Ceph 网络的其他接口执行此操作。要使此更改永久生效,您应该在接口配置文件中添加此配置。

Sample tuning profile for OSD nodes

Some various sysctl parameters that you can implement—which are known to have a positive impact on Ceph OSD network node performance—follow. If intending to use these parameters in production, please test and verify prior to implementing.

How to do it...

在日常 I/O 工作负载和恢复/重新平衡场景中,适当的调优可以极大地提高集群的性能。让我们看看为您的 OSD 节点设置推荐的调优配置文件:

  1. Create a file: /etc/sysctl.d/ceph-tuning.conf.

以下屏幕截图中显示了一个示例:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. Update with the following parameters:
        ### Network tuning ###
        net.core.rmem_max = 56623104
        net.core.wmem_max = 56623104
        net.core.rmem_default = 56623104
        net.core.wmem_default = 56623104
        net.core.optmem_max = 40960
        net.ipv4.tcp_rmem = 4096 87380 56623104
        net.ipv4.tcp_wmem = 4096 65536 56623104
        net.core.somaxconn = 1024
        net.core.netdev_max_backlog = 50000
        net.ipv4.tcp_max_syn_backlog = 30000
        net.ipv4.tcp_max_tw_buckets = 2000000
        net.ipv4.tcp_tw_reuse = 1
        net.ipv4.tcp_fin_timeout = 10
        net.ipv4.tcp_slow_start_after_idle = 0
        net.ipv4.conf.all.send_redirects = 0
        net.ipv4.conf.all.accept_redirects = 0
        net.ipv4.conf.all.accept_source_route = 0
        net.ipv4.tcp_mtu_probing = 1
        net.ipv4.tcp_timestamps = 0

以下屏幕截图显示了一个示例:


读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. Load the values:
        # systcl -p /etc/sysctl.d/ceph-tuning.conf

有关这些值的更多信息,请参阅:https://www.kernel.org/doc/文档/sysctl/

不要在您的 VirtualBox 实验室中使用此配置文件,因为 VM 无法处理这些值并且会崩溃。

Ceph erasure-coding

Ceph 中的默认数据保护机制是复制。它已被证明是最流行的数据保护方法之一。然而,复制的缺点是它需要双倍的存储空间来提供冗余。例如,如果您计划构建一个具有 1 PB 可用容量且复制因子为 3 的存储解决方案,则您需要 3 PB 的原始存储容量才能获得 1 PB 的可用容量,即 200% 或更多。这样,通过复制机制,存储系统的每GB成本显着增加。对于小型集群,您可能会忽略复制开销,但对于大型环境,它变得很重要。

Ceph 的 Firefly 版本引入了另一种称为擦除编码的数据保护方法。这种数据保护方法与复制方法完全不同。它通过将每个对象划分为称为数据块的更小块、使用编码块对它们进行编码、最后将所有这些块存储在 Ceph 集群的不同故障区域中来保证数据保护。纠删码的概念围绕着方程n = k + m。这在以下列表中进行了解释:

  • k: This is the number of chunks the original object is divided into; it is also known as data chunks.
  • m: This is the extra code added to the original data chunks to provide data protection; it is also known as coding chunks. For ease of understanding, you can consider it as the reliability level.
  • n: This is the total number of chunks created after the erasure-coding process.

根据前面的等式,纠删码 Ceph 池中的每个对象都将存储为 k+m 个块,每个块存储在一个唯一的 OSD 中一套表演。这样,一个对象的所有块都分布在整个 Ceph 集群中,提供了更高程度的可靠性。现在,让我们讨论一些有关擦除编码的有用术语:

  • Recovery: At the time of recovery, we will require any k chunks out of n chunks to recover the data.
  • 可靠性级别:借助纠删码,Ceph 可以容忍多达 m 个块的故障。

  • 编码率 (r): 这可以使用公式 r = k / n 计算,其中 r 小于 1 需要存储.这是使用公式 1/r.

为了更好地理解这些术语,我们来看一个例子。根据纠删码(3, 2) 规则,创建了一个带有五个 OSD 的 Ceph 池。存储在此池中的每个对象都将按照以下公式划分为数据集和编码块:n = k + m.

考虑 5 = 3 + 2,然后 n = 5、k = 3m = 2。每个对象将被划分为三个数据块,并在其中添加两个额外的纠删码块,总共五个块将存储和分布在 Ceph 集群中纠删码池的五个 OSD 上。在失败的情况下,要 构造我们需要的原始文件 k 个块) 中的三个块(n 个块),和五个块来恢复它。因此,我们可以维持一个y(m)两个OSD的故障,因为数据可以使用恢复三个 OSD:

  • Encoding rate (r) = 3 / 5 = 0.6 < 1
  • Storage Required = 1/r = 1 / 0.6 = 1.6 times of original file

假设有一个大小为 1 GB 的数据文件。要将此文件存储在纠删码 (3, 5) 池上的 Ceph 集群中,您将需要 1.6 GB 的存储空间,这将为您提供具有两个 OSD 可持续性的文件存储失败。

与复制方式相比,如果同一个文件存储在一个复制池中,那么为了维持两个 OSD 的故障,Ceph 需要一个副本大小为 3 的池,最终需要 3 GB 的存储空间才能可靠地存储 1 GB 的文件。这样一来,您可以通过使用 Ceph 的纠删码功能将存储成本降低约 40%,并获得与复制相同的可靠性。

与复制池相比,纠删码池需要更少的存储空间,但是,此存储节省元素是以性能为代价的,因为纠删码过程将每个对象分成多个较小的数据块,并且一些较新的编码块与这些数据块混合在一起。最后,所有这些块都存储在 Ceph 集群的不同故障区域中。这整个机制需要来自 OSD 节点的更多计算能力。而且,在恢复的时候,解码数据块也需要大量的计算。因此,您可能会发现用于存储数据的纠删码机制比复制机制要慢一些。纠删码主要取决于用例,您可以根据您的数据存储要求充分利用纠删码。

Erasure code plugin

Ceph 为我们提供了在创建纠删码配置文件时选择纠删码插件的选项。每次可以使用不同的插件创建多个纠删码配置文件。选择正确的配置文件很重要,因为它在创建池后无法修改。为了更改配置文件,需要创建一个具有不同配置文件的新池,并将前一个池中的所有对象移至新池。 Ceph 支持以下用于擦除编码的插件:

  • Jerasure erasure code plugin: The Jerasure plugin is the most generic and flexible plugin. It is also the default for Ceph erasure-coded pools. The Jerasure plugin encapsulates the Jerasure library. Jerasure uses the Reed Solomon Code technique. The following diagram illustrates Jerasure code (3, 2). As explained, data is first divided into three data chunks, and an additional two coded chunks are added, and they finally get stored in the unique failure zone of the Ceph cluster, as shown in the following diagram:
    读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
    With the Jerasure plugin, when an erasure-coded object is stored on multiple OSDs, recovering from the loss of one OSD requires reading from all the others. For instance, if Jerasure is configured with k=3 and m=2, losing one OSD requires reading from all five OSDs to repair, which is not very efficient during recovery.
  • Locally repairable erasure code plugin: Since Jerasure erasure code (Reed Solomon) was not recovery efficient, it was improved by the local parity method, and the new method is known as Locally Repairable erasure Code (LRC). The LRC plugin creates local parity chunks that are able to recover using less OSD, which makes it recovery-efficient. To understand this better, let's assume that LRC is configured with k=8, m=4, and l=4 (locality). It will create an additional parity chunk for every four OSDs. When a single OSD is lost, it can be recovered with only four OSDs instead of 11, which is the case with Jerasure. See the following diagram:
读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

LRC 旨在从丢失单个 OSD 中恢复时降低带宽。如前所述,每四个数据块 (K) 生成一个本地奇偶校验块 (L)。当 K3 是lost,而不是从所有[(K+M)-K3] 个块中恢复,即11个块,用LRC,从 K1、K2、K4、L1 chun<跨度>ks.

  • 带状纠删码插件:本地代表适用于单个 OSD 故障。对于多个 OSD 故障,LRC 的恢复开销很大,因为它必须使用全局奇偶校验 (M) 进行恢复。让我们重新考虑前面的场景,假设丢失了多个数据块 K3 和 K4。要使用 LRC 恢复丢失的块,需要从 K1、K2、L1(本地奇偶校验块)和 恢复M1(全局奇偶校验块)。因此,LRC 涉及多磁盘故障的开销。

    为了解决这个问题,Shingle Erasure Code (SHEC) <跨度><跨度>已被引入。 SHEC 插件封装了多个 SHEC 库,让 Ceph 能够比 Jerasure 和 LRC 更高效地恢复数据。 SHEC 方法的目标是有效地处理多个磁盘故障。在这种方法下,局部奇偶校验的计算范围已经改变,并且将对它们之间的重叠(如屋顶上的瓦片)进行奇偶校验以保持耐久性。

    让我们通过示例SHEC (10,6,5)来理解这一点,其中K=10(数据块),m =6(奇偶校验块)和 l=5(计算范围)。在这种情况下,SHEC 的图示如下:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

回收效率是 SHEC 的最大特点之一。它最大限度地减少了恢复期间从磁盘读取的数据量。如果块 K6K9 丢失,SHEC 将使用 M3M4 奇偶校验块,并且 K5、K7、K8 和 K10 数据块用于恢复。下图中也说明了这一点:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

对于多磁盘故障,SHEC 有望比其他方法更有效地恢复。在双盘故障的情况下,SHEC 的恢复时间比 Solomon 代码快 18.6%。

  • ISA-I erasure code plugin: The Intelligent Storage Acceleration (ISA) plugin encapsulates the ISA library. ISA-I was optimized for Intel platforms using some platform-specific instructions, and thus runs only on Intel architecture. ISA can be used in either of the two forms of Reed Solomon, that is, Vandermonde or Cauchy.

Creating an erasure-coded pool

擦除代码是通过创建擦除类型的 Ceph 池来实现的。该池基于定义纠删码特征的纠删码配置文件。我们将首先创建纠删码配置文件,然后我们将基于此配置文件创建纠删码池。

How to do it...

以下步骤将向您展示如何创建 EC 配置文件,然后将该配置文件应用到 EC 池。

  1. 本节中提到的命令将创建一个名为 EC-profile 的纠删码配置文件,该配置文件将具有 k=3m=2 的特征,分别是数据和编码块的数量。因此,每个存储在纠删码池中的对象将被划分为 3 个(k)个数据块,并在其中添加 2 个(m)个额外的编码块,总共制作了 5 (k + m) 个块。最后,这 5 (k + m) 个块分布在不同的 OSD 故障区域:

    1. Create the erasure code profile:
                 # ceph osd erasure-code-profile set EC-profile
                   ruleset-failure-domain=osd k=3 m=2
    1. List the profile:
                 # ceph osd erasure-code-profile ls
    1. Get the contents of your erasure code profile:
                # ceph osd erasure-code-profile get EC-profile

这显示在以下屏幕截图中:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. Create a Ceph pool of the erasure type, which is based on the erasure code profile that we created in step 1:
        # ceph osd pool create EC-pool 16 16 erasure EC-profile

检查新创建的池的状态;你应该发现池的大小是5 (k + m),也就是擦除大小。因此,数据将被写入五个不同的 OSD,如下图所示:


读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. Let's now add some data to this newly created Ceph pool. To do this, we will create a dummy file, hello.txt, and add this file to the EC-pool as shown in the following screenshot:
读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. To verify if the erasure-coded pool is working correctly, we will check the OSD map for the EC-pool and object1 as shown in the following screenshot:
读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

如果你观察输出,你会注意到object1存放在placement group中 2.c,依次存储在 EC 池。你还会注意到归置组存储在五个OSD上,即osd.2osd.8osd.0osd.7osd.6。如果您返回第 1 步,您将看到我们创建了 (3,2) 的纠删码配置文件。这就是为什么 object1 存储在五个 OSD 上的原因。

在这个阶段,我们已经完成了在 Ceph 集群中建立一个纠删池。现在,我们将刻意尝试破坏 OSD,以查看当 OSD 不可用时擦除池的行为。

  1. We will now try to bring down osd.2 and osd.8, one by one.
These are optional steps and you should not be performing this on your production Ceph cluster. Also, the OSD numbers might change for your cluster; replace them wherever necessary.

关闭 osd.2 并检查 EC-pool object1的OSD映射。您应该注意到 osd.2 被替换为单词 NONE, w这意味着 osd.2 不再可用于此池:

        root@ceph-node2 # systemctl stop ceph-osd@2
                        # ceph osd map EC-pool object1
This is shown in the following screenshot:
读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. Similarly, break one or more OSD, that is, osd.8, and notice the OSD map for the EC-pool and object1. You will notice that, like osd.2, osd.8 also gets replaced by the word NONE, which means that osd.8 is also no longer available for this EC-pool:
         root@ceph-node1 # ststemctl stop ceph-osd@8
                         # ceph osd map EC-pool object1

如下截图所示:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

现在,EC-pool 运行在三个 OSD 上,这是设置的最低要求擦除池。如前所述, EC-pool 将需要五个块中的任意三个块才能读取数据.现在,我们只剩下三个块了,分别是 on osd.0osd.7 osd.6,我们仍然可以访问数据。让我们验证读取的数据:

# rados -p EC-pool ls
# rados -p EC-pool get object1 /tmp/object1
# cat /tmp/object1
读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

Ceph 强大的架构极大地受益于纠删码功能。当 Ceph 检测到任何故障区域不可用时,它就会开始其基本的恢复操作。在恢复操作期间,擦除池通过将失败的块解码到新的 OSD 上来重建自己,然后,它们会自动使所有块可用。

在上面提到的最后两个步骤中,我们打算最终破解了osd.2osd.8< /kbd>。一段时间后,Ceph 将启动t 恢复并将丢失的块重新生成到不同的 OSD 上。恢复操作完成后,您应该检查 EC-poolobject1 的 OSD 映射。您会惊讶地看到新的 OSD ID 为 osd.1osd.3。因此,an 擦除池在没有管理输入的情况下变得健康,如以下屏幕截图所示:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

Ceph cache tiering

<span>与纠删码一样,Ceph Firefly 版本中也引入了缓存分层功能。对于存储在缓存层中的数据子集,缓存层为 Ceph 客户端提供了更好的 I/O 性能。缓存分层会在速度更快的磁盘(通常是 SSD)之上创建一个 Ceph 池。这个缓存池应该放在一个常规的、复制的擦除池之前,这样所有的客户端 I/O 操作都由缓存池首先处理;稍后,数据被刷新到现有的数据池中。客户端享受高速缓存池的高性能,而他们的数据透明地写入常规池。下图说明了 Ceph 缓存分层:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

缓存层构建在昂贵、速度更快的 SSD/NVMe 之上,从而为客户端提供更好的 I/O 性能。缓存层由存储层备份,该存储层由具有复制或擦除类型的 HDD 组成。整个客户端 I/O 请求进入缓存层并获得更快的响应,无论是 read 还是 write;更快的缓存层服务于客户端请求。根据我们为缓存层创建的策略,它会将其所有数据刷新到后备存储层,以便它可以缓存新请求来自客户。缓存层和存储层之间的所有数据迁移都是自动发生的,并且对客户端是透明的。缓存分层代理处理缓存层和存储层之间的数据迁移。 管理员可以配置迁移的发生方式.有两种主要场景,我们将在以下部分讨论。

Writeback mode

当 Ceph 缓存分层配置为 writeback 模式时,Ceph 客户端将数据写入缓存层池,即更快的池,因此会立即收到确认。根据您为缓存层设置的刷新/驱逐策略,数据从缓存层迁移到存储层,并最终由缓存分层代理从缓存层中删除。在客户端的 read 操作期间,缓存分层代理首先将数据从存储层传输到缓存层,然后将其提供给客户端。数据将保留在缓存层中,直到它变为非活动或冷态。 writeback 模式的缓存层非常适合可变数据,例如照片或视频编辑、事务数据等。 writeback 模式非常适合可变数据。

Read-only mode

当 Ceph 缓存分层配置为 只读 模式时,它仅适用于客户端的 读取 操作。 write 操作在此模式下不处理,它们存储在存储层中。当任何客户端执行 read 操作时,缓存分层代理将请求的数据从存储层复制到缓存层。根据您为缓存层配置的策略,从缓存层中删除过时的对象。当多个客户端需要读取大量类似数据(例如社交媒体内容)时,这种方法非常理想。不可变数据非常适合只读缓存层。

上游 Ceph 社区有很多关于使用 Ceph 缓存分层的稳定性和实际性能改进的讨论。建议您在将其部署到生产环境之前充分研究此功能!

Creating a pool for cache tiering

为了充分利用 Ceph 的缓存分层功能,您应该使用更快的磁盘(例如 SSD),并在由 HDD 组成的较慢/常规池之上创建一个快速缓存池。在 章节9, Ceph Under the Hood, 我们介绍了 p 通过修改 CRUSH 图在特定 OSD 上创建 Ceph 池的过程。要在您的环境中设置缓存层,您需要首先修改您的粉碎图并为 SSD 磁盘创建一个规则集。因为我们已经在 第 9 章< /a>引擎盖下的 Ceph,< span>在特定 OSD 上创建 Ceph 池 配方我们将使用基于 osd.0osd.3osd.6 。由于这是一个测试设置,我们没有真正的 SSD,我们假设 OSD 0、3 和 6 是 SSD,并将在它们之上创建一个缓存池,如下图所示:< /span>

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

让我们使用命令 ceph osd crush rule ls 检查 CRUSH 布局,如以下屏幕截图所示。我们已经有了在 第 9 章引擎盖下的 Ceph在特定 OSD 上创建 Ceph 池配方。您可以通过运行 ceph osd crush rule dump ssd-pool 命令获取有关此 CRUSH 规则的更多信息:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

How to do it...

我们现在将创建一个用于缓存分层的池:

  1. Create a new pool with the name cache-pool and set crush_ruleset as 1 so that the new pool gets created on SSD disks:
        # ceph osd pool create cache-pool 16 16
        # ceph osd pool set cache-pool crush_ruleset 1

如下截图所示:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. Make sure that your pool is created correctly, which means that it should always store all the objects on osd.0, osd.3, and osd.6.
  2. List the cache-pool for contents; since it's a new pool, it should not have any content:
        # rados -p cache-pool ls
  1. Add a temporary object to the cache-pool to make sure it's storing the object on the correct OSDs:
        # rados -p cache-pool put object1 /etc/hosts
        # rados -p cache-pool ls
  1. Verify the OSD map for the cache-pool and object1; it should get stored on osd.0, osd.3, and osd.6:
        # ceph osd map cache-pool object1
  1. Finally, remove the object:
        # rados -p cache-pool rm object1
读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

See also

  • Refer to the Creating a cache tier recipe in this chapter.

Creating a cache tier

在上一个秘籍中,我们创建了一个基于 SSD 的池 cache-pool。我们现在将使用这个池作为我们在本章前面创建的纠删码池 EC-pool 的缓存层:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

接下来的说明将指导您使用写回模式创建缓存层以及如何使用 EC 池设置覆盖层。

How to do it...

按照这组步骤,我们将为我们的纠删码池创建一个缓存层:

  1. Create a cache tier that will associate storage-pools with cache-pools. The syntax is ceph osd tier add <storage_pool> <cache_pool>:
        # ceph osd tier add EC-pool cache-pool
  1. Set the cache mode as either writeback or read-only. In this demonstration, we will use writeback, and the syntax is ceph osd tier cache-mode <cache_pool> writeback:
        # ceph osd tier cache-mode cache-pool writeback
  1. To direct all the client requests from the standard pool to the cache pool, set the pool overlay using the syntax ceph osd tier set-overlay <storage_pool> <cache_pool>:
         # ceph osd tier set-overlay EC-pool cache-pool
  1. On checking the pool details, you will notice that the EC-pool has tier, read_tier, and write_tier set as 4, which is the pool ID for the cache-pool. Similarly, for cache-pool, the settings will be tier_of set to 5 and cache_mode as writeback. All these settings imply that the cache pool is configured correctly:
        # ceph osd dump | egrep -i "EC-pool|cache-pool"
读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
Output of Ceph osd dump command

Configuring a cache tier

一个缓存层有几个定义缓存层策略的配置选项。在发生回写的情况下,需要此缓存层策略将数据从缓存层刷新到存储层。在只读缓存层的 e 情况下,它将数据从存储层移动到缓存层。在这个秘籍中,我尝试使用写回模式来演示缓存层。这些是您应该为您的生产环境配置的一些设置,根据您的要求使用不同的值。

How to do it...

现在我们已经配置了缓存层,我们需要为缓存池设置一些推荐的配置选项:

  1. If looking to use a cache-tier in a production deployment, you should use the bloom filters data structure after careful review of the cache-tier considerations:
        # ceph osd pool set cache-pool hit_set_type bloom
  1. hit_set_count defines how much time in seconds each hit set should cover, and hit_set_period defines how many such hit sets are to be persisted:
        # ceph osd pool set cache-pool hit_set_count 1
        # ceph osd pool set cache-pool hit_set_period = 300
  1. target_max_bytes is the maximum number of bytes after the cache-tiering agent starts flushing/evicting objects from a cache pool. target_max_objects is the maximum number of objects after which a cache-tiering agent starts flushing/evicting objects from a cache pool:
        # ceph osd pool set cache-pool target_max_bytes 1000000 
        # ceph osd pool set cache-pool target_max_objects 10000

如下截图所示:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. Enable cache_min_flush_age and cache_min_evict_age, which are the times in seconds that a cache-tiering agent takes to flush and evict objects from a cache tier to a storage tier:
        # ceph osd pool set cache-pool cache_min_flush_age 300 
        # ceph osd pool set cache-pool cache_min_evict_age 300

如下截图所示:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. Enable cache_target_dirty_ratio, which is the percentage of the cache pool containing dirty (modified) objects before the cache-tiering agent flushes them to the storage tier:
        # ceph osd pool set cache-pool cache_target_dirty_ratio .01
  1. Enable cache_target_full_ratio, which is the percentage of the cache pool containing unmodified objects before the cache-tiering agent flushes them to the storage tier:
        ceph osd pool set cache-pool cache_target_full_ratio .02

如下截图所示:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

完成这些步骤后,Ceph 缓存分层设置应该完成,您可以开始向其中添加工作负载。

  1. Create a temporary file of 500 MB that we will use to write to the EC-pool, and which will eventually be written to a cache-pool:
        # dd if=/dev/zero of=/tmp/file1 bs=1M count=500

如下截图所示:


读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

Testing a cache tier

因为我们的缓存层已经准备好了,在write操作期间,客户端会看到正在写入他们的常规池的内容,但实际上,它正在被写入缓存-< /span>先存储池,然后根据缓存层策略数据,将其刷新到存储层。这种数据迁移对客户端是透明的。

How to do it...

  1. In the previous recipe, we created a 500 MB test file named /tmp/file1; we will now put this file in an EC-pool:
        # rados -p EC-pool put object1 /tmp/file1
  1. Since an EC-pool is tiered with a cache-pool, the named file1 will get written to the EC-pool as object metadata, but the actual object will get written into the cache-pool. To verify this, list each pool to get the object names:
        # rados -p EC-pool ls
        # rados -p cache-pool ls

这显示在以下屏幕截图中:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. When viewing rados df, we can see the actual space the object is taking up in each of the pools and where it truly resides:
        # rados df
This is shown in the following screenshot:
读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
  1. After 300 seconds (as we have configured the cache_min_evict_age to 300 seconds), the cache tiering agent will migrate the object from the cache-pool to the EC-pool, and object1 will be removed from the cache-pool:
        # rados df

这显示在以下屏幕截图中:

读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整
       # rados -p EC-pool ls
       # rados -p cache-pool ls
读书笔记《ceph-cookbook-second-edition》Cave的生产计划和性能调整

如果您仔细查看步骤 3 和 4,您会注意到数据已从 cache-pool 迁移在一定时间后发送到EC-pool,对用户完全透明。< /span>

Cache tiering – possible dangers in production environments

在任何类型的生产用例中部署缓存分层之前,应进行适当的全面测试,以验证缓存分层功能不会导致异常性能问题。众所周知,缓存分层会降低大多数客户端工作负载中的读/写性能,应谨慎使用:

  • Workload dependent: The use of cache tiering to improve cluster performance is dependent on the type of work the cluster will be doing. The promoting and demotion of objects in and out of the cache can only be effective if there is large commonality in the data access pattern and client requests reach a smaller number of objects. When designing the cache pool, it is important that it is large enough to capture your planned working set for the defined workload to avoid any type of cache thrashing.
  • Difficult to benchmark: Any common benchmarking methods used on the Ceph cluster will report terrible performance with the use of cache tiering. This is because most performance benchmarking tools are not geared toward writing/reading a small object set, therefore it can take a longer time for the cache tier to heat up and show it's usefulness. The time it takes for the tier to heat up can be costly.
  • Usually slower: Workloads that span large numbers of objects and are not cache-tireing friendly can show much lower performance then a normal RADOS pool not utilizing a cache tier.
  • Complexity: Deciding to utilize cache-tiering can lead to further configuration and management complexity within your Ceph cluster. The use of cache tiering may put you in a position where you encounter an issue that no other Ceph user has encountered; this may put your cluster at risk.

Known good workloads

缓存分层在使用 RGW 时很有用,并且 RGW 工作负载包含针对最近写入或经常访问的对象的大部分读取操作。此用例中的缓存分层(在可配置的时间段内将对象保存在更快的层中)实际上可能会提高性能。

Known bad workloads

以下是已知的不良工作负载:

具有复制缓存和纠删码基础的 RBD:适当倾斜的工作负载仍会一次又一次地向冷对象发送写入。由于缓存层仍然不支持小写入,因此必须将整个 4 MB 对象提升到缓存层以满足较小的写入。极少数用户部署了这种配置,它之所以成功,是因为他们的用例是备份数据,这些数据保持冷态并且对性能不敏感。

带有复制缓存和基础的 RBD:当使用带有复制基础的 RBD 时,当基础是纠删码时,层可以更好地执行,但这仍然很大程度上取决于工作负载中存在的偏斜和很难确定。此用例需要对工作负载有非常好的了解,并且必须正确仔细地调整和配置缓存分层。