vlambda博客
学习文章列表

读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

More on Ceph

在本章中,我们将介绍以下秘籍:

  • Disk performance baseline
  • Baseline network performance
  • Ceph rados bench
  • RADOS load-gen
  • Benchmarking the Ceph Block Device
  • Benchmarking Ceph RBD using FIO
  • Ceph admin socket
  • Using the ceph tell command
  • Ceph REST API
  • Profiling Ceph memory
  • The ceph-objectstore-tool
  • Using ceph-medic
  • Deploying the experimental Ceph BlueStore

Introduction

在前面的章节中,我们介绍了部署、配置和管理 Ceph 的不同方法。在本章中,我们将介绍对 Ceph 集群进行基准测试,这是在投入生产之前必须做的事情。我们还将介绍使用管理套接字、REST API 和 ceph-objectstore-tool 进行 Ceph 管理和故障排除的高级方法。最后,我们将学习 Ceph 内存分析。

在将 Ceph 集群用于生产工作负载之前对其进行基准测试应该是一个优先事项。基准测试可为您提供集群在读取、写入、延迟和其他工作负载期间的性能的大致结果。

在进行真正的基准测试之前,最好通过测量连接到集群节点的硬件(例如磁盘和网络)的性能来建立预期最大性能的基线。

Disk performance baseline

磁盘性能基线测试将分两步完成。首先,我们将测量单个磁盘的性能,然后,我们将同时测量连接到一个 Ceph OSD 节点的所有磁盘的性能。

为了获得真实的结果,我正在针对部署在物理硬件上的 Ceph 集群运行本秘籍中描述的基准测试。我们也可以在托管在虚拟机上的 Ceph 集群上运行这些测试,但我们可能不会得到吸引人的结果。

Single disk write performance

为了获得磁盘的读写性能,我们将使用 dd 命令与 oflag 设置为 direct 为了绕过磁盘缓存以获得真实的结果。

How to do it...

让我们对单磁盘写入性能进行基准测试:

  1. Drop caches:
        # echo 3 > /proc/sys/vm/drop_caches
  1. Use dd to write a file named deleteme of the size 10G, filled with zeros /dev/zero as the input file to the directory where Ceph OSD is mounted, that is, /var/lib/ceph/osd/ceph-0/:
        # dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/deleteme
           bs=10G count=1
          oflag=direct

理想情况下,您应该重复步骤 1 和步骤 2 几次并取平均值。在我们的例子中,写入操作的平均值为 121 MB/s,如下图所示:

读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

Multiple disk write performance

下一步,我们将在节点上 Ceph 使用的所有 OSD 磁盘上运行 dd , ceph-node1, 获取单个节点的汇总磁盘写入性能。

How to do it...

让我们对多磁盘写入性能进行基准测试:

  1. Get the total number of disks in use with the Ceph OSD; in my case, it's three disks:
        # mount | grep -i osd | wc -l
  1. Drop caches:
        # echo 3 > /proc/sys/vm/drop_caches
  1. The following command will execute the dd command on all the Ceph OSD disks:
        # for i in `mount | grep osd | awk '{print $3}'`;
          do (dd if=/dev/zero
         of=$i/deleteme bs=10G count=1 oflag=direct &) ; done 

要获得汇总的磁盘写入性能,请取所有写入速度的平均值。就我而言,平均值为 127 MB/s

读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

Single disk read performance

为了获得单盘读取性能,我们再次使用 dd 命令。

How to do it...

让我们对单磁盘读取性能进行基准测试:

  1. Drop caches:
        # echo 3 > /proc/sys/vm/drop_caches
  1. Use dd to read from the file, deleteme, which we created during the write test. We will read the deleteme file to /dev/null with iflag set to direct:
        # dd if=/var/lib/ceph/osd/ceph-0/deleteme of=/dev/null bs=10G
         count=1 iflag=direct

理想情况下,您应该重复步骤 1 和步骤 2 几次并取平均值。在我们的例子中,读取操作的平均值为 133 MB/s,如下图所示:

Multiple disk read performance

类似于单盘读取性能,我们将使用dd得到聚合的多盘读取性能.

How to do it...

让我们对多磁盘读取性能进行基准测试:

  1. 获取 Ceph OSD 使用的磁盘总数;就我而言,它是三个磁盘:

        # mount | grep -i osd | wc -l
  1. Drop caches:
        # echo 3 > /proc/sys/vm/drop_caches
  1. The following command will execute the dd command on all the Ceph OSD disks:
        # for i in `mount | grep osd | awk '{print $3}'`;
          do (dd if=$i/deleteme
         of=/dev/null bs=10G count=1 iflag=direct &); done 

要获得聚合的磁盘读取性能,请取所有读取速度的平均值。就我而言,平均值为 137 MB/s:

读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

Results

根据我们执行的测试,结果将如下所示。这些结果因环境而异;您正在使用的硬件和 OSD 节点上的磁盘数量可以发挥很大作用:

操作

每磁盘

聚合

阅读

133 MB/s

137 MB/s

121 MB/s

127 MB/s

Baseline network performance

在这个秘籍中,我们将执行测试以发现 Ceph OSD 节点之间网络的基线性能。为此,我们将使用 iperf 实用程序。确保在 Ceph 节点上安装了 iperf 包。 iperf 是一个简单的点对点网络带宽测试器,适用于客户端-服务器模型。

要开始网络基准测试,请在第一个 Ceph 节点和客户端上使用服务器选项执行 iperf第二个 Ceph 节点上的选项。

How to do it...

使用 iperf,让我们获得集群网络性能的基准:

  1. 安装 iperf ceph-node1ceph-node2:

        # sudo yum install iperf -y
  1. On ceph-node1, execute iperf with -s for the server, and -p to listen on a specific port:
        # iperf -s -p 6900
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
You can skip the -p option if the TPC port 5201 is open, or you can choose any other port that is open and not in use.
  1. On ceph-node2, execute iperf with the client option, -c:
        # iperf -c ceph-node1 -p 6900
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
You can also use the -p option with the iperf command to determine the number of parallel stream connections to make with the server. It will return a realistic result if you have a channel-bonding technique such as LACP.

这表明 Ceph 节点之间的网络连通性约为 2.15 Gbits/s。同样,您可以对 Ceph 集群的其他节点执行网络带宽检查。网络带宽实际上取决于您在 Ceph 节点之间使用的网络基础设施。

See also

  • 第 10 章生产Ceph 的规划和性能调优,您可以在其中找到与 Ceph 网络相关的更多信息。

Ceph rados bench

Ceph 附带一个称为 rados bench 的内置基准测试工具,可用于在池级别测量 Ceph 集群的性能。 rados bench 工具支持写入、顺序读取和随机读取基准测试,还可以清理临时基准数据,非常整洁。

How to do it...

让我们尝试使用 the rados bench 运行一些测试:

  1. 要对池 RDB 运行 10 秒的写入测试而不进行清理,请使用以下命令:

        # rados bench -p rbd 10 write --no-cleanup

执行命令后我们得到如下截图:

读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

您会注意到我的测试实际上运行了 17 秒的总时间,这是由于在 VM 上运行测试以及完成测试的写入 OPS 所需的时间延长。

  1. Similarly, to run a 10 second sequential read test on the RBD pool, run the following:
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
在这种情况下,知道为什么读取测试在几秒钟内完成,或者为什么它没有在指定的 10 秒内执行可能会很有趣。这是因为读取速度比写入速度快,而且 rados bench 已完成读取写入测试期间生成的所有数据。但是,此行为在很大程度上取决于您的硬件和软件基础架构。
  1. Similar to running a random read test with the rados bench, execute the following:
        # rados bench -p rbd 10 rand 
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

How it works...

rados bench 的语法如下:

# rados bench -p <pool_name> <seconds> <write|seq|rand> -b <block size> -t
  --no-cleanup

语法可以解释如下:

  • -p: -p or --pool specifies the pool name.
  • <seconds>: Tests the time in seconds.
  • <write|seq|rand>: Is the type of test, for example, write, sequential read, or random read.
  • -b: For the block size; by default, it's 4M.
  • -t: Is the number of concurrent threads; the default is 16.
  • --no-cleanup: Is the temporary data that is written to the pool by the rados bench that should not be cleaned. This data will be used for read operations when they are used with sequential reads or random reads. The default is cleaned up.
请注意,我在本书中部署的 Vagrant 集群上运行了这些测试。您绝对可以对您的物理集群运行这些命令,并且会收到比我对虚拟机更好的结果!

rados bench 是一个非常方便的工具,可以快速测量 Ceph 集群的原始性能,您可以根据写入、读取和随机读取配置文件。

RADOS load-gen

有点类似于rados benchRADOS load-gen是Ceph 提供的另一个有趣的工具,开箱即用。顾名思义,RADOS load-gen 工具可用于在 Ceph 集群上生成负载,并可用于模拟高负载场景。

How to do it...

  1. Let's try to generate some load on our Ceph cluster with the following command:
        # rados -p rbd load-gen --num-objects 50 --min-object-size 4M 
        --max-object-size 4M --max-ops 16 --min-op-len 4M --max-op-len 4M 
        --percent 5 --target-throughput 2000 --run-length 60

How it works...

RADOS load-gen 的语法如下:跨度>

# rados -p <pool-name> load-gen

下面是前面命令的详细解释:

  • --num-objects: The total number of objects
  • --min-object-size: The minimum object size in bytes
  • --max-object-size: The maximum object size in bytes
  • --min-ops: The minimum number of operations
  • --max-ops: The maximum number of operations
  • --min-op-len: The minimum operation length
  • --max-op-len: The maximum operation length
  • --max-backlog: The maximum backlog (in MB)
  • --percent: The percentage of read operations
  • --target-throughput: The target throughput (in MB)
  • --run-length: The total run time in seconds

此命令将通过将 50 个对象写入 RBD 池来在 Ceph 集群上产生负载。这些对象和操作长度中的每一个都是 4 MB 大小,5% 的读取,测试运行时间为 60 秒:

读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

为简洁起见,输出已被修剪。一旦 load-gen 命令完成,它会清理它在测试期间创建的所有对象并显示操作吞吐量:

读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

There's more...

您还可以使用 watch ceph -s 命令或 ceph -w;同时,RADOS load-gen 将运行,只是看看它是如何进行的。

Benchmarking the Ceph Block Device

工具,rados benchRADOS load-gen ,我们在上一个秘籍中讨论过,用于对 Ceph 集群池进行基准测试。在这个秘籍中,我们将专注于使用 rbd bench-write 工具对 Ceph 块设备进行基准测试。 ceph rbd 命令行界面提供了一个名为 < span>bench-write,这是一个在 Ceph Rados 块设备上执行写入基准测试操作的工具。

How to do it...

为了对 Ceph 块设备进行基准测试,我们需要创建一个块设备并将其映射到 Ceph 客户端节点:

  1. Create a Ceph Block Device named block-device1, of size 10 G, and map it:
        # rbd create block-device1 --size 10240 --image-feature layering
        # rbd info --image block-device1
        # rbd map block-device1
        # rbd showmapped
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. Create a filesystem on the block device and mount it:
        # mkfs.xfs /dev/rbd1
        # mkdir -p /mnt/ceph-block-device1
        # mount /dev/rbd1 /mnt/ceph-block-device1
        # df -h /mnt/ceph-block-device1
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. To benchmark block-device1 for 5 GB of total write length, execute the following command:
        # rbd bench-write block-device1 --io-total 5368709200 
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

如您所见,rbd bench-write 输出格式良好的结果。< /span>

How it works...

rbd bench-write 的语法如下所示:

# rbd bench-write <RBD image name>

以下是前面语法的详细解释:

  • --io-size: The write size in bytes; the default is 4M
  • --io-threads: The number of threads; the default is 16
  • --io-total: The total bytes to write; the default is 1024M
  • --io-pattern <seq|rand>: This is the write pattern, the default is seq
You can use different options with the rbd bench-write tool to adjust the block size, number of threads, and IO pattern.

See also

  • Chapter 2, Working with Ceph Block Device, where we covered the creation of the Ceph Block Device in detail.

Benchmarking Ceph RBD using FIO

灵活 I/O(FIO);它是最流行的生成 I/O 工作负载和基准测试的工具之一。 FIO 最近添加了对 RBD 的原生支持。 FIO 是高度可定制的,可用于模拟和基准测试几乎所有类型的工作负载。在这个秘籍中,我们将学习如何使用 FIO 对 Ceph RBD 进行基准测试。

How to do it...

要对 Ceph 块设备进行基准测试,我们需要创建一个块设备并将其映射到 Ceph 客户端节点:

  1. Install the FIO package on the node where you mapped the Ceph RBD image. In our case, it's the ceph-client1 node:
        # yum install fio -y

由于 FIO 支持 RBD ioengine,我们不需要将 RBD 映像挂载为文件系统。要对 RBD 进行基准测试,我们只需提供将用于连接到 Ceph 集群的 RBD 映像名称、池和 Ceph 用户。创建具有以下内容的 FIO 配置文件:

        [write-4M]
        description="write test with block size of 4M"
        ioengine=rbd
        clientname=admin
        pool=rbd
        rbdname=block-device1
        iodepth=32
        runtime=120
        rw=write
        bs=4M
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. To start FIO benchmarking, execute the following FIO command by providing the FIO profile file as an argument:
        # fio write.fio
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

完成后,FIO 会生成许多有用的信息,应仔细观察。但是,乍一看,您可能主要对 IOPS 和聚合带宽感兴趣,这两者都在上一个屏幕截图中突出显示。

See Also

Ceph admin socket

Ceph 组件是守护进程和 Unix 域套接字。 Ceph 允许我们使用这些套接字来查询它的守护进程。 Ceph 管理套接字是在运行时获取和设置 Ceph 守护进程配置的强大工具。使用此工具,更改守护程序配置值变得容易得多,而不是更改需要重新启动守护程序的 Ceph 配置文件。

为此,您应该登录到运行 Ceph 守护进程的节点并执行 ceph 守护进程命令。

How to do it...

有两种方法可以访问管理套接字:

  1. Using the Ceph daemon-name:
        $ sudo ceph daemon {daemon-name} {option} 
  1. The default location is /var/run/ceph. Using the absolute path of the socket file:
        $ sudo ceph daemon {absolute path to socket file} {option}

我们现在将尝试使用管理套接字访问 Ceph 守护进程:

  1. List all the available admin socket commands for the OSD:
        # ceph daemon osd.0 help
  1. Similarly, list all the available socket commands for MON:
        # ceph daemon mon.ceph-node1 help
  1. Check the OSD configuration settings for osd.0:
        # ceph daemon osd.0 config show
  1. Check the MON configuration settings for mon.ceph-node1:
        # ceph daemon mon.ceph-node1 config show
Ceph 管理守护程序允许您在运行时更改守护程序配置设置。然而,这些变化是暂时的。要永久更改 Ceph 守护程序配置,请更新 Ceph 配置文件。
  1. To get the current config value for osd, use the _recover_max_chunk parameter for the osd.0 daemon:
        # ceph daemon osd.0 config get osd_recovery_max_chunk
  1. To change the osd_recovery_max_chunk value for osd.0, execute the following command:
        # ceph daemon osd.0 config set osd_recovery_max_chunk 1000000 
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

Using the ceph tell command

另一种无需登录到该节点的开销即可更改 Ceph 守护程序的运行时配置的有效方法是使用 ceph tell< /span> 命令。

How to do it...

ceph tell 命令可以节省您登录到运行守护程序的节点的工作量。此命令通过监控节点,因此您可以从集群中的任何节点执行它:

  1. The syntax for the ceph tell command is as follows:
        # ceph tell {daemon-type}.{id or *} injectargs 
          --{config_setting_name} {value}
  1. To change the osd_recovery_threads setting from osd.0, execute the following:
        # ceph tell osd.0 injectargs '--osd_recovery_threads=2'
  1. To change the same setting for all the OSDs across the cluster, execute the following:
        # ceph tell osd.* injectargs '--osd_recovery_threads=2'
  1. You can also change multiple settings as a one liner:
        # ceph tell osd.* injectargs '--osd_recovery_max_active=1 
          --osd_recovery_max_single_start=1 --osd_recovery_op_priority=50'

Ceph REST API

Ceph 具有强大的 REST API 接口访问权限,可让您以编程方式管理集群。它可以作为 WSGI 应用程序或独立服务器运行,侦听默认端口 5000。它通过 HTTP 可访问的界面提供与 Ceph 命令行工具类似的功能。命令以 HTTP GET 和 PUT 请求的形式提交,结果可以 JSON、XML 和文本格式返回。在这个秘籍中,我将快速向你展示如何设置 Ceph REST API 并与之交互。

How to do it...

让我们配置并使用 Ceph REST API 来检查一些集群状态:

  1. Create a user, client.restapi, on the Ceph cluster with appropriate access to mon, osd, and mds:
        # ceph auth get-or-create client.restapi
           mds 'allow *' osd 'allow *' 
        mon 'allow *' > /etc/ceph/ceph.client.restapi.keyring

  1. Add the following section to the ceph.conf file:
        [client.restapi]
        log file = /var/log/ceph/ceph.restapi.log
        keyring = /etc/ceph/ceph.client.restapi.keyring
  1. Execute the following command to start the ceph-rest-api as a standalone web server in the background:
        # nohup ceph-rest-api > /var/log/ceph-rest-api &> /var/log/
        ceph-rest-api-error.log &
You can also run the ceph-rest-api without nohup, s uppressing it to the background.
  1. The ceph-rest-api should now be listening on 0.0.0.0:5000; use curl to query the ceph-rest-api for the cluster health:
        # curl localhost:5000/api/v0.1/health
  1. Similarly, check the osd and mon status via rest-api:
        # curl localhost:5000/api/v0.1/osd/stat
        # curl localhost:5000/api/v0.1/mon/stat
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. The ceph-rest-api has support for most of the Ceph CLI commands. To check the list of available ceph-rest-api commands, execute the following:
        # curl localhost:5000/api/v0.1
This command will return the output in HTML; it will be good if you visit localhost:5000/api/v0.1 from a web browser to render the HTML for easier readability.

这是 ceph-rest-api 的基本实现。要在生产环境中使用它,最好将它部署在多个实例中,并使用 Web 服务器封装的 WSGI 应用程序,并且前端由负载平衡器完成。 ceph-rest-api 是一项可扩展的轻量级服务,可让您像专业人士一样管理 Ceph 集群。< /span>

Profiling Ceph memory

内存分析是动态程序分析的过程,使用TCMalloc 确定程序的内存消耗并确定优化它的方法。在这个秘籍中,我们将讨论如何在 Ceph 守护进程上使用内存分析器进行内存调查。

How to do it...

让我们看看如何分析节点上运行的 Ceph 守护进程的内存使用情况:

  1. Start the memory profiler on a specific daemon:
        # ceph tell osd.2 heap start_profiler
To auto-start the profiler as soon as the Ceph osd daemon starts, set the environment variable as CEPH_HEAP_PROFILER_INIT=true .

It's a good idea to keep the profiler running for a few hours so that it can collect as much information related to the memory footprint as possible. At the same time, you can also generate some load on the cluster.
  1. Next, print heap statistics about the memory footprint that the profiler has collected:
        # ceph tell osd.2 heap stats
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. You can also dump heap stats on a file that can be used later; by default, it will create the dump file as /var/log/ceph/osd.2.profile.0001.heap:
        # ceph tell osd.2 heap dump
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. To read this dump file, you will require google-perftools:
        # yum install -y google-perftools
  1. To view the profiler logs:
        # pprof --text {path-to-daemon} {log-path/filename}
        # pprof --text /usr/bin/ceph-osd 
                       /var/log/ceph/osd.2.profile.0001.heap
  1. For granule comparison, generate several profile dump files for the same daemon, and use the Google profiler tool to compare it:
        # pprof --text --base /var/log/ceph/osd.0.profile.0001.heap 
          /usr/bin/
          ceph-osd /var/log/ceph/osd.2.profile.0002.heap
  1. Release memory that TCMalloc has allocated but is not being used by Ceph:
        # ceph tell osd.2 heap release
  1. Once you are done, stop the profiler as you do not want to leave this running in a production cluster:
        # ceph tell osd.2 heap stop_profiler

Ceph 守护进程已经非常成熟,除非遇到导致内存泄漏的错误,否则您可能并不真的需要内存分析器进行分析。您可以使用前面讨论的过程来找出 Ceph 守护程序的内存问题。

The ceph-objectstore-tool

Ceph 的关键特性之一是它的自我修复和自我修复特性。 Ceph 通过在不同的 OSD 保留多个归置组副本来做到这一点,并确保您不会丢失您的数据。在极少数情况下,您可能会看到多个 OSD 故障,其中一个或多个 PG 副本位于故障 OSD 上,并且 PG 状态变得不完整,从而导致集群运行状况中的错误。对于粒度恢复,Ceph 提供了一个低级 PG 和对象数据恢复工具,称为 ceph-objectstore-tool.

The ceph-objectstore-tool can be a risky operation, and the command needs to be run either as root or sudo . Do not attempt this on a production cluster without engaging the Red Hat Ceph Storage Support, unless you are sure of what you are doing. It can cause irreversible data loss in your cluster.

How to do it...

让我们来看看 ceph-objectstore-tool 的一些使用示例:

  1. Find incomplete PGs on your Ceph cluster. Using this command, you can get the PG ID and its acting set:
        # ceph health detail | grep incomplete
  1. Using the acting set, you can locate the OSD host:
        # ceph osd find <osd_number>
  1. Log in to the OSD node and stop the OSD that you intend to work on:
         # systemctl stop ceph-osd@<id>

以下部分描述了您可以与 ceph-objectstore-tool:

  1. To identify the objects within an OSD, execute the following. The tool will output all objects, irrespective of their placement groups:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --op list
  1. To identify the objects within a placement group, execute the following:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --pgid <pgid> --op list
  1. To list the placement groups stored on an OSD, execute the following:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --op list-pgs
  1. If you know the object ID that you are looking for, specify it to find the PG ID:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --op list <object-id>
  1. Retrieve information about a particular placement group:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --pgid <pg-id> --op info
  1. Retrieve a log of operations on a placement group:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --pgid <pg-id> --op log

移除一个归置组是一个有风险的操作,可能会导致数据丢失;请谨慎使用此功能。如果您在 OSD 上有一个损坏的归置组,阻止了 OSD 服务的对等或启动,那么在删除该归置组之前,请确保您在另一个 OSD 上拥有该归置组的有效副本。作为预防措施,在删除 PG 之前,您还可以通过将 PG 导出到文件来对其进行备份:

  1. To remove a placement group, execute the following command:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --pgid <pg-id> --op remove
  1. To export a placement group to a file, execute the following:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --pgid <pg-id> 
        --file /path/to/file --op export
  1. To import a placement group from a file, execute the following:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --file </path/to/file> 
        --op import
  1. An OSD may have objects marked as lost. To list the lost or unfound objects, execute the following:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --op list-lost
  1. To find objects marked as lost for a single placement group, specify pgid:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --pgid <pgid> --op list-lost
  1. The ceph-objectstore-tool is purposely used to fix the PG's lost objects. An OSD may have objects marked lost. To remove the lost setting for the lost objects of a placement group, execute the following:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --op fix-lost
  1. To fix lost objects for a particular placement group, specify pgid:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --pgid <pg-id> --op fix-lost
  1. If you know the identity of the lost object you want to fix, specify the object ID:
        # ceph-objectstore-tool --data-path </path/to/osd> 
        --journal-path </path/to/journal> --op fix-lost <object-id>

How it works...

ceph-objectstore-tool 的语法是:

ceph-objectstore-tool <options>

<options> 的值可以如下:< /span>

  • --data-path: The path to the OSD
  • --journal-path: The path to the journal
  • --op: The operation
  • --pgid: The placement group ID
  • --skip-journal-replay: Use this when the journal is corrupted
  • --skip-mount-omap: Use this when the LevelDB data store is corrupted and unable to mount
  • --file: The path to the file, used with the import/export operation

为了更好的理解这个工具,我们举个例子:一个pool对一个对象做了两份拷贝,PG位于osd.1 osd.2。此时如果发生故障,会出现如下顺序:

  1. osd.1 goes down.
  2. osd.2 handles all the write operations in a degraded state.
  3. osd.1 comes up and peers with osd.2 for data replication.
  4. Suddenly, osd.2 goes down before replicating all the objects to osd.1.
  5. At this point, you have data on osd.1, but it's stale.

排查后发现可以从文件系统中读取osd.2数据,但是它的 osd 服务没有启动。在这种情况下,应该使用 ceph-objectstore-tool 从失败的 <跨度>osdceph-objectstore-tool 为您提供足够的能力来检查、修改和检索对象数据和元数据。

应避免使用 Linux 工具如 cprsync 进行恢复来自故障 OSD 的数据,因为这些工具没有考虑所有必要的元数据,并且恢复的对象可能无法使用!

Using ceph-medic

从一开始,Ceph 就缺乏一个整体的健康检查工具,该工具很容易突出 Ceph 集群内部的问题。 ceph statusceph health detail 命令存在,它们可以很好地提供整体集群运行状况详细信息,但如果存在更复杂的问题,则不会为用户指明任何具体方向。 ceph-medic 项目的创建允许运行单个命令来轮询 Ceph 集群上的多个预定义检查。这些检查范围从最佳实践建议到密钥环和目录所有权的验证。 ceph-medic 项目继续快速发展,并经常添加新的检查。


在编写本书时,仅支持为 centOS 7 构建的 rpm 存储库。

How to do it...

我们将使用以下步骤来安装和使用 ceph-medic

  1. Install the latest RPM repo:
        # wget http://download.ceph.com/ceph-medic/latest/rpm/el7/
               ceph-medic.repo
         -O /etc/yum.repos.d/ceph-medic.repo
  1. Install epel-release:
        # yum install epel-release
  1. Install the GPG key for ceph-medic:
        # wget https://download.ceph.com/keys/release.asc
        # rpm --import release.asc
  1. Install ceph-medic:
        # yum install ceph-medic
  1. Validate the install:
        # ceph-medic --help
  1. Run ceph-medic check on your cluster:
        # ceph-medic check
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

ceph-medic 会将完整的日志文件输出到发出命令的当前工作目录。此日志比命令发送到终端的输出详细得多。可以使用 ~/.cephmedic.conf 中的 --log-path 选项修改此日志位置。

How it works...

由于 ceph-medic 对整个集群执行检查,因此它需要知道集群中存在的节点以及对集群中节点的无密码 SSH 访问。如果你的集群是通过 ceph-ansible 部署的,那么你的节点已经配置好了,这不是必需的,如果没有,那么你需要将 ceph-medic 指向一个清单文件和 SSH 配置文件。

ceph-medic 命令的语法如下:

        # ceph-medic --inventory /path/to/hosts 
        --ssh-config /path/to/ssh_config check

inventory 文件是一个典型的 Ansible inventory 文件,可以在运行 ceph-medic check 的当前工作目录中创建。该文件必须名为 hosts 并且支持以下标准主机组:monsosdsrgws、< kbd>mdss、mgrs 和客户端。示例主机文件如下所示:

[mons]
 ceph-node1
 ceph-node2
 ceph-node3
 
 [osds]
 ceph-node1
 ceph-node2
 ceph-node3
 
 [mdss]
 ceph-node2

SSH 配置文件允许对特定帐户进行非交互式 SSH 访问,这些帐户可以在没有密码提示的情况下使用 sudo。该文件可以在运行 ceph-medic check 的工作目录中创建。 Vagrant VM 集群上的示例 SSH 配置文件如下所示:

Host ceph-node1
   HostName 127.0.0.1
   User vagrant
   Port 2200
   UserKnownHostsFile /dev/null
   StrictHostKeyChecking no
   PasswordAuthentication no
   IdentityFile /Users/andrewschoen/.vagrant.d/insecure_private_key
   IdentitiesOnly yes
   LogLevel FATAL
 
 Host ceph-node2
   HostName 127.0.0.1
   User vagrant
   Port 2201
   UserKnownHostsFile /dev/null
   StrictHostKeyChecking no
   PasswordAuthentication no
   IdentityFile /Users/andrewschoen/.vagrant.d/insecure_private_key
   IdentitiesOnly yes
   LogLevel FATAL

See also

  • The upstream project page has details of the ceph-medic tool and it's various checks and is a good source of information as this tool develops further: https://github.com/ceph/ceph-medic.

Deploying the experimental Ceph BlueStore

BlueStore 是 Ceph OSD 守护进程的新后端。它的亮点是更好的性能(写入大约是 2 倍)、完整的数据校验和和内置压缩。与目前使用的 FileStore 后端相比,BlueStore 允许将对象直接存储在 Ceph 块设备上,而不需要任何文件系统接口。 BlueStore 是 Luminous (12.2.z) 版本的新默认存储后端,将在配置新 OSD 时默认使用。 BlueStore 在Jewel 中被认为是生产就绪的,因此不建议以BlueStore 作为后端运行任何生产Jewel 集群。

BlueStore 的一些功能和增强功能包括:

  • RocksDB backend: Metadata is stored in a RocksDB backend as opposed to FileStore's current LevelDB. RocksDB is a multithreaded backend and is much more performant than the current LevelDB backend.
  • Multi-device support: BlueStore can use multiple block devices for storing different data.
  • No large double-writes: BlueStore will only fall back to typical write-ahead journaling scheme if write size is below a certain configurable threshold.
  • Efficient block device usage: BlueStore doesn't use a filesystem so it minimizes the need to clear the storage device cache.
  • Flexible allocator: BlueStore can implement different policies for different types of storage devices. Basically setting different behaviors between SSDs and HDDs.

How to do it...

OSD 可以通过 ceph-ansible 与 BlueStore 后端一起部署,我鼓励您在现有集群中使用 BlueStore 后端部署第二个 Ceph 集群或 OSD 节点,并比较本文前面描述的基准测试关于由 BlueStore 支持的 Ceph 集群或 OSD 节点的章节,您将看到 rados bench 测试的显着改进!

ceph-node4 安装为通过 ceph-ansible 带有 BlueStore 后端的 OSD 节点,您可以执行以下操作:

  1. Add ceph-node4 to the /etc/ansible/hosts file under [osds]:
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. In the group_vars/all.yml file on the ceph-ansible management node, ceph-node1, update the config overrides and osd_objectsotre settings:
      osd_objectstore: bluestore

      ceph_conf_overrides:
                  global:
                            enable experimental unrecoverable data 
                                                        corrupting 
                            features: 'bluestore rocksdb'
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. In the group_vars/osds.yml file on the ceph-ansible management node, ceph-node1, update the following settings:
        bluestore: true
         # journal colocation: true
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. Rerun the Ansible playbook on ceph-node1:
        root@ceph-node1 ceph-ansible # ansible-playbook site.yml
  1. Check ceph -s command and note the new flags enabled in the cluster for the BlueStore experimental feature:
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息
  1. Check the OSD data directory on one of the newly deployed OSDs backed by BlueStore and compared to a FileStore backed OSD. You can see the link directly on the block device on the BlueStore OSD. The output for FileStore is as follows:
读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

BlueStore 的输出如下:

读书笔记《ceph-cookbook-second-edition》更多关于塞夫的信息

Ceph-node4 现在已经成功部署了三个带有 BlueStore 后端的 OSD。集群中的其余 OSD 保留在 Jewel 默认的 FileStore 后端。随意使用本章提供的工具测试 BlueStore 后端和 FileStore 后端之间的性能比较!

See Also