An Introduction to Troubleshooting Ceph

在本章中，我们将介绍以下食谱：

Initial troubleshooting and logging
Troubleshooting network issues
Troubleshooting monitors
Troubleshooting OSDs
Troubleshooting placement groups

Introduction

本书前面的秘籍已经教你如何部署和管理 Ceph 集群，以及配置不同的客户端来访问 Ceph 对象存储，但是如果你的集群出现错误怎么办？随着时间的流逝，支持 Ceph 集群的硬件会出现故障，如果您了解错误的含义以及要查找的内容以缩小实际问题的范围，则对 Ceph 进行故障排除并不一定是一个可怕的考验。在本章中，我们将介绍一些技巧，这些技巧将使您能够熟练地解决 Ceph 集群上的各种问题，并向您展示在正确理解一些常见错误消息的情况下，Ceph 故障排除实际上并没有那么困难！

Initial troubleshooting and logging

当您开始对 Ceph 问题进行故障排除时，您首先需要确定是哪个 Ceph 组件导致了问题。有时可以在 ceph health detail 输出或 status 命令中清楚地标记此组件，但其他时候，需要进一步调查才能发现实际问题。验证高级集群的状态还可以帮助您确定是单个故障还是整个节点故障。验证配置中的某些内容可能归因于问题也是明智的，例如不推荐的配置或配置错误的环境中的硬件。本章中的各种秘籍将有助于缩小这些问题的范围，但让我们首先查看集群的高级概述以及这些命令可以告诉我们什么。

How to do it...

让我们回顾一些初始集群状态命令，以帮助确定我们需要在哪里进行故障排除：

Validate the overall health of the Ceph storage cluster :

      # ceph health detail

If the cluster is in a HEALTH_OK state, this command really does not tell you much. But if there are PGs or OSDs in the error state, this command can provide further details about the state of the cluster. Let's stop one of our OSD processes on ceph-node2 and then rerun the ceph health detail to see what it reports:

      root@ceph-node2 # systemctl stop ceph-osd@<id>  
                      # ceph health detail

读书笔记《ceph-cookbook-second-edition》Cave故障排除简介

With an OSD process stopped, you can see that the ceph health detail is much more verbose in its reporting, and I have trimmed this log. It shows the actual number of degraded objects due to the down OSD and that the PGs backed by the OSD that was stopped on ceph node2 are unclean and undersized as they have not yet been backfilled to another OSD.
Check the current status of your Ceph cluster:

      # ceph status

ceph status（或 ceph -s）提供比 ceph health detail 更多的信息，但如果集群中存在问题，则不会那么详细。让我们再次停止 ceph-node2 上的 OSD 进程之一，然后重新运行 ceph status 以查看它报告的内容：

第 7 章，监控 Ceph 集群 ，更详细地介绍了这些命令。这些命令中的任何一个都可以帮助您缩小组件范围，以便开始在组件级别或节点级别对故障进行故障排除。

Troubleshooting network issues

Ceph 存储高度依赖于底层网络配置并具有专用且可靠的网络连接。网络是集群中最重要的部分之一，Ceph 节点使用它来相互通信。网络基础设施的问题可能会导致 Ceph 集群出现许多问题，从 OSD 抖动（OSD 上下波动）到监控时钟偏差错误。此外，网络错误（例如丢包或高延迟）可能会导致整个集群出现稳定性和性能问题。

How to do it...

如果怀疑存在集群通信问题，可以进行以下一些初步检查：

Verify that the ceph.conf file has the IP address values set for cluster_network and public_network.
Verify that the networking interface on the nodes is up. You can use a Linux utility, such as ifconfig or ip address, to view the state of the configured networking interfaces:

Validate that all Ceph nodes are able to communicate via short hostnames. You can validate that the /etc/hosts file on each node is properly updated.

Validate that the proper firewall ports are open for each component in the Ceph cluster. You can utilize the firewall-cmd utility to validate the port status. The proper firewall ports for each component are discussed in Chapter 1, Ceph - Introduction and Beyond:

      # firewall-cmd --info-zone=public

端口 6789 和端口 6800 到端口 7100 在接口 enp0s3 和 enp0s8 上打开，因为此节点托管监视器和 OSD。

Validate that the network interfaces have no errors and that there is no latency between nodes. The Linux utilities ethtool and ping can be used to validate:

      # ethtool -S <interface>

      # ping <ceph node>

If network performance is suspect, then the iperf utility can be utilized. This tool was discussed in detail in Chapter 12, More on Ceph.
Verify that NTP is set up and running on each of the Ceph nodes in the cluster:

       # systemctl status ntpd
       # ntpq -p

Lastly, you can verify that all Ceph nodes in the cluster have identical network configurations, as a single slow node can weigh down the entire Ceph cluster. This includes verifying that all nodes have identical MTU settings, bonding, and speed. The ip address and ethtool utilities can be used in validating this.

Troubleshooting monitors

Ceph 监视器问题通常可以在 ceph status 或 ceph health detail 命令输出中看到，它们会识别哪个监视器正在报告问题。 Ceph 监控日志位于 /var/log/ceph/ceph-mon. .log，可以调查此日志以确定监控失败或错误的根本原因。即将到来的秘籍将涵盖一些可以在带有监视器的 Ceph 集群中看到的常见问题。

How to do it...

让我们来看看一些最常见的监视器错误以及如何解决这些问题的一些步骤：

Ceph monitor is reporting a clock skew:
- Clock skew can be reported in the monitor log file as well as in the ceph status command.
- This error indicates that the monitor reporting the clock skew error has a clock that is not synchronized with the rest of the monitors in the cluster.
- This error is usually due to improper NTP configuration or no NTP server running on the monitor node.
- Network issues can also lead to clock skew errors. Follow the recipe in the previous section for proper NTP investigation and network troubleshooting.
Ceph monitor is reported as out of quorum:
- Out of quorum can be reported in the monitor log file as well as in the ceph status command.
- This error indicates that one or more MONs is marked as down but the other monitors in the cluster are able to form a quorum:
  - If the monitor is stuck in a probing state, then you would want to validate the network connectivity between the monitor and the other nodes in the cluster. You can follow the previous section on network troubleshooting.
  - If the monitor is stuck in a state of electing, then you would want to check for a clock skew.
  - If the monitor sets itself as a leader but is not in quorum with the rest of the monitors, then the monitors may have a clock skew issue. If the monitor process is running but MON is marked as down, you can check the current status of the monitor by utilizing the admin socket on MON:

                       # ceph daemon mon.<nodename> mon_status

- If the monitor process is not running, you can attempt to start the monitor:

                  # systemctl start ceph-mon@<node name>
                  # systemctl status ceph-mon@<node name>

- If the monitor fails to start and then review the monitor log file /var/log/ceph/ceph-mon.<node-name>.log for a failure.
- If the monitor log contains the corruption error in the middle of record error, then the monitor likely has a corrupted mon_store. To resolve this issue, replace the Ceph monitor, as covered in Chapter 8, Operating and managing a Ceph cluster.
- If the monitor log contains the Caught signal (Bus error) error, then the monitors /var partition is likely full. You will need to delete any unused data from the /var partition, but do not delete any data from the monitor directory manually. You can manually compact the monitor store database using the following command or configuring to compact at start in the ceph.conf file under the [mons] section and restarting the monitor:

                  # ceph tell mon.<node name> compact
                    mon_compact_on_start = true

Ceph monitor is reporting store is getting too big!:
- The Ceph MON store is a level.db database and stores key-value pairs. The monitor can actually be delayed in responding to client requests if the store.db is too big, as it takes a longer time to query a large level.db.
- This error will log prior to the /var partition on the monitor becoming full.
- Validate the current size of the Ceph monitor store:

             # du -sch /var/lib/ceph/mon/<cluster name>-
               <node name>/store.db

- Compact the monitor store as required and as discussed.

How to inject a valid monitor map if the monitor has a corrupted or outdated map:
- A Ceph monitor cannot gain a quorum if its map is outdated or corrupted. If two or more monitors are in quorum, the safest option is to obtain a valid monmap and inject it into the monitor that is unable to gain quorum:
  - On a monitor that is in quorum, pull the valid monmap:

                        # ceph mon getmap -o /tmp/monmap

- - Stop the monitor with the corrupted monmap, copy the valid monmap from a good monitor, and inject the monmap into the bad monitor:

                         # systemctl stop ceph-mon@<node-name>
                         # scp /tmp/monmap root@<nodename>:/tmp/monmap
                         # ceph-mon -i <id> --inject-monmap /tmp/monmap

- - Start the monitor with the corrupted monmap:

                        # systemctl start ceph-mon@<node-name>

- If the all monitors are unable to gain quorum, for example, you only have one monitor with a valid monmap, then the recovery scenario would differ from a valid quorum.
  - On a monitor that has a valid monmap, stop the monitor:

                         # systemctl stop ceph-mon@<node-name>

- - Pull the valid monmap from the stopped monitor:

                        # ceph mon getmap -o /tmp/monmap

- - Stop one of the monitors with a corrupted monmap:

                        # systemctl stop ceph-mon@<node-name>

- - Copy the good monmap to the monitor with the corrupted map:

                        # scp /tmp/monmap root@<nodename>:/tmp/monmap

- - Inject the monmap from the good monitor:

                        # ceph-mon -i <id> --inject-monmap /tmp/monmap

- - Start the monitor and check the mon_status to validate the quorum:

                        # systemctl start ceph-mon@<node-name>
                        # ceph daemon mon.<nodename> mon_status

- Let's see an example where ceph-node1 has a good copy of the monmap and ceph-node2 has a corrupted monmap:
  - Copy the good monmap from ceph-node1:

- - Inject the monmap to ceph-node2 and check mon_status:

Troubleshooting OSDs

与 Ceph 监视器问题一样，Ceph OSD 问题通常首先出现在 ceph health detail 或 status 命令中。这通常会给你一些关于从哪里开始寻找实际问题的想法。例如，是单个 OSD 宕机还是对应于特定主机的一块 OSD 宕机。 Ceph OSD 日志位于托管特定 OSD 进程的节点中的 /var/log/ceph/ceph-osd.<id>.log 中，是解决 OSD 问题时开始查看的第一个区域.接下来的教程将向您展示如何解决您可能在 Ceph 集群中遇到的一些更常见的 Ceph OSD 问题。

How to do it...

在开始对 OSD 进行故障排除之前，最好先验证 Ceph 节点之间的集群和公共网络，因为 OSD 关闭问题通常是由对等 OSD 和 MON 之间的通信问题引起的：

How to handle a full OSD flag on your Ceph cluster:
- Running a ceph health detail will provide you with the OSD ID that is currently flagged as full by the cluster:

              # ceph health detail

- A full flag is indicated by the Ceph config option mon_osd_full_ratio; by default, this is 95%. Note that this config setting applies only to a situation when the flag gets set on the cluster and does not apply to the actual PGs.
- Ceph will prevent client I/O from writing to a PG that resides on an OSD that has the full flag set to prevent any chance of data loss to the PGs on the full OSD.
- We would want to validate the percentage of RAW capacity used to determine what safe recovery methods to take and whether we have cluster-free space for the recovery. If the capacity used by RAW is less than 75%, it is considered safe for recovery actions:

              # ceph df

- Deleting unneeded data from the cluster would be the easiest method of recovery, but for any type of delete actions, we would need to increase the full ratio on the PGs as a delete is a write operation and writes are blocked to PGs on an OSD that's over 95% utilized with the full flag. In order to proceed, we must set the PG full ratio on the cluster higher than 95% in order to allow the delete operation to be successful:

               # ceph pg set_full_ratio 0.98

- Adding additional OSDs or OSD nodes is also a method to increase the RAW capacity in the cluster if you are just running out of space.
How to handle a near full OSD flag on your Ceph cluster:
- As with the full flag, the near full flag can also be seen in the ceph health detail command and will provide the OSD ID that is flagged:

               # ceph health detail

- A near full flag is indicated by the Ceph config option mon_osd_nearfull_ratio; by default, this is 85%. Note that this config setting applies only when the flag gets set on the cluster and does not apply to the actual PGs.
- Typical causes and remedies for near full OSDs are as follows:
  - Improper imbalance of OSD count per node throughout the cluster. Look into the proper cluster balance.
  - Improper balance of OSD weights in the CRUSH map throughout the cluster. Look into proper cluster balance or implement OSD weights by utilization. This will reweigh OSDs automatically based upon OSD utilization average (threshold) throughout the cluster. You can test this prior to implementation as well:

                       # ceph osd test-reweight-by-utilization 
                         [threshold] [weight change amount]
                          [number of osds]
                       # ceph osd reweight-by-utilization [threshold] 
                         [weight change amount] [number of osds]

- - Improper PG count setting per OSD count in the cluster. Utilize the Ceph PG calculator (http://ceph.com/pgcalc/) tool to verify a proper PG count per OSD count in the Ceph cluster.
  - Old CRUSH tunables being run on the cluster (optimal is recommended). Set CRUSH tunables to optimal in the cluster. Verify that any running clients support the version of tunables being set on the Ceph cluster. Kernel clients must support the running tunable, or issues will arise. Tunables and their features are detailed at http://docs.ceph.com/docs/master/rados/operations/crush-map/?highlight=crush%20tunables#tunables:

                       # ceph osd crush tunables optimal

- - Just putting too much data into the cluster than the backing hardware can support. Look into adding additional OSDs or OSD nodes to the Ceph cluster.

Ceph cluster reports that one or more OSDs are down:
- If a single OSD is down, this error is typically due to a hardware failure on the backing disk.
- If multiple OSDs are down, this error is typically due to communication issues between the peer OSDs or the OSDs and the monitors.
- In order to determine a direction to go for troubleshooting down OSDs, you need to determine which OSDs are actually down. The ceph health detail command provides these details:

                 # ceph health detail

- You can try and restart the OSD daemon:

                # systemctl start ceph-osd@<id>

- If the OSD daemon cannot start, then review the following common issues:
  - Validate that the OSD and journal partitions are mounted:

                         # ceph-disk list

- If your OSD node has more than ten OSDs, then validate that you have properly set the kernel max PID setting. Set to default 32768 and then increase to 4194303 and restart the OSD process.

检查当前的 PID 计数集：

                          # cat /proc/sys/kernel/pid_max

在 /etc/sysctl.conf 中，set: kernel.pid_max = 4194303
设置更改无需重启：

                          # sysctl -p

验证是否进行了更改：

                          # sysctl -a | grep kernel.pid_max

- - If you receive ERROR: missing keyring, cannot use cephx for authentication, then the OSD is missing its keyring and you will need to register the OSD's keyring:

                           # ceph auth add osd.{osd-num} osd 'allow *' 
                             mon 'allow rwx' -i /var/lib/ceph/osd/ceph-
                             {osd-num}/keyring

- - If you receive ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-<id>, then the OSD daemon cannot read the underlying filesystem. Review /var/log/messages or dmesg for any errors against the device backing the OSD node and replace the faulty media.
  - If you receive the FAILED assert(!m_filestore_fail_eio || r != -5) error at /var/log/ceph/ceph-osd.<id>.log, then review /var/log/messages or dmesg for any errors against the device backing the OSD node and replace the faulty media.
- If the OSD daemon is running but is still marked as down then review the following issues:
  - Check the OSD log file /var/log/ceph/ceph-osd.<id> for a wrongly marked me down message and verify that any recovery or scrubbing activity was occurring at the time the OSD was marked down. The cluster log file located on the monitor node /var/log/ceph/ceph.log can be reviewed for these events. These operations can take an extended amount of time in certain scenarios and can cause the OSD to not respond to heartbeat requests and be marked down by its peers.
    You can increase the grace period OSDs will wait for heartbeat packets, but this is not recommended. The default is 20 seconds, and this can be set in the [osds] section of the conf file:

                           osd_heartbeat_grace = 20

- - If no scrubbing or recovery occurred at the time of the OSD wrongly being marked down then, validate in the ceph health detail command whether any blocked or slow requests are seen against any OSDs:

                           # ceph health detail

- - See whether you can find a commonality between the OSDs that have the blocked or slow requests logging. If a common OSD node is seen, then you likely have a network issue on that OSD node; if a common rack is seen, then you likely have a network issue on that rack. You can sort the down OSDs by executing the following:

                           # ceph osd tree | grep down

- How to troubleshoot slow requests or blocked requests on an OSD:
  - A slow request is a request that has been in the OSD op_queue for 30 seconds or more and has not been serviced. This is configurable by setting the osd_op_complaint_time, default 30 seconds. It is not recommended that you change this, as that can lead to false reporting issues.
  - Main causes of these issues are network issues, hardware issues, and system load.
  - To begin troubleshooting these issues, you will first want to validate whether the OSDs with the slow requests are sharing any common hardware or networking; if they are, then the hardware/network is the likely suspect.
  - You can drill down further into the slow requests by dumping the historic ops on the OSD while the slow requests are logging:

                          # ceph daemon osd.<id> dump_historic_ops

- - This can tell you the type of slow request and can help pinpoint where to look further. The types of common slow requests are as follows:
    - waiting for rw locks: OSD is waiting to acquire a lock on PG for OP
    - waiting for sub ops: OSD is waiting for a replica OSD to commit the OP to the journal
    - no flag points reached: OP hasn't reached any major OP milestone
    - waiting for degraded object: OSD hasn't replicated an object a certain number of times yet
  - Using Linux tools such as iostat can help determine a poorly performing OSD that can be leading to slow requests. Reviewing for high await times can help pinpoint a poorly performing disk. More information on the iostat utility can be found at http://man7.org/linux/man-pages/man1/iostat.1.html:

                          # iostat -x 1

- - Using Linux tools such as netstat can help determine a poorly performing OSD node that may be leading to slow requests. Refer to the earlier recipes on Troubleshooting network issues:

                              # netstat -s

Troubleshooting placement groups

与 Ceph 守护进程问题一样，归置组问题通常首先出现在 ceph health detail 或 status 命令中。它们通常伴随着处于关闭状态的 OSD 或显示器上的时钟偏差问题。在继续进行故障排除之前，归置组会验证您的所有 Ceph 监视器是否up且处于仲裁状态，并且所有 Ceph OSD 都处于 up/in 状态。接下来的教程将向您展示如何解决您可能在 Ceph 集群中遇到的一些更常见的 Ceph 归置组问题。

How to do it...

在开始对 PG 状态进行故障排除之前，请确认 Ceph 监视器已全部启动并处于仲裁状态，并且任何可用的 OSD 也处于启动/启动状态：

How to handle stale placement groups:
- A PG is labeled stale by the Ceph monitor when it does not receive a status update of the PG from the primary OSD or if a peer OSD reports the primary OSD as down.
- The stale PG state is commonly seen after recently bringing the Ceph cluster up and when the PGs have not completed peering.
- If the PG is stuck in a stale state, then validate with ceph health detail about which OSDs are currently down in the PG's acting set and attempt to bring that OSD online. If you're unable to bring it online, refer to the previous section on Troubleshooting OSDs:

              # ceph health detail
              # systemctl start ceph-osd<id>

How to handle unclean placement groups:
- A PG is labeled unclean if it has not been active+clean for 300 seconds, as defined by the mon_pg_stuck_threshold.
- A PG labeled unclean has not properly replicated its objects for the required replication size on the pool.
- An unclean PG is usually due to an OSD being down. Review the ceph health detail and ceph osd tree for any down OSDs and resolve as necessary:

             # ceph osd tree | grep down
             # ceph health detail

How to handle inconsistent placement groups:
- A PG is marked inconsistent when there is a mismatch between objects on its replicas. Examples include differences in object size and objects missing in the replica after recovery completion.
- These errors are typically flagged by scrubbing on the Ceph cluster.
- In order to determine why the PG is flagged as inconsistent, we can do the following:
  - Issue a deep-scrub on the inconsistent placement group:

                           # ceph pg deep-scrub <pg.id>

- - Review the ceph -w command for a message related to deep-scrub on that PG.

                           # ceph -w |grep <pg.id>

- - Validate that the deep-scrub error fits one of the following repairable scenarios:
    - <pg.id> shard <osd>: soid <object> missing attr _, missing attr <attr type>
    - <pg.id> shard <osd>: soid <object> digest 0 != known digest <digest>, size 0 != known size <size>
    - <pg.id> shard <osd>: soid <object> size 0 != known size <size>
    - <pg.id> deep-scrub stat mismatch, got <mismatch>
    - <pg.id> shard <osd>: soid <object> candidate had a read error, digest 0 != known digest <digest>
  - If the error reported fits one of the these scenarios, then the PG can be safely repaired and have a deep-scrub rerun to validate the repair:

                           # ceph pg repair <pg.id>
                           # ceph deep-scrub <pg.id>

- - If the output indicates one of the following, then do not repair the PG and open a case with Red Hat support for assistance or reach out to the Ceph community:
    - <pg.id> shard <osd>: soid <object> digest <digest> != known digest <digest>
    - <pg.id> shard <osd>: soid <object> omap_digest <digest> != known omap_digest <digest>
How to handle down placement groups:
- When a PG is in a down state, it will not actively be serving client I/O.
- A down PG is typically due to a peering failure and a down or several down OSDs.
- To determine the cause for a down PG, we need to query the PG:

                    # ceph pg <pg.id> query

- When reviewing the PG query, there is a section in the command called recovery_state. This section highlights the PG's current recovery and will flag any issues that are blocking the current recovery:

- If the recovery is blocked: peering is blocked by down osd then bring the down OSD back up to recover the PG. Otherwise if the OSD process cannot be started review the Troubleshooting OSDs section for further methods into OSD investigation.
How to handle inactive placement groups:
- When a PG is inactive it will not be serving client I/O.
- A PG is set to inactive state when it has not been active for 300 seconds, as defined by the mon_pg_stuck_threshold.
- PG's that are inactive are usually due to OSDs that are in a down state. Use the ceph health detail and ceph osd tree commands to determine the down OSDs:

                   # ceph health detail
                   # ceph osd tree | grep down

How to handle placement groups that have unfound objects.
- Unfound objects exist when Ceph is aware that newer copies of the object exists but is unable to find them. As Ceph cannot find the newer copies, it cannot complete recovery so marks the objects unfound.
- A cluster that has placement groups with unfound objects usually has experienced an OSD node failure or OSD flapping scenarios where OSDs in the PGs acting were continually dropping offline and coming back online before completing recovery on the objects.
- The ceph health detail command will flag PGs with unfound objects and will help us determine which OSDs need investigation:

                    # ceph health detail

- We can query the PG to determine further details on the OSDs that are required to be probed for the missing objects by viewing recovery_state as done previously with down placement groups:

                    # ceph pg <pg.id> query

- Resolve any of the OSDs that are in a down state in order to recover the unfound objects.
- If you're unable to recover these objects, open a support ticket with Red Hat support or reach out to the Ceph community for assistance.

There's more…

既然您对 Ceph 中的不同组件进行故障排除有了一个大致的了解，那么您就可以很好地处理 Ceph 集群中的故障而不会惊慌失措。但是，如果您的 Ceph 集群发现自己处于无法修复的状态，并且无论您做什么，都无法恢复集群，该怎么办？ Ceph 社区是一个庞大且知识渊博的社区，总是愿意帮助 Ceph 同胞。有关 Ceph 社区的更多信息，请参阅 http://ceph.com/community/ 并加入 Ceph社区 IRC 频道和邮件列表：http://ceph.com/irc/

vlambda博客
学习文章列表

读书笔记《ceph-cookbook-second-edition》Cave故障排除简介

An Introduction to Troubleshooting Ceph

Introduction

Initial troubleshooting and logging

How to do it...

Troubleshooting network issues

How to do it...

Troubleshooting monitors

How to do it...

Troubleshooting OSDs

How to do it...

Troubleshooting placement groups

How to do it...

There's more…

标签:

vlambda博客 学习文章列表

读书笔记《ceph-cookbook-second-edition》Cave故障排除简介

标签:

vlambda博客
学习文章列表