Bug 1856960
Summary: | [Tool] Update the ceph-bluestore-tool for adding rescue procedure for bluefs log replay | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Neha Ojha <nojha> |
Component: | RADOS | Assignee: | Adam Kupczyk <akupczyk> |
Status: | CLOSED ERRATA | QA Contact: | Manohar Murthy <mmurthy> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.3 | CC: | akupczyk, assingh, bhubbard, ceph-eng-bugs, cswanson, dzafman, gsitlani, jdurgin, kchai, linuxkidd, mmuench, mmurthy, nojha, pdhange, rmandyam, rollercow, rzarzyns, sseshasa, tpetr, tserlin, tvainio, vumrao, ykaul |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 4.2 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | ceph-14.2.11-13.el8cp, ceph-14.2.11-13.el7cp | Doc Type: | Bug Fix |
Doc Text: |
Cause:
There was not enough checks on BlueFS log replay log size.
When OSD was not processing any external requests, it still was periodically sending a small update to RocksDB. This translated to appending some to BlueFS log. Note: any actual OP on OSD would have triggered log compaction.
Consequence:
BlueFS log grows so large that it can no longer be read. It remained unnoticed until OSD restart.
Fix:
Once error condition is reached, OSD is unable to boot.
Heuristic procedure has been created that attempts to find on device missing parts of log.
It is enabled when "bluefs_replay_recovery=true" is set.
Because it is only heuristic solution, fsck is necessary to check if process was successful.
In normal mode, BlueFS compacts log right after bootup. To prevent this compaction "bluefs_replay_recovery_disable_compact=true" should be used until *fsck* returns success.
So, fix procedure is 2 steps:
1) CHECK
ceph-bluestore-tool -l /proc/self/fd/1 --log-level 5 --path *osd path* fsck --debug_bluefs=5/5 --bluefs_replay_recovery=true --bluefs_replay_recovery_disable_compact=true
2) ACTUAL FIX
ceph-bluestore-tool -l /proc/self/fd/1 --log-level 5 --path *osd path* fsck --debug_bluefs=5/5 --bluefs_replay_recovery=true
Result:
Now OSD can boot up.
|
Story Points: | --- |
Clone Of: | 1821133 | Environment: | |
Last Closed: | 2021-01-12 14:56:02 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1821133, 1856961 | ||
Bug Blocks: |
Comment 8
errata-xmlrpc
2021-01-12 14:56:02 UTC
|