Bug 1856960
| Summary: | [Tool] Update the ceph-bluestore-tool for adding rescue procedure for bluefs log replay | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Neha Ojha <nojha> |
| Component: | RADOS | Assignee: | Adam Kupczyk <akupczyk> |
| Status: | CLOSED ERRATA | QA Contact: | Manohar Murthy <mmurthy> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 3.3 | CC: | akupczyk, assingh, bhubbard, ceph-eng-bugs, cswanson, dzafman, gsitlani, jdurgin, kchai, linuxkidd, mmuench, mmurthy, nojha, pdhange, rmandyam, rollercow, rzarzyns, sseshasa, tpetr, tserlin, tvainio, vumrao, ykaul |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.2 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-14.2.11-13.el8cp, ceph-14.2.11-13.el7cp | Doc Type: | Bug Fix |
| Doc Text: |
Cause:
There was not enough checks on BlueFS log replay log size.
When OSD was not processing any external requests, it still was periodically sending a small update to RocksDB. This translated to appending some to BlueFS log. Note: any actual OP on OSD would have triggered log compaction.
Consequence:
BlueFS log grows so large that it can no longer be read. It remained unnoticed until OSD restart.
Fix:
Once error condition is reached, OSD is unable to boot.
Heuristic procedure has been created that attempts to find on device missing parts of log.
It is enabled when "bluefs_replay_recovery=true" is set.
Because it is only heuristic solution, fsck is necessary to check if process was successful.
In normal mode, BlueFS compacts log right after bootup. To prevent this compaction "bluefs_replay_recovery_disable_compact=true" should be used until *fsck* returns success.
So, fix procedure is 2 steps:
1) CHECK
ceph-bluestore-tool -l /proc/self/fd/1 --log-level 5 --path *osd path* fsck --debug_bluefs=5/5 --bluefs_replay_recovery=true --bluefs_replay_recovery_disable_compact=true
2) ACTUAL FIX
ceph-bluestore-tool -l /proc/self/fd/1 --log-level 5 --path *osd path* fsck --debug_bluefs=5/5 --bluefs_replay_recovery=true
Result:
Now OSD can boot up.
|
Story Points: | --- |
| Clone Of: | 1821133 | Environment: | |
| Last Closed: | 2021-01-12 14:56:02 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1821133, 1856961 | ||
| Bug Blocks: | |||
|
Comment 8
errata-xmlrpc
2021-01-12 14:56:02 UTC
|