Bug 1856960

Summary: [Tool] Update the ceph-bluestore-tool for adding rescue procedure for bluefs log replay
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Neha Ojha <nojha>
Component: RADOSAssignee: Adam Kupczyk <akupczyk>
Status: CLOSED ERRATA QA Contact: Manohar Murthy <mmurthy>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.3CC: akupczyk, assingh, bhubbard, ceph-eng-bugs, cswanson, dzafman, gsitlani, jdurgin, kchai, linuxkidd, mmuench, mmurthy, nojha, pdhange, rmandyam, rollercow, rzarzyns, sseshasa, tpetr, tserlin, tvainio, vumrao, ykaul
Target Milestone: ---Keywords: Reopened
Target Release: 4.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-14.2.11-13.el8cp, ceph-14.2.11-13.el7cp Doc Type: Bug Fix
Doc Text:
Cause: There was not enough checks on BlueFS log replay log size. When OSD was not processing any external requests, it still was periodically sending a small update to RocksDB. This translated to appending some to BlueFS log. Note: any actual OP on OSD would have triggered log compaction. Consequence: BlueFS log grows so large that it can no longer be read. It remained unnoticed until OSD restart. Fix: Once error condition is reached, OSD is unable to boot. Heuristic procedure has been created that attempts to find on device missing parts of log. It is enabled when "bluefs_replay_recovery=true" is set. Because it is only heuristic solution, fsck is necessary to check if process was successful. In normal mode, BlueFS compacts log right after bootup. To prevent this compaction "bluefs_replay_recovery_disable_compact=true" should be used until *fsck* returns success. So, fix procedure is 2 steps: 1) CHECK ceph-bluestore-tool -l /proc/self/fd/1 --log-level 5 --path *osd path* fsck --debug_bluefs=5/5 --bluefs_replay_recovery=true --bluefs_replay_recovery_disable_compact=true 2) ACTUAL FIX ceph-bluestore-tool -l /proc/self/fd/1 --log-level 5 --path *osd path* fsck --debug_bluefs=5/5 --bluefs_replay_recovery=true Result: Now OSD can boot up.
Story Points: ---
Clone Of: 1821133 Environment:
Last Closed: 2021-01-12 14:56:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1821133, 1856961    
Bug Blocks:    

Comment 8 errata-xmlrpc 2021-01-12 14:56:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081