Bug 2269348

Summary: osdc/Journaler: better handle ENOENT during replay as up:standby-replay
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Venky Shankar <vshankar>
Component: CephFSAssignee: Venky Shankar <vshankar>
Status: CLOSED ERRATA QA Contact: Hemanth Kumar <hyelloji>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.0CC: ceph-eng-bugs, cephqe-warriors, hyelloji, sostapov, tserlin
Target Milestone: ---   
Target Release: 6.1z7   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-17.2.6-230 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2269347 Environment:
Last Closed: 2024-08-28 17:59:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2269347    
Bug Blocks:    

Description Venky Shankar 2024-03-13 09:58:38 UTC
+++ This bug was initially created as a clone of Bug #2269347 +++

-15> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  1 mds.21387370.journaler.mdlog(ro) recover start
   -14> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  1 mds.21387370.journaler.mdlog(ro) read_head
   -13> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  4 mds.0.log Waiting for journal 0x200 to recover...
   -12> 2022-07-29T13:23:34.742+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) _finish_read_head loghead(trim 7788696698880, expire 7788721262080, write 7789114335530, stream_format 1).  probing for end of log (from 7789114335530)...
   -11> 2022-07-29T13:23:34.742+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) probing for end of the log
   -10> 2022-07-29T13:23:34.742+0000 7f1eed56d700 10 monclient: get_auth_request con 0x55c5deab7000 auth_method 0
    -9> 2022-07-29T13:23:34.755+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) _finish_probe_end write_pos = 7789125276317 (header had 7789114335530). recovered.
    -8> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  4 mds.0.log Journal 0x200 recovered.
    -7> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  4 mds.0.log Recovered journal 0x200 in format 1
    -6> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  2 mds.0.0 Booting: 1: loading/discovering base inodes
    -5> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  0 mds.0.cache creating system inode with ino:0x100
    -4> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  0 mds.0.cache creating system inode with ino:0x1
    -3> 2022-07-29T13:23:34.757+0000 7f1ee455b700  2 mds.0.0 Booting: 2: replaying mds log
    -2> 2022-07-29T13:23:34.798+0000 7f1ee455b700  0 mds.21387370.journaler.mdlog(ro) _finish_read got error -2
    -1> 2022-07-29T13:23:34.800+0000 7f1ee2d58700 -1 /builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f1ee2d58700 time 2022-07-29T13:23:34.799865+0000
/builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: 1383: FAILED ceph_assert(journaler->is_readable() || mds->is_daemon_stopping())

 ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f1ef2fc2b60]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a]
 3: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec]
 4: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101]
 5: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a]
 6: clone()

     0> 2022-07-29T13:23:34.801+0000 7f1ee2d58700 -1 *** Caught signal (Aborted) **
 in thread 7f1ee2d58700 thread_name:md_log_replay

 ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7f1ef1d6dc20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1ef2fc2bb1]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a]
 6: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec]
 7: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101]
 8: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a]
 9: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Journaler should not cause the MDS to assert in this situation. We should handle this more gracefully.

Comment 2 Scott Ostapovicz 2024-03-19 13:07:36 UTC
These BZs were targeted to z5 after the date when they should have been targeted at z6.

Comment 12 errata-xmlrpc 2024-08-28 17:59:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 security, bug fix, and enhancement updates.), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:5960