2269348 – osdc/Journaler: better handle ENOENT during replay as up:standby-replay

Bug 2269348 - osdc/Journaler: better handle ENOENT during replay as up:standby-replay

Summary: osdc/Journaler: better handle ENOENT during replay as up:standby-replay

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	6.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	6.1z7
Assignee:	Venky Shankar
QA Contact:	Hemanth Kumar
Docs Contact:
URL:
Whiteboard:
Depends On:	2269347
Blocks:
TreeView+	depends on / blocked

Reported:	2024-03-13 09:58 UTC by Venky Shankar
Modified:	2024-08-28 17:59 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ceph-17.2.6-230
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2269347
Environment:
Last Closed:	2024-08-28 17:59:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	57048	None	None	None	2024-03-13 09:58:37 UTC
Red Hat Issue Tracker	RHCEPH-8557	None	None	None	2024-03-19 13:10:10 UTC
Red Hat Product Errata	RHBA-2024:5960	None	None	None	2024-08-28 17:59:12 UTC

Description Venky Shankar 2024-03-13 09:58:38 UTC

+++ This bug was initially created as a clone of Bug #2269347 +++

-15> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  1 mds.21387370.journaler.mdlog(ro) recover start
   -14> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  1 mds.21387370.journaler.mdlog(ro) read_head
   -13> 2022-07-29T13:23:34.738+0000 7f1ee3d5a700  4 mds.0.log Waiting for journal 0x200 to recover...
   -12> 2022-07-29T13:23:34.742+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) _finish_read_head loghead(trim 7788696698880, expire 7788721262080, write 7789114335530, stream_format 1).  probing for end of log (from 7789114335530)...
   -11> 2022-07-29T13:23:34.742+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) probing for end of the log
   -10> 2022-07-29T13:23:34.742+0000 7f1eed56d700 10 monclient: get_auth_request con 0x55c5deab7000 auth_method 0
    -9> 2022-07-29T13:23:34.755+0000 7f1ee455b700  1 mds.21387370.journaler.mdlog(ro) _finish_probe_end write_pos = 7789125276317 (header had 7789114335530). recovered.
    -8> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  4 mds.0.log Journal 0x200 recovered.
    -7> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  4 mds.0.log Recovered journal 0x200 in format 1
    -6> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  2 mds.0.0 Booting: 1: loading/discovering base inodes
    -5> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  0 mds.0.cache creating system inode with ino:0x100
    -4> 2022-07-29T13:23:34.755+0000 7f1ee3d5a700  0 mds.0.cache creating system inode with ino:0x1
    -3> 2022-07-29T13:23:34.757+0000 7f1ee455b700  2 mds.0.0 Booting: 2: replaying mds log
    -2> 2022-07-29T13:23:34.798+0000 7f1ee455b700  0 mds.21387370.journaler.mdlog(ro) _finish_read got error -2
    -1> 2022-07-29T13:23:34.800+0000 7f1ee2d58700 -1 /builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: In function 'void MDLog::_replay_thread()' thread 7f1ee2d58700 time 2022-07-29T13:23:34.799865+0000
/builddir/build/BUILD/ceph-16.2.0/src/mds/MDLog.cc: 1383: FAILED ceph_assert(journaler->is_readable() || mds->is_daemon_stopping())

 ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f1ef2fc2b60]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a]
 3: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec]
 4: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101]
 5: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a]
 6: clone()

     0> 2022-07-29T13:23:34.801+0000 7f1ee2d58700 -1 *** Caught signal (Aborted) **
 in thread 7f1ee2d58700 thread_name:md_log_replay

 ceph version 16.2.0-152.el8cp (e456e8b705cb2f4a779689a0d80b122bcb0d67c9) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7f1ef1d6dc20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f1ef2fc2bb1]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x274d7a) [0x7f1ef2fc2d7a]
 6: (MDLog::_replay_thread()+0x1d7c) [0x55c5dc72e2ec]
 7: (MDLog::ReplayThread::entry()+0x11) [0x55c5dc430101]
 8: /lib64/libpthread.so.0(+0x817a) [0x7f1ef1d6317a]
 9: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Journaler should not cause the MDS to assert in this situation. We should handle this more gracefully.

Comment 2 Scott Ostapovicz 2024-03-19 13:07:36 UTC

These BZs were targeted to z5 after the date when they should have been targeted at z6.

Comment 12 errata-xmlrpc 2024-08-28 17:59:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 security, bug fix, and enhancement updates.), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:5960

Note You need to log in before you can comment on or make changes to this bug.