2018110 – Ceph monitor crash after upgrade from ceph

Bug 2018110 - Ceph monitor crash after upgrade from ceph

Summary: Ceph monitor crash after upgrade from ceph

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	5.1
Assignee:	Patrick Donnelly
QA Contact:	Amarnath
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-28 09:35 UTC by Venky Shankar
Modified:	2022-04-04 10:22 UTC (History)
CC List:	3 users (show)
Fixed In Version:	ceph-16.2.6-20.el8cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-04 10:22:22 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	52820	None	None	None	2021-10-28 09:35:06 UTC
Ceph Project Bug Tracker	52874	None	None	None	2021-10-28 09:38:46 UTC
Red Hat Issue Tracker	RHCEPH-2120	None	None	None	2021-10-28 09:36:54 UTC
Red Hat Product Errata	RHSA-2022:1174	None	None	None	2022-04-04 10:22:41 UTC

Description Venky Shankar 2021-10-28 09:35:06 UTC

Seen when upgrading from 15.2.14 to 16.2.6 release::

Oct 05 18:42:57 virthost2 ceph-mon[115602]:  in thread 7f14a74f3700 thread_name:ms_dispatch
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(0x14140) [0x7f14b0190140]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  2: gsignal()
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  3: abort()
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  4: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0x9a7ec) [0x7f14b00437ec]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  5: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0xa5966) [0x7f14b004e966]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  6: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0xa59d1) [0x7f14b004e9d1]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  7: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0xa5c65) [0x7f14b004ec65]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  8: /usr/lib/ceph/libceph-common.so.2(+0x28982a) [0x7f14b06e682a]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  9: (MDSMonitor::tick()+0x475) [0x55a3d8709015]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  10: (MDSMonitor::on_active()+0x28) [0x55a3d86ef068]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  11: (Context::complete(int)+0x9) [0x55a3d850fc29]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  12: (void finish_contexts&lt;std::__cxx11::list&lt;Context, std::allocator&lt;Context*&gt; > >(ceph::common::CephContext*, std::__cxx11::list&lt;Context*, std::allocator&lt;Context*&gt; >&, int)+0xa8) [0x55a3d853b458]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  13: (Paxos::finish_round()+0x70) [0x55a3d8623100]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  14: (Paxos::dispatch(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x3d3) [0x55a3d8624ef3]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  15: (Monitor::dispatch_op(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x116b) [0x55a3d850d7eb]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  16: (Monitor::_ms_dispatch(Message*)+0x41e) [0x55a3d850de2e]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  17: (Dispatcher::ms_dispatch2(boost::intrusive_ptr&lt;Message&gt; const&)+0x59) [0x55a3d853c9d9]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  18: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr&lt;Message&gt; const&)+0x468) [0x7f14b08d5eb8]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  19: (DispatchQueue::entry()+0x5ef) [0x7f14b08d35bf]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  20: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f14b0990cbd]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f14b0184ea7]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  22: clone()

Comment 7 Amarnath 2021-12-03 09:10:34 UTC

Upgraded from 14.2.11-208.el8cp to  16.2.0-146.el8cp


did not observe any clone or abort operations in mon logs
attached all mon logs from all the nodes
IOs were running while upgrade also

Setup details before upgrade : 
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED       RAW USED     %RAW USED 
    hdd       180 GiB     168 GiB     83 MiB       12 GiB          6.71 
    TOTAL     180 GiB     168 GiB     83 MiB       12 GiB          6.71 
 
POOLS:
    POOL                    ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
    cephfs_data              1       132 B           1     192 KiB         0        53 GiB 
    cephfs_metadata          2      42 KiB          22     1.5 MiB         0        53 GiB 
    .rgw.root                3     1.3 KiB           4     768 KiB         0        53 GiB 
    default.rgw.control      4         0 B           8         0 B         0        53 GiB 
    default.rgw.meta         5       374 B           2     384 KiB         0        53 GiB 
    default.rgw.log          6     3.5 KiB          49     6.2 MiB         0        53 GiB 
    rbd                      7         0 B           0         0 B         0        53 GiB 
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph -s
  cluster:
    id:     37dc81e0-e59c-4aa0-b819-abeb2eee717b
    health: HEALTH_WARN
            1 pools have too few placement groups
            mons are allowing insecure global_id reclaim
 
  services:
    mon:     3 daemons, quorum ceph-bz-verify-jgwibw-node2,ceph-bz-verify-jgwibw-node3,ceph-bz-verify-jgwibw-node1-installer (age 35h)
    mgr:     ceph-bz-verify-jgwibw-node1-installer(active, since 35h), standbys: ceph-bz-verify-jgwibw-node2
    mds:     cephfs:1 {0=ceph-bz-verify-jgwibw-node5=up:active} 2 up:standby
    osd:     12 osds: 12 up (since 35h), 12 in (since 35h)
    rgw-nfs: 2 daemons active (ceph-bz-verify-jgwibw-node4, ceph-bz-verify-jgwibw-node6)
 
  data:
    pools:   7 pools, 208 pgs
    objects: 86 objects, 47 KiB
    usage:   12 GiB used, 168 GiB / 180 GiB avail
    pgs:     208 active+clean
 
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph fs set cephfs allow_standby_replay false
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph version
ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)

After Upgrade : 
[root@ceph-bz-verify-jgwibw-node7 cephfs_fuse]# ceph -s
  cluster:
    id:     37dc81e0-e59c-4aa0-b819-abeb2eee717b
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 pools have too few placement groups
 
  services:
    mon:     3 daemons, quorum ceph-bz-verify-jgwibw-node2,ceph-bz-verify-jgwibw-node3,ceph-bz-verify-jgwibw-node1-installer (age 29m)
    mgr:     ceph-bz-verify-jgwibw-node1-installer(active, since 18m), standbys: ceph-bz-verify-jgwibw-node2
    mds:     1/1 daemons up, 2 standby
    osd:     12 osds: 12 up (since 24m), 12 in (since 37h)
    rgw-nfs: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 209 pgs
    objects: 220 objects, 474 MiB
    usage:   3.3 GiB used, 177 GiB / 180 GiB avail
    pgs:     209 active+clean
 
  io:
    client:   2.7 KiB/s rd, 21 MiB/s wr, 1 op/s rd, 33 op/s wr
 
[root@ceph-bz-verify-jgwibw-node7 cephfs_fuse]# ceph version
ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)

Setup Details :
ceph-BZ_Verify-JGWIBW-node7	10.0.208.121
ceph-BZ_Verify-JGWIBW-node6	10.0.211.48
ceph-BZ_Verify-JGWIBW-node5	10.0.209.32
ceph-BZ_Verify-JGWIBW-node4	10.0.210.249
ceph-BZ_Verify-JGWIBW-node3	10.0.209.157
ceph-BZ_Verify-JGWIBW-node2	10.0.209.20
ceph-BZ_Verify-JGWIBW-node1-installer	10.0.211.212



@vshankar ,
Can you please confirm if this is sufficent.

Comment 8 Venky Shankar 2021-12-06 05:56:25 UTC

> @vshankar ,
> Can you please confirm if this is sufficent.

Looks good.

Comment 14 errata-xmlrpc 2022-04-04 10:22:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174

Note You need to log in before you can comment on or make changes to this bug.