Bug 2018110

Summary: Ceph monitor crash after upgrade from ceph
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Venky Shankar <vshankar>
Component: CephFSAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED ERRATA QA Contact: Amarnath <amk>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: ceph-eng-bugs, tserlin, vereddy
Target Milestone: ---   
Target Release: 5.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-16.2.6-20.el8cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-04 10:22:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Venky Shankar 2021-10-28 09:35:06 UTC
Seen when upgrading from 15.2.14 to 16.2.6 release::

Oct 05 18:42:57 virthost2 ceph-mon[115602]:  in thread 7f14a74f3700 thread_name:ms_dispatch
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(0x14140) [0x7f14b0190140]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  2: gsignal()
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  3: abort()
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  4: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0x9a7ec) [0x7f14b00437ec]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  5: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0xa5966) [0x7f14b004e966]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  6: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0xa59d1) [0x7f14b004e9d1]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  7: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0xa5c65) [0x7f14b004ec65]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  8: /usr/lib/ceph/libceph-common.so.2(+0x28982a) [0x7f14b06e682a]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  9: (MDSMonitor::tick()+0x475) [0x55a3d8709015]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  10: (MDSMonitor::on_active()+0x28) [0x55a3d86ef068]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  11: (Context::complete(int)+0x9) [0x55a3d850fc29]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  12: (void finish_contexts&lt;std::__cxx11::list&lt;Context, std::allocator&lt;Context*&gt; > >(ceph::common::CephContext*, std::__cxx11::list&lt;Context*, std::allocator&lt;Context*&gt; >&, int)+0xa8) [0x55a3d853b458]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  13: (Paxos::finish_round()+0x70) [0x55a3d8623100]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  14: (Paxos::dispatch(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x3d3) [0x55a3d8624ef3]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  15: (Monitor::dispatch_op(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x116b) [0x55a3d850d7eb]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  16: (Monitor::_ms_dispatch(Message*)+0x41e) [0x55a3d850de2e]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  17: (Dispatcher::ms_dispatch2(boost::intrusive_ptr&lt;Message&gt; const&)+0x59) [0x55a3d853c9d9]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  18: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr&lt;Message&gt; const&)+0x468) [0x7f14b08d5eb8]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  19: (DispatchQueue::entry()+0x5ef) [0x7f14b08d35bf]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  20: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f14b0990cbd]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f14b0184ea7]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  22: clone()

Comment 7 Amarnath 2021-12-03 09:10:34 UTC
Upgraded from 14.2.11-208.el8cp to  16.2.0-146.el8cp


did not observe any clone or abort operations in mon logs
attached all mon logs from all the nodes
IOs were running while upgrade also

Setup details before upgrade : 
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED       RAW USED     %RAW USED 
    hdd       180 GiB     168 GiB     83 MiB       12 GiB          6.71 
    TOTAL     180 GiB     168 GiB     83 MiB       12 GiB          6.71 
 
POOLS:
    POOL                    ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
    cephfs_data              1       132 B           1     192 KiB         0        53 GiB 
    cephfs_metadata          2      42 KiB          22     1.5 MiB         0        53 GiB 
    .rgw.root                3     1.3 KiB           4     768 KiB         0        53 GiB 
    default.rgw.control      4         0 B           8         0 B         0        53 GiB 
    default.rgw.meta         5       374 B           2     384 KiB         0        53 GiB 
    default.rgw.log          6     3.5 KiB          49     6.2 MiB         0        53 GiB 
    rbd                      7         0 B           0         0 B         0        53 GiB 
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph -s
  cluster:
    id:     37dc81e0-e59c-4aa0-b819-abeb2eee717b
    health: HEALTH_WARN
            1 pools have too few placement groups
            mons are allowing insecure global_id reclaim
 
  services:
    mon:     3 daemons, quorum ceph-bz-verify-jgwibw-node2,ceph-bz-verify-jgwibw-node3,ceph-bz-verify-jgwibw-node1-installer (age 35h)
    mgr:     ceph-bz-verify-jgwibw-node1-installer(active, since 35h), standbys: ceph-bz-verify-jgwibw-node2
    mds:     cephfs:1 {0=ceph-bz-verify-jgwibw-node5=up:active} 2 up:standby
    osd:     12 osds: 12 up (since 35h), 12 in (since 35h)
    rgw-nfs: 2 daemons active (ceph-bz-verify-jgwibw-node4, ceph-bz-verify-jgwibw-node6)
 
  data:
    pools:   7 pools, 208 pgs
    objects: 86 objects, 47 KiB
    usage:   12 GiB used, 168 GiB / 180 GiB avail
    pgs:     208 active+clean
 
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph fs set cephfs allow_standby_replay false
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph version
ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)

After Upgrade : 
[root@ceph-bz-verify-jgwibw-node7 cephfs_fuse]# ceph -s
  cluster:
    id:     37dc81e0-e59c-4aa0-b819-abeb2eee717b
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 pools have too few placement groups
 
  services:
    mon:     3 daemons, quorum ceph-bz-verify-jgwibw-node2,ceph-bz-verify-jgwibw-node3,ceph-bz-verify-jgwibw-node1-installer (age 29m)
    mgr:     ceph-bz-verify-jgwibw-node1-installer(active, since 18m), standbys: ceph-bz-verify-jgwibw-node2
    mds:     1/1 daemons up, 2 standby
    osd:     12 osds: 12 up (since 24m), 12 in (since 37h)
    rgw-nfs: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 209 pgs
    objects: 220 objects, 474 MiB
    usage:   3.3 GiB used, 177 GiB / 180 GiB avail
    pgs:     209 active+clean
 
  io:
    client:   2.7 KiB/s rd, 21 MiB/s wr, 1 op/s rd, 33 op/s wr
 
[root@ceph-bz-verify-jgwibw-node7 cephfs_fuse]# ceph version
ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)

Setup Details :
ceph-BZ_Verify-JGWIBW-node7	10.0.208.121
ceph-BZ_Verify-JGWIBW-node6	10.0.211.48
ceph-BZ_Verify-JGWIBW-node5	10.0.209.32
ceph-BZ_Verify-JGWIBW-node4	10.0.210.249
ceph-BZ_Verify-JGWIBW-node3	10.0.209.157
ceph-BZ_Verify-JGWIBW-node2	10.0.209.20
ceph-BZ_Verify-JGWIBW-node1-installer	10.0.211.212



@vshankar ,
Can you please confirm if this is sufficent.

Comment 8 Venky Shankar 2021-12-06 05:56:25 UTC
> @vshankar ,
> Can you please confirm if this is sufficent.

Looks good.

Comment 14 errata-xmlrpc 2022-04-04 10:22:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174