Bug 2018110 - Ceph monitor crash after upgrade from ceph
Summary: Ceph monitor crash after upgrade from ceph
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: CephFS
Version: 5.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 5.1
Assignee: Patrick Donnelly
QA Contact: Amarnath
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-28 09:35 UTC by Venky Shankar
Modified: 2022-04-04 10:22 UTC (History)
3 users (show)

Fixed In Version: ceph-16.2.6-20.el8cp
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-04 10:22:22 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 52820 0 None None None 2021-10-28 09:35:06 UTC
Ceph Project Bug Tracker 52874 0 None None None 2021-10-28 09:38:46 UTC
Red Hat Issue Tracker RHCEPH-2120 0 None None None 2021-10-28 09:36:54 UTC
Red Hat Product Errata RHSA-2022:1174 0 None None None 2022-04-04 10:22:41 UTC

Description Venky Shankar 2021-10-28 09:35:06 UTC
Seen when upgrading from 15.2.14 to 16.2.6 release::

Oct 05 18:42:57 virthost2 ceph-mon[115602]:  in thread 7f14a74f3700 thread_name:ms_dispatch
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  ceph version 16.2.6 (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable)
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  1: /lib/x86_64-linux-gnu/libpthread.so.0(0x14140) [0x7f14b0190140]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  2: gsignal()
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  3: abort()
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  4: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0x9a7ec) [0x7f14b00437ec]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  5: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0xa5966) [0x7f14b004e966]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  6: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0xa59d1) [0x7f14b004e9d1]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  7: /usr/lib/x86_64-linux-gnu/libstdc+.so.6(0xa5c65) [0x7f14b004ec65]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  8: /usr/lib/ceph/libceph-common.so.2(+0x28982a) [0x7f14b06e682a]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  9: (MDSMonitor::tick()+0x475) [0x55a3d8709015]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  10: (MDSMonitor::on_active()+0x28) [0x55a3d86ef068]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  11: (Context::complete(int)+0x9) [0x55a3d850fc29]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  12: (void finish_contexts<std::__cxx11::list<Context, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) [0x55a3d853b458]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  13: (Paxos::finish_round()+0x70) [0x55a3d8623100]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  14: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x3d3) [0x55a3d8624ef3]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  15: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x116b) [0x55a3d850d7eb]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  16: (Monitor::_ms_dispatch(Message*)+0x41e) [0x55a3d850de2e]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  17: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x59) [0x55a3d853c9d9]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  18: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x468) [0x7f14b08d5eb8]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  19: (DispatchQueue::entry()+0x5ef) [0x7f14b08d35bf]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  20: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f14b0990cbd]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  21: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f14b0184ea7]
Oct 05 18:42:57 virthost2 ceph-mon[115602]:  22: clone()

Comment 7 Amarnath 2021-12-03 09:10:34 UTC
Upgraded from 14.2.11-208.el8cp to  16.2.0-146.el8cp


did not observe any clone or abort operations in mon logs
attached all mon logs from all the nodes
IOs were running while upgrade also

Setup details before upgrade : 
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED       RAW USED     %RAW USED 
    hdd       180 GiB     168 GiB     83 MiB       12 GiB          6.71 
    TOTAL     180 GiB     168 GiB     83 MiB       12 GiB          6.71 
 
POOLS:
    POOL                    ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
    cephfs_data              1       132 B           1     192 KiB         0        53 GiB 
    cephfs_metadata          2      42 KiB          22     1.5 MiB         0        53 GiB 
    .rgw.root                3     1.3 KiB           4     768 KiB         0        53 GiB 
    default.rgw.control      4         0 B           8         0 B         0        53 GiB 
    default.rgw.meta         5       374 B           2     384 KiB         0        53 GiB 
    default.rgw.log          6     3.5 KiB          49     6.2 MiB         0        53 GiB 
    rbd                      7         0 B           0         0 B         0        53 GiB 
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph -s
  cluster:
    id:     37dc81e0-e59c-4aa0-b819-abeb2eee717b
    health: HEALTH_WARN
            1 pools have too few placement groups
            mons are allowing insecure global_id reclaim
 
  services:
    mon:     3 daemons, quorum ceph-bz-verify-jgwibw-node2,ceph-bz-verify-jgwibw-node3,ceph-bz-verify-jgwibw-node1-installer (age 35h)
    mgr:     ceph-bz-verify-jgwibw-node1-installer(active, since 35h), standbys: ceph-bz-verify-jgwibw-node2
    mds:     cephfs:1 {0=ceph-bz-verify-jgwibw-node5=up:active} 2 up:standby
    osd:     12 osds: 12 up (since 35h), 12 in (since 35h)
    rgw-nfs: 2 daemons active (ceph-bz-verify-jgwibw-node4, ceph-bz-verify-jgwibw-node6)
 
  data:
    pools:   7 pools, 208 pgs
    objects: 86 objects, 47 KiB
    usage:   12 GiB used, 168 GiB / 180 GiB avail
    pgs:     208 active+clean
 
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph fs set cephfs allow_standby_replay false
[root@ceph-bz-verify-jgwibw-node7 ~]# ceph version
ceph version 14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)

After Upgrade : 
[root@ceph-bz-verify-jgwibw-node7 cephfs_fuse]# ceph -s
  cluster:
    id:     37dc81e0-e59c-4aa0-b819-abeb2eee717b
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 pools have too few placement groups
 
  services:
    mon:     3 daemons, quorum ceph-bz-verify-jgwibw-node2,ceph-bz-verify-jgwibw-node3,ceph-bz-verify-jgwibw-node1-installer (age 29m)
    mgr:     ceph-bz-verify-jgwibw-node1-installer(active, since 18m), standbys: ceph-bz-verify-jgwibw-node2
    mds:     1/1 daemons up, 2 standby
    osd:     12 osds: 12 up (since 24m), 12 in (since 37h)
    rgw-nfs: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 209 pgs
    objects: 220 objects, 474 MiB
    usage:   3.3 GiB used, 177 GiB / 180 GiB avail
    pgs:     209 active+clean
 
  io:
    client:   2.7 KiB/s rd, 21 MiB/s wr, 1 op/s rd, 33 op/s wr
 
[root@ceph-bz-verify-jgwibw-node7 cephfs_fuse]# ceph version
ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)

Setup Details :
ceph-BZ_Verify-JGWIBW-node7	10.0.208.121
ceph-BZ_Verify-JGWIBW-node6	10.0.211.48
ceph-BZ_Verify-JGWIBW-node5	10.0.209.32
ceph-BZ_Verify-JGWIBW-node4	10.0.210.249
ceph-BZ_Verify-JGWIBW-node3	10.0.209.157
ceph-BZ_Verify-JGWIBW-node2	10.0.209.20
ceph-BZ_Verify-JGWIBW-node1-installer	10.0.211.212



@vshankar ,
Can you please confirm if this is sufficent.

Comment 8 Venky Shankar 2021-12-06 05:56:25 UTC
> @vshankar ,
> Can you please confirm if this is sufficent.

Looks good.

Comment 14 errata-xmlrpc 2022-04-04 10:22:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.1 Security, Enhancement, and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1174


Note You need to log in before you can comment on or make changes to this bug.