Description of problem (please be detailed as possible and provide log snippests): mon is in CLBO after upgrading to 4.10-113 Version of all relevant components (if applicable): upgraded from ocs-registry:4.9.2-9 to ocs-registry:4.10.0-113 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Not tried If this is a regression, please provide more details to justify this: yes Steps to Reproduce: 1. upgrade odf from ocs-registry:4.9.2-9 to ocs-registry:4.10.0-113 2. check mon status 3. Actual results: rook-ceph-mon-a-7995c845d-7742h 2/2 Running 0 50m 10.131.0.38 ip-10-0-184-89.us-east-2.compute.internal <none> <none> rook-ceph-mon-b-5f6cfbd5d6-98hg2 1/2 CrashLoopBackOff 14 (101s ago) 48m 10.128.2.39 ip-10-0-139-255.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-68f7cc956d-rlt57 2/2 Running 0 50m 10.129.2.68 ip-10-0-220-6.us-east-2.compute.internal <none> <none> Expected results: all pods should be up and running after upgrade Additional info: Name: rook-ceph-mon-b-5f6cfbd5d6-98hg2 Namespace: openshift-storage Priority: 2000001000 Priority Class Name: system-node-critical Node: ip-10-0-139-255.us-east-2.compute.internal/10.0.139.255 Start Time: Thu, 20 Jan 2022 15:55:33 +0000 Labels: app=rook-ceph-mon Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 48m default-scheduler Successfully assigned openshift-storage/rook-ceph-mon-b-5f6cfbd5d6-98hg2 to ip-10-0-139-255.us-east-2.compute.internal Normal AddedInterface 48m multus Add eth0 [10.128.2.39/23] from openshift-sdn Normal Pulled 48m kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d" already present on machine Normal Created 48m kubelet Created container chown-container-data-dir Normal Started 48m kubelet Started container chown-container-data-dir Normal Pulled 48m kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d" already present on machine Normal Created 48m kubelet Created container init-mon-fs Normal Started 48m kubelet Started container init-mon-fs Normal Pulled 48m kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d" already present on machine Normal Created 48m kubelet Created container log-collector Normal Started 48m kubelet Started container log-collector Normal Pulled 48m (x3 over 48m) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d" already present on machine Normal Created 48m (x3 over 48m) kubelet Created container mon Normal Started 48m (x3 over 48m) kubelet Started container mon Warning BackOff 3m30s (x228 over 48m) kubelet Back-off restarting failed container > rook-ceph operator logs r2022-01-20T16:03:20.307429100Z 2022-01-20 16:03:20.307388 I | op-k8sutil: updating deployment "rook-ceph-mon-b" after verifying it is safe to stop 2022-01-20T16:03:20.307429100Z 2022-01-20 16:03:20.307405 I | op-mon: checking if we can stop the deployment rook-ceph-mon-b 2022-01-20T16:03:37.271345478Z 2022-01-20 16:03:37.271287 E | ceph-bucket-notification: failed to reconcile failed to get object store from ObjectBucket "openshift-storage/obc-openshift-storage-cli-bucket-0dc2c10ed8e44b068a4f4808c81f4": malformed BucketHost "s3.openshift-storage.svc": malformed subdomain name "s3" 2022-01-20T16:05:36.491165267Z 2022-01-20 16:05:36.491127 E | op-mon: attempting to continue after failing to start mon "b". failed to update mon deployment rook-ceph-mon-b: gave up waiting for deployment "rook-ceph-mon-b" to update because "ProgressDeadlineExceeded" 2022-01-20T16:28:25.356825984Z 2022-01-20 16:28:25.356784 W | op-mon: monitor b is not in quorum list 2022-01-20T16:28:25.382488939Z 2022-01-20 16:28:25.382445 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3022/consoleFull must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-128ai3c33-ua/j-128ai3c33-ua_20220120T114821/logs/failed_testcase_ocs_logs_1642682264/test_upgrade_ocs_logs/
I think we will need the Ceph team to help debug this issue. The mon is in crash loop backoff with an error from the mon process (copied below). I don't see any crashes registered, and I don't see any must-gather ceph commands were run, so debugging this may be hard. 2022-01-20T16:42:30.296755514Z debug -43> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding auth protocol: cephx 2022-01-20T16:42:30.296755514Z debug -42> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding auth protocol: cephx 2022-01-20T16:42:30.296755514Z debug -41> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding auth protocol: cephx 2022-01-20T16:42:30.296755514Z debug -40> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding auth protocol: none 2022-01-20T16:42:30.296755514Z debug -39> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: secure 2022-01-20T16:42:30.296763563Z debug -38> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: crc 2022-01-20T16:42:30.296763563Z debug -37> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: secure2022-01-20T16:42:30.296771171Z 2022-01-20T16:42:30.296771171Z debug -36> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: crc 2022-01-20T16:42:30.296771171Z debug 2022-01-20T16:42:30.296778803Z -35> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: secure 2022-01-20T16:42:30.296778803Z debug -34> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: crc2022-01-20T16:42:30.296786459Z 2022-01-20T16:42:30.296786459Z debug -33> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: crc2022-01-20T16:42:30.296793797Z 2022-01-20T16:42:30.296793797Z debug -32> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: secure 2022-01-20T16:42:30.296801328Z debug -31> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: crc 2022-01-20T16:42:30.296801328Z debug 2022-01-20T16:42:30.296808629Z -30> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: secure 2022-01-20T16:42:30.296808629Z debug 2022-01-20T16:42:30.296816004Z -29> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: crc 2022-01-20T16:42:30.296816004Z debug -28> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84140) adding con mode: secure 2022-01-20T16:42:30.296823269Z debug -27> 2022-01-20T16:42:30.287+0000 7f5cc51a0700 2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring 2022-01-20T16:42:30.296823269Z debug -26> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 0 starting mon.b rank 1 at public addrs [v2:172.30.242.117:3300/0,v1:172.30.242.117:6789/0] at bind addrs [v2:10.128.2.39:3300/0,v1:10.128.2.39:6789/0] mon_data /var/lib/ceph/mon/ceph-b fsid 370885ac-8dec-4d95-8350-0deb0752c15b 2022-01-20T16:42:30.296830648Z debug -25> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding auth protocol: cephx 2022-01-20T16:42:30.296830648Z debug 2022-01-20T16:42:30.296839985Z -24> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding auth protocol: cephx 2022-01-20T16:42:30.296839985Z debug -23> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding auth protocol: cephx 2022-01-20T16:42:30.296839985Z debug 2022-01-20T16:42:30.296847622Z -22> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding auth protocol: none 2022-01-20T16:42:30.296847622Z debug -21> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: secure 2022-01-20T16:42:30.296847622Z debug 2022-01-20T16:42:30.296855188Z -20> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: crc 2022-01-20T16:42:30.296855188Z debug -19> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: secure 2022-01-20T16:42:30.296862684Z debug -18> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: crc 2022-01-20T16:42:30.296862684Z debug -17> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: secure2022-01-20T16:42:30.296869994Z 2022-01-20T16:42:30.296869994Z debug -16> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: crc 2022-01-20T16:42:30.296869994Z debug -15> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: crc2022-01-20T16:42:30.296877556Z 2022-01-20T16:42:30.296877556Z debug -14> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: secure 2022-01-20T16:42:30.296884850Z debug -13> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: crc 2022-01-20T16:42:30.296884850Z debug -12> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: secure 2022-01-20T16:42:30.296892184Z debug -11> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: crc 2022-01-20T16:42:30.296899315Z debug -10> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 AuthRegistry(0x55896cf84a40) adding con mode: secure 2022-01-20T16:42:30.296899315Z debug -9> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring 2022-01-20T16:42:30.296906626Z debug -8> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 adding auth protocol: cephx 2022-01-20T16:42:30.296906626Z debug -7> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 5 adding auth protocol: cephx 2022-01-20T16:42:30.296914084Z debug -6> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 10 log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: daemon prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201) 2022-01-20T16:42:30.296921160Z debug -5> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 10 log_channel(audit) update_config to_monitors: true to_syslog: false syslog_facility: local0 prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201) 2022-01-20T16:42:30.296928465Z debug -4> 2022-01-20T16:42:30.289+0000 7f5cc51a0700 1 mon.b@-1(???) e3 preinit fsid 370885ac-8dec-4d95-8350-0deb0752c15b 2022-01-20T16:42:30.296935725Z debug -3> 2022-01-20T16:42:30.289+0000 7f5cc51a0700 0 mon.b@-1(???).mds e27 new map 2022-01-20T16:42:30.296935725Z debug -2> 2022-01-20T16:42:30.290+0000 7f5cc51a0700 0 mon.b@-1(???).mds e27 print_map 2022-01-20T16:42:30.296935725Z e27 2022-01-20T16:42:30.296935725Z enable_multiple, ever_enabled_multiple: 1,1 2022-01-20T16:42:30.296935725Z default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} 2022-01-20T16:42:30.296935725Z legacy client fscid: 1 2022-01-20T16:42:30.296935725Z 2022-01-20T16:42:30.296935725Z Filesystem 'ocs-storagecluster-cephfilesystem' (1) 2022-01-20T16:42:30.296935725Z fs_name ocs-storagecluster-cephfilesystem 2022-01-20T16:42:30.296935725Z epoch 27 2022-01-20T16:42:30.296935725Z flags 32 2022-01-20T16:42:30.296935725Z created 2022-01-20T12:27:41.132859+0000 2022-01-20T16:42:30.296935725Z modified 2022-01-20T15:55:22.452393+0000 2022-01-20T16:42:30.296935725Z tableserver 0 2022-01-20T16:42:30.296935725Z root 0 2022-01-20T16:42:30.296935725Z session_timeout 60 2022-01-20T16:42:30.296935725Z session_autoclose 300 2022-01-20T16:42:30.296935725Z max_file_size 1099511627776 2022-01-20T16:42:30.296935725Z required_client_features {} 2022-01-20T16:42:30.296935725Z last_failure 0 2022-01-20T16:42:30.296935725Z last_failure_osd_epoch 982 2022-01-20T16:42:30.296935725Z compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} 2022-01-20T16:42:30.296935725Z max_mds 1 2022-01-20T16:42:30.296935725Z in 0 2022-01-20T16:42:30.296935725Z up {0=74228} 2022-01-20T16:42:30.296935725Z failed 2022-01-20T16:42:30.296935725Z damaged 2022-01-20T16:42:30.296935725Z stopped 2022-01-20T16:42:30.296935725Z data_pools [3] 2022-01-20T16:42:30.296935725Z metadata_pool 2 2022-01-20T16:42:30.296935725Z inline_data disabled 2022-01-20T16:42:30.296935725Z balancer 2022-01-20T16:42:30.296935725Z standby_count_wanted 1 2022-01-20T16:42:30.296935725Z [mds.ocs-storagecluster-cephfilesystem-a{0:74228} state up:active seq 5 join_fscid=1 addr [v2:10.131.0.39:6800/1133637502,v1:10.131.0.39:6801/1133637502] compat {c=[1],r=[1],i=[77f]}] 2022-01-20T16:42:30.296935725Z [mds.ocs-storagecluster-cephfilesystem-b{0:74306} state up:standby-replay seq 1 join_fscid=1 addr [v2:10.129.2.70:6800/115663078,v1:10.129.2.70:6801/115663078] compat {c=[1],r=[1],i=[7ff]}] 2022-01-20T16:42:30.296935725Z 2022-01-20T16:42:30.296935725Z 2022-01-20T16:42:30.296935725Z 2022-01-20T16:42:30.296950797Z debug -1> 2022-01-20T16:42:30.291+0000 7f5cc51a0700 -1 /builddir/build/BUILD/ceph-16.2.7/src/mds/FSMap.cc: In function 'void FSMap::sanity(bool) const' thread 7f5cc51a0700 time 2022-01-20T16:42:30.290637+0000 2022-01-20T16:42:30.296950797Z /builddir/build/BUILD/ceph-16.2.7/src/mds/FSMap.cc: 868: FAILED ceph_assert(info.compat.writeable(fs->mds_map.compat)) 2022-01-20T16:42:30.296950797Z 2022-01-20T16:42:30.296950797Z ceph version 16.2.7-31.el8cp (2cfe2e2a505bfa00c184623965dbdb21ed9ff6aa) pacific (stable) 2022-01-20T16:42:30.296950797Z 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f5cbc322c82] 2022-01-20T16:42:30.296950797Z 2: /usr/lib64/ceph/libceph-common.so.2(+0x276e9c) [0x7f5cbc322e9c] 2022-01-20T16:42:30.296950797Z 3: (FSMap::sanity(bool) const+0x2a8) [0x7f5cbc871788] 2022-01-20T16:42:30.296950797Z 4: (MDSMonitor::update_from_paxos(bool*)+0x39a) [0x55896ac1b7aa] 2022-01-20T16:42:30.296950797Z 5: (PaxosService::refresh(bool*)+0x10e) [0x55896ab3c29e] 2022-01-20T16:42:30.296950797Z 6: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x55896a9ed2cc] 2022-01-20T16:42:30.296950797Z 7: (Monitor::init_paxos()+0x10c) [0x55896a9ed5dc] 2022-01-20T16:42:30.296950797Z 8: (Monitor::preinit()+0xd30) [0x55896aa1aaa0] 2022-01-20T16:42:30.296950797Z 9: main() 2022-01-20T16:42:30.296950797Z 10: __libc_start_main() 2022-01-20T16:42:30.296950797Z 11: _start() 2022-01-20T16:42:30.296950797Z 2022-01-20T16:42:30.296950797Z debug 0> 2022-01-20T16:42:30.293+0000 7f5cc51a0700 -1 *** Caught signal (Aborted) ** 2022-01-20T16:42:30.296950797Z in thread 7f5cc51a0700 thread_name:ceph-mon 2022-01-20T16:42:30.296950797Z 2022-01-20T16:42:30.296950797Z ceph version 16.2.7-31.el8cp (2cfe2e2a505bfa00c184623965dbdb21ed9ff6aa) pacific (stable) 2022-01-20T16:42:30.296950797Z 1: /lib64/libpthread.so.0(+0x12c20) [0x7f5cba062c20] 2022-01-20T16:42:30.296950797Z 2: gsignal() 2022-01-20T16:42:30.296950797Z 3: abort() 2022-01-20T16:42:30.296950797Z 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f5cbc322cd3] 2022-01-20T16:42:30.296950797Z 5: /usr/lib64/ceph/libceph-common.so.2(+0x276e9c) [0x7f5cbc322e9c] 2022-01-20T16:42:30.296950797Z 6: (FSMap::sanity(bool) const+0x2a8) [0x7f5cbc871788] 2022-01-20T16:42:30.296950797Z 7: (MDSMonitor::update_from_paxos(bool*)+0x39a) [0x55896ac1b7aa] 2022-01-20T16:42:30.296950797Z 8: (PaxosService::refresh(bool*)+0x10e) [0x55896ab3c29e] 2022-01-20T16:42:30.296950797Z 9: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x55896a9ed2cc] 2022-01-20T16:42:30.296950797Z 10: (Monitor::init_paxos()+0x10c) [0x55896a9ed5dc] 2022-01-20T16:42:30.296950797Z 11: (Monitor::preinit()+0xd30) [0x55896aa1aaa0] 2022-01-20T16:42:30.296950797Z 12: main() 2022-01-20T16:42:30.296950797Z 13: __libc_start_main() 2022-01-20T16:42:30.296950797Z 14: _start() 2022-01-20T16:42:30.296950797Z NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2022-01-20T16:42:30.296964906Z 2022-01-20T16:42:30.296992315Z --- logging levels --- 2022-01-20T16:42:30.296992315Z 0/ 5 none 2022-01-20T16:42:30.297000801Z 0/ 1 lockdep 2022-01-20T16:42:30.297000801Z 0/ 1 context 2022-01-20T16:42:30.297000801Z 1/ 1 crush 2022-01-20T16:42:30.297008537Z 1/ 5 mds 2022-01-20T16:42:30.297008537Z 1/ 5 mds_balancer 2022-01-20T16:42:30.297016192Z 1/ 5 mds_locker 2022-01-20T16:42:30.297016192Z 1/ 5 mds_log 2022-01-20T16:42:30.297016192Z 1/ 5 mds_log_expire2022-01-20T16:42:30.297023991Z 2022-01-20T16:42:30.297023991Z 1/ 5 mds_migrator 2022-01-20T16:42:30.297023991Z 0/ 1 buffer 2022-01-20T16:42:30.297031807Z 0/ 1 timer 2022-01-20T16:42:30.297031807Z 0/ 1 filer 2022-01-20T16:42:30.297031807Z 0/ 1 striper2022-01-20T16:42:30.297039569Z 2022-01-20T16:42:30.297039569Z 0/ 1 objecter 2022-01-20T16:42:30.297039569Z 0/ 5 rados 2022-01-20T16:42:30.297047277Z 0/ 5 rbd 2022-01-20T16:42:30.297047277Z 0/ 5 rbd_mirror 2022-01-20T16:42:30.297054704Z 0/ 5 rbd_replay 2022-01-20T16:42:30.297054704Z 0/ 5 rbd_pwl 2022-01-20T16:42:30.297054704Z 0/ 5 journaler2022-01-20T16:42:30.297062378Z 2022-01-20T16:42:30.297062378Z 0/ 5 objectcacher 2022-01-20T16:42:30.297069833Z 0/ 5 immutable_obj_cache 2022-01-20T16:42:30.297069833Z 0/ 5 client 2022-01-20T16:42:30.297069833Z 1/ 5 osd2022-01-20T16:42:30.297079198Z 2022-01-20T16:42:30.297079198Z 0/ 5 optracker 2022-01-20T16:42:30.297079198Z 0/ 5 objclass 2022-01-20T16:42:30.297079198Z 1/ 3 filestore2022-01-20T16:42:30.297087171Z 2022-01-20T16:42:30.297087171Z 1/ 3 journal 2022-01-20T16:42:30.297087171Z 0/ 0 ms 2022-01-20T16:42:30.297094968Z 1/ 5 mon 2022-01-20T16:42:30.297094968Z 0/10 monc 2022-01-20T16:42:30.297102520Z 1/ 5 paxos 2022-01-20T16:42:30.297102520Z 0/ 5 tp 2022-01-20T16:42:30.297102520Z 1/ 5 auth2022-01-20T16:42:30.297110195Z 2022-01-20T16:42:30.297110195Z 1/ 5 crypto 2022-01-20T16:42:30.297110195Z 1/ 1 finisher 2022-01-20T16:42:30.297118068Z 1/ 1 reserver 2022-01-20T16:42:30.297118068Z 1/ 5 heartbeatmap 2022-01-20T16:42:30.297125715Z 1/ 5 perfcounter 2022-01-20T16:42:30.297125715Z 1/ 5 rgw 2022-01-20T16:42:30.297125715Z 1/ 5 rgw_sync2022-01-20T16:42:30.297133388Z 2022-01-20T16:42:30.297133388Z 1/10 civetweb 2022-01-20T16:42:30.297133388Z 1/ 5 javaclient 2022-01-20T16:42:30.297141161Z 1/ 5 asok 2022-01-20T16:42:30.297141161Z 1/ 1 throttle 2022-01-20T16:42:30.297148825Z 0/ 0 refs 2022-01-20T16:42:30.297148825Z 1/ 5 compressor 2022-01-20T16:42:30.297148825Z 1/ 5 bluestore2022-01-20T16:42:30.297156553Z 2022-01-20T16:42:30.297156553Z 1/ 5 bluefs 2022-01-20T16:42:30.297156553Z 1/ 3 bdev 2022-01-20T16:42:30.297164207Z 1/ 5 kstore 2022-01-20T16:42:30.297164207Z 4/ 5 rocksdb 2022-01-20T16:42:30.297164207Z 4/ 5 leveldb2022-01-20T16:42:30.297171880Z 2022-01-20T16:42:30.297171880Z 4/ 5 memdb 2022-01-20T16:42:30.297171880Z 1/ 5 fuse 2022-01-20T16:42:30.297179561Z 2/ 5 mgr 2022-01-20T16:42:30.297179561Z 1/ 5 mgrc 2022-01-20T16:42:30.297187170Z 1/ 5 dpdk 2022-01-20T16:42:30.297187170Z 1/ 5 eventtrace 2022-01-20T16:42:30.297187170Z 1/ 5 prioritycache2022-01-20T16:42:30.297194819Z 2022-01-20T16:42:30.297194819Z 0/ 5 test 2022-01-20T16:42:30.297194819Z 0/ 5 cephfs_mirror 2022-01-20T16:42:30.297202426Z 0/ 5 cephsqlite 2022-01-20T16:42:30.297202426Z -2/-2 (syslog threshold) 2022-01-20T16:42:30.297209918Z 99/99 (stderr threshold) 2022-01-20T16:42:30.297209918Z --- pthread ID / name mapping for recent threads --- 2022-01-20T16:42:30.297234902Z 140036001363712 / rocksdb:dump_st 2022-01-20T16:42:30.297244513Z 140036169725696 / admin_socket 2022-01-20T16:42:30.297251760Z 140036420536064 / ceph-mon 2022-01-20T16:42:30.297251760Z max_recent 10000 2022-01-20T16:42:30.297251760Z max_new 100002022-01-20T16:42:30.297259441Z 2022-01-20T16:42:30.297259441Z log_file /var/lib/ceph/crash/2022-01-20T16:42:30.293729Z_ff2cc9e7-cf68-4c5a-b384-336b8997364e/log 2022-01-20T16:42:30.297259441Z --- end dump of recent events ---
Transferring this to the Ceph component to get help debugging this on the Ceph side.
I have created a ceph bug but a quick google search tells me that there is an existing issue around this, see the release notes https://github.com/ceph/ceph/pull/44131 and this conversation in the community https://www.spinics.net/lists/ceph-users/msg70110.html
Rook is applying "mon_mds_skip_sanity" already before the upgrade, we can see that from the op logs. 2022-01-20T15:53:10.400253921Z 2022-01-20 15:53:10.400208 I | ceph-cluster-controller: upgrading ceph cluster to "16.2.7-31 pacific" 2022-01-20T15:53:10.400253921Z 2022-01-20 15:53:10.400230 I | ceph-cluster-controller: cluster "openshift-storage": version "16.2.7-31 pacific" detected for image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d" 2022-01-20T15:53:10.460664304Z 2022-01-20 15:53:10.460626 I | op-config: setting "mon"="mon_mds_skip_sanity"="1" option to the mon configuration database 2022-01-20T15:53:10.784778923Z 2022-01-20 15:53:10.784735 I | op-config: successfully set "mon"="mon_mds_skip_sanity"="1" option to the mon configuration database Still, it looks like mon-b is failing the upgrade. Patrick do you have any idea what could be wrong? Thanks!
*** Bug 2043510 has been marked as a duplicate of this bug. ***
Verified with below versions: =========================== upgrade was success from ocs-registry:4.9.3-2 to cs-registry:4.10.0-164 job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3311/console Moving to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1372