Bug 2043513

Summary: [Tracker for Ceph BZ 2044836] mon is in CLBO after upgrading to 4.10-113
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Vijay Avuthu <vavuthu>
Component: cephAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED ERRATA QA Contact: Vijay Avuthu <vavuthu>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.10CC: bniver, madam, mhackett, mmuench, muagarwa, ocs-bugs, odf-bz-bot, pbalogh, pdonnell, shan, tnielsen
Target Milestone: ---Keywords: Automation, Regression, UpgradeBlocker
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-147 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2044836 (view as bug list) Environment:
Last Closed: 2022-04-13 18:51:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2044836    
Bug Blocks:    

Description Vijay Avuthu 2022-01-21 12:22:26 UTC
Description of problem (please be detailed as possible and provide log
snippests):

mon is in CLBO after upgrading to 4.10-113

Version of all relevant components (if applicable):

upgraded from ocs-registry:4.9.2-9 to ocs-registry:4.10.0-113

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Not tried

If this is a regression, please provide more details to justify this:
yes

Steps to Reproduce:
1. upgrade odf from ocs-registry:4.9.2-9 to ocs-registry:4.10.0-113
2. check mon status
3.


Actual results:

rook-ceph-mon-a-7995c845d-7742h                                   2/2     Running                0               50m     10.131.0.38    ip-10-0-184-89.us-east-2.compute.internal    <none>           <none>
rook-ceph-mon-b-5f6cfbd5d6-98hg2                                  1/2     CrashLoopBackOff       14 (101s ago)   48m     10.128.2.39    ip-10-0-139-255.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-68f7cc956d-rlt57                                  2/2     Running                0               50m     10.129.2.68    ip-10-0-220-6.us-east-2.compute.internal     <none>           <none>


Expected results:

all pods should be up and running after upgrade

Additional info:

Name:                 rook-ceph-mon-b-5f6cfbd5d6-98hg2
Namespace:            openshift-storage
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-0-139-255.us-east-2.compute.internal/10.0.139.255
Start Time:           Thu, 20 Jan 2022 15:55:33 +0000
Labels:               app=rook-ceph-mon

Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       48m                    default-scheduler  Successfully assigned openshift-storage/rook-ceph-mon-b-5f6cfbd5d6-98hg2 to ip-10-0-139-255.us-east-2.compute.internal
  Normal   AddedInterface  48m                    multus             Add eth0 [10.128.2.39/23] from openshift-sdn
  Normal   Pulled          48m                    kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d" already present on machine
  Normal   Created         48m                    kubelet            Created container chown-container-data-dir
  Normal   Started         48m                    kubelet            Started container chown-container-data-dir
  Normal   Pulled          48m                    kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d" already present on machine
  Normal   Created         48m                    kubelet            Created container init-mon-fs
  Normal   Started         48m                    kubelet            Started container init-mon-fs
  Normal   Pulled          48m                    kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d" already present on machine
  Normal   Created         48m                    kubelet            Created container log-collector
  Normal   Started         48m                    kubelet            Started container log-collector
  Normal   Pulled          48m (x3 over 48m)      kubelet            Container image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d" already present on machine
  Normal   Created         48m (x3 over 48m)      kubelet            Created container mon
  Normal   Started         48m (x3 over 48m)      kubelet            Started container mon
  Warning  BackOff         3m30s (x228 over 48m)  kubelet            Back-off restarting failed container


> rook-ceph operator logs

r2022-01-20T16:03:20.307429100Z 2022-01-20 16:03:20.307388 I | op-k8sutil: updating deployment "rook-ceph-mon-b" after verifying it is safe to stop
2022-01-20T16:03:20.307429100Z 2022-01-20 16:03:20.307405 I | op-mon: checking if we can stop the deployment rook-ceph-mon-b
2022-01-20T16:03:37.271345478Z 2022-01-20 16:03:37.271287 E | ceph-bucket-notification: failed to reconcile failed to get object store from ObjectBucket "openshift-storage/obc-openshift-storage-cli-bucket-0dc2c10ed8e44b068a4f4808c81f4": malformed BucketHost "s3.openshift-storage.svc": malformed subdomain name "s3"
2022-01-20T16:05:36.491165267Z 2022-01-20 16:05:36.491127 E | op-mon: attempting to continue after failing to start mon "b". failed to update mon deployment rook-ceph-mon-b: gave up waiting for deployment "rook-ceph-mon-b" to update because "ProgressDeadlineExceeded"


2022-01-20T16:28:25.356825984Z 2022-01-20 16:28:25.356784 W | op-mon: monitor b is not in quorum list
2022-01-20T16:28:25.382488939Z 2022-01-20 16:28:25.382445 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to start mon pods: failed to wait for mon quorum: exceeded max retry count waiting for monitors to reach quorum




job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3022/consoleFull

must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-128ai3c33-ua/j-128ai3c33-ua_20220120T114821/logs/failed_testcase_ocs_logs_1642682264/test_upgrade_ocs_logs/

Comment 3 Blaine Gardner 2022-01-24 19:09:23 UTC
I think we will need the Ceph team to help debug this issue. The mon is in crash loop backoff with an error from the mon process (copied below). I don't see any crashes registered, and I don't see any must-gather ceph commands were run, so debugging this may be hard.

2022-01-20T16:42:30.296755514Z debug    -43> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding auth protocol: cephx
2022-01-20T16:42:30.296755514Z debug    -42> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding auth protocol: cephx
2022-01-20T16:42:30.296755514Z debug    -41> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding auth protocol: cephx
2022-01-20T16:42:30.296755514Z debug    -40> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding auth protocol: none
2022-01-20T16:42:30.296755514Z debug    -39> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: secure
2022-01-20T16:42:30.296763563Z debug    -38> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: crc
2022-01-20T16:42:30.296763563Z debug    -37> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: secure2022-01-20T16:42:30.296771171Z 
2022-01-20T16:42:30.296771171Z debug    -36> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: crc
2022-01-20T16:42:30.296771171Z debug 2022-01-20T16:42:30.296778803Z    -35> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: secure
2022-01-20T16:42:30.296778803Z debug    -34> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: crc2022-01-20T16:42:30.296786459Z 
2022-01-20T16:42:30.296786459Z debug    -33> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: crc2022-01-20T16:42:30.296793797Z 
2022-01-20T16:42:30.296793797Z debug    -32> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: secure
2022-01-20T16:42:30.296801328Z debug    -31> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: crc
2022-01-20T16:42:30.296801328Z debug 2022-01-20T16:42:30.296808629Z    -30> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: secure
2022-01-20T16:42:30.296808629Z debug 2022-01-20T16:42:30.296816004Z    -29> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: crc
2022-01-20T16:42:30.296816004Z debug    -28> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84140) adding con mode: secure
2022-01-20T16:42:30.296823269Z debug    -27> 2022-01-20T16:42:30.287+0000 7f5cc51a0700  2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring
2022-01-20T16:42:30.296823269Z debug    -26> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  0 starting mon.b rank 1 at public addrs [v2:172.30.242.117:3300/0,v1:172.30.242.117:6789/0] at bind addrs [v2:10.128.2.39:3300/0,v1:10.128.2.39:6789/0] mon_data /var/lib/ceph/mon/ceph-b fsid 370885ac-8dec-4d95-8350-0deb0752c15b
2022-01-20T16:42:30.296830648Z debug    -25> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding auth protocol: cephx
2022-01-20T16:42:30.296830648Z debug 2022-01-20T16:42:30.296839985Z    -24> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding auth protocol: cephx
2022-01-20T16:42:30.296839985Z debug    -23> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding auth protocol: cephx
2022-01-20T16:42:30.296839985Z debug 2022-01-20T16:42:30.296847622Z    -22> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding auth protocol: none
2022-01-20T16:42:30.296847622Z debug    -21> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: secure
2022-01-20T16:42:30.296847622Z debug 2022-01-20T16:42:30.296855188Z    -20> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: crc
2022-01-20T16:42:30.296855188Z debug    -19> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: secure
2022-01-20T16:42:30.296862684Z debug    -18> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: crc
2022-01-20T16:42:30.296862684Z debug    -17> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: secure2022-01-20T16:42:30.296869994Z 
2022-01-20T16:42:30.296869994Z debug    -16> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: crc
2022-01-20T16:42:30.296869994Z debug    -15> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: crc2022-01-20T16:42:30.296877556Z 
2022-01-20T16:42:30.296877556Z debug    -14> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: secure
2022-01-20T16:42:30.296884850Z debug    -13> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: crc
2022-01-20T16:42:30.296884850Z debug    -12> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: secure
2022-01-20T16:42:30.296892184Z debug    -11> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: crc
2022-01-20T16:42:30.296899315Z debug    -10> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 AuthRegistry(0x55896cf84a40) adding con mode: secure
2022-01-20T16:42:30.296899315Z debug     -9> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  2 auth: KeyRing::load: loaded key file /etc/ceph/keyring-store/keyring
2022-01-20T16:42:30.296906626Z debug     -8> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 adding auth protocol: cephx
2022-01-20T16:42:30.296906626Z debug     -7> 2022-01-20T16:42:30.288+0000 7f5cc51a0700  5 adding auth protocol: cephx
2022-01-20T16:42:30.296914084Z debug     -6> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 10 log_channel(cluster) update_config to_monitors: true to_syslog: false syslog_facility: daemon prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201)
2022-01-20T16:42:30.296921160Z debug     -5> 2022-01-20T16:42:30.288+0000 7f5cc51a0700 10 log_channel(audit) update_config to_monitors: true to_syslog: false syslog_facility: local0 prio: info to_graylog: false graylog_host: 127.0.0.1 graylog_port: 12201)
2022-01-20T16:42:30.296928465Z debug     -4> 2022-01-20T16:42:30.289+0000 7f5cc51a0700  1 mon.b@-1(???) e3 preinit fsid 370885ac-8dec-4d95-8350-0deb0752c15b
2022-01-20T16:42:30.296935725Z debug     -3> 2022-01-20T16:42:30.289+0000 7f5cc51a0700  0 mon.b@-1(???).mds e27 new map
2022-01-20T16:42:30.296935725Z debug     -2> 2022-01-20T16:42:30.290+0000 7f5cc51a0700  0 mon.b@-1(???).mds e27 print_map
2022-01-20T16:42:30.296935725Z e27
2022-01-20T16:42:30.296935725Z enable_multiple, ever_enabled_multiple: 1,1
2022-01-20T16:42:30.296935725Z default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2022-01-20T16:42:30.296935725Z legacy client fscid: 1
2022-01-20T16:42:30.296935725Z  
2022-01-20T16:42:30.296935725Z Filesystem 'ocs-storagecluster-cephfilesystem' (1)
2022-01-20T16:42:30.296935725Z fs_name	ocs-storagecluster-cephfilesystem
2022-01-20T16:42:30.296935725Z epoch	27
2022-01-20T16:42:30.296935725Z flags	32
2022-01-20T16:42:30.296935725Z created	2022-01-20T12:27:41.132859+0000
2022-01-20T16:42:30.296935725Z modified	2022-01-20T15:55:22.452393+0000
2022-01-20T16:42:30.296935725Z tableserver	0
2022-01-20T16:42:30.296935725Z root	0
2022-01-20T16:42:30.296935725Z session_timeout	60
2022-01-20T16:42:30.296935725Z session_autoclose	300
2022-01-20T16:42:30.296935725Z max_file_size	1099511627776
2022-01-20T16:42:30.296935725Z required_client_features	{}
2022-01-20T16:42:30.296935725Z last_failure	0
2022-01-20T16:42:30.296935725Z last_failure_osd_epoch	982
2022-01-20T16:42:30.296935725Z compat	compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2022-01-20T16:42:30.296935725Z max_mds	1
2022-01-20T16:42:30.296935725Z in	0
2022-01-20T16:42:30.296935725Z up	{0=74228}
2022-01-20T16:42:30.296935725Z failed	
2022-01-20T16:42:30.296935725Z damaged	
2022-01-20T16:42:30.296935725Z stopped	
2022-01-20T16:42:30.296935725Z data_pools	[3]
2022-01-20T16:42:30.296935725Z metadata_pool	2
2022-01-20T16:42:30.296935725Z inline_data	disabled
2022-01-20T16:42:30.296935725Z balancer	
2022-01-20T16:42:30.296935725Z standby_count_wanted	1
2022-01-20T16:42:30.296935725Z [mds.ocs-storagecluster-cephfilesystem-a{0:74228} state up:active seq 5 join_fscid=1 addr [v2:10.131.0.39:6800/1133637502,v1:10.131.0.39:6801/1133637502] compat {c=[1],r=[1],i=[77f]}]
2022-01-20T16:42:30.296935725Z [mds.ocs-storagecluster-cephfilesystem-b{0:74306} state up:standby-replay seq 1 join_fscid=1 addr [v2:10.129.2.70:6800/115663078,v1:10.129.2.70:6801/115663078] compat {c=[1],r=[1],i=[7ff]}]
2022-01-20T16:42:30.296935725Z  
2022-01-20T16:42:30.296935725Z  
2022-01-20T16:42:30.296935725Z 
2022-01-20T16:42:30.296950797Z debug     -1> 2022-01-20T16:42:30.291+0000 7f5cc51a0700 -1 /builddir/build/BUILD/ceph-16.2.7/src/mds/FSMap.cc: In function 'void FSMap::sanity(bool) const' thread 7f5cc51a0700 time 2022-01-20T16:42:30.290637+0000
2022-01-20T16:42:30.296950797Z /builddir/build/BUILD/ceph-16.2.7/src/mds/FSMap.cc: 868: FAILED ceph_assert(info.compat.writeable(fs->mds_map.compat))
2022-01-20T16:42:30.296950797Z 
2022-01-20T16:42:30.296950797Z  ceph version 16.2.7-31.el8cp (2cfe2e2a505bfa00c184623965dbdb21ed9ff6aa) pacific (stable)
2022-01-20T16:42:30.296950797Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f5cbc322c82]
2022-01-20T16:42:30.296950797Z  2: /usr/lib64/ceph/libceph-common.so.2(+0x276e9c) [0x7f5cbc322e9c]
2022-01-20T16:42:30.296950797Z  3: (FSMap::sanity(bool) const+0x2a8) [0x7f5cbc871788]
2022-01-20T16:42:30.296950797Z  4: (MDSMonitor::update_from_paxos(bool*)+0x39a) [0x55896ac1b7aa]
2022-01-20T16:42:30.296950797Z  5: (PaxosService::refresh(bool*)+0x10e) [0x55896ab3c29e]
2022-01-20T16:42:30.296950797Z  6: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x55896a9ed2cc]
2022-01-20T16:42:30.296950797Z  7: (Monitor::init_paxos()+0x10c) [0x55896a9ed5dc]
2022-01-20T16:42:30.296950797Z  8: (Monitor::preinit()+0xd30) [0x55896aa1aaa0]
2022-01-20T16:42:30.296950797Z  9: main()
2022-01-20T16:42:30.296950797Z  10: __libc_start_main()
2022-01-20T16:42:30.296950797Z  11: _start()
2022-01-20T16:42:30.296950797Z 
2022-01-20T16:42:30.296950797Z debug      0> 2022-01-20T16:42:30.293+0000 7f5cc51a0700 -1 *** Caught signal (Aborted) **
2022-01-20T16:42:30.296950797Z  in thread 7f5cc51a0700 thread_name:ceph-mon
2022-01-20T16:42:30.296950797Z 
2022-01-20T16:42:30.296950797Z  ceph version 16.2.7-31.el8cp (2cfe2e2a505bfa00c184623965dbdb21ed9ff6aa) pacific (stable)
2022-01-20T16:42:30.296950797Z  1: /lib64/libpthread.so.0(+0x12c20) [0x7f5cba062c20]
2022-01-20T16:42:30.296950797Z  2: gsignal()
2022-01-20T16:42:30.296950797Z  3: abort()
2022-01-20T16:42:30.296950797Z  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f5cbc322cd3]
2022-01-20T16:42:30.296950797Z  5: /usr/lib64/ceph/libceph-common.so.2(+0x276e9c) [0x7f5cbc322e9c]
2022-01-20T16:42:30.296950797Z  6: (FSMap::sanity(bool) const+0x2a8) [0x7f5cbc871788]
2022-01-20T16:42:30.296950797Z  7: (MDSMonitor::update_from_paxos(bool*)+0x39a) [0x55896ac1b7aa]
2022-01-20T16:42:30.296950797Z  8: (PaxosService::refresh(bool*)+0x10e) [0x55896ab3c29e]
2022-01-20T16:42:30.296950797Z  9: (Monitor::refresh_from_paxos(bool*)+0x18c) [0x55896a9ed2cc]
2022-01-20T16:42:30.296950797Z  10: (Monitor::init_paxos()+0x10c) [0x55896a9ed5dc]
2022-01-20T16:42:30.296950797Z  11: (Monitor::preinit()+0xd30) [0x55896aa1aaa0]
2022-01-20T16:42:30.296950797Z  12: main()
2022-01-20T16:42:30.296950797Z  13: __libc_start_main()
2022-01-20T16:42:30.296950797Z  14: _start()
2022-01-20T16:42:30.296950797Z  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2022-01-20T16:42:30.296964906Z 
2022-01-20T16:42:30.296992315Z --- logging levels ---
2022-01-20T16:42:30.296992315Z    0/ 5 none
2022-01-20T16:42:30.297000801Z    0/ 1 lockdep
2022-01-20T16:42:30.297000801Z    0/ 1 context
2022-01-20T16:42:30.297000801Z    1/ 1 crush
2022-01-20T16:42:30.297008537Z    1/ 5 mds
2022-01-20T16:42:30.297008537Z    1/ 5 mds_balancer
2022-01-20T16:42:30.297016192Z    1/ 5 mds_locker
2022-01-20T16:42:30.297016192Z    1/ 5 mds_log
2022-01-20T16:42:30.297016192Z    1/ 5 mds_log_expire2022-01-20T16:42:30.297023991Z 
2022-01-20T16:42:30.297023991Z    1/ 5 mds_migrator
2022-01-20T16:42:30.297023991Z    0/ 1 buffer
2022-01-20T16:42:30.297031807Z    0/ 1 timer
2022-01-20T16:42:30.297031807Z    0/ 1 filer
2022-01-20T16:42:30.297031807Z    0/ 1 striper2022-01-20T16:42:30.297039569Z 
2022-01-20T16:42:30.297039569Z    0/ 1 objecter
2022-01-20T16:42:30.297039569Z    0/ 5 rados
2022-01-20T16:42:30.297047277Z    0/ 5 rbd
2022-01-20T16:42:30.297047277Z    0/ 5 rbd_mirror
2022-01-20T16:42:30.297054704Z    0/ 5 rbd_replay
2022-01-20T16:42:30.297054704Z    0/ 5 rbd_pwl
2022-01-20T16:42:30.297054704Z    0/ 5 journaler2022-01-20T16:42:30.297062378Z 
2022-01-20T16:42:30.297062378Z    0/ 5 objectcacher
2022-01-20T16:42:30.297069833Z    0/ 5 immutable_obj_cache
2022-01-20T16:42:30.297069833Z    0/ 5 client
2022-01-20T16:42:30.297069833Z    1/ 5 osd2022-01-20T16:42:30.297079198Z 
2022-01-20T16:42:30.297079198Z    0/ 5 optracker
2022-01-20T16:42:30.297079198Z    0/ 5 objclass
2022-01-20T16:42:30.297079198Z    1/ 3 filestore2022-01-20T16:42:30.297087171Z 
2022-01-20T16:42:30.297087171Z    1/ 3 journal
2022-01-20T16:42:30.297087171Z    0/ 0 ms
2022-01-20T16:42:30.297094968Z    1/ 5 mon
2022-01-20T16:42:30.297094968Z    0/10 monc
2022-01-20T16:42:30.297102520Z    1/ 5 paxos
2022-01-20T16:42:30.297102520Z    0/ 5 tp
2022-01-20T16:42:30.297102520Z    1/ 5 auth2022-01-20T16:42:30.297110195Z 
2022-01-20T16:42:30.297110195Z    1/ 5 crypto
2022-01-20T16:42:30.297110195Z    1/ 1 finisher
2022-01-20T16:42:30.297118068Z    1/ 1 reserver
2022-01-20T16:42:30.297118068Z    1/ 5 heartbeatmap
2022-01-20T16:42:30.297125715Z    1/ 5 perfcounter
2022-01-20T16:42:30.297125715Z    1/ 5 rgw
2022-01-20T16:42:30.297125715Z    1/ 5 rgw_sync2022-01-20T16:42:30.297133388Z 
2022-01-20T16:42:30.297133388Z    1/10 civetweb
2022-01-20T16:42:30.297133388Z    1/ 5 javaclient
2022-01-20T16:42:30.297141161Z    1/ 5 asok
2022-01-20T16:42:30.297141161Z    1/ 1 throttle
2022-01-20T16:42:30.297148825Z    0/ 0 refs
2022-01-20T16:42:30.297148825Z    1/ 5 compressor
2022-01-20T16:42:30.297148825Z    1/ 5 bluestore2022-01-20T16:42:30.297156553Z 
2022-01-20T16:42:30.297156553Z    1/ 5 bluefs
2022-01-20T16:42:30.297156553Z    1/ 3 bdev
2022-01-20T16:42:30.297164207Z    1/ 5 kstore
2022-01-20T16:42:30.297164207Z    4/ 5 rocksdb
2022-01-20T16:42:30.297164207Z    4/ 5 leveldb2022-01-20T16:42:30.297171880Z 
2022-01-20T16:42:30.297171880Z    4/ 5 memdb
2022-01-20T16:42:30.297171880Z    1/ 5 fuse
2022-01-20T16:42:30.297179561Z    2/ 5 mgr
2022-01-20T16:42:30.297179561Z    1/ 5 mgrc
2022-01-20T16:42:30.297187170Z    1/ 5 dpdk
2022-01-20T16:42:30.297187170Z    1/ 5 eventtrace
2022-01-20T16:42:30.297187170Z    1/ 5 prioritycache2022-01-20T16:42:30.297194819Z 
2022-01-20T16:42:30.297194819Z    0/ 5 test
2022-01-20T16:42:30.297194819Z    0/ 5 cephfs_mirror
2022-01-20T16:42:30.297202426Z    0/ 5 cephsqlite
2022-01-20T16:42:30.297202426Z   -2/-2 (syslog threshold)
2022-01-20T16:42:30.297209918Z   99/99 (stderr threshold)
2022-01-20T16:42:30.297209918Z --- pthread ID / name mapping for recent threads ---
2022-01-20T16:42:30.297234902Z   140036001363712 / rocksdb:dump_st
2022-01-20T16:42:30.297244513Z   140036169725696 / admin_socket
2022-01-20T16:42:30.297251760Z   140036420536064 / ceph-mon
2022-01-20T16:42:30.297251760Z   max_recent     10000
2022-01-20T16:42:30.297251760Z   max_new        100002022-01-20T16:42:30.297259441Z 
2022-01-20T16:42:30.297259441Z   log_file /var/lib/ceph/crash/2022-01-20T16:42:30.293729Z_ff2cc9e7-cf68-4c5a-b384-336b8997364e/log
2022-01-20T16:42:30.297259441Z --- end dump of recent events ---

Comment 4 Blaine Gardner 2022-01-24 20:16:34 UTC
Transferring this to the Ceph component to get help debugging this on the Ceph side.

Comment 5 Mudit Agarwal 2022-01-25 10:29:11 UTC
I have created a ceph bug but a quick google search tells me that there is an existing issue around this, 
see the release notes https://github.com/ceph/ceph/pull/44131 and this conversation in the community https://www.spinics.net/lists/ceph-users/msg70110.html

Comment 6 Sébastien Han 2022-01-25 10:52:29 UTC
Rook is applying "mon_mds_skip_sanity" already before the upgrade, we can see that from the op logs.


2022-01-20T15:53:10.400253921Z 2022-01-20 15:53:10.400208 I | ceph-cluster-controller: upgrading ceph cluster to "16.2.7-31 pacific"
2022-01-20T15:53:10.400253921Z 2022-01-20 15:53:10.400230 I | ceph-cluster-controller: cluster "openshift-storage": version "16.2.7-31 pacific" detected for image "quay.io/rhceph-dev/rhceph@sha256:77a11bd0eca26a1315c384f1d7f0d7a1f6dd0631e464cd0b1e2cee929f558d9d"
2022-01-20T15:53:10.460664304Z 2022-01-20 15:53:10.460626 I | op-config: setting "mon"="mon_mds_skip_sanity"="1" option to the mon configuration database
2022-01-20T15:53:10.784778923Z 2022-01-20 15:53:10.784735 I | op-config: successfully set "mon"="mon_mds_skip_sanity"="1" option to the mon configuration database

Still, it looks like mon-b is failing the upgrade.

Patrick do you have any idea what could be wrong?
Thanks!

Comment 7 Sébastien Han 2022-01-26 15:28:06 UTC
*** Bug 2043510 has been marked as a duplicate of this bug. ***

Comment 18 Vijay Avuthu 2022-02-23 15:17:57 UTC
Verified with below versions:
===========================

upgrade was success from ocs-registry:4.9.3-2 to cs-registry:4.10.0-164

job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/3311/console

Moving to verified

Comment 20 errata-xmlrpc 2022-04-13 18:51:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372