Created attachment 1752122 [details] output of openshift-storage details Description of problem (please be detailed as possible and provide log snippests): We deployed OCS 4.6 and see there are 3 mon related PVCs created and 2 mon pods came to running state but the third one came mon-d instead of mon-c. We have a PVC mon-c which is free to use but it still went ahead and spawned mon-d and remains in pending as we dont have extra disk. Ideally it should use the mon-c which is available. Version of all relevant components (if applicable): OCS 4.6 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? It does not complete the OCS 4.6 deployment Is there any workaround available to the best of your knowledge? No. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? Did not try from UI If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy OCS 4.6 using local volumes through LSO with 3 disks each for mon and osd 2.The operators are deployed and starts to deploy mon first 3.The mon-a and mon-b are deployed but mon-d starts instead of mon-c and its stuck Actual results: The deployment is stuck at mon pod to complete and its waiting for a 4th disk which is not present instead of using a free mon-c PVC Expected results: It should deploy mon-c and proceed with deployment Additional info: Adding the output of OCS deployment to show the issue. Additional info:
Travis, can you or someone from Rook team take a look?
Can you share the logs for the operator? In particular, the rook operator log would show why it is trying to start mon-d instead of mon-c.
Created attachment 1754179 [details] Rook Operator logs
Created attachment 1754183 [details] rook-ceph-operator logs @travis: Adding rook operator loogs of cluster we just reproduced again
Travis, PTAL
The mons are not able to form quorum, and is blocking the cluster from being created. There are several symptoms to see this is the case: - There are not three mons running named a,b,c. If there is a mon-d attempting to start, it means the operator timed out trying the first three mons and is going to try a new mon. The fact that it jumped to mon-d instead of retrying mon-b or mon-c is a bug that is fixed in OCS 4.7. - The crash collector pods won't start until after the mons have quorum - No mgr, osd, or other ceph daemons besides the mons The cause of the mons not forming quorum is almost always network related. The operator cannot connect to the mon pods, either due to network config, or a firewall blocking the mon ports. See the Ceph network reference for the ports that need to be open for mons and OSDs: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/ A quick test on the connectivity is if you "oc rsh" to the operator pod, then curl the mon endpoint. You can find the mon endpoint with "oc -n openshift-storage get svc", then something like: curl <clusterIP>:3300
Created attachment 1754857 [details] Logs of mon-d the mon-d pod is running successfully, so have attached the logs
Created attachment 1754859 [details] Logs of mon-f mon-f pod is running successfully but the logs show error.
The mons are not in quorum and seem to be having issues peering. The operator is able to connect to the mons (the operator can curl the mon endpoints), so it shouldn't be a network issue, at least between the operator and mons. In the mon-d logs, we see these messages repeated over and over: debug 2021-02-03 17:44:00.739 7f49395f8700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished debug 2021-02-03 17:44:04.598 7f493457d700 -1 mon.d@0(probing) e0 get_health_metrics reporting 728 slow ops, oldest is log(1 entries from seq 1 at 2021-02-03 16:43:00.735590) debug 2021-02-03 17:44:05.574 7f4931d78700 1 mon.d@0(probing) e0 adding peer [v2:172.21.58.93:3300/0,v1:172.21.58.93:6789/0] to list of hints debug 2021-02-03 17:44:05.574 7f4931d78700 1 mon.d@0(probing) e0 adding peer [v2:172.21.58.93:3300/0,v1:172.21.58.93:6789/0] to list of hints In the mon-f logs, these messages are repeated: debug 2021-02-03 17:00:06.709 7f56852f1700 0 log_channel(cluster) log [INF] : mon.f calling monitor election debug 2021-02-03 17:00:06.709 7f56852f1700 1 mon.f@0(electing).elector(0) init, first boot, initializing epoch at 1 debug 2021-02-03 17:00:06.713 7f56852f1700 -1 mon.f@0(electing) e0 failed to get devid for sda: fallback method has serial 'LUN C-Mode 'but no model debug 2021-02-03 17:00:06.713 7f56852f1700 -1 mon.f@0(electing) e0 failed to get devid for sdc: fallback method has serial 'LUN C-Mode 'but no model @Josh Do these errors give an indication why these two mons are not forming quorum? Or can someone take a look at the logs for other indications of what's happening?
@sahina: If this is due to timing issue, can the timing be increased on OCS 4.6 to support this?
Moving to ceph to investigate the mons forming quorum
Ping. Hi Scott, the IBM team is running into this issue frequently on the IBM Cloud (ROKS) platform. Can someone from Ceph team look at this?
Ping. Hi Josh, the IBM team is running into this issue frequently on the IBM Cloud (ROKS) platform. Can someone from Ceph team look at this?
(In reply to Travis Nielsen from comment #10) > The mons are not in quorum and seem to be having issues peering. The > operator is able to connect to the mons (the operator can curl the mon > endpoints), so it shouldn't be a network issue, at least between the > operator and mons. > > In the mon-d logs, we see these messages repeated over and over: > > debug 2021-02-03 17:44:00.739 7f49395f8700 0 log_channel(audit) log [DBG] : > from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished > debug 2021-02-03 17:44:04.598 7f493457d700 -1 mon.d@0(probing) e0 > get_health_metrics reporting 728 slow ops, oldest is log(1 entries from seq > 1 at 2021-02-03 16:43:00.735590) > debug 2021-02-03 17:44:05.574 7f4931d78700 1 mon.d@0(probing) e0 adding > peer [v2:172.21.58.93:3300/0,v1:172.21.58.93:6789/0] to list of hints > debug 2021-02-03 17:44:05.574 7f4931d78700 1 mon.d@0(probing) e0 adding > peer [v2:172.21.58.93:3300/0,v1:172.21.58.93:6789/0] to list of hints > > In the mon-f logs, these messages are repeated: > > debug 2021-02-03 17:00:06.709 7f56852f1700 0 log_channel(cluster) log [INF] > : mon.f calling monitor election > debug 2021-02-03 17:00:06.709 7f56852f1700 1 mon.f@0(electing).elector(0) > init, first boot, initializing epoch at 1 > debug 2021-02-03 17:00:06.713 7f56852f1700 -1 mon.f@0(electing) e0 failed to > get devid for sda: fallback method has serial 'LUN C-Mode 'but no model > debug 2021-02-03 17:00:06.713 7f56852f1700 -1 mon.f@0(electing) e0 failed to > get devid for sdc: fallback method has serial 'LUN C-Mode 'but no model > > @Josh Do these errors give an indication why these two mons are not forming > quorum? Or can someone take a look at the logs for other indications of > what's happening? The monitors don't seem to be communicating with each other. There's nothing indicating any other issue in their logs. This could be confirmed by debug_ms=1 logs from the monitors, however I expect more basic testing of connectivity between the mon pods would be easier to test. Basic verification of connectivity between nodes should be a part of any CI run.
Sahina, this needs to be reproduced with debug logs.
Akash, if you can reproduce this, please provide debug logs and check network connectivity as well.
@Sahina: We reproduced and gave a cluster to Rohan last week for further debugging.
Travis suggested increasing the mon failover timeout to 15 min Jose suggested doing a platform check(Because this is only IBM ROKS specific issue) and change the default value of this timeout instead of exposing it as a field in StorageCluster CR.
@sahina: We tested this fix and see mon-a,b and d as earlier instead of c and there is a error message in rook-ceph operator for mon-c "op-mon: mon "c" not found in quorum, waiting for timeout (101 seconds left) before failover" So this issue is still not fixed
Got another cluster with this issue. Catalog Source indicates the build, so someone please double-check this build has the fix: quay.io/rhceph-dev/ocs-registry:4.6.4-311.ci Node was down for 14 minutes, mon timeout was 10m, (not 15 as the fix should have made it), and mon timed out. $ oc get infrastructure cluster -o yaml -o jsonpath='{.status.platform}' IBMCloud Did not see `Increasing Mon failover timeout to 15m for IBMCloud` in ocs-operator logs (that's the log message I expect to see from looking at the PR) Relevant logs: # cluster comes up on March 24 5:38 PM 2021-03-24 17:38:07.028949 I | clusterdisruption-controller: all osds are up. pg health is active+clean # first clusterdisruption-controller message without "all osds are up", upgrade? March 25 5:24:55 AM 021-03-25 05:24:55.398374 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1]. currently draining: "rack2". pg health: "cluster is not fully clean. PGs: [{StateName:peering Count:176}]" # within a minute, mon-c goes down: 2021-03-25 05:25:15.559578 W | op-mon: mon "c" not found in quorum, waiting for timeout (599 seconds left) before failover # 10 minutes later, mon-c times out, and mon-d starts: 2021-03-25 05:35:50.364039 W | op-mon: mon "c" NOT found in quorum and timeout exceeded, mon will be failed over 2021-03-25 05:35:50.364068 I | op-mon: Failing over monitor "c" 2021-03-25 05:35:50.375205 I | op-mon: scaling the mon "c" deployment to replica 0 2021-03-25 05:35:50.395872 I | op-mon: starting new mon: &{ResourceName:rook-ceph-mon-d DaemonName:d PublicIP: Port:6789 DataPathMap:0xc000731b90} 2021-03-25 05:35:50.446195 I | op-mon: sched-mon: created canary monitor rook-ceph-mon-d-canary pvc rook-ceph-mon-d 2021-03-25 05:35:50.600571 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-d-canary # 3 Minutes later, mon-d comes up, node in rack2 still down 2021-03-25 05:38:48.069099 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1]. currently draining: "rack2". pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:123} {StateName:active+undersized Count:53}]" 2021-03-25 05:38:50.041396 I | op-mon: Monitors in quorum: [a b d] # After 14m2s(minus time it takes for OSD to come up) overall node downtime, OSDs come back up indicating node is back up, cluster can start healing 2021-03-25 05:38:57.885178 I | clusterdisruption-controller: all osds are up. last drained failure domain: "rack2". pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:123} {StateName:active+undersized Count:53}]". waiting for PGs to be active+clean or PG health check to timeout
Comment 23 and Comment 24 are about the same cluster.
I am creating new cluster for verification of the other BZ. So once tier1 job will finish here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier1/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier1/14/console We can use this cluster for verification of this BZ.
I have verified the mon timeout on the cluster provided by Petr and it was set to 15 minutes.
Based on Comment 27 I am moving to verified.
Hi @edonnell what should be the doctype for this bug?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days