Bug 1922421 - [ROKS] OCS deployment stuck at mon pod in pending state
Summary: [ROKS] OCS deployment stuck at mon pod in pending state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: OCS 4.7.0
Assignee: Rohan Gupta
QA Contact: Petr Balogh
URL:
Whiteboard:
Depends On:
Blocks: 1937837 1938134
TreeView+ depends on / blocked
 
Reported: 2021-01-29 17:02 UTC by akgunjal@in.ibm.com
Modified: 2023-09-15 00:59 UTC (History)
15 users (show)

Fixed In Version: 4.7.0-307.ci
Doc Type: Bug Fix
Doc Text:
.Increased `mon` failover timeout With this update `mon` failover timeout has been increased to 15 minutes on IBM Cloud. Previously, the `mons` would begin to failover while they were still coming up.
Clone Of:
: 1937837 (view as bug list)
Environment:
Last Closed: 2021-05-19 09:18:58 UTC
Embargoed:


Attachments (Terms of Use)
output of openshift-storage details (9.07 KB, text/plain)
2021-01-29 17:02 UTC, akgunjal@in.ibm.com
no flags Details
Rook Operator logs (77.92 KB, text/plain)
2021-02-02 05:55 UTC, Kesavan
no flags Details
rook-ceph-operator logs (28.49 KB, text/plain)
2021-02-02 06:27 UTC, akgunjal@in.ibm.com
no flags Details
Logs of mon-d (463.04 KB, text/plain)
2021-02-03 19:27 UTC, Rohan Gupta
no flags Details
Logs of mon-f (556.04 KB, text/plain)
2021-02-03 19:29 UTC, Rohan Gupta
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1113 0 None open Increase mon failover timeout to 15 min for IBMCloudPlatformType 2021-03-12 08:40:12 UTC
Github openshift ocs-operator pull 1122 0 None open Bug 1922421:[release-4.7] Increase mon failover timeout to 15 min for IBMCloudPlatformType 2021-03-18 16:40:59 UTC
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:19:51 UTC

Description akgunjal@in.ibm.com 2021-01-29 17:02:06 UTC
Created attachment 1752122 [details]
output of openshift-storage details

Description of problem (please be detailed as possible and provide log
snippests):
We deployed OCS 4.6 and see there are 3 mon related PVCs created and 2 mon pods came to running state but the third one came mon-d instead of mon-c. We have a PVC mon-c which is free to use but it still went ahead and spawned mon-d and remains in pending as we dont have extra disk. Ideally it should use the mon-c which is available.

Version of all relevant components (if applicable):
OCS 4.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
It does not complete the OCS 4.6 deployment


Is there any workaround available to the best of your knowledge?
No.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Did not try from UI

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy OCS 4.6 using local volumes through LSO with 3 disks each for mon and osd
2.The operators are deployed and starts to deploy mon first
3.The mon-a and mon-b are deployed but mon-d starts instead of mon-c and its stuck


Actual results:
The deployment is stuck at mon pod to complete and its waiting for a 4th disk which is not present instead of using a free mon-c PVC

Expected results:
It should deploy mon-c and proceed with deployment

Additional info:
Adding the output of OCS deployment to show the issue.


Additional info:

Comment 2 Sahina Bose 2021-01-29 17:42:44 UTC
Travis, can you or someone from Rook team take a look?

Comment 3 Travis Nielsen 2021-02-01 18:21:06 UTC
Can you share the logs for the operator? In particular, the rook operator log would show why it is trying to start mon-d instead of mon-c.

Comment 4 Kesavan 2021-02-02 05:55:21 UTC
Created attachment 1754179 [details]
Rook Operator logs

Comment 5 akgunjal@in.ibm.com 2021-02-02 06:27:02 UTC
Created attachment 1754183 [details]
rook-ceph-operator logs

@travis: Adding rook operator loogs of cluster we just reproduced again

Comment 6 Mudit Agarwal 2021-02-02 14:19:25 UTC
Travis, PTAL

Comment 7 Travis Nielsen 2021-02-02 18:52:06 UTC
The mons are not able to form quorum, and is blocking the cluster from being created. There are several symptoms to see this is the case:
- There are not three mons running named a,b,c. If there is a mon-d attempting to start, it means the operator timed out trying the first three mons and is going to try a new mon. The fact that it jumped to mon-d instead of retrying mon-b or mon-c is a bug that is fixed in OCS 4.7.
- The crash collector pods won't start until after the mons have quorum
- No mgr, osd, or other ceph daemons besides the mons

The cause of the mons not forming quorum is almost always network related. The operator cannot connect to the mon pods, either due to network config, or a firewall blocking the mon ports. See the Ceph network reference for the ports that need to be open for mons and OSDs:
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

A quick test on the connectivity is if you "oc rsh" to the operator pod, then curl the mon endpoint. You can find the mon endpoint with "oc -n openshift-storage get svc", then something like: 
curl <clusterIP>:3300

Comment 8 Rohan Gupta 2021-02-03 19:27:44 UTC
Created attachment 1754857 [details]
Logs of mon-d

the mon-d pod is running successfully, so have attached the logs

Comment 9 Rohan Gupta 2021-02-03 19:29:55 UTC
Created attachment 1754859 [details]
Logs of mon-f

mon-f pod is running successfully but the logs show error.

Comment 10 Travis Nielsen 2021-02-03 20:43:40 UTC
The mons are not in quorum and seem to be having issues peering. The operator is able to connect to the mons (the operator can curl the mon endpoints), so it shouldn't be a network issue, at least between the operator and mons. 

In the mon-d logs, we see these messages repeated over and over:

debug 2021-02-03 17:44:00.739 7f49395f8700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
debug 2021-02-03 17:44:04.598 7f493457d700 -1 mon.d@0(probing) e0 get_health_metrics reporting 728 slow ops, oldest is log(1 entries from seq 1 at 2021-02-03 16:43:00.735590)
debug 2021-02-03 17:44:05.574 7f4931d78700  1 mon.d@0(probing) e0  adding peer [v2:172.21.58.93:3300/0,v1:172.21.58.93:6789/0] to list of hints
debug 2021-02-03 17:44:05.574 7f4931d78700  1 mon.d@0(probing) e0  adding peer [v2:172.21.58.93:3300/0,v1:172.21.58.93:6789/0] to list of hints

In the mon-f logs, these messages are repeated:

debug 2021-02-03 17:00:06.709 7f56852f1700  0 log_channel(cluster) log [INF] : mon.f calling monitor election
debug 2021-02-03 17:00:06.709 7f56852f1700  1 mon.f@0(electing).elector(0) init, first boot, initializing epoch at 1 
debug 2021-02-03 17:00:06.713 7f56852f1700 -1 mon.f@0(electing) e0 failed to get devid for sda: fallback method has serial 'LUN C-Mode      'but no model
debug 2021-02-03 17:00:06.713 7f56852f1700 -1 mon.f@0(electing) e0 failed to get devid for sdc: fallback method has serial 'LUN C-Mode      'but no model

@Josh Do these errors give an indication why these two mons are not forming quorum? Or can someone take a look at the logs for other indications of what's happening?

Comment 11 akgunjal@in.ibm.com 2021-02-04 14:41:21 UTC
@sahina: If this is due to timing issue, can the timing be increased on OCS 4.6 to support this?

Comment 12 Travis Nielsen 2021-02-15 16:54:37 UTC
Moving to ceph to investigate the mons forming quorum

Comment 13 Sahina Bose 2021-02-22 12:54:00 UTC
Ping. 
Hi Scott, the IBM team is running into this issue frequently on the IBM Cloud (ROKS) platform. Can someone from Ceph team look at this?

Comment 14 Sahina Bose 2021-02-22 12:54:32 UTC
Ping. 
Hi Josh, the IBM team is running into this issue frequently on the IBM Cloud (ROKS) platform. Can someone from Ceph team look at this?

Comment 15 Josh Durgin 2021-02-23 16:39:00 UTC
(In reply to Travis Nielsen from comment #10)
> The mons are not in quorum and seem to be having issues peering. The
> operator is able to connect to the mons (the operator can curl the mon
> endpoints), so it shouldn't be a network issue, at least between the
> operator and mons. 
> 
> In the mon-d logs, we see these messages repeated over and over:
> 
> debug 2021-02-03 17:44:00.739 7f49395f8700  0 log_channel(audit) log [DBG] :
> from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
> debug 2021-02-03 17:44:04.598 7f493457d700 -1 mon.d@0(probing) e0
> get_health_metrics reporting 728 slow ops, oldest is log(1 entries from seq
> 1 at 2021-02-03 16:43:00.735590)
> debug 2021-02-03 17:44:05.574 7f4931d78700  1 mon.d@0(probing) e0  adding
> peer [v2:172.21.58.93:3300/0,v1:172.21.58.93:6789/0] to list of hints
> debug 2021-02-03 17:44:05.574 7f4931d78700  1 mon.d@0(probing) e0  adding
> peer [v2:172.21.58.93:3300/0,v1:172.21.58.93:6789/0] to list of hints
> 
> In the mon-f logs, these messages are repeated:
> 
> debug 2021-02-03 17:00:06.709 7f56852f1700  0 log_channel(cluster) log [INF]
> : mon.f calling monitor election
> debug 2021-02-03 17:00:06.709 7f56852f1700  1 mon.f@0(electing).elector(0)
> init, first boot, initializing epoch at 1 
> debug 2021-02-03 17:00:06.713 7f56852f1700 -1 mon.f@0(electing) e0 failed to
> get devid for sda: fallback method has serial 'LUN C-Mode      'but no model
> debug 2021-02-03 17:00:06.713 7f56852f1700 -1 mon.f@0(electing) e0 failed to
> get devid for sdc: fallback method has serial 'LUN C-Mode      'but no model
> 
> @Josh Do these errors give an indication why these two mons are not forming
> quorum? Or can someone take a look at the logs for other indications of
> what's happening?

The monitors don't seem to be communicating with each other. There's nothing indicating any other issue in their logs. This could be confirmed by debug_ms=1 logs from the monitors, however I expect more basic testing of connectivity between the mon pods would be easier to test.

Basic verification of connectivity between nodes should be a part of any CI run.

Comment 16 Mudit Agarwal 2021-03-02 09:09:24 UTC
Sahina, this needs to be reproduced with debug logs.

Comment 17 Sahina Bose 2021-03-02 15:12:34 UTC
Akash, if you can reproduce this, please provide debug logs and check network connectivity as well.

Comment 18 akgunjal@in.ibm.com 2021-03-08 14:48:14 UTC
@Sahina: We reproduced and gave a cluster to Rohan last week for further debugging.

Comment 19 Rohan Gupta 2021-03-11 06:44:33 UTC
Travis suggested increasing the mon failover timeout to 15 min
Jose suggested doing a platform check(Because this is only IBM ROKS specific issue) and change the default value of this timeout instead of exposing it as a field in StorageCluster CR.

Comment 23 akgunjal@in.ibm.com 2021-03-26 06:43:49 UTC
@sahina: We tested this fix and see mon-a,b and d as earlier instead of c
and there is a error message in rook-ceph operator for mon-c

"op-mon: mon "c" not found in quorum, waiting for timeout (101 seconds left) before failover"

So this issue is still not fixed

Comment 24 Rohan CJ 2021-03-26 07:47:20 UTC
Got another cluster with this issue.

Catalog Source indicates the build, so someone please double-check this build has the fix: quay.io/rhceph-dev/ocs-registry:4.6.4-311.ci

Node was down for 14 minutes, mon timeout was 10m, (not 15 as the fix should have made it), and mon timed out.

$ oc get infrastructure cluster -o yaml -o jsonpath='{.status.platform}'
IBMCloud

Did not see `Increasing Mon failover timeout to 15m for IBMCloud` in ocs-operator logs (that's the log message I expect to see from looking at the PR)

Relevant logs:

# cluster comes up on March 24 5:38 PM
2021-03-24 17:38:07.028949 I | clusterdisruption-controller: all osds are up. pg health is active+clean

# first clusterdisruption-controller message without "all osds are up", upgrade? March 25 5:24:55 AM 
021-03-25 05:24:55.398374 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1]. currently draining: "rack2". pg health: "cluster is not fully clean. PGs: [{StateName:peering Count:176}]"

# within a minute, mon-c goes down:
2021-03-25 05:25:15.559578 W | op-mon: mon "c" not found in quorum, waiting for timeout (599 seconds left) before failover

# 10 minutes later, mon-c times out, and mon-d starts:
2021-03-25 05:35:50.364039 W | op-mon: mon "c" NOT found in quorum and timeout exceeded, mon will be failed over
2021-03-25 05:35:50.364068 I | op-mon: Failing over monitor "c"
2021-03-25 05:35:50.375205 I | op-mon: scaling the mon "c" deployment to replica 0
2021-03-25 05:35:50.395872 I | op-mon: starting new mon: &{ResourceName:rook-ceph-mon-d DaemonName:d PublicIP: Port:6789 DataPathMap:0xc000731b90}
2021-03-25 05:35:50.446195 I | op-mon: sched-mon: created canary monitor rook-ceph-mon-d-canary pvc rook-ceph-mon-d
2021-03-25 05:35:50.600571 I | op-mon: sched-mon: created canary deployment rook-ceph-mon-d-canary

# 3 Minutes later, mon-d comes up, node in rack2 still down
2021-03-25 05:38:48.069099 I | clusterdisruption-controller: all "rack" failure domains: [rack0 rack1]. currently draining: "rack2". pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:123} {StateName:active+undersized Count:53}]"
2021-03-25 05:38:50.041396 I | op-mon: Monitors in quorum: [a b d]

# After 14m2s(minus time it takes for OSD to come up) overall node downtime, OSDs come back up indicating node is back up, cluster can start healing
2021-03-25 05:38:57.885178 I | clusterdisruption-controller: all osds are up. last drained failure domain: "rack2".  pg health: "cluster is not fully clean. PGs: [{StateName:active+undersized+degraded Count:123} {StateName:active+undersized Count:53}]". waiting for PGs to be active+clean or PG health check to timeout

Comment 25 Rohan CJ 2021-03-26 07:48:14 UTC
Comment 23 and Comment 24 are about the same cluster.

Comment 26 Petr Balogh 2021-03-30 13:59:08 UTC
I am creating new cluster for verification of the other BZ. So once tier1 job will finish here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier1/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier1/14/console

We can use this cluster for verification of this BZ.

Comment 27 Shirisha S Rao 2021-04-06 12:38:15 UTC
I have verified the mon timeout on the cluster provided by Petr and it was set to 15 minutes.

Comment 29 Petr Balogh 2021-04-20 13:53:07 UTC
Based on Comment 27 I am moving to verified.

Comment 30 Rohan Gupta 2021-04-20 14:07:25 UTC
Hi @edonnell what should be the doctype for this bug?

Comment 32 errata-xmlrpc 2021-05-19 09:18:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 33 Red Hat Bugzilla 2023-09-15 00:59:35 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.