Bug 1937837 - [ROKS] OCS deployment stuck at mon pod in pending state
Summary: [ROKS] OCS deployment stuck at mon pod in pending state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: OCS 4.6.4
Assignee: Rohan Gupta
QA Contact: Petr Balogh
URL:
Whiteboard:
Depends On: 1922421
Blocks: 1931424
TreeView+ depends on / blocked
 
Reported: 2021-03-11 16:10 UTC by Mudit Agarwal
Modified: 2021-06-01 08:48 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1922421
Environment:
Last Closed: 2021-04-08 10:29:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1113 0 None open Increase mon failover timeout to 15 min for IBMCloudPlatformType 2021-03-15 14:21:41 UTC
Red Hat Product Errata RHBA-2021:1134 0 None None None 2021-04-08 10:29:28 UTC

Comment 6 Petr Balogh 2021-03-25 08:45:03 UTC
I ran deployment + tier1 with build:
quay.io/rhceph-dev/ocs-registry:4.6.4-311.ci

I think that based on the execution I did:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/1564/
Logs:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-ibmcloud/pbalogh-ibmcloud_20210324T154933

I was able to deploy the cluster and haven't seen mon pods stuck.
I shared kubeconfig (http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-ibmcloud/pbalog[…]ud_20210324T154933/openshift-cluster-dir/auth/kubeconfig) to the cluster with IBM Guys/ Akash to confirm that.

Comment 7 Sahina Bose 2021-03-26 08:30:30 UTC
Rohan, can you confirm if the build has the fix?

Comment 8 Rohan CJ 2021-03-26 08:51:38 UTC
We confirmed that build with fix is not working.

Comment 10 Mudit Agarwal 2021-03-26 17:05:10 UTC
Moving this out of 4.6.4 as we can't delay 4.6. for this fix.

Comment 11 Rohan CJ 2021-03-29 07:32:57 UTC
Looks like the build we tested in with didn't have the patch:

https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/311/ -> https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/OCS%20Build%20Pipeline%204.6/174/artifact/ocs_operator_tag.txt -> ocs-operator tag 4.6-83.d9600491.release_4.6


When we tested with the patched version, the timeout was set to 15 minutes correctly.

We made a mistake when verifying if the patch was in the build earlier.

Comment 12 Rohan CJ 2021-03-29 07:36:59 UTC
@muagarwa can we move this back to 4.6.4?

Comment 13 Mudit Agarwal 2021-03-29 10:48:01 UTC
Providing the dev_ack, lets wait for QA

Comment 14 Rohan CJ 2021-03-29 13:20:31 UTC
I see the patch in the latest build: https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/322/

Comment 15 Petr Balogh 2021-03-30 12:07:32 UTC
Deployed new once cluster with RC2 build of 4.6.4 and here is kubeconfig which I provided to Akash to take a look at cluster:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbaloghibmcloud/pbaloghibmcloud_20210330T101320/openshift-cluster-dir/auth/kubeconfig

Deployed here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/1674/

Comment 18 Shirisha S Rao 2021-04-01 03:31:11 UTC
I have verified the mon timeout on the cluster provided by @pbalogh and it was set to 15 minutes.
The OCS version on the cluster is : 4.7.0-330.ci

Comment 19 Petr Balogh 2021-04-01 11:14:28 UTC
Hey Shrisha,

yesterday about 3-4pm Brno time I upgraded the cluster so I got confirmed from Akash that you are done with testing on this cluster so I used it for upgrade testing.

So when you worked on cluster yesterday it was:
v4.6.4-323.ci 

So I will mark it as verified.

Thanks

Comment 23 errata-xmlrpc 2021-04-08 10:29:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.6.4 container bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1134


Note You need to log in before you can comment on or make changes to this bug.