Bug 1977379

Summary:	After upgrade to OCS 4.8 we see: Ceph cluster health is not OK. Health: HEALTH_ERR no active mgr - MGR pod is in CrashLoopBackOff
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Petr Balogh <pbalogh>
Component:	rook	Assignee:	Travis Nielsen <tnielsen>
Status:	VERIFIED ---	QA Contact:	Petr Balogh <pbalogh>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	muagarwa, nberry
Target Milestone:	---	Keywords:	Automation, Regression, UpgradeBlocker
Target Release:	OCS 4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.8.0-443.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Petr Balogh 2021-06-29 14:53:30 UTC

Description of problem (please be detailed as possible and provide log
snippests):
After upgrade to OCS 4.8 we see:
CSV seems to be in succeed state:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j036ai3c33-ua/j036ai3c33-ua_20210628T181832/logs/failed_testcase_ocs_logs_1624911902/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-318f9fb2c5a4aa470e5f221c4a7a76045b6778b4c8895db71bd9225024169371/namespaces/openshift-storage/oc_output/csv

NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.8.0-431.ci   OpenShift Container Storage   4.8.0-431.ci   ocs-operator.v4.7.2-429.ci   Succeeded



http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j036ai3c33-ua/j036ai3c33-ua_20210628T181832/logs/failed_testcase_ocs_logs_1624911902/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-318f9fb2c5a4aa470e5f221c4a7a76045b6778b4c8895db71bd9225024169371/namespaces/openshift-storage/oc_output/pods_-owide

rook-ceph-mgr-a-9c5fff665-wsj6p                                   1/2     CrashLoopBackOff   53         3h15m   10.128.2.161   ip-10-0-154-253.us-east-2.compute.internal   <none>           <none>

From events I see:
Events:
  Type     Reason      Age                    From     Message
  ----     ------      ----                   ----     -------
  Warning  ProbeError  10m (x151 over 3h14m)  kubelet  Liveness probe error: Get "http://10.128.2.161:9283/": dial tcp 10.128.2.161:9283: connect: connection refused
body:
  Warning  BackOff  48s (x584 over 3h6m)  kubelet  Back-off restarting failed container



Version of all relevant components (if applicable):
OCS: 4.8.0-431.ci 
OCPP 4.8.0-0.nightly-2021-06-25-182927


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
CEPH shows HEALTH ERROR


Is there any workaround available to the best of your knowledge?
NO

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Haven't reproduced yet.

Can this issue reproduce from the UI?
Haven't tried

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCS 4.7.1
2. Upgrade to mentioned 4.8 build



Actual results:
MGR pod is crasLoopBackOff errror

Expected results:
Have MGR pod running


Additional info:
Job link:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1247/
Must Gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j036ai3c33-ua/j036ai3c33-ua_20210628T181832/logs/failed_testcase_ocs_logs_1624911902/test_upgrade_ocs_logs/


Trying to reproduce here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/38/

Comment 3 Travis Nielsen 2021-06-29 20:04:25 UTC

The mgr log [1] shows that a socket could not be created for the prometheus endpoint. Since the liveness probe uses the prometheus port (9283), the mgr is killed after the liveness probe fails. 

2021-06-29T02:10:18.479499730Z debug 2021-06-29 02:10:18.478 7f5679c72700  0 mgr[prometheus] [29/Jun/2021:02:10:18] ENGINE Error in HTTP server: shutting down
2021-06-29T02:10:18.479499730Z Traceback (most recent call last):
2021-06-29T02:10:18.479499730Z   File "/usr/lib/python3.6/site-packages/cherrypy/process/servers.py", line 217, in _start_http_thread
2021-06-29T02:10:18.479499730Z     self.httpserver.start()
2021-06-29T02:10:18.479499730Z   File "/usr/lib/python3.6/site-packages/cherrypy/wsgiserver/__init__.py", line 2008, in start
2021-06-29T02:10:18.479499730Z     raise socket.error(msg)
2021-06-29T02:10:18.479499730Z OSError: No socket could be created -- (('10.128.2.18', 9283): [Errno 99] Cannot assign requested address)

This seems like a one-time environment error, but let's see if there is a repro.


[1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j036ai3c33-ua/j036ai3c33-ua_20210628T181832/logs/failed_testcase_ocs_logs_1624911902/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-318f9fb2c5a4aa470e5f221c4a7a76045b6778b4c8895db71bd9225024169371/namespaces/openshift-storage/pods/rook-ceph-mgr-a-9c5fff665-wsj6p/mgr/mgr/logs/current.log

Comment 4 Petr Balogh 2021-06-30 11:00:28 UTC

This issue got reproduced on the job I triggered here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1264/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

New must gather from above job:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j038ai3c33-ua/j038ai3c33-ua_20210629T174937/logs/failed_testcase_ocs_logs_1624996704/test_upgrade_ocs_logs/

Realized that both jobs I triggered were for verification of hugepages bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1946792

So the environment is with hugepages enabled FYI in both cases.


Trying to reproduce once more with pausing the job before teardown so I will be able to provide access to the cluster.
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1270/console

Trying also reproduce without hugepages enabled here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/40/

Comment 5 Travis Nielsen 2021-06-30 21:20:47 UTC

Thanks for the repro, now I see what is happening, this is certainly related to some changes we had between 4.7 and 4.8 that would only affect upgrades from 4.7 to 4.8.

In 4.7 (and previous releases), the metrics service was binding to the pod IP, which was updated every time with an init container.
In 4.8, the metrics service is now binding to the default, which binds to all interfaces on localhost and there is no need to bind to a specific ip anymore.
When upgrading from 4.7 to 4.8, the setting of binding to the pod ip still persists in the ceph settings. Since the mgr pod ip will change during upgrade when the mgr pod is updated and restarted, the binding fails since the mgr pod is assigned a new pod ip.

Therefore, Rook needs to ensure the binding setting is unset so that the mgr pod can start with the default binding to localhost.

This issue is not seeing upstream because in Rook v1.5.7, we first made this fix which would reset the setting to bind to 0.0.0.0:
https://github.com/rook/rook/pull/7151

Then in upstream Rook v1.6 we made the change to remove the init container with https://github.com/rook/rook/pull/7152

As long as upstream users upgraded from 1.5.7 or newer to v1.6, they would not hit this issue (and we have had no reports of this issue).

The v1.5 fix was never backported downstream to OCS 4.7, therefore we need a fix in 4.8 to handle this case.

@Petr Is this perhaps the first time upgrade tests have been run from 4.7 to 4.8? Otherwise, I don't see why this wouldn't have been seen sooner. This certainly should be a blocker for upgrades from 4.7.

Comment 8 Petr Balogh 2021-07-01 12:03:00 UTC

No it's not first time when we ran the upgrade, I think it used to pass some time back IIRC (Will check some other upgrade jobs we had)- is it possible that that change came to downstream recently?


Only one issue was with external mode upgrade where we have the other bug.

Another reproduction was here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1272/console
and here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1273/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

So it's not related to hugepages as I thought it could be.

Comment 9 Travis Nielsen 2021-07-01 13:35:04 UTC

It actually won't repro every time, it will only repro if the pod IP of the mgr pod changes. So the previous upgrade tests must have had mgr pods that retained their pod ips during upgrade.

Comment 10 Petr Balogh 2021-07-13 13:50:22 UTC

Based on latest executions:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/47/testReport/
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto/73/testReport/

Plus we had some others https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-production-upgrade-ci-pipeline-4.8/2/ where I didn't see this issue I am marking this as verified.