Bug 1977379
| Summary: | After upgrade to OCS 4.8 we see: Ceph cluster health is not OK. Health: HEALTH_ERR no active mgr - MGR pod is in CrashLoopBackOff | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Petr Balogh <pbalogh> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | VERIFIED --- | QA Contact: | Petr Balogh <pbalogh> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | muagarwa, nberry |
| Target Milestone: | --- | Keywords: | Automation, Regression, UpgradeBlocker |
| Target Release: | OCS 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.8.0-443.ci | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Petr Balogh
2021-06-29 14:53:30 UTC
The mgr log [1] shows that a socket could not be created for the prometheus endpoint. Since the liveness probe uses the prometheus port (9283), the mgr is killed after the liveness probe fails.
2021-06-29T02:10:18.479499730Z debug 2021-06-29 02:10:18.478 7f5679c72700 0 mgr[prometheus] [29/Jun/2021:02:10:18] ENGINE Error in HTTP server: shutting down
2021-06-29T02:10:18.479499730Z Traceback (most recent call last):
2021-06-29T02:10:18.479499730Z File "/usr/lib/python3.6/site-packages/cherrypy/process/servers.py", line 217, in _start_http_thread
2021-06-29T02:10:18.479499730Z self.httpserver.start()
2021-06-29T02:10:18.479499730Z File "/usr/lib/python3.6/site-packages/cherrypy/wsgiserver/__init__.py", line 2008, in start
2021-06-29T02:10:18.479499730Z raise socket.error(msg)
2021-06-29T02:10:18.479499730Z OSError: No socket could be created -- (('10.128.2.18', 9283): [Errno 99] Cannot assign requested address)
This seems like a one-time environment error, but let's see if there is a repro.
[1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j036ai3c33-ua/j036ai3c33-ua_20210628T181832/logs/failed_testcase_ocs_logs_1624911902/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-318f9fb2c5a4aa470e5f221c4a7a76045b6778b4c8895db71bd9225024169371/namespaces/openshift-storage/pods/rook-ceph-mgr-a-9c5fff665-wsj6p/mgr/mgr/logs/current.log
This issue got reproduced on the job I triggered here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1264/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/ New must gather from above job: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j038ai3c33-ua/j038ai3c33-ua_20210629T174937/logs/failed_testcase_ocs_logs_1624996704/test_upgrade_ocs_logs/ Realized that both jobs I triggered were for verification of hugepages bug: https://bugzilla.redhat.com/show_bug.cgi?id=1946792 So the environment is with hugepages enabled FYI in both cases. Trying to reproduce once more with pausing the job before teardown so I will be able to provide access to the cluster. https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1270/console Trying also reproduce without hugepages enabled here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/40/ Thanks for the repro, now I see what is happening, this is certainly related to some changes we had between 4.7 and 4.8 that would only affect upgrades from 4.7 to 4.8. In 4.7 (and previous releases), the metrics service was binding to the pod IP, which was updated every time with an init container. In 4.8, the metrics service is now binding to the default, which binds to all interfaces on localhost and there is no need to bind to a specific ip anymore. When upgrading from 4.7 to 4.8, the setting of binding to the pod ip still persists in the ceph settings. Since the mgr pod ip will change during upgrade when the mgr pod is updated and restarted, the binding fails since the mgr pod is assigned a new pod ip. Therefore, Rook needs to ensure the binding setting is unset so that the mgr pod can start with the default binding to localhost. This issue is not seeing upstream because in Rook v1.5.7, we first made this fix which would reset the setting to bind to 0.0.0.0: https://github.com/rook/rook/pull/7151 Then in upstream Rook v1.6 we made the change to remove the init container with https://github.com/rook/rook/pull/7152 As long as upstream users upgraded from 1.5.7 or newer to v1.6, they would not hit this issue (and we have had no reports of this issue). The v1.5 fix was never backported downstream to OCS 4.7, therefore we need a fix in 4.8 to handle this case. @Petr Is this perhaps the first time upgrade tests have been run from 4.7 to 4.8? Otherwise, I don't see why this wouldn't have been seen sooner. This certainly should be a blocker for upgrades from 4.7. No it's not first time when we ran the upgrade, I think it used to pass some time back IIRC (Will check some other upgrade jobs we had)- is it possible that that change came to downstream recently? Only one issue was with external mode upgrade where we have the other bug. Another reproduction was here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1272/console and here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1273/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/ So it's not related to hugepages as I thought it could be. It actually won't repro every time, it will only repro if the pod IP of the mgr pod changes. So the previous upgrade tests must have had mgr pods that retained their pod ips during upgrade. Based on latest executions: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/47/testReport/ https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto/73/testReport/ Plus we had some others https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-production-upgrade-ci-pipeline-4.8/2/ where I didn't see this issue I am marking this as verified. |