Description of problem (please be detailed as possible and provide log snippests): After upgrade to OCS 4.8 we see: CSV seems to be in succeed state: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j036ai3c33-ua/j036ai3c33-ua_20210628T181832/logs/failed_testcase_ocs_logs_1624911902/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-318f9fb2c5a4aa470e5f221c4a7a76045b6778b4c8895db71bd9225024169371/namespaces/openshift-storage/oc_output/csv NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.8.0-431.ci OpenShift Container Storage 4.8.0-431.ci ocs-operator.v4.7.2-429.ci Succeeded http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j036ai3c33-ua/j036ai3c33-ua_20210628T181832/logs/failed_testcase_ocs_logs_1624911902/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-318f9fb2c5a4aa470e5f221c4a7a76045b6778b4c8895db71bd9225024169371/namespaces/openshift-storage/oc_output/pods_-owide rook-ceph-mgr-a-9c5fff665-wsj6p 1/2 CrashLoopBackOff 53 3h15m 10.128.2.161 ip-10-0-154-253.us-east-2.compute.internal <none> <none> From events I see: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProbeError 10m (x151 over 3h14m) kubelet Liveness probe error: Get "http://10.128.2.161:9283/": dial tcp 10.128.2.161:9283: connect: connection refused body: Warning BackOff 48s (x584 over 3h6m) kubelet Back-off restarting failed container Version of all relevant components (if applicable): OCS: 4.8.0-431.ci OCPP 4.8.0-0.nightly-2021-06-25-182927 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? CEPH shows HEALTH ERROR Is there any workaround available to the best of your knowledge? NO Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Haven't reproduced yet. Can this issue reproduce from the UI? Haven't tried If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install OCS 4.7.1 2. Upgrade to mentioned 4.8 build Actual results: MGR pod is crasLoopBackOff errror Expected results: Have MGR pod running Additional info: Job link: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1247/ Must Gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j036ai3c33-ua/j036ai3c33-ua_20210628T181832/logs/failed_testcase_ocs_logs_1624911902/test_upgrade_ocs_logs/ Trying to reproduce here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/38/
The mgr log [1] shows that a socket could not be created for the prometheus endpoint. Since the liveness probe uses the prometheus port (9283), the mgr is killed after the liveness probe fails. 2021-06-29T02:10:18.479499730Z debug 2021-06-29 02:10:18.478 7f5679c72700 0 mgr[prometheus] [29/Jun/2021:02:10:18] ENGINE Error in HTTP server: shutting down 2021-06-29T02:10:18.479499730Z Traceback (most recent call last): 2021-06-29T02:10:18.479499730Z File "/usr/lib/python3.6/site-packages/cherrypy/process/servers.py", line 217, in _start_http_thread 2021-06-29T02:10:18.479499730Z self.httpserver.start() 2021-06-29T02:10:18.479499730Z File "/usr/lib/python3.6/site-packages/cherrypy/wsgiserver/__init__.py", line 2008, in start 2021-06-29T02:10:18.479499730Z raise socket.error(msg) 2021-06-29T02:10:18.479499730Z OSError: No socket could be created -- (('10.128.2.18', 9283): [Errno 99] Cannot assign requested address) This seems like a one-time environment error, but let's see if there is a repro. [1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j036ai3c33-ua/j036ai3c33-ua_20210628T181832/logs/failed_testcase_ocs_logs_1624911902/test_upgrade_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-318f9fb2c5a4aa470e5f221c4a7a76045b6778b4c8895db71bd9225024169371/namespaces/openshift-storage/pods/rook-ceph-mgr-a-9c5fff665-wsj6p/mgr/mgr/logs/current.log
This issue got reproduced on the job I triggered here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1264/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/ New must gather from above job: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j038ai3c33-ua/j038ai3c33-ua_20210629T174937/logs/failed_testcase_ocs_logs_1624996704/test_upgrade_ocs_logs/ Realized that both jobs I triggered were for verification of hugepages bug: https://bugzilla.redhat.com/show_bug.cgi?id=1946792 So the environment is with hugepages enabled FYI in both cases. Trying to reproduce once more with pausing the job before teardown so I will be able to provide access to the cluster. https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1270/console Trying also reproduce without hugepages enabled here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/40/
Thanks for the repro, now I see what is happening, this is certainly related to some changes we had between 4.7 and 4.8 that would only affect upgrades from 4.7 to 4.8. In 4.7 (and previous releases), the metrics service was binding to the pod IP, which was updated every time with an init container. In 4.8, the metrics service is now binding to the default, which binds to all interfaces on localhost and there is no need to bind to a specific ip anymore. When upgrading from 4.7 to 4.8, the setting of binding to the pod ip still persists in the ceph settings. Since the mgr pod ip will change during upgrade when the mgr pod is updated and restarted, the binding fails since the mgr pod is assigned a new pod ip. Therefore, Rook needs to ensure the binding setting is unset so that the mgr pod can start with the default binding to localhost. This issue is not seeing upstream because in Rook v1.5.7, we first made this fix which would reset the setting to bind to 0.0.0.0: https://github.com/rook/rook/pull/7151 Then in upstream Rook v1.6 we made the change to remove the init container with https://github.com/rook/rook/pull/7152 As long as upstream users upgraded from 1.5.7 or newer to v1.6, they would not hit this issue (and we have had no reports of this issue). The v1.5 fix was never backported downstream to OCS 4.7, therefore we need a fix in 4.8 to handle this case. @Petr Is this perhaps the first time upgrade tests have been run from 4.7 to 4.8? Otherwise, I don't see why this wouldn't have been seen sooner. This certainly should be a blocker for upgrades from 4.7.
No it's not first time when we ran the upgrade, I think it used to pass some time back IIRC (Will check some other upgrade jobs we had)- is it possible that that change came to downstream recently? Only one issue was with external mode upgrade where we have the other bug. Another reproduction was here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1272/console and here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1273/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/ So it's not related to hugepages as I thought it could be.
It actually won't repro every time, it will only repro if the pod IP of the mgr pod changes. So the previous upgrade tests must have had mgr pods that retained their pod ips during upgrade.
Based on latest executions: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-upgrade-ocs-auto/47/testReport/ https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-vsphere-upi-1az-rhcos-vsan-3m-3w-upgrade-ocs-auto/73/testReport/ Plus we had some others https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-trigger-production-upgrade-ci-pipeline-4.8/2/ where I didn't see this issue I am marking this as verified.