Description of problem (please be detailed as possible and provide log snippests): After upgrade to 4.6.6 we see that ceph toolbox pod cannot communicate to the cluster anymore. In this job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1282/consoleFull I see before upgrade test on OCS 4.5 the ceph command works: 19:39:14 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-f5494bcdc-nd7qg -- ceph health 19:39:25 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK. But right after upgrade: 21:02:29 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-f5494bcdc-nd7qg -- ceph health 21:02:40 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: [errno 13] error connecting to the cluster command terminated with exit code 13 The all next ceph commands are failing. Version of all relevant components (if applicable): OCS: 4.6.6-426.ci OCP: 4.6 nightly build Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes it's constantly reproducible, all our upgrade jobs failed with this. Can this issue reproduce from the UI? Haven't tried If this is a regression, please provide more details to justify this: Yes, this worked well before Steps to Reproduce: 1. install OCS 4.5 2. Upgrade to 4.6.6 build 3. Then the issue is hit when you run ceph command on ceph toolbox pod Actual results: Cannot run ceph commands on toolbox pod Expected results: Be able to run the commands Additional info: Job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1282/consoleFull kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j043ai3c33-ua/j043ai3c33-ua_20210701T150543/openshift-cluster-dir/auth/kubeconfig Must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j043ai3c33-ua/j043ai3c33-ua_20210701T150543/logs/failed_testcase_ocs_logs_1625159105/test_upgrade_ocs_logs/ Another reproduction: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1263/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j066vu1cs33-ua/j066vu1cs33-ua_20210629T160701/logs/failed_testcase_ocs_logs_1624988496/test_upgrade_ocs_logs/ Next: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1255/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/ Next on vsphere: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1253/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/
I was looking in to the first cluster. The toolbox image was - image: registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256:82db76aa07847d7fb5993af515cfaaffb803876b8a7869f85a27260a9edf7fb8 Something triggered the restart of the OCS operator (wasn't me). I think the tests are still running on the cluster. Now the image is showing up correctly as - image: quay.io/rhceph-dev/rook-ceph@sha256:6626db27489220f89fadef52203687860474411c604e8e27f8262fc5973879e8 Also, the ceph commands are working now in the tool box.
We might close this issue after concluding the discussion: https://chat.google.com/room/AAAAREGEba8/4JE6K4E4_94 In any case, looks like this is 4.6.z specific
Verified here: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1303/testReport/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.6.6 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2669