Bug 1978578

Summary: After upgrade to 4.6.6 the toolbox pod fails to run ceph commands with error: [errno 13] error connecting to the cluster
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Petr Balogh <pbalogh>
Component: ocs-operatorAssignee: Rakshith <rar>
Status: CLOSED ERRATA QA Contact: Petr Balogh <pbalogh>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.6CC: madam, muagarwa, ocs-bugs, rcyriac, sapillai, sostapov
Target Milestone: ---Keywords: Automation, Regression, UpgradeBlocker, ZStream
Target Release: OCS 4.6.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.6.6-442.ci Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-07 18:53:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Balogh 2021-07-02 08:38:24 UTC
Description of problem (please be detailed as possible and provide log
snippests):
After upgrade to 4.6.6 we see that ceph toolbox pod cannot communicate to the cluster anymore.

In this job:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1282/consoleFull

I see before upgrade test on OCS 4.5 the ceph command works:
19:39:14 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-f5494bcdc-nd7qg -- ceph health
19:39:25 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.

But right after upgrade:
21:02:29 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-f5494bcdc-nd7qg -- ceph health
21:02:40 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: [errno 13] error connecting to the cluster
command terminated with exit code 13

The all next ceph commands are failing.



Version of all relevant components (if applicable):
OCS: 4.6.6-426.ci
OCP: 4.6 nightly build


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes it's constantly reproducible, all our upgrade jobs failed with this.


Can this issue reproduce from the UI?
Haven't tried


If this is a regression, please provide more details to justify this:
Yes, this worked well before


Steps to Reproduce:
1. install OCS 4.5
2. Upgrade to 4.6.6 build
3. Then the issue is hit when you run ceph command on ceph toolbox pod


Actual results:
Cannot run ceph commands on toolbox pod


Expected results:
Be able to run the commands


Additional info:
Job:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1282/consoleFull

kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j043ai3c33-ua/j043ai3c33-ua_20210701T150543/openshift-cluster-dir/auth/kubeconfig

Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j043ai3c33-ua/j043ai3c33-ua_20210701T150543/logs/failed_testcase_ocs_logs_1625159105/test_upgrade_ocs_logs/


Another reproduction:

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1263/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j066vu1cs33-ua/j066vu1cs33-ua_20210629T160701/logs/failed_testcase_ocs_logs_1624988496/test_upgrade_ocs_logs/

Next:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1255/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

Next on vsphere:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1253/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

Comment 3 Santosh Pillai 2021-07-02 09:40:37 UTC
I was looking in to the first cluster. The toolbox image was 
- image: registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256:82db76aa07847d7fb5993af515cfaaffb803876b8a7869f85a27260a9edf7fb8

Something triggered the restart of the OCS operator (wasn't me). I think the tests are still running on the cluster. Now the image is showing up correctly as 
- image: quay.io/rhceph-dev/rook-ceph@sha256:6626db27489220f89fadef52203687860474411c604e8e27f8262fc5973879e8

Also, the ceph commands are working now in the tool box.

Comment 4 Mudit Agarwal 2021-07-02 10:09:16 UTC
We might close this issue after concluding the discussion: https://chat.google.com/room/AAAAREGEba8/4JE6K4E4_94

In any case, looks like this is 4.6.z specific

Comment 15 errata-xmlrpc 2021-07-07 18:53:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.6.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2669