Bug 1978578 - After upgrade to 4.6.6 the toolbox pod fails to run ceph commands with error: [errno 13] error connecting to the cluster
Summary: After upgrade to 4.6.6 the toolbox pod fails to run ceph commands with error:...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: OCS 4.6.6
Assignee: Rakshith
QA Contact: Petr Balogh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-02 08:38 UTC by Petr Balogh
Modified: 2021-07-07 18:53 UTC (History)
6 users (show)

Fixed In Version: v4.6.6-442.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-07 18:53:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 1254 0 None open Bug 1978578: [release-4.6] fixed cephTools update when spec is changed 2021-07-02 13:14:09 UTC
Red Hat Product Errata RHBA-2021:2669 0 None None None 2021-07-07 18:53:06 UTC

Description Petr Balogh 2021-07-02 08:38:24 UTC
Description of problem (please be detailed as possible and provide log
snippests):
After upgrade to 4.6.6 we see that ceph toolbox pod cannot communicate to the cluster anymore.

In this job:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1282/consoleFull

I see before upgrade test on OCS 4.5 the ceph command works:
19:39:14 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-f5494bcdc-nd7qg -- ceph health
19:39:25 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.

But right after upgrade:
21:02:29 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-f5494bcdc-nd7qg -- ceph health
21:02:40 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: [errno 13] error connecting to the cluster
command terminated with exit code 13

The all next ceph commands are failing.



Version of all relevant components (if applicable):
OCS: 4.6.6-426.ci
OCP: 4.6 nightly build


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes it's constantly reproducible, all our upgrade jobs failed with this.


Can this issue reproduce from the UI?
Haven't tried


If this is a regression, please provide more details to justify this:
Yes, this worked well before


Steps to Reproduce:
1. install OCS 4.5
2. Upgrade to 4.6.6 build
3. Then the issue is hit when you run ceph command on ceph toolbox pod


Actual results:
Cannot run ceph commands on toolbox pod


Expected results:
Be able to run the commands


Additional info:
Job:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1282/consoleFull

kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j043ai3c33-ua/j043ai3c33-ua_20210701T150543/openshift-cluster-dir/auth/kubeconfig

Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j043ai3c33-ua/j043ai3c33-ua_20210701T150543/logs/failed_testcase_ocs_logs_1625159105/test_upgrade_ocs_logs/


Another reproduction:

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1263/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j066vu1cs33-ua/j066vu1cs33-ua_20210629T160701/logs/failed_testcase_ocs_logs_1624988496/test_upgrade_ocs_logs/

Next:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1255/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

Next on vsphere:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1253/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

Comment 3 Santosh Pillai 2021-07-02 09:40:37 UTC
I was looking in to the first cluster. The toolbox image was 
- image: registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256:82db76aa07847d7fb5993af515cfaaffb803876b8a7869f85a27260a9edf7fb8

Something triggered the restart of the OCS operator (wasn't me). I think the tests are still running on the cluster. Now the image is showing up correctly as 
- image: quay.io/rhceph-dev/rook-ceph@sha256:6626db27489220f89fadef52203687860474411c604e8e27f8262fc5973879e8

Also, the ceph commands are working now in the tool box.

Comment 4 Mudit Agarwal 2021-07-02 10:09:16 UTC
We might close this issue after concluding the discussion: https://chat.google.com/room/AAAAREGEba8/4JE6K4E4_94

In any case, looks like this is 4.6.z specific

Comment 15 errata-xmlrpc 2021-07-07 18:53:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.6.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2669


Note You need to log in before you can comment on or make changes to this bug.