1978578 – After upgrade to 4.6.6 the toolbox pod fails to run ceph commands with error: [errno 13] error connecting to the cluster

Bug 1978578 - After upgrade to 4.6.6 the toolbox pod fails to run ceph commands with error: [errno 13] error connecting to the cluster

Summary: After upgrade to 4.6.6 the toolbox pod fails to run ceph commands with error:...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.6.6
Assignee:	Rakshith
QA Contact:	Petr Balogh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-02 08:38 UTC by Petr Balogh
Modified:	2021-07-07 18:53 UTC (History)
CC List:	6 users (show)
Fixed In Version:	v4.6.6-442.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-07 18:53:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ocs-operator pull 1254	0	None	open	Bug 1978578: [release-4.6] fixed cephTools update when spec is changed	2021-07-02 13:14:09 UTC
Red Hat Product Errata	RHBA-2021:2669	0	None	None	None	2021-07-07 18:53:06 UTC

Description Petr Balogh 2021-07-02 08:38:24 UTC

Description of problem (please be detailed as possible and provide log
snippests):
After upgrade to 4.6.6 we see that ceph toolbox pod cannot communicate to the cluster anymore.

In this job:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1282/consoleFull

I see before upgrade test on OCS 4.5 the ceph command works:
19:39:14 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-f5494bcdc-nd7qg -- ceph health
19:39:25 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.

But right after upgrade:
21:02:29 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage exec rook-ceph-tools-f5494bcdc-nd7qg -- ceph health
21:02:40 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: [errno 13] error connecting to the cluster
command terminated with exit code 13

The all next ceph commands are failing.



Version of all relevant components (if applicable):
OCS: 4.6.6-426.ci
OCP: 4.6 nightly build


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes it's constantly reproducible, all our upgrade jobs failed with this.


Can this issue reproduce from the UI?
Haven't tried


If this is a regression, please provide more details to justify this:
Yes, this worked well before


Steps to Reproduce:
1. install OCS 4.5
2. Upgrade to 4.6.6 build
3. Then the issue is hit when you run ceph command on ceph toolbox pod


Actual results:
Cannot run ceph commands on toolbox pod


Expected results:
Be able to run the commands


Additional info:
Job:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1282/consoleFull

kubeconfig: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j043ai3c33-ua/j043ai3c33-ua_20210701T150543/openshift-cluster-dir/auth/kubeconfig

Must gather:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j043ai3c33-ua/j043ai3c33-ua_20210701T150543/logs/failed_testcase_ocs_logs_1625159105/test_upgrade_ocs_logs/


Another reproduction:

https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1263/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j066vu1cs33-ua/j066vu1cs33-ua_20210629T160701/logs/failed_testcase_ocs_logs_1624988496/test_upgrade_ocs_logs/

Next:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1255/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

Next on vsphere:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1253/testReport/tests.ecosystem.upgrade/test_upgrade/test_upgrade/

Comment 3 Santosh Pillai 2021-07-02 09:40:37 UTC

I was looking in to the first cluster. The toolbox image was 
- image: registry.redhat.io/ocs4/rook-ceph-rhel8-operator@sha256:82db76aa07847d7fb5993af515cfaaffb803876b8a7869f85a27260a9edf7fb8

Something triggered the restart of the OCS operator (wasn't me). I think the tests are still running on the cluster. Now the image is showing up correctly as 
- image: quay.io/rhceph-dev/rook-ceph@sha256:6626db27489220f89fadef52203687860474411c604e8e27f8262fc5973879e8

Also, the ceph commands are working now in the tool box.

Comment 4 Mudit Agarwal 2021-07-02 10:09:16 UTC

We might close this issue after concluding the discussion: https://chat.google.com/room/AAAAREGEba8/4JE6K4E4_94

In any case, looks like this is 4.6.z specific

Comment 9 Petr Balogh 2021-07-07 08:04:12 UTC

Verified here:
https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1303/testReport/

Comment 15 errata-xmlrpc 2021-07-07 18:53:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.6.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2669

Note You need to log in before you can comment on or make changes to this bug.