Bug 1886348

Summary:	osd removal job failed with status "Error"
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Itzhak <ikave>
Component:	rook	Assignee:	Travis Nielsen <tnielsen>
Status:	CLOSED ERRATA	QA Contact:	Itzhak <ikave>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.6	CC:	ebenahar, madam, muagarwa, nberry, ocs-bugs, prsurve, rgeorge, sdudhgao, shan, sostapov, tnielsen
Target Milestone:	---	Keywords:	AutomationBackLog, Regression
Target Release:	OCS 4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.6.0-153.ci	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-17 06:24:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1787236, 1879008

Description Itzhak 2020-10-08 09:18:09 UTC

Description of problem (please be detailed as possible and provide log
snippests):
When trying to run the osd removal job the job created successfully but it stuck with an error state.

Version of all relevant components (if applicable):
OCP 4.6, OCS 4.6

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
When reattaching a new volume to the VM we need to run the osd removal job in order todelete the osd pod's deployment.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
Execute the osd removal job using the command:
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 | oc 
create -f -
output: job.batch/ocs-osd-removal-0 created

Actual results:
After a few seconds, the osd removal job failed with the status "Error". 

Expected results:
the osd removal job succeeded with the status "Completed".

Additional info:

I checked both vSphere LSO 4.6, vSphere non-LSO 4.6. The issue seems to be only with 4.6, cause the osd removal job was fine when I ran it on vSphere LSO 4.5.
Versions:
OCP version:
Client Version: 4.3.8
Server Version: 4.6.0-0.nightly-2020-10-05-234751
Kubernetes Version: v1.19.0+db1fc96

OCS verison:
ocs-operator.v4.6.0-113.ci   OpenShift Container Storage   4.6.0-113.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-05-234751   True        False         31h     Cluster version is 4.6.0-0.nightly-2020-10-05-234751

Rook version
rook: 4.6-64.6507bc66.release_4.6
go: go1.15.0

Ceph version
ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)

Comment 2 Sébastien Han 2020-10-08 15:40:15 UTC

Logs?

Comment 3 Itzhak 2020-10-11 08:48:10 UTC

When I tried to take the logs from one of the osd removal jobs:
$oc logs -f ocs-osd-removal-1-6nvxs -n openshift-storage 
I got this error:
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)

I think Servesha already knows the issue

Comment 4 Servesha 2020-10-12 08:14:04 UTC

@Itzhak Yes. I saw the issue. I'm digging in it.

Comment 5 Travis Nielsen 2020-10-12 13:24:48 UTC

This error means that the ceph command failed to execute, probably because it was missing a volume mount with the mon endpoints, or the keyring.
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)

@Servesha, this sounds like the same issue that you were already investigating with the OCS integration. We have a working example in the rook repo. The template generated by the OCS operator needs to match the job spec for the working example. 
https://github.com/rook/rook/blob/release-1.4/cluster/examples/kubernetes/ceph/osd-purge.yaml

Comment 7 Servesha 2020-10-13 06:29:10 UTC

A PR is opened: https://github.com/openshift/ocs-operator/pull/820/files

Comment 8 Michael Adam 2020-10-13 08:13:06 UTC

(In reply to Servesha from comment #7)
> A PR is opened: https://github.com/openshift/ocs-operator/pull/820/files

Is the BZ component wrong?

Comment 9 Sébastien Han 2020-10-13 08:56:14 UTC

(In reply to Michael Adam from comment #8)
> (In reply to Servesha from comment #7)
> > A PR is opened: https://github.com/openshift/ocs-operator/pull/820/files
> 
> Is the BZ component wrong?

Yes.

Comment 10 Servesha 2020-10-14 23:58:34 UTC

Update: Tested the code and it is working properly. Waiting for the checks to pass for PR https://github.com/openshift/ocs-operator/pull/820

Comment 11 Servesha 2020-10-20 06:40:55 UTC

https://github.com/openshift/ocs-operator/pull/820 is merged.

Comment 12 Mudit Agarwal 2020-10-20 07:15:06 UTC

Backport PR is not yet merged: https://github.com/openshift/ocs-operator/pull/852

Comment 13 Mudit Agarwal 2020-10-23 00:49:40 UTC

BP PR is still not merged.

Comment 16 Travis Nielsen 2020-10-29 18:00:45 UTC

Removing the OSD is succeeding, and triggering a reconcile as expected. The reconcile will attempt to start a new OSD to replace the old one. However, the bug is that the PVC from the previous OSD was not deleted. The same PVC is being used for the new PVC and it fails to start since it was purged. 

The fix is to purge the PVC during the OSD removal. The log messages are missing from the osd purge job that indicate that the pvc would have been deleted. Here is the code where it is expected to be removed:
https://github.com/openshift/rook/blob/release-4.6/pkg/daemon/ceph/osd/remove.go#L95-L118

A workaround should be to delete the PVC and the OSD deployment, which will trigger a new reconcile and start a new OSD successfully.

From the operator log [1] we see that the PVC still exists:

2020-10-29T08:33:05.435252525Z 2020-10-29 08:33:05.435095 I | op-osd: OSD PVC "ocs-deviceset-1-data-0-9l6lc" already exists

From the osd prepare log [2] we see that the old OSD is detected again:

2020-10-29T08:33:28.681747178Z 2020-10-29 08:33:28.681732 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-1-data-0-9l6lc --format json
2020-10-29T08:33:28.945916190Z 2020-10-29 08:33:28.945863 D | cephosd: {
2020-10-29T08:33:28.945916190Z     "0": {
2020-10-29T08:33:28.945916190Z         "ceph_fsid": "847e0bc9-137c-4dd1-892f-cec46ef682e2",
2020-10-29T08:33:28.945916190Z         "device": "/mnt/ocs-deviceset-1-data-0-9l6lc",
2020-10-29T08:33:28.945916190Z         "osd_id": 0,
2020-10-29T08:33:28.945916190Z         "osd_uuid": "35920aab-57d0-4895-93b4-11f3358f5cad",
2020-10-29T08:33:28.945916190Z         "type": "bluestore"
2020-10-29T08:33:28.945916190Z     }
2020-10-29T08:33:28.945916190Z }


[1] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/1886348/must-gather.local.1871105043414068258/quay-io-rhceph-dev-ocs-must-gather-sha256-82bb2fcff186300764858427b7912ee518009df45ef4eaa3bc13f1f49e8a8301/namespaces/openshift-storage/pods/rook-ceph-operator-5fb9cd9764-kld8x/rook-ceph-operator/rook-ceph-operator/logs/current.log
[2] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/1886348/must-gather.local.1871105043414068258/quay-io-rhceph-dev-ocs-must-gather-sha256-82bb2fcff186300764858427b7912ee518009df45ef4eaa3bc13f1f49e8a8301/namespaces/openshift-storage/pods/rook-ceph-osd-prepare-ocs-deviceset-1-data-0-9l6lc-4s9ht/provision/provision/logs/current.log

@sdudhgao Can you take a look?

Comment 17 Servesha 2020-10-30 07:59:05 UTC

Yes Travis I will take a look.

Comment 19 Travis Nielsen 2020-11-03 00:22:40 UTC

Fix verified upstream here: https://github.com/rook/rook/pull/6533

Comment 20 Travis Nielsen 2020-11-03 20:22:53 UTC

Merged downstream to release-4.6: https://github.com/openshift/rook/pull/144

Comment 21 Itzhak 2020-11-09 17:57:47 UTC

I checked with vSphere Non-LSO OCP 4.6, and the osd removal job was fine. You can see it in this validation job https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14459/console that it worked as expected.

About vSphere LSO OCP 4.6: 
from what I have seen, the osd removal job works fine, but I didn't try the steps of node replacement or device replacement, so I am still not sure. I need to recheck it.

Comment 22 Itzhak 2020-11-11 15:49:00 UTC

I performed node replacement steps with the configurations: vSphere, OCP 4.6, OCS 4.6, LSO.

As part of these steps, I also executed the osd removal job. 
The job has been created successfully with the status "Completed". 
And the process of the node replacement finished successfully. 

So in summarizing, the osd removal job created successfully both with vSphere, OCP 4.6, LSO, 
and also with vSphere OCP 4.6, Non-LSO.

Additional info about the LSO cluster I used: 

OCP version:
Client Version: 4.3.8
Server Version: 4.6.0-0.nightly-2020-11-07-035509
Kubernetes Version: v1.19.0+9f84db3

OCS verison:
ocs-operator.v4.6.0-156.ci   OpenShift Container Storage   4.6.0-156.ci              Succeeded

cluster version
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-11-07-035509   True        False         2d1h    Cluster version is 4.6.0-0.nightly-2020-11-07-035509

Rook version
rook: 4.6-73.15d47331.release_4.6
go: go1.15.2

Ceph version
ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)

Comment 25 errata-xmlrpc 2020-12-17 06:24:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605