Description of problem (please be detailed as possible and provide log snippests): When trying to run the osd removal job the job created successfully but it stuck with an error state. Version of all relevant components (if applicable): OCP 4.6, OCS 4.6 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? When reattaching a new volume to the VM we need to run the osd removal job in order todelete the osd pod's deployment. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: Steps to Reproduce: Execute the osd removal job using the command: $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=0 | oc create -f - output: job.batch/ocs-osd-removal-0 created Actual results: After a few seconds, the osd removal job failed with the status "Error". Expected results: the osd removal job succeeded with the status "Completed". Additional info: I checked both vSphere LSO 4.6, vSphere non-LSO 4.6. The issue seems to be only with 4.6, cause the osd removal job was fine when I ran it on vSphere LSO 4.5. Versions: OCP version: Client Version: 4.3.8 Server Version: 4.6.0-0.nightly-2020-10-05-234751 Kubernetes Version: v1.19.0+db1fc96 OCS verison: ocs-operator.v4.6.0-113.ci OpenShift Container Storage 4.6.0-113.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-10-05-234751 True False 31h Cluster version is 4.6.0-0.nightly-2020-10-05-234751 Rook version rook: 4.6-64.6507bc66.release_4.6 go: go1.15.0 Ceph version ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)
Logs?
When I tried to take the logs from one of the osd removal jobs: $oc logs -f ocs-osd-removal-1-6nvxs -n openshift-storage I got this error: Error initializing cluster client: ObjectNotFound('error calling conf_read_file',) I think Servesha already knows the issue
@Itzhak Yes. I saw the issue. I'm digging in it.
This error means that the ceph command failed to execute, probably because it was missing a volume mount with the mon endpoints, or the keyring. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',) @Servesha, this sounds like the same issue that you were already investigating with the OCS integration. We have a working example in the rook repo. The template generated by the OCS operator needs to match the job spec for the working example. https://github.com/rook/rook/blob/release-1.4/cluster/examples/kubernetes/ceph/osd-purge.yaml
A PR is opened: https://github.com/openshift/ocs-operator/pull/820/files
(In reply to Servesha from comment #7) > A PR is opened: https://github.com/openshift/ocs-operator/pull/820/files Is the BZ component wrong?
(In reply to Michael Adam from comment #8) > (In reply to Servesha from comment #7) > > A PR is opened: https://github.com/openshift/ocs-operator/pull/820/files > > Is the BZ component wrong? Yes.
Update: Tested the code and it is working properly. Waiting for the checks to pass for PR https://github.com/openshift/ocs-operator/pull/820
https://github.com/openshift/ocs-operator/pull/820 is merged.
Backport PR is not yet merged: https://github.com/openshift/ocs-operator/pull/852
BP PR is still not merged.
Removing the OSD is succeeding, and triggering a reconcile as expected. The reconcile will attempt to start a new OSD to replace the old one. However, the bug is that the PVC from the previous OSD was not deleted. The same PVC is being used for the new PVC and it fails to start since it was purged. The fix is to purge the PVC during the OSD removal. The log messages are missing from the osd purge job that indicate that the pvc would have been deleted. Here is the code where it is expected to be removed: https://github.com/openshift/rook/blob/release-4.6/pkg/daemon/ceph/osd/remove.go#L95-L118 A workaround should be to delete the PVC and the OSD deployment, which will trigger a new reconcile and start a new OSD successfully. From the operator log [1] we see that the PVC still exists: 2020-10-29T08:33:05.435252525Z 2020-10-29 08:33:05.435095 I | op-osd: OSD PVC "ocs-deviceset-1-data-0-9l6lc" already exists From the osd prepare log [2] we see that the old OSD is detected again: 2020-10-29T08:33:28.681747178Z 2020-10-29 08:33:28.681732 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-1-data-0-9l6lc --format json 2020-10-29T08:33:28.945916190Z 2020-10-29 08:33:28.945863 D | cephosd: { 2020-10-29T08:33:28.945916190Z "0": { 2020-10-29T08:33:28.945916190Z "ceph_fsid": "847e0bc9-137c-4dd1-892f-cec46ef682e2", 2020-10-29T08:33:28.945916190Z "device": "/mnt/ocs-deviceset-1-data-0-9l6lc", 2020-10-29T08:33:28.945916190Z "osd_id": 0, 2020-10-29T08:33:28.945916190Z "osd_uuid": "35920aab-57d0-4895-93b4-11f3358f5cad", 2020-10-29T08:33:28.945916190Z "type": "bluestore" 2020-10-29T08:33:28.945916190Z } 2020-10-29T08:33:28.945916190Z } [1] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/1886348/must-gather.local.1871105043414068258/quay-io-rhceph-dev-ocs-must-gather-sha256-82bb2fcff186300764858427b7912ee518009df45ef4eaa3bc13f1f49e8a8301/namespaces/openshift-storage/pods/rook-ceph-operator-5fb9cd9764-kld8x/rook-ceph-operator/rook-ceph-operator/logs/current.log [2] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/1886348/must-gather.local.1871105043414068258/quay-io-rhceph-dev-ocs-must-gather-sha256-82bb2fcff186300764858427b7912ee518009df45ef4eaa3bc13f1f49e8a8301/namespaces/openshift-storage/pods/rook-ceph-osd-prepare-ocs-deviceset-1-data-0-9l6lc-4s9ht/provision/provision/logs/current.log @sdudhgao Can you take a look?
Yes Travis I will take a look.
Fix verified upstream here: https://github.com/rook/rook/pull/6533
Merged downstream to release-4.6: https://github.com/openshift/rook/pull/144
I checked with vSphere Non-LSO OCP 4.6, and the osd removal job was fine. You can see it in this validation job https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14459/console that it worked as expected. About vSphere LSO OCP 4.6: from what I have seen, the osd removal job works fine, but I didn't try the steps of node replacement or device replacement, so I am still not sure. I need to recheck it.
I performed node replacement steps with the configurations: vSphere, OCP 4.6, OCS 4.6, LSO. As part of these steps, I also executed the osd removal job. The job has been created successfully with the status "Completed". And the process of the node replacement finished successfully. So in summarizing, the osd removal job created successfully both with vSphere, OCP 4.6, LSO, and also with vSphere OCP 4.6, Non-LSO. Additional info about the LSO cluster I used: OCP version: Client Version: 4.3.8 Server Version: 4.6.0-0.nightly-2020-11-07-035509 Kubernetes Version: v1.19.0+9f84db3 OCS verison: ocs-operator.v4.6.0-156.ci OpenShift Container Storage 4.6.0-156.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-11-07-035509 True False 2d1h Cluster version is 4.6.0-0.nightly-2020-11-07-035509 Rook version rook: 4.6-73.15d47331.release_4.6 go: go1.15.2 Ceph version ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605