Bug 1886348
Summary: | osd removal job failed with status "Error" | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Itzhak <ikave> |
Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
Status: | CLOSED ERRATA | QA Contact: | Itzhak <ikave> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.6 | CC: | ebenahar, madam, muagarwa, nberry, ocs-bugs, prsurve, rgeorge, sdudhgao, shan, sostapov, tnielsen |
Target Milestone: | --- | Keywords: | AutomationBackLog, Regression |
Target Release: | OCS 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.6.0-153.ci | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-17 06:24:44 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1787236, 1879008 |
Description
Itzhak
2020-10-08 09:18:09 UTC
Logs? When I tried to take the logs from one of the osd removal jobs: $oc logs -f ocs-osd-removal-1-6nvxs -n openshift-storage I got this error: Error initializing cluster client: ObjectNotFound('error calling conf_read_file',) I think Servesha already knows the issue @Itzhak Yes. I saw the issue. I'm digging in it. This error means that the ceph command failed to execute, probably because it was missing a volume mount with the mon endpoints, or the keyring. Error initializing cluster client: ObjectNotFound('error calling conf_read_file',) @Servesha, this sounds like the same issue that you were already investigating with the OCS integration. We have a working example in the rook repo. The template generated by the OCS operator needs to match the job spec for the working example. https://github.com/rook/rook/blob/release-1.4/cluster/examples/kubernetes/ceph/osd-purge.yaml A PR is opened: https://github.com/openshift/ocs-operator/pull/820/files (In reply to Servesha from comment #7) > A PR is opened: https://github.com/openshift/ocs-operator/pull/820/files Is the BZ component wrong? (In reply to Michael Adam from comment #8) > (In reply to Servesha from comment #7) > > A PR is opened: https://github.com/openshift/ocs-operator/pull/820/files > > Is the BZ component wrong? Yes. Update: Tested the code and it is working properly. Waiting for the checks to pass for PR https://github.com/openshift/ocs-operator/pull/820 Backport PR is not yet merged: https://github.com/openshift/ocs-operator/pull/852 BP PR is still not merged. Removing the OSD is succeeding, and triggering a reconcile as expected. The reconcile will attempt to start a new OSD to replace the old one. However, the bug is that the PVC from the previous OSD was not deleted. The same PVC is being used for the new PVC and it fails to start since it was purged. The fix is to purge the PVC during the OSD removal. The log messages are missing from the osd purge job that indicate that the pvc would have been deleted. Here is the code where it is expected to be removed: https://github.com/openshift/rook/blob/release-4.6/pkg/daemon/ceph/osd/remove.go#L95-L118 A workaround should be to delete the PVC and the OSD deployment, which will trigger a new reconcile and start a new OSD successfully. From the operator log [1] we see that the PVC still exists: 2020-10-29T08:33:05.435252525Z 2020-10-29 08:33:05.435095 I | op-osd: OSD PVC "ocs-deviceset-1-data-0-9l6lc" already exists From the osd prepare log [2] we see that the old OSD is detected again: 2020-10-29T08:33:28.681747178Z 2020-10-29 08:33:28.681732 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list /mnt/ocs-deviceset-1-data-0-9l6lc --format json 2020-10-29T08:33:28.945916190Z 2020-10-29 08:33:28.945863 D | cephosd: { 2020-10-29T08:33:28.945916190Z "0": { 2020-10-29T08:33:28.945916190Z "ceph_fsid": "847e0bc9-137c-4dd1-892f-cec46ef682e2", 2020-10-29T08:33:28.945916190Z "device": "/mnt/ocs-deviceset-1-data-0-9l6lc", 2020-10-29T08:33:28.945916190Z "osd_id": 0, 2020-10-29T08:33:28.945916190Z "osd_uuid": "35920aab-57d0-4895-93b4-11f3358f5cad", 2020-10-29T08:33:28.945916190Z "type": "bluestore" 2020-10-29T08:33:28.945916190Z } 2020-10-29T08:33:28.945916190Z } [1] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/1886348/must-gather.local.1871105043414068258/quay-io-rhceph-dev-ocs-must-gather-sha256-82bb2fcff186300764858427b7912ee518009df45ef4eaa3bc13f1f49e8a8301/namespaces/openshift-storage/pods/rook-ceph-operator-5fb9cd9764-kld8x/rook-ceph-operator/rook-ceph-operator/logs/current.log [2] http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/1886348/must-gather.local.1871105043414068258/quay-io-rhceph-dev-ocs-must-gather-sha256-82bb2fcff186300764858427b7912ee518009df45ef4eaa3bc13f1f49e8a8301/namespaces/openshift-storage/pods/rook-ceph-osd-prepare-ocs-deviceset-1-data-0-9l6lc-4s9ht/provision/provision/logs/current.log @sdudhgao Can you take a look? Yes Travis I will take a look. Fix verified upstream here: https://github.com/rook/rook/pull/6533 Merged downstream to release-4.6: https://github.com/openshift/rook/pull/144 I checked with vSphere Non-LSO OCP 4.6, and the osd removal job was fine. You can see it in this validation job https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/14459/console that it worked as expected. About vSphere LSO OCP 4.6: from what I have seen, the osd removal job works fine, but I didn't try the steps of node replacement or device replacement, so I am still not sure. I need to recheck it. I performed node replacement steps with the configurations: vSphere, OCP 4.6, OCS 4.6, LSO. As part of these steps, I also executed the osd removal job. The job has been created successfully with the status "Completed". And the process of the node replacement finished successfully. So in summarizing, the osd removal job created successfully both with vSphere, OCP 4.6, LSO, and also with vSphere OCP 4.6, Non-LSO. Additional info about the LSO cluster I used: OCP version: Client Version: 4.3.8 Server Version: 4.6.0-0.nightly-2020-11-07-035509 Kubernetes Version: v1.19.0+9f84db3 OCS verison: ocs-operator.v4.6.0-156.ci OpenShift Container Storage 4.6.0-156.ci Succeeded cluster version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-11-07-035509 True False 2d1h Cluster version is 4.6.0-0.nightly-2020-11-07-035509 Rook version rook: 4.6-73.15d47331.release_4.6 go: go1.15.2 Ceph version ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605 |