Description of problem (please be detailed as possible and provide log snippests): After 2 our of the 3 original OCS nodes failed, followed the directions listed listed below to remove the failed nodes and add new nodes: https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.6/html/scaling_storage/scaling-out-storage-capacity_rhocs#adding-a-node-using-a-local-storage-device_rhocs The ocs-operator at the time of failure was running on one of the failed nodes, and thus the ocs-operator deployment tried to reschedule it to a healthy node. I had to delete the configmap lock that was preventing it from being rescheduled, but am unable to get the ocs-operator pod to become ready. # ocs-operator logs {"level":"info","ts":"2021-02-18T23:06:19.395Z","logger":"cmd","msg":"Go Version: go1.15.5"} {"level":"info","ts":"2021-02-18T23:06:19.396Z","logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":"2021-02-18T23:06:19.396Z","logger":"cmd","msg":"operator-sdk Version: v0.17.0"} {"level":"info","ts":"2021-02-18T23:06:19.396Z","logger":"cmd","msg":"Running in development mode: false"} {"level":"info","ts":"2021-02-18T23:06:19.396Z","logger":"leader","msg":"Trying to become the leader."} {"level":"info","ts":"2021-02-18T23:06:22.217Z","logger":"leader","msg":"No pre-existing lock was found."} {"level":"info","ts":"2021-02-18T23:06:22.222Z","logger":"leader","msg":"Became the leader."} {"level":"info","ts":"2021-02-18T23:06:25.032Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"} {"level":"info","ts":"2021-02-18T23:06:25.032Z","logger":"cmd","msg":"Registering Components."} {"level":"info","ts":"2021-02-18T23:06:25.043Z","logger":"cmd","msg":"OCSInitialization resource already exists"} {"level":"info","ts":"2021-02-18T23:06:25.043Z","logger":"cmd","msg":"Starting the Cmd."} {"level":"info","ts":"2021-02-18T23:06:25.044Z","logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"} {"level":"info","ts":"2021-02-18T23:06:25.044Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"storagecluster-controller","source":"kind source: /, Kind="} {"level":"info","ts":"2021-02-18T23:06:25.044Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"ocsinitialization-controller","source":"kind source: /, Kind="} {"level":"info","ts":"2021-02-18T23:06:25.044Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"persistentvolume-controller","source":"kind source: /, Kind="} {"level":"info","ts":"2021-02-18T23:06:25.144Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"persistentvolume-controller"} {"level":"info","ts":"2021-02-18T23:06:25.245Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"persistentvolume-controller","worker count":1} {"level":"info","ts":"2021-02-18T23:06:25.245Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"ocsinitialization-controller"} {"level":"info","ts":"2021-02-18T23:06:25.345Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"ocsinitialization-controller","worker count":1} {"level":"info","ts":"2021-02-18T23:06:25.345Z","logger":"controller_ocsinitialization","msg":"Reconciling OCSInitialization","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"} {"level":"info","ts":"2021-02-18T23:06:25.349Z","logger":"controller_ocsinitialization","msg":"Updating rook-ceph SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"} {"level":"info","ts":"2021-02-18T23:06:25.360Z","logger":"controller_ocsinitialization","msg":"Updating rook-ceph-csi SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"} {"level":"info","ts":"2021-02-18T23:06:25.371Z","logger":"controller_ocsinitialization","msg":"Updating noobaa SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"} {"level":"info","ts":"2021-02-18T23:06:25.397Z","logger":"controller_ocsinitialization","msg":"Reconciling OCSInitialization","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"} {"level":"info","ts":"2021-02-18T23:06:25.399Z","logger":"controller_ocsinitialization","msg":"Updating rook-ceph SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"} {"level":"info","ts":"2021-02-18T23:06:25.410Z","logger":"controller_ocsinitialization","msg":"Updating rook-ceph-csi SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"} {"level":"info","ts":"2021-02-18T23:06:25.417Z","logger":"controller_ocsinitialization","msg":"Updating noobaa SecurityContextConstraint","Request.Namespace":"openshift-storage","Request.Name":"ocsinit"} {"level":"info","ts":"2021-02-18T23:06:25.445Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"storagecluster-controller"} {"level":"info","ts":"2021-02-18T23:06:25.445Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"storagecluster-controller","worker count":1} {"level":"info","ts":"2021-02-18T23:06:25.445Z","logger":"controller_storagecluster","msg":"Reconciling StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2021-02-18T23:06:25.876Z","logger":"controller_storagecluster","msg":"Restoring original cephObjectStore ocs-storagecluster-cephobjectstore","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2021-02-18T23:06:26.006Z","logger":"controller_storagecluster","msg":"Restoring original cephObjectStoreUser ocs-storagecluster-cephobjectstoreuser","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2021-02-18T23:06:26.120Z","logger":"controller_storagecluster","msg":"Restoring original cephBlockPool ocs-storagecluster-cephblockpool","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2021-02-18T23:06:26.234Z","logger":"controller_storagecluster","msg":"Restoring original cephFilesystem ocs-storagecluster-cephfilesystem","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2021-02-18T23:06:26.368Z","logger":"controller_storagecluster","msg":"Waiting on ceph cluster to initialize before starting noobaa","Request.Namespace":"openshift-storage","Request.Name":"ocs-storagecluster"} {"level":"info","ts":"2021-02-18T23:06:26.503Z","logger":"controller_storagecluster","msg":"Reconciling metrics exporter service","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}} {"level":"info","ts":"2021-02-18T23:06:26.610Z","logger":"controller_storagecluster","msg":"Reconciling metrics exporter service monitor","NamespacedName":{"namespace":"openshift-storage","name":"ocs-metrics-exporter"}} Version of all relevant components (if applicable): OCS 4.6.2 OCP 4.6.15 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No storage is available for the OCP cluster. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy OCS 2. Kill two RHCOS VMs running the OCS nodes 3. Remove ocs-operator configmap lock so ocs-operator may be reprovisioned to healthy node Actual results: ocs-operator pod never obtains ready status Expected results: ocs-operator pod obtains ready status to properly fix ceph-mon-* ceph-ods-* Additional info:
I'll add that the rook-ceph-mon-a/b are in pending status (as they were on the failed nodes). rook-ceph-mon-c is in healthy status. rook-ceph-mon-a-6bbfbf5999-ddmgm 0/1 Pending 0 143m rook-ceph-mon-b-755568566d-4sgsj 0/1 Pending 0 24h rook-ceph-mon-c-5cf658f954-mfr2r 1/1 Running 13 14d But normally the nodeSelector on those deployments are managed by ocs-operator...which is waiting for ceph to finish initializing... This is circular...
This is all expected behavior. The whole ConfigMap lock deletion thing is a known problem that will be resolved for OCS 4.7 (it's something out of our control, it's part of the framework we're using and we're upgrading it for OCS 4.7). As far as the ocs-operator not being Ready, this is intentional as the operator should not report Ready until all StorageClusters (and their components) are healthy. And indeed, NooBaa should not be created until the CephCluster is healthy, since NooBaa relies on a Ceph volume for its operation. You should inspect the CephCluster CR and the rook-ceph-operator logs to determine what is actually going on. Dealing with failed nodes in Kuberenetes is a pain in general. Pods will remain Pending and/or Terminated until either the exact node comes back healthy or the admin intervenes. In this case, you probably have to force delete any stuck Pods. We're considering ways to address this, but nothing is available for OCS 4.6.
Since this is not a crucial bug, moving to OCS 4.8.
Starting from OCS 4.7 we do not use configmap locks. The operator readiness and waiting for CephCluster before creating NooBaa is working as expected. So this bug doesn't exist anymore. Is it critical enough to have a 4.6 only fix? If not, we can close this.