Description of problem: Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy CNV 2. Edit HCO and add nodePlacement for infra and workloads: $ oc -n openshift-cnv edit hyperconverged Change the spec: spec: infra: nodePlacement: nodeSelector: foo: bar workloads: nodePlacement: nodeSelector: foo: bar 3. Save edits and close editor 4. Revert edits Actual results: error: hyperconvergeds.hco.kubevirt.io "kubevirt-hyperconverged" could not be patched: Internal error occurred: failed calling webhook "validate-hco.kubevirt.io": Post "https://hco-operator-service.openshift-cnv.svc:4343/validate-hco-kubevirt-io-v1beta1-hyperconverged?timeout=30s": no endpoints available for service "hco-operator-service" You can run `oc replace -f /tmp/oc-edit-ikelt.yaml` to try this update again. Expected results: Changes made in step 3 should be reversed. Additional info: Happens whether or not there is a schedulable node labelled foo=bar in the cluster.
This is an HCO issue and not an SSP issue, moving to the install team
Currently the operator and its validating webhook are exposed by a single pod. If the user applies a change that cause the operator, or indirectly one of its operands, to get stuck waiting for an impossible reconciliation the operator is going to report read=false (currently it's the only available communication channel to the OLM). As a side effect of that, the webhook is not ready as well and this prevents any other user changes potentially causing a deadlock. The proper solution is stopping using the readiness probe to communicate with the OLM using the newer operator conditions mechanism ( https://github.com/operator-framework/enhancements/blob/master/enhancements/operator-conditions.md ) but this can be done only on OCP 4.7 so CNV 2.6. A quick (but dirty) solution for CNV 2.5 is to expose the webhook on a second pod (using a single binary for simplicity) so that we can rely on two independent readiness probes.
Taint a node. [kbidarka@localhost cnv-tests]$ oc adm taint node kbid25ve-m2n85-worker-0-95qhh worker=load-balancer:NoSchedule node/kbid25ve-m2n85-worker-0-95qhh tainted Ensure Taint got applied on that node. [kbidarka@localhost cnv-tests]$ oc get nodes kbid25ve-m2n85-worker-0-95qhh -o yaml | grep -A 3 taints -- taints: - effect: NoSchedule key: worker value: load-balancer Update the workloads section under the hyperconverged CR. [kbidarka@localhost cnv-tests]$ oc edit hyperconverged -n openshift-cnv hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited Ensure the hyperconverged CR got updated successfully. [kbidarka@localhost cnv-tests]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml | grep -A 5 workloads -- workloads: nodePlacement: tolerations: - effect: NoSchedule key: worker operator: Exists Ensure the change got propagated to the daemonset virt-handler, as an example of workloads. [kbidarka@localhost cnv-tests]$ oc get daemonset virt-handler -n openshift-cnv -o yaml | grep -A 6 "tolerations:" -- tolerations: - key: CriticalAddonsOnly operator: Exists - effect: NoSchedule key: worker operator: Exists Check that the pods got re-created. [kbidarka@localhost cnv-tests]$ oc get pods -n openshift-cnv | grep -i virt-handler virt-handler-6v6pl 1/1 Running 0 7m48s virt-handler-r75wl 1/1 Running 0 6m55s virt-handler-w6cq7 1/1 Running 0 7m17s Ensure the virt-handler pod running on the taint node also got created, with the Tolerations. [kbidarka@localhost cnv-tests]$ oc get pods virt-handler-w6cq7 -n openshift-cnv -o yaml | grep nodeName fieldPath: spec.nodeName nodeName: kbid25ve-m2n85-worker-0-95qhh Pod got re-created with the Tolerations. [kbidarka@localhost cnv-tests]$ oc get pods virt-handler-w6cq7 -n openshift-cnv -o yaml | grep -A 6 "tolerations:" -- tolerations: - key: CriticalAddonsOnly operator: Exists - effect: NoSchedule key: worker operator: Exists - effect: NoExecute
Continuing from the mentioned steps in comment 3 We have 3 running virt-handler pods (cnv-tests) [kbidarka@localhost cnv-tests]$ oc get pods -n openshift-cnv | grep -i virt-handler virt-handler-6v6pl 1/1 Running 0 21m virt-handler-r75wl 1/1 Running 0 20m virt-handler-w6cq7 1/1 Running 0 21m We had updated the workloads section above. (cnv-tests) [kbidarka@localhost cnv-tests]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml | grep -A 5 workloads -- workloads: nodePlacement: tolerations: - effect: NoSchedule key: worker operator: Exists Reverting the workloads section by editing the hyperconverged object. (cnv-tests) [kbidarka@localhost cnv-tests]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited the updates to the workloads section has been reverted Successfully. (cnv-tests) [kbidarka@localhost cnv-tests]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml | grep -A 5 workloads -- workloads: {} status: conditions: - lastHeartbeatTime: "2020-10-28T19:51:24Z" lastTransitionTime: "2020-10-21T19:32:23Z" message: Reconcile completed successfully Due to tolerations removed from under the workloads section in hyperconverged, there are now only 2 virt-handler pods running (cnv-tests) [kbidarka@localhost cnv-tests]$ oc get pods -n openshift-cnv | grep -i virt-handler virt-handler-f2hdn 1/1 Running 0 2m9s virt-handler-qz6rq 1/1 Running 0 2m38s Removing the taint from the node, makes all the virt-handler pods up. [kbidarka@localhost cnv-tests]$ oc get pods -n openshift-cnv | grep -i virt-handler virt-handler-8q7dx 1/1 Running 0 13m virt-handler-f2hdn 1/1 Running 0 15m virt-handler-qz6rq 1/1 Running 0 15m --- Summary: The updates to the workloads section has been reverted successfully.
--- spec: infra: nodePlacement: nodeSelector: node-role.kubernetes.io/master: “” tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists version: v2.5.0 workloads: nodePlacement: nodeSelector: node-role.kubernetes.io/worker: “” Updated the infra and the workloads section under the hyperconverged CR as mentioned above. [kbidarka@localhost must-gather-kbidarka]$ oc get pods -n openshift-cnv | grep hco-operator hco-operator-656878d67d-8z7gk 0/1 Running 0 175m Reverted the hyperconverged CR to default values and the hco-operator was back in 1/1 READY state. [kbidarka@localhost must-gather-kbidarka]$ oc get pods -n openshift-cnv | grep hco-operator hco-operator-656878d67d-8z7gk 1/1 Running 0 3h1m Spoke to Simone, he mentioned I could use the above steps to verify this bug. We are now able to revert the changes made to HCO CR successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:5127