Bug 1889401

Summary: Cannot revert changes after adding nodePlacement to HCO
Product: Container Native Virtualization (CNV) Reporter: Or Bairey-Sehayek <obaireys>
Component: InstallationAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.5.0CC: cnv-qe-bugs, kbidarka, stirabos
Target Milestone: ---   
Target Release: 2.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hco-bundle-registry-container-v2.5.0-399 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-17 13:24:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1877698    

Description Or Bairey-Sehayek 2020-10-19 15:18:03 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible: 100%


Steps to Reproduce:
1. Deploy CNV
2. Edit HCO and add nodePlacement for infra and workloads:

$ oc -n openshift-cnv edit hyperconverged

Change the spec:

spec:
  infra:
    nodePlacement:
      nodeSelector:
        foo: bar
  workloads:
    nodePlacement:
      nodeSelector:
        foo: bar

3. Save edits and close editor
4. Revert edits

Actual results:

error: hyperconvergeds.hco.kubevirt.io "kubevirt-hyperconverged" could not be patched: Internal error occurred: failed calling webhook "validate-hco.kubevirt.io": Post "https://hco-operator-service.openshift-cnv.svc:4343/validate-hco-kubevirt-io-v1beta1-hyperconverged?timeout=30s": no endpoints available for service "hco-operator-service"
You can run `oc replace -f /tmp/oc-edit-ikelt.yaml` to try this update again.


Expected results:

Changes made in step 3 should be reversed.

Additional info:

Happens whether or not there is a schedulable node labelled foo=bar in the cluster.

Comment 1 Omer Yahud 2020-10-20 07:01:10 UTC
This is an HCO issue and not an SSP issue, moving to the install team

Comment 2 Simone Tiraboschi 2020-10-23 12:19:07 UTC
Currently the operator and its validating webhook are exposed by a single pod.
If the user applies a change that cause the operator, or indirectly one of its operands, to get stuck waiting for an impossible reconciliation the operator is going to report read=false (currently it's the only available communication channel to the OLM).
As a side effect of that, the webhook is not ready as well and this prevents any other user changes potentially causing a deadlock.

The proper solution is stopping using the readiness probe to communicate with the OLM using the newer operator conditions mechanism ( https://github.com/operator-framework/enhancements/blob/master/enhancements/operator-conditions.md ) but this can be done only on OCP 4.7 so CNV 2.6.

A quick (but dirty) solution for CNV 2.5 is to expose the webhook on a second pod (using a single binary for simplicity) so that we can rely on two independent readiness probes.

Comment 3 Kedar Bidarkar 2020-10-28 20:01:01 UTC
Taint a node.
[kbidarka@localhost cnv-tests]$ oc adm taint node kbid25ve-m2n85-worker-0-95qhh worker=load-balancer:NoSchedule
node/kbid25ve-m2n85-worker-0-95qhh tainted

Ensure Taint got applied on that node.
[kbidarka@localhost cnv-tests]$ oc get nodes kbid25ve-m2n85-worker-0-95qhh -o yaml | grep -A 3 taints
--
  taints:
  - effect: NoSchedule
    key: worker
    value: load-balancer

Update the workloads section under the hyperconverged CR.
[kbidarka@localhost cnv-tests]$ oc edit hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited


Ensure the hyperconverged CR got updated successfully.
[kbidarka@localhost cnv-tests]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml | grep -A 5 workloads
--
  workloads:
    nodePlacement:
      tolerations:
      - effect: NoSchedule
        key: worker
        operator: Exists

Ensure the change got propagated to the daemonset virt-handler, as an example of workloads.
[kbidarka@localhost cnv-tests]$ oc get daemonset virt-handler  -n openshift-cnv -o yaml  | grep -A 6 "tolerations:"
--
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: worker
        operator: Exists


Check that the pods got re-created. 
[kbidarka@localhost cnv-tests]$ oc get pods -n openshift-cnv | grep -i virt-handler
virt-handler-6v6pl                                   1/1       Running   0          7m48s
virt-handler-r75wl                                   1/1       Running   0          6m55s
virt-handler-w6cq7                                   1/1       Running   0          7m17s

Ensure the virt-handler pod running on the taint node also got created, with the Tolerations.
[kbidarka@localhost cnv-tests]$ oc get pods virt-handler-w6cq7 -n openshift-cnv -o yaml | grep nodeName 
          fieldPath: spec.nodeName
  nodeName: kbid25ve-m2n85-worker-0-95qhh


Pod got re-created with the Tolerations.
[kbidarka@localhost cnv-tests]$ oc get pods virt-handler-w6cq7 -n openshift-cnv -o yaml | grep -A 6 "tolerations:"
--
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoSchedule
    key: worker
    operator: Exists
  - effect: NoExecute

Comment 4 Kedar Bidarkar 2020-10-28 20:08:59 UTC
Continuing from the mentioned steps in comment 3

We have 3 running virt-handler pods
(cnv-tests) [kbidarka@localhost cnv-tests]$ oc get pods -n openshift-cnv | grep -i virt-handler
virt-handler-6v6pl                                   1/1       Running   0          21m
virt-handler-r75wl                                   1/1       Running   0          20m
virt-handler-w6cq7                                   1/1       Running   0          21m

We had updated the workloads section above.
(cnv-tests) [kbidarka@localhost cnv-tests]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml | grep -A 5 workloads
--
  workloads:
    nodePlacement:
      tolerations:
      - effect: NoSchedule
        key: worker
        operator: Exists

Reverting the workloads section by editing the hyperconverged object.
(cnv-tests) [kbidarka@localhost cnv-tests]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited

the updates to the workloads section has been reverted Successfully.
(cnv-tests) [kbidarka@localhost cnv-tests]$ oc get hyperconverged kubevirt-hyperconverged -n openshift-cnv -o yaml | grep -A 5 workloads
--
  workloads: {}
status:
  conditions:
  - lastHeartbeatTime: "2020-10-28T19:51:24Z"
    lastTransitionTime: "2020-10-21T19:32:23Z"
    message: Reconcile completed successfully


Due to tolerations removed from under the workloads section in hyperconverged, there are now only 2 virt-handler pods running 
(cnv-tests) [kbidarka@localhost cnv-tests]$ oc get pods -n openshift-cnv | grep -i virt-handler
virt-handler-f2hdn                                   1/1       Running    0          2m9s
virt-handler-qz6rq                                   1/1       Running    0          2m38s


Removing the taint from the node, makes all the virt-handler pods up.
[kbidarka@localhost cnv-tests]$ oc get pods -n openshift-cnv | grep -i virt-handler
virt-handler-8q7dx                                   1/1       Running   0          13m
virt-handler-f2hdn                                   1/1       Running   0          15m
virt-handler-qz6rq                                   1/1       Running   0          15m

---

Summary: The updates to the workloads section has been reverted successfully.

Comment 6 Kedar Bidarkar 2020-11-03 16:31:50 UTC
---
  spec:
    infra:
      nodePlacement:
        nodeSelector:
          node-role.kubernetes.io/master: “”
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Exists
    version: v2.5.0
    workloads:
      nodePlacement:
        nodeSelector:
          node-role.kubernetes.io/worker: “”


Updated the infra and the workloads section under the hyperconverged CR as mentioned above.

[kbidarka@localhost must-gather-kbidarka]$ oc get pods -n openshift-cnv | grep hco-operator
hco-operator-656878d67d-8z7gk                         0/1     Running   0          175m


Reverted the hyperconverged CR to default values and the hco-operator was back in 1/1 READY state.

[kbidarka@localhost must-gather-kbidarka]$ oc get pods -n openshift-cnv | grep hco-operator
hco-operator-656878d67d-8z7gk                         1/1     Running   0          3h1m


Spoke to Simone, he mentioned I could use the above steps to verify this bug.

We are now able to revert the changes made to HCO CR successfully.

Comment 10 errata-xmlrpc 2020-11-17 13:24:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5127