Bug 2025401
| Summary: | [TEST ONLY] [CNV+OCS/ODF] Virtualization poison pill implemenation | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Gobinda Das <godas> |
| Component: | SSP | Assignee: | Boriso <bodnopoz> |
| Status: | CLOSED ERRATA | QA Contact: | Geetika Kapoor <gkapoor> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.9.0 | CC: | bodnopoz, cnv-qe-bugs, dholler, fdeutsch, gkapoor, msluiter, owasserm, rnetser, yadu, ycui |
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-14 19:28:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Gobinda Das
2021-11-22 05:20:47 UTC
Hi, We did the first round of testing with 3 nodes cluster(Master+Workers are in same node) . So basically nothing works for us(Neither node moved out from cluster nor VM started in another node to another node). Here https://docs.google.com/document/d/1SilyExSqgXIth7-a0jtWqNBdYTfohDoA0NV6qAu09dA/edit?usp=sharing the results are captured. Conclusion: As of now by default PP+NHC do not operate on masters. And there is no way we can configure to work Poison Pill on master. Hi All,
We completed the basic testing with 3 masters and 3 workers and replica 2
Below are the steps and results:
Setup:
OpenShift version 4.9.4
OCS version: 4.8.4
CNV version: 4.9.0
Node Health Check Operator: 0.1.0
Poison Pill Operator: 0.2.0
Baremetal nodes
rhocs-bm1.lab.eng.blr.redhat.com Ready master
rhocs-bm2.lab.eng.blr.redhat.com Ready master
rhocs-bm3.lab.eng.blr.redhat.com Ready master
rhocs-bm7.lab.eng.blr.redhat.com Ready worker
rhocs-bm8.lab.eng.blr.redhat.com Ready worker
rhocs-bm9.lab.eng.blr.redhat.com Ready worker
Steps:
1 - Install OCS and create a StorageCluster
2 - Install CNV and NodeHealthCheck operator which installs the Poison pill operator. The configuration is default which comes out of the box when installing NHO
3 - Create a VM utilizing the block volume from ceph-rbd storage class
4 - Killed the kubelet service of the node(rhocs-bm7.lab.eng.blr.redhat.com) where the VM is running.
Results:
1 - The failed node is detected by NHC
2 - Poision Pill remediation is created for rhocs-bm7.lab.eng.blr.redhat.com (Where VM is running)
3 - The failed node is then removed from the cluster
4 - The baremetal node goes into down state but does not reboot
5 - The VM spawn up on another node (rhocs-bm9.lab.eng.blr.redhat.com)
6 - After sometime the failed node object is recreated in case the node comes back up later
All related details are attached in doc [1]
[1] https://docs.google.com/document/d/1BZNUC8P_CEE5oOoMObpYTvsuqzzvhheRMQu2P3I7ZPM/edit?usp=sharing
(In reply to Gobinda Das from comment #1) > Hi, > We did the first round of testing with 3 nodes cluster(Master+Workers are in > same node) . So basically nothing works for us(Neither node moved out from > cluster nor VM started in another node to another node). Here > https://docs.google.com/document/d/1SilyExSqgXIth7- > a0jtWqNBdYTfohDoA0NV6qAu09dA/edit?usp=sharing the results are captured. > > Conclusion: > As of now by default PP+NHC do not operate on masters. And there is no way > we can configure to work Poison Pill on master. This is tracked in https://issues.redhat.com/browse/ECOPROJECT-116 All the possible cases which are outlined here: https://www.medik8s.io/remediation/poison-pill/how-it-works/ and here's an explanation on how to test them: https://docs.google.com/document/d/1EgwV3MH-JaBa-8N5KR0MPaQ2ZRr6QhCPdDHymT5hgSo/edit?usp=sharing Hi, Fabian, could you please take a look? Do we need to move Virt component? Gobinda, what is the specific thing you'd like to see tested? (In reply to Fabian Deutsch from comment #6) > Gobinda, what is the specific thing you'd like to see tested? Fabian, We have done the basic testing and planning to do some more tests as mentioned in https://docs.google.com/document/d/1EgwV3MH-JaBa-8N5KR0MPaQ2ZRr6QhCPdDHymT5hgSo/edit Planning to resume test from next week as the serrvers are occupied for some other testing. Test Environment: Set up Node health check operator. By default it is installed for all ns. [nhc_setup] Make sure deployment is success $ oc get csv -n node-health-check NAME DISPLAY VERSION REPLACES PHASE node-healthcheck-operator.v0.2.0 Node Health Check Operator 0.2.0 Succeeded poison-pill.v0.3.0 Poison Pill Operator 0.3.0 Succeeded Test Case 1: Start a VM. Stop kubelet service and make sure VM is able to failover and move to someother worker node. 1. Trigger vm . Make sure it is running. $ oc get vmi -A NAMESPACE NAME AGE PHASE IP NODENAME READY node-health-check nhc-vm-1656964433-7484694 5m53s Running 10.131.0.229 c01-gk411-nhc-ttpt5-worker-0-4wqtz True 2. sudo systemctl stop kubelet.service 3. wait for kubelet service to start again on its own. 4. Connectivity is lost in between from VM. $ oc logs virt-launcher-nhc-vm-1656964433-7484694-2vw9f -n node-health-check Error from server: Get "https://192.168.0.152:10250/containerLogs/node-health-check/virt-launcher-nhc-vm-1656964433-7484694-2vw9f/compute": dial tcp 192.168.0.152:10250: connect: connection refused [cnv-qe-jenkins@c01-gk411-nhc-ttpt5-executor ~]$ oc get vmi -A NAMESPACE NAME AGE PHASE IP NODENAME READY node-health-check nhc-vm-1656964433-7484694 10m Running 10.131.0.229 c01-gk411-nhc-ttpt5-worker-0-4wqtz False 5. Poison pill try to find the faulty node and start peer health check $ oc logs poison-pill-ds-ntfhv -n node-health-check I0704 19:54:20.070978 3425096 request.go:655] Throttling request took 1.037833139s, request: GET:https://172.30.0.1:443/apis/network.operator.openshift.io/v1?timeout=32s 2022-07-04T19:54:23.180Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"} 2022-07-04T19:54:23.180Z INFO setup Starting as a poison pill agent that should run as part of the daemonset 2022-07-04T19:54:23.261Z INFO setup Time to assume that unhealthy node has been rebooted {"time": "3m5s"} 2022-07-04T19:54:23.261Z INFO setup init grpc server 2022-07-04T19:54:23.261Z INFO setup starting manager 2022-07-04T19:54:23.262Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"} 2022-07-04T19:54:23.262Z INFO watchdog watchdog started 2022-07-04T19:54:23.262Z INFO api-check api connectivity check started 2022-07-04T19:54:23.262Z INFO controller-runtime.manager.controller.poisonpillremediation Starting EventSource {"reconciler group": "poison-pill.medik8s.io", "reconciler kind": "PoisonPillRemediation", "source": "kind source: /, Kind="} 2022-07-04T19:54:23.362Z INFO peers peers started 2022-07-04T19:54:23.363Z INFO controller-runtime.manager.controller.poisonpillremediation Starting Controller {"reconciler group": "poison-pill.medik8s.io", "reconciler kind": "PoisonPillRemediation"} 2022-07-04T19:54:23.463Z INFO controller-runtime.manager.controller.poisonpillremediation Starting workers {"reconciler group": "poison-pill.medik8s.io", "reconciler kind": "PoisonPillRemediation", "worker count": 1} 2022-07-04T19:54:23.464Z INFO peerhealth.server peer health server started 6. $ oc get vmi -A NAMESPACE NAME AGE PHASE IP NODENAME READY node-health-check nhc-vm-1656964433-7484694 15m Running 10.131.0.229 c01-gk411-nhc-ttpt5-worker-0-4wqtz False 7. VM is retriggered on a different node. $ oc get vmi -A NAMESPACE NAME AGE PHASE IP NODENAME READY node-health-check nhc-vm-1656964433-7484694 3m59s Running 10.128.2.79 c01-gk411-nhc-ttpt5-worker-0-57zp2 True 8 . Later kubelet services also get started on worker. $ uptime 20:08:50 up 2 min, 1 user, load average: 3.13, 1.04, 0.38 $ systemctl status kubelet.service ● kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/kubelet.service.d └─10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf, 20-openstack-node-name.conf Active: active (running) since Mon 2022-07-04 20:08:08 UTC; 45s ago Process: 2800 ExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state (code=exited, status=0/SUCCESS) Process: 2798 ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state (code=exited, status=0/SUCCESS) Process: 2796 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS) Main PID: 2802 (kubelet) Tasks: 20 (limit: 101913) Memory: 170.1M CPU: 3.567s Good to see this verified. @gkapoor please provide some details on the cluster (node count) that this got tested with Test Environment use for testing is 6 node cluster(3 workers,3 masters). This is not expected to work with SNO and compact cluster(master, worker sharing nodes) as per discussion with Poison Pill team. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6526 |