Bug 2025401

Summary: [TEST ONLY] [CNV+OCS/ODF] Virtualization poison pill implemenation
Product: Container Native Virtualization (CNV) Reporter: Gobinda Das <godas>
Component: SSPAssignee: Boriso <bodnopoz>
Status: CLOSED ERRATA QA Contact: Geetika Kapoor <gkapoor>
Severity: high Docs Contact:
Priority: high    
Version: 4.9.0CC: bodnopoz, cnv-qe-bugs, dholler, fdeutsch, gkapoor, msluiter, owasserm, rnetser, yadu, ycui
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-14 19:28:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gobinda Das 2021-11-22 05:20:47 UTC
Description of problem:
  In HCI environment(CNV+ODF) to see if the automated poison pill approach could cause ODF mon quorum loss or VM outages for OCP Virt. Or has it been tested with these products already?

NHC/PP only kicks in once there has been a failure... so ODF would have already lost quorum (because the mon already failed, or was unreachable), we're trying to get it back by recovering the node.  Same for the VMs... either it's already dead or unreachable, and we're giving you the chance to bring it up somewhere else.

The most common reason that fencing might "create downtime" is when admins set overly aggressive timeouts (causing NHC/PP to react to every little blip), and under spec the machines (causing the process keeping the watchdog alive to be starved of CPU).

We can't control the hardware, but we have some rules in place for configuring timeouts, and have more planned for future releases.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Gobinda Das 2021-11-22 05:28:04 UTC
Hi,
We did the first round of testing with 3 nodes cluster(Master+Workers are in same node) . So basically nothing works for us(Neither node moved out from cluster nor VM started in another node to another node). Here https://docs.google.com/document/d/1SilyExSqgXIth7-a0jtWqNBdYTfohDoA0NV6qAu09dA/edit?usp=sharing the results are captured.

Conclusion:
As of now by default PP+NHC do not operate on masters. And there is no way we can configure to work Poison Pill on master.

Comment 2 Gobinda Das 2021-11-22 05:35:07 UTC
Hi All,
 We completed the basic testing with 3 masters and 3 workers and replica 2
Below are the steps and results:
Setup:
   OpenShift version 4.9.4
   OCS version: 4.8.4
   CNV version: 4.9.0
   Node Health Check Operator: 0.1.0
   Poison Pill Operator: 0.2.0

Baremetal nodes  
   rhocs-bm1.lab.eng.blr.redhat.com   Ready    master
   rhocs-bm2.lab.eng.blr.redhat.com   Ready    master  
   rhocs-bm3.lab.eng.blr.redhat.com   Ready    master  
   rhocs-bm7.lab.eng.blr.redhat.com   Ready    worker  
   rhocs-bm8.lab.eng.blr.redhat.com   Ready    worker  
   rhocs-bm9.lab.eng.blr.redhat.com   Ready    worker 
 
Steps:
   1 - Install OCS and create a StorageCluster
   2 - Install CNV and NodeHealthCheck operator which installs the Poison pill operator. The configuration is default which comes out of the box when installing NHO 
   3 - Create a VM utilizing the block volume from ceph-rbd storage class 
   4 - Killed the kubelet service of the node(rhocs-bm7.lab.eng.blr.redhat.com) where the VM is running.
   
Results:
    1 - The failed node is detected by NHC
    2 - Poision Pill remediation is created for rhocs-bm7.lab.eng.blr.redhat.com (Where VM is running)
    3 - The failed node is then removed from the cluster
    4 - The baremetal node goes into down state but does not reboot
    5 - The VM spawn up on another node (rhocs-bm9.lab.eng.blr.redhat.com)
    6 - After sometime the failed node object is recreated in case the node comes back up later

All  related details are attached in doc [1]

[1] https://docs.google.com/document/d/1BZNUC8P_CEE5oOoMObpYTvsuqzzvhheRMQu2P3I7ZPM/edit?usp=sharing

Comment 3 Gobinda Das 2021-11-22 05:35:48 UTC
(In reply to Gobinda Das from comment #1)
> Hi,
> We did the first round of testing with 3 nodes cluster(Master+Workers are in
> same node) . So basically nothing works for us(Neither node moved out from
> cluster nor VM started in another node to another node). Here
> https://docs.google.com/document/d/1SilyExSqgXIth7-
> a0jtWqNBdYTfohDoA0NV6qAu09dA/edit?usp=sharing the results are captured.
> 
> Conclusion:
> As of now by default PP+NHC do not operate on masters. And there is no way
> we can configure to work Poison Pill on master.

This is tracked in
https://issues.redhat.com/browse/ECOPROJECT-116

Comment 4 Gobinda Das 2021-11-22 05:36:56 UTC
All the possible cases which are outlined here:
 https://www.medik8s.io/remediation/poison-pill/how-it-works/

and here's an explanation on how to test them:
   https://docs.google.com/document/d/1EgwV3MH-JaBa-8N5KR0MPaQ2ZRr6QhCPdDHymT5hgSo/edit?usp=sharing

Comment 5 Yan Du 2021-11-24 13:33:09 UTC
Hi, Fabian, could you please take a look? Do we need to move Virt component?

Comment 6 Fabian Deutsch 2021-11-25 14:37:03 UTC
Gobinda, what is the specific thing you'd like to see tested?

Comment 7 Gobinda Das 2021-12-09 07:41:29 UTC
(In reply to Fabian Deutsch from comment #6)
> Gobinda, what is the specific thing you'd like to see tested?

Fabian, We have done the basic testing and planning to do some more tests as mentioned in https://docs.google.com/document/d/1EgwV3MH-JaBa-8N5KR0MPaQ2ZRr6QhCPdDHymT5hgSo/edit

Planning to resume test from next week as the serrvers are occupied for some other testing.

Comment 9 Geetika Kapoor 2022-07-04 20:26:26 UTC
Test Environment:

Set up Node health check operator. By default it is installed for all ns. [nhc_setup]
Make sure deployment is success

$ oc get csv -n node-health-check
NAME                               DISPLAY                      VERSION   REPLACES   PHASE
node-healthcheck-operator.v0.2.0   Node Health Check Operator   0.2.0                Succeeded
poison-pill.v0.3.0                 Poison Pill Operator         0.3.0                Succeeded


Test Case 1: Start a VM. Stop kubelet service and make sure VM is able to failover and move to someother worker node.

1. Trigger vm . Make sure it is running.

$ oc get vmi -A
NAMESPACE           NAME                        AGE     PHASE     IP             NODENAME                             READY
node-health-check   nhc-vm-1656964433-7484694   5m53s   Running   10.131.0.229   c01-gk411-nhc-ttpt5-worker-0-4wqtz   True


2. sudo systemctl stop kubelet.service
3. wait for kubelet service to start again on its own.
4. Connectivity is lost in between from VM.

$ oc logs virt-launcher-nhc-vm-1656964433-7484694-2vw9f -n node-health-check
Error from server: Get "https://192.168.0.152:10250/containerLogs/node-health-check/virt-launcher-nhc-vm-1656964433-7484694-2vw9f/compute": dial tcp 192.168.0.152:10250: connect: connection refused
[cnv-qe-jenkins@c01-gk411-nhc-ttpt5-executor ~]$ oc get vmi -A
NAMESPACE           NAME                        AGE   PHASE     IP             NODENAME                             READY
node-health-check   nhc-vm-1656964433-7484694   10m   Running   10.131.0.229   c01-gk411-nhc-ttpt5-worker-0-4wqtz   False

5. Poison pill try to find the faulty node and start peer health check

$ oc logs poison-pill-ds-ntfhv  -n node-health-check
I0704 19:54:20.070978 3425096 request.go:655] Throttling request took 1.037833139s, request: GET:https://172.30.0.1:443/apis/network.operator.openshift.io/v1?timeout=32s
2022-07-04T19:54:23.180Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2022-07-04T19:54:23.180Z	INFO	setup	Starting as a poison pill agent that should run as part of the daemonset
2022-07-04T19:54:23.261Z	INFO	setup	Time to assume that unhealthy node has been rebooted	{"time": "3m5s"}
2022-07-04T19:54:23.261Z	INFO	setup	init grpc server
2022-07-04T19:54:23.261Z	INFO	setup	starting manager
2022-07-04T19:54:23.262Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
2022-07-04T19:54:23.262Z	INFO	watchdog	watchdog started
2022-07-04T19:54:23.262Z	INFO	api-check	api connectivity check started
2022-07-04T19:54:23.262Z	INFO	controller-runtime.manager.controller.poisonpillremediation	Starting EventSource	{"reconciler group": "poison-pill.medik8s.io", "reconciler kind": "PoisonPillRemediation", "source": "kind source: /, Kind="}
2022-07-04T19:54:23.362Z	INFO	peers	peers started
2022-07-04T19:54:23.363Z	INFO	controller-runtime.manager.controller.poisonpillremediation	Starting Controller	{"reconciler group": "poison-pill.medik8s.io", "reconciler kind": "PoisonPillRemediation"}
2022-07-04T19:54:23.463Z	INFO	controller-runtime.manager.controller.poisonpillremediation	Starting workers	{"reconciler group": "poison-pill.medik8s.io", "reconciler kind": "PoisonPillRemediation", "worker count": 1}
2022-07-04T19:54:23.464Z	INFO	peerhealth.server	peer health server started


6. $ oc get vmi -A
NAMESPACE           NAME                        AGE   PHASE     IP             NODENAME                             READY
node-health-check   nhc-vm-1656964433-7484694   15m   Running   10.131.0.229   c01-gk411-nhc-ttpt5-worker-0-4wqtz   False


7. VM is retriggered on a different node.

$ oc get vmi -A
NAMESPACE           NAME                        AGE     PHASE     IP            NODENAME                             READY
node-health-check   nhc-vm-1656964433-7484694   3m59s   Running   10.128.2.79   c01-gk411-nhc-ttpt5-worker-0-57zp2   True

8 . Later kubelet services also get started on worker.

$ uptime
 20:08:50 up 2 min,  1 user,  load average: 3.13, 1.04, 0.38
$ systemctl status kubelet.service
● kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf, 20-openstack-node-name.conf
   Active: active (running) since Mon 2022-07-04 20:08:08 UTC; 45s ago
  Process: 2800 ExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state (code=exited, status=0/SUCCESS)
  Process: 2798 ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state (code=exited, status=0/SUCCESS)
  Process: 2796 ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests (code=exited, status=0/SUCCESS)
 Main PID: 2802 (kubelet)
    Tasks: 20 (limit: 101913)
   Memory: 170.1M
      CPU: 3.567s

Comment 11 Fabian Deutsch 2022-07-06 12:48:05 UTC
Good to see this verified.

@gkapoor please provide some details on the cluster (node count) that this got tested with

Comment 12 Geetika Kapoor 2022-07-06 15:03:05 UTC
Test Environment use for testing is 6 node cluster(3 workers,3 masters). This is not expected to work with SNO and compact cluster(master, worker sharing nodes) as per discussion with Poison Pill team.

Comment 14 errata-xmlrpc 2022-09-14 19:28:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526