2159397 – VM latency checkup - Checkup not performing a teardown in case of setup failure

Bug 2159397 - VM latency checkup - Checkup not performing a teardown in case of setup failure

Summary: VM latency checkup - Checkup not performing a teardown in case of setup failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.12.1
Assignee:	Edward Haas
QA Contact:	Nir Rozen
Docs Contact:
URL:
Whiteboard:
Depends On:	2156902
Blocks:
TreeView+	depends on / blocked

Reported:	2023-01-09 12:57 UTC by awax
Modified:	2023-02-28 20:06 UTC (History)
CC List:	3 users (show)
Fixed In Version:	vm-network-latency-checkup v4.12.1-2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2156902
Environment:
Last Closed:	2023-02-28 20:06:27 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kiagnose kiagnose pull 241	None	Merged	vmlatency, checkup, setup: Delete VMI/s if setup fails	2023-01-25 11:49:46 UTC
Red Hat Issue Tracker	CNV-24121	None	None	None	2023-01-09 13:00:47 UTC
Red Hat Product Errata	RHEA-2023:1023	None	None	None	2023-02-28 20:06:39 UTC

Description awax 2023-01-09 12:57:02 UTC

+++ This bug was initially created as a clone of Bug #2156902 +++
Target release: v4.12.1

Description of problem:
When a checkup encounter a setup failure, the components created by the job are not deleted.

Version-Release number of selected component (if applicable):
kubevirt-hyperconverged-operator.v4.12.0   OpenShift Virtualization                         4.12.0     kubevirt-hyperconverged-operator.v4.11.1   Succeeded

Client Version: 4.12.0-rc.6
Kustomize Version: v4.5.7
Server Version: 4.12.0-rc.6
Kubernetes Version: v1.25.4+77bec7a


How reproducible:
Create a checkup configmap with a nonexistent node specified as the source node. The first virt-launcher pod will stay in pending mode and will never get to a running state and the actual checkup will never start.

Steps to Reproduce:
1. create a Namespace
oc new-project test-latency

2. Create a Bridge with this yaml:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: br10
spec:
  desiredState:
	interfaces:
	- bridge:
		options:
		  stp:
			enabled: false
		port:
		- name: ens9
	  ipv4:
		auto-dns: true
		dhcp: false
		enabled: false
	  ipv6:
		auto-dns: true
autoconf: false
		dhcp: false
		enabled: false
	  name: br10
	  state: up
	  type: linux-bridge
  nodeSelector:
	node-role.kubernetes.io/worker: ''

3. Create a NAD with this yaml:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: bridge-network-nad
spec:
  config: |
	{
	  "cniVersion":"0.3.1",
	  "name": "br10",
	  "plugins": [
		  {
			  "type": "cnv-bridge",
			  "bridge": "br10"
		  }
	  ]
	}
~     


4. Create a service-account, role, and role-binding:
cat <<EOF | kubectl apply  -f -
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: vm-latency-checkup-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kubevirt-vm-latency-checker
rules:
- apiGroups: ["kubevirt.io"]
  resources: ["virtualmachineinstances"]
  verbs: ["get", "create", "delete"]
- apiGroups: ["subresources.kubevirt.io"]
  resources: ["virtualmachineinstances/console"]
  verbs: ["get"]
- apiGroups: ["k8s.cni.cncf.io"]
  resources: ["network-attachment-definitions"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kubevirt-vm-latency-checker
subjects:
- kind: ServiceAccount
  name: vm-latency-checkup-sa
roleRef:
  kind: Role
  name: kubevirt-vm-latency-checker
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kiagnose-configmap-access
rules:
- apiGroups: [ "" ]
  resources: [ "configmaps" ]
  verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kiagnose-configmap-access
subjects:
- kind: ServiceAccount
  name: vm-latency-checkup-sa
roleRef:
  kind: Role
  name: kiagnose-configmap-access
  apiGroup: rbac.authorization.k8s.io
EOF


5. Create the ConfigMap with the "spec.param.max_desired_latency_milliseconds" filed set to 0:
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubevirt-vm-latency-checkup-config
data:
  spec.timeout: 5m
  spec.param.network_attachment_definition_namespace: "manual-latency-check"
  spec.param.network_attachment_definition_name: "bridge-network-nad"
  spec.param.max_desired_latency_milliseconds: "0"
  spec.param.sample_duration_seconds: "5"
  spec.param.source_node: non-existent-node
  spec.param.target_node: cnv-qe-14.cnvqe.lab.eng.rdu2.redhat.com
EOF


6. Create a job:
cat <<EOF | kubectl apply -f -
---
apiVersion: batch/v1
kind: Job
metadata:
  name: kubevirt-vm-latency-checkup
spec:
  backoffLimit: 0
  template:
	spec:
	  serviceAccountName: vm-latency-checkup-sa
	  restartPolicy: Never
	  containers:
		- name: vm-latency-checkup
		  image: brew.registry.redhat.io/rh-osbs/container-native-virtualization-vm-network-latency-checkup:v4.12.0
		  securityContext:
			runAsUser: 1000
			allowPrivilegeEscalation: false
			capabilities:
			  drop: ["ALL"]
			runAsNonRoot: true
			seccompProfile:
			  type: "RuntimeDefault"
		  env:
			- name: CONFIGMAP_NAMESPACE
			  value: test-latency
			- name: CONFIGMAP_NAME
			  value: kubevirt-vm-latency-checkup-config
EOF

Actual results:
When the job is deleted the pods and VMI's are not deleted:
oc get all
NAME                                           READY   STATUS    RESTARTS   AGE
pod/latency-nonexistent-node-job-qt4wk         0/1     Error     0          74m
pod/virt-launcher-latency-check-source-4fqgk   0/2     Pending   0          74m
pod/virt-launcher-latency-check-target-smj9r   2/2     Running   0          74m

NAME                                     COMPLETIONS   DURATION   AGE
job.batch/latency-nonexistent-node-job   0/1           74m        74m

NAME                                                      AGE   PHASE        IP               NODENAME                                  READY
virtualmachineinstance.kubevirt.io/latency-check-source   74m   Scheduling                                                              False
virtualmachineinstance.kubevirt.io/latency-check-target   74m   Running      192.168.100.20   cnv-qe-14.cnvqe.lab.eng.rdu2.redhat.com   True


Expected results:
All the resources created by the Job are deleted as the job gets deleted.

Comment 1 awax 2023-02-07 08:21:59 UTC

Verified on IBM BM cluster:
$ oc get clusterversion                                                                                                                         (412_remove_closed_bug_2159397_from_code|…4⚑35)
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-rc.8   True        False         4d16h   Cluster version is 4.12.0-rc.8


$ oc get csv -n openshift-cnv   
kubevirt-hyperconverged-operator.v4.12.1   OpenShift Virtualization                         4.12.1                kubevirt-hyperconverged-operator.v4.11.1   Succeeded

Comment 7 errata-xmlrpc 2023-02-28 20:06:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.12.1 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:1023

Note You need to log in before you can comment on or make changes to this bug.