2156902 – VM latency checkup - Checkup not performing a teardown in case of setup failure

Bug 2156902 - VM latency checkup - Checkup not performing a teardown in case of setup failure

Summary: VM latency checkup - Checkup not performing a teardown in case of setup failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.13.0
Assignee:	Edward Haas
QA Contact:	awax
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2159397
TreeView+	depends on / blocked

Reported:	2022-12-29 11:27 UTC by awax
Modified:	2023-05-18 02:56 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2159397 (view as bug list)
Environment:
Last Closed:	2023-05-18 02:56:36 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kiagnose kiagnose pull 239	None	Merged	vmlatency, checkup, setup: Delete VMI/s if setup fails	2023-01-25 13:38:38 UTC
Red Hat Issue Tracker	CNV-23732	None	None	None	2022-12-29 11:35:21 UTC
Red Hat Product Errata	RHSA-2023:3205	None	None	None	2023-05-18 02:56:53 UTC

Description awax 2022-12-29 11:27:12 UTC

Description of problem:
When a checkup encounter a setup failure, the components created by the job are not deleted.

Version-Release number of selected component (if applicable):
kubevirt-hyperconverged-operator.v4.12.0   OpenShift Virtualization                         4.12.0     kubevirt-hyperconverged-operator.v4.11.1   Succeeded

Client Version: 4.12.0-rc.6
Kustomize Version: v4.5.7
Server Version: 4.12.0-rc.6
Kubernetes Version: v1.25.4+77bec7a


How reproducible:
Create a checkup configmap with a nonexistent node specified as the source node. The first virt-launcher pod will stay in pending mode and will never get to a running state and the actual checkup will never start.

Steps to Reproduce:
1. create a Namespace
oc new-project test-latency

2. Create a Bridge with this yaml:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: br10
spec:
  desiredState:
	interfaces:
	- bridge:
		options:
		  stp:
			enabled: false
		port:
		- name: ens9
	  ipv4:
		auto-dns: true
		dhcp: false
		enabled: false
	  ipv6:
		auto-dns: true
autoconf: false
		dhcp: false
		enabled: false
	  name: br10
	  state: up
	  type: linux-bridge
  nodeSelector:
	node-role.kubernetes.io/worker: ''

3. Create a NAD with this yaml:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: bridge-network-nad
spec:
  config: |
	{
	  "cniVersion":"0.3.1",
	  "name": "br10",
	  "plugins": [
		  {
			  "type": "cnv-bridge",
			  "bridge": "br10"
		  }
	  ]
	}
~     


4. Create a service-account, role, and role-binding:
cat <<EOF | kubectl apply  -f -
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: vm-latency-checkup-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kubevirt-vm-latency-checker
rules:
- apiGroups: ["kubevirt.io"]
  resources: ["virtualmachineinstances"]
  verbs: ["get", "create", "delete"]
- apiGroups: ["subresources.kubevirt.io"]
  resources: ["virtualmachineinstances/console"]
  verbs: ["get"]
- apiGroups: ["k8s.cni.cncf.io"]
  resources: ["network-attachment-definitions"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kubevirt-vm-latency-checker
subjects:
- kind: ServiceAccount
  name: vm-latency-checkup-sa
roleRef:
  kind: Role
  name: kubevirt-vm-latency-checker
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kiagnose-configmap-access
rules:
- apiGroups: [ "" ]
  resources: [ "configmaps" ]
  verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kiagnose-configmap-access
subjects:
- kind: ServiceAccount
  name: vm-latency-checkup-sa
roleRef:
  kind: Role
  name: kiagnose-configmap-access
  apiGroup: rbac.authorization.k8s.io
EOF


5. Create the ConfigMap with the "spec.param.max_desired_latency_milliseconds" filed set to 0:
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubevirt-vm-latency-checkup-config
data:
  spec.timeout: 5m
  spec.param.network_attachment_definition_namespace: "manual-latency-check"
  spec.param.network_attachment_definition_name: "bridge-network-nad"
  spec.param.max_desired_latency_milliseconds: "0"
  spec.param.sample_duration_seconds: "5"
  spec.param.source_node: non-existent-node
  spec.param.target_node: cnv-qe-14.cnvqe.lab.eng.rdu2.redhat.com
EOF


6. Create a job:
cat <<EOF | kubectl apply -f -
---
apiVersion: batch/v1
kind: Job
metadata:
  name: kubevirt-vm-latency-checkup
spec:
  backoffLimit: 0
  template:
	spec:
	  serviceAccountName: vm-latency-checkup-sa
	  restartPolicy: Never
	  containers:
		- name: vm-latency-checkup
		  image: brew.registry.redhat.io/rh-osbs/container-native-virtualization-vm-network-latency-checkup:v4.12.0
		  securityContext:
			runAsUser: 1000
			allowPrivilegeEscalation: false
			capabilities:
			  drop: ["ALL"]
			runAsNonRoot: true
			seccompProfile:
			  type: "RuntimeDefault"
		  env:
			- name: CONFIGMAP_NAMESPACE
			  value: test-latency
			- name: CONFIGMAP_NAME
			  value: kubevirt-vm-latency-checkup-config
EOF

Actual results:
When the job is deleted the pods and VMI's are not deleted:
oc get all
NAME                                           READY   STATUS    RESTARTS   AGE
pod/latency-nonexistent-node-job-qt4wk         0/1     Error     0          74m
pod/virt-launcher-latency-check-source-4fqgk   0/2     Pending   0          74m
pod/virt-launcher-latency-check-target-smj9r   2/2     Running   0          74m

NAME                                     COMPLETIONS   DURATION   AGE
job.batch/latency-nonexistent-node-job   0/1           74m        74m

NAME                                                      AGE   PHASE        IP               NODENAME                                  READY
virtualmachineinstance.kubevirt.io/latency-check-source   74m   Scheduling                                                              False
virtualmachineinstance.kubevirt.io/latency-check-target   74m   Running      192.168.100.20   cnv-qe-14.cnvqe.lab.eng.rdu2.redhat.com   True


Expected results:
All the resources created by the Job are deleted as the job gets deleted.

Comment 1 awax 2023-02-07 11:18:38 UTC

Verified on a PSI cluster:

$ oc get csv -A | grep virt
...
openshift-cnv                                      kubevirt-hyperconverged-operator.v4.13.0          OpenShift Virtualization      4.13.0                kubevirt-hyperconverged-operator.v4.11.1   Succeeded

Openshift version: 4.12.0
CNV version: 4.13.0
HCO image: brew.registry.redhat.io/rh-osbs/iib:418191
OCS version: 4.12.0
CNI type: OVNKubernetes
Workers type: virtual

Comment 3 errata-xmlrpc 2023-05-18 02:56:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205

Note You need to log in before you can comment on or make changes to this bug.