Bug 2159397

Summary: VM latency checkup - Checkup not performing a teardown in case of setup failure
Product: Container Native Virtualization (CNV) Reporter: awax
Component: NetworkingAssignee: Edward Haas <edwardh>
Status: CLOSED ERRATA QA Contact: Nir Rozen <nrozen>
Severity: low Docs Contact:
Priority: low    
Version: 4.12.0CC: edwardh, phoracek, ralavi
Target Milestone: ---   
Target Release: 4.12.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: vm-network-latency-checkup v4.12.1-2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2156902 Environment:
Last Closed: 2023-02-28 20:06:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2156902    
Bug Blocks:    

Description awax 2023-01-09 12:57:02 UTC
+++ This bug was initially created as a clone of Bug #2156902 +++
Target release: v4.12.1

Description of problem:
When a checkup encounter a setup failure, the components created by the job are not deleted.

Version-Release number of selected component (if applicable):
kubevirt-hyperconverged-operator.v4.12.0   OpenShift Virtualization                         4.12.0     kubevirt-hyperconverged-operator.v4.11.1   Succeeded

Client Version: 4.12.0-rc.6
Kustomize Version: v4.5.7
Server Version: 4.12.0-rc.6
Kubernetes Version: v1.25.4+77bec7a


How reproducible:
Create a checkup configmap with a nonexistent node specified as the source node. The first virt-launcher pod will stay in pending mode and will never get to a running state and the actual checkup will never start.

Steps to Reproduce:
1. create a Namespace
oc new-project test-latency

2. Create a Bridge with this yaml:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: br10
spec:
  desiredState:
	interfaces:
	- bridge:
		options:
		  stp:
			enabled: false
		port:
		- name: ens9
	  ipv4:
		auto-dns: true
		dhcp: false
		enabled: false
	  ipv6:
		auto-dns: true
autoconf: false
		dhcp: false
		enabled: false
	  name: br10
	  state: up
	  type: linux-bridge
  nodeSelector:
	node-role.kubernetes.io/worker: ''

3. Create a NAD with this yaml:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: bridge-network-nad
spec:
  config: |
	{
	  "cniVersion":"0.3.1",
	  "name": "br10",
	  "plugins": [
		  {
			  "type": "cnv-bridge",
			  "bridge": "br10"
		  }
	  ]
	}
~     


4. Create a service-account, role, and role-binding:
cat <<EOF | kubectl apply  -f -
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: vm-latency-checkup-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kubevirt-vm-latency-checker
rules:
- apiGroups: ["kubevirt.io"]
  resources: ["virtualmachineinstances"]
  verbs: ["get", "create", "delete"]
- apiGroups: ["subresources.kubevirt.io"]
  resources: ["virtualmachineinstances/console"]
  verbs: ["get"]
- apiGroups: ["k8s.cni.cncf.io"]
  resources: ["network-attachment-definitions"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kubevirt-vm-latency-checker
subjects:
- kind: ServiceAccount
  name: vm-latency-checkup-sa
roleRef:
  kind: Role
  name: kubevirt-vm-latency-checker
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kiagnose-configmap-access
rules:
- apiGroups: [ "" ]
  resources: [ "configmaps" ]
  verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kiagnose-configmap-access
subjects:
- kind: ServiceAccount
  name: vm-latency-checkup-sa
roleRef:
  kind: Role
  name: kiagnose-configmap-access
  apiGroup: rbac.authorization.k8s.io
EOF


5. Create the ConfigMap with the "spec.param.max_desired_latency_milliseconds" filed set to 0:
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubevirt-vm-latency-checkup-config
data:
  spec.timeout: 5m
  spec.param.network_attachment_definition_namespace: "manual-latency-check"
  spec.param.network_attachment_definition_name: "bridge-network-nad"
  spec.param.max_desired_latency_milliseconds: "0"
  spec.param.sample_duration_seconds: "5"
  spec.param.source_node: non-existent-node
  spec.param.target_node: cnv-qe-14.cnvqe.lab.eng.rdu2.redhat.com
EOF


6. Create a job:
cat <<EOF | kubectl apply -f -
---
apiVersion: batch/v1
kind: Job
metadata:
  name: kubevirt-vm-latency-checkup
spec:
  backoffLimit: 0
  template:
	spec:
	  serviceAccountName: vm-latency-checkup-sa
	  restartPolicy: Never
	  containers:
		- name: vm-latency-checkup
		  image: brew.registry.redhat.io/rh-osbs/container-native-virtualization-vm-network-latency-checkup:v4.12.0
		  securityContext:
			runAsUser: 1000
			allowPrivilegeEscalation: false
			capabilities:
			  drop: ["ALL"]
			runAsNonRoot: true
			seccompProfile:
			  type: "RuntimeDefault"
		  env:
			- name: CONFIGMAP_NAMESPACE
			  value: test-latency
			- name: CONFIGMAP_NAME
			  value: kubevirt-vm-latency-checkup-config
EOF

Actual results:
When the job is deleted the pods and VMI's are not deleted:
oc get all
NAME                                           READY   STATUS    RESTARTS   AGE
pod/latency-nonexistent-node-job-qt4wk         0/1     Error     0          74m
pod/virt-launcher-latency-check-source-4fqgk   0/2     Pending   0          74m
pod/virt-launcher-latency-check-target-smj9r   2/2     Running   0          74m

NAME                                     COMPLETIONS   DURATION   AGE
job.batch/latency-nonexistent-node-job   0/1           74m        74m

NAME                                                      AGE   PHASE        IP               NODENAME                                  READY
virtualmachineinstance.kubevirt.io/latency-check-source   74m   Scheduling                                                              False
virtualmachineinstance.kubevirt.io/latency-check-target   74m   Running      192.168.100.20   cnv-qe-14.cnvqe.lab.eng.rdu2.redhat.com   True


Expected results:
All the resources created by the Job are deleted as the job gets deleted.

Comment 1 awax 2023-02-07 08:21:59 UTC
Verified on IBM BM cluster:
$ oc get clusterversion                                                                                                                         (412_remove_closed_bug_2159397_from_code|…4⚑35)
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.0-rc.8   True        False         4d16h   Cluster version is 4.12.0-rc.8


$ oc get csv -n openshift-cnv   
kubevirt-hyperconverged-operator.v4.12.1   OpenShift Virtualization                         4.12.1                kubevirt-hyperconverged-operator.v4.11.1   Succeeded

Comment 7 errata-xmlrpc 2023-02-28 20:06:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 4.12.1 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:1023