Bug 2156392

Summary: In the VM latency checkup, the max_desired_latency_milliseconds field has no meaning when the measured latency is less than 1[ms]
Product: Container Native Virtualization (CNV) Reporter: awax
Component: NetworkingAssignee: Orel Misan <omisan>
Status: CLOSED ERRATA QA Contact: awax
Severity: low Docs Contact:
Priority: low    
Version: 4.12.0CC: omisan, ralavi
Target Milestone: ---   
Target Release: 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-18 02:56:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description awax 2022-12-26 18:41:10 UTC

Comment 1 awax 2022-12-26 18:41:29 UTC
Description of problem:
The max_desired_latency_milliseconds field has no meaning.

Version-Release number of selected component (if applicable):


How reproducible:
Create a ConfigMap with the "spec.param.max_desired_latency_milliseconds" field set to 0. The Job will finish successfully instead of failing.

Steps to Reproduce:
1. create a Namespace
oc new-project test-latency

2. Create a Bridge with this yaml:
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: br10
spec:
  desiredState:
	interfaces:
	- bridge:
		options:
		  stp:
			enabled: false
		port:
		- name: ens9
	  ipv4:
		auto-dns: true
		dhcp: false
		enabled: false
	  ipv6:
		auto-dns: true
autoconf: false
		dhcp: false
		enabled: false
	  name: br10
	  state: up
	  type: linux-bridge
  nodeSelector:
	node-role.kubernetes.io/worker: ''

3. Create a NAD with this yaml:
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: bridge-network-nad
spec:
  config: |
	{
	  "cniVersion":"0.3.1",
	  "name": "br10",
	  "plugins": [
		  {
			  "type": "cnv-bridge",
			  "bridge": "br10"
		  }
	  ]
	}
~     


4. Create a service-account, role, and role-binding:
cat <<EOF | kubectl apply  -f -
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: vm-latency-checkup-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kubevirt-vm-latency-checker
rules:
- apiGroups: ["kubevirt.io"]
  resources: ["virtualmachineinstances"]
  verbs: ["get", "create", "delete"]
- apiGroups: ["subresources.kubevirt.io"]
  resources: ["virtualmachineinstances/console"]
  verbs: ["get"]
- apiGroups: ["k8s.cni.cncf.io"]
  resources: ["network-attachment-definitions"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kubevirt-vm-latency-checker
subjects:
- kind: ServiceAccount
  name: vm-latency-checkup-sa
roleRef:
  kind: Role
  name: kubevirt-vm-latency-checker
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: kiagnose-configmap-access
rules:
- apiGroups: [ "" ]
  resources: [ "configmaps" ]
  verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kiagnose-configmap-access
subjects:
- kind: ServiceAccount
  name: vm-latency-checkup-sa
roleRef:
  kind: Role
  name: kiagnose-configmap-access
  apiGroup: rbac.authorization.k8s.io
EOF


5. Create the ConfigMap with the "spec.param.max_desired_latency_milliseconds" filed set to 0:
cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubevirt-vm-latency-checkup-config
data:
  spec.timeout: 5m
  spec.param.network_attachment_definition_namespace: "manual-latency-check"
  spec.param.network_attachment_definition_name: "bridge-network-nad"
  spec.param.max_desired_latency_milliseconds: "0"
  spec.param.sample_duration_seconds: "5"
EOF


6. Create a job:
cat <<EOF | kubectl apply -n <target-namespace> -f -
---
apiVersion: batch/v1
kind: Job
metadata:
  name: kubevirt-vm-latency-checkup
spec:
  backoffLimit: 0
  template:
	spec:
	  serviceAccountName: vm-latency-checkup-sa
	  restartPolicy: Never
	  containers:
		- name: vm-latency-checkup
		  image: quay.io/kiagnose/kubevirt-vm-latency:main
		  securityContext:
			runAsUser: 1000
			allowPrivilegeEscalation: false
			capabilities:
			  drop: ["ALL"]
			runAsNonRoot: true
			seccompProfile:
			  type: "RuntimeDefault"
		  env:
			- name: CONFIGMAP_NAMESPACE
			  value: test-latency
			- name: CONFIGMAP_NAME
			  value: kubevirt-vm-latency-checkup-config
EOF



Actual results:
Check the job status - job should finish successfully.
Check the ConfigMap results, would have all the parameters as a successful job:
oc get cm latency-configmap -oyaml
apiVersion: v1
data:
  spec.param.max_desired_latency_milliseconds: "0"
  spec.param.network_attachment_definition_name: checkup-nad
  spec.param.network_attachment_definition_namespace: test-checkup-framework
  spec.param.sample_duration_seconds: "5"
  spec.timeout: 300m
  status.completionTimestamp: "2022-12-25T14:21:01Z"
  status.failureReason: ""
  status.result.avgLatencyNanoSec: "491000"
  status.result.maxLatencyNanoSec: "652000"
  status.result.measurementDurationSec: "5"
  status.result.minLatencyNanoSec: "336000"
  status.result.sourceNode: master2
  status.result.targetNode: master1
  status.startTimestamp: "2022-12-25T14:20:09Z"
  status.succeeded: "true"
kind: ConfigMap
metadata:
  creationTimestamp: "2022-12-25T14:20:06Z"
  labels:
    created-by-dynamic-class-creator: "Yes"
  name: latency-configmap
  namespace: test-checkup-framework
  resourceVersion: "7798869"
  uid: dadebd03-c8b5-456b-9995-b201eb52415f


Expected results:
The job should fail.

Comment 2 Orel Misan 2022-12-27 07:30:01 UTC
The "spec.param.max_desired_latency_milliseconds" field is respected.

The measured latency is treated as an integer, with units of milliseconds.
When the measured latency is less than 1 [ms], it is considered as 0 [ms], thus the condition `actualMaxLatency > maxLatencyDesired` (which fails the checkup if true) is false and the checkup succeeds.

https://github.com/kiagnose/kiagnose/blob/95d8c7995fabbb7be11f7cde0eca57dc01e222bb/checkups/kubevirt-vm-latency/vmlatency/internal/checkup/checkup.go#L157

Comment 3 awax 2023-02-07 13:42:22 UTC
Tested on a PSI cluster:

$ oc get csv -A | grep virt
...
openshift-cnv                                      kubevirt-hyperconverged-operator.v4.13.0          OpenShift Virtualization      4.13.0                kubevirt-hyperconverged-operator.v4.11.1   Succeeded

Openshift version: 4.12.0
CNV version: 4.13.0
HCO image: brew.registry.redhat.io/rh-osbs/iib:418191
OCS version: 4.12.0
CNI type: OVNKubernetes
Workers type: virtual

It still doesn't seem to respect the max_desired_latency_milliseconds field - I would expect such a checkup to fail because it didn't meet the requirements of the user. What actually happen is that the VMI's are not created, the virt-launcher pods are stuck in init status:
$ oc logs virt-launcher-latency-check-source-vz5rd-kk5tl
Error from server (BadRequest): container "compute" in pod "virt-launcher-latency-check-source-vz5rd-kk5tl" is waiting to start: PodInitializing


After about 10 minutes the checkup job failes with:
checkup failed: setup: failed to wait for VMI 'test-checkup-framework/latency-check-target-rlpmb' IP address to appear on status: timed out waiting for the condition

Comment 4 awax 2023-02-08 12:22:10 UTC
Verified on a fresh installation of PSI:

kubevirt-hyperconverged-operator.v4.13.0
IIB - 425854 - version  v4.13.0.rhel9-1385

Comment 6 errata-xmlrpc 2023-05-18 02:56:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205