Bug 2243672

Summary: [DPDK checkup] Teardown should happen immediately after setup failure
Product: Container Native Virtualization (CNV) Reporter: Yossi Segev <ysegev>
Component: NetworkingAssignee: Orel Misan <omisan>
Status: CLOSED MIGRATED QA Contact: Yossi Segev <ysegev>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.14.0CC: omisan
Target Milestone: ---   
Target Release: 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v4.15.0.rhel9-1377 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-12-14 16:16:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
DPDK checkup resources none

Description Yossi Segev 2023-10-12 13:53:23 UTC
Created attachment 1993631 [details]
DPDK checkup resources

Description of problem:
When DPDK checkup setup fails (e.g. due to an invalid parameter in the ConfigMap, like in https://bugzilla.redhat.com/show_bug.cgi?id=2188244), the setup should occur immediately, and not wait for the configured timeout (which is set in the job's ConfigMap).


Version-Release number of selected component (if applicable):
CNV 4.14.0
container-native-virtualization/kubevirt-dpdk-checkup-rhel9:v4.14.0-116


How reproducible:
Always


Steps to Reproduce:
1.
Apply the attached resources in their numeric order.
$ oc apply -f 1-dpdk-checkup-mcp.yaml
$ oc apply -f 2-dpdk-checkup-performance-profile.yaml
...
Change the parameters according to your environment.
Note that the ConfigMap (6-dpdk-checkup-configmap.yaml) has this invalid parameter on purpose:
spec.param.trafficGenTargetNodeName: "invalid-node-name"


Actual results:
The job fails and teardown removes all created resources (pods and VMIs), but it only happens after the timeout that is configured in the ConfigMap (10 minutes in this example).


Expected results:
If an invalid parameter was entered and the checkup fails - it should fail immediately, including tearing down all the created resources.

Comment 1 Orel Misan 2023-10-16 08:04:40 UTC
If there is a problem with the creation of either of the VMIs - they are deleted immediately.

After the creation of the two VMIs, there is a wait for both of the VMIs to be ready (the wait is serial).
This wait is bounded by the overall timeout, specified in the user-supplied ConfigMap.

The solution needs to limit the setup timeout to a certain time - after which the setup will fail and the VMIs will be deleted.

Comment 2 Yossi Segev 2023-10-23 08:06:20 UTC
The implemented solution is to grant a 10 minutes grace for the setup to succeed, and if the setup fails - teardown after this timeout ends.
So the teardown shouldn't occur immediately, but rather after this 10 minutes timeout (unless `spec.timeout` in the job's ConfigMap is less than 10 minutes).
To verify this bug - set `spec.timeout` to 15m, and verify the teardown occur after 10 minutes.