Bug 2132473

Summary: Memory usage of virt-operator PODs increased after updating HCO
Product: Container Native Virtualization (CNV) Reporter: Denys Shchedrivyi <dshchedr>
Component: VirtualizationAssignee: lpivarc
Status: CLOSED ERRATA QA Contact: Akriti Gupta <akrgupta>
Severity: high Docs Contact:
Priority: medium    
Version: 4.12.0CC: acardace, jlejosne, kbidarka
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 14:05:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
virt-operator pod memory usage
none
Heap pprof none

Description Denys Shchedrivyi 2022-10-05 20:48:20 UTC
Created attachment 1916302 [details]
virt-operator pod memory usage

Description of problem:
 There is suspicious behavior of virt-operator pod - each time when I'm updating Kubevirt configuration through the HCO - the memory usage of virt-operator  increased on 15-20Mb.

 Initially virt-operator pod used ~200Mb, but after updating HCO several times - the usage became ~400Mb and it never goes down (screenshot is attached)

 The maximum in my tests I saw 450Mb (with 570Mb in peaks)


Version-Release number of selected component (if applicable):
4.12

How reproducible:
100%

Steps to Reproduce:
1. update kubevirt through HCO (for example by updating liveMigrationConfig parameters or by adding some annotations)
2. After updating HCO there is a peak of memory usage (which is probably expected), but when it goes down the memory usage is +15-20Mb from initial value

Comment 1 sgott 2022-10-06 12:06:23 UTC
Let's assume worst case scenario, that the virt-operator pod ends up getting killed due to OOM or memory pressure. In that case, the leader election will pick the alternate virt-operator and a new one will be respawned. In other words, the cluster will recover gracefully.

Because of this I'm estimating the severity of this BZ to be medium. Please let me know if you have concerns with this rationale.

Comment 2 Denys Shchedrivyi 2022-10-06 13:04:24 UTC
 I think medium is fine, sooner or later it will be OOMkilled and respawned, you can see it on a new screenshot - I've run updating HCO in a loop and the virt-operator memory usage raised to 1.2G (it would go further if loop script would continue), but it was restarted only several hours later..

Comment 4 Jed Lejosne 2022-10-06 18:07:49 UTC
On an upstream KubeVirt setup, updating an annotation on the CR in a loop doesn't seem to repro the issue. HCO must be doing something else to KubeVirt on CR update.

Comment 5 Jed Lejosne 2022-10-06 20:01:14 UTC
Created attachment 1916613 [details]
Heap pprof

Comment 6 Jed Lejosne 2022-10-06 20:01:36 UTC
Scratch the above, changing just an annotation doesn't do much at all. However, changing the CPU model has a strong impact on memory consumption.
This temporarily tripled memory consumption (it went back down after but stayed much higher than previously):

for i in `seq 10`; do
  kubectl patch kubevirt kubevirt -n kubevirt --type='json' -p='[{"op": "replace", "path": "/spec/configuration/cpuModel", "value":"Penryn"}]'
  sleep 10
  kubectl patch kubevirt kubevirt -n kubevirt --type='json' -p='[{"op": "remove", "path": "/spec/configuration/cpuModel"}]'
  sleep 10
done

Attached heap pprof after running that and waiting about 15 minutes.

Comment 11 Kedar Bidarkar 2023-03-01 13:58:49 UTC
Raising the severity to High and moving this to CNV 4.14 due to the capacity.

Comment 12 Denys Shchedrivyi 2023-08-17 16:34:57 UTC
Verified on CNV-v4.14.0.rhel9-1576. I've run HCO updates in a loop for several hours and observed some memory spikes during while the script was runnung but when it completed - memory usage returned to initial state.

Comment 15 errata-xmlrpc 2023-11-08 14:05:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817