2090350 – memory leak in virt-launcher process

Bug 2090350 - memory leak in virt-launcher process

Summary: memory leak in virt-launcher process

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	ffossemo
QA Contact:	Denys Shchedrivyi
Docs Contact:
URL:
Whiteboard:
Depends On:	2114021
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-25 15:06 UTC by Denys Shchedrivyi
Modified:	2023-11-13 08:16 UTC (History)
CC List:	10 users (show)
Fixed In Version:	hco-bundle-registry-container-v4.11.0-587
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-09-14 19:34:47 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt kubevirt pull 8011	None	Merged	Increase virt-launcher memory overhead	2022-07-01 15:46:05 UTC
Github	kubevirt kubevirt pull 8026	None	Merged	[release-0.53] Increase virt-launcher memory overhead	2022-07-01 15:46:05 UTC
Github	kubevirt kubevirt pull 8234	None	open	Update libvirt-go to fix memory leak	2022-08-02 20:04:44 UTC
Github	kubevirt kubevirt pull 8236	None	open	[release-0.53]Update libvirt-go to fix memory leak	2022-08-02 20:04:44 UTC
Red Hat Issue Tracker	CNV-18627	None	None	None	2023-11-13 08:16:47 UTC
Red Hat Product Errata	RHSA-2022:6526	None	None	None	2022-09-14 19:34:56 UTC

Internal Links: 2114021

Description Denys Shchedrivyi 2022-05-25 15:06:49 UTC

Description of problem:
 The memory usage of the virt-launcher process on virt-launcher pod is growing and may lead to OOMKilled pods:

> # VM started:
> qemu          10  0.1  0.2 2002312 60872 ?       Sl   22:15   0:04 /usr/bin/virt-launcher --qemu-timeout 253s --name vm-fedora --uid 6b181371-7c41-4783-8f8f-a

> # 12hrs
> qemu          10  0.1  0.2 2002312 67952 ?       Sl   May16   0:58 /usr/bin/virt-launcher --qemu-timeout 253s --name vm-fedora --uid 6b181371-7c41-4783-8f8f-a

> # 48hrs
> qemu          10  0.1  0.2 2002312 70528 ?       Sl   May16   3:06 /usr/bin/virt-launcher --qemu-timeout 253s --name vm-fedora --uid 6b181371-7c41-4783-8f8f-a

> # And after 5 days it has exceeded the 75MB limit:
> qemu          10  0.1  0.3 2002312 77364 ?       Sl   May16   9:28 /usr/bin/virt-launcher --qemu-timeout 253s --name vm-fedora --uid 6b181371-7c41-4783-8f8f-a


Version-Release number of selected component (if applicable):
4.11

How reproducible:


Steps to Reproduce:
1. run VM with stress for a long time
2. check memory
3.

Actual results:
 Memory usage of virt-launcher is growing and may exceed the limit

Expected results:
 Memory usage is under the limit

Additional info:

Comment 1 sgott 2022-06-03 15:23:38 UTC

We've left a cluster running for a week, and have not seen evidence of a memory leak. We've used a profiler to double check memory consumption of virt-launcher as well.

Because we don't have a clear reproducer, we'd like to enlist QE's help. Would it be possible to set up a long-running test to provide evidence of the problem?

Deferring to the next release. There's no clear evidence that this is an urgent problem, and diagnosing whether or not there is a problem will take time.

Comment 2 Denys Shchedrivyi 2022-06-03 15:36:11 UTC

I've reproduced it again and can share access to the cluster.

 VM is running for 6 days and currently virt-launcher uses ~81Mb:

$ oc get vm
NAME                AGE     STATUS    READY
vm-fedora           6d20h   Running   True


# Initial
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
qemu          10  0.1  0.2 2002316 65300 ?       Sl   18:56   0:01 /usr/bin/virt-launcher --qemu-timeout 313s --name vm-fedora --uid 7535d491-5ac3-46fa-9173-0

# 72Hrs
qemu          10  0.1  0.3 2002316 75840 ?       Sl   May27   5:56 /usr/bin/virt-launcher --qemu-timeout 313s --name vm-fedora --uid 7535d491-5ac3-46fa-9173-0

# 6 days
qemu          10  0.1  0.3 2002316 81084 ?       Sl   May27  11:04 /usr/bin/virt-launcher --qemu-timeout 313s --name vm-fedora --uid 7535d491-5ac3-46fa-9173-0


There is Fedora with stress-ng tool running:
nohup sudo stress-ng --vm 1 --vm-bytes 100% --vm-method all --verify -t 0 &

Comment 6 sgott 2022-06-08 19:44:33 UTC

Could you share the VM definition that was used? We'd like to clarify what limit was set.

Comment 11 sgott 2022-07-01 19:39:55 UTC

During analysis of this BZ, we did not find direct evidence of a memory leak. This was validated by profiling the virt-launcher container over the course of several days.

However emprical data collected during profiling showed that the overhead calculations for virt-launcher we too low. Prior to fixing, the overhead was set to 75M. Testing showed that virt-launcher's overhead could reach as high as 90M. So the overhead calculation has subequently been adjusted.

The fix has been merged in the CNV 4.11 stable branch, but builds are still pending.

Comment 13 Denys Shchedrivyi 2022-07-13 13:57:49 UTC

I've 7 VMs running for multiple days (5-10) and don't see the memory usage more than 100Mb (usually it is between 70 and 90Mb). Moving this BZ to Verified

I will continue testing for extra multiple days just in case

Comment 16 Fabian Deutsch 2022-07-28 08:28:22 UTC

This bug re-appeared in CI, therefore I'm reopening it for further investigation

Comment 17 Fabian Deutsch 2022-07-28 08:31:18 UTC

Because the symptoms are different I've actually filed a new bug: bug #2111794 and moving this one back to verified.

Comment 19 Denys Shchedrivyi 2022-07-28 22:04:02 UTC

I've decided to reopen this bug. 


According to longevity test (VMs run for 20-30 days) - we already over the limit on some PODs and pretty close on other:


> virt-launcher-rhel8-nonroot-meadowlark-fw7qr   1/1     Running   0          21d
> POD: virt-launcher-rhel8-nonroot-meadowlark-fw7qr, RSS: 98640

> virt-launcher-rhel8-nonroot-quelea-gkcz5       1/1     Running   0          21d
> POD: virt-launcher-rhel8-nonroot-quelea-gkcz5, RSS: 97928

> virt-launcher-rhel8-root-barracuda-8mqd4       1/1     Running   0          28d
> POD: virt-launcher-rhel8-root-barracuda-8mqd4, RSS: 105512

> virt-launcher-vm-fedora-mem-d7pnt              2/2     Running   0          21d
> POD: virt-launcher-vm-fedora-mem-d7pnt, RSS: 69576

> virt-launcher-win10-nonroot-sailfish-fvpl2     2/2     Running   0          21d
> POD: virt-launcher-win10-nonroot-sailfish-fvpl2, RSS: 94884

> virt-launcher-win10-nonroot-stoat-pnmx7        2/2     Running   0          21d
> POD: virt-launcher-win10-nonroot-stoat-pnmx7, RSS: 89344

> virt-launcher-win10-root-rodent-jk7l4          2/2     Running   0          31d
> POD: virt-launcher-win10-root-rodent-jk7l4, RSS: 109876


I'm collecting memory usage every 6 hours, and I can see that it goes up and down periodically, but the general trend - it grows from day to day on most of pods - you can see the example in attached file rhel8-root-baracuda.mem.log

Comment 21 Fabian Deutsch 2022-07-29 07:28:32 UTC

Federico, can you please share a command of how Denys can get an output similar to what the test is providing:

(example)
tests/vmi_configuration_test.go:2950
the virtlogd process is taking too much RAM! (17468Ki > 17Mi). All processes: &map[libvirtd:{{28876800 0} {<nil>}  BinarySI} qemu-kvm:{{454934528 0} {<nil>}  BinarySI} virt-launcher:{{61800448 0} {<nil>}  BinarySI} virt-launcher-monitor:{{19357696 0} {<nil>}  BinarySI} virtlogd:{{17887232 0} {<nil>} 17468Ki BinarySI}]

IOW Can we not only look at the container level, but try to identify which process is growning to get a better lead.

Comment 23 ffossemo 2022-07-29 07:55:39 UTC

@fdeutsch The test checks the RSS of the involved processes by running `ps` command with different arguments.
The full command is `ps --no-header axo rss,command`, but I don't think the arguments are important here.
We check the following processes:
"virt-launcher-monitor"
"virt-launcher"
"virtlogd"
"libvirtd"

btw, also if they seems to be related to the entire pod, the RSS results Denys showed are referring to the only `virt-launcher` process.

Comment 27 Antonio Cardace 2022-08-03 09:50:58 UTC

The libvirt-go fix has been merged in the release branch and should shortly land downstream too.

Comment 31 Antonio Cardace 2022-08-03 13:48:24 UTC

The following is a procedure to verify that the leak is actually gone.

1. Create a VMI (can be any kind of VMI, it's not important)
2. SSH as root into the node running the VMI
3. Install the bcc-tools to get the memleak tool by running
       # dnf install bcc-tools kernel-devel-$(uname -r)
4. Look for the PID of the virt-launcher running the VMI previously started
5. Run the memleak tool against that PID by doing
       # /usr/share/bcc/tools/memleak -p $PID -o 8000 --wa-missing-free

The memleak tool will display all outstanding allocations (meaning they weren't freed yet), what you want to verify is that you don't see any memory allocation related to the virt-handler prometheus endpoint scraping.

6. Exec into the virt-handler running the VMI
      $ kubectl -n openshift-cnv exec virt-handler-* -- bash
7. Scrape the prom endpoint multiple times and check that you don't see any related allocation reported by memleak
      $ for i in {1..10000}; do  curl -L -k https://0.0.0.0:8443/metrics; done

If the memleak is still there it will be pretty eloquent as you will see the tool reporting ~500K of leaked memory from the same point right after scraping the endpoint 10K times.

Hope this is clear enough.

Comment 32 Denys Shchedrivyi 2022-08-04 12:34:24 UTC

Here results after running for i in {1..10000}; do  curl -L -k https://0.0.0.0:8443/metrics; done multiple times on old and new CNV

Old cluster:
[01:48:35] Top 10 stacks with outstanding allocations:
	22448595 bytes in 307515 allocations from stack
		g_malloc+0x16 [libglib-2.0.so.0.5600.4] 

New cluster (CNV-v4.11.0-587):
[01:48:59] Top 10 stacks with outstanding allocations:
	7024 bytes in 9 allocations from stack
		[unknown] [libglib-2.0.so.0.5600.4]


It looks like the issue fixed! Thank you!

Comment 34 errata-xmlrpc 2022-09-14 19:34:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526

Note You need to log in before you can comment on or make changes to this bug.