1004426 – Rhevm Server System Memory Growth Concern

Bug 1004426 - Rhevm Server System Memory Growth Concern

Summary: Rhevm Server System Memory Growth Concern

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-09-04 15:31 UTC by baiesi
Modified:	2016-02-10 19:35 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-09-16 07:47:07 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Rhevm System Memory Trend - run1 (203.67 KB, application/vnd.oasis.opendocument.spreadsheet) 2013-09-09 19:04 UTC, baiesi	no flags	Details
View All

Description baiesi 2013-09-04 15:31:25 UTC

Summary:
Rhevm 3.2 Server System Memory Growth Concern

Description of problem:
Over a 26 day period of running system / longevity tests with Rhevm, I'v been collecting my test environments system metrics while the system is running user admin and simulated client load.  I noticed the memory usage on the Rhevm Server climbed initially form 21.5GB to 28.3GB over that time (26 days) which is approx 86% of the total.  Over the past 6 day I began monitoring the libvirt watching it grow from VSZ(virt mem) 10.4GB to 11.8GB.  I stopped all client load  yesterday to watch and see if javas garbage collection would kick in and clean up, but the trend seemed to just level off. The test environment is currently in this condition and will remain in this state for a brief period of time in case developers wish to get access to it if needed.

I'm sure I could keep driving the system to a point where physical memory would become exshausted but then the system would most likely become unuseable for both myself and the developers.  I have plenty of collected data of the test system in the environment to share if needed.

top - 14:59:40 up 26 days, 23:45,  1 user,  load average: 0.00, 0.00, 0.00
Tasks: 420 total,   2 running, 418 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.1%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32862732k total, 28372068k used,  4490664k free,   253588k buffers
Swap: 16498680k total,        0k used, 16498680k free, 20768292k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ COMMAND                                                                                                                                           
31746 ovirt     20   0 11.3g 2.4g  18m S  1.2  7.6   1539:24 java
                                                                                                                                              
Version-Release number of selected component:
System Test Env:
-Red Hat Enterprise Virtualization Manager Version: 3.2.1-0.39.el6ev
-Qty 1 Rhel6.4, Rhevm Server,  high end Dell PowerEdge R710 Dual 8core, 32GBRam, rhevm-3.2.1-0.39.el6ev.noarch
-Qty 4 Rhel6.4, Hosts all high end Dell, PowerEdge R710 Dual 8core, 16GBRam
-Qty 1 Rhel6.4, Ipa Directory Server
-Qty 3 Rhel6.4, Load Client machines to dive user simulated load.

How reproducible: This is the first run
Steps to Reproduce:
1.Run System test load against the system for an extended period of time

Actual results:
Memory growth trending upward

Expected results:
Sustainable memory management with continued system operation and functionality without interuptions

Additional info:
I have running a 30 day test using Rhevm 3.2.
Type            : System / Longevity
Target Duration : 30 days
Current Duration: 26 days / Run 1

VM(s)
Total 46 Vms created

Storage
-ISCSI Total 500G
-Name Type Storage Format Cross Data-Center-Status FreeSpace
-ISCIMainStorage Data (Master) iSCSI  V3 Active 263 GB

Data collection / monitoring:
All systems being monitored for uptime, memory, swap, cpu, networkio, diskio and disk space during the test run. (except for the IPA Server/ Clients)  Tests here are to simulate client load but not stress the systems out.  The idea is to simulate 1 years of activity in 30 days while being monitored for system reliabiliy and continued admin and user functionality.

System Test Load:
1. VM_Crud client, A python multithread client using the sdk to cycles through a crud flow of VM(s) over a period of time defined by the tester to drive load against the system  (10 threads)

2. VM_Migration client, A python multithread client using the sdk to cycles through migrating running vms from host to host in the test environment over a period of time defined by the tester to drive load against the system (2 threads)

3. VM_Cycling client, A python multithread client using the sdk to cycles through a rnd run, suspend, stop of existing VM(s) in the test environment over a period of time defined by the tester to drive load against the system (10 threads)

4. UserPortal client, A python multithread client using selenium  to drive the User Portal.  The client cycles through unique users to run, stop or start console remote-viewer of existing VM(s) in the test environment over a period of time defined by the tester to drive load against the system (10 threads)

Comment 1 Juan Hernández 2013-09-06 08:27:16 UTC

If I understand correctly you have a RHEV-M machine with 32 GiB of RAM and according to the output of top the RHEV-M process is only consuming 11.3 GiB of virtual space and 2.4 GiB of real RAM. The amount of virtual space is not really relevant, and the real amount of RAM is reasonable. Take into account that the Java virtual machine used by the RHEV-M process is configured to use up to 1 GiB of heap space by default. That plus the stacks of the threads account for most of those 2.4 GiB of real RAM. So I would say that this is normal, not alarming at all.

What is probably alarming you is the total amount of RAM in use reported by top, those 28 GiB. But you have to take into account that most of that space is used by the file system cache, approx 20 GiB. This is normal as well: the kernel tries to use all the available memory, if not needed for other thing it uses it for the file system cache. So in the long term the machine should use all the available memory, until all the file system is loaded in RAM, that is perfectly healthy.

I would say that this is probably an indication that the machine has too much memory. I would suggest to reduce it, maybe to 4 GiB instead of 32 GiB. This is the minimum required by RHEV-M (and even that is probably too large) so if you test that you will be testing what our more demanding/constrained customers will do.

I would also suggest to activate garbage collection logging in the RHEV-M Java virtual machine, so if there are problems in that area in the future we can analyze them. To do that add the following to /etc/sysconfig/ovirt-engine:

ENGINE_VERBOSE_GC=false

Then restart RHEV-M:

# service ovirt-engine restart

It will then start to produce garbage collection debug information in /var/log/ovirt-engine/console.log.

Comment 2 Juan Hernández 2013-09-06 08:29:15 UTC

Sorry, obviously it should be:

  ENGINE_VERBOSE_GC=true

Comment 3 baiesi 2013-09-09 19:02:40 UTC

Great thanks for the good information you have relayed;

I'll take into account and locate a server with 4GB to run the next test run.  Agreed probably better to test with the minimum required RAM specified by Redhat docs. I'll activate the GB collection logs as specified going forward.

Also in the future runs I'll also monitor the ovirt java process as well from the start and java heap memory too.  Will be starting a new longevity test run soon.

I've also attached a libre office calc spread sheet on the trend I observed and was concerned about initially. (rhevm32_run1_system_memory_20days.ods)

Thanks

Comment 4 baiesi 2013-09-09 19:04:45 UTC

Created attachment 795735 [details]
Rhevm System Memory Trend - run1

Comment 5 Juan Hernández 2013-09-16 07:47:07 UTC

I'm closing this bug. If this concern reappears in the future please reopen.

Note You need to log in before you can comment on or make changes to this bug.