Bug 1030460

Summary: Seen out of memory kill in the engine .
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RamaKasturi <knarra>
Component: rhscAssignee: Sahina Bose <sabose>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: dpati, dtsang, grajaiya, herrold, juan.hernandez, knarra, mmahoney, pprakash, rhs-bugs, sabose, ssampat
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: cb12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-25 08:03:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1028966, 1040049    
Bug Blocks:    
Attachments:
Description Flags
Attaching the error screenshot.
none
Output of pmap after starting engine
none
pmap2.txt none

Description RamaKasturi 2013-11-14 13:50:31 UTC
Created attachment 823961 [details]
Attaching the error screenshot.

Description of problem:
ovirt-engine service is getting crashed.

Version-Release number of selected component (if applicable):
rhsc-2.1.2-0.23.master.el6_5.noarch

How reproducible:
Not Always

Steps to Reproduce:
1. create a distributed volume and start it.
2. mount the volume and create a file of 10GB  in the volume.
3. select a brick and start removing it.
4. 

Actual results:
By the time data migration is completed and when you try clicking on the drop down in the activities column an popup comes saying "A request to the server failed .status code : 503.

Expected results:
No crash should happen.

Additional info:

Comment 3 Sahina Bose 2013-11-17 05:17:48 UTC
Please provide engine and vdsm logs

Comment 5 Prasanth 2013-11-18 09:37:23 UTC
Sahina, is this bug related to Bug 1026100 in RHEVM by any chance? If so, I think it might be useful for you to fix this bug!

Comment 6 Juan Hernández 2013-11-18 10:34:54 UTC
This looks very similar to bug 1028966 in RHEV-M, as the engine is consuming more than 6 GiB of RSS.

To make progress we need to generate a heap dump of the engine when it is consuming this unusual amount of memory. I would suggest to try to reproduce in a machine with more RAM (the current one has 8 GiB) so that when the engine is consuming those 6 GiB we can make a heap dump before the out of memory killer kills it.

Comment 7 Juan Hernández 2013-11-18 15:34:27 UTC
In RHEV-M we are studying if this can be caused by the 64 MiB memory areas created by the libc "malloc" allocator (87 were detected in bug 1028966 ). It would be helpful if you can check if the following setting in /etc/sysconfig/ovirt-engine helps:

export MALLOC_ARENA_MAX=1

Please make sure that this is effectively applied to the engine:

# ps -u ovirt
  PID TTY          TIME CMD
 1710 ?        00:00:00 ovirt-websocket
 4547 ?        00:00:00 ovirt-engine.py
 4549 ?        00:01:30 java

# strings /proc/4549/environ | grep MALLOC
MALLOC_ARENA_MAX=1

This should reduce the number of 64 MiB areas to just 1.

Comment 8 Juan Hernández 2013-11-18 15:36:12 UTC
Other useful information you can gather from the engine when this situation arises is the memory map generated with the "pmap" command:

# ps -u ovirt
  PID TTY          TIME CMD
 1710 ?        00:00:00 ovirt-websocket
 4547 ?        00:00:00 ovirt-engine.py
 4549 ?        00:01:30 java

# pmap 4549 > mymap.txt

Comment 9 Sahina Bose 2013-11-22 13:10:43 UTC
Juan,

For what it's worth - QE has been hitting this OOM killer ever since they started testing the engine on RHEL 6.5 and EAP 6.2.

From the pmap output when the memory consumption on engine vm was almost approaching the 8GB limit:

0000000000e13000 2868564 2534820 2534820 rw---    [ anon ]
00000000aff80000 1311232  840660  840660 rw---    [ anon ]

00007fe4c2482000 3301336 2353956 2353956 rw---    [ anon ]


And from pmap output when the engine was just started:
0000000000e13000 2868564K rw---    [ anon ]
00000000aff80000 1311232K rw---    [ anon ]

00007fe52c214000 1567120K rw---    [ anon ]


If you notice the third line has doubled. I'm not sure what it corresponds to however. Will attach both pmap outputs to the bug

Comment 10 Sahina Bose 2013-11-22 13:12:06 UTC
Created attachment 827768 [details]
Output of pmap after starting engine

Output of pmap after starting engine

Comment 11 Sahina Bose 2013-11-22 13:13:14 UTC
Created attachment 827769 [details]
pmap2.txt

Output of pmap when engine was consuming close to 8GB

Comment 12 Juan Hernández 2013-11-25 11:46:40 UTC
Please take a look at comment 17 in bug 1028966. If you can do the same in your environment it will help to determine the cause of this issue.

https://bugzilla.redhat.com/show_bug.cgi?id=1028966#c17

Comment 13 Juan Hernández 2013-11-28 12:24:08 UTC
Sahina, I think that this bug should now be closed as a duplicate of bug 1028966, and the solution should be the same proposed there.

Comment 15 Sahina Bose 2013-12-16 02:32:56 UTC
The patch which introduces "Conflicts: java-1.7.0-openjdk = 1:1.7.0.45-2.4.3.3.el6" (as per comment 48 on Bug 1028966) has been merged into RHSC repository

Comment 16 Sahina Bose 2013-12-17 05:44:39 UTC
openjdk update is available in RHEL 6.5.z stream. Please ensure that you're subscribed to this.

Comment 17 Prasanth 2013-12-17 11:24:20 UTC
(In reply to Sahina Bose from comment #16)
> openjdk update is available in RHEL 6.5.z stream. Please ensure that you're
> subscribed to this.

Sahina,

As RHS-C server is expected to be subscribed to base RHEL 6 channel (rhel-x86_64-server-6) for getting the required child channels [1], is it possible to get openjdk update which is available in RHEL 6.5.z stream??

[1] 
rhel-x86_64-server-6-rhs-rhsc-2.1
jbappplatform-6-x86_64-server-6-rpm

If so, please let me know how does that work.

Comment 18 Sahina Bose 2013-12-17 11:34:49 UTC
I'm assuming the base RHEL 6 channel will have the Z stream updates. If not, need to check with Rel eng how to get these.

Comment 19 RamaKasturi 2013-12-25 14:00:36 UTC
Have not seen this issue with cb12 and with the new open jdk java version of "java-1.7.0-openjdk-1.7.0.45-2.4.3.4.el6_5.x86_64" . so marking this verified. 

Will re open  if it happens again.

Comment 21 errata-xmlrpc 2014-02-25 08:03:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html