Bug 1011515 - cgred process dies
cgred process dies
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers (Show other bugs)
1.2.0
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Brenton Leanhardt
libra bugs
:
Depends On: 964219
Blocks: 947775 961026 923851
  Show dependency treegraph
 
Reported: 2013-09-24 09:06 EDT by Brenton Leanhardt
Modified: 2017-03-08 12 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 964219
Environment:
aws ec2 instance
Last Closed: 2014-02-04 13:17:34 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Brenton Leanhardt 2013-09-24 09:06:41 EDT
This is a bug to help the OSE team track the fix for RHEL 6.5.

+++ This bug was initially created as a clone of Bug #964219 +++

Description of problem:

We use cgroups extensively to manage resources on our OpenShift servers.  We have been noticing since the 6.3 to 6.4 RHEL update that we see the cgred process dies.  Here is the output we are seeing in rsyslog:

-----
May 14 17:42:25 ex-std-node181 CGRE[4439]: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x00000000022ffa30 ***
May 14 17:42:25 ex-std-node181.prod.rhcloud.com kernel: cgrulesengd[4439] general protection ip:32e0433fe4 sp:7ffff2778e90 error:0 in libc-2.12.so[32e0400000+18a000]
-----

The interesting thing we are seeing is that we restart the cgred service and it sometimes will go right back into this "dead" state within 2-5 minutes.

May 16 14:21:55 ex-std-node47 CGRE[28214]: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x0000000000b2e5b0 ***

May 16 14:25:24 ex-std-node47 CGRE[3286]: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x0000000000c24af0 ***
May 16 14:25:24 ex-std-node47 kernel: cgrulesengd[3286] general protection ip:3000433fe4 sp:7fff2c9d4fd0 error:0 in libc-2.12.so[3000400000+18a000]

On ex-std-node47 there are 731 users that we have placed into cgroups.  

Version-Release number of selected component (if applicable):
libcgroup-pam-0.37-7.1.el6_4.x86_64
libcgroup-0.37-7.1.el6_4.x86_64

How reproducible:

We are seeing this multiple times a day and currently have a cgred restart handler to attempt to bring the service back up.

I'm not exactly sure what is causing this to fail but we are more than willing to help debug this issue.

Steps to Reproduce:
1.  
2.
3.
  
Actual results:

Service cgred is dying.

Expected results:

Cgred to be resilient and continue to enforce resource restrictions.

Additional info:

We are probably an outlier when it comes to most problems but we use cgroups and depend on it to run OpenShift.  We place all of our users into our multi-tenant environment and place them into cgroups.  We then rely on cgroups to restrict each user according to some predefined resource limits.  Please assist and we will provide more info.

--- Additional comment from RHEL Product and Program Management on 2013-05-17 11:03:59 EDT ---

Since this bug report was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

--- Additional comment from Peter Schiffer on 2013-08-13 12:57:49 EDT ---

Kenny,

do you still see this problem (even with libcgroup-0.37-7.2.el6_4)?

Thanks,

peter

--- Additional comment from Kenny Woodson on 2013-08-16 11:28:29 EDT ---

Peter,

Yes, we are still seeing these issues.  Here are a few of them from today:

Aug 16 05:53:05 ex-std-node96 CGRE[1340]: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x0000000001bf3370 ***

Aug 16 04:37:12 ex-std-node38 CGRE[1149]: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x0000000000be73f0 ***

Aug 16 10:44:43 ex-std-node4 CGRE[1103]: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x00000000025eabf0 ***

These are 3 separate servers.

rpm -qa | grep libcgroup:

libcgroup-pam-0.37-7.2.el6_4.x86_64
libcgroup-0.37-7.2.el6_4.x86_64

Thanks,

kenny

--- Additional comment from Peter Schiffer on 2013-08-19 06:51:15 EDT ---

Kenny,

are you able to generate coredump? If yes, could you attach it?

Thanks,

peter

--- Additional comment from Kenny Woodson on 2013-08-19 15:13:05 EDT ---

Peter,

We have core file size set to -c for dumps.  I have experimented and tried to dump the process by killing it with a 3, 4, 6, 8, and an 11.  I haven't had any luck.  Is there something else I can try?

Any assistance would help.

]# cat /proc/sys/kernel/core_pattern 
/var/crash/core-%e-%s-%u-%g-%p-%t


Thanks,

kenny

--- Additional comment from Rob Millner on 2013-08-19 21:25:13 EDT ---



--- Additional comment from Kenny Woodson on 2013-08-23 12:24:37 EDT ---

Attached a core dump.  Working with peter.

--- Additional comment from Kenny Woodson on 2013-08-23 12:31:46 EDT ---

Binary core dump

--- Additional comment from Kenny Woodson on 2013-08-23 13:59:48 EDT ---

Peter,

I'm not if this is the same issue but I found 2 more core files.

Attached.

--- Additional comment from Peter Schiffer on 2013-08-26 13:16:15 EDT ---

This problem was introduced with bug #849757, and it should be fixed as part of the bug #913286.

--- Additional comment from RHEL Product and Program Management on 2013-08-27 10:59:02 EDT ---

Since this bug is proposed for the current release and its status
has changed to MODIFIED or VERIFIED, the devel_ack flag has been
set to +.

--- Additional comment from errata-xmlrpc on 2013-08-28 04:53:48 EDT ---

Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2013:15463-01
https://errata.devel.redhat.com/advisory/15463

--- Additional comment from Kenny Woodson on 2013-09-11 10:01:58 EDT ---

At Peter's request I installed the latest version for RHEL 6.5 and cgred process immediately dies.

libcgroup-debuginfo-0.40.rc1-2.el6.x86_64
libcgroup-pam-0.40.rc1-2.el6.x86_64
libcgroup-devel-0.40.rc1-2.el6.x86_64
libcgroup-0.40.rc1-2.el6.x86_64

Stopping CGroup Rules Engine Daemon...                     [  OK  ]
Starting CGroup Rules Engine Daemon: /bin/bash: line 1:  2095 Segmentation fault      /sbin/cgrulesengd -g cgred
                                                           [FAILED]

Attaching a core dump.

--- Additional comment from Kenny Woodson on 2013-09-11 10:03:37 EDT ---



--- Additional comment from Peter Schiffer on 2013-09-16 06:18:08 EDT ---

http://bulk-mail.corp.redhat.com/archives/cvs-commits-list/2013-September/msg06437.html
http://bulk-mail.corp.redhat.com/archives/cvs-commits-list/2013-September/msg06439.html

--- Additional comment from errata-xmlrpc on 2013-09-17 07:36:39 EDT ---

Bug report changed from MODIFIED to ON_QA status by the Errata System: 
Advisory RHBA-2013:15463-01: 
Changed by: Peter Schiffer (pschiffe@redhat.com)
http://errata.devel.redhat.com/advisory/15463

Note You need to log in before you can comment on or make changes to this bug.