Red Hat Bugzilla – Bug 1011515
cgred process dies
Last modified: 2017-03-08 12:35 EST
This is a bug to help the OSE team track the fix for RHEL 6.5.
+++ This bug was initially created as a clone of Bug #964219 +++
Description of problem:
We use cgroups extensively to manage resources on our OpenShift servers. We have been noticing since the 6.3 to 6.4 RHEL update that we see the cgred process dies. Here is the output we are seeing in rsyslog:
May 14 17:42:25 ex-std-node181 CGRE: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x00000000022ffa30 ***
May 14 17:42:25 ex-std-node181.prod.rhcloud.com kernel: cgrulesengd general protection ip:32e0433fe4 sp:7ffff2778e90 error:0 in libc-2.12.so[32e0400000+18a000]
The interesting thing we are seeing is that we restart the cgred service and it sometimes will go right back into this "dead" state within 2-5 minutes.
May 16 14:21:55 ex-std-node47 CGRE: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x0000000000b2e5b0 ***
May 16 14:25:24 ex-std-node47 CGRE: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x0000000000c24af0 ***
May 16 14:25:24 ex-std-node47 kernel: cgrulesengd general protection ip:3000433fe4 sp:7fff2c9d4fd0 error:0 in libc-2.12.so[3000400000+18a000]
On ex-std-node47 there are 731 users that we have placed into cgroups.
Version-Release number of selected component (if applicable):
We are seeing this multiple times a day and currently have a cgred restart handler to attempt to bring the service back up.
I'm not exactly sure what is causing this to fail but we are more than willing to help debug this issue.
Steps to Reproduce:
Service cgred is dying.
Cgred to be resilient and continue to enforce resource restrictions.
We are probably an outlier when it comes to most problems but we use cgroups and depend on it to run OpenShift. We place all of our users into our multi-tenant environment and place them into cgroups. We then rely on cgroups to restrict each user according to some predefined resource limits. Please assist and we will provide more info.
--- Additional comment from RHEL Product and Program Management on 2013-05-17 11:03:59 EDT ---
Since this bug report was entered in bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.
--- Additional comment from Peter Schiffer on 2013-08-13 12:57:49 EDT ---
do you still see this problem (even with libcgroup-0.37-7.2.el6_4)?
--- Additional comment from Kenny Woodson on 2013-08-16 11:28:29 EDT ---
Yes, we are still seeing these issues. Here are a few of them from today:
Aug 16 05:53:05 ex-std-node96 CGRE: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x0000000001bf3370 ***
Aug 16 04:37:12 ex-std-node38 CGRE: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x0000000000be73f0 ***
Aug 16 10:44:43 ex-std-node4 CGRE: *** glibc detected *** /sbin/cgrulesengd: double free or corruption (fasttop): 0x00000000025eabf0 ***
These are 3 separate servers.
rpm -qa | grep libcgroup:
--- Additional comment from Peter Schiffer on 2013-08-19 06:51:15 EDT ---
are you able to generate coredump? If yes, could you attach it?
--- Additional comment from Kenny Woodson on 2013-08-19 15:13:05 EDT ---
We have core file size set to -c for dumps. I have experimented and tried to dump the process by killing it with a 3, 4, 6, 8, and an 11. I haven't had any luck. Is there something else I can try?
Any assistance would help.
]# cat /proc/sys/kernel/core_pattern
--- Additional comment from Rob Millner on 2013-08-19 21:25:13 EDT ---
--- Additional comment from Kenny Woodson on 2013-08-23 12:24:37 EDT ---
Attached a core dump. Working with peter.
--- Additional comment from Kenny Woodson on 2013-08-23 12:31:46 EDT ---
Binary core dump
--- Additional comment from Kenny Woodson on 2013-08-23 13:59:48 EDT ---
I'm not if this is the same issue but I found 2 more core files.
--- Additional comment from Peter Schiffer on 2013-08-26 13:16:15 EDT ---
This problem was introduced with bug #849757, and it should be fixed as part of the bug #913286.
--- Additional comment from RHEL Product and Program Management on 2013-08-27 10:59:02 EDT ---
Since this bug is proposed for the current release and its status
has changed to MODIFIED or VERIFIED, the devel_ack flag has been
set to +.
--- Additional comment from errata-xmlrpc on 2013-08-28 04:53:48 EDT ---
Bug report changed to ON_QA status by Errata System.
A QE request has been submitted for advisory RHBA-2013:15463-01
--- Additional comment from Kenny Woodson on 2013-09-11 10:01:58 EDT ---
At Peter's request I installed the latest version for RHEL 6.5 and cgred process immediately dies.
Stopping CGroup Rules Engine Daemon... [ OK ]
Starting CGroup Rules Engine Daemon: /bin/bash: line 1: 2095 Segmentation fault /sbin/cgrulesengd -g cgred
Attaching a core dump.
--- Additional comment from Kenny Woodson on 2013-09-11 10:03:37 EDT ---
--- Additional comment from Peter Schiffer on 2013-09-16 06:18:08 EDT ---
--- Additional comment from errata-xmlrpc on 2013-09-17 07:36:39 EDT ---
Bug report changed from MODIFIED to ON_QA status by the Errata System:
Changed by: Peter Schiffer (firstname.lastname@example.org)