Bug 797192

Summary: corosync filling up /dev/shm
Product: Red Hat Enterprise Linux 6 Reporter: Patrick Van Gilst <patrick.vangilst>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.2CC: ccaulfie, cluster-maint, jkortus, lhh, msvoboda, rpeterso, sbradley, teigland
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-1.4.1-6.el6 Doc Type: Bug Fix
Doc Text:
Previously, the underlying library of corosync did not delete temporary buffers used for Inter-Process Communication (IPC) that are stored in the /dev/shm shared memory file system. Therefore, if the user without proper privileges attempted to establish an IPC connection, the attempt failed with an error message as expected but memory allocated for temporary buffers was not released. This could eventually result in /dev/shm being fully used and Denial of Service. This update modifies the coroipcc library to let applications delete temporary buffers if the buffers were not deleted by the corosync server. The /dev/shm file system is no longer cluttered with needless data in this scenario and IPC connections can be established as expected.
Story Points: ---
Clone Of:
: 797922 (view as bug list) Environment:
Last Closed: 2012-06-20 12:23:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 797922, 810915, 810916, 810917    
Attachments:
Description Flags
cluster.conf
none
bind_mount.sh
none
Proposed patch none

Description Patrick Van Gilst 2012-02-24 14:18:29 UTC
Description of problem:

Running a 2 nodes NFS cluster active-passive whith RHCS 6.2, after about 1 hour /dev/shm becomes 100% full. SELinux in Permissive mode.

# df -h /dev/shm/
Filesystem            Size  Used Avail Use% Mounted on
tmpfs                  16G     -     -   -  /dev/shm

# ls -l /dev/shm/ | wc -l
21275

# ls /dev/shm
control_buffer-geIdyN  control_buffer-T04M9s  dispatch_buffer-9UF86u  dispatch_buffer-mhchTW  dispatch_buffer-yWbMFr  request_buffer-GkctZk   request_buffer-SU8WQu  response_buffer-AEBaB3  response_buffer-n1BCmK 
.........

# ls -l /dev/shm | less
total 16409808
-rw-------. 1 rpcuser rpcuser    8192 Feb 24 13:22 control_buffer-00AIgE
-rw-------. 1 root    root       8192 Feb 24 14:51 control_buffer-00oq6Q
-rw-------. 1 rpcuser rpcuser    8192 Feb 24 13:48 control_buffer-00R0LQ
-rw-------. 1 rpcuser rpcuser    8192 Feb 24 13:00 control_buffer-01AsLt
-rw-------. 1 rpcuser rpcuser    8192 Feb 24 12:16 control_buffer-01D8op
-rw-------. 1 rpcuser rpcuser    8192 Feb 24 13:16 control_buffer-01wU0Z
-rw-------. 1 rpcuser rpcuser    8192 Feb 24 12:20 control_buffer-01XgT8
-rw-------. 1 rpcuser rpcuser    8192 Feb 24 14:14 control_buffer-01xmph
......

Seems that coroipcc.c cannot free shared memory....


Version-Release number of selected component (if applicable):

Red Hat Enterprise Linux Server release 6.2 (Santiago)
cman-3.0.12.1-23.el6.x86_64
rgmanager-3.0.12.1-5.el6.x86_64
resource-agents-3.9.2-7.el6.x86_64
corosync-1.4.1-4.el6.x86_64
nfs-utils-lib-1.1.5-4.el6.x86_64
nfs-utils-1.2.3-15.el6.x86_64

How reproducible:
Always


Steps to Reproduce:

1. Create a RHCS cluster for nfs in active-passive mode.
2. Start the service
3. /dev/shm is filling up
  
Actual results:

/dev/shm filling up. As a consequence:
# cman_tool version -r segfaults
# corosync-objctl -a
Could not initialize objdb library. Error 2
Cannot use every corosync-* utilities

Expected results:
/dev/shm not filling up, no segfault, no error whith corosync-* utilities

Comment 2 Jan Friesse 2012-02-27 10:37:50 UTC
Can you please provide your cluster.conf?

Comment 3 Patrick Van Gilst 2012-02-27 11:17:50 UTC
Created attachment 566012 [details]
cluster.conf

Comment 4 Jan Friesse 2012-02-27 12:25:08 UTC
(In reply to comment #3)
> Created attachment 566012 [details]
> cluster.conf

Thanks, can you please also provide "/usr/local/bin/bind_mount.sh" (even I can somehow imagine what it does)?

Comment 5 Patrick Van Gilst 2012-02-27 12:33:08 UTC
Created attachment 566027 [details]
bind_mount.sh

Thanks for working on this issue.

Comment 6 Jan Friesse 2012-02-27 13:28:17 UTC
I believe that I found main problem in corosync. Please confirm, that you are seeing "Invalid IPC credentials." in /var/log/messages.

What I don't understand is, who running as rpcuser is trying to connect to corosync. This is actually not problem for solving bug, but it may be problem in future for your environment.

Comment 7 Patrick Van Gilst 2012-02-27 13:51:25 UTC
Yes I confirm that we have a lot of "Invalid IPC credentials." in /var/log/messages.

Concerning rpcuser, maybe this can help:

# ps aux | grep [r]pcuser
rpcuser  30140  0.0  0.0  27424  1396 ?        S<s  Feb24   0:03 rpc.statd -H /usr/share/cluster/nfsserver.sh -d

Comment 10 Jan Friesse 2012-02-27 14:27:21 UTC
Created attachment 566051 [details]
Proposed patch

Unlink shm buffers if init fails

If ipc init failed, buffers was not unlinked nether by client (lib) side
nor server (corosync) side. This may lead to fill all available space,
resulting in no accept of other connection. Typical example can be user
running any corosync ipc binary (like corosync-objctl), without correct
uid/gid entry in corosync configuration, resulting in DOS.

Comment 13 Patrick Van Gilst 2012-02-27 15:14:16 UTC
Thanks a lot for your efficiency!
I'm waiting for the suggestions from rgmanager's maintainers before I give a try.

Comment 15 Jan Friesse 2012-03-07 08:57:26 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
Trying to do corosync IPC with user account without privileges to do IPC.

Consequence
Application is correctly informed about no privileges to do IPC, error message is correctly logged, but temporary buffers in /dev/shm used for IPC are not deleted and /dev/shm is keep filling.

Fix
Delete temporary buffers in /dev/shm by applications (implemented in lib) if corosync server didn't did so.

Result
Library properly deletes temporary buffers in /dev/shm if corosync didn't did so and /dev/shm is not filled up.

Comment 20 Miroslav Svoboda 2012-04-26 12:32:15 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,11 +1 @@
-Cause
+Previously, the underlying library of corosync did not delete temporary buffers used for Inter-Process Communication (IPC) that are stored in the /dev/shm shared memory file system. Therefore, if the user without proper privileges attempted to establish an IPC connection, the attempt failed with an error message as expected but memory allocated for temporary buffers was not released. This could eventually result in /dev/shm being fully used and Denial of Service. This update modifies the coroipcc library to let applications delete temporary buffers if the buffers were not deleted by the corosync server. The /dev/shm file system is no longer cluttered with needless data in this scenario and IPC connections can be established as expected.-Trying to do corosync IPC with user account without privileges to do IPC.
-
-Consequence
-Application is correctly informed about no privileges to do IPC, error message is correctly logged, but temporary buffers in /dev/shm used for IPC are not deleted and /dev/shm is keep filling.
-
-Fix
-Delete temporary buffers in /dev/shm by applications (implemented in lib) if corosync server didn't did so.
-
-Result
-Library properly deletes temporary buffers in /dev/shm if corosync didn't did so and /dev/shm is not filled up.

Comment 22 errata-xmlrpc 2012-06-20 12:23:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0777.html