Hide Forgot
Description of problem: Running a 2 nodes NFS cluster active-passive whith RHCS 6.2, after about 1 hour /dev/shm becomes 100% full. SELinux in Permissive mode. # df -h /dev/shm/ Filesystem Size Used Avail Use% Mounted on tmpfs 16G - - - /dev/shm # ls -l /dev/shm/ | wc -l 21275 # ls /dev/shm control_buffer-geIdyN control_buffer-T04M9s dispatch_buffer-9UF86u dispatch_buffer-mhchTW dispatch_buffer-yWbMFr request_buffer-GkctZk request_buffer-SU8WQu response_buffer-AEBaB3 response_buffer-n1BCmK ......... # ls -l /dev/shm | less total 16409808 -rw-------. 1 rpcuser rpcuser 8192 Feb 24 13:22 control_buffer-00AIgE -rw-------. 1 root root 8192 Feb 24 14:51 control_buffer-00oq6Q -rw-------. 1 rpcuser rpcuser 8192 Feb 24 13:48 control_buffer-00R0LQ -rw-------. 1 rpcuser rpcuser 8192 Feb 24 13:00 control_buffer-01AsLt -rw-------. 1 rpcuser rpcuser 8192 Feb 24 12:16 control_buffer-01D8op -rw-------. 1 rpcuser rpcuser 8192 Feb 24 13:16 control_buffer-01wU0Z -rw-------. 1 rpcuser rpcuser 8192 Feb 24 12:20 control_buffer-01XgT8 -rw-------. 1 rpcuser rpcuser 8192 Feb 24 14:14 control_buffer-01xmph ...... Seems that coroipcc.c cannot free shared memory.... Version-Release number of selected component (if applicable): Red Hat Enterprise Linux Server release 6.2 (Santiago) cman-3.0.12.1-23.el6.x86_64 rgmanager-3.0.12.1-5.el6.x86_64 resource-agents-3.9.2-7.el6.x86_64 corosync-1.4.1-4.el6.x86_64 nfs-utils-lib-1.1.5-4.el6.x86_64 nfs-utils-1.2.3-15.el6.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a RHCS cluster for nfs in active-passive mode. 2. Start the service 3. /dev/shm is filling up Actual results: /dev/shm filling up. As a consequence: # cman_tool version -r segfaults # corosync-objctl -a Could not initialize objdb library. Error 2 Cannot use every corosync-* utilities Expected results: /dev/shm not filling up, no segfault, no error whith corosync-* utilities
Can you please provide your cluster.conf?
Created attachment 566012 [details] cluster.conf
(In reply to comment #3) > Created attachment 566012 [details] > cluster.conf Thanks, can you please also provide "/usr/local/bin/bind_mount.sh" (even I can somehow imagine what it does)?
Created attachment 566027 [details] bind_mount.sh Thanks for working on this issue.
I believe that I found main problem in corosync. Please confirm, that you are seeing "Invalid IPC credentials." in /var/log/messages. What I don't understand is, who running as rpcuser is trying to connect to corosync. This is actually not problem for solving bug, but it may be problem in future for your environment.
Yes I confirm that we have a lot of "Invalid IPC credentials." in /var/log/messages. Concerning rpcuser, maybe this can help: # ps aux | grep [r]pcuser rpcuser 30140 0.0 0.0 27424 1396 ? S<s Feb24 0:03 rpc.statd -H /usr/share/cluster/nfsserver.sh -d
Created attachment 566051 [details] Proposed patch Unlink shm buffers if init fails If ipc init failed, buffers was not unlinked nether by client (lib) side nor server (corosync) side. This may lead to fill all available space, resulting in no accept of other connection. Typical example can be user running any corosync ipc binary (like corosync-objctl), without correct uid/gid entry in corosync configuration, resulting in DOS.
Thanks a lot for your efficiency! I'm waiting for the suggestions from rgmanager's maintainers before I give a try.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause Trying to do corosync IPC with user account without privileges to do IPC. Consequence Application is correctly informed about no privileges to do IPC, error message is correctly logged, but temporary buffers in /dev/shm used for IPC are not deleted and /dev/shm is keep filling. Fix Delete temporary buffers in /dev/shm by applications (implemented in lib) if corosync server didn't did so. Result Library properly deletes temporary buffers in /dev/shm if corosync didn't did so and /dev/shm is not filled up.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,11 +1 @@ -Cause +Previously, the underlying library of corosync did not delete temporary buffers used for Inter-Process Communication (IPC) that are stored in the /dev/shm shared memory file system. Therefore, if the user without proper privileges attempted to establish an IPC connection, the attempt failed with an error message as expected but memory allocated for temporary buffers was not released. This could eventually result in /dev/shm being fully used and Denial of Service. This update modifies the coroipcc library to let applications delete temporary buffers if the buffers were not deleted by the corosync server. The /dev/shm file system is no longer cluttered with needless data in this scenario and IPC connections can be established as expected.-Trying to do corosync IPC with user account without privileges to do IPC. - -Consequence -Application is correctly informed about no privileges to do IPC, error message is correctly logged, but temporary buffers in /dev/shm used for IPC are not deleted and /dev/shm is keep filling. - -Fix -Delete temporary buffers in /dev/shm by applications (implemented in lib) if corosync server didn't did so. - -Result -Library properly deletes temporary buffers in /dev/shm if corosync didn't did so and /dev/shm is not filled up.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0777.html