Description of problem: When the system semaphores are exhausted, aisexec enters a tight spin which makes debugging/diagnosing the problem difficult. It would be nice if aisexec could limit the retries it does to fail a little more gracefully, and maybe log an error message about running out of semaphores so people know what to do without searching or contacting support. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Dave, I'm not totally sure what problem you mean. Is that problem with *_init functions (cpg_init, ...) in client waiting forever for semaphore or there is really some problem with aisexec server? If problem is with client, I'm not sure what solution to choose. Of course function can return something like ERR_RETRY, ERR_LIBRARY, ... but isn't that ABI/API breaker? Some kind of logging in client seems for me even harder becase we don't have any API for logging in client apps. Honza (In reply to comment #0) > Description of problem: > > When the system semaphores are exhausted, aisexec enters a tight spin which > makes debugging/diagnosing the problem difficult. It would be nice if aisexec > could limit the retries it does to fail a little more gracefully, and maybe log > an error message about running out of semaphores so people know what to do > without searching or contacting support. > > Version-Release number of selected component (if applicable): > > > How reproducible: > > > Steps to Reproduce: > 1. > 2. > 3. > > Actual results: > > > Expected results: > > > Additional info:
Any library call that involves creating a semaphore can fail when the semaphores run out; I don't know what calls those are. When that happens, the library call should return a fatal error (not retry since retrying is pointless), ideally some error which is unique or can be generally identified with the semaphores. Most if not all of my apps that use cpg will log the error that's returned so we can easily tell. Right now no error is returned when semaphores run out, and the caller ends up stuck, looping or spinning forever. In addition to library calls, if there is any non-library code in openais that creates semaphores, it should obviously check for errors and log and error.
Dave, semaphores should be created always first in lib and after that in server. This is why I asked if you know about something specific (reproducer) in server. Even retrying might seems pointless, it is not SO much true. You can always run ipcrm command for remove of dead semaphores or kill app running semaphore and after that stucked app will continue start work without any problem. Of course adding special error code is not big issue, but from my point of view, changed behavior is ABI/API breaker. Steve. Any idea there? (In reply to comment #3) > Any library call that involves creating a semaphore can fail when the > semaphores run out; I don't know what calls those are. When that happens, the > library call should return a fatal error (not retry since retrying is > pointless), ideally some error which is unique or can be generally identified > with the semaphores. Most if not all of my apps that use cpg will log the > error that's returned so we can easily tell. Right now no error is returned > when semaphores run out, and the caller ends up stuck, looping or spinning > forever. > > In addition to library calls, if there is any non-library code in openais that > creates semaphores, it should obviously check for errors and log and error.
Honza, There are two separate ways to fix this problem. Method #1 involves changing the semaphore usage to use unnamed posix semaphores. This prevents our semaphores from running out and avoids interactions with third party ISV applications. Method #2 involves returning an error in util.c:openais_service_connect:336. Instead of looping on error, return an error such as ERR_LIBRARY. To duplicate, create something like 150 or so ipc connections. Note this problem doesn't exist in corosync because it uses posix unnamed semaphores.
*** Bug 584574 has been marked as a duplicate of this bug. ***
Created attachment 418296 [details] MMapped shared memory Support for mmaped shared memory. Even problem is not reported on SYS V shm, it is just matter of time until somebody will find that SYS V semaphores problem solved, but there is another one with shm.
Created attachment 418297 [details] Support for POSIX semaphores Posix semaphores doesn't use SYS V semaphores resources, so it should fix the bug with keep functionality. Apply on top of MMAP patch.
Created attachment 418298 [details] IPC + posix semaphores hardening Backport of almost same patch as in corosync for solve a segfault issue in openais on exit. Semop implementation doesn't has this problem because semop is little more clever and doesn't segfault on uninitialized semaphore. Apply on top of Support for POSIX semaphores patch.
nice work Honza Regards -steve
Created attachment 428413 [details] Support for POSIX semaphores - try2 - add LDFLAGS to libcpg, libconfdb Posix semaphores doesn't use SYS V semaphores resources, so it should fix the bug with keep functionality. Apply on top of MMAP patch. It also adds LDFLAGS -lpthread to libcpg and libconfdb linking, so new version works correctly with old applications.
Created attachment 430373 [details] Proposed patch - previous 3 patches merged together Patch contains previous 3 patches merged together and also "backport" for preallocation of file.
Created attachment 431741 [details] Proposed patch - previous 3 patches merged together and removed /var/run from shm path Patch contains previous 3 patches merged together and also "backport" for preallocation of file. Only change since previous patch is remove "/var/run" from valid shm paths (so only /dev/shm is supported now).
please note this bugzilla will provide method #2 to offer some limited relief for the customer issue of complete deadlock. Instead, applications that properly check errors will print them to the customer allowing them to deduce the system limits have been reached.
Looks like we should be able to verify this one.
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release.
*** Bug 626910 has been marked as a duplicate of this bug. ***
Created attachment 486065 [details] Proposed patch open_ais_service_connect creates SYS V semaphores and shms. If system limit for semaphores/shms is exceeded code looped in endless cycle. Now ENOSPC is correctly handled and SA_AIS_ERR_NO_SPACE is returned to lib user.
Created attachment 486066 [details] Test case Test case for patch: Output without patch: [root@node-08 a]# ./without Cpg initialize 0 Cpg initialize 1 ... Cpg initialize 126 Cpg initialize 127 strace $PID: ... semget(0x22969f66, 3, IPC_CREAT|IPC_EXCL|0600) = -1 ENOSPC (No space left on device) geteuid32() = 0 semget(0x16018da9, 3, IPC_CREAT|IPC_EXCL|0600) = -1 ENOSPC (No space left on device) geteuid32() = 0 semget(0x7732a8b1, 3, IPC_CREAT|IPC_EXCL|0600) = -1 ENOSPC (No space left on device) geteuid32() = 0 ... With patch: [root@node-08 a]# ./with Cpg initialize 0 Cpg initialize 1 ... Cpg initialize 126 Cpg initialize 127 Could not initialize Cluster Process Group API instance error 15
Patch sent to ML
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1012.html