Bug 561546
Summary: | aisexec spins when semaphores run out | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | David Teigland <teigland> | |
Component: | openais | Assignee: | Jan Friesse <jfriesse> | |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 5.5 | CC: | cluster-maint, gbarros, iannis, jkortus, jwest, sdake, tao | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | openais-0.80.6-29.el5 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 626910 (view as bug list) | Environment: | ||
Last Closed: | 2011-07-21 07:46:43 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 613551 | |||
Bug Blocks: | 626910, 694180 | |||
Attachments: |
Description
David Teigland
2010-02-03 21:31:39 UTC
Dave, I'm not totally sure what problem you mean. Is that problem with *_init functions (cpg_init, ...) in client waiting forever for semaphore or there is really some problem with aisexec server? If problem is with client, I'm not sure what solution to choose. Of course function can return something like ERR_RETRY, ERR_LIBRARY, ... but isn't that ABI/API breaker? Some kind of logging in client seems for me even harder becase we don't have any API for logging in client apps. Honza (In reply to comment #0) > Description of problem: > > When the system semaphores are exhausted, aisexec enters a tight spin which > makes debugging/diagnosing the problem difficult. It would be nice if aisexec > could limit the retries it does to fail a little more gracefully, and maybe log > an error message about running out of semaphores so people know what to do > without searching or contacting support. > > Version-Release number of selected component (if applicable): > > > How reproducible: > > > Steps to Reproduce: > 1. > 2. > 3. > > Actual results: > > > Expected results: > > > Additional info: Any library call that involves creating a semaphore can fail when the semaphores run out; I don't know what calls those are. When that happens, the library call should return a fatal error (not retry since retrying is pointless), ideally some error which is unique or can be generally identified with the semaphores. Most if not all of my apps that use cpg will log the error that's returned so we can easily tell. Right now no error is returned when semaphores run out, and the caller ends up stuck, looping or spinning forever. In addition to library calls, if there is any non-library code in openais that creates semaphores, it should obviously check for errors and log and error. Dave, semaphores should be created always first in lib and after that in server. This is why I asked if you know about something specific (reproducer) in server. Even retrying might seems pointless, it is not SO much true. You can always run ipcrm command for remove of dead semaphores or kill app running semaphore and after that stucked app will continue start work without any problem. Of course adding special error code is not big issue, but from my point of view, changed behavior is ABI/API breaker. Steve. Any idea there? (In reply to comment #3) > Any library call that involves creating a semaphore can fail when the > semaphores run out; I don't know what calls those are. When that happens, the > library call should return a fatal error (not retry since retrying is > pointless), ideally some error which is unique or can be generally identified > with the semaphores. Most if not all of my apps that use cpg will log the > error that's returned so we can easily tell. Right now no error is returned > when semaphores run out, and the caller ends up stuck, looping or spinning > forever. > > In addition to library calls, if there is any non-library code in openais that > creates semaphores, it should obviously check for errors and log and error. Honza, There are two separate ways to fix this problem. Method #1 involves changing the semaphore usage to use unnamed posix semaphores. This prevents our semaphores from running out and avoids interactions with third party ISV applications. Method #2 involves returning an error in util.c:openais_service_connect:336. Instead of looping on error, return an error such as ERR_LIBRARY. To duplicate, create something like 150 or so ipc connections. Note this problem doesn't exist in corosync because it uses posix unnamed semaphores. *** Bug 584574 has been marked as a duplicate of this bug. *** Created attachment 418296 [details]
MMapped shared memory
Support for mmaped shared memory. Even problem is not reported on SYS V shm, it is just matter of time until somebody will find that SYS V semaphores problem solved, but there is another one with shm.
Created attachment 418297 [details]
Support for POSIX semaphores
Posix semaphores doesn't use SYS V semaphores resources, so it should fix the bug with keep functionality. Apply on top of MMAP patch.
Created attachment 418298 [details]
IPC + posix semaphores hardening
Backport of almost same patch as in corosync for solve a segfault
issue in openais on exit.
Semop implementation doesn't has this problem because semop is little
more clever and doesn't segfault on uninitialized semaphore.
Apply on top of Support for POSIX semaphores patch.
nice work Honza Regards -steve Created attachment 428413 [details]
Support for POSIX semaphores - try2 - add LDFLAGS to libcpg, libconfdb
Posix semaphores doesn't use SYS V semaphores resources, so it should fix the
bug with keep functionality. Apply on top of MMAP patch.
It also adds LDFLAGS -lpthread to libcpg and libconfdb linking, so new version works correctly with old applications.
Created attachment 430373 [details]
Proposed patch - previous 3 patches merged together
Patch contains previous 3 patches merged together and also "backport" for preallocation of file.
Created attachment 431741 [details]
Proposed patch - previous 3 patches merged together and removed /var/run from shm path
Patch contains previous 3 patches merged together and also "backport" for
preallocation of file.
Only change since previous patch is remove "/var/run" from valid shm paths (so only /dev/shm is supported now).
please note this bugzilla will provide method #2 to offer some limited relief for the customer issue of complete deadlock. Instead, applications that properly check errors will print them to the customer allowing them to deduce the system limits have been reached. Looks like we should be able to verify this one. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. *** Bug 626910 has been marked as a duplicate of this bug. *** Created attachment 486065 [details]
Proposed patch
open_ais_service_connect creates SYS V semaphores and shms.
If system limit for semaphores/shms is exceeded code looped
in endless cycle.
Now ENOSPC is correctly handled and SA_AIS_ERR_NO_SPACE is
returned to lib user.
Created attachment 486066 [details]
Test case
Test case for patch:
Output without patch:
[root@node-08 a]# ./without
Cpg initialize 0
Cpg initialize 1
...
Cpg initialize 126
Cpg initialize 127
strace $PID:
...
semget(0x22969f66, 3, IPC_CREAT|IPC_EXCL|0600) = -1 ENOSPC (No space left on device)
geteuid32() = 0
semget(0x16018da9, 3, IPC_CREAT|IPC_EXCL|0600) = -1 ENOSPC (No space left on device)
geteuid32() = 0
semget(0x7732a8b1, 3, IPC_CREAT|IPC_EXCL|0600) = -1 ENOSPC (No space left on device)
geteuid32() = 0
...
With patch:
[root@node-08 a]# ./with
Cpg initialize 0
Cpg initialize 1
...
Cpg initialize 126
Cpg initialize 127
Could not initialize Cluster Process Group API instance error 15
Patch sent to ML An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-1012.html |