Bug 247291 - shutdown while processing relocation request results in node reboot
Summary: shutdown while processing relocation request results in node reboot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager
Version: 5.0
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: Lon Hohberger
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-07-06 18:18 UTC by Lon Hohberger
Modified: 2009-04-16 22:18 UTC (History)
1 user (show)

Fixed In Version: RHBA-2007-0580
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-07 16:46:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0580 0 normal SHIPPED_LIVE rgmanager bug fix and enhancement update 2007-10-30 15:37:24 UTC

Description Lon Hohberger 2007-07-06 18:18:16 UTC
Description of problem:

On one node, if you run 'clusvcadm -r foo' in a loop, it bounces the service
back and forth between nodes.

On one of the nodes the service is using, if you run 'while : ; do service
rgmanager start; sleep 60; service rgmanager stop', you will eventually get this:

Jul  6 13:57:55 lisa rgmanager: [17227]: <notice> Shutting down Cluster Service
Manager...
Jul  6 13:57:55 lisa clurgmgrd[16434]: <notice> Shutting down
Jul  6 13:57:56 lisa clurgmgrd[16434]: <notice> Shutdown complete, exiting
Jul  6 13:57:56 lisa kernel: clurgmgrd[17239]: segfault at 0000000000000000 rip
0000000000415cd8 rsp 0000000044605f30 error 4
Jul  6 13:57:56 lisa kernel: dlm: rgmanager: group leave failed -512 0
Jul  6 13:57:56 lisa clurgmgrd[16433]: <crit> Watchdog: Daemon died, rebooting... 
Jul  6 13:57:56 lisa dlm_controld[3676]: open
"/sys/kernel/dlm/rgmanager/control" error -1 2
Jul  6 13:57:56 lisa dlm_controld[3676]: open
"/sys/kernel/dlm/rgmanager/event_done" error -1 2
Jul  6 13:57:56 lisa kernel: md: stopping all md devices.
Jul  6 13:57:57 lisa kernel: Synchronizing SCSI cache for disk sda:


Version-Release number of selected component (if applicable): 5.1 beta


How reproducible: difficult

Comment 1 Lon Hohberger 2007-07-16 17:34:38 UTC
I think this can be solved by tracking all threads (even simple ones) and making
sure they're cleaned up in the exit path.  I will test this soon.

Comment 3 RHEL Program Management 2007-07-20 20:07:13 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Lon Hohberger 2007-07-23 18:20:45 UTC
Test setup:
* 5 node cluster
* 2 exclusive services (test1, test2)

Reproduce case:
* on node 1: 
   while :; do clusvcadm -r test1; done
* on node 2:
   while :; do clusvcadm -r test2; done
* on node 3 (**):
   while :; do service rgmanager stop; service rgmanager start; sleep 30; done

**: This needs to be one of the nodes the service is hitting.


Comment 5 Lon Hohberger 2007-07-24 18:51:54 UTC
Patches in RHEL5, RHEL51, head.

Comment 8 errata-xmlrpc 2007-11-07 16:46:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0580.html



Note You need to log in before you can comment on or make changes to this bug.