247291 – shutdown while processing relocation request results in node reboot

Bug 247291 - shutdown while processing relocation request results in node reboot

Summary: shutdown while processing relocation request results in node reboot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-07-06 18:18 UTC by Lon Hohberger
Modified:	2009-04-16 22:18 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2007-0580
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 16:46:15 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0580	0	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2007-10-30 15:37:24 UTC

Description Lon Hohberger 2007-07-06 18:18:16 UTC

Description of problem:

On one node, if you run 'clusvcadm -r foo' in a loop, it bounces the service
back and forth between nodes.

On one of the nodes the service is using, if you run 'while : ; do service
rgmanager start; sleep 60; service rgmanager stop', you will eventually get this:

Jul  6 13:57:55 lisa rgmanager: [17227]: <notice> Shutting down Cluster Service
Manager...
Jul  6 13:57:55 lisa clurgmgrd[16434]: <notice> Shutting down
Jul  6 13:57:56 lisa clurgmgrd[16434]: <notice> Shutdown complete, exiting
Jul  6 13:57:56 lisa kernel: clurgmgrd[17239]: segfault at 0000000000000000 rip
0000000000415cd8 rsp 0000000044605f30 error 4
Jul  6 13:57:56 lisa kernel: dlm: rgmanager: group leave failed -512 0
Jul  6 13:57:56 lisa clurgmgrd[16433]: <crit> Watchdog: Daemon died, rebooting... 
Jul  6 13:57:56 lisa dlm_controld[3676]: open
"/sys/kernel/dlm/rgmanager/control" error -1 2
Jul  6 13:57:56 lisa dlm_controld[3676]: open
"/sys/kernel/dlm/rgmanager/event_done" error -1 2
Jul  6 13:57:56 lisa kernel: md: stopping all md devices.
Jul  6 13:57:57 lisa kernel: Synchronizing SCSI cache for disk sda:


Version-Release number of selected component (if applicable): 5.1 beta


How reproducible: difficult

Comment 1 Lon Hohberger 2007-07-16 17:34:38 UTC

I think this can be solved by tracking all threads (even simple ones) and making
sure they're cleaned up in the exit path.  I will test this soon.

Comment 3 RHEL Program Management 2007-07-20 20:07:13 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Lon Hohberger 2007-07-23 18:20:45 UTC

Test setup:
* 5 node cluster
* 2 exclusive services (test1, test2)

Reproduce case:
* on node 1: 
   while :; do clusvcadm -r test1; done
* on node 2:
   while :; do clusvcadm -r test2; done
* on node 3 (**):
   while :; do service rgmanager stop; service rgmanager start; sleep 30; done

**: This needs to be one of the nodes the service is hitting.

Comment 5 Lon Hohberger 2007-07-24 18:51:54 UTC

Patches in RHEL5, RHEL51, head.

Comment 8 errata-xmlrpc 2007-11-07 16:46:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0580.html

Note You need to log in before you can comment on or make changes to this bug.