Bug 440967

Summary:	dlm_emergency_shutdown on path failover
Product:	Red Hat Enterprise Linux 4	Reporter:	Abdel Sadek <abdel.sadek>
Component:	dlm-kernel	Assignee:	David Teigland <teigland>
Status:	CLOSED NOTABUG	QA Contact:	Cluster QE <mspqa-list>
Severity:	urgent	Docs Contact:
Priority:	low
Version:	4.6	CC:	cluster-maint, sghosh
Target Milestone:	rc	Keywords:	TestBlocker
Target Release:	---
Hardware:	powerpc
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-05-01 19:00:48 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Abdel Sadek 2008-04-04 18:11:46 UTC

Description of problem:

Test: Redhat Cluster installed and configured on both cluster nodes.  While all
resources are up (cluster was in optimal state), and IO was running from 3rd
host, run automated multi-switch ports fail.

After test has run for a few hours, immediately following a failover,  dlm will
(distributed lock manager) issue an emergency shutdown. All disk resources
associated with the cluster go into recovery. DLM prints:

lock_dlm:  Assertion failed on line 432 of file
/builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c

and the Kernel follows with:
kernel BUG in do_dlm_lock at
/builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c:432!



Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1) Install RedHat 4.6 on two PPC hosts
2) Install appropriate mpp and IBM client software
3) Install RedHat Cluster Suite and create resources on array
4) While IO was running from 3rd host.  Start MulitPathPortFail on Switch
(cisco9000).  Script will do:
   a) fail paths A (from host sides), 
   - sleep 600seconds, 
   - bring paths A up
   - sleep 600 seconds
   - fail paths B (from host sides)
   - sleep 600 seconds
   - bring paths B up
   - sleep 600 seconds
   - fails paths A (from array side)
   - sleep 600 seconds
   - bring paths A up
   - sleep 600 seconds
   - fails paths B (from array side)
   - sleep 600 seconds
   - bring paths B up
   - sleep 600 seconds
   - <loop back to step 4a>

  
Actual results:
After test has run for a few hours, immediately following a failover,  dlm will
(distributed lock manager) issue an emergency shutdown. All disk resources
associated with the cluster go into recovery. All cluster nodes go to mon> mode.

Expected results:
Cluste resources whouldn't go into recovery and cluster nodes shouldn't go to
mon> mode.  There shouldn't be an amergency shutdown.  There was always one path
from each cluser node, and storage array side available for redundancy, and
protect cluster nodes from loosing communication to the storage array.

Additional info:

Host information:

Kernel = Linux gonsalves 2.6.9-67.EL #1 SMP Wed Nov 7 13:50:40 EST 2007 ppc64
ppc64 ppc64 GNU/Linux
RHEL Release = Red Hat Enterprise Linux AS release 4 (Nahant Update 6)
MPP version = 09.02.B5.15
HBA model =  Emulex LP10000 2Gb PCI-X Fibre Channel Adapter on PCI bus d8 device
08 irq 343
Emulex Driver = Emulex LightPulse Fibre Channel SCSI driver 8.0.16.34
Emulex FW = 2.10 (B2F2.10X8)
EmulexBIOS = 1.50a4

Comment 2 Subhendu Ghosh 2009-01-07 20:29:00 UTC

re-assigning to dlm-kernel not cachefilesd

Comment 3 David Teigland 2009-01-07 20:46:59 UTC

The dlm shuts down because the cluster (cman) has shut down.
The cluster will typically shut down due to a network disruption.

Comment 4 Nate Straz 2009-01-07 20:49:35 UTC

Can we get a copy of your cluster.conf?  We're missing some information that should be in there.

1. lock_dlm is really part of GFS, so we need some details on your GFS config
2. What is your multipath config?  Are you using something beyond device-mapper-multipath?
3. You talk about I/O from a third node.  Does that mean your GFS file system is exported via NFS to the third node?

Comment 5 David Teigland 2009-01-07 20:55:40 UTC

Another possibility for the clustering shutting down is something
in the kernel monopolizing the cpu and not giving the cman membership
thread a chance to send heartbeat messages.  We would expect to see
some cman information in /var/log/messages about this