This service will be undergoing maintenance at 20:00 UTC, 2017-04-03. It is expected to last about 30 minutes
Bug 440967 - dlm_emergency_shutdown on path failover
dlm_emergency_shutdown on path failover
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: dlm-kernel (Show other bugs)
4.6
powerpc Linux
low Severity urgent
: rc
: ---
Assigned To: David Teigland
Cluster QE
: TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-04-04 14:11 EDT by Abdel Sadek
Modified: 2016-04-26 09:49 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-05-01 15:00:48 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Abdel Sadek 2008-04-04 14:11:46 EDT
Description of problem:

Test: Redhat Cluster installed and configured on both cluster nodes.  While all
resources are up (cluster was in optimal state), and IO was running from 3rd
host, run automated multi-switch ports fail.

After test has run for a few hours, immediately following a failover,  dlm will
(distributed lock manager) issue an emergency shutdown. All disk resources
associated with the cluster go into recovery. DLM prints:

lock_dlm:  Assertion failed on line 432 of file
/builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c

and the Kernel follows with:
kernel BUG in do_dlm_lock at
/builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c:432!



Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1) Install RedHat 4.6 on two PPC hosts
2) Install appropriate mpp and IBM client software
3) Install RedHat Cluster Suite and create resources on array
4) While IO was running from 3rd host.  Start MulitPathPortFail on Switch
(cisco9000).  Script will do:
   a) fail paths A (from host sides), 
   - sleep 600seconds, 
   - bring paths A up
   - sleep 600 seconds
   - fail paths B (from host sides)
   - sleep 600 seconds
   - bring paths B up
   - sleep 600 seconds
   - fails paths A (from array side)
   - sleep 600 seconds
   - bring paths A up
   - sleep 600 seconds
   - fails paths B (from array side)
   - sleep 600 seconds
   - bring paths B up
   - sleep 600 seconds
   - <loop back to step 4a>

  
Actual results:
After test has run for a few hours, immediately following a failover,  dlm will
(distributed lock manager) issue an emergency shutdown. All disk resources
associated with the cluster go into recovery. All cluster nodes go to mon> mode.

Expected results:
Cluste resources whouldn't go into recovery and cluster nodes shouldn't go to
mon> mode.  There shouldn't be an amergency shutdown.  There was always one path
from each cluser node, and storage array side available for redundancy, and
protect cluster nodes from loosing communication to the storage array.

Additional info:

Host information:

Kernel = Linux gonsalves 2.6.9-67.EL #1 SMP Wed Nov 7 13:50:40 EST 2007 ppc64
ppc64 ppc64 GNU/Linux
RHEL Release = Red Hat Enterprise Linux AS release 4 (Nahant Update 6)
MPP version = 09.02.B5.15
HBA model =  Emulex LP10000 2Gb PCI-X Fibre Channel Adapter on PCI bus d8 device
08 irq 343
Emulex Driver = Emulex LightPulse Fibre Channel SCSI driver 8.0.16.34
Emulex FW = 2.10 (B2F2.10X8)
EmulexBIOS = 1.50a4
Comment 2 Subhendu Ghosh 2009-01-07 15:29:00 EST
re-assigning to dlm-kernel not cachefilesd
Comment 3 David Teigland 2009-01-07 15:46:59 EST
The dlm shuts down because the cluster (cman) has shut down.
The cluster will typically shut down due to a network disruption.
Comment 4 Nate Straz 2009-01-07 15:49:35 EST
Can we get a copy of your cluster.conf?  We're missing some information that should be in there.

1. lock_dlm is really part of GFS, so we need some details on your GFS config
2. What is your multipath config?  Are you using something beyond device-mapper-multipath?
3. You talk about I/O from a third node.  Does that mean your GFS file system is exported via NFS to the third node?
Comment 5 David Teigland 2009-01-07 15:55:40 EST
Another possibility for the clustering shutting down is something
in the kernel monopolizing the cpu and not giving the cman membership
thread a chance to send heartbeat messages.  We would expect to see
some cman information in /var/log/messages about this

Note You need to log in before you can comment on or make changes to this bug.