Description of problem: Test: Redhat Cluster installed and configured on both cluster nodes. While all resources are up (cluster was in optimal state), and IO was running from 3rd host, run automated multi-switch ports fail. After test has run for a few hours, immediately following a failover, dlm will (distributed lock manager) issue an emergency shutdown. All disk resources associated with the cluster go into recovery. DLM prints: lock_dlm: Assertion failed on line 432 of file /builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c and the Kernel follows with: kernel BUG in do_dlm_lock at /builddir/build/BUILD/gfs-kernel-2.6.9-75/up/src/dlm/lock.c:432! Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1) Install RedHat 4.6 on two PPC hosts 2) Install appropriate mpp and IBM client software 3) Install RedHat Cluster Suite and create resources on array 4) While IO was running from 3rd host. Start MulitPathPortFail on Switch (cisco9000). Script will do: a) fail paths A (from host sides), - sleep 600seconds, - bring paths A up - sleep 600 seconds - fail paths B (from host sides) - sleep 600 seconds - bring paths B up - sleep 600 seconds - fails paths A (from array side) - sleep 600 seconds - bring paths A up - sleep 600 seconds - fails paths B (from array side) - sleep 600 seconds - bring paths B up - sleep 600 seconds - <loop back to step 4a> Actual results: After test has run for a few hours, immediately following a failover, dlm will (distributed lock manager) issue an emergency shutdown. All disk resources associated with the cluster go into recovery. All cluster nodes go to mon> mode. Expected results: Cluste resources whouldn't go into recovery and cluster nodes shouldn't go to mon> mode. There shouldn't be an amergency shutdown. There was always one path from each cluser node, and storage array side available for redundancy, and protect cluster nodes from loosing communication to the storage array. Additional info: Host information: Kernel = Linux gonsalves 2.6.9-67.EL #1 SMP Wed Nov 7 13:50:40 EST 2007 ppc64 ppc64 ppc64 GNU/Linux RHEL Release = Red Hat Enterprise Linux AS release 4 (Nahant Update 6) MPP version = 09.02.B5.15 HBA model = Emulex LP10000 2Gb PCI-X Fibre Channel Adapter on PCI bus d8 device 08 irq 343 Emulex Driver = Emulex LightPulse Fibre Channel SCSI driver 8.0.16.34 Emulex FW = 2.10 (B2F2.10X8) EmulexBIOS = 1.50a4
re-assigning to dlm-kernel not cachefilesd
The dlm shuts down because the cluster (cman) has shut down. The cluster will typically shut down due to a network disruption.
Can we get a copy of your cluster.conf? We're missing some information that should be in there. 1. lock_dlm is really part of GFS, so we need some details on your GFS config 2. What is your multipath config? Are you using something beyond device-mapper-multipath? 3. You talk about I/O from a third node. Does that mean your GFS file system is exported via NFS to the third node?
Another possibility for the clustering shutting down is something in the kernel monopolizing the cpu and not giving the cman membership thread a chance to send heartbeat messages. We would expect to see some cman information in /var/log/messages about this