Bug 177163 - umount of a gfs filesystem hangs
Summary: umount of a gfs filesystem hangs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cman
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Christine Caulfield
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 164915
TreeView+ depends on / blocked
 
Reported: 2006-01-06 19:49 UTC by Robert Peterson
Modified: 2009-04-16 20:00 UTC (History)
2 users (show)

Fixed In Version: RHBA-2006-0236
Clone Of:
Environment:
Last Closed: 2006-03-09 19:47:36 UTC
Embargoed:


Attachments (Terms of Use)
libcman program to demonstrate hang (656 bytes, text/x-csrc)
2006-01-13 14:02 UTC, Christine Caulfield
no flags Details
Proposed patch for testing (1.09 KB, patch)
2006-01-13 16:23 UTC, Christine Caulfield
no flags Details | Diff
Test version w/Patrick's 16-bit fix and Bob's /proc/cluster/msg_history (9.80 KB, patch)
2006-01-13 23:17 UTC, Robert Peterson
no flags Details | Diff
Additional patch (681 bytes, patch)
2006-01-16 17:00 UTC, Christine Caulfield
no flags Details | Diff
New patch for testing (2.42 KB, patch)
2006-01-17 14:27 UTC, Christine Caulfield
no flags Details | Diff
Yet another patch (2.48 KB, patch)
2006-01-19 13:43 UTC, Christine Caulfield
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0236 0 normal SHIPPED_LIVE cman-kernel bug fix update 2006-03-09 05:00:00 UTC

Description Robert Peterson 2006-01-06 19:49:33 UTC
Description of problem:

Running umount on a gfs filesystem hung after my cluster sat mostly idle for a
few days.  

This is a three-node cluster running RHEL4-U3 code.
The cluster consists of trin-12,trin-13 and trin-14, all 32-bit i686 boxes.

The sequence of events was like this:

1. I rebooted all nodes in my cluster, and everything came up normally,
   with nfs service running on trin-13.
2. I used ls -l on my gfs mountpoint to make sure everything was up normally.
3. I used cluster management gui interface from trin-12 to disable my nfs
   service on trin-13.
4. I let all systems sit mostly idle for two days while I worked on other
   systems.  No suspicious dmesgs were logged at that time.  No test cases were
   running, no cluster-suite related commands were issued that I can recall.
5. On trin-12, after a couple days of idle time, I did umount of my gfs
   filesystem.

On the failing (hung) system, trin-12, I dumped the task status and the umount
task looked like this:

umount        D 00008EE2  2228  5399   3729                     (NOTLB)
dda46e80 00000082 54feae58 00008ee2 df194600 0000041f 54ff81f4 00008ee2
       d7bc4c50 d7bc4ddc df5c51a8 df5c5180 dda46e9c dda46ed0 c030f021 00000000
       d7bc4c50 c011d0ce 00000000 00000000 54feadb7 dda46eb8 c011c50c 00000001
Call Trace:
 [<c030f021>] wait_for_completion+0x11f/0x206
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<c011c50c>] activate_task+0x53/0x5f
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<e153e021>] kcl_leave_service+0xe1/0x142 [cman]
 [<e01e0f56>] release_mountgroup+0x137/0x152 [lock_dlm]
 [<e01e357d>] lm_dlm_unmount+0x21/0x45 [lock_dlm]
 [<e018635e>] lm_unmount+0x39/0x6d [lock_harness]
 [<e197a92c>] gfs_lm_unmount+0x1e/0x26 [gfs]
 [<e198b07a>] gfs_put_super+0x302/0x331 [gfs]
 [<c016f471>] generic_shutdown_super+0x119/0x2eb
 [<e1988a80>] gfs_kill_sb+0x1f/0x43 [gfs]
 [<c016f145>] deactivate_super+0xc5/0xda
 [<c018a388>] sys_umount+0x65/0x6c
 [<c015b195>] unmap_vma_list+0xe/0x17
 [<c015b544>] do_munmap+0x1c8/0x1d2
 [<c018a39a>] sys_oldumount+0xb/0xe
 [<c03115af>] syscall_call+0x7/0xb

So cman is hanging, waiting for a reply or something.
On the failing system, I did: clustat and got:
[root@trin-12 src]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  trin-12                                  Online, Local, rgmanager
  trin-13                                  Online, rgmanager
  trin-14                                  Online, rgmanager

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  nfssvc               (trin-13)                      disabled

On the failing system, I did: cat /proc/cluster/services and saw:

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 2 3]

DLM Lock Space:  "bobs_gfs"                          3   4 run       -
[1 2 3]

DLM Lock Space:  "Magma"                             6   7 run       -
[1 2 3]

GFS Mount Group: "bobs_gfs"                          4   5 run       S-11,208,1
[1 2 3]

User:            "usrm::manager"                     5   6 run       -
[1 2 3]

So gfs is trying to leave the mountgroup "bobs_gfs".
According to Dave Teigland: looks like cman/sm has started leaving the group and
is waiting for some acks from the other nodes.  S-11 == SEST_LEAVE_ACKWAIT

Doing cat /proc/cluster/services on the other nodes produced:

[root@trin-13 cluster-i686-2005-12-21-1504]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 2 3]

DLM Lock Space:  "bobs_gfs"                          3   4 run       -
[1 2 3]

DLM Lock Space:  "Magma"                             6   7 run       -
[1 2 3]

GFS Mount Group: "bobs_gfs"                          4   5 run       -
[1 2 3]

User:            "usrm::manager"                     5   6 run       -
[1 2 3]

And:

[root@trin-14 src]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1 2 3]

DLM Lock Space:  "bobs_gfs"                          3   4 run       -
[2 1 3]

DLM Lock Space:  "Magma"                             6   7 run       -
[1 2 3]

GFS Mount Group: "bobs_gfs"                          4   5 run       -
[2 1 3]

User:            "usrm::manager"                     5   6 run       -
[1 2 3]

Which seemed to indicate the other nodes maybe didn't see the message they were
supposed to ack.

Netstat -a on trin-12 produced this:

[root@trin-12 ~]# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State
tcp        0      0 *:nfs                       *:*                         LISTEN
tcp        0      0 *:867                       *:*                         LISTEN
tcp        0      0 trin-12.lab.msp.redha:21064 *:*                         LISTEN
tcp        0      0 *:32778                     *:*                         LISTEN
tcp        0      0 *:32779                     *:*                         LISTEN
tcp        0      0 *:41966                     *:*                         LISTEN
tcp        0      0 *:sunrpc                    *:*                         LISTEN
tcp        0      0 *:41968                     *:*                         LISTEN
tcp        0      0 *:5008                      *:*                         LISTEN
tcp        0      0 *:883                       *:*                         LISTEN
tcp        0      0 *:50008                     *:*                         LISTEN
tcp        0      0 trin-12.lab.msp.redha:32775 trin-14.lab.msp.redha:21064
ESTABLISHED
tcp        0      0 trin-12.lab.msp.redha:32774 trin-13.lab.msp.redha:21064
ESTABLISHED
tcp        0      0 *:41967                     *:*                         LISTEN
tcp        0      0 *:41969                     *:*                         LISTEN
tcp        0      0 *:ssh                       *:*                         LISTEN
tcp        0      0 ::1:50006                   *:*                         LISTEN
tcp        0      0 *:50009                     *:*                         LISTEN
tcp        0      0 trin-12.lab.msp.redhat.:ssh technetium.msp.redhat:52981
ESTABLISHED
tcp        0      0 trin-12.lab.msp.redhat.:ssh technetium.msp.redhat:53897
ESTABLISHED
udp        0      0 *:nfs                       *:*
udp        0      0 *:32771                     *:*
udp        0      0 *:32772                     *:*
udp        0      0 trin-12.lab.msp.re:6809     *:*
udp        0      0 broadcast:6809              *:*
udp        0      0 *:23456                     *:*
udp        0      0 *:807                       *:*
udp        0      0 *:bootpc                    *:*
udp        0      0 *:bootpc                    *:*
udp        0      0 *:864                       *:*
udp        0      0 *:sunrpc                    *:*
udp        0      0 *:880                       *:*
udp        0      0 *:50007                     *:*
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node Path
unix  2      [ ACC ]     STREAM     LISTENING     6551   /var/run/cluster/ccsd.sock
unix  2      [ ACC ]     STREAM     LISTENING     7577   /tmp/.font-unix/fs7100
unix  2      [ ]         DGRAM                    7686  
@/var/run/hal/hotplug_socket
unix  2      [ ]         DGRAM                    4387   @udevd
unix  2      [ ACC ]     STREAM     LISTENING     6824   @clvmd
unix  14     [ ]         DGRAM                    6339   /dev/log
unix  2      [ ACC ]     STREAM     LISTENING     7090   /var/run/acpid.socket
unix  2      [ ACC ]     STREAM     LISTENING     7619  
/var/run/dbus/system_bus_socket
unix  2      [ ]         DGRAM                    110454
unix  2      [ ]         DGRAM                    83382
unix  2      [ ]         DGRAM                    75460
unix  2      [ ]         DGRAM                    50446
unix  2      [ ]         DGRAM                    9926
unix  2      [ ]         DGRAM                    7767
unix  3      [ ]         STREAM     CONNECTED     7685  
/var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     7684
unix  2      [ ]         DGRAM                    7638
unix  3      [ ]         STREAM     CONNECTED     7622
unix  3      [ ]         STREAM     CONNECTED     7621
unix  2      [ ]         DGRAM                    7612
unix  2      [ ]         DGRAM                    7525
unix  2      [ ]         DGRAM                    7175
unix  2      [ ]         DGRAM                    6837
unix  2      [ ]         DGRAM                    6747
unix  2      [ ]         DGRAM                    6533
unix  3      [ ]         STREAM     CONNECTED     6514
unix  3      [ ]         STREAM     CONNECTED     6513
unix  2      [ ]         DGRAM                    6347

Netstat -a on trin-13 produced this:

[root@trin-13 cluster-i686-2005-12-21-1504]# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State
tcp        0      0 *:nfs                       *:*                         LISTEN
tcp        0      0 *:776                       *:*                         LISTEN
tcp        0      0 trin-13.lab.msp.redha:21064 *:*                         LISTEN
tcp        0      0 *:5801                      *:*                         LISTEN
tcp        0      0 *:5901                      *:*                         LISTEN
tcp        0      0 *:41966                     *:*                         LISTEN
tcp        0      0 *:sunrpc                    *:*                         LISTEN
tcp        0      0 *:41968                     *:*                         LISTEN
tcp        0      0 *:5008                      *:*                         LISTEN
tcp        0      0 *:6001                      *:*                         LISTEN
tcp        0      0 *:32789                     *:*                         LISTEN
tcp        0      0 *:32790                     *:*                         LISTEN
tcp        0      0 *:792                       *:*                         LISTEN
tcp        0      0 *:50008                     *:*                         LISTEN
tcp        0      0 trin-13.lab.msp.redha:21064 trin-13.lab.msp.redha:32784
ESTABLISHED
tcp        0      0 trin-13.lab.msp.redha:32774 trin-14.lab.msp.redha:21064
ESTABLISHED
tcp        0      0 trin-13.lab.msp.redha:32784 trin-13.lab.msp.redha:21064
ESTABLISHED
tcp        0      0 trin-13.lab.msp.redha:21064 trin-12.lab.msp.redha:32774
ESTABLISHED
tcp        0      0 trin-13.lab.msp.redha:21064 trin-14.lab.msp.redha:32774
ESTABLISHED
tcp        0      0 trin-13.lab.msp.redhat:5901 technetium.msp.redhat:51801
ESTABLISHED
tcp        0      0 *:41967                     *:*                         LISTEN
tcp        0      0 *:6001                      *:*                         LISTEN
tcp        0      0 *:41969                     *:*                         LISTEN
tcp        0      0 *:ssh                       *:*                         LISTEN
tcp        0      0 ::1:50006                   *:*                         LISTEN
tcp        0      0 *:50009                     *:*                         LISTEN
tcp        0      0 trin-13.lab.msp.redhat.:ssh technetium.msp.redhat:58444
ESTABLISHED
udp        0      0 *:nfs                       *:*
udp        0      0 *:32771                     *:*
udp        0      0 *:32772                     *:*
udp        0      0 *:773                       *:*
udp        0      0 *:789                       *:*
udp        0      0 trin-13.lab.msp.re:6809     *:*
udp        0      0 broadcast:6809              *:*
udp        0      0 *:23456                     *:*
udp        0      0 *:bootpc                    *:*
udp        0      0 *:716                       *:*
udp        0      0 *:sunrpc                    *:*
udp        0      0 *:50007                     *:*
raw        0      0 *:icmp                      *:*                         7
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node Path
unix  2      [ ACC ]     STREAM     LISTENING     7186   /tmp/.font-unix/fs7100
unix  2      [ ACC ]     STREAM     LISTENING     7248  
/var/run/dbus/system_bus_socket
unix  2      [ ACC ]     STREAM     LISTENING     132158 @/tmp/fam-root-
unix  2      [ ACC ]     STREAM     LISTENING     131882 /tmp/.X11-unix/X1
unix  2      [ ACC ]     STREAM     LISTENING     131968
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  2      [ ACC ]     STREAM     LISTENING     131975
/tmp/orbit-root/linc-2ed8-0-688899c94c95d
unix  2      [ ACC ]     STREAM     LISTENING     132098 /tmp/.ICE-unix/11992
unix  2      [ ACC ]     STREAM     LISTENING     132456 /tmp/mapping-root
unix  2      [ ACC ]     STREAM     LISTENING     132107 /tmp/keyring-UBDS9K/socket
unix  2      [ ACC ]     STREAM     LISTENING     132123
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  2      [ ACC ]     STREAM     LISTENING     132143
/tmp/orbit-root/linc-2f00-0-49ca6aba5fc1
unix  2      [ ACC ]     STREAM     LISTENING     132267
/tmp/orbit-root/linc-2f25-0-4adc9d44cf36
unix  2      [ ACC ]     STREAM     LISTENING     132300
/tmp/orbit-root/linc-2f29-0-7daa02e6ed464
unix  2      [ ACC ]     STREAM     LISTENING     132322
/tmp/orbit-root/linc-2f2d-0-2f96663b273b
unix  2      [ ACC ]     STREAM     LISTENING     132343
/tmp/orbit-root/linc-2f2b-0-2f96663b88298
unix  2      [ ACC ]     STREAM     LISTENING     132414
/tmp/orbit-root/linc-2f37-0-7de67610eaa4e
unix  2      [ ACC ]     STREAM     LISTENING     132443
/tmp/orbit-root/linc-2f33-0-765a6e5a7056f
unix  2      [ ACC ]     STREAM     LISTENING     132500
/tmp/orbit-root/linc-2f40-0-1978dbb4ab9f9
unix  2      [ ACC ]     STREAM     LISTENING     132530
/tmp/orbit-root/linc-2f42-0-abe7c1b4666f
unix  2      [ ACC ]     STREAM     LISTENING     132567
/tmp/orbit-root/linc-2f44-0-abe7c1bdfd3f
unix  2      [ ACC ]     STREAM     LISTENING     132597
/tmp/orbit-root/linc-2f46-0-34b75c0b1c659
unix  2      [ ACC ]     STREAM     LISTENING     132704
/tmp/orbit-root/linc-2f4a-0-745b5083a41a3
unix  2      [ ACC ]     STREAM     LISTENING     7006   /dev/gpmctl
unix  16     [ ]         DGRAM                    6000   /dev/log
unix  2      [ ]         DGRAM                    7429  
@/var/run/hal/hotplug_socket
unix  2      [ ]         DGRAM                    4301   @udevd
unix  2      [ ACC ]     STREAM     LISTENING     6534   @clvmd
unix  2      [ ACC ]     STREAM     LISTENING     6222   /var/run/cluster/ccsd.sock
unix  2      [ ACC ]     STREAM     LISTENING     6827   /var/run/acpid.socket
unix  2      [ ]         DGRAM                    149404
unix  3      [ ]         STREAM     CONNECTED     132707
/tmp/orbit-root/linc-2f4a-0-745b5083a41a3
unix  3      [ ]         STREAM     CONNECTED     132706
unix  3      [ ]         STREAM     CONNECTED     132703
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132702
unix  3      [ ]         STREAM     CONNECTED     132648 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132647
unix  2      [ ]         DGRAM                    132642
unix  3      [ ]         STREAM     CONNECTED     132630 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132629
unix  3      [ ]         STREAM     CONNECTED     132612
/tmp/orbit-root/linc-2f29-0-7daa02e6ed464
unix  3      [ ]         STREAM     CONNECTED     132611
unix  3      [ ]         STREAM     CONNECTED     132610
/tmp/orbit-root/linc-2f46-0-34b75c0b1c659
unix  3      [ ]         STREAM     CONNECTED     132609
unix  3      [ ]         STREAM     CONNECTED     132604
/tmp/orbit-root/linc-2f46-0-34b75c0b1c659
unix  3      [ ]         STREAM     CONNECTED     132603
unix  3      [ ]         STREAM     CONNECTED     132602
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  3      [ ]         STREAM     CONNECTED     132601
unix  3      [ ]         STREAM     CONNECTED     132600
/tmp/orbit-root/linc-2f46-0-34b75c0b1c659
unix  3      [ ]         STREAM     CONNECTED     132599
unix  3      [ ]         STREAM     CONNECTED     132596
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132595
unix  3      [ ]         STREAM     CONNECTED     132590 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132589
unix  3      [ ]         STREAM     CONNECTED     132582
/tmp/orbit-root/linc-2f29-0-7daa02e6ed464
unix  3      [ ]         STREAM     CONNECTED     132581
unix  3      [ ]         STREAM     CONNECTED     132580
/tmp/orbit-root/linc-2f44-0-abe7c1bdfd3f
unix  3      [ ]         STREAM     CONNECTED     132579
unix  3      [ ]         STREAM     CONNECTED     132574
/tmp/orbit-root/linc-2f44-0-abe7c1bdfd3f
unix  3      [ ]         STREAM     CONNECTED     132573
unix  3      [ ]         STREAM     CONNECTED     132572
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  3      [ ]         STREAM     CONNECTED     132571
unix  3      [ ]         STREAM     CONNECTED     132570
/tmp/orbit-root/linc-2f44-0-abe7c1bdfd3f
unix  3      [ ]         STREAM     CONNECTED     132569
unix  3      [ ]         STREAM     CONNECTED     132566
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132565
unix  3      [ ]         STREAM     CONNECTED     132560 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132559
unix  3      [ ]         STREAM     CONNECTED     132548
/tmp/orbit-root/linc-2f29-0-7daa02e6ed464
unix  3      [ ]         STREAM     CONNECTED     132547
unix  3      [ ]         STREAM     CONNECTED     132546
/tmp/orbit-root/linc-2f42-0-abe7c1b4666f
unix  3      [ ]         STREAM     CONNECTED     132545
unix  3      [ ]         STREAM     CONNECTED     132537
/tmp/orbit-root/linc-2f42-0-abe7c1b4666f
unix  3      [ ]         STREAM     CONNECTED     132536
unix  3      [ ]         STREAM     CONNECTED     132535
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  3      [ ]         STREAM     CONNECTED     132534
unix  3      [ ]         STREAM     CONNECTED     132533
/tmp/orbit-root/linc-2f42-0-abe7c1b4666f
unix  3      [ ]         STREAM     CONNECTED     132532
unix  3      [ ]         STREAM     CONNECTED     132529
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132528
unix  3      [ ]         STREAM     CONNECTED     132523 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132522
unix  3      [ ]         STREAM     CONNECTED     132515
/tmp/orbit-root/linc-2f29-0-7daa02e6ed464
unix  3      [ ]         STREAM     CONNECTED     132514
unix  3      [ ]         STREAM     CONNECTED     132513
/tmp/orbit-root/linc-2f40-0-1978dbb4ab9f9
unix  3      [ ]         STREAM     CONNECTED     132512
unix  3      [ ]         STREAM     CONNECTED     132507
/tmp/orbit-root/linc-2f40-0-1978dbb4ab9f9
unix  3      [ ]         STREAM     CONNECTED     132506
unix  3      [ ]         STREAM     CONNECTED     132505
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  3      [ ]         STREAM     CONNECTED     132504
unix  3      [ ]         STREAM     CONNECTED     132503
/tmp/orbit-root/linc-2f40-0-1978dbb4ab9f9
unix  3      [ ]         STREAM     CONNECTED     132502
unix  3      [ ]         STREAM     CONNECTED     132499
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132498
unix  3      [ ]         STREAM     CONNECTED     132493 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132492
unix  3      [ ]         STREAM     CONNECTED     132460 /tmp/mapping-root
unix  3      [ ]         STREAM     CONNECTED     132452
unix  3      [ ]         STREAM     CONNECTED     132446
/tmp/orbit-root/linc-2f33-0-765a6e5a7056f
unix  3      [ ]         STREAM     CONNECTED     132445
unix  3      [ ]         STREAM     CONNECTED     132442
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132441
unix  3      [ ]         STREAM     CONNECTED     132438 /tmp/.ICE-unix/11992
unix  3      [ ]         STREAM     CONNECTED     132437
unix  3      [ ]         STREAM     CONNECTED     132431
/tmp/orbit-root/linc-2f37-0-7de67610eaa4e
unix  3      [ ]         STREAM     CONNECTED     132430
unix  3      [ ]         STREAM     CONNECTED     132429
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132428
unix  3      [ ]         STREAM     CONNECTED     132427 @/tmp/fam-root-
unix  3      [ ]         STREAM     CONNECTED     132426
unix  3      [ ]         STREAM     CONNECTED     132425
/var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     132424
unix  3      [ ]         STREAM     CONNECTED     132423
/tmp/orbit-root/linc-2f2b-0-2f96663b88298
unix  3      [ ]         STREAM     CONNECTED     132422
unix  3      [ ]         STREAM     CONNECTED     132421
/tmp/orbit-root/linc-2f37-0-7de67610eaa4e
unix  3      [ ]         STREAM     CONNECTED     132420
unix  3      [ ]         STREAM     CONNECTED     132417
/tmp/orbit-root/linc-2f37-0-7de67610eaa4e
unix  3      [ ]         STREAM     CONNECTED     132416
unix  3      [ ]         STREAM     CONNECTED     132413
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  3      [ ]         STREAM     CONNECTED     132412
unix  3      [ ]         STREAM     CONNECTED     132404
/tmp/orbit-root/linc-2f2b-0-2f96663b88298
unix  3      [ ]         STREAM     CONNECTED     132403
unix  3      [ ]         STREAM     CONNECTED     132402
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  3      [ ]         STREAM     CONNECTED     132401
unix  3      [ ]         STREAM     CONNECTED     132398 @/tmp/fam-root-
unix  3      [ ]         STREAM     CONNECTED     132397
unix  3      [ ]         STREAM     CONNECTED     132396 @/tmp/fam-root-
unix  3      [ ]         STREAM     CONNECTED     132395
unix  3      [ ]         STREAM     CONNECTED     132394 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132393
unix  3      [ ]         STREAM     CONNECTED     132374 /tmp/.ICE-unix/11992
unix  3      [ ]         STREAM     CONNECTED     132371
unix  3      [ ]         STREAM     CONNECTED     132370 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132369
unix  3      [ ]         STREAM     CONNECTED     132346
/tmp/orbit-root/linc-2f2b-0-2f96663b88298
unix  3      [ ]         STREAM     CONNECTED     132345
unix  3      [ ]         STREAM     CONNECTED     132342
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132341
unix  3      [ ]         STREAM     CONNECTED     132338 /tmp/.ICE-unix/11992
unix  3      [ ]         STREAM     CONNECTED     132337
unix  3      [ ]         STREAM     CONNECTED     132332 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132331
unix  3      [ ]         STREAM     CONNECTED     132327
/var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     132326
unix  3      [ ]         STREAM     CONNECTED     132325
/tmp/orbit-root/linc-2f2d-0-2f96663b273b
unix  3      [ ]         STREAM     CONNECTED     132324
unix  3      [ ]         STREAM     CONNECTED     132321
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132320
unix  3      [ ]         STREAM     CONNECTED     132317 /tmp/.ICE-unix/11992
unix  3      [ ]         STREAM     CONNECTED     132316
unix  3      [ ]         STREAM     CONNECTED     132311 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132310
unix  3      [ ]         STREAM     CONNECTED     132309
/tmp/orbit-root/linc-2f29-0-7daa02e6ed464
unix  3      [ ]         STREAM     CONNECTED     132308
unix  3      [ ]         STREAM     CONNECTED     132307
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  3      [ ]         STREAM     CONNECTED     132306
unix  3      [ ]         STREAM     CONNECTED     132303
/tmp/orbit-root/linc-2f29-0-7daa02e6ed464
unix  3      [ ]         STREAM     CONNECTED     132302
unix  3      [ ]         STREAM     CONNECTED     132299
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132298
unix  3      [ ]         STREAM     CONNECTED     132294 /tmp/.ICE-unix/11992
unix  3      [ ]         STREAM     CONNECTED     132293
unix  3      [ ]         STREAM     CONNECTED     132286 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132285
unix  3      [ ]         STREAM     CONNECTED     132274 /tmp/.ICE-unix/11992
unix  3      [ ]         STREAM     CONNECTED     132273
unix  3      [ ]         STREAM     CONNECTED     132272 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132271
unix  3      [ ]         STREAM     CONNECTED     132270
/tmp/orbit-root/linc-2f25-0-4adc9d44cf36
unix  3      [ ]         STREAM     CONNECTED     132269
unix  3      [ ]         STREAM     CONNECTED     132266
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132265
unix  3      [ ]         STREAM     CONNECTED     132209
/tmp/orbit-root/linc-2f00-0-49ca6aba5fc1
unix  3      [ ]         STREAM     CONNECTED     132203
unix  3      [ ]         STREAM     CONNECTED     132189
/tmp/orbit-root/linc-2f00-0-49ca6aba5fc1
unix  3      [ ]         STREAM     CONNECTED     132188
unix  3      [ ]         STREAM     CONNECTED     132187
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  3      [ ]         STREAM     CONNECTED     132186
unix  3      [ ]         STREAM     CONNECTED     132160 @/tmp/fam-root-
unix  3      [ ]         STREAM     CONNECTED     132159
unix  3      [ ]         STREAM     CONNECTED     132146
/tmp/orbit-root/linc-2f00-0-49ca6aba5fc1
unix  3      [ ]         STREAM     CONNECTED     132145
unix  3      [ ]         STREAM     CONNECTED     132142
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     132141
unix  3      [ ]         STREAM     CONNECTED     132136 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     132135
unix  3      [ ]         STREAM     CONNECTED     132129
/tmp/orbit-root/linc-2ed8-0-688899c94c95d
unix  3      [ ]         STREAM     CONNECTED     132128
unix  3      [ ]         STREAM     CONNECTED     132127
/tmp/orbit-root/linc-2efe-0-7112fc9f6c3e4
unix  3      [ ]         STREAM     CONNECTED     132126
unix  3      [ ]         STREAM     CONNECTED     132097
/tmp/orbit-root/linc-2ed8-0-688899c94c95d
unix  3      [ ]         STREAM     CONNECTED     132096
unix  3      [ ]         STREAM     CONNECTED     132095
/tmp/orbit-root/linc-2efa-0-352b6cee4b0d5
unix  3      [ ]         STREAM     CONNECTED     131974
unix  2      [ ]         DGRAM                    131964
unix  3      [ ]         STREAM     CONNECTED     131901 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     131900
unix  3      [ ]         STREAM     CONNECTED     131893 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     131892
unix  3      [ ]         STREAM     CONNECTED     131891 /tmp/.X11-unix/X1
unix  3      [ ]         STREAM     CONNECTED     131890
unix  2      [ ]         DGRAM                    96275
unix  2      [ ]         DGRAM                    26684
unix  2      [ ]         DGRAM                    7525
unix  3      [ ]         STREAM     CONNECTED     7428  
/var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     7427
unix  3      [ ]         STREAM     CONNECTED     7410  
/var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     7409
unix  2      [ ]         DGRAM                    7267
unix  3      [ ]         STREAM     CONNECTED     7251
unix  3      [ ]         STREAM     CONNECTED     7250
unix  2      [ ]         DGRAM                    7231
unix  2      [ ]         DGRAM                    7134
unix  2      [ ]         DGRAM                    6933
unix  2      [ ]         DGRAM                    6891
unix  2      [ ]         DGRAM                    6545
unix  2      [ ]         DGRAM                    6442
unix  2      [ ]         DGRAM                    6418
unix  2      [ ]         DGRAM                    6204
unix  3      [ ]         STREAM     CONNECTED     6185
unix  3      [ ]         STREAM     CONNECTED     6184
unix  2      [ ]         DGRAM                    6008

Netstat -a on trin-14 produced this:

[root@trin-14 src]# netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State
tcp        0      0 *:nfs                       *:*                         LISTEN
tcp        0      0 trin-14.lab.msp.redha:21064 *:*                         LISTEN
tcp        0      0 *:32777                     *:*                         LISTEN
tcp        0      0 *:32778                     *:*                         LISTEN
tcp        0      0 *:41966                     *:*                         LISTEN
tcp        0      0 *:622                       *:*                         LISTEN
tcp        0      0 *:sunrpc                    *:*                         LISTEN
tcp        0      0 *:41968                     *:*                         LISTEN
tcp        0      0 *:5008                      *:*                         LISTEN
tcp        0      0 *:50008                     *:*                         LISTEN
tcp        0      0 *:1021                      *:*                         LISTEN
tcp        0      0 trin-14.lab.msp.redha:32774 trin-13.lab.msp.redha:21064
ESTABLISHED
tcp        0      0 trin-14.lab.msp.redha:21064 trin-12.lab.msp.redha:32775
ESTABLISHED
tcp        0      0 trin-14.lab.msp.redha:21064 trin-13.lab.msp.redha:32774
ESTABLISHED
tcp        0      0 *:41967                     *:*                         LISTEN
tcp        0      0 *:41969                     *:*                         LISTEN
tcp        0      0 *:ssh                       *:*                         LISTEN
tcp        0      0 ::1:50006                   *:*                         LISTEN
tcp        0      0 *:50009                     *:*                         LISTEN
tcp        0   1104 trin-14.lab.msp.redhat.:ssh technetium.msp.redhat:56306
ESTABLISHED
udp        0      0 *:nfs                       *:*
udp        0      0 *:32769                     *:*
udp        0      0 *:32770                     *:*
udp        0      0 trin-14.lab.msp.re:6809     *:*
udp        0      0 broadcast:6809              *:*
udp        0      0 *:925                       *:*
udp        0      0 *:23456                     *:*
udp        0      0 *:bootpc                    *:*
udp        0      0 *:619                       *:*
udp        0      0 *:sunrpc                    *:*
udp        0      0 *:1018                      *:*
udp        0      0 *:50007                     *:*
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node Path
unix  2      [ ACC ]     STREAM     LISTENING     6222  
/var/run/dbus/system_bus_socket
unix  2      [ ACC ]     STREAM     LISTENING     5958   /dev/gpmctl
unix  2      [ ACC ]     STREAM     LISTENING     5191   /var/run/cluster/ccsd.sock
unix  14     [ ]         DGRAM                    4969   /dev/log
unix  2      [ ACC ]     STREAM     LISTENING     6156   /tmp/.font-unix/fs7100
unix  2      [ ]         DGRAM                    6406  
@/var/run/hal/hotplug_socket
unix  2      [ ]         DGRAM                    3397   @udevd
unix  2      [ ACC ]     STREAM     LISTENING     5476   @clvmd
unix  2      [ ACC ]     STREAM     LISTENING     5778   /var/run/acpid.socket
unix  2      [ ]         DGRAM                    112847
unix  2      [ ]         DGRAM                    83455
unix  2      [ ]         DGRAM                    10957
unix  2      [ ]         DGRAM                    8355
unix  2      [ ]         DGRAM                    6512
unix  3      [ ]         STREAM     CONNECTED     6405  
/var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     6404
unix  3      [ ]         STREAM     CONNECTED     6387  
/var/run/dbus/system_bus_socket
unix  3      [ ]         STREAM     CONNECTED     6386
unix  2      [ ]         DGRAM                    6246
unix  3      [ ]         STREAM     CONNECTED     6230
unix  3      [ ]         STREAM     CONNECTED     6229
unix  2      [ ]         DGRAM                    6197
unix  2      [ ]         DGRAM                    6087
unix  2      [ ]         DGRAM                    5885
unix  2      [ ]         DGRAM                    5844
unix  2      [ ]         DGRAM                    5490
unix  2      [ ]         DGRAM                    5395
unix  2      [ ]         DGRAM                    5175
unix  3      [ ]         STREAM     CONNECTED     5156
unix  3      [ ]         STREAM     CONNECTED     5155
unix  2      [ ]         DGRAM                    4977

Again, no dmesgs were reported on the nodes in my cluster.

Cluster Suite tasks were in this state:

ccsd          S 00000000  2548  2338      1          2339  2324 (NOTLB)
ddaeaeb0 00000082 c014c301 00000000 00000001 0000048c 5bfca29b 00008f4e
       ddcc51a0 ddcc532c de00c280 7fffffff 00000000 ddaeaf74 c030fef1 db52d518
       c017d342 00000246 db57f100 ddaeaf58 db52d518 00000246 00000246 ddee1300
Call Trace:
 [<c014c301>] __alloc_pages+0xd5/0x2f7
 [<c030fef1>] schedule_timeout+0x50/0x10c
 [<c017d342>] __pollwait+0x2d/0x94
 [<c017d795>] do_select+0x347/0x378
 [<c017d315>] __pollwait+0x0/0x94
 [<c017dab9>] sys_select+0x2e0/0x43a
 [<c01856cc>] destroy_inode+0x3d/0x4c
 [<c03115af>] syscall_call+0x7/0xb
 [<c031007b>] rwsem_down_read_failed+0x33/0x204
ccsd          S 00000000  2252  2339      1          2389  2338 (NOTLB)
dd3a6eb0 00000082 c014c301 00000000 ddcc4bd0 00037232 b3f31959 00000022
       df130230 df1303bc de3bbb80 7fffffff 00000000 dd3a6f74 c030fef1 db52d918
       c017d342 00000246 dbf52580 dd3a6f58 db52d718 00000246 dd01a680 dd3a6f58
Call Trace:
 [<c014c301>] __alloc_pages+0xd5/0x2f7
 [<c030fef1>] schedule_timeout+0x50/0x10c
 [<c017d342>] __pollwait+0x2d/0x94
 [<c02b2c44>] datagram_poll+0x25/0xd1
 [<c017d795>] do_select+0x347/0x378
 [<c017d315>] __pollwait+0x0/0x94
 [<c017dab9>] sys_select+0x2e0/0x43a
 [<c03115af>] syscall_call+0x7/0xb
 [<c031007b>] rwsem_down_read_failed+0x33/0x204
cman_comms    S 0000001C  1796  2389      1          2390  2339 (L-TLB)
ddba1f90 00000046 ddba1f44 0000001c ddba1f3c 00000102 c7275423 00008f4f
       df131970 df131afc ddba1000 e154b740 e154b760 db46f700 e152e35d 0000001f
       00000000 1d244b3c 00000000 0000000a e153f85c 00000000 00000000 df126eac
Call Trace:
 [<e152e35d>] cluster_kthread+0x10c/0x594 [cman]
 [<c03114ce>] ret_from_fork+0x6/0x14
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<e152e251>] cluster_kthread+0x0/0x594 [cman]
 [<c01041dd>] kernel_thread_helper+0x5/0xb
cman_serviced S 0000005B  2672  2391      3          2440  1685 (L-TLB)
ddb85fc8 00000046 df194600 0000005b df131970 00000039 5501bfb3 00008ee2
       df194600 df19478c ddb85000 df126e6c 00000000 e1539617 e153972e c0139ad9
       fffffffc ffffffff ffffffff c0139a70 00000000 00000000 00000000 c01041dd
Call Trace:
 [<e1539617>] serviced+0x0/0x140 [cman]
 [<e153972e>] serviced+0x117/0x140 [cman]
 [<c0139ad9>] kthread+0x69/0x91
 [<c0139a70>] kthread+0x0/0x91
 [<c01041dd>] kernel_thread_helper+0x5/0xb
cman_memb     S 00000008  1868  2390      1          2392  2389 (L-TLB)
ddb88fb4 00000046 ddb88f80 00000008 ddb88f78 0000012c 9073e7bc 00008f4f
       df1313a0 df13152c df1313a0 00000000 ddb88fdc 00000000 e1534a12 0000001f
       00000000 e15347e3 00000000 df1313a0 c011d0ce db4e5db0 db4e5db0 00000000
Call Trace:
 [<e1534a12>] membership_kthread+0x22f/0x520 [cman]
 [<e15347e3>] membership_kthread+0x0/0x520 [cman]
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<e15347e3>] membership_kthread+0x0/0x520 [cman]
 [<c01041dd>] kernel_thread_helper+0x5/0xb
cman_hbeat    S 00008F4F  2636  2392      1          2427  2390 (L-TLB)
ddbecfd4 00000046 c692120b 00008f4f df131970 00000937 c692120b 00008f4f
       df194bd0 df194d5c ddbec000 df194bd0 e154bb80 00000000 e1534559 0000001f
       00000000 e153449a 00000000 00000000 c01041dd 00000000 00000000 00000000
Call Trace:
 [<e1534559>] hello_kthread+0xbf/0x13c [cman]
 [<e153449a>] hello_kthread+0x0/0x13c [cman]
 [<c01041dd>] kernel_thread_helper+0x5/0xb
fenced        S DDA13FA8  2724  2427      1          2439  2392 (NOTLB)
dda13f9c 00000086 0000000b dda13fa8 00000000 0000a5d9 05f8d371 00000023
       df130dd0 df130f5c dda13000 dda13fac bff2d380 dda13000 c01054ae 00004200
       00000000 00000000 00000000 bff2d380 00000000 bff2d49c c03115af bff2d380
Call Trace:
 [<c01054ae>] sys_rt_sigsuspend+0x209/0x224
 [<c03115af>] syscall_call+0x7/0xb
 [<c031007b>] rwsem_down_read_failed+0x33/0x204
clvmd         S C0360CF4  2212  2439      1          2446  2427 (NOTLB)
dd445eb0 00000086 000000d0 c0360cf4 dd445f58 000007b7 c1d0ff55 00008f4a
       ddc213a0 ddc2152c 096484d8 096484d8 00000000 dd445f74 c030ff92 c035f430
       ddbc7eb8 096484d8 1d244b3c 00000000 00000005 c031e8be c0320b2a 000000a8
Call Trace:
 [<c030ff92>] schedule_timeout+0xf1/0x10c
 [<c012b87d>] process_timeout+0x0/0x5
 [<c017d795>] do_select+0x347/0x378
 [<c017d315>] __pollwait+0x0/0x94
 [<c017dab9>] sys_select+0x2e0/0x43a
 [<c02acb83>] sock_ioctl+0x2dd/0x38b
 [<c012614b>] sys_time+0xf/0x58
 [<c03115af>] syscall_call+0x7/0xb
clvmd         S 00000024  3436  2446      1          2447  2439 (NOTLB)
dd963f14 00000086 a6a1e790 00000024 de2f8700 00001971 a6a1ed5e 00000024
       de19d8f0 de19da7c db5cbd80 db4fb300 dd963f4c dd963fac e154ffb3 dda7ef5c
       00000003 dda7eef8 00000246 c040d544 00000038 b7f703d0 00000000 de19d8f0
Call Trace:
 [<e154ffb3>] dlm_read+0x187/0x625 [dlm]
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<c013a661>] wake_futex+0x3a/0x44
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<c0168bb2>] vfs_read+0xb6/0xe2
 [<c0168dc5>] sys_read+0x3c/0x62
 [<c03115af>] syscall_call+0x7/0xb
clvmd         S 00000024  2472  2447      1          2570  2446 (NOTLB)
de695e94 00000086 a6a2a647 00000024 de2f8700 00001c6d a6c2f991 00000024
       de19c780 de19c90c 00000000 7fffffff de695000 de695ef0 c030fef1 00000001
       de695ef8 de695ef8 de695ef8 de0ae530 de695ef8 c013abf8 1d244b3c 00000000
Call Trace:
 [<c030fef1>] schedule_timeout+0x50/0x10c
 [<c013abf8>] queue_me+0x59/0x121
 [<c013af34>] futex_wait+0x133/0x196
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<c013b1d9>] do_futex+0x29/0x5a
 [<c013b30b>] sys_futex+0x101/0x10c
 [<c03115af>] syscall_call+0x7/0xb
dlm_astd      S DDB1CF84  3680  2440      3          2441  2391 (L-TLB)
ddb1cfa4 00000046 e156c0a0 ddb1cf84 d7ae1770 0000002b 5dcd33dc 00008f4e
       ddc20800 ddc2098c ddb1c000 dd445e60 00000000 e154eb46 e154ecab 1d244b3c
       00000000 0000000a e1562aa4 00000000 00000000 dd445e60 00000000 ddb1c000
Call Trace:
 [<e154eb46>] dlm_astd+0x0/0x1ff [dlm]
 [<e154ecab>] dlm_astd+0x165/0x1ff [dlm]
 [<c0139ad9>] kthread+0x69/0x91
 [<c0139a70>] kthread+0x0/0x91
 [<c01041dd>] kernel_thread_helper+0x5/0xb
dlm_recvd     S 00000000  2548  2441      3          2442  2440 (L-TLB)
ddb1bfa4 00000046 00000000 00000000 00000000 0000016e 98dccbad 00008f4e
       ddc21970 ddc21afc ddb1b000 dd445e54 00000000 e1559620 e15596aa 1d244b3c
       00000000 0000000a e1564010 00000000 00000000 dd445e54 00000000 ddb1b000
Call Trace:
 [<e1559620>] dlm_recvd+0x0/0xa7 [dlm]
 [<e15596aa>] dlm_recvd+0x8a/0xa7 [dlm]
 [<c0139ad9>] kthread+0x69/0x91
 [<c0139a70>] kthread+0x0/0x91
 [<c01041dd>] kernel_thread_helper+0x5/0xb
dlm_sendd     S 00000048  3104  2442      3          2443  2441 (L-TLB)
ddb17fa4 00000046 ccfc8de0 00000048 00000048 000002de 980093a0 00008f4e
       df5e2cd0 df5e2e5c ddb17000 dd445e54 00000000 e155989c e1559926 1d244b3c
       00000000 0000000a e1564010 00000000 00000000 dd445e54 00000000 ddb17000
Call Trace:
 [<e155989c>] dlm_sendd+0x0/0xac [dlm]
 [<e1559926>] dlm_sendd+0x8a/0xac [dlm]
 [<c0139ad9>] kthread+0x69/0x91
 [<c0139a70>] kthread+0x0/0x91
 [<c01041dd>] kernel_thread_helper+0x5/0xb
dlm_recoverd  S DF7D5600  3512  2443      3          2514  2442 (L-TLB)
ddb10fc0 00000046 00000003 df7d5600 df131970 00003d56 97d9e54a 00000023
       de2f8cd0 de2f8e5c ddb10000 df7d5600 df7d5600 e1561482 e15614a8 ddb10000
       dd445e3c c0139ad9 fffffffc ffffffff ffffffff c0139a70 00000000 00000000
Call Trace:
 [<e1561482>] dlm_recoverd+0x0/0x55 [dlm]
 [<e15614a8>] dlm_recoverd+0x26/0x55 [dlm]
 [<c0139ad9>] kthread+0x69/0x91
 [<c0139a70>] kthread+0x0/0x91
 [<c01041dd>] kernel_thread_helper+0x5/0xb
dlm_recoverd  S DF7C9C00  3500  2514      3          2515  2443 (L-TLB)
ddf96fc0 00000046 00000003 df7c9c00 df131970 00004464 26e0767a 00000025
       de19c1b0 de19c33c ddf96000 df7c9c00 df7c9c00 e1561482 e15614a8 ddf96000
       ddb95bfc c0139ad9 fffffffc ffffffff ffffffff c0139a70 00000000 00000000
Call Trace:
 [<e1561482>] dlm_recoverd+0x0/0x55 [dlm]
 [<e15614a8>] dlm_recoverd+0x26/0x55 [dlm]
 [<c0139ad9>] kthread+0x69/0x91
 [<c0139a70>] kthread+0x0/0x91
 [<c01041dd>] kernel_thread_helper+0x5/0xb
lock_dlm1     S 00008EE2  2900  2515      3          2516  2514 (L-TLB)
dda32f6c 00000046 54fc79fd 00008ee2 ddcc4030 0000005e 54fc9d5d 00008ee2
       de19d320 de19d4ac df7c6400 df59a780 dda32f94 00000000 e01e6c04 0000008c
       01000000 84c4eac8 00000000 de19d320 c011d0ce 00000000 00000000 c030ed50
Call Trace:
 [<e01e6c04>] dlm_async+0x11c/0x416 [lock_dlm]
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<c030ed50>] schedule+0x438/0x5ea
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<e01e6ae8>] dlm_async+0x0/0x416 [lock_dlm]
 [<c0139ad9>] kthread+0x69/0x91
 [<c0139a70>] kthread+0x0/0x91
 [<c01041dd>] kernel_thread_helper+0x5/0xb
lock_dlm2     S DF7C6400  2656  2516      3          3713  2515 (L-TLB)
ddaf4f6c 00000046 df59aa80 df7c6400 d7bc4c50 00000035 54fc9f3a 00008ee2
       ddcc4030 ddcc41bc df7c6400 df59aa80 ddaf4f94 d89ab520 e01e6c04 0000008c
       01000000 866bb168 00000000 ddcc4030 c011d0ce 00000000 00000000 c030ed50
Call Trace:
 [<e01e6c04>] dlm_async+0x11c/0x416 [lock_dlm]
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<c030ed50>] schedule+0x438/0x5ea
 [<c011d0ce>] default_wake_function+0x0/0xc
 [<e01e6ae8>] dlm_async+0x0/0x416 [lock_dlm]
 [<c0139ad9>] kthread+0x69/0x91
 [<c0139a70>] kthread+0x0/0x91
 [<c01041dd>] kernel_thread_helper+0x5/0xb
hald          S 00000001  2460  2749      1          2764  2739 (NOTLB)
ddbd9f1c 00000086 00000000 00000001 00000000 00000bb1 9bc284c8 00008f4f
       d7c7d3a0 d7c7d52c 0963f3d4 0963f3d4 ddbd9fa0 c7744240 c030ff92 c035f338
       c035f338 0963f3d4 1d244b3c 00000000 00000005 c031e8be c0320b2a 000000a8
Call Trace:
 [<c030ff92>] schedule_timeout+0xf1/0x10c
 [<c012b87d>] process_timeout+0x0/0x5
 [<c017dd21>] do_poll+0x8d/0xab
 [<c017dede>] sys_poll+0x19f/0x24d
 [<c017d315>] __pollwait+0x0/0x94
 [<c0126225>] sys_gettimeofday+0x53/0xac
 [<c03115af>] syscall_call+0x7/0xb
 [<c031007b>] rwsem_down_read_failed+0x33/0x204
clurgmgrd     S D7EA6F58  2272  2764      1          2769  2749 (NOTLB)
d7ea6eb0 00000086 c0360cf4 d7ea6f58 000000d0 00000562 5e37c94b 00008f4e
       d7ae1770 d7ae18fc 0963fe3f 0963fe3f 00000000 d7ea6f74 c030ff92 ddc370c4
       db8bccf4 0963fe3f 1d244b3c 00000000 00000005 c031e8be c0320b2a 000000a8
Call Trace:
 [<c030ff92>] schedule_timeout+0xf1/0x10c
 [<c012b87d>] process_timeout+0x0/0x5
 [<c017d795>] do_select+0x347/0x378
 [<c017d315>] __pollwait+0x0/0x94
 [<c017dab9>] sys_select+0x2e0/0x43a
 [<c03115af>] syscall_call+0x7/0xb
 [<c031007b>] rwsem_down_read_failed+0x33/0x204
clurgmgrd     S 00000000  2948  3711      1                3672 (NOTLB)
ddbf7eb0 00000086 c014c301 00000000 00000001 000007aa c72a3afe 00008f4f
       d7bc40b0 d7bc423c 0963f2cc 0963f2cc 00000000 ddbf7f74 c030ff92 c035f330
       c035f330 0963f2cc 1d244b3c 00000000 00000005 c031e8be c0320b2a 000000a8
Call Trace:
 [<c014c301>] __alloc_pages+0xd5/0x2f7
 [<c030ff92>] schedule_timeout+0xf1/0x10c
 [<c012b87d>] process_timeout+0x0/0x5
 [<c017d795>] do_select+0x347/0x378
 [<c017d315>] __pollwait+0x0/0x94
 [<c017dab9>] sys_select+0x2e0/0x43a
 [<c030ed50>] schedule+0x438/0x5ea
 [<c03115af>] syscall_call+0x7/0xb
dlm_recoverd  S DF7D3C00  3500  3713      3                2516 (L-TLB)
d761efc0 00000046 00000003 df7d3c00 df131970 000057ff debe506d 00000028
       ddc20230 ddc203bc d761e000 df7d3c00 df7d3c00 e1561482 e15614a8 d761e000
       d7ea6e3c c0139ad9 fffffffc ffffffff ffffffff c0139a70 00000000 00000000
Call Trace:
 [<e1561482>] dlm_recoverd+0x0/0x55 [dlm]
 [<e15614a8>] dlm_recoverd+0x26/0x55 [dlm]
 [<c0139ad9>] kthread+0x69/0x91
 [<c0139a70>] kthread+0x0/0x91
 [<c01041dd>] kernel_thread_helper+0x5/0xb

Version-Release number of selected component (if applicable):

cman-1.0.4-0

How reproducible:

I haven't tried to reproduce it.

Steps to Reproduce:
1. Reboot all systems in the cluster.
2. Use cluster management gui to stop nfs service.
3. Wait a couple days (this step may be unnecessary).
4. Do umount of your gfs filesystem.
  
Actual results:

umount command hangs.

Expected results:

umount should complete normally.

Additional info:

Comment 1 Christine Caulfield 2006-01-09 10:22:31 UTC
I would suggest that it's pretty unlikely that a message has just failed to get
through to another node. bugzilla history indicates that multiple resends (and
node down events when they don't arrive) are much more common in the event of
networking glitches.

Comment 2 Corey Marthaler 2006-01-11 16:48:43 UTC
I appear to have hit this exact issue as well on link-08 by just running
mount_stress on my three node cluster (link-01, link-02, link-08).

################ itr=488 ################
unmounting on link-02.../mnt/link2.../mnt/link0.../mnt/link3...
unmounting on link-08.../mnt/link1...
[HANG]


Dump of the umount task:
Jan 11 05:16:28 link-08 kernel: umount        D 000001001f27c248     0  5067   
851                     (NOTLB)
Jan 11 05:16:28 link-08 kernel: 000001003b89dd18 0000000000000002
000000000060f9c0 ffffffff00000073
Jan 11 05:16:28 link-08 kernel:        0000010020a6d7f0 0000000000000073
0000010001709fc0 0000000067616c66
Jan 11 05:16:28 link-08 kernel:        000001001a5bc7f0 00000000001041e3
Jan 11 05:16:28 link-08 kernel: Call
Trace:<ffffffff80304c65>{wait_for_completion+167}
<ffffffff80133401>{default_wake_function+0}
Jan 11 05:16:28 link-08 kernel:       
<ffffffff80133401>{default_wake_function+0}
<ffffffffa021c021>{:cman:kcl_leave_service+249}
Jan 11 05:16:28 link-08 kernel:       
<ffffffffa02b8dc3>{:lock_dlm:release_mountgroup+189}
Jan 11 05:16:28 link-08 kernel:       
<ffffffffa02baae5>{:lock_dlm:lm_dlm_unmount+38}
<ffffffffa02503ae>{:lock_harness:lm_unmount+62}
Jan 11 05:16:28 link-08 kernel:       
<ffffffffa026b73f>{:gfs:gfs_lm_unmount+33}
<ffffffffa027a5a4>{:gfs:gfs_put_super+830}
Jan 11 05:16:28 link-08 kernel:       
<ffffffff8017d0b9>{generic_shutdown_super+202}
<ffffffffa0277fdd>{:gfs:gfs_kill_sb+42}
Jan 11 05:16:28 link-08 kernel:       
<ffffffff801ccc20>{dummy_inode_permission+0} <ffffffff8017cfd6>{deactivate_super+95}
Jan 11 05:16:28 link-08 kernel:        <ffffffff8019253b>{sys_umount+925}
<ffffffff801e865d>{__up_write+20}
Jan 11 05:16:28 link-08 kernel:        <ffffffff8016b7d0>{sys_munmap+94}
<ffffffff801101c6>{system_call+126}
Jan 11 05:16:28 link-08 kernel:



[root@link-08 ~]# ps -C umount
  PID TTY          TIME CMD
 5067 ?        00:00:00 umount


[root@link-08 ~]# lsof -p 5067
COMMAND  PID USER   FD   TYPE DEVICE     SIZE    NODE NAME
umount  5067 root  cwd    DIR  253,0     4096       2 /
umount  5067 root  rtd    DIR  253,0     4096       2 /
umount  5067 root  txt    REG  253,0    57264 6340698 /bin/umount
umount  5067 root  mem    REG  253,0 39554608 4165972 /usr/lib/locale/locale-archive
umount  5067 root  mem    REG  253,0   105202 1409081 /lib64/ld-2.3.4.so
umount  5067 root  mem    REG  253,0  1489988 1409262 /lib64/tls/libc-2.3.4.so
umount  5067 root    0u  IPv4 242736              TCP
link-08.lab.msp.redhat.com:52364->joynter.lab.msp.redhat.com:5010 (ESTABLISHED)
umount  5067 root    1w   CHR    1,3             1861 /dev/null
umount  5067 root    2w   CHR    1,3             1861 /dev/null
umount  5067 root    4u  IPv4 242736              TCP
link-08.lab.msp.redhat.com:52364->joynter.lab.msp.redhat.com:5010 (ESTABLISHED)
umount  5067 root    5u  IPv4 242737              TCP
link-08.lab.msp.redhat.com:52365->joynter.lab.msp.redhat.com:5011 (ESTABLISHED)
umount  5067 root    6u  IPv4 242738              TCP
link-08.lab.msp.redhat.com:52366->joynter.lab.msp.redhat.com:5012 (ESTABLISHED)



[root@link-01 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             3   4 run       -
[1 2 3]

DLM Lock Space:  "link1"                           3730 2625 run       -
[1 3]

DLM Lock Space:  "link3"                           3728 2627 run       -
[1]

GFS Mount Group: "link1"                           3731 2626 run       -
[1 3]

GFS Mount Group: "link3"                           3729 2628 run       -
[1]


[root@link-02 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             3   4 run       -
[1 2 3]



[root@link-08 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 3]

DLM Lock Space:  "clvmd"                             3   4 run       -
[1 2 3]

DLM Lock Space:  "link1"                           3730 2541 run       -
[1 3]

DLM Lock Space:  "link2"                           3724 2543 run       -
[3]

GFS Mount Group: "link1"                           3731 2542 run       S-11,208,1
[1 3]

GFS Mount Group: "link2"                           3725 2544 run       -
[3]



Comment 3 David Teigland 2006-01-11 17:22:15 UTC
I suspect this may be a regression from RHEL4U2 since it looks
repeatable and hasn't been reported outside of RHEL4U3 testing.

Comment 4 David Teigland 2006-01-11 18:47:56 UTC
I get the same result using the STABLE branch.  I had
three nodes sitting idle for a couple days and the
umount hung just as above.  I'll be looking first for
counters that might be rolling over given enough time.

Comment 5 Corey Marthaler 2006-01-11 22:09:46 UTC
bumping the severity to reflect comment #3...

Comment 6 Christine Caulfield 2006-01-13 14:00:35 UTC
It seems to be the "fix" for bz#173621 causing this, but I haven't been able to
work out why, or why the symptoms are as bizarre as they are..

To reproduce:
Run the (below) attached program will on two nodes. Eventually it will stall
even though the cluster remains up.

Now try to bring another node into the cluster and it will die with all sorts of
wierd messages about inconsistent cluster views and transition timeouts.

Backout the change mentioned above:
  cvs update -r1.42.2.17 cman-kernel/src/cnxman.c
and it will run seemingly forever.

This also very likely explains bz#176872



Comment 7 Christine Caulfield 2006-01-13 14:02:25 UTC
Created attachment 123162 [details]
libcman program to demonstrate hang

Comment 8 Christine Caulfield 2006-01-13 16:23:12 UTC
Created attachment 123170 [details]
Proposed patch for testing

This is what I think the patch should look like (with an extra printk). Please
test it if you can and see how may open cman/dlm bugs it actually fixes :)

Comment 9 Robert Peterson 2006-01-13 23:15:02 UTC
First of all, I verified that Patrick is indeed right about how and why the code
is failing.  With my /proc/cluster/msg_history patch, I was able to see the
cluster run normally until the 16-bit cman sequence number wrapped from 0xffff
to 0x0000.  That's when the systems in the cluster started ignoring each others
messages.  When I let my third node back into the cluster, it caused my second
node to reproduce the symptoms of bz176872 (kernel panic).

This probably DOES explain bz 177163 because part of that symtom was "wait 2
days".  Based on very rough "hello message" estimates on an idle 3-node cluster,
I predicted that the time for the sequence number to wrap is about 18 hours.

With Patrick's test program, the time to recreate is reduced to about 13 minutes.

I have done a fair amount testing on Patrick's test patch (minus the extra
printk).  The patch appears to work fine, and I ran the test program until the
cman sequence number wrapped several times.  (I was able to see this by using my
msg_history patch).  The same test failed without Patrick's test patch.

My concern is that we can still lose/ignore cman messages when we wrap from
positive numbers to negative numbers, as illustrated by the following simple
program and its accompanying output:

#include <stdio.h>

int main()
{
  unsigned short a,b,i;

  for (a=32765,b=32763,i=0; i<8; i++,a++,b++) {
    printf("a=%04X b=%04X: a-b<=0? ",a,b);
    if ((short)(((short)(a) - (short)b) <= 0))
      printf("yes.\n");
    else
      printf("no.\n");
  }
}

Yields this output:

a=7FFD b=7FFB: a-b<=0? no.
a=7FFE b=7FFC: a-b<=0? no.
a=7FFF b=7FFD: a-b<=0? no.
a=8000 b=7FFE: a-b<=0? yes.
a=8001 b=7FFF: a-b<=0? yes.
a=8002 b=8000: a-b<=0? no.
a=8003 b=8001: a-b<=0? no.
a=8004 b=8002: a-b<=0? no.

Wrapping from 0xffff to 0x0000 works properly with the patch.

I made my own version of a test patch to solve this problem using 32-bit
unsigned numbers rather than 16-bit.  That patch also worked perfectly.  
With 32-bit numbers, I calculate it would take roughly 134 years for the
sequence number to wrap with normal "hello" messages.  With Patrick's test
program, I calculate it would wrap in approximately 455 days.

However, the 32-bit patch has a nasty side-effect:  Since the cman message
structure is changed to accomodate the 32-bit numbers, any nodes in the cluster
who aren't running the 32-bit version will hang up.  That means the fix would
have to be applied throughout the entire cluster and then the whole cluster
restarted.  If any node out there has the old 16-bit code, it will get very
confused and have problems coming up.  I tried this and ended up having to boot
the node with the old version into single-user mode, doing ifup eth0, service
sshd start, etc., then copying on and applying the patch.  Once the 32-bit patch
is on all nodes, no problem however.  I don't necessarily want to subject
customers to that restriction or potential headaches, but it's one solution.

I had a philosophical discussion about this with Jon Brassow, and he recommended
we try a solution where we keep the sequence number as 16-bits (and thus not
change the internal message structure), but use the MSB to signify a
wrap-condition.  The cman message would only be considered duplicate (and
ignored) if the sequence number is the same (or less) AND the MSB/wrap-bit is
the same.  This is very close to Patrick's existing patch, but there wouldn't be
a boundary problem with 0x7fff - 0x8000.

Of course, there is still the possibility of hitting a cman message that will be
ignored, but it's not likely to occur.  The failure/wrap around time under idle
conditions would be reduced from 18 hours to 9 hours (if idle) due to using 7
bits rather than 8.  The messages would only be ignored if there were two wraps
that occurred in between messages (i.e. the MSB goes from 0 to 1 and then back
to 0, making it look as it did 18 hours ago).  So the system's messages would
have to disappear and reappear 18 hours later, after the 9-hour wrap occurs
twice.  Even with Patrick's test program, it would have to appear 13 minutes
later, which is not likely.

If messages are received with the wrong wrap-bit, no problem.  If the node goes
down and comes back, the sequence number for the new incarnation is restarted,
so again, no problem.

I'm running this hybrid 16-bit patch on the smoke cluster, and running a smoke
test over the weekend.

I'm planning to try to implement this kind of solution on Monday unless Patrick
beats me to it.

Also, I'm attaching my /proc/cluster/msg_history patch.


Comment 10 Robert Peterson 2006-01-13 23:17:47 UTC
Created attachment 123185 [details]
Test version w/Patrick's 16-bit fix and Bob's /proc/cluster/msg_history

Comment 11 Christine Caulfield 2006-01-16 09:29:47 UTC
My patch is correct, (see also /usr/src/linux/include/linux/jiffies.h - which is
from where is was lifted).

Your test program has a bug in it. The comparison line should read:

if ((short)((short)(a) - (short)b) <= 0)

You were casting the whole of the comparison to (short) rather than just the
subtraction :)

Comment 12 Christine Caulfield 2006-01-16 17:00:39 UTC
Created attachment 123245 [details]
Additional patch

looks like you might need this patch on top of the first one. If there are no
ACKed messages for ages (eg lots of HELLOs but no umounts) then you get a
message with a different sign to the last ACK it will get ignored.

Comment 13 Christine Caulfield 2006-01-17 14:27:42 UTC
Created attachment 123292 [details]
New patch for testing

This patch superceds the previous two. I realised that there is a small problem
with them too in that messages that get lost on the wire may not get resent if
a HELLO message follows them too closely.

This is an amalgamation of those two patches plus a small amount of extra code
to distinguish been ACKable messages and non-ACKable ones.

Comment 14 Christine Caulfield 2006-01-19 13:43:51 UTC
Created attachment 123436 [details]
Yet another patch

Just in case you think I was wasting my time in a hospital bed - I realised why
that last patch still doesn't work.

This one just might :)

Comment 15 Robert Peterson 2006-01-19 23:25:24 UTC
This latest patch seems pretty stable.  I still haven't tried it in my two-NIC
situation that was killing the previous patch, but I'll do that tomorrow.

FYI:  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178367
describes a minor memory leak regarding the usage of /proc/cluster/services and
/proc/cluster/nodes.  The patch for that memory leak is attached to that bz, and
it contains (1) Patrick's patch from 19 January 2006, (2) my patch for the leak,
and (3) a newer version of my new /proc/cluster/smsg_history and
/proc/cluster/msg_history patch.  I'm only posting this here because that patch
supercedes my patch/attachment here from 13 January 2006.


Comment 16 Christine Caulfield 2006-01-23 16:56:08 UTC
Fix checked in to U3:
Checking in cnxman-private.h;
/cvs/cluster/cluster/cman-kernel/src/cnxman-private.h,v  <--  cnxman-private.h
new revision: 1.12.2.2.10.1; previous revision: 1.12.2.2
done
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.18.2.1; previous revision: 1.42.2.18
done

Fixed into RHEL4:
Checking in cnxman-private.h;
/cvs/cluster/cluster/cman-kernel/src/cnxman-private.h,v  <--  cnxman-private.h
new revision: 1.12.2.4; previous revision: 1.12.2.3
done
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.21; previous revision: 1.42.2.20
done


Comment 19 Red Hat Bugzilla 2006-03-09 19:47:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0236.html



Note You need to log in before you can comment on or make changes to this bug.