Bug 152451 - gfs_recoverd (and other processes) stuck waiting on disk i/o
gfs_recoverd (and other processes) stuck waiting on disk i/o
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks: 144795
  Show dependency treegraph
 
Reported: 2005-03-29 12:15 EST by Corey Marthaler
Modified: 2009-04-16 16:30 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-04-20 10:57:03 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2005-03-29 12:15:21 EST
Description of problem:
Revolver has seen recovery times up to 2.5 hours.

The lastest was on a 6 node cluster (tank-01 - tank-06), each running "LITE" i/o
to 2 gfs. Tank-02 was the shot node.

tank-01:
Mar 28 17:39:25 tank-01 kernel: CMAN: removing node tank-02.lab.msp.redhat.com
from the cluster : Missed too many heartbeats
Mar 28 17:39:38 tank-01 kernel: lpfc 0000:02:01.0: 0:0203 Nodev timeout on NPort
x680100 Data: x2010808 x7 x7
Mar 28 17:39:46 tank-01 fenced[5846]: tank-02.lab.msp.redhat.com not a cluster
member after 20 secpost_fail_delay
Mar 28 17:39:46 tank-01 fenced[5846]: fencing node "tank-02.lab.msp.redhat.com"
Mar 28 17:39:50 tank-01 fenced[5846]: fence "tank-02.lab.msp.redhat.com" success
Mar 28 17:39:57 tank-01 kernel: GFS: fsid=tank-cluster:gfs1.0: jid=1: Trying to
acquire journal lock...
Mar 28 17:39:57 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=1: Trying to
acquire journal lock...
Mar 28 17:39:57 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=1: Looking at
journal...
Mar 28 17:39:57 tank-01 kernel: GFS: fsid=tank-cluster:gfs1.0: jid=1: Busy
Mar 28 17:39:57 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=1: Acquiring
the transaction lock...
Mar 28 17:42:46 tank-01 kernel: CMAN: node tank-02.lab.msp.redhat.com rejoining
.
.  [about 2.5 hours later]
.
Mar 28 20:17:08 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=1: Replaying
journal...
Mar 28 20:17:09 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=1: Replayed
78 of 79 blocks
Mar 28 20:17:09 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=1: replays =
78, skips = 0, sames = 1
Mar 28 20:17:10 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=1: Journal
replayed in 9434s
Mar 28 20:17:10 tank-01 kernel: GFS: fsid=tank-cluster:gfs0.0: jid=1: Done


Here's the info from back when this was seen but mistakenly added to bz 145683:

Hit this again on a 5 node cluster, 7 gfs. Shot 4 nodes, only one left
up was morph-02. 

Here's all the info you asked for:

[root@morph-02 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[3 2 1 4 5]

DLM Lock Space:  "clvmd"                             3   4 run       -
[3]

DLM Lock Space:  "gfs0"                              4   5 run       -
[3]

DLM Lock Space:  "gfs1"                              6   7 run       -
[3]

DLM Lock Space:  "gfs2"                              8   9 run       -
[3]

DLM Lock Space:  "gfs3"                             10  11 run       -
[3]

DLM Lock Space:  "gfs4"                             12  13 run       -
[3]

DLM Lock Space:  "gfs5"                             14  15 run       -
[3]

DLM Lock Space:  "gfs6"                             16  17 run       -
[3]

GFS Mount Group: "gfs0"                              5   6 recover 2 -
[3]

GFS Mount Group: "gfs1"                              7   8 recover 2 -
[3]

GFS Mount Group: "gfs2"                              9  10 run       -
[3]

GFS Mount Group: "gfs3"                             11  12 run       -
[3]

GFS Mount Group: "gfs5"                             15  16 run       -
[3]

GFS Mount Group: "gfs6"                             17  18 run       -
[3]



[root@morph-02 ~]# cat /proc/cluster/lock_dlm/debug
b7e7 0,5 id 10225 sts 0 0
7794 req 7,ebb7e7 ex 0-7fffffffffffffff lkf 2000 wait 1
7794 lk 7,ebb7e7 id 0 -1,5 2000
7794 lk 11,ebb7e7 id 10225 5,0 4
6619 qc 7,ebb7e7 -1,5 id e102c8 sts 0 0
6619 qc 11,ebb7e7 5,0 id 10225 sts 0 0
7794 ex plock 0
7620 en punlock 7,ebb7e4
7811 en punlock 7,ebb7ee
7620 lk 11,ebb7e4 id 101af 0,5 4
7811 lk 11,ebb7ee id 201d7 0,5 4
6619 qc 11,ebb7e4 0,5 id 101af sts 0 0
6619 qc 11,ebb7ee 0,5 id 201d7 sts 0 0
7620 remove 7,ebb7e4
7620 un 7,ebb7e4 113013c 5 0
6619 qc 7,ebb7e4 5,5 id 113013c sts -65538 0
7811 remove 7,ebb7ee
7811 un 7,ebb7ee f603cf 5 0
7620 lk 11,ebb7e4 id 101af 5,0 4
6619 qc 7,ebb7ee 5,5 id f603cf sts -65538 0
6619 qc 11,ebb7e4 5,0 id 101af sts 0 0
7620 ex punlock 0
7620 en plock 7,ebb7e4
7620 lk 11,ebb7e4 id 101af 0,5 4
6619 qc 11,ebb7e4 0,5 id 101af sts 0 0
7620 req 7,ebb7e4 ex 0-7fffffffffffffff lkf 2000 wait 1
7620 lk 7,ebb7e4 id 0 -1,5 2000
7620 lk 11,ebb7e4 id 101af 5,0 4
6619 qc 7,ebb7e4 -1,5 id 125019e sts 0 0
6619 qc 11,ebb7e4 5,0 id 101af sts 0 0
7620 ex plock 0
7811 lk 11,ebb7ee id 201d7 5,0 4
6619 qc 11,ebb7ee 5,0 id 201d7 sts 0 0
7811 ex punlock 0
7811 en plock 7,ebb7ee
7811 lk 11,ebb7ee id 201d7 0,5 4
6619 qc 11,ebb7ee 0,5 id 201d7 sts 0 0
7811 req 7,ebb7ee ex 2e1138-2e230a lkf 2000 wait 1
7811 lk 7,ebb7ee id 0 -1,5 2000
7811 lk 11,ebb7ee id 201d7 5,0 4
6619 qc 7,ebb7ee -1,5 id f702f2 sts 0 0
6619 qc 11,ebb7ee 5,0 id 201d7 sts 0 0
7811 ex plock 0
7641 en punlock 7,ebbc52
7641 lk 11,ebbc52 id 103c1 0,5 4
6619 qc 11,ebbc52 0,5 id 103c1 sts 0 0
7641 remove 7,ebbc52
7641 un 7,ebbc52 f20135 5 0
6619 qc 7,ebbc52 5,5 id f20135 sts -65538 0
7641 lk 11,ebbc52 id 103c1 5,0 4
6619 qc 11,ebbc52 5,0 id 103c1 sts 0 0
7641 ex punlock 0
7641 en plock 7,ebbc5f
7641 lk 11,ebbc5f id 102a4 0,5 4
6619 qc 11,ebbc5f 0,5 id 102a4 sts 0 0
7641 req 7,ebbc5f ex 0-7fffffffffffffff lkf 2000 wait 1
7641 lk 7,ebbc5f id 0 -1,5 2000
7794 en punlock 7,ebb7e7
7794 lk 11,ebb7e7 id 10225 0,5 4
7641 lk 11,ebbc5f id 102a4 5,0 4
6619 qc 7,ebbc5f -1,5 id f401f1 sts 0 0
6619 qc 11,ebb7e7 0,5 id 10225 sk 7,edb746
7746 lk 11,edb746 id 2006d 0,5 4
6619 qc 11,edb746 0,5 id 2006d sts 0 0
7621 en punlock 7,ebb7e2
7746 remove 7,edb746
7746 un 7,edb746 112032b 5 0
6619 qc 7,edb746 5,5 id 112032b sts -65538 0
7621 lk 11,ebb7e2 id 2003a 0,5 4
7746 lk 11,edb746 id 2006d 5,0 4
6619 qc 11,ebb7e2 0,5 id 2003a sts 0 0
6619 qc 11,edb746 5,0 id 2006d sts 0 0
7746 ex punlock 0
7746 en plock 7,edb74e
7621 remove 7,ebb7e2
7621 un 7,ebb7e2 11202bd 5 0
6619 qc 7,ebb7e2 5,5 id 11202bd sts -65538 0
7621 lk 11,ebb7e2 id 2003a 5,0 4
6619 qc 11,ebb7e2 5,0 id 2003a sts 0 0
7621 ex punlock 0
7621 en plock 7,ebb7e5
7621 lk 11,ebb7e5 id 10097 0,5 4
6619 qc 11,ebb7e5 0,5 id 10097 sts 0 0
7621 req 7,ebb7e5 ex 0-7fffffffffffffff lkf 2000 wait 1
7621 lk 7,ebb7e5 id 0 -1,5 2000
7621 lk 11,ebb7e5 id 10097 5,0 4
6619 qc 7,ebb7e5 -1,5 id 11e00d6 sts 0 0
6619 qc 11,ebb7e5 5,0 id 10097 sts 0 0
7621 ex plock 0
7622 lk 11,ebb7e2 id 2003a 0,5 4
6619 qc 11,ebb7e2 0,5 id 2003a sts 0 0
7622 req 7,ebb7e2 ex 0-7fffffffffffffff lkf 2000 wait 1
7622 lk 7,e22 lk 11,ebb7e2 id 2003a 5,0 4
6619 qc 7,ebb7e2 -1,5 id 11202dc sts 0 0
6619 qc 11,ebb7e2 5,0 id 2003a sts 0 0
7622 ex plock 0
7639 en punlock 7,ebbc5e
7639 lk 11,ebbc5e id 302a7 0,5 4
6619 qc 11,ebbc5e 0,5 id 302a7 sts 0 0
7639 remove 7,ebbc5e
7639 un 7,ebbc5e e70074 5 0
6619 qc 7,ebbc5e 5,5 id e70074 sts -65538 0
7639 lk 11,ebbc5e id 302a7 5,0 4
6619 qc 11,ebbc5e 5,0 id 302a7 sts 0 0
7639 ex punlock 0
7639 en plock 7,ebbc5f
7644 lk 11,ebbc5e id 302a7 0,5 4
6619 qc 11,ebbc5e 0,5 id 302a7 sts 0 0
7644 req 7,ebbc5e ex 0-7fffffffffffffff lkf 2000 wait 1
7644 lk 7,ebbc5e id 0 -1,5 2000
7644 lk 11,ebbc5e id 302a7 5,0 4
6619 qc 7,ebbc5e -1,5 id 1000185 sts 0 0
6619 qc 11,ebbc5e 5,0 id 302a7 sts 0 0
7644 ex plock 0
7621 en punlock 7,ebb7e5
7621 lk 11,ebb7e5 id 10097 0,5 4
6619 qc 11,ebb7e5 0,5 id 10097 sts 0 0
7621 remove 7,ebb7e5
7621 un 7,ebb7e5 11e00d6 5 0
6619 qc 7,ebb7e5 5,5 id 11e00d6 sts -65538 0
7621 lk 11,ebb7e5 id 10097 5,0 4
6619 qc 11,ebb7e5 5,0 id 10097 sts 0 0
7621 ex punlock 0
7621 en plock 7,ebb7e4

[root@morph-02 ~]# cat /proc/cluster/dlm_debug
1 "      11
gfs2 resent 7 requests
gfs1 recover event 71 finished
gfs2 recover event 71 finished
gfs4 move flags 0,0,1 ids 65,71,71
gfs0 processed 0 requests
gfs0 resend marked requests
gfs0 resend e0222 lq 3 flg 1200008 node 0/0 "       8
gfs0 resend 9034e lq 1 flg 200000 node -1/-1 "       7
gfs0 resend 702e0 lq 1 flg 200000 node -1/-1 "       7
gfs0 resend 703bd lq 1 flg 200000 node -1/-1 "       5
gfs0 resend b02fa lq 1 flg 200000 node -1/-1 "       7
gfs0 resent 5 requests
gfs0 recover event 71 finished
gfs4 process held requests
gfs4 processed 0 requests
gfs4 resend marked requests
gfs4 resend d01ab lq 3 flg 1200008 node 0/0 "       8
gfs4 resend 60116 lq 1 flg 200000 node -1/-1 "       7
gfs4 resend c03ec lq 1 flg 200000 node -1/-1 "       7
gfs4 resend a018d lq 1 flg 200000 node -1/-1 "       7
gfs4 resend 8011f lq 1 flg 200000 node -1/-1 "       5
gfs4 resent 5 requests
gfs4 recover event 71 finished


[root@morph-02 ~]# ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd
  PID WIDE-WCHAN-COLUMN CMD
    1 -                 init [3]
    2 migration_thread  [migration/0]
    3 ksoftirqd         [ksoftirqd/0]
    4 migration_thread  [migration/1]
    5 ksoftirqd         [ksoftirqd/1]
    6 worker_thread     [events/0]
    7 worker_thread     [events/1]
    8 worker_thread     [khelper]
    9 worker_thread     [kacpid]
   41 worker_thread     [kblockd/0]
   42 worker_thread     [kblockd/1]
   43 hub_thread        [khubd]
   52 pdflush           [pdflush]
   53 -                 [pdflush]
   55 worker_thread     [aio/0]
   56 worker_thread     [aio/1]
   54 kswapd            [kswapd0]
  129 serio_thread      [kseriod]
  198 -                 [scsi_eh_0]
  199 -                 [qla2300_0_dpc]
  224 worker_thread     [kmirrord/0]
  225 worker_thread     [kmirrord/1]
  241 kjournald         [kjournald]
 1104 -                 udevd
 1464 kjournald         [kjournald]
 1810 -                 syslogd -m 0
 1814 syslog            klogd -x
 1824 -                 irqbalance
 1834 -                 portmap
 1853 -                 rpc.statd
 1883 -                 rpc.idmapd
 1985 -                 /usr/sbin/smartd
 1994 -                 /usr/sbin/acpid
 2005 -                 cupsd
 2040 -                 /usr/sbin/sshd
 2053 -                 xinetd -stayalive -pidfile /var/run/xinetd.pid
 2072 -                 sendmail: rejecting connections on daemon MTA:
load average: 93
 2080 pause             sendmail: Queue runner@01:00:00 for
/var/spool/clientmqueue
 2135 -                 gpm -m /dev/input/mice -t imps2
 2188 -                 crond
 2213 -                 xfs -droppriv -daemon
 2230 -                 /usr/sbin/atd
 2239 -                 dbus-daemon-1 --system
 2250 -                 cups-config-daemon
 2260 -                 hald
 2267 -                 /sbin/agetty ttyS0 115200 vt100
 2268 -                 /sbin/mingetty tty1
 2269 -                 /sbin/mingetty tty2
 2270 -                 /sbin/mingetty tty3
 2271 -                 /sbin/mingetty tty4
 2272 -                 /sbin/mingetty tty5
 2273 -                 /sbin/mingetty tty6
 5871 -                 ccsd
 5893 cluster_kthread   [cman_comms]
 5895 serviced          [cman_serviced]
 5894 membership_kthrea [cman_memb]
 5972 hello_kthread     [cman_hbeat]
 6031 rt_sigsuspend     fenced
 6618 -                 clvmd
 6619 dlm_astd          [dlm_astd]
 6620 dlm_recvd         [dlm_recvd]
 6621 dlm_sendd         [dlm_sendd]
 6622 dlm_recoverd      [dlm_recoverd]
 6814 dlm_recoverd      [dlm_recoverd]
 6815 dlm_async         [lock_dlm1]
 6816 dlm_async         [lock_dlm2]
 6817 -                 [gfs_scand]
 6818 gfs_glockd        [gfs_glockd]
 6819 wait_on_buffer    [gfs_recoverd]
 6820 -                 [gfs_logd]
 6821 glock_wait_intern [gfs_quotad]
 6822 -                 [gfs_inoded]
 6882 dlm_recoverd      [dlm_recoverd]
 6883 dlm_async         [lock_dlm1]
 6884 dlm_async         [lock_dlm2]
 6885 -                 [gfs_scand]
 6886 gfs_glockd        [gfs_glockd]
 6887 wait_on_buffer    [gfs_recoverd]
 6888 -                 [gfs_logd]
 6889 glock_wait_intern [gfs_quotad]
 6890 -                 [gfs_inoded]
 6959 dlm_recoverd      [dlm_recoverd]
 6960 dlm_async         [lock_dlm1]
 6961 dlm_async         [lock_dlm2]
 6962 -                 [gfs_scand]
 6963 gfs_glockd        [gfs_glockd]
 6964 -                 [gfs_recoverd]
 6965 -                 [gfs_logd]
 6966 -                 [gfs_quotad]
 6967 -                 [gfs_inoded]
 7036 dlm_recoverd      [dlm_recoverd]
 7037 dlm_async         [lock_dlm1]
 7038 dlm_async         [lock_dlm2]
 7039 -                 [gfs_scand]
 7040 -                 [gfs_glockd]
 7041 -                 [gfs_recoverd]
 7042 -                 [gfs_logd]
 7043 -                 [gfs_quotad]
 7044 -                 [gfs_inoded]
 7113 dlm_recoverd      [dlm_recoverd]
 7114 dlm_async         [lock_dlm1]
 7115 dlm_async         [lock_dlm2]
 7116 -                 [gfs_scand]
 7117 gfs_glockd        [gfs_glockd]
 7118 -                 [gfs_recoverd]
 7119 -                 [gfs_logd]
 7120 -                 [gfs_quotad]
 7121 -                 [gfs_inoded]
 7190 dlm_recoverd      [dlm_recoverd]
 7191 dlm_async         [lock_dlm1]
 7192 dlm_async         [lock_dlm2]
 7193 -                 [gfs_scand]
 7194 gfs_glockd        [gfs_glockd]
 7195 -                 [gfs_recoverd]
 7196 -                 [gfs_logd]
 7197 -                 [gfs_quotad]
 7198 -                 [gfs_inoded]
 7267 dlm_recoverd      [dlm_recoverd]
 7268 -                 [lock_dlm1]
 7269 -                 [lock_dlm2]
 7270 -                 [gfs_scand]
 7271 gfs_glockd        [gfs_glockd]
 7272 -                 [gfs_recoverd]
 7273 -                 [gfs_logd]
 7274 -                 [gfs_quotad]
 7275 -                 [gfs_inoded]
 7446 -                 sshd: root@notty
 7448 -                 sshd: root@notty
 7450 -                 sshd: root@notty
 7451 -                 sshd: root@notty
 7454 -                 sshd: root@notty
 7456 wait              bash -c
PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c
 7457 wait              bash -c
PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c
 7465 -                 sshd: root@notty
 7468 wait              bash -c
PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c
 7481 wait              bash -c
PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c
 7482 wait              bash -c
PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c
 7541 wait              bash -c
PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c
 7565 -                 sshd: root@notty
 7568 wait              bash -c
PATH=/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/msp/c
 7586 pipe_wait         /usr/bin/perl -w
/tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM
 7587 pipe_wait         /usr/bin/perl -w
/tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM
 7588 pipe_wait         /usr/bin/perl -w
/tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM
 7597 wait              sh -c PATH=$PATH:/tmp/STS/bin;
/tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7586 -
 7598 wait              sh -c PATH=$PATH:/tmp/STS/bin;
/tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7587 -
 7599 pipe_wait         /usr/bin/perl -w
/tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM
 7602 pipe_wait         /usr/bin/perl -w
/tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM
 7605 pipe_wait         /usr/bin/perl -w
/tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM
 7608 -                 /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l
 7609 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l
 7610 wait_on_buffer    growfiles -i 0 -N 500 -n 4 -b
 7611 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l
 7612 wait              sh -c iogen -f sync -m sequential -s
read,write,readv,writev -t 1b -T 30000 30000:rwsy
 7613 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l
 7614 wait              accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7615 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l
 7616 wait              sh -c iogen -f buffered -m sequential -s
read,write,readv,writev -t 1b -T 6000b 6000b:
 7617 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7587 -l /tmp/revolver/7587/revolver_l
 7618 wait              genesis -n 500 -d 150 -p 4
 7619 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7620 -                 accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7621 wait_on_buffer    accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7622 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7623 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7624 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7625 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7626 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7627 wait_on_buffer    growfiles -i 0 -N 500 -n 4 -b
 7628 -                 growfiles -i 0 -N 500 -n 4 -b
 7629 -                 growfiles -i 0 -N 500 -n 4 -b
 7630 -                 /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l
 7631 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l
 7632 wait              sh -c iogen -f buffered -m sequential -s
read,write,readv,writev -t 1b -T 6000b 6000b:
 7633 pipe_wait         iogen -f buffered -m sequential -s read write
readv writev -t 1b -T 6000b 6000b:rwbufl
 7634 lock_page         doio -avk
 7635 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l
 7636 wait              accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7637 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l
 7638 wait              genesis -n 500 -d 150 -p 4
 7639 -                 accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7640 -                 genesis -n 500 -d 150 -p 4
 7641 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7642 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7643 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l
 7644 wait_on_buffer    accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7645 -                 growfiles -i 0 -N 500 -n 4 -b
 7646 -                 growfiles -i 0 -N 500 -n 4 -b
 7647 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7586 -l /tmp/revolver/7586/revolver_l
 7648 wait              sh -c iogen -f sync -m sequential -s
read,write,readv,writev -t 1b -T 30000 30000:rwsy
 7649 -                 genesis -n 500 -d 150 -p 4
 7650 glock_wait_intern genesis -n 500 -d 150 -p 4
 7651 -                 accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7652 wait_on_buffer    growfiles -i 0 -N 500 -n 4 -b
 7653 -                 growfiles -i 0 -N 500 -n 4 -b
 7654 pipe_wait         iogen -f sync -m sequential -s read write
readv writev -t 1b -T 30000 30000:rwsynclarg
 7655 -                 doio -avk
 7657 wait              sh -c PATH=$PATH:/tmp/STS/bin;
/tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7599 -
 7659 wait              sh -c PATH=$PATH:/tmp/STS/bin;
/tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7605 -
 7660 -                 /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l
 7661 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l
 7662 wait              accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7663 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l
 7664 wait              genesis -n 500 -d 150 -p 4
 7665 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l
 7666 wait              sh -c iogen -f sync -m sequential -s
read,write,readv,writev -t 1b -T 30000 30000:rwsy
 7667 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l
 7668 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7669 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7671 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7605 -l /tmp/revolver/7605/revolver_l
 7672 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7673 -                 growfiles -i 0 -N 500 -n 4 -b
 7674 wait_async        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7675 -                 genesis -n 500 -d 150 -p 4
 7676 glock_wait_intern genesis -n 500 -d 150 -p 4
 7677 glock_wait_intern genesis -n 500 -d 150 -p 4
 7678 -                 genesis -n 500 -d 150 -p 4
 7670 wait              sh -c iogen -f buffered -m sequential -s
read,write,readv,writev -t 1b -T 6000b 6000b:
 7679 -                 growfiles -i 0 -N 500 -n 4 -b
 7680 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b
 7681 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b
 7682 pipe_wait         iogen -f sync -m sequential -s read write
readv writev -t 1b -T 30000 30000:rwsynclarg
 7683 pipe_wait         iogen -f buffered -m sequential -s read write
readv writev -t 1b -T 6000b 6000b:rwbufl
 7684 wait_async        doio -avk
 7685 wait_async        doio -avk
 7686 pipe_wait         iogen -f buffered -m sequential -s read write
readv writev -t 1b -T 6000b 6000b:rwbufl
 7687 pipe_wait         iogen -f sync -m sequential -s read write
readv writev -t 1b -T 30000 30000:rwsynclarg
 7688 -                 doio -avk
 7689 -                 doio -avk
 7728 wait              sh -c PATH=$PATH:/tmp/STS/bin;
/tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7602 -
 7729 -                 /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l
 7730 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l
 7731 wait              sh -c iogen -f buffered -m sequential -s
read,write,readv,writev -t 1b -T 6000b 6000b:
 7732 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l
 7733 wait              sh -c iogen -f sync -m sequential -s
read,write,readv,writev -t 1b -T 30000 30000:rwsy
 7734 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l
 7735 wait              genesis -n 500 -d 150 -p 4
 7736 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l
 7737 -                 growfiles -i 0 -N 500 -n 4 -b
 7738 pipe_wait         iogen -f buffered -m sequential -s read write
readv writev -t 1b -T 6000b 6000b:rwbufl
 7739 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7602 -l /tmp/revolver/7602/revolver_l
 7740 wait              accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7741 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7742 -                 doio -avk
 7743 -                 genesis -n 500 -d 150 -p 4
 7744 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7745 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7746 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7747 wait_on_buffer    accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7748 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7749 -                 growfiles -i 0 -N 500 -n 4 -b
 7750 -                 growfiles -i 0 -N 500 -n 4 -b
 7751 -                 growfiles -i 0 -N 500 -n 4 -b
 7752 -                 accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7753 pipe_wait         iogen -f sync -m sequential -s read write
readv writev -t 1b -T 30000 30000:rwsynclarg
 7754 -                 doio -avk
 7755 -                 /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l
 7756 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l
 7757 -                 growfiles -i 0 -N 500 -n 4 -b
 7758 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l
 7759 wait              accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7761 -                 growfiles -i 0 -N 500 -n 4 -b
 7760 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l
 7762 wait_on_buffer    growfiles -i 0 -N 500 -n 4 -b
 7763 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l
 7764 wait              sh -c iogen -f buffered -m sequential -s
read,write,readv,writev -t 1b -T 6000b 6000b:
 7765 -                 growfiles -i 0 -N 500 -n 4 -b
 7766 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7599 -l /tmp/revolver/7599/revolver_l
 7767 wait              sh -c iogen -f sync -m sequential -s
read,write,readv,writev -t 1b -T 30000 30000:rwsy
 7768 pipe_wait         iogen -f sync -m sequential -s read write
readv writev -t 1b -T 30000 30000:rwsynclarg
 7769 -                 doio -avk
 7770 wait              genesis -n 500 -d 150 -p 4
 7771 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7773 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7772 wait_on_buffer    accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7774 -                 genesis -n 500 -d 150 -p 4
 7775 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7776 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7777 wait_on_buffer    accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7778 pipe_wait         iogen -f buffered -m sequential -s read write
readv writev -t 1b -T 6000b 6000b:rwbufl
 7779 wait_on_buffer    accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7780 pipe_wait         /usr/bin/perl -w
/tmp/STS/gfs/bin/revolver_load_gen -r /tmp/STS -L HEAVY -m LOCK_DLM
 7781 wait_on_buffer    doio -avk
 7785 wait              sh -c PATH=$PATH:/tmp/STS/bin;
/tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7780 -
 7786 -                 /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l
 7787 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l
 7788 wait              accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7789 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l
 7790 -                 growfiles -i 0 -N 500 -n 4 -b
 7792 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7791 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l
 7794 -                 accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7795 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l
 7793 wait              sh -c iogen -f sync -m sequential -s
read,write,readv,writev -t 1b -T 30000 30000:rwsy
 7797 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7796 wait              genesis -n 500 -d 150 -p 4
 7798 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7780 -l /tmp/revolver/7780/revolver_l
 7799 wait              sh -c iogen -f buffered -m sequential -s
read,write,readv,writev -t 1b -T 6000b 6000b:
 7800 -                 accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7801 -                 growfiles -i 0 -N 500 -n 4 -b
 7802 wait_on_buffer    growfiles -i 0 -N 500 -n 4 -b
 7803 -                 growfiles -i 0 -N 500 -n 4 -b
 7804 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7805 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7806 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7807 wait_on_buffer    genesis -n 500 -d 150 -p 4
 7808 pipe_wait         iogen -f sync -m sequential -s read write
readv writev -t 1b -T 30000 30000:rwsynclarg
 7809 -                 doio -avk
 7810 pipe_wait         iogen -f buffered -m sequential -s read write
readv writev -t 1b -T 6000b 6000b:rwbufl
 7811 -                 doio -avk
 7813 wait              sh -c PATH=$PATH:/tmp/STS/bin;
/tmp/STS/bin/pan2 -x 5 -f /tmp/revolver_load_gen.7588 -
 7814 -                 /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l
 7815 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l
 7817 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l
 7816 wait              sh -c iogen -f buffered -m sequential -s
read,write,readv,writev -t 1b -T 6000b 6000b:
 7818 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l
 7819 wait              accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7821 pipe_wait         iogen -f buffered -m sequential -s read write
readv writev -t 1b -T 6000b 6000b:rwbufl
 7820 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7822 wait_async        doio -avk
 7823 wait_async        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7824 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7825 wait              genesis -n 500 -d 150 -p 4
 7826 -                 genesis -n 500 -d 150 -p 4
 7827 wait_local        accordion -p 4 accrdfile1 accrdfile2
accrdfile3 accrdfile4 accrdfile5
 7828 glock_wait_intern genesis -n 500 -d 150 -p 4
 7829 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l
 7830 -                 genesis -n 500 -d 150 -p 4
 7831 -                 growfiles -i 0 -N 500 -n 4 -b
 7832 -                 genesis -n 500 -d 150 -p 4
 7833 pipe_wait         /tmp/STS/bin/pan2 -x 5 -f
/tmp/revolver_load_gen.7588 -l /tmp/revolver/7588/revolver_l
 7834 wait              sh -c iogen -f sync -m sequential -s
read,write,readv,writev -t 1b -T 30000 30000:rwsy
 7835 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b
 7836 glock_wait_intern growfiles -i 0 -N 500 -n 4 -b
 7837 -                 growfiles -i 0 -N 500 -n 4 -b
 7838 pipe_wait         iogen -f sync -m sequential -s read write
readv writev -t 1b -T 30000 30000:rwsynclarg
 7839 wait_async        doio -avk
 9019 -                 sshd: root@pts/0
 9021 wait              -bash
 9059 -                 ps -e -o pid,wchan=WIDE-WCHAN-COLUMN -o cmd



Version-Release number of selected component (if applicable):
DLM 2.6.9-29.0 (built Mar 28 2005 09:56:36) installed
Lock_DLM (built Mar 28 2005 10:02:04) installed


How reproducible:
every once and awhile
Comment 1 Corey Marthaler 2005-03-29 12:17:15 EST
Dave's comments from bz 145683 and email:

While gfs0 and gfs1 are in state "recover 2", gfs should be doing
journal recovery for those two fs's.  At the end there should be
a message from gfs stating how long it took to do each journal replay;
it would be interesting to see what that showed.

I'm betting that the journal recovery time will indicate the 15-20
minutes you waited.  The ps shows that a bunch of processes, including
the first two gfs_recoverd threads, are blocked on "wait_on_buffer".
If I'm not mistaken, this means they are waiting for i/o to complete.

I've seen situations where the storage device or drivers or something
below the fs hangs for a long time and all i/o hangs until the
problem is resolved.  This could be the situation.  It's also
possible that there's no i/o hang, but just an i/o bottleneck.
The i/o on all the running fs's could be starving the i/o from the
gfs_recoverd threads on the two unrecovered fs's.  Running top
during this time would probably tell a lot about which of these
was going on.



I don't suppose those nodes are still stuck there?  It would be helpful to
get the ps from the other two nodes, as well as /proc/meminfo and
/proc/slabinfo from all.

There's actually nothing wrong on morph-01.  It's morph-02 and 05 where
dlm recovery has stalled in the rebuild resource directory stage.  That
stage can eat up a lot of memory and in the past I've seen nodes run out
of memory completely there.  There are usually other indications that the
system memory is gone, though, which is why I'm interested in
meminfo/slabinfo.

If it is in fact a memory problem, the solution isn't too clear...

With the load you're running multiplied by 4 fs's, the situation seems
ripe for running memory dry.  It's really high numbers of locks in the
dlm, combined with recovery that can lead to this.  

/proc/cluster/lock_dlm/drop_count is one crude method we have for trying
to keep the number of dlm/gfs locks down to try to avoid running out of
memory during recovery.  By default I think it's set to 50000, and the
only way to find a better number is trial and error.  Lowering it makes
out-of-memory during recovery less likely, but can limit gfs caching and
hurt performance.

We've discussed the situation before:
https://www.redhat.com/archives/linux-cluster/2004-October/msg00188.html

There may be other ways of keeping the number of dlm/gfs locks down that I
haven't looked into, like /proc/sys/vm/vfs_cache_pressure, or maybe some
gfs tunables.  It would take a fair bit of study and controlled testing to
find a good answer.
Comment 2 David Teigland 2005-03-29 21:44:16 EST
The first part of comment 1 is applicable to this bug.  The second part
isn't related, so please disregard.

I'm 90% certain that this is a bug at the SAN level; either in the lpfc
driver or with the storage hardware.
Comment 3 Corey Marthaler 2005-03-30 10:27:54 EST
Only one node in the six node tank cluster has a lpfc hba, and none of the 
nodes in the morph cluster have one. All the others have qla's. So it could 
very well be a "machine killed while doing I/O messes up the SAN" thing but I 
don't think it's a lpfc thing.
Comment 4 Corey Marthaler 2005-04-20 10:57:03 EDT
blaming this on "cheap" storage, will reopen if ever seen on "good" storage.

Note You need to log in before you can comment on or make changes to this bug.