Bug 243013 - stuck dlm recovery causes corrupt filesystem after cmirror leg and node failure
stuck dlm recovery causes corrupt filesystem after cmirror leg and node failure
Status: CLOSED DUPLICATE of bug 359341
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: cmirror (Show other bugs)
4
All Linux
low Severity low
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-06-06 17:53 EDT by Corey Marthaler
Modified: 2010-01-11 21:03 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-03-26 13:51:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2007-06-06 17:53:31 EDT
Description of problem:
After killing one of the legs in a 3 leg corelog mirror, as well as one on the
nodes in the cluster (link-07), all gfs I/O that I had going to that one cmirror
stopped, and all future I/O got the standard "`/mnt/mirror3': Input/output
error". The down conversion appeared to work fine (although, it was suspiciously
instantaneous) and clvmd is not deadlocked. The only issue is that one gfs
filesystem which appears to be stuck in dlm recovery

# BEFORE THE FAILURE
[root@link-08 ~]# lvs -a -o +devices
  LV                 VG         Attr   LSize  Origin Snap%  Move Log         
Copy%  Devices              
  LogVol00           VolGroup00 -wi-ao 72.44G                                  
     /dev/hda2(0)         
  LogVol01           VolGroup00 -wi-ao  1.94G                                  
     /dev/hda2(2318)      
  mirror1            corey1     Mwi-ao 10.00G                    mirror1_mlog
100.00 mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] corey1     iwi-ao 10.00G                                  
     /dev/sda1(0)         
  [mirror1_mimage_1] corey1     iwi-ao 10.00G                                  
     /dev/sdb1(0)         
  [mirror1_mlog]     corey1     lwi-ao  4.00M                                  
     /dev/sdb2(0)         
  mirror2            corey2     Mwi-ao 10.00G                                
100.00 mirror2_mimage_0(0),mirror2_mimage_1(0)
  [mirror2_mimage_0] corey2     iwi-ao 10.00G                                  
     /dev/sdd1(0)         
  [mirror2_mimage_1] corey2     iwi-ao 10.00G                                  
     /dev/sdc1(0)         
  mirror3            corey3     Mwi-ao 10.00G                                
100.00 mirror3_mimage_0(0),mirror3_mimage_1(0),mirror3_mimage_2(0)
  [mirror3_mimage_0] corey3     iwi-ao 10.00G                                  
     /dev/sde1(0)         
  [mirror3_mimage_1] corey3     iwi-ao 10.00G                                  
     /dev/sdf1(0)         
  [mirror3_mimage_2] corey3     iwi-ao 10.00G                                  
     /dev/sdg1(0)         

# ON ALL NODES
[root@link-08 ~]# echo offline > /sys/block/sde/device/state

# I KILLED LINK-07 
[root@link-08 ~]# lvs -a -o +devices
  /dev/sde1: open failed: No such device or address
  LV                 VG         Attr   LSize  Origin Snap%  Move Log         
Copy%  Devices              
  LogVol00           VolGroup00 -wi-ao 72.44G                                  
     /dev/hda2(0)         
  LogVol01           VolGroup00 -wi-ao  1.94G                                  
     /dev/hda2(2318)      
  mirror1            corey1     Mwi-ao 10.00G                    mirror1_mlog
100.00 mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] corey1     iwi-ao 10.00G                                  
     /dev/sda1(0)         
  [mirror1_mimage_1] corey1     iwi-ao 10.00G                                  
     /dev/sdb1(0)         
  [mirror1_mlog]     corey1     lwi-ao  4.00M                                  
     /dev/sdb2(0)         
  mirror2            corey2     Mwi-ao 10.00G                                
100.00 mirror2_mimage_0(0),mirror2_mimage_1(0)
  [mirror2_mimage_0] corey2     iwi-ao 10.00G                                  
     /dev/sdd1(0)         
  [mirror2_mimage_1] corey2     iwi-ao 10.00G                                  
     /dev/sdc1(0)         
  mirror3            corey3     Mwi-ao 10.00G                                
100.00 mirror3_mimage_2(0),mirror3_mimage_1(0)
  [mirror3_mimage_1] corey3     iwi-ao 10.00G                                  
     /dev/sdf1(0)         
  [mirror3_mimage_2] corey3     iwi-ao 10.00G                                  
     /dev/sdg1(0)         
[root@link-08 ~]# touch /mnt/mirror3/foo
touch: cannot touch `/mnt/mirror3/foo': Input/output error
[root@link-08 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       72G  2.3G   66G   4% /
/dev/hda1              99M   19M   75M  21% /boot
none                  500M     0  500M   0% /dev/shm
/dev/mapper/corey1-mirror1
                      9.5G  3.9M  9.5G   1% /mnt/mirror1
df: `/mnt/mirror3': Input/output error
/dev/mapper/corey2-mirror2
                      9.5G  3.9M  9.5G   1% /mnt/mirror2


[root@link-08 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   M   link-08
   2    1    4   M   link-02
   3    1    4   X   link-07
   4    1    4   M   link-04
[root@link-08 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[2 1 4]

DLM Lock Space:  "clvmd"                            74   4 run       -
[2 1 4]

DLM Lock Space:  "clustered_log"                    75   5 run       -
[2 1 4]

DLM Lock Space:  "1"                                77   6 run       -
[2 1 4]

DLM Lock Space:  "3"                                85   8 run       S-10,200,0
[2 1 4]

DLM Lock Space:  "2"                                81  10 run       -
[2 1 4]

GFS Mount Group: "1"                                79   7 run       -
[2 1 4]

GFS Mount Group: "3"                                87   9 recover 2 -
[2 1 4]

GFS Mount Group: "2"                                83  11 run       -
[2 1 4]


[root@link-08 ~]# dmsetup status
corey1-mirror1_mimage_1: 0 20971520 linear
corey1-mirror1: 0 20971520 mirror 2 253:7 253:8 20480/20480 1 AA 3
clustered_disk 253:6 A
corey1-mirror1_mimage_0: 0 20971520 linear
corey2-mirror2_mimage_1: 0 20971520 linear
corey2-mirror2_mimage_0: 0 20971520 linear
corey2-mirror2: 0 20971520 mirror 2 253:10 253:11 20480/20480 1 AA 1 clustered_core
corey3-mirror3_mimage_2: 0 20971520 linear
corey3-mirror3_mimage_1: 0 20971520 linear
VolGroup00-LogVol01: 0 4063232 linear
corey3-mirror3: 0 20971520 mirror 2 253:4 253:3 20480/20480 1 AA 1 clustered_core
VolGroup00-LogVol00: 0 151912448 linear
corey1-mirror1_mlog: 0 8192 linear
[root@link-08 ~]# dmsetup info
Name:              corey1-mirror1_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-0QLi8UKjRguEwy4KI1k9KriBg4lyb2j1V9V23NFX7ZxmKdVvCMdSh2B0Q5uPTSKB

Name:              corey1-mirror1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 9
Number of targets: 1
UUID: LVM-0QLi8UKjRguEwy4KI1k9KriBg4lyb2j1fydUC356OtfUZVfh74E4yWHGepLEAQu5

Name:              corey1-mirror1_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: LVM-0QLi8UKjRguEwy4KI1k9KriBg4lyb2j1YjGdtpLiSDkmhud2mVjyNgZveaSecV87

Name:              corey2-mirror2_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 11
Number of targets: 1
UUID: LVM-caAVDd2dPhYq3rRmtemqUvgAt90oxO1vpnsQHYANmpEyo0yXJQwqe62JRBXRDtT9

Name:              corey2-mirror2_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 10
Number of targets: 1
UUID: LVM-caAVDd2dPhYq3rRmtemqUvgAt90oxO1vczGS49XAjOu44xh1tr4PQr7WiR5ErZl3

Name:              corey2-mirror2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 12
Number of targets: 1
UUID: LVM-caAVDd2dPhYq3rRmtemqUvgAt90oxO1vC3HFk9CLB9ELCxhSZBOf1lkkE11Rdz25

Name:              corey3-mirror3_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 4
Number of targets: 1
UUID: LVM-vCWvB54l0xEiiQuUH1anc8UJwd1KeH2R2yZvBBJDzam6k5CuJbD1uAfD3FTZvNwi

Name:              corey3-mirror3_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 3
Number of targets: 1
UUID: LVM-vCWvB54l0xEiiQuUH1anc8UJwd1KeH2RJMiGbaBAD0GCnWpSCxIZ7ot8if1bjIBE

Name:              VolGroup00-LogVol01
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-TCx5xJ7FuRhXzJ4g7CvPsw2AhhFBNLQUvNlDc7SvClgdBMh2WD6TraPFgzjSVMRp

Name:              corey3-mirror3
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      7
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-vCWvB54l0xEiiQuUH1anc8UJwd1KeH2RrVxFHPheVHLeYGToOyHqciYLB1QRho7o

Name:              VolGroup00-LogVol00
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-TCx5xJ7FuRhXzJ4g7CvPsw2AhhFBNLQUrfFjpdEgWnBeJx2UpyC5Mr6XpgXdHWCh

Name:              corey1-mirror1_mlog
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 6
Number of targets: 1
UUID: LVM-0QLi8UKjRguEwy4KI1k9KriBg4lyb2j1hCJh1EVAOxhYcsIkh5Fwap1yOkxFeidY


Version-Release number of selected component (if applicable):
2.6.9-55.ELlargesmp
dlm-kernel-2.6.9-46.16
cmirror-kernel-2.6.9-32.0
Comment 1 Corey Marthaler 2007-06-06 18:04:38 EDT
More info...

[root@link-02 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   M   link-08
   2    1    4   M   link-02
   3    1    4   X   link-07
   4    1    4   M   link-04
[root@link-02 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[2 1 4]

DLM Lock Space:  "clvmd"                            74  13 run       -
[2 1 4]

DLM Lock Space:  "clustered_log"                    75  14 run       -
[2 1 4]

DLM Lock Space:  "1"                                77  15 run       -
[2 1 4]

DLM Lock Space:  "2"                                81  17 run       -
[2 1 4]

DLM Lock Space:  "3"                                85  19 run       -
[2 1 4]

GFS Mount Group: "1"                                79  16 run       -
[2 1 4]

GFS Mount Group: "2"                                83  18 run       -
[2 1 4]

GFS Mount Group: "3"                                87  20 recover 4 -
[2 1 4]


[root@link-04 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   M   link-08
   2    1    4   M   link-02
   3    1    4   X   link-07
   4    1    4   M   link-04
[root@link-04 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[2 1 4]

DLM Lock Space:  "clvmd"                            74   4 run       -
[2 1 4]

DLM Lock Space:  "clustered_log"                    75   5 run       -
[2 1 4]

DLM Lock Space:  "1"                                77   6 run       -
[2 1 4]

DLM Lock Space:  "2"                                81   8 run       -
[2 1 4]

DLM Lock Space:  "3"                                85  10 run       -
[2 1 4]

GFS Mount Group: "1"                                79   7 run       -
[2 1 4]

GFS Mount Group: "2"                                83   9 run       -
[2 1 4]

GFS Mount Group: "3"                                87  11 recover 4 -
[2 1 4]



[root@link-08 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   M   link-08
   2    1    4   M   link-02
   3    1    4   X   link-07
   4    1    4   M   link-04
[root@link-08 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[2 1 4]

DLM Lock Space:  "clvmd"                            74   4 run       -
[2 1 4]

DLM Lock Space:  "clustered_log"                    75   5 run       -
[2 1 4]

DLM Lock Space:  "1"                                77   6 run       -
[2 1 4]

DLM Lock Space:  "3"                                85   8 run       S-10,200,0
[2 1 4]

DLM Lock Space:  "2"                                81  10 run       -
[2 1 4]

GFS Mount Group: "1"                                79   7 run       -
[2 1 4]

GFS Mount Group: "3"                                87   9 recover 2 -
[2 1 4]

GFS Mount Group: "2"                                83  11 run       -
[2 1 4]









[root@link-02 ~]# cat /proc/cluster/sm_debug
 3
0200004f recover state 2
0200004f cb recover state 2
02000057 recover state 4
02000053 recover state 4
0200004f recover state 3
02000057 recover state 4
02000053 recover state 4
0200004f recover state 5
02000057 recover state 4
02000053 recover state 5

[root@link-04 ~]# cat /proc/cluster/sm_debug
 3
0200004f recover state 2
0200004f cb recover state 2
02000057 recover state 4
02000053 recover state 4
0200004f recover state 3
02000057 recover state 4
02000053 recover state 4
0200004f recover state 5
02000057 recover state 4
02000053 recover state 5

[root@link-08 ~]# cat /proc/cluster/sm_debug
 2
02000057 recover state 2
0200004f recover state 3
02000053 recover state 2
02000057 recover state 2
0200004f recover state 5
02000053 cb recover state 2
02000053 recover state 3
02000057 recover state 2
02000053 recover state 5
02000057 recover state 2


[root@link-02 ~]# cat /proc/cluster/dlm_debug
s
clvmd update remastered resources
3 updated 1 resources
3 rebuild locks
2 updated 2 resources
2 rebuild locks
1 updated 3 resources
1 rebuild locks
clvmd updated 3 resources
clvmd rebuild locks
3 rebuilt 1 locks
3 recover event 114 done
2 rebuilt 2 locks
2 recover event 114 done
1 rebuilt 3 locks
1 recover event 114 done
clvmd rebuilt 3 locks
clvmd recover event 114 done
3 move flags 0,0,1 ids 112,114,114
3 process held requests
3 processed 0 requests
3 resend marked requests
3 resent 0 requests
3 recover event 114 finished
1 move flags 0,0,1 ids 108,114,114
1 process held requests
1 processed 0 requests
1 resend marked requests
1 resent 0 requests
1 recover event 114 finished
2 move flags 0,0,1 ids 110,114,114
2 process held requests
2 processed 0 requests
2 resend marked requests
2 resent 0 requests
2 recover event 114 finished
clvmd move flags 0,0,1 ids 106,114,114
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 114 finished


[root@link-04 ~]# cat /proc/cluster/dlm_debug
s
1 marked 0 requests
1 purge locks of departed nodes
clvmd mark waiting requests
clvmd marked 0 requests
clvmd purge locks of departed nodes
clvmd purged 0 locks
clvmd update remastered resources
2 purged 1 locks
2 update remastered resources
1 purged 2 locks
1 update remastered resources
1 updated 3 resources
1 rebuild locks
2 updated 2 resources
2 rebuild locks
clvmd updated 3 resources
clvmd rebuild locks
1 rebuilt 3 locks
1 recover event 12 done
2 rebuilt 2 locks
2 recover event 12 done
clvmd rebuilt 3 locks
clvmd recover event 12 done
1 move flags 0,0,1 ids 6,12,12
1 process held requests
1 processed 0 requests
1 resend marked requests
1 resent 0 requests
1 recover event 12 finished
2 move flags 0,0,1 ids 8,12,12
2 process held requests
2 processed 0 requests
2 resend marked requests
2 resent 0 requests
2 recover event 12 finished
clvmd move flags 0,0,1 ids 4,12,12
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 12 finished


[root@link-08 ~]# cat /proc/cluster/dlm_debug
t 0 locks
3 recover event 24 done
2 purged 1 locks
2 update remastered resources
clvmd purged 0 locks
clvmd update remastered resources
2 updated 2 resources
2 rebuild locks
2 rebuilt 0 locks
2 recover event 24 done
clvmd updated 3 resources
clvmd rebuild locks
1 updated 3 resources
1 rebuild locks
clvmd rebuilt 0 locks
1 rebuilt 0 locks
clvmd recover event 24 done
1 recover event 24 done
3 move flags 0,0,1 ids 22,24,24
3 process held requests
3 processed 0 requests
3 resend marked requests
3 resent 0 requests
3 recover event 24 finished
1 move flags 0,0,1 ids 18,24,24
1 process held requests
1 processed 0 requests
1 resend marked requests
1 resent 0 requests
1 recover event 24 finished
2 move flags 0,0,1 ids 20,24,24
2 process held requests
2 processed 0 requests
2 resend marked requests
2 resent 0 requests
2 recover event 24 finished
clvmd move flags 0,0,1 ids 16,24,24
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 24 finished

Comment 2 Corey Marthaler 2007-06-06 18:06:48 EDT
[root@link-02 ~]# ps ax -o pid,stat,cmd,wchan
  PID STAT CMD              WCHAN
    1 S    init [3]         -
    2 S    [migration/0]    migration_thread
    3 SN   [ksoftirqd/0]    ksoftirqd
    4 S    [migration/1]    migration_thread
    5 SN   [ksoftirqd/1]    ksoftirqd
    6 S<   [events/0]       worker_thread
    7 S<   [events/1]       worker_thread
    8 S<   [khelper]        worker_thread
    9 S<   [kacpid]         worker_thread
   37 S<   [kblockd/0]      worker_thread
   38 S<   [kblockd/1]      worker_thread
   39 S    [khubd]          hub_thread
   62 S    [pdflush]        pdflush
   63 D    [pdflush]        wait_on_buffer
   64 S    [kswapd0]        kswapd
   65 S<   [aio/0]          worker_thread
   66 S<   [aio/1]          worker_thread
  210 S    [kseriod]        serio_thread
  445 S    [scsi_eh_0]      16045567552327254017
  446 S<   [qla2300_0_dpc]  16045567552327254017
  506 S    [kjournald]      kjournald
 1896 S<   [kmirrord]       worker_thread
 1993 S<s  udevd            -
 2099 S<   [kedac]          -
 2226 S<   [kauditd]        kauditd_thread
 2338 S    [kjournald]      kjournald
 2918 Ss   /sbin/dhclient - -
 2961 Ss   syslogd -m 0     -
 2965 Ss   klogd -x         syslog
 2978 Ss   irqbalance       -
 2989 Ss   portmap          -
 3008 Ss   rpc.statd        -
 3036 Ss   rpc.idmapd       -
 3121 S    /usr/sbin/smartd -
 3130 Ss   /usr/sbin/acpid  -
 3218 Ss   /usr/sbin/sshd   -
 3237 Ss   xinetd -stayaliv -
 3255 Ss   sendmail: accept -
 3264 Ss   sendmail: Queue  pause
 3318 Ss   gpm -m /dev/inpu -
 3469 Ss   crond            -
 3490 Ss   xfs -droppriv -d -
 3507 Ss   /usr/sbin/atd    -
 3516 Ss   dbus-daemon-1 -- -
 3527 Ss   hald             -
 3542 S<sl modclusterd      -
 3617 Ss   /usr/sbin/oddjob -
 3653 Ss   /usr/sbin/saslau fcntl_setlk
 3656 S    /usr/sbin/saslau -
 3657 S    /usr/sbin/saslau fcntl_setlk
 3658 S    /usr/sbin/saslau fcntl_setlk
 3659 S    /usr/sbin/saslau fcntl_setlk
 3680 S<s  ricci -u 101     -
 3685 Ss   login -- root    wait
 3686 Ss+  /sbin/mingetty t -
 3687 Ss+  /sbin/mingetty t -
 3688 Ss+  /sbin/mingetty t -
 3689 Ss+  /sbin/mingetty t -
 3691 Ss+  /sbin/mingetty t -
 3692 Ss+  /sbin/mingetty t -
 4238 R+   ps ax -o pid,sta -
 4509 Ss   -bash            wait
 4616 Ssl  ccsd             -
 4665 S    [cman_comms]     cluster_kthread
 4666 S    [cman_memb]      membership_kthread
 4667 S<   [cman_serviced]  serviced
 4668 S    [cman_hbeat]     hello_kthread
 4687 Ss   fenced -t 120 -w rt_sigsuspend
10514 Ss   cupsd            -
11595 Rs   sshd: root@notty -
27225 S<   [dlm_astd]       dlm_astd
27226 S<   [dlm_recvd]      dlm_recvd
27227 S<   [dlm_sendd]      dlm_sendd
27723 Ssl  clvmd -T20 -t 90 -
27724 S<   [dlm_recoverd]   dlm_recoverd
28467 S    [cluster_log_ser -
28512 S<   [kmirrord]       worker_thread
28513 S<   [kcopyd]         worker_thread
30090 S<Lsl [dmeventd]      -
30111 S<   [dlm_recoverd]   dlm_recoverd
30112 S<   [lock_dlm1]      dlm_async
30113 S<   [lock_dlm2]      dlm_async
30114 S    [gfs_scand]      -
30115 S    [gfs_glockd]     gfs_glockd
30121 S    [gfs_recoverd]   -
30122 S    [gfs_logd]       -
30123 S    [gfs_quotad]     -
30124 S    [gfs_inoded]     -
30128 S<   [dlm_recoverd]   dlm_recoverd
30138 S<   [lock_dlm1]      dlm_async
30139 S<   [lock_dlm2]      dlm_async
30140 S    [gfs_scand]      -
30141 S    [gfs_glockd]     gfs_glockd
30142 S    [gfs_recoverd]   -
30148 S    [gfs_logd]       -
30149 S    [gfs_quotad]     -
30150 S    [gfs_inoded]     -
30154 S<   [dlm_recoverd]   dlm_recoverd
30164 S<   [lock_dlm1]      dlm_async
30165 S<   [lock_dlm2]      dlm_async
30166 S    [gfs_scand]      -
30167 S    [gfs_glockd]     gfs_glockd
30168 S    [gfs_recoverd]   -
30174 S    [gfs_logd]       -
30175 S    [gfs_quotad]     -
30176 S    [gfs_inoded]     -
30242 S    xiogen -f buffer pipe_wait
30243 S    xdoio -vD        pipe_wait
30244 S    xiogen -f buffer pipe_wait
30245 S    xdoio -vD        pipe_wait
30246 S    xiogen -f buffer pipe_wait
30248 S    xdoio -vD        pipe_wait
30252 D    xdoio -vD        -
30253 D    xdoio -vD        glock_wait_internal
30254 R    xdoio -vD        -
30534 S+   tail -f /var/log -
30672 Ss   sshd: root@pts/0 -
30679 Ss   -bash            wait
31346 S<   [kmirrord]       worker_thread




[root@link-04 ~]# ps ax -o pid,stat,cmd,wchan
  PID STAT CMD              WCHAN
    1 S    init [3]         -
    2 S    [migration/0]    migration_thread
    3 SN   [ksoftirqd/0]    ksoftirqd
    4 S<   [events/0]       worker_thread
    5 S<   [khelper]        worker_thread
    6 S<   [kacpid]         worker_thread
   30 S<   [kblockd/0]      worker_thread
   31 S    [khubd]          hub_thread
   52 S    [pdflush]        pdflush
   53 D    [pdflush]        wait_on_buffer
   54 S    [kswapd0]        kswapd
   55 S<   [aio/0]          worker_thread
  199 S    [kseriod]        serio_thread
  429 S    [scsi_eh_0]      16045567552327254017
  454 S    [kjournald]      kjournald
 1530 S<s  udevd            -
 1595 S<   [kedac]          -
 1717 S<   [kauditd]        kauditd_thread
 1822 S    [kjournald]      kjournald
 2383 Ss   /sbin/dhclient - -
 2426 Ss   syslogd -m 0     -
 2430 Ss   klogd -x         syslog
 2450 Ss   portmap          -
 2469 Ss   rpc.statd        -
 2496 Ss   rpc.idmapd       -
 2572 S    /usr/sbin/smartd -
 2581 Ss   /usr/sbin/acpid  -
 2590 Ss   cupsd            -
 2648 Ss   /usr/sbin/sshd   -
 2661 Ss   xinetd -stayaliv -
 2679 Ss   sendmail: accept -
 2689 Ss   sendmail: Queue  pause
 2736 Ss   gpm -m /dev/inpu -
 2883 Ss   crond            -
 2904 Ss   xfs -droppriv -d -
 2921 Ss   /usr/sbin/atd    -
 2930 Ss   dbus-daemon-1 -- -
 2941 Ss   hald             -
 2956 S<sl modclusterd      -
 3051 Ss   /usr/sbin/oddjob -
 3087 Ss   /usr/sbin/saslau fcntl_setlk
 3088 S    /usr/sbin/saslau -
 3089 S    /usr/sbin/saslau fcntl_setlk
 3090 S    /usr/sbin/saslau fcntl_setlk
 3091 S    /usr/sbin/saslau fcntl_setlk
 3100 S<s  ricci -u 101     -
 3105 Ss   login -- root    wait
 3106 Ss+  /sbin/mingetty t -
 3107 Ss+  /sbin/mingetty t -
 3108 Ss+  /sbin/mingetty t -
 3109 Ss+  /sbin/mingetty t -
 3110 Ss+  /sbin/mingetty t -
 3111 Ss+  /sbin/mingetty t -
 4744 Ss   sshd: root@notty -
 4801 Ssl  ccsd             -
 4854 S    [cman_comms]     cluster_kthread
 4855 S    [cman_memb]      membership_kthread
 4856 S<   [cman_serviced]  serviced
 4863 S    [cman_hbeat]     hello_kthread
 4878 Ss   -bash            wait
 4914 S+   tail -f /var/log -
 4934 Ss   fenced -t 120 -w rt_sigsuspend
 6402 Ss   sshd: root@pts/1 -
 6404 Ss   -bash            wait
 6592 S    [scsi_eh_1]      16045567552327254017
 7192 Ssl  clvmd -T20 -t 90 -
 7193 S<   [dlm_astd]       dlm_astd
 7194 S<   [dlm_recvd]      dlm_recvd
 7195 S<   [dlm_sendd]      dlm_sendd
 7196 S<   [dlm_recoverd]   dlm_recoverd
 7274 S    [cluster_log_ser -
 7284 S<   [kcopyd]         worker_thread
 7286 S<Lsl [dmeventd]      -
 7311 S<   [kmirrord]       worker_thread
 7364 S<   [kmirrord]       worker_thread
 7520 S<   [dlm_recoverd]   dlm_recoverd
 7521 S<   [lock_dlm1]      dlm_async
 7522 S<   [lock_dlm2]      dlm_async
 7523 S    [gfs_scand]      -
 7524 S    [gfs_glockd]     gfs_glockd
 7525 S    [gfs_recoverd]   -
 7526 S    [gfs_logd]       -
 7527 S    [gfs_quotad]     -
 7528 S    [gfs_inoded]     -
 7532 S<   [dlm_recoverd]   dlm_recoverd
 7538 S<   [lock_dlm1]      dlm_async
 7539 S<   [lock_dlm2]      dlm_async
 7540 S    [gfs_scand]      -
 7541 S    [gfs_glockd]     gfs_glockd
 7551 S    [gfs_recoverd]   -
 7552 S    [gfs_logd]       -
 7553 S    [gfs_quotad]     -
 7554 S    [gfs_inoded]     -
 7558 S<   [dlm_recoverd]   dlm_recoverd
 7559 S<   [lock_dlm1]      dlm_async
 7560 S<   [lock_dlm2]      dlm_async
 7561 S    [gfs_scand]      -
 7562 S    [gfs_glockd]     gfs_glockd
 7572 S    [gfs_recoverd]   -
 7573 S    [gfs_logd]       -
 7574 S    [gfs_quotad]     -
 7575 S    [gfs_inoded]     -
 7632 S    xiogen -f buffer pipe_wait
 7633 S    xdoio -vD        pipe_wait
 7634 S    xiogen -f buffer pipe_wait
 7635 S    xdoio -vD        pipe_wait
 7636 S    xiogen -f buffer pipe_wait
 7637 S    xdoio -vD        pipe_wait
 7642 R    xdoio -vD        -
 7643 R    xdoio -vD        -
 7644 D    xdoio -vD        glock_wait_internal
 7708 S<   [kmirrord]       worker_thread
10021 R+   ps ax -o pid,sta -




[root@link-08 ~]# ps ax -o pid,stat,cmd,wchan
  PID STAT CMD              WCHAN
    1 S    init [3]         -
    2 S    [migration/0]    migration_thread
    3 SN   [ksoftirqd/0]    ksoftirqd
    4 S    [migration/1]    migration_thread
    5 SN   [ksoftirqd/1]    ksoftirqd
    6 S<   [events/0]       worker_thread
    7 S<   [events/1]       worker_thread
    8 S<   [khelper]        worker_thread
    9 S<   [kacpid]         worker_thread
   38 S<   [kblockd/0]      worker_thread
   39 S<   [kblockd/1]      worker_thread
   40 S    [khubd]          hub_thread
   63 S    [pdflush]        pdflush
   64 D    [pdflush]        wait_on_buffer
   65 S    [kswapd1]        kswapd
   66 S    [kswapd0]        kswapd
   67 S<   [aio/0]          worker_thread
   68 S<   [aio/1]          worker_thread
  212 S    [kseriod]        serio_thread
  447 S    [scsi_eh_0]      16045567552327254017
  449 S    [scsi_eh_1]      16045567552327254017
  466 S    [scsi_eh_2]      16045567552327254017
  467 S<   [qla2300_2_dpc]  16045567552327254017
  527 S    [kjournald]      kjournald
 2036 S<s  udevd            -
 2145 S<   [kedac]          -
 2272 S<   [kauditd]        kauditd_thread
 2381 S    [kjournald]      kjournald
 2975 Ss   syslogd -m 0     -
 2979 Ss   klogd -x         syslog
 2992 Ss   irqbalance       -
 3003 Ss   portmap          -
 3022 Ss   rpc.statd        -
 3050 Ss   rpc.idmapd       -
 3129 S    /usr/sbin/smartd -
 3138 Ss   /usr/sbin/acpid  -
 3147 Ss   cupsd            -
 3210 Ss   /usr/sbin/sshd   -
 3223 Ss   xinetd -stayaliv -
 3241 Ss   sendmail: accept -
 3250 Ss   sendmail: Queue  pause
 3298 Ss   gpm -m /dev/inpu -
 3435 Ss   crond            -
 3456 Ss   xfs -droppriv -d -
 3473 Ss   /usr/sbin/atd    -
 3482 Ss   dbus-daemon-1 -- -
 3493 Ss   hald             -
 3509 S<sl modclusterd      -
 3576 Ss   /usr/sbin/oddjob -
 3598 Ss   /usr/sbin/saslau fcntl_setlk
 3599 S    /usr/sbin/saslau -
 3600 S    /usr/sbin/saslau fcntl_setlk
 3601 S    /usr/sbin/saslau fcntl_setlk
 3602 S    /usr/sbin/saslau fcntl_setlk
 3624 S<s  ricci -u 101     -
 3629 Ss   login -- root    wait
 3630 Ss+  /sbin/mingetty t -
 3631 Ss+  /sbin/mingetty t -
 3632 Ss+  /sbin/mingetty t -
 3633 Ss+  /sbin/mingetty t -
 3634 Ss+  /sbin/mingetty t -
 3635 Ss+  /sbin/mingetty t -
 4624 Ss   -bash            wait
 5492 Ss   /sbin/dhclient - -
 5552 Ssl  ccsd             -
 5643 S    [cman_comms]     cluster_kthread
 5644 S    [cman_memb]      membership_kthread
 5645 S<   [cman_serviced]  serviced
 5646 S    [cman_hbeat]     hello_kthread
 5665 Ss   fenced -t 120 -w rt_sigsuspend
 5774 S<Lsl [dmeventd]      -
 5977 Ssl  clvmd -T20 -t 90 -
 5978 S<   [dlm_astd]       dlm_astd
 5979 S<   [dlm_recvd]      dlm_recvd
 5980 S<   [dlm_sendd]      dlm_sendd
 5981 S<   [dlm_recoverd]   dlm_recoverd
 6024 S    [cluster_log_ser -
 6069 S<   [kcopyd]         worker_thread
 6089 S<   [kmirrord]       worker_thread
 6260 Ss   sshd: root@pts/0 -
 6262 Ss   -bash            -
 6333 S<   [dlm_recoverd]   dlm_recoverd
 6339 S<   [lock_dlm1]      dlm_async
 6340 S<   [lock_dlm2]      dlm_async
 6341 S    [gfs_scand]      -
 6342 S    [gfs_glockd]     gfs_glockd
 6343 S    [gfs_recoverd]   -
 6344 D    [gfs_logd]       -
 6345 S    [gfs_quotad]     -
 6346 S    [gfs_inoded]     -
 6351 S<   [dlm_recoverd]   dlm_recoverd
 6352 D<   [lock_dlm1]      kcl_leave_service
 6353 S<   [lock_dlm2]      dlm_async
 6354 S    [gfs_scand]      -
 6355 S    [gfs_glockd]     gfs_glockd
 6365 D    [gfs_recoverd]   glock_wait_internal
 6366 S    [gfs_logd]       -
 6367 S    [gfs_quotad]     -
 6368 S    [gfs_inoded]     -
 6431 Ss   sshd: root@pts/1 -
 6433 Ss   -bash            wait
 6496 S<   [kmirrord]       worker_thread
 6550 S<   [dlm_recoverd]   dlm_recoverd
 6551 S<   [lock_dlm1]      dlm_async
 6552 S<   [lock_dlm2]      dlm_async
 6553 S    [gfs_scand]      -
 6554 S    [gfs_glockd]     gfs_glockd
 6579 S+   tail -f /var/log -
 8746 S    [gfs_recoverd]   -
 8877 S    [gfs_logd]       -
 8878 S    [gfs_quotad]     -
 8879 S    [gfs_inoded]     -
 9376 S    xiogen -f buffer pipe_wait
 9377 S    xdoio -vD        pipe_wait
 9378 S    xiogen -f buffer pipe_wait
 9379 S    xdoio -vD        pipe_wait
 9381 S+   xiogen -f buffer pipe_wait
 9382 S+   xdoio -vD        pipe_wait
 9385 D+   xdoio -vD        glock_wait_internal
 9386 R    xdoio -vD        -
 9387 D    xdoio -vD        -
 9543 S<   [kmirrord]       worker_thread
11855 R+   ps ax -o pid,sta -

Comment 3 Corey Marthaler 2007-06-06 18:13:22 EDT
Looks liek this may be the priblem, on link-08:

Jun  6 16:31:21 link-08 lvm[5774]: Completed: vgreduce --config
devices{ignore_suspended_devices=1} --removemissing corey3
Jun  6 16:31:21 link-08 lvm[5774]: corey3-mirror3 is now in-sync
Jun  6 16:31:22 link-08 kernel: GFS: fsid=LINK_128:1.1: jid=2: Replaying journal...
Jun  6 16:31:22 link-08 lvm[5774]: No longer monitoring mirror device
corey3-mirror3 for events
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:1.1: jid=2: Replayed 2 of 2
blocks
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:1.1: jid=2: replays = 2,
skips = 0, sames = 0
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:2.1: jid=2: Replaying journal...
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:2.1: jid=2: Replayed 2 of 2
blocks
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:2.1: jid=2: replays = 2,
skips = 0, sames = 0
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:3.1: fatal: filesystem
consistency error
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:3.1:   function =
trans_go_xmote_bh
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:3.1:   file =
/builddir/build/BUILD/gfs-kernel-2.6.9-72/largesmp/src/gfs/glops.c, line = 542
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:3.1:   time = 1181165483
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:3.1: about to withdraw from
the cluster
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:3.1: waiting for outstanding I/O
Jun  6 16:31:23 link-08 kernel: GFS: fsid=LINK_128:3.1: telling LM to withdraw
Jun  6 16:31:24 link-08 kernel: GFS: fsid=LINK_128:1.1: jid=2: Journal replayed
in 4s
Jun  6 16:31:24 link-08 kernel: GFS: fsid=LINK_128:1.1: jid=2: Done
Jun  6 16:31:24 link-08 kernel: GFS: fsid=LINK_128:2.1: jid=2: Journal replayed
in 4s
Jun  6 16:31:24 link-08 kernel: GFS: fsid=LINK_128:2.1: jid=2: Done
Jun  6 16:32:34 link-08 kernel: cdrom: open failed.
Jun  6 16:33:35 link-08 kernel: cdrom: open failed.
Comment 4 Corey Marthaler 2007-06-12 10:43:17 EDT
Reproduced this issue without the "function = trans_go_xmote_bh" assertion.

[root@link-02 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    4   M   link-02
   2    1    4   X   link-08
   3    1    4   M   link-04
   4    1    4   M   link-07

[root@link-02 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 run       -
[1 4 3]

DLM Lock Space:  "clvmd"                            58  11 run       -
[1 3 4]

DLM Lock Space:  "clustered_log"                    59  12 run       -
[1 3 4]

DLM Lock Space:  "1"                                61  13 run       -
[1 3 4]

DLM Lock Space:  "2"                                65  15 run       -
[1 3 4]

DLM Lock Space:  "3"                                69  17 run       -
[1 3 4]

GFS Mount Group: "1"                                63  14 run       -
[1 3 4]

GFS Mount Group: "2"                                67  16 run       -
[1 3 4]

GFS Mount Group: "3"                                71  18 recover 4 -
[1 3 4]


[root@link-02 ~]# dmsetup info
Name:              corey1-mirror1_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 3
Number of targets: 1
UUID: LVM-0QLi8UKjRguEwy4KI1k9KriBg4lyb2j11Q90OmNWZQ6Pox3i0Ng53hJV0Tw6oKpc

Name:              corey1-mirror1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 5
Number of targets: 1
UUID: LVM-0QLi8UKjRguEwy4KI1k9KriBg4lyb2j1TYbQrVgkVw59UKMAeI2N6MQ8XJohDhmq

Name:              corey1-mirror1_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 2
Number of targets: 1
UUID: LVM-0QLi8UKjRguEwy4KI1k9KriBg4lyb2j1vvItnjlKtYjveNxMlJPa2LEWsyskBGD5

Name:              corey2-mirror2_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: LVM-caAVDd2dPhYq3rRmtemqUvgAt90oxO1vVcr1q3EQly4f8qd6A66B9dlTNkRh0V0T

Name:              corey2-mirror2_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 6
Number of targets: 1
UUID: LVM-caAVDd2dPhYq3rRmtemqUvgAt90oxO1v2Ap1nemAZlcYodVA8c5oX6J5GYd8zuEs

Name:              corey2-mirror2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 8
Number of targets: 1
UUID: LVM-caAVDd2dPhYq3rRmtemqUvgAt90oxO1vc0yz9j3KbEPcj4crAfUQJ9Y0s97ihs1t

Name:              corey3-mirror3_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        2
Event number:      0
Major, minor:      253, 11
Number of targets: 1
UUID: LVM-vCWvB54l0xEiiQuUH1anc8UJwd1KeH2RAK1EBHNcPALYNfaYSHr2bY6AyKwHHnza

Name:              corey3-mirror3_mimage_1
State:             ACTIVE
Tables present:    LIVE
Open count:        2
Event number:      0
Major, minor:      253, 10
Number of targets: 1
UUID: LVM-vCWvB54l0xEiiQuUH1anc8UJwd1KeH2RT3l70pL9L2NATt0FXghyhR94GkwJo8W0

Name:              corey3-mirror3_mimage_0
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 9
Number of targets: 1
UUID: LVM-vCWvB54l0xEiiQuUH1anc8UJwd1KeH2RmuLBJPJc9DB6v2gRUV9rU4gUY11kgrHQ

Name:              VolGroup00-LogVol01
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-dq1liKVsB8CzZiuNRtOF1tYTkXqgKO85b28r2eyrPqurvpuPPS83R8fcaG7QULHG

Name:              corey3-mirror3
State:             ACTIVE
Tables present:    LIVE & INACTIVE
Open count:        1
Event number:      5
Major, minor:      253, 12
Number of targets: 1
UUID: LVM-vCWvB54l0xEiiQuUH1anc8UJwd1KeH2RCC3uqHwUhyadUOOqFLGNuGfqmWpgP3OK

Name:              VolGroup00-LogVol00
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-dq1liKVsB8CzZiuNRtOF1tYTkXqgKO85W9BzgAV2x8i4cT3pKw0TArv1HJm54Tu6

Name:              corey1-mirror1_mimage_2
State:             ACTIVE
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 4
Number of targets: 1
UUID: LVM-0QLi8UKjRguEwy4KI1k9KriBg4lyb2j1kWTC4gefpSn1T4NFkhOoSw4DoWdYP1iv
Comment 5 Jonathan Earl Brassow 2007-09-28 11:40:30 EDT
This could very well be a dup of bz 257241
Comment 6 RHEL Product and Program Management 2007-11-30 14:05:29 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 7 Jonathan Earl Brassow 2008-03-26 13:51:42 EDT
Making this a dup of 359341 (going to newer bug because it has better recreation
information)

*** This bug has been marked as a duplicate of 359341 ***

Note You need to log in before you can comment on or make changes to this bug.