Bug 479154

Summary: RHEL5 cmirror tracker: data corruption detected after multiple cmirror/gfs mount ops
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 5.4CC: agk, ccaulfie, dwysocha, edamato, heinzm, mbroz, syeghiay
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-03-02 17:12:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2009-01-07 16:33:52 UTC
Description of problem:
I was running mount_stress on a cmirror/gfs filesystem on the taft cluster and after having stopped the test saw taft-02 assert when attempting to umount this filesystem. The last operation run was a mount with a different specified 'num_glockd' option.

This issue may be related to one of the other cmirror corruption issues, but in this case no recovery or relocation was involved.

####### itr=1110 (Tue Jan  6 15:35:11 CST 2009) #######
unmounting on taft-01-bond.../mnt/mirror...
unmounting on taft-03-bond...

attempting mount(s) on taft-03-bond:
mount -t gfs -o dirsync /dev/mapper/taft-mirror /mnt/mirror


####### itr=1111 (Tue Jan  6 15:35:13 CST 2009) #######
unmounting on taft-01-bond...
unmounting on taft-02-bond.../mnt/mirror...
unmounting on taft-04-bond.../mnt/mirror...
unmounting on taft-03-bond.../mnt/mirror...

attempting mount(s) on taft-02-bond:
mount -t gfs -o num_glockd=21 /dev/mapper/taft-mirror /mnt/mirror


####### itr=1112 (Tue Jan  6 15:35:21 CST 2009) #######

/dev/mapper/taft-mirror on /mnt/mirror type gfs (rw,hostdata=jid=0:id=16056322:first=1,num_glockd=21)

[root@taft-02]# umount /mnt/mirror

Trying to join cluster "lock_dlm", "TAFT:1167501108"
Joined cluster. Now mounting FS...
GFS: fsid=TAFT:1167501108.1: jid=1: Trying to acquire journal lock...
GFS: fsid=TAFT:1167501108.1: jid=1: Looking at journal...
GFS: fsid=TAFT:1167501108.1: jid=1: Done
Trying to join cluster "lock_dlm", "TAFT:1"
Joined cluster. Now mounting FS...
GFS: fsid=TAFT:1.1: jid=1: Trying to acquire journal lock...
GFS: fsid=TAFT:1.1: jid=1: Looking at journal...
GFS: fsid=TAFT:1.1: jid=1: Done
Trying to join cluster "lock_dlm", "TAFT:1"
Joined cluster. Now mounting FS...
GFS: fsid=TAFT:1.0: jid=0: Trying to acquire journal lock...
GFS: fsid=TAFT:1.0: jid=0: Looking at journal...
GFS: fsid=TAFT:1.0: jid=0: Done
GFS: fsid=TAFT:1.0: jid=1: Trying to acquire journal lock...
GFS: fsid=TAFT:1.0: jid=1: Looking at journal...
GFS: fsid=TAFT:1.0: jid=1: Done
GFS: fsid=TAFT:1.0: jid=2: Trying to acquire journal lock...
GFS: fsid=TAFT:1.0: jid=2: Looking at journal...
GFS: fsid=TAFT:1.0: jid=2: Done
GFS: fsid=TAFT:1.0: jid=3: Trying to acquire journal lock...
GFS: fsid=TAFT:1.0: jid=3: Looking at journal...
GFS: fsid=TAFT:1.0: jid=3: Done
GFS: fsid=TAFT:1.0: Scanning for log elements...
GFS: fsid=TAFT:1.0: Found 1 unlinked inodes
GFS: fsid=TAFT:1.0: Found quota changes for 0 IDs
GFS: fsid=TAFT:1.0: Done
GFS: fsid=TAFT:1.0: fatal: filesystem consistency error
GFS: fsid=TAFT:1.0:   RG = 17
GFS: fsid=TAFT:1.0:   function = gfs_setbit
GFS: fsid=TAFT:1.0:   file = /builddir/build/BUILD/gfs-kmod-0.1.31/_kmod_build_/src/gfs/bits.c, line = 81
GFS: fsid=TAFT:1.0:   time = 1231277734
GFS: fsid=TAFT:1.0: about to withdraw from the cluster
GFS: fsid=TAFT:1.0: telling LM to withdraw
GFS: fsid=TAFT:1.0: withdrawn

Call Trace:
 [<ffffffff8867afc0>] :gfs:gfs_lm_withdraw+0xc4/0xd3
 [<ffffffff80015008>] sync_buffer+0x0/0x3f
 [<ffffffff8005bc6a>] cache_alloc_refill+0x106/0x186
 [<ffffffff88668de0>] :gfs:gfs_dpin+0x11c/0x20c
 [<ffffffff88692ad0>] :gfs:gfs_consist_rgrpd_i+0x3c/0x41
 [<ffffffff8868e210>] :gfs:blkfree_internal+0x125/0x145
 [<ffffffff8868e369>] :gfs:gfs_metafree+0x2b/0xea
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8866cca7>] :gfs:gfs_ea_dealloc+0x3f7/0x4b3
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff88675661>] :gfs:gfs_inode_dealloc+0x203/0x53c
 [<ffffffff886671e8>] :gfs:gfs_inoded+0x0/0x44
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff88692023>] :gfs:gfs_unlinked_dealloc+0x31/0xb3
 [<ffffffff886671e8>] :gfs:gfs_inoded+0x0/0x44
 [<ffffffff886671f6>] :gfs:gfs_inoded+0xe/0x44
 [<ffffffff80032360>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032262>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


I didn't see any device errors in any of the logs before this took place.


Version-Release number of selected component (if applicable):
2.6.18-128.el5                                                                                                                                                                                                                lvm2-2.02.40-6.el5    BUILT: Fri Oct 24 07:37:33 CDT 2008
lvm2-cluster-2.02.40-7.el5    BUILT: Wed Nov 26 07:19:19 CST 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.36-1.el5    BUILT: Tue Dec  9 16:38:13 CST 2008
kmod-cmirror-0.1.21-10.el5    BUILT: Wed Dec 17 15:18:59 CST 2008


I'll attempt to reproduce this and gather more info...

Comment 1 Jonathan Earl Brassow 2009-03-02 17:12:25 UTC
I'm closing this bug "INSUFFICIENT_DATA".  I'll need either a reproducer, or something that points to cmirror being the problem.