Bug 429546

Summary: lock_dlm: plock device version mismatch: kernel (1.1.0), user (16777216.16777216.0)
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: cmanAssignee: David Teigland <teigland>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.1CC: cluster-maint
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: ppc64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0347 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 15:58:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 429587, 437164    

Description Nate Straz 2008-01-21 14:36:54 UTC
Description of problem:

While running g2 on GFS, the first single reader multi writer test case produced
the following log entries.

Jan 20 16:04:10 basic gfs_controld[28493]: receive_plock from 1 header 1 info
16777216
Jan 20 16:04:10 basic gfs_controld[28493]: plock result write err -1 errno 22
Jan 20 16:04:10 basic kernel: lock_dlm: plock device version mismatch: kernel
(1.1.0), user (16777216.16777216.0)
Jan 20 16:04:10 basic gfs_controld[28493]: receive_plock from 1 header 1 info
16777216
Jan 20 16:04:10 basic gfs_controld[28493]: plock result write err -1 errno 22
Jan 20 16:04:10 basic kernel: lock_dlm: plock device version mismatch: kernel
(1.1.0), user (16777216.16777216.0)

Both writers hung in D state with the following backtraces:

 xdoio         D 000000000ff0df40 12368 29572      1               29576 (NOTLB)
 Call Trace:
 [C000000002433590] [C000000002433620] 0xc000000002433620 (unreliable)
 [C000000002433760] [C0000000000106C0] .__switch_to+0x130/0x154
 [C0000000024337F0] [C00000000035B0C8] .schedule+0xae0/0xc70
 [C000000002433900] [D000000000A03138] .gdlm_plock+0x158/0x240 [lock_dlm]
 [C0000000024339E0] [D000000000ACD8AC] .gfs_lm_plock+0x4c/0x68 [gfs]
 [C000000002433A60] [D000000000AD81F0] .gfs_lock+0x144/0x168 [gfs]
 [C000000002433BB0] [C000000000109EEC] .fcntl_setlk+0x180/0x2fc
 [C000000002433CB0] [C000000000105454] .sys_fcntl+0x320/0x3f8
 [C000000002433D50] [C000000000129B90] .compat_sys_fcntl64+0x348/0x514
 [C000000002433E30] [C0000000000086A4] syscall_exit+0x0/0x40
 xdoio         D 000000000ff0df40 12368 29576      1         29572 28994 (NOTLB)
 Call Trace:
 [C0000000769D3590] [C0000000769D3620] 0xc0000000769d3620 (unreliable)
 [C0000000769D3760] [C0000000000106C0] .__switch_to+0x130/0x154
 [C0000000769D37F0] [C00000000035B0C8] .schedule+0xae0/0xc70
 [C0000000769D3900] [D000000000A03138] .gdlm_plock+0x158/0x240 [lock_dlm]
 [C0000000769D39E0] [D000000000ACD8AC] .gfs_lm_plock+0x4c/0x68 [gfs]
 [C0000000769D3A60] [D000000000AD81F0] .gfs_lock+0x144/0x168 [gfs]
 [C0000000769D3BB0] [C000000000109EEC] .fcntl_setlk+0x180/0x2fc
 [C0000000769D3CB0] [C000000000105454] .sys_fcntl+0x320/0x3f8
 [C0000000769D3D50] [C000000000129B90] .compat_sys_fcntl64+0x348/0x514
 [C0000000769D3E30] [C0000000000086A4] syscall_exit+0x0/0x40


Version-Release number of selected component (if applicable):
cman-2.0.73-1.el5_1.4.ppc
gfs-utils-0.1.12-1.el5.ppc
gfs2-utils-0.1.38-1.el5.ppc
kernel-2.6.18-53.el5.ppc64
kernel-2.6.18-53.1.6.el5.ppc64
kmod-gfs-0.1.19-7.el5_1.1.ppc64

How reproducible:
Hit on first try.  Will reset and try again

Comment 1 Nate Straz 2008-01-21 15:06:41 UTC
Steps to Reproduce:

1. cd /mnt/gfs
2. export PATH=$PATH:/usr/tests/sts-rhel5.1/bin
3. xiogen -n -i 1 -f buffered -m sequential -s write -F
1000b:single_reader_mw_buffered_sequential | xdoio -vk

Comment 2 Nate Straz 2008-01-21 15:16:54 UTC
Some information requested by Dave after starting gfs_controld w/ -P.

[root@doral gfs_vs0]# group_tool dump gfs gfs_vs0
1200928471 listen 1
1200928471 cpg 4
1200928471 groupd 6
1200928471 uevent 7
1200928471 plocks 10
1200928471 setup done
1200928493 process_plocks: no mg id 40001
1200928493 process_plocks: no mg id 40001
1200928521 client 6: dump
[root@doral gfs_vs0]# group_tool
type             level name     id       state       
fence            0     default  00010001 none        
[1 2 3 4]
dlm              1     clvmd    00030001 none        
[1 2 3 4]
dlm              1     gfs_vs0  00050001 none        
[1 2 3 4]
gfs              2     gfs_vs0  00040001 none        
[1 2 3 4]


Comment 3 David Teigland 2008-01-21 19:59:47 UTC
This is an alignment problem.  Things work if we do the byte-swapping on the
original structure and then copy it into the final buffer, instead of copying
first and then trying to do the byte-swapping at an offset within the send buffer.


Comment 4 David Teigland 2008-01-21 20:32:26 UTC
fix committed to HEAD, RHEL5 and RHEL51 branches

Comment 7 Rob Kenna 2008-03-12 18:05:48 UTC
Marking as 5.1z fix for ppc support.

Comment 10 errata-xmlrpc 2008-05-21 15:58:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html