429546 – lock_dlm: plock device version mismatch: kernel (1.1.0), user (16777216.16777216.0)

Bug 429546 - lock_dlm: plock device version mismatch: kernel (1.1.0), user (16777216.16777216.0)

Summary: lock_dlm: plock device version mismatch: kernel (1.1.0), user (16777216.16777...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.1
Hardware:	ppc64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Cluster5-ppc 437164
TreeView+	depends on / blocked

Reported:	2008-01-21 14:36 UTC by Nate Straz
Modified:	2009-04-16 22:19 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2008-0347
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 15:58:40 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0347	0	normal	SHIPPED_LIVE	cman bug fix and enhancement update	2008-05-20 12:39:41 UTC

Description Nate Straz 2008-01-21 14:36:54 UTC

Description of problem:

While running g2 on GFS, the first single reader multi writer test case produced
the following log entries.

Jan 20 16:04:10 basic gfs_controld[28493]: receive_plock from 1 header 1 info
16777216
Jan 20 16:04:10 basic gfs_controld[28493]: plock result write err -1 errno 22
Jan 20 16:04:10 basic kernel: lock_dlm: plock device version mismatch: kernel
(1.1.0), user (16777216.16777216.0)
Jan 20 16:04:10 basic gfs_controld[28493]: receive_plock from 1 header 1 info
16777216
Jan 20 16:04:10 basic gfs_controld[28493]: plock result write err -1 errno 22
Jan 20 16:04:10 basic kernel: lock_dlm: plock device version mismatch: kernel
(1.1.0), user (16777216.16777216.0)

Both writers hung in D state with the following backtraces:

 xdoio         D 000000000ff0df40 12368 29572      1               29576 (NOTLB)
 Call Trace:
 [C000000002433590] [C000000002433620] 0xc000000002433620 (unreliable)
 [C000000002433760] [C0000000000106C0] .__switch_to+0x130/0x154
 [C0000000024337F0] [C00000000035B0C8] .schedule+0xae0/0xc70
 [C000000002433900] [D000000000A03138] .gdlm_plock+0x158/0x240 [lock_dlm]
 [C0000000024339E0] [D000000000ACD8AC] .gfs_lm_plock+0x4c/0x68 [gfs]
 [C000000002433A60] [D000000000AD81F0] .gfs_lock+0x144/0x168 [gfs]
 [C000000002433BB0] [C000000000109EEC] .fcntl_setlk+0x180/0x2fc
 [C000000002433CB0] [C000000000105454] .sys_fcntl+0x320/0x3f8
 [C000000002433D50] [C000000000129B90] .compat_sys_fcntl64+0x348/0x514
 [C000000002433E30] [C0000000000086A4] syscall_exit+0x0/0x40
 xdoio         D 000000000ff0df40 12368 29576      1         29572 28994 (NOTLB)
 Call Trace:
 [C0000000769D3590] [C0000000769D3620] 0xc0000000769d3620 (unreliable)
 [C0000000769D3760] [C0000000000106C0] .__switch_to+0x130/0x154
 [C0000000769D37F0] [C00000000035B0C8] .schedule+0xae0/0xc70
 [C0000000769D3900] [D000000000A03138] .gdlm_plock+0x158/0x240 [lock_dlm]
 [C0000000769D39E0] [D000000000ACD8AC] .gfs_lm_plock+0x4c/0x68 [gfs]
 [C0000000769D3A60] [D000000000AD81F0] .gfs_lock+0x144/0x168 [gfs]
 [C0000000769D3BB0] [C000000000109EEC] .fcntl_setlk+0x180/0x2fc
 [C0000000769D3CB0] [C000000000105454] .sys_fcntl+0x320/0x3f8
 [C0000000769D3D50] [C000000000129B90] .compat_sys_fcntl64+0x348/0x514
 [C0000000769D3E30] [C0000000000086A4] syscall_exit+0x0/0x40


Version-Release number of selected component (if applicable):
cman-2.0.73-1.el5_1.4.ppc
gfs-utils-0.1.12-1.el5.ppc
gfs2-utils-0.1.38-1.el5.ppc
kernel-2.6.18-53.el5.ppc64
kernel-2.6.18-53.1.6.el5.ppc64
kmod-gfs-0.1.19-7.el5_1.1.ppc64

How reproducible:
Hit on first try.  Will reset and try again

Comment 1 Nate Straz 2008-01-21 15:06:41 UTC

Steps to Reproduce:

1. cd /mnt/gfs
2. export PATH=$PATH:/usr/tests/sts-rhel5.1/bin
3. xiogen -n -i 1 -f buffered -m sequential -s write -F
1000b:single_reader_mw_buffered_sequential | xdoio -vk

Comment 2 Nate Straz 2008-01-21 15:16:54 UTC

Some information requested by Dave after starting gfs_controld w/ -P.

[root@doral gfs_vs0]# group_tool dump gfs gfs_vs0
1200928471 listen 1
1200928471 cpg 4
1200928471 groupd 6
1200928471 uevent 7
1200928471 plocks 10
1200928471 setup done
1200928493 process_plocks: no mg id 40001
1200928493 process_plocks: no mg id 40001
1200928521 client 6: dump
[root@doral gfs_vs0]# group_tool
type             level name     id       state       
fence            0     default  00010001 none        
[1 2 3 4]
dlm              1     clvmd    00030001 none        
[1 2 3 4]
dlm              1     gfs_vs0  00050001 none        
[1 2 3 4]
gfs              2     gfs_vs0  00040001 none        
[1 2 3 4]

Comment 3 David Teigland 2008-01-21 19:59:47 UTC

This is an alignment problem.  Things work if we do the byte-swapping on the
original structure and then copy it into the final buffer, instead of copying
first and then trying to do the byte-swapping at an offset within the send buffer.

Comment 4 David Teigland 2008-01-21 20:32:26 UTC

fix committed to HEAD, RHEL5 and RHEL51 branches

Comment 7 Rob Kenna 2008-03-12 18:05:48 UTC

Marking as 5.1z fix for ppc support.

Comment 10 errata-xmlrpc 2008-05-21 15:58:40 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html

Note You need to log in before you can comment on or make changes to this bug.