Bug 220762

Summary:	GFS2 kernel BUG
Product:	[Fedora] Fedora	Reporter:	Michael Smotritsky <mike.smey>
Component:	kernel	Assignee:	Steve Whitehouse <swhiteho>
Status:	CLOSED WORKSFORME	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	6	CC:	swhiteho, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-02-27 16:58:20 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michael Smotritsky 2006-12-26 01:36:16 UTC

Description of problem:
1. Two blades with simple CMAN/GFS2 setup.
2. startup fine; 
3. blade1 creates file1 - can read it fine
4. when blade2 tries to open file1 for the first time  - Kernel BUG at 
mm/filemap.c:553
5. every other access - no problem...
6. when blade2 creates file2 - blade1 opens file2 just fine


/var/log/messages

Dec 25 12:26:39 vvsorms2 kernel: Kernel BUG at mm/filemap.c:553
Dec 25 12:26:39 vvsorms2 kernel: invalid opcode: 0000 [1] SMP
Dec 25 12:26:39 vvsorms2 kernel: last sysfs 
file: /devices/pci0000:00/0000:00:03.0/0000:04:00.0/0000:06:01.0/host1/rport-
1:0-0/target1:0:0/1:0:0:0/vendor
Dec 25 12:26:39 vvsorms2 kernel: CPU 0
Dec 25 12:26:39 vvsorms2 kernel: Modules linked in: md5 sctp ipv6 lock_dlm gfs2 
dlm configfs ib_iser rdma_cm ib_addr ib_cm ib_sa ib_mad ib_core iscsi_tcp 
libiscsi scsi_transport_iscsi dm_multipath video sbs i2c_ec button battery 
asus_acpi ac parport_pc lp parport joydev sg tg3 i6300esb i2c_i801 i2c_core 
e752x_edac edac_mc pcspkr dm_snapshot dm_zero dm_mirror dm_mod usb_storage 
qla2xxx scsi_transport_fc shpchp mptspi mptscsih mptbase scsi_transport_spi 
sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Dec 25 12:26:39 vvsorms2 kernel: Pid: 2932, comm: cat Not tainted 2.6.18-
1.2868.fc6 #1
Dec 25 12:26:39 vvsorms2 kernel: RIP: 0010:[<ffffffff80217998>]  
[<ffffffff80217998>] unlock_page+0xf/0x2f
Dec 25 12:26:39 vvsorms2 kernel: RSP: 0018:ffff81015cdb99f8  EFLAGS: 00010246
Dec 25 12:26:39 vvsorms2 kernel: RAX: 0000000000000000 RBX: ffff810006737000 
RCX: 00000000c0000100
Dec 25 12:26:39 vvsorms2 kernel: RDX: ffff81015cdb9bf8 RSI: ffff81016017f830 
RDI: ffff810006737000
Dec 25 12:26:39 vvsorms2 kernel: RBP: 0000000000000000 R08: ffff81015cdb8000 
R09: ffff810164971000
Dec 25 12:26:39 vvsorms2 kernel: R10: ffffffff88348960 R11: ffff81015cdb4198 
R12: 0000000000000001
Dec 25 12:26:39 vvsorms2 kernel: R13: 0000000000000001 R14: ffff81015d767390 
R15: 0000000000000001
Dec 25 12:26:40 vvsorms2 kernel: FS:  00002aaaaaab4260(0000) GS:ffffffff805e4000
(0000) knlGS:0000000000000000
Dec 25 12:26:40 vvsorms2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Dec 25 12:26:40 vvsorms2 kernel: CR2: 0000000000402fa0 CR3: 000000015c8c9000 
CR4: 00000000000006e0
Dec 25 12:26:40 vvsorms2 kernel: Process cat (pid: 2932, threadinfo 
ffff81015cdb8000, task ffff81016017f830)
Dec 25 12:26:40 vvsorms2 kernel: Stack:  ffff810006737000 ffffffff8833b429 
ffffffffffffffff ffff81015cdb9bf8
Dec 25 12:26:40 vvsorms2 kernel:  0000000000000000 ffff810164971000 
ffff810164c5d480 ffff81016017f830
Dec 25 12:26:40 vvsorms2 kernel:  0000000000000000 ffff810164c5d480 
ffff810164c5d480 ffffffff80262372
Dec 25 12:26:40 vvsorms2 kernel: Call Trace:
Dec 25 12:26:40 vvsorms2 kernel:  
[<ffffffff8833b429>] :gfs2:gfs2_readpages+0x1cb/0x1f6
Dec 25 12:26:40 vvsorms2 kernel:  [<ffffffff802127de>] 
__do_page_cache_readahead+0x107/0x1da
Dec 25 12:26:40 vvsorms2 kernel:  [<ffffffff80232225>] 
blockable_page_cache_readahead+0x56/0xb5
Dec 25 12:26:40 vvsorms2 kernel:  [<ffffffff80213955>] 
page_cache_readahead+0xd6/0x1af
Dec 25 12:26:40 vvsorms2 kernel:  [<ffffffff8020bd92>] 
do_generic_mapping_read+0x126/0x41d
Dec 25 12:26:40 vvsorms2 kernel:  [<ffffffff8020c1f5>] 
__generic_file_aio_read+0x16c/0x1bf
Dec 25 12:26:40 vvsorms2 kernel:  [<ffffffff802ba13f>] 
generic_file_read+0xac/0xc5
Dec 25 12:26:40 vvsorms2 kernel:  [<ffffffff8020b1bb>] vfs_read+0xcb/0x171
Dec 25 12:26:40 vvsorms2 kernel:  [<ffffffff80211527>] sys_read+0x45/0x6e
Dec 25 12:26:40 vvsorms2 kernel:  [<ffffffff8025bcce>] system_call+0x7e/0x83
Dec 25 12:26:40 vvsorms2 kernel: DWARF2 unwinder stuck at system_call+0x7e/0x83
Dec 25 12:26:40 vvsorms2 kernel: Leftover inexact backtrace:
Dec 25 12:26:40 vvsorms2 kernel:
Dec 25 12:26:40 vvsorms2 kernel:
Dec 25 12:26:40 vvsorms2 kernel: Code: 0f 0b 68 54 4a 48 80 c2 29 02 48 89 df 
e8 b3 29 00 00 48 89
Dec 25 12:26:40 vvsorms2 kernel: RIP  [<ffffffff80217998>] unlock_page+0xf/0x2f
Dec 25 12:26:40 vvsorms2 kernel:  RSP <ffff81015cdb99f8>

Comment 1 Michael Smotritsky 2006-12-26 01:39:50 UTC

*** Bug 220761 has been marked as a duplicate of this bug. ***

Comment 2 Michael Smotritsky 2006-12-26 03:53:03 UTC

The same problem occurs on blade2 every time i modify file1 on blade1 and try 
to touch it on blade2...

Comment 3 Steve Whitehouse 2007-01-04 14:58:23 UTC

I suspect that this might alreaby be fixed in the upstream kernel as some
changes were made in that area since FC6. Are you able to try an upstream
kernel, or even the kernel from rawhide?

Comment 4 Michael Smotritsky 2007-01-05 16:30:32 UTC

Sorry, can't go to upstream or rawhide at the moment...these blades are getting 
ready to be in prod pilot...
Which kernel should be more stable for GFS2? We realy would like to use it...
When you think more stable (in GFS2 area) kernel will get into FC6?
Thanks!

Comment 5 Russell Cattelan 2007-01-08 20:53:33 UTC

Are you planning on using GFS2 in production?

I would advise against that at this time.
There are a couple of issues that are being working on at the moment
that prevent gfs2 from being used in a production environment.

Comment 6 Steve Whitehouse 2007-01-10 09:59:42 UTC

I've just sent a bunch of GFS2 updates to go into the next FC6 kernel. Its in
testing at the moment and should be released next week if all goes to plan.
Please try that one when its released and see if that fixes your problem. I
suspect it will, but do let us know if it doesn't.

Comment 7 Michael Smotritsky 2007-01-10 12:10:32 UTC

Great! Will try the new kernel as soon as it's out and let you know about the 
results...
Thank you very much!

Comment 8 Steve Whitehouse 2007-01-15 09:36:52 UTC

Please test kernel 2.6.19-1.2895 and let me know if that fixes the problem for you.

Comment 9 Michael Smotritsky 2007-01-15 19:43:55 UTC

Thanks for the heads up... As soon as 2.6.19-1.2895 will show up in the updates 
repos I'll test it.

Comment 10 Michael Smotritsky 2007-01-26 13:05:41 UTC

I'v install new kernel, unfortunately I can't boot the blade - kernel panic in 
initrd - https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=224289

As soon as it's resolved i'll continue with testing gfs2

Comment 11 Michael Smotritsky 2007-01-26 19:39:02 UTC

I've fixed my problem by recompiling initrd image with lvm support..
Will be testing gfs2 over the weekend...
Also I've got all new updates  - lvm2, cman and gfs2-tools

Comment 12 Michael Smotritsky 2007-02-13 04:28:58 UTC

After installing 2.6.19-1.2895 the second blade is failing to start fenced:

***********************
Feb 12 22:07:44 vvsorms2 openais[10801]: [CLM  ] Members Left: 
Feb 12 22:07:44 vvsorms2 openais[10801]: [CLM  ] Members Joined: 
Feb 12 22:07:44 vvsorms2 openais[10801]: [CLM  ]        r(0) ip(192.168.1.163)  
Feb 12 22:07:44 vvsorms2 openais[10801]: [SYNC ] This node is within the 
primary component and will provide service. 
Feb 12 22:07:44 vvsorms2 openais[10801]: [TOTEM] entering OPERATIONAL state. 
Feb 12 22:07:44 vvsorms2 openais[10801]: [CMAN ] quorum regained, resuming 
activity 
Feb 12 22:07:44 vvsorms2 openais[10801]: [CLM  ] got nodejoin message 
192.168.1.163 
Feb 12 22:07:51 vvsorms2 fenced[10817]: 192.168.1.160 not a cluster member 
after 6 sec post_join_delay
Feb 12 22:07:51 vvsorms2 fenced[10817]: 192.168.1.163 not a cluster member 
after 6 sec post_join_delay
Feb 12 22:07:51 vvsorms2 fenced[10817]: fencing node "192.168.1.160"
Feb 12 22:07:51 vvsorms2 fenced[10817]: fence "192.168.1.160" failed
Feb 12 22:07:56 vvsorms2 fenced[10817]: fencing node "192.168.1.160"
Feb 12 22:07:56 vvsorms2 fenced[10817]: fence "192.168.1.160" failed
Feb 12 22:08:01 vvsorms2 fenced[10817]: fencing node "192.168.1.160"
Feb 12 22:08:01 vvsorms2 fenced[10817]: fence "192.168.1.160" failed

**********************************************
First blade has no problem.

I have tried to do clean start, post join delay  - nothing seems to be diff in 
the log - 6 secs...


cluster.conf:
**************************
<?xml version="1.0"?>
<cluster name="vvso" config_version="2">

<cman expected_votes="1" two_node="1">
</cman>

<clusternodes>
<clusternode name="vvsorms1" nodeid="1">
<fence>
<method>
<device name="blade_center" blade="1"/>
</method>
</fence>
</clusternode>

<clusternode name="vvsorms2" nodeid="2">
<fence>
<method>
<device name="blade_center" blade="2"/>
</method>
</fence>
</clusternode>
</clusternodes>

<fence_daemon clean_start="0">
</fence_daemon>

<fencedevices>
<fencedevice name="blade_center" agent="fence_bladecenter" 
ipaddr="xxx.xxx.xxx.xxx" login="xxxxxxxx" passwd="xxxxxxxx"/>
</fencedevices>
</cluster>

*****************************************************************

I wonder what could be the problem here? It was OK on the prev kernel...

Thanks!

Mike

Comment 13 Steve Whitehouse 2007-02-27 10:00:16 UTC

Patrick, any ideas on this one?

Comment 14 Christine Caulfield 2007-02-27 10:23:49 UTC

That's just a fencing failure. The way to find out why fencing is failing is to
try running it manually: 'fence_node 192.1.168.1.160'.

Actually it looks slightly odd that fenced is referring to nodes by IP address
rather than name - which might explain why it's failing as they have names in
the config file. I don't know where fenced gets its node names from so I can't
really add much beyond that.

Comment 15 Michael Smotritsky 2007-02-27 15:27:08 UTC

Thanks for the suggestion...
After replacing IPs with names in the cluster.conf (I had ips originally - that 
is why log shows that) the services startted up fine and the original Kernel 
BUG is gone! The new kernel looks like fixed it...
I have another question though: when i stop gfs2 service I get 
*****************************************
[vvsorms2:/capsit]# service gfs2 stop
Unmounting GFS2 filesystems:  [  OK  ]
FATAL: Module lock_dlm is in use.
FATAL: Module gfs2 is in use.
*****************************************
is it OK?

if I try to start it up again it starts up just fine...

thank you very much for your help..

I will be contiuing testing now with the real application. 
Our application is very demanding of the file system - it stores 10s of millons 
of small files (10-30k) and some very large ones (1-2G) so it will be good test 
for the GFS2...

Steve...
Do you want me to post my notes here or on some other forum? I'm sure giving 
the nature of our app I'll have some info to share...

This particular bug has been resolved...

Thanks again

Mike

Comment 16 Steve Whitehouse 2007-02-27 15:33:43 UTC

Note that since the FC-6 kernel got upgraded to 2.6.20 it seems that some of the
GFS2 patches got lost along the way. I've just sent a new patch to bring it
uptodate so hopefully that will make it in shortly.

Its probably best to post your notes to the cluster-devel mailing list unless
there is a confidential aspect to them.

If you are happy with the current state of this bug, then I'll mark it closed.

Comment 17 Michael Smotritsky 2007-02-27 16:15:04 UTC

thanks for the warning, 
does 2.6.20 after the update willhave anything new in gfs2 since 2.6.19?
I'll post updates of my testing on cluster-devel..
and this bug is closed, thanks

Comment 18 Steve Whitehouse 2007-02-27 16:58:20 UTC

Yes, there will be a number of new patches and they are well worth having.

Not really sure which resolution is best, none of them seem to fit the situation
really, so I've picked what I think is the closest.