Bug 167962

Summary: GFS causes kernel panic when moving 8G file
Product: [Retired] Red Hat Cluster Suite Reporter: Joseph Salisbury <jts>
Component: gfsAssignee: Wendy Cheng <nobody+wcheng>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Mike McLean <mikem>
Severity: high Docs Contact:
Priority: medium    
Version: 3CC: bmarzins, ccaulfie, dshaks, jbrassow, kanderso, rkenna
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-30 20:19:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joseph Salisbury 2005-09-09 21:12:58 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
GFS version 6.0 causes a kernel panic when moving an 8G file from one gfs filesystem to another gfs filesystem.  The following error is reported before the panic:

Message from syslogd@spa65 at Fri Sep  9 15:26:39 2005 ...
spa65 kernel: Kernel panic: GFS: Assertion failed on line 1253 of file linux_dio.c


This happens when moving only one file or multiple files at the same time.

We enabled multipathing and round-robin, which we thought might be the problem.  However, the problem still happens when these features are set to none with pool_mp.



Version-Release number of selected component (if applicable):
GFS 6.0 on RHEL3U6

How reproducible:
Always

Steps to Reproduce:
1. Try to move a file that is 8G in size from one gfs filesystem to another.


  

Actual Results:  Kernel Panic

Expected Results:  The file moved.

Additional info:

Comment 1 Kiersten (Kerri) Anderson 2005-09-09 21:41:07 UTC
We need the system logs and the console messages that got printed out when the
panic happened.

Comment 3 John Shakshober 2005-09-12 15:17:30 UTC
Let me add some background on the configuation and the errors Joe and I are seeing;

System - 4-node x86_64 dl585 4-cpu, 16 GB of memory (amd64)
SW - RHEL3 U6 with GFS 6.0 with 2-dual card QLA2300's per node.

We used Qlogic driver for failover which reduced paths from 4 downto 2
Presented 4 luns to 2 HBA each to POOL and enabled round-robin striping.

The Oracle 1-G R1 database ran fine with sync IO or async IO.  We filled up 1lun
with "space", so Joe was attempting to move the file (via "mv") or copy it using
(via "cp").   Either operation failed.

We are getting an Oracle error (reported to Oracle) if /when we attempt to used
Direct_IO (Oracle simply attempts to open the file with O_DIRECT flag).

We are in the process of rebuilding without round-robin stripping and/or
attempting to reproduce on 4-node at Westford (rather than at HP)


Comment 4 Joseph Salisbury 2005-09-12 15:46:10 UTC
The system panics during boot now.  Below is the console output while trying to
boot the system.  Seems to be metadata corruption:


GFS: fsid=gfsrac:oraindex.2: Done
GFS: fsid=gfsrac:oraundo.2: Joined cluster. Now mounting FS...
GFS: fsid=gfsrac:oraundo.2: jid=2: Trying to acquire journal lock...
GFS: fsid=gfsrac:oraundo.2: jid=2: Looking at journal...
GFS: fsid=gfsrac:oraundo.2: jid=2: Done
GFS: fsid=gfsrac:oraundo.2: Scanning for log elements...
GFS: fsid=gfsrac:oraundo.2: Found 0 unlinked inodes
GFS: fsid=gfsrac:oraundo.2: Found quota changes for 0 IDs
GFS: fsid=gfsrac:oraundo.2: Done
                                                           [  OK  ]
Starting automount: Bad metadata at 7039232, should be 5
  mh_magic = 0x00C50453
  mh_type = 223219726
  mh_generation = 878477087085035717
  mh_format = 69010723                               t      k
  mh_incarn = 537791556
This architecture does not implement dump_stack()
Kernel panic: GFS: Assertion failed on line 1253 of file linux_dio.c
GFS: assertion: "metatype_check_magic == GFS_MAGIC && metatype_check_type == ((h
eight) ? (5) : (4))"
GFS: time = 1126539848
GFS: fsid=gfsrac:oraindex.2


Comment 5 John Shakshober 2005-09-12 20:28:10 UTC
We setup a RHEL3 U6 3-node (used to be 4-node), cluster here in Westford,
no problem copying 8 15 GB files between mount points.

Running Oracle now with DIO (o_direct) to a gfs volume now with tpc-H 10G size.
Results will be available by tomorrow AM



Comment 6 John Shakshober 2005-09-14 14:33:26 UTC
Direct IO runs completed, comparison of performance is below;

Not using directIO TPC-H fills up the file caches and we can only get 11 MB/sec

Timing is Seconds
m1timing:Execution Time Throughput Test: 89020.63
m2timing:Execution Time Throughput Test: 75675.87

With Direct IO tpc-H can run at 1 FC adapter bandwidth of 180 MB/sec (of 200
MB/sec peak).

m1timing:Execution Time Throughput Test: 6550.27
m2timing:Execution Time Throughput Test: 6616.08

A nice 12 X improvement.

Back to re-create the larger cluster file systems with round-robin for pool



Comment 7 John Shakshober 2005-09-21 15:19:50 UTC
We can not reproduce this on x86 cluster, DIO works fine to create and run TPC-
H on 3 nodes (1-node died on me) this is running RHEL3-U6 with GFS 6.0.

Additional file corruptions have occured any time we try to create databases 
with Direct IO ... the file system get corrupted;

Sep 14 21:57:02 spa65 su(pam_unix)[3756]: session opened for user oracle by 
(uid=0)
Sep 14 21:57:02 spa65 su(pam_unix)[3596]: session opened for user oracle by 
(uid=0)
Sep 14 21:57:03 spa65 kernel: This architecture does not implement dump_stack()

>> Sep 14 21:57:03 spa65 kernel: Kernel panic: GFS: Assertion failed on line 
1227 of file rgrp.c

Sep 14 22:16:03 spa65 syslogd 1.4.1: restart.
Sep 14 22:16:03 spa65 syslog: syslogd startup succeeded
Sep 14 22:16:03 spa65 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 14 22:16:03 spa65 kernel: ok

We then re-fsck each file system
gfs_fsck -v -y /dev/pool/oraundo
gfs_fsck -v -y /dev/pool/oraindex
gfs_fsck -v -y /dev/pool/oradata

[root@spa65 log]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/cciss/c0d0p6    159738020   4735588 146888132   4% /
/dev/cciss/c0d0p1       518000     47196    444492  10% /boot
none                   8005308         0   8005308   0% /dev/shm
/dev/cciss/c0d0p5      8256808     39928   7797460   1% /tmp
/dev/cciss/c0d0p3      8256824    222572   7614828   3% /var
/dev/pool/oracle_base
                     38572512   4938704  33633808  13% /opt/oracle
/dev/pool/oradata    720029760 320741308 399288452  45% /oradata
/dev/pool/oraindex    38572512       540  38571972   1% /oraindex
/dev/pool/oraundo     38572512  23599260  14973252  62% /oraundo 


We have ALSO reproduced this on a 2nd x86_64 system with the same 10G R1 system 
running on AMD64 systems using iSCSI under RHEL4 U2 running GFS 6.1

  


Comment 8 Wendy Cheng 2005-10-28 04:14:32 UTC
The original estimation about this issue was to:

* take VFS layer fix from bz 147870 (rhel 3 oracle pwrite issue)
* add similar GFS patch from bz 169154 (rhel 4 direct io)

Unfortunately, after really going thru the code today, the gfs direct io
implementations are different between rhel 4 (where most of the works is done by
base kernel) and rhel 3 (gfs does its own thing). 

In short, this problem is different from bz 169154. I don't have a good
understanding about the root cause of this rhel 3 problem  this moment. 


Comment 9 Wendy Cheng 2005-10-28 13:59:36 UTC
hi, I have to re-examine the issue ... 

Is this problem is about "cp an 8G file between different GFS filessytems" could
cause a panic ? It is fuzzy to me why this is DirectIO related ? Could you
explain more ? 

Comment 10 Wendy Cheng 2005-10-28 14:06:07 UTC
I need to be more specific - when we do "cp /mnt/gfs1/myfile /mnt/gfs2/myfile",
there should *not* have any Direct IO code involved. How do you make this
command crash the kernel in DIO code ? Did you set inherit_direct_io flag on the
filesystem ? 

Comment 11 John Shakshober 2005-10-28 15:14:38 UTC
The current thinking, is the file got corrupt when it was written w/ DIO
by Oracle, previously Oracle had already bailed out.  

After a reboot, Joe's attempt to copy the corrupt file caused GFS to panic
as it detected the corruption during the read. 



Comment 12 Wendy Cheng 2005-10-28 15:46:14 UTC
Could I have one of these files to do experiments ? 

Comment 13 Wendy Cheng 2005-10-30 21:25:01 UTC
Phone call with dshak and he said this couldn't get recreated on perf5-7 (i686)
cluster (that I'm currently working on) but I forgot to ask:

1. Did this *only* happen with x86_64 nodes ? Running 32 bit or 64 bit ? Was the
copy between i686 and x86_64 nodes ? 
2. Did the crashes happen on "dest" node or "src" node (when doing "cp") ? 



Comment 14 Wendy Cheng 2005-10-31 03:40:37 UTC
Sorry, typo in 2) - should be:

"Did the crash happen on "dest" file or "src" file" ? However, I guess since we
never capture the vmcore and/or the stack dump, it is unknown at this moment. So
don't worry about it.. I'll try to get a pair of x86_64 nodes to try this out. 

Comment 15 John Shakshober 2005-10-31 10:39:30 UTC
This does not happen when moving basic 8GB files around ... 
We have completed full IOzone testing to 1m-32 GB files (doubling each time).

This only occured AFTER a database was built with DIO.  Then trying to 
isolate, we were moving files around when GFS would panic.   

We are running up2date on the cluster to fix a different EXT3 issue 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=171931

After that we will try to reproduce, and will give you access the the system if 
we can reproduce.


Comment 16 John Shakshober 2005-11-18 15:49:15 UTC
Able to crash the RHEL3 U6  cluster again running the TPC-H part of 
Oracle 10G Stress test - OAST.

Repeated 9 times, trying different nodes, and cleaning up with fsck_gfs

# DISK SUMMARY (/sec)
#Reads  R-Merged  R-KBytes   Writes  W-Merged  W-KBytes
    5         0        17     9818     71128     88661
Ouch!
[root@spa65 GFS]# vmstat 3
procs                      memory      swap          io     system         cpu
r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
0 18      0 13200292  29992 1543228    0    0   112  3927  188   919 24  6 66  4
0 18      0 13198076  30064 1546904    0    0   305 88885 6163  1429  2  7  1 91
2 17      0 13197820  30064 1547880    0    0    31 89352 6030   641  2  5  0 93
0 18      0 13197404  30072 1548316    0    0    18 89529 6066   719  2  6  0 93
0 16      0 13197168  30080 1548896    0    0    29 89062 6218   915  2  6  0 93
0 16      0 13196744  30080 1549408    0    0    28 88999 6310   913  1  5  0 94
3  1      0 13685528  30080 1053740    0    0    23 58755 12405 64286  2 18 17 62
2  3      0 13685536  30080 1055000    0    0    36 77750 24667 222595  7 41 20 32
1 16      0 13683084  30088 1059720    0    0   383 91229 13249 78496  3 16  2 79
2  0      0 13684024  30096 1055052    0    0   134 51498 15714 163853  3 28 21 48
0 14      0 13676516  30096 1067272    0    0  3475 95926 8216 16311  2  8  1 89
7  0      0 13675676  30096 1066504    0    0   143 83475 7261 14529  1  7  4 88

Message from syslogd@spa65 at Wed Nov 16 15:10:42 2005 ...
spa65 kernel: Kernel panic: GFS: Assertion failed on line 550 of file rgrp.c




Nov 13 04:03:01 spa65 syslogd 1.4.1: restart.
Nov 14 14:01:47 spa65 sshd(pam_unix)[17741]: session opened for user root by (uid=0)
Nov 14 14:01:51 spa65 sshd(pam_unix)[17741]: session closed for user root
Nov 14 14:02:13 spa65 sshd(pam_unix)[17794]: session opened for user oracle by
(uid=0)
Nov 14 19:04:11 spa65 kernel: Bad metadata at 72654208, should be 5
Nov 14 19:04:11 spa65 kernel:   mh_magic = 0x00020000
Nov 14 19:04:11 spa65 kernel:   mh_type = 2097152
Nov 14 19:04:11 spa65 kernel:   mh_generation = 8444250867653466
Nov 14 19:04:11 spa65 kernel:   mh_format = 2692808704
Nov 14 19:04:11 spa65 kernel:   mh_incarn = 2797273088
Nov 14 19:04:11 spa65 kernel: This architecture does not implement dump_stack()
Nov 14 19:08:28 spa65 syslogd 1.4.1: restart.
Nov 14 19:08:28 spa65 syslog: syslogd startup succeeded
Nov 14 19:08:28 spa65 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 14 19:08:28 spa65 kernel: ok
Nov 14 19:08:28 spa65 kernel: Bootdata ok (command line is ro root=LABEL=/ )
Nov 14 19:08:28 spa65 kernel: Linux version 2.4.21-37.ELsmp
(bhcompile.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux
3.2.3-53)) #1 SMP Wed Sep 7 13:32:18 EDT 2005 

Comment 17 Wendy Cheng 2005-11-18 15:55:06 UTC
Was this same as the email you sent earlier that happened with-and-without
DirectIO on x86_64 only nodes ? 

Comment 18 John Shakshober 2005-11-18 16:13:21 UTC
Yes this is with or without DIO.

Shak

Comment 19 Wendy Cheng 2005-11-18 21:16:07 UTC
Building a debug kernel now.. let me know if the machine slot is available. 

Comment 20 Wendy Cheng 2005-11-19 16:50:42 UTC
Cut and paste part of the report from dshak for reference purpose:

>
> In September 2005, issues with RHEL3 x86_64 were identified running 
> heavy I/O workloads against
> the 4-node amd64 system w/ 16 GB of memory, 2 FC adapter/node, 4 200 
> GB luns RAID0+1 HW, used Pool striping on GFS initially.  Pool 
> striping appears to be cause of kernel panics;
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=167962
> We are working around the current stripped pool problem, using a 
> re-layout to separate luns to match previous 9i CERT and Westford x86 
> cluster.  We keep the striped luns for debug later.

Comment 21 Ben Marzinski 2006-04-25 21:16:45 UTC
comment #17 and #18 refer to DIO. Is it possible to create this problem entirely
without using DIO?

Comment 22 Wendy Cheng 2006-04-25 21:18:42 UTC
answer to comment #21: YES (according to shak). 

Comment 23 John Shakshober 2006-04-26 08:24:23 UTC
We can request time on the HP system again if still available?
Remember the symptom - 
 POOL was configured using POOL striping with 4 FC paths to the same LUN
Does anyone have a machine with that many PATHs here at Red Hat?

Comment 24 Rob Kenna 2006-08-30 19:38:06 UTC
Shak -  I've thrown this to you for "need info" given that Salisbury is
"splitsville".

- Rob

Comment 25 John Shakshober 2006-08-30 20:10:41 UTC
We no longer have a system to try to reproduce this problem.


Comment 26 Wendy Cheng 2006-08-30 20:19:09 UTC
Since we havn't seen field report for this issue, I'll close this as
non-reproducible. Reopen it if necessary.