250824 – EXT3 problem/crashing with large partitions

Bug 250824 - EXT3 problem/crashing with large partitions

Summary: EXT3 problem/crashing with large partitions

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	7
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	urgent
Target Milestone:	---
Assignee:	Eric Sandeen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-08-03 20:04 UTC by Paul D
Modified:	2008-02-16 02:33 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-02-16 02:33:06 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log file of error messages (64.69 KB, application/zip) 2007-08-03 20:04 UTC, Paul D	no flags	Details
dmesg for the machine concerned (24.37 KB, text/plain) 2007-08-06 19:38 UTC, Paul D	no flags	Details
View All

Description Paul D 2007-08-03 20:04:34 UTC

This bug seesm to have re-appeared from previous incarnations.

I have a 3TB Raid array running on an Areca raid controller.  Previously the
controller was running the array as a 1.8TB partition and a 1.2TB partition,
the 1.8Tb had to be expanded to the full 3TB (less overhead).

I migrated all the data off the array (which has been working fine for 8
months) and recreated it as follows:-

parted /dev/sdb

1.  rm 1 (removed the remaining old 1.8TB partition)
2.  unit GB
3.  mklabel - gpt
4.  mkpartfs - primary - ext2 - 0 - 3000
5.  q

The partition was created without problems and was converted to ext3 using
tune2fs -j /dev/sdb1, again no problems reported.

As soon as I started to migrate the data back the messages log started to fill
with the meesages shown on the attachment.  The partition has already
gone into ro mode once and although appearing stable right now (even though the
log is filling with the error messages) I'm not happy moving 2.2TB of data back
onto the array.

I'm currently running 2.6.21-1.3228-fc7 x86_64.

If you look at the log file at 18:17 you'll see the drive go into ro mode and 
then the reboot following.  I have deleted literally 10's of thousands of the 
same error messages from the log but have left enough in place to give you a 
flavour of what's happening together with the ability to debug it.

If you want any further data please let me know.

Comment 1 Paul D 2007-08-03 20:04:34 UTC

Created attachment 160652 [details]
log file of error messages

Comment 2 Paul D 2007-08-03 21:30:26 UTC

Since the original posting I've done a couple of test and made the following 
observations.

1.  The error seems to kick in if the gui box "preparing to copy" appears (I'm 
moving a lot of data here) if the copy goes immediately active the error 
doesn't seem to appear.

I'll do more testing and see if I can provide further feedback.

Comment 3 Paul D 2007-08-05 02:35:18 UTC

OK after spending the last 30 hours messing around with this I can report the 
following.

I have reconfigured the system each time before conducting the following tests:-

The system has been configured thus:-

Areca Web front end deletes and then recreates the raid array as 5 x 750gb 
drives in Raid5 giving 3000gb of available data space (before overhead).

The Areca web front end has then been used to create a single 3000GB volumeset.

I have installed the latest Fedora 7 kernel (2.6.22.1-41.fc7.x86_64) to ensure 
that the problem is still valid in the latest Kernel.

The drive get's identified as /dev/sdb for the purposes of this setup.

Commands used to create the drive:-

parted /dev/sdb
unit GB
mklabel - gpt
mkpartfs - primary - ext2 - 0 - 3000
q

tune2fs -j /dev/sdb1

during these commands no errors are reported back.

I have then deliberately copied heavily to the created array from multiple 
sources, all from within Gnome using nautilus.

So far in 5 test I have been able to get the array to fail every time, however 
(and this may be relevent or it may well be a false positive) when I was 
copying to the array from only a single source I could not get the array to 
error, one test ran for 8 hours copying nearly 1tb from another source without 
fail but within 20 minutes of starting a 2nd copy process to the array the 
errors started to show up - interestingly cancelling this second copy process 
stopped the errors appearing in the log.  It's almost as if the system 'gets 
confused' when it's handling multiple write streams.  

Please also note that this array controller also has a seperate 1TB array 
attached (using 6 x 200GB drives) that has never shown this behaviour.  So it 
may well be a combination of >2tb and multiple write streams that are the 
problem.

I'm going to create the array using reierfs and see if the problem still 
creates then.

Comment 4 Paul D 2007-08-05 14:37:36 UTC

Further update.

I created an array using reiserfs using the same procedure as above except 
replacing tune2fs -j /dev/sdb with mkreiserfs /dev/sdb.

So far I've copied over 1TB into the array with up to 7 seperate concurrent 
copies (from both local drives and nfs connections) and have not been able to 
get the fault to appear.  Given that previously in 5 attempts I couldn't get 
the fault to not appear if there was more then one concurrent copy running it 
would genuinely appear to be a problem/conflict between multiple copy process's 
/ >2tb array / ext3 file system / the areca controller.  I'm not sure if the 
controller is an issue as googling this problem seems to pop it up on lots of 
different controllers.

While this problem may be obscure and currently minor can I suggest that it may 
be of bigger import then realised at first.  If users find that they've been 
storing data and there's nothing to indicate a problem (except the logs) until 
they read the data or do a reboot and then find the problem they'll be pretty 
p**d off.  The reason for the increased import is that a number of 
manufacturers have recently announced 1TB drives and this obscure problem may 
well suddently become a flood. 

If you want me to do any testing I'll have to do it before start of business on 
Thursday as after that point I'll have to commit the drive to production (and 
currently will be doing it in the reiserfs structure that's currently working).

Comment 5 Paul D 2007-08-06 01:02:57 UTC

OK The new array has now had over 2tb copied onto it into the reiserfs 
formatted array.

No errors reported in the logs and I've randomly checked about 200 files and 
they seem OK.

As I said in my earlier note I'm convinced this is a conflict between >2tb 
array, ext3 and multiple concurrent copy process's.

I've e-mailed Areca support and drawn their attention to this bugzilla.  It 
would be useful if they set up a machine to test whether they can recreate the 
problem.

The setup instructions on the Areca site for arrays >2tb are out of date for 
more recent Kernels as Large Block Device is enabled in the kernel now by 
default (it would be useful if Areca updated their instructions).

Comment 6 Phil Knirsch 2007-08-06 12:46:44 UTC

Reassigning to correct component (kernel).

Read ya, Phil

Comment 7 Chuck Ebbert 2007-08-06 17:40:46 UTC

EXT3-fs error (device sdb1): ext3_new_block: Allocating block in system zone -
blocks from 731578723, length 1

Could mkpartfs followed by tune2fs be creating a broken filesystem?

Why not use mkpart followed by 'mkfs -t ext3'?

Comment 8 Eric Sandeen 2007-08-06 18:21:36 UTC

So are you using the arcmsr module?

Reiserfs working might tend to point to ext3 problems, although it may be
possible that reiserfs just hasn't found the corruption yet.  I don't know how
well reiser's fsck works, but it might be worth running it as a test.

I know that xfs should be safe past 2T for sure; if you have any time you might
try making it with xfs, do the copy, and run xfs_repair to see if it finds any
problems.  I'm still tempted to suspect the IO layers, not the filesystem.

(I also would prefer to see you explicitly run mkfs.ext3 when creating the
filesystem, only because I don't know for sure what parted does in that respect,
but manually running mkfs.ext3 is obvious...)

Thanks,

-Eric

Comment 9 Eric Sandeen 2007-08-06 18:47:22 UTC

It might also be interesting to include the geometry of the resulting ext3
filesystem, so I can get an idea of where those reported block numbers are
landing.  Doing the normal parted thing, when you're done, do debugfs <device>
and type the "stats" command.

cat /proc/partitions would also be good to see exactly how large your device is.

Thanks,
-Eric

Comment 10 Paul D 2007-08-06 19:32:09 UTC

I did the following to test the system further.

As I'd moved the 2tb of data back to the primary array on a resierfs I could 
then free up the 1TB array to play with.

The 1TB array is 6 x 200GB drives in RAID5, physically in the same machine 
attached to the same controller.

I created a 1TB array exactly the same way I had above for the 3TB array and 
then started to do multiple concurrent copies into it to see what the effect/
impact was.  After copying 500GB with up to 5 concurrent copies (3 local from 
other local drives and 1 'push' and 1 'pull' from other boxes via nfs). I could 
not get the error to appear.

I'll do a reiser fsck when I can umount the drive again in a couple of hours 
(I've got a data job running right now that needs that drive online).  I'll get 
the rest of the reports at about 10:30pm London time.

Comment 11 Paul D 2007-08-06 19:38:27 UTC

Created attachment 160766 [details]
dmesg for the machine concerned

dmesg for the machine concerned.

Comment 12 Paul D 2007-08-06 20:09:15 UTC

Eric

Sorry after reading through your notes it's going to be a problem for me to 
reset the drive back to ext3.  I've copied all the data back to it and that 
took nearly 24 hours (there's about 2.1TB of live data on it that I'd have to 
migrate off it again). So I'd be looking at about 48 hours just to shunt the 
data off and back again which takes me pretty close to the drive going back 
into live live production on Weds night/Thurs morning.

Even shunting 1TB of data across the controller to the other 1tb array takes 
about 5 hours (if run as a single copy) and network moves are painfully slow!!

Areca have dropped me a note saying they're going to try to set up a similar 
configuration to test and could they be added to the cc list 
support.tw.

I noticed in the dmesg that during detection the 3TB and 1TB drives get a 
different detection flag.

-----------------------------------------------------------------------

ARECA RAID ADAPTER4: FIRMWARE VERSION V1.43 2007-4-17  
scsi4 : Areca SATA Host Adapter RAID Controller( RAID6 capable)
        Driver Version 1.20.00.13
scsi 4:0:0:0: Direct-Access     Areca    ARC-1160-VOL#00  R001 PQ: 0 ANSI: 5
scsi 4:0:0:1: Direct-Access     Areca    ARC-1160-VOL#01  R001 PQ: 0 ANSI: 5
scsi 4:0:16:0: Processor         Areca    RAID controller  R001 PQ: 0 ANSI: 0
sd 4:0:0:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
sd 4:0:0:0: [sdb] 5859373056 512-byte hardware sectors (2999999 MB)
sd 4:0:0:0: [sdb] Write Protect is off
sd 4:0:0:0: [sdb] Mode Sense: cb 00 00 08
sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
sd 4:0:0:0: [sdb] Very big device. Trying to use READ CAPACITY(16).
sd 4:0:0:0: [sdb] 5859373056 512-byte hardware sectors (2999999 MB)
sd 4:0:0:0: [sdb] Write Protect is off
sd 4:0:0:0: [sdb] Mode Sense: cb 00 00 08
sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sdb: sdb1
sd 4:0:0:0: [sdb] Attached SCSI disk
sd 4:0:0:0: Attached scsi generic sg2 type 0
sd 4:0:0:1: [sdc] 1953123840 512-byte hardware sectors (999999 MB)
sd 4:0:0:1: [sdc] Write Protect is off
sd 4:0:0:1: [sdc] Mode Sense: cb 00 00 08
sd 4:0:0:1: [sdc] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
sd 4:0:0:1: [sdc] 1953123840 512-byte hardware sectors (999999 MB)
sd 4:0:0:1: [sdc] Write Protect is off
sd 4:0:0:1: [sdc] Mode Sense: cb 00 00 08
sd 4:0:0:1: [sdc] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sdc: unknown partition table
sd 4:0:0:1: [sdc] Attached SCSI disk
sd 4:0:0:1: Attached scsi generic sg3 type 0

-------------------------------------------------------------------

The "very big device" reference seems to indicate that the underlying code 
handles the drive differently?

I take your point about using tune2fs or mkfs.ext3, I didn't think of that 
yesterday but it was an obvious test in hindsight.

I'll get back to you with the fsck as soon as I can.

Comment 13 Eric Sandeen 2007-08-06 20:14:56 UTC

2T is the likely culprit here; that's right at the 32-bits-of-512-byte-sectors
boundary.  Whether it's the fs or the controller remains to be seen, but I think
3TB filesystems on ext3 aren't too uncommon...

I understand that you can't go back to ext3 or test xfs, if your data is in
place... though without further testing we may have trouble isolating the problem.

If the reiser fsck turns up trouble, though, that'll be a big hint.  If not, we
can choose between not trusting reiserfsck vs. assuming ext3 must be broken.  :)

Has areca tested this driver on x86_64 with > 2T ?

we can only put support.tw on cc: if they have a bugzilla acct I'm afraid.

-Eric

Comment 14 Eric Sandeen 2007-08-06 20:16:46 UTC

> 2T is the likely culprit here; that's right at the 32-bits-of-512-byte-sectors
> boundary.  Whether it's the fs or the controller remains to be seen, but I think
> 3TB filesystems on ext3 aren't too uncommon...

Oh, and since the controller speaks in 512 units while the fs speaks largely in
block-sized units (4k most likely) that does tend to make me suspect the controller.

Comment 15 Paul D 2007-08-06 20:28:49 UTC

Proc/partitions

major minor  #blocks  name

   8     0   36151920 sda
   8     1     208813 sda1
   8     2    2048287 sda2
   8     3   15350107 sda3
   8     4          1 sda4
   8     5   18538978 sda5
   8    16 2929686528 sdb
   8    17 2929686494 sdb1
   8    32  976561920 sdc
   8    33  976561886 sdc1

Comment 16 Paul D 2007-08-06 23:34:54 UTC

Eric

I'm running the fsck.reiserfs now.....but looking at it it's an overnight run 
so I'll post the results tomorrow.

Comment 17 Paul D 2007-08-07 02:45:11 UTC

reiserfsck 3.6.19 (2003 www.namesys.com)

*************************************************************
** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to reiserfs-list, **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at  www.namesys.com/support.html. **
*************************************************************

Will read-only check consistency of the filesystem on /dev/sdb1
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
###########
reiserfsck --check started at Tue Aug  7 00:01:12 2007
###########
Replaying journal..
Reiserfs journal '/dev/sdb1' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..//  2 (of  23)/ / 24 (of 138)/////123 (of 170)
finished                               
Comparing bitmaps..finished
Checking Semantic tree:
finished                                                                       
No corruptions found
There are on the filesystem:
        Leaves 568681
        Internal nodes 3446
        Directories 8186
        Other files 438643
        Data block pointers 540989760 (0 of them are zero)
        Safe links 0
###########
reiserfsck finished at Tue Aug  7 01:49:32 2007
###########

This is with about 2.1TB of data on the array.

Comment 18 Eric Sandeen 2007-08-07 03:27:39 UTC

hm, interesting.  Ok, thanks for the reiserfs check info!  I have to say I
expected to find corruption. :)

Comment 19 Paul D 2007-08-07 04:47:34 UTC

No problem.

My gut has been telling me that this is something to do with the multiple 
concurrent copies and ext3.

You may well be right though that doing mkpartfs within parted is not the same 
as doing mkfs.ext3.  It's strange though that on the >2TB array I could then 
get it to fail five times out of five while on the 1TB array (which was created 
using parted's mkpartfs by the way) I couldn't get the problem to happen.  This 
was all on the same physical controller and machine.

There's certainly something obscure going on, it'll be difficult to pin it down 
though.  If I can get my hand on some more HD's I'll try to set up another 3TB 
on the array and see if I can run some more tests for you.

Comment 20 Eric Sandeen 2007-08-07 16:56:24 UTC

I'll set up 3T of ext3 here & do some testing.  If you are willing to sacrifice
your reiserfs fs at this point, you could use lmdd to write a unique pattern at
each offset to the entire 3T block device, then read it back & verify.  (this
patterning functionality is built into lmdd)  That'd be a reasonable exhaustive
test of the underlying block device.

Thanks,
-Eric

Comment 21 Paul D 2007-08-08 16:10:58 UTC

Eric

Just to let you know I set up a script to copy data onto and off the array 
(running multiple copy process's) while it's still on reiserfs and I haven't 
been able to generate a problem.  Thus far I must have copied about 4-5tb onto 
the drive (this is over and above the genuine 2.1tb on it).

The drive goes back into production tomorrow so It'll be next weekend before I 
can do any more testing on it.  I'll keep you advised of my progress.

Comment 22 Eric Sandeen 2007-08-08 16:23:05 UTC

Paul, I appreciate all the info.  I will try to test this here as well when I
have the time & hardware avaialble.  Sorry for being so anxious to push this off
onto areca.  ;-)  For your script, you simply have multiple recursive copies
running, or something like that?

Thanks,

-Eric

Comment 23 Eric Sandeen 2007-09-20 14:36:51 UTC

Paul, any further news on this problem?  I'm trying to set aside some time to
test here.  Can you let me know  exactly what script you were using for your test?

Thanks,

-eric

Comment 24 Paul D 2007-09-20 14:57:11 UTC

Hi Eric

I'm currently rationalising some of the content (extracting pertinent data from 
backups then dumping rest) and should be in a position to do some more testing 
the week after next.  The problem (as I'm sure you're aware) is that getting 
the data off a 3TB array takes nearly 24 hours in itself (as well as 24 hours 
to get it back on!!) so I have to manage the logistics of this carefully - 
while I'm doing this testing I've effectively removed my fallback box.

I'll aim to come back to you the 2nd week of October if that's OK and I should 
be able to do a whole raft of tests that week.

Regards

Comment 25 Eric Sandeen 2007-10-29 21:55:59 UTC

Paul, if you can give me an idea what tests you've run (are these scripts that
are copying files around?) I will try to do some testing here.  Since you
indicate that parallel copies seem to cause the problem, I would like to try to
replicate what your are doing as closely as possible, but I need more info to do
that.
 
Thanks,
-Eric

Comment 26 Eric Sandeen 2007-10-29 22:28:04 UTC

FWIW, others w/ problems on areca, although of course that's what you get with
the right google search... but just for posterity:

http://www.nabble.com/Problem-with-ext3-filesystem-t2887878.html
http://oss.sgi.com/archives/xfs/2007-10/msg00144.html
http://oss.sgi.com/archives/xfs/2006-06/msg00098.html

Comment 27 Christopher Brown 2008-01-10 19:19:37 UTC

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

Paul - were you able to complete the tests indicated in comment #24?

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Comment 28 Christopher Brown 2008-02-16 02:33:06 UTC

Closing as per previous comment. Please re-open if this is still an issue for you.

Note You need to log in before you can comment on or make changes to this bug.