Bug 860483 - MAYBE Fedora 17 has a disk I/O problem. MAYBE
MAYBE Fedora 17 has a disk I/O problem. MAYBE
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
19
i686 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-25 19:38 EDT by Jim Brannan
Modified: 2013-10-08 13:40 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-10-08 13:40:22 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jim Brannan 2012-09-25 19:38:42 EDT
Description of problem:

I've yum updated to keep to the current Fedora 17, x86_65 release as of 25Sep2012.. The following test have occurred over many weeks. I can repeat some if you think things have changed.

It all started when I had a series of probably duplicate files, but I wanted to make sure so I used cmp to see and kept getting different answers; Sometimes equal sometimes different and at different locations. So with too many new things in play what could be wrong?

A) cmp is broken or can't handle large binary files [400MB to 3.1GB dtv recordings]

1) I wrote a java program to compare two files. It kept giving different answers too. So it doesn't seem to be cmp.

B) Could it be my brand new USB->SATA drive?

2) I copied one of the suspicions files to another drive and then duplicated it.  When comparing them over and over still got different answers.  It doesn't look like the drive.

C) Could it be ext4. I have never used ext4 except last month when installing Fedora17 for the first time and had a drive failure loosing almost everything. [Or was it a drive failure?]

3) Copied the files from #2 to an ext3 partition on drive from #2 and to a fat32 partition on the USB drive from #1.  Both compared differently on different attempts.  So it doesn't seem like it relates to the file system.

D) Could it be Fedora17 x86_64.  The first time I've used the 64 bit release.

4) I downloaded and created a live CD for Fedora17 i686.  When running that I got 3 'wrong' compares in 45 tests.  When running a x86_64 booted from a USB stick I got 15 wrong out of 45.  I got 53 failures out of 63 runs on the 'real' Fedora17 x86_64. All with the same files on the same USB->SATA drive w/ ext4. So that was somewhat inconclusive. 

E) Could it be Fedora17?

5) Though the home partition of my old Fedora7 installation was destroyed the boot and root partitions seemed intact so I made a new TempHome7 partition (ext3) and added three pairs of files and a test script.  Something went wrong during boot and the X server or video driver didn't come up but I logged on as root and ran the test getting only one failure out of 67 runs.  That one failure really bothers me.

F) It is interesting that I used cp on Fedora17 to make the test files and expected there to be some errors in the copy but cmp on Fedora7 said (mostly) they were equal.  Is there any difference between the way cp makes a copy and cmp tests for equal? like mapping a file into memory vs reading one block at a time? 

6) I may never know, but I did modify the java program to try both regular I/O and memory mapped files and ran numerous tests on those same 3 pairs of files on that same ext3 partition.

	Fedora17 Equal/Runs
	cmp:	10/135
	Mapped:	54/135
	I/O:	135/135

So far the java program's regular I/O has always said each pair [on that drive w/ ext3] are equal and cmp almost never does with the memory mapped java somewhere in the middle.  When using the original set of 13 files with both cmp and the new expanded java program, I get:

	cmp:	12/86
	Mapped:	24/86
	I/O:	54/86
	not eq:	5+5+5
	no file:10

Now it's harder to analyze the results as I now believe that at least two of those files are genuinely not equal and one is a different size and the pair of another has been deleted.  But there seems to still be a difference between the methods.

7) Downloaded a Fedora16 i686 live CD and after a lot of frustration built another bootable USB stick. No java on the stick.

		TempHome7		Original 13
	cmp:	93/95		cmp:	20/47
				no file:7

Not completely conclusive.  It sure started out good, but I kept running the tests and eventually it started to fail.  Eventually I had a system halt... it did say panic in the tiny tiny text but I don't really know what happened.

G) Could it be my 5 year old motherboard/CPU?

8) I ran one pass of the memtest from the Fedora17 x86_64 CD [Memtest86 v4.20] and it passed. But that's just mostly memory, right?

----

Ok, I know, too much data.  All I really know at the moment is that on Fedora17 x86_64 files that are probably equal almost never match w/ cmp and most of the time match with java I/O and about half of the time match with java memory mapped files.  It appears that cp can successfully copy a file that cmp won't match afterward.  I've had two partitions wiped out and several weird freezes and program failures since starting to use Fedora17.  It doesn't make sense to report any of these until I have confidence in my hardware and Fedora17's disk I/O.

Q1) So what do I try next? How can I help figure out what (if anything) is wrong?

Q2) Are there any CPU diagnostic test programs out there? disk I/O too?

Q3) I am using dialup so it takes about 2 1/2 days to download a CD-sized iso.  But if there's a Fedora version that I should try let me know.  

Q4) Any tools out there to attempt to resurrect my lost data [ext3 partition]? The live CD/USB systems just hang when I try and mount or dd copy or fsck the partition. 

Q5) I'm thinking of extracting the drive into a USB external case so perhaps the system won't hang when it gets confused.  Would that make a difference?

Version-Release number of selected component (if applicable):


How reproducible:
Very, given enough time.

Steps to Reproduce:
1. make a large file, perhaps dd from random around 1GB
2. cp it to another location
3. cmp the two, for more details see above
  
Actual results:
Sometimes equal, sometimes not equal, and at different locations

Expected results:
always equal

Additional info:
Pentium D x 2 3.4GHz extra L2 cache; 2GB DDR2 interleaved ram
Comment 1 Jim Brannan 2012-10-16 05:05:14 EDT
----------------------------------------------------------------------

Oops, Oops, Oops.  A small bug in my Java program lead to a great amount of confusion.  I had the labels for "Memory Mapped I/O" and "Block I/O" swapped.  So in the above discussion it was the "Memory Mapped I/O" that was succeeding most frequently.

My latest test was to install i686 into a partition of its own and yum update to the most current version (3.6.1-1.fc17.i686), clear out my test logs and rerun the three-equal-file tests {TempHome7 above).  Apparently i686 has a limit on how big a file that it will map into memory and the 1.2GB and 1.6GB files exceeded that limit and that's how I noticed the labeling error.  I've updated the java program to map in chunks if it needs to and here's the results of the tests:

	*fc17.i686.log				*fc17.x86_64.log
	Pass/Total	%Pass	Errors		Pass/Total	%Pass	Errors
cmp:	540/629		85%	0	cmp:	57/630		9%	0
I/O:	366/629		58%	0	I/O:	300/630		47%	0
Mapped:	592/629		94%	1	Mapped:	597/630		94%	0

Since I don't have any idea about the source of the error I can't really say if the differences are important, only that it happens with the current i686 and practically speaking I seem to stand a lesser chance of mystery crashes with it.

Let me know if I can be of help and sorry again for the confusion.

jsb.
Comment 2 Jim Brannan 2013-02-20 20:43:04 EST
A little more data:
  I cp a large file and sometimes I get a bad copy, the larger the file the more likely.  I presume that this is just this problem showing up in the reads that cp does and then it just writes the bad data out.
  dd if= of= however rarely (if ever) fails to copy correctly.  Since the default for dd is 512 bytes I added another compare form to my java program that uses 512 byte reads rather than 4K.  It also rarely fails.
  So now we have four programs (java,cmp,cp,dd) that do I/O in different ways and fail with different frequencies.

  A new experiment occurred to me; So I created two 8G partitions on the USB drive.  dd copied a big file to the first [I discovered I misunderstood /dev/random and you cannot dd copy a big file from it, oh well] and then dd copied the first partition to the second and ran my java compare program as well as cmp.

  fdisk /dev/sdd
    n ... +8G
  dd 'if=/Drive/HomeC/Backup/nightly-2013-06.zip' 'of=/dev/sdd10' bs=4K
  dd 'if=/dev/sdd10' 'of=/dev/sdd11' bs=4K

  I haven't managed to get (java based) memory mapped I/O to work on a partition, but the 512 and 4K I/O behave as above, the 512 saying equal, the 4K I/O and cmp picking apparently random spots to be unequal.  This at least bypasses all file systems and points a finger downstream.

  Finally got Intel to tell me where the Linux based CPU diagnostics were and so I've run and passed the 4 minute 32-bit test.  Any other diagnostics out there for confirming/denying potential hardware explanations?

  As of Fri, Feb 15, 2013 and 3.7.3-101.fc17.i686

jsb.
Comment 3 Fedora End Of Life 2013-07-03 19:31:49 EDT
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.
Comment 4 Jim Brannan 2013-07-27 21:52:12 EDT
as of 3.9.5-301.fc19.i686
Comment 5 Josh Boyer 2013-09-18 16:46:48 EDT
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.11.1-200.fc19.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.
Comment 6 Josh Boyer 2013-10-08 13:40:22 EDT
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.