This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 454848 - unstable file system after kernel update
unstable file system after kernel update
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
8
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
https://www.redhat.com/archives/fedor...
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-07-10 02:49 EDT by Ralf W. Grosse-Kunstleve
Modified: 2008-07-14 20:03 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-14 20:03:30 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
example output of 99 bin/libtbx.scons runs (43.10 KB, text/plain)
2008-07-10 02:49 EDT, Ralf W. Grosse-Kunstleve
no flags Details

  None (edit)
Description Ralf W. Grosse-Kunstleve 2008-07-10 02:49:27 EDT
Description of problem:

  Randomly, files appear to be missing, leading to application failures.
  When immediately trying again the exact same command it (usually) works.

Version-Release number of selected component (if applicable):

  kernel 2.6.23.15-137.fc8

How reproducible:

  I can reproduce this problem by repeatedly running SCons (make replacement)
  on a large source tree. About 10% of the SCons runs end with some kind
  of problem.
  I have direct evidence that the problem is connected to the kernel version:
  After rebooting with kernel 2.6.25.9-40.fc8 I'm seeing the SCons
  failures.
  After rebooting with the older kernel 2.6.23.15-137.fc8 the machine is
  completely stable again.
  Other machines with a newer kernel are unstable, too.
  Other machines with the older kernel are stable, too.

Steps to Reproduce:
  mkdir /var/tmp/junk
  cd /var/tmp/junk
  wget http://cci.lbl.gov/cctbx_build/results/2008_04_25_1421/cctbx_bundle.selfx
  perl cctbx_bundle.selfx 4 # this will take ~10 minutes
  cd cctbx_build
  bin/libtbx.scons
  bin/libtbx.scons
  bin/libtbx.scons
  ...
  
Actual results:

  About 10% of the time there are SCons warnings or spurious recompilations.

Expected results:

  I'll attach an example output with 99 bin/libtbx.scons results, showing
  8 "missing SConscript" warnings and one spurious set of recompilations.
  The other 90 calls are OK. 

Additional info:

  The same problem exists under Fedora 9, all versions from the day it
  was released (same day I installed it) to at least this kernel version:
  2.6.25.9-76.fc9.x86_64 #1 SMP Fri Jun 27

  The first kernel version I know is broken is 2.6.25.4-10.fc8.

  I've also seen at least one "make" failure, while building
  Python 2.5.2 from sources. So it is not just SCons. SCons
  just happens to trigger the kernel problem exceptionally
  often.
Comment 1 Ralf W. Grosse-Kunstleve 2008-07-10 02:49:27 EDT
Created attachment 311450 [details]
example output of 99 bin/libtbx.scons runs
Comment 2 Eric Sandeen 2008-07-10 12:31:54 EDT
i'll take a quick look over the testcase, thanks.
Comment 3 Eric Sandeen 2008-07-10 12:39:05 EDT
For those playing along at home, you need tcsh to run the testcase.
Comment 4 Eric Sandeen 2008-07-10 15:18:43 EDT
.... and python-devel.  :)

Anyway, I ran 100 builds on 2.6.25.9-40.fc8 and saw no errors.

On a whim, can you add default_relatime=0 to your boot commandline and see if it
makes any difference?

post-boot grep "default relative atime updates" in dmesg to be sure...

Thanks,
-Eric
Comment 5 Ralf W. Grosse-Kunstleve 2008-07-10 22:02:46 EDT
> On a whim, can you add default_relatime=0 to your boot commandline and see if it
> makes any difference?

Will do that over the weekend; the machines are too busy right now.

In the meantime, I did this:

1. install fedora 8 from dvd, starting from scratch (i.e. disk repartitioned)
2. yum install tcsh; adduser ralf, no other customizations
3. run 100*bin/libtbx.scons
4. yum update + reboot
5. run 100*bin/libtbx.scons

Step 3. shows no errors.
(I got stuck at step 4 with this: ERROR with rpm_check_debug
but got around this via: yum remove NetworkManager; yum update)
To my surprise, step 5. also shows no errors!
The kernel version after yum update is still as reported yesterday:
2.6.25.9-40.fc8

The only difference to what I did before are the missing customizations
of the system that I usually do: activate NFS (server & client),
NIS, automount, some other misc. things. I'll do that asap and will
report what happens.
Comment 6 Ralf W. Grosse-Kunstleve 2008-07-14 19:44:25 EDT
Followup to comment #5: I applied all customizations step by step, rebooting
and testing several times. The machine remained stable all the way to the end.

Saturday I ran "yum update" on the Fedora 9 machine, and is is stable now,
too, for the first time.

Today I rebuilt another unstable Fedora 8 system from scratch, exactly the
same way as the one in comment #5, only that I did everything in one go
and didn't reboot+test after each step. At the end is was NOT stable.
I have absolutely no clue how this can be, except that "yum update" may
have given me different updates on Friday (first machine) and today
(second machine). But then again, why should it be "NOT OK", "OK", "NOT OK"?
Nothing here really makes sense.

Rebooting the second FC8 machine with the original kernel (2.6.23.1-42.fc8)
made it stable again. This is similar to what I observed on another machine
mentioned in my original posting.

For completeness, eventually I ran rpm --install --force
kernel-2.6.23.15-137.fc8.x86_64.rpm
kernel-devel-2.6.23.15-137.fc8.x86_64.rpm
kernel-headers-2.6.23.15-137.fc8.x86_64.rpm
on the machine I did today (just because we've been using that kernel
for several months without problems) and it is still stable.

Summary: a few days ago I had three broken machines, today I have them
all fixed somehow in three different ways, and I have zero explanations.

I'll keep my hand off the systems now, quietly hoping that somebody
will somehow find and fix the root cause of the problem, without ever
knowing how much trouble it has caused me.
Comment 7 Eric Sandeen 2008-07-14 19:52:03 EDT
If you wind up with a problematic system again, please do try turning off the
default relatime and see how that goes.

Thanks,
-Eric
Comment 8 Eric Sandeen 2008-07-14 20:03:30 EDT
I'm not sure there's a lot we can  do w/o a reproducer but please do keep us up
to date, and re-open, if you get more info.

Thanks,
-Eric

Note You need to log in before you can comment on or make changes to this bug.