Bug 76584 - EXT3 journal sarved
EXT3 journal sarved
Status: CLOSED CURRENTRELEASE
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-10-23 14:19 EDT by Ben Woodard
Modified: 2015-01-04 17:02 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-09-30 11:40:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch which evidently fixes the problem (1.01 KB, patch)
2002-10-23 14:20 EDT, Ben Woodard
no flags Details | Diff

  None (edit)
Description Ben Woodard 2002-10-23 14:19:19 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20020918

Description of problem:
From: 	Jim Garlick <garlick@llnl.gov>
To: 	bwoodard@llnl.gov
Subject: 	Analysis of problems with the RMS database on MCRI (fwd)
Date: 	17 Oct 2002 13:54:39 -0700	
Hi Ben -

Could you have a look at this?  The Quadrics guys have reported that crashing
the machine running the RMS database occasionally caused the database to
revert to a state over an hour old.  They have done a nice job of tracking
down the problem to ext3 and provided a kernel patch of what seems to be
reasonable pedegree.  The pertinent summary information follows (included in
context in attached analysis):
--------------------
 - the cause is believed to be that the synchronous writeout flow of data is
starving the writeout of filesystem meta-data, due to a bug in the ext3
filesystem [1].
  -> apparently due to incorrect setting of the buffer head flushtime field
in dirty buffers.

 - Neil Brown found a similar issue and raised it on the ext3-users mailing
list [2].

 - Andrew Morton, core kernel developer, developed a patch to cleanly fix
this issue [1] (attached).

 - attached results of reproduced scenario show complete loss of data
written into database across a period of 1h 40m.
--------------------
Could you open this as a redhat bugzilla so this patch gets thrown into
the redhat patch mill for a future kernel release?  I'll plan to roll this
into the CHAOS kernel (unless you know of a good reason not to).

Thanks,

Jim
---------- Forwarded message ----------
Date: Thu, 17 Oct 2002 18:15:58 +0100
From: Duncan Roweth <duncan@quadrics.com>
To: "Robin Goldstone (E-mail)" <robing@llnl.gov>,
     "Jim Garlick (E-mail)" <garlick@llnl.gov>
Subject: Analysis of problems with the RMS database on MCRI

Robin, Jim

I have attached a copy of our analysis of the problems
seen with the RMS database "going backwards" when mcri
was rebooted.

Have a read and then get back to me.

Best Wishes
Duncan Roweth

Quadrics Limited,               Tel:   +44 (0)117 9075384
One Bridewell Street,           Fax:   +44 (0)117 9075395
Bristol, BS1 2AA,               email: duncan@quadrics.com
                                http://www.quadrics.com/

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
 - many attempts were tried to reproduce this set of problems, with varying
environmental factors.

 - unless all the given conditions were met, very little (usually 5-10s, up
to 30s) of data loss occurred.

 - a set of scripts was developed to help create the right conditions to
reproduce this problem reliably (attached).

 - conditions required to reproduce problem:
  -> rewriting blocks inside the database file, ie not extending the file
with new blocks.
   - achieved by creating a 'tester' table, flushing many entries to it to
set the maximum extent, deleting all the entries,restarting mSQLd then
slowly adding entries every 1s. See 'db-create', 'db-populate' and 'db-add'
scripts.

  -> having a process perform continuous synchronous writeout (eg by lots of
synchronous logging) against a file sharing the same journal as the
database, ie on the same filesystem.
   - achieved by aggressively logging to the syslog with logger(1), and
observed with vmstat(8). See 'log-load' script.

  -> uncleanly reboot the system, ie no unmounting, syncing or process
killing. This can be achieved with the 'boot' option with the sysrequest
kernel feature, over serial console or terminal console.

 - able to then harvest information from the database, quantifying data
loss.
  -> achieved by looking at the last readable entries in 'tester' table. See
'db-get' script.

Actual Results:  ext3 journal never gets updated

Expected Results:  ext3 journal gets updated in a timely manner

Additional info:

	From: 	Daniel Blueman <daniel.blueman@quadrics.com>
To: 	Duncan Roweth <duncan@quadrics.com>
Cc: 	Robin Crook <robin@quadrics.com>
Subject: 	MCR problem report...
Date: 	17 Oct 2002 16:59:08 +0100	
Problem Analysis on MCR rmshost Dataloss
----------------------------------------
Daniel J Blueman, Quadrics


1. Background

 - rmshost in the MCR cluster, mcri, is known to oops/panic causing severe
dataloss in the RMS database.


2. Observations

 - stracing mSQLd showed that it was correctly calling msync(MS_ASYNC) every
120s periodically, so is performing correctly.

 - the 'vmstat' command showed that there was a continuous stream of
synchronous data being written to the disk, specifically the syslog files in
/var/log. This provided an initial clue about what was causing this problem
to occur. Asynchronous traffic manifests itself by regular bursts of
writeout at 5s intervals, whereby synchronous writeout hits straight away.

4. Analysis

 - the cause is believed to be that the synchronous writeout flow of data is
starving the writeout of filesystem meta-data, due to a bug in the ext3
filesystem [1].
  -> apparently due to incorrect setting of the buffer head flushtime field
in dirty buffers.

 - Neil Brown found a similar issue and raised it on the ext3-users mailing
list [2].

 - Andrew Morton, core kernel developer, developed a patch to cleanly fix
this issue [1] (attached).

 - attached results of reproduced scenario show complete loss of data
written into database across a period of 1h 40m.


5. Solutions

5.1 Workarounds

5.1.1 Moving RMS database to different filesystem

 - move /var/rms/ or /var/rms/msqldb to a different filesystem.
  -> disadvantage: doesn't guarantee that the database will have a journal
to itself.

 - better still that a journaling filesystem is created specifically for the
database and/or RMS logs (mke2fs -j /dev/hdaX).

5.1.2 Mounting filesystem with different journaling mode

 - ext3 journaling behaviour may be different with different journal modes.

 - edit the fstab to add the filesystem mount option
'data={writeback|ordered|journal}'.
  -> disadvantage: mounting /var with data=journal will hurt write
performance significantly, which will degrade if there is lots of
synchronous writeout.

5.1.3 Enable asynchronous logging with syslog
 - add '-' in front of the log file specifiers in /etc/syslog.conf.
  -> disadvantage: log updates won't be synchronous, therefore 5-10s of data
may be lost in rmshost oops.
  -> advantage: logging will consume much less I/O bandwidth.

5.1.4 Reduce logging from other hosts
 - tweak /etc/syslog.conf file on compute nodes to raise the message
priority filter.

5.1.5 Move logging to another host
 - use a separate 'loghost' system, to collect all logging from compute
nodes.
  -> advantage: reduce load and dependency on rmshost.
  -> advantage: with rmshost failure, logs are still available.

5.1.6 Regular calls to sync(1) or sync(2)
  -> disadvantage: very expensive - will penalise writing to all disks
severely and starve reads; effectively a read-write barrier.


5.2 Fixes

5.2.1 Patch ext3 tree in kernel to fix bug
 - patch the kernel to fix the bug in ext3 [1].
 - raise this issue with RedHat in a support contract or otherwise to get
fixed in revised kernel release.


6 Additional recommendations

6.1 Tune I/O subsystem
 - 'hdparm -c1 -u1 -m16 /dev/hda' to enable multiple blocks/interrupt,
32-bit transfers and (importantly) IRQ unmasking.
 - 'elvtune -r2048 -w32768 /dev/hda' to optimise I/O elevator better for
write bandwidth (from better write request merging).

6.2 Migrate frequently used filesystems to different disks
  -> advantage: concurrent disk access greatly reduces journal overhead and
seeking.

6.3 Avoid panic-on-oopsing
 - allows the user to use the sysrequest sequence to sync and unmount disks.
  -> requires: /etc/sysctl.conf must be edited to not disable the sysrequest
feature.
  -> advantage: allows additional safety upon fatal error.

6.4 Apply latest ext3 bugfix patch [3]
 - fixes all known outstanding bugs.


7 Notes

The design of the ext3 journaling method by Dr Stephen C Tweedie can be
found at [4]


8 References

 [1] https://listman.redhat.com/pipermail/ext3-users/2002-June/003640.html
 [2] https://listman.redhat.com/pipermail/ext3-users/2002-June/003634.html
 [3]
ftp://ftp.kernel.org/pub/linux/kernel/people/sct/ext3/v2.4/ext3-0.9.18-2.4.1
9pre8.patch
 [4]
ftp://ftp.kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz

___________________________
Daniel J Blueman
Software Engineer, Quadrics
Comment 1 Ben Woodard 2002-10-23 14:20:34 EDT
Created attachment 81724 [details]
Patch which  evidently fixes the problem
Comment 2 Stephen Tweedie 2002-11-07 13:25:37 EST
The patch looks good, I'll definitely queue it for our own kernels.

Do users want a test build with this fix?
Comment 3 Ben Woodard 2002-11-07 14:30:57 EST
No that will not be necessary. We are already running this patch in house.

-ben
Comment 4 Bugzilla owner 2004-09-30 11:40:06 EDT
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.