Bug 174182

Summary: journal commit starvation
Product: Red Hat Enterprise Linux 3 Reporter: Bastien Nocera <bnocera>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: petrides, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-12-05 22:51:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 170417    
Attachments:
Description Flags
test.pl
none
messages file none

Description Bastien Nocera 2005-11-25 14:38:56 UTC
Using the attached testcase, applications using the disk will "freeze" for large
periods of time (tens of seconds).

The same problem doesn't occur when ext2 is the filesystem used.

kernel 2.4.21-37.ELsmp

Full Alt+SysRq+T attached below. Selected samples (the dd is replaced in this
case by the attached test program):

Nov 23 11:55:05 maroon kernel: dd            R current   3776  1275   1274     
               (NOTLB)
Nov 23 11:55:05 maroon kernel: Call Trace:   [<c016787a>] create_buffers
[kernel] 0x6a (0xf6917e24)
Nov 23 11:55:05 maroon kernel: [<f8886532>] ext3_get_block [ext3] 0x52 (0xf6917e38)
Nov 23 11:55:05 maroon kernel: [<c016814b>] __block_prepare_write [kernel] 0x1ab
(0xf6917e5c)
Nov 23 11:55:05 maroon kernel: [<c0168b09>] block_prepare_write [kernel] 0x39
(0xf6917ea0)
Nov 23 11:55:05 maroon kernel: [<f88864e0>] ext3_get_block [ext3] 0x0 (0xf6917eb4)
Nov 23 11:55:05 maroon kernel: [<f8886bb9>] ext3_prepare_write [ext3] 0xc9
(0xf6917ec0)
Nov 23 11:55:05 maroon kernel: [<f88864e0>] ext3_get_block [ext3] 0x0 (0xf6917ed0)
Nov 23 11:55:05 maroon kernel: [<c014c053>] do_generic_file_write [kernel] 0x1e3
(0xf6917ef4)
Nov 23 11:55:05 maroon kernel: [<c014c5bf>] generic_file_write [kernel] 0x13f
(0xf6917f48)
Nov 23 11:55:05 maroon kernel: [<f8883e99>] ext3_file_write [ext3] 0x39 (0xf6917f74)
Nov 23 11:55:05 maroon kernel: [<c0164b27>] sys_write [kernel] 0x97 (0xf6917f94)

Nov 23 11:55:05 maroon kernel: vi            D 00000001  3764  1326   1276     
               (NOTLB)
Nov 23 11:55:05 maroon kernel: Call Trace:   [<c0124a52>] sleep_on [kernel] 0x52
(0xf2679ee8)
Nov 23 11:55:05 maroon kernel: [<f8879be8>] log_wait_commit_Rsmp_4dfe4007 [jbd]
0x68 (0xf2679f18)
Nov 23 11:55:05 maroon kernel: [<f8874bd3>] journal_stop_Rsmp_74af6844 [jbd]
0x193 (0xf2679f34)
Nov 23 11:55:05 maroon kernel: [<f8873445>] journal_start_Rsmp_25661df5 [jbd]
0xa5 (0xf2679f40)
Nov 23 11:55:05 maroon kernel: [<f8874ccc>] journal_force_commit_Rsmp_2a9443c3
[jbd] 0x7c (0xf2679f64)
Nov 23 11:55:05 maroon kernel: [<f888f091>] ext3_force_commit [ext3] 0x51
(0xf2679f70)
Nov 23 11:55:05 maroon kernel: [<f8883fb4>] ext3_sync_file [ext3] 0x84 (0xf2679f7c)
Nov 23 11:55:05 maroon kernel: [<f8887270>] ext3_writepage [ext3] 0x0 (0xf2679f84)
Nov 23 11:55:05 maroon kernel: [<c01667f8>] sys_fsync [kernel] 0x98 (0xf2679f9c)

Comment 1 Bastien Nocera 2005-11-25 14:38:56 UTC
Created attachment 121488 [details]
test.pl

Comment 2 Bastien Nocera 2005-11-25 14:40:34 UTC
Created attachment 121489 [details]
messages file

Comment 3 Stephen Tweedie 2005-11-29 20:28:23 UTC
ext3 is a journaled filesystem.  The journal is a limited resource.  If you
completely fill the journal, then no more writes can be scheduled, at all; ext2
has no such bottleneck simply because it has no journal.  

And your test case is the worst-case scenario because you're forcing ext3 to
flush out large amounts of data for each transaction, bottlenecking the
transaction itself on the data queue.  This may not be particularly pleasant but
it's largely as expected in this case.

It is unlikely we're going to do much work to significantly rebalance the
interactions between ext3 and the VM for RHEL-3 at this stage.  Is there a major
problem being caused here?


Comment 4 Bastien Nocera 2005-11-30 09:14:07 UTC
About 200 users are logged in via ssh to this machine, running text editors, and
it would hang for between 10 to 30 seconds when snapshots of the database are taken.

Comment 5 Bastien Nocera 2005-11-30 09:15:08 UTC
Would increasing the size of the journal help?

Comment 6 Thomas Uebermeier 2005-11-30 09:23:44 UTC
the original information, that ext2 is solving the problem is probably wrong. 
Although most tests (including this one) were done on RHEL4, the customer did 
also see a freeze on a ext2 partition - it just came in later. 

Comment 10 Ernie Petrides 2006-12-05 22:51:58 UTC
Closing based on last comment.