Bug 174182

Summary:

journal commit starvation

Product:

Red Hat Enterprise Linux 3

Reporter:

Bastien Nocera <bnocera>

Component:

kernel

Assignee:

Stephen Tweedie <sct>

Status:

CLOSED NOTABUG

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.0

CC:

petrides, tao

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-12-05 22:51:58 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

170417

Attachments:

Description	Flags
test.pl	none
messages file	none

Description Bastien Nocera 2005-11-25 14:38:56 UTC

Using the attached testcase, applications using the disk will "freeze" for large
periods of time (tens of seconds).

The same problem doesn't occur when ext2 is the filesystem used.

kernel 2.4.21-37.ELsmp

Full Alt+SysRq+T attached below. Selected samples (the dd is replaced in this
case by the attached test program):

Nov 23 11:55:05 maroon kernel: dd            R current   3776  1275   1274     
               (NOTLB)
Nov 23 11:55:05 maroon kernel: Call Trace:   [<c016787a>] create_buffers
[kernel] 0x6a (0xf6917e24)
Nov 23 11:55:05 maroon kernel: [<f8886532>] ext3_get_block [ext3] 0x52 (0xf6917e38)
Nov 23 11:55:05 maroon kernel: [<c016814b>] __block_prepare_write [kernel] 0x1ab
(0xf6917e5c)
Nov 23 11:55:05 maroon kernel: [<c0168b09>] block_prepare_write [kernel] 0x39
(0xf6917ea0)
Nov 23 11:55:05 maroon kernel: [<f88864e0>] ext3_get_block [ext3] 0x0 (0xf6917eb4)
Nov 23 11:55:05 maroon kernel: [<f8886bb9>] ext3_prepare_write [ext3] 0xc9
(0xf6917ec0)
Nov 23 11:55:05 maroon kernel: [<f88864e0>] ext3_get_block [ext3] 0x0 (0xf6917ed0)
Nov 23 11:55:05 maroon kernel: [<c014c053>] do_generic_file_write [kernel] 0x1e3
(0xf6917ef4)
Nov 23 11:55:05 maroon kernel: [<c014c5bf>] generic_file_write [kernel] 0x13f
(0xf6917f48)
Nov 23 11:55:05 maroon kernel: [<f8883e99>] ext3_file_write [ext3] 0x39 (0xf6917f74)
Nov 23 11:55:05 maroon kernel: [<c0164b27>] sys_write [kernel] 0x97 (0xf6917f94)

Nov 23 11:55:05 maroon kernel: vi            D 00000001  3764  1326   1276     
               (NOTLB)
Nov 23 11:55:05 maroon kernel: Call Trace:   [<c0124a52>] sleep_on [kernel] 0x52
(0xf2679ee8)
Nov 23 11:55:05 maroon kernel: [<f8879be8>] log_wait_commit_Rsmp_4dfe4007 [jbd]
0x68 (0xf2679f18)
Nov 23 11:55:05 maroon kernel: [<f8874bd3>] journal_stop_Rsmp_74af6844 [jbd]
0x193 (0xf2679f34)
Nov 23 11:55:05 maroon kernel: [<f8873445>] journal_start_Rsmp_25661df5 [jbd]
0xa5 (0xf2679f40)
Nov 23 11:55:05 maroon kernel: [<f8874ccc>] journal_force_commit_Rsmp_2a9443c3
[jbd] 0x7c (0xf2679f64)
Nov 23 11:55:05 maroon kernel: [<f888f091>] ext3_force_commit [ext3] 0x51
(0xf2679f70)
Nov 23 11:55:05 maroon kernel: [<f8883fb4>] ext3_sync_file [ext3] 0x84 (0xf2679f7c)
Nov 23 11:55:05 maroon kernel: [<f8887270>] ext3_writepage [ext3] 0x0 (0xf2679f84)
Nov 23 11:55:05 maroon kernel: [<c01667f8>] sys_fsync [kernel] 0x98 (0xf2679f9c)

Comment 1 Bastien Nocera 2005-11-25 14:38:56 UTC

Created attachment 121488 [details]
test.pl

Comment 2 Bastien Nocera 2005-11-25 14:40:34 UTC

Created attachment 121489 [details]
messages file

Comment 3 Stephen Tweedie 2005-11-29 20:28:23 UTC

ext3 is a journaled filesystem.  The journal is a limited resource.  If you
completely fill the journal, then no more writes can be scheduled, at all; ext2
has no such bottleneck simply because it has no journal.  

And your test case is the worst-case scenario because you're forcing ext3 to
flush out large amounts of data for each transaction, bottlenecking the
transaction itself on the data queue.  This may not be particularly pleasant but
it's largely as expected in this case.

It is unlikely we're going to do much work to significantly rebalance the
interactions between ext3 and the VM for RHEL-3 at this stage.  Is there a major
problem being caused here?

Comment 4 Bastien Nocera 2005-11-30 09:14:07 UTC

About 200 users are logged in via ssh to this machine, running text editors, and
it would hang for between 10 to 30 seconds when snapshots of the database are taken.

Comment 5 Bastien Nocera 2005-11-30 09:15:08 UTC

Would increasing the size of the journal help?

Comment 6 Thomas Uebermeier 2005-11-30 09:23:44 UTC

the original information, that ext2 is solving the problem is probably wrong. 
Although most tests (including this one) were done on RHEL4, the customer did 
also see a freeze on a ext2 partition - it just came in later.

Comment 10 Ernie Petrides 2006-12-05 22:51:58 UTC

Closing based on last comment.