Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 4 product line. The current stable release is 4.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 746812

Summary:

[RHEL4][PATCH] Assertion failure in journal_start() at fs/jbd/transaction.c:274: "handle->h_transaction->t_journal == journal"

Product:

Red Hat Enterprise Linux 4

Reporter:

Frank Hirtz <fhirtz>

Component:

kernel

Assignee:

Red Hat Kernel Manager <kernel-mgr>

Status:

CLOSED DUPLICATE

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.9

CC:

dhowells, rwheeler

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-10-17 20:28:14 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Patch to correct the issue	none

Description Frank Hirtz 2011-10-17 20:25:16 UTC

Created attachment 528650 [details]
Patch to correct the issue

Description of problem:

Here's the panic on the console:

Assertion failure in journal_start() at fs/jbd/transaction.c:274: "handle->h_transaction->t_journal == journal"
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at transaction:274
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: nfs lockd nfs_acl autofs4 md5 ipv6 i2c_dev i2c_core sunrpc ipmi_devintf ipmi_si ipmi_msghandler openafs(U) arpt_mangle arptable_filter arp_tables ds yenta_socket pcmcia_core scsi_dump diskdump zlib_deflate vxspec(U) vxio(U) vxdmp(U) joydev dm_mod button battery ac ohci_hcd hw_random shpchp tg3 mptscsih mptfc mptsas mptspi mptscsi mptbase ext3 jbd sd_mod scsi_mod
Pid: 5006, comm: zapplet Tainted: PF     2.6.9-34msdw1.ELsmp
RIP: 0010:[<ffffffffa002c446>] <ffffffffa002c446>{:jbd:journal_start+95}
RSP: 0018:00000103c824f218  EFLAGS: 00010212
RAX: 0000000000000073 RBX: 000001036eae4d30 RCX: ffffffff803d9f48
RDX: ffffffff803d9f48 RSI: 0000000000000246 RDI: ffffffff803d9f40
RBP: 00000100edf3e200 R08: ffffffff803d9f48 R09: 000001036eae4d30
R10: 0000000100000000 R11: ffffffff8011e060 R12: 0000000000000002
R13: 00000100edf28800 R14: 00000103c824f3d8 R15: 0000000000001000
FS:  0000002a958a0b00(0000) GS:ffffffff804d7d00(005b) knlGS:000000005d718bb0
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000055908d41 CR3: 00000000edf98000 CR4: 00000000000006e0
Process zapplet (pid: 5006, threadinfo 00000103c824e000, task 00000103d0cb07f0)
Stack: 00000103d79e7c18 000001036eae4d30 00000103d79e7c18 ffffffffa004476b
       00000103d79e7c18 0000000000000001 0000000000003000 ffffffff80196054
       0000000045bf5493 00000103d79e7c18
Call Trace:<ffffffffa004476b>{:ext3:ext3_dirty_inode+40} <ffffffff80196054>{__mark_inode_dirty+40}
       <ffffffff8018f6ff>{update_atime+147} <ffffffff801599e2>{do_generic_mapping_read+997}
       <ffffffff801599f4>{file_read_actor+0} <ffffffff8015b5bc>{__generic_file_aio_read+385}
       <ffffffff8015b7a0>{generic_file_aio_read+48} <ffffffff80177a08>{do_sync_read+173}
       <ffffffff80134e72>{autoremove_wake_function+0} <ffffffff80177606>{generic_file_llseek+46}
       <ffffffffa0362b7f>{:openafs:osi_rdwr+255} <ffffffffa033b87f>{:openafs:afs_UFSRead+3407}
       <ffffffffa0367f2d>{:openafs:afs_linux_readpage+413}
       <ffffffff80134e72>{autoremove_wake_function+0} <ffffffff8015f94a>{read_pages+171}
       <ffffffff8015fc92>{do_page_cache_readahead+319} <ffffffff80159e3d>{filemap_nopage+338}
       <ffffffff80134e44>{finish_wait+52} <ffffffff80168291>{do_no_page+1003}
       <ffffffff8016860f>{handle_mm_fault+343} <ffffffff80304b5d>{thread_return+88}
       <ffffffff8024abe5>{__make_request+1351} <ffffffff80123752>{do_page_fault+518}
       <ffffffff8030546f>{io_schedule+38} <ffffffff80178cde>{__wait_on_buffer+143}
       <ffffffff80178b52>{bh_wake_function+0} <ffffffff80110c61>{error_exit+0}
       <ffffffff801ea257>{copy_user_generic_c+13} <ffffffff8015a4b8>{__filemap_copy_from_user_iovec+74}
       <ffffffff8015a94b>{generic_file_buffered_write+794}
       <ffffffffa0049568>{:ext3:__ext3_journal_stop+31} <ffffffff80196112>{__mark_inode_dirty+230}
       <ffffffff8015afb6>{__generic_file_aio_write_nolock+731}
       <ffffffff8015b088>{__generic_file_write_nolock+158}
       <ffffffff801592a4>{__lock_page+206} <ffffffff80168491>{do_no_page+1515}
       <ffffffff80134e72>{autoremove_wake_function+0} <ffffffff8015b102>{generic_file_writev+97}
       <ffffffff80177b38>{do_sync_write+0} <ffffffff8019da70>{compat_do_readv_writev+447}
       <ffffffff80134e72>{autoremove_wake_function+0} <ffffffff80189873>{poll_freewait+64}
       <ffffffff8018a4c6>{sys_poll+783} <ffffffff80177b16>{vfs_read+226}
       <ffffffff8019dbd9>{compat_sys_writev+95} <ffffffff80125497>{cstar_do_call+27}

Code: 0f 0b 16 42 03 a0 ff ff ff ff 12 01 ff 43 0c e9 98 00 00 00
RIP <ffffffffa002c446>{:jbd:journal_start+95} RSP <00000103c824f218>

For instance, to understand the issue, let's consider the way the algorithm works, both with and without the patch. For simplicity, let's assume we're doing a writev of an array of 8 struct iovec's, each with length 1024. Let's also assume the pages of the buffers for the 8 iovec's are all not present and will be fetched during a page fault. Let's also assume we're writing to an empty file, that means offset = 0.
.
Without the patch, this is what happens:
1. Function generic_file_buffered_write is called.
2. The #do { ... } while (count)# loop is entered
3. bytes is calculated basically as PAGE_CACHE_SIZE - offset. In other words, bytes <= 4096 always. As our offset is 0, bytes will be equal to PAGE_CACHE_SIZE which is 4096.
4. maxlen is calculated from the first iovec, in our case it will be 1024.
5. fault_in_pages_readable(iovec[0], 1024) will be called. This function basically reads the first and last byte of the buffer to trigger the page faults and let the kernel do the rest. As maxlen <= bytes always, then fault_in_pages_readable will always fault either one or two pages that will be enough to hold the whole buffer in memory.
---> iovec[0].iov_base buffer will be faulted here.
6. ext3->prepare_write is called here, this starts a journalling operation. If we call it again in the code below this we'll have the panic related to the assertion in journal_start().
7. filemap_copy_from_user_iovec is called passing the iovec and bytes = 4096! That's correct, we want to fill the whole output page here reading as many entries from the iovec as needed.
8. It will copy the buffer from iovec[0] to the destination page, but it will remain 3072 bytes to write, so it will move on.
9. It will copy the buffer from iovec[1] to the destination page, but it's not present so it will trigger a page fault here.
---> iovec[1].iov_base buffer will be faulted here.
10. We wrote 2048 bytes, but there are still 2048 bytes to go, so we'll move on. We'll now copy the buffer from iovec[2] to the destination page, but it's not present so it will trigger yet another page fault here.
---> iovec[2].iov_base buffer will be faulted here.
11. Almost there, 3072 written, 1024 to go, we'll move on to iovec[3], this will be our last one.
---> iovec[3].iov_base buffer will be faulted here.
12. Done! We copied 4096 bytes to our destination page, but in the process we faulted 3 extra pages.
13. We're back in generic_file_buffered_write, as the destination page is full now, ext3->commit_file will be called which will end the journalling transaction. It will also move to iovec[4] now as that is the next one to be written.
14. generic_file_buffered_write will allocate another page, now mapped to offset = 4096 of the destination file. Recalculate bytes = 8192 - 4096 = 4096, maxlen = 1024 (length of iovec[4]).
15. fault_in_pages_readable(iovec[4], 1024) will be called.
---> iovec[4].iov_base buffer will be faulted here.
16. ext3->prepare_write is called here, this starts a new journalling operation for the next 4096 bytes.
17. filemap_copy_from_user_iovec is called passing the &iovec[4] and bytes = 4096! We want to fill in this page as well.
18. It will copy the buffer from iovec[4] to the destination page, but it will remain 3072 bytes to write.
19. It will copy the buffer from iovec[5] to the destination page, but it's not present so it will trigger a page fault here.
---> iovec[5].iov_base buffer will be faulted here.
20. Still 2048 bytes to go, we'll now copy the buffer from iovec[6] to the destination page, but it's not present so it will trigger yet another page fault here.
---> iovec[6].iov_base buffer will be faulted here.
21. Almost there, 3072 written, 1024 to go, we'll move on to iovec[7], this will be our last one.
---> iovec[7].iov_base buffer will be faulted here.
22. Done! We copied another 4096 bytes to our destination page, but in the process we faulted another 3 pages.
23. ext3->commit_file is done, we finish this journalling operation as well.
24. count = 0, nothing remaining to write, so we just do final clean up and bail out.
.
Now, compare with what happens with the patch:
1. Function generic_file_buffered_write is called.
2. The #do { ... } while (count)# loop is entered
3. bytes is calculated basically as PAGE_CACHE_SIZE - offset. In other words, bytes <= 4096 always. As our offset is 0, bytes will be equal to PAGE_CACHE_SIZE which is 4096.
4. maxlen is calculated from the first iovec, in our case it will be 1024.
5. fault_in_iovec_pages_readable(iovec, 4096) will be called! This function will call fault_in_pages_readable for as many iovec's needed to fulfill 4096 bytes. In our case, 4 of them.
---> iovec[0].iov_base buffer will be faulted here.
---> iovec[1].iov_base buffer will be faulted here.
---> iovec[2].iov_base buffer will be faulted here.
---> iovec[3].iov_base buffer will be faulted here.
6. ext3->prepare_write is called here, this starts a journalling operation. If we call it again in the code below this we'll have the panic related to the assertion in journal_start().
7. filemap_copy_from_user_iovec is called passing the iovec and bytes = 4096.
8. It will copy the buffer from iovec[0] to the destination page, but it will remain 3072 bytes to write, so it will move on.
9. It will copy the buffer from iovec[1] to the destination page, this will already be present in memory so no page fault is typically triggered here!
10. It will copy the buffer from iovec[2] to the destination page, this will also already be present and no page fault triggered!
11. Almost there, 3072 written, 1024 to go, we'll move on to iovec[3], this will be our last one. No page fault triggered here either.
12. Done! We copied 4096 bytes to our destination page. We didn't trigger any additional page faults. [* see below]
13. We're back in generic_file_buffered_write, as the destination page is full now, ext3->commit_file will be called which will end the journalling transaction. It will also move to iovec[4] now as that is the next one to be written.
14. generic_file_buffered_write will allocate another page, now mapped to offset = 4096 of the destination file. Recalculate bytes = 8192 - 4096 = 4096, maxlen = 1024 (length of iovec[4]).
15. fault_in_iovec_pages_readable(&iovec[4], 4096) will be called. It will fetch the pages for the next 4 buffers in order to fulfill the request for 4096 bytes.
---> iovec[4].iov_base buffer will be faulted here.
---> iovec[5].iov_base buffer will be faulted here.
---> iovec[6].iov_base buffer will be faulted here.
---> iovec[7].iov_base buffer will be faulted here.
16. ext3->prepare_write is called here, this starts a new journalling operation for the next 4096 bytes.
17. filemap_copy_from_user_iovec is called passing the &iovec[4] and bytes = 4096 as we want to fill in this page as well.
18. It will copy the buffer from iovec[4] to the destination page, but it will remain 3072 bytes to write.
19. It will copy the buffer from iovec[5] to the destination page, no page fault triggered here.
20. It will copy the buffer from iovec[6] to the destination page, no page fault triggered here.
21. It will copy the buffer from iovec[7] to the destination page, no page fault triggered here.
22. Done! We copied another 4096 bytes to our destination page, no page faults triggered here either. [* see below]
23. ext3->commit_file is done, we finish this journalling operation as well.
24. count = 0, nothing remaining to write, so we just do final clean up and bail out.
.
As you see, the amount of work and amount of pages faulted in both algorithms is exaclty the same. The basic difference is that with the patch the pages will be faulted *BEFORE* calling ext3->prepare_write. This makes a big difference to us because an AFS read/readpage might end up triggering an ext3->prepare_write itself, so it would be called in a context where it's not allowed and will trigger the bug.
.
The reason why an AFS read might trigger an ext3 write is that AFS keeps a cache in a local filesystem. Here we implement it with the ext3 filesystem mounted under /afscore. In a situation where AFS's memory and disk caches are full, in order for an AFS read to succeed it will have to flush a file from the memory cache to the filesystem cache, and it will also have to flush an old entry in the filesystem cache, so this will end up triggering an ext3 write (ext3->prepare_write which is where the kernel panics).
.
Now, with the patched kernel, under a situation where there is a lot of memory pressure, it's possible that between the call to fault_in_iovec_pages_readable and the point in filemap_copy_from_user_iovec where that buffer is touched the page where the buffer is will be discarded, so in that case another page fault for that same page will happen again. Of course, that's far from ideal, in fact more recent kernels disable page faults at that point and are prepared to retry the operation if one of the pages are not present at that point.
.
However, for our purposes, I guess we're OK with allowing the retry there. In fact, it can already happen for the first page which is faulted in the original algorithm. Even if we're in such a very high memory pressure situation where the page that was just faulted was already flushed, it's really unlikely that it will also have been flushed from AFS's memory and filesystem cache, so I think the probability that this will end up triggering the AFS read that triggers an ext3 write is so rare that we probably won't see it until our last RHEL4 machine is decommissioned... In any case, even if we do see it, it would be many orders of magnitude less frequent than the problem we're seeing right now.
.
Now, regarding adoption of this patch, I'm not really sure whether I'd push it for adoption in a "standard" RHEL4 kernel, I think it's OK for us to keep it only in the EL.msdw.* kernel. I think it's not very probable that this problem affects any modules other than OpenAFS, the fact that OpenAFS has such close relation with an ext3 filesystem really makes it unique in that sense. Also, considering that upstream and more recent kernels went into a completely different direction, it's probably not the best to make this a part of the default kernel anyway.
.
There is, of course, an alternative. There is a patch that fixes this issue more in line with what is in more recent kernels, it will pre-fault one page only but write to the destination page with page faults disabled. If it needs a page fault, it will commit the write so far and retry the operation again later starting from the page that was not present (pre-faulting it before the next write of course to guarantee eventual success).
The patch started on mailing list at this post:
https://lkml.org/lkml/2006/10/13/141
And eventually it was committed to the kernel in this commit:
http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commitdiff;h=08291429cfa6258c4cd95d8833beb40f828b194e
.
Neither one of them applies cleanly to the RHEL4 kernel. By the time this patch was committed to the kernel that filemap.c file already looked completely different so it would be hard to backport. Although the one in the mailing list did not apply cleanly, it doesn't look very different so maybe it would be possible to backport it from there. However, I don't know if there were any issues with it that made it go through such a transformation before actually getting applied upstream. In any case, I think that only introduces more risk as the change is much more profound than what the patch to pre-fault the pages does... The option is there though.


Version-Release number of selected component (if applicable):


How reproducible:
Sporadic in the client's environment

Steps to Reproduce:
1. Use RHEL4
2. Use Openafs
3. Wait
  
Actual results:


Expected results:


Additional info:

Comment 1 Frank Hirtz 2011-10-17 20:28:14 UTC


*** This bug has been marked as a duplicate of bug 740642 ***