Bug 429092

Summary:	GFS: generic_file_write_nolock() called with opened journal transaction
Product:	Red Hat Enterprise Linux 5	Reporter:	Monakhov Dmitriy <dmonakhov>
Component:	gfs-kmod	Assignee:	Robert Peterson <rpeterso>
Status:	CLOSED WONTFIX	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	low
Version:	5.0	CC:	adas, bmarzins, nstraz, rwheeler, swhiteho, vvs
Target Milestone:	---
Target Release:	5.5
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-09-30 14:33:52 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Monakhov Dmitriy 2008-01-17 08:10:40 UTC

Description of problem:
Gfs writing codepath calling generic_file_write_nolock() with opened journal
transaction. This is strongly prohibited actions because most probably result in
VM related actions such as balance_dirty_pages_ratelimited_nr which result in 
switching between different fs superblocks while current->journal != NULL.
Sooner or later this result in BUG() triggering or EIO and other bad things.
In my case i've got EIO from ext3_write_inode() and droping DIRTY inode flag
with following calltrace:
 Call Trace:
  [<ffffffff880521b6>] :ext3:ext3_write_inode+0x2b/0x47
  [<ffffffff80030db2>] __writeback_single_inode+0x1e3/0x322
  [<ffffffff80023af4>] mempool_alloc+0x3a/0xf0
  [<ffffffff80021610>] sync_sb_inodes+0x1a9/0x267
  [<ffffffff800514ba>] writeback_inodes+0x82/0xd8
  [<ffffffff800c85dc>] balance_dirty_pages_ratelimited_nr+0x11a/0x1fa
  [<ffffffff8000fa2f>] generic_file_buffered_write+0x56c/0x6a0
  [<ffffffff80016118>] __generic_file_aio_write_nolock+0x36c/0x3b8
  [<ffffffff883f4aae>] :gfs:gfs_dreread+0x87/0xc7
  [<ffffffff883f4b26>] :gfs:gfs_dread+0x38/0x57
  [<ffffffff800c5822>] generic_file_aio_write_nolock+0x20/0x6c

  [<ffffffff800c5bee>] generic_file_write_nolock+0x8f/0xa8
  [<ffffffff8009b42b>] autoremove_wake_function+0x0/0x2e
  [<ffffffff8841c890>] :gfs:gfs_trans_begin_i+0x13c/0x1b2
  [<ffffffff8840f769>] :gfs:do_write_buf+0x455/0x690
  [<ffffffff8840ef68>] :gfs:walk_vm+0x110/0x317
  [<ffffffff8840f314>] :gfs:do_write_buf+0x0/0x690
  [<ffffffff8002fea8>] __clear_user+0x12/0x50
  [<ffffffff8840f21b>] :gfs:__gfs_write+0xac/0xc6
  [<ffffffff800164fb>] vfs_write+0xad/0x153
  [<ffffffff80016b98>] sys_write+0x49/0xbf
  [<ffffffff8005e166>] system_call+0x7e/0x83


testcase: in fact test case is fantastically trivial, i'm wondering why nobody
catch it before. write big file to gfs fs (file_size >= 2 * ram_size )
dd if=/dev/zero of=/mnt_gfs bs=1M count=10240

buggy place:
do_do_write_buf
...
   if (alloc_required) {
...      gfs_trans_begin(..)
   }
         struct iovec local_iov = { .iov_base = buf, .iov_len = size };

                if (!iocb) {
                        count = generic_file_write_nolock(file, &local_iov, 1,
offset);
<<                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
 << generic_file_xxx_write_nolock called with runing transaction(curent->journal
!= NULL). this is absolutaly prohobited.  
            } else {
                        count = generic_file_aio_write_nolock(iocb,
                                &local_iov, 1, offset);
                        if (count == -EIOCBQUEUED)
                                count = wait_on_sync_kiocb(iocb);
                }
}

Comment 3 Steve Whitehouse 2008-02-06 11:47:34 UTC

Is this really RHEL 4? balance_dirty_pages_ratelimited_nr doesn't appear to
exist in RHEL4, but only in RHEL-5. Can you confirm the kernel version?

Comment 4 Monakhov Dmitriy 2008-02-06 11:53:44 UTC

yes it was RHEL5

Comment 6 Robert Peterson 2008-02-07 15:19:46 UTC

Well, gfs1 isn't part of the base kernel, so there should be no need
to post the change to rhkernel-list.  So no resistance.

Comment 8 Robert Peterson 2008-02-07 19:55:47 UTC

Dmitriy, I tried the dd command you recommended to recreate the problem,
but GFS did not fail.  We can see the problem and we're researching the
best way to fix it.  IMHO, there is not a "good" way to fix it because of
the lock ordering required by GFS.  So we are trying to find a viable
alternative that will make it work properly.  By the way, GFS2 does not
have this problem.

Comment 9 Monakhov Dmitriy 2008-02-08 05:57:04 UTC

>We can see the problem and we're researching the best way to fix it.
As a temporary solution gfs_writei() may be used for all cases. For example 
following trivial patch has fixed my issue, but with significant performance
drawback, because gfs_writei() optimized for journaled mode.

--- ./fs/gfs/ops_file.c~orig    2008-02-08 08:48:19.000000000 +0300
+++ ./fs/gfs/ops_file.c 2008-02-08 08:49:07.000000000 +0300
@@ -851,7 +851,7 @@ do_do_write_buf(struct file *file, char 
                brelse(dibh);
        }
 
-       if (journaled ||
+       if (1 || journaled || /* always use this branch */
            (gfs_is_stuffed(ip) && !test_bit(GIF_PAGED, &ip->i_flags) &&
             *offset + size <= sdp->sd_sb.sb_bsize - sizeof(struct gfs_dinode))) {
 
>By the way, GFS2 does no thave this problem
Yes, i do know about this, but unfortunately IMHO currently GFS2 is not ready
for production.

Comment 13 Robert Peterson 2009-09-30 14:33:52 UTC

No one has complained about this problem and no one has been able
to recreate it for a very long time.  Changing the code to fix it
would only put customers at risk.  If the problem is recreatable,
we can re-address the issue at that time.  I'm closing as WONTFIX.