Bug 783063 - tasks blocked forever on sync-mounted ext2 volume
Summary: tasks blocked forever on sync-mounted ext2 volume
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Eric Sandeen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-01-19 08:54 UTC by Martin Wilck
Modified: 2012-09-17 19:07 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-09-17 19:07:06 UTC
Type: ---


Attachments (Terms of Use)
strace of vi hanging in an unlink() call (16.46 KB, application/vnd.pg.format)
2012-01-19 08:54 UTC, Martin Wilck
no flags Details
dmesg with output of sysrq-l, sysrq-w, sysrq-t (121.90 KB, text/plain)
2012-01-19 08:55 UTC, Martin Wilck
no flags Details
suggested patch (1.66 KB, patch)
2012-01-30 16:39 UTC, Martin Wilck
no flags Details | Diff

Description Martin Wilck 2012-01-19 08:54:04 UTC
Created attachment 556212 [details]
strace of vi hanging in an unlink() call

Description of problem:
I have an encrypted ext2 volume which I have been using under F15 for storing passwords and other private data. It's a 10MB regular file. For working with it, I loop-mount the file, open the loop device with cryptsetup luksOpen, and mount the crypt device.

Since updating to F16, when I try to open a file on this device with "vi", the process will hang in uninterruptable sleep forever. The file system can't be unmounted any more and the system won't be able to shut down cleanly.

Version-Release number of selected component (if applicable):
3.1.9-1.fc16.x86_64

How reproducible:
So far, every time I tried (3x)

Steps to Reproduce:
1. mount encrypted volume (I am using the options "noatime,nodiratime,sync")
2. open some file on the volume with "vi"
  
Actual results:
vi never opens the file and hangs forever, not killable with kill -9

Expected results:
file is opened cleanly

Additional info:
Strace attached. Before mounting this volume, I did an e2fsck that found a few minor problems in the file system (caused by the previous occurence of this problem) which e2fsck was able to fix:

/dev/mapper/vol2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Extended attribute block 306 has reference count 246, should be 245.  Fix<y>? yes

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong for group #0 (7098, counted=7113).
Fix<y>? yes

Free blocks count wrong (7098, counted=7113).
Fix<y>? yes

Inode bitmap differences:  -256 -(258--259)
Fix<y>? yes

Free inodes count wrong for group #0 (1788, counted=1792).
Fix<y>? yes

Free inodes count wrong (1788, counted=1792).
Fix<y>? yes


/dev/mapper/vol2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vol2: 256/2048 files (0.8% non-contiguous), 1079/8192 blocks

From /proc/mounts:

/dev/mapper/vol2 /mnt/vol2 ext2 rw,sync,seclabel,noatime,nodiratime,user_xattr,acl,barrier=1 0 0

Comment 1 Martin Wilck 2012-01-19 08:55:31 UTC
Created attachment 556213 [details]
dmesg with output of sysrq-l, sysrq-w, sysrq-t

Comment 2 Martin Wilck 2012-01-19 08:58:03 UTC
Just for clarification, this never happened to me with F15, but so far every time on F16.

Comment 3 Martin Wilck 2012-01-19 11:52:47 UTC
This has nothing todo with loop mount or encryption, only with the "sync" mount option. I just reproduced it with a normal LVM volume mounted with "-o sync". Doesn't happen on the same volume if "-o sync" is left out.

/dev/mapper/vg-S /mnt/james2 ext2 rw,sync,seclabel,relatime,user_xattr,acl,barrier=1 0 0

Comment 4 Martin Wilck 2012-01-19 12:17:43 UTC
Here is a tiny reproducer in python. Works on any sync-mounted ex2 fs.

from os import open as osopen, rename, unlink, close, O_RDWR, O_CREAT

f1=osopen("f1", O_RDWR|O_CREAT)
close(f1)
unlink("f1")  # this works

f2=osopen("f2", O_RDWR|O_CREAT)
f1=osopen("f1", O_RDWR|O_CREAT)
close(f1)
unlink("f1")  # this hangs forever

Comment 5 Josh Boyer 2012-01-19 12:20:24 UTC
In F16 (and newer) we're using the ext4 module to mount ext2/ext3 filesystems.  I wonder if that has anything to do with this behavior.

Eric?

Comment 6 Martin Wilck 2012-01-19 12:24:22 UTC
This is how the stack of the blocked processes looks like.

[ 1515.121557] python          D ffff8801200ae080     0  3295   3034 0x00000000
[ 1515.121561]  ffff8800b0a87b58 0000000000000086 ffff880100000000 0000000000000000
[ 1515.121565]  ffff8801200adcc0 ffff8800b0a87fd8 ffff8800b0a87fd8 ffff8800b0a87fd8
[ 1515.121570]  ffffffff81a0d020 ffff8801200adcc0 ffff8800b0a87b58 000000018109824d
[ 1515.121573] Call Trace:
[ 1515.121577]  [<ffffffff811a07b0>] ? __wait_on_buffer+0x30/0x30
[ 1515.121579]  [<ffffffff815d343f>] schedule+0x3f/0x60
[ 1515.121582]  [<ffffffff815d34ef>] io_schedule+0x8f/0xd0
[ 1515.121585]  [<ffffffff811a07be>] sleep_on_buffer+0xe/0x20
[ 1515.121588]  [<ffffffff815d3bba>] __wait_on_bit_lock+0x5a/0xc0
[ 1515.121591]  [<ffffffff811a07b0>] ? __wait_on_buffer+0x30/0x30
[ 1515.121594]  [<ffffffff815d3c9c>] out_of_line_wait_on_bit_lock+0x7c/0x90
[ 1515.121597]  [<ffffffff8108e860>] ? autoremove_wake_function+0x50/0x50
[ 1515.121601]  [<ffffffff811a0806>] __lock_buffer+0x36/0x40
[ 1515.121604]  [<ffffffff811a1a85>] __sync_dirty_buffer+0xc5/0xd0
[ 1515.121607]  [<ffffffff811a1aa3>] sync_dirty_buffer+0x13/0x20
[ 1515.121610]  [<ffffffff81217370>] __ext4_handle_dirty_metadata+0xb0/0x120
[ 1515.121613]  [<ffffffff8108e810>] ? remove_wait_queue+0x50/0x50
[ 1515.121616]  [<ffffffff81225e84>] ext4_xattr_release_block+0xd4/0x190
[ 1515.121619]  [<ffffffff81227c2e>] ext4_xattr_delete_inode+0xbe/0x100
[ 1515.121622]  [<ffffffff811ec441>] ext4_free_inode+0x71/0x590
[ 1515.121625]  [<ffffffff811f2bc7>] ? ext4_mark_inode_dirty+0x87/0x210
[ 1515.121628]  [<ffffffff811f4dd6>] ? ext4_evict_inode+0x276/0x430
[ 1515.121631]  [<ffffffff811f4eb3>] ext4_evict_inode+0x353/0x430
[ 1515.121634]  [<ffffffff8118bd9a>] evict+0x9a/0x1a0
[ 1515.121636]  [<ffffffff8118bfa3>] iput+0x103/0x200
[ 1515.121639]  [<ffffffff811815d3>] do_unlinkat+0x153/0x1c0
[ 1515.121642]  [<ffffffff8108239a>] ? sys_rt_sigaction+0xba/0xd0
[ 1515.121645]  [<ffffffff81182a86>] sys_unlink+0x16/0x20
[ 1515.121648]  [<ffffffff815dd1c2>] system_call_fastpath+0x16/0x1b

Comment 7 Martin Wilck 2012-01-19 12:29:34 UTC
(In reply to comment #5)
> In F16 (and newer) we're using the ext4 module to mount ext2/ext3 filesystems. 
> I wonder if that has anything to do with this behavior.

The same test works fine on an ext4-formatted partition.

Comment 8 Martin Wilck 2012-01-19 12:36:22 UTC
The problem *only* occurs when mounting a *ext2-formatted block device*.
It doesn't happen with ext4- or ext3-formatted devices, even if I use "mount -t ext2" (which is only legal for ext3-formatted device, anyway).

[root@cooper ~]# mke2fs -t ext4 /dev/vg/Q
[root@cooper ~]# mount -t ext4 -o sync /dev/vg/Q /mnt/q
[root@cooper ~]# (cd /mnt/q; python /tmp/test.py)         -> OK
[root@cooper ~]# umount /mnt/q
[root@cooper ~]# mount -t ext3 -o sync /dev/vg/Q /mnt/q   -> mount fails
[root@cooper ~]# mount -t ext2 -o sync /dev/vg/Q /mnt/q   -> mount fails
[root@cooper ~]# mke2fs -t ext3 /dev/vg/Q
[root@cooper ~]# mount -t ext3 -o sync /dev/vg/Q /mnt/q
[root@cooper ~]# (cd /mnt/q; python /tmp/test.py)         -> OK
[root@cooper ~]# umount /mnt/q
[root@cooper ~]# mount -t ext2 -o sync /dev/vg/Q /mnt/q 
[root@cooper ~]# (cd /mnt/q; python /tmp/test.py)         -> OK
[root@cooper ~]# umount /mnt/q
[root@cooper ~]# mke2fs /dev/vg/Q
[root@cooper ~]# mount -t ext2 -o sync /dev/vg/Q /mnt/q
[root@cooper ~]# (cd /mnt/q; python /tmp/test.py)         -> HANGS

Comment 9 Eric Sandeen 2012-01-19 16:24:19 UTC
If you do your test mounting it with an explicit selinux context does the problem go away?

something like:

mount -o context=system_u:object_r:nfs_t:s0 -o sync ....

it's hanging up in the xattr sync path, presumably that's selinux stuff going on.

Weird that it's in io_schedule though...

Comment 10 Martin Wilck 2012-01-20 08:47:18 UTC
It happens also with the the follwing options:

/dev/mapper/vg-R /mnt/r ext2 rw,sync,context=system_u:object_r:nfs_t:s0,relatime,user_xattr,acl,barrier=1 0 0

However I found that the reproducer in comment 4 is not always sufficient to reproduce the problem. "vi" is always sufficient though.

Comment 11 Martin Wilck 2012-01-20 08:52:17 UTC
* Please forget what I wrote in comment #8. *

My reproducer was obviously not reliable. With "vi" as reproducer, I am seeing this problem also on ext4 partition mounted with -o sync:

/dev/mapper/vg-T /mnt/t ext4 rw,sync,seclabel,relatime,user_xattr,acl,barrier=1,data=ordered 0 0

As I said in the previous comment already, mounting with fixed selinux context doesn't help, either.

Comment 12 Martin Wilck 2012-01-30 16:39:46 UTC
Created attachment 558402 [details]
suggested patch

The problem also occurs with upstream 3.3.0-rc1 when the ext4 module is used for an ext2 file system.

I just sent this patch to the linux-ext4 mailing list. It solves the problem for me, but I'm certainly not knowledgeable enough to judge if it's correct.

Comment 13 Martin Wilck 2012-01-30 16:43:54 UTC
To avoid confusion, I should add that my statement in comment #11 was wrong. I wish I could delete that comment.

I have done more extensive testing and seen that the problem occurs only when ext2 filesystems are mounted with "-o sync" using the ext4 driver, as I correctly noted in comment #8. The options "nouser_xattr,noacl" do not change anything.

Anyway, please review the patch from comment #12.

Comment 14 Eric Sandeen 2012-01-30 17:03:48 UTC
Martin, thanks for sending the patch, apologies for not digging into this sooner myself.  I'll take a look.

-Eric

Comment 15 Dave Jones 2012-03-22 17:03:55 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 16 Dave Jones 2012-03-22 17:06:54 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 17 Dave Jones 2012-03-22 17:18:00 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 18 Hiroya Nozaki 2012-04-26 08:39:19 UTC
I think this problem has been not solved yet, has it ?
I've encountered the same or similar problem with rhel5.5, 2.6.18-194.el5. and I've also found out the cause of the problem, ... maybe. So could you verify my hypothesis is whether true or not. 
by the way, all written code below is the 2.6.18-194.el5.

then, Here is the one of the stack traces in my case

------------------------------------------------------------------------
PID: 10146  TASK: ffff81043ff21860  CPU: 11  COMMAND: "run.sh"
 #0 [ffff8106dd6ed988] schedule at ffffffff80063f96
 #1 [ffff8106dd6eda60] io_schedule at ffffffff800647ea
 #2 [ffff8106dd6eda80] sync_page at ffffffff80028b1a
 #3 [ffff8106dd6eda90] __wait_on_bit_lock at ffffffff8006492e
 #4 [ffff8106dd6edac0] __lock_page at ffffffff8003ff92
 #5 [ffff8106dd6edb10] truncate_inode_pages_range at ffffffff8002b86a
 #6 [ffff8106dd6edbe0] vmtruncate at ffffffff800ced98
 #7 [ffff8106dd6edc00] ll_setattr_raw at ffffffff889a1038
 #8 [ffff8106dd6edea0] do_filp_open at ffffffff800274fb
 #9 [ffff8106dd6edf50] do_sys_open at ffffffff80019e1e
#10 [ffff8106dd6edf80] system_call at ffffffff8005e116
    RIP: 00000036d2ec58f0  RSP: 00007fffcf8e43f8  RFLAGS: 00000246
    RAX: 0000000000000002  RBX: ffffffff8005e116  RCX: 0000000000000000
    RDX: 00000000000001b6  RSI: 0000000000000241  RDI: 000000000fb23080
    RBP: 0000000000000000   R8: 0000000000000020   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000246  R12: 000000000fb23080
    R13: 0000000000000241  R14: 00007fffcf8e4620  R15: 000000000fb24650
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b
------------------------------------------------------------------------

__lock_page() calls __wait_on_bit_lock() with the sync_page() symbol which is the one in the same file, mm/filemap.c

so, __wait_on_bit_lock() comes to call sync_page() in its while-loop and, obviously expect action(), in the case sync_page() thought, to returns non-zero value.but, as you can look at the below code, sync_page() never return non-zero value in this case.


------------------------------------------------------------------------
void fastcall __lock_page(struct page *page)
{
        DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

        __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
                                                        TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(__lock_page);

------------------------------------------------------------------------
int __sched fastcall
__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
                        int (*action)(void *), unsigned mode)
{
        int ret = 0;

        do {
                prepare_to_wait_exclusive(wq, &q->wait, mode);
                if (test_bit(q->key.bit_nr, q->key.flags)) {
                        if ((ret = (*action)(q->key.flags)))
                                break;
                }
        } while (test_and_set_bit(q->key.bit_nr, q->key.flags));
        finish_wait(wq, &q->wait);
        return ret;
}
EXPORT_SYMBOL(__wait_on_bit_lock);
------------------------------------------------------------------------

static int sync_page(void *word)
{
        struct address_space *mapping;
        struct page *page;

        page = container_of((unsigned long *)word, struct page, flags);
        smp_mb();
        mapping = page_mapping(page);
        if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
                mapping->a_ops->sync_page(page);
        io_schedule();
        return 0;
}
------------------------------------------------------------------------

and I think the other problems like my case or have similar stack trace are caused by the same issue, i mean, q->key.bit_nr in __wait_on_bit() and __wait_on_bit_lock() is already set and action() never returns non-zero value.

anyway, I know this site is not for the support, but I'm glad if you give me an answer or some opinion to my hypothesis.

Comment 19 Martin Wilck 2012-04-26 09:25:50 UTC
Nozaki-san,

> I think this problem has been not solved yet, has it ?

I will respond to this separately later.

> I've encountered the same or similar problem with rhel5.5, 2.6.18-194.el5. and
> I've also found out the cause of the problem, ... maybe. So could you verify my
> hypothesis is whether true or not. 

This bugzilla is against teh Fedora kernel, which is much newer than the RHEL5 kernel you are referring to. You should open a new bug.

> so, __wait_on_bit_lock() comes to call sync_page() in its while-loop and,
> obviously expect action(), in the case sync_page() thought, to returns >non-zero
> value.but, as you can look at the below code, sync_page() never return >non-zero
> value in this case.

That doesn't matter because the condition that__wait_on_bit_lock waits for is actually "test_and_set_bit(q->key.bit_nr, q->key.flags)", not the result of action(). The non-zero result of action() exists only to signal an error condition. The fact that sync_page() never returns non-zero simply means that errors can't occur on this code path.

If the kernel gets stuck with the stack trace you have shown, it means that the page you're looking at stays locked (PH_locked bit set) forever. This may or may not indicate a locking problem in the kernel.

Comment 20 Martin Wilck 2012-04-26 09:42:16 UTC
This bug is indeed not fixed in FC16 kernel 3.3.2-6.fc16

Comment 21 Martin Wilck 2012-04-26 09:55:30 UTC
The patch from comment #12, with Eric's additions, is scheduled for inclusion in kernel 3.4.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c1bb05a657fb3d8c6179a4ef7980261fae4521d7

Comment 22 Hiroya Nozaki 2012-04-27 05:11:18 UTC
Hi, Martin. 
I'm glad to receive your reply.

>This bugzilla is against teh Fedora kernel, which is much newer than the RHEL5
>kernel you are referring to. You should open a new bug.

I'm sorry. I'll do definitely that when I find out some bugs.


>That doesn't matter because the condition that__wait_on_bit_lock waits for is
>actually "test_and_set_bit(q->key.bit_nr, q->key.flags)", not the result of
>action(). The non-zero result of action() exists only to signal an error
>condition. The fact that sync_page() never returns non-zero simply means that
>errors can't occur on this code path.

Oh,i got it. It really makes sense what's you said, I never came up with the signal stuff. Then, now, I started suspecting another possibility and have found 2nd suspect in Lustre bugzilla.

I'm not sure it gives you a hint to this case, but anyway, I left the link to the article here. it seems that the similar bug are caused by kmalloc()'s flag.
(and I haven't applied this patch to my custom Lustre, actually).
https://bugzilla.lustre.org/show_bug.cgi?id=21983

so, I'll recommend you to investigate kmalloc() flag is right or not.
and thank you for your co-operation to me !!!

Comment 23 Eric Sandeen 2012-05-16 02:08:38 UTC
Looks like we should have submitted this to -stable.  I'll do that.

Comment 24 Eric Sandeen 2012-05-22 14:44:05 UTC
Patch should be present in 3.3.7, which I think just got built for F16. Please retest when that hits the repos!

Thanks,
-Eric

Comment 25 Josh Boyer 2012-09-17 19:07:06 UTC
I'm going to close this out for now.  3.3.7 was released a while ago, and we've been on 3.4 for a while now too.

If this problem still occurs in Fedora with the latest 3.4 or 3.5 kernel, please reopen.


Note You need to log in before you can comment on or make changes to this bug.