Bug 636287 - GFS2: [RFE] Make GFS2 handle errors more gracefully
Summary: GFS2: [RFE] Make GFS2 handle errors more gracefully
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Steve Whitehouse
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 253948
Blocks: 589070 697864
TreeView+ depends on / blocked
 
Reported: 2010-09-21 20:15 UTC by Steve Whitehouse
Modified: 2018-12-01 19:03 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Steve Whitehouse 2010-09-21 20:15:59 UTC
In a number of cases, the answer to any problem is to "withdraw". This procedure is error prone (depending on what failed, it might not work) and it is rather complicated, relying on userspace to do some of the work and then poke the kernel to acknowledge what has been done.

The plan is that we'd like to introduce ways of handling errors more gracefully. Some work has been done already. The goal is to leave as much of the filesystem still functioning as possible while avoiding making matters any worse in the area of the error.

It would be good to have something like the remount ro that ext2/3/4 has. For cluster filesystems we also need to ensure that we don't write anything back to the disk after that point in case that we are leaving the cluster. Maybe need some extra blkdev function to allow that.

This is a bug in which to collect any similar ideas.

Comment 1 Steve Whitehouse 2010-09-24 14:19:26 UTC
There are three kinds of bugs which we need to look at:

1. I/O Errors. The underlying block device tells us that it cannot process an I/O request of some kind

Simulation method: dm-error module

Solution: Where possible this should be fed back to the user. That maybe via an async path (e.g. a dirtied mmaped page, for which msync is the correct return path for that information). There is no reason for this to be catastrophic, provided that a reasonable amount of the block device is still functional.

Potentially we might want to drop into a read-only mount, and maybe migrate services off the node, but we have to be careful here. If the fault is on the block device itself, rather than on the path between the node and the block device, there is no point in migrating services. How can we tell the difference? How do we relay this info to userland?

2. On-disk format errors. We've read something from disk that doesn't conform to the expected pattern.

Simulation method: gfs2_edit, archived corrupt metadata

Solution: Try to mitigate the effects of the problem. Maybe we can just fail the syscall where we hit the issue? Ideally try to isolate the problem. We already have the error flag for rgrps which is a step in this direction. Maybe we need a per inode error flag as well? The main issue here is how to ensure that the incorrect information is contained and does not lead to further corruption of the filesystem. Some thought needs to go into which kinds of problems we can fix "on the fly" and which will need fsck to solve them.

3. In core state errors. Some part of the in-core state is not as expected.

Simulation method: none at the moment aside from editing the source 

Solution: Probably have to do something similar to the old withdraw function. We can't risk writing back incorrect data into the shared block device. If the issue is minor, we might be able to carry on, but it might be a symptom of something liable to cause corruption on the disk (e.g. a buffer overrun) so we need to be very careful. Using BUG() to terminate the current thread is one possible way to deal with this, but since it is likely to occur while locks are being held, it is unlikely to do anything but postpone the issue for a short while.


Are there any other classes of errors which I've missed?

Comment 2 Steve Whitehouse 2010-09-24 14:22:23 UTC
We also have a bug already for the I/O errors part of this issue

Comment 7 ouyang.maochun 2012-02-23 10:27:53 UTC
if GFS2 filesystem consistency is corrupted, the whole cluster have to stop and run fsck.gfs2 on it. I've found in some scenario ,one node which was accessing GFS2 went down ,other node would fence it and replay its jid,but after a short while the filesystem reported consistency was corrupted.Does jid replaying have some bugs? Is online filesystem checking on the schedule?

Comment 8 Steve Whitehouse 2012-02-23 10:40:53 UTC
I don't expect to find any bugs in the journal replay code, since it is pretty simple. It just finds a set of blocks in the journal, and then copies those blocks back in place.

That said, we do also have a bug open to perform more checks during the replay of journals in order to improve the robustness of this process.

What kernel version are you using? We've not had any other reports of issues relating to journal replay, and this code is checked regularly by us.

We currently don't have online filesystem checking on our todo list. It would be very tricky to do due to the dynamic state, and certainly not a complete solution since it would have to rely on the internal state of the filesystem for consistency which is the very thing that it should be checking.

Comment 9 ouyang.maochun 2012-02-23 11:10:36 UTC
if GFS2 filesystem consistency is corrupted, the whole cluster have to stop and run fsck.gfs2 on it. I've found in some scenario ,one node which was accessing GFS2 went down ,other node would fence it and replay its jid,but after a short while the filesystem reported consistency was corrupted.Does jid replaying have some bugs? Is online filesystem checking on the schedule?

Comment 10 ouyang.maochun 2012-02-23 11:30:05 UTC
my kernel version is "2.6.32-71.el6.x86_64"

I had filesystem consistency corrupted for two times .and the step to reproduce the issue is as follows:
1、I ran mkfs.gfs2
1、serveral nodes were accessing GFS2 ,and one  node that was accessing GFS2 failed(I just physically poweroff it)
2、other node fence it ,acquire its jid glock and replay the jid
3、after a while GFS2 report filesytem consistency problem and withdraw GFS2

It is a very simple scene to test whether jid replaying work ,but so far I have found the  filesystem consistency issues for two times .I know that before other node to replay jid ,there is some response time for it to take action .Is there any possibility ,during the response time , that other nodes make some changes to the GFS2 ,which the failed node's jid can not handle ?

Comment 11 Steve Whitehouse 2012-02-23 11:39:39 UTC
That sounds just like a bug to me. There shouldn't be anything that can be damaged on the filesystem in that way. Since you are using RHEL6, I assume that you are a customer, so the best solution is to file a ticket with our support team and we can look into this for you.

If you could provide the output from fsck when you run it and also the messages from the withdraw and a metadata dump of the damaged filesystem, then that would be very helpful. Our support team should be able to help you in collecting that data.

Comment 14 ouyang.maochun 2012-03-30 08:32:12 UTC
I've found a issue and try to fix it by modifying code.

step to reproduce(kernel version "2.6.32-71.el6.x86_64"):
1、all cluster node mount gfs2
2、more than n/2  node failed 
3、run  "umount" gfs2 on one of the surviving nodes
4、the process Umount become "D" status and could not return even if all other failed nodes startup again

here is stack:
Mar 26 11:34:11 node4 kernel: umount        D 0000000000000002     0 22327  22247 0x00000080
Mar 26 11:34:11 node4 kernel: ffff88002b307ba8 0000000000000086 ffff880001e10b68 ffff88002b307ce0
Mar 26 11:34:11 node4 kernel: 00ff88002b307b48 00ffffff00000002 0000000000000000 00000000ffffffff
Mar 26 11:34:11 node4 kernel: ffff88002b9bfb18 ffff88002b307fd8 0000000000010518 ffff88002b9bfb18
Mar 26 11:34:11 node4 kernel: Call Trace:
Mar 26 11:34:11 node4 kernel: [<ffffffffa0743210>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffffa074321e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffff814ccecf>] __wait_on_bit+0x5f/0x90
Mar 26 11:34:11 node4 kernel: [<ffffffffa0743210>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffff814ccf78>] out_of_line_wait_on_bit+0x78/0x90
Mar 26 11:34:11 node4 kernel: [<ffffffff81091c60>] ? wake_bit_function+0x0/0x50
Mar 26 11:34:11 node4 kernel: [<ffffffffa0744e56>] gfs2_glock_wait+0x36/0x40 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffffa0746031>] gfs2_glock_nq+0x191/0x370 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffffa075ecf8>] gfs2_statfs_sync+0x58/0x1b0 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffffa075ecf0>] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffffa075ee90>] gfs2_make_fs_ro+0x40/0xc0 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffff81091c60>] ? wake_bit_function+0x0/0x50
Mar 26 11:34:11 node4 kernel: [<ffffffff81183208>] ? invalidate_inodes+0xd8/0x160
Mar 26 11:34:11 node4 kernel: [<ffffffffa075f100>] gfs2_put_super+0x1f0/0x220 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffff8116ab86>] generic_shutdown_super+0x56/0xe0
Mar 26 11:34:11 node4 kernel: [<ffffffff8116ac41>] kill_block_super+0x31/0x50
Mar 26 11:34:11 node4 kernel: [<ffffffffa0750e91>] gfs2_kill_sb+0x61/0x90 [gfs2]
Mar 26 11:34:11 node4 kernel: [<ffffffff8116bcf0>] deactivate_super+0x70/0x90
Mar 26 11:34:11 node4 kernel: [<ffffffff8118738f>] mntput_no_expire+0xbf/0x110
Mar 26 11:34:11 node4 kernel: [<ffffffff811877bb>] sys_umount+0x7b/0x3a0
Mar 26 11:34:11 node4 kernel: [<ffffffff810d4002>] ? audit_syscall_entry+0x272/0x2a0
Mar 26 11:34:11 node4 kernel: [<ffffffff81013172>] system_call_fastpath+0x16/0x1b
Mar 26 11:36:11 node4 kernel: INFO: task gfs2_quotad:21576 blocked for more than 120 seconds.
Mar 26 11:36:11 node4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 26 11:36:11 node4 kernel: gfs2_quotad   D 0000000000000002     0 21576      2 0x00000080
Mar 26 11:36:11 node4 kernel: ffff88002b38bc20 0000000000000046 ffff88002b38bb90 ffffffffa04a798d
Mar 26 11:36:11 node4 kernel: 0000000000000000 ffff880077daf000 ffff88002b38bc50 ffffffffa04a57b8
Mar 26 11:36:11 node4 kernel: ffff880037efc638 ffff88002b38bfd8 0000000000010518 ffff880037efc638
Mar 26 11:36:11 node4 kernel: Call Trace:
Mar 26 11:36:11 node4 kernel: [<ffffffffa04a798d>] ? dlm_put_lockspace+0x1d/0x40 [dlm]
Mar 26 11:36:11 node4 kernel: [<ffffffffa04a57b8>] ? dlm_lock+0x98/0x1e0 [dlm]
Mar 26 11:36:11 node4 kernel: [<ffffffffa0743210>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffffa074321e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffff814ccecf>] __wait_on_bit+0x5f/0x90
Mar 26 11:36:11 node4 kernel: [<ffffffffa0743210>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffff814ccf78>] out_of_line_wait_on_bit+0x78/0x90
Mar 26 11:36:11 node4 kernel: [<ffffffff81091c60>] ? wake_bit_function+0x0/0x50
Mar 26 11:36:11 node4 kernel: [<ffffffffa0744e56>] gfs2_glock_wait+0x36/0x40 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffffa0746031>] gfs2_glock_nq+0x191/0x370 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffff8107de7b>] ? try_to_del_timer_sync+0x7b/0xe0
Mar 26 11:36:11 node4 kernel: [<ffffffff8107de7b>] ? try_to_del_timer_sync+0x7b/0xe0
Mar 26 11:36:11 node4 kernel: [<ffffffffa075ecf8>] gfs2_statfs_sync+0x58/0x1b0 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffff814ccb6c>] ? schedule_timeout+0x19c/0x2f0
Mar 26 11:36:11 node4 kernel: [<ffffffffa075ecf0>] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffffa0756c07>] quotad_check_timeo+0x57/0xb0 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffffa0756e94>] gfs2_quotad+0x234/0x2b0 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffff81091c20>] ? autoremove_wake_function+0x0/0x40
Mar 26 11:36:11 node4 kernel: [<ffffffffa0756c60>] ? gfs2_quotad+0x0/0x2b0 [gfs2]
Mar 26 11:36:11 node4 kernel: [<ffffffff810918b6>] kthread+0x96/0xa0
Mar 26 11:36:11 node4 kernel: [<ffffffff810141ca>] child_rip+0xa/0x20
Mar 26 11:36:11 node4 kernel: [<ffffffff81091820>] ? kthread+0x0/0xa0
Mar 26 11:36:11 node4 kernel: [<ffffffff810141c0>] ? child_rip+0x0/0x20

maybe there is deadlock if umount apply for glock before lock recovery complete.because I didn't figure out how to deal with glock , I try another way to avoid the problem. my modification:

fs/gfs2/sys.c
static ssize_t block_store(struct gfs2_sbd *sdp, const char *buf, size_t len)
{
        struct lm_lockstruct *ls = &sdp->sd_lockstruct;
        ssize_t ret = len;
        int val;

        val = simple_strtol(buf, NULL, 0);

        if (val == 1)
                set_bit(DFL_BLOCK_LOCKS, &ls->ls_flags);
        else if (val == 0) {
                clear_bit(DFL_BLOCK_LOCKS, &ls->ls_flags);
                smp_mb__after_clear_bit();
                wake_up_bit(&ls->ls_flags, DFL_BLOCK_LOCKS);
                gfs2_glock_thaw(sdp);
        } else {
                ret = -EINVAL;
        }
        return ret;
}

fs/gfs2/ops_fstype.c

static int gfs2_kill_sb_wait(void *word)
{
        schedule();
        return 0;
}

static void gfs2_kill_sb(struct super_block *sb)
{
        struct gfs2_sbd *sdp = sb->s_fs_info;
        struct lm_lockstruct *ls = &sdp->sd_lockstruct;

        if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_flags))
            wait_on_bit(&ls->ls_flags, DFL_BLOCK_LOCKS,
                            gfs2_kill_sb_wait, TASK_UNINTERRUPTIBLE);

        if (sdp == NULL) {
                kill_block_super(sb);
                return;
        }

        gfs2_meta_syncfs(sdp);
        dput(sdp->sd_root_dir);
        dput(sdp->sd_master_dir);
        sdp->sd_root_dir = NULL;
        sdp->sd_master_dir = NULL;
        shrink_dcache_sb(sb);
        kill_block_super(sb);
        gfs2_delete_debugfs_file(sdp);
        kfree(sdp);
}

I test my revision  and it works.Is my modification OK?Or is there any better way to fix this issue?

Comment 15 Steve Whitehouse 2012-03-30 08:51:38 UTC
Comment #14 has nothing to do with this bug. Can you open a new bug for it?

So far as I can tell, what you've found is that we have the following:

 - Recovery starts
 - Unmount is called and the sysfs directory becomes inaccessible
 - Unmount is waiting for last few locks to be unlocked, but since sysfs
   is inaccessible, this cannot happen

Or did I misunderstand? What version of Fedora are you using?

Comment 16 ouyang.maochun 2012-03-30 09:15:23 UTC
Fedora 12 ,but I rebuild the kernel version which is base on RHEL 6.0

my modification does not try to access sysfs ,however ,after all recoveries done ,gfs_controld will call function "start_kernel" to  notify kernel that lock recovery done ,and I just wait here where DFL_BLOCK_LOCKS was cleared. And let umount can deal with latter procedure

Comment 17 Steve Whitehouse 2012-03-30 09:40:04 UTC
Fedora 12 is very old. You cannot also assume that mix and match of Fedora userspace and RHEL kernels will work without issues. When you say that the kenrel is based on RHEL6, then what exactly do you mean?

Can you reproduce this on some more recent distro?

I can see where you are adding the new wait, but it looks to me like sysfs has already been unmounted by that point, so I'm not sure how the message will be relayed to the kernel.

In the latest rawhide, all this infrastructure has gone away anyway, and the recovery coordination comes via the DLM and its remaining userspace. So I'm not at all sure that there is anything to fix in the upstream code, anyway.

Either way, this doesn't belong here, so please open a separate bug for it.


Note You need to log in before you can comment on or make changes to this bug.