Bug 773219
Summary: | Detach a busy block device for 64 bit pv guest sometimes crash | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Shengnan Wang <shwang> | ||||||||||
Component: | kernel | Assignee: | Andrew Jones <drjones> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 6.3 | CC: | drjones, leiwang, lersek, moli, mrezanin, qguan, qwan, xen-maint, yuzhou | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | xen | ||||||||||||
Fixed In Version: | kernel-2.6.32-238.el6 | Doc Type: | Bug Fix | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 791116 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2012-06-20 08:13:57 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 653816, 791116, 818023 | ||||||||||||
Attachments: |
|
Description
Shengnan Wang
2012-01-11 09:58:52 UTC
Created attachment 552055 [details]
screenshot for the crash guest
Created attachment 552056 [details]
boot log for the crash 6.2 pv guest
Created attachment 552057 [details]
xm dmesg log for 6.2 pv guest
Created attachment 552058 [details]
xm dmesg log for 5.8 pv guest
(In reply to comment #0) > 4.KEEP the block in mount status and detatch it in host: > xm block-detach $guestid $block_device_id > 5.Redo the step4, until there is no error info given. I assume you do this in a loop. How long do you wait between the individual detach attempts? For the 4.9pv test, it occurs in the first time. And in the following test, it is about 2~4 seconds between two detach operations. Problem here is that detach is done via xenbus where we store closing state and check for error message after. However, when error is not delivered on time host is not aware of unsuccessful detach. In such case should be disk detached as soon as possible that happens on unmount. As fixing flaw in handling detach would cause more problems than it solve this issue won't be fixed. However, warning backtrace on rhel6 is still present. Even if guest is working after this backtrace, this should be confirmed by kernel guys. ------------[ cut here ]------------ WARNING: at lib/list_debug.c:26 __list_add+0x6d/0xa0() (Not tainted) list_add corruption. next->prev should be prev (ffff880004378608), but was ffff88003d609ce8. (next=ffff88003d609ce8). Modules linked in: ext2 autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 microcode xen_netfront ext4 mbcache jbd2 xen_blkfront dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 1642, comm: umount Not tainted 2.6.32-220.el6.x86_64 #1 Call Trace: [<ffffffff81069b77>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069c66>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8127b84d>] ? __list_add+0x6d/0xa0 [<ffffffff8119f87f>] ? list_move+0x1f/0x30 [<ffffffff8119ff4f>] ? __mark_inode_dirty+0x13f/0x160 [<ffffffff811922cc>] ? inode_setattr+0x4c/0x60 [<ffffffff811315ab>] ? shmem_notify_change+0x7b/0x120 [<ffffffff81192698>] ? notify_change+0x168/0x340 [<ffffffff81215acc>] ? selinux_inode_getattr+0x2c/0x30 [<ffffffff8134ad3b>] ? devtmpfs_delete_node+0x29b/0x2f0 [<ffffffff81007c8f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8115fbf4>] ? kmem_cache_free+0xc4/0x2b0 [<ffffffff81191a5c>] ? ifind_fast+0x3c/0xb0 [<ffffffff810074fd>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81007ca2>] ? check_events+0x12/0x20 [<ffffffff81341af2>] ? klist_children_put+0x12/0x20 [<ffffffff81342f08>] ? device_del+0x1a8/0x1e0 [<ffffffff811e8024>] ? del_gendisk+0x134/0x150 [<ffffffff8126ca27>] ? kobject_put+0x27/0x60 [<ffffffffa003d009>] ? blkfront_closing+0x99/0x110 [xen_blkfront] [<ffffffffa003d7e6>] ? blkif_release+0x66/0x70 [xen_blkfront] [<ffffffff811afb8e>] ? __blkdev_put+0xce/0x190 [<ffffffff811afc3a>] ? __blkdev_put+0x17a/0x190 [<ffffffff811afc60>] ? blkdev_put+0x10/0x20 [<ffffffff811afcfc>] ? close_bdev_exclusive+0x2c/0x40 [<ffffffff811788f4>] ? kill_block_super+0x44/0x50 [<ffffffff811798a0>] ? deactivate_super+0x70/0x90 [<ffffffff8119581f>] ? mntput_no_expire+0xbf/0x110 [<ffffffff811962bb>] ? sys_umount+0x7b/0x3a0 [<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b ---[ end trace b66bdfda9c09c32d ]--- I couldn't reproduce this on my system with the following setup Host Guest (rhel6 64-bit PV) ---- ----------------------- xen-3.0.3-135.el5 kernel 2.6.32-220.4.1.el6.x86_64 kernel-xen-2.6.18-305.el5 On the host I created a disk and attached it to the guest # dd if=/dev/zero of=test.disk bs=1 count=1 seek=2G # mkfs.ext3 test.disk # xm block-attach rhel6-64pv file:$PWD/test.disk xvdb w On the guest I mounted it # mount /dev/xvdb a Back on the host I attempted to detached it with a loop # while ! xm block-detach rhel6-64pv xvdb ; do ((++count)) ; echo $count; sleep .3; done So I was trying to detach it ever .3 seconds. I let it try over 1000 times, but I always got an error like this on the host > Error: Device 51728 (vbd) disconnect failed - Device in use; refusing to close. > Usage: xm block-detach <Domain> <DevId> [-f|--force] > > Destroy a domain's virtual block device. and I got an error like this on the guest > vbd vbd-51728: 16 Device in use; refusing to close The mounted block device still worked, and nothing else was in the guest's dmesg, i.e. no backtrace. Then I unmounted it in the guest, and still didn't get anything in dmesg. 'xm list --long rhel6-64pv' showed that it was still attached, and indeed I could mount it again from the guest, and then umount again. After umounting, I tried 'xm block-detach rhel6-64pv xvdb' again, still no errors. xm list showed it was gone. The guest's dmesg showed nothing strange. I repeated the attach/mount/umount/detach, and still all was OK. So I guess it takes a certain environment to reproduce. I looked at the backtrace in comment 8 and have a *guess* at what's going on. I need a host where this reproduces to do some tests in order to be sure of anything though. Here's the trace with some comments WARNING: at lib/list_debug.c:26 __list_add+0x6d/0xa0() (Not tainted) list_add corruption. next->prev should be prev (ffff880004378608), but was ffff88003d609ce8. (next=ffff88003d609ce8). The warning says some list is corrupted. Call Trace: [<ffffffff81069b77>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81069c66>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8127b84d>] ? __list_add+0x6d/0xa0 [<ffffffff8119f87f>] ? list_move+0x1f/0x30 [<ffffffff8119ff4f>] ? __mark_inode_dirty+0x13f/0x160 (__mark_inode_dirty [fs/fs-writeback.c:974]) We find out here that it's the bdi_writeback dirty inodes list. [<ffffffff811922cc>] ? inode_setattr+0x4c/0x60 [<ffffffff811315ab>] ? shmem_notify_change+0x7b/0x120 [<ffffffff81192698>] ? notify_change+0x168/0x340 [<ffffffff81215acc>] ? selinux_inode_getattr+0x2c/0x30 [<ffffffff8134ad3b>] ? devtmpfs_delete_node+0x29b/0x2f0 From here up it looks like a sane stack progressing towards the __list_add call. [<ffffffff81007c8f>] ? xen_restore_fl_direct_end+0x0/0x1 [<ffffffff8115fbf4>] ? kmem_cache_free+0xc4/0x2b0 [<ffffffff81191a5c>] ? ifind_fast+0x3c/0xb0 [<ffffffff810074fd>] ? xen_force_evtchn_callback+0xd/0x10 [<ffffffff81007ca2>] ? check_events+0x12/0x20 From here down it looks like a sane stack starting from the umount command that progresses towards the devtmpfs_delete_node call above. [<ffffffff81341af2>] ? klist_children_put+0x12/0x20 [<ffffffff81342f08>] ? device_del+0x1a8/0x1e0 [<ffffffff811e8024>] ? del_gendisk+0x134/0x150 [<ffffffff8126ca27>] ? kobject_put+0x27/0x60 [<ffffffffa003d009>] ? blkfront_closing+0x99/0x110 [xen_blkfront] [<ffffffffa003d7e6>] ? blkif_release+0x66/0x70 [xen_blkfront] [<ffffffff811afb8e>] ? __blkdev_put+0xce/0x190 (__blkdev_put [fs/block_dev.c:1379]) [<ffffffff811afc3a>] ? __blkdev_put+0x17a/0x190 (__blkdev_put [fs/block_dev.c:1398]) Note the two calls to __blkdev_put on the stack above. This means we recursed into a contained blkdev. It also means that we dropped the mutex for partition zero's blkdev. [<ffffffff811afc60>] ? blkdev_put+0x10/0x20 [<ffffffff811afcfc>] ? close_bdev_exclusive+0x2c/0x40 [<ffffffff811788f4>] ? kill_block_super+0x44/0x50 [<ffffffff811798a0>] ? deactivate_super+0x70/0x90 [<ffffffff8119581f>] ? mntput_no_expire+0xbf/0x110 [<ffffffff811962bb>] ? sys_umount+0x7b/0x3a0 [<ffffffff810d46e2>] ? audit_syscall_entry+0x272/0x2a0 [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b So what's the other stuff in the middle that we ignored? That part of the trace isn't complete, but we appear to have handled an event which required ifind_fast() and kmem_cache_free(). Let's guess the event was xenwatch, and that it was telling us that blkback changed. If so, then we'd call backend_changed() in blkfront (and I think there's a decent chance, considering this test, that the state would be XenbusStateClosing). So we'd run 1120 case XenbusStateClosing: 1121 if (info->gd == NULL) { 1122 xenbus_frontend_closed(dev); 1123 break; 1124 } 1125 bd = bdget_disk(info->gd, 0); 1126 if (bd == NULL) 1127 xenbus_dev_fatal(dev, -ENODEV, "bdget failed"); 1128 1129 mutex_lock(&bd->bd_mutex); 1130 if (info->users > 0) 1131 xenbus_dev_error(dev, -EBUSY, 1132 "Device in use; refusing to close" ); 1133 else 1134 blkfront_closing(dev); 1135 mutex_unlock(&bd->bd_mutex); 1136 bdput(bd); 1137 break; We use a reference to partition zero's blkdev (line 1125). This means we can grab the blkdev mutex (line 1129) (recall our umount task has already switched to holding the contained blkdev mutex instead). Now info->users == 0 since our umount task's call to blkif_release has already decremented it. (As an aside, I believe this change to the blkfront info should have been protected by its own mutex, as it is now upstream). Anyway, due to users being zero, we call blkfront_closing. blkfront_closing looks like this 1066 static void blkfront_closing(struct xenbus_device *dev) 1067 { 1068 struct blkfront_info *info = dev_get_drvdata(&dev->dev); 1069 unsigned long flags; 1070 1071 dev_dbg(&dev->dev, "blkfront_closing: %s removed\n", dev->nodename) ; 1072 1073 if (info->rq == NULL) 1074 goto out; 1075 1076 spin_lock_irqsave(&blkif_io_lock, flags); 1077 1078 /* No more blkif_request(). */ 1079 blk_stop_queue(info->rq); 1080 1081 /* No more gnttab callback work. */ 1082 gnttab_cancel_free_callback(&info->callback); 1083 spin_unlock_irqrestore(&blkif_io_lock, flags); 1084 1085 /* Flush gnttab callback work. Must be done with no locks held. */ 1086 flush_scheduled_work(); 1087 1088 blk_cleanup_queue(info->rq); 1089 info->rq = NULL; 1090 1091 del_gendisk(info->gd); 1092 1093 out: 1094 xenbus_frontend_closed(dev); 1095 } Since from the backtrace we know our umount task is in del_gendisk(), we also know info->rq == NULL (again an unprotected blkfront info change). So we jump to out (line 1074) and call xenbus_frontend_closed(), which just tells xenbus that the device is now closed (which it isn't, because we're still closing it in the umount task...). We then return to backend_changed() line 1135 and call bdput() on partition zero's blkdev. Here's where we finally get back to those functions in the trace that we initially ignored, ifind_fast() and kmem_cache_free(). bdput iput iput_final generic_drop_inode generic_delete_inode ext4_delete_inode ext4_destroy_inode maybe some eventual call to ifind_fast to remove sysfs nodes? kmem_cache_free The bdput trace I've shown here isn't perfect (I'm not sure about ifind_fast). I need an environment where I can reproduce this bug and then turn on ftrace or something to be sure. Also note that we've already called bdput() on this inode once from __blkdev_put() before we recursed. I'm not sure how that plays in here. In any case, I'd say blkfront looks a bit buggy wrt to its locking, and upstream's version now looks better. We should probably take a closer look at it. Here's an upstream patch (and its add-on-fix patch) that should fix this bug 0e34582 blkfront: fixes for 'xm block-detach ... --force' 5d7ed20 blkfront: don't access freed struct xenbus_device and here's an upstream patch that addresses the blkfront info locking issues I saw b70f5fa blkfront: Lock blkfront_info when closing in between them are a couple fixes which would also be good (if for nothing else, then just context) a66b5ae blkfront: Clean up vbd release 9897cb5 blkfront: Fix gendisk leak 89de166 blkfront: Fix backtrace in del_gendisk Additional blkfront info lock related patches that came after b70f5fa are: fa1bd35 blkfront: Lock blockfront_info during xbdev removal 7fd152f blkfront: Fix blkfront backend switch race (bdev release) 1396174 blkfront: Fix blkfront backend switch race (bdev open) If we take all the above commits, then we'd be we're pretty synced up with Greg's stable/linux-2.6.36.y branch. We'd still be missing a4cc14e xen-blkfront: fix missing out label d54142c blkfront: Klog the unclean release path 7b32d10 blkfront: Remove obsolete info->users acfca3c blkfront: Remove obsolete info->users Those are pretty safe, but unnecessary, so I'd rather not bother with them just to sync to an arbitrary stable branch. I think I'll roll a patchset for the other 9 patches though. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. (In reply to comment #13) > I think I'll roll a patchset for the other 9 patches though. Nice work! ... if I may :) I created a patch series today with the 9 backports from comment 13, plus one more. To be precise 203fd61 xen: use less generic names in blkfront driver. 0e34582 blkfront: fixes for 'xm block-detach ... --force' 5d7ed20 blkfront: don't access freed struct xenbus_device 89de166 blkfront: Fix backtrace in del_gendisk 9897cb5 blkfront: Fix gendisk leak a66b5ae blkfront: Clean up vbd release b70f5fa blkfront: Lock blkfront_info when closing 1396174 blkfront: Fix blkfront backend switch race (bdev open) 7fd152f blkfront: Fix blkfront backend switch race (bdev release) fa1bd35 blkfront: Lock blockfront_info during xbdev removal They were all clean cherry-picks, with the exception that I removed the calls to [un]lock_kernel(). Those calls should be unnecessary because RHEL6 doesn't have 8a6cfeb6deca or 6e9624b8caec, meaning the calls are still higher up the stack. Then I created my brew build[1] to test it. Testing showed things got *worse*! Before, I was unable to reproduce this in over 1000 tries on my machine (comment 10). Now, with the host completely the same, the guest completely the same (except for the new kernel), and the test disk image completely the same, I can reproduce on my second try every time (which is likely a clue as to what's going wrong host-side). When I get no error on the 'xm block-detach', then 'xm block-list' shows the block device is in state 5 (XenbusStateClosing). I don't get the backtrace in comment 8 in the guest though, instead the disk accessing tasks *hang* when trying to use the disk. I get "blocked for more than 120 seconds" messages. Sigh... I'll need to take a longer look at this to decide what to do. Maybe we should look again at host-side fixing, as well as guest-side? Or instead of, or neither... [1] https://brewweb.devel.redhat.com/taskinfo?taskID=4045779 I should also add that if I don't try to rip the block device out from under the feet of the guest (i.e. don't run 'xm block-detach' on the mounted device), then the block devices seem to work fine with the patched blkfront module (with the brew build pointed to in comment 16). So perhaps that brew build is at least good for making this bug more reproducible, and thus easier to debug the host-side. If we do fix the host-side, then perhaps we'll still want this patch series as well. OK, I got the rhel6 guest back to the same behavior as reported in comment 10. The host-side race doesn't reproduce for me and the guest is always happy. I tried over 2500 times to detach while the guest had the blkdev mounted. The guest always refused and continued to work properly. There was a bug in the backports that had made things worse yesterday, and I also found another bug by analysis, while poking through the code. I've written patches for them and sent them upstream http://lists.xen.org/archives/html/xen-devel/2012-02/msg01372.html http://lists.xen.org/archives/html/xen-devel/2012-02/msg01373.html I opted to backport three more patches as well acfca3c blkfront: Remove obsolete info->users 7b32d10 blkfront: Remove obsolete info->users a4cc14e xen-blkfront: fix missing out label Patch(es) available on kernel-2.6.32-238.el6 host: 2.6.18-312.el5 xen: xen-3.0.3-135.el5_8.2 steps, 1,on the host, attach the test disk #xm block-attach rhel62.pv file:/tmp/test.disk xvdb w 2,on the guest, mount the disk, #mount /dev/xvdb /mnt 3, write a loop to to the detach, while ! xm block-detach $guest xvdb ; do ((++count)) ; echo $count; sleep .3; done 4, after the loop exit, then try to umount the disk in guest, #umount /mnt guest: 2.6.32-220.el6 easy to reproduce the crash with above steps, guest: 2.6.32-262.el6 try about 20 times with the above steps, didn't crash, this should be fixed according to my test reulst, Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0862.html |