688858 – VM can't be stopped when hit read error

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 688858 - VM can't be stopped when hit read error

Summary: VM can't be stopped when hit read error

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jes Sorensen
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Rhel6KvmTier1
TreeView+	depends on / blocked

Reported:	2011-03-18 09:28 UTC by juzhang
Modified:	2013-01-09 23:40 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-03-29 13:32:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description juzhang 2011-03-18 09:28:13 UTC

Description of problem:
VM can't be stopped when hit read error

Version-Release number of selected component (if applicable):
#rpm -qa | grep qemu-kvm
qemu-kvm-tools-0.12.1.2-2.149.el6.x86_64
#uname -r
2.6.32-118.el6.x86_64
guest:
rhel5.6


How reproducible:
100%

Steps to Reproduce:
1.In nfs server
#qemu-img create -f qcow2 junzhang.qcow2 6G
2.mount nfs
#mount 10.66.8.113:/home/ nfs/ -o soft,timeo=2,retrans=2
3.Boot guest with take juzhang.qcow2 as secondary disk
#/usr/libexec/qemu-kvm -m 2G -smp 4 -drive file=/root/zhangjunyi/rhel5.6-virtio-64.qcow2,if=none,id=test,cache=none,format=qcow2,werror=stop,rerror=stop -device virtio-blk-pci,drive=test -cpu qemu64,+sse2,+x2apic -boot c -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=22:11:22:45:66:94 -monitor stdio  -drive file=/dev/cdrom,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -vnc :10 -drive file=/root/nfs/junzhang.qcow2,if=none,id=test1,cache=none,format=qcow2,werror=stop,rerror=stop -device virtio-blk-pci,drive=test1 -qmp tcp:0:4446,server,nowait
4.In guest,read data form secondly disk
while true;do dd if=/dev/vdb of=/dev/null ;done
5.disconnect nfs server and make sure nfs service is stopped.

 
Actual results:
VM can't be stopped,in guest,still can read data from vdb

Expected results:
vm should be stoped with error"{"timestamp": {"seconds": 1300438562, "microseconds": 236084}, "event": "BLOCK_IO_ERROR", "data": {"device": "test1", "__com.redhat_debug_info": {"message": "Input/output error", "errno": 5}, "__com.redhat_reason": "eio", "operation": "stop", "action": "stop"}}
"

Additional info:
I also tested write data in vdb,vm can be stopped with messages
{"timestamp": {"seconds": 1300438562, "microseconds": 237911}, "event": "BLOCK_IO_ERROR", "data": {"device": "test1", "__com.redhat_debug_info": {"message": "Input/output error", "errno": 5}, "__com.redhat_reason": "eio", "operation": "write", "action": "stop"}}

Comment 2 Dor Laor 2011-03-20 08:10:44 UTC

Are you sure read requests for this secondary disk where executed?
What's the output of strace or tcpdump to the nfs server?

Comment 3 juzhang 2011-03-20 09:31:12 UTC

(In reply to comment #2)
Mount command
#mount 10.66.8.113:/home/ nfs/ -o soft,timeo=2,retrans=2
> Are you sure read requests for this secondary disk where executed?
Yes
> What's the output of strace or tcpdump to the nfs server?
ls -la /proc/`pgrep qemu`/fd
dr-x------. 2 root root  0 Mar 20 17:17 .
dr-xr-xr-x. 7 root root  0 Mar 20 17:16 ..
lrwx------. 1 root root 64 Mar 20 17:17 0 -> /dev/pts/2
lrwx------. 1 root root 64 Mar 20 17:17 1 -> /dev/pts/2
lrwx------. 1 root root 64 Mar 20 17:17 10 -> /root/zhangjunyi/rhel5.6-virtio-64.qcow2
lrwx------. 1 root root 64 Mar 20 17:17 11 -> anon_inode:[signalfd]
lr-x------. 1 root root 64 Mar 20 17:17 12 -> /dev/sr0
lrwx------. 1 root root 64 Mar 20 17:17 13 -> /root/nfs/junzhang.qcow2
lrwx------. 1 root root 64 Mar 20 17:17 14 -> anon_inode:kvm-vcpu
lrwx------. 1 root root 64 Mar 20 17:17 15 -> anon_inode:kvm-vcpu
lrwx------. 1 root root 64 Mar 20 17:17 16 -> anon_inode:kvm-vcpu
lrwx------. 1 root root 64 Mar 20 17:17 17 -> anon_inode:kvm-vcpu
lrwx------. 1 root root 64 Mar 20 17:17 18 -> socket:[1999005]
lrwx------. 1 root root 64 Mar 20 17:17 19 -> socket:[1999002]
lrwx------. 1 root root 64 Mar 20 17:17 2 -> /dev/pts/2
lrwx------. 1 root root 64 Mar 20 17:17 20 -> anon_inode:[eventfd]
lrwx------. 1 root root 64 Mar 20 17:17 21 -> anon_inode:[eventfd]
lrwx------. 1 root root 64 Mar 20 17:17 22 -> anon_inode:[signalfd]
lrwx------. 1 root root 64 Mar 20 17:17 23 -> anon_inode:[eventfd]
lrwx------. 1 root root 64 Mar 20 17:17 24 -> anon_inode:[eventfd]
lrwx------. 1 root root 64 Mar 20 17:17 25 -> socket:[1999006]
lrwx------. 1 root root 64 Mar 20 17:17 3 -> socket:[1998960]
lrwx------. 1 root root 64 Mar 20 17:17 4 -> /dev/kvm
lrwx------. 1 root root 64 Mar 20 17:17 5 -> anon_inode:kvm-vm
lr-x------. 1 root root 64 Mar 20 17:17 6 -> pipe:[1998962]
l-wx------. 1 root root 64 Mar 20 17:17 7 -> pipe:[1998962]
lrwx------. 1 root root 64 Mar 20 17:17 8 -> /dev/net/tun
lrwx------. 1 root root 64 Mar 20 17:17 9 -> /dev/vhost-net

Please note,/root/nfs/junzhang.qcow2 is secondary disk where in nfs server.
 # strace -p `pidof qemu-kvm`-e trace=desc 2> a.txt
Disconnect the nfs server.
#service nfs stop
#tail -f a.txt | grep 13

Results:
got no error for file descriptor 13 in strace file when nfs is down.

tail -f a.txt | grep 13
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999413})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999413})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [11], left {0, 998813})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999613})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})
select(27, [0 6 8 11 18 20 22 23 24 25 26], [], [], {1, 0}) = 1 (in [24], left {0, 999513})

Comment 4 juzhang 2011-03-20 09:39:08 UTC

(In reply to comment #3)

> 
> Please note,/root/nfs/junzhang.qcow2 is secondary disk where in nfs server.
>  # strace -p `pidof qemu-kvm`-e trace=desc 2> a.txt
> Disconnect the nfs server.
> #service nfs stop
> #tail -f a.txt | grep 13
> 
> Results:
> got no error for file descriptor 13 in strace file when nfs is down.
however,In host,when nfs server is down,I tried to enter in mount directory.hit the following messages.I think this can prove that the  host has detected nfs status is disconnected  
#cd /root/nfs
-bash: cd: /root/nfs: Input/output error

Comment 5 Jes Sorensen 2011-03-29 09:09:36 UTC

What happens if you use hard nfs mounts, instead of soft mounts?

Comment 6 juzhang 2011-03-29 11:14:13 UTC

(In reply to comment #5)
> What happens if you use hard nfs mounts, instead of soft mounts?

Using mount 10.66.8.113:/home/ nfs/ instead of mount 10.66.8.113:/home/ nfs/ -o soft,timeo=2,retrans=2.

after disconnect nfs server.
qemu-kvm process is hang. if you reconnect nfs server,the qemu-kvm come back.

Comment 7 Jes Sorensen 2011-03-29 11:46:42 UTC

I believe soft mounts are default, could you try with -ohard ?

Thanks,
Jes

Comment 8 juzhang 2011-03-29 12:07:02 UTC

(In reply to comment #7)
> I believe soft mounts are default, could you try with -ohard ?
> 
> Thanks,
> Jes

#mount 10.66.8.162:/home/ nfs/ -o hard

hit the as same issue as comment6.

Comment 9 Jes Sorensen 2011-03-29 13:32:32 UTC

Hi,

I did some more digging into this. The problem is whether it is soft
or hard mounts, the process can be stuck on a semaphore in the kernel,
which doesn't get interrupted in case the NFS server disappears. Even
if you mount using the 'intr' flag, you may still get stuck.

The select() calls you are seeing in the strace log is simply the
QEMU AIO code sitting waiting for IOs to complete. Unfortunately
I don't see anything in the AIO headers that allows us to set a
timeout for these operations.

This is a property of NFS - not much we can do about it unfortunately.

You might want to check http://nfs.sourceforge.net/#section_d
for more details, see under D6.

Cheers,
Jes

Note You need to log in before you can comment on or make changes to this bug.