Bug 1089873
Summary: | hang on startup using ISO image from inaccessible NFS share | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Francesco Romani <fromani> | ||||||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 6.5 | CC: | acathrow, bili, chhu, dallan, dyuan, fromani, michal.skrivanek, mzhan, shyu | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2014-05-14 08:07:39 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 755528 | ||||||||||
Attachments: |
|
Description
Francesco Romani
2014-04-22 06:29:46 UTC
relevant mount/fstab line of the NFS mount on which the ISO image is stored: shinji:/srv/ovirt/iso on /rhev/data-center/mnt/shinji:_srv_ovirt_iso type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=192.168.1.64) where "shinji" is 192.168.1.64 That is the NFS request should timeout in ~6 minutes. Could you try to reproduce and keep it waiting for a significantly more time (let's say half an hour) and if it's still waiting, provide debug logs from libvirtd so that we see if we are still waiting on the first request or if multiple requests were involved? Also running "strace -e trace=file" on libvirtd at the same time could be useful too. Reproduced. It took a bit more then 12 minutes 2014-04-28 11:37:19.680+0000: 2354: debug : qemuDomainObjBeginJobInternal:808 : Starting job: modify (async=none) 2014-04-28 11:37:19.680+0000: 2354: debug : qemuProcessStart:3606 : Beginning VM startup process 2014-04-28 11:37:19.680+0000: 2354: debug : qemuProcessStart:3618 : Setting current domain def as transient 2014-04-28 11:37:19.680+0000: 2354: debug : qemuProcessStart:3644 : Preparing host devices 2014-04-28 11:37:19.680+0000: 2354: debug : qemuProcessStart:3648 : Preparing chr devices 2014-04-28 11:37:19.680+0000: 2354: debug : qemuProcessStart:3657 : Generating domain security label (if required) 2014-04-28 11:37:19.686+0000: 2354: debug : qemuProcessStart:3666 : Ensuring no historical cgroup is lying around 2014-04-28 11:37:19.689+0000: 2354: debug : qemuProcessStart:3744 : Creating domain log file 2014-04-28 11:37:19.689+0000: 2354: debug : qemuProcessStart:3749 : Checking for KVM availability 2014-04-28 11:37:19.689+0000: 2354: debug : qemuProcessStart:3759 : Determining emulator version 2014-04-28 11:37:19.689+0000: 2354: debug : qemuCapsCacheLookup:2151 : Returning caps 0x7f0ce81166a0 for /usr/libexec/qemu-kvm 2014-04-28 11:37:19.689+0000: 2354: debug : qemuProcessStart:3768 : Checking for CDROM and floppy presence 2014-04-28 11:49:39.009+0000: 2354: debug : qemuDomainCheckDiskPresence:1920 : Dropping disk 'hdc' on domain 'F20_1' (UUID '025d59e3-f12c-41b8-be3f-de8dc22d23ce') due to inaccessible source '/rhev/data-center/mnt/192.168.1.20:_srv_virtstore_nfs_generic_iso/b0390d9e-2f1d-43c0-a56c-a20af032e934/images/11111111-1111-1111-1111-111111111111/Fedora-Live-Desktop-x86_64-20-1.iso' 2014-04-28 11:49:39.009+0000: 2354: debug : virStorageFileGetMetadata:993 : path=/rhev/data-center/mnt/blockSD/00249736-3f63-497a-b357-e6092e457e35/images/a447e318-db6c-420f-8a7e-2900190ec3f1/2133baf1-22a8-4f67-9466-160ca8df2a04 format=1 uid=107 gid=107 probe=0 2014-04-28 11:49:39.009+0000: 2354: debug : virStorageFileGetMetadataRecurse:925 : path=/rhev/data-center/mnt/blockSD/00249736-3f63-497a-b357-e6092e457e35/images/a447e318-db6c-420f-8a7e-2900190ec3f1/2133baf1-22a8-4f67-9466-160ca8df2a04 format=1 uid=107 gid=107 probe=0 2014-04-28 11:49:39.010+0000: 2354: debug : virStorageFileGetMetadataInternal:673 : path=/rhev/data-center/mnt/blockSD/00249736-3f63-497a-b357-e6092e457e35/images/a447e318-db6c-420f-8a7e-2900190ec3f1/2133baf1-22a8-4f67-9466-160ca8df2a04, fd=26, format=1 2014-04-28 11:49:39.011+0000: 2354: debug : qemuProcessStart:3798 : Setting up domain cgroup (if required) 2014-04-28 11:49:39.021+0000: 2354: debug : qemuProcessStart:3807 : Preparing monitor state 2014-04-28 11:49:39.021+0000: 2354: debug : qemuProcessStart:3843 : Assigning domain PCI addresses It may be good enough for me. There is any way to control this timeout? since it's 12 minutes, that would hint that it's accessing a file twice somewhere. If we can reduce that to one and say that the reliability of the check depends on the "timeo" parameter that might be good enough. Do you have the strace to confirm this? Created attachment 892814 [details]
strace logs
Created attachment 892821 [details]
output of strace -tt -o libvirt_strace.txt -e trace=file -f -p 2310 and libvirtd logs
Unfortunately, the strace is not very helpful: 2389 11:16:41.257783 open("/etc/group", O_RDONLY|O_CLOEXEC) = 26 2389 11:26:41.261512 --- SIGCHLD (Child exited) @ 0 (0) --- The checks happen in a new child between these two lines. I'm not sure why -f was not enough to follow that child. So it appears that strace on RHEL-6 is unable to trace children when it is attached to a running process. To see them, one needs to run the command using strace: strace -t -o nfs.strace -e trace=file -ff libvirtd --daemon --listen Searching for the inaccessible CDROM image (RHEL-7-boot.iso in my case) gives the following result: # grep boot.iso nfs.strace.* nfs.strace.29789:09:59:09 access("/mnt/test/isos/RHEL-7-boot.iso", F_OK) = -1 EIO (Input/output error) That is, we only access the file once. So there's no bug in libvirt in this area. And after reading nfs(5), it may even work as expected as the timeout seems to be 10 minutes rather then 6: timeo=n The time in deciseconds (tenths of a second) the NFS client waits for a response before it retries an NFS request. For NFS over TCP the default timeo value is 600 (60 sec‐ onds). The NFS client performs linear backoff: After each retransmission the timeout is increased by timeo up to the maximum of 600 seconds. However, for NFS over UDP, the client uses an adaptive algorithm to estimate an appropriate timeout value for frequently used request types (such as READ and WRITE requests), but uses the timeo setting for infrequently used request types (such as FSINFO requests). If the timeo option is not specified, infrequently used request types are retried after 1.1 seconds. After each retransmission, the NFS client doubles the timeout for that request, up to a maximum timeout length of 60 sec‐ onds. retrans=n The number of times the NFS client retries a request before it attempts further recovery action. If the retrans option is not specified, the NFS client tries each request three times. The NFS client generates a "server not responding" mes‐ sage after retrans retries, then attempts further recov‐ ery (depending on whether the hard mount option is in effect). I don't really understand what they are trying to say but it seems that the real timeout is not just "timeo / 10 * retrans" as one would hope for. |