Bug 881732
Summary: | vdsm: vdsm is stuck in recovery for almost an hour on NFS storage with running vm's when blocking storage from host | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Barak <bazulay> | ||||||||
Component: | qemu-kvm | Assignee: | Luiz Capitulino <lcapitulino> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.3 | CC: | abaron, acathrow, areis, bazulay, berrange, bsarathy, chayang, cpelland, dallan, dpaikov, dron, dyasny, dyuan, hateya, iheim, jdenemar, jlibosva, juzhang, lpeer, michen, minovotn, mkenneth, mzhan, rwu, sluo, virt-maint, whuang, xfu, ykaul, zpeng | ||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | infra | ||||||||||
Fixed In Version: | qemu-kvm-0.12.1.2-2.344.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | 851837 | ||||||||||
: | 884650 (view as bug list) | Environment: | |||||||||
Last Closed: | 2013-02-21 07:45:12 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 851837, 884650, 886101 | ||||||||||
Attachments: |
|
Comment 2
Martin Kletzander
2012-11-29 16:29:40 UTC
Hi, Dafna I am trying reproduce this bug with vdsm and libvirt , Can you reproduce this bug with just only one host with 2 running&writeing vm then add iptalbes in the host block NFS storage in your env ? In my env vdsm can recovery after NFS block by iptables ,host will set up(green) and guest has paused due to storage I/O problem. vdsm-4.9.6-41.0.el6_3.x86_64 libvirt-0.9.10-21.el6_3.6.x86_64 qemu-kvm-rhev-0.12.1.2-2.295.el6_3.4.x86_64 Wenlong I reproduce this bug with 3 host using one NFS storage 1) running&writing vm on each host 2) block nfs ip in each host via iptables 3) wait 20 mins SPM host nerver recovery but other two host are setting up(green) I attach vdsm and libvirt log later Created attachment 654962 [details]
libvirtd log
Created attachment 654963 [details]
vdsm log
As I explained in comment #2, there is nothing we can do that wouldn't be just a burden and solved it cleanly. I also presume this won't be easy to deal with in qemu, but apart from a workarounds in upper layers (vdsm, etc.), could there be something done in qemu? I'm reassigning this to the qemu-kvm component to try to solve this issue. In case there are any libvirt-related questions, feel free to request needinfo or ask me directly. This bug looks similar in nature to bug 665820. There's nothing we can do on qemu right now to get away from it. If the connection to an NFS storage goes away, we're in trouble. Check bug 665820#c10 for more details. I'm moving it back to libvirt, as I would have to close it as WONTFIX otherwise. After further discussion on IRC, it seems like the best option is to backport the following two qemu patches, so I'm moving back to qemu - "Add rate limiting of RTC_CHANGE, BALLOON_CHANGE & WATCHDOG events" - "Add event notification for guest balloon changes" Dan, those are what's required, right? Yep, those were the key patches on the QEMU side. Hi all, I have tried this issue with nfs storage, but did not meet any anticly things. Host info: kernel-2.6.32-345.el6.x86_64 qemu-kvm-0.12.1.2-2.337.el6.x86_64 seabios-0.6.1.2-25.el6.x86_64 spice-server-0.12.0-7.el6.x86_64 spice-gtk-0.14-5.el6.x86_64 spice-gtk-tools-0.14-5.el6.x86_64 Guest info: kernel-2.6.32-345.el6.x86_64 Steps: 1.configure the NFS and mount it. # cat /etc/exports /home *(rw,no_root_squash,sync) # mount -o soft,timeo=600,retrans=2,nosharecache $nfs_server_ip:/home /mnt 2.boot a guest with the image located on NFS storage. eg:# /usr/libexec/qemu-kvm -M rhel6.4.0 -cpu SandyBridge -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1,maxcpus=6 -usb -device usb-tablet,id=input0 -name sluo_migration -uuid 990ea161-6b67-47b2-b803-19fb01d30d30 -rtc base=localtime,clock=host,driftfix=slew -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x3 -drive file=/mnt/RHEL-Server-6.4-64-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=08:2E:5F:0A:0D:B1,bus=pci.0,addr=0x5 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5931,disable-ticketing,seamless-migration=on -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x6 -device hda-duplex -device usb-ehci,id=ehci,addr=0x7 -chardev spicevmc,name=usbredir,id=usbredirchardev1 -device usb-redir,chardev=usbredirchardev1,id=usbredirdev1,bus=ehci.0,debug=3 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -serial unix:/tmp/ttyS0,server,nowait -qmp tcp:0:4444,server,nowait -monitor stdio 3.run iozone in guest. # iozone -a 4.disconnect the NFS server and check the guest & host status. - scenario 1: stop the the nfs server directly. # service nfs stop - scenario 2: use iptables to firewall the nfs port in host. # iptables -A OUTPUT -p tcp -d $host_ip_addr --dport $port -j DROP 5.check the host & guest status after 3 min. 6.reconnect the NFS server and check the host and guest status. - scenario 1: # service nfs restart - scenario 2: # service iptables stop Results: after step 2, boot guest successfully. after step 4, the guest and host are in running status. after step 5, the guest and host are still in running status, but the guest kernel log will output call trace(maybe bug 873246), i will attach the log later. after step 6, the guest and host are still in running status correctly. Best Regards. sluo Created attachment 657967 [details]
the guest kernel will call trace after stop the nfs service or firewall the nfs ports with iptables.
Hi, Developers According to our kvm QE testing results from comment13, Host and guest are both works well regardless of removing iptables block or not after qemu-kvm realize io error. Any mistake, please free to point out, any further testing, please free to leave comment. (In reply to comment #10) > After further discussion on IRC, it seems like the best option is to > backport the following two qemu patches, so I'm moving back to qemu > > - "Add rate limiting of RTC_CHANGE, BALLOON_CHANGE & WATCHDOG events" > - "Add event notification for guest balloon changes" > > Dan, those are what's required, right? Dave, does it also need to backport the fix of the followed rhel7 libvirt bug to 6.3.z or it already has? Bug 822094 - RFE: Avoid calling 'query-balloon' monitor API in virDomainGetXMLDesc and virDomainGetInfo (In reply to comment #16) > (In reply to comment #10) > > After further discussion on IRC, it seems like the best option is to > > backport the following two qemu patches, so I'm moving back to qemu > > > > - "Add rate limiting of RTC_CHANGE, BALLOON_CHANGE & WATCHDOG events" > > - "Add event notification for guest balloon changes" > > > > Dan, those are what's required, right? > > Dave, does it also need to backport the fix of the followed rhel7 libvirt > bug to 6.3.z or it already has? > Bug 822094 - RFE: Avoid calling 'query-balloon' monitor API in > virDomainGetXMLDesc and virDomainGetInfo I'm wondering the same thing. I'll work on the qemu side, but we need confirmation on the following: 1. Does libvirt in 6.4 already have the needed changes or does it need to be patched too? If it does need to be patched, then we need a bz for it 2. For which RHEL versions should I backport this? Since this bug is not a regression but a new feature in qemu-kvm level, removing the regression keyword from this bug. Jakub, thanks for your testing, but while you were testing I've been informed by libvirt that without the query-events command libvirt doesn't make use of the new event. The qemu build you tested doesn't have the query-events command. Are you sure you _can_ reproduce the issue with qemu from 6.4? Re-tested this issue with https://brewweb.devel.redhat.com/taskinfo?taskID=5171409 Testing steps is the same as comment 23. Testing result: First time {"QMP": {"version": {"qemu": {"micro": 1, "minor": 12, "major": 0}, "package": "(qemu-kvm-0.12.1.2)"}, "capabilities": []}} {"execute": "qmp_capabilities"} {"return": {}} {"timestamp": {"seconds": 1355111067, "microseconds": 986891}, "event": "BALLOON_CHANGE", "data": {"actual": 2146435072}} {"timestamp": {"seconds": 1355111068, "microseconds": 967956}, "event": "BALLOON_CHANGE", "data": {"actual": 407896064}} {"timestamp": {"seconds": 1355111069, "microseconds": 870799}, "event": "BALLOON_CHANGE", "data": {"actual": 339972096}} {"timestamp": {"seconds": 1355111070, "microseconds": 285427}, "event": "BALLOON_CHANGE", "data": {"actual": 334503936}} {"timestamp": {"seconds": 1355111071, "microseconds": 801001}, "event": "BALLOON_CHANGE", "data": {"actual": 332226560}} {"timestamp": {"seconds": 1355111072, "microseconds": 610561}, "event": "BALLOON_CHANGE", "data": {"actual": 330407936}} {"timestamp": {"seconds": 1355111073, "microseconds": 820518}, "event": "BALLOON_CHANGE", "data": {"actual": 329113600}} {"timestamp": {"seconds": 1355111074, "microseconds": 838049}, "event": "BALLOON_CHANGE", "data": {"actual": 326492160}} {"timestamp": {"seconds": 1355111075, "microseconds": 854049}, "event": "BALLOON_CHANGE", "data": {"actual": 323403776}} {"timestamp": {"seconds": 1355111076, "microseconds": 666959}, "event": "BALLOON_CHANGE", "data": {"actual": 320581632}} Second time {"QMP": {"version": {"qemu": {"micro": 1, "minor": 12, "major": 0}, "package": "(qemu-kvm-0.12.1.2)"}, "capabilities": []}} {"execute": "qmp_capabilities"} {"return": {}} {"timestamp": {"seconds": 1355111598, "microseconds": 587178}, "event": "BALLOON_CHANGE", "data": {"actual": 2146435072}} {"timestamp": {"seconds": 1355111599, "microseconds": 586354}, "event": "BALLOON_CHANGE", "data": {"actual": 417333248}} {"timestamp": {"seconds": 1355111600, "microseconds": 564807}, "event": "BALLOON_CHANGE", "data": {"actual": 343719936}} {"timestamp": {"seconds": 1355111601, "microseconds": 196695}, "event": "BALLOON_CHANGE", "data": {"actual": 335794176}} {"timestamp": {"seconds": 1355111602, "microseconds": 406443}, "event": "BALLOON_CHANGE", "data": {"actual": 334860288}} {"timestamp": {"seconds": 1355111603, "microseconds": 416257}, "event": "BALLOON_CHANGE", "data": {"actual": 333811712}} {"timestamp": {"seconds": 1355111604, "microseconds": 428653}, "event": "BALLOON_CHANGE", "data": {"actual": 331460608}} {"timestamp": {"seconds": 1355111605, "microseconds": 236315}, "event": "BALLOON_CHANGE", "data": {"actual": 330924032}} {"timestamp": {"seconds": 1355111606, "microseconds": 449123}, "event": "BALLOON_CHANGE", "data": {"actual": 329601024}} {"timestamp": {"seconds": 1355111607, "microseconds": 366527}, "event": "BALLOON_CHANGE", "data": {"actual": 326688768}} {"timestamp": {"seconds": 1355111608, "microseconds": 582896}, "event": "BALLOON_CHANGE", "data": {"actual": 322342912}} {"timestamp": {"seconds": 1355111609, "microseconds": 494079}, "event": "BALLOON_CHANGE", "data": {"actual": 320143360}} Test query-events command {"execute": "query-events"} {"return": [{"name": "SPICE_MIGRATE_COMPLETED"}, {"name": "BALLOON_CHANGE"}, {"name": "WAKEUP"}, {"name": "SUSPEND_DISK"}, {"name": "SUSPEND"}, {"name": "__com.redhat_SPICE_DISCONNECTED"}, {"name": "__com.redhat_SPICE_INITIALIZED"}, {"name": "BLOCK_JOB_CANCELLED"}, {"name": "BLOCK_JOB_COMPLETED"}, {"name": "DEVICE_TRAY_MOVED"}, {"name": "SPICE_DISCONNECTED"}, {"name": "SPICE_INITIALIZED"}, {"name": "SPICE_CONNECTED"}, {"name": "WATCHDOG"}, {"name": "RTC_CHANGE"}, {"name": "BLOCK_IO_ERROR"}, {"name": "VNC_DISCONNECTED"}, {"name": "VNC_INITIALIZED"}, {"name": "VNC_CONNECTED"}, {"name": "RESUME"}, {"name": "STOP"}, {"name": "POWERDOWN"}, {"name": "RESET"}, {"name": "SHUTDOWN"}]} These are the results I'm getting with a fixed build: * Blocked hosts still respond to getVdsCaps. * Blocked hosts aren't moved to Non Responding but ARE moved to Non Operational. * Some VMs on blocked hosts are migrated to non-blocked host. * Some VMs on blocked hosts are flip-flopping between Paused and Not Responding statuses. Daniel, Does this mean the issue is fixed? Also, can you say which of the builds you've tested? (In reply to comment #38) > Daniel, > > Does this mean the issue is fixed? yes - issue is fixed and vdsm boots normally recovering the needed vms. >Also, can you say which of the builds you've tested? we tested the following: vdsm-4.9.6-44.0.el6_3.x86_64 libvirt-0.10.2-11.el6.x86_64 kernel-2.6.32-279.15.1.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.295.el6_3.balloon2.x86_64 Thanks Haim, patches have been posted for 6.4. Now, can you please clarify if you need 6.4 and/or 6.3.z? If you need 6.3.z, it's important to note that we first have to get the patches merged in 6.4 (and that's what I'm working on right now). Hi Luiz, We're planning on building the RHEV Hypervisor today which is based on RHEL 6.3.z bug fixes and this bug is one of the fixes we need in today's RHEV-H 6.3.z build. Hope this helps. Thanks, Chris Chris, It's very unlikely we'll the fixes in for 6.3.z today. Fixed in version qemu-kvm-0.12.1.2-2.344.el6 Michal Verified with qemu-kvm-rhev-0.12.1.2-2.348.el6.x86_64, libvirt-0.10.2-13.el6.x86_64, vdsm-4.10.2-1.0.el6.x86_64 Steps: 1. install a VM on a host 2. block connectivity to the storage domain from this host using iptables # iptables -A INPUT -s $ip_of_storage_domain -j DROP 3. remove iptables 4. restart vdsm Actual Result: after step 2: ------------ # vdsClient -s 0 list table Failed to initialize storage # virsh -r list Id Name State ---------------------------------------------------- 1 win7-bz881732 running in vdsm log, getCapabilities was trying to answer vdsm but failed: BindingXMLRPC::903::vds::(wrapper) client [10.66.4.148]::call getCapabilities with () {} BindingXMLRPC::910::vds::(wrapper) return getCapabilities with {'status': {'message': 'Recovering from crash or Initializing', 'code': 99}} after step 4: ------------ # vdsClient -s 0 list table 13549dd1-2583-4b61-ac55-78a4a92ead3d 8775 win7-bz881732 Up # virsh -r list Id Name State ---------------------------------------------------- 1 win7-bz881732 running in vdsm log, getCapabilities answered vdsm successfully: BindingXMLRPC::903::vds::(wrapper) client [10.66.4.148]::call getCapabilities with () {} flowID [4a31d36e] BindingXMLRPC::910::vds::(wrapper) return getCapabilities with {'status': {'message': 'Done', 'code': 0}, 'info': {'HBAInventory': {'iSCSI': [{'InitiatorName': 'iqn.1994-05.com.redhat:80cd3cf52ce'}], 'FC': []}, 'packages2': {'kernel': {'release': '348.el6.x86_64', 'buildtime': 1355114883.0, 'version': '2.6.32'}, 'qemu-kvm-rhev': {'release': '2.348.el6', 'buildtime': 1355930976L, 'version': '0.12.1.2'}, 'spice-server': {'release': '10.el6', 'buildtime': 1356032154L, 'version': '0.12.0'}, 'vdsm': {'release': '1.0.el6', 'buildtime': 1353349266L, 'version': '4.10.2'}, 'libvirt': {'release': '13.el6', 'buildtime': 1355929853L, 'version': '0.10.2'}, 'qemu-img-rhev': {'release': '2.348.el6', 'buildtime': 1355930976L, 'version': '0.12.1.2'}}, 'cpuModel': 'Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz', 'hooks': {}, 'vmTypes': ['kvm'], 'supportedProtocols': ['2.2', '2.3'], 'networks': {'rhevm': {'iface': 'rhevm', 'addr': '', 'cfg': {'DEVICE': 'rhevm', 'DELAY': '0', 'BOOTPROTO': 'dhcp', 'TYPE': 'Bridge', 'ONBOOT': 'yes'}, 'mtu': '1500', 'netmask': '', 'stp': 'off', 'bridged': True, 'gateway': '10.66.103.254', 'ports': ['eth0', 'vnet0', 'vnet1']}}, 'bridges': {'rhevm': {'addr': '', 'cfg': {'DEVICE': 'rhevm', 'DELAY': '0', 'BOOTPROTO': 'dhcp', 'TYPE': 'Bridge', 'ONBOOT': 'yes'}, 'mtu': '1500', 'netmask': '', 'stp': 'off', 'ports': ['eth0', 'vnet0', 'vnet1']}}, 'uuid': '4C4C4544-0048-3710-8050-B6C04F423358_d4:be:d9:a2:83:ba', 'lastClientIface': 'rhevm', 'nics': {'eth0': {'addr': '', 'cfg': {'BRIDGE': 'rhevm', 'IPV6INIT': 'yes', 'NM_CONTROLLED': 'no', 'MTU': '1500', 'HWADDR': 'D4:BE:D9:A2:83:BA', 'DEVICE': 'eth0', 'TYPE': 'Ethernet', 'ONBOOT': 'yes', 'UUID': '86c006a6-9de1-472b-aff1-cc2134eb5d23'}, 'mtu': '1500', 'netmask': '', 'hwaddr': 'd4:be:d9:a2:83:ba', 'speed': 1000}}, 'software_revision': '1.0', 'clusterLevels': ['3.0', '3.1'], 'cpuFlags': u'fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,rdtscp,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,xtopology,nonstop_tsc,aperfmperf,pni,pclmulqdq,dtes64,monitor,ds_cpl,vmx,smx,est,tm2,ssse3,cx16,xtpr,pdcm,pcid,sse4_1,sse4_2,x2apic,popcnt,tsc_deadline_timer,aes,xsave,avx,lahf_lm,ida,arat,epb,xsaveopt,pln,pts,dts,tpr_shadow,vnmi,flexpriority,ept,vpid,model_Nehalem,model_Conroe,model_coreduo,model_core2duo,model_Penryn,model_Westmere,model_n270,model_SandyBridge', 'ISCSIInitiatorName': 'iqn.1994-05.com.redhat:80cd3cf52ce', 'netConfigDirty': 'False', 'supportedENGINEs': ['3.0', '3.1'], 'reservedMem': '321', 'bondings': {'bond4': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond0': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond1': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond2': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond3': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}}, 'software_version': '4.10', 'memSize': '7750', 'cpuSpeed': '3093.060', 'cpuSockets': '1', 'vlans': {}, 'cpuCores': '4', 'kvmEnabled': 'true', 'guestOverhead': '65', 'management_ip': '', 'version_name': 'Snow Man', 'emulatedMachines': [u'rhel6.4.0', u'pc', u'rhel6.3.0', u'rhel6.2.0', u'rhel6.1.0', u'rhel6.0.0', u'rhel5.5.0', u'rhel5.4.4', u'rhel5.4.0'], 'operatingSystem': {'release': '6.4.0.3.el6', 'version': '6Server', 'name': 'RHEL'}, 'lastClient': '10.66.4.148'}} ---As per above, this issue has been fixed correctly. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-0527.html |