Description of problem: - host is spm - kill -STOP `pgrep vdsm` && sleep 120 && kill -CONT `pgrep vdsm` - vdsClient 0 getVdsCaps - unexpected exception Thread-6461::DEBUG::2012-06-20 20:19:43,541::BindingXMLRPC::864::vds::(wrapper) client [10.35.104.13]::call getCapabilities with () {} Thread-6461::ERROR::2012-06-20 20:19:43,581::BindingXMLRPC::873::vds::(wrapper) libvirt error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 869, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 278, in getCapabilities ret = api.getCapabilities() File "/usr/share/vdsm/API.py", line 1038, in getCapabilities c = caps.get() File "/usr/share/vdsm/caps.py", line 215, in get caps.update(netinfo.get()) File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 246, in get nets = networks() File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 69, in networks for name in conn.listNetworks(): File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3240, in listNetworks if ret is None: raise libvirtError ('virConnectListNetworks() failed', conn=self) libvirtError: Cannot write data: Broken pipe [root@orange-vdsd ~]# ps -o etime `pgrep libvirt` ELAPSED 3-03:35:32 libvirt service didn't restart or died in any-way during that procedure and vdsm should still be able to communicate with its old PID.
I do not know why the socket got broken. It would be interesting to retry this SIGSTOP with virsh. Does vdsm take itself down after this failure?
*** Bug 833772 has been marked as a duplicate of this bug. ***
(In reply to comment #2) > I do not know why the socket got broken. It would be interesting to retry > this SIGSTOP with virsh. I found this in the libvirtd logs after reproducing: 2012-07-20 15:36:43.373+0000: 4270: warning : virKeepAliveTimer:182 : No response from client 0x266b7e0 after 5 keepalive messages in 30 seconds 2012-07-20 15:36:43.373+0000: 4270: debug : virConnectClose:1496 : conn=0x7fbbc8000930 2012-07-20 15:36:43.374+0000: 4270: debug : qemuDriverCloseCallbackRunAll:661 : conn=0x7fbbc8000930 So, looks like libvirt shuts the connection down. There was a patch to libvirt to allow for the client to turn off keepalives, but there are some caveats regarding event loops and it looks like VDSM makes use of these. So... > Does vdsm take itself down after this failure? No, and I see that there's code to wrap the connection methods in order to cause a shutdown upon error. I'm looking into why this isn't happening.
Haim, Please check if reproduced on the latest libvirt - after change of libvirt's keepalive default to off
(In reply to comment #6) > Haim, > > Please check if reproduced on the latest libvirt - after change of libvirt's > keepalive default to off no repro on latest libivrt and vdsm, executed the script and vdsm takes it-self down when properly: Thread-10779::ERROR::2012-08-02 16:40:32,916::libvirtconnection::92::vds::(wrapper) connection to libvirt broken. taking vdsm down. Thread-10779::INFO::2012-08-02 16:40:32,927::vmChannels::135::vds::(stop) VM channels listener was stopped. Thread-10779::DEBUG::2012-08-02 16:40:32,928::task::588::TaskManager.Task::(_updateState) Task=`1baf7dbc-fa12-4efa-a369-a171c73bd9f4`::moving from state init - > state preparing Thread-10779::INFO::2012-08-02 16:40:32,929::logUtils::37::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None) vdsm-4.9.6-24.0.el6_3.x86_64 libvirt-0.9.10-21.el6.x86_64
Reopening this bug as keepalives are once again enabled by default (BZ#832081) and the steps listed in c#1 can reproduce this problem on RHEL 6.4 : # cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.4 (Santiago) # rpm -qa | egrep '(libvirt|vdsm)' | sort libvirt-0.10.2-18.el6.x86_64 libvirt-client-0.10.2-18.el6.x86_64 libvirt-debuginfo-0.10.2-18.el6.x86_64 libvirt-lock-sanlock-0.10.2-18.el6.x86_64 libvirt-python-0.10.2-18.el6.x86_64 vdsm-4.10.2-1.6.el6.x86_64 vdsm-cli-4.10.2-1.6.el6.noarch vdsm-python-4.10.2-1.6.el6.x86_64 vdsm-xmlrpc-4.10.2-1.6.el6.noarch # kill -STOP `pgrep vdsm` && sleep 120 && kill -CONT `pgrep vdsm` # vdsClient -s 0 getVdsCaps Unexpected exception Disabling keepalives works around this for now : /etc/libvirt/libvirtd.conf ~~~ keepalive_interval = -1 ~~~ # service libvirtd restart # service vdsmd restart # kill -STOP `pgrep vdsm` && sleep 120 && kill -CONT `pgrep vdsm` # vdsClient -s 0 getVdsCaps HBAInventory = {'iSCSI': [{'InitiatorName': 'iqn.1994-05.com.redhat:eb7ebd96c076'}], 'FC': [{'wwpn': '21000024ff2c1bb8', 'wwnn': '20000024ff2c1bb8', 'model': 'QLE2562 - QLogic 8Gb FC Dual-port HBA for System x'}, {'wwpn': '21000024ff2c1bb9', 'wwnn': '20000024ff2c1bb9', 'model': 'QLE2562 - QLogic 8Gb FC Dual-port HBA for System x'}]} [..] BZ#951571 covers another use case where vdsmd fences itself due to libvirtd closing the connection during a migration.
Reporting an instance of lost keepalives and vdsm crash: libvirtd.log (1 hour behind): 2013-03-26 10:14:28.583+0000: 13785: warning : virKeepAliveTimerInternal:156 : No response from client 0x11eb3a0 after 5 keepalive messages in 30 seconds messages: Mar 26 11:14:28 xxxxxxxx kernel: vdsm[27223]: segfault at 20 ip 00000032d0209220 sp 00007fc9f57f9748 error 4 in libpthread-2.12.so[32d0200000+17000] Log files will follow shortly.
*** Bug 951571 has been marked as a duplicate of this bug. ***
Checked on RHEVM - 3.2 - SF14 vdsm-4.10.2-16.0.el6ev.x86_64 libvirt-0.10.2-18.el6_4.4.x86_64 vdsm do not losses connection to libvirt socket. [root@nott-vds1 ~]# kill -STOP `pgrep vdsm` && sleep 120 && kill -CONT `pgrep vdsm` [root@nott-vds1 ~]# ps -o etime `pgrep libvirt` [root@nott-vds1 ~]# vdsClient -s 0 getVdsCaps HBAInventory = {'iSCSI': [{'InitiatorName': 'iqn.1994-05.com.redhat:nott-vds1'}], 'FC': []} ISCSIInitiatorName = iqn.1994-05.com.redhat:nott-vds1 bondings = {'bond4': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond0': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond1': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond2': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond3': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}} bridges = {'rhevm': {'addr': '10.35.115.10', 'cfg': {'IPV6INIT': 'yes', 'DHCP_HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'MTU': '1500', 'DELAY': '0', 'NM_CONTROLLED': 'no', 'BOOTPROTO': 'dhcp', 'DEVICE': 'rhevm', 'TYPE': 'Bridge', 'ONBOOT': 'yes', 'UUID': 'ea5b84dd-3530-4fae-9ee0-df853fac5125'}, 'mtu': '1500', 'netmask': '255.255.255.0', 'stp': 'off', 'ports': ['eth0']}} clusterLevels = ['3.0', '3.1', '3.2'] cpuCores = 8 cpuFlags = fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,rdtscp,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,xtopology,nonstop_tsc,aperfmperf,pni,dtes64,monitor,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,dca,sse4_1,sse4_2,popcnt,lahf_lm,ida,dts,tpr_shadow,vnmi,flexpriority,ept,vpid,model_Nehalem,model_Conroe,model_coreduo,model_core2duo,model_Penryn,model_n270 cpuModel = Intel(R) Xeon(R) CPU E5520 @ 2.27GHz cpuSockets = 2 cpuSpeed = 2266.000 cpuThreads = 16 emulatedMachines = ['rhel6.4.0', 'pc', 'rhel6.3.0', 'rhel6.2.0', 'rhel6.1.0', 'rhel6.0.0', 'rhel5.5.0', 'rhel5.4.4', 'rhel5.4.0'] guestOverhead = 65 hooks = {} kvmEnabled = true lastClient = 10.35.161.240 lastClientIface = rhevm management_ip = memSize = 32095 netConfigDirty = False networks = {'rhevm': {'iface': 'rhevm', 'addr': '10.35.115.10', 'cfg': {'IPV6INIT': 'yes', 'DHCP_HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'MTU': '1500', 'DELAY': '0', 'NM_CONTROLLED': 'no', 'BOOTPROTO': 'dhcp', 'DEVICE': 'rhevm', 'TYPE': 'Bridge', 'ONBOOT': 'yes', 'UUID': 'ea5b84dd-3530-4fae-9ee0-df853fac5125'}, 'mtu': '1500', 'netmask': '255.255.255.0', 'stp': 'off', 'bridged': True, 'gateway': '10.35.115.254', 'ports': ['eth0']}} nics = {'eth1': {'addr': '', 'cfg': {'DEVICE': 'eth1', 'UUID': '144387f8-f9ff-455d-82f6-fafc6fb86d45', 'DHCP_HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'NM_CONTROLLED': 'yes', 'BOOTPROTO': 'dhcp', 'HWADDR': '78:E7:D1:E4:8C:71', 'TYPE': 'Ethernet', 'ONBOOT': 'no'}, 'mtu': '1500', 'netmask': '', 'hwaddr': '78:e7:d1:e4:8c:71', 'speed': 0}, 'eth0': {'addr': '', 'cfg': {'BRIDGE': 'rhevm', 'DEVICE': 'eth0', 'IPV6INIT': 'yes', 'DHCP_HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'MTU': '1500', 'NM_CONTROLLED': 'no', 'HWADDR': '78:e7:d1:e4:8c:70', 'ONBOOT': 'yes', 'UUID': 'ea5b84dd-3530-4fae-9ee0-df853fac5125'}, 'mtu': '1500', 'netmask': '', 'hwaddr': '78:e7:d1:e4:8c:70', 'speed': 1000}} operatingSystem = {'release': '6.4.0.4.el6', 'version': '6Server', 'name': 'RHEL'} packages2 = {'kernel': {'release': '358.2.1.el6.x86_64', 'buildtime': 1361355457.0, 'version': '2.6.32'}, 'spice-server': {'release': '12.el6', 'buildtime': 1358331735, 'version': '0.12.0'}, 'vdsm': {'release': '16.0.el6ev', 'buildtime': 1366880858, 'version': '4.10.2'}, 'qemu-kvm': {'release': '2.355.el6_4.3', 'buildtime': 1365612567, 'version': '0.12.1.2'}, 'libvirt': {'release': '18.el6_4.4', 'buildtime': 1365809427, 'version': '0.10.2'}, 'qemu-img': {'release': '2.355.el6_4.3', 'buildtime': 1365612567, 'version': '0.12.1.2'}} reservedMem = 321 software_revision = 16.0 software_version = 4.10 supportedENGINEs = ['3.0', '3.1', '3.2'] supportedProtocols = ['2.2', '2.3'] supportedRHEVMs = ['3.0'] uuid = bb62c08b-ba33-4a8a-b8b5-bfa1b06f6290 version_name = Snow Man vlans = {} vmTypes = ['kvm'] [root@nott-vds1 ~]# ps -o etime `pgrep libvirt` ELAPSED 02:05:19 [root@nott-vds1 ~]# vdsClient -s 0 getVdsCaps HBAInventory = {'iSCSI': [{'InitiatorName': 'iqn.1994-05.com.redhat:nott-vds1'}], 'FC': []} ISCSIInitiatorName = iqn.1994-05.com.redhat:nott-vds1 bondings = {'bond4': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond0': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond1': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond2': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}, 'bond3': {'addr': '', 'cfg': {}, 'mtu': '1500', 'netmask': '', 'slaves': [], 'hwaddr': '00:00:00:00:00:00'}} bridges = {'rhevm': {'addr': '10.35.115.10', 'cfg': {'IPV6INIT': 'yes', 'DHCP_HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'MTU': '1500', 'DELAY': '0', 'NM_CONTROLLED': 'no', 'BOOTPROTO': 'dhcp', 'DEVICE': 'rhevm', 'TYPE': 'Bridge', 'ONBOOT': 'yes', 'UUID': 'ea5b84dd-3530-4fae-9ee0-df853fac5125'}, 'mtu': '1500', 'netmask': '255.255.255.0', 'stp': 'off', 'ports': ['eth0']}} clusterLevels = ['3.0', '3.1', '3.2'] cpuCores = 8 cpuFlags = fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,rdtscp,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,xtopology,nonstop_tsc,aperfmperf,pni,dtes64,monitor,ds_cpl,vmx,est,tm2,ssse3,cx16,xtpr,pdcm,dca,sse4_1,sse4_2,popcnt,lahf_lm,ida,dts,tpr_shadow,vnmi,flexpriority,ept,vpid,model_Nehalem,model_Conroe,model_coreduo,model_core2duo,model_Penryn,model_n270 cpuModel = Intel(R) Xeon(R) CPU E5520 @ 2.27GHz cpuSockets = 2 cpuSpeed = 2266.000 cpuThreads = 16 emulatedMachines = ['rhel6.4.0', 'pc', 'rhel6.3.0', 'rhel6.2.0', 'rhel6.1.0', 'rhel6.0.0', 'rhel5.5.0', 'rhel5.4.4', 'rhel5.4.0'] guestOverhead = 65 hooks = {} kvmEnabled = true lastClient = 10.35.161.240 lastClientIface = rhevm management_ip = memSize = 32095 netConfigDirty = False networks = {'rhevm': {'iface': 'rhevm', 'addr': '10.35.115.10', 'cfg': {'IPV6INIT': 'yes', 'DHCP_HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'MTU': '1500', 'DELAY': '0', 'NM_CONTROLLED': 'no', 'BOOTPROTO': 'dhcp', 'DEVICE': 'rhevm', 'TYPE': 'Bridge', 'ONBOOT': 'yes', 'UUID': 'ea5b84dd-3530-4fae-9ee0-df853fac5125'}, 'mtu': '1500', 'netmask': '255.255.255.0', 'stp': 'off', 'bridged': True, 'gateway': '10.35.115.254', 'ports': ['eth0']}} nics = {'eth1': {'addr': '', 'cfg': {'DEVICE': 'eth1', 'UUID': '144387f8-f9ff-455d-82f6-fafc6fb86d45', 'DHCP_HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'NM_CONTROLLED': 'yes', 'BOOTPROTO': 'dhcp', 'HWADDR': '78:E7:D1:E4:8C:71', 'TYPE': 'Ethernet', 'ONBOOT': 'no'}, 'mtu': '1500', 'netmask': '', 'hwaddr': '78:e7:d1:e4:8c:71', 'speed': 0}, 'eth0': {'addr': '', 'cfg': {'BRIDGE': 'rhevm', 'DEVICE': 'eth0', 'IPV6INIT': 'yes', 'DHCP_HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'HOSTNAME': 'nott-vds1.qa.lab.tlv.redhat.com', 'MTU': '1500', 'NM_CONTROLLED': 'no', 'HWADDR': '78:e7:d1:e4:8c:70', 'ONBOOT': 'yes', 'UUID': 'ea5b84dd-3530-4fae-9ee0-df853fac5125'}, 'mtu': '1500', 'netmask': '', 'hwaddr': '78:e7:d1:e4:8c:70', 'speed': 1000}} operatingSystem = {'release': '6.4.0.4.el6', 'version': '6Server', 'name': 'RHEL'} packages2 = {'kernel': {'release': '358.2.1.el6.x86_64', 'buildtime': 1361355457.0, 'version': '2.6.32'}, 'spice-server': {'release': '12.el6', 'buildtime': 1358331735, 'version': '0.12.0'}, 'vdsm': {'release': '16.0.el6ev', 'buildtime': 1366880858, 'version': '4.10.2'}, 'qemu-kvm': {'release': '2.355.el6_4.3', 'buildtime': 1365612567, 'version': '0.12.1.2'}, 'libvirt': {'release': '18.el6_4.4', 'buildtime': 1365809427, 'version': '0.10.2'}, 'qemu-img': {'release': '2.355.el6_4.3', 'buildtime': 1365612567, 'version': '0.12.1.2'}} reservedMem = 321 software_revision = 16.0 software_version = 4.10 supportedENGINEs = ['3.0', '3.1', '3.2'] supportedProtocols = ['2.2', '2.3'] supportedRHEVMs = ['3.0'] uuid = bb62c08b-ba33-4a8a-b8b5-bfa1b06f6290 version_name = Snow Man vlans = {} vmTypes = ['kvm']
Simon, The Doc Text I've written here is pretty vague because I don't understand what the "certain case" is that causes VDSM to lose its connection to the libvirt socket. Can you direct me to someone who can explain what causes VDSM to lose its connection to the libvirt socket? Thanks in advance. Zac
Hey Zac, libvirt uses keepalive logic that when the connection to vdsm brakes for awhile automatically the socket between libvirt and vdsm gets closed. Because a bug in libvirt on high load scenarios, the keepalive closed the socket and vdsm stayed up without being able to establish the communication again. To fix the issue we approved the way vdsm recognizes this scenario and restart itself to establish the connection again. We also disabled keepalive feature in libvirt as part of vdsm initialization.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0886.html