Bug 1206884
Summary: | HE deployment fails due to libvirtError: internal error client socket is closed | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nikolai Sednev <nsednev> | ||||||||||||
Component: | vdsm | Assignee: | Sandro Bonazzola <sbonazzo> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Nikolai Sednev <nsednev> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 3.5.1 | CC: | adahms, bazulay, danken, fdeutsch, gklein, huiwa, istein, lbopf, lpeer, lsurette, ncredi, nsednev, oourfali, pstehlik, sbonazzo, stirabos, ybronhei, ycui, yeylon, ykaul, ylavi | ||||||||||||
Target Milestone: | ovirt-3.6.0-rc | Keywords: | Reopened, TestOnly, Triaged, ZStream | ||||||||||||
Target Release: | 3.6.0 | Flags: | istein:
needinfo+
|
||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | |||||||||||||||
: | 1235591 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2016-03-09 19:34:00 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 1059435, 1198639, 1235591, 1250199 | ||||||||||||||
Attachments: |
|
Description
Nikolai Sednev
2015-03-29 10:04:33 UTC
Created attachment 1007953 [details]
answers-20150323062421.conf
Nikolai, could you please also provide the sosreport from the host side? We need to find out what the exception is. [root@alma03 /]# hosted-engine --deploy [ INFO ] Stage: Initializing [ INFO ] Generating a temporary VNC password. [ INFO ] Stage: Environment setup Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards. Are you sure you want to continue? (Yes, No)[Yes]: Configuration files: [] Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20150402142028-1h4utr.log Version: otopi-1.3.0 (otopi-1.3.0-2.el6ev) It has been detected that this program is executed through an SSH connection without using screen. Continuing with the installation may lead to broken installation if the network connection fails. It is highly recommended to abort the installation and run it inside a screen session using command "screen". Do you want to continue anyway? (Yes, No)[No]: yes [ INFO ] Hardware supports virtualization [ INFO ] Stage: Environment packages setup [ INFO ] Stage: Programs detection [ INFO ] Stage: Environment setup [ INFO ] Generating libvirt-spice certificates [ INFO ] Stage: Environment customization --== STORAGE CONFIGURATION ==-- During customization use CTRL-D to abort. Please specify the storage you would like to use (iscsi, nfs3, nfs4)[nfs3]: Please specify the full shared storage connection path to use (example: host:/path): 10.35.160.108:/RHEV/nsednev_HE_3_5 [ INFO ] Installing on first host Please provide storage domain name. [hosted_storage]: Local storage datacenter name is an internal name and currently will not be shown in engine's admin UI. Please enter local datacenter name [hosted_datacenter]: --== SYSTEM CONFIGURATION ==-- --== NETWORK CONFIGURATION ==-- Please indicate a nic to set rhevm bridge on: (em1, em2, p1p1, p1p2) [em1]: p1p1 iptables was detected on your computer, do you wish setup to configure it? (Yes, No)[Yes]: Please indicate a pingable gateway IP address [10.35.117.254]: --== VM CONFIGURATION ==-- Please specify the device to boot the VM from (cdrom, disk, pxe) [cdrom]: pxe Please specify an alias for the Hosted Engine image [hosted_engine]: The following CPU types are supported by this host: - model_SandyBridge: Intel SandyBridge Family - model_Westmere: Intel Westmere Family - model_Nehalem: Intel Nehalem Family - model_Penryn: Intel Penryn Family - model_Conroe: Intel Conroe Family Please specify the CPU type to be used by the VM [model_SandyBridge]: Please specify the number of virtual CPUs for the VM [Defaults to minimum requirement: 2]: Please specify the disk size of the VM in GB [Defaults to minimum requirement: 25]: You may specify a unicast MAC address for the VM or accept a randomly generated default [00:16:3e:23:1a:fa]: 00:16:3E:7B:BB:BB Please specify the memory size of the VM in MB [Defaults to minimum requirement: 4096]: Please specify the console type you would like to use to connect to the VM (vnc, spice) [vnc]: --== HOSTED ENGINE CONFIGURATION ==-- Enter the name which will be used to identify this host inside the Administrator Portal [hosted_engine_1]: Enter 'admin@internal' user password that will be used for accessing the Administrator Portal: Confirm 'admin@internal' user password: Please provide the FQDN for the engine you would like to use. This needs to match the FQDN that you will use for the engine installation within the VM. Note: This will be the FQDN of the VM you are now going to create, it should not point to the base host or to any other existing machine. Engine FQDN: nsednev-he-2.qa.lab.tlv.redhat.com Please provide the name of the SMTP server through which we will send notifications [localhost]: Please provide the TCP port number of the SMTP server [25]: Please provide the email address from which notifications will be sent [root@localhost]: Please provide a comma-separated list of email addresses which will get notifications [root@localhost]: [ INFO ] Stage: Setup validation --== CONFIGURATION PREVIEW ==-- Bridge interface : p1p1 Engine FQDN : nsednev-he-2.qa.lab.tlv.redhat.com Bridge name : rhevm SSH daemon port : 22 Firewall manager : iptables Gateway address : 10.35.117.254 Host name for web application : hosted_engine_1 Host ID : 1 Image alias : hosted_engine Image size GB : 25 Storage connection : 10.35.160.108:/RHEV/nsednev_HE_3_5 Console type : vnc Memory size MB : 4096 MAC address : 00:16:3E:7B:BB:BB Boot type : pxe Number of CPUs : 2 CPU Type : model_SandyBridge Please confirm installation settings (Yes, No)[Yes]: [ INFO ] Stage: Transaction setup [ INFO ] Stage: Misc configuration [ INFO ] Stage: Package installation [ INFO ] Stage: Misc configuration [ INFO ] Configuring libvirt [ INFO ] Configuring VDSM [ INFO ] Starting vdsmd [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Configuring the management bridge [ ERROR ] Failed to execute stage 'Misc configuration': Failed to setup networks {'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True}}. Error code: "16" message: "Unexpected exception" [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150402142253.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination Logs attached including the sosreport. Created attachment 1010219 [details]
alma03 logs
Created attachment 1010220 [details]
sosreportalma03
Created attachment 1010221 [details]
answers-20150402142253.conf
(In reply to Nikolai Sednev from comment #5) > Created attachment 1010220 [details] > sosreportalma03 I see in VDSM logs: Thread-16::DEBUG::2015-04-02 14:22:53,586::BindingXMLRPC::1133::vds::(wrapper) client [127.0.0.1]::call setupNetworks with ({'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True}}, {}, {'connectivityCheck': False}) {} Thread-16::ERROR::2015-04-02 14:22:53,597::BindingXMLRPC::1143::vds::(wrapper) libvirt error Traceback (most recent call last): File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 1136, in wrapper File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 554, in setupNetworks File "/usr/share/vdsm/API.py", line 1394, in setupNetworks File "/usr/share/vdsm/supervdsm.py", line 50, in __call__ File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda> File "<string>", line 2, in setupNetworks File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod libvirtError: internal error client socket is closed With a corresponding supervdsm log: MainProcess|Thread-16::DEBUG::2015-04-02 14:22:53,587::supervdsmServer::101::SuperVdsm.ServerCallback::(wrapper) call setupNetworks with ({'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True}}, {}, {'connectivityCheck': False}) {} MainProcess|Thread-16::WARNING::2015-04-02 14:22:53,588::libvirtconnection::135::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7 MainProcess|Thread-16::CRITICAL::2015-04-02 14:22:53,588::libvirtconnection::137::root::(wrapper) taking calling process down. MainThread::DEBUG::2015-04-02 14:22:53,588::supervdsmServer::453::SuperVdsm.Server::(main) Terminated normally MainProcess|Thread-16::DEBUG::2015-04-02 14:22:53,588::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 7 level: 2 message: internal error client socket is closed MainProcess|Thread-16::ERROR::2015-04-02 14:22:53,589::supervdsmServer::105::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks Traceback (most recent call last): File "/usr/share/vdsm/supervdsmServer", line 103, in wrapper res = func(*args, **kwargs) File "/usr/share/vdsm/supervdsmServer", line 223, in setupNetworks return setupNetworks(networks, bondings, **options) File "/usr/share/vdsm/network/api.py", line 612, in setupNetworks File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 112, in networks File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 129, in wrapper File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3405, in getLibVersion libvirtError: internal error client socket is closed and a libvirt log: 2015-04-02 14:22:26.347+0000: 17198: info : libvirt version: 0.10.2, package: 46.el6_6.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-01-23-10:39:40, x86-026.build.eng.bos.redhat.com) 2015-04-02 14:22:26.347+0000: 17198: error : virNWFilterSnoopLeaseFileRefresh:1903 : open("/var/run/libvirt/network/nwfilter.ltmp"): No such file or directory 2015-04-02 14:22:26.398+0000: 17198: error : virExecWithHook:426 : Cannot find 'pm-is-supported' in path: No such file or directory 2015-04-02 14:22:26.398+0000: 17198: warning : qemuCapsInit:885 : Failed to get host power management capabilities 2015-04-02 14:22:26.809+0000: 17198: error : virExecWithHook:426 : Cannot find 'pm-is-supported' in path: No such file or directory 2015-04-02 14:22:26.809+0000: 17198: warning : lxcCapsInit:77 : Failed to get host power management capabilities 2015-04-02 14:22:51.820+0000: 17191: error : virExecWithHook:426 : Cannot find 'pm-is-supported' in path: No such file or directory 2015-04-02 14:22:51.820+0000: 17191: warning : qemuCapsInit:885 : Failed to get host power management capabilities looks like an issue on vdsm side. moving to vdsm / network for review. MainProcess|Thread-16::CRITICAL::2015-04-02 14:22:53,588::libvirtconnection::137::root::(wrapper) taking calling process down. means that libvirtd died after vdsm connected to it. That's not a problem of vdsm proper. Can you tell what took libvirtd down? the copied 'pm-is-supported' errors reminds me of an issue fix in a newer libvirt. (In reply to Dan Kenigsberg from comment #9) > Can you tell what took libvirtd down? the copied 'pm-is-supported' errors > reminds me of an issue fix in a newer libvirt. All the info I have are the ones in the attached sosreport provided by Nikolai. I can't tell what took libvirtd down after supervdsmd start. If you can provide the version of libvirt that had the issue we can check in the sosreport if the installed version is affected. Yaniv, can you tell why libvirtd died? Hi I would change log level to debug and see what libvirtd.log says - you need to modify libvirtd.conf for that. can you do that and attach the output? In addition I would add someone from libvirt to take a look as well BTW, due to bug 1222154 comment 13 - please make sure that you run service libvirtd stop chkconfig libvirtd off before any el6 tests of ovirt-3.5's vdsm. any news regarding the debug level logs? is it possible to produce them? I was unable to reproduce this bug, might be closed as not reproduced. (In reply to Nikolai Sednev from comment #15) > I was unable to reproduce this bug, might be closed as not reproduced. I've failed to get this reproduced on RHELs, on RHEVHs we don't have any image with HE setup yet, for better reproduction, such image is required. Once you have a image please reopen if this reproduces. I can reproduce this issue with the el6 build from comment 29 and the following network configuration: # tail -n 40 /etc/sysconfig/network-scripts/ifcfg-* ==> /etc/sysconfig/network-scripts/ifcfg-bond0 <== BONDING_OPTS="mode=4 miimon=100" BOOTPROTO="dhcp" DEVICE="bond0" IPV6INIT="no" IPV6_AUTOCONF="no" NM_CONTROLLED="no" ONBOOT="yes" PEERNTP="yes" TYPE="Bond" ==> /etc/sysconfig/network-scripts/ifcfg-eth0 <== DEVICE="eth0" HWADDR="52:54:00:2c:3c:8e" MASTER="bond0" NM_CONTROLLED="no" ONBOOT="yes" SLAVE="yes" ==> /etc/sysconfig/network-scripts/ifcfg-eth1 <== DEVICE="eth1" HWADDR="52:54:00:c0:7a:da" MASTER="bond0" NM_CONTROLLED="no" ONBOOT="yes" SLAVE="yes" ==> /etc/sysconfig/network-scripts/ifcfg-lo <== DEVICE="lo" IPADDR="127.0.0.1" NETMASK="255.0.0.0" ONBOOT="yes" Changing mode=4 to mode=balance-rr in the bonding opts prevents this error, but leads to bug 1230638. It's probably a race, because I also encountered it with a plain nic (eth1), running the setup a second time without any changes solved the issue. *** Bug 1231614 has been marked as a duplicate of this bug. *** first try, can't reproduce with rhev-hypervisor6-6.6-20150609.0.el6ev (In reply to Sandro Bonazzola from comment #34) > first try, can't reproduce with > rhev-hypervisor6-6.6-20150609.0.el6ev it's 6.7 not 6.6. BTW, second run worked as well on my network and host so I guess it may be really some race condition depending on the power or the load of the system / network. Created attachment 1040844 [details]
Config and logs from a failed attempt
According to logs attached in comment #36, supervdsmd received a setupNetworks call: MainThread::DEBUG::2015-06-19 10:00:41,416::vdsm-restore-net-config::77::root::(unified_restoration) Calling setupNetworks with networks ({}) and bond ({}). ended with: MainThread::INFO::2015-06-19 10:04:42,886::vdsm-restore-net-config::145::root::(restore) networks already restored. doing nothing. Then supervdsmd has confirmed being up and running: MainProcess|Thread-13::DEBUG::2015-06-19 10:04:45,431::supervdsmServer::102::SuperVdsm.ServerCallback::(wrapper) call getHardwareInfo with () {} MainProcess|Thread-13::DEBUG::2015-06-19 10:04:45,431::supervdsmServer::109::SuperVdsm.ServerCallback::(wrapper) return getHardwareInfo with {'systemProductName': 'Standard PC (i440FX + PIIX, 1996)', 'systemUUID': 'D2F9C074-81CD-4057-FAA3-17B5980EEE1A', 'systemVersion': 'pc-i440fx-2.1', 'systemManufacturer': 'QEMU'} before receiving the call to setupNetworks: MainProcess|Thread-15::DEBUG::2015-06-19 10:04:46,327::supervdsmServer::102::SuperVdsm.ServerCallback::(wrapper) call setupNetworks with ({'rhevm': {'nic': 'eth1', 'netmask': '', 'ipaddr': ''}}, {}, {'connectivityCheck': False}) {} MainProcess|Thread-15::WARNING::2015-06-19 10:04:46,328::libvirtconnection::135::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7 MainProcess|Thread-15::CRITICAL::2015-06-19 10:04:46,328::libvirtconnection::137::root::(wrapper) taking calling process down. MainProcess|Thread-15::DEBUG::2015-06-19 10:04:46,328::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 7 level: 2 message: internal error client socket is closed MainProcess|Thread-15::ERROR::2015-06-19 10:04:46,328::supervdsmServer::106::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks Traceback (most recent call last): File "/usr/share/vdsm/supervdsmServer", line 104, in wrapper res = func(*args, **kwargs) File "/usr/share/vdsm/supervdsmServer", line 224, in setupNetworks return setupNetworks(networks, bondings, **options) File "/usr/share/vdsm/network/api.py", line 624, in setupNetworks File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 112, in networks File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 129, in wrapper File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3405, in getLibVersion libvirtError: internal error client socket is closed libvirt was already up and running in the past 20 seconds: 2015-06-19 10:04:25.265+0000: 16409: info : libvirt version: 0.10.2, package: 54.el6 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-04-28-09:24:16, x86-033.build.eng.bos.redhat.com) 2015-06-19 10:04:25.265+0000: 16409: error : virExecWithHook:426 : Cannot find 'pm-is-supported' in path: No such file or directory 2015-06-19 10:04:25.265+0000: 16409: warning : qemuCapsInit:885 : Failed to get host power management capabilities 2015-06-19 10:04:25.599+0000: 16409: error : virExecWithHook:426 : Cannot find 'pm-is-supported' in path: No such file or directory 2015-06-19 10:04:25.599+0000: 16409: warning : lxcCapsInit:77 : Failed to get host power management capabilities 2015-06-19 10:04:44.256+0000: 16399: error : virExecWithHook:426 : Cannot find 'pm-is-supported' in path: No such file or directory 2015-06-19 10:04:44.256+0000: 16399: warning : qemuCapsInit:885 : Failed to get host power management capabilities In vdsm.log I see: clientIFinit::DEBUG::2015-06-19 10:04:44,244::libvirtconnection::150::root::(get) trying to connect libvirt just 2 seconds before the call to setupNetworks But I think the connection has been established since there is also: Thread-14::DEBUG::2015-06-19 10:04:45,923::caps::276::root::(getLiveMergeSupport) libvirt is missing 'VIR_DOMAIN_BLOCK_COMMIT_RELATIVE': live merge disabled just one second before the call to setupNetworks I don't see anything related to restarts during the process or anything related to hosted-engine calls to vdsm other than the call to setupNetworks. It doesn't seems something related to delays or race conditions. Looks like a bug in the way vdsm handles the connection with libvirt. Fabian and I couldn't reproduce it easily . It happened few times, but we tried to get the libvirtd debug logs (add log_filters="1:libvirt 1:util 1:qemu" and log_outputs="1:file:/var/log/libvirt/libvirtd.log" lines to libvirtd.conf) - Then we need to restart libvirtd service and it might helped to avoid bug ... but we're not sure exactly. Having libvirt debug log is necessary to understand why libvirt raises an exception - this exception caused this regression and must be related to hosted-engine network configurations as it happened just after setupNetwork call. I suggest to leave this bug for verification in next 3.5.z release - This is a complex of network configuration issue, libvirt unknown error which relates to hosted-engine installation over node.. (quite a lot involve in it) I tried about 15 times to rewproduce it with plain nics and bonds. With the congfigured libvirtd logging and without. All testing was done on the el6 0609 build. We should document it as a known bug for now. Andrew, can we add this as a known issue to the 3.5.4 notes? Moving it to integration, and to MODIFIED, based on Comment 39. If it re-occurs, and seems to be an infra issue, then we'll re-examine. Hi Fabian, Yes, we'll make sure this one is added to the release notes for you. Thank you for the notice! Kind regards, Andrew Moving to ON_QA since this has been marked as TestOnly. QE: please wait for the build to be available before trying to test this. *** Bug 1244740 has been marked as a duplicate of this bug. *** Not reproduced on Red Hat Enterprise Virtualization Hypervisor release 7.2 (20151104.0.el7ev) ovirt-vmconsole-host-1.0.0-1.el7ev.noarch ovirt-node-branding-rhev-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-lib-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-plugin-snmp-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch ovirt-node-plugin-vdsm-0.6.1-2.el7ev.noarch ovirt-setup-lib-1.0.0-1.el7ev.noarch ovirt-vmconsole-1.0.0-1.el7ev.noarch ovirt-node-lib-config-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-selinux-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-plugin-cim-logic-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-hosted-engine-ha-1.3.2.1-1.el7ev.noarch ovirt-node-plugin-hosted-engine-0.3.0-2.el7ev.noarch ovirt-node-plugin-snmp-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-node-plugin-rhn-3.6.0-0.20.20151103git3d3779a.el7ev.noarch sanlock-3.2.4-1.el7.x86_64 ovirt-node-lib-legacy-3.6.0-0.20.20151103git3d3779a.el7ev.noarch libvirt-1.2.17-13.el7.x86_64 qemu-kvm-rhev-2.3.0-31.el7.x86_64 vdsm-4.17.10.1-0.el7ev.noarch ovirt-host-deploy-offline-1.4.0-1.el7ev.x86_64 ovirt-node-plugin-cim-3.6.0-0.20.20151103git3d3779a.el7ev.noarch ovirt-host-deploy-1.4.0-1.el7ev.noarch Moving 'requires_doc_text' to '-', as this bug is verified and no longer needs to be documented as a known issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html |