Bug 1235591
Summary: | HE deployment fails due to libvirtError: internal error client socket is closed | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | rhev-integ | ||||||||||||||
Component: | ovirt-hosted-engine-setup | Assignee: | Simone Tiraboschi <stirabos> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Nikolai Sednev <nsednev> | ||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||
Priority: | high | ||||||||||||||||
Version: | 3.5.1 | CC: | adahms, bazulay, bmcclain, cshao, danken, ecohen, eedri, fdeutsch, gklein, huiwa, istein, lpeer, lsurette, ncredi, nsednev, pstehlik, sbonazzo, stirabos, ybronhei, ycui, yeylon, ylavi | ||||||||||||||
Target Milestone: | --- | Keywords: | Reopened, Triaged, ZStream | ||||||||||||||
Target Release: | 3.5.4 | Flags: | ylavi:
Triaged+
|
||||||||||||||
Hardware: | x86_64 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | integration | ||||||||||||||||
Fixed In Version: | vt16.4 | Doc Type: | Bug Fix | ||||||||||||||
Doc Text: |
On Red Hat Enterprise Virtualization Hypervisors, the vdsmd and libvirtd services experienced restart errors while deploying the Self-Hosted Engine. The hosted-engine setup script changes the VDSM configuration file and restarts it, but this is not enough; explicitly calling vdsm-tool configure is also required to make the changes effective and void any race condition on the services' start-up.
|
Story Points: | --- | ||||||||||||||
Clone Of: | 1206884 | ||||||||||||||||
: | 1244740 (view as bug list) | Environment: | |||||||||||||||
Last Closed: | 2015-09-03 18:13:01 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | 1206884, 1246863 | ||||||||||||||||
Bug Blocks: | 1059435, 1234906, 1241426, 1250199 | ||||||||||||||||
Attachments: |
|
Comment 1
Nikolai Sednev
2015-07-01 10:25:35 UTC
*** Bug 1241470 has been marked as a duplicate of this bug. *** We can still encounter this issue in rhev-hypervisor6-6.7-20150717.0.el6ev (vdsm-4.16.22-1.el6ev.x86_64). Test verison: rhev-hypervisor6-6.7-20150717.0.el6ev ovirt-node-3.2.3-11.el6.noarch ovirt-node-plugin-hosted-engine-0.2.0-16.0.el6ev.noarch ovirt-hosted-engine-setup-1.2.5.1-1.el6ev.noarch vdsm-4.16.22-1.el6ev.x86_64 Test steps: 1. Clean install rhev-hypervisor6-6.7-20150717.0.el6ev 2. Create network with bond1 with BONDING_OPTS='mode=1 miimon=100' 3. Setup hosted engine with bond1 as network Actual results: 1. It will report error and quit the setup process. Error in ovirt-hosted-engine-setup ============================================================================================== 2015-07-20 02:10:02 DEBUG otopi.context context._executeMethod:152 method exception Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/otopi/context.py", line 142, in _executeMethod File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/network/bridge.py", line 207, in _misc File "/usr/share/ovirt-hosted-engine-setup/plugins/ovirt-hosted-engine-setup/network/bridge.py", line 225, in _setupNetworks RuntimeError: Failed to setup networks {'rhevm': {'bonding': 'bond1', 'bootproto': 'dhcp', 'blockingdhcp': True}}. Error code: "16" message: "Unexpected exception" 2015-07-20 02:10:02 ERROR otopi.context context._executeMethod:161 Failed to execute stage 'Misc configuration': Failed to setup networks {'rhevm': {'bonding': 'bond1', 'bootproto': 'dhcp', 'blockingdhcp': True}}. Error code: "16" message: "Unexpected exception" I noticed that this issue you verified is in rhevh7.1. Since rhevh6.7 still hase such issue, do I need to assign this issue or report a new one to track it? We know that bonds have an impact on network issues, thus I'd suggest that we use (and adjust) this bug for this bug with normal networking, and open a new bug for this bug with bonds. Hui/Nikolai, if you agree, please open a new bug. Hi Fabian, I'm a bit confused with requirement of opening another bug on already existing and documented one. Can you explain the reason for opening a new bug at the same issue? We have general deployment failure for HE-RHEVH6.7 deployment over Vlan tagged network or bonded network, these both bugs are: https://bugzilla.redhat.com/show_bug.cgi?id=1235591 https://bugzilla.redhat.com/show_bug.cgi?id=1244636 Both bugs already defined and have logs. In the description we see that this bug (bug 1235591) is about a plain interface (p1p1). And bug 1244636 is encountered when vlan tags are used. The bug in comment 3 however is encountered when a bond is used. So to me, these are three causes are independent and each should have it's own bug to track it's resolution. I understand and totally agree with you Fabian, these are three different deployment flows and deserve for different bugs to be opened. Here goes 1244740 for this one. This current reproduction flow based on untagged network environment, bug was not reproduced on such environment, hence changing to verified. Works for me, see comment #1. Need to retest on newest RHEVH6.7 build. Created attachment 1054423 [details]
answers-20150721142009.conf
Failed again to deploy on Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150707.0.el6ev), while used flat network: [root@alma04 ~]# rpm -qa vdsm* sanlock* qemu* mom* libvirt* ovirt* gluster* vdsm-jsonrpc-4.16.21-1.el6ev.noarch libvirt-python-0.10.2-54.el6.x86_64 ovirt-node-plugin-rhn-3.2.3-11.el6.noarch libvirt-lock-sanlock-0.10.2-54.el6.x86_64 vdsm-hook-vhostmd-4.16.21-1.el6ev.noarch sanlock-python-2.8-2.el6_5.x86_64 qemu-kvm-rhev-0.12.1.2-2.479.el6.x86_64 vdsm-hook-ethtool-options-4.16.21-1.el6ev.noarch ovirt-node-plugin-hosted-engine-0.2.0-16.0.el6ev.noarch ovirt-node-plugin-cim-3.2.3-11.el6.noarch vdsm-cli-4.16.21-1.el6ev.noarch glusterfs-devel-3.6.0.53-1.el6.x86_64 ovirt-host-deploy-1.3.0-2.el6ev.noarch ovirt-node-branding-rhev-3.2.3-11.el6.noarch vdsm-yajsonrpc-4.16.21-1.el6ev.noarch glusterfs-api-3.6.0.53-1.el6.x86_64 qemu-img-rhev-0.12.1.2-2.479.el6.x86_64 sanlock-lib-2.8-2.el6_5.x86_64 glusterfs-libs-3.6.0.53-1.el6.x86_64 vdsm-python-zombiereaper-4.16.21-1.el6ev.noarch vdsm-xmlrpc-4.16.21-1.el6ev.noarch libvirt-0.10.2-54.el6.x86_64 ovirt-node-selinux-3.2.3-11.el6.noarch libvirt-cim-0.6.1-12.el6.x86_64 vdsm-4.16.21-1.el6ev.x86_64 vdsm-reg-4.16.21-1.el6ev.noarch ovirt-hosted-engine-setup-1.2.5.1-1.el6ev.noarch ovirt-host-deploy-offline-1.3.0-3.el6ev.x86_64 glusterfs-rdma-3.6.0.53-1.el6.x86_64 ovirt-node-plugin-snmp-3.2.3-11.el6.noarch glusterfs-fuse-3.6.0.53-1.el6.x86_64 qemu-kvm-rhev-tools-0.12.1.2-2.479.el6.x86_64 glusterfs-3.6.0.53-1.el6.x86_64 sanlock-2.8-2.el6_5.x86_64 libvirt-client-0.10.2-54.el6.x86_64 mom-0.4.1-5.el6ev.noarch ovirt-node-3.2.3-11.el6.noarch ovirt-hosted-engine-ha-1.2.6-2.el6ev.noarch ovirt-node-plugin-vdsm-0.2.0-25.el6ev.noarch vdsm-python-4.16.21-1.el6ev.noarch [root@alma04 ~]# ifconfig lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:102754 errors:0 dropped:0 overruns:0 frame:0 TX packets:102754 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:18145895 (17.3 MiB) TX bytes:18145895 (17.3 MiB) p1p1 Link encap:Ethernet HWaddr A0:36:9F:3B:16:7C inet addr:10.35.117.26 Bcast:10.35.117.255 Mask:255.255.255.0 inet6 addr: fe80::a236:9fff:fe3b:167c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:696845 errors:3 dropped:0 overruns:0 frame:3 TX packets:4272 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:47110831 (44.9 MiB) TX bytes:420354 (410.5 KiB) [root@alma04 ~]# hosted-engine --deploy [ INFO ] Stage: Initializing [ INFO ] Generating a temporary VNC password. [ INFO ] Stage: Environment setup Continuing will configure this host for serving as hypervisor and create a VM where you have to install oVirt Engine afterwards. Are you sure you want to continue? (Yes, No)[Yes]: Configuration files: [] Log file: /var/log/ovirt-hosted-engine-setup/ovirt-hosted-engine-setup-20150721141812-nzwhc3.log Version: otopi-1.3.2 (otopi-1.3.2-1.el6ev) It has been detected that this program is executed through an SSH connection without using screen. Continuing with the installation may lead to broken installation if the network connection fails. It is highly recommended to abort the installation and run it inside a screen session using command "screen". Do you want to continue anyway? (Yes, No)[No]: yes [ INFO ] Hardware supports virtualization [ INFO ] Stage: Environment packages setup [ INFO ] Stage: Programs detection [ INFO ] Stage: Environment setup [ INFO ] Stage: Environment customization --== STORAGE CONFIGURATION ==-- During customization use CTRL-D to abort. Please specify the storage you would like to use (iscsi, nfs3, nfs4)[nfs3]: Please specify the full shared storage connection path to use (example: host:/path): 10.35.160.108:/RHEV/nsednev_HE_3_6_second_deployment_setup [ INFO ] Installing on first host Please provide storage domain name. [hosted_storage]: Local storage datacenter name is an internal name and currently will not be shown in engine's admin UI. Please enter local datacenter name [hosted_datacenter]: --== SYSTEM CONFIGURATION ==-- --== NETWORK CONFIGURATION ==-- Please indicate a nic to set rhevm bridge on: (em1, em2, p1p1, p1p2) [em1]: p1p1 iptables was detected on your computer, do you wish setup to configure it? (Yes, No)[Yes]: Please indicate a pingable gateway IP address [10.35.117.254]: --== VM CONFIGURATION ==-- Please specify the device to boot the VM from (cdrom, disk, pxe) [cdrom]: pxe Please specify an alias for the Hosted Engine image [hosted_engine]: The following CPU types are supported by this host: - model_SandyBridge: Intel SandyBridge Family - model_Westmere: Intel Westmere Family - model_Nehalem: Intel Nehalem Family - model_Penryn: Intel Penryn Family - model_Conroe: Intel Conroe Family Please specify the CPU type to be used by the VM [model_SandyBridge]: Please specify the number of virtual CPUs for the VM [Defaults to minimum requirement: 2]: Please specify the disk size of the VM in GB [Defaults to minimum requirement: 25]: You may specify a unicast MAC address for the VM or accept a randomly generated default [00:16:3e:0a:7a:2b]: 00:16:3E:7B:BB:BB Please specify the memory size of the VM in MB [Defaults to minimum requirement: 4096]: Please specify the console type you would like to use to connect to the VM (vnc, spice) [vnc]: --== HOSTED ENGINE CONFIGURATION ==-- Enter the name which will be used to identify this host inside the Administrator Portal [hosted_engine_1]: Enter 'admin@internal' user password that will be used for accessing the Administrator Portal: Confirm 'admin@internal' user password: Please provide the FQDN for the engine you would like to use. This needs to match the FQDN that you will use for the engine installation within the VM. Note: This will be the FQDN of the VM you are now going to create, it should not point to the base host or to any other existing machine. Engine FQDN: nsednev-he-2.qa.lab.tlv.redhat.com Please provide the name of the SMTP server through which we will send notifications [localhost]: Please provide the TCP port number of the SMTP server [25]: Please provide the email address from which notifications will be sent [root@localhost]: Please provide a comma-separated list of email addresses which will get notifications [root@localhost]: [ INFO ] Stage: Setup validation --== CONFIGURATION PREVIEW ==-- Bridge interface : p1p1 Engine FQDN : nsednev-he-2.qa.lab.tlv.redhat.com Bridge name : rhevm SSH daemon port : 22 Firewall manager : iptables Gateway address : 10.35.117.254 Host name for web application : hosted_engine_1 Host ID : 1 Image alias : hosted_engine Image size GB : 25 Storage connection : 10.35.160.108:/RHEV/nsednev_HE_3_6_second_deployment_setup Console type : vnc Memory size MB : 4096 MAC address : 00:16:3E:7B:BB:BB Boot type : pxe Number of CPUs : 2 CPU Type : model_SandyBridge Please confirm installation settings (Yes, No)[Yes]: [ INFO ] Stage: Transaction setup [ INFO ] Stage: Misc configuration [ INFO ] Stage: Package installation [ INFO ] Stage: Misc configuration [ INFO ] Configuring libvirt [ INFO ] Configuring VDSM [ INFO ] Starting vdsmd [ INFO ] Waiting for VDSM hardware info [ INFO ] Waiting for VDSM hardware info [ INFO ] Configuring the management bridge [ ERROR ] Failed to execute stage 'Misc configuration': Failed to setup networks {'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True}}. Error code: "16" message: "Unexpected exception" [ INFO ] Stage: Clean up [ INFO ] Generating answer file '/var/lib/ovirt-hosted-engine-setup/answers/answers-20150721142009.conf' [ INFO ] Stage: Pre-termination [ INFO ] Stage: Termination Log is attached: answers-20150721142009.conf Nikolai, could you please reproduce the bug, and then make that system available to us so Simone can take a look at the environment? Created attachment 1054424 [details]
libvirtd.log
Created attachment 1054427 [details]
all logs from host
libvirtd is somehow died: Jul 21 14:18:38 alma04 kernel: device-mapper: table: 253:5: multipath: error getting device Jul 21 14:18:38 alma04 kernel: device-mapper: ioctl: error adding target to table Jul 21 14:18:39 alma04 kernel: device-mapper: table: 253:5: multipath: error getting device Jul 21 14:18:39 alma04 kernel: device-mapper: ioctl: error adding target to table Jul 21 14:19:44 alma04 libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: No such file or directory Jul 21 14:19:44 alma04 kernel: lo: Disabled Privacy Extensions Jul 21 14:20:07 alma04 kernel: device-mapper: table: 253:5: multipath: error getting device Jul 21 14:20:07 alma04 kernel: device-mapper: ioctl: error adding target to table Jul 21 14:20:09 alma04 vdsm vds ERROR libvirt error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 1136, in wrapper#012 File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 554, in setupNetworks#012 File "/usr/share/vdsm/API.py", line 1398, in setupNetworks#012 File "/usr/share/vdsm/supervdsm.py", line 50, in __call__#012 File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>#012 File "<string>", line 2, in setupNetworks#012 File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod#012libvirtError: internal error client socket is closed Jul 21 14:20:09 alma04 respawn: slave '/usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid' died, respawning slave Jul 21 15:01:01 alma04 logrotate: ALERT exited abnormally with [1] We need to understand why Dan, can you please have a look? Created attachment 1054722 [details]
the full libvirt log with debug inside
The issue is still there: MainProcess|Thread-16::DEBUG::2015-07-22 09:24:23,709::supervdsmServer::102::SuperVdsm.ServerCallback::(wrapper) call setupNetworks with ({'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True}}, {}, {'connectivityCheck': False}) {} MainProcess|Thread-16::WARNING::2015-07-22 09:24:23,710::libvirtconnection::135::root::(wrapper) connection to libvirt broken. ecode: 1 edom: 7 MainProcess|Thread-16::CRITICAL::2015-07-22 09:24:23,710::libvirtconnection::137::root::(wrapper) taking calling process down. MainThread::DEBUG::2015-07-22 09:24:23,711::supervdsmServer::462::SuperVdsm.Server::(main) Terminated normally MainProcess|Thread-16::DEBUG::2015-07-22 09:24:23,711::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 7 level: 2 message: internal error client socket is closed MainProcess|Thread-16::ERROR::2015-07-22 09:24:23,711::supervdsmServer::106::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks Traceback (most recent call last): File "/usr/share/vdsm/supervdsmServer", line 104, in wrapper res = func(*args, **kwargs) File "/usr/share/vdsm/supervdsmServer", line 224, in setupNetworks return setupNetworks(networks, bondings, **options) File "/usr/share/vdsm/network/api.py", line 611, in setupNetworks File "/usr/lib/python2.6/site-packages/vdsm/netinfo.py", line 112, in networks File "/usr/lib/python2.6/site-packages/vdsm/libvirtconnection.py", line 129, in wrapper File "/usr/lib64/python2.6/site-packages/libvirt.py", line 3405, in getLibVersion libvirtError: internal error client socket is closed MainThread::DEBUG::2015-07-22 09:24:24,140::netconfpersistence::141::root::(_getConfigs) Non-existing config set. MainThread::DEBUG::2015-07-22 09:24:24,140::netconfpersistence::141::root::(_getConfigs) Non-existing config set. MainThread::DEBUG::2015-07-22 09:24:24,835::supervdsmServer::422::SuperVdsm.Server::(main) Making sure I'm root - SuperVdsm MainThread::DEBUG::2015-07-22 09:24:24,836::supervdsmServer::431::SuperVdsm.Server::(main) Parsing cmd args MainThread::DEBUG::2015-07-22 09:24:24,836::supervdsmServer::434::SuperVdsm.Server::(main) Cleaning old socket /var/run/vdsm/svdsm.sock MainThread::DEBUG::2015-07-22 09:24:24,836::supervdsmServer::438::SuperVdsm.Server::(main) Setting up keep alive thread MainThread::DEBUG::2015-07-22 09:24:24,836::supervdsmServer::444::SuperVdsm.Server::(main) Creating remote object manager MainThread::DEBUG::2015-07-22 09:24:24,838::fileUtils::192::Storage.fileUtils::(chown) Changing owner for /var/run/vdsm/svdsm.sock, to (36:36) MainThread::DEBUG::2015-07-22 09:24:24,838::supervdsmServer::455::SuperVdsm.Server::(main) Started serving super vdsm object sourceRoute::DEBUG::2015-07-22 09:24:24,839::sourceroutethread::79::root::(_subscribeToInotifyLoop) sourceRouteThread.subscribeToInotifyLoop started and now we have full libvirtd debug logs. Created attachment 1054736 [details]
the full libvirt log with debug inside
While in the libvirtd log we have a gap of about 100 seconds exactly in correspondence of the issue (from 09:24:23.538+0000 to 09:26:02.026+0000). libvirtd was configured with log_level = 1 so we already have all we could get. 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollDispatchHandles:483 : EVENT_POLL_DISPATCH_HANDLE: watch=14 events=2 2015-07-22 09:24:23.538+0000: 60101: debug : virNetMessageFree:73 : msg=0xb47b50 nfds=0 cb=(nil) 2015-07-22 09:24:23.538+0000: 60101: debug : virNetServerClientCalculateHandleMode:152 : tls=(nil) hs=-1, rx=0xb443f0 tx=(nil) 2015-07-22 09:24:23.538+0000: 60101: debug : virNetServerClientCalculateHandleMode:183 : mode=1 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollUpdateHandle:146 : EVENT_POLL_UPDATE_HANDLE: watch=14 events=1 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollInterruptLocked:697 : Skip interrupt, 1 -356829088 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollCleanupTimeouts:501 : Cleanup 1 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollCleanupHandles:549 : Cleanup 9 2015-07-22 09:24:23.538+0000: 60101: debug : virEventRunDefaultImpl:244 : running default event implementation 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollCleanupTimeouts:501 : Cleanup 1 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollCleanupHandles:549 : Cleanup 9 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollMakePollFDs:378 : Prepare n=0 w=1, f=6 e=1 d=0 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollMakePollFDs:378 : Prepare n=1 w=2, f=8 e=1 d=0 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollMakePollFDs:378 : Prepare n=2 w=3, f=11 e=1 d=0 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollMakePollFDs:378 : Prepare n=3 w=4, f=12 e=1 d=0 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollMakePollFDs:378 : Prepare n=4 w=5, f=13 e=1 d=0 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollMakePollFDs:378 : Prepare n=5 w=6, f=14 e=1 d=0 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollMakePollFDs:378 : Prepare n=6 w=7, f=16 e=1 d=0 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollMakePollFDs:378 : Prepare n=7 w=8, f=18 e=1 d=0 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollMakePollFDs:378 : Prepare n=8 w=14, f=20 e=1 d=0 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollCalculateTimeout:320 : Calculate expiry of 1 timers 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollCalculateTimeout:346 : Timeout at 0 due in -1 ms 2015-07-22 09:24:23.538+0000: 60101: debug : virEventPollRunOnce:614 : EVENT_POLL_RUN: nhandles=9 timeout=-1 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollRunOnce:625 : Poll got 1 event(s) 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollDispatchTimeouts:410 : Dispatch 1 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollDispatchHandles:455 : Dispatch 9 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollDispatchHandles:469 : i=0 w=1 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollDispatchHandles:469 : i=1 w=2 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollDispatchHandles:469 : i=2 w=3 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollDispatchHandles:469 : i=3 w=4 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollDispatchHandles:483 : EVENT_POLL_DISPATCH_HANDLE: watch=4 events=1 2015-07-22 09:26:02.026+0000: 60101: debug : virNetSocketNew:147 : localAddr=0x7ffdd61a3060 remoteAddr=0x7ffdd61a2fd0 fd=23 errfd=-1 pid=0 2015-07-22 09:26:02.026+0000: 60101: debug : virObjectNew:110 : OBJECT_NEW: obj=0xb487f0 classname=virNetSocket 2015-07-22 09:26:02.026+0000: 60101: debug : virNetSocketNew:204 : RPC_SOCKET_NEW: sock=0xb487f0 fd=23 errfd=-1 pid=0 localAddr=127.0.0.1;0, remoteAddr=127.0.0.1;0 2015-07-22 09:26:02.026+0000: 60101: debug : virNetServerClientNew:403 : sock=0xb487f0 auth=2 tls=(nil) 2015-07-22 09:26:02.026+0000: 60101: debug : virObjectNew:110 : OBJECT_NEW: obj=0xb48440 classname=virNetServerClient 2015-07-22 09:26:02.026+0000: 60101: debug : virObjectRef:168 : OBJECT_REF: obj=0xb487f0 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollInterruptLocked:697 : Skip interrupt, 1 -356829088 2015-07-22 09:26:02.026+0000: 60101: debug : virEventPollAddTimeout:243 : EVENT_POLL_ADD_TIMEOUT: timer=7 frequency=-1 cb=0x7f88ea4be190 opaque=0xb48440 ff=(nil) 2015-07-22 09:26:02.026+0000: 60101: debug : virNetMessageNew:45 : msg=0xb48520 tracked=1 2015-07-22 09:26:02.026+0000: 60101: debug : virNetServerClientNewInternal:382 : RPC_SERVER_CLIENT_NEW: client=0xb48440 sock=0xb487f0 2015-07-22 09:26:02.026+0000: 60101: debug : virNetServerClientCalculateHandleMode:152 : tls=(nil) hs=-1, rx=0xb48520 tx=(nil) 2015-07-22 09:26:02.026+0000: 60101: debug : virNetServerClientCalculateHandleMode:183 : mode=1 2015-07-22 09:26:02.026+0000: 60101: debug : virObjectRef:168 : OBJECT_REF: obj=0xb48440 2015-07-22 09:26:02.026+0000: 60101: debug : virNetServerClientRegisterEvent:199 : Registering client event callback 1 And at the same instant (09:24:23) supervdsm died and was respawned Jul 22 09:23:58 alma04 libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: No such file or directory Jul 22 09:23:58 alma04 kernel: lo: Disabled Privacy Extensions Jul 22 09:24:22 alma04 kernel: device-mapper: table: 253:5: multipath: error getting device Jul 22 09:24:22 alma04 kernel: device-mapper: ioctl: error adding target to table Jul 22 09:24:23 alma04 vdsm vds ERROR libvirt error#012Traceback (most recent call last):#012 File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 1136, in wrapper#012 File "/usr/share/vdsm/rpc/BindingXMLRPC.py", line 554, in setupNetworks#012 File "/usr/share/vdsm/API.py", line 1398, in setupNetworks#012 File "/usr/share/vdsm/supervdsm.py", line 50, in __call__#012 File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>#012 File "<string>", line 2, in setupNetworks#012 File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod#012libvirtError: internal error client socket is closed Jul 22 09:24:23 alma04 respawn: slave '/usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock --pidfile /var/run/vdsm/supervdsmd.pid' died, respawning slave And after a few seconds also mom died being unable to talk with supervdsm: 2015-07-22 09:24:22,036 - mom - INFO - MOM starting 2015-07-22 09:24:22,094 - mom.HostMonitor - INFO - Host Monitor starting 2015-07-22 09:24:22,095 - mom - INFO - hypervisor interface vdsm 2015-07-22 09:24:22,223 - mom.GuestManager - INFO - Guest Manager starting 2015-07-22 09:24:22,237 - mom.HostMonitor - INFO - HostMonitor is ready 2015-07-22 09:24:22,252 - mom.Policy - INFO - Loaded policy '00-defines' 2015-07-22 09:24:22,325 - mom.Policy - INFO - Loaded policy '02-balloon' 2015-07-22 09:24:22,406 - mom.Policy - INFO - Loaded policy '03-ksm' 2015-07-22 09:24:22,582 - mom.Policy - INFO - Loaded policy '04-cputune' 2015-07-22 09:24:22,583 - mom.PolicyEngine - INFO - Policy Engine starting 2015-07-22 09:24:22,587 - mom.RPCServer - INFO - RPC Server is disabled 2015-07-22 09:24:32,640 - mom.Controllers.KSM - INFO - Updating KSM configuration: pages_to_scan:0 run:0 sleep_millisecs:0 2015-07-22 09:24:32,653 - mom.PolicyEngine - ERROR - Policy Engine crashed Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/mom/PolicyEngine.py", line 131, in run File "/usr/lib/python2.6/site-packages/mom/PolicyEngine.py", line 122, in do_controls File "/usr/lib/python2.6/site-packages/mom/Controllers/KSM.py", line 53, in process File "/usr/lib/python2.6/site-packages/mom/HypervisorInterfaces/vdsmInterface.py", line 218, in ksmTune File "/usr/share/vdsm/supervdsm.py", line 55, in __call__ RuntimeError: Broken communication with supervdsm. Failed call to ksmTune 2015-07-22 09:24:37,599 - mom - ERROR - Thread 'PolicyEngine' has exited 2015-07-22 09:24:42,246 - mom.GuestManager - INFO - Guest Manager ending 2015-07-22 09:24:42,371 - mom.HostMonitor - INFO - Host Monitor ending 2015-07-22 09:24:42,371 - mom - INFO - MOM ending Created attachment 1054850 [details]
sosreport
might be that libvirt was not up or responsive while vdsm called the setupNetwork verb? do HE restart libvirtd during the installation? maybe we should ping libvirt to initiate the communication in case it got broken (in cases such as libvirtd restart) before running setupNetwork? manually running the setupNetwork verb worked properly there when the connection between vdsm to libvirt is established properly from vdsm import vdscli c=vdscli.connect() c.setupNetworks({'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', 'blockingdhcp': True}}, {}, {'connectivityCheck': False}) {'status': {'message': 'Done', 'code': 0}} (In reply to Yaniv Bronhaim from comment #25) > might be that libvirt was not up or responsive while vdsm called the > setupNetwork verb? It could be, but we still need to understand why. > do HE restart libvirtd during the installation? No, it doesn't. > maybe we > should ping libvirt to initiate the communication in case it got broken (in > cases such as libvirtd restart) before running setupNetwork? > > manually running the setupNetwork verb worked properly there when the > connection between vdsm to libvirt is established properly > > from vdsm import vdscli > c=vdscli.connect() > c.setupNetworks({'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', > 'blockingdhcp': True}}, {}, {'connectivityCheck': False}) > {'status': {'message': 'Done', 'code': 0}} I'll try on that host (In reply to Simone Tiraboschi from comment #26) > > from vdsm import vdscli > > c=vdscli.connect() > > c.setupNetworks({'rhevm': {'nic': 'p1p1', 'bootproto': 'dhcp', > > 'blockingdhcp': True}}, {}, {'connectivityCheck': False}) > > {'status': {'message': 'Done', 'code': 0}} > > I'll try on that host It works as expected on that host. ok, its quite clear what happens. look the syslog prints below: Jul 22 13:03:55 alma04 init: /etc/init.conf: Unable to load configuration: No such file or directory this line ^ indicates libvirtd restart - for some reason this is the only print the service outputs in el6 - but it doesn't matter, the thing is that it means that something restart libvirtd or libvirtd died and start over. next restart happens during the HE installation at Jul 22 13:14:12 - I know that, because few second afterwards (Jul 22 13:15:34) the call to setupNetwork occurs and fails - the logic is that first call fails after libvirt is restarted by exception, catching that exception restart vdsm and reconnection is established normally during vdsm startup code. so its obvious what's the issue here. I suspect that the restart occurs during the "vdsm-tool configure" call that probably happens after vdsm rpms installation. therefore, its my problem :) afair we had such race with setupNetwork quite long ago - don't we retry to call the verb after fail? if we can't do that, we can force vdsm restart after the configure call in hosted engine setup code (great idea :)) .. any other alternative you see here? btw, it also explains why it happens only over el6 - with systemd libvirtd restart will automatically trigger vdsm restart. Jul 22 13:03:53 alma04 dbus: [system] Reloaded configuration Jul 22 13:03:55 alma04 init: /etc/init.conf: Unable to load configuration: No such file or directory Jul 22 13:03:55 alma04 libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: No such file or directory Jul 22 13:03:56 alma04 kernel: lo: Disabled Privacy Extensions Jul 22 13:04:01 alma04 kernel: iscsi: registered transport (tcp) Jul 22 13:04:01 alma04 kernel: iscsi: registered transport (iser) Jul 22 13:04:01 alma04 iscsid: iSCSI logger with pid=15593 started! Jul 22 13:04:02 alma04 iscsid: iSCSI daemon with pid=15594 started! Jul 22 13:04:14 alma04 kernel: Bridge firewalling registered Jul 22 13:04:16 alma04 kernel: Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Jul 22 13:04:28 alma04 multipathd: reconfigure (SIGHUP) Jul 22 13:04:28 alma04 kernel: device-mapper: multipath round-robin: version 1.0.0 loaded Jul 22 13:04:28 alma04 kernel: device-mapper: table: 253:5: multipath: error getting device Jul 22 13:04:28 alma04 kernel: device-mapper: ioctl: error adding target to table Jul 22 13:04:28 alma04 multipathd: 35000c50066168fc7: ignoring map Jul 22 13:04:28 alma04 kernel: device-mapper: table: 253:5: multipath: error getting device Jul 22 13:04:28 alma04 kernel: device-mapper: ioctl: error adding target to table Jul 22 13:14:12 alma04 libvirtd: Could not find keytab file: /etc/libvirt/krb5.tab: No such file or directory Jul 22 13:14:12 alma04 kernel: lo: Disabled Privacy Extensions Jul 22 13:15:29 alma04 kernel: Slow work thread pool: Starting up Jul 22 13:15:29 alma04 kernel: Slow work thread pool: Ready Jul 22 13:15:29 alma04 kernel: FS-Cache: Loaded Jul 22 13:15:29 alma04 kernel: NFS: Registering the id_resolver key type Jul 22 13:15:29 alma04 kernel: FS-Cache: Netfs 'nfs' registered for caching Jul 22 13:15:30 alma04 kernel: device-mapper: table: 253:5: multipath: error getting device Jul 22 13:15:30 alma04 kernel: device-mapper: ioctl: error adding target to table Jul 22 13:15:34 alma04 kernel: device-mapper: table: 253:5: multipath: error getting device Jul 22 13:15:34 alma04 kernel: device-mapper: ioctl: error adding target to table Jul 22 13:15:34 alma04 vdsm root WARNING connection to libvirt broken. ecode: 1 edom: 7 Jul 22 13:15:34 alma04 vdsm root CRITICAL taking calling process down. simone, can you check if this bug could be caught by the new CI job? (In reply to Yaniv Bronhaim from comment #28) > afair we had such race with setupNetwork quite long ago - don't we retry to > call the verb after fail? I'd prefer to avoid that > if we can't do that, we can force vdsm restart > after the configure call in hosted engine setup code (great idea :)) .. any > other alternative you see here? But I could easily restart vdsm after re-configuring. > btw, it also explains why it happens only over el6 - with systemd libvirtd > restart will automatically trigger vdsm restart. (In reply to Eyal Edri from comment #29) > simone, can you check if this bug could be caught by the new CI job? If ybronhei explanation is correct, it should happen only on el6 while the new CI job covers only 3.6 on el7. But if it would happen there it would much probably be caught. I posted a patch to hosted-engine-setup - the logic there was indeed wrong.. see the differences between this to host-deploy which does the configure phase correctly https://github.com/oVirt/ovirt-host-deploy/blob/master/src/plugins/ovirt-host-deploy/vdsm/packages.py It seams that we got the issue: on hosted-engine side we need to explicitly call vdsm-tool configure after tweaking vdsm configuration files instead of simply reloading the service. This seams to be enough to have all the involved services restarted in the proper order avoiding the synchronization issue. We tried this patch on my host for two time and it passed OK, waiting for merging the patch in to the build to get to the QA after smoke test and acceptance. Performed deployment on two RHEVH6.7s with HE3.5.z running on RHEL6.7: [root@localhost ~]# rpm -qa libvirt* sanlock* qemu* mom vdsm* ovirt* vdsm-yajsonrpc-4.16.23-1.el6ev.noarch sanlock-2.8-2.el6_5.x86_64 libvirt-0.10.2-54.el6.x86_64 libvirt-cim-0.6.1-12.el6.x86_64 sanlock-python-2.8-2.el6_5.x86_64 vdsm-4.16.23-1.el6ev.x86_64 ovirt-node-plugin-hosted-engine-0.2.0-16.0.el6ev.noarch vdsm-python-zombiereaper-4.16.23-1.el6ev.noarch ovirt-node-branding-rhev-3.2.3-14.el6.noarch vdsm-jsonrpc-4.16.23-1.el6ev.noarch sanlock-lib-2.8-2.el6_5.x86_64 qemu-img-rhev-0.12.1.2-2.479.el6.x86_64 vdsm-python-4.16.23-1.el6ev.noarch vdsm-cli-4.16.23-1.el6ev.noarch ovirt-host-deploy-1.3.0-2.el6ev.noarch qemu-kvm-rhev-tools-0.12.1.2-2.479.el6.x86_64 libvirt-client-0.10.2-54.el6.x86_64 libvirt-python-0.10.2-54.el6.x86_64 mom-0.4.1-5.el6ev.noarch ovirt-node-plugin-rhn-3.2.3-14.el6.noarch ovirt-node-3.2.3-14.el6.noarch libvirt-lock-sanlock-0.10.2-54.el6.x86_64 vdsm-hook-vhostmd-4.16.23-1.el6ev.noarch qemu-kvm-rhev-0.12.1.2-2.479.el6.x86_64 vdsm-reg-4.16.23-1.el6ev.noarch ovirt-hosted-engine-setup-1.2.5.2-1.el6ev.noarch vdsm-hook-ethtool-options-4.16.23-1.el6ev.noarch ovirt-node-plugin-vdsm-0.2.0-25.el6ev.noarch ovirt-node-plugin-cim-3.2.3-14.el6.noarch ovirt-node-selinux-3.2.3-14.el6.noarch ovirt-hosted-engine-ha-1.2.6-2.el6ev.noarch ovirt-host-deploy-offline-1.3.0-3.el6ev.x86_64 ovirt-node-plugin-snmp-3.2.3-14.el6.noarch vdsm-xmlrpc-4.16.23-1.el6ev.noarch Linux version 2.6.32-573.1.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Tue Jul 14 02:46:51 EDT 2015 Engine: Linux version 2.6.32-573.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Wed Jul 1 18:23:37 EDT 2015 rhevm-3.5.4-1.2.el6ev.noarch Red Hat Enterprise Virtualization Hypervisor release 6.7 (20150728.0.el6ev) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1729.html |