Bug 1348103
Summary: | Add host failed for 5 minutes and then succeeded | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Michael Burman <mburman> | ||||||||
Component: | BLL.Infra | Assignee: | Piotr Kliczewski <pkliczew> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Petr Matyáš <pmatyas> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 4.0.0 | CC: | bugs, cshao, danken, gklein, huzhao, leiwang, lsvaty, mburman, mgoldboi, mperina, nsednev, nsoffer, oourfali, pkliczew, stirabos, weiwang, yaniwang, ybronhei, ycui | ||||||||
Target Milestone: | ovirt-4.0.2 | Keywords: | AutomationBlocker | ||||||||
Target Release: | --- | Flags: | rule-engine:
ovirt-3.6.z+
rule-engine: ovirt-4.0.z+ rule-engine: blocker+ rule-engine: planning_ack+ oourfali: devel_ack+ lsvaty: testing_ack+ |
||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-08-12 14:22:33 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1350763 | ||||||||||
Bug Blocks: | 1167262, 1350718 | ||||||||||
Attachments: |
|
From the engine log I see that old vdsm was removed at 2016-06-20 09:04:19,668 and reinstallation started at 2016-06-20 09:04:19,716. Host deploy process began but network configuration failed on the engine side due to InterruptedException: 2016-06-20 09:05:58,086 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (org.ovirt.thread.pool-6-thread-38) [1611aa5b] Error: VDSGenericException: VDSNetworkException: Waiting for connect interrupted on vdsm side we received SIGTERM during the installation: MainThread::DEBUG::2016-06-20 09:09:48,732::vdsm::72::vds::(sigtermHandler) Received signal 15 engine reconnected at 2016-06-20 09:13:03,323 and sent setupNetworks: jsonrpc.Executor/1::DEBUG::2016-06-20 09:13:03,392::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'ovirtmgmt': {'hostQos': {'out': {'ls': {'m2': 50}}}, 'nic': 'enp4s0', 'mtu': 1500, 'bootproto': 'dhcp', 'STP': 'no', 'bridged': 'true', 'defaultRoute': True}}, 'options': {'connectivityCheck': 'true', 'connectivityTimeout': 120}} The engine was able to reconnect at 2016-06-20 09:14:05,581 and as a result we got: jsonrpc.Executor/1::DEBUG::2016-06-20 09:14:08,573::__init__::550::jsonrpc.JsonRpcServer::(_serveRequest) Return 'Host.setupNetworks' in bridge with {'message': 'Done', 'code': 0} Here are my questions to the setup: 1. Why the engine called interrupt on thread waiting for setupNetworks (was there SIGTERM sent to the engine as well?) 2. Why vdsm was rebooted during setupNetworks process 3. Why it took so long for first connection: 2016-06-20 09:05:33,757 - connect 2016-06-20 09:05:36,220 - connected This are the questions we all asking as well. I didn't do anything except - add host :) There was nothing happening on my env except this. I have no answers. There have to be something unusual happening on your env because vdsm should not be rebooted during installation process not the engine should not call interrupt. Please attempt to reproduce it on different env and please send steps how we could get those signals on both ends. There is nothing unusual on my env. Dan, any input from you? This is happened again. will attach logs in few minutes. and console log. it's not a 1 time issue. this is happens every few add hosts. As well i did what masayag asked - kill -3 on the jboss procees so you can see the treads in the console.log Created attachment 1170061 [details]
new logs from second reproduction
Rebooting the server after the so called successful add host(after the 5 minutes) leaves the system in non-operational state. because vdsm failing to restore the ovirtmgmt network. It looks like it wasn't persisted well when the host was added or there is some other issue in restoring the network. But the ovirtmgmt bridge isn't exist anymore and the is no ifcfg-ovirtmgmt. restore-net::WARNING::2016-06-21 08:16:52,808::bridges::42::root::(ports) ovirtmgmt is not a Linux bridge restore-net::INFO::2016-06-21 08:16:52,809::cache::213::root::(_getNetInfo) Obtaining info for net ovirtmgmt. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 188, in _getNetInfo 'stp': bridges.stp_state(iface)}) File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/bridges.py", line 66, in stp_state with open(BRIDGING_OPT % (bridge, 'stp_state')) as stp_file: IOError: [Errno 2] No such file or directory: '/sys/class/net/ovirtmgmt/bridge/stp_state' Attaching logs after the reboot. This is all continues state from the original report and i think it related to the failed add host in the beginning. Dan, please take a look as well. Created attachment 1170063 [details]
no ovirtmgmt bridge after reboot
The second issue seems to be separate bug so please open separate one for it. Vdsm being rebooted during setupNetworks was caused by: Jun 21 07:53:46 orchid-vds2 systemd: Stopping MOM instance configured for VDSM purposes... Jun 21 07:53:56 orchid-vds2 systemd: mom-vdsm.service stop-sigterm timed out. Killing. Jun 21 07:53:56 orchid-vds2 systemd: mom-vdsm.service: main process exited, code=killed, status=9/KILL Jun 21 07:53:56 orchid-vds2 systemd: Unit mom-vdsm.service entered failed state. Jun 21 07:53:56 orchid-vds2 systemd: mom-vdsm.service failed. Jun 21 07:53:56 orchid-vds2 systemd: Stopping Virtual Desktop Server Manager... Jun 21 07:53:57 orchid-vds2 vdsmd_init_common.sh: vdsm: Running run_final_hooks Jun 21 07:53:57 orchid-vds2 systemd: Starting Virtual Desktop Server Manager... Yaniv can please take a look why this is happening during setupNetworks? dropping needinfo, as bug 1348451 was opened to handle the unstable state due to vdsm restart while in setupNetwork. it seems that vdsm was soft-fenced by Engine. MainThread::DEBUG::2016-06-20 09:09:48,732::vdsm::72::vds::(sigtermHandler) Received signal 15 2016-06-20 09:09:37,997 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-43) [63a30e3e] Opening SSH Soft Fencing session on host 'orchid-vds2.qa.lab.tlv.redhat.com' 2016-06-20 09:09:38,601 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-43) [63a30e3e] Executing SSH Soft Fencing command on host 'orchid-vds2.qa.lab.tlv.redhat.com' This host was part of previous installation and it was SPM. Later it was removed and readded to the cluster. I am not sure whether this is correct log entry. One SIGTERM happened at 2016-06-20 09:09:48,732 and the other at Jun 21 07:53:56. It would seem like a bug because host was during host deploy so it would be hard to soft fence it. We have 2 sets of reproductions here with both vdsm.log and engine.log - in both envs you have correlation between the signal to the SshSoftFencingCommand run.. its only that... The bug is in SshSoftFencingCommand logic. Hmm, let's ask Martin about it @Martin we are seeing the same issue in CI. Please take let us know whether it is fencing related. It pretty reproducible on hosted-engine CI jobs with lago deploying additional hosts: hosted-engine-setup pauses since the host went in NonOperational state asking the user to review on engine side. After 5 minutes the AutoRecoveryManager kicks in and the host goes up by itself and so just retrying on hosted-engine-side is enough to continue. It's still a blocker for hosted-engine automated tests. I briefly checked Simone's engine logs and I do not see any fencing related log entries. I only see: 2016-06-26 20:10:57,565 INFO [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-8-thread-23) [5446dcb1] Lock Acquired to object 'EngineLock:{exclusiveLocks='[d4fe0b7c-1f94-4ed8-940f-5aaf6549c08e=<STORAGE, ACTION_TYPE_FAILED_STORAGE_DEVICE_LOCKED>]', sharedLocks='null'}' 2016-06-26 20:10:57,567 WARN [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-8-thread-23) [5446dcb1] Validation of action 'ImportHostedEngineStorageDomain' failed for user SYSTEM. Reasons: VAR__ACTION__ADD,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_MASTER_STORAGE_DOMAIN_NOT_ACTIVE Nir is it possible that sanlock killed vdsm before it was installed? I. I haven't found anything about fencing in Simone's log, the only suspicious thing is the ImportHostedEngineStorageDomain warning mentioned by Piotr above II. Looking at logs from 2nd reproduction here are my notes: 1. Host orchid-vds2.qa.lab.tlv.redhat.com failed installation and became NonOperational at 2016-06-21 07:50:05,566 2. Host orchid-vds2.qa.lab.tlv.redhat.com became NonResponsive at 2016-06-21 07:51:55 3. SSH Soft Fencing was successfully executed at 2016-06-21 07:53:46 No idea why VDSM didn't respond to engine between 2016-06-21 07:50:05,566 and 2016-06-21 07:53:46, although according to vdsm.log VDSM was running. From fencing flow execution everything seems to be fine, so are we sure that engine can connect to VDSM even though ovirtgmt was not configured properly? Vdsm logs return getCaps response at 2016-06-21 07:49:43,809. Next we loose connectivity and we attempt several times to reconnect. At 07:50:05,533 waiting thread is interrupted (could be sigterm on the engine side) and and later (07:53:56,885) vdsm receives sigterm. In Simone's log I can see that thread was interruped at (20:11:44,532) but for Simone vdsm was killed earlier 16:11:20 +4h. We saw it also on 3.6.7, please take a look at https://bugzilla.redhat.com/show_bug.cgi?id=1350718 Same fix of Bug 1350763 should solve this one. Leaving this open for verification purpose. Piotr, isn't it modified? Yes, it is *** Bug 1357615 has been marked as a duplicate of this bug. *** Verified on 4.0.2-1 I forgot to remove needinfo |
Created attachment 1169740 [details] Logs Description of problem: Add host failed for 5 minutes and then succeeded. - Failed to configure management network on host orchid-vds2.qa.lab.tlv.redhat.com due to setup networks failure. - Host orchid-vds2.qa.lab.tlv.redhat.com installation failed. Failed to configure management network on the host. - Could not get hardware information for host orchid-vds2.qa.lab.tlv.redhat.com - Host orchid-vds2.qa.lab.tlv.redhat.com does not comply with the cluster CLUSTER1 networks, the following networks are missing on host: 'ovirtmgmt' - Status of host orchid-vds2.qa.lab.tlv.redhat.com was set to NonOperational. - Host orchid-vds2.qa.lab.tlv.redhat.com is rebooting. - Status of host orchid-vds2.qa.lab.tlv.redhat.com was set to Up. * Attaching all relevant logs. * We were trying to understand with masayag and danken what is going on and we currently don't understand. But there is an issue defiantly. Version-Release number of selected component (if applicable): 4.0.0.5-0.1.el7ev vdsm-4.18.3-0.el7ev.x86_64 How reproducible: Can't say. sometimes, from time to time. Steps to Reproduce: 1. Add host to rhev-m 4 Actual results: Sometimes add host failed for few minutes and then succeeded.