Bug 1348103

Summary: Add host failed for 5 minutes and then succeeded
Product: [oVirt] ovirt-engine Reporter: Michael Burman <mburman>
Component: BLL.InfraAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: high    
Version: 4.0.0CC: bugs, cshao, danken, gklein, huzhao, leiwang, lsvaty, mburman, mgoldboi, mperina, nsednev, nsoffer, oourfali, pkliczew, stirabos, weiwang, yaniwang, ybronhei, ycui
Target Milestone: ovirt-4.0.2Keywords: AutomationBlocker
Target Release: ---Flags: rule-engine: ovirt-3.6.z+
rule-engine: ovirt-4.0.z+
rule-engine: blocker+
rule-engine: planning_ack+
oourfali: devel_ack+
lsvaty: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-12 14:22:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1350763    
Bug Blocks: 1167262, 1350718    
Attachments:
Description Flags
Logs
none
new logs from second reproduction
none
no ovirtmgmt bridge after reboot none

Description Michael Burman 2016-06-20 07:27:26 UTC
Created attachment 1169740 [details]
Logs

Description of problem:
Add host failed for 5 minutes and then succeeded. 

- Failed to configure management network on host orchid-vds2.qa.lab.tlv.redhat.com due to setup networks failure.

- Host orchid-vds2.qa.lab.tlv.redhat.com installation failed. Failed to configure management network on the host.

- Could not get hardware information for host orchid-vds2.qa.lab.tlv.redhat.com

- Host orchid-vds2.qa.lab.tlv.redhat.com does not comply with the cluster CLUSTER1 networks, the following networks are missing on host: 'ovirtmgmt'

- Status of host orchid-vds2.qa.lab.tlv.redhat.com was set to NonOperational.

- Host orchid-vds2.qa.lab.tlv.redhat.com is rebooting.

- Status of host orchid-vds2.qa.lab.tlv.redhat.com was set to Up.

* Attaching all relevant logs. 
* We were trying to understand with masayag and danken what is going on and we currently don't understand. But there is an issue defiantly.

Version-Release number of selected component (if applicable):
4.0.0.5-0.1.el7ev
vdsm-4.18.3-0.el7ev.x86_64

How reproducible:
Can't say. sometimes, from time to time.

Steps to Reproduce:
1. Add host to rhev-m 4

Actual results:
Sometimes add host failed for few minutes and then succeeded.

Comment 1 Piotr Kliczewski 2016-06-20 09:14:49 UTC
From the engine log I see that old vdsm was removed at 2016-06-20 09:04:19,668 and reinstallation started at 2016-06-20 09:04:19,716.

Host deploy process began but network configuration failed on the engine side due to InterruptedException:

2016-06-20 09:05:58,086 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HostSetupNetworksVDSCommand] (org.ovirt.thread.pool-6-thread-38) [1611aa5b] Error: VDSGenericException: VDSNetworkException: Waiting for connect interrupted

on vdsm side we received SIGTERM during the installation:

MainThread::DEBUG::2016-06-20 09:09:48,732::vdsm::72::vds::(sigtermHandler) Received signal 15

engine reconnected at 2016-06-20 09:13:03,323 and sent setupNetworks:

jsonrpc.Executor/1::DEBUG::2016-06-20 09:13:03,392::__init__::522::jsonrpc.JsonRpcServer::(_serveRequest) Calling 'Host.setupNetworks' in bridge with {'bondings': {}, 'networks': {'ovirtmgmt': {'hostQos': {'out': {'ls': {'m2': 50}}}, 'nic': 'enp4s0', 'mtu': 1500, 'bootproto': 'dhcp', 'STP': 'no', 'bridged': 'true', 'defaultRoute': True}}, 'options': {'connectivityCheck': 'true', 'connectivityTimeout': 120}}

The engine was able to reconnect at 2016-06-20 09:14:05,581 and as a result we got:

jsonrpc.Executor/1::DEBUG::2016-06-20 09:14:08,573::__init__::550::jsonrpc.JsonRpcServer::(_serveRequest) Return 'Host.setupNetworks' in bridge with {'message': 'Done', 'code': 0}


Here are my questions to the setup:
1. Why the engine called interrupt on thread waiting for setupNetworks (was there SIGTERM sent to the engine as well?)
2. Why vdsm was rebooted during setupNetworks process
3. Why it took so long for first connection:

2016-06-20 09:05:33,757 - connect
2016-06-20 09:05:36,220 - connected

Comment 2 Michael Burman 2016-06-20 09:23:24 UTC
This are the questions we all asking as well.
I didn't do anything except - add host :)
There was nothing happening on my env except this.
I have no answers.

Comment 3 Piotr Kliczewski 2016-06-20 09:26:00 UTC
There have to be something unusual happening on your env because vdsm should not be rebooted during installation process not the engine should not call interrupt.

Please attempt to reproduce it on different env and please send steps how we could get those signals on both ends.

Comment 4 Michael Burman 2016-06-20 09:30:14 UTC
There is nothing unusual on my env. 
Dan, any input from you?

Comment 5 Michael Burman 2016-06-21 04:54:57 UTC
This is happened again. will attach logs in few minutes. and console log. it's not a 1 time issue. this is happens every few add hosts.

Comment 6 Michael Burman 2016-06-21 05:01:35 UTC
As well i did what masayag asked -
kill -3 on the jboss procees so you can see the treads in the console.log

Comment 7 Michael Burman 2016-06-21 05:06:16 UTC
Created attachment 1170061 [details]
new logs from second reproduction

Comment 8 Michael Burman 2016-06-21 05:29:30 UTC
Rebooting the server after the so called successful add host(after the 5 minutes) leaves the system in non-operational state. because vdsm failing to restore the ovirtmgmt network. It looks like it wasn't persisted well when the host was added or there is some other issue in restoring the network. 
But the ovirtmgmt bridge isn't exist anymore and the is no ifcfg-ovirtmgmt. 


restore-net::WARNING::2016-06-21 08:16:52,808::bridges::42::root::(ports) ovirtmgmt is not a Linux bridge
restore-net::INFO::2016-06-21 08:16:52,809::cache::213::root::(_getNetInfo) Obtaining info for net ovirtmgmt.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/cache.py", line 188, in _getNetInfo
    'stp': bridges.stp_state(iface)})
  File "/usr/lib/python2.7/site-packages/vdsm/network/netinfo/bridges.py", line 66, in stp_state
    with open(BRIDGING_OPT % (bridge, 'stp_state')) as stp_file:
IOError: [Errno 2] No such file or directory: '/sys/class/net/ovirtmgmt/bridge/stp_state'

Attaching logs after the reboot. This is all continues state from the original report and i think it related to the failed add host in the beginning.

Dan, please take a look as well.

Comment 9 Michael Burman 2016-06-21 05:30:00 UTC
Created attachment 1170063 [details]
no ovirtmgmt bridge after reboot

Comment 10 Piotr Kliczewski 2016-06-21 07:31:05 UTC
The second issue seems to be separate bug so please open separate one for it.

Vdsm being rebooted during setupNetworks was caused by:

Jun 21 07:53:46 orchid-vds2 systemd: Stopping MOM instance configured for VDSM purposes...
Jun 21 07:53:56 orchid-vds2 systemd: mom-vdsm.service stop-sigterm timed out. Killing.
Jun 21 07:53:56 orchid-vds2 systemd: mom-vdsm.service: main process exited, code=killed, status=9/KILL
Jun 21 07:53:56 orchid-vds2 systemd: Unit mom-vdsm.service entered failed state.
Jun 21 07:53:56 orchid-vds2 systemd: mom-vdsm.service failed.
Jun 21 07:53:56 orchid-vds2 systemd: Stopping Virtual Desktop Server Manager...
Jun 21 07:53:57 orchid-vds2 vdsmd_init_common.sh: vdsm: Running run_final_hooks
Jun 21 07:53:57 orchid-vds2 systemd: Starting Virtual Desktop Server Manager...

Yaniv can please take a look why this is happening during setupNetworks?

Comment 11 Dan Kenigsberg 2016-06-22 14:42:04 UTC
dropping needinfo, as bug 1348451 was opened to handle the unstable state due to vdsm restart while in setupNetwork.

Comment 12 Dan Kenigsberg 2016-06-22 14:48:49 UTC
it seems that vdsm was soft-fenced by Engine.

MainThread::DEBUG::2016-06-20 09:09:48,732::vdsm::72::vds::(sigtermHandler) Received signal 15


2016-06-20 09:09:37,997 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-43) [63a30e3e] Opening SSH Soft Fencing session on host 'orchid-vds2.qa.lab.tlv.redhat.com'
2016-06-20 09:09:38,601 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (org.ovirt.thread.pool-6-thread-43) [63a30e3e] Executing SSH Soft Fencing command on host 'orchid-vds2.qa.lab.tlv.redhat.com'

Comment 13 Piotr Kliczewski 2016-06-22 15:15:17 UTC
This host was part of previous installation and it was SPM. Later it was removed and readded to the cluster. I am not sure whether this is correct log entry. One SIGTERM happened at 2016-06-20 09:09:48,732 and the other at Jun 21 07:53:56.

It would seem like a bug because host was during host deploy so it would be hard to soft fence it.

Comment 14 Yaniv Bronhaim 2016-06-22 16:12:03 UTC
We have 2 sets of reproductions here with both vdsm.log and engine.log - in both envs you have correlation between the signal to the SshSoftFencingCommand run.. 
its only that...
The bug is in SshSoftFencingCommand logic.

Comment 15 Piotr Kliczewski 2016-06-23 08:12:50 UTC
Hmm, let's ask Martin about it

Comment 16 Piotr Kliczewski 2016-06-27 11:38:50 UTC
@Martin we are seeing the same issue in CI. Please take let us know whether it is fencing related.

Comment 17 Simone Tiraboschi 2016-06-27 11:43:15 UTC
It pretty reproducible on hosted-engine CI jobs with lago deploying additional hosts: hosted-engine-setup pauses since the host went in NonOperational state asking the user to review on engine side.

After 5 minutes the AutoRecoveryManager kicks in and the host goes up by itself and so just retrying on hosted-engine-side is enough to continue.
It's still a blocker for hosted-engine automated tests.

Comment 18 Piotr Kliczewski 2016-06-27 13:32:15 UTC
I briefly checked Simone's engine logs and I do not see any fencing related log entries. I only see:

2016-06-26 20:10:57,565 INFO  [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-8-thread-23) [5446dcb1] Lock Acquired to object 'EngineLock:{exclusiveLocks='[d4fe0b7c-1f94-4ed8-940f-5aaf6549c08e=<STORAGE, ACTION_TYPE_FAILED_STORAGE_DEVICE_LOCKED>]', sharedLocks='null'}'
2016-06-26 20:10:57,567 WARN  [org.ovirt.engine.core.bll.storage.domain.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-8-thread-23) [5446dcb1] Validation of action 'ImportHostedEngineStorageDomain' failed for user SYSTEM. Reasons: VAR__ACTION__ADD,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_MASTER_STORAGE_DOMAIN_NOT_ACTIVE

Nir is it possible that sanlock killed vdsm before it was installed?

Comment 19 Martin Perina 2016-06-27 14:52:29 UTC
I. I haven't found anything about fencing in Simone's log, the only suspicious thing is the ImportHostedEngineStorageDomain warning mentioned by Piotr above


II. Looking at logs from 2nd reproduction here are my notes:

  1. Host orchid-vds2.qa.lab.tlv.redhat.com failed installation and became NonOperational at 2016-06-21 07:50:05,566
  2. Host orchid-vds2.qa.lab.tlv.redhat.com became NonResponsive at 2016-06-21 07:51:55
  3. SSH Soft Fencing was successfully executed at 2016-06-21 07:53:46
  
  No idea why VDSM didn't respond to engine between 2016-06-21 07:50:05,566 and 2016-06-21 07:53:46, although according to vdsm.log VDSM was running. From fencing flow execution everything seems to be fine, so are we sure that engine can connect to VDSM even though ovirtgmt was not configured properly?

Comment 20 Piotr Kliczewski 2016-06-27 15:08:08 UTC
Vdsm logs return getCaps response at 2016-06-21 07:49:43,809. Next we loose connectivity and we attempt several times to reconnect. At 07:50:05,533 waiting thread is interrupted (could be sigterm on the engine side) and and later (07:53:56,885) vdsm receives sigterm.

In Simone's log I can see that thread was interruped at (20:11:44,532) but for Simone vdsm was killed earlier 16:11:20 +4h.

Comment 21 Simone Tiraboschi 2016-06-28 14:15:33 UTC
We saw it also on 3.6.7, please take a look at https://bugzilla.redhat.com/show_bug.cgi?id=1350718

Comment 22 Oved Ourfali 2016-06-30 12:03:25 UTC
Same fix of Bug 1350763 should solve this one.
Leaving this open for verification purpose.

Comment 23 Oved Ourfali 2016-07-05 14:52:37 UTC
Piotr, isn't it modified?

Comment 24 Piotr Kliczewski 2016-07-05 15:02:30 UTC
Yes, it is

Comment 25 Simone Tiraboschi 2016-07-19 09:09:35 UTC
*** Bug 1357615 has been marked as a duplicate of this bug. ***

Comment 26 Petr Matyáš 2016-07-25 07:58:35 UTC
Verified on 4.0.2-1

Comment 27 Piotr Kliczewski 2016-08-01 11:57:01 UTC
I forgot to remove needinfo