Bug 1337711 - After rebooting HE VM, the newly added vnic comes unplugged
Summary: After rebooting HE VM, the newly added vnic comes unplugged
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.6.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.0.1
: ---
Assignee: Roy Golan
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1327267 1342988 1343980
Blocks: 1345782
TreeView+ depends on / blocked
 
Reported: 2016-05-19 21:50 UTC by Marina Kalinin
Modified: 2019-12-16 05:48 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1345782 (view as bug list)
Environment:
Last Closed: 2016-08-23 20:39:57 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
HE VM from OVF (9.47 KB, application/xml)
2016-05-28 02:28 UTC, Marina Kalinin
no flags Details
sosreport from host alma03 (13.16 MB, application/x-xz)
2016-06-13 15:17 UTC, Nikolai Sednev
no flags Details
sosreport from host alma03 (8.88 MB, application/x-xz)
2016-06-19 18:49 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1327267 0 high CLOSED HE Vm's ovf isn't updated according to 'OvfUpdateIntervalInMinutes' value 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1337641 0 high CLOSED Special procedure needed to upgrade HE environment with RHEL6 hosts to RHEL7 2022-05-16 06:48:27 UTC
Red Hat Knowledge Base (Solution) 2343941 0 None None None 2016-05-28 23:30:48 UTC
Red Hat Product Errata RHEA-2016:1743 0 normal SHIPPED_LIVE Red Hat Virtualization Manager 4.0 GA Enhancement (ovirt-engine) 2016-09-02 21:54:01 UTC

Internal Links: 1327267 1337641

Description Marina Kalinin 2016-05-19 21:50:01 UTC
Description of problem:
After rebooting HE VM, the newly added vnic comes unplugged.

Version-Release number of selected component (if applicable):
- 3.6.5
- Hosted Engine vm built based of RHEV-M appliance.

How reproducible:
100%

Steps to Reproduce:
1. Get RHEVM 3.6.5 in Hosted Engine environment.
2. Edit HE VM, aka manager, through the Admin Portal and add additional vnic to it.
3. The vnic is added successfully, the guest OS sees this and all good.
4. Reboot the guest. ( I did cold reboot - turn the machine off and then start it.)

Actual results:
The second vnic appears unplugged.

Expected results:
The second vnic should be plugged.


Additional info:
I can provide logs, but the issue is very easy reproducible, so I'll skip that step. Let me know if you need logs.

Comment 2 Roy Golan 2016-05-20 07:51:14 UTC
Marina you probably hit Bug 1311151 which is verified in 3.6.6. Let me know if any of what you see is differnt. Otherwise I'll close it as a dup

Comment 3 Frank DeLorey 2016-05-20 18:58:57 UTC
Hello Roy. Marina is out today so I took a look at this case. It does seem like the same issue however please do not mark it a duplicate yet until I concur with Marina.Also is there a workaround available for this or does this customer need to wait for 3.6.6?

Regards,

Frank

Comment 4 Marina Kalinin 2016-05-25 17:43:43 UTC
Hi Roy,

It is a different bug.
Here, the nic is not present to the guest at all.

In the guest:
# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:16:3e:59:a6:03 brd ff:ff:ff:ff:ff:ff
    inet 10.10.176.36/21 brd 10.10.183.255 scope global eth0
    inet6 2620:52:0:ab0:216:3eff:fe59:a603/64 scope global dynamic 
       valid_lft 2591965sec preferred_lft 604765sec
    inet6 fe80::216:3eff:fe59:a603/64 scope link 
       valid_lft forever preferred_lft forever

No mention of eth1 at all.
Same here:

# ll /etc/sysconfig/network-scripts/ifcfg-*
-rw-r--r--. 1 root root 126 Mar  1 07:27 /etc/sysconfig/network-scripts/ifcfg-eth0
-rw-r--r--. 1 root root 254 Apr 12 05:14 /etc/sysconfig/network-scripts/ifcfg-lo

# ethtool eth1
Settings for eth1:
Cannot get device settings: No such device
Cannot get wake-on-lan settings: No such device
Cannot get message level: No such device
Cannot get link status: No such device
No data available

# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 SCSI storage controller: Red Hat, Inc Virtio SCSI
00:05.0 Communication controller: Red Hat, Inc Virtio console
00:06.0 SCSI storage controller: Red Hat, Inc Virtio block device


And after modifying the nic in the Admin Portal to plug it in, it appears on the guest:
# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 SCSI storage controller: Red Hat, Inc Virtio SCSI
00:05.0 Communication controller: Red Hat, Inc Virtio console
00:06.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:07.0 Ethernet controller: Red Hat, Inc Virtio network device   <=== additional network device showed up

It is present in the ip output:
# ip a s eth1
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:1a:4a:16:01:54 brd ff:ff:ff:ff:ff:ff

..

Comment 5 Marina Kalinin 2016-05-25 19:00:29 UTC
I see that vm.conf does not change after I modify the VM on the Admin Portal to plug the nic in. Not sure if it is by design of hosted-engine or not.
I would expect vm.conf to contain all the information about this VM. This is especially important in the scenarios when this additional network is served as display network.

I am going to modify the vm.conf on the share storage and see if it makes any effect. And hopefully it will be the right way to workaround it, and not to modify the database. (if HE VM first starts and then reads the DB, I would assume vm.conf to contain all that information, rather then the DB). 

Thinking about this more, maybe it is something related to ova updater? That it didn't have the time to update the OVF store with the new config? Probably not. The VM has been up for long enough time (hour+) to get them updated.

Comment 6 Marina Kalinin 2016-05-25 21:16:27 UTC
I succeeded to modify vm.conf using this solution:
https://access.redhat.com/solutions/2209751

At step 4, when vm.conf should be modified, here is the line to add:
devices={nicModel:pv,macAddr:<MAC_from_Admin_Portal>,linkActive:true,network:<logical_net_name>,deviceId:<device_id from vm_devices>,address:None,device:bridge,type:interface}

Only question is why device is always bridge.
I guess it is ok, from the times when every logical network had to be a bridge and we just don't have a non-bridge network definition. I don't know.

Comment 8 Marina Kalinin 2016-05-25 21:22:24 UTC
I don't understand this - so, HE VM boots correctly with the nic up, however, /var/run/ovirt-hosted-engine-ha/vm.conf still does not contain the updated information. What does it mean? That from now on it will be always required to start the VM with that modified vm.conf?

What is the way to update vm.conf on the share storage? I could not find it anywhere on the hosted storage domain.

Comment 9 Simone Tiraboschi 2016-05-26 07:24:46 UTC
Editing a custom vm.conf is just a recovery action if something went wrong with previous edit attempt and the system is not able to start the engine VM.

Its local modification are not persisted in any way: vm.conf gets periodically regenerated from the ovf description wrote by the engine in the OVF_STORE. So the only way to have modification to the engine VM correctly persisted, is modifying the the VM on engine side.

Comment 11 Marina Kalinin 2016-05-28 02:28:46 UTC
Created attachment 1162586 [details]
HE VM from OVF

Comment 12 Marina Kalinin 2016-05-28 02:34:05 UTC
Seems like ovf file is not being updated with the other nics.
See attachment from comment 11.
I added 3 additional nics, with and without AllowEditingHostedEngine,
I waited enough OvfUpdateIntervalInMinutes time to make sure the OVF file is updated. However, in the ovf store, there is only 1 interface defined.

Logs are available, but this is very easy reproducible, so let me know.

Interesting enough, there is no way to remove all those nics from UI now. :-) I should probably open some bug on that too.

Comment 13 Roy Golan 2016-05-29 07:48:51 UTC
Marina this is a clone of by Bug 1327267.

Please do open a bug if you can't remove nics (and state there that we should make sure we don't leave the engine with no nics at all)

Comment 14 Marina Kalinin 2016-05-29 21:56:56 UTC
(In reply to Roy Golan from comment #13)
> Marina this is a clone of by Bug 1327267.
> 
> Please do open a bug if you can't remove nics (and state there that we
> should make sure we don't leave the engine with no nics at all)

Roy, maybe. I agree to have this bug as TestOnly.

As for the nic removal - once the nic is unplugged, I can remove it. So, we are good here.

Comment 16 Nikolai Sednev 2016-06-02 13:08:41 UTC
I've received this error on host:
hosted-engine --vm-status                                
Unable to read vm.conf, please check ovirt-ha-agent logs     

MainThread::ERROR::2016-06-02 16:02:14,018::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Connection to storage server failed' - trying to restart agent
MainThread::WARNING::2016-06-02 16:02:19,023::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '3'
MainThread::INFO::2016-06-02 16:02:19,045::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: alma04.qa.lab.tlv.redhat.com
MainThread::INFO::2016-06-02 16:02:19,046::hosted_engine::613::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM
MainThread::INFO::2016-06-02 16:02:19,096::hosted_engine::658::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage
MainThread::INFO::2016-06-02 16:02:19,096::storage_server::218::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-06-02 16:02:19,334::storage_server::222::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::ERROR::2016-06-02 16:02:19,373::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Connection to storage server failed' - trying to restart agent
~


systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2016-06-02 16:03:45 IDT; 39s ago
 Main PID: 19178 (ovirt-ha-agent)
   CGroup: /system.slice/ovirt-ha-agent.service
           └─19178 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon

Jun 02 16:04:18 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent
Jun 02 16:04:18 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: ERROR:ovirt_hosted_engine_ha.agent.agent.Agent:Error: 'Connection to storage server failed' - trying to restart agent
Jun 02 16:04:23 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: WARNING:ovirt_hosted_engine_ha.agent.agent.Agent:Restarting agent, attempt '6'
Jun 02 16:04:23 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Found certificate common name: alma04.qa.lab.tlv.redhat.com
Jun 02 16:04:23 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM
Jun 02 16:04:23 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Connecting the storage
Jun 02 16:04:23 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server
Jun 02 16:04:23 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server
Jun 02 16:04:24 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent
Jun 02 16:04:24 alma04.qa.lab.tlv.redhat.com ovirt-ha-agent[19178]: ERROR:ovirt_hosted_engine_ha.agent.agent.Agent:Error: 'Connection to storage server failed' - trying to restart agent


systemctl status ovirt-ha-broker -l
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2016-06-02 15:29:04 IDT; 35min ago
 Main PID: 855 (ovirt-ha-broker)
   CGroup: /system.slice/ovirt-ha-broker.service
           └─855 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon

Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.monitor.Monitor:Loaded submonitor ping
Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.monitor.Monitor:Loaded submonitor mem-load
Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.monitor.Monitor:Loaded submonitor engine-health
Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.monitor.Monitor:Loaded submonitor engine-health
Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.monitor.Monitor:Loaded submonitor mem-free
Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.monitor.Monitor:Loaded submonitor cpu-load
Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.monitor.Monitor:Loaded submonitor cpu-load-no-engine
Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.monitor.Monitor:Finished loading submonitors
Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.listener.Listener:Initializing SocketServer
Jun 02 15:29:09 alma04.qa.lab.tlv.redhat.com ovirt-ha-broker[855]: INFO:ovirt_hosted_engine_ha.broker.listener.Listener:SocketServer ready


Thread-2552::DEBUG::2016-06-02 16:05:23,671::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/bin/
mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3,None 10.35.64.11:/vol/RHEV/Virt/nsednev_3_6_HE_2 /rhev/data-center/mnt/10.35.64.11
:_vol_RHEV_Virt_nsednev__3__6__HE__2 (cwd None)
Thread-2552::ERROR::2016-06-02 16:05:23,694::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 455, in connect
    return self._mountCon.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 237, in connect
    six.reraise(t, v, tb)
  File "/usr/share/vdsm/storage/storageServer.py", line 229, in connect
    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
  File "/usr/share/vdsm/storage/mount.py", line 225, in mount
    return self._runcmd(cmd, timeout)
  File "/usr/share/vdsm/storage/mount.py", line 241, in _runcmd
    raise MountError(rc, ";".join((out, err)))
MountError: (32, ';mount.nfs: an incorrect mount option was specified\n')


Thread-2555::ERROR::2016-06-02 16:05:28,995::sdc::139::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 88615f12-cb6e-4e8f-acbd-fe149c289726
Thread-2555::ERROR::2016-06-02 16:05:28,995::sdc::156::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 88615f12-cb6e-4e8f-acbd-fe149c289726
Thread-2555::DEBUG::2016-06-02 16:05:28,996::lvm::370::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-2555::DEBUG::2016-06-02 16:05:28,996::lvm::290::Storage.Misc.excCmd::(cmd) /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 88615f12-cb6e-4e8f-acbd-fe149c289726 (cwd None)
Thread-2555::DEBUG::2016-06-02



Sosreport from host is attached.

How I've got to this state:
1)Got el6.8 HE-3.6.7 deployed on el7.2.
2)Added data storage domain to get HE-SD auto-imported.
3)Edited via WEBUI the HE-VM and added additional VNIC to it, which was also seen from CLI from the HE-VM, all this without removing already existing management VNIC.
4)Rebooted the host.
5)After reboot, I've logged in to the host and ran "hosted-engine --vm-status" command.

Components on host:
ovirt-vmconsole-1.0.2-2.el7ev.noarch
ovirt-hosted-engine-setup-1.3.7.0-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
mom-0.5.3-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
vdsm-4.17.29-0.el7ev.noarch
rhevm-sdk-python-3.6.5.1-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-hosted-engine-ha-1.3.5.6-1.el7ev.noarch
ovirt-vmconsole-host-1.0.2-2.el7ev.noarch
rhevm-appliance-20160515.0-1.el7ev.noarch
Red Hat Enterprise Linux Server release 7.2 (Maipo)
Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016
Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Comment 17 Yaniv Lavi 2016-06-02 14:16:56 UTC
How is this failure related to the bug?

Comment 18 Nikolai Sednev 2016-06-02 14:41:26 UTC
Providing the external link to the logs here, as the file size is much bigger that supported by Bugzilla.
https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88TEZqNHFDMDBVMTg/view?usp=sharing

(In reply to Yaniv Dary from comment #17)
> How is this failure related to the bug?

Its related as I've followed the exactly the same scenario as described by Marina and got to situation in which engine can't even start.

Untill this issue is resolved, this very bug can't be ON_QA.

Comment 19 Doron Fediuck 2016-06-06 12:54:42 UTC
See comment 13 and the depends field.
This BZ should be verified after Bug 1327267.

Comment 23 Nikolai Sednev 2016-06-08 12:25:16 UTC
Followed these steps:
1. Get RHEVM 3.6.5 in Hosted Engine environment.
2. Edit HE VM, aka manager, through the Admin Portal and add additional vnic to it.
3. The vnic is added successfully, the guest OS sees this and all good.
4. Reboot the host. 


As a result, when host booted up, 

Engine recognized newly added NIC before it was restarted:

eth0      Link encap:Ethernet  HWaddr 00:16:3E:7B:BB:BB  
          inet addr:10.35.97.250  Bcast:10.35.97.255  Mask:255.255.255.0
          inet6 addr: fe80::216:3eff:fe7b:bbbb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:591028 errors:0 dropped:0 overruns:0 frame:0
          TX packets:79145 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4227960859 (3.9 GiB)  TX bytes:9473511 (9.0 MiB)

eth1      Link encap:Ethernet  HWaddr 00:1A:4A:16:01:51  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:249837 errors:0 dropped:0 overruns:0 frame:0
          TX packets:249837 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:212690803 (202.8 MiB)  TX bytes:212690803 (202.8 MiB)

After host restarted, I've got agent in failed state:
systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2016-06-08 15:20:52 IDT; 33s ago
 Main PID: 12152 (ovirt-ha-agent)
   CGroup: /system.slice/ovirt-ha-agent.service
           └─12152 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon

Jun 08 15:21:20 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent
Jun 08 15:21:20 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: ERROR:ovirt_hosted_engine_ha.agent.agent.Agent:Error: 'Connection to storage server failed' - trying to restart agent
Jun 08 15:21:25 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: WARNING:ovirt_hosted_engine_ha.agent.agent.Agent:Restarting agent, attempt '5'
Jun 08 15:21:25 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Found certificate common name: alma03.qa.lab.tlv.redhat.com
Jun 08 15:21:25 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM
Jun 08 15:21:25 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Connecting the storage
Jun 08 15:21:25 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server
Jun 08 15:21:25 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server
Jun 08 15:21:25 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Connection to storage server failed' - trying to restart agent
Jun 08 15:21:25 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[12152]: ERROR:ovirt_hosted_engine_ha.agent.agent.Agent:Error: 'Connection to storage server failed' - trying to restart agent

Engine failed to start at all.

The components on host:
qemu-kvm-rhev-2.3.0-31.el7_2.15.x86_64
ovirt-vmconsole-host-1.0.2-2.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
vdsm-4.17.30-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-hosted-engine-ha-1.3.5.7-1.el7ev.noarch
ovirt-vmconsole-1.0.2-2.el7ev.noarch
ovirt-hosted-engine-setup-1.3.7.1-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
mom-0.5.3-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
Red Hat Enterprise Linux Server release 7.2 (Maipo)
Linux version 3.10.0-327.22.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon May 16 13:31:48 EDT 2016
Linux 3.10.0-327.22.1.el7.x86_64 #1 SMP Mon May 16 13:31:48 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Comment 26 Nikolai Sednev 2016-06-13 15:16:38 UTC
1)I've deployed clean HE-VM on single host over NFS.
2)Added data storage domain, so that HE-VM could succeed in auto-import of HE-VM.
3)Waited until HE-SD was successfully auto-imported.
4)Added empty vNIC to the engine's VM via WEBUI.
5)Set hosted-engine in to global maintenance.
6)Powered off the HE-VM via hosted-engine CLI option.
7)Sent hosted-engine --vm-start.
8)Set maintenance to none.
9)Engine failed to get started and I see more than 30 minutes the "INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Current state EngineUnexpectedlyDown (score: 0)"
 


Host:
qemu-kvm-rhev-2.3.0-31.el7_2.15.x86_64
mom-0.5.4-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-vmconsole-host-1.0.2-2.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.7-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.7.2-1.el7ev.noarch
ovirt-vmconsole-1.0.2-2.el7ev.noarch
vdsm-4.17.31-0.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux alma03.qa.lab.tlv.redhat.com 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Engine:
rhevm-dwh-setup-3.6.6-1.el6ev.noarch
rhevm-webadmin-portal-3.6.7.3-0.1.el6.noarch
rhevm-spice-client-x64-cab-3.6-7.el6.noarch
rhevm-setup-plugins-3.6.5-1.el6ev.noarch
rhevm-setup-3.6.7.3-0.1.el6.noarch
rhevm-tools-backup-3.6.7.3-0.1.el6.noarch
rhevm-doc-3.6.7-1.el6eng.noarch
rhevm-branding-rhev-3.6.0-10.el6ev.noarch
rhevm-setup-base-3.6.7.3-0.1.el6.noarch
rhevm-backend-3.6.7.3-0.1.el6.noarch
rhevm-dbscripts-3.6.7.3-0.1.el6.noarch
rhevm-dependencies-3.6.0-1.el6ev.noarch
rhevm-spice-client-x86-cab-3.6-7.el6.noarch
rhevm-sdk-python-3.6.7.0-1.el6ev.noarch
rhevm-guest-agent-common-1.0.11-6.el6ev.noarch
rhevm-image-uploader-3.6.0-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.7.3-0.1.el6.noarch
rhevm-vmconsole-proxy-helper-3.6.7.3-0.1.el6.noarch
rhevm-reports-setup-3.6.5.1-1.el6ev.noarch
rhevm-restapi-3.6.7.3-0.1.el6.noarch
rhevm-3.6.7.3-0.1.el6.noarch
rhevm-log-collector-3.6.1-1.el6ev.noarch
rhevm-spice-client-x86-msi-3.6-7.el6.noarch
rhevm-setup-plugin-vmconsole-proxy-helper-3.6.7.3-0.1.el6.noarch
rhevm-extensions-api-impl-3.6.7.3-0.1.el6.noarch
rhevm-websocket-proxy-3.6.7.3-0.1.el6.noarch
rhevm-reports-3.6.5.1-1.el6ev.noarch
rhevm-tools-3.6.7.3-0.1.el6.noarch
rhevm-setup-plugin-ovirt-engine-common-3.6.7.3-0.1.el6.noarch
rhevm-dwh-3.6.6-1.el6ev.noarch
rhevm-userportal-3.6.7.3-0.1.el6.noarch
rhevm-spice-client-x64-msi-3.6-7.el6.noarch
rhevm-iso-uploader-3.6.0-1.el6ev.noarch
rhevm-lib-3.6.7.3-0.1.el6.noarch
rhevm-cli-3.6.2.1-1.el6ev.noarch
rhevm-setup-plugin-websocket-proxy-3.6.7.3-0.1.el6.noarch
Linux version 2.6.32-642.1.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC) ) #1 SMP Fri May 6 14:54:05 EDT 2016
Linux nsednev-he-2.qa.lab.tlv.redhat.com 2.6.32-642.1.1.el6.x86_64 #1 SMP Fri May 6 14:54:05 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 6.8 (Santiago)

Comment 27 Nikolai Sednev 2016-06-13 15:17:49 UTC
Created attachment 1167540 [details]
sosreport from host alma03

Comment 28 Simone Tiraboschi 2016-06-13 15:33:45 UTC
Thread-18503::INFO::2016-06-13 18:03:34,675::vm::1932::virt.vm::(_run) vmId=`3721aa2b-67d8-47ad-bfac-460a2a089710`::<?xml version="1.0" encoding="utf-8"?><domain type="kvm" xmlns:ovirt="http://ovirt.org/vm/tune/1.0">
	<name>HostedEngine</name>
	<uuid>3721aa2b-67d8-47ad-bfac-460a2a089710</uuid>
	<memory>4194304</memory>
	<currentMemory>4194304</currentMemory>
	<vcpu current="4">4</vcpu>
	<devices>
		<channel type="unix">
			<target name="com.redhat.rhevm.vdsm" type="virtio"/>
			<source mode="bind" path="/var/lib/libvirt/qemu/channels/3721aa2b-67d8-47ad-bfac-460a2a089710.com.redhat.rhevm.vdsm"/>
		</channel>
		<channel type="unix">
			<target name="org.qemu.guest_agent.0" type="virtio"/>
			<source mode="bind" path="/var/lib/libvirt/qemu/channels/3721aa2b-67d8-47ad-bfac-460a2a089710.org.qemu.guest_agent.0"/>
		</channel>
		<input bus="ps2" type="mouse"/>
		<memballoon model="none"/>
		<controller index="0" ports="16" type="virtio-serial">
			<address bus="0x00" domain="0x0000" function="0x0" slot="0x05" type="pci"/>
		</controller>
		<controller type="ide">
			<address bus="0x00" domain="0x0000" function="0x1" slot="0x01" type="pci"/>
		</controller>
		<controller model="virtio-scsi" type="scsi">
			<address bus="0x00" domain="0x0000" function="0x0" slot="0x04" type="pci"/>
		</controller>
		<controller type="usb">
			<address bus="0x00" domain="0x0000" function="0x2" slot="0x01" type="pci"/>
		</controller>
		<video>
			<address bus="0x00" domain="0x0000" function="0x0" slot="0x02" type="pci"/>
			<model heads="1" type="cirrus" vram="32768"/>
		</video>
		<graphics autoport="yes" defaultMode="secure" listen="0" passwd="*****" passwdValidTo="1970-01-01T00:00:01" port="-1" type="vnc"/>
		<interface type="bridge">
			<address bus="0x00" domain="0x0000" function="0x0" slot="0x03" type="pci"/>
			<mac address="00:16:3E:7B:BB:BB"/>
			<model type="virtio"/>
			<source bridge="ovirtmgmt"/>
			<link state="up"/>
		</interface>
		<interface address="None" type="bridge">
			<mac address="00:1a:4a:16:01:51"/>
			<model type="virtio"/>
			<source bridge="None"/>
			<link state="up"/>
		</interface>
		<disk device="disk" snapshot="no" type="file">
			<address bus="0x00" domain="0x0000" function="0x0" slot="0x06" type="pci"/>
			<source file="/var/run/vdsm/storage/a0903254-f31e-4798-be1b-e5f6150bcdec/501bc9ac-0463-4c8b-a23b-efd6cd1c523b/7f04725e-1926-48b6-bb0f-d14f62555b6b"/>
			<target bus="virtio" dev="vda"/>
			<serial>501bc9ac-0463-4c8b-a23b-efd6cd1c523b</serial>
			<boot order="1"/>
			<driver cache="none" error_policy="stop" io="threads" name="qemu" type="raw"/>
		</disk>
		<disk device="cdrom" snapshot="no" type="file">
			<address bus="1" controller="0" target="0" type="drive" unit="0"/>
			<source file="" startupPolicy="optional"/>
			<target bus="ide" dev="hdc"/>
			<readonly/>
		</disk>
		<lease>
			<key>7f04725e-1926-48b6-bb0f-d14f62555b6b</key>
			<lockspace>a0903254-f31e-4798-be1b-e5f6150bcdec</lockspace>
			<target offset="0" path="/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__2/a0903254-f31e-4798-be1b-e5f6150bcdec/images/501bc9ac-0463-4c8b-a23b-efd6cd1c523b/7f04725e-1926-48b6-bb0f-d14f62555b6b.lease"/>
		</lease>
	<channel type="unix"><target name="org.ovirt.hosted-engine-setup.0" type="virtio"/><source mode="bind" path="/var/lib/libvirt/qemu/channels/3721aa2b-67d8-47ad-bfac-460a2a089710.org.ovirt.hosted-engine-setup.0"/></channel></devices>
	<metadata>
		<ovirt:qos/>
	</metadata>
	<os>
		<type arch="x86_64" machine="rhel6.5.0">hvm</type>
		<smbios mode="sysinfo"/>
	</os>
	<sysinfo type="smbios">
		<system>
			<entry name="manufacturer">Red Hat</entry>
			<entry name="product">RHEV Hypervisor</entry>
			<entry name="version">7.2-13.0.el7ev</entry>
			<entry name="serial">4C4C4544-0059-4410-8053-B7C04F573032</entry>
			<entry name="uuid">3721aa2b-67d8-47ad-bfac-460a2a089710</entry>
		</system>
	</sysinfo>
	<clock adjustment="0" offset="variable">
		<timer name="rtc" tickpolicy="catchup"/>
		<timer name="pit" tickpolicy="delay"/>
		<timer name="hpet" present="no"/>
	</clock>
	<features>
		<acpi/>
	</features>
	<cpu match="exact">
		<model>qemu64</model>
		<feature name="svm" policy="disable"/>
	</cpu>
<on_poweroff>destroy</on_poweroff><on_reboot>destroy</on_reboot><on_crash>destroy</on_crash></domain>
mailbox.SPMMonitor::DEBUG::2016-06-13 18:03:34,679::storage_mailbox::735::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.0450888 s, 22.7 MB/s\n'; <rc> = 0
Thread-18503::ERROR::2016-06-13 18:03:34,821::vm::759::virt.vm::(_startUnderlyingVm) vmId=`3721aa2b-67d8-47ad-bfac-460a2a089710`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 703, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1941, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1313, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3611, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: Cannot get interface MTU on 'None': No such device
Thread-18503::INFO::2016-06-13 18:03:34,823::vm::1330::virt.vm::(setDownStatus) vmId=`3721aa2b-67d8-47ad-bfac-460a2a089710`::Changed state to Down: Cannot get interface MTU on 'None': No such device (code=1)

Comment 29 Simone Tiraboschi 2016-06-13 15:41:01 UTC
[root@alma03 ~]# cat /var/run/ovirt-hosted-engine-ha/vm.conf 
emulatedMachine=rhel6.5.0
vmId=3721aa2b-67d8-47ad-bfac-460a2a089710
smp=4
memSize=4096
spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir
vmName=HostedEngine
display=vnc
devices={index:0,iface:virtio,format:raw,bootOrder:1,address:{slot:0x06,bus:0x00,domain:0x0000,type:pci,function:0x0},volumeID:7f04725e-1926-48b6-bb0f-d14f62555b6b,imageID:501bc9ac-0463-4c8b-a23b-efd6cd1c523b,readonly:false,domainID:a0903254-f31e-4798-be1b-e5f6150bcdec,deviceId:501bc9ac-0463-4c8b-a23b-efd6cd1c523b,poolID:00000000-0000-0000-0000-000000000000,device:disk,shared:exclusive,propagateErrors:off,type:disk}
devices={nicModel:pv,macAddr:00:16:3E:7B:BB:BB,linkActive:true,network:ovirtmgmt,deviceId:be052cf9-8747-4982-805e-b3bcaa55b8d5,address:{slot:0x03,bus:0x00,domain:0x0000,type:pci,function:0x0},device:bridge,type:interface}
devices={nicModel:pv,macAddr:00:1a:4a:16:01:51,linkActive:true,network:None,deviceId:a65ffa7a-d182-4d73-967f-93eae6a755f8,address:None,device:bridge,type:interface}
devices={index:2,iface:ide,shared:false,readonly:true,deviceId:8c3179ac-b322-4f5c-9449-c52e3665e0ae,address:{controller:0,target:0,unit:0,bus:1,type:drive},device:cdrom,path:,type:disk}
devices={device:virtio-serial,type:controller,deviceId:aafc77c0-4a8d-4ced-8467-f92cd1868d90,address:{slot:0x05,bus:0x00,domain:0x0000,type:pci,function:0x0}}
devices={device:cirrus,alias:video0,type:video,deviceId:d691d9a5-36fb-490d-9c21-53eca8b02fce,address:{slot:0x02,bus:0x00,domain:0x0000,type:pci,function:0x0}}
devices={device:ide,type:controller,deviceId:25156b8d-9962-4b45-967f-8e9b26931425,address:{slot:0x01,bus:0x00,domain:0x0000,type:pci,function:0x1}}
devices={device:scsi,model:virtio-scsi,type:controller,deviceId:67589c37-bf92-4f1e-865d-b1b46d6c770f,address:{slot:0x04,bus:0x00,domain:0x0000,type:pci,function:0x0}}
devices={device:usb,type:controller,deviceId:72a87009-523f-40a9-b861-c61dc3a57c95,address:{slot:0x01,bus:0x00,domain:0x0000,type:pci,function:0x2}}

Comment 30 Nikolai Sednev 2016-06-13 15:43:26 UTC
This is what I saw on engine right after new and empty interface was created:
[root@nsednev-he-2 ~]# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:16:3E:7B:BB:BB  
          inet addr:10.35.97.250  Bcast:10.35.97.255  Mask:255.255.255.0
          inet6 addr: fe80::216:3eff:fe7b:bbbb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1287636 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8362 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:832955848 (794.3 MiB)  TX bytes:4428613 (4.2 MiB)

eth1      Link encap:Ethernet  HWaddr 00:1A:4A:16:01:51  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:122762 errors:0 dropped:0 overruns:0 frame:0
          TX packets:122762 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:48337150 (46.0 MiB)  TX bytes:48337150 (46.0 MiB)

So eth1 was a new interface and it had default ethernet MTU value of 1500.

Reassigning back the bug.

Comment 31 Simone Tiraboschi 2016-06-13 15:44:09 UTC
in vm.conf we have two interfaces but on the new one we are missing the whole address value.

devices={nicModel:pv,macAddr:00:16:3E:7B:BB:BB,linkActive:true,network:ovirtmgmt,deviceId:be052cf9-8747-4982-805e-b3bcaa55b8d5,address:{slot:0x03,bus:0x00,domain:0x0000,type:pci,function:0x0},device:bridge,type:interface}
devices={nicModel:pv,macAddr:00:1a:4a:16:01:51,linkActive:true,network:None,deviceId:a65ffa7a-d182-4d73-967f-93eae6a755f8,address:None,device:bridge,type:interface}

Comment 32 Simone Tiraboschi 2016-06-13 15:47:37 UTC
In the OVF_STORE we have:

<?xml version='1.0' encoding='UTF-8'?><ovf:Envelope xmlns:ovf="http://schemas.dmtf.org/ovf/envelope/1/" xmlns:rasd="http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/CIM_ResourceAllocationSettingData" xmlns:vssd="http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/CIM_VirtualSystemSettingData" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ovf:version="3.6.0.0"><References><File ovf:href="501bc9ac-0463-4c8b-a23b-efd6cd1c523b/7f04725e-1926-48b6-bb0f-d14f62555b6b" ovf:id="7f04725e-1926-48b6-bb0f-d14f62555b6b" ovf:size="53687091200" ovf:description="Active VM" ovf:disk_storage_type="IMAGE" ovf:cinder_volume_type=""/><Nic ovf:id="be052cf9-8747-4982-805e-b3bcaa55b8d5"/><Nic ovf:id="a65ffa7a-d182-4d73-967f-93eae6a755f8"/></References><Section xsi:type="ovf:NetworkSection_Type"><Info>List of networks</Info><Network ovf:name="Network 1"/></Section><Section xsi:type="ovf:DiskSection_Type"><Info>List of Virtual Disks</Info><Disk ovf:diskId="7f04725e-1926-48b6-bb0f-d14f62555b6b" ovf:size="50" ovf:actual_size="6" ovf:vm_snapshot_id="d276a54e-be19-4902-89a3-68d78876241e" ovf:parentRef="" ovf:fileRef="501bc9ac-0463-4c8b-a23b-efd6cd1c523b/7f04725e-1926-48b6-bb0f-d14f62555b6b" ovf:format="http://www.vmware.com/specifications/vmdk.html#sparse" ovf:volume-format="RAW" ovf:volume-type="Preallocated" ovf:disk-interface="VirtIO" ovf:boot="false" ovf:disk-alias="virtio-disk0" ovf:wipe-after-delete="false"/></Section><Content ovf:id="out" xsi:type="ovf:VirtualSystem_Type"><CreationDate>2016/06/13 13:13:54</CreationDate><ExportDate>2016/06/13 14:40:17</ExportDate><DeleteProtected>false</DeleteProtected><SsoMethod>guest_agent</SsoMethod><IsSmartcardEnabled>false</IsSmartcardEnabled><TimeZone>Etc/GMT</TimeZone><default_boot_sequence>0</default_boot_sequence><Generation>7</Generation><VmType>0</VmType><ClusterCompatibilityVersion>3.6</ClusterCompatibilityVersion><MinAllocatedMem>4096</MinAllocatedMem><IsStateless>false</IsStateless><IsRunAndPause>false</IsRunAndPause><AutoStartup>false</AutoStartup><Priority>1</Priority><MigrationSupport>1</MigrationSupport><IsBootMenuEnabled>false</IsBootMenuEnabled><IsSpiceFileTransferEnabled>true</IsSpiceFileTransferEnabled><IsSpiceCopyPasteEnabled>true</IsSpiceCopyPasteEnabled><CustomEmulatedMachine>rhel6.5.0</CustomEmulatedMachine><CustomCpuName></CustomCpuName><PredefinedProperties></PredefinedProperties><UserDefinedProperties></UserDefinedProperties><Name>HostedEngine</Name><TemplateId>00000000-0000-0000-0000-000000000000</TemplateId><TemplateName>Blank</TemplateName><IsInitilized>false</IsInitilized><Origin>6</Origin><app_list>kernel-2.6.32-642.el6,kernel-2.6.32-642.1.1.el6,cloud-init-0.7.5-5.el6,rhevm-guest-agent-common-1.0.11-6.el6ev</app_list><DefaultDisplayType>0</DefaultDisplayType><TrustedService>false</TrustedService><UseHostCpu>false</UseHostCpu><UseLatestVersion>false</UseLatestVersion><Section ovf:id="3721aa2b-67d8-47ad-bfac-460a2a089710" ovf:required="false" xsi:type="ovf:OperatingSystemSection_Type"><Info>Guest Operating System</Info><Description>other_linux</Description></Section><Section xsi:type="ovf:VirtualHardwareSection_Type"><Info>4 CPU, 4096 Memeory</Info><System><vssd:VirtualSystemType>ENGINE 3.6.0.0</vssd:VirtualSystemType></System><Item><rasd:Caption>4 virtual cpu</rasd:Caption><rasd:Description>Number of virtual CPU</rasd:Description><rasd:InstanceId>1</rasd:InstanceId><rasd:ResourceType>3</rasd:ResourceType><rasd:num_of_sockets>4</rasd:num_of_sockets><rasd:cpu_per_socket>1</rasd:cpu_per_socket><rasd:threads_per_cpu>1</rasd:threads_per_cpu></Item><Item><rasd:Caption>4096 MB of memory</rasd:Caption><rasd:Description>Memory Size</rasd:Description><rasd:InstanceId>2</rasd:InstanceId><rasd:ResourceType>4</rasd:ResourceType><rasd:AllocationUnits>MegaBytes</rasd:AllocationUnits><rasd:VirtualQuantity>4096</rasd:VirtualQuantity></Item><Item><rasd:Caption>virtio-disk0</rasd:Caption><rasd:InstanceId>7f04725e-1926-48b6-bb0f-d14f62555b6b</rasd:InstanceId><rasd:ResourceType>17</rasd:ResourceType><rasd:HostResource>501bc9ac-0463-4c8b-a23b-efd6cd1c523b/7f04725e-1926-48b6-bb0f-d14f62555b6b</rasd:HostResource><rasd:Parent>00000000-0000-0000-0000-000000000000</rasd:Parent><rasd:Template>00000000-0000-0000-0000-000000000000</rasd:Template><rasd:ApplicationList></rasd:ApplicationList><rasd:StorageId>a0903254-f31e-4798-be1b-e5f6150bcdec</rasd:StorageId><rasd:StoragePoolId>00000001-0001-0001-0001-00000000017e</rasd:StoragePoolId><rasd:CreationDate>2016/06/13 13:13:55</rasd:CreationDate><rasd:LastModified>1970/01/01 00:00:00</rasd:LastModified><rasd:last_modified_date>2016/06/13 14:40:17</rasd:last_modified_date><Type>disk</Type><Device>disk</Device><rasd:Address>{slot=0x06, bus=0x00, domain=0x0000, type=pci, function=0x0}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>virtio-disk0</Alias></Item><Item><rasd:Caption>Ethernet adapter on ovirtmgmt</rasd:Caption><rasd:InstanceId>be052cf9-8747-4982-805e-b3bcaa55b8d5</rasd:InstanceId><rasd:ResourceType>10</rasd:ResourceType><rasd:OtherResourceType>ovirtmgmt</rasd:OtherResourceType><rasd:ResourceSubType>3</rasd:ResourceSubType><rasd:Connection>ovirtmgmt</rasd:Connection><rasd:Linked>true</rasd:Linked><rasd:Name>vnet0</rasd:Name><rasd:MACAddress>00:16:3E:7B:BB:BB</rasd:MACAddress><rasd:speed>1000</rasd:speed><Type>interface</Type><Device>bridge</Device><rasd:Address>{slot=0x03, bus=0x00, domain=0x0000, type=pci, function=0x0}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>net0</Alias></Item><Item><rasd:Caption>Ethernet adapter on [No Network]</rasd:Caption><rasd:InstanceId>a65ffa7a-d182-4d73-967f-93eae6a755f8</rasd:InstanceId><rasd:ResourceType>10</rasd:ResourceType><rasd:OtherResourceType/><rasd:ResourceSubType>3</rasd:ResourceSubType><rasd:Connection/><rasd:Linked>true</rasd:Linked><rasd:Name>nic1</rasd:Name><rasd:MACAddress>00:1a:4a:16:01:51</rasd:MACAddress><rasd:speed>1000</rasd:speed><Type>interface</Type><Device>bridge</Device><rasd:Address></rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias></Alias></Item><Item><rasd:Caption>USB Controller</rasd:Caption><rasd:InstanceId>3</rasd:InstanceId><rasd:ResourceType>23</rasd:ResourceType><rasd:UsbPolicy>DISABLED</rasd:UsbPolicy></Item><Item><rasd:Caption>Graphical Framebuffer</rasd:Caption><rasd:InstanceId>6c3528ac-6357-430d-9824-7acadca0e0e2</rasd:InstanceId><rasd:ResourceType>26</rasd:ResourceType><Type>graphics</Type><Device>vnc</Device><rasd:Address></rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias></Alias></Item><Item><rasd:Caption>CDROM</rasd:Caption><rasd:InstanceId>4c8813c9-e228-484c-8a13-377c5f8dae50</rasd:InstanceId><rasd:ResourceType>15</rasd:ResourceType><Type>disk</Type><Device>cdrom</Device><rasd:Address></rasd:Address><BootOrder>0</BootOrder><IsPlugged>false</IsPlugged><IsReadOnly>true</IsReadOnly><Alias></Alias><SpecParams><path></path></SpecParams></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>aafc77c0-4a8d-4ced-8467-f92cd1868d90</rasd:InstanceId><Type>controller</Type><Device>virtio-serial</Device><rasd:Address>{slot=0x05, bus=0x00, domain=0x0000, type=pci, function=0x0}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>virtio-serial0</Alias></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>d691d9a5-36fb-490d-9c21-53eca8b02fce</rasd:InstanceId><Type>video</Type><Device>cirrus</Device><rasd:Address>{slot=0x02, bus=0x00, domain=0x0000, type=pci, function=0x0}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>video0</Alias></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>0bc293b9-907d-4ac0-81a8-123c6cd77e60</rasd:InstanceId><Type>channel</Type><Device>unix</Device><rasd:Address>{bus=0, controller=0, type=virtio-serial, port=2}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>channel1</Alias></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>25156b8d-9962-4b45-967f-8e9b26931425</rasd:InstanceId><Type>controller</Type><Device>ide</Device><rasd:Address>{slot=0x01, bus=0x00, domain=0x0000, type=pci, function=0x1}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>ide</Alias></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>3ab4d2ac-7c67-41cc-bbed-5bdc059ab303</rasd:InstanceId><Type>channel</Type><Device>unix</Device><rasd:Address>{bus=0, controller=0, type=virtio-serial, port=3}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>channel2</Alias></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>617be138-60bd-458d-b7d2-de9ae5fc01ea</rasd:InstanceId><Type>channel</Type><Device>unix</Device><rasd:Address>{bus=0, controller=0, type=virtio-serial, port=1}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>channel0</Alias></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>638c4da0-fbb0-480f-8b0a-4dd9fe5be8f8</rasd:InstanceId><Type>disk</Type><Device>cdrom</Device><rasd:Address>{bus=1, controller=0, type=drive, target=0, unit=0}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>ide0-1-0</Alias></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>67589c37-bf92-4f1e-865d-b1b46d6c770f</rasd:InstanceId><Type>controller</Type><Device>scsi</Device><rasd:Address>{slot=0x04, bus=0x00, domain=0x0000, type=pci, function=0x0}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>scsi0</Alias></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>72a87009-523f-40a9-b861-c61dc3a57c95</rasd:InstanceId><Type>controller</Type><Device>usb</Device><rasd:Address>{slot=0x01, bus=0x00, domain=0x0000, type=pci, function=0x2}</rasd:Address><BootOrder>0</BootOrder><IsPlugged>true</IsPlugged><IsReadOnly>false</IsReadOnly><Alias>usb</Alias></Item><Item><rasd:ResourceType>0</rasd:ResourceType><rasd:InstanceId>94570f3f-3274-4acf-86a6-d85ab48df6ef</rasd:InstanceId><Type>sound</Type><Device>ich6</Device><rasd:Address></rasd:Address><BootOrder>0</BootOrder><IsPlugged>false</IsPlugged><IsReadOnly>true</IsReadOnly><Alias></Alias></Item></Section><Section xsi:type="ovf:SnapshotsSection_Type"><Snapshot ovf:id="d276a54e-be19-4902-89a3-68d78876241e"><Type>ACTIVE</Type><Description>Active VM</Description><CreationDate>2016/06/13 13:13:55</CreationDate></Snapshot></Section></Content></ovf:Envelope>

Comment 33 Nikolai Sednev 2016-06-19 08:39:03 UTC
Is there a chance that https://bugzilla.redhat.com/show_bug.cgi?id=1340628 could somehow be related to this bug? In this bug I've done reproduction without changing default update timings for OVF, I mean that I did not changed the "set engine-config -s OvfUpdateIntervalInMinutes=1" before restarting the engine. Could this be a root cause?

Comment 34 Simone Tiraboschi 2016-06-19 13:52:57 UTC
Yes, you have to change that value and restart the engine VM. Otherwise after any modification to the engine VM you have to wait at least one hour before restarting the engine VM.

Comment 35 Nikolai Sednev 2016-06-19 15:09:52 UTC
I've changed the "set engine-config -s OvfUpdateIntervalInMinutes=1" and have the same symptoms just as before, powered off/on the engine and got to the same results as described in comment #26. Then I've also restarted the hosted-engine-host and hit again the https://bugzilla.redhat.com/show_bug.cgi?id=1343005. Engine fails to get started.

Comment 36 Nikolai Sednev 2016-06-19 18:45:35 UTC
I've tried to make "warm-restart" of HE-VM using these steps on host first:
hosted-engine --set-maintenance --mode=global
hosted-engine --vm-poweroff
hosted-engine --vm-start
hosted-engine --set-maintenance --mode=none
hosted-engine --vm-status
hosted-engine --set-maintenance --mode=none
hosted-engine --vm-status



[root@alma03 ~]# systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2016-06-19 20:53:10 IDT; 27min ago
 Main PID: 2916 (ovirt-ha-agent)
   CGroup: /system.slice/ovirt-ha-agent.service
           └─2916 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon

Jun 19 21:20:26 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.lib.image.Image:Preparing images
Jun 19 21:20:31 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Reloading vm.conf from the shared storage domain
Jun 19 21:20:31 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Trying to get a fresher copy of vm configuration from the OVF_STORE
Jun 19 21:20:37 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Found OVF_STORE: imgUUID:2474f383-99a7-418e-8ca7-d24c8336e363, volUUID:4b5556d0-f693-45e1-b5fb-4da06e91d190
Jun 19 21:20:37 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Found OVF_STORE: imgUUID:7fc06428-2493-4f23-a58f-f534a33283f2, volUUID:d9a5c8ff-9235-4bd4-9b61-ac22fb36ada8
Jun 19 21:20:37 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Extracting Engine VM OVF from the OVF_STORE
Jun 19 21:20:37 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:OVF_STORE volume path: /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__2/b9b588f4-9c91-4784-a0b8-242b3e034bd6/images/7fc06428-2493-4f23-a58f-f534a33283f2/d9a5c8ff-9235-4bd4-9b61-ac22fb36ada8
Jun 19 21:20:37 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Found an OVF for HE VM, trying to convert
Jun 19 21:20:37 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Got vm.conf from OVF_STORE
Jun 19 21:20:37 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[2916]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Current state EngineUnexpectedlyDown (score: 0)
[root@alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : adfb9a85
Host timestamp                     : 1728
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=1728 (Sun Jun 19 21:20:11 2016)
        host-id=1
        score=0
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
        timeout=Thu Jan  1 02:35:46 1970

Then I tried these steps:
[root@alma03 ~]# systemctl restart ovirt-ha-broker && systemctl restart ovirt-ha-agent

As a result score changed to 3400 and engine was starting, but eventually failed with:
hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : f6866174
Host timestamp                     : 2101
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2101 (Sun Jun 19 21:26:24 2016)
        host-id=1
        score=0
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
[root@alma03 ~]# systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2016-06-19 21:22:42 IDT; 4min 41s ago
 Main PID: 10420 (ovirt-ha-agent)
   CGroup: /system.slice/ovirt-ha-agent.service
           └─10420 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon

Jun 19 21:27:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Timeout set to Sun Jun 19 21:37:03 2016 while transitioning <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'> -> <class 'ovirt_hosted_engine_ha.agent.states.EngineUnexpectedlyDown'>
Jun 19 21:27:03 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM
Jun 19 21:27:08 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Connecting the storage
Jun 19 21:27:08 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server
Jun 19 21:27:18 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server
Jun 19 21:27:18 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Refreshing the storage domain
Jun 19 21:27:18 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Preparing images
Jun 19 21:27:18 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.image.Image:Preparing images
Jun 19 21:27:23 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Reloading vm.conf from the shared storage domain
Jun 19 21:27:23 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Trying to get a fresher copy of vm configuration from the OVF_STORE

Then I've waited for more than 15 minutes to see if score would rise to 3400 on my host, and did so:
[root@alma03 ~]# systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2016-06-19 21:22:42 IDT; 15min ago
 Main PID: 10420 (ovirt-ha-agent)
   CGroup: /system.slice/ovirt-ha-agent.service
           └─10420 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon

Jun 19 21:38:04 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Engine down and local host has best score (3400), attempting to start engine VM
Jun 19 21:38:04 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink:Trying: notify time=1466361484.36 type=state_transition detail=EngineDown-EngineStart hostname='alma03.qa.lab.tlv.redhat.com'
Jun 19 21:38:06 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink:Success, was notification of state_transition (EngineDown-EngineStart) sent? sent
Jun 19 21:38:06 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM
Jun 19 21:38:11 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Connecting the storage
Jun 19 21:38:11 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server
Jun 19 21:38:21 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Connecting storage server
Jun 19 21:38:21 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.storage_server.StorageServer:Refreshing the storage domain
Jun 19 21:38:21 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Preparing images
Jun 19 21:38:21 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.image.Image:Preparing images
[root@alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 00e62eca
Host timestamp                     : 2801
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2801 (Sun Jun 19 21:38:04 2016)
        host-id=1
        score=3400
        maintenance=False
        state=EngineStart
        stopped=False

But engine failed to start and score was zeroed again:
[root@alma03 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "bad vm status", "health": "bad", "vm": "down", "detail": "down"}
Score                              : 0
stopped                            : False
Local maintenance                  : False
crc32                              : e47afaba
Host timestamp                     : 2884
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=2884 (Sun Jun 19 21:39:27 2016)
        host-id=1
        score=0
        maintenance=False
        state=EngineUnexpectedlyDown
        stopped=False
[root@alma03 ~]# systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2016-06-19 21:22:42 IDT; 18min ago
 Main PID: 10420 (ovirt-ha-agent)
   CGroup: /system.slice/ovirt-ha-agent.service
           └─10420 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon

Jun 19 21:40:26 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Trying to get a fresher copy of vm configuration from the OVF_STORE
Jun 19 21:40:32 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Found OVF_STORE: imgUUID:2474f383-99a7-418e-8ca7-d24c8336e363, volUUID:4b5556d0-f693-45e1-b5fb-4da06e91d190
Jun 19 21:40:32 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Found OVF_STORE: imgUUID:7fc06428-2493-4f23-a58f-f534a33283f2, volUUID:d9a5c8ff-9235-4bd4-9b61-ac22fb36ada8
Jun 19 21:40:32 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Extracting Engine VM OVF from the OVF_STORE
Jun 19 21:40:32 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:OVF_STORE volume path: /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__HE__2/b9b588f4-9c91-4784-a0b8-242b3e034bd6/images/7fc06428-2493-4f23-a58f-f534a33283f2/d9a5c8ff-9235-4bd4-9b61-ac22fb36ada8
Jun 19 21:40:32 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Found an OVF for HE VM, trying to convert
Jun 19 21:40:32 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Got vm.conf from OVF_STORE
Jun 19 21:40:32 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Score is 0 due to unexpected vm shutdown at Sun Jun 19 21:40:06 2016
Jun 19 21:40:32 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Current state EngineUnexpectedlyDown (score: 0)
Jun 19 21:40:42 alma03.qa.lab.tlv.redhat.com ovirt-ha-agent[10420]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing VDSM

I've also tried to power-cycle the host, and result was the same, engine could not start.

Comment 37 Nikolai Sednev 2016-06-19 18:49:04 UTC
Created attachment 1169587 [details]
sosreport from host alma03

Comment 38 Nikolai Sednev 2016-06-19 18:50:07 UTC
Adding components from host:
ovirt-iso-uploader-4.0.0-1.el7ev.noarch
ovirt-engine-restapi-4.0.0.5-0.1.el7ev.noarch
ovirt-vmconsole-1.0.3-1.el7ev.noarch
vdsm-4.18.3-0.el7ev.x86_64
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-vmconsole-proxy-1.0.3-1.el7ev.noarch
ovirt-engine-extension-aaa-jdbc-1.1.0-1.el7ev.noarch
ovirt-engine-dwh-4.0.0-2.el7ev.noarch
ovirt-engine-setup-4.0.0.5-0.1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-engine-cli-3.6.7.0-1.el7ev.noarch
ovirt-engine-websocket-proxy-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-image-uploader-4.0.0-1.el7ev.noarch
ovirt-engine-userportal-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-webadmin-portal-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-tools-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-setup-plugin-websocket-proxy-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-backend-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-vmconsole-proxy-helper-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-dbscripts-4.0.0.5-0.1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch
ovirt-engine-lib-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-setup-plugin-ovirt-engine-common-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-extensions-api-impl-4.0.0.5-0.1.el7ev.noarch
ovirt-host-deploy-java-1.5.0-1.el7ev.noarch
ovirt-engine-dwh-setup-4.0.0-2.el7ev.noarch
ovirt-engine-setup-plugin-vmconsole-proxy-helper-4.0.0.5-0.1.el7ev.noarch
ovirt-engine-4.0.0.5-0.1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
mom-0.5.4-1.el7ev.noarch
ovirt-engine-tools-backup-4.0.0.5-0.1.el7ev.noarch
ovirt-log-collector-4.0.0-1.el7ev.noarch
ovirt-engine-dashboard-1.0.0-20160615git43298a4.el7ev.x86_64
ovirt-engine-setup-plugin-ovirt-engine-4.0.0.5-0.1.el7ev.noarch
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-engine-setup-base-4.0.0.5-0.1.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Comment 39 Nikolai Sednev 2016-06-20 06:16:30 UTC
I do see this in /var/run/ovirt-hosted-engine-ha/vm.conf:
[root@alma03 ~]# cat /var/run/ovirt-hosted-engine-ha/vm.conf
emulatedMachine=rhel6.5.0
vmId=df291a06-fa08-4e7a-8e7c-32a2903de512
smp=4
memSize=4096
spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir
vmName=HostedEngine
display=vnc
devices={index:0,iface:virtio,format:raw,bootOrder:1,address:{slot:0x06,bus:0x00,domain:0x0000,type:pci,function:0x0},volumeID:dd85c3e1-cbb0-4719-bc24-7b7c058510a4,imageID:fb42d97d-8b85-42ca-a55c-dcdc48c15850,readonly:false,domainID:b9b588f4-9c91-4784-a0b8-242b3e034bd6,deviceId:fb42d97d-8b85-42ca-a55c-dcdc48c15850,poolID:00000000-0000-0000-0000-000000000000,device:disk,shared:exclusive,propagateErrors:off,type:disk}
devices={nicModel:pv,macAddr:00:16:3E:7B:BB:BB,linkActive:true,network:ovirtmgmt,deviceId:c53ae1e8-3dad-4659-993a-1e2ff0b3b7d4,address:{slot:0x03,bus:0x00,domain:0x0000,type:pci,function:0x0},device:bridge,type:interface}
devices={nicModel:pv,macAddr:00:1a:4a:16:01:51,linkActive:true,network:None,deviceId:59d8c675-b7c5-4bf1-a716-91a9263b1d5e,address:{slot:0x07,bus:0x00,domain:0x0000,type:pci,function:0x0},device:bridge,type:interface}
devices={index:2,iface:ide,shared:false,readonly:true,deviceId:8c3179ac-b322-4f5c-9449-c52e3665e0ae,address:{controller:0,target:0,unit:0,bus:1,type:drive},device:cdrom,path:,type:disk}
devices={device:vga,alias:video0,type:video,deviceId:4d4f1a6c-db8e-4855-9d03-f8c34ee81135,address:{slot:0x02,bus:0x00,domain:0x0000,type:pci,function:0x0}}
devices={device:virtio-serial,type:controller,deviceId:81a03974-39fa-4140-b827-4b4b941c6006,address:{slot:0x05,bus:0x00,domain:0x0000,type:pci,function:0x0}}
devices={device:usb,type:controller,deviceId:ec58c2e6-3f86-44c6-8627-8025be9fd509,address:{slot:0x01,bus:0x00,domain:0x0000,type:pci,function:0x2}}
devices={device:scsi,model:virtio-scsi,type:controller,deviceId:fe7eb17d-56c7-4cf7-9db1-631ef165d4f9,address:{slot:0x04,bus:0x00,domain:0x0000,type:pci,function:0x0}}
devices={device:ide,type:controller,deviceId:3a9be9c2-9f12-4cd7-ace7-a26edeb45dcd,address:{slot:0x01,bus:0x00,domain:0x0000,type:pci,function:0x1}}

Comment 40 Nikolai Sednev 2016-06-20 10:54:04 UTC
I think that root cause of why engine not being able to get started happens because of this:
Thread-27987::DEBUG::2016-06-20 12:04:43,927::bindingxmlrpc::1242::vds::(wrapper) return vmGetStats with {'status': {'message': 'Done', 'code': 0}
, 'statsList': [{'status': 'Down', 'exitMessage': "Cannot get interface MTU on 'None': No such device", 'statusTime': '4349468610', 'vmId': 'df291
a06-fa08-4e7a-8e7c-32a2903de512', 'exitReason': 1, 'exitCode': 1}]}

When empty interface being created for the engine's VM, for some reason it's type is 'None' instead of 'Dummy' and MTU is empty, so these caused for engine's VM not to start. If empty vNIC added to guest-VM then it's type is dummy and there is no problem starting it at all.

Comment 41 Nikolai Sednev 2016-06-20 10:55:29 UTC
Can you please provide more information regarding customer's vNIC? Was it assigned to existing network or was it simply created as empty?

Comment 42 Marina Kalinin 2016-06-20 20:53:23 UTC
Customer never shared that exact information with me.
From my tests, I didn't create any additional logical network and defined the second vnic on ovirtmgmt logical network.
I never got into the state when engine would not start.
But I also never saw OVF store updated with the nic, and consequently, I didn't see vm.conf modified. 

However, I think let's have a clear line to start from, because the code is now different since the time the customer and myself has tested this.

How I see the use cases:
Case #1:
- Deploy HE VM with network on ovirtmgmt logical network (default/only option).
- Modify the VM in the UI and add additional vnic that will be on a different network then ovirtmgmt logical network, let's call it display network. Mark it as display in the UI as well.
- Reboot HE VM
- Second vNIC, attached to display logical network is up and active.

Case #2:
Same as above, but instead rebooting, test live migration of HE VM.
Make sure after the migration to another host the second vnic is still plugged and active.

Case #3: (customer case)
Same as number, with one difference: ovirtmgmt is on a private network.
The display network is the public network.
To access HE VM Admin Portal you either need to be on that private network or add the display network to HE VM.
Reboot and make sure you still can access the VM/admin portal from the public network.

Case #4: (my testing environment, not sure if it is a valid use case though):
Same as number #1 with the difference that the second vnic will be on the same logical network as the first vnic, ovirtmgmt.

Comment 45 Nikolai Sednev 2016-06-21 14:35:22 UTC
Followed reproduction steps as follows:
1)Ran "engine-config -s OvfUpdateIntervalInMinutes=1" on engine.
2)Restarted the engine.
3)Added new network to host and attached it to it's NIC.
4)Added new vNIC to HE-VM and assigned it to new network.
5)Set host in global maintenance "hosted-engine --set-maintenance --mode=global"
6)On host "hosted-engine --vm-shutdown".
7)On host cat /var/run/ovirt-hosted-engine-ha/vm.conf and nevly added vNIC was seen on HE-VM and on host, VM was still powering down.
8)Started HE-VM on host "hosted-engine --vm-start".
9)On host "hosted-engine --set-maintenance --mode=none".
10)Engine booted up and newly added vNIC remained connected to it.

Host:
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-vmconsole-1.0.3-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
vdsm-4.18.3-0.el7ev.x86_64
ovirt-setup-lib-1.0.2-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
mom-0.5.4-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-host-deploy-1.5.0-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Engine:
rhevm-doc-4.0.0-2.el7ev.noarch
rhev-release-4.0.0-18-001.noarch
rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch
rhevm-branding-rhev-4.0.0-1.el7ev.noarch
rhevm-4.0.0.6-0.1.el7ev.noarch
rhevm-guest-agent-common-1.0.12-2.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch
rhev-guest-tools-iso-4.0-2.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Comment 46 Nikolai Sednev 2016-06-21 14:40:05 UTC
Attaching link of screen-capture of what I've been doing.
https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88a09QZDE4WkVjLW8/view?usp=sharing

Comment 47 Nikolai Sednev 2016-06-21 16:15:59 UTC
In case that host being power-cycled, and reproduction is made as described in Comment 45, then everything is working just fine too.

Migration also passed OK.

Case #4 is not legal as there is only one vNIC that can be within the management network while HE-VM is running, as there is already first vNIC in that network, I can't add another newly created vNIC to the same mgmt network.

Case #3 You may use is a bit problematic for reproduction on my environment, I'd better be asking network team for help here.

I'll split this bug in to two possible options:

1)Customer did as described in 1348640.
2)Customer did not let shared storage to get OVF updated with new vNIC parameters, which is by default 1 hour (customer could shorten the timer by running "engine-config -s OvfUpdateIntervalInMinutes=1" on their engine's and then restart them) prior to HE-VM was migrated/powered-off/on or host was power-cycled causing HE-VM for cold restart.

Comment 51 Nikolai Sednev 2016-06-23 11:31:06 UTC
I've tried to:
As a precondition, I've ran "engine-config -s OvfUpdateIntervalInMinutes=1" on engine and restarted ovirt-engine service on HE-VM.
1)Added network named "Monitoring1" to networks tab.
2)Associated "Monitoring1" on both hosts with additional physical interface "enp3s0f1", interface was unplugged from network, so there was no link up on it and it had no assigned IP address.
3)Added new vNIC to the engine and associated it with "Monitoring1" profile.
4)Migrated HE-VM back and forth between the hosts.
5)Set host with HE-VM in to maintenance and saw that HE-VM was migrated to another active host.
6)Turned one host in to local maintenance and then power-cycled second host with HE-VM, once host came online agan, the HE-VM powered up successfully, without loosing new vNIC.

During these tests vNIC did not disappeared from HE-VM.


Works for me on these components:

Hosts:
ovirt-setup-lib-1.0.2-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.16.x86_64
mom-0.5.4-1.el7ev.noarch
rhev-release-4.0.0-19-001.noarch
vdsm-4.18.4-2.el7ev.x86_64
ovirt-vmconsole-host-1.0.3-1.el7ev.noarch
ovirt-hosted-engine-ha-2.0.0-1.el7ev.noarch
ovirt-engine-sdk-python-3.6.7.0-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.5.x86_64
ovirt-host-deploy-1.5.0-1.el7ev.noarch
ovirt-hosted-engine-setup-2.0.0.2-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-vmconsole-1.0.3-1.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Engine:
rhevm-doc-4.0.0-2.el7ev.noarch
rhevm-setup-plugins-4.0.0.1-1.el7ev.noarch
rhevm-spice-client-x64-msi-4.0-2.el7ev.noarch
rhevm-4.0.0.6-0.1.el7ev.noarch
rhev-release-4.0.0-19-001.noarch
rhevm-guest-agent-common-1.0.12-2.el7ev.noarch
rhevm-dependencies-4.0.0-1.el7ev.noarch
rhevm-branding-rhev-4.0.0-2.el7ev.noarch
rhevm-spice-client-x86-msi-4.0-2.el7ev.noarch
rhev-guest-tools-iso-4.0-2.el7ev.noarch
Linux version 3.10.0-327.22.2.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Jun 9 10:09:10 EDT 2016
Linux 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 9 10:09:10 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Comment 53 errata-xmlrpc 2016-08-23 20:39:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1743.html


Note You need to log in before you can comment on or make changes to this bug.