1306825 – hosted-engine upgrade fails after upgrade hosts from el6 to el7

Bug 1306825 - hosted-engine upgrade fails after upgrade hosts from el6 to el7

Summary: hosted-engine upgrade fails after upgrade hosts from el6 to el7

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-hosted-engine-setup
Classification:	oVirt
Component:	General
Sub Component:
Version:	1.3.2.3
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	ovirt-3.6.4
Target Release:	1.3.4.0
Assignee:	Simone Tiraboschi
QA Contact:	Artyom
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1316143 (view as bug list)
Depends On:	1319721
Blocks:	1317895 1317901
TreeView+	depends on / blocked

Reported:	2016-02-11 19:43 UTC by Paul
Modified:	2017-05-11 09:27 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-04-05 13:55:05 UTC
oVirt Team:	Integration
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-3.6.z+ ylavi: planning_ack+ sbonazzo: devel_ack+ mavital: testing_ack+

Attachments	(Terms of Use)
VDSM logs on getImagesList (28.94 KB, text/plain) 2016-03-21 09:49 UTC, Simone Tiraboschi	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1274622	high	CLOSED	getImagesList fails if called on a file based storageDomain which is not connected to any storage pool	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1316143	urgent	CLOSED	3.6 hosted-engine hosts can't be added properly to 3.6 host cluster that was started with 3.4.	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1319721	urgent	CLOSED	Call to getImagesList on NFS on host without connected storage pool but with SD, return {'status': {'message': 'OK', 'co...	2021-02-22 00:41:40 UTC
oVirt gerrit	54062	master	MERGED	upgrade: prevent deploying additional 3.6 host on a 3.5 setup	2020-02-25 08:53:35 UTC
oVirt gerrit	54307	ovirt-hosted-engine-setup-1.3	MERGED	upgrade: prevent deploying additional 3.6 host on a 3.5 setup	2020-02-25 08:53:35 UTC
oVirt gerrit	54595	master	MERGED	storage: reloading other volumes uuid from disk	2020-02-25 08:53:35 UTC
oVirt gerrit	54737	ovirt-hosted-engine-setup-1.3	MERGED	storage: reloading other volumes uuid from disk	2020-02-25 08:53:35 UTC

Internal Links: 1274622 1316143 1319721

Description Paul 2016-02-11 19:43:13 UTC

Description of problem:

The hosted-engine fails to upgrade / auto import the storage domain. 

Version-Release number of selected component (if applicable):

( on host ) rpm -qa ovirt* vdsm*
ovirt-vmconsole-1.0.0-1.el7.centos.noarch
ovirt-setup-lib-1.0.1-1.el7.centos.noarch
vdsm-yajsonrpc-4.17.18-1.el7.noarch
vdsm-cli-4.17.18-1.el7.noarch
ovirt-hosted-engine-ha-1.3.3.7-1.el7.centos.noarch
ovirt-release36-003-1.noarch
vdsm-xmlrpc-4.17.18-1.el7.noarch
vdsm-4.17.18-1.el7.noarch
ovirt-host-deploy-1.4.1-1.el7.centos.noarch
ovirt-hosted-engine-setup-1.3.2.3-1.el7.centos.noarch
vdsm-python-4.17.18-1.el7.noarch
vdsm-hook-vmfex-dev-4.17.18-1.el7.noarch
ovirt-release35-006-1.noarch
ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
ovirt-engine-sdk-python-3.6.2.1-1.el7.centos.noarch
vdsm-infra-4.17.18-1.el7.noarch
vdsm-jsonrpc-4.17.18-1.el7.noarch


How reproducible:

Upgrade hosts from el6 to el7 and upgrade ovirt to 3.6

 
Steps to Reproduce:

1. Upgrade ovirt to 3.6 on el6
2. Upgrade hosts to el7 ( vdsmd not upated on el6 anymore )
3. deploy reinstalled el7 hosts to ovirt 
4. restart ovirt-ha-agent ( host is in maintenance mode!!! )
5. upgrade should start, but fails

Actual results:

/var/log/ovirt-hosted-engine-ha/agent.log
MainThread::INFO::2016-02-11 20:06:19,191::hosted_engine::744::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Acquired lock on host id 2
MainThread::INFO::2016-02-11 20:06:19,192::upgrade::947::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade) Upgrading to current version
MainThread::INFO::2016-02-11 20:06:19,592::upgrade::819::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_in_engine_maintenance) This host is connected to other storage pools
MainThread::ERROR::2016-02-11 20:06:19,592::upgrade::950::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade) Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready
 
on hosted-engine /var/log/ovirt-engine/engine.log



2016-02-11 20:06:19,304 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-19) [66a28e10] START, FullListVDSCommand(HostName = , FullListVDSCommandParameters:{runAsync='true', hostId='41894d95-ef99-45a8-bd5d-c59d6e4c5e2e', vds='Host[,41894d95-ef99-45a8-bd5d-c59d6e4c5e2e]', vmIds='[8cb6bafc-abd7-49d8-b781-a6f37e63430a]'}), log id: 7084c699
2016-02-11 20:06:20,311 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FullListVDSCommand] (DefaultQuartzScheduler_Worker-19) [66a28e10] FINISH, FullListVDSCommand, return: [{status=Up, nicModel=rtl8139,pv, emulatedMachine=pc, guestDiskMapping={QEMU_DVD-ROM_={name=/dev/sr0}, 842b979e-9c0a-4337-b={name=/dev/vda}}, vmId=8cb6bafc-abd7-49d8-b781-a6f37e63430a, pid=385, devices=[Ljava.lang.Object;@1d0aa085, smp=2, vmType=kvm, displayIp=0, display=vnc, displaySecurePort=-1, memSize=8194, displayPort=5900, cpuType=Westmere, spiceSecureChannels=smain,sdisplay,sinputs,scursor,splayback,srecord,ssmartcard,susbredir, statusTime=4461066420, vmName=HostedEngine, clientIp=, pauseCode=NOERR}], log id: 7084c699
2016-02-11 20:06:20,322 INFO  [org.ovirt.engine.core.bll.storage.GetExistingStorageDomainListQuery] (org.ovirt.thread.pool-8-thread-35) [] START, GetExistingStorageDomainListQuery(GetExistingStorageDomainListParameters:{refresh='true', filtered='false'}), log id: 7d08449
2016-02-11 20:06:20,323 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainsListVDSCommand] (org.ovirt.thread.pool-8-thread-35) [] START, HSMGetStorageDomainsListVDSCommand(HostName = geisha-2.pazion.nl, HSMGetStorageDomainsListVDSCommandParameters:{runAsync='true', hostId='41894d95-ef99-45a8-bd5d-c59d6e4c5e2e', storagePoolId='00000000-0000-0000-0000-000000000000', storageType='null', storageDomainType='Data', path='null'}), log id: a9cd8a1
2016-02-11 20:06:21,837 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainsListVDSCommand] (org.ovirt.thread.pool-8-thread-35) [] FINISH, HSMGetStorageDomainsListVDSCommand, return: [7ebbf9af-f4aa-4639-be31-ee4aa38ccea6, 8744765f-729f-4687-905e-1edd3546a16e, 88b69eba-ef4f-4dbe-ba53-20dadd424d0e], log id: a9cd8a1
2016-02-11 20:06:21,862 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (org.ovirt.thread.pool-8-thread-35) [] START, HSMGetStorageDomainInfoVDSCommand(HostName = geisha-2.pazion.nl, HSMGetStorageDomainInfoVDSCommandParameters:{runAsync='true', hostId='41894d95-ef99-45a8-bd5d-c59d6e4c5e2e', storageDomainId='88b69eba-ef4f-4dbe-ba53-20dadd424d0e'}), log id: 38c26d07
2016-02-11 20:06:22,877 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetStorageDomainInfoVDSCommand] (org.ovirt.thread.pool-8-thread-35) [] FINISH, HSMGetStorageDomainInfoVDSCommand, return: <StorageDomainStatic:{name='hostedengine_nfs', id='88b69eba-ef4f-4dbe-ba53-20dadd424d0e'}, 499b208c-9de9-4a2a-97de-30f410b4e6d4>, log id: 38c26d07
2016-02-11 20:06:22,877 INFO  [org.ovirt.engine.core.bll.storage.GetExistingStorageDomainListQuery] (org.ovirt.thread.pool-8-thread-35) [] FINISH, GetExistingStorageDomainListQuery, log id: 7d08449
2016-02-11 20:06:22,877 INFO  [org.ovirt.engine.core.bll.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-8-thread-35) [46900451] Lock Acquired to object 'EngineLock:{exclusiveLocks='[]', sharedLocks='null'}'
2016-02-11 20:06:22,896 WARN  [org.ovirt.engine.core.bll.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-8-thread-35) [46900451] CanDoAction of action 'ImportHostedEngineStorageDomain' failed for user SYSTEM. Reasons: VAR__ACTION__ADD,VAR__TYPE__STORAGE__DOMAIN,ACTION_TYPE_FAILED_STORAGE_DOMAIN_NOT_EXIST
2016-02-11 20:06:22,896 INFO  [org.ovirt.engine.core.bll.ImportHostedEngineStorageDomainCommand] (org.ovirt.thread.pool-8-thread-35) [46900451] Lock freed to object 'EngineLock:{exclusiveLocks='[]', sharedLocks='null'}'
Expected results:
Upgraded hosted-engine with storage domain available in webGUI


Additional info:

ovirt install was created in 3.4 and multiple times upgrade with success.
hosted engine is on NFS3
master storage is on FC

I had serious problems with this upgrade path ( hosts el6 to el7 )
- https://www.mail-archive.com/users@ovirt.org/msg30964.html
- during hosted-engine --deploy /var/run/vdsm/storage was missing
- problems with spUUID= in hosted-engine.conf, no connection, so reset to 0000-0000 value in hosted-engine.conf as suggested. ( http://screencast.com/t/n0yFcgd5gC )

Finally hosted-engine is working and I was able to set Cluster compatibility in webGui to 3.6 and save.

Only issue beside failed upgrade is fact I have commented #conf_volume_UUID, if not, I get these in the logs:

MainThread::INFO::2016-02-11 20:40:01,424::config::205::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Trying to get a fresher copy of vm configuration from the OVF_STORE
MainThread::WARNING::2016-02-11 20:40:01,425::ovf_store::105::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(scan) Unable to find OVF_STORE
MainThread::ERROR::2016-02-11 20:40:01,425::config::234::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config::(refresh_local_conf_file) Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
MainThread::ERROR::2016-02-11 20:40:01,425::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Path to volume 125f858b-6e2a-444b-bbfa-e45a328075f6 not found in /rhev/data-center/mnt' - trying to restart agent


Tried bot local and global maintenance mode.
I tried to remove any existing connection to the host and did :

service ovirt-ha-broker stop
service ovirt-ha-agent stop
sanlock client shutdown -f 1
service sanlock stop
service vdsmd restart
umount /rhev/data-center/mnt/hostedstorage.pazion.nl\:_opt_hosted-engine/
service sanlock start
service ovirt-ha-broker start

I tried to restart host and trigger upgrade.
Also tried to shutdown hosted-engine and restart ovirt-ha-agent.
Still same error. Makes me believe I hit a bug...

I keep getting connected to storage pools error and error host is not in maintenance mode

Comment 1 Paul 2016-02-11 19:57:16 UTC

I found https://bugzilla.redhat.com/show_bug.cgi?id=1271771
and https://bugzilla.redhat.com/show_bug.cgi?id=1294457

So I tried to set HostedEngineStorageDomainName and restart ovirt-ha-agent, but this had no effect either. I got the same connected storagepools and put host in maintenance error again.

Comment 2 Paul 2016-02-11 19:58:34 UTC

Related to https://bugzilla.redhat.com/show_bug.cgi?id=1269768

Comment 3 Doron Fediuck 2016-02-21 08:46:06 UTC

(In reply to Paul from comment #0)
> Description of problem:
> 
> The hosted-engine fails to upgrade / auto import the storage domain. 
> 

>  
> Steps to Reproduce:
> 
> 1. Upgrade ovirt to 3.6 on el6
> 2. Upgrade hosts to el7 ( vdsmd not upated on el6 anymore )

Hi,
the above is not the proper upgrade path since you are moving between operating
systems.

The right way to upgrade is while still in 3.5. 
Here's a list of all steps to properly upgrade your HE setup:

Assumptions
------------
- All hosts are running latest 3.5 on el6
- Using 3.5 hosted engine.
- The 3.5 cluster has a redundancy of at least 1 host (since we need to be able to take down a host and its VMs will be evacuated).

Phase 1: el6 to el7 (must be done in 3.5)
-----------------------------------------
- Move one host to maintenance.
- Reinstall the host as el7
- Create new 3.5 cluster for el7
- Add the new host to the new cluster (in 3.5 compatibility mode: engine is still 3.5)
- Run hosted engine deploy on the new host.
- Migrate VMs directly to the new host.
- Repeat the above for other hosts until you get to the last host running HE VM.- 
- Stop the HE VM running in the el6 cluster. It should be automatically started in the el7 cluster.
- Take the last el6 host down to maintenance and reinstall it, add to the new el7 cluster.

Phase 2: 3.5 to 3.6
--------------------
- Move the HE VM to global maintenance.
- Upgrade the engine RPMs to 3.6
- start the engine.
- upgrade all the hosts from 3.5 to 5.6 (you must move the host to maintenance from the UI to ensure it is viewed as maintenance mode by the engine and also put into local-maintenance mode).
- Once everything is stable and running, change cluster compatibility mode to 3.6
- Make sure you have an additional SD. If not, add it to ensure the VM is properly imported.

Comment 4 Doron Fediuck 2016-02-21 08:48:05 UTC

Please re-open if you find any issues while following the instructions in comment 3.

Comment 5 Paul 2016-02-21 09:56:04 UTC

Upgrade path followed:
I had problems during the upgrade. So what I did is I downgraded 1 host to ovirt 3.5 and then ran upgrade path described in phase 1.

Problem is the whole host upgrade was done because I found out 3.6 was not supported on el6. So By this time I already had upgraded my engine.

In the quest to get all properly upgrade I set DC to compatibility 3.6 as well as Cluster.

I have tried different options and also have a problem with the storage domain being imported ( Is this what you meant with  "Make sure you have an additional SD. If not, add it to ensure the VM is properly imported.". ? ):
http://lists.ovirt.org/pipermail/users/2016-February/038023.html


Possible solution?:
I am a bit stuck in the middle. Beside above upgrade problem,  my hosted_storage domain is named different and only imported ( and locked ) when name is set to hostedengine_nfs. These might be closely related tough.

So my main question is how can i upgrade my setup to 3.6?

Is there a way to downgrade hosted-engine to 3.5 ( and dc, cluster compatibility )  and rerun upgrade.

Or is there some fix in 3.6.3 which might provide a solution?

Or is my only solution create a new hosted-engine ( clean install from 3.6 ) and then disconnect old master storage domain and import storage domain on new hosted-engine?

Comment 6 Simone Tiraboschi 2016-02-29 13:22:26 UTC

I'm for re-opening it:
currently trying to add a fresh el7 host to a 3.5 hosted-engine instance using hosted-engine-setup from 3.6 fails with:
 [ INFO  ] Stage: Setup validation
 [ ERROR ] Failed to execute stage 'Setup validation': Unable to prepare image: Unknown pool id, pool not connected: ('f8eee402-e5f8-4325-9f5c-81acc65e57aa',)
 [ INFO  ] Stage: Clean up
which is not that clear.

Really supporting it (direct 3.5 on el6 -> 3.6 on el7 upgrade) is definitively not worth since the proposed way (upgrading from el6 -> el7 while on 3.5 and only after that upgrading 3.5->3.6) is correctly working.
But at least we should provide a clear error message when the user is not on the supported path and so at least we should provide a clear error message when the user tries to add a 3.6 host to an existing 3.5 instance.

The severity is not that high since it's just about providing a clear error message instead of an internal one.

Comment 7 Paul 2016-03-02 22:10:22 UTC

I upgraded the hosted-engine to 3.6.3 and the storagedomain became active in the webinterface, when it first was locked.

From there things started moving and I was able to put hosts in maintenance and the OVF stores were created and hosted-engine.conf got updated.

Looking back I think there were several (related) issues I encountered and got fixed somehow fixed in 3.6.3:

- pool UUID problem, with error not found and still connected
- differently named hosted_storage 

So after quite a struggle and thanks to the new 3.6.3 version, my hosts have been upgraded to el7 and my platform is upgraded to 3.6.

Thanks to everyone for the help and especially Simone Tiraboschi helping me by answering my posts on the ovirt users list.

Thanks!

Comment 8 Artyom 2016-03-20 14:45:37 UTC

Upgrade 3.4 el6 -> 3.5 el6 -> 3.5 el7 -> 3.6 el7

3.4 el6 - Versions
================================================
ovirt-hosted-engine-setup-1.1.5-1.el6ev.noarch
ovirt-hosted-engine-ha-1.1.6-3.el6ev.noarch
vdsm-python-zombiereaper-4.14.18-7.el6ev.noarch
vdsm-cli-4.14.18-7.el6ev.noarch
vdsm-4.14.18-7.el6ev.x86_64
vdsm-xmlrpc-4.14.18-7.el6ev.noarch
vdsm-python-4.14.18-7.el6ev.x86_64


3.5 el6 - Versions:
================================================
kernel-2.6.32-573.el6.x86_64
ovirt-hosted-engine-setup-1.2.6.1-1.el6ev.noarch
ovirt-hosted-engine-ha-1.2.10-1.el6ev.noarch
vdsm-jsonrpc-4.16.36-1.el6ev.noarch
vdsm-python-4.16.36-1.el6ev.noarch
vdsm-cli-4.16.36-1.el6ev.noarch
vdsm-yajsonrpc-4.16.36-1.el6ev.noarch
vdsm-4.16.36-1.el6ev.x86_64
vdsm-python-zombiereaper-4.16.36-1.el6ev.noarch
vdsm-xmlrpc-4.16.36-1.el6ev.noarch

3.5 el7 Versions:
================================================
kernel-3.10.0-327.13.1.el7.x86_64
ovirt-hosted-engine-ha-1.2.10-1.el7ev.noarch
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch
vdsm-4.16.36-1.el7ev.x86_64
vdsm-python-zombiereaper-4.16.36-1.el7ev.noarch
vdsm-xmlrpc-4.16.36-1.el7ev.noarch
vdsm-jsonrpc-4.16.36-1.el7ev.noarch
vdsm-hook-ethtool-options-4.16.36-1.el7ev.noarch
vdsm-python-4.16.36-1.el7ev.noarch
vdsm-yajsonrpc-4.16.36-1.el7ev.noarch
vdsm-cli-4.16.36-1.el7ev.noarch

3.6 el7 Versions:
================================================
kernel-3.10.0-327.13.1.el7.x86_64
ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5-1.el7ev.noarch
vdsm-python-4.17.23.1-0.el7ev.noarch
vdsm-4.17.23.1-0.el7ev.noarch
vdsm-cli-4.17.23.1-0.el7ev.noarch
vdsm-hook-ethtool-options-4.17.23.1-0.el7ev.noarch
vdsm-jsonrpc-4.17.23.1-0.el7ev.noarch
vdsm-infra-4.17.23.1-0.el7ev.noarch
vdsm-yajsonrpc-4.17.23.1-0.el7ev.noarch
vdsm-hook-vmfex-dev-4.17.23.1-0.el7ev.noarch
vdsm-xmlrpc-4.17.23.1-0.el7ev.noarch
================================================

Steps 3.4 el6 -> 3.5 el6:

1) Start with 3.4 engine(el6) and two hosts(el6) - environment has nfs storage domain and two additional running VM's
2) Set global maintenance
3) Upgrade engine to 3.5
4) Put first host to maintenance
5) Upgrade first host to 3.5(el6)
6) Activate first host
7) Put second host to maintenance
8) Upgrade second host to 3.5(el6)
9) Activate host
10) Change cluster and datacenter compatability version to 3.5

PASS

Steps 3.5 el6 -> 3.5 el7:

1) Put first host to maintenance and remove it from engine
2) Reprovision first host to 3.5 el7
3) Redeploy first host via hosted-engine tool(use second host as source for config file and W/A from https://bugzilla.redhat.com/show_bug.cgi?id=1308962)
4) Put second host to maintenance and remove it from engine
5) Reprovision second host to 3.5 el7
6) Redeploy second host via hosted-engine tool(use first host as source for config file and W/A from https://bugzilla.redhat.com/show_bug.cgi?id=1308962)
PASS

Steps 3.5 el7 -> 3.6 el7:

1) Set global maintenance
2) Upgrade engine to 3.6
3) Put first host to maintenance
4) Upgrade first host to 3.6
5) Activate first host
6) Put second host to maintenance
7) Upgrade second host to 3.6
8) Activate host
9) Change cluster and datacenter compatability version to 3.6

PASS

Deploy additional 3.6 host:
### Please upgrade the existing HE hosts to current release before adding this host.
### Please check the log file for more details.
***Q:STRING OVEHOSTED_PREVENT_MIXING_HE_35_CURRENT
### Replying "No" will abort Setup.
### Continue? (Yes, No) [No] Yes
Deploy succeed

!!!ovirt-ha-agent droped to fail state!!!
agent.log
=============================
MainThread::DEBUG::2016-03-20 14:12:13,181::brokerlink::273::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Sending request: notify time=1458475933.18 type=state_transition detail=StartState-ReinitializeFSM hostname='alma07.qa.lab.tlv.redhat.com'
MainThread::DEBUG::2016-03-20 14:12:13,181::util::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(socket_readline) socket_readline with 30.0 seconds timeout
MainThread::DEBUG::2016-03-20 14:12:13,182::brokerlink::282::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Full response: failure <type 'exceptions.RuntimeError'>
MainThread::DEBUG::2016-03-20 14:12:13,182::brokerlink::258::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_checked_communicate) Failed response from socket
MainThread::ERROR::2016-03-20 14:12:13,183::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'alma07.qa.lab.tlv.redhat.com'}: Request failed: <type 'exceptions.RuntimeError'>' - trying to restart agent

broker.log
=============================
Thread-32979::ERROR::2016-03-20 14:12:18,978::listener::192::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Error handling request, data: "notify time=1458475938.98 type=state_transition detail=StartState-ReinitializeFSM hostname='alma07.qa.lab.tlv.redhat.com'"
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 166, in handle
    data)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 302, in _dispatch
    if notifications.notify(**options):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/notifications.py", line 54, in notify
    archive_fname=constants.NOTIFY_CONF_FILE_ARCHIVE_FNAME,
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/env/config.py", line 243, in refresh_local_conf_file
    conf_vol_id,
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/heconflib.py", line 273, in get_volume_path
    vol_uuid=vol_uuid,
RuntimeError: Path to volume None not found in /rhev/data-center/mnt

Looks like because the answer file from the first host does not have values for:
OVEHOSTED_STORAGE/confImageUUID
OVEHOSTED_STORAGE/confVolUUID
we have values under hosted-engine.conf
conf_volume_UUID=None
conf_image_UUID=None

After that I added values to hosted-engine.conf file(taken from other host) and restarted the host I see that ovirt-ha-agent succeed to up, but:
MainThread::ERROR::2016-03-20 14:12:18,806::upgrade::980::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(upgrade_35_36) Unable to upgrade while not in maintenance mode: please put this host into maintenance mode from the engine, and manually restart this service when ready

I puted the host to maintenance via engine and restart ovirt-ha-agent, but again I encounter error message:
MainThread::ERROR::2016-03-20 15:01:07,890::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Unable to connect SP: Wrong Master domain or its version: 'SD=3ac831d6-6124-4b42-a060-f89c64be09a1, pool=52313664-01fd-4df9-b2ae-d87cf7e2c81a'' - trying to restart agent

from vdsm:
Thread-3066::ERROR::2016-03-20 15:04:18,304::task::866::Storage.TaskManager.Task::(_setError) Task=`86520e58-b47d-4a10-8a5a-c8f69a6f7533`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1036, in connectStoragePool
    spUUID, hostID, msdUUID, masterVersion, domainsMap)
  File "/usr/share/vdsm/storage/hsm.py", line 1101, in _connectStoragePool
    res = pool.connect(hostID, msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 657, in connect
    self.__rebuild(msdUUID=msdUUID, masterVersion=masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1231, in __rebuild
    self.setMasterDomain(msdUUID, masterVersion)
  File "/usr/share/vdsm/storage/sp.py", line 1447, in setMasterDomain
    raise se.StoragePoolWrongMaster(self.spUUID, msdUUID)
StoragePoolWrongMaster: Wrong Master domain or its version: 'SD=3ac831d6-6124-4b42-a060-f89c64be09a1, pool=52313664-01fd-4df9-b2ae-d87cf7e2c81a'

from some reason 
# vdsClient -s 0 getImagesList 3ac831d6-6124-4b42-a060-f89c64be09a1
return empty list, when on other host
# vdsClient -s 0 getImagesList 3ac831d6-6124-4b42-a060-f89c64be09a1
9ecd4e5f-bb24-4fd6-8c20-c442425b59b6
b6b637a4-37be-48e9-aacb-e3d4a6be29cc
995171f0-1abb-488b-9b18-3e17aad0c3de
4d1915e1-a9f7-4bca-b666-0997adec5ef4

Comment 9 Red Hat Bugzilla Rules Engine 2016-03-20 14:45:43 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 10 Simone Tiraboschi 2016-03-21 09:22:15 UTC

(In reply to Artyom from comment #8)
> from some reason 
> # vdsClient -s 0 getImagesList 3ac831d6-6124-4b42-a060-f89c64be09a1
> return empty list, when on other host
> # vdsClient -s 0 getImagesList 3ac831d6-6124-4b42-a060-f89c64be09a1
> 9ecd4e5f-bb24-4fd6-8c20-c442425b59b6
> b6b637a4-37be-48e9-aacb-e3d4a6be29cc
> 995171f0-1abb-488b-9b18-3e17aad0c3de
> 4d1915e1-a9f7-4bca-b666-0997adec5ef4

The issue is just here:
the image are available on the NFS share 

[root@alma07 images]# pwd
/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_alukiano__HE__upgrade/3ac831d6-6124-4b42-a060-f89c64be09a1/images
[root@alma07 images]# ls -l
total 16
drwxr-xr-x. 2 vdsm kvm 4096 17 mar 15.36 4d1915e1-a9f7-4bca-b666-0997adec5ef4
drwxr-xr-x. 2 vdsm kvm 4096 18 mar 00.52 995171f0-1abb-488b-9b18-3e17aad0c3de
drwxr-xr-x. 2 vdsm kvm 4096 20 mar 15.40 9ecd4e5f-bb24-4fd6-8c20-c442425b59b6
drwxr-xr-x. 2 vdsm kvm 4096 18 mar 00.52 b6b637a4-37be-48e9-aacb-e3d4a6be29cc

but VDSM is not reporting them.

I reproduced it with a small python script:
 from vdsm import vdscli

 sdUUID = '3ac831d6-6124-4b42-a060-f89c64be09a1'
 cli = vdscli.connect(timeout=60)
 result = cli.getImagesList(sdUUID)
 print(result)
 result = cli.getConnectedStoragePoolsList()
 print(result)

That prints:
 {'status': {'message': 'OK', 'code': 0}, 'imageslist': []}
 {'status': {'message': 'OK', 'code': 0}, 'poollist': []}

From the past we know that VDSM's getImagesList wasn't working on NFS when not connected to a storage pool (see rhbz#1274622 ) but it that case it was returning:
 {'status': {'message': 'list index out of range', 'code': 100}}
And we implemented a workaround for that

But now it's still not working (imageslist = [] when images are there) but it also returns a wrong error code of 0 hiding the issue so our workaround doesn't trigger.

Comment 11 Sandro Bonazzola 2016-03-21 09:26:22 UTC

So it's a regression on vdsm. Please open a BZ against VDSM and let's make it blocking for 3.6.4.
Moving back this bug to QA, please make the new BZ blocking this one too.

Comment 12 Simone Tiraboschi 2016-03-21 09:49:56 UTC

Created attachment 1138524 [details]
VDSM logs on getImagesList

Comment 13 Simone Tiraboschi 2016-03-21 12:23:58 UTC

*** Bug 1316143 has been marked as a duplicate of this bug. ***

Comment 14 Artyom 2016-03-23 09:57:58 UTC

Verified
Upgrade 3.4 el6 -> 3.5 el6 -> 3.5 el7 -> 3.6 el7

3.4 el6 - Versions
================================================
ovirt-hosted-engine-setup-1.1.5-1.el6ev.noarch
ovirt-hosted-engine-ha-1.1.6-3.el6ev.noarch
vdsm-python-zombiereaper-4.14.18-7.el6ev.noarch
vdsm-cli-4.14.18-7.el6ev.noarch
vdsm-4.14.18-7.el6ev.x86_64
vdsm-xmlrpc-4.14.18-7.el6ev.noarch
vdsm-python-4.14.18-7.el6ev.x86_64


3.5 el6 - Versions:
================================================
kernel-2.6.32-573.el6.x86_64
ovirt-hosted-engine-setup-1.2.6.1-1.el6ev.noarch
ovirt-hosted-engine-ha-1.2.10-1.el6ev.noarch
vdsm-jsonrpc-4.16.36-1.el6ev.noarch
vdsm-python-4.16.36-1.el6ev.noarch
vdsm-cli-4.16.36-1.el6ev.noarch
vdsm-yajsonrpc-4.16.36-1.el6ev.noarch
vdsm-4.16.36-1.el6ev.x86_64
vdsm-python-zombiereaper-4.16.36-1.el6ev.noarch
vdsm-xmlrpc-4.16.36-1.el6ev.noarch

3.5 el7 Versions:
================================================
kernel-3.10.0-327.13.1.el7.x86_64
ovirt-hosted-engine-ha-1.2.10-1.el7ev.noarch
ovirt-hosted-engine-setup-1.2.6.1-1.el7ev.noarch
vdsm-4.16.36-1.el7ev.x86_64
vdsm-python-zombiereaper-4.16.36-1.el7ev.noarch
vdsm-xmlrpc-4.16.36-1.el7ev.noarch
vdsm-jsonrpc-4.16.36-1.el7ev.noarch
vdsm-hook-ethtool-options-4.16.36-1.el7ev.noarch
vdsm-python-4.16.36-1.el7ev.noarch
vdsm-yajsonrpc-4.16.36-1.el7ev.noarch
vdsm-cli-4.16.36-1.el7ev.noarch

3.6 el7 Versions:
================================================
kernel-3.10.0-327.13.1.el7.x86_64
ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch
vdsm-python-4.17.23.1-0.el7ev.noarch
vdsm-4.17.23.1-0.el7ev.noarch
vdsm-cli-4.17.23.1-0.el7ev.noarch
vdsm-hook-ethtool-options-4.17.23.1-0.el7ev.noarch
vdsm-jsonrpc-4.17.23.1-0.el7ev.noarch
vdsm-infra-4.17.23.1-0.el7ev.noarch
vdsm-yajsonrpc-4.17.23.1-0.el7ev.noarch
vdsm-hook-vmfex-dev-4.17.23.1-0.el7ev.noarch
vdsm-xmlrpc-4.17.23.1-0.el7ev.noarch
================================================

Steps 3.4 el6 -> 3.5 el6:

1) Start with 3.4 engine(el6) and two hosts(el6) - environment has nfs storage domain and two additional running VM's
2) Set global maintenance
3) Upgrade engine to 3.5
4) Put first host to maintenance
5) Upgrade first host to 3.5(el6)
6) Activate first host
7) Put second host to maintenance
8) Upgrade second host to 3.5(el6)
9) Activate host
10) Change cluster and datacenter compatability version to 3.5

PASS

Steps 3.5 el6 -> 3.5 el7:

1) Put first host to maintenance and remove it from engine
2) Reprovision first host to 3.5 el7
3) Redeploy first host via hosted-engine tool(use second host as source for config file and W/A from https://bugzilla.redhat.com/show_bug.cgi?id=1308962)
4) Put second host to maintenance and remove it from engine
5) Reprovision second host to 3.5 el7
6) Redeploy second host via hosted-engine tool(use first host as source for config file and W/A from https://bugzilla.redhat.com/show_bug.cgi?id=1308962)
PASS

Steps 3.5 el7 -> 3.6 el7:

1) Set global maintenance
2) Upgrade engine to 3.6
3) Put first host to maintenance
4) Upgrade first host to 3.6
5) Activate first host
6) Put second host to maintenance
7) Upgrade second host to 3.6
8) Activate host
9) Change cluster and datacenter compatability version to 3.6

PASS

Deploy additional 3.6 host:
### Please upgrade the existing HE hosts to current release before adding this host.
### Please check the log file for more details.
***Q:STRING OVEHOSTED_PREVENT_MIXING_HE_35_CURRENT
### Replying "No" will abort Setup.
### Continue? (Yes, No) [No] Yes
Deploy succeed

agent and broker service succeed to start without any troubles

NOTE:
After host deploy, you will need to put it to maintenance via engine and restart ovirt-ha-agent to finish HE upgrade process

Note You need to log in before you can comment on or make changes to this bug.