Bug 1276650

Summary: ovirt-ha-agent will hang during 3.5 -> 3.6 upgrade on NFS ('list index out of range' from getImagesList)
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Simone Tiraboschi <stirabos>
Component: AgentAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED CURRENTRELEASE QA Contact: Artyom <alukiano>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 1.3.0CC: alukiano, amureini, bugs, mavital, msivak, rnachimu, sbonazzo, ylavi
Target Milestone: ovirt-3.6.0-rc3Keywords: Triaged
Target Release: 1.3.2Flags: rule-engine: ovirt-3.6.0+
rule-engine: blocker+
ylavi: planning_ack+
sbonazzo: devel_ack+
mavital: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: integration
Fixed In Version: Doc Type: Bug Fix
Doc Text:
VDSM getImagesList raises an 'list index out of range' exception if called on a storage domain witch is not connected to any SP. Avoid directly using it.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-22 13:30:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1278130    
Bug Blocks: 1234906, 1285700    

Description Simone Tiraboschi 2015-10-30 11:20:13 UTC
Description of problem:
On NFS storage only, ovirt-ha-agent will hang during 3.5 -> 3.6 upgrade cause it's using vdscli.getImagesList which is broken and returns {'status': {'message': 'list index out of range', 'code': 100}}

MainThread::INFO::2015-10-30 11:14:38,919::upgrade::125::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) Looking for conf volume
MainThread::DEBUG::2015-10-30 11:14:38,926::upgrade::131::ovirt_hosted_engine_ha.lib.upgrade.StorageServer::(_is_conf_volume_there) {'status': {'message': 'list index out of range', 'code': 100}}
MainThread::ERROR::2015-10-30 11:14:38,927::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'list index out of range' - trying to restart agent

[root@c71het20151028 ~]# vdsClient -s 0 getImagesList acfcfc14-c2ff-404d-9dbd-89b1743ce10f
list index out of range
[root@c71het20151028 ~]# echo $?
1

See also:
https://bugzilla.redhat.com/1274622

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha.noarch                                                                                1.3.1-1.el7.centos

How reproducible:
100% (NFS only)

Steps to Reproduce:
1. deploy hosted-engine from 3.5 on NFS
2. upgrade to 3.6
3.

Actual results:
It hangs with:
MainThread::ERROR::2015-10-30 11:14:38,927::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'list index out of range' - trying to restart agent

Expected results:
It successfully upgrade

Additional info:
NFS only, on iSCSI it works.

Comment 1 Sandro Bonazzola 2015-10-30 13:27:58 UTC
Yaniv, maybe we should respin just ovirt-hosted-engine-ha including this fix in 3.6.0 GA. What do you think?

Comment 2 Red Hat Bugzilla Rules Engine 2015-10-30 13:28:04 UTC
This bug is marked for z-stream, yet the milestone is for a major version, therefore the milestone has been reset.
Please set the correct milestone or drop the z stream flag.

Comment 3 Yaniv Lavi 2015-11-01 11:24:58 UTC
Should work in upgrades from 3.5 to 3.6. If you lose HE, you lose the env.

Comment 4 Red Hat Bugzilla Rules Engine 2015-11-01 11:25:02 UTC
This bug is marked for z-stream, yet the milestone is for a major version, therefore the milestone has been reset.
Please set the correct milestone or drop the z stream flag.

Comment 5 Simone Tiraboschi 2015-11-02 08:31:35 UTC
*** Bug 1277013 has been marked as a duplicate of this bug. ***

Comment 6 Sandro Bonazzola 2015-11-02 10:41:17 UTC
Dropping dep on bug #1274622 since we can workaround it with External Bug ID: oVirt gerrit 47889. Moving it to See also.

Comment 7 Artyom 2015-12-01 16:23:44 UTC
Verified on ovirt-hosted-engine-ha-1.3.3-1.el7ev.noarch
1) Deploy hosted-engine 3.5 on two hosts and on NFS storage
2) Put first host to maintenance via webadmin
3) Upgrade packages and restart host(restart host W/A because bug https://bugzilla.redhat.com/show_bug.cgi?id=1282187)
4) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
5) Activate host via webadmin
6) Put second host to maintenance(wait until all vms migrated and he vm migrate on first host)
7) Upgrade packages and restart second host
8) Wait for correct status via hosted-engine --vm-status(can take around 5-7 minutes)
9) Activate second host via webadmin
10) Put environment to global maintenance
11) Update rhevm-setup.noarch package on engine
12) Run engine-setup on vm and finish upgrade process
13) Disable global maintenance via webadmin

Comment 8 Sandro Bonazzola 2015-12-22 13:30:41 UTC
oVirt 3.6.0 has been released and the bz verified, moving to closed current release.