Bug 1007747

Summary: Network issue with storage on SPM host with a HA VM ends in having host in unassigned state and HA stuck in migration
Product: Red Hat Enterprise Virtualization Manager Reporter: Jiri Belka <jbelka>
Component: ovirt-engineAssignee: Gilad Chaplik <gchaplik>
Status: CLOSED CURRENTRELEASE QA Contact: Artyom <alukiano>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: acathrow, dfediuck, iheim, lpeer, mavital, ofrenkel, Rhev-m-bugs, yeylon
Target Milestone: ---   
Target Release: 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: sla
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 22:11:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1019461    
Attachments:
Description Flags
engine.log.gz host4-vdsm.log.gz host3-vdsm.log.gz none

Description Jiri Belka 2013-09-13 09:12:05 UTC
Created attachment 797220 [details]
engine.log.gz host4-vdsm.log.gz host3-vdsm.log.gz

Description of problem:

I blocked with iptables network access from SPM host to NFS storage. This host (host4) had running a HA VM. After a while engine discovered this issue and moved SPM to other host (host3). The engine also initialized migration of this HA VM (test-rh6-x64). But this migration got stuck and then no other life-saving action about this VM has happened, the HA VM still runs on original host (host4) where there is no access to storage! Also the original host is in unassigned state which means you cannot put it in maintenance etc.

In Hosts tab host3 (now being SPM) has in Virtual Machines 2 VMs (one is about stuck migration - test-rh6-x64), host4 still reports original HA VM (test-rh6-x64).

There are two running Tasks (newest first):

  * Setting Host dell-r210ii-04 to Non-Operational mode.
  * Migrating VM test-rh6-x64 because previous Host became non-operational

In Events for the HA VM (test-rh6-x64) there's no event message about started migration because of HA functionality!

In Events from original SPM host (host4), there is no message in Events that this host lost SPM mode. I would expect to see it there. Although in new SPM host (host3) there is message:

  Storage Pool Manager runs on Host dell-r210ii-03 (Address: 10.34.63.222).

I tried to execute Cancel Migration from VM's context menu but it ended in failure:

  Failed to cancel migration for VM: test-rh6-x64

Summary - while there is network issue to storage the host is in odd state, HA VM is not migrated or restarted on other host.

Version-Release number of selected component (if applicable):
sf20.1

How reproducible:
100%

Steps to Reproduce:
1. 2 hosts, nfs storage, one HA VM
2. on SPM host block outgoing access to NFS storage
3. wait to see things happening

Actual results:
original SPM host in odd state, HA VM is not fulfilled as it is stuck in Migration From state

Expected results:
if SPM was switched thus not fencing original host i would expect if migration did not succeed the HA VM process would be killed and the HA VM would be started on a host having correct access to storage, not to let it rotten on problematic host

Additional info:
i will try to reproduce same issue on latest 3.3 but this situation really put strange view on HA functionality in 3.2 version which has been made public.

Related/similar to BZ962180 ?

Comment 1 Jiri Belka 2013-09-13 09:14:16 UTC
engine=> select * from job;
                job_id                |     action_type      |                              description                               | status  |              
 owner_id               | visible |         start_time         | end_time |      last_update_time      | correlation_id 
--------------------------------------+----------------------+------------------------------------------------------------------------+---------+--------------
------------------------+---------+----------------------------+----------+----------------------------+----------------
 a04fc611-6e93-40f2-b438-956e86f6a727 | InternalMigrateVm    | Migrating VM test-rh6-x64 because previous Host became non-operational | STARTED | 00000000-0000
-0000-0000-000000000000 | t       | 2013-09-13 10:20:54.162+02 |          | 2013-09-13 10:20:54.362+02 | 27886777
 945abad7-1c26-4d94-bd20-3c210acd70c9 | SetNonOperationalVds | Setting Host dell-r210ii-04 to Non-Operational mode.                   | STARTED | 00000000-0000
-0000-0000-000000000000 | t       | 2013-09-13 10:26:03.468+02 |          | 2013-09-13 10:26:03.482+02 | 72575726
(2 rows)

engine=> select * from async_tasks;
 task_id | action_type | status | result | action_parameters | action_params_class | step_id | command_id | started_at | storage_pool_id | task_type | task_par
ameters | task_params_class 
---------+-------------+--------+--------+-------------------+---------------------+---------+------------+------------+-----------------+-----------+---------
--------+-------------------
(0 rows)

Comment 2 Jiri Belka 2013-09-13 09:41:48 UTC
After all of this I tried to 'Power Off' the VM. The action is reported by engine as successful and the state is 'Down'. But in fact the qemu-kvm process is still running on the original host (host4).

I cleaned all tasks, I killed the qemu-kvm process but no change, the host is still in unassigned state. Restart of vdsmd did not help as well.

After doing some crazy things (removing forgotten VM from vds_dynamics for the original host), restarting engine, vdsmd, I got the host in 'Non Operational' state, which I suppose is correct status. After "repairing" network issue and unmounting hanged nfs storage, I was able to active the host successfully.

Comment 3 Andrew Cathrow 2013-09-16 15:39:33 UTC
It's likely that this is fixed in 3.2.2 already (bz 984943)
Omer please confirm?

Comment 4 Omer Frenkel 2013-09-22 11:59:38 UTC
not related to bz 984943
this is related to issues we encountered in host monitoring that is stuck,
which cause host status to stuck in unassigned and all vms statuses on this host not to be refreshed (bz 977169)

Comment 5 Doron Fediuck 2013-09-23 09:19:07 UTC
Since bug 977169 has a fix in 3.3, which may handle this issue
I suggest to test this bz and see if it can be reproduced.

Comment 6 Itamar Heim 2013-12-11 07:30:59 UTC
doron, per comment 5 - why isn't this ON_QA to be tested?

Comment 7 Doron Fediuck 2013-12-11 10:59:32 UTC
After additional check we were not able to get the same results.
So I'd like this to be re-tested using the fix included in bug 977169.

Comment 8 Artyom 2013-12-12 17:06:51 UTC
Verified on is26
After block connection between src host and storage, SPM passed to second host and also HA vm migrated to second host without trouble and vdsClient show that no vms on src host, also src host in unassign state.
After restore connection between src host and storage, host change state to UP.

Comment 9 Itamar Heim 2014-01-21 22:11:48 UTC
Closing - RHEV 3.3 Released