Hide Forgot
Created attachment 797220 [details] engine.log.gz host4-vdsm.log.gz host3-vdsm.log.gz Description of problem: I blocked with iptables network access from SPM host to NFS storage. This host (host4) had running a HA VM. After a while engine discovered this issue and moved SPM to other host (host3). The engine also initialized migration of this HA VM (test-rh6-x64). But this migration got stuck and then no other life-saving action about this VM has happened, the HA VM still runs on original host (host4) where there is no access to storage! Also the original host is in unassigned state which means you cannot put it in maintenance etc. In Hosts tab host3 (now being SPM) has in Virtual Machines 2 VMs (one is about stuck migration - test-rh6-x64), host4 still reports original HA VM (test-rh6-x64). There are two running Tasks (newest first): * Setting Host dell-r210ii-04 to Non-Operational mode. * Migrating VM test-rh6-x64 because previous Host became non-operational In Events for the HA VM (test-rh6-x64) there's no event message about started migration because of HA functionality! In Events from original SPM host (host4), there is no message in Events that this host lost SPM mode. I would expect to see it there. Although in new SPM host (host3) there is message: Storage Pool Manager runs on Host dell-r210ii-03 (Address: 10.34.63.222). I tried to execute Cancel Migration from VM's context menu but it ended in failure: Failed to cancel migration for VM: test-rh6-x64 Summary - while there is network issue to storage the host is in odd state, HA VM is not migrated or restarted on other host. Version-Release number of selected component (if applicable): sf20.1 How reproducible: 100% Steps to Reproduce: 1. 2 hosts, nfs storage, one HA VM 2. on SPM host block outgoing access to NFS storage 3. wait to see things happening Actual results: original SPM host in odd state, HA VM is not fulfilled as it is stuck in Migration From state Expected results: if SPM was switched thus not fencing original host i would expect if migration did not succeed the HA VM process would be killed and the HA VM would be started on a host having correct access to storage, not to let it rotten on problematic host Additional info: i will try to reproduce same issue on latest 3.3 but this situation really put strange view on HA functionality in 3.2 version which has been made public. Related/similar to BZ962180 ?
engine=> select * from job; job_id | action_type | description | status | owner_id | visible | start_time | end_time | last_update_time | correlation_id --------------------------------------+----------------------+------------------------------------------------------------------------+---------+-------------- ------------------------+---------+----------------------------+----------+----------------------------+---------------- a04fc611-6e93-40f2-b438-956e86f6a727 | InternalMigrateVm | Migrating VM test-rh6-x64 because previous Host became non-operational | STARTED | 00000000-0000 -0000-0000-000000000000 | t | 2013-09-13 10:20:54.162+02 | | 2013-09-13 10:20:54.362+02 | 27886777 945abad7-1c26-4d94-bd20-3c210acd70c9 | SetNonOperationalVds | Setting Host dell-r210ii-04 to Non-Operational mode. | STARTED | 00000000-0000 -0000-0000-000000000000 | t | 2013-09-13 10:26:03.468+02 | | 2013-09-13 10:26:03.482+02 | 72575726 (2 rows) engine=> select * from async_tasks; task_id | action_type | status | result | action_parameters | action_params_class | step_id | command_id | started_at | storage_pool_id | task_type | task_par ameters | task_params_class ---------+-------------+--------+--------+-------------------+---------------------+---------+------------+------------+-----------------+-----------+--------- --------+------------------- (0 rows)
After all of this I tried to 'Power Off' the VM. The action is reported by engine as successful and the state is 'Down'. But in fact the qemu-kvm process is still running on the original host (host4). I cleaned all tasks, I killed the qemu-kvm process but no change, the host is still in unassigned state. Restart of vdsmd did not help as well. After doing some crazy things (removing forgotten VM from vds_dynamics for the original host), restarting engine, vdsmd, I got the host in 'Non Operational' state, which I suppose is correct status. After "repairing" network issue and unmounting hanged nfs storage, I was able to active the host successfully.
It's likely that this is fixed in 3.2.2 already (bz 984943) Omer please confirm?
not related to bz 984943 this is related to issues we encountered in host monitoring that is stuck, which cause host status to stuck in unassigned and all vms statuses on this host not to be refreshed (bz 977169)
Since bug 977169 has a fix in 3.3, which may handle this issue I suggest to test this bz and see if it can be reproduced.
doron, per comment 5 - why isn't this ON_QA to be tested?
After additional check we were not able to get the same results. So I'd like this to be re-tested using the fix included in bug 977169.
Verified on is26 After block connection between src host and storage, SPM passed to second host and also HA vm migrated to second host without trouble and vdsClient show that no vms on src host, also src host in unassign state. After restore connection between src host and storage, host change state to UP.
Closing - RHEV 3.3 Released