Description of problem:
I upgraded a cluster from 4.2.6 to 4.2.7, first nodes, then engine. Then after migrating a virtual machine, it is now displayed in engine as running on two nodes. Upon inspection with virsh on both nodes, it was only running on that it was migrated to. The VM details page will show the VM as running on the node it was migrated from, which is not true. When shutting down the virtual machine it will vanish from the list of VMs on that node it is actually running on as expected. But on the node the VM was migrated from it is displayed as down.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. not clear
VM is displayed as running on two nodes
VM should only be displayed running on that host it is actually running on
The cluster is running on Ceph Mimic.
I tried to migrate other machines, but they will transfer without any problem. My idea to tackle the problem is to stop the VM, detach the storage, trash the VM definition and re-create a new VM definition to which the old disk is attached.
Clarification: The cluster is running VMs on a Ceph Mimic storage domain.
Hi, without logs there's really nothing to look at.
Is it reproducible at all? Do you have logs from engine and from both nodes fro that failure you observed?
Created attachment 1507301 [details]
Current engine logs for VM-ID with described symptoms
These are the latest logs concerning the specific VM-ID.
Please attach full logs if possible (as much as I like only seeing the lines for the relevant VM).
I wonder if you hit https://bugzilla.redhat.com/show_bug.cgi?id=1647388 during the upgrade. Was the VM migrated as the other host was in maintenance and upgraded, maybe?
Created attachment 1507442 [details]
combined engine logs Part A
Created attachment 1507443 [details]
combined engine logs Part B
(In reply to Ryan Barry from comment #4)
> Please attach full logs if possible (as much as I like only seeing the lines
> for the relevant VM).
> I wonder if you hit https://bugzilla.redhat.com/show_bug.cgi?id=1647388
> during the upgrade. Was the VM migrated as the other host was in maintenance
> and upgraded, maybe?
No. While upgrading no other actions are performed on the cluster. My upgrade for nodes is as follows:
1. go to maintenance and let VMs migrate to other nodes.
2. upgrade through UI
3. Wait for node to return to maintenance mode
4. Login to node and "yum update" to re-install more current ceph libraries.
5. Reboot again via UI "ssh restart"
6. Wait for node to return to maintenance mode
8. Migrate testceph (the VM showing the symptoms) to the newly activated node for test purposes.
Hmm. I'm not able to reproduce this, and the logs look relatively normal (you even put the hosts into maintenance in order)
At what point did the symptom show up, and on which nodes? There are no log messages about unmanaged VMs being discovered, so it may have been a lock somewhere in the database, though the logs don't show anything
OK. I will try to re-create the VM with a new definition. Let's if I'm able to delete the defective VM definition.
- Removed the disk (keeping it). OK.
- Deleted VM in question. OK.
- Recreated VM and attached the old disk. OK.
- Complete a life cycle of Startup and Shutdown. OK.
The new VM definition is working as expected. It shows up on only one Node and will disappear from Nodes when in status down. Problem fixed.
When I remember correctly I had one problem when upgrading. The problem with current oVirt upgrades is, that with every upgrade, the Ceph libraries will be reverted to ones that are incompatible with the current Ceph release. I remember trying to migrate the test VM to a node that was freshly upgraded without the current libraries re-installed. I had forgotten to upgrade the libraries, so the migration failed because Ceph refuses to serve data. My guess is, the problem started there.
So let's close this ticket.
Thanks for the root cause :)