Bug 1651211 - VM is displayed as running on two nodes
Summary: VM is displayed as running on two nodes
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.2.7
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: ---
Assignee: bugs@ovirt.org
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-19 12:34 UTC by Andreas Elvers
Modified: 2018-12-06 14:56 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-12-06 14:56:12 UTC
oVirt Team: Virt
Embargoed:


Attachments (Terms of Use)
Current engine logs for VM-ID with described symptoms (4.19 KB, application/x-gzip)
2018-11-19 15:27 UTC, Andreas Elvers
no flags Details
combined engine logs Part A (15.00 MB, application/x-bzip)
2018-11-20 16:36 UTC, Andreas Elvers
no flags Details
combined engine logs Part B (8.80 MB, application/octet-stream)
2018-11-20 16:37 UTC, Andreas Elvers
no flags Details

Description Andreas Elvers 2018-11-19 12:34:10 UTC
Description of problem:

I upgraded a cluster from 4.2.6 to 4.2.7, first nodes, then engine. Then after migrating a virtual machine, it is now displayed in engine as running on two nodes. Upon inspection with virsh on both nodes, it was only running on that it was migrated to. The VM details page will show the VM as running on the node it was migrated from, which is not true. When shutting down the virtual machine it will vanish from the list of VMs on that node it is actually running on as expected. But on the node the VM was migrated from it is displayed as down.

Version-Release number of selected component (if applicable):

Engine 4.2.7.5-1.el7

How reproducible:

Not clear

Steps to Reproduce:
1. not clear
2.
3.

Actual results:

VM is displayed as running on two nodes

Expected results:

VM should only be displayed running on that host it is actually running on

Additional info:

The cluster is running on Ceph Mimic.

I tried to migrate other machines, but they will transfer without any problem. My idea to tackle the problem is to stop the VM, detach the storage, trash the VM definition and re-create a new VM definition to which the old disk is attached.

Comment 1 Andreas Elvers 2018-11-19 12:37:35 UTC
Clarification: The cluster is running VMs on a Ceph Mimic storage domain.

Comment 2 Michal Skrivanek 2018-11-19 12:58:06 UTC
Hi, without logs there's really nothing to look at.
Is it reproducible at all? Do you have logs from engine and from both nodes fro that failure you observed?

Comment 3 Andreas Elvers 2018-11-19 15:27:42 UTC
Created attachment 1507301 [details]
Current engine logs for VM-ID with described symptoms

These are the latest logs concerning the specific VM-ID.

Comment 4 Ryan Barry 2018-11-20 15:53:53 UTC
Please attach full logs if possible (as much as I like only seeing the lines for the relevant VM).

I wonder if you hit https://bugzilla.redhat.com/show_bug.cgi?id=1647388 during the upgrade. Was the VM migrated as the other host was in maintenance and upgraded, maybe?

Comment 5 Andreas Elvers 2018-11-20 16:36:32 UTC
Created attachment 1507442 [details]
combined engine logs Part A

Comment 6 Andreas Elvers 2018-11-20 16:37:19 UTC
Created attachment 1507443 [details]
combined engine logs Part B

Comment 7 Andreas Elvers 2018-11-20 16:48:53 UTC
(In reply to Ryan Barry from comment #4)
> Please attach full logs if possible (as much as I like only seeing the lines
> for the relevant VM).

Done that.

> 
> I wonder if you hit https://bugzilla.redhat.com/show_bug.cgi?id=1647388
> during the upgrade. Was the VM migrated as the other host was in maintenance
> and upgraded, maybe?

No. While upgrading no other actions are performed on the cluster. My upgrade for nodes is as follows:

1. go to maintenance and let VMs migrate to other nodes.
2. upgrade through UI
3. Wait for node to return to maintenance mode
4. Login to node and "yum update" to re-install more current ceph libraries.
5. Reboot again via UI "ssh restart"
6. Wait for node to return to maintenance mode
7. Activate
8. Migrate testceph (the VM showing the symptoms) to the newly activated node for test purposes.

Comment 8 Ryan Barry 2018-11-29 00:34:49 UTC
Hmm. I'm not able to reproduce this, and the logs look relatively normal (you even put the hosts into maintenance in order)

At what point did the symptom show up, and on which nodes? There are no log messages about unmanaged VMs being discovered, so it may have been a lock somewhere in the database, though the logs don't show anything

Comment 9 Andreas Elvers 2018-12-06 13:58:41 UTC
OK. I will try to re-create the VM with a new definition. Let's if I'm able to delete the defective VM definition.

- Removed the disk (keeping it). OK.
- Deleted VM in question. OK.
- Recreated VM and attached the old disk. OK.
- Complete a life cycle of Startup and Shutdown. OK.

The new VM definition is working as expected. It shows up on only one Node and will disappear from Nodes when in status down. Problem fixed.

When I remember correctly I had one problem when upgrading. The problem with current oVirt upgrades is, that with every upgrade, the Ceph libraries will be reverted to ones that are incompatible with the current Ceph release. I remember trying to migrate the test VM to a node that was freshly upgraded without the current libraries re-installed. I had forgotten to upgrade the libraries, so the migration failed because Ceph refuses to serve data. My guess is, the problem started there. 

So let's close this ticket.

Comment 10 Ryan Barry 2018-12-06 14:56:12 UTC
Thanks for the root cause :)


Note You need to log in before you can comment on or make changes to this bug.