Bug 1253043

Summary: Ghost VMs created with prefix "external-" by "null@N/A"
Product: Red Hat Enterprise Virtualization Manager Reporter: Robert McSwain <rmcswain>
Component: ovirt-engineAssignee: Francesco Romani <fromani>
Status: CLOSED ERRATA QA Contact: sefi litmanovich <slitmano>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.3CC: amureini, bcholler, dmoessne, fromani, fsun, gklein, inetkach, lpeer, lsurette, luisfa36, melewis, mgoldboi, michal.skrivanek, mkalinin, pspacek, rbalakri, Rhev-m-bugs, rmcswain, sbonazzo, simon.barrett, tjelinek, yeylon, ykaul
Target Milestone: ovirt-3.6.3   
Target Release: 3.6.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, incorrect handling of the virtual machine recovery file caused virtual machine shut down inside the guest. This caused the virtual machine recovery file to be leaked. This may cause incorrect detection of virtual machines, especially in the case of network failures, and cause the Red Hat Enterprise Virtualization Manager to find unknown external virtual machines. Now, VDSM has been fixed to no longer leak recovery files. Recovery files are now correctly handled and this prevents the Red Hat Enterprise Virtualization Manager from incorrectly detecting virtual machines.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-09 21:11:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1213937, 1121089    

Description Robert McSwain 2015-08-12 19:38:50 UTC
Description of problem:
Empty VMs which had already been removed from RHEV environment are seen reappearing, created by "null@N/A". A VM was removed on July 15, but it reappeared on July 20:

2015-Jul-15, 18:03: VM CHG0007972_www01_remedy_prod_int_phx2 was successfully removed.
2015-Jul-20, 07:21: VM external-www01_remedy_prod_int_phx2 was created by null@N/A.

From splunk, another event right before the import:
 7/20/15 7:21:16.206 AM 	 2015-07-20 07:21:16,206 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-71) [1cda8be9] Importing VM www01_remedy_prod_int_phx2 as external-www01_remedy_prod_int_phx2, as it is running on the on Host, but does not exist in the engine.


Version-Release number of selected component (if applicable):
rhevm-3.5.3.1-1.4.el6ev.noarch

How reproducible:
Unknown

Actual results:
VM should stay deleted

Expected results:
VM returned after 5 days 

Additional info:
Information coming in a future update

Comment 2 Omer Frenkel 2015-08-13 11:36:29 UTC
usually this is a result of unclean removal of host.
the vm is still running or still reported by vdsm, you should stop it and then remove it again.

please let me know if it helped

Comment 3 Robert McSwain 2015-08-26 14:07:57 UTC
I've pinged the customer again and will let you know if we get any more info from them.

Comment 4 Michal Skrivanek 2015-10-22 09:20:42 UTC
any news?

Comment 8 Simon Barrett 2015-11-13 12:20:42 UTC
I have seen the same behaviour where a number of previously removed VM's have come back as external. I see the output below in the engine log but unfortunately I don't have the log from when the VM's were originally removed so I can't see if anything went wrong during the removal. I was also upgrading my nodes from 3.4.2 to 3.5.4 just prior to these re-appearing.

Not sure if that helps in any way but I thought I'd share it anyway. I am running 3.5.4.

015-11-10 05:33:21,685 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-012.demo as external-ny2-laa-012.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,685 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-011-rhel.demo as external-ny2-laa-011-rhel.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,686 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-013-rhel.demo as external-ny2-laa-013-rhel.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,686 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-012-rhel.demo as external-ny2-laa-012-rhel.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,687 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-lda-004a.demo as external-ny2-lda-004a.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,687 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-010.demo as external-ny2-laa-010.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,688 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-014-rhel.demo as external-ny2-laa-014-rhel.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,688 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-014.demo as external-ny2-laa-014.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,689 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-lda-004b.demo as external-ny2-lda-004b.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,689 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-013.demo as external-ny2-laa-013.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,690 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-011.demo as external-ny2-laa-011.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,690 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-009.demo as external-ny2-laa-009.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,690 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-010-rhel.demo as external-ny2-laa-010-rhel.demo, as it is running on the on Host, but does not exist in the engine.
2015-11-10 05:33:21,691 INFO  [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (DefaultQuartzScheduler_Worker-32) [518940e8] Importing VM ny2-laa-009-rhel.demo as external-ny2-laa-009-rhel.demo, as it is running on the on Host, but does not exist in the engine.


root> grep "was created by null" engine.log-20151111
2015-11-10 05:33:23,034 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [593a4795] Correlation ID: 593a4795, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-012.demo was created by null@N/A.
2015-11-10 05:33:23,180 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [2795b718] Correlation ID: 2795b718, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-011-rhel.demo was created by null@N/A.
2015-11-10 05:33:23,367 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [2c7eccae] Correlation ID: 2c7eccae, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-013-rhel.demo was created by null@N/A.
2015-11-10 05:33:23,488 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [6d050906] Correlation ID: 6d050906, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-012-rhel.demo was created by null@N/A.
2015-11-10 05:33:23,643 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [77327bf0] Correlation ID: 77327bf0, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-lda-004a.demo was created by null@N/A.
2015-11-10 05:33:23,809 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [7d8ece3f] Correlation ID: 7d8ece3f, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-010.demo was created by null@N/A.
2015-11-10 05:33:23,897 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [6d939916] Correlation ID: 6d939916, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-014-rhel.demo was created by null@N/A.
2015-11-10 05:33:23,981 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [5c6ca5bc] Correlation ID: 5c6ca5bc, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-014.demo was created by null@N/A.
2015-11-10 05:33:24,146 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [cbe9bfb] Correlation ID: cbe9bfb, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-lda-004b.demo was created by null@N/A.
2015-11-10 05:33:24,238 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [5487ee3b] Correlation ID: 5487ee3b, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-013.demo was created by null@N/A.
2015-11-10 05:33:24,328 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [280caed6] Correlation ID: 280caed6, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-011.demo was created by null@N/A.
2015-11-10 05:33:24,417 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [6087c1f1] Correlation ID: 6087c1f1, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-009.demo was created by null@N/A.
2015-11-10 05:33:24,505 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [18e4ef1d] Correlation ID: 18e4ef1d, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-010-rhel.demo was created by null@N/A.
2015-11-10 05:33:24,603 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-32) [92aba53] Correlation ID: 92aba53, Call Stack: null, Custom Event ID: -1, Message: VM external-ny2-laa-009-rhel.demo was created by null@N/A.

Comment 10 Michal Skrivanek 2015-11-18 12:55:32 UTC
related to bug 1121089 (same issue apparently)

Comment 11 Michal Skrivanek 2015-11-27 08:27:21 UTC
Investigation on locally reproduced problem showed a possible reason - network connectivity/instability is triggering fencing actions - restart vdsm multiple times. When the vdsm VM recovery is interrupted before it finishes its job, the recovery files are left behind and reused on following restarts. 

Current 3.5 code makes troubleshooting quite difficult as it doesn't provide sufficient information to know what's going on for sure, so this is still more of a theory than a proof.
There are improvements to logging as well as logic within the recovery thread on master, worth backporting to 3.6 at least

Targeting 3.6 at the moment, though we may need to wait for more occurrences there

Comment 12 Francesco Romani 2016-01-04 14:45:35 UTC
(In reply to Michal Skrivanek from comment #11)
> Investigation on locally reproduced problem showed a possible reason -
> network connectivity/instability is triggering fencing actions - restart
> vdsm multiple times. When the vdsm VM recovery is interrupted before it
> finishes its job, the recovery files are left behind and reused on following
> restarts. 
> 
> Current 3.5 code makes troubleshooting quite difficult as it doesn't provide
> sufficient information to know what's going on for sure, so this is still
> more of a theory than a proof.
> There are improvements to logging as well as logic within the recovery
> thread on master, worth backporting to 3.6 at least
> 
> Targeting 3.6 at the moment, though we may need to wait for more occurrences
> there

After more investigation, we think the most probable cause is Vdsm misdetecting shutdown from within the guest.
In that case Vdsm correctly deletes the recovery file, but recreates it after the deletion with 'Down' status.

The reason is Vdsm does 'saveState' after releaseVm in the onQemuDeath callback.

saveState has a safety guard to avoid this recreation, which triggers only in the 'destroy' path.

While we have pretty solid evidence this is the root cause, I haven't reproduced it yet. Will do ASAP, and post a patch if this is indeed the case.

Comment 13 Francesco Romani 2016-01-13 15:49:49 UTC
(In reply to Francesco Romani from comment #12)
> (In reply to Michal Skrivanek from comment #11)

> While we have pretty solid evidence this is the root cause, I haven't
> reproduced it yet. Will do ASAP, and post a patch if this is indeed the case.

Forgot to update (sorry about that): confirmed/reproduced.

Comment 14 Francesco Romani 2016-01-25 13:59:48 UTC
patches merged on 3.6 branch -> MODIFIED

Comment 15 sefi litmanovich 2016-02-10 10:37:12 UTC
Verified with rhevm-3.6.3-0.1.el6.noarch.

1. Created vm with rhel 7.2 installed.
2. Started vm - recovery file created.
3. Shutdown vm from within the vm.
4. After shutting down and engine reports vm in down status - no recovery file re created on host.

Comment 23 errata-xmlrpc 2016-03-09 21:11:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0376.html

Comment 24 Michal Skrivanek 2016-06-14 07:00:39 UTC
*** Bug 1345942 has been marked as a duplicate of this bug. ***