1477151 – Clean up old VM recovery information also by VM name

Bug 1477151 - Clean up old VM recovery information also by VM name

Summary: Clean up old VM recovery information also by VM name

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	4.20.9.1
Assignee:	Milan Zamazal
QA Contact:	Vladimir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-01 10:57 UTC by Petr Matyáš
Modified:	2018-02-22 09:58 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-02-22 09:58:31 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+

Attachments	(Terms of Use)
logs (1.20 MB, application/x-gzip) 2017-08-01 10:57 UTC, Petr Matyáš	no flags	Details
pg log (14.13 KB, text/plain) 2017-08-02 07:35 UTC, Petr Matyáš	no flags	Details
pg log with a little debug (2.05 MB, application/x-gzip) 2017-08-02 13:10 UTC, Petr Matyáš	no flags	Details
libvirt logs (9.09 KB, application/x-gzip) 2017-08-02 14:45 UTC, Petr Matyáš	no flags	Details
all the logs with pg 9.2 (17.65 MB, application/x-gzip) 2017-08-07 10:13 UTC, Petr Matyáš	no flags	Details
Show Obsolete (3) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	80179	0	master	MERGED	virt: Remove stale domains with the same name	2017-08-08 11:45:18 UTC

Description Petr Matyáš 2017-08-01 10:57:57 UTC

Created attachment 1307510 [details]
logs

Description of problem:
We have postgres 9.5 as remote db, engine fails to run once a VM that is pinned to a host (and explicitly started with run once on that particular host) with
2017-08-01 13:26:42,495+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] E
VENT_ID: VM_DOWN_ERROR(119), VM vm_ha is down with error. Exit message: operation failed: domain 'vm_ha' already exists with uuid $uuid.

When I just start the VM, it is started, but on different host than the one it should be pinned to.

Version-Release number of selected component (if applicable):
postgresql95-server-9.5.7-1PGDG.rhel7.x86_64
ovirt-engine-4.2.0-0.0.master.20170730103259.gitb12378f.el7.centos.noarch
vdsm-4.20.1-271.gitac81a4d.el7.centos.x86_64

How reproducible:
always

Steps to Reproduce:
1. create a vm that is pinned to some host
2. run once that vm, 
3.

Actual results:
vm fails to start

Expected results:
vm is started on specified host

Additional info:

Comment 1 Lukas Svaty 2017-08-01 11:49:12 UTC

Tested on regular setup with 9.2pg and local db, working in ovirt-4.2.

Comment 2 Eli Mesika 2017-08-01 14:28:18 UTC

Unable to reproduce with the scenario on master,
Can you please verify that this is consistent , i.e. try to add several more VMs and use run-once on each

Comment 3 Petr Matyáš 2017-08-01 15:05:08 UTC

Yes, this seemed consistent on our env with remote db.
However I didn't test thoroughly enough, so to clarify, this is happening only on a VM that is highly available, the host got fenced. When the fencing is finished and the host is up, the VM is power off and removed. Then when you create the same VM and try to start it on the host that got fenced it just wouldn't start.

Comment 4 Eli Mesika 2017-08-01 20:08:18 UTC

Can you please provide also the relevant postgres server log ?

Comment 5 Petr Matyáš 2017-08-02 07:35:04 UTC

Created attachment 1307987 [details]
pg log

Comment 6 Eli Mesika 2017-08-02 12:17:00 UTC

(In reply to Petr Matyáš from comment #5)
> Created attachment 1307987 [details]
> pg log

Are you sure that this is the correct PG log that is aligned with problem ?

I see a very short cycle of PG session that is terminated by the user (stopping PG manually as far as I understand)

Then at the end of the log a very wired error is reported :

< 2017-08-01 13:35:16.288 IDT >ERROR:  relation "vm_stati" does not exist at character 28
< 2017-08-01 13:35:16.288 IDT >STATEMENT:  select vm_name,status from vm_stati;
< 2017-08-01 13:35:18.599 IDT >ERROR:  column "status" does not exist at character 16
< 2017-08-01 13:35:18.599 IDT >STATEMENT:  select vm_name,status from vm_static;

Seems like a query is using "vm_stati" instead of "vm_static"

But I think that this log is not aligned with the reported scenario.

Can you please :

1) Stop engine 

2) Delete the relevant PG log 

3) restart PG (this will create a new PG log for the day)

4) Reproduce the reported scenario 

5) Send the PG log again

Thanks

Comment 7 Petr Matyáš 2017-08-02 13:10:26 UTC

Created attachment 1308215 [details]
pg log with a little debug

When trying to reproduce this, engine went bananas every time I opened VMs tab.

I managed to get as far as fence host with VM running, waited for host to became up, power off and removed the VM and created the VM again. Then everything went to hell.

This log has the biggest debug level I could find so hopefully, you will manage to find the issue.

Comment 8 Eli Mesika 2017-08-02 13:46:33 UTC

(In reply to Petr Matyáš from comment #7)
> Created attachment 1308215 [details]
> This log has the biggest debug level I could find so hopefully, you will
> manage to find the issue.

Well there is no ERROR in the PG log, which means that the complain about duplicate uuid does not come from a data violation in the DB.

When looking into the code , the only place that can cause this "Exit message" to be displayed when a VM is set to DOWN is in: 

/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java :

This is the relevant code snippet :

---START---
private void auditVmOnDownError() {
        AuditLogableBase logable = Injector.injectMembers(new AuditLogableBase(vdsManager.getVdsId(), getVmId()));
        logable.addCustomValue("ExitMessage",
                vdsmVm.getVmDynamic().getExitMessage() != null ?
                        "Exit message: " + vdsmVm.getVmDynamic().getExitMessage()
                        : " ");
        auditLog(logable, AuditLogType.VM_DOWN_ERROR);
}
---END---

The "Exit Message" is set from the exit_message value returned from the call to VDSM.

So, AFAIU , there is no DB issue here and you should investigate what else was changed except the PG version

Comment 9 Petr Matyáš 2017-08-02 13:58:26 UTC

From what I'm told it's the same configuration as any other engine we have for testing, only the db version is different.

I can give you access to our engine if you want, we have it till today only though.

Comment 10 Eli Mesika 2017-08-02 14:40:58 UTC

This is taken from attached VDSM log :

libvirtError: operation failed: domain 'vm_ha' already exists with uuid 270a28b7-1d2a-4f66-8a69-e5c614770dcc
2017-08-01 13:26:42,433+0300 INFO  (vm/e52c945a) [virt.vm] (vmId='e52c945a-1550-4f1b-ad7e-996f5746cbce') Changed state to Down: operation failed: domain 'vm_ha' already exists with uuid 270a28b7-1d2a-4f66-8a69-e5c614770dcc (code=1) (vm:1441)

So, its a libvirt error ....

Comment 11 Petr Matyáš 2017-08-02 14:45:26 UTC

Created attachment 1308288 [details]
libvirt logs

Comment 12 Michal Skrivanek 2017-08-03 06:57:21 UTC

please confirm 
- does it happen only with postgres 9.5?
- did you check the environment for existing VMs with the same UUID via virsh?
- did it reproduce in any other environment?

Comment 13 Petr Matyáš 2017-08-03 07:24:55 UTC

Yes.
No, but shouldn't this be checked for? I can't influence what UUID will the VM get.
No, we don't have any other env with PG 9.5.

Comment 14 Michal Skrivanek 2017-08-03 07:44:40 UTC

(In reply to Petr Matyáš from comment #13)
> Yes.
- how do you tell that it's only postgres 9.5? can you perhaps downgrade postgres in your environment? Either way, I find it highly unlikely that it's postgres related

> No, but shouldn't this be checked for? I can't influence what UUID will the
> VM get.
the please check that. There is likely a stray VM registered in libvirt with the same UUID. What happened to that host, was it disconnected and reconencted to other environment? Reinstalled, upgraded, etc?

> No, we don't have any other env with PG 9.5.

Comment 15 Michal Skrivanek 2017-08-03 07:45:39 UTC

Can you create a new HA VM and try the same?

Comment 16 Eli Mesika 2017-08-03 09:06:16 UTC

(In reply to Michal Skrivanek from comment #15)
> Can you create a new HA VM and try the same?

AFAIU this is already done , see my request in comment 2 and the response in comment 3

Comment 21 Milan Zamazal 2017-08-03 15:04:31 UTC

According to the Vdsm log, the VM can't be run since a VM with the same name already exists on the system. We clean up stale (persistent) VM domains by their UUID and not name. Since the VM had been removed and re-created again with the same name, the UUID hasn't changed and thus the old domain, remaining after non-regular VM termination, with the old UUID hasn't been removed. Then libvirt detected the VM name clash and refused to run the new VM.

We should look up the stale VM domain in clean up not only by the VM id but also by the VM name.

Comment 22 Michal Skrivanek 2017-08-03 15:45:49 UTC

Why

Comment 23 Eli Mesika 2017-08-06 07:58:41 UTC

(In reply to Milan Zamazal from comment #21)
I still do not understand according to that , how this explains the fact that it works in PG 9.2 and fails in PG 9.5 ? 
The explanation you gave should result in a failure without any dependency of the PG DB version...

Comment 24 Michal Skrivanek 2017-08-06 08:04:09 UTC

AFAICT it wasn't tested on 9.2 at all

Comment 25 Petr Matyáš 2017-08-07 07:42:36 UTC

(In reply to Michal Skrivanek from comment #24)
> AFAICT it wasn't tested on 9.2 at all

You are wrong, I tested this on 9.2. Actually this is being tested on 9.2 all the time, since this is automated test case.

Comment 26 Michal Skrivanek 2017-08-07 09:58:26 UTC

let's take a look then. Can you provide engine and vdsm logs of the same oVirt version with PG 9.2?

Comment 27 Petr Matyáš 2017-08-07 10:13:34 UTC

Created attachment 1309988 [details]
all the logs with pg 9.2

Comment 28 Michal Skrivanek 2017-08-07 10:56:51 UTC

that's not the same version as with PG 9.5, most notably it includes commit 59057bd64. Also, one test is using "vm_ha" and the other "vm-ha" - is it really the same automation code for both runs - comment #27 and comment #17? 

Either way, I still do not see anything related to postgres version. Let's wait with retest once the patch 80179 is merged

Comment 29 Petr Matyáš 2017-08-07 11:33:50 UTC

I didn't use automation for these runs, hence this naming issue, but I just tested with the same name and it is working as well.

Don't know why would this be related to DB, but let's wait for that patch and test with it on 9.2 and 9.5 :)

Comment 30 Vladimir 2018-02-14 14:08:19 UTC

Do I understand correctly that it has nothing to do with remote Postgres or Postgres version?

Comment 31 Milan Zamazal 2018-02-14 14:23:28 UTC

You understand it correctly.

Comment 32 Vladimir 2018-02-16 08:29:13 UTC

Verifed on 4.2.1.6-0.1.el7 with local Postgres, vm was added sucessfully

Comment 33 Sandro Bonazzola 2018-02-22 09:58:31 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.