Bug 1477151 - Clean up old VM recovery information also by VM name
Clean up old VM recovery information also by VM name
Status: ON_QA
Product: vdsm
Classification: oVirt
Component: Core (Show other bugs)
---
Unspecified Unspecified
medium Severity high (vote)
: ovirt-4.2.0
: ---
Assigned To: Milan Zamazal
meital avital
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-01 06:57 EDT by Petr Matyáš
Modified: 2017-09-28 04:08 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.2+


Attachments (Terms of Use)
logs (1.20 MB, application/x-gzip)
2017-08-01 06:57 EDT, Petr Matyáš
no flags Details
pg log (14.13 KB, text/plain)
2017-08-02 03:35 EDT, Petr Matyáš
no flags Details
pg log with a little debug (2.05 MB, application/x-gzip)
2017-08-02 09:10 EDT, Petr Matyáš
no flags Details
libvirt logs (9.09 KB, application/x-gzip)
2017-08-02 10:45 EDT, Petr Matyáš
no flags Details
all the logs with pg 9.2 (17.65 MB, application/x-gzip)
2017-08-07 06:13 EDT, Petr Matyáš
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 80179 master MERGED virt: Remove stale domains with the same name 2017-08-08 07:45 EDT

  None (edit)
Description Petr Matyáš 2017-08-01 06:57:57 EDT
Created attachment 1307510 [details]
logs

Description of problem:
We have postgres 9.5 as remote db, engine fails to run once a VM that is pinned to a host (and explicitly started with run once on that particular host) with
2017-08-01 13:26:42,495+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] E
VENT_ID: VM_DOWN_ERROR(119), VM vm_ha is down with error. Exit message: operation failed: domain 'vm_ha' already exists with uuid $uuid.

When I just start the VM, it is started, but on different host than the one it should be pinned to.

Version-Release number of selected component (if applicable):
postgresql95-server-9.5.7-1PGDG.rhel7.x86_64
ovirt-engine-4.2.0-0.0.master.20170730103259.gitb12378f.el7.centos.noarch
vdsm-4.20.1-271.gitac81a4d.el7.centos.x86_64

How reproducible:
always

Steps to Reproduce:
1. create a vm that is pinned to some host
2. run once that vm, 
3.

Actual results:
vm fails to start

Expected results:
vm is started on specified host

Additional info:
Comment 1 Lukas Svaty 2017-08-01 07:49:12 EDT
Tested on regular setup with 9.2pg and local db, working in ovirt-4.2.
Comment 2 Eli Mesika 2017-08-01 10:28:18 EDT
Unable to reproduce with the scenario on master,
Can you please verify that this is consistent , i.e. try to add several more VMs and use run-once on each
Comment 3 Petr Matyáš 2017-08-01 11:05:08 EDT
Yes, this seemed consistent on our env with remote db.
However I didn't test thoroughly enough, so to clarify, this is happening only on a VM that is highly available, the host got fenced. When the fencing is finished and the host is up, the VM is power off and removed. Then when you create the same VM and try to start it on the host that got fenced it just wouldn't start.
Comment 4 Eli Mesika 2017-08-01 16:08:18 EDT
Can you please provide also the relevant postgres server log ?
Comment 5 Petr Matyáš 2017-08-02 03:35 EDT
Created attachment 1307987 [details]
pg log
Comment 6 Eli Mesika 2017-08-02 08:17:00 EDT
(In reply to Petr Matyáš from comment #5)
> Created attachment 1307987 [details]
> pg log

Are you sure that this is the correct PG log that is aligned with problem ?

I see a very short cycle of PG session that is terminated by the user (stopping PG manually as far as I understand)

Then at the end of the log a very wired error is reported :

< 2017-08-01 13:35:16.288 IDT >ERROR:  relation "vm_stati" does not exist at character 28
< 2017-08-01 13:35:16.288 IDT >STATEMENT:  select vm_name,status from vm_stati;
< 2017-08-01 13:35:18.599 IDT >ERROR:  column "status" does not exist at character 16
< 2017-08-01 13:35:18.599 IDT >STATEMENT:  select vm_name,status from vm_static;

Seems like a query is using "vm_stati" instead of "vm_static"

But I think that this log is not aligned with the reported scenario.

Can you please :

1) Stop engine 

2) Delete the relevant PG log 

3) restart PG (this will create a new PG log for the day)

4) Reproduce the reported scenario 

5) Send the PG log again

Thanks
Comment 7 Petr Matyáš 2017-08-02 09:10 EDT
Created attachment 1308215 [details]
pg log with a little debug

When trying to reproduce this, engine went bananas every time I opened VMs tab.

I managed to get as far as fence host with VM running, waited for host to became up, power off and removed the VM and created the VM again. Then everything went to hell.

This log has the biggest debug level I could find so hopefully, you will manage to find the issue.
Comment 8 Eli Mesika 2017-08-02 09:46:33 EDT
(In reply to Petr Matyáš from comment #7)
> Created attachment 1308215 [details]
> This log has the biggest debug level I could find so hopefully, you will
> manage to find the issue.

Well there is no ERROR in the PG log, which means that the complain about duplicate uuid does not come from a data violation in the DB.

When looking into the code , the only place that can cause this "Exit message" to be displayed when a VM is set to DOWN is in: 

/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java :

This is the relevant code snippet :

---START---
private void auditVmOnDownError() {
        AuditLogableBase logable = Injector.injectMembers(new AuditLogableBase(vdsManager.getVdsId(), getVmId()));
        logable.addCustomValue("ExitMessage",
                vdsmVm.getVmDynamic().getExitMessage() != null ?
                        "Exit message: " + vdsmVm.getVmDynamic().getExitMessage()
                        : " ");
        auditLog(logable, AuditLogType.VM_DOWN_ERROR);
}
---END---

The "Exit Message" is set from the exit_message value returned from the call to VDSM.

So, AFAIU , there is no DB issue here and you should investigate what else was changed except the PG version
Comment 9 Petr Matyáš 2017-08-02 09:58:26 EDT
From what I'm told it's the same configuration as any other engine we have for testing, only the db version is different.

I can give you access to our engine if you want, we have it till today only though.
Comment 10 Eli Mesika 2017-08-02 10:40:58 EDT
This is taken from attached VDSM log :

libvirtError: operation failed: domain 'vm_ha' already exists with uuid 270a28b7-1d2a-4f66-8a69-e5c614770dcc
2017-08-01 13:26:42,433+0300 INFO  (vm/e52c945a) [virt.vm] (vmId='e52c945a-1550-4f1b-ad7e-996f5746cbce') Changed state to Down: operation failed: domain 'vm_ha' already exists with uuid 270a28b7-1d2a-4f66-8a69-e5c614770dcc (code=1) (vm:1441)

So, its a libvirt error ....
Comment 11 Petr Matyáš 2017-08-02 10:45 EDT
Created attachment 1308288 [details]
libvirt logs
Comment 12 Michal Skrivanek 2017-08-03 02:57:21 EDT
please confirm 
- does it happen only with postgres 9.5?
- did you check the environment for existing VMs with the same UUID via virsh?
- did it reproduce in any other environment?
Comment 13 Petr Matyáš 2017-08-03 03:24:55 EDT
Yes.
No, but shouldn't this be checked for? I can't influence what UUID will the VM get.
No, we don't have any other env with PG 9.5.
Comment 14 Michal Skrivanek 2017-08-03 03:44:40 EDT
(In reply to Petr Matyáš from comment #13)
> Yes.
- how do you tell that it's only postgres 9.5? can you perhaps downgrade postgres in your environment? Either way, I find it highly unlikely that it's postgres related

> No, but shouldn't this be checked for? I can't influence what UUID will the
> VM get.
the please check that. There is likely a stray VM registered in libvirt with the same UUID. What happened to that host, was it disconnected and reconencted to other environment? Reinstalled, upgraded, etc?

> No, we don't have any other env with PG 9.5.
Comment 15 Michal Skrivanek 2017-08-03 03:45:39 EDT
Can you create a new HA VM and try the same?
Comment 16 Eli Mesika 2017-08-03 05:06:16 EDT
(In reply to Michal Skrivanek from comment #15)
> Can you create a new HA VM and try the same?

AFAIU this is already done , see my request in comment 2 and the response in comment 3
Comment 21 Milan Zamazal 2017-08-03 11:04:31 EDT
According to the Vdsm log, the VM can't be run since a VM with the same name already exists on the system. We clean up stale (persistent) VM domains by their UUID and not name. Since the VM had been removed and re-created again with the same name, the UUID hasn't changed and thus the old domain, remaining after non-regular VM termination, with the old UUID hasn't been removed. Then libvirt detected the VM name clash and refused to run the new VM.

We should look up the stale VM domain in clean up not only by the VM id but also by the VM name.
Comment 22 Michal Skrivanek 2017-08-03 11:45:49 EDT
Why
Comment 23 Eli Mesika 2017-08-06 03:58:41 EDT
(In reply to Milan Zamazal from comment #21)
I still do not understand according to that , how this explains the fact that it works in PG 9.2 and fails in PG 9.5 ? 
The explanation you gave should result in a failure without any dependency of the PG DB version...
Comment 24 Michal Skrivanek 2017-08-06 04:04:09 EDT
AFAICT it wasn't tested on 9.2 at all
Comment 25 Petr Matyáš 2017-08-07 03:42:36 EDT
(In reply to Michal Skrivanek from comment #24)
> AFAICT it wasn't tested on 9.2 at all

You are wrong, I tested this on 9.2. Actually this is being tested on 9.2 all the time, since this is automated test case.
Comment 26 Michal Skrivanek 2017-08-07 05:58:26 EDT
let's take a look then. Can you provide engine and vdsm logs of the same oVirt version with PG 9.2?
Comment 27 Petr Matyáš 2017-08-07 06:13 EDT
Created attachment 1309988 [details]
all the logs with pg 9.2
Comment 28 Michal Skrivanek 2017-08-07 06:56:51 EDT
that's not the same version as with PG 9.5, most notably it includes commit 59057bd64. Also, one test is using "vm_ha" and the other "vm-ha" - is it really the same automation code for both runs - comment #27 and comment #17? 

Either way, I still do not see anything related to postgres version. Let's wait with retest once the patch 80179 is merged
Comment 29 Petr Matyáš 2017-08-07 07:33:50 EDT
I didn't use automation for these runs, hence this naming issue, but I just tested with the same name and it is working as well.

Don't know why would this be related to DB, but let's wait for that patch and test with it on 9.2 and 9.5 :)

Note You need to log in before you can comment on or make changes to this bug.