Bug 1477151
Summary: | Clean up old VM recovery information also by VM name | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Petr Matyáš <pmatyas> | ||||||||||||
Component: | Core | Assignee: | Milan Zamazal <mzamazal> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Vladimir <vshypygu> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | --- | CC: | bugs, emesika, lsvaty, michal.skrivanek, mperina, mzamazal, pmatyas, tjelinek | ||||||||||||
Target Milestone: | ovirt-4.2.0 | Flags: | rule-engine:
ovirt-4.2+
|
||||||||||||
Target Release: | 4.20.9.1 | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||||||||
Doc Text: |
undefined
|
Story Points: | --- | ||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2018-02-22 09:58:31 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Tested on regular setup with 9.2pg and local db, working in ovirt-4.2. Unable to reproduce with the scenario on master, Can you please verify that this is consistent , i.e. try to add several more VMs and use run-once on each Yes, this seemed consistent on our env with remote db. However I didn't test thoroughly enough, so to clarify, this is happening only on a VM that is highly available, the host got fenced. When the fencing is finished and the host is up, the VM is power off and removed. Then when you create the same VM and try to start it on the host that got fenced it just wouldn't start. Can you please provide also the relevant postgres server log ? Created attachment 1307987 [details]
pg log
(In reply to Petr Matyáš from comment #5) > Created attachment 1307987 [details] > pg log Are you sure that this is the correct PG log that is aligned with problem ? I see a very short cycle of PG session that is terminated by the user (stopping PG manually as far as I understand) Then at the end of the log a very wired error is reported : < 2017-08-01 13:35:16.288 IDT >ERROR: relation "vm_stati" does not exist at character 28 < 2017-08-01 13:35:16.288 IDT >STATEMENT: select vm_name,status from vm_stati; < 2017-08-01 13:35:18.599 IDT >ERROR: column "status" does not exist at character 16 < 2017-08-01 13:35:18.599 IDT >STATEMENT: select vm_name,status from vm_static; Seems like a query is using "vm_stati" instead of "vm_static" But I think that this log is not aligned with the reported scenario. Can you please : 1) Stop engine 2) Delete the relevant PG log 3) restart PG (this will create a new PG log for the day) 4) Reproduce the reported scenario 5) Send the PG log again Thanks Created attachment 1308215 [details]
pg log with a little debug
When trying to reproduce this, engine went bananas every time I opened VMs tab.
I managed to get as far as fence host with VM running, waited for host to became up, power off and removed the VM and created the VM again. Then everything went to hell.
This log has the biggest debug level I could find so hopefully, you will manage to find the issue.
(In reply to Petr Matyáš from comment #7) > Created attachment 1308215 [details] > This log has the biggest debug level I could find so hopefully, you will > manage to find the issue. Well there is no ERROR in the PG log, which means that the complain about duplicate uuid does not come from a data violation in the DB. When looking into the code , the only place that can cause this "Exit message" to be displayed when a VM is set to DOWN is in: /backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/monitoring/VmAnalyzer.java : This is the relevant code snippet : ---START--- private void auditVmOnDownError() { AuditLogableBase logable = Injector.injectMembers(new AuditLogableBase(vdsManager.getVdsId(), getVmId())); logable.addCustomValue("ExitMessage", vdsmVm.getVmDynamic().getExitMessage() != null ? "Exit message: " + vdsmVm.getVmDynamic().getExitMessage() : " "); auditLog(logable, AuditLogType.VM_DOWN_ERROR); } ---END--- The "Exit Message" is set from the exit_message value returned from the call to VDSM. So, AFAIU , there is no DB issue here and you should investigate what else was changed except the PG version From what I'm told it's the same configuration as any other engine we have for testing, only the db version is different. I can give you access to our engine if you want, we have it till today only though. This is taken from attached VDSM log : libvirtError: operation failed: domain 'vm_ha' already exists with uuid 270a28b7-1d2a-4f66-8a69-e5c614770dcc 2017-08-01 13:26:42,433+0300 INFO (vm/e52c945a) [virt.vm] (vmId='e52c945a-1550-4f1b-ad7e-996f5746cbce') Changed state to Down: operation failed: domain 'vm_ha' already exists with uuid 270a28b7-1d2a-4f66-8a69-e5c614770dcc (code=1) (vm:1441) So, its a libvirt error .... Created attachment 1308288 [details]
libvirt logs
please confirm - does it happen only with postgres 9.5? - did you check the environment for existing VMs with the same UUID via virsh? - did it reproduce in any other environment? Yes. No, but shouldn't this be checked for? I can't influence what UUID will the VM get. No, we don't have any other env with PG 9.5. (In reply to Petr Matyáš from comment #13) > Yes. - how do you tell that it's only postgres 9.5? can you perhaps downgrade postgres in your environment? Either way, I find it highly unlikely that it's postgres related > No, but shouldn't this be checked for? I can't influence what UUID will the > VM get. the please check that. There is likely a stray VM registered in libvirt with the same UUID. What happened to that host, was it disconnected and reconencted to other environment? Reinstalled, upgraded, etc? > No, we don't have any other env with PG 9.5. Can you create a new HA VM and try the same? (In reply to Michal Skrivanek from comment #15) > Can you create a new HA VM and try the same? AFAIU this is already done , see my request in comment 2 and the response in comment 3 According to the Vdsm log, the VM can't be run since a VM with the same name already exists on the system. We clean up stale (persistent) VM domains by their UUID and not name. Since the VM had been removed and re-created again with the same name, the UUID hasn't changed and thus the old domain, remaining after non-regular VM termination, with the old UUID hasn't been removed. Then libvirt detected the VM name clash and refused to run the new VM. We should look up the stale VM domain in clean up not only by the VM id but also by the VM name. Why (In reply to Milan Zamazal from comment #21) I still do not understand according to that , how this explains the fact that it works in PG 9.2 and fails in PG 9.5 ? The explanation you gave should result in a failure without any dependency of the PG DB version... AFAICT it wasn't tested on 9.2 at all (In reply to Michal Skrivanek from comment #24) > AFAICT it wasn't tested on 9.2 at all You are wrong, I tested this on 9.2. Actually this is being tested on 9.2 all the time, since this is automated test case. let's take a look then. Can you provide engine and vdsm logs of the same oVirt version with PG 9.2? Created attachment 1309988 [details]
all the logs with pg 9.2
that's not the same version as with PG 9.5, most notably it includes commit 59057bd64. Also, one test is using "vm_ha" and the other "vm-ha" - is it really the same automation code for both runs - comment #27 and comment #17? Either way, I still do not see anything related to postgres version. Let's wait with retest once the patch 80179 is merged I didn't use automation for these runs, hence this naming issue, but I just tested with the same name and it is working as well. Don't know why would this be related to DB, but let's wait for that patch and test with it on 9.2 and 9.5 :) Do I understand correctly that it has nothing to do with remote Postgres or Postgres version? You understand it correctly. Verifed on 4.2.1.6-0.1.el7 with local Postgres, vm was added sucessfully This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |
Created attachment 1307510 [details] logs Description of problem: We have postgres 9.5 as remote db, engine fails to run once a VM that is pinned to a host (and explicitly started with run once on that particular host) with 2017-08-01 13:26:42,495+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] E VENT_ID: VM_DOWN_ERROR(119), VM vm_ha is down with error. Exit message: operation failed: domain 'vm_ha' already exists with uuid $uuid. When I just start the VM, it is started, but on different host than the one it should be pinned to. Version-Release number of selected component (if applicable): postgresql95-server-9.5.7-1PGDG.rhel7.x86_64 ovirt-engine-4.2.0-0.0.master.20170730103259.gitb12378f.el7.centos.noarch vdsm-4.20.1-271.gitac81a4d.el7.centos.x86_64 How reproducible: always Steps to Reproduce: 1. create a vm that is pinned to some host 2. run once that vm, 3. Actual results: vm fails to start Expected results: vm is started on specified host Additional info: