Bug 1348847
| Summary: | Multiple auto-start pool member VMs not starting back after network issues | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Barak Korren <bkorren> | ||||||||||
| Component: | ovirt-engine | Assignee: | Nobody <nobody> | ||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | meital avital <mavital> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 3.6.7 | CC: | ahadas, bkorren, gklein, lsurette, michal.skrivanek, pzhukov, rbalakri, Rhev-m-bugs, srevivo, tjelinek, ykaul | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2016-07-12 13:14:05 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 1170630 [details]
engine.log form from june 21st
Adding log form June 21st as previously uploaded log only has June 22nd
One more thing to add: engine=# select count(*) from job where status != 'FINISHED' and description like 'Launching VM nested-lab4%'; count ------- 1018 (1 row) # select count(*) from job where status = 'STARTED' and description like 'Launching VM nested-lab4%'; count ------- 982 (1 row) @Arik: this looks a lot like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1346270 - do you agree? @Tomas - please note that not all issues we mentioned look like duplicates of that, only the last one (for '*-worker-3). It seems like nested-lab4-rgi-7.2-20160302.0-ge_worker-3 nested-lab4-rgi-7.2-20160302.0-ge_worker-7 nested-lab4-rgi-7.2-20160302.0-worker-3 were all failing to restart because they lost their disks (duplicate of 1346270) Barak, why do you differentiate the last VM from the first two VMs? The problem with nested-lab4-rgi-7.2-20160302.0-builder-3 seems to be different: on 2016-06-20 19:10:03,217 it was started and its lock was released on 2016-06-22 03:38:46,254 it fails to migrate because it is locked and no further operation can be executed because of that lock But the operation that actually locked the VM after 2016-06-20 19:10:03,217 is missing and this information is crucial for the investigation. Barak, do you have the engine log for that period of time? The June 21st log file include everything I've got since '2016-06-20 03:44:14' until 2016-06-21 04:41:53. Do you need an earlier or a later log then this? Created attachment 1178799 [details]
June 20th engine.log
Adding earlier log in case its needed (2016-06-19 03:32:04 till 2016-06-19 03:33:09)
(In reply to Barak Korren from comment #6) > Do you need an earlier or a later log then this? I need later log that covers the time from 2016-06-21 04:41:53,980 to 2016-06-22 03:34:02,375 Created attachment 1178843 [details]
June 22nd engine.log
Added requested log
Thanks, So that VM is locked because of a problem in internal-migrate VM that was already fixed as part of bz 1332039 but the fix was not backported to 3.6.z Michal, can we backport the fix to 3.6? it is supposed to be very simple closing as a duplicate then. Feel free to raise a potential zstream request there *** This bug has been marked as a duplicate of bug 1332039 *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Created attachment 1170626 [details] engine.log Description of problem: After a massive set of network issues in out labs that cause several hosts to go down as well as storage connectivity issues, we found out that we have several VMs that cannot be started back up. Version-Release number of selected component (if applicable): 3.6.7-0.1.el6 I will attach engine logs with details since the start of the network issues. The issues started around 18:00 IDT on Jun 21st, 2016. We now see 4 different VMs that do not start up: - nested-lab4-rgi-7.2-20160302.0-builder-3 - VM seems locked, cannot be started - nested-lab4-rgi-7.2-20160302.0-ge_worker-3 - Failing to start with the following error message: [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-86) [75977d93] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Mes sage: VM nested-lab4-rgi-7.2-20160302.0-ge_worker-3 is down with error. Exit message: Bad volume specification {'index': '0', u'iface': u'virtio', 'reqsize': '0', u'format': u'cow', u'optional': u'false', u'volu meID': u'23c6ea76-c643-4d50-8827-c48b066ed9d5', 'apparentsize': '1073741824', u'imageID': u'eaa0b3b0-22ae-4995-864b-488ef2cdb01c', u'specParams': {}, u'readonly': u'false', u'domainID': u'c05f309b-5460-4971-8dc2 -758d8dfd6ea9', u'deviceId': u'eaa0b3b0-22ae-4995-864b-488ef2cdb01c', 'truesize': '1073741824', u'poolID': u'8b46552f-2793-4216-9b5d-01cd13a677b6', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}. [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-86) [75977d93] Running on vds during rerun failed vm: 'null' - nested-lab4-rgi-7.2-20160302.0-ge_worker-7 - Seems to show similar failure to 'ge_worker-3': [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-2) [69092234] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM nested-lab4-rgi-7.2-20160302.0-ge_worker-7 is down with error. Exit message: Bad volume specification {'index': '0', u'iface': u'virtio', 'reqsize': '0', u'format': u'cow', u'optional': u'false', u'volumeID': u'e784e96b-6af2-41b6-b53b-af7d32c9d848', 'apparentsize': '1073741824', u'imageID': u'80b543b1-7d92-4a0b-9598-922b211429ee', u'specParams': {}, u'readonly': u'false', u'domainID': u'c05f309b-5460-4971-8dc2-758d8dfd6ea9', u'deviceId': u'80b543b1-7d92-4a0b-9598-922b211429ee', 'truesize': '1073741824', u'poolID': u'8b46552f-2793-4216-9b5d-01cd13a677b6', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}. - nested-lab4-rgi-7.2-20160302.0-worker-3 - Seems to not be started at all because it lost its disk somehow