Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1348847

Summary: Multiple auto-start pool member VMs not starting back after network issues
Product: Red Hat Enterprise Virtualization Manager Reporter: Barak Korren <bkorren>
Component: ovirt-engineAssignee: Nobody <nobody>
Status: CLOSED DUPLICATE QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.7CC: ahadas, bkorren, gklein, lsurette, michal.skrivanek, pzhukov, rbalakri, Rhev-m-bugs, srevivo, tjelinek, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-12 13:14:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine.log
none
engine.log form from june 21st
none
June 20th engine.log
none
June 22nd engine.log none

Description Barak Korren 2016-06-22 08:24:04 UTC
Created attachment 1170626 [details]
engine.log

Description of problem:
After a massive set of network issues in out labs that cause several hosts to go down as well as storage connectivity issues, we found out that we have several VMs that cannot be started back up.

Version-Release number of selected component (if applicable):
3.6.7-0.1.el6

I will attach engine logs with details since the start of the network issues.
The issues started around 18:00 IDT on Jun 21st, 2016.

We now see 4 different VMs that do not start up:
	
- nested-lab4-rgi-7.2-20160302.0-builder-3 - VM seems locked, cannot be started

- nested-lab4-rgi-7.2-20160302.0-ge_worker-3 - Failing to start with the following error message:

[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-86) [75977d93] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Mes
sage: VM nested-lab4-rgi-7.2-20160302.0-ge_worker-3 is down with error. Exit message: Bad volume specification {'index': '0', u'iface': u'virtio', 'reqsize': '0', u'format': u'cow', u'optional': u'false', u'volu
meID': u'23c6ea76-c643-4d50-8827-c48b066ed9d5', 'apparentsize': '1073741824', u'imageID': u'eaa0b3b0-22ae-4995-864b-488ef2cdb01c', u'specParams': {}, u'readonly': u'false', u'domainID': u'c05f309b-5460-4971-8dc2
-758d8dfd6ea9', u'deviceId': u'eaa0b3b0-22ae-4995-864b-488ef2cdb01c', 'truesize': '1073741824', u'poolID': u'8b46552f-2793-4216-9b5d-01cd13a677b6', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}.
[org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-86) [75977d93] Running on vds during rerun failed vm: 'null'

- nested-lab4-rgi-7.2-20160302.0-ge_worker-7 - Seems to show similar failure to 'ge_worker-3':

[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-2) [69092234] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM nested-lab4-rgi-7.2-20160302.0-ge_worker-7 is down with error. Exit message: Bad volume specification {'index': '0', u'iface': u'virtio', 'reqsize': '0', u'format': u'cow', u'optional': u'false', u'volumeID': u'e784e96b-6af2-41b6-b53b-af7d32c9d848', 'apparentsize': '1073741824', u'imageID': u'80b543b1-7d92-4a0b-9598-922b211429ee', u'specParams': {}, u'readonly': u'false', u'domainID': u'c05f309b-5460-4971-8dc2-758d8dfd6ea9', u'deviceId': u'80b543b1-7d92-4a0b-9598-922b211429ee', 'truesize': '1073741824', u'poolID': u'8b46552f-2793-4216-9b5d-01cd13a677b6', u'device': u'disk', u'shared': u'false', u'propagateErrors': u'off', u'type': u'disk'}.

- nested-lab4-rgi-7.2-20160302.0-worker-3 - Seems to not be started at all because it lost its disk somehow

Comment 1 Barak Korren 2016-06-22 08:30:10 UTC
Created attachment 1170630 [details]
engine.log form from june 21st

Adding log form June 21st as previously uploaded log only has June 22nd

Comment 2 Pavel Zhukov 2016-06-23 14:40:46 UTC
One more thing to add:

engine=# select count(*) from job where status != 'FINISHED' and description like 'Launching VM nested-lab4%';
 count 
-------
  1018
(1 row)

# select count(*) from job where status = 'STARTED' and description like 'Launching VM nested-lab4%';
 count 
-------
   982
(1 row)

Comment 3 Tomas Jelinek 2016-06-24 08:29:41 UTC
@Arik: this looks a lot like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1346270 - do you agree?

Comment 4 Barak Korren 2016-06-26 09:09:42 UTC
@Tomas - please note that not all issues we mentioned look like duplicates of that, only the last one (for '*-worker-3).

Comment 5 Arik 2016-07-12 07:37:08 UTC
It seems like
nested-lab4-rgi-7.2-20160302.0-ge_worker-3
nested-lab4-rgi-7.2-20160302.0-ge_worker-7
nested-lab4-rgi-7.2-20160302.0-worker-3
were all failing to restart because they lost their disks (duplicate of 1346270)
Barak, why do you differentiate the last VM from the first two VMs?

The problem with nested-lab4-rgi-7.2-20160302.0-builder-3 seems to be different:
on 2016-06-20 19:10:03,217 it was started and its lock was released
on 2016-06-22 03:38:46,254 it fails to migrate because it is locked and no further operation can be executed because of that lock

But the operation that actually locked the VM after 2016-06-20 19:10:03,217 is missing and this information is crucial for the investigation.
Barak, do you have the engine log for that period of time?

Comment 6 Barak Korren 2016-07-12 07:48:54 UTC
The June 21st log file include everything I've got since '2016-06-20 03:44:14' until 2016-06-21 04:41:53. 
Do you need an earlier or a later log then this?

Comment 7 Barak Korren 2016-07-12 07:53:01 UTC
Created attachment 1178799 [details]
June 20th engine.log

Adding earlier log in case its needed (2016-06-19 03:32:04 till 2016-06-19 03:33:09)

Comment 8 Arik 2016-07-12 08:22:37 UTC
(In reply to Barak Korren from comment #6)
> Do you need an earlier or a later log then this?
I need later log that covers the time from 2016-06-21 04:41:53,980 to 2016-06-22 03:34:02,375

Comment 9 Barak Korren 2016-07-12 10:17:02 UTC
Created attachment 1178843 [details]
June 22nd engine.log

Added requested log

Comment 10 Arik 2016-07-12 11:48:40 UTC
Thanks,
So that VM is locked because of a problem in internal-migrate VM that was already fixed as part of bz 1332039 but the fix was not backported to 3.6.z

Michal, can we backport the fix to 3.6? it is supposed to be very simple

Comment 11 Michal Skrivanek 2016-07-12 13:14:05 UTC
closing as a duplicate then. Feel free to raise a potential zstream request there

*** This bug has been marked as a duplicate of bug 1332039 ***

Comment 12 Red Hat Bugzilla 2023-09-14 03:27:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days