1043562 – Can't start vm after upgrade

Bug 1043562 - Can't start vm after upgrade

Summary: Can't start vm after upgrade

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	ovirt-engine-core
Sub Component:
Version:	3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Allon Mureinik
QA Contact:	bugs@ovirt.org
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-12-16 16:14 UTC by Ohad Basan
Modified:	2016-02-10 19:00 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-02-20 09:30:23 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
engine log (13.20 MB, text/x-log) 2013-12-16 16:18 UTC, Ohad Basan	no flags	Details
db1 (2.10 MB, text/x-sql) 2013-12-17 14:01 UTC, Ohad Basan	no flags	Details
db2 (1.67 MB, text/x-sql) 2013-12-17 14:05 UTC, Ohad Basan	no flags	Details
db3 (3.00 MB, text/x-sql) 2013-12-17 14:06 UTC, Ohad Basan	no flags	Details
View All

Description Ohad Basan 2013-12-16 16:14:03 UTC

Description of problem:
Can't start vm after upgrading from 3.3.0 to 3.3.1
Error message doesn't appear on ui 
attaching the engine.log

Comment 1 Ohad Basan 2013-12-16 16:18:16 UTC

Created attachment 837304 [details]
engine log

Comment 3 Itamar Heim 2013-12-16 20:34:56 UTC

I'm pretty sure this has already been fixed in 3.3.2. considering its in RC, please try to reproduce with it first.

Comment 4 Ohad Basan 2013-12-17 09:14:25 UTC

same problem after upgrading to 3.3.2 RC
2013-12-17 11:12:19,068 ERROR [org.ovirt.engine.core.bll.SearchQuery] (ajp--127.0.0.1-8702-9) Query SearchQuery failed. Exception message is StatementCallback; bad SQL grammar [SELECT * FROM (SELECT * FROM storage_domains_for_search WHERE ( id IN (SELECT storage_domains_with_hosts_view.id FROM  storage_domains_with_hosts_view  ))  ORDER BY storage_name ASC ) as T1 OFFSET (1 -1) LIMIT 100]; nested exception is org.postgresql.util.PSQLException: ERROR: relation "tt_temp22" does not exist
  Where: SQL statement "truncate table tt_TEMP22"

Comment 5 Eli Mesika 2013-12-17 09:23:40 UTC

Please attach DB backup before upgrade

Comment 6 Ohad Basan 2013-12-17 13:50:29 UTC

I'm attaching 3 sql files.
one for 3.3.0 one for 3.3.1
and one for 3.3.2 rc

Comment 7 Ohad Basan 2013-12-17 14:01:13 UTC

Created attachment 837695 [details]
db1

Comment 8 Ohad Basan 2013-12-17 14:05:53 UTC

Created attachment 837698 [details]
db2

Comment 9 Ohad Basan 2013-12-17 14:06:27 UTC

Created attachment 837699 [details]
db3

Comment 10 Eli Mesika 2013-12-25 11:16:55 UTC

Ohad, is that Still relevant ???

Comment 11 Ohad Basan 2013-12-30 09:17:16 UTC

yes
it is still relevant.

Comment 12 Eyal Edri 2014-01-01 21:26:43 UTC

Can this bug get assigned? it's currently blocking using nested virtualization on oVirt.

Comment 13 Eli Mesika 2014-01-01 21:55:36 UTC

The real cause of that is not the DB , before that , there is a canDoAction failure of teh RunVm comamnd :

2013-12-16 17:50:31,187 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (ajp--127.0.0.1-8702-2) [22dda41f] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM

Please attach also vdsm.log in order to see more info on this failure

Comment 16 Eli Mesika 2014-01-07 08:55:52 UTC

After analysing with remote debug , it failed on SchedulingManager::canSchedule 
We got an empty Host list of available Hosts altough we have one Host Up and running (the Host is filter out by the scheduling/policy mechanism)

RunVmCommand fails on canDoAction on the following code :

--snip--

 if (!SchedulingManager.getInstance().canSchedule(
                vdsGroup, vm, vdsBlackList, vdsWhiteList, destVds, messages)) {
            return false;
        }

--snip--

Moving the bug to SLA

Comment 17 Doron Fediuck 2014-01-07 15:00:46 UTC

(In reply to Eli Mesika from comment #16)
> After analysing with remote debug , it failed on
> SchedulingManager::canSchedule 
> We got an empty Host list of available Hosts altough we have one Host Up and
> running (the Host is filter out by the scheduling/policy mechanism)
> 
> RunVmCommand fails on canDoAction on the following code :
> 
> --snip--
> 
>  if (!SchedulingManager.getInstance().canSchedule(
>                 vdsGroup, vm, vdsBlackList, vdsWhiteList, destVds,
> messages)) {
>             return false;
>         }
> 
> --snip--
> 
> Moving the bug to SLA

In such a case the log provides filtering information (which hosts filtered out
by which filter). The attached log does not show this. 
Can you please provide a relevant log / access to the system?

Comment 19 Gilad Chaplik 2014-01-07 16:30:21 UTC

Doron is right, the logs are stale, martin already provided a fix for engine logging issues [1].

The current issue that Ohad is experiencing is a storage issue.

from vdsm log:

Thread-1585711::ERROR::2014-01-05 14:03:25,214::vm::2062::vm.Vm::(_startUnderlyingVm) vmId=`774c712d-a130-48ca-8280-be847e80f216`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 2022, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/vm.py", line 2820, in _run
    self.preparePaths(devices[DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 2083, in preparePaths
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 256, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'index': 0, 'iface': 'virtio', 'reqsize': '0', 'format': 'cow', 'bootOrder': '2', 'poolID': '3e0452c8-21b5-4df1-aced-44bc4c568229', 'volumeID': 'ace57d6a-0f7c-4ee5-a072-46ecb4df8ffe', 'apparentsize': '1073741824', 'imageID': 'c6978895-ab61-4df1-889c-d1319957574c', 'specParams': {}, 'readonly': 'false', 'domainID': 'eb317662-1dab-458b-b9dd-82634e6222ae', 'optional': 'false', 'deviceId': 'c6978895-ab61-4df1-889c-d1319957574c', 'truesize': '1073741824', 'address': {'bus': '0x00', ' slot': '0x06', ' domain': '0x0000', ' type': 'pci', ' function': '0x0'}, 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'type': 'disk'}

We try to run the vm on the host, fail ^^, try to rerun, and left with no hosts. That's why second iteration of canSchedule fails on no hosts, and there is no error popup for it since it's internal.

[1] https://bugzilla.redhat.com/1002005

Comment 20 Doron Fediuck 2014-01-07 16:34:45 UTC

Re-assigning to storage based on Gilad's findings.

Comment 21 Itamar Heim 2014-01-12 08:42:36 UTC

setting target release to current version for consideration and review. please do not push non-RFE bugs to an undefined target release to make sure bugs are reviewed for relevancy, fix, closure, etc.

Comment 22 Ayal Baron 2014-02-17 13:46:39 UTC

(In reply to Gilad Chaplik from comment #19)
> Doron is right, the logs are stale, martin already provided a fix for engine
> logging issues [1].
> 
> The current issue that Ohad is experiencing is a storage issue.
> 
> from vdsm log:
> 
> Thread-1585711::ERROR::2014-01-05
> 14:03:25,214::vm::2062::vm.Vm::(_startUnderlyingVm)
> vmId=`774c712d-a130-48ca-8280-be847e80f216`::The vm start process failed
> Traceback (most recent call last):
>   File "/usr/share/vdsm/vm.py", line 2022, in _startUnderlyingVm
>     self._run()
>   File "/usr/share/vdsm/vm.py", line 2820, in _run
>     self.preparePaths(devices[DISK_DEVICES])
>   File "/usr/share/vdsm/vm.py", line 2083, in preparePaths
>     drive['path'] = self.cif.prepareVolumePath(drive, self.id)
>   File "/usr/share/vdsm/clientIF.py", line 256, in prepareVolumePath
>     raise vm.VolumeError(drive)
> VolumeError: Bad volume specification {'index': 0, 'iface': 'virtio',
> 'reqsize': '0', 'format': 'cow', 'bootOrder': '2', 'poolID':
> '3e0452c8-21b5-4df1-aced-44bc4c568229', 'volumeID':
> 'ace57d6a-0f7c-4ee5-a072-46ecb4df8ffe', 'apparentsize': '1073741824',
> 'imageID': 'c6978895-ab61-4df1-889c-d1319957574c', 'specParams': {},
> 'readonly': 'false', 'domainID': 'eb317662-1dab-458b-b9dd-82634e6222ae',
> 'optional': 'false', 'deviceId': 'c6978895-ab61-4df1-889c-d1319957574c',
> 'truesize': '1073741824', 'address': {'bus': '0x00', ' slot': '0x06', '
> domain': '0x0000', ' type': 'pci', ' function': '0x0'}, 'device': 'disk',
> 'shared': 'false', 'propagateErrors': 'off', 'type': 'disk'}
> 
> We try to run the vm on the host, fail ^^, try to rerun, and left with no
> hosts. That's why second iteration of canSchedule fails on no hosts, and
> there is no error popup for it since it's internal.
> 
> [1] https://bugzilla.redhat.com/1002005

Where did you take this log from? the relevant part is missing from this excerpt

Comment 23 Gilad Chaplik 2014-02-17 16:10:57 UTC

(In reply to Ayal Baron from comment #22)

..
..

> Where did you take this log from? the relevant part is missing from this
> excerpt

attached vdsm log.

Comment 24 Ayal Baron 2014-02-17 21:43:18 UTC

(In reply to Gilad Chaplik from comment #23)
> (In reply to Ayal Baron from comment #22)
> 
> ..
> ..
> 
> > Where did you take this log from? the relevant part is missing from this
> > excerpt
> 
> attached vdsm log.

Sorry, looks like initial download wasn't fully successful and I was looking at a partial log.

Looks like the images directory on the domain doesn't exist for some reason.

Thread-1585711::ERROR::2014-01-05 14:03:25,206::blockVolume::446::Storage.Volume::(validateImagePath) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/blockVolume.py", line 444, in validateImagePath
    os.mkdir(imageDir, 0755)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/3e0452c8-21b5-4df1-aced-44bc4c568229/eb317662-1dab-458b-b9dd-82634e6222ae/images/c6978895-ab61-4df1-889c-d1319957574c'

Comment 25 Ayal Baron 2014-02-20 09:30:23 UTC

(In reply to Ayal Baron from comment #24)
> (In reply to Gilad Chaplik from comment #23)
> > (In reply to Ayal Baron from comment #22)
> > 
> > ..
> > ..
> > 
> > > Where did you take this log from? the relevant part is missing from this
> > > excerpt
> > 
> > attached vdsm log.
> 
> Sorry, looks like initial download wasn't fully successful and I was looking
> at a partial log.
> 
> Looks like the images directory on the domain doesn't exist for some reason.
> 
> Thread-1585711::ERROR::2014-01-05
> 14:03:25,206::blockVolume::446::Storage.Volume::(validateImagePath)
> Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/blockVolume.py", line 444, in
> validateImagePath
>     os.mkdir(imageDir, 0755)
> OSError: [Errno 2] No such file or directory:
> '/rhev/data-center/3e0452c8-21b5-4df1-aced-44bc4c568229/eb317662-1dab-458b-
> b9dd-82634e6222ae/images/c6978895-ab61-4df1-889c-d1319957574c'

Logs don't contain enough info to understand how setup reached this situation and the lgos no longer exist.
If this happens again, please reopen.

Note You need to log in before you can comment on or make changes to this bug.