Bug 1043562

Summary: Can't start vm after upgrade
Product: [Retired] oVirt Reporter: Ohad Basan <obasan>
Component: ovirt-engine-coreAssignee: Allon Mureinik <amureini>
Status: CLOSED INSUFFICIENT_DATA QA Contact: bugs <bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.3CC: abaron, acanan, acathrow, bazulay, eedri, emesika, gchaplik, iheim, obasan, sbonazzo, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-20 09:30:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
engine log
none
db1
none
db2
none
db3 none

Description Ohad Basan 2013-12-16 16:14:03 UTC
Description of problem:
Can't start vm after upgrading from 3.3.0 to 3.3.1
Error message doesn't appear on ui 
attaching the engine.log

Comment 1 Ohad Basan 2013-12-16 16:18:16 UTC
Created attachment 837304 [details]
engine log

Comment 3 Itamar Heim 2013-12-16 20:34:56 UTC
I'm pretty sure this has already been fixed in 3.3.2. considering its in RC, please try to reproduce with it first.

Comment 4 Ohad Basan 2013-12-17 09:14:25 UTC
same problem after upgrading to 3.3.2 RC
2013-12-17 11:12:19,068 ERROR [org.ovirt.engine.core.bll.SearchQuery] (ajp--127.0.0.1-8702-9) Query SearchQuery failed. Exception message is StatementCallback; bad SQL grammar [SELECT * FROM (SELECT * FROM storage_domains_for_search WHERE ( id IN (SELECT storage_domains_with_hosts_view.id FROM  storage_domains_with_hosts_view  ))  ORDER BY storage_name ASC ) as T1 OFFSET (1 -1) LIMIT 100]; nested exception is org.postgresql.util.PSQLException: ERROR: relation "tt_temp22" does not exist
  Where: SQL statement "truncate table tt_TEMP22"

Comment 5 Eli Mesika 2013-12-17 09:23:40 UTC
Please attach DB backup before upgrade

Comment 6 Ohad Basan 2013-12-17 13:50:29 UTC
I'm attaching 3 sql files.
one for 3.3.0 one for 3.3.1
and one for 3.3.2 rc

Comment 7 Ohad Basan 2013-12-17 14:01:13 UTC
Created attachment 837695 [details]
db1

Comment 8 Ohad Basan 2013-12-17 14:05:53 UTC
Created attachment 837698 [details]
db2

Comment 9 Ohad Basan 2013-12-17 14:06:27 UTC
Created attachment 837699 [details]
db3

Comment 10 Eli Mesika 2013-12-25 11:16:55 UTC
Ohad, is that Still relevant ???

Comment 11 Ohad Basan 2013-12-30 09:17:16 UTC
yes
it is still relevant.

Comment 12 Eyal Edri 2014-01-01 21:26:43 UTC
Can this bug get assigned? it's currently blocking using nested virtualization on oVirt.

Comment 13 Eli Mesika 2014-01-01 21:55:36 UTC
The real cause of that is not the DB , before that , there is a canDoAction failure of teh RunVm comamnd :

2013-12-16 17:50:31,187 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (ajp--127.0.0.1-8702-2) [22dda41f] CanDoAction of action RunVm failed. Reasons:VAR__ACTION__RUN,VAR__TYPE__VM

Please attach also vdsm.log in order to see more info on this failure

Comment 16 Eli Mesika 2014-01-07 08:55:52 UTC
After analysing with remote debug , it failed on SchedulingManager::canSchedule 
We got an empty Host list of available Hosts altough we have one Host Up and running (the Host is filter out by the scheduling/policy mechanism)

RunVmCommand fails on canDoAction on the following code :

--snip--

 if (!SchedulingManager.getInstance().canSchedule(
                vdsGroup, vm, vdsBlackList, vdsWhiteList, destVds, messages)) {
            return false;
        }

--snip--

Moving the bug to SLA

Comment 17 Doron Fediuck 2014-01-07 15:00:46 UTC
(In reply to Eli Mesika from comment #16)
> After analysing with remote debug , it failed on
> SchedulingManager::canSchedule 
> We got an empty Host list of available Hosts altough we have one Host Up and
> running (the Host is filter out by the scheduling/policy mechanism)
> 
> RunVmCommand fails on canDoAction on the following code :
> 
> --snip--
> 
>  if (!SchedulingManager.getInstance().canSchedule(
>                 vdsGroup, vm, vdsBlackList, vdsWhiteList, destVds,
> messages)) {
>             return false;
>         }
> 
> --snip--
> 
> Moving the bug to SLA

In such a case the log provides filtering information (which hosts filtered out
by which filter). The attached log does not show this. 
Can you please provide a relevant log / access to the system?

Comment 19 Gilad Chaplik 2014-01-07 16:30:21 UTC
Doron is right, the logs are stale, martin already provided a fix for engine logging issues [1].

The current issue that Ohad is experiencing is a storage issue.

from vdsm log:

Thread-1585711::ERROR::2014-01-05 14:03:25,214::vm::2062::vm.Vm::(_startUnderlyingVm) vmId=`774c712d-a130-48ca-8280-be847e80f216`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 2022, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/vm.py", line 2820, in _run
    self.preparePaths(devices[DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 2083, in preparePaths
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 256, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'index': 0, 'iface': 'virtio', 'reqsize': '0', 'format': 'cow', 'bootOrder': '2', 'poolID': '3e0452c8-21b5-4df1-aced-44bc4c568229', 'volumeID': 'ace57d6a-0f7c-4ee5-a072-46ecb4df8ffe', 'apparentsize': '1073741824', 'imageID': 'c6978895-ab61-4df1-889c-d1319957574c', 'specParams': {}, 'readonly': 'false', 'domainID': 'eb317662-1dab-458b-b9dd-82634e6222ae', 'optional': 'false', 'deviceId': 'c6978895-ab61-4df1-889c-d1319957574c', 'truesize': '1073741824', 'address': {'bus': '0x00', ' slot': '0x06', ' domain': '0x0000', ' type': 'pci', ' function': '0x0'}, 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'type': 'disk'}

We try to run the vm on the host, fail ^^, try to rerun, and left with no hosts. That's why second iteration of canSchedule fails on no hosts, and there is no error popup for it since it's internal.

[1] https://bugzilla.redhat.com/1002005

Comment 20 Doron Fediuck 2014-01-07 16:34:45 UTC
Re-assigning to storage based on Gilad's findings.

Comment 21 Itamar Heim 2014-01-12 08:42:36 UTC
setting target release to current version for consideration and review. please do not push non-RFE bugs to an undefined target release to make sure bugs are reviewed for relevancy, fix, closure, etc.

Comment 22 Ayal Baron 2014-02-17 13:46:39 UTC
(In reply to Gilad Chaplik from comment #19)
> Doron is right, the logs are stale, martin already provided a fix for engine
> logging issues [1].
> 
> The current issue that Ohad is experiencing is a storage issue.
> 
> from vdsm log:
> 
> Thread-1585711::ERROR::2014-01-05
> 14:03:25,214::vm::2062::vm.Vm::(_startUnderlyingVm)
> vmId=`774c712d-a130-48ca-8280-be847e80f216`::The vm start process failed
> Traceback (most recent call last):
>   File "/usr/share/vdsm/vm.py", line 2022, in _startUnderlyingVm
>     self._run()
>   File "/usr/share/vdsm/vm.py", line 2820, in _run
>     self.preparePaths(devices[DISK_DEVICES])
>   File "/usr/share/vdsm/vm.py", line 2083, in preparePaths
>     drive['path'] = self.cif.prepareVolumePath(drive, self.id)
>   File "/usr/share/vdsm/clientIF.py", line 256, in prepareVolumePath
>     raise vm.VolumeError(drive)
> VolumeError: Bad volume specification {'index': 0, 'iface': 'virtio',
> 'reqsize': '0', 'format': 'cow', 'bootOrder': '2', 'poolID':
> '3e0452c8-21b5-4df1-aced-44bc4c568229', 'volumeID':
> 'ace57d6a-0f7c-4ee5-a072-46ecb4df8ffe', 'apparentsize': '1073741824',
> 'imageID': 'c6978895-ab61-4df1-889c-d1319957574c', 'specParams': {},
> 'readonly': 'false', 'domainID': 'eb317662-1dab-458b-b9dd-82634e6222ae',
> 'optional': 'false', 'deviceId': 'c6978895-ab61-4df1-889c-d1319957574c',
> 'truesize': '1073741824', 'address': {'bus': '0x00', ' slot': '0x06', '
> domain': '0x0000', ' type': 'pci', ' function': '0x0'}, 'device': 'disk',
> 'shared': 'false', 'propagateErrors': 'off', 'type': 'disk'}
> 
> We try to run the vm on the host, fail ^^, try to rerun, and left with no
> hosts. That's why second iteration of canSchedule fails on no hosts, and
> there is no error popup for it since it's internal.
> 
> [1] https://bugzilla.redhat.com/1002005

Where did you take this log from? the relevant part is missing from this excerpt

Comment 23 Gilad Chaplik 2014-02-17 16:10:57 UTC
(In reply to Ayal Baron from comment #22)

..
..

> Where did you take this log from? the relevant part is missing from this
> excerpt

attached vdsm log.

Comment 24 Ayal Baron 2014-02-17 21:43:18 UTC
(In reply to Gilad Chaplik from comment #23)
> (In reply to Ayal Baron from comment #22)
> 
> ..
> ..
> 
> > Where did you take this log from? the relevant part is missing from this
> > excerpt
> 
> attached vdsm log.

Sorry, looks like initial download wasn't fully successful and I was looking at a partial log.

Looks like the images directory on the domain doesn't exist for some reason.

Thread-1585711::ERROR::2014-01-05 14:03:25,206::blockVolume::446::Storage.Volume::(validateImagePath) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/blockVolume.py", line 444, in validateImagePath
    os.mkdir(imageDir, 0755)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/3e0452c8-21b5-4df1-aced-44bc4c568229/eb317662-1dab-458b-b9dd-82634e6222ae/images/c6978895-ab61-4df1-889c-d1319957574c'

Comment 25 Ayal Baron 2014-02-20 09:30:23 UTC
(In reply to Ayal Baron from comment #24)
> (In reply to Gilad Chaplik from comment #23)
> > (In reply to Ayal Baron from comment #22)
> > 
> > ..
> > ..
> > 
> > > Where did you take this log from? the relevant part is missing from this
> > > excerpt
> > 
> > attached vdsm log.
> 
> Sorry, looks like initial download wasn't fully successful and I was looking
> at a partial log.
> 
> Looks like the images directory on the domain doesn't exist for some reason.
> 
> Thread-1585711::ERROR::2014-01-05
> 14:03:25,206::blockVolume::446::Storage.Volume::(validateImagePath)
> Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/blockVolume.py", line 444, in
> validateImagePath
>     os.mkdir(imageDir, 0755)
> OSError: [Errno 2] No such file or directory:
> '/rhev/data-center/3e0452c8-21b5-4df1-aced-44bc4c568229/eb317662-1dab-458b-
> b9dd-82634e6222ae/images/c6978895-ab61-4df1-889c-d1319957574c'

Logs don't contain enough info to understand how setup reached this situation and the lgos no longer exist.
If this happens again, please reopen.