Bug 988299

Summary:

Impossible to start VM from Gluster Storage Domain

Product:

[Retired] oVirt

Reporter:

Maciej Malesa <mmalesa>

Component:

ovirt-engine-core

Assignee:

Deepak C Shetty <deepakcs>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

urgent

Version:

3.3

CC:

acathrow, amureini, andrew, ar1666, barumuga, bharata.rao, cdhouch, chris.sullivan, deepakcs, hchiramm, iheim, mmalesa, sabose, samppah, sankarshan, yeylon

Target Milestone:

---

Target Release:

3.3

Hardware:

x86_64

OS:

Linux

URL:

http://gerrit.ovirt.org/#/c/17994/

Whiteboard:

gluster

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-09-23 07:36:18 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

918494

Attachments:

Description	Flags
ovirt-engine.log AND all vdsm logs for gluster node	none
all logs for ovirt, virtual host, two gluster nodes	none
engine.log and vdsm.log after adding vdsm-gluster	none
ovirt-engine.log and vdsm.log excerpts at time of VM start	none
VDSM log at point of VM start failure - ovirt 3.3.0-1 vdsm 4.12.1	none
Gluster volume info	none
glusterd.vol file	none
GlusterFS/libvirt/qemu RPM list from virt host	none

Description Maciej Malesa 2013-07-25 09:24:13 UTC

Description of problem:
After adding Gluster Storage Domain, adding at least two hosts into the cluster and creating Virtual Machine it is impossible to start it.

The error returned is as follows:
VM test1 is down. Exit message: Bad volume specification {'index': 0, 'iface': 'virtio', 'reqsize': '0', 'format': 'raw', 'type': 'disk', 'volumeID': 'df499832-2936-465c-9812-9a3123c1e286', 'apparentsize': '10737418240', 'imageID': 'af5a6b19-107d-4de6-b594-6be885f68a87', 'specParams': {}, 'readonly': 'false', 'domainID': '2583bb3f-a4ec-4f22-9ff7-7486cb6d4dff', 'deviceId': 'af5a6b19-107d-4de6-b594-6be885f68a87', 'truesize': '4096', 'poolID': 'b8118efd-97cc-4f66-bb96-4b6e418e43c9', 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'optional': 'false'}.

Gluster Cluster is composed of 3 servers with one physical disk dedicated to hold gluster data. It has replica count 3. Each disk in each server is a brick that is formatted in xfs.

Version-Release number of selected component (if applicable):
ovirt-3.3 release 0.3.beta1.el6
vdsm 4.12.0 release 0.1.rc3.el6
glusterfs 3.4.0 release 2.el6

operating system centos 6.4

Steps to Reproduce:

Gluster:
gluster volume create vms replica 3 server1:/path/to/brick server2:/path/to/brick server3:/path/to/brick
gluster volume start vms

foreach server:
chown 36:36 /path/to/brick

Ovirt:
1. Add Data Center with GlusterFS storage type
2. Add Cluster with compatibility version 3.3 and Enable Virt Service checked.
3. Upload ISO image
4. Create new VM and start it

Actual results:
Virtual machine will not start. It gives an error as above

Expected results:
Virtual machine should start.

Comment 1 Sahina Bose 2013-07-25 13:01:37 UTC

Bala, could you check this?

Comment 2 Bala.FA 2013-07-31 01:29:58 UTC

please add 
1. engine logs
2. vdsm logs
3. supervdsm logs

Comment 3 Deepak C Shetty 2013-07-31 06:13:14 UTC

Hi Maciej,
    
1) Please look at the below links for the pre-req. needed to use Gluster volume (aka Gluster cluster) as part of oVirt Storage domain ...

http://www.ovirt.org/Features/GlusterFS_Storage_Domain#Testing
http://www.ovirt.org/Features/GlusterFS_Storage_Domain#Important_Pre-requisites

So apart from 36:36, you need to set other glusterd and gluster volume options
(manually for now) to consume gluster cluster as a storage domain.

2) Per the pre-req. mentioned in the above page, we need a minm. of libvirt version 1.0.1 and qemu version 1.3 , since these are the respective versions that have gluster support, which is needed by oVirt Gluster storage domain

Can you provide the libvirt and qemu version on your test system ?

3) engine, vdsm and supervdsm logs will help as Bala already noted. For now it looks un-related to the libvirt/qemu version issue.. but can't be sure, unless I see the VDSM logs to understand the stack from where the above error is coming.

Comment 4 Deepak C Shetty 2013-07-31 06:14:55 UTC

Maciej,
    Between step 3 and 4, did you create a new VM disk on the storage domain or not ?

Comment 5 Itamar Heim 2013-07-31 06:38:40 UTC

(In reply to Deepak C Shetty from comment #3)
> Hi Maciej,
>     
> 1) Please look at the below links for the pre-req. needed to use Gluster
> volume (aka Gluster cluster) as part of oVirt Storage domain ...
> 
> http://www.ovirt.org/Features/GlusterFS_Storage_Domain#Testing
> http://www.ovirt.org/Features/GlusterFS_Storage_Domain#Important_Pre-
> requisites
> 
> So apart from 36:36, you need to set other glusterd and gluster volume
> options
> (manually for now) to consume gluster cluster as a storage domain.

is this needed even when creating the volume via ovirt-engine?

> 
> 2) Per the pre-req. mentioned in the above page, we need a minm. of libvirt
> version 1.0.1 and qemu version 1.3 , since these are the respective versions
> that have gluster support, which is needed by oVirt Gluster storage domain

doesn't vdsm require this minimal versions via its spec to avoid unneeded issues here?

Comment 6 Deepak C Shetty 2013-07-31 07:24:48 UTC

(In reply to Itamar Heim from comment #5)
> (In reply to Deepak C Shetty from comment #3)
> > Hi Maciej,
> >     
> > 1) Please look at the below links for the pre-req. needed to use Gluster
> > volume (aka Gluster cluster) as part of oVirt Storage domain ...
> > 
> > http://www.ovirt.org/Features/GlusterFS_Storage_Domain#Testing
> > http://www.ovirt.org/Features/GlusterFS_Storage_Domain#Important_Pre-
> > requisites
> > 
> > So apart from 36:36, you need to set other glusterd and gluster volume
> > options
> > (manually for now) to consume gluster cluster as a storage domain.
> 
> is this needed even when creating the volume via ovirt-engine?

Yes it is, since there is no way to set these options (as of today) via the 'optimize for virt. store' settings. I am working with Gluster India team to see how best this can be done from UI, so that in future, we don't have a need to do this manual steps.. but reaching a conclusion on that topic has taken some time!

> 
> > 
> > 2) Per the pre-req. mentioned in the above page, we need a minm. of libvirt
> > version 1.0.1 and qemu version 1.3 , since these are the respective versions
> > that have gluster support, which is needed by oVirt Gluster storage domain
> 
> doesn't vdsm require this minimal versions via its spec to avoid unneeded
> issues here?


Current vdsm.spec has qemu-kvm >=0.15 ... If i change that to qemu-kvm >= 1.3
i believe VDSM won't get installed for many of the distro's that doesn't have qemu 1.3.. so I believe changing vdsm.spec to force 1.3 is not the right thing.

It also wouldn't work for the case where user doesn't want to use gluster storage domain but VDSM will force qemu 1.3 which isn't good too.

I am not sure whether we need to handle this as part of oVirt Cluster version 3.3 dep ? is there a way we can force VDSM dep pkg version based on Cluster version being used ? It still may not be ideal, since User can use oVirt Cluster ver 3.3 but use NFS/iSCSI domain and not Gluster... so I am open to suggestions on how to handle this issue in the right way ?

thanx,
deepak
P.S. Having said that I am still not sure if the qemu/libvirt version is the issue here.. since we don't have VDSM logs, we can't know the real reason for the reported issue yet! But that is separate from the issue you raised on how to handle the pkg version dep.

Comment 7 Maciej Malesa 2013-07-31 12:14:41 UTC

(In reply to Deepak C Shetty from comment #4)
> Maciej,
>     Between step 3 and 4, did you create a new VM disk on the storage domain
> or not ?

Hi.
I created new VM and after that I created disk from the "wizard". It did not start. 
Next approach was to create disk first and attach it to the new VM. Then I issued VM start. Effect was the same.

Comment 8 Caerie Houchins 2013-07-31 18:32:15 UTC

I'm seeing the same issue even after setting the pre-req options on the latest 3.3 beta.

Comment 9 Deepak C Shetty 2013-08-01 04:53:55 UTC

Caerie,
   Can you exactly provide the steps you did on the Gluster volume side ?
Also provide details on where did you install the oVirt engine and VDSM from, so that I can try to re-create.

Lastly, can u provide the VDSM side logs pls ? Without logs its too difficult to make out.

Comment 10 Caerie Houchins 2013-08-01 05:15:10 UTC

I did a new CentOS 6.4 install.

I used the following repos for GlusterFS and Ovirt:
[ovirt-beta]
name=Beta builds of the oVirt project
baseurl=http://ovirt.org/releases/beta/rpm/EL/$releasever/
enabled=1
skip_if_unavailable=1
gpgcheck=0

[glusterfs-epel]
name=GlusterFS is a clustered file-system capable of scaling to several petabytes.
baseurl=http://download.gluster.org/pub/gluster/glusterfs/3.4/3.4.0/EPEL.repo/epel-$releasever/$basearch/
enabled=1
skip_if_unavailable=1
gpgcheck=0

I added the gluster nodes as Hosts in Ovirt as their own Cluster and attempted to create the Volume but could not through the GUI (I didn't know about the pre-reqs until I found this bug report today). So I created the Distributed Replcated volume with:
gluster volume create vmstorage replica 2 192.168.12.108:/bricks/brick1 192.168.12.109:/bricks/brick1

The Ovirt GUI immediately saw the volume and added it. I then used the GUI to add 2 more bricks.

I then made a second Cluster of VM servers and added those to Ovirt.

I then created the Storage Domain using the gluster mount created above, 192.168.12.108:vmstorage

Then I created a blank VM using the wizard and the disk 16G disk for it.

When I tried to start it up I received the Bad volume specification error.

I found this bug report, and followed the pre-req steps for setting the gluster options and the change to the glusterd.vol file. I restarted the glusterd the gluster volume.

I was then able to click the Optimize for VirtStore option and it seemed to work.

I deleted the original VM, created a new one, and receive the same error when trying to start it. From the ovirt-engine.log:
INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQ
uartzScheduler_Worker-31) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM test3 is do
wn. Exit message: Bad volume specification {'index': 0, 'iface': 'virtio', 'reqsize': '0', 'format': 'raw', 't
ype': 'disk', 'volumeID': '6511e8b7-64b1-4928-8e6c-52041e5a1733', 'apparentsize': '17179869184', 'imageID': '5
b65e144-4c0a-4d22-b372-7bc55d8e2785', 'specParams': {}, 'readonly': 'false', 'domainID': '27fadc4e-a194-4bad-a
ef0-b85d73d1e51b', 'deviceId': '5b65e144-4c0a-4d22-b372-7bc55d8e2785', 'truesize': '0', 'poolID': 'a8276303-95
12-46e1-a2ad-621123ab45dd', 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'optional': 'false'
}.

Attaching the VDSM logs from the first gluster node and the ovirt-engine log

Comment 11 Caerie Houchins 2013-08-01 05:20:53 UTC

Created attachment 781451 [details]
ovirt-engine.log AND all vdsm logs for gluster node

Comment 12 Deepak C Shetty 2013-08-01 09:15:56 UTC

    I looked at the engine.log and vdsm.log from your zip.
It seems like the Error "Bad volume specification" is coming on the Engine side only. I don't see any error on the VDSM side. I looked around the
2013-08-01 00:12 to 2013-08-01 00:14 time window on engine and vdsm logs

I see that the reported error is only on the Engine side. There is no ERROR or WARN related events in the vdsm.log.

So someone from Engine side needs to debug this to figure why Engine thinks its 
a bad volume
spec.

Having said that.. I have few interesting observations...

1) On the VDSM side i see connectStorageServer but i don't see
createStorageDomain and createVolume calls. which means.. the Engine never passed createStorageDomain and createVolume (when user created a 16GB disk) to VDSM... thats really strange! and i have never seen something like this before.

2) So what i see in the above time window on the VDSM side is..
connectStorageServer (which mounts the gluster volume on VDSM host)
createStorageDomain - missing
createVolume - missing
connectStoragePool
createVm - missing... which is expected, since Engine fails with the 'bad volume spec' error

I am not much of an expert to debug the Engine side logs, so it would be good if someone can help debug from engine side. We also need to understnad what 'bad volume spec' error of engine actually means ? I have no clue looking at the engine.log

thanx,
deepak

Comment 13 Caerie Houchins 2013-08-01 21:17:02 UTC

Just to try to narrow this down, I reinstalled everything and added back in only a two node glusterfs setup.  Added the hosts to Ovirt but still couldn't create the Volume through the GUI (after making sure all pre-reqs were done) I kept getting a 500 GUI error.

Created the glusterfs volume via the CLI and it showed up in the GUI.

Created a virtual machine host to be the storage master.

Created a new storage domain using the glusterf fs volume.

Created a test disk in the domain.  This creates all the files it should on the glusterfs volume.

Attempted to remove the disk and it fails, setting the disk to illegal status.

It looks like it's trying to find a volume named 0a414f79-0b29-4ca7-a2d7-733158969dfc and failing for some reason, even though that is what is located at the root of my glusterfs volume :

total 4
drwxr-xr-x. 4 vdsm kvm    91 Aug  1 13:52 .
drwxr-xr-x. 3 root root 4096 Jul 31 10:29 ..
drwxr-xr-x. 5 vdsm kvm    45 Aug  1 13:53 0a414f79-0b29-4ca7-a2d7-733158969dfc
-rwxr-xr-x. 1 vdsm kvm     0 Aug  1 13:52 __DIRECT_IO_TEST__

From the ovirt-engine.log 

2013-08-01 16:06:03,684 INFO  [org.ovirt.engine.core.bll.RemoveDiskCommand] (pool-6-thread-43) [4b1dfc61] Running command: RemoveDiskCommand internal: false. Entities
 affected :  ID: 618ef3cb-7d43-4519-a9a4-bb718fe9ec68 Type: Disk
2013-08-01 16:06:03,687 INFO  [org.ovirt.engine.core.bll.RemoveImageCommand] (pool-6-thread-43) Running command: RemoveImageCommand internal: true. Entities affected 
:  ID: 0a414f79-0b29-4ca7-a2d7-733158969dfc Type: Storage
2013-08-01 16:06:03,697 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-43) START, DeleteImageGroupVDSCommand( storagePool
Id = dcffbd47-8adb-49da-9e3c-787338c58b58, ignoreFailoverLimit = false, storageDomainId = 0a414f79-0b29-4ca7-a2d7-733158969dfc, imageGroupId = 618ef3cb-7d43-4519-a9a4
-bb718fe9ec68, postZeros = false, forceDelete = false), log id: 2b13e97
2013-08-01 16:06:03,765 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-43) Failed in DeleteImageGroupVDS method
2013-08-01 16:06:03,766 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-43) Error code StorageDomainDoesNotExist and error message IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Storage domain does not exist: ('0a414f79-0b29-4ca7-a2d7-733158969dfc',)
2013-08-01 16:06:03,767 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-43) IrsBroker::Failed::DeleteImageGroupVDS due to: IRSErrorException: IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Storage domain does not exist: ('0a414f79-0b29-4ca7-a2d7-733158969dfc',)
2013-08-01 16:06:03,777 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand] (pool-6-thread-43) FINISH, DeleteImageGroupVDSCommand, log id: 2b13e97
2013-08-01 16:06:03,777 ERROR [org.ovirt.engine.core.bll.RemoveImageCommand] (pool-6-thread-43) Command org.ovirt.engine.core.bll.RemoveImageCommand throw Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException: IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error = Storage domain does not exist: ('0a414f79-0b29-4ca7-a2d7-733158969dfc',) (Failed with VDSM error StorageDomainDoesNotExist and code 358)
2013-08-01 16:06:03,783 INFO  [org.ovirt.engine.core.bll.RemoveImageCommand] (pool-6-thread-43) Command [id=7cc38dc4-fd4f-498f-8581-bb471a75b30b]: Compensating CHANGED_STATUS_ONLY of org.ovirt.engine.core.common.businessentities.Image; snapshot: EntityStatusSnapshot [id=a5129d1b-7b5d-4210-a52e-1ff905175e4f, status=ILLEGAL].
2013-08-01 16:06:03,794 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (pool-6-thread-43) Removed task 7014e15e-e1fa-440e-8e5b-f0f10762e56f from DataBase
2013-08-01 16:06:03,798 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-6-thread-43) Correlation ID: 4b1dfc61, Job ID: 6c595a2f-46c0-4d12-bcf7-3bff04b6e421, Call Stack: null, Custom Event ID: -1, Message: User admin@internal finished removing disk mydisk_disk1 with storage failure in domain vmstoragedomain.

Comment 14 Deepak C Shetty 2013-08-02 04:46:33 UTC

> 2013-08-01 16:06:03,765 ERROR
> [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
> (pool-6-thread-43) Failed in DeleteImageGroupVDS method
> 2013-08-01 16:06:03,766 ERROR
> [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
> (pool-6-thread-43) Error code StorageDomainDoesNotExist and error message
> IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error
> = Storage domain does not exist: ('0a414f79-0b29-4ca7-a2d7-733158969dfc',)
> 2013-08-01 16:06:03,767 ERROR
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (pool-6-thread-43) IrsBroker::Failed::DeleteImageGroupVDS due to:
> IRSErrorException: IRSGenericException: IRSErrorException: Failed to
> DeleteImageGroupVDS, error = Storage domain does not exist:
> ('0a414f79-0b29-4ca7-a2d7-733158969dfc',)

This matches with what I saw and reported as observation on VDSM side, in my prev. comment. Since createStorageDomain is not being called on the VDSM side, there is really no storage domain created on the host.. which is what Engine log is saying here. But the root cause seems to be the fact that createStorageDomain is not passed by Engine to the VDSM... ideally it should have failed at the "Created a new storage domain using the glusterf fs volume." step itself... but strangely that goes thru.

Need more debug from the engine side to decipher why create storage domain is not passed by the Engine to VDSM.

Caerie,
   Can you dump the output of 'df -h' on the host side after
"Created a new storage domain using the glusterf fs volume." is shown successful on the GUI ?


thanx,
deepak

Comment 15 Sahina Bose 2013-08-02 07:38:44 UTC

(In reply to Deepak C Shetty from comment #14)
> > 2013-08-01 16:06:03,765 ERROR
> > [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
> > (pool-6-thread-43) Failed in DeleteImageGroupVDS method
> > 2013-08-01 16:06:03,766 ERROR
> > [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
> > (pool-6-thread-43) Error code StorageDomainDoesNotExist and error message
> > IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error
> > = Storage domain does not exist: ('0a414f79-0b29-4ca7-a2d7-733158969dfc',)
> > 2013-08-01 16:06:03,767 ERROR
> > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > (pool-6-thread-43) IrsBroker::Failed::DeleteImageGroupVDS due to:
> > IRSErrorException: IRSGenericException: IRSErrorException: Failed to
> > DeleteImageGroupVDS, error = Storage domain does not exist:
> > ('0a414f79-0b29-4ca7-a2d7-733158969dfc',)
> 
> This matches with what I saw and reported as observation on VDSM side, in my
> prev. comment. Since createStorageDomain is not being called on the VDSM
> side, there is really no storage domain created on the host.. which is what
> Engine log is saying here. But the root cause seems to be the fact that
> createStorageDomain is not passed by Engine to the VDSM... ideally it should
> have failed at the "Created a new storage domain using the glusterf fs
> volume." step itself... but strangely that goes thru.
> 
> Need more debug from the engine side to decipher why create storage domain
> is not passed by the Engine to VDSM.

Deepak,
2013-07-31 15:29:59,996 INFO  [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (pool-6-thread-39) START, CreateVmVDSCommand(HostName = vmhp61m2, HostId = ca4f4013-e3f2-4c8f-89fc-c6507f771eb3, vmId=70b48ccd-6479-40c8-a14f-8351dee1d144, vm=VM [test3]), log id: 4f3f3a43

Is it possible that the VDSM log attached if from the storage node and not hypervisore node vmhp61m2?
Caerie - can you confirm. 
> 
> Caerie,
>    Can you dump the output of 'df -h' on the host side after
> "Created a new storage domain using the glusterf fs volume." is shown
> successful on the GUI ?
> 

Also, in the engine logs (from comment 10)-
2013-07-31 15:26:51,803 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-39) vds nfsp14m2 reported domain 27fadc4e-a194-4bad-aef0-b85d73d1e51b:glusterstoredomain as in problem, moving the vds to status NonOperational


Deepak, looks like there is some issue with storage domain ?
But in the vdsm log

Thread-68109::DEBUG::2013-07-31 15:26:56,438::task::579::TaskManager.Task::(_updateState) Task=`0b717e05-e630-4087-91de-8d6073ec259f`::moving from state init -> state preparing
Thread-68109::INFO::2013-07-31 15:26:56,439::logUtils::44::dispatcher::(wrapper) Run and protect: repoStats(options=None)
Thread-68109::INFO::2013-07-31 15:26:56,439::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'a806930b-177c-45c6-a8f8-3650892f9639': {'delay': '0.000398594', 'lastCheck': '8.5', 'code': 0, 'valid': True, 'version': 0}, '27fadc4e-a194-4bad-aef0-b85d73d1e51b': {'delay': '0.00172392', 'lastCheck': '8.6', 'code': 0, 'valid': True, 'version': 3}}


                 
> 
> thanx,
> deepak

Comment 16 Deepak C Shetty 2013-08-02 07:57:38 UTC

(In reply to Sahina Bose from comment #15)
> (In reply to Deepak C Shetty from comment #14)
> > > 2013-08-01 16:06:03,765 ERROR
> > > [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
> > > (pool-6-thread-43) Failed in DeleteImageGroupVDS method
> > > 2013-08-01 16:06:03,766 ERROR
> > > [org.ovirt.engine.core.vdsbroker.irsbroker.DeleteImageGroupVDSCommand]
> > > (pool-6-thread-43) Error code StorageDomainDoesNotExist and error message
> > > IRSGenericException: IRSErrorException: Failed to DeleteImageGroupVDS, error
> > > = Storage domain does not exist: ('0a414f79-0b29-4ca7-a2d7-733158969dfc',)
> > > 2013-08-01 16:06:03,767 ERROR
> > > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> > > (pool-6-thread-43) IrsBroker::Failed::DeleteImageGroupVDS due to:
> > > IRSErrorException: IRSGenericException: IRSErrorException: Failed to
> > > DeleteImageGroupVDS, error = Storage domain does not exist:
> > > ('0a414f79-0b29-4ca7-a2d7-733158969dfc',)
> > 
> > This matches with what I saw and reported as observation on VDSM side, in my
> > prev. comment. Since createStorageDomain is not being called on the VDSM
> > side, there is really no storage domain created on the host.. which is what
> > Engine log is saying here. But the root cause seems to be the fact that
> > createStorageDomain is not passed by Engine to the VDSM... ideally it should
> > have failed at the "Created a new storage domain using the glusterf fs
> > volume." step itself... but strangely that goes thru.
> > 
> > Need more debug from the engine side to decipher why create storage domain
> > is not passed by the Engine to VDSM.
> 
> Deepak,
> 2013-07-31 15:29:59,996 INFO 
> [org.ovirt.engine.core.vdsbroker.CreateVmVDSCommand] (pool-6-thread-39)
> START, CreateVmVDSCommand(HostName = vmhp61m2, HostId =
> ca4f4013-e3f2-4c8f-89fc-c6507f771eb3,
> vmId=70b48ccd-6479-40c8-a14f-8351dee1d144, vm=VM [test3]), log id: 4f3f3a43
> 
> Is it possible that the VDSM log attached if from the storage node and not
> hypervisore node vmhp61m2?
> Caerie - can you confirm. 

Sahina, Agree.. I think Caerie is attaching the VDSM log from glsuter node, we need the log from VDSM / hypevisor host node.

In fact that was my first doubt too.. but i overlooked it, since i saw connectStoragePool being called in the vdsm.log. But thinking abt this again, I guess the storagepool and storageserver concepts gets used on the storage node as well.. so your guess is right.

Caerie, pls provide the vdsm.log from hypervisor host (vmhp61m2)


> > 
> > Caerie,
> >    Can you dump the output of 'df -h' on the host side after
> > "Created a new storage domain using the glusterf fs volume." is shown
> > successful on the GUI ?
> > 
> 
> Also, in the engine logs (from comment 10)-
> 2013-07-31 15:26:51,803 WARN 
> [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
> (pool-6-thread-39) vds nfsp14m2 reported domain
> 27fadc4e-a194-4bad-aef0-b85d73d1e51b:glusterstoredomain as in problem,
> moving the vds to status NonOperational
> 
> 
> Deepak, looks like there is some issue with storage domain ?
> But in the vdsm log
> 
> Thread-68109::DEBUG::2013-07-31
> 15:26:56,438::task::579::TaskManager.Task::(_updateState)
> Task=`0b717e05-e630-4087-91de-8d6073ec259f`::moving from state init -> state
> preparing
> Thread-68109::INFO::2013-07-31
> 15:26:56,439::logUtils::44::dispatcher::(wrapper) Run and protect:
> repoStats(options=None)
> Thread-68109::INFO::2013-07-31
> 15:26:56,439::logUtils::47::dispatcher::(wrapper) Run and protect:
> repoStats, Return response: {'a806930b-177c-45c6-a8f8-3650892f9639':
> {'delay': '0.000398594', 'lastCheck': '8.5', 'code': 0, 'valid': True,
> 'version': 0}, '27fadc4e-a194-4bad-aef0-b85d73d1e51b': {'delay':
> '0.00172392', 'lastCheck': '8.6', 'code': 0, 'valid': True, 'version': 3}}

Sahina, if your guess is right, the vdsm.log attached itself is wrong.. so no point in seeing repoStats in vdsm.log.

Comment 17 Deepak C Shetty 2013-08-02 08:29:04 UTC

Bala/Sahina,
   One another reason why i thought the vdsm.log provided was from the hyp. host is because i saw this...

Thread-89425::INFO::2013-08-01 00:12:08,716::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=7, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'port': '', 'connection': '192.168.12.108:vmstorage', 'iqn': '', 'portal': '', 'user': '', 'vfs_type': 'glusterfs', 'password': '******', 'id': 'c79ae912-4adb-4407-97e7-f57ebebe6b3b'}], options=None)
Thread-89425::DEBUG::2013-08-01 00:12:08,738::mount::226::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/mount -t glusterfs 192.168.12.108:vmstorage /rhev/data-center/mnt/glusterSD/192.168.12.108:vmstorage' (cwd None)
Thread-89425::DEBUG::2013-08-01 00:12:09,101::hsm::2333::Storage.HSM::(__prefetchDomains) glusterDomPath: glusterSD/*
Thread-89425::DEBUG::2013-08-01 00:12:09,123::hsm::2345::Storage.HSM::(__prefetchDomains) Found SD uuids: ('27fadc4e-a194-4bad-aef0-b85d73d1e51b',)
Thread-89425::DEBUG::2013-08-01 00:12:09,123::hsm::2396::Storage.HSM::(connectStorageServer) knownSDs: {a806930b-177c-45c6-a8f8-3650892f9639: storage.nfsSD.findDomain, 27fadc4e-a194-4bad-aef0-b85d73d1e51b: storage.glusterSD.findDomain}
Thread-89425::INFO::2013-08-01 00:12:09,124::logUtils::47::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': 'c79ae912-4adb-4407-97e7-f57ebebe6b3b'}]}

domType=7 is GLUSTERFS_DOMAIN

Does connectStorageServer makes sense if this vdsm.log is from gluster only node ?

Comment 18 Sahina Bose 2013-08-02 09:27:53 UTC

Deepak,
Looks like virt and storage are running on the same node. And this log is from that, but not from the node on which the VM creation was done?

Comment 19 Caerie Houchins 2013-08-02 15:02:38 UTC

Yes, the vdsm logs were to the gluster node as stated in the file comments.

I've since redone the entire setup from scratch and made it simpler.  These logs go to:
1 - Virtual Machine host
2 - GlusterFS Host nodes 

1 gluster volume with two bricks in distribute replicate 2

Including logs for:
ovirt-engine
virtual machine host vdsm logs
gluster node 1 - glusterfs and vdsm logs
gluster node 2 - glusterfs and vdsm logs

Comment 20 Caerie Houchins 2013-08-02 15:03:33 UTC

Created attachment 782002 [details]
all logs for ovirt, virtual host, two gluster nodes

Comment 21 Deepak C Shetty 2013-08-04 06:01:36 UTC

Thanks Caerie, 
    I looked into the vmhost_vdsm.log and engine.log and found some details.

From engine.log...

2013-08-02 09:48:44,527 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM testvm1 is down. Exit message: Bad volume specification {'index': 0, 'iface': 'virtio', 'reqsize': '0', 'format': 'raw', 'type': 'disk', 'volumeID': '88ead0ec-9cb1-40a6-9755-d69806ac9146', 'apparentsize': '17179869184', 'imageID': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'specParams': {}, 'readonly': 'false', 'domainID': '0a414f79-0b29-4ca7-a2d7-733158969dfc', 'deviceId': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'truesize': '17179869184', 'poolID': 'dcffbd47-8adb-49da-9e3c-787338c58b58', 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'optional': 'false'}.

So the problem is reported at 2013-08-02 09:48:44 time

During the same time from vmhost_vdsm.log ...

Thread-43199::ERROR::2013-08-02 09:48:48,414::task::850::TaskManager.Task::(_setError) Task=`e58043ac-226d-45c8-8fed-d85ccd148ed5`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3251, in prepareImage
    'vmVolInfo': vol.getVmVolumeInfo()}
  File "/usr/share/vdsm/storage/glusterVolume.py", line 30, in getVmVolumeInfo
    volInfo = svdsmProxy.glusterVolumeInfo(volname, volfileServer)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in glusterVolumeInfo
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
OSError: [Errno 2] No such file or directory: gluster
Thread-43199::DEBUG::2013-08-02 09:48:48,416::task::869::TaskManager.Task::(_run) Task=`e58043ac-226d-45c8-8fed-d85ccd148ed5`::Task._run: e58043ac-226d-45c8-8fed-d85ccd148ed5 ('0a414f79-0b29-4ca7-a2d7-733158969dfc', 'dcffbd47-8adb-49da-9e3c-787338c58b58', '5242493d-2420-4079-9dcc-81dc032fb1d4', '88ead0ec-9cb1-40a6-9755-d69806ac9146') {} failed - stopping task
Thread-43199::DEBUG::2013-08-02 09:48:48,416::task::1194::TaskManager.Task::(stop) Task=`e58043ac-226d-45c8-8fed-d85ccd148ed5`::stopping in state preparing (force False)
Thread-43199::DEBUG::2013-08-02 09:48:48,416::task::974::TaskManager.Task::(_decref) Task=`e58043ac-226d-45c8-8fed-d85ccd148ed5`::ref 1 aborting True


Thread-43199::ERROR::2013-08-02 09:48:48,421::dispatcher::70::Storage.Dispatcher.Protect::(run) [Errno 2] No such file or directory: gluster
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 62, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 1159, in prepare
    raise self.error
OSError: [Errno 2] No such file or directory: gluster
Thread-43199::DEBUG::2013-08-02 09:48:48,422::vm::2034::vm.Vm::(_startUnderlyingVm) vmId=`42b8b60f-7d89-4780-a27e-c0627cd49f8a`::_ongoingCreations released
Thread-43199::ERROR::2013-08-02 09:48:48,422::vm::2060::vm.Vm::(_startUnderlyingVm) vmId=`42b8b60f-7d89-4780-a27e-c0627cd49f8a`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 2020, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/vm.py", line 2815, in _run
    self.preparePaths(devices[DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 2081, in preparePaths
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 256, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'index': 0, 'iface': 'virtio', 'reqsize': '0', 'format': 'raw', 'type': 'disk', 'volumeID': '88ead0ec-9cb1-40a6-9755-d69806ac9146', 'apparentsize': '17179869184', 'imageID': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'specParams': {}, 'readonly': 'false', 'domainID': '0a414f79-0b29-4ca7-a2d7-733158969dfc', 'deviceId': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'truesize': '17179869184', 'poolID': 'dcffbd47-8adb-49da-9e3c-787338c58b58', 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'optional': 'false'}
Thread-43199::DEBUG::2013-08-02 09:48:48,429::vm::2444::vm.Vm::(setDownStatus) vmId=`42b8b60f-7d89-4780-a27e-c0627cd49f8a`::Changed state to Down: Bad volume specification {'index': 0, 'iface': 'virtio', 'reqsize': '0', 'format': 'raw', 'type': 'disk', 'volumeID': '88ead0ec-9cb1-40a6-9755-d69806ac9146', 'apparentsize': '17179869184', 'imageID': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'specParams': {}, 'readonly': 'false', 'domainID': '0a414f79-0b29-4ca7-a2d7-733158969dfc', 'deviceId': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'truesize': '17179869184', 'poolID': 'dcffbd47-8adb-49da-9e3c-787338c58b58', 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'optional': 'false'}


The root cause being...
OSError: [Errno 2] No such file or directory: gluster

So it looks like on the VDSM host you don't have /usr/sbin/gluster
or u have the gluster binary but not in the above path ?

Caerie, can you confirm the below...

1) /usr/sbin/gluster present on vdsm host ?
2) Is the glusterd service running on the VDSM host ?
3) 'gluster volume info' on VDSM host
4) 'gluster volume info --remote-host=192.168.12.108' from VDSM host ?
(assuming 192.168.12.108 is one of your gluster nodes)

I still didn't see things like createStorageDomain etc in the vmhost_vdsm.log but thats probably bcos its rolled over.. nevertheless, given we see the VDSM error at the same time as engine.log, I think we don't need to see other vdsm host logs.

Bala,
    So glusterVolume.py tries to do ...

volInfo = svdsmProxy.glusterVolumeInfo(volname, volfileServer)
which goes to the vdsm-gluster plugin/cli code in VDSM

which is failing with 

OSError: [Errno 2] No such file or directory: gluster

So any idea what are the cases when gluster vdsm plugin can fail with the above error ? One obvious is gluster cli not installed or present on VDSM host, but can there be other issues ?

thanx,
deepak

Comment 22 Deepak C Shetty 2013-08-05 05:38:53 UTC

Caerie,
   In addition to the above, pls provide the output of below commands
executed on the VDSM host ( not the gluster node)

1) rpm -qa| grep gluster

Comment 23 Bala.FA 2013-08-05 06:08:43 UTC

(In reply to Deepak C Shetty from comment #21)
> Thanks Caerie, 
>     I looked into the vmhost_vdsm.log and engine.log and found some details.
> 
> From engine.log...
> 
> 2013-08-02 09:48:44,527 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: null,
> Custom Event ID: -1, Message: VM testvm1 is down. Exit message: Bad volume
> specification {'index': 0, 'iface': 'virtio', 'reqsize': '0', 'format':
> 'raw', 'type': 'disk', 'volumeID': '88ead0ec-9cb1-40a6-9755-d69806ac9146',
> 'apparentsize': '17179869184', 'imageID':
> '5242493d-2420-4079-9dcc-81dc032fb1d4', 'specParams': {}, 'readonly':
> 'false', 'domainID': '0a414f79-0b29-4ca7-a2d7-733158969dfc', 'deviceId':
> '5242493d-2420-4079-9dcc-81dc032fb1d4', 'truesize': '17179869184', 'poolID':
> 'dcffbd47-8adb-49da-9e3c-787338c58b58', 'device': 'disk', 'shared': 'false',
> 'propagateErrors': 'off', 'optional': 'false'}.
> 
> So the problem is reported at 2013-08-02 09:48:44 time
> 
> During the same time from vmhost_vdsm.log ...
> 
> Thread-43199::ERROR::2013-08-02
> 09:48:48,414::task::850::TaskManager.Task::(_setError)
> Task=`e58043ac-226d-45c8-8fed-d85ccd148ed5`::Unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/task.py", line 857, in _run
>     return fn(*args, **kargs)
>   File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/storage/hsm.py", line 3251, in prepareImage
>     'vmVolInfo': vol.getVmVolumeInfo()}
>   File "/usr/share/vdsm/storage/glusterVolume.py", line 30, in
> getVmVolumeInfo
>     volInfo = svdsmProxy.glusterVolumeInfo(volname, volfileServer)
>   File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
>     return callMethod()
>   File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
>     **kwargs)
>   File "<string>", line 2, in glusterVolumeInfo
>   File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in
> _callmethod
>     raise convert_to_error(kind, result)
> OSError: [Errno 2] No such file or directory: gluster
> Thread-43199::DEBUG::2013-08-02
> 09:48:48,416::task::869::TaskManager.Task::(_run)
> Task=`e58043ac-226d-45c8-8fed-d85ccd148ed5`::Task._run:
> e58043ac-226d-45c8-8fed-d85ccd148ed5
> ('0a414f79-0b29-4ca7-a2d7-733158969dfc',
> 'dcffbd47-8adb-49da-9e3c-787338c58b58',
> '5242493d-2420-4079-9dcc-81dc032fb1d4',
> '88ead0ec-9cb1-40a6-9755-d69806ac9146') {} failed - stopping task
> Thread-43199::DEBUG::2013-08-02
> 09:48:48,416::task::1194::TaskManager.Task::(stop)
> Task=`e58043ac-226d-45c8-8fed-d85ccd148ed5`::stopping in state preparing
> (force False)
> Thread-43199::DEBUG::2013-08-02
> 09:48:48,416::task::974::TaskManager.Task::(_decref)
> Task=`e58043ac-226d-45c8-8fed-d85ccd148ed5`::ref 1 aborting True
> 
> 
> Thread-43199::ERROR::2013-08-02
> 09:48:48,421::dispatcher::70::Storage.Dispatcher.Protect::(run) [Errno 2] No
> such file or directory: gluster
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/dispatcher.py", line 62, in run
>     result = ctask.prepare(self.func, *args, **kwargs)
>   File "/usr/share/vdsm/storage/task.py", line 1159, in prepare
>     raise self.error
> OSError: [Errno 2] No such file or directory: gluster
> Thread-43199::DEBUG::2013-08-02
> 09:48:48,422::vm::2034::vm.Vm::(_startUnderlyingVm)
> vmId=`42b8b60f-7d89-4780-a27e-c0627cd49f8a`::_ongoingCreations released
> Thread-43199::ERROR::2013-08-02
> 09:48:48,422::vm::2060::vm.Vm::(_startUnderlyingVm)
> vmId=`42b8b60f-7d89-4780-a27e-c0627cd49f8a`::The vm start process failed
> Traceback (most recent call last):
>   File "/usr/share/vdsm/vm.py", line 2020, in _startUnderlyingVm
>     self._run()
>   File "/usr/share/vdsm/vm.py", line 2815, in _run
>     self.preparePaths(devices[DISK_DEVICES])
>   File "/usr/share/vdsm/vm.py", line 2081, in preparePaths
>     drive['path'] = self.cif.prepareVolumePath(drive, self.id)
>   File "/usr/share/vdsm/clientIF.py", line 256, in prepareVolumePath
>     raise vm.VolumeError(drive)
> VolumeError: Bad volume specification {'index': 0, 'iface': 'virtio',
> 'reqsize': '0', 'format': 'raw', 'type': 'disk', 'volumeID':
> '88ead0ec-9cb1-40a6-9755-d69806ac9146', 'apparentsize': '17179869184',
> 'imageID': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'specParams': {},
> 'readonly': 'false', 'domainID': '0a414f79-0b29-4ca7-a2d7-733158969dfc',
> 'deviceId': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'truesize':
> '17179869184', 'poolID': 'dcffbd47-8adb-49da-9e3c-787338c58b58', 'device':
> 'disk', 'shared': 'false', 'propagateErrors': 'off', 'optional': 'false'}
> Thread-43199::DEBUG::2013-08-02
> 09:48:48,429::vm::2444::vm.Vm::(setDownStatus)
> vmId=`42b8b60f-7d89-4780-a27e-c0627cd49f8a`::Changed state to Down: Bad
> volume specification {'index': 0, 'iface': 'virtio', 'reqsize': '0',
> 'format': 'raw', 'type': 'disk', 'volumeID':
> '88ead0ec-9cb1-40a6-9755-d69806ac9146', 'apparentsize': '17179869184',
> 'imageID': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'specParams': {},
> 'readonly': 'false', 'domainID': '0a414f79-0b29-4ca7-a2d7-733158969dfc',
> 'deviceId': '5242493d-2420-4079-9dcc-81dc032fb1d4', 'truesize':
> '17179869184', 'poolID': 'dcffbd47-8adb-49da-9e3c-787338c58b58', 'device':
> 'disk', 'shared': 'false', 'propagateErrors': 'off', 'optional': 'false'}
> 
> 
> The root cause being...
> OSError: [Errno 2] No such file or directory: gluster
> 
> So it looks like on the VDSM host you don't have /usr/sbin/gluster
> or u have the gluster binary but not in the above path ?
> 
> Caerie, can you confirm the below...
> 
> 1) /usr/sbin/gluster present on vdsm host ?
> 2) Is the glusterd service running on the VDSM host ?
> 3) 'gluster volume info' on VDSM host
> 4) 'gluster volume info --remote-host=192.168.12.108' from VDSM host ?
> (assuming 192.168.12.108 is one of your gluster nodes)
> 
> I still didn't see things like createStorageDomain etc in the
> vmhost_vdsm.log but thats probably bcos its rolled over.. nevertheless,
> given we see the VDSM error at the same time as engine.log, I think we don't
> need to see other vdsm host logs.
> 
> Bala,
>     So glusterVolume.py tries to do ...
> 
> volInfo = svdsmProxy.glusterVolumeInfo(volname, volfileServer)
> which goes to the vdsm-gluster plugin/cli code in VDSM
> 

gluster.cli is part of vdsm itself.  I remember your storage domain patch moved from vdsm-gluster.


> which is failing with 
> 
> OSError: [Errno 2] No such file or directory: gluster
> 
> So any idea what are the cases when gluster vdsm plugin can fail with the
> above error ? One obvious is gluster cli not installed or present on VDSM
> host, but can there be other issues ?
> 
> thanx,
> deepak

Comment 24 Deepak C Shetty 2013-08-05 06:17:51 UTC

(In reply to Bala.FA from comment #23)
> > Bala,
> >     So glusterVolume.py tries to do ...
> > 
> > volInfo = svdsmProxy.glusterVolumeInfo(volname, volfileServer)
> > which goes to the vdsm-gluster plugin/cli code in VDSM
> > 
> 
> gluster.cli is part of vdsm itself.  I remember your storage domain patch
> moved from vdsm-gluster.

Bala,
    Regret if my question was not clear. Yes gluster.cli is part of VDSM itself, but it depends on /usr/sbin/gluster, which probably is not present on the VDSM host, hence the failure.

Caerie,
    Pls provide the requested info from VDSM host to help root cause further. Thanks.

Comment 25 Caerie Houchins 2013-08-05 13:40:28 UTC

For the Ovirt Engine box:
[root@ovirtp1m2 ~]# rpm -qa |grep gluster
glusterfs-3.4.0-3.el6.x86_64
glusterfs-fuse-3.4.0-3.el6.x86_64
glusterfs-rdma-3.4.0-3.el6.x86_64

For the Virtual Machine Host :
[root@vmhp60m2 ~]# rpm -qa |grep gluster
glusterfs-rdma-3.4.0-3.el6.x86_64
glusterfs-3.4.0-3.el6.x86_64
glusterfs-fuse-3.4.0-3.el6.x86_64

For the Gluster Nodes:
[root@nfsp12m2 ~]# rpm -qa |grep gluster
glusterfs-server-3.4.0-3.el6.x86_64
glusterfs-3.4.0-3.el6.x86_64
glusterfs-fuse-3.4.0-3.el6.x86_64
glusterfs-rdma-3.4.0-3.el6.x86_64
vdsm-gluster-4.12.0-0.1.rc3.el6.noarch
glusterfs-geo-replication-3.4.0-3.el6.x86_64

[root@nfsp13m2 ~]# rpm -qa |grep gluster
glusterfs-server-3.4.0-3.el6.x86_64
glusterfs-3.4.0-3.el6.x86_64
glusterfs-fuse-3.4.0-3.el6.x86_64
glusterfs-rdma-3.4.0-3.el6.x86_64
vdsm-gluster-4.12.0-0.1.rc3.el6.noarch
glusterfs-geo-replication-3.4.0-3.el6.x86_64

I have two Ovirt Clusters here, one has Gluster enabled and has the Gluster Hosts on it, the other has Virtualization enabled and has the Virtual Host server on it.  That would explain the difference in what is installed on each.

Now that we have an idea here I'm going to make the packages on the Virtual Machine Host and the Ovirt Engine match those on the Gluster Hosts.  This will be great if it's just a package dependency issue.

Comment 26 Caerie Houchins 2013-08-05 14:24:51 UTC

After installing vdsm-gluster on the Virtual Machine host, the virtual machines are failing to start with an "unknown protocol gluster" message in Ovirt.  I then made sure glusterd was running and there didn't seem to be a difference.

At least this fixes the original problem of this bug report of "Bad Volume specification" 

I tried installing vdsm-gluster on the ovirt-engine just to see what would happen, and obviously that didn't change anything.

Since this is failing with unknown service gluster, I'm going to try removing the virtual machine cluster and re-adding with both virtualization and gluster service enabled in ovirt.

Attaching ovirt-engine log and vdsm log from virtual machine host.

Snippet from engine.log

2013-08-05 08:51:19,269 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (pool-6-thread-47) [1b05fd77] Corre
lation ID: 1b05fd77, Job ID: b77a5d81-8b8e-41ca-b4ef-46ac232065cf, Call Stack: null, Custom Event ID: -1, Message: VM test5 was started 
by admin@internal (Host: vmhp60m2).
2013-08-05 08:51:21,375 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-62) START, De
stroyVDSCommand(HostName = vmhp60m2, HostId = 9f7eb405-5207-4ac5-b80e-e40ac40354af, vmId=6f505f64-2422-4949-be52-131ae5dfda78, force=fal
se, secondsToWait=0, gracefully=false), log id: 19971bf9
2013-08-05 08:51:21,484 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (DefaultQuartzScheduler_Worker-62) FINISH, D
estroyVDSCommand, log id: 19971bf9
2013-08-05 08:51:21,495 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-62) 
Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM test5 is down. Exit message: internal error unknown protocol ty
pe 'gluster'.

[root@vmhp60m2 ~]# rpm -qa |grep gluster
glusterfs-rdma-3.4.0-3.el6.x86_64
glusterfs-server-3.4.0-3.el6.x86_64
vdsm-gluster-4.12.0-0.1.rc3.el6.noarch
glusterfs-3.4.0-3.el6.x86_64
glusterfs-fuse-3.4.0-3.el6.x86_64

[root@ovirtp1m2 ovirt-engine]# rpm -qa |grep gluster
glusterfs-3.4.0-3.el6.x86_64
glusterfs-fuse-3.4.0-3.el6.x86_64
glusterfs-rdma-3.4.0-3.el6.x86_64
vdsm-gluster-4.12.0-18.git78903dd.el6.noarch
glusterfs-server-3.4.0-3.el6.x86_64

Comment 27 Caerie Houchins 2013-08-05 14:27:41 UTC

Created attachment 782850 [details]
engine.log and vdsm.log after adding vdsm-gluster

Comment 28 Deepak C Shetty 2013-08-05 15:14:46 UTC

Caerie,
   Thanks for confirming that after installing vdsm-gluster (which in turn installs glusterfs-server) the original reported problem in this BZ is gone.

Ideally just install glsuterfs-server should have worked too.

BTW, FYI there is a big mail thread going on in the gluster-devel mailing list on how to package gluster such that these kind of issues (and some other types) are not seen by the end User. I have ensured that the issue reported by this BZ has been reported as a feedback to which I have recd. favourable response... so in future, when the next set of Gluster RPM packaging changes are introduced, it should ensure that for hypervisor only host, all that is needed is glusterfs, glusterfs-libs glusterfs-fuse and glusterfs-cli (proposed) packages. I will be modifying the vdsm.spec (aka vdsm dep for gluster) once that is changed on the gluster project side.

Having said that.. reg. the next problem you are seeing.. Here is my debug

From vmhost_vdsm.log

                <disk device="disk" snapshot="no" type="network">
                        <source name="vmstorage/0a414f79-0b29-4ca7-a2d7-733158969dfc/images/9f88f7d6-1822-47a4-8d55-3747126b2a99/2c62aafc-6f04-4e5a-af48-9284316d35fe" protocol="gluster">
                                <host name="192.168.12.108" port="0" transport="tcp"/>
                        </source>
                        <target bus="virtio" dev="vda"/>
                        <serial>9f88f7d6-1822-47a4-8d55-3747126b2a99</serial>
                        <driver cache="none" error_policy="stop" io="threads" name="qemu" type="raw"/>
                </disk>

Thread-196808::DEBUG::2013-08-05 08:51:29,754::vm::2034::vm.Vm::(_startUnderlyingVm) vmId=`6f505f64-2422-4949-be52-131ae5dfda78`::_ongoingCreations released
Thread-196808::ERROR::2013-08-05 08:51:29,755::vm::2060::vm.Vm::(_startUnderlyingVm) vmId=`6f505f64-2422-4949-be52-131ae5dfda78`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 2020, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/vm.py", line 2901, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 76, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 2645, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: internal error unknown protocol type 'gluster'
Thread-196808::DEBUG::2013-08-05 08:51:29,759::vm::2444::vm.Vm::(setDownStatus) vmId=`6f505f64-2422-4949-be52-131ae5dfda78`::Changed state to Down: internal error unknown protocol type 'gluster'


This means that VDSM generated the right libvirt XML that is used to create the VM. But looks like your qemu (either qemu-kvm or just qemu) on the VDSM hypervisor host is old and doesn't support 'gluster' as a supported protocol.

The minimum qemu version needed in qemu 1.3
Can you pls give the o/p of the following commands from your hyp. host

1) rpm -qa| grep qemu
2) ldd `which qemu-kvm | grep libgf`. Replace qemu-kvm with the actual qemu binary if that is not applicable in your case.

Comment 29 Caerie Houchins 2013-08-05 16:12:58 UTC

[root@vmhp60m2 ~]# rpm -qa |grep qemu
gpxe-roms-qemu-0.9.7-6.9.el6.noarch
qemu-kvm-tools-0.12.1.2-2.355.0.1.el6.centos.6.x86_64
qemu-kvm-0.12.1.2-2.355.0.1.el6.centos.6.x86_64
qemu-img-0.12.1.2-2.355.0.1.el6.centos.6.x86_64

This appears to be the latest qemu-kvm for Centos 6.4 which is what all of our servers run.  I'll try to find something newer that's been backported.

I foresee a LOT of people with this same problem that are running 6.4 across their Enterprise.

Comment 30 Caerie Houchins 2013-08-06 15:41:01 UTC

FYI even the ovirt-node for EL6 in beta ISOs runs qemu-kvm-0.12 so that won't work either.

Comment 31 Deepak C Shetty 2013-08-07 08:21:19 UTC

Caerie,
    I agree... and I have taken up the same with vdsm-devel list @
https://lists.fedorahosted.org/pipermail/vdsm-devel/2013-August/002497.html

Having said that here is the summary of this BZ...

1) Installing vdsm-gluster on the host resolved the "bad volume spec" issue.
This will further be fixed on gluster is available post RPM re-structuring and vdsm.spec.in is changed to reflect the dep on the new gluster packages

2) Qemu 1.3 (once installed) should allow you to start a VM gracefully using Gluster storage domain. ( Googling: I don't see a repo for centos which has qemu 1.3) Can you try compiling it if possible to validate this ?

thanx,
deepak

Comment 32 Andrew Cathrow 2013-08-12 11:34:50 UTC

(In reply to Deepak C Shetty from comment #31)
> Caerie,
>     I agree... and I have taken up the same with vdsm-devel list @
> https://lists.fedorahosted.org/pipermail/vdsm-devel/2013-August/002497.html
> 
> Having said that here is the summary of this BZ...
> 
> 1) Installing vdsm-gluster on the host resolved the "bad volume spec" issue.
> This will further be fixed on gluster is available post RPM re-structuring
> and vdsm.spec.in is changed to reflect the dep on the new gluster packages
> 
> 2) Qemu 1.3 (once installed) should allow you to start a VM gracefully using
> Gluster storage domain. ( Googling: I don't see a repo for centos which has
> qemu 1.3) Can you try compiling it if possible to validate this ?

RHEL 6.5 (and so centos 6.5) will get native gluster support
Tracked in 848070

> 
> thanx,
> deepak

Comment 33 Deepak C Shetty 2013-08-13 08:05:05 UTC

Update:

1) GlusterFS RPM packaging went thru few changes to accomodate the client & server side usecases. More details abt that discussion is @

http://lists.gnu.org/archive/html/gluster-devel/2013-07/msg00098.html
http://lists.gnu.org/archive/html/gluster-devel/2013-08/msg00008.html

So the new GlusterFS client side RPMs which are needed by VDSM when its acting as the hyp. host (not storage/gluster node) are available for all the distros in Q.

Patch for updating vdsm.spec for this is @
http://gerrit.ovirt.org/#/c/17994/

2) Per the discussion @ 
https://lists.fedorahosted.org/pipermail/vdsm-devel/2013-August/002497.html

qemu w/ libgfapi support (typically qemu 1.3 but could be qemu <some other version> when libgfapi is backported) is available in F18/F19 repos... either in the distro repo or in the virt-preview repo @ fedorapeople.org/groups/virt/virt-preview/

Per the above thread, qemu w/ libgfapi support is being backported to rhel6 (hence should be available in centos too) and this is being tracked @ 
https://bugzilla.redhat.com/show_bug.cgi?id=848070

So VDSM as part of installing qemu or User installing it manually will be able to get the latest qemu which has libgfapi (aka gluster backend) support for F18/F19/RHEL/Centos from their respective repos.

3) Note: F17 dep is being removed as it has reached End of Life. See ...
http://fedoraproject.org/wiki/End_of_life

thanx,
deepak

Comment 34 Maciej Malesa 2013-08-29 16:38:49 UTC

Hi.
I was wondering if it is possible to include correct version of qemu into ovirt 3.3 repository as it is required for one of ovirt's main new features - Gluster Storage Domain.

Is there any release plan for RHEL 6.5? It seems that it will be a long time for having this feature in CentOS (supported by ovirt as I understand).

Maciek

Comment 35 Chris Sullivan 2013-09-06 10:01:39 UTC

Hi,

I've experienced problems very similar to Caerie's using the following config:
 - Fedora 19 kernel 3.10.10-200 (engine and hosts)
 - ovirt-engine 3.3.0-0.7.rc2.fc19
 - vdsm 4.12.1
 - libvirt 1.0.5
 - qemu 1.4
 - gluster 3.4
 - all hosts both Gluster and virt hosts

Issues:
 - Could not start VM from Gluster domain
 - Could not delete VM disks from Gluster domain (deletion reports successful however disk remains in list with status Illegal)

Notes:
 - All virt hosts mounted Gluster volume successfully
 - I was able to start a VM created through Ovirt, but I had to start it directly on the virt host via qemu-system-x86_64 -file gluster://...

ovirt-engine.log and vdsm.log from virt host will be attached.

I'm in the process of upgrading to ovirt-engine 3.3.0-1 and will report back if the problem persists or is resolved.

Cheers,

Chris

Comment 36 Chris Sullivan 2013-09-06 10:03:09 UTC

Created attachment 794636 [details]
ovirt-engine.log and vdsm.log excerpts at time of VM start

Comment 37 Sahina Bose 2013-09-06 11:07:02 UTC

From the logs:

  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 199, in callCrabRPCFunction
    raise err
OSError: [Errno 2] No such file or directory: '/rhev/data-center/mnt/glusterSD/192.168.12.108:vmstorage/27fadc4e-a194-4bad-aef0-b85d73d1e51b'
Thread-4905::DEBUG::2013-07-31 12:39:22,319::domainMonitor::233::Storage.DomainMonitorThread::(_monitorDomain) Domain 27fadc4e-a194-4bad-aef0-b85d73d1e51b changed its status to Invalid

Could you make sure you have set the required permissions on bricks and volume options as per
http://www.ovirt.org/Features/GlusterFS_Storage_Domain#Important_Pre-requisites

Comment 38 Chris Sullivan 2013-09-08 19:05:00 UTC

Created attachment 795388 [details]
VDSM log at point of VM start failure - ovirt 3.3.0-1 vdsm 4.12.1

Comment 39 Chris Sullivan 2013-09-08 19:19:25 UTC

(In reply to Sahina Bose from comment #37)
> From the logs:
> 
>   File "/usr/share/vdsm/storage/remoteFileHandler.py", line 199, in
> callCrabRPCFunction
>     raise err
> OSError: [Errno 2] No such file or directory:
> '/rhev/data-center/mnt/glusterSD/192.168.12.108:vmstorage/27fadc4e-a194-4bad-
> aef0-b85d73d1e51b'
> Thread-4905::DEBUG::2013-07-31
> 12:39:22,319::domainMonitor::233::Storage.DomainMonitorThread::
> (_monitorDomain) Domain 27fadc4e-a194-4bad-aef0-b85d73d1e51b changed its
> status to Invalid
> 
> Could you make sure you have set the required permissions on bricks and
> volume options as per
> http://www.ovirt.org/Features/GlusterFS_Storage_Domain#Important_Pre-
> requisites

Hi Sahina,

I'm not sure that the above log excerpt is from the log extracts I provided - I can confirm that the Gluster volume is Active in the oVirt web interface and mounted on the virt host.

I've uploaded a more complete VDSM log file as I realized that the previous excerpt did not include the Python tracebacks - my apologies! The VM start error occurs at line 110930 and appears to be a KeyError on the name of the Gluster volume.

The pre-reqs you linked to have already been incorporated prior to the error occurring - Gluster volume information, /etc/glusterd.vol file and relevant RPM listing have been uploaded for confirmation.

I've completed upgrading to ovirt-engine 3.3.0-1 and the problem persists.

Thanks for your help,

Chris

Comment 40 Chris Sullivan 2013-09-08 19:20:11 UTC

Created attachment 795389 [details]
Gluster volume info

Comment 41 Chris Sullivan 2013-09-08 19:21:26 UTC

Created attachment 795390 [details]
glusterd.vol file

Comment 42 Chris Sullivan 2013-09-08 19:22:15 UTC

Created attachment 795391 [details]
GlusterFS/libvirt/qemu RPM list from virt host

Comment 43 Chris Sullivan 2013-09-09 06:43:59 UTC

(In reply to Chris Sullivan from comment #39)
> Hi Sahina,
> 
> I'm not sure that the above log excerpt is from the log extracts I provided
> - I can confirm that the Gluster volume is Active in the oVirt web interface
> and mounted on the virt host.
> 
> I've uploaded a more complete VDSM log file as I realized that the previous
> excerpt did not include the Python tracebacks - my apologies! The VM start
> error occurs at line 110930 and appears to be a KeyError on the name of the
> Gluster volume.
> 
> The pre-reqs you linked to have already been incorporated prior to the error
> occurring - Gluster volume information, /etc/glusterd.vol file and relevant
> RPM listing have been uploaded for confirmation.
> 
> I've completed upgrading to ovirt-engine 3.3.0-1 and the problem persists.
> 
> Thanks for your help,
> 
> Chris

After investigating /usr/share/vdsm/storage/glusterVolume.py I identified the issue as being caused by the GlusterFS volume having an underscore in the name (i.e. r410-01:hades_storage). This is converted to be the mount point 'r410-01:hades__storage' underneath the /rhev/data-center/mnt/glusterSD folder - I'm assuming that the first underscore escapes the second.

When glusterVolume.py splits the mount point it does not convert the double underscore back to a single one as per the actual volume name, and so it fails when looking up the volume info for the volume as the volume 'hades__storage' does not exist.

I removed all the bricks from the hades_storage volume, cleared the fattrs and .glusterFS folders, then recreated the volume under the name hades. VMs now start successfully from the volume.

Hope this helps resolve the issue :) I think that the best way to solve the issue would be to modify the VDSM Gluster code to handle volume names with underscores in a manner consistent with how ovirt-engine defines the mountpoints.

Cheers,

Chris

Comment 44 Samuli Heinonen 2013-09-13 09:40:12 UTC

Hi,

Also having / (slash) in front of volume name causes "Bad volume specification" -error message when starting VM stored on Gluster Storage Domain. Ie. server.tld:/volname is bad. Mounting of volume is fine and creating storage domain shows no errors or warnings.

As far as I know it's perfectly fine to use / with GlusterFS when mounting volumes.

-samuli

Comment 45 Chris Sullivan 2013-09-13 09:46:57 UTC

(In reply to Samuli Heinonen from comment #44)
> Hi,
> 
> Also having / (slash) in front of volume name causes "Bad volume
> specification" -error message when starting VM stored on Gluster Storage
> Domain. Ie. server.tld:/volname is bad. Mounting of volume is fine and
> creating storage domain shows no errors or warnings.
> 
> As far as I know it's perfectly fine to use / with GlusterFS when mounting
> volumes.
> 
> -samuli

Hi Samuli,

I found the same issue - it's because the / gets converted to an underscore which causes the same problem as an underscore in the actual volume name i.e. it is not properly stripped out when the volume information is being looked up.

Cheers,

Chris

Comment 46 Deepak C Shetty 2013-09-13 14:01:16 UTC

Chris, Samuli,
    Thanks for reporting these issues and special thanks to Chris for helping root cause it.

I have fixed both the '_' and leading '/' issues in 
http://gerrit.ovirt.org/#/c/19219/

I tested with server:vol_name and server:/vol_name and it worked fine.

The fix is very minimal, so pls see if you can validate in your env. by taking the patch and/or applying the patch changes directly in /usr/share/vdsm/storage/glusterVolume.py and restarting vdsmd and re-testing.

Sahina,
    Thanks for the follow-up.

thanx,
deepak

Comment 47 Samuli Heinonen 2013-09-15 11:04:32 UTC

Thanks Deepak, I'm happy to confirm that it fixes the issue when using /.

-samuli

Comment 48 Itamar Heim 2013-09-23 07:36:18 UTC

closing as this should be in 3.3 (doing so in bulk, so may be incorrect)