Description of problem: a system upgraded from 4.1.7 (with libgfapi not enabled yet) to 4.2. 3 hosts in a HCI configuration Version-Release number of selected component (if applicable): 4.2.0.2 How reproducible: always Steps to Reproduce: 1. enable libgfapi root@ovengine log]# engine-config -s LibgfApiSupported=true Please select a version: 1. 3.6 2. 4.0 3. 4.1 4. 4.2 4 [root@ovengine log]# engine-config -g LibgfApiSupported LibgfApiSupported: false version: 3.6 LibgfApiSupported: false version: 4.0 LibgfApiSupported: false version: 4.1 LibgfApiSupported: true version: 4.2 2. restart engine [root@ovengine log]# systemctl restart ovirt-engine [root@ovengine log]# 3. Reconnect to web admin portal and power on a CentOS 6 VM Actual results: Failed to run VM In events: Jan 1, 2018, 11:53:35 PM Failed to run VM centos6 (User: admin@internal-authz). Jan 1, 2018, 11:53:35 PM Failed to run VM centos6 on Host ovirt02.localdomain.local. Jan 1, 2018, 11:53:35 PM Failed to run VM centos6 on Host ovirt03.localdomain.local. Jan 1, 2018, 11:53:35 PM Failed to run VM centos6 on Host ovirt01.localdomain.local. Expected results: Able to run VM with qemu-kvm option no more set as "-drive file=/rhev/data-center/mnt/glusterSD/..." but direct Gluster Additional info: In engine.log file (going to attach full log In engine.log 2018-01-01 23:53:34,996+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2885) [8d7c68c6-b236-4e76-b7b2-f000e2b07425] Failed in 'CreateBrokerVDS' method, for vds: 'ovirt01.localdomain.local'; host: 'ovirt01.localdomain.local': 1 2018-01-01 23:53:34,996+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2885) [8d7c68c6-b236-4e76-b7b2-f000e2b07425] Command 'CreateBrokerVDSCommand(HostName = ovirt01.localdomain.local, CreateVDSCommandParameters:{hostId='e5079118-1147-469e-876f-e20013276ece', vmId='64da5593-1022-4f66-ae3f-b273deda4c22', vm='VM [centos6]'})' execution failed: 1 2018-01-01 23:53:34,996+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2885) [8d7c68c6-b236-4e76-b7b2-f000e2b07425] FINISH, CreateBrokerVDSCommand, log id: e3bbe56 2018-01-01 23:53:34,996+01 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2885) [8d7c68c6-b236-4e76-b7b2-f000e2b07425] Failed to create VM: 1 2018-01-01 23:53:34,997+01 ERROR [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2885) [8d7c68c6-b236-4e76-b7b2-f000e2b07425] Command 'CreateVDSCommand( CreateVDSCommandParameters:{hostId='e5079118-1147-469e-876f-e20013276ece', vmId='64da5593-1022-4f66-ae3f-b273deda4c22', vm='VM [centos6]'})' execution failed: java.lang.ArrayIndexOutOfBoundsException: 1 2018-01-01 23:53:34,997+01 INFO [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2885) [8d7c68c6-b236-4e76-b7b2-f000e2b07425] FINISH, CreateVDSCommand, return: Down, log id: ab299ce 2018-01-01 23:53:34,997+01 WARN [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-2885) [8d7c68c6-b236-4e76-b7b2-f000e2b07425] Failed to run VM 'centos6': EngineException: java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 1 (Failed with error ENGINE and code 5001) Nothing particular in vdsm.log The domain where the VM disk is put seems ok [root@ovirt01 vdsm]# gluster volume info data Volume Name: data Type: Replicate Volume ID: 2238c6db-48c5-4071-8929-879cedcf39bf Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: ovirt01.localdomain.local:/gluster/brick2/data Brick2: ovirt02.localdomain.local:/gluster/brick2/data Brick3: ovirt03.localdomain.local:/gluster/brick2/data (arbiter) Options Reconfigured: transport.address-family: inet performance.readdir-ahead: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: off cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 6 network.ping-timeout: 30 user.cifs: off nfs.disable: on performance.strict-o-direct: on [root@ovirt01 vdsm]# [root@ovirt01 vdsm]# gluster volume heal data info Brick ovirt01.localdomain.local:/gluster/brick2/data Status: Connected Number of entries: 0 Brick ovirt02.localdomain.local:/gluster/brick2/data Status: Connected Number of entries: 0 Brick ovirt03.localdomain.local:/gluster/brick2/data Status: Connected Number of entries: 0 [root@ovirt01 vdsm]# I followed here to enable libgfabi: https://www.ovirt.org/develop/release-management/features/storage/glusterfs-storage-domain/ The cluster and DC are already at 4.2 compatibility version.
Created attachment 1375684 [details] engine.log
Created attachment 1375685 [details] vdsm.log in xz format during time of attempt of VM start
Arik, can it be related to libvirt XML?
(In reply to Yaniv Kaul from comment #3) > Arik, can it be related to libvirt XML? Yes, it most probably an issue during the generation of the XML. @Gianluca, can you please reproduce it with debug logs enabled?
I have temporarily changed log settings this way [root@ovengine ovirt-engine]# pwd /usr/share/ovirt-engine/services/ovirt-engine [root@ovengine ovirt-engine]# [root@ovengine ovirt-engine]# diff -u3 ovirt-engine.xml.in ovirt-engine.xml.in.orig --- ovirt-engine.xml.in 2018-01-02 15:05:58.833430614 +0100 +++ ovirt-engine.xml.in.orig 2017-12-11 18:41:16.000000000 +0100 @@ -175,7 +175,7 @@ <!-- Loggers for the engine: --> <logger category="org.ovirt" use-parent-handlers="false"> - <level name="DEBUG"/> + <level name="INFO"/> <handlers> <handler name="ENGINE"/> <handler name="ovirt-logger"/> @@ -186,13 +186,13 @@ </logger> <!-- Loggers for the UI --> <logger category="org.ovirt.engine.ui.frontend.server.gwt.OvirtRemoteLoggingService" use-parent-handlers="false"> - <level name="DEBUG"/> + <level name="INFO"/> <handlers> <handler name="UI"/> </handlers> </logger> <logger category="org.ovirt.engine.core.bll"> - <level name="DEBUG"/> + <level name="INFO"/> </logger> <logger category="org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresJdbcTemplate"> <level name="WARN"/> [root@ovengine ovirt-engine]# and systemctl restart ovirt-engine. I'm going to attach engine.log in gzip format
Created attachment 1375781 [details] engine.log in gzip format after enabling debug level
(In reply to Gianluca Cecchi from comment #6) > Created attachment 1375781 [details] > engine.log in gzip format after enabling debug level Thank you very much! @Denis, could you please check why it fails on [1]? [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/builder/vminfo/LibvirtVmXmlBuilder.java#L1679
Looks like gluster's volume info sometimes returned incorrectly and when libvirtvmxmlbuilder tries to use it, it fails. I posted a patch, that will fallback to file based access in that case and, at the same time, logs that. @Gianluca Could you please try to reproduce with my patch and search engine logs for "Invalid volInfo value:"?
Where can I find the patch and how to apply it?
The simplest way is to wait for the build here https://gerrit.ovirt.org/#/c/85957/ and then grab RPMs from jenkins
For example for RHEL you can use those RPMs http://jenkins.ovirt.org/job/ovirt-engine_master_check-patch-el7-x86_64/35344/artifact/exported-artifacts/
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
Hello, unfortunately it is not feasible for me to apply master modifications to my environment right now. Nevertheless I would like to dig more and help in case it is possible and my setup has something wrong or incorrectly inherited during upgrades from 4.0 when it was created.. I'm far from being a programmer or a developer, but I see that many methods rely also on storage procedures and queries in *_sp.sql files under /usr/share/ovirt-engine/dbscripts of engine. Correct me if I'm wrong Could you confirm what will be the expected snippet of domain xml file for gluster disks backed by libgfapi? In my case I think it should be <disk type='network' device='disk'> <driver name='qemu' type='raw'/> <source protocol='gluster' name='data/190f4096-003e-4908-825a-6c231e60276d/images/02731d5e-c222-4697-8f1f-d26a6a23ec79/1836df76-835b-4625-9ce8-0856176dc30c'> <host name='ovirt01.localdomain.local' port='0'/> </source> <target dev='vda' bus='virtio'/> </disk> I see this into code LibvirtVmXmlBuilder.java private void writeDiskSource(Disk disk, String dev) { writer.writeStartElement("source"); switch (disk.getDiskStorageType()) { ... case "network": writer.writeAttributeString("protocol", "gluster"); String[] volInfo = vmInfoBuildUtils.getGlusterVolInfo(disk); writer.writeAttributeString( "name", String.format("%s/%s/images/%s/%s", volInfo[1], diskImage.getStorageIds().get(0), diskImage.getId(), diskImage.getImageId())); writer.writeStartElement("host"); writer.writeAttributeString("name", volInfo[0]); writer.writeAttributeString("port", "0"); writer.writeEndElement(); break; ... writer.writeEndElement(); } and VmInfoBuildUtils.java contains public String[] getGlusterVolInfo(Disk disk) { StorageDomainStatic dom = this.storageDomainStaticDao.get(((DiskImage) disk).getStorageIds().get(0)); StorageServerConnections con = this.storageServerConnectionDao.getAllForDomain(dom.getId()).get(0); String path = con.getConnection(); // host:/volume return path.split(":/"); } I have a doubt about port that seems to be put statically to "0" in your code... does it mean 24007? In libvirtdocs I read that it is 24007 if not present, but nothing about "0" value. Another thing I see in my case is what should be the connection if I understood correctly: engine=# SELECT * engine-# FROM storage_server_connections engine-# WHERE EXISTS ( engine(# SELECT 1 engine(# FROM storage_domain_static engine(# WHERE storage_domain_static.id = '190f4096-003e-4908-825a-6c231e60276d' engine(# AND storage_domain_static.storage_type IN ( engine(# 1, engine(# 4, engine(# 6, engine(# 7 engine(# ) -- file storage domains - nfs,posix,local,gluster engine(# AND storage_server_connections.id = storage_domain_static.storage engine(# engine(# UNION ALL engine(# engine(# SELECT 1 engine(# FROM storage_domain_static engine(# INNER JOIN luns engine(# ON storage_domain_static.storage = luns.volume_group_id engine(# INNER JOIN lun_storage_server_connection_map engine(# ON luns.lun_id = lun_storage_server_connection_map.lun_id engine(# AND storage_server_connections.id = lun_storage_server_connection_map.storage_server_connection engine(# WHERE storage_domain_static.id = '190f4096-003e-4908-825a-6c231e60276d' engine(# AND storage_domain_static.storage_type = 3 -- storage type = iscsi engine(# ); id | connection | user_name | password | iqn | port | portal | stora ge_type | mount_options | vfs_type | nfs_version | nfs_timeo | nfs_retrans | gluster_volume_id --------------------------------------+--------------------------------+-----------+----------+-----+------+--------+------ --------+----------------------------------------------------------------------------+-----------+-------------+----------- +-------------+------------------- 1170c066-41f3-43ca-8b52-0374a1175f9d | ovirt01.localdomain.local:data | | | | | 1 | 7 | backup-volfile-servers=ovirt02.localdomain.local:ovirt03.localdomain.local | glusterfs | | | | (1 row) engine=# In my case the "connection" field reads "ovirt01.localdomain.local:data" (without the slash), while in your code I see the split between "host" and "volume" this way return path.split(":/"); but perhaps I'm looking in the wrong part... Let me know if I can run any query against the engine db to crosscheck what you expect as consistent...
If it can be of any help, the history of this HCI environment has been: Nov 2016 installed 4.0.5 with ansible and gdeploy Jun 2017 upgrade to 4.1.2 Jul 2017 upgrade to 4.1.3 Nov 2017 upgrade to 4.1.7 Dec 2017 upgrade to 4.2.0
This is a root cause, thank you. Didn't knew, that '/' part is optional. Should be correctly fixed with the last patch.
Yes. I initially deployed 4.0.5 with ansible/gdeploy, based on this post by Jason Brooks: https://www.ovirt.org/blog/2016/08/up-and-running-with-ovirt-4-0-and-gluster-storage/ At that time the path in the guide to the volume was indeed of the kind host:volume. The Gluster version used was 3.7. It seems that in the 4.1 version of the guide in April 2017 here: https://ovirt.org/blog/2017/04/up-and-running-with-ovirt-4.1-and-gluster-storage/ the ":/" notation has been used instead. Possibly the problem could affect only environments that started in 4.0.x and then gradually updated to 4.2. Le me know if you prefer me to test the patch once 4.2.1 is out or if I can in any way modify engine RDBMS tables to reflect the "new" ":/" syntax.
Patch is merged and i believe it will be in 4.2.1, so you can wait for it.
Hi Denis, when is expected 4.2.1 to be released? Cheers,
Hello, I was able to update my environment from 4.2.0 to 4.2.1rc1 and now it works as expected. The disk part from: -drive file=/rhev/data-center/mnt/glusterSD/ovirt01.localdomain.local:data/190f4096-003e-4908-825a-6c231e60276d/images/02731d5e-c222-4697-8f1f-d26a6a23ec79/1836df76-835b-4625-9ce8-0856176dc30c,format=raw,if=none,id=drive-virtio-disk0,serial=02731d5e-c222-4697-8f1f-d26a6a23ec79,cache=none,werror=stop,rerror=stop,aio=threads to: -drive file=gluster://ovirt01.localdomain.local/data/190f4096-003e-4908-825a-6c231e60276d/images/02731d5e-c222-4697-8f1f-d26a6a23ec79/1836df76-835b-4625-9ce8-0856176dc30c,file.debug=4,format=raw,if=none,id=drive-virtio-disk0,serial=02731d5e-c222-4697-8f1f-d26a6a23ec79,cache=none,werror=stop,rerror=stop,aio=threads btw: what is the debug=4 option?
> btw: what is the debug=4 option? Denis, can you provide some info on this please?
In the mean time I have to notice that from a libvirt xml point of view, the "dynamic" file centos6.xml under /etc/libvirt/qemu of the hypervisor that has in charge the VM seems not to contain anything related to "debug"... I don't know the link between the "-drive file" parameter of the qemu-kvm process and the generated xml though the disk snippet here below: <disk type='network' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads'/> <source protocol='gluster' name='data/190f4096-003e-4908-825a-6c231e60276d/images/02731d5e-c222-4697-8f1f-d26a6a23ec79/1836df76-835b-4625-9ce8-0856176dc30c'> <host name='ovirt01.localdomain.local' port='0'/> </source> <target dev='vda' bus='virtio'/> <serial>02731d5e-c222-4697-8f1f-d26a6a23ec79</serial> <boot order='1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </disk>
(In reply to Gianluca Cecchi from comment #19) > Hello, I was able to update my environment from 4.2.0 to 4.2.1rc1 and now it > works as expected. Nice to hear that :) > > btw: what is the debug=4 option? We are propagating debug levels from the libvirt to qemu. It was introduces with patch [1] for bug [2] [1] https://www.redhat.com/archives/libvir-list/2016-October/msg00916.html [2] https://bugzilla.redhat.com/show_bug.cgi?id=1376009
Tested with RHV 4.2.3 and gluster 3.12 0. Stopped all the VMs 1. Configured the engine to make use of gfapi 2. Started the VM 3. VMs boots properly with gfapi access mechanism
This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.