Bug 1530261
Summary: | unable to boot vm with libgfapi in 4.2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Gianluca Cecchi <gianluca.cecchi> | ||||||||
Component: | BLL.Virt | Assignee: | Denis Chaplygin <dchaplyg> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | SATHEESARAN <sasundar> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 4.2.0.2 | CC: | ahadas, alejandro.cortina2, amureini, bugs, dchaplyg, gianluca.cecchi, lveyde, sasundar, tnisan | ||||||||
Target Milestone: | ovirt-4.2.1 | Keywords: | Regression | ||||||||
Target Release: | 4.2.1.1 | Flags: | rule-engine:
ovirt-4.2+
rule-engine: blocker+ |
||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | ovirt-engine-4.2.1.1 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Cause: Invalid assumption on gluster mount point url format
Consequence: VMs are not able to start on volumes mounted with alternative format of gluster mount point
Fix: Added proper support for both possible formats
Result: VMs are starting fine.
|
Story Points: | --- | ||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-05-10 06:28:35 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Gianluca Cecchi
2018-01-02 11:50:54 UTC
Created attachment 1375684 [details]
engine.log
Created attachment 1375685 [details]
vdsm.log in xz format during time of attempt of VM start
Arik, can it be related to libvirt XML? (In reply to Yaniv Kaul from comment #3) > Arik, can it be related to libvirt XML? Yes, it most probably an issue during the generation of the XML. @Gianluca, can you please reproduce it with debug logs enabled? I have temporarily changed log settings this way [root@ovengine ovirt-engine]# pwd /usr/share/ovirt-engine/services/ovirt-engine [root@ovengine ovirt-engine]# [root@ovengine ovirt-engine]# diff -u3 ovirt-engine.xml.in ovirt-engine.xml.in.orig --- ovirt-engine.xml.in 2018-01-02 15:05:58.833430614 +0100 +++ ovirt-engine.xml.in.orig 2017-12-11 18:41:16.000000000 +0100 @@ -175,7 +175,7 @@ <!-- Loggers for the engine: --> <logger category="org.ovirt" use-parent-handlers="false"> - <level name="DEBUG"/> + <level name="INFO"/> <handlers> <handler name="ENGINE"/> <handler name="ovirt-logger"/> @@ -186,13 +186,13 @@ </logger> <!-- Loggers for the UI --> <logger category="org.ovirt.engine.ui.frontend.server.gwt.OvirtRemoteLoggingService" use-parent-handlers="false"> - <level name="DEBUG"/> + <level name="INFO"/> <handlers> <handler name="UI"/> </handlers> </logger> <logger category="org.ovirt.engine.core.bll"> - <level name="DEBUG"/> + <level name="INFO"/> </logger> <logger category="org.ovirt.engine.core.dal.dbbroker.PostgresDbEngineDialect$PostgresJdbcTemplate"> <level name="WARN"/> [root@ovengine ovirt-engine]# and systemctl restart ovirt-engine. I'm going to attach engine.log in gzip format Created attachment 1375781 [details]
engine.log in gzip format after enabling debug level
(In reply to Gianluca Cecchi from comment #6) > Created attachment 1375781 [details] > engine.log in gzip format after enabling debug level Thank you very much! @Denis, could you please check why it fails on [1]? [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/builder/vminfo/LibvirtVmXmlBuilder.java#L1679 Looks like gluster's volume info sometimes returned incorrectly and when libvirtvmxmlbuilder tries to use it, it fails. I posted a patch, that will fallback to file based access in that case and, at the same time, logs that. @Gianluca Could you please try to reproduce with my patch and search engine logs for "Invalid volInfo value:"? Where can I find the patch and how to apply it? The simplest way is to wait for the build here https://gerrit.ovirt.org/#/c/85957/ and then grab RPMs from jenkins For example for RHEL you can use those RPMs http://jenkins.ovirt.org/job/ovirt-engine_master_check-patch-el7-x86_64/35344/artifact/exported-artifacts/ This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. Hello, unfortunately it is not feasible for me to apply master modifications to my environment right now. Nevertheless I would like to dig more and help in case it is possible and my setup has something wrong or incorrectly inherited during upgrades from 4.0 when it was created.. I'm far from being a programmer or a developer, but I see that many methods rely also on storage procedures and queries in *_sp.sql files under /usr/share/ovirt-engine/dbscripts of engine. Correct me if I'm wrong Could you confirm what will be the expected snippet of domain xml file for gluster disks backed by libgfapi? In my case I think it should be <disk type='network' device='disk'> <driver name='qemu' type='raw'/> <source protocol='gluster' name='data/190f4096-003e-4908-825a-6c231e60276d/images/02731d5e-c222-4697-8f1f-d26a6a23ec79/1836df76-835b-4625-9ce8-0856176dc30c'> <host name='ovirt01.localdomain.local' port='0'/> </source> <target dev='vda' bus='virtio'/> </disk> I see this into code LibvirtVmXmlBuilder.java private void writeDiskSource(Disk disk, String dev) { writer.writeStartElement("source"); switch (disk.getDiskStorageType()) { ... case "network": writer.writeAttributeString("protocol", "gluster"); String[] volInfo = vmInfoBuildUtils.getGlusterVolInfo(disk); writer.writeAttributeString( "name", String.format("%s/%s/images/%s/%s", volInfo[1], diskImage.getStorageIds().get(0), diskImage.getId(), diskImage.getImageId())); writer.writeStartElement("host"); writer.writeAttributeString("name", volInfo[0]); writer.writeAttributeString("port", "0"); writer.writeEndElement(); break; ... writer.writeEndElement(); } and VmInfoBuildUtils.java contains public String[] getGlusterVolInfo(Disk disk) { StorageDomainStatic dom = this.storageDomainStaticDao.get(((DiskImage) disk).getStorageIds().get(0)); StorageServerConnections con = this.storageServerConnectionDao.getAllForDomain(dom.getId()).get(0); String path = con.getConnection(); // host:/volume return path.split(":/"); } I have a doubt about port that seems to be put statically to "0" in your code... does it mean 24007? In libvirtdocs I read that it is 24007 if not present, but nothing about "0" value. Another thing I see in my case is what should be the connection if I understood correctly: engine=# SELECT * engine-# FROM storage_server_connections engine-# WHERE EXISTS ( engine(# SELECT 1 engine(# FROM storage_domain_static engine(# WHERE storage_domain_static.id = '190f4096-003e-4908-825a-6c231e60276d' engine(# AND storage_domain_static.storage_type IN ( engine(# 1, engine(# 4, engine(# 6, engine(# 7 engine(# ) -- file storage domains - nfs,posix,local,gluster engine(# AND storage_server_connections.id = storage_domain_static.storage engine(# engine(# UNION ALL engine(# engine(# SELECT 1 engine(# FROM storage_domain_static engine(# INNER JOIN luns engine(# ON storage_domain_static.storage = luns.volume_group_id engine(# INNER JOIN lun_storage_server_connection_map engine(# ON luns.lun_id = lun_storage_server_connection_map.lun_id engine(# AND storage_server_connections.id = lun_storage_server_connection_map.storage_server_connection engine(# WHERE storage_domain_static.id = '190f4096-003e-4908-825a-6c231e60276d' engine(# AND storage_domain_static.storage_type = 3 -- storage type = iscsi engine(# ); id | connection | user_name | password | iqn | port | portal | stora ge_type | mount_options | vfs_type | nfs_version | nfs_timeo | nfs_retrans | gluster_volume_id --------------------------------------+--------------------------------+-----------+----------+-----+------+--------+------ --------+----------------------------------------------------------------------------+-----------+-------------+----------- +-------------+------------------- 1170c066-41f3-43ca-8b52-0374a1175f9d | ovirt01.localdomain.local:data | | | | | 1 | 7 | backup-volfile-servers=ovirt02.localdomain.local:ovirt03.localdomain.local | glusterfs | | | | (1 row) engine=# In my case the "connection" field reads "ovirt01.localdomain.local:data" (without the slash), while in your code I see the split between "host" and "volume" this way return path.split(":/"); but perhaps I'm looking in the wrong part... Let me know if I can run any query against the engine db to crosscheck what you expect as consistent... If it can be of any help, the history of this HCI environment has been: Nov 2016 installed 4.0.5 with ansible and gdeploy Jun 2017 upgrade to 4.1.2 Jul 2017 upgrade to 4.1.3 Nov 2017 upgrade to 4.1.7 Dec 2017 upgrade to 4.2.0 This is a root cause, thank you. Didn't knew, that '/' part is optional. Should be correctly fixed with the last patch. Yes. I initially deployed 4.0.5 with ansible/gdeploy, based on this post by Jason Brooks: https://www.ovirt.org/blog/2016/08/up-and-running-with-ovirt-4-0-and-gluster-storage/ At that time the path in the guide to the volume was indeed of the kind host:volume. The Gluster version used was 3.7. It seems that in the 4.1 version of the guide in April 2017 here: https://ovirt.org/blog/2017/04/up-and-running-with-ovirt-4.1-and-gluster-storage/ the ":/" notation has been used instead. Possibly the problem could affect only environments that started in 4.0.x and then gradually updated to 4.2. Le me know if you prefer me to test the patch once 4.2.1 is out or if I can in any way modify engine RDBMS tables to reflect the "new" ":/" syntax. Patch is merged and i believe it will be in 4.2.1, so you can wait for it. Hi Denis, when is expected 4.2.1 to be released? Cheers, Hello, I was able to update my environment from 4.2.0 to 4.2.1rc1 and now it works as expected. The disk part from: -drive file=/rhev/data-center/mnt/glusterSD/ovirt01.localdomain.local:data/190f4096-003e-4908-825a-6c231e60276d/images/02731d5e-c222-4697-8f1f-d26a6a23ec79/1836df76-835b-4625-9ce8-0856176dc30c,format=raw,if=none,id=drive-virtio-disk0,serial=02731d5e-c222-4697-8f1f-d26a6a23ec79,cache=none,werror=stop,rerror=stop,aio=threads to: -drive file=gluster://ovirt01.localdomain.local/data/190f4096-003e-4908-825a-6c231e60276d/images/02731d5e-c222-4697-8f1f-d26a6a23ec79/1836df76-835b-4625-9ce8-0856176dc30c,file.debug=4,format=raw,if=none,id=drive-virtio-disk0,serial=02731d5e-c222-4697-8f1f-d26a6a23ec79,cache=none,werror=stop,rerror=stop,aio=threads btw: what is the debug=4 option? > btw: what is the debug=4 option?
Denis, can you provide some info on this please?
In the mean time I have to notice that from a libvirt xml point of view, the "dynamic" file centos6.xml under /etc/libvirt/qemu of the hypervisor that has in charge the VM seems not to contain anything related to "debug"... I don't know the link between the "-drive file" parameter of the qemu-kvm process and the generated xml though the disk snippet here below: <disk type='network' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads'/> <source protocol='gluster' name='data/190f4096-003e-4908-825a-6c231e60276d/images/02731d5e-c222-4697-8f1f-d26a6a23ec79/1836df76-835b-4625-9ce8-0856176dc30c'> <host name='ovirt01.localdomain.local' port='0'/> </source> <target dev='vda' bus='virtio'/> <serial>02731d5e-c222-4697-8f1f-d26a6a23ec79</serial> <boot order='1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </disk> (In reply to Gianluca Cecchi from comment #19) > Hello, I was able to update my environment from 4.2.0 to 4.2.1rc1 and now it > works as expected. Nice to hear that :) > > btw: what is the debug=4 option? We are propagating debug levels from the libvirt to qemu. It was introduces with patch [1] for bug [2] [1] https://www.redhat.com/archives/libvir-list/2016-October/msg00916.html [2] https://bugzilla.redhat.com/show_bug.cgi?id=1376009 Tested with RHV 4.2.3 and gluster 3.12 0. Stopped all the VMs 1. Configured the engine to make use of gfapi 2. Started the VM 3. VMs boots properly with gfapi access mechanism This bugzilla is included in oVirt 4.2.1 release, published on Feb 12th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.1 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |