Description ------------- Hosted Engine VM fails to boot up, post copying its image from local to Glusterfs storage domain backed with replica 3 gluster volume. Version ------- RHHI-V 1.8 RHVH 4.4 vdsm-4.40.13-1.el8ev.x86_64 vdsm-gluster-4.40.13-1.el8ev.x86_64 python3-ioprocess-1.4.1-1.el8ev.x86_64 ioprocess-1.4.1-1.el8ev.x86_64 glusterfs-6.0-32.el8rhgs.x86_64 How reproducible ----------------- Always Steps to reproduce ------------------- 1. Start with RHHI-V installation from Web console 2. Post Gluster deployment is complete, continue to Hosted Engine deployment 3. Provide values for Hosted Engine configuration and start HE deployment Actual results --------------- Hosted Engine deployment fails, as the HE VM is unable to boot Expected results ----------------- Hosted Engine VM should boot up successfully --- Additional comment from RHEL Program Management on 2020-04-13 07:23:47 UTC --- This bug is automatically being proposed for RHHI-V 1.8 release at Red Hat Hyperconverged Infrastructure for Virtualization product, by setting the release flag 'rhiv‑1.8' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from RHEL Program Management on 2020-04-13 07:23:47 UTC --- This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. --- Additional comment from SATHEESARAN on 2020-04-13 07:27:31 UTC --- Adding Internal_whiteboard value as rhv-4.4.0 --- Additional comment from SATHEESARAN on 2020-04-13 07:55:43 UTC --- --- Additional comment from SATHEESARAN on 2020-04-13 07:58:30 UTC --- Gluster volume info -------------------- [root@rhsqa-grafton7 ~]# gluster volume info engine Volume Name: engine Type: Replicate Volume ID: 2d3cabed-49ba-4064-b56e-6db56e9a2112 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: rhsqa-grafton7.lab.eng.blr.redhat.com:/gluster_bricks/engine/engine Brick2: rhsqa-grafton8.lab.eng.blr.redhat.com:/gluster_bricks/engine/engine Brick3: rhsqa-grafton9.lab.eng.blr.redhat.com:/gluster_bricks/engine/engine Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.client-io-threads: on nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 storage.owner-uid: 36 storage.owner-gid: 36 network.ping-timeout: 30 performance.strict-o-direct: on cluster.granular-entry-heal: enable [root@rhsqa-grafton7 ~]# gluster volume status engine Status of volume: engine Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhsqa-grafton7.lab.eng.blr.redhat.com :/gluster_bricks/engine/engine 49152 0 Y 106846 Brick rhsqa-grafton8.lab.eng.blr.redhat.com :/gluster_bricks/engine/engine 49152 0 Y 103352 Brick rhsqa-grafton9.lab.eng.blr.redhat.com :/gluster_bricks/engine/engine 49152 0 Y 105014 Self-heal Daemon on localhost N/A N/A Y 245951 Self-heal Daemon on rhsqa-grafton9.lab.eng. blr.redhat.com N/A N/A Y 122547 Self-heal Daemon on rhsqa-grafton8.lab.eng. blr.redhat.com N/A N/A Y 122592 Task Status of Volume engine ------------------------------------------------------------------------------ There are no active volume tasks --- Additional comment from SATHEESARAN on 2020-04-13 08:01:33 UTC --- --- Additional comment from SATHEESARAN on 2020-04-13 08:03:08 UTC --- captured the gluster volume profile of the engine volume, during the setup --- Additional comment from SATHEESARAN on 2020-04-13 08:07:50 UTC --- --- Additional comment from SATHEESARAN on 2020-04-13 08:08:38 UTC --- [root@rhsqa-grafton7 ~]# hosted-engine --vm-status --== Host rhsqa-grafton7-nic2.lab.eng.blr.redhat.com (id: 1) status ==-- Host ID : 1 Host timestamp : 85763 Score : 3400 Engine status : {"vm": "up", "health": "bad", "detail": "Up", "reason": "failed liveliness check"} Hostname : rhsqa-grafton7-nic2.lab.eng.blr.redhat.com Local maintenance : False stopped : False crc32 : 3f4e8b05 conf_on_shared_storage : True local_conf_timestamp : 85763 Status up-to-date : True Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=85763 (Mon Apr 13 08:08:17 2020) host-id=1 score=3400 vm_conf_refresh_time=85763 (Mon Apr 13 08:08:17 2020) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False
Here are the logs: ovirt-hosted-engine-setup.log https://bugzilla.redhat.com/attachment.cgi?id=1678399 Screenshot showing VM failing to boot https://bugzilla.redhat.com/attachment.cgi?id=1678397 Gluster Fuse mount log https://bugzilla.redhat.com/attachment.cgi?id=1678396
Nir, I seeing this problem, where HE VM fails to boot up https://bugzilla.redhat.com/attachment.cgi?id=1678397 Do you have any insights ?
Hi Nir, As I looked in to the strace logs, all the reads that happened on the VM image file, is using 1 byte block size only, there is no subsequent 512 byte block reads. [root@rhsqa-grafton7-nic2 ~]# grep 7c16b3ea /brick_logs/*| grep pread /brick_logs/brick-strace.log.478300:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.588522:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.588522:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.588530:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.588530:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.595895:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.595895:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.595896:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.595897:pread64(20</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.596859:pread64(20</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument) /brick_logs/brick-strace.log.596860:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
Can you please attach full sos report from the host?
Created attachment 1678662 [details] sosreport from the HE Node
(In reply to Sandro Bonazzola from comment #4) > Can you please attach full sos report from the host? Hello Sandro, I have attached the sosreport of the RHVH node, from where the HE deployment is initiated.
what is the block size of the storage and what on the guest? If you use 4k and the storage, you must emulate 512 on the guest, otherwise VM won't boot, see e.g. https://docs.google.com/presentation/d/1ClLMZ4XAb8CPhYOw6mNpv0va5JWcAgpyFbJn-Bml_aY/edit#slide=id.g6dfd8a41d6_2_13
(In reply to Vojtech Juranek from comment #7) > what is the block size of the storage and what on the guest? If you use 4k > and the storage, you must emulate 512 on the guest, otherwise VM won't boot, > see e.g. > https://docs.google.com/presentation/d/1ClLMZ4XAb8CPhYOw6mNpv0va5JWcAgpyFbJn- > Bml_aY/edit#slide=id.g6dfd8a41d6_2_13 Thanks Vojtech. Block size of the underlying bricks of engine volume is 512B [root@rhsqa-grafton7-nic2 ~]# blockdev --getss /dev/mapper/gluster_vg_sdb-gluster_lv_engine 512 [root@rhsqa-grafton8-nic2 ~]# blockdev --getss /dev/mapper/gluster_vg_sdb-gluster_lv_engine 512 [root@rhsqa-grafton9-nic2 ~]# blockdev --getss /dev/mapper/gluster_vg_sdb-gluster_lv_engine 512
checking ovirt-hosted-engine-setup.log, I see following mount and based on the name (gluster_vg_sdb-gluster_lv_engine), I guess this where engine VM is located. Is this correct? { "block_available": 25942060, "block_size": 4096, "block_total": 26200576, "block_used": 258516, "device": "/dev/mapper/gluster_vg_sdb-gluster_lv_engine", "fstype": "xfs", "inode_available": 52424517, "inode_total": 52426752, "inode_used": 2235, "mount": "/gluster_bricks/engine", "options": "rw,seclabel,noatime,nodiratime,attr2,inode64,logbsize=256k,sunit=512,swidth=5120,noquota", "size_available": 106258677760, "size_total": 107317559296, "uuid": "98012f56-2657-4bb9-8942-e5c582abdc2b" } If it's correct, it seem that the block size there is 4k ("block_size": 4096,). Could you please check this?
(In reply to Vojtech Juranek from comment #9) > checking ovirt-hosted-engine-setup.log, I see following mount and based on > the name (gluster_vg_sdb-gluster_lv_engine), I guess this where engine VM is > located. Is this correct? > > { > "block_available": 25942060, > "block_size": 4096, > "block_total": 26200576, > "block_used": 258516, > "device": "/dev/mapper/gluster_vg_sdb-gluster_lv_engine", > "fstype": "xfs", > "inode_available": 52424517, > "inode_total": 52426752, > "inode_used": 2235, > "mount": "/gluster_bricks/engine", > "options": > "rw,seclabel,noatime,nodiratime,attr2,inode64,logbsize=256k,sunit=512, > swidth=5120,noquota", > "size_available": 106258677760, > "size_total": 107317559296, > "uuid": "98012f56-2657-4bb9-8942-e5c582abdc2b" > } > > If it's correct, it seem that the block size there is 4k ("block_size": > 4096,). > > Could you please check this? Hello Vojtech, I see that the XFS bricks are 512B block sized only. I double confirmed. But initially during local HE VM( HostedEngineLocal ) creation, it says in the corresponding logs that the block_size : 4096 But if you could check that create_target_storage_domain's log, <snip> [root@rhsqa-grafton7-nic2 ovirt-hosted-engine-setup]# grep -i block_size ./* | grep create_target ./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313123743-7ipn23.log: "block_size": 512, ./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313123743-7ipn23.log: "block_size": 512, ./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313123743-7ipn23.log: "block_size": 512, ./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313161913-fmlqih.log: "block_size": 512, ./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313161913-fmlqih.log: "block_size": 512, ./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313161913-fmlqih.log: "block_size": 512, </snip> Now I have tried one more test with 4K gluster volume. Still I see the same issue that HE VM is unable to boot post migrating to gluster storage domain I will attach the logs for that soon
Created attachment 1678994 [details] sosreport testing with 4k gluster storage
thanks for double checking, no need to test with 4k gluster and post the logs, I wanted to be sure this is not an issue related to booting VM from 4k storage as it seemed to me like the most probable root cause.
I see some gluster related exception in supervdsm log (errors in vdsm log basically means that vdsm cannot read from gluster), full exception is bellow. It basically fails when calling glfs_new() function from libgfapi. So it seems to me rather gluster issue than vdsm issue. Do you have any idea why glfs_new() can fail or what could be the root cuase? MainProcess|jsonrpc/2::ERROR::2020-04-14 09:07:35,085::supervdsm_server::97::SuperVdsm.ServerCallback::(wrapper) Error in volumeStatvfs Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 234, in volumeStatvfs out = commands.run(command, env=env) File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, in run raise cmdutils.Error(args, p.returncode, out, err) vdsm.common.cmdutils.Error: Command ['/usr/bin/python3', '-m', 'vdsm.gluster.gfapi', '-v', 'data', '-p', '24007', '-H', 'localhost', '-t', 'tcp', '-c', 'statvfs'] failed with rc=1 out=b'' err=b'Traceback (most recent call last):\n File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main\n "__main__", mod_spec)\n File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code\n exec(code, run_globals)\n File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 296, in <module>\n int(args.port), args.protocol)\n File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 108, in volumeStatvfsGet\n fs = glfsInit(volumeId, host, port, protocol)\n File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 66, in glfsInit\n fs = _glfs_new(volumeId)\nctypes.ArgumentError: argument 1: <class \'TypeError\'>: wrong type\n' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 95, in wrapper res = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 236, in volumeStatvfs raise ge.GlfsStatvfsException(e.rc, [e.err]) vdsm.gluster.exception.GlfsStatvfsException: Failed to get Gluster volume Size info: rc=1 out=[b'Traceback (most recent call last):\n File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main\n "__main__", mod_spec)\n File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code\n exec(code, run_globals)\n File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 296, in <module>\n int(args.port), args.protocol)\n File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 108, in volumeStatvfsGet\n fs = glfsInit(volumeId, host, port, protocol)\n File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 66, in glfsInit\n fs = _glfs_new(volumeId)\nctypes.ArgumentError: argument 1: <class \'TypeError\'>: wrong type\n'] err=()
(In reply to Vojtech Juranek from comment #13) > I see some gluster related exception in supervdsm log (errors in vdsm log > basically means that vdsm cannot read from gluster), full exception is > bellow. > It basically fails when calling glfs_new() function from libgfapi. So it > seems to me rather gluster issue than vdsm issue. > Do you have any idea why glfs_new() can fail or what could be the root cuase? > > > MainProcess|jsonrpc/2::ERROR::2020-04-14 > 09:07:35,085::supervdsm_server::97::SuperVdsm.ServerCallback::(wrapper) > Error in volumeStatvfs > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 234, > in volumeStatvfs > out = commands.run(command, env=env) > File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, > in run > raise cmdutils.Error(args, p.returncode, out, err) > vdsm.common.cmdutils.Error: Command ['/usr/bin/python3', '-m', > 'vdsm.gluster.gfapi', '-v', 'data', '-p', '24007', '-H', 'localhost', '-t', > 'tcp', '-c', 'statvfs'] failed with rc=1 out=b'' err=b'Traceback (most > recent call last):\n File "/usr/lib64/python3.6/runpy.py", line 193, in > _run_module_as_main\n "__main__", mod_spec)\n File > "/usr/lib64/python3.6/runpy.py", line 85, in _run_code\n exec(code, > run_globals)\n File > "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 296, in > <module>\n int(args.port), args.protocol)\n File > "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 108, in > volumeStatvfsGet\n fs = glfsInit(volumeId, host, port, protocol)\n File > "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 66, in > glfsInit\n fs = _glfs_new(volumeId)\nctypes.ArgumentError: argument 1: > <class \'TypeError\'>: wrong type\n' > > During handling of the above exception, another exception occurred: > > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 95, > in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 236, > in volumeStatvfs > raise ge.GlfsStatvfsException(e.rc, [e.err]) > vdsm.gluster.exception.GlfsStatvfsException: Failed to get Gluster volume > Size info: rc=1 out=[b'Traceback (most recent call last):\n File > "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main\n > "__main__", mod_spec)\n File "/usr/lib64/python3.6/runpy.py", line 85, in > _run_code\n exec(code, run_globals)\n File > "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 296, in > <module>\n int(args.port), args.protocol)\n File > "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 108, in > volumeStatvfsGet\n fs = glfsInit(volumeId, host, port, protocol)\n File > "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 66, in > glfsInit\n fs = _glfs_new(volumeId)\nctypes.ArgumentError: argument 1: > <class \'TypeError\'>: wrong type\n'] err=() Vojtech, Thanks once again ! I am not sure about these errors. But looking at the errors, it looks like vdsm is making use of 'gfapi' interface for gluster. I will loop in the gluster and RHHI developers, to understand what's going wrong here ? @Gobinda, Do you know why VDSM is trying to make use of gfapi ? Or any one who know about it.
I think gfapi used for accessing SD, not sure. @kaustav can you please check as a priority?
Hi Vojtech, The above error you mentioned File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 234, in volumeStatvfs this is when we use gfapi to find size of volume these errors don't have anything to do with reading from storage domain imo But the below error i am seeing in vdsm log frequently. Is there anything to look for this error? 2020-04-15 09:44:33,838+0000 ERROR (qgapoller/1) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7f7e5c2c5a20>> operation failed (periodic:188) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 186, in __call__ self._func() File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 316, in _poller self._qga_capability_check(vm_obj) File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 539, in _qga_capability_check ret = self.call_qga_command(vm, _QEMU_GUEST_INFO_COMMAND) File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 265, in call_qga_command ret = vm.qemu_agent_command(cmd, _COMMAND_TIMEOUT, 0) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5606, in qemu_agent_command self._dom, command, timeout, flags) File "/usr/lib64/python3.6/site-packages/libvirt_qemu.py", line 39, in qemuAgentCommand ret = libvirtmod_qemu.virDomainQemuAgentCommand(domain._o, cmd, timeout, flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 50, in __getattr__ % self.vmid) vdsm.virt.virdomain.NotConnectedError: VM '41a0d36b-214a-4702-b068-99c12c486c0d' was not defined yet or was undefined 2020-04-15 08:20:08,213+0000 ERROR (qgapoller/3) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7f7e5c2c5a20>> operation failed (periodic:188) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 186, in __call__ self._func() File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 316, in _poller self._qga_capability_check(vm_obj) File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 539, in _qga_capability_check ret = self.call_qga_command(vm, _QEMU_GUEST_INFO_COMMAND) File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 265, in call_qga_command ret = vm.qemu_agent_command(cmd, _COMMAND_TIMEOUT, 0) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5606, in qemu_agent_command self._dom, command, timeout, flags) File "/usr/lib64/python3.6/site-packages/libvirt_qemu.py", line 39, in qemuAgentCommand ret = libvirtmod_qemu.virDomainQemuAgentCommand(domain._o, cmd, timeout, flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 50, in __getattr__ % self.vmid) vdsm.virt.virdomain.NotConnectedError: VM '41a0d36b-214a-4702-b068-99c12c486c0d' was not defined yet or was undefined
(In reply to SATHEESARAN from comment #3) > As I looked in to the strace logs, all the reads that happened on the VM > image file, is using 1 byte block size only, there is no subsequent 512 byte > block reads. Are you looking at all calls or only failed calls? The expected flow is: read 1 byte -> fails with EINVAL read 512 bytes -> ok This is expected each time qemu opens the image. But I don't think this is the issue, it looks like the vm disk is corrupted. I don't this is vdsm or storage bug. This is hosted engine setup bug. Hosted engine maintainers should debug the installation and make sure the vm disk is copied correctly to storage. To verify that the disk was copied correctly, we can run this after the copy: qemu-img compare src.img dst.img It would be useful to add this step in special debug mode so debugging deployment issue is easier. If the image was copied correctly but the vm will not boot this is a virt issue.
(In reply to Nir Soffer from comment #17) > (In reply to SATHEESARAN from comment #3) > > As I looked in to the strace logs, all the reads that happened on the VM > > image file, is using 1 byte block size only, there is no subsequent 512 byte > > block reads. > > Are you looking at all calls or only failed calls? > > The expected flow is: > > read 1 byte -> fails with EINVAL > read 512 bytes -> ok Hello Nir, I was grepping for all pread's on one brick for the GFID for the image file used by HE VM. The second 'read 512 bytes' never happens. I can also check by collecting 'strace' from all the bricks. > > This is expected each time qemu opens the image. > > But I don't think this is the issue, it looks like the vm disk is corrupted. > > I don't this is vdsm or storage bug. This is hosted engine setup bug. > > Hosted engine maintainers should debug the installation and make sure the vm > disk > is copied correctly to storage. > > To verify that the disk was copied correctly, we can run this after the copy: > > qemu-img compare src.img dst.img > > It would be useful to add this step in special debug mode so debugging > deployment > issue is easier. > > If the image was copied correctly but the vm will not boot this is a virt > issue. I checked this. Both the images are the same [root@ ~]# qemu-img compare /var/tmp/localvmozx2vt8k/images/d4945538-65e8-400f-bda9-0e95f36278e3/532a5b0a-d368-47de-a0b7-ce8fc24042a6 /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/7c084fc6-d2d0-4c92-846c-aa22c9de2c5f/images/eb7e672c-13d1-442a-abe6-3376d38c3cbe/245616af-20bc-4bf7-9bbd-9b0aaa5ad38f Images are identical.
There is this discrepancy in the block_size that I see in the logs. In actual gluster brick engine is a 512B block-size device, but the logs reports that its a 4096B block-size device. <snip_from_create_target_vm.log> { "block_available": 26008004, "block_size": 4096, "block_total": 26200576, "block_used": 192572, "device": "/dev/mapper/gluster_vg_sdb-gluster_lv_engine", "fstype": "xfs", "inode_available": 52423094, "inode_total": 52426752, "inode_used": 3658, "mount": "/gluster_bricks/engine", "options": "rw,seclabel,noatime,nodiratime,attr2,inode64,logbsize=256k,sunit=512,swidth=5120,noquota", "size_available": 106528784384, "size_total": 107317559296, "uuid": "34d4cd11-2347-441a-b4cc-fbabc5fb7e1c" }, </snip_from_create_target_vm.log> [root@ ~]# blockdev --getss /dev/mapper/gluster_vg_sdb-gluster_lv_engine 512 @Vojta, do you know why ?
Unfortunately no idea, maybe Simone knows. From storage point of view everything looks good, storage block size is really 512: [root@rhsqa-grafton7-nic2 ~]# dd if=/dev/zero of=/gluster_bricks/engine/test oflag=direct bs=512 count=1 1+0 records in 1+0 records out 512 bytes copied, 0.000196591 s, 2.6 MB/s Storage block size was create with autodetection: 2020-04-21 06:40:33,941+0000 INFO (jsonrpc/7) [vdsm.api] START createStorageDomain(storageType=7, sdUUID='40ec7af8-085b-47fc-a5d4-2bee467a2da6', domainName='hosted_storage', typeSpecificArg='rhsqa-grafton7.lab.eng.blr.redhat.com:/engine', domClass=1, domVersion='5', block_size=0, max_hosts=250, options=None) from=::ffff:192.168.222.150,43316, flow_id=371bbc8d, task_id=5b4349e4-7181-48bc-b88e-baa2bc1b86cf (api:48) We haven't debug logs, but vdsm detect the block size correctly, as SD metadata have 512 as a block size: [root@rhsqa-grafton7-nic2 ~]# cat /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/40ec7af8-085b-47fc-a5d4-2bee467a2da6/dom_md/metadata ALIGNMENT=1048576 BLOCK_SIZE=512 CLASS=Data DESCRIPTION=hosted_storage IOOPTIMEOUTSEC=10 LEASERETRIES=3 LEASETIMESEC=60 LOCKPOLICY= LOCKRENEWALINTERVALSEC=5 MASTER_VERSION=1 POOL_DESCRIPTION=Default POOL_DOMAINS=40ec7af8-085b-47fc-a5d4-2bee467a2da6:Active POOL_SPM_ID=-1 POOL_SPM_LVER=-1 POOL_UUID=40bf7c46-832d-11ea-8993-004755204901 REMOTE_PATH=rhsqa-grafton7.lab.eng.blr.redhat.com:/engine ROLE=Master SDUUID=40ec7af8-085b-47fc-a5d4-2bee467a2da6 TYPE=GLUSTERFS VERSION=5 _SHA_CKSUM=64b5a6020db8b9095a7052e3403d7b327dfcf6f8 So it seems to me more like virt issue. I don't see any other issue in vdsm log. Right after creating HE VM I see this warning in vdsm log, but not sure if it's a real issue (probably not, if so, looks again like virt issue): 2020-04-21 07:04:24,654+0000 WARN (vm/8672a439) [virt.vm] (vmId='8672a439-ca76-4ad1-998f-8cdd96d54893') Cannot find device alias for _conf:{'type': 'lease', 'device': 'lease', 'lease_id': '2c9a272d-c57e-48a2-b508-3519098d51e9', 'sd_id': '40ec7af8-085b-47fc-a5d4-2bee467a2da6', 'path': 'LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6', 'offset': 'LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6'} _deviceXML:None alias:None config:<bound method Base.config of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> createXmlElem:<bound method Device.createXmlElem of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> custom:{} device:lease deviceId:None deviceType:None from_xml_tree:<bound method Device.from_xml_tree of <class 'vdsm.virt.vmdevices.lease.Device'>> getXML:<bound method Device.getXML of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> get_extra_xmls:<bound method Base.get_extra_xmls of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> get_identifying_attrs:<bound method Device.get_identifying_attrs of <class 'vdsm.virt.vmdevices.lease.Device'>> get_metadata:<bound method Base.get_metadata of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> hotunplug_event:<threading.Event object at 0x7fbfc476e6d8> is_hostdevice:False lease_id:2c9a272d-c57e-48a2-b508-3519098d51e9 log:<SimpleLogAdapter virt.vm (INFO)> offset:LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 path:LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 sd_id:40ec7af8-085b-47fc-a5d4-2bee467a2da6 setup:<bound method Base.setup of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> specParams:{} teardown:<bound method Base.teardown of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> type:lease update_device_info:<bound method Device.update_device_info of <class 'vdsm.virt.vmdevices.lease.Device'>> vmid:None (core:329)
Ryan, can someone from Virt have a look?
I'm talking to Sas right now. THe VM actually starts, but the disks are not attached (it goes to a grub menu). This is likely not virt. I'll look at the logs, but my suspicion is that it's actually gluster storage based on the fact that the disk sums are identical after copy, the VM comes up, etc. Sas has a live environment which I'll look at, but it would be helpful for someone from storage and gluster to also look at it. Gluster is not healthy: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 111, in _execGluster return commands.run(cmd) File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, in run raise cmdutils.Error(args, p.returncode, out, err) vdsm.common.cmdutils.Error: Command ['/usr/sbin/gluster', '--mode=script', 'volume', 'geo-replication', 'status', '--xml'] failed with rc=2 out=b'geo-replication command failed\n' err=b'' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 1222, in volumeGeoRepStatus xmltree = _execGlusterXml(command) File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 134, in _execGlusterXml return _getTree(_execGluster(cmd)) File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 113, in _execGluster raise ge.GlusterCmdFailedException(rc=e.rc, err=[e.msg]) vdsm.gluster.exception.GlusterCmdFailedException: Command failed: rc=2 out=() err=['Command {self.cmd} failed with rc={self.rc} out={self.out!r} err={self.err!r}'] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 95, in wrapper res = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 1224, in volumeGeoRepStatus raise ge.GlusterGeoRepStatusFailedException(rc=e.rc, err=e.err) vdsm.gluster.exception.GlusterGeoRepStatusFailedException: Geo Rep status failed: rc=2 out=() err=['Command {self.cmd} failed with rc={self.rc} out={self.out!r} err={self.err!r}']
Looked at the logs. Libvirt XML looks ok, and there are no exceptions in the libvirt logs. The disk is held by the qemu process (ownership is ok), there's just no additional read data. Not virt. Nir, any additional thoughts?
(In reply to Ryan Barry from comment #22) > I'm talking to Sas right now. THe VM actually starts, but the disks are not > attached (it goes to a grub menu). This is likely not virt. I'll look at the > logs, but my suspicion is that it's actually gluster storage based on the > fact that the disk sums are identical after copy, the VM comes up, etc. Sas > has a live environment which I'll look at, but it would be helpful for > someone from storage and gluster to also look at it. > > Gluster is not healthy: > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 111, in > _execGluster > return commands.run(cmd) > File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, > in run > raise cmdutils.Error(args, p.returncode, out, err) > vdsm.common.cmdutils.Error: Command ['/usr/sbin/gluster', '--mode=script', > 'volume', 'geo-replication', 'status', '--xml'] failed with rc=2 > out=b'geo-replication command failed\n' err=b'' > > During handling of the above exception, another exception occurred: > > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 1222, in > volumeGeoRepStatus > xmltree = _execGlusterXml(command) > File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 134, in > _execGlusterXml > return _getTree(_execGluster(cmd)) > File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 113, in > _execGluster > raise ge.GlusterCmdFailedException(rc=e.rc, err=[e.msg]) > vdsm.gluster.exception.GlusterCmdFailedException: Command failed: rc=2 > out=() err=['Command {self.cmd} failed with rc={self.rc} out={self.out!r} > err={self.err!r}'] > > During handling of the above exception, another exception occurred: > > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 95, > in wrapper > res = func(*args, **kwargs) > File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 1224, in > volumeGeoRepStatus > raise ge.GlusterGeoRepStatusFailedException(rc=e.rc, err=e.err) > vdsm.gluster.exception.GlusterGeoRepStatusFailedException: Geo Rep status > failed: rc=2 out=() err=['Command {self.cmd} failed with rc={self.rc} > out={self.out!r} err={self.err!r}'] This is because there is no geo-replication configured so far and because of this reason it fails and returns with rc=2 [root@ ~]# gluster volume geo-replication status --xml geo-replication command failed [root@ ~]# echo $? 2 I will ask RHHI dev to fix this. In all cases, that should be separate Traceback, not really associated with this failure of HE VM to start on gluster storage domain
Created attachment 1680554 [details] Hosted engineVM configXML
Created attachment 1681558 [details] HostedEngineLocal XML Definition from retest on 24April2020
Created attachment 1681559 [details] HostedEngine XML Definition from retest on 24April2020
Created attachment 1681560 [details] HostedEngineLocal VM QEMU log from retest on 24April2020
Created attachment 1681561 [details] HostedEngine VM QEMU log from retest on 24April2020
I checked the environment on Apr 24. First we verified that the image in shared storage is identical to the original image used to bootstrap the vm. We add new task to the ansible script copying the image: - name: Verify copy of VM disk command: qemu-img compare {{ local_vm_disk_path }} {{ he_virtio_disk_path }} environment: "{{ he_cmd_lang }}" become: true become_user: vdsm become_method: sudo changed_when: true This runs right after the task copying the image from the bootstrap image to the gluster mount. According to ansible log the check was successful: Apr 24 14:11:45 rhsqa-grafton7-nic2 platform-python[2483803]: ansible-command Invoked with _raw_params=qemu-img compare /var/tmp/localvm2k8qg0na/images/d4945538-65e8-400f-bda9-0e95f36278e3/532a5b0a-d368-47de-a0b7-ce8fc24042a6 /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/images/106d9b8f-6bb7-4622-9ccf-21cf217bbcea/91a6eff5-ffa2-4a96-8eef-b02a863fddf1 warn=True _uses_shell=False stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None So we don't have any issue creating this image. Next I checked the image while the VM was running (stuck in grub rescue shell). # qemu-img info 91a6eff5-ffa2-4a96-8eef-b02a863fddf1 image: 91a6eff5-ffa2-4a96-8eef-b02a863fddf1 file format: raw virtual size: 58 GiB (62277025792 bytes) disk size: 3.82 GiB # qemu-img compare 91a6eff5-ffa2-4a96-8eef-b02a863fddf1 /var/tmp/localvm2k8qg0na/images/d4945538-65e8-400f-bda9-0e95f36278e3/532a5b0a-d368-47de-a0b7-ce8fc24042a6 # cat 91a6eff5-ffa2-4a96-8eef-b02a863fddf1.meta CAP=62277025792 CTIME=1587737303 DESCRIPTION={"DiskAlias":"he_virtio_disk","DiskDescription":"Hosted-Engine disk"} DISKTYPE=HEVD DOMAIN=517783ed-8479-4ccc-b0c2-c161764cfd93 FORMAT=RAW GEN=0 IMAGE=106d9b8f-6bb7-4622-9ccf-21cf217bbcea LEGALITY=LEGAL PUUID=00000000-0000-0000-0000-000000000000 TYPE=SPARSE VOLTYPE=LEAF EOF qeum command line: -blockdev '{"driver":"file","filename":"/var/run/vdsm/storage/517783ed-8479-4ccc-b0c2-c161764cfd93/106d9b8f-6bb7-4622-9ccf-21cf217bbcea/91a6eff5-ffa2-4a96-8eef-b02a863fddf1","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ Everything look normal, but booting from this image fails. Next I moved hosted engine to global maintenance and shutdown the VM: # hosted-engine -set-maintenance --mode=global # hosted-enigne --vm-poweroff I tested that we can boot from this image using an overlay: qemu-img create -f qcow2 -b /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/images/106d9b8f-6bb7-4622-9ccf-21cf217bbcea/91a6eff5-ffa2-4a96-8eef-b02a863fddf1 -F raw overlay.qcow2 # /usr/libexec/qemu-kvm -m 4g -nographic -drive file=overlay.qcow2,format=qcow2,cache=none VM boots properly. And that we can boot from a copy of this image: # mkdir /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp # qemu-img convert -p -f raw -O raw -t none -T none /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/images/106d9b8f-6bb7-4622-9ccf-21cf217bbcea/91a6eff5-ffa2-4a96-8eef-b02a863fddf1 /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 (100.00/100%) # ls -lhs /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 4.3G -rw-r--r--. 1 root root 58G Apr 24 16:06 /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 # /usr/libexec/qemu-kvm -m 4g -nographic -drive file=/rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1,format=raw,cache=none VM boots properly Then I tried to build VM XML using virt-install, and add stuff from the original RHV XML. Finally I found that boot fails once I add <seclabel> to the disk. This is the difference between a XML that boot that one that does not: # diff -u reproduce-good.xml reproduce-bad.xml --- reproduce-good.xml 2020-04-24 22:06:58.197707718 +0000 +++ reproduce-bad.xml 2020-04-24 22:10:40.979171859 +0000 @@ -70,6 +70,7 @@ <disk type="file" device="disk" snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' iothread='1'/> <source file="/rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1"> + <seclabel model='dac' relabel='no'/> </source> <target dev="vda" bus="virtio"/> </disk> Here is how I tested the XMLs. I repeated this test today (Apr 27) to make sure ownership and permissions are correct. # ls -lh images lrwxrwxrwx. 1 root root 119 Apr 27 17:22 images -> /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/ Before testing I fixed I change ownership and permissions to match what rhev is using, and run restorecon to have the same selinux labels as other images under /rhev/data-center.../images/... # ls -lhZ images/ total 5.9G -rw-rw-r--. 1 vdsm kvm system_u:object_r:fusefs_t:s0 58G Apr 24 22:08 copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 virsh # define reproduce-good.xml Domain reproduce defined from reproduce-good.xml virsh # start reproduce Domain reproduce started # virsh console reproduce --safe ... Red Hat Enterprise Linux 8.2 (Ootpa) Kernel 4.18.0-193.el8.x86_64 on an x86_64 Web console: https://hostedenginesm3.lab.eng.blr.redhat.com:9090/ hostedenginesm3 login: virsh # destroy reproduce Domain reproduce destroyed So this xml - without <seclabel> works. Testing that images are still correct: # ls -lhZ images/ total 5.9G -rw-rw-r--. 1 vdsm kvm system_u:object_r:fusefs_t:s0 58G Apr 27 17:32 copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 Now testing the bad xml, with <seclabel> - same way RHV use it: virsh # define reproduce-bad.xml Domain reproduce defined from reproduce-bad.xml virsh # start reproduce Domain reproduce started # virsh console reproduce --safe ... grub rescue> ls (hd0) (hd0,msdos2) (hd0,msdos1) grub rescue> ls (hd0,msdos1)/boot/grub error: ../../grub-core/kern/disk_common.c:47:attempt to read or write outside of disk `hd0'. error: ../../grub-core/kern/disk_common.c:47:attempt to read or write outside of disk `hd0'. error: ../../grub-core/fs/fshelp.c:258:file `/boot/grub' not found. # ls -lhZ images/ total 5.9G -rw-rw-r--. 1 vdsm kvm system_u:object_r:fusefs_t:s0 58G Apr 27 17:32 copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 We don't get any AVCs for the image, but we see: # ausearch -m avc -ts today ---- time->Mon Apr 27 17:30:44 2020 type=AVC msg=audit(1588008644.879:25273): avc: denied { read write } for pid=4026503 comm="qemu-kvm" path="/dev/mapper/control" dev="devtmpfs" ino=29741 scontext=system_u:system_r:svirt_t:s0:c401,c404 tcontext=system_u:object_r:lvm_control_t:s0 tclass=chr_file permissive=0 ---- time->Mon Apr 27 17:38:09 2020 type=AVC msg=audit(1588009089.828:25297): avc: denied { read write } for pid=4029172 comm="qemu-kvm" path="/dev/mapper/control" dev="devtmpfs" ino=29741 scontext=system_u:system_r:svirt_t:s0:c123,c792 tcontext=system_u:object_r:lvm_control_t:s0 tclass=chr_file permissive=0 # getfattr --dump images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 (no output) # setfattr -n name -v value images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1; echo $? setfattr: images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1: Operation not supported 1 Looks like gluster does not support extended attributes (or it is not configured properly). I also verified that the "bad" xml works if we put an image on xfs. See reproduce-xfs.xml. Based on this I think this is libvirt bug, so moving to libvirt.
Versions tested: # rpm -q libvirt-daemon qemu-kvm libvirt-daemon-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64 qemu-kvm-4.2.0-19.module+el8.2.0+6296+6b821950.x86_64
Created attachment 1682250 [details] Screenshot showing failed boot
Created attachment 1682251 [details] VM XML that boots with disk on gluster storage
Created attachment 1682252 [details] VM XML that does not boot with disk on gluster storage
Created attachment 1682256 [details] Same "bad" XML with disk on XFS file system, works
(In reply to Gobinda Das from comment #15) > I think gfapi used for accessing SD, not sure. > @kaustav can you please check as a priority? We should not be using gfapi (and it'd be good to remove it from the vdsm gluster code!)
(In reply to Nir Soffer from comment #32) > # getfattr --dump images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 > (no output) > > # setfattr -n name -v value > images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1; echo $? > setfattr: images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1: > Operation not supported > 1 > > Looks like gluster does not support extended attributes (or it is not > configured properly). Gluster volumes does support XATTRs [root@]# df -h /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/ Filesystem Size Used Avail Use% Mounted on rhsqa-grafton7.lab.eng.blr.redhat.com:/engine 100G 13G 88G 13% /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine [root@ ~]# cd /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/ [root@ ~]# touch file [root@ ~]# setfattr -n user.name -v value file [root@ ~]# [root@ ~]# getfattr -d -m. file # file: file security.selinux="system_u:object_r:fusefs_t:s0" user.name="value"
Nir, Satheesaran, in an interactive debugging session we were able to work around this by setting remember_owner=0 in qemu.conf. However, it's unclear to me why would this feature interfere with guest booting. I mean, true the feature has something to do with seclabels (but mostly when restoring them, not when starting the VM), but the problem is in the debugging session we have seen the disk having the same seclabels regardless of the remember_owner status. Can either of you attach debug logs please? BTW: Those AVCs are related to something else (see bug 1823976).
Created attachment 1682720 [details] libvirtd log with debug enabled
(In reply to Michal Privoznik from comment #40) > Nir, Satheesaran, > > in an interactive debugging session we were able to work around this by > setting remember_owner=0 in qemu.conf. However, it's unclear to me why would > this feature interfere with guest booting. I mean, true the feature has > something to do with seclabels (but mostly when restoring them, not when > starting the VM), but the problem is in the debugging session we have seen > the disk having the same seclabels regardless of the remember_owner status. > > Can either of you attach debug logs please? > > BTW: Those AVCs are related to something else (see bug 1823976). Hello Michal, I have attached the libvirtd.log with debug. Hope this is the one you are asking for
Created attachment 1682728 [details] QEMU log for the domain - reproduce
contents of /etc/libvirt/hooks/qemu [root@ ~]# cat /etc/libvirt/hooks/qemu #!/bin/sh DOMAIN="$1" EVENT="$2" PHASE="$3" exec /usr/libexec/vdsm/vm_libvirt_hook.py "$DOMAIN" "$EVENT" "$PHASE"
So, after whole day of debugging here are my findings: there are two ways to work around this bug (whether it is a libvirt bug remains undecided, read on): 1) set remember_owner=0 in qemu.conf; This way, libvirt won't save any XATTRs on glusterfs (which it removes immediately anyways), 2) tell libvirt to not set SELinux label on the disk (since gluster via FUSE doesn't support it anyways): <seclabel model='dac' relabel='no'/> <seclabel model='selinux' relabel='no'/> Now, this bug is very tricky to reproduce and in fact one one try it reproduces and on the other it doesn't (without changing anything), and yet this is not a race condition (at least not in libvirt). But, here is the code that is triggered if neither 1) nor 2) workaround is in place: https://gitlab.com/libvirt/libvirt/-/blob/master/src/security/security_selinux.c#L1327 It's the virSecuritySELinuxSetFilecon() function which will be called like this: virSecuritySELinuxSetFilecon(mgr, path = "/path/to/gluster/disk", tcon = "system_u:object_r:svirt_image_t:s0:c159,c181", remember = true); At this point transactions are turned off, so virSecuritySELinuxTransactionAppend() returns 0. Then @remember is true and hence current label of the disk is written into XATTRs (econ = "system_u:object_r:fusefs_t:s0"), @refcount = 1 because no one else is using the image. Then the virSecuritySELinuxSetFileconImpl() will be called which will trye to set the desired context @tcon on the @path, but this will fail (setfilecon_raw() returns -1 with errno = ENOTSUP) and thus the Impl() function returns 1. At this point, rc will be equal to 1 and thus @rollback won't be set to false and therefore virSecuritySELinuxRestoreFileLabel() is called. At this point, we have: - some XATTRs on the file - no SELinux label change The virSecuritySELinuxRestoreFileLabel() will then read the original SELinux label (which is the same as the file has now, because we failed to set it), will remove all XATTRs and will again fail to set SELinux label (ignored). So there is something in this little dance that makes QEMU fail when reading the disk image. But the failure is very weird because there is nothing in the audit log, no AVC. And I even enabled noaudit printings ("semodule -DB") and still no AVCs. However, if say remember_owner is turned off then no XATTRs are stored on the file, the relabel still fails though, but QEMU is able to read the file happily. Or, if SELinux relabel is turned off for the disk then relabel is not even attempted and QEMU is able to read the file, again happily. This brings me to a conclusion that libvirt is tickling a bug in SELinux/FUSE/gluster, and I lean towards gluster because this does not reproduce on any other filesystem. To make things ever weirder, if neither of workarounds is enabled, and I attach gdb and pause libvirtd right after it removed the XATTRs and failed to restore the context (not that there is anything to restore - again, nothing changed), and I manually run "restorecon /path/to/gluster/disk" from another console, then resume libvird again and let QEMU run, guess what - the guest boots! QEMU is able to read the disk. Throughout this whole operation SELinux label was never changed - at least not visibly (ls -lZ). I will resume debugging tomorrow.
Debugging day two. Running qemu under strace shows some differences in the way QEMU reads data from the disk. I will attach strace outputs in a little while. But, the main difference is, this is what a successful boot looks like: qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\xeb\x63\x90\x10\x8e\xd0\xbc\x00\xb0\xb8\x00\x00\x8e\xd8\x8e\xc0\xfb\xbe\x00\x7c\xbf\x00\x06\xb9\x00\x02\xf3\xa4\xea\x21\x06\x00"..., 512, 0) = 512 qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x52\xbf\xf4\x81\x66\x8b\x2d\x83\x7d\x08\x00\x0f\x84\xe2\x00\x80\x7c\xff\x00\x74\x46\x66\x8b\x1d\x66\x8b\x4d\x04\x66\x31\xc0\xb0"..., 512, 512) = 512 ... qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x89\x50\x08\x8b\x45\x94\x8b\x40\x0c\x8d\x14\x03\x8b\x45\x94\x89\x50\x0c\x90\x83\xc4\x6c\x5b\x5e\x5f\x5d\xc3\xf3\x0f\x1e\xfb\x55"..., 17408, 1096704) = 17408 qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x49\x4e\x41\xc0\x03\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"..., 11776, 806420480) = 11776 qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x49\x4e\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"..., 20992, 806432256) = 20992 qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\xe8\x0f\xb6\x45\xe8\x0f\xb6\xc0\x8b\x14\x85\x00\x00\x00\x00\x0f\xb6\x45\xe9\x0f\xb6\xc0\x8b\x04\x85\x00\x04\x00\x00\x31\xc2\x0f"..., 2560, 14286848) = 2560 And this is what an unsuccessful boot looks like: qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\xeb\x63\x90\x10\x8e\xd0\xbc\x00\xb0\xb8\x00\x00\x8e\xd8\x8e\xc0\xfb\xbe\x00\x7c\xbf\x00\x06\xb9\x00\x02\xf3\xa4\xea\x21\x06\x00"..., 512, 0) = 512 qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x52\xbf\xf4\x81\x66\x8b\x2d\x83\x7d\x08\x00\x0f\x84\xe2\x00\x80\x7c\xff\x00\x74\x46\x66\x8b\x1d\x66\x8b\x4d\x04\x66\x31\xc0\xb0"..., 512, 512) = 512 ... qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x89\x50\x08\x8b\x45\x94\x8b\x40\x0c\x8d\x14\x03\x8b\x45\x94\x89\x50\x0c\x90\x83\xc4\x6c\x5b\x5e\x5f\x5d\xc3\xf3\x0f\x1e\xfb\x55"..., 17408, 1096704) = 17408 qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x49\x4e\x81\xa4\x03\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"..., 24576, 17006592) = 24576 qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x49\x4e\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"..., 8192, 17031168) = 8192 Both straces start the same (the boot sector is read) and for the first few lines (shown as "...") they are the same. But, while in the successful boot qemu reads from file at offset 806420480, in case of the unsuccessful boot it reads from much lower offset 17006592 (these are ~750MB apart). Now, the interesting part is that qemu tries to determine the size of the virtual disk (the file), and again it gets different results. In case of the successfull boot: qemu_strace.3144:fstat(17</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, {st_dev=makedev(0, 0x34), st_ino=10159533956285893482, st_mode=S_IFREG|0644, st_nlink=1, st_uid=107, st_gid=107, st_blksize=131072, st_blocks=8004888, st_size=62277025792, st_atime=1588152829 /* 2020-04-29T11:33:49.465440461+0200 */, st_atime_nsec=465440461, st_mtime=1588227404 /* 2020-04-30T08:16:44.247680655+0200 */, st_mtime_nsec=247680655, st_ctime=1588228091 /* 2020-04-30T08:28:11.041818827+0200 */, st_ctime_nsec=41818827}) = 0 and unsuccessful: qemu_strace.2664:fstat(18</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, {st_dev=makedev(0, 0x34), st_ino=10159533956285893482, st_mode=S_IFREG|0644, st_nlink=1, st_uid=107, st_gid=107, st_blksize=131072, st_blocks=125960, st_size=65404928, st_atime=1588152828 /* 2020-04-29T11:33:48.755428985+0200 */, st_atime_nsec=755428985, st_mtime=1588166246 /* 2020-04-29T15:17:26.102212781+0200 */, st_mtime_nsec=102212781, st_ctime=1588228091 /* 2020-04-30T08:28:11.041818827+0200 */, st_ctime_nsec=41818827}) = 0 Notice the huge difference in st_size and st_blocks. This what ls reports the size is: ls -lhZ /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 -rw-r--r--. 1 qemu qemu system_u:object_r:fusefs_t:s0 58G Apr 30 08:45 /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 Therefore, I conclude that gluster reports different file sizes which confuses qemu which then in turn leaves the guest unable to boot. I will switch this one over to gluster after I attach promised straces.
Created attachment 1683198 [details] qemu_straces.tar.xz Promised straces. Firstly I though that iotrhead is involved somehow, so I've disabled it. Therefore there are two successful runs (qemu_success/ and qemu_success_no_iothread/) and two failed runs (qemu_fail/ and qemu_fail_no_iothread/).
Hi Sunil, 1. Please see comment 46 , but before that - let's make sure we are NOT using libgfapi here. 2. Please assign someone to look at this promptly.
Alright, so I have a very simple reproducer that doesn't involve qemu or libvirt. In one console I'm watching the file size: for ((i=0; i<100; i++)); do stat /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 ; sleep .1; done and in the other I am setting an XATTR: setfattr -n trusted.libvirt -v ble /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 And this is what I can see: File: /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 Size: 62277025792 Blocks: 8004888 IO Block: 131072 regular file Device: 34h/52d Inode: 10159533956285893482 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 107/ qemu) Gid: ( 107/ qemu) Context: system_u:object_r:fusefs_t:s0 Access: 2020-04-29 11:33:49.465440461 +0200 Modify: 2020-04-30 08:45:00.812196690 +0200 Change: 2020-04-30 08:46:03.026205893 +0200 Birth: - File: /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 Size: 65404928 Blocks: 125960 IO Block: 131072 regular file Device: 34h/52d Inode: 10159533956285893482 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 107/ qemu) Gid: ( 107/ qemu) Context: system_u:object_r:fusefs_t:s0 Access: 2020-04-29 11:33:48.755428985 +0200 Modify: 2020-04-29 15:17:26.102212781 +0200 Change: 2020-04-30 09:54:26.139738718 +0200 Birth: - File: /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 Size: 62277025792 Blocks: 8004888 IO Block: 131072 regular file Device: 34h/52d Inode: 10159533956285893482 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 107/ qemu) Gid: ( 107/ qemu) Context: system_u:object_r:fusefs_t:s0 Access: 2020-04-29 11:33:49.465440461 +0200 Modify: 2020-04-30 08:45:00.812196690 +0200 Change: 2020-04-30 09:54:26.139738718 +0200 Birth: - So not only reported file size drops, mtime is also confusing.
(In reply to Yaniv Kaul from comment #48) > Hi Sunil, > > 1. Please see comment 46 , but before that - let's make sure we are NOT > using libgfapi here. > 2. Please assign someone to look at this promptly. I have shared my test setup on this issue with Ryan Barry, Nir Soffer and Michal. Its not using libgfapi but the gluster fuse mount
Thanks Michal for your analysis and getting us so far. I am moving this bug to Red Hat Gluster Storage product.
I have tested this behaviour with replica 3, with sharding enabled with test procedure mentioned in comment49 I could see that the discrepancy with size of the file, below is the output from stat on the file, while setfattr happening on that file. <snip> Size: 4294967296 Blocks: 8388608 IO Block: 131072 regular file Size: 4294967296 Blocks: 8388608 IO Block: 131072 regular file Size: 4294967296 Blocks: 8388608 IO Block: 131072 regular file Size: 4294967296 Blocks: 8388608 IO Block: 131072 regular file Size: 67108864 Blocks: 131072 IO Block: 131072 regular file Size: 67108864 Blocks: 131072 IO Block: 131072 regular file Size: 67108864 Blocks: 131072 IO Block: 131072 regular file Size: 67108864 Blocks: 131072 IO Block: 131072 regular file Size: 67108864 Blocks: 131072 IO Block: 131072 regular file Size: 67108864 Blocks: 131072 IO Block: 131072 regular file </snip> [root@ ~]# grep -i size sample1 |uniq Size: 4294967296 Blocks: 8388608 IO Block: 131072 regular file Size: 67108864 Blocks: 131072 IO Block: 131072 regular file Size: 4294967296 Blocks: 8388608 IO Block: 131072 regular file Size: 67108864 Blocks: 131072 IO Block: 131072 regular file Size: 4294967296 Blocks: 8388608 IO Block: 131072 regular file I have also repeated the test, with replica 3 volume without enabling shard and I do not see any discrepance [root@ ~]# grep -i size sample |uniq Size: 4294967296 Blocks: 8388608 IO Block: 131072 regular file So this leads to the conclusion that this issue is seen with sharding translator of glusterfs
Thanks Ravi for your suggestions. With Ravi's input, I have repeated the test by turning off 'performance.stat-prefetch' [localhost@ ~]# gluster volume set engine performance.stat-prefetch off In this case, when stat-prefetch is turned off, the size of the file doesn't get different, when setting xattr on that file from gluster fuse mount. @Krutika, Refer comment49 for the simple steps to reproduce this issue. I too have the setup available for debugging, let me know if you need one
I think this is a side-effect of commit 69f77d2 (patch , wherein in setxattr fop, md-cache is caching the pre and post fop attributes gotten from the dict. While the patch did handle cases like ec where file sizes returned by posix from multiple bricks needs to be aggregated before unwinding response, i think it missed to make a similar change in shard translator. In the case of shard, each brick will be returning only the base shard's size. This md-cache ends up caching and keeps on serving until cache-timeout. Solution would be to "complete" what the patch missed in terms of overloading shard setxattr also to request aggregated file size during call and use the aggregate file size returned in dict to update the GF_{PRE,POST}STAT key values before unwinding the call.
0. So I'm assuming the file on which libvirt is setting the extended attribute is the VM image file. Is that correct? 1. So how often does libvirt call SETXATTR on a VM during its lifetime? 2. And at what point in the VM's lifecycle is this SETXATTR called? I have two solutions - one of them involving an extra network call and another involving potential backward compatibility problem during upgrade. I need to decide which one to implement based on the answers to the above questions. -Krutika
(In reply to Krutika Dhananjay from comment #58) > 0. So I'm assuming the file on which libvirt is setting the extended > attribute is the VM image file. Is that correct? Yes. When starting a VM or attaching a disk to a running VM, libvirt must chown() + setfcon() the file so that the qemu process can access it. And just before doing so, libvirt will record original owner and SELinux label into XATTRs. Then, when cleaning up after terminated qemu or detached disk, libvirt looks into XATTRs and if it finds some previous records it will use that to restore the original owner. If not then chonw(root:root) + restorecon combo is called. > 1. So how often does libvirt call SETXATTR on a VM during its lifetime? This depends on the file. There are files which will be set only once (well, and once more they will have XATTRs removed) - like pidfile. But then there are files which will have their XATTRs set and removed multiple times - as many times as an user wants to hotplug/hotunplug the disk during VM's lifetime. If they want to do it every 10 seconds then every 10 seconds the file representing the disk inside VM will have its XATTRs modified. > 2. And at what point in the VM's lifecycle is this SETXATTR called? Again, this depends on the scenario. Those files that are configured in so called inactive XML (configuration that VM is started with) will have those XATTRs set just before execve() of qemu; and removed on qemu process dying (libvirt keeps a socket open and it detects qemu dying by seeing HUP on the socket which triggers cleanup code run which is where XATTRs will be removed). But because we have hutplug and hotunplug, corresponding XATTR work is done on the file representing device hotplugged to/hotunplugged from qemu. > > I have two solutions - one of them involving an extra network call and > another involving potential backward compatibility problem during upgrade. > I need to decide which one to implement based on the answers to the above > questions. Sounds good. If it makes any difference, I don't think that RHV uses gluster for anything other than storing VM disks. So we don't have to bother with other ("service") files, like aforementioned pidfile, monitor socket file, etc. > > -Krutika
(In reply to Michal Privoznik from comment #59) > (In reply to Krutika Dhananjay from comment #58) > > 0. So I'm assuming the file on which libvirt is setting the extended > > attribute is the VM image file. Is that correct? > > Yes. When starting a VM or attaching a disk to a running VM, libvirt must > chown() + setfcon() the file so that the qemu process can access it. And > just before doing so, libvirt will record original owner and SELinux label > into XATTRs. Then, when cleaning up after terminated qemu or detached disk, > libvirt looks into XATTRs and if it finds some previous records it will use > that to restore the original owner. If not then chonw(root:root) + > restorecon combo is called. > > > 1. So how often does libvirt call SETXATTR on a VM during its lifetime? > > This depends on the file. There are files which will be set only once (well, > and once more they will have XATTRs removed) - like pidfile. But then there > are files which will have their XATTRs set and removed multiple times - as > many times as an user wants to hotplug/hotunplug the disk during VM's > lifetime. If they want to do it every 10 seconds then every 10 seconds the > file representing the disk inside VM will have its XATTRs modified. > > > 2. And at what point in the VM's lifecycle is this SETXATTR called? > > Again, this depends on the scenario. Those files that are configured in so > called inactive XML (configuration that VM is started with) will have those > XATTRs set just before execve() of qemu; and removed on qemu process dying > (libvirt keeps a socket open and it detects qemu dying by seeing HUP on the > socket which triggers cleanup code run which is where XATTRs will be > removed). But because we have hutplug and hotunplug, corresponding XATTR > work is done on the file representing device hotplugged to/hotunplugged from > qemu. > > > > > I have two solutions - one of them involving an extra network call and > > another involving potential backward compatibility problem during upgrade. > > I need to decide which one to implement based on the answers to the above > > questions. > > Sounds good. If it makes any difference, I don't think that RHV uses gluster > for anything other than storing VM disks. So we don't have to bother with > other ("service") files, like aforementioned pidfile, monitor socket file, > etc. > > > > > -Krutika So as far as RHV is concerned, then is it safe for me to assume that there will be only an initial SETXATTR and a REMOVEXATTR at the end? In which case it makes more sense for me to go ahead with the additional network-call approach (and then again it's not like the call WILL go over the network *EVERY* time, we even store the aggregated file attribute in cache in shard translator, so sometimes it will be served right off the cache).
(In reply to Krutika Dhananjay from comment #60) > So as far as RHV is concerned, then is it safe for me to assume that there > will be only an initial SETXATTR and a REMOVEXATTR at the end? > In which case it makes more sense for me to go ahead with the additional > network-call approach (and then again it's not like the call WILL go over > the network *EVERY* time, we even store the aggregated file attribute in > cache in shard translator, so sometimes it will be served right off the > cache). Oh yeah, that sounds reasonable. XATTRs are very low traffic. It's NOT like libvirt would set them on each disk read/write. It sets them once and removes them once. Any overhead will spread over VM runtime.
Upstream github tracker - https://github.com/gluster/glusterfs/issues/1243 Patch coming right up!
(In reply to Krutika Dhananjay from comment #62) > Upstream github tracker - https://github.com/gluster/glusterfs/issues/1243 > > Patch coming right up! Sorry, this is going to take some more time. Issue not reproducible upstream in the first place. Applying the patch downstream for the purpose of testing is leading to huge conflicts due to coding style differences between the two. Will upload once I get to test patch.
Upstream patch - https://review.gluster.org/c/glusterfs/+/24471
(In reply to Michal Privoznik from comment #61) > (In reply to Krutika Dhananjay from comment #60) > > > So as far as RHV is concerned, then is it safe for me to assume that there > > will be only an initial SETXATTR and a REMOVEXATTR at the end? > > In which case it makes more sense for me to go ahead with the additional > > network-call approach (and then again it's not like the call WILL go over > > the network *EVERY* time, we even store the aggregated file attribute in > > cache in shard translator, so sometimes it will be served right off the > > cache). > > Oh yeah, that sounds reasonable. XATTRs are very low traffic. It's NOT like > libvirt would set them on each disk read/write. It sets them once and > removes them once. Any overhead will spread over VM runtime. Hi, One last question. The analysis in comment #46 holds only for SETXATTR? Or do you see this even in REMOVEXATTR since I believe that is also something you'd execute on the file at some point? I've fixed the problem in SETXATTR. I believe the same bug exists in REMOVEXATTR as well in gluster. If it is necessary, then I'll have to fix REMOVEXATTR code as well (which implies I'll have to add another 100+ lines of code). -Krutika
(In reply to Krutika Dhananjay from comment #67) > > One last question. The analysis in comment #46 holds only for SETXATTR? Or > do you see this even in REMOVEXATTR since I believe that is also something > you'd execute on the file at some point? > > I've fixed the problem in SETXATTR. I believe the same bug exists in > REMOVEXATTR as well in gluster. If it is necessary, then I'll have to fix > REMOVEXATTR code as well (which implies I'll have to add another 100+ lines > of code). Frankly, I don't know. I no longer have access to the testing environment. But, you are right in assuming that libvirt does call REMOVEXATTR at some point. It does so when a VM is shut down. Or on a device hot unplug (e.g. a disk detach). However, at that point the VM (aka qemu process) is no longer using the associated file (the order is reversed to that when starting a VM), and therefore the apparent file size is not an issue. But, this is from libvirt's POV, other applications might do more XATTR work and rely on the apparent file size, so I'd say it is worth fixing the REMOVEXATTR case too.
(In reply to Michal Privoznik from comment #68) > (In reply to Krutika Dhananjay from comment #67) > > > > One last question. The analysis in comment #46 holds only for SETXATTR? Or > > do you see this even in REMOVEXATTR since I believe that is also something > > you'd execute on the file at some point? > > > > I've fixed the problem in SETXATTR. I believe the same bug exists in > > REMOVEXATTR as well in gluster. If it is necessary, then I'll have to fix > > REMOVEXATTR code as well (which implies I'll have to add another 100+ lines > > of code). > > Frankly, I don't know. I no longer have access to the testing environment. > But, you are right in assuming that libvirt does call REMOVEXATTR at some > point. It does so when a VM is shut down. Or on a device hot unplug (e.g. a > disk detach). However, at that point the VM (aka qemu process) is no longer > using the associated file (the order is reversed to that when starting a > VM), and therefore the apparent file size is not an issue. But, this is from > libvirt's POV, other applications might do more XATTR work and rely on the > apparent file size, so I'd say it is worth fixing the REMOVEXATTR case too. Thanks, I'm on it!
Downstream backport for the SETXATTR changes - https://code.engineering.redhat.com/gerrit/#/c/201135/
https://review.gluster.org/c/glusterfs/+/24480 <-- removexattr fix
Tested with RHV 4.4.1 and RHGS 3.5.2 - glusterfs-6.0-37.el8rhgs. 1. RHHI-V deployment is completed successfully with RV Self-Hosted Engine consuming glusterfs storage domain 2. Additional gluster storage domains are created and operational 3. VMs are created and various workloads are run inside VMs
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2572