Bug 1823423 - glusterfs reports invalid file size when XATTRs are set/removed
Summary: glusterfs reports invalid file size when XATTRs are set/removed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: sharding
Version: rhgs-3.5
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: RHGS 3.5.z Batch Update 2
Assignee: Krutika Dhananjay
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks: 1823286
TreeView+ depends on / blocked
 
Reported: 2020-04-13 16:36 UTC by SATHEESARAN
Modified: 2020-06-21 17:41 UTC (History)
23 users (show)

Fixed In Version: glusterfs-6.0-36
Doc Type: No Doc Update
Doc Text:
Clone Of: 1823286
Environment:
Last Closed: 2020-06-16 06:19:39 UTC
Embargoed:


Attachments (Terms of Use)
sosreport from the HE Node (14.32 MB, application/octet-stream)
2020-04-14 09:57 UTC, SATHEESARAN
no flags Details
sosreport testing with 4k gluster storage (8.89 MB, application/octet-stream)
2020-04-15 09:47 UTC, SATHEESARAN
no flags Details
Hosted engineVM configXML (16.02 KB, application/octet-stream)
2020-04-21 13:13 UTC, SATHEESARAN
no flags Details
HostedEngineLocal XML Definition from retest on 24April2020 (6.61 KB, application/octet-stream)
2020-04-24 15:25 UTC, SATHEESARAN
no flags Details
HostedEngine XML Definition from retest on 24April2020 (16.01 KB, application/octet-stream)
2020-04-24 15:26 UTC, SATHEESARAN
no flags Details
HostedEngineLocal VM QEMU log from retest on 24April2020 (31.43 KB, application/octet-stream)
2020-04-24 15:27 UTC, SATHEESARAN
no flags Details
HostedEngine VM QEMU log from retest on 24April2020 (627.82 KB, application/octet-stream)
2020-04-24 15:27 UTC, SATHEESARAN
no flags Details
Screenshot showing failed boot (152.76 KB, image/png)
2020-04-27 19:38 UTC, Nir Soffer
no flags Details
VM XML that boots with disk on gluster storage (3.17 KB, text/plain)
2020-04-27 19:40 UTC, Nir Soffer
no flags Details
VM XML that does not boot with disk on gluster storage (3.21 KB, text/plain)
2020-04-27 19:41 UTC, Nir Soffer
no flags Details
Same "bad" XML with disk on XFS file system, works (3.07 KB, text/plain)
2020-04-27 19:42 UTC, Nir Soffer
no flags Details
libvirtd log with debug enabled (11.31 MB, application/octet-stream)
2020-04-29 06:27 UTC, SATHEESARAN
no flags Details
QEMU log for the domain - reproduce (298.67 KB, application/octet-stream)
2020-04-29 06:42 UTC, SATHEESARAN
no flags Details
qemu_straces.tar.xz (500.23 KB, application/x-xz)
2020-04-30 07:19 UTC, Michal Privoznik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2572 0 None None None 2020-06-16 06:19:56 UTC

Description SATHEESARAN 2020-04-13 16:36:52 UTC
Description
-------------
Hosted Engine VM fails to boot up, post copying its image from local
to Glusterfs storage domain backed with replica 3 gluster volume.

Version
-------
RHHI-V 1.8
RHVH 4.4
vdsm-4.40.13-1.el8ev.x86_64
vdsm-gluster-4.40.13-1.el8ev.x86_64
python3-ioprocess-1.4.1-1.el8ev.x86_64
ioprocess-1.4.1-1.el8ev.x86_64
glusterfs-6.0-32.el8rhgs.x86_64

How reproducible
-----------------
Always

Steps to reproduce
-------------------
1. Start with RHHI-V installation from Web console
2. Post Gluster deployment is complete, continue to Hosted Engine deployment
3. Provide values for Hosted Engine configuration and start HE deployment

Actual results
---------------
Hosted Engine deployment fails, as the HE VM is unable to boot

Expected results
-----------------
Hosted Engine VM should boot up successfully

--- Additional comment from RHEL Program Management on 2020-04-13 07:23:47 UTC ---

This bug is automatically being proposed for RHHI-V 1.8 release at Red Hat Hyperconverged Infrastructure for Virtualization product, by setting the release flag 'rhiv‑1.8' to '?'.

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from RHEL Program Management on 2020-04-13 07:23:47 UTC ---

This bug report has Keywords: Regression or TestBlocker.

Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release.

Please resolve ASAP.

--- Additional comment from SATHEESARAN on 2020-04-13 07:27:31 UTC ---

Adding Internal_whiteboard value as rhv-4.4.0

--- Additional comment from SATHEESARAN on 2020-04-13 07:55:43 UTC ---



--- Additional comment from SATHEESARAN on 2020-04-13 07:58:30 UTC ---

Gluster volume info
--------------------
[root@rhsqa-grafton7 ~]# gluster volume info engine
 
Volume Name: engine
Type: Replicate
Volume ID: 2d3cabed-49ba-4064-b56e-6db56e9a2112
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: rhsqa-grafton7.lab.eng.blr.redhat.com:/gluster_bricks/engine/engine
Brick2: rhsqa-grafton8.lab.eng.blr.redhat.com:/gluster_bricks/engine/engine
Brick3: rhsqa-grafton9.lab.eng.blr.redhat.com:/gluster_bricks/engine/engine
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.client-io-threads: on
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: off
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.granular-entry-heal: enable

[root@rhsqa-grafton7 ~]# gluster volume status engine
Status of volume: engine
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsqa-grafton7.lab.eng.blr.redhat.com
:/gluster_bricks/engine/engine              49152     0          Y       106846
Brick rhsqa-grafton8.lab.eng.blr.redhat.com
:/gluster_bricks/engine/engine              49152     0          Y       103352
Brick rhsqa-grafton9.lab.eng.blr.redhat.com
:/gluster_bricks/engine/engine              49152     0          Y       105014
Self-heal Daemon on localhost               N/A       N/A        Y       245951
Self-heal Daemon on rhsqa-grafton9.lab.eng.
blr.redhat.com                              N/A       N/A        Y       122547
Self-heal Daemon on rhsqa-grafton8.lab.eng.
blr.redhat.com                              N/A       N/A        Y       122592
 
Task Status of Volume engine
------------------------------------------------------------------------------
There are no active volume tasks

--- Additional comment from SATHEESARAN on 2020-04-13 08:01:33 UTC ---



--- Additional comment from SATHEESARAN on 2020-04-13 08:03:08 UTC ---

captured the gluster volume profile of the engine volume, during the setup

--- Additional comment from SATHEESARAN on 2020-04-13 08:07:50 UTC ---



--- Additional comment from SATHEESARAN on 2020-04-13 08:08:38 UTC ---

[root@rhsqa-grafton7 ~]# hosted-engine --vm-status


--== Host rhsqa-grafton7-nic2.lab.eng.blr.redhat.com (id: 1) status ==--

Host ID                            : 1
Host timestamp                     : 85763
Score                              : 3400
Engine status                      : {"vm": "up", "health": "bad", "detail": "Up", "reason": "failed liveliness check"}
Hostname                           : rhsqa-grafton7-nic2.lab.eng.blr.redhat.com
Local maintenance                  : False
stopped                            : False
crc32                              : 3f4e8b05
conf_on_shared_storage             : True
local_conf_timestamp               : 85763
Status up-to-date                  : True
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=85763 (Mon Apr 13 08:08:17 2020)
	host-id=1
	score=3400
	vm_conf_refresh_time=85763 (Mon Apr 13 08:08:17 2020)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineStarting
	stopped=False

Comment 1 SATHEESARAN 2020-04-13 16:43:17 UTC
Here are the logs:

ovirt-hosted-engine-setup.log
https://bugzilla.redhat.com/attachment.cgi?id=1678399

Screenshot showing VM failing to boot
https://bugzilla.redhat.com/attachment.cgi?id=1678397

Gluster Fuse mount log
https://bugzilla.redhat.com/attachment.cgi?id=1678396

Comment 2 SATHEESARAN 2020-04-13 16:44:49 UTC
Nir,

I seeing this problem, where HE VM fails to boot up
https://bugzilla.redhat.com/attachment.cgi?id=1678397

Do you have any insights ?

Comment 3 SATHEESARAN 2020-04-14 07:31:49 UTC
Hi Nir,

As I looked in to the strace logs, all the reads that happened on the VM image file, is using 1 byte block size only, there is no subsequent 512 byte block reads.


[root@rhsqa-grafton7-nic2 ~]# grep 7c16b3ea /brick_logs/*| grep pread
/brick_logs/brick-strace.log.478300:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.588522:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.588522:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.588523:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.588530:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.588530:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.595895:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.595895:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.595896:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.595897:pread64(20</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.596859:pread64(20</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)
/brick_logs/brick-strace.log.596860:pread64(19</gluster_bricks/engine/engine/.glusterfs/7c/16/7c16b3ea-563f-40cf-a847-5f3807fd6842>, 0x7f4facfff000, 1, 0) = -1 EINVAL (Invalid argument)

Comment 4 Sandro Bonazzola 2020-04-14 08:50:22 UTC
Can you please attach full sos report from the host?

Comment 5 SATHEESARAN 2020-04-14 09:57:53 UTC
Created attachment 1678662 [details]
sosreport from the HE Node

Comment 6 SATHEESARAN 2020-04-14 10:14:02 UTC
(In reply to Sandro Bonazzola from comment #4)
> Can you please attach full sos report from the host?

Hello Sandro,

I have attached the sosreport of the RHVH node, from where the HE deployment is initiated.

Comment 7 Vojtech Juranek 2020-04-14 10:29:22 UTC
what is the block size of the storage and what on the guest? If you use 4k and the storage, you must emulate 512 on the guest, otherwise VM won't boot, see e.g. https://docs.google.com/presentation/d/1ClLMZ4XAb8CPhYOw6mNpv0va5JWcAgpyFbJn-Bml_aY/edit#slide=id.g6dfd8a41d6_2_13

Comment 8 SATHEESARAN 2020-04-15 04:43:12 UTC
(In reply to Vojtech Juranek from comment #7)
> what is the block size of the storage and what on the guest? If you use 4k
> and the storage, you must emulate 512 on the guest, otherwise VM won't boot,
> see e.g.
> https://docs.google.com/presentation/d/1ClLMZ4XAb8CPhYOw6mNpv0va5JWcAgpyFbJn-
> Bml_aY/edit#slide=id.g6dfd8a41d6_2_13

Thanks Vojtech.

Block size of the underlying bricks of engine volume is 512B

[root@rhsqa-grafton7-nic2 ~]# blockdev --getss /dev/mapper/gluster_vg_sdb-gluster_lv_engine 
512

[root@rhsqa-grafton8-nic2 ~]# blockdev --getss /dev/mapper/gluster_vg_sdb-gluster_lv_engine 
512

[root@rhsqa-grafton9-nic2 ~]# blockdev --getss /dev/mapper/gluster_vg_sdb-gluster_lv_engine 
512

Comment 9 Vojtech Juranek 2020-04-15 08:52:49 UTC
checking ovirt-hosted-engine-setup.log, I see following mount and based on the name (gluster_vg_sdb-gluster_lv_engine), I guess this where engine VM is located. Is this correct?

    {
        "block_available": 25942060,
        "block_size": 4096,
        "block_total": 26200576,
        "block_used": 258516,
        "device": "/dev/mapper/gluster_vg_sdb-gluster_lv_engine",
        "fstype": "xfs",
        "inode_available": 52424517,
        "inode_total": 52426752,
        "inode_used": 2235,
        "mount": "/gluster_bricks/engine",
        "options": "rw,seclabel,noatime,nodiratime,attr2,inode64,logbsize=256k,sunit=512,swidth=5120,noquota",
        "size_available": 106258677760,
        "size_total": 107317559296,
        "uuid": "98012f56-2657-4bb9-8942-e5c582abdc2b"
    }

If it's correct, it seem that the block size there is 4k ("block_size": 4096,).

Could you please check this?

Comment 10 SATHEESARAN 2020-04-15 09:43:46 UTC
(In reply to Vojtech Juranek from comment #9)
> checking ovirt-hosted-engine-setup.log, I see following mount and based on
> the name (gluster_vg_sdb-gluster_lv_engine), I guess this where engine VM is
> located. Is this correct?
> 
>     {
>         "block_available": 25942060,
>         "block_size": 4096,
>         "block_total": 26200576,
>         "block_used": 258516,
>         "device": "/dev/mapper/gluster_vg_sdb-gluster_lv_engine",
>         "fstype": "xfs",
>         "inode_available": 52424517,
>         "inode_total": 52426752,
>         "inode_used": 2235,
>         "mount": "/gluster_bricks/engine",
>         "options":
> "rw,seclabel,noatime,nodiratime,attr2,inode64,logbsize=256k,sunit=512,
> swidth=5120,noquota",
>         "size_available": 106258677760,
>         "size_total": 107317559296,
>         "uuid": "98012f56-2657-4bb9-8942-e5c582abdc2b"
>     }
> 
> If it's correct, it seem that the block size there is 4k ("block_size":
> 4096,).
> 
> Could you please check this?

Hello Vojtech,

I see that the XFS bricks are 512B block sized only.
I double confirmed.

But initially during local HE VM( HostedEngineLocal ) creation, it says in the corresponding logs that the block_size : 4096
But if you could check that create_target_storage_domain's log,

<snip>
[root@rhsqa-grafton7-nic2 ovirt-hosted-engine-setup]# grep -i block_size ./* | grep create_target
./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313123743-7ipn23.log:        "block_size": 512,
./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313123743-7ipn23.log:                "block_size": 512,
./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313123743-7ipn23.log:        "block_size": 512,
./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313161913-fmlqih.log:        "block_size": 512,
./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313161913-fmlqih.log:                "block_size": 512,
./ovirt-hosted-engine-setup-ansible-create_target_vm-2020313161913-fmlqih.log:        "block_size": 512,
</snip>

Now I have tried one more test with 4K gluster volume. Still I see the same issue that HE VM is unable to boot post migrating to gluster storage domain
I will attach the logs for that soon

Comment 11 SATHEESARAN 2020-04-15 09:47:54 UTC
Created attachment 1678994 [details]
sosreport testing with 4k gluster storage

Comment 12 Vojtech Juranek 2020-04-15 11:35:06 UTC
thanks for double checking, no need to test with 4k gluster and post the logs, I wanted to be sure this is not an issue related to booting VM from 4k storage as it seemed to me like the most probable root cause.

Comment 13 Vojtech Juranek 2020-04-16 13:36:45 UTC
I see some gluster related exception in supervdsm log (errors in vdsm log basically means that vdsm cannot read from gluster), full exception is bellow.
It basically fails when calling glfs_new() function from libgfapi. So it seems to me rather gluster issue than vdsm issue.
Do you have any idea why glfs_new() can fail or what could be the root cuase?


MainProcess|jsonrpc/2::ERROR::2020-04-14 09:07:35,085::supervdsm_server::97::SuperVdsm.ServerCallback::(wrapper) Error in volumeStatvfs
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 234, in volumeStatvfs
    out = commands.run(command, env=env)
  File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, in run
    raise cmdutils.Error(args, p.returncode, out, err)
vdsm.common.cmdutils.Error: Command ['/usr/bin/python3', '-m', 'vdsm.gluster.gfapi', '-v', 'data', '-p', '24007', '-H', 'localhost', '-t', 'tcp', '-c', 'statvfs'] failed with rc=1 out=b'' err=b'Traceback (most recent call last):\n  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main\n    "__main__", mod_spec)\n  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code\n    exec(code, run_globals)\n  File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 296, in <module>\n    int(args.port), args.protocol)\n  File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 108, in volumeStatvfsGet\n    fs = glfsInit(volumeId, host, port, protocol)\n  File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 66, in glfsInit\n    fs = _glfs_new(volumeId)\nctypes.ArgumentError: argument 1: <class \'TypeError\'>: wrong type\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 95, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 236, in volumeStatvfs
    raise ge.GlfsStatvfsException(e.rc, [e.err])
vdsm.gluster.exception.GlfsStatvfsException: Failed to get Gluster volume Size info: rc=1 out=[b'Traceback (most recent call last):\n  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main\n    "__main__", mod_spec)\n  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code\n    exec(code, run_globals)\n  File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 296, in <module>\n    int(args.port), args.protocol)\n  File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 108, in volumeStatvfsGet\n    fs = glfsInit(volumeId, host, port, protocol)\n  File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 66, in glfsInit\n    fs = _glfs_new(volumeId)\nctypes.ArgumentError: argument 1: <class \'TypeError\'>: wrong type\n'] err=()

Comment 14 SATHEESARAN 2020-04-16 14:54:30 UTC
(In reply to Vojtech Juranek from comment #13)
> I see some gluster related exception in supervdsm log (errors in vdsm log
> basically means that vdsm cannot read from gluster), full exception is
> bellow.
> It basically fails when calling glfs_new() function from libgfapi. So it
> seems to me rather gluster issue than vdsm issue.
> Do you have any idea why glfs_new() can fail or what could be the root cuase?
> 
> 
> MainProcess|jsonrpc/2::ERROR::2020-04-14
> 09:07:35,085::supervdsm_server::97::SuperVdsm.ServerCallback::(wrapper)
> Error in volumeStatvfs
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 234,
> in volumeStatvfs
>     out = commands.run(command, env=env)
>   File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101,
> in run
>     raise cmdutils.Error(args, p.returncode, out, err)
> vdsm.common.cmdutils.Error: Command ['/usr/bin/python3', '-m',
> 'vdsm.gluster.gfapi', '-v', 'data', '-p', '24007', '-H', 'localhost', '-t',
> 'tcp', '-c', 'statvfs'] failed with rc=1 out=b'' err=b'Traceback (most
> recent call last):\n  File "/usr/lib64/python3.6/runpy.py", line 193, in
> _run_module_as_main\n    "__main__", mod_spec)\n  File
> "/usr/lib64/python3.6/runpy.py", line 85, in _run_code\n    exec(code,
> run_globals)\n  File
> "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 296, in
> <module>\n    int(args.port), args.protocol)\n  File
> "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 108, in
> volumeStatvfsGet\n    fs = glfsInit(volumeId, host, port, protocol)\n  File
> "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 66, in
> glfsInit\n    fs = _glfs_new(volumeId)\nctypes.ArgumentError: argument 1:
> <class \'TypeError\'>: wrong type\n'
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 95,
> in wrapper
>     res = func(*args, **kwargs)
>   File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 236,
> in volumeStatvfs
>     raise ge.GlfsStatvfsException(e.rc, [e.err])
> vdsm.gluster.exception.GlfsStatvfsException: Failed to get Gluster volume
> Size info: rc=1 out=[b'Traceback (most recent call last):\n  File
> "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main\n   
> "__main__", mod_spec)\n  File "/usr/lib64/python3.6/runpy.py", line 85, in
> _run_code\n    exec(code, run_globals)\n  File
> "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 296, in
> <module>\n    int(args.port), args.protocol)\n  File
> "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 108, in
> volumeStatvfsGet\n    fs = glfsInit(volumeId, host, port, protocol)\n  File
> "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 66, in
> glfsInit\n    fs = _glfs_new(volumeId)\nctypes.ArgumentError: argument 1:
> <class \'TypeError\'>: wrong type\n'] err=()


Vojtech,

Thanks once again !

I am not sure about these errors. But looking at the errors, it looks like vdsm is making use of 'gfapi' interface for gluster.
I will loop in the gluster and RHHI developers, to understand what's going wrong here ?


@Gobinda, Do you know why VDSM is trying to make use of gfapi ?
Or any one who know about it.

Comment 15 Gobinda Das 2020-04-17 06:14:10 UTC
I think gfapi used for accessing SD, not sure.
@kaustav can you please check as a priority?

Comment 16 Gobinda Das 2020-04-17 10:10:07 UTC
Hi Vojtech,
 The above error you mentioned
File "/usr/lib/python3.6/site-packages/vdsm/gluster/gfapi.py", line 234, in volumeStatvfs
this is when we use gfapi to find size of volume
these errors don't have anything to do with reading from storage domain imo

But the below error i am seeing in vdsm log frequently. Is there anything to look for this error?

2020-04-15 09:44:33,838+0000 ERROR (qgapoller/1) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7f7e5c2c5a20>> operation failed (periodic:188)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 186, in __call__
    self._func()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 316, in _poller
    self._qga_capability_check(vm_obj)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 539, in _qga_capability_check
    ret = self.call_qga_command(vm, _QEMU_GUEST_INFO_COMMAND)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 265, in call_qga_command
    ret = vm.qemu_agent_command(cmd, _COMMAND_TIMEOUT, 0)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5606, in qemu_agent_command
    self._dom, command, timeout, flags)
  File "/usr/lib64/python3.6/site-packages/libvirt_qemu.py", line 39, in qemuAgentCommand
    ret = libvirtmod_qemu.virDomainQemuAgentCommand(domain._o, cmd, timeout, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 50, in __getattr__
    % self.vmid)
vdsm.virt.virdomain.NotConnectedError: VM '41a0d36b-214a-4702-b068-99c12c486c0d' was not defined yet or was undefined


2020-04-15 08:20:08,213+0000 ERROR (qgapoller/3) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7f7e5c2c5a20>> operation failed (periodic:188)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 186, in __call__
    self._func()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 316, in _poller
    self._qga_capability_check(vm_obj)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 539, in _qga_capability_check
    ret = self.call_qga_command(vm, _QEMU_GUEST_INFO_COMMAND)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 265, in call_qga_command
    ret = vm.qemu_agent_command(cmd, _COMMAND_TIMEOUT, 0)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5606, in qemu_agent_command
    self._dom, command, timeout, flags)
  File "/usr/lib64/python3.6/site-packages/libvirt_qemu.py", line 39, in qemuAgentCommand
    ret = libvirtmod_qemu.virDomainQemuAgentCommand(domain._o, cmd, timeout, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 50, in __getattr__
    % self.vmid)
vdsm.virt.virdomain.NotConnectedError: VM '41a0d36b-214a-4702-b068-99c12c486c0d' was not defined yet or was undefined

Comment 17 Nir Soffer 2020-04-17 11:07:49 UTC
(In reply to SATHEESARAN from comment #3)
> As I looked in to the strace logs, all the reads that happened on the VM
> image file, is using 1 byte block size only, there is no subsequent 512 byte
> block reads.

Are you looking at all calls or only failed calls?
 
The expected flow is:

read 1 byte -> fails with EINVAL
read 512 bytes -> ok

This is expected each time qemu opens the image.

But I don't think this is the issue, it looks like the vm disk is corrupted.

I don't this is vdsm or storage bug. This is hosted engine setup bug.

Hosted engine maintainers should debug the installation and make sure the vm disk
is copied correctly to storage.

To verify that the disk was copied correctly, we can run this after the copy:

    qemu-img compare src.img dst.img

It would be useful to add this step in special debug mode so debugging deployment
issue is easier.

If the image was copied correctly but the vm will not boot this is a virt issue.

Comment 18 SATHEESARAN 2020-04-20 14:52:01 UTC
(In reply to Nir Soffer from comment #17)
> (In reply to SATHEESARAN from comment #3)
> > As I looked in to the strace logs, all the reads that happened on the VM
> > image file, is using 1 byte block size only, there is no subsequent 512 byte
> > block reads.
> 
> Are you looking at all calls or only failed calls?
>  
> The expected flow is:
> 
> read 1 byte -> fails with EINVAL
> read 512 bytes -> ok

Hello Nir,

I was grepping for all pread's on one brick for the GFID for the image file used by HE VM.
The second 'read 512 bytes' never happens.
I can also check by collecting 'strace' from all the bricks.

> 
> This is expected each time qemu opens the image.
> 
> But I don't think this is the issue, it looks like the vm disk is corrupted.
> 
> I don't this is vdsm or storage bug. This is hosted engine setup bug.
> 
> Hosted engine maintainers should debug the installation and make sure the vm
> disk
> is copied correctly to storage.
> 
> To verify that the disk was copied correctly, we can run this after the copy:
> 
>     qemu-img compare src.img dst.img
> 
> It would be useful to add this step in special debug mode so debugging
> deployment
> issue is easier.
> 
> If the image was copied correctly but the vm will not boot this is a virt
> issue.

I checked this. Both the images are the same
[root@ ~]# qemu-img compare /var/tmp/localvmozx2vt8k/images/d4945538-65e8-400f-bda9-0e95f36278e3/532a5b0a-d368-47de-a0b7-ce8fc24042a6 /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/7c084fc6-d2d0-4c92-846c-aa22c9de2c5f/images/eb7e672c-13d1-442a-abe6-3376d38c3cbe/245616af-20bc-4bf7-9bbd-9b0aaa5ad38f
Images are identical.

Comment 19 SATHEESARAN 2020-04-21 08:00:24 UTC
There is this discrepancy in the block_size that I see in the logs.
In actual gluster brick engine is a 512B block-size device, but the logs
reports that its a 4096B block-size device.

<snip_from_create_target_vm.log>
    {
        "block_available": 26008004,
        "block_size": 4096,
        "block_total": 26200576,
        "block_used": 192572,
        "device": "/dev/mapper/gluster_vg_sdb-gluster_lv_engine",
        "fstype": "xfs",
        "inode_available": 52423094,
        "inode_total": 52426752,
        "inode_used": 3658,
        "mount": "/gluster_bricks/engine",
        "options": "rw,seclabel,noatime,nodiratime,attr2,inode64,logbsize=256k,sunit=512,swidth=5120,noquota",
        "size_available": 106528784384,
        "size_total": 107317559296,
        "uuid": "34d4cd11-2347-441a-b4cc-fbabc5fb7e1c"
    },
</snip_from_create_target_vm.log>

[root@ ~]# blockdev  --getss /dev/mapper/gluster_vg_sdb-gluster_lv_engine
512

@Vojta, do you know why ?

Comment 20 Vojtech Juranek 2020-04-21 09:47:55 UTC
Unfortunately no idea, maybe Simone knows.

From storage point of view everything looks good, storage block size is really 512:

[root@rhsqa-grafton7-nic2 ~]# dd if=/dev/zero of=/gluster_bricks/engine/test oflag=direct bs=512 count=1
1+0 records in
1+0 records out
512 bytes copied, 0.000196591 s, 2.6 MB/s

Storage block size was create with autodetection:

2020-04-21 06:40:33,941+0000 INFO  (jsonrpc/7) [vdsm.api] START createStorageDomain(storageType=7, sdUUID='40ec7af8-085b-47fc-a5d4-2bee467a2da6', domainName='hosted_storage', typeSpecificArg='rhsqa-grafton7.lab.eng.blr.redhat.com:/engine', domClass=1, domVersion='5', block_size=0, max_hosts=250, options=None) from=::ffff:192.168.222.150,43316, flow_id=371bbc8d, task_id=5b4349e4-7181-48bc-b88e-baa2bc1b86cf (api:48)

We haven't debug logs, but vdsm detect the block size correctly, as SD metadata have 512 as a block size:

[root@rhsqa-grafton7-nic2 ~]# cat /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/40ec7af8-085b-47fc-a5d4-2bee467a2da6/dom_md/metadata
ALIGNMENT=1048576
BLOCK_SIZE=512
CLASS=Data
DESCRIPTION=hosted_storage
IOOPTIMEOUTSEC=10
LEASERETRIES=3
LEASETIMESEC=60
LOCKPOLICY=
LOCKRENEWALINTERVALSEC=5
MASTER_VERSION=1
POOL_DESCRIPTION=Default
POOL_DOMAINS=40ec7af8-085b-47fc-a5d4-2bee467a2da6:Active
POOL_SPM_ID=-1
POOL_SPM_LVER=-1
POOL_UUID=40bf7c46-832d-11ea-8993-004755204901
REMOTE_PATH=rhsqa-grafton7.lab.eng.blr.redhat.com:/engine
ROLE=Master
SDUUID=40ec7af8-085b-47fc-a5d4-2bee467a2da6
TYPE=GLUSTERFS
VERSION=5
_SHA_CKSUM=64b5a6020db8b9095a7052e3403d7b327dfcf6f8


So it seems to me more like virt issue. I don't see any other issue in vdsm log.
Right after creating HE VM I see this warning in vdsm log, but not sure if it's a real issue (probably not, if so, looks again like virt issue):

2020-04-21 07:04:24,654+0000 WARN  (vm/8672a439) [virt.vm] (vmId='8672a439-ca76-4ad1-998f-8cdd96d54893') Cannot find device alias for _conf:{'type': 'lease', 'device': 'lease', 'lease_id': '2c9a272d-c57e-48a2-b508-3519098d51e9', 'sd_id': '40ec7af8-085b-47fc-a5d4-2bee467a2da6', 'path': 'LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6', 'offset': 'LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6'} _deviceXML:None alias:None config:<bound method Base.config of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> createXmlElem:<bound method Device.createXmlElem of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> custom:{} device:lease deviceId:None deviceType:None from_xml_tree:<bound method Device.from_xml_tree of <class 'vdsm.virt.vmdevices.lease.Device'>> getXML:<bound method Device.getXML of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> get_extra_xmls:<bound method Base.get_extra_xmls of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> get_identifying_attrs:<bound method Device.get_identifying_attrs of <class 'vdsm.virt.vmdevices.lease.Device'>> get_metadata:<bound method Base.get_metadata of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> hotunplug_event:<threading.Event object at 0x7fbfc476e6d8> is_hostdevice:False lease_id:2c9a272d-c57e-48a2-b508-3519098d51e9 log:<SimpleLogAdapter virt.vm (INFO)> offset:LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 path:LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 sd_id:40ec7af8-085b-47fc-a5d4-2bee467a2da6 setup:<bound method Base.setup of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> specParams:{} teardown:<bound method Base.teardown of <lease.Device sd_id=40ec7af8-085b-47fc-a5d4-2bee467a2da6, lease_id=2c9a272d-c57e-48a2-b508-3519098d51e9, path=LEASE-PATH:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6, offset=LEASE-OFFSET:2c9a272d-c57e-48a2-b508-3519098d51e9:40ec7af8-085b-47fc-a5d4-2bee467a2da6 at 0x7fbfe4113a98>> type:lease update_device_info:<bound method Device.update_device_info of <class 'vdsm.virt.vmdevices.lease.Device'>> vmid:None (core:329)

Comment 21 Tal Nisan 2020-04-21 11:15:49 UTC
Ryan, can someone from Virt have a look?

Comment 22 Ryan Barry 2020-04-21 11:22:33 UTC
I'm talking to Sas right now. THe VM actually starts, but the disks are not attached (it goes to a grub menu). This is likely not virt. I'll look at the logs, but my suspicion is that it's actually gluster storage based on the fact that the disk sums are identical after copy, the VM comes up, etc. Sas has a live environment which I'll look at, but it would be helpful for someone from storage and gluster to also look at it.

Gluster is not healthy:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 111, in _execGluster
    return commands.run(cmd)
  File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101, in run
    raise cmdutils.Error(args, p.returncode, out, err)
vdsm.common.cmdutils.Error: Command ['/usr/sbin/gluster', '--mode=script', 'volume', 'geo-replication', 'status', '--xml'] failed with rc=2 out=b'geo-replication command failed\n' err=b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 1222, in volumeGeoRepStatus
    xmltree = _execGlusterXml(command)
  File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 134, in _execGlusterXml
    return _getTree(_execGluster(cmd))
  File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 113, in _execGluster
    raise ge.GlusterCmdFailedException(rc=e.rc, err=[e.msg])
vdsm.gluster.exception.GlusterCmdFailedException: Command failed: rc=2 out=() err=['Command {self.cmd} failed with rc={self.rc} out={self.out!r} err={self.err!r}']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 95, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 1224, in volumeGeoRepStatus
    raise ge.GlusterGeoRepStatusFailedException(rc=e.rc, err=e.err)
vdsm.gluster.exception.GlusterGeoRepStatusFailedException: Geo Rep status failed: rc=2 out=() err=['Command {self.cmd} failed with rc={self.rc} out={self.out!r} err={self.err!r}']

Comment 23 Ryan Barry 2020-04-21 11:46:34 UTC
Looked at the logs.

Libvirt XML looks ok, and there are no exceptions in the libvirt logs. The disk is held by the qemu process (ownership is ok), there's just no additional read data. Not virt. Nir, any additional thoughts?

Comment 25 SATHEESARAN 2020-04-21 12:27:47 UTC
(In reply to Ryan Barry from comment #22)
> I'm talking to Sas right now. THe VM actually starts, but the disks are not
> attached (it goes to a grub menu). This is likely not virt. I'll look at the
> logs, but my suspicion is that it's actually gluster storage based on the
> fact that the disk sums are identical after copy, the VM comes up, etc. Sas
> has a live environment which I'll look at, but it would be helpful for
> someone from storage and gluster to also look at it.
> 
> Gluster is not healthy:
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 111, in
> _execGluster
>     return commands.run(cmd)
>   File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 101,
> in run
>     raise cmdutils.Error(args, p.returncode, out, err)
> vdsm.common.cmdutils.Error: Command ['/usr/sbin/gluster', '--mode=script',
> 'volume', 'geo-replication', 'status', '--xml'] failed with rc=2
> out=b'geo-replication command failed\n' err=b''
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 1222, in
> volumeGeoRepStatus
>     xmltree = _execGlusterXml(command)
>   File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 134, in
> _execGlusterXml
>     return _getTree(_execGluster(cmd))
>   File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 113, in
> _execGluster
>     raise ge.GlusterCmdFailedException(rc=e.rc, err=[e.msg])
> vdsm.gluster.exception.GlusterCmdFailedException: Command failed: rc=2
> out=() err=['Command {self.cmd} failed with rc={self.rc} out={self.out!r}
> err={self.err!r}']
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/supervdsm_server.py", line 95,
> in wrapper
>     res = func(*args, **kwargs)
>   File "/usr/lib/python3.6/site-packages/vdsm/gluster/cli.py", line 1224, in
> volumeGeoRepStatus
>     raise ge.GlusterGeoRepStatusFailedException(rc=e.rc, err=e.err)
> vdsm.gluster.exception.GlusterGeoRepStatusFailedException: Geo Rep status
> failed: rc=2 out=() err=['Command {self.cmd} failed with rc={self.rc}
> out={self.out!r} err={self.err!r}']

This is because there is no geo-replication configured so far and because of this reason it fails
and returns with rc=2
[root@ ~]# gluster volume geo-replication status --xml
geo-replication command failed
[root@ ~]# echo $?
2

I will ask RHHI dev to fix this. In all cases, that should be separate Traceback, not really
associated with this failure of HE VM to start on gluster storage domain

Comment 26 SATHEESARAN 2020-04-21 13:13:02 UTC
Created attachment 1680554 [details]
Hosted engineVM configXML

Comment 28 SATHEESARAN 2020-04-24 15:25:02 UTC
Created attachment 1681558 [details]
HostedEngineLocal XML Definition from retest on 24April2020

Comment 29 SATHEESARAN 2020-04-24 15:26:20 UTC
Created attachment 1681559 [details]
HostedEngine XML Definition from retest on 24April2020

Comment 30 SATHEESARAN 2020-04-24 15:27:09 UTC
Created attachment 1681560 [details]
HostedEngineLocal VM QEMU log from retest on 24April2020

Comment 31 SATHEESARAN 2020-04-24 15:27:44 UTC
Created attachment 1681561 [details]
HostedEngine VM QEMU log from retest on 24April2020

Comment 32 Nir Soffer 2020-04-27 19:31:37 UTC
I checked the environment on Apr 24.

First we verified that the image in shared storage is identical to
the original image used to bootstrap the vm. We add new task
to the ansible script copying the image:

  - name: Verify copy of VM disk
    command: qemu-img compare {{ local_vm_disk_path }} {{ he_virtio_disk_path }}
    environment: "{{ he_cmd_lang }}"
    become: true
    become_user: vdsm
    become_method: sudo
    changed_when: true

This runs right after the task copying the image from the bootstrap image
to the gluster mount.

According to ansible log the check was successful:

Apr 24 14:11:45 rhsqa-grafton7-nic2 platform-python[2483803]: ansible-command Invoked with _raw_params=qemu-img compare /var/tmp/localvm2k8qg0na/images/d4945538-65e8-400f-bda9-0e95f36278e3/532a5b0a-d368-47de-a0b7-ce8fc24042a6 /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/images/106d9b8f-6bb7-4622-9ccf-21cf217bbcea/91a6eff5-ffa2-4a96-8eef-b02a863fddf1 warn=True _uses_shell=False stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None

So we don't have any issue creating this image.


Next I checked the image while the VM was running (stuck in grub rescue shell).

# qemu-img info 91a6eff5-ffa2-4a96-8eef-b02a863fddf1
image: 91a6eff5-ffa2-4a96-8eef-b02a863fddf1
file format: raw
virtual size: 58 GiB (62277025792 bytes)
disk size: 3.82 GiB

# qemu-img compare 91a6eff5-ffa2-4a96-8eef-b02a863fddf1 /var/tmp/localvm2k8qg0na/images/d4945538-65e8-400f-bda9-0e95f36278e3/532a5b0a-d368-47de-a0b7-ce8fc24042a6

# cat 91a6eff5-ffa2-4a96-8eef-b02a863fddf1.meta 
CAP=62277025792
CTIME=1587737303
DESCRIPTION={"DiskAlias":"he_virtio_disk","DiskDescription":"Hosted-Engine disk"}
DISKTYPE=HEVD
DOMAIN=517783ed-8479-4ccc-b0c2-c161764cfd93
FORMAT=RAW
GEN=0
IMAGE=106d9b8f-6bb7-4622-9ccf-21cf217bbcea
LEGALITY=LEGAL
PUUID=00000000-0000-0000-0000-000000000000
TYPE=SPARSE
VOLTYPE=LEAF
EOF

qeum command line:
-blockdev '{"driver":"file","filename":"/var/run/vdsm/storage/517783ed-8479-4ccc-b0c2-c161764cfd93/106d9b8f-6bb7-4622-9ccf-21cf217bbcea/91a6eff5-ffa2-4a96-8eef-b02a863fddf1","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \

Everything look normal, but booting from this image fails.

Next I moved hosted engine to global maintenance and shutdown the VM: 

# hosted-engine -set-maintenance --mode=global
# hosted-enigne --vm-poweroff

I tested that we can boot from this image using an overlay:

qemu-img create -f qcow2 -b /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/images/106d9b8f-6bb7-4622-9ccf-21cf217bbcea/91a6eff5-ffa2-4a96-8eef-b02a863fddf1 -F raw overlay.qcow2

# /usr/libexec/qemu-kvm -m 4g -nographic -drive file=overlay.qcow2,format=qcow2,cache=none

VM boots properly.

And that we can boot from a copy of this image:

# mkdir /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp

# qemu-img convert -p -f raw -O raw -t none -T none /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/images/106d9b8f-6bb7-4622-9ccf-21cf217bbcea/91a6eff5-ffa2-4a96-8eef-b02a863fddf1 /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1
    (100.00/100%)

# ls -lhs /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1
4.3G -rw-r--r--. 1 root root 58G Apr 24 16:06 /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1

# /usr/libexec/qemu-kvm -m 4g -nographic -drive file=/rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1,format=raw,cache=none

VM boots properly

Then I tried to build VM XML using virt-install, and add stuff from the original
RHV XML. Finally I found that boot fails once I add <seclabel> to the disk.

This is the difference between a XML that boot that one that does not:

# diff -u reproduce-good.xml reproduce-bad.xml
--- reproduce-good.xml 2020-04-24 22:06:58.197707718 +0000
+++ reproduce-bad.xml 2020-04-24 22:10:40.979171859 +0000
@@ -70,6 +70,7 @@
     <disk type="file" device="disk" snapshot='no'>
       <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' iothread='1'/>
       <source file="/rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1">
+        <seclabel model='dac' relabel='no'/>
       </source>
       <target dev="vda" bus="virtio"/>
     </disk>


Here is how I tested the XMLs. I repeated this test today (Apr 27)
to make sure ownership and permissions are correct.

# ls -lh images
lrwxrwxrwx. 1 root root 119 Apr 27 17:22 images ->
/rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/

Before testing I fixed I change ownership and permissions to match  what
rhev is using, and run restorecon to have the same selinux labels as other
images under /rhev/data-center.../images/...

# ls -lhZ images/
total 5.9G
-rw-rw-r--. 1 vdsm kvm system_u:object_r:fusefs_t:s0  58G Apr 24 22:08
copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1


virsh # define reproduce-good.xml
Domain reproduce defined from reproduce-good.xml

virsh # start reproduce
Domain reproduce started

# virsh console reproduce --safe
...
Red Hat Enterprise Linux 8.2 (Ootpa)
Kernel 4.18.0-193.el8.x86_64 on an x86_64

Web console: https://hostedenginesm3.lab.eng.blr.redhat.com:9090/

hostedenginesm3 login:

virsh # destroy reproduce
Domain reproduce destroyed

So this xml - without <seclabel> works.

Testing that images are still correct:

# ls -lhZ images/
total 5.9G
-rw-rw-r--. 1 vdsm kvm system_u:object_r:fusefs_t:s0  58G Apr 27 17:32
copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1


Now testing the bad xml, with <seclabel> - same way RHV use it:

virsh # define reproduce-bad.xml
Domain reproduce defined from reproduce-bad.xml

virsh # start reproduce
Domain reproduce started


# virsh console reproduce --safe
...
grub rescue> ls
(hd0) (hd0,msdos2) (hd0,msdos1)
grub rescue> ls (hd0,msdos1)/boot/grub
error: ../../grub-core/kern/disk_common.c:47:attempt to read or write outside of
 disk `hd0'.
error: ../../grub-core/kern/disk_common.c:47:attempt to read or write outside of
 disk `hd0'.

error: ../../grub-core/fs/fshelp.c:258:file `/boot/grub' not found.


# ls -lhZ images/
total 5.9G
-rw-rw-r--. 1 vdsm kvm system_u:object_r:fusefs_t:s0  58G Apr 27 17:32
copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1

We don't get any AVCs for the image, but we see:


# ausearch -m avc -ts today
----
time->Mon Apr 27 17:30:44 2020
type=AVC msg=audit(1588008644.879:25273): avc:  denied  { read write }
for  pid=4026503 comm="qemu-kvm" path="/dev/mapper/control"
dev="devtmpfs" ino=29741
scontext=system_u:system_r:svirt_t:s0:c401,c404
tcontext=system_u:object_r:lvm_control_t:s0 tclass=chr_file
permissive=0
----
time->Mon Apr 27 17:38:09 2020
type=AVC msg=audit(1588009089.828:25297): avc:  denied  { read write }
for  pid=4029172 comm="qemu-kvm" path="/dev/mapper/control"
dev="devtmpfs" ino=29741
scontext=system_u:system_r:svirt_t:s0:c123,c792
tcontext=system_u:object_r:lvm_control_t:s0 tclass=chr_file
permissive=0

# getfattr --dump images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1
(no output)

# setfattr -n name -v value
images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1; echo $?
setfattr: images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1:
Operation not supported
1

Looks like gluster does not support extended attributes (or it is not
configured properly).


I also verified that the "bad" xml works if we put an image on xfs.
See reproduce-xfs.xml.


Based on this I think this is libvirt bug, so moving to libvirt.

Comment 33 Nir Soffer 2020-04-27 19:33:16 UTC
Versions tested:
# rpm -q libvirt-daemon qemu-kvm
libvirt-daemon-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64
qemu-kvm-4.2.0-19.module+el8.2.0+6296+6b821950.x86_64

Comment 34 Nir Soffer 2020-04-27 19:38:19 UTC
Created attachment 1682250 [details]
Screenshot showing failed boot

Comment 35 Nir Soffer 2020-04-27 19:40:47 UTC
Created attachment 1682251 [details]
VM XML that boots with disk on gluster storage

Comment 36 Nir Soffer 2020-04-27 19:41:28 UTC
Created attachment 1682252 [details]
VM XML that does not boot with disk on gluster storage

Comment 37 Nir Soffer 2020-04-27 19:42:35 UTC
Created attachment 1682256 [details]
Same "bad" XML with disk on XFS file system, works

Comment 38 Yaniv Kaul 2020-04-28 09:11:25 UTC
(In reply to Gobinda Das from comment #15)
> I think gfapi used for accessing SD, not sure.
> @kaustav can you please check as a priority?

We should not be using gfapi (and it'd be good to remove it from the vdsm gluster code!)

Comment 39 SATHEESARAN 2020-04-28 14:45:57 UTC
(In reply to Nir Soffer from comment #32)
> # getfattr --dump images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1
> (no output)
> 
> # setfattr -n name -v value
> images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1; echo $?
> setfattr: images/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1:
> Operation not supported
> 1
> 
> Looks like gluster does not support extended attributes (or it is not
> configured properly).

Gluster volumes does support XATTRs

[root@]# df -h /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/
Filesystem                                     Size  Used Avail Use% Mounted on
rhsqa-grafton7.lab.eng.blr.redhat.com:/engine  100G   13G   88G  13% /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com:_engine

[root@ ~]# cd /rhev/data-center/mnt/glusterSD/rhsqa-grafton7.lab.eng.blr.redhat.com\:_engine/517783ed-8479-4ccc-b0c2-c161764cfd93/tmp/
[root@ ~]# touch file
[root@ ~]# setfattr -n user.name -v value file
[root@ ~]#
 
[root@ ~]# getfattr -d -m. file
# file: file
security.selinux="system_u:object_r:fusefs_t:s0"
user.name="value"

Comment 40 Michal Privoznik 2020-04-28 15:48:15 UTC
Nir, Satheesaran,

in an interactive debugging session we were able to work around this by setting remember_owner=0 in qemu.conf. However, it's unclear to me why would this feature interfere with guest booting. I mean, true the feature has something to do with seclabels (but mostly when restoring them, not when starting the VM), but the problem is in the debugging session we have seen the disk having the same seclabels regardless of the remember_owner status.

Can either of you attach debug logs please?

BTW: Those AVCs are related to something else (see bug 1823976).

Comment 41 SATHEESARAN 2020-04-29 06:27:43 UTC
Created attachment 1682720 [details]
libvirtd log with debug enabled

Comment 42 SATHEESARAN 2020-04-29 06:29:24 UTC
(In reply to Michal Privoznik from comment #40)
> Nir, Satheesaran,
> 
> in an interactive debugging session we were able to work around this by
> setting remember_owner=0 in qemu.conf. However, it's unclear to me why would
> this feature interfere with guest booting. I mean, true the feature has
> something to do with seclabels (but mostly when restoring them, not when
> starting the VM), but the problem is in the debugging session we have seen
> the disk having the same seclabels regardless of the remember_owner status.
> 
> Can either of you attach debug logs please?
> 
> BTW: Those AVCs are related to something else (see bug 1823976).

Hello Michal,

I have attached the libvirtd.log with debug.
Hope this is the one you are asking for

Comment 43 SATHEESARAN 2020-04-29 06:42:57 UTC
Created attachment 1682728 [details]
QEMU log for the domain - reproduce

Comment 44 SATHEESARAN 2020-04-29 06:47:05 UTC
contents of /etc/libvirt/hooks/qemu

[root@ ~]# cat /etc/libvirt/hooks/qemu
#!/bin/sh

DOMAIN="$1"
EVENT="$2"
PHASE="$3"

exec /usr/libexec/vdsm/vm_libvirt_hook.py "$DOMAIN" "$EVENT" "$PHASE"

Comment 45 Michal Privoznik 2020-04-29 14:58:29 UTC
So, after whole day of debugging here are my findings:

there are two ways to work around this bug (whether it is a libvirt bug remains undecided, read on):

1) set remember_owner=0 in qemu.conf; This way, libvirt won't save any XATTRs on glusterfs (which it removes immediately anyways),

2) tell libvirt to not set SELinux label on the disk (since gluster via FUSE doesn't support it anyways):

        <seclabel model='dac' relabel='no'/>
        <seclabel model='selinux' relabel='no'/>


Now, this bug is very tricky to reproduce and in fact one one try it reproduces and on the other it doesn't (without changing anything), and yet this is not a race condition (at least not in libvirt). But, here is the code that is triggered if neither 1) nor 2) workaround is in place:

https://gitlab.com/libvirt/libvirt/-/blob/master/src/security/security_selinux.c#L1327

It's the virSecuritySELinuxSetFilecon() function which will be called like this:

virSecuritySELinuxSetFilecon(mgr, path = "/path/to/gluster/disk", tcon = "system_u:object_r:svirt_image_t:s0:c159,c181", remember = true);

At this point transactions are turned off, so virSecuritySELinuxTransactionAppend() returns 0. Then @remember is true and hence current label of the disk is written into XATTRs (econ = "system_u:object_r:fusefs_t:s0"), @refcount = 1 because no one else is using the image. Then the virSecuritySELinuxSetFileconImpl() will be called which will trye to set the desired context @tcon on the @path, but this will fail (setfilecon_raw() returns -1 with errno = ENOTSUP) and thus the Impl() function returns 1. At this point, rc will be equal to 1 and thus @rollback won't be set to false and therefore virSecuritySELinuxRestoreFileLabel() is called.

At this point, we have:
 - some XATTRs on the file
 - no SELinux label change

The virSecuritySELinuxRestoreFileLabel() will then read the original SELinux label (which is the same as the file has now, because we failed to set it), will remove all XATTRs and will again fail to set SELinux label (ignored).

So there is something in this little dance that makes QEMU fail when reading the disk image. But the failure is very weird because there is nothing in the audit log, no AVC. And I even enabled noaudit printings ("semodule -DB") and still no AVCs.

However, if say remember_owner is turned off then no XATTRs are stored on the file, the relabel still fails though, but QEMU is able to read the file happily. Or, if SELinux relabel is turned off for the disk then relabel is not even attempted and QEMU is able to read the file, again happily.

This brings me to a conclusion that libvirt is tickling a bug in SELinux/FUSE/gluster, and I lean towards gluster because this does not reproduce on any other filesystem.

To make things ever weirder, if neither of workarounds is enabled, and I attach gdb and pause libvirtd right after it removed the XATTRs and failed to restore the context (not that there is anything to restore - again, nothing changed), and I manually run "restorecon /path/to/gluster/disk" from another console, then resume libvird again and let QEMU run, guess what - the guest boots! QEMU is able to read the disk.

Throughout this whole operation SELinux label was never changed - at least not visibly (ls -lZ).

I will resume debugging tomorrow.

Comment 46 Michal Privoznik 2020-04-30 07:15:31 UTC
Debugging day two.

Running qemu under strace shows some differences in the way QEMU reads data from the disk. I will attach strace outputs in a little while. But, the main difference is, this is what a successful boot looks like:

qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\xeb\x63\x90\x10\x8e\xd0\xbc\x00\xb0\xb8\x00\x00\x8e\xd8\x8e\xc0\xfb\xbe\x00\x7c\xbf\x00\x06\xb9\x00\x02\xf3\xa4\xea\x21\x06\x00"..., 512, 0) = 512
qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x52\xbf\xf4\x81\x66\x8b\x2d\x83\x7d\x08\x00\x0f\x84\xe2\x00\x80\x7c\xff\x00\x74\x46\x66\x8b\x1d\x66\x8b\x4d\x04\x66\x31\xc0\xb0"..., 512, 512) = 512
...
qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x89\x50\x08\x8b\x45\x94\x8b\x40\x0c\x8d\x14\x03\x8b\x45\x94\x89\x50\x0c\x90\x83\xc4\x6c\x5b\x5e\x5f\x5d\xc3\xf3\x0f\x1e\xfb\x55"..., 17408, 1096704) = 17408
qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x49\x4e\x41\xc0\x03\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"..., 11776, 806420480) = 11776
qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x49\x4e\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"..., 20992, 806432256) = 20992
qemu_strace.3455:pread64(31</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\xe8\x0f\xb6\x45\xe8\x0f\xb6\xc0\x8b\x14\x85\x00\x00\x00\x00\x0f\xb6\x45\xe9\x0f\xb6\xc0\x8b\x04\x85\x00\x04\x00\x00\x31\xc2\x0f"..., 2560, 14286848) = 2560


And this is what an unsuccessful boot looks like:

qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\xeb\x63\x90\x10\x8e\xd0\xbc\x00\xb0\xb8\x00\x00\x8e\xd8\x8e\xc0\xfb\xbe\x00\x7c\xbf\x00\x06\xb9\x00\x02\xf3\xa4\xea\x21\x06\x00"..., 512, 0) = 512
qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x52\xbf\xf4\x81\x66\x8b\x2d\x83\x7d\x08\x00\x0f\x84\xe2\x00\x80\x7c\xff\x00\x74\x46\x66\x8b\x1d\x66\x8b\x4d\x04\x66\x31\xc0\xb0"..., 512, 512) = 512
...
qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x89\x50\x08\x8b\x45\x94\x8b\x40\x0c\x8d\x14\x03\x8b\x45\x94\x89\x50\x0c\x90\x83\xc4\x6c\x5b\x5e\x5f\x5d\xc3\xf3\x0f\x1e\xfb\x55"..., 17408, 1096704) = 17408
qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x49\x4e\x81\xa4\x03\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"..., 24576, 17006592) = 24576
qemu_strace.2862:pread64(32</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, "\x49\x4e\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"..., 8192, 17031168) = 8192


Both straces start the same (the boot sector is read) and for the first few lines (shown as "...") they are the same. But, while in the successful boot qemu reads from file at offset 806420480, in case of the unsuccessful boot it reads from much lower offset 17006592 (these are ~750MB apart).

Now, the interesting part is that qemu tries to determine the size of the virtual disk (the file), and again it gets different results. In case of the successfull boot:

qemu_strace.3144:fstat(17</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, {st_dev=makedev(0, 0x34), st_ino=10159533956285893482, st_mode=S_IFREG|0644, st_nlink=1, st_uid=107, st_gid=107, st_blksize=131072, st_blocks=8004888, st_size=62277025792, st_atime=1588152829 /* 2020-04-29T11:33:49.465440461+0200 */, st_atime_nsec=465440461, st_mtime=1588227404 /* 2020-04-30T08:16:44.247680655+0200 */, st_mtime_nsec=247680655, st_ctime=1588228091 /* 2020-04-30T08:28:11.041818827+0200 */, st_ctime_nsec=41818827}) = 0


and unsuccessful:

qemu_strace.2664:fstat(18</mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1>, {st_dev=makedev(0, 0x34), st_ino=10159533956285893482, st_mode=S_IFREG|0644, st_nlink=1, st_uid=107, st_gid=107, st_blksize=131072, st_blocks=125960, st_size=65404928, st_atime=1588152828 /* 2020-04-29T11:33:48.755428985+0200 */, st_atime_nsec=755428985, st_mtime=1588166246 /* 2020-04-29T15:17:26.102212781+0200 */, st_mtime_nsec=102212781, st_ctime=1588228091 /* 2020-04-30T08:28:11.041818827+0200 */, st_ctime_nsec=41818827}) = 0



Notice the huge difference in st_size and st_blocks. This what ls reports the size is:

ls -lhZ /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 
-rw-r--r--. 1 qemu qemu system_u:object_r:fusefs_t:s0 58G Apr 30 08:45 /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1


Therefore, I conclude that gluster reports different file sizes which confuses qemu which then in turn leaves the guest unable to boot. I will switch this one over to gluster after I attach promised straces.

Comment 47 Michal Privoznik 2020-04-30 07:19:19 UTC
Created attachment 1683198 [details]
qemu_straces.tar.xz

Promised straces. Firstly I though that iotrhead is involved somehow, so I've disabled it. Therefore there are two successful runs (qemu_success/ and qemu_success_no_iothread/) and two failed runs (qemu_fail/ and qemu_fail_no_iothread/).

Comment 48 Yaniv Kaul 2020-04-30 07:57:43 UTC
Hi Sunil,

1. Please see comment 46 , but before that - let's make sure we are NOT using libgfapi here.
2. Please assign someone to look at this promptly.

Comment 49 Michal Privoznik 2020-04-30 08:02:00 UTC
Alright, so I have a very simple reproducer that doesn't involve qemu or libvirt. In one console I'm watching the file size:

  for ((i=0; i<100; i++)); do stat /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1 ; sleep .1; done

and in the other I am setting an XATTR:

  setfattr -n trusted.libvirt -v ble /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1

And this is what I can see:

  File: /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1
  Size: 62277025792     Blocks: 8004888    IO Block: 131072 regular file
Device: 34h/52d Inode: 10159533956285893482  Links: 1
Access: (0644/-rw-r--r--)  Uid: (  107/    qemu)   Gid: (  107/    qemu)
Context: system_u:object_r:fusefs_t:s0
Access: 2020-04-29 11:33:49.465440461 +0200
Modify: 2020-04-30 08:45:00.812196690 +0200
Change: 2020-04-30 08:46:03.026205893 +0200
 Birth: -

  File: /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1
  Size: 65404928        Blocks: 125960     IO Block: 131072 regular file
Device: 34h/52d Inode: 10159533956285893482  Links: 1
Access: (0644/-rw-r--r--)  Uid: (  107/    qemu)   Gid: (  107/    qemu)
Context: system_u:object_r:fusefs_t:s0
Access: 2020-04-29 11:33:48.755428985 +0200
Modify: 2020-04-29 15:17:26.102212781 +0200
Change: 2020-04-30 09:54:26.139738718 +0200
 Birth: -

  File: /mnt/test/mprivozn/copy-of-91a6eff5-ffa2-4a96-8eef-b02a863fddf1
  Size: 62277025792     Blocks: 8004888    IO Block: 131072 regular file
Device: 34h/52d Inode: 10159533956285893482  Links: 1
Access: (0644/-rw-r--r--)  Uid: (  107/    qemu)   Gid: (  107/    qemu)
Context: system_u:object_r:fusefs_t:s0
Access: 2020-04-29 11:33:49.465440461 +0200
Modify: 2020-04-30 08:45:00.812196690 +0200
Change: 2020-04-30 09:54:26.139738718 +0200
 Birth: -


So not only reported file size drops, mtime is also confusing.

Comment 50 SATHEESARAN 2020-04-30 08:45:39 UTC
(In reply to Yaniv Kaul from comment #48)
> Hi Sunil,
> 
> 1. Please see comment 46 , but before that - let's make sure we are NOT
> using libgfapi here.
> 2. Please assign someone to look at this promptly.

I have shared my test setup on this issue with Ryan Barry, Nir Soffer and Michal.
Its not using libgfapi but the gluster fuse mount

Comment 51 SATHEESARAN 2020-04-30 10:52:41 UTC
Thanks Michal for your analysis and getting us so far.
I am moving this bug to Red Hat Gluster Storage product.

Comment 53 SATHEESARAN 2020-04-30 11:01:19 UTC
I have tested this behaviour with replica 3, with sharding enabled with test procedure mentioned in comment49

I could see that the discrepancy with size of the file, below is the output from stat on the file,
while setfattr happening on that file.
<snip>
  Size: 4294967296      Blocks: 8388608    IO Block: 131072 regular file
  Size: 4294967296      Blocks: 8388608    IO Block: 131072 regular file
  Size: 4294967296      Blocks: 8388608    IO Block: 131072 regular file
  Size: 4294967296      Blocks: 8388608    IO Block: 131072 regular file
  Size: 67108864        Blocks: 131072     IO Block: 131072 regular file
  Size: 67108864        Blocks: 131072     IO Block: 131072 regular file
  Size: 67108864        Blocks: 131072     IO Block: 131072 regular file
  Size: 67108864        Blocks: 131072     IO Block: 131072 regular file
  Size: 67108864        Blocks: 131072     IO Block: 131072 regular file
  Size: 67108864        Blocks: 131072     IO Block: 131072 regular file
</snip>

[root@ ~]# grep -i size sample1 |uniq
  Size: 4294967296	Blocks: 8388608    IO Block: 131072 regular file
  Size: 67108864  	Blocks: 131072     IO Block: 131072 regular file
  Size: 4294967296	Blocks: 8388608    IO Block: 131072 regular file
  Size: 67108864  	Blocks: 131072     IO Block: 131072 regular file
  Size: 4294967296	Blocks: 8388608    IO Block: 131072 regular file

I have also repeated the test, with replica 3 volume without enabling shard and I do not see any discrepance
[root@ ~]# grep -i size sample |uniq
  Size: 4294967296	Blocks: 8388608    IO Block: 131072 regular file

So this leads to the conclusion that this issue is seen with sharding translator of glusterfs

Comment 55 SATHEESARAN 2020-04-30 13:21:17 UTC
Thanks Ravi for your suggestions.
With Ravi's input, I have repeated the test by turning off 'performance.stat-prefetch'

[localhost@ ~]# gluster volume set engine performance.stat-prefetch off

In this case, when stat-prefetch is turned off, the size of the file doesn't get different,
when setting xattr on that file from gluster fuse mount.

@Krutika, Refer comment49 for the simple steps to reproduce this issue.
I too have the setup available for debugging, let me know if you need one

Comment 56 Krutika Dhananjay 2020-05-05 06:39:16 UTC
I think this is a side-effect of commit 69f77d2 (patch , wherein in setxattr fop, md-cache is caching the pre and post fop attributes gotten from the dict. While the patch did handle cases like ec where file sizes returned by posix from multiple bricks needs to be aggregated before unwinding response, i think it missed to make a similar change in shard translator. In the case of shard, each brick will be returning only the base shard's size. This md-cache ends up caching and keeps on serving until cache-timeout.

Solution would be to "complete" what the patch missed in terms of overloading shard setxattr also to request aggregated file size during call and use the aggregate file size returned in dict to update the GF_{PRE,POST}STAT key values before unwinding the call.

Comment 58 Krutika Dhananjay 2020-05-07 12:31:36 UTC
0. So I'm assuming the file on which libvirt is setting the extended attribute is the VM image file. Is that correct?
1. So how often does libvirt call SETXATTR on a VM during its lifetime?
2. And at what point in the VM's lifecycle is this SETXATTR called?

I have two solutions - one of them involving an extra network call and another involving potential backward compatibility problem during upgrade.
I need to decide which one to implement based on the answers to the above questions.

-Krutika

Comment 59 Michal Privoznik 2020-05-07 14:01:29 UTC
(In reply to Krutika Dhananjay from comment #58)
> 0. So I'm assuming the file on which libvirt is setting the extended
> attribute is the VM image file. Is that correct?

Yes. When starting a VM or attaching a disk to a running VM, libvirt must chown() + setfcon() the file so that the qemu process can access it. And just before doing so, libvirt will record original owner and SELinux label into XATTRs. Then, when cleaning up after terminated qemu or detached disk, libvirt looks into XATTRs and if it finds some previous records it will use that to restore the original owner. If not then chonw(root:root) + restorecon combo is called.

> 1. So how often does libvirt call SETXATTR on a VM during its lifetime?

This depends on the file. There are files which will be set only once (well, and once more they will have XATTRs removed) - like pidfile. But then there are files which will have their XATTRs set and removed multiple times - as many times as an user wants to hotplug/hotunplug the disk during VM's lifetime. If they want to do it every 10 seconds then every 10 seconds the file representing the disk inside VM will have its XATTRs modified.

> 2. And at what point in the VM's lifecycle is this SETXATTR called?

Again, this depends on the scenario. Those files that are configured in so called inactive XML (configuration that VM is started with) will have those XATTRs set just before execve() of qemu; and removed on qemu process dying (libvirt keeps a socket open and it detects qemu dying by seeing HUP on the socket which triggers cleanup code run which is where XATTRs will be removed). But because we have hutplug and hotunplug, corresponding XATTR work is done on the file representing device hotplugged to/hotunplugged from qemu.

> 
> I have two solutions - one of them involving an extra network call and
> another involving potential backward compatibility problem during upgrade.
> I need to decide which one to implement based on the answers to the above
> questions.

Sounds good. If it makes any difference, I don't think that RHV uses gluster for anything other than storing VM disks. So we don't have to bother with other ("service") files, like aforementioned pidfile, monitor socket file, etc.

> 
> -Krutika

Comment 60 Krutika Dhananjay 2020-05-08 10:51:42 UTC
(In reply to Michal Privoznik from comment #59)
> (In reply to Krutika Dhananjay from comment #58)
> > 0. So I'm assuming the file on which libvirt is setting the extended
> > attribute is the VM image file. Is that correct?
> 
> Yes. When starting a VM or attaching a disk to a running VM, libvirt must
> chown() + setfcon() the file so that the qemu process can access it. And
> just before doing so, libvirt will record original owner and SELinux label
> into XATTRs. Then, when cleaning up after terminated qemu or detached disk,
> libvirt looks into XATTRs and if it finds some previous records it will use
> that to restore the original owner. If not then chonw(root:root) +
> restorecon combo is called.
> 
> > 1. So how often does libvirt call SETXATTR on a VM during its lifetime?
> 
> This depends on the file. There are files which will be set only once (well,
> and once more they will have XATTRs removed) - like pidfile. But then there
> are files which will have their XATTRs set and removed multiple times - as
> many times as an user wants to hotplug/hotunplug the disk during VM's
> lifetime. If they want to do it every 10 seconds then every 10 seconds the
> file representing the disk inside VM will have its XATTRs modified.
> 
> > 2. And at what point in the VM's lifecycle is this SETXATTR called?
> 
> Again, this depends on the scenario. Those files that are configured in so
> called inactive XML (configuration that VM is started with) will have those
> XATTRs set just before execve() of qemu; and removed on qemu process dying
> (libvirt keeps a socket open and it detects qemu dying by seeing HUP on the
> socket which triggers cleanup code run which is where XATTRs will be
> removed). But because we have hutplug and hotunplug, corresponding XATTR
> work is done on the file representing device hotplugged to/hotunplugged from
> qemu.
> 
> > 
> > I have two solutions - one of them involving an extra network call and
> > another involving potential backward compatibility problem during upgrade.
> > I need to decide which one to implement based on the answers to the above
> > questions.
> 
> Sounds good. If it makes any difference, I don't think that RHV uses gluster
> for anything other than storing VM disks. So we don't have to bother with
> other ("service") files, like aforementioned pidfile, monitor socket file,
> etc.
> 
> > 
> > -Krutika

So as far as RHV is concerned, then is it safe for me to assume that there will be only an initial SETXATTR and a REMOVEXATTR at the end?
In which case it makes more sense for me to go ahead with the additional network-call approach (and then again it's not like the call WILL go over the network *EVERY* time, we even store the aggregated file attribute in cache in shard translator, so sometimes it will be served right off the cache).

Comment 61 Michal Privoznik 2020-05-11 05:12:05 UTC
(In reply to Krutika Dhananjay from comment #60)

> So as far as RHV is concerned, then is it safe for me to assume that there
> will be only an initial SETXATTR and a REMOVEXATTR at the end?
> In which case it makes more sense for me to go ahead with the additional
> network-call approach (and then again it's not like the call WILL go over
> the network *EVERY* time, we even store the aggregated file attribute in
> cache in shard translator, so sometimes it will be served right off the
> cache).

Oh yeah, that sounds reasonable. XATTRs are very low traffic. It's NOT like libvirt would set them on each disk read/write. It sets them once and removes them once. Any overhead will spread over VM runtime.

Comment 62 Krutika Dhananjay 2020-05-14 06:49:16 UTC
Upstream github tracker - https://github.com/gluster/glusterfs/issues/1243

Patch coming right up!

Comment 63 Krutika Dhananjay 2020-05-15 10:46:08 UTC
(In reply to Krutika Dhananjay from comment #62)
> Upstream github tracker - https://github.com/gluster/glusterfs/issues/1243
> 
> Patch coming right up!

Sorry, this is going to take some more time.
Issue not reproducible upstream in the first place. Applying the patch downstream for the purpose of testing is leading to huge conflicts due to coding style differences between the two. Will upload once I get to test patch.

Comment 66 Krutika Dhananjay 2020-05-20 07:26:53 UTC
Upstream patch - https://review.gluster.org/c/glusterfs/+/24471

Comment 67 Krutika Dhananjay 2020-05-21 11:11:47 UTC
(In reply to Michal Privoznik from comment #61)
> (In reply to Krutika Dhananjay from comment #60)
> 
> > So as far as RHV is concerned, then is it safe for me to assume that there
> > will be only an initial SETXATTR and a REMOVEXATTR at the end?
> > In which case it makes more sense for me to go ahead with the additional
> > network-call approach (and then again it's not like the call WILL go over
> > the network *EVERY* time, we even store the aggregated file attribute in
> > cache in shard translator, so sometimes it will be served right off the
> > cache).
> 
> Oh yeah, that sounds reasonable. XATTRs are very low traffic. It's NOT like
> libvirt would set them on each disk read/write. It sets them once and
> removes them once. Any overhead will spread over VM runtime.

Hi,

One last question. The analysis in comment #46 holds only for SETXATTR? Or do you see this even in REMOVEXATTR since I believe that is also something you'd execute on the file at some point?

I've fixed the problem in SETXATTR. I believe the same bug exists in REMOVEXATTR as well in gluster. If it is necessary, then I'll have to fix REMOVEXATTR code as well (which implies I'll have to add another 100+ lines of code).

-Krutika

Comment 68 Michal Privoznik 2020-05-21 12:25:49 UTC
(In reply to Krutika Dhananjay from comment #67)
> 
> One last question. The analysis in comment #46 holds only for SETXATTR? Or
> do you see this even in REMOVEXATTR since I believe that is also something
> you'd execute on the file at some point?
> 
> I've fixed the problem in SETXATTR. I believe the same bug exists in
> REMOVEXATTR as well in gluster. If it is necessary, then I'll have to fix
> REMOVEXATTR code as well (which implies I'll have to add another 100+ lines
> of code).

Frankly, I don't know. I no longer have access to the testing environment. But, you are right in assuming that libvirt does call REMOVEXATTR at some point. It does so when a VM is shut down. Or on a device hot unplug (e.g. a disk detach). However, at that point the VM (aka qemu process) is no longer using the associated file (the order is reversed to that when starting a VM), and therefore the apparent file size is not an issue. But, this is from libvirt's POV, other applications might do more XATTR work and rely on the apparent file size, so I'd say it is worth fixing the REMOVEXATTR case too.

Comment 69 Krutika Dhananjay 2020-05-22 04:49:11 UTC
(In reply to Michal Privoznik from comment #68)
> (In reply to Krutika Dhananjay from comment #67)
> > 
> > One last question. The analysis in comment #46 holds only for SETXATTR? Or
> > do you see this even in REMOVEXATTR since I believe that is also something
> > you'd execute on the file at some point?
> > 
> > I've fixed the problem in SETXATTR. I believe the same bug exists in
> > REMOVEXATTR as well in gluster. If it is necessary, then I'll have to fix
> > REMOVEXATTR code as well (which implies I'll have to add another 100+ lines
> > of code).
> 
> Frankly, I don't know. I no longer have access to the testing environment.
> But, you are right in assuming that libvirt does call REMOVEXATTR at some
> point. It does so when a VM is shut down. Or on a device hot unplug (e.g. a
> disk detach). However, at that point the VM (aka qemu process) is no longer
> using the associated file (the order is reversed to that when starting a
> VM), and therefore the apparent file size is not an issue. But, this is from
> libvirt's POV, other applications might do more XATTR work and rely on the
> apparent file size, so I'd say it is worth fixing the REMOVEXATTR case too.

Thanks, I'm on it!

Comment 70 Krutika Dhananjay 2020-05-22 06:43:58 UTC
Downstream backport for the SETXATTR changes - https://code.engineering.redhat.com/gerrit/#/c/201135/

Comment 71 Krutika Dhananjay 2020-05-22 08:28:46 UTC
https://review.gluster.org/c/glusterfs/+/24480 <-- removexattr fix

Comment 77 SATHEESARAN 2020-06-08 14:58:12 UTC
Tested with RHV 4.4.1 and RHGS 3.5.2  - glusterfs-6.0-37.el8rhgs.

1. RHHI-V deployment is completed successfully with RV Self-Hosted Engine consuming glusterfs storage domain
2. Additional gluster storage domains are created and operational
3. VMs are created and various workloads are run inside VMs

Comment 79 errata-xmlrpc 2020-06-16 06:19:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2572


Note You need to log in before you can comment on or make changes to this bug.