1700319 – VM is going to pause state with "storage I/O error".

Bug 1700319 - VM is going to pause state with "storage I/O error".

Summary: VM is going to pause state with "storage I/O error".

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.3.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.4.0
Target Release:	4.3.5
Assignee:	Eyal Shenitzky
QA Contact:	Evelina Shames
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1700316 1728210
TreeView+	depends on / blocked

Reported:	2019-04-16 10:04 UTC by Mugdha Soni
Modified:	2020-08-04 13:19 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ovirt-engine-4.3.5.3
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1700316
Clones:	1728210 (view as bug list)
Environment:
Last Closed:	2020-08-04 13:17:36 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vmstore.log (99.56 KB, text/plain) 2019-04-22 11:17 UTC, Mugdha Soni	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:3247	None	None	None	2020-08-04 13:19:18 UTC
oVirt gerrit	101174	'None'	MERGED	core: fix initial size allocation for preallocated file-based disks	2020-10-28 11:40:52 UTC
oVirt gerrit	101342	'None'	MERGED	core: fix initial size allocation for preallocated file-based disks	2020-10-28 11:40:52 UTC

Description Mugdha Soni 2019-04-16 10:04:05 UTC

+++ This bug was initially created as a clone of Bug #1700316 +++

Description of problem:
------------------------------
After user creates a VM out of template and runs it the VM is going to paused state with storage I/O storage error and later investigating the logs found an error stating " stale file handler ".


Version-Release number of selected component:
----------------------------------------------
glusterfs-server-3.12.2-47.el7rhgs.x86_64

gluster-ansible-cluster-1.0-1.el7rhgs.noarch
gluster-ansible-repositories-1.0-1.el7rhgs.noarch
gluster-ansible-features-1.0.4-5.el7rhgs.noarch
gluster-ansible-infra-1.0.3-3.el7rhgs.noarch
gluster-ansible-maintenance-1.0.1-1.el7rhgs.noarch
gluster-ansible-roles-1.0.4-4.el7rhgs.noarch

Red Hat Enterprise Linux Server release 7.6 (Maipo)

How reproducible:
-------------------
2/2

Steps to Reproduce:
---------------------
1.Successful deploy gluster and hosted engine.
2.Create RHEL 7.6 template and then create vm out of it .
3.The VM goes to pause state after user runs the VM.

Actual results:
----------------------
VM goes to pause state with storage I/O error.

Expected results:
----------------------
VM should not go in pause state

--- Additional comment from RHEL Product and Program Management on 2019-04-16 09:59:33 UTC ---

This bug is automatically being proposed for RHHI-V 1.7 release at Red Hat Hyperconverged Infrastructure for Virtualization product, by setting the release flag 'rhiv‑1.7' to '?'.

If this bug should be proposed for a different release, please manually change the proposed release flag.

Comment 3 Raghavendra G 2019-04-17 01:59:55 UTC

I see read (which is an operation on fd) failing with ESTALE. The only reason I can think of for this is using anonymous-fds. Can you turn off performance.open-behind and see whether it helps?

Comment 4 Mugdha Soni 2019-04-22 11:17:02 UTC

Created attachment 1557151 [details]
vmstore.log

Comment 5 Mugdha Soni 2019-04-22 11:29:49 UTC

In reply to comment#3
I turned off the performance.open-behind still could see this issue.The logs have been attached.

Comment 6 Krutika Dhananjay 2019-04-23 06:08:38 UTC

Mugdha,

The logs seem to indicate that there were lot of images that were deleted.

[2019-04-22 06:21:03.286624] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=a3180be9-e95b-4fc0-81a2-5f5ab9465a18 from backend
[2019-04-22 06:21:03.289603] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=92a11c36-a0f0-4a17-9187-484c876106bc from backend
[2019-04-22 06:21:03.291567] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=a535e911-fd7d-4bba-9eb9-91842f3ee2ca from backend
[2019-04-22 06:21:03.601443] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=1f9cddce-d0cc-46f8-ab4b-c2918ffaf01b from backend
[2019-04-22 06:21:03.604608] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=aebb9d5b-3af7-4fd3-b737-4a6402b5f677 from backend
[2019-04-22 06:21:03.607079] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=dccb4887-6e03-428b-8a68-5466c38c5829 from backend
[2019-04-22 06:21:03.894875] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=c244b562-504a-43e1-b2b5-521ff35776ec from backend
[2019-04-22 06:21:03.897581] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=685e0108-35fa-45b2-b919-9d08d0b7c2cd from backend
[2019-04-22 06:21:03.899639] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=8c4902db-4318-4005-a451-0d332f5aee59 from backend
[2019-04-22 06:21:04.205551] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=fb6075c9-e744-4048-b30d-773bdbc2aa5a from backend
[2019-04-22 06:21:04.208530] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=25f6cf12-050a-4d64-92e2-25a2ba640df4 from backend
[2019-04-22 06:21:04.210682] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=f901881f-2d21-4ff1-9a33-0e6c929c0cc1 from backend
[2019-04-22 06:21:04.507627] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=e2d262df-7cf7-4af0-b529-18960155b592 from backend
[2019-04-22 06:21:04.510694] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=292e85a1-e197-4778-bb40-7441d31356c3 from backend
[2019-04-22 06:21:04.513183] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=2555816a-88c2-47cd-837d-07113910592b from backend
[2019-04-22 06:21:04.789920] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=45581e16-64c6-4c08-b0a0-e8796773c46d from backend
[2019-04-22 06:21:04.792762] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=61536a7c-0b69-4220-81c9-2e0acdf53ee8 from backend
[2019-04-22 06:21:04.794852] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=c99ac123-fc3a-4a0d-a142-ab3bdb144523 from backend
[2019-04-22 06:21:05.076846] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=bfa89657-c62c-4555-96d8-0ca7e80c8ded from backend
[2019-04-22 06:21:05.079858] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=c1351e11-46f4-4954-9c4c-844db7823ab7 from backend
[2019-04-22 06:21:05.082067] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=06b6ca57-f81d-4a2d-9d4c-3dba7a825f83 from backend
[2019-04-22 06:29:34.520531] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=376ebba2-cc30-4aa5-85db-f36f8998026d from backend

Was deletion part of the test case too? If so, can you shed some details around that and at what point it is done?

-Krutika

Comment 7 Raghavendra G 2019-04-23 06:14:07 UTC

Digging further into why fd based operations fail with ESTALE errors, I had a discussion with Krutika on how shard handles open fds. fds are opened only on base shard and hence they are not immune to deletions of path. However, I am not sure whether there were any explicit deletion of shards. But, I do see quite a few shard_delete_shards msgs like,

[2019-04-22 08:59:53.237690] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 3625714: READ => -1 gfid=ee412a06-262f-4710-948d-36ae97d4cbc3 fd=0x7ff6900c0358 (Stale file handle)
[2019-04-22 09:05:25.498899] I [MSGID: 133022] [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of gfid=ee412a06-262f-4710-948d-36ae97d4cbc3 from backend

Though readv failed earlier than shard_delete_shards msg above. Nevertheless I would like to explore further on:
* what caused deletion of shards?
* is it possible that shards were deleted earlier than fd based operations?

I would request Krutika to explore the above two questions.

As to relevance of this bug to dht, only error msgs present in logs relevant to DHT are those that just say the file is not present on a particular subvolume and these msgs doesn't indicate an error condition.

Comment 8 Krutika Dhananjay 2019-04-24 05:23:32 UTC

(In reply to Raghavendra G from comment #7)
> Digging further into why fd based operations fail with ESTALE errors, I had
> a discussion with Krutika on how shard handles open fds. fds are opened only
> on base shard and hence they are not immune to deletions of path. However, I
> am not sure whether there were any explicit deletion of shards. But, I do
> see quite a few shard_delete_shards msgs like,
> 
> [2019-04-22 08:59:53.237690] W [fuse-bridge.c:2756:fuse_readv_cbk]
> 0-glusterfs-fuse: 3625714: READ => -1
> gfid=ee412a06-262f-4710-948d-36ae97d4cbc3 fd=0x7ff6900c0358 (Stale file
> handle)
> [2019-04-22 09:05:25.498899] I [MSGID: 133022]
> [shard.c:3636:shard_delete_shards] 0-vmstore-shard: Deleted shards of
> gfid=ee412a06-262f-4710-948d-36ae97d4cbc3 from backend
> 
> Though readv failed earlier than shard_delete_shards msg above. Nevertheless
> I would like to explore further on:
> * what caused deletion of shards?
> * is it possible that shards were deleted earlier than fd based operations?
> 
> I would request Krutika to explore the above two questions.

Like I mentioned yesterday in the discussion on chat, deleting a vm is disallowed unless it's powered off.
In other words, qemu closes the fd on the base file long before deletion.

Moreover, the logs about deletion are stemming from qe/dev cleaning up old vms before rerunning the test.
Tried the test case myself and I can confirm that deletions happened long after the pause, at
the time I cleaned up the setup before a rerun.

Multiple reasons to believe deletions and reads are not happening in parallel, including the order
of events seen in the logs. I think we can safely drop that theory and explore other possibilities.

I'll debug further on qe setup and post my analysis.

> 
> As to relevance of this bug to dht, only error msgs present in logs relevant
> to DHT are those that just say the file is not present on a particular
> subvolume and these msgs doesn't indicate an error condition.

Comment 9 Raghavendra G 2019-04-24 06:35:46 UTC

(In reply to Mugdha Soni from comment #0)
> +++ This bug was initially created as a clone of Bug #1700316 +++
> 
> Description of problem:
> ------------------------------
> After user creates a VM out of template and runs it the VM is going to
> paused state with storage I/O storage error and later investigating the logs
> found an error stating " stale file handler ".

Which logs did you find "Stale file handle"? Is it logs of VM or glusterfs client logs? Asking this question to identify failure of which syscall might've caused the VM to pause. From glusterfs client logs attached I could see statfs and readv responses fail with ESTALE errors. I assume one of these would've caused VM Pause, but just want to confirm the same.

Comment 10 Krutika Dhananjay 2019-04-24 15:24:31 UTC

Here's the short summary of the debugging done today -

So the vm pauses during readv fop on it - specifically on the 12th shard of the vm when I reproduced it today.
The error that caused the pause is ESTALE which happens when -
shard creates anon fd on the 12th shard and sends readv on it.
DHT winds this down to AFR. AFR in turn winds this to its read child (child 0 in this case) as is evidenced by the occurrence of the following logs:

[2019-04-24 09:20:20.772460] W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 10-vmstore-client-0: remote operation failed [No such file or directory]


The brick returns ENOENT on the read. So afr attempts an inode refresh using fstat. The fstat also fails with ENOENT -
[2019-04-24 09:20:20.772959] W [MSGID: 114031] [client-rpc-fops.c:1464:client3_3_fstat_cbk] 10-vmstore-client-0: remote operation failed [No such file or directory]
[2019-04-24 09:20:20.773113] W [MSGID: 114031] [client-rpc-fops.c:1464:client3_3_fstat_cbk] 10-vmstore-client-2: remote operation failed [No such file or directory]
[2019-04-24 09:20:20.773161] W [MSGID: 114031] [client-rpc-fops.c:1464:client3_3_fstat_cbk] 10-vmstore-client-1: remote operation failed [No such file or directory]

The ENOENT is propagated upwards to DHT. DHT in readv cbk on seeing that error is ENOENT launches the migration check synctask in which it sends a lookup on the gfid.
This lookup is failed in the resolution phase in protocol/server and server returns ESTALE -

[2019-04-24 09:21:58.005820] E [MSGID: 109040] [dht-helper.c:1387:dht_migration_complete_check_task] 10-vmstore-dht: 2da0e8e7-e2ba-44db-af25-22abae893d41: failed to lookup the file on vmstore-dht [Stale file handle]

DHT propagates it back to shard and shard propagates it upwards up to fuse where the READV is failed with ESTALE, following which qemu pauses the vm.

[2019-04-24 09:31:09.806926] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 13005629: READ => -1 gfid=1da75ccf-3de7-4db4-85d2-c5bef44aaec8 fd=0x7ff6900ace58 (Stale file handle)


However, this is where it gets interesting. I checked the backend for shard 12 of 1da75ccf-3de7-4db4-85d2-c5bef44aaec8 and it very much exists, along with its gfid handle, only it is a different gfid than the one on which anon fd is opened and readv sent. The readv was sent on gfid 2da0e8e7-e2ba-44db-af25-22abae893d41 (non-existent in backend) whereas the actual gfid in the backend of the shard is 1540624d-59e1-4c8c-b89e-360aa0e635c7 -

[root@rhsqa-grafton1-nic2 .shard]# ls -lai 1da75ccf-3de7-4db4-85d2-c5bef44aaec8.12
68786459214 -rw-rw----. 2 root root 67108864 Apr 24 14:12 1da75ccf-3de7-4db4-85d2-c5bef44aaec8.12
[root@rhsqa-grafton1-nic2 .shard]# cd ..
[root@rhsqa-grafton1-nic2 vmstore]# find . -inum 68786459214
./.glusterfs/15/40/1540624d-59e1-4c8c-b89e-360aa0e635c7
./.shard/1da75ccf-3de7-4db4-85d2-c5bef44aaec8.12
[root@rhsqa-grafton1-nic2 vmstore]# find . -inum 68786459214
./.glusterfs/15/40/1540624d-59e1-4c8c-b89e-360aa0e635c7
./.shard/1da75ccf-3de7-4db4-85d2-c5bef44aaec8.12


The question is why is there a gfid mismatch between in-memory inode and backend inode? The in-memory inode object in fact is clean with non-zero ref and is properly linked into inode table -

(gdb) p local->inode_list[0]
$14 = (inode_t *) 0x7ff69972c848
(gdb) p *local->inode_list[0]
$15 = {table = 0x7ff68d4d4d30, gfid = "-\240\350\347\342\272Dۯ%\"\253\256\211=A", lock = {spinlock = 0, mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {
          __prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, nlookup = 1, fd_count = 1, active_fd_count = 1, ref = 4, ia_type = IA_IFREG, fd_list = {next = 0x7ff63c027db8, prev = 0x7ff63c027db8}, 
  dentry_list = {next = 0x7ff6a42817e8, prev = 0x7ff6a42817e8}, hash = {next = 0x7ff68d9d9a50, prev = 0x7ff68d9d9a50}, list = {next = 0x7ff6981d5d00, prev = 0x7ff6996a8270}, _ctx = 0x7ff6981f6b40, invalidate_sent = _gf_false}
(gdb) call inode_resolve(0x7ff68d4d4d30, "/.shard/1da75ccf-3de7-4db4-85d2-c5bef44aaec8.12")
[Thread 0x7ff6a132c700 (LWP 15827) exited]
[Thread 0x7ff6a136d700 (LWP 15826) exited]
[Thread 0x7ff6a1471700 (LWP 15821) exited]
[Thread 0x7ff6a1430700 (LWP 15823) exited]
$19 = (inode_t *) 0x7ff69972c848

This is the only thing left to debug, which I will complete tomorrow.

Thanks,
Krutika

P.S. - I'm still using the setup, so would like it to be left undisturbed.

Comment 11 Krutika Dhananjay 2019-04-25 11:24:50 UTC

Found the issue. Had to load trace translator manually in the volfile to find the issue. AGAIN, the bug was hit on shard #12 of the file.

The logs seen at the time of failure are -

<readv-estale-failure>
[2019-04-25 10:14:28.404340] W [MSGID: 114031] [client-rpc-fops.c:2927:client3_3_readv_cbk] 0-vmstore-client-1: remote operation failed [No such file or directory]
[2019-04-25 10:14:28.404684] W [MSGID: 114031] [client-rpc-fops.c:1464:client3_3_fstat_cbk] 0-vmstore-client-1: remote operation failed [No such file or directory]
[2019-04-25 10:14:28.404703] W [MSGID: 114031] [client-rpc-fops.c:1464:client3_3_fstat_cbk] 0-vmstore-client-2: remote operation failed [No such file or directory]
[2019-04-25 10:14:28.404848] W [MSGID: 114031] [client-rpc-fops.c:1464:client3_3_fstat_cbk] 0-vmstore-client-0: remote operation failed [No such file or directory]
[2019-04-25 10:14:28.406755] E [MSGID: 109040] [dht-helper.c:1387:dht_migration_complete_check_task] 0-vmstore-dht: 6f750303-3bf5-4d88-89f8-d054ab6df464: failed to lookup the file on vmstore-dht [Stale file handle]
[2019-04-25 10:14:28.406813] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 1068032: READ => -1 gfid=33789188-e848-4e9f-90e5-700f31f257b8 fd=0x7f5d2407eb88 (Stale file handle)
</readv-estale-failure>

So in this setup, when creation of vms from template was initiated, the fallocate for the vm in question was sent on node 3's gluster mountpoint -

[2019-04-25 09:27:43.098832] I [trace.c:715:trace_mknod_cbk] 0-vmstore-shard-child: 224: gfid=00000000-0000-0000-0000-000000000000 (op_ret=0 *stbuf = {gfid=6f750303-3bf5-4d88-89f8-d054ab6df464 ino=9941925239492310116, mode=100660, nlink=1, uid=0, gid=0, size=0, blocks=0, atime=2019-04-25-09:27:43 mtime=2019-04-25-09:27:43 ctime=2019-04-25-09:27:43 atime_sec=1556184463, atime_nsec=97066167, mtime_sec=1556184463, mtime_nsec=97066167, ctime_sec=1556184463, ctime_nsec=98066174}, *preparent = {gfid=be318638-e8a0-4c6d-977d-7a937aa84806 ino=10916015845652580358, mode=40755, nlink=3, uid=0, gid=0, size=4096, blocks=8, atime=2019-04-15-11:15:26 mtime=2019-04-25-09:27:43 ctime=2019-04-25-09:27:43 atime_sec=1555326926, atime_nsec=688394340, mtime_sec=1556184463, mtime_nsec=97105740, ctime_sec=1556184463, ctime_nsec=97105740}, *postparent = {gfid=be318638-e8a0-4c6d-977d-7a937aa84806 ino=10916015845652580358, mode=40755, nlink=3, uid=0, gid=0, size=4096, blocks=8, atime=2019-04-15-11:15:26 mtime=2019-04-25-09:27:43 ctime=2019-04-25-09:27:43 atime_sec=1555326926, atime_nsec=688394340, mtime_sec=1556184463, mtime_nsec=97105740, ctime_sec=1556184463, ctime_nsec=97105740})                

and as part of its creation, shard #12 got linked with gfid 6f750303-3bf5-4d88-89f8-d054ab6df464 at time 09:27:43.*. And shard remembers this association in-memory so it doesn't have to look it up everytime on disk before sending operations on it.


But around 6 seconds later, an ftruncate to size 0 followed by another fallocate is sent on this very same vm image. This truncate entails deleting all shards of the file, shard #12 being no exception:

<ftruncate-on-node-2>

[2019-04-25 09:27:49.330619] I [trace.c:1730:trace_lookup] 0-vmstore-shard-child: 396: gfid=00000000-0000-0000-0000-000000000000 path=/.shard/33789188-e848-4e9f-90e5-700f31f257b8.12
[2019-04-25 09:27:49.331083] I [trace.c:626:trace_lookup_cbk] 0-vmstore-shard-child: 396: gfid=6f750303-3bf5-4d88-89f8-d054ab6df464 (op_ret=0 *buf {gfid=6f750303-3bf5-4d88-89f8-d054ab6df464 ino=9941925239492310116, mode=100660, nlink=1, uid=0, gid=0, size=67108864, blocks=131072, atime=2019-04-25-09:27:43 mtime=2019-04-25-09:27:43 ctime=2019-04-25-09:27:43 atime_sec=1556184463, atime_nsec=97066167, mtime_sec=1556184463, mtime_nsec=197066836, ctime_sec=1556184463, ctime_nsec=199066849}, *postparent {gfid=be318638-e8a0-4c6d-977d-7a937aa84806 ino=10916015845652580358, mode=40755, nlink=3, uid=0, gid=0, size=4096, blocks=8, atime=2019-04-15-11:15:26 mtime=2019-04-25-09:27:49 ctime=2019-04-25-09:27:49 atime_sec=1555326926, atime_nsec=688394340, mtime_sec=1556184469, mtime_nsec=236149085, ctime_sec=1556184469, ctime_nsec=236149085}                                                                                                                   

*** ^^ as part of resolution of shards before truncate, shard 12 is looked up in node 2. At this point it's still associated with gfid 6f750303-3bf5-4d88-89f8-d054ab6df464. ***
[2019-04-25 09:27:49.422671] I [trace.c:1884:trace_unlink] 0-vmstore-shard-child: 396: gfid=6f750303-3bf5-4d88-89f8-d054ab6df464 path=/.shard/33789188-e848-4e9f-90e5-700f31f257b8.12 flag=0
[2019-04-25 09:27:49.422712] I [trace.c:486:trace_unlink_cbk] 0-vmstore-shard-child: 396: gfid=26c203c6-f958-405c-a98b-0257f105c233 op_ret=0,  *preparent = {gfid=be318638-e8a0-4c6d-977d-7a937aa84806 ino=10916015845652580358, mode=40755, nlink=3, uid=0, gid=0, size=4096, blocks=8, atime=2019-04-15-11:15:26 mtime=2019-04-25-09:27:49 ctime=2019-04-25-09:27:49 atime_sec=1555326926, atime_nsec=688394340, mtime_sec=1556184469, mtime_nsec=422150398, ctime_sec=1556184469, ctime_nsec=422150398}, *postparent = {gfid=be318638-e8a0-4c6d-977d-7a937aa84806 ino=10916015845652580358, mode=40755, nlink=3, uid=0, gid=0, size=4096, blocks=8, atime=2019-04-15-11:15:26 mtime=2019-04-25-09:27:49 ctime=2019-04-25-09:27:49 atime_sec=1555326926, atime_nsec=688394340, mtime_sec=1556184469, mtime_nsec=422150398, ctime_sec=1556184469, ctime_nsec=422150398})                                                                                                               

*** ^^^ the unlink above as part of truncate successfully deletes the shard. this is expected ***

[2019-04-25 09:27:49.505079] I [trace.c:2236:trace_truncate] 0-vmstore-shard-child: 396: gfid=33789188-e848-4e9f-90e5-700f31f257b8 path=(null), offset=0
[2019-04-25 09:27:49.506329] I [trace.c:935:trace_truncate_cbk] 0-vmstore-shard-child: 396: (op_ret=0, *prebuf = {gfid=33789188-e848-4e9f-90e5-700f31f257b8 ino=10440874521693149112, mode=100660, nlink=1, uid=36, gid=36, size=67108864, blocks=131072, atime=2019-04-25-09:27:42 mtime=2019-04-25-09:27:43 ctime=2019-04-25-09:27:49 atime_sec=1556184462, atime_nsec=961065257, mtime_sec=1556184463, mtime_nsec=195066822, ctime_sec=1556184469, ctime_nsec=505109024}, *postbuf = {gfid=33789188-e848-4e9f-90e5-700f31f257b8 ino=10440874521693149112, mode=100660, nlink=1, uid=36, gid=36, size=0, blocks=0, atime=2019-04-25-09:27:42 mtime=2019-04-25-09:27:49 ctime=2019-04-25-09:27:49 atime_sec=1556184462, atime_nsec=961065257, mtime_sec=1556184469, mtime_nsec=505109024, ctime_sec=1556184469, ctime_nsec=505109024} )                                                                                                                                                

*** ^^^ truncate on base file is also completed ***

[2019-04-25 09:27:49.506375] I [trace.c:1667:trace_xattrop] 0-vmstore-shard-child: 396: gfid=33789188-e848-4e9f-90e5-700f31f257b8 (path=(null) flags=1)
[2019-04-25 09:27:49.506929] I [trace.c:1342:trace_xattrop_cbk] 0-vmstore-shard-child: 396: gfid=33789188-e848-4e9f-90e5-700f31f257b8 op_ret=0, op_errno=0
[2019-04-25 09:27:49.506951] I [trace.c:1195:trace_ftruncate_cbk] 0-vmstore-shard-parent: 396: op_ret=0, *prebuf = {gfid=33789188-e848-4e9f-90e5-700f31f257b8 ino=10440874521693149112, mode=100660, nlink=1, uid=36, gid=36, size=42949672960, blocks=83886080, atime=2019-04-25-09:27:42 mtime=2019-04-25-09:27:43 ctime=2019-04-25-09:27:43 atime_sec=1556184462, atime_nsec=961065257, mtime_sec=1556184463, mtime_nsec=195066822, ctime_sec=1556184463, ctime_nsec=341067799}, *postbuf = {gfid=33789188-e848-4e9f-90e5-700f31f257b8 ino=10440874521693149112, mode=100660, nlink=1, uid=36, gid=36, size=0, blocks=83755008, atime=2019-04-25-09:27:42 mtime=2019-04-25-09:27:43 ctime=2019-04-25-09:27:43 atime_sec=1556184462, atime_nsec=961065257, mtime_sec=1556184463, mtime_nsec=195066822, ctime_sec=1556184463, ctime_nsec=341067799} )

*** ^^^ and now shard unwinds success to its parent which is passed on till fuse/qemu. this marks the completion of ftruncate ***
</ftruncate>

Not too long after this, the same client at node 2 initiates fallocate again on the vm. As part of this, block #12 gets recreated but now with a different gfid. Nothing surprising yet.

<mknod-block-12-again>
[2019-04-25 09:27:49.601551] I [trace.c:1823:trace_mknod] 0-vmstore-shard-child: 398: gfid=00000000-0000-0000-0000-000000000000 path=/.shard/33789188-e848-4e9f-90e5-700f31f257b8.12 mode=33200 umask=00, dev=0)
[2019-04-25 09:27:49.602291] I [trace.c:715:trace_mknod_cbk] 0-vmstore-shard-child: 398: gfid=00000000-0000-0000-0000-000000000000 (op_ret=0 *stbuf = {gfid=a53dd89c-c5c5-4821-a4df-71b3ce0671e5 ino=11880339359096271333, mode=100660, nlink=1, uid=0, gid=0, size=0, blocks=0, atime=2019-04-25-09:27:49 mtime=2019-04-25-09:27:49 ctime=2019-04-25-09:27:49 atime_sec=1556184469, atime_nsec=600927283, mtime_sec=1556184469, mtime_nsec=600927283, ctime_sec=1556184469, ctime_nsec=600927283}, *preparent = {gfid=be318638-e8a0-4c6d-977d-7a937aa84806 ino=10916015845652580358, mode=40755, nlink=3, uid=0, gid=0, size=4096, blocks=8, atime=2019-04-15-11:15:26 mtime=2019-04-25-09:27:49 ctime=2019-04-25-09:27:49 atime_sec=1555326926, atime_nsec=688394340, mtime_sec=1556184469, mtime_nsec=601151662, ctime_sec=1556184469, ctime_nsec=601151662}, *postparent = {gfid=be318638-e8a0-4c6d-977d-7a937aa84806 ino=10916015845652580358, mode=40755, nlink=3, uid=0, gid=0, size=4096, blocks=8, atime=2019-04-15-11:15:26 mtime=2019-04-25-09:27:49 ctime=2019-04-25-09:27:49 atime_sec=1555326926, atime_nsec=688394340, mtime_sec=1556184469, mtime_nsec=601151662, ctime_sec=1556184469, ctime_nsec=601151662})        
</mknod-block-12-again>

Now shard 12 is linked to gfid a53dd89c-c5c5-4821-a4df-71b3ce0671e5.

And when creation of the vm from template finishes, I attempt to boot the vm. The vm gets launched in host 3, which still contains old-gfid-to-block-12 association in-memory. It creates an anon fd on this inode, sends a readv which fails with ESTALE because the old .glusterfs gfid link ceases to exist.

Comment 12 Krutika Dhananjay 2019-04-25 11:33:04 UTC

This new method of creating vms from templates breaks a major assumption in sharding about this workload being a single-writer/reader use-case.

I will think about whether there is an easy way to fix this without hurting performance in terms of having to resolve each shard before the read.

Meanwhile, I am curious to know why in the new scheme (I say new because the algo seems to have changed since the prev version of RHV), RHV successfully fallocates the image through one mount, only to discard it all with a truncate to 0 and a reallocation (again another fallocate) on another mount point. And larger the image, more time spent in fallocation followed by truncation followed by a reallocation. What exactly is the reason for vm creation from template being done this way? Is this expected behavior? 

-Krutika

Comment 13 Mugdha Soni 2019-04-25 13:02:27 UTC

(In reply to Raghavendra G from comment #9)
> (In reply to Mugdha Soni from comment #0)
> > +++ This bug was initially created as a clone of Bug #1700316 +++
> > 
> > Description of problem:
> > ------------------------------
> > After user creates a VM out of template and runs it the VM is going to
> > paused state with storage I/O storage error and later investigating the logs
> > found an error stating " stale file handler ".
> 
> Which logs did you find "Stale file handle"? Is it logs of VM or glusterfs
> client logs? Asking this question to identify failure of which syscall
> might've caused the VM to pause. From glusterfs client logs attached I could
> see statfs and readv responses fail with ESTALE errors. I assume one of
> these would've caused VM Pause, but just want to confirm the same.

Error was seen in glusterfs client logs.

Comment 14 Krutika Dhananjay 2019-04-25 13:45:03 UTC

Restoring needinfo on Benny Zlotnick.

Comment 15 Benny Zlotnik 2019-04-28 10:53:00 UTC

Eyal worked on this, moving needinfo

Comment 17 Krutika Dhananjay 2019-05-02 05:05:30 UTC

(In reply to Benny Zlotnik from comment #15)
> Eyal worked on this, moving needinfo

Hi,

Did you get a chance to check the queries?

Comment 18 Eyal Shenitzky 2019-05-07 06:07:46 UTC

Hi,

Can you please provide more information about the VM / template?

Gluster based?
Thin/Preallocated?
etc..

Comment 19 Krutika Dhananjay 2019-05-07 06:12:54 UTC

(In reply to Eyal Shenitzky from comment #18)
> Hi,
> 
> Can you please provide more information about the VM / template?
> 
> Gluster based?
> Thin/Preallocated?
> etc..

Yes, gluster-based.

I *think* it is preallocated but I'll let Sas confirm.

Comment 20 Krutika Dhananjay 2019-05-07 06:20:52 UTC

Also, I think that the truncate itself that I talked about in comment #12 was internally called by the `qemu-img` command.
So what I'm observing is a case of multiple hypervisors trying to execute qemu-img/preallocate on the same file a few seconds apart, with the second hypervisor effectively undoing and redoing everything that the first one did.

Comment 21 Krutika Dhananjay 2019-05-08 05:33:54 UTC

(In reply to Krutika Dhananjay from comment #20)
> Also, I think that the truncate itself that I talked about in comment #12
> was internally called by the `qemu-img` command.
> So what I'm observing is a case of multiple hypervisors trying to execute
> qemu-img/preallocate on the same file a few seconds apart, with the second
> hypervisor effectively undoing and redoing everything that the first one did.

Setting needinfo on Eyal over comments #12 and #20.

Comment 22 Eyal Shenitzky 2019-05-08 06:34:34 UTC

Please attach the engine log and the vdsm log from the hypervisor

Comment 23 SATHEESARAN 2019-05-22 08:10:48 UTC

(In reply to Krutika Dhananjay from comment #19)
> (In reply to Eyal Shenitzky from comment #18)
> > Hi,
> > 
> > Can you please provide more information about the VM / template?
> > 
> > Gluster based?
> > Thin/Preallocated?
> > etc..
> 
> Yes, gluster-based.
> 
> I *think* it is preallocated but I'll let Sas confirm.

Yes, its Gluster based and preallocated.

Comment 24 SATHEESARAN 2019-05-22 08:13:16 UTC

(In reply to Eyal Shenitzky from comment #22)
> Please attach the engine log and the vdsm log from the hypervisor

sosreport from the hypervisor is available. Refer comment#2

Comment 27 Eyal Shenitzky 2019-06-12 07:56:30 UTC

(In reply to Krutika Dhananjay from comment #26)
> (In reply to Atin Mukherjee from comment #25)
> > Status please?
> 
> Setting needinfo on the right person.
> 
> Eyal, any news on this?

Can you please add the engine / vdsm version?

There was no change in the flow for creating raw-preallocated file-based volumes in vdsm for versions < 4.3.

In version 4.3 we create raw-preallocated file-based volume in the following flow:

1. truncate the volume to its size
2. fallocate the volume to the requested size
3. changing the volume permissions 

The flow that is used for version 4.3 when copying raw-preallocated file-based template volume for new VM that is based on it:

1. truncate the volume to its size
2. fallocate the volume to an initial size (1)
3. changing the volume permissions
4. fallocate the created volume again to it's full size.

This change was done in order to keep the file-based volume as preallocated volume (before version 4.3 it was converted to sparse) 
and to prevent fallocating the volume twice.

Comment 28 Sahina Bose 2019-06-13 10:08:42 UTC

Given that there's a change in the fallocate process, do we need any changes?

Comment 29 Krutika Dhananjay 2019-06-18 06:19:42 UTC

(In reply to Eyal Shenitzky from comment #27)
> (In reply to Krutika Dhananjay from comment #26)
> > (In reply to Atin Mukherjee from comment #25)
> > > Status please?
> > 
> > Setting needinfo on the right person.
> > 
> > Eyal, any news on this?
> 
> Can you please add the engine / vdsm version?
> 
> There was no change in the flow for creating raw-preallocated file-based
> volumes in vdsm for versions < 4.3.
> 
> In version 4.3 we create raw-preallocated file-based volume in the following
> flow:
> 
> 1. truncate the volume to its size
> 2. fallocate the volume to the requested size
> 3. changing the volume permissions 
> 
> The flow that is used for version 4.3 when copying raw-preallocated
> file-based template volume for new VM that is based on it:
> 
> 1. truncate the volume to its size
> 2. fallocate the volume to an initial size (1)
> 3. changing the volume permissions
> 4. fallocate the created volume again to it's full size.
> 
> This change was done in order to keep the file-based volume as preallocated
> volume (before version 4.3 it was converted to sparse) 
> and to prevent fallocating the volume twice.

So what I see in the gluster trace logs is different from what is described above.

In particular, the initial fallocate is not done to size 1(byte?). In fact the file is fully fallocated to its real size of 40GB before being truncated to 0 and then fallocated again to full size.

The vdsm version used:
vdsm.x86_64                           4.30.13-1.el7ev            @rhv-4.3.3

Am I missing something here?

-Krutika

Comment 30 Krutika Dhananjay 2019-06-18 06:27:22 UTC

(In reply to Sahina Bose from comment #28)
> Given that there's a change in the fallocate process, do we need any changes?

I think if the first fallocate that is executed is of size < $features.shard-block-size, there are no changes needed and things should work fine without any pauses.
But I don't see that to be the case. it seems to be doing fallocate to full size on both accounts.

-Krutika

Comment 31 Eyal Shenitzky 2019-06-19 06:19:04 UTC

(In reply to Krutika Dhananjay from comment #29)
> (In reply to Eyal Shenitzky from comment #27)
> > (In reply to Krutika Dhananjay from comment #26)
> > > (In reply to Atin Mukherjee from comment #25)
> > > > Status please?
> > > 
> > > Setting needinfo on the right person.
> > > 
> > > Eyal, any news on this?
> > 
> > Can you please add the engine / vdsm version?
> > 
> > There was no change in the flow for creating raw-preallocated file-based
> > volumes in vdsm for versions < 4.3.
> > 
> > In version 4.3 we create raw-preallocated file-based volume in the following
> > flow:
> > 
> > 1. truncate the volume to its size
> > 2. fallocate the volume to the requested size
> > 3. changing the volume permissions 
> > 
> > The flow that is used for version 4.3 when copying raw-preallocated
> > file-based template volume for new VM that is based on it:
> > 
> > 1. truncate the volume to its size
> > 2. fallocate the volume to an initial size (1)
> > 3. changing the volume permissions
> > 4. fallocate the created volume again to it's full size.
> > 
> > This change was done in order to keep the file-based volume as preallocated
> > volume (before version 4.3 it was converted to sparse) 
> > and to prevent fallocating the volume twice.
> 
> So what I see in the gluster trace logs is different from what is described
> above.
> 
> In particular, the initial fallocate is not done to size 1(byte?). In fact
> the file is fully fallocated to its real size of 40GB before being truncated
> to 0 and then fallocated again to full size.
> 
> The vdsm version used:
> vdsm.x86_64                           4.30.13-1.el7ev            @rhv-4.3.3
> 
> Am I missing something here?
> 
> -Krutika

Can you please explain what is look problematic to you (RHV / VDSM side)?
What do you think that causing the problem?

Comment 32 Sahina Bose 2019-06-19 08:30:20 UTC

(In reply to Eyal Shenitzky from comment #31)
> (In reply to Krutika Dhananjay from comment #29)
> > (In reply to Eyal Shenitzky from comment #27)
> > > (In reply to Krutika Dhananjay from comment #26)
> > > > (In reply to Atin Mukherjee from comment #25)
> > > > > Status please?
> > > > 
> > > > Setting needinfo on the right person.
> > > > 
> > > > Eyal, any news on this?
> > > 
> > > Can you please add the engine / vdsm version?
> > > 
> > > There was no change in the flow for creating raw-preallocated file-based
> > > volumes in vdsm for versions < 4.3.
> > > 
> > > In version 4.3 we create raw-preallocated file-based volume in the following
> > > flow:
> > > 
> > > 1. truncate the volume to its size
> > > 2. fallocate the volume to the requested size
> > > 3. changing the volume permissions 
> > > 
> > > The flow that is used for version 4.3 when copying raw-preallocated
> > > file-based template volume for new VM that is based on it:
> > > 
> > > 1. truncate the volume to its size
> > > 2. fallocate the volume to an initial size (1)
> > > 3. changing the volume permissions
> > > 4. fallocate the created volume again to it's full size.
> > > 
> > > This change was done in order to keep the file-based volume as preallocated
> > > volume (before version 4.3 it was converted to sparse) 
> > > and to prevent fallocating the volume twice.
> > 
> > So what I see in the gluster trace logs is different from what is described
> > above.
> > 
> > In particular, the initial fallocate is not done to size 1(byte?). In fact
> > the file is fully fallocated to its real size of 40GB before being truncated
> > to 0 and then fallocated again to full size.
> > 
> > The vdsm version used:
> > vdsm.x86_64                           4.30.13-1.el7ev            @rhv-4.3.3
> > 
> > Am I missing something here?
> > 
> > -Krutika
> 
> Can you please explain what is look problematic to you (RHV / VDSM side)?
> What do you think that causing the problem?

Let me try to answer this. Eyal, as per the flow you mentioned in comment 27, initially truncate file to size 40G, and then fallocate is called with 1 byte
But Krutika thinks fallocate was called with real size of 40GB and then truncated to 0.
This is where there's a mismatch seen. Can you confirm if the work flow is infact calling fallocate (full size), truncate(0) and then fallocate (full size) ?

Comment 33 Eyal Shenitzky 2019-06-19 08:40:00 UTC

(In reply to Sahina Bose from comment #32)
> (In reply to Eyal Shenitzky from comment #31)
> > (In reply to Krutika Dhananjay from comment #29)
> > > (In reply to Eyal Shenitzky from comment #27)
> > > > (In reply to Krutika Dhananjay from comment #26)
> > > > > (In reply to Atin Mukherjee from comment #25)
> > > > > > Status please?
> > > > > 
> > > > > Setting needinfo on the right person.
> > > > > 
> > > > > Eyal, any news on this?
> > > > 
> > > > Can you please add the engine / vdsm version?
> > > > 
> > > > There was no change in the flow for creating raw-preallocated file-based
> > > > volumes in vdsm for versions < 4.3.
> > > > 
> > > > In version 4.3 we create raw-preallocated file-based volume in the following
> > > > flow:
> > > > 
> > > > 1. truncate the volume to its size
> > > > 2. fallocate the volume to the requested size
> > > > 3. changing the volume permissions 
> > > > 
> > > > The flow that is used for version 4.3 when copying raw-preallocated
> > > > file-based template volume for new VM that is based on it:
> > > > 
> > > > 1. truncate the volume to its size
> > > > 2. fallocate the volume to an initial size (1)
> > > > 3. changing the volume permissions
> > > > 4. fallocate the created volume again to it's full size.
> > > > 
> > > > This change was done in order to keep the file-based volume as preallocated
> > > > volume (before version 4.3 it was converted to sparse) 
> > > > and to prevent fallocating the volume twice.
> > > 
> > > So what I see in the gluster trace logs is different from what is described
> > > above.
> > > 
> > > In particular, the initial fallocate is not done to size 1(byte?). In fact
> > > the file is fully fallocated to its real size of 40GB before being truncated
> > > to 0 and then fallocated again to full size.
> > > 
> > > The vdsm version used:
> > > vdsm.x86_64                           4.30.13-1.el7ev            @rhv-4.3.3
> > > 
> > > Am I missing something here?
> > > 
> > > -Krutika
> > 
> > Can you please explain what is look problematic to you (RHV / VDSM side)?
> > What do you think that causing the problem?
> 
> Let me try to answer this. Eyal, as per the flow you mentioned in comment
> 27, initially truncate file to size 40G, and then fallocate is called with 1
> byte
> But Krutika thinks fallocate was called with real size of 40GB and then
> truncated to 0.
> This is where there's a mismatch seen. Can you confirm if the work flow is
> infact calling fallocate (full size), truncate(0) and then fallocate (full
> size) ?

This I already understand, I am trying to understand how it is influencing on you.

The situation that the volume will be fully allocated twice is only if deferred_preallocation is not supported by the host.

Comment 34 Krutika Dhananjay 2019-06-19 11:35:09 UTC

(In reply to Eyal Shenitzky from comment #33)
> (In reply to Sahina Bose from comment #32)
> > (In reply to Eyal Shenitzky from comment #31)
> > > (In reply to Krutika Dhananjay from comment #29)
> > > > (In reply to Eyal Shenitzky from comment #27)
> > > > > (In reply to Krutika Dhananjay from comment #26)
> > > > > > (In reply to Atin Mukherjee from comment #25)
> > > > > > > Status please?
> > > > > > 
> > > > > > Setting needinfo on the right person.
> > > > > > 
> > > > > > Eyal, any news on this?
> > > > > 
> > > > > Can you please add the engine / vdsm version?
> > > > > 
> > > > > There was no change in the flow for creating raw-preallocated file-based
> > > > > volumes in vdsm for versions < 4.3.
> > > > > 
> > > > > In version 4.3 we create raw-preallocated file-based volume in the following
> > > > > flow:
> > > > > 
> > > > > 1. truncate the volume to its size
> > > > > 2. fallocate the volume to the requested size
> > > > > 3. changing the volume permissions 
> > > > > 
> > > > > The flow that is used for version 4.3 when copying raw-preallocated
> > > > > file-based template volume for new VM that is based on it:
> > > > > 
> > > > > 1. truncate the volume to its size
> > > > > 2. fallocate the volume to an initial size (1)
> > > > > 3. changing the volume permissions
> > > > > 4. fallocate the created volume again to it's full size.
> > > > > 
> > > > > This change was done in order to keep the file-based volume as preallocated
> > > > > volume (before version 4.3 it was converted to sparse) 
> > > > > and to prevent fallocating the volume twice.
> > > > 
> > > > So what I see in the gluster trace logs is different from what is described
> > > > above.
> > > > 
> > > > In particular, the initial fallocate is not done to size 1(byte?). In fact
> > > > the file is fully fallocated to its real size of 40GB before being truncated
> > > > to 0 and then fallocated again to full size.
> > > > 
> > > > The vdsm version used:
> > > > vdsm.x86_64                           4.30.13-1.el7ev            @rhv-4.3.3
> > > > 
> > > > Am I missing something here?
> > > > 
> > > > -Krutika
> > > 
> > > Can you please explain what is look problematic to you (RHV / VDSM side)?
> > > What do you think that causing the problem?
> > 
> > Let me try to answer this. Eyal, as per the flow you mentioned in comment
> > 27, initially truncate file to size 40G, and then fallocate is called with 1
> > byte
> > But Krutika thinks fallocate was called with real size of 40GB and then
> > truncated to 0.
> > This is where there's a mismatch seen. Can you confirm if the work flow is
> > infact calling fallocate (full size), truncate(0) and then fallocate (full
> > size) ?
> 
> This I already understand, I am trying to understand how it is influencing
> on you.

This is a problem in RHHI as shard translator in gluster assumes that the writes are from a single writer.
And this was the case until this issue was hit, which breaks this assumption.
Even if full-size fallocates were done from the same gluster mount point for a given vm, we are good. But what I saw
was that the first full-size fallocate was executed on host 3's gluster mountpoint whereas the second was executed from
host 2's gluster mountpoint.


> 
> The situation that the volume will be fully allocated twice is only if
> deferred_preallocation is not supported by the host.

What's deferred_preallocation? Could you throw some light on that?

-Krutika

Comment 35 Krutika Dhananjay 2019-06-20 09:01:40 UTC

I simulated the flow where an image is first truncated to actual size, then fallocated to 1B and then fallocated to full size.
I can confirm that if these are the steps followed, then there will be no io errors (which in turn implies there will be no vm pause at boot time).

-Krutika

Comment 36 SATHEESARAN 2019-06-20 14:04:06 UTC

I could reproduce this issue with RHV 4.3.5 + RHEL 7.7 + RHGS 3.5.0 ( glusterfs-6.0-5 ) with the following steps:

1. Create a RHEL 7.6 VM with 40GB boot disk, preallocated( using falloc ) - all this is done with VM creation workflow in RHV
2. Post OS installation, powered off the VM and created a template of the VM
3. Created 10 VM out of the template
4. Started these VMs altogether.

One of the VM paused

Comment 37 Eyal Shenitzky 2019-06-23 06:28:05 UTC

> 
> This is a problem in RHHI as shard translator in gluster assumes that the
> writes are from a single writer.
> And this was the case until this issue was hit, which breaks this assumption.
> Even if full-size fallocates were done from the same gluster mount point for
> a given vm, we are good. But what I saw
> was that the first full-size fallocate was executed on host 3's gluster
> mountpoint whereas the second was executed from
> host 2's gluster mountpoint.

OK, now I understand.

Yes, this new flow breaks this assumption.
The allocation operation can be done from different hosts.

The first allocation can be done from host A and the second can be done from host B.
But, the allocations are done one after the other, not concurrently.


The ability to perform storage operation from different hosts that are not the SPM
existing since 4.1 version

> 
> 
> 
> What's deferred_preallocation? Could you throw some light on that?

The new deferred_preallocation capability is what I explain before.
It is the ability to postpone the full allocation to a later stage in the process.
In this case, postpone the allocation to the point when the data is copied and not when the disk is allocated.

The host should support this new capability since version 4.3

> 
> -Krutika

Comment 38 Krutika Dhananjay 2019-06-24 09:32:34 UTC

(In reply to Eyal Shenitzky from comment #37)
> > 
> > This is a problem in RHHI as shard translator in gluster assumes that the
> > writes are from a single writer.
> > And this was the case until this issue was hit, which breaks this assumption.
> > Even if full-size fallocates were done from the same gluster mount point for
> > a given vm, we are good. But what I saw
> > was that the first full-size fallocate was executed on host 3's gluster
> > mountpoint whereas the second was executed from
> > host 2's gluster mountpoint.
> 
> OK, now I understand.
> 
> Yes, this new flow breaks this assumption.
> The allocation operation can be done from different hosts.
> 
> The first allocation can be done from host A and the second can be done from
> host B.
> But, the allocations are done one after the other, not concurrently.

Sure, but even if the first fallocate was indeed of 1B size and the second fallocate of full size with them being executed from different hosts, things will work.
I simulated the same operations on gluster with a 1B fallocate which is later followed by a full-size fallocate, and things work.

But in this case, what we're seeing is that both the first and second fallocate are being done to full-size.
In short, the problem is with two fallocates to full-size being executed from two different mounts, even if at different points of time.

Could you please check if that is the case in vdsm code? And also maybe point to the code in github where these two fallocates are done?

-Krutika

> 
> 
> The ability to perform storage operation from different hosts that are not
> the SPM
> existing since 4.1 version
> 
> > 
> > 
> > 
> > What's deferred_preallocation? Could you throw some light on that?
> 
> The new deferred_preallocation capability is what I explain before.
> It is the ability to postpone the full allocation to a later stage in the
> process.
> In this case, postpone the allocation to the point when the data is copied
> and not when the disk is allocated.
> 
> The host should support this new capability since version 4.3
> 
> > 
> > -Krutika

Comment 39 Eyal Shenitzky 2019-06-24 10:31:34 UTC

> Sure, but even if the first fallocate was indeed of 1B size and the second
> fallocate of full size with them being executed from different hosts, things
> will work.
> I simulated the same operations on gluster with a 1B fallocate which is
> later followed by a full-size fallocate, and things work.
> 
> But in this case, what we're seeing is that both the first and second
> fallocate are being done to full-size.
> In short, the problem is with two fallocates to full-size being executed
> from two different mounts, even if at different points of time.
> 
> Could you please check if that is the case in vdsm code? And also maybe
> point to the code in github where these two fallocates are done?

In order to check this, I need the VDSM logs, according to comment #11 
the bug appears in the logs around 2019-04-25 09:27:43 but the logs in the SOS report are until - 2019-04-22 only.

This is the relevant code - https://github.com/oVirt/vdsm/blob/01b77155a367e3c863f29df079457e088cacc6fc/lib/vdsm/storage/fileVolume.py#L429

> 
> -Krutika

Comment 40 Krutika Dhananjay 2019-06-24 10:49:49 UTC

(In reply to Eyal Shenitzky from comment #39)
> > Sure, but even if the first fallocate was indeed of 1B size and the second
> > fallocate of full size with them being executed from different hosts, things
> > will work.
> > I simulated the same operations on gluster with a 1B fallocate which is
> > later followed by a full-size fallocate, and things work.
> > 
> > But in this case, what we're seeing is that both the first and second
> > fallocate are being done to full-size.
> > In short, the problem is with two fallocates to full-size being executed
> > from two different mounts, even if at different points of time.
> > 
> > Could you please check if that is the case in vdsm code? And also maybe
> > point to the code in github where these two fallocates are done?
> 
> In order to check this, I need the VDSM logs, according to comment #11 
> the bug appears in the logs around 2019-04-25 09:27:43 but the logs in the
> SOS report are until - 2019-04-22 only.
> 
> This is the relevant code -
> https://github.com/oVirt/vdsm/blob/01b77155a367e3c863f29df079457e088cacc6fc/
> lib/vdsm/storage/fileVolume.py#L429
> 
> > 
> > -Krutika

The sosreports are for older runs. Let me try to deduce the timestamp of vm pause from the old sosreports and get back.

-Krutika

Comment 41 Krutika Dhananjay 2019-06-24 11:48:27 UTC

(In reply to Krutika Dhananjay from comment #40)
> (In reply to Eyal Shenitzky from comment #39)
> > > Sure, but even if the first fallocate was indeed of 1B size and the second
> > > fallocate of full size with them being executed from different hosts, things
> > > will work.
> > > I simulated the same operations on gluster with a 1B fallocate which is
> > > later followed by a full-size fallocate, and things work.
> > > 
> > > But in this case, what we're seeing is that both the first and second
> > > fallocate are being done to full-size.
> > > In short, the problem is with two fallocates to full-size being executed
> > > from two different mounts, even if at different points of time.
> > > 
> > > Could you please check if that is the case in vdsm code? And also maybe
> > > point to the code in github where these two fallocates are done?
> > 
> > In order to check this, I need the VDSM logs, according to comment #11 
> > the bug appears in the logs around 2019-04-25 09:27:43 but the logs in the
> > SOS report are until - 2019-04-22 only.
> > 
> > This is the relevant code -
> > https://github.com/oVirt/vdsm/blob/01b77155a367e3c863f29df079457e088cacc6fc/
> > lib/vdsm/storage/fileVolume.py#L429
> > 
> > > 
> > > -Krutika
> 
> The sosreports are for older runs. Let me try to deduce the timestamp of vm
> pause from the old sosreports and get back.
> 
> -Krutika

Here's one that i found in sosreports from comment #2:

[2019-04-16 06:42:51.201964] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 846853: READ => -1 gfid=1f9cddce-d0cc-46f8-ab4b-c2918ffaf01b fd=0x7ff690008128 (Stale file handle)

One more -
[2019-04-16 06:42:51.201964] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 846853: READ => -1 gfid=1f9cddce-d0cc-46f8-ab4b-c2918ffaf01b fd=0x7ff690008128 (Stale file handle)

Some more:
[2019-04-17 06:27:15.835814] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 1372629: READ => -1 gfid=1f9cddce-d0cc-46f8-ab4b-c2918ffaf01b fd=0x7ff690008128 (Stale file handle)
[2019-04-17 06:27:21.898401] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 1372645: READ => -1 gfid=1f9cddce-d0cc-46f8-ab4b-c2918ffaf01b fd=0x7ff690008128 (Stale file handle)
[2019-04-17 06:27:22.711071] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 1372647: READ => -1 gfid=1f9cddce-d0cc-46f8-ab4b-c2918ffaf01b fd=0x7ff690008128 (Stale file handle)
[2019-04-17 06:27:22.877170] W [fuse-bridge.c:2756:fuse_readv_cbk] 0-glusterfs-fuse: 1372649: READ => -1 gfid=1f9cddce-d0cc-46f8-ab4b-c2918ffaf01b fd=0x7ff690008128 (Stale file handle)

Please let me know if this is sufficient.
Also, all timestamps are in UTC.

-Krutika

Comment 42 Eyal Shenitzky 2019-06-25 08:40:22 UTC

From the VDSM logs, I see only call for creating new volume without initial size, which means that the volume will be fully allocated twice (in the creation and when copying the data).

Can you please verify that the host doesn't support the deferred_preallocation by running the following command on the hosts in the relevant environment and share the output:

vdsm-client Host getCapabilities | grep deferred_preallocation

Comment 43 Krutika Dhananjay 2019-06-25 08:54:06 UTC

(In reply to Eyal Shenitzky from comment #42)
> From the VDSM logs, I see only call for creating new volume without initial
> size, which means that the volume will be fully allocated twice (in the
> creation and when copying the data).
> 
> Can you please verify that the host doesn't support the
> deferred_preallocation by running the following command on the hosts in the
> relevant environment and share the output:
> 
> vdsm-client Host getCapabilities | grep deferred_preallocation

Asking Satheesaran and Mugdha since they have the setup.

Sas, Mugdha,

Could you share the output of the command above?

-Krutika

Comment 44 Krutika Dhananjay 2019-06-25 09:10:33 UTC

I do see these in vdsm logs. A search for "deferred_preallocation" shows the following match - "u'deferred_preallocation': True,".
Let me know if this is the same as the output of the vdsm command in comment #42 you're looking for.


2019-04-22 16:01:10,248+0530 INFO  (jsonrpc/6) [api.host] FINISH getCapabilities return={'status': {'message': 'Done', 'code': 0}, 'info': {u'HBAInventory': {u'iSCSI': [{u'InitiatorName': u'iqn.1994-05.com.redhat:4a694c96f48'}], u'FC': []}, u'packages2': {u'kernel': {u'release': u'957.10.1.el7.x86_64', u'version': u'3.10.0'}, u'glusterfs-rdma': {u'release': u'47.el7rhgs', u'version': u'3.12.2'}, u'glusterfs-fuse': {u'release': u'47.el7rhgs', u'version': u'3.12.2'}, u'spice-server': {u'release': u'6.el7_6.1', u'version': u'0.14.0'}, u'librbd1': {u'release': u'4.el7', u'version': u'10.2.5'}, u'vdsm': {u'release': u'1.el7ev', u'version': u'4.30.13'}, u'qemu-kvm': {u'release': u'18.el7_6.3', u'version': u'2.12.0'}, u'openvswitch': {u'release': u'5.el7ev', u'version': u'2.10'}, u'libvirt': {u'release': u'10.el7_6.6', u'version': u'4.5.0'}, u'ovirt-hosted-engine-ha': {u'release': u'1.el7ev', u'version': u'2.3.1'}, u'qemu-img': {u'release': u'18.el7_6.3', u'version': u'2.12.0'}, u'mom': {u'release': u'1.el7ev', u'version': u'0.5.12'}, u'glusterfs': {u'release': u'47.el7rhgs', u'version': u'3.12.2'}, u'glusterfs-cli': {u'release': u'47.el7rhgs', u'version': u'3.12.2'}, u'glusterfs-server': {u'release': u'47.el7rhgs', u'version': u'3.12.2'}, u'glusterfs-geo-replication': {u'release': u'47.el7rhgs', u'version': u'3.12.2'}}, u'numaNodeDistance': {u'1': [21, 10], u'0': [10, 21]}, u'cpuModel': u'Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz', u'nestedVirtualization': False, u'liveMerge': u'true', u'hooks': {u'after_vm_start': {u'openstacknet_utils.py': {u'md5': u'1ed38ddf30f8a9c7574589e77e2c0b1f'}, u'50_openstacknet': {u'md5': u'ea0a5a715da8c1badbcda28e8b8fa00e'}}, u'after_network_setup': {u'30_ethtool_options': {u'md5': u'ce1fbad7aa0389e3b06231219140bf0d'}}, u'before_vm_start': {u'50_hostedengine': {u'md5': u'95c810cdcfe4195302a59574a5148289'}, u'50_vhostmd': {u'md5': u'9206bc390bcbf208b06a8e899581be2d'}}, u'after_device_migrate_destination': {u'openstacknet_utils.py': {u'md5': u'1ed38ddf30f8a9c7574589e77e2c0b1f'}, u'50_openstacknet': {u'md5': u'6226fbc4d1602994828a3904fc1b875d'}}, u'before_device_migrate_destination': {u'50_vmfex': {u'md5': u'49caba1a5faadd8efacef966f79bc30a'}}, u'before_device_create': {u'50_vmfex': {u'md5': u'49caba1a5faadd8efacef966f79bc30a'}, u'openstacknet_utils.py': {u'md5': u'1ed38ddf30f8a9c7574589e77e2c0b1f'}, u'50_openstacknet': {u'md5': u'ec8d5d7ba063a109f749cd63c7a2b58d'}, u'20_ovirt_provider_ovn_vhostuser_hook': {u'md5': u'a8af653b7386c138b2e6e9738bd6b62c'}, u'10_ovirt_provider_ovn_hook': {u'md5': u'73822988042847bab1ea832a6b9fa837'}}, u'after_get_caps': {u'openstacknet_utils.py': {u'md5': u'1ed38ddf30f8a9c7574589e77e2c0b1f'}, u'50_openstacknet': {u'md5': u'5c3a9ab6e06e039bdd220c0216e45809'}, u'ovirt_provider_ovn_hook': {u'md5': u'4c4b1d2d5460e6a65114ae36cb775df6'}}, u'after_nic_hotplug': {u'openstacknet_utils.py': {u'md5': u'1ed38ddf30f8a9c7574589e77e2c0b1f'}, u'50_openstacknet': {u'md5': u'6226fbc4d1602994828a3904fc1b875d'}}, u'before_vm_migrate_destination': {u'50_vhostmd': {u'md5': u'9206bc390bcbf208b06a8e899581be2d'}}, u'after_device_create': {u'openstacknet_utils.py': {u'md5': u'1ed38ddf30f8a9c7574589e77e2c0b1f'}, u'50_openstacknet': {u'md5': u'6226fbc4d1602994828a3904fc1b875d'}}, u'before_vm_dehibernate': {u'50_vhostmd': {u'md5': u'9206bc390bcbf208b06a8e899581be2d'}}, u'before_nic_hotplug': {u'50_vmfex': {u'md5': u'49caba1a5faadd8efacef966f79bc30a'}, u'openstacknet_utils.py': {u'md5': u'1ed38ddf30f8a9c7574589e77e2c0b1f'}, u'50_openstacknet': {u'md5': u'ec8d5d7ba063a109f749cd63c7a2b58d'}, u'20_ovirt_provider_ovn_vhostuser_hook': {u'md5': u'a8af653b7386c138b2e6e9738bd6b62c'}, u'10_ovirt_provider_ovn_hook': {u'md5': u'73822988042847bab1ea832a6b9fa837'}}, u'before_network_setup': {u'50_fcoe': {u'md5': u'28c352339c8beef1e1b05c67d106d062'}}, u'after_vm_destroy': {u'delete_vhostuserclient_hook': {u'md5': u'c2f279cc9483a3f842f6c29df13994c1'}, u'50_vhostmd': {u'md5': u'bdf4802c0521cf1bae08f2b90a9559cf'}}}, u'supportsIPv6': True, u'realtimeKernel': False, u'vmTypes': [u'kvm'], u'liveSnapshot': u'true', u'cpuThreads': u'32', u'kdumpStatus': 0, u'deferred_preallocation': True, u'networks': {u'ovirtmgmt': {u'iface': u'ovirtmgmt', u'ipv6autoconf': True, u'addr': u'10.70.45.1', u'dhcpv6': False, u'ipv6addrs': [u'2620:52:0:462c:ec4:7aff:fe57:c748/64'], u'switch': u'legacy', u'bridged': True, u'southbound': u'ens3f0', u'dhcpv4': True, u'netmask': u'255.255.252.0', u'ipv4defaultroute': True, u'stp': u'off', u'ipv4addrs': [u'10.70.45.1/22'], u'mtu': u'1500', u'ipv6gateway': u'fe80:52:0:462c::3fe', u'gateway': u'10.70.47.254', u'ports': [u'vnet0', u'vnet1', u'vnet2', u'ens3f0']}, u'gluster_network': {u'iface': u'enp4s0f0', u'ipv6autoconf': True, u'addr': u'10.70.36.79', u'dhcpv6': False, u'ipv6addrs': [u'2620:52:0:4624:ec4:7aff:fe6f:3388/64'], u'switch': u'legacy', u'bridged': False, u'mtu': u'1500', u'dhcpv4': True, u'netmask': u'255.255.254.0', u'ipv4defaultroute': False, u'ipv4addrs': [u'10.70.36.79/23'], u'interface': u'enp4s0f0', u'southbound': u'enp4s0f0', u'ipv6gateway': u'::', u'gateway': u'10.70.37.254'}}, u'kernelArgs': u'BOOT_IMAGE=/vmlinuz-3.10.0-957.10.1.el7.x86_64 root=/dev/mapper/rhel_rhsqa--grafton1--nic2-root ro crashkernel=auto rd.lvm.lv=rhel_rhsqa-grafton1-nic2/root
rd.lvm.lv=rhel_rhsqa-grafton1-nic2/swap rhgb quiet LANG=en_US.UTF-8', u'domain_versions': [0, 2, 3, 4, 5], u'bridges': {u'ovirtmgmt': {u'ipv6autoconf': True, u'addr': u'10.70.45.1', u'dhcpv6': False, u'ipv6addrs': [u'2620:52:0:462c:ec4:7aff:fe57:c748/64'], u'mtu': u'1500', u'dhcpv4': True, u'netmask': u'255.255.252.0', u'ipv4defaultroute': True, u'stp': u'off', u'ipv4addrs': [u'10.70.45.1/22'], u'ipv6gateway': u'fe80:52:0:462c::3fe', u'gateway': u'10.70.47.254', u'opts': {u'multicast_last_member_count': u'2', u'vlan_protocol': u'0x8100', u'hash_elasticity': u'4', u'multicast_query_response_interval': u'1000', u'group_fwd_mask': u'0x0', u'multicast_snooping': u'1', u'multicast_startup_query_interval': u'3125', u'hello_timer': u'0', u'multicast_querier_interval': u'25500', u'max_age': u'2000', u'hash_max': u'512', u'stp_state': u'0', u'topology_change_detected': u'0', u'priority': u'32768', u'multicast_igmp_version': u'2', u'multicast_membership_interval': u'26000', u'root_path_cost': u'0', u'root_port': u'0', u'multicast_stats_enabled': u'0', u'multicast_startup_query_count': u'2', u'nf_call_iptables': u'0', u'vlan_stats_enabled': u'0', u'hello_time': u'200', u'topology_change': u'0', u'bridge_id': u'8000.0cc47a57c748', u'topology_change_timer': u'0', u'ageing_time': u'30000', u'nf_call_ip6tables': u'0', u'multicast_mld_version': u'1', u'gc_timer': u'1001', u'root_id': u'8000.0cc47a57c748', u'nf_call_arptables': u'0', u'group_addr': u'1:80:c2:0:0:0', u'multicast_last_member_interval': u'100', u'default_pvid': u'1', u'multicast_query_interval': u'12500', u'multicast_query_use_ifaddr': u'0', u'tcn_timer': u'0', u'multicast_router': u'1', u'vlan_filtering': u'0', u'multicast_querier': u'0', u'forward_delay': u'0'}, u'ports': [u'vnet0', u'vnet1', u'vnet2', u'ens3f0']}, u'virbr0': {u'ipv6autoconf': False, u'addr': u'192.168.122.1', u'dhcpv6': False, u'ipv6addrs': [], u'mtu': u'1500', u'dhcpv4': False, u'netmask': u'255.255.255.0', u'ipv4defaultroute': False, u'stp': u'on', u'ipv4addrs': [u'192.168.122.1/24'], u'ipv6gateway': u'::', u'gateway': u'', u'opts': {u'multicast_last_member_count': u'2', u'vlan_protocol': u'0x8100', u'hash_elasticity': u'4', u'multicast_query_response_interval': u'1000', u'group_fwd_mask': u'0x0', u'multicast_snooping': u'1', u'multicast_startup_query_interval': u'3125', u'hello_timer': u'109', u'multicast_querier_interval': u'25500', u'max_age': u'2000', u'hash_max': u'512', u'stp_state': u'1', u'topology_change_detected': u'0', u'priority': u'32768', u'multicast_igmp_version': u'2', u'multicast_membership_interval': u'26000', u'root_path_cost': u'0', u'root_port': u'0', u'multicast_stats_enabled': u'0', u'multicast_startup_query_count': u'2', u'nf_call_iptables': u'0', u'vlan_stats_enabled': u'0', u'hello_time': u'200', u'topology_change': u'0', u'bridge_id': u'8000.525400f77a36', u'topology_change_timer': u'0', u'ageing_time': u'30000', u'nf_call_ip6tables': u'0', u'multicast_mld_version': u'1', u'gc_timer': u'21993', u'root_id': u'8000.525400f77a36', u'nf_call_arptables': u'0', u'group_addr': u'1:80:c2:0:0:0', u'multicast_last_member_interval': u'100', u'default_pvid': u'1', u'multicast_query_interval': u'12500', u'multicast_query_use_ifaddr': u'0', u'tcn_timer': u'0', u'multicast_router': u'1', u'vlan_filtering': u'0', u'multicast_querier': u'0', u'forward_delay': u'200'}, u'ports': [u'virbr0-nic']}}, u'uuid': u'00000000-0000-0000-0000-0cc47a6f3388', u'onlineCpus': u'0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23,8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31', u'nameservers': [u'10.70.34.2', u'10.70.34.1', u'10.75.5.25'], u'nics': {u'ens3f1': {u'ipv6autoconf': True, u'addr': u'', u'speed': 0, u'dhcpv6': False, u'ipv6addrs': [], u'mtu': u'1500', u'dhcpv4': False, u'netmask': u'', u'ipv4defaultroute': False, u'ipv4addrs': [], u'hwaddr': u'0c:c4:7a:57:c7:49', u'ipv6gateway': u'::', u'gateway': u''}, u'ens3f0': {u'ipv6autoconf': False, u'addr': u'', u'speed': 1000, u'dhcpv6': False, u'ipv6addrs': [], u'mtu': u'1500', u'dhcpv4': False, u'netmask': u'', u'ipv4defaultroute': False, u'ipv4addrs': [], u'hwaddr': u'0c:c4:7a:57:c7:48', u'ipv6gateway': u'::', u'gateway': u''}, u'enp4s0f0': {u'ipv6autoconf': True, u'addr': u'10.70.36.79', u'speed': 10000, u'dhcpv6': False, u'ipv6addrs': [u'2620:52:0:4624:ec4:7aff:fe6f:3388/64'], u'mtu': u'1500', u'dhcpv4': True, u'netmask': u'255.255.254.0', u'ipv4defaultroute': False, u'ipv4addrs': [u'10.70.36.79/23'], u'hwaddr': u'0c:c4:7a:6f:33:88', u'ipv6gateway': u'::', u'gateway': u'10.70.37.254'}, u'enp4s0f1': {u'ipv6autoconf': True, u'addr': u'', u'speed': 0, u'dhcpv6': False, u'ipv6addrs': [], u'mtu': u'1500', u'dhcpv4': False, u'netmask': u'', u'ipv4defaultroute': False, u'ipv4addrs': [], u'hwaddr': u'0c:c4:7a:6f:33:89', u'ipv6gateway': u'::', u'gateway': u''}}, u'software_revision': u'1', u'hostdevPassthrough': u'false', u'clusterLevels': [u'4.1', u'4.2', u'4.3'], u'cpuFlags': u'fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,dts,acpi,mmx,fxsr,sse,sse2,ss,ht,tm,pbe,syscall,nx,pdpe1gb,rdtscp,lm,constant_tsc,arch_perfmon,pebs,bts,rep_good,nopl,xtopology,nonstop_tsc,aperfmperf,eagerfpu,pni,pclmulqdq,dtes64,monitor,ds_cpl,vmx,smx,est,tm2,ssse3,sdbg,fma,cx16,xtpr,pdcm,pcid,dca,sse4_1,sse4_2,x2apic,movbe,popcnt,tsc_deadline_timer,aes,xsave,avx,f16c,rdrand,lahf_lm,abm,epb,intel_ppin,ssbd,ibrs,ibpb,stibp,tpr_shadow,vnmi,flexpriority,ept,vpid,fsgsbase,tsc_adjust,bmi1,avx2,smep,bmi2,erms,invpcid,cqm,xsaveopt,cqm_llc,cqm_occup_llc,dtherm,ida,arat,pln,pts,spec_ctrl,intel_stibp,flush_l1d,model_Opteron_G2,model_kvm32,model_SandyBridge-IBRS,model_Haswell-noTSX,model_Opteron_G1,model_core2duo,model_pentium,model_Westmere-IBRS,model_qemu64,model_kvm64,model_coreduo,model_Nehalem,model_Nehalem-IBRS,model_IvyBridge,model_Westmere,model_SandyBridge,model_IvyBridge-IBRS,model_n270,model_Haswell-noTSX-IBRS,model_qemu32,model_Penryn,model_pentium2,model_pentium3,model_Conroe,model_486', u'kernelFeatures': {u'IBRS': 0, u'RETP': 1, u'SSBD': 3, u'PTI': 1}, u'ISCSIInitiatorName': u'iqn.1994-05.com.redhat:4a694c96f48', u'netConfigDirty': u'False', u'selinux': {u'mode': u'1'}, u'autoNumaBalancing': 1, u'reservedMem': u'321', u'bondings': {}, u'software_version': u'4.30.13', u'supportedENGINEs': [u'4.1', u'4.2', u'4.3'], u'vncEncrypted': False, u'backupEnabled': False, u'cpuSpeed': u'2900.097', u'numaNodes': {u'1': {u'totalMemory': u'131072', u'cpus': [8, 9, 10, 11, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31]}, u'0': {u'totalMemory': u'130961', u'cpus': [0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23]}}, u'cpuSockets': u'2', u'vlans': {}, u'version_name': u'Snow Man', 'lastClientIface': 'lo', u'cpuCores': u'16', u'hostedEngineDeployed': True, u'hugepages': [1048576, 2048], u'guestOverhead': u'65', u'additionalFeatures': [u'libgfapi_supported', u'GLUSTER_SNAPSHOT', u'GLUSTER_GEO_REPLICATION', u'GLUSTER_BRICK_MANAGEMENT'], u'openstack_binding_host_ids': {u'OPENSTACK_OVN': u'rhsqa-grafton1-nic2.lab.eng.blr.redhat.com', u'OPEN_VSWITCH': u'rhsqa-grafton1-nic2.lab.eng.blr.redhat.com', u'OVIRT_PROVIDER_OVN': u'6b595e68-2518-44be-a957-48b254709e5b'}, u'kvmEnabled': u'true', u'memSize': u'257675', u'emulatedMachines': [u'pc-i440fx-rhel7.1.0', u'pc-q35-rhel7.3.0', u'rhel6.3.0', u'pc-i440fx-rhel7.5.0', u'pc-i440fx-rhel7.0.0', u'rhel6.1.0', u'pc-q35-rhel7.6.0', u'pc-i440fx-rhel7.4.0', u'rhel6.6.0', u'pc-q35-rhel7.5.0', u'rhel6.2.0', u'pc', u'pc-i440fx-rhel7.3.0', u'q35', u'pc-i440fx-rhel7.2.0', u'rhel6.4.0', u'pc-q35-rhel7.4.0', u'pc-i440fx-rhel7.6.0', u'rhel6.0.0', u'rhel6.5.0'], u'rngSources': [u'hwrng', u'random'], u'operatingSystem': {u'release': u'4.el7', u'pretty_name': u'Red Hat Enterprise Linux Server 7.6 (Maipo)', u'version': u'7.6', u'name': u'RHEL'}}} from=::1,57490 (api:54)

-Krutika

Comment 45 Eyal Shenitzky 2019-06-25 10:41:27 UTC

Yes, it is the same.

So I investigate it and it seems like we had a bug in RHV engine that it didn't set the 
initial size of the volume to 1 byte and caused to the volume to fully allocate twice.

After the bug will be fixed, it will solve this issue?

Comment 46 Krutika Dhananjay 2019-06-25 12:30:30 UTC

(In reply to Eyal Shenitzky from comment #45)
> Yes, it is the same.
> 
> So I investigate it and it seems like we had a bug in RHV engine that it
> didn't set the 
> initial size of the volume to 1 byte and caused to the volume to fully
> allocate twice.
> 
> After the bug will be fixed, it will solve this issue?

Yes, fixing the bug should solve this issue, based on a test I did just on gluster with an initial fallocate of 1B followed by another to full size, both executed from differrnt mounts, and then a read on the file from the first mount. But we can also test your patch if you can give us the build containing the fix.
-Krutika

Comment 48 Sahina Bose 2019-06-25 14:54:28 UTC

Eyal, assigning to you based on Comment 45

Comment 49 Avihai 2019-07-04 06:33:41 UTC

Please provide a clear scenario so I can see if I can ack.

Comment 50 RHV bug bot 2019-07-04 13:10:59 UTC

WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

Comment 51 Eyal Shenitzky 2019-07-08 05:16:52 UTC

Steps to reproduce:

1. Create a VM with a raw-preallocated file-based disk (Gluster in this case)
2. Create a template from this VM
3. Clone a new VM from the template created in step 2

Before the fix:

The imageInitialSizeInBytes that was sent to the CreateVolumeVDSCommand was equal to null.

2019-07-08 08:08:10,199+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (default task-4) [c28bb1f4-624a-4bd7-ad10-5b792ed8c03b] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='f7c3d0eb-7e72-4eb3-8f33-321c2d8cebb7', ignoreFailoverLimit='false', storageDomainId='8ec58e25-d3ae-4079-8925-880873dd15ad', imageGroupId='773e23e0-cc1a-48b6-8947-4bbb0a36bfbb', imageSizeInBytes='3221225472', volumeFormat='RAW', newImageId='40d5f984-2801-4ad0-8009-8c1623ca47ea', imageType='Preallocated', newImageDescription='null', imageInitialSizeInBytes='null', imageId='00000000-0000-0000-0000-000000000000', sourceImageGroupId='00000000-0000-0000-0000-000000000000'}), log id: 6a8b4221

After the fix:

The imageInitialSizeInBytes that was sent to the CreateVolumeVDSCommand is equal to '1'.

2019-07-08 08:08:10,199+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (default task-4) [c28bb1f4-624a-4bd7-ad10-5b792ed8c03b] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='f7c3d0eb-7e72-4eb3-8f33-321c2d8cebb7', ignoreFailoverLimit='false', storageDomainId='8ec58e25-d3ae-4079-8925-880873dd15ad', imageGroupId='773e23e0-cc1a-48b6-8947-4bbb0a36bfbb', imageSizeInBytes='3221225472', volumeFormat='RAW', newImageId='40d5f984-2801-4ad0-8009-8c1623ca47ea', imageType='Preallocated', newImageDescription='null', imageInitialSizeInBytes='1', imageId='00000000-0000-0000-0000-000000000000', sourceImageGroupId='00000000-0000-0000-0000-000000000000'}), log id: 6a8b4221

Comment 55 RHV bug bot 2019-12-13 13:14:11 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 56 RHV bug bot 2019-12-20 17:44:06 UTC

WARN: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (ON_QA) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 57 Evelina Shames 2019-12-23 10:28:16 UTC

Verified on:
engine-4.4.0-0.13.master.el7

with the following steps:
1. Create a VM with a raw-preallocated file-based disk (Gluster)
2. Create a template from this VM
3. Clone a new VM from the template created in step 2

Comment 58 RHV bug bot 2020-01-08 14:49:34 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 59 RHV bug bot 2020-01-08 15:17:38 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 60 RHV bug bot 2020-01-24 19:51:23 UTC

WARN: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops: Bug status (VERIFIED) wasn't changed but the folowing should be fixed:

[Found non-acked flags: '{}', ]

For more info please contact: rhv-devops

Comment 62 errata-xmlrpc 2020-08-04 13:17:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247

Note You need to log in before you can comment on or make changes to this bug.