Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1488719

Summary:	[RHHI] cannot boot vms created from template when disk format = qcow2
Product:	[Community] GlusterFS	Reporter:	Krutika Dhananjay <kdhananj>
Component:	sharding	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:	bugs <bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.10	CC:	amureini, bugs, bugs, dchaplyg, kdhananj, knarra, ndevos, rhs-bugs, rkavunga, sabose, sasundar, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.10.6	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1488546	Environment:
Last Closed:	2017-10-06 17:11:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Gluster	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1482994, 1488546
Bug Blocks:	1411323, 1488718

Description Krutika Dhananjay 2017-09-06 05:50:56 UTC

+++ This bug was initially created as a clone of Bug #1488546 +++

+++ This bug was initially created as a clone of Bug #1482994 +++

Description of problem:
Once the template is created, any new vms created from that template fails to start.

Attaching the screenshot for the same.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-41.el7rhgs.x86_64
vdsm-4.19.28-1.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Configure gluster and install Hosted Engine using cockpit UI
2. Add additional hosts and create storage domains
3. create a vm.
4. Now make a template out of the created vm.
5. create new vm from the template .

Actual results:
vm creation succeeds but fails to boot with the error shown in the screenshot.

Expected results:
Vm should get booted successfully.

Additional info:

no errors seen in qemu logs:
=====================================
2017-08-18 13:33:37.463+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-07-18-09:40:48, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-16.el7_4.4), hostname: tettnang.lab.eng.blr.redhat.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name guest=testvm_shard_64MB,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-14-testvm_shard_64MB/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m size=8388608k,slots=16,maxmem=33554432k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=8192 -uuid fea25030-23af-4069-98da-e74a544cb04f -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.4-18.el7rhgs,serial=36383737-3538-4753-4835-30375832434A,uuid=fea25030-23af-4069-98da-e74a544cb04f' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-14-testvm_shard_64MB/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-08-18T13:33:36,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=gluster://10.70.36.78/data/2edc6cc1-7b31-425d-a969-c5df2287c84a/images/3ef6ed85-9647-4b48-a041-77d9a5a1e9f5/e07f8c83-b365-49f1-b276-fad2f8afc9dc,file.debug=4,format=qcow2,if=none,id=drive-scsi0-0-0-0,serial=3ef6ed85-9647-4b48-a041-77d9a5a1e9f5,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:45:55:21:48:15,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/fea25030-23af-4069-98da-e74a544cb04f.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/fea25030-23af-4069-98da-e74a544cb04f.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5900,addr=10.70.34.41,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on
[2017-08-18 13:33:37.791278] I [afr.c:94:fix_quorum_options] 0-data-replicate-0: reindeer: incoming qtype = auto
[2017-08-18 13:33:37.791329] I [afr.c:116:fix_quorum_options] 0-data-replicate-0: reindeer: quorum_count = 2147483647
[2017-08-18 13:33:37.792934] W [MSGID: 101174] [graph.c:360:_log_if_unknown_option] 0-data-readdir-ahead: option 'parallel-readdir' is not recognized
[2017-08-18 13:33:37.792997] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 74657474-6e61-6e67-2e6c-61622e656e67 (0) coming up
[2017-08-18 13:33:37.793027] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-0: parent translators are ready, attempting connect on transport
[2017-08-18 13:33:37.797036] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-1: parent translators are ready, attempting connect on transport
[2017-08-18 13:33:37.797683] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-0: changing port to 49153 (from 0)
[2017-08-18 13:33:37.801016] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-2: parent translators are ready, attempting connect on transport
[2017-08-18 13:33:37.805314] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 13:33:37.805424] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49153 (from 0)
[2017-08-18 13:33:37.808857] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-0: Connected to data-client-0, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 13:33:37.808885] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 13:33:37.808965] I [MSGID: 108005] [afr-common.c:4706:afr_notify] 0-data-replicate-0: Subvolume 'data-client-0' came back up; going online.
[2017-08-18 13:33:37.809303] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-0: Server lk version = 1
[2017-08-18 13:33:37.809652] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-2: changing port to 49153 (from 0)
[2017-08-18 13:33:37.812611] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 13:33:37.813306] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-1: Connected to data-client-1, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 13:33:37.813347] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 13:33:37.813489] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-1: Server lk version = 1
[2017-08-18 13:33:37.816687] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 13:33:37.813306] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-1: Connected to data-client-1, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 13:33:37.813347] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 13:33:37.813489] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-1: Server lk version = 1
[2017-08-18 13:33:37.816687] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 13:33:37.817323] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-2: Connected to data-client-2, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 13:33:37.817342] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 13:33:37.817531] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-2: Server lk version = 1
[2017-08-18 13:33:37.820304] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-data-replicate-0: selecting local read_child data-client-1
[2017-08-18 13:33:37.820727] I [MSGID: 104041] [glfs-resolve.c:970:__glfs_active_subvol] 0-data: switched to graph 74657474-6e61-6e67-2e6c-61622e656e67 (0)
[2017-08-18 13:33:37.825282] I [MSGID: 133017] [shard.c:4900:shard_seek] 0-data-shard: seek called on f7e757ae-e231-4e79-8dea-d9ff84da555d. [Operation not supported]
[2017-08-18 13:33:37.832548] I [MSGID: 133017] [shard.c:4900:shard_seek] 0-data-shard: seek called on 7771a6d6-6171-4a09-ac83-e2cf723833bf. [Operation not supported]
2017-08-18T13:33:37.832765Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2017-08-18T13:33:37.832797Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
main_channel_link: add main channel client
main_channel_handle_parsed: net test: latency 1.825000 ms, bitrate 1369899665 bps (1306.438127 Mbps)
inputs_connect: inputs channel client create

errors seen in brick logs:
================================
[2017-08-18 13:28:24.415782] E [MSGID: 113002] [posix.c:267:posix_lookup] 0-data-posix: buf->ia_gfid is null for /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a
7f562db55.785 [No data available]
[2017-08-18 13:28:24.415845] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-data-server: 114652: LOOKUP /.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55.785 (be31
8638-e8a0-4c6d-977d-7a937aa84806/b1a937e5-4a69-4b41-96cb-07a7f562db55.785) ==> (No data available) [No data available]
[2017-08-18 13:28:24.472442] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.786 -> /gluster_bricks/data/data/.glusterfs/ce/13/ce1330fe-e550-43d2-92ea-caa2c9510971failed  [File exists]
[2017-08-18 13:28:24.472492] E [MSGID: 113020] [posix.c:1395:posix_mknod] 0-data-posix: setting gfid on /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55
.786 failed
[2017-08-18 13:28:24.534927] E [MSGID: 113002] [posix.c:267:posix_lookup] 0-data-posix: buf->ia_gfid is null for /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a
7f562db55.787 [No data available]
[2017-08-18 13:28:24.534981] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-data-server: 114743: LOOKUP /.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55.787 (be31
8638-e8a0-4c6d-977d-7a937aa84806/b1a937e5-4a69-4b41-96cb-07a7f562db55.787) ==> (No data available) [No data available]
[2017-08-18 13:28:24.592544] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.788 -> /gluster_bricks/data/data/.glusterfs/83/00/830084a6-423b-4a29-a0bc-2ea90e2374a2failed  [File exists]
[2017-08-18 13:28:24.592544] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.788 -> /gluster_bricks/data/data/.glusterfs/83/00/830084a6-423b-4a29-a0bc-2ea90e2374a2failed  [File exists]
[2017-08-18 13:28:24.592562] E [MSGID: 113020] [posix.c:1395:posix_mknod] 0-data-posix: setting gfid on /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55
.788 failed
[2017-08-18 13:28:24.700997] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.790 -> /gluster_bricks/data/data/.glusterfs/fb/a9/fba964dc-2b80-4c97-a84a-0fd4970d12f4failed  [File exists]
[2017-08-18 13:28:24.701032] E [MSGID: 113020] [posix.c:1395:posix_mknod] 0-data-posix: setting gfid on /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55
.790 failed
[2017-08-18 13:28:24.761620] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.791 -> /gluster_bricks/data/data/.glusterfs/a7/f5/a7f5cc30-f0fd-49a0-91d6-2233563e9c65failed  [File exists]
[2017-08-18 13:28:24.761621] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.791 -> /gluster_bricks/data/data/.glusterfs/a7/f5/a7f5cc30-f0fd-49a0-91d6-2233563e9c65failed  [File exists]
[2017-08-18 13:28:24.761638] E [MSGID: 113020] [posix.c:1395:posix_mknod] 0-data-posix: setting gfid on /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55
.791 failed
[2017-08-18 13:28:24.890108] E [MSGID: 113002] [posix.c:267:posix_lookup] 0-data-posix: buf->ia_gfid is null for /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a
7f562db55.793 [No data available]
[2017-08-18 13:28:24.890172] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-data-server: 115002: LOOKUP /.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55.793 (be31
8638-e8a0-4c6d-977d-7a937aa84806/b1a937e5-4a69-4b41-96cb-07a7f562db55.793) ==> (No data available) [No data available]
[2017-08-18 13:28:25.003037] E [MSGID: 113002] [posix.c:267:posix_lookup] 0-data-posix: buf->ia_gfid is null for /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a
7f562db55.795 [No data available]
[2017-08-18 13:28:25.003093] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-data-server: 115090: LOOKUP /.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55.795 (be31
8638-e8a0-4c6d-977d-7a937aa84806/b1a937e5-4a69-4b41-96cb-07a7f562db55.795) ==> (No data available) [No data available]

--- Additional comment from RamaKasturi on 2017-08-18 10:34:00 EDT ---

sosreports can be found in the link below:
==========================================

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1482994/

--- Additional comment from Sahina Bose on 2017-08-21 02:03:01 EDT ---

Moving from oVirt to RHGS

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-21 02:03:07 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Krutika Dhananjay on 2017-08-21 03:13:55 EDT ---

Niels,

The only error that we see in the client logs are from shard which informs about SEEK_DATA and SEEK_HOLE not being supported. Technically things should work fine even if shard doesn't support it in terms of qemu falling back to its earlier mode of operation before seek was even implemented in gluster, at least until it is implemented in shard, which is a non-trivial effort.

I checked block/gluster.c in qemu codebase and found that with sharding enabled, supports_seek_data flag should be set to false by virtue of shard returning ENOTSUP.

The only place there this flag is read seems to be find_allocation() which will go to "exit" in this case and return -ENOTSUP to qemu_gluster_co_get_block_status(). Unfortunately I do not understand what would happen next. ret seems to be set to BDRV_BLOCK_DATA and I don't understand what that's supposed to mean. So I did a git-blame and saw that this particular piece of code was written by you as part of commit 947eb2030eef73b7a66167a6244b11eeb4f109ff. Could you please clarify what qemu does next and why vms won't boot because of this, for want of any kind of logging of fop failures in gfapi itself?

Thanks in advance,
Krutika

--- Additional comment from Mohammed Rafi KC on 2017-08-21 09:07 EDT ---



--- Additional comment from Mohammed Rafi KC on 2017-08-21 09:08 EDT ---



--- Additional comment from Mohammed Rafi KC on 2017-08-21 09:09 EDT ---



--- Additional comment from Mohammed Rafi KC on 2017-08-21 09:10:38 EDT ---

We have similar problem in our dev cluster set up [1].

Attached the log snipet for errors/warning from both client and bricks.


Do you think this problem can happen in fuse mount as well.


[1] : https://rhsdev-cluster.lab.eng.blr.redhat.com

--- Additional comment from Mohammed Rafi KC on 2017-08-21 09:12:37 EDT ---

rising the need info on Krutika for comment 8

--- Additional comment from SATHEESARAN on 2017-08-21 12:04:10 EDT ---

For the worth of information, I am also updating the results from the test that I have done.

Note: This observation should not taken as a indication to stall the current investigation, but to compare and figure out the problem

Test-1
------

Components    : qemu-kvm-rhev, libvirt ( virsh cmd utility )
SSL/TLS infa  : Self-signed certs as well as CA 
Observation   : I could start VM, boot VMs, install OS, create template, create snapshot, merge snapshots with VM disks accessed via gfapi access mechanism

Test-2
------
Components    : qemu-kvm-rhev, vdsm, RHV 4.1.5
SSL/TLS Infra : CA signed certs
Observation   : I could again create VMs, create templates, Create VMs from template.

Note: In this case, I have enabled engine config 
engine-config -s LibgfApiSupported=true

My total observation.
In my RHV 4.1.5 setup with CA signed certs, I observe no problems.

--- Additional comment from Niels de Vos on 2017-08-22 11:15:44 EDT ---

I don't see any seek problems in the logs here... not sure if this problem is the same as bz#1357919.

There have been some changes to how glfs_lseek() is handled by QEMU, but that has been months ago. Unfortunately the version of qemu is not mentioned in the bug. Rafi provided versions over IRC:

15:56 < rafi> ndevos, qemu-kvm-tools-rhev-2.6.0-28.el7_3.12.x86_64
15:56 < rafi> libvirt-daemon-kvm-2.0.0-10.el7_3.9.x86_64
15:56 < rafi> qemu-kvm-rhev-2.6.0-28.el7_3.12.x86_64
15:56 < rafi> qemu-kvm-common-rhev-2.6.0-28.el7_3.12.x86_64

--- Additional comment from Krutika Dhananjay on 2017-08-23 01:13:59 EDT ---

Rafi,

I checked the logs. None of them indicate anything about the issue that is seen in dev cluster. All of these logs represent a transient state.

-Krutika

--- Additional comment from Krutika Dhananjay on 2017-08-23 01:26:31 EDT ---

(In reply to Niels de Vos from comment #11)
> I don't see any seek problems in the logs here... not sure if this problem
> is the same as bz#1357919.
> 
> There have been some changes to how glfs_lseek() is handled by QEMU, but
> that has been months ago. Unfortunately the version of qemu is not mentioned
> in the bug. Rafi provided versions over IRC:
> 
> 15:56 < rafi> ndevos, qemu-kvm-tools-rhev-2.6.0-28.el7_3.12.x86_64
> 15:56 < rafi> libvirt-daemon-kvm-2.0.0-10.el7_3.9.x86_64
> 15:56 < rafi> qemu-kvm-rhev-2.6.0-28.el7_3.12.x86_64
> 15:56 < rafi> qemu-kvm-common-rhev-2.6.0-28.el7_3.12.x86_64

Kasturi,

What is the version of qemu that you are using? Could you please provide the output? Niels seems to suggest in comment 11 that there were some changes to how QEMU handles seek fop in the past.

-Krutika

--- Additional comment from RamaKasturi on 2017-08-23 02:37:56 EDT ---

Hi Neils & krutika,

   Apologies that i missed to put that information. Here is the qemu version.

[root@yarrow ~]# rpm -qa | grep qemu
qemu-kvm-rhev-2.9.0-16.el7_4.4.x86_64
qemu-kvm-tools-rhev-2.9.0-16.el7_4.4.x86_64
qemu-kvm-common-rhev-2.9.0-16.el7_4.4.x86_64
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.2.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7.noarch
qemu-img-rhev-2.9.0-16.el7_4.4.x86_64


Thanks
kasturi

--- Additional comment from Niels de Vos on 2017-08-23 04:17:34 EDT ---

In case of problems related to seek, we need to verify if this upstream change was backported.

commit 223a23c198787328ae75bc65d84edf5fde33c0b6
Author: Jeff Cody <jcody>
Date:   Tue May 23 13:27:50 2017 -0400

    block/gluster: glfs_lseek() workaround

https://git.qemu.org/gitweb.cgi?p=qemu.git;a=commit;h=223a23c198787328ae75bc65d84edf5fde33c0b6

--- Additional comment from Niels de Vos on 2017-08-23 04:31:58 EDT ---

(In reply to Niels de Vos from comment #15)
> In case of problems related to seek, we need to verify if this upstream
> change was backported.
> 
> commit 223a23c198787328ae75bc65d84edf5fde33c0b6
> Author: Jeff Cody <jcody>
> Date:   Tue May 23 13:27:50 2017 -0400
> 
>     block/gluster: glfs_lseek() workaround
> 
> https://git.qemu.org/gitweb.cgi?p=qemu.git;a=commit;
> h=223a23c198787328ae75bc65d84edf5fde33c0b6

Bug 1451191 was used to get the fix in qemu-kvm-rhev-2.9.0-8.el7.x86_64. The version this was reported against is higher, so it should contain the backport as well.

Bug 1425293#c21 mentions that the seek problem has been fixed with the combination of

libvirt-3.2.0-10.el7.x86_64
qemu-kvm-rhev-2.9.0-10.el7.x86_64
glusterfs-3.8.4-28.el7rhgs.x86_64


There has been a fix in seek on the gluster side as well. http://review.gluster.org/15943 was backported to upstream glusterfs-3.8.7. This matches downstream bug 1454558 (git tag v3.8.4-26).


At the moment, I do not think we are missing anything from the seek side. Is it possible to verify that qemu continues with <something> after trying glfs_lseek()? You could do this by setting diagnostics.client-log-level=TRACE for the volume and the QEMU/gfapi logs should be much more verbose.

--- Additional comment from RamaKasturi on 2017-08-23 04:52:25 EDT ---

sure niels. I will set the option on volume and update the bug.

--- Additional comment from RamaKasturi on 2017-08-23 06:35:47 EDT ---

Hi Niels,

        I have set the volume option as you mentioned and i see errors only with seek returning operation not supported.


[2017-08-23 10:32:42.999569] D [MSGID: 0] [shard.c:4901:shard_seek] 0-stack-trace: stack-address: 0x5594f725a020, data-shard returned -1 error: Operation not supported [Operation not supported]
[2017-08-23 10:32:42.999585] D [MSGID: 0] [io-stats.c:2529:io_stats_seek_cbk] 0-stack-trace: stack-address: 0x5594f725a020, data returned -1 error: Operation not supported [Operation not supported]


complete qemu log file is present in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1482994/vm_data_trace1.log

Thanks
kasturi

--- Additional comment from Niels de Vos on 2017-08-23 07:22:06 EDT ---

Thanks for testing! After seek failed with ENOTSUPP, qemu does some reads from the vm-image on the volume. I therefore doubt that the problem is caused by seek.

It would be helpful to have similar logs from the test where sharding is disabled. If this also does not work, it may really not be a problem with seek (or if we're unlocky, a different seek issue).

--- Additional comment from RamaKasturi on 2017-08-23 08:47:15 EDT ---

Hi Niels,

 I tested this with out sharding enabled and below is what i see.


[root@yarrow qemu]# cat vm_noshard_1.log 
2017-08-18 07:47:56.357+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-07-18-09:40:48, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-16.el7_4.4), hostname: yarrow.lab.eng.blr.redhat.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name guest=vm_noshard_1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-3-vm_noshard_1/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m size=8388608k,slots=16,maxmem=33554432k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=8192 -uuid 6c540118-b51e-4599-958e-a7a80e5d4f9c -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.4-18.el7rhgs,serial=36383737-3538-4753-4835-303758324348,uuid=6c540118-b51e-4599-958e-a7a80e5d4f9c' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-3-vm_noshard_1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-08-18T07:47:55,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=gluster://10.70.36.78/data/1cc3328f-0462-4b9e-9cd2-b22b1bce0bcc/images/6cd3f0a7-3e81-46db-a25d-7a92d9234a8f/c308f044-d17d-47b0-a164-cea091b908b9,file.debug=4,format=qcow2,if=none,id=drive-scsi0-0-0-0,serial=6cd3f0a7-3e81-46db-a25d-7a92d9234a8f,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:45:55:21:48:0d,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/6c540118-b51e-4599-958e-a7a80e5d4f9c.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/6c540118-b51e-4599-958e-a7a80e5d4f9c.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5901,addr=10.70.34.43,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on
[2017-08-18 07:47:56.512892] I [afr.c:94:fix_quorum_options] 0-data-replicate-0: reindeer: incoming qtype = auto
[2017-08-18 07:47:56.512931] I [afr.c:116:fix_quorum_options] 0-data-replicate-0: reindeer: quorum_count = 2147483647
[2017-08-18 07:47:56.514517] W [MSGID: 101174] [graph.c:360:_log_if_unknown_option] 0-data-readdir-ahead: option 'parallel-readdir' is not recognized
[2017-08-18 07:47:56.514580] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 79617272-6f77-2e6c-6162-2e656e672e62 (0) coming up
[2017-08-18 07:47:56.514613] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-0: parent translators are ready, attempting connect on transport
[2017-08-18 07:47:56.518721] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-1: parent translators are ready, attempting connect on transport
[2017-08-18 07:47:56.519082] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-0: changing port to 49153 (from 0)
[2017-08-18 07:47:56.522525] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-2: parent translators are ready, attempting connect on transport
[2017-08-18 07:47:56.526842] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 07:47:56.527094] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49153 (from 0)
[2017-08-18 07:47:56.530333] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-0: Connected to data-client-0, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 07:47:56.530364] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 07:47:56.530464] I [MSGID: 108005] [afr-common.c:4706:afr_notify] 0-data-replicate-0: Subvolume 'data-client-0' came back up; going online.
[2017-08-18 07:47:56.530893] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-0: Server lk version = 1
[2017-08-18 07:47:56.531269] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-2: changing port to 49153 (from 0)
[2017-08-18 07:47:56.534429] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 07:47:56.535133] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-1: Connected to data-client-1, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 07:47:56.535157] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 07:47:56.535348] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-1: Server lk version = 1
[2017-08-18 07:47:56.537986] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 07:47:56.538625] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-2: Connected to data-client-2, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 07:47:56.538644] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 07:47:56.538873] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-2: Server lk version = 1
[2017-08-18 07:47:56.541463] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-data-replicate-0: selecting local read_child data-client-0
[2017-08-18 07:47:56.542170] I [MSGID: 104041] [glfs-resolve.c:970:__glfs_active_subvol] 0-data: switched to graph 79617272-6f77-2e6c-6162-2e656e672e62 (0)
[2017-08-18 07:47:56.547327] W [MSGID: 114031] [client-rpc-fops.c:2211:client3_3_seek_cbk] 0-data-client-0: remote operation failed [No such device or address]
[2017-08-18 07:47:56.548794] I [MSGID: 114021] [client.c:2365:notify] 0-data-client-0: current graph is no longer active, destroying rpc_client 
[2017-08-18 07:47:56.548845] I [MSGID: 114021] [client.c:2365:notify] 0-data-client-1: current graph is no longer active, destroying rpc_client 
[2017-08-18 07:47:56.548878] I [MSGID: 114021] [client.c:2365:notify] 0-data-client-2: current graph is no longer active, destroying rpc_client 
[2017-08-18 07:47:56.548871] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-data-client-0: disconnected from data-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-08-18 07:47:56.548916] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-data-client-2: disconnected from data-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2017-08-18 07:47:56.548910] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-data-client-1: disconnected from data-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-08-18 07:47:56.548935] W [MSGID: 108001] [afr-common.c:4820:afr_notify] 0-data-replicate-0: Client-quorum is not met
[2017-08-18 07:47:56.549058] E [MSGID: 108006] [afr-common.c:4731:afr_notify] 0-data-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-08-18 07:47:56.549468] I [io-stats.c:4006:fini] 0-data: io-stats translator unloaded
[2017-08-18 07:47:56.549951] I [MSGID: 101191] [event-epoll.c:633:event_dispatch_epoll_worker] 0-epoll: Exited thread with index 1
[2017-08-18 07:47:56.549979] I [MSGID: 101191] [event-epoll.c:633:event_dispatch_epoll_worker] 0-epoll: Exited thread with index 2
2017-08-18T07:47:57.496671Z qemu-kvm: -drive file=gluster://10.70.36.78/data/1cc3328f-0462-4b9e-9cd2-b22b1bce0bcc/images/6cd3f0a7-3e81-46db-a25d-7a92d9234a8f/c308f044-d17d-47b0-a164-cea091b908b9,file.debug=4,format=qcow2,if=none,id=drive-scsi0-0-0-0,serial=6cd3f0a7-3e81-46db-a25d-7a92d9234a8f,cache=none,werror=stop,rerror=stop,aio=threads: Could not read L1 table: Input/output error
2017-08-18 07:47:57.514+0000: shutting down, reason=failed

Thanks
kasturi

--- Additional comment from Niels de Vos on 2017-08-23 11:36:32 EDT ---

Sahina mentioned that a very similar problem has been reported earlier:

- https://bugs.launchpad.net/bugs/1644754
- https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg04742.html

The same "Could not read L1 table: Input/output error" is mentioned there.

# qemu-img info gluster://gluster01/gv0/bar2.qcow2
  qemu-img: Could not open 'gluster://gluster01/gv0/bar2.qcow2': Could not read 
L1 table: Input/output error


The trace log from comment #18 has reasonable return values for the *_read_cbk functions. There is one exception though:

[2017-08-23 10:32:43.003671] T [MSGID: 0] [client-rpc-fops.c:3011:client3_3_readv_cbk] 0-stack-trace: stack-address: 0x5594f710f020, data-client-2 returned 8
[2017-08-23 10:32:43.003691] T [MSGID: 0] [afr-inode-read.c:1838:afr_readv_cbk] 0-stack-trace: stack-address: 0x5594f710f020, data-replicate-0 returned 8
[2017-08-23 10:32:43.003706] T [MSGID: 0] [dht-inode-read.c:502:dht_readv_cbk] 0-stack-trace: stack-address: 0x5594f710f020, data-dht returned 8
[2017-08-23 10:32:43.003720] T [MSGID: 0] [shard.c:3122:shard_readv_do_cbk] 0-stack-trace: stack-address: 0x5594f710f020, data-shard returned 512


Krutika, this suggests that shard changed the returned data that was read (I do not see the requested size in the log) from 8 bytes to 512. It looks rather unexpected to me that more bytes are returned than possibly have been requested. Do you have an idea if that can be intentional?


Kasturi, it is possible to capture a tcpdump on the hypervisor while starting the VM? That should give the exact requests and replies that go over the network.


The logs  from comment #20 do not have tracing enabled, so we can not see if the same behaviour is happening without sharding. Those logs have "remote operation failed [No such device or address]" as a reply on lseek. This should only be returned in the following case:

ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond the end of the file.

With a packet capture, we can see what seek operation is executed, and verify the size of the file on the brick(s).

--- Additional comment from Niels de Vos on 2017-08-23 11:39:56 EDT ---

Kasturi, is this an arbiter volume? Possibly the SEEK FOP is sent to the arbiter and there the size of the file will always be 0. Any seek operations will not behave correctly. It seems that there is no 'support' for SEEK in arbiter yet.

--- Additional comment from Niels de Vos on 2017-08-23 12:28:10 EDT ---

Arbiter bricks are always a little confusing to me. So this patch has been sent to get it clarified https://review.gluster.org/18103 (either with an implementation or a comment in the sources).

--- Additional comment from RamaKasturi on 2017-08-23 12:32:35 EDT ---

(In reply to Niels de Vos from comment #21)
> Sahina mentioned that a very similar problem has been reported earlier:
> 
> - https://bugs.launchpad.net/bugs/1644754
> - https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg04742.html
> 
> The same "Could not read L1 table: Input/output error" is mentioned there.
> 
> # qemu-img info gluster://gluster01/gv0/bar2.qcow2
>   qemu-img: Could not open 'gluster://gluster01/gv0/bar2.qcow2': Could not
> read 
> L1 table: Input/output error
> 
> 
> The trace log from comment #18 has reasonable return values for the
> *_read_cbk functions. There is one exception though:
> 
> [2017-08-23 10:32:43.003671] T [MSGID: 0]
> [client-rpc-fops.c:3011:client3_3_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-client-2 returned 8
> [2017-08-23 10:32:43.003691] T [MSGID: 0]
> [afr-inode-read.c:1838:afr_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-replicate-0 returned 8
> [2017-08-23 10:32:43.003706] T [MSGID: 0]
> [dht-inode-read.c:502:dht_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-dht returned 8
> [2017-08-23 10:32:43.003720] T [MSGID: 0] [shard.c:3122:shard_readv_do_cbk]
> 0-stack-trace: stack-address: 0x5594f710f020, data-shard returned 512
> 
> 
> Krutika, this suggests that shard changed the returned data that was read (I
> do not see the requested size in the log) from 8 bytes to 512. It looks
> rather unexpected to me that more bytes are returned than possibly have been
> requested. Do you have an idea if that can be intentional?
> 
> 
> Kasturi, it is possible to capture a tcpdump on the hypervisor while
> starting the VM? That should give the exact requests and replies that go
> over the network.

Niels, sure i can do that. I will capture it tomorrow and update the bug with the results. This is with sharding enabled right.
> 
> 
> The logs  from comment #20 do not have tracing enabled, so we can not see if
> the same behaviour is happening without sharding. Those logs have "remote
> operation failed [No such device or address]" as a reply on lseek. This
> should only be returned in the following case:
> 
> ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond the
> end of the file.
> 
> With a packet capture, we can see what seek operation is executed, and
> verify the size of the file on the brick(s).

--- Additional comment from RamaKasturi on 2017-08-23 12:33:17 EDT ---

(In reply to Niels de Vos from comment #22)
> Kasturi, is this an arbiter volume? Possibly the SEEK FOP is sent to the
> arbiter and there the size of the file will always be 0. Any seek operations
> will not behave correctly. It seems that there is no 'support' for SEEK in
> arbiter yet.

Niels, i am observing this with both replicate and arbiter volumes.

--- Additional comment from RamaKasturi on 2017-08-23 12:34:53 EDT ---

setting the need info's back as it got cleared with my update.

--- Additional comment from Krutika Dhananjay on 2017-08-23 21:57:03 EDT ---

(In reply to Niels de Vos from comment #21)
> Sahina mentioned that a very similar problem has been reported earlier:
> 
> - https://bugs.launchpad.net/bugs/1644754
> - https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg04742.html
> 
> The same "Could not read L1 table: Input/output error" is mentioned there.
> 
> # qemu-img info gluster://gluster01/gv0/bar2.qcow2
>   qemu-img: Could not open 'gluster://gluster01/gv0/bar2.qcow2': Could not
> read 
> L1 table: Input/output error
> 
> 
> The trace log from comment #18 has reasonable return values for the
> *_read_cbk functions. There is one exception though:
> 
> [2017-08-23 10:32:43.003671] T [MSGID: 0]
> [client-rpc-fops.c:3011:client3_3_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-client-2 returned 8
> [2017-08-23 10:32:43.003691] T [MSGID: 0]
> [afr-inode-read.c:1838:afr_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-replicate-0 returned 8
> [2017-08-23 10:32:43.003706] T [MSGID: 0]
> [dht-inode-read.c:502:dht_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-dht returned 8
> [2017-08-23 10:32:43.003720] T [MSGID: 0] [shard.c:3122:shard_readv_do_cbk]
> 0-stack-trace: stack-address: 0x5594f710f020, data-shard returned 512
> 
> 
> Krutika, this suggests that shard changed the returned data that was read (I
> do not see the requested size in the log) from 8 bytes to 512. It looks
> rather unexpected to me that more bytes are returned than possibly have been
> requested. Do you have an idea if that can be intentional?

Gah! Too bad we don't log the readv parameters in the wind path in trace.

But that's a good point, Niels. I checked code in shard and shard seems to return
the requested size in the call back too, irrespective of how many bytes could actually be read
from the backend. And then there have been two patches since May 2016 that touch shard's readv codepath.
Need to see if they are causing any problems.
I'm not sure to what extent tcpdump output will help here because it won't give the exact details of
what offset and length shard received and what it wound further down.
If tcpdump doesn't help, we can change the volfile manually to load trace translators below and above
shard and then test again.

-Krutika


> 
> 
> Kasturi, it is possible to capture a tcpdump on the hypervisor while
> starting the VM? That should give the exact requests and replies that go
> over the network.
> 
> 
> The logs  from comment #20 do not have tracing enabled, so we can not see if
> the same behaviour is happening without sharding. Those logs have "remote
> operation failed [No such device or address]" as a reply on lseek. This
> should only be returned in the following case:
> 
> ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond the
> end of the file.
> 
> With a packet capture, we can see what seek operation is executed, and
> verify the size of the file on the brick(s).

--- Additional comment from Krutika Dhananjay on 2017-08-23 21:58:19 EDT ---

Restoring the needinfo on Kasturi for the tcpdump output.

--- Additional comment from Niels de Vos on 2017-08-24 04:33:25 EDT ---

(In reply to Niels de Vos from comment #23)
> Arbiter bricks are always a little confusing to me. So this patch has been
> sent to get it clarified https://review.gluster.org/18103 (either with an
> implementation or a comment in the sources).

This is now tracked in bz1484722. AFR should prevent SEEK and other FOPs from reaching the arbiter.

--- Additional comment from Niels de Vos on 2017-08-24 04:37:16 EDT ---

(In reply to Krutika Dhananjay from comment #27)
> (In reply to Niels de Vos from comment #21)
..
> > Krutika, this suggests that shard changed the returned data that was read (I
> > do not see the requested size in the log) from 8 bytes to 512. It looks
> > rather unexpected to me that more bytes are returned than possibly have been
> > requested. Do you have an idea if that can be intentional?
> 
> Gah! Too bad we don't log the readv parameters in the wind path in trace.
> 
> But that's a good point, Niels. I checked code in shard and shard seems to
> return
> the requested size in the call back too, irrespective of how many bytes
> could actually be read
> from the backend. And then there have been two patches since May 2016 that
> touch shard's readv codepath.
> Need to see if they are causing any problems.

I do not suspect shard to be the problem here, because it also happens without it. That does not mean there are no potential problems with shards though,  maybe there are two different bugs being hit here.

> I'm not sure to what extent tcpdump output will help here because it won't
> give the exact details of
> what offset and length shard received and what it wound further down.
> If tcpdump doesn't help, we can change the volfile manually to load trace
> translators below and above
> shard and then test again.

I guess my request was not clear enough. The tcpdump would be most helpful when sharding is not used. Otherwise the requests might get split over different bricks, and things are a little more difficult to debug.

--- Additional comment from Krutika Dhananjay on 2017-08-24 07:05:37 EDT ---

Kasturi provided the tcpdump output for analysis.

Based on this, https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13918723 is the build containing the fix to incorrect shard size being returned and has been given to Kasturi for testing. Let's see if it fixes the issue.

Clearing needinfos.

-Krutika

--- Additional comment from Krutika Dhananjay on 2017-08-24 09:41:53 EDT ---

Kasturi,

I checked the logs associated with kasturi_newbuild and this time the nunber-of-bytes-read is correctly returned by shard as 8. So there is definitely more to this than the fact that shard was converting 8 to 512 while unwinding readvs.

-Krutika

--- Additional comment from Sahina Bose on 2017-08-28 12:06:29 EDT ---

This issue doesn't seem to be consistently reproducible - I tried creating a template and starting the VM on your setup and was able to. I was also able to clone the VM from which you created template, and start it.


For the VM that fails to start from template, when mounting and checking the disk:

[root@dhcp46-21 ~]# xfs_repair -n /dev/sdb1
Phase 1 - find and verify superblock...
error reading superblock 1 -- seek to offset 268435456 failed
couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!

attempting to find secondary superblock...
................................................Sorry, could not find valid secondary superblock
Exiting now.


Not sure how the disk got to this state.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-28 13:50:26 EDT ---

This BZ having been considered, and subsequently not approved to be fixed at the RHGS 3.3.0 release, is being proposed for the next minor release of RHGS

--- Additional comment from Niels de Vos on 2017-08-28 16:22:34 EDT ---

(In reply to Sahina Bose from comment #33)
> For the VM that fails to start from template, when mounting and checking the
> disk:
> 
> [root@dhcp46-21 ~]# xfs_repair -n /dev/sdb1
> Phase 1 - find and verify superblock...
> error reading superblock 1 -- seek to offset 268435456 failed
> couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!
> 
> attempting to find secondary superblock...
> ................................................Sorry, could not find valid
> secondary superblock
> Exiting now.
> 
> 
> Not sure how the disk got to this state.

If you specify which hypervisor this is, and how "xfs_repair" can be run inside the VM, we could run ltrace against the qemu process (needs qemu + glusterfs debuginfo) and see what libgfapi functions are called and what they return. There has been a report (bz1425293) where glfs_seek() returns bogus values, but I have not been able to reproduce that yet. glfs_seek() should return that it is not supported, and QEMU is expected to handle that gracefully.

--- Additional comment from Sahina Bose on 2017-08-29 02:13:50 EDT ---

(In reply to Niels de Vos from comment #35)
> (In reply to Sahina Bose from comment #33)
> > For the VM that fails to start from template, when mounting and checking the
> > disk:
> > 
> > [root@dhcp46-21 ~]# xfs_repair -n /dev/sdb1
> > Phase 1 - find and verify superblock...
> > error reading superblock 1 -- seek to offset 268435456 failed
> > couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!
> > 
> > attempting to find secondary superblock...
> > ................................................Sorry, could not find valid
> > secondary superblock
> > Exiting now.
> > 
> > 
> > Not sure how the disk got to this state.
> 
> If you specify which hypervisor this is, and how "xfs_repair" can be run
> inside the VM, we could run ltrace against the qemu process (needs qemu +
> glusterfs debuginfo) and see what libgfapi functions are called and what
> they return. There has been a report (bz1425293) where glfs_seek() returns
> bogus values, but I have not been able to reproduce that yet. glfs_seek()
> should return that it is not supported, and QEMU is expected to handle that
> gracefully.

Have sent you the details of the hypervisor in a separate email

--- Additional comment from RamaKasturi on 2017-08-30 10:15:55 EDT ---

Hi,

   I tried to compare the setup where creation of vm was working from a template with mine. The difference i found was with the storage allocation policy. 

   I see that when the vm being created from the template which has storage allocation policy as 'thin' and disk format as 'qcow2' fails to boot. 

   If the vm being created from the template has storage allocation policy as 'clone' and disk format in 'raw' booting of vm works fine with out any issues.

   Attaching screenshots here for the same. So, i feel that issue does not lie in libgfapi as once i change the storage allocation policy i could see that vm boots on a setup where libgfapi is enabled.

   However i am not sure why creating a vm from template which has storage allocation policy as 'thin' and disk format as 'qcow2' does not work.

Thanks
kasturi

--- Additional comment from RamaKasturi on 2017-09-01 08:33:57 EDT ---

I have tried to create vms out of windows template as well and while booting the windows vm i see an error "A disk read error occured".

Attaching screenshot for the same

--- Additional comment from RamaKasturi on 2017-09-01 08:34 EDT ---



--- Additional comment from Krutika Dhananjay on 2017-09-05 05:55:05 EDT ---

The issue is happening by virtue of md-cache being enabled. Disabling md-cache causes the vms created from template to boot fine. Cancelling needinfo on Allon.

--- Additional comment from Krutika Dhananjay on 2017-09-05 12:10:53 EDT ---

The bug is in shard. When vm creation from template is initiated, RHV creates a hard link of the template under /xxxx../images/$UUID where UUID is the unique identifier of the vm being created.

In glusterfs, LINK fop is overloaded to return the attributes of the file as well in the callback, so that this information can be cached at various layers - md-cache being one of them - so that a subsequent stat/lookup can directly be served off the cache, and thereby help improve performance.

During LINK in this scenario, the following is what happened:

<snip1>

[2017-09-05 06:41:18.080640] I [trace.c:2012:trace_link] 0-vmstore-parent-mdc: 602: oldgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 oldpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/7807fb2b-420e-4dc7-b477-ba7796faa205/e52e483d-174d-4d95-89b5-41e0e37853c0 --> newgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 newpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/8e8d9148-4867-4d63-b6de-3092da32b1e8/e52e483d-174d-4d95-89b5-41e0e37853c0

[2017-09-05 06:41:18.080654] I [trace.c:2012:trace_link] 0-vmstore-parent-shard: 602: oldgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 oldpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/7807fb2b-420e-4dc7-b477-ba7796faa205/e52e483d-174d-4d95-89b5-41e0e37853c0 --> newgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 newpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/8e8d9148-4867-4d63-b6de-3092da32b1e8/e52e483d-174d-4d95-89b5-41e0e37853c0

[2017-09-05 06:41:18.080664] I [trace.c:2012:trace_link] 0-vmstore-child-shard: 602: oldgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 oldpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/7807fb2b-420e-4dc7-b477-ba7796faa205/e52e483d-174d-4d95-89b5-41e0e37853c0 --> newgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 newpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/8e8d9148-4867-4d63-b6de-3092da32b1e8/e52e483d-174d-4d95-89b5-41e0e37853c0

[2017-09-05 06:41:18.081413] I [trace.c:789:trace_link_cbk] 0-vmstore-child-shard: 602: (op_ret=0, *stbuf = {gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 ino=13616751834765032693, mode=100660, nlink=2, uid=36, gid=36, ************size=67108864***********, blocks=129104, atime=2017-09-05-05:59:36 mtime=2017-09-05-05:59:37 ctime=2017-09-05-06:41:18 atime_sec=1504591176, atime_nsec=290615297, mtime_sec=1504591177, mtime_nsec=297623454, ctime_sec=1504593678, ctime_nsec=79878645},  *prebuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=27878000, ctime_sec=1504593678, ctime_nsec=29878241}, *postbuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=79878645, ctime_sec=1504593678, ctime_nsec=79878645})

[2017-09-05 06:41:18.081439] I [trace.c:789:trace_link_cbk] 0-vmstore-parent-shard: 602: (op_ret=0, *stbuf = {gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 ino=13616751834765032693, mode=100660, nlink=2, uid=36, gid=36, *******size=67108864***********, blocks=129104, atime=2017-09-05-05:59:36 mtime=2017-09-05-05:59:37 ctime=2017-09-05-06:41:18 atime_sec=1504591176, atime_nsec=290615297, mtime_sec=1504591177, mtime_nsec=297623454, ctime_sec=1504593678, ctime_nsec=79878645},  *prebuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=27878000, ctime_sec=1504593678, ctime_nsec=29878241}, *postbuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=79878645, ctime_sec=1504593678, ctime_nsec=79878645})

[2017-09-05 06:41:18.081480] I [trace.c:789:trace_link_cbk] 0-vmstore-parent-mdc: 602: (op_ret=0, *stbuf = {gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 ino=13616751834765032693, mode=100660, nlink=2, uid=36, gid=36, size=67108864, blocks=129104, atime=2017-09-05-05:59:36 mtime=2017-09-05-05:59:37 ctime=2017-09-05-06:41:18 atime_sec=1504591176, atime_nsec=290615297, mtime_sec=1504591177, mtime_nsec=297623454, ctime_sec=1504593678, ctime_nsec=79878645},  *prebuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=27878000, ctime_sec=1504593678, ctime_nsec=29878241}, *postbuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=79878645, ctime_sec=1504593678, ctime_nsec=79878645})

</snip1>

Basically shard is supposed to return among other attributes the aggregated size of the image and not the physical size of just the base shard. While this is correctly handled in all other fops, in LINK it continued to return only the physical size of the base shard which is nothing but the shard-block-size. So instead of returning 20GB in iatt, it ended up returning 67108864 which is equal to 64MB.

The problem is not so much that it returned the incorrect size. The real problem starts when md-cache caches this and uses this to serve the immediate stat on the file, which is exactly what happened - 

<snip2>

[2017-09-05 06:41:18.088070] I [trace.c:1746:trace_stat] 0-vmstore-parent-mdc: 611: gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 path=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/8e8d9148-4867-4d63-b6de-3092da32b1e8/e52e483d-174d-4d95-89b5-41e0e37853c0

[2017-09-05 06:41:18.088087] I [trace.c:182:trace_stat_cbk] 0-vmstore-parent-mdc: 611: gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 op_ret=0 buf=gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 ino=13616751834765032693, mode=100660, nlink=2, uid=36, gid=36, ***********size=67108864*************, blocks=129104, atime=2017-09-05-05:59:36 mtime=2017-09-05-05:59:37 ctime=2017-09-05-06:41:18 atime_sec=1504591176, atime_nsec=290615297, mtime_sec=1504591177, mtime_nsec=297623454, ctime_sec=1504593678, ctime_nsec=79878645

</snip2>

The problem is that all the remaining steps towards creating the vm from template depend on the correctness of the size returned in STAT. And getting the size wrong implies everything else will go wrong.

I fixed this bug in shard, made a brew build and tested it on Kasturi's setup and vms booted fine even with stat-prefetch on. I tried this on 3 vms and all 3 of them ran fine.

The patch itself isn't all that big - 42 lines of code added in shard.

Big thanks to Kasturi for her cooperation and patience in resolving the bug. :)

I'll prettify the patch and add a test case if possible and send it on upstream soon.

-Krutika

--- Additional comment from Worker Ant on 2017-09-05 12:39:45 EDT ---

REVIEW: https://review.gluster.org/18209 (features/shard: Return aggregated size in stbuf of LINK fop) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

--- Additional comment from Worker Ant on 2017-09-05 22:31:13 EDT ---

COMMIT: https://review.gluster.org/18209 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 91430817ce5bcbeabf057e9c978485728a85fb2b
Author: Krutika Dhananjay <kdhananj>
Date:   Tue Sep 5 21:42:26 2017 +0530

    features/shard: Return aggregated size in stbuf of LINK fop
    
    Change-Id: I42df7679d63fec9b4c03b8dbc66c5625f097fac0
    BUG: 1488546
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/18209
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 1 Worker Ant 2017-09-06 06:00:56 UTC

REVIEW: https://review.gluster.org/18212 (features/shard: Return aggregated size in stbuf of LINK fop) posted (#1) for review on release-3.10 by Krutika Dhananjay (kdhananj)

Comment 2 Worker Ant 2017-09-17 12:48:26 UTC

COMMIT: https://review.gluster.org/18212 committed in release-3.10 by Shyamsundar Ranganathan (srangana) 
------
commit 3fccf92b8af269324ef94e3d42e1746a01ec9d89
Author: Krutika Dhananjay <kdhananj>
Date:   Tue Sep 5 21:42:26 2017 +0530

    features/shard: Return aggregated size in stbuf of LINK fop
    
            Backport of:
            > Change-Id: I42df7679d63fec9b4c03b8dbc66c5625f097fac0
            > Reviewed-on: https://review.gluster.org/18209
            > BUG: 1488546
            > cherry-picked from 91430817ce5bcbeabf057e9c978485728a85fb2b
    
    Change-Id: I42df7679d63fec9b4c03b8dbc66c5625f097fac0
    BUG: 1488719
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/18212
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 3 Shyamsundar 2017-10-06 17:11:53 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.6, please open a new bug report.

glusterfs-3.10.6 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-October/000084.html
[2] https://www.gluster.org/pipermail/gluster-users/