1482994 – [RHHI] cannot boot vms created from template when disk format = qcow2

Bug 1482994 - [RHHI] cannot boot vms created from template when disk format = qcow2

Summary: [RHHI] cannot boot vms created from template when disk format = qcow2

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	sharding
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.1
Assignee:	Krutika Dhananjay
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-3 1475687 1488546 1488718 1488719 1488722
TreeView+	depends on / blocked

Reported:	2017-08-18 14:28 UTC by RamaKasturi
Modified:	2017-11-29 03:29 UTC (History)
CC List:	15 users (show)
Fixed In Version:	glusterfs-3.8.4-45
Doc Type:	Bug Fix
Doc Text:	Previously, when creating a Virtual Machine (VM) using a template, the shard translator returned an incorrect file size. As a result, VMs that were created using these templates were not bootable. This update corrects the link file operation in the shard translator so that the aggregated file size is returned correctly and VMs created from templates boot without any issues.
Clone Of:
Clones:	1488546 1488722 (view as bug list)
Environment:
Last Closed:	2017-11-29 03:29:14 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
screenshot where storage allocation policy was clone and disk format is qcow2 (191.80 KB, image/png) 2017-08-30 14:17 UTC, RamaKasturi	no flags	Details
screenshot where storage allocation policy was clone and disk format is raw (191.80 KB, image/png) 2017-08-30 14:19 UTC, RamaKasturi	no flags	Details
screenshot where storage allocation policy was thin and disk format is qcow2 (193.58 KB, image/png) 2017-08-30 14:19 UTC, RamaKasturi	no flags	Details
Attaching screenshot while booting windows vm (205.75 KB, image/png) 2017-09-01 12:34 UTC, RamaKasturi	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1484722	0	unspecified	CLOSED	return ENOSYS for 'non readable' FOPs	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2017:3276	0	normal	SHIPPED_LIVE	glusterfs bug fix update	2017-11-29 08:28:52 UTC

Internal Links: 1484722

Description RamaKasturi 2017-08-18 14:28:10 UTC

Description of problem:
Once the template is created, any new vms created from that template fails to start.

Attaching the screenshot for the same.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-41.el7rhgs.x86_64
vdsm-4.19.28-1.el7ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Configure gluster and install Hosted Engine using cockpit UI
2. Add additional hosts and create storage domains
3. create a vm.
4. Now make a template out of the created vm.
5. create new vm from the template .

Actual results:
vm creation succeeds but fails to boot with the error shown in the screenshot.

Expected results:
Vm should get booted successfully.

Additional info:

no errors seen in qemu logs:
=====================================
2017-08-18 13:33:37.463+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-07-18-09:40:48, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-16.el7_4.4), hostname: tettnang.lab.eng.blr.redhat.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name guest=testvm_shard_64MB,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-14-testvm_shard_64MB/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m size=8388608k,slots=16,maxmem=33554432k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=8192 -uuid fea25030-23af-4069-98da-e74a544cb04f -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.4-18.el7rhgs,serial=36383737-3538-4753-4835-30375832434A,uuid=fea25030-23af-4069-98da-e74a544cb04f' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-14-testvm_shard_64MB/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-08-18T13:33:36,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=gluster://10.70.36.78/data/2edc6cc1-7b31-425d-a969-c5df2287c84a/images/3ef6ed85-9647-4b48-a041-77d9a5a1e9f5/e07f8c83-b365-49f1-b276-fad2f8afc9dc,file.debug=4,format=qcow2,if=none,id=drive-scsi0-0-0-0,serial=3ef6ed85-9647-4b48-a041-77d9a5a1e9f5,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:45:55:21:48:15,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/fea25030-23af-4069-98da-e74a544cb04f.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/fea25030-23af-4069-98da-e74a544cb04f.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5900,addr=10.70.34.41,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on
[2017-08-18 13:33:37.791278] I [afr.c:94:fix_quorum_options] 0-data-replicate-0: reindeer: incoming qtype = auto
[2017-08-18 13:33:37.791329] I [afr.c:116:fix_quorum_options] 0-data-replicate-0: reindeer: quorum_count = 2147483647
[2017-08-18 13:33:37.792934] W [MSGID: 101174] [graph.c:360:_log_if_unknown_option] 0-data-readdir-ahead: option 'parallel-readdir' is not recognized
[2017-08-18 13:33:37.792997] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 74657474-6e61-6e67-2e6c-61622e656e67 (0) coming up
[2017-08-18 13:33:37.793027] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-0: parent translators are ready, attempting connect on transport
[2017-08-18 13:33:37.797036] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-1: parent translators are ready, attempting connect on transport
[2017-08-18 13:33:37.797683] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-0: changing port to 49153 (from 0)
[2017-08-18 13:33:37.801016] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-2: parent translators are ready, attempting connect on transport
[2017-08-18 13:33:37.805314] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 13:33:37.805424] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49153 (from 0)
[2017-08-18 13:33:37.808857] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-0: Connected to data-client-0, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 13:33:37.808885] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 13:33:37.808965] I [MSGID: 108005] [afr-common.c:4706:afr_notify] 0-data-replicate-0: Subvolume 'data-client-0' came back up; going online.
[2017-08-18 13:33:37.809303] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-0: Server lk version = 1
[2017-08-18 13:33:37.809652] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-2: changing port to 49153 (from 0)
[2017-08-18 13:33:37.812611] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 13:33:37.813306] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-1: Connected to data-client-1, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 13:33:37.813347] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 13:33:37.813489] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-1: Server lk version = 1
[2017-08-18 13:33:37.816687] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 13:33:37.813306] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-1: Connected to data-client-1, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 13:33:37.813347] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 13:33:37.813489] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-1: Server lk version = 1
[2017-08-18 13:33:37.816687] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 13:33:37.817323] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-2: Connected to data-client-2, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 13:33:37.817342] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 13:33:37.817531] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-2: Server lk version = 1
[2017-08-18 13:33:37.820304] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-data-replicate-0: selecting local read_child data-client-1
[2017-08-18 13:33:37.820727] I [MSGID: 104041] [glfs-resolve.c:970:__glfs_active_subvol] 0-data: switched to graph 74657474-6e61-6e67-2e6c-61622e656e67 (0)
[2017-08-18 13:33:37.825282] I [MSGID: 133017] [shard.c:4900:shard_seek] 0-data-shard: seek called on f7e757ae-e231-4e79-8dea-d9ff84da555d. [Operation not supported]
[2017-08-18 13:33:37.832548] I [MSGID: 133017] [shard.c:4900:shard_seek] 0-data-shard: seek called on 7771a6d6-6171-4a09-ac83-e2cf723833bf. [Operation not supported]
2017-08-18T13:33:37.832765Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2017-08-18T13:33:37.832797Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config
main_channel_link: add main channel client
main_channel_handle_parsed: net test: latency 1.825000 ms, bitrate 1369899665 bps (1306.438127 Mbps)
inputs_connect: inputs channel client create

errors seen in brick logs:
================================
[2017-08-18 13:28:24.415782] E [MSGID: 113002] [posix.c:267:posix_lookup] 0-data-posix: buf->ia_gfid is null for /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a
7f562db55.785 [No data available]
[2017-08-18 13:28:24.415845] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-data-server: 114652: LOOKUP /.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55.785 (be31
8638-e8a0-4c6d-977d-7a937aa84806/b1a937e5-4a69-4b41-96cb-07a7f562db55.785) ==> (No data available) [No data available]
[2017-08-18 13:28:24.472442] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.786 -> /gluster_bricks/data/data/.glusterfs/ce/13/ce1330fe-e550-43d2-92ea-caa2c9510971failed  [File exists]
[2017-08-18 13:28:24.472492] E [MSGID: 113020] [posix.c:1395:posix_mknod] 0-data-posix: setting gfid on /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55
.786 failed
[2017-08-18 13:28:24.534927] E [MSGID: 113002] [posix.c:267:posix_lookup] 0-data-posix: buf->ia_gfid is null for /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a
7f562db55.787 [No data available]
[2017-08-18 13:28:24.534981] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-data-server: 114743: LOOKUP /.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55.787 (be31
8638-e8a0-4c6d-977d-7a937aa84806/b1a937e5-4a69-4b41-96cb-07a7f562db55.787) ==> (No data available) [No data available]
[2017-08-18 13:28:24.592544] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.788 -> /gluster_bricks/data/data/.glusterfs/83/00/830084a6-423b-4a29-a0bc-2ea90e2374a2failed  [File exists]
[2017-08-18 13:28:24.592544] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.788 -> /gluster_bricks/data/data/.glusterfs/83/00/830084a6-423b-4a29-a0bc-2ea90e2374a2failed  [File exists]
[2017-08-18 13:28:24.592562] E [MSGID: 113020] [posix.c:1395:posix_mknod] 0-data-posix: setting gfid on /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55
.788 failed
[2017-08-18 13:28:24.700997] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.790 -> /gluster_bricks/data/data/.glusterfs/fb/a9/fba964dc-2b80-4c97-a84a-0fd4970d12f4failed  [File exists]
[2017-08-18 13:28:24.701032] E [MSGID: 113020] [posix.c:1395:posix_mknod] 0-data-posix: setting gfid on /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55
.790 failed
[2017-08-18 13:28:24.761620] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.791 -> /gluster_bricks/data/data/.glusterfs/a7/f5/a7f5cc30-f0fd-49a0-91d6-2233563e9c65failed  [File exists]
[2017-08-18 13:28:24.761621] W [MSGID: 113096] [posix-handle.c:773:posix_handle_hard] 0-data-posix: link /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db5
5.791 -> /gluster_bricks/data/data/.glusterfs/a7/f5/a7f5cc30-f0fd-49a0-91d6-2233563e9c65failed  [File exists]
[2017-08-18 13:28:24.761638] E [MSGID: 113020] [posix.c:1395:posix_mknod] 0-data-posix: setting gfid on /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55
.791 failed
[2017-08-18 13:28:24.890108] E [MSGID: 113002] [posix.c:267:posix_lookup] 0-data-posix: buf->ia_gfid is null for /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a
7f562db55.793 [No data available]
[2017-08-18 13:28:24.890172] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-data-server: 115002: LOOKUP /.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55.793 (be31
8638-e8a0-4c6d-977d-7a937aa84806/b1a937e5-4a69-4b41-96cb-07a7f562db55.793) ==> (No data available) [No data available]
[2017-08-18 13:28:25.003037] E [MSGID: 113002] [posix.c:267:posix_lookup] 0-data-posix: buf->ia_gfid is null for /gluster_bricks/data/data/.shard/b1a937e5-4a69-4b41-96cb-07a
7f562db55.795 [No data available]
[2017-08-18 13:28:25.003093] E [MSGID: 115050] [server-rpc-fops.c:158:server_lookup_cbk] 0-data-server: 115090: LOOKUP /.shard/b1a937e5-4a69-4b41-96cb-07a7f562db55.795 (be31
8638-e8a0-4c6d-977d-7a937aa84806/b1a937e5-4a69-4b41-96cb-07a7f562db55.795) ==> (No data available) [No data available]

Comment 1 RamaKasturi 2017-08-18 14:34:00 UTC

sosreports can be found in the link below:
==========================================

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1482994/

Comment 2 Sahina Bose 2017-08-21 06:03:01 UTC

Moving from oVirt to RHGS

Comment 4 Krutika Dhananjay 2017-08-21 07:13:55 UTC

Niels,

The only error that we see in the client logs are from shard which informs about SEEK_DATA and SEEK_HOLE not being supported. Technically things should work fine even if shard doesn't support it in terms of qemu falling back to its earlier mode of operation before seek was even implemented in gluster, at least until it is implemented in shard, which is a non-trivial effort.

I checked block/gluster.c in qemu codebase and found that with sharding enabled, supports_seek_data flag should be set to false by virtue of shard returning ENOTSUP.

The only place there this flag is read seems to be find_allocation() which will go to "exit" in this case and return -ENOTSUP to qemu_gluster_co_get_block_status(). Unfortunately I do not understand what would happen next. ret seems to be set to BDRV_BLOCK_DATA and I don't understand what that's supposed to mean. So I did a git-blame and saw that this particular piece of code was written by you as part of commit 947eb2030eef73b7a66167a6244b11eeb4f109ff. Could you please clarify what qemu does next and why vms won't boot because of this, for want of any kind of logging of fop failures in gfapi itself?

Thanks in advance,
Krutika

Comment 11 Niels de Vos 2017-08-22 15:15:44 UTC

I don't see any seek problems in the logs here... not sure if this problem is the same as bz#1357919.

There have been some changes to how glfs_lseek() is handled by QEMU, but that has been months ago. Unfortunately the version of qemu is not mentioned in the bug. Rafi provided versions over IRC:

15:56 < rafi> ndevos, qemu-kvm-tools-rhev-2.6.0-28.el7_3.12.x86_64
15:56 < rafi> libvirt-daemon-kvm-2.0.0-10.el7_3.9.x86_64
15:56 < rafi> qemu-kvm-rhev-2.6.0-28.el7_3.12.x86_64
15:56 < rafi> qemu-kvm-common-rhev-2.6.0-28.el7_3.12.x86_64

Comment 12 Krutika Dhananjay 2017-08-23 05:13:59 UTC

Rafi,

I checked the logs. None of them indicate anything about the issue that is seen in dev cluster. All of these logs represent a transient state.

-Krutika

Comment 13 Krutika Dhananjay 2017-08-23 05:26:31 UTC

(In reply to Niels de Vos from comment #11)
> I don't see any seek problems in the logs here... not sure if this problem
> is the same as bz#1357919.
> 
> There have been some changes to how glfs_lseek() is handled by QEMU, but
> that has been months ago. Unfortunately the version of qemu is not mentioned
> in the bug. Rafi provided versions over IRC:
> 
> 15:56 < rafi> ndevos, qemu-kvm-tools-rhev-2.6.0-28.el7_3.12.x86_64
> 15:56 < rafi> libvirt-daemon-kvm-2.0.0-10.el7_3.9.x86_64
> 15:56 < rafi> qemu-kvm-rhev-2.6.0-28.el7_3.12.x86_64
> 15:56 < rafi> qemu-kvm-common-rhev-2.6.0-28.el7_3.12.x86_64

Kasturi,

What is the version of qemu that you are using? Could you please provide the output? Niels seems to suggest in comment 11 that there were some changes to how QEMU handles seek fop in the past.

-Krutika

Comment 14 RamaKasturi 2017-08-23 06:37:56 UTC

Hi Neils & krutika,

   Apologies that i missed to put that information. Here is the qemu version.

[root@yarrow ~]# rpm -qa | grep qemu
qemu-kvm-rhev-2.9.0-16.el7_4.4.x86_64
qemu-kvm-tools-rhev-2.9.0-16.el7_4.4.x86_64
qemu-kvm-common-rhev-2.9.0-16.el7_4.4.x86_64
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.2.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7.noarch
qemu-img-rhev-2.9.0-16.el7_4.4.x86_64


Thanks
kasturi

Comment 15 Niels de Vos 2017-08-23 08:17:34 UTC

In case of problems related to seek, we need to verify if this upstream change was backported.

commit 223a23c198787328ae75bc65d84edf5fde33c0b6
Author: Jeff Cody <jcody>
Date:   Tue May 23 13:27:50 2017 -0400

    block/gluster: glfs_lseek() workaround

https://git.qemu.org/gitweb.cgi?p=qemu.git;a=commit;h=223a23c198787328ae75bc65d84edf5fde33c0b6

Comment 16 Niels de Vos 2017-08-23 08:31:58 UTC

(In reply to Niels de Vos from comment #15)
> In case of problems related to seek, we need to verify if this upstream
> change was backported.
> 
> commit 223a23c198787328ae75bc65d84edf5fde33c0b6
> Author: Jeff Cody <jcody>
> Date:   Tue May 23 13:27:50 2017 -0400
> 
>     block/gluster: glfs_lseek() workaround
> 
> https://git.qemu.org/gitweb.cgi?p=qemu.git;a=commit;
> h=223a23c198787328ae75bc65d84edf5fde33c0b6

Bug 1451191 was used to get the fix in qemu-kvm-rhev-2.9.0-8.el7.x86_64. The version this was reported against is higher, so it should contain the backport as well.

Bug 1425293#c21 mentions that the seek problem has been fixed with the combination of

libvirt-3.2.0-10.el7.x86_64
qemu-kvm-rhev-2.9.0-10.el7.x86_64
glusterfs-3.8.4-28.el7rhgs.x86_64


There has been a fix in seek on the gluster side as well. http://review.gluster.org/15943 was backported to upstream glusterfs-3.8.7. This matches downstream bug 1454558 (git tag v3.8.4-26).


At the moment, I do not think we are missing anything from the seek side. Is it possible to verify that qemu continues with <something> after trying glfs_lseek()? You could do this by setting diagnostics.client-log-level=TRACE for the volume and the QEMU/gfapi logs should be much more verbose.

Comment 17 RamaKasturi 2017-08-23 08:52:25 UTC

sure niels. I will set the option on volume and update the bug.

Comment 18 RamaKasturi 2017-08-23 10:35:47 UTC

Hi Niels,

        I have set the volume option as you mentioned and i see errors only with seek returning operation not supported.


[2017-08-23 10:32:42.999569] D [MSGID: 0] [shard.c:4901:shard_seek] 0-stack-trace: stack-address: 0x5594f725a020, data-shard returned -1 error: Operation not supported [Operation not supported]
[2017-08-23 10:32:42.999585] D [MSGID: 0] [io-stats.c:2529:io_stats_seek_cbk] 0-stack-trace: stack-address: 0x5594f725a020, data returned -1 error: Operation not supported [Operation not supported]


complete qemu log file is present in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1482994/vm_data_trace1.log

Thanks
kasturi

Comment 19 Niels de Vos 2017-08-23 11:22:06 UTC

Thanks for testing! After seek failed with ENOTSUPP, qemu does some reads from the vm-image on the volume. I therefore doubt that the problem is caused by seek.

It would be helpful to have similar logs from the test where sharding is disabled. If this also does not work, it may really not be a problem with seek (or if we're unlocky, a different seek issue).

Comment 20 RamaKasturi 2017-08-23 12:47:15 UTC

Hi Niels,

 I tested this with out sharding enabled and below is what i see.


[root@yarrow qemu]# cat vm_noshard_1.log 
2017-08-18 07:47:56.357+0000: starting up libvirt version: 3.2.0, package: 14.el7_4.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-07-18-09:40:48, x86-039.build.eng.bos.redhat.com), qemu version: 2.9.0(qemu-kvm-rhev-2.9.0-16.el7_4.4), hostname: yarrow.lab.eng.blr.redhat.com
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/libexec/qemu-kvm -name guest=vm_noshard_1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-3-vm_noshard_1/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX -m size=8388608k,slots=16,maxmem=33554432k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=8192 -uuid 6c540118-b51e-4599-958e-a7a80e5d4f9c -smbios 'type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.4-18.el7rhgs,serial=36383737-3538-4753-4835-303758324348,uuid=6c540118-b51e-4599-958e-a7a80e5d4f9c' -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-3-vm_noshard_1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-08-18T07:47:55,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=gluster://10.70.36.78/data/1cc3328f-0462-4b9e-9cd2-b22b1bce0bcc/images/6cd3f0a7-3e81-46db-a25d-7a92d9234a8f/c308f044-d17d-47b0-a164-cea091b908b9,file.debug=4,format=qcow2,if=none,id=drive-scsi0-0-0-0,serial=6cd3f0a7-3e81-46db-a25d-7a92d9234a8f,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:45:55:21:48:0d,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/6c540118-b51e-4599-958e-a7a80e5d4f9c.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/6c540118-b51e-4599-958e-a7a80e5d4f9c.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5901,addr=10.70.34.43,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=8388608,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on
[2017-08-18 07:47:56.512892] I [afr.c:94:fix_quorum_options] 0-data-replicate-0: reindeer: incoming qtype = auto
[2017-08-18 07:47:56.512931] I [afr.c:116:fix_quorum_options] 0-data-replicate-0: reindeer: quorum_count = 2147483647
[2017-08-18 07:47:56.514517] W [MSGID: 101174] [graph.c:360:_log_if_unknown_option] 0-data-readdir-ahead: option 'parallel-readdir' is not recognized
[2017-08-18 07:47:56.514580] I [MSGID: 104045] [glfs-master.c:91:notify] 0-gfapi: New graph 79617272-6f77-2e6c-6162-2e656e672e62 (0) coming up
[2017-08-18 07:47:56.514613] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-0: parent translators are ready, attempting connect on transport
[2017-08-18 07:47:56.518721] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-1: parent translators are ready, attempting connect on transport
[2017-08-18 07:47:56.519082] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-0: changing port to 49153 (from 0)
[2017-08-18 07:47:56.522525] I [MSGID: 114020] [client.c:2356:notify] 0-data-client-2: parent translators are ready, attempting connect on transport
[2017-08-18 07:47:56.526842] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 07:47:56.527094] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-1: changing port to 49153 (from 0)
[2017-08-18 07:47:56.530333] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-0: Connected to data-client-0, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 07:47:56.530364] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 07:47:56.530464] I [MSGID: 108005] [afr-common.c:4706:afr_notify] 0-data-replicate-0: Subvolume 'data-client-0' came back up; going online.
[2017-08-18 07:47:56.530893] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-0: Server lk version = 1
[2017-08-18 07:47:56.531269] I [rpc-clnt.c:2001:rpc_clnt_reconfig] 0-data-client-2: changing port to 49153 (from 0)
[2017-08-18 07:47:56.534429] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 07:47:56.535133] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-1: Connected to data-client-1, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 07:47:56.535157] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 07:47:56.535348] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-1: Server lk version = 1
[2017-08-18 07:47:56.537986] I [MSGID: 114057] [client-handshake.c:1450:select_server_supported_programs] 0-data-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-08-18 07:47:56.538625] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-data-client-2: Connected to data-client-2, attached to remote volume '/gluster_bricks/data/data'.
[2017-08-18 07:47:56.538644] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-data-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2017-08-18 07:47:56.538873] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-data-client-2: Server lk version = 1
[2017-08-18 07:47:56.541463] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-data-replicate-0: selecting local read_child data-client-0
[2017-08-18 07:47:56.542170] I [MSGID: 104041] [glfs-resolve.c:970:__glfs_active_subvol] 0-data: switched to graph 79617272-6f77-2e6c-6162-2e656e672e62 (0)
[2017-08-18 07:47:56.547327] W [MSGID: 114031] [client-rpc-fops.c:2211:client3_3_seek_cbk] 0-data-client-0: remote operation failed [No such device or address]
[2017-08-18 07:47:56.548794] I [MSGID: 114021] [client.c:2365:notify] 0-data-client-0: current graph is no longer active, destroying rpc_client 
[2017-08-18 07:47:56.548845] I [MSGID: 114021] [client.c:2365:notify] 0-data-client-1: current graph is no longer active, destroying rpc_client 
[2017-08-18 07:47:56.548878] I [MSGID: 114021] [client.c:2365:notify] 0-data-client-2: current graph is no longer active, destroying rpc_client 
[2017-08-18 07:47:56.548871] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-data-client-0: disconnected from data-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-08-18 07:47:56.548916] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-data-client-2: disconnected from data-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2017-08-18 07:47:56.548910] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-data-client-1: disconnected from data-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-08-18 07:47:56.548935] W [MSGID: 108001] [afr-common.c:4820:afr_notify] 0-data-replicate-0: Client-quorum is not met
[2017-08-18 07:47:56.549058] E [MSGID: 108006] [afr-common.c:4731:afr_notify] 0-data-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-08-18 07:47:56.549468] I [io-stats.c:4006:fini] 0-data: io-stats translator unloaded
[2017-08-18 07:47:56.549951] I [MSGID: 101191] [event-epoll.c:633:event_dispatch_epoll_worker] 0-epoll: Exited thread with index 1
[2017-08-18 07:47:56.549979] I [MSGID: 101191] [event-epoll.c:633:event_dispatch_epoll_worker] 0-epoll: Exited thread with index 2
2017-08-18T07:47:57.496671Z qemu-kvm: -drive file=gluster://10.70.36.78/data/1cc3328f-0462-4b9e-9cd2-b22b1bce0bcc/images/6cd3f0a7-3e81-46db-a25d-7a92d9234a8f/c308f044-d17d-47b0-a164-cea091b908b9,file.debug=4,format=qcow2,if=none,id=drive-scsi0-0-0-0,serial=6cd3f0a7-3e81-46db-a25d-7a92d9234a8f,cache=none,werror=stop,rerror=stop,aio=threads: Could not read L1 table: Input/output error
2017-08-18 07:47:57.514+0000: shutting down, reason=failed

Thanks
kasturi

Comment 21 Niels de Vos 2017-08-23 15:36:32 UTC

Sahina mentioned that a very similar problem has been reported earlier:

- https://bugs.launchpad.net/bugs/1644754
- https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg04742.html

The same "Could not read L1 table: Input/output error" is mentioned there.

# qemu-img info gluster://gluster01/gv0/bar2.qcow2
  qemu-img: Could not open 'gluster://gluster01/gv0/bar2.qcow2': Could not read 
L1 table: Input/output error


The trace log from comment #18 has reasonable return values for the *_read_cbk functions. There is one exception though:

[2017-08-23 10:32:43.003671] T [MSGID: 0] [client-rpc-fops.c:3011:client3_3_readv_cbk] 0-stack-trace: stack-address: 0x5594f710f020, data-client-2 returned 8
[2017-08-23 10:32:43.003691] T [MSGID: 0] [afr-inode-read.c:1838:afr_readv_cbk] 0-stack-trace: stack-address: 0x5594f710f020, data-replicate-0 returned 8
[2017-08-23 10:32:43.003706] T [MSGID: 0] [dht-inode-read.c:502:dht_readv_cbk] 0-stack-trace: stack-address: 0x5594f710f020, data-dht returned 8
[2017-08-23 10:32:43.003720] T [MSGID: 0] [shard.c:3122:shard_readv_do_cbk] 0-stack-trace: stack-address: 0x5594f710f020, data-shard returned 512


Krutika, this suggests that shard changed the returned data that was read (I do not see the requested size in the log) from 8 bytes to 512. It looks rather unexpected to me that more bytes are returned than possibly have been requested. Do you have an idea if that can be intentional?


Kasturi, it is possible to capture a tcpdump on the hypervisor while starting the VM? That should give the exact requests and replies that go over the network.


The logs  from comment #20 do not have tracing enabled, so we can not see if the same behaviour is happening without sharding. Those logs have "remote operation failed [No such device or address]" as a reply on lseek. This should only be returned in the following case:

ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond the end of the file.

With a packet capture, we can see what seek operation is executed, and verify the size of the file on the brick(s).

Comment 22 Niels de Vos 2017-08-23 15:39:56 UTC

Kasturi, is this an arbiter volume? Possibly the SEEK FOP is sent to the arbiter and there the size of the file will always be 0. Any seek operations will not behave correctly. It seems that there is no 'support' for SEEK in arbiter yet.

Comment 23 Niels de Vos 2017-08-23 16:28:10 UTC

Arbiter bricks are always a little confusing to me. So this patch has been sent to get it clarified https://review.gluster.org/18103 (either with an implementation or a comment in the sources).

Comment 24 RamaKasturi 2017-08-23 16:32:35 UTC

(In reply to Niels de Vos from comment #21)
> Sahina mentioned that a very similar problem has been reported earlier:
> 
> - https://bugs.launchpad.net/bugs/1644754
> - https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg04742.html
> 
> The same "Could not read L1 table: Input/output error" is mentioned there.
> 
> # qemu-img info gluster://gluster01/gv0/bar2.qcow2
>   qemu-img: Could not open 'gluster://gluster01/gv0/bar2.qcow2': Could not
> read 
> L1 table: Input/output error
> 
> 
> The trace log from comment #18 has reasonable return values for the
> *_read_cbk functions. There is one exception though:
> 
> [2017-08-23 10:32:43.003671] T [MSGID: 0]
> [client-rpc-fops.c:3011:client3_3_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-client-2 returned 8
> [2017-08-23 10:32:43.003691] T [MSGID: 0]
> [afr-inode-read.c:1838:afr_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-replicate-0 returned 8
> [2017-08-23 10:32:43.003706] T [MSGID: 0]
> [dht-inode-read.c:502:dht_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-dht returned 8
> [2017-08-23 10:32:43.003720] T [MSGID: 0] [shard.c:3122:shard_readv_do_cbk]
> 0-stack-trace: stack-address: 0x5594f710f020, data-shard returned 512
> 
> 
> Krutika, this suggests that shard changed the returned data that was read (I
> do not see the requested size in the log) from 8 bytes to 512. It looks
> rather unexpected to me that more bytes are returned than possibly have been
> requested. Do you have an idea if that can be intentional?
> 
> 
> Kasturi, it is possible to capture a tcpdump on the hypervisor while
> starting the VM? That should give the exact requests and replies that go
> over the network.

Niels, sure i can do that. I will capture it tomorrow and update the bug with the results. This is with sharding enabled right.
> 
> 
> The logs  from comment #20 do not have tracing enabled, so we can not see if
> the same behaviour is happening without sharding. Those logs have "remote
> operation failed [No such device or address]" as a reply on lseek. This
> should only be returned in the following case:
> 
> ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond the
> end of the file.
> 
> With a packet capture, we can see what seek operation is executed, and
> verify the size of the file on the brick(s).

Comment 25 RamaKasturi 2017-08-23 16:33:17 UTC

(In reply to Niels de Vos from comment #22)
> Kasturi, is this an arbiter volume? Possibly the SEEK FOP is sent to the
> arbiter and there the size of the file will always be 0. Any seek operations
> will not behave correctly. It seems that there is no 'support' for SEEK in
> arbiter yet.

Niels, i am observing this with both replicate and arbiter volumes.

Comment 26 RamaKasturi 2017-08-23 16:34:53 UTC

setting the need info's back as it got cleared with my update.

Comment 27 Krutika Dhananjay 2017-08-24 01:57:03 UTC

(In reply to Niels de Vos from comment #21)
> Sahina mentioned that a very similar problem has been reported earlier:
> 
> - https://bugs.launchpad.net/bugs/1644754
> - https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg04742.html
> 
> The same "Could not read L1 table: Input/output error" is mentioned there.
> 
> # qemu-img info gluster://gluster01/gv0/bar2.qcow2
>   qemu-img: Could not open 'gluster://gluster01/gv0/bar2.qcow2': Could not
> read 
> L1 table: Input/output error
> 
> 
> The trace log from comment #18 has reasonable return values for the
> *_read_cbk functions. There is one exception though:
> 
> [2017-08-23 10:32:43.003671] T [MSGID: 0]
> [client-rpc-fops.c:3011:client3_3_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-client-2 returned 8
> [2017-08-23 10:32:43.003691] T [MSGID: 0]
> [afr-inode-read.c:1838:afr_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-replicate-0 returned 8
> [2017-08-23 10:32:43.003706] T [MSGID: 0]
> [dht-inode-read.c:502:dht_readv_cbk] 0-stack-trace: stack-address:
> 0x5594f710f020, data-dht returned 8
> [2017-08-23 10:32:43.003720] T [MSGID: 0] [shard.c:3122:shard_readv_do_cbk]
> 0-stack-trace: stack-address: 0x5594f710f020, data-shard returned 512
> 
> 
> Krutika, this suggests that shard changed the returned data that was read (I
> do not see the requested size in the log) from 8 bytes to 512. It looks
> rather unexpected to me that more bytes are returned than possibly have been
> requested. Do you have an idea if that can be intentional?

Gah! Too bad we don't log the readv parameters in the wind path in trace.

But that's a good point, Niels. I checked code in shard and shard seems to return
the requested size in the call back too, irrespective of how many bytes could actually be read
from the backend. And then there have been two patches since May 2016 that touch shard's readv codepath.
Need to see if they are causing any problems.
I'm not sure to what extent tcpdump output will help here because it won't give the exact details of
what offset and length shard received and what it wound further down.
If tcpdump doesn't help, we can change the volfile manually to load trace translators below and above
shard and then test again.

-Krutika


> 
> 
> Kasturi, it is possible to capture a tcpdump on the hypervisor while
> starting the VM? That should give the exact requests and replies that go
> over the network.
> 
> 
> The logs  from comment #20 do not have tracing enabled, so we can not see if
> the same behaviour is happening without sharding. Those logs have "remote
> operation failed [No such device or address]" as a reply on lseek. This
> should only be returned in the following case:
> 
> ENXIO  whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond the
> end of the file.
> 
> With a packet capture, we can see what seek operation is executed, and
> verify the size of the file on the brick(s).

Comment 28 Krutika Dhananjay 2017-08-24 01:58:19 UTC

Restoring the needinfo on Kasturi for the tcpdump output.

Comment 29 Niels de Vos 2017-08-24 08:33:25 UTC

(In reply to Niels de Vos from comment #23)
> Arbiter bricks are always a little confusing to me. So this patch has been
> sent to get it clarified https://review.gluster.org/18103 (either with an
> implementation or a comment in the sources).

This is now tracked in bz1484722. AFR should prevent SEEK and other FOPs from reaching the arbiter.

Comment 30 Niels de Vos 2017-08-24 08:37:16 UTC

(In reply to Krutika Dhananjay from comment #27)
> (In reply to Niels de Vos from comment #21)
..
> > Krutika, this suggests that shard changed the returned data that was read (I
> > do not see the requested size in the log) from 8 bytes to 512. It looks
> > rather unexpected to me that more bytes are returned than possibly have been
> > requested. Do you have an idea if that can be intentional?
> 
> Gah! Too bad we don't log the readv parameters in the wind path in trace.
> 
> But that's a good point, Niels. I checked code in shard and shard seems to
> return
> the requested size in the call back too, irrespective of how many bytes
> could actually be read
> from the backend. And then there have been two patches since May 2016 that
> touch shard's readv codepath.
> Need to see if they are causing any problems.

I do not suspect shard to be the problem here, because it also happens without it. That does not mean there are no potential problems with shards though,  maybe there are two different bugs being hit here.

> I'm not sure to what extent tcpdump output will help here because it won't
> give the exact details of
> what offset and length shard received and what it wound further down.
> If tcpdump doesn't help, we can change the volfile manually to load trace
> translators below and above
> shard and then test again.

I guess my request was not clear enough. The tcpdump would be most helpful when sharding is not used. Otherwise the requests might get split over different bricks, and things are a little more difficult to debug.

Comment 31 Krutika Dhananjay 2017-08-24 11:05:37 UTC

Kasturi provided the tcpdump output for analysis.

Based on this, https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13918723 is the build containing the fix to incorrect shard size being returned and has been given to Kasturi for testing. Let's see if it fixes the issue.

Clearing needinfos.

-Krutika

Comment 32 Krutika Dhananjay 2017-08-24 13:41:53 UTC

Kasturi,

I checked the logs associated with kasturi_newbuild and this time the nunber-of-bytes-read is correctly returned by shard as 8. So there is definitely more to this than the fact that shard was converting 8 to 512 while unwinding readvs.

-Krutika

Comment 33 Sahina Bose 2017-08-28 16:06:29 UTC

This issue doesn't seem to be consistently reproducible - I tried creating a template and starting the VM on your setup and was able to. I was also able to clone the VM from which you created template, and start it.


For the VM that fails to start from template, when mounting and checking the disk:

[root@dhcp46-21 ~]# xfs_repair -n /dev/sdb1
Phase 1 - find and verify superblock...
error reading superblock 1 -- seek to offset 268435456 failed
couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!

attempting to find secondary superblock...
................................................Sorry, could not find valid secondary superblock
Exiting now.


Not sure how the disk got to this state.

Comment 35 Niels de Vos 2017-08-28 20:22:34 UTC

(In reply to Sahina Bose from comment #33)
> For the VM that fails to start from template, when mounting and checking the
> disk:
> 
> [root@dhcp46-21 ~]# xfs_repair -n /dev/sdb1
> Phase 1 - find and verify superblock...
> error reading superblock 1 -- seek to offset 268435456 failed
> couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!
> 
> attempting to find secondary superblock...
> ................................................Sorry, could not find valid
> secondary superblock
> Exiting now.
> 
> 
> Not sure how the disk got to this state.

If you specify which hypervisor this is, and how "xfs_repair" can be run inside the VM, we could run ltrace against the qemu process (needs qemu + glusterfs debuginfo) and see what libgfapi functions are called and what they return. There has been a report (bz1425293) where glfs_seek() returns bogus values, but I have not been able to reproduce that yet. glfs_seek() should return that it is not supported, and QEMU is expected to handle that gracefully.

Comment 36 Sahina Bose 2017-08-29 06:13:50 UTC

(In reply to Niels de Vos from comment #35)
> (In reply to Sahina Bose from comment #33)
> > For the VM that fails to start from template, when mounting and checking the
> > disk:
> > 
> > [root@dhcp46-21 ~]# xfs_repair -n /dev/sdb1
> > Phase 1 - find and verify superblock...
> > error reading superblock 1 -- seek to offset 268435456 failed
> > couldn't verify primary superblock - attempted to perform I/O beyond EOF !!!
> > 
> > attempting to find secondary superblock...
> > ................................................Sorry, could not find valid
> > secondary superblock
> > Exiting now.
> > 
> > 
> > Not sure how the disk got to this state.
> 
> If you specify which hypervisor this is, and how "xfs_repair" can be run
> inside the VM, we could run ltrace against the qemu process (needs qemu +
> glusterfs debuginfo) and see what libgfapi functions are called and what
> they return. There has been a report (bz1425293) where glfs_seek() returns
> bogus values, but I have not been able to reproduce that yet. glfs_seek()
> should return that it is not supported, and QEMU is expected to handle that
> gracefully.

Have sent you the details of the hypervisor in a separate email

Comment 37 Krutika Dhananjay 2017-08-30 11:49:20 UTC

So I loaded trace translator above md-cache translator on Kasturi's setup to trace the different file system calls, the offsets and lengths of the IO, etc.

Right after we create "New VM" from template, the UUID directory under image gets created and another UUID (call it UUID2) file, its .meta and .lease files also get created under this directory along with hard links to the original template file.

When I checked the file operations on UUID2 file, it seems like for some time writes are sent on it at meaningful offsets and lengths - like offset=0,size=65536, offset=131072,size=65536, etc.

At this point, the total file size is 131072 + 65536 = 196608, a number exactly divisible by 512, O_DIRECT compliant. All good till now.

But after this, there are a bunch of writes that gluster receives that are at rather odd offsets and size (like 8 bytes, 12 bytes etc), and NOT O_DIRECT compliant, such as the following:

[2017-08-30 09:33:56.320058] I [trace.c:2310:trace_writev] 0-vmstore-md-cache-parent: 3721: gfid=e69cc192-5207-45ee-bfb1-e98414f3ab4a fd=0x7f0734077630, count=1,  offset=196608 flags=08002 write_size=8
[2017-08-30 09:33:56.320144] I [trace.c:2310:trace_writev] 0-vmstore-md-cache-parent: 3723: gfid=e69cc192-5207-45ee-bfb1-e98414f3ab4a fd=0x7f0734077630, count=1,  offset=36 flags=08002 write_size=12
[2017-08-30 09:33:56.320306] I [trace.c:2310:trace_writev] 0-vmstore-md-cache-parent: 3725: gfid=e69cc192-5207-45ee-bfb1-e98414f3ab4a fd=0x7f0734077630, count=1,  offset=24 flags=08002 write_size=8


This doesn't seem normal to me.

Denis, could you confirm if these writes are normal?

-Krutika

Comment 38 RamaKasturi 2017-08-30 14:15:55 UTC

Hi,

   I tried to compare the setup where creation of vm was working from a template with mine. The difference i found was with the storage allocation policy. 

   I see that when the vm being created from the template which has storage allocation policy as 'thin' and disk format as 'qcow2' fails to boot. 

   If the vm being created from the template has storage allocation policy as 'clone' and disk format in 'raw' booting of vm works fine with out any issues.

   Attaching screenshots here for the same. So, i feel that issue does not lie in libgfapi as once i change the storage allocation policy i could see that vm boots on a setup where libgfapi is enabled.

   However i am not sure why creating a vm from template which has storage allocation policy as 'thin' and disk format as 'qcow2' does not work.

Thanks
kasturi

Comment 39 RamaKasturi 2017-08-30 14:17:07 UTC

Created attachment 1320108 [details]
screenshot where storage allocation policy was clone and disk format is qcow2

Comment 40 RamaKasturi 2017-08-30 14:19:04 UTC

Created attachment 1320111 [details]
screenshot where storage allocation policy was clone and disk format is raw

Comment 41 RamaKasturi 2017-08-30 14:19:50 UTC

Created attachment 1320112 [details]
screenshot where storage allocation policy was thin and disk format is qcow2

Comment 42 Allon Mureinik 2017-08-31 11:51:05 UTC

Sahina, I'm not quite sure what info is request from me here. All RHV does here is to orchestrate the call to qemu-img convert that creates the disk (either as a raw or qcow2 volume).

Given that, I'm not quite sure what VDSM can do here.
- It could be we're using the wrong command line to create the volume, or the wrong parameters to access it. If so, we should fix it, but we'd need some guidance as to what we should do differently.
- If we're using the correct calls, the bug is somewhere between qemu and gluster.

Comment 43 Krutika Dhananjay 2017-09-01 07:02:02 UTC

Hi Allon,

The thing that we're at the moment suspecting is the occurrence of writes which are neither O_DIRECT aligned and also appear to be really short during vm creation from template

[2017-08-30 09:33:56.320058] I [trace.c:2310:trace_writev] 0-vmstore-md-cache-parent: 3721: gfid=e69cc192-5207-45ee-bfb1-e98414f3ab4a fd=0x7f0734077630, count=1,  offset=196608 flags=08002 write_size=8
[2017-08-30 09:33:56.320144] I [trace.c:2310:trace_writev] 0-vmstore-md-cache-parent: 3723: gfid=e69cc192-5207-45ee-bfb1-e98414f3ab4a fd=0x7f0734077630, count=1,  offset=36 flags=08002 write_size=12
[2017-08-30 09:33:56.320306] I [trace.c:2310:trace_writev] 0-vmstore-md-cache-parent: 3725: gfid=e69cc192-5207-45ee-bfb1-e98414f3ab4a fd=0x7f0734077630, count=1,  offset=24 flags=08002 write_size=8

The first write was of size=8bytes.
The second write was at offset 36 and size 12
Third at offset 24 and size 8.

At the moment this is the only oddity we could recognise.

If you think this is the job of qemu and the right persons to ask are qemu developers, just let us know and we'll reach out to them.

-Krutika

Comment 44 RamaKasturi 2017-09-01 12:33:57 UTC

I have tried to create vms out of windows template as well and while booting the windows vm i see an error "A disk read error occured".

Attaching screenshot for the same

Comment 45 RamaKasturi 2017-09-01 12:34:53 UTC

Created attachment 1320912 [details]
Attaching screenshot while booting windows vm

Comment 46 Krutika Dhananjay 2017-09-05 09:55:05 UTC

The issue is happening by virtue of md-cache being enabled. Disabling md-cache causes the vms created from template to boot fine. Cancelling needinfo on Allon.

Comment 47 Krutika Dhananjay 2017-09-05 16:10:53 UTC

The bug is in shard. When vm creation from template is initiated, RHV creates a hard link of the template under /xxxx../images/$UUID where UUID is the unique identifier of the vm being created.

In glusterfs, LINK fop is overloaded to return the attributes of the file as well in the callback, so that this information can be cached at various layers - md-cache being one of them - so that a subsequent stat/lookup can directly be served off the cache, and thereby help improve performance.

During LINK in this scenario, the following is what happened:

<snip1>

[2017-09-05 06:41:18.080640] I [trace.c:2012:trace_link] 0-vmstore-parent-mdc: 602: oldgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 oldpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/7807fb2b-420e-4dc7-b477-ba7796faa205/e52e483d-174d-4d95-89b5-41e0e37853c0 --> newgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 newpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/8e8d9148-4867-4d63-b6de-3092da32b1e8/e52e483d-174d-4d95-89b5-41e0e37853c0

[2017-09-05 06:41:18.080654] I [trace.c:2012:trace_link] 0-vmstore-parent-shard: 602: oldgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 oldpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/7807fb2b-420e-4dc7-b477-ba7796faa205/e52e483d-174d-4d95-89b5-41e0e37853c0 --> newgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 newpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/8e8d9148-4867-4d63-b6de-3092da32b1e8/e52e483d-174d-4d95-89b5-41e0e37853c0

[2017-09-05 06:41:18.080664] I [trace.c:2012:trace_link] 0-vmstore-child-shard: 602: oldgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 oldpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/7807fb2b-420e-4dc7-b477-ba7796faa205/e52e483d-174d-4d95-89b5-41e0e37853c0 --> newgfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 newpath=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/8e8d9148-4867-4d63-b6de-3092da32b1e8/e52e483d-174d-4d95-89b5-41e0e37853c0

[2017-09-05 06:41:18.081413] I [trace.c:789:trace_link_cbk] 0-vmstore-child-shard: 602: (op_ret=0, *stbuf = {gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 ino=13616751834765032693, mode=100660, nlink=2, uid=36, gid=36, ************size=67108864***********, blocks=129104, atime=2017-09-05-05:59:36 mtime=2017-09-05-05:59:37 ctime=2017-09-05-06:41:18 atime_sec=1504591176, atime_nsec=290615297, mtime_sec=1504591177, mtime_nsec=297623454, ctime_sec=1504593678, ctime_nsec=79878645},  *prebuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=27878000, ctime_sec=1504593678, ctime_nsec=29878241}, *postbuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=79878645, ctime_sec=1504593678, ctime_nsec=79878645})

[2017-09-05 06:41:18.081439] I [trace.c:789:trace_link_cbk] 0-vmstore-parent-shard: 602: (op_ret=0, *stbuf = {gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 ino=13616751834765032693, mode=100660, nlink=2, uid=36, gid=36, *******size=67108864***********, blocks=129104, atime=2017-09-05-05:59:36 mtime=2017-09-05-05:59:37 ctime=2017-09-05-06:41:18 atime_sec=1504591176, atime_nsec=290615297, mtime_sec=1504591177, mtime_nsec=297623454, ctime_sec=1504593678, ctime_nsec=79878645},  *prebuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=27878000, ctime_sec=1504593678, ctime_nsec=29878241}, *postbuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=79878645, ctime_sec=1504593678, ctime_nsec=79878645})

[2017-09-05 06:41:18.081480] I [trace.c:789:trace_link_cbk] 0-vmstore-parent-mdc: 602: (op_ret=0, *stbuf = {gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 ino=13616751834765032693, mode=100660, nlink=2, uid=36, gid=36, size=67108864, blocks=129104, atime=2017-09-05-05:59:36 mtime=2017-09-05-05:59:37 ctime=2017-09-05-06:41:18 atime_sec=1504591176, atime_nsec=290615297, mtime_sec=1504591177, mtime_nsec=297623454, ctime_sec=1504593678, ctime_nsec=79878645},  *prebuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=27878000, ctime_sec=1504593678, ctime_nsec=29878241}, *postbuf = {gfid=68d302a5-7fed-411e-9f81-bb2efcc282e4 ino=11493673534509712100, mode=40755, nlink=2, uid=36, gid=36, size=4096, blocks=8, atime=2017-09-05-06:41:18 mtime=2017-09-05-06:41:18 ctime=2017-09-05-06:41:18 atime_sec=1504593678, atime_nsec=27878000, mtime_sec=1504593678, mtime_nsec=79878645, ctime_sec=1504593678, ctime_nsec=79878645})

</snip1>

Basically shard is supposed to return among other attributes the aggregated size of the image and not the physical size of just the base shard. While this is correctly handled in all other fops, in LINK it continued to return only the physical size of the base shard which is nothing but the shard-block-size. So instead of returning 20GB in iatt, it ended up returning 67108864 which is equal to 64MB.

The problem is not so much that it returned the incorrect size. The real problem starts when md-cache caches this and uses this to serve the immediate stat on the file, which is exactly what happened - 

<snip2>

[2017-09-05 06:41:18.088070] I [trace.c:1746:trace_stat] 0-vmstore-parent-mdc: 611: gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 path=/84d1d265-a58b-4951-a8f3-8bc7669b33ab/images/8e8d9148-4867-4d63-b6de-3092da32b1e8/e52e483d-174d-4d95-89b5-41e0e37853c0

[2017-09-05 06:41:18.088087] I [trace.c:182:trace_stat_cbk] 0-vmstore-parent-mdc: 611: gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 op_ret=0 buf=gfid=2ba927dc-d4ab-4ca0-bcf8-6ba629d17cf5 ino=13616751834765032693, mode=100660, nlink=2, uid=36, gid=36, ***********size=67108864*************, blocks=129104, atime=2017-09-05-05:59:36 mtime=2017-09-05-05:59:37 ctime=2017-09-05-06:41:18 atime_sec=1504591176, atime_nsec=290615297, mtime_sec=1504591177, mtime_nsec=297623454, ctime_sec=1504593678, ctime_nsec=79878645

</snip2>

The problem is that all the remaining steps towards creating the vm from template depend on the correctness of the size returned in STAT. And getting the size wrong implies everything else will go wrong.

I fixed this bug in shard, made a brew build and tested it on Kasturi's setup and vms booted fine even with stat-prefetch on. I tried this on 3 vms and all 3 of them ran fine.

The build that contains the fix is available here - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14003344 - in case anyone is interested.

The patch itself isn't all that big - 42 lines of code added in shard.

Big thanks to Kasturi for her cooperation and patience in resolving the bug. :)

I'll prettify the patch and add a test case if possible and send it on upstream soon.

-Krutika

Comment 48 Krutika Dhananjay 2017-09-05 16:49:46 UTC

https://review.gluster.org/18209

Comment 49 SATHEESARAN 2017-09-06 09:33:37 UTC

I am proposing this bug as a blocker.

Without the fix, one can't create VMs out of thin template.
It also affects classic RHV-RHGS usecase as well as RHHI 1.1

Comment 50 SATHEESARAN 2017-09-06 10:04:42 UTC

(In reply to SATHEESARAN from comment #49)
> I am proposing this bug as a blocker.
> 
> Without the fix, one can't create VMs out of thin template.
> It also affects classic RHV-RHGS usecase as well as RHHI 1.1

This bug is denied blocker for RHGS 3.3.0.
The workaround will be documented.

Proposing it for 3.3.1

Comment 51 Atin Mukherjee 2017-09-06 13:22:29 UTC

upstream mainline patch : https://review.gluster.org/#/c/18209/ 
upstream 3.12 patch : https://review.gluster.org/#/c/18211/
downstream patch : https://code.engineering.redhat.com/gerrit/117426

Comment 55 Krutika Dhananjay 2017-09-19 09:48:33 UTC

Looks good.

Comment 56 RamaKasturi 2017-09-22 11:45:42 UTC

verified and works fine with build glusterfs-3.8.4-45.el7rhgs.x86_64.

with stat-prefetch on i see that vms created from template boots fine and there is no issue seen.

[root@rhsqa-grafton1 ~]# gluster volume get vmstore performance.stat-prefetch
Option                                  Value                                   
------                                  -----                                   
performance.stat-prefetch               on

Comment 57 Pratik Mulay 2017-11-16 13:36:39 UTC

Hi Krutika,

I've edited the Doc Text for it's associated Errata.

Request you to review the same and let me know in case of any concerns.

If no changes are required, request you to provide your approval for the same.

Comment 58 Krutika Dhananjay 2017-11-16 13:41:21 UTC

Looks good to me.

Comment 59 Pratik Mulay 2017-11-17 09:40:52 UTC

Minor changes to Doc Text basis peer review.

Comment 62 errata-xmlrpc 2017-11-29 03:29:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3276

Note You need to log in before you can comment on or make changes to this bug.