Bug 1579981

Summary:	When the customer tries to migrate an RHV 4.1 disk from one storage domain to another, the glusterfsd core dumps.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Andrew Robinson <anrobins>
Component:	disperse	Assignee:	Ashish Pandey <aspandey>
Status:	CLOSED ERRATA	QA Contact:	Upasana <ubansal>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, anrobins, apaladug, aspandey, bkunal, bturner, jahernan, kdhananj, pkarampu, rcyriac, rhinduja, rhs-bugs, rhsc-qe-bugs, sankarshan, sasundar, storage-qa-internal, vdas
Target Milestone:	---
Target Release:	RHGS 3.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-04 06:48:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1503138

Description Andrew Robinson 2018-05-18 21:11:07 UTC

Description of problem:

The customer has RHV-M 4.1.9.2 with Gluster and ISCI Storage Domains. When the customer tries to live migrate a VM disk from current gluster storage domain to another storage domain (iscsi), glusterfsd core dumps.

Core files:

~~~
https://access.redhat.com/rs/cases/02071230/attachments/cf3a653a-5db7-4e26-9a96-ecfd5cad934c
~~~

From the VDSM logs:

As soon as copyimage is started,

~~~
2018-04-24 05:10:11,405-0700 ERROR (tasks/1) [storage.Image] Copy image error: image=48eb2429-e20e-41be-bdd1-05d79ab8c959, src domain=9daf8c8f-53dc-45fe-8833-4500599d75f6, dst domain=873b7bc2-1c92-4416-86ff-556b95ef614e (image:535)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/image.py", line 526, in _interImagesCopy
    self._wait_for_qemuimg_operation(operation)
  File "/usr/share/vdsm/storage/image.py", line 141, in _wait_for_qemuimg_operation
    operation.wait_for_completion()
  File "/usr/lib/python2.7/site-packages/vdsm/qemuimg.py", line 339, in wait_for_completion
    self.poll(timeout)
  File "/usr/lib/python2.7/site-packages/vdsm/qemuimg.py", line 334, in poll
    self.error)
QImgError: cmd=['/usr/bin/taskset', '--cpu-list', '0-7', '/usr/bin/nice', '-n', '19', '/usr/bin/ionice', '-c', '3', '/usr/bin/qemu-img', 'convert', '-p', '-t', 'none', '-T', 'none', '-f', 'raw', u'/rhev/data-center/mnt/glusterSD/172.16.49.253:_gv0/9daf8c8f-53dc-45fe-8833-4500599d75f6/images/48eb2429-e20e-41be-bdd1-05d79ab8c959/3e032791-8181-43c4-8f71-d09452bc4243', '-O', 'raw', u'/rhev/data-center/mnt/blockSD/873b7bc2-1c92-4416-86ff-556b95ef614e/images/48eb2429-e20e-41be-bdd1-05d79ab8c959/3e032791-8181-43c4-8f71-d09452bc4243'], ecode=-6, stdout=, stderr=qemu-img: error while reading sector 0: Transport endpoint is not connected


2018-04-24 05:10:12,456-0700 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/172.16.49.253:_gv0/9daf8c8f-53dc-45fe-8833-4500599d75f6/dom_md/metadata (monitor:500)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 498, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 365, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/172.16.49.253:_gv0/9daf8c8f-53dc-45fe-8833-4500599d75f6/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/glusterSD/172.16.49.253:_gv0/9daf8c8f-53dc-45fe-8833-4500599d75f6/dom_md/metadata\': Transport endpoint is not connected\n"))
~~~

From the messages logs:

~~~
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: pending frames:
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(SEEK)
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(SEEK)
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(READ)
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(READ)
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(OPENDIR)
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(0) op(0)
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: patchset: git://git.gluster.com/glusterfs.git
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: signal received: 11
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: time of crash:
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: 2018-04-24 12:10:11
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: configuration details:
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: argp 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: backtrace 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: dlfcn 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: libpthread 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: llistxattr 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: setfsid 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: spinlock 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: epoll.h 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: xattr.h 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: st_atim.tv_nsec 1
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: package-string: glusterfs 3.8.4
Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: ---------
Apr 24 05:10:11 2A-J01-F37-NODE7 abrt-hook-ccpp: Process 4386 (glusterfsd) of user 0 killed by SIGSEGV - dumping core
Apr 24 05:10:11 2A-J01-F37-NODE7 abrt-hook-ccpp: Process 2638 (qemu-img) of user 36 killed by SIGABRT - dumping core
Apr 24 05:10:11 2A-J01-F37-NODE7 libvirtd: 2018-04-24 12:10:11.539+0000: 3492: error : qemuOpenFileAs:3234 : Failed to open file '/rhev/data-center/0c6eb4e7-aae9-42d2-96a0-f13c33de1737/9daf8c8f-53dc-45fe-8833-4500599d75f6/images/48eb2429-e20e-41be-bdd1-05d79ab8c959/1aa4ea05-1e90-44a6-91b8-0ee5c4c68968': Transport endpoint is not connected


Version-Release number of selected component (if applicable):

vdsm-4.19.50-1.el7ev.x86_64                                 Tue Apr 10 07:52:52 2018
glusterfs-3.8.4-54.4.el7rhgs.x86_64                         Tue Apr 10 07:50:00 2018
libvirt-daemon-3.9.0-14.el7_5.2.x86_64                      Tue Apr 10 07:51:38 2018
sanlock-3.6.0-1.el7.x86_64                                  Tue Apr 10 07:50:04 2018
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64                      Tue Apr 10 07:52:47 2018


How reproducible:

Any time the customer tries to Live migrate a VM disk from current gluster storage domain to another storage domain (iscsi), the glusterfsd core dumps. VM enters pause state and requires a restart.


Actual results:

Hung VM, core dump

Expected results:

VM disk migrated to new storage domain.


Additional info:

1. The stack trace has been matched to Bugzilla 1502812:

Bug 1502812 - [GSS] Client segfaults when grepping $UUID.meta files on EC vol
BZ https://bugzilla.redhat.com/show_bug.cgi?id=1502812

However, the process that crashes is different than here. 


2. We note that the customer is running dispersed volumes and those are not supported for Virtualization. We will ask the customer to migrate to a form of three-way replicated volumes.

rt doc : https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/configuring_red_hat_virtualization_with_red_hat_gluster_storage/


3. The customer notes that the VMs whose disks fail to transfer are all Windows Server 2012

Comment 13 Bipin Kunal 2018-05-24 07:19:44 UTC

Vivek/Ashish,

    I see that this has been accepted on 3.4 release, Just wanted to understand how do we plan to test?  Do we test RHV with disperse volume?

-Bipin

Comment 22 Bipin Kunal 2018-06-01 10:08:34 UTC

Raising needinfo on PM for hotfix approval

Comment 34 Bipin Kunal 2018-06-13 10:59:05 UTC

Apart from that I see that package version used is "3.8.4-54.8.1.gitf96d95b.el7rhgs.1.HOTFIX.Case02071230.BZ1579981". I am not sure why it contains sub-string "gitf96d95b"? Are we using downstream dist-git repo to build the rpms?

I checked dist-git branch and I am not able to find branch corresponding to this build.

While providing the build link please even add information like: 1) base version used for hotfix creation 2) patches used 3) dist-git branch link.

Comment 64 errata-xmlrpc 2018-09-04 06:48:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607