Description of problem: The customer has RHV-M 4.1.9.2 with Gluster and ISCI Storage Domains. When the customer tries to live migrate a VM disk from current gluster storage domain to another storage domain (iscsi), glusterfsd core dumps. Core files: ~~~ https://access.redhat.com/rs/cases/02071230/attachments/cf3a653a-5db7-4e26-9a96-ecfd5cad934c ~~~ From the VDSM logs: As soon as copyimage is started, ~~~ 2018-04-24 05:10:11,405-0700 ERROR (tasks/1) [storage.Image] Copy image error: image=48eb2429-e20e-41be-bdd1-05d79ab8c959, src domain=9daf8c8f-53dc-45fe-8833-4500599d75f6, dst domain=873b7bc2-1c92-4416-86ff-556b95ef614e (image:535) Traceback (most recent call last): File "/usr/share/vdsm/storage/image.py", line 526, in _interImagesCopy self._wait_for_qemuimg_operation(operation) File "/usr/share/vdsm/storage/image.py", line 141, in _wait_for_qemuimg_operation operation.wait_for_completion() File "/usr/lib/python2.7/site-packages/vdsm/qemuimg.py", line 339, in wait_for_completion self.poll(timeout) File "/usr/lib/python2.7/site-packages/vdsm/qemuimg.py", line 334, in poll self.error) QImgError: cmd=['/usr/bin/taskset', '--cpu-list', '0-7', '/usr/bin/nice', '-n', '19', '/usr/bin/ionice', '-c', '3', '/usr/bin/qemu-img', 'convert', '-p', '-t', 'none', '-T', 'none', '-f', 'raw', u'/rhev/data-center/mnt/glusterSD/172.16.49.253:_gv0/9daf8c8f-53dc-45fe-8833-4500599d75f6/images/48eb2429-e20e-41be-bdd1-05d79ab8c959/3e032791-8181-43c4-8f71-d09452bc4243', '-O', 'raw', u'/rhev/data-center/mnt/blockSD/873b7bc2-1c92-4416-86ff-556b95ef614e/images/48eb2429-e20e-41be-bdd1-05d79ab8c959/3e032791-8181-43c4-8f71-d09452bc4243'], ecode=-6, stdout=, stderr=qemu-img: error while reading sector 0: Transport endpoint is not connected 2018-04-24 05:10:12,456-0700 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/172.16.49.253:_gv0/9daf8c8f-53dc-45fe-8833-4500599d75f6/dom_md/metadata (monitor:500) Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 498, in _pathChecked delay = result.delay() File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 365, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/172.16.49.253:_gv0/9daf8c8f-53dc-45fe-8833-4500599d75f6/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/glusterSD/172.16.49.253:_gv0/9daf8c8f-53dc-45fe-8833-4500599d75f6/dom_md/metadata\': Transport endpoint is not connected\n")) ~~~ From the messages logs: ~~~ Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: pending frames: Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(SEEK) Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(SEEK) Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(READ) Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(READ) Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(1) op(OPENDIR) Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: frame : type(0) op(0) Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: patchset: git://git.gluster.com/glusterfs.git Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: signal received: 11 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: time of crash: Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: 2018-04-24 12:10:11 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: configuration details: Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: argp 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: backtrace 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: dlfcn 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: libpthread 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: llistxattr 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: setfsid 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: spinlock 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: epoll.h 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: xattr.h 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: st_atim.tv_nsec 1 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: package-string: glusterfs 3.8.4 Apr 24 05:10:11 2A-J01-F37-NODE7 rhev-data-center-mnt-glusterSD-172.16.49.253:_gv0[4386]: --------- Apr 24 05:10:11 2A-J01-F37-NODE7 abrt-hook-ccpp: Process 4386 (glusterfsd) of user 0 killed by SIGSEGV - dumping core Apr 24 05:10:11 2A-J01-F37-NODE7 abrt-hook-ccpp: Process 2638 (qemu-img) of user 36 killed by SIGABRT - dumping core Apr 24 05:10:11 2A-J01-F37-NODE7 libvirtd: 2018-04-24 12:10:11.539+0000: 3492: error : qemuOpenFileAs:3234 : Failed to open file '/rhev/data-center/0c6eb4e7-aae9-42d2-96a0-f13c33de1737/9daf8c8f-53dc-45fe-8833-4500599d75f6/images/48eb2429-e20e-41be-bdd1-05d79ab8c959/1aa4ea05-1e90-44a6-91b8-0ee5c4c68968': Transport endpoint is not connected Version-Release number of selected component (if applicable): vdsm-4.19.50-1.el7ev.x86_64 Tue Apr 10 07:52:52 2018 glusterfs-3.8.4-54.4.el7rhgs.x86_64 Tue Apr 10 07:50:00 2018 libvirt-daemon-3.9.0-14.el7_5.2.x86_64 Tue Apr 10 07:51:38 2018 sanlock-3.6.0-1.el7.x86_64 Tue Apr 10 07:50:04 2018 qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64 Tue Apr 10 07:52:47 2018 How reproducible: Any time the customer tries to Live migrate a VM disk from current gluster storage domain to another storage domain (iscsi), the glusterfsd core dumps. VM enters pause state and requires a restart. Actual results: Hung VM, core dump Expected results: VM disk migrated to new storage domain. Additional info: 1. The stack trace has been matched to Bugzilla 1502812: Bug 1502812 - [GSS] Client segfaults when grepping $UUID.meta files on EC vol BZ https://bugzilla.redhat.com/show_bug.cgi?id=1502812 However, the process that crashes is different than here. 2. We note that the customer is running dispersed volumes and those are not supported for Virtualization. We will ask the customer to migrate to a form of three-way replicated volumes. rt doc : https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/configuring_red_hat_virtualization_with_red_hat_gluster_storage/ 3. The customer notes that the VMs whose disks fail to transfer are all Windows Server 2012
Vivek/Ashish, I see that this has been accepted on 3.4 release, Just wanted to understand how do we plan to test? Do we test RHV with disperse volume? -Bipin
Raising needinfo on PM for hotfix approval
Apart from that I see that package version used is "3.8.4-54.8.1.gitf96d95b.el7rhgs.1.HOTFIX.Case02071230.BZ1579981". I am not sure why it contains sub-string "gitf96d95b"? Are we using downstream dist-git repo to build the rpms? I checked dist-git branch and I am not able to find branch corresponding to this build. While providing the build link please even add information like: 1) base version used for hotfix creation 2) patches used 3) dist-git branch link.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607