1638352 – Multipath devices not cleaned up and call traces observed on node reboot(initiator and target on same OCP node)

Bug 1638352 - Multipath devices not cleaned up and call traces observed on node reboot(initiator and target on same OCP node)

Summary: Multipath devices not cleaned up and call traces observed on node reboot(init...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-block
Sub Component:
Version:	ocs-3.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	OCS 3.11
Assignee:	Prasanna Kumar Kalever
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1629581
TreeView+	depends on / blocked

Reported:	2018-10-11 11:23 UTC by Neha Berry
Modified:	2019-02-08 06:23 UTC (History)
CC List:	12 users (show)
Fixed In Version:	gluster-block-0.2.1-28.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-24 04:52:58 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:2987	0	None	None	None	2018-10-24 04:54:38 UTC

Description Neha Berry 2018-10-11 11:23:14 UTC

This bug looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1511010 but additional to system-udev call traces, we also have other call traces in dmesg logs.

Multipath devices not cleaned up and call traces observed on node reboot(initiator and target on same OCP node)

Description of problem:
+++++++++++++++++++++++
We are seeing call traces after rebooting a node which was an initiator as well as target node for 3 app-pods.

Steps Performed
--------------
1. Node X was the initiator and target node for 3 app pods. Rebooted the node and observed that login to itself didnt take place for 2 pods (BZ#1597726). There were 7 logins 
2. We were testing BZ#1597726 and again rebooted the node 10.70.47.20 at 'Thu Oct 11 14:22:41 IST 2018'

# date && reboot 
Thu Oct 11 14:22:41 IST 2018

3. The node took some long time to reboot, hence while it was coming up, oc moved the 3 pods to a different node. So in effect, the mpaths were unmounted from this node.

4. The glusterfs pod took a long time to come into Ready state. In the node,  lot of "Call Traces" with kernel hungs were seen on the console as well as dmesg logs
 dmesg start time after reboot - Thu Oct 11 14:26:58 2018

call traces copied in next comment



# pvscan
  /dev/mapper/mpatha: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/mpatha: read failed after 0 of 4096 at 5368643584: Input/output error
  /dev/mapper/mpatha: read failed after 0 of 4096 at 5368700928: Input/output error
  /dev/mapper/mpatha: read failed after 0 of 4096 at 4096: Input/output error
  /dev/mapper/mpathb: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/mpathb: read failed after 0 of 4096 at 3221159936: Input/output error
  /dev/mapper/mpathb: read failed after 0 of 4096 at 3221217280: Input/output error
  /dev/mapper/mpathb: read failed after 0 of 4096 at 4096: Input/output error
  PV /dev/sdd1   VG docker-vg                             lvm2 [<50.00 GiB / 30.00 GiB free]
  PV /dev/sdc    VG vg_d0cba9e35aeb34920a39851bdf74bbb6   lvm2 [1023.87 GiB / <619.27 GiB free]
  PV /dev/sda2   VG rhel_dhcp47-20                        lvm2 [95.00 GiB / 4.00 MiB free]
  Total: 3 [1.14 TiB] / in use: 3 [1.14 TiB] / in no VG: 0 [0   ]

# ls -l /dev|grep dm-222
brw-rw----. 1 root disk    253, 222 Oct 11 15:31 dm-222
[root@dhcp47-20 new-bz]# ls -l /dev|grep dm-223
brw-rw----. 1 root disk    253, 223 Oct 11 15:31 dm-223
[root@dhcp47-20 new-bz]# 


mpatha                                                                                         253:222  0    5G  0 mpath 
mpathb                                                                                         253:223  0    3G  0 mpath 




Version-Release number of selected component (if applicable):
+++++++++++++++++++++++++++++

# uname -a
Linux dhcp47-20.lab.eng.blr.redhat.com 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

# oc rsh glusterfs-storage-995g4 rpm -qa|grep gluster
glusterfs-libs-3.12.2-18.1.el7rhgs.x86_64
glusterfs-3.12.2-18.1.el7rhgs.x86_64
glusterfs-api-3.12.2-18.1.el7rhgs.x86_64
python2-gluster-3.12.2-18.1.el7rhgs.x86_64
glusterfs-fuse-3.12.2-18.1.el7rhgs.x86_64
glusterfs-server-3.12.2-18.1.el7rhgs.x86_64
gluster-block-0.2.1-27.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-18.1.el7rhgs.x86_64
glusterfs-cli-3.12.2-18.1.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-18.1.el7rhgs.x86_64


# oc rsh heketi-storage-1-n2xr2 rpm -qa|grep heketi
heketi-client-7.0.0-14.el7rhgs.x86_64
heketi-7.0.0-14.el7rhgs.x86_64



How reproducible:
+++++++++++++++++
Similar call traces are seen when we reboot a node which is an initiator and a target. We used to see issues similar to BZ# 1511010. But in this case, a numbe rof call traces are observed on node reboot.

Steps to Reproduce:
+++++++++++
1. With a node running as initiator and target, reboot the node.
2. If the node takes a considerable time to boot up, oc sheduler  moves the pods to another node
3. Check the dmesg logs and multipath -ll once the node comes up. We see uncleaned multipath entries and some Call traces, along with kernel hung tasks.


Actual results:
++++++++++
The multipath device is not cleaned up even though the pods are now moved out to another node. Also, we keep seeing I/O errors on the uncleaned devices.

Expected results:
+++++++++++++
Since the pods were moved, multipath -ll should be clean and I am not sure if the call traces observed can lead to any system issue or not.

Comment 23 errata-xmlrpc 2018-10-24 04:52:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2987

Note You need to log in before you can comment on or make changes to this bug.