This bug looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1511010 but additional to system-udev call traces, we also have other call traces in dmesg logs. Multipath devices not cleaned up and call traces observed on node reboot(initiator and target on same OCP node) Description of problem: +++++++++++++++++++++++ We are seeing call traces after rebooting a node which was an initiator as well as target node for 3 app-pods. Steps Performed -------------- 1. Node X was the initiator and target node for 3 app pods. Rebooted the node and observed that login to itself didnt take place for 2 pods (BZ#1597726). There were 7 logins 2. We were testing BZ#1597726 and again rebooted the node 10.70.47.20 at 'Thu Oct 11 14:22:41 IST 2018' # date && reboot Thu Oct 11 14:22:41 IST 2018 3. The node took some long time to reboot, hence while it was coming up, oc moved the 3 pods to a different node. So in effect, the mpaths were unmounted from this node. 4. The glusterfs pod took a long time to come into Ready state. In the node, lot of "Call Traces" with kernel hungs were seen on the console as well as dmesg logs dmesg start time after reboot - Thu Oct 11 14:26:58 2018 call traces copied in next comment # pvscan /dev/mapper/mpatha: read failed after 0 of 4096 at 0: Input/output error /dev/mapper/mpatha: read failed after 0 of 4096 at 5368643584: Input/output error /dev/mapper/mpatha: read failed after 0 of 4096 at 5368700928: Input/output error /dev/mapper/mpatha: read failed after 0 of 4096 at 4096: Input/output error /dev/mapper/mpathb: read failed after 0 of 4096 at 0: Input/output error /dev/mapper/mpathb: read failed after 0 of 4096 at 3221159936: Input/output error /dev/mapper/mpathb: read failed after 0 of 4096 at 3221217280: Input/output error /dev/mapper/mpathb: read failed after 0 of 4096 at 4096: Input/output error PV /dev/sdd1 VG docker-vg lvm2 [<50.00 GiB / 30.00 GiB free] PV /dev/sdc VG vg_d0cba9e35aeb34920a39851bdf74bbb6 lvm2 [1023.87 GiB / <619.27 GiB free] PV /dev/sda2 VG rhel_dhcp47-20 lvm2 [95.00 GiB / 4.00 MiB free] Total: 3 [1.14 TiB] / in use: 3 [1.14 TiB] / in no VG: 0 [0 ] # ls -l /dev|grep dm-222 brw-rw----. 1 root disk 253, 222 Oct 11 15:31 dm-222 [root@dhcp47-20 new-bz]# ls -l /dev|grep dm-223 brw-rw----. 1 root disk 253, 223 Oct 11 15:31 dm-223 [root@dhcp47-20 new-bz]# mpatha 253:222 0 5G 0 mpath mpathb 253:223 0 3G 0 mpath Version-Release number of selected component (if applicable): +++++++++++++++++++++++++++++ # uname -a Linux dhcp47-20.lab.eng.blr.redhat.com 3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux # oc rsh glusterfs-storage-995g4 rpm -qa|grep gluster glusterfs-libs-3.12.2-18.1.el7rhgs.x86_64 glusterfs-3.12.2-18.1.el7rhgs.x86_64 glusterfs-api-3.12.2-18.1.el7rhgs.x86_64 python2-gluster-3.12.2-18.1.el7rhgs.x86_64 glusterfs-fuse-3.12.2-18.1.el7rhgs.x86_64 glusterfs-server-3.12.2-18.1.el7rhgs.x86_64 gluster-block-0.2.1-27.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-18.1.el7rhgs.x86_64 glusterfs-cli-3.12.2-18.1.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-18.1.el7rhgs.x86_64 # oc rsh heketi-storage-1-n2xr2 rpm -qa|grep heketi heketi-client-7.0.0-14.el7rhgs.x86_64 heketi-7.0.0-14.el7rhgs.x86_64 How reproducible: +++++++++++++++++ Similar call traces are seen when we reboot a node which is an initiator and a target. We used to see issues similar to BZ# 1511010. But in this case, a numbe rof call traces are observed on node reboot. Steps to Reproduce: +++++++++++ 1. With a node running as initiator and target, reboot the node. 2. If the node takes a considerable time to boot up, oc sheduler moves the pods to another node 3. Check the dmesg logs and multipath -ll once the node comes up. We see uncleaned multipath entries and some Call traces, along with kernel hung tasks. Actual results: ++++++++++ The multipath device is not cleaned up even though the pods are now moved out to another node. Also, we keep seeing I/O errors on the uncleaned devices. Expected results: +++++++++++++ Since the pods were moved, multipath -ll should be clean and I am not sure if the call traces observed can lead to any system issue or not.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2987