Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 504133

Summary:

/sbin/multipath threads are waiting forever

Product:

Red Hat Enterprise MRG

Reporter:

IBM Bug Proxy <bugproxy>

Component:

realtime-kernel

Assignee:

Red Hat Real Time Maintenance <rt-maint>

Status:

CLOSED ERRATA

QA Contact:

David Sommerseth <davids>

Severity:

high

Docs Contact:

Priority:

low

Version:

1.1

CC:

bhu, lgoncalv, ovasik

Target Milestone:

1.1.5

Target Release:

---

Hardware:

x86_64

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-07-14 19:12:00 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
External patch	none

Description IBM Bug Proxy 2009-06-04 12:20:40 UTC

=Comment: #0=================================================
Venkateswarara Jujjuri <jvrao.com> - 
I bounced a path while blast is running; when the path went down; failover happened as expected.
But, when I brought up the path; I see inconsistent behavior.

Some of the dm devices did not have the second path.
When I ran the multipath command by hand, these paths sowed up with a message in the /var/log/mesage.

The message in /var/log/message is something like:
May 13 22:05:26 elm3c28 multipathd: dm-0: add map (uevent)
May 13 22:05:26 elm3c28 multipathd: dm-0: devmap already registered


=========== Details ======================

I have total 7 disks; and it is a SAN boot.

[root@elm3c28 BLAST]# cat /var/log/messages |  grep "Attached rdac" | grep "May 12 23"
May 12 23:21:46 elm3c28 kernel: scsi 0:0:0:0: Attached rdac.
May 12 23:22:04 elm3c28 kernel: scsi 0:0:0:1: Attached rdac.
May 12 23:22:30 elm3c28 kernel: scsi 0:0:0:2: Attached rdac.
May 12 23:23:12 elm3c28 kernel: scsi 0:0:0:3: Attached rdac.
May 12 23:23:12 elm3c28 kernel: scsi 0:0:0:4: Attached rdac.
May 12 23:23:29 elm3c28 kernel: scsi 0:0:0:5: Attached rdac.
May 12 23:23:58 elm3c28 kernel: scsi 0:0:0:6: Attached rdac.

>>>>> >>>>> var/log/messages show that all 7 paths attached.


[root@elm3c28 BLAST]#  lsmod | grep rdac
scsi_dh_rdac           14592  14  <<<<<<<<<<<<<<<<<<< This also confirms it.
scsi_dh                11648  2 dm_multipath,scsi_dh_rdac
scsi_mod              153560  9
sr_mod,usb_storage,sg,qla2xxx,scsi_transport_fc,scsi_tgt,scsi_dh_rdac,scsi_dh,sd_mod


>>>>>> >>>>>> But multipath -ll doesn't show second path for mpath4 and mpath0 

[root@elm3c28 BLAST]# multipath -ll
mpath2 (3600a0b8000421b1400001c304a035feb) dm-6 IBM,1814      FAStT
[size=20G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=100][enabled]
 \_ 0:0:0:2 sdd 8:48  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:2 sdh 8:112 [active][ghost]
mpath1 (3600a0b8000421b1400001c2e4a035fca) dm-5 IBM,1814      FAStT
[size=20G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=100][active]
 \_ 0:0:0:1 sdb 8:16  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:1 sdf 8:80  [active][ghost]
mpath0 (3600a0b8000421b14000006c748cfa29b) dm-0 IBM,1814      FAStT
[size=70G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=100][active]
 \_ 1:0:0:0 sdc 8:32  [active][ready]<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< No Second path
mpath6 (3600a0b8000421b1400001c384a036064) dm-10 IBM,1814      FAStT
[size=20G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=100][enabled]
 \_ 0:0:0:6 sdl 8:176 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:6 sdn 8:208 [active][ghost]
mpath5 (3600a0b8000421b1400001c364a03604a) dm-9 IBM,1814      FAStT
[size=20G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=100][enabled]
 \_ 0:0:0:5 sdi 8:128 [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:5 sdm 8:192 [active][ghost]
mpath4 (3600a0b8000421b1400001c344a036030) dm-8 IBM,1814      FAStT
[size=20G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=0][active]
 \_ 1:0:0:4 sdk 8:160 [active][ready]<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< No Second path
mpath3 (3600a0b8000421b1400001c324a036006) dm-7 IBM,1814      FAStT
[size=20G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=100][enabled]
 \_ 0:0:0:3 sde 8:64  [active][ready]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:3 sdj 8:144 [active][ghost]
=Comment: #1=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Further investigation revels that the multipathd is not waiting enough time for the path node to
show up. 

We have failed to create/add the second path to the mpath0 device.


tid:14193 May 15 19:30:04 | uevent 'add' from '/block/sda'
tid:14193 May 15 19:30:04 | uevent 'add' from '/block/sda/sda1'
tid:14193 May 15 19:30:04 | uevent 'add' from '/block/sda/sda2'

 <<< uevent_listen() thread received uevent to add this path.
          This confirms that the udev event is not missed.

tid:14195 May 15 19:30:04 | sda: add path (uevent)

<<< The worker thread immediately picked up the uevent for servicing it.


tid:14195 May 15 19:30:04 | sda: not found in pathvec
tid:14195 May 15 19:30:04 | wait_for_file: for /sys/block/sda/dev
tid:14195 May 15 19:30:04 | wait_for_file: return 0 for /sys/block/sda/dev
tid:14195 May 15 19:30:04 | wait_for_file: for /sys/block/sda/device/vendor
tid:14195 May 15 19:30:04 | wait_for_file: return 0 for /sys/block/sda/device/vendor
tid:14195 May 15 19:30:04 | wait_for_file: for /sys/block/sda/device/model
tid:14195 May 15 19:30:04 | wait_for_file: return 0 for /sys/block/sda/device/model
tid:14195 May 15 19:30:04 | wait_for_file: for /sys/block/sda/device/rev
tid:14195 May 15 19:30:04 | wait_for_file: return 0 for /sys/block/sda/device/rev
tid:14195 May 15 19:30:04 | wait_for_file: for /dev/sda
tid:14195 May 15 19:31:04 | wait_for_file: return 1 for /dev/sda on time expiry :-1

<<< Here multipathd waiting for the /dev/sda file to be created by the udev daemon. Because of
blast's heavy IO, presumably udev took more than a minute to create the /dev/sda node.
multipathd gave-up waiting after a min and posted an error. 


tid:14195 May 15 19:31:04 | failed to open /dev/sda
tid:14195 May 15 19:31:04 | sda: failed to get path uid

Fix could be as simple as increasing the wait time. But because of the costs associated with
creating a user level build infrastructure, we may need to 
investigate a kernel side change can fix this issue.

To validate my understanding, I am going to increase the wait time and fire-off another test run
over the wk-end.

=Comment: #2=================================================
Venkateswarara Jujjuri <jvrao.com> - 
I increased the wait in mulpathd's wait_for_file() to 180sec from the original 60sec.
Path bouncing tests reveled that in 6 times out of 91 path failovers, we would have seen this issue
if the wait time was not increased.

**tid:8059 May 16 17:19:37 | wait_for_file: for /dev/sda
**tid:8059 May 16 17:20:41 | wait_for_file: return 0 for /dev/sda

>>>> udev took 64 sec to create the path.

**tid:8059 May 16 18:24:09 | wait_for_file: for /dev/sda
**tid:8059 May 16 18:25:10 | wait_for_file: return 0 for /dev/sda

>>> udev took 61 sec to create the path.

**tid:8059 May 17 00:55:35 | wait_for_file: for /dev/sda
**tid:8059 May 17 00:56:53 | wait_for_file: return 0 for /dev/sda

>>> udev took 78 sec to create the path.

**tid:8059 May 17 02:37:49 | wait_for_file: for /dev/sda
**tid:8059 May 17 02:39:05 | wait_for_file: return 0 for /dev/sda

>>> udev took 67 sec to create the path.

**tid:8059 May 17 03:10:51 | wait_for_file: for /dev/sda
**tid:8059 May 17 03:11:58 | wait_for_file: return 0 for /dev/sda

>>> udev took 67 sec to create the path.

**tid:8059 May 17 04:50:19 | wait_for_file: for /dev/sda
**tid:8059 May 17 04:51:24 | wait_for_file: return 0 for /dev/sda

>>> udev took 65 sec to create the path.

=Comment: #3=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Next experiment:

I have increased the rtprio of the udevd and restored the multipathd wait time to 60. 

[root@elm3c28 multipathd]# ps -eLo rtprio,comm,pid,cmd | grep udev
     - udevd             671 /sbin/udevd -d
     - grep            29089 grep udev
[root@elm3c28 multipathd]# chrt -p 95 `pidof udevd`
[root@elm3c28 multipathd]# ps -eLo rtprio,comm,pid,cmd | grep udev
    95 udevd             671 /sbin/udevd -d
     - grep            29107 grep udev

Tests started again, and let us see if the increased priority of the 
udevd gives enough time for it to catchup and services the 
dev creation request within 60sec time frame.
=Comment: #5=================================================
Venkateswarara Jujjuri <jvrao.com> - 
(In reply to comment #3)
> Next experiment:
> 
> I have increased the rtprio of the udevd and restored the multipathd wait time
> to 60. 
> 
> [root@elm3c28 multipathd]# ps -eLo rtprio,comm,pid,cmd | grep udev
>      - udevd             671 /sbin/udevd -d
>      - grep            29089 grep udev
> [root@elm3c28 multipathd]# chrt -p 95 `pidof udevd`
> [root@elm3c28 multipathd]# ps -eLo rtprio,comm,pid,cmd | grep udev
>     95 udevd             671 /sbin/udevd -d
>      - grep            29107 grep udev
> 
> Tests started again, and let us see if the increased priority of the 
> udevd gives enough time for it to catchup and services the 
> dev creation request within 60sec time frame.
> 

This did not work. Even with rtprio 95, the bug got reproduced. It took 5 hours
of blast and port bouncing to recreate the problem.
=Comment: #6=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Noticed that the AVT has been turned on DS4k. Because of this, vol_id of the udevd is taking longer
time; hence multipathd's 60sec wait time was not enough.

I disabled it long back; but the recent DS4k f/w upgrade might have set it back
to default. (on by default).

After disabling AVT, I fired off the tests again, Still, I see one path missing.
Investigating this new issue.

multipathd> show paths
hcil    dev dev_t pri dm_st   chk_st   next_check
1:0:0:0 sdc 8:32  100 [active][ready]  XXXXXXXXX. 19/20
1:0:0:1 sdd 8:48  0   [active][ghost]  XXXX...... 9/20
1:0:0:2 sde 8:64  0   [active][ghost]  XXXXXXXXX. 19/20
1:0:0:3 sdf 8:80  0   [active][ghost]  XXXXXXXXX. 19/20
1:0:0:4 sdg 8:96  0   [active][ghost]  XXXXXXXXX. 19/20
1:0:0:5 sdh 8:112 0   [active][ghost]  XXXXXXXXX. 19/20
1:0:0:6 sdi 8:128 0   [active][ghost]  XXXXXXXXX. 19/20
0:0:0:0 sda 8:0   0   [undef] [faulty] [orphan]
0:0:0:1 sdb 8:16  100 [active][ready]  XXXX...... 8/20
0:0:0:2 sdj 8:144 100 [active][ready]  XXXXXXX... 14/20
0:0:0:3 sdk 8:160 100 [active][ready]  XXXXXXX... 15/20
0:0:0:4 sdl 8:176 100 [active][ready]  XXXXX..... 10/20
0:0:0:5 sdm 8:192 100 [active][ready]  XXXXXXXX.. 16/20
0:0:0:6 sdn 8:208 100 [active][ready]  XXXXXXXX.. 17/20
multipathd>
 
=Comment: #7=================================================
Venkateswarara Jujjuri <jvrao.com> - 
The problem appears to be prominent and reproducible even with AVT disabled.

The main problem is that blast on the system is making system very very slow
for the udev to get enough cycles to complete the task of creating paths within 60sec time frame.

tid:11082 May 19 12:40:37 | uevent 'add' from '/block/sda'
<<<  multipathd socket saw the request to add the path /block/sda

tid:11084 May 19 12:40:37 | wait_for_file: for /dev/sda
tid:11084 May 19 12:41:36 | wait_for_file: return 1 for /dev/sda on time expiry :-1
<< Wait thread immediately started..and returned error exactly after 60 sec.

*** Below log shows that udevd took almost 120 secs to create the path.

May 19 12:41:17 elm3c29 udevd-event[12663]: run_program: '/lib/udev/usb_id -x'
May 19 12:41:17 elm3c29 udevd[694]: udev_event_run: seq 1297 forked, pid [12663], 'add' 'block', 40
seconds old

<<< udevd took this request to create the path after 40 Sec.!!!

May 19 12:41:18 elm3c29 udevd-event[12663]: run_program: Waiting 1 seconds for output of
'/lib/udev/usb_id -x(12666)'
May 19 12:41:31 elm3c29 udevd-event[12663]: run_program: '/lib/udev/usb_id' returned with status 1
May 19 12:41:35 elm3c29 udevd-event[12663]: run_program: '/lib/udev/scsi_id -g -x -s /block/sda -d
/dev/.tmp-8-0'
May 19 12:41:36 elm3c29 udevd-event[12663]: run_program: Waiting 1 seconds for output of
'/lib/udev/scsi_id -g -x -s /block/sda -d /dev/.tmp-8-0(12683)'

<<< 7 sec scheduling delay??

May 19 12:41:43 elm3c29 udevd-event[12663]: run_program: '/lib/udev/scsi_id' (stdout) 'ID_VENDOR=IBM'
May 19 12:41:43 elm3c29 udevd-event[12663]: run_program: '/lib/udev/scsi_id' (stdout)
'ID_MODEL=1814_FAStT'
May 19 12:41:43 elm3c29 udevd-event[12663]: run_program: '/lib/udev/scsi_id' (stdout) 'ID_REVISION=0916'
May 19 12:41:43 elm3c29 udevd-event[12663]: run_program: '/lib/udev/scsi_id' (stdout)
'ID_SERIAL=3600a0b8000421b14000003824888dfd4'
May 19 12:41:43 elm3c29 udevd-event[12663]: run_program: '/lib/udev/scsi_id' (stdout) 'ID_TYPE=disk'
May 19 12:41:43 elm3c29 udevd-event[12663]: run_program: '/lib/udev/scsi_id' (stdout) 'ID_BUS=scsi'
May 19 12:41:43 elm3c29 udevd-event[12663]: run_program: '/lib/udev/scsi_id' returned with status 0
May 19 12:41:43 elm3c29 udevd-event[12663]: udev_rules_get_name: add symlink
'disk/by-id/scsi-3600a0b8000421b14000003824888dfd4'
May 19 12:41:43 elm3c29 udevd-event[12663]: run_program: '/lib/udev/path_id /block/sda'

<<< 10 sec delay. When I run the command by hand, it returns quickly.

May 19 12:41:53 elm3c29 udevd-event[12663]: run_program: '/lib/udev/path_id' (stdout)
'ID_PATH=pci-0000:03:05.0-fc-0x200400a0b8421b15:0x0000000000000000'
May 19 12:41:53 elm3c29 udevd-event[12663]: run_program: '/lib/udev/path_id' returned with status 0
May 19 12:41:53 elm3c29 udevd-event[12663]: udev_rules_get_name: add symlink
'disk/by-path/pci-0000:03:05.0-fc-0x200400a0b8421b15:0x0000000000000000'
May 19 12:41:53 elm3c29 udevd-event[12663]: run_program: '/lib/udev/vol_id --export /dev/.tmp-8-0'
May 19 12:41:54 elm3c29 udevd-event[12663]: run_program: Waiting 1 seconds for output of
'/lib/udev/vol_id --export /dev/.tmp-8-0(12731)'

<<< Another 11 sec delay.

May 19 12:42:07 elm3c29 udevd-event[12663]: run_program: '/lib/udev/vol_id' returned with status 4

<< 5 secs to execute two consecutive steps ?

May 19 12:42:12 elm3c29 udevd-event[12663]: run_program: '/lib/udev/edd_id --export /dev/.tmp-8-0'
May 19 12:42:15 elm3c29 udevd-event[12663]: run_program: Waiting 1 seconds for output of
'/lib/udev/edd_id --export /dev/.tmp-8-0(12748)'

<< Another 10+ sec delay.

May 19 12:42:26 elm3c29 udevd-event[12663]: run_program: '/lib/udev/edd_id' (stderr) 'no kernel EDD
support'
May 19 12:42:27 elm3c29 udevd-event[12663]: run_program: '/lib/udev/edd_id' returned with status 2
May 19 12:42:27 elm3c29 udevd-event[12663]: udev_rules_get_name: no node name set, will use kernel
name 'sda'
May 19 12:42:27 elm3c29 udevd-event[12663]: udev_db_get_device: no db file to read
/dev/.udev/db/block@sda: No such file or directory
May 19 12:42:27 elm3c29 udevd-event[12663]: udev_node_add: creating device node '/dev/sda', major =
'8', minor = '0', mode = '0640', uid = '0', gid = '6'
May 19 12:42:27 elm3c29 udevd-event[12663]: udev_node_add: creating symlink
'/dev/disk/by-id/scsi-3600a0b8000421b14000003824888dfd4' to '../../sda'
May 19 12:42:27 elm3c29 udevd-event[12663]: udev_node_symlink: link
'/dev/disk/by-id/scsi-3600a0b8000421b14000003824888dfd4' points to different target '../../sdc',
delete it
May 19 12:42:27 elm3c29 udevd-event[12663]: udev_node_symlink: creating symlink
'/dev/disk/by-id/scsi-3600a0b8000421b14000003824888dfd4' to '../../sda'
May 19 12:42:27 elm3c29 udevd-event[12663]: udev_node_add: creating symlink
'/dev/disk/by-path/pci-0000:03:05.0-fc-0x200400a0b8421b15:0x0000000000000000' to '../../sda'
May 19 12:42:27 elm3c29 udevd-event[12663]: udev_node_symlink: creating symlink
'/dev/disk/by-path/pci-0000:03:05.0-fc-0x200400a0b8421b15:0x0000000000000000' to '../../sda'
May 19 12:42:27 elm3c29 udevd-event[12663]: pass_env_to_socket: passed -1 bytes to socket
'/org/kernel/udev/monitor',
May 19 12:42:27 elm3c29 udevd-event[12663]: run_program: '/lib/udev/udev_run_hotplugd'
May 19 12:42:27 elm3c29 udevd-event[12663]: run_program: '/lib/udev/udev_run_hotplugd' returned with
status 0
May 19 12:42:27 elm3c29 udevd-event[12663]: run_program: '/lib/udev/udev_run_devd'
May 19 12:42:27 elm3c29 udevd-event[12663]: pass_env_to_socket: passed 569 bytes to socket
'/org/freedesktop/hal/udev_event',
May 19 12:42:47 elm3c29 udevd-event[12663]: run_program: '/sbin/pam_console_apply /dev/sda
/dev/disk/by-id/scsi-3600a0b8000421b14000003824888dfd4
/dev/disk/by-path/pci-0000:03:05.0-fc-0x200400a0b8421b15:0x0000000000000000'
May 19 12:42:47 elm3c29 udevd-event[12663]: run_program: '/sbin/pam_console_apply' returned with
status 0
May 19 12:42:49 elm3c29 udevd-event[12663]: udev_event_run: seq 1297 finished

=Comment: #9=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Instrumented the kernel, started the tests..and quickly system hung.
Unfortunately I did not configure the system for the dump on NMI; hence no luck.

After that, i configured the system to take dump on NMI, and booted the
instrumented kernel.

Blast and port bounces have been running for 8 hours; and No issues. :(

I will let it run over night and hope to catch the bug.
=Comment: #10=================================================
Venkateswarara Jujjuri <jvrao.com> - 
This problem is very easy to recreate on slow running machines like LS20 and LS21. But on HS21 It is
extremely difficult;

It took almost 36 hours to recreate with  4 Blast threads and 
bouncing paths every 5 mins with 10min gap between alternative paths.

Now I am trying with uncommenting the following rule in multipath.rules on HS21

KERNEL!="dm-[0-9]*", ACTION=="add", PROGRAM=="/bin/bash -c '/sbin/lsmod | /bin/grep ^dm_multipath'",
RUN+="/sbin/multipath -v0 %M:%m"

=Comment: #11=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Got it reproduced:

1. Activated /sbin/multipath rule on ADD event
2. Ran blast on 4 threads
3. Bounced paths.
** Out of 7 mpath devices only 3 showed up after bringing in the path

[root@elm3c19 ~]# multipath -l
mpath2 (3600a0b8000421b1400001bc44a021952) dm-7 IBM,1814      FAStT
[size=20G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:0:2 sdc 8:32  [active][undef]
\_ round-robin 0 [prio=0][enabled]
 \_ 3:0:0:2 sdg 8:96  [active][undef]
mpath1 (3600a0b8000421b1400001bb64a00bffa) dm-10 IBM,1814      FAStT
[size=20G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=0][active]
 \_ 2:0:0:1 sdb 8:16  [active][undef]
\_ round-robin 0 [prio=0][enabled]
 \_ 3:0:0:1 sdf 8:80  [active][undef]
mpath0 (3600a0b8000421a90000007a048eb95cb) dm-0 IBM,1814      FAStT
[size=70G][features=1 queue_if_no_path][hwhandler=1 rdac]
\_ round-robin 0 [prio=0][active]
 \_ 3:0:0:0 sde 8:64  [active][undef]
\_ round-robin 0 [prio=0][enabled]
 \_ 2:0:0:0 sda 8:0   [active][undef]
[root@elm3c19 ~]# 

====> and you can see 4 /sbin/multipath stuck on these 4 paths.
[root@elm3c19 ~]# ps -ef|grep multipath
root      9051     1  0 08:23 ?        00:00:00 /sbin/multipathd
root     11560 11537  0 08:43 ?        00:00:00 /sbin/multipath -v0 8:48
root     11623 11600  0 08:43 ?        00:00:00 /sbin/multipath -v0 8:112
root     11705 11682  0 08:43 ?        00:00:00 /sbin/multipath -v0 8:192
root     11721 11655  0 08:43 ?        00:00:00 /sbin/multipath -v0 8:176

Here are the work queue pointers from the instrumentation

May 29 08:40:55 elm3c19 kernel: multipath_dtr:calling flush_workqueue for
kmpath_handlerd:ffff81044adba280
May 29 08:40:56 elm3c19 kernel: multipath_dtr:calling flush_workqueue for kmultipathd:ffff81044adba300
May 29 08:40:56 elm3c19 kernel: jv_workqueue_debug_info: Entered with wq:ffff81044adba280
May 29 08:40:57 elm3c19 kernel: jv_workqueue_debug_info: cpu:0  cwq: ffff81044b997e00
May 29 08:40:58 elm3c19 kernel: jv_workqueue_debug_info: Entered with wq:ffff81044adba300
May 29 08:40:58 elm3c19 kernel: jv_workqueue_debug_info: cpu:0  cwq: ffff81044b996e00
May 29 08:40:58 elm3c19 kernel: jv_workqueue_debug_info: cpu:1  cwq: ffff81044b996c00
May 29 08:40:59 elm3c19 kernel: jv_workqueue_debug_info: cpu:2  cwq: ffff81044b996a00
May 29 08:40:59 elm3c19 kernel: jv_workqueue_debug_info: cpu:3  cwq: ffff81044b996800
May 29 08:40:59 elm3c19 kernel: jv_workqueue_debug_info: cpu:4  cwq: ffff81044b996600
May 29 08:41:00 elm3c19 multipathd: dm-9: add map (uevent) 
May 29 08:41:01 elm3c19 kernel: jv_workqueue_debug_info: cpu:5  cwq: ffff81044b996400
May 29 08:41:02 elm3c19 multipathd: dm-9: devmap already registered 
May 29 08:41:03 elm3c19 kernel: jv_workqueue_debug_info: cpu:6  cwq: ffff81044b996200
May 29 08:41:03 elm3c19 multipathd: 8:48: mark as failed 
May 29 08:41:03 elm3c19 kernel: jv_workqueue_debug_info: cpu:7  cwq: ffff81044b996000

=Comment: #12=================================================
Venkateswarara Jujjuri <jvrao.com> - 
Alright made some progress in getting close to the problem..but still no clear picture.

Main observations:

- 4 mpath devices did not show up.
- 4 mpath threads are hung waiting for their barrier to be done.
- Only 1 cpu_workqueue has non-NULL barrier field (CPU1)
   crash> struct cpu_workqueue_struct ffff81044b996c00|grep bar
   barrier = 0xffff8101b24abb18

- This workqueue has 4 barriers hanging out of it.

     crash> struct wq_barrier 0xffff8101b24abb18 |grep prev_barrier
      prev_barrier = 0xffff81026311bb18,
  
    crash> struct wq_barrier 0xffff81026311bb18|grep prev_barrier
      prev_barrier = 0xffff8101f2073b18,

    crash> struct wq_barrier 0xffff8101f2073b18|grep prev_barrier
      prev_barrier = 0xffff8102f81ddb18,

    crash> struct wq_barrier 0xffff8102f81ddb18 |grep prev_barrier
     prev_barrier = 0x0,

Everything is good because all are fitting in the picture.

But the inconsistency starts here.

- All worklists in the wq_barriers and cpu_workqueues are empty.
- All the task_lists are also empty.

One barrier structure is shown below..all 4 barriers are in the same state.

crash> struct wq_barrier 0xffff8101b24abb18
struct wq_barrier {
  work = {
    data = {
      counter = -139619528512512 // Discussed below.
    },
    entry = {
      prio = 140,
      plist = {
        prio_list = {
          next = 0xffff8101b24abb28,  // +16 & next==prev  => Empty
          prev = 0xffff8101b24abb28
        },
        node_list = {
          next = 0xffff8101b24abb38,  // +16 & next==prev  => Empty
          prev = 0xffff8101b24abb38
        }
      }
    },
    func = 0xffffffff8104e6b8 <wq_barrier_func>
  },  ==> Size is 0x56
  worklist = {
    prio_list = {
      next = 0xffff8101b24abb50,  // +56 & next==prev  => Empty
      prev = 0xffff8101b24abb50
    },
    node_list = {
      next = 0xffff8101b24abb60,  // +16 & next==prev  => Empty
      prev = 0xffff8101b24abb60
    }
  },
  prev_barrier = 0xffff81026311bb18,
  prev_prio = -2130385224,
  waiter_prio = 118,
  cwq = 0xffff81044b996c00,  // The workqueue we re hanging out of.
  done = {
    done = 0,
    wait = {
      lock = {
        lock = {
          wait_lock = {
            raw_lock = {
              slock = 0
            },
            break_lock = 0
          },
          wait_list = {
            prio_list = {
              next = 0xffff8101b24abb98, // Empty
              prev = 0xffff8101b24abb98
            }
          },
          owner = 0x0
        },
        break_lock = 0
      },
      task_list = {
        next = 0xffff8101b24abac0, // Empty.
        prev = 0xffff8101b24abac0
      }
    }
  }
}

Unanswered questions:

1. counter = -139619528512512
    struct cpu_workqueue_struct *get_wq_data(struct work_struct *work)
    {
        return (void *) (atomic_long_read(&work->data) &   
                                                    WORK_STRUCT_WQ_DATA_MASK);
    }
   If I substitute the above value in here I get 0xfffffebaec46a778 which
   is supposed to be the workqueue. 
   crash> struct cpu_workqueue_struct 0xfffffebaec46a778 
    struct cpu_workqueue_struct struct: read error: kernel virtual address:    
    fffffebaec46a778  type: "gdb_readmem_callback"
    Cannot access memory at address 0xfffffebaec46a778

Hence is it a corruption? or may be I am not calculating right. 
All the barriers are have the same value.

2. done=0 (hence the work is not done); task list is NULL but barriers are 
    hanging in there.

3. in the run_workqueue() we appear to call complete(&barrier->done);
    But, task list is empty so not sure how it can wakeup these waiting threads.


Investigation continues....

=Comment: #13=================================================
Venkateswarara Jujjuri <jvrao.com> - 
I think I found the issue. Problem is in kernel/workqueue.c

The problem is in the run_workqueue().

This function is not handling the case where more than one sequential
barriers are inserted into a cwq.

Current logic works like this:

static void run_workqueue(struct cpu_workqueue_struct *cwq)
{

again:

        while (!plist_head_empty(worklist)) {
            
***  ( Process the entire work list )
           ....

                if (unlikely(cwq->barrier))
                        worklist = &cwq->barrier->worklist;  
*** Here we pick the first barrier.
        }

***    We insert barriers using insert_wq_barrier() which makes the workqueues
***    empty. This is because there is no work to be done..we just need to 
***    wakeup the waiter when we hit the barrier.
***     Hence we come out of this while loop.

        if (unlikely(worklist != &cwq->worklist)) {
                struct wq_barrier *barrier = cwq->barrier;

*** Control enters here

                BUG_ON(!barrier);
                cwq->barrier = barrier->prev_barrier;

*** Get the prev_barrier into the structure

                complete(&barrier->done);

*** call complete for the first barrier

                if (unlikely(cwq->barrier))
                        worklist = &cwq->barrier->worklist;
                else
                        worklist = &cwq->worklist;

                if ( !plist_head_empty(worklist)) 
                    goto again;
*** As you can see, the above if statement is going to be false because
*** worklist is empty for the barrier and ...and get out
*** of this function. Hence we are not waking up threads sleeping on
*** these barriers..and our multipaths are waiting forever.
        }

Fix is simple.

if (unlikely(cwq->barrier)) {
    worklist = &cwq->barrier->worklist;
    goto again;  <<< FIX >>>
}
else
    worklist = &cwq->worklist;

I tried the following patch.

Index: linux-2.6.24.7-ibmrt2.1.05-view/kernel/workqueue.c
===================================================================
--- linux-2.6.24.7-ibmrt2.1.05-view.orig/kernel/workqueue.c
+++ linux-2.6.24.7-ibmrt2.1.05-view/kernel/workqueue.c
@@ -353,8 +353,12 @@ again:
                else
                        worklist = &cwq->worklist;

-               if (!plist_head_empty(worklist))
-                       goto again;
+               if (unlikely(cwq->barrier) || !plist_head_empty(worklist)) {
+                   if (plist_head_empty(worklist)) {
+                        printk("run_workqueue: WOULD have hit the bug for cwq:%p and barrier:%p\n",
cwq, cwq->barrier);
+                    }
+                    goto again;
+                }
        }

 I see "WOULD have hit the bug" print 
in the /var/log/mesages and No multipath threads are waiting after
path bounces...and paths come back as expected.

I will do further testing on the patch before sending out the patch.

=Comment: #15=================================================
Venkateswarara Jujjuri <jvrao.com> - 

run_workqueue() routine is falsely exiting out before processing
all barriers in the queue. 
This is causing waiters (/sbin/multipath  in our case) wait forever
and hence corresponding new paths are not showing up.

Fixed so that the run_workqueue() finishes processing all barriers
before exiting out.

Testing:
Ran blast over 24 hours with path failovers while the udev rule to run /sbin/multipath is 
active. Paths showed up as expected.

Similar configuration without the patch could not go beyond a 30min run.

Comment 1 IBM Bug Proxy 2009-06-04 12:20:48 UTC

Created attachment 346524 [details]
External patch

Comment 6 errata-xmlrpc 2009-07-14 19:12:00 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1157.html