Bug 1300453

Summary: [NTAP CQ 822554]systemd attempts to unmount filesystem during mulitpath failure test
Product: Red Hat Enterprise Linux 7 Reporter: Ryan <Ryan.Rodine>
Component: device-mapper-multipathAssignee: LVM and device-mapper development team <lvm-team>
Status: CLOSED DUPLICATE QA Contact: Lin Li <lilin>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.2CC: abdel.sadek, agk, bmarzins, ctatman, heinzm, lilin, lnykryn, matt.schulte, msekleta, msnitzer, ng-hsg-engcustomer-iop-bz, prajnoha, qguo, rsawhill, Ryan.Rodine, systemd-maint-list
Target Milestone: rc   
Target Release: 7.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-26 08:33:36 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1269644    
Attachments:
Description Flags
Output of journalctl -b
none
Output of multipath -ll
none
Output of journalctl -b
none
udevadm monitor output
none
journalctl output none

Description Ryan 2016-01-20 21:14:10 UTC
User-Agent:       Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9
Build Identifier: 

During storage controller reboot testing, during which, half the paths to a multipath backed filesystem mount are being added and removed (by resetting 1 of 2 storage controllers),  systemd attempts to unmount the filesystem, even though there are remaining active paths.



Reproducible: Sometimes

Steps to Reproduce:
1.Format and mount multipathed (4 paths) device.  In this case, we are using xfs but do not believe the filesystem to be relevant
2.Excersize path offline/online.  In this case we are resetting a controller every 5 minutes, which causes half of the paths to go offline/online.
3.
Actual Results:  
systemd reports dev-mapper-mpathcw.device as dead and tries to unmount the filesystem.

Expected Results:  
Since there should still be active paths to the LUN, systemd should not attempt to unmount.

In the attached example, systemd attempts to unmount /iomnt-mpathcw (multipath device mpathcw) shortly after a path is restored, bringing current paths to 3:

Jan 20 11:17:23 ICTM19Z1S02H3.ict.englab.netapp.com multipathd[1822]: mpathcw: sdgo - rdac checker reports path is up
Jan 20 11:17:23 ICTM19Z1S02H3.ict.englab.netapp.com multipathd[1822]: 132:64: reinstated
Jan 20 11:17:23 ICTM19Z1S02H3.ict.englab.netapp.com multipathd[1822]: mpathcw: remaining active paths: 3
Jan 20 11:17:23 ICTM19Z1S02H3.ict.englab.netapp.com multipathd[1822]: mpathcw: load table [0 37748736 multipath 4 pg_init_retries 50 queue_if_no_path retain_attached_hw_handler 1 rdac 3 2 service-time 0 1 1 68:32 1 service-time 0 1 1 132:64 1 service-time 0 2 1 134:64 1 70:32 1]

For some reason systemd believes dev-mapper-mpathcw.device to be dead and tries unmount:

Jan 20 11:17:26 ICTM19Z1S02H3.ict.englab.netapp.com systemd[1]: Trying to enqueue job iomnt\x2dmpathcw.mount/stop/replace
Jan 20 11:17:26 ICTM19Z1S02H3.ict.englab.netapp.com systemd[1]: Installed new job remote-fs.target/stop as 1850
Jan 20 11:17:26 ICTM19Z1S02H3.ict.englab.netapp.com systemd[1]: Installed new job iomnt\x2dmpathcw.mount/stop as 1849
Jan 20 11:17:26 ICTM19Z1S02H3.ict.englab.netapp.com systemd[1]: Enqueued job iomnt\x2dmpathcw.mount/stop as 1849
Jan 20 11:17:26 ICTM19Z1S02H3.ict.englab.netapp.com systemd[1]: dev-disk-by\x2did-dm\x2duuid\x2dmpath\x2d360080e50002909c400007a255695b068.device changed plugged -> dead
Jan 20 11:17:26 ICTM19Z1S02H3.ict.englab.netapp.com systemd[1]: dev-disk-by\x2did-dm\x2dname\x2dmpathcw.device changed plugged -> dead
Jan 20 11:17:26 ICTM19Z1S02H3.ict.englab.netapp.com systemd[1]: dev-dm\x2d37.device changed plugged -> dead
Jan 20 11:17:26 ICTM19Z1S02H3.ict.englab.netapp.com systemd[1]: sys-devices-virtual-block-dm\x2d37.device changed plugged -> dead


journalctl -b and multipath -ll output will be attached.

Comment 1 Ryan 2016-01-20 21:16:23 UTC
Created attachment 1116717 [details]
Output of journalctl -b

Comment 2 Ryan 2016-01-20 21:17:05 UTC
Created attachment 1116718 [details]
Output of multipath -ll

Comment 4 Ryan 2016-01-20 21:47:02 UTC
This does not appear, at least on the surface, to be the same root cause as NTAP CQ 669667/BZ 1187403, which was caused by bad udev rules (I checked and didn't see those here)

Comment 5 Lukáš Nykrýn 2016-01-21 14:13:52 UTC
It is a long shot, but can you try this build?
https://people.redhat.com/lnykryn/systemd-219-19.el7_2.3/

Comment 6 Ryan 2016-01-21 20:34:41 UTC
I tried the build at https://people.redhat.com/lnykryn/systemd-219-19.el7_2.3/ and am still seeing the same behavior.

Comment 7 Matt Schulte 2016-02-08 14:50:28 UTC
This phantom unmount behavior is now being seen on other test configs, not just Ryan's.  This is endangering our upcoming program release and we would like to know what can be done to escalate this problem.  We need to find someone with some pretty deep systemd knowledge to help us understand this issue.

Comment 8 Lukáš Nykrýn 2016-02-08 15:23:53 UTC
(In reply to Ryan from comment #1)
> Created attachment 1116717 [details]
> Output of journalctl -b

At least on my machine this looks broken, can you please reattach the log?

Comment 9 Ryan 2016-02-08 16:00:11 UTC
MIME type was incorrectly set to text.  Changed to application/zip

(In reply to Lukáš Nykrýn from comment #8)
> (In reply to Ryan from comment #1)
> > Created attachment 1116717 [details]
> > Output of journalctl -b
> 
> At least on my machine this looks broken, can you please reattach the log?

Comment 10 Chris Tatman 2016-02-09 22:52:33 UTC
Hi Lukas,

Are you now able to view the logs that NetApp just attached? What else do you need from them to get this debugged?

Thanks!

--Chris

Comment 11 Lukáš Nykrýn 2016-02-10 10:18:21 UTC
Can you please create debug log once again, this time using this build?
https://copr.fedorainfracloud.org/coprs/lnykryn/systemd-rhel-staging/

Comment 12 Michal Sekletar 2016-02-10 16:32:07 UTC
I've tried to simulate scenario from bug description with 4 iscsi sessions to the same LUN. However I was not able to reproduce. I was still able to access filesystems when I brought two paths offline and then back online. systemd didn't unmount anything.

Note that I was testing with newest build prepared for next z-stream batch. However it contains only couple of unrelated patches on top of https://people.redhat.com/lnykryn/systemd-219-19.el7_2.3/.

Is this reproducible on smaller setup? If yes, then it would be great to gather requested debug log on such setup.

Comment 14 Chris Tatman 2016-02-10 18:35:58 UTC
Hi Ryan,

Can you help capture debug logs per comments 11 & 12 please?

Thanks!

--Chris

Comment 15 Ryan 2016-02-10 22:32:16 UTC
Not sure if it will reproduce with less LUNs, but I'll give it a shot and use the build mentioned by Lukáš.

One thing I should have mentioned in the description is that the issue doesn't always happen on the first down/up.  Sometimes several iterations can pass before the issue occurs.

Comment 16 Ryan 2016-02-11 19:59:51 UTC
Created attachment 1123289 [details]
Output of journalctl -b

Comment 17 Ryan 2016-02-11 20:00:20 UTC
Reproduced the issue using the build referenced by Lukáš Nykrýn.  Attaching journalctl debug log.

I'm now going to try reproducing it with less LUNs.  We are currently testing with many raw LUN and a handful of filesystem.  Going to remove the raw LUNs and see if I can reproduce with only the handful of filesystems attached.

Comment 18 Michal Sekletar 2016-02-12 16:10:58 UTC
Ryan, 

While trying to reproduce please also try to capture output of the following command,

udevadm monitor --udev --property

It will print udev event after rule processing with its properties.

Comment 19 Ryan 2016-02-13 16:53:57 UTC
I've reproduced the issue with only a single mounted LUN and filesystem.  Attaching journalctl and udevadm monitor output.

Comment 20 Ryan 2016-02-13 16:56:27 UTC
Created attachment 1123832 [details]
udevadm monitor output

udevadm monitor output while recreating with a single mount

Comment 21 Ryan 2016-02-13 16:57:24 UTC
Created attachment 1123833 [details]
journalctl output

journalctl log while recreating with a single mount

Comment 22 Michal Sekletar 2016-02-15 15:30:12 UTC
In udevadm monitor output I see following event logged 

UDEV  [94891.553203] change   /devices/virtual/block/dm-5 (block)
ACTION=change
DEVLINKS=/dev/disk/by-id/dm-name-mpathak /dev/disk/by-id/dm-uuid-mpath-360080e500029f68800008f33564b3cf2 /dev/mapper/mpathak
DEVNAME=/dev/dm-5
DEVPATH=/devices/virtual/block/dm-5
DEVTYPE=disk
DM_ACTION=PATH_FAILED
DM_MULTIPATH_TIMESTAMP=1455225500
DM_NAME=mpathak
DM_NR_VALID_PATHS=3
DM_PATH=8:80
DM_SEQNUM=488
DM_SUSPENDED=1
DM_TARGET=multipath
DM_UDEV_DISABLE_LIBRARY_FALLBACK_FLAG=1
DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
DM_UDEV_PRIMARY_SOURCE_FLAG=1
DM_UDEV_RULES_VSN=2
DM_UUID=mpath-360080e500029f68800008f33564b3cf2
MAJOR=253
MINOR=5
MPATH_SBIN_PATH=/sbin
SEQNUM=9741
SUBSYSTEM=block
SYSTEMD_READY=0
TAGS=:systemd:
USEC_INITIALIZED=2538

From journal output it seems like processing of the above event was finished just before the umount attempt. Note that this event has in its environment SYSTEMD_READY=0. This basically tells systemd that if this device (/dev/dm-5) was previously announced by udev now it is not supposed to be used by systemd anymore. Thus systemd transitions this device to "dead" state and umount of dependent mount unit is triggered because of this.

When I was (unsuccessfully) trying to reproduce this bug I did not observe change events about path failure or reinstatement with following flags set.

DM_UDEV_DISABLE_OTHER_RULES_FLAG=1
SYSTEMD_READY=0

This seems like a bug in device-mapper-multipath rules. Reassigning.

Comment 23 Ben Marzinski 2016-02-15 21:04:16 UTC
Something looks wrong here.

In 10-dm.rules, DM_UDEV_DISABLE_OTHER_RULES_FLAG=1 is set when the device is suspended:

# Avoid processing and scanning a DM device in the other (foreign)
# rules if it is in suspended state. However, we still keep 'disk'
# and 'DM subsystem' related rules enabled in this case.
ENV{DM_SUSPENDED}=="1", ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}="1"

This looks like the most likely reason why DM_UDEV_DISABLE_OTHER_RULES_FLAG is set in this case.

In 99-systemd.rules, SYSTEMD_READY=0 is set when DM_UDEV_DISABLE_OTHER_RULES_FLAG is set:

SUBSYSTEM=="block", KERNEL!="ram*", ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}=="1", ENV{SYSTEMD_READY}="0"

So it looks like whenever a udev is processing an event on a suspended dm device, SYSTEMD_READY will be set to 0.  While it makes sense to not act on a
dm device before it is ready, once the device is up, it shouldn't be unmounted simply because it is suspended for a second.

Peter, am I missing something here?

Comment 24 Peter Rajnoha 2016-02-16 07:42:12 UTC
(In reply to Ben Marzinski from comment #23)
> SUBSYSTEM=="block", KERNEL!="ram*",
> ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}=="1", ENV{SYSTEMD_READY}="0"
> 
> So it looks like whenever a udev is processing an event on a suspended dm
> device, SYSTEMD_READY will be set to 0.  While it makes sense to not act on a
> dm device before it is ready, once the device is up, it shouldn't be
> unmounted simply because it is suspended for a second.
> 
> Peter, am I missing something here?

Yes, it's exactly like you said. I need to find out now how we're going to fix this...

Comment 25 Ryan Sawhill 2016-02-25 19:15:16 UTC
So Peter, to be clear, bz1312011 is the answer here as well?

Comment 26 Peter Rajnoha 2016-02-26 07:50:10 UTC
(In reply to Ryan Sawhill from comment #25)
> So Peter, to be clear, bz1312011 is the answer here as well?

Yes, after further discussion with systemd team we also tracked the history and we found out that the rule setting SYSTEMD_READY=0 got there by mistake, it seems - the rules were suppoesed to be skipped only.

See also https://github.com/systemd/systemd/commit/b7cf1b68a79833411851fa547ac9b4906207c224 which also references the https://github.com/systemd/systemd/commit/b7cf1b68a79833411851fa547ac9b4906207c224 - so we should put back this rule:

SUBSYSTEM=="block", KERNEL!="ram*", ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}=="1", GOTO="systemd_end"

(so GOTO="systemd_end" instead of that SYSTEMD_READY=0)

I think we don't need to do anything else on dm side of rules then so we can probably close this bug and let bug #1312011 to track the issue.

Comment 27 Peter Rajnoha 2016-02-26 07:55:31 UTC
(In reply to Peter Rajnoha from comment #26)
> so we should put back this rule:
> 
> SUBSYSTEM=="block", KERNEL!="ram*",
> ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}=="1", GOTO="systemd_end"
> 
> (so GOTO="systemd_end" instead of that SYSTEMD_READY=0)
> 

Currently, the 99-systemd.rules doesn't process DM devices any further, so removing the rule completely will have the same effect.

Comment 28 Peter Rajnoha 2016-02-26 08:28:54 UTC
Sorry, I overlooked one important thing! We need to make sure that the DM device is not marked as ready on ADD event. So the rule in 99-systemd.rules should be:

SUBSYSTEM=="block", ACTION=="add", ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}=="1", ENV{SYSTEMD_READY}="0"

Comment 29 Peter Rajnoha 2016-02-26 08:29:21 UTC
(I've also update bug #1312011)

Comment 30 Peter Rajnoha 2016-02-26 08:33:36 UTC

*** This bug has been marked as a duplicate of bug 1312011 ***