Bug 1701234

Summary:	blk-availability.service doesn't respect unit order on shutdown
Product:	Red Hat Enterprise Linux 7	Reporter:	Renaud Métrich <rmetrich>
Component:	lvm2	Assignee:	Peter Rajnoha <prajnoha>
lvm2 sub component:	Default / Unclassified	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	high
Priority:	high	CC:	agk, cbesson, cmarthal, erlend, heinzm, jbrassow, jmagrini, loberman, mbliss, mrichter, msnitzer, paelzer, pdwyer, prajnoha, qguo, revers, rhandlin, zkabelac
Version:	7.6	Keywords:	Reopened, Triaged
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-18 07:15:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1784876

Description Renaud Métrich 2019-04-18 12:56:02 UTC

Description of problem:

The blk-availability.service unit is activated automatically when multipathd is enabled, even if multipathd is finally not used.
This leads to the blk-availability service to unmount file systems too early, breaking unit ordering and leading to shutdown issues of custom services requiring some mount points.


Version-Release number of selected component (if applicable):

device-mapper-1.02.149-10.el7_6.3.x86_64


How reproducible:

Always


Steps to Reproduce:

1. Enable multipathd even though there is no multipath device

  # yum -y install device-mapper-multipath
  # systemctl enable multipathd --now

2. Create a custom mount point "/data"

  # lvcreate -n data -L 1G rhel
  # mkfs.xfs /dev/rhel/data
  # mkdir /data
  # echo "/dev/mapper/rhel-data /data xfs defaults 0 0" >> /etc/fstab
  # mount /data

3. Create a custom service requiring mount point "/data"

  # cat > /etc/systemd/system/my.service << EOF
[Unit]
RequiresMountsFor=/data

[Service]
ExecStart=/bin/bash -c 'echo "STARTING"; mountpoint /data; true'
ExecStop=/bin/bash -c 'echo "STOPPING IN 5 SECONDS"; sleep 5; mountpoint /data; true'
Type=oneshot
RemainAfterExit=true

[Install]
WantedBy=default.target
EOF
  # systemctl daemon-reload
  # systemctl enable my.service --now

4. Set up persistent journal and reboot

  # mkdir -p /var/log/journal
  # systemctl restart systemd-journald
  # reboot

5. Check the previous boot's shutdown

  # journalctl -b -1 -o short-precise -u my.service -u data.mount -u blk-availability.service

Actual results:

-- Logs begin at Thu 2019-04-18 12:48:12 CEST, end at Thu 2019-04-18 13:35:50 CEST. --
Apr 18 13:31:46.933571 vm-blkavail7 systemd[1]: Started Availability of block devices.
Apr 18 13:31:48.452326 vm-blkavail7 systemd[1]: Mounting /data...
Apr 18 13:31:48.509633 vm-blkavail7 systemd[1]: Mounted /data.
Apr 18 13:31:48.856228 vm-blkavail7 systemd[1]: Starting my.service...
Apr 18 13:31:48.894419 vm-blkavail7 bash[2856]: STARTING
Apr 18 13:31:48.930270 vm-blkavail7 bash[2856]: /data is a mountpoint
Apr 18 13:31:48.979457 vm-blkavail7 systemd[1]: Started my.service.
Apr 18 13:35:02.544999 vm-blkavail7 systemd[1]: Stopping my.service...
Apr 18 13:35:02.547811 vm-blkavail7 systemd[1]: Stopping Availability of block devices...
Apr 18 13:35:02.639325 vm-blkavail7 bash[3393]: STOPPING IN 5 SECONDS
Apr 18 13:35:02.760043 vm-blkavail7 blkdeactivate[3395]: Deactivating block devices:
Apr 18 13:35:02.827170 vm-blkavail7 blkdeactivate[3395]: [SKIP]: unmount of rhel-swap (dm-1) mounted on [SWAP]
Apr 18 13:35:02.903924 vm-blkavail7 systemd[1]: Unmounted /data.
Apr 18 13:35:02.988073 vm-blkavail7 blkdeactivate[3395]: [UMOUNT]: unmounting rhel-data (dm-2) mounted on /data... done
Apr 18 13:35:02.988253 vm-blkavail7 blkdeactivate[3395]: [SKIP]: unmount of rhel-root (dm-0) mounted on /
Apr 18 13:35:03.083448 vm-blkavail7 systemd[1]: Stopped Availability of block devices.
Apr 18 13:35:07.693154 vm-blkavail7 bash[3393]: /data is not a mountpoint
Apr 18 13:35:07.696330 vm-blkavail7 systemd[1]: Stopped my.service.

--> We can see the following:
- blkdeactivate runs, unmounting /data, even though my.service is still running (hence the unexpected message "/data is not a mountpoint")


Expected results:

- my.service gets stopped
- then "data.mount" gets stopped
- finally blkdeactivate runs


Additional info:

I understand there is some chicken-and-egg problem here, but it's just not possible to blindly unmount file systems and ignore expected unit ordering.

Comment 2 Peter Rajnoha 2019-04-23 13:09:14 UTC

Normally, I'd add Before=local-fs-pre.target into blk-availability.service so on shutdown its ExecStop would execute after all local mount points are unmounted.

The problem might be with all the dependencies like iscsi, fcoe and rbdmap services where we need to make sure that these are executed *after* blk-availability. So I need to find a proper target that we can hook on so that it also fits all the dependencies. It's possible we need to create a completely new target so we can properly synchronize all the services on shutdown. I'll see what I can do...

Comment 3 Renaud Métrich 2019-04-23 13:17:39 UTC

Indeed, wasn't able to find a proper target, none exists.
I believe blk-availability itself needs to be modified to only deactivate non-local disks (hopefully there is a way to distinguish).

Comment 4 Renaud Métrich 2019-06-19 13:34:15 UTC

Hi Peter,

Could you explain why blk-availability is needed when using multipath or iscsi?
With systemd ordering dependencies in units, is that really needed?

Comment 5 Peter Rajnoha 2019-06-21 08:43:50 UTC

(In reply to Renaud Métrich from comment #4)
> Hi Peter,
> 
> Could you explain why blk-availability is needed when using multipath or
> iscsi?
> With systemd ordering dependencies in units, is that really needed?

It is still needed because otherwise there wouldn't be anything else to properly deactivate the stack. Even though, the blk-availability.service with blkdeactivate call is still not perfect, it's still better than nothing and letting systemd to shoot down the devices on its own within its "last-resort" device deactivation loop that happens in shutdown initramfs (here, the iscsi/fcoe and all the other devices are already disconnected anyway, so anything else on top can't be properly deactivated).

We've just received related report on github too (https://github.com/lvmteam/lvm2/issues/18).

I'm revisiting this problem now. The correct solution requires more patching - this part is very fragile at the moment (...easy to break other functionality).

Comment 6 Peter Rajnoha 2019-06-21 08:47:41 UTC

(In reply to Renaud Métrich from comment #3)
> I believe blk-availability itself needs to be modified to only deactivate
> non-local disks (hopefully there is a way to distinguish).

It's possible that we need to split the blk-availability (and the blkdeactivate) in two because of this... There is a way to distinguish I hope (definitely for iscsi/fcoe), but there currently isn't a central authority to decide on this so it must be done manually (checking certain properties in sysfs "manually").

Comment 7 Renaud Métrich 2019-06-21 08:51:28 UTC

I must be missing something. This service is used to deactivate "remote" block devices requiring the network, such as iscsi or fcoe.
Why aren't these services deactivating the block devices by themselves?
That way systemd won't kill everything abruptly.

Comment 8 Peter Rajnoha 2019-06-21 09:08:22 UTC

(In reply to Renaud Métrich from comment #7)
> I must be missing something. This service is used to deactivate "remote"
> block devices requiring the network, such as iscsi or fcoe.

Nope, ALL storage, remote as well as local, if possible. We need to look at the complete stack (e.g. device-mapper devices which are layered on top of other layers, are set up locally)

> Why aren't these services deactivating the block devices by themselves?

Well, honestly, because nobody has ever solved that :)

At the beginning, it probably wasn't that necessary and if you just shut your system down and let the devices as they are (unattached, not deactivated), it wasn't such a problem. But now, with various caching layers, thin pools... it's getting quite important to deactivate the stack properly to also properly flush any metadata or data.

Of course, we still need to count with the situation where there's a power outage and the machine is not backed by any other power source so you'd have your machine shot down immediately (for that there are various checking and fixing mechanism). But it's certainly better to avoid this situation as you could still lose some data.

Systemd's loop in the shutdown initramfs is really the last-resort thing to execute, but we can't rely on that (it's just a loop on device list with limited loop count, it doesn't look at the real nature of that layer in the stack).

Comment 9 Renaud Métrich 2019-06-21 09:39:30 UTC

OK, then we need a "blk-availability-local" service and "blk-availability-remote" service and maybe associated targets, similar to "local-fs.target" and "remote-fs.target".
Probably this should be handled by systemd package itself, typically by analyzing the device properties when a device shows up in udev.

Comment 10 Peter Rajnoha 2020-02-11 15:18:40 UTC

Based on the report here, this affects only setups with custom services/systemd units. Also, the blk-availability/blkdeactivate has been in RHEL7 since 7.0 and this seems to be the only report we have received so far (therefore, I don't expect much users to be affected by this issue).

Also, I think it's less risk adding the extra dependency as already described here https://access.redhat.com/solutions/4154611 than splitting the blk-availability / blkdeactivate into (at least) two parts running at different times. Also, if we did this, we'd need to introduce a new synchronization point (like a systemd target) that other services would need to depend on (and so it would require much more changes in various other components which involves risks).

In future, we'll try to cover this shutdown scenario in a more proper way with new Storage Instantiation Daemon (SID).

Comment 17 Chris Williams 2020-11-11 21:40:49 UTC

Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7

Comment 18 Chris Williams 2020-11-11 23:18:32 UTC

Apologies for the inadvertent closure.