Bug 1730055

Summary:	A systemd package update explicitly stops, and then starts, the udev services causing hangs within DM event completion dependent operations
Product:	Red Hat Enterprise Linux 7	Reporter:	Kyle Walker <kwalker>
Component:	systemd	Assignee:	systemd-maint
Status:	CLOSED WONTFIX	QA Contact:	Frantisek Sumsal <fsumsal>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.6	CC:	agk, heinzm, jbrassow, msnitzer, prajnoha, systemd-maint-list, uobergfe, yuokada, zkabelac
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-11 21:39:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1801675

Description Kyle Walker 2019-07-15 18:12:21 UTC

Description of problem:
 Since the early days of udev source migrating to systemd, the update path included an explicit "stop" and then "start" event for the udev services. This was implemented in the earliest days as the following commit:

    commit 973c8d4ab5276398ad5e1ab1c78e8fa64207ba61
    Author: Kay Sievers <kay>
    Date:   Thu May 24 22:42:11 2012 +0200

        New upstream release including udev merge


Where the following can be observed:

    %pre
    <snip>
    systemctl stop systemd-udev.service systemd-udev-control.socket systemd-udev-kernel.socket >/dev/null 2>&1 || :
    <snip>

    %post
    <snip>
    /bin/systemctl start systemd-udev.service >/dev/null 2>&1 || :

For containerized workloads, which rely on "dmsetup udevcomplete" events, the above can result in hard hangs. Specifically, a hang waiting for the following rule to be exercised:

/usr/lib/udev/rules.d/95-dm-notify.rules
  ENV{DM_COOKIE}=="?*", RUN+="/usr/sbin/dmsetup udevcomplete $env{DM_COOKIE}"

Version-Release number of selected component (if applicable):
 systemd-219-62.el7_6.7.x86_64
 docker-1.13.1-96.gitb2f74b2.el7.x86_64
 device-mapper-1.02.149-10.el7_6.8.x86_64


How reproducible:
 Easily

Steps to Reproduce:
1. Install docker

    # yum install docker -y

2. Setup docker storage to use devicemapper. Example below from a system with an unused /dev/vdb

    # cat << EOF > /etc/sysconfig/docker-storage-setup 
    VG=docker-vg
    DEVS=/dev/vdb
    EOF

    # docker-storage-setup

3. Start a "test" container

    # docker run -d --name test registry.access.redhat.com/rhel7-minimal bash -c "while true; do sleep 1; done"

4. Simulate an update race window by stopping the udev services in the manner below

    # systemctl stop systemd-udevd-control.socket systemd-udevd-kernel.socket systemd-udevd.service

5. Issue a "docker kill" to the test container

    # docker kill test &

6. Further docker commands hang:

    # docker ps

7. Capture a stack trace of the docker daemon

    # kill -SIGUSR1 $(cat /var/run/docker.pid)

8. Verify that one of the goroutines is blocked waiting in a callpath similar to the following:

<snip>
    goroutine 233 [syscall]:
    github.com/docker/docker/pkg/devicemapper._Cfunc_dm_udev_wait(0xc40d4d580a, 0x0)
            _cgo_gotypes.go:411 +0x4d
    github.com/docker/docker/pkg/devicemapper.dmUdevWaitFct(0xd4d580a, 0x43e044)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:223 +0x2a
    github.com/docker/docker/pkg/devicemapper.UdevWait(0xc420add4f8, 0x0, 0x0)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:255 +0x39
    github.com/docker/docker/pkg/devicemapper.RemoveDeviceDeferred(0xc420d73200, 0x56, 0x0, 0x0)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:377 +0x3bb
    github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).deactivateDeviceMode(0xc4203e24e0, 0xc42033c940, 0x1, 0x0, 0x0)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2125 +0x1a9
    github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).deactivateDevice(0xc4203e24e0, 0xc42033c940, 0x0, 0x0)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2107 +0x40
    github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).UnmountDevice(0xc4203e24e0, 0xc42159d200, 0x40, 0xc420fe6cb0, 0x61, 0x0, 0x0)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2413 +0x46d
    github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Put(0xc420350910, 0xc42159d200, 0x40, 0xeb3ab4, 0xc420d730e0)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:222 +0x137
    github.com/docker/docker/layer.(*referencedRWLayer).Unmount(0xc4200c0e48, 0xa8b3dfa, 0xed4beb799)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/layer/mounted_layer.go:98 +0x4d
    github.com/docker/docker/daemon.(*Daemon).Unmount(0xc420435400, 0xc420b29a00, 0x0, 0xffffffffffffffff)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/daemon.go:920 +0x38
    github.com/docker/docker/daemon.(*Daemon).conditionalUnmountOnCleanup(0xc420435400, 0xc420b29a00, 0xc42101f830, 0x5d2cc099)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/daemon_unix.go:1134 +0x35
    github.com/docker/docker/daemon.(*Daemon).Cleanup(0xc420435400, 0xc420b29a00)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/start.go:208 +0x8b
    github.com/docker/docker/daemon.(*Daemon).StateChanged(0xc420435400, 0xc420f1a0c0, 0x40, 0xc420adcbdc, 0x4, 0x8900000000, 0x0, 0x0, 0x0, 0x0, ...)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/monitor.go:55 +0x473
    github.com/docker/docker/libcontainerd.(*container).handleEvent.func1()
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/libcontainerd/container_unix.go:217 +0x78
    github.com/docker/docker/libcontainerd.(*queue).append.func1(0xc420f1a000, 0x0, 0xc4200c34a0, 0xc4203b95c0)
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/libcontainerd/queue_unix.go:28 +0x30
    created by github.com/docker/docker/libcontainerd.(*queue).append
            /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/libcontainerd/queue_unix.go:24 +0x161
<snip>


Actual results:
 The kill of the "test" container hangs and the above backtrace is observed

Expected results:
 No hang during kill of the "test" container, the daemon remains responsive, and there is no need to capture the described backtrace.


Additional info:
 Following the split of udev back out of the main package into a subpackage, this problem see to be no longer present. The following commit altered the operation to just restart the service as opposed to a full stop/start:

    commit b7ec953f258e07a2996e30ceb12394de9ed56748
    Author: Zbigniew Jędrzejewski-Szmek <zbyszek.pl>
    Date:   Fri Apr 22 14:21:31 2016 -0400
    
        Move udev-related commands to udev's %post
    
        This removed the operation of stopping udevd in %pre, and starting it
        in %post. I think this shouldn't be necessary anymore.

[^2]: https://src.fedoraproject.org/rpms/systemd/c/b7ec953f258e07a2996e30ceb12394de9ed56748

Comment 3 Jan Synacek 2019-07-16 07:40:48 UTC

This looks like a docker problem to me. Does devicemapper.dmUdevWaitFct() recover after the udev sockets are back up?

Comment 6 Peter Rajnoha 2019-07-17 11:49:20 UTC

Well, this is an interesting one!

I've just discussed this with Michal Sekletar from systemd and it looks we need systemd-udevd to support handing over the open uevent netlink socket from old to new instance of the udev daemon, hence not closing it and not loosing the uevents this way. He's going to investigate this possibility. Actually we do something very similar in dmeventd already...

Otherwise, from LVM/DM point of view (as well as docker's DM wrapper), I don't think we can do anything else here than getting the udevd's uevent processing timeout somehow (e.g. via libudev) and then set that timeout for the semaphores we use for the DM uevent processing notification. But there are two issues with this still:

  - getting the timeout this way is race-prone (but at least "some" timeout would be there preventing complete hang, even though the timeout could be set incorrectly if we restart udevd with different timeout value after we get the timeout value);

  - and then there's the fact that the uevents are still lost in between udevd stop and start (the "restart" during upgrade is still "stop" and "start" inside).

Comment 7 Jan Synacek 2019-07-18 08:34:28 UTC

As demonstrated by this bug, the assumption that the event is not going to disappear is not safe and makes the code deadlock, which shouldn't ever happen. IMHO it's better to time out and continue running with some data possibly lost (and log about that, of course) than to deadlock because of a false assumption.

(In reply to Peter Rajnoha from comment #6)
>   - getting the timeout this way is race-prone (but at least "some" timeout
> would be there preventing complete hang, even though the timeout could be
> set incorrectly if we restart udevd with different timeout value after we
> get the timeout value);

I don't know how lvm handles this, but I'm not sure where the race would be. If there's no data on the socket within let's say 1 sec, time out don't read from the socket anymore. Where's the race, exactly? 

>   - and then there's the fact that the uevents are still lost in between
> udevd stop and start (the "restart" during upgrade is still "stop" and
> "start" inside).

Well, that's the problem that we're trying to solve... Even with the implementation of the socket handover, it's always a good defensive practise to assume that socket data gets lost.

So, ideally, the fix should be on both sides, correct?

Comment 8 Peter Rajnoha 2019-07-18 10:06:00 UTC

(In reply to Jan Synacek from comment #7)
> As demonstrated by this bug, the assumption that the event is not going to
> disappear is not safe and makes the code deadlock, which shouldn't ever
> happen. IMHO it's better to time out and continue running with some data
> possibly lost (and log about that, of course) than to deadlock because of a
> false assumption.

Timeouts are always wrong. You never know what's the best timeout value. You have different environments with different workloads and different timing - one value can work perfectly in one environment, but fails horribly in another.

But as I said, if we were going to use SOME timeout value for the sempahores in libdm, it would be the timeout that udevd is currently set to (...because it doesn't make sense to wait longer than the time udevd has to process the uevent because the notification to unlock the libdm semaphore is part of that uevent processing in udevd). But...

> 
> (In reply to Peter Rajnoha from comment #6)
> >   - getting the timeout this way is race-prone (but at least "some" timeout
> > would be there preventing complete hang, even though the timeout could be
> > set incorrectly if we restart udevd with different timeout value after we
> > get the timeout value);
> 
> I don't know how lvm handles this, but I'm not sure where the race would be.
> If there's no data on the socket within let's say 1 sec, time out don't read
> from the socket anymore. Where's the race, exactly? 

1) the external tool (like LVM) would get the udevd event processing timeout value (say via libudev and it's 180s)
2) the udevd is stopped and started with a different timeout value, processing the uevent (say the timeout is now higher, e.g. 360s)
3) the external tool uses the outdated timeout value (so here it would timeout sooner than it should - after 180s)

> 
> >   - and then there's the fact that the uevents are still lost in between
> > udevd stop and start (the "restart" during upgrade is still "stop" and
> > "start" inside).
> 
> Well, that's the problem that we're trying to solve... Even with the
> implementation of the socket handover, it's always a good defensive practise
> to assume that socket data gets lost.
> 
> So, ideally, the fix should be on both sides, correct?

Udevd is a core system service, it can't be losing uevents! It MUST do the very best it can do to not lose them as its the main consumer of them and various other system components depend on it.

Of course, the socket buffer is not infinite so if it gets full within the time we're restarting udevd, well, that's a problem. But I think we can detect buffer overrun(?). In that case, udevd can at least execute "udevadm trigger" to reexecute udev rules. However, if the buffer is set appropriately, we should minimize this condition to a very minimum. Otherwise, all the socket activation for services in systemd would be useless and spoiled - it depends on socket buffers and then services reading that buffer after systemd instantiates the service itself.

But there are theoretical and practical issues here.

There's also the fact that after all those years, I think this is the very first time I see this problem with udevd not handing over the open netlink socket during its restart to manifest somehow. So for a long time, this was just a theoretical issue. It's because the situation where there are udevd restart and at the same time there are devices reconfigurations in parallel aren't (weren't) that common and now they are, it seems.

Comment 9 Peter Rajnoha 2019-07-18 10:20:14 UTC

If the LVM/libdm didn't deadlock, we wouldn't find out so quickly that there's the problem with udevd not handing over the socket properly during restarts - we're not hiding the problems this way and they at least get the proper urgency to resolve. When timeouts just happen, the issues simply get less urgency - iow, who cares about some timeout somewhere if the system keeps going. We've been having a few reported hangs sporadically with these semaphores over 10 years we use them in libdm, but it was always pointing to some problem with urgency then and it got resolved (or at least it pointed to a problematic component).

Comment 13 Peter Rajnoha 2019-08-20 09:44:08 UTC

This should be fixed on udevd side - udevd needs to hand over the open netlink socket from old to new instance on restart. Otherwise, udevd can lose important uevents...

Comment 15 Chris Williams 2020-11-11 21:39:47 UTC

Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7