Bug 1730055
| Summary: | A systemd package update explicitly stops, and then starts, the udev services causing hangs within DM event completion dependent operations | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Kyle Walker <kwalker> |
| Component: | systemd | Assignee: | systemd-maint |
| Status: | CLOSED WONTFIX | QA Contact: | Frantisek Sumsal <fsumsal> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.6 | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, systemd-maint-list, uobergfe, yuokada, zkabelac |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-11-11 21:39:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1801675 | ||
This looks like a docker problem to me. Does devicemapper.dmUdevWaitFct() recover after the udev sockets are back up? Well, this is an interesting one! I've just discussed this with Michal Sekletar from systemd and it looks we need systemd-udevd to support handing over the open uevent netlink socket from old to new instance of the udev daemon, hence not closing it and not loosing the uevents this way. He's going to investigate this possibility. Actually we do something very similar in dmeventd already... Otherwise, from LVM/DM point of view (as well as docker's DM wrapper), I don't think we can do anything else here than getting the udevd's uevent processing timeout somehow (e.g. via libudev) and then set that timeout for the semaphores we use for the DM uevent processing notification. But there are two issues with this still: - getting the timeout this way is race-prone (but at least "some" timeout would be there preventing complete hang, even though the timeout could be set incorrectly if we restart udevd with different timeout value after we get the timeout value); - and then there's the fact that the uevents are still lost in between udevd stop and start (the "restart" during upgrade is still "stop" and "start" inside). As demonstrated by this bug, the assumption that the event is not going to disappear is not safe and makes the code deadlock, which shouldn't ever happen. IMHO it's better to time out and continue running with some data possibly lost (and log about that, of course) than to deadlock because of a false assumption. (In reply to Peter Rajnoha from comment #6) > - getting the timeout this way is race-prone (but at least "some" timeout > would be there preventing complete hang, even though the timeout could be > set incorrectly if we restart udevd with different timeout value after we > get the timeout value); I don't know how lvm handles this, but I'm not sure where the race would be. If there's no data on the socket within let's say 1 sec, time out don't read from the socket anymore. Where's the race, exactly? > - and then there's the fact that the uevents are still lost in between > udevd stop and start (the "restart" during upgrade is still "stop" and > "start" inside). Well, that's the problem that we're trying to solve... Even with the implementation of the socket handover, it's always a good defensive practise to assume that socket data gets lost. So, ideally, the fix should be on both sides, correct? (In reply to Jan Synacek from comment #7) > As demonstrated by this bug, the assumption that the event is not going to > disappear is not safe and makes the code deadlock, which shouldn't ever > happen. IMHO it's better to time out and continue running with some data > possibly lost (and log about that, of course) than to deadlock because of a > false assumption. Timeouts are always wrong. You never know what's the best timeout value. You have different environments with different workloads and different timing - one value can work perfectly in one environment, but fails horribly in another. But as I said, if we were going to use SOME timeout value for the sempahores in libdm, it would be the timeout that udevd is currently set to (...because it doesn't make sense to wait longer than the time udevd has to process the uevent because the notification to unlock the libdm semaphore is part of that uevent processing in udevd). But... > > (In reply to Peter Rajnoha from comment #6) > > - getting the timeout this way is race-prone (but at least "some" timeout > > would be there preventing complete hang, even though the timeout could be > > set incorrectly if we restart udevd with different timeout value after we > > get the timeout value); > > I don't know how lvm handles this, but I'm not sure where the race would be. > If there's no data on the socket within let's say 1 sec, time out don't read > from the socket anymore. Where's the race, exactly? 1) the external tool (like LVM) would get the udevd event processing timeout value (say via libudev and it's 180s) 2) the udevd is stopped and started with a different timeout value, processing the uevent (say the timeout is now higher, e.g. 360s) 3) the external tool uses the outdated timeout value (so here it would timeout sooner than it should - after 180s) > > > - and then there's the fact that the uevents are still lost in between > > udevd stop and start (the "restart" during upgrade is still "stop" and > > "start" inside). > > Well, that's the problem that we're trying to solve... Even with the > implementation of the socket handover, it's always a good defensive practise > to assume that socket data gets lost. > > So, ideally, the fix should be on both sides, correct? Udevd is a core system service, it can't be losing uevents! It MUST do the very best it can do to not lose them as its the main consumer of them and various other system components depend on it. Of course, the socket buffer is not infinite so if it gets full within the time we're restarting udevd, well, that's a problem. But I think we can detect buffer overrun(?). In that case, udevd can at least execute "udevadm trigger" to reexecute udev rules. However, if the buffer is set appropriately, we should minimize this condition to a very minimum. Otherwise, all the socket activation for services in systemd would be useless and spoiled - it depends on socket buffers and then services reading that buffer after systemd instantiates the service itself. But there are theoretical and practical issues here. There's also the fact that after all those years, I think this is the very first time I see this problem with udevd not handing over the open netlink socket during its restart to manifest somehow. So for a long time, this was just a theoretical issue. It's because the situation where there are udevd restart and at the same time there are devices reconfigurations in parallel aren't (weren't) that common and now they are, it seems. If the LVM/libdm didn't deadlock, we wouldn't find out so quickly that there's the problem with udevd not handing over the socket properly during restarts - we're not hiding the problems this way and they at least get the proper urgency to resolve. When timeouts just happen, the issues simply get less urgency - iow, who cares about some timeout somewhere if the system keeps going. We've been having a few reported hangs sporadically with these semaphores over 10 years we use them in libdm, but it was always pointing to some problem with urgency then and it got resolved (or at least it pointed to a problematic component). This should be fixed on udevd side - udevd needs to hand over the open netlink socket from old to new instance on restart. Otherwise, udevd can lose important uevents... Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7. From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. From the RHEL life cycle page: https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase "During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available." If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes: https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns. [0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7 |
Description of problem: Since the early days of udev source migrating to systemd, the update path included an explicit "stop" and then "start" event for the udev services. This was implemented in the earliest days as the following commit: commit 973c8d4ab5276398ad5e1ab1c78e8fa64207ba61 Author: Kay Sievers <kay> Date: Thu May 24 22:42:11 2012 +0200 New upstream release including udev merge Where the following can be observed: %pre <snip> systemctl stop systemd-udev.service systemd-udev-control.socket systemd-udev-kernel.socket >/dev/null 2>&1 || : <snip> %post <snip> /bin/systemctl start systemd-udev.service >/dev/null 2>&1 || : For containerized workloads, which rely on "dmsetup udevcomplete" events, the above can result in hard hangs. Specifically, a hang waiting for the following rule to be exercised: /usr/lib/udev/rules.d/95-dm-notify.rules ENV{DM_COOKIE}=="?*", RUN+="/usr/sbin/dmsetup udevcomplete $env{DM_COOKIE}" Version-Release number of selected component (if applicable): systemd-219-62.el7_6.7.x86_64 docker-1.13.1-96.gitb2f74b2.el7.x86_64 device-mapper-1.02.149-10.el7_6.8.x86_64 How reproducible: Easily Steps to Reproduce: 1. Install docker # yum install docker -y 2. Setup docker storage to use devicemapper. Example below from a system with an unused /dev/vdb # cat << EOF > /etc/sysconfig/docker-storage-setup VG=docker-vg DEVS=/dev/vdb EOF # docker-storage-setup 3. Start a "test" container # docker run -d --name test registry.access.redhat.com/rhel7-minimal bash -c "while true; do sleep 1; done" 4. Simulate an update race window by stopping the udev services in the manner below # systemctl stop systemd-udevd-control.socket systemd-udevd-kernel.socket systemd-udevd.service 5. Issue a "docker kill" to the test container # docker kill test & 6. Further docker commands hang: # docker ps 7. Capture a stack trace of the docker daemon # kill -SIGUSR1 $(cat /var/run/docker.pid) 8. Verify that one of the goroutines is blocked waiting in a callpath similar to the following: <snip> goroutine 233 [syscall]: github.com/docker/docker/pkg/devicemapper._Cfunc_dm_udev_wait(0xc40d4d580a, 0x0) _cgo_gotypes.go:411 +0x4d github.com/docker/docker/pkg/devicemapper.dmUdevWaitFct(0xd4d580a, 0x43e044) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:223 +0x2a github.com/docker/docker/pkg/devicemapper.UdevWait(0xc420add4f8, 0x0, 0x0) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:255 +0x39 github.com/docker/docker/pkg/devicemapper.RemoveDeviceDeferred(0xc420d73200, 0x56, 0x0, 0x0) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:377 +0x3bb github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).deactivateDeviceMode(0xc4203e24e0, 0xc42033c940, 0x1, 0x0, 0x0) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2125 +0x1a9 github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).deactivateDevice(0xc4203e24e0, 0xc42033c940, 0x0, 0x0) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2107 +0x40 github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).UnmountDevice(0xc4203e24e0, 0xc42159d200, 0x40, 0xc420fe6cb0, 0x61, 0x0, 0x0) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2413 +0x46d github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Put(0xc420350910, 0xc42159d200, 0x40, 0xeb3ab4, 0xc420d730e0) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:222 +0x137 github.com/docker/docker/layer.(*referencedRWLayer).Unmount(0xc4200c0e48, 0xa8b3dfa, 0xed4beb799) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/layer/mounted_layer.go:98 +0x4d github.com/docker/docker/daemon.(*Daemon).Unmount(0xc420435400, 0xc420b29a00, 0x0, 0xffffffffffffffff) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/daemon.go:920 +0x38 github.com/docker/docker/daemon.(*Daemon).conditionalUnmountOnCleanup(0xc420435400, 0xc420b29a00, 0xc42101f830, 0x5d2cc099) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/daemon_unix.go:1134 +0x35 github.com/docker/docker/daemon.(*Daemon).Cleanup(0xc420435400, 0xc420b29a00) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/start.go:208 +0x8b github.com/docker/docker/daemon.(*Daemon).StateChanged(0xc420435400, 0xc420f1a0c0, 0x40, 0xc420adcbdc, 0x4, 0x8900000000, 0x0, 0x0, 0x0, 0x0, ...) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/daemon/monitor.go:55 +0x473 github.com/docker/docker/libcontainerd.(*container).handleEvent.func1() /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/libcontainerd/container_unix.go:217 +0x78 github.com/docker/docker/libcontainerd.(*queue).append.func1(0xc420f1a000, 0x0, 0xc4200c34a0, 0xc4203b95c0) /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/libcontainerd/queue_unix.go:28 +0x30 created by github.com/docker/docker/libcontainerd.(*queue).append /builddir/build/BUILD/docker-b2f74b220126bddbae585cfe2406498ced3ae065/_build/src/github.com/docker/docker/libcontainerd/queue_unix.go:24 +0x161 <snip> Actual results: The kill of the "test" container hangs and the above backtrace is observed Expected results: No hang during kill of the "test" container, the daemon remains responsive, and there is no need to capture the described backtrace. Additional info: Following the split of udev back out of the main package into a subpackage, this problem see to be no longer present. The following commit altered the operation to just restart the service as opposed to a full stop/start: commit b7ec953f258e07a2996e30ceb12394de9ed56748 Author: Zbigniew Jędrzejewski-Szmek <zbyszek.pl> Date: Fri Apr 22 14:21:31 2016 -0400 Move udev-related commands to udev's %post This removed the operation of stopping udevd in %pre, and starting it in %post. I think this shouldn't be necessary anymore. [^2]: https://src.fedoraproject.org/rpms/systemd/c/b7ec953f258e07a2996e30ceb12394de9ed56748