Red Hat Bugzilla – Bug 1290249
sd-event malfunction can cause an event loop breakage, systemctl hang/reboot needed
Last modified: 2016-01-25 22:21:06 EST
Created attachment 1104182 [details]
strace from systemd while in broken event loop state
Description of problem:
The currently available version of systemd-219 for Fedora-22 is subject to a malfunction in the sd-event pending_prioq_compare function which can swap a disabled event source with an enabled one, causing a broken event loop followed by systemctl hangs eventually requiring a reboot.
The issue has been identified and fixed upstream:
Version-Release number of selected component (if applicable):
With syscall tracing we were able to observe (under production load) the epoll_wait POLLOUT looping after our monitoring system noticed 'systemctl' processes piling up. The systems in question run between 5000~7000 units. I've attached the strace.
We were _not_ able to reproduce per the systemd mailing list post (which described a future version - 227, ahead of 219)
Steps to Reproduce:
1. gdb, attach to #1, b pending_prioq_compare
2. break the sd-event queue
3. inspect x, y locals, look for disabled event source, if not, continue (gdb script can help)
4. strace #1 to observe POLLOUT infinite loop / broken sd-event loop
- non-deterministic piling up systemctl processes
- heavy epoll_wait activity by #1 with an infinitely-increasing POLLOUT list
- systemctl processes not piling up
- normal (paired) epoll_wait2 behavior from #1, no disabled event sources swapped with enabled ones
We have deployed a custom build with this patch and have so far not been able to observe the infinite loop/sd-event malfunction under load.
Would it be possible to get this patch back-ported to fedora-22's systemd-219 rpm? (the patch in the github pull request link)
Created attachment 1104183 [details]
Patch from systemd/systemd#1366 in .patch format
Downloaded the .patch fromhttps://github.com/pocek/systemd/commit/8046c4576a68977a1089d2585866bfab8152661b.patch, uploaded to here.
systemd-219-27.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-7365dd5df4
systemd-219-27.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-7365dd5df4
systemd-219-27.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.