Bug 1290249 - sd-event malfunction can cause an event loop breakage, systemctl hang/reboot needed
Summary: sd-event malfunction can cause an event loop breakage, systemctl hang/reboot ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 22
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-09 22:51 UTC by AJ Christensen
Modified: 2016-01-26 03:21 UTC (History)
7 users (show)

Fixed In Version: systemd-219-27.fc22
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-26 03:21:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
strace from systemd while in broken event loop state (1.84 KB, text/plain)
2015-12-09 22:51 UTC, AJ Christensen
no flags Details
Patch from systemd/systemd#1366 in .patch format (1.77 KB, patch)
2015-12-09 22:56 UTC, AJ Christensen
no flags Details | Diff

Description AJ Christensen 2015-12-09 22:51:14 UTC
Created attachment 1104182 [details]
strace from systemd while in broken event loop state

Description of problem:
The currently available version of systemd-219 for Fedora-22 is subject to a malfunction in the sd-event pending_prioq_compare function which can swap a disabled event source with an enabled one, causing a broken event loop followed by systemctl hangs eventually requiring a reboot.

The issue has been identified and fixed upstream:
http://lists.freedesktop.org/archives/systemd-devel/2015-September/034356.html
https://github.com/systemd/systemd/pull/1366


Version-Release number of selected component (if applicable):
systemd-219-25.fc22.x86_64


How reproducible:
6/10

With syscall tracing we were able to observe (under production load) the epoll_wait POLLOUT looping after our monitoring system noticed 'systemctl' processes piling up. The systems in question run between 5000~7000 units. I've attached the strace.

We were _not_ able to reproduce per the systemd mailing list post (which described a future version - 227, ahead of 219)

Steps to Reproduce:
1. gdb, attach to #1, b pending_prioq_compare
2. break the sd-event queue
3. inspect x, y locals, look for disabled event source, if not, continue (gdb script can help)
4. strace #1 to observe POLLOUT infinite loop / broken sd-event loop

Actual results:
- non-deterministic piling up systemctl processes
- heavy epoll_wait activity by #1 with an infinitely-increasing POLLOUT list

Expected results:
- systemctl processes not piling up
- normal (paired) epoll_wait2 behavior from #1, no disabled event sources swapped with enabled ones

Additional info:
We have deployed a custom build with this patch and have so far not been able to observe the infinite loop/sd-event malfunction under load.

Comment 1 Joe Miller 2015-12-09 22:54:00 UTC
Would it be possible to get this patch back-ported to fedora-22's systemd-219 rpm?  (the patch in the github pull request link)

Comment 2 AJ Christensen 2015-12-09 22:56:38 UTC
Created attachment 1104183 [details]
Patch from systemd/systemd#1366 in .patch format

Downloaded the .patch fromhttps://github.com/pocek/systemd/commit/8046c4576a68977a1089d2585866bfab8152661b.patch, uploaded to here.

Comment 4 Fedora Update System 2016-01-07 12:34:03 UTC
systemd-219-27.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-7365dd5df4

Comment 5 Fedora Update System 2016-01-09 04:25:57 UTC
systemd-219-27.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-7365dd5df4

Comment 6 Fedora Update System 2016-01-26 03:20:58 UTC
systemd-219-27.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.