Bug 1290249

Summary: sd-event malfunction can cause an event loop breakage, systemctl hang/reboot needed
Product: [Fedora] Fedora Reporter: AJ Christensen <aj>
Component: systemdAssignee: systemd-maint
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 22CC: joeym, johannbg, lnykryn, msekleta, s, systemd-maint, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: systemd-219-27.fc22 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-26 03:21:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace from systemd while in broken event loop state
none
Patch from systemd/systemd#1366 in .patch format none

Description AJ Christensen 2015-12-09 22:51:14 UTC
Created attachment 1104182 [details]
strace from systemd while in broken event loop state

Description of problem:
The currently available version of systemd-219 for Fedora-22 is subject to a malfunction in the sd-event pending_prioq_compare function which can swap a disabled event source with an enabled one, causing a broken event loop followed by systemctl hangs eventually requiring a reboot.

The issue has been identified and fixed upstream:
http://lists.freedesktop.org/archives/systemd-devel/2015-September/034356.html
https://github.com/systemd/systemd/pull/1366


Version-Release number of selected component (if applicable):
systemd-219-25.fc22.x86_64


How reproducible:
6/10

With syscall tracing we were able to observe (under production load) the epoll_wait POLLOUT looping after our monitoring system noticed 'systemctl' processes piling up. The systems in question run between 5000~7000 units. I've attached the strace.

We were _not_ able to reproduce per the systemd mailing list post (which described a future version - 227, ahead of 219)

Steps to Reproduce:
1. gdb, attach to #1, b pending_prioq_compare
2. break the sd-event queue
3. inspect x, y locals, look for disabled event source, if not, continue (gdb script can help)
4. strace #1 to observe POLLOUT infinite loop / broken sd-event loop

Actual results:
- non-deterministic piling up systemctl processes
- heavy epoll_wait activity by #1 with an infinitely-increasing POLLOUT list

Expected results:
- systemctl processes not piling up
- normal (paired) epoll_wait2 behavior from #1, no disabled event sources swapped with enabled ones

Additional info:
We have deployed a custom build with this patch and have so far not been able to observe the infinite loop/sd-event malfunction under load.

Comment 1 Joe Miller 2015-12-09 22:54:00 UTC
Would it be possible to get this patch back-ported to fedora-22's systemd-219 rpm?  (the patch in the github pull request link)

Comment 2 AJ Christensen 2015-12-09 22:56:38 UTC
Created attachment 1104183 [details]
Patch from systemd/systemd#1366 in .patch format

Downloaded the .patch fromhttps://github.com/pocek/systemd/commit/8046c4576a68977a1089d2585866bfab8152661b.patch, uploaded to here.

Comment 4 Fedora Update System 2016-01-07 12:34:03 UTC
systemd-219-27.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-7365dd5df4

Comment 5 Fedora Update System 2016-01-09 04:25:57 UTC
systemd-219-27.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-7365dd5df4

Comment 6 Fedora Update System 2016-01-26 03:20:58 UTC
systemd-219-27.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.