Bug 1290249 - sd-event malfunction can cause an event loop breakage, systemctl hang/reboot needed
sd-event malfunction can cause an event loop breakage, systemctl hang/reboot ...
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: systemd (Show other bugs)
22
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: systemd-maint
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-09 17:51 EST by AJ Christensen
Modified: 2016-01-25 22:21 EST (History)
7 users (show)

See Also:
Fixed In Version: systemd-219-27.fc22
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-25 22:21:06 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
strace from systemd while in broken event loop state (1.84 KB, text/plain)
2015-12-09 17:51 EST, AJ Christensen
no flags Details
Patch from systemd/systemd#1366 in .patch format (1.77 KB, patch)
2015-12-09 17:56 EST, AJ Christensen
no flags Details | Diff

  None (edit)
Description AJ Christensen 2015-12-09 17:51:14 EST
Created attachment 1104182 [details]
strace from systemd while in broken event loop state

Description of problem:
The currently available version of systemd-219 for Fedora-22 is subject to a malfunction in the sd-event pending_prioq_compare function which can swap a disabled event source with an enabled one, causing a broken event loop followed by systemctl hangs eventually requiring a reboot.

The issue has been identified and fixed upstream:
http://lists.freedesktop.org/archives/systemd-devel/2015-September/034356.html
https://github.com/systemd/systemd/pull/1366


Version-Release number of selected component (if applicable):
systemd-219-25.fc22.x86_64


How reproducible:
6/10

With syscall tracing we were able to observe (under production load) the epoll_wait POLLOUT looping after our monitoring system noticed 'systemctl' processes piling up. The systems in question run between 5000~7000 units. I've attached the strace.

We were _not_ able to reproduce per the systemd mailing list post (which described a future version - 227, ahead of 219)

Steps to Reproduce:
1. gdb, attach to #1, b pending_prioq_compare
2. break the sd-event queue
3. inspect x, y locals, look for disabled event source, if not, continue (gdb script can help)
4. strace #1 to observe POLLOUT infinite loop / broken sd-event loop

Actual results:
- non-deterministic piling up systemctl processes
- heavy epoll_wait activity by #1 with an infinitely-increasing POLLOUT list

Expected results:
- systemctl processes not piling up
- normal (paired) epoll_wait2 behavior from #1, no disabled event sources swapped with enabled ones

Additional info:
We have deployed a custom build with this patch and have so far not been able to observe the infinite loop/sd-event malfunction under load.
Comment 1 Joe Miller 2015-12-09 17:54:00 EST
Would it be possible to get this patch back-ported to fedora-22's systemd-219 rpm?  (the patch in the github pull request link)
Comment 2 AJ Christensen 2015-12-09 17:56 EST
Created attachment 1104183 [details]
Patch from systemd/systemd#1366 in .patch format

Downloaded the .patch fromhttps://github.com/pocek/systemd/commit/8046c4576a68977a1089d2585866bfab8152661b.patch, uploaded to here.
Comment 4 Fedora Update System 2016-01-07 07:34:03 EST
systemd-219-27.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-7365dd5df4
Comment 5 Fedora Update System 2016-01-08 23:25:57 EST
systemd-219-27.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-7365dd5df4
Comment 6 Fedora Update System 2016-01-25 22:20:58 EST
systemd-219-27.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.