Red Hat Bugzilla – Bug 848182
Runaway site processes are consuming 100% cpu
Last modified: 2015-05-14 21:13:21 EDT
Description of problem:
Some site processes are consuming a lot of CPU power and seem to not be doing anything. Also, these processes never seem to finish.
If I restart apache, things go back to normal for awhile, but then after a few hours, I see these processes start popping up again.
Here's what they look like in top (note the unusual time+):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17598 libra_pa 20 0 213m 99m 3652 R 66.7 1.4 659:24.89 Rails: /var/www/stickshift/site
19645 libra_pa 20 0 206m 92m 1984 R 62.5 1.3 653:30.31 Rails: /var/www/stickshift/site
I can't seem to figure out what they're doing as strace prints nothing (notice that I let them sit there for a few minutes before killing the strace).
# time strace -f -s600 -p 17598
Process 17598 attached - interrupt to quit
^CProcess 17598 detached
# time strace -f -s600 -p 19645
Process 19645 attached - interrupt to quit
^CProcess 19645 detached
Passenger still shows them as active:
App root: /var/www/stickshift/site
* PID: 28694 Sessions: 0 Processed: 3598 Uptime: 6m 46s
* PID: 8410 Sessions: 0 Processed: 5568 Uptime: 18m 2s
* PID: 28708 Sessions: 0 Processed: 13292 Uptime: 41m 30s
* PID: 28384 Sessions: 0 Processed: 8214 Uptime: 43m 10s
* PID: 28678 Sessions: 0 Processed: 213 Uptime: 6m 47s
* PID: 28949 Sessions: 0 Processed: 156 Uptime: 6m 34s
* PID: 28718 Sessions: 0 Processed: 8636 Uptime: 41m 22s
* PID: 17598 Sessions: 1 Processed: 31 Uptime: 14h 19m 27s
* PID: 19645 Sessions: 1 Processed: 5 Uptime: 14h 8m 20s
Version-Release number of selected component (if applicable):
Sporadic, but reliably reproducible.
Steps to Reproduce:
1. Unknown. I just restart httpd and then after a few hours I see these processes taking up a lot of CPU.
Processes that are taking up a lot of CPU, but seem to not be doing anything.
Processes that are only taking up CPU power to actually do something.
Does this happen in the broker as well?
It doesn't seem to.
Thomas, any updates on getting repro info from production?
Agreed in defect triage this can miss the sprint while debugging.
We saw it twice in 1 day, I restarted httpd both time, and after the second restart, we haven't seen the problem since.
I'm going to close this bug and if it happens again, I'll re-open.