| Summary: | OSD SimpleMessenger thread gets stuck in a loop and burns CPU | ||
|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> |
| Component: | RADOS | Assignee: | Samuel Just <sjust> |
| Status: | CLOSED NOTABUG | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 2.1 | CC: | ceph-eng-bugs, dzafman, icolle, kchai, kdreyer, kurs, nlevine, sjust, skinjo, sweil, vakulkar, vumrao |
| Target Milestone: | rc | ||
| Target Release: | 2.1 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: The Ceph OSD messenger thread could enter an indefinite loop in some scenarios where the network is interrupted between Ceph clients and OSDs.
Consequence: As a consequence, OSDs could use consume CPU and become unresponsive, and cluster service could be degraded.
Fix: The OSD code has been altered to avoid infinitely looping in this scenario.
Result: OSDs are more resilient to scenarios that trigger this bug.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-12-08 16:53:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Vikhyat Umrao
2016-12-06 19:04:19 UTC
Looks like the jewel backport in https://github.com/ceph/ceph/pull/12341 will probably make v10.2.4. Problem: A client disconnect can put SimpleMessenger threads in an infinite loop that tries to read from the socket, gets EAGAIN, and loops. It is unclear exactly what environmental circumstances lead to this state, but Zheng was hitting it in his dev environment when he submitted the fix, and a customer was hitting it on seemingly every OSD on most hosts (pushing the load over 200 on an otherwise idle cluster). Customer impact: A SimpleMessenger thread gets stuck in a loop and burns CPU. No other known impact (besides the additional system load). How widespread: No idea. For this customer it happened to all OSDs on most hosts in the cluster, and reentered this state shortly after rebooting the host. Unclear exactly why this cluster was susceptible but others haven't seen the problem. QE has few questions that needs clarification:- 1. Is this QE testable ? If "YES" Can you please provide the steps to reproduce the Bug ? If "NO" QE will run the Automated regression suite. Sam, Sage, mind answering Kiran's questions in Comment 18 above? |