Bug 1402098

Summary: OSD SimpleMessenger thread gets stuck in a loop and burns CPU
Product: Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Samuel Just <sjust>
Status: CLOSED NOTABUG QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: ceph-eng-bugs, dzafman, icolle, kchai, kdreyer, kurs, nlevine, sjust, skinjo, sweil, vakulkar, vumrao
Target Milestone: rc   
Target Release: 2.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The Ceph OSD messenger thread could enter an indefinite loop in some scenarios where the network is interrupted between Ceph clients and OSDs. Consequence: As a consequence, OSDs could use consume CPU and become unresponsive, and cluster service could be degraded. Fix: The OSD code has been altered to avoid infinitely looping in this scenario. Result: OSDs are more resilient to scenarios that trigger this bug.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-08 16:53:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Vikhyat Umrao 2016-12-06 19:04:19 UTC
Description of problem:

Backport: http://tracker.ceph.com/issues/14120 : Pipe::do_recv() may loop infinitely


Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 2.1

Comment 2 Ken Dreyer (Red Hat) 2016-12-06 20:25:31 UTC
Looks like the jewel backport in https://github.com/ceph/ceph/pull/12341 will probably make v10.2.4.

Comment 7 Sage Weil 2016-12-07 01:55:48 UTC
Problem:

A client disconnect can put SimpleMessenger threads in an infinite loop that tries to read from the socket, gets EAGAIN, and loops.

It is unclear exactly what environmental circumstances lead to this state, but Zheng was hitting it in his dev environment when he submitted the fix, and a customer was hitting it on seemingly every OSD on most hosts (pushing the load over 200 on an otherwise idle cluster).


Customer impact:

A SimpleMessenger thread gets stuck in a loop and burns CPU.

No other known impact (besides the additional system load).


How widespread:

No idea.  For this customer it happened to all OSDs on most hosts in the cluster, and reentered this state shortly after rebooting the host.  Unclear exactly why this cluster was susceptible but others haven't seen the problem.

Comment 18 kiran raje urs J 2016-12-07 18:45:14 UTC
QE has few questions that needs clarification:-
1. Is this QE testable ? If "YES" Can you please provide the steps to reproduce the Bug ? If "NO" QE will run the Automated regression suite.

Comment 19 Ken Dreyer (Red Hat) 2016-12-07 18:52:56 UTC
Sam, Sage, mind answering Kiran's questions in Comment 18 above?