Bug 1402098 - OSD SimpleMessenger thread gets stuck in a loop and burns CPU
Summary: OSD SimpleMessenger thread gets stuck in a loop and burns CPU
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 2.1
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 2.1
Assignee: Samuel Just
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-06 19:04 UTC by Vikhyat Umrao
Modified: 2020-01-17 16:18 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The Ceph OSD messenger thread could enter an indefinite loop in some scenarios where the network is interrupted between Ceph clients and OSDs. Consequence: As a consequence, OSDs could use consume CPU and become unresponsive, and cluster service could be degraded. Fix: The OSD code has been altered to avoid infinitely looping in this scenario. Result: OSDs are more resilient to scenarios that trigger this bug.
Clone Of:
Environment:
Last Closed: 2016-12-08 16:53:30 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 14120 0 None None None 2016-12-06 20:25:30 UTC

Description Vikhyat Umrao 2016-12-06 19:04:19 UTC
Description of problem:

Backport: http://tracker.ceph.com/issues/14120 : Pipe::do_recv() may loop infinitely


Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 2.1

Comment 2 Ken Dreyer (Red Hat) 2016-12-06 20:25:31 UTC
Looks like the jewel backport in https://github.com/ceph/ceph/pull/12341 will probably make v10.2.4.

Comment 7 Sage Weil 2016-12-07 01:55:48 UTC
Problem:

A client disconnect can put SimpleMessenger threads in an infinite loop that tries to read from the socket, gets EAGAIN, and loops.

It is unclear exactly what environmental circumstances lead to this state, but Zheng was hitting it in his dev environment when he submitted the fix, and a customer was hitting it on seemingly every OSD on most hosts (pushing the load over 200 on an otherwise idle cluster).


Customer impact:

A SimpleMessenger thread gets stuck in a loop and burns CPU.

No other known impact (besides the additional system load).


How widespread:

No idea.  For this customer it happened to all OSDs on most hosts in the cluster, and reentered this state shortly after rebooting the host.  Unclear exactly why this cluster was susceptible but others haven't seen the problem.

Comment 18 kiran raje urs J 2016-12-07 18:45:14 UTC
QE has few questions that needs clarification:-
1. Is this QE testable ? If "YES" Can you please provide the steps to reproduce the Bug ? If "NO" QE will run the Automated regression suite.

Comment 19 Ken Dreyer (Red Hat) 2016-12-07 18:52:56 UTC
Sam, Sage, mind answering Kiran's questions in Comment 18 above?


Note You need to log in before you can comment on or make changes to this bug.