Bug 1558185

Summary:	Upgrade to 2.5, Slow OSD startups and unstable OSDs
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	tbrekke
Component:	RADOS	Assignee:	Josh Durgin <jdurgin>
Status:	CLOSED NOTABUG	QA Contact:	Manohar Murthy <mmurthy>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	2.5	CC:	anharris, ceph-eng-bugs, dzafman, eric.goirand, hnallurv, kchai, kdreyer, linuxkidd, mhackett, tbrekke, tpetr
Target Milestone:	z5
Target Release:	2.5
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-14 15:19:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1548481
Bug Blocks:

Description tbrekke 2018-03-19 19:22:32 UTC

Description of problem:

After upgrading a few hosts to 2.5 the OSDs had slightly longer then normal startups. During this time nothing was being logged, just 100% CPU usage.

We updated a few of the hosts without too many issues, but eventually some of the OSDs started crashing. (no longer were responding then hit suicide timeouts).

Once the OSDs that crashed started they were stuck on boot, nothing showing in logs after

2018-03-16 19:34:52.555544 7f4f2d454700 20 filestore(/var/lib/ceph/osd/ceph-162) sync_entry woke after 5.000164


perf top showed all the CPU going towards iterating through leveldb likly doing the scanning/enumeration

  15.58%  libleveldb.so.1.0.7   [.] leveldb::Block::Iter::Prev

strace showed this every few seconds 

[pid 1468613] <... poll resumed> )      = 1 ([{fd=10, revents=POLLIN}])
[pid 1468613] accept(10, {sa_family=AF_LOCAL, NULL}, [2]) = 19
[pid 1468613] read(19, "{", 1)          = 1
[pid 1468613] read(19, " ", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, "p", 1)          = 1
[pid 1468613] read(19, "r", 1)          = 1
[pid 1468613] read(19, "e", 1)          = 1
[pid 1468613] read(19, "f", 1)          = 1
[pid 1468613] read(19, "i", 1)          = 1
[pid 1468613] read(19, "x", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, ":", 1)          = 1
[pid 1468613] read(19, " ", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, "1", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, " ", 1)          = 1
[pid 1468613] read(19, "}", 1)          = 1
[pid 1468613] read(19, "\n", 1)         = 1


This process took 20 minutes on some of the OSDs and up to 4 hours on others. 

Once all the OSDs were up and in, we stopped the upgrades, and left the cluster in a mixed state over the weekend. The issue then reoccurred and many OSD went down and got stuck in long booting state. Reverting the hosts back to 2.4 resolved the issue, and regained stability. (mons are currently still 2.5)

Will have logs shortly.

Version-Release number of selected component (if applicable):


How reproducible:

Steps to Reproduce:
1. Upgrade OSDs to 2.5
2. Wait for OSDs to crash
3. On boot they take a very long time to come up, with 1 core pinned.

Actual results:


Expected results:


Additional info:

Comment 20 Giridhar Ramaraju 2019-08-05 13:09:45 UTC

Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 21 Giridhar Ramaraju 2019-08-05 13:10:58 UTC

Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 22 Josh Durgin 2019-08-14 15:19:43 UTC

Closing per comment#17