Bug 1558185

Summary: Upgrade to 2.5, Slow OSD startups and unstable OSDs
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: tbrekke
Component: RADOSAssignee: Josh Durgin <jdurgin>
Status: CLOSED NOTABUG QA Contact: Manohar Murthy <mmurthy>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2.5CC: anharris, ceph-eng-bugs, dzafman, eric.goirand, hnallurv, kchai, kdreyer, linuxkidd, mhackett, tbrekke, tpetr
Target Milestone: z5   
Target Release: 2.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-14 15:19:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1548481    
Bug Blocks:    

Description tbrekke 2018-03-19 19:22:32 UTC
Description of problem:

After upgrading a few hosts to 2.5 the OSDs had slightly longer then normal startups. During this time nothing was being logged, just 100% CPU usage.

We updated a few of the hosts without too many issues, but eventually some of the OSDs started crashing. (no longer were responding then hit suicide timeouts).

Once the OSDs that crashed started they were stuck on boot, nothing showing in logs after

2018-03-16 19:34:52.555544 7f4f2d454700 20 filestore(/var/lib/ceph/osd/ceph-162) sync_entry woke after 5.000164


perf top showed all the CPU going towards iterating through leveldb likly doing the scanning/enumeration

  15.58%  libleveldb.so.1.0.7   [.] leveldb::Block::Iter::Prev

strace showed this every few seconds 

[pid 1468613] <... poll resumed> )      = 1 ([{fd=10, revents=POLLIN}])
[pid 1468613] accept(10, {sa_family=AF_LOCAL, NULL}, [2]) = 19
[pid 1468613] read(19, "{", 1)          = 1
[pid 1468613] read(19, " ", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, "p", 1)          = 1
[pid 1468613] read(19, "r", 1)          = 1
[pid 1468613] read(19, "e", 1)          = 1
[pid 1468613] read(19, "f", 1)          = 1
[pid 1468613] read(19, "i", 1)          = 1
[pid 1468613] read(19, "x", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, ":", 1)          = 1
[pid 1468613] read(19, " ", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, "1", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, " ", 1)          = 1
[pid 1468613] read(19, "}", 1)          = 1
[pid 1468613] read(19, "\n", 1)         = 1


This process took 20 minutes on some of the OSDs and up to 4 hours on others. 

Once all the OSDs were up and in, we stopped the upgrades, and left the cluster in a mixed state over the weekend. The issue then reoccurred and many OSD went down and got stuck in long booting state. Reverting the hosts back to 2.4 resolved the issue, and regained stability. (mons are currently still 2.5)

Will have logs shortly.

Version-Release number of selected component (if applicable):


How reproducible:

Steps to Reproduce:
1. Upgrade OSDs to 2.5
2. Wait for OSDs to crash
3. On boot they take a very long time to come up, with 1 core pinned.

Actual results:


Expected results:


Additional info:

Comment 20 Giridhar Ramaraju 2019-08-05 13:09:45 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 21 Giridhar Ramaraju 2019-08-05 13:10:58 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 22 Josh Durgin 2019-08-14 15:19:43 UTC
Closing per comment#17