Bug 1558185 - Upgrade to 2.5, Slow OSD startups and unstable OSDs
Summary: Upgrade to 2.5, Slow OSD startups and unstable OSDs
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 2.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: z5
: 2.5
Assignee: Josh Durgin
QA Contact: Manohar Murthy
URL:
Whiteboard:
Depends On: 1548481
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-19 19:22 UTC by tbrekke
Modified: 2022-02-21 18:05 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-14 15:19:43 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3390451 0 Upgrade None Ceph - During upgrade on Red Hat Ceph Storage OSD's are taking extended time to boot. 2019-05-30 12:53:04 UTC

Description tbrekke 2018-03-19 19:22:32 UTC
Description of problem:

After upgrading a few hosts to 2.5 the OSDs had slightly longer then normal startups. During this time nothing was being logged, just 100% CPU usage.

We updated a few of the hosts without too many issues, but eventually some of the OSDs started crashing. (no longer were responding then hit suicide timeouts).

Once the OSDs that crashed started they were stuck on boot, nothing showing in logs after

2018-03-16 19:34:52.555544 7f4f2d454700 20 filestore(/var/lib/ceph/osd/ceph-162) sync_entry woke after 5.000164


perf top showed all the CPU going towards iterating through leveldb likly doing the scanning/enumeration

  15.58%  libleveldb.so.1.0.7   [.] leveldb::Block::Iter::Prev

strace showed this every few seconds 

[pid 1468613] <... poll resumed> )      = 1 ([{fd=10, revents=POLLIN}])
[pid 1468613] accept(10, {sa_family=AF_LOCAL, NULL}, [2]) = 19
[pid 1468613] read(19, "{", 1)          = 1
[pid 1468613] read(19, " ", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, "p", 1)          = 1
[pid 1468613] read(19, "r", 1)          = 1
[pid 1468613] read(19, "e", 1)          = 1
[pid 1468613] read(19, "f", 1)          = 1
[pid 1468613] read(19, "i", 1)          = 1
[pid 1468613] read(19, "x", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, ":", 1)          = 1
[pid 1468613] read(19, " ", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, "1", 1)          = 1
[pid 1468613] read(19, "\"", 1)         = 1
[pid 1468613] read(19, " ", 1)          = 1
[pid 1468613] read(19, "}", 1)          = 1
[pid 1468613] read(19, "\n", 1)         = 1


This process took 20 minutes on some of the OSDs and up to 4 hours on others. 

Once all the OSDs were up and in, we stopped the upgrades, and left the cluster in a mixed state over the weekend. The issue then reoccurred and many OSD went down and got stuck in long booting state. Reverting the hosts back to 2.4 resolved the issue, and regained stability. (mons are currently still 2.5)

Will have logs shortly.

Version-Release number of selected component (if applicable):


How reproducible:

Steps to Reproduce:
1. Upgrade OSDs to 2.5
2. Wait for OSDs to crash
3. On boot they take a very long time to come up, with 1 core pinned.

Actual results:


Expected results:


Additional info:

Comment 20 Giridhar Ramaraju 2019-08-05 13:09:45 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 21 Giridhar Ramaraju 2019-08-05 13:10:58 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 

Regards,
Giri

Comment 22 Josh Durgin 2019-08-14 15:19:43 UTC
Closing per comment#17


Note You need to log in before you can comment on or make changes to this bug.