Description of problem: After upgrading a few hosts to 2.5 the OSDs had slightly longer then normal startups. During this time nothing was being logged, just 100% CPU usage. We updated a few of the hosts without too many issues, but eventually some of the OSDs started crashing. (no longer were responding then hit suicide timeouts). Once the OSDs that crashed started they were stuck on boot, nothing showing in logs after 2018-03-16 19:34:52.555544 7f4f2d454700 20 filestore(/var/lib/ceph/osd/ceph-162) sync_entry woke after 5.000164 perf top showed all the CPU going towards iterating through leveldb likly doing the scanning/enumeration 15.58% libleveldb.so.1.0.7 [.] leveldb::Block::Iter::Prev strace showed this every few seconds [pid 1468613] <... poll resumed> ) = 1 ([{fd=10, revents=POLLIN}]) [pid 1468613] accept(10, {sa_family=AF_LOCAL, NULL}, [2]) = 19 [pid 1468613] read(19, "{", 1) = 1 [pid 1468613] read(19, " ", 1) = 1 [pid 1468613] read(19, "\"", 1) = 1 [pid 1468613] read(19, "p", 1) = 1 [pid 1468613] read(19, "r", 1) = 1 [pid 1468613] read(19, "e", 1) = 1 [pid 1468613] read(19, "f", 1) = 1 [pid 1468613] read(19, "i", 1) = 1 [pid 1468613] read(19, "x", 1) = 1 [pid 1468613] read(19, "\"", 1) = 1 [pid 1468613] read(19, ":", 1) = 1 [pid 1468613] read(19, " ", 1) = 1 [pid 1468613] read(19, "\"", 1) = 1 [pid 1468613] read(19, "1", 1) = 1 [pid 1468613] read(19, "\"", 1) = 1 [pid 1468613] read(19, " ", 1) = 1 [pid 1468613] read(19, "}", 1) = 1 [pid 1468613] read(19, "\n", 1) = 1 This process took 20 minutes on some of the OSDs and up to 4 hours on others. Once all the OSDs were up and in, we stopped the upgrades, and left the cluster in a mixed state over the weekend. The issue then reoccurred and many OSD went down and got stuck in long booting state. Reverting the hosts back to 2.4 resolved the issue, and regained stability. (mons are currently still 2.5) Will have logs shortly. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Upgrade OSDs to 2.5 2. Wait for OSDs to crash 3. On boot they take a very long time to come up, with 1 core pinned. Actual results: Expected results: Additional info:
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri
Closing per comment#17