Hide Forgot
Description of problem: It appears that OSDs begin to flap following the starting of recovery/backfill after an OSD was removed. Unsure if the osd removal or recovery/backfill starting was the trigger for the osd flapping. It's worth noting PG scrubbing is occurring throughout these issues, but nodeep-scrub is set. The customer provided a good consensus of the behavior seen in the ceph.log (full ceph.log can be found in the SFDC case attached to the BZ): 2016-10-22 08:03:01.936205 mon.0 192.168.1.125:6789/0 32311847 : cluster [INF] osdmap e43387: 320 osds: 315 up, 315 in old osd.132 removed (change) 2016-10-22 08:03:02.938214 osd.196 192.168.1.140:6813/1158359 1472 : cluster [INF] 3.ca3 restarting backfill on osd.64 from (0'0,0'0] MAX to 43385'3101774 backfilling starts (ie from osd removal) 2016-10-22 09:43:44.604357 mon.0 192.168.1.125:6789/0 32340175 : cluster [INF] osd.224 marked itself down 2016-10-22 09:43:44.604449 mon.0 192.168.1.125:6789/0 32340176 : cluster [INF] osd.186 marked itself down 2016-10-22 09:43:44.604499 mon.0 192.168.1.125:6789/0 32340177 : cluster [INF] osd.213 marked itself down 2016-10-22 09:43:44.604553 mon.0 192.168.1.125:6789/0 32340178 : cluster [INF] osd.107 marked itself down 2016-10-22 09:43:44.606239 mon.0 192.168.1.125:6789/0 32340179 : cluster [INF] osd.245 marked itself down 2016-10-22 09:43:44.615770 mon.0 192.168.1.125:6789/0 32340180 : cluster [INF] osd.162 marked itself down 2016-10-22 09:43:44.616829 mon.0 192.168.1.125:6789/0 32340181 : cluster [INF] osd.150 marked itself down 2016-10-22 09:43:44.618316 mon.0 192.168.1.125:6789/0 32340182 : cluster [INF] osd.22 marked itself down 2016-10-22 09:43:44.619419 mon.0 192.168.1.125:6789/0 32340183 : cluster [INF] osd.3 marked itself down 2016-10-22 09:43:44.634200 mon.0 192.168.1.125:6789/0 32340184 : cluster [INF] osd.14 marked itself down 2016-10-22 09:43:44.639197 mon.0 192.168.1.125:6789/0 32340185 : cluster [INF] osd.50 marked itself down 2016-10-22 09:43:44.639466 mon.0 192.168.1.125:6789/0 32340186 : cluster [INF] osd.240 marked itself down 2016-10-22 09:43:44.649004 mon.0 192.168.1.125:6789/0 32340187 : cluster [INF] osd.118 marked itself down 2016-10-22 09:43:44.652766 mon.0 192.168.1.125:6789/0 32340188 : cluster [INF] osd.207 marked itself down 2016-10-22 09:43:44.659188 mon.0 192.168.1.125:6789/0 32340189 : cluster [INF] osd.183 marked itself down 2016-10-22 09:43:44.668634 mon.0 192.168.1.125:6789/0 32340190 : cluster [INF] osd.49 marked itself down 2016-10-22 09:43:44.701021 mon.0 192.168.1.125:6789/0 32340191 : cluster [INF] osd.64 marked itself down 2016-10-22 09:43:44.705284 mon.0 192.168.1.125:6789/0 32340192 : cluster [INF] osd.76 marked itself down 2016-10-22 09:43:44.706894 mon.0 192.168.1.125:6789/0 32340193 : cluster [INF] osd.92 marked itself down 2016-10-22 09:43:44.982770 mon.0 192.168.1.125:6789/0 32340194 : cluster [INF] osdmap e43851: 319 osds: 296 up, 315 in 2016-10-22 09:59:03.022274 mon.0 192.168.1.125:6789/0 32344110 : cluster [INF] osd.3 192.168.1.143:6800/26750 boot 2016-10-22 09:59:05.086716 mon.0 192.168.1.125:6789/0 32344123 : cluster [INF] osd.14 192.168.1.143:6802/27033 boot 2016-10-22 09:59:06.098339 mon.0 192.168.1.125:6789/0 32344129 : cluster [INF] osd.50 192.168.1.143:6808/27528 boot 2016-10-22 09:59:07.104064 mon.0 192.168.1.125:6789/0 32344137 : cluster [INF] osd.92 192.168.1.143:6814/28032 boot 2016-10-22 09:59:08.109739 mon.0 192.168.1.125:6789/0 32344142 : cluster [INF] osd.49 192.168.1.143:6806/27377 boot 2016-10-22 09:59:08.112917 mon.0 192.168.1.125:6789/0 32344143 : cluster [INF] osd.183 192.168.1.143:6824/28821 boot 2016-10-22 09:59:08.113109 mon.0 192.168.1.125:6789/0 32344144 : cluster [INF] osd.162 192.168.1.143:6822/28660 boot 2016-10-22 09:59:08.113315 mon.0 192.168.1.125:6789/0 32344145 : cluster [INF] osd.207 192.168.1.143:6828/29147 boot 2016-10-22 09:59:08.113544 mon.0 192.168.1.125:6789/0 32344146 : cluster [INF] osd.64 192.168.1.143:6810/27680 boot 2016-10-22 09:59:09.123305 mon.0 192.168.1.125:6789/0 32344155 : cluster [INF] osd.22 192.168.1.143:6804/27224 boot 2016-10-22 09:59:09.123535 mon.0 192.168.1.125:6789/0 32344156 : cluster [INF] osd.107 192.168.1.143:6816/28186 boot 2016-10-22 09:59:09.123746 mon.0 192.168.1.125:6789/0 32344157 : cluster [INF] osd.118 192.168.1.143:6818/28354 boot 2016-10-22 09:59:10.128166 mon.0 192.168.1.125:6789/0 32344162 : cluster [INF] osd.213 192.168.1.143:6830/29306 boot 2016-10-22 09:59:10.134517 mon.0 192.168.1.125:6789/0 32344163 : cluster [INF] osd.186 192.168.1.143:6826/28992 boot 2016-10-22 09:59:11.139538 mon.0 192.168.1.125:6789/0 32344168 : cluster [INF] osd.224 192.168.1.143:6832/29478 boot 2016-10-22 09:59:11.139754 mon.0 192.168.1.125:6789/0 32344169 : cluster [INF] osd.76 192.168.1.143:6812/27861 boot 2016-10-22 09:59:11.139987 mon.0 192.168.1.125:6789/0 32344170 : cluster [INF] osd.240 192.168.1.143:6834/29634 boot 2016-10-22 09:59:12.509281 mon.0 192.168.1.125:6789/0 32344178 : cluster [INF] osd.245 192.168.1.143:6836/29799 boot 2016-10-22 09:59:12.509736 mon.0 192.168.1.125:6789/0 32344179 : cluster [INF] osd.150 192.168.1.143:6820/28504 boot 2016-10-22 10:12:58.708200 mon.0 192.168.1.125:6789/0 32347722 : cluster [INF] pgmap v21827208: 18432 pgs: 18432 active+clean; 18640 GB data, 155 TB used, 360 TB / 515 TB avail; 4152 kB/s rd, 5199 kB/s wr, 2311 op/s backfilling & recovery complete 2016-10-22 10:16:24.436915 mon.0 192.168.1.125:6789/0 32348601 : cluster [INF] osdmap e43893: 320 osds: 315 up, 315 in osd.132 re-added (change) 2016-10-22 10:20:38.665552 mon.0 192.168.1.125:6789/0 32349653 : cluster [INF] osdmap e43894: 320 osds: 315 up, 315 in 2016-10-22 10:20:39.692814 mon.0 192.168.1.125:6789/0 32349656 : cluster [INF] osd.132 192.168.1.143:6838/82114 boot 2016-10-22 10:20:39.702909 mon.0 192.168.1.125:6789/0 32349657 : cluster [INF] osdmap e43895: 320 osds: 316 up, 316 in osd.132 started (change) 2016-10-22 10:20:40.697784 osd.33 192.168.1.140:6809/1147419 1532 : cluster [INF] 4.72e restarting backfill on osd.207 from (0'0,0'0] MAX to 43864'2889519 backfill & recovery starts (from the osd start) 2016-10-22 12:18:08.231991 mon.0 192.168.1.125:6789/0 32384009 : cluster [INF] osd.227 marked itself down 2016-10-22 12:18:10.630392 mon.0 192.168.1.125:6789/0 32384024 : cluster [INF] osd.214 marked itself down 2016-10-22 12:18:12.990507 mon.0 192.168.1.125:6789/0 32384034 : cluster [INF] osd.201 marked itself down 2016-10-22 12:18:15.367589 mon.0 192.168.1.125:6789/0 32384048 : cluster [INF] osd.187 marked itself down 2016-10-22 12:18:17.701925 mon.0 192.168.1.125:6789/0 32384062 : cluster [INF] osd.175 marked itself down 2016-10-22 12:18:20.038557 mon.0 192.168.1.125:6789/0 32384070 : cluster [INF] osd.158 marked itself down 2016-10-22 12:18:21.993274 mon.0 192.168.1.125:6789/0 32384075 : cluster [INF] osdmap e46185: 320 osds: 310 up, 316 in 2016-10-22 12:18:22.358472 mon.0 192.168.1.125:6789/0 32384077 : cluster [INF] osd.142 marked itself down 2016-10-22 12:18:22.994061 mon.0 192.168.1.125:6789/0 32384083 : cluster [INF] osd.0 192.168.1.140:6802/2599826 boot 2016-10-22 12:18:23.001285 mon.0 192.168.1.125:6789/0 32384084 : cluster [INF] osdmap e46186: 320 osds: 310 up, 317 in 2nd osd online 2016-10-22 12:18:24.677577 mon.0 192.168.1.125:6789/0 32384093 : cluster [INF] osd.137 marked itself down 2016-10-22 12:18:27.013172 mon.0 192.168.1.125:6789/0 32384108 : cluster [INF] osd.126 marked itself down 2016-10-22 12:18:29.338712 mon.0 192.168.1.125:6789/0 32384125 : cluster [INF] osd.112 marked itself down 2016-10-22 12:18:31.541036 mon.0 192.168.1.125:6789/0 32384145 : cluster [INF] osd.102 marked itself down 2016-10-22 12:18:33.856201 mon.0 192.168.1.125:6789/0 32384159 : cluster [INF] osd.89 marked itself down 2016-10-22 12:18:34.452716 mon.0 192.168.1.125:6789/0 32384163 : cluster [INF] osdmap e46197: 320 osds: 305 up, 317 in this is a problem !!!! 2016-10-22 12:18:35.870748 mon.0 192.168.1.125:6789/0 32384170 : cluster [INF] osd.137 192.168.1.147:6800/2968335 boot 2016-10-22 12:18:35.871394 mon.0 192.168.1.125:6789/0 32384171 : cluster [INF] osd.142 192.168.1.147:6804/2968496 boot 2016-10-22 12:18:45.120610 mon.0 192.168.1.125:6789/0 32384185 : cluster [INF] osd.158 192.168.1.147:6808/2968654 boot 2016-10-22 12:18:45.599912 mon.0 192.168.1.125:6789/0 32384192 : cluster [INF] osd.25 marked itself down 2016-10-22 12:18:46.134602 mon.0 192.168.1.125:6789/0 32384197 : cluster [INF] osd.201 192.168.1.147:6817/2969203 boot 2016-10-22 12:18:46.148214 mon.0 192.168.1.125:6789/0 32384198 : cluster [INF] osd.175 192.168.1.147:6810/2968863 boot 2016-10-22 12:18:46.149837 mon.0 192.168.1.125:6789/0 32384199 : cluster [INF] osd.214 192.168.1.147:6820/2969358 boot 2016-10-22 12:18:46.151152 mon.0 192.168.1.125:6789/0 32384200 : cluster [INF] osd.187 192.168.1.147:6814/2969029 boot 2016-10-22 12:18:46.153278 mon.0 192.168.1.125:6789/0 32384201 : cluster [INF] osd.227 192.168.1.147:6822/2969547 boot Version-Release number of selected component (if applicable): ceph-0.94.3-3.el7cp.x86_64 How reproducible: Unknown Steps to Reproduce: Unknown Actual results: osd's are marked down/booted (flap) throughout recovery/backfill Expected results: Recovery/backfill occur and no osd's are marked down. Additional info: * The nodeep-scrub flag is set in this environment during this issue, so this issue does not appear to be a match for BZ 1373653. * I've requested debug_ms=1, debug_osd=20, debug_filestore=20 when the customer next sees this issue
> The sosreport would also give us whatever logs we would need to see systemd service changes, right? Yeah, they should contain that data, I'll request it separately if the sosreport fails to capture it.