Description of problem: After upgrading from RHCS 4.0 to RHCS 4.1 , all OSDs from one of the node are not booting up cleanly. As a results they are begging marked down by peer OSDs / MONs I am seen a bunch of these message after enabling debug logs 2020-05-19 08:35:36.784 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.987s5_head pgid 18.987s5 2020-05-19 08:35:37.394 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.7a2s3_head pgid 18.7a2s3 2020-05-19 08:35:46.022 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.1728s1_head pgid 18.1728s1 2020-05-19 08:35:48.081 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.148s5_head pgid 18.148s5 2020-05-19 08:35:48.670 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.27f1s2_head pgid 18.27f1s2 2020-05-19 08:35:50.809 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.211cs4_head pgid 18.211cs4 2020-05-19 08:35:52.101 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.4f8s3_head pgid 18.4f8s3 2020-05-19 08:36:00.470 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.35c7s4_head pgid 18.35c7s4 2020-05-19 08:36:01.737 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.2123s0_head pgid 18.2123s0 2020-05-19 08:36:03.773 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.2134s5_head pgid 18.2134s5 2020-05-19 08:36:04.417 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.12d4s0_head pgid 18.12d4s0 2020-05-19 08:36:06.423 7fca97bcddc0 20 osd.8 43581 clearing temps in 17.710_head pgid 17.710 2020-05-19 08:36:06.423 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.2ff7s3_head pgid 18.2ff7s3 2020-05-19 08:36:14.785 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.3f22s5_head pgid 18.3f22s5 2020-05-19 08:36:15.391 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.1daes0_head pgid 18.1daes0 2020-05-19 08:36:17.243 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.2149s0_head pgid 18.2149s0 2020-05-19 08:36:19.015 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.5c4s5_head pgid 18.5c4s5 2020-05-19 08:36:19.591 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.3cabs3_head pgid 18.3cabs3 2020-05-19 08:36:27.661 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.13bas4_head pgid 18.13bas4 2020-05-19 08:36:28.874 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.2d1fs2_head pgid 18.2d1fs2 2020-05-19 08:36:30.947 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.216as4_head pgid 18.216as4 2020-05-19 08:36:32.168 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.21aas5_head pgid 18.21aas5 2020-05-19 08:36:32.750 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.34das2_head pgid 18.34das2 2020-05-19 08:36:34.896 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.21c2s5_head pgid 18.21c2s5 2020-05-19 08:36:35.485 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.21e6s4_head pgid 18.21e6s4 2020-05-19 08:36:36.713 7fca97bcddc0 20 osd.8 43581 clearing temps in 18.22e2s3_head pgid 18.22e2s3 Version-Release number of selected component (if applicable): RHCS 4.1 How reproducible: Steps to Reproduce: 1. Keep several million objects in cluster 2. Upgrade Ceph version from RHCS 4.0 to 4.1 3. Check of ALL osds are up and running ? Actual results: All OSDs from one are down, while OSDs are trying their best to boot up, seeing bunch of clearing temps in PG logs (above) Expected results: Like other nodes, OSDs of this node should also come up clean Additional info: Debug logs from affected OSDs : https://pastebin.com/raw/RLwec9mT Other outputs : https://pastebin.com/raw/fYJDm1Th
Here is tht output from one of the OSD which has been running since last 2 hours [root@rgw-1 ceph]# podman ps | head -1 CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 93b496fdfa83 registry.redhat.io/rhceph-beta/rhceph-4-rhel8:latest 2 hours ago Up 2 hours ago ceph-osd-19 Logs : https://pastebin.com/raw/bMEXescH (unfortunately these are not debug logs, i enabled debuging later )
Some more logs from the same OSD 8156-f6950be79fa2.1523928.1104_readprepround97040:head by client.1552659.0:7485602 2020-05-18 23:08:11.895158 0 2020-05-19 08:50:23.855 7fca97bcddc0 20 read_log_and_missing 40489'1516 (0'0) modify 18:71f11d72:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.768_readprepround80394:head by client.1527012.0:6742513 2020-05-18 23:08:12.146016 0 2020-05-19 08:50:23.855 7fca97bcddc0 20 read_log_and_missing 40489'1517 (0'0) modify 18:71f360a4:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.806_readprepround3492:head by client.1524210.0:6836207 2020-05-18 23:08:13.612280 0 2020-05-19 08:50:23.855 7fca97bcddc0 20 read_log_and_missing 40506'1518 (0'0) modify 18:71f1334b:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.1120_readprepround87252:head by client.1552644.0:7779797 2020-05-18 23:08:48.789095 0 2020-05-19 08:50:23.855 7fca97bcddc0 20 read_log_and_missing 40623'1519 (0'0) modify 18:71f291cd:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.1120_readprepround89767:head by client.1552644.0:7787160 2020-05-18 23:17:01.783024 0 2020-05-19 08:50:23.861 7fca97bcddc0 10 read_log_and_missing done 2020-05-19 08:50:23.861 7fca97bcddc0 10 osd.8 pg_epoch: 43579 pg[18.f8es1( v 40623'1519 lc 40068'1384 (0'0,40623'1519] local-lis/les=40550/40551 n=1457 ec=38437/36901 lis/c 43546/39072 les/c/f 43547/39095/0 43579/43579/42441) [97,8,144,129,255,394]/[97,357,144,129,255,394]p97(0) r=-1 lpr=0 pi=[39072,43579)/3 crt=40506'1518 lcod 0'0 unknown m=118 mbc={}] handle_initialize 2020-05-19 08:50:23.861 7fca97bcddc0 5 osd.8 pg_epoch: 43579 pg[18.f8es1( v 40623'1519 lc 40068'1384 (0'0,40623'1519] local-lis/les=40550/40551 n=1457 ec=38437/36901 lis/c 43546/39072 les/c/f 43547/39095/0 43579/43579/42441) [97,8,144,129,255,394]/[97,357,144,129,255,394]p97(0) r=-1 lpr=0 pi=[39072,43579)/3 crt=40506'1518 lcod 0'0 unknown m=118 mbc={}] exit Initial 0.036784 0 0.000000 2020-05-19 08:50:23.861 7fca97bcddc0 5 osd.8 pg_epoch: 43579 pg[18.f8es1( v 40623'1519 lc 40068'1384 (0'0,40623'1519] local-lis/les=40550/40551 n=1457 ec=38437/36901 lis/c 43546/39072 les/c/f 43547/39095/0 43579/43579/42441) [97,8,144,129,255,394]/[97,357,144,129,255,394]p97(0) r=-1 lpr=0 pi=[39072,43579)/3 crt=40506'1518 lcod 0'0 unknown m=118 mbc={}] enter Reset 2020-05-19 08:50:23.861 7fca97bcddc0 20 osd.8 pg_epoch: 43579 pg[18.f8es1( v 40623'1519 lc 40068'1384 (0'0,40623'1519] local-lis/les=40550/40551 n=1457 ec=38437/36901 lis/c 43546/39072 les/c/f 43547/39095/0 43579/43579/42441) [97,8,144,129,255,394]/[97,357,144,129,255,394]p97(0) r=-1 lpr=0 pi=[39072,43579)/3 crt=40506'1518 lcod 0'0 unknown m=118 mbc={}] set_last_peering_reset 43579 2020-05-19 08:50:23.861 7fca97bcddc0 10 osd.8 pg_epoch: 43579 pg[18.f8es1( v 40623'1519 lc 40068'1384 (0'0,40623'1519] local-lis/les=40550/40551 n=1457 ec=38437/36901 lis/c 43546/39072 les/c/f 43547/39095/0 43579/43579/42441) [97,8,144,129,255,394]/[97,357,144,129,255,394]p97(0) r=-1 lpr=43579 pi=[39072,43579)/3 crt=40506'1518 lcod 0'0 unknown m=118 mbc={}] Clearing blocked outgoing recovery messages 2020-05-19 08:50:23.861 7fca97bcddc0 10 osd.8 pg_epoch: 43579 pg[18.f8es1( v 40623'1519 lc 40068'1384 (0'0,40623'1519] local-lis/les=40550/40551 n=1457 ec=38437/36901 lis/c 43546/39072 les/c/f 43547/39095/0 43579/43579/42441) [97,8,144,129,255,394]/[97,357,144,129,255,394]p97(0) r=-1 lpr=43579 pi=[39072,43579)/3 crt=40506'1518 lcod 0'0 unknown m=118 mbc={}] Not blocking outgoing recovery messages 2020-05-19 08:50:23.861 7fca97bcddc0 6 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0 2020-05-19 08:50:23.861 7fca97bcddc0 10 osd.8 43581 load_pgs loaded pg[18.f8es1( v 40623'1519 lc 40068'1384 (0'0,40623'1519] local-lis/les=40550/40551 n=1457 ec=38437/36901 lis/c 43546/39072 les/c/f 43547/39095/0 43579/43579/42441) [97,8,144,129,255,394]/[97,357,144,129,255,394]p97(0) r=-1 lpr=43579 pi=[39072,43579)/3 crt=40506'1518 lcod 0'0 unknown m=118 mbc={}] 2020-05-19 08:50:23.861 7fca97bcddc0 20 osd.8 43581 register_pg 18.f8es1 0x5585497bc000 2020-05-19 08:50:23.861 7fca97bcddc0 10 osd.8:2._attach_pg 18.f8es1 0x5585497bc000 2020-05-19 08:50:23.864 7fca97bcddc0 10 osd.8 43581 pgid 18.f78s4 coll 18.f78s4_head 2020-05-19 08:50:23.868 7fca97bcddc0 10 osd.8 43581 _make_pg 18.f78s4 2020-05-19 08:50:23.868 7fca97bcddc0 5 osd.8 pg_epoch: 43579 pg[18.f78s4(unlocked)] enter Initial 2020-05-19 08:50:23.868 7fca97bcddc0 20 ErasureCodePluginJerasure: factory: {crush-device-class=,crush-failure-domain=host,crush-root=default,jerasure-per-chunk-alignment=false,k=4,m=2,plugin=jerasure,technique=reed_sol_van,w=8} 2020-05-19 08:50:23.868 7fca97bcddc0 10 ErasureCodeJerasure: technique=reed_sol_van 2020-05-19 08:50:23.868 7fca97bcddc0 20 osd.8 pg_epoch: 43579 pg[18.f78s4(unlocked)] enter NotTrimming 2020-05-19 08:50:23.868 7fca97bcddc0 20 read_log_and_missing coll 18.f78s4_head 4#18:1ef00000::::head# 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'1 (0'0) modify 18:1ef28f21:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.10_finaltestround13376:head by client.1552644.0:8252 2020-05-18 18:48:10.321021 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'2 (0'0) modify 18:1ef1da7f:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.112_finaltestround1638:head by client.1552659.0:26018 2020-05-18 18:48:15.220968 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'3 (0'0) modify 18:1ef07e21:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.174_finaltestround12501:head by client.1552659.0:53418 2020-05-18 18:48:20.852752 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'4 (0'0) modify 18:1ef2f6ae:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.432_finaltestround13304:head by client.1552659.0:102456 2020-05-18 18:48:28.083833 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'5 (0'0) modify 18:1ef3676a:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.591_finaltestround12609:head by client.1552644.0:102682 2020-05-18 18:48:30.424963 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'6 (0'0) modify 18:1ef3f72a:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.329_finaltestround14045:head by client.1552659.0:139482 2020-05-18 18:48:32.834324 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'7 (0'0) modify 18:1ef1536c:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.115_finaltestround1143:head by client.1552644.0:205218 2020-05-18 18:48:42.736130 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'8 (0'0) modify 18:1ef1c391:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.128_finaltestround13711:head by client.1552659.0:246120 2020-05-18 18:48:44.733724 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'9 (0'0) modify 18:1ef25102:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.161_finaltestround13765:head by client.1552644.0:308222 2020-05-18 18:48:53.713128 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'10 (0'0) modify 18:1ef34789:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.23_finaltestround12764:head by client.1552659.0:333352 2020-05-18 18:48:54.128010 0 2020-05-19 08:50:23.870 7fca97bcddc0 20 read_log_and_missing 39504'11 (0'0) modify 18:1ef10fd4:::747eb73a-0107-4d9f-8156-f6950be79fa2.1523928.329_finaltestround12067:head by client.1552659.0:379022 2020-05-18 18:48:58.510291 0
Created attachment 1689970 [details] As required attached are the logs
Hi Neha So we did a few more things, the rgw data pool has some 45 Million objects ( cosbench data ) so we deleted all pools. I thought when i do so OSDs will not do the checking thing. Just after deleting the pools, OSD utilization on all nodes skyrocketed so had to reboot all nodes one by one. Even after reboot the CPU consumption by OSD containers is super high ( see attached utilization screenshots) As of now , cluster has - 0 objects - 1 pool (.rgw.root) - rgw-1 host has most of the OSDs down - CPU % utilization on all need (except rgw-1) node is extremely high (even thought i rebooted all nodes ) [root@rgw-5 ~]# ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 3.6 PiB 3.4 PiB 198 TiB 198 TiB 5.39 TOTAL 3.6 PiB 3.4 PiB 198 TiB 198 TiB 5.39 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL .rgw.root 21 0 B 0 0 B 0 1.4 PiB [root@rgw-5 ~]# [root@rgw-5 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 4878.71826 root default -7 813.11523 host rgw-1 0 hdd 15.34180 osd.0 down 1.00000 1.00000 1 hdd 15.34180 osd.1 up 1.00000 1.00000 2 hdd 15.34180 osd.2 down 1.00000 1.00000 3 hdd 15.34180 osd.3 up 1.00000 1.00000 4 hdd 15.34180 osd.4 up 1.00000 1.00000 5 hdd 15.34180 osd.5 down 1.00000 1.00000 6 hdd 15.34180 osd.6 down 1.00000 1.00000 7 hdd 15.34180 osd.7 up 1.00000 1.00000 8 hdd 15.34180 osd.8 up 1.00000 1.00000 9 hdd 15.34180 osd.9 up 1.00000 1.00000 10 hdd 15.34180 osd.10 down 1.00000 1.00000 11 hdd 15.34180 osd.11 down 1.00000 1.00000 14 hdd 15.34180 osd.14 down 1.00000 1.00000 15 hdd 15.34180 osd.15 down 1.00000 1.00000 18 hdd 15.34180 osd.18 down 1.00000 1.00000 19 hdd 15.34180 osd.19 down 1.00000 1.00000 20 hdd 15.34180 osd.20 down 1.00000 1.00000 23 hdd 15.34180 osd.23 up 1.00000 1.00000 24 hdd 15.34180 osd.24 up 1.00000 1.00000 26 hdd 15.34180 osd.26 up 1.00000 1.00000 28 hdd 15.34180 osd.28 down 1.00000 1.00000 29 hdd 15.34180 osd.29 down 1.00000 1.00000 32 hdd 15.34180 osd.32 up 1.00000 1.00000 33 hdd 15.34180 osd.33 down 1.00000 1.00000 35 hdd 15.34180 osd.35 down 1.00000 1.00000 37 hdd 15.34180 osd.37 up 1.00000 1.00000 38 hdd 15.34180 osd.38 down 1.00000 1.00000 41 hdd 15.34180 osd.41 up 1.00000 1.00000 42 hdd 15.34180 osd.42 down 1.00000 1.00000 45 hdd 15.34180 osd.45 down 1.00000 1.00000 46 hdd 15.34180 osd.46 up 1.00000 1.00000 47 hdd 15.34180 osd.47 down 1.00000 1.00000 50 hdd 15.34180 osd.50 up 1.00000 1.00000 51 hdd 15.34180 osd.51 up 1.00000 1.00000 54 hdd 15.34180 osd.54 down 1.00000 1.00000 55 hdd 15.34180 osd.55 down 1.00000 1.00000 58 hdd 15.34180 osd.58 down 1.00000 1.00000 59 hdd 15.34180 osd.59 up 1.00000 1.00000 61 hdd 15.34180 osd.61 down 1.00000 1.00000 63 hdd 15.34180 osd.63 down 1.00000 1.00000 65 hdd 15.34180 osd.65 down 1.00000 1.00000 67 hdd 15.34180 osd.67 down 1.00000 1.00000 68 hdd 15.34180 osd.68 up 1.00000 1.00000 71 hdd 15.34180 osd.71 down 1.00000 1.00000 72 hdd 15.34180 osd.72 down 1.00000 1.00000 75 hdd 15.34180 osd.75 down 1.00000 1.00000 76 hdd 15.34180 osd.76 down 1.00000 1.00000 79 hdd 15.34180 osd.79 down 1.00000 1.00000 80 hdd 15.34180 osd.80 up 1.00000 1.00000 83 hdd 15.34180 osd.83 up 1.00000 1.00000 84 hdd 15.34180 osd.84 down 1.00000 1.00000 87 hdd 15.34180 osd.87 down 1.00000 1.00000 88 hdd 15.34180 osd.88 down 1.00000 1.00000 -5 813.11523 host rgw-2 13 hdd 15.34180 osd.13 up 1.00000 1.00000 17 hdd 15.34180 osd.17 up 1.00000 1.00000 22 hdd 15.34180 osd.22 up 1.00000 1.00000 27 hdd 15.34180 osd.27 up 1.00000 1.00000 31 hdd 15.34180 osd.31 up 1.00000 1.00000 36 hdd 15.34180 osd.36 up 1.00000 1.00000 40 hdd 15.34180 osd.40 up 1.00000 1.00000 44 hdd 15.34180 osd.44 up 1.00000 1.00000 49 hdd 15.34180 osd.49 up 1.00000 1.00000 53 hdd 15.34180 osd.53 up 1.00000 1.00000 57 hdd 15.34180 osd.57 up 1.00000 1.00000 62 hdd 15.34180 osd.62 up 1.00000 1.00000 66 hdd 15.34180 osd.66 up 1.00000 1.00000 70 hdd 15.34180 osd.70 up 1.00000 1.00000 74 hdd 15.34180 osd.74 up 1.00000 1.00000 78 hdd 15.34180 osd.78 up 1.00000 1.00000 82 hdd 15.34180 osd.82 up 1.00000 1.00000 86 hdd 15.34180 osd.86 up 1.00000 1.00000 90 hdd 15.34180 osd.90 up 1.00000 1.00000 92 hdd 15.34180 osd.92 up 1.00000 1.00000 94 hdd 15.34180 osd.94 up 1.00000 1.00000 96 hdd 15.34180 osd.96 up 1.00000 1.00000 98 hdd 15.34180 osd.98 up 1.00000 1.00000 100 hdd 15.34180 osd.100 up 1.00000 1.00000 102 hdd 15.34180 osd.102 up 1.00000 1.00000 104 hdd 15.34180 osd.104 up 1.00000 1.00000 106 hdd 15.34180 osd.106 up 1.00000 1.00000 108 hdd 15.34180 osd.108 up 1.00000 1.00000 110 hdd 15.34180 osd.110 up 1.00000 1.00000 112 hdd 15.34180 osd.112 up 1.00000 1.00000 114 hdd 15.34180 osd.114 up 1.00000 1.00000 116 hdd 15.34180 osd.116 up 1.00000 1.00000 118 hdd 15.34180 osd.118 up 1.00000 1.00000 120 hdd 15.34180 osd.120 up 1.00000 1.00000 122 hdd 15.34180 osd.122 up 1.00000 1.00000 124 hdd 15.34180 osd.124 up 1.00000 1.00000 126 hdd 15.34180 osd.126 up 1.00000 1.00000 128 hdd 15.34180 osd.128 up 1.00000 1.00000 130 hdd 15.34180 osd.130 up 1.00000 1.00000 132 hdd 15.34180 osd.132 up 1.00000 1.00000 134 hdd 15.34180 osd.134 up 1.00000 1.00000 136 hdd 15.34180 osd.136 up 1.00000 1.00000 138 hdd 15.34180 osd.138 up 1.00000 1.00000 140 hdd 15.34180 osd.140 up 1.00000 1.00000 142 hdd 15.34180 osd.142 up 1.00000 1.00000 144 hdd 15.34180 osd.144 up 1.00000 1.00000 146 hdd 15.34180 osd.146 up 1.00000 1.00000 148 hdd 15.34180 osd.148 up 1.00000 1.00000 150 hdd 15.34180 osd.150 up 1.00000 1.00000 152 hdd 15.34180 osd.152 up 1.00000 1.00000 154 hdd 15.34180 osd.154 up 1.00000 1.00000 156 hdd 15.34180 osd.156 up 1.00000 1.00000 158 hdd 15.34180 osd.158 up 1.00000 1.00000 -3 813.11523 host rgw-3 43 hdd 15.34180 osd.43 up 1.00000 1.00000 52 hdd 15.34180 osd.52 up 1.00000 1.00000 107 hdd 15.34180 osd.107 up 1.00000 1.00000 109 hdd 15.34180 osd.109 up 1.00000 1.00000 111 hdd 15.34180 osd.111 up 1.00000 1.00000 113 hdd 15.34180 osd.113 up 1.00000 1.00000 115 hdd 15.34180 osd.115 up 1.00000 1.00000 117 hdd 15.34180 osd.117 up 1.00000 1.00000 119 hdd 15.34180 osd.119 up 1.00000 1.00000 121 hdd 15.34180 osd.121 up 1.00000 1.00000 123 hdd 15.34180 osd.123 up 1.00000 1.00000 125 hdd 15.34180 osd.125 up 1.00000 1.00000 127 hdd 15.34180 osd.127 up 1.00000 1.00000 129 hdd 15.34180 osd.129 up 1.00000 1.00000 131 hdd 15.34180 osd.131 up 1.00000 1.00000 133 hdd 15.34180 osd.133 up 1.00000 1.00000 135 hdd 15.34180 osd.135 up 1.00000 1.00000 137 hdd 15.34180 osd.137 up 1.00000 1.00000 139 hdd 15.34180 osd.139 up 1.00000 1.00000 141 hdd 15.34180 osd.141 up 1.00000 1.00000 143 hdd 15.34180 osd.143 up 1.00000 1.00000 145 hdd 15.34180 osd.145 up 1.00000 1.00000 147 hdd 15.34180 osd.147 up 1.00000 1.00000 149 hdd 15.34180 osd.149 up 1.00000 1.00000 151 hdd 15.34180 osd.151 up 1.00000 1.00000 153 hdd 15.34180 osd.153 up 1.00000 1.00000 155 hdd 15.34180 osd.155 up 1.00000 1.00000 157 hdd 15.34180 osd.157 up 1.00000 1.00000 187 hdd 15.34180 osd.187 up 1.00000 1.00000 188 hdd 15.34180 osd.188 up 1.00000 1.00000 190 hdd 15.34180 osd.190 up 1.00000 1.00000 191 hdd 15.34180 osd.191 up 1.00000 1.00000 192 hdd 15.34180 osd.192 up 1.00000 1.00000 193 hdd 15.34180 osd.193 up 1.00000 1.00000 194 hdd 15.34180 osd.194 up 1.00000 1.00000 196 hdd 15.34180 osd.196 up 1.00000 1.00000 197 hdd 15.34180 osd.197 up 1.00000 1.00000 198 hdd 15.34180 osd.198 up 1.00000 1.00000 199 hdd 15.34180 osd.199 up 1.00000 1.00000 200 hdd 15.34180 osd.200 up 1.00000 1.00000 202 hdd 15.34180 osd.202 up 1.00000 1.00000 203 hdd 15.34180 osd.203 up 1.00000 1.00000 204 hdd 15.34180 osd.204 up 1.00000 1.00000 205 hdd 15.34180 osd.205 up 1.00000 1.00000 206 hdd 15.34180 osd.206 up 1.00000 1.00000 208 hdd 15.34180 osd.208 up 1.00000 1.00000 209 hdd 15.34180 osd.209 up 1.00000 1.00000 210 hdd 15.34180 osd.210 up 1.00000 1.00000 211 hdd 15.34180 osd.211 up 1.00000 1.00000 212 hdd 15.34180 osd.212 up 1.00000 1.00000 214 hdd 15.34180 osd.214 up 1.00000 1.00000 215 hdd 15.34180 osd.215 up 1.00000 1.00000 216 hdd 15.34180 osd.216 up 1.00000 1.00000 -9 813.11523 host rgw-4 265 hdd 15.34180 osd.265 up 1.00000 1.00000 268 hdd 15.34180 osd.268 up 1.00000 1.00000 271 hdd 15.34180 osd.271 up 1.00000 1.00000 273 hdd 15.34180 osd.273 up 1.00000 1.00000 276 hdd 15.34180 osd.276 up 1.00000 1.00000 279 hdd 15.34180 osd.279 up 1.00000 1.00000 281 hdd 15.34180 osd.281 up 1.00000 1.00000 284 hdd 15.34180 osd.284 up 1.00000 1.00000 287 hdd 15.34180 osd.287 up 1.00000 1.00000 290 hdd 15.34180 osd.290 up 1.00000 1.00000 293 hdd 15.34180 osd.293 up 1.00000 1.00000 296 hdd 15.34180 osd.296 up 1.00000 1.00000 299 hdd 15.34180 osd.299 up 1.00000 1.00000 301 hdd 15.34180 osd.301 up 1.00000 1.00000 304 hdd 15.34180 osd.304 up 1.00000 1.00000 307 hdd 15.34180 osd.307 up 1.00000 1.00000 310 hdd 15.34180 osd.310 up 1.00000 1.00000 313 hdd 15.34180 osd.313 up 1.00000 1.00000 316 hdd 15.34180 osd.316 up 1.00000 1.00000 319 hdd 15.34180 osd.319 up 1.00000 1.00000 321 hdd 15.34180 osd.321 up 1.00000 1.00000 324 hdd 15.34180 osd.324 up 1.00000 1.00000 327 hdd 15.34180 osd.327 up 1.00000 1.00000 329 hdd 15.34180 osd.329 up 1.00000 1.00000 332 hdd 15.34180 osd.332 up 1.00000 1.00000 335 hdd 15.34180 osd.335 up 1.00000 1.00000 338 hdd 15.34180 osd.338 up 1.00000 1.00000 341 hdd 15.34180 osd.341 up 1.00000 1.00000 344 hdd 15.34180 osd.344 up 1.00000 1.00000 347 hdd 15.34180 osd.347 up 1.00000 1.00000 349 hdd 15.34180 osd.349 up 1.00000 1.00000 352 hdd 15.34180 osd.352 up 1.00000 1.00000 354 hdd 15.34180 osd.354 up 1.00000 1.00000 357 hdd 15.34180 osd.357 up 1.00000 1.00000 360 hdd 15.34180 osd.360 up 1.00000 1.00000 363 hdd 15.34180 osd.363 up 1.00000 1.00000 366 hdd 15.34180 osd.366 up 1.00000 1.00000 369 hdd 15.34180 osd.369 up 1.00000 1.00000 372 hdd 15.34180 osd.372 up 1.00000 1.00000 375 hdd 15.34180 osd.375 up 1.00000 1.00000 378 hdd 15.34180 osd.378 up 1.00000 1.00000 380 hdd 15.34180 osd.380 up 1.00000 1.00000 382 hdd 15.34180 osd.382 up 1.00000 1.00000 385 hdd 15.34180 osd.385 up 1.00000 1.00000 388 hdd 15.34180 osd.388 up 1.00000 1.00000 391 hdd 15.34180 osd.391 up 1.00000 1.00000 394 hdd 15.34180 osd.394 up 1.00000 1.00000 397 hdd 15.34180 osd.397 up 1.00000 1.00000 399 hdd 15.34180 osd.399 up 1.00000 1.00000 402 hdd 15.34180 osd.402 up 1.00000 1.00000 404 hdd 15.34180 osd.404 up 1.00000 1.00000 407 hdd 15.34180 osd.407 up 1.00000 1.00000 410 hdd 15.34180 osd.410 up 1.00000 1.00000 -13 813.14203 host rgw-5 189 hdd 15.34279 osd.189 up 1.00000 1.00000 195 hdd 15.34279 osd.195 up 1.00000 1.00000 201 hdd 15.34180 osd.201 up 1.00000 1.00000 207 hdd 15.34180 osd.207 up 1.00000 1.00000 213 hdd 15.34180 osd.213 up 1.00000 1.00000 217 hdd 15.34180 osd.217 up 1.00000 1.00000 218 hdd 15.34279 osd.218 up 1.00000 1.00000 219 hdd 15.34180 osd.219 up 1.00000 1.00000 220 hdd 15.34279 osd.220 up 1.00000 1.00000 221 hdd 15.34180 osd.221 up 1.00000 1.00000 222 hdd 15.34279 osd.222 up 1.00000 1.00000 223 hdd 15.34180 osd.223 up 1.00000 1.00000 224 hdd 15.34279 osd.224 up 1.00000 1.00000 225 hdd 15.34180 osd.225 up 1.00000 1.00000 226 hdd 15.34279 osd.226 up 1.00000 1.00000 227 hdd 15.34180 osd.227 up 1.00000 1.00000 228 hdd 15.34279 osd.228 up 1.00000 1.00000 229 hdd 15.34180 osd.229 up 1.00000 1.00000 230 hdd 15.34279 osd.230 up 1.00000 1.00000 231 hdd 15.34180 osd.231 up 1.00000 1.00000 232 hdd 15.34279 osd.232 up 1.00000 1.00000 233 hdd 15.34180 osd.233 up 1.00000 1.00000 234 hdd 15.34180 osd.234 up 1.00000 1.00000 235 hdd 15.34279 osd.235 up 1.00000 1.00000 236 hdd 15.34180 osd.236 up 1.00000 1.00000 237 hdd 15.34279 osd.237 up 1.00000 1.00000 238 hdd 15.34180 osd.238 up 1.00000 1.00000 239 hdd 15.34279 osd.239 up 1.00000 1.00000 240 hdd 15.34180 osd.240 up 1.00000 1.00000 241 hdd 15.34279 osd.241 up 1.00000 1.00000 242 hdd 15.34180 osd.242 up 1.00000 1.00000 243 hdd 15.34279 osd.243 up 1.00000 1.00000 244 hdd 15.34180 osd.244 up 1.00000 1.00000 245 hdd 15.34279 osd.245 up 1.00000 1.00000 246 hdd 15.34180 osd.246 up 1.00000 1.00000 247 hdd 15.34279 osd.247 up 1.00000 1.00000 248 hdd 15.34180 osd.248 up 1.00000 1.00000 249 hdd 15.34279 osd.249 up 1.00000 1.00000 250 hdd 15.34180 osd.250 up 1.00000 1.00000 251 hdd 15.34279 osd.251 up 1.00000 1.00000 252 hdd 15.34180 osd.252 up 1.00000 1.00000 253 hdd 15.34279 osd.253 up 1.00000 1.00000 254 hdd 15.34180 osd.254 up 1.00000 1.00000 255 hdd 15.34279 osd.255 up 1.00000 1.00000 256 hdd 15.34180 osd.256 up 1.00000 1.00000 257 hdd 15.34279 osd.257 up 1.00000 1.00000 258 hdd 15.34180 osd.258 up 1.00000 1.00000 259 hdd 15.34279 osd.259 up 1.00000 1.00000 260 hdd 15.34180 osd.260 up 1.00000 1.00000 261 hdd 15.34279 osd.261 up 1.00000 1.00000 262 hdd 15.34279 osd.262 up 1.00000 1.00000 263 hdd 15.34279 osd.263 up 1.00000 1.00000 264 hdd 15.34279 osd.264 up 1.00000 1.00000 -11 813.11523 host rgw-6 12 hdd 15.34180 osd.12 up 1.00000 1.00000 16 hdd 15.34180 osd.16 up 1.00000 1.00000 21 hdd 15.34180 osd.21 up 1.00000 1.00000 25 hdd 15.34180 osd.25 up 1.00000 1.00000 30 hdd 15.34180 osd.30 up 1.00000 1.00000 34 hdd 15.34180 osd.34 up 1.00000 1.00000 39 hdd 15.34180 osd.39 up 1.00000 1.00000 48 hdd 15.34180 osd.48 up 1.00000 1.00000 56 hdd 15.34180 osd.56 up 1.00000 1.00000 60 hdd 15.34180 osd.60 up 1.00000 1.00000 64 hdd 15.34180 osd.64 up 1.00000 1.00000 69 hdd 15.34180 osd.69 up 1.00000 1.00000 73 hdd 15.34180 osd.73 up 1.00000 1.00000 77 hdd 15.34180 osd.77 up 1.00000 1.00000 81 hdd 15.34180 osd.81 up 1.00000 1.00000 85 hdd 15.34180 osd.85 up 1.00000 1.00000 89 hdd 15.34180 osd.89 up 1.00000 1.00000 91 hdd 15.34180 osd.91 up 1.00000 1.00000 93 hdd 15.34180 osd.93 up 1.00000 1.00000 95 hdd 15.34180 osd.95 up 1.00000 1.00000 97 hdd 15.34180 osd.97 up 1.00000 1.00000 99 hdd 15.34180 osd.99 up 1.00000 1.00000 101 hdd 15.34180 osd.101 up 1.00000 1.00000 103 hdd 15.34180 osd.103 up 1.00000 1.00000 105 hdd 15.34180 osd.105 up 1.00000 1.00000 159 hdd 15.34180 osd.159 up 1.00000 1.00000 160 hdd 15.34180 osd.160 up 1.00000 1.00000 161 hdd 15.34180 osd.161 up 1.00000 1.00000 162 hdd 15.34180 osd.162 up 1.00000 1.00000 163 hdd 15.34180 osd.163 up 1.00000 1.00000 164 hdd 15.34180 osd.164 up 1.00000 1.00000 165 hdd 15.34180 osd.165 up 1.00000 1.00000 166 hdd 15.34180 osd.166 up 1.00000 1.00000 167 hdd 15.34180 osd.167 up 1.00000 1.00000 168 hdd 15.34180 osd.168 up 1.00000 1.00000 169 hdd 15.34180 osd.169 up 1.00000 1.00000 170 hdd 15.34180 osd.170 up 1.00000 1.00000 171 hdd 15.34180 osd.171 up 1.00000 1.00000 172 hdd 15.34180 osd.172 up 1.00000 1.00000 173 hdd 15.34180 osd.173 up 1.00000 1.00000 174 hdd 15.34180 osd.174 up 1.00000 1.00000 175 hdd 15.34180 osd.175 up 1.00000 1.00000 176 hdd 15.34180 osd.176 up 1.00000 1.00000 177 hdd 15.34180 osd.177 up 1.00000 1.00000 178 hdd 15.34180 osd.178 up 1.00000 1.00000 179 hdd 15.34180 osd.179 up 1.00000 1.00000 180 hdd 15.34180 osd.180 up 1.00000 1.00000 181 hdd 15.34180 osd.181 up 1.00000 1.00000 182 hdd 15.34180 osd.182 up 1.00000 1.00000 183 hdd 15.34180 osd.183 up 1.00000 1.00000 184 hdd 15.34180 osd.184 up 1.00000 1.00000 185 hdd 15.34180 osd.185 up 1.00000 1.00000 186 hdd 15.34180 osd.186 up 1.00000 1.00000 [root@rgw-5 ~]# [root@rgw-5 ~]# ceph -s cluster: id: ebe0aa4b-4fb5-4c68-84ab-cbf1118937a2 health: HEALTH_WARN nodown,noout,norebalance,norecover flag(s) set 34 osds down Long heartbeat ping times on back interface seen, longest is 27716.598 msec Long heartbeat ping times on front interface seen, longest is 27723.520 msec Reduced data availability: 16 pgs inactive, 12 pgs peering, 2 pgs stale services: mon: 3 daemons, quorum rgw-1,rgw-2,rgw-3 (age 36m) mgr: rgw-4(active, since 100m), standbys: rgw-6, rgw-5 osd: 318 osds: 284 up (since 83s), 318 in (since 4h); 10 remapped pgs flags nodown,noout,norebalance,norecover data: pools: 1 pools, 32 pgs objects: 0 objects, 0 B usage: 199 TiB used, 3.4 PiB / 3.6 PiB avail pgs: 9.375% pgs unknown 40.625% pgs not active 11 active+clean 7 peering 5 active+clean+remapped 3 unknown 3 remapped+peering 2 stale+peering 1 activating [root@rgw-5 ~]#
Created attachment 1689971 [details] OSD nodes CPU Utilization grafana screenshots
Created attachment 1689972 [details] Ceph PG Dump One more interesting output is the ceph pg dump (attached here) As i mentioned, i deleted all the pools (approx 45 Million objects) and then got .rgw.root pool auto-created by one of the running RGW with 128 PGs. The PG dump output has a lot of stale/undeleted entries of PG, which should get deleted as i delete the pool. Do you think these stale entries are causing 1) super high cup utilization on all OSD nodes 2) causing OSDs in RGW-1 no to boot up ?
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days