Bug 2229651
| Summary: | [EC 2+2@4] Observing Slow OSD heartbeats on front and back post Host down and reboot | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Pawan <pdhiran> |
| Component: | RADOS | Assignee: | Michael J. Kidd <linuxkidd> |
| Status: | NEW --- | QA Contact: | Pawan <pdhiran> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.1 | CC: | bhubbard, ceph-eng-bugs, cephqe-warriors, nojha, vumrao |
| Target Milestone: | --- | ||
| Target Release: | 7.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Description of problem: We are observing slow heartbeat errors post Host down ( 30+ min downtime ) and reboot. All the OSDs which are reported to have slow heartbeats belong to the same host, that was brought down. # ceph health detail HEALTH_WARN Slow OSD heartbeats on back (longest 2082.589ms); Slow OSD heartbeats on front (longest 2210.562ms) [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 2082.589ms) Slow OSD heartbeats on back from osd.21 [] to osd.9 [] 2082.589 msec possibly improving Slow OSD heartbeats on back from osd.21 [] to osd.1 [] 1880.528 msec possibly improving Slow OSD heartbeats on back from osd.13 [] to osd.9 [] 1274.215 msec possibly improving [WRN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 2210.562ms) Slow OSD heartbeats on front from osd.21 [] to osd.9 [] 2210.562 msec possibly improving Slow OSD heartbeats on front from osd.13 [] to osd.9 [] 1659.351 msec possibly improving Slow OSD heartbeats on front from osd.1 [] to osd.9 [] 1237.286 msec possibly improving Slow OSD heartbeats on front from osd.21 [] to osd.1 [] 1069.686 msec possibly improving [root@argo012 ~]# ceph -s cluster: id: 66070a80-2f84-11ee-bc2c-0cc47af3ea56 health: HEALTH_WARN Slow OSD heartbeats on back (longest 2082.589ms) Slow OSD heartbeats on front (longest 2210.562ms) services: mon: 3 daemons, quorum argo012,argo013,argo014 (age 6d) mgr: argo012.odttqx(active, since 6d), standbys: argo013.akdhka, argo014.xfhnzv osd: 36 osds: 36 up (since 23m), 36 in (since 23m) rgw: 4 daemons active (4 hosts, 1 zones) data: pools: 7 pools, 1185 pgs objects: 1.65M objects, 1.2 TiB usage: 3.0 TiB used, 12 TiB / 15 TiB avail pgs: 1180 active+clean 3 active+clean+scrubbing 2 active+clean+scrubbing+deep io: client: 19 KiB/s rd, 2.3 MiB/s wr, 26 op/s rd, 53 op/s wr OSD dump : osd.0 up in weight 1 up_from 329 up_thru 7984 down_at 322 last_clean_interval [51,321) [v2:10.8.128.213:6800/3313869263,v1:10.8.128.213:6801/3313869263] [v2:10.8.128.213:6802/3313869263,v1:10.8.128.213:6803/3313869263] exists,up 6bfdc645-9cf9-4c5f-8227-621468bcd219 osd.1 up in weight 1 up_from 7850 up_thru 8018 down_at 7458 last_clean_interval [3815,7457) [v2:10.8.128.215:6800/4115655657,v1:10.8.128.215:6801/4115655657] [v2:10.8.128.215:6805/4115655657,v1:10.8.128.215:6808/4115655657] exists,up 58ae0f72-c629-4225-86ba-f45ec774f4ca osd.2 up in weight 1 up_from 356 up_thru 7998 down_at 349 last_clean_interval [74,348) [v2:10.8.128.214:6824/787684420,v1:10.8.128.214:6825/787684420] [v2:10.8.128.214:6826/787684420,v1:10.8.128.214:6827/787684420] exists,up 2e8bfd8a-c71f-4fea-9d89-de69f96d87a5 osd.3 up in weight 1 up_from 265 up_thru 8026 down_at 257 last_clean_interval [72,256) [v2:10.8.128.212:6842/1074866389,v1:10.8.128.212:6843/1074866389] [v2:10.8.128.212:6844/1074866389,v1:10.8.128.212:6845/1074866389] exists,up ae682a2e-7c22-40c1-b0ca-755a30af6e8b osd.4 up in weight 1 up_from 320 up_thru 7930 down_at 312 last_clean_interval [83,311) [v2:10.8.128.213:6856/3771243220,v1:10.8.128.213:6857/3771243220] [v2:10.8.128.213:6858/3771243220,v1:10.8.128.213:6859/3771243220] exists,up 19ec550a-4b5a-40d3-a9c3-0683600bb820 osd.5 up in weight 1 up_from 7868 up_thru 7969 down_at 7805 last_clean_interval [3815,7804) [v2:10.8.128.215:6802/3662024116,v1:10.8.128.215:6804/3662024116] [v2:10.8.128.215:6807/3662024116,v1:10.8.128.215:6810/3662024116] exists,up 0c89c728-e935-4b83-8558-baa6f56eb13a osd.6 up in weight 1 up_from 296 up_thru 8032 down_at 289 last_clean_interval [84,288) [v2:10.8.128.212:6866/189185225,v1:10.8.128.212:6867/189185225] [v2:10.8.128.212:6868/189185225,v1:10.8.128.212:6869/189185225] exists,up ece9252a-ef39-4af6-8b47-1d2088971c65 osd.7 up in weight 1 up_from 347 up_thru 8013 down_at 339 last_clean_interval [85,338) [v2:10.8.128.214:6864/2038155548,v1:10.8.128.214:6865/2038155548] [v2:10.8.128.214:6866/2038155548,v1:10.8.128.214:6867/2038155548] exists,up 6ba5e8d2-72ed-4767-9b82-846ca0c8e3b3 osd.8 up in weight 1 up_from 333 up_thru 7976 down_at 326 last_clean_interval [94,325) [v2:10.8.128.213:6864/2744938019,v1:10.8.128.213:6865/2744938019] [v2:10.8.128.213:6866/2744938019,v1:10.8.128.213:6867/2744938019] exists,up 7c5978b0-7d50-42d0-8c6d-8ea0d05c106d osd.9 up in weight 1 up_from 7844 up_thru 7944 down_at 7811 last_clean_interval [3815,7810) [v2:10.8.128.215:6856/1001536700,v1:10.8.128.215:6857/1001536700] [v2:10.8.128.215:6858/1001536700,v1:10.8.128.215:6859/1001536700] exists,up 6cff0521-bb0e-4f86-98ce-51f81b0c2ec4 osd.10 up in weight 1 up_from 260 up_thru 7919 down_at 255 last_clean_interval [54,254) [v2:10.8.128.212:6802/3459447412,v1:10.8.128.212:6803/3459447412] [v2:10.8.128.212:6804/3459447412,v1:10.8.128.212:6805/3459447412] exists,up fad0e532-cf28-4380-b7d7-badecc7a5507 osd.11 up in weight 1 up_from 369 up_thru 7892 down_at 362 last_clean_interval [56,361) [v2:10.8.128.214:6800/2818272725,v1:10.8.128.214:6801/2818272725] [v2:10.8.128.214:6802/2818272725,v1:10.8.128.214:6803/2818272725] exists,up b51fc3b0-bcb5-459d-add5-fbd3ef38e996 osd.12 up in weight 1 up_from 308 up_thru 7883 down_at 293 last_clean_interval [59,292) [v2:10.8.128.213:6808/4220875889,v1:10.8.128.213:6809/4220875889] [v2:10.8.128.213:6810/4220875889,v1:10.8.128.213:6811/4220875889] exists,up e4fbca6f-2cfd-4adf-aa76-934f1ee9b4de osd.13 up in weight 1 up_from 7850 up_thru 8019 down_at 7460 last_clean_interval [3804,7459) [v2:10.8.128.215:6832/3062302921,v1:10.8.128.215:6833/3062302921] [v2:10.8.128.215:6834/3062302921,v1:10.8.128.215:6835/3062302921] exists,up 2843c285-10ad-4fb5-8b15-55e82def853e osd.14 up in weight 1 up_from 342 up_thru 7899 down_at 334 last_clean_interval [67,333) [v2:10.8.128.214:6808/79306502,v1:10.8.128.214:6809/79306502] [v2:10.8.128.214:6810/79306502,v1:10.8.128.214:6811/79306502] exists,up 59a6f548-842e-4ebb-9436-5a115b7a51a2 osd.15 up in weight 1 up_from 283 up_thru 7986 down_at 275 last_clean_interval [63,274) [v2:10.8.128.212:6810/77006824,v1:10.8.128.212:6811/77006824] [v2:10.8.128.212:6812/77006824,v1:10.8.128.212:6813/77006824] exists,up 7c1d296d-f101-4419-a5dd-8c82d43497cf osd.16 up in weight 1 up_from 315 up_thru 8024 down_at 308 last_clean_interval [56,307) [v2:10.8.128.213:6816/3922750587,v1:10.8.128.213:6817/3922750587] [v2:10.8.128.213:6818/3922750587,v1:10.8.128.213:6819/3922750587] exists,up fa0b85ef-5abd-47d0-bb8f-d24f1d293bb7 osd.17 up in weight 1 up_from 7880 up_thru 7923 down_at 7462 last_clean_interval [3799,7461) [v2:10.8.128.215:6824/3380354607,v1:10.8.128.215:6825/3380354607] [v2:10.8.128.215:6826/3380354607,v1:10.8.128.215:6827/3380354607] exists,up 9470c87f-aa05-48e3-bd76-f3fbb9829547 osd.18 up in weight 1 up_from 380 up_thru 7988 down_at 366 last_clean_interval [57,365) [v2:10.8.128.214:6816/2008617659,v1:10.8.128.214:6817/2008617659] [v2:10.8.128.214:6818/2008617659,v1:10.8.128.214:6819/2008617659] exists,up f109dde3-809a-4341-a4c3-ae76bc94f7a6 osd.19 up in weight 1 up_from 270 up_thru 8028 down_at 262 last_clean_interval [59,261) [v2:10.8.128.212:6818/3144902093,v1:10.8.128.212:6819/3144902093] [v2:10.8.128.212:6820/3144902093,v1:10.8.128.212:6821/3144902093] exists,up 0289665b-cc42-48f7-a55f-bb673799c9ca osd.20 up in weight 1 up_from 307 up_thru 7968 down_at 301 last_clean_interval [75,300) [v2:10.8.128.213:6824/4112410208,v1:10.8.128.213:6825/4112410208] [v2:10.8.128.213:6826/4112410208,v1:10.8.128.213:6827/4112410208] exists,up bfa91559-6140-444c-849e-2a65a404d3d1 osd.21 up in weight 1 up_from 7851 up_thru 7981 down_at 7464 last_clean_interval [3799,7463) [v2:10.8.128.215:6864/2402257169,v1:10.8.128.215:6865/2402257169] [v2:10.8.128.215:6866/2402257169,v1:10.8.128.215:6867/2402257169] exists,up 6a3c703d-d2fd-4e5a-8dae-1594df612a47 osd.22 up in weight 1 up_from 365 up_thru 7990 down_at 357 last_clean_interval [67,356) [v2:10.8.128.214:6832/1448249484,v1:10.8.128.214:6833/1448249484] [v2:10.8.128.214:6834/1448249484,v1:10.8.128.214:6835/1448249484] exists,up 60b77f38-edd4-41d1-80d6-5c0ba4b20246 osd.23 up in weight 1 up_from 274 up_thru 7972 down_at 267 last_clean_interval [74,266) [v2:10.8.128.212:6826/190209101,v1:10.8.128.212:6827/190209101] [v2:10.8.128.212:6828/190209101,v1:10.8.128.212:6829/190209101] exists,up a2e1020f-497b-42a0-9c69-424907dfc60f osd.24 up in weight 1 up_from 325 up_thru 8022 down_at 317 last_clean_interval [80,316) [v2:10.8.128.213:6832/2627256784,v1:10.8.128.213:6833/2627256784] [v2:10.8.128.213:6834/2627256784,v1:10.8.128.213:6835/2627256784] exists,up 7dccce66-a574-4622-8866-fa6a68648c45 osd.25 up in weight 1 up_from 7876 up_thru 8008 down_at 7765 last_clean_interval [3813,7764) [v2:10.8.128.215:6803/3478994007,v1:10.8.128.215:6806/3478994007] [v2:10.8.128.215:6809/3478994007,v1:10.8.128.215:6812/3478994007] exists,up 657c5d3d-50e0-4929-b6b2-073d3e1d1b56 osd.26 up in weight 1 up_from 337 up_thru 7980 down_at 330 last_clean_interval [82,329) [v2:10.8.128.214:6840/2828642581,v1:10.8.128.214:6841/2828642581] [v2:10.8.128.214:6842/2828642581,v1:10.8.128.214:6843/2828642581] exists,up 5149280c-7369-41d8-ae6a-7bd5c816dacb osd.27 up in weight 1 up_from 288 up_thru 7899 down_at 280 last_clean_interval [80,279) [v2:10.8.128.212:6834/3922396735,v1:10.8.128.212:6835/3922396735] [v2:10.8.128.212:6836/3922396735,v1:10.8.128.212:6837/3922396735] exists,up 5b4ce98c-03d1-4d59-9945-801f82b80ce0 osd.28 up in weight 1 up_from 311 up_thru 7996 down_at 304 last_clean_interval [84,303) [v2:10.8.128.213:6840/233969254,v1:10.8.128.213:6841/233969254] [v2:10.8.128.213:6842/233969254,v1:10.8.128.213:6843/233969254] exists,up 8bc5f3c5-4767-4944-8084-b3f26c2d5ef6 osd.29 up in weight 1 up_from 7868 up_thru 7929 down_at 7768 last_clean_interval [3815,7767) [v2:10.8.128.215:6848/494241297,v1:10.8.128.215:6849/494241297] [v2:10.8.128.215:6850/494241297,v1:10.8.128.215:6851/494241297] exists,up 1a8d0339-6783-4694-bac1-b855b7b6d04c osd.30 up in weight 1 up_from 360 up_thru 8030 down_at 353 last_clean_interval [84,352) [v2:10.8.128.214:6848/3895757485,v1:10.8.128.214:6849/3895757485] [v2:10.8.128.214:6850/3895757485,v1:10.8.128.214:6851/3895757485] exists,up d9a2b3fa-adce-44a7-8157-39dcb9ada8fb osd.31 up in weight 1 up_from 292 up_thru 8002 down_at 285 last_clean_interval [88,284) [v2:10.8.128.212:6850/1920787905,v1:10.8.128.212:6851/1920787905] [v2:10.8.128.212:6852/1920787905,v1:10.8.128.212:6853/1920787905] exists,up ca0f750a-ba77-4107-ba10-ba6c389d5fc0 osd.32 up in weight 1 up_from 304 up_thru 8000 down_at 298 last_clean_interval [93,297) [v2:10.8.128.213:6848/524570476,v1:10.8.128.213:6849/524570476] [v2:10.8.128.213:6850/524570476,v1:10.8.128.213:6851/524570476] exists,up 21287e7b-9429-472a-bdcb-9dc5e7aa3f47 osd.33 up in weight 1 up_from 7862 up_thru 7922 down_at 7799 last_clean_interval [3803,7799) [v2:10.8.128.215:6840/2698895660,v1:10.8.128.215:6841/2698895660] [v2:10.8.128.215:6842/2698895660,v1:10.8.128.215:6843/2698895660] exists,up 8e07bd19-8d9c-4134-b3e9-e39a3e9d207e osd.34 up in weight 1 up_from 352 up_thru 8010 down_at 344 last_clean_interval [90,343) [v2:10.8.128.214:6856/1261359665,v1:10.8.128.214:6857/1261359665] [v2:10.8.128.214:6858/1261359665,v1:10.8.128.214:6859/1261359665] exists,up 5a6db3ff-d874-4685-9563-d71746061363 osd.35 up in weight 1 up_from 562 up_thru 7898 down_at 534 last_clean_interval [278,533) [v2:10.8.128.212:6858/3358448506,v1:10.8.128.212:6859/3358448506] [v2:10.8.128.212:6860/3358448506,v1:10.8.128.212:6861/3358448506] exists,up 5c883920-b9bb-44e7-bec5-8ca258d29722 # ceph osd perf osd commit_latency(ms) apply_latency(ms) 35 6 6 34 6 6 33 19 19 32 0 0 31 0 0 30 2 2 29 0 0 12 7 7 11 0 0 10 8 8 9 4 4 8 4 4 7 6 6 6 4 4 5 6 6 4 2 2 3 0 0 2 0 0 1 10 10 0 2 2 13 0 0 14 0 0 15 0 0 16 0 0 17 0 0 18 0 0 19 1 1 20 8 8 21 1 1 22 6 6 23 0 0 24 4 4 25 3 3 26 11 11 27 11 11 28 0 0 Version-Release number of selected component (if applicable): ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable) How reproducible: 1/1 Steps to Reproduce: 1. Deploy RHCS cluster, write data. 2. Bring down a host for 30+ minutes, wait for OSDs to be marked out, and data to be backfilled into other OSDs 3. Bring back the host. Observe that we are observing warnings of slow heartbeat even after 1 hour post power up of the host( PGs were already active+ clean) . Actual results: No warnings to be seen. Expected results: Observed health warnings even after cluster had completed recovery and was with active + clean PGs Additional info: The warnings were removed from the cluster post ~1 hour of host shutdown and power-up. Cluster had already reached active+ clean post ~20 min of host shutdown and power-up