Bug 2229651 - [EC 2+2@4] Observing Slow OSD heartbeats on front and back post Host down and reboot
Summary: [EC 2+2@4] Observing Slow OSD heartbeats on front and back post Host down and...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 6.1
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 7.1
Assignee: Michael J. Kidd
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-07 08:05 UTC by Pawan
Modified: 2023-08-10 06:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-7169 0 None None None 2023-08-07 08:09:00 UTC

Description Pawan 2023-08-07 08:05:25 UTC
Description of problem:
We are observing slow heartbeat errors post Host down ( 30+ min downtime ) and reboot.

All the OSDs which are reported to have slow heartbeats belong to the same host, that was brought down.

# ceph health detail
HEALTH_WARN Slow OSD heartbeats on back (longest 2082.589ms); Slow OSD heartbeats on front (longest 2210.562ms)
[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 2082.589ms)
    Slow OSD heartbeats on back from osd.21 [] to osd.9 [] 2082.589 msec possibly improving
    Slow OSD heartbeats on back from osd.21 [] to osd.1 [] 1880.528 msec possibly improving
    Slow OSD heartbeats on back from osd.13 [] to osd.9 [] 1274.215 msec possibly improving
[WRN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 2210.562ms)
    Slow OSD heartbeats on front from osd.21 [] to osd.9 [] 2210.562 msec possibly improving
    Slow OSD heartbeats on front from osd.13 [] to osd.9 [] 1659.351 msec possibly improving
    Slow OSD heartbeats on front from osd.1 [] to osd.9 [] 1237.286 msec possibly improving
    Slow OSD heartbeats on front from osd.21 [] to osd.1 [] 1069.686 msec possibly improving


[root@argo012 ~]# ceph -s
  cluster:
    id:     66070a80-2f84-11ee-bc2c-0cc47af3ea56
    health: HEALTH_WARN
            Slow OSD heartbeats on back (longest 2082.589ms)
            Slow OSD heartbeats on front (longest 2210.562ms)

  services:
    mon: 3 daemons, quorum argo012,argo013,argo014 (age 6d)
    mgr: argo012.odttqx(active, since 6d), standbys: argo013.akdhka, argo014.xfhnzv
    osd: 36 osds: 36 up (since 23m), 36 in (since 23m)
    rgw: 4 daemons active (4 hosts, 1 zones)

  data:
    pools:   7 pools, 1185 pgs
    objects: 1.65M objects, 1.2 TiB
    usage:   3.0 TiB used, 12 TiB / 15 TiB avail
    pgs:     1180 active+clean
             3    active+clean+scrubbing
             2    active+clean+scrubbing+deep

  io:
    client:   19 KiB/s rd, 2.3 MiB/s wr, 26 op/s rd, 53 op/s wr


OSD dump :
osd.0 up   in  weight 1 up_from 329 up_thru 7984 down_at 322 last_clean_interval [51,321) [v2:10.8.128.213:6800/3313869263,v1:10.8.128.213:6801/3313869263] [v2:10.8.128.213:6802/3313869263,v1:10.8.128.213:6803/3313869263] exists,up 6bfdc645-9cf9-4c5f-8227-621468bcd219
osd.1 up   in  weight 1 up_from 7850 up_thru 8018 down_at 7458 last_clean_interval [3815,7457) [v2:10.8.128.215:6800/4115655657,v1:10.8.128.215:6801/4115655657] [v2:10.8.128.215:6805/4115655657,v1:10.8.128.215:6808/4115655657] exists,up 58ae0f72-c629-4225-86ba-f45ec774f4ca
osd.2 up   in  weight 1 up_from 356 up_thru 7998 down_at 349 last_clean_interval [74,348) [v2:10.8.128.214:6824/787684420,v1:10.8.128.214:6825/787684420] [v2:10.8.128.214:6826/787684420,v1:10.8.128.214:6827/787684420] exists,up 2e8bfd8a-c71f-4fea-9d89-de69f96d87a5
osd.3 up   in  weight 1 up_from 265 up_thru 8026 down_at 257 last_clean_interval [72,256) [v2:10.8.128.212:6842/1074866389,v1:10.8.128.212:6843/1074866389] [v2:10.8.128.212:6844/1074866389,v1:10.8.128.212:6845/1074866389] exists,up ae682a2e-7c22-40c1-b0ca-755a30af6e8b
osd.4 up   in  weight 1 up_from 320 up_thru 7930 down_at 312 last_clean_interval [83,311) [v2:10.8.128.213:6856/3771243220,v1:10.8.128.213:6857/3771243220] [v2:10.8.128.213:6858/3771243220,v1:10.8.128.213:6859/3771243220] exists,up 19ec550a-4b5a-40d3-a9c3-0683600bb820
osd.5 up   in  weight 1 up_from 7868 up_thru 7969 down_at 7805 last_clean_interval [3815,7804) [v2:10.8.128.215:6802/3662024116,v1:10.8.128.215:6804/3662024116] [v2:10.8.128.215:6807/3662024116,v1:10.8.128.215:6810/3662024116] exists,up 0c89c728-e935-4b83-8558-baa6f56eb13a
osd.6 up   in  weight 1 up_from 296 up_thru 8032 down_at 289 last_clean_interval [84,288) [v2:10.8.128.212:6866/189185225,v1:10.8.128.212:6867/189185225] [v2:10.8.128.212:6868/189185225,v1:10.8.128.212:6869/189185225] exists,up ece9252a-ef39-4af6-8b47-1d2088971c65
osd.7 up   in  weight 1 up_from 347 up_thru 8013 down_at 339 last_clean_interval [85,338) [v2:10.8.128.214:6864/2038155548,v1:10.8.128.214:6865/2038155548] [v2:10.8.128.214:6866/2038155548,v1:10.8.128.214:6867/2038155548] exists,up 6ba5e8d2-72ed-4767-9b82-846ca0c8e3b3
osd.8 up   in  weight 1 up_from 333 up_thru 7976 down_at 326 last_clean_interval [94,325) [v2:10.8.128.213:6864/2744938019,v1:10.8.128.213:6865/2744938019] [v2:10.8.128.213:6866/2744938019,v1:10.8.128.213:6867/2744938019] exists,up 7c5978b0-7d50-42d0-8c6d-8ea0d05c106d
osd.9 up   in  weight 1 up_from 7844 up_thru 7944 down_at 7811 last_clean_interval [3815,7810) [v2:10.8.128.215:6856/1001536700,v1:10.8.128.215:6857/1001536700] [v2:10.8.128.215:6858/1001536700,v1:10.8.128.215:6859/1001536700] exists,up 6cff0521-bb0e-4f86-98ce-51f81b0c2ec4
osd.10 up   in  weight 1 up_from 260 up_thru 7919 down_at 255 last_clean_interval [54,254) [v2:10.8.128.212:6802/3459447412,v1:10.8.128.212:6803/3459447412] [v2:10.8.128.212:6804/3459447412,v1:10.8.128.212:6805/3459447412] exists,up fad0e532-cf28-4380-b7d7-badecc7a5507
osd.11 up   in  weight 1 up_from 369 up_thru 7892 down_at 362 last_clean_interval [56,361) [v2:10.8.128.214:6800/2818272725,v1:10.8.128.214:6801/2818272725] [v2:10.8.128.214:6802/2818272725,v1:10.8.128.214:6803/2818272725] exists,up b51fc3b0-bcb5-459d-add5-fbd3ef38e996
osd.12 up   in  weight 1 up_from 308 up_thru 7883 down_at 293 last_clean_interval [59,292) [v2:10.8.128.213:6808/4220875889,v1:10.8.128.213:6809/4220875889] [v2:10.8.128.213:6810/4220875889,v1:10.8.128.213:6811/4220875889] exists,up e4fbca6f-2cfd-4adf-aa76-934f1ee9b4de
osd.13 up   in  weight 1 up_from 7850 up_thru 8019 down_at 7460 last_clean_interval [3804,7459) [v2:10.8.128.215:6832/3062302921,v1:10.8.128.215:6833/3062302921] [v2:10.8.128.215:6834/3062302921,v1:10.8.128.215:6835/3062302921] exists,up 2843c285-10ad-4fb5-8b15-55e82def853e
osd.14 up   in  weight 1 up_from 342 up_thru 7899 down_at 334 last_clean_interval [67,333) [v2:10.8.128.214:6808/79306502,v1:10.8.128.214:6809/79306502] [v2:10.8.128.214:6810/79306502,v1:10.8.128.214:6811/79306502] exists,up 59a6f548-842e-4ebb-9436-5a115b7a51a2
osd.15 up   in  weight 1 up_from 283 up_thru 7986 down_at 275 last_clean_interval [63,274) [v2:10.8.128.212:6810/77006824,v1:10.8.128.212:6811/77006824] [v2:10.8.128.212:6812/77006824,v1:10.8.128.212:6813/77006824] exists,up 7c1d296d-f101-4419-a5dd-8c82d43497cf
osd.16 up   in  weight 1 up_from 315 up_thru 8024 down_at 308 last_clean_interval [56,307) [v2:10.8.128.213:6816/3922750587,v1:10.8.128.213:6817/3922750587] [v2:10.8.128.213:6818/3922750587,v1:10.8.128.213:6819/3922750587] exists,up fa0b85ef-5abd-47d0-bb8f-d24f1d293bb7
osd.17 up   in  weight 1 up_from 7880 up_thru 7923 down_at 7462 last_clean_interval [3799,7461) [v2:10.8.128.215:6824/3380354607,v1:10.8.128.215:6825/3380354607] [v2:10.8.128.215:6826/3380354607,v1:10.8.128.215:6827/3380354607] exists,up 9470c87f-aa05-48e3-bd76-f3fbb9829547
osd.18 up   in  weight 1 up_from 380 up_thru 7988 down_at 366 last_clean_interval [57,365) [v2:10.8.128.214:6816/2008617659,v1:10.8.128.214:6817/2008617659] [v2:10.8.128.214:6818/2008617659,v1:10.8.128.214:6819/2008617659] exists,up f109dde3-809a-4341-a4c3-ae76bc94f7a6
osd.19 up   in  weight 1 up_from 270 up_thru 8028 down_at 262 last_clean_interval [59,261) [v2:10.8.128.212:6818/3144902093,v1:10.8.128.212:6819/3144902093] [v2:10.8.128.212:6820/3144902093,v1:10.8.128.212:6821/3144902093] exists,up 0289665b-cc42-48f7-a55f-bb673799c9ca
osd.20 up   in  weight 1 up_from 307 up_thru 7968 down_at 301 last_clean_interval [75,300) [v2:10.8.128.213:6824/4112410208,v1:10.8.128.213:6825/4112410208] [v2:10.8.128.213:6826/4112410208,v1:10.8.128.213:6827/4112410208] exists,up bfa91559-6140-444c-849e-2a65a404d3d1
osd.21 up   in  weight 1 up_from 7851 up_thru 7981 down_at 7464 last_clean_interval [3799,7463) [v2:10.8.128.215:6864/2402257169,v1:10.8.128.215:6865/2402257169] [v2:10.8.128.215:6866/2402257169,v1:10.8.128.215:6867/2402257169] exists,up 6a3c703d-d2fd-4e5a-8dae-1594df612a47
osd.22 up   in  weight 1 up_from 365 up_thru 7990 down_at 357 last_clean_interval [67,356) [v2:10.8.128.214:6832/1448249484,v1:10.8.128.214:6833/1448249484] [v2:10.8.128.214:6834/1448249484,v1:10.8.128.214:6835/1448249484] exists,up 60b77f38-edd4-41d1-80d6-5c0ba4b20246
osd.23 up   in  weight 1 up_from 274 up_thru 7972 down_at 267 last_clean_interval [74,266) [v2:10.8.128.212:6826/190209101,v1:10.8.128.212:6827/190209101] [v2:10.8.128.212:6828/190209101,v1:10.8.128.212:6829/190209101] exists,up a2e1020f-497b-42a0-9c69-424907dfc60f
osd.24 up   in  weight 1 up_from 325 up_thru 8022 down_at 317 last_clean_interval [80,316) [v2:10.8.128.213:6832/2627256784,v1:10.8.128.213:6833/2627256784] [v2:10.8.128.213:6834/2627256784,v1:10.8.128.213:6835/2627256784] exists,up 7dccce66-a574-4622-8866-fa6a68648c45
osd.25 up   in  weight 1 up_from 7876 up_thru 8008 down_at 7765 last_clean_interval [3813,7764) [v2:10.8.128.215:6803/3478994007,v1:10.8.128.215:6806/3478994007] [v2:10.8.128.215:6809/3478994007,v1:10.8.128.215:6812/3478994007] exists,up 657c5d3d-50e0-4929-b6b2-073d3e1d1b56
osd.26 up   in  weight 1 up_from 337 up_thru 7980 down_at 330 last_clean_interval [82,329) [v2:10.8.128.214:6840/2828642581,v1:10.8.128.214:6841/2828642581] [v2:10.8.128.214:6842/2828642581,v1:10.8.128.214:6843/2828642581] exists,up 5149280c-7369-41d8-ae6a-7bd5c816dacb
osd.27 up   in  weight 1 up_from 288 up_thru 7899 down_at 280 last_clean_interval [80,279) [v2:10.8.128.212:6834/3922396735,v1:10.8.128.212:6835/3922396735] [v2:10.8.128.212:6836/3922396735,v1:10.8.128.212:6837/3922396735] exists,up 5b4ce98c-03d1-4d59-9945-801f82b80ce0
osd.28 up   in  weight 1 up_from 311 up_thru 7996 down_at 304 last_clean_interval [84,303) [v2:10.8.128.213:6840/233969254,v1:10.8.128.213:6841/233969254] [v2:10.8.128.213:6842/233969254,v1:10.8.128.213:6843/233969254] exists,up 8bc5f3c5-4767-4944-8084-b3f26c2d5ef6
osd.29 up   in  weight 1 up_from 7868 up_thru 7929 down_at 7768 last_clean_interval [3815,7767) [v2:10.8.128.215:6848/494241297,v1:10.8.128.215:6849/494241297] [v2:10.8.128.215:6850/494241297,v1:10.8.128.215:6851/494241297] exists,up 1a8d0339-6783-4694-bac1-b855b7b6d04c
osd.30 up   in  weight 1 up_from 360 up_thru 8030 down_at 353 last_clean_interval [84,352) [v2:10.8.128.214:6848/3895757485,v1:10.8.128.214:6849/3895757485] [v2:10.8.128.214:6850/3895757485,v1:10.8.128.214:6851/3895757485] exists,up d9a2b3fa-adce-44a7-8157-39dcb9ada8fb
osd.31 up   in  weight 1 up_from 292 up_thru 8002 down_at 285 last_clean_interval [88,284) [v2:10.8.128.212:6850/1920787905,v1:10.8.128.212:6851/1920787905] [v2:10.8.128.212:6852/1920787905,v1:10.8.128.212:6853/1920787905] exists,up ca0f750a-ba77-4107-ba10-ba6c389d5fc0
osd.32 up   in  weight 1 up_from 304 up_thru 8000 down_at 298 last_clean_interval [93,297) [v2:10.8.128.213:6848/524570476,v1:10.8.128.213:6849/524570476] [v2:10.8.128.213:6850/524570476,v1:10.8.128.213:6851/524570476] exists,up 21287e7b-9429-472a-bdcb-9dc5e7aa3f47
osd.33 up   in  weight 1 up_from 7862 up_thru 7922 down_at 7799 last_clean_interval [3803,7799) [v2:10.8.128.215:6840/2698895660,v1:10.8.128.215:6841/2698895660] [v2:10.8.128.215:6842/2698895660,v1:10.8.128.215:6843/2698895660] exists,up 8e07bd19-8d9c-4134-b3e9-e39a3e9d207e
osd.34 up   in  weight 1 up_from 352 up_thru 8010 down_at 344 last_clean_interval [90,343) [v2:10.8.128.214:6856/1261359665,v1:10.8.128.214:6857/1261359665] [v2:10.8.128.214:6858/1261359665,v1:10.8.128.214:6859/1261359665] exists,up 5a6db3ff-d874-4685-9563-d71746061363
osd.35 up   in  weight 1 up_from 562 up_thru 7898 down_at 534 last_clean_interval [278,533) [v2:10.8.128.212:6858/3358448506,v1:10.8.128.212:6859/3358448506] [v2:10.8.128.212:6860/3358448506,v1:10.8.128.212:6861/3358448506] exists,up 5c883920-b9bb-44e7-bec5-8ca258d29722


# ceph osd perf
osd  commit_latency(ms)  apply_latency(ms)
 35                   6                  6
 34                   6                  6
 33                  19                 19
 32                   0                  0
 31                   0                  0
 30                   2                  2
 29                   0                  0
 12                   7                  7
 11                   0                  0
 10                   8                  8
  9                   4                  4
  8                   4                  4
  7                   6                  6
  6                   4                  4
  5                   6                  6
  4                   2                  2
  3                   0                  0
  2                   0                  0
  1                  10                 10
  0                   2                  2
 13                   0                  0
 14                   0                  0
 15                   0                  0
 16                   0                  0
 17                   0                  0
 18                   0                  0
 19                   1                  1
 20                   8                  8
 21                   1                  1
 22                   6                  6
 23                   0                  0
 24                   4                  4
 25                   3                  3
 26                  11                 11
 27                  11                 11
 28                   0                  0



Version-Release number of selected component (if applicable):
ceph version 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)

How reproducible:
1/1

Steps to Reproduce:
1. Deploy RHCS cluster, write data.
2. Bring down a host for 30+ minutes, wait for OSDs to be marked out, and data to be backfilled into other OSDs
3. Bring back the host. Observe that we are observing warnings of slow heartbeat even after 1 hour post power up of the host( PGs were already active+ clean) . 

Actual results:
No warnings to be seen.

Expected results:
Observed health warnings even after cluster had completed recovery and was with active + clean PGs

Additional info:
The warnings were removed from the cluster post ~1 hour of host shutdown and power-up. Cluster had already reached active+ clean post ~20 min of host shutdown and power-up


Note You need to log in before you can comment on or make changes to this bug.