Bug 2210217 - "stuck peering for" warning is misleading
Summary: "stuck peering for" warning is misleading
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 6.1z2
Assignee: Shreyansh Sancheti
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-26 06:42 UTC by Shreyansh Sancheti
Modified: 2023-07-12 12:31 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
ssanchet: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 51688 0 None None None 2023-05-26 06:42:23 UTC
Red Hat Issue Tracker RHCEPH-6744 0 None None None 2023-05-26 06:43:40 UTC
Red Hat Knowledge Base (Solution) 7017024 0 None None None 2023-06-05 08:38:55 UTC

Description Shreyansh Sancheti 2023-05-26 06:42:23 UTC
When OSDs restart or crush maps change it is common to see a HEALTH_WARN claiming that PGs have been stuck peering since awhile, even though they were active just seconds ago.
It would be preferable if we only report PG_AVAILABILITY issues if they really are stuck peering longer than 60s.

E.g.

HEALTH_WARN Reduced data availability: 50 pgs peering
PG_AVAILABILITY Reduced data availability: 50 pgs peering
    pg 3.7df is stuck peering for 792.178587, current state remapped+peering, last acting [100,113,352]
    pg 3.8ae is stuck peering for 280.567053, current state remapped+peering, last acting [226,345,350]
    pg 3.c0b is stuck peering for 1018.081127, current state remapped+peering, last acting [62,246,249]
    pg 3.fc9 is stuck peering for 65.799756, current state remapped+peering, last acting [123,447,351]
    pg 4.c is stuck peering for 208.471034, current state remapped+peering, last acting [123,501,247]
...
(Related: I proposed to change PG_AVAILABILITY issues to HEALTH_ERR at https://tracker.ceph.com/issues/23565 and https://github.com/ceph/ceph/pull/42192, so this needs to be fixed before merging that.)

I tracked this to `PGMap::get_health_checks` which will mark a PG as stuck peering if now - last_peered > mon_pg_stuck_threshold.
But the problem is that last_peered is only updated if there is IO on a PG -- an OSD doesn't send pgstats if it is idle.
To fix, we could update last_active/last_peered etc, and send a pg stats update more frequently even when idle?

Clearly osd_pg_stat_report_interval_max is related here, but the default is 500 and we have some PGs reported stuck peering longer than 500s, so there is still something missing here.

We observe this in nautilus, but the code hasn't changed much in master AFAICT.



To reproduce:
#!/bin/bash

../src/stop.sh
MON=1 MGR=1 OSD=3 ../src/vstart.sh -n -d
./bin/ceph osd pool create rbd 8 8
./bin/ceph osd pool set rbd min_size 1
./bin/ceph df
./bin/rados bench 20 write -p rbd --no-cleanup
./bin/ceph osd tree
sudo ./bin/init-ceph stop osd.1
sudo ./bin/init-ceph stop osd.2
./bin/rados bench 20 write -p rbd --no-cleanup
./bin/rados bench 20 write -p rbd --no-cleanup
sudo ./bin/init-ceph stop osd.0
sudo ./bin/init-ceph start osd.1
sudo ./bin/init-ceph start osd.2
./bin/ceph osd down 0
./bin/ceph osd lost osd.0 --yes-i-really-mean-it
./bin/rados bench 20 write -p rbd --no-cleanup
./bin/rados bench 20 write -p rbd --no-cleanup
sudo ./bin/init-ceph start osd.0

Comment 1 Shreyansh Sancheti 2023-06-01 16:09:55 UTC
Steps to reproduce:

../src/stop.sh
MON=1 MGR=1 OSD=4 ../src/vstart.sh -n -d

./bin/ceph osd pool create rbd 512 512
./bin/ceph osd pool set rbd min_size 1
./bin/ceph df
./bin/rados bench 30 write -p rbd --no-cleanup
./bin/ceph osd tree
sudo ./bin/ceph osd out osd.1
sudo ./bin/ceph osd out osd.2
./bin/rados bench 120 write -p rbd --no-cleanup
./bin/rados bench 120 write -p rbd --no-cleanup
sudo ./bin/ceph osd out osd.3
sudo ./bin/ceph osd in osd.1
sudo ./bin/ceph osd in osd.2
./bin/ceph osd lost osd.3 --yes-i-really-mean-it
./bin/rados bench 120 write -p rbd --no-cleanup
./bin/rados bench 120 write -p rbd --no-cleanup
sudo ./bin/ceph osd in osd.3

behaviour seen in ceph health status which comes and goes 

[WRN] PG_AVAILABILITY: Reduced data availability: 10 pgs inactive, 18 pgs peering
    pg 2.6 is stuck peering for 8m, current state peering, last acting [1,0,2]
    pg 2.e is stuck peering for 8m, current state peering, last acting [1,0,2]

Comment 4 Scott Ostapovicz 2023-07-12 12:31:24 UTC
Missed the 6.1 z1 window.  Retargeting to 6.1 z2.


Note You need to log in before you can comment on or make changes to this bug.