Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2210217

Summary:	"stuck peering for" warning is misleading
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Shreyansh Sancheti <ssanchet>
Component:	RADOS	Assignee:	Radoslaw Zarzynski <rzarzyns>
Status:	NEW ---	QA Contact:	Pawan <pdhiran>
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	5.0	CC:	bhubbard, bkunal, ceph-eng-bugs, cephqe-warriors, nojha, rzarzyns, sostapov, torkil, tpetr, vumrao
Target Milestone:	---	Flags:	ssanchet: needinfo+
Target Release:	9.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Shreyansh Sancheti 2023-05-26 06:42:23 UTC

When OSDs restart or crush maps change it is common to see a HEALTH_WARN claiming that PGs have been stuck peering since awhile, even though they were active just seconds ago.
It would be preferable if we only report PG_AVAILABILITY issues if they really are stuck peering longer than 60s.

E.g.

HEALTH_WARN Reduced data availability: 50 pgs peering
PG_AVAILABILITY Reduced data availability: 50 pgs peering
    pg 3.7df is stuck peering for 792.178587, current state remapped+peering, last acting [100,113,352]
    pg 3.8ae is stuck peering for 280.567053, current state remapped+peering, last acting [226,345,350]
    pg 3.c0b is stuck peering for 1018.081127, current state remapped+peering, last acting [62,246,249]
    pg 3.fc9 is stuck peering for 65.799756, current state remapped+peering, last acting [123,447,351]
    pg 4.c is stuck peering for 208.471034, current state remapped+peering, last acting [123,501,247]
...
(Related: I proposed to change PG_AVAILABILITY issues to HEALTH_ERR at https://tracker.ceph.com/issues/23565 and https://github.com/ceph/ceph/pull/42192, so this needs to be fixed before merging that.)

I tracked this to `PGMap::get_health_checks` which will mark a PG as stuck peering if now - last_peered > mon_pg_stuck_threshold.
But the problem is that last_peered is only updated if there is IO on a PG -- an OSD doesn't send pgstats if it is idle.
To fix, we could update last_active/last_peered etc, and send a pg stats update more frequently even when idle?

Clearly osd_pg_stat_report_interval_max is related here, but the default is 500 and we have some PGs reported stuck peering longer than 500s, so there is still something missing here.

We observe this in nautilus, but the code hasn't changed much in master AFAICT.



To reproduce:
#!/bin/bash

../src/stop.sh
MON=1 MGR=1 OSD=3 ../src/vstart.sh -n -d
./bin/ceph osd pool create rbd 8 8
./bin/ceph osd pool set rbd min_size 1
./bin/ceph df
./bin/rados bench 20 write -p rbd --no-cleanup
./bin/ceph osd tree
sudo ./bin/init-ceph stop osd.1
sudo ./bin/init-ceph stop osd.2
./bin/rados bench 20 write -p rbd --no-cleanup
./bin/rados bench 20 write -p rbd --no-cleanup
sudo ./bin/init-ceph stop osd.0
sudo ./bin/init-ceph start osd.1
sudo ./bin/init-ceph start osd.2
./bin/ceph osd down 0
./bin/ceph osd lost osd.0 --yes-i-really-mean-it
./bin/rados bench 20 write -p rbd --no-cleanup
./bin/rados bench 20 write -p rbd --no-cleanup
sudo ./bin/init-ceph start osd.0

Comment 1 Shreyansh Sancheti 2023-06-01 16:09:55 UTC

Steps to reproduce:

../src/stop.sh
MON=1 MGR=1 OSD=4 ../src/vstart.sh -n -d

./bin/ceph osd pool create rbd 512 512
./bin/ceph osd pool set rbd min_size 1
./bin/ceph df
./bin/rados bench 30 write -p rbd --no-cleanup
./bin/ceph osd tree
sudo ./bin/ceph osd out osd.1
sudo ./bin/ceph osd out osd.2
./bin/rados bench 120 write -p rbd --no-cleanup
./bin/rados bench 120 write -p rbd --no-cleanup
sudo ./bin/ceph osd out osd.3
sudo ./bin/ceph osd in osd.1
sudo ./bin/ceph osd in osd.2
./bin/ceph osd lost osd.3 --yes-i-really-mean-it
./bin/rados bench 120 write -p rbd --no-cleanup
./bin/rados bench 120 write -p rbd --no-cleanup
sudo ./bin/ceph osd in osd.3

behaviour seen in ceph health status which comes and goes 

[WRN] PG_AVAILABILITY: Reduced data availability: 10 pgs inactive, 18 pgs peering
    pg 2.6 is stuck peering for 8m, current state peering, last acting [1,0,2]
    pg 2.e is stuck peering for 8m, current state peering, last acting [1,0,2]

Comment 4 Scott Ostapovicz 2023-07-12 12:31:24 UTC

Missed the 6.1 z1 window.  Retargeting to 6.1 z2.