Bug 1351320
| Summary: | OSD's assert during snap trim osd/ReplicatedPG.cc: 2655: FAILED assert(0) | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Mike Hackett <mhackett> |
| Component: | RADOS | Assignee: | David Zafman <dzafman> |
| Status: | CLOSED UPSTREAM | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 1.3.2 | CC: | ceph-eng-bugs, dzafman, flucifre, kchai, kdreyer, nlevine, olim, vumrao |
| Target Milestone: | rc | ||
| Target Release: | 1.3.3 | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-07-18 14:20:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Mike Hackett
2016-06-29 17:48:20 UTC
Snaptrimq for pg 0.1ef1 reports: trim_objectcould not find coid 0/bddc9ef/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e Reviewing the provided OSD log for OSD.234 we can see that we are actively trying to trim the snapshots and it cannot because there is an object that is corrupted for some reason which leads to the OSD taking an assert.
***Review***
In the OSD logs we can see the following occurring:
On PG 0.1ef1 we start snaptrimming for object rbd_data.b77eb164a531e5.0000000000004fdf:
SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects react trimming 0/bddc9ef1/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e
-3> 2016-06-27 08:11:06.781905 7f08f1268700 10 osd.234 pg_epoch: 547937 pg[0.1ef1( v 547936'1181836 (543861'1178763,547936'1181836] local-les=547936 n=71899 ec=1 les/c 547936/547936 547935/547935/547935) [234,259,19] r=0 lpr=547935 crt=547932'1181831 lcod 547932'1181834 mlcod 547932'1181834 active+clean
This PG has an incredibly large snapq so I have not listed this (just a note).
When trimming we are attempting to obtain the object info and are unable to find, It's trying to getattr OI_ATTR which is the object info attr.:
get_object_context: obc NOT found in cache: 0/bddc9ef1/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e
-2> 2016-06-27 08:11:06.784323 7f08f1268700 10 osd.234 pg_epoch: 547937 pg[0.1ef1( v 547936'1181836 (543861'1178763,547936'1181836] local-les=547936 n=71899 ec=1 les/c 547936/547936 547935/547935/547935) [234,259,19] r=0 lpr=547935 crt=547932'1181831 lcod 547932'1181834 mlcod 547932'1181834 active+clean
We report that no object info can be obtained.
get_object_context: no obc for soid 0/bddc9ef1/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e and !can_create
-1> 2016-06-27 08:11:06.786646 7f08f1268700 -1 osd.234 pg_epoch: 547937 pg[0.1ef1( v 547936'1181836 (543861'1178763,547936'1181836] local-les=547936 n=71899 ec=1 les/c 547936/547936 547935/547935/547935) [234,259,19] r=0 lpr=547935 crt=547932'1181831 lcod 547932'1181834 mlcod 547932'1181834 active+clean
snaptrim on object fails:
trim_objectcould not find coid 0/bddc9ef1/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e
And we assert the OSD.
***Recovery***
Our recovery actions after discussion with engineering for this issue would be the following steps:
1. OSD.234 is the acting primary OSD for PG 0.1ef1, on OSD node hosting OSD.234
2. Look in /var/lib/ceph/osd/ceph-234/current/ for 0.1ef1 to validate object is present in filesystem.
'sudo find /var/lib/ceph/osd/ceph-234/current/0.1ef1_head/ -name 'rbd\\udata.b77eb164a531e5.0000000000004fdf*' -ls'
***NOTE***
IF object is present continue on with action plan (steps 3-5), if it is not then please update as then we have an issue with the snapmapper pointing to an invalid object and NOT an issue with a corrupt/missing object info.
3. Look into a window when snaptrimming is not occurring or turn off snaptrimming
4. On OSD.234 run the following:
'ceph pg repair 0.1ef1'
validate we repair successfully with pg repair command, the repair should tell us.
5. Turn on snaptrimming (if it was turned off previously)
setting config value osd_pg_max_concurrent_snap_trims to 0 will disable snap trims. The default is 2 Issue looks to be: http://tracker.ceph.com/issues/13837 |