Bug 1289313

Summary:	osd/PG.cc: 288: FAILED assert(info.last_epoch_started >= info.history.last_epoch_started)
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Moses Muir <mmuir>
Component:	RADOS	Assignee:	David Zafman <dzafman>
Status:	CLOSED ERRATA	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	1.3.1	CC:	bhubbard, ceph-eng-bugs, dzafman, flucifre, hnallurv, kchai, kdreyer, shmohan, smanjara, vumrao
Target Milestone:	rc	Keywords:	Patch
Target Release:	1.3.2
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-0.94.5-1.el7cp Ubuntu: ceph_0.94.5-2redhat1	Doc Type:	Bug Fix
Doc Text:	Cause/Consequence: When a PG (placement group) gets stuck incomplete, the OSD could crash with an assert due to last_epoch_started mismatches. Fix: The OSD code has been altered to skip the assert when the user sets osd_find_best_info_ignore_history_les to "true". Result: The OSD does not crash when recovering an incomplete PG and osd_find_best_info_ignore_history_les is set.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-02-29 14:44:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Moses Muir 2015-12-07 20:45:43 UTC

Description of problem:

There is an assert (osd/PG.cc: 288: FAILED assert(info.last_epoch_started >= info.history.last_epoch_started) that happens when data loss has occurred and recovery of stale data is being initiated (usually by support) via setting the osd_find_best_info_ignore_history_les config variable.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Ken Dreyer (Red Hat) 2015-12-07 22:15:52 UTC

David, please let me know what patch to pull in for this.

Comment 3 David Zafman 2015-12-07 22:26:37 UTC

Can you get commit 02a9a41f151a3d968bf8066749658659dc6e3ac4 from git:dzafman/ceph.git or do you need the patch in another form?

Here it is on github.com:

https://github.com/dzafman/ceph/commit/02a9a41f151a3d968bf8066749658659dc6e3ac4

Comment 6 Ken Dreyer (Red Hat) 2015-12-09 16:06:22 UTC

David, could you please file a bug upstream for this issue so that we ensure that it gets backported to the hammer and jewel branches upstream? http://tracker.ceph.com/projects/ceph/issues/new

For the doc text in this bugzilla, I'm thinking of something like the following:

"When a PG gets stuck incomplete, the OSD could crash with an assert due to last_epoch_started mismatches. With this change, an OSD in this condition will not crash if the user sets osd_find_best_info_ignore_history_les to "true". "

Is that text accurate? What would you change there?

Comment 7 David Zafman 2015-12-09 17:42:58 UTC

Ken, I have already filed upstream tracker 14015: http://tracker.ceph.com/issues/14015 marked Pending Backport.

Comment 8 Ken Dreyer (Red Hat) 2015-12-09 17:51:37 UTC

Thanks, can you please review the "Doc Text" field, and update it as needed?

Comment 10 David Zafman 2015-12-10 00:39:38 UTC

This is what I'd say:

When a PG gets stuck incomplete, it is possible to recover partial data.  To recover the user sets osd_find_best_info_ignore_history_les to "true" and the primary OSD is marked down to force a re-peer.   With this change, an OSD in this condition will not crash.

Comment 11 David Zafman 2015-12-10 00:40:10 UTC

This change will be going into upstream anyway.

Comment 12 Ken Dreyer (Red Hat) 2015-12-10 20:37:35 UTC

Targeting to RHCS 1.3.2 in the meantime.

Comment 13 Vikhyat Umrao 2015-12-25 09:03:34 UTC

Hi Federico,

Now we have patch in upstream and it is merged, I think we should be taking it for 1.3.2.

Regards,
Vikhyat

Comment 17 Federico Lucifredi 2016-01-11 15:47:33 UTC

Needinfo Ken to see if he is open to picking this up in a future respin.

Comment 18 Federico Lucifredi 2016-01-11 16:40:27 UTC

Ken confirms this is okay to include in 1.3.2 at a respin point.

Comment 21 shylesh 2016-02-11 18:37:41 UTC


Mail snippet capturing testcase for validating .
=============================================
1. [ubuntu@magna105 ~]$ sudo ceph osd pool create redhat 1 1 replicated
replicated_ruleset
pool 'redhat' created

2. [ubuntu@magna105 ~]$ sudo ceph osd pool set redhat min_size 1
set pool 625 min_size to 1
[ubuntu@magna105 ~]$ sudo ceph pg dump | grep 625
dumped all in format plain
625.0   0       0       0       0       0       0       0       0
active+clean    2016-02-09 10:35:48.766861      0'0       1598:11 [1,0,2]
1       [1,0,2] 1       0'0     2016-02-09 10:35:26.639700      0'0
2016-02-09 10:35:26.639700
pool 625        0       0       0       0       0       0       0       0

3. did some I/O
4. brought down [1,0]  and did some I/O
5. brought down [2] then bring back [1,0]
after this the state of the pool is

[ubuntu@magna105 ~]$ sudo ceph pg dump | grep 625
dumped all in format plain
625.0   100     0       0       0       0       168200  100     100
down+peering    2016-02-09 10:40:16.451085      1598'100  1607:122
[1,0]   1       [1,0]   1       0'0     2016-02-09 10:35:26.639700
0'0     2016-02-09 10:35:26.639700
pool 625        100     0       0       0       0       168200  100
  100


later I marked osd.2 as lost Still the same issue.

dumped all in format plain
stamp 2016-02-10 04:31:07.562590
625.0   100     0       0       0       0       168200  100     100
down+peering    2016-02-09 10:40:16.451085      1598'100  1613:124
[1,0]   1       [1,0]   1       0'0     2016-02-09 10:35:26.639700
0'0     2016-02-09 10:35:26.639700
pool 625        100     0       0       0       0       168200  100     100

now mark osd 1 down to force peering again.

[ubuntu@magna105 ceph]$ sudo ceph pg dump | grep 625
dumped all in format plain
625.0   100     0       100     0       0       168200  100     100     active+undersized+degraded      2016-02-10 04:52:31.564448        1598'100        1620:138        [1,0]   1       [1,0]   1       0'0     2016-02-09 10:35:26.639700      0'0       2016-02-09 10:35:26.639700
pool 625        100     0       100     0       0       168200  100     100


it became active but not incomplete..


@dzafman,
I am not able to achieve incomplete state as you mentioned in the mail.Could you please have a look into this so that I can verify this bug asap.

Comment 22 David Zafman 2016-02-13 02:46:42 UTC

I wouldn't let this testing hold anything up.   This change doesn't  affect anything except manual repair (usually by support).  That's why existing tests coverage wouldn't test this.  At worst if the bug isn't fixed it would only affect support trying to bring a cluster back to life when data loss has occurred.

Sam reminded me that backfill was required to get incomplete instead of just stuck peering.

To get an incomplete to test try this:

Start a cluster with at least 4 osds and the following osd config to get backfill to happen:
osd_min_pg_log_entries = 5
osd_max_pg_log_entries = 5

Create a pool with pg_num/pgp_num 1 size 3 and min_size 1

Say the pg's replicas are on osd 1 2 3
Write enough data to a bunch of objects, so it is a non-trivial amount of data to backfill
kill osd 1 while still writing some additional objects
Make sure it is still backfilling to a new osd say osd 4
kill osd 2 & 3 BEFORE backfill completes to osd 4
Bring osd 1 back up
PG should be stuck peering wanting to query osd 2 or 3
mark 2 and 3 lost
PG should go incomplete

Now test the osd_find_best_info_ignore_history_les config value by injecting it into PG primary and marking it down to re-peer.

Comment 23 shilpa 2016-02-16 11:38:35 UTC

(In reply to David Zafman from comment #22)
> I wouldn't let this testing hold anything up.   This change doesn't  affect
> anything except manual repair (usually by support).  That's why existing
> tests coverage wouldn't test this.  At worst if the bug isn't fixed it would
> only affect support trying to bring a cluster back to life when data loss
> has occurred.
> 
> Sam reminded me that backfill was required to get incomplete instead of just
> stuck peering.
> 
> To get an incomplete to test try this:
> 
> Start a cluster with at least 4 osds and the following osd config to get
> backfill to happen:
> osd_min_pg_log_entries = 5
> osd_max_pg_log_entries = 5
> 
> Create a pool with pg_num/pgp_num 1 size 3 and min_size 1
> 
> Say the pg's replicas are on osd 1 2 3
> Write enough data to a bunch of objects, so it is a non-trivial amount of
> data to backfill
> kill osd 1 while still writing some additional objects
> Make sure it is still backfilling to a new osd say osd 4
> kill osd 2 & 3 BEFORE backfill completes to osd 4
> Bring osd 1 back up
> PG should be stuck peering wanting to query osd 2 or 3
> mark 2 and 3 lost
> PG should go incomplete
> 
> Now test the osd_find_best_info_ignore_history_les config value by injecting
> it into PG primary and marking it down to re-peer.

Hit a crash while testing the above steps.

Primary osd's [0,6,3]. osd.1 is the new osd. PG num is 12.0. Brought the PG to incomplete state.
Before injecting osd_find_best_info_ignore_history_les config : 

$ sudo ceph pg dump | grep incomplete
dumped all in format plain
12.0	60	0	0	0	0	797276160	40	40	incomplete	2016-02-16 10:51:22.274293	141'361	167:381	[0]	0	

pgmap v5617: 1153 pgs, 12 pools, 760 MB data, 107 objects
            1041 MB used, 1851 GB / 1852 GB avail
            58/321 objects degraded (18.069%)
            36/321 objects misplaced (11.215%)
                 723 active+undersized+degraded+remapped
                 429 undersized+degraded+peered
                   1 incomplete


After injecting osd_find_best_info_ignore_history_les config and marking osd.0 down, the pg was back in active state:

$ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep osd_find_best_info_ignore_history_les
    "osd_find_best_info_ignore_history_les": "true",


$ sudo ceph pg dump | grep ^12.0
dumped all in format plain
12.0	60	0	120	0	0	797276160	40	40	active+undersized+degraded	2016-02-16 11:01:45.991202	141'361	172:387	

$ sudo ceph health
HEALTH_WARN 1153 pgs degraded; 1153 pgs stuck degraded; 465 pgs stuck inactive; 1153 pgs stuck unclean; 1153 pgs stuck undersized; 1153 pgs undersized; 98 requests are blocked > 32 sec; recovery 176/321 objects degraded (54.829%); recovery 38/321 objects misplaced (11.838%); too many PGs per OSD (576 > max 300)


However, while trying to bring osd.6 and osd.3 back up from 'down' state, I hit a crash.

0> 2016-02-16 11:12:18.848291 7f246120d700 -1 osd/PGLog.cc: In function 'void PGLog::rewind_divergent_log(ObjectStore::Transaction&, eversion_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f
246120d700 time 2016-02-16 11:12:18.844362
osd/PGLog.cc: 503: FAILED assert(newhead >= log.tail)

 ceph version 0.94.5-5redhat1trusty (deef183a81111fa5e128ec88c90a32c9587c615d)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba86eb]
 2: (PGLog::rewind_divergent_log(ObjectStore::Transaction&, eversion_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x60b) [0x76b1db]
 3: (PG::rewind_divergent_log(ObjectStore::Transaction&, eversion_t)+0x97) [0x7bd7b7]
 4: (PG::RecoveryState::Stray::react(PG::MInfoRec const&)+0x22f) [0x7df2af]
 5: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x216) [0x81aa
b6]
 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base
 const&)+0x5b) [0x80628b]
 7: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1ce) [0x7b3fde]
 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c0) [0x6a7990]
 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x700068]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xb9914e]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xb9a1f0]
 12: (()+0x8182) [0x7f247b790182]
 13: (clone()+0x6d) [0x7f2479cfb47d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Comment 25 David Zafman 2016-02-16 19:15:39 UTC

Here is what I got with debug_osd set to 30 on osd.3:

2016-02-16 18:44:40.937390 7f6843352700 10 osd.3 pg_epoch: 272 pg[12.0( v 143'445 (143'421,143'445] local-les=146 n=81 ec=109 les/c 269/269 270/270/268) [0,3] r=1 lpr=270 pi=109-269/63 crt=143'443 lcod 0'0 inactive NOTIFY] handle_peering_event: epoch_sent: 271 epoch_requested: 271 MInfoRec from 0 info: 12.0( v 141'361 (124'321,141'361] local-les=271 n=60 ec=109 les/c 269/269 270/270/268)
2016-02-16 18:44:40.937403 7f6843352700 10 osd.3 pg_epoch: 272 pg[12.0( v 143'445 (143'421,143'445] local-les=146 n=81 ec=109 les/c 269/269 270/270/268) [0,3] r=1 lpr=270 pi=109-269/63 crt=143'443 lcod 0'0 inactive NOTIFY] state<Started/Stray>: got info from osd.0 12.0( v 141'361 (124'321,141'361] local-les=271 n=60 ec=109 les/c 269/269 270/270/268)
2016-02-16 18:44:40.937414 7f6843352700 10 rewind_divergent_log truncate divergent future 141'361
2
2016-02-16 18:44:40.941006 7f6843352700 -1 osd/PGLog.cc: In function 'void PGLog::rewind_divergent_log(ObjectStore::Transaction&, eversion_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f6843352700 time 2016-02-16 18:44:40.937417
osd/PGLog.cc: 503: FAILED assert(newhead >= log.tail)

Comment 26 David Zafman 2016-02-16 19:57:33 UTC

Please re-run this test.  The osd.3 which crashed here was marked lost and should not have been restarted.  The 2 OSDs are marked lost to simulate data that is not recoverable and these OSDs must not be restarted.

There should have been 2 OSDs with information about the pg up/in and 2 pgs down/out and marked lost.  Once marked lost those pgs should not be started.  The 2 up/in osds are going to go active after setting the config and forcing a re-peer of the primary mark it down (it'll come right back up and peer).

Comment 27 shilpa 2016-02-17 09:12:05 UTC

Ok without osd.3 and osd.6, we have only osd.0 and osd.1, where osd.0 came back up and re-peered.  
But ceph health was stuck at:
HEALTH_WARN 1153 pgs degraded; 1153 pgs stuck degraded; 465 pgs stuck inactive; 1153 pgs stuck unclean; 1153 pgs stuck undersized; 1153 pgs undersized; 98 requests are blocked > 32 sec; recovery 176/321 objects degraded (54.829%); recovery 38/321 objects misplaced (11.838%); too many PGs per OSD (576 > max 300)

I added two more OSD's to the cluster and the cluster health came to active+clean state.

Comment 28 shilpa 2016-02-17 09:46:09 UTC

Further I/O's are happening. Able to recover from PG Incomplete state. Moving the bug to verified.

Comment 30 errata-xmlrpc 2016-02-29 14:44:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0313