Bug 1289313
Summary: | osd/PG.cc: 288: FAILED assert(info.last_epoch_started >= info.history.last_epoch_started) | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Moses Muir <mmuir> |
Component: | RADOS | Assignee: | David Zafman <dzafman> |
Status: | CLOSED ERRATA | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 1.3.1 | CC: | bhubbard, ceph-eng-bugs, dzafman, flucifre, hnallurv, kchai, kdreyer, shmohan, smanjara, vumrao |
Target Milestone: | rc | Keywords: | Patch |
Target Release: | 1.3.2 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHEL: ceph-0.94.5-1.el7cp Ubuntu: ceph_0.94.5-2redhat1 | Doc Type: | Bug Fix |
Doc Text: |
Cause/Consequence: When a PG (placement group) gets stuck incomplete, the OSD could crash with an assert due to last_epoch_started mismatches.
Fix: The OSD code has been altered to skip the assert when the user sets osd_find_best_info_ignore_history_les to "true".
Result: The OSD does not crash when recovering an incomplete PG and osd_find_best_info_ignore_history_les is set.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-02-29 14:44:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Moses Muir
2015-12-07 20:45:43 UTC
David, please let me know what patch to pull in for this. Can you get commit 02a9a41f151a3d968bf8066749658659dc6e3ac4 from git:dzafman/ceph.git or do you need the patch in another form? Here it is on github.com: https://github.com/dzafman/ceph/commit/02a9a41f151a3d968bf8066749658659dc6e3ac4 David, could you please file a bug upstream for this issue so that we ensure that it gets backported to the hammer and jewel branches upstream? http://tracker.ceph.com/projects/ceph/issues/new For the doc text in this bugzilla, I'm thinking of something like the following: "When a PG gets stuck incomplete, the OSD could crash with an assert due to last_epoch_started mismatches. With this change, an OSD in this condition will not crash if the user sets osd_find_best_info_ignore_history_les to "true". " Is that text accurate? What would you change there? Ken, I have already filed upstream tracker 14015: http://tracker.ceph.com/issues/14015 marked Pending Backport. Thanks, can you please review the "Doc Text" field, and update it as needed? This is what I'd say: When a PG gets stuck incomplete, it is possible to recover partial data. To recover the user sets osd_find_best_info_ignore_history_les to "true" and the primary OSD is marked down to force a re-peer. With this change, an OSD in this condition will not crash. This change will be going into upstream anyway. Targeting to RHCS 1.3.2 in the meantime. Hi Federico, Now we have patch in upstream and it is merged, I think we should be taking it for 1.3.2. Regards, Vikhyat Needinfo Ken to see if he is open to picking this up in a future respin. Ken confirms this is okay to include in 1.3.2 at a respin point. Mail snippet capturing testcase for validating . ============================================= 1. [ubuntu@magna105 ~]$ sudo ceph osd pool create redhat 1 1 replicated replicated_ruleset pool 'redhat' created 2. [ubuntu@magna105 ~]$ sudo ceph osd pool set redhat min_size 1 set pool 625 min_size to 1 [ubuntu@magna105 ~]$ sudo ceph pg dump | grep 625 dumped all in format plain 625.0 0 0 0 0 0 0 0 0 active+clean 2016-02-09 10:35:48.766861 0'0 1598:11 [1,0,2] 1 [1,0,2] 1 0'0 2016-02-09 10:35:26.639700 0'0 2016-02-09 10:35:26.639700 pool 625 0 0 0 0 0 0 0 0 3. did some I/O 4. brought down [1,0] and did some I/O 5. brought down [2] then bring back [1,0] after this the state of the pool is [ubuntu@magna105 ~]$ sudo ceph pg dump | grep 625 dumped all in format plain 625.0 100 0 0 0 0 168200 100 100 down+peering 2016-02-09 10:40:16.451085 1598'100 1607:122 [1,0] 1 [1,0] 1 0'0 2016-02-09 10:35:26.639700 0'0 2016-02-09 10:35:26.639700 pool 625 100 0 0 0 0 168200 100 100 later I marked osd.2 as lost Still the same issue. dumped all in format plain stamp 2016-02-10 04:31:07.562590 625.0 100 0 0 0 0 168200 100 100 down+peering 2016-02-09 10:40:16.451085 1598'100 1613:124 [1,0] 1 [1,0] 1 0'0 2016-02-09 10:35:26.639700 0'0 2016-02-09 10:35:26.639700 pool 625 100 0 0 0 0 168200 100 100 now mark osd 1 down to force peering again. [ubuntu@magna105 ceph]$ sudo ceph pg dump | grep 625 dumped all in format plain 625.0 100 0 100 0 0 168200 100 100 active+undersized+degraded 2016-02-10 04:52:31.564448 1598'100 1620:138 [1,0] 1 [1,0] 1 0'0 2016-02-09 10:35:26.639700 0'0 2016-02-09 10:35:26.639700 pool 625 100 0 100 0 0 168200 100 100 it became active but not incomplete.. @dzafman, I am not able to achieve incomplete state as you mentioned in the mail.Could you please have a look into this so that I can verify this bug asap. I wouldn't let this testing hold anything up. This change doesn't affect anything except manual repair (usually by support). That's why existing tests coverage wouldn't test this. At worst if the bug isn't fixed it would only affect support trying to bring a cluster back to life when data loss has occurred. Sam reminded me that backfill was required to get incomplete instead of just stuck peering. To get an incomplete to test try this: Start a cluster with at least 4 osds and the following osd config to get backfill to happen: osd_min_pg_log_entries = 5 osd_max_pg_log_entries = 5 Create a pool with pg_num/pgp_num 1 size 3 and min_size 1 Say the pg's replicas are on osd 1 2 3 Write enough data to a bunch of objects, so it is a non-trivial amount of data to backfill kill osd 1 while still writing some additional objects Make sure it is still backfilling to a new osd say osd 4 kill osd 2 & 3 BEFORE backfill completes to osd 4 Bring osd 1 back up PG should be stuck peering wanting to query osd 2 or 3 mark 2 and 3 lost PG should go incomplete Now test the osd_find_best_info_ignore_history_les config value by injecting it into PG primary and marking it down to re-peer. (In reply to David Zafman from comment #22) > I wouldn't let this testing hold anything up. This change doesn't affect > anything except manual repair (usually by support). That's why existing > tests coverage wouldn't test this. At worst if the bug isn't fixed it would > only affect support trying to bring a cluster back to life when data loss > has occurred. > > Sam reminded me that backfill was required to get incomplete instead of just > stuck peering. > > To get an incomplete to test try this: > > Start a cluster with at least 4 osds and the following osd config to get > backfill to happen: > osd_min_pg_log_entries = 5 > osd_max_pg_log_entries = 5 > > Create a pool with pg_num/pgp_num 1 size 3 and min_size 1 > > Say the pg's replicas are on osd 1 2 3 > Write enough data to a bunch of objects, so it is a non-trivial amount of > data to backfill > kill osd 1 while still writing some additional objects > Make sure it is still backfilling to a new osd say osd 4 > kill osd 2 & 3 BEFORE backfill completes to osd 4 > Bring osd 1 back up > PG should be stuck peering wanting to query osd 2 or 3 > mark 2 and 3 lost > PG should go incomplete > > Now test the osd_find_best_info_ignore_history_les config value by injecting > it into PG primary and marking it down to re-peer. Hit a crash while testing the above steps. Primary osd's [0,6,3]. osd.1 is the new osd. PG num is 12.0. Brought the PG to incomplete state. Before injecting osd_find_best_info_ignore_history_les config : $ sudo ceph pg dump | grep incomplete dumped all in format plain 12.0 60 0 0 0 0 797276160 40 40 incomplete 2016-02-16 10:51:22.274293 141'361 167:381 [0] 0 pgmap v5617: 1153 pgs, 12 pools, 760 MB data, 107 objects 1041 MB used, 1851 GB / 1852 GB avail 58/321 objects degraded (18.069%) 36/321 objects misplaced (11.215%) 723 active+undersized+degraded+remapped 429 undersized+degraded+peered 1 incomplete After injecting osd_find_best_info_ignore_history_les config and marking osd.0 down, the pg was back in active state: $ sudo ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep osd_find_best_info_ignore_history_les "osd_find_best_info_ignore_history_les": "true", $ sudo ceph pg dump | grep ^12.0 dumped all in format plain 12.0 60 0 120 0 0 797276160 40 40 active+undersized+degraded 2016-02-16 11:01:45.991202 141'361 172:387 $ sudo ceph health HEALTH_WARN 1153 pgs degraded; 1153 pgs stuck degraded; 465 pgs stuck inactive; 1153 pgs stuck unclean; 1153 pgs stuck undersized; 1153 pgs undersized; 98 requests are blocked > 32 sec; recovery 176/321 objects degraded (54.829%); recovery 38/321 objects misplaced (11.838%); too many PGs per OSD (576 > max 300) However, while trying to bring osd.6 and osd.3 back up from 'down' state, I hit a crash. 0> 2016-02-16 11:12:18.848291 7f246120d700 -1 osd/PGLog.cc: In function 'void PGLog::rewind_divergent_log(ObjectStore::Transaction&, eversion_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f 246120d700 time 2016-02-16 11:12:18.844362 osd/PGLog.cc: 503: FAILED assert(newhead >= log.tail) ceph version 0.94.5-5redhat1trusty (deef183a81111fa5e128ec88c90a32c9587c615d) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xba86eb] 2: (PGLog::rewind_divergent_log(ObjectStore::Transaction&, eversion_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x60b) [0x76b1db] 3: (PG::rewind_divergent_log(ObjectStore::Transaction&, eversion_t)+0x97) [0x7bd7b7] 4: (PG::RecoveryState::Stray::react(PG::MInfoRec const&)+0x22f) [0x7df2af] 5: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl _::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x216) [0x81aa b6] 6: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x5b) [0x80628b] 7: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1ce) [0x7b3fde] 8: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2c0) [0x6a7990] 9: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x18) [0x700068] 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xb9914e] 11: (ThreadPool::WorkThread::entry()+0x10) [0xb9a1f0] 12: (()+0x8182) [0x7f247b790182] 13: (clone()+0x6d) [0x7f2479cfb47d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Here is what I got with debug_osd set to 30 on osd.3: 2016-02-16 18:44:40.937390 7f6843352700 10 osd.3 pg_epoch: 272 pg[12.0( v 143'445 (143'421,143'445] local-les=146 n=81 ec=109 les/c 269/269 270/270/268) [0,3] r=1 lpr=270 pi=109-269/63 crt=143'443 lcod 0'0 inactive NOTIFY] handle_peering_event: epoch_sent: 271 epoch_requested: 271 MInfoRec from 0 info: 12.0( v 141'361 (124'321,141'361] local-les=271 n=60 ec=109 les/c 269/269 270/270/268) 2016-02-16 18:44:40.937403 7f6843352700 10 osd.3 pg_epoch: 272 pg[12.0( v 143'445 (143'421,143'445] local-les=146 n=81 ec=109 les/c 269/269 270/270/268) [0,3] r=1 lpr=270 pi=109-269/63 crt=143'443 lcod 0'0 inactive NOTIFY] state<Started/Stray>: got info from osd.0 12.0( v 141'361 (124'321,141'361] local-les=271 n=60 ec=109 les/c 269/269 270/270/268) 2016-02-16 18:44:40.937414 7f6843352700 10 rewind_divergent_log truncate divergent future 141'361 2 2016-02-16 18:44:40.941006 7f6843352700 -1 osd/PGLog.cc: In function 'void PGLog::rewind_divergent_log(ObjectStore::Transaction&, eversion_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7f6843352700 time 2016-02-16 18:44:40.937417 osd/PGLog.cc: 503: FAILED assert(newhead >= log.tail) Please re-run this test. The osd.3 which crashed here was marked lost and should not have been restarted. The 2 OSDs are marked lost to simulate data that is not recoverable and these OSDs must not be restarted. There should have been 2 OSDs with information about the pg up/in and 2 pgs down/out and marked lost. Once marked lost those pgs should not be started. The 2 up/in osds are going to go active after setting the config and forcing a re-peer of the primary mark it down (it'll come right back up and peer). Ok without osd.3 and osd.6, we have only osd.0 and osd.1, where osd.0 came back up and re-peered. But ceph health was stuck at: HEALTH_WARN 1153 pgs degraded; 1153 pgs stuck degraded; 465 pgs stuck inactive; 1153 pgs stuck unclean; 1153 pgs stuck undersized; 1153 pgs undersized; 98 requests are blocked > 32 sec; recovery 176/321 objects degraded (54.829%); recovery 38/321 objects misplaced (11.838%); too many PGs per OSD (576 > max 300) I added two more OSD's to the cluster and the cluster health came to active+clean state. Further I/O's are happening. Able to recover from PG Incomplete state. Moving the bug to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0313 |