Bug 1600138

Summary: [Bluestore]: one of the osds flapped multiple times with 1525: FAILED assert(0 == "bluefs enospc")
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Parikshith <pbyregow>
Component: RADOSAssignee: Sage Weil <sweil>
Status: CLOSED ERRATA QA Contact: Manohar Murthy <mmurthy>
Severity: high Docs Contact:
Priority: high    
Version: 3.1CC: akupczyk, anharris, assingh, ceph-eng-bugs, dzafman, edonnell, gsitlani, hnallurv, jdurgin, kchai, linuxkidd, mamccoma, mhackett, mmurthy, nojha, pasik, shalygin.k, sweil, tchandra, tonay, tpetr, tserlin, vumrao
Target Milestone: z2   
Target Release: 3.2   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-12.2.8-108.el7cp Ubuntu: ceph_12.2.8-93redhat1xenial Doc Type: Bug Fix
Doc Text:
.RocksDB compaction no longer exhausts free space of BlueFS Previously, the balancing of free space between main storage and storage for RocksDB, managed by BlueFS, happened only when write operations were underway. This caused an `ENOSPC` error for BlueFS to be returned when RocksDB compaction was triggered right before long interval without write operations. With this update, the code has been modified to periodically check free space balance even if no write operations are ongoing so that compaction no longer exhausts free space of BlueFS.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-30 15:56:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1629656    
Attachments:
Description Flags
osd.10.log
none
ceph.log
none
mon logs none

Description Parikshith 2018-07-11 13:57:13 UTC
Created attachment 1458119 [details]
osd.10.log

Description of problem:
one of the osds randomly flapped multiple times with 1525: FAILED assert(0 == "bluefs enospc")

Version-Release number of selected component (if applicable):
ceph version 12.2.5-27.el7cp

How reproducible:  
Intermittent

Steps: 
1. Created cluster with only bluestore osds. Filled the cluster with ~70% data.

Actual results:
One of the osds (osd.10) started flapping.

2018-07-11 07:35:48.258555 7f354e0ae700 -1 bluefs _allocate failed to allocate 0x3000000 on bdev 2, free 0x1700000
2018-07-11 07:35:48.258569 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 0: 0 extents
2018-07-11 07:35:48.258574 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 1: 0 extents
2018-07-11 07:35:48.258576 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 2: 0 extents
2018-07-11 07:35:48.258577 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 3: 0 extents
2018-07-11 07:35:48.258579 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 4: 0 extents
2018-07-11 07:35:48.258580 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 5: 0 extents
2018-07-11 07:35:48.258582 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 6: 0 extents
2018-07-11 07:35:48.258584 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 7: 0 extents
2018-07-11 07:35:48.258585 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 8: 0 extents
2018-07-11 07:35:48.258587 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump free bin 9: 14 extents
2018-07-11 07:35:48.258588 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0x26fcc00000~100000
2018-07-11 07:35:48.258590 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0x26ff300000~100000
2018-07-11 07:35:48.258591 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0x33ed800000~100000
2018-07-11 07:35:48.258592 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0x7649900000~100000
2018-07-11 07:35:48.258594 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0x9c2d700000~100000
2018-07-11 07:35:48.258595 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0x9c4d200000~100000
2018-07-11 07:35:48.258597 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0x9d69d00000~200000
2018-07-11 07:35:48.258598 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0xa1f8e00000~100000
2018-07-11 07:35:48.258599 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0xa3b9d00000~100000
2018-07-11 07:35:48.258601 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0xc70a100000~500000
2018-07-11 07:35:48.258602 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0xc72bb00000~200000
2018-07-11 07:35:48.258603 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0xc76ae00000~100000
2018-07-11 07:35:48.258604 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0xc781400000~200000
2018-07-11 07:35:48.258606 7f354e0ae700  0 stupidalloc 0x0x55a30f764070 dump  0xc98f300000~300000
2018-07-11 07:35:48.258608 7f354e0ae700 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x2f0901c
2018-07-11 07:35:48.264086 7f354e0ae700 -1 /builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f354e0ae700 time 2018-07-11 07:35:48.258621
/builddir/build/BUILD/ceph-12.2.5/src/os/bluestore/BlueFS.cc: 1525: FAILED assert(0 == "bluefs enospc")

 ceph version 12.2.5-27.el7cp (6b7d1a8cf2e2376fd0a12424947c8c97aa1c0890) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55a304abd710]
 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1b1b) [0x55a304a3c13b]
 3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x188) [0x55a304a3c6a8]
 4: (BlueRocksWritableFile::Flush()+0x3c) [0x55a304a5823c]
 5: (rocksdb::WritableFileWriter::Flush()+0xaf) [0x55a304e2755f]
 6: (rocksdb::WritableFileWriter::Sync(bool)+0x33) [0x55a304e27773]
 7: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0xa1b) [0x55a304eed7ab]
 8: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0xea2) [0x55a304ef0222]
 9: (rocksdb::CompactionJob::Run()+0x1de) [0x55a304ef0cde]
 10: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c7) [0x55a304e7c687]
 11: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0x283) [0x55a304e7fc73]
 12: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x271) [0x55a304ee6e31]
 13: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x3d) [0x55a304ee701d]
 14: (()+0xb5070) [0x7f355fafc070]
 15: (()+0x7dd5) [0x7f356016fdd5]
 16: (clone()+0x6d) [0x7f355f260b3d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

Additional info:
There were no IOs going on the cluster.
There were slow/blocked requests when scrubbing was going on.
1 pg was stuck in active+clean+remapped state (bz #1600040)

Comment 7 Parikshith 2018-07-13 18:02:17 UTC
Created attachment 1458772 [details]
ceph.log

Comment 8 Parikshith 2018-07-13 18:03:07 UTC
Created attachment 1458773 [details]
mon logs

Comment 11 Sage Weil 2018-07-24 15:56:46 UTC
How reproducible is this?  I'm looking at the luminous code and I think the only way this would happen is if the device is fast and/or the osd is idle, but rocksdb is doing compaction.  We have a configurable that gives bluefs a minimum if 1 GB of free space.  I think the way to address this is to bump that to, say, 5GB or 10GB.  It's hard to tell if that will do the trick, though, without being able to reproduce...

Can verify it reproduces, and then try it again with

 bluestore_bluefs_min_free = 10737418240   # 10gb (current default is 1gb)

?

Comment 13 Josh Durgin 2018-09-14 22:29:57 UTC
Moving to 3.* unless we can reproduce to test setting bluestore_bluefs_min_free to a higher value.

Comment 21 Neha Ojha 2019-02-15 17:51:31 UTC
Tomas, what is the current state of the customer cluster? We are working on backporting this patch, but that might take some time.

Comment 25 Neha Ojha 2019-02-15 23:32:29 UTC
Adam(akupczyk) is investigating a way to get around this issue without applying the patch immediately.

Comment 39 Neha Ojha 2019-02-22 18:54:17 UTC
Sounds good. Thanks Vikhyat.

Comment 65 Neha Ojha 2019-03-07 19:00:06 UTC
We will be backporting https://github.com/ceph/ceph/pull/26735 to luminous and shipping it in 3.2z2.

Comment 74 errata-xmlrpc 2019-04-30 15:56:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0911