Bug 1501958
Summary: | [CephFS]:- Cluster ended up in "damaged" mds when subtree pinning is in progress and tried to do mds failover | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | shylesh <shmohan> |
Component: | CephFS | Assignee: | Patrick Donnelly <pdonnell> |
Status: | CLOSED ERRATA | QA Contact: | Ramakrishnan Periyasamy <rperiyas> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 3.0 | CC: | ceph-eng-bugs, hnallurv, john.spray, kdreyer, pdonnell, rperiyas, shmohan, tserlin, zyan |
Target Milestone: | z2 | ||
Target Release: | 3.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHEL: ceph-12.2.4-4.el7cp Ubuntu: ceph_12.2.4-5redhat1xenial | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-26 17:38:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 5
Yan, Zheng
2017-10-16 13:25:58 UTC
I wrongly interpret the log. looks like two mds wrote to object 200.00004273 at the same time. something must be wrong with blacklist In osd.3.log at magna103:/var/log/ceph <pre> 2017-10-13 14:10:16.312400 7f308a412700 10 osd.3 pg_epoch: 849 pg[2.3( v 849'910079 (841'908563,849'910079] local-lis/les=668/669 n=76028 ec=3/3 lis/c 668/668 les/c/f 669/670/0 668/668/371) [3,0,8] r=0 lpr=668 luod=849'910050 lua=849'910055 crt=849'910079 lcod 848'910049 mlcod 845'910047 active+clean] sending reply on osd_op(mds.0.2269:12216 2.3 2:c78e7855:::200.00004273:head [write 842784~1373 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e849) v8 0x9914830a80 ... 2017-10-13 14:11:10.061530 7f309ac33700 10 osd.3 pg_epoch: 851 pg[2.3( v 851'910221 (841'908663,851'910221] local-lis/les=668/669 n=76028 ec=3/3 lis/c 668/668 les/c/f 669/670/0 668/668/371) [3,0,8] r=0 lpr=668 luod=851'910216 lua=851'910215 crt=851'910221 lcod 851'910215 mlcod 851'910214 active+clean] sending reply on osd_op(mds.0.2207:27831 2.3 2:c78e7855:::200.00004273:head [write 842784~2354 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e846) v8 0x99126e2700 </pre> mds.0.2269 first wrote an log entry at offset 842784, then mds.0.2207 wrote another log entry at the same offset. mds.0.2207 was the laggy mds, which should be blacklisted. sudo ceph -c /etc/ceph/cfs.conf daemon mon.magna023 config get mds_blacklist_interval { "mds_blacklist_interval": "5.000000" } 5 seconds are too short, you should use default value. the issue was caused by wrong config. mds_blacklist_interval is only used on monitor daemons. You do not need to modify this from the default. Setting a short blacklist interval is effectively the same as preventing the monitors from blacklisting failed MDSs, and it will break the system. provided QA_ACK, clearing need info, Please move to bug to ON_QA Ken, could you please move this bug to ON_QA (In reply to Ramakrishnan Periyasamy from comment #21) > Ken, could you please move this bug to ON_QA Done. Thomas Moving this bug to verified state, updated the command output in comment 20 tested in ceph version ceph-12.2.4-4.el7cp Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1259 |