Hide Forgot
External Bug ID: Ceph Project Bug Tracker 21812 Seem like that standby replay mds submitted log entry
I wrongly interpret the log. looks like two mds wrote to object 200.00004273 at the same time. something must be wrong with blacklist In osd.3.log at magna103:/var/log/ceph <pre> 2017-10-13 14:10:16.312400 7f308a412700 10 osd.3 pg_epoch: 849 pg[2.3( v 849'910079 (841'908563,849'910079] local-lis/les=668/669 n=76028 ec=3/3 lis/c 668/668 les/c/f 669/670/0 668/668/371) [3,0,8] r=0 lpr=668 luod=849'910050 lua=849'910055 crt=849'910079 lcod 848'910049 mlcod 845'910047 active+clean] sending reply on osd_op(mds.0.2269:12216 2.3 2:c78e7855:::200.00004273:head [write 842784~1373 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e849) v8 0x9914830a80 ... 2017-10-13 14:11:10.061530 7f309ac33700 10 osd.3 pg_epoch: 851 pg[2.3( v 851'910221 (841'908663,851'910221] local-lis/les=668/669 n=76028 ec=3/3 lis/c 668/668 les/c/f 669/670/0 668/668/371) [3,0,8] r=0 lpr=668 luod=851'910216 lua=851'910215 crt=851'910221 lcod 851'910215 mlcod 851'910214 active+clean] sending reply on osd_op(mds.0.2207:27831 2.3 2:c78e7855:::200.00004273:head [write 842784~2354 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected+full_force e846) v8 0x99126e2700 </pre> mds.0.2269 first wrote an log entry at offset 842784, then mds.0.2207 wrote another log entry at the same offset. mds.0.2207 was the laggy mds, which should be blacklisted.
sudo ceph -c /etc/ceph/cfs.conf daemon mon.magna023 config get mds_blacklist_interval { "mds_blacklist_interval": "5.000000" } 5 seconds are too short, you should use default value. the issue was caused by wrong config.
mds_blacklist_interval is only used on monitor daemons. You do not need to modify this from the default. Setting a short blacklist interval is effectively the same as preventing the monitors from blacklisting failed MDSs, and it will break the system.
provided QA_ACK, clearing need info, Please move to bug to ON_QA
Ken, could you please move this bug to ON_QA
(In reply to Ramakrishnan Periyasamy from comment #21) > Ken, could you please move this bug to ON_QA Done. Thomas
Moving this bug to verified state, updated the command output in comment 20 tested in ceph version ceph-12.2.4-4.el7cp
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1259