Description of problem: Objects are deleted but the index still thinks they are there. This issue was thought to be resolved in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1464099 Version-Release number of selected component (if applicable): How reproducible: See http://tracker.ceph.com/issues/20380 Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Was able to reproduce the error after multiple times (seems to generate condition around 1 out of approx. 20 runs) running the following set of scripts (modified versions of those supplied by customer) on a cluster with three rados gateways. Currently combing through OSD logs to see if there are any clues as to the underlying problem. ==== delete_create_script.sh #!/bin/sh trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT swift -A http://localhost:8000/auth -U test:tester -K testing post test dir=$(dirname $0) ${dir}/delete_create_object.sh one & ${dir}/delete_create_object.sh two & ${dir}/delete_create_object.sh three & ${dir}/delete_create_object.sh four & ${dir}/delete_create_object.sh five & ${dir}/delete_create_object.sh six & ${dir}/delete_create_object.sh seven & ${dir}/delete_create_object.sh eight & ${dir}/delete_create_object.sh nine & ${dir}/delete_create_object.sh ten & wait echo Done all ==== delete_create_object.sh #!/bin/sh port_lo=8000 port_hi=8002 trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT i=1 objects=100 prefix=$1 list_out=$1.list echo $(date) > $list_out while [ $i -lt $objects ] do object=$prefix.`date +%Y-%m-%d:%H:%M:%S`.$i touch $object port1=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo )) port2=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo )) port3=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo )) swift -A http://localhost:${port1}/auth -U test:tester -K testing upload test $object >/dev/null swift -A http://localhost:${port2}/auth -U test:tester -K testing delete test $object >/dev/null & swift -A http://localhost:${port3}/auth -U test:tester -K testing list test >>$list_out i=`expr $i + 1` rm -f $object & done wait echo Done $1 ====
*** Bug 1496568 has been marked as a duplicate of this bug. ***
upstream jewel backport: https://github.com/ceph/ceph/pull/16856
git tag --contains ff67388e24c93ca16553839c16f51030fa322917 v10.2.10
I believe I have this fixed. Under local testing I was generally able to create this faulty condition in about 30 minutes. My local testing with fix has now run over 3 hours and has not generated faulty condition. The proposed bug fix is currently a DNM PR https://github.com/ceph/ceph/pull/18709 . The PR is currently being built by the ceph-ci tooling, and when that's done I will run it through the RGW suite to look for any regressions.
The fix on master has been running for close to 72 hours without recreating the error state. As mentioned previously I had been able to generate the error state in about 30 minutes. So this fix does seem to address the error reported. Regression testing found no regressions. See the PR (https://github.com/ceph/ceph/pull/18709) for details.
(In reply to Eric Ivancich from comment #15) > The fix on master has been running for close to 72 hours without recreating > the error state. As mentioned previously I had been able to generate the > error state in about 30 minutes. So this fix does seem to address the error > reported. > > Regression testing found no regressions. See the PR > (https://github.com/ceph/ceph/pull/18709) for details. Thank you Eric. Awesome work! Matt and Eric - I think we need to pull this patch - https://github.com/ceph/ceph/pull/18709 to 3.0(luminous) also? Do we have a bug for it or you want me to open one?
This bug is targeted for RHCEPH 2.5 and this fix is not in RHCEPH 3. Would you please cherry-pick the change to ceph-3.0-rhel-patches (with the RHCEPH 3 clone ID number, "Resolves: rhbz#1530784") so customers do not experience a regression?
Hi all, Using the script mentioned in comment#3, created, deleted 10k objects. Did not see issue, moving BZ to verified for ceph version 10.2.10-11.el7cp. Thanks, Vidushi
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0340