Bug 1500904 - Stale bucket index entries are left over after object deletions
Summary: Stale bucket index entries are left over after object deletions
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RGW
Version: 2.3
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: 2.5
Assignee: Matt Benjamin (redhat)
QA Contact: Vidushi Mishra
Aron Gunn
URL:
Whiteboard:
: 1496568 (view as bug list)
Depends On: 1530784
Blocks: 1473188 1491723 1536401
TreeView+ depends on / blocked
 
Reported: 2017-10-11 17:21 UTC by Benjamin Schmaus
Modified: 2021-03-11 15:58 UTC (History)
14 users (show)

Fixed In Version: RHEL: ceph-10.2.10-9.el7cp Ubuntu: ceph_10.2.10-6redhat1xenial
Doc Type: Bug Fix
Doc Text:
.Stale bucket index entries are no longer left over after object deletions Previously, under certain circumstances, deleted objects were incorrectly interpreted as incomplete delete transactions because of an incorrect time. As a consequence, the delete operations were reported successful in the Ceph Object Gateway logs, but the deleted objects were not correctly removed from bucket indexes. The incorrect time comparison has been fixed, and deleting objects works correctly.
Clone Of:
: 1530784 (view as bug list)
Environment:
Last Closed: 2018-02-21 19:44:55 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 20380 0 None None None 2017-10-11 17:21:16 UTC
Ceph Project Bug Tracker 20895 0 None None None 2017-11-02 15:37:26 UTC
Ceph Project Bug Tracker 22555 0 None None None 2018-01-03 19:00:33 UTC
Red Hat Product Errata RHBA-2018:0340 0 normal SHIPPED_LIVE Red Hat Ceph Storage 2.5 bug fix and enhancement update 2018-02-22 00:50:32 UTC

Description Benjamin Schmaus 2017-10-11 17:21:17 UTC
Description of problem:

Objects are deleted but the index still thinks they are there.  This issue was thought to be resolved in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1464099


Version-Release number of selected component (if applicable):


How reproducible:

See http://tracker.ceph.com/issues/20380

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 J. Eric Ivancich 2017-10-23 17:44:10 UTC
Was able to reproduce the error after multiple times (seems to generate condition around 1 out of approx. 20 runs) running the following set of scripts (modified versions of those supplied by customer) on a cluster with three rados gateways. Currently combing through OSD logs to see if there are any clues as to the underlying problem.

==== delete_create_script.sh

#!/bin/sh

trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT

swift -A http://localhost:8000/auth -U test:tester -K testing post test

dir=$(dirname $0)

${dir}/delete_create_object.sh one &
${dir}/delete_create_object.sh two &
${dir}/delete_create_object.sh three &
${dir}/delete_create_object.sh four &
${dir}/delete_create_object.sh five &
${dir}/delete_create_object.sh six &
${dir}/delete_create_object.sh seven &
${dir}/delete_create_object.sh eight &
${dir}/delete_create_object.sh nine &
${dir}/delete_create_object.sh ten &

wait

echo Done all

==== delete_create_object.sh

#!/bin/sh

port_lo=8000
port_hi=8002

trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT

i=1
objects=100
prefix=$1

list_out=$1.list
echo $(date) > $list_out

while [ $i -lt $objects ]
do
    object=$prefix.`date +%Y-%m-%d:%H:%M:%S`.$i
    touch $object

    port1=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))
    port2=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))
    port3=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))

    swift -A http://localhost:${port1}/auth -U test:tester -K testing upload test $object >/dev/null
    swift -A http://localhost:${port2}/auth -U test:tester -K testing delete test $object >/dev/null &
    swift -A http://localhost:${port3}/auth -U test:tester -K testing list test >>$list_out

    i=`expr $i + 1`
    rm -f $object &
done

wait

echo Done $1

====

Comment 4 Vikhyat Umrao 2017-11-02 15:23:08 UTC
*** Bug 1496568 has been marked as a duplicate of this bug. ***

Comment 5 Vikhyat Umrao 2017-11-02 15:28:33 UTC
upstream jewel backport: https://github.com/ceph/ceph/pull/16856

Comment 6 Vikhyat Umrao 2017-11-02 15:33:49 UTC
git tag --contains ff67388e24c93ca16553839c16f51030fa322917
v10.2.10

Comment 9 J. Eric Ivancich 2017-11-03 17:08:03 UTC
I believe I have this fixed. Under local testing I was generally able to create this faulty condition in about 30 minutes. My local testing with fix has now run over 3 hours and has not generated faulty condition.

The proposed bug fix is currently a DNM PR https://github.com/ceph/ceph/pull/18709 .

The PR is currently being built by the ceph-ci tooling, and when that's done I will run it through the RGW suite to look for any regressions.

Comment 15 J. Eric Ivancich 2017-11-06 14:48:44 UTC
The fix on master has been running for close to 72 hours without recreating the error state. As mentioned previously I had been able to generate the error state in about 30 minutes. So this fix does seem to address the error reported.

Regression testing found no regressions. See the PR (https://github.com/ceph/ceph/pull/18709) for details.

Comment 16 Vikhyat Umrao 2017-11-06 14:51:42 UTC
(In reply to Eric Ivancich from comment #15)
> The fix on master has been running for close to 72 hours without recreating
> the error state. As mentioned previously I had been able to generate the
> error state in about 30 minutes. So this fix does seem to address the error
> reported.
> 
> Regression testing found no regressions. See the PR
> (https://github.com/ceph/ceph/pull/18709) for details.

Thank you Eric. Awesome work!

Matt and Eric - I think we need to pull this patch - https://github.com/ceph/ceph/pull/18709 to 3.0(luminous) also? Do we have a bug for it or you want me to open one?

Comment 22 Ken Dreyer (Red Hat) 2018-01-03 19:04:53 UTC
This bug is targeted for RHCEPH 2.5 and this fix is not in RHCEPH 3.

Would you please cherry-pick the change to ceph-3.0-rhel-patches (with the RHCEPH 3 clone ID number, "Resolves: rhbz#1530784") so customers do not experience a regression?

Comment 27 Vidushi Mishra 2018-01-17 05:47:21 UTC
Hi all,

Using the script mentioned in comment#3, created, deleted 10k objects.
Did not see issue, moving BZ to verified for ceph version 10.2.10-11.el7cp.

Thanks,
Vidushi

Comment 32 errata-xmlrpc 2018-02-21 19:44:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0340


Note You need to log in before you can comment on or make changes to this bug.