1500904 – Stale bucket index entries are left over after object deletions

Bug 1500904 - Stale bucket index entries are left over after object deletions

Summary: Stale bucket index entries are left over after object deletions

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	2.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	2.5
Assignee:	Matt Benjamin (redhat)
QA Contact:	Vidushi Mishra
Docs Contact:	Aron Gunn
URL:
Whiteboard:
Duplicates (1):	1496568 (view as bug list)
Depends On:	1530784
Blocks:	1473188 1491723 1536401
TreeView+	depends on / blocked

Reported:	2017-10-11 17:21 UTC by Benjamin Schmaus
Modified:	2021-03-11 15:58 UTC (History)
CC List:	14 users (show)
Fixed In Version:	RHEL: ceph-10.2.10-9.el7cp Ubuntu: ceph_10.2.10-6redhat1xenial
Doc Type:	Bug Fix
Doc Text:	.Stale bucket index entries are no longer left over after object deletions Previously, under certain circumstances, deleted objects were incorrectly interpreted as incomplete delete transactions because of an incorrect time. As a consequence, the delete operations were reported successful in the Ceph Object Gateway logs, but the deleted objects were not correctly removed from bucket indexes. The incorrect time comparison has been fixed, and deleting objects works correctly.
Clone Of:
Clones:	1530784 (view as bug list)
Environment:
Last Closed:	2018-02-21 19:44:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	20380	None	None	None	2017-10-11 17:21:16 UTC
Ceph Project Bug Tracker	20895	None	None	None	2017-11-02 15:37:26 UTC
Ceph Project Bug Tracker	22555	None	None	None	2018-01-03 19:00:33 UTC
Red Hat Product Errata	RHBA-2018:0340	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.5 bug fix and enhancement update	2018-02-22 00:50:32 UTC

Description Benjamin Schmaus 2017-10-11 17:21:17 UTC

Description of problem:

Objects are deleted but the index still thinks they are there.  This issue was thought to be resolved in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1464099


Version-Release number of selected component (if applicable):


How reproducible:

See http://tracker.ceph.com/issues/20380

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 J. Eric Ivancich 2017-10-23 17:44:10 UTC

Was able to reproduce the error after multiple times (seems to generate condition around 1 out of approx. 20 runs) running the following set of scripts (modified versions of those supplied by customer) on a cluster with three rados gateways. Currently combing through OSD logs to see if there are any clues as to the underlying problem.

==== delete_create_script.sh

#!/bin/sh

trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT

swift -A http://localhost:8000/auth -U test:tester -K testing post test

dir=$(dirname $0)

${dir}/delete_create_object.sh one &
${dir}/delete_create_object.sh two &
${dir}/delete_create_object.sh three &
${dir}/delete_create_object.sh four &
${dir}/delete_create_object.sh five &
${dir}/delete_create_object.sh six &
${dir}/delete_create_object.sh seven &
${dir}/delete_create_object.sh eight &
${dir}/delete_create_object.sh nine &
${dir}/delete_create_object.sh ten &

wait

echo Done all

==== delete_create_object.sh

#!/bin/sh

port_lo=8000
port_hi=8002

trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT

i=1
objects=100
prefix=$1

list_out=$1.list
echo $(date) > $list_out

while [ $i -lt $objects ]
do
    object=$prefix.`date +%Y-%m-%d:%H:%M:%S`.$i
    touch $object

    port1=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))
    port2=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))
    port3=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))

    swift -A http://localhost:${port1}/auth -U test:tester -K testing upload test $object >/dev/null
    swift -A http://localhost:${port2}/auth -U test:tester -K testing delete test $object >/dev/null &
    swift -A http://localhost:${port3}/auth -U test:tester -K testing list test >>$list_out

    i=`expr $i + 1`
    rm -f $object &
done

wait

echo Done $1

====

Comment 4 Vikhyat Umrao 2017-11-02 15:23:08 UTC

*** Bug 1496568 has been marked as a duplicate of this bug. ***

Comment 5 Vikhyat Umrao 2017-11-02 15:28:33 UTC

upstream jewel backport: https://github.com/ceph/ceph/pull/16856

Comment 6 Vikhyat Umrao 2017-11-02 15:33:49 UTC

git tag --contains ff67388e24c93ca16553839c16f51030fa322917
v10.2.10

Comment 9 J. Eric Ivancich 2017-11-03 17:08:03 UTC

I believe I have this fixed. Under local testing I was generally able to create this faulty condition in about 30 minutes. My local testing with fix has now run over 3 hours and has not generated faulty condition.

The proposed bug fix is currently a DNM PR https://github.com/ceph/ceph/pull/18709 .

The PR is currently being built by the ceph-ci tooling, and when that's done I will run it through the RGW suite to look for any regressions.

Comment 15 J. Eric Ivancich 2017-11-06 14:48:44 UTC

The fix on master has been running for close to 72 hours without recreating the error state. As mentioned previously I had been able to generate the error state in about 30 minutes. So this fix does seem to address the error reported.

Regression testing found no regressions. See the PR (https://github.com/ceph/ceph/pull/18709) for details.

Comment 16 Vikhyat Umrao 2017-11-06 14:51:42 UTC

(In reply to Eric Ivancich from comment #15)
> The fix on master has been running for close to 72 hours without recreating
> the error state. As mentioned previously I had been able to generate the
> error state in about 30 minutes. So this fix does seem to address the error
> reported.
> 
> Regression testing found no regressions. See the PR
> (https://github.com/ceph/ceph/pull/18709) for details.

Thank you Eric. Awesome work!

Matt and Eric - I think we need to pull this patch - https://github.com/ceph/ceph/pull/18709 to 3.0(luminous) also? Do we have a bug for it or you want me to open one?

Comment 22 Ken Dreyer (Red Hat) 2018-01-03 19:04:53 UTC

This bug is targeted for RHCEPH 2.5 and this fix is not in RHCEPH 3.

Would you please cherry-pick the change to ceph-3.0-rhel-patches (with the RHCEPH 3 clone ID number, "Resolves: rhbz#1530784") so customers do not experience a regression?

Comment 27 Vidushi Mishra 2018-01-17 05:47:21 UTC

Hi all,

Using the script mentioned in comment#3, created, deleted 10k objects.
Did not see issue, moving BZ to verified for ceph version 10.2.10-11.el7cp.

Thanks,
Vidushi

Comment 32 errata-xmlrpc 2018-02-21 19:44:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0340

Note You need to log in before you can comment on or make changes to this bug.