Bug 1500904

Summary:	Stale bucket index entries are left over after object deletions
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Benjamin Schmaus <bschmaus>
Component:	RGW	Assignee:	Matt Benjamin (redhat) <mbenjamin>
Status:	CLOSED ERRATA	QA Contact:	Vidushi Mishra <vimishra>
Severity:	high	Docs Contact:	Aron Gunn <agunn>
Priority:	unspecified
Version:	2.3	CC:	agunn, anharris, cbodley, ceph-eng-bugs, hnallurv, ivancich, kbader, kdreyer, mbenjamin, mhackett, owasserm, sweil, tserlin, vumrao
Target Milestone:	rc
Target Release:	2.5
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-10.2.10-9.el7cp Ubuntu: ceph_10.2.10-6redhat1xenial	Doc Type:	Bug Fix
Doc Text:	.Stale bucket index entries are no longer left over after object deletions Previously, under certain circumstances, deleted objects were incorrectly interpreted as incomplete delete transactions because of an incorrect time. As a consequence, the delete operations were reported successful in the Ceph Object Gateway logs, but the deleted objects were not correctly removed from bucket indexes. The incorrect time comparison has been fixed, and deleting objects works correctly.	Story Points:	---
Clone Of:
Clones:	1530784 (view as bug list)		Environment:
Last Closed:	2018-02-21 19:44:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1530784
Bug Blocks:	1473188, 1491723, 1536401

Description Benjamin Schmaus 2017-10-11 17:21:17 UTC

Description of problem:

Objects are deleted but the index still thinks they are there.  This issue was thought to be resolved in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1464099


Version-Release number of selected component (if applicable):


How reproducible:

See http://tracker.ceph.com/issues/20380

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 J. Eric Ivancich 2017-10-23 17:44:10 UTC

Was able to reproduce the error after multiple times (seems to generate condition around 1 out of approx. 20 runs) running the following set of scripts (modified versions of those supplied by customer) on a cluster with three rados gateways. Currently combing through OSD logs to see if there are any clues as to the underlying problem.

==== delete_create_script.sh

#!/bin/sh

trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT

swift -A http://localhost:8000/auth -U test:tester -K testing post test

dir=$(dirname $0)

${dir}/delete_create_object.sh one &
${dir}/delete_create_object.sh two &
${dir}/delete_create_object.sh three &
${dir}/delete_create_object.sh four &
${dir}/delete_create_object.sh five &
${dir}/delete_create_object.sh six &
${dir}/delete_create_object.sh seven &
${dir}/delete_create_object.sh eight &
${dir}/delete_create_object.sh nine &
${dir}/delete_create_object.sh ten &

wait

echo Done all

==== delete_create_object.sh

#!/bin/sh

port_lo=8000
port_hi=8002

trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT

i=1
objects=100
prefix=$1

list_out=$1.list
echo $(date) > $list_out

while [ $i -lt $objects ]
do
    object=$prefix.`date +%Y-%m-%d:%H:%M:%S`.$i
    touch $object

    port1=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))
    port2=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))
    port3=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))

    swift -A http://localhost:${port1}/auth -U test:tester -K testing upload test $object >/dev/null
    swift -A http://localhost:${port2}/auth -U test:tester -K testing delete test $object >/dev/null &
    swift -A http://localhost:${port3}/auth -U test:tester -K testing list test >>$list_out

    i=`expr $i + 1`
    rm -f $object &
done

wait

echo Done $1

====

Comment 4 Vikhyat Umrao 2017-11-02 15:23:08 UTC

*** Bug 1496568 has been marked as a duplicate of this bug. ***

Comment 5 Vikhyat Umrao 2017-11-02 15:28:33 UTC

upstream jewel backport: https://github.com/ceph/ceph/pull/16856

Comment 6 Vikhyat Umrao 2017-11-02 15:33:49 UTC

git tag --contains ff67388e24c93ca16553839c16f51030fa322917
v10.2.10

Comment 9 J. Eric Ivancich 2017-11-03 17:08:03 UTC

I believe I have this fixed. Under local testing I was generally able to create this faulty condition in about 30 minutes. My local testing with fix has now run over 3 hours and has not generated faulty condition.

The proposed bug fix is currently a DNM PR https://github.com/ceph/ceph/pull/18709 .

The PR is currently being built by the ceph-ci tooling, and when that's done I will run it through the RGW suite to look for any regressions.

Comment 15 J. Eric Ivancich 2017-11-06 14:48:44 UTC

The fix on master has been running for close to 72 hours without recreating the error state. As mentioned previously I had been able to generate the error state in about 30 minutes. So this fix does seem to address the error reported.

Regression testing found no regressions. See the PR (https://github.com/ceph/ceph/pull/18709) for details.

Comment 16 Vikhyat Umrao 2017-11-06 14:51:42 UTC

(In reply to Eric Ivancich from comment #15)
> The fix on master has been running for close to 72 hours without recreating
> the error state. As mentioned previously I had been able to generate the
> error state in about 30 minutes. So this fix does seem to address the error
> reported.
> 
> Regression testing found no regressions. See the PR
> (https://github.com/ceph/ceph/pull/18709) for details.

Thank you Eric. Awesome work!

Matt and Eric - I think we need to pull this patch - https://github.com/ceph/ceph/pull/18709 to 3.0(luminous) also? Do we have a bug for it or you want me to open one?

Comment 22 Ken Dreyer (Red Hat) 2018-01-03 19:04:53 UTC

This bug is targeted for RHCEPH 2.5 and this fix is not in RHCEPH 3.

Would you please cherry-pick the change to ceph-3.0-rhel-patches (with the RHCEPH 3 clone ID number, "Resolves: rhbz#1530784") so customers do not experience a regression?

Comment 27 Vidushi Mishra 2018-01-17 05:47:21 UTC

Hi all,

Using the script mentioned in comment#3, created, deleted 10k objects.
Did not see issue, moving BZ to verified for ceph version 10.2.10-11.el7cp.

Thanks,
Vidushi

Comment 32 errata-xmlrpc 2018-02-21 19:44:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0340