Bug 1500904

Summary: Stale bucket index entries are left over after object deletions
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Benjamin Schmaus <bschmaus>
Component: RGWAssignee: Matt Benjamin (redhat) <mbenjamin>
Status: CLOSED ERRATA QA Contact: Vidushi Mishra <vimishra>
Severity: high Docs Contact: Aron Gunn <agunn>
Priority: unspecified    
Version: 2.3CC: agunn, anharris, cbodley, ceph-eng-bugs, hnallurv, ivancich, kbader, kdreyer, mbenjamin, mhackett, owasserm, sweil, tserlin, vumrao
Target Milestone: rc   
Target Release: 2.5   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.10-9.el7cp Ubuntu: ceph_10.2.10-6redhat1xenial Doc Type: Bug Fix
Doc Text:
.Stale bucket index entries are no longer left over after object deletions Previously, under certain circumstances, deleted objects were incorrectly interpreted as incomplete delete transactions because of an incorrect time. As a consequence, the delete operations were reported successful in the Ceph Object Gateway logs, but the deleted objects were not correctly removed from bucket indexes. The incorrect time comparison has been fixed, and deleting objects works correctly.
Story Points: ---
Clone Of:
: 1530784 (view as bug list) Environment:
Last Closed: 2018-02-21 19:44:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1530784    
Bug Blocks: 1473188, 1491723, 1536401    

Description Benjamin Schmaus 2017-10-11 17:21:17 UTC
Description of problem:

Objects are deleted but the index still thinks they are there.  This issue was thought to be resolved in BZ https://bugzilla.redhat.com/show_bug.cgi?id=1464099


Version-Release number of selected component (if applicable):


How reproducible:

See http://tracker.ceph.com/issues/20380

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 J. Eric Ivancich 2017-10-23 17:44:10 UTC
Was able to reproduce the error after multiple times (seems to generate condition around 1 out of approx. 20 runs) running the following set of scripts (modified versions of those supplied by customer) on a cluster with three rados gateways. Currently combing through OSD logs to see if there are any clues as to the underlying problem.

==== delete_create_script.sh

#!/bin/sh

trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT

swift -A http://localhost:8000/auth -U test:tester -K testing post test

dir=$(dirname $0)

${dir}/delete_create_object.sh one &
${dir}/delete_create_object.sh two &
${dir}/delete_create_object.sh three &
${dir}/delete_create_object.sh four &
${dir}/delete_create_object.sh five &
${dir}/delete_create_object.sh six &
${dir}/delete_create_object.sh seven &
${dir}/delete_create_object.sh eight &
${dir}/delete_create_object.sh nine &
${dir}/delete_create_object.sh ten &

wait

echo Done all

==== delete_create_object.sh

#!/bin/sh

port_lo=8000
port_hi=8002

trap 'kill $(jobs -p)' SIGINT SIGTERM # EXIT

i=1
objects=100
prefix=$1

list_out=$1.list
echo $(date) > $list_out

while [ $i -lt $objects ]
do
    object=$prefix.`date +%Y-%m-%d:%H:%M:%S`.$i
    touch $object

    port1=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))
    port2=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))
    port3=$(( RANDOM % (port_hi - port_lo + 1 ) + port_lo ))

    swift -A http://localhost:${port1}/auth -U test:tester -K testing upload test $object >/dev/null
    swift -A http://localhost:${port2}/auth -U test:tester -K testing delete test $object >/dev/null &
    swift -A http://localhost:${port3}/auth -U test:tester -K testing list test >>$list_out

    i=`expr $i + 1`
    rm -f $object &
done

wait

echo Done $1

====

Comment 4 Vikhyat Umrao 2017-11-02 15:23:08 UTC
*** Bug 1496568 has been marked as a duplicate of this bug. ***

Comment 5 Vikhyat Umrao 2017-11-02 15:28:33 UTC
upstream jewel backport: https://github.com/ceph/ceph/pull/16856

Comment 6 Vikhyat Umrao 2017-11-02 15:33:49 UTC
git tag --contains ff67388e24c93ca16553839c16f51030fa322917
v10.2.10

Comment 9 J. Eric Ivancich 2017-11-03 17:08:03 UTC
I believe I have this fixed. Under local testing I was generally able to create this faulty condition in about 30 minutes. My local testing with fix has now run over 3 hours and has not generated faulty condition.

The proposed bug fix is currently a DNM PR https://github.com/ceph/ceph/pull/18709 .

The PR is currently being built by the ceph-ci tooling, and when that's done I will run it through the RGW suite to look for any regressions.

Comment 15 J. Eric Ivancich 2017-11-06 14:48:44 UTC
The fix on master has been running for close to 72 hours without recreating the error state. As mentioned previously I had been able to generate the error state in about 30 minutes. So this fix does seem to address the error reported.

Regression testing found no regressions. See the PR (https://github.com/ceph/ceph/pull/18709) for details.

Comment 16 Vikhyat Umrao 2017-11-06 14:51:42 UTC
(In reply to Eric Ivancich from comment #15)
> The fix on master has been running for close to 72 hours without recreating
> the error state. As mentioned previously I had been able to generate the
> error state in about 30 minutes. So this fix does seem to address the error
> reported.
> 
> Regression testing found no regressions. See the PR
> (https://github.com/ceph/ceph/pull/18709) for details.

Thank you Eric. Awesome work!

Matt and Eric - I think we need to pull this patch - https://github.com/ceph/ceph/pull/18709 to 3.0(luminous) also? Do we have a bug for it or you want me to open one?

Comment 22 Ken Dreyer (Red Hat) 2018-01-03 19:04:53 UTC
This bug is targeted for RHCEPH 2.5 and this fix is not in RHCEPH 3.

Would you please cherry-pick the change to ceph-3.0-rhel-patches (with the RHCEPH 3 clone ID number, "Resolves: rhbz#1530784") so customers do not experience a regression?

Comment 27 Vidushi Mishra 2018-01-17 05:47:21 UTC
Hi all,

Using the script mentioned in comment#3, created, deleted 10k objects.
Did not see issue, moving BZ to verified for ceph version 10.2.10-11.el7cp.

Thanks,
Vidushi

Comment 32 errata-xmlrpc 2018-02-21 19:44:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0340