1856100 – [RGW] Lifecycle polices stopped processing after upgrade

Bug 1856100 - [RGW] Lifecycle polices stopped processing after upgrade

Summary: [RGW] Lifecycle polices stopped processing after upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	z6
Target Release:	3.3
Assignee:	Matt Benjamin (redhat)
QA Contact:	Vidushi Mishra
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-12 17:59 UTC by Steve Baldwin
Modified:	2023-12-15 18:26 UTC (History)
CC List:	15 users (show)
Fixed In Version:	RHEL: ceph-12.2.12-124.el7cp Ubuntu: ceph_12.2.12-111redhat1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-18 18:05:58 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	46677	None	None	None	2020-07-22 18:41:21 UTC
Red Hat Issue Tracker	RHCEPH-7673	None	None	None	2023-10-06 21:08:05 UTC
Red Hat Product Errata	RHSA-2020:3504	None	None	None	2020-08-18 18:06:10 UTC

Description Steve Baldwin 2020-07-12 17:59:56 UTC

Description of problem:
After upgrading to 3.3z5 / 12.2.12-115 all lifecycle polices stopped processing. When lc attempts to process there are "failed to acquire on" and "failed to update lc object" messages reported in the logs.
Also, attempting to list the lifecycle policies results in Input/output errors.

Version-Release number of selected component (if applicable):
3.3z5 / 12.2.12-115

How reproducible:
Always on the customer site 

Steps to Reproduce:
1. Upgrade from 3.3z2 to 3.3z5

Actual results:
- Lifecycle polices stopped processing resulting in errors.
- Listing lifecycle policies 'radosgw-admin lc list' results in Input/output errors

Expected results:
- Lifecycle polices should continue to process post upgrade.
- Listing lifecycle policies 'radosgw-admin lc list' should return all policies.

Additional info:
- This is cluster is using multi site.
- This is affecting all lifecycle policies.  Data that should be deleted by the lifecycle policies per customers regulations are kept past their lifetimes.

Comment 1 Steve Baldwin 2020-07-12 18:13:37 UTC

The errors below are reported in the rgw logs when lc attempts to process -

2020-07-08 00:00:00.634619 7f3d1c1dc700  0 RGWLC::process() failed to acquire lock on, sleep 5, try againlc.14
2020-07-08 00:00:00.792339 7f3d1e1e0700  0 RGWLC::process() failed to acquire lock on, sleep 5, try againlc.15
2020-07-08 00:00:00.805290 7f3d1a1d8700  0 RGWLC::process() failed to acquire lock on, sleep 5, try againlc.30
2020-07-08 00:00:05.768541 7f3d1c1dc700  0 RGWLC::process() failed to update lc object lc.19-5
2020-07-08 00:00:05.900899 7f3d1a1d8700  0 RGWLC::process() failed to update lc object lc.19-5
2020-07-08 00:00:06.041900 7f3d1e1e0700  0 RGWLC::process() failed to update lc object lc.19-5

When the user attempts to list the lc policies it results in the following error.

$ radosgw-admin lc list
ERROR: failed to list objs: (5) Input/output error


I had the customer run the lc list with debugging on and will attach the output to the BZ. I have a bucket list and stats as well as the rgw logs however the lc logging at the default level is sparse and I could only find the errors mentioned above relating to the lifecycle.
 
Let me know if you would need debug logs when lc attempts to process and will have the customer proceed with that data capture..

Thanks,
 - Steve

Comment 6 Steve Baldwin 2020-07-16 06:51:15 UTC

The customer provided some additional data from their staging cluster that is running 3.3z5 which has the same errors plus these bucket_lc_prepare error messages.

======== Case 02698166 : Comment #25  ================

FYI, we get the same error message on our staging cluster when running LC commands. However, the staging cluster has these errors logged:

2020-07-16 00:40:09.956696 7f621caa9700  0 RGWLC::bucket_lc_prepare() failed to set entry lc.17
2020-07-16 00:40:09.965018 7f621aaa5700  0 RGWLC::bucket_lc_prepare() failed to set entry lc.1
2020-07-16 00:40:09.965737 7f621eaad700  0 RGWLC::bucket_lc_prepare() failed to set entry lc.23
2020-07-16 00:40:09.973526 7f621caa9700  0 RGWLC::bucket_lc_prepare() failed to set entry lc.17
2020-07-16 00:40:09.981725 7f621aaa5700  0 RGWLC::bucket_lc_prepare() failed to set entry lc.1

The RGW log pool there has high IOPS (~2000) but is otherwise idle. I'm guessing it's writing lots of error logs or marker info?

=======================================================

Comment 7 Steve Baldwin 2020-07-17 18:05:54 UTC

Hello Casey,

The errors included in my last update c#6 which are from the customers staging cluster. I inquired with customer on ceph versions to verify all components were at version 12.2.12-115 in the staging cluster. 
I had verified production versions but not the staging cluster. The customer did in fact have 1 node running with the older code and has upgraded that node in staging and the "bucket_lc_prepare() failed to set entry" errors are not reporting any longer.  See update from customer below:

======== Case 02698166 : Comment #29  ================ 

There was 1 server that was running old code due to an ansible SNAFU. After upgrading, the "failed to set entry" message has gone away.  The error -5 is still thrown when running radosgw-admin lc list.

======================================================

Thanks,
 - Steve

Comment 12 Matt Benjamin (redhat) 2020-07-22 18:14:35 UTC

Reproduced and fix pushed for 3.3z6, as discussed in RHCS-LT call.

thanks!

Matt

Comment 33 errata-xmlrpc 2020-08-18 18:05:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 3.3 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3504

Note You need to log in before you can comment on or make changes to this bug.