Description of problem: After upgrading to 3.3z5 / 12.2.12-115 all lifecycle polices stopped processing. When lc attempts to process there are "failed to acquire on" and "failed to update lc object" messages reported in the logs. Also, attempting to list the lifecycle policies results in Input/output errors. Version-Release number of selected component (if applicable): 3.3z5 / 12.2.12-115 How reproducible: Always on the customer site Steps to Reproduce: 1. Upgrade from 3.3z2 to 3.3z5 Actual results: - Lifecycle polices stopped processing resulting in errors. - Listing lifecycle policies 'radosgw-admin lc list' results in Input/output errors Expected results: - Lifecycle polices should continue to process post upgrade. - Listing lifecycle policies 'radosgw-admin lc list' should return all policies. Additional info: - This is cluster is using multi site. - This is affecting all lifecycle policies. Data that should be deleted by the lifecycle policies per customers regulations are kept past their lifetimes.
The errors below are reported in the rgw logs when lc attempts to process - 2020-07-08 00:00:00.634619 7f3d1c1dc700 0 RGWLC::process() failed to acquire lock on, sleep 5, try againlc.14 2020-07-08 00:00:00.792339 7f3d1e1e0700 0 RGWLC::process() failed to acquire lock on, sleep 5, try againlc.15 2020-07-08 00:00:00.805290 7f3d1a1d8700 0 RGWLC::process() failed to acquire lock on, sleep 5, try againlc.30 2020-07-08 00:00:05.768541 7f3d1c1dc700 0 RGWLC::process() failed to update lc object lc.19-5 2020-07-08 00:00:05.900899 7f3d1a1d8700 0 RGWLC::process() failed to update lc object lc.19-5 2020-07-08 00:00:06.041900 7f3d1e1e0700 0 RGWLC::process() failed to update lc object lc.19-5 When the user attempts to list the lc policies it results in the following error. $ radosgw-admin lc list ERROR: failed to list objs: (5) Input/output error I had the customer run the lc list with debugging on and will attach the output to the BZ. I have a bucket list and stats as well as the rgw logs however the lc logging at the default level is sparse and I could only find the errors mentioned above relating to the lifecycle. Let me know if you would need debug logs when lc attempts to process and will have the customer proceed with that data capture.. Thanks, - Steve
The customer provided some additional data from their staging cluster that is running 3.3z5 which has the same errors plus these bucket_lc_prepare error messages. ======== Case 02698166 : Comment #25 ================ FYI, we get the same error message on our staging cluster when running LC commands. However, the staging cluster has these errors logged: 2020-07-16 00:40:09.956696 7f621caa9700 0 RGWLC::bucket_lc_prepare() failed to set entry lc.17 2020-07-16 00:40:09.965018 7f621aaa5700 0 RGWLC::bucket_lc_prepare() failed to set entry lc.1 2020-07-16 00:40:09.965737 7f621eaad700 0 RGWLC::bucket_lc_prepare() failed to set entry lc.23 2020-07-16 00:40:09.973526 7f621caa9700 0 RGWLC::bucket_lc_prepare() failed to set entry lc.17 2020-07-16 00:40:09.981725 7f621aaa5700 0 RGWLC::bucket_lc_prepare() failed to set entry lc.1 The RGW log pool there has high IOPS (~2000) but is otherwise idle. I'm guessing it's writing lots of error logs or marker info? =======================================================
Hello Casey, The errors included in my last update c#6 which are from the customers staging cluster. I inquired with customer on ceph versions to verify all components were at version 12.2.12-115 in the staging cluster. I had verified production versions but not the staging cluster. The customer did in fact have 1 node running with the older code and has upgraded that node in staging and the "bucket_lc_prepare() failed to set entry" errors are not reporting any longer. See update from customer below: ======== Case 02698166 : Comment #29 ================ There was 1 server that was running old code due to an ansible SNAFU. After upgrading, the "failed to set entry" message has gone away. The error -5 is still thrown when running radosgw-admin lc list. ====================================================== Thanks, - Steve
Reproduced and fix pushed for 3.3z6, as discussed in RHCS-LT call. thanks! Matt
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 3.3 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3504