2239149 – [CEE/sd][RGW] RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object: (110) Connection timed out

Bug 2239149 - [CEE/sd][RGW] RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object: (110) Connection timed out

Summary: [CEE/sd][RGW] RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RA...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.3z6
Assignee:	Casey Bodley
QA Contact:	Hemanth Sai
Docs Contact:
URL:
Whiteboard:
Depends On:	2214981 2239150
Blocks:	2228874 2228875 2230445
TreeView+	depends on / blocked

Reported:	2023-09-15 14:06 UTC by Casey Bodley
Modified:	2024-02-08 16:55 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ceph-16.2.10-245.el8cp
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2214981
Environment:
Last Closed:	2024-02-08 16:55:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-7474	0	None	None	None	2023-09-15 14:08:39 UTC
Red Hat Product Errata	RHSA-2024:0745	0	None	None	None	2024-02-08 16:55:36 UTC

Description Casey Bodley 2023-09-15 14:06:20 UTC

+++ This bug was initially created as a clone of Bug #2214981 +++

Description of problem:
Customer is observing frequent connection timed out error in the RGW logs and because of this jobs are getting cancelled and users have to rerun the jobs. There is no pattern for the connection timed out BTW as per customer this is happening when they are coping huge data via RGW. 

~~~
2023-05-30T06:02:26.098+0000 7f8cbf8b5700  1 req 2251681747474232490 10.014173508s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object xx.rgw.meta:users.uid:xxx: (110) Connection timed out
2023-05-30T06:02:26.098+0000 7f8cbf8b5700  1 req 2251681747474232490 10.014173508s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=xx.rgw.meta:users.uid:xxx tries=0
~~~

Version-Release number of selected component (if applicable):
RHCS 5.3z3 (16.2.10-160.el8cp)
RHEL 8.4

How reproducible:
Customer environment specific

Steps to Reproduce:
NA

Actual results:
Connection timed out happening 

Expected results:
There should not be connection timed out messages

Additional info:

Customer is using Hadoop credential provider for OIDC+STS authentication

--- Additional comment from Tridibesh Chakraborty on 2023-06-14 10:41:59 UTC ---

In the RGW debug logs, I am seeing below error messages:

~~~
2023-05-30T06:02:26.098+0000 7f8cbf8b5700  1 req 2251681747474232490 10.014173508s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object wdc-prod.rgw.meta:users.uid:prod-cep-admin: (110) Connection timed out
2023-05-30T06:02:26.098+0000 7f8cbf8b5700  1 req 2251681747474232490 10.014173508s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=wdc-prod.rgw.meta:users.uid:prod-cep-admin tries=0
2023-05-30T06:02:36.099+0000 7f8c2277b700  1 req 2251681747474232490 20.015346527s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):423 invalidation attempt 0 failed: (110) Connection timed out
2023-05-30T06:02:36.099+0000 7f8c2277b700  1 req 2251681747474232490 20.015346527s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=wdc-prod.rgw.meta:users.uid:prod-cep-admin tries=1
2023-05-30T06:02:46.100+0000 7f8c2277b700  1 req 2251681747474232490 30.016517639s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):423 invalidation attempt 1 failed: (110) Connection timed out
2023-05-30T06:02:46.101+0000 7f8c2277b700  1 req 2251681747474232490 30.017517090s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Inv
alidating obj=wdc-prod.rgw.meta:users.uid:prod-cep-admin tries=2
2023-05-30T06:02:56.102+0000 7f8c56fe4700  1 req 2251681747474232490 40.018688202s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):423 invalidation attempt 2 failed: (110) Connection timed out
2023-05-30T06:02:56.102+0000 7f8c56fe4700  1 req 2251681747474232490 40.018688202s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=wdc-prod.rgw.meta:users.uid:prod-cep-admin tries=3
~~~

Also, I found a BZ (BZ#2002220) where I can see similar RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj message which was fixed on RHCS 5.0z1.

Also, I have found an upstream issue tracker. Not sure if it is related to the same or not.


https://tracker.ceph.com/issues/54397

I have asked customer whether this user prod-cep-admin is present in the RGW as the logs are complaining on this user. 

Customer: Tesco Plc
Account#: 1382945
SFDC case#: 03510446

Sosreport and related debug logs are downloaded into supportshell under below location:

/cases/03510446

Customer have shared sosreport from one of the RGW node (0070-sosreport-whds0046-03510446-2023-05-31-shoajtd.tar.xz). This sosreport contains the RGW debug logs under below location:

/cases/03510446/sosreport-20230531-131308/whds0046.global.tesco.org/var/log/ceph/4e4ba66d-f744-4b46-be8b-5d557ad551ab

~~~
[trchakra@supportshell-1 4e4ba66d-f744-4b46-be8b-5d557ad551ab]$ ll | grep rgw
-rw-rw-rw-. 1 yank yank 1324842546 May 31 13:15 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log
-rw-rw-rw-. 1 yank yank  148024705 Mar 30 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230331.gz
-rw-rw-rw-. 1 yank yank  460102866 Mar 31 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230401.gz
-rw-rw-rw-. 1 yank yank  204506082 Apr  1 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230402.gz
-rw-rw-rw-. 1 yank yank  333590708 Apr  2 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230403.gz
-rw-rw-rw-. 1 yank yank  326798384 Apr  3 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230404.gz
-rw-rw-rw-. 1 yank yank  438984797 Apr  4 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230405.gz
-rw-rw-rw-. 1 yank yank   62165386 Apr 28 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230429.gz
-rw-rw-rw-. 1 yank yank   29442276 May 24 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230525-0001.gz
-rw-rw-rw-. 1 yank yank   70509279 May 25 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230526-0001.gz
-rw-rw-rw-. 1 yank yank   87535180 May 26 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230527-0001.gz
-rw-rw-rw-. 1 yank yank   48863228 May 27 23:00 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230528-0001.gz
-rw-rw-rw-. 1 yank yank   95360887 May 28 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230529-0001.gz
-rw-rw-rw-. 1 yank yank  137448007 May 29 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230530-0001.gz
-rw-rw-rw-. 1 yank yank  155697581 May 30 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230531-0001.gz
-rw-rw-rw-. 1 yank yank  948619563 May 31 13:15 rgwops.log
-rw-rw-rw-. 1 yank yank   89684444 Mar 30 23:01 rgwops.log-20230331.gz
-rw-rw-rw-. 1 yank yank  103510414 Mar 31 23:01 rgwops.log-20230401.gz
-rw-rw-rw-. 1 yank yank   69866360 Apr  1 23:01 rgwops.log-20230402.gz
-rw-rw-rw-. 1 yank yank   77197450 Apr  2 23:00 rgwops.log-20230403.gz
-rw-rw-rw-. 1 yank yank  103942227 Apr  3 23:01 rgwops.log-20230404.gz
-rw-rw-rw-. 1 yank yank  316809268 Apr  4 23:01 rgwops.log-20230405.gz
-rw-rw-rw-. 1 yank yank   50589095 Apr 28 23:01 rgwops.log-20230429.gz
-rw-rw-rw-. 1 yank yank   21411080 May 24 23:01 rgwops.log-20230525-0001.gz
-rw-rw-rw-. 1 yank yank   27623030 May 25 23:01 rgwops.log-20230526-0001.gz
-rw-rw-rw-. 1 yank yank   29317867 May 26 23:01 rgwops.log-20230527-0001.gz
-rw-rw-rw-. 1 yank yank   20790332 May 27 23:00 rgwops.log-20230528-0001.gz
-rw-rw-rw-. 1 yank yank   46030965 May 28 23:01 rgwops.log-20230529-0001.gz
-rw-rw-rw-. 1 yank yank   51457895 May 29 23:01 rgwops.log-20230530-0001.gz
-rw-rw-rw-. 1 yank yank   54903543 May 30 23:01 rgwops.log-20230531-0001.gz
~~~

--- Additional comment from Tridibesh Chakraborty on 2023-06-23 09:40:31 UTC ---

Hello Matt,

Can you please let me know if we are hitting here any bug? Or it is a misconfiguration?

Thanks,
Tridibesh

--- Additional comment from Tridibesh Chakraborty on 2023-06-28 07:20:58 UTC ---

Trying to find anything meaningful myself. 

One of the timed out message:

~~~
2023-05-12T08:11:54.815+0000 7f8c4c7cf700  1 req 6699870219808022696 10.011156082s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object wdc-prod.rgw.meta:users.uid:prod-cep-admin: (110) Connection timed out
~~~

CPU usage is normal during that time:

~~~
06:40:04 AM     all      1.45      0.00      1.49      0.01      0.00     97.03
06:50:04 AM     all      1.69      0.00      1.75      0.01      0.00     96.54
07:00:02 AM     all      1.66      0.01      1.51      0.01      0.00     96.81
07:10:04 AM     all      2.10      0.00      2.31      0.02      0.00     95.58
07:20:04 AM     all      2.24      0.00      2.12      0.04      0.00     95.60
07:30:02 AM     all      1.68      0.01      1.64      0.02      0.00     96.66
07:40:04 AM     all      1.86      0.00      1.93      0.02      0.00     96.19
07:50:04 AM     all      1.89      0.00      2.00      0.02      0.00     96.08
08:00:03 AM     all      1.64      0.01      1.66      0.02      0.00     96.67
08:10:04 AM     all      2.07      0.00      2.07      0.18      0.00     95.68
08:20:04 AM     all      1.74      0.00      1.86      0.09      0.00     96.31
08:30:02 AM     all      1.45      0.01      1.67      0.02      0.00     96.86
08:40:03 AM     all      3.26      0.00      3.52      0.32      0.00     92.90
~~~

Memory usage can see on higher side

~~~
06:20:04 AM   9153884 184415336 254354908     96.53 181203776   5420280  85141192     32.31 117878496 127612920      3924
06:30:03 AM   8975448 184490308 254533344     96.59 181400932   5476608  85092168     32.29 118126084 127525156      2188
06:40:04 AM   9331848 184462592 254176944     96.46 180998176   5497024  85095868     32.29 117642556 127680800      1340
06:50:04 AM   9861768 184448104 253647024     96.26 180409832   5542356  84967680     32.24 116695508 128087592      2148
07:00:02 AM   9328664 184474632 254180128     96.46 180908800   5603180  85141456     32.31 117443168 127882680      3000
07:10:04 AM   9569740 184350308 253939052     96.37 180483080   5664924  85117880     32.30 116538396 128546256      1752
07:20:04 AM   8040520 184269568 255468272     96.95 181870664   5725772  85048860     32.28 117987868 128618580      1936
07:30:02 AM   7829496 184412716 255679296     97.03 182218312   5732208  85131732     32.31 118243332 128580140      1936
07:40:04 AM   7740504 184479268 255768288     97.06 182364476   5741740  85031196     32.27 118583244 128329280      2040
07:50:04 AM   7420276 184280780 256088516     97.18 182467656   5761556  84965956     32.24 119011660 128213368      3252
08:00:03 AM   7192792 184356568 256316000     97.27 182765576   5766776  85109100     32.30 119255540 128207600      3660
08:10:04 AM   6920976 184319864 256587816     97.37 182937164   5832392  85310572     32.37 119569988 128164088      1448
08:20:04 AM   7177172 184511812 256331620     97.28 182918084   5789408  85157376     32.32 120340676 127135952      1560
08:30:02 AM   6909440 184428624 256599352     97.38 183079164   5812472  85133024     32.31 120972512 126770908      1104
08:40:03 AM   8958964 184705012 254549828     96.60 181305188   5651932  85654424     32.51 108196220 137174192      9448
~~~

Load average is normal:

~~~
06:20:04 AM         0      2717      1.86      1.83      1.81         0
06:30:03 AM         1      2561      0.82      1.38      1.65         0
06:40:04 AM         0      2559      1.48      1.28      1.45         0
06:50:04 AM         0      2727      1.66      1.55      1.54         0
07:00:02 AM         3      2564      1.80      1.75      1.67         0
07:10:04 AM         2      2560      1.72      1.92      1.83         0
07:20:04 AM         1      2723      2.04      2.06      1.96         0
07:30:02 AM         0      2564      1.05      1.52      1.75         0
07:40:04 AM         0      2564      1.60      1.64      1.71         0
07:50:04 AM         0      2704      1.42      1.70      1.76         0
08:00:03 AM         2      2559      2.52      1.87      1.79         0
08:10:04 AM         2      2565      1.32      1.86      1.87         0
08:20:04 AM         1      2564      1.31      1.74      1.82         0
08:30:02 AM         6      2567      2.23      1.62      1.66         0
08:40:03 AM        10      2599      5.53      4.55      3.03         0
~~~

I noticed High block read:

~~~
06:40:04 AM    820.02    648.40    171.62 185589.35   8831.89
06:50:04 AM    787.46    677.09    110.37 190159.80   1423.48
07:00:02 AM    800.44    711.38     89.06 199348.78   1112.41
07:10:04 AM    953.48    739.67    213.82 229423.06   2424.51
07:20:04 AM   1121.73    869.30    252.43 234740.23   3025.41
07:30:02 AM    982.46    773.94    208.52 260530.12   2550.83
07:40:04 AM    935.18    751.69    183.49 228753.50   2512.17
07:50:04 AM    988.02    836.12    151.90 241907.94   1752.07
08:00:03 AM    995.31    717.67    277.64 237473.34   3000.98
08:10:04 AM   2233.51    664.50   1569.02 214436.12  27850.18
08:20:04 AM   3005.57   2233.33    772.23 332121.67  14138.59
08:30:02 AM    850.32    694.54    155.78 191332.81   2508.30
08:40:03 AM   3495.29    993.85   2501.43 192695.07 114584.07
~~~

Disk utilization is quiet high (reaching above 90%, obviously will cause slow performance)

~~~
05:50:03 AM       sdj    134.05  20846.78    145.42    156.60      2.46     18.34      7.24     97.00
06:00:02 AM       sdj    131.87  19737.11    411.62    152.79      2.58     19.54      7.25     95.64
06:00:02 AM       sdg    113.06  17066.43    401.35    154.50      2.12     18.71      7.19     81.31
06:10:04 AM       sdj    131.57  19885.95    344.87    153.77      2.56     19.46      7.26     95.49
06:10:04 AM       sdg    114.46  17235.29    385.35    153.95      2.16     18.85      7.15     81.85
06:40:04 AM       sdg    129.39  20341.50    326.33    159.73      3.40     26.25      6.95     89.89
06:50:04 AM       sde    138.27  21588.66      9.56    156.20      3.96     28.67      6.82     94.30
06:50:04 AM       sdg    123.47  19540.45     11.25    158.35      3.21     25.96      7.00     86.43
07:00:02 AM       sde    138.52  22137.14      5.12    159.84      3.84     27.70      6.83     94.61
07:00:02 AM       sdg    122.04  19462.56      3.37    159.50      3.14     25.70      7.04     85.96
07:10:04 AM       sdi    132.11  20481.59      4.28    155.07      3.60     27.24      6.74     89.06
07:10:04 AM       sde    135.21  20957.56      9.24    155.07      3.94     29.11      6.80     91.89
07:10:04 AM       sdg    131.55  20317.20      5.58    154.48      3.75     28.50      6.84     89.92
07:20:04 AM       sdi    133.90  20483.81      6.98    153.04      3.66     27.37      6.75     90.32
07:20:04 AM       sde    129.25  19781.57      6.75    153.11      3.57     27.65      6.78     87.64
07:20:04 AM       sdg    130.36  20006.79      7.55    153.53      3.49     26.75      6.86     89.47
07:30:02 AM       sdf    136.53  21564.84     14.72    158.06      4.08     29.85      6.95     94.88
07:30:02 AM       sdd    143.00  21101.37     16.25    147.67      2.59     18.12      6.66     95.29
07:30:02 AM       sdi    135.95  21021.53     13.74    154.73      3.72     27.35      6.82     92.69
07:30:02 AM       sdh    106.39  26908.52     11.10    253.03      2.44     22.90      8.21     87.29
07:40:04 AM       sdf    135.99  20389.66     10.89    150.01      4.06     29.83      6.87     93.48
07:40:04 AM       sdd    141.39  19957.73     16.54    141.27      2.73     19.29      6.61     93.52
07:40:04 AM       sdi    134.45  19627.73      8.92    146.05      3.74     27.83      6.71     90.17
07:40:04 AM       sdh    107.22  25501.50     11.06    237.95      2.52     23.51      8.08     86.63
07:50:04 AM       sdf    136.31  20027.55      5.89    146.97      3.99     29.30      6.78     92.42
07:50:04 AM       sdd    143.84  19977.27      5.11    138.92      2.73     18.95      6.46     92.86
07:50:04 AM       sdh    117.50  23465.00      6.42    199.76      2.75     23.44      7.31     85.91
08:00:03 AM       sdf    135.66  20959.82     20.65    154.66      3.89     28.71      6.91     93.70
08:00:03 AM       sdd    143.10  21179.39     14.06    148.10      2.59     18.08      6.75     96.56
08:00:03 AM       sdh    140.77  21242.44     15.24    151.01      3.56     25.29      6.32     88.95
08:10:04 AM       sdf    135.20  20827.60      3.58    154.07      3.92     28.98      6.93     93.65
08:10:04 AM       sdd    145.74  21302.24      2.78    146.18      2.57     17.62      6.68     97.29
08:10:04 AM       sdh    143.34  21507.47      4.29    150.08      3.54     24.72      6.31     90.46
08:20:04 AM       sdd    146.15  21059.81    159.45    145.19      3.13     21.39      6.27     91.58
08:30:02 AM       sdd    156.02  22307.99     23.02    143.13      4.52     29.00      6.00     93.59
08:40:03 AM       sdd    151.17  21887.74    369.49    147.23      4.48     29.64      6.21     93.88
~~~

Network utilization is low:

~~~
07:00:02 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
07:50:04 AM      eth3   3967.72   4227.08  11208.65  13429.43      0.00      0.00      0.60      0.11
07:50:04 AM     bond1   8026.98   8341.91  22885.19  26816.57      0.00      0.00      0.70      0.11
07:50:04 AM      eth0    225.25   2265.20     76.59   3167.11      0.00      0.00      0.60      0.10
07:50:04 AM     bond0    380.63   4425.63     91.85   6325.13      0.00      0.00      0.70      0.10
07:50:04 AM      eth1    155.38   2160.44     15.25   3158.01      0.00      0.00      0.10      0.10
07:50:04 AM      eth2   4059.27   4114.83  11676.54  13387.13      0.00      0.00      0.10      0.11
08:00:02 AM      eth3   2956.72   2900.38   4310.54   4360.87      0.00      0.00      0.60      0.04
08:00:02 AM     bond1   5795.80   5723.51   8174.45   8377.80      0.00      0.00      0.70      0.03
08:00:02 AM      eth0     92.28    254.05     64.77    256.93      0.00      0.00      0.60      0.01
08:00:02 AM     bond0    154.41    415.61    118.54    418.84      0.00      0.00      0.70      0.01
08:00:02 AM      eth1     62.13    161.57     53.78    161.92      0.00      0.00      0.10      0.01
08:00:02 AM      eth2   2839.08   2823.13   3863.91   4016.93      0.00      0.00      0.10      0.03
08:10:04 AM      eth3   5674.71   7149.43  19677.73  33063.69      0.00      0.00      0.60      0.27
08:10:04 AM     bond1  11895.28  14652.54  41230.98  68568.04      0.00      0.00      0.70      0.28
08:10:04 AM      eth0    104.27    957.90     26.45   1291.77      0.00      0.00      0.60      0.04
08:10:04 AM     bond0    265.83   1882.56    126.39   2580.27      0.00      0.00      0.70      0.04
08:10:04 AM      eth1    161.56    924.66     99.94   1288.50      0.00      0.00      0.10      0.04
08:10:04 AM      eth2   6220.56   7503.12  21553.25  35504.35      0.00      0.00      0.10      0.29
08:20:04 AM      eth3   4779.80   5680.79  12828.09  20859.93      0.00      0.00      0.60      0.17
08:20:04 AM     bond1   9923.10  11354.47  26486.59  41782.66      0.00      0.00      0.70      0.17
08:20:04 AM      eth0    118.79   1176.19     34.19   1675.39      0.00      0.00      0.60      0.05
08:20:04 AM     bond0    215.43   2062.88     75.23   2907.40      0.00      0.00      0.70      0.05
08:20:04 AM      eth1     96.64    886.70     41.04   1232.01      0.00      0.00      0.10      0.04
08:20:04 AM      eth2   5143.30   5673.68  13658.50  20922.73      0.00      0.00      0.10      0.17
08:30:02 AM      eth3   3696.28   4296.09   8136.76  13917.48      0.00      0.00      0.60      0.11
08:30:02 AM     bond1   7605.12   8614.60  16916.80  28081.38      0.00      0.00      0.70      0.12
~~~

Total number of socket reaching around 11k. So this includes OSD connections+RGW connections+ssh and other housekeeping connections. 

~~~
07:00:02 AM    totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw
07:10:04 AM     11345     10945         8         0         0        68
07:20:04 AM     11680     11280         8         0         0       297
07:30:03 AM     10996     10596         8         0         0      1170
07:40:04 AM     12011     11611         8         0         0        17
07:50:04 AM     11482     11080         8         0         0        34
08:00:02 AM     11993     11593         8         0         0       217
08:10:04 AM     11864     11464         8         0         0        29
08:20:04 AM     12016     11616         8         0         0        47
08:30:02 AM     11182     10782         8         0         0       741
08:40:04 AM     10802     10402         8         0         0       826
08:50:04 AM     11051     10651         8         0         0       475
09:00:03 AM     11498     11098         8         0         0       106
09:10:04 AM     11915     11515         8         0         0        53
09:20:04 AM     11081     10681         8         0         0       446
09:30:02 AM     12027     11627         8         0         0       585
09:40:03 AM     11762     11338         8         0         0        96
~~~

Requested below details from customer:

The number of active connections to RGW:

~~~
$ cd /var/run/ceph/<fsid>
$ while sleep 60 ; do  echo -n "$(date):" ; ceph --admin-daemon ceph-client.rgw.*.asok perf dump |grep "qactive" ; done >> /tmp/rgw_qactive  #### Let is run for around an hour and then stop it
~~~

The configuration value of rgw_max_concurrent_requests & rgw_thread_pool_size. I am hoping they are set to default value

~~~
$ ceph --admin-daemon ceph-client.rgw.*.asok config show | egrep -i "rgw_max_concurrent_requests|rgw_thread_pool_size"
~~~

We may need to suggest to increase rgw_max_concurrent_requests & rgw_thread_pool_size. 

Please let me know your thoughts. 

Thanks,
Tridibesh

--- Additional comment from Matt Benjamin (redhat) on 2023-06-28 11:36:26 UTC ---

Hi Tridibesh,

The frequent notify timeouts indicate a problem, and might be related, but I doubt they are the cause.  The timeout could indicate that an RGW has crashed or hung, but could also indicate a network problem.

To debug the main issue, I think what we would be interested in is log information from an attempted upload, that failed.  That might not be related to the notify timeout (unless it crashed an RGW, and other RGWs are now timing out reaching it.)  Also, though, Tesco is using custom logic (by us) to renew STS credentials in precisely the case of a long-running job, and it may be important to figure out of maybe the actual cause of failure is a problem with a token renewal that probably be expected to have happened.  We might be able to compare the request run time with the token expiration rules to make a correlation.

Matt

--- Additional comment from Casey Bodley on 2023-06-28 13:10:50 UTC ---

RGWSI_Notify::robust_notify() is called when we write changes to a metadata object. it broadcasts a message to other radosgws (and radosgw-admins) so they can invalidate their metadata cache of that object. robust_notify() waits up to 15s for all radosgw[-admin]s to respond before continuing. so if any of those clients are unresponsive, you'd see this connection timeout error

i recently found that AssumeRole was writing an unnecessary update to the user metadata - see the linked ceph tracker issue and pull request. given that their workload relies on AssumeRole, this metadata write is likely the operation that ends up failing with "Notify failed on object xx.rgw.meta:users.uid:xxx". this metadata write is unlikely to *cause* the connection timeout errors, but providing that fix may reduce their frequency and lessen the impact on clients

in the past, we've seen these kind of timeouts because of a defunct radosgw-admin command that never shut down cleanly. you can check for this by listing all of the endpoints that are registered for these notifications with the rados tool:

$ rados -p xx.rgw.control listwatchers notify.0
watcher=192.168.245.130:0/803955432 client.4147 cookie=94838542249216

--- Additional comment from Matt Benjamin (redhat) on 2023-06-28 16:12:37 UTC ---

Tridibesh,

If the customer has a test environment available, we could make a test fix with the change Casey is describing.

Matt

--- Additional comment from Tridibesh Chakraborty on 2023-06-29 06:39:55 UTC ---

Hi Matt & Casey,

Thank you for picking this up and responding with your findings. 

I can cross verify with customer whether they have any test environment to apply the fix and check there. But if I am not wrong, when last time when I checked with them for a test environment (initially multisite and later standalone setup), they told they didn't have. But better I can have this validated with them and confirm you back. 

I am keeping this as a pending action item on me. 

Thanks,
Tridibesh

--- Additional comment from Tridibesh Chakraborty on 2023-06-30 11:50:08 UTC ---

Hi Matt/Casey,

I have shared the analysis with the customer and also asked whether there is any test environment available to test the fix. I am yet to receive a response from them. 

BTW, for my own reference, do we have any internal doc or article which explains these various types of alerts in debug logs of RGW in depth (e.g. RGWSI_Notify::robust_notify()) and what is their significance, so that atleast can look up there and correlate these debug messages?

Thanks,
Tridibesh

--- Additional comment from Matt Benjamin (redhat) on 2023-06-30 11:55:43 UTC ---

Tridibesh,

I'm not aware of any special doc on these messages, they're part of an internal workflow.  If there is anything, it might be in a KCS.

Matt

--- Additional comment from Tridibesh Chakraborty on 2023-06-30 12:02:48 UTC ---

Hi Matt,

Thanks. Let me check for the KCSs to see if there are any indepth discussion there. 

Regarding the confirmation of test environment, looks like customer is in leave till 3rd July. So we may not receive any response till then.

Thanks,
Tridibesh

--- Additional comment from Matt Benjamin (redhat) on 2023-07-05 12:42:05 UTC ---

bipin/vikhyat, could you perhaps help triage the customer interaction here?  this risks getting stalled...

Matt

--- Additional comment from Tridibesh Chakraborty on 2023-07-05 13:13:32 UTC ---

Hello Matt,

Customer was on leave till yesterday. So I have put a reminder in the case again. Hopefully will get a response back today or tomorrow. 

Thanks,
Tridibesh

--- Additional comment from Tridibesh Chakraborty on 2023-07-06 07:10:27 UTC ---

Hello Matt/Casey,

I have got a response from customer. 

---
 We have only our backup clusters for testing, however that clusters are not integrated with RHSSO for AD integration. Could you please let  us know whether we can test this issue on a cluster, which has only local users and local roles ?
---

Can you please confirm if this will work or they need to have similar setup like their production with OIDC and assume role? I am guessing they need to have similar setup. 

Thanks,
Tridibesh

--- Additional comment from Casey Bodley on 2023-07-06 14:02:34 UTC ---

(In reply to Tridibesh Chakraborty from comment #13)
> 
> Can you please confirm if this will work or they need to have similar setup
> like their production with OIDC and assume role? I am guessing they need to
> have similar setup.

they would want to recreate a workload that includes AssumeRole for this testing. i believe you can do this with local users/roles

--- Additional comment from Tridibesh Chakraborty on 2023-07-18 08:09:12 UTC ---

Hello Casey/Matt,

I am not getting much response from customer on the test cluster setup. So if we assume that customer can't setup a test cluster and test the fix, can we have some other alternative approach to validate the solution and provide a fix to customer?

Thanks,
Tridibesh

--- Additional comment from Casey Bodley on 2023-07-18 13:12:05 UTC ---

(In reply to Tridibesh Chakraborty from comment #15)
> Hello Casey/Matt,
> 
> I am not getting much response from customer on the test cluster setup. So
> if we assume that customer can't setup a test cluster and test the fix, can
> we have some other alternative approach to validate the solution and provide
> a fix to customer?

afaik we don't have a reproducer for the customer's issue (notify timeouts). we have a fix for something that's related, and we can test that on its own (it was tested/merged upstream). but we don't know that it will resolve the customer case

are the RGWSI_Notify::robust_notify errors still frequent? in https://bugzilla.redhat.com/show_bug.cgi?id=2214981#c5 above i recommended debugging with 'rados listwatchers'. has anyone tried that?

--- Additional comment from Bipin Kunal on 2023-07-20 05:42:42 UTC ---

Removing the need info as support is already on top of it.

--- Additional comment from Tridibesh Chakraborty on 2023-07-25 10:13:56 UTC ---

Hello Casey,

I have asked the output of the command and got below output from the customer infra.

~~~
[root@whds0001 ~]# rados -p wdc-prod.rgw.control listwatchers notify.0
watcher=172.17.182.29:0/3406395733 client.8502909 cookie=93848966864896
watcher=172.17.182.150:0/3446051895 client.8503179 cookie=93855039410560
watcher=172.17.182.77:0/276458263 client.8501351 cookie=93878190430848
watcher=172.17.182.28:0/3862430847 client.8487829 cookie=93930343931904
watcher=172.17.182.76:0/3194749452 client.8500730 cookie=93986595453568
watcher=172.17.182.149:0/2617041512 client.8488282 cookie=93997356120704
watcher=172.17.182.93:0/1053154184 client.8503011 cookie=94113522858624
watcher=172.17.182.92:0/2122245614 client.8488057 cookie=94127103244928
watcher=172.17.182.108:0/3092808963 client.8488126 cookie=94186202534272
watcher=172.17.182.166:0/3974953876 client.8503212 cookie=94216629256192
watcher=172.17.182.118:0/457252313 client.8501483 cookie=94234440880512
watcher=172.17.182.165:0/341660400 client.8502696 cookie=94385907513984
watcher=172.17.182.45:0/2356641433 client.8501297 cookie=94391517215360
watcher=172.17.182.117:0/3327240474 client.8502540 cookie=94471223401088
watcher=172.17.182.109:0/93312324 client.8501405 cookie=94480676133504
watcher=172.17.182.134:0/2542382258 client.8503158 cookie=94531319347840
watcher=172.17.182.12:0/1970182782 client.8500505 cookie=94551361460864
watcher=172.17.182.61:0/3106670398 client.8488594 cookie=94593773992576
watcher=172.17.182.181:0/3652166781 client.8501180 cookie=94604185006464
watcher=172.17.182.44:0/1666038841 client.8502252 cookie=94681847044736
watcher=172.17.182.133:0/4008072637 client.8500967 cookie=94689492099072
watcher=172.17.182.60:0/2562545797 client.8500685 cookie=94858030715264
watcher=172.17.182.182:0/3240503417 client.8488873 cookie=94893928585856
[root@whds0001 ~]# 
~~~

Thanks,
Tridibesh

--- Additional comment from Casey Bodley on 2023-07-25 12:37:52 UTC ---

(In reply to Tridibesh Chakraborty from comment #18)
> Hello Casey,
> 
> I have asked the output of the command and got below output from the
> customer infra.

thanks. the next step is to correlate the IPs with their running radosgws, and look for watchers that don't belong there. you might find a radosgw-admin command running somewhere they didn't intend

p.s. are the RGWSI_Notify::robust_notify errors still frequent? are we still collecting data about this?

--- Additional comment from Tridibesh Chakraborty on 2023-07-31 08:51:03 UTC ---

Hi Casey,

Customer confirmed that these are the IPs of their Radosgws. Also they checked for any running radosgw-command on them and did not find any. Regarding the error message, they are still seeing them. When they checked last time it was around 1500 messages for a single day (26th July) and it may be less than what they were getting initially. 

Thanks,
Tridibesh

--- Additional comment from Matt Benjamin (redhat) on 2023-08-03 13:48:31 UTC ---

Hi Tridibesh,

Casey and Adam have two changes that Casey thinks won't fix the root cause but 1) will reduce incidence and 2) provide debugging information when it reproducies.

Could you ask whether the customer could potentially run a debugging test fix?

Matt

--- Additional comment from Tridibesh Chakraborty on 2023-08-04 02:31:50 UTC ---

Hi Matt,

Sure, I have asked customer if they can run a debugging test fix. So if I understand correctly, you will be providing a testfix which will enable the verbose debugging of the issue so that we have more information to get this fixed. Is it correct?

Thanks,
Tridibesh

--- Additional comment from Tridibesh Chakraborty on 2023-08-11 02:54:14 UTC ---

Hi Casey/Matt,

Customer have couple of queries. 

>What kind of fix are suggested by engineering team?

If I am not wrong, this fix will not eliminate the root cause, it will just reduce the incident and provide more debugging information so that we can trace back the actual cause and provide a fix for that. 

>This is our production storage, we should test any fix in lower environments before applying in prod. But we don't have this kind of issues in other environments.

Customer don't have exact replica of their production system where they can test the fix. So, if that is the case, then I guess we need to provide the fix via next Z release as customer will not be able to validate it and we need to carry out the necessary testing. 

Is my understanding correct? If not, please share your thoughts. 

Thanks,
Tridibesh

--- Additional comment from Matt Benjamin (redhat) on 2023-09-10 17:28:31 UTC ---

(As there is no RCA yet, moving to rhcs-7.0z1.)

--- Additional comment from Tridibesh Chakraborty on 2023-09-15 01:11:30 UTC ---

Hi Matt/Casey,

Can you please let me know if we have any code fix/workaround for this bug on RHCS 5.3z5? Looks like customer is unhappy with their experience with Ceph and they don't see much meaningful progress on this issue. There is an active escalation on this case now. So if I understand correctly on RHCS 5.3z5 the issue is not fixed, but it has the code fix to reduce the occurrence of this error messages and it will try to capture more meaningful debug logs so that we can find the root cause of this issue and provide a fix to customer. Please correct me if my understanding is wrong. 

Thanks,
Tridibesh

Comment 1 Scott Ostapovicz 2024-01-08 14:45:27 UTC

I am confused by this issue.  It seems to be a problem with system instability, not an RGW issue.  A "fix" was provided to the 7.0 codebase (that suppressed error reporting and increased debugging info), but was this fix also already delivered as a hot fix based on 5.3 z6?  If so, then I am assuming that the fix already should have been merged into the 5.3 codebase as well, so the next 5.3 z6 build should already have this, correct?  In any case, please update this ticket accordingly. We are nominally passed the point where issues should be accepted into this release, but if it is already present, I should not need to remove it from the release.

Comment 9 errata-xmlrpc 2024-02-08 16:55:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 Security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:0745

Note You need to log in before you can comment on or make changes to this bug.