+++ This bug was initially created as a clone of Bug #2214981 +++
Description of problem:
Customer is observing frequent connection timed out error in the RGW logs and because of this jobs are getting cancelled and users have to rerun the jobs. There is no pattern for the connection timed out BTW as per customer this is happening when they are coping huge data via RGW.
~~~
2023-05-30T06:02:26.098+0000 7f8cbf8b5700 1 req 2251681747474232490 10.014173508s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object xx.rgw.meta:users.uid:xxx: (110) Connection timed out
2023-05-30T06:02:26.098+0000 7f8cbf8b5700 1 req 2251681747474232490 10.014173508s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=xx.rgw.meta:users.uid:xxx tries=0
~~~
Version-Release number of selected component (if applicable):
RHCS 5.3z3 (16.2.10-160.el8cp)
RHEL 8.4
How reproducible:
Customer environment specific
Steps to Reproduce:
NA
Actual results:
Connection timed out happening
Expected results:
There should not be connection timed out messages
Additional info:
Customer is using Hadoop credential provider for OIDC+STS authentication
--- Additional comment from Tridibesh Chakraborty on 2023-06-14 10:41:59 UTC ---
In the RGW debug logs, I am seeing below error messages:
~~~
2023-05-30T06:02:26.098+0000 7f8cbf8b5700 1 req 2251681747474232490 10.014173508s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object wdc-prod.rgw.meta:users.uid:prod-cep-admin: (110) Connection timed out
2023-05-30T06:02:26.098+0000 7f8cbf8b5700 1 req 2251681747474232490 10.014173508s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=wdc-prod.rgw.meta:users.uid:prod-cep-admin tries=0
2023-05-30T06:02:36.099+0000 7f8c2277b700 1 req 2251681747474232490 20.015346527s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):423 invalidation attempt 0 failed: (110) Connection timed out
2023-05-30T06:02:36.099+0000 7f8c2277b700 1 req 2251681747474232490 20.015346527s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=wdc-prod.rgw.meta:users.uid:prod-cep-admin tries=1
2023-05-30T06:02:46.100+0000 7f8c2277b700 1 req 2251681747474232490 30.016517639s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):423 invalidation attempt 1 failed: (110) Connection timed out
2023-05-30T06:02:46.101+0000 7f8c2277b700 1 req 2251681747474232490 30.017517090s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Inv
alidating obj=wdc-prod.rgw.meta:users.uid:prod-cep-admin tries=2
2023-05-30T06:02:56.102+0000 7f8c56fe4700 1 req 2251681747474232490 40.018688202s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):423 invalidation attempt 2 failed: (110) Connection timed out
2023-05-30T06:02:56.102+0000 7f8c56fe4700 1 req 2251681747474232490 40.018688202s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=wdc-prod.rgw.meta:users.uid:prod-cep-admin tries=3
~~~
Also, I found a BZ (BZ#2002220) where I can see similar RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj message which was fixed on RHCS 5.0z1.
Also, I have found an upstream issue tracker. Not sure if it is related to the same or not.
https://tracker.ceph.com/issues/54397
I have asked customer whether this user prod-cep-admin is present in the RGW as the logs are complaining on this user.
Customer: Tesco Plc
Account#: 1382945
SFDC case#: 03510446
Sosreport and related debug logs are downloaded into supportshell under below location:
/cases/03510446
Customer have shared sosreport from one of the RGW node (0070-sosreport-whds0046-03510446-2023-05-31-shoajtd.tar.xz). This sosreport contains the RGW debug logs under below location:
/cases/03510446/sosreport-20230531-131308/whds0046.global.tesco.org/var/log/ceph/4e4ba66d-f744-4b46-be8b-5d557ad551ab
~~~
[trchakra@supportshell-1 4e4ba66d-f744-4b46-be8b-5d557ad551ab]$ ll | grep rgw
-rw-rw-rw-. 1 yank yank 1324842546 May 31 13:15 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log
-rw-rw-rw-. 1 yank yank 148024705 Mar 30 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230331.gz
-rw-rw-rw-. 1 yank yank 460102866 Mar 31 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230401.gz
-rw-rw-rw-. 1 yank yank 204506082 Apr 1 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230402.gz
-rw-rw-rw-. 1 yank yank 333590708 Apr 2 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230403.gz
-rw-rw-rw-. 1 yank yank 326798384 Apr 3 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230404.gz
-rw-rw-rw-. 1 yank yank 438984797 Apr 4 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230405.gz
-rw-rw-rw-. 1 yank yank 62165386 Apr 28 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230429.gz
-rw-rw-rw-. 1 yank yank 29442276 May 24 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230525-0001.gz
-rw-rw-rw-. 1 yank yank 70509279 May 25 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230526-0001.gz
-rw-rw-rw-. 1 yank yank 87535180 May 26 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230527-0001.gz
-rw-rw-rw-. 1 yank yank 48863228 May 27 23:00 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230528-0001.gz
-rw-rw-rw-. 1 yank yank 95360887 May 28 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230529-0001.gz
-rw-rw-rw-. 1 yank yank 137448007 May 29 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230530-0001.gz
-rw-rw-rw-. 1 yank yank 155697581 May 30 23:01 ceph-client.rgw.whds0046.global-prod.wdc-prod.8080.whds0046.lhjojv.log-20230531-0001.gz
-rw-rw-rw-. 1 yank yank 948619563 May 31 13:15 rgwops.log
-rw-rw-rw-. 1 yank yank 89684444 Mar 30 23:01 rgwops.log-20230331.gz
-rw-rw-rw-. 1 yank yank 103510414 Mar 31 23:01 rgwops.log-20230401.gz
-rw-rw-rw-. 1 yank yank 69866360 Apr 1 23:01 rgwops.log-20230402.gz
-rw-rw-rw-. 1 yank yank 77197450 Apr 2 23:00 rgwops.log-20230403.gz
-rw-rw-rw-. 1 yank yank 103942227 Apr 3 23:01 rgwops.log-20230404.gz
-rw-rw-rw-. 1 yank yank 316809268 Apr 4 23:01 rgwops.log-20230405.gz
-rw-rw-rw-. 1 yank yank 50589095 Apr 28 23:01 rgwops.log-20230429.gz
-rw-rw-rw-. 1 yank yank 21411080 May 24 23:01 rgwops.log-20230525-0001.gz
-rw-rw-rw-. 1 yank yank 27623030 May 25 23:01 rgwops.log-20230526-0001.gz
-rw-rw-rw-. 1 yank yank 29317867 May 26 23:01 rgwops.log-20230527-0001.gz
-rw-rw-rw-. 1 yank yank 20790332 May 27 23:00 rgwops.log-20230528-0001.gz
-rw-rw-rw-. 1 yank yank 46030965 May 28 23:01 rgwops.log-20230529-0001.gz
-rw-rw-rw-. 1 yank yank 51457895 May 29 23:01 rgwops.log-20230530-0001.gz
-rw-rw-rw-. 1 yank yank 54903543 May 30 23:01 rgwops.log-20230531-0001.gz
~~~
--- Additional comment from Tridibesh Chakraborty on 2023-06-23 09:40:31 UTC ---
Hello Matt,
Can you please let me know if we are hitting here any bug? Or it is a misconfiguration?
Thanks,
Tridibesh
--- Additional comment from Tridibesh Chakraborty on 2023-06-28 07:20:58 UTC ---
Trying to find anything meaningful myself.
One of the timed out message:
~~~
2023-05-12T08:11:54.815+0000 7f8c4c7cf700 1 req 6699870219808022696 10.011156082s int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object wdc-prod.rgw.meta:users.uid:prod-cep-admin: (110) Connection timed out
~~~
CPU usage is normal during that time:
~~~
06:40:04 AM all 1.45 0.00 1.49 0.01 0.00 97.03
06:50:04 AM all 1.69 0.00 1.75 0.01 0.00 96.54
07:00:02 AM all 1.66 0.01 1.51 0.01 0.00 96.81
07:10:04 AM all 2.10 0.00 2.31 0.02 0.00 95.58
07:20:04 AM all 2.24 0.00 2.12 0.04 0.00 95.60
07:30:02 AM all 1.68 0.01 1.64 0.02 0.00 96.66
07:40:04 AM all 1.86 0.00 1.93 0.02 0.00 96.19
07:50:04 AM all 1.89 0.00 2.00 0.02 0.00 96.08
08:00:03 AM all 1.64 0.01 1.66 0.02 0.00 96.67
08:10:04 AM all 2.07 0.00 2.07 0.18 0.00 95.68
08:20:04 AM all 1.74 0.00 1.86 0.09 0.00 96.31
08:30:02 AM all 1.45 0.01 1.67 0.02 0.00 96.86
08:40:03 AM all 3.26 0.00 3.52 0.32 0.00 92.90
~~~
Memory usage can see on higher side
~~~
06:20:04 AM 9153884 184415336 254354908 96.53 181203776 5420280 85141192 32.31 117878496 127612920 3924
06:30:03 AM 8975448 184490308 254533344 96.59 181400932 5476608 85092168 32.29 118126084 127525156 2188
06:40:04 AM 9331848 184462592 254176944 96.46 180998176 5497024 85095868 32.29 117642556 127680800 1340
06:50:04 AM 9861768 184448104 253647024 96.26 180409832 5542356 84967680 32.24 116695508 128087592 2148
07:00:02 AM 9328664 184474632 254180128 96.46 180908800 5603180 85141456 32.31 117443168 127882680 3000
07:10:04 AM 9569740 184350308 253939052 96.37 180483080 5664924 85117880 32.30 116538396 128546256 1752
07:20:04 AM 8040520 184269568 255468272 96.95 181870664 5725772 85048860 32.28 117987868 128618580 1936
07:30:02 AM 7829496 184412716 255679296 97.03 182218312 5732208 85131732 32.31 118243332 128580140 1936
07:40:04 AM 7740504 184479268 255768288 97.06 182364476 5741740 85031196 32.27 118583244 128329280 2040
07:50:04 AM 7420276 184280780 256088516 97.18 182467656 5761556 84965956 32.24 119011660 128213368 3252
08:00:03 AM 7192792 184356568 256316000 97.27 182765576 5766776 85109100 32.30 119255540 128207600 3660
08:10:04 AM 6920976 184319864 256587816 97.37 182937164 5832392 85310572 32.37 119569988 128164088 1448
08:20:04 AM 7177172 184511812 256331620 97.28 182918084 5789408 85157376 32.32 120340676 127135952 1560
08:30:02 AM 6909440 184428624 256599352 97.38 183079164 5812472 85133024 32.31 120972512 126770908 1104
08:40:03 AM 8958964 184705012 254549828 96.60 181305188 5651932 85654424 32.51 108196220 137174192 9448
~~~
Load average is normal:
~~~
06:20:04 AM 0 2717 1.86 1.83 1.81 0
06:30:03 AM 1 2561 0.82 1.38 1.65 0
06:40:04 AM 0 2559 1.48 1.28 1.45 0
06:50:04 AM 0 2727 1.66 1.55 1.54 0
07:00:02 AM 3 2564 1.80 1.75 1.67 0
07:10:04 AM 2 2560 1.72 1.92 1.83 0
07:20:04 AM 1 2723 2.04 2.06 1.96 0
07:30:02 AM 0 2564 1.05 1.52 1.75 0
07:40:04 AM 0 2564 1.60 1.64 1.71 0
07:50:04 AM 0 2704 1.42 1.70 1.76 0
08:00:03 AM 2 2559 2.52 1.87 1.79 0
08:10:04 AM 2 2565 1.32 1.86 1.87 0
08:20:04 AM 1 2564 1.31 1.74 1.82 0
08:30:02 AM 6 2567 2.23 1.62 1.66 0
08:40:03 AM 10 2599 5.53 4.55 3.03 0
~~~
I noticed High block read:
~~~
06:40:04 AM 820.02 648.40 171.62 185589.35 8831.89
06:50:04 AM 787.46 677.09 110.37 190159.80 1423.48
07:00:02 AM 800.44 711.38 89.06 199348.78 1112.41
07:10:04 AM 953.48 739.67 213.82 229423.06 2424.51
07:20:04 AM 1121.73 869.30 252.43 234740.23 3025.41
07:30:02 AM 982.46 773.94 208.52 260530.12 2550.83
07:40:04 AM 935.18 751.69 183.49 228753.50 2512.17
07:50:04 AM 988.02 836.12 151.90 241907.94 1752.07
08:00:03 AM 995.31 717.67 277.64 237473.34 3000.98
08:10:04 AM 2233.51 664.50 1569.02 214436.12 27850.18
08:20:04 AM 3005.57 2233.33 772.23 332121.67 14138.59
08:30:02 AM 850.32 694.54 155.78 191332.81 2508.30
08:40:03 AM 3495.29 993.85 2501.43 192695.07 114584.07
~~~
Disk utilization is quiet high (reaching above 90%, obviously will cause slow performance)
~~~
05:50:03 AM sdj 134.05 20846.78 145.42 156.60 2.46 18.34 7.24 97.00
06:00:02 AM sdj 131.87 19737.11 411.62 152.79 2.58 19.54 7.25 95.64
06:00:02 AM sdg 113.06 17066.43 401.35 154.50 2.12 18.71 7.19 81.31
06:10:04 AM sdj 131.57 19885.95 344.87 153.77 2.56 19.46 7.26 95.49
06:10:04 AM sdg 114.46 17235.29 385.35 153.95 2.16 18.85 7.15 81.85
06:40:04 AM sdg 129.39 20341.50 326.33 159.73 3.40 26.25 6.95 89.89
06:50:04 AM sde 138.27 21588.66 9.56 156.20 3.96 28.67 6.82 94.30
06:50:04 AM sdg 123.47 19540.45 11.25 158.35 3.21 25.96 7.00 86.43
07:00:02 AM sde 138.52 22137.14 5.12 159.84 3.84 27.70 6.83 94.61
07:00:02 AM sdg 122.04 19462.56 3.37 159.50 3.14 25.70 7.04 85.96
07:10:04 AM sdi 132.11 20481.59 4.28 155.07 3.60 27.24 6.74 89.06
07:10:04 AM sde 135.21 20957.56 9.24 155.07 3.94 29.11 6.80 91.89
07:10:04 AM sdg 131.55 20317.20 5.58 154.48 3.75 28.50 6.84 89.92
07:20:04 AM sdi 133.90 20483.81 6.98 153.04 3.66 27.37 6.75 90.32
07:20:04 AM sde 129.25 19781.57 6.75 153.11 3.57 27.65 6.78 87.64
07:20:04 AM sdg 130.36 20006.79 7.55 153.53 3.49 26.75 6.86 89.47
07:30:02 AM sdf 136.53 21564.84 14.72 158.06 4.08 29.85 6.95 94.88
07:30:02 AM sdd 143.00 21101.37 16.25 147.67 2.59 18.12 6.66 95.29
07:30:02 AM sdi 135.95 21021.53 13.74 154.73 3.72 27.35 6.82 92.69
07:30:02 AM sdh 106.39 26908.52 11.10 253.03 2.44 22.90 8.21 87.29
07:40:04 AM sdf 135.99 20389.66 10.89 150.01 4.06 29.83 6.87 93.48
07:40:04 AM sdd 141.39 19957.73 16.54 141.27 2.73 19.29 6.61 93.52
07:40:04 AM sdi 134.45 19627.73 8.92 146.05 3.74 27.83 6.71 90.17
07:40:04 AM sdh 107.22 25501.50 11.06 237.95 2.52 23.51 8.08 86.63
07:50:04 AM sdf 136.31 20027.55 5.89 146.97 3.99 29.30 6.78 92.42
07:50:04 AM sdd 143.84 19977.27 5.11 138.92 2.73 18.95 6.46 92.86
07:50:04 AM sdh 117.50 23465.00 6.42 199.76 2.75 23.44 7.31 85.91
08:00:03 AM sdf 135.66 20959.82 20.65 154.66 3.89 28.71 6.91 93.70
08:00:03 AM sdd 143.10 21179.39 14.06 148.10 2.59 18.08 6.75 96.56
08:00:03 AM sdh 140.77 21242.44 15.24 151.01 3.56 25.29 6.32 88.95
08:10:04 AM sdf 135.20 20827.60 3.58 154.07 3.92 28.98 6.93 93.65
08:10:04 AM sdd 145.74 21302.24 2.78 146.18 2.57 17.62 6.68 97.29
08:10:04 AM sdh 143.34 21507.47 4.29 150.08 3.54 24.72 6.31 90.46
08:20:04 AM sdd 146.15 21059.81 159.45 145.19 3.13 21.39 6.27 91.58
08:30:02 AM sdd 156.02 22307.99 23.02 143.13 4.52 29.00 6.00 93.59
08:40:03 AM sdd 151.17 21887.74 369.49 147.23 4.48 29.64 6.21 93.88
~~~
Network utilization is low:
~~~
07:00:02 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
07:50:04 AM eth3 3967.72 4227.08 11208.65 13429.43 0.00 0.00 0.60 0.11
07:50:04 AM bond1 8026.98 8341.91 22885.19 26816.57 0.00 0.00 0.70 0.11
07:50:04 AM eth0 225.25 2265.20 76.59 3167.11 0.00 0.00 0.60 0.10
07:50:04 AM bond0 380.63 4425.63 91.85 6325.13 0.00 0.00 0.70 0.10
07:50:04 AM eth1 155.38 2160.44 15.25 3158.01 0.00 0.00 0.10 0.10
07:50:04 AM eth2 4059.27 4114.83 11676.54 13387.13 0.00 0.00 0.10 0.11
08:00:02 AM eth3 2956.72 2900.38 4310.54 4360.87 0.00 0.00 0.60 0.04
08:00:02 AM bond1 5795.80 5723.51 8174.45 8377.80 0.00 0.00 0.70 0.03
08:00:02 AM eth0 92.28 254.05 64.77 256.93 0.00 0.00 0.60 0.01
08:00:02 AM bond0 154.41 415.61 118.54 418.84 0.00 0.00 0.70 0.01
08:00:02 AM eth1 62.13 161.57 53.78 161.92 0.00 0.00 0.10 0.01
08:00:02 AM eth2 2839.08 2823.13 3863.91 4016.93 0.00 0.00 0.10 0.03
08:10:04 AM eth3 5674.71 7149.43 19677.73 33063.69 0.00 0.00 0.60 0.27
08:10:04 AM bond1 11895.28 14652.54 41230.98 68568.04 0.00 0.00 0.70 0.28
08:10:04 AM eth0 104.27 957.90 26.45 1291.77 0.00 0.00 0.60 0.04
08:10:04 AM bond0 265.83 1882.56 126.39 2580.27 0.00 0.00 0.70 0.04
08:10:04 AM eth1 161.56 924.66 99.94 1288.50 0.00 0.00 0.10 0.04
08:10:04 AM eth2 6220.56 7503.12 21553.25 35504.35 0.00 0.00 0.10 0.29
08:20:04 AM eth3 4779.80 5680.79 12828.09 20859.93 0.00 0.00 0.60 0.17
08:20:04 AM bond1 9923.10 11354.47 26486.59 41782.66 0.00 0.00 0.70 0.17
08:20:04 AM eth0 118.79 1176.19 34.19 1675.39 0.00 0.00 0.60 0.05
08:20:04 AM bond0 215.43 2062.88 75.23 2907.40 0.00 0.00 0.70 0.05
08:20:04 AM eth1 96.64 886.70 41.04 1232.01 0.00 0.00 0.10 0.04
08:20:04 AM eth2 5143.30 5673.68 13658.50 20922.73 0.00 0.00 0.10 0.17
08:30:02 AM eth3 3696.28 4296.09 8136.76 13917.48 0.00 0.00 0.60 0.11
08:30:02 AM bond1 7605.12 8614.60 16916.80 28081.38 0.00 0.00 0.70 0.12
~~~
Total number of socket reaching around 11k. So this includes OSD connections+RGW connections+ssh and other housekeeping connections.
~~~
07:00:02 AM totsck tcpsck udpsck rawsck ip-frag tcp-tw
07:10:04 AM 11345 10945 8 0 0 68
07:20:04 AM 11680 11280 8 0 0 297
07:30:03 AM 10996 10596 8 0 0 1170
07:40:04 AM 12011 11611 8 0 0 17
07:50:04 AM 11482 11080 8 0 0 34
08:00:02 AM 11993 11593 8 0 0 217
08:10:04 AM 11864 11464 8 0 0 29
08:20:04 AM 12016 11616 8 0 0 47
08:30:02 AM 11182 10782 8 0 0 741
08:40:04 AM 10802 10402 8 0 0 826
08:50:04 AM 11051 10651 8 0 0 475
09:00:03 AM 11498 11098 8 0 0 106
09:10:04 AM 11915 11515 8 0 0 53
09:20:04 AM 11081 10681 8 0 0 446
09:30:02 AM 12027 11627 8 0 0 585
09:40:03 AM 11762 11338 8 0 0 96
~~~
Requested below details from customer:
The number of active connections to RGW:
~~~
$ cd /var/run/ceph/<fsid>
$ while sleep 60 ; do echo -n "$(date):" ; ceph --admin-daemon ceph-client.rgw.*.asok perf dump |grep "qactive" ; done >> /tmp/rgw_qactive #### Let is run for around an hour and then stop it
~~~
The configuration value of rgw_max_concurrent_requests & rgw_thread_pool_size. I am hoping they are set to default value
~~~
$ ceph --admin-daemon ceph-client.rgw.*.asok config show | egrep -i "rgw_max_concurrent_requests|rgw_thread_pool_size"
~~~
We may need to suggest to increase rgw_max_concurrent_requests & rgw_thread_pool_size.
Please let me know your thoughts.
Thanks,
Tridibesh
--- Additional comment from Matt Benjamin (redhat) on 2023-06-28 11:36:26 UTC ---
Hi Tridibesh,
The frequent notify timeouts indicate a problem, and might be related, but I doubt they are the cause. The timeout could indicate that an RGW has crashed or hung, but could also indicate a network problem.
To debug the main issue, I think what we would be interested in is log information from an attempted upload, that failed. That might not be related to the notify timeout (unless it crashed an RGW, and other RGWs are now timing out reaching it.) Also, though, Tesco is using custom logic (by us) to renew STS credentials in precisely the case of a long-running job, and it may be important to figure out of maybe the actual cause of failure is a problem with a token renewal that probably be expected to have happened. We might be able to compare the request run time with the token expiration rules to make a correlation.
Matt
--- Additional comment from Casey Bodley on 2023-06-28 13:10:50 UTC ---
RGWSI_Notify::robust_notify() is called when we write changes to a metadata object. it broadcasts a message to other radosgws (and radosgw-admins) so they can invalidate their metadata cache of that object. robust_notify() waits up to 15s for all radosgw[-admin]s to respond before continuing. so if any of those clients are unresponsive, you'd see this connection timeout error
i recently found that AssumeRole was writing an unnecessary update to the user metadata - see the linked ceph tracker issue and pull request. given that their workload relies on AssumeRole, this metadata write is likely the operation that ends up failing with "Notify failed on object xx.rgw.meta:users.uid:xxx". this metadata write is unlikely to *cause* the connection timeout errors, but providing that fix may reduce their frequency and lessen the impact on clients
in the past, we've seen these kind of timeouts because of a defunct radosgw-admin command that never shut down cleanly. you can check for this by listing all of the endpoints that are registered for these notifications with the rados tool:
$ rados -p xx.rgw.control listwatchers notify.0
watcher=192.168.245.130:0/803955432 client.4147 cookie=94838542249216
--- Additional comment from Matt Benjamin (redhat) on 2023-06-28 16:12:37 UTC ---
Tridibesh,
If the customer has a test environment available, we could make a test fix with the change Casey is describing.
Matt
--- Additional comment from Tridibesh Chakraborty on 2023-06-29 06:39:55 UTC ---
Hi Matt & Casey,
Thank you for picking this up and responding with your findings.
I can cross verify with customer whether they have any test environment to apply the fix and check there. But if I am not wrong, when last time when I checked with them for a test environment (initially multisite and later standalone setup), they told they didn't have. But better I can have this validated with them and confirm you back.
I am keeping this as a pending action item on me.
Thanks,
Tridibesh
--- Additional comment from Tridibesh Chakraborty on 2023-06-30 11:50:08 UTC ---
Hi Matt/Casey,
I have shared the analysis with the customer and also asked whether there is any test environment available to test the fix. I am yet to receive a response from them.
BTW, for my own reference, do we have any internal doc or article which explains these various types of alerts in debug logs of RGW in depth (e.g. RGWSI_Notify::robust_notify()) and what is their significance, so that atleast can look up there and correlate these debug messages?
Thanks,
Tridibesh
--- Additional comment from Matt Benjamin (redhat) on 2023-06-30 11:55:43 UTC ---
Tridibesh,
I'm not aware of any special doc on these messages, they're part of an internal workflow. If there is anything, it might be in a KCS.
Matt
--- Additional comment from Tridibesh Chakraborty on 2023-06-30 12:02:48 UTC ---
Hi Matt,
Thanks. Let me check for the KCSs to see if there are any indepth discussion there.
Regarding the confirmation of test environment, looks like customer is in leave till 3rd July. So we may not receive any response till then.
Thanks,
Tridibesh
--- Additional comment from Matt Benjamin (redhat) on 2023-07-05 12:42:05 UTC ---
bipin/vikhyat, could you perhaps help triage the customer interaction here? this risks getting stalled...
Matt
--- Additional comment from Tridibesh Chakraborty on 2023-07-05 13:13:32 UTC ---
Hello Matt,
Customer was on leave till yesterday. So I have put a reminder in the case again. Hopefully will get a response back today or tomorrow.
Thanks,
Tridibesh
--- Additional comment from Tridibesh Chakraborty on 2023-07-06 07:10:27 UTC ---
Hello Matt/Casey,
I have got a response from customer.
---
We have only our backup clusters for testing, however that clusters are not integrated with RHSSO for AD integration. Could you please let us know whether we can test this issue on a cluster, which has only local users and local roles ?
---
Can you please confirm if this will work or they need to have similar setup like their production with OIDC and assume role? I am guessing they need to have similar setup.
Thanks,
Tridibesh
--- Additional comment from Casey Bodley on 2023-07-06 14:02:34 UTC ---
(In reply to Tridibesh Chakraborty from comment #13)
>
> Can you please confirm if this will work or they need to have similar setup
> like their production with OIDC and assume role? I am guessing they need to
> have similar setup.
they would want to recreate a workload that includes AssumeRole for this testing. i believe you can do this with local users/roles
--- Additional comment from Tridibesh Chakraborty on 2023-07-18 08:09:12 UTC ---
Hello Casey/Matt,
I am not getting much response from customer on the test cluster setup. So if we assume that customer can't setup a test cluster and test the fix, can we have some other alternative approach to validate the solution and provide a fix to customer?
Thanks,
Tridibesh
--- Additional comment from Casey Bodley on 2023-07-18 13:12:05 UTC ---
(In reply to Tridibesh Chakraborty from comment #15)
> Hello Casey/Matt,
>
> I am not getting much response from customer on the test cluster setup. So
> if we assume that customer can't setup a test cluster and test the fix, can
> we have some other alternative approach to validate the solution and provide
> a fix to customer?
afaik we don't have a reproducer for the customer's issue (notify timeouts). we have a fix for something that's related, and we can test that on its own (it was tested/merged upstream). but we don't know that it will resolve the customer case
are the RGWSI_Notify::robust_notify errors still frequent? in https://bugzilla.redhat.com/show_bug.cgi?id=2214981#c5 above i recommended debugging with 'rados listwatchers'. has anyone tried that?
--- Additional comment from Bipin Kunal on 2023-07-20 05:42:42 UTC ---
Removing the need info as support is already on top of it.
--- Additional comment from Tridibesh Chakraborty on 2023-07-25 10:13:56 UTC ---
Hello Casey,
I have asked the output of the command and got below output from the customer infra.
~~~
[root@whds0001 ~]# rados -p wdc-prod.rgw.control listwatchers notify.0
watcher=172.17.182.29:0/3406395733 client.8502909 cookie=93848966864896
watcher=172.17.182.150:0/3446051895 client.8503179 cookie=93855039410560
watcher=172.17.182.77:0/276458263 client.8501351 cookie=93878190430848
watcher=172.17.182.28:0/3862430847 client.8487829 cookie=93930343931904
watcher=172.17.182.76:0/3194749452 client.8500730 cookie=93986595453568
watcher=172.17.182.149:0/2617041512 client.8488282 cookie=93997356120704
watcher=172.17.182.93:0/1053154184 client.8503011 cookie=94113522858624
watcher=172.17.182.92:0/2122245614 client.8488057 cookie=94127103244928
watcher=172.17.182.108:0/3092808963 client.8488126 cookie=94186202534272
watcher=172.17.182.166:0/3974953876 client.8503212 cookie=94216629256192
watcher=172.17.182.118:0/457252313 client.8501483 cookie=94234440880512
watcher=172.17.182.165:0/341660400 client.8502696 cookie=94385907513984
watcher=172.17.182.45:0/2356641433 client.8501297 cookie=94391517215360
watcher=172.17.182.117:0/3327240474 client.8502540 cookie=94471223401088
watcher=172.17.182.109:0/93312324 client.8501405 cookie=94480676133504
watcher=172.17.182.134:0/2542382258 client.8503158 cookie=94531319347840
watcher=172.17.182.12:0/1970182782 client.8500505 cookie=94551361460864
watcher=172.17.182.61:0/3106670398 client.8488594 cookie=94593773992576
watcher=172.17.182.181:0/3652166781 client.8501180 cookie=94604185006464
watcher=172.17.182.44:0/1666038841 client.8502252 cookie=94681847044736
watcher=172.17.182.133:0/4008072637 client.8500967 cookie=94689492099072
watcher=172.17.182.60:0/2562545797 client.8500685 cookie=94858030715264
watcher=172.17.182.182:0/3240503417 client.8488873 cookie=94893928585856
[root@whds0001 ~]#
~~~
Thanks,
Tridibesh
--- Additional comment from Casey Bodley on 2023-07-25 12:37:52 UTC ---
(In reply to Tridibesh Chakraborty from comment #18)
> Hello Casey,
>
> I have asked the output of the command and got below output from the
> customer infra.
thanks. the next step is to correlate the IPs with their running radosgws, and look for watchers that don't belong there. you might find a radosgw-admin command running somewhere they didn't intend
p.s. are the RGWSI_Notify::robust_notify errors still frequent? are we still collecting data about this?
--- Additional comment from Tridibesh Chakraborty on 2023-07-31 08:51:03 UTC ---
Hi Casey,
Customer confirmed that these are the IPs of their Radosgws. Also they checked for any running radosgw-command on them and did not find any. Regarding the error message, they are still seeing them. When they checked last time it was around 1500 messages for a single day (26th July) and it may be less than what they were getting initially.
Thanks,
Tridibesh
--- Additional comment from Matt Benjamin (redhat) on 2023-08-03 13:48:31 UTC ---
Hi Tridibesh,
Casey and Adam have two changes that Casey thinks won't fix the root cause but 1) will reduce incidence and 2) provide debugging information when it reproducies.
Could you ask whether the customer could potentially run a debugging test fix?
Matt
--- Additional comment from Tridibesh Chakraborty on 2023-08-04 02:31:50 UTC ---
Hi Matt,
Sure, I have asked customer if they can run a debugging test fix. So if I understand correctly, you will be providing a testfix which will enable the verbose debugging of the issue so that we have more information to get this fixed. Is it correct?
Thanks,
Tridibesh
--- Additional comment from Tridibesh Chakraborty on 2023-08-11 02:54:14 UTC ---
Hi Casey/Matt,
Customer have couple of queries.
>What kind of fix are suggested by engineering team?
If I am not wrong, this fix will not eliminate the root cause, it will just reduce the incident and provide more debugging information so that we can trace back the actual cause and provide a fix for that.
>This is our production storage, we should test any fix in lower environments before applying in prod. But we don't have this kind of issues in other environments.
Customer don't have exact replica of their production system where they can test the fix. So, if that is the case, then I guess we need to provide the fix via next Z release as customer will not be able to validate it and we need to carry out the necessary testing.
Is my understanding correct? If not, please share your thoughts.
Thanks,
Tridibesh
--- Additional comment from Matt Benjamin (redhat) on 2023-09-10 17:28:31 UTC ---
(As there is no RCA yet, moving to rhcs-7.0z1.)
--- Additional comment from Tridibesh Chakraborty on 2023-09-15 01:11:30 UTC ---
Hi Matt/Casey,
Can you please let me know if we have any code fix/workaround for this bug on RHCS 5.3z5? Looks like customer is unhappy with their experience with Ceph and they don't see much meaningful progress on this issue. There is an active escalation on this case now. So if I understand correctly on RHCS 5.3z5 the issue is not fixed, but it has the code fix to reduce the occurrence of this error messages and it will try to capture more meaningful debug logs so that we can find the root cause of this issue and provide a fix to customer. Please correct me if my understanding is wrong.
Thanks,
Tridibesh