2308166 – [RHCS 6][Ceph-mgr][Thousands of connections in CLOSE-WAIT status]

Bug 2308166 - [RHCS 6][Ceph-mgr][Thousands of connections in CLOSE-WAIT status] [NEEDINFO]

Summary: [RHCS 6][Ceph-mgr][Thousands of connections in CLOSE-WAIT status]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Dashboard
Sub Component:
Version:	6.1
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	6.1z9
Assignee:	Nizamudeen
QA Contact:	Vinayak Papnoi
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-08-27 14:02 UTC by Michaela Lang
Modified:	2025-04-28 05:30 UTC (History)
CC List:	10 users (show)
Fixed In Version:	python-cheroot-10.0.1-4.el9cp; rhceph-container-6-353
Doc Type:	If docs needed, set a value
Doc Text:	.`cherrypy` no longer gets stuck during network security scanning Previously, due to a bug in the cheroot package, `cherrypy` would get stuck during some security scans that were scanning the network. As a result, the Ceph Dashboard became unresponsive and needed to restart the mgr module. With this fix, the cheroot package is updated and the issue is resolved.
Clone Of:	2184612
Environment:
Last Closed:	2025-04-28 05:30:10 UTC
Embargoed:
Dependent Products:
Flags:	saraut: needinfo? (nia)

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-9580	None	None	None	2024-08-27 14:03:12 UTC
Red Hat Issue Tracker	RHCSDASH-1575	None	None	None	2024-08-27 14:03:14 UTC
Red Hat Knowledge Base (Solution)	7084560	None	None	None	2024-08-27 17:49:28 UTC
Red Hat Product Errata	RHSA-2025:4238	None	None	None	2025-04-28 05:30:14 UTC

Description Michaela Lang 2024-08-27 14:02:52 UTC

# Clone of 2184612 for 6.1z8 

### Description of problem:

* RHCS 5.3 running in different Ceph clusters. The customer ( Inditex ) uses the Ceph restful mgr module to make API calls. Sometimes, these calls are not working until the mgr service is restarted. This occurs randomly and on different Ceph clusters. 

		  ❯ curl -k -X POST "https://cnec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (55) Send failure: Broken pipe
		  
		  ❯ curl -k -X POST "https://cnec1ocs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (55) Send failure: Broken pipe
		  
		  ❯ curl -k -X POST "https://ieec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (55) Send failure: Broken pipe
		  
		  ❯ curl -k -X POST "https://ieec1ocs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (55) Send failure: Broken pipe
  
* Restarting the restful module itself doesn't help:

		  [root@cnec1cs1o01 ~]$ sudo ceph mgr module disable restful && sudo ceph mgr module enable restful
		  [root@cnec1cs1o01 ~]$
		  
		  ❯ curl -k -X POST "https://cnec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (35) Recv failure: Connection reset by peer
  

* The MGR needs to be failed to recover access to the API:

		  [root@cnec1cs1o01 ~]# ceph -s
		    cluster:
		      id:     e4304a5b-2f71-4ccc-a6a6-652c59a9e671
		      health: HEALTH_OK
		  
		    services:
		      mon: 3 daemons, quorum cnec1cs1o01,cnec1cs1o02,cnec1cs1o03 (age 4w)
		      mgr: cnec1cs1o01.dvbgnj(active, since 5m), standbys: cnec1cs1o02.bqqjaa, cnec1cs1o03.hkunpo
		      osd: 60 osds: 60 up (since 4w), 60 in (since 8M)
		      rgw: 6 daemons active (6 hosts, 2 zones)
		  
		    data:
		      pools:   15 pools, 1121 pgs
		      objects: 5.49M objects, 2.3 TiB
		      usage:   7.2 TiB used, 79 TiB / 86 TiB avail
		      pgs:     1121 active+clean
		  
		    io:
		      client:   304 KiB/s rd, 0 B/s wr, 304 op/s rd, 0 op/s wr
		  
		  [root@cnec1cs1o01 ~]# ceph mgr fail cnec1cs1o01.dvbgnj
		  
		  ❯ curl -k -X POST "https://cnec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  "eyJ*********************q2o"
  

* The customer added the -v option to the curl command to determine where the request gets stuck. It happens during the TLS handshake:

		  [digfernandodlhdt@ieec1cs1o01 ~]$ curl -k -X POST "https://ieec1cs1o01.ecommerce.inditex.grp:8443/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' -v
		  Note: Unnecessary use of -X or --request, POST is already inferred.
		  *   Trying 10.68.31.41...
		  * TCP_NODELAY set
		  * Connected to ieec1cs1o01.ecommerce.inditex.grp (10.68.31.41) port 8443 (#0)
		  * ALPN, offering h2
		  * ALPN, offering http/1.1
		  * successfully set certificate verify locations:
		  *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
		    CApath: none
		  * TLSv1.3 (OUT), TLS handshake, Client hello (1):
		  * TLSv1.3 (IN), TLS handshake, Server hello (2):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Certificate (11):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, CERT verify (15):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Finished (20):
		  * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
		  * TLSv1.3 (OUT), TLS handshake, [no content] (0):
		  * TLSv1.3 (OUT), TLS handshake, Finished (20):
		  * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
		  * ALPN, server did not agree to a protocol
		  * Server certificate:
		  *  subject: O=ECOMMERCE.INDITEX.GRP; CN=ieec1cs1n01.ecommerce.inditex.grp
		  *  start date: Apr 28 16:14:47 2021 GMT
		  *  expire date: Apr 29 16:14:47 2023 GMT
		  *  issuer: O=ECOMMERCE.INDITEX.GRP; CN=Certificate Authority
		  *  SSL certificate verify ok.
		  * TLSv1.3 (OUT), TLS app data, [no content] (0):
		  > POST /api/auth HTTP/1.1
		  > Host: ieec1cs1o01.ecommerce.inditex.grp:8443
		  > User-Agent: curl/7.61.1
		  > Accept: application/vnd.ceph.api.v1.0+json
		  > Content-Type: application/json
		  > Content-Length: 64
		  >
		  * upload completely sent off: 64 out of 64 bytes
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
		  ^C
  
* There are some sosreports collected of mgr nodes from different Ceph clusters where this issue was observed. The information is at support-shell, directory /cases/03459963. This is the location of the data:

		0030-sosreport-ieec1cs1o01-03459963-2023-03-13-ejwuuwl.tar.xz    --> restful API **not restored** before uploading.
		0040-sosreport-ieec1ocs1n001-03459963-2023-03-14-meuoavq.tar.xz  --> restful API restored before uploading sos report.
		0050-sosreport-cnec1ocs1n001-03459963-2023-03-14-nlueqys.tar.xz. --> restful API restored before uploading sos report.

   
* The sosreport of node ieec1cs1o01 was the only one gathered while this issue was present. Reviewing the output of the ss_-peaonmi command, I found almost 11K connections to port 8443 in CLOSE-WAIT status:
		
		  [...]
		  
		  tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.3:43428     users:(("ceph-mgr",pid=6591,fd=7051)) uid:167 ino:100729770 sk:28e47 -->
		  tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.4:33658     users:(("ceph-mgr",pid=6591,fd=26962)) uid:167 ino:109079723 sk:293d2 -->
		  tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.3:54855     users:(("ceph-mgr",pid=6591,fd=22457)) uid:167 ino:106448541 sk:28e48 -->
		  tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.3:38439     users:(("ceph-mgr",pid=6591,fd=1991)) uid:167 ino:99138254 sk:28e49 -->
		  
		  [...] 
		  
		  
		   grep ":8443" ss_-peaonmi  | grep CLOSE-WAIT | wc -l
		  10900 
  

* Most of these connections are coming from IPs 10.68.31.3 and 10.68.31.4:

		     grep ":8443" ss_-peaonmi  | grep CLOSE-WAIT | awk '{print $6}' | awk -F: '{print $1}' | sort | uniq -c
		      114 10.68.30.126
		     5374 10.68.31.3
		     5399 10.68.31.4
		       13 10.68.31.41
       

* These IPs correspond to two F5 load balancers. The pattern reproduces: every time this issue is observed, many CLOSE-WAIT connections appear. 

* We waited for another occurrence of this issue and collected debug logs for the MGR and the restful module:

		ceph config set mgr debug_mgr 20
		ceph config set mgr mgr/dashboard/log_level debug		
  
  This information is also at support-shell, folder: /cases/03459963/sosreport-20230321-090454/cnec1cs1o02.ecommerce.inditex.grp
  
  The debug MGR logs start at this point:

    2023-03-21T08:54:03.199+0000 7f1502b68700  0 [progress INFO root] Processing OSDMap change 26713..26713
    2023-03-21T08:54:04.194+0000 7f16dd1b0700 10 mgr ms_dispatch2 active kv_data(v2166132 prefix config/, incremental, 1 keys) v1
    2023-03-21T08:54:04.194+0000 7f16dd1b0700 10 mgr ms_dispatch2 kv_data(v2166132 prefix config/, incremental, 1 keys) v1
    2023-03-21T08:54:04.194+0000 7f16dd1b0700 10 mgr update_kv_data incremental update on config/
    2023-03-21T08:54:04.194+0000 7f16dd1b0700 20 mgr update_kv_data  set config/mgr/debug_mgr = 20/20

Until:

    2023-03-21T09:00:21.829+0000 7f1557b72700  0 [balancer DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.830+0000 7f1557b72700  0 [cephadm DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.833+0000 7f1557b72700  0 [crash DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.835+0000 7f1557b72700  0 [devicehealth DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.837+0000 7f1557b72700  0 [orchestrator DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.837+0000 7f1557b72700  0 [pg_autoscaler DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.839+0000 7f1557b72700  0 [progress DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    

So we have approximately 6 minutes of information. Unfortunately, I couldn't spot anything wrong here. I'd appreciate it if you can have a look in case I'm missing anything. 

* On a third occurrence of this issue, the customer collected a tcpdump output from the Ceph node, filtering by the IPs of the load-balancers:

		timeout 5m tcpdump -vvv -enni bond0 host 10.71.64.3 or host 10.71.64.4 -w /dev/shm/tcpdump.pcap
		dropped privs to tcpdump
		tcpdump: listening on bond0, link-type EN10MB (Ethernet), capture size 262144 bytes
		735 packets captured
		827 packets received by filter
		0 packets dropped by kernel
		
   This data is also at support-shell, file: /cases/03459963/0170-tcpdump.pcap
   
   I noticed that the F5 load balancer closes the connection by sending a `FIN,ACK`, but the Ceph mgr doesn't reply. The below examples are taken from this Wireshark output:
   
		 625 248.826240   10.71.64.3 → 10.71.64.140 TCP 74 5740 → 8443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=579299640 TSecr=0 WS=128
		 626 248.826310 10.71.64.140 → 10.71.64.3   TCP 74 8443 → 5740 [SYN, ACK] Seq=0 Ack=1 Win=26844 Len=0 MSS=8960 SACK_PERM=1 TSval=3659855381 TSecr=579299640 WS=8192
		 627 248.827191   10.71.64.3 → 10.71.64.140 TCP 66 5740 → 8443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=579299641 TSecr=3659855381
		 629 248.827265 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=1 Ack=322 Win=32768 Len=0 TSval=3659855381 TSecr=579299641
		 631 248.828654   10.71.64.3 → 10.71.64.140 TCP 66 5740 → 8443 [ACK] Seq=322 Ack=138 Win=30336 Len=0 TSval=579299643 TSecr=3659855382
		 634 248.838287   10.71.64.3 → 10.71.64.140 TCP 140 [TCP Retransmission] 5740 → 8443 [PSH, ACK] Seq=373 Ack=138 Win=30336 Len=74 TSval=579299653 TSecr=3659855382
		 635 248.838297 10.71.64.140 → 10.71.64.3   TCP 78 8443 → 5740 [ACK] Seq=138 Ack=447 Win=32768 Len=0 TSval=3659855392 TSecr=579299643 SLE=373 SRE=447
		 641 253.821406 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=521 Win=32768 Len=0 TSval=3659860376 TSecr=579304636
		 649 258.791470 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=595 Win=32768 Len=0 TSval=3659865346 TSecr=579309606
		 666 263.776157 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=669 Win=32768 Len=0 TSval=3659870330 TSecr=579314590
		 668 263.776179 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=700 Win=32768 Len=0 TSval=3659870330 TSecr=579314590
		 696 280.020639   10.71.64.3 → 10.71.64.140 TCP 66 5740 → 8443 [FIN, ACK] Seq=700 Ack=138 Win=30336 Len=0 TSval=579330835 TSecr=3659870330
		 697 280.061373 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=701 Win=32768 Len=0 TSval=3659886616 TSecr=579330835
		
		
		
		 652 261.356912   10.71.64.4 → 10.71.64.140 TCP 74 36025 → 8443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=579312170 TSecr=0 WS=128
		 653 261.357004 10.71.64.140 → 10.71.64.4   TCP 74 8443 → 36025 [SYN, ACK] Seq=0 Ack=1 Win=26844 Len=0 MSS=8960 SACK_PERM=1 TSval=4284243230 TSecr=579312170 WS=8192
		 654 261.358061   10.71.64.4 → 10.71.64.140 TCP 66 36025 → 8443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=579312171 TSecr=4284243230
		 656 261.358112 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=1 Ack=322 Win=32768 Len=0 TSval=4284243231 TSecr=579312172
		 658 261.358897   10.71.64.4 → 10.71.64.140 TCP 66 36025 → 8443 [ACK] Seq=322 Ack=138 Win=30336 Len=0 TSval=579312172 TSecr=4284243232
		 661 261.368264   10.71.64.4 → 10.71.64.140 TCP 140 [TCP Retransmission] 36025 → 8443 [PSH, ACK] Seq=373 Ack=138 Win=30336 Len=74 TSval=579312182 TSecr=4284243232
		 662 261.368302 10.71.64.140 → 10.71.64.4   TCP 78 8443 → 36025 [ACK] Seq=138 Ack=447 Win=32768 Len=0 TSval=4284243241 TSecr=579312172 SLE=373 SRE=447
		 670 266.375512 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=521 Win=32768 Len=0 TSval=4284248249 TSecr=579317189
		 685 271.333848 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=595 Win=32768 Len=0 TSval=4284253207 TSecr=579322147
		 691 276.350680 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=669 Win=32768 Len=0 TSval=4284258224 TSecr=579327164
		 693 276.350702 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=700 Win=32768 Len=0 TSval=4284258224 TSecr=579327164
		 729 292.314216   10.71.64.4 → 10.71.64.140 TCP 66 36025 → 8443 [FIN, ACK] Seq=700 Ack=138 Win=30336 Len=0 TSval=579343127 TSecr=4284258224
		 730 292.354370 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=701 Win=32768 Len=0 TSval=4284274228 TSecr=579343127
   
Please, we need some assistance to resolve this issue. The Ceph MGR is holding these connections open, despiting having a close-notify from the F5 load-balancers. 

### How reproducible:

At customer site. However, there are some similar issues reported:

[1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LKB7PCMHREU4SJDEDYCIHOLGXK2MTUHE/
[2] Partnerhttps://bugzilla.redhat.com/show_bug.cgi?id=2135369#c19


### Actual results:

The MGR is not correctly closing these connections. 


### Expected results:

The MGR should be correctly closing these connections. 


### Additional info:

N/A - The location of the data collected is posted above.

Comment 1 Manny 2024-08-27 17:49:29 UTC

See KCS (https://access.redhat.com/solutions/7084560)

BR
Manny

Comment 11 errata-xmlrpc 2025-04-28 05:30:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 6.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:4238

Note You need to log in before you can comment on or make changes to this bug.