Bug 2308164 - [RHCS 7][Ceph-mgr][Thousands of connections in CLOSE-WAIT status]
Summary: [RHCS 7][Ceph-mgr][Thousands of connections in CLOSE-WAIT status]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Dashboard
Version: 7.1
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
: 7.1z3
Assignee: Nizamudeen
QA Contact: Sayalee
Disha Walvekar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-08-27 14:00 UTC by Michaela Lang
Modified: 2025-02-24 15:42 UTC (History)
8 users (show)

Fixed In Version: python-cheroot-10.0.1-4.el9cp, rhceph-container-7-468
Doc Type: Bug Fix
Doc Text:
.`cherrypy` no longer gets stuck during network security scanning Previously, due to a bug in the cheroot package, `cherrypy` would get stuck during some security scans that were scanning the network. As a result, the Ceph Dashboard became unresponsive and needed to restart the mgr module. With this fix, the cheroot package is updated and the issue is resolved.
Clone Of: 2184612
Environment:
Last Closed: 2025-02-24 15:42:23 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-9579 0 None None None 2024-08-27 14:01:26 UTC
Red Hat Issue Tracker RHCSDASH-1574 0 None None None 2024-08-27 14:03:04 UTC
Red Hat Knowledge Base (Solution) 7084560 0 None None None 2024-08-27 17:49:17 UTC
Red Hat Product Errata RHBA-2025:1770 0 None None None 2025-02-24 15:42:26 UTC

Description Michaela Lang 2024-08-27 14:00:44 UTC
# Clone of 2184612 for 7.1z2 

### Description of problem:

* RHCS 5.3 running in different Ceph clusters. The customer ( Inditex ) uses the Ceph restful mgr module to make API calls. Sometimes, these calls are not working until the mgr service is restarted. This occurs randomly and on different Ceph clusters. 

		  ❯ curl -k -X POST "https://cnec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (55) Send failure: Broken pipe
		  
		  ❯ curl -k -X POST "https://cnec1ocs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (55) Send failure: Broken pipe
		  
		  ❯ curl -k -X POST "https://ieec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (55) Send failure: Broken pipe
		  
		  ❯ curl -k -X POST "https://ieec1ocs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (55) Send failure: Broken pipe
  
* Restarting the restful module itself doesn't help:

		  [root@cnec1cs1o01 ~]$ sudo ceph mgr module disable restful && sudo ceph mgr module enable restful
		  [root@cnec1cs1o01 ~]$
		  
		  ❯ curl -k -X POST "https://cnec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  curl: (35) Recv failure: Connection reset by peer
  

* The MGR needs to be failed to recover access to the API:

		  [root@cnec1cs1o01 ~]# ceph -s
		    cluster:
		      id:     e4304a5b-2f71-4ccc-a6a6-652c59a9e671
		      health: HEALTH_OK
		  
		    services:
		      mon: 3 daemons, quorum cnec1cs1o01,cnec1cs1o02,cnec1cs1o03 (age 4w)
		      mgr: cnec1cs1o01.dvbgnj(active, since 5m), standbys: cnec1cs1o02.bqqjaa, cnec1cs1o03.hkunpo
		      osd: 60 osds: 60 up (since 4w), 60 in (since 8M)
		      rgw: 6 daemons active (6 hosts, 2 zones)
		  
		    data:
		      pools:   15 pools, 1121 pgs
		      objects: 5.49M objects, 2.3 TiB
		      usage:   7.2 TiB used, 79 TiB / 86 TiB avail
		      pgs:     1121 active+clean
		  
		    io:
		      client:   304 KiB/s rd, 0 B/s wr, 304 op/s rd, 0 op/s wr
		  
		  [root@cnec1cs1o01 ~]# ceph mgr fail cnec1cs1o01.dvbgnj
		  
		  ❯ curl -k -X POST "https://cnec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token'
		  "eyJ*********************q2o"
  

* The customer added the -v option to the curl command to determine where the request gets stuck. It happens during the TLS handshake:

		  [digfernandodlhdt@ieec1cs1o01 ~]$ curl -k -X POST "https://ieec1cs1o01.ecommerce.inditex.grp:8443/api/auth" -H  "Accept: application/vnd.ceph.api.v1.0+json" -H  "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' -v
		  Note: Unnecessary use of -X or --request, POST is already inferred.
		  *   Trying 10.68.31.41...
		  * TCP_NODELAY set
		  * Connected to ieec1cs1o01.ecommerce.inditex.grp (10.68.31.41) port 8443 (#0)
		  * ALPN, offering h2
		  * ALPN, offering http/1.1
		  * successfully set certificate verify locations:
		  *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
		    CApath: none
		  * TLSv1.3 (OUT), TLS handshake, Client hello (1):
		  * TLSv1.3 (IN), TLS handshake, Server hello (2):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Certificate (11):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, CERT verify (15):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Finished (20):
		  * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
		  * TLSv1.3 (OUT), TLS handshake, [no content] (0):
		  * TLSv1.3 (OUT), TLS handshake, Finished (20):
		  * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
		  * ALPN, server did not agree to a protocol
		  * Server certificate:
		  *  subject: O=ECOMMERCE.INDITEX.GRP; CN=ieec1cs1n01.ecommerce.inditex.grp
		  *  start date: Apr 28 16:14:47 2021 GMT
		  *  expire date: Apr 29 16:14:47 2023 GMT
		  *  issuer: O=ECOMMERCE.INDITEX.GRP; CN=Certificate Authority
		  *  SSL certificate verify ok.
		  * TLSv1.3 (OUT), TLS app data, [no content] (0):
		  > POST /api/auth HTTP/1.1
		  > Host: ieec1cs1o01.ecommerce.inditex.grp:8443
		  > User-Agent: curl/7.61.1
		  > Accept: application/vnd.ceph.api.v1.0+json
		  > Content-Type: application/json
		  > Content-Length: 64
		  >
		  * upload completely sent off: 64 out of 64 bytes
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
		  * TLSv1.3 (IN), TLS handshake, [no content] (0):
		  * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
		  ^C
  
* There are some sosreports collected of mgr nodes from different Ceph clusters where this issue was observed. The information is at support-shell, directory /cases/03459963. This is the location of the data:

		0030-sosreport-ieec1cs1o01-03459963-2023-03-13-ejwuuwl.tar.xz    --> restful API **not restored** before uploading.
		0040-sosreport-ieec1ocs1n001-03459963-2023-03-14-meuoavq.tar.xz  --> restful API restored before uploading sos report.
		0050-sosreport-cnec1ocs1n001-03459963-2023-03-14-nlueqys.tar.xz. --> restful API restored before uploading sos report.

   
* The sosreport of node ieec1cs1o01 was the only one gathered while this issue was present. Reviewing the output of the ss_-peaonmi command, I found almost 11K connections to port 8443 in CLOSE-WAIT status:
		
		  [...]
		  
		  tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.3:43428     users:(("ceph-mgr",pid=6591,fd=7051)) uid:167 ino:100729770 sk:28e47 -->
		  tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.4:33658     users:(("ceph-mgr",pid=6591,fd=26962)) uid:167 ino:109079723 sk:293d2 -->
		  tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.3:54855     users:(("ceph-mgr",pid=6591,fd=22457)) uid:167 ino:106448541 sk:28e48 -->
		  tcp   CLOSE-WAIT 328    0                                                                                                       10.68.31.41:8443                  10.68.31.3:38439     users:(("ceph-mgr",pid=6591,fd=1991)) uid:167 ino:99138254 sk:28e49 -->
		  
		  [...] 
		  
		  
		   grep ":8443" ss_-peaonmi  | grep CLOSE-WAIT | wc -l
		  10900 
  

* Most of these connections are coming from IPs 10.68.31.3 and 10.68.31.4:

		     grep ":8443" ss_-peaonmi  | grep CLOSE-WAIT | awk '{print $6}' | awk -F: '{print $1}' | sort | uniq -c
		      114 10.68.30.126
		     5374 10.68.31.3
		     5399 10.68.31.4
		       13 10.68.31.41
       

* These IPs correspond to two F5 load balancers. The pattern reproduces: every time this issue is observed, many CLOSE-WAIT connections appear. 

* We waited for another occurrence of this issue and collected debug logs for the MGR and the restful module:

		ceph config set mgr debug_mgr 20
		ceph config set mgr mgr/dashboard/log_level debug		
  
  This information is also at support-shell, folder: /cases/03459963/sosreport-20230321-090454/cnec1cs1o02.ecommerce.inditex.grp
  
  The debug MGR logs start at this point:

    2023-03-21T08:54:03.199+0000 7f1502b68700  0 [progress INFO root] Processing OSDMap change 26713..26713
    2023-03-21T08:54:04.194+0000 7f16dd1b0700 10 mgr ms_dispatch2 active kv_data(v2166132 prefix config/, incremental, 1 keys) v1
    2023-03-21T08:54:04.194+0000 7f16dd1b0700 10 mgr ms_dispatch2 kv_data(v2166132 prefix config/, incremental, 1 keys) v1
    2023-03-21T08:54:04.194+0000 7f16dd1b0700 10 mgr update_kv_data incremental update on config/
    2023-03-21T08:54:04.194+0000 7f16dd1b0700 20 mgr update_kv_data  set config/mgr/debug_mgr = 20/20

Until:

    2023-03-21T09:00:21.829+0000 7f1557b72700  0 [balancer DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.830+0000 7f1557b72700  0 [cephadm DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.833+0000 7f1557b72700  0 [crash DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.835+0000 7f1557b72700  0 [devicehealth DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.837+0000 7f1557b72700  0 [orchestrator DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.837+0000 7f1557b72700  0 [pg_autoscaler DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    2023-03-21T09:00:21.839+0000 7f1557b72700  0 [progress DEBUG root] setting log level based on debug_mgr: INFO (2/5)
    

So we have approximately 6 minutes of information. Unfortunately, I couldn't spot anything wrong here. I'd appreciate it if you can have a look in case I'm missing anything. 

* On a third occurrence of this issue, the customer collected a tcpdump output from the Ceph node, filtering by the IPs of the load-balancers:

		timeout 5m tcpdump -vvv -enni bond0 host 10.71.64.3 or host 10.71.64.4 -w /dev/shm/tcpdump.pcap
		dropped privs to tcpdump
		tcpdump: listening on bond0, link-type EN10MB (Ethernet), capture size 262144 bytes
		735 packets captured
		827 packets received by filter
		0 packets dropped by kernel
		
   This data is also at support-shell, file: /cases/03459963/0170-tcpdump.pcap
   
   I noticed that the F5 load balancer closes the connection by sending a `FIN,ACK`, but the Ceph mgr doesn't reply. The below examples are taken from this Wireshark output:
   
		 625 248.826240   10.71.64.3 → 10.71.64.140 TCP 74 5740 → 8443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=579299640 TSecr=0 WS=128
		 626 248.826310 10.71.64.140 → 10.71.64.3   TCP 74 8443 → 5740 [SYN, ACK] Seq=0 Ack=1 Win=26844 Len=0 MSS=8960 SACK_PERM=1 TSval=3659855381 TSecr=579299640 WS=8192
		 627 248.827191   10.71.64.3 → 10.71.64.140 TCP 66 5740 → 8443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=579299641 TSecr=3659855381
		 629 248.827265 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=1 Ack=322 Win=32768 Len=0 TSval=3659855381 TSecr=579299641
		 631 248.828654   10.71.64.3 → 10.71.64.140 TCP 66 5740 → 8443 [ACK] Seq=322 Ack=138 Win=30336 Len=0 TSval=579299643 TSecr=3659855382
		 634 248.838287   10.71.64.3 → 10.71.64.140 TCP 140 [TCP Retransmission] 5740 → 8443 [PSH, ACK] Seq=373 Ack=138 Win=30336 Len=74 TSval=579299653 TSecr=3659855382
		 635 248.838297 10.71.64.140 → 10.71.64.3   TCP 78 8443 → 5740 [ACK] Seq=138 Ack=447 Win=32768 Len=0 TSval=3659855392 TSecr=579299643 SLE=373 SRE=447
		 641 253.821406 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=521 Win=32768 Len=0 TSval=3659860376 TSecr=579304636
		 649 258.791470 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=595 Win=32768 Len=0 TSval=3659865346 TSecr=579309606
		 666 263.776157 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=669 Win=32768 Len=0 TSval=3659870330 TSecr=579314590
		 668 263.776179 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=700 Win=32768 Len=0 TSval=3659870330 TSecr=579314590
		 696 280.020639   10.71.64.3 → 10.71.64.140 TCP 66 5740 → 8443 [FIN, ACK] Seq=700 Ack=138 Win=30336 Len=0 TSval=579330835 TSecr=3659870330
		 697 280.061373 10.71.64.140 → 10.71.64.3   TCP 66 8443 → 5740 [ACK] Seq=138 Ack=701 Win=32768 Len=0 TSval=3659886616 TSecr=579330835
		
		
		
		 652 261.356912   10.71.64.4 → 10.71.64.140 TCP 74 36025 → 8443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=579312170 TSecr=0 WS=128
		 653 261.357004 10.71.64.140 → 10.71.64.4   TCP 74 8443 → 36025 [SYN, ACK] Seq=0 Ack=1 Win=26844 Len=0 MSS=8960 SACK_PERM=1 TSval=4284243230 TSecr=579312170 WS=8192
		 654 261.358061   10.71.64.4 → 10.71.64.140 TCP 66 36025 → 8443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=579312171 TSecr=4284243230
		 656 261.358112 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=1 Ack=322 Win=32768 Len=0 TSval=4284243231 TSecr=579312172
		 658 261.358897   10.71.64.4 → 10.71.64.140 TCP 66 36025 → 8443 [ACK] Seq=322 Ack=138 Win=30336 Len=0 TSval=579312172 TSecr=4284243232
		 661 261.368264   10.71.64.4 → 10.71.64.140 TCP 140 [TCP Retransmission] 36025 → 8443 [PSH, ACK] Seq=373 Ack=138 Win=30336 Len=74 TSval=579312182 TSecr=4284243232
		 662 261.368302 10.71.64.140 → 10.71.64.4   TCP 78 8443 → 36025 [ACK] Seq=138 Ack=447 Win=32768 Len=0 TSval=4284243241 TSecr=579312172 SLE=373 SRE=447
		 670 266.375512 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=521 Win=32768 Len=0 TSval=4284248249 TSecr=579317189
		 685 271.333848 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=595 Win=32768 Len=0 TSval=4284253207 TSecr=579322147
		 691 276.350680 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=669 Win=32768 Len=0 TSval=4284258224 TSecr=579327164
		 693 276.350702 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=700 Win=32768 Len=0 TSval=4284258224 TSecr=579327164
		 729 292.314216   10.71.64.4 → 10.71.64.140 TCP 66 36025 → 8443 [FIN, ACK] Seq=700 Ack=138 Win=30336 Len=0 TSval=579343127 TSecr=4284258224
		 730 292.354370 10.71.64.140 → 10.71.64.4   TCP 66 8443 → 36025 [ACK] Seq=138 Ack=701 Win=32768 Len=0 TSval=4284274228 TSecr=579343127
   
Please, we need some assistance to resolve this issue. The Ceph MGR is holding these connections open, despiting having a close-notify from the F5 load-balancers. 

### How reproducible:

At customer site. However, there are some similar issues reported:

[1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LKB7PCMHREU4SJDEDYCIHOLGXK2MTUHE/
[2] Partnerhttps://bugzilla.redhat.com/show_bug.cgi?id=2135369#c19


### Actual results:

The MGR is not correctly closing these connections. 


### Expected results:

The MGR should be correctly closing these connections. 


### Additional info:

N/A - The location of the data collected is posted above.

Comment 1 Manny 2024-08-27 17:49:18 UTC
See KCS (https://access.redhat.com/solutions/7084560)

BR
Manny

Comment 7 errata-xmlrpc 2025-02-24 15:42:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.1 security, bug fix, enhancement, and known issue updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:1770


Note You need to log in before you can comment on or make changes to this bug.