# Clone of 2184612 for 7.1z2 ### Description of problem: * RHCS 5.3 running in different Ceph clusters. The customer ( Inditex ) uses the Ceph restful mgr module to make API calls. Sometimes, these calls are not working until the mgr service is restarted. This occurs randomly and on different Ceph clusters. ❯ curl -k -X POST "https://cnec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H "Accept: application/vnd.ceph.api.v1.0+json" -H "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token' curl: (55) Send failure: Broken pipe ❯ curl -k -X POST "https://cnec1ocs1mgrlb1.ecommerce.inditex.grp/api/auth" -H "Accept: application/vnd.ceph.api.v1.0+json" -H "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token' curl: (55) Send failure: Broken pipe ❯ curl -k -X POST "https://ieec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H "Accept: application/vnd.ceph.api.v1.0+json" -H "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token' curl: (55) Send failure: Broken pipe ❯ curl -k -X POST "https://ieec1ocs1mgrlb1.ecommerce.inditex.grp/api/auth" -H "Accept: application/vnd.ceph.api.v1.0+json" -H "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token' curl: (55) Send failure: Broken pipe * Restarting the restful module itself doesn't help: [root@cnec1cs1o01 ~]$ sudo ceph mgr module disable restful && sudo ceph mgr module enable restful [root@cnec1cs1o01 ~]$ ❯ curl -k -X POST "https://cnec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H "Accept: application/vnd.ceph.api.v1.0+json" -H "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token' curl: (35) Recv failure: Connection reset by peer * The MGR needs to be failed to recover access to the API: [root@cnec1cs1o01 ~]# ceph -s cluster: id: e4304a5b-2f71-4ccc-a6a6-652c59a9e671 health: HEALTH_OK services: mon: 3 daemons, quorum cnec1cs1o01,cnec1cs1o02,cnec1cs1o03 (age 4w) mgr: cnec1cs1o01.dvbgnj(active, since 5m), standbys: cnec1cs1o02.bqqjaa, cnec1cs1o03.hkunpo osd: 60 osds: 60 up (since 4w), 60 in (since 8M) rgw: 6 daemons active (6 hosts, 2 zones) data: pools: 15 pools, 1121 pgs objects: 5.49M objects, 2.3 TiB usage: 7.2 TiB used, 79 TiB / 86 TiB avail pgs: 1121 active+clean io: client: 304 KiB/s rd, 0 B/s wr, 304 op/s rd, 0 op/s wr [root@cnec1cs1o01 ~]# ceph mgr fail cnec1cs1o01.dvbgnj ❯ curl -k -X POST "https://cnec1cs1mgrlb1.ecommerce.inditex.grp/api/auth" -H "Accept: application/vnd.ceph.api.v1.0+json" -H "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' | jq '.token' "eyJ*********************q2o" * The customer added the -v option to the curl command to determine where the request gets stuck. It happens during the TLS handshake: [digfernandodlhdt@ieec1cs1o01 ~]$ curl -k -X POST "https://ieec1cs1o01.ecommerce.inditex.grp:8443/api/auth" -H "Accept: application/vnd.ceph.api.v1.0+json" -H "Content-Type: application/json" -d '{"username": "ceph-inventory", "password": "****"}' -v Note: Unnecessary use of -X or --request, POST is already inferred. * Trying 10.68.31.41... * TCP_NODELAY set * Connected to ieec1cs1o01.ecommerce.inditex.grp (10.68.31.41) port 8443 (#0) * ALPN, offering h2 * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none * TLSv1.3 (OUT), TLS handshake, Client hello (1): * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, [no content] (0): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server did not agree to a protocol * Server certificate: * subject: O=ECOMMERCE.INDITEX.GRP; CN=ieec1cs1n01.ecommerce.inditex.grp * start date: Apr 28 16:14:47 2021 GMT * expire date: Apr 29 16:14:47 2023 GMT * issuer: O=ECOMMERCE.INDITEX.GRP; CN=Certificate Authority * SSL certificate verify ok. * TLSv1.3 (OUT), TLS app data, [no content] (0): > POST /api/auth HTTP/1.1 > Host: ieec1cs1o01.ecommerce.inditex.grp:8443 > User-Agent: curl/7.61.1 > Accept: application/vnd.ceph.api.v1.0+json > Content-Type: application/json > Content-Length: 64 > * upload completely sent off: 64 out of 64 bytes * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS handshake, [no content] (0): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): ^C * There are some sosreports collected of mgr nodes from different Ceph clusters where this issue was observed. The information is at support-shell, directory /cases/03459963. This is the location of the data: 0030-sosreport-ieec1cs1o01-03459963-2023-03-13-ejwuuwl.tar.xz --> restful API **not restored** before uploading. 0040-sosreport-ieec1ocs1n001-03459963-2023-03-14-meuoavq.tar.xz --> restful API restored before uploading sos report. 0050-sosreport-cnec1ocs1n001-03459963-2023-03-14-nlueqys.tar.xz. --> restful API restored before uploading sos report. * The sosreport of node ieec1cs1o01 was the only one gathered while this issue was present. Reviewing the output of the ss_-peaonmi command, I found almost 11K connections to port 8443 in CLOSE-WAIT status: [...] tcp CLOSE-WAIT 328 0 10.68.31.41:8443 10.68.31.3:43428 users:(("ceph-mgr",pid=6591,fd=7051)) uid:167 ino:100729770 sk:28e47 --> tcp CLOSE-WAIT 328 0 10.68.31.41:8443 10.68.31.4:33658 users:(("ceph-mgr",pid=6591,fd=26962)) uid:167 ino:109079723 sk:293d2 --> tcp CLOSE-WAIT 328 0 10.68.31.41:8443 10.68.31.3:54855 users:(("ceph-mgr",pid=6591,fd=22457)) uid:167 ino:106448541 sk:28e48 --> tcp CLOSE-WAIT 328 0 10.68.31.41:8443 10.68.31.3:38439 users:(("ceph-mgr",pid=6591,fd=1991)) uid:167 ino:99138254 sk:28e49 --> [...] grep ":8443" ss_-peaonmi | grep CLOSE-WAIT | wc -l 10900 * Most of these connections are coming from IPs 10.68.31.3 and 10.68.31.4: grep ":8443" ss_-peaonmi | grep CLOSE-WAIT | awk '{print $6}' | awk -F: '{print $1}' | sort | uniq -c 114 10.68.30.126 5374 10.68.31.3 5399 10.68.31.4 13 10.68.31.41 * These IPs correspond to two F5 load balancers. The pattern reproduces: every time this issue is observed, many CLOSE-WAIT connections appear. * We waited for another occurrence of this issue and collected debug logs for the MGR and the restful module: ceph config set mgr debug_mgr 20 ceph config set mgr mgr/dashboard/log_level debug This information is also at support-shell, folder: /cases/03459963/sosreport-20230321-090454/cnec1cs1o02.ecommerce.inditex.grp The debug MGR logs start at this point: 2023-03-21T08:54:03.199+0000 7f1502b68700 0 [progress INFO root] Processing OSDMap change 26713..26713 2023-03-21T08:54:04.194+0000 7f16dd1b0700 10 mgr ms_dispatch2 active kv_data(v2166132 prefix config/, incremental, 1 keys) v1 2023-03-21T08:54:04.194+0000 7f16dd1b0700 10 mgr ms_dispatch2 kv_data(v2166132 prefix config/, incremental, 1 keys) v1 2023-03-21T08:54:04.194+0000 7f16dd1b0700 10 mgr update_kv_data incremental update on config/ 2023-03-21T08:54:04.194+0000 7f16dd1b0700 20 mgr update_kv_data set config/mgr/debug_mgr = 20/20 Until: 2023-03-21T09:00:21.829+0000 7f1557b72700 0 [balancer DEBUG root] setting log level based on debug_mgr: INFO (2/5) 2023-03-21T09:00:21.830+0000 7f1557b72700 0 [cephadm DEBUG root] setting log level based on debug_mgr: INFO (2/5) 2023-03-21T09:00:21.833+0000 7f1557b72700 0 [crash DEBUG root] setting log level based on debug_mgr: INFO (2/5) 2023-03-21T09:00:21.835+0000 7f1557b72700 0 [devicehealth DEBUG root] setting log level based on debug_mgr: INFO (2/5) 2023-03-21T09:00:21.837+0000 7f1557b72700 0 [orchestrator DEBUG root] setting log level based on debug_mgr: INFO (2/5) 2023-03-21T09:00:21.837+0000 7f1557b72700 0 [pg_autoscaler DEBUG root] setting log level based on debug_mgr: INFO (2/5) 2023-03-21T09:00:21.839+0000 7f1557b72700 0 [progress DEBUG root] setting log level based on debug_mgr: INFO (2/5) So we have approximately 6 minutes of information. Unfortunately, I couldn't spot anything wrong here. I'd appreciate it if you can have a look in case I'm missing anything. * On a third occurrence of this issue, the customer collected a tcpdump output from the Ceph node, filtering by the IPs of the load-balancers: timeout 5m tcpdump -vvv -enni bond0 host 10.71.64.3 or host 10.71.64.4 -w /dev/shm/tcpdump.pcap dropped privs to tcpdump tcpdump: listening on bond0, link-type EN10MB (Ethernet), capture size 262144 bytes 735 packets captured 827 packets received by filter 0 packets dropped by kernel This data is also at support-shell, file: /cases/03459963/0170-tcpdump.pcap I noticed that the F5 load balancer closes the connection by sending a `FIN,ACK`, but the Ceph mgr doesn't reply. The below examples are taken from this Wireshark output: 625 248.826240 10.71.64.3 → 10.71.64.140 TCP 74 5740 → 8443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=579299640 TSecr=0 WS=128 626 248.826310 10.71.64.140 → 10.71.64.3 TCP 74 8443 → 5740 [SYN, ACK] Seq=0 Ack=1 Win=26844 Len=0 MSS=8960 SACK_PERM=1 TSval=3659855381 TSecr=579299640 WS=8192 627 248.827191 10.71.64.3 → 10.71.64.140 TCP 66 5740 → 8443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=579299641 TSecr=3659855381 629 248.827265 10.71.64.140 → 10.71.64.3 TCP 66 8443 → 5740 [ACK] Seq=1 Ack=322 Win=32768 Len=0 TSval=3659855381 TSecr=579299641 631 248.828654 10.71.64.3 → 10.71.64.140 TCP 66 5740 → 8443 [ACK] Seq=322 Ack=138 Win=30336 Len=0 TSval=579299643 TSecr=3659855382 634 248.838287 10.71.64.3 → 10.71.64.140 TCP 140 [TCP Retransmission] 5740 → 8443 [PSH, ACK] Seq=373 Ack=138 Win=30336 Len=74 TSval=579299653 TSecr=3659855382 635 248.838297 10.71.64.140 → 10.71.64.3 TCP 78 8443 → 5740 [ACK] Seq=138 Ack=447 Win=32768 Len=0 TSval=3659855392 TSecr=579299643 SLE=373 SRE=447 641 253.821406 10.71.64.140 → 10.71.64.3 TCP 66 8443 → 5740 [ACK] Seq=138 Ack=521 Win=32768 Len=0 TSval=3659860376 TSecr=579304636 649 258.791470 10.71.64.140 → 10.71.64.3 TCP 66 8443 → 5740 [ACK] Seq=138 Ack=595 Win=32768 Len=0 TSval=3659865346 TSecr=579309606 666 263.776157 10.71.64.140 → 10.71.64.3 TCP 66 8443 → 5740 [ACK] Seq=138 Ack=669 Win=32768 Len=0 TSval=3659870330 TSecr=579314590 668 263.776179 10.71.64.140 → 10.71.64.3 TCP 66 8443 → 5740 [ACK] Seq=138 Ack=700 Win=32768 Len=0 TSval=3659870330 TSecr=579314590 696 280.020639 10.71.64.3 → 10.71.64.140 TCP 66 5740 → 8443 [FIN, ACK] Seq=700 Ack=138 Win=30336 Len=0 TSval=579330835 TSecr=3659870330 697 280.061373 10.71.64.140 → 10.71.64.3 TCP 66 8443 → 5740 [ACK] Seq=138 Ack=701 Win=32768 Len=0 TSval=3659886616 TSecr=579330835 652 261.356912 10.71.64.4 → 10.71.64.140 TCP 74 36025 → 8443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=579312170 TSecr=0 WS=128 653 261.357004 10.71.64.140 → 10.71.64.4 TCP 74 8443 → 36025 [SYN, ACK] Seq=0 Ack=1 Win=26844 Len=0 MSS=8960 SACK_PERM=1 TSval=4284243230 TSecr=579312170 WS=8192 654 261.358061 10.71.64.4 → 10.71.64.140 TCP 66 36025 → 8443 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=579312171 TSecr=4284243230 656 261.358112 10.71.64.140 → 10.71.64.4 TCP 66 8443 → 36025 [ACK] Seq=1 Ack=322 Win=32768 Len=0 TSval=4284243231 TSecr=579312172 658 261.358897 10.71.64.4 → 10.71.64.140 TCP 66 36025 → 8443 [ACK] Seq=322 Ack=138 Win=30336 Len=0 TSval=579312172 TSecr=4284243232 661 261.368264 10.71.64.4 → 10.71.64.140 TCP 140 [TCP Retransmission] 36025 → 8443 [PSH, ACK] Seq=373 Ack=138 Win=30336 Len=74 TSval=579312182 TSecr=4284243232 662 261.368302 10.71.64.140 → 10.71.64.4 TCP 78 8443 → 36025 [ACK] Seq=138 Ack=447 Win=32768 Len=0 TSval=4284243241 TSecr=579312172 SLE=373 SRE=447 670 266.375512 10.71.64.140 → 10.71.64.4 TCP 66 8443 → 36025 [ACK] Seq=138 Ack=521 Win=32768 Len=0 TSval=4284248249 TSecr=579317189 685 271.333848 10.71.64.140 → 10.71.64.4 TCP 66 8443 → 36025 [ACK] Seq=138 Ack=595 Win=32768 Len=0 TSval=4284253207 TSecr=579322147 691 276.350680 10.71.64.140 → 10.71.64.4 TCP 66 8443 → 36025 [ACK] Seq=138 Ack=669 Win=32768 Len=0 TSval=4284258224 TSecr=579327164 693 276.350702 10.71.64.140 → 10.71.64.4 TCP 66 8443 → 36025 [ACK] Seq=138 Ack=700 Win=32768 Len=0 TSval=4284258224 TSecr=579327164 729 292.314216 10.71.64.4 → 10.71.64.140 TCP 66 36025 → 8443 [FIN, ACK] Seq=700 Ack=138 Win=30336 Len=0 TSval=579343127 TSecr=4284258224 730 292.354370 10.71.64.140 → 10.71.64.4 TCP 66 8443 → 36025 [ACK] Seq=138 Ack=701 Win=32768 Len=0 TSval=4284274228 TSecr=579343127 Please, we need some assistance to resolve this issue. The Ceph MGR is holding these connections open, despiting having a close-notify from the F5 load-balancers. ### How reproducible: At customer site. However, there are some similar issues reported: [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LKB7PCMHREU4SJDEDYCIHOLGXK2MTUHE/ [2] Partnerhttps://bugzilla.redhat.com/show_bug.cgi?id=2135369#c19 ### Actual results: The MGR is not correctly closing these connections. ### Expected results: The MGR should be correctly closing these connections. ### Additional info: N/A - The location of the data collected is posted above.
See KCS (https://access.redhat.com/solutions/7084560) BR Manny
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.1 security, bug fix, enhancement, and known issue updates), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2025:1770