Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2231684

Summary: keep alive timeout and hosts disconnects when working with ~100+ namespaces
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Aviv Caro <acaro>
Component: NVMeOFAssignee: Aviv Caro <aviv.caro>
Status: CLOSED WORKSFORME QA Contact: Manohar Murthy <mmurthy>
Severity: high Docs Contact: ceph-doc-bot <ceph-doc-bugzilla>
Priority: unspecified    
Version: 7.0CC: cephqe-warriors, idryomov
Target Milestone: ---Keywords: Reopened
Target Release: 7.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-01-25 14:32:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Aviv Caro 2023-08-13 09:02:34 UTC
For more details see https://github.com/ceph/ceph-nvmeof/issues/161

Comment 1 Aviv Caro 2023-08-16 12:18:03 UTC
Issue seems to be fixed after taking bump version to 23.01.1 LTS.

Comment 2 Rahul Lepakshi 2023-08-21 06:27:25 UTC
Re-opening issue as issue is still seen at 23.01.1. Details at https://github.com/ceph/ceph-nvmeof/issues/161#issuecomment-1685718132

Comment 3 Rahul Lepakshi 2023-08-23 09:08:50 UTC
Observations after some test runs
1) With a 16GB RAM GW node - Test pass and No KA timeouts, host and subsystem are intact even after a day - 
   1) http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-BPXH6Y/Scale_to_256_namespaces_in_single_subsystem_on_NVMeOF_GW_0.log
   2) http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-BPXH6Y/GW_server.log
   3) [root@ceph-nvmf3-bpxh6y-node5 ceph-nvmeof]# cat /proc/meminfo
      MemTotal:       16107316 kB
      MemFree:          476232 kB
      MemAvailable:    1078012 kB

2) With a 8GB RAM GW node - Test fails but not with KA timeout message and GW crashes
   1) http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-GRYB7N/Scale_to_256_namespaces_in_single_subsystem_on_NVMeOF_GW_0.log
   2) http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-GRYB7N/GW_server.log
   3) [root@ceph-nvmf1-gryb7n-node5 ceph-nvmeof]# cat /proc/meminfo
      MemTotal:        7862076 kB
      MemFree:          939148 kB
      MemAvailable:    1239144 kB

Comment 4 Aviv Caro 2023-08-23 09:19:03 UTC
After some discussions with @orit.was, rlepaksh, and manohar.m - we agreed that for 7.0 TP, we will need at least 16 GB for the GW. We also agreed to reconsider if we can work with less memory for the GA in 7.1. So need to change the target release to 7.1.

Comment 5 Rahul Lepakshi 2023-09-13 13:43:26 UTC
With upstream container build,
Seeing GW crash with containers also with 4GB RAM node at 135 namespace with IO- http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-KDQSHS
Whereas on 16GB RAM node - crashes at 391 namespaces with IO - http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-Y4GMR7/test_1k_namespace_with_1_subsystem_in_Single_GW_0.log