Bug 2046533

Summary: nvme-cli: nvme connect-all returns Failed to write to /dev/nvme-fabrics
Product: Red Hat Enterprise Linux 8 Reporter: Marco Patalano <mpatalan>
Component: nvme-cliAssignee: Maurizio Lombardi <mlombard>
Status: CLOSED WONTFIX QA Contact: Marco Patalano <mpatalan>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 8.6CC: arun.c, jbrassow, jmeneghi, thomasberryiif
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: NVMe_P2
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-26 07:28:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marco Patalano 2022-01-26 21:50:18 UTC
Description of problem: I am encountering some issues when running nvme connect-all in my NVMe-TCP environment. In the first scenario, the discovery.conf looks as follows:

# cat /etc/nvme/discovery.conf 
-t tcp -a 172.16.0.101 -s 4420
-t tcp -a 172.16.1.101 -s 4420
-t tcp -a 172.16.0.102 -s 4420
-t tcp -a 172.16.1.102 -s 4420

No connections have been made at this point and nvme list is empty:
# nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------

I then issue nvme connect-all:

# nvme connect-all
traddr=172.16.1.101 is already connected
traddr=172.16.0.101 is already connected
traddr=172.16.1.102 is already connected
traddr=172.16.0.102 is already connected

The connections are made successfully, but I believe this would be confusing to the customer.

# nvme list-subsys /dev/nvme0n1
nvme-subsys0 - NQN=nqn.1992-08.com.netapp:sn.f9f91ad1ea5811ebb38f00a098cbcac6:subsystem.tcp_nvme_ss_1
\
 +- nvme0 tcp traddr=172.16.1.101 trsvcid=4420 live optimized
 +- nvme1 tcp traddr=172.16.0.101 trsvcid=4420 live optimized
 +- nvme2 tcp traddr=172.16.1.102 trsvcid=4420 live non-optimized
 +- nvme3 tcp traddr=172.16.0.102 trsvcid=4420 live non-optimized

In the next scenario, I specify the host-traddr:

# cat /etc/nvme/discovery.conf 
-t tcp -a 172.16.0.101 -w 172.16.0.110 -s 4420
-t tcp -a 172.16.1.101 -w 172.16.1.110 -s 4420
-t tcp -a 172.16.0.102 -w 172.16.0.110 -s 4420
-t tcp -a 172.16.1.102 -w 172.16.1.110 -s 4420

I then issue nvme connect-all (the connections were removed with nvme disconnect-all prior to this test):

# nvme connect-all
Failed to write to /dev/nvme-fabrics: Connection timed out
Failed to write to /dev/nvme-fabrics: Connection timed out
Failed to write to /dev/nvme-fabrics: Connection timed out
Failed to write to /dev/nvme-fabrics: Connection timed out

It appears that the command fails from the output above - yet all the connections are successful:

# nvme list-subsys /dev/nvme0n1
nvme-subsys0 - NQN=nqn.1992-08.com.netapp:sn.f9f91ad1ea5811ebb38f00a098cbcac6:subsystem.tcp_nvme_ss_1
\
 +- nvme0 tcp traddr=172.16.0.101 trsvcid=4420 host_traddr=172.16.0.110 live optimized
 +- nvme1 tcp traddr=172.16.1.101 trsvcid=4420 host_traddr=172.16.1.110 live optimized
 +- nvme2 tcp traddr=172.16.0.102 trsvcid=4420 host_traddr=172.16.0.110 live non-optimized
 +- nvme3 tcp traddr=172.16.1.102 trsvcid=4420 host_traddr=172.16.1.110 live non-optimized


Version-Release number of selected component (if applicable):
# rpm -qa nvme-cli
nvme-cli-1.16-3.el8.x86_64

How reproducible: Often


Steps to Reproduce:
1. see above

Comment 1 John Meneghini 2022-08-30 12:22:19 UTC
Marco what's the storage array used to find this bug. DellEMC? I think this is a long standing problem that either needs to be fixed upstream, or pushed back to the vendor.  I don't think we see this problem with anything but the specific vendors array.

Comment 2 TimmyJ 2022-12-29 09:54:36 UTC
Did you manage to fix the write error? I have a similar error, I wonder what you found the solution.

Comment 3 TimmyJ 2022-12-29 15:16:37 UTC
Did you manage to fix the write error? I have a similar error, I wonder what you found the solution. I tried searching the internet for solutions, using the essay title maker https://papersowl.com/essay-title-generator/ to generate the right search queries. But so far I have not found the information, and because of this I cannot finish my educational project.

Comment 4 John Meneghini 2023-04-17 20:48:20 UTC
Marco, is this still a problem with RHEL 9.2?

Can we mark this as fixed in the current release.

Note: nvme/tcp is not supported in RHEL 8

Comment 5 Marco Patalano 2023-04-19 13:14:23 UTC
Hello John,

For RHEL-9.2, I do not see the following messages:

Failed to write to /dev/nvme-fabrics: Connection timed out

However, for both RHEL-8.8 and RHEL-9.2, I continue to see these messages when issuing "nvme connect-all":

# nvme connect-all
traddr=172.18.210.60 is already connected
traddr=172.18.210.61 is already connected
traddr=172.18.220.61 is already connected
traddr=172.18.220.60 is already connected

As mentioned previously, this may be confusing to customers as we did not establish any connections prior to issuing the command.

Finally, for RHEL-8.8, I continue to see the following only when using the Powerstore on the backend:

# nvme connect-all
Failed to write to /dev/nvme-fabrics: Connection timed out
Failed to write to /dev/nvme-fabrics: Connection timed out
Failed to write to /dev/nvme-fabrics: Connection timed out
Failed to write to /dev/nvme-fabrics: Connection timed out
Failed to write to /dev/nvme-fabrics: Connection timed out
Failed to write to /dev/nvme-fabrics: Connection timed out

Since NVMe-TCP is tech preview in RHEL-8, I am OK with closing this BZ. However, should I open a separate BZ to determine if the messaging for successful connections needs to be fixed?

Marco

Comment 7 RHEL Program Management 2023-07-26 07:28:16 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.