Bug 1753468

Summary: [RHEL-8/RDMA] ibsrpdm segment fault
Product: Red Hat Enterprise Linux 8 Reporter: Honggang LI <honli>
Component: rdma-coreAssignee: Honggang LI <honli>
Status: CLOSED WONTFIX QA Contact: zguo <zguo>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.2CC: hwkernel-mgr, rdma-dev-team, zguo
Target Milestone: rcKeywords: TestOnly
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rdma-core-26.0-3.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-19 07:30:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1722257    
Bug Blocks:    

Description Honggang LI 2019-09-19 02:58:35 UTC
Description of problem:

[root@rdma-dev-19 ~]$ grep -i distro /etc/motd
                           DISTRO=RHEL-8.1.0-20190916.n.0

[root@rdma-dev-19 ~]$ ibstat
CA 'mlx5_2'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.23.1020
	Hardware version: 0
	Node GUID: 0x248a07030049d338
	System image GUID: 0x248a07030049d338
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 13
		LMC: 0
		SM lid: 1
		Capability mask: 0x2659e848
		Port GUID: 0x248a07030049d338
		Link layer: InfiniBand
CA 'mlx5_3'
	CA type: MT4115
	Number of ports: 1
	Firmware version: 12.23.1020
	Hardware version: 0
	Node GUID: 0x248a07030049d339
	System image GUID: 0x248a07030049d338
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 40
		Base lid: 38
		LMC: 1
		SM lid: 36
		Capability mask: 0x2659e848
		Port GUID: 0x248a07030049d339
		Link layer: InfiniBand
CA 'mlx5_bond_0'
	CA type: MT4117
	Number of ports: 1
	Firmware version: 14.23.1020
	Hardware version: 0
	Node GUID: 0x7cfe900300cb743a
	System image GUID: 0x7cfe900300cb743a
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 10
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x7efe90fffecb743a
		Link layer: Ethernet
[root@rdma-dev-19 ~]$ 
[root@rdma-dev-19 ~]$ 
[root@rdma-dev-19 ~]$ ibsrpdm -vc -d /dev/infiniband/umad0
Device mlx5_bond_0 was found
CQ was created with 10 CQEs
CQ was created with 1 CQEs
MR was created with addr=0x55da2e222010, lkey=0x614b,
QP was created, QP number=0x11e8
QPs were modified to RTS
SM LID is 0, maybe no SM is running
Querying SRP targets failed
[root@rdma-dev-19 ~]$ 
[root@rdma-dev-19 ~]$ 

[root@rdma-dev-19 ~]$ ibsrpdm -vc -d /dev/infiniband/umad1
Couldn't read ibdev attribute
Fail to translate umad to ibdev and port
Failed to build config
free(): double free detected in tcache 2
Aborted (core dumped)

[root@rdma-dev-19 ~]$ 
[root@rdma-dev-19 ~]$ 
[root@rdma-dev-19 ~]$ ibsrpdm -vc -d /dev/infiniband/umad2
Device mlx5_2 was found
CQ was created with 10 CQEs
CQ was created with 1 CQEs
MR was created with addr=0x5584da90d010, lkey=0x4ce2,
QP was created, QP number=0x94
QPs were modified to RTS
Advanced SM, performing a capability query
discover Targets for P_key ffff (index 0)
discover Targets for P_key 0280 (index 1)
discover Targets for P_key 0480 (index 2)
discover Targets for P_key 0680 (index 3)
discover Targets for P_key 0880 (index 4)
enter do_port
Allowing SRP target with id_ext 0002c90300317810 because not using a rules file
Found an SRP target with id_ext 0002c90300317810 - check if it is already connected
id_ext=0002c90300317810,ioc_guid=0002c90300317810,dgid=fe800000000000000002c90300317811,pkey=ffff,service_id=0002c90300317810
enter do_port
Allowing SRP target with id_ext 0002c90300317810 because not using a rules file
Found an SRP target with id_ext 0002c90300317810 - check if it is already connected
id_ext=0002c90300317810,ioc_guid=0002c90300317810,dgid=fe800000000000000002c90300317811,pkey=8002,service_id=0002c90300317810
enter do_port
Allowing SRP target with id_ext 0002c90300317810 because not using a rules file
Found an SRP target with id_ext 0002c90300317810 - check if it is already connected
id_ext=0002c90300317810,ioc_guid=0002c90300317810,dgid=fe800000000000000002c90300317811,pkey=8004,service_id=0002c90300317810
enter do_port
Allowing SRP target with id_ext 0002c90300317810 because not using a rules file
Found an SRP target with id_ext 0002c90300317810 - check if it is already connected
id_ext=0002c90300317810,ioc_guid=0002c90300317810,dgid=fe800000000000000002c90300317811,pkey=8006,service_id=0002c90300317810
enter do_port
Allowing SRP target with id_ext 0002c90300317810 because not using a rules file
Found an SRP target with id_ext 0002c90300317810 - check if it is already connected
id_ext=0002c90300317810,ioc_guid=0002c90300317810,dgid=fe800000000000000002c90300317811,pkey=8008,service_id=0002c90300317810
discover Targets for P_key ffff (index 0)
discover Targets for P_key 0280 (index 1)
discover Targets for P_key 0480 (index 2)
discover Targets for P_key 0680 (index 3)
discover Targets for P_key 0880 (index 4)
enter do_port
Allowing SRP target with id_ext 001175000077d81a because not using a rules file
Found an SRP target with id_ext 001175000077d81a - check if it is already connected
id_ext=001175000077d81a,ioc_guid=001175000077d81a,dgid=fe80000000000000001175000077d81a,pkey=ffff,service_id=001175000077d81a
enter do_port
Allowing SRP target with id_ext 001175000077d81a because not using a rules file
Found an SRP target with id_ext 001175000077d81a - check if it is already connected
id_ext=001175000077d81a,ioc_guid=001175000077d81a,dgid=fe80000000000000001175000077d81a,pkey=8002,service_id=001175000077d81a
enter do_port
Allowing SRP target with id_ext 001175000077d81a because not using a rules file
Found an SRP target with id_ext 001175000077d81a - check if it is already connected
id_ext=001175000077d81a,ioc_guid=001175000077d81a,dgid=fe80000000000000001175000077d81a,pkey=8004,service_id=001175000077d81a
enter do_port
Allowing SRP target with id_ext 001175000077d81a because not using a rules file
Found an SRP target with id_ext 001175000077d81a - check if it is already connected
id_ext=001175000077d81a,ioc_guid=001175000077d81a,dgid=fe80000000000000001175000077d81a,pkey=8006,service_id=001175000077d81a
enter do_port


Version-Release number of selected component (if applicable):
Both rhel-8.1 in-box srp_daemon and latest upstream code impacted by this issue.
srp_daemon-22.3-1.el8.x86_64 (rhel-8.1 in-box)
srp_daemon-26.0-1.el8.x86_64 (built from upstream repo)

How reproducible:
100%

Steps to Reproduce:
1.
2.
3.

Actual results:
[root@rdma-dev-19 ~]$ ibsrpdm -vc -d /dev/infiniband/umad1
Couldn't read ibdev attribute
Fail to translate umad to ibdev and port
Failed to build config
free(): double free detected in tcache 2
Aborted (core dumped)

Expected results:
no segment fault

Additional info:

Comment 1 Honggang LI 2019-09-19 05:54:37 UTC
There is no 'umad1' char device file in rdma-dev-19.

[root@rdma-dev-19 ~]$ ls /sys/class/infiniband_mad/
abi_version  issm0  issm2  issm3  umad0  umad2  umad3

[root@rdma-dev-19 ~]$ ls  -l /dev/infiniband/umad*
crw-------. 1 root root 231, 0 Sep 18 22:42 /dev/infiniband/umad0
crw-------. 1 root root 231, 2 Sep 18 22:42 /dev/infiniband/umad2
crw-------. 1 root root 231, 3 Sep 18 22:42 /dev/infiniband/umad3

Comment 2 Honggang LI 2019-09-19 06:44:36 UTC
Sent this simple patch for upstream review. Set devel+ flag.

diff --git a/srp_daemon/srp_daemon.c b/srp_daemon/srp_daemon.c
index 337b21c7..f0bcf923 100644
--- a/srp_daemon/srp_daemon.c
+++ b/srp_daemon/srp_daemon.c
@@ -727,6 +727,7 @@ end:
        if (ret) {
                free(*ibport);
                free(*ibdev);
+               *ibdev = NULL;
        }
        free(class_dev_path);

Comment 3 Jarod Wilson 2019-11-21 18:13:35 UTC
This patch was included in upstream rdma-core v26, which is now built for RHEL-8.2, so marking TestOnly and moving to MODIFIED.

Comment 9 zguo 2019-12-05 02:27:35 UTC
Set verified per comment 5 & 8.

Comment 12 RHEL Program Management 2021-03-19 07:30:52 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.