Bug 2037144

Summary: [RHEL-9] The output of ethfindgood is a little confusing for me
Product: Red Hat Enterprise Linux 9 Reporter: zguo <zguo>
Component: eth-toolsAssignee: Kamal Heib <kheib>
Status: CLOSED WONTFIX QA Contact: Infiniband QE <infiniband-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 9.0CC: jijun.wang, kheib, rdma-dev-team
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-05 07:28:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description zguo 2022-01-05 04:02:48 UTC
Description of problem:


Version-Release number of selected component (if applicable):
$ rpm -qa | grep eth-
eth-tools-basic-11.1.0.1-5.el9.x86_64
eth-tools-fastfabric-11.1.0.1-5.el9.x86_64


How reproducible:
Always

Steps to Reproduce:
1.
$ cat /etc/eth-tools/hosts 
172.31.40.130
172.31.40.131
2. $ /usr/sbin/ethsetupssh -S -p -f /etc/eth-tools/hosts
3. $ /usr/sbin/ethsetupsnmp -p -L -f /etc/eth-tools/hosts
4. 
$ ethfindgood
Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files.
2 hosts will be checked
2 hosts are pingable (alive)
2 hosts are ssh'able (running)
bash: line 10: [: too many arguments
IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found
bash: line 10: [: too many arguments
IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found
0 total hosts have RDMA active ports on one or more fabrics (active)
0 hosts are alive, running, active (good)
2 hosts are bad (bad)
Bad hosts have been added to /root/punchlist.csv

$ cat /root/punchlist.csv
2022/01/04 22:06:34;172.31.40.130;Has inactive RDMA port(s)
2022/01/04 22:06:34;172.31.40.131;Has inactive RDMA port(s)
2022/01/04 22:07:32;172.31.40.130;Has inactive RDMA port(s)
2022/01/04 22:07:32;172.31.40.131;Has inactive RDMA port(s)
2022/01/04 22:08:54;172.31.40.130;Has inactive RDMA port(s)
2022/01/04 22:08:54;172.31.40.131;Has inactive RDMA port(s)
2022/01/04 22:17:18;172.31.40.130;Has inactive RDMA port(s)
2022/01/04 22:17:18;172.31.40.131;Has inactive RDMA port(s)
2022/01/04 22:41:18;172.31.40.130;Has inactive RDMA port(s)
2022/01/04 22:41:18;172.31.40.131;Has inactive RDMA port(s)
2022/01/04 22:53:11;172.31.40.130;Has inactive RDMA port(s)
2022/01/04 22:53:11;172.31.40.131;Has inactive RDMA port(s)

$ ibstatus
Infiniband device 'irdma0' port 1 status:
	default gid:	 fe80:0000:0000:0000:b696:91ff:fead:8588
	base lid:	 0x1
	sm lid:		 0x0
	state:		 4: ACTIVE
	phys state:	 5: LinkUp
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 Ethernet

Infiniband device 'irdma1' port 1 status:
	default gid:	 fe80:0000:0000:0000:b696:91ff:fead:8589
	base lid:	 0x1
	sm lid:		 0x0
	state:		 1: DOWN
	phys state:	 3: Disabled
	rate:		 100 Gb/sec (4X EDR)
	link_layer:	 Ethernet




Actual results:
$ ethfindgood
Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files.
2 hosts will be checked
2 hosts are pingable (alive)
2 hosts are ssh'able (running)
bash: line 10: [: too many arguments
IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found
bash: line 10: [: too many arguments
IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found
0 total hosts have RDMA active ports on one or more fabrics (active)
0 hosts are alive, running, active (good)
2 hosts are bad (bad)
Bad hosts have been added to /root/punchlist.csv

Expected results:
There are 2 irdma HCAs on each hosts, each HCA has 1 port. 1 HCA is active while 1 is inactive on every host. So at least 1 HCA works on each host. But the output shows 
"0 total hosts have RDMA active ports on one or more fabrics (active)
0 hosts are alive, running, active (good)". 

I think it should be 
"2 total hosts have RDMA active ports on one or more fabrics (active)
2 hosts are alive, running, active (good)
2 hosts have inactive RDMA port(s) (bad)
"
Besides, it show unexpected "bash: line 10: [: too many arguments
"
Additional info:

Comment 1 Honggang LI 2022-01-05 05:07:51 UTC
[test@rdma-dev-30 sbin]$ sh -x /usr/sbin/ethfindgood
+ '[' -f /etc/eth-tools/ethfastfabric.conf ']'
+ . /etc/eth-tools/ethfastfabric.conf
++ '[' '' = '' ']'
++ CONFIG_DIR=/etc
++ export CONFIG_DIR
++ export HOSTS_FILE=/etc/eth-tools/hosts
++ HOSTS_FILE=/etc/eth-tools/hosts
++ export SWITCHES_FILE=/etc/eth-tools/switches
++ SWITCHES_FILE=/etc/eth-tools/switches
++ export MGMT_HOST=localhost
++ MGMT_HOST=localhost
++ export FF_MAX_PARALLEL=1000
++ FF_MAX_PARALLEL=1000
++ export FF_TIMEOUT_MULT=2
++ FF_TIMEOUT_MULT=2
++ export FF_RESULT_DIR=/home/test
++ FF_RESULT_DIR=/home/test
+++ cat /usr/lib/eth-tools/osid_wrapper
++ export FF_PRODUCT=IntelEth-Basic.RHEL9-x86_64
++ FF_PRODUCT=IntelEth-Basic.RHEL9-x86_64
+++ cat /etc/eth-tools/version_wrapper
++ export FF_PRODUCT_VERSION=
++ FF_PRODUCT_VERSION=
++ export 'FF_PACKAGES=eth eth_rdma'
++ FF_PACKAGES='eth eth_rdma'
++ export 'FF_INSTALL_OPTIONS= '
++ FF_INSTALL_OPTIONS=' '
++ export 'FF_UPGRADE_OPTIONS= '
++ FF_UPGRADE_OPTIONS=' '
++ export UPLOADS_DIR=./uploads
++ UPLOADS_DIR=./uploads
++ export DOWNLOADS_DIR=./downloads
++ DOWNLOADS_DIR=./downloads
++ export FF_ANALYSIS_DIR=/var/usr/lib/eth-tools/analysis
++ FF_ANALYSIS_DIR=/var/usr/lib/eth-tools/analysis
++ export FF_LOGIN_METHOD=ssh
++ FF_LOGIN_METHOD=ssh
++ export FF_USERNAME=root
++ FF_USERNAME=root
++ export FF_PASSWORD=
++ FF_PASSWORD=
++ export FF_ROOTPASS=
++ FF_ROOTPASS=
++ export 'FF_FABRIC_HEALTH= -o errors -o slowlinks'
++ FF_FABRIC_HEALTH=' -o errors -o slowlinks'
++ export FF_ALL_ANALYSIS=fabric
++ FF_ALL_ANALYSIS=fabric
++ export 'FF_DIFF_CMD=diff -C 1'
++ FF_DIFF_CMD='diff -C 1'
++ export FF_MPI_APPS_DIR=/home/test/mpi_apps
++ FF_MPI_APPS_DIR=/home/test/mpi_apps
++ export FF_CUDA_DIR=/usr/local/cuda
++ FF_CUDA_DIR=/usr/local/cuda
++ export FF_MPI_ENV=
++ FF_MPI_ENV=
++ export 'FF_DEVIATION_ARGS=-bwtol 20 -lattol 50 -c'
++ FF_DEVIATION_ARGS='-bwtol 20 -lattol 50 -c'
++ export FF_SERIALIZE_OUTPUT=yes
++ FF_SERIALIZE_OUTPUT=yes
++ export FF_HOSTVERIFY_DIR=/root
++ FF_HOSTVERIFY_DIR=/root
+ . /usr/lib/eth-tools/ethfastfabric.conf.def
++ '[' /etc = '' ']'
++ export HOSTS_FILE=/etc/eth-tools/hosts
++ HOSTS_FILE=/etc/eth-tools/hosts
++ export SWITCHES_FILE=/etc/eth-tools/switches
++ SWITCHES_FILE=/etc/eth-tools/switches
++ export MGMT_HOST=localhost
++ MGMT_HOST=localhost
++ export FF_MAX_PARALLEL=1000
++ FF_MAX_PARALLEL=1000
++ export FF_TIMEOUT_MULT=2
++ FF_TIMEOUT_MULT=2
++ export FF_RESULT_DIR=/home/test
++ FF_RESULT_DIR=/home/test
++ export FF_PRODUCT=IntelEth-Basic.RHEL9-x86_64
++ FF_PRODUCT=IntelEth-Basic.RHEL9-x86_64
+++ cat /etc/eth-tools/version_wrapper
++ export FF_PRODUCT_VERSION=
++ FF_PRODUCT_VERSION=
++ export 'FF_PACKAGES=eth eth_rdma'
++ FF_PACKAGES='eth eth_rdma'
++ export 'FF_INSTALL_OPTIONS= '
++ FF_INSTALL_OPTIONS=' '
++ export 'FF_UPGRADE_OPTIONS= '
++ FF_UPGRADE_OPTIONS=' '
++ export UPLOADS_DIR=./uploads
++ UPLOADS_DIR=./uploads
++ export DOWNLOADS_DIR=./downloads
++ DOWNLOADS_DIR=./downloads
++ export FF_ANALYSIS_DIR=/var/usr/lib/eth-tools/analysis
++ FF_ANALYSIS_DIR=/var/usr/lib/eth-tools/analysis
++ export FF_LOGIN_METHOD=ssh
++ FF_LOGIN_METHOD=ssh
++ export FF_USERNAME=root
++ FF_USERNAME=root
++ export FF_PASSWORD=
++ FF_PASSWORD=
++ export FF_ROOTPASS=
++ FF_ROOTPASS=
++ export 'FF_FABRIC_HEALTH= -o errors -o slowlinks'
++ FF_FABRIC_HEALTH=' -o errors -o slowlinks'
++ export FF_ALL_ANALYSIS=fabric
++ FF_ALL_ANALYSIS=fabric
++ export 'FF_DIFF_CMD=diff -C 1'
++ FF_DIFF_CMD='diff -C 1'
++ export FF_MPI_APPS_DIR=/home/test/mpi_apps
++ FF_MPI_APPS_DIR=/home/test/mpi_apps
++ export FF_CUDA_DIR=/usr/local/cuda
++ FF_CUDA_DIR=/usr/local/cuda
++ export FF_MPI_ENV=
++ FF_MPI_ENV=
++ export 'FF_DEVIATION_ARGS=-bwtol 20 -lattol 50 -c'
++ FF_DEVIATION_ARGS='-bwtol 20 -lattol 50 -c'
++ export FF_SERIALIZE_OUTPUT=yes
++ FF_SERIALIZE_OUTPUT=yes
++ export FF_HOSTVERIFY_DIR=/root
++ FF_HOSTVERIFY_DIR=/root
+ . /usr/lib/eth-tools/ff_funcs
++ FF_PRD_NAME=eth-tools
++ declare -A LC_NODE_PORTS
++ '[' /etc = '' ']'
+ trap 'exit 1' SIGHUP SIGTERM SIGINT
+ punchlist=/home/test/punchlist.csv
+ del=';'
++ date '+%Y/%m/%d %T'
+ timestamp='2022/01/05 00:07:21'
++ basename /usr/sbin/ethfindgood
+ readonly BASENAME=ethfindgood
+ BASENAME=ethfindgood
+ '[' x = x--help ']'
+ skip_ssh=n
+ skip_active=n
+ dir=/etc/eth-tools
+ timelimit=20
+ getopts d:f:h:RAT: param
+ shift 0
+ '[' 0 -gt 0 ']'
+ check_host_args ethfindgood
+ local l_hosts_file
+ '[' /etc/eth-tools/hosts = '' ']'
+ '[' '' = '' ']'
+ l_hosts_file=/etc/eth-tools/hosts
++ resolve_file ethfindgood /etc/eth-tools/hosts
++ '[' -f /etc/eth-tools/hosts ']'
++ echo /etc/eth-tools/hosts
+ HOSTS_FILE=/etc/eth-tools/hosts
+ '[' /etc/eth-tools/hosts = '' ']'
++ expand_file ethfindgood /etc/eth-tools/hosts
++ local file
++ cat /etc/eth-tools/hosts
++ ff_filter_comments
++ read line
++ egrep -v '^[[:space:]]*#'
++ egrep -v '^[[:space:]]*$'
+++ expr 172.31.40.130 : '\([^ 	]*\).*'
++ f1=172.31.40.130
++ '[' x172.31.40.130 = xinclude ']'
++ echo 172.31.40.130
++ cut -f1
++ read line
+++ expr 172.31.40.131 : '\([^ 	]*\).*'
++ f1=172.31.40.131
++ '[' x172.31.40.131 = xinclude ']'
++ echo 172.31.40.131
++ cut -f1
++ read line
+ CONTENTS='172.31.40.130
172.31.40.131'
++ extract_device_name ethfindgood '172.31.40.130
172.31.40.131'
++ echo '172.31.40.130
172.31.40.131'
++ read line
++ echo 172.31.40.130
++ awk -F '[:,[({]' '{print $1}'
++ read line
++ echo 172.31.40.131
++ awk -F '[:,[({]' '{print $1}'
++ read line
+ HOSTS='172.31.40.130
172.31.40.131'
+ '[' '172.31.40.130
172.31.40.131' = '' ']'
+ extract_node_ports '172.31.40.130
172.31.40.131'
+ content='172.31.40.130
172.31.40.131'
+ for line in $content
+ raw_node=172.31.40.130
++ trim_string 172.31.40.130
++ str=172.31.40.130
++ echo 172.31.40.130
++ sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//'
+ node=172.31.40.130
+ [[ 172.31.40.130 = \1\7\2\.\3\1\.\4\0\.\1\3\0 ]]
+ LC_NODE_PORTS[${node,,}]=
+ for line in $content
+ raw_node=172.31.40.131
++ trim_string 172.31.40.131
++ str=172.31.40.131
++ echo 172.31.40.131
++ sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//'
+ node=172.31.40.131
+ [[ 172.31.40.131 = \1\7\2\.\3\1\.\4\0\.\1\3\1 ]]
+ LC_NODE_PORTS[${node,,}]=
+ export HOSTS
+ unset HOSTS_FILE
+ good_meaning=
+ good_file=
++ mktemp
+ alive_hostonly=/tmp/tmp.m4wnmu0F5J
++ mktemp
+ running_hostonly=/tmp/tmp.EblW5uMxES
+ bak_files=
+ for file in alive running active good bad
+ '[' -f /etc/eth-tools/alive ']'
+ mv -f /etc/eth-tools/alive /etc/eth-tools/alive.bak
mv: cannot move '/etc/eth-tools/alive' to '/etc/eth-tools/alive.bak': Permission denied
+ [[ -z '' ]]
+ bak_files=alive
+ for file in alive running active good bad
+ '[' -f /etc/eth-tools/running ']'
+ mv -f /etc/eth-tools/running /etc/eth-tools/running.bak
mv: cannot move '/etc/eth-tools/running' to '/etc/eth-tools/running.bak': Permission denied
+ [[ -z alive ]]
+ bak_files=alive,running
+ for file in alive running active good bad
+ '[' -f /etc/eth-tools/active ']'
+ mv -f /etc/eth-tools/active /etc/eth-tools/active.bak
mv: cannot move '/etc/eth-tools/active' to '/etc/eth-tools/active.bak': Permission denied
+ [[ -z alive,running ]]
+ bak_files=alive,running,active
+ for file in alive running active good bad
+ '[' -f /etc/eth-tools/good ']'
+ mv -f /etc/eth-tools/good /etc/eth-tools/good.bak
mv: cannot move '/etc/eth-tools/good' to '/etc/eth-tools/good.bak': Permission denied
+ [[ -z alive,running,active ]]
+ bak_files=alive,running,active,good
+ for file in alive running active good bad
+ '[' -f /etc/eth-tools/bad ']'
+ mv -f /etc/eth-tools/bad /etc/eth-tools/bad.bak
mv: cannot move '/etc/eth-tools/bad' to '/etc/eth-tools/bad.bak': Permission denied
+ [[ -z alive,running,active,good ]]
+ bak_files=alive,running,active,good,bad
+ [[ -n alive,running,active,good,bad ]]
+ echo 'Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files.'
Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files.
++ ff_var_filter_dups_to_stdout '172.31.40.130
172.31.40.131'
++ wc -l
++ ff_var_to_stdout '172.31.40.130
172.31.40.131'
++ ff_filter_dups
++ echo '172.31.40.130
172.31.40.131'
++ ff_to_lc
++ tr A-Z a-z
++ tr -s ' ' '\n'
++ sort -u
++ sed -e '/^$/d'
+ echo '2 hosts will be checked'
2 hosts will be checked
+ ethpingall -p
+ grep 'is alive'
+ sed -e 's/:.*//'
+ ff_filter_dups
+ ethsorthosts
+ ff_to_lc
+ tr A-Z a-z
+ sort -u
+ append_punchlist /dev/fd/63 /tmp/tmp.m4wnmu0F5J 'Doesn'\''t ping'
++ ff_var_filter_dups_to_stdout '172.31.40.130
172.31.40.131'
++ ff_var_to_stdout '172.31.40.130
172.31.40.131'
+ ethsorthosts
++ ff_filter_dups
++ sort /dev/fd/63
+ read host
++ echo '172.31.40.130
+ comm -23 /dev/fd/62 /dev/fd/61
172.31.40.131'
++ ff_to_lc
++ sort /tmp/tmp.m4wnmu0F5J
++ tr A-Z a-z
++ tr -s ' ' '\n'
++ sort -u
++ sed -e '/^$/d'
+ good_meaning=alive
+ good_file_hostonly=/tmp/tmp.m4wnmu0F5J
+ to_nodes_ports /tmp/tmp.m4wnmu0F5J /etc/eth-tools/alive
+ src=/tmp/tmp.m4wnmu0F5J
+ dst=/etc/eth-tools/alive
++ cat /tmp/tmp.m4wnmu0F5J
+ get_nodes_ports '172.31.40.130
172.31.40.131'
/usr/sbin/ethfindgood: line 186: /etc/eth-tools/alive: Permission denied
+ good_file=/etc/eth-tools/alive
++ cat /etc/eth-tools/alive
++ wc -l
+ echo '2 hosts are pingable (alive)'
2 hosts are pingable (alive)
+ '[' n = n ']'
+ ethsorthosts
++ to_canon
+ mycomm12 /dev/fd/63 /dev/fd/62
+ /usr/lib/eth-tools/comm12 /dev/fd/63 /dev/fd/62
++ read line
++ sort --ignore-case -t ' ' -k1,1
++ ethcmdall -h '' -f /tmp/tmp.m4wnmu0F5J -P -p -T 20 'echo 123'
++ grep ': 123'
++ sed 's/:.*//'
+++ echo 172.31.40.130
++ ff_filter_dups
+++ ff_to_lc
+++ tr A-Z a-z
++ to_canon
++ ff_to_lc
++ tr A-Z a-z
++ sort -u
++ read line
++ sort --ignore-case -t ' ' -k1,1
++ canon=172.31.40.130
++ echo '172.31.40.130 172.31.40.130'
++ read line
+++ echo 172.31.40.131
+++ ff_to_lc
+++ tr A-Z a-z
++ canon=172.31.40.131
++ echo '172.31.40.131 172.31.40.131'
++ read line
+++ echo 172.31.40.130
+++ ff_to_lc
+++ tr A-Z a-z
++ canon=172.31.40.130
++ echo '172.31.40.130 172.31.40.130'
++ read line
+++ echo 172.31.40.131
+++ ff_to_lc
+++ tr A-Z a-z
++ canon=172.31.40.131
++ echo '172.31.40.131 172.31.40.131'
++ read line
+ append_punchlist /tmp/tmp.m4wnmu0F5J /tmp/tmp.EblW5uMxES 'Can'\''t ssh'
+ ethsorthosts
+ read host
++ sort /tmp/tmp.m4wnmu0F5J
+ comm -23 /dev/fd/63 /dev/fd/62
++ sort /tmp/tmp.EblW5uMxES
+ to_nodes_ports /tmp/tmp.EblW5uMxES /etc/eth-tools/running
+ src=/tmp/tmp.EblW5uMxES
+ dst=/etc/eth-tools/running
++ cat /tmp/tmp.EblW5uMxES
+ get_nodes_ports '172.31.40.130
172.31.40.131'
/usr/sbin/ethfindgood: line 186: /etc/eth-tools/running: Permission denied
+ good_meaning='alive, running'
+ good_file=/etc/eth-tools/running
++ cat /etc/eth-tools/running
++ wc -l
+ echo '2 hosts are ssh'\''able (running)'
2 hosts are ssh'able (running)
+ rm -f /tmp/tmp.EblW5uMxES
+ rm -f /tmp/tmp.m4wnmu0F5J
+ '[' n = n ']'
+ ff_filter_dups
+ ethsorthosts
++ cat /etc/eth-tools/running
+ ff_to_lc
+ tr A-Z a-z
/usr/sbin/ethfindgood: line 277: /etc/eth-tools/active: Permission denied
+ sort -u
+ for line in $(cat $good_file)
+ host=172.31.40.130
++ get_node_ports 172.31.40.130
++ node=172.31.40.130
++ echo ''
+ ports=
+ [[ -z '' ]]
+ cmds='
				ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)"
				[ -z "$ports" ] && exit 1
				for port in $ports; do
			'
+ cmds='type ibv_devinfo > /dev/null 2>&1 || exit 1
			
				ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)"
				[ -z "$ports" ] && exit 1
				for port in $ports; do
			
				slot=$(ls -l /sys/class/net | grep $port | awk '\''{print $11}'\'' | cut -d '\''/'\'' -f 6)
				[ -z $slot ] && exit 1
				irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null)
				[ -z $irdma_dev ] && exit 1
				ibv_devinfo -d $irdma_dev | grep '\''^\s*state:\s*PORT_ACTIVE'\'' > /dev/null 2>&1 || exit 1
			done
		'
+ ssh 172.31.40.130 'type ibv_devinfo > /dev/null 2>&1 || exit 1
			
				ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)"
				[ -z "$ports" ] && exit 1
				for port in $ports; do
			
				slot=$(ls -l /sys/class/net | grep $port | awk '\''{print $11}'\'' | cut -d '\''/'\'' -f 6)
				[ -z $slot ] && exit 1
				irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null)
				[ -z $irdma_dev ] && exit 1
				ibv_devinfo -d $irdma_dev | grep '\''^\s*state:\s*PORT_ACTIVE'\'' > /dev/null 2>&1 || exit 1
			done
		'
bash: line 10: [: too many arguments
IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found
+ for line in $(cat $good_file)
+ host=172.31.40.131
++ get_node_ports 172.31.40.131
++ node=172.31.40.131
++ echo ''
+ ports=
+ [[ -z '' ]]
+ cmds='
				ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)"
				[ -z "$ports" ] && exit 1
				for port in $ports; do
			'
+ cmds='type ibv_devinfo > /dev/null 2>&1 || exit 1
			
				ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)"
				[ -z "$ports" ] && exit 1
				for port in $ports; do
			
				slot=$(ls -l /sys/class/net | grep $port | awk '\''{print $11}'\'' | cut -d '\''/'\'' -f 6)
				[ -z $slot ] && exit 1
				irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null)
				[ -z $irdma_dev ] && exit 1
				ibv_devinfo -d $irdma_dev | grep '\''^\s*state:\s*PORT_ACTIVE'\'' > /dev/null 2>&1 || exit 1
			done
		'
+ ssh 172.31.40.131 'type ibv_devinfo > /dev/null 2>&1 || exit 1
			
				ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)"
				[ -z "$ports" ] && exit 1
				for port in $ports; do
			
				slot=$(ls -l /sys/class/net | grep $port | awk '\''{print $11}'\'' | cut -d '\''/'\'' -f 6)
				[ -z $slot ] && exit 1
				irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null)
				[ -z $irdma_dev ] && exit 1
				ibv_devinfo -d $irdma_dev | grep '\''^\s*state:\s*PORT_ACTIVE'\'' > /dev/null 2>&1 || exit 1
			done
		'
bash: line 10: [: too many arguments
IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found
+ append_punchlist /etc/eth-tools/running /etc/eth-tools/active 'Has inactive RDMA port(s)'
+ ethsorthosts
++ sort /etc/eth-tools/running
+ read host
+ comm -23 /dev/fd/63 /dev/fd/62
++ sort /etc/eth-tools/active
+ echo '2022/01/05 00:07:21;172.31.40.130;Has inactive RDMA port(s)'
+ read host
+ echo '2022/01/05 00:07:21;172.31.40.131;Has inactive RDMA port(s)'
+ read host
+ ethsorthosts
++ to_canon
/usr/sbin/ethfindgood: line 280: /etc/eth-tools/good: Permission denied
+ mycomm12 /dev/fd/63 /dev/fd/62
++ to_canon
+ /usr/lib/eth-tools/comm12 /dev/fd/63 /dev/fd/62
++ read line
++ sort --ignore-case -t ' ' -k1,1
++ read line
++ sort --ignore-case -t ' ' -k1,1
+++ echo 172.31.40.130
+++ ff_to_lc
+++ tr A-Z a-z
++ canon=172.31.40.130
++ echo '172.31.40.130 172.31.40.130'
++ read line
+++ echo 172.31.40.131
+++ ff_to_lc
+++ tr A-Z a-z
++ canon=172.31.40.131
++ echo '172.31.40.131 172.31.40.131'
++ read line
+ good_meaning='alive, running, active'
++ cat /etc/eth-tools/active
++ wc -l
+ echo '0 total hosts have RDMA active ports on one or more fabrics (active)'
0 total hosts have RDMA active ports on one or more fabrics (active)
++ cat /etc/eth-tools/good
++ wc -l
+ echo '0 hosts are alive, running, active (good)'
0 hosts are alive, running, active (good)
+ ethsorthosts
/usr/sbin/ethfindgood: line 290: /etc/eth-tools/bad: Permission denied
+ comm -23 /dev/fd/63 /dev/fd/62
++ sort /etc/eth-tools/good
+++ get_nodes_ports '172.31.40.130
172.31.40.131'
+++ nodes='172.31.40.130
172.31.40.131'
+++ for node in $nodes
+++ ports=
+++ [[ -z '' ]]
+++ echo 172.31.40.130
+++ for node in $nodes
+++ ports=
+++ [[ -z '' ]]
+++ echo 172.31.40.131
++ ff_var_filter_dups_to_stdout '172.31.40.130
172.31.40.131'
++ ff_var_to_stdout '172.31.40.130
172.31.40.131'
++ ff_filter_dups
++ echo '172.31.40.130
172.31.40.131'
++ ff_to_lc
++ tr A-Z a-z
++ tr -s ' ' '\n'
++ sort -u
++ sed -e '/^$/d'
++ cat /etc/eth-tools/bad
++ wc -l
+ echo '2 hosts are bad (bad)'
2 hosts are bad (bad)
+ echo 'Bad hosts have been added to /home/test/punchlist.csv'
Bad hosts have been added to /home/test/punchlist.csv
+ exit 0
[test@rdma-dev-30 sbin]$

Comment 2 Honggang LI 2022-01-05 05:58:45 UTC
$  diff -Nurp ethfindgood.orig ethfindgood.new
--- ethfindgood.orig	2022-01-05 00:23:42.792600133 -0500
+++ ethfindgood.new	2022-01-05 00:57:26.236753421 -0500
@@ -268,7 +268,7 @@ then
 			$cmds
 				slot=\$(ls -l /sys/class/net | grep \$port | awk '{print \$11}' | cut -d '/' -f 6)
 				[ -z \$slot ] && exit 1
-				irdma_dev=\$(ls \$(find /sys/devices/ -name \$slot)/infiniband 2> /dev/null)
+				irdma_dev=\$(ls \$(find /sys/devices/ -ipath */\$slot/infiniband) 2> /dev/null)
 				[ -z \$irdma_dev ] && exit 1
 				ibv_devinfo -d \$irdma_dev | grep '^\s*state:\s*PORT_ACTIVE' > /dev/null 2>&1 || exit 1
 			done

Comment 3 Honggang LI 2022-01-05 06:00:33 UTC
(In reply to zguo from comment #0)

> "
> Besides, it show unexpected "bash: line 10: [: too many arguments
> "


Please test this patch.
https://bugzilla.redhat.com/show_bug.cgi?id=2037144#c2

Comment 4 zguo 2022-01-05 07:25:38 UTC
(In reply to Honggang LI from comment #3)
> (In reply to zguo from comment #0)
> 
> > "
> > Besides, it show unexpected "bash: line 10: [: too many arguments
> > "
> 
> 
> Please test this patch.
> https://bugzilla.redhat.com/show_bug.cgi?id=2037144#c2

[root@rdma-dev-30 ~]$ /usr/sbin/ethfindgood
Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files.
2 hosts will be checked
2 hosts are pingable (alive)
2 hosts are ssh'able (running)
0 total hosts have RDMA active ports on one or more fabrics (active)
0 hosts are alive, running, active (good)
2 hosts are bad (bad)
Bad hosts have been added to /root/punchlist.csv

Comment 5 Jijun Wang 2022-01-05 15:42:09 UTC
The script intends to find devices with driver ice, and then find each device's slot and then irdma device name.
Could you run below script and put your output here?

#!/bin/bash

set -x
ports="$(ls -l /sys/class/net/*/device/driver | grep 'ice$' | awk '{print $9}' | cut -d '/' -f5)"
[ -z "$ports" ] && exit 1
for port in $ports; do
  slot=$(ls -l /sys/class/net | grep $port | awk '{print $11}' | cut -d '/' -f 6)
  [ -z $slot ] && exit 1
  irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null)
  [ -z $irdma_dev ] && exit 1
  ibv_devinfo -d $irdma_dev
done

Comment 6 Honggang LI 2022-01-06 12:25:14 UTC
[root@rdma-dev-31 ~]$ cat -n /tmp/a.sh 
     1	#!/bin/bash
     2	set -x
     3	
     4	ports="$(ls -l /sys/class/net/*/device/driver | grep 'ice$' | awk '{print $9}' | cut -d '/' -f5)"
     5	[ -z "$ports" ] && exit 1
     6	for port in $ports; do
     7	  slot=$(ls -l /sys/class/net | grep $port | awk '{print $11}' | cut -d '/' -f6)
     8	  [ -z $slot ] && exit 1
     9	  irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null)
    10	  [ -z $irdma_dev ] && exit 1
    11	  ibv_devinfo -d $irdma_dev
    12	done
[root@rdma-dev-31 ~]$ sh /tmp/a.sh 
++ grep 'ice$'
++ ls -l /sys/class/net/i810_off/device/driver /sys/class/net/i810_roce/device/driver /sys/class/net/lom_1/device/driver /sys/class/net/lom_2/device/driver /sys/class/net/lom_3/device/driver /sys/class/net/lom_4/device/driver
++ cut -d / -f5
++ awk '{print $9}'
+ ports='i810_off
i810_roce'
+ '[' -z 'i810_off
i810_roce' ']'
+ for port in $ports
++ ls -l /sys/class/net
++ grep i810_off
++ awk '{print $11}'
++ cut -d / -f6
+ slot=0000:44:00.1
+ '[' -z 0000:44:00.1 ']'
+++ find /sys/devices/ -name 0000:44:00.1
++ ls /sys/devices/pci0000:40/0000:40:03.1/0000:44:00.1 /sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband
+ irdma_dev='/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:
irdma1

/sys/devices/pci0000:40/0000:40:03.1/0000:44:00.1:
aer_dev_correctable
aer_dev_fatal
aer_dev_nonfatal
ari_enabled
broken_parity_status
class
config
consistent_dma_mask_bits
current_link_speed
current_link_width
d3cold_allowed
device
dma_mask_bits
driver
driver_override
enable
firmware_node
ice.roce.1
infiniband
infiniband_verbs
iommu
iommu_group
irq
link
local_cpulist
local_cpus
max_link_speed
max_link_width
modalias
msi_bus
msi_irqs
net
numa_node
power
power_state
remove
rescan
reset
resource
resource0
resource0_wc
resource3
resource3_wc
revision
rom
sriov_drivers_autoprobe
sriov_numvfs
sriov_offset
sriov_stride
sriov_totalvfs
sriov_vf_device
sriov_vf_total_msix
subsystem
subsystem_device
subsystem_vendor
uevent
vendor
vpd'
+ '[' -z /sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband: irdma1 /sys/devices/pci0000:40/0000:40:03.1/0000:44:00.1: aer_dev_correctable aer_dev_fatal aer_dev_nonfatal ari_enabled broken_parity_status class config consistent_dma_mask_bits current_link_speed current_link_width d3cold_allowed device dma_mask_bits driver driver_override enable firmware_node ice.roce.1 infiniband infiniband_verbs iommu iommu_group irq link local_cpulist local_cpus max_link_speed max_link_width modalias msi_bus msi_irqs net numa_node power power_state remove rescan reset resource resource0 resource0_wc resource3 resource3_wc revision rom sriov_drivers_autoprobe sriov_numvfs sriov_offset sriov_stride sriov_totalvfs sriov_vf_device sriov_vf_total_msix subsystem subsystem_device subsystem_vendor uevent vendor vpd ']'
/tmp/a.sh: line 10: [: too many arguments
+ ibv_devinfo -d /sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband: irdma1 /sys/devices/pci0000:40/0000:40:03.1/0000:44:00.1: aer_dev_correctable aer_dev_fatal aer_dev_nonfatal ari_enabled broken_parity_status class config consistent_dma_mask_bits current_link_speed current_link_width d3cold_allowed device dma_mask_bits driver driver_override enable firmware_node ice.roce.1 infiniband infiniband_verbs iommu iommu_group irq link local_cpulist local_cpus max_link_speed max_link_width modalias msi_bus msi_irqs net numa_node power power_state remove rescan reset resource resource0 resource0_wc resource3 resource3_wc revision rom sriov_drivers_autoprobe sriov_numvfs sriov_offset sriov_stride sriov_totalvfs sriov_vf_device sriov_vf_total_msix subsystem subsystem_device subsystem_vendor uevent vendor vpd
IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found
+ for port in $ports
++ ls -l /sys/class/net
++ grep i810_roce
++ awk '{print $11}'
++ cut -d / -f6
+ slot='0000:44:00.0
i810_roce.43
i810_roce.45'
+ '[' -z 0000:44:00.0 i810_roce.43 i810_roce.45 ']'
/tmp/a.sh: line 8: [: too many arguments
+++ find /sys/devices/ -name 0000:44:00.0 i810_roce.43 i810_roce.45
find: paths must precede expression: `i810_roce.43'
++ ls /infiniband
+ irdma_dev=
+ '[' -z ']'
+ exit 1
[root@rdma-dev-31 ~]$

Comment 7 Jijun Wang 2022-01-06 13:30:33 UTC
Thanks. There are 2 issues.
The first one is on finding irdma dev name. Honggang's patch shall fix it.
The send one is on finding slot number. It was fixed in our 11.2 version. Please change the following line in ethfindgood

slot=\$(ls -l /sys/class/net | grep \$port | awk '{print \$11}' | cut -d '/' -f 6)

to

slot=\$(ls -l /sys/class/net | grep \"\$port \" | awk '{print \$11}' | cut -d '/' -f 6)

Comment 8 zguo 2022-01-07 03:31:44 UTC
(In reply to Jijun Wang from comment #7)
> Thanks. There are 2 issues.
> The first one is on finding irdma dev name. Honggang's patch shall fix it.
> The send one is on finding slot number. It was fixed in our 11.2 version.
> Please change the following line in ethfindgood
> 
> slot=\$(ls -l /sys/class/net | grep \$port | awk '{print \$11}' | cut -d '/'
> -f 6)
> 
> to
> 
> slot=\$(ls -l /sys/class/net | grep \"\$port \" | awk '{print \$11}' | cut
> -d '/' -f 6)

It looks good now.

[root@rdma-dev-31 ~]$ ethfindgood 
Warning: backed up existing /etc/eth-tools/{alive,good,bad} as *.bak files.
2 hosts will be checked
2 hosts are pingable (alive)
2 hosts are ssh'able (running)
2 total hosts have RDMA active ports on one or more fabrics (active)
2 hosts are alive, running, active (good)
0 hosts are bad (bad)
Bad hosts have been added to /root/punchlist.csv

Comment 9 Jijun Wang 2022-01-07 14:57:07 UTC
Thanks. I will update eth-tools-fastfabric

Comment 10 Jijun Wang 2022-01-07 16:24:35 UTC
Updated eth-tools to 11.1.0.1-6
Here is the f36 build https://koji.fedoraproject.org/koji/taskinfo?taskID=80960413

Comment 11 Honggang LI 2022-01-08 02:45:41 UTC
(In reply to Jijun Wang from comment #10)
> Updated eth-tools to 11.1.0.1-6
> Here is the f36 build
> https://koji.fedoraproject.org/koji/taskinfo?taskID=80960413

I built it for rhel-9.0.0, but it still needs improvement.

[root@rdma-dev-30 ~]$ /usr/sbin/ethsetupsnmp -p -L -f /etc/eth-tools/hosts
Configuring SNMP...
Enter space separated list of admin hosts (rdma-dev-30.rdma.lab.eng.rdu2.redhat.com): 
Enter SNMP community string (public): 
Fast Fabric requires the following MIBs:
	1.3.6.1.2.1.1 (SNMPv2-MIB:system)
	1.3.6.1.2.1.2 (IF-MIB:interfaces)
	1.3.6.1.2.1.4 (IP-MIB:ip)
	1.3.6.1.2.1.10.7 (EtherLike-MIB:dot3)
	1.3.6.1.2.1.31.1 (IP-MIB:ifMIBObjects) 
Do you accept these MIBs [y/n] (y): 
Enter space separated list of extra MIBs to support (NONE): 

Will config SNMP with the following settings:
  admin hosts: rdma-dev-30.rdma.lab.eng.rdu2.redhat.com
  community: public
  MIBs: 1.3.6.1.2.1.1 1.3.6.1.2.1.2 1.3.6.1.2.1.4 1.3.6.1.2.1.10.7 1.3.6.1.2.1.31.1 
Do you accept these settings [y/n] (y): 
mv: cannot stat '/etc/snmp/snmpd.conf': No such file or directory
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

so, the package should require `net-snmp`.

========================================================================


[root@rdma-dev-30 ~]$ /usr/sbin/ethfindgood
Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files.
2 hosts will be checked
2 hosts are pingable (alive)
2 hosts are ssh'able (running)
0 total hosts have RDMA active ports on one or more fabrics (active)  <====
0 hosts are alive, running, active (good)                             <====
2 hosts are bad (bad)
Bad hosts have been added to /root/punchlist.csv
[root@rdma-dev-30 ~]$ rpm -q eth-tools-fastfabric
eth-tools-fastfabric-11.1.0.1-6.el9.x86_64


267                 cmds="type ibv_devinfo > /dev/null 2>&1 || exit 1
268                         $cmds
269                                 slot=\$(ls -l /sys/class/net | grep \"\$port \" | awk '{print \$11}' | cut -d '/' -f 6)
270                                 [ -z \$slot ] && exit 1
271                                 irdma_dev=\$(ls \$(find /sys/devices/ -path */\$slot/infiniband) 2> /dev/null)
272                                 [ -z \$irdma_dev ] && exit 1
273                                 ibv_devinfo -d \$irdma_dev | grep '^\s*state:\s*PORT_ACTIVE' > /dev/null 2>&1 || exit 1

The `exit 1` in line 273 will terminate the for loop as the first port of the first ice device is down. That is why ethfindgood
can't detect active port in the *second* ice device. But if we remove the `exit 1` in line 273, ethfindgood will ignore bad hosts whose last port of ice devices is active.

Comment 12 Jijun Wang 2022-01-10 14:01:08 UTC
You are right. I will fix them.

Comment 13 Jijun Wang 2022-01-11 04:00:46 UTC
Updated eth-tools to 11.1.0.1-7

The changes are
- When a user specifies ports for a node, we check to ensure all ports are active RDMA ports
- If a user doesn't specify ports for a node, we check to ensure at least one port is active RDMA port
- Added net-snmp to eth-tools rpm dependency

Here is the f36 build https://koji.fedoraproject.org/koji/taskinfo?taskID=81083085

Comment 16 RHEL Program Management 2023-07-05 07:28:26 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.