Description of problem: Version-Release number of selected component (if applicable): $ rpm -qa | grep eth- eth-tools-basic-11.1.0.1-5.el9.x86_64 eth-tools-fastfabric-11.1.0.1-5.el9.x86_64 How reproducible: Always Steps to Reproduce: 1. $ cat /etc/eth-tools/hosts 172.31.40.130 172.31.40.131 2. $ /usr/sbin/ethsetupssh -S -p -f /etc/eth-tools/hosts 3. $ /usr/sbin/ethsetupsnmp -p -L -f /etc/eth-tools/hosts 4. $ ethfindgood Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files. 2 hosts will be checked 2 hosts are pingable (alive) 2 hosts are ssh'able (running) bash: line 10: [: too many arguments IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found bash: line 10: [: too many arguments IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found 0 total hosts have RDMA active ports on one or more fabrics (active) 0 hosts are alive, running, active (good) 2 hosts are bad (bad) Bad hosts have been added to /root/punchlist.csv $ cat /root/punchlist.csv 2022/01/04 22:06:34;172.31.40.130;Has inactive RDMA port(s) 2022/01/04 22:06:34;172.31.40.131;Has inactive RDMA port(s) 2022/01/04 22:07:32;172.31.40.130;Has inactive RDMA port(s) 2022/01/04 22:07:32;172.31.40.131;Has inactive RDMA port(s) 2022/01/04 22:08:54;172.31.40.130;Has inactive RDMA port(s) 2022/01/04 22:08:54;172.31.40.131;Has inactive RDMA port(s) 2022/01/04 22:17:18;172.31.40.130;Has inactive RDMA port(s) 2022/01/04 22:17:18;172.31.40.131;Has inactive RDMA port(s) 2022/01/04 22:41:18;172.31.40.130;Has inactive RDMA port(s) 2022/01/04 22:41:18;172.31.40.131;Has inactive RDMA port(s) 2022/01/04 22:53:11;172.31.40.130;Has inactive RDMA port(s) 2022/01/04 22:53:11;172.31.40.131;Has inactive RDMA port(s) $ ibstatus Infiniband device 'irdma0' port 1 status: default gid: fe80:0000:0000:0000:b696:91ff:fead:8588 base lid: 0x1 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) link_layer: Ethernet Infiniband device 'irdma1' port 1 status: default gid: fe80:0000:0000:0000:b696:91ff:fead:8589 base lid: 0x1 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 100 Gb/sec (4X EDR) link_layer: Ethernet Actual results: $ ethfindgood Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files. 2 hosts will be checked 2 hosts are pingable (alive) 2 hosts are ssh'able (running) bash: line 10: [: too many arguments IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found bash: line 10: [: too many arguments IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found 0 total hosts have RDMA active ports on one or more fabrics (active) 0 hosts are alive, running, active (good) 2 hosts are bad (bad) Bad hosts have been added to /root/punchlist.csv Expected results: There are 2 irdma HCAs on each hosts, each HCA has 1 port. 1 HCA is active while 1 is inactive on every host. So at least 1 HCA works on each host. But the output shows "0 total hosts have RDMA active ports on one or more fabrics (active) 0 hosts are alive, running, active (good)". I think it should be "2 total hosts have RDMA active ports on one or more fabrics (active) 2 hosts are alive, running, active (good) 2 hosts have inactive RDMA port(s) (bad) " Besides, it show unexpected "bash: line 10: [: too many arguments " Additional info:
[test@rdma-dev-30 sbin]$ sh -x /usr/sbin/ethfindgood + '[' -f /etc/eth-tools/ethfastfabric.conf ']' + . /etc/eth-tools/ethfastfabric.conf ++ '[' '' = '' ']' ++ CONFIG_DIR=/etc ++ export CONFIG_DIR ++ export HOSTS_FILE=/etc/eth-tools/hosts ++ HOSTS_FILE=/etc/eth-tools/hosts ++ export SWITCHES_FILE=/etc/eth-tools/switches ++ SWITCHES_FILE=/etc/eth-tools/switches ++ export MGMT_HOST=localhost ++ MGMT_HOST=localhost ++ export FF_MAX_PARALLEL=1000 ++ FF_MAX_PARALLEL=1000 ++ export FF_TIMEOUT_MULT=2 ++ FF_TIMEOUT_MULT=2 ++ export FF_RESULT_DIR=/home/test ++ FF_RESULT_DIR=/home/test +++ cat /usr/lib/eth-tools/osid_wrapper ++ export FF_PRODUCT=IntelEth-Basic.RHEL9-x86_64 ++ FF_PRODUCT=IntelEth-Basic.RHEL9-x86_64 +++ cat /etc/eth-tools/version_wrapper ++ export FF_PRODUCT_VERSION= ++ FF_PRODUCT_VERSION= ++ export 'FF_PACKAGES=eth eth_rdma' ++ FF_PACKAGES='eth eth_rdma' ++ export 'FF_INSTALL_OPTIONS= ' ++ FF_INSTALL_OPTIONS=' ' ++ export 'FF_UPGRADE_OPTIONS= ' ++ FF_UPGRADE_OPTIONS=' ' ++ export UPLOADS_DIR=./uploads ++ UPLOADS_DIR=./uploads ++ export DOWNLOADS_DIR=./downloads ++ DOWNLOADS_DIR=./downloads ++ export FF_ANALYSIS_DIR=/var/usr/lib/eth-tools/analysis ++ FF_ANALYSIS_DIR=/var/usr/lib/eth-tools/analysis ++ export FF_LOGIN_METHOD=ssh ++ FF_LOGIN_METHOD=ssh ++ export FF_USERNAME=root ++ FF_USERNAME=root ++ export FF_PASSWORD= ++ FF_PASSWORD= ++ export FF_ROOTPASS= ++ FF_ROOTPASS= ++ export 'FF_FABRIC_HEALTH= -o errors -o slowlinks' ++ FF_FABRIC_HEALTH=' -o errors -o slowlinks' ++ export FF_ALL_ANALYSIS=fabric ++ FF_ALL_ANALYSIS=fabric ++ export 'FF_DIFF_CMD=diff -C 1' ++ FF_DIFF_CMD='diff -C 1' ++ export FF_MPI_APPS_DIR=/home/test/mpi_apps ++ FF_MPI_APPS_DIR=/home/test/mpi_apps ++ export FF_CUDA_DIR=/usr/local/cuda ++ FF_CUDA_DIR=/usr/local/cuda ++ export FF_MPI_ENV= ++ FF_MPI_ENV= ++ export 'FF_DEVIATION_ARGS=-bwtol 20 -lattol 50 -c' ++ FF_DEVIATION_ARGS='-bwtol 20 -lattol 50 -c' ++ export FF_SERIALIZE_OUTPUT=yes ++ FF_SERIALIZE_OUTPUT=yes ++ export FF_HOSTVERIFY_DIR=/root ++ FF_HOSTVERIFY_DIR=/root + . /usr/lib/eth-tools/ethfastfabric.conf.def ++ '[' /etc = '' ']' ++ export HOSTS_FILE=/etc/eth-tools/hosts ++ HOSTS_FILE=/etc/eth-tools/hosts ++ export SWITCHES_FILE=/etc/eth-tools/switches ++ SWITCHES_FILE=/etc/eth-tools/switches ++ export MGMT_HOST=localhost ++ MGMT_HOST=localhost ++ export FF_MAX_PARALLEL=1000 ++ FF_MAX_PARALLEL=1000 ++ export FF_TIMEOUT_MULT=2 ++ FF_TIMEOUT_MULT=2 ++ export FF_RESULT_DIR=/home/test ++ FF_RESULT_DIR=/home/test ++ export FF_PRODUCT=IntelEth-Basic.RHEL9-x86_64 ++ FF_PRODUCT=IntelEth-Basic.RHEL9-x86_64 +++ cat /etc/eth-tools/version_wrapper ++ export FF_PRODUCT_VERSION= ++ FF_PRODUCT_VERSION= ++ export 'FF_PACKAGES=eth eth_rdma' ++ FF_PACKAGES='eth eth_rdma' ++ export 'FF_INSTALL_OPTIONS= ' ++ FF_INSTALL_OPTIONS=' ' ++ export 'FF_UPGRADE_OPTIONS= ' ++ FF_UPGRADE_OPTIONS=' ' ++ export UPLOADS_DIR=./uploads ++ UPLOADS_DIR=./uploads ++ export DOWNLOADS_DIR=./downloads ++ DOWNLOADS_DIR=./downloads ++ export FF_ANALYSIS_DIR=/var/usr/lib/eth-tools/analysis ++ FF_ANALYSIS_DIR=/var/usr/lib/eth-tools/analysis ++ export FF_LOGIN_METHOD=ssh ++ FF_LOGIN_METHOD=ssh ++ export FF_USERNAME=root ++ FF_USERNAME=root ++ export FF_PASSWORD= ++ FF_PASSWORD= ++ export FF_ROOTPASS= ++ FF_ROOTPASS= ++ export 'FF_FABRIC_HEALTH= -o errors -o slowlinks' ++ FF_FABRIC_HEALTH=' -o errors -o slowlinks' ++ export FF_ALL_ANALYSIS=fabric ++ FF_ALL_ANALYSIS=fabric ++ export 'FF_DIFF_CMD=diff -C 1' ++ FF_DIFF_CMD='diff -C 1' ++ export FF_MPI_APPS_DIR=/home/test/mpi_apps ++ FF_MPI_APPS_DIR=/home/test/mpi_apps ++ export FF_CUDA_DIR=/usr/local/cuda ++ FF_CUDA_DIR=/usr/local/cuda ++ export FF_MPI_ENV= ++ FF_MPI_ENV= ++ export 'FF_DEVIATION_ARGS=-bwtol 20 -lattol 50 -c' ++ FF_DEVIATION_ARGS='-bwtol 20 -lattol 50 -c' ++ export FF_SERIALIZE_OUTPUT=yes ++ FF_SERIALIZE_OUTPUT=yes ++ export FF_HOSTVERIFY_DIR=/root ++ FF_HOSTVERIFY_DIR=/root + . /usr/lib/eth-tools/ff_funcs ++ FF_PRD_NAME=eth-tools ++ declare -A LC_NODE_PORTS ++ '[' /etc = '' ']' + trap 'exit 1' SIGHUP SIGTERM SIGINT + punchlist=/home/test/punchlist.csv + del=';' ++ date '+%Y/%m/%d %T' + timestamp='2022/01/05 00:07:21' ++ basename /usr/sbin/ethfindgood + readonly BASENAME=ethfindgood + BASENAME=ethfindgood + '[' x = x--help ']' + skip_ssh=n + skip_active=n + dir=/etc/eth-tools + timelimit=20 + getopts d:f:h:RAT: param + shift 0 + '[' 0 -gt 0 ']' + check_host_args ethfindgood + local l_hosts_file + '[' /etc/eth-tools/hosts = '' ']' + '[' '' = '' ']' + l_hosts_file=/etc/eth-tools/hosts ++ resolve_file ethfindgood /etc/eth-tools/hosts ++ '[' -f /etc/eth-tools/hosts ']' ++ echo /etc/eth-tools/hosts + HOSTS_FILE=/etc/eth-tools/hosts + '[' /etc/eth-tools/hosts = '' ']' ++ expand_file ethfindgood /etc/eth-tools/hosts ++ local file ++ cat /etc/eth-tools/hosts ++ ff_filter_comments ++ read line ++ egrep -v '^[[:space:]]*#' ++ egrep -v '^[[:space:]]*$' +++ expr 172.31.40.130 : '\([^ ]*\).*' ++ f1=172.31.40.130 ++ '[' x172.31.40.130 = xinclude ']' ++ echo 172.31.40.130 ++ cut -f1 ++ read line +++ expr 172.31.40.131 : '\([^ ]*\).*' ++ f1=172.31.40.131 ++ '[' x172.31.40.131 = xinclude ']' ++ echo 172.31.40.131 ++ cut -f1 ++ read line + CONTENTS='172.31.40.130 172.31.40.131' ++ extract_device_name ethfindgood '172.31.40.130 172.31.40.131' ++ echo '172.31.40.130 172.31.40.131' ++ read line ++ echo 172.31.40.130 ++ awk -F '[:,[({]' '{print $1}' ++ read line ++ echo 172.31.40.131 ++ awk -F '[:,[({]' '{print $1}' ++ read line + HOSTS='172.31.40.130 172.31.40.131' + '[' '172.31.40.130 172.31.40.131' = '' ']' + extract_node_ports '172.31.40.130 172.31.40.131' + content='172.31.40.130 172.31.40.131' + for line in $content + raw_node=172.31.40.130 ++ trim_string 172.31.40.130 ++ str=172.31.40.130 ++ echo 172.31.40.130 ++ sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' + node=172.31.40.130 + [[ 172.31.40.130 = \1\7\2\.\3\1\.\4\0\.\1\3\0 ]] + LC_NODE_PORTS[${node,,}]= + for line in $content + raw_node=172.31.40.131 ++ trim_string 172.31.40.131 ++ str=172.31.40.131 ++ echo 172.31.40.131 ++ sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' + node=172.31.40.131 + [[ 172.31.40.131 = \1\7\2\.\3\1\.\4\0\.\1\3\1 ]] + LC_NODE_PORTS[${node,,}]= + export HOSTS + unset HOSTS_FILE + good_meaning= + good_file= ++ mktemp + alive_hostonly=/tmp/tmp.m4wnmu0F5J ++ mktemp + running_hostonly=/tmp/tmp.EblW5uMxES + bak_files= + for file in alive running active good bad + '[' -f /etc/eth-tools/alive ']' + mv -f /etc/eth-tools/alive /etc/eth-tools/alive.bak mv: cannot move '/etc/eth-tools/alive' to '/etc/eth-tools/alive.bak': Permission denied + [[ -z '' ]] + bak_files=alive + for file in alive running active good bad + '[' -f /etc/eth-tools/running ']' + mv -f /etc/eth-tools/running /etc/eth-tools/running.bak mv: cannot move '/etc/eth-tools/running' to '/etc/eth-tools/running.bak': Permission denied + [[ -z alive ]] + bak_files=alive,running + for file in alive running active good bad + '[' -f /etc/eth-tools/active ']' + mv -f /etc/eth-tools/active /etc/eth-tools/active.bak mv: cannot move '/etc/eth-tools/active' to '/etc/eth-tools/active.bak': Permission denied + [[ -z alive,running ]] + bak_files=alive,running,active + for file in alive running active good bad + '[' -f /etc/eth-tools/good ']' + mv -f /etc/eth-tools/good /etc/eth-tools/good.bak mv: cannot move '/etc/eth-tools/good' to '/etc/eth-tools/good.bak': Permission denied + [[ -z alive,running,active ]] + bak_files=alive,running,active,good + for file in alive running active good bad + '[' -f /etc/eth-tools/bad ']' + mv -f /etc/eth-tools/bad /etc/eth-tools/bad.bak mv: cannot move '/etc/eth-tools/bad' to '/etc/eth-tools/bad.bak': Permission denied + [[ -z alive,running,active,good ]] + bak_files=alive,running,active,good,bad + [[ -n alive,running,active,good,bad ]] + echo 'Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files.' Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files. ++ ff_var_filter_dups_to_stdout '172.31.40.130 172.31.40.131' ++ wc -l ++ ff_var_to_stdout '172.31.40.130 172.31.40.131' ++ ff_filter_dups ++ echo '172.31.40.130 172.31.40.131' ++ ff_to_lc ++ tr A-Z a-z ++ tr -s ' ' '\n' ++ sort -u ++ sed -e '/^$/d' + echo '2 hosts will be checked' 2 hosts will be checked + ethpingall -p + grep 'is alive' + sed -e 's/:.*//' + ff_filter_dups + ethsorthosts + ff_to_lc + tr A-Z a-z + sort -u + append_punchlist /dev/fd/63 /tmp/tmp.m4wnmu0F5J 'Doesn'\''t ping' ++ ff_var_filter_dups_to_stdout '172.31.40.130 172.31.40.131' ++ ff_var_to_stdout '172.31.40.130 172.31.40.131' + ethsorthosts ++ ff_filter_dups ++ sort /dev/fd/63 + read host ++ echo '172.31.40.130 + comm -23 /dev/fd/62 /dev/fd/61 172.31.40.131' ++ ff_to_lc ++ sort /tmp/tmp.m4wnmu0F5J ++ tr A-Z a-z ++ tr -s ' ' '\n' ++ sort -u ++ sed -e '/^$/d' + good_meaning=alive + good_file_hostonly=/tmp/tmp.m4wnmu0F5J + to_nodes_ports /tmp/tmp.m4wnmu0F5J /etc/eth-tools/alive + src=/tmp/tmp.m4wnmu0F5J + dst=/etc/eth-tools/alive ++ cat /tmp/tmp.m4wnmu0F5J + get_nodes_ports '172.31.40.130 172.31.40.131' /usr/sbin/ethfindgood: line 186: /etc/eth-tools/alive: Permission denied + good_file=/etc/eth-tools/alive ++ cat /etc/eth-tools/alive ++ wc -l + echo '2 hosts are pingable (alive)' 2 hosts are pingable (alive) + '[' n = n ']' + ethsorthosts ++ to_canon + mycomm12 /dev/fd/63 /dev/fd/62 + /usr/lib/eth-tools/comm12 /dev/fd/63 /dev/fd/62 ++ read line ++ sort --ignore-case -t ' ' -k1,1 ++ ethcmdall -h '' -f /tmp/tmp.m4wnmu0F5J -P -p -T 20 'echo 123' ++ grep ': 123' ++ sed 's/:.*//' +++ echo 172.31.40.130 ++ ff_filter_dups +++ ff_to_lc +++ tr A-Z a-z ++ to_canon ++ ff_to_lc ++ tr A-Z a-z ++ sort -u ++ read line ++ sort --ignore-case -t ' ' -k1,1 ++ canon=172.31.40.130 ++ echo '172.31.40.130 172.31.40.130' ++ read line +++ echo 172.31.40.131 +++ ff_to_lc +++ tr A-Z a-z ++ canon=172.31.40.131 ++ echo '172.31.40.131 172.31.40.131' ++ read line +++ echo 172.31.40.130 +++ ff_to_lc +++ tr A-Z a-z ++ canon=172.31.40.130 ++ echo '172.31.40.130 172.31.40.130' ++ read line +++ echo 172.31.40.131 +++ ff_to_lc +++ tr A-Z a-z ++ canon=172.31.40.131 ++ echo '172.31.40.131 172.31.40.131' ++ read line + append_punchlist /tmp/tmp.m4wnmu0F5J /tmp/tmp.EblW5uMxES 'Can'\''t ssh' + ethsorthosts + read host ++ sort /tmp/tmp.m4wnmu0F5J + comm -23 /dev/fd/63 /dev/fd/62 ++ sort /tmp/tmp.EblW5uMxES + to_nodes_ports /tmp/tmp.EblW5uMxES /etc/eth-tools/running + src=/tmp/tmp.EblW5uMxES + dst=/etc/eth-tools/running ++ cat /tmp/tmp.EblW5uMxES + get_nodes_ports '172.31.40.130 172.31.40.131' /usr/sbin/ethfindgood: line 186: /etc/eth-tools/running: Permission denied + good_meaning='alive, running' + good_file=/etc/eth-tools/running ++ cat /etc/eth-tools/running ++ wc -l + echo '2 hosts are ssh'\''able (running)' 2 hosts are ssh'able (running) + rm -f /tmp/tmp.EblW5uMxES + rm -f /tmp/tmp.m4wnmu0F5J + '[' n = n ']' + ff_filter_dups + ethsorthosts ++ cat /etc/eth-tools/running + ff_to_lc + tr A-Z a-z /usr/sbin/ethfindgood: line 277: /etc/eth-tools/active: Permission denied + sort -u + for line in $(cat $good_file) + host=172.31.40.130 ++ get_node_ports 172.31.40.130 ++ node=172.31.40.130 ++ echo '' + ports= + [[ -z '' ]] + cmds=' ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)" [ -z "$ports" ] && exit 1 for port in $ports; do ' + cmds='type ibv_devinfo > /dev/null 2>&1 || exit 1 ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)" [ -z "$ports" ] && exit 1 for port in $ports; do slot=$(ls -l /sys/class/net | grep $port | awk '\''{print $11}'\'' | cut -d '\''/'\'' -f 6) [ -z $slot ] && exit 1 irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null) [ -z $irdma_dev ] && exit 1 ibv_devinfo -d $irdma_dev | grep '\''^\s*state:\s*PORT_ACTIVE'\'' > /dev/null 2>&1 || exit 1 done ' + ssh 172.31.40.130 'type ibv_devinfo > /dev/null 2>&1 || exit 1 ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)" [ -z "$ports" ] && exit 1 for port in $ports; do slot=$(ls -l /sys/class/net | grep $port | awk '\''{print $11}'\'' | cut -d '\''/'\'' -f 6) [ -z $slot ] && exit 1 irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null) [ -z $irdma_dev ] && exit 1 ibv_devinfo -d $irdma_dev | grep '\''^\s*state:\s*PORT_ACTIVE'\'' > /dev/null 2>&1 || exit 1 done ' bash: line 10: [: too many arguments IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found + for line in $(cat $good_file) + host=172.31.40.131 ++ get_node_ports 172.31.40.131 ++ node=172.31.40.131 ++ echo '' + ports= + [[ -z '' ]] + cmds=' ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)" [ -z "$ports" ] && exit 1 for port in $ports; do ' + cmds='type ibv_devinfo > /dev/null 2>&1 || exit 1 ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)" [ -z "$ports" ] && exit 1 for port in $ports; do slot=$(ls -l /sys/class/net | grep $port | awk '\''{print $11}'\'' | cut -d '\''/'\'' -f 6) [ -z $slot ] && exit 1 irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null) [ -z $irdma_dev ] && exit 1 ibv_devinfo -d $irdma_dev | grep '\''^\s*state:\s*PORT_ACTIVE'\'' > /dev/null 2>&1 || exit 1 done ' + ssh 172.31.40.131 'type ibv_devinfo > /dev/null 2>&1 || exit 1 ports="$(ls -l /sys/class/net/*/device/driver | grep '\''ice$'\'' | awk '\''{print $9}'\'' | cut -d '\''/'\'' -f5)" [ -z "$ports" ] && exit 1 for port in $ports; do slot=$(ls -l /sys/class/net | grep $port | awk '\''{print $11}'\'' | cut -d '\''/'\'' -f 6) [ -z $slot ] && exit 1 irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null) [ -z $irdma_dev ] && exit 1 ibv_devinfo -d $irdma_dev | grep '\''^\s*state:\s*PORT_ACTIVE'\'' > /dev/null 2>&1 || exit 1 done ' bash: line 10: [: too many arguments IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found + append_punchlist /etc/eth-tools/running /etc/eth-tools/active 'Has inactive RDMA port(s)' + ethsorthosts ++ sort /etc/eth-tools/running + read host + comm -23 /dev/fd/63 /dev/fd/62 ++ sort /etc/eth-tools/active + echo '2022/01/05 00:07:21;172.31.40.130;Has inactive RDMA port(s)' + read host + echo '2022/01/05 00:07:21;172.31.40.131;Has inactive RDMA port(s)' + read host + ethsorthosts ++ to_canon /usr/sbin/ethfindgood: line 280: /etc/eth-tools/good: Permission denied + mycomm12 /dev/fd/63 /dev/fd/62 ++ to_canon + /usr/lib/eth-tools/comm12 /dev/fd/63 /dev/fd/62 ++ read line ++ sort --ignore-case -t ' ' -k1,1 ++ read line ++ sort --ignore-case -t ' ' -k1,1 +++ echo 172.31.40.130 +++ ff_to_lc +++ tr A-Z a-z ++ canon=172.31.40.130 ++ echo '172.31.40.130 172.31.40.130' ++ read line +++ echo 172.31.40.131 +++ ff_to_lc +++ tr A-Z a-z ++ canon=172.31.40.131 ++ echo '172.31.40.131 172.31.40.131' ++ read line + good_meaning='alive, running, active' ++ cat /etc/eth-tools/active ++ wc -l + echo '0 total hosts have RDMA active ports on one or more fabrics (active)' 0 total hosts have RDMA active ports on one or more fabrics (active) ++ cat /etc/eth-tools/good ++ wc -l + echo '0 hosts are alive, running, active (good)' 0 hosts are alive, running, active (good) + ethsorthosts /usr/sbin/ethfindgood: line 290: /etc/eth-tools/bad: Permission denied + comm -23 /dev/fd/63 /dev/fd/62 ++ sort /etc/eth-tools/good +++ get_nodes_ports '172.31.40.130 172.31.40.131' +++ nodes='172.31.40.130 172.31.40.131' +++ for node in $nodes +++ ports= +++ [[ -z '' ]] +++ echo 172.31.40.130 +++ for node in $nodes +++ ports= +++ [[ -z '' ]] +++ echo 172.31.40.131 ++ ff_var_filter_dups_to_stdout '172.31.40.130 172.31.40.131' ++ ff_var_to_stdout '172.31.40.130 172.31.40.131' ++ ff_filter_dups ++ echo '172.31.40.130 172.31.40.131' ++ ff_to_lc ++ tr A-Z a-z ++ tr -s ' ' '\n' ++ sort -u ++ sed -e '/^$/d' ++ cat /etc/eth-tools/bad ++ wc -l + echo '2 hosts are bad (bad)' 2 hosts are bad (bad) + echo 'Bad hosts have been added to /home/test/punchlist.csv' Bad hosts have been added to /home/test/punchlist.csv + exit 0 [test@rdma-dev-30 sbin]$
$ diff -Nurp ethfindgood.orig ethfindgood.new --- ethfindgood.orig 2022-01-05 00:23:42.792600133 -0500 +++ ethfindgood.new 2022-01-05 00:57:26.236753421 -0500 @@ -268,7 +268,7 @@ then $cmds slot=\$(ls -l /sys/class/net | grep \$port | awk '{print \$11}' | cut -d '/' -f 6) [ -z \$slot ] && exit 1 - irdma_dev=\$(ls \$(find /sys/devices/ -name \$slot)/infiniband 2> /dev/null) + irdma_dev=\$(ls \$(find /sys/devices/ -ipath */\$slot/infiniband) 2> /dev/null) [ -z \$irdma_dev ] && exit 1 ibv_devinfo -d \$irdma_dev | grep '^\s*state:\s*PORT_ACTIVE' > /dev/null 2>&1 || exit 1 done
(In reply to zguo from comment #0) > " > Besides, it show unexpected "bash: line 10: [: too many arguments > " Please test this patch. https://bugzilla.redhat.com/show_bug.cgi?id=2037144#c2
(In reply to Honggang LI from comment #3) > (In reply to zguo from comment #0) > > > " > > Besides, it show unexpected "bash: line 10: [: too many arguments > > " > > > Please test this patch. > https://bugzilla.redhat.com/show_bug.cgi?id=2037144#c2 [root@rdma-dev-30 ~]$ /usr/sbin/ethfindgood Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files. 2 hosts will be checked 2 hosts are pingable (alive) 2 hosts are ssh'able (running) 0 total hosts have RDMA active ports on one or more fabrics (active) 0 hosts are alive, running, active (good) 2 hosts are bad (bad) Bad hosts have been added to /root/punchlist.csv
The script intends to find devices with driver ice, and then find each device's slot and then irdma device name. Could you run below script and put your output here? #!/bin/bash set -x ports="$(ls -l /sys/class/net/*/device/driver | grep 'ice$' | awk '{print $9}' | cut -d '/' -f5)" [ -z "$ports" ] && exit 1 for port in $ports; do slot=$(ls -l /sys/class/net | grep $port | awk '{print $11}' | cut -d '/' -f 6) [ -z $slot ] && exit 1 irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null) [ -z $irdma_dev ] && exit 1 ibv_devinfo -d $irdma_dev done
[root@rdma-dev-31 ~]$ cat -n /tmp/a.sh 1 #!/bin/bash 2 set -x 3 4 ports="$(ls -l /sys/class/net/*/device/driver | grep 'ice$' | awk '{print $9}' | cut -d '/' -f5)" 5 [ -z "$ports" ] && exit 1 6 for port in $ports; do 7 slot=$(ls -l /sys/class/net | grep $port | awk '{print $11}' | cut -d '/' -f6) 8 [ -z $slot ] && exit 1 9 irdma_dev=$(ls $(find /sys/devices/ -name $slot)/infiniband 2> /dev/null) 10 [ -z $irdma_dev ] && exit 1 11 ibv_devinfo -d $irdma_dev 12 done [root@rdma-dev-31 ~]$ sh /tmp/a.sh ++ grep 'ice$' ++ ls -l /sys/class/net/i810_off/device/driver /sys/class/net/i810_roce/device/driver /sys/class/net/lom_1/device/driver /sys/class/net/lom_2/device/driver /sys/class/net/lom_3/device/driver /sys/class/net/lom_4/device/driver ++ cut -d / -f5 ++ awk '{print $9}' + ports='i810_off i810_roce' + '[' -z 'i810_off i810_roce' ']' + for port in $ports ++ ls -l /sys/class/net ++ grep i810_off ++ awk '{print $11}' ++ cut -d / -f6 + slot=0000:44:00.1 + '[' -z 0000:44:00.1 ']' +++ find /sys/devices/ -name 0000:44:00.1 ++ ls /sys/devices/pci0000:40/0000:40:03.1/0000:44:00.1 /sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband + irdma_dev='/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband: irdma1 /sys/devices/pci0000:40/0000:40:03.1/0000:44:00.1: aer_dev_correctable aer_dev_fatal aer_dev_nonfatal ari_enabled broken_parity_status class config consistent_dma_mask_bits current_link_speed current_link_width d3cold_allowed device dma_mask_bits driver driver_override enable firmware_node ice.roce.1 infiniband infiniband_verbs iommu iommu_group irq link local_cpulist local_cpus max_link_speed max_link_width modalias msi_bus msi_irqs net numa_node power power_state remove rescan reset resource resource0 resource0_wc resource3 resource3_wc revision rom sriov_drivers_autoprobe sriov_numvfs sriov_offset sriov_stride sriov_totalvfs sriov_vf_device sriov_vf_total_msix subsystem subsystem_device subsystem_vendor uevent vendor vpd' + '[' -z /sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband: irdma1 /sys/devices/pci0000:40/0000:40:03.1/0000:44:00.1: aer_dev_correctable aer_dev_fatal aer_dev_nonfatal ari_enabled broken_parity_status class config consistent_dma_mask_bits current_link_speed current_link_width d3cold_allowed device dma_mask_bits driver driver_override enable firmware_node ice.roce.1 infiniband infiniband_verbs iommu iommu_group irq link local_cpulist local_cpus max_link_speed max_link_width modalias msi_bus msi_irqs net numa_node power power_state remove rescan reset resource resource0 resource0_wc resource3 resource3_wc revision rom sriov_drivers_autoprobe sriov_numvfs sriov_offset sriov_stride sriov_totalvfs sriov_vf_device sriov_vf_total_msix subsystem subsystem_device subsystem_vendor uevent vendor vpd ']' /tmp/a.sh: line 10: [: too many arguments + ibv_devinfo -d /sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband: irdma1 /sys/devices/pci0000:40/0000:40:03.1/0000:44:00.1: aer_dev_correctable aer_dev_fatal aer_dev_nonfatal ari_enabled broken_parity_status class config consistent_dma_mask_bits current_link_speed current_link_width d3cold_allowed device dma_mask_bits driver driver_override enable firmware_node ice.roce.1 infiniband infiniband_verbs iommu iommu_group irq link local_cpulist local_cpus max_link_speed max_link_width modalias msi_bus msi_irqs net numa_node power power_state remove rescan reset resource resource0 resource0_wc resource3 resource3_wc revision rom sriov_drivers_autoprobe sriov_numvfs sriov_offset sriov_stride sriov_totalvfs sriov_vf_device sriov_vf_total_msix subsystem subsystem_device subsystem_vendor uevent vendor vpd IB device '/sys/devices/pci0000:40/0000:40:00.2/iommu/ivhd1/devices/0000:44:00.1/infiniband:' wasn't found + for port in $ports ++ ls -l /sys/class/net ++ grep i810_roce ++ awk '{print $11}' ++ cut -d / -f6 + slot='0000:44:00.0 i810_roce.43 i810_roce.45' + '[' -z 0000:44:00.0 i810_roce.43 i810_roce.45 ']' /tmp/a.sh: line 8: [: too many arguments +++ find /sys/devices/ -name 0000:44:00.0 i810_roce.43 i810_roce.45 find: paths must precede expression: `i810_roce.43' ++ ls /infiniband + irdma_dev= + '[' -z ']' + exit 1 [root@rdma-dev-31 ~]$
Thanks. There are 2 issues. The first one is on finding irdma dev name. Honggang's patch shall fix it. The send one is on finding slot number. It was fixed in our 11.2 version. Please change the following line in ethfindgood slot=\$(ls -l /sys/class/net | grep \$port | awk '{print \$11}' | cut -d '/' -f 6) to slot=\$(ls -l /sys/class/net | grep \"\$port \" | awk '{print \$11}' | cut -d '/' -f 6)
(In reply to Jijun Wang from comment #7) > Thanks. There are 2 issues. > The first one is on finding irdma dev name. Honggang's patch shall fix it. > The send one is on finding slot number. It was fixed in our 11.2 version. > Please change the following line in ethfindgood > > slot=\$(ls -l /sys/class/net | grep \$port | awk '{print \$11}' | cut -d '/' > -f 6) > > to > > slot=\$(ls -l /sys/class/net | grep \"\$port \" | awk '{print \$11}' | cut > -d '/' -f 6) It looks good now. [root@rdma-dev-31 ~]$ ethfindgood Warning: backed up existing /etc/eth-tools/{alive,good,bad} as *.bak files. 2 hosts will be checked 2 hosts are pingable (alive) 2 hosts are ssh'able (running) 2 total hosts have RDMA active ports on one or more fabrics (active) 2 hosts are alive, running, active (good) 0 hosts are bad (bad) Bad hosts have been added to /root/punchlist.csv
Thanks. I will update eth-tools-fastfabric
Updated eth-tools to 11.1.0.1-6 Here is the f36 build https://koji.fedoraproject.org/koji/taskinfo?taskID=80960413
(In reply to Jijun Wang from comment #10) > Updated eth-tools to 11.1.0.1-6 > Here is the f36 build > https://koji.fedoraproject.org/koji/taskinfo?taskID=80960413 I built it for rhel-9.0.0, but it still needs improvement. [root@rdma-dev-30 ~]$ /usr/sbin/ethsetupsnmp -p -L -f /etc/eth-tools/hosts Configuring SNMP... Enter space separated list of admin hosts (rdma-dev-30.rdma.lab.eng.rdu2.redhat.com): Enter SNMP community string (public): Fast Fabric requires the following MIBs: 1.3.6.1.2.1.1 (SNMPv2-MIB:system) 1.3.6.1.2.1.2 (IF-MIB:interfaces) 1.3.6.1.2.1.4 (IP-MIB:ip) 1.3.6.1.2.1.10.7 (EtherLike-MIB:dot3) 1.3.6.1.2.1.31.1 (IP-MIB:ifMIBObjects) Do you accept these MIBs [y/n] (y): Enter space separated list of extra MIBs to support (NONE): Will config SNMP with the following settings: admin hosts: rdma-dev-30.rdma.lab.eng.rdu2.redhat.com community: public MIBs: 1.3.6.1.2.1.1 1.3.6.1.2.1.2 1.3.6.1.2.1.4 1.3.6.1.2.1.10.7 1.3.6.1.2.1.31.1 Do you accept these settings [y/n] (y): mv: cannot stat '/etc/snmp/snmpd.conf': No such file or directory ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ so, the package should require `net-snmp`. ======================================================================== [root@rdma-dev-30 ~]$ /usr/sbin/ethfindgood Warning: backed up existing /etc/eth-tools/{alive,running,active,good,bad} as *.bak files. 2 hosts will be checked 2 hosts are pingable (alive) 2 hosts are ssh'able (running) 0 total hosts have RDMA active ports on one or more fabrics (active) <==== 0 hosts are alive, running, active (good) <==== 2 hosts are bad (bad) Bad hosts have been added to /root/punchlist.csv [root@rdma-dev-30 ~]$ rpm -q eth-tools-fastfabric eth-tools-fastfabric-11.1.0.1-6.el9.x86_64 267 cmds="type ibv_devinfo > /dev/null 2>&1 || exit 1 268 $cmds 269 slot=\$(ls -l /sys/class/net | grep \"\$port \" | awk '{print \$11}' | cut -d '/' -f 6) 270 [ -z \$slot ] && exit 1 271 irdma_dev=\$(ls \$(find /sys/devices/ -path */\$slot/infiniband) 2> /dev/null) 272 [ -z \$irdma_dev ] && exit 1 273 ibv_devinfo -d \$irdma_dev | grep '^\s*state:\s*PORT_ACTIVE' > /dev/null 2>&1 || exit 1 The `exit 1` in line 273 will terminate the for loop as the first port of the first ice device is down. That is why ethfindgood can't detect active port in the *second* ice device. But if we remove the `exit 1` in line 273, ethfindgood will ignore bad hosts whose last port of ice devices is active.
You are right. I will fix them.
Updated eth-tools to 11.1.0.1-7 The changes are - When a user specifies ports for a node, we check to ensure all ports are active RDMA ports - If a user doesn't specify ports for a node, we check to ensure at least one port is active RDMA port - Added net-snmp to eth-tools rpm dependency Here is the f36 build https://koji.fedoraproject.org/koji/taskinfo?taskID=81083085
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.