Bug 1860674
| Summary: | [RHEL8.3] All OPENMPI benchmarks fail after upgrading to "environment-modules-4.5.1-1.el8.x86_64" | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Brian Chae <bchae> | ||||||
| Component: | environment-modules | Assignee: | Lukáš Nykrýn <lnykryn> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Frantisek Sumsal <fsumsal> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 8.3 | CC: | bchae, fsumsal, honli, rdma-dev-team, tmichael, xavier.delaruelle | ||||||
| Target Milestone: | rc | Keywords: | Regression | ||||||
| Target Release: | 8.0 | Flags: | pm-rhel:
mirror+
|
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-11-04 02:13:47 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1842946 | ||||||||
| Attachments: |
|
||||||||
Likely a regression of https://bugzilla.redhat.com/show_bug.cgi?id=1642837 . @Jan, could you please have a look? Newer versions of environment-modules (>=4.2) now ensure consistency of the loaded environment.
The following lines of logs:
Loading /etc/modulefiles/mpi/openmpi-x86_64
ERROR: /etc/modulefiles/mpi/openmpi-x86_64 cannot be loaded due to a conflict.
HINT: Might try "module unload mpi" first.
Seem to indicate that a "mpi" module is already loaded prior the attempt to load "/etc/modulefiles/mpi/openmpi-x86_64". As those modulefiles declare a conflict toward any other "mpi" module, an issue is raised. No error were raised on older version of environment-modules (<4.2) as conflict detection was incomplete.
I would suggest to look at user environment right before test is launched and add a "module unload mpi" (or a "module purge") right before the "module load /etc/modulefiles/mpi/openmpi-x86_64" command.
(In reply to Xavier Delaruelle from comment #2) > I would suggest to look at user environment right before test is launched > and add a "module unload mpi" (or a "module purge") right before the "module > load /etc/modulefiles/mpi/openmpi-x86_64" command. Yes, 'module purge' was executed before run 'module load /etc/modulefiles/mpi/openmpi-x86_64'. Please see the attachment for details. https://bugzilla.redhat.com/attachment.cgi?id=1702441 Created attachment 1702816 [details]
environment-modules fix
Thanks for the clarification.
This is clearly a bug on the environment-modules side.
I have just made a fix for it (see the patch attached). It could be applied right away on the SRPM if you want to quickly build a fixed version of the environment-modules package. I will release upstream a v4.5.2 in the next days, that will include this fix.
(In reply to Xavier Delaruelle from comment #4) > Created attachment 1702816 [details] > environment-modules fix Confirmed this patch works for me. Thank you! (In reply to Honggang LI from comment #5) > (In reply to Xavier Delaruelle from comment #4) > > Created attachment 1702816 [details] > > environment-modules fix > > Confirmed this patch works for me. Thank you! I also tested with the stated patch to /usr/share/Modules/libexec/modulecmd.tcl and openmpi tests passed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (environment-modules bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4593 |
Created attachment 1702441 [details] client log for openmpi showing all benchmark failures due to environment-modules-5.4.1 Description of problem: All OPENMPI benchmarks fail with after upgrading to "environment-modules-4.5.1-1.el8.x86_64" package from "environment-modules-4.5.1-1.el8.x86_64", when the "mpirun" without the full path is used as the benchmark command. workarounds: 1. Use the full path to "mpirun" command, instead: "/usr/lib64/openmpi/bin/mpirun", when "environment-modules-4.5.1-1.el8.x86_64" is loaded. 2. Or, load package, environment-modules-4.1.4-4.el8.x86_64 Version-Release number of selected component (if applicable): DISTRO=RHEL-8.3.0-20200701.2 + [20-07-25 07:05:45] cat /etc/redhat-release Red Hat Enterprise Linux release 8.3 Beta (Ootpa) + [20-07-25 07:05:45] uname -a Linux rdma-virt-03.lab.bos.redhat.com 4.18.0-221.el8.x86_64 #1 SMP Thu Jun 25 20:58:19 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux + [20-07-25 07:05:45] cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-221.el8.x86_64 root=/dev/mapper/rhel_rdma--virt--03-root ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=/dev/mapper/rhel_rdma--virt--03-swap rd.lvm.lv=rhel_rdma-virt-03/root rd.lvm.lv=rhel_rdma-virt-03/swap console=ttyS1,115200n81 + [20-07-25 07:05:45] rpm -q rdma-core linux-firmware rdma-core-29.0-3.el8.x86_64 linux-firmware-20200619-99.git3890db36.el8.noarch + [20-07-25 07:05:45] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver ==> /sys/class/infiniband/mlx5_0/fw_ver <== 12.25.1020 ==> /sys/class/infiniband/mlx5_1/fw_ver <== 12.25.1020 ==> /sys/class/infiniband/mlx5_bond_0/fw_ver <== 14.27.1016 + [20-07-25 07:05:45] lspci + [20-07-25 07:05:45] grep -i -e ethernet -e infiniband -e omni -e ConnectX 02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe 04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4] 05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] + [20-07-25 07:05:50] dnf install -y --setopt=strict=0 --nogpgcheck openmpi mpitests-openmpi environment-modules Updating Subscription Management repositories. Unable to read consumer identity This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register. Last metadata expiration check: 0:00:19 ago on Sat 25 Jul 2020 07:05:31 AM EDT. Package environment-modules-4.1.4-4.el8.x86_64 is already installed. Dependencies resolved. ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: mpitests-openmpi x86_64 5.6.2-1.el8 beaker-AppStream 943 k openmpi x86_64 4.0.3-1.el8 beaker-AppStream 2.8 M Upgrading: environment-modules x86_64 4.5.1-1.el8 brew 419 k Transaction Summary ================================================================================ Install 2 Packages Upgrade 1 Package Total download size: 4.1 M Downloading Packages: (1/3): mpitests-openmpi-5.6.2-1.el8.x86_64.rpm 40 MB/s | 943 kB 00:00 (2/3): environment-modules-4.5.1-1.el8.x86_64.r 12 MB/s | 419 kB 00:00 (3/3): openmpi-4.0.3-1.el8.x86_64.rpm 40 MB/s | 2.8 MB 00:00 -------------------------------------------------------------------------------- Total 59 MB/s | 4.1 MB 00:00 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Upgrading : environment-modules-4.5.1-1.el8.x86_64 1/4 Running scriptlet: environment-modules-4.5.1-1.el8.x86_64 1/4 Installing : openmpi-4.0.3-1.el8.x86_64 2/4 Installing : mpitests-openmpi-5.6.2-1.el8.x86_64 3/4 Cleanup : environment-modules-4.1.4-4.el8.x86_64 4/4 Running scriptlet: environment-modules-4.1.4-4.el8.x86_64 4/4 Verifying : mpitests-openmpi-5.6.2-1.el8.x86_64 1/4 Verifying : openmpi-4.0.3-1.el8.x86_64 2/4 Verifying : environment-modules-4.5.1-1.el8.x86_64 3/4 Verifying : environment-modules-4.1.4-4.el8.x86_64 4/4 Installed products updated. Upgraded: environment-modules-4.5.1-1.el8.x86_64 <<<============== Installed: mpitests-openmpi-5.6.2-1.el8.x86_64 openmpi-4.0.3-1.el8.x86_64 How reproducible: 100% Steps to Reproduce: 1. upgrade the package "environment-modules" from "environment-modules-4.1.4-4.el8.x86_64" to "environment-modules-4.5.1-1.el8.x86_64" 2. make sure the path to "mpirun" exists + [20-07-25 07:06:32] which mpirun /usr/lib64/openmpi/bin/mpirun 3. run a OPENMPI benchmark using "mpirun", as the following: imeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include mlx5_0:1 -mca mtl '^psm2,psm,ofi' -mca btl openib,self -mca btl_openib_allow_ib 1 -x UCX_NET_DEVICES=mlx5_ib0 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency Actual results: + [20-07-25 07:06:05] which mpirun /usr/lib64/openmpi/bin/mpirun + [20-07-25 07:06:05] '[' 0 -ne 0 ']' ++ [20-07-25 07:06:05] cat imb_mpi.txt + [20-07-25 07:06:05] for app in $(cat imb_mpi.txt) + [20-07-25 07:06:05] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include mlx5_0:1 -mca mtl '^psm2,psm,ofi' -mca btl openib,self -mca btl_openib_allow_ib 1 -x UCX_NET_DEVICES=mlx5_ib0 mpitests-IMB-MPI1 PingPong -time 1.5 Loading /etc/modulefiles/mpi/openmpi-x86_64 ERROR: /etc/modulefiles/mpi/openmpi-x86_64 cannot be loaded due to a conflict. HINT: Might try "module unload mpi" first. bash: orted: command not found -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- + [20-07-25 07:06:05] __MPI_check_result 127 mpitests-openmpi IMB-MPI1 PingPong mpirun /root/hfile_one_core Expected results: + [20-07-26 08:47:16] which mpirun /usr/lib64/openmpi/bin/mpirun + [20-07-26 08:47:16] '[' 0 -ne 0 ']' ++ [20-07-26 08:47:16] cat imb_mpi.txt + [20-07-26 08:47:16] for app in $(cat imb_mpi.txt) + [20-07-26 08:47:16] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include mlx5_2:1 -mca mtl '^psm2,psm,ofi' -mca btl openib,self -mca btl_openib_allow_ib 1 -x UCX_NET_DEVICES=mlx5_ib0 mpitests-IMB-MPI1 PingPong -time 1.5 #------------------------------------------------------------ # Intel(R) MPI Benchmarks 2019 Update 6, MPI-1 part #------------------------------------------------------------ # Date : Sun Jul 26 08:47:16 2020 # Machine : x86_64 # System : Linux # Release : 4.18.0-221.el8.x86_64 # Version : #1 SMP Thu Jun 25 20:58:19 UTC 2020 # MPI Version : 3.1 # MPI Thread Environment: # Calling sequence was: # mpitests-IMB-MPI1 PingPong -time 1.5 # Minimum message length in bytes: 0 # Maximum message length in bytes: 4194304 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # PingPong #--------------------------------------------------- # Benchmarking PingPong # #processes = 2 #--------------------------------------------------- #bytes #repetitions t[usec] Mbytes/sec 0 1000 9.66 0.00 1 1000 9.77 0.10 2 1000 9.93 0.20 4 1000 9.74 0.41 8 1000 9.45 0.85 16 1000 9.59 1.67 32 1000 9.78 3.27 64 1000 9.97 6.42 128 1000 10.06 12.72 256 1000 10.13 25.27 512 1000 10.38 49.34 1024 1000 10.21 100.32 2048 1000 18.77 109.12 4096 1000 19.41 211.07 8192 1000 21.46 381.79 16384 1000 30.78 532.30 32768 1000 44.40 737.99 65536 640 96.45 679.45 131072 320 120.11 1091.27 262144 160 206.78 1267.72 524288 80 284.22 1844.65 1048576 40 450.47 2327.75 2097152 20 701.05 2991.46 4194304 10 1293.59 3242.36 # All processes entering MPI_Finalize + [20-07-26 08:47:21] __MPI_check_result 0 mpitests-openmpi IMB-MPI1 PingPong mpirun /root/hfile_one_core Additional info: This behavior started as soon as the RHEL8.3 is upgrade with "environment-modules-4.5.1-1.el8.x86_64". With "environment-modules-4.1.4-4.el8.x86_64" installed, no such issue exists.