Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1500952

Summary:

Infiniband - disable a port and you will not be able to bring it back up

Product:

Red Hat Enterprise Linux 7

Reporter:

lejeczek <peljasz>

Component:

infiniband-diags

Assignee:

Honggang LI <honli>

Status:

CLOSED WONTFIX

QA Contact:

Infiniband QE <infiniband-qe>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

7.4

CC:

ddutile, dledford, jshortt, peljasz, rdma-dev-team, yuri

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-01-15 07:43:27 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
full dmesg	none

Description lejeczek 2017-10-11 20:03:28 UTC

Description of problem:

I don't know if it is kernel or userspace, but I do:

$ ibportstate 1 1 off
Initial CA PortInfo:
# Port info: Lid 1 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................1
SMLid:...........................1
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps
LinkSpeedEnabled:................2.5 Gbps
LinkSpeedActive:.................2.5 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
Disable may be irreversible

After PortInfo set:
# Port info: Lid 1 port 1
LinkState:.......................Active
PhysLinkState:...................Disabled
Lid:.............................1
SMLid:...........................1
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps
LinkSpeedEnabled:................2.5 Gbps
LinkSpeedActive:.................2.5 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0

then I do:

$ ibportstate 1 1 on
ibwarn: [74878] _do_madrpc: recv failed: Connection timed out
ibwarn: [74878] mad_rpc: _do_madrpc failed; dport (Lid 1)
ibportstate: iberror: failed: smp query nodeinfo failed


Version-Release number of selected component (if applicable):

09:00.0 InfiniBand: Mellanox Technologies MT25408 [ConnectX VPI - IB SDR / 10GigE] (rev a0)
        Subsystem: Mellanox Technologies Device 0003
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 54
        NUMA node: 0
        Region 0: Memory at ef500000 (64-bit, non-prefetchable) [size=1M]
        Region 2: Memory at e4800000 (64-bit, prefetchable) [size=8M]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Vital Product Data
                Product Name: Eagle IB DDR
                Read-only fields:
                        [PN] Part number: HCA-00001            
                        [EC] Engineering changes: A5
                        [SN] Serial number: ML2410001968            
                        [V0] Vendor specific: HCA 500Ex-D     


3.10.0-693.2.2.el7.x86_64
infiniband-diags-1.6.7-1.el7.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Don Dutile (Red Hat) 2017-10-11 21:15:53 UTC

mellanox device firmware version?

Comment 3 lejeczek 2017-10-11 23:05:58 UTC

The latest version I could find for my HCA model, I believe.


	CA type: MT25408
	Firmware version: 2.9.1000

Comment 4 Don Dutile (Red Hat) 2017-10-11 23:17:33 UTC

(In reply to lejeczek from comment #3)
> The latest version I could find for my HCA model, I believe.
> 
> 
> 	CA type: MT25408
> 	Firmware version: 2.9.1000

full output from lspci -vvv of that slot -- you chopped some of it off... thanks.
Also, the bz is reported against 7.3, but the kernel is a 7.4 kernel.
Is the kernel version correct, and bz-Version field wrong?

Comment 5 lejeczek 2017-10-12 08:05:32 UTC

sorry, yes, 7.4

$ lspci -s 09:00.0 -vvv 
09:00.0 InfiniBand: Mellanox Technologies MT25408 [ConnectX VPI - IB SDR / 10GigE] (rev a0)
	Subsystem: Mellanox Technologies Device 0003
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 54
	NUMA node: 0
	Region 0: Memory at ef500000 (64-bit, non-prefetchable) [size=1M]
	Region 2: Memory at e4800000 (64-bit, prefetchable) [size=8M]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [48] Vital Product Data
		Product Name: Eagle IB DDR
		Read-only fields:
			[PN] Part number: HCA-00001            
			[EC] Engineering changes: A5
			[SN] Serial number: ML2410001968            
			[V0] Vendor specific: HCA 500Ex-D     
			[RV] Reserved: checksum good, 0 byte(s) reserved
		Read/write fields:
			[V1] Vendor specific: N/A    
			[YA] Asset tag: N/A                             
			[RW] Read-write area: 107 byte(s) free
		End
	Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
		Vector table: BAR=0 offset=0007c000
		PBA: BAR=0 offset=0007d000
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #8, Speed 2.5GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 1
		ARICtl:	MFVC- ACS-, Function Group: 0
	Kernel driver in use: mlx4_core
	Kernel modules: mlx4_core

Comment 6 lejeczek 2017-10-12 12:23:42 UTC

Just a bit more on my setup(simplistic): two nodes with one HCA each, each HCA two ports, connect directly: port1 to port1 & port2 to port2.

Maybe it's in the kernel(?)

in /etc/sysconfig/opensm:
GUIDS="0x0008f104039a62a1 0x0008f104039a62a2"

in /etc/rdma/mlx4.conf:
0000:09:00.0 eth eth

and both port(when system boots) are in: Link layer: InfiniBand
When I do:

$ /usr/libexec/mlx4-setup.sh < /etc/rdma/mlx4.conf 
Failed to set port2 to eth mode


$ ibstat
CA 'mlx4_0'
	CA type: MT25408
	Number of ports: 2
	Firmware version: 2.9.1000
	Hardware version: a0
	Node GUID: 0x0008f104039a62a0
	System image GUID: 0x0008f104039a62a3
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 10
		Base lid: 1
		LMC: 0
		SM lid: 1
		Capability mask: 0x0259086a
		Port GUID: 0x0008f104039a62a1
		Link layer: InfiniBand
	Port 2:
		State: Active
		Physical state: LinkUp
		Rate: 10
		Base lid: 3
		LMC: 0
		SM lid: 3
		Capability mask: 0x0259086a
		Port GUID: 0x0008f104039a62a2
		Link layer: InfiniBand

Comment 7 Don Dutile (Red Hat) 2017-10-12 13:22:47 UTC

Re-assigning to partner engineers to see if they can isolate the error.

Comment 8 Don Dutile (Red Hat) 2017-10-12 13:51:31 UTC

Well, we don't have back-to-back configs to dupe this bug.
I'm now wondering if this is the link up failure -- endpt-endpts don't handle this situation properly, or opensm needs a tweek (per-port control?) to enable it.  I'd be curious if a switch-based fabric would have this issue.

Comment 9 lejeczek 2017-10-12 15:54:11 UTC

Configs are pretty much plain vanilla, I did touch only the two aforementioned, everything else is distro's.
Unfortunately cannot test it with a switch in between, have no switch, it's poor man's setup, but that should be exactly why it would be popular among users.

On 'off' port, reboot brings it back, but that's not even a workaround, too ugly to be.

Comment 10 lejeczek 2017-10-12 16:08:24 UTC

My latest(probably last for now) observation.
With this "working" config, with Link layer: InfiniBand, having set up nmcli c infiniband type and simply testing(I realize that some tunning might help):

$ iperf -c 10.5.4.100
------------------------------------------------------------
Client connecting to 10.5.4.100, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 10.5.4.49 port 53276 connected with 10.5.4.100 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.2 sec   532 MBytes   439 Mbits/sec

That looks very wrong, no? I'm far from being kernel expert but something smells fishy to me.

Rest of the HW stack(it it may help) - to say it's not a no-name rag, maybe worth passing onto Dell as HCAs sit in PE r815s.

Comment 11 Don Dutile (Red Hat) 2017-10-12 18:21:26 UTC

ah, dell per815s -- please send boot log.
try with iommu=off (on by default on amd systems) on kernel cmd line

Comment 12 lejeczek 2017-10-12 20:49:33 UTC

Created attachment 1337966 [details]
full dmesg

Comment 13 lejeczek 2017-10-12 20:50:44 UTC

I believe I've had it disabled, have I?

...
[Thu Oct 12 18:53:29 2017] PCI-DMA: aperture base @ c4000000 size 65536 KB
[Thu Oct 12 18:53:29 2017] PCI-DMA: using GART IOMMU.
[Thu Oct 12 18:53:29 2017] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture
[Thu Oct 12 18:53:29 2017] perf: AMD NB counters detected
...

full dmesg attached earlier.

Comment 14 lejeczek 2017-10-12 21:00:39 UTC

wrong paste, this one:

[Thu Oct 12 18:53:28 2017] AGP: Please enable the IOMMU option in the BIOS setup

Comment 15 Doug Ledford 2017-10-12 21:33:41 UTC

For a back to back connection, you shouldn't be trying to set the ports to Eth mode.  They won't work reliably back to back in Eth mode is my experience.  Stick with Infiniband.  Also, if you do put them in Eth mode, then you don't need opensm.  That's only for when they are in IB mode.

I can't say for certain, but your description sounds like you downed the interface that was running opensm.  If you down that interface, then nothing will work after that because you've disconnected the SM from the fabric, and while it may not be obvious, lots of fabric tools need the SM in order to reach the ports they work on.  The entire IB fabric design is such that, you as a administrator on machine A, can decide that the port on Machine B is bad, and with a single command shut that port down.  The command actually goes out and queries the subnet manager for a bunch of info, then contacts the card (not even the kernel stack mind you) at the other end, tells the card to shut down, and it does and delivers a message to the kernel (which is why you have to have root permissions to run lots of these tools).  If you down the machine that has opensm on it, you've just shut off the entire network in effect.  Attempts to bring it back up won't work, but if you restart opensm at the other end, that might.  That would also handily explain why a reboot works.

Comment 16 lejeczek 2017-10-13 12:08:46 UTC

@ #15 ok, 
What I fiddled with yesterday revealed similar - forcing port mode to eht by echoing "eth" to sysfs port(it would not work straightforward, if I remember if first echoed to port2 and then to port1 then card's(as this model apparently has to have both ports in the same mode) ports would switch mode.
But even though above method would put ports in "eth", ports status with ibstat would still be disabled, which was "link-up" and fine just before when was in "infiniband".

I'd not mind to stick with "infiniband" and IPoIB but, that iperf test miserable results.
Are such results expected, surely not? Where is that near 10Gbs performance?

Comment 17 lejeczek 2017-10-23 12:56:34 UTC

and I've been getting these:

[Mon Oct 23 12:48:39 2017] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 12:48:39 2017] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 12:48:39 2017] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 12:48:39 2017] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 12:48:39 2017] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 12:48:39 2017] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 12:48:39 2017] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 12:48:39 2017] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 12:48:39 2017] Node 4 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 12:48:39 2017] Node 4 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 12:48:39 2017] Node 5 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 12:48:39 2017] Node 5 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 12:48:39 2017] Node 6 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 12:48:39 2017] Node 6 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 12:48:39 2017] Node 7 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 12:48:39 2017] Node 7 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 12:48:39 2017] 28817542 total pagecache pages
[Mon Oct 23 12:48:39 2017] 0 pages in swap cache
[Mon Oct 23 12:48:39 2017] Swap cache stats: add 0, delete 0, find 0/0
[Mon Oct 23 12:48:39 2017] Free swap  = 0kB
[Mon Oct 23 12:48:39 2017] Total swap = 0kB
[Mon Oct 23 12:48:39 2017] 33547797 pages RAM
[Mon Oct 23 12:48:39 2017] 0 pages HighMem/MovableOnly
[Mon Oct 23 12:48:39 2017] 590046 pages reserved
[Mon Oct 23 12:51:04 2017] kworker/u128:0: page allocation failure: order:4, mode:0x8010
[Mon Oct 23 12:51:04 2017] CPU: 2 PID: 1201816 Comm: kworker/u128:0 Not tainted 3.10.0-693.2.2.el7.x86_64 #1
[Mon Oct 23 12:51:04 2017] Hardware name: Dell Inc. PowerEdge R815/04Y8PT, BIOS 3.2.2 09/15/2014
[Mon Oct 23 12:51:04 2017] Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib]
[Mon Oct 23 12:51:04 2017]  0000000000008010 000000003d79fcfb ffff881031273818 ffffffff816a3db1
[Mon Oct 23 12:51:04 2017]  ffff8810312738a8 ffffffff81188810 0000000000000000 ffff88042ffdb000
[Mon Oct 23 12:51:04 2017]  0000000000000004 0000000000008010 ffff8810312738a8 000000003d79fcfb
[Mon Oct 23 12:51:04 2017] Call Trace:
[Mon Oct 23 12:51:04 2017]  [<ffffffff816a3db1>] dump_stack+0x19/0x1b
[Mon Oct 23 12:51:04 2017]  [<ffffffff81188810>] warn_alloc_failed+0x110/0x180
[Mon Oct 23 12:51:04 2017]  [<ffffffff8169fd8a>] __alloc_pages_slowpath+0x6b6/0x724
[Mon Oct 23 12:51:04 2017]  [<ffffffff8118cd85>] __alloc_pages_nodemask+0x405/0x420
[Mon Oct 23 12:51:04 2017]  [<ffffffff81030f8f>] dma_generic_alloc_coherent+0x8f/0x140
[Mon Oct 23 12:51:04 2017]  [<ffffffff81065c0d>] gart_alloc_coherent+0x2d/0x40
[Mon Oct 23 12:51:04 2017]  [<ffffffffc012e4d3>] mlx4_buf_direct_alloc.isra.6+0xd3/0x1a0 [mlx4_core]
[Mon Oct 23 12:51:04 2017]  [<ffffffffc012e76b>] mlx4_buf_alloc+0x1cb/0x240 [mlx4_core]
[Mon Oct 23 12:51:04 2017]  [<ffffffffc04dd85e>] create_qp_common.isra.31+0x62e/0x10d0 [mlx4_ib]
[Mon Oct 23 12:51:04 2017]  [<ffffffffc04de44e>] mlx4_ib_create_qp+0x14e/0x480 [mlx4_ib]
[Mon Oct 23 12:51:04 2017]  [<ffffffffc06df20c>] ? ipoib_cm_tx_init+0x5c/0x400 [ib_ipoib]
[Mon Oct 23 12:51:04 2017]  [<ffffffffc0639c3a>] ib_create_qp+0x7a/0x2f0 [ib_core]
[Mon Oct 23 12:51:04 2017]  [<ffffffffc06df2b3>] ipoib_cm_tx_init+0x103/0x400 [ib_ipoib]
[Mon Oct 23 12:51:04 2017]  [<ffffffffc06e1608>] ipoib_cm_tx_start+0x268/0x3f0 [ib_ipoib]
[Mon Oct 23 12:51:04 2017]  [<ffffffff810a881a>] process_one_work+0x17a/0x440
[Mon Oct 23 12:51:04 2017]  [<ffffffff810a94e6>] worker_thread+0x126/0x3c0
[Mon Oct 23 12:51:04 2017]  [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
[Mon Oct 23 12:51:04 2017]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[Mon Oct 23 12:51:04 2017]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[Mon Oct 23 12:51:04 2017]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[Mon Oct 23 12:51:04 2017]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[Mon Oct 23 12:51:04 2017] Mem-Info:
[Mon Oct 23 12:51:04 2017] active_anon:2390093 inactive_anon:13567 isolated_anon:0
 active_file:11458273 inactive_file:17301026 isolated_file:0
 unevictable:23865 dirty:2205 writeback:19161 unstable:0
 slab_reclaimable:1154991 slab_unreclaimable:131607
 mapped:73010 shmem:23572 pagetables:29384 bounce:0
 free:158939 free_pcp:159 free_cma:0

Comment 18 lejeczek 2017-10-23 13:57:15 UTC

Ok,
if this is not specific to a hardware I wonder if this is reproducible?
If is then would be urgent.
Above, #17, failures seem that happen when rdma in IPoIB and a gluster vol(in replica mode if that may matter) runs across such a link, and then libvirt quemu stores qcow images that guests run.

Comment 19 lejeczek 2017-10-23 15:49:24 UTC

And it does not happen on the libvirt host but on the host that runs IB subnet manager(if it helps)

I start a kvm quest and ~10 sec later:

[Mon Oct 23 16:43:32 2017] kworker/u128:2: page allocation failure: order:4, mode:0x8010
[Mon Oct 23 16:43:32 2017] CPU: 1 PID: 1343862 Comm: kworker/u128:2 Not tainted 3.10.0-693.2.2.el7.x86_64 #1
[Mon Oct 23 16:43:32 2017] Hardware name: Dell Inc. PowerEdge R815/04Y8PT, BIOS 3.2.2 09/15/2014
[Mon Oct 23 16:43:32 2017] Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib]
[Mon Oct 23 16:43:32 2017]  0000000000008010 00000000553c90b1 ffff880c1c6eb818 ffffffff816a3db1
[Mon Oct 23 16:43:32 2017]  ffff880c1c6eb8a8 ffffffff81188810 0000000000000000 ffff88042ffdb000
[Mon Oct 23 16:43:32 2017]  0000000000000004 0000000000008010 ffff880c1c6eb8a8 00000000553c90b1
[Mon Oct 23 16:43:32 2017] Call Trace:
[Mon Oct 23 16:43:32 2017]  [<ffffffff816a3db1>] dump_stack+0x19/0x1b
[Mon Oct 23 16:43:32 2017]  [<ffffffff81188810>] warn_alloc_failed+0x110/0x180
[Mon Oct 23 16:43:32 2017]  [<ffffffff8169fd8a>] __alloc_pages_slowpath+0x6b6/0x724
[Mon Oct 23 16:43:32 2017]  [<ffffffff8118cd85>] __alloc_pages_nodemask+0x405/0x420
[Mon Oct 23 16:43:32 2017]  [<ffffffff81030f8f>] dma_generic_alloc_coherent+0x8f/0x140
[Mon Oct 23 16:43:32 2017]  [<ffffffff81065c0d>] gart_alloc_coherent+0x2d/0x40
[Mon Oct 23 16:43:32 2017]  [<ffffffffc012e4d3>] mlx4_buf_direct_alloc.isra.6+0xd3/0x1a0 [mlx4_core]
[Mon Oct 23 16:43:32 2017]  [<ffffffffc012e76b>] mlx4_buf_alloc+0x1cb/0x240 [mlx4_core]
[Mon Oct 23 16:43:32 2017]  [<ffffffffc04dd85e>] create_qp_common.isra.31+0x62e/0x10d0 [mlx4_ib]
[Mon Oct 23 16:43:32 2017]  [<ffffffffc04de44e>] mlx4_ib_create_qp+0x14e/0x480 [mlx4_ib]
[Mon Oct 23 16:43:32 2017]  [<ffffffffc06df20c>] ? ipoib_cm_tx_init+0x5c/0x400 [ib_ipoib]
[Mon Oct 23 16:43:32 2017]  [<ffffffffc0639c3a>] ib_create_qp+0x7a/0x2f0 [ib_core]
[Mon Oct 23 16:43:32 2017]  [<ffffffffc06df2b3>] ipoib_cm_tx_init+0x103/0x400 [ib_ipoib]
[Mon Oct 23 16:43:32 2017]  [<ffffffffc06e1608>] ipoib_cm_tx_start+0x268/0x3f0 [ib_ipoib]
[Mon Oct 23 16:43:32 2017]  [<ffffffff810a881a>] process_one_work+0x17a/0x440
[Mon Oct 23 16:43:32 2017]  [<ffffffff810a94e6>] worker_thread+0x126/0x3c0
[Mon Oct 23 16:43:32 2017]  [<ffffffff810a93c0>] ? manage_workers.isra.24+0x2a0/0x2a0
[Mon Oct 23 16:43:32 2017]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[Mon Oct 23 16:43:32 2017]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[Mon Oct 23 16:43:32 2017]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[Mon Oct 23 16:43:32 2017]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[Mon Oct 23 16:43:32 2017] Mem-Info:
[Mon Oct 23 16:43:32 2017] active_anon:2389656 inactive_anon:17792 isolated_anon:0
 active_file:14294829 inactive_file:14609973 isolated_file:0
 unevictable:24185 dirty:11846 writeback:9907 unstable:0
 slab_reclaimable:1024309 slab_unreclaimable:127961
 mapped:74895 shmem:28096 pagetables:30088 bounce:0
 free:142329 free_pcp:249 free_cma:0
[Mon Oct 23 16:43:32 2017] Node 0 DMA free:15320kB min:24kB low:28kB high:36kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:64kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 3054 15853 15853
[Mon Oct 23 16:43:32 2017] Node 0 DMA32 free:70776kB min:5344kB low:6680kB high:8016kB active_anon:348892kB inactive_anon:788kB active_file:1238956kB inactive_file:1245568kB unevictable:128kB isolated(anon):0kB isolated(file):0kB present:3381732kB managed:3129612kB mlocked:128kB dirty:988kB writeback:0kB mapped:5316kB shmem:904kB slab_reclaimable:130524kB slab_unreclaimable:27500kB kernel_stack:2720kB pagetables:4804kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 0 12798 12798
[Mon Oct 23 16:43:32 2017] Node 0 Normal free:42204kB min:22392kB low:27988kB high:33588kB active_anon:2029008kB inactive_anon:1464kB active_file:5031776kB inactive_file:5031796kB unevictable:184kB isolated(anon):0kB isolated(file):0kB present:13369344kB managed:13105676kB mlocked:184kB dirty:5076kB writeback:2920kB mapped:15232kB shmem:1636kB slab_reclaimable:627160kB slab_unreclaimable:49516kB kernel_stack:6192kB pagetables:19556kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 0 0 0
[Mon Oct 23 16:43:32 2017] Node 1 Normal free:54268kB min:28216kB low:35268kB high:42324kB active_anon:892572kB inactive_anon:5152kB active_file:7507692kB inactive_file:7564224kB unevictable:30460kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16513752kB mlocked:30460kB dirty:5572kB writeback:6748kB mapped:24672kB shmem:13960kB slab_reclaimable:236700kB slab_unreclaimable:63856kB kernel_stack:4944kB pagetables:35108kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 0 0 0
[Mon Oct 23 16:43:32 2017] Node 2 Normal free:70176kB min:28216kB low:35268kB high:42324kB active_anon:433084kB inactive_anon:28416kB active_file:7454452kB inactive_file:8063268kB unevictable:500kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16513752kB mlocked:500kB dirty:7036kB writeback:5808kB mapped:48424kB shmem:29264kB slab_reclaimable:305152kB slab_unreclaimable:34748kB kernel_stack:2992kB pagetables:12412kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 0 0 0
[Mon Oct 23 16:43:32 2017] Node 3 Normal free:56032kB min:28216kB low:35268kB high:42324kB active_anon:1742132kB inactive_anon:7440kB active_file:6872352kB inactive_file:7278428kB unevictable:432kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16513748kB mlocked:432kB dirty:4020kB writeback:4808kB mapped:43540kB shmem:7816kB slab_reclaimable:323600kB slab_unreclaimable:88564kB kernel_stack:5088kB pagetables:11252kB unstable:0kB bounce:0kB free_pcp:116kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 0 0 0
[Mon Oct 23 16:43:32 2017] Node 4 Normal free:62396kB min:28216kB low:35268kB high:42324kB active_anon:449844kB inactive_anon:1540kB active_file:7761004kB inactive_file:7807132kB unevictable:1900kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16513752kB mlocked:1900kB dirty:7772kB writeback:5344kB mapped:38548kB shmem:1696kB slab_reclaimable:281812kB slab_unreclaimable:33564kB kernel_stack:6064kB pagetables:7596kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 0 0 0
[Mon Oct 23 16:43:32 2017] Node 5 Normal free:67528kB min:28216kB low:35268kB high:42324kB active_anon:453252kB inactive_anon:11184kB active_file:7702536kB inactive_file:7753640kB unevictable:2160kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16513752kB mlocked:2160kB dirty:4192kB writeback:4468kB mapped:50780kB shmem:11408kB slab_reclaimable:383980kB slab_unreclaimable:31732kB kernel_stack:1376kB pagetables:6112kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 0 0 0
[Mon Oct 23 16:43:32 2017] Node 6 Normal free:63364kB min:28216kB low:35268kB high:42324kB active_anon:2323384kB inactive_anon:6884kB active_file:6143308kB inactive_file:6204420kB unevictable:36660kB isolated(anon):0kB isolated(file):0kB present:16777216kB managed:16513752kB mlocked:36660kB dirty:6472kB writeback:3932kB mapped:45724kB shmem:36700kB slab_reclaimable:1510980kB slab_unreclaimable:57204kB kernel_stack:2096kB pagetables:8264kB unstable:0kB bounce:0kB free_pcp:124kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 0 0 0
[Mon Oct 23 16:43:32 2017] Node 7 Normal free:67812kB min:28188kB low:35232kB high:42280kB active_anon:885896kB inactive_anon:8300kB active_file:7467240kB inactive_file:7491416kB unevictable:24316kB isolated(anon):0kB isolated(file):0kB present:16760832kB managed:16497308kB mlocked:24316kB dirty:6256kB writeback:5600kB mapped:27344kB shmem:8344kB slab_reclaimable:297328kB slab_unreclaimable:125096kB kernel_stack:8928kB pagetables:15248kB unstable:0kB bounce:0kB free_pcp:824kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[Mon Oct 23 16:43:32 2017] lowmem_reserve[]: 0 0 0 0
[Mon Oct 23 16:43:32 2017] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (M) 3*4096kB (M) = 15320kB
[Mon Oct 23 16:43:32 2017] Node 0 DMA32: 8167*4kB (UEM) 4085*8kB (UEM) 277*16kB (UEM) 26*32kB (UM) 1*64kB (U) 0*128kB 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 70932kB
[Mon Oct 23 16:43:32 2017] Node 0 Normal: 9685*4kB (UEM) 428*8kB (UEM) 58*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 43092kB
[Mon Oct 23 16:43:32 2017] Node 1 Normal: 4961*4kB (UEM) 2977*8kB (UEM) 756*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 55756kB
[Mon Oct 23 16:43:32 2017] Node 2 Normal: 15226*4kB (UEM) 1341*8kB (UEM) 5*16kB (UM) 1*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 71744kB
[Mon Oct 23 16:43:32 2017] Node 3 Normal: 3278*4kB (UEM) 2200*8kB (UEM) 1113*16kB (UEM) 282*32kB (UEM) 1*64kB (U) 1*128kB (U) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 57992kB
[Mon Oct 23 16:43:32 2017] Node 4 Normal: 6713*4kB (UEM) 3355*8kB (UEM) 608*16kB (UM) 15*32kB (UM) 3*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 64220kB
[Mon Oct 23 16:43:32 2017] Node 5 Normal: 8039*4kB (UEM) 4611*8kB (UEM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 69044kB
[Mon Oct 23 16:43:32 2017] Node 6 Normal: 12841*4kB (UEM) 1626*8kB (UM) 1*16kB (U) 1*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 64420kB
[Mon Oct 23 16:43:32 2017] Node 7 Normal: 12563*4kB (UEM) 1081*8kB (UEM) 653*16kB (EM) 12*32kB (UM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 69732kB
[Mon Oct 23 16:43:32 2017] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 16:43:32 2017] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 16:43:32 2017] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 16:43:32 2017] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 16:43:32 2017] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 16:43:32 2017] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 16:43:32 2017] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 16:43:32 2017] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 16:43:32 2017] Node 4 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 16:43:32 2017] Node 4 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 16:43:32 2017] Node 5 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 16:43:32 2017] Node 5 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 16:43:32 2017] Node 6 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 16:43:32 2017] Node 6 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 16:43:32 2017] Node 7 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Mon Oct 23 16:43:32 2017] Node 7 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Mon Oct 23 16:43:32 2017] 28935510 total pagecache pages
[Mon Oct 23 16:43:32 2017] 0 pages in swap cache
[Mon Oct 23 16:43:32 2017] Swap cache stats: add 0, delete 0, find 0/0
[Mon Oct 23 16:43:32 2017] Free swap  = 0kB
[Mon Oct 23 16:43:32 2017] Total swap = 0kB
[Mon Oct 23 16:43:32 2017] 33547797 pages RAM
[Mon Oct 23 16:43:32 2017] 0 pages HighMem/MovableOnly
[Mon Oct 23 16:43:32 2017] 590046 pages reserved


Should I make it a separate bug report?

Comment 20 Doug Ledford 2017-10-25 13:48:22 UTC

(In reply to lejeczek from comment #0)
> Description of problem:
> 
> I don't know if it is kernel or userspace, but I do:
> 
> $ ibportstate 1 1 off
> Initial CA PortInfo:
> # Port info: Lid 1 port 1
> LinkState:.......................Active
> PhysLinkState:...................LinkUp
> Lid:.............................1
> SMLid:...........................1
> LMC:.............................0
> LinkWidthSupported:..............1X or 4X
> LinkWidthEnabled:................1X or 4X
> LinkWidthActive:.................4X
> LinkSpeedSupported:..............2.5 Gbps
> LinkSpeedEnabled:................2.5 Gbps
> LinkSpeedActive:.................2.5 Gbps
> Mkey:............................<not displayed>
> MkeyLeasePeriod:.................0
> ProtectBits:.....................0
> Disable may be irreversible
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^

The program you are using is even telling you that what you are doing may not be fixable.


> then I do:
> 
> $ ibportstate 1 1 on
> ibwarn: [74878] _do_madrpc: recv failed: Connection timed out
> ibwarn: [74878] mad_rpc: _do_madrpc failed; dport (Lid 1)
> ibportstate: iberror: failed: smp query nodeinfo failed

And just like it warned you about, you put yourself into an unfixable position.  Your link to the subnet manager is severed and re-enabling the port requires access to the subnet manager.

So, the initial issue in this bug is "NOTABUG".

Comment 21 Doug Ledford 2017-10-25 13:49:44 UTC

(In reply to lejeczek from comment #9)
> On 'off' port, reboot brings it back, but that's not even a workaround, too
> ugly to be.

It's exactly what you have to do if you cut yourself off from the subnet manager.  Lack of understanding of the consequences of your actions when playing with a new fabric is understandable, but it doesn't mean there is a bug.

Comment 22 Doug Ledford 2017-10-25 13:55:39 UTC

(In reply to lejeczek from comment #10)
> My latest(probably last for now) observation.
> With this "working" config, with Link layer: InfiniBand, having set up nmcli
> c infiniband type and simply testing(I realize that some tunning might help):
> 
> $ iperf -c 10.5.4.100
> ------------------------------------------------------------
> Client connecting to 10.5.4.100, TCP port 5001
> TCP window size: 2.50 MByte (default)
> ------------------------------------------------------------
> [  3] local 10.5.4.49 port 53276 connected with 10.5.4.100 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.2 sec   532 MBytes   439 Mbits/sec
> 
> That looks very wrong, no? I'm far from being kernel expert but something
> smells fishy to me.
> 
> Rest of the HW stack(it it may help) - to say it's not a no-name rag, maybe
> worth passing onto Dell as HCAs sit in PE r815s.

You are running on Dell PE R815s?  I ask because those may be big beefy machines, but they are also old, AMD Opteron based machines, yes?  If so, then this speed is not out of line.  You are doing a single threaded IP test where all packets must be copied from user space to kernel space by the CPU.  Because the test is TCP, it is necessarily serialized, and so you don't get any benefit from multiple CPUs.  You are CPU bound on whatever CPU the test is running on.  If you download sockstream from github and use it in multi-port UDP mode and lock the send/recv threads for each port to different CPUs, your performance will go up.  And likewise, in real world conditions with multiple processes and multiple TCP connections in use, your performance will go up.  But you will be suffering from manual copies of all data going over the IPoIB interface until the IPoIB accelerator patches make it into a Red Hat kernel, which considering they are just now landing upstream, will be no sooner than 7.5, and maybe later.

If you want to see performance without IPoIB as the bottleneck, use the ib_send_bw and friends from the perftest package or qperf.  Those are RDMA aware tests and will exercise your actual RDMA offload capability.  Then, in your actual use of the cards, if you can setup kernel protocols such as iSER or SRP, then at least you can share disks from one machine to others and get RDMA performance instead of IPoIB performance.

Comment 23 Doug Ledford 2017-10-25 13:58:04 UTC

(In reply to lejeczek from comment #16)
> @ #15 ok, 
> What I fiddled with yesterday revealed similar - forcing port mode to eht by
> echoing "eth" to sysfs port(it would not work straightforward, if I remember
> if first echoed to port2 and then to port1 then card's(as this model
> apparently has to have both ports in the same mode) ports would switch mode.
> But even though above method would put ports in "eth", ports status with
> ibstat would still be disabled, which was "link-up" and fine just before
> when was in "infiniband".
> 
> I'd not mind to stick with "infiniband" and IPoIB but, that iperf test
> miserable results.
> Are such results expected, surely not? Where is that near 10Gbs performance?

You can't do ethernet back to back with these cards (well, maybe you can if you download and run the lldp daemon and configure one of them to act as an lldp server).  The cards expect the switch to tell them about the link using the lldp protocol.  Two cards back to back are like two employees and no boss: nothing works.  When in InfiniBand mode, the subnet manager is said boss.

Comment 24 lejeczek 2017-10-25 14:04:47 UTC

I've stuck to IPoIB soon after I found out other bits could not work, so it's been a while.
With: iperf -c 10.5.4.100 -P 16 it looks bit better, ~5Gbps.

However, whether it's a hardware(purely IPoIB) or kernel, or combination of both, I get segfaults as in my above earlier comments.

I've filed bug #1506252.

Comment 25 Doug Ledford 2017-10-25 14:09:15 UTC

(In reply to lejeczek from comment #19)
> And it does not happen on the libvirt host but on the host that runs IB
> subnet manager(if it helps)
> 
> I start a kvm quest and ~10 sec later:
> 
> [Mon Oct 23 16:43:32 2017] kworker/u128:2: page allocation failure: order:4,
> mode:0x8010

These are warnings, nothing more.  The machine should continue on after this just fine.  They are a result of highly fragmented memory on the system combined with trying to use IPoIB in connected mode.  When IPoIB is in connected mode, and it needs to start communicating with a new host, it tries to allocate a new queue pair.  Depending on the queue depth you have configured for IPoIB connected mode queue pairs, the overall allocation can be quite large.  An order 4 allocation is an attempt to find 64K of continuous free physical memory (I think 64k is right, it might be off a bit).  It can't find it, so the allocation fails, and when that happens, IPoIB falls back to disconnected mode for that host and continues on.


> [Mon Oct 23 16:43:32 2017] Node 0 Normal: 9685*4kB (UEM) 428*8kB (UEM)
> 58*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 43092kB
> [Mon Oct 23 16:43:32 2017] Node 1 Normal: 4961*4kB (UEM) 2977*8kB (UEM)
> 756*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 55756kB
> [Mon Oct 23 16:43:32 2017] Node 2 Normal: 15226*4kB (UEM) 1341*8kB (UEM)
> 5*16kB (UM) 1*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 71744kB
> [Mon Oct 23 16:43:32 2017] Node 3 Normal: 3278*4kB (UEM) 2200*8kB (UEM)
> 1113*16kB (UEM) 282*32kB (UEM) 1*64kB (U) 1*128kB (U) 1*256kB (M) 0*512kB
> 0*1024kB 0*2048kB 0*4096kB = 57992kB
> [Mon Oct 23 16:43:32 2017] Node 4 Normal: 6713*4kB (UEM) 3355*8kB (UEM)
> 608*16kB (UM) 15*32kB (UM) 3*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB
> 0*2048kB 0*4096kB = 64220kB
> [Mon Oct 23 16:43:32 2017] Node 5 Normal: 8039*4kB (UEM) 4611*8kB (UEM)
> 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 69044kB
> [Mon Oct 23 16:43:32 2017] Node 6 Normal: 12841*4kB (UEM) 1626*8kB (UM)
> 1*16kB (U) 1*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 64420kB
> [Mon Oct 23 16:43:32 2017] Node 7 Normal: 12563*4kB (UEM) 1081*8kB (UEM)
> 653*16kB (EM) 12*32kB (UM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 69732kB

All of your nodes are running very low on large chunks of contiguous memory.

> [Mon Oct 23 16:43:32 2017] 0 pages in swap cache
> [Mon Oct 23 16:43:32 2017] Swap cache stats: add 0, delete 0, find 0/0
> [Mon Oct 23 16:43:32 2017] Free swap  = 0kB
> [Mon Oct 23 16:43:32 2017] Total swap = 0kB

You should enable swap.  Even only just a small amount, say 10GB.  It allows the kernel to shuffle pages out and back in and rearrange memory for more contiguous pages in the process.  It also might help to configure some of your memory are hugepages.

Comment 26 Doug Ledford 2017-10-25 14:11:05 UTC

(In reply to lejeczek from comment #18)
> Ok,
> if this is not specific to a hardware I wonder if this is reproducible?
> If is then would be urgent.
> Above, #17, failures seem that happen when rdma in IPoIB and a gluster
> vol(in replica mode if that may matter)

Gluster has RDMA support last I knew, it should be able to run using the IB link without using IPoIB.  How is your setup configured?

> runs across such a link, and then
> libvirt quemu stores qcow images that guests run.

How are you accessing these images?

Comment 27 lejeczek 2017-11-03 17:02:46 UTC

IOMMU - is it possible to have it ON somehow? With tweaking/tuning perhaps?

If I just have it on kernel keep on spitting:

amd-vi event loggetd io_page_fault...

OS boots ok but then:

mlx4_core ... device is going to be reset
mlx4_core .... internal error detected...
ib_srpt receiving unrecognized IB event 8
ib_srpt disabling MAD processing failed.

Comment 30 RHEL Program Management 2021-01-15 07:43:27 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 31 Yuri Arabadji 2021-09-06 15:29:50 UTC

Same thing happening on Qlogic adapters. Wanted to force-push link width to 4x instead of 1x that's been selected by default and wanted to trigger a port "reset" by off/on sequence and, lo and behold, port went into down state with no way of bringing it up.

IB specs have been designed by morons, that's the only answer.