Bug 1303904 - Failed to provision rdma-qe-xx machines to RHEL-6.8-20160125.0
Failed to provision rdma-qe-xx machines to RHEL-6.8-20160125.0
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: rdma (Show other bugs)
6.7
x86_64 Linux
urgent Severity urgent
: rc
: ---
Assigned To: Doug Ledford
zguo
: TestBlocker
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-02 06:49 EST by zguo
Modified: 2016-07-04 21:55 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-03 00:09:11 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
zguo: needinfo-


Attachments (Terms of Use)

  None (edit)
Description zguo 2016-02-02 06:49:05 EST
Description of problem:

Not sure what may lead to this. May be a rdma network configure issue. I just guessed a component, please correct it if it's not proper.

Version-Release number of selected component (if applicable):
RHEL-6.8-20160125.0

How reproducible:
Always

Steps to Reproduce:
1. Submit beaker jobs to rdma-qe-xx machines, all aborted.
The testing jobs I submitted:
https://beaker.engineering.redhat.com/jobs/1208450
https://beaker.engineering.redhat.com/jobs/1208449
https://beaker.engineering.redhat.com/jobs/1208448
https://beaker.engineering.redhat.com/jobs/1208447

2. Submit beaker jobs to rdma-dev-xx machines
Some jobs passed, and some aborted.
Passed: 
https://beaker.engineering.redhat.com/jobs/1210043
https://beaker.engineering.redhat.com/jobs/1210042
Failed:
https://beaker.engineering.redhat.com/jobs/1210044
https://beaker.engineering.redhat.com/jobs/1210048

Actual results:
Jobs aborted.

Expected results:
rdma-qe-xx machines can be provisioned to RHEL-6.8-20160125.0 successfully

Additional info:
1. rdma-qe-xx machines can be provisioned to RHEL-6.7 successfully
Comment 3 zguo 2016-02-02 21:46:50 EST
Hi Honggang or other developers who are available,

Could you please help take a look at this issue which is blocking our testing?

Thanks
Zhaojuan
Comment 4 Honggang LI 2016-02-02 22:18:56 EST
I will check this issue in this afternoon.
Comment 5 Honggang LI 2016-02-02 22:50:00 EST
This should be a tg3 Ethernet device driver issue. All rdma-qe-xx has been connected to beaker network via tg3 NIC. The tg3 device never up after the installation reboot. As result, beaker jobs timed out because rdma-qe-xx can't detect beaker server's heartbeat.

[file.bos.redhat.com] [10:43:18 PM]
[honli@file machines]$ grep tg3 rdma*
rdma-dev-10:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:2a:20 dhcp defroute
rdma-dev-10:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:2a:21
rdma-dev-11:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:2a:84 dhcp defroute
rdma-dev-11:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:2a:85
rdma-dev-12:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:3d:a4 dhcp defroute
rdma-dev-12:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:3d:a5
rdma-dev-13:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:23:3c dhcp defroute
rdma-dev-13:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:23:3d
rdma-dev-14:Create_Interface tg3_1 Ethernet yes hwaddr 40:f2:e9:5c:51:1c bridge lab-bridge
rdma-dev-14:Create_Interface tg3_2 Ethernet no hwaddr 40:f2:e9:5c:51:1d
rdma-dev-14:Create_Interface tg3_3 Ethernet no hwaddr 40:f2:e9:5c:51:1e
rdma-dev-14:Create_Interface tg3_4 Ethernet no hwaddr 40:f2:e9:5c:51:1f
rdma-dev-15:Create_Interface tg3_1 Ethernet yes hwaddr 44:a8:42:2b:ab:4f dhcp defroute
rdma-dev-15:Create_Interface tg3_2 Ethernet no hwaddr 44:a8:42:2b:ab:50
rdma-dev-15:Create_Interface tg3_3 Ethernet no hwaddr 44:a8:42:2b:ab:51
rdma-dev-15:Create_Interface tg3_4 Ethernet no hwaddr 44:a8:42:2b:ab:52
rdma-dev-16:Create_Interface tg3_1 Ethernet yes hwaddr 44:a8:42:2b:b2:9d dhcp defroute
rdma-dev-16:Create_Interface tg3_2 Ethernet no hwaddr 44:a8:42:2b:b2:9e
rdma-dev-16:Create_Interface tg3_3 Ethernet no hwaddr 44:a8:42:2b:b2:9f
rdma-dev-16:Create_Interface tg3_4 Ethernet no hwaddr 44:a8:42:2b:b2:a0
rdma-master:Create_Interface tg3_1 Ethernet yes hwaddr e0:db:55:0b:b4:a8 bridge lab-bridge
rdma-master:Create_Interface tg3_2 Ethernet no hwaddr e0:db:55:0b:b4:a9
rdma-master:Create_Interface tg3_3 Ethernet no hwaddr e0:db:55:0b:b4:aa
rdma-master:Create_Interface tg3_4 Ethernet no hwaddr e0:db:55:0b:b4:ab
rdma-perf-00:Create_Interface tg3_1 Ethernet yes hwaddr d8:9d:67:14:1e:f8 bridge lab-bridge
rdma-perf-00:Create_Interface tg3_2 Ethernet no hwaddr d8:9d:67:14:1e:f9
rdma-perf-00:Create_Interface tg3_3 Ethernet no hwaddr d8:9d:67:14:1e:fa
rdma-perf-00:Create_Interface tg3_4 Ethernet no hwaddr d8:9d:67:14:1e:fb
rdma-perf-01:Create_Interface tg3_1 Ethernet yes hwaddr d8:9d:67:14:6c:6c bridge lab-bridge
rdma-perf-01:Create_Interface tg3_2 Ethernet no hwaddr d8:9d:67:14:6c:6d
rdma-perf-01:Create_Interface tg3_3 Ethernet no hwaddr d8:9d:67:14:6c:6e
rdma-perf-01:Create_Interface tg3_4 Ethernet no hwaddr d8:9d:67:14:6c:6f
rdma-perf-02:Create_Interface tg3_1 Ethernet yes hwaddr d8:9d:67:13:c8:80 bridge lab-bridge
rdma-perf-02:Create_Interface tg3_2 Ethernet no hwaddr d8:9d:67:13:c8:81
rdma-perf-02:Create_Interface tg3_3 Ethernet no hwaddr d8:9d:67:13:c8:82
rdma-perf-02:Create_Interface tg3_4 Ethernet no hwaddr d8:9d:67:13:c8:83
rdma-perf-03:Create_Interface tg3_1 Ethernet yes hwaddr d8:9d:67:14:87:8c bridge lab-bridge
rdma-perf-03:Create_Interface tg3_2 Ethernet no hwaddr d8:9d:67:14:87:8d
rdma-perf-03:Create_Interface tg3_3 Ethernet no hwaddr d8:9d:67:14:87:8e
rdma-perf-03:Create_Interface tg3_4 Ethernet no hwaddr d8:9d:67:14:87:8f
rdma-qe-02:Create_Interface tg3_1 Ethernet yes hwaddr 40:a8:f0:75:ff:68 dhcp defroute
rdma-qe-02:Create_Interface tg3_2 Ethernet no hwaddr 40:a8:f0:75:ff:69
rdma-qe-03:Create_Interface tg3_1 Ethernet yes hwaddr 9c:b6:54:bb:4a:90 dhcp defroute
rdma-qe-03:Create_Interface tg3_2 Ethernet no hwaddr 9c:b6:54:bb:4a:91
rdma-qe-04:Create_Interface tg3_1 Ethernet yes hwaddr 9c:b6:54:bb:48:84 dhcp defroute
rdma-qe-04:Create_Interface tg3_2 Ethernet no hwaddr 9c:b6:54:bb:48:85
rdma-qe-05:Create_Interface tg3_1 Ethernet yes hwaddr 9c:b6:54:bb:79:6c dhcp defroute
rdma-qe-05:Create_Interface tg3_2 Ethernet no hwaddr 9c:b6:54:bb:79:6d
rdma-qe-06:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:21:24 dhcp defroute
rdma-qe-06:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:21:25
rdma-qe-07:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:27:0c dhcp defroute
rdma-qe-07:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:27:0d
rdma-qe-08:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:3d:e0 dhcp defroute
rdma-qe-08:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:3d:e1
rdma-qe-09:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:3b:ec dhcp defroute
rdma-qe-09:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:3b:ed
rdma-qe-10:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:21:18 dhcp defroute
rdma-qe-10:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:21:19
rdma-qe-11:Create_Interface tg3_1 Ethernet yes hwaddr 2c:59:e5:9a:3b:d4 dhcp defroute
rdma-qe-11:Create_Interface tg3_2 Ethernet no hwaddr 2c:59:e5:9a:3b:d5
rdma-qe-12:Create_Interface tg3_1 Ethernet yes hwaddr 34:64:a9:95:c9:3c dhcp defroute
rdma-qe-12:Create_Interface tg3_2 Ethernet no hwaddr 34:64:a9:95:c9:3d
rdma-qe-13:Create_Interface tg3_1 Ethernet yes hwaddr 40:a8:f0:75:fc:18 dhcp defroute
rdma-qe-13:Create_Interface tg3_2 Ethernet no hwaddr 40:a8:f0:75:fc:19
rdma-qe-14:Create_Interface tg3_1 Ethernet yes hwaddr 44:a8:42:2b:af:30 dhcp defroute
rdma-qe-14:Create_Interface tg3_2 Ethernet no hwaddr 44:a8:42:2b:af:31
rdma-qe-14:Create_Interface tg3_3 Ethernet no hwaddr 44:a8:42:2b:af:32
rdma-qe-14:Create_Interface tg3_4 Ethernet no hwaddr 44:a8:42:2b:af:33
rdma-qe-15:Create_Interface tg3_1 Ethernet yes hwaddr 44:a8:42:2b:b0:34 dhcp defroute
rdma-qe-15:Create_Interface tg3_2 Ethernet no hwaddr 44:a8:42:2b:b0:35
rdma-qe-15:Create_Interface tg3_3 Ethernet no hwaddr 44:a8:42:2b:b0:36
rdma-qe-15:Create_Interface tg3_4 Ethernet no hwaddr 44:a8:42:2b:b0:37
rdma-storage-02:Create_Interface tg3_1 Ethernet yes hwaddr 54:9f:35:0c:24:70 dhcp defroute
rdma-storage-02:Create_Interface tg3_2 Ethernet no hwaddr 54:9f:35:0c:24:71
rdma-storage-02:Create_Interface tg3_3 Ethernet no hwaddr 54:9f:35:0c:24:72
rdma-storage-02:Create_Interface tg3_4 Ethernet no hwaddr 54:9f:35:0c:24:73
rdma-storage-03:Create_Interface tg3_1 Ethernet yes hwaddr 54:9f:35:0c:1b:74 dhcp defroute
rdma-storage-03:Create_Interface tg3_2 Ethernet no hwaddr 54:9f:35:0c:1b:75
rdma-storage-03:Create_Interface tg3_3 Ethernet no hwaddr 54:9f:35:0c:1b:76
rdma-storage-03:Create_Interface tg3_4 Ethernet no hwaddr 54:9f:35:0c:1b:77
rdma-storage-04:Create_Interface tg3_1 Ethernet yes hwaddr 54:9f:35:0c:2a:94 dhcp defroute
rdma-storage-04:Create_Interface tg3_2 Ethernet no hwaddr 54:9f:35:0c:2a:95
rdma-storage-04:Create_Interface tg3_3 Ethernet no hwaddr 54:9f:35:0c:2a:96
rdma-storage-04:Create_Interface tg3_4 Ethernet no hwaddr 54:9f:35:0c:2a:97
rdma-virt-00:Create_Interface tg3_1 Ethernet yes hwaddr 44:a8:42:01:24:04 bridge lab-bridge
rdma-virt-00:Create_Interface tg3_2 Ethernet no hwaddr 44:a8:42:01:24:05
rdma-virt-00:Create_Interface tg3_3 Ethernet no hwaddr 44:a8:42:01:24:06
rdma-virt-00:Create_Interface tg3_4 Ethernet no hwaddr 44:a8:42:01:24:07
rdma-virt-01:Create_Interface tg3_1 Ethernet yes hwaddr 44:a8:42:01:31:d9 bridge lab-bridge
rdma-virt-01:Create_Interface tg3_2 Ethernet no hwaddr 44:a8:42:01:31:da
rdma-virt-01:Create_Interface tg3_3 Ethernet no hwaddr 44:a8:42:01:31:db
rdma-virt-01:Create_Interface tg3_4 Ethernet no hwaddr 44:a8:42:01:31:dc
rdma-virt-02:Create_Interface tg3_1 Ethernet yes hwaddr 44:a8:42:01:1b:a2 bridge lab-bridge
rdma-virt-02:Create_Interface tg3_2 Ethernet no hwaddr 44:a8:42:01:1b:a3
rdma-virt-02:Create_Interface tg3_3 Ethernet no hwaddr 44:a8:42:01:1b:a4
rdma-virt-02:Create_Interface tg3_4 Ethernet no hwaddr 44:a8:42:01:1b:a5
rdma-virt-03:Create_Interface tg3_1 Ethernet yes hwaddr 44:a8:42:01:22:29 bridge lab-bridge
rdma-virt-03:Create_Interface tg3_2 Ethernet no hwaddr 44:a8:42:01:22:2a
rdma-virt-03:Create_Interface tg3_3 Ethernet no hwaddr 44:a8:42:01:22:2b
rdma-virt-03:Create_Interface tg3_4 Ethernet no hwaddr 44:a8:42:01:22:2c
[file.bos.redhat.com] [10:43:26 PM]
[honli@file machines]$
Comment 6 Honggang LI 2016-02-03 00:08:47 EST
Buggy Create_Interface function in "rdma-function.sh" only updates udev rules for InfiniBand/IPoIB interfaces. It ignores Enthernet interfaces. The default /etc/udev/rules.d/70-persistent-net.rules file names the tg3 interfaces as ethX. So, network service failed to up the tg3_X interfaces as they did not rename by udev. I will fix the issue ASAP.
Comment 9 Doug Ledford 2016-02-03 11:33:26 EST
(In reply to Honggang LI from comment #6)
> Buggy Create_Interface function in "rdma-function.sh" only updates udev
> rules for InfiniBand/IPoIB interfaces. It ignores Enthernet interfaces. The
> default /etc/udev/rules.d/70-persistent-net.rules file names the tg3
> interfaces as ethX. So, network service failed to up the tg3_X interfaces as
> they did not rename by udev. I will fix the issue ASAP.

This didn't used to be necessary.  Just deleting the original device config files and creating new ones would override the previous device names stored in the udev rules file.  Has this changed with the latest rhel6 then?
Comment 10 Honggang LI 2016-02-03 11:42:09 EST
(In reply to Doug Ledford from comment #9) 
> This didn't used to be necessary.  Just deleting the original device config

The old script without my update works for RHEL-6.7 (and the older 6.x distros). I suspect my F23 provision jobs timed out may because it too. I did not play Fedora over the rdma cluster, so I'm not sure.

> files and creating new ones would override the previous device names stored
> in the udev rules file.  Has this changed with the latest rhel6 then?

Not sure. And the weird thing is that machines with bnx Ethernet NICs work with RHEL-6.8-20160125.0, only machines with tg3 NIC failed.
Comment 11 Don Dutile 2016-02-03 17:54:11 EST
(In reply to Honggang LI from comment #10)
> (In reply to Doug Ledford from comment #9) 
> > This didn't used to be necessary.  Just deleting the original device config
> 
> The old script without my update works for RHEL-6.7 (and the older 6.x
> distros). I suspect my F23 provision jobs timed out may because it too. I
> did not play Fedora over the rdma cluster, so I'm not sure.
> 
> > files and creating new ones would override the previous device names stored
> > in the udev rules file.  Has this changed with the latest rhel6 then?
> 
> Not sure. And the weird thing is that machines with bnx Ethernet NICs work
> with RHEL-6.8-20160125.0, only machines with tg3 NIC failed.

Is this a unique script to our test machines?
a pkg change?
Comment 12 Doug Ledford 2016-02-03 18:17:15 EST
(In reply to Don Dutile from comment #11)
> (In reply to Honggang LI from comment #10)
> > (In reply to Doug Ledford from comment #9) 
> > > This didn't used to be necessary.  Just deleting the original device config
> > 
> > The old script without my update works for RHEL-6.7 (and the older 6.x
> > distros). I suspect my F23 provision jobs timed out may because it too. I
> > did not play Fedora over the rdma cluster, so I'm not sure.
> > 
> > > files and creating new ones would override the previous device names stored
> > > in the udev rules file.  Has this changed with the latest rhel6 then?
> > 
> > Not sure. And the weird thing is that machines with bnx Ethernet NICs work
> > with RHEL-6.8-20160125.0, only machines with tg3 NIC failed.
> 
> Is this a unique script to our test machines?

It's one of the primary function routines in rdma-functions.sh that gets installed on all the rdma-* machines.

> a pkg change?

Maybe.  We recently changed all rhel6 installs from using NetworkManager (which would read the device name from the ifcfg-* file and set the device name appropriately) back to using the old SysV init network package.  This was due to the rhel6 NetworkManager not supporting vlans or maybe pkeys, can't remember off the top of my head, but one of the recent updates we made to the default network interface setup was not supported with the rhel6 NetworkManager but was by the network service and so we switched back.  With that comes a concurrent change in how the device naming is done.  The SysV init network script will attempt to change a device name, but I think they defer to the udev persistent device rules.  The SysV init network script for infiniband device are actually part of the rhel6 rdma package and they will change the IB device name from rhel6.0 through about 6.5 I think, then they switched to using a udev rule.  So the matrix of when we used udev rules and when we relied on the network service to rename the device is complex in the rhel6 lifetime.  I don't have an explanation for why it would have changed recently, nor why it would have only effected the tg3 devices and not the bnx devices.  That makes no sense and make me think that Honggang's change might have papered over the issue, but we don't really know what the true root cause of the issue was.
Comment 13 Honggang LI 2016-02-04 03:23:43 EST
(In reply to Doug Ledford from comment #12)

> Maybe.  We recently changed all rhel6 installs from using NetworkManager
> (which would read the device name from the ifcfg-* file and set the device
> name appropriately) back to using the old SysV init network package.  This
> was due to the rhel6 NetworkManager not supporting vlans or maybe pkeys,
> can't remember off the top of my head, but one of the recent updates we made

https://bugzilla.redhat.com/show_bug.cgi?id=1276030
https://bugzilla.redhat.com/show_bug.cgi?id=1284115
Comment 14 Don Dutile 2016-02-04 10:28:54 EST
(In reply to Honggang LI from comment #13)
> (In reply to Doug Ledford from comment #12)
> 
> > Maybe.  We recently changed all rhel6 installs from using NetworkManager
> > (which would read the device name from the ifcfg-* file and set the device
> > name appropriately) back to using the old SysV init network package.  This
> > was due to the rhel6 NetworkManager not supporting vlans or maybe pkeys,
> > can't remember off the top of my head, but one of the recent updates we made
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1276030
> https://bugzilla.redhat.com/show_bug.cgi?id=1284115
There are two patches for 1284115.
Have they been tried ?   Should we get NM patchesd for 6.8 & additional doc update for no-NM support for ib-vlan's ?
Comment 15 Honggang LI 2016-02-13 21:56:19 EST
(In reply to Don Dutile from comment #14)
> There are two patches for 1284115.
> Have they been tried ?   Should we get NM patchesd for 6.8 & additional doc
> update for no-NM support for ib-vlan's ?

Zhaojuan
 Please test the patches for bz1284115.

Note You need to log in before you can comment on or make changes to this bug.