Bug 1774451 - [sriov] VF cannot be init in rhel worker
Summary: [sriov] VF cannot be init in rhel worker
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: All
OS: All
high
high
Target Milestone: ---
: 4.3.0
Assignee: Peng Liu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-20 10:30 UTC by zhaozhanqi
Modified: 2020-01-23 11:13 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:13:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
config daemon logs (1.96 MB, text/plain)
2019-12-10 14:07 UTC, zhaozhanqi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 137 0 None closed Bug 1782147: [release-4.3] Block config daemon until initializing status of SriovNetworkNodeStat… 2020-06-10 04:35:43 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:13:36 UTC

Description zhaozhanqi 2019-11-20 10:30:11 UTC
Description of problem:
When the sriovnodenetworkpolicy is created. the VF cannot be inited. Found the worker do not be reboot.

cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-1062.4.3.el7.x86_64 root=/dev/mapper/rhel_dell--per740--14-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per740-14/root rd.lvm.lv=rhel_dell-per740-14/swap rhgb quiet LANG=en_US.UTF-8


Version-Release number of selected component (if applicable):
 
quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.3.0-201911150628-ose-sriov-network-operator

How reproducible:
always

Steps to Reproduce:
1. setup cluster and add one rhel worker into cluster
2. Create one sriovnodenetworkpolicy CR
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277rhel-netdevice
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    pfNames:
      - p2p1
    rootDevices:
      - '0000:60:00.0'
    vendor: '15b3'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable-rhel: 'true'
  numVfs: 2
  resourceName: mlx277ndrhel
3. Check the logs of dp
4. Check the `cat /proc/cmdline ` on worker


Actual results:

 oc logs sriov-device-plugin-r76zx
I1120 10:03:48.154783      20 manager.go:70] Using Kubelet Plugin Registry Mode
I1120 10:03:48.154983      20 main.go:44] resource manager reading configs
I1120 10:03:48.155585      20 manager.go:98] ResourceList: [{ResourceName:intelnetdevicerhel IsRdma:false Selectors:{Vendors:[8086] Devices:[] Drivers:[iavf mlx5_core i40evf ixgbevf] PfNames:[p1p1] LinkTypes:[]}} {ResourceName:mlx277ndrhel IsRdma:false Selectors:{Vendors:[15b3] Devices:[] Drivers:[iavf mlx5_core i40evf ixgbevf] PfNames:[p2p1] LinkTypes:[]}}]
I1120 10:03:48.155750      20 main.go:60] Discovering host network devices
I1120 10:03:48.155776      20 manager.go:179] discovering host network devices
I1120 10:03:48.238914      20 manager.go:209] discoverDevices(): device found: 0000:18:00.0	02          	Intel Corporation   	I350 Gigabit Network Connection         
I1120 10:03:48.239622      20 manager.go:259] excluding interface em1:  default route found: {Ifindex: 3 Dst: <nil> Src: <nil> Gw: 10.73.131.254 Flags: [] Table: 254}
I1120 10:03:48.239706      20 manager.go:209] discoverDevices(): device found: 0000:18:00.1	02          	Intel Corporation   	I350 Gigabit Network Connection         
I1120 10:03:48.240264      20 manager.go:279] em2 added to linkWatchList
I1120 10:03:48.240812      20 manager.go:209] discoverDevices(): device found: 0000:18:00.2	02          	Intel Corporation   	I350 Gigabit Network Connection         
I1120 10:03:48.241354      20 manager.go:279] em3 added to linkWatchList
I1120 10:03:48.241875      20 manager.go:209] discoverDevices(): device found: 0000:18:00.3	02          	Intel Corporation   	I350 Gigabit Network Connection         
I1120 10:03:48.242368      20 manager.go:279] em4 added to linkWatchList
I1120 10:03:48.242831      20 manager.go:209] discoverDevices(): device found: 0000:3b:00.0	02          	Intel Corporation   	Ethernet Controller XXV710 for 25GbE ...
I1120 10:03:48.243297      20 manager.go:279] p1p1 added to linkWatchList
I1120 10:03:48.243610      20 manager.go:209] discoverDevices(): device found: 0000:3b:00.1	02          	Intel Corporation   	Ethernet Controller XXV710 for 25GbE ...
I1120 10:03:48.243941      20 manager.go:279] p1p2 added to linkWatchList
I1120 10:03:48.244262      20 manager.go:209] discoverDevices(): device found: 0000:5e:00.0	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5]             
I1120 10:03:48.244605      20 manager.go:279] p3p1 added to linkWatchList
I1120 10:03:48.245312      20 manager.go:209] discoverDevices(): device found: 0000:5e:00.1	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5]             
I1120 10:03:48.245647      20 manager.go:279] p3p2 added to linkWatchList
I1120 10:03:48.246424      20 manager.go:209] discoverDevices(): device found: 0000:60:00.0	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]          
I1120 10:03:48.246773      20 manager.go:279] p2p1 added to linkWatchList
I1120 10:03:48.247769      20 manager.go:209] discoverDevices(): device found: 0000:60:00.1	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]          
I1120 10:03:48.248279      20 manager.go:279] p2p2 added to linkWatchList
I1120 10:03:48.249195      20 main.go:66] Initializing resource servers
I1120 10:03:48.249205      20 manager.go:108] number of config: 2
I1120 10:03:48.249211      20 manager.go:111] 
I1120 10:03:48.249215      20 manager.go:112] Creating new ResourcePool: intelnetdevicerhel
I1120 10:03:48.249239      20 manager.go:125] New resource server is created for intelnetdevicerhel ResourcePool
I1120 10:03:48.249244      20 manager.go:111] 
I1120 10:03:48.249247      20 manager.go:112] Creating new ResourcePool: mlx277ndrhel
I1120 10:03:48.249265      20 factory.go:144] device added: [pciAddr: 0000:60:00.0, vendor: 15b3, device: 1015, driver: mlx5_core]
I1120 10:03:48.249273      20 manager.go:125] New resource server is created for mlx277ndrhel ResourcePool
I1120 10:03:48.249278      20 main.go:72] Starting all servers...
I1120 10:03:48.249350      20 server.go:190] starting intelnetdevicerhel device plugin endpoint at: intelnetdevicerhel.sock
I1120 10:03:48.250282      20 server.go:216] intelnetdevicerhel device plugin endpoint started serving
I1120 10:03:48.250449      20 server.go:190] starting mlx277ndrhel device plugin endpoint at: mlx277ndrhel.sock
I1120 10:03:48.251256      20 server.go:216] mlx277ndrhel device plugin endpoint started serving
I1120 10:03:48.251314      20 main.go:77] All servers started.
I1120 10:03:48.251332      20 main.go:78] Listening for term signals
I1120 10:03:50.058938      20 server.go:105] Plugin: mlx277ndrhel.sock gets registered successfully at Kubelet
I1120 10:03:50.059082      20 server.go:130] ListAndWatch(mlx277ndrhel) invoked
I1120 10:03:50.059116      20 server.go:138] ListAndWatch(mlx277ndrhel): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:60:00.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}
I1120 10:03:50.387395      20 server.go:105] Plugin: intelnetdevicerhel.sock gets registered successfully at Kubelet
I1120 10:03:50.387507      20 server.go:130] ListAndWatch(intelnetdevicerhel) invoked
I1120 10:03:50.387530      20 server.go:138] ListAndWatch(intelnetdevicerhel): send devices &ListAndWatchResponse{Devices:[]*Device{},}
I1120 10:03:51.388680      20 server.go:105] Plugin: mlx277ndrhel.sock gets registered successfully at Kubelet
I1120 10:03:51.388710      20 server.go:130] ListAndWatch(mlx277ndrhel) invoked
I1120 10:03:51.388738      20 server.go:138] ListAndWatch(mlx277ndrhel): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:60:00.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}

Expected results:


Additional info:

Comment 1 Peng Liu 2019-11-26 04:58:59 UTC
I cannot reproduce it in my environment.

Comment 2 Ben Bennett 2019-11-26 13:47:41 UTC
@zhaozhanqi can you help Peng Liu work out how to reproduce this?  Or has it been resolved by a fix in the latest builds?  Thanks

Comment 3 zhaozhanqi 2019-11-27 09:08:42 UTC
ok, I will try to reproduce this issue again. I guess it should be I created one not total matched sriovnodenetworkpolicy CR and then caused this issue.

Comment 4 zenghui.shi 2019-12-06 04:03:55 UTC
This is not reproducible in dev environment.
@zhaozhanqi, do you see this failure always happen on RHEL worker node?

Comment 5 zhaozhanqi 2019-12-06 04:08:19 UTC
since for now, the cluster cannot be installed due to this bug https://bugzilla.redhat.com/show_bug.cgi?id=1779222

I will have a try when the new cluster is set up.

Comment 7 zhaozhanqi 2019-12-10 14:07:05 UTC
Created attachment 1643653 [details]
config daemon logs

Comment 8 Peng Liu 2019-12-10 15:03:56 UTC
@zhaozhanqi please also verify this bug when fix for BZ#1781718

Comment 11 zhaozhanqi 2019-12-13 05:07:59 UTC
Verified this bug on quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.3.0-201912111117-ose-sriov-network-operator
with steps comment 6

Comment 13 errata-xmlrpc 2020-01-23 11:13:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.