Description of problem: When the sriovnodenetworkpolicy is created. the VF cannot be inited. Found the worker do not be reboot. cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-1062.4.3.el7.x86_64 root=/dev/mapper/rhel_dell--per740--14-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per740-14/root rd.lvm.lv=rhel_dell-per740-14/swap rhgb quiet LANG=en_US.UTF-8 Version-Release number of selected component (if applicable): quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.3.0-201911150628-ose-sriov-network-operator How reproducible: always Steps to Reproduce: 1. setup cluster and add one rhel worker into cluster 2. Create one sriovnodenetworkpolicy CR apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlx277rhel-netdevice namespace: openshift-sriov-network-operator spec: mtu: 1500 nicSelector: pfNames: - p2p1 rootDevices: - '0000:60:00.0' vendor: '15b3' nodeSelector: feature.node.kubernetes.io/sriov-capable-rhel: 'true' numVfs: 2 resourceName: mlx277ndrhel 3. Check the logs of dp 4. Check the `cat /proc/cmdline ` on worker Actual results: oc logs sriov-device-plugin-r76zx I1120 10:03:48.154783 20 manager.go:70] Using Kubelet Plugin Registry Mode I1120 10:03:48.154983 20 main.go:44] resource manager reading configs I1120 10:03:48.155585 20 manager.go:98] ResourceList: [{ResourceName:intelnetdevicerhel IsRdma:false Selectors:{Vendors:[8086] Devices:[] Drivers:[iavf mlx5_core i40evf ixgbevf] PfNames:[p1p1] LinkTypes:[]}} {ResourceName:mlx277ndrhel IsRdma:false Selectors:{Vendors:[15b3] Devices:[] Drivers:[iavf mlx5_core i40evf ixgbevf] PfNames:[p2p1] LinkTypes:[]}}] I1120 10:03:48.155750 20 main.go:60] Discovering host network devices I1120 10:03:48.155776 20 manager.go:179] discovering host network devices I1120 10:03:48.238914 20 manager.go:209] discoverDevices(): device found: 0000:18:00.0 02 Intel Corporation I350 Gigabit Network Connection I1120 10:03:48.239622 20 manager.go:259] excluding interface em1: default route found: {Ifindex: 3 Dst: <nil> Src: <nil> Gw: 10.73.131.254 Flags: [] Table: 254} I1120 10:03:48.239706 20 manager.go:209] discoverDevices(): device found: 0000:18:00.1 02 Intel Corporation I350 Gigabit Network Connection I1120 10:03:48.240264 20 manager.go:279] em2 added to linkWatchList I1120 10:03:48.240812 20 manager.go:209] discoverDevices(): device found: 0000:18:00.2 02 Intel Corporation I350 Gigabit Network Connection I1120 10:03:48.241354 20 manager.go:279] em3 added to linkWatchList I1120 10:03:48.241875 20 manager.go:209] discoverDevices(): device found: 0000:18:00.3 02 Intel Corporation I350 Gigabit Network Connection I1120 10:03:48.242368 20 manager.go:279] em4 added to linkWatchList I1120 10:03:48.242831 20 manager.go:209] discoverDevices(): device found: 0000:3b:00.0 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I1120 10:03:48.243297 20 manager.go:279] p1p1 added to linkWatchList I1120 10:03:48.243610 20 manager.go:209] discoverDevices(): device found: 0000:3b:00.1 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I1120 10:03:48.243941 20 manager.go:279] p1p2 added to linkWatchList I1120 10:03:48.244262 20 manager.go:209] discoverDevices(): device found: 0000:5e:00.0 02 Mellanox Technolo... MT27800 Family [ConnectX-5] I1120 10:03:48.244605 20 manager.go:279] p3p1 added to linkWatchList I1120 10:03:48.245312 20 manager.go:209] discoverDevices(): device found: 0000:5e:00.1 02 Mellanox Technolo... MT27800 Family [ConnectX-5] I1120 10:03:48.245647 20 manager.go:279] p3p2 added to linkWatchList I1120 10:03:48.246424 20 manager.go:209] discoverDevices(): device found: 0000:60:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx] I1120 10:03:48.246773 20 manager.go:279] p2p1 added to linkWatchList I1120 10:03:48.247769 20 manager.go:209] discoverDevices(): device found: 0000:60:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx] I1120 10:03:48.248279 20 manager.go:279] p2p2 added to linkWatchList I1120 10:03:48.249195 20 main.go:66] Initializing resource servers I1120 10:03:48.249205 20 manager.go:108] number of config: 2 I1120 10:03:48.249211 20 manager.go:111] I1120 10:03:48.249215 20 manager.go:112] Creating new ResourcePool: intelnetdevicerhel I1120 10:03:48.249239 20 manager.go:125] New resource server is created for intelnetdevicerhel ResourcePool I1120 10:03:48.249244 20 manager.go:111] I1120 10:03:48.249247 20 manager.go:112] Creating new ResourcePool: mlx277ndrhel I1120 10:03:48.249265 20 factory.go:144] device added: [pciAddr: 0000:60:00.0, vendor: 15b3, device: 1015, driver: mlx5_core] I1120 10:03:48.249273 20 manager.go:125] New resource server is created for mlx277ndrhel ResourcePool I1120 10:03:48.249278 20 main.go:72] Starting all servers... I1120 10:03:48.249350 20 server.go:190] starting intelnetdevicerhel device plugin endpoint at: intelnetdevicerhel.sock I1120 10:03:48.250282 20 server.go:216] intelnetdevicerhel device plugin endpoint started serving I1120 10:03:48.250449 20 server.go:190] starting mlx277ndrhel device plugin endpoint at: mlx277ndrhel.sock I1120 10:03:48.251256 20 server.go:216] mlx277ndrhel device plugin endpoint started serving I1120 10:03:48.251314 20 main.go:77] All servers started. I1120 10:03:48.251332 20 main.go:78] Listening for term signals I1120 10:03:50.058938 20 server.go:105] Plugin: mlx277ndrhel.sock gets registered successfully at Kubelet I1120 10:03:50.059082 20 server.go:130] ListAndWatch(mlx277ndrhel) invoked I1120 10:03:50.059116 20 server.go:138] ListAndWatch(mlx277ndrhel): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:60:00.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} I1120 10:03:50.387395 20 server.go:105] Plugin: intelnetdevicerhel.sock gets registered successfully at Kubelet I1120 10:03:50.387507 20 server.go:130] ListAndWatch(intelnetdevicerhel) invoked I1120 10:03:50.387530 20 server.go:138] ListAndWatch(intelnetdevicerhel): send devices &ListAndWatchResponse{Devices:[]*Device{},} I1120 10:03:51.388680 20 server.go:105] Plugin: mlx277ndrhel.sock gets registered successfully at Kubelet I1120 10:03:51.388710 20 server.go:130] ListAndWatch(mlx277ndrhel) invoked I1120 10:03:51.388738 20 server.go:138] ListAndWatch(mlx277ndrhel): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:60:00.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} Expected results: Additional info:
I cannot reproduce it in my environment.
@zhaozhanqi can you help Peng Liu work out how to reproduce this? Or has it been resolved by a fix in the latest builds? Thanks
ok, I will try to reproduce this issue again. I guess it should be I created one not total matched sriovnodenetworkpolicy CR and then caused this issue.
This is not reproducible in dev environment. @zhaozhanqi, do you see this failure always happen on RHEL worker node?
since for now, the cluster cannot be installed due to this bug https://bugzilla.redhat.com/show_bug.cgi?id=1779222 I will have a try when the new cluster is set up.
Created attachment 1643653 [details] config daemon logs
@zhaozhanqi please also verify this bug when fix for BZ#1781718
PR https://github.com/openshift/sriov-network-operator/pull/137
Verified this bug on quay.io/openshift-release-dev/ocp-v4.0-art-dev:v4.3.0-201912111117-ose-sriov-network-operator with steps comment 6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062