Bug 1960263

Summary: SR-IOV obliviously reboot the node
Product: OpenShift Container Platform Reporter: Peng Liu <pliu>
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: high CC: dosmith, keyoung, vlaad, zzhao
Version: 4.6Keywords: Reopened
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1960103 Environment:
Last Closed: 2022-08-26 14:18:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1960103    
Bug Blocks:    

Comment 4 zhaozhanqi 2021-06-07 10:52:12 UTC
Verified this bug on 4.6.0-202106032244

Comment 6 errata-xmlrpc 2021-06-15 19:30:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.34 bux fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2267

Comment 7 Peng Liu 2021-06-17 05:33:16 UTC
Need to backport one more patch which fixes the scenario where custom MCP is created.

Comment 9 zhaozhanqi 2021-06-28 02:34:43 UTC

Verified this bug on 4.6.0-202106232234

Create the following yaml file at same time

# cat 1g-mc.yaml intel-dpdk.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
    labels:
        machineconfiguration.openshift.io/role: worker
    name: 50-kargs-1g-hugepages
spec:
    kernelArguments:
        - default_hugepagesz=1G
        - hugepagesz=1G
        - hugepages=4 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: intel-dpdk
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci
  mtu: 1700
  nicSelector:
    deviceID: "158b"
    pfNames:
      - ens1f1
    rootDevices:
      - '0000:3b:00.1'
    vendor: '8086'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 2
  priority: 99
  resourceName: inteldpdk

# oc logs sriov-network-config-daemon-klpbk | grep MCP
I0628 02:03:05.797224  265835 daemon.go:754] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-06-25 09:39:42 +0000 UTC  } {NodeDegraded False 2021-06-25 09:39:47 +0000 UTC  } {Degraded False 2021-06-25 09:39:47 +0000 UTC  } {Updated False 2021-06-28 02:02:24 +0000 UTC  } {Updating True 2021-06-28 02:02:24 +0000 UTC  All nodes are updating to rendered-worker-d2dd550696cbfafc253011805efcfe77}], wait...
I0628 02:03:35.801191  265835 daemon.go:754] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-06-25 09:39:42 +0000 UTC  } {NodeDegraded False 2021-06-25 09:39:47 +0000 UTC  } {Degraded False 2021-06-25 09:39:47 +0000 UTC  } {Updated False 2021-06-28 02:02:24 +0000 UTC  } {Updating True 2021-06-28 02:02:24 +0000 UTC  All nodes are updating to rendered-worker-d2dd550696cbfafc253011805efcfe77}], wait...
I0628 02:04:05.802850  265835 daemon.go:754] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-06-25 09:39:42 +0000 UTC  } {NodeDegraded False 2021-06-25 09:39:47 +0000 UTC  } {Degraded False 2021-06-25 09:39:47 +0000 UTC  } {Updated False 2021-06-28 02:02:24 +0000 UTC  } {Updating True 2021-06-28 02:02:24 +0000 UTC  All nodes are updating to rendered-worker-d2dd550696cbfafc253011805efcfe77}], wait...
I0628 02:09:41.253031    3422 daemon.go:486] completeDrain(): resume MCP worker



# oc describe node dell-per740-14.rhts.eng.pek2.redhat.com 
Name:               dell-per740-14.rhts.eng.pek2.redhat.com
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/sriov-capable=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=dell-per740-14.rhts.eng.pek2.redhat.com
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_dell-per740-14.rhts.eng.pek2.redhat.com","mac-address":"e4:43:4b:5b:6c:28","ip-addresses...
                    k8s.ovn.org/node-chassis-id: 44557c31-ea74-49f4-abae-78316e0dffa3
                    k8s.ovn.org/node-join-subnets: {"default":"100.64.3.0/29"}
                    k8s.ovn.org/node-local-nat-ip: {"default":["169.254.12.7"]}
                    k8s.ovn.org/node-mgmt-port-mac-address: 36:26:ff:f3:a3:8b
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.73.116.62/23","ipv6":"2620:52:0:4974:928e:6695:d41e:b1a4/64"}
                    k8s.ovn.org/node-subnets: {"default":"10.128.2.0/23"}
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-d2dd550696cbfafc253011805efcfe77
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-d2dd550696cbfafc253011805efcfe77
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    sriovnetwork.openshift.io/state: Idle
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 25 Jun 2021 06:04:40 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  dell-per740-14.rhts.eng.pek2.redhat.com
  AcquireTime:     <unset>
  RenewTime:       Sun, 27 Jun 2021 22:28:43 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sun, 27 Jun 2021 22:24:52 -0400   Sun, 27 Jun 2021 22:08:10 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sun, 27 Jun 2021 22:24:52 -0400   Sun, 27 Jun 2021 22:08:10 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sun, 27 Jun 2021 22:24:52 -0400   Sun, 27 Jun 2021 22:08:10 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 27 Jun 2021 22:24:52 -0400   Sun, 27 Jun 2021 22:08:20 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.73.116.62
  Hostname:    dell-per740-14.rhts.eng.pek2.redhat.com
Capacity:
  cpu:                     32
  ephemeral-storage:       584963052Ki
  hugepages-1Gi:           4Gi
  memory:                  32479680Ki
  openshift.io/inteldpdk:  2
  pods:                    250
Allocatable:
  cpu:                     31500m
  ephemeral-storage:       538028206007
  hugepages-1Gi:           4Gi
  memory:                  27134400Ki
  openshift.io/inteldpdk:  2
  pods:                    250