Bug 1960263 - SR-IOV obliviously reboot the node
Summary: SR-IOV obliviously reboot the node
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.z
Assignee: Peng Liu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1960103
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-13 13:11 UTC by Peng Liu
Modified: 2022-08-26 14:18 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1960103
Environment:
Last Closed: 2022-08-26 14:18:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 506 0 None open [Release-4.6] Bug 1960263: Pause MCP before draining/rebooting node 2021-05-28 01:00:51 UTC
Github openshift sriov-network-operator pull 522 0 None open [Release-4.6] Bug 1960263: Find the MCP based on the owner of node's desired MC 2021-06-21 02:01:59 UTC
Red Hat Product Errata RHBA-2021:2267 0 None None None 2021-06-15 19:30:28 UTC

Comment 4 zhaozhanqi 2021-06-07 10:52:12 UTC
Verified this bug on 4.6.0-202106032244

Comment 6 errata-xmlrpc 2021-06-15 19:30:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.34 bux fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2267

Comment 7 Peng Liu 2021-06-17 05:33:16 UTC
Need to backport one more patch which fixes the scenario where custom MCP is created.

Comment 9 zhaozhanqi 2021-06-28 02:34:43 UTC

Verified this bug on 4.6.0-202106232234

Create the following yaml file at same time

# cat 1g-mc.yaml intel-dpdk.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
    labels:
        machineconfiguration.openshift.io/role: worker
    name: 50-kargs-1g-hugepages
spec:
    kernelArguments:
        - default_hugepagesz=1G
        - hugepagesz=1G
        - hugepages=4 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: intel-dpdk
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci
  mtu: 1700
  nicSelector:
    deviceID: "158b"
    pfNames:
      - ens1f1
    rootDevices:
      - '0000:3b:00.1'
    vendor: '8086'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 2
  priority: 99
  resourceName: inteldpdk

# oc logs sriov-network-config-daemon-klpbk | grep MCP
I0628 02:03:05.797224  265835 daemon.go:754] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-06-25 09:39:42 +0000 UTC  } {NodeDegraded False 2021-06-25 09:39:47 +0000 UTC  } {Degraded False 2021-06-25 09:39:47 +0000 UTC  } {Updated False 2021-06-28 02:02:24 +0000 UTC  } {Updating True 2021-06-28 02:02:24 +0000 UTC  All nodes are updating to rendered-worker-d2dd550696cbfafc253011805efcfe77}], wait...
I0628 02:03:35.801191  265835 daemon.go:754] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-06-25 09:39:42 +0000 UTC  } {NodeDegraded False 2021-06-25 09:39:47 +0000 UTC  } {Degraded False 2021-06-25 09:39:47 +0000 UTC  } {Updated False 2021-06-28 02:02:24 +0000 UTC  } {Updating True 2021-06-28 02:02:24 +0000 UTC  All nodes are updating to rendered-worker-d2dd550696cbfafc253011805efcfe77}], wait...
I0628 02:04:05.802850  265835 daemon.go:754] drainNode():MCP worker is not ready: [{RenderDegraded False 2021-06-25 09:39:42 +0000 UTC  } {NodeDegraded False 2021-06-25 09:39:47 +0000 UTC  } {Degraded False 2021-06-25 09:39:47 +0000 UTC  } {Updated False 2021-06-28 02:02:24 +0000 UTC  } {Updating True 2021-06-28 02:02:24 +0000 UTC  All nodes are updating to rendered-worker-d2dd550696cbfafc253011805efcfe77}], wait...
I0628 02:09:41.253031    3422 daemon.go:486] completeDrain(): resume MCP worker



# oc describe node dell-per740-14.rhts.eng.pek2.redhat.com 
Name:               dell-per740-14.rhts.eng.pek2.redhat.com
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/sriov-capable=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=dell-per740-14.rhts.eng.pek2.redhat.com
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_dell-per740-14.rhts.eng.pek2.redhat.com","mac-address":"e4:43:4b:5b:6c:28","ip-addresses...
                    k8s.ovn.org/node-chassis-id: 44557c31-ea74-49f4-abae-78316e0dffa3
                    k8s.ovn.org/node-join-subnets: {"default":"100.64.3.0/29"}
                    k8s.ovn.org/node-local-nat-ip: {"default":["169.254.12.7"]}
                    k8s.ovn.org/node-mgmt-port-mac-address: 36:26:ff:f3:a3:8b
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.73.116.62/23","ipv6":"2620:52:0:4974:928e:6695:d41e:b1a4/64"}
                    k8s.ovn.org/node-subnets: {"default":"10.128.2.0/23"}
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-d2dd550696cbfafc253011805efcfe77
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-d2dd550696cbfafc253011805efcfe77
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    sriovnetwork.openshift.io/state: Idle
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 25 Jun 2021 06:04:40 -0400
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  dell-per740-14.rhts.eng.pek2.redhat.com
  AcquireTime:     <unset>
  RenewTime:       Sun, 27 Jun 2021 22:28:43 -0400
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sun, 27 Jun 2021 22:24:52 -0400   Sun, 27 Jun 2021 22:08:10 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sun, 27 Jun 2021 22:24:52 -0400   Sun, 27 Jun 2021 22:08:10 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sun, 27 Jun 2021 22:24:52 -0400   Sun, 27 Jun 2021 22:08:10 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 27 Jun 2021 22:24:52 -0400   Sun, 27 Jun 2021 22:08:20 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  10.73.116.62
  Hostname:    dell-per740-14.rhts.eng.pek2.redhat.com
Capacity:
  cpu:                     32
  ephemeral-storage:       584963052Ki
  hugepages-1Gi:           4Gi
  memory:                  32479680Ki
  openshift.io/inteldpdk:  2
  pods:                    250
Allocatable:
  cpu:                     31500m
  ephemeral-storage:       538028206007
  hugepages-1Gi:           4Gi
  memory:                  27134400Ki
  openshift.io/inteldpdk:  2
  pods:                    250


Note You need to log in before you can comment on or make changes to this bug.