Bug 1921321 - SR-IOV obliviously reboot the node
Summary: SR-IOV obliviously reboot the node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.8.0
Assignee: Peng Liu
QA Contact: zhaozhanqi
Stephen
URL:
Whiteboard:
: 1928265 (view as bug list)
Depends On:
Blocks: 1960103
TreeView+ depends on / blocked
 
Reported: 2021-01-27 21:47 UTC by Yuval Kashtan
Modified: 2021-12-09 07:55 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
A reboot may be required to enact SR-IOV changes on supported NICs. SR-IOV currently issues the reboot when it is ready. If this reboot coincides with changes in the Machine Config policy, the node can be left in an undetermined state. The Machine Config Operator assumes that the updated policy has been applied when it has not. [NOTE] ==== This race condition can also be caused by adding a node to a Machine Config Pool that has MCP and SR-IOV changes. ==== To avoid this issue, new nodes requiring MCO and SR-IOV changes should be completed sequentially. First, apply all MCO configuration and wait for the nodes to settle. Then apply the SR-IOV configuration. If a new node is being added to a Machine Config Pool that includes SR-IOV, this issue can be avoided by removing the SR-IOV policy from the Machine Config Pool and then adding the new worker. Then re-apply the SR-IOV policy."
Clone Of:
Environment:
Last Closed: 2021-07-27 22:36:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer cluster objects (1.77 KB, text/plain)
2021-01-27 21:47 UTC, Yuval Kashtan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 418 0 None closed Use MCO to apply switchdev configuration 2021-02-17 13:19:28 UTC
Github openshift sriov-network-operator pull 487 0 None open Bug 1921321: Sync upstream 2021-4-6 2021-04-07 14:18:40 UTC
Github openshift sriov-network-operator pull 494 0 None open Bug 1921321: Sync upstream: 2021-4-15 2021-04-15 02:14:36 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:37:15 UTC

Description Yuval Kashtan 2021-01-27 21:47:01 UTC
Created attachment 1751432 [details]
reproducer cluster objects

Description of problem:
When applying sriovNetworkNodePolicy in conjunction to applying an MachineConfig that takes a while to apply (like switching to rt-kernel),
SR-IOV reboot the node in the middle of that process.
when node come back online it is left in an intermediate state it cannot reconsile

IMHO this is a design bug,
all node configuration changes should be done through MCO.

Version-Release number of selected component (if applicable):
4.7


How reproducible:
very often, with below steps


Steps to Reproduce:
to use it:
this need a node with Intel SRIOV capable NIC.
make sure to update the SriovNetworkNodePolicy with that NIC name
then:
1. oc apply -f reproducer.yaml # it is expected to fail on missing CRDs
2. wait for cluster to settle and sriov-network-operator to become operational
3. apply worker-duprofile to node
4. oc apply -f reproducer.yaml # again to apply missing CRs
5. you can inspect sriov-daemon and machine-config-daemon on that node to see what happening

Actual results:
no kernel-rt on node

Expected results:
kernel-rt on node

Additional info:
this is the bz on MCO part - https://bugzilla.redhat.com/show_bug.cgi?id=1916169

Comment 2 Yuval Kashtan 2021-01-29 13:21:50 UTC
seems like there's already a mechanism to interact with MCO u/s
https://github.com/openshift/sriov-network-operator/commit/d45a8e35feec3d7b2e183052c07b56d93ff1e0a3

I dont think it resolve the issue (cause reqReboot can still be set)
but I think it can be enhanced to solve the issue

Comment 4 Ken Young 2021-02-04 19:52:44 UTC
More discussion here: http://post-office.corp.redhat.com/archives/aos-devel/2021-February/msg00086.html

Updating the severity of this to urgent.

Comment 6 Ken Young 2021-02-11 21:21:16 UTC
Zenghui,

I believe this issue should be documented in the 4.7 Release notes.  I am thinking something like this:

Cause: To enact SRIOV changes on an Intel NIC, a reboot is required.  SRIOV currently issues the reboot when it is ready.  If this reboot coincides with changes in the Machine Config policy, the node can be left in an undetermined state.  Machine Config Operator believes that updated policy has been applied when it actually has not.  Note that this race condition can also be caused by adding a node to a machine config pool which has MCP and SRIOV changes.

Consequence: The node is left in an indeterminate state.

Workaround (if any): To avoid this issue, new nodes requiring SRIOV and MCO changes should do so in a step wise fashion.  First apply all MCO configuration and wait for the nodes to settle.  Then apply the SRIOV configuration.  If a new node is being added to a machine config pool which includes SRIOV, this issue can be avoided by removing the SRIOV policy from the machine configuration pool and then adding the new worker.  Then re-apply the SRIOV policy.

Result: If the configuration of MCO and SRIOV is completed sequentially, the node will provision correctly.

What do you think?  By the way, the deadline for identifying bugs for Release Notes is close of business tomorrow.

REgards,
Ken Y

Comment 11 zenghui.shi 2021-02-18 11:26:20 UTC
(In reply to Ken Young from comment #6)
> Zenghui,
> 
> I believe this issue should be documented in the 4.7 Release notes.  I am
> thinking something like this:
> 
> Cause: To enact SRIOV changes on an Intel NIC, a reboot is required.  

SR-IOV config on Mellanox NIC would also require rebooting to take effect.
Perhaps making this as a general statement that:
"reboot is sometimes required to enact SR-IOV changes on supported NICs", wdyt?

> currently issues the reboot when it is ready.  If this reboot coincides with
> changes in the Machine Config policy, the node can be left in an
> undetermined state.  Machine Config Operator believes that updated policy
> has been applied when it actually has not.  Note that this race condition
> can also be caused by adding a node to a machine config pool which has MCP
> and SRIOV changes.
> 
> Consequence: The node is left in an indeterminate state.
> 
> Workaround (if any): To avoid this issue, new nodes requiring SRIOV and MCO
> changes should do so in a step wise fashion.  First apply all MCO
> configuration and wait for the nodes to settle.  Then apply the SRIOV
> configuration.  If a new node is being added to a machine config pool which
> includes SRIOV, this issue can be avoided by removing the SRIOV policy from
> the machine configuration pool and then adding the new worker.  Then
> re-apply the SRIOV policy.
> 
> Result: If the configuration of MCO and SRIOV is completed sequentially, the
> node will provision correctly.
> 
> What do you think?  By the way, the deadline for identifying bugs for
> Release Notes is close of business tomorrow.
> 

The rest looks good to me.

Comment 13 Stephen 2021-02-24 14:11:11 UTC
This has been published in the 4.8 Release Notes here: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.7/html/release_notes/ocp-4-7-release-notes

Comment 15 Ken Young 2021-03-27 16:41:13 UTC
*** Bug 1928265 has been marked as a duplicate of this bug. ***

Comment 17 Peng Liu 2021-04-06 16:42:42 UTC
Here's the upstream PR https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/93

Comment 19 Sabina Aledort 2021-04-13 13:32:11 UTC
Hi,

We just tried the fix and it seems that sriov is waiting for the wrong MCP. It is waiting for 'worker' MCP while it should wait to 'worker-duprofile' MCP.

[root@cnfdd3-installer cnf-internal-deploy]# oc get node
NAME                                             STATUS     ROLES                     AGE     VERSION
cnfdd3.clus2.t5g.lab.eng.bos.redhat.com          NotReady   worker,worker-duprofile   160m    v1.20.0+5f82cdb
dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com   Ready      master,virtual            3h47m   v1.20.0+5f82cdb
dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com   Ready      master,virtual            3h46m   v1.20.0+5f82cdb
dhcp19-17-128.clus2.t5g.lab.eng.bos.redhat.com   Ready      master,virtual            3h46m   v1.20.0+5f82cdb
dhcp19-17-5.clus2.t5g.lab.eng.bos.redhat.com     Ready      worker                    3h17m   v1.20.0+5f82cdb

[root@cnfdd3-installer cnf-internal-deploy]# oc get mcp -A
NAME               CONFIG                                                       UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master             rendered-master-35194b0693787f3b1f3134ea0a3488ec             True      False      False      3              3                   3                     0                      3h46m
worker             rendered-worker-2d38b8340641ab4c1b1af1479c7386d6             True      False      False      1              1                   1                     0                      3h46m
worker-duprofile   rendered-worker-duprofile-73fff1411b4fe061d3f875cbdfc5816c   False     True       False      1              0                   1                     0                      149m

[root@cnfdd3-installer ~]# oc logs -n openshift-sriov-network-operator sriov-network-config-daemon-pmrsl | grep -in mcp
246:I0413 11:01:42.802272   51194 daemon.go:786] getNodeMachinePool(): find node in MCP worker
251:I0413 11:01:49.166330   51194 daemon.go:839] drainNode(): MCP worker is ready
252:I0413 11:01:49.166343   51194 daemon.go:849] drainNode(): pause MCP worker
253:I0413 11:01:49.175907   51194 daemon.go:731] annotateNode(): Annotate node cnfdd3.clus2.t5g.lab.eng.bos.redhat.com with: Draining_MCP_Paused
254:I0413 11:01:49.205714   51194 daemon.go:839] drainNode(): MCP worker is ready
255:I0413 11:01:49.205728   51194 daemon.go:841] drainNode(): stop MCP informerworker

Comment 21 zhaozhanqi 2021-04-15 11:23:39 UTC
@saledort 

Could you help try again with new build?

Comment 22 Sabina Aledort 2021-04-19 12:54:57 UTC
The new build looks good.

We got RT kernel on the node and the sriov policy was set successfully.

[root@cnfdd3-installer cnf-internal-deploy]# oc get node -o wide
NAME                                             STATUS   ROLES                     AGE     VERSION                INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
cnfdd3.clus2.t5g.lab.eng.bos.redhat.com          Ready    worker,worker-duprofile   162m    v1.21.0-rc.0+2993be8   10.19.16.100   <none>        Red Hat Enterprise Linux CoreOS 48.84.202104171300-0 (Ootpa)   4.18.0-293.rt7.59.el8.x86_64   cri-o://1.21.0-74.rhaos4.8.gitbc1ef35.el8
dhcp19-17-102.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual            3h25m   v1.21.0-rc.0+2993be8   10.19.17.102   <none>        Red Hat Enterprise Linux CoreOS 48.84.202104171300-0 (Ootpa)   4.18.0-293.el8.x86_64          cri-o://1.21.0-74.rhaos4.8.gitbc1ef35.el8
dhcp19-17-118.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual            3h25m   v1.21.0-rc.0+2993be8   10.19.17.118   <none>        Red Hat Enterprise Linux CoreOS 48.84.202104171300-0 (Ootpa)   4.18.0-293.el8.x86_64          cri-o://1.21.0-74.rhaos4.8.gitbc1ef35.el8
dhcp19-17-128.clus2.t5g.lab.eng.bos.redhat.com   Ready    master,virtual            3h25m   v1.21.0-rc.0+2993be8   10.19.17.128   <none>        Red Hat Enterprise Linux CoreOS 48.84.202104171300-0 (Ootpa)   4.18.0-293.el8.x86_64          cri-o://1.21.0-74.rhaos4.8.gitbc1ef35.el8
dhcp19-17-56.clus2.t5g.lab.eng.bos.redhat.com    Ready    worker                    175m    v1.21.0-rc.0+2993be8   10.19.17.56    <none>        Red Hat Enterprise Linux CoreOS 48.84.202104171300-0 (Ootpa)   4.18.0-293.el8.x86_64          cri-o://1.21.0-74.rhaos4.8.gitbc1ef35.el8

Allocatable:                                                
  cpu:                                    47                                                                                                         
  ephemeral-storage:                      431049040797 
  hugepages-1Gi:                          16Gi                                                                                                       
  hugepages-2Mi:                          0                            
  memory:                                 79643004Ki                                                                                                 
  openshift.io/mh_u_site_1_fqdn_worker1:  4                                                                                                                  
  pods:                                   250                                                  

[root@cnfdd3-installer cnf-internal-deploy]# oc get mcp -A
NAME               CONFIG                                                       UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master             rendered-master-9e5d7db0cf72ef9267abfa7a0c038380             True      False      False      3              3                   3                     0                      3h21m
worker             rendered-worker-f5439a2461aae76736ae85ef75f8a3d2             True      False      False      1              1                   1                     0                      3h21m
worker-duprofile   rendered-worker-duprofile-df0ea7c177cad5fc144601884ba055b8   True      False      False      1              1                   1                     0                      84m

[root@cnfdd3-installer cnf-internal-deploy]# oc logs -n openshift-sriov-network-operator sriov-network-config-daemon-4xrzs | grep MCP
I0419 11:37:11.303454    9252 daemon.go:880] drainNode():MCP worker-duprofile is not ready: [{RenderDegraded False 2021-04-19 11:28:57 +0000 UTC  } {NodeDegraded False 2021-04-19 11:29:02 +0000 UTC  } {Degraded False 2021-04-19 11:29:02 +0000 UTC  } {Updated False 2021-04-19 11:29:24 +0000 UTC  } {Updating True 2021-04-19 11:29:24 +0000 UTC  All nodes are updating to rendered-worker-duprofile-2e3f4a90e05613fa4e39dfb14226921a}], wait...
I0419 11:37:41.302725    9252 daemon.go:880] drainNode():MCP worker-duprofile is not ready: [{RenderDegraded False 2021-04-19 11:28:57 +0000 UTC  } {NodeDegraded False 2021-04-19 11:29:02 +0000 UTC  } {Degraded False 2021-04-19 11:29:02 +0000 UTC  } {Updated False 2021-04-19 11:29:24 +0000 UTC  } {Updating True 2021-04-19 11:29:24 +0000 UTC  All nodes are updating to rendered-worker-duprofile-2e3f4a90e05613fa4e39dfb14226921a}], wait...
I0419 11:38:11.303793    9252 daemon.go:880] drainNode():MCP worker-duprofile is not ready: [{RenderDegraded False 2021-04-19 11:28:57 +0000 UTC  } {NodeDegraded False 2021-04-19 11:29:02 +0000 UTC  } {Degraded False 2021-04-19 11:29:02 +0000 UTC  } {Updated False 2021-04-19 11:29:24 +0000 UTC  } {Updating True 2021-04-19 11:29:24 +0000 UTC  All nodes are updating to rendered-worker-duprofile-2e3f4a90e05613fa4e39dfb14226921a}], wait...
I0419 11:38:41.304807    9252 daemon.go:880] drainNode():MCP worker-duprofile is not ready: [{RenderDegraded False 2021-04-19 11:28:57 +0000 UTC  } {NodeDegraded False 2021-04-19 11:29:02 +0000 UTC  } {Degraded False 2021-04-19 11:29:02 +0000 UTC  } {Updated False 2021-04-19 11:29:24 +0000 UTC  } {Updating True 2021-04-19 11:29:24 +0000 UTC  All nodes are updating to rendered-worker-duprofile-2e3f4a90e05613fa4e39dfb14226921a}], wait...
I0419 11:39:11.305213    9252 daemon.go:880] drainNode():MCP worker-duprofile is not ready: [{RenderDegraded False 2021-04-19 11:28:57 +0000 UTC  } {NodeDegraded False 2021-04-19 11:29:02 +0000 UTC  } {Degraded False 2021-04-19 11:29:02 +0000 UTC  } {Updated False 2021-04-19 11:29:24 +0000 UTC  } {Updating True 2021-04-19 11:29:24 +0000 UTC  All nodes are updating to rendered-worker-duprofile-2e3f4a90e05613fa4e39dfb14226921a}], wait...
I0419 11:39:41.306091    9252 daemon.go:880] drainNode():MCP worker-duprofile is not ready: [{RenderDegraded False 2021-04-19 11:28:57 +0000 UTC  } {NodeDegraded False 2021-04-19 11:29:02 +0000 UTC  } {Degraded False 2021-04-19 11:29:02 +0000 UTC  } {Updated False 2021-04-19 11:29:24 +0000 UTC  } {Updating True 2021-04-19 11:29:24 +0000 UTC  All nodes are updating to rendered-worker-duprofile-2e3f4a90e05613fa4e39dfb14226921a}], wait...
I0419 11:40:11.306728    9252 daemon.go:880] drainNode():MCP worker-duprofile is not ready: [{RenderDegraded False 2021-04-19 11:28:57 +0000 UTC  } {NodeDegraded False 2021-04-19 11:29:02 +0000 UTC  } {Degraded False 2021-04-19 11:29:02 +0000 UTC  } {Updated False 2021-04-19 11:29:24 +0000 UTC  } {Updating True 2021-04-19 11:29:24 +0000 UTC  All nodes are updating to rendered-worker-duprofile-2e3f4a90e05613fa4e39dfb14226921a}], wait...
I0419 11:40:41.307084    9252 daemon.go:880] drainNode():MCP worker-duprofile is not ready: [{RenderDegraded False 2021-04-19 11:28:57 +0000 UTC  } {NodeDegraded False 2021-04-19 11:29:02 +0000 UTC  } {Degraded False 2021-04-19 11:29:02 +0000 UTC  } {Updated False 2021-04-19 11:29:24 +0000 UTC  } {Updating True 2021-04-19 11:29:24 +0000 UTC  All nodes are updating to rendered-worker-duprofile-2e3f4a90e05613fa4e39dfb14226921a}], wait...
I0419 11:47:45.238599   21005 daemon.go:598] completeDrain(): resume MCP worker-duprofile

Comment 23 zhaozhanqi 2021-04-20 01:54:40 UTC
Thanks Sabina, Move this bug to verified according to comment 22

Comment 26 errata-xmlrpc 2021-07-27 22:36:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.