1752144 – applying a MC with `nosmt` karg can prevent the MC from being rolled out to all nodes

Bug 1752144 - applying a MC with `nosmt` karg can prevent the MC from being rolled out to all nodes

Summary: applying a MC with `nosmt` karg can prevent the MC from being rolled out to a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Sinny Kumari
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-13 21:33 UTC by Micah Abbott
Modified:	2020-01-28 02:14 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:05:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1251	0	'None'	closed	Bug 1752144: docs: Add doc for nosmt Kernel Argument	2020-12-02 21:17:21 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:06:22 UTC

Description Micah Abbott 2019-09-13 21:33:34 UTC

Description of problem:

Applying a MachineConfig with the `nosmt` kernel argument can prevent the MachineConfig from being completely rolled out to all nodes.


Version-Release number of selected component (if applicable):

4.2.0-0.nightly-2019-09-12-034447

How reproducible:

100%


Steps to Reproduce:
1.  Create 3 master/3 worker cluster in AWS
2.  Create MC with `nosmt` kernel argument to worker nodes
3.  Apply MC to cluster


Actual results:

2/3 nodes will be updated successfully, last node will not

Expected results:

All nodes updated successfully

Additional info:

What ends up happening is that the last node will not fully drain because the router pod cannot be rescheduled to the other nodes, due to insufficient CPU capacity.

```
$ oc -n openshift-machine-config-operator logs po/machine-config-daemon-fwfnh --tail=20
I0913 21:27:18.955122    3129 update.go:89] pod "kube-state-metrics-77874d5cd7-7x2j4" removed (evicted)
I0913 21:27:19.555060    3129 update.go:89] pod "telemeter-client-58c7d48694-bhjhd" removed (evicted)
I0913 21:27:20.354975    3129 update.go:89] pod "prometheus-adapter-5b64c99bbb-z84rg" removed (evicted)
I0913 21:27:20.755345    3129 update.go:89] pod "alertmanager-main-0" removed (evicted)
I0913 21:27:20.954960    3129 update.go:89] pod "grafana-9fd6f7c74-f4vvv" removed (evicted)
I0913 21:27:22.127498    3129 update.go:89] pod "prometheus-k8s-0" removed (evicted)
I0913 21:27:22.554936    3129 update.go:89] pod "openshift-state-metrics-8954bf77-dbrjb" removed (evicted)
I0913 21:27:22.851601    3129 update.go:89] error when evicting pod "router-default-57bcb847dc-757wb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0913 21:27:23.126583    3129 update.go:89] pod "prometheus-adapter-5b64c99bbb-smlw9" removed (evicted)
I0913 21:27:24.126973    3129 update.go:89] pod "image-registry-fb4879955-r9fjg" removed (evicted)
I0913 21:27:27.857551    3129 update.go:89] error when evicting pod "router-default-57bcb847dc-757wb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0913 21:27:32.863517    3129 update.go:89] error when evicting pod "router-default-57bcb847dc-757wb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0913 21:27:37.869370    3129 update.go:89] error when evicting pod "router-default-57bcb847dc-757wb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0913 21:27:42.875157    3129 update.go:89] error when evicting pod "router-default-57bcb847dc-757wb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0913 21:27:47.880799    3129 update.go:89] error when evicting pod "router-default-57bcb847dc-757wb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

$ oc -n openshift-ingress get po
NAME                              READY   STATUS    RESTARTS   AGE
router-default-57bcb847dc-4np2v   0/1     Pending   0          6m9s
router-default-57bcb847dc-757wb   1/1     Running   0          22m

$ oc -n openshift-ingress get po/router-default-57bcb847dc-4np2v -o jsonpath='{.status.conditions[0].message}{"\n"}'
0/6 nodes are available: 1 node(s) were unschedulable, 2 Insufficient cpu, 3 node(s) didn't match node selector.
```

As the `nosmt` flag is rolled out to the worker nodes, it appears the overall CPU capacity of the workers is reduced and there is not enough capacity in the cluster to schedule all the previously scheduled pods.

Before `nosmt` rollout:

```
$ oc get nodes -l node-role.kubernetes.io/worker= -o jsonpath='{range .items[*]}{"Allocated: "}{.status.allocatable.cpu}{" Capacity: "}{.status.capacity.cpu}{"\n"}{end}'
Allocated: 1500m Capacity: 2
Allocated: 1500m Capacity: 2
Allocated: 1500m Capacity: 2
```

As `nosmt` rolls out:

```
$ oc get nodes -l node-role.kubernetes.io/worker= -o jsonpath='{range .items[*]}{"Allocated: "}{.status.allocatable.cpu}{" Capacity: "}{.status.capacity.cpu}{"\n"}{end}'
Allocated: 1500m Capacity: 2
Allocated: 1500m Capacity: 2
Allocated: 500m Capacity: 1
```

Workaround:

Scale up one of the MachineSets to increase CPU capacity as needed.


Possibly Related:
https://bugzilla.redhat.com/show_bug.cgi?id=1752111

Comment 1 Micah Abbott 2019-09-13 21:35:14 UTC

I recognize this isn't the fault of the MCO directly, but starting here in hopes of landing in the right place.

Comment 2 Micah Abbott 2019-09-13 21:35:56 UTC

MC for reference:

```
$ cat ../machineConfigs/kargs-add-nosmt.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 60-nosmt-kargs
spec:
    kernelArguments:
      - nosmt
```

Comment 3 Antonio Murdaca 2019-09-16 15:09:12 UTC

I wonder if something changed resource wise that now prevents this from going through.

Comment 4 Sinny Kumari 2019-09-19 12:44:22 UTC

I can reproduce this issue with 3 master/worker node (2 core CPU on each node) cluster launched in aws with latest installer (d9a9648cf2330d467cca9f2988846d031464125e) and MachineConfig applied from comment#2

Few observation:

Before applying 60-nosmt-kargs Machine Config, Allocated CPU is 1500m on each worker node
$ oc get nodes -l node-role.kubernetes.io/worker= -o jsonpath='{range .items[*]}{"Allocated: "}{.status.allocatable.cpu}{" Capacity: "}{.status.capacity.cpu}{"\n"}{end}'
Allocated: 1500m Capacity: 2
Allocated: 1500m Capacity: 2
Allocated: 1500m Capacity: 2

While MachineConfig is getting applied on first worker node, CPU requests looks on remaining worker nodes:

  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1200m (80%)   300m (20%)
  cpu                         1010m (67%)   300m (20%)

It is able to upgrade because we have enough CPU on two workers(CPU=2) to launch pods from drained worker

After applying 60-nosmt-kargs MachineConfig on two workers, Allocated CPU on worker nodes reduces to 500m each. This is because with nosmt enabled we end up with only 1 CPU on each worker node.

When two of the node is already updated with 60-nosmt-kargs Machine Config, I believe CPU load on available worker nodes increases and reaches to full capacity and hence some of pod scheduling are stuck due to waiting on CPU resources


Note:  With  worker node having CPU=4, 60-nosmt-kargs MachineConfig gets applied successfully on all worker nodes.

Comment 5 Antonio Murdaca 2019-09-19 13:03:59 UTC

Cluster was provisioned without nosmt and the case here is we add `nosmt`. `nosmt` is known to trade performances for security and it's advisable to provision more powerful instances if you're going to use nosmt. So, this isn't a bug per se, but something that maybe we should document somewhere. This isn't impacting upgrades either but just prevents the nosmt rollout to all nodes.

Comment 7 Micah Abbott 2019-11-18 19:38:33 UTC

The linked PR updates the docs and I can confirm the docs are updated.

Comment 9 errata-xmlrpc 2020-01-23 11:05:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.