Bug 2071854

Summary:	After applying KubeletConfig by Compliance operator - the Kubelet goes into NotReady
Product:	OpenShift Container Platform	Reporter:	Vladislav Walek <vwalek>
Component:	Compliance Operator	Assignee:	Vincent Shen <wenshen>
Status:	CLOSED ERRATA	QA Contact:	xiyuan
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.8	CC:	antaylor, jhrozek, jmittapa, lbragsta, mrogers, suprs, wenshen, xiyuan
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-06-06 14:39:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vladislav Walek 2022-04-04 23:38:49 UTC

Description of problem:

IHAC where after applying the compliance operator and the new KubeletConfig, the node goes into NotReady state.

The kubelet shows error:

Apr 04 14:57:37 <node> hyperkube[722632]: E0404 14:57:37.897575  722632 server.go:292] "Failed to run kubelet" err="failed to run Kubelet: failed to create kubelet: grace period must be specified for the soft eviction threshold imagefs.available"
Apr 04 14:57:37 <node> systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE

We see that error on 2 nodes already, both from the different pools. 
Also the MCP shows are progressing, so it seems that after Machine Config Daemon applied the new Config, the node became NotReady.

I see that the KubeletConfig is correctly configured with the values defined in the KubeletConfig.

Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.8.22

How reproducible:
not at the moment

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Vincent Shen 2022-04-05 06:30:11 UTC

Sorry for the inconvenience, I just went through the log, I think it is related to a bug we recently discovered, we will work on a fix to this Sprint, this was due to KubeletConfig not being fully rendered into MachineConfig, since the remediation patch same KC over and over again. and our compliance operator unpaused the pool too soon, so some nodes didn't get a fully complete rendered MachineConfig. You can see this by looking at the 2 most recent rendered-worker-*. 

There is a couple of ways to temporally band-aid fix this issue while a patch is on the way:

1. To manually create KC for the pool with the desired setting before scans, so no KubeletConfig will be generated.
2. To disable auto-apply remediation, manually pause the pool, then apply all remediations at once, wait 20s then unpause the pool
3. To use a tailored profile to disable those KubeletConfig rules, and create a separate profile to only have those KubeletConfig checks but with auto-apply remediation off.

Comment 4 Vladislav Walek 2022-04-05 18:37:45 UTC

Hello Vincent

thank you for reply. I was checking the workaround, but I am worried as now the pool is being updated, however it failed on those 2 first nodes.
How we can undo it so we can apply the workaround?

Should we pause it, apply the workaround to manually create the KC and then unpause it?

Should we remove the compliance KC and force the pool to revert the changes? Then pause it and apply workaround?

I am thinking what we can do now to stabilize the cluster back.

Comment 5 Lance Bragstad 2022-04-06 19:15:53 UTC

Following up that we're actively reviewing the fix upstream, but it hasn't merged yet.

Comment 6 Vladislav Walek 2022-04-26 00:54:14 UTC

(In reply to Vincent Shen from comment #3)
> Sorry for the inconvenience, I just went through the log, I think it is
> related to a bug we recently discovered, we will work on a fix to this
> Sprint, this was due to KubeletConfig not being fully rendered into
> MachineConfig, since the remediation patch same KC over and over again. and
> our compliance operator unpaused the pool too soon, so some nodes didn't get
> a fully complete rendered MachineConfig. You can see this by looking at the
> 2 most recent rendered-worker-*. 
> 
> There is a couple of ways to temporally band-aid fix this issue while a
> patch is on the way:
> 
> 1. To manually create KC for the pool with the desired setting before scans,
> so no KubeletConfig will be generated.
> 2. To disable auto-apply remediation, manually pause the pool, then apply
> all remediations at once, wait 20s then unpause the pool
> 3. To use a tailored profile to disable those KubeletConfig rules, and
> create a separate profile to only have those KubeletConfig checks but with
> auto-apply remediation off.

Hello Vincent,

I was checking the issue, however, I see a problem where the configuration in the Kubelet config on the node causing that the Kubelet won't start.
It seems like the issue is also with Kubelet.

This is my kubelet config where it fails to start:

# cat kubelet.conf 
{
  "kind": "KubeletConfiguration",
  "apiVersion": "kubelet.config.k8s.io/v1beta1",
  "staticPodPath": "/etc/kubernetes/manifests",
  "syncFrequency": "0s",
  "fileCheckFrequency": "0s",
  "httpCheckFrequency": "0s",
  "tlsCipherSuites": [
    "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256",
    "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256"
  ],
  "tlsMinVersion": "VersionTLS12",
  "rotateCertificates": true,
  "serverTLSBootstrap": true,
  "authentication": {
    "x509": {
      "clientCAFile": "/etc/kubernetes/kubelet-ca.crt"
    },
    "webhook": {
      "cacheTTL": "0s"
    },
    "anonymous": {
      "enabled": false
    }
  },
  "authorization": {
    "webhook": {
      "cacheAuthorizedTTL": "0s",
      "cacheUnauthorizedTTL": "0s"
    }
  },
  "eventRecordQPS": 10,
  "clusterDomain": "cluster.local",
  "clusterDNS": [
    "172.30.0.10"
  ],
  "streamingConnectionIdleTimeout": "0s",
  "nodeStatusUpdateFrequency": "0s",
  "nodeStatusReportFrequency": "0s",
  "imageMinimumGCAge": "0s",
  "volumeStatsAggPeriod": "0s",
  "systemCgroups": "/system.slice",
  "cgroupRoot": "/",
  "cgroupDriver": "systemd",
  "cpuManagerReconcilePeriod": "0s",
  "runtimeRequestTimeout": "0s",
  "maxPods": 250,
  "kubeAPIQPS": 50,
  "kubeAPIBurst": 100,
  "serializeImagePulls": false,
  "evictionHard": {
    "imagefs.available": "10%",
    "imagefs.inodesFree": "5%",
    "memory.available": "200Mi",
    "nodefs.available": "5%",
    "nodefs.inodesFree": "4%"
  },
  "evictionSoft": {
    "imagefs.available": "15%",
    "imagefs.inodesFree": "10%",
    "memory.available": "500Mi",
    "nodefs.available": "10%",
    "nodefs.inodesFree": "5%"
  },
  "evictionSoftGracePeriod": {
    "imagefs.available": "1m30s",
    "imagefs.inodesFree": "1m30s",
    "memory.available": "1m30s",
    "nodefs.available": "1m30s",
    "nodefs.inodesFree": "1m30s"
  },
  "evictionPressureTransitionPeriod": "0s",
  "protectKernelDefaults": true,
  "makeIPTablesUtilChains": true,
  "featureGates": {
    "APIPriorityAndFairness": true,
    "CSIMigrationAWS": false,
    "CSIMigrationAzureDisk": false,
    "CSIMigrationAzureFile": false,
    "CSIMigrationGCE": false,
    "CSIMigrationOpenStack": false,
    "CSIMigrationvSphere": false,
    "DownwardAPIHugePages": true,
    "LegacyNodeRoleBehavior": false,
    "NodeDisruptionExclusion": true,
    "PodSecurity": true,
    "RotateKubeletServerCertificate": true,
    "ServiceNodeExclusion": true,
    "SupportPodPidsLimit": true
  },
  "memorySwap": {},
  "containerLogMaxSize": "50Mi",
  "systemReserved": {
    "ephemeral-storage": "1Gi"
  },
  "logging": {
    "flushFrequency": 0,
    "verbosity": 0,
    "options": {
      "json": {
        "infoBufferSize": "0"
      }
    }
  },
  "shutdownGracePeriod": "0s",
  "shutdownGracePeriodCriticalPods": "0s"
}

Comment 7 Vincent Shen 2022-05-13 05:35:48 UTC

Hi, sorry for the late reply; the new patch just got merged; it should fix those issues. We will have a new downstream release soon.

Comment 12 xiyuan 2022-05-31 04:29:43 UTC

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-25-193227   True        False         179m    Cluster version is 4.11.0-0.nightly-2022-05-25-193227
$ oc get ip
NAME            CSV                           APPROVAL    APPROVED
install-6cg2j   compliance-operator.v0.1.52   Automatic   true
$ oc get csv
NAME                           DISPLAY                            VERSION   REPLACES   PHASE
compliance-operator.v0.1.52    Compliance Operator                0.1.52               Succeeded
elasticsearch-operator.5.4.2   OpenShift Elasticsearch Operator   5.4.2                Succeeded

$ oc apply -f -<<EOF
apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSettingBinding
metadata:
  name: my-ssb-r
profiles:
  - name: ocp4-cis
    kind: Profile
    apiGroup: compliance.openshift.io/v1alpha1
  - name: ocp4-cis-node
    kind: Profile
    apiGroup: compliance.openshift.io/v1alpha1
settingsRef:
  name: default-auto-apply
  kind: ScanSetting
  apiGroup: compliance.openshift.io/v1alpha1
EOF

scansettingbinding.compliance.openshift.io/my-ssb-r created

$ oc get mcp -w
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-73b7ff2017113e3b10837c8ba5708074   False     True       False      3              0                   0                     0                      94m
worker   rendered-worker-05cc352994476d576bfe32ee05d54ac5   False     True       False      3              0                   0                     0                      94m
worker   rendered-worker-05cc352994476d576bfe32ee05d54ac5   False     True       False      3              1                   1                     0                      96m
worker   rendered-worker-05cc352994476d576bfe32ee05d54ac5   False     True       False      3              1                   1                     0                      97m
master   rendered-master-73b7ff2017113e3b10837c8ba5708074   False     True       False      3              1                   1                     0                      99m
worker   rendered-worker-05cc352994476d576bfe32ee05d54ac5   False     True       False      3              1                   2                     0                      99m
worker   rendered-worker-05cc352994476d576bfe32ee05d54ac5   False     True       False      3              2                   2                     0                      99m
...

master   rendered-master-73b7ff2017113e3b10837c8ba5708074   False     True       False      3              2                   2                     0                      111m
worker   rendered-worker-00942673568ca9a81884b3b66701add2   True      False      False      3              3                   3                     0                      111m
master   rendered-master-73b7ff2017113e3b10837c8ba5708074   False     True       False      3              2                   3                     0                      112m
master   rendered-master-9fab9add783b2721edb5241b8c9acb23   True      False      False      3              3                   3                     0                      112m
$ oc get suite -w
NAME       PHASE     RESULT
my-ssb-r   RUNNING   NOT-AVAILABLE
my-ssb-r   RUNNING   NOT-AVAILABLE
my-ssb-r   RUNNING   NOT-AVAILABLE
my-ssb-r   AGGREGATING   NOT-AVAILABLE
my-ssb-r   AGGREGATING   NOT-AVAILABLE
my-ssb-r   AGGREGATING   NOT-AVAILABLE
my-ssb-r   DONE          NON-COMPLIANT
my-ssb-r   DONE          NON-COMPLIANT

$ oc get mc
NAME                                                           GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                                      b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
00-worker                                                      b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
01-master-container-runtime                                    b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
01-master-kubelet                                              b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
01-worker-container-runtime                                    b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
01-worker-kubelet                                              b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
75-ocp4-cis-node-master-kubelet-enable-protect-kernel-sysctl                                              3.1.0             55m
75-ocp4-cis-node-worker-kubelet-enable-protect-kernel-sysctl                                              3.1.0             55m
99-master-fips                                                                                            3.2.0             3h18m
99-master-generated-kubelet                                    b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             101m
99-master-generated-registries                                 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
99-master-ssh                                                                                             3.2.0             3h18m
99-worker-fips                                                                                            3.2.0             3h18m
99-worker-generated-kubelet                                    b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             101m
99-worker-generated-registries                                 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
99-worker-ssh                                                                                             3.2.0             3h18m
rendered-master-0c647db20fdc963301525b4a8a00c3f6               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             55m
rendered-master-235a6dd9a0d73e311cdc21cd1fef71f9               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
rendered-master-2a697c77660898065dff335f4472297a               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             55m
rendered-master-71646497579869b3718e1cef277d1fa2               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             54m
rendered-master-73b7ff2017113e3b10837c8ba5708074               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             175m
rendered-master-9fab9add783b2721edb5241b8c9acb23               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             101m
rendered-worker-00942673568ca9a81884b3b66701add2               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             101m
rendered-worker-05cc352994476d576bfe32ee05d54ac5               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             175m
rendered-worker-23aad77b69701757a79e5c4a840a5df8               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             3h14m
rendered-worker-51facbf4656efaf285a59d6e7442c85d               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             55m
rendered-worker-821640b98dc113b6cc7fc38d7e17a8cf               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             55m
rendered-worker-9cdf69f635e842ef58dded909894280a               b52e75eabe58c8e13edf3948dc9509ea9f93b3d9   3.2.0             54m
[xiyuan@MiWiFi-RA69-srv func]$ oc get kube
kubeapiservers.operator.openshift.io               kubeletconfigs.machineconfiguration.openshift.io   kubestorageversionmigrators.operator.openshift.io  
kubecontrollermanagers.operator.openshift.io       kubeschedulers.operator.openshift.io               
$ oc get kubeletconfig
NAME                                 AGE
compliance-operator-kubelet-master   101m
compliance-operator-kubelet-worker   101m
$ oc get kubeletconfig compliance-operator-kubelet-master -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  annotations:
    machineconfiguration.openshift.io/mc-name-suffix: ""
  creationTimestamp: "2022-05-31T02:43:15Z"
  finalizers:
  - 99-master-generated-kubelet
  generation: 19
  labels:
    compliance.openshift.io/scan-name: test-node-master
    compliance.openshift.io/suite: ocp4-kubelet-configure-tls-cipher-test
  name: compliance-operator-kubelet-master
  resourceVersion: "177532"
  uid: e2367857-3b1a-4068-89e5-d818f5fc2dfb
spec:
  kubeletConfig:
    eventRecordQPS: 10
    evictionHard:
      imagefs.available: 10%
      imagefs.inodesFree: 5%
      memory.available: 200Mi
      nodefs.available: 5%
      nodefs.inodesFree: 4%
    evictionPressureTransitionPeriod: 0s
    evictionSoft:
      imagefs.available: 15%
      imagefs.inodesFree: 10%
      memory.available: 500Mi
      nodefs.available: 10%
      nodefs.inodesFree: 5%
    evictionSoftGracePeriod:
      imagefs.available: 1m30s
      imagefs.inodesFree: 1m30s
      memory.available: 1m30s
      nodefs.available: 1m30s
      nodefs.inodesFree: 1m30s
    makeIPTablesUtilChains: true
    tlsCipherSuites:
    - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
    - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
    - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/master: ""
status:
  conditions:
  - lastTransitionTime: "2022-05-31T03:44:39Z"
    message: Success
    status: "True"
    type: Success

Comment 13 xiyuan 2022-05-31 04:48:14 UTC

For issues in https://bugzilla.redhat.com/show_bug.cgi?id=2071854#c10, bug https://bugzilla.redhat.com/show_bug.cgi?id=2091546 created to track. 
Per https://bugzilla.redhat.com/show_bug.cgi?id=2071854#c10 and  https://bugzilla.redhat.com/show_bug.cgi?id=2071854#c12, move this bug to verified

Comment 15 errata-xmlrpc 2022-06-06 14:39:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Compliance Operator bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4657

Comment 16 Red Hat Bugzilla 2023-09-15 01:53:38 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days