Bug 2071854
| Summary: | After applying KubeletConfig by Compliance operator - the Kubelet goes into NotReady | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vladislav Walek <vwalek> |
| Component: | Compliance Operator | Assignee: | Vincent Shen <wenshen> |
| Status: | CLOSED ERRATA | QA Contact: | xiyuan |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.8 | CC: | antaylor, jhrozek, jmittapa, lbragsta, mrogers, suprs, wenshen, xiyuan |
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-06-06 14:39:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Vladislav Walek
2022-04-04 23:38:49 UTC
Sorry for the inconvenience, I just went through the log, I think it is related to a bug we recently discovered, we will work on a fix to this Sprint, this was due to KubeletConfig not being fully rendered into MachineConfig, since the remediation patch same KC over and over again. and our compliance operator unpaused the pool too soon, so some nodes didn't get a fully complete rendered MachineConfig. You can see this by looking at the 2 most recent rendered-worker-*. There is a couple of ways to temporally band-aid fix this issue while a patch is on the way: 1. To manually create KC for the pool with the desired setting before scans, so no KubeletConfig will be generated. 2. To disable auto-apply remediation, manually pause the pool, then apply all remediations at once, wait 20s then unpause the pool 3. To use a tailored profile to disable those KubeletConfig rules, and create a separate profile to only have those KubeletConfig checks but with auto-apply remediation off. Hello Vincent thank you for reply. I was checking the workaround, but I am worried as now the pool is being updated, however it failed on those 2 first nodes. How we can undo it so we can apply the workaround? Should we pause it, apply the workaround to manually create the KC and then unpause it? Should we remove the compliance KC and force the pool to revert the changes? Then pause it and apply workaround? I am thinking what we can do now to stabilize the cluster back. Following up that we're actively reviewing the fix upstream, but it hasn't merged yet. (In reply to Vincent Shen from comment #3) > Sorry for the inconvenience, I just went through the log, I think it is > related to a bug we recently discovered, we will work on a fix to this > Sprint, this was due to KubeletConfig not being fully rendered into > MachineConfig, since the remediation patch same KC over and over again. and > our compliance operator unpaused the pool too soon, so some nodes didn't get > a fully complete rendered MachineConfig. You can see this by looking at the > 2 most recent rendered-worker-*. > > There is a couple of ways to temporally band-aid fix this issue while a > patch is on the way: > > 1. To manually create KC for the pool with the desired setting before scans, > so no KubeletConfig will be generated. > 2. To disable auto-apply remediation, manually pause the pool, then apply > all remediations at once, wait 20s then unpause the pool > 3. To use a tailored profile to disable those KubeletConfig rules, and > create a separate profile to only have those KubeletConfig checks but with > auto-apply remediation off. Hello Vincent, I was checking the issue, however, I see a problem where the configuration in the Kubelet config on the node causing that the Kubelet won't start. It seems like the issue is also with Kubelet. This is my kubelet config where it fails to start: # cat kubelet.conf { "kind": "KubeletConfiguration", "apiVersion": "kubelet.config.k8s.io/v1beta1", "staticPodPath": "/etc/kubernetes/manifests", "syncFrequency": "0s", "fileCheckFrequency": "0s", "httpCheckFrequency": "0s", "tlsCipherSuites": [ "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" ], "tlsMinVersion": "VersionTLS12", "rotateCertificates": true, "serverTLSBootstrap": true, "authentication": { "x509": { "clientCAFile": "/etc/kubernetes/kubelet-ca.crt" }, "webhook": { "cacheTTL": "0s" }, "anonymous": { "enabled": false } }, "authorization": { "webhook": { "cacheAuthorizedTTL": "0s", "cacheUnauthorizedTTL": "0s" } }, "eventRecordQPS": 10, "clusterDomain": "cluster.local", "clusterDNS": [ "172.30.0.10" ], "streamingConnectionIdleTimeout": "0s", "nodeStatusUpdateFrequency": "0s", "nodeStatusReportFrequency": "0s", "imageMinimumGCAge": "0s", "volumeStatsAggPeriod": "0s", "systemCgroups": "/system.slice", "cgroupRoot": "/", "cgroupDriver": "systemd", "cpuManagerReconcilePeriod": "0s", "runtimeRequestTimeout": "0s", "maxPods": 250, "kubeAPIQPS": 50, "kubeAPIBurst": 100, "serializeImagePulls": false, "evictionHard": { "imagefs.available": "10%", "imagefs.inodesFree": "5%", "memory.available": "200Mi", "nodefs.available": "5%", "nodefs.inodesFree": "4%" }, "evictionSoft": { "imagefs.available": "15%", "imagefs.inodesFree": "10%", "memory.available": "500Mi", "nodefs.available": "10%", "nodefs.inodesFree": "5%" }, "evictionSoftGracePeriod": { "imagefs.available": "1m30s", "imagefs.inodesFree": "1m30s", "memory.available": "1m30s", "nodefs.available": "1m30s", "nodefs.inodesFree": "1m30s" }, "evictionPressureTransitionPeriod": "0s", "protectKernelDefaults": true, "makeIPTablesUtilChains": true, "featureGates": { "APIPriorityAndFairness": true, "CSIMigrationAWS": false, "CSIMigrationAzureDisk": false, "CSIMigrationAzureFile": false, "CSIMigrationGCE": false, "CSIMigrationOpenStack": false, "CSIMigrationvSphere": false, "DownwardAPIHugePages": true, "LegacyNodeRoleBehavior": false, "NodeDisruptionExclusion": true, "PodSecurity": true, "RotateKubeletServerCertificate": true, "ServiceNodeExclusion": true, "SupportPodPidsLimit": true }, "memorySwap": {}, "containerLogMaxSize": "50Mi", "systemReserved": { "ephemeral-storage": "1Gi" }, "logging": { "flushFrequency": 0, "verbosity": 0, "options": { "json": { "infoBufferSize": "0" } } }, "shutdownGracePeriod": "0s", "shutdownGracePeriodCriticalPods": "0s" } Hi, sorry for the late reply; the new patch just got merged; it should fix those issues. We will have a new downstream release soon. $ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-05-25-193227 True False 179m Cluster version is 4.11.0-0.nightly-2022-05-25-193227
$ oc get ip
NAME CSV APPROVAL APPROVED
install-6cg2j compliance-operator.v0.1.52 Automatic true
$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE
compliance-operator.v0.1.52 Compliance Operator 0.1.52 Succeeded
elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded
$ oc apply -f -<<EOF
apiVersion: compliance.openshift.io/v1alpha1
kind: ScanSettingBinding
metadata:
name: my-ssb-r
profiles:
- name: ocp4-cis
kind: Profile
apiGroup: compliance.openshift.io/v1alpha1
- name: ocp4-cis-node
kind: Profile
apiGroup: compliance.openshift.io/v1alpha1
settingsRef:
name: default-auto-apply
kind: ScanSetting
apiGroup: compliance.openshift.io/v1alpha1
EOF
scansettingbinding.compliance.openshift.io/my-ssb-r created
$ oc get mcp -w
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-73b7ff2017113e3b10837c8ba5708074 False True False 3 0 0 0 94m
worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 0 0 0 94m
worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 1 1 0 96m
worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 1 1 0 97m
master rendered-master-73b7ff2017113e3b10837c8ba5708074 False True False 3 1 1 0 99m
worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 1 2 0 99m
worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 2 2 0 99m
...
master rendered-master-73b7ff2017113e3b10837c8ba5708074 False True False 3 2 2 0 111m
worker rendered-worker-00942673568ca9a81884b3b66701add2 True False False 3 3 3 0 111m
master rendered-master-73b7ff2017113e3b10837c8ba5708074 False True False 3 2 3 0 112m
master rendered-master-9fab9add783b2721edb5241b8c9acb23 True False False 3 3 3 0 112m
$ oc get suite -w
NAME PHASE RESULT
my-ssb-r RUNNING NOT-AVAILABLE
my-ssb-r RUNNING NOT-AVAILABLE
my-ssb-r RUNNING NOT-AVAILABLE
my-ssb-r AGGREGATING NOT-AVAILABLE
my-ssb-r AGGREGATING NOT-AVAILABLE
my-ssb-r AGGREGATING NOT-AVAILABLE
my-ssb-r DONE NON-COMPLIANT
my-ssb-r DONE NON-COMPLIANT
$ oc get mc
NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE
00-master b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
00-worker b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
01-master-container-runtime b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
01-master-kubelet b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
01-worker-container-runtime b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
01-worker-kubelet b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
75-ocp4-cis-node-master-kubelet-enable-protect-kernel-sysctl 3.1.0 55m
75-ocp4-cis-node-worker-kubelet-enable-protect-kernel-sysctl 3.1.0 55m
99-master-fips 3.2.0 3h18m
99-master-generated-kubelet b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 101m
99-master-generated-registries b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
99-master-ssh 3.2.0 3h18m
99-worker-fips 3.2.0 3h18m
99-worker-generated-kubelet b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 101m
99-worker-generated-registries b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
99-worker-ssh 3.2.0 3h18m
rendered-master-0c647db20fdc963301525b4a8a00c3f6 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 55m
rendered-master-235a6dd9a0d73e311cdc21cd1fef71f9 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
rendered-master-2a697c77660898065dff335f4472297a b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 55m
rendered-master-71646497579869b3718e1cef277d1fa2 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 54m
rendered-master-73b7ff2017113e3b10837c8ba5708074 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 175m
rendered-master-9fab9add783b2721edb5241b8c9acb23 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 101m
rendered-worker-00942673568ca9a81884b3b66701add2 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 101m
rendered-worker-05cc352994476d576bfe32ee05d54ac5 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 175m
rendered-worker-23aad77b69701757a79e5c4a840a5df8 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m
rendered-worker-51facbf4656efaf285a59d6e7442c85d b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 55m
rendered-worker-821640b98dc113b6cc7fc38d7e17a8cf b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 55m
rendered-worker-9cdf69f635e842ef58dded909894280a b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 54m
[xiyuan@MiWiFi-RA69-srv func]$ oc get kube
kubeapiservers.operator.openshift.io kubeletconfigs.machineconfiguration.openshift.io kubestorageversionmigrators.operator.openshift.io
kubecontrollermanagers.operator.openshift.io kubeschedulers.operator.openshift.io
$ oc get kubeletconfig
NAME AGE
compliance-operator-kubelet-master 101m
compliance-operator-kubelet-worker 101m
$ oc get kubeletconfig compliance-operator-kubelet-master -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
annotations:
machineconfiguration.openshift.io/mc-name-suffix: ""
creationTimestamp: "2022-05-31T02:43:15Z"
finalizers:
- 99-master-generated-kubelet
generation: 19
labels:
compliance.openshift.io/scan-name: test-node-master
compliance.openshift.io/suite: ocp4-kubelet-configure-tls-cipher-test
name: compliance-operator-kubelet-master
resourceVersion: "177532"
uid: e2367857-3b1a-4068-89e5-d818f5fc2dfb
spec:
kubeletConfig:
eventRecordQPS: 10
evictionHard:
imagefs.available: 10%
imagefs.inodesFree: 5%
memory.available: 200Mi
nodefs.available: 5%
nodefs.inodesFree: 4%
evictionPressureTransitionPeriod: 0s
evictionSoft:
imagefs.available: 15%
imagefs.inodesFree: 10%
memory.available: 500Mi
nodefs.available: 10%
nodefs.inodesFree: 5%
evictionSoftGracePeriod:
imagefs.available: 1m30s
imagefs.inodesFree: 1m30s
memory.available: 1m30s
nodefs.available: 1m30s
nodefs.inodesFree: 1m30s
makeIPTablesUtilChains: true
tlsCipherSuites:
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
machineConfigPoolSelector:
matchLabels:
pools.operator.machineconfiguration.openshift.io/master: ""
status:
conditions:
- lastTransitionTime: "2022-05-31T03:44:39Z"
message: Success
status: "True"
type: Success
For issues in https://bugzilla.redhat.com/show_bug.cgi?id=2071854#c10, bug https://bugzilla.redhat.com/show_bug.cgi?id=2091546 created to track. Per https://bugzilla.redhat.com/show_bug.cgi?id=2071854#c10 and https://bugzilla.redhat.com/show_bug.cgi?id=2071854#c12, move this bug to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Compliance Operator bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:4657 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |