Bug 2071854
Summary: | After applying KubeletConfig by Compliance operator - the Kubelet goes into NotReady | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vladislav Walek <vwalek> |
Component: | Compliance Operator | Assignee: | Vincent Shen <wenshen> |
Status: | CLOSED ERRATA | QA Contact: | xiyuan |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.8 | CC: | antaylor, jhrozek, jmittapa, lbragsta, mrogers, suprs, wenshen, xiyuan |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-06-06 14:39:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vladislav Walek
2022-04-04 23:38:49 UTC
Sorry for the inconvenience, I just went through the log, I think it is related to a bug we recently discovered, we will work on a fix to this Sprint, this was due to KubeletConfig not being fully rendered into MachineConfig, since the remediation patch same KC over and over again. and our compliance operator unpaused the pool too soon, so some nodes didn't get a fully complete rendered MachineConfig. You can see this by looking at the 2 most recent rendered-worker-*. There is a couple of ways to temporally band-aid fix this issue while a patch is on the way: 1. To manually create KC for the pool with the desired setting before scans, so no KubeletConfig will be generated. 2. To disable auto-apply remediation, manually pause the pool, then apply all remediations at once, wait 20s then unpause the pool 3. To use a tailored profile to disable those KubeletConfig rules, and create a separate profile to only have those KubeletConfig checks but with auto-apply remediation off. Hello Vincent thank you for reply. I was checking the workaround, but I am worried as now the pool is being updated, however it failed on those 2 first nodes. How we can undo it so we can apply the workaround? Should we pause it, apply the workaround to manually create the KC and then unpause it? Should we remove the compliance KC and force the pool to revert the changes? Then pause it and apply workaround? I am thinking what we can do now to stabilize the cluster back. Following up that we're actively reviewing the fix upstream, but it hasn't merged yet. (In reply to Vincent Shen from comment #3) > Sorry for the inconvenience, I just went through the log, I think it is > related to a bug we recently discovered, we will work on a fix to this > Sprint, this was due to KubeletConfig not being fully rendered into > MachineConfig, since the remediation patch same KC over and over again. and > our compliance operator unpaused the pool too soon, so some nodes didn't get > a fully complete rendered MachineConfig. You can see this by looking at the > 2 most recent rendered-worker-*. > > There is a couple of ways to temporally band-aid fix this issue while a > patch is on the way: > > 1. To manually create KC for the pool with the desired setting before scans, > so no KubeletConfig will be generated. > 2. To disable auto-apply remediation, manually pause the pool, then apply > all remediations at once, wait 20s then unpause the pool > 3. To use a tailored profile to disable those KubeletConfig rules, and > create a separate profile to only have those KubeletConfig checks but with > auto-apply remediation off. Hello Vincent, I was checking the issue, however, I see a problem where the configuration in the Kubelet config on the node causing that the Kubelet won't start. It seems like the issue is also with Kubelet. This is my kubelet config where it fails to start: # cat kubelet.conf { "kind": "KubeletConfiguration", "apiVersion": "kubelet.config.k8s.io/v1beta1", "staticPodPath": "/etc/kubernetes/manifests", "syncFrequency": "0s", "fileCheckFrequency": "0s", "httpCheckFrequency": "0s", "tlsCipherSuites": [ "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256", "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384", "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256", "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256" ], "tlsMinVersion": "VersionTLS12", "rotateCertificates": true, "serverTLSBootstrap": true, "authentication": { "x509": { "clientCAFile": "/etc/kubernetes/kubelet-ca.crt" }, "webhook": { "cacheTTL": "0s" }, "anonymous": { "enabled": false } }, "authorization": { "webhook": { "cacheAuthorizedTTL": "0s", "cacheUnauthorizedTTL": "0s" } }, "eventRecordQPS": 10, "clusterDomain": "cluster.local", "clusterDNS": [ "172.30.0.10" ], "streamingConnectionIdleTimeout": "0s", "nodeStatusUpdateFrequency": "0s", "nodeStatusReportFrequency": "0s", "imageMinimumGCAge": "0s", "volumeStatsAggPeriod": "0s", "systemCgroups": "/system.slice", "cgroupRoot": "/", "cgroupDriver": "systemd", "cpuManagerReconcilePeriod": "0s", "runtimeRequestTimeout": "0s", "maxPods": 250, "kubeAPIQPS": 50, "kubeAPIBurst": 100, "serializeImagePulls": false, "evictionHard": { "imagefs.available": "10%", "imagefs.inodesFree": "5%", "memory.available": "200Mi", "nodefs.available": "5%", "nodefs.inodesFree": "4%" }, "evictionSoft": { "imagefs.available": "15%", "imagefs.inodesFree": "10%", "memory.available": "500Mi", "nodefs.available": "10%", "nodefs.inodesFree": "5%" }, "evictionSoftGracePeriod": { "imagefs.available": "1m30s", "imagefs.inodesFree": "1m30s", "memory.available": "1m30s", "nodefs.available": "1m30s", "nodefs.inodesFree": "1m30s" }, "evictionPressureTransitionPeriod": "0s", "protectKernelDefaults": true, "makeIPTablesUtilChains": true, "featureGates": { "APIPriorityAndFairness": true, "CSIMigrationAWS": false, "CSIMigrationAzureDisk": false, "CSIMigrationAzureFile": false, "CSIMigrationGCE": false, "CSIMigrationOpenStack": false, "CSIMigrationvSphere": false, "DownwardAPIHugePages": true, "LegacyNodeRoleBehavior": false, "NodeDisruptionExclusion": true, "PodSecurity": true, "RotateKubeletServerCertificate": true, "ServiceNodeExclusion": true, "SupportPodPidsLimit": true }, "memorySwap": {}, "containerLogMaxSize": "50Mi", "systemReserved": { "ephemeral-storage": "1Gi" }, "logging": { "flushFrequency": 0, "verbosity": 0, "options": { "json": { "infoBufferSize": "0" } } }, "shutdownGracePeriod": "0s", "shutdownGracePeriodCriticalPods": "0s" } Hi, sorry for the late reply; the new patch just got merged; it should fix those issues. We will have a new downstream release soon. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-05-25-193227 True False 179m Cluster version is 4.11.0-0.nightly-2022-05-25-193227 $ oc get ip NAME CSV APPROVAL APPROVED install-6cg2j compliance-operator.v0.1.52 Automatic true $ oc get csv NAME DISPLAY VERSION REPLACES PHASE compliance-operator.v0.1.52 Compliance Operator 0.1.52 Succeeded elasticsearch-operator.5.4.2 OpenShift Elasticsearch Operator 5.4.2 Succeeded $ oc apply -f -<<EOF apiVersion: compliance.openshift.io/v1alpha1 kind: ScanSettingBinding metadata: name: my-ssb-r profiles: - name: ocp4-cis kind: Profile apiGroup: compliance.openshift.io/v1alpha1 - name: ocp4-cis-node kind: Profile apiGroup: compliance.openshift.io/v1alpha1 settingsRef: name: default-auto-apply kind: ScanSetting apiGroup: compliance.openshift.io/v1alpha1 EOF scansettingbinding.compliance.openshift.io/my-ssb-r created $ oc get mcp -w NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-73b7ff2017113e3b10837c8ba5708074 False True False 3 0 0 0 94m worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 0 0 0 94m worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 1 1 0 96m worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 1 1 0 97m master rendered-master-73b7ff2017113e3b10837c8ba5708074 False True False 3 1 1 0 99m worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 1 2 0 99m worker rendered-worker-05cc352994476d576bfe32ee05d54ac5 False True False 3 2 2 0 99m ... master rendered-master-73b7ff2017113e3b10837c8ba5708074 False True False 3 2 2 0 111m worker rendered-worker-00942673568ca9a81884b3b66701add2 True False False 3 3 3 0 111m master rendered-master-73b7ff2017113e3b10837c8ba5708074 False True False 3 2 3 0 112m master rendered-master-9fab9add783b2721edb5241b8c9acb23 True False False 3 3 3 0 112m $ oc get suite -w NAME PHASE RESULT my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r RUNNING NOT-AVAILABLE my-ssb-r AGGREGATING NOT-AVAILABLE my-ssb-r AGGREGATING NOT-AVAILABLE my-ssb-r AGGREGATING NOT-AVAILABLE my-ssb-r DONE NON-COMPLIANT my-ssb-r DONE NON-COMPLIANT $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m 00-worker b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m 01-master-container-runtime b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m 01-master-kubelet b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m 01-worker-container-runtime b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m 01-worker-kubelet b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m 75-ocp4-cis-node-master-kubelet-enable-protect-kernel-sysctl 3.1.0 55m 75-ocp4-cis-node-worker-kubelet-enable-protect-kernel-sysctl 3.1.0 55m 99-master-fips 3.2.0 3h18m 99-master-generated-kubelet b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 101m 99-master-generated-registries b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m 99-master-ssh 3.2.0 3h18m 99-worker-fips 3.2.0 3h18m 99-worker-generated-kubelet b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 101m 99-worker-generated-registries b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m 99-worker-ssh 3.2.0 3h18m rendered-master-0c647db20fdc963301525b4a8a00c3f6 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 55m rendered-master-235a6dd9a0d73e311cdc21cd1fef71f9 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m rendered-master-2a697c77660898065dff335f4472297a b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 55m rendered-master-71646497579869b3718e1cef277d1fa2 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 54m rendered-master-73b7ff2017113e3b10837c8ba5708074 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 175m rendered-master-9fab9add783b2721edb5241b8c9acb23 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 101m rendered-worker-00942673568ca9a81884b3b66701add2 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 101m rendered-worker-05cc352994476d576bfe32ee05d54ac5 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 175m rendered-worker-23aad77b69701757a79e5c4a840a5df8 b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 3h14m rendered-worker-51facbf4656efaf285a59d6e7442c85d b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 55m rendered-worker-821640b98dc113b6cc7fc38d7e17a8cf b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 55m rendered-worker-9cdf69f635e842ef58dded909894280a b52e75eabe58c8e13edf3948dc9509ea9f93b3d9 3.2.0 54m [xiyuan@MiWiFi-RA69-srv func]$ oc get kube kubeapiservers.operator.openshift.io kubeletconfigs.machineconfiguration.openshift.io kubestorageversionmigrators.operator.openshift.io kubecontrollermanagers.operator.openshift.io kubeschedulers.operator.openshift.io $ oc get kubeletconfig NAME AGE compliance-operator-kubelet-master 101m compliance-operator-kubelet-worker 101m $ oc get kubeletconfig compliance-operator-kubelet-master -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: annotations: machineconfiguration.openshift.io/mc-name-suffix: "" creationTimestamp: "2022-05-31T02:43:15Z" finalizers: - 99-master-generated-kubelet generation: 19 labels: compliance.openshift.io/scan-name: test-node-master compliance.openshift.io/suite: ocp4-kubelet-configure-tls-cipher-test name: compliance-operator-kubelet-master resourceVersion: "177532" uid: e2367857-3b1a-4068-89e5-d818f5fc2dfb spec: kubeletConfig: eventRecordQPS: 10 evictionHard: imagefs.available: 10% imagefs.inodesFree: 5% memory.available: 200Mi nodefs.available: 5% nodefs.inodesFree: 4% evictionPressureTransitionPeriod: 0s evictionSoft: imagefs.available: 15% imagefs.inodesFree: 10% memory.available: 500Mi nodefs.available: 10% nodefs.inodesFree: 5% evictionSoftGracePeriod: imagefs.available: 1m30s imagefs.inodesFree: 1m30s memory.available: 1m30s nodefs.available: 1m30s nodefs.inodesFree: 1m30s makeIPTablesUtilChains: true tlsCipherSuites: - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/master: "" status: conditions: - lastTransitionTime: "2022-05-31T03:44:39Z" message: Success status: "True" type: Success For issues in https://bugzilla.redhat.com/show_bug.cgi?id=2071854#c10, bug https://bugzilla.redhat.com/show_bug.cgi?id=2091546 created to track. Per https://bugzilla.redhat.com/show_bug.cgi?id=2071854#c10 and https://bugzilla.redhat.com/show_bug.cgi?id=2071854#c12, move this bug to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Compliance Operator bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:4657 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |