Bug 1782152
| Summary: | Pull CI: Cluster operator machine-config Degraded is True with RequiredPoolsFailed: machineconfig.machineconfiguration.openshift.io rendered-master-* not found | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vadim Rutkovsky <vrutkovs> |
| Component: | Machine Config Operator | Assignee: | Colin Howell Walters <cwalters> |
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.2.z | CC: | cwalters, kgarriso, mnguyen |
| Target Milestone: | --- | ||
| Target Release: | 4.2.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1782149 | Environment: | |
| Last Closed: | 2020-02-12 12:16:16 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1781708, 1782149 | ||
| Bug Blocks: | |||
|
Description
Vadim Rutkovsky
2019-12-11 09:52:00 UTC
Verified on 4.2.0-0.nightly-2020-02-03-234322. The masters and worker are using the same /etc/containers/storage.conf with the generated by mco containerruntimeconfig comment. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2020-02-03-234322 True False 93m Cluster version is 4.2.0-0.nightly-2020-02-03-234322 $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-130-217.us-west-2.compute.internal Ready master 108m v1.14.6+0a21dd3b3 ip-10-0-130-47.us-west-2.compute.internal Ready worker 101m v1.14.6+0a21dd3b3 ip-10-0-139-157.us-west-2.compute.internal Ready master 107m v1.14.6+0a21dd3b3 ip-10-0-143-116.us-west-2.compute.internal Ready worker 101m v1.14.6+0a21dd3b3 ip-10-0-147-60.us-west-2.compute.internal Ready master 107m v1.14.6+0a21dd3b3 ip-10-0-148-107.us-west-2.compute.internal Ready worker 101m v1.14.6+0a21dd3b3 $ oc debug node/ip-10-0-130-217.us-west-2.compute.internal Starting pod/ip-10-0-130-217us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# exit exit Removing debug pod ... $ oc debug node/ip-10-0-130-217.us-west-2.compute.internal -- chroot /host cat /etc/containers/storage.conf Starting pod/ip-10-0-130-217us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` # This file is generated by the Machine Config Operator's containerruntimeconfig controller. # # storage.conf is the configuration file for all tools # that share the containers/storage libraries # See man 5 containers-storage.conf for more information # The "container storage" table contains all of the server options. [storage] # Default Storage Driver driver = "overlay" # Temporary storage location runroot = "/var/run/containers/storage" # Primary Read/Write location of container storage graphroot = "/var/lib/containers/storage" [storage.options] # Storage options to be passed to underlying storage drivers # AdditionalImageStores is used to pass paths to additional Read/Only image stores # Must be comma separated list. additionalimagestores = [ ] # Size is used to set a maximum size of the container image. Only supported by # certain container storage drivers. size = "" # OverrideKernelCheck tells the driver to ignore kernel checks based on kernel version override_kernel_check = "true" # Remap-UIDs/GIDs is the mapping from UIDs/GIDs as they should appear inside of # a container, to UIDs/GIDs as they should appear outside of the container, and # the length of the range of UIDs/GIDs. Additional mapped sets can be listed # and will be heeded by libraries, but there are limits to the number of # mappings which the kernel will allow when you later attempt to run a # container. # # remap-uids = 0:1668442479:65536 # remap-gids = 0:1668442479:65536 # Remap-User/Group is a name which can be used to look up one or more UID/GID # ranges in the /etc/subuid or /etc/subgid file. Mappings are set up starting # with an in-container ID of 0 and the a host-level ID taken from the lowest # range that matches the specified name, and using the length of that range. # Additional ranges are then assigned, using the ranges which specify the # lowest host-level IDs first, to the lowest not-yet-mapped container-level ID, # until all of the entries have been used for maps. # # remap-user = "storage" # remap-group = "storage" [storage.options.thinpool] # Storage Options for thinpool # autoextend_percent determines the amount by which pool needs to be # grown. This is specified in terms of % of pool size. So a value of 20 means # that when threshold is hit, pool will be grown by 20% of existing # pool size. # autoextend_percent = "20" # autoextend_threshold determines the pool extension threshold in terms # of percentage of pool size. For example, if threshold is 60, that means when # pool is 60% full, threshold has been hit. # autoextend_threshold = "80" # basesize specifies the size to use when creating the base device, which # limits the size of images and containers. # basesize = "10G" # blocksize specifies a custom blocksize to use for the thin pool. # blocksize="64k" # directlvm_device specifies a custom block storage device to use for the # thin pool. Required if you setup devicemapper # directlvm_device = "" # directlvm_device_force wipes device even if device already has a filesystem # directlvm_device_force = "True" # fs specifies the filesystem type to use for the base device. # fs="xfs" # log_level sets the log level of devicemapper. # 0: LogLevelSuppress 0 (Default) # 2: LogLevelFatal # 3: LogLevelErr # 4: LogLevelWarn # 5: LogLevelNotice # 6: LogLevelInfo # 7: LogLevelDebug # log_level = "7" # min_free_space specifies the min free space percent in a thin pool require for # new device creation to succeed. Valid values are from 0% - 99%. # Value 0% disables # min_free_space = "10%" # mkfsarg specifies extra mkfs arguments to be used when creating the base # device. # mkfsarg = "" # mountopt specifies extra mount options used when mounting the thin devices. # mountopt = "" # use_deferred_removal Marking device for deferred removal # use_deferred_removal = "True" # use_deferred_deletion Marking device for deferred deletion # use_deferred_deletion = "True" # xfs_nospace_max_retries specifies the maximum number of retries XFS should # attempt to complete IO when ENOSPC (no space) error is returned by # underlying storage device. # xfs_nospace_max_retries = "0" Removing debug pod ... $ oc debug node/ip-10-0-130-47.us-west-2.compute.internal -- chroot /host cat /etc/containers/storage.conf Starting pod/ip-10-0-130-47us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` # This file is generated by the Machine Config Operator's containerruntimeconfig controller. # # storage.conf is the configuration file for all tools # that share the containers/storage libraries # See man 5 containers-storage.conf for more information # The "container storage" table contains all of the server options. [storage] # Default Storage Driver driver = "overlay" # Temporary storage location runroot = "/var/run/containers/storage" # Primary Read/Write location of container storage graphroot = "/var/lib/containers/storage" [storage.options] # Storage options to be passed to underlying storage drivers # AdditionalImageStores is used to pass paths to additional Read/Only image stores # Must be comma separated list. additionalimagestores = [ ] # Size is used to set a maximum size of the container image. Only supported by # certain container storage drivers. size = "" # OverrideKernelCheck tells the driver to ignore kernel checks based on kernel version override_kernel_check = "true" # Remap-UIDs/GIDs is the mapping from UIDs/GIDs as they should appear inside of # a container, to UIDs/GIDs as they should appear outside of the container, and # the length of the range of UIDs/GIDs. Additional mapped sets can be listed # and will be heeded by libraries, but there are limits to the number of # mappings which the kernel will allow when you later attempt to run a # container. # # remap-uids = 0:1668442479:65536 # remap-gids = 0:1668442479:65536 # Remap-User/Group is a name which can be used to look up one or more UID/GID # ranges in the /etc/subuid or /etc/subgid file. Mappings are set up starting # with an in-container ID of 0 and the a host-level ID taken from the lowest # range that matches the specified name, and using the length of that range. # Additional ranges are then assigned, using the ranges which specify the # lowest host-level IDs first, to the lowest not-yet-mapped container-level ID, # until all of the entries have been used for maps. # # remap-user = "storage" # remap-group = "storage" [storage.options.thinpool] # Storage Options for thinpool # autoextend_percent determines the amount by which pool needs to be # grown. This is specified in terms of % of pool size. So a value of 20 means # that when threshold is hit, pool will be grown by 20% of existing # pool size. # autoextend_percent = "20" # autoextend_threshold determines the pool extension threshold in terms # of percentage of pool size. For example, if threshold is 60, that means when # pool is 60% full, threshold has been hit. # autoextend_threshold = "80" # basesize specifies the size to use when creating the base device, which # limits the size of images and containers. # basesize = "10G" # blocksize specifies a custom blocksize to use for the thin pool. # blocksize="64k" # directlvm_device specifies a custom block storage device to use for the # thin pool. Required if you setup devicemapper # directlvm_device = "" # directlvm_device_force wipes device even if device already has a filesystem # directlvm_device_force = "True" # fs specifies the filesystem type to use for the base device. # fs="xfs" # log_level sets the log level of devicemapper. # 0: LogLevelSuppress 0 (Default) # 2: LogLevelFatal # 3: LogLevelErr # 4: LogLevelWarn # 5: LogLevelNotice # 6: LogLevelInfo # 7: LogLevelDebug # log_level = "7" # min_free_space specifies the min free space percent in a thin pool require for # new device creation to succeed. Valid values are from 0% - 99%. # Value 0% disables # min_free_space = "10%" # mkfsarg specifies extra mkfs arguments to be used when creating the base # device. # mkfsarg = "" # mountopt specifies extra mount options used when mounting the thin devices. # mountopt = "" # use_deferred_removal Marking device for deferred removal # use_deferred_removal = "True" # use_deferred_deletion Marking device for deferred deletion # use_deferred_deletion = "True" # xfs_nospace_max_retries specifies the maximum number of retries XFS should # attempt to complete IO when ENOSPC (no space) error is returned by # underlying storage device. # xfs_nospace_max_retries = "0" Removing debug pod ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0395 This Assessment/Impact Template is still in progress but I wanted to get what I already have into the BZ to not block the process:
Generic bug assessment:
What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
The upgrade from 4.2.16-> 4.3.0 fails
The MCO is Degraded and showing in MCD logs:
I1210 16:15:51.104169 11181 daemon.go:955] Validating against pending config rendered-master-bad39270d7f2359e7d7c35c302c3178c
E1210 16:15:51.105286 11181 daemon.go:1350] content mismatch for file /etc/containers/storage.conf:
What kind of clusters are impacted because of the bug?
Clusters upgrading from 4.2.16 -> 4.3.0
What is the impact of the bug on the cluster?
MCO is degraded which leads to the upgrade not finishing
What cluster functionality is degraded while hitting the bug?
Machine-Config-Operator is Degraded
Might any customer data be lost because of the bug?
Don’t think so
Is it possible to recover the cluster from the bug?
Requires extensive cluster-admin intervention
Yes but currently the way to fix is to manually edit the /etc/containers/storage.conf file (via oc debug node…) to have the right/expecdted contents from the newest rendered-XXX machine config.
MCO team is currently trying to find a better workaround. As this is a time consuming fix. Will update if we find one.
What is the observed rate of failure we see in CI?
How long before the bug is fixed?
The bug has been fixed in 4.2.18
Update-specific assessment:
What is the expected rate of the failure (%) for vulnerable clusters which attempt the update?
Do not have specific %, but historically this bug took down promotion ci runs which is how it was discovered and subsequently fixed. We expect that any cluster upgrading from 4.2.16-> 4.3.0 can hit this as this doesn’t depend on customer action and is the result of rpm/mco both updating storage.conf and subsequently failing MCD checks.
Picking up the fix in 4.2.18 will prevent this from happening
Does the upgrade succeed?
No
Is the cluster usable after the upgrade?
All of the other operators have upgraded, it is the MCO that is degraded which will prevent changes via Machine Configs and upgrades from proceeding until the degraded state is fixed.
Relevant Other Links:
See: https://github.com/openshift/machine-config-operator/issues/1319
Just updating here the specific manual workaround instructions. This is to be done for each node: If you hit this bug in the MCD logs you should see: I0220 01:55:27.683493 192713 daemon.go:955] Validating against pending config rendered-master-1234abcd E0220 01:55:27.684114 192713 daemon.go:1350] content mismatch for file /etc/containers/storage.conf: # first: $ oc get machineconfig rendered-master-1234abcd -o yaml copy the storage.conf file contents (they will be url encoded and look like: %23%20storage.conf%20is%20the%20configuration%20file%20for...... take that urlencoded content and use a decoding tool (like: https://www.url-encode-decode.com/) to decode the file contents correctly. then hop onto your target node: $ oc debug -t node/ip-123... ... sh-4.2# chroot /host sh-4.2# vi /etc/containers/storage.conf Paste the decoded content (watch out here for any extra newlines accidentally added, etc..) after you are finished, exit debug shell and restart the machine-config-daemon we looked at above: $ oc delete pod -n openshift-machine-config-operator machine-config-daemon-123 After restarting you should see in the restarted MCD logs: I0220 02:31:22.536307 250361 daemon.go:955] Validating against pending config rendered-master-1234abcd I0220 02:31:22.544403 250361 daemon.go:971] Validated on-disk state |