1782152 – Pull CI: Cluster operator machine-config Degraded is True with RequiredPoolsFailed: machineconfig.machineconfiguration.openshift.io rendered-master-* not found

Bug 1782152 - Pull CI: Cluster operator machine-config Degraded is True with RequiredPoolsFailed: machineconfig.machineconfiguration.openshift.io rendered-master-* not found

Summary: Pull CI: Cluster operator machine-config Degraded is True with RequiredPoolsF...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.z
Assignee:	Colin Howell Walters
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:	1781708 1782149
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-11 09:52 UTC by Vadim Rutkovsky
Modified:	2020-02-20 03:09 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1782149
Environment:
Last Closed:	2020-02-12 12:16:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1323	0	None	closed	Bug 1782152: templates/container-storage: Add a "this is generated" note	2021-02-01 07:08:08 UTC
Red Hat Product Errata	RHBA-2020:0395	0	None	None	None	2020-02-12 12:16:42 UTC

Description Vadim Rutkovsky 2019-12-11 09:52:00 UTC

+++ This bug was initially created as a clone of Bug #1782149 +++

This bug was initially created as a copy of Bug #1781708


Description of problem:
Bunch of pull-ci-* jobs are failing with:

level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Failed to resync 0.0.1-2019-12-10-092928 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node ip-10-0-129-171.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9ffdae4ce3763dbc967f7e9e041d4de1\\\\\\\" not found\\\", Node ip-10-0-145-21.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9ffdae4ce3763dbc967f7e9e041d4de1\\\\\\\" not found\\\", Node ip-10-0-140-8.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9ffdae4ce3763dbc967f7e9e041d4de1\\\\\\\" not found\\\"\", retrying"

Additionally, mao daemon (logs at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2765/pull-ci-openshift-installer-master-e2e-aws/8952/artifacts/e2e-aws/pods/openshift-machine-config-operator_machine-config-daemon-6lqbw_machine-config-daemon.log) complains about (storage.conf file is quite long to share it in its completeness):
```
E1210 09:19:15.517411   13932 daemon.go:1350] content mismatch for file /etc/containers/storage.conf: # 

A: This file is is the configuration file for all tools
# that use the containers/storage library.
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]
...
```

Error message repeated again every minute until 09:50:16.231021

Known jobs:
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/280/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/775
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2765/pull-ci-openshift-installer-master-e2e-aws/8952
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console/3559/pull-ci-openshift-console-master-e2e-gcp-console/5884
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_telemeter/273/pull-ci-openshift-telemeter-master-e2e-aws/517

You can find the same error message in the remaining five daemon logs.

Version-Release number of selected component (if applicable):
Master branch of installer: registry.svc.ci.openshift.org/ci-op-0l9nfi34/release@sha256:1dc8db6e093e8484d0a75ad69be3d617b0d7502cdebdd892d0119cac43150cc9

How reproducible:
Always

Steps to Reproduce:
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2765/pull-ci-openshift-installer-master-e2e-aws/8952

Actual results:
- MCO is Degraded

Expected results:
- MCO is not Degraded and the cluster is installed successfully 

Additional info:

Comment 2 Michael Nguyen 2020-02-04 17:08:28 UTC

Verified on 4.2.0-0.nightly-2020-02-03-234322. The masters and worker are using the same /etc/containers/storage.conf with the generated by mco containerruntimeconfig comment.

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.2.0-0.nightly-2020-02-03-234322 True False 93m Cluster version is 4.2.0-0.nightly-2020-02-03-234322
$ oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-130-217.us-west-2.compute.internal Ready master 108m v1.14.6+0a21dd3b3
ip-10-0-130-47.us-west-2.compute.internal Ready worker 101m v1.14.6+0a21dd3b3
ip-10-0-139-157.us-west-2.compute.internal Ready master 107m v1.14.6+0a21dd3b3
ip-10-0-143-116.us-west-2.compute.internal Ready worker 101m v1.14.6+0a21dd3b3
ip-10-0-147-60.us-west-2.compute.internal Ready master 107m v1.14.6+0a21dd3b3
ip-10-0-148-107.us-west-2.compute.internal Ready worker 101m v1.14.6+0a21dd3b3
$ oc debug node/ip-10-0-130-217.us-west-2.compute.internal
Starting pod/ip-10-0-130-217us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# exit
exit

Removing debug pod ...
$ oc debug node/ip-10-0-130-217.us-west-2.compute.internal -- chroot /host cat /etc/containers/storage.conf
Starting pod/ip-10-0-130-217us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
# This file is generated by the Machine Config Operator's containerruntimeconfig controller.
#
# storage.conf is the configuration file for all tools
# that share the containers/storage libraries
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]

# Default Storage Driver
driver = "overlay"

# Temporary storage location
runroot = "/var/run/containers/storage"

# Primary Read/Write location of container storage
graphroot = "/var/lib/containers/storage"

[storage.options]
# Storage options to be passed to underlying storage drivers

# AdditionalImageStores is used to pass paths to additional Read/Only image stores
# Must be comma separated list.
additionalimagestores = [
]

# Size is used to set a maximum size of the container image. Only supported by
# certain container storage drivers.
size = ""

# OverrideKernelCheck tells the driver to ignore kernel checks based on kernel version
override_kernel_check = "true"

# Remap-UIDs/GIDs is the mapping from UIDs/GIDs as they should appear inside of
# a container, to UIDs/GIDs as they should appear outside of the container, and
# the length of the range of UIDs/GIDs. Additional mapped sets can be listed
# and will be heeded by libraries, but there are limits to the number of
# mappings which the kernel will allow when you later attempt to run a
# container.
#
# remap-uids = 0:1668442479:65536
# remap-gids = 0:1668442479:65536

# Remap-User/Group is a name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid or /etc/subgid file. Mappings are set up starting
# with an in-container ID of 0 and the a host-level ID taken from the lowest
# range that matches the specified name, and using the length of that range.
# Additional ranges are then assigned, using the ranges which specify the
# lowest host-level IDs first, to the lowest not-yet-mapped container-level ID,
# until all of the entries have been used for maps.
#
# remap-user = "storage"
# remap-group = "storage"

[storage.options.thinpool]
# Storage Options for thinpool

# autoextend_percent determines the amount by which pool needs to be
# grown. This is specified in terms of % of pool size. So a value of 20 means
# that when threshold is hit, pool will be grown by 20% of existing
# pool size.
# autoextend_percent = "20"

# autoextend_threshold determines the pool extension threshold in terms
# of percentage of pool size. For example, if threshold is 60, that means when
# pool is 60% full, threshold has been hit.
# autoextend_threshold = "80"

# basesize specifies the size to use when creating the base device, which
# limits the size of images and containers.
# basesize = "10G"

# blocksize specifies a custom blocksize to use for the thin pool.
# blocksize="64k"

# directlvm_device specifies a custom block storage device to use for the
# thin pool. Required if you setup devicemapper
# directlvm_device = ""

# directlvm_device_force wipes device even if device already has a filesystem
# directlvm_device_force = "True"

# fs specifies the filesystem type to use for the base device.
# fs="xfs"

# log_level sets the log level of devicemapper.
# 0: LogLevelSuppress 0 (Default)
# 2: LogLevelFatal
# 3: LogLevelErr
# 4: LogLevelWarn
# 5: LogLevelNotice
# 6: LogLevelInfo
# 7: LogLevelDebug
# log_level = "7"

# min_free_space specifies the min free space percent in a thin pool require for
# new device creation to succeed. Valid values are from 0% - 99%.
# Value 0% disables
# min_free_space = "10%"

# mkfsarg specifies extra mkfs arguments to be used when creating the base
# device.
# mkfsarg = ""

# mountopt specifies extra mount options used when mounting the thin devices.
# mountopt = ""

# use_deferred_removal Marking device for deferred removal
# use_deferred_removal = "True"

# use_deferred_deletion Marking device for deferred deletion
# use_deferred_deletion = "True"

# xfs_nospace_max_retries specifies the maximum number of retries XFS should
# attempt to complete IO when ENOSPC (no space) error is returned by
# underlying storage device.
# xfs_nospace_max_retries = "0"

Removing debug pod ...
$ oc debug node/ip-10-0-130-47.us-west-2.compute.internal -- chroot /host cat /etc/containers/storage.conf
Starting pod/ip-10-0-130-47us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
# This file is generated by the Machine Config Operator's containerruntimeconfig controller.
#
# storage.conf is the configuration file for all tools
# that share the containers/storage libraries
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]