Bug 1782153 - Pull CI: Cluster operator machine-config Degraded is True with RequiredPoolsFailed: machineconfig.machineconfiguration.openshift.io rendered-master-* not found
Summary: Pull CI: Cluster operator machine-config Degraded is True with RequiredPoolsF...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.1.z
Assignee: Colin Howell Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1781708
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-11 09:54 UTC by Vadim Rutkovsky
Modified: 2020-02-13 06:14 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1781708
Environment:
Last Closed: 2020-02-13 06:14:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1382 0 None closed Bug 1782153: templates: Add a "this is generated" note 2020-07-24 16:59:01 UTC
Red Hat Product Errata RHBA-2020:0399 0 None None None 2020-02-13 06:14:11 UTC

Description Vadim Rutkovsky 2019-12-11 09:54:22 UTC
+++ This bug was initially created as a clone of Bug #1781708 +++

Description of problem:
Bunch of pull-ci-* jobs are failing with:

level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Failed to resync 0.0.1-2019-12-10-092928 because: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node ip-10-0-129-171.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9ffdae4ce3763dbc967f7e9e041d4de1\\\\\\\" not found\\\", Node ip-10-0-145-21.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9ffdae4ce3763dbc967f7e9e041d4de1\\\\\\\" not found\\\", Node ip-10-0-140-8.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9ffdae4ce3763dbc967f7e9e041d4de1\\\\\\\" not found\\\"\", retrying"

Additionally, mao daemon (logs at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2765/pull-ci-openshift-installer-master-e2e-aws/8952/artifacts/e2e-aws/pods/openshift-machine-config-operator_machine-config-daemon-6lqbw_machine-config-daemon.log) complains about (storage.conf file is quite long to share it in its completeness):
```
E1210 09:19:15.517411   13932 daemon.go:1350] content mismatch for file /etc/containers/storage.conf: # 

A: This file is is the configuration file for all tools
# that use the containers/storage library.
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]
...
```

Error message repeated again every minute until 09:50:16.231021

Known jobs:
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/280/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/775
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2765/pull-ci-openshift-installer-master-e2e-aws/8952
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console/3559/pull-ci-openshift-console-master-e2e-gcp-console/5884
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_telemeter/273/pull-ci-openshift-telemeter-master-e2e-aws/517

You can find the same error message in the remaining five daemon logs.

Version-Release number of selected component (if applicable):
Master branch of installer: registry.svc.ci.openshift.org/ci-op-0l9nfi34/release@sha256:1dc8db6e093e8484d0a75ad69be3d617b0d7502cdebdd892d0119cac43150cc9

How reproducible:
Always

Steps to Reproduce:
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2765/pull-ci-openshift-installer-master-e2e-aws/8952

Actual results:
- MCO is Degraded

Expected results:
- MCO is not Degraded and the cluster is installed successfully 

Additional info:

--- Additional comment from Vadim Rutkovsky on 2019-12-10 12:59:00 UTC ---

Older podman was pulled in due to wrong tagging of slirp4netns

--- Additional comment from Michal Fojtik on 2019-12-10 14:45:03 UTC ---

Moving to urgent as this is affecting a lot of CI jobs.

--- Additional comment from Jindrich Novy on 2019-12-10 15:03:57 UTC ---

slirp4netns is now whitelisted by RCM and built to rhaos-4.3-rhel-8-candidate:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=25264758

--- Additional comment from Colin Walters on 2019-12-10 20:49:53 UTC ---

https://github.com/openshift/machine-config-operator/pull/1320 is a probable fix.

Comment 4 Michael Nguyen 2020-02-04 19:56:56 UTC
Verified on 4.1.0-0.nightly-2020-02-04-143926.  /etc/containers/storage.conf is the same on both masters and workers with the generated by message.

$ oc get node
NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-128-22.ec2.internal    Ready    worker   45m   v1.13.4+125a3c441
ip-10-0-134-62.ec2.internal    Ready    master   49m   v1.13.4+125a3c441
ip-10-0-138-243.ec2.internal   Ready    worker   45m   v1.13.4+125a3c441
ip-10-0-139-148.ec2.internal   Ready    master   49m   v1.13.4+125a3c441
ip-10-0-150-238.ec2.internal   Ready    worker   45m   v1.13.4+125a3c441
ip-10-0-159-4.ec2.internal     Ready    master   49m   v1.13.4+125a3c441
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2020-02-04-143926   True        False         38m     Cluster version is 4.1.0-0.nightly-2020-02-04-143926
$ oc debug node/ip-10-0-128-22.ec2.internal -- chroot /host cat /etc/containers/storage.conf
Starting pod/ip-10-0-128-22ec2internal-debug ...
To use host binaries, run `chroot /host`
# This file is generated by the Machine Config Operator's containerruntimeconfig controller.
#

# storage.conf is the configuration file for all tools
# that share the containers/storage libraries
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]

# Default Storage Driver
driver = "overlay"

# Temporary storage location
runroot = "/var/run/containers/storage"

# Primary Read/Write location of container storage
graphroot = "/var/lib/containers/storage"

[storage.options]
# Storage options to be passed to underlying storage drivers

# AdditionalImageStores is used to pass paths to additional Read/Only image stores
# Must be comma separated list.
additionalimagestores = [
]

# Size is used to set a maximum size of the container image.  Only supported by
# certain container storage drivers.
size = ""

# OverrideKernelCheck tells the driver to ignore kernel checks based on kernel version
override_kernel_check = "true"

# Remap-UIDs/GIDs is the mapping from UIDs/GIDs as they should appear inside of
# a container, to UIDs/GIDs as they should appear outside of the container, and
# the length of the range of UIDs/GIDs.  Additional mapped sets can be listed
# and will be heeded by libraries, but there are limits to the number of
# mappings which the kernel will allow when you later attempt to run a
# container.
#
# remap-uids = 0:1668442479:65536
# remap-gids = 0:1668442479:65536

# Remap-User/Group is a name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid or /etc/subgid file.  Mappings are set up starting
# with an in-container ID of 0 and the a host-level ID taken from the lowest
# range that matches the specified name, and using the length of that range.
# Additional ranges are then assigned, using the ranges which specify the
# lowest host-level IDs first, to the lowest not-yet-mapped container-level ID,
# until all of the entries have been used for maps.
#
# remap-user = "storage"
# remap-group = "storage"

[storage.options.thinpool]
# Storage Options for thinpool

# autoextend_percent determines the amount by which pool needs to be
# grown. This is specified in terms of % of pool size. So a value of 20 means
# that when threshold is hit, pool will be grown by 20% of existing
# pool size.
# autoextend_percent = "20"

# autoextend_threshold determines the pool extension threshold in terms
# of percentage of pool size. For example, if threshold is 60, that means when
# pool is 60% full, threshold has been hit.
# autoextend_threshold = "80"

# basesize specifies the size to use when creating the base device, which
# limits the size of images and containers.
# basesize = "10G"

# blocksize specifies a custom blocksize to use for the thin pool.
# blocksize="64k"

# directlvm_device specifies a custom block storage device to use for the
# thin pool. Required if you setup devicemapper
# directlvm_device = ""

# directlvm_device_force wipes device even if device already has a filesystem
# directlvm_device_force = "True"

# fs specifies the filesystem type to use for the base device.
# fs="xfs"

# log_level sets the log level of devicemapper.
# 0: LogLevelSuppress 0 (Default)
# 2: LogLevelFatal
# 3: LogLevelErr
# 4: LogLevelWarn
# 5: LogLevelNotice
# 6: LogLevelInfo
# 7: LogLevelDebug
# log_level = "7"

# min_free_space specifies the min free space percent in a thin pool require for
# new device creation to succeed. Valid values are from 0% - 99%.
# Value 0% disables
# min_free_space = "10%"

# mkfsarg specifies extra mkfs arguments to be used when creating the base
# device.
# mkfsarg = ""

# mountopt specifies extra mount options used when mounting the thin devices.
# mountopt = ""

# use_deferred_removal Marking device for deferred removal
# use_deferred_removal = "True"

# use_deferred_deletion Marking device for deferred deletion
# use_deferred_deletion = "True"

# xfs_nospace_max_retries specifies the maximum number of retries XFS should
# attempt to complete IO when ENOSPC (no space) error is returned by
# underlying storage device.
# xfs_nospace_max_retries = "0"

Removing debug pod ...
$ oc debug node/ip-10-0-134-62.ec2.internal -- chroot /host cat /etc/containers/storage.conf
Starting pod/ip-10-0-134-62ec2internal-debug ...
To use host binaries, run `chroot /host`
# This file is generated by the Machine Config Operator's containerruntimeconfig controller.
#

# storage.conf is the configuration file for all tools
# that share the containers/storage libraries
# See man 5 containers-storage.conf for more information
# The "container storage" table contains all of the server options.
[storage]

# Default Storage Driver
driver = "overlay"

# Temporary storage location
runroot = "/var/run/containers/storage"

# Primary Read/Write location of container storage
graphroot = "/var/lib/containers/storage"

[storage.options]
# Storage options to be passed to underlying storage drivers

# AdditionalImageStores is used to pass paths to additional Read/Only image stores
# Must be comma separated list.
additionalimagestores = [
]

# Size is used to set a maximum size of the container image.  Only supported by
# certain container storage drivers.
size = ""

# OverrideKernelCheck tells the driver to ignore kernel checks based on kernel version
override_kernel_check = "true"

# Remap-UIDs/GIDs is the mapping from UIDs/GIDs as they should appear inside of
# a container, to UIDs/GIDs as they should appear outside of the container, and
# the length of the range of UIDs/GIDs.  Additional mapped sets can be listed
# and will be heeded by libraries, but there are limits to the number of
# mappings which the kernel will allow when you later attempt to run a
# container.
#
# remap-uids = 0:1668442479:65536
# remap-gids = 0:1668442479:65536

# Remap-User/Group is a name which can be used to look up one or more UID/GID
# ranges in the /etc/subuid or /etc/subgid file.  Mappings are set up starting
# with an in-container ID of 0 and the a host-level ID taken from the lowest
# range that matches the specified name, and using the length of that range.
# Additional ranges are then assigned, using the ranges which specify the
# lowest host-level IDs first, to the lowest not-yet-mapped container-level ID,
# until all of the entries have been used for maps.
#
# remap-user = "storage"
# remap-group = "storage"

[storage.options.thinpool]
# Storage Options for thinpool

# autoextend_percent determines the amount by which pool needs to be
# grown. This is specified in terms of % of pool size. So a value of 20 means
# that when threshold is hit, pool will be grown by 20% of existing
# pool size.
# autoextend_percent = "20"

# autoextend_threshold determines the pool extension threshold in terms
# of percentage of pool size. For example, if threshold is 60, that means when
# pool is 60% full, threshold has been hit.
# autoextend_threshold = "80"

# basesize specifies the size to use when creating the base device, which
# limits the size of images and containers.
# basesize = "10G"

# blocksize specifies a custom blocksize to use for the thin pool.
# blocksize="64k"

# directlvm_device specifies a custom block storage device to use for the
# thin pool. Required if you setup devicemapper
# directlvm_device = ""

# directlvm_device_force wipes device even if device already has a filesystem
# directlvm_device_force = "True"

# fs specifies the filesystem type to use for the base device.
# fs="xfs"

# log_level sets the log level of devicemapper.
# 0: LogLevelSuppress 0 (Default)
# 2: LogLevelFatal
# 3: LogLevelErr
# 4: LogLevelWarn
# 5: LogLevelNotice
# 6: LogLevelInfo
# 7: LogLevelDebug
# log_level = "7"

# min_free_space specifies the min free space percent in a thin pool require for
# new device creation to succeed. Valid values are from 0% - 99%.
# Value 0% disables
# min_free_space = "10%"

# mkfsarg specifies extra mkfs arguments to be used when creating the base
# device.
# mkfsarg = ""

# mountopt specifies extra mount options used when mounting the thin devices.
# mountopt = ""

# use_deferred_removal Marking device for deferred removal
# use_deferred_removal = "True"

# use_deferred_deletion Marking device for deferred deletion
# use_deferred_deletion = "True"

# xfs_nospace_max_retries specifies the maximum number of retries XFS should
# attempt to complete IO when ENOSPC (no space) error is returned by
# underlying storage device.
# xfs_nospace_max_retries = "0"

Removing debug pod ...

Comment 6 errata-xmlrpc 2020-02-13 06:14:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0399


Note You need to log in before you can comment on or make changes to this bug.