Description of problem: machine-config cannot succeed during upgrade from 4.2.9 to 4.2.10: Status: Conditions: Last Transition Time: 2019-12-09T14:51:33Z Message: Cluster not available for 4.2.10 Status: False Type: Available Last Transition Time: 2019-12-09T14:37:38Z Message: Working towards 4.2.10 Status: True Type: Progressing Last Transition Time: 2019-12-09T14:51:33Z Message: Unable to apply 4.2.10: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0) Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2019-12-09T10:40:33Z Reason: AsExpected Status: True Type: Upgradeable Extension: Last Sync Error: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0) Master: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ip.us-east-2.compute.internal is reporting: \"failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1\"" Worker: all 3 nodes are at latest configuration rendered-worker-bd5671078c730a55e305ab9c6b835be3 [wzheng@openshift-qe 3]$ oc debug nodes/ip.us-east-2.compute.internal Starting pod/ipus-east-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# service machine-config-daemon-host.service status Redirecting to /bin/systemctl status machine-config-daemon-host.service Warning: The unit file, source configuration file or drop-ins of machine-config-daemon-host.service changed on disk. Run 'systemctl daemon-reload' to reload units. ● machine-config-daemon-host.service - Machine Config Daemon Initial Loaded: loaded (/etc/systemd/system/machine-config-daemon-host.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/machine-config-daemon-host.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Tue 2019-12-10 02:24:58 UTC; 2s ago Process: 2049448 ExecStart=/usr/libexec/machine-config-daemon pivot (code=exited, status=1/FAILURE) Main PID: 2049448 (code=exited, status=1/FAILURE) CPU: 942ms Dec 10 02:24:57 podman[2049467]: 2019-12-10 02:24:57.750935927 +0000 UTC m=+1.961631647 image pull Dec 10 02:24:57 machine-config-daemon[2049448]: 17ef259af6996da8bcb42b19a50c78dfd1e02395fef1ed30dc8d9286decbcfbc Dec 10 02:24:57 machine-config-daemon[2049448]: I1210 02:24:57.758345 2049448 rpm-ostree.go:356] Running captured: podman inspect --type=image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256> Dec 10 02:24:58 machine-config-daemon[2049448]: I1210 02:24:58.110357 2049448 rpm-ostree.go:356] Running captured: podman create --net=none --name ostree-container-pivot quay.io/openshift-release-> Dec 10 02:24:58 machine-config-daemon[2049448]: Error: error creating container storage: the container name "ostree-container-pivot" is already in use by "a8d1e4db7e1388df9fc46c63cd8336498d0dd2574> Dec 10 02:24:58 ip-10-0-158-47 machine-config-daemon[2049448]: error: exit status 125 Dec 10 02:24:58 systemd[1]: machine-config-daemon-host.service: Main process exited, code=exited, status=1/FAILURE Dec 10 02:24:58 systemd[1]: machine-config-daemon-host.service: Failed with result 'exit-code'. Dec 10 02:24:58 systemd[1]: Failed to start Machine Config Daemon Initial. Dec 10 02:24:58 systemd[1]: machine-config-daemon-host.service: Consumed 942ms CPU time Version-Release number of selected component (if applicable): 4.2.9->4.2.10 How reproducible: 20% Steps to Reproduce: 1.Set up a 4.2.9 2.Upgrade to 4.2.10 3.Watch cluster operator status Actual results: machine-config remain 4.2.9 Expected results: It should upgrade to 4.2.10. Additional info:
@wenjing, you say you see it 20% of the time. out of how many times? thanks
The must gather indicates that there are no degraded nodes: ``` degradedMachineCount: 0 machineCount: 2 observedGeneration: 2 readyMachineCount: 2 unavailableMachineCount: 0 updatedMachineCount: 2 ``` ``` degradedMachineCount: 0 machineCount: 3 observedGeneration: 2 readyMachineCount: 3 unavailableMachineCount: 0 updatedMachineCount: 3 ``` ``` conditions: - lastTransitionTime: 2019-09-09T01:04:18Z message: Cluster has deployed 4.2.0-0.nightly-2019-09-08-180038 status: "True" type: Available - lastTransitionTime: 2019-09-09T01:04:18Z message: Cluster version is 4.2.0-0.nightly-2019-09-08-180038 status: "False" type: Progressing - lastTransitionTime: 2019-09-09T01:03:26Z status: "False" type: Degraded - lastTransitionTime: 2019-09-09T01:04:18Z reason: AsExpected status: "True" type: Upgradeable extension: master: all 3 nodes are at latest configuration rendered-master-7f57c7f00707be89c871da3dee48edd3 worker: all 2 nodes are at latest configuration rendered-worker-dc11fa50a1bc04b1960564a588524738 ```
Looking a little closer... this must gather is from September (2019-09-09). Please attach the current must gather so we can investigate.
This the 4.2 backport of https://github.com/openshift/machine-config-operator/pull/1285
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0460