Bug 1781665

Summary:	Upgrade from 4.2.9 to 4.2.10 stuck in machine-config occasionally
Product:	OpenShift Container Platform	Reporter:	Wenjing Zheng <wzheng>
Component:	RHCOS	Assignee:	Micah Abbott <miabbott>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.2.z	CC:	amurdaca, aos-bugs, bbreard, dustymabe, imcleod, jligon, jokerman, kgarriso, nstielau
Target Milestone:	---
Target Release:	4.2.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1784979 (view as bug list)		Environment:
Last Closed:	2020-02-24 16:52:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1784981
Bug Blocks:	1783621

Description Wenjing Zheng 2019-12-10 11:25:47 UTC

Description of problem:
machine-config cannot succeed during upgrade from 4.2.9 to 4.2.10:
Status:
  Conditions:
    Last Transition Time:  2019-12-09T14:51:33Z
    Message:               Cluster not available for 4.2.10
    Status:                False
    Type:                  Available
    Last Transition Time:  2019-12-09T14:37:38Z
    Message:               Working towards 4.2.10
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2019-12-09T14:51:33Z
    Message:               Unable to apply 4.2.10: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0)
    Reason:                RequiredPoolsFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-12-09T10:40:33Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
  Extension:
    Last Sync Error:  error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0)
    Master:           pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ip.us-east-2.compute.internal is reporting: \"failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1\""
    Worker:           all 3 nodes are at latest configuration rendered-worker-bd5671078c730a55e305ab9c6b835be3

[wzheng@openshift-qe 3]$ oc debug nodes/ip.us-east-2.compute.internal
Starting pod/ipus-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# service machine-config-daemon-host.service status
Redirecting to /bin/systemctl status machine-config-daemon-host.service
Warning: The unit file, source configuration file or drop-ins of machine-config-daemon-host.service changed on disk. Run 'systemctl daemon-reload' to reload units.
● machine-config-daemon-host.service - Machine Config Daemon Initial
   Loaded: loaded (/etc/systemd/system/machine-config-daemon-host.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/machine-config-daemon-host.service.d
           └─10-default-env.conf
   Active: failed (Result: exit-code) since Tue 2019-12-10 02:24:58 UTC; 2s ago
  Process: 2049448 ExecStart=/usr/libexec/machine-config-daemon pivot (code=exited, status=1/FAILURE)
 Main PID: 2049448 (code=exited, status=1/FAILURE)
      CPU: 942ms

Dec 10 02:24:57  podman[2049467]: 2019-12-10 02:24:57.750935927 +0000 UTC m=+1.961631647 image pull  
Dec 10 02:24:57  machine-config-daemon[2049448]: 17ef259af6996da8bcb42b19a50c78dfd1e02395fef1ed30dc8d9286decbcfbc
Dec 10 02:24:57  machine-config-daemon[2049448]: I1210 02:24:57.758345 2049448 rpm-ostree.go:356] Running captured: podman inspect --type=image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256>
Dec 10 02:24:58  machine-config-daemon[2049448]: I1210 02:24:58.110357 2049448 rpm-ostree.go:356] Running captured: podman create --net=none --name ostree-container-pivot quay.io/openshift-release->
Dec 10 02:24:58  machine-config-daemon[2049448]: Error: error creating container storage: the container name "ostree-container-pivot" is already in use by "a8d1e4db7e1388df9fc46c63cd8336498d0dd2574>
Dec 10 02:24:58 ip-10-0-158-47 machine-config-daemon[2049448]: error: exit status 125
Dec 10 02:24:58  systemd[1]: machine-config-daemon-host.service: Main process exited, code=exited, status=1/FAILURE
Dec 10 02:24:58  systemd[1]: machine-config-daemon-host.service: Failed with result 'exit-code'.
Dec 10 02:24:58 systemd[1]: Failed to start Machine Config Daemon Initial.
Dec 10 02:24:58  systemd[1]: machine-config-daemon-host.service: Consumed 942ms CPU time


Version-Release number of selected component (if applicable):
4.2.9->4.2.10

How reproducible:
20%

Steps to Reproduce:
1.Set up a 4.2.9
2.Upgrade to 4.2.10
3.Watch cluster operator status

Actual results:
machine-config remain 4.2.9

Expected results:
It should upgrade to 4.2.10.

Additional info:

Comment 2 Kirsten Garrison 2019-12-10 18:03:01 UTC

@wenjing, you say you see it 20% of the time.  out of how many times?  thanks

Comment 3 Kirsten Garrison 2019-12-10 18:13:01 UTC

The must gather indicates that there are no degraded nodes:
```
degradedMachineCount: 0
  machineCount: 2
  observedGeneration: 2
  readyMachineCount: 2
  unavailableMachineCount: 0
  updatedMachineCount: 2
```

```
  degradedMachineCount: 0
  machineCount: 3
  observedGeneration: 2
  readyMachineCount: 3
  unavailableMachineCount: 0
  updatedMachineCount: 3

```
```
  conditions:
  - lastTransitionTime: 2019-09-09T01:04:18Z
    message: Cluster has deployed 4.2.0-0.nightly-2019-09-08-180038
    status: "True"
    type: Available
  - lastTransitionTime: 2019-09-09T01:04:18Z
    message: Cluster version is 4.2.0-0.nightly-2019-09-08-180038
    status: "False"
    type: Progressing
  - lastTransitionTime: 2019-09-09T01:03:26Z
    status: "False"
    type: Degraded
  - lastTransitionTime: 2019-09-09T01:04:18Z
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension:
    master: all 3 nodes are at latest configuration rendered-master-7f57c7f00707be89c871da3dee48edd3
    worker: all 2 nodes are at latest configuration rendered-worker-dc11fa50a1bc04b1960564a588524738
```

Comment 4 Kirsten Garrison 2019-12-10 18:14:00 UTC

Looking a little closer... this must gather is from September (2019-09-09). Please attach the current must gather so we can investigate.

Comment 9 Kirsten Garrison 2019-12-18 20:04:36 UTC

This the 4.2 backport of https://github.com/openshift/machine-config-operator/pull/1285

Comment 15 errata-xmlrpc 2020-02-24 16:52:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0460