Bug 1781665
| Summary: | Upgrade from 4.2.9 to 4.2.10 stuck in machine-config occasionally | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Wenjing Zheng <wzheng> | |
| Component: | RHCOS | Assignee: | Micah Abbott <miabbott> | |
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | low | |||
| Version: | 4.2.z | CC: | amurdaca, aos-bugs, bbreard, dustymabe, imcleod, jligon, jokerman, kgarriso, nstielau | |
| Target Milestone: | --- | |||
| Target Release: | 4.2.z | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1784979 (view as bug list) | Environment: | ||
| Last Closed: | 2020-02-24 16:52:45 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1784981 | |||
| Bug Blocks: | 1783621 | |||
@wenjing, you say you see it 20% of the time. out of how many times? thanks The must gather indicates that there are no degraded nodes:
```
degradedMachineCount: 0
machineCount: 2
observedGeneration: 2
readyMachineCount: 2
unavailableMachineCount: 0
updatedMachineCount: 2
```
```
degradedMachineCount: 0
machineCount: 3
observedGeneration: 2
readyMachineCount: 3
unavailableMachineCount: 0
updatedMachineCount: 3
```
```
conditions:
- lastTransitionTime: 2019-09-09T01:04:18Z
message: Cluster has deployed 4.2.0-0.nightly-2019-09-08-180038
status: "True"
type: Available
- lastTransitionTime: 2019-09-09T01:04:18Z
message: Cluster version is 4.2.0-0.nightly-2019-09-08-180038
status: "False"
type: Progressing
- lastTransitionTime: 2019-09-09T01:03:26Z
status: "False"
type: Degraded
- lastTransitionTime: 2019-09-09T01:04:18Z
reason: AsExpected
status: "True"
type: Upgradeable
extension:
master: all 3 nodes are at latest configuration rendered-master-7f57c7f00707be89c871da3dee48edd3
worker: all 2 nodes are at latest configuration rendered-worker-dc11fa50a1bc04b1960564a588524738
```
Looking a little closer... this must gather is from September (2019-09-09). Please attach the current must gather so we can investigate. This the 4.2 backport of https://github.com/openshift/machine-config-operator/pull/1285 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0460 |
Description of problem: machine-config cannot succeed during upgrade from 4.2.9 to 4.2.10: Status: Conditions: Last Transition Time: 2019-12-09T14:51:33Z Message: Cluster not available for 4.2.10 Status: False Type: Available Last Transition Time: 2019-12-09T14:37:38Z Message: Working towards 4.2.10 Status: True Type: Progressing Last Transition Time: 2019-12-09T14:51:33Z Message: Unable to apply 4.2.10: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0) Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2019-12-09T10:40:33Z Reason: AsExpected Status: True Type: Upgradeable Extension: Last Sync Error: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 0) Master: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ip.us-east-2.compute.internal is reporting: \"failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1\"" Worker: all 3 nodes are at latest configuration rendered-worker-bd5671078c730a55e305ab9c6b835be3 [wzheng@openshift-qe 3]$ oc debug nodes/ip.us-east-2.compute.internal Starting pod/ipus-east-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# service machine-config-daemon-host.service status Redirecting to /bin/systemctl status machine-config-daemon-host.service Warning: The unit file, source configuration file or drop-ins of machine-config-daemon-host.service changed on disk. Run 'systemctl daemon-reload' to reload units. ● machine-config-daemon-host.service - Machine Config Daemon Initial Loaded: loaded (/etc/systemd/system/machine-config-daemon-host.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/machine-config-daemon-host.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Tue 2019-12-10 02:24:58 UTC; 2s ago Process: 2049448 ExecStart=/usr/libexec/machine-config-daemon pivot (code=exited, status=1/FAILURE) Main PID: 2049448 (code=exited, status=1/FAILURE) CPU: 942ms Dec 10 02:24:57 podman[2049467]: 2019-12-10 02:24:57.750935927 +0000 UTC m=+1.961631647 image pull Dec 10 02:24:57 machine-config-daemon[2049448]: 17ef259af6996da8bcb42b19a50c78dfd1e02395fef1ed30dc8d9286decbcfbc Dec 10 02:24:57 machine-config-daemon[2049448]: I1210 02:24:57.758345 2049448 rpm-ostree.go:356] Running captured: podman inspect --type=image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256> Dec 10 02:24:58 machine-config-daemon[2049448]: I1210 02:24:58.110357 2049448 rpm-ostree.go:356] Running captured: podman create --net=none --name ostree-container-pivot quay.io/openshift-release-> Dec 10 02:24:58 machine-config-daemon[2049448]: Error: error creating container storage: the container name "ostree-container-pivot" is already in use by "a8d1e4db7e1388df9fc46c63cd8336498d0dd2574> Dec 10 02:24:58 ip-10-0-158-47 machine-config-daemon[2049448]: error: exit status 125 Dec 10 02:24:58 systemd[1]: machine-config-daemon-host.service: Main process exited, code=exited, status=1/FAILURE Dec 10 02:24:58 systemd[1]: machine-config-daemon-host.service: Failed with result 'exit-code'. Dec 10 02:24:58 systemd[1]: Failed to start Machine Config Daemon Initial. Dec 10 02:24:58 systemd[1]: machine-config-daemon-host.service: Consumed 942ms CPU time Version-Release number of selected component (if applicable): 4.2.9->4.2.10 How reproducible: 20% Steps to Reproduce: 1.Set up a 4.2.9 2.Upgrade to 4.2.10 3.Watch cluster operator status Actual results: machine-config remain 4.2.9 Expected results: It should upgrade to 4.2.10. Additional info: