Bug 1842906
Summary: | upgrade from 4.4.6>4.5 fails: unexpected on-disk state validating against rendered-master | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | MinLi <minmli> | ||||
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> | ||||
Status: | CLOSED ERRATA | QA Contact: | Johnny Liu <jialiu> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.5 | CC: | agarcial, amurdaca, aos-bugs, bleanhar, cglombek, jerzhang, jiajliu, jialiu, jokerman, kgarriso, lmohanty, skumari, smilner, walters, wking, zhsun | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.6.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-10-27 16:03:47 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1846690, 1867886 | ||||||
Attachments: |
|
Description
MinLi
2020-06-02 11:24:31 UTC
Can you please provide a must gather from this cluster? Created attachment 1694682 [details]
must-gather log
@Antonio Murdaca the must-gather archive compressed is more than 24M, it's oversize for attachment. And I found some audit files are corrupted because when I collect must-gather I meet the sync error: rsync: connection unexpectedly closed (118621465 bytes received so far) [receiver] rsync error: error in rsync protocol data stream (code 12) at io.c(226) [receiver=3.1.3] I list the must-gather archive tree, which part are necessary? I can upload seperately $ tree must-gather.local.3693029977532110604/ -L 3 must-gather.local.3693029977532110604/ └── quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-9734e6d304b88b243df5a1720570bf63ab684b2317755e0ed00de0aa32b05ce8 ├── audit_logs │ ├── kube-apiserver │ ├── kube-apiserver.audit_logs_listing │ ├── openshift-apiserver │ └── openshift-apiserver.audit_logs_listing ├── cluster-scoped-resources │ ├── admissionregistration.k8s.io │ ├── apiextensions.k8s.io │ ├── apiregistration.k8s.io │ ├── certificates.k8s.io │ ├── config.openshift.io │ ├── core │ ├── imageregistry.operator.openshift.io │ ├── machineconfiguration.openshift.io │ ├── network.openshift.io │ ├── oauth.openshift.io │ ├── operator.openshift.io │ ├── rbac.authorization.k8s.io │ ├── samples.operator.openshift.io │ ├── security.openshift.io │ └── storage.k8s.io ├── host_service_logs │ └── masters ├── namespaces │ ├── default │ ├── kube-system │ ├── openshift-apiserver │ ├── openshift-apiserver-operator │ ├── openshift-authentication │ ├── openshift-authentication-operator │ ├── openshift-cloud-credential-operator │ ├── openshift-cluster-machine-approver │ ├── openshift-cluster-node-tuning-operator │ ├── openshift-cluster-samples-operator │ ├── openshift-cluster-storage-operator │ ├── openshift-cluster-version │ ├── openshift-config │ ├── openshift-config-managed │ ├── openshift-config-operator │ ├── openshift-console │ ├── openshift-console-operator │ ├── openshift-controller-manager │ ├── openshift-controller-manager-operator │ └── openshift-dns-operator └── version Is this 100% reproducible? The mismatch is here: 212 2020-06-02T08:24:21.03138522Z I0602 08:24:21.031337 2094 daemon.go:992] Validating against pending config rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de 213 2020-06-02T08:24:21.321396781Z E0602 08:24:21.321340 2094 daemon.go:1378] could not stat file: "/etc/crio/crio.conf", error: lstat /etc/crio/crio.conf: no such file or directory Looks like crio.conf went away, I know the Node team changed to dropins config for crio but I never saw this error. Moving to them for further debug. Moving back to MCO: - updated from rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de to rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc - update completed just fine from logs - reboot - pending config is rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de instead of being rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc - something went wrong...maybe previous to this update ``` 245 2020-06-02T08:28:12.965660887Z I0602 08:28:12.965620 2094 daemon.go:768] Current config: rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de 246 2020-06-02T08:28:12.965660887Z I0602 08:28:12.965638 2094 daemon.go:769] Desired config: rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc 247 2020-06-02T08:28:12.965660887Z I0602 08:28:12.965644 2094 daemon.go:783] Pending config: rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de 248 2020-06-02T08:28:13.004251894Z I0602 08:28:13.004206 2094 daemon.go:992] Validating against pending config rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de 249 2020-06-02T08:28:13.244763039Z E0602 08:28:13.244193 2094 daemon.go:1378] could not stat file: "/etc/crio/crio.conf", error: lstat /etc/crio/crio.conf: no such file or directory 250 2020-06-02T08:28:13.244763039Z E0602 08:28:13.244216 2094 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de ``` Can you tell me the reproduction rate of this issue? There seems to be a bunch of other issues with this cluster: ``` 020-06-02T08:20:53.021428876Z I0602 08:20:52.919028 428428 rpm-ostree.go:246] Pivoting to: 45.81.202005291504-0 (0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116) 171 2020-06-02T08:20:53.021428876Z client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0) added; new total=1 172 2020-06-02T08:20:53.4321233Z Initiated txn UpdateDeployment for client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0): /org/projectatomic/rpmostree1/rhcos 173 2020-06-02T08:21:00.931620484Z Created new deployment /ostree/deploy/rhcos/deploy/0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116.0 174 2020-06-02T08:21:00.931620484Z sanitycheck(/usr/bin/true) successful 175 2020-06-02T08:21:01.932081633Z Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful 176 2020-06-02T08:21:01.932081633Z client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0 177 2020-06-02T08:21:01.932081633Z In idle state; will auto-exit in 61 seconds 178 2020-06-02T08:21:02.104386609Z I0602 08:21:02.104338 415991 update.go:1393] initiating reboot: Node will reboot into config rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc 179 2020-06-02T08:21:02.288441293Z I0602 08:21:02.287406 415991 daemon.go:590] Shutting down MachineConfigDaemon 180 2020-06-02T08:24:15.238555586Z I0602 08:24:15.237189 2094 start.go:74] Version: v4.5.0-202005301517-dirty (293f78b64d86f9f0491f6baa991e3f0c8fe1b046) 181 2020-06-02T08:24:15.24417601Z I0602 08:24:15.243873 2094 start.go:84] Calling chroot("/rootfs") 182 2020-06-02T08:24:15.244851195Z I0602 08:24:15.244553 2094 rpm-ostree.go:368] Running captured: rpm-ostree status --json 183 2020-06-02T08:24:15.990817588Z I0602 08:24:15.990762 2094 daemon.go:211] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c8ad7572d9d6d2bd0f8673b01dbd9e01ba5500aad8fa4690a7c91d85dca437c (44.81.202005250830-0) 184 2020-06-02T08:24:16.001472968Z I0602 08:24:16.001431 2094 metrics.go:106] Registering Prometheus metrics 185 2020-06-02T08:24:16.001575818Z I0602 08:24:16.001564 2094 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 186 2020-06-02T08:24:16.005029027Z I0602 08:24:16.004998 2094 update.go:1393] Starting to manage node: compute-0 187 2020-06-02T08:24:16.021796261Z I0602 08:24:16.021758 2094 rpm-ostree.go:368] Running captured: rpm-ostree status 188 2020-06-02T08:24:16.288078923Z I0602 08:24:16.288030 2094 daemon.go:815] State: idle 189 2020-06-02T08:24:16.288078923Z AutomaticUpdates: disabled 190 2020-06-02T08:24:16.288078923Z Deployments: 191 2020-06-02T08:24:16.288078923Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c8ad7572d9d6d2bd0f8673b01dbd9e01ba5500aad8fa4690a7c91d85dca437c 192 2020-06-02T08:24:16.288078923Z CustomOrigin: Managed by machine-config-operator 193 2020-06-02T08:24:16.288078923Z Version: 44.81.202005250830-0 (2020-05-25T08:35:40Z) 194 2020-06-02T08:24:16.288078923Z 195 2020-06-02T08:24:16.288078923Z pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519 196 2020-06-02T08:24:16.288078923Z CustomOrigin: Managed by machine-config-operator 197 2020-06-02T08:24:16.288078923Z Version: 45.81.202005291504-0 (2020-05-29T15:08:02Z) ``` The above shows logs for a successful pivot but then upon reboot, the old deployment is still at play? cc'ing Colin on this Also, the Pending config previous comment is kind of unexplainable as looking at the code, the correct pending config must have been written and read after reboot. I think what we need is an assessment of the reproduction rate of this with a newer payload The MCC logs look highly suspicious as well: 41 2020-06-02T14:18:58.28067928Z I0602 14:18:58.280554 1 node_controller.go:453] Pool worker: node localhost changed machineconfiguration.openshift.io/currentConfig = rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de 42 2020-06-02T14:18:58.28067928Z I0602 14:18:58.280612 1 node_controller.go:453] Pool worker: node localhost changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de 43 2020-06-02T14:18:58.28067928Z I0602 14:18:58.280624 1 node_controller.go:453] Pool worker: node localhost changed machineconfiguration.openshift.io/state = Done Also, I can see the master pool has FIPS There you go, your master pool was broken: 230391 2020-06-03T14:36:44.460501037Z I0603 14:36:44.460468 2504 daemon.go:992] Validating against pending config rendered-master-0764d3758dba677517722abcd8d01f75 230392 2020-06-03T14:36:44.854410706Z E0603 14:36:44.854370 2504 daemon.go:1391] content mismatch for file /etc/kubernetes/kubelet.conf: and then a diff, I'm CC'ing jerry as I suspect this could be the kubelet config bug we already had. I've opened https://github.com/openshift/must-gather/pull/157 as well as we need those going forward (missing from the current must-gather) The kubeletconfig error you see is not the source of the issue here. Taking a look just at the masters (since only the masters have a kubeletconfig defined): There's 3 rendered master config on the system, we'll name them as follows for the sake of discussion: old - rendered-master-0764d3758dba677517722abcd8d01f75 current - rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e new - rendered-master-78d80a78b21182d10da2f40d35158f3c where old is the first boot of the system, current is same as old, plus the kubeletconfig being applied, and new is the nightly 4.5 being upgraded to. The first upgrade from old to current completed correct for all 3 nodes. Upon attempting to upgrade from current to new, the upgrade failed. See: machine-config-daemon-jn9tv/machine-config-daemon/machine-config-daemon/logs/current.log The first oddity in this log is that both pre-shutdown and new MCD logs are logged to the same file. This should not have been the case. Upon doing new the new deployment a completely new MCD should have been deployed, and yet if we look at the above log, you can see: 2020-06-02T08:20:41.312541614Z I0602 08:20:41.312468 132706 daemon.go:590] Shutting down MachineConfigDaemon 2020-06-02T08:25:17.577128832Z I0602 08:25:17.574252 2504 start.go:74] Version: v4.5.0-202005301517-dirty (293f78b64d86f9f0491f6baa991e3f0c8fe1b046) The MCD pod originally running there restarted. I think this indicates something going wrong during the upgrade. A few lines later it says: 2020-06-02T08:25:19.226127395Z Deployments: 2020-06-02T08:25:19.226127395Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c8ad7572d9d6d2bd0f8673b01dbd9e01ba5500aad8fa4690a7c91d85dca437c 2020-06-02T08:25:19.226127395Z CustomOrigin: Managed by machine-config-operator 2020-06-02T08:25:19.226127395Z Version: 44.81.202005250830-0 (2020-05-25T08:35:40Z) 2020-06-02T08:25:19.226127395Z 2020-06-02T08:25:19.226127395Z pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519 2020-06-02T08:25:19.226127395Z CustomOrigin: Managed by machine-config-operator 2020-06-02T08:25:19.226127395Z Version: 45.81.202005291504-0 (2020-05-29T15:08:02Z) That's still the old deployment. It seems we never pivoted, despite earlier it claiming: 2020-06-02T08:20:31.902507451Z I0602 08:20:31.710600 141725 rpm-ostree.go:246] Pivoting to: 45.81.202005291504-0 (0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116) 2020-06-02T08:20:31.902507451Z client(id:cli dbus:1.1151 unit:machine-config-daemon-host.service uid:0) added; new total=1 2020-06-02T08:20:31.902507451Z Initiated txn UpdateDeployment for client(id:cli dbus:1.1151 unit:machine-config-daemon-host.service uid:0): /org/projectatomic/rpmostree1/rhcos 2020-06-02T08:20:39.858658739Z Created new deployment /ostree/deploy/rhcos/deploy/0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116.0 2020-06-02T08:20:39.858658739Z sanitycheck(/usr/bin/true) successful 2020-06-02T08:20:40.652521666Z Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful 2020-06-02T08:20:40.652521666Z client(id:cli dbus:1.1151 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0 2020-06-02T08:20:40.652521666Z In idle state; will auto-exit in 61 seconds 2020-06-02T08:20:41.106344141Z I0602 08:20:41.106310 132706 update.go:1393] initiating reboot: Node will reboot into config rendered-master-78d80a78b21182d10da2f40d35158f3c The logs don't otherwise indicate what went wrong here. Full journal logs may give us the reason. (rpm-ostree failure maybe?) Then, we start seeing the error, and: 2020-06-02T08:25:29.610143137Z I0602 08:25:29.609244 2504 daemon.go:768] Current config: rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e 2020-06-02T08:25:29.610143137Z I0602 08:25:29.609261 2504 daemon.go:769] Desired config: rendered-master-78d80a78b21182d10da2f40d35158f3c 2020-06-02T08:25:29.610143137Z I0602 08:25:29.609268 2504 daemon.go:783] Pending config: rendered-master-0764d3758dba677517722abcd8d01f75 2020-06-02T08:25:29.772138517Z I0602 08:25:29.771640 2504 update.go:1393] Disk currentConfig rendered-master-0764d3758dba677517722abcd8d01f75 overrides node annotation rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e 2020-06-02T08:25:29.798157031Z I0602 08:25:29.796863 2504 daemon.go:992] Validating against pending config rendered-master-0764d3758dba677517722abcd8d01f75 well of course that's wrong. We're at least on rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e (current) which has the kubeletconfig changes. Yet the node is validating against old, which is why its reporting the kubeletconfig error. So then the node should be on current or new, based on the machine-os-content I think we're still on current, but the logs indicate we successfully upgraded. Again, really weird. As to the 2 lines referencing old (rendered-master-0764d3758dba677517722abcd8d01f75), it means that on the host, 1. /etc/machine-config-daemon/currentconfig has rendered-master-0764d3758dba677517722abcd8d01f75 2. journalctl -o json _UID=0 MESSAGE_ID=machine-config-daemon-pending-state has rendered-master-0764d3758dba677517722abcd8d01f75 Again to debug, we probably need the full journal logs from the host. must-gather only captures crio/kubelet logs. If another cluster exhibits this behaviour, could you capture the full journalctl logs as well? Ok, and the worker pool also exhibits similiar behaviour: 2020-06-02T08:24:23.374544879Z I0602 08:24:23.374476 2094 daemon.go:768] Current config: rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de 2020-06-02T08:24:23.374544879Z I0602 08:24:23.374499 2094 daemon.go:769] Desired config: rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc 2020-06-02T08:24:23.374544879Z I0602 08:24:23.374505 2094 daemon.go:783] Pending config: rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de 2020-06-02T08:24:23.415838052Z I0602 08:24:23.415782 2094 daemon.go:992] Validating against pending config rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de Since there is only 2 configs here its easier to see where the state is: post-upgrade the files were updated, but the osimageurl is still the old one. The node is now caught in a limbo state where its neither rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de (because the files were updated) but also not rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc (because rpm-ostree did not pivot). This is the update logs: 2020-06-02T08:20:30.249070886Z I0602 08:20:30.249043 415991 update.go:1271] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519 2020-06-02T08:20:30.28864255Z In idle state; will auto-exit in 61 seconds 2020-06-02T08:20:30.28864255Z Started RPM-OSTree System Management Daemon. 2020-06-02T08:20:30.28864255Z client(id:cli dbus:1.1087 unit:crio-e729f67e2e91be7f8180c1ee34cd4bad02c9ca40244beab0d4089a1d41a4edae.scope uid:0) added; new total=1 2020-06-02T08:20:30.28864255Z client(id:cli dbus:1.1087 unit:crio-e729f67e2e91be7f8180c1ee34cd4bad02c9ca40244beab0d4089a1d41a4edae.scope uid:0) vanished; remaining=0 2020-06-02T08:20:30.28864255Z In idle state; will auto-exit in 62 seconds 2020-06-02T08:20:30.28864255Z client(id:cli dbus:1.1089 unit:crio-e729f67e2e91be7f8180c1ee34cd4bad02c9ca40244beab0d4089a1d41a4edae.scope uid:0) added; new total=1 2020-06-02T08:20:30.28864255Z client(id:cli dbus:1.1089 unit:crio-e729f67e2e91be7f8180c1ee34cd4bad02c9ca40244beab0d4089a1d41a4edae.scope uid:0) vanished; remaining=0 2020-06-02T08:20:30.28864255Z In idle state; will auto-exit in 64 seconds 2020-06-02T08:20:30.28864255Z In idle state; will auto-exit in 60 seconds 2020-06-02T08:20:30.28864255Z rpm-ostreed.service: Consumed 116ms CPU time 2020-06-02T08:20:30.541315415Z Starting Machine Config Daemon Initial... 2020-06-02T08:20:30.541315415Z I0602 08:20:30.333046 428428 rpm-ostree.go:366] Running captured: rpm-ostree status --json 2020-06-02T08:20:30.541315415Z Starting RPM-OSTree System Management Daemon... 2020-06-02T08:20:30.541315415Z Reading config file '/etc/rpm-ostreed.conf' 2020-06-02T08:20:30.541315415Z In idle state; will auto-exit in 60 seconds 2020-06-02T08:20:30.541315415Z Started RPM-OSTree System Management Daemon. 2020-06-02T08:20:30.541315415Z client(id:cli dbus:1.1090 unit:machine-config-daemon-host.service uid:0) added; new total=1 2020-06-02T08:20:30.89576432Z client(id:cli dbus:1.1090 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0 2020-06-02T08:20:30.89576432Z In idle state; will auto-exit in 60 seconds 2020-06-02T08:20:30.89576432Z I0602 08:20:30.552666 428428 rpm-ostree.go:154] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c8ad7572d9d6d2bd0f8673b01dbd9e01ba5500aad8fa4690a7c91d85dca437c 2020-06-02T08:20:30.89576432Z I0602 08:20:30.552785 428428 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519 2020-06-02T08:20:30.89576432Z 2020-06-02 08:20:30.738310966 +0000 UTC m=+0.128692845 system refresh 2020-06-02T08:20:52.682154014Z 2020-06-02 08:20:52.432658853 +0000 UTC m=+21.823040743 image pull 2020-06-02T08:20:52.682154014Z b5be1b28281306372765ad38f41123a1875ffd2a35b485f8d0e8db7d315a5bfb 2020-06-02T08:20:52.682154014Z I0602 08:20:52.443028 428428 rpm-ostree.go:366] Running captured: podman inspect --type=image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519 2020-06-02T08:20:52.682154014Z I0602 08:20:52.594268 428428 rpm-ostree.go:366] Running captured: podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name ostree-container-pivot-1a2ccf6d-1c36-424b-ab84-619a5943bfbf quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519 2020-06-02T08:20:53.021428876Z I0602 08:20:52.763016 428428 rpm-ostree.go:366] Running captured: podman mount 7bf69529015d73d4f90e8e2e9ee0121db38a9c1d5b361c2292424251d7d4ad12 2020-06-02T08:20:53.021428876Z I0602 08:20:52.919028 428428 rpm-ostree.go:246] Pivoting to: 45.81.202005291504-0 (0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116) 2020-06-02T08:20:53.021428876Z client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0) added; new total=1 2020-06-02T08:20:53.4321233Z Initiated txn UpdateDeployment for client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0): /org/projectatomic/rpmostree1/rhcos 2020-06-02T08:21:00.931620484Z Created new deployment /ostree/deploy/rhcos/deploy/0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116.0 2020-06-02T08:21:00.931620484Z sanitycheck(/usr/bin/true) successful 2020-06-02T08:21:01.932081633Z Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful 2020-06-02T08:21:01.932081633Z client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0 2020-06-02T08:21:01.932081633Z In idle state; will auto-exit in 61 seconds 2020-06-02T08:21:02.104386609Z I0602 08:21:02.104338 415991 update.go:1393] initiating reboot: Node will reboot into config rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc Nothing seems to be going wrong offhand. Journal logs is probably the next place to look. >Again to debug, we probably need the full journal logs from the host. must-gather only captures crio/kubelet logs. https://github.com/openshift/must-gather/pull/158 https://github.com/openshift/machine-config-operator/pull/1790 @Antonio Murdaca I only run this upgrade test once and it hit this issue. And our team arrange upgrade CI and manual upgrade test everyday, so we can focus on the reproduction of this issue. *** Bug 1843309 has been marked as a duplicate of this bug. *** OK we had a long discussion about this bug in a realtime chat. I was *pretty* sure I understood the bug but in trying to write up the diagnosis I am less certain. But... One smoking gun to me is the combination of facts: - Something is rebooting before machine-config-daemon-firstboot.service runs - The pivot code has a bare `systemctl reboot` call that doesn't scream into the logs, it's just calling glog (and I wonder if we're missing log verbosity here) Today we're writing both `/etc/pivot/image-pullspec` *and* `/etc/ignition-machine-config-encapsulated.json`, and we do this because the first one is needed for 4.1 bootimages. In the BZ I believe what's happening is `machine-config-daemon-host.service` is being triggered by that before `machine-config-daemon-firstboot.service` runs. In that case it's possible for it to win and kill off `machine-config-daemon-firstboot.service` *before* it has a chance to unlink its `/etc/ignition-machine-config-encapsulated.json` stamp file. Previous to https://github.com/openshift/machine-config-operator/pull/1762 this failure was masked because on the next boot, the service wouldn't run because the BindsTo= was present. But now that we removed that, the service does run again. I think we need to clearly split off something like `machine-config-daemon-firstboot-v41` that is only triggered for 4.1 systems, leaving `machine-config-daemon-host.service` to only be triggered by another service or the MCD pod. Related previous discussion in https://github.com/openshift/machine-config-operator/pull/904/files/40d8225dae9f25b392e39fa2ce30c48d18638fa5#diff-4eb88ea6718e4dcceecd6dfe6e4437e1 It's hard to trace through this because `machine-config-daemon-host` is started by all 3 cases of: - directly via presence of `/etc/pivot/image-pullspec` - via `machine-config-daemon-firstboot` - via the MCD pod It does look like on firstboot though we're doing two concurrent pulls from different process IDs: Jun 09 02:36:06 compute-0 machine-config-daemon[1887]: I0609 02:36:06.508497 1887 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70795637fd2143de29d7486305da17120b7e2270cc5b887b3f21d921391> Jun 09 02:36:06 compute-0 machine-config-daemon[1833]: I0609 02:36:06.508497 1887 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70795637fd2143de29d7486305da17120b7e2270cc5b887b3f21d921391> So that means the services are conflicting. But...this still leaves mysterious to me what's causing a reboot, because AFAICS we should have logged something there. OK I didn't test this but I think https://github.com/openshift/machine-config-operator/pull/1804 will fix this. No argh, there are indeed two separate processes but notice that the pull happens on the same microsecond, and the pid 1887 is in the log. So 1833 there is the `machine-config-daemon-firstboot.service` just proxying the journal logs. https://github.com/openshift/machine-config-operator/pull/1802 would fix the confusion there but...that's not a bug. The cluster here is on vSphere, correct? Has this bug ever reproduced outside of vSphere? Can you think of anything here that might be rebooting the node during early cluster bringup? The vmware agent for example is active...could there be e.g. some tooling that's trying to do live migration? (In reply to Colin Walters from comment #25) > The cluster here is on vSphere, correct? Has this bug ever reproduced > outside of vSphere? > We hit it on upi/barematal too. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1843309 Wait a second, where is this unit coming from? # cat /etc/systemd/system/multi-user.target.wants/restart.service [Unit] ConditionFirstBoot=yes [Service] Type=idle ExecStart=/sbin/reboot [Install] WantedBy=multi-user.target # It seems to be written by Ignition... There's the smoking gun for sure, that's what's rebooting the cluster in the middle of firstboot with zero logging! Ahh wait, I am dimly recalling the installer was doing this at one point... A bit of digging in github.com/openshift/installer with `git log -G restart\.service` turns up: https://github.com/openshift/installer/commit/e284327c8771ef84c9151f5db97f1f52db20ef0d which came from https://github.com/openshift/installer/pull/3497 So...there's not much comment in that commit for why it was removed, but it definitely needs to go! I'm reassigning to Installer based on this, feel free to close as a duplicate. Since github says it's a large diff and won't expand, here's the direct link to the removal: https://github.com/openshift/installer/commit/e284327c8771ef84c9151f5db97f1f52db20ef0d#diff-a40d954648d45299ae39941ce1a70a68L51 > We hit it on upi/barematal too. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1843309 But is that *real* baremetal, or is it baremetal emulated with libvirt? There was a similar issue with IPI baremetal which was *also* (forcibly) rebooting the nodes: https://bugzilla.redhat.com/show_bug.cgi?id=1840222 Can you give me a kubeconfig for an affected cluster? (In reply to Colin Walters from comment #29) > > We hit it on upi/barematal too. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1843309 > > But is that *real* baremetal, or is it baremetal emulated with libvirt? > There was a similar issue with IPI baremetal which was *also* (forcibly) > rebooting the nodes: > https://bugzilla.redhat.com/show_bug.cgi?id=1840222 > > Can you give me a kubeconfig for an affected cluster? It's not a *real* baremetal, and it is the vm on openstack. I have not an affected cluster on hand now, if needed, i can rebuild one for u debug further. > There's the smoking gun for sure, that's what's rebooting the cluster in the middle of firstboot with zero logging! Ahh wait, I am dimly recalling the installer was doing this at one point...
Do you mean the restart service is the root cause?
The reason why there is restart service is because in some installation, user need to change network configuration, such as: static IP, DOMAIN in network configuration file. Once the changed network config file is laid down, need a reboot to make it take effect, so that the initial fresh install get completed successfully.
Do you mean once user defined such restart service, can not upgrade his/here cluster any more?
And the restart unit is existing for a long time since 4.1, but we never hit such problem before, especially during 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 4.4, why this only happened on 4.4 -> 4.5 ?
> Antonio and I currently suspect maybe https://github.com/openshift/machine-config-operator/pull/1762 regressed this behaviour as we removed a gate for running the firstboot service.
Go through QE's test history, we even have ever run upgrade a cluster (with user defined restart unit via ignition) from 4.4.0-0.nightly-2020-05-21-042450 to 4.5.0-0.nightly-2020-05-21-010047, it was working well.
We hit this issue for the 1st time when upgrading a cluster (with user defined restart unit via ignition) from 4.4 to 4.5.0-0.nightly-2020-06-01-081609.
Because the above nightly builds was already pruned, I can not confirm if this is a regression caused by the PR#1762. But from the behaviour, sound like yes. Why this user defined restart system unit make MCD run into the firstboot flag? Sound like MCO did not manage this special system unit well.
And I also give another shot, I removed the restart unit from 4.4 ignition files, install a cluster with 4.4.0-0.nightly-2020-06-08-083627 , trigger an upgrade to 4.5.0-0.nightly-2020-06-08-031520, it succeed.
Though luckily the removal of restart unit from 4.4 ignition files does not affect its fresh installation, but in some old version of cluster, once had the restart unit, when upgrading to 4.5, will 100% hit such problem. This sound become a compatibility issue.
Based on the above analysis, I suggest MCO should manage all kinds of system unit (include this restart unit) well, and guard them to avoid run into such issue.
(In reply to Colin Walters from comment #25) > The cluster here is on vSphere, correct? Has this bug ever reproduced > outside of vSphere? > > Can you think of anything here that might be rebooting the node during early > cluster bringup? The vmware agent for example is active...could there be > e.g. some tooling that's trying to do live migration? 1 yeah, this cluster is on vSphere 6.5, not sure if it reproduced on other platforms. 2 there is one case of creating a kubeletconfig, it need to override /etc/kubernetes/kubelet.conf and then reboot the node. (In reply to MinLi from comment #33) > 2 there is one case of creating a kubeletconfig, it need to override > /etc/kubernetes/kubelet.conf and then reboot the node. I think we can ignore this to avoid confusion here. I already explained in the comment 31, the reboot is from restart system unit from ignition upon 4.4 fresh install. Add back needinfo flag for comment 31 and 32 My understanding of this is that the reboot was done to pick up config changes for the networking interface (in OKD this was actually changed to only restart NM, without rebooting: https://github.com/openshift/installer/blob/fcos/upi/vsphere/machine/ignition.tf#L48-L69) Now in OCP it looks like this has been changed to configure a containerized HAProxy directly and restart it in https://github.com/openshift/installer/commit/e284327c8771ef84c9151f5db97f1f52db20ef0d The issue I see here is that this restart service unit - which is really a remnant from initial provisioning and shouldn't even exist anymore - in fact still exists on the affected machines, and with the recent removal of BindsTo=ignition-firstboot-complete.service from machine-config-daemon-firstboot.service it now runs again - even though we obviously don't want it to. I don't think this can be fixed in Installer - I think we need MCO to ensure the service unit is removed or at least disabled on affected platforms before attempting to upgrade to 4.5. There might be away to remove the service unit entirely - for now I created PRs to disable the service. PR to master: https://github.com/openshift/machine-config-operator/pull/1805 PR to release-4.5: https://github.com/openshift/machine-config-operator/pull/1806 > The issue I see here is that this restart service unit - which is really a remnant from initial provisioning and shouldn't even exist anymore It exists in at least 4.4.0-0.nightly-2020-06-07-014539 (but not 4.5+ I think). But even though that PR was merged a while ago into release-4.4 it seems not to be in the release image yet, need to chase that down. > - in fact still exists on the affected machines, and with the recent removal of BindsTo=ignition-firstboot-complete.service from machine-config-daemon-firstboot.service it now runs again - even though we obviously don't want it to. Wait...no, on this affected node, I only see it running on the firstboot: [root@compute-0 /]# journalctl -u restart -- Logs begin at Tue 2020-06-09 02:21:47 UTC, end at Wed 2020-06-10 12:26:26 UTC. -- Jun 09 02:36:04 compute-0 systemd[1]: Started restart.service. Jun 09 02:36:09 compute-0 systemd[1]: Stopping restart.service... Jun 09 02:36:09 compute-0 systemd[1]: Stopped restart.service. Jun 09 02:36:09 compute-0 systemd[1]: restart.service: Consumed 8ms CPU time [root@compute-0 /]# as you'd expect because the unit has ConditionFirstBoot=yes. I do think it's extremely likely that the attempted fix for the IPI baremetal forced reboots in https://github.com/openshift/machine-config-operator/pull/1762 somehow made this vSphere rebooting worse. I will trace through that. We could almost certainly improve things in the MCO here, let's look at that. But bottom line: since the installer-generated restart.service is gone in release-4.4 I would like to see that shipped and not have to worry about this weird special case of vSphere rebooting. > It's not a *real* baremetal, and it is the vm on openstack. I have not an affected cluster on hand now, if needed, i can rebuild one for u debug further. Yes, please do that. I also asked about this on https://bugzilla.redhat.com/show_bug.cgi?id=1840222#c16 (In reply to Colin Walters from comment #39) > > It's not a *real* baremetal, and it is the vm on openstack. I have not an affected cluster on hand now, if needed, i can rebuild one for u debug further. > > Yes, please do that. I also asked about this on > https://bugzilla.redhat.com/show_bug.cgi?id=1840222#c16 If you want a env to reproduce this issue, QE will enable 'restart' unit via ignition files to rereate it, do you still want such a env, especially sound like we already get the root cause? OK right the MCO problem is: - 1st boot: 4.4 install, machine boots the first time - restart.service kills machine-config-daemon-firstboot before it upgrades, or before it unlinks the file - 2nd boot: We stumble on without having pivoted, node joins cluster MCD starts a reboot into 44.81.202006062030-0 which we *should* have updated to before joining the cluster - 3rd boot: Upgraded to 4.5, and the new machine-config-daemon-firstboot without BindsTo= gets laid down in /etc - 4th boot (current): In 4.5, and -firstboot triggers again, trying to finally apply the original config The messy problem here is I think not having the BindsTo= is correct from the start...we need to remove it before rebooting. PR inbound. > If you want a env to reproduce this issue, QE will enable 'restart' unit via ignition files to rereate it, do you still want such a env, especially sound like we already get the root cause?
If this is occuring *outside* of vSphere/vmware (that sub-thread was about baremetal) it sounds like a separate (though maybe related) bug. Is there a must-gather for it? Or a reproducer environment w/kubeconfig would be great.
OK now I'm 97% confident https://github.com/openshift/machine-config-operator/pull/1807 will fix this! (In reply to Colin Walters from comment #42) > > If you want a env to reproduce this issue, QE will enable 'restart' unit via ignition files to rereate it, do you still want such a env, especially sound like we already get the root cause? > > If this is occuring *outside* of vSphere/vmware (that sub-thread was about > baremetal) it sounds like a separate (though maybe related) bug. Is there a > must-gather for it? Or a reproducer environment w/kubeconfig would be great. After I checked QE's cluster install script, whatever upi on vsphere and baremetal, we was injecting 'restart' unit via ignition in the same way. So I almost 99% confirm they are the same issue here. I am re-create such upi-on-barematel env for your double confirm. (In reply to Colin Walters from comment #39) > > It's not a *real* baremetal, and it is the vm on openstack. I have not an affected cluster on hand now, if needed, i can rebuild one for u debug further. > > Yes, please do that. I also asked about this on > https://bugzilla.redhat.com/show_bug.cgi?id=1840222#c16 In #1840222 we are using the libvirt driver in sushy-tools to control the VM's, the bug causing the unwanted reboot was in the codepath specifically for libvirt. Given that this is a VM on openstack we wouldn't be hitting the same problem here. One other thing I need to emphasize: I suspect even before the MCO change here actively broke things, this uncoordinated reboot has been impacting the cluster reliability at install time, particularly for the control plane bringup. See also https://github.com/openshift/machine-config-operator/pull/926 To fix this *immediately* without waiting for other changes in the installer or MCO, please change your restart.service to look like this: ``` [Unit] ConditionFirstBoot=yes After=machine-config-daemon-firstboot.service [Service] Type=oneshot RemainAfterExit=yes ExecStart=/bin/sh -c 'echo starting QE injected reboot unit && /sbin/reboot' [Install] WantedBy=multi-user.target ``` Where the most important part here is the `After=`, but the `echo` telling us *why* the reboot is happening is about as equally important =) > Where the most important part here is the `After=`, but the `echo` telling us *why* the reboot is happening is about as equally important =)
Good suggestion.
Actually, since the node will reboot from machine-config-daemon-firstboot.service in all cases, why are you even injecting a restart.service at all? I don't think it should be necessary. What fix is going to be verified in this bug? https://github.com/openshift/machine-config-operator/pull/1807 ? Something from the installer? If the MCO fix unblocks 4.5 upgrades I'd recommend we address any non urgent problems in 4.6 and consider backporting. (In reply to Colin Walters from comment #49) > Actually, since the node will reboot from > machine-config-daemon-firstboot.service in all cases, why are you even > injecting a restart.service at all? I don't think it should be necessary. Just like what I said in comment 44, I was injecting restart.service in a upi-on-baremetal install to reproduce such issue if you are suspecting it is some other separate issue. (In reply to Brenton Leanhardt from comment #50) > What fix is going to be verified in this bug? > https://github.com/openshift/machine-config-operator/pull/1807 ? Something > from the installer? If the MCO fix unblocks 4.5 upgrades I'd recommend we > address any non urgent problems in 4.6 and consider backporting. https://github.com/openshift/machine-config-operator/pull/1807 provides a remediation to this issue. My suggestion is that 1807 target this bug and the installer team review if they should backport the removal of the offending unit. Verified this bug with 4.6.0-0.nightly-2020-06-15-214351 (upgrade from 4.5.0-0.nightly-2020-06-11-183238), and PASS. [root@preserve-jialiu-ansible ~]# oc get node NAME STATUS ROLES AGE VERSION jialiu452-p9652-control-plane-0 Ready master,worker 16h v1.18.3+2164959 jialiu452-p9652-control-plane-1 Ready master,worker 16h v1.18.3+2164959 jialiu452-p9652-control-plane-2 Ready master,worker 16h v1.18.3+2164959 [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-06-15-214351 True False False 16h cloud-credential 4.6.0-0.nightly-2020-06-15-214351 True False False 16h cluster-autoscaler 4.6.0-0.nightly-2020-06-15-214351 True False False 16h config-operator 4.6.0-0.nightly-2020-06-15-214351 True False False 16h console 4.6.0-0.nightly-2020-06-15-214351 True False False 4h2m csi-snapshot-controller 4.6.0-0.nightly-2020-06-15-214351 True False False 4h dns 4.6.0-0.nightly-2020-06-15-214351 True False False 16h etcd 4.6.0-0.nightly-2020-06-15-214351 True False False 16h image-registry 4.6.0-0.nightly-2020-06-15-214351 True False False 4h1m ingress 4.6.0-0.nightly-2020-06-15-214351 True False False 4h23m insights 4.6.0-0.nightly-2020-06-15-214351 True False False 16h kube-apiserver 4.6.0-0.nightly-2020-06-15-214351 True False False 16h kube-controller-manager 4.6.0-0.nightly-2020-06-15-214351 True False False 16h kube-scheduler 4.6.0-0.nightly-2020-06-15-214351 True False False 16h kube-storage-version-migrator 4.6.0-0.nightly-2020-06-15-214351 True False False 16h machine-api 4.6.0-0.nightly-2020-06-15-214351 True False False 16h machine-approver 4.6.0-0.nightly-2020-06-15-214351 True False False 16h machine-config 4.6.0-0.nightly-2020-06-15-214351 True False False 4h27m marketplace 4.6.0-0.nightly-2020-06-15-214351 True False False 4h1m monitoring 4.6.0-0.nightly-2020-06-15-214351 True False False 4h network 4.6.0-0.nightly-2020-06-15-214351 True False False 16h node-tuning 4.6.0-0.nightly-2020-06-15-214351 True False False 4h24m openshift-apiserver 4.6.0-0.nightly-2020-06-15-214351 True False False 4h3m openshift-controller-manager 4.6.0-0.nightly-2020-06-15-214351 True False False 4h23m openshift-samples 4.6.0-0.nightly-2020-06-15-214351 True False False 4h23m operator-lifecycle-manager 4.6.0-0.nightly-2020-06-15-214351 True False False 16h operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-06-15-214351 True False False 16h operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-06-15-214351 True False False 4h1m service-ca 4.6.0-0.nightly-2020-06-15-214351 True False False 16h storage 4.6.0-0.nightly-2020-06-15-214351 True False False 4h24m [root@jialiu452-p9652-control-plane-2 ~]# systemctl list-unit-files|grep restart restart.service enabled [root@jialiu452-p9652-control-plane-2 ~]# journalctl --system|grep -i reboot Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO : files: op(3a): [started] processing unit "mcd-write-pivot-reboot.service" Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO : files: op(3a): op(3b): [started] writing unit "mcd-write-pivot-reboot.service" at "etc/systemd/system/mcd-write-pivot-reboot.service" Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO : files: op(3a): op(3b): [finished] writing unit "mcd-write-pivot-reboot.service" at "etc/systemd/system/mcd-write-pivot-reboot.service" Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO : files: op(3a): [finished] processing unit "mcd-write-pivot-reboot.service" Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO : files: op(3c): [started] enabling unit "mcd-write-pivot-reboot.service" Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO : files: op(3c): [finished] enabling unit "mcd-write-pivot-reboot.service" Jun 15 14:45:37 jialiu452-p9652-control-plane-2 restorecon[1514]: Relabeled /etc/systemd/system/mcd-write-pivot-reboot.service from system_u:object_r:unlabeled_t:s0 to system_u:object_r:power_unit_file_t:s0 Jun 15 14:45:38 jialiu452-p9652-control-plane-2 systemd[1]: Started mcd-write-pivot-reboot.service. Jun 15 14:45:38 jialiu452-p9652-control-plane-2 systemd[1]: mcd-write-pivot-reboot.service: Consumed 6ms CPU time Jun 15 14:45:43 jialiu452-p9652-control-plane-2 systemd[1]: Starting Reboot... -- Reboot -- Jun 15 14:46:48 jialiu452-p9652-control-plane-2 machine-config-daemon[1345]: I0615 14:46:48.011423 1345 update.go:1346] initiating reboot: Completing firstboot provisioning to rendered-master-5c6dfc33b0a6115a3a3c25831d6c8009 Jun 15 14:46:48 jialiu452-p9652-control-plane-2 root[1640]: machine-config-daemon[1345]: initiating reboot: Completing firstboot provisioning to rendered-master-5c6dfc33b0a6115a3a3c25831d6c8009 Jun 15 14:46:48 jialiu452-p9652-control-plane-2 systemd-logind[1320]: System is rebooting. Jun 15 14:46:48 jialiu452-p9652-control-plane-2 systemd[1]: machine-config-daemon-reboot.service: Consumed 31ms CPU time Jun 15 14:46:52 jialiu452-p9652-control-plane-2 systemd[1]: Starting Reboot... -- Reboot -- Jun 16 03:10:07 jialiu452-p9652-control-plane-2 root[1993844]: machine-config-daemon[1922473]: initiating reboot: Node will reboot into config rendered-master-fb575102395e8eb678ccebee9c1a25fc Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937041 1479 factory.go:177] Factory "crio" was unable to handle container "/system.slice/machine-config-daemon-reboot.service" Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937096 1479 factory.go:166] Error trying to work out if we can handle /system.slice/machine-config-daemon-reboot.service: /system.slice/machine-config-daemon-reboot.service not handled by systemd handler Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937106 1479 factory.go:177] Factory "systemd" was unable to handle container "/system.slice/machine-config-daemon-reboot.service" Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937115 1479 factory.go:173] Using factory "raw" for container "/system.slice/machine-config-daemon-reboot.service" Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937670 1479 manager.go:950] Added container: "/system.slice/machine-config-daemon-reboot.service" (aliases: [], namespace: "") Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937830 1479 handler.go:325] Added event &{/system.slice/machine-config-daemon-reboot.service 2020-06-16 03:10:07.933497501 +0000 UTC containerCreation {<nil>}} Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937919 1479 container.go:467] Start housekeeping for container "/system.slice/machine-config-daemon-reboot.service" Jun 16 03:10:07 jialiu452-p9652-control-plane-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-fb575102395e8eb678ccebee9c1a25fc. Jun 16 03:10:08 jialiu452-p9652-control-plane-2 systemd-logind[1326]: System is rebooting. Jun 16 03:10:08 jialiu452-p9652-control-plane-2 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-fb575102395e8eb678ccebee9c1a25fc. Jun 16 03:10:08 jialiu452-p9652-control-plane-2 systemd[1]: machine-config-daemon-reboot.service: Consumed 22ms CPU time Jun 16 03:10:38 jialiu452-p9652-control-plane-2 systemd[1]: Starting Reboot... -- Reboot -- Jun 16 03:11:07 jialiu452-p9652-control-plane-2 hyperkube[1374]: I0616 03:11:07.646852 1374 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"jialiu452-p9652-control-plane-2", UID:"jialiu452-p9652-control-plane-2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Rebooted' Node jialiu452-p9652-control-plane-2 has been rebooted, boot id: dab01b25-cc00-45eb-b021-79425eb72a27 [root@jialiu452-p9652-control-plane-2 ~]# systemctl cat restart # /etc/systemd/system/restart.service [Unit] ConditionFirstBoot=yes [Service] Type=idle ExecStart=/sbin/reboot [Install] WantedBy=multi-user.target For upi on vsphere, 4.5 already removed restart service, so this bug is not applicable for 4.5 -> 4.6 vsphere cluster upgrade. See https://bugzilla.redhat.com/show_bug.cgi?id=1846690#c13 We have already a PR for this again Verified this bug with 4.6.0-0.nightly-2020-06-22-222207, and PASS. [root@preserve-jialiu-ansible ~]# oc get node NAME STATUS ROLES AGE VERSION jialiu451-bnc8p-compute-0 Ready worker 26m v1.18.3+106f511 jialiu451-bnc8p-compute-1 Ready worker 26m v1.18.3+106f511 jialiu451-bnc8p-control-plane-0 Ready master 35m v1.18.3+106f511 jialiu451-bnc8p-control-plane-1 Ready master 34m v1.18.3+106f511 jialiu451-bnc8p-control-plane-2 Ready master 35m v1.18.3+106f511 [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-06-23-020504 True False False 16m cloud-credential 4.5.0-0.nightly-2020-06-23-020504 True False False 35m cluster-autoscaler 4.5.0-0.nightly-2020-06-23-020504 True False False 25m config-operator 4.5.0-0.nightly-2020-06-23-020504 True False False 25m console 4.5.0-0.nightly-2020-06-23-020504 True False False 19m csi-snapshot-controller 4.5.0-0.nightly-2020-06-23-020504 True False False 23m dns 4.5.0-0.nightly-2020-06-23-020504 True False False 31m etcd 4.5.0-0.nightly-2020-06-23-020504 True False False 30m image-registry 4.5.0-0.nightly-2020-06-23-020504 True False False 23m ingress 4.5.0-0.nightly-2020-06-23-020504 True False False 23m insights 4.5.0-0.nightly-2020-06-23-020504 True False False 29m kube-apiserver 4.5.0-0.nightly-2020-06-23-020504 True False False 30m kube-controller-manager 4.5.0-0.nightly-2020-06-23-020504 True False False 30m kube-scheduler 4.5.0-0.nightly-2020-06-23-020504 True False False 27m kube-storage-version-migrator 4.5.0-0.nightly-2020-06-23-020504 True False False 23m machine-api 4.5.0-0.nightly-2020-06-23-020504 True False False 29m machine-approver 4.5.0-0.nightly-2020-06-23-020504 True False False 30m machine-config 4.5.0-0.nightly-2020-06-23-020504 True False False 30m marketplace 4.5.0-0.nightly-2020-06-23-020504 True False False 28m monitoring 4.5.0-0.nightly-2020-06-23-020504 True False False 21m network 4.5.0-0.nightly-2020-06-23-020504 True False False 32m node-tuning 4.5.0-0.nightly-2020-06-23-020504 True False False 32m openshift-apiserver 4.5.0-0.nightly-2020-06-23-020504 True False False 27m openshift-controller-manager 4.5.0-0.nightly-2020-06-23-020504 True False False 29m openshift-samples 4.5.0-0.nightly-2020-06-23-020504 True False False 25m operator-lifecycle-manager 4.5.0-0.nightly-2020-06-23-020504 True False False 31m operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-06-23-020504 True False False 31m operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-06-23-020504 True False False 27m service-ca 4.5.0-0.nightly-2020-06-23-020504 True False False 32m storage 4.5.0-0.nightly-2020-06-23-020504 True False False 29m [root@preserve-jialiu-ansible ~]# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-06-22-222207 --force --allow-explicit-upgrade warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to preceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-06-22-222207 [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-06-23-020504 True True 20m Working towards 4.6.0-0.nightly-2020-06-22-222207: 84% complete, waiting on machine-config [root@preserve-jialiu-ansible ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-06-22-222207 True False 5m26s Cluster version is 4.6.0-0.nightly-2020-06-22-222207 [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-06-22-222207 True False False 57m cloud-credential 4.6.0-0.nightly-2020-06-22-222207 True False False 76m cluster-autoscaler 4.6.0-0.nightly-2020-06-22-222207 True False False 66m config-operator 4.6.0-0.nightly-2020-06-22-222207 True False False 66m console 4.6.0-0.nightly-2020-06-22-222207 True False False 14m csi-snapshot-controller 4.6.0-0.nightly-2020-06-22-222207 True False False 18m dns 4.6.0-0.nightly-2020-06-22-222207 True False False 72m etcd 4.6.0-0.nightly-2020-06-22-222207 True False False 71m image-registry 4.6.0-0.nightly-2020-06-22-222207 True False False 15m ingress 4.6.0-0.nightly-2020-06-22-222207 True False False 34m insights 4.6.0-0.nightly-2020-06-22-222207 True False False 70m kube-apiserver 4.6.0-0.nightly-2020-06-22-222207 True False False 71m kube-controller-manager 4.6.0-0.nightly-2020-06-22-222207 True False False 71m kube-scheduler 4.6.0-0.nightly-2020-06-22-222207 True False False 68m kube-storage-version-migrator 4.6.0-0.nightly-2020-06-22-222207 True False False 15m machine-api 4.6.0-0.nightly-2020-06-22-222207 True False False 70m machine-approver 4.6.0-0.nightly-2020-06-22-222207 True False False 71m machine-config 4.6.0-0.nightly-2020-06-22-222207 True False False 8m26s marketplace 4.6.0-0.nightly-2020-06-22-222207 True False False 10m monitoring 4.6.0-0.nightly-2020-06-22-222207 True False False 10m network 4.6.0-0.nightly-2020-06-22-222207 True False False 73m node-tuning 4.6.0-0.nightly-2020-06-22-222207 True False False 34m openshift-apiserver 4.6.0-0.nightly-2020-06-22-222207 True False False 68m openshift-controller-manager 4.6.0-0.nightly-2020-06-22-222207 True False False 32m openshift-samples 4.6.0-0.nightly-2020-06-22-222207 True False False 34m operator-lifecycle-manager 4.6.0-0.nightly-2020-06-22-222207 True False False 72m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-06-22-222207 True False False 72m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-06-22-222207 True False False 10m service-ca 4.6.0-0.nightly-2020-06-22-222207 True False False 73m storage 4.6.0-0.nightly-2020-06-22-222207 True False False 34m [core@jialiu451-bnc8p-control-plane-0 ~]$ systemctl list-unit-files|grep restart restart.service enabled [core@jialiu451-bnc8p-control-plane-0 ~]$ systemctl cat restart.service # /etc/systemd/system/restart.service [Unit] ConditionFirstBoot=yes [Service] Type=idle ExecStart=/sbin/reboot [Install] WantedBy=multi-user.target [core@jialiu451-bnc8p-control-plane-0 ~]$ journalctl --system|grep -i reboot Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO : files: op(3a): [started] processing unit "mcd-write-pivot-reboot.service" Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO : files: op(3a): op(3b): [started] writing unit "mcd-write-pivot-reboot.service" at "etc/systemd/system/mcd-write-pivot-reboot.service" Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO : files: op(3a): op(3b): [finished] writing unit "mcd-write-pivot-reboot.service" at "etc/systemd/system/mcd-write-pivot-reboot.service" Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO : files: op(3a): [finished] processing unit "mcd-write-pivot-reboot.service" Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO : files: op(3c): [started] enabling unit "mcd-write-pivot-reboot.service" Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO : files: op(3c): [finished] enabling unit "mcd-write-pivot-reboot.service" Jun 23 04:42:17 jialiu451-bnc8p-control-plane-0 restorecon[1498]: Relabeled /etc/systemd/system/mcd-write-pivot-reboot.service from system_u:object_r:unlabeled_t:s0 to system_u:object_r:power_unit_file_t:s0 Jun 23 04:42:17 jialiu451-bnc8p-control-plane-0 systemd[1]: Started mcd-write-pivot-reboot.service. Jun 23 04:42:17 jialiu451-bnc8p-control-plane-0 systemd[1]: mcd-write-pivot-reboot.service: Consumed 8ms CPU time Jun 23 04:42:23 jialiu451-bnc8p-control-plane-0 systemd[1]: Starting Reboot... -- Reboot -- Jun 23 04:43:30 jialiu451-bnc8p-control-plane-0 machine-config-daemon[1332]: I0623 04:43:30.113571 1332 update.go:1404] initiating reboot: Completing firstboot provisioning to rendered-master-6b21e41272a59673f9f3b69d2b458bdd Jun 23 04:43:30 jialiu451-bnc8p-control-plane-0 root[1595]: machine-config-daemon[1332]: initiating reboot: Completing firstboot provisioning to rendered-master-6b21e41272a59673f9f3b69d2b458bdd Jun 23 04:43:30 jialiu451-bnc8p-control-plane-0 systemd-logind[1311]: System is rebooting. Jun 23 04:43:30 jialiu451-bnc8p-control-plane-0 systemd[1]: machine-config-daemon-reboot.service: Consumed 22ms CPU time Jun 23 04:43:34 jialiu451-bnc8p-control-plane-0 systemd[1]: Starting Reboot... -- Reboot -- Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 root[149658]: machine-config-daemon[137032]: initiating reboot: Node will reboot into config rendered-master-7c3f8c9520bbd711fc62d5e9f56b36d6 Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.247991 1476 factory.go:166] Error trying to work out if we can handle /system.slice/machine-config-daemon-reboot.service: /system.slice/machine-config-daemon-reboot.service not handled by systemd handler Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248020 1476 factory.go:177] Factory "systemd" was unable to handle container "/system.slice/machine-config-daemon-reboot.service" Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248032 1476 factory.go:177] Factory "crio" was unable to handle container "/system.slice/machine-config-daemon-reboot.service" Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248040 1476 factory.go:173] Using factory "raw" for container "/system.slice/machine-config-daemon-reboot.service" Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248494 1476 manager.go:950] Added container: "/system.slice/machine-config-daemon-reboot.service" (aliases: [], namespace: "") Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248723 1476 handler.go:325] Added event &{/system.slice/machine-config-daemon-reboot.service 2020-06-23 05:42:31.247231787 +0000 UTC containerCreation {<nil>}} Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248761 1476 container.go:467] Start housekeeping for container "/system.slice/machine-config-daemon-reboot.service" Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7c3f8c9520bbd711fc62d5e9f56b36d6. Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 systemd-logind[1322]: System is rebooting. Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7c3f8c9520bbd711fc62d5e9f56b36d6. Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 systemd[1]: machine-config-daemon-reboot.service: Consumed 20ms CPU time Jun 23 05:43:01 jialiu451-bnc8p-control-plane-0 systemd[1]: Starting Reboot... -- Reboot -- Jun 23 05:43:33 jialiu451-bnc8p-control-plane-0 hyperkube[1375]: I0623 05:43:33.355733 1375 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"jialiu451-bnc8p-control-plane-0", UID:"jialiu451-bnc8p-control-plane-0", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Rebooted' Node jialiu451-bnc8p-control-plane-0 has been rebooted, boot id: 2b73374a-4636-4a66-b107-743c29b6cef9 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |