Bug 1842906 - upgrade from 4.4.6>4.5 fails: unexpected on-disk state validating against rendered-master
Summary: upgrade from 4.4.6>4.5 fails: unexpected on-disk state validating against ren...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.6.0
Assignee: Antonio Murdaca
QA Contact: Johnny Liu
URL:
Whiteboard:
: 1843309 (view as bug list)
Depends On:
Blocks: 1846690 1867886
TreeView+ depends on / blocked
 
Reported: 2020-06-02 11:24 UTC by MinLi
Modified: 2021-04-05 17:36 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:03:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must-gather log (2.24 MB, application/gzip)
2020-06-03 02:43 UTC, MinLi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1807 0 None closed Bug 1842906: daemon: Remove encapsulated config when joining the cluster 2021-02-16 18:03:40 UTC
Github openshift machine-config-operator pull 1844 0 None closed Bug 1842906: pkg/daemon: run 4.4 upgrade hack on MCD cluster boot 2021-02-02 07:46:57 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:04:20 UTC

Description MinLi 2020-06-02 11:24:31 UTC
Description of problem:
upgrade from 4.4.6>4.5 failed, MachineConfigPool's controller version mismatch,"unexpected on-disk state validating against rendered-master-0764d3758dba677517722abcd8d01f75"


Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-06-02-044312

How reproducible:
always 

Steps to Reproduce:
1.Upgrade a cluster from 4.4.6->4.5
2.
3.

Actual results:
1.upgrade fail,machine-config operator degraded

Expected results:
1.upgrade succeed


Additional info:
$ oc get co machine-config -o yaml 
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-06-02T06:18:49Z"
  generation: 1
  name: machine-config
  resourceVersion: "122451"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: fb50b08a-139b-41cf-b815-54ee36af4cb1
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-06-02T08:31:11Z"
    message: Cluster not available for 4.5.0-0.nightly-2020-06-02-044312
    status: "False"
    type: Available
  - lastTransitionTime: "2020-06-02T08:14:48Z"
    message: Working towards 4.5.0-0.nightly-2020-06-02-044312
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-06-02T08:31:11Z"
    message: 'Unable to apply 4.5.0-0.nightly-2020-06-02-044312: timed out waiting
      for the condition during syncRequiredMachineConfigPools: pool master has not
      progressed to latest configuration: controller version mismatch for rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e
      expected 293f78b64d86f9f0491f6baa991e3f0c8fe1b046 has 8af4f709c4ba9c0afff3408ecc99c8fce61dd314,
      retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-06-02T06:25:43Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable


$ oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e   False     True       True       3              0                   0                     1                      4h49m
worker   rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de   False     True       True       5              0                   0                     1        

$ oc get mcp master  -o yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2020-06-02T06:18:54Z"
  generation: 4
  labels:
    custom-kubelet: small-pods
    machineconfiguration.openshift.io/mco-built-in: ""
    operator.machineconfiguration.openshift.io/required-for-upgrade: ""
  name: master
  resourceVersion: "87494"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
  uid: 9ae480f0-c7f9-4a73-8ac1-cfbe4943ea9e
spec:
  configuration:
    name: rendered-master-78d80a78b21182d10da2f40d35158f3c
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-master
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-master-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-9ae480f0-c7f9-4a73-8ac1-cfbe4943ea9e-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-9ae480f0-c7f9-4a73-8ac1-cfbe4943ea9e-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-fips
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-master-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: master
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/master: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2020-06-02T06:19:39Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2020-06-02T08:19:00Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2020-06-02T08:19:00Z"
    message: All nodes are updating to rendered-master-78d80a78b21182d10da2f40d35158f3c
    reason: ""
    status: "True"
    type: Updating
  - lastTransitionTime: "2020-06-02T08:25:32Z"
    message: 'Node control-plane-0 is reporting: "unexpected on-disk state validating
      against rendered-master-0764d3758dba677517722abcd8d01f75"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2020-06-02T08:25:32Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded

$ oc get mcp worker -o yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2020-06-02T06:18:54Z"
  generation: 3
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
  name: worker
  resourceVersion: "89576"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  uid: 76a7b917-e9f9-4e58-8643-8bc211c97cba
spec:
  configuration:
    name: rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-76a7b917-e9f9-4e58-8643-8bc211c97cba-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-fips
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2020-06-02T06:19:39Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2020-06-02T08:19:00Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2020-06-02T08:19:00Z"
    message: All nodes are updating to rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc
    reason: ""
    status: "True"
    type: Updating
  - lastTransitionTime: "2020-06-02T08:24:22Z"
    message: 'Node compute-0 is reporting: "unexpected on-disk state validating against
      rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2020-06-02T08:24:22Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded

Comment 1 Kirsten Garrison 2020-06-02 22:15:25 UTC
Can you please provide a must gather from this cluster?

Comment 2 MinLi 2020-06-03 02:43:21 UTC
Created attachment 1694682 [details]
must-gather log

Comment 4 MinLi 2020-06-04 03:19:51 UTC
@Antonio Murdaca

the must-gather archive compressed is more than 24M, it's oversize for attachment. And I found some audit files are corrupted because when I collect must-gather I meet the sync error: 
rsync: connection unexpectedly closed (118621465 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(226) [receiver=3.1.3]

I list the must-gather archive tree, which part are necessary? I can upload seperately

$ tree must-gather.local.3693029977532110604/ -L 3
must-gather.local.3693029977532110604/
└── quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-9734e6d304b88b243df5a1720570bf63ab684b2317755e0ed00de0aa32b05ce8
    ├── audit_logs
    │   ├── kube-apiserver
    │   ├── kube-apiserver.audit_logs_listing
    │   ├── openshift-apiserver
    │   └── openshift-apiserver.audit_logs_listing
    ├── cluster-scoped-resources
    │   ├── admissionregistration.k8s.io
    │   ├── apiextensions.k8s.io
    │   ├── apiregistration.k8s.io
    │   ├── certificates.k8s.io
    │   ├── config.openshift.io
    │   ├── core
    │   ├── imageregistry.operator.openshift.io
    │   ├── machineconfiguration.openshift.io
    │   ├── network.openshift.io
    │   ├── oauth.openshift.io
    │   ├── operator.openshift.io
    │   ├── rbac.authorization.k8s.io
    │   ├── samples.operator.openshift.io
    │   ├── security.openshift.io
    │   └── storage.k8s.io
    ├── host_service_logs
    │   └── masters
    ├── namespaces
    │   ├── default
    │   ├── kube-system
    │   ├── openshift-apiserver
    │   ├── openshift-apiserver-operator
    │   ├── openshift-authentication
    │   ├── openshift-authentication-operator
    │   ├── openshift-cloud-credential-operator
    │   ├── openshift-cluster-machine-approver
    │   ├── openshift-cluster-node-tuning-operator
    │   ├── openshift-cluster-samples-operator
    │   ├── openshift-cluster-storage-operator
    │   ├── openshift-cluster-version
    │   ├── openshift-config
    │   ├── openshift-config-managed
    │   ├── openshift-config-operator
    │   ├── openshift-console
    │   ├── openshift-console-operator
    │   ├── openshift-controller-manager
    │   ├── openshift-controller-manager-operator
    │   └── openshift-dns-operator
    └── version

Comment 6 Antonio Murdaca 2020-06-04 09:07:57 UTC
Is this 100% reproducible? The mismatch is here:

  212 2020-06-02T08:24:21.03138522Z I0602 08:24:21.031337    2094 daemon.go:992] Validating against pending config rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de
  213 2020-06-02T08:24:21.321396781Z E0602 08:24:21.321340    2094 daemon.go:1378] could not stat file: "/etc/crio/crio.conf", error: lstat /etc/crio/crio.conf: no such file or directory


Looks like crio.conf went away, I know the Node team changed to dropins config for crio but I never saw this error. Moving to them for further debug.

Comment 7 Antonio Murdaca 2020-06-04 09:15:52 UTC
Moving back to MCO:

- updated from rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de to rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc
- update completed just fine from logs
- reboot
- pending config is rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de instead of being rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc
- something went wrong...maybe previous to this update

```
  245 2020-06-02T08:28:12.965660887Z I0602 08:28:12.965620    2094 daemon.go:768] Current config: rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de
  246 2020-06-02T08:28:12.965660887Z I0602 08:28:12.965638    2094 daemon.go:769] Desired config: rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc
  247 2020-06-02T08:28:12.965660887Z I0602 08:28:12.965644    2094 daemon.go:783] Pending config: rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de
  248 2020-06-02T08:28:13.004251894Z I0602 08:28:13.004206    2094 daemon.go:992] Validating against pending config rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de
  249 2020-06-02T08:28:13.244763039Z E0602 08:28:13.244193    2094 daemon.go:1378] could not stat file: "/etc/crio/crio.conf", error: lstat /etc/crio/crio.conf: no such file or directory
  250 2020-06-02T08:28:13.244763039Z E0602 08:28:13.244216    2094 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de
```

Can you tell me the reproduction rate of this issue?

Comment 8 Antonio Murdaca 2020-06-04 09:21:22 UTC
There seems to be a bunch of other issues with this cluster:
```
020-06-02T08:20:53.021428876Z I0602 08:20:52.919028  428428 rpm-ostree.go:246] Pivoting to: 45.81.202005291504-0 (0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116)
  171 2020-06-02T08:20:53.021428876Z client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0) added; new total=1
  172 2020-06-02T08:20:53.4321233Z Initiated txn UpdateDeployment for client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0): /org/projectatomic/rpmostree1/rhcos
  173 2020-06-02T08:21:00.931620484Z Created new deployment /ostree/deploy/rhcos/deploy/0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116.0
  174 2020-06-02T08:21:00.931620484Z sanitycheck(/usr/bin/true) successful            
  175 2020-06-02T08:21:01.932081633Z Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful
  176 2020-06-02T08:21:01.932081633Z client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0
  177 2020-06-02T08:21:01.932081633Z In idle state; will auto-exit in 61 seconds      
  178 2020-06-02T08:21:02.104386609Z I0602 08:21:02.104338  415991 update.go:1393] initiating reboot: Node will reboot into config rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc
  179 2020-06-02T08:21:02.288441293Z I0602 08:21:02.287406  415991 daemon.go:590] Shutting down MachineConfigDaemon
  180 2020-06-02T08:24:15.238555586Z I0602 08:24:15.237189    2094 start.go:74] Version: v4.5.0-202005301517-dirty (293f78b64d86f9f0491f6baa991e3f0c8fe1b046)
  181 2020-06-02T08:24:15.24417601Z I0602 08:24:15.243873    2094 start.go:84] Calling chroot("/rootfs")
  182 2020-06-02T08:24:15.244851195Z I0602 08:24:15.244553    2094 rpm-ostree.go:368] Running captured: rpm-ostree status --json
  183 2020-06-02T08:24:15.990817588Z I0602 08:24:15.990762    2094 daemon.go:211] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c8ad7572d9d6d2bd0f8673b01dbd9e01ba5500aad8fa4690a7c91d85dca437c (44.81.202005250830-0)
  184 2020-06-02T08:24:16.001472968Z I0602 08:24:16.001431    2094 metrics.go:106] Registering Prometheus metrics
  185 2020-06-02T08:24:16.001575818Z I0602 08:24:16.001564    2094 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
  186 2020-06-02T08:24:16.005029027Z I0602 08:24:16.004998    2094 update.go:1393] Starting to manage node: compute-0
  187 2020-06-02T08:24:16.021796261Z I0602 08:24:16.021758    2094 rpm-ostree.go:368] Running captured: rpm-ostree status
  188 2020-06-02T08:24:16.288078923Z I0602 08:24:16.288030    2094 daemon.go:815] State: idle
  189 2020-06-02T08:24:16.288078923Z AutomaticUpdates: disabled                       
  190 2020-06-02T08:24:16.288078923Z Deployments:                                     
  191 2020-06-02T08:24:16.288078923Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c8ad7572d9d6d2bd0f8673b01dbd9e01ba5500aad8fa4690a7c91d85dca437c
  192 2020-06-02T08:24:16.288078923Z               CustomOrigin: Managed by machine-config-operator
  193 2020-06-02T08:24:16.288078923Z                    Version: 44.81.202005250830-0 (2020-05-25T08:35:40Z)
  194 2020-06-02T08:24:16.288078923Z                                                  
  195 2020-06-02T08:24:16.288078923Z   pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519
  196 2020-06-02T08:24:16.288078923Z               CustomOrigin: Managed by machine-config-operator
  197 2020-06-02T08:24:16.288078923Z                    Version: 45.81.202005291504-0 (2020-05-29T15:08:02Z)
```

The above shows logs for a successful pivot but then upon reboot, the old deployment is still at play? cc'ing Colin on this

Also, the Pending config previous comment is kind of unexplainable as looking at the code, the correct pending config must have been written and read after reboot.

Comment 9 Antonio Murdaca 2020-06-04 09:22:28 UTC
I think what we need is an assessment of the reproduction rate of this with a newer payload

Comment 10 Antonio Murdaca 2020-06-04 09:25:18 UTC
The MCC logs look highly suspicious as well:

 41 2020-06-02T14:18:58.28067928Z I0602 14:18:58.280554       1 node_controller.go:453] Pool worker: node localhost changed machineconfiguration.openshift.io/currentConfig = rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de
 42 2020-06-02T14:18:58.28067928Z I0602 14:18:58.280612       1 node_controller.go:453] Pool worker: node localhost changed machineconfiguration.openshift.io/desiredConfig = rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de
 43 2020-06-02T14:18:58.28067928Z I0602 14:18:58.280624       1 node_controller.go:453] Pool worker: node localhost changed machineconfiguration.openshift.io/state = Done


Also, I can see the master pool has FIPS

Comment 11 Antonio Murdaca 2020-06-04 09:27:56 UTC
There you go, your master pool was broken:

230391 2020-06-03T14:36:44.460501037Z I0603 14:36:44.460468    2504 daemon.go:992] Validating against pending config rendered-master-0764d3758dba677517722abcd8d01f75
230392 2020-06-03T14:36:44.854410706Z E0603 14:36:44.854370    2504 daemon.go:1391] content mismatch for file /etc/kubernetes/kubelet.conf: 

and then a diff, I'm CC'ing jerry as I suspect this could be the kubelet config bug we already had.

Comment 12 Antonio Murdaca 2020-06-04 09:34:48 UTC
I've opened https://github.com/openshift/must-gather/pull/157 as well as we need those going forward (missing from the current must-gather)

Comment 13 Yu Qi Zhang 2020-06-05 20:15:44 UTC
The kubeletconfig error you see is not the source of the issue here. Taking a look just at the masters (since only the masters have a kubeletconfig defined):

There's 3 rendered master config on the system, we'll name them as follows for the sake of discussion:

old - rendered-master-0764d3758dba677517722abcd8d01f75
current - rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e
new - rendered-master-78d80a78b21182d10da2f40d35158f3c

where old is the first boot of the system, current is same as old, plus the kubeletconfig being applied, and new is the nightly 4.5 being upgraded to.

The first upgrade from old to current completed correct for all 3 nodes. Upon attempting to upgrade from current to new, the upgrade failed. See: machine-config-daemon-jn9tv/machine-config-daemon/machine-config-daemon/logs/current.log

The first oddity in this log is that both pre-shutdown and new MCD logs are logged to the same file. This should not have been the case. Upon doing new the new deployment a completely new MCD should have been deployed, and yet if we look at the above log, you can see:

2020-06-02T08:20:41.312541614Z I0602 08:20:41.312468  132706 daemon.go:590] Shutting down MachineConfigDaemon
2020-06-02T08:25:17.577128832Z I0602 08:25:17.574252    2504 start.go:74] Version: v4.5.0-202005301517-dirty (293f78b64d86f9f0491f6baa991e3f0c8fe1b046)

The MCD pod originally running there restarted. I think this indicates something going wrong during the upgrade. A few lines later it says:

2020-06-02T08:25:19.226127395Z Deployments:
2020-06-02T08:25:19.226127395Z * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c8ad7572d9d6d2bd0f8673b01dbd9e01ba5500aad8fa4690a7c91d85dca437c
2020-06-02T08:25:19.226127395Z               CustomOrigin: Managed by machine-config-operator
2020-06-02T08:25:19.226127395Z                    Version: 44.81.202005250830-0 (2020-05-25T08:35:40Z)
2020-06-02T08:25:19.226127395Z
2020-06-02T08:25:19.226127395Z   pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519
2020-06-02T08:25:19.226127395Z               CustomOrigin: Managed by machine-config-operator
2020-06-02T08:25:19.226127395Z                    Version: 45.81.202005291504-0 (2020-05-29T15:08:02Z)


That's still the old deployment. It seems we never pivoted, despite earlier it claiming:

2020-06-02T08:20:31.902507451Z I0602 08:20:31.710600  141725 rpm-ostree.go:246] Pivoting to: 45.81.202005291504-0 (0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116)
2020-06-02T08:20:31.902507451Z client(id:cli dbus:1.1151 unit:machine-config-daemon-host.service uid:0) added; new total=1
2020-06-02T08:20:31.902507451Z Initiated txn UpdateDeployment for client(id:cli dbus:1.1151 unit:machine-config-daemon-host.service uid:0): /org/projectatomic/rpmostree1/rhcos
2020-06-02T08:20:39.858658739Z Created new deployment /ostree/deploy/rhcos/deploy/0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116.0
2020-06-02T08:20:39.858658739Z sanitycheck(/usr/bin/true) successful
2020-06-02T08:20:40.652521666Z Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful
2020-06-02T08:20:40.652521666Z client(id:cli dbus:1.1151 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0
2020-06-02T08:20:40.652521666Z In idle state; will auto-exit in 61 seconds
2020-06-02T08:20:41.106344141Z I0602 08:20:41.106310  132706 update.go:1393] initiating reboot: Node will reboot into config rendered-master-78d80a78b21182d10da2f40d35158f3c

The logs don't otherwise indicate what went wrong here. Full journal logs may give us the reason. (rpm-ostree failure maybe?)

Then, we start seeing the error, and:

2020-06-02T08:25:29.610143137Z I0602 08:25:29.609244    2504 daemon.go:768] Current config: rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e
2020-06-02T08:25:29.610143137Z I0602 08:25:29.609261    2504 daemon.go:769] Desired config: rendered-master-78d80a78b21182d10da2f40d35158f3c
2020-06-02T08:25:29.610143137Z I0602 08:25:29.609268    2504 daemon.go:783] Pending config: rendered-master-0764d3758dba677517722abcd8d01f75
2020-06-02T08:25:29.772138517Z I0602 08:25:29.771640    2504 update.go:1393] Disk currentConfig rendered-master-0764d3758dba677517722abcd8d01f75 overrides node annotation rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e
2020-06-02T08:25:29.798157031Z I0602 08:25:29.796863    2504 daemon.go:992] Validating against pending config rendered-master-0764d3758dba677517722abcd8d01f75

well of course that's wrong. We're at least on rendered-master-fcd0e2ffa87f62eb0260c0bb8ee0286e (current) which has the kubeletconfig changes. Yet the node is validating against old, which is why its reporting the kubeletconfig error. So then the node should be on current or new, based on the machine-os-content I think we're still on current, but the logs indicate we successfully upgraded. Again, really weird.

As to the 2 lines referencing old (rendered-master-0764d3758dba677517722abcd8d01f75), it means that on the host,
1. /etc/machine-config-daemon/currentconfig has rendered-master-0764d3758dba677517722abcd8d01f75
2. journalctl -o json _UID=0 MESSAGE_ID=machine-config-daemon-pending-state has rendered-master-0764d3758dba677517722abcd8d01f75

Again to debug, we probably need the full journal logs from the host. must-gather only captures crio/kubelet logs. If another cluster exhibits this behaviour, could you capture the full journalctl logs as well?

Comment 14 Yu Qi Zhang 2020-06-05 20:25:51 UTC
Ok, and the worker pool also exhibits similiar behaviour:

2020-06-02T08:24:23.374544879Z I0602 08:24:23.374476    2094 daemon.go:768] Current config: rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de
2020-06-02T08:24:23.374544879Z I0602 08:24:23.374499    2094 daemon.go:769] Desired config: rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc
2020-06-02T08:24:23.374544879Z I0602 08:24:23.374505    2094 daemon.go:783] Pending config: rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de
2020-06-02T08:24:23.415838052Z I0602 08:24:23.415782    2094 daemon.go:992] Validating against pending config rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de

Since there is only 2 configs here its easier to see where the state is: post-upgrade the files were updated, but the osimageurl is still the old one. The node is now caught in a limbo state where its neither rendered-worker-5e2fcf39779f6e113f77e6b1f71a05de (because the files were updated) but also not rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc (because rpm-ostree did not pivot).

This is the update logs:

2020-06-02T08:20:30.249070886Z I0602 08:20:30.249043  415991 update.go:1271] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519
2020-06-02T08:20:30.28864255Z In idle state; will auto-exit in 61 seconds
2020-06-02T08:20:30.28864255Z Started RPM-OSTree System Management Daemon.
2020-06-02T08:20:30.28864255Z client(id:cli dbus:1.1087 unit:crio-e729f67e2e91be7f8180c1ee34cd4bad02c9ca40244beab0d4089a1d41a4edae.scope uid:0) added; new total=1
2020-06-02T08:20:30.28864255Z client(id:cli dbus:1.1087 unit:crio-e729f67e2e91be7f8180c1ee34cd4bad02c9ca40244beab0d4089a1d41a4edae.scope uid:0) vanished; remaining=0
2020-06-02T08:20:30.28864255Z In idle state; will auto-exit in 62 seconds
2020-06-02T08:20:30.28864255Z client(id:cli dbus:1.1089 unit:crio-e729f67e2e91be7f8180c1ee34cd4bad02c9ca40244beab0d4089a1d41a4edae.scope uid:0) added; new total=1
2020-06-02T08:20:30.28864255Z client(id:cli dbus:1.1089 unit:crio-e729f67e2e91be7f8180c1ee34cd4bad02c9ca40244beab0d4089a1d41a4edae.scope uid:0) vanished; remaining=0
2020-06-02T08:20:30.28864255Z In idle state; will auto-exit in 64 seconds
2020-06-02T08:20:30.28864255Z In idle state; will auto-exit in 60 seconds
2020-06-02T08:20:30.28864255Z rpm-ostreed.service: Consumed 116ms CPU time
2020-06-02T08:20:30.541315415Z Starting Machine Config Daemon Initial...
2020-06-02T08:20:30.541315415Z I0602 08:20:30.333046  428428 rpm-ostree.go:366] Running captured: rpm-ostree status --json
2020-06-02T08:20:30.541315415Z Starting RPM-OSTree System Management Daemon...
2020-06-02T08:20:30.541315415Z Reading config file '/etc/rpm-ostreed.conf'
2020-06-02T08:20:30.541315415Z In idle state; will auto-exit in 60 seconds
2020-06-02T08:20:30.541315415Z Started RPM-OSTree System Management Daemon.
2020-06-02T08:20:30.541315415Z client(id:cli dbus:1.1090 unit:machine-config-daemon-host.service uid:0) added; new total=1
2020-06-02T08:20:30.89576432Z client(id:cli dbus:1.1090 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0
2020-06-02T08:20:30.89576432Z In idle state; will auto-exit in 60 seconds
2020-06-02T08:20:30.89576432Z I0602 08:20:30.552666  428428 rpm-ostree.go:154] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5c8ad7572d9d6d2bd0f8673b01dbd9e01ba5500aad8fa4690a7c91d85dca437c
2020-06-02T08:20:30.89576432Z I0602 08:20:30.552785  428428 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519
2020-06-02T08:20:30.89576432Z 2020-06-02 08:20:30.738310966 +0000 UTC m=+0.128692845 system refresh
2020-06-02T08:20:52.682154014Z 2020-06-02 08:20:52.432658853 +0000 UTC m=+21.823040743 image pull
2020-06-02T08:20:52.682154014Z b5be1b28281306372765ad38f41123a1875ffd2a35b485f8d0e8db7d315a5bfb
2020-06-02T08:20:52.682154014Z I0602 08:20:52.443028  428428 rpm-ostree.go:366] Running captured: podman inspect --type=image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519
2020-06-02T08:20:52.682154014Z I0602 08:20:52.594268  428428 rpm-ostree.go:366] Running captured: podman create --net=none --annotation=org.openshift.machineconfigoperator.pivot=true --name ostree-container-pivot-1a2ccf6d-1c36-424b-ab84-619a5943bfbf quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fe4ff9322d3ed09dc58668179d99840397892b300960caa0a3eb8822e15ba519
2020-06-02T08:20:53.021428876Z I0602 08:20:52.763016  428428 rpm-ostree.go:366] Running captured: podman mount 7bf69529015d73d4f90e8e2e9ee0121db38a9c1d5b361c2292424251d7d4ad12
2020-06-02T08:20:53.021428876Z I0602 08:20:52.919028  428428 rpm-ostree.go:246] Pivoting to: 45.81.202005291504-0 (0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116)
2020-06-02T08:20:53.021428876Z client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0) added; new total=1
2020-06-02T08:20:53.4321233Z Initiated txn UpdateDeployment for client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0): /org/projectatomic/rpmostree1/rhcos
2020-06-02T08:21:00.931620484Z Created new deployment /ostree/deploy/rhcos/deploy/0c71982828e2c7d335a0dd5e988d80b817ce4193d8b4ef97f8d9f216ecd5e116.0
2020-06-02T08:21:00.931620484Z sanitycheck(/usr/bin/true) successful
2020-06-02T08:21:01.932081633Z Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful
2020-06-02T08:21:01.932081633Z client(id:cli dbus:1.1092 unit:machine-config-daemon-host.service uid:0) vanished; remaining=0
2020-06-02T08:21:01.932081633Z In idle state; will auto-exit in 61 seconds
2020-06-02T08:21:02.104386609Z I0602 08:21:02.104338  415991 update.go:1393] initiating reboot: Node will reboot into config rendered-worker-c2d5283deeb9c08da7d5c2be0309eccc

Nothing seems to be going wrong offhand. Journal logs is probably the next place to look.

Comment 15 Colin Walters 2020-06-05 21:26:01 UTC
>Again to debug, we probably need the full journal logs from the host. must-gather only captures crio/kubelet logs. 

https://github.com/openshift/must-gather/pull/158
https://github.com/openshift/machine-config-operator/pull/1790

Comment 16 MinLi 2020-06-08 03:55:23 UTC
@Antonio Murdaca

I only run this upgrade test once and it hit this issue. 
And our team arrange upgrade CI and manual upgrade test everyday, so we can focus on the reproduction of this issue.

Comment 18 Urvashi Mohnani 2020-06-09 13:57:01 UTC
*** Bug 1843309 has been marked as a duplicate of this bug. ***

Comment 20 Colin Walters 2020-06-09 18:22:36 UTC
OK we had a long discussion about this bug in a realtime chat.

I was *pretty* sure I understood the bug but in trying to write up the diagnosis I am less certain.  But...

One smoking gun to me is the combination of facts:

- Something is rebooting before machine-config-daemon-firstboot.service runs
- The pivot code has a bare `systemctl reboot` call that doesn't scream into the logs, it's just calling glog (and I wonder if we're missing log verbosity here)

Today we're writing both `/etc/pivot/image-pullspec` *and* `/etc/ignition-machine-config-encapsulated.json`, and we do this because the first one is needed for 4.1 bootimages.

In the BZ I believe what's happening is `machine-config-daemon-host.service` is being triggered by that before `machine-config-daemon-firstboot.service` runs.

In that case it's possible for it to win and kill off `machine-config-daemon-firstboot.service` *before* it has a chance to unlink its `/etc/ignition-machine-config-encapsulated.json` stamp file.

Previous to https://github.com/openshift/machine-config-operator/pull/1762 this failure was masked because on the next boot, the service wouldn't run because the BindsTo= was present.

But now that we removed that, the service does run again.

Comment 21 Colin Walters 2020-06-09 18:25:26 UTC
I think we need to clearly split off something like `machine-config-daemon-firstboot-v41` that is only triggered for 4.1 systems, leaving `machine-config-daemon-host.service` to only be triggered by another service or the MCD pod.

Comment 22 Colin Walters 2020-06-09 19:33:03 UTC
Related previous discussion in https://github.com/openshift/machine-config-operator/pull/904/files/40d8225dae9f25b392e39fa2ce30c48d18638fa5#diff-4eb88ea6718e4dcceecd6dfe6e4437e1 

It's hard to trace through this because `machine-config-daemon-host` is started by all 3 cases of:

- directly via presence of `/etc/pivot/image-pullspec`
- via `machine-config-daemon-firstboot`
- via the MCD pod

It does look like on firstboot though we're doing two concurrent pulls from different process IDs:

Jun 09 02:36:06 compute-0 machine-config-daemon[1887]: I0609 02:36:06.508497    1887 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70795637fd2143de29d7486305da17120b7e2270cc5b887b3f21d921391>
Jun 09 02:36:06 compute-0 machine-config-daemon[1833]: I0609 02:36:06.508497    1887 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:70795637fd2143de29d7486305da17120b7e2270cc5b887b3f21d921391>

So that means the services are conflicting.  But...this still leaves mysterious to me what's causing a reboot, because AFAICS we should have logged something there.

Comment 23 Colin Walters 2020-06-09 22:20:53 UTC
OK I didn't test this but I think https://github.com/openshift/machine-config-operator/pull/1804 will fix this.

Comment 24 Colin Walters 2020-06-10 00:14:08 UTC
No argh, there are indeed two separate processes but notice that the pull happens on the same microsecond, and the pid 1887 is in the log.
So 1833 there is the `machine-config-daemon-firstboot.service` just proxying the journal logs.
https://github.com/openshift/machine-config-operator/pull/1802 would fix the confusion there but...that's not a bug.

Comment 25 Colin Walters 2020-06-10 02:01:53 UTC
The cluster here is on vSphere, correct?  Has this bug ever reproduced outside of vSphere?

Can you think of anything here that might be rebooting the node during early cluster bringup?  The vmware agent for example is active...could there be e.g. some tooling that's trying to do live migration?

Comment 26 liujia 2020-06-10 02:04:50 UTC
(In reply to Colin Walters from comment #25)
> The cluster here is on vSphere, correct?  Has this bug ever reproduced
> outside of vSphere?
> 
We hit it on upi/barematal too. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1843309

Comment 27 Colin Walters 2020-06-10 02:13:00 UTC
Wait a second, where is this unit coming from?

# cat /etc/systemd/system/multi-user.target.wants/restart.service 
[Unit]
ConditionFirstBoot=yes
[Service]
Type=idle
ExecStart=/sbin/reboot
[Install]
WantedBy=multi-user.target
#

It seems to be written by Ignition...

There's the smoking gun for sure, that's what's rebooting the cluster in the middle of firstboot with zero logging!  Ahh wait, I am dimly recalling the installer was doing this at one point...

A bit of digging in github.com/openshift/installer with `git log -G restart\.service` turns up:

https://github.com/openshift/installer/commit/e284327c8771ef84c9151f5db97f1f52db20ef0d
which came from https://github.com/openshift/installer/pull/3497

So...there's not much comment in that commit for why it was removed, but it definitely needs to go!

I'm reassigning to Installer based on this, feel free to close as a duplicate.

Comment 28 Colin Walters 2020-06-10 02:14:30 UTC
Since github says it's a large diff and won't expand, here's the direct link to the removal: https://github.com/openshift/installer/commit/e284327c8771ef84c9151f5db97f1f52db20ef0d#diff-a40d954648d45299ae39941ce1a70a68L51

Comment 29 Colin Walters 2020-06-10 02:16:51 UTC
> We hit it on upi/barematal too. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1843309

But is that *real* baremetal, or is it baremetal emulated with libvirt?  There was a similar issue with IPI baremetal which was *also* (forcibly) rebooting the nodes:
https://bugzilla.redhat.com/show_bug.cgi?id=1840222

Can you give me a kubeconfig for an affected cluster?

Comment 30 liujia 2020-06-10 02:35:49 UTC
(In reply to Colin Walters from comment #29)
> > We hit it on upi/barematal too. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1843309
> 
> But is that *real* baremetal, or is it baremetal emulated with libvirt? 
> There was a similar issue with IPI baremetal which was *also* (forcibly)
> rebooting the nodes:
> https://bugzilla.redhat.com/show_bug.cgi?id=1840222
> 
> Can you give me a kubeconfig for an affected cluster?

It's not a *real* baremetal, and it is the vm on openstack. I have not an affected cluster on hand now, if needed, i can rebuild one for u debug further.

Comment 31 Johnny Liu 2020-06-10 03:23:13 UTC
> There's the smoking gun for sure, that's what's rebooting the cluster in the middle of firstboot with zero logging!  Ahh wait, I am dimly recalling the installer was doing this at one point...

Do you mean the restart service is the root cause?

The reason why there is restart service is because in some installation, user need to change network configuration, such as: static IP, DOMAIN in network configuration file. Once the changed network config file is laid down, need a reboot to make it take effect, so that the initial fresh install get completed successfully.

Do you mean once user defined such restart service, can not upgrade his/here cluster any more?
And the restart unit is existing for a long time since 4.1, but we never hit such problem before, especially during 4.1 -> 4.2, 4.2 -> 4.3, 4.3 -> 4.4, why this only happened on 4.4 -> 4.5 ?

Comment 32 Johnny Liu 2020-06-10 09:59:11 UTC
> Antonio and I currently suspect maybe https://github.com/openshift/machine-config-operator/pull/1762 regressed this behaviour as we removed a gate for running the firstboot service.

Go through QE's test history, we even have ever run upgrade a cluster (with user defined restart unit via ignition) from 4.4.0-0.nightly-2020-05-21-042450 to 4.5.0-0.nightly-2020-05-21-010047, it was working well. 
We hit this issue for the 1st time when upgrading a cluster (with user defined restart unit via ignition) from 4.4 to 4.5.0-0.nightly-2020-06-01-081609.

Because the above nightly builds was already pruned, I can not confirm if this is a regression caused by the PR#1762. But from the behaviour, sound like yes. Why this user defined restart system unit make MCD run into the firstboot flag? Sound like MCO did not manage this special system unit well.


And I also give another shot, I removed the restart unit from 4.4 ignition files, install a cluster with 4.4.0-0.nightly-2020-06-08-083627 , trigger an upgrade to 4.5.0-0.nightly-2020-06-08-031520, it succeed.

Though luckily the removal of restart unit from 4.4 ignition files does not affect its fresh installation, but in some old version of cluster, once had the restart unit, when upgrading to 4.5, will 100% hit such problem. This sound become a compatibility issue. 

Based on the above analysis, I suggest MCO should manage all kinds of system unit (include this restart unit) well, and guard them to avoid run into such issue.

Comment 33 MinLi 2020-06-10 10:08:52 UTC
(In reply to Colin Walters from comment #25)
> The cluster here is on vSphere, correct?  Has this bug ever reproduced
> outside of vSphere?
> 
> Can you think of anything here that might be rebooting the node during early
> cluster bringup?  The vmware agent for example is active...could there be
> e.g. some tooling that's trying to do live migration?

1 yeah, this cluster is on vSphere 6.5, not sure if it reproduced on other platforms.

2 there is one case of creating a kubeletconfig, it need to override /etc/kubernetes/kubelet.conf and then reboot the node.

Comment 34 Johnny Liu 2020-06-10 10:35:50 UTC
(In reply to MinLi from comment #33)
> 2 there is one case of creating a kubeletconfig, it need to override
> /etc/kubernetes/kubelet.conf and then reboot the node.

I think we can ignore this to avoid confusion here. I already explained in the comment 31, the reboot is from restart system unit from ignition upon 4.4 fresh install.

Comment 35 Johnny Liu 2020-06-10 10:39:53 UTC
Add back needinfo flag for comment 31 and 32

Comment 36 Christian Glombek 2020-06-10 11:18:49 UTC
My understanding of this is that the reboot was done to pick up config changes for the networking interface (in OKD this was actually changed to only restart NM, without rebooting:
https://github.com/openshift/installer/blob/fcos/upi/vsphere/machine/ignition.tf#L48-L69)

Now in OCP it looks like this has been changed to configure a containerized HAProxy directly and restart it in https://github.com/openshift/installer/commit/e284327c8771ef84c9151f5db97f1f52db20ef0d

The issue I see here is that this restart service unit - which is really a remnant from initial provisioning and shouldn't even exist anymore - in fact still exists on the affected machines,
and with the recent removal of BindsTo=ignition-firstboot-complete.service from machine-config-daemon-firstboot.service it now runs again - even though we obviously don't want it to.

I don't think this can be fixed in Installer - I think we need MCO to ensure the service unit is removed or at least disabled on affected platforms before attempting to upgrade to 4.5.

Comment 37 Christian Glombek 2020-06-10 11:35:44 UTC
There might be away to remove the service unit entirely - for now I created PRs to disable the service.

PR to master: https://github.com/openshift/machine-config-operator/pull/1805
PR to release-4.5: https://github.com/openshift/machine-config-operator/pull/1806

Comment 38 Colin Walters 2020-06-10 12:32:09 UTC
> The issue I see here is that this restart service unit - which is really a remnant from initial provisioning and shouldn't even exist anymore

It exists in at least 4.4.0-0.nightly-2020-06-07-014539 (but not 4.5+ I think).

But even though that PR was merged a while ago into release-4.4 it seems not to be in the release image yet, need to chase that down.

> - in fact still exists on the affected machines,
and with the recent removal of BindsTo=ignition-firstboot-complete.service from machine-config-daemon-firstboot.service it now runs again - even though we obviously don't want it to.

Wait...no, on this affected node, I only see it running on the firstboot:

[root@compute-0 /]# journalctl -u restart
-- Logs begin at Tue 2020-06-09 02:21:47 UTC, end at Wed 2020-06-10 12:26:26 UTC. --
Jun 09 02:36:04 compute-0 systemd[1]: Started restart.service.
Jun 09 02:36:09 compute-0 systemd[1]: Stopping restart.service...
Jun 09 02:36:09 compute-0 systemd[1]: Stopped restart.service.
Jun 09 02:36:09 compute-0 systemd[1]: restart.service: Consumed 8ms CPU time
[root@compute-0 /]# 

as you'd expect because the unit has ConditionFirstBoot=yes.

I do think it's extremely likely that the attempted fix for the IPI baremetal forced reboots in
https://github.com/openshift/machine-config-operator/pull/1762
somehow made this vSphere rebooting worse.  I will trace through that.

We could almost certainly improve things in the MCO here, let's look at that.

But bottom line: since the installer-generated restart.service is gone in release-4.4 I would like to see that shipped and not have to worry about this weird special case of vSphere rebooting.

Comment 39 Colin Walters 2020-06-10 12:36:03 UTC
> It's not a *real* baremetal, and it is the vm on openstack. I have not an affected cluster on hand now, if needed, i can rebuild one for u debug further.

Yes, please do that.  I also asked about this on https://bugzilla.redhat.com/show_bug.cgi?id=1840222#c16

Comment 40 Johnny Liu 2020-06-10 12:56:37 UTC
(In reply to Colin Walters from comment #39)
> > It's not a *real* baremetal, and it is the vm on openstack. I have not an affected cluster on hand now, if needed, i can rebuild one for u debug further.
> 
> Yes, please do that.  I also asked about this on
> https://bugzilla.redhat.com/show_bug.cgi?id=1840222#c16

If you want a env to reproduce this issue, QE will enable 'restart' unit via ignition files to rereate it, do you still want such a env, especially sound like we already get the root cause?

Comment 41 Colin Walters 2020-06-10 12:59:56 UTC
OK right the MCO problem is:

- 1st boot: 4.4 install, machine boots the first time
- restart.service kills machine-config-daemon-firstboot before it upgrades, or before it unlinks the file
- 2nd boot: We stumble on without having pivoted, node joins cluster
  MCD starts a reboot into 44.81.202006062030-0 which we *should* have updated to before joining the cluster
- 3rd boot: Upgraded to 4.5, and the new machine-config-daemon-firstboot without BindsTo= gets laid down in /etc
- 4th boot (current): In 4.5, and -firstboot triggers again, trying to finally apply the original config

The messy problem here is I think not having the BindsTo= is correct from the start...we need to remove
it before rebooting.  PR inbound.

Comment 42 Colin Walters 2020-06-10 13:03:36 UTC
> If you want a env to reproduce this issue, QE will enable 'restart' unit via ignition files to rereate it, do you still want such a env, especially sound like we already get the root cause?

If this is occuring *outside* of vSphere/vmware (that sub-thread was about baremetal) it sounds like a separate (though maybe related) bug.  Is there a must-gather for it?  Or a reproducer environment w/kubeconfig would be great.

Comment 43 Colin Walters 2020-06-10 13:03:53 UTC
OK now I'm 97% confident https://github.com/openshift/machine-config-operator/pull/1807 will fix this!

Comment 44 Johnny Liu 2020-06-10 13:18:31 UTC
(In reply to Colin Walters from comment #42)
> > If you want a env to reproduce this issue, QE will enable 'restart' unit via ignition files to rereate it, do you still want such a env, especially sound like we already get the root cause?
> 
> If this is occuring *outside* of vSphere/vmware (that sub-thread was about
> baremetal) it sounds like a separate (though maybe related) bug.  Is there a
> must-gather for it?  Or a reproducer environment w/kubeconfig would be great.

After I checked QE's cluster install script, whatever upi on vsphere and baremetal, we was injecting 'restart' unit via ignition in the same way. So I almost 99% confirm they are the same issue here. 

I am re-create such upi-on-barematel env for your double confirm.

Comment 45 Derek Higgins 2020-06-10 13:35:15 UTC
(In reply to Colin Walters from comment #39)
> > It's not a *real* baremetal, and it is the vm on openstack. I have not an affected cluster on hand now, if needed, i can rebuild one for u debug further.
> 
> Yes, please do that.  I also asked about this on
> https://bugzilla.redhat.com/show_bug.cgi?id=1840222#c16

In #1840222 we are using the libvirt driver in sushy-tools to control the VM's, the bug causing the unwanted reboot was in the codepath specifically for libvirt.
Given that this is a VM on openstack we wouldn't be hitting the same problem here.

Comment 46 Colin Walters 2020-06-10 14:11:01 UTC
One other thing I need to emphasize: I suspect even before the MCO change here actively broke things, this uncoordinated reboot has been impacting the cluster reliability at install time, particularly for the control plane bringup.
See also https://github.com/openshift/machine-config-operator/pull/926

To fix this *immediately* without waiting for other changes in the installer or MCO, please change your restart.service to look like this:

```
[Unit]
ConditionFirstBoot=yes
After=machine-config-daemon-firstboot.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/sh -c 'echo starting QE injected reboot unit && /sbin/reboot'

[Install]
WantedBy=multi-user.target
```

Where the most important part here is the `After=`, but the `echo` telling us *why* the reboot is happening is about as equally important =)

Comment 47 Johnny Liu 2020-06-10 15:10:40 UTC
> Where the most important part here is the `After=`, but the `echo` telling us *why* the reboot is happening is about as equally important =)
Good suggestion.

Comment 49 Colin Walters 2020-06-10 18:14:49 UTC
Actually, since the node will reboot from machine-config-daemon-firstboot.service in all cases, why are you even injecting a restart.service at all?  I don't think it should be necessary.

Comment 50 Brenton Leanhardt 2020-06-10 19:41:58 UTC
What fix is going to be verified in this bug?  https://github.com/openshift/machine-config-operator/pull/1807 ? Something from the installer?  If the MCO fix unblocks 4.5 upgrades I'd recommend we address any non urgent problems in 4.6 and consider backporting.

Comment 51 Johnny Liu 2020-06-11 01:51:25 UTC
(In reply to Colin Walters from comment #49)
> Actually, since the node will reboot from
> machine-config-daemon-firstboot.service in all cases, why are you even
> injecting a restart.service at all?  I don't think it should be necessary.

Just like what I said in comment 44, I was injecting restart.service in a upi-on-baremetal install to reproduce such issue if you are suspecting it is some other separate issue.

Comment 52 Steve Milner 2020-06-11 13:29:52 UTC
(In reply to Brenton Leanhardt from comment #50)
> What fix is going to be verified in this bug? 
> https://github.com/openshift/machine-config-operator/pull/1807 ? Something
> from the installer?  If the MCO fix unblocks 4.5 upgrades I'd recommend we
> address any non urgent problems in 4.6 and consider backporting.

https://github.com/openshift/machine-config-operator/pull/1807 provides a remediation to this issue. My suggestion is that 1807 target this bug and the installer team review if they should backport the removal of the offending unit.

Comment 55 Johnny Liu 2020-06-16 07:13:57 UTC
Verified this bug with 4.6.0-0.nightly-2020-06-15-214351 (upgrade from 4.5.0-0.nightly-2020-06-11-183238), and PASS.

[root@preserve-jialiu-ansible ~]# oc get node
NAME                              STATUS   ROLES           AGE   VERSION
jialiu452-p9652-control-plane-0   Ready    master,worker   16h   v1.18.3+2164959
jialiu452-p9652-control-plane-1   Ready    master,worker   16h   v1.18.3+2164959
jialiu452-p9652-control-plane-2   Ready    master,worker   16h   v1.18.3+2164959

[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
cloud-credential                           4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
cluster-autoscaler                         4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
config-operator                            4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
console                                    4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h2m
csi-snapshot-controller                    4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h
dns                                        4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
etcd                                       4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
image-registry                             4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h1m
ingress                                    4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h23m
insights                                   4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
kube-apiserver                             4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
kube-controller-manager                    4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
kube-scheduler                             4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
kube-storage-version-migrator              4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
machine-api                                4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
machine-approver                           4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
machine-config                             4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h27m
marketplace                                4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h1m
monitoring                                 4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h
network                                    4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
node-tuning                                4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h24m
openshift-apiserver                        4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h3m
openshift-controller-manager               4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h23m
openshift-samples                          4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h23m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h1m
service-ca                                 4.6.0-0.nightly-2020-06-15-214351   True        False         False      16h
storage                                    4.6.0-0.nightly-2020-06-15-214351   True        False         False      4h24m


[root@jialiu452-p9652-control-plane-2 ~]# systemctl list-unit-files|grep restart
restart.service                                                        enabled  
[root@jialiu452-p9652-control-plane-2 ~]# journalctl --system|grep -i reboot
Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO     : files: op(3a): [started]  processing unit "mcd-write-pivot-reboot.service"
Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO     : files: op(3a): op(3b): [started]  writing unit "mcd-write-pivot-reboot.service" at "etc/systemd/system/mcd-write-pivot-reboot.service"
Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO     : files: op(3a): op(3b): [finished] writing unit "mcd-write-pivot-reboot.service" at "etc/systemd/system/mcd-write-pivot-reboot.service"
Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO     : files: op(3a): [finished] processing unit "mcd-write-pivot-reboot.service"
Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO     : files: op(3c): [started]  enabling unit "mcd-write-pivot-reboot.service"
Jun 15 14:45:34 host-10-0-98-190.openstacklocal ignition[1145]: INFO     : files: op(3c): [finished] enabling unit "mcd-write-pivot-reboot.service"
Jun 15 14:45:37 jialiu452-p9652-control-plane-2 restorecon[1514]: Relabeled /etc/systemd/system/mcd-write-pivot-reboot.service from system_u:object_r:unlabeled_t:s0 to system_u:object_r:power_unit_file_t:s0
Jun 15 14:45:38 jialiu452-p9652-control-plane-2 systemd[1]: Started mcd-write-pivot-reboot.service.
Jun 15 14:45:38 jialiu452-p9652-control-plane-2 systemd[1]: mcd-write-pivot-reboot.service: Consumed 6ms CPU time
Jun 15 14:45:43 jialiu452-p9652-control-plane-2 systemd[1]: Starting Reboot...
-- Reboot --
Jun 15 14:46:48 jialiu452-p9652-control-plane-2 machine-config-daemon[1345]: I0615 14:46:48.011423    1345 update.go:1346] initiating reboot: Completing firstboot provisioning to rendered-master-5c6dfc33b0a6115a3a3c25831d6c8009
Jun 15 14:46:48 jialiu452-p9652-control-plane-2 root[1640]: machine-config-daemon[1345]: initiating reboot: Completing firstboot provisioning to rendered-master-5c6dfc33b0a6115a3a3c25831d6c8009
Jun 15 14:46:48 jialiu452-p9652-control-plane-2 systemd-logind[1320]: System is rebooting.
Jun 15 14:46:48 jialiu452-p9652-control-plane-2 systemd[1]: machine-config-daemon-reboot.service: Consumed 31ms CPU time
Jun 15 14:46:52 jialiu452-p9652-control-plane-2 systemd[1]: Starting Reboot...
-- Reboot --
Jun 16 03:10:07 jialiu452-p9652-control-plane-2 root[1993844]: machine-config-daemon[1922473]: initiating reboot: Node will reboot into config rendered-master-fb575102395e8eb678ccebee9c1a25fc
Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937041    1479 factory.go:177] Factory "crio" was unable to handle container "/system.slice/machine-config-daemon-reboot.service"
Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937096    1479 factory.go:166] Error trying to work out if we can handle /system.slice/machine-config-daemon-reboot.service: /system.slice/machine-config-daemon-reboot.service not handled by systemd handler
Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937106    1479 factory.go:177] Factory "systemd" was unable to handle container "/system.slice/machine-config-daemon-reboot.service"
Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937115    1479 factory.go:173] Using factory "raw" for container "/system.slice/machine-config-daemon-reboot.service"
Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937670    1479 manager.go:950] Added container: "/system.slice/machine-config-daemon-reboot.service" (aliases: [], namespace: "")
Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937830    1479 handler.go:325] Added event &{/system.slice/machine-config-daemon-reboot.service 2020-06-16 03:10:07.933497501 +0000 UTC containerCreation {<nil>}}
Jun 16 03:10:07 jialiu452-p9652-control-plane-2 hyperkube[1479]: I0616 03:10:07.937919    1479 container.go:467] Start housekeeping for container "/system.slice/machine-config-daemon-reboot.service"
Jun 16 03:10:07 jialiu452-p9652-control-plane-2 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-fb575102395e8eb678ccebee9c1a25fc.
Jun 16 03:10:08 jialiu452-p9652-control-plane-2 systemd-logind[1326]: System is rebooting.
Jun 16 03:10:08 jialiu452-p9652-control-plane-2 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-fb575102395e8eb678ccebee9c1a25fc.
Jun 16 03:10:08 jialiu452-p9652-control-plane-2 systemd[1]: machine-config-daemon-reboot.service: Consumed 22ms CPU time
Jun 16 03:10:38 jialiu452-p9652-control-plane-2 systemd[1]: Starting Reboot...
-- Reboot --
Jun 16 03:11:07 jialiu452-p9652-control-plane-2 hyperkube[1374]: I0616 03:11:07.646852    1374 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"jialiu452-p9652-control-plane-2", UID:"jialiu452-p9652-control-plane-2", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Rebooted' Node jialiu452-p9652-control-plane-2 has been rebooted, boot id: dab01b25-cc00-45eb-b021-79425eb72a27
[root@jialiu452-p9652-control-plane-2 ~]# systemctl cat restart
# /etc/systemd/system/restart.service
[Unit]
ConditionFirstBoot=yes
[Service]
Type=idle
ExecStart=/sbin/reboot
[Install]
WantedBy=multi-user.target

Comment 56 Johnny Liu 2020-06-16 09:06:33 UTC
For upi on vsphere, 4.5 already removed restart service, so this bug is not applicable for 4.5 -> 4.6 vsphere cluster upgrade.

Comment 57 Antonio Murdaca 2020-06-18 13:17:59 UTC
See https://bugzilla.redhat.com/show_bug.cgi?id=1846690#c13

We have already a PR for this again

Comment 61 Johnny Liu 2020-06-23 06:02:03 UTC
Verified this bug with 4.6.0-0.nightly-2020-06-22-222207, and PASS.

[root@preserve-jialiu-ansible ~]# oc get node
NAME                              STATUS   ROLES    AGE   VERSION
jialiu451-bnc8p-compute-0         Ready    worker   26m   v1.18.3+106f511
jialiu451-bnc8p-compute-1         Ready    worker   26m   v1.18.3+106f511
jialiu451-bnc8p-control-plane-0   Ready    master   35m   v1.18.3+106f511
jialiu451-bnc8p-control-plane-1   Ready    master   34m   v1.18.3+106f511
jialiu451-bnc8p-control-plane-2   Ready    master   35m   v1.18.3+106f511

[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-0.nightly-2020-06-23-020504   True        False         False      16m
cloud-credential                           4.5.0-0.nightly-2020-06-23-020504   True        False         False      35m
cluster-autoscaler                         4.5.0-0.nightly-2020-06-23-020504   True        False         False      25m
config-operator                            4.5.0-0.nightly-2020-06-23-020504   True        False         False      25m
console                                    4.5.0-0.nightly-2020-06-23-020504   True        False         False      19m
csi-snapshot-controller                    4.5.0-0.nightly-2020-06-23-020504   True        False         False      23m
dns                                        4.5.0-0.nightly-2020-06-23-020504   True        False         False      31m
etcd                                       4.5.0-0.nightly-2020-06-23-020504   True        False         False      30m
image-registry                             4.5.0-0.nightly-2020-06-23-020504   True        False         False      23m
ingress                                    4.5.0-0.nightly-2020-06-23-020504   True        False         False      23m
insights                                   4.5.0-0.nightly-2020-06-23-020504   True        False         False      29m
kube-apiserver                             4.5.0-0.nightly-2020-06-23-020504   True        False         False      30m
kube-controller-manager                    4.5.0-0.nightly-2020-06-23-020504   True        False         False      30m
kube-scheduler                             4.5.0-0.nightly-2020-06-23-020504   True        False         False      27m
kube-storage-version-migrator              4.5.0-0.nightly-2020-06-23-020504   True        False         False      23m
machine-api                                4.5.0-0.nightly-2020-06-23-020504   True        False         False      29m
machine-approver                           4.5.0-0.nightly-2020-06-23-020504   True        False         False      30m
machine-config                             4.5.0-0.nightly-2020-06-23-020504   True        False         False      30m
marketplace                                4.5.0-0.nightly-2020-06-23-020504   True        False         False      28m
monitoring                                 4.5.0-0.nightly-2020-06-23-020504   True        False         False      21m
network                                    4.5.0-0.nightly-2020-06-23-020504   True        False         False      32m
node-tuning                                4.5.0-0.nightly-2020-06-23-020504   True        False         False      32m
openshift-apiserver                        4.5.0-0.nightly-2020-06-23-020504   True        False         False      27m
openshift-controller-manager               4.5.0-0.nightly-2020-06-23-020504   True        False         False      29m
openshift-samples                          4.5.0-0.nightly-2020-06-23-020504   True        False         False      25m
operator-lifecycle-manager                 4.5.0-0.nightly-2020-06-23-020504   True        False         False      31m
operator-lifecycle-manager-catalog         4.5.0-0.nightly-2020-06-23-020504   True        False         False      31m
operator-lifecycle-manager-packageserver   4.5.0-0.nightly-2020-06-23-020504   True        False         False      27m
service-ca                                 4.5.0-0.nightly-2020-06-23-020504   True        False         False      32m
storage                                    4.5.0-0.nightly-2020-06-23-020504   True        False         False      29m

[root@preserve-jialiu-ansible ~]# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-06-22-222207 --force --allow-explicit-upgrade
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to preceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-06-22-222207

[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-06-23-020504   True        True          20m     Working towards 4.6.0-0.nightly-2020-06-22-222207: 84% complete, waiting on machine-config

[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-06-22-222207   True        False         5m26s   Cluster version is 4.6.0-0.nightly-2020-06-22-222207

[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-06-22-222207   True        False         False      57m
cloud-credential                           4.6.0-0.nightly-2020-06-22-222207   True        False         False      76m
cluster-autoscaler                         4.6.0-0.nightly-2020-06-22-222207   True        False         False      66m
config-operator                            4.6.0-0.nightly-2020-06-22-222207   True        False         False      66m
console                                    4.6.0-0.nightly-2020-06-22-222207   True        False         False      14m
csi-snapshot-controller                    4.6.0-0.nightly-2020-06-22-222207   True        False         False      18m
dns                                        4.6.0-0.nightly-2020-06-22-222207   True        False         False      72m
etcd                                       4.6.0-0.nightly-2020-06-22-222207   True        False         False      71m
image-registry                             4.6.0-0.nightly-2020-06-22-222207   True        False         False      15m
ingress                                    4.6.0-0.nightly-2020-06-22-222207   True        False         False      34m
insights                                   4.6.0-0.nightly-2020-06-22-222207   True        False         False      70m
kube-apiserver                             4.6.0-0.nightly-2020-06-22-222207   True        False         False      71m
kube-controller-manager                    4.6.0-0.nightly-2020-06-22-222207   True        False         False      71m
kube-scheduler                             4.6.0-0.nightly-2020-06-22-222207   True        False         False      68m
kube-storage-version-migrator              4.6.0-0.nightly-2020-06-22-222207   True        False         False      15m
machine-api                                4.6.0-0.nightly-2020-06-22-222207   True        False         False      70m
machine-approver                           4.6.0-0.nightly-2020-06-22-222207   True        False         False      71m
machine-config                             4.6.0-0.nightly-2020-06-22-222207   True        False         False      8m26s
marketplace                                4.6.0-0.nightly-2020-06-22-222207   True        False         False      10m
monitoring                                 4.6.0-0.nightly-2020-06-22-222207   True        False         False      10m
network                                    4.6.0-0.nightly-2020-06-22-222207   True        False         False      73m
node-tuning                                4.6.0-0.nightly-2020-06-22-222207   True        False         False      34m
openshift-apiserver                        4.6.0-0.nightly-2020-06-22-222207   True        False         False      68m
openshift-controller-manager               4.6.0-0.nightly-2020-06-22-222207   True        False         False      32m
openshift-samples                          4.6.0-0.nightly-2020-06-22-222207   True        False         False      34m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-06-22-222207   True        False         False      72m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-06-22-222207   True        False         False      72m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-06-22-222207   True        False         False      10m
service-ca                                 4.6.0-0.nightly-2020-06-22-222207   True        False         False      73m
storage                                    4.6.0-0.nightly-2020-06-22-222207   True        False         False      34m

[core@jialiu451-bnc8p-control-plane-0 ~]$ systemctl list-unit-files|grep restart
restart.service                                                        enabled  
[core@jialiu451-bnc8p-control-plane-0 ~]$ systemctl cat restart.service
# /etc/systemd/system/restart.service
[Unit]
ConditionFirstBoot=yes
[Service]
Type=idle
ExecStart=/sbin/reboot
[Install]
WantedBy=multi-user.target
[core@jialiu451-bnc8p-control-plane-0 ~]$ journalctl --system|grep -i reboot
Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO     : files: op(3a): [started]  processing unit "mcd-write-pivot-reboot.service"
Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO     : files: op(3a): op(3b): [started]  writing unit "mcd-write-pivot-reboot.service" at "etc/systemd/system/mcd-write-pivot-reboot.service"
Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO     : files: op(3a): op(3b): [finished] writing unit "mcd-write-pivot-reboot.service" at "etc/systemd/system/mcd-write-pivot-reboot.service"
Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO     : files: op(3a): [finished] processing unit "mcd-write-pivot-reboot.service"
Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO     : files: op(3c): [started]  enabling unit "mcd-write-pivot-reboot.service"
Jun 23 04:42:13 host-10-0-98-181.openstacklocal ignition[1144]: INFO     : files: op(3c): [finished] enabling unit "mcd-write-pivot-reboot.service"
Jun 23 04:42:17 jialiu451-bnc8p-control-plane-0 restorecon[1498]: Relabeled /etc/systemd/system/mcd-write-pivot-reboot.service from system_u:object_r:unlabeled_t:s0 to system_u:object_r:power_unit_file_t:s0
Jun 23 04:42:17 jialiu451-bnc8p-control-plane-0 systemd[1]: Started mcd-write-pivot-reboot.service.
Jun 23 04:42:17 jialiu451-bnc8p-control-plane-0 systemd[1]: mcd-write-pivot-reboot.service: Consumed 8ms CPU time
Jun 23 04:42:23 jialiu451-bnc8p-control-plane-0 systemd[1]: Starting Reboot...
-- Reboot --
Jun 23 04:43:30 jialiu451-bnc8p-control-plane-0 machine-config-daemon[1332]: I0623 04:43:30.113571    1332 update.go:1404] initiating reboot: Completing firstboot provisioning to rendered-master-6b21e41272a59673f9f3b69d2b458bdd
Jun 23 04:43:30 jialiu451-bnc8p-control-plane-0 root[1595]: machine-config-daemon[1332]: initiating reboot: Completing firstboot provisioning to rendered-master-6b21e41272a59673f9f3b69d2b458bdd
Jun 23 04:43:30 jialiu451-bnc8p-control-plane-0 systemd-logind[1311]: System is rebooting.
Jun 23 04:43:30 jialiu451-bnc8p-control-plane-0 systemd[1]: machine-config-daemon-reboot.service: Consumed 22ms CPU time
Jun 23 04:43:34 jialiu451-bnc8p-control-plane-0 systemd[1]: Starting Reboot...
-- Reboot --
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 root[149658]: machine-config-daemon[137032]: initiating reboot: Node will reboot into config rendered-master-7c3f8c9520bbd711fc62d5e9f56b36d6
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.247991    1476 factory.go:166] Error trying to work out if we can handle /system.slice/machine-config-daemon-reboot.service: /system.slice/machine-config-daemon-reboot.service not handled by systemd handler
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248020    1476 factory.go:177] Factory "systemd" was unable to handle container "/system.slice/machine-config-daemon-reboot.service"
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248032    1476 factory.go:177] Factory "crio" was unable to handle container "/system.slice/machine-config-daemon-reboot.service"
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248040    1476 factory.go:173] Using factory "raw" for container "/system.slice/machine-config-daemon-reboot.service"
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248494    1476 manager.go:950] Added container: "/system.slice/machine-config-daemon-reboot.service" (aliases: [], namespace: "")
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248723    1476 handler.go:325] Added event &{/system.slice/machine-config-daemon-reboot.service 2020-06-23 05:42:31.247231787 +0000 UTC containerCreation {<nil>}}
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 hyperkube[1476]: I0623 05:42:31.248761    1476 container.go:467] Start housekeeping for container "/system.slice/machine-config-daemon-reboot.service"
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 systemd[1]: Started machine-config-daemon: Node will reboot into config rendered-master-7c3f8c9520bbd711fc62d5e9f56b36d6.
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 systemd-logind[1322]: System is rebooting.
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 systemd[1]: Stopped machine-config-daemon: Node will reboot into config rendered-master-7c3f8c9520bbd711fc62d5e9f56b36d6.
Jun 23 05:42:31 jialiu451-bnc8p-control-plane-0 systemd[1]: machine-config-daemon-reboot.service: Consumed 20ms CPU time
Jun 23 05:43:01 jialiu451-bnc8p-control-plane-0 systemd[1]: Starting Reboot...
-- Reboot --
Jun 23 05:43:33 jialiu451-bnc8p-control-plane-0 hyperkube[1375]: I0623 05:43:33.355733    1375 event.go:278] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"jialiu451-bnc8p-control-plane-0", UID:"jialiu451-bnc8p-control-plane-0", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'Rebooted' Node jialiu451-bnc8p-control-plane-0 has been rebooted, boot id: 2b73374a-4636-4a66-b107-743c29b6cef9

Comment 63 errata-xmlrpc 2020-10-27 16:03:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 64 W. Trevor King 2021-04-05 17:36:37 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.