Bug 1927041

Summary: The 1st update takes over 100+ minutes on new cluster
Product: OpenShift Container Platform Reporter: Etienne Simard <esimard>
Component: Machine Config OperatorAssignee: John Kyros <jkyros>
Status: CLOSED DUPLICATE QA Contact: Michael Nguyen <mnguyen>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: behoward, hongkliu, jerzhang, jiajliu, jialiu, jima, kgarriso, nstielau, sdodson, vrutkovs, wking, yanyang, yunjiang
Target Milestone: ---Keywords: Reopened
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-24 20:10:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Etienne Simard 2021-02-09 22:11:52 UTC
Description of problem:

Upgrading from 4.7 nightly to another 4.7 nightly build, the Cluster Operator machine-config gets stuck at progressing for a long time.

~~
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-02-06-084550   True        True          82m     Working towards 4.7.0-0.nightly-2021-02-09-024347: 560 of 668 done (83% complete), waiting on machine-config

$oc get co
machine-config                             4.7.0-0.nightly-2021-02-06-084550   False       True          True       49m


$ oc get pods -n openshift-machine-config-operator
NAME                                         READY   STATUS        RESTARTS   AGE
machine-config-controller-645c959b9d-t49mf   1/1     Running       0          107m
machine-config-daemon-5qg75                  2/2     Running       0          43m
machine-config-daemon-hzdpq                  2/2     Running       0          2m44s
machine-config-daemon-jl2sd                  2/2     Running       0          13m
machine-config-daemon-mbq45                  2/2     Running       0          33m
machine-config-daemon-pxq69                  2/2     Terminating   0          130m
machine-config-daemon-wd45p                  2/2     Running       0          23m
machine-config-operator-86b8b55b59-k9x9k     1/1     Running       0          55m
machine-config-server-mxhrb                  1/1     Running       0          129m
machine-config-server-w965f                  1/1     Running       0          129m
machine-config-server-xt4zj                  1/1     Running       0          129m

$ oc logs -f machine-config-daemon-pxq69 -n openshift-machine-config-operator -c machine-config-daemon
I0209 20:20:49.092275    7026 update.go:1483] Disabled systemd units [nodeip-configuration.service openshift-azure-routes.service]
I0209 20:20:49.092306    7026 update.go:1290] Deleting stale data
I0209 20:20:49.110048    7026 update.go:1733] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0209 20:20:49.141008    7026 rpm-ostree.go:261] Running captured: systemctl reload crio
I0209 20:20:49.307019    7026 update.go:1902] crio config reloaded successfully! Desired config rendered-master-063aab0d801ffe845914cbeb4455b928 has been applied, skipping reboot
I0209 20:20:49.310397    7026 daemon.go:816] Current config: rendered-master-af7caa8698441735a80b920d02c89d87
I0209 20:20:49.310419    7026 daemon.go:817] Desired config: rendered-master-063aab0d801ffe845914cbeb4455b928
I0209 20:20:49.331980    7026 daemon.go:1165] Completing pending config rendered-master-063aab0d801ffe845914cbeb4455b928
I0209 20:20:49.346109    7026 update.go:1902] completed update for config rendered-master-063aab0d801ffe845914cbeb4455b928
I0209 20:20:49.349282    7026 daemon.go:1181] In desired config rendered-master-063aab0d801ffe845914cbeb4455b928
I0209 22:05:47.321130    7026 daemon.go:600] Got SIGTERM, but actively updating




Version-Release number of selected component (if applicable):

4.7.0-0.nightly-2021-02-06-084550 to 4.7.0-0.nightly-2021-02-06-084550 and early


How reproducible: reproduced each time I tried it


Steps to Reproduce:
1. Install 4.7 cluster (I installed mine on Azure)
2. Upgrade 4.7 cluster to a minor version 

Actual results:

The cluster gets stuck at machine-config co progressing and degraded for a long time. 

Expected results:

Shorter time to upgrade, like previous versions of OpenShift.

Additional info:

Could be related to https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/daemon.go#L596-L601

Comment 2 Kirsten Garrison 2021-02-10 20:15:49 UTC
*** Bug 1926731 has been marked as a duplicate of this bug. ***

Comment 3 Kirsten Garrison 2021-02-10 20:16:39 UTC
I'm going to test this with more recent nightlies, thanks for the report!

Comment 4 Kirsten Garrison 2021-02-11 00:15:02 UTC
In my initial test using newer nightlies I don't see a slowdown : 

Tested 4.7.0-0.nightly-2021-02-09-192846 to 4.7.0-0.nightly-2021-02-09-224509:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1359622962772185088

I'm trying some more runs, so we can see if this holds.

Comment 5 W. Trevor King 2021-02-11 00:47:48 UTC
Seen in an orthogonal 4.7.0-fc.3 -> 4.7.0-rc.1 update, with a similar, slow MCD rollout going on:

  $ grep -rB1 SIGTERM namespaces/openshift-machine-config-operator/pods
  namespaces/openshift-machine-config-operator/pods/machine-config-daemon-9z7cr/machine-config-daemon/machine-config-daemon/logs/current.log-2021-02-06T01:00:36.836110038Z I0206 01:00:36.836071    2099 daemon.go:1181] In desired config rendered-worker-f0fe9deaeb32ccea72bf9f752d286d33
  namespaces/openshift-machine-config-operator/pods/machine-config-daemon-9z7cr/machine-config-daemon/machine-config-daemon/logs/current.log:2021-02-10T21:17:01.278010347Z I0210 21:17:01.275519    2099 daemon.go:600] Got SIGTERM, but actively updating

If the MCO was actually actively updating, I'd have expect it to log something about that update instead of that silence from 1am on the 6th until the SIGTERM on the 10th.  So my current theory is that the outgoing MCD is deadlocking, and a SIGQUIT stack trace would confirm or disprove.  You probably don't actually need an update for this.  Auditing any running cluster for a MCD whose final log line is hours or more ago is probably sufficient.  Once you find an MCD that is not actively logging, tail its logs, and send it the SIGQUIT (like we did in a separate CVO bug in [1]).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900#c5

Comment 6 Kirsten Garrison 2021-02-11 01:29:35 UTC
More successful upgrades using new nightlies (not slow):

4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 succeeded : https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1359656566436925440

 4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1359656542374203392

Comment 7 Kirsten Garrison 2021-02-11 04:39:19 UTC
Three more tests with new nightlies (gcp &aws) that don't seem slow

4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 succeeded : https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1359681866017280000

4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 aws succeeded : https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1359681838896910336

4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 aws succeeded : https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1359684936130367488

Comment 8 Etienne Simard 2021-02-12 01:36:32 UTC
I confirm that my last upgrade on azure from 4.7.0-0.nightly-2021-02-08-052658 to 4.7.0-0.nightly-2021-02-11-133115 took about 60 minutes and I did not see that issue there.

Comment 9 Nick Stielau 2021-02-15 18:14:14 UTC
Setting to blocker- for 4.7 GA.  I'd be interested to hear more about what duration testing we have today, and if we need to invest more there to catch slow upgrades.

Comment 10 Ben Howard 2021-02-24 16:04:22 UTC
I think that this was fixed with https://github.com/openshift/machine-config-operator/pull/2407 which dealt with "7026 daemon.go:600] Got SIGTERM, but actively updating".

Comment 11 W. Trevor King 2021-02-24 23:53:14 UTC
Setting Target Release to 4.8.0, because [1] landed after 4.7 had forked off master (so if this does end up fixing the issue, we'd need a backport to take it to 4.7.z).

[1]: https://github.com/openshift/machine-config-operator/pull/2395

Comment 13 Michael Nguyen 2021-03-09 16:10:30 UTC
Upgraded from 4.8.0-0.nightly-2021-03-08-133419 to 4.8.0-0.nightly-2021-03-08-184701 in about 45 minutes.

Comment 14 W. Trevor King 2021-03-16 23:41:02 UTC
If this ends up getting backported, you'll probably want to bring back the fix for bug 1933772 as well.  With that one still POST, it might be easier to close it as a dup and hang that follow-up PR on this bug?  Or just manually work up the backport PR with both commits.  Whatever's easiest.

Comment 15 Yu Qi Zhang 2021-04-09 17:05:02 UTC
Moving back to new as the PR was reverted due to other bugs

Comment 16 liujia 2021-04-23 04:26:16 UTC
Also hit the issue during upgrade from 4.7.0-0.nightly-2021-04-21-211002 to 4.8.0-0.nightly-2021-04-22-061234. MCD pods will get a long time to be restarted, which makes mco degraded for a long time too.

Comment 17 Johnny Liu 2021-05-12 09:46:23 UTC
Seem like I also hit similar issue. 

Fresh install disconnected cluster using 4.7.10-x86_64, then upgrade it to 4.8.0-fc.3-x86_64, the machine-config operator take a long time to get upgaded.

[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-fc.3   True        False         False      56m
baremetal                                  4.8.0-fc.3   True        False         False      3h4m
cloud-credential                           4.8.0-fc.3   True        False         False      3h4m
cluster-autoscaler                         4.8.0-fc.3   True        False         False      3h3m
config-operator                            4.8.0-fc.3   True        False         False      3h4m
console                                    4.8.0-fc.3   True        False         False      59m
csi-snapshot-controller                    4.8.0-fc.3   True        False         False      56m
dns                                        4.8.0-fc.3   True        False         False      51m
etcd                                       4.8.0-fc.3   True        False         False      3h3m
image-registry                             4.8.0-fc.3   True        False         False      174m
ingress                                    4.8.0-fc.3   True        False         False      174m
insights                                   4.8.0-fc.3   True        False         False      178m
kube-apiserver                             4.8.0-fc.3   True        False         False      3h2m
kube-controller-manager                    4.8.0-fc.3   True        False         False      3h1m
kube-scheduler                             4.8.0-fc.3   True        False         False      3h1m
kube-storage-version-migrator              4.8.0-fc.3   True        False         False      105m
machine-api                                4.8.0-fc.3   True        False         False      176m
machine-approver                           4.8.0-fc.3   True        False         False      3h4m
machine-config                             4.7.10       False       True          True       51m
marketplace                                4.8.0-fc.3   True        False         False      104m
monitoring                                 4.8.0-fc.3   True        False         False      58m
network                                    4.8.0-fc.3   True        False         False      3h4m
node-tuning                                4.8.0-fc.3   True        False         False      60m
openshift-apiserver                        4.8.0-fc.3   True        False         False      179m
openshift-controller-manager               4.8.0-fc.3   True        False         False      179m
openshift-samples                          4.8.0-fc.3   True        False         False      60m
operator-lifecycle-manager                 4.8.0-fc.3   True        False         False      3h3m
operator-lifecycle-manager-catalog         4.8.0-fc.3   True        False         False      3h4m
operator-lifecycle-manager-packageserver   4.8.0-fc.3   True        False         False      60m
service-ca                                 4.8.0-fc.3   True        False         False      3h4m
storage                                    4.8.0-fc.3   True        False         False      102m


[root@preserve-jialiu-ansible ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-9de9298fea2e327d14a7bcc561681671   True      False      False      3              3                   3                     0                      3h5m
worker   rendered-worker-2b996b24091707eec2bb707c2c68e308   True      False      False      3              3                   3                     0                      3h5m

[root@preserve-jialiu-ansible ~]# oc describe co machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
              include.release.openshift.io/single-node-developer: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-05-11T07:20:23Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:exclude.release.openshift.io/internal-openshift-hosted:
          f:include.release.openshift.io/self-managed-high-availability:
          f:include.release.openshift.io/single-node-developer:
      f:spec:
      f:status:
        .:
        f:relatedObjects:
        f:versions:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2021-05-11T07:20:23Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:extension:
          .:
          f:master:
          f:worker:
        f:relatedObjects:
        f:versions:
    Manager:         machine-config-operator
    Operation:       Update
    Time:            2021-05-11T08:51:28Z
  Resource Version:  93891
  UID:               125de3cf-43bb-44a1-8cde-5957a48b70e8
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-05-11T09:43:01Z
    Message:               Working towards 4.8.0-fc.3
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-05-11T07:28:16Z
    Reason:                AsExpected
    Status:                True
    Type:                  Upgradeable
    Last Transition Time:  2021-05-11T09:53:02Z
    Message:               Unable to apply 4.8.0-fc.3: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 4, ready: 6, unavailable: 0)
    Reason:                MachineConfigDaemonFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-05-11T09:41:03Z
    Message:               Cluster not available for 4.8.0-fc.3
    Status:                False
    Type:                  Available
  Extension:
    Master:  all 3 nodes are at latest configuration rendered-master-9de9298fea2e327d14a7bcc561681671
    Worker:  all 3 nodes are at latest configuration rendered-worker-2b996b24091707eec2bb707c2c68e308
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  controllerconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  kubeletconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  containerruntimeconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigs
    Group:     
    Name:      
    Resource:  nodes
    Group:     
    Name:      openshift-kni-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-openstack-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-ovirt-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-vsphere-infra
    Resource:  namespaces
  Versions:
    Name:     operator
    Version:  4.7.10
Events:       <none>
[root@preserve-jialiu-ansible ~]# oc describe node|grep machineconfig
                    machineconfiguration.openshift.io/currentConfig: rendered-master-9de9298fea2e327d14a7bcc561681671
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-9de9298fea2e327d14a7bcc561681671
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-master-9de9298fea2e327d14a7bcc561681671
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-9de9298fea2e327d14a7bcc561681671
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-master-9de9298fea2e327d14a7bcc561681671
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-9de9298fea2e327d14a7bcc561681671
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done


[root@preserve-jialiu-ansible ~]# oc get po -n openshift-machine-config-operator
NAME                                        READY   STATUS        RESTARTS   AGE
machine-config-controller-97944c7d4-x4znz   1/1     Running       0          107m
machine-config-daemon-6ffkf                 2/2     Running       0          21m
machine-config-daemon-72pjx                 2/2     Running       0          58s
machine-config-daemon-7gq75                 2/2     Terminating   0          3h6m
machine-config-daemon-dzljj                 2/2     Running       0          11m
machine-config-daemon-mfhps                 2/2     Running       0          31m
machine-config-daemon-wlhsw                 2/2     Running       0          41m
machine-config-operator-5c6c8476f7-8zn65    1/1     Running       0          53m
machine-config-server-hn4ms                 1/1     Running       0          3h6m
machine-config-server-jj2js                 1/1     Running       0          3h6m
machine-config-server-stbdh                 1/1     Running       0          3h6m

The cluters was already terminated, did not catpure logs.

Comment 19 John Kyros 2021-05-27 18:22:37 UTC
I am struggling to reproduce this anywhere. I've tried multiples: 

Nightly to nightly ( 4.7.0-0.nightly-2021-05-20-112118  -> 4.7.0-0.nightly-2021-05-26-013541, others ) 
4.7.10-x86_64  -> 4.8.0-fc.3-x86_64
4.7.10-x86_64 -> 4.7.11-x86_64 -> 4.8.0-fc.3-x86_64

In all cases the MCO appeared to take between 18 and 23 minutes to finish once it started progressing. Max degradation of the MCO I saw was 1m30s.
Longest cluster upgrade I experienced was 1h2m. 

There is either another variable or I'm not sufficiently unlucky. I also can't just blindly put [1] back in to fix because it will bring back [2]. 

[1] https://github.com/openshift/machine-config-operator/pull/2395
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1933772 

I'm closing this, but please reopen this bug (or open a new bug) if this issue continues to occur.

Comment 20 Johnny Liu 2021-06-17 07:59:12 UTC
I hit this issue again when I was running my new round of RC upgrade testing from 4.7.17 to 4.8.0-rc.0


[root@preserve-jialiu-ansible ~]# oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-rc.0   True        False         False      73m
baremetal                                  4.8.0-rc.0   True        False         False      3h37m
cloud-credential                           4.8.0-rc.0   True        False         False      3h27m
cluster-autoscaler                         4.8.0-rc.0   True        False         False      3h36m
config-operator                            4.8.0-rc.0   True        False         False      3h37m
console                                    4.8.0-rc.0   True        False         False      80m
csi-snapshot-controller                    4.8.0-rc.0   True        False         False      132m
dns                                        4.8.0-rc.0   True        False         False      69m
etcd                                       4.8.0-rc.0   True        False         False      3h36m
image-registry                             4.8.0-rc.0   True        False         False      132m
ingress                                    4.8.0-rc.0   True        False         False      80m
insights                                   4.8.0-rc.0   True        False         False      3h30m
kube-apiserver                             4.8.0-rc.0   True        False         False      3h35m
kube-controller-manager                    4.8.0-rc.0   True        False         False      3h35m
kube-scheduler                             4.8.0-rc.0   True        False         False      3h34m
kube-storage-version-migrator              4.8.0-rc.0   True        False         False      130m
machine-api                                4.8.0-rc.0   True        False         False      3h31m
machine-approver                           4.8.0-rc.0   True        False         False      3h36m
machine-config                             4.7.17       False       True          False      53m
marketplace                                4.8.0-rc.0   True        False         False      129m
monitoring                                 4.8.0-rc.0   True        False         False      77m
network                                    4.8.0-rc.0   True        False         False      3h37m
node-tuning                                4.8.0-rc.0   True        False         False      80m
openshift-apiserver                        4.8.0-rc.0   True        False         False      3h32m
openshift-controller-manager               4.8.0-rc.0   True        False         False      80m
openshift-samples                          4.8.0-rc.0   True        False         False      80m
operator-lifecycle-manager                 4.8.0-rc.0   True        False         False      3h36m
operator-lifecycle-manager-catalog         4.8.0-rc.0   True        False         False      3h36m
operator-lifecycle-manager-packageserver   4.8.0-rc.0   True        False         False      3h32m
service-ca                                 4.8.0-rc.0   True        False         False      3h37m
storage                                    4.8.0-rc.0   True        True          False      128m


[root@preserve-jialiu-ansible ~]# oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             3h38m
00-worker                                          3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             3h38m
01-master-container-runtime                        3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             3h38m
01-master-kubelet                                  3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             3h38m
01-worker-container-runtime                        3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             3h38m
01-worker-kubelet                                  3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             3h38m
99-master-generated-registries                     3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             3h38m
99-master-ssh                                                                                 3.2.0             3h45m
99-worker-generated-registries                     3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             3h38m
99-worker-ssh                                                                                 3.2.0             3h45m
rendered-master-5f25de2f5af9fbb541efcfe99862504c   e980e4ebb10f252c0a9e044a8872df2c3404ff77   3.2.0             133m
rendered-master-afc3746b1ce25828269e1c5279923558   e980e4ebb10f252c0a9e044a8872df2c3404ff77   3.2.0             3h38m
rendered-master-cf235cb731b806dd20614cdab2b0e803   3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             90s
rendered-worker-165caca6c9377c8459587c2d2f76a422   3f6db243f2d6651720daa658e8edc830f47dd184   3.2.0             90s
rendered-worker-82bace951257778c36015aecebe64026   e980e4ebb10f252c0a9e044a8872df2c3404ff77   3.2.0             133m
rendered-worker-8dc1a3a13e9c7f25b42e0a9cbb9717bc   e980e4ebb10f252c0a9e044a8872df2c3404ff77   3.2.0             3h38m

[root@preserve-jialiu-ansible ~]# oc -n openshift-machine-config-operator get ds
NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
machine-config-daemon   6         6         6       6            6           kubernetes.io/os=linux            3h40m
machine-config-server   3         3         3       3            3           node-role.kubernetes.io/master=   3h39m
[root@preserve-jialiu-ansible ~]# oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-5f25de2f5af9fbb541efcfe99862504c   False     True       False      3              0                   0                     0                      3h40m
worker   rendered-worker-82bace951257778c36015aecebe64026   False     True       False      3              0                   0                     0                      3h40m

[root@preserve-jialiu-ansible ~]# oc get po -n openshift-machine-config-operator
NAME                                         READY   STATUS        RESTARTS   AGE
machine-config-controller-5b6ff6cd68-8njjl   1/1     Running       0          119m
machine-config-daemon-b6drh                  2/2     Running       0          23m
machine-config-daemon-gfgbn                  2/2     Running       0          33m
machine-config-daemon-mzjvv                  2/2     Running       0          43m
machine-config-daemon-qnwkn                  2/2     Running       0          2m50s
machine-config-daemon-w9md5                  2/2     Running       0          13m
machine-config-daemon-zd4gf                  2/2     Terminating   0          3h28m
machine-config-operator-5dc8d8796b-ggkks     1/1     Running       0          55m
machine-config-server-7fd26                  1/1     Running       0          3h27m
machine-config-server-979ck                  1/1     Running       0          3h27m
machine-config-server-jjfwx                  1/1     Running       0          3h27m

[root@preserve-jialiu-ansible ~]# oc -n openshift-machine-config-operator logs machine-config-daemon-zd4gf machine-config-daemon
<--snip-->
I0617 05:43:28.615103    6608 update.go:1737] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0617 05:43:28.647695    6608 rpm-ostree.go:261] Running captured: systemctl reload crio
I0617 05:43:28.674130    6608 update.go:1906] crio config reloaded successfully! Desired config rendered-master-5f25de2f5af9fbb541efcfe99862504c has been applied, skipping reboot
I0617 05:43:28.678557    6608 daemon.go:802] Current config: rendered-master-afc3746b1ce25828269e1c5279923558
I0617 05:43:28.678619    6608 daemon.go:803] Desired config: rendered-master-5f25de2f5af9fbb541efcfe99862504c
I0617 05:43:28.695788    6608 daemon.go:1151] Completing pending config rendered-master-5f25de2f5af9fbb541efcfe99862504c
I0617 05:43:28.695809    6608 update.go:1906] completed update for config rendered-master-5f25de2f5af9fbb541efcfe99862504c
I0617 05:43:28.699028    6608 daemon.go:1167] In desired config rendered-master-5f25de2f5af9fbb541efcfe99862504c
I0617 07:39:56.076509    6608 daemon.go:586] Got SIGTERM, but actively updating


[root@preserve-jialiu-ansible ~]# for i in $(oc get po -o name -n openshift-machine-config-operator|grep daemon); do oc logs $i -c machine-config-daemon -n openshift-machine-config-operator|head -n1; done | sort
I0617 04:14:57.050274    6608 start.go:108] Version: v4.7.0-202106112128.p0.git.e980e4e-dirty (e980e4ebb10f252c0a9e044a8872df2c3404ff77)
I0617 06:59:04.618149  176503 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184)
I0617 07:09:15.894178  128094 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184)
I0617 07:19:30.349355  194589 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184)
I0617 07:29:39.340320  132307 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184)
I0617 07:39:55.484559  167503 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184)

Pls check the timestamp, it is really similar behaviour to https://bugzilla.redhat.com/show_bug.cgi?id=1926731 (which is closed as duplicated to this bug)

Comment 22 John Kyros 2021-06-22 22:29:07 UTC
So the original report on this one and your new one look like they both got stuck where SIGTERM protection is active after a rebootless update of SSH/cri-o configuration, which looks like an instance of [1]. 

I0617 05:43:28.615103    6608 update.go:1737] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0617 05:43:28.647695    6608 rpm-ostree.go:261] Running captured: systemctl reload crio
I0617 05:43:28.674130    6608 update.go:1906] crio config reloaded successfully! Desired config rendered-master-5f25de2f5af9fbb541efcfe99862504c has been applied, skipping reboot

The MCO's gracePeriod is 10 minutes, so every time something tries to kill/restart the MCO while it is protecting itself, it will wait the whole 10 minutes for SIGKILL, which definitely matches the explicit "10mins*number of nodes" timing from the duplicated bug you mentioned [2]. 



The reason I think I'm struggling to reproduce this one -- any cluster I build or that CI builds doesn't seem to do the same rebootless config update before or during the upgrade. 

Are you literally just installing a cluster, letting it settle, and then immediately upgrading it, or does your process do other things that would cause machine config updates? 

If it does, that's 100% okay, I just want to account for the rebootless update so I know we don't have a combination of two bugs -- something like "yes, we get stuck on the SIGTERM protection, but that only happens because something intermittently triggers a weird rebootless config update when it otherwise shouldn't".



[1] https://bugzilla.redhat.com/show_bug.cgi?id=1965992
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1926731

Comment 23 Johnny Liu 2021-06-23 04:22:56 UTC
> Are you literally just installing a cluster, letting it settle, and then immediately upgrading it, or does your process do other things that would cause machine config updates? 

Here is more details about the upgrade process.
1. Set up a cluster in a restrict network.
2. Mirror the target payload release image to mirror registry.
3. Per step 2's output, create ImageContentSourcePolicy
4. Wait for some minutes to confirm new machineconfig is applied to all nodes, and cluster still keep healthy
5. Start the upgrade process to the target release image.

Then this issue happened.

Though ImageContentSourcePolicy might involve some update of cri-o configuration, which cause machine config updates, I already waited for it to be completed before trigger upgrade.

Comment 24 John Kyros 2021-06-24 20:10:25 UTC
Thanks for the details, this makes more sense now. 

Yes, changing ImageContentSourcePolicy does cause a rebootless cri-o config update, and ( because of [1] ) that config update erroneously gets you stuck in the SIGTERM protection forever. 

I've tested this on a cluster and I can reproduce this bug reliably if I update ImageContentSourcePolicy before I do the upgrade.

1.) Apply an ImageContentSourcePolicy: 

cat << EOF > fake_mirror.yaml 
apiVersion: operator.openshift.io/v1alpha1
kind: ImageContentSourcePolicy
metadata:
  name: metering-mirror
spec:
  repositoryDigestMirrors:
    - mirrors:
        - registry.redhat.io/openshift4/ose-metering-ansible-operator
      source: registry.redhat.io/openshift4/ose-metering-ansible-operator
EOF
oc apply -f fake_mirror.yaml

2.) Upgrade the cluster ( I did 4.7.16 --> 4.8.0-rc.0 )

3.) Watch the MCO portion of the upgrade take at least 10 minutes per node 


Performing the same test with an image containing the fix from [1], the timing of the MCO upgrade is normal. 



I'm going to close this bug as a duplicate of [1] because we understand root cause here. Thank you Johnny for re-opening this and for the details, I appreciate it. 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1965992

*** This bug has been marked as a duplicate of bug 1965992 ***