Bug 1927041
Summary: | The 1st update takes over 100+ minutes on new cluster | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Etienne Simard <esimard> |
Component: | Machine Config Operator | Assignee: | John Kyros <jkyros> |
Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.7 | CC: | behoward, hongkliu, jerzhang, jiajliu, jialiu, jima, kgarriso, nstielau, sdodson, vrutkovs, wking, yanyang, yunjiang |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-06-24 20:10:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Etienne Simard
2021-02-09 22:11:52 UTC
*** Bug 1926731 has been marked as a duplicate of this bug. *** I'm going to test this with more recent nightlies, thanks for the report! In my initial test using newer nightlies I don't see a slowdown : Tested 4.7.0-0.nightly-2021-02-09-192846 to 4.7.0-0.nightly-2021-02-09-224509: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1359622962772185088 I'm trying some more runs, so we can see if this holds. Seen in an orthogonal 4.7.0-fc.3 -> 4.7.0-rc.1 update, with a similar, slow MCD rollout going on: $ grep -rB1 SIGTERM namespaces/openshift-machine-config-operator/pods namespaces/openshift-machine-config-operator/pods/machine-config-daemon-9z7cr/machine-config-daemon/machine-config-daemon/logs/current.log-2021-02-06T01:00:36.836110038Z I0206 01:00:36.836071 2099 daemon.go:1181] In desired config rendered-worker-f0fe9deaeb32ccea72bf9f752d286d33 namespaces/openshift-machine-config-operator/pods/machine-config-daemon-9z7cr/machine-config-daemon/machine-config-daemon/logs/current.log:2021-02-10T21:17:01.278010347Z I0210 21:17:01.275519 2099 daemon.go:600] Got SIGTERM, but actively updating If the MCO was actually actively updating, I'd have expect it to log something about that update instead of that silence from 1am on the 6th until the SIGTERM on the 10th. So my current theory is that the outgoing MCD is deadlocking, and a SIGQUIT stack trace would confirm or disprove. You probably don't actually need an update for this. Auditing any running cluster for a MCD whose final log line is hours or more ago is probably sufficient. Once you find an MCD that is not actively logging, tail its logs, and send it the SIGQUIT (like we did in a separate CVO bug in [1]). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1873900#c5 More successful upgrades using new nightlies (not slow): 4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 succeeded : https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1359656566436925440 4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1359656542374203392 Three more tests with new nightlies (gcp &aws) that don't seem slow 4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 succeeded : https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1359681866017280000 4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 aws succeeded : https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1359681838896910336 4.7.0-0.nightly-2021-02-09-192846 4.7.0-0.nightly-2021-02-09-224509 aws succeeded : https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1359684936130367488 I confirm that my last upgrade on azure from 4.7.0-0.nightly-2021-02-08-052658 to 4.7.0-0.nightly-2021-02-11-133115 took about 60 minutes and I did not see that issue there. Setting to blocker- for 4.7 GA. I'd be interested to hear more about what duration testing we have today, and if we need to invest more there to catch slow upgrades. I think that this was fixed with https://github.com/openshift/machine-config-operator/pull/2407 which dealt with "7026 daemon.go:600] Got SIGTERM, but actively updating". Setting Target Release to 4.8.0, because [1] landed after 4.7 had forked off master (so if this does end up fixing the issue, we'd need a backport to take it to 4.7.z). [1]: https://github.com/openshift/machine-config-operator/pull/2395 Upgraded from 4.8.0-0.nightly-2021-03-08-133419 to 4.8.0-0.nightly-2021-03-08-184701 in about 45 minutes. If this ends up getting backported, you'll probably want to bring back the fix for bug 1933772 as well. With that one still POST, it might be easier to close it as a dup and hang that follow-up PR on this bug? Or just manually work up the backport PR with both commits. Whatever's easiest. Moving back to new as the PR was reverted due to other bugs Also hit the issue during upgrade from 4.7.0-0.nightly-2021-04-21-211002 to 4.8.0-0.nightly-2021-04-22-061234. MCD pods will get a long time to be restarted, which makes mco degraded for a long time too. Seem like I also hit similar issue. Fresh install disconnected cluster using 4.7.10-x86_64, then upgrade it to 4.8.0-fc.3-x86_64, the machine-config operator take a long time to get upgaded. [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-fc.3 True False False 56m baremetal 4.8.0-fc.3 True False False 3h4m cloud-credential 4.8.0-fc.3 True False False 3h4m cluster-autoscaler 4.8.0-fc.3 True False False 3h3m config-operator 4.8.0-fc.3 True False False 3h4m console 4.8.0-fc.3 True False False 59m csi-snapshot-controller 4.8.0-fc.3 True False False 56m dns 4.8.0-fc.3 True False False 51m etcd 4.8.0-fc.3 True False False 3h3m image-registry 4.8.0-fc.3 True False False 174m ingress 4.8.0-fc.3 True False False 174m insights 4.8.0-fc.3 True False False 178m kube-apiserver 4.8.0-fc.3 True False False 3h2m kube-controller-manager 4.8.0-fc.3 True False False 3h1m kube-scheduler 4.8.0-fc.3 True False False 3h1m kube-storage-version-migrator 4.8.0-fc.3 True False False 105m machine-api 4.8.0-fc.3 True False False 176m machine-approver 4.8.0-fc.3 True False False 3h4m machine-config 4.7.10 False True True 51m marketplace 4.8.0-fc.3 True False False 104m monitoring 4.8.0-fc.3 True False False 58m network 4.8.0-fc.3 True False False 3h4m node-tuning 4.8.0-fc.3 True False False 60m openshift-apiserver 4.8.0-fc.3 True False False 179m openshift-controller-manager 4.8.0-fc.3 True False False 179m openshift-samples 4.8.0-fc.3 True False False 60m operator-lifecycle-manager 4.8.0-fc.3 True False False 3h3m operator-lifecycle-manager-catalog 4.8.0-fc.3 True False False 3h4m operator-lifecycle-manager-packageserver 4.8.0-fc.3 True False False 60m service-ca 4.8.0-fc.3 True False False 3h4m storage 4.8.0-fc.3 True False False 102m [root@preserve-jialiu-ansible ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9de9298fea2e327d14a7bcc561681671 True False False 3 3 3 0 3h5m worker rendered-worker-2b996b24091707eec2bb707c2c68e308 True False False 3 3 3 0 3h5m [root@preserve-jialiu-ansible ~]# oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-05-11T07:20:23Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:spec: f:status: .: f:relatedObjects: f:versions: Manager: cluster-version-operator Operation: Update Time: 2021-05-11T07:20:23Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: f:versions: Manager: machine-config-operator Operation: Update Time: 2021-05-11T08:51:28Z Resource Version: 93891 UID: 125de3cf-43bb-44a1-8cde-5957a48b70e8 Spec: Status: Conditions: Last Transition Time: 2021-05-11T09:43:01Z Message: Working towards 4.8.0-fc.3 Status: True Type: Progressing Last Transition Time: 2021-05-11T07:28:16Z Reason: AsExpected Status: True Type: Upgradeable Last Transition Time: 2021-05-11T09:53:02Z Message: Unable to apply 4.8.0-fc.3: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 4, ready: 6, unavailable: 0) Reason: MachineConfigDaemonFailed Status: True Type: Degraded Last Transition Time: 2021-05-11T09:41:03Z Message: Cluster not available for 4.8.0-fc.3 Status: False Type: Available Extension: Master: all 3 nodes are at latest configuration rendered-master-9de9298fea2e327d14a7bcc561681671 Worker: all 3 nodes are at latest configuration rendered-worker-2b996b24091707eec2bb707c2c68e308 Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Group: Name: openshift-kni-infra Resource: namespaces Group: Name: openshift-openstack-infra Resource: namespaces Group: Name: openshift-ovirt-infra Resource: namespaces Group: Name: openshift-vsphere-infra Resource: namespaces Versions: Name: operator Version: 4.7.10 Events: <none> [root@preserve-jialiu-ansible ~]# oc describe node|grep machineconfig machineconfiguration.openshift.io/currentConfig: rendered-master-9de9298fea2e327d14a7bcc561681671 machineconfiguration.openshift.io/desiredConfig: rendered-master-9de9298fea2e327d14a7bcc561681671 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308 machineconfiguration.openshift.io/desiredConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-master-9de9298fea2e327d14a7bcc561681671 machineconfiguration.openshift.io/desiredConfig: rendered-master-9de9298fea2e327d14a7bcc561681671 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308 machineconfiguration.openshift.io/desiredConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-master-9de9298fea2e327d14a7bcc561681671 machineconfiguration.openshift.io/desiredConfig: rendered-master-9de9298fea2e327d14a7bcc561681671 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308 machineconfiguration.openshift.io/desiredConfig: rendered-worker-2b996b24091707eec2bb707c2c68e308 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done [root@preserve-jialiu-ansible ~]# oc get po -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE machine-config-controller-97944c7d4-x4znz 1/1 Running 0 107m machine-config-daemon-6ffkf 2/2 Running 0 21m machine-config-daemon-72pjx 2/2 Running 0 58s machine-config-daemon-7gq75 2/2 Terminating 0 3h6m machine-config-daemon-dzljj 2/2 Running 0 11m machine-config-daemon-mfhps 2/2 Running 0 31m machine-config-daemon-wlhsw 2/2 Running 0 41m machine-config-operator-5c6c8476f7-8zn65 1/1 Running 0 53m machine-config-server-hn4ms 1/1 Running 0 3h6m machine-config-server-jj2js 1/1 Running 0 3h6m machine-config-server-stbdh 1/1 Running 0 3h6m The cluters was already terminated, did not catpure logs. I am struggling to reproduce this anywhere. I've tried multiples: Nightly to nightly ( 4.7.0-0.nightly-2021-05-20-112118 -> 4.7.0-0.nightly-2021-05-26-013541, others ) 4.7.10-x86_64 -> 4.8.0-fc.3-x86_64 4.7.10-x86_64 -> 4.7.11-x86_64 -> 4.8.0-fc.3-x86_64 In all cases the MCO appeared to take between 18 and 23 minutes to finish once it started progressing. Max degradation of the MCO I saw was 1m30s. Longest cluster upgrade I experienced was 1h2m. There is either another variable or I'm not sufficiently unlucky. I also can't just blindly put [1] back in to fix because it will bring back [2]. [1] https://github.com/openshift/machine-config-operator/pull/2395 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1933772 I'm closing this, but please reopen this bug (or open a new bug) if this issue continues to occur. I hit this issue again when I was running my new round of RC upgrade testing from 4.7.17 to 4.8.0-rc.0 [root@preserve-jialiu-ansible ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-rc.0 True False False 73m baremetal 4.8.0-rc.0 True False False 3h37m cloud-credential 4.8.0-rc.0 True False False 3h27m cluster-autoscaler 4.8.0-rc.0 True False False 3h36m config-operator 4.8.0-rc.0 True False False 3h37m console 4.8.0-rc.0 True False False 80m csi-snapshot-controller 4.8.0-rc.0 True False False 132m dns 4.8.0-rc.0 True False False 69m etcd 4.8.0-rc.0 True False False 3h36m image-registry 4.8.0-rc.0 True False False 132m ingress 4.8.0-rc.0 True False False 80m insights 4.8.0-rc.0 True False False 3h30m kube-apiserver 4.8.0-rc.0 True False False 3h35m kube-controller-manager 4.8.0-rc.0 True False False 3h35m kube-scheduler 4.8.0-rc.0 True False False 3h34m kube-storage-version-migrator 4.8.0-rc.0 True False False 130m machine-api 4.8.0-rc.0 True False False 3h31m machine-approver 4.8.0-rc.0 True False False 3h36m machine-config 4.7.17 False True False 53m marketplace 4.8.0-rc.0 True False False 129m monitoring 4.8.0-rc.0 True False False 77m network 4.8.0-rc.0 True False False 3h37m node-tuning 4.8.0-rc.0 True False False 80m openshift-apiserver 4.8.0-rc.0 True False False 3h32m openshift-controller-manager 4.8.0-rc.0 True False False 80m openshift-samples 4.8.0-rc.0 True False False 80m operator-lifecycle-manager 4.8.0-rc.0 True False False 3h36m operator-lifecycle-manager-catalog 4.8.0-rc.0 True False False 3h36m operator-lifecycle-manager-packageserver 4.8.0-rc.0 True False False 3h32m service-ca 4.8.0-rc.0 True False False 3h37m storage 4.8.0-rc.0 True True False 128m [root@preserve-jialiu-ansible ~]# oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 3h38m 00-worker 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 3h38m 01-master-container-runtime 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 3h38m 01-master-kubelet 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 3h38m 01-worker-container-runtime 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 3h38m 01-worker-kubelet 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 3h38m 99-master-generated-registries 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 3h38m 99-master-ssh 3.2.0 3h45m 99-worker-generated-registries 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 3h38m 99-worker-ssh 3.2.0 3h45m rendered-master-5f25de2f5af9fbb541efcfe99862504c e980e4ebb10f252c0a9e044a8872df2c3404ff77 3.2.0 133m rendered-master-afc3746b1ce25828269e1c5279923558 e980e4ebb10f252c0a9e044a8872df2c3404ff77 3.2.0 3h38m rendered-master-cf235cb731b806dd20614cdab2b0e803 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 90s rendered-worker-165caca6c9377c8459587c2d2f76a422 3f6db243f2d6651720daa658e8edc830f47dd184 3.2.0 90s rendered-worker-82bace951257778c36015aecebe64026 e980e4ebb10f252c0a9e044a8872df2c3404ff77 3.2.0 133m rendered-worker-8dc1a3a13e9c7f25b42e0a9cbb9717bc e980e4ebb10f252c0a9e044a8872df2c3404ff77 3.2.0 3h38m [root@preserve-jialiu-ansible ~]# oc -n openshift-machine-config-operator get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE machine-config-daemon 6 6 6 6 6 kubernetes.io/os=linux 3h40m machine-config-server 3 3 3 3 3 node-role.kubernetes.io/master= 3h39m [root@preserve-jialiu-ansible ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-5f25de2f5af9fbb541efcfe99862504c False True False 3 0 0 0 3h40m worker rendered-worker-82bace951257778c36015aecebe64026 False True False 3 0 0 0 3h40m [root@preserve-jialiu-ansible ~]# oc get po -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE machine-config-controller-5b6ff6cd68-8njjl 1/1 Running 0 119m machine-config-daemon-b6drh 2/2 Running 0 23m machine-config-daemon-gfgbn 2/2 Running 0 33m machine-config-daemon-mzjvv 2/2 Running 0 43m machine-config-daemon-qnwkn 2/2 Running 0 2m50s machine-config-daemon-w9md5 2/2 Running 0 13m machine-config-daemon-zd4gf 2/2 Terminating 0 3h28m machine-config-operator-5dc8d8796b-ggkks 1/1 Running 0 55m machine-config-server-7fd26 1/1 Running 0 3h27m machine-config-server-979ck 1/1 Running 0 3h27m machine-config-server-jjfwx 1/1 Running 0 3h27m [root@preserve-jialiu-ansible ~]# oc -n openshift-machine-config-operator logs machine-config-daemon-zd4gf machine-config-daemon <--snip--> I0617 05:43:28.615103 6608 update.go:1737] Writing SSHKeys at "/home/core/.ssh/authorized_keys" I0617 05:43:28.647695 6608 rpm-ostree.go:261] Running captured: systemctl reload crio I0617 05:43:28.674130 6608 update.go:1906] crio config reloaded successfully! Desired config rendered-master-5f25de2f5af9fbb541efcfe99862504c has been applied, skipping reboot I0617 05:43:28.678557 6608 daemon.go:802] Current config: rendered-master-afc3746b1ce25828269e1c5279923558 I0617 05:43:28.678619 6608 daemon.go:803] Desired config: rendered-master-5f25de2f5af9fbb541efcfe99862504c I0617 05:43:28.695788 6608 daemon.go:1151] Completing pending config rendered-master-5f25de2f5af9fbb541efcfe99862504c I0617 05:43:28.695809 6608 update.go:1906] completed update for config rendered-master-5f25de2f5af9fbb541efcfe99862504c I0617 05:43:28.699028 6608 daemon.go:1167] In desired config rendered-master-5f25de2f5af9fbb541efcfe99862504c I0617 07:39:56.076509 6608 daemon.go:586] Got SIGTERM, but actively updating [root@preserve-jialiu-ansible ~]# for i in $(oc get po -o name -n openshift-machine-config-operator|grep daemon); do oc logs $i -c machine-config-daemon -n openshift-machine-config-operator|head -n1; done | sort I0617 04:14:57.050274 6608 start.go:108] Version: v4.7.0-202106112128.p0.git.e980e4e-dirty (e980e4ebb10f252c0a9e044a8872df2c3404ff77) I0617 06:59:04.618149 176503 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184) I0617 07:09:15.894178 128094 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184) I0617 07:19:30.349355 194589 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184) I0617 07:29:39.340320 132307 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184) I0617 07:39:55.484559 167503 start.go:108] Version: v4.8.0-202106122144.p0.git.3f6db24.assembly.stream-dirty (3f6db243f2d6651720daa658e8edc830f47dd184) Pls check the timestamp, it is really similar behaviour to https://bugzilla.redhat.com/show_bug.cgi?id=1926731 (which is closed as duplicated to this bug) So the original report on this one and your new one look like they both got stuck where SIGTERM protection is active after a rebootless update of SSH/cri-o configuration, which looks like an instance of [1]. I0617 05:43:28.615103 6608 update.go:1737] Writing SSHKeys at "/home/core/.ssh/authorized_keys" I0617 05:43:28.647695 6608 rpm-ostree.go:261] Running captured: systemctl reload crio I0617 05:43:28.674130 6608 update.go:1906] crio config reloaded successfully! Desired config rendered-master-5f25de2f5af9fbb541efcfe99862504c has been applied, skipping reboot The MCO's gracePeriod is 10 minutes, so every time something tries to kill/restart the MCO while it is protecting itself, it will wait the whole 10 minutes for SIGKILL, which definitely matches the explicit "10mins*number of nodes" timing from the duplicated bug you mentioned [2]. The reason I think I'm struggling to reproduce this one -- any cluster I build or that CI builds doesn't seem to do the same rebootless config update before or during the upgrade. Are you literally just installing a cluster, letting it settle, and then immediately upgrading it, or does your process do other things that would cause machine config updates? If it does, that's 100% okay, I just want to account for the rebootless update so I know we don't have a combination of two bugs -- something like "yes, we get stuck on the SIGTERM protection, but that only happens because something intermittently triggers a weird rebootless config update when it otherwise shouldn't". [1] https://bugzilla.redhat.com/show_bug.cgi?id=1965992 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1926731 > Are you literally just installing a cluster, letting it settle, and then immediately upgrading it, or does your process do other things that would cause machine config updates?
Here is more details about the upgrade process.
1. Set up a cluster in a restrict network.
2. Mirror the target payload release image to mirror registry.
3. Per step 2's output, create ImageContentSourcePolicy
4. Wait for some minutes to confirm new machineconfig is applied to all nodes, and cluster still keep healthy
5. Start the upgrade process to the target release image.
Then this issue happened.
Though ImageContentSourcePolicy might involve some update of cri-o configuration, which cause machine config updates, I already waited for it to be completed before trigger upgrade.
Thanks for the details, this makes more sense now. Yes, changing ImageContentSourcePolicy does cause a rebootless cri-o config update, and ( because of [1] ) that config update erroneously gets you stuck in the SIGTERM protection forever. I've tested this on a cluster and I can reproduce this bug reliably if I update ImageContentSourcePolicy before I do the upgrade. 1.) Apply an ImageContentSourcePolicy: cat << EOF > fake_mirror.yaml apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: name: metering-mirror spec: repositoryDigestMirrors: - mirrors: - registry.redhat.io/openshift4/ose-metering-ansible-operator source: registry.redhat.io/openshift4/ose-metering-ansible-operator EOF oc apply -f fake_mirror.yaml 2.) Upgrade the cluster ( I did 4.7.16 --> 4.8.0-rc.0 ) 3.) Watch the MCO portion of the upgrade take at least 10 minutes per node Performing the same test with an image containing the fix from [1], the timing of the MCO upgrade is normal. I'm going to close this bug as a duplicate of [1] because we understand root cause here. Thank you Johnny for re-opening this and for the details, I appreciate it. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1965992 *** This bug has been marked as a duplicate of bug 1965992 *** |