1922187 – [build-watch] release-openshift-origin-installer-e2e-aws-upgrade 4.6-stable-to-4.7-ci failing

Bug 1922187 - [build-watch] release-openshift-origin-installer-e2e-aws-upgrade 4.6-stable-to-4.7-ci failing

Summary: [build-watch] release-openshift-origin-installer-e2e-aws-upgrade 4.6-stable-t...

Keywords:
Status:	CLOSED DUPLICATE of bug 1920027
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-29 12:31 UTC by Kevin Pouget
Modified:	2021-04-05 17:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-node] Managed cluster should report ready nodes the entire duration of the test run [Late]
Last Closed:	2021-02-03 18:17:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Kevin Pouget 2021-01-29 12:31:52 UTC

The upgrade from 4.6.15 to 4.7.0-0.ci is failing for the last candidates, see there:

https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/dashboards/overviewn

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1355065120719376384

It seems to come from this test failure:

>> [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_more
>
> fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:143]: during upgrade to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-01-29-080604
> Unexpected error:
>    <*errors.errorString | 0xc0026ebfa0>: {
>        s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()",
>    }
>    Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is reporting a failure: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
> occurred

Comment 1 Vadim Rutkovsky 2021-01-29 13:36:22 UTC

Lets track this in existing bug

*** This bug has been marked as a duplicate of bug 1921198 ***

Comment 2 Dan Winship 2021-02-01 13:35:40 UTC

Sorry, I spoke too soon on Slack on Friday; this is not the same as 1921198; in that bug, the upgrade succeeds, but the e2e job complains because the apiserver was unreachable at one point (because of an OVS problem). Whereas in this one the apiserver is actually failing and marking itself degraded.

Comment 3 Stefan Schimanski 2021-02-01 21:22:27 UTC

The openshift-apiserver pod

            "name": "apiserver-d559dcf48-b7n2b",
            "namespace": "openshift-apiserver",

shows:

            "status": {
                "conditions": [
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2021-01-29T09:18:58Z",
                        "message": "0/6 nodes are available: 2 node(s) didn't match Pod's node affinity, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable.",
                        "reason": "Unschedulable",
                        "status": "False",
                        "type": "PodScheduled"
                    }
                ],
                "phase": "Pending",
                "qosClass": "Burstable"
            }

The node ip-10-0-146-224.ec2.internal has suspicious annotations:

                    "machineconfiguration.openshift.io/reason": "unexpected on-disk state validating against rendered-master-94090a3396a605db226f3f20b4c8e931: content mismatch for file \"/etc/systemd/system/pivot.service.d/10-mco-default-env.conf\"",
                    "machineconfiguration.openshift.io/state": "Degraded",

with

            "spec": {
                "providerID": "aws:///us-east-1b/i-0862a1b8ba235bbf9",
                "taints": [
                    {
                        "effect": "NoSchedule",
                        "key": "node-role.kubernetes.io/master"
                    },
                    {
                        "effect": "NoSchedule",
                        "key": "node.kubernetes.io/unschedulable",
                        "timeAdded": "2021-01-29T09:18:56Z"
                    }
                ],
                "unschedulable": true
            },

Comment 4 Wally 2021-02-01 21:30:43 UTC

https://prow.ci.openshift.org/?job=release-openshift-origin-installer-e2e-aws-upgrade shows better test history since first reported.  

Is this still an UpgradeBlocker?

Comment 5 Ben Howard 2021-02-03 18:17:04 UTC

This should have been fixed with the following PRs:
https://github.com/openshift/machine-config-operator/commit/44d95f3dd112a2a56935edb93a61953b83ab5f79
https://github.com/openshift/machine-config-operator/commit/3837e3812207d09a76c31bd1f1abb23e8fbfc1fd

*** This bug has been marked as a duplicate of bug 1920027 ***

Comment 6 W. Trevor King 2021-04-05 17:46:41 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.