Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1725478

Summary:	machine config operator is failing the e2e aws upgrade test
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Micah Abbott <miabbott>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	adam.kaplan, walters
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	buildcop
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-07-02 13:01:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Parees 2019-06-30 18:17:36 UTC

Test is consistently failing across PRs:
https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/directory/pull-ci-openshift-origin-master-e2e-aws-upgrade

Jun 30 17:59:24.982: INFO: cluster upgrade is Progressing: Unable to apply 0.0.1-2019-06-30-154123: the cluster operator machine-config has not yet successfully rolled out
Jun 30 17:59:24.982: INFO: cluster upgrade is Failing: Cluster operator machine-config is still updating


Jun 30 17:59:55.039: INFO: Cluster operators:
NAME                                     A F P VERSION MESSAGE
authentication                                 <new>   
cloud-credential                               <new>   6 of 6 credentials requests provisioned and reconciled.
cluster-autoscaler                             <new>   
console                                        <new>   
dns                                            <new>   Desired and available number of DNS DaemonSets are equal
image-registry                                 <new>   The registry is ready
ingress                                        <new>   desired and current number of IngressControllers are equal
kube-apiserver                                 <new>   Progressing: 3 nodes are at revision 7
kube-controller-manager                        <new>   Progressing: 3 nodes are at revision 5
kube-scheduler                                 <new>   Progressing: 3 nodes are at revision 6
machine-api                                    <new>   
machine-config                           F T T <old>   Working towards 0.0.1-2019-06-30-154123
marketplace                                    <new>   Successfully progressed to release version: 0.0.1-2019-06-30-154123
monitoring                                     <new>   
network                                        <new>   
node-tuning                                    <new>   Cluster version is "0.0.1-2019-06-30-154123"
openshift-apiserver                            <new>   
openshift-controller-manager                   <new>   
openshift-samples                              <new>   Samples installation successful at 0.0.1-2019-06-30-154123
operator-lifecycle-manager                     <new>   Deployed 0.10.1
operator-lifecycle-manager-catalog             <new>   Deployed 0.10.1
operator-lifecycle-manager-packageserver       <new>   Deployed version 0.10.1
service-ca                                     <new>   Progressing: All service-ca-operator deployments updated
service-catalog-apiserver                      <new>   
service-catalog-controller-manager             <new>   
storage                                        <new>   
support                                        <new>   Monitoring the cluster

Comment 1 Ben Parees 2019-06-30 18:24:07 UTC

link to a specific failure:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/23287/pull-ci-openshift-origin-master-e2e-aws-upgrade/2914/

Comment 2 Colin Walters 2019-06-30 19:34:52 UTC

I built a new MCD, looks like the failure must be related to that.  Don't have too much more in the logs useful than
`Jun 30 18:00:16 ip-10-0-159-241 systemd[1]: machine-config-daemon-host.service: Failed with result 'exit-code'.`
as far as I can see.  Debugging more.

Comment 3 Colin Walters 2019-06-30 21:01:24 UTC

https://github.com/openshift/machine-config-operator/pull/908
will help debug.

Comment 4 Ben Parees 2019-06-30 21:14:30 UTC

If there's no clear fix tonight can you revert for now so Europe isn't blocked when they come online?

Comment 5 Colin Walters 2019-06-30 21:34:28 UTC

The PR has a likely fix now.  Will be verifying later tonight EST.

Comment 6 Colin Walters 2019-07-01 11:51:38 UTC

Ah and of course we need a new on-host MCD with that fix.

Building machine-config-daemon-4.2.0-2.rhaos4.2.git15edac1.el8 for rhaos-4.2-rhel-8-candidate
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=22443971

Comment 7 Colin Walters 2019-07-01 13:22:59 UTC

*** Bug 1725799 has been marked as a duplicate of this bug. ***

Comment 8 Colin Walters 2019-07-01 13:24:10 UTC

Redoing that build since I forgot to upload the sources.  But all RHCOS builds
are also blocking on the ART pipeline being broken:
https://jenkins-rhcos-art.cloud.privileged.psi.redhat.com/job/rhcos-art-rhcos-4.2/44/console

Comment 9 Colin Walters 2019-07-01 17:20:39 UTC

Now blocking on https://gitlab.cee.redhat.com/openshift-art/rhcos-upshift/merge_requests/38

Comment 10 Colin Walters 2019-07-01 19:39:31 UTC

https://gitlab.cee.redhat.com/openshift-art/rhcos-upshift/issues/2

Comment 11 Colin Walters 2019-07-02 00:01:28 UTC

OK ART pipeline is running again, oscontainer was promoted https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.2

Comment 12 Colin Walters 2019-07-02 13:01:14 UTC

Back to green.