Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1725478

Summary: machine config operator is failing the e2e aws upgrade test
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED CURRENTRELEASE QA Contact: Micah Abbott <miabbott>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: adam.kaplan, walters
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: buildcop
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-02 13:01:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Parees 2019-06-30 18:17:36 UTC
Test is consistently failing across PRs:
https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/directory/pull-ci-openshift-origin-master-e2e-aws-upgrade

Jun 30 17:59:24.982: INFO: cluster upgrade is Progressing: Unable to apply 0.0.1-2019-06-30-154123: the cluster operator machine-config has not yet successfully rolled out
Jun 30 17:59:24.982: INFO: cluster upgrade is Failing: Cluster operator machine-config is still updating


Jun 30 17:59:55.039: INFO: Cluster operators:
NAME                                     A F P VERSION MESSAGE
authentication                                 <new>   
cloud-credential                               <new>   6 of 6 credentials requests provisioned and reconciled.
cluster-autoscaler                             <new>   
console                                        <new>   
dns                                            <new>   Desired and available number of DNS DaemonSets are equal
image-registry                                 <new>   The registry is ready
ingress                                        <new>   desired and current number of IngressControllers are equal
kube-apiserver                                 <new>   Progressing: 3 nodes are at revision 7
kube-controller-manager                        <new>   Progressing: 3 nodes are at revision 5
kube-scheduler                                 <new>   Progressing: 3 nodes are at revision 6
machine-api                                    <new>   
machine-config                           F T T <old>   Working towards 0.0.1-2019-06-30-154123
marketplace                                    <new>   Successfully progressed to release version: 0.0.1-2019-06-30-154123
monitoring                                     <new>   
network                                        <new>   
node-tuning                                    <new>   Cluster version is "0.0.1-2019-06-30-154123"
openshift-apiserver                            <new>   
openshift-controller-manager                   <new>   
openshift-samples                              <new>   Samples installation successful at 0.0.1-2019-06-30-154123
operator-lifecycle-manager                     <new>   Deployed 0.10.1
operator-lifecycle-manager-catalog             <new>   Deployed 0.10.1
operator-lifecycle-manager-packageserver       <new>   Deployed version 0.10.1
service-ca                                     <new>   Progressing: All service-ca-operator deployments updated
service-catalog-apiserver                      <new>   
service-catalog-controller-manager             <new>   
storage                                        <new>   
support                                        <new>   Monitoring the cluster

Comment 2 Colin Walters 2019-06-30 19:34:52 UTC
I built a new MCD, looks like the failure must be related to that.  Don't have too much more in the logs useful than
`Jun 30 18:00:16 ip-10-0-159-241 systemd[1]: machine-config-daemon-host.service: Failed with result 'exit-code'.`
as far as I can see.  Debugging more.

Comment 3 Colin Walters 2019-06-30 21:01:24 UTC
https://github.com/openshift/machine-config-operator/pull/908
will help debug.

Comment 4 Ben Parees 2019-06-30 21:14:30 UTC
If there's no clear fix tonight can you revert for now so Europe isn't blocked when they come online?

Comment 5 Colin Walters 2019-06-30 21:34:28 UTC
The PR has a likely fix now.  Will be verifying later tonight EST.

Comment 6 Colin Walters 2019-07-01 11:51:38 UTC
Ah and of course we need a new on-host MCD with that fix.

Building machine-config-daemon-4.2.0-2.rhaos4.2.git15edac1.el8 for rhaos-4.2-rhel-8-candidate
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=22443971

Comment 7 Colin Walters 2019-07-01 13:22:59 UTC
*** Bug 1725799 has been marked as a duplicate of this bug. ***

Comment 8 Colin Walters 2019-07-01 13:24:10 UTC
Redoing that build since I forgot to upload the sources.  But all RHCOS builds
are also blocking on the ART pipeline being broken:
https://jenkins-rhcos-art.cloud.privileged.psi.redhat.com/job/rhcos-art-rhcos-4.2/44/console

Comment 9 Colin Walters 2019-07-01 17:20:39 UTC
Now blocking on https://gitlab.cee.redhat.com/openshift-art/rhcos-upshift/merge_requests/38

Comment 11 Colin Walters 2019-07-02 00:01:28 UTC
OK ART pipeline is running again, oscontainer was promoted https://openshift-gce-devel.appspot.com/builds/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.2

Comment 12 Colin Walters 2019-07-02 13:01:14 UTC
Back to green.