Bug 1707212

Summary:	Upgrade failures due to MCO trying to roll out multiple master changes
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Machine Config Operator	Assignee:	Colin Walters <walters>
Status:	CLOSED ERRATA	QA Contact:	Micah Abbott <miabbott>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	amurdaca, eparis, imcleod, mifiedle, nstielau, rkrawitz, walters
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:48:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1706606
Bug Blocks:

Description Ben Parees 2019-05-07 04:20:25 UTC

Description of problem:
OCP upgrade failed waiting for etcd-quorum guard:

May 07 01:42:38.269 E clusterversion/version changed Failing to True: UpdatePayloadFailed: Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (315 of 350)
May 07 01:48:15.472 E clusteroperator/machine-config changed Degraded to True: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 1): Failed to resync 0.0.1-2019-05-07-003443 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 1)



How reproducible:
test flake

Steps to Reproduce:
Seen in e2e-aws-upgrade job


Actual results:
upgrade timed out/failed

Expected results:
upgrade succeeds


Additional info:
Not clear who owns quorum guard, feel free to reassign to the MCO team if appropriate.

Comment 1 Ben Parees 2019-05-07 04:22:45 UTC

forgot to link the failing job:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_console-operator/226/pull-ci-openshift-console-operator-master-e2e-aws-upgrade/76

Looks like there may also have been networking issues and MCO node sync issues which are more likely the root cause here.

Comment 2 Ben Parees 2019-05-07 04:24:26 UTC

Similar looking failure:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/182/pull-ci-openshift-cluster-version-operator-master-e2e-aws-upgrade/56


Moving to MCO for now a a more likely target, but networking also looks suspect.

Comment 3 Antonio Murdaca 2019-05-07 09:12:30 UTC

This is likely https://bugzilla.redhat.com/show_bug.cgi?id=1706606 but this mentions the quorum guard specifically, I've cc'ed Robert to take a look

Comment 4 Colin Walters 2019-05-07 10:59:03 UTC

This one has a crio mismatch 

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/226/pull-ci-openshift-console-operator-master-e2e-aws-upgrade/76/artifacts/e2e-aws-upgrade/must-gather/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-7gvcc/machine-config-daemon/machine-config-daemon/logs/current.log

Comment 5 Eric Paris 2019-05-07 15:42:50 UTC

I say we leave this on the list until we solve https://bugzilla.redhat.com/show_bug.cgi?id=1706606 then re-assess. the MCO team should stay focused on that bug.

Comment 6 Robert Krawitz 2019-05-07 18:04:06 UTC

This indicates that two nodes were unhealthy simultaneously, which is exactly what the eqg is intended to prevent.

Looking at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/182/pull-ci-openshift-cluster-version-operator-master-e2e-aws-upgrade/56/artifacts/e2e-aws-upgrade/e2e.log, I see that one master is being drained, and later a second master is drained before the first one is healthy again.  The first node was marked schedulable immediately before the second one was drained, but `etcd` was not healthy on the first node; killing off the second node would have led to quorum failure.

-------------------------------------

    3723 May 07 01:18:19.496 I node/ip-10-0-138-140.ec2.internal Written pending config rendered-master-558df6f241eed55f48f51eae3925e390
    3724 May 07 01:18:19.500 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal invariant violation (bug): static pod should not transition Running->Pending with same UID
    3725 May 07 01:18:19.500 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-metrics container stopped being ready
    3726 May 07 01:18:19.500 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-member container stopped being ready
    3727 May 07 01:18:19.520 I node/ip-10-0-138-140.ec2.internal Draining node to update config.
    3728 May 07 01:18:19.531 I node/ip-10-0-144-80.ec2.internal Draining node to update config.
    3729 May 07 01:18:19.541 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal graceful deletion within 0s
    3730 May 07 01:18:19.711 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal deleted
    3731 May 07 01:18:19.803 I ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal created
...
    4698 May 07 01:21:39.870 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-855976c44d-zxscn Started container guard
    4699 May 07 01:21:39.907 I node/ip-10-0-138-140.ec2.internal Node ip-10-0-138-140.ec2.internal status is now: NodeSchedulable
    4700 May 07 01:21:40.254 W ns/openshift-apiserver pod/apiserver-j99xp node/ip-10-0-138-140.ec2.internal container=openshift-apiserver container restarted
    4701 May 07 01:21:40.254 W ns/openshift-apiserver pod/apiserver-j99xp node/ip-10-0-138-140.ec2.internal container=fix-audit-permissions init container restarted
    4702 May 07 01:21:40.655 W ns/openshift-sdn pod/sdn-mndsk node/ip-10-0-138-140.ec2.internal container=sdn container restarted
    4703 May 07 01:21:40.678 W clusteroperator/network changed Progressing to False
    4704 May 07 01:21:41.052 W ns/openshift-cluster-node-tuning-operator pod/tuned-p8j7k node/ip-10-0-138-140.ec2.internal container=tuned container restarted
    4705 May 07 01:21:46.256 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal invariant violation (bug): static pod should not transition Running->Pending with same UID
    4706 May 07 01:21:46.256 E ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-member container exited with code 255 (Error): 1\n2019-05-07 01:18:26.642041 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT=2379\n2019-05-07 01:18:26.642047 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_PROTO=tcp\n2019-05-07 01:18:26.642052 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP=tcp://172.30.175.201:2379\n2019-05-07 01:18:26.642059 W | pkg/flags: unrecognized environment variable ETCD_PORT=tcp://172.30.175.201:2379\n2019-05-07 01:18:26.642065 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_HOST=172.30.175.201\n2019-05-07 01:18:26.642115 I | etcdmain: etcd Version: 3.3.10\n2019-05-07 01:18:26.642132 I | etcdmain: Git SHA: 46fe287\n2019-05-07 01:18:26.642139 I | etcdmain: Go Version: go1.10.8\n2019-05-07 01:18:26.642143 I | etcdmain: Go OS/Arch: linux/amd64\n2019-05-07 01:18:26.642149 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4\n2019-05-07 01:18:26.642234 N | etcdmain: the server is already initialized as member before, starting as etcd member...\n2019-05-07 01:18:26.642288 I | embed: peerTLS: cert = /etc/ssl/etcd/system:etcd-peer:etcd-0.ci-op-si6q4q2f-6d8c6.origin-ci-int-aws.dev.rhcloud.com.crt, key = /etc/ssl/etcd/system:etcd-peer:etcd-0.ci-op-si6q4q2f-6d8c6.origin-ci-int-aws.dev.rhcloud.com.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true, crl-file = \n2019-05-07 01:18:26.643246 I | embed: listening for peers on https://0.0.0.0:2380\n2019-05-07 01:18:26.643310 I | embed: listening for client requests on 0.0.0.0:2379\n2019-05-07 01:18:26.692068 I | etcdserver: name = etcd-member-ip-10-0-138-140.ec2.internal\n2019-05-07 01:18:26.692100 I | etcdserver: data dir = /var/lib/etcd\n2019-05-07 01:18:26.692109 I | etcdserver: member dir = /var/lib/etcd/member\n2019-05-07 01:18:26.692115 I | etcdserver: heartbeat = 100ms\n2019-05-07 01:18:26.692120 I | etcdserver: election = 1000ms\n2019-05-07 01:18:26.692126 I | etcdserver: snapshot count = 100000\n2019-05-07 01:18:26.692145 I | etcdserver: advertise client URLs = https://10.0.138.140:2379\n
    4707 May 07 01:21:46.256 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-member container restarted
    4708 May 07 01:21:46.256 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-metrics container stopped being ready
    4709 May 07 01:21:46.256 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=discovery init container restarted
...
    4714 May 07 01:21:48.228 I ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal Stopping container etcd-metrics
    4715 May 07 01:21:48.234 I ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal Stopping container etcd-member
    4716 May 07 01:21:48.307 W ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal node/ip-10-0-166-46.ec2.internal invariant violation (bug): static pod should not transition Running->Pending with same UID
    4717 May 07 01:21:48.307 W ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal node/ip-10-0-166-46.ec2.internal container=etcd-metrics container stopped being ready
    4718 May 07 01:21:48.307 W ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal node/ip-10-0-166-46.ec2.internal container=etcd-member container stopped being ready
    4719 May 07 01:21:48.363 I node/ip-10-0-166-46.ec2.internal Written pending config rendered-master-558df6f241eed55f48f51eae3925e390
    4720 May 07 01:21:48.645 I node/ip-10-0-166-46.ec2.internal Draining node to update config.
    4721 May 07 01:21:48.650 E ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=kube-apiserver-8 container exited with code 255 (Error):    1 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.PersistentVolume ended with: too old resource version: 19012 (30688)\nW0507 01:18:20.906806       1 cacher.go:125] Terminating all watchers from cacher *unstructured.Unstructured\nW0507 01:18:20.906929       1 cacher.go:125] Terminating all watchers from cacher *apiregistration.APIService\nW0507 01:18:20.940844       1 reflector.go:256] k8s.io/kube-aggregator/pkg/client/informers/internalversion/factory.go:117: watch of *apiregistration.APIService ended with: too old resource version: 24253 (30694)\nI0507 01:18:22.947284       1 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io\nE0507 01:18:22.950294       1 controller.go:114] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: OpenAPI spec does not exist\nI0507 01:18:22.950322       1 controller.go:127] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.\nI0507 01:18:23.486196       1 controller.go:107] OpenAPI AggregationController: Processing item v1.packages.operators.coreos.com\nI0507 01:18:23.489955       1 controller.go:107] OpenAPI AggregationController: Processing item v1.template.openshift.io\nI0507 01:18:23.496062       1 controller.go:107] OpenAPI AggregationController: Processing item v1.user.openshift.io\nI0507 01:18:23.498399       1 controller.go:107] OpenAPI AggregationController: Processing item v1.image.openshift.io\nI0507 01:18:26.705631       1 log.go:172] httputil: ReverseProxy read error during body copy: unexpected EOF\nI0507 01:18:26.705671       1 log.go:172] suppressing panic for copyResponse error in test; copy error: unexpected EOF\nE0507 01:18:26.708593       1 reflector.go:237] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to watch *v1.OAuthClient: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io)\nI0507 01:18:27.152057       1 controller.go:176] Shutting down kubernetes service endpoint reconciler\n

Comment 7 Colin Walters 2019-05-08 02:41:04 UTC

https://github.com/openshift/machine-config-operator/pull/701
is working on this bug.

Comment 8 Ian McLeod 2019-05-08 20:45:11 UTC

Eric just asked in a video chat if this is actually an example of quorum guard in the MCD behaving as expected.

Antonio, thoughts?

Comment 9 Antonio Murdaca 2019-05-08 20:48:40 UTC

(In reply to Robert Krawitz from comment #6)
> This indicates that two nodes were unhealthy simultaneously, which is
> exactly what the eqg is intended to prevent.
> 

Ian, yeah, as Robert noted above, this the exact scope of the eqg. The bug we have is that we're indeeding trying to bring down 2 masters and eqg prevents that correctly.

Comment 11 Micah Abbott 2019-05-16 19:22:35 UTC

CI results in the last 7d find two instances of failed jobs with the message "syncRequiredMachineConfigPools: error pool master is not ready", but they were during the initial install of the cluster, not during an upgrade (as originally reported).

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/757/pull-ci-openshift-machine-config-operator-master-e2e-aws/3671

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/757/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/2157

These are genuine failures that were reported against a pending PR (see: https://github.com/openshift/machine-config-operator/pull/757#discussion_r284467120), so they are not indicative of problems with the fix to this issue.



On a cluster installed from 4.1.0-0.nightly-2019-05-16-075717, I performed a similar set of steps as the unit test (https://github.com/openshift/machine-config-operator/blob/master/test/e2e/mcd_test.go#L332)

1.  create invalid MC + apply
2.  observe a node try to apply new MC
3.  check for annotation on node:

Annotations:        machine.openshift.io/machine: openshift-machine-api/miabbott-testing-rm4v2-worker-us-west-2a-fj58c                                                                                               
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-e1f197ad57a898cf77ca1b9a10b8d969                                                                                                
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-f81e92cf86b63044b0df35d7b6341c37                                                                                                
                    machineconfiguration.openshift.io/reason:                                                                                                                                                        
                      can't reconcile config rendered-worker-e1f197ad57a898cf77ca1b9a10b8d969 with rendered-worker-f81e92cf86b63044b0df35d7b6341c37: invalid Ign...                                                  
                    machineconfiguration.openshift.io/state: Unreconcilable                                                                                                                

4.  observe MCP marked as degraded

$ oc get machineconfigpool/worker                                                                                
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED                                                                                                                            
worker   rendered-worker-e1f197ad57a898cf77ca1b9a10b8d969   False     True       True                    

5.  delete the invalid rendered MC and watch nodes/MCP recover

Status:                                                                                                                                                                                                             
  Conditions:                                                                                                                                                                                                       
    Last Transition Time:  2019-05-16T16:03:12Z                                                                                                                                                                     
    Message:                                                                                                                                                                                                        
    Reason:                                                                                          
    Status:                False                                                                     
    Type:                  RenderDegraded                                                                                                                                                                           
    Last Transition Time:  2019-05-16T19:10:39Z                                                                                                                                                                     
    Message:                                                                                                                                                                                                        
    Reason:                All nodes are updated with rendered-worker-e1f197ad57a898cf77ca1b9a10b8d969
    Status:                True                                                                                                                                                                                     
    Type:                  Updated                                                                                                                                                                                  
    Last Transition Time:  2019-05-16T19:10:39Z                                          
    Message:                                                                          
    Reason:                                                                                          
    Status:                False
    Type:                  Updating
    Last Transition Time:  2019-05-16T19:10:39Z
    Message:
    Reason:
    Status:                False
    Type:                  NodeDegraded
    Last Transition Time:  2019-05-16T19:10:39Z
    Message:
    Reason:
    Status:                False
    Type:                  Degraded


Marking VERIFIED

Comment 13 errata-xmlrpc 2019-06-04 10:48:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758