Hide Forgot
Description of problem: OCP upgrade failed waiting for etcd-quorum guard: May 07 01:42:38.269 E clusterversion/version changed Failing to True: UpdatePayloadFailed: Could not update deployment "openshift-machine-config-operator/etcd-quorum-guard" (315 of 350) May 07 01:48:15.472 E clusteroperator/machine-config changed Degraded to True: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 1): Failed to resync 0.0.1-2019-05-07-003443 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 1) How reproducible: test flake Steps to Reproduce: Seen in e2e-aws-upgrade job Actual results: upgrade timed out/failed Expected results: upgrade succeeds Additional info: Not clear who owns quorum guard, feel free to reassign to the MCO team if appropriate.
forgot to link the failing job: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_console-operator/226/pull-ci-openshift-console-operator-master-e2e-aws-upgrade/76 Looks like there may also have been networking issues and MCO node sync issues which are more likely the root cause here.
Similar looking failure: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/182/pull-ci-openshift-cluster-version-operator-master-e2e-aws-upgrade/56 Moving to MCO for now a a more likely target, but networking also looks suspect.
This is likely https://bugzilla.redhat.com/show_bug.cgi?id=1706606 but this mentions the quorum guard specifically, I've cc'ed Robert to take a look
This one has a crio mismatch https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_console-operator/226/pull-ci-openshift-console-operator-master-e2e-aws-upgrade/76/artifacts/e2e-aws-upgrade/must-gather/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-7gvcc/machine-config-daemon/machine-config-daemon/logs/current.log
I say we leave this on the list until we solve https://bugzilla.redhat.com/show_bug.cgi?id=1706606 then re-assess. the MCO team should stay focused on that bug.
This indicates that two nodes were unhealthy simultaneously, which is exactly what the eqg is intended to prevent. Looking at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/182/pull-ci-openshift-cluster-version-operator-master-e2e-aws-upgrade/56/artifacts/e2e-aws-upgrade/e2e.log, I see that one master is being drained, and later a second master is drained before the first one is healthy again. The first node was marked schedulable immediately before the second one was drained, but `etcd` was not healthy on the first node; killing off the second node would have led to quorum failure. ------------------------------------- 3723 May 07 01:18:19.496 I node/ip-10-0-138-140.ec2.internal Written pending config rendered-master-558df6f241eed55f48f51eae3925e390 3724 May 07 01:18:19.500 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal invariant violation (bug): static pod should not transition Running->Pending with same UID 3725 May 07 01:18:19.500 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-metrics container stopped being ready 3726 May 07 01:18:19.500 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-member container stopped being ready 3727 May 07 01:18:19.520 I node/ip-10-0-138-140.ec2.internal Draining node to update config. 3728 May 07 01:18:19.531 I node/ip-10-0-144-80.ec2.internal Draining node to update config. 3729 May 07 01:18:19.541 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal graceful deletion within 0s 3730 May 07 01:18:19.711 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal deleted 3731 May 07 01:18:19.803 I ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal created ... 4698 May 07 01:21:39.870 I ns/openshift-machine-config-operator pod/etcd-quorum-guard-855976c44d-zxscn Started container guard 4699 May 07 01:21:39.907 I node/ip-10-0-138-140.ec2.internal Node ip-10-0-138-140.ec2.internal status is now: NodeSchedulable 4700 May 07 01:21:40.254 W ns/openshift-apiserver pod/apiserver-j99xp node/ip-10-0-138-140.ec2.internal container=openshift-apiserver container restarted 4701 May 07 01:21:40.254 W ns/openshift-apiserver pod/apiserver-j99xp node/ip-10-0-138-140.ec2.internal container=fix-audit-permissions init container restarted 4702 May 07 01:21:40.655 W ns/openshift-sdn pod/sdn-mndsk node/ip-10-0-138-140.ec2.internal container=sdn container restarted 4703 May 07 01:21:40.678 W clusteroperator/network changed Progressing to False 4704 May 07 01:21:41.052 W ns/openshift-cluster-node-tuning-operator pod/tuned-p8j7k node/ip-10-0-138-140.ec2.internal container=tuned container restarted 4705 May 07 01:21:46.256 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal invariant violation (bug): static pod should not transition Running->Pending with same UID 4706 May 07 01:21:46.256 E ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-member container exited with code 255 (Error): 1\n2019-05-07 01:18:26.642041 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_PORT=2379\n2019-05-07 01:18:26.642047 W | pkg/flags: unrecognized environment variable ETCD_PORT_9979_TCP_PROTO=tcp\n2019-05-07 01:18:26.642052 W | pkg/flags: unrecognized environment variable ETCD_PORT_2379_TCP=tcp://172.30.175.201:2379\n2019-05-07 01:18:26.642059 W | pkg/flags: unrecognized environment variable ETCD_PORT=tcp://172.30.175.201:2379\n2019-05-07 01:18:26.642065 W | pkg/flags: unrecognized environment variable ETCD_SERVICE_HOST=172.30.175.201\n2019-05-07 01:18:26.642115 I | etcdmain: etcd Version: 3.3.10\n2019-05-07 01:18:26.642132 I | etcdmain: Git SHA: 46fe287\n2019-05-07 01:18:26.642139 I | etcdmain: Go Version: go1.10.8\n2019-05-07 01:18:26.642143 I | etcdmain: Go OS/Arch: linux/amd64\n2019-05-07 01:18:26.642149 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4\n2019-05-07 01:18:26.642234 N | etcdmain: the server is already initialized as member before, starting as etcd member...\n2019-05-07 01:18:26.642288 I | embed: peerTLS: cert = /etc/ssl/etcd/system:etcd-peer:etcd-0.ci-op-si6q4q2f-6d8c6.origin-ci-int-aws.dev.rhcloud.com.crt, key = /etc/ssl/etcd/system:etcd-peer:etcd-0.ci-op-si6q4q2f-6d8c6.origin-ci-int-aws.dev.rhcloud.com.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true, crl-file = \n2019-05-07 01:18:26.643246 I | embed: listening for peers on https://0.0.0.0:2380\n2019-05-07 01:18:26.643310 I | embed: listening for client requests on 0.0.0.0:2379\n2019-05-07 01:18:26.692068 I | etcdserver: name = etcd-member-ip-10-0-138-140.ec2.internal\n2019-05-07 01:18:26.692100 I | etcdserver: data dir = /var/lib/etcd\n2019-05-07 01:18:26.692109 I | etcdserver: member dir = /var/lib/etcd/member\n2019-05-07 01:18:26.692115 I | etcdserver: heartbeat = 100ms\n2019-05-07 01:18:26.692120 I | etcdserver: election = 1000ms\n2019-05-07 01:18:26.692126 I | etcdserver: snapshot count = 100000\n2019-05-07 01:18:26.692145 I | etcdserver: advertise client URLs = https://10.0.138.140:2379\n 4707 May 07 01:21:46.256 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-member container restarted 4708 May 07 01:21:46.256 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=etcd-metrics container stopped being ready 4709 May 07 01:21:46.256 W ns/openshift-etcd pod/etcd-member-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=discovery init container restarted ... 4714 May 07 01:21:48.228 I ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal Stopping container etcd-metrics 4715 May 07 01:21:48.234 I ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal Stopping container etcd-member 4716 May 07 01:21:48.307 W ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal node/ip-10-0-166-46.ec2.internal invariant violation (bug): static pod should not transition Running->Pending with same UID 4717 May 07 01:21:48.307 W ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal node/ip-10-0-166-46.ec2.internal container=etcd-metrics container stopped being ready 4718 May 07 01:21:48.307 W ns/openshift-etcd pod/etcd-member-ip-10-0-166-46.ec2.internal node/ip-10-0-166-46.ec2.internal container=etcd-member container stopped being ready 4719 May 07 01:21:48.363 I node/ip-10-0-166-46.ec2.internal Written pending config rendered-master-558df6f241eed55f48f51eae3925e390 4720 May 07 01:21:48.645 I node/ip-10-0-166-46.ec2.internal Draining node to update config. 4721 May 07 01:21:48.650 E ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-138-140.ec2.internal node/ip-10-0-138-140.ec2.internal container=kube-apiserver-8 container exited with code 255 (Error): 1 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.PersistentVolume ended with: too old resource version: 19012 (30688)\nW0507 01:18:20.906806 1 cacher.go:125] Terminating all watchers from cacher *unstructured.Unstructured\nW0507 01:18:20.906929 1 cacher.go:125] Terminating all watchers from cacher *apiregistration.APIService\nW0507 01:18:20.940844 1 reflector.go:256] k8s.io/kube-aggregator/pkg/client/informers/internalversion/factory.go:117: watch of *apiregistration.APIService ended with: too old resource version: 24253 (30694)\nI0507 01:18:22.947284 1 controller.go:107] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io\nE0507 01:18:22.950294 1 controller.go:114] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: OpenAPI spec does not exist\nI0507 01:18:22.950322 1 controller.go:127] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.\nI0507 01:18:23.486196 1 controller.go:107] OpenAPI AggregationController: Processing item v1.packages.operators.coreos.com\nI0507 01:18:23.489955 1 controller.go:107] OpenAPI AggregationController: Processing item v1.template.openshift.io\nI0507 01:18:23.496062 1 controller.go:107] OpenAPI AggregationController: Processing item v1.user.openshift.io\nI0507 01:18:23.498399 1 controller.go:107] OpenAPI AggregationController: Processing item v1.image.openshift.io\nI0507 01:18:26.705631 1 log.go:172] httputil: ReverseProxy read error during body copy: unexpected EOF\nI0507 01:18:26.705671 1 log.go:172] suppressing panic for copyResponse error in test; copy error: unexpected EOF\nE0507 01:18:26.708593 1 reflector.go:237] github.com/openshift/client-go/oauth/informers/externalversions/factory.go:101: Failed to watch *v1.OAuthClient: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io)\nI0507 01:18:27.152057 1 controller.go:176] Shutting down kubernetes service endpoint reconciler\n
https://github.com/openshift/machine-config-operator/pull/701 is working on this bug.
Eric just asked in a video chat if this is actually an example of quorum guard in the MCD behaving as expected. Antonio, thoughts?
(In reply to Robert Krawitz from comment #6) > This indicates that two nodes were unhealthy simultaneously, which is > exactly what the eqg is intended to prevent. > Ian, yeah, as Robert noted above, this the exact scope of the eqg. The bug we have is that we're indeeding trying to bring down 2 masters and eqg prevents that correctly.
CI results in the last 7d find two instances of failed jobs with the message "syncRequiredMachineConfigPools: error pool master is not ready", but they were during the initial install of the cluster, not during an upgrade (as originally reported). https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/757/pull-ci-openshift-machine-config-operator-master-e2e-aws/3671 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/757/pull-ci-openshift-machine-config-operator-master-e2e-aws-op/2157 These are genuine failures that were reported against a pending PR (see: https://github.com/openshift/machine-config-operator/pull/757#discussion_r284467120), so they are not indicative of problems with the fix to this issue. On a cluster installed from 4.1.0-0.nightly-2019-05-16-075717, I performed a similar set of steps as the unit test (https://github.com/openshift/machine-config-operator/blob/master/test/e2e/mcd_test.go#L332) 1. create invalid MC + apply 2. observe a node try to apply new MC 3. check for annotation on node: Annotations: machine.openshift.io/machine: openshift-machine-api/miabbott-testing-rm4v2-worker-us-west-2a-fj58c machineconfiguration.openshift.io/currentConfig: rendered-worker-e1f197ad57a898cf77ca1b9a10b8d969 machineconfiguration.openshift.io/desiredConfig: rendered-worker-f81e92cf86b63044b0df35d7b6341c37 machineconfiguration.openshift.io/reason: can't reconcile config rendered-worker-e1f197ad57a898cf77ca1b9a10b8d969 with rendered-worker-f81e92cf86b63044b0df35d7b6341c37: invalid Ign... machineconfiguration.openshift.io/state: Unreconcilable 4. observe MCP marked as degraded $ oc get machineconfigpool/worker NAME CONFIG UPDATED UPDATING DEGRADED worker rendered-worker-e1f197ad57a898cf77ca1b9a10b8d969 False True True 5. delete the invalid rendered MC and watch nodes/MCP recover Status: Conditions: Last Transition Time: 2019-05-16T16:03:12Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2019-05-16T19:10:39Z Message: Reason: All nodes are updated with rendered-worker-e1f197ad57a898cf77ca1b9a10b8d969 Status: True Type: Updated Last Transition Time: 2019-05-16T19:10:39Z Message: Reason: Status: False Type: Updating Last Transition Time: 2019-05-16T19:10:39Z Message: Reason: Status: False Type: NodeDegraded Last Transition Time: 2019-05-16T19:10:39Z Message: Reason: Status: False Type: Degraded Marking VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758