Bug 1950471

Summary:	e2e-metal-ipi-ovn-dualstack is failing with etcd unable to bootstrap
Product:	OpenShift Container Platform	Reporter:	Stephen Benjamin <stbenjam>
Component:	Networking	Assignee:	Federico Paolinelli <fpaoline>
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	achernet, aconstan, astoycos, augol, danw, keyoung, pablo.iranzo, pibanezr, rfreiman, saledort, sgreene, ydalal, ykashtan, yprokule, ysegev, zzhao
Version:	4.8	Keywords:	AutomationBlocker, TestBlocker
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 23:01:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1831748

Description Stephen Benjamin 2021-04-16 17:11:50 UTC

Description of problem:

Dualstack is fairly reliably failing with etcd unable to come up:

$ oc logs -n openshift-etcd etcd-master-2.ostest.test.metalkube.org  -c etcd
#### attempt 9
      member={name="etcd-bootstrap", peerURLs=[https://192.168.111.27:2380}, clientURLs=[https://192.168.111.27:2379]
      member={name="master-1.ostest.test.metalkube.org", peerURLs=[https://192.168.111.21:2380}, clientURLs=[https://192.168.111.21:2379]
      member={name="master-0.ostest.test.metalkube.org", peerURLs=[https://192.168.111.20:2380}, clientURLs=[https://192.168.111.20:2379]
      member "https://192.168.111.22:2380" not found in member list, check operator logs for possible scaling problems


Jobs here: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack



Version-Release number of selected component (if applicable):

4.8.0-0.nightly-2021-04-16-110300

How reproducible:

Always or close to it

Steps to Reproduce:
1. Install a metal IPI environment in dualstack mode (IP_STACK=v4v6 with dev-scripts)


Actual results:

Installation fails with etcd errors

Additional info:

I also can provide a live reproducer environment.

Comment 2 Dan Winship 2021-04-17 20:23:26 UTC

cluster-etcd-operator claims it cannot connect to any etcds. Trying to connect to etcd on .20 gives "connection refused":

2021-04-17T19:03:50.542258872Z WARNING: 2021/04/17 19:03:50 grpc: addrConn.createTransport failed to connect to {https://192.168.111.20:2379  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.20:2379: connect: connection refused". Reconnecting...

which seems to be because etcd on .20 is failing to start:

2021-04-17T19:23:50.055477614Z #### attempt 9
2021-04-17T19:23:50.057339651Z       member={name="master-2.ostest.test.metalkube.org", peerURLs=[https://192.168.111.22:2380}, clientURLs=[https://192.168.111.22:2379]
2021-04-17T19:23:50.057339651Z       member={name="etcd-bootstrap", peerURLs=[https://192.168.111.45:2380}, clientURLs=[https://192.168.111.45:2379]
2021-04-17T19:23:50.057339651Z       member "https://192.168.111.20:2380" not found in member list, check operator logs for possible scaling problems
2021-04-17T19:23:50.057339651Z #### sleeping...
2021-04-17T19:23:51.057782760Z timed out

Trying to connect to etcd on .21 also gives connection refused

2021-04-17T19:03:50.390584234Z WARNING: 2021/04/17 19:03:50 grpc: addrConn.createTransport failed to connect to {https://192.168.111.21:2379  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.21:2379: connect: connection refused". Reconnecting...

which makes sense since the must-gather does not even show there being any etcd on .21 (so why does cluster-etcd-operator think there is one?)

Trying to connect to etcd on .22 gives "operation was canceled"

2021-04-17T19:04:00.291207509Z WARNING: 2021/04/17 19:04:00 grpc: addrConn.createTransport failed to connect to {https://192.168.111.22:2379  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.22:2379: operation was canceled". Reconnecting...

while the etcd logs on .22 show:

2021-04-17T19:04:00.399636356Z {"level":"warn","ts":"2021-04-17T19:04:00.399Z","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"192.168.111.22:41532","server-name":"","error":"EOF"}

So it seems to be intentionally refusing to talk to cluster-etcd-operator, not having a network problem.



The bootstrap etcd logs show lots of

{"level":"warn","ts":"2021-04-17T19:00:15.045Z","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"192.168.111.22:45472","server-name":"","error":"remote error: tls: bad certificate"}

for a while, and then later

{"level":"info","ts":"2021-04-17T19:00:26.547Z","caller":"membership/cluster.go:392","msg":"added member","cluster-id":"388387a7db949890","local-member-id":"85b84316b387be60","added-peer-id":"3c1dba9ae16d7132","added-peer-peer-urls":["https://192.168.111.22:2380"]}
...
{"level":"info","ts":"2021-04-17T19:00:26.568Z","caller":"rafthttp/transport.go:327","msg":"added remote peer","local-member-id":"85b84316b387be60","remote-peer-id":"3c1dba9ae16d7132","remote-peer-urls":["https://192.168.111.22:2380"]}
...
{"level":"warn","ts":"2021-04-17T19:00:27.734Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://192.168.111.22:2380/version","remote-member-id":"3c1dba9ae16d7132","error":"Get https://192.168.111.22:2380/version: dial tcp 192.168.111.22:2380: connect: connection refused"}

but the last error mentioning 192.168.111.22 is at 19:00:31, which is exactly when the current etcd pod on that node started. So it looks like master-2 _has_ successfully added itself to the etcd cluster.


So there does not appear to be any network-level problem here; master-0's etcd cannot start because the bootstrap etcd does not believe it should be part of the cluster, master-1 is not currently running etcd, and master-2's etcd has started and is in communication with the bootstrap etcd, but is choosing to reject connections from cluster-etcd-operator.

Comment 6 Sam Batschelet 2021-04-19 12:58:36 UTC

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack/1384044504545234944

In testing, I found that with dualstack IPv6 OVN the pod network could not communicate with hostNetwork on the same node. So if etcd-operator was scheduled on control-plane-0 then it could not communicate with etcd on control-plane-0. But at the same time etcd-operator pod scheduled on control-plane-0 could communicate with control-plane-{1,2}.

I tested this by rshing into etcd-operator container and performing curl GET requests to the etcd operand health endpoint curl -vvv https://$IP/health -k. Success for this test was invalid cert error, failure is hanging timeout. Then verified by killing etcd-operator having it reschedule to another node and the exact same pattern was repeatable on that local node. This affects scaling because the operator will fail health checks to the etcd member on the local node resulting in a degraded status. If the operator is degraded we will not scale up. 

# etcd-operator on master 2 (hostIP: 192.168.111.22) failing healthcheck to etcd on master 2.
> 2021-04-19T08:22:10.406451158Z I0419 08:22:10.406419       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"openshift-etcd-operator", Name:"openshift-etcd-operator", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: master-2.ostest.test.metalkube.org

# master 2 etcd reports successful healthcheck from quorum-guard a second later.
> 2021-04-19T08:22:11.157518868Z 2021-04-19 08:22:11.157469 I | etcdserver/api/etcdhttp: /health OK (status code 200)

# etcd-operator log showing failed health check now you notice here this was canceled from timeout vs refused. Refused would be a possible TLS issue or unstarted etcd.
> 2021-04-19T08:22:42.241879843Z WARNING: 2021/04/19 08:22:42 grpc: addrConn.createTransport failed to connect to {https://192.168.111.22:2379  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 192.168.111.22:2379: operation was canceled". Reconnecting...

This proves etcd is running fine on master-2 but the operator can not communicate with it over hostNetwork. Strange that quorum-guard also scheduled on the node does not appear to have a problem. So perhaps reviewing the routes between the two pods can help resolve the issue?

# NOTE: these are unauthenticated TCP probes from kubelet, we are working on fixing this but the logs do not reflect an actual issue.
> 2021-04-19T08:22:19.953771624Z {"level":"warn","ts":"2021-04-19T08:22:19.953Z","caller":"embed/config_logging.go:279","msg":"rejected connection","remote-addr":"192.168.111.22:43888","server-name":"","error":"EOF"}

Comment 8 Dan Winship 2021-04-19 13:34:51 UTC

so summing up, it looks like on dual-stack nodes, a pod cannot make a connection to its IPv4 node IP

Comment 10 Sabina Aledort 2021-04-26 15:49:30 UTC

Facing the same issue installing metal IPI cluster in dualstack mode.

From etcd operator logs:
WARNING: 2021/04/26 08:43:54 grpc: addrConn.createTransport failed to connect to {https://10.19.17.204:2379  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.19.17.204:2379: connect: connection refused". Reconnecting...
WARNING: 2021/04/26 08:43:54 grpc: addrConn.createTransport failed to connect to {https://10.19.17.162:2379  <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 10.19.17.162:2379: connect: connection refused". Reconnecting...

Comment 13 yigal dalal 2021-05-06 13:11:24 UTC

Any objection to adding tag (IPI) to whiteboard so we can easily filter all bugs that affect IPI installation on virt/BM

Comment 14 zhaozhanqi 2021-05-11 02:57:42 UTC

Verified this bug on build 4.8.0-0.nightly-2021-05-10-151546


# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.nightly-2021-05-10-151546   True        False         False      49m
baremetal                                  4.8.0-0.nightly-2021-05-10-151546   True        False         False      95m
cloud-credential                           4.8.0-0.nightly-2021-05-10-151546   True        False         False      113m
cluster-autoscaler                         4.8.0-0.nightly-2021-05-10-151546   True        False         False      97m
config-operator                            4.8.0-0.nightly-2021-05-10-151546   True        False         False      99m
console                                    4.8.0-0.nightly-2021-05-10-151546   True        False         False      57m
csi-snapshot-controller                    4.8.0-0.nightly-2021-05-10-151546   True        False         False      87m
dns                                        4.8.0-0.nightly-2021-05-10-151546   True        False         False      89m
etcd                                       4.8.0-0.nightly-2021-05-10-151546   True        False         False      93m
image-registry                             4.8.0-0.nightly-2021-05-10-151546   True        False         False      46m
ingress                                    4.8.0-0.nightly-2021-05-10-151546   True        False         False      64m
insights                                   4.8.0-0.nightly-2021-05-10-151546   True        False         False      86m
kube-apiserver                             4.8.0-0.nightly-2021-05-10-151546   True        False         False      87m
kube-controller-manager                    4.8.0-0.nightly-2021-05-10-151546   True        False         False      91m
kube-scheduler                             4.8.0-0.nightly-2021-05-10-151546   True        False         False      88m
kube-storage-version-migrator              4.8.0-0.nightly-2021-05-10-151546   True        False         False      91m
machine-api                                4.8.0-0.nightly-2021-05-10-151546   True        False         False      86m
machine-approver                           4.8.0-0.nightly-2021-05-10-151546   True        False         False      95m
machine-config                             4.8.0-0.nightly-2021-05-10-151546   True        False         False      87m
marketplace                                4.8.0-0.nightly-2021-05-10-151546   True        False         False      97m
monitoring                                 4.8.0-0.nightly-2021-05-10-151546   True        False         False      63m
network                                    4.8.0-0.nightly-2021-05-10-151546   True        False         False      99m
node-tuning                                4.8.0-0.nightly-2021-05-10-151546   True        False         False      90m
openshift-apiserver                        4.8.0-0.nightly-2021-05-10-151546   True        False         False      86m
openshift-controller-manager               4.8.0-0.nightly-2021-05-10-151546   True        False         False      88m
openshift-samples                          4.8.0-0.nightly-2021-05-10-151546   True        False         False      86m
operator-lifecycle-manager                 4.8.0-0.nightly-2021-05-10-151546   True        False         False      98m
operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-05-10-151546   True        False         False      96m
operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-05-10-151546   True        False         False      88m
service-ca                                 4.8.0-0.nightly-2021-05-10-151546   True        False         False      99m
storage                                    4.8.0-0.nightly-2021-05-10-151546   True        False         False      99m
# oc get network -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: Network
  metadata:
    creationTimestamp: "2021-05-11T00:53:05Z"
    generation: 2
    name: cluster
    resourceVersion: "5074"
    uid: 181ba4ca-8bb5-493f-affb-3985819768c4
  spec:
    clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    - cidr: fd01::/48
      hostPrefix: 64
    externalIP:
      policy: {}
    networkType: OVNKubernetes
    serviceNetwork:
    - 172.30.0.0/16
    - fd02::/112
  status:
    clusterNetwork:
    - cidr: 10.128.0.0/14
      hostPrefix: 23
    - cidr: fd01::/48
      hostPrefix: 64
    clusterNetworkMTU: 1400
    networkType: OVNKubernetes
    serviceNetwork:
    - 172.30.0.0/16
    - fd02::/112
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 18 errata-xmlrpc 2021-07-27 23:01:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438