Bug 1878215 - etcd operator degraded when deploying 4.6 baremetal IPv6
Summary: etcd operator degraded when deploying 4.6 baremetal IPv6
Keywords:
Status: CLOSED DUPLICATE of bug 1877833
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Ben Bennett
QA Contact: Chad Crum
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-11 15:41 UTC by Chad Crum
Modified: 2020-09-14 14:33 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-14 14:33:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Chad Crum 2020-09-11 15:41:10 UTC
Description of problem:
When deploying OCP 4.6 with baremetal/provisioning networks ipv6, deployment fails with etcd operator in a degraded state. 

(Appears similar to BZ1877833 but slightly different etcd log messages)

Version-Release number of selected component (if applicable):
registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-09-224210

How reproducible:
100%

Steps to Reproduce:
1. Deploy registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-09-224210 in a baremetal environment with ipv6

Actual results:
etcd operator is stuck in a degraded state with one of the masters crash looping
Logs from etcd:
            State:       Waiting
              Reason:    CrashLoopBackOff
            Last State:  Terminated
              Reason:    Error
              Message:   [fd2e:6f44:5dd8::111]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::111]:2379]
              member={name="etcd-bootstrap", peerURLs=[https://[fd2e:6f44:5dd8::115]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::115]:2379]
              target=nil, err=<nil>
        #### sleeping...
        #### attempt 5
              member={name="master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com", peerURLs=[https://[fd2e:6f44:5dd8::111]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::111]:2379]
              member={name="etcd-bootstrap", peerURLs=[https://[fd2e:6f44:5dd8::115]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::115]:2379]
              target=nil, err=<nil>
        #### sleeping...
        #### attempt 6
              member={name="master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com", peerURLs=[https://[fd2e:6f44:5dd8::111]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::111]:2379]
              member={name="etcd-bootstrap", peerURLs=[https://[fd2e:6f44:5dd8::115]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::115]:2379]
              target=nil, err=<nil>
        #### sleeping...
        #### attempt 7
              member={name="master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com", peerURLs=[https://[fd2e:6f44:5dd8::111]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::111]:2379]
              member={name="etcd-bootstrap", peerURLs=[https://[fd2e:6f44:5dd8::115]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::115]:2379]
              target=nil, err=<nil>
        #### sleeping...
        #### attempt 8
              member={name="master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com", peerURLs=[https://[fd2e:6f44:5dd8::111]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::111]:2379]
              member={name="etcd-bootstrap", peerURLs=[https://[fd2e:6f44:5dd8::115]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::115]:2379]
              target=nil, err=<nil>
        #### sleeping...
        #### attempt 9
              member={name="master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com", peerURLs=[https://[fd2e:6f44:5dd8::111]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::111]:2379]
              member={name="etcd-bootstrap", peerURLs=[https://[fd2e:6f44:5dd8::115]:2380}, clientURLs=[https://[fd2e:6f44:5dd8::115]:2379]
              target=nil, err=<nil>
        #### sleeping...
        timed out
              Exit Code:    1
              Started:      Fri, 11 Sep 2020 18:23:33 +0300
              Finished:     Fri, 11 Sep 2020 18:23:43 +0300
            Ready:          False
            Restart Count:  18
            Requests:
              cpu:      300m
              memory:   600Mi
            Readiness:  exec [/bin/sh -ec lsof -n -i :2380 | grep LISTEN] delay=3s timeout=5s period=5s #success=1 #failure=3

openshift-etcd operator pod state:
    [root@sealusa12 ~]# oc get pods
    NAME                                                                READY   STATUS             RESTARTS   AGE
    etcd-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com                2/3     CrashLoopBackOff   19         78m
    etcd-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com                3/3     Running            1          78m
    installer-2-master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed          0          78m
    installer-2-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com         0/1     Completed          0          79m
    revision-pruner-2-master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   0/1     Completed          0          78m


Expected results:
    etcd operator gets in a functional state and deploy completes successfully

Additional info:
    This looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1877833 as an etc master has a pod crashing, although the etcd pod log message noted in that bz is different than one I'm seeing. I could be looking in a different location though.
    
    I also should note that other operators are having issues with start up:
        authentication                             4.6.0-0.nightly-2020-09-09-224210   False       False         True       86m
        console                                    4.6.0-0.nightly-2020-09-09-224210   False       False         True       55m
        etcd                                       4.6.0-0.nightly-2020-09-09-224210   False       True          True       84m
        monitoring                                 4.6.0-0.nightly-2020-09-09-224210   False       True          True       29m
        openshift-apiserver                        4.6.0-0.nightly-2020-09-09-224210   False       False         False      11m
        operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-09-09-224210   False       True          False      30m
    But my guess right now is that etcd (or some underlying ovn problem?) similar to BZ1877833 may be the cause.

Comment 1 Ben Bennett 2020-09-14 14:33:20 UTC

*** This bug has been marked as a duplicate of bug 1877833 ***


Note You need to log in before you can comment on or make changes to this bug.