Bug 1873121 - RedHat CoreOS worker node is not listening to port 80 and 443
Summary: RedHat CoreOS worker node is not listening to port 80 and 443
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 4.5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Eric Ponvelle
QA Contact: Hongan Li
Vikram Goyal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-27 12:41 UTC by barnali
Modified: 2021-04-21 03:45 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-21 03:45:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description barnali 2020-08-27 12:41:10 UTC
Description of problem:
I've installed a cluster with 3 control plane and 3 worker nodes.
All nodes are in ready state. But the 3rd worker node shows down in the HAProxy.
So I tried to ping port 80 and 443 after logging into the worker node. It shows connection refused for both ports. Then I ran netstat command and can't see port 80 or 443. 
[core@worker3 conf.d]$ sudo netstat -lptn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      1503/kubelet
tcp        0      0 10.0.235.43:9100        0.0.0.0:*               LISTEN      3395/kube-rbac-prox
tcp        0      0 127.0.0.1:9100          0.0.0.0:*               LISTEN      2978/node_exporter
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd
tcp        0      0 0.0.0.0:50451           0.0.0.0:*               LISTEN      1426/rpc.statd
tcp        0      0 127.0.0.1:4180          0.0.0.0:*               LISTEN      3339/oauth-proxy
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1403/sshd
tcp        0      0 127.0.0.1:8797          0.0.0.0:*               LISTEN      2222/machine-config
tcp6       0      0 :::9001                 :::*                    LISTEN      3339/oauth-proxy
tcp6       0      0 :::10250                :::*                    LISTEN      1503/kubelet
tcp6       0      0 :::9101                 :::*                    LISTEN      2356/openshift-sdn-
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd
tcp6       0      0 :::10256                :::*                    LISTEN      2356/openshift-sdn-
tcp6       0      0 :::22                   :::*                    LISTEN      1403/sshd
tcp6       0      0 :::49879                :::*                    LISTEN      1426/rpc.statd
tcp6       0      0 :::10010                :::*                    LISTEN      1462/crio
tcp6       0      0 :::9537                 :::*                    LISTEN      1462/crio



In other worker nodes, with same command, I can see port 80 and 443
[core@worker2 conf.d]$ sudo netstat -lptn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      1510/kubelet
tcp        0      0 0.0.0.0:42921           0.0.0.0:*               LISTEN      1430/rpc.statd
tcp        0      0 127.0.0.1:10443         0.0.0.0:*               LISTEN      3570955/haproxy
tcp        0      0 127.0.0.1:10444         0.0.0.0:*               LISTEN      3570955/haproxy
tcp        0      0 10.0.235.42:9100        0.0.0.0:*               LISTEN      3635/kube-rbac-prox
tcp        0      0 127.0.0.1:9100          0.0.0.0:*               LISTEN      3170/node_exporter
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/systemd
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      3570955/haproxy
tcp        0      0 127.0.0.1:4180          0.0.0.0:*               LISTEN      3570/oauth-proxy
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1406/sshd
tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      3570955/haproxy
tcp        0      0 127.0.0.1:8797          0.0.0.0:*               LISTEN      2472/machine-config
tcp6       0      0 :::9001                 :::*                    LISTEN      3570/oauth-proxy
tcp6       0      0 :::60873                :::*                    LISTEN      1430/rpc.statd
tcp6       0      0 :::10250                :::*                    LISTEN      1510/kubelet
tcp6       0      0 :::9101                 :::*                    LISTEN      2469/openshift-sdn-
tcp6       0      0 :::111                  :::*                    LISTEN      1/systemd
tcp6       0      0 :::1936                 :::*                    LISTEN      5721/openshift-rout
tcp6       0      0 :::10256                :::*                    LISTEN      2469/openshift-sdn-
tcp6       0      0 :::22                   :::*                    LISTEN      1406/sshd
tcp6       0      0 :::10010                :::*                    LISTEN      1468/crio
tcp6       0      0 :::9537                 :::*                    LISTEN      1468/crio



Version-Release number of selected component (if applicable):
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="45.82.202008101249-0"
VERSION_ID="4.5"
OPENSHIFT_VERSION="4.5"
RHEL_VERSION="8.2"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 45.82.202008101249-0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.5"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.5"
OSTREE_VERSION='45.82.202008101249-0'


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
worker node3 should be up in lb and should have active connection to port 80 and port 443

Expected results:
port 80 and 443 both are not showing up in netstat output and also this node shows down in haproxy due to this connection issue to port 80 and 443

Additional info:

Comment 1 Micah Abbott 2020-08-27 15:04:59 UTC
This looks more like a problem with the haproxy pod/container not operating correctly on the node than an OS level issue.

Could you provide a must-gather for the problematic node?

Moving to Routing as they may be better suited to triage this kind of problem.

Comment 2 mfisher 2020-08-27 16:15:14 UTC
Target set to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 3 Daneyon Hansen 2020-08-28 22:32:18 UTC
Node ports will only be used for HAProxy if the endpoint publishing strategy [1] is set to "HostNetwork". To better assist you, please provide the following details:

1. Output of `oc get infrastructure/cluster -o`
2. Details of the ingresscontroller resource used to run HAProxy, i.e. `oc get ingresscontroller/default -n openshift-ingress-operator -o yaml`:
3. Ingress Operator logs.

[1] https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L69-L86

Comment 4 barnali 2020-09-01 07:33:36 UTC
1> oc get infrastructure/cluster                                                                                                              NAME      AGE
cluster   5d21h

The issue was that ingresscontroller no of replicas was defaulted to 2, so in the 3rd worker node ingress controller pod was not created so port 80 and 443 was unused. Scaled the default ingress controller with below command and it started showing healthy in HAProxy.
oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 3}}' --type=merge

I guess it can be set during OpenShift installation by specifying replica in cluster-ingress-02-config.yml file also. If this information is mentioned in installation document(https://docs.openshift.com/container-platform/4.5/installing/installing_bare_metal/installing-bare-metal.html) similar to how the below information is provided that would be helpful.

<<<Modify the <installation_directory>/manifests/cluster-scheduler-02-config.yml Kubernetes manifest file to prevent Pods from being scheduled on the control plane machines>>>

2>oc get ingresscontroller/default -n openshift-ingress-operator                                                                              -o yaml
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  creationTimestamp: "2020-08-26T09:37:01Z"
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 2
  managedFields:
  - apiVersion: operator.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:replicas: {}
    manager: oc
    operation: Update
    time: "2020-08-27T17:10:30Z"
  - apiVersion: operator.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .: {}
          v:"ingresscontroller.operator.openshift.io/finalizer-ingresscontroller                                                                             ": {}
      f:spec: {}
      f:status:
        .: {}
        f:availableReplicas: {}
        f:conditions: {}
        f:domain: {}
        f:endpointPublishingStrategy:
          .: {}
          f:type: {}
        f:observedGeneration: {}
        f:selector: {}
        f:tlsProfile:
          .: {}
          f:ciphers: {}
          f:minTLSVersion: {}
    manager: ingress-operator
    operation: Update
    time: "2020-08-27T17:10:50Z"
  name: default
  namespace: openshift-ingress-operator
  resourceVersion: "740546"
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator                                                                             /ingresscontrollers/default
  uid: 7779c895-9fc4-4f49-aa96-935e01367f71
spec:
  replicas: 3
status:
  availableReplicas: 3
  conditions:
  - lastTransitionTime: "2020-08-26T09:37:02Z"
    reason: Valid
    status: "True"
    type: Admitted
  - lastTransitionTime: "2020-08-27T13:37:30Z"
    status: "True"
    type: Available
  - lastTransitionTime: "2020-08-27T13:37:30Z"
    message: The deployment has Available status condition set to True
    reason: DeploymentAvailable
    status: "False"
    type: DeploymentDegraded
  - lastTransitionTime: "2020-08-26T09:37:05Z"
    message: The configured endpoint publishing strategy does not include a mana                                                                             ged
      load balancer
    reason: EndpointPublishingStrategyExcludesManagedLoadBalancer
    status: "False"
    type: LoadBalancerManaged
  - lastTransitionTime: "2020-08-26T09:37:05Z"
    message: No DNS zones are defined in the cluster dns config.
    reason: NoDNSZones
    status: "False"
    type: DNSManaged
  - lastTransitionTime: "2020-08-27T13:37:30Z"
    status: "False"
    type: Degraded
  domain: apps.dmcan.ocppoc.cluster
  endpointPublishingStrategy:
    type: HostNetwork
  observedGeneration: 2
  selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller                                                                             =default
  tlsProfile:
    ciphers:
    - TLS_AES_128_GCM_SHA256
    - TLS_AES_256_GCM_SHA384
    - TLS_CHACHA20_POLY1305_SHA256
    - ECDHE-ECDSA-AES128-GCM-SHA256
    - ECDHE-RSA-AES128-GCM-SHA256
    - ECDHE-ECDSA-AES256-GCM-SHA384
    - ECDHE-RSA-AES256-GCM-SHA384
    - ECDHE-ECDSA-CHACHA20-POLY1305
    - ECDHE-RSA-CHACHA20-POLY1305
    - DHE-RSA-AES128-GCM-SHA256
    - DHE-RSA-AES256-GCM-SHA384
    minTLSVersion: VersionTLS12

Comment 5 Daneyon Hansen 2020-09-01 16:53:10 UTC
@barnali,

Thanks for the feedback. I'm reassigning the bz to docs based on feedback in https://bugzilla.redhat.com/show_bug.cgi?id=1873121#c4.

Comment 7 Hongan Li 2021-04-16 03:42:58 UTC
Hi Eric,

Actually the file cluster-ingress-02-config.yml only contains ingress base domain setting, see below

$ cat cluster-ingress-02-config.yml 
apiVersion: config.openshift.io/v1
kind: Ingress
metadata:
  creationTimestamp: null
  name: cluster
spec:
  domain: apps.example.openshift.com
status: {}

but the replicas is defined in CR ingresscontroller instead of above Ingress.config.openshift.io.

I believe the best practice for this operation is scaling it after the installation is complete, see below:

$ oc -n openshift-ingress-operator scale ingresscontroller/default --replicas=3

(patch ingresscontroller in #Comment 4 is also fine)

@barnali, what do you think? if it is acceptable please feel free to close this. Thanks

Comment 8 barnali 2021-04-21 03:45:09 UTC
This can be closed if best practice is scaling it after the installation is completed.


Note You need to log in before you can comment on or make changes to this bug.