Description of problem: I've installed a cluster with 3 control plane and 3 worker nodes. All nodes are in ready state. But the 3rd worker node shows down in the HAProxy. So I tried to ping port 80 and 443 after logging into the worker node. It shows connection refused for both ports. Then I ran netstat command and can't see port 80 or 443. [core@worker3 conf.d]$ sudo netstat -lptn Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 1503/kubelet tcp 0 0 10.0.235.43:9100 0.0.0.0:* LISTEN 3395/kube-rbac-prox tcp 0 0 127.0.0.1:9100 0.0.0.0:* LISTEN 2978/node_exporter tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd tcp 0 0 0.0.0.0:50451 0.0.0.0:* LISTEN 1426/rpc.statd tcp 0 0 127.0.0.1:4180 0.0.0.0:* LISTEN 3339/oauth-proxy tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1403/sshd tcp 0 0 127.0.0.1:8797 0.0.0.0:* LISTEN 2222/machine-config tcp6 0 0 :::9001 :::* LISTEN 3339/oauth-proxy tcp6 0 0 :::10250 :::* LISTEN 1503/kubelet tcp6 0 0 :::9101 :::* LISTEN 2356/openshift-sdn- tcp6 0 0 :::111 :::* LISTEN 1/systemd tcp6 0 0 :::10256 :::* LISTEN 2356/openshift-sdn- tcp6 0 0 :::22 :::* LISTEN 1403/sshd tcp6 0 0 :::49879 :::* LISTEN 1426/rpc.statd tcp6 0 0 :::10010 :::* LISTEN 1462/crio tcp6 0 0 :::9537 :::* LISTEN 1462/crio In other worker nodes, with same command, I can see port 80 and 443 [core@worker2 conf.d]$ sudo netstat -lptn Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 1510/kubelet tcp 0 0 0.0.0.0:42921 0.0.0.0:* LISTEN 1430/rpc.statd tcp 0 0 127.0.0.1:10443 0.0.0.0:* LISTEN 3570955/haproxy tcp 0 0 127.0.0.1:10444 0.0.0.0:* LISTEN 3570955/haproxy tcp 0 0 10.0.235.42:9100 0.0.0.0:* LISTEN 3635/kube-rbac-prox tcp 0 0 127.0.0.1:9100 0.0.0.0:* LISTEN 3170/node_exporter tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 3570955/haproxy tcp 0 0 127.0.0.1:4180 0.0.0.0:* LISTEN 3570/oauth-proxy tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1406/sshd tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN 3570955/haproxy tcp 0 0 127.0.0.1:8797 0.0.0.0:* LISTEN 2472/machine-config tcp6 0 0 :::9001 :::* LISTEN 3570/oauth-proxy tcp6 0 0 :::60873 :::* LISTEN 1430/rpc.statd tcp6 0 0 :::10250 :::* LISTEN 1510/kubelet tcp6 0 0 :::9101 :::* LISTEN 2469/openshift-sdn- tcp6 0 0 :::111 :::* LISTEN 1/systemd tcp6 0 0 :::1936 :::* LISTEN 5721/openshift-rout tcp6 0 0 :::10256 :::* LISTEN 2469/openshift-sdn- tcp6 0 0 :::22 :::* LISTEN 1406/sshd tcp6 0 0 :::10010 :::* LISTEN 1468/crio tcp6 0 0 :::9537 :::* LISTEN 1468/crio Version-Release number of selected component (if applicable): NAME="Red Hat Enterprise Linux CoreOS" VERSION="45.82.202008101249-0" VERSION_ID="4.5" OPENSHIFT_VERSION="4.5" RHEL_VERSION="8.2" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 45.82.202008101249-0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.5" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.5" OSTREE_VERSION='45.82.202008101249-0' How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: worker node3 should be up in lb and should have active connection to port 80 and port 443 Expected results: port 80 and 443 both are not showing up in netstat output and also this node shows down in haproxy due to this connection issue to port 80 and 443 Additional info:
This looks more like a problem with the haproxy pod/container not operating correctly on the node than an OS level issue. Could you provide a must-gather for the problematic node? Moving to Routing as they may be better suited to triage this kind of problem.
Target set to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved.
Node ports will only be used for HAProxy if the endpoint publishing strategy [1] is set to "HostNetwork". To better assist you, please provide the following details: 1. Output of `oc get infrastructure/cluster -o` 2. Details of the ingresscontroller resource used to run HAProxy, i.e. `oc get ingresscontroller/default -n openshift-ingress-operator -o yaml`: 3. Ingress Operator logs. [1] https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L69-L86
1> oc get infrastructure/cluster NAME AGE cluster 5d21h The issue was that ingresscontroller no of replicas was defaulted to 2, so in the 3rd worker node ingress controller pod was not created so port 80 and 443 was unused. Scaled the default ingress controller with below command and it started showing healthy in HAProxy. oc patch -n openshift-ingress-operator ingresscontroller/default --patch '{"spec":{"replicas": 3}}' --type=merge I guess it can be set during OpenShift installation by specifying replica in cluster-ingress-02-config.yml file also. If this information is mentioned in installation document(https://docs.openshift.com/container-platform/4.5/installing/installing_bare_metal/installing-bare-metal.html) similar to how the below information is provided that would be helpful. <<<Modify the <installation_directory>/manifests/cluster-scheduler-02-config.yml Kubernetes manifest file to prevent Pods from being scheduled on the control plane machines>>> 2>oc get ingresscontroller/default -n openshift-ingress-operator -o yaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: creationTimestamp: "2020-08-26T09:37:01Z" finalizers: - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller generation: 2 managedFields: - apiVersion: operator.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:replicas: {} manager: oc operation: Update time: "2020-08-27T17:10:30Z" - apiVersion: operator.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: {} v:"ingresscontroller.operator.openshift.io/finalizer-ingresscontroller ": {} f:spec: {} f:status: .: {} f:availableReplicas: {} f:conditions: {} f:domain: {} f:endpointPublishingStrategy: .: {} f:type: {} f:observedGeneration: {} f:selector: {} f:tlsProfile: .: {} f:ciphers: {} f:minTLSVersion: {} manager: ingress-operator operation: Update time: "2020-08-27T17:10:50Z" name: default namespace: openshift-ingress-operator resourceVersion: "740546" selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator /ingresscontrollers/default uid: 7779c895-9fc4-4f49-aa96-935e01367f71 spec: replicas: 3 status: availableReplicas: 3 conditions: - lastTransitionTime: "2020-08-26T09:37:02Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2020-08-27T13:37:30Z" status: "True" type: Available - lastTransitionTime: "2020-08-27T13:37:30Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "False" type: DeploymentDegraded - lastTransitionTime: "2020-08-26T09:37:05Z" message: The configured endpoint publishing strategy does not include a mana ged load balancer reason: EndpointPublishingStrategyExcludesManagedLoadBalancer status: "False" type: LoadBalancerManaged - lastTransitionTime: "2020-08-26T09:37:05Z" message: No DNS zones are defined in the cluster dns config. reason: NoDNSZones status: "False" type: DNSManaged - lastTransitionTime: "2020-08-27T13:37:30Z" status: "False" type: Degraded domain: apps.dmcan.ocppoc.cluster endpointPublishingStrategy: type: HostNetwork observedGeneration: 2 selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller =default tlsProfile: ciphers: - TLS_AES_128_GCM_SHA256 - TLS_AES_256_GCM_SHA384 - TLS_CHACHA20_POLY1305_SHA256 - ECDHE-ECDSA-AES128-GCM-SHA256 - ECDHE-RSA-AES128-GCM-SHA256 - ECDHE-ECDSA-AES256-GCM-SHA384 - ECDHE-RSA-AES256-GCM-SHA384 - ECDHE-ECDSA-CHACHA20-POLY1305 - ECDHE-RSA-CHACHA20-POLY1305 - DHE-RSA-AES128-GCM-SHA256 - DHE-RSA-AES256-GCM-SHA384 minTLSVersion: VersionTLS12
@barnali, Thanks for the feedback. I'm reassigning the bz to docs based on feedback in https://bugzilla.redhat.com/show_bug.cgi?id=1873121#c4.
Hi Eric, Actually the file cluster-ingress-02-config.yml only contains ingress base domain setting, see below $ cat cluster-ingress-02-config.yml apiVersion: config.openshift.io/v1 kind: Ingress metadata: creationTimestamp: null name: cluster spec: domain: apps.example.openshift.com status: {} but the replicas is defined in CR ingresscontroller instead of above Ingress.config.openshift.io. I believe the best practice for this operation is scaling it after the installation is complete, see below: $ oc -n openshift-ingress-operator scale ingresscontroller/default --replicas=3 (patch ingresscontroller in #Comment 4 is also fine) @barnali, what do you think? if it is acceptable please feel free to close this. Thanks
This can be closed if best practice is scaling it after the installation is completed.