Bug 1961550 - HAproxy pod logs showing error "another server named 'pod:httpd-7c7ccfffdc-wdkvk:httpd:8080-tcp:10.128.x.x:8080' was already defined at line 326, please use distinct names"
Summary: HAproxy pod logs showing error "another server named 'pod:httpd-7c7ccfffdc-wd...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Stephen Greene
QA Contact: Arvind iyengar
URL:
Whiteboard:
Depends On:
Blocks: 1963243
TreeView+ depends on / blocked
 
Reported: 2021-05-18 09:00 UTC by Jobin A T
Modified: 2022-08-04 22:32 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Remove selector from a service exposed via a route. Consequence: Duplicate endpointslices would be created for the service's pods, triggering HAProxy reload errors due to duplicate server entries. Fix: Filter out accidental duplicate server lines when writing out the HAProxy config file. Result: Deleting the selector from a service does not brick the router.
Clone Of:
Environment:
Last Closed: 2021-07-27 23:08:55 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift router pull 285 0 None closed Bug 1961550: Add a condition to check if the Endpoints ID is duplicated 2021-05-21 17:27:30 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:09:17 UTC

Description Jobin A T 2021-05-18 09:00:33 UTC
Description of problem:

HAproxy is not generating the config correctly if the "selector" was removed from the service and old Endpoints were remained.


Version-Release number of selected component (if applicable):
Red Hat OpenShift Container Platform (RHOCP) 4.6

How reproducible:
100%

Steps to Reproduce:


1. Prepare the test pod and service as follows.
  $ oc new-project test
  $ oc new-app httpd -n test
  $ oc get pod -o wide -n test
  $ oc get pod -o wide -n test
  NAME                     READY   STATUS    RESTARTS   AGE   IP           
  httpd-7c7ccfffdc-wdkvk   1/1     Running   0          66s   10.128.2.8   <--- Pod IP

  $ oc describe svc httpd -n test
  Name:              httpd
  Namespace:         test
  :
  Selector:          deployment=httpd   <--- Above pod can trace this label if it is mached or not.
  Type:              ClusterIP
  IP:                172.30.178.250
  Port:              8080-tcp  8080/TCP
  TargetPort:        8080/TCP
  Endpoints:         10.128.2.8:8080    <--- You can see the Endpoints are matched with the pod IP.
  Port:              8443-tcp  8443/TCP
  TargetPort:        8443/TCP
  Endpoints:         10.128.2.8:8443    <--- You can see the Endpoints are matched with the pod IP.

2. Remove the "selector" field.
  $ oc replace -f - <<EOF
  apiVersion: v1
  kind: Service
  metadata:
    labels:
      app: httpd
      app.kubernetes.io/component: httpd
      app.kubernetes.io/instance: httpd
    name: httpd
    namespace: test
  spec:
    clusterIP: 172.30.178.250
    ports:
    - name: 8080-tcp
      port: 8080
      protocol: TCP
      targetPort: 8080
    - name: 8443-tcp
      port: 8443
      protocol: TCP
      targetPort: 8443
    sessionAffinity: None
    type: ClusterIP
  EOF

  $ oc describe svc httpd -n test
  Name:              httpd
  Namespace:         test
  :
  Selector:          <none>             <--- removed the "selector" value, it causes not syncing the Endpoints IPs.
  Type:              ClusterIP
  IP:                172.30.178.250
  Port:              8080-tcp  8080/TCP
  TargetPort:        8080/TCP
  Endpoints:         10.128.2.8:8080    <--- Old Endpoints are remained
  Port:              8443-tcp  8443/TCP
  TargetPort:        8443/TCP
  Endpoints:         10.128.2.8:8443    <--- Old Endpoints are remained


3. Check the Endpoints after restarting the test pod.
   You can see the existing service does not sync any more due to lost "selector", it's a expected behavior.
  $ oc delete pod httpd-7c7ccfffdc-wdkvk -n test
  $ oc get pod -o wide -n test
  $ oc get pod -o wide -n test
  NAME                     READY   STATUS    RESTARTS   AGE   IP        
  httpd-7c7ccfffdc-hd2dj   1/1     Running   0          19s   10.128.2.9 <--- Pod IP is changed after restarting.

  $ oc describe svc httpd -n test
  Name:              httpd
  Namespace:         test
  :
  Selector:          <none>             <--- removed the "selector" value, it causes not syncing the Endpoints IPs.
  Type:              ClusterIP
  IP:                172.30.178.250
  Port:              8080-tcp  8080/TCP
  TargetPort:        8080/TCP
  Endpoints:         10.128.2.8:8080    <--- Old Endpoints are remained
  Port:              8443-tcp  8443/TCP
  TargetPort:        8443/TCP
  Endpoints:         10.128.2.8:8443    <--- Old Endpoints are remained

4. Expose the service, then the issue would be reproduced.

  $ oc expose svc httpd -n test
  $ oc logs -n openshift-ingress deploy/router-default
  :
  E0518 06:47:25.288227       1 limiter.go:165] error reloading router: exit status 1
  [ALERT] 137/064725 (221) : parsing [/var/lib/haproxy/conf/haproxy.config:327] : backend 'be_http:test:httpd', another server named 'pod:httpd-7c7ccfffdc-wdkvk:httpd:8080-tcp:10.128.2.8:8080' was already defined at line 326, please use distinct names.
  [ALERT] 137/064725 (221) : Fatal errors found in configuration.

Actual results:

  E0518 06:47:25.288227       1 limiter.go:165] error reloading router: exit status 1
  [ALERT] 137/064725 (221) : parsing [/var/lib/haproxy/conf/haproxy.config:327] : backend 'be_http:test:httpd', another server named 'pod:httpd-7c7ccfffdc-wdkvk:httpd:8080-tcp:10.128.2.8:8080' was already defined at line 326, please use distinct names.
  [ALERT] 137/064725 (221) : Fatal errors found in configuration.

Expected results:


Additional info:

Comment 1 Stephen Greene 2021-05-18 17:47:41 UTC
(In reply to Jobin A T from comment #0)
> HAproxy is not generating the config correctly if the "selector" was removed
> from the service and old Endpoints were remained.

In OCP 4.6, the router observes EndpointSlices resources instead of Endpoints, by the way.

> Version-Release number of selected component (if applicable):
> Red Hat OpenShift Container Platform (RHOCP) 4.6

Was the customer not observing this issue on OCP 4.5? What z version of 4.6.z is the customer using specifically?

> Expected results:

What exactly is the outcome expected by the customer in this situation?
When you remove the label-selector from a service, you are essentially putting the service in an unmanaged state, so updating the relevant endpoint/endpointslice resources to avoid HAProxy backend collisions would be your responsibility (instead of the endpoint/endpointslice mirroring controller's responsibility).


It's not clear what we can do to resolve this issue, since removing the service selector is edging into unsupported territory. Does the customer have different expectations for how the router should behave in this situation?

Tagging need-info as we try to boil this BZ down to a solvable issue (if that's possible). Leaving in NEW state until then.

Comment 2 Daein Park 2021-05-19 06:26:31 UTC
Hi team,

EndpointSlice can be duplicated, look the output [1], when the selector is removed.
In other words, the same Endpoints ID can be available, the same server records would be added like debug logs [2].
And it makes router pods be failed to restart next time as follows.

  [ALERT] 138/033357 (18) : parsing [/var/lib/haproxy/conf/haproxy.config:327] : backend 'be_http:test:httpd', another server named 'pod:httpd-7c7ccfffdc-fc294:httpd:8080-tcp:10.128.2.8:8080' was already defined at line 326, please use distinct names.
  [ALERT] 138/033357 (18) : Fatal errors found in configuration.
  I0519 03:34:30.934144       1 template.go:690] router "msg"="Shutdown requested, waiting 45s for new connections to cease"

As of the latest k8s(1.21, I'm not sure if the upstream backport would be applied to the OCPv4), endpointslice controller fix[0] may suppress this issue.

  [0] Updating EndpointSlice controllers to avoid duplicate creations
    - https://github.com/kubernetes/kubernetes/pull/100103

I think regardless of this issue, for stable up and running of the router, it had better check if the endpoints ID is duplicated before adding the server records to the "haproxy.config".


[1] As soon as a "selector" field removes, the duplicated endpointslice generated.
    But each EndpointSlice is not the same trigger to generate, 
    such as original one is generated by the Service, other one is generated by the Endpoints(for manual Endpoints managements).
$ oc get endpointslice,endpoints -o wide
NAME                                         ADDRESSTYPE   PORTS       ENDPOINTS    AGE
endpointslice.discovery.k8s.io/httpd-5sg47   IPv4          8443,8080   10.128.2.8   19m  <--- OwnerRef is the Endpoints. After removing "selector", this one generated.
endpointslice.discovery.k8s.io/httpd-qr7hh   IPv4          8443,8080   10.128.2.8   17h  <--- OwnerRef is the Service.

NAME              ENDPOINTS                         AGE
endpoints/httpd   10.128.2.8:8080,10.128.2.8:8443   17h

[2] The duplicated endpointslices were added, look the "processing subset" logs,it causes the parsing error.
I0519 01:09:57.252744       7 plugin.go:178] template "msg"="processing endpoints"  "endpointCount"=2 "eventType"="MODIFIED" "name"="httpd" "namespace"="test"
I0519 01:09:57.252802       7 plugin.go:181] template "msg"="processing subset"  "index"=0 "subset"={"addresses":[{"ip":"10.128.2.8","targetRef":{"kind":"Pod","namespace":"test","name":"httpd-7c7ccfffdc-fc294","uid":"4f918dc5-d020-44c9-ba7c-6e87009f33f0","resourceVersion":"105002"}}],"ports":[{"name":"8080-tcp","port":8080,"protocol":"TCP"},{"name":"8443-tcp","port":8443,"protocol":"TCP"}]}
I0519 01:09:57.252826       7 plugin.go:181] template "msg"="processing subset"  "index"=1 "subset"={"addresses":[{"ip":"10.128.2.8","targetRef":{"kind":"Pod","namespace":"test","name":"httpd-7c7ccfffdc-fc294","uid":"4f918dc5-d020-44c9-ba7c-6e87009f33f0","resourceVersion":"105002"}}],"ports":[{"name":"8080-tcp","port":8080,"protocol":"TCP"},{"name":"8443-tcp","port":8443,"protocol":"TCP"}]}
I0519 01:09:57.252846       7 plugin.go:190] template "msg"="modifying endpoints"  "key"="test/httpd"
I0519 01:09:57.252923       7 router.go:445] template "msg"="writing the router config"  
I0519 01:09:57.252984       7 router.go:499] template "msg"="committing router certificate manager changes..."  
I0519 01:09:57.252998       7 router.go:504] template "msg"="router certificate manager config committed"  
I0519 01:09:57.257998       7 router.go:455] template "msg"="calling reload function"  "fn"=0
I0519 01:09:57.259462       7 router.go:459] template "msg"="reloading the router"  
E0519 01:09:57.277337       7 limiter.go:165] error reloading router: exit status 1
[ALERT] 138/010957 (80) : parsing [/var/lib/haproxy/conf/haproxy.config:327] : backend 'be_http:test:httpd', another server named 'pod:httpd-7c7ccfffdc-fc294:httpd:8080-tcp:10.128.2.8:8080' was already defined at line 326, please use distinct names.

Comment 3 Daein Park 2021-05-19 08:36:32 UTC
FYI, I've also opened PR here: https://github.com/openshift/router/pull/285

Comment 4 Stephen Greene 2021-05-19 13:52:16 UTC
(In reply to Daein Park from comment #3)
> FYI, I've also opened PR here: https://github.com/openshift/router/pull/285

Thanks for sharing a potential patch! I will check with my team and see if this is the approach we want to take.


In the meantime, could you include the full endpointslice yaml for the 2 endpointslices you mention in Comment 2 (this will help me better understand the issue)?
Can you safely delete the endpointslice that has the service as it's owner ref? Is that at least a workaround here?
Again, deleting the service selector from a clusterIP service does not seem like a supported/typical flow, but I suppose we can better handle the outcome of doing so.


Also, note that https://github.com/kubernetes/kubernetes/pull/100103 should be included in OCP 4.8. If I get around to it, I will try to reproduce this issue on 4.6 and see if I can also reproduce this issue on 4.8 latest. It's not clear to me how that upstream patch would resolve the issue, but I agree that it could perhaps make a difference.

Comment 5 Daein Park 2021-05-19 15:23:25 UTC
@Stephen Greene Thank you for your prompt update.

> In the meantime, could you include the full endpointslice yaml for the 2 endpointslices you mention in Comment 2 (this will help me better understand the issue)?

Please refer [0].

> Can you safely delete the endpointslice that has the service as it's owner ref? Is that at least a workaround here?

Yes, you're right. Workaournd is to remove invalid/not managed another one. 
For more details, it's different depending on "selector" field configuration states in the existing service.

  1. If you removed the "selector" in the service, only endpointslice which has a Endpoints OwnerRef would be generated after removing all endpointslice automatically.
     Refer [1] for my testing result.

  2. If you set the "selector" in the service again, only endpointslice which has a Service OwnerRef would be generated after removing all endpointslice automatically.
     Refer [2] for my testing result.

> Again, deleting the service selector from a clusterIP service does not seem like a supported/typical flow, but I suppose we can better handle the outcome of doing so.

I also agree with your thought. It's not usual, and to remove "selector" means user should manage their own endpoints by himself.
But, this process to enable self-managed endpoints may affect running router like this issue, it's unexpected result for users.
If possible, we would better enhance the process in the router or providing kind documentations for suppressing unexpected issue.

> Also, note that https://github.com/kubernetes/kubernetes/pull/100103 should be included in OCP 4.8. If I get around to it, I will try to reproduce this issue on 4.6 and see if I can also reproduce this issue on 4.8 latest. It's not clear to me how that upstream patch would resolve the issue, but I agree that it could perhaps make a difference.

As you mentioned, it may not be related with this issue. It looks like only tracing the same endpointslice generations...
But I think it worth testing, because the fix can change the current endpointslice creation behavior.


[0] endpointslices
$ oc get endpointslices
NAME          ADDRESSTYPE   PORTS       ENDPOINTS     AGE
httpd-4hf9v   IPv4          8443,8080   10.129.2.16   109m  
httpd-9bw7j   IPv4          8443,8080   10.129.2.16   6s     <--- Generated after removing "selector" field in the service.

$ oc get endpointslices -o yaml
apiVersion: v1
items:
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1beta1
  endpoints:
  - addresses:
    - 10.129.2.16
    conditions:
      ready: true
    targetRef:
      kind: Pod
      name: httpd-7c7ccfffdc-mqps8
      namespace: test
      resourceVersion: "242425"
      uid: 671865b5-f7e9-41fd-911e-cf53b8fc6e2c
    topology:
      kubernetes.io/hostname: ip-10-0-186-33.ap-northeast-1.compute.internal
      topology.kubernetes.io/region: ap-northeast-1
      topology.kubernetes.io/zone: ap-northeast-1a
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2021-05-18T08:03:23Z"
    creationTimestamp: "2021-05-19T12:45:39Z"
    generateName: httpd-
    generation: 2
    labels:
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      kubernetes.io/service-name: httpd
      manager: kube-controller-manager
      operation: Update
      time: "2021-05-19T14:30:38Z"
    name: httpd-4hf9v
    namespace: test
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: httpd
      uid: 07f8582e-969b-4f52-bd74-f73a56a8bfae
    resourceVersion: "243485"
    selfLink: /apis/discovery.k8s.io/v1beta1/namespaces/test/endpointslices/httpd-4hf9v
    uid: f8b63b4a-fd61-452b-b3b4-0c5e75e6d84a
  ports:
  - name: 8443-tcp
    port: 8443
    protocol: TCP
  - name: 8080-tcp
    port: 8080
    protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1beta1
  endpoints:
  - addresses:
    - 10.129.2.16
    conditions:
      ready: true
    targetRef:
      kind: Pod
      name: httpd-7c7ccfffdc-mqps8
      namespace: test
      resourceVersion: "242425"
      uid: 671865b5-f7e9-41fd-911e-cf53b8fc6e2c
    topology:
      kubernetes.io/hostname: ip-10-0-186-33.ap-northeast-1.compute.internal
  kind: EndpointSlice
  metadata:
    creationTimestamp: "2021-05-19T14:34:34Z"
    generateName: httpd-
    generation: 1
    labels:
      app: httpd
      app.kubernetes.io/component: httpd
      app.kubernetes.io/instance: httpd
      endpointslice.kubernetes.io/managed-by: endpointslicemirroring-controller.k8s.io
      kubernetes.io/service-name: httpd
      manager: kube-controller-manager
      operation: Update
      time: "2021-05-19T14:34:34Z"
    name: httpd-9bw7j
    namespace: test
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Endpoints
      name: httpd
      uid: 0291165b-95cf-403c-8f06-2e39c1ab81fb
    resourceVersion: "245803"
    selfLink: /apis/discovery.k8s.io/v1beta1/namespaces/test/endpointslices/httpd-9bw7j
    uid: 02c9837a-7568-4d91-9dd0-948c1cc4a07c
  ports:
  - name: 8443-tcp
    port: 8443
    protocol: TCP
  - name: 8080-tcp
    port: 8080
    protocol: TCP
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

[1] 
$ oc delete endpointslice httpd-4hf9v httpd-9bw7j
endpointslice.discovery.k8s.io "httpd-4hf9v" deleted
endpointslice.discovery.k8s.io "httpd-9bw7j" deleted

$ oc get endpointslice
NAME          ADDRESSTYPE   PORTS       ENDPOINTS     AGE
httpd-qw8k4   IPv4          8443,8080   10.129.2.16   2s
$ oc get endpointslice httpd-qw8k4 -o yaml
addressType: IPv4
apiVersion: discovery.k8s.io/v1beta1
endpoints:
- addresses:
  - 10.129.2.16
  conditions:
    ready: true
  targetRef:
    kind: Pod
    name: httpd-7c7ccfffdc-mqps8
    namespace: test
    resourceVersion: "242425"
    uid: 671865b5-f7e9-41fd-911e-cf53b8fc6e2c
  topology:
    kubernetes.io/hostname: ip-10-0-186-33.ap-northeast-1.compute.internal
kind: EndpointSlice
metadata:
  creationTimestamp: "2021-05-19T14:39:35Z"
  generateName: httpd-
  generation: 1
  labels:
    app: httpd
    app.kubernetes.io/component: httpd
    app.kubernetes.io/instance: httpd
    endpointslice.kubernetes.io/managed-by: endpointslicemirroring-controller.k8s.io
    kubernetes.io/service-name: httpd
    manager: kube-controller-manager
    operation: Update
    time: "2021-05-19T14:39:35Z"
  name: httpd-qw8k4
  namespace: test
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Endpoints                            <--- Endpoints ownerRef
    name: httpd
    uid: 0291165b-95cf-403c-8f06-2e39c1ab81fb
  resourceVersion: "247199"
  selfLink: /apis/discovery.k8s.io/v1beta1/namespaces/test/endpointslices/httpd-qw8k4
  uid: 9f0ba243-2f3e-4333-b24f-cba8a5491dd5
ports:
- name: 8443-tcp
  port: 8443
  protocol: TCP
- name: 8080-tcp
  port: 8080
  protocol: TCP
  
$ oc get endpoints
NAME    ENDPOINTS                           AGE
httpd   10.129.2.16:8080,10.129.2.16:8443   30h

[2]
$ oc get endpointslice
NAME          ADDRESSTYPE   PORTS       ENDPOINTS     AGE
httpd-mk9gg   IPv4          8443,8080   10.129.2.16   9s
httpd-qw8k4   IPv4          8443,8080   10.129.2.16   10m
$ oc delete endpointslice httpd-mk9gg httpd-qw8k4
endpointslice.discovery.k8s.io "httpd-mk9gg" deleted
endpointslice.discovery.k8s.io "httpd-qw8k4" deleted

$ oc get endpointslice
NAME          ADDRESSTYPE   PORTS       ENDPOINTS     AGE
httpd-vlw5r   IPv4          8443,8080   10.129.2.16   2s

$ oc get endpointslice httpd-vlw5r -o yaml
addressType: IPv4
apiVersion: discovery.k8s.io/v1beta1
endpoints:
- addresses:
  - 10.129.2.16
  conditions:
    ready: true
  targetRef:
    kind: Pod
    name: httpd-7c7ccfffdc-mqps8
    namespace: test
    resourceVersion: "242425"
    uid: 671865b5-f7e9-41fd-911e-cf53b8fc6e2c
  topology:
    kubernetes.io/hostname: ip-10-0-186-33.ap-northeast-1.compute.internal
    topology.kubernetes.io/region: ap-northeast-1
    topology.kubernetes.io/zone: ap-northeast-1a
kind: EndpointSlice
metadata:
  creationTimestamp: "2021-05-19T14:50:38Z"
  generateName: httpd-
  generation: 1
  labels:
    endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
    kubernetes.io/service-name: httpd
    manager: kube-controller-manager
    operation: Update
    time: "2021-05-19T14:50:38Z"
  name: httpd-vlw5r
  namespace: test
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Service                                          <--- Service ownerRef
    name: httpd
    uid: 07f8582e-969b-4f52-bd74-f73a56a8bfae
  resourceVersion: "250314"
  selfLink: /apis/discovery.k8s.io/v1beta1/namespaces/test/endpointslices/httpd-vlw5r
  uid: e124efcc-a57f-439f-8d31-3ddf5ed4c6f0
ports:
- name: 8443-tcp
  port: 8443
  protocol: TCP
- name: 8080-tcp
  port: 8080
  protocol: TCP
  
$ oc get endpoints
NAME    ENDPOINTS                           AGE
httpd   10.129.2.16:8080,10.129.2.16:8443   30h

Comment 6 Stephen Greene 2021-05-19 19:18:29 UTC
Thanks for sharing your test cases in detail.

I was able to reproduce this issue on 4.8.0-0.ci-2021-05-19-081203 using the following trivial reproducer:


oc new-project test
oc create -f https://raw.githubusercontent.com/openshift/origin/master/examples/hello-openshift/hello-pod.json
oc expose pod/hello-openshift
oc expose service/hello-openshift
oc patch service hello-openshift --patch '{"spec":{"selector":null}}'

observe reload failure in router pod logs


> I also agree with your thought. It's not usual, and to remove "selector" means user should manage their own endpoints by himself. But, this process to enable self-managed endpoints may affect running router like this issue, it's unexpected result for users.
> If possible, we would better enhance the process in the router or providing kind documentations for suppressing unexpected issue.

I agree. I will work with my team to decide the best path forward here and get back to you soon. Thank you for your patience as we work through higher priority issues.

Note that right now in OCP, the customer could take advantage of the following metric:

`template_router_reload_failure` should return `1` if any router pods are currently in a "wedged" state, meaning the router pod cannot reload into the newest HAProxy configuration.

This metric is leveraged by the `HAProxyReloadFail` alert, which should fire if the `template_router_reload_failure` metric is returning `1` for at least 5 minutes. 
So, if the customer is concerned about other edge cases causing issues like this in the future, than can simply monitor these metrics/alerts.

Thanks!

Comment 7 Daein Park 2021-05-19 23:55:11 UTC
> I was able to reproduce this issue on 4.8.0-0.ci-2021-05-19-081203 using the following trivial reproducer:

Thank you for your testing. Unfortunately, the upstream fix was not helpful to fix on this issue.

> This metric is leveraged by the `HAProxyReloadFail` alert, which should fire if the `template_router_reload_failure` metric is returning `1` for at least 5 minutes. 
> So, if the customer is concerned about other edge cases causing issues like this in the future, than can simply monitor these metrics/alerts.

Great suggestion ! It would be helpful for better management of the router configuration !

Thank you again for your information and work !

Comment 8 Stephen Greene 2021-05-20 16:59:00 UTC
spoke with my team and we have decided that it would be wise to merge something along the lines of https://github.com/openshift/router/pull/285.

When I have time I will review the proposed patch and see if there's any performance implications associated with it (or if there's a better way to implement the fix in general).
Thanks!

Comment 10 Arvind iyengar 2021-05-24 10:28:56 UTC
Verified in "4.8.0-0.nightly-2021-05-21-233425" payload. With this release, there are no more router reload errors when the selector is removed for a service mapped to a route:
-----
oc get clusterversion                          
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-21-233425   True        False         4h16m   Cluster version is 4.8.0-0.nightly-2021-05-21-233425


oc get all                                        
NAME                  READY   STATUS    RESTARTS   AGE
pod/hello-openshift   1/1     Running   0          93m

NAME                      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/hello-openshift   ClusterIP   172.30.92.91   <none>        8080/TCP   93m

NAME                                       HOST/PORT                                                             PATH   SERVICES          PORT   TERMINATION   WILDCARD
route.route.openshift.io/hello-openshift   hello-openshift-test1.apps.aiyengar4824.qe.devcluster.openshift.com          hello-openshift   8080                 None


apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2021-05-24T08:53:50Z"
  labels:
    name: hello-openshift
  name: hello-openshift
  namespace: test1
  resourceVersion: "97341"
  uid: 0a4c84e9-af89-46a4-86bd-787a9aaeebb3
spec:
  clusterIP: 172.30.92.91
  clusterIPs:
  - 172.30.92.91
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}


oc -n openshift-ingress logs router-default-56b4fbb5ff-f4nrd --tail 50
I0524 05:47:04.382092       1 template.go:433] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: c7b3985da3d1341fdac33f4d6bb6994fe29d32b7\nversionFromGit: 4.0.0-299-gc7b3985d\ngitTreeState: clean\nbuildDate: 2021-05-21T18:48:49Z\n"
I0524 05:47:04.383766       1 metrics.go:155] metrics "msg"="router health and metrics port listening on HTTP and HTTPS"  "address"="0.0.0.0:1936"
I0524 05:47:04.388599       1 router.go:191] template "msg"="creating a new template router"  "writeDir"="/var/lib/haproxy"
I0524 05:47:04.388679       1 router.go:270] template "msg"="router will coalesce reloads within an interval of each other"  "interval"="5s"
I0524 05:47:04.389386       1 router.go:332] template "msg"="watching for changes"  "path"="/etc/pki/tls/private"
I0524 05:47:04.389442       1 router.go:262] router "msg"="router is including routes in all namespaces"  
E0524 05:47:04.494692       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0524 05:47:04.530597       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 05:47:09.521271       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 05:47:35.597216       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 05:47:40.569203       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 05:47:45.606777       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 05:54:19.025876       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 05:54:35.750379       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 05:55:09.203483       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 05:55:14.196201       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 06:07:52.951984       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 06:08:25.395947       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 06:08:30.377415       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0524 06:08:55.400129       1 router.go:579] template "msg"="router reloaded"  "output"=" - Proxy protocol on, checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
------

Comment 14 Stephen Greene 2021-07-08 23:07:07 UTC
xref https://github.com/kubernetes/kubernetes/issues/103576

Comment 16 errata-xmlrpc 2021-07-27 23:08:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 17 Jie Wu 2021-09-14 04:35:15 UTC
The customer has the same error in OCP4.7.13 environment.

The fix commit has been added on 2021-5-22.
https://github.com/openshift/router/commit/8e5e70b4164d4fc2f2515d431892f2b1c803f0ed

And I confirmed that openshift4/ose-haproxy-router 4.7.16(buildDate: 2021-06-03T23:22:11Z) or later has been fixed this issue.

OpenShift Container Platform 4.7.16 container image list
https://access.redhat.com/solutions/6115681

# podman run -it --entrypoint=/usr/bin/openshift-router  openshift4/ose-haproxy-router:v4.7.0-202106032231.p0.git.5a0e656 version
openshift-router

majorFromGit: 
minorFromGit: 
commitFromGit: 5a0e6561b0480df9f32a8ef87a54a1dc4cf91b93
versionFromGit: 4.0.0-268-g5a0e6561
gitTreeState: clean
buildDate: 2021-06-03T23:22:11Z


Note You need to log in before you can comment on or make changes to this bug.