Bug 1944851

Summary: List of ingress routes not cleaned up when routers no longer exist - take 2
Product: OpenShift Container Platform Reporter: Andreas Karis <akaris>
Component: NetworkingAssignee: Grant Spence <gspence>
Networking sub component: router QA Contact: Shudi Li <shudili>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: low CC: alexander, amcdermo, aos-bugs, gspence, hongli, mjoseph, mmasters
Version: 4.7   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: If a previously admitted route's ingress controller is deleted or sharding configuration is added, the status will indicate it is still admitted, which is incorrect. Consequence: The route status will mislead users into thinking the route is still admitted when it is not. Fix: The ingress operator will clear the status of the route when a route is unadmitted. Result: When an ingress controller is updated to shard a route or it is deleted, the route status will be cleared.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:36:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andreas Karis 2021-03-30 19:38:47 UTC
A customer created a second ingress operator for a different domain but they forgot to apply a route or namespace selector.
All projects got updated with the second router.
They deleted this controller, but all routes are not updated and they're not able to edit their status field.

This can easily be reproduced in a lab:
~~~
cat <<'EOF' | oc apply -f -
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: sharded
  namespace: openshift-ingress-operator
spec:
  domain: shard.ipi-cluster.example.com
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/worker: ""
EOF
oc scale ingresscontroller -n openshift-ingress-operator --replicas 1 sharded
oc scale ingresscontroller -n openshift-ingress-operator --replicas 1 default
~~~

Wait until both ingress routers come up, then delete the new sharded router:
~~~
oc delete -n openshift-ingress-operator ingresscontroller sharded
~~~

All routes still show 2 entries in the status fields and only in the status field:
~~~
[root@openshift-jumpserver-0 ~]# oc get routes -A
NAMESPACE                  NAME                HOST/PORT                                                                        PATH   SERVICES            PORT    TERMINATION            WILDCARD
openshift-authentication   oauth-openshift     oauth-openshift.apps.ipi-cluster.example.com ... 1 more                                 oauth-openshift     6443    passthrough/Redirect   None
openshift-console          console             console-openshift-console.apps.ipi-cluster.example.com ... 1 more                       console             https   reencrypt/Redirect     None
openshift-console          downloads           downloads-openshift-console.apps.ipi-cluster.example.com ... 1 more                     downloads           http    edge/Redirect          None
openshift-ingress-canary   canary              canary-openshift-ingress-canary.apps.ipi-cluster.example.com ... 1 more                 ingress-canary      8080    edge/Redirect          None
openshift-monitoring       alertmanager-main   alertmanager-main-openshift-monitoring.apps.ipi-cluster.example.com ... 1 more          alertmanager-main   web     reencrypt/Redirect     None
openshift-monitoring       grafana             grafana-openshift-monitoring.apps.ipi-cluster.example.com ... 1 more                    grafana             https   reencrypt/Redirect     None
openshift-monitoring       prometheus-k8s      prometheus-k8s-openshift-monitoring.apps.ipi-cluster.example.com ... 1 more             prometheus-k8s      web     reencrypt/Redirect     None
openshift-monitoring       thanos-querier      thanos-querier-openshift-monitoring.apps.ipi-cluster.example.com ... 1 more             thanos-querier      web     reencrypt/Redirect     None
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc get route -o json -n openshift-authentication   oauth-openshift | jq '.status'
{
  "ingress": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2021-03-17T12:48:44Z",
          "status": "True",
          "type": "Admitted"
        }
      ],
      "host": "oauth-openshift.apps.ipi-cluster.example.com",
      "routerCanonicalHostname": "apps.ipi-cluster.example.com",
      "routerName": "default",
      "wildcardPolicy": "None"
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2021-03-30T18:25:55Z",
          "status": "True",
          "type": "Admitted"
        }
      ],
      "host": "oauth-openshift.apps.ipi-cluster.example.com",
      "routerCanonicalHostname": "shard.ipi-cluster.example.com",
      "routerName": "sharded",
      "wildcardPolicy": "None"
    }
  ]
}
~~~

We do have an old bug for that which was reported at the times of OCP 3.x: https://bugzilla.redhat.com/show_bug.cgi?id=1356819

The solution at the time was to manually clean up the routes:
https://docs.openshift.com/container-platform/3.11/architecture/networking/routes.html#route-status-field

With: https://github.com/openshift/origin/blob/release-3.11/images/router/clear-route-status.sh

That script, however, does not work in OCP 4.x:
~~~
[root@openshift-jumpserver-0 ~]# oc get routes -A
NAMESPACE                  NAME                HOST/PORT                                                                        PATH   SERVICES            PORT    TERMINATION            WILDCARD
openshift-authentication   oauth-openshift     oauth-openshift.apps.ipi-cluster.example.com ... 1 more                                 oauth-openshift     6443    passthrough/Redirect   None
openshift-console          console             console-openshift-console.apps.ipi-cluster.example.com ... 1 more                       console             https   reencrypt/Redirect     None
openshift-console          downloads           downloads-openshift-console.apps.ipi-cluster.example.com ... 1 more                     downloads           http    edge/Redirect          None
openshift-ingress-canary   canary              canary-openshift-ingress-canary.apps.ipi-cluster.example.com ... 1 more                 ingress-canary      8080    edge/Redirect          None
openshift-monitoring       alertmanager-main   alertmanager-main-openshift-monitoring.apps.ipi-cluster.example.com ... 1 more          alertmanager-main   web     reencrypt/Redirect     None
openshift-monitoring       grafana             grafana-openshift-monitoring.apps.ipi-cluster.example.com ... 1 more                    grafana             https   reencrypt/Redirect     None
openshift-monitoring       prometheus-k8s      prometheus-k8s-openshift-monitoring.apps.ipi-cluster.example.com ... 1 more             prometheus-k8s      web     reencrypt/Redirect     None
openshift-monitoring       thanos-querier      thanos-querier-openshift-monitoring.apps.ipi-cluster.example.com ... 1 more             thanos-querier      web     reencrypt/Redirect     None
[root@openshift-jumpserver-0 ~]# 
[root@openshift-jumpserver-0 ~]# 
[root@openshift-jumpserver-0 ~]# 
[root@openshift-jumpserver-0 ~]# bash  ./clear-route-status.sh openshift-authentication ALL
Error from server (NotFound): the server could not find the requested resource
~~~

Comment 1 Andreas Karis 2021-03-30 19:39:17 UTC
I adjusted the cleanup script for OCP 4 [1]. There was only a minor change to make [0]

Create the script in [1] and put it into file clear-route-status-ocp4.sh

Then, run:
~~~
namespaces=$(oc get routes -A | tail -n+2 | awk '{print $1}' | uniq); for n in $namespaces ; do bash ./clear-route-status-ocp4.sh $n ALL ; done
~~~

The output will be:
~~~
[root@openshift-jumpserver-0 ~]# namespaces=$(oc get routes -A | tail -n+2 | awk '{print $1}' | uniq); for n in $namespaces ; do bash ./clear-route-status-ocp4.sh $n ALL ; done
route status for route oauth-openshift in namespace openshift-authentication cleared
route status for route console in namespace openshift-console cleared
route status for route downloads in namespace openshift-console cleared
route status for route canary in namespace openshift-ingress-canary cleared
route status for route alertmanager-main in namespace openshift-monitoring cleared
route status for route grafana in namespace openshift-monitoring cleared
route status for route prometheus-k8s in namespace openshift-monitoring cleared
route status for route thanos-querier in namespace openshift-monitoring cleared
[root@openshift-jumpserver-0 ~]# oc get routes -A
NAMESPACE                  NAME                HOST/PORT                                                             PATH   SERVICES            PORT    TERMINATION            WILDCARD
openshift-authentication   oauth-openshift     oauth-openshift.apps.ipi-cluster.example.com                                 oauth-openshift     6443    passthrough/Redirect   None
openshift-console          console             console-openshift-console.apps.ipi-cluster.example.com                       console             https   reencrypt/Redirect     None
openshift-console          downloads           downloads-openshift-console.apps.ipi-cluster.example.com                     downloads           http    edge/Redirect          None
openshift-ingress-canary   canary              canary-openshift-ingress-canary.apps.ipi-cluster.example.com                 ingress-canary      8080    edge/Redirect          None
openshift-monitoring       alertmanager-main   alertmanager-main-openshift-monitoring.apps.ipi-cluster.example.com          alertmanager-main   web     reencrypt/Redirect     None
openshift-monitoring       grafana             grafana-openshift-monitoring.apps.ipi-cluster.example.com                    grafana             https   reencrypt/Redirect     None
openshift-monitoring       prometheus-k8s      prometheus-k8s-openshift-monitoring.apps.ipi-cluster.example.com             prometheus-k8s      web     reencrypt/Redirect     None
openshift-monitoring       thanos-querier      thanos-querier-openshift-monitoring.apps.ipi-cluster.example.com             thanos-querier      web     reencrypt/Redirect     None
~~~



---------------------------------------------------------------------------

[0] There's only a minor difference to the original script:
~~~
[root@openshift-jumpserver-0 ~]# diff -u clear-route-status*
--- clear-route-status-ocp4.sh	2021-03-30 19:22:33.944568703 +0000
+++ clear-route-status.sh	2021-03-30 18:46:49.407568703 +0000
@@ -11,14 +11,10 @@
 function clear_status() {
     local namespace="${1}"
     local route_name="${2}"
-    local my_json_blob; my_json_blob=$(oc get route -n ${namespace} ${route_name} -o json)
-    local modified_json; modified_json=$(echo "${my_json_blob}" | jq -c 'del(.status.ingress)')
-    curl -s -X PUT http://localhost:8001/apis/route.openshift.io/v1/namespaces/${namespace}/routes/"${route_name}"/status --data-binary "${modified_json}" -H "Content-Type: application/json" > /dev/null
-    if [ "$?" == "0" ] ; then
-        echo "route status for route ${route_name} in namespace ${namespace} cleared"
-    else
-        echo "error modifying route ${route_name} in namespace ${namespace}"
-    fi
+    local my_json_blob; my_json_blob=$(oc get --raw http://localhost:8001/oapi/v1/namespaces/${namespace}/routes/${route_name}/)
+    local modified_json; modified_json=$(echo "${my_json_blob}" | jq 'del(.status.ingress)')
+    curl -s -X PUT http://localhost:8001/oapi/v1/namespaces/"${namespace}"/routes/"${route_name}"/status --data-binary "${modified_json}" -H "Content-Type: application/json" > /dev/null
+    echo "route status for route "${route_name}" in namespace "${namespace}" cleared"
 }
 
 #sets up clearing a status set by a specific router
~~~

[1] Full script for OCP 4:
~~~
#!/bin/bash

set -o errexit
set -o pipefail
set -o nounset

# This allows for the clearing of route statuses, routers don't clear the routes status so some may be stale.
# Upon deletion of the routes status active routers will immediately update with a vaild status

#clears status of all routers
function clear_status() {
    local namespace="${1}"
    local route_name="${2}"
    local my_json_blob; my_json_blob=$(oc get route -n ${namespace} ${route_name} -o json)
    local modified_json; modified_json=$(echo "${my_json_blob}" | jq -c 'del(.status.ingress)')
    curl -s -X PUT http://localhost:8001/apis/route.openshift.io/v1/namespaces/${namespace}/routes/"${route_name}"/status --data-binary "${modified_json}" -H "Content-Type: application/json" > /dev/null
    if [ "$?" == "0" ] ; then
        echo "route status for route ${route_name} in namespace ${namespace} cleared"
    else
        echo "error modifying route ${route_name} in namespace ${namespace}"
    fi
}

#sets up clearing a status set by a specific router
function clear_status_set_by() {
    local router_name="${1}"

    for namespace in $( oc get namespaces -o 'jsonpath={.items[*].metadata.name}' ); do
        local routes; routes=($(oc get routes -o jsonpath='{.items[*].metadata.name}' --namespace="${namespace}" 2>/dev/null))
        if [[ "${#routes[@]}" -ne 0  ]]; then
            for route in "${routes[@]}"; do
                clear_routers_status "${namespace}" "${route}" "${router_name}"
            done
        else
            echo "No routes found for namespace "${namespace}""
        fi
    done

}

# clears the status field of a specific router name
function clear_routers_status() {
    local namespace="${1}"
    local route_name="${2}"
    local router_name="${3}"
    local my_json_blob; my_json_blob=$(oc get --raw http://localhost:8001/oapi/v1/namespaces/"${namespace}"/routes/"${route_name}"/) 
    local modified_json; modified_json=$(echo "${my_json_blob}" | jq '."status"."ingress"|=map(select(.routerName != "'${router_name}'"))')
    if [[ "${modified_json}" != "$(echo "${my_json_blob}" | jq '.')" ]]; then
        curl -s -X PUT http://localhost:8001/oapi/v1/namespaces/"${namespace}"/routes/"${route_name}"/status --data-binary "${modified_json}" -H "Content-Type: application/json" > /dev/null
        echo "route status for route "${route_name}" set by router "${router_name}" cleared"
    else
        echo "route "${route_name}" has no status set by "${router_name}""
    fi
}

function cleanup() {
    if [[ -n "${PROXY_PID:+unset_check}" ]]; then
        kill "${PROXY_PID}"
    fi
}
trap cleanup EXIT

USAGE="Usage:
To clear only the status set by a specific router on all routes in all namespaces
./clear-router-status.sh -r [router_name]

router_name is the name in the deployment config, not the name of the pod. If the router is running it will
immediately update any cleared status.

To clear the status field of a route or all routes in a given namespace
./clear-route-status.sh [namespace] [route-name | ALL]


Example Usage
--------------
To clear the status of all routes in all namespaces:
oc get namespaces | awk '{if (NR!=1) print \$1}' | xargs -n 1 -I %% ./clear-route-status.sh %% ALL

To clear the status of all routes in namespace default:
./clear-route-status.sh default ALL

To clear the status of route example in namespace default:
./clear-route-status.sh default example

NOTE: if a router that admits a route is running it will immediately update the cleared route status 
"

if [[ ${#} -ne 2 || "${@}" == *" help "* ]]; then
    printf "%s" "${USAGE}"
    exit
fi

if ! command -v jq >/dev/null 2>&1; then
    printf "%s\n%s\n" "Command line JSON processor 'jq' not found." "please install 'jq' version greater than 1.4 to use this script."
    exit 1
fi

if ! echo | jq '."status"."ingress"|=map(select(.routerName != "test"))' >/dev/null 2>&1; then
    printf "%s\n%s\n" "Command line JSON processor 'jq' version is incorrect." "Please install 'jq' version greater than 1.4 to use this script"
    exit 1
fi    

oc proxy > /dev/null &
PROXY_PID="${!}"

## attempt to access the proxy until it is online
until curl -s -X GET http://localhost:8001/oapi/v1/ >/dev/null; do
    sleep 1
done

if [[ "${1}" == "-r" ]]; then
    clear_status_set_by "${2}"
    exit
fi

namespace="${1}"
route_name="${2}"

if [[ "${route_name}" == "ALL" ]]; then
    routes=($(oc get routes -o jsonpath='{.items[*].metadata.name}' --namespace="${namespace}" 2>/dev/null))
    if [[ "${#routes[@]}" -ne 0 ]]; then
        for route in "${routes[@]}"; do
            clear_status "${namespace}" "${route}"
        done
    else
        echo "No routes found for namespace "${namespace}""
    fi
else
    clear_status "${namespace}" "${route_name}"
fi

~~~

Comment 2 Andreas Karis 2021-03-30 19:43:28 UTC
I really do not agree though with the conclusion from OCP 3:
https://bugzilla.redhat.com/show_bug.cgi?id=1356819#c3

The fact that a customer hits the same issue years later IMO shows that the cleanup script should be replaced with automation through the controller.

Back in the days, we argued that:
~~~
The route status is really meant to serve as a debugging indicator for why a route was allowed/not allowed in by a specific router. As routers go away, we don't really clear the status - look it as a more like an events/logs thing. 
(...)
~~~

Given that we now have the ingress operator, it should take care of updating a route's status whenever an owning ingresscontroller is deleted. That should be achievable by parsing the status/ingress/routerName field for all ingress entries and if the deleted ingresscontroller name matches the status part, we should delete it. At the least, we should add something that indicates that this was deleted and we should not show "1 more" in the overview

Thanks,

Andreas

Comment 3 Andrew McDermott 2021-04-01 16:20:21 UTC
Thanks for the script. We will review the script but also raise an RFE so that we can consider automating this in the operator.

Comment 4 Miciah Dashiel Butler Masters 2021-05-10 13:52:36 UTC
*** Bug 1958088 has been marked as a duplicate of this bug. ***

Comment 5 Alexander Niebuhr 2021-05-11 13:45:47 UTC
we do have the same issue, I do suggest heavily that it will get automated in the operator. I do understand the fact, that it might not bee necessary to clear the status.. The problem is that the UI Frontend uses (for me falsely) the latest entry of the status entries array. So if the router witch the stale status gets to be the latest of all router statuses, the UI shows an stale route information...

In my bug https://bugzilla.redhat.com/show_bug.cgi?id=1958088, I suggested to track this in two issues. One for the actual status clearing, and the other one for the UI changes.. not sure if that changes anything, but this UI bug makes this issue an urgent p for us internally, since we can't be sure user of the UI see the current & correct route information!

Comment 10 Shudi Li 2022-06-23 07:04:36 UTC
Verified it with 4.11.0-0.nightly-2022-06-22-190830

1.
% oc get clusterversion                 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-22-190830   True        False         89m     Cluster version is 4.11.0-0.nightly-2022-06-22-190830
%

2.
% oc create -f ddds
ingresscontroller.operator.openshift.io/sharded created
%
% cat ddds
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  name: sharded
  namespace: openshift-ingress-operator
spec:
  domain: shard.shudi-411djp90.qe.gcp.devcluster.openshift.com
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/worker: ""
%

3.
%oc -n openshift-ingress get pods     
NAME                              READY   STATUS    RESTARTS   AGE
router-default-58bfd965c6-fk2w6   1/1     Running   0          5h25m
router-default-58bfd965c6-t7tpx   1/1     Running   0          5h25m
router-sharded-ccb7565bc-fkggh    1/1     Running   0          27s
router-sharded-ccb7565bc-skc78    1/1     Running   0          27s
% oc get route -o json -n openshift-authentication   oauth-openshift | jq '.status'
{
  "ingress": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-06-23T01:27:40Z",
          "status": "True",
          "type": "Admitted"
        }
      ],
      "host": "oauth-openshift.apps.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerCanonicalHostname": "router-default.apps.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerName": "default",
      "wildcardPolicy": "None"
    },
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-06-23T06:47:12Z",
          "status": "True",
          "type": "Admitted"
        }
      ],
      "host": "oauth-openshift.apps.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerCanonicalHostname": "router-sharded.shard.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerName": "sharded",
      "wildcardPolicy": "None"
    }
  ]
}
% 

4.
% oc scale ingresscontroller -n openshift-ingress-operator --replicas 1 sharded
ingresscontroller.operator.openshift.io/sharded scaled
% oc scale ingresscontroller -n openshift-ingress-operator --replicas 1 default
ingresscontroller.operator.openshift.io/default scaled
%

5.
% oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-58bfd965c6-fk2w6   1/1     Running   0          5h27m
router-sharded-ccb7565bc-skc78    1/1     Running   0          2m37s
% 

6.
%oc delete -n openshift-ingress-operator ingresscontroller sharded
ingresscontroller.operator.openshift.io "sharded" deleted
% 

7. the "router-sharded" is removed from oauth-openshift route
 % oc get route -o json -n openshift-authentication   oauth-openshift | jq '.status'
{
  "ingress": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-06-23T01:27:40Z",
          "status": "True",
          "type": "Admitted"
        }
      ],
      "host": "oauth-openshift.apps.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerCanonicalHostname": "router-default.apps.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerName": "default",
      "wildcardPolicy": "None"
    }
  ]
}
% 


8. the "router-sharded" is removed from other routes, too
% oc get route -o json -n openshift-console  console  | jq '.status'
{
  "ingress": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-06-23T01:27:40Z",
          "status": "True",
          "type": "Admitted"
        }
      ],
      "host": "console-openshift-console.apps.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerCanonicalHostname": "router-default.apps.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerName": "default",
      "wildcardPolicy": "None"
    }
  ]
}
% 
% oc get route -o json -n openshift-ingress-canary   canary  | jq '.status'
{
  "ingress": [
    {
      "conditions": [
        {
          "lastTransitionTime": "2022-06-23T01:27:40Z",
          "status": "True",
          "type": "Admitted"
        }
      ],
      "host": "canary-openshift-ingress-canary.apps.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerCanonicalHostname": "router-default.apps.shudi-411djp90.qe.gcp.devcluster.openshift.com",
      "routerName": "default",
      "wildcardPolicy": "None"
    }
  ]
}
%

Comment 12 Miciah Dashiel Butler Masters 2022-07-01 15:35:58 UTC
There is a follow-up fix in bug 2101878.

Comment 13 errata-xmlrpc 2022-08-10 10:36:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069