Hide Forgot
Description of problem: When idle service upgrade from 4.8 to 4.9 version. the service cannot be accessed. seems the following load-balancer record did not be removed. sh-4.4# ovn-nbctl list load-balancer | grep 172.30.107.13 -B 8 _uuid : db7c2000-aee8-4903-9590-a23a13b73e39 external_ids : {k8s-idling-lb-tcp=yes} health_check : [] ip_port_mappings : {} name : "" options : {reject="false"} protocol : tcp selection_fields : [] vips : {"172.30.107.13:27017"=""} When I deleted above record by manual, it works well. Version-Release number of selected component (if applicable): upgrade from 4.8.17-x86_64--> 4.9.5-x86_64 How reproducible: always Steps to Reproduce: 1. setup 4.8.17 cluster with OVN network plugin 2. new project and create test pod/service oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/list_for_pods.json 3. idle this service oc idle test-service 4. upgrade cluster to 4.9.5 version 5. Access the service 6. Check load-balancer in ovn db Actual results: step 6: sh-4.4# ovn-nbctl list load-balancer | grep 172.30.107.13 -B 8 _uuid : db7c2000-aee8-4903-9590-a23a13b73e39 external_ids : {k8s-idling-lb-tcp=yes} health_check : [] ip_port_mappings : {} name : "" options : {reject="false"} protocol : tcp selection_fields : [] vips : {"172.30.107.13:27017"=""} -- _uuid : ef3160c0-5192-4f1d-b459-e779d5faf043 external_ids : {"k8s.ovn.org/kind"=Service, "k8s.ovn.org/owner"="idle-upgrade/test-service"} health_check : [] ip_port_mappings : {} name : "Service_idle-upgrade/test-service_TCP_cluster" options : {event="false", reject="true", skip_snat="false"} protocol : tcp selection_fields : [] vips : {"172.30.107.13:27017"="10.128.2.101:8080,10.131.0.39:8080"} Expected results: Additional info: When I deleted db7c2000-aee8-4903-9590-a23a13b73e39, and it works well. sh-4.4# ovn-nbctl lb-del db7c2000-aee8-4903-9590-a23a13b73e39 sh-4.4# ovn-nbctl lb-list | grep 172.30.107.13 ef3160c0-5192-4f1d-b459-e779d5faf043 Service_idle-upg tcp 172.30.107.13:27017 10.128.2.101:8080,10.131.0.39:8080 sh-4.4# curl 172.30.107.13:27017 Hello OpenShift!
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, it’s always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
(In reply to Lalatendu Mohanty from comment #4) > We're asking the following questions to evaluate whether or not this bug > warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The > ultimate goal is to avoid delivering an update which introduces new risk or > reduces cluster functionality in any way. Sample answers are provided to > give more context and the UpgradeBlocker flag has been added to this bug. It > will be removed if the assessment indicates that this should not block > upgrade edges. The expectation is that the assignee answers these questions. > > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? This bug affects customers who upgrade to 4.9 or newer while there are idled services. > What is the impact? Is it serious enough to warrant blocking edges? After upgrade, the affected load balancers for the idled services will not forward any traffic. Even if service is no longer idle. > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? Admin would have to manually remove the idle load balancer(s) from ovn nb db in order to have the affected services working again. > Is this a regression (if all previous versions were also vulnerable, > updating to the new, vulnerable version does not increase exposure)? > example: No, it’s always been like this we just never noticed > example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 Yes, this is a regression introduced on 4.9.0, when this bug was merged in: Bug 1995330: Cherry-pick of per-service loadbalancers #666
Bumping the priority of the bug as this is a blocker+ bug.
As per our current understanding we are not sure about how many customers are using idling application feature as we are not able to get that information from telemetry. Hence we are not blocking edges for this bug. However if we find new information that this is impacting customers we will revisit the idea of blocking edges.
> @Flavio Fernandes For the workaround of the issue can you add exact commands in the bug. So that when customers face this issue they can easily apply the workaround > and ideally also "run this exact command to determine if you're vulnerable..." In order to see if you're vulnerable and to manually fix the issue, the first step is to have access to ovn nb db. This is one way of doing it: NOTE: ** Do this _AFTER_ upgrading cluster to 4.9 or later ** NODE=$(oc get node | grep master-0 | cut -d' ' -f1) OVN_POD_MASTERAPP=$(oc -n openshift-ovn-kubernetes get pod \ -l app=ovnkube-master,component=network \ -o jsonpath='{range .items[?(@.spec.nodeName=="'${NODE}'")]}{.metadata.name}{end}') oc exec -it $OVN_POD_MASTERAPP -n openshift-ovn-kubernetes -c ovnkube-master -- bash # From inside that pod, you can grab credentials for accessing the ovn nb db. One way of doing that is by looking at the params of nbctl process. # example: # [root@ci-ln-fj1d7mb-72292-vsdl5-master-0 ~]# ps auxww | grep -- '--db ' root 12 0.0 0.0 43632 7428 ? Ss 22:04 0:00 ovn-nbctl --pidfile=/var/run/ovn/ovn-nbctl.pid --detach -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:10.0.0.3:9641,ssl:10.0.0.4:9641,ssl:10.0.0.5:9641 --log-file=/run/ovn/ovn-nbctl.log -vreconnect:file:info # Then make an alias for ovn-nbctl using the -p -c -C and --db like so: alias ovn-nbctl='ovn-nbctl -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:10.0.0.3:9641,ssl:10.0.0.4:9641,ssl:10.0.0.5:9641' # In order to see if you're vulnerable, list ovn load balancers that have 'idling' as part of its external id. For example: [root@ci-ln-fj1d7mb-72292-vsdl5-master-0 ~]# ovn-nbctl --columns _uuid,external_ids,vips,options find load_balancer external_ids=k8s-idling-lb-tcp=yes _uuid : 0b50415f-b31f-4934-bd45-8d902fa80efc external_ids : {k8s-idling-lb-tcp=yes} vips : {"172.30.60.96:27017"=""} options : {reject="false"} # Note: if there were idling services that are not tcp (i.e udp, sctp), please replace the command with the proper protocol # If that returns nothing, then there is nothing further to be done. # In order to fix, simply remove the load rows found, as they are no longer used after 4.9. Here is how: for LBUUID in $(ovn-nbctl --bare --columns _uuid find load_balancer external_ids=k8s-idling-lb-tcp=yes) ; do \ echo $LBUUID ; ovn-nbctl lb-del $LBUUID ; done
Verified by upgrading 4.8.20 -> payload built from https://github.com/openshift/ovn-kubernetes/pull/837 with an idled service. service unidled successfully after upgrade.
Verified per comment 7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.9.8 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4712