2018398 – [4.9z] OVN idle service cannot be accessed after upgrade from 4.8

Bug 2018398 - [4.9z] OVN idle service cannot be accessed after upgrade from 4.8

Summary: [4.9z] OVN idle service cannot be accessed after upgrade from 4.8

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.z
Assignee:	ffernand
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:	2023985
Blocks:	2041307
TreeView+	depends on / blocked

Reported:	2021-10-29 06:39 UTC by zhaozhanqi
Modified:	2022-01-17 03:22 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2023985 2041307 (view as bug list)
Environment:
Last Closed:	2021-11-22 21:47:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 837	None	open	Bug 2018398: [4.9z] findLegacyLBs to also include idling LBs	2021-11-18 00:00:14 UTC
Github	ovn-org ovn-kubernetes pull 2638	None	Merged	Bug 2018398: findLegacyLBs to also include idling LBs	2021-11-17 14:41:12 UTC
Red Hat Product Errata	RHBA-2021:4712	None	None	None	2021-11-22 21:47:18 UTC

Description zhaozhanqi 2021-10-29 06:39:46 UTC

Description of problem:

When idle service upgrade from 4.8 to 4.9 version.  the service cannot be accessed. 

seems the following load-balancer record did not be removed. 

sh-4.4# ovn-nbctl list load-balancer | grep 172.30.107.13 -B 8
_uuid               : db7c2000-aee8-4903-9590-a23a13b73e39
external_ids        : {k8s-idling-lb-tcp=yes}
health_check        : []
ip_port_mappings    : {}
name                : ""
options             : {reject="false"}
protocol            : tcp
selection_fields    : []
vips                : {"172.30.107.13:27017"=""}

When I deleted above record by manual, it works well.

Version-Release number of selected component (if applicable):
upgrade from 4.8.17-x86_64--> 4.9.5-x86_64

How reproducible:
always

Steps to Reproduce:
1. setup 4.8.17 cluster with OVN network plugin
2. new project and create test pod/service

oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/list_for_pods.json

3. idle this service

oc idle test-service

4. upgrade cluster to 4.9.5 version

5. Access the service

6. Check load-balancer in ovn db

Actual results:

step 6:

sh-4.4# ovn-nbctl list load-balancer | grep 172.30.107.13 -B 8
_uuid               : db7c2000-aee8-4903-9590-a23a13b73e39
external_ids        : {k8s-idling-lb-tcp=yes}
health_check        : []
ip_port_mappings    : {}
name                : ""
options             : {reject="false"}
protocol            : tcp
selection_fields    : []
vips                : {"172.30.107.13:27017"=""}
--
_uuid               : ef3160c0-5192-4f1d-b459-e779d5faf043
external_ids        : {"k8s.ovn.org/kind"=Service, "k8s.ovn.org/owner"="idle-upgrade/test-service"}
health_check        : []
ip_port_mappings    : {}
name                : "Service_idle-upgrade/test-service_TCP_cluster"
options             : {event="false", reject="true", skip_snat="false"}
protocol            : tcp
selection_fields    : []
vips                : {"172.30.107.13:27017"="10.128.2.101:8080,10.131.0.39:8080"}


Expected results:


Additional info:

When I deleted db7c2000-aee8-4903-9590-a23a13b73e39, and it works well. 

sh-4.4# ovn-nbctl lb-del db7c2000-aee8-4903-9590-a23a13b73e39
sh-4.4# ovn-nbctl lb-list | grep 172.30.107.13
ef3160c0-5192-4f1d-b459-e779d5faf043    Service_idle-upg    tcp        172.30.107.13:27017     10.128.2.101:8080,10.131.0.39:8080
sh-4.4# curl 172.30.107.13:27017
Hello OpenShift!

Comment 4 Lalatendu Mohanty 2021-11-17 16:02:02 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 5 ffernand 2021-11-17 19:20:55 UTC

(In reply to Lalatendu Mohanty from comment #4)
> We're asking the following questions to evaluate whether or not this bug
> warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The
> ultimate goal is to avoid delivering an update which introduces new risk or
> reduces cluster functionality in any way. Sample answers are provided to
> give more context and the UpgradeBlocker flag has been added to this bug. It
> will be removed if the assessment indicates that this should not block
> upgrade edges. The expectation is that the assignee answers these questions.
> 

> Who is impacted?  If we have to block upgrade edges based on this issue,
> which edges would need blocking?


This bug affects customers who upgrade to 4.9 or newer while there are
idled services.

> What is the impact?  Is it serious enough to warrant blocking edges?

After upgrade, the affected load balancers for the idled services will not
forward any traffic. Even if service is no longer idle.


> How involved is remediation (even moderately serious impacts might be
> acceptable if they are easy to mitigate)?


Admin would have to manually remove the idle load balancer(s) from ovn nb db
in order to have the affected services working again.


> Is this a regression (if all previous versions were also vulnerable,
> updating to the new, vulnerable version does not increase exposure)?
>   example: No, it’s always been like this we just never noticed
>   example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Yes, this is a regression introduced on 4.9.0, when this bug was
merged in:  Bug 1995330: Cherry-pick of per-service loadbalancers #666

Comment 6 Lalatendu Mohanty 2021-11-17 20:58:51 UTC

Bumping the priority of the bug as this is a blocker+ bug.

Comment 7 Lalatendu Mohanty 2021-11-17 21:34:18 UTC

As per our current understanding we are not sure about how many customers are using idling application feature as we are not able to get that information from telemetry. Hence we are not blocking edges for this bug. However if we find new information that this is impacting customers we will revisit the idea of blocking edges.

Comment 8 ffernand 2021-11-17 23:37:58 UTC

> @Flavio Fernandes  For the workaround of the issue can you add exact commands in the bug. So that when customers face this issue they can easily apply the workaround
> and ideally also "run this exact command to determine if you're vulnerable..."


In order to see if you're vulnerable and to manually fix the issue, the first step is to have access to ovn nb db. This is one way of doing it:

NOTE: ** Do this _AFTER_ upgrading cluster to 4.9 or later **



NODE=$(oc get node | grep master-0 | cut -d' ' -f1)

OVN_POD_MASTERAPP=$(oc -n openshift-ovn-kubernetes get pod \
  -l app=ovnkube-master,component=network \
  -o jsonpath='{range .items[?(@.spec.nodeName=="'${NODE}'")]}{.metadata.name}{end}')

oc exec -it $OVN_POD_MASTERAPP -n openshift-ovn-kubernetes -c ovnkube-master -- bash

# From inside that pod, you can grab credentials for accessing the ovn nb db. One way of doing that is by looking at the params of nbctl process.
# example:
# [root@ci-ln-fj1d7mb-72292-vsdl5-master-0 ~]#  ps auxww | grep -- '--db '
root          12  0.0  0.0  43632  7428 ?        Ss   22:04   0:00 ovn-nbctl --pidfile=/var/run/ovn/ovn-nbctl.pid --detach -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:10.0.0.3:9641,ssl:10.0.0.4:9641,ssl:10.0.0.5:9641 --log-file=/run/ovn/ovn-nbctl.log -vreconnect:file:info


# Then make an alias for ovn-nbctl using the -p -c -C and --db like so:

alias ovn-nbctl='ovn-nbctl -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:10.0.0.3:9641,ssl:10.0.0.4:9641,ssl:10.0.0.5:9641'


# In order to see if you're vulnerable, list ovn load balancers that have 'idling' as part of its external id. For example:

[root@ci-ln-fj1d7mb-72292-vsdl5-master-0 ~]# ovn-nbctl --columns _uuid,external_ids,vips,options find load_balancer external_ids=k8s-idling-lb-tcp=yes
_uuid               : 0b50415f-b31f-4934-bd45-8d902fa80efc
external_ids        : {k8s-idling-lb-tcp=yes}
vips                : {"172.30.60.96:27017"=""}
options             : {reject="false"}


# Note: if there were idling services that are not tcp (i.e udp, sctp), please replace the command with the proper protocol
# If that returns nothing, then there is nothing further to be done.

# In order to fix, simply remove the load rows found, as they are no longer used after 4.9. Here is how:

for LBUUID in $(ovn-nbctl --bare --columns _uuid find load_balancer external_ids=k8s-idling-lb-tcp=yes) ; do \
     echo $LBUUID ; ovn-nbctl lb-del $LBUUID  ; done

Comment 9 Mike Fiedler 2021-11-18 00:48:40 UTC

Verified by upgrading 4.8.20 -> payload built from https://github.com/openshift/ovn-kubernetes/pull/837 with an idled service.   service unidled successfully after upgrade.

Comment 12 Mike Fiedler 2021-11-18 18:50:38 UTC

Verified per comment 7

Comment 15 errata-xmlrpc 2021-11-22 21:47:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.8 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4712

Note You need to log in before you can comment on or make changes to this bug.