1852289 – Upgrade testsuite fails on ppc64le environment - Unsupported LoadBalancer service

Bug 1852289 - Upgrade testsuite fails on ppc64le environment - Unsupported LoadBalancer service

Summary: Upgrade testsuite fails on ppc64le environment - Unsupported LoadBalancer ser...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Rafael Fonseca
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-30 04:54 UTC by Basheer
Modified:	2022-08-04 22:27 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:13:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Upgrade log (4.31 MB, text/plain) 2020-06-30 04:54 UTC, Basheer	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25586	0	None	closed	Bug 1852289: Add libvirt to the list of unsupported platforms for LB service	2021-02-16 08:41:35 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:13:33 UTC

Description Basheer 2020-06-30 04:54:09 UTC

Created attachment 1699233 [details]
Upgrade log

Created attachment 1699233 [details]
Upgrade log

Description of problem:


Version-Release number of selected component (if applicable):
During upgradation from Openshift4.3 to Openshift4.4


How reproducible:
consistently

Installation type:
Libvirt IPI


Steps to Reproduce:
1. Install 4.3.23
2. Run openshift-tests run-upgrade all --to-image=<OCP4.4_image>

Actual results:
Test failed with the following error message:

Jun  4 06:55:05.926: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes/test/e2e/framework/service/jig.go:566]: Jun  4 06:04:13.983: Timed out waiting for service "service-test" to have a load balancer

failed: (1h0m55s) 2020-06-04T10:55:05 "[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]"


Expected results:
Cluster should upgrade to 4.4 version and upgrade tests should get passed

Comment 1 Basheer 2020-06-30 04:59:03 UTC

Part of the upgrade scenario the mentioned service is getting created via following lines:

https://github.com/openshift/origin/blob/release-4.3/test/e2e/upgrade/service/service.go#L50:L64

golang
ginkgo.By("creating a TCP service " + serviceName + " with type=LoadBalancer in namespace " + ns.Name)
tcpService := jig.CreateTCPServiceOrFail(ns.Name, func(s *v1.Service) {
s.Spec.Type = v1.ServiceTypeLoadBalancer
if s.Annotations == nil

{ s.Annotations = make(map[string]string) }
// We tune the LB checks to match the longest intervals available so that interactions between
// upgrading components and the service are more obvious.
// - AWS allows configuration, default is 70s (6 failed with 10s interval in 1.17) set to match GCP
s.Annotations["service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval"] = "8"
s.Annotations["service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold"] = "3"
s.Annotations["service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold"] = "2"
// - Azure is hardcoded to 15s (2 failed with 5s interval in 1.17) and is sufficient
// - GCP has a non-configurable interval of 32s (3 failed health checks with 8s interval in 1.17)
// - thus pods need to stay up for > 32s, so pod shutdown period will will be 45s
})


looking at the code it is trying to create a loadbalancer service with the aws annotations which may just work on the aws providers.


Mimicking the code and tried creating the Loadbalancer service and seen following behavior :

Yaml:
apiVersion: v1
kind: Service
metadata:
name: my-service
annotations:
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "8"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
spec:
selector:
app: MyApp
ports:

protocol: TCP
port: 80
type: LoadBalancer
----------------------------------

# oc create -f service.yaml
service/my-service created
# oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 13d
my-service LoadBalancer 172.30.227.169 <pending> 80:30616/TCP 66s
openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 13d
#

# oc get svc my-service -o=yaml
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "8"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
creationTimestamp: "2020-06-03T11:04:54Z"
name: my-service
namespace: default
resourceVersion: "5990717"
selfLink: /api/v1/namespaces/default/services/my-service
uid: 859dd769-3c97-41fb-894c-0c948bea02b5
spec:
clusterIP: 172.30.227.169
externalTrafficPolicy: Cluster
ports:

nodePort: 30616
port: 80
protocol: TCP
targetPort: 80
selector:
app: MyApp
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer: {}

Test is failing as it is looking for `svc.Status.LoadBalancer.Ingress` entry but in this case it is none as we are not running in the aws environment.

This concludes that current code will not work on the non-aws environment, hence further analysis and enhancement required to run it on the non-aws providers.

Comment 2 Stephen Cuppett 2020-07-07 13:22:47 UTC

Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.

Comment 3 Andrew McDermott 2020-07-09 12:13:35 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 8 Dan Li 2020-07-27 19:04:47 UTC

Adding "UpcomingSprint" tag as the team does not have the bandwidth to work on this bug during this sprint.

Comment 10 mfisher 2020-08-18 19:55:51 UTC

Target reset to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 12 lmcfadde 2020-08-19 21:27:22 UTC

@jpoulin FYI ^^ target reset to 4.7 in case it directly impacts "Libvirt: CI Job for Upgrade Testing on P".  Infer fix can possibly be backported to 4.6 if required.

Comment 16 Dan Li 2020-09-08 18:28:30 UTC

Hi lmcfadde, is this bug something that the IBM team could help to investigate into?

Comment 18 Dan Li 2020-09-08 19:31:58 UTC

Please ignore my Comment 16 as the bug has been re-assigned.

Comment 19 Jeremy Poulin 2020-09-11 19:21:23 UTC

The way ingress is handled on libvirt deployments doesn't make use of a load balancer.
Instead, the same round-robin dns provided by dnsmasq when the cluster network gets set up handles the ingress traffic pre and post upgrade.

We should probably just special case this in the test suite.

Comment 26 Rafael Fonseca 2021-01-12 19:47:04 UTC

Still waiting on the linked PR to be reviewed and merged.

Comment 28 Hongan Li 2021-02-01 07:05:41 UTC

didn't observe error in upgrade e2e test, moving to verified.

Comment 31 errata-xmlrpc 2021-02-24 15:13:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.