Created attachment 1699233 [details] Upgrade log Created attachment 1699233 [details] Upgrade log Description of problem: Version-Release number of selected component (if applicable): During upgradation from Openshift4.3 to Openshift4.4 How reproducible: consistently Installation type: Libvirt IPI Steps to Reproduce: 1. Install 4.3.23 2. Run openshift-tests run-upgrade all --to-image=<OCP4.4_image> Actual results: Test failed with the following error message: Jun 4 06:55:05.926: INFO: Running AfterSuite actions on node 1 fail [k8s.io/kubernetes/test/e2e/framework/service/jig.go:566]: Jun 4 06:04:13.983: Timed out waiting for service "service-test" to have a load balancer failed: (1h0m55s) 2020-06-04T10:55:05 "[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]" Expected results: Cluster should upgrade to 4.4 version and upgrade tests should get passed
Part of the upgrade scenario the mentioned service is getting created via following lines: https://github.com/openshift/origin/blob/release-4.3/test/e2e/upgrade/service/service.go#L50:L64 golang ginkgo.By("creating a TCP service " + serviceName + " with type=LoadBalancer in namespace " + ns.Name) tcpService := jig.CreateTCPServiceOrFail(ns.Name, func(s *v1.Service) { s.Spec.Type = v1.ServiceTypeLoadBalancer if s.Annotations == nil { s.Annotations = make(map[string]string) } // We tune the LB checks to match the longest intervals available so that interactions between // upgrading components and the service are more obvious. // - AWS allows configuration, default is 70s (6 failed with 10s interval in 1.17) set to match GCP s.Annotations["service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval"] = "8" s.Annotations["service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold"] = "3" s.Annotations["service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold"] = "2" // - Azure is hardcoded to 15s (2 failed with 5s interval in 1.17) and is sufficient // - GCP has a non-configurable interval of 32s (3 failed health checks with 8s interval in 1.17) // - thus pods need to stay up for > 32s, so pod shutdown period will will be 45s }) looking at the code it is trying to create a loadbalancer service with the aws annotations which may just work on the aws providers. Mimicking the code and tried creating the Loadbalancer service and seen following behavior : Yaml: apiVersion: v1 kind: Service metadata: name: my-service annotations: service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "8" service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3" service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2" spec: selector: app: MyApp ports: protocol: TCP port: 80 type: LoadBalancer ---------------------------------- # oc create -f service.yaml service/my-service created # oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 13d my-service LoadBalancer 172.30.227.169 <pending> 80:30616/TCP 66s openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 13d # # oc get svc my-service -o=yaml apiVersion: v1 kind: Service metadata: annotations: service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2" service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "8" service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3" creationTimestamp: "2020-06-03T11:04:54Z" name: my-service namespace: default resourceVersion: "5990717" selfLink: /api/v1/namespaces/default/services/my-service uid: 859dd769-3c97-41fb-894c-0c948bea02b5 spec: clusterIP: 172.30.227.169 externalTrafficPolicy: Cluster ports: nodePort: 30616 port: 80 protocol: TCP targetPort: 80 selector: app: MyApp sessionAffinity: None type: LoadBalancer status: loadBalancer: {} Test is failing as it is looking for `svc.Status.LoadBalancer.Ingress` entry but in this case it is none as we are not running in the aws environment. This concludes that current code will not work on the non-aws environment, hence further analysis and enhancement required to run it on the non-aws providers.
Setting target release to current development version (4.6) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate.
Iām adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Adding "UpcomingSprint" tag as the team does not have the bandwidth to work on this bug during this sprint.
Target reset to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved.
@jpoulin FYI ^^ target reset to 4.7 in case it directly impacts "Libvirt: CI Job for Upgrade Testing on P". Infer fix can possibly be backported to 4.6 if required.
Hi lmcfadde, is this bug something that the IBM team could help to investigate into?
Please ignore my Comment 16 as the bug has been re-assigned.
The way ingress is handled on libvirt deployments doesn't make use of a load balancer. Instead, the same round-robin dns provided by dnsmasq when the cluster network gets set up handles the ingress traffic pre and post upgrade. We should probably just special case this in the test suite.
Still waiting on the linked PR to be reviewed and merged.
didn't observe error in upgrade e2e test, moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633