1921198 – API server is not reachable during upgrade to 4.7

Bug 1921198 - API server is not reachable during upgrade to 4.7

Summary: API server is not reachable during upgrade to 4.7

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Dan Winship
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-27 17:33 UTC by Hongkai Liu
Modified:	2023-09-15 00:59 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-29 19:15:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hongkai Liu 2021-01-27 17:33:31 UTC

Description of problem:

During the original upgrade to 4.7-fc.2 from 4.6.9, a failure in the in-cluster networking caused an outage for the API server as clients could not reach it


2021/01/15 16:49:09 warning: failed to get pod e2e-e2e: Get "https://172.30.0.1:443/api/v1/namespaces/ci-op-hcbwcqn5/pods/e2e-e2e": dial tcp 172.30.0.1:443: connect: connection refused

The client here is a CI job on running on the cluster.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Hongkai Liu 2021-01-27 17:34:47 UTC

More details:
https://docs.google.com/document/d/1uR8-S17sMZ-C27jia3EIDrGNLKMNpGw2bFkPccpEQus/edit?usp=sharing
Must gather:
https://coreos.slack.com/archives/CHY2E1BL4/p1610736496055900

Comment 2 Dan Winship 2021-01-28 15:51:16 UTC

(In reply to Hongkai Liu from comment #0)
> During the original upgrade to 4.7-fc.2 from 4.6.9, a failure in the
> in-cluster networking caused an outage for the API server as clients could
> not reach it
> 
> 
> 2021/01/15 16:49:09 warning: failed to get pod e2e-e2e: Get
> "https://172.30.0.1:443/api/v1/namespaces/ci-op-hcbwcqn5/pods/e2e-e2e": dial
> tcp 172.30.0.1:443: connect: connection refused

namespaces/openshift-sdn/pods/sdn-nvg45/sdn/sdn/logs/current.log (sdn log for one of the masters) shows:

  2021-01-15T16:48:57.396657220Z F0115 16:48:57.396618 3375499 healthcheck.go:82] SDN healthcheck detected OVS server change, restarting: timed out waiting for the condition

  ...

  2021-01-15T16:51:39.637215079Z I0115 16:51:39.637177    3034 node.go:152] Initializing SDN node "ip-10-0-159-123.ec2.internal" (10.0.159.123) of type "redhat/openshift-ovs-networkpolicy"

So it looks like OVS is getting restarted at some point during the upgrade, which is of course not supposed to ever be happening any more on a running node because it can cause network outages. :-/

Not immediately obvious from the must-gather who is restarting it...

Comment 3 Vadim Rutkovsky 2021-01-29 13:36:22 UTC

*** Bug 1922187 has been marked as a duplicate of this bug. ***

Comment 4 Vadim Rutkovsky 2021-01-29 13:48:46 UTC

Same for upgrades to latest 4.7 ci release - https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1354819404952506368

Seems GCP and Azure are not affected:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.6-stable-to-4.7-ci and 
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6-stable-to-4.7-ci are looking green-ish

Comment 5 Vadim Rutkovsky 2021-01-29 18:00:33 UTC

Possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1920027

Comment 6 Dan Winship 2021-02-01 13:40:30 UTC

(In reply to Vadim Rutkovsky from comment #4)
> Same for upgrades to latest 4.7 ci release -
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-aws-upgrade/1354819404952506368

This was a mistake; Vadim asked about upgrade problems with apiserver and I said it was probably this bug, but that one is a totally different apiserver problem.

(In reply to Vadim Rutkovsky from comment #5)
> Possibly a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1920027

(may apply to 1922187 but that doesn't seem to be what's happening here)

Comment 13 Lalatendu Mohanty 2021-03-09 15:06:51 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 14 Dan Winship 2021-03-09 21:25:50 UTC

It is not clear what happened based on the information provided, and no further information is available from the original system. We were never able to reproduce the problem, and no one else has encountered a similar bug  since then. So, closing since there is nothing useful we can do with this bug report.

(There were multiple bugs in the upgrade attempt that led to this bug being filed, so it's possible this bug was a side effect of one of the other bugs that _did_ get diagnosed and fixed...)

Comment 15 zhaozhanqi 2021-03-26 07:23:53 UTC

@danw 

we met similar issue again when doing upgrade from 4.5.36->4.6.23

found the error in sdn yaml file (must-gather.local.316999794353163198/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-5154227b30f1519bd01b37f1495b15640f5f82b1fefdaef19da66838b9858f3c/namespaces/openshift-sdn/pods/sdn-r5c2n/sdn-r5c2n.yaml:

        containerID: cri-o://4f6af7aa08af77badc2d794d0f58e90781e22275ccbb1c28e3ebd0c4df49898c
        exitCode: 255
        finishedAt: "2021-03-25T04:12:07Z"
        message: |
          tting endpoints for openshift-service-catalog-controller-manager/controller-manager:https to [10.128.0.52:6443 10.129.0.41:6443 10.130.0.25:6443]
          I0325 04:11:29.499541  101481 proxier.go:370] userspace proxy: processing 0 service events
          I0325 04:11:29.500688  101481 proxier.go:349] userspace syncProxyRules took 44.587513ms
          I0325 04:11:31.141109  101481 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-multus/multus-admission-controller:webhook to [10.128.0.80:6443 10.129.0.59:6443 10.130.0.50:6443]
          I0325 04:11:31.141147  101481 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-multus/multus-admission-controller:metrics to [10.128.0.80:8443 10.129.0.59:8443 10.130.0.50:8443]
          I0325 04:11:31.310239  101481 proxier.go:370] userspace proxy: processing 0 service events
          I0325 04:11:31.310621  101481 proxier.go:349] userspace syncProxyRules took 42.944661ms
          I0325 04:11:33.467974  101481 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-apiserver/api:https to [10.128.0.61:8443 10.129.0.49:8443 10.130.0.33:8443]
          I0325 04:11:33.640652  101481 proxier.go:370] userspace proxy: processing 0 service events
          I0325 04:11:33.641076  101481 proxier.go:349] userspace syncProxyRules took 42.173319ms
          I0325 04:11:42.326376  101481 roundrobin.go:295] LoadBalancerRR: Removing endpoints for openshift-controller-manager-operator/metrics:https
          I0325 04:11:42.522328  101481 proxier.go:370] userspace proxy: processing 0 service events
          I0325 04:11:42.522820  101481 proxier.go:349] userspace syncProxyRules took 44.187616ms
          I0325 04:11:43.365695  101481 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-controller-manager-operator/metrics:https to [10.130.0.38:8443]
          I0325 04:11:43.547184  101481 proxier.go:370] userspace proxy: processing 0 service events
          I0325 04:11:43.547618  101481 proxier.go:349] userspace syncProxyRules took 49.138075ms
          F0325 04:12:07.430665  101481 healthcheck.go:82] SDN healthcheck detected OVS server change, restarting: timed out waiting for the condition
        reason: Error
        startedAt: "2021-03-25T04:09:16Z"
    name: sdn
    ready: true
    restartCount: 1



and some pods in that node cannot access "172.0.0.1:443"


tail router-default-769d869b98-4wfxf/router/router/logs/current.log
2021-03-25T05:35:06.525425113Z E0325 05:35:06.525344       1 reflector.go:127] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
2021-03-25T05:35:16.253180469Z E0325 05:35:16.253130       1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1.Route: failed to list *v1.Route: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
2021-03-25T05:35:24.445609683Z E0325 05:35:24.445514       1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: Get "https://172.30.0.1:443/apis/discovery.k8s.io/v1beta1/endpointslices?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
2021-03-25T05:35:55.74119508Z E0325 05:35:55.741144       1 reflector.go:127] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
2021-03-25T05:36:11.421513782Z E0325 05:36:11.421455       1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1.Route: failed to list *v1.Route: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
2021-03-25T05:36:20.510266896Z E0325 05:36:20.510193       1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: Get "https://172.30.0.1:443/apis/discovery.k8s.io/v1beta1/endpointslices?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
2021-03-25T05:36:34.717253572Z E0325 05:36:34.717188       1 reflector.go:127] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
2021-03-25T05:36:49.757256385Z E0325 05:36:49.757204       1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1.Route: failed to list *v1.Route: Get "https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
2021-03-25T05:36:52.829191733Z E0325 05:36:52.829126       1 reflector.go:127] github.com/openshift/router/pkg/router/controller/factory/factory.go:125: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: Get "https://172.30.0.1:443/apis/discovery.k8s.io/v1beta1/endpointslices?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
2021-03-25T05:37:18.685155438Z E0325 05:37:18.685105       1 reflector.go:127] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host



Here is the whole must-gather logs:
http://10.73.131.57:9000/openshift-must-gather/2021-03-25-05-43-18/must-gather.local.316999794353163198.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=openshift%2F20210325%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210325T054335Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=e2cb26d099b927476e56fbee86e1e8082ff5616f80418eb497b2cae3a2ee520b

Comment 16 Dan Winship 2021-03-29 19:15:41 UTC

Please file a new bug; original report was 4.6 -> 4.7, and in a particular context. There's not really anything to suggest that this is the same bug other than one very common error message.

Comment 17 W. Trevor King 2021-03-31 03:41:42 UTC

Removing UpgradeBlocker from this closed bug to take it out of our triage queue.  If this bug gets replaced by a new bug with more data, feel free to add the keyword to that new bug.

Comment 18 Red Hat Bugzilla 2023-09-15 00:59:19 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.