1633662 – [Azure] cloud provider timeout while gathering node information

Bug 1633662 - [Azure] cloud provider timeout while gathering node information

Summary: [Azure] cloud provider timeout while gathering node information

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Jan Chaloupka
QA Contact:	Xiaoli Tian
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-27 13:35 UTC by Robert Bost
Modified:	2019-01-10 08:55 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-10 08:55:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0028	0	None	None	None	2019-01-10 08:55:28 UTC

Description Robert Bost 2018-09-27 13:35:05 UTC

Description of problem:
Customer trying to upgrade from v3.9.40 -> v3.9.41 but upgrade halts due to failure restarting node. 

Sep 20 14:48:11 ocpdazwe-mst-001 atomic-openshift-node[21444]: I0920 14:48:11.728975   21444 kubelet.go:1791] skipping pod synchronization - [Kubelet failed to get node info: failed to get external ID from cloud provider: instance not found]

Note that restarting the node service manually (systemctl) works fine. This is likely because systemctl waits longer for the node information than the Ansible playbook does.

Version-Release number of selected component (if applicable): v3.9.40, v3.9.41

How reproducible: Always for customer env

Additional info:
Partial review of logging while issue occurs shows a possible issue with caching or other problem inside azure cloudprovider.

Comment 4 Takayoshi Tanaka 2018-10-22 04:25:15 UTC

I find one hypothesis. The OCP 3.9.41, which causes this issue, introduces the following PR.

> Add cache for VirtualMachinesClient.Get in azure cloud provider
https://github.com/kubernetes/kubernetes/pull/57432

This is the kubernetes PR but I confirmed the OpenShift includes the corresponding PR.
This PR could cause the following issues.
https://github.com/kubernetes/kubernetes/issues/57031
https://github.com/kubernetes/kubernetes/issues/56276

Since these issues cause the API error, this BZ could happen by the PR 57432 in my thought.

Actually, the fix has been merged into the upstream kubernetes 1.9.

Fix vm cache in concurrent case in azure_util.go #57994
> Fix vm cache in concurrent case in azure_util.go
https://github.com/kubernetes/kubernetes/pull/57994

Do you think my hypothesis?

Comment 6 Vladislav Walek 2018-10-24 14:58:46 UTC

Working versions:
- 3.9.30
- 3.9.33

Not working versions:
- 3.9.41
- 3.9.43

Comment 29 Jianwei Hou 2018-12-21 02:07:13 UTC

The issue is not reproducible on ocp v3.9.60, move to verified.

Comment 31 errata-xmlrpc 2019-01-10 08:55:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0028

Note You need to log in before you can comment on or make changes to this bug.