Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1633662

Summary: [Azure] cloud provider timeout while gathering node information
Product: OpenShift Container Platform Reporter: Robert Bost <rbost>
Component: Cloud ComputeAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: Xiaoli Tian <xtian>
Severity: high Docs Contact:
Priority: high    
Version: 3.9.0CC: cshereme, jchaloup, jhou, rbost, tatanaka, vwalek
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-10 08:55:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robert Bost 2018-09-27 13:35:05 UTC
Description of problem:
Customer trying to upgrade from v3.9.40 -> v3.9.41 but upgrade halts due to failure restarting node. 

Sep 20 14:48:11 ocpdazwe-mst-001 atomic-openshift-node[21444]: I0920 14:48:11.728975   21444 kubelet.go:1791] skipping pod synchronization - [Kubelet failed to get node info: failed to get external ID from cloud provider: instance not found]

Note that restarting the node service manually (systemctl) works fine. This is likely because systemctl waits longer for the node information than the Ansible playbook does.

Version-Release number of selected component (if applicable): v3.9.40, v3.9.41

How reproducible: Always for customer env

Additional info:
Partial review of logging while issue occurs shows a possible issue with caching or other problem inside azure cloudprovider.

Comment 4 Takayoshi Tanaka 2018-10-22 04:25:15 UTC
I find one hypothesis. The OCP 3.9.41, which causes this issue, introduces the following PR.

> Add cache for VirtualMachinesClient.Get in azure cloud provider
https://github.com/kubernetes/kubernetes/pull/57432

This is the kubernetes PR but I confirmed the OpenShift includes the corresponding PR.
This PR could cause the following issues.
https://github.com/kubernetes/kubernetes/issues/57031
https://github.com/kubernetes/kubernetes/issues/56276

Since these issues cause the API error, this BZ could happen by the PR 57432 in my thought.

Actually, the fix has been merged into the upstream kubernetes 1.9.

Fix vm cache in concurrent case in azure_util.go #57994
> Fix vm cache in concurrent case in azure_util.go
https://github.com/kubernetes/kubernetes/pull/57994

Do you think my hypothesis?

Comment 6 Vladislav Walek 2018-10-24 14:58:46 UTC
Working versions:
- 3.9.30
- 3.9.33

Not working versions:
- 3.9.41
- 3.9.43

Comment 29 Jianwei Hou 2018-12-21 02:07:13 UTC
The issue is not reproducible on ocp v3.9.60, move to verified.

Comment 31 errata-xmlrpc 2019-01-10 08:55:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0028