Description of problem: Customer trying to upgrade from v3.9.40 -> v3.9.41 but upgrade halts due to failure restarting node. Sep 20 14:48:11 ocpdazwe-mst-001 atomic-openshift-node[21444]: I0920 14:48:11.728975 21444 kubelet.go:1791] skipping pod synchronization - [Kubelet failed to get node info: failed to get external ID from cloud provider: instance not found] Note that restarting the node service manually (systemctl) works fine. This is likely because systemctl waits longer for the node information than the Ansible playbook does. Version-Release number of selected component (if applicable): v3.9.40, v3.9.41 How reproducible: Always for customer env Additional info: Partial review of logging while issue occurs shows a possible issue with caching or other problem inside azure cloudprovider.
I find one hypothesis. The OCP 3.9.41, which causes this issue, introduces the following PR. > Add cache for VirtualMachinesClient.Get in azure cloud provider https://github.com/kubernetes/kubernetes/pull/57432 This is the kubernetes PR but I confirmed the OpenShift includes the corresponding PR. This PR could cause the following issues. https://github.com/kubernetes/kubernetes/issues/57031 https://github.com/kubernetes/kubernetes/issues/56276 Since these issues cause the API error, this BZ could happen by the PR 57432 in my thought. Actually, the fix has been merged into the upstream kubernetes 1.9. Fix vm cache in concurrent case in azure_util.go #57994 > Fix vm cache in concurrent case in azure_util.go https://github.com/kubernetes/kubernetes/pull/57994 Do you think my hypothesis?
Working versions: - 3.9.30 - 3.9.33 Not working versions: - 3.9.41 - 3.9.43
The issue is not reproducible on ocp v3.9.60, move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0028