Bug 1879524

Summary: Node goes into unschedulable after a MC is applied and node rebooted.
Product: OpenShift Container Platform Reporter: Jiří Mencák <jmencak>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, jokerman
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-17 13:12:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jiří Mencák 2020-09-16 13:08:24 UTC
Description of problem:
This is a rare but a very real bug I noticed during testing a new functionality that creates MachineConfigs.  Node goes into unschedulable after applying a MC, MCP never updates and MCD reports issues "failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout".   

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-16-000734   True        False         6h37m   Cluster version is 4.6.0-0.nightly-2020-09-16-000734


How reproducible:
Rare.  In the past, I was noticing these issues, but this is the first time I caught it in time prior to doing more changes on the cluster. 

Steps to Reproduce:
1. Create a machineconfig using a custom MCP causing a node reboot.

Actual results:
On a rare occassion (one in ~20?) times a node will go into unschedulable and the custom MCP never updates.  MCD reporting:
failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout

Expected results:
Node in schedulable, MCP updated.

Additional info:
Other pods on the affected node have the same issue as the mcd pod.  The must-gather below was taken "after" I've set the node into schedulable and deleted the mcp pod.

http://file.rdu.redhat.com/jmencak/bugzilla/2020-09-19/must-gather.local.7417114535902373082.tar.xz

Comment 4 Ben Bennett 2020-09-17 13:12:01 UTC

*** This bug has been marked as a duplicate of bug 1874696 ***