Bug 1727140

Summary:	atomic-openshift-node.service keeps restarting every 3 minutes and all nodes flapping between Ready and NotReady
Product:	OpenShift Container Platform	Reporter:	David Caldwell <dcaldwel>
Component:	Installer	Assignee:	Patrick Dillon <padillon>
Installer sub component:	openshift-ansible	QA Contact:	Weihua Meng <wmeng>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	ahoness, aos-bugs, chris.liles, fan-wxa, gblomqui, hcisneir, jokerman, ksuzumur, ktadimar, llopezmo, mmccomas, msweiker, nigoyal, rh-container, rsandu, sgaikwad, wmeng
Version:	3.11.0	Keywords:	Regression
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: the sync pod runs a loop which evaluates whether a config file has changed; a regression was introduced which results in showing that the config has always changed whenever a cluster uses the volume config Consequence: when the sync pod sees the config file has changed, it triggers a reboot of the atomic openshift node service. the loop runs every 3 minutes, so the service restarts every 3 mins. Fix: the evaluation loop compares an old volume config versus the current state; the regression removed the creation of an old volume config and the fix was to reintroduce the creation of the old volume config. Result: the service is only restarted when a change is detected in the volume config (not during every evaluation)	Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-13 14:09:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Caldwell 2019-07-04 15:45:39 UTC

Description of problem:

Node status keeps flapping between Ready and NotReady.
Customer is using VMware cloud provider.
Multiple network interfaces present on the nodes.

From journal logs from an affected node we see the following:

Every 10 seconds:

Jul 02 08:40:23 node.local atomic-openshift-node[21218]: I0702 08:40:23.527090 21218 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "node.local"
Jul 02 08:40:23 node.local atomic-openshift-node[21218]: I0702 08:40:23.527758 21218 cloud_request_manager.go:108] Node addresses from cloud provider for node "node.local" collected

and this x2 every 5 seconds:

Jul 03 14:10:56 node.local atomic-openshift-node[15849]: echo "OVS seems to have crashed, exiting"

and every 3 minutes:

Jul 03 14:12:22 node.local systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart.
Jul 03 14:12:22 node.local systemd[1]: Stopped OpenShift Node.
Jul 03 14:12:22 node.local systemd[1]: Starting OpenShift Node...

Also frequently seen in the logs:

Jul 03 14:10:00 node.local atomic-openshift-node[15849]: I0703 14:10:00.552568 15849 vsphere.go:538] Find local IP address 10.xxx.xxx.xxx and set type to
Jul 03 14:10:00 node.local atomic-openshift-node[15849]: I0703 14:10:00.552606 15849 vsphere.go:538] Find local IP address 172.xxx.xxx.xxx and set type to
Jul 03 14:10:00 node.local atomic-openshift-node[15849]: I0703 14:10:00.552707 15849 vsphere.go:538] Find local IP address 192.xxx.xxx.xxx and set type to

(Full log attached as private.)

Note workaround: Nodes stabilised when 'nodeIP: xxx.xxx.xxx.xxx' was set in configmaps.

Version-Release number of selected component (if applicable):

3.11

How reproducible:

Not sure but the nodes all have multiple network interfaces and VMware is the cloud provider. nodeIP was not present and behaviour stopped when nodeIP added to node configmaps.

Actual results:

Node (workers, infras and masters) all flapped between Ready and NotReady.

Expected results:

No flapping when using VMware as cloud provider.

Additional info:

Behaviour seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1668802 but the errors are different so didn't want to confuse the BZ.

Comment 2 David Caldwell 2019-07-05 05:53:30 UTC

Exact version used: 3.11.117

Comment 12 weiguo fan 2019-07-18 11:24:03 UTC

Hello,

This problem also happens in our RHOCP3.11 environment.

    =======================================================
    [root@ip-172-31-112-117 ~]# oc describe nodes ip-172-31-112-117.us-east-2.compute.internal
    ....
    Normal   Starting                 5m               kubelet, ip-172-31-112-117.us-east-2.compute.internal  Starting kubelet.
    Normal   NodeHasSufficientDisk    5m               kubelet, ip-172-31-112-117.us-east-2.compute.internal  Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeHasSufficientDisk
    Normal   NodeHasSufficientMemory  5m               kubelet, ip-172-31-112-117.us-east-2.compute.internal  Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeHasSufficientMemory
    Normal   NodeHasNoDiskPressure    5m               kubelet, ip-172-31-112-117.us-east-2.compute.internal  Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
    Normal   NodeHasSufficientPID     5m               kubelet, ip-172-31-112-117.us-east-2.compute.internal  Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeHasSufficientPID
    Normal   NodeNotReady             5m               kubelet, ip-172-31-112-117.us-east-2.compute.internal  Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeNotReady
    Normal   NodeAllocatableEnforced  5m               kubelet, ip-172-31-112-117.us-east-2.compute.internal  Updated Node Allocatable limit across pods
    Normal   NodeReady                5m               kubelet, ip-172-31-112-117.us-east-2.compute.internal  Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeReady

    [ocp311@ip-172-31-28-151 ansible]$ oc get nodes
    NAME                                           STATUS     ROLES          AGE       VERSION
    ip-172-31-112-117.us-east-2.compute.internal   Ready      infra,master   3d        v1.11.0+d4cacc0
    ip-172-31-64-199.us-east-2.compute.internal    Ready      infra,master   3d        v1.11.0+d4cacc0
    ip-172-31-64-225.us-east-2.compute.internal    NotReady   compute        3d        v1.11.0+d4cacc0
    ip-172-31-96-168.us-east-2.compute.internal    Ready      infra,master   3d        v1.11.0+d4cacc0
    [ocp311@ip-172-31-28-151 ansible]$ oc get nodes
    NAME                                           STATUS     ROLES          AGE       VERSION
    ip-172-31-112-117.us-east-2.compute.internal   NotReady   infra,master   3d        v1.11.0+d4cacc0
    ip-172-31-64-199.us-east-2.compute.internal    NotReady   infra,master   3d        v1.11.0+d4cacc0
    ip-172-31-64-225.us-east-2.compute.internal    NotReady   compute        3d        v1.11.0+d4cacc0
    ip-172-31-96-168.us-east-2.compute.internal    NotReady   infra,master   3d        v1.11.0+d4cacc0
    [ocp311@ip-172-31-28-151 ansible]$ oc get nodes
    NAME                                           STATUS    ROLES          AGE       VERSION
    ip-172-31-112-117.us-east-2.compute.internal   Ready     infra,master   3d        v1.11.0+d4cacc0
    ip-172-31-64-199.us-east-2.compute.internal    Ready     infra,master   3d        v1.11.0+d4cacc0
    ip-172-31-64-225.us-east-2.compute.internal    Ready     compute        3d        v1.11.0+d4cacc0
    ip-172-31-96-168.us-east-2.compute.internal    Ready     infra,master   3d        v1.11.0+d4cacc0

According to our investigation, atomic-openshift-node is restarted by node sync pod.
This problem is a regression due to the following change to openshift-ansible-roles.

     - Tar node & volume config to do md5sum comparison in sync pod.
       (pdd)

Because the above change, sync pod became creating a tar file with node & volume config,
and use its md5sum to judge whether node's configuration has been changed. 
But there is a problem. We shouldn't use tar file's md5sum for judgment.
Because even the context of files are the same, the tar files are something different.
  
    For example:
    ========================================
    [root@ip-172-31-112-117 tmp]# touch aa bb
    [root@ip-172-31-112-117 tmp]# cp aa cc
    [root@ip-172-31-112-117 tmp]# cp bb dd
    [root@ip-172-31-112-117 tmp]# diff aa cc
    [root@ip-172-31-112-117 tmp]# diff bb dd
    [root@ip-172-31-112-117 tmp]# tar -Pcf aadd.tar aa bb
    [root@ip-172-31-112-117 tmp]# tar -Pcf ccdd.tar cc dd
    [root@ip-172-31-112-117 tmp]# diff aabb.tar ccdd.tar
    Binary files aabb.tar and ccdd.tar differ
    [root@ip-172-31-112-117 tmp]# md5sum aabb.tar ccdd.tar
    609fce8f20a2378af15d68360c2f5960  aabb.tar
    505bcafc44a4cffe3a253872b95baa32  ccdd.tar

Comment 21 Patrick Dillon 2019-07-30 01:41:17 UTC

This PR fixes the regression: https://github.com/openshift/openshift-ansible/pull/11779

Comment 22 Patrick Dillon 2019-07-30 01:43:28 UTC

*** Bug 1728195 has been marked as a duplicate of this bug. ***

Comment 23 Patrick Dillon 2019-07-30 11:59:12 UTC

PR#11779 is approved but is blocked by https://github.com/openshift/release/pull/4533

Comment 24 Weihua Meng 2019-08-02 06:52:44 UTC

Can be reproduced by following steps:
1. create /var/lib/origin dir if not exist.  # mkdir /var/lib/origin
2. create a new partition and use xfs file system.
3. add mount point to /etc/fstab with grpquota option
/dev/mapper/rhel-lv01  /var/lib/origin  xfs  defaults,grpquota  0 2 
4. mount -a
5. install OCP with openshift_node_local_quota_per_fsgroup=200Mi

Aug 01 08:30:55 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node.
Aug 01 08:33:57 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node.
Aug 01 08:36:58 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node.
Aug 01 08:40:00 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node.
Aug 01 08:43:02 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node.
Aug 01 08:46:04 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node.
Aug 01 08:49:05 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node.

Comment 25 Weihua Meng 2019-08-02 12:50:10 UTC

Fixed. 
openshift-ansible-3.11.135-1.git.0.b7ad55a.el7


# systemctl status atomic-openshift-node.service -l
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2019-08-02 08:20:31 EDT; 22min ago

running for 22 minutes without restart

# cat /etc/origin/node/volume-config.yaml
apiVersion: kubelet.config.openshift.io/v1
kind: VolumeConfig
localQuota:
  perFSGroup: 200Mi

Comment 27 Weihua Meng 2019-08-03 01:23:18 UTC

Verified per comment 25 .

Comment 28 Seth Jennings 2019-08-05 16:55:11 UTC

*** Bug 1737310 has been marked as a duplicate of this bug. ***

Comment 29 Hugo Cisneiros (Eitch) 2019-08-07 00:28:11 UTC

Workaround instructions for using the fixed sync.yaml from https://github.com/openshift/openshift-ansible/pull/11779/files:

1. Download the fixed sync daemonset definition:

https://raw.githubusercontent.com/sureshgaikwad/openshift-ansible/a5043cb12dea6cff3f9513dc1aaa5c9d13c94c56/roles/openshift_node_group/files/sync.yaml

2. Replace the daemonset:

$ oc project openshift-node   
$ oc replace -f sync.yaml

* The sync pods will restart (oc get pods) and the atomic-openshift-node will restart one last time;
* From now on the node will be always in Ready state.

3. We recommend that when the next version is released, cluster should be updated again which will include these fixes.

Comment 31 errata-xmlrpc 2019-08-13 14:09:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2352