Description of problem: Node status keeps flapping between Ready and NotReady. Customer is using VMware cloud provider. Multiple network interfaces present on the nodes. From journal logs from an affected node we see the following: Every 10 seconds: Jul 02 08:40:23 node.local atomic-openshift-node[21218]: I0702 08:40:23.527090 21218 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "node.local" Jul 02 08:40:23 node.local atomic-openshift-node[21218]: I0702 08:40:23.527758 21218 cloud_request_manager.go:108] Node addresses from cloud provider for node "node.local" collected and this x2 every 5 seconds: Jul 03 14:10:56 node.local atomic-openshift-node[15849]: echo "OVS seems to have crashed, exiting" and every 3 minutes: Jul 03 14:12:22 node.local systemd[1]: atomic-openshift-node.service holdoff time over, scheduling restart. Jul 03 14:12:22 node.local systemd[1]: Stopped OpenShift Node. Jul 03 14:12:22 node.local systemd[1]: Starting OpenShift Node... Also frequently seen in the logs: Jul 03 14:10:00 node.local atomic-openshift-node[15849]: I0703 14:10:00.552568 15849 vsphere.go:538] Find local IP address 10.xxx.xxx.xxx and set type to Jul 03 14:10:00 node.local atomic-openshift-node[15849]: I0703 14:10:00.552606 15849 vsphere.go:538] Find local IP address 172.xxx.xxx.xxx and set type to Jul 03 14:10:00 node.local atomic-openshift-node[15849]: I0703 14:10:00.552707 15849 vsphere.go:538] Find local IP address 192.xxx.xxx.xxx and set type to (Full log attached as private.) Note workaround: Nodes stabilised when 'nodeIP: xxx.xxx.xxx.xxx' was set in configmaps. Version-Release number of selected component (if applicable): 3.11 How reproducible: Not sure but the nodes all have multiple network interfaces and VMware is the cloud provider. nodeIP was not present and behaviour stopped when nodeIP added to node configmaps. Actual results: Node (workers, infras and masters) all flapped between Ready and NotReady. Expected results: No flapping when using VMware as cloud provider. Additional info: Behaviour seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1668802 but the errors are different so didn't want to confuse the BZ.
Exact version used: 3.11.117
Hello, This problem also happens in our RHOCP3.11 environment. ======================================================= [root@ip-172-31-112-117 ~]# oc describe nodes ip-172-31-112-117.us-east-2.compute.internal .... Normal Starting 5m kubelet, ip-172-31-112-117.us-east-2.compute.internal Starting kubelet. Normal NodeHasSufficientDisk 5m kubelet, ip-172-31-112-117.us-east-2.compute.internal Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeHasSufficientDisk Normal NodeHasSufficientMemory 5m kubelet, ip-172-31-112-117.us-east-2.compute.internal Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 5m kubelet, ip-172-31-112-117.us-east-2.compute.internal Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 5m kubelet, ip-172-31-112-117.us-east-2.compute.internal Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeHasSufficientPID Normal NodeNotReady 5m kubelet, ip-172-31-112-117.us-east-2.compute.internal Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeNotReady Normal NodeAllocatableEnforced 5m kubelet, ip-172-31-112-117.us-east-2.compute.internal Updated Node Allocatable limit across pods Normal NodeReady 5m kubelet, ip-172-31-112-117.us-east-2.compute.internal Node ip-172-31-112-117.us-east-2.compute.internal status is now: NodeReady [ocp311@ip-172-31-28-151 ansible]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-172-31-112-117.us-east-2.compute.internal Ready infra,master 3d v1.11.0+d4cacc0 ip-172-31-64-199.us-east-2.compute.internal Ready infra,master 3d v1.11.0+d4cacc0 ip-172-31-64-225.us-east-2.compute.internal NotReady compute 3d v1.11.0+d4cacc0 ip-172-31-96-168.us-east-2.compute.internal Ready infra,master 3d v1.11.0+d4cacc0 [ocp311@ip-172-31-28-151 ansible]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-172-31-112-117.us-east-2.compute.internal NotReady infra,master 3d v1.11.0+d4cacc0 ip-172-31-64-199.us-east-2.compute.internal NotReady infra,master 3d v1.11.0+d4cacc0 ip-172-31-64-225.us-east-2.compute.internal NotReady compute 3d v1.11.0+d4cacc0 ip-172-31-96-168.us-east-2.compute.internal NotReady infra,master 3d v1.11.0+d4cacc0 [ocp311@ip-172-31-28-151 ansible]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-172-31-112-117.us-east-2.compute.internal Ready infra,master 3d v1.11.0+d4cacc0 ip-172-31-64-199.us-east-2.compute.internal Ready infra,master 3d v1.11.0+d4cacc0 ip-172-31-64-225.us-east-2.compute.internal Ready compute 3d v1.11.0+d4cacc0 ip-172-31-96-168.us-east-2.compute.internal Ready infra,master 3d v1.11.0+d4cacc0 According to our investigation, atomic-openshift-node is restarted by node sync pod. This problem is a regression due to the following change to openshift-ansible-roles. - Tar node & volume config to do md5sum comparison in sync pod. (pdd) Because the above change, sync pod became creating a tar file with node & volume config, and use its md5sum to judge whether node's configuration has been changed. But there is a problem. We shouldn't use tar file's md5sum for judgment. Because even the context of files are the same, the tar files are something different. For example: ======================================== [root@ip-172-31-112-117 tmp]# touch aa bb [root@ip-172-31-112-117 tmp]# cp aa cc [root@ip-172-31-112-117 tmp]# cp bb dd [root@ip-172-31-112-117 tmp]# diff aa cc [root@ip-172-31-112-117 tmp]# diff bb dd [root@ip-172-31-112-117 tmp]# tar -Pcf aadd.tar aa bb [root@ip-172-31-112-117 tmp]# tar -Pcf ccdd.tar cc dd [root@ip-172-31-112-117 tmp]# diff aabb.tar ccdd.tar Binary files aabb.tar and ccdd.tar differ [root@ip-172-31-112-117 tmp]# md5sum aabb.tar ccdd.tar 609fce8f20a2378af15d68360c2f5960 aabb.tar 505bcafc44a4cffe3a253872b95baa32 ccdd.tar
This PR fixes the regression: https://github.com/openshift/openshift-ansible/pull/11779
*** Bug 1728195 has been marked as a duplicate of this bug. ***
PR#11779 is approved but is blocked by https://github.com/openshift/release/pull/4533
Can be reproduced by following steps: 1. create /var/lib/origin dir if not exist. # mkdir /var/lib/origin 2. create a new partition and use xfs file system. 3. add mount point to /etc/fstab with grpquota option /dev/mapper/rhel-lv01 /var/lib/origin xfs defaults,grpquota 0 2 4. mount -a 5. install OCP with openshift_node_local_quota_per_fsgroup=200Mi Aug 01 08:30:55 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node. Aug 01 08:33:57 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node. Aug 01 08:36:58 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node. Aug 01 08:40:00 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node. Aug 01 08:43:02 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node. Aug 01 08:46:04 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node. Aug 01 08:49:05 preserve-wmeng311lq3oa123-me-1 systemd[1]: Stopped OpenShift Node.
Fixed. openshift-ansible-3.11.135-1.git.0.b7ad55a.el7 # systemctl status atomic-openshift-node.service -l ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2019-08-02 08:20:31 EDT; 22min ago running for 22 minutes without restart # cat /etc/origin/node/volume-config.yaml apiVersion: kubelet.config.openshift.io/v1 kind: VolumeConfig localQuota: perFSGroup: 200Mi
Verified per comment 25 .
*** Bug 1737310 has been marked as a duplicate of this bug. ***
Workaround instructions for using the fixed sync.yaml from https://github.com/openshift/openshift-ansible/pull/11779/files: 1. Download the fixed sync daemonset definition: https://raw.githubusercontent.com/sureshgaikwad/openshift-ansible/a5043cb12dea6cff3f9513dc1aaa5c9d13c94c56/roles/openshift_node_group/files/sync.yaml 2. Replace the daemonset: $ oc project openshift-node $ oc replace -f sync.yaml * The sync pods will restart (oc get pods) and the atomic-openshift-node will restart one last time; * From now on the node will be always in Ready state. 3. We recommend that when the next version is released, cluster should be updated again which will include these fixes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2352