1470612 – Hostsubnet could not be deleted after delete node

Bug 1470612 - Hostsubnet could not be deleted after delete node

Summary: Hostsubnet could not be deleted after delete node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.7.z
Assignee:	Jacob Tanenbaum
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-13 10:07 UTC by Yan Du
Modified:	2018-04-05 09:29 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-05 09:28:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Origin (Github)	18482	0	None	None	None	2018-02-06 19:06:48 UTC
Red Hat Product Errata	RHBA-2018:0636	0	None	None	None	2018-04-05 09:29:03 UTC

Description Yan Du 2017-07-13 10:07:43 UTC

Description of problem:
We found that sometimes hostsubnet could not be deleted after delete node. But we may not have the exact steps to reproduce it, we have tried many times, and the below steps may help to reproduce it.

Version-Release number of selected component (if applicable):
openshift v3.6.143
kubernetes v1.6.1+5115d708d7

How reproducible:
Sometimes

Steps to Reproduce:
1. setup multi-node env
[root@host-8-175-115 ~]# oc get node
NAME                                                STATUS    AGE       VERSION
host-8-174-41.host.centralci.eng.rdu2.redhat.com    Ready     37s       v1.6.1+5115d708d7
host-8-175-115.host.centralci.eng.rdu2.redhat.com   Ready     34s       v1.6.1+5115d708d7

2. check the hostsubnet
[root@host-8-175-115 ~]# oc get hostsubnet
NAME                                                HOST                                                HOST IP          SUBNET
host-8-174-41.host.centralci.eng.rdu2.redhat.com    host-8-174-41.host.centralci.eng.rdu2.redhat.com    172.16.120.2     10.128.0.0/23
host-8-175-115.host.centralci.eng.rdu2.redhat.com   host-8-175-115.host.centralci.eng.rdu2.redhat.com   172.16.120.199   10.129.0.0/23

3. Enlarge the clusterNetworkCIDR
default -> clusterNetworkCIDR: 10.128.0.0/14
update it to - > clusterNetworkCIDR: 10.128.0.0/10

3. delete the node
[root@host-8-175-115 ~]# oc delete node host-8-174-41.host.centralci.eng.rdu2.redhat.com
node "host-8-174-41.host.centralci.eng.rdu2.redhat.com" deleted
[root@host-8-175-115 ~]# oc get node
NAME                                                STATUS    AGE       VERSION
host-8-175-115.host.centralci.eng.rdu2.redhat.com   Ready     3m        v1.6.1+5115d708d7

4. check the hostsubent again
[root@host-8-175-115 ~]# oc get hostsubnet
NAME                                                HOST                                                HOST IP          SUBNET
host-8-174-41.host.centralci.eng.rdu2.redhat.com    host-8-174-41.host.centralci.eng.rdu2.redhat.com    172.16.120.2     10.128.0.0/23
host-8-175-115.host.centralci.eng.rdu2.redhat.com   host-8-175-115.host.centralci.eng.rdu2.redhat.com   172.16.120.199   10.129.0.0/23


Actual results:
step4: the hostsubnet could not be deleted after delete the node


Expected results:
hostsubnet should be deleted after delete the node

Additional info:

Comment 3 Dan Winship 2017-07-13 15:18:28 UTC

> 3. Enlarge the clusterNetworkCIDR
> default -> clusterNetworkCIDR: 10.128.0.0/14
> update it to - > clusterNetworkCIDR: 10.128.0.0/10
> 
> 3. delete the node

I assume there's an implied "restart the master" between those steps?

I couldn't reproduce the bug either way though

Comment 4 Dan Winship 2017-07-13 19:43:33 UTC

(In reply to Dan Winship from comment #3)
> > 3. Enlarge the clusterNetworkCIDR
> > 3. delete the node
> 
> I assume there's an implied "restart the master" between those steps?

And if so, were you giving the master a chance to fully start up after restarting, or just deleting the node right away? If you deleted the Node really really quickly after the master started, then the SDN's Node watch might not have started yet, and so it wouldn't see a Deleted event for the Node, and so it wouldn't delete the HostSubnet. (This might also be why it's hard to reproduce?) In that case it should be possible to tell that that's what happened from the logs.

We probably ought to be better about keeping Nodes and HostSubnets in sync, although that's tricky now that we let the user create non-Node-based HostSubnets for routers too, so we can't just assume that any HostSubnet that doesn't correspond to a Node is safe to delete. But we could have the master annotate the HostSubnets that it creates with the UID of the corresponding Node, and then at startup, if it finds a HostSubnet that allegedly corresponds to a Node, but that Node doesn't exist, then it could delete it. (HostSubnets for routers would have no annotation and so would never be automatically deleted.)

Comment 6 Clayton Coleman 2017-11-10 18:06:11 UTC

We should definitely be doing this (keeping the invariant of hostsubnet and node in sync).  That's a big gap.

Comment 7 Jacob Tanenbaum 2018-01-29 18:33:43 UTC

I can't reproduce the inability to delete the hostsubnets. I was able to get the system into the state where there was a hostsubnet for a deleted node but the cluster did not prevent me from deleting it. Can this still be reproduced? Should the system automatically delete the hostsubnet without a node?

Comment 8 Yan Du 2018-01-30 08:21:58 UTC

Hi, jtanenba
Maybe you could use this way to reproduce the issue:

1. setup multi-node env

2. ssh into master and run below cmd:
[root@ip-172-18-6-194 ~]# systemctl restart atomic-openshift-master-api.service ;systemctl restart atomic-openshift-master-controllers.service ;oc delete node ip-172-18-6-202.ec2.internal
node "ip-172-18-6-202.ec2.internal" deleted

3. Check the node and hostsubnet
[root@ip-172-18-6-194 ~]# oc get node
NAME                           STATUS                     ROLES     AGE       VERSION
ip-172-18-6-194.ec2.internal   Ready,SchedulingDisabled   master    22m       v1.9.1+a0ce1bc657
[root@ip-172-18-6-194 ~]# oc get hostsubnet
NAME                           HOST                           HOST IP        SUBNET          EGRESS IPS
ip-172-18-6-194.ec2.internal   ip-172-18-6-194.ec2.internal   172.18.6.194   10.129.0.0/23   []
ip-172-18-6-202.ec2.internal   ip-172-18-6-202.ec2.internal   172.18.6.202   10.131.0.0/23   []

Then you could see the node is deleted, but the hostsubnet still exists.

I just use latest OCP-3.9 env to reproduce it:
openshift v3.9.0-0.31.0
kubernetes v1.9.1+a0ce1bc657

Comment 9 openshift-github-bot 2018-02-16 02:24:27 UTC

Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/4134eec37f56e5d8813d3d507f124191860dcff6
automatically remove hostsubnets with no nodes on startup

openshift deletes a nodes hostsubnet automatically but during startup there is a window before the watches start that a node can be deleted and we miss the event for the deletion of the hostsubnet. At the end of startup look through the list of hostsubnets and make sure there is an accompanying node, if no node is found delete it.

Bug 1470612

Comment 11 Meng Bo 2018-03-01 09:15:53 UTC

Issue has been fixed on build v3.9.1

The hostsubnet will be deleted too when the node is deleted right after the master restarted.

Comment 15 errata-xmlrpc 2018-04-05 09:28:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0636

Note You need to log in before you can comment on or make changes to this bug.