Description of problem: We found that sometimes hostsubnet could not be deleted after delete node. But we may not have the exact steps to reproduce it, we have tried many times, and the below steps may help to reproduce it. Version-Release number of selected component (if applicable): openshift v3.6.143 kubernetes v1.6.1+5115d708d7 How reproducible: Sometimes Steps to Reproduce: 1. setup multi-node env [root@host-8-175-115 ~]# oc get node NAME STATUS AGE VERSION host-8-174-41.host.centralci.eng.rdu2.redhat.com Ready 37s v1.6.1+5115d708d7 host-8-175-115.host.centralci.eng.rdu2.redhat.com Ready 34s v1.6.1+5115d708d7 2. check the hostsubnet [root@host-8-175-115 ~]# oc get hostsubnet NAME HOST HOST IP SUBNET host-8-174-41.host.centralci.eng.rdu2.redhat.com host-8-174-41.host.centralci.eng.rdu2.redhat.com 172.16.120.2 10.128.0.0/23 host-8-175-115.host.centralci.eng.rdu2.redhat.com host-8-175-115.host.centralci.eng.rdu2.redhat.com 172.16.120.199 10.129.0.0/23 3. Enlarge the clusterNetworkCIDR default -> clusterNetworkCIDR: 10.128.0.0/14 update it to - > clusterNetworkCIDR: 10.128.0.0/10 3. delete the node [root@host-8-175-115 ~]# oc delete node host-8-174-41.host.centralci.eng.rdu2.redhat.com node "host-8-174-41.host.centralci.eng.rdu2.redhat.com" deleted [root@host-8-175-115 ~]# oc get node NAME STATUS AGE VERSION host-8-175-115.host.centralci.eng.rdu2.redhat.com Ready 3m v1.6.1+5115d708d7 4. check the hostsubent again [root@host-8-175-115 ~]# oc get hostsubnet NAME HOST HOST IP SUBNET host-8-174-41.host.centralci.eng.rdu2.redhat.com host-8-174-41.host.centralci.eng.rdu2.redhat.com 172.16.120.2 10.128.0.0/23 host-8-175-115.host.centralci.eng.rdu2.redhat.com host-8-175-115.host.centralci.eng.rdu2.redhat.com 172.16.120.199 10.129.0.0/23 Actual results: step4: the hostsubnet could not be deleted after delete the node Expected results: hostsubnet should be deleted after delete the node Additional info:
> 3. Enlarge the clusterNetworkCIDR > default -> clusterNetworkCIDR: 10.128.0.0/14 > update it to - > clusterNetworkCIDR: 10.128.0.0/10 > > 3. delete the node I assume there's an implied "restart the master" between those steps? I couldn't reproduce the bug either way though
(In reply to Dan Winship from comment #3) > > 3. Enlarge the clusterNetworkCIDR > > 3. delete the node > > I assume there's an implied "restart the master" between those steps? And if so, were you giving the master a chance to fully start up after restarting, or just deleting the node right away? If you deleted the Node really really quickly after the master started, then the SDN's Node watch might not have started yet, and so it wouldn't see a Deleted event for the Node, and so it wouldn't delete the HostSubnet. (This might also be why it's hard to reproduce?) In that case it should be possible to tell that that's what happened from the logs. We probably ought to be better about keeping Nodes and HostSubnets in sync, although that's tricky now that we let the user create non-Node-based HostSubnets for routers too, so we can't just assume that any HostSubnet that doesn't correspond to a Node is safe to delete. But we could have the master annotate the HostSubnets that it creates with the UID of the corresponding Node, and then at startup, if it finds a HostSubnet that allegedly corresponds to a Node, but that Node doesn't exist, then it could delete it. (HostSubnets for routers would have no annotation and so would never be automatically deleted.)
We should definitely be doing this (keeping the invariant of hostsubnet and node in sync). That's a big gap.
I can't reproduce the inability to delete the hostsubnets. I was able to get the system into the state where there was a hostsubnet for a deleted node but the cluster did not prevent me from deleting it. Can this still be reproduced? Should the system automatically delete the hostsubnet without a node?
Hi, jtanenba Maybe you could use this way to reproduce the issue: 1. setup multi-node env 2. ssh into master and run below cmd: [root@ip-172-18-6-194 ~]# systemctl restart atomic-openshift-master-api.service ;systemctl restart atomic-openshift-master-controllers.service ;oc delete node ip-172-18-6-202.ec2.internal node "ip-172-18-6-202.ec2.internal" deleted 3. Check the node and hostsubnet [root@ip-172-18-6-194 ~]# oc get node NAME STATUS ROLES AGE VERSION ip-172-18-6-194.ec2.internal Ready,SchedulingDisabled master 22m v1.9.1+a0ce1bc657 [root@ip-172-18-6-194 ~]# oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS IPS ip-172-18-6-194.ec2.internal ip-172-18-6-194.ec2.internal 172.18.6.194 10.129.0.0/23 [] ip-172-18-6-202.ec2.internal ip-172-18-6-202.ec2.internal 172.18.6.202 10.131.0.0/23 [] Then you could see the node is deleted, but the hostsubnet still exists. I just use latest OCP-3.9 env to reproduce it: openshift v3.9.0-0.31.0 kubernetes v1.9.1+a0ce1bc657
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/4134eec37f56e5d8813d3d507f124191860dcff6 automatically remove hostsubnets with no nodes on startup openshift deletes a nodes hostsubnet automatically but during startup there is a window before the watches start that a node can be deleted and we miss the event for the deletion of the hostsubnet. At the end of startup look through the list of hostsubnets and make sure there is an accompanying node, if no node is found delete it. Bug 1470612
Issue has been fixed on build v3.9.1 The hostsubnet will be deleted too when the node is deleted right after the master restarted.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0636