Bug 1470612
| Summary: | Hostsubnet could not be deleted after delete node | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Yan Du <yadu> |
| Component: | Networking | Assignee: | Jacob Tanenbaum <jtanenba> |
| Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.6.0 | CC: | aos-bugs, atragler, bbennett, ccoleman, rkhan, sukulkar, yadu |
| Target Milestone: | --- | ||
| Target Release: | 3.7.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-04-05 09:28:25 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Yan Du
2017-07-13 10:07:43 UTC
> 3. Enlarge the clusterNetworkCIDR
> default -> clusterNetworkCIDR: 10.128.0.0/14
> update it to - > clusterNetworkCIDR: 10.128.0.0/10
>
> 3. delete the node
I assume there's an implied "restart the master" between those steps?
I couldn't reproduce the bug either way though
(In reply to Dan Winship from comment #3) > > 3. Enlarge the clusterNetworkCIDR > > 3. delete the node > > I assume there's an implied "restart the master" between those steps? And if so, were you giving the master a chance to fully start up after restarting, or just deleting the node right away? If you deleted the Node really really quickly after the master started, then the SDN's Node watch might not have started yet, and so it wouldn't see a Deleted event for the Node, and so it wouldn't delete the HostSubnet. (This might also be why it's hard to reproduce?) In that case it should be possible to tell that that's what happened from the logs. We probably ought to be better about keeping Nodes and HostSubnets in sync, although that's tricky now that we let the user create non-Node-based HostSubnets for routers too, so we can't just assume that any HostSubnet that doesn't correspond to a Node is safe to delete. But we could have the master annotate the HostSubnets that it creates with the UID of the corresponding Node, and then at startup, if it finds a HostSubnet that allegedly corresponds to a Node, but that Node doesn't exist, then it could delete it. (HostSubnets for routers would have no annotation and so would never be automatically deleted.) We should definitely be doing this (keeping the invariant of hostsubnet and node in sync). That's a big gap. I can't reproduce the inability to delete the hostsubnets. I was able to get the system into the state where there was a hostsubnet for a deleted node but the cluster did not prevent me from deleting it. Can this still be reproduced? Should the system automatically delete the hostsubnet without a node? Hi, jtanenba Maybe you could use this way to reproduce the issue: 1. setup multi-node env 2. ssh into master and run below cmd: [root@ip-172-18-6-194 ~]# systemctl restart atomic-openshift-master-api.service ;systemctl restart atomic-openshift-master-controllers.service ;oc delete node ip-172-18-6-202.ec2.internal node "ip-172-18-6-202.ec2.internal" deleted 3. Check the node and hostsubnet [root@ip-172-18-6-194 ~]# oc get node NAME STATUS ROLES AGE VERSION ip-172-18-6-194.ec2.internal Ready,SchedulingDisabled master 22m v1.9.1+a0ce1bc657 [root@ip-172-18-6-194 ~]# oc get hostsubnet NAME HOST HOST IP SUBNET EGRESS IPS ip-172-18-6-194.ec2.internal ip-172-18-6-194.ec2.internal 172.18.6.194 10.129.0.0/23 [] ip-172-18-6-202.ec2.internal ip-172-18-6-202.ec2.internal 172.18.6.202 10.131.0.0/23 [] Then you could see the node is deleted, but the hostsubnet still exists. I just use latest OCP-3.9 env to reproduce it: openshift v3.9.0-0.31.0 kubernetes v1.9.1+a0ce1bc657 Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/4134eec37f56e5d8813d3d507f124191860dcff6 automatically remove hostsubnets with no nodes on startup openshift deletes a nodes hostsubnet automatically but during startup there is a window before the watches start that a node can be deleted and we miss the event for the deletion of the hostsubnet. At the end of startup look through the list of hostsubnets and make sure there is an accompanying node, if no node is found delete it. Bug 1470612 Issue has been fixed on build v3.9.1 The hostsubnet will be deleted too when the node is deleted right after the master restarted. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0636 |