Bug 1461865
Summary: | AWS getInstancesByNodeNames is broken for large clusters | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Hemant Kumar <hekumar> |
Component: | Node | Assignee: | Hemant Kumar <hekumar> |
Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.6.0 | CC: | aos-bugs, bchilds, dma, eparis, hekumar, jokerman, mifiedle, mmccomas, sdodson, smunilla |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1460388 | Environment: | |
Last Closed: | 2017-08-10 05:28:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1460388 | ||
Bug Blocks: |
Description
Hemant Kumar
2017-06-15 13:25:52 UTC
PR opened - https://github.com/openshift/origin/pull/14669 How much large clusters this will happen? 200 instances? QE don't have so much large cluster, Could you help verify the bug. thanks Yeah, this would be tricky to test directly on a cluster. I myself verified this by changing pagination limit to smaller value(default is 150) and making sure it works as expected. I don't have a good answer, since pagination limit can't be tuned via command line parameter and requires code change. Scott - do we have a dev or opstest cluster which is large enough for testing this? Just to clarify - yes the bug happens on 200+ node cluster. But - the key thing here to test is - we are now fetching instance information in batches of 150 and those get aggregated in a single list and returned, as if caller has made a single call (i.e the whole batching mechanism of opaque to the caller). Creating dummy nodes will not work for volume check that fails because of 200 limit. Those nodes have to have volumes attached to them too, for being considered for getInstancesByNodeNames request. So once cluster goes 200+ limit, what would break is - if you terminate a node in the cluster then volumes that were attached to it, will not correctly detach. But the bug also affects load balancers. Here is upstream bug - https://github.com/kubernetes/kubernetes/issues/45050 . The upstream bug seems to suggest that once a cluster has 200+ nodes then AWS created doesn't seem to get any healthy nodes behind it. @hekumar Can you take a look at comment 12? From storage perspective there are 2 ways: 1. There should be errors logged in the controller logs. 2. for a 200+ node cluster, with each node running at least one or more pod with volume - what we can do is, go to AWS console and force detach one of the volumes being used inside the pod. If fetching node information is working as expected, then the detached volume will be automatically attached back. If not - it would remain detached. Thanks. Here is the test case for verification: - 1 master + 1 infra + 200 nodes on AWS us-west-2b - Run 1 pod on every node with a volume/volumeMount referencing an existing PVC - Verify PVCs are bound to PVs. - Verify all volumes in us by instances in the AWS EC2 console - Force mount volumes from instances. - Verify volumes transition to Available and then are automatically re-attached to the instance. - Verify PV and PVCs remain bound in OpenShift Step above should be "Force Detach" volumes from instances. Verified on 3.6.133 Executed test in comment 15. After force detach, volumes are re-attached successfully: Jul 6 20:53:46 ip-172-31-8-32 atomic-openshift-master: I0706 20:53:46.750530 39189 node_status_updater.go:136] Updating status for node "ip-172-31-27-235.us-west-2.compute.internal" succeeded. patchBytes: "{\"status\":{\"volumesAttached\":[{\"devicePath\":\"/dev/xvdcd\",\"name\":\"kubernetes.io/aws-ebs/aws://us-west-2b/vol-04bb7b19cddccb436\"},{\"devicePath\":\"/dev/xvdbn\",\"name\":\"kubernetes.io/aws-ebs/aws://us-west-2b/vol-0c838ab3eecc2d6c0\"}]}}" VolumesAttached: [{kubernetes.io/aws-ebs/aws://us-west-2b/vol-04bb7b19cddccb436 /dev/xvdcd} {kubernetes.io/aws-ebs/aws://us-west-2b/vol-0c838ab3eecc2d6c0 /dev/xvdbn}] Jul 6 20:53:46 ip-172-31-8-32 atomic-openshift-master: I0706 20:53:46.755358 39189 node_status_updater.go:136] Updating status for node "ip-172-31-25-70.us-west-2.compute.internal" succeeded. patchBytes: "{\"status\":{\"volumesAttached\":[{\"devicePath\":\"/dev/xvdbj\",\"name\":\"kubernetes.io/aws-ebs/aws://us-west-2b/vol-0223d80541a74c9b7\"},{\"devicePath\":\"/dev/xvdcy\",\"name\":\"kubernetes.io/aws-ebs/aws://us-west-2b/vol-07a774bb0a889af1b\"}]}}" VolumesAttached: [{kubernetes.io/aws-ebs/aws://us-west-2b/vol-0223d80541a74c9b7 /dev/xvdbj} {kubernetes.io/aws-ebs/aws://us-west-2b/vol-07a774bb0a889af1b /dev/xvdcy}] Jul 6 20:53:46 ip-172-31-8-32 atomic-openshift-master: I0706 20:53:46.760077 39189 node_status_updater.go:136] Updating status for node "ip-172-31-3-250.us-west-2.compute.internal" succeeded. patchBytes: "{\"status\":{\"volumesAttached\":[{\"devicePath\":\"/dev/xvdba\",\"name\":\"kubernetes.io/aws-ebs/aws://us-west-2b/vol-0182b1a36bda8511e\"},{\"devicePath\":\"/dev/xvdbz\",\"name\":\"kubernetes.io/aws-ebs/aws://us-west-2b/vol-0902b6ce264f8f2c5\"}]}}" VolumesAttached: [{kubernetes.io/aws-ebs/aws://us-west-2b/vol-0182b1a36bda8511e /dev/xvdba} {kubernetes.io/aws-ebs/aws://us-west-2b/vol-0902b6ce264f8f2c5 /dev/xvdbz}] Jul 6 20:53:46 ip-172-31-8-32 atomic-openshift-master: I0706 20:53:46.772668 39189 node_status_updater.go:136] Updating status for node "ip-172-31-22-5.us-west-2.compute.internal" succeeded. patchBytes: "{\"status\":{\"volumesAttached\":[{\"devicePath\":\"/dev/xvdce\",\"name\":\"kubernetes.io/aws-ebs/aws://us-west-2b/vol-04cdac9a650e119a7\"},{\"devicePath\":\"/dev/xvdcx\",\"name\":\"kubernetes.io/aws-ebs/aws://us-west-2b/vol-043479f3dbb958922\"}]}}" VolumesAttached: [{kubernetes.io/aws-ebs/aws://us-west-2b/vol-04cdac9a650e119a7 /dev/xvdce} {kubernetes.io/aws-ebs/aws://us-west-2b/vol-043479f3dbb958922 /dev/xvdcx}] Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 |