Bug 1460388 - AWS getInstancesByNodeNames is broken for large clusters
Summary: AWS getInstancesByNodeNames is broken for large clusters
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.5.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.5.z
Assignee: Hemant Kumar
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks: 1461865
TreeView+ depends on / blocked
 
Reported: 2017-06-09 21:58 UTC by Hemant Kumar
Modified: 2017-10-25 13:02 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When describing multiple instances on AWS, we are supplying each node as a filter. This fails to work if cluster is large enough because AWS only allows upto 200 filters to a request. Consequence: DescribeInstances calls fail, resulting in broken load balancer and storage functionality in AWS. Fix: Implement batching of describeinstance calls to get over the filtering limit. Result: DescribeInstances calls work for larger clusters too.
Clone Of:
: 1461865 (view as bug list)
Environment:
Last Closed: 2017-10-25 13:02:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Master and node syslogs (7.44 MB, application/x-gzip)
2017-07-11 20:19 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:3049 0 normal SHIPPED_LIVE OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update 2017-10-25 15:57:15 UTC

Description Hemant Kumar 2017-06-09 21:58:32 UTC
From CloudTrail logs:

    "errorMessage": "The maximum number of filter values specified on a single call is 200",

This should affect all features (such as Load Balancers, Storage) which rely on this function to work.


I have opened an upstream ticket as well - 
https://github.com/kubernetes/kubernetes/issues/47271

Comment 1 Hemant Kumar 2017-06-09 21:59:37 UTC
Assigning this to myself for now.

Comment 2 Hemant Kumar 2017-06-09 22:01:43 UTC
We will have to backport the fix to both 3.5 and 3.6 once available.

Comment 3 Hemant Kumar 2017-06-15 13:32:11 UTC
Link to PR - https://github.com/openshift/ose/pull/788

Comment 11 Mike Fiedler 2017-07-11 20:18:49 UTC
Tested on v3.5.5.31.3 and there seems to be an issue still.

I followed the same procedure used to verify on 3.6:  See https://bugzilla.redhat.com/show_bug.cgi?id=1461865#c15 and https://bugzilla.redhat.com/show_bug.cgi?id=1461865#c17 for details and results

1.  Installed 202 node cluster at v3.5.5.31.3
2.  Created 250 projects with deployment configurations that included Volume and VolumeMount for a PVC which was dynamically bound to an EBS PV
3.  In the node syslog (attached), you can see the volume successfully attached to the node and formatted
4. Verified the PV and PVC were Bound
5. In the AWS console, force detached the volume used in namespace svt-190 (/dev/xvdbq), vol id vol-0a70d2fb3b570c842
6. Waited for the volume to be reattached.

The volume stayed available in the AWS console and never reattached.   The PVC remained in Bound state

NAME      STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
pvc2      Bound     pvc-abfad388-666e-11e7-be60-02238eeb625a   1Gi        RWO           51m
root@ip-172-31-47-221: ~ # oc get pv pvc-abfad388-666e-11e7-be60-02238eeb625a
NAME                                       CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM          REASON    AGE
pvc-abfad388-666e-11e7-be60-02238eeb625a   1Gi        RWO           Delete          Bound     svt-190/pvc2             51m
root@ip-172-31-47-221: ~ # 


There were some odd errors re:  PVCs in the master logs, but they were for different namespaces.  The entry below is all one message with repeated text.

Jul 11 15:30:15 ip-172-31-41-8 atomic-openshift-master: E0711 15:30:15.657255   21952 factory.go:583] Error scheduling svt-246 deploymentconfig2-1-kn6xx: [SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolum
Jul 11 15:30:15 ip-172-31-41-8 atomic-openshift-master: eClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected
Jul 11 15:30:15 ip-172-31-41-8 atomic-openshift-master: ., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentV
Jul 11 15:30:15 ip-172-31-41-8 atomic-openshift-master: olumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpected., SchedulerPredicates failed due to PersistentVolumeClaim is not bound: "pvc2", which is unexpe

Comment 12 Mike Fiedler 2017-07-11 20:19:24 UTC
Created attachment 1296452 [details]
Master and node syslogs

Comment 16 Hemant Kumar 2017-07-11 21:06:15 UTC
If node is doing attach/detach - controller does not performs verification of detached volumes. That is the mechanism that causes detached volumes to be automatically attached back.

In Openshift-3.6, the default is controller attach/detach.

Comment 17 Mike Fiedler 2017-07-12 00:35:24 UTC
Applied the configuration from https://docs.openshift.org/1.5/install_config/persistent_storage/enabling_controller_attach_detach.html to the node and the scenario worked correctly.

Verified on 3.5.5.5.31.3

Comment 19 errata-xmlrpc 2017-10-25 13:02:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049


Note You need to log in before you can comment on or make changes to this bug.