Bug 1464426
Summary: | App PODs becomes Orphaned and is not visible. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Shekhar Berry <shberry> |
Component: | Node | Assignee: | Derek Carr <decarr> |
Status: | CLOSED WONTFIX | QA Contact: | Xiaoli Tian <xtian> |
Severity: | low | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.5.1 | CC: | aos-bugs, ekuric, gblomqui, hchiramm, jeder, jokerman, mmccomas, mpillai, psuriset, rsussman, rtalur, shberry, xtian |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-07-03 18:18:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Shekhar Berry
2017-06-23 12:07:22 UTC
what is meant by an orphan pod in this context? can you attach the output of the following for one of the orphaned pods? oc get pod <pod-name> -o yaml oc describe pod <pod-name> this looks like a networking issue where the node is unable to fetch the image pull secret for the pod's container image and is hitting network timeouts. is this a containerized install? are all the pods experiencing the same problem on the same node or set of nodes? is node to master networking showing any other problems? i.e. are pod status or node status updates showing errors? can you confirm that the secret: fio4/default-dockercfg-jphz6 actually exists as well? reading through the original report, you note that the pod no longer exists. who deleted this pod? (In reply to Derek Carr from comment #1) > what is meant by an orphan pod in this context? > > can you attach the output of the following for one of the orphaned pods? > > oc get pod <pod-name> -o yaml > oc describe pod <pod-name> The POD doesn't exist in the cluster any more. I somehow managed to get name of one of the PODs which vanished and searched for it in the logs which I have attached in the original report. (In reply to Derek Carr from comment #3) > is this a containerized install? > > are all the pods experiencing the same problem on the same node or set of > nodes? > > is node to master networking showing any other problems? i.e. are pod > status or node status updates showing errors? > > can you confirm that the secret: > fio4/default-dockercfg-jphz6 > > actually exists as well? Yes, This is a containerized install. Not all PODs experienced the problem, only 600 of 1000 PODs experienced it. And its not on any particular node, it varies across all the nodes in the cluster. The PODs which were still running was also distributed across all the nodes in the cluster and they were unaffected. I don't have the setup now to confirm the existence of the secret but I could create new app pods after that and there were 400 PODs which were unaffected and hence I believe the secret existed. (In reply to Derek Carr from comment #4) > reading through the original report, you note that the pod no longer exists. > who deleted this pod? No one deleted those PODs, 600 of 1000 running PODs simply vanished. I am not sure I understand the core problem to know how to proceed. Can you explain your scenario more crisply? For example, how exactly did you create 1000 pods? Did you create 1000 deployments each with replica=1? Did you create 1 deployment with replicas=1000? The only way a pod would vanish (i.e. be deleted from the API server entirely) is if a backing controller (deployment, replication controller, etc.) deleted it. Can you see "vanished" pods if you do the following: $ oc get pods -a --all-namespaces Do you have 600 pods pending scheduling or in some other state? Basically, I don't understand what is the symptom you are experiencing now? Do you have deployments that are not scaled back up to their targeted replica size? If any one of your application pods failed, the backing controller should have created a new one unless you manually created 1000 pods without a controller. In which case, the pods that would have failed would have just failed. In addition, is there a reason why you are trying to scale up to so many pods on so few nodes? What is your allocatable pods value per node? (see oc describe node) and look at allocatable. (In reply to Derek Carr from comment #8) > I am not sure I understand the core problem to know how to proceed. Can you > explain your scenario more crisply? For example, how exactly did you create > 1000 pods? I created 1000 PODs using scripts. The PODs issues a PVC claim and its served PV by the storage define in storageclass. > > Did you create 1000 deployments each with replica=1? > Did you create 1 deployment with replicas=1000? I did 1000 deployments each with replica=1 > > The only way a pod would vanish (i.e. be deleted from the API server > entirely) is if a backing controller (deployment, replication controller, > etc.) deleted it. > > Can you see "vanished" pods if you do the following: > > $ oc get pods -a --all-namespaces No, I don't see "vanished" 600 PODs. I only see existing and running 400 PODs > > Do you have 600 pods pending scheduling or in some other state? Basically, > I don't understand what is the symptom you are experiencing now? Do you > have deployments that are not scaled back up to their targeted replica size? > If any one of your application pods failed, the backing controller should > have created a new one unless you manually created 1000 pods without a > controller. In which case, the pods that would have failed would have just > failed. As I said earlier there were 1000 different application PODs. None related to each other. This 1000 PODs existed in OCP cluster for 8 days without any issue. I was doing some IO runs onto these PODs in these 8 days and all worked fine. But somehow 600 of these 1000 running PODs vanished from system next day. > > In addition, is there a reason why you are trying to scale up to so many > pods on so few nodes? What is your allocatable pods value per node? (see > oc describe node) and look at allocatable. I am part of perf and scale team and we are trying to do PV scalability test by enabling brick multiplex on storage (new feature in RHGS 3.3) The allocatable POD value per node is 250. No updates in 2 years on a low severity bug for a version that's EOL. If this issue still persists in current versions of OCP, please open a new bug against those versions. |