Bug 1464426

Summary: App PODs becomes Orphaned and is not visible.
Product: OpenShift Container Platform Reporter: Shekhar Berry <shberry>
Component: NodeAssignee: Derek Carr <decarr>
Status: CLOSED WONTFIX QA Contact: Xiaoli Tian <xtian>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.5.1CC: aos-bugs, ekuric, gblomqui, hchiramm, jeder, jokerman, mmccomas, mpillai, psuriset, rsussman, rtalur, shberry, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-03 18:18:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shekhar Berry 2017-06-23 12:07:22 UTC
Description of problem:
Hi,

I have 8 node OCP with 3 nodes out of it dedicated for CNS (non-schedulable)

I am evaluating the scalability performance of brick multiplex in OCP environment with persistent storage provided by CNS

I scaled upto 1000 Application PODs with gluster volume mounted inside it. This configuration persisted for 8 days with all app PODs up and running. 

Today, on logging, I see 600 App pods have become orphaned and only 400 of them are running. No node or docker restarted.
Heketi POD only restarted in last 8 hours.

If I login to gluster pod, I still see 1000 volumes with status as started.
There is 1000 PVC in bound state still in the system.
1000 End Point and 1000 Services are still there.

Hence I am not suspecting any issue on the gluster layer.

One message which I saw for one of the missing pods is: 

Jun 22 20:15:13 gprfs023.sbu.lab.eng.bos.redhat.com atomic-openshift-node[102749]: W0622 20:15:13.959682  102749 kubelet_pods.go:631] Unable to retrieve pull secret fio2/default-dockercfg-s7snt for fio2/fio-pod-drngj due to Get https://gprfs013.sbu.lab.eng.bos.redhat.com:8443/api/v1/namespaces/fio2/secrets/default-dockercfg-s7snt: dial tcp: lookup gprfs013.sbu.lab.eng.bos.redhat.com on 10.16.153.66:53: read udp 10.16.153.66:56650->10.16.153.66:53: i/o timeout.  The image pull may not succeed.


The pod fio-pod-drngj in the message above no longer exists. I got similar message on 3 of the nodes for about 480 PODs.
The log message is attached to the bug for all the nodes.

Version-Release number of selected component (if applicable):

oc version
oc v3.5.5.20
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://gprfs013.sbu.lab.eng.bos.redhat.com:8443
openshift v3.5.5.20
kubernetes v1.5.2+43a9be4


docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-31.1.git97ba2c0.el7.x86_64
 Go version:      go1.8
 Git commit:      97ba2c0/1.12.6
 Built:           Fri May 26 16:26:51 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-31.1.git97ba2c0.el7.x86_64
 Go version:      go1.8
 Git commit:      97ba2c0/1.12.6
 Built:           Fri May 26 16:26:51 2017
 OS/Arch:         linux/amd64



How reproducible:

I got this once in my setup.


Actual results:


Expected results:


Additional info:


The logs for the same can be found here 
http://perf1.perf.lab.eng.bos.redhat.com/pub/shberry/orphaned_pods/orphaned_pod_logs.txt

Comment 1 Derek Carr 2017-07-03 14:03:28 UTC
what is meant by an orphan pod in this context?

can you attach the output of the following for one of the orphaned pods?

oc get pod <pod-name> -o yaml
oc describe pod <pod-name>

Comment 2 Derek Carr 2017-07-03 14:09:21 UTC
this looks like a networking issue where the node is unable to fetch the image pull secret for the pod's container image and is hitting network timeouts.

Comment 3 Derek Carr 2017-07-03 14:11:32 UTC
is this a containerized install?

are all the pods experiencing the same problem on the same node or set of nodes?

is node to master networking showing any other problems?  i.e. are pod status or node status updates showing errors?

can you confirm that the secret:
fio4/default-dockercfg-jphz6

actually exists as well?

Comment 4 Derek Carr 2017-07-03 14:19:37 UTC
reading through the original report, you note that the pod no longer exists.  who deleted this pod?

Comment 5 Shekhar Berry 2017-07-03 15:52:45 UTC
(In reply to Derek Carr from comment #1)
> what is meant by an orphan pod in this context?
> 
> can you attach the output of the following for one of the orphaned pods?
> 
> oc get pod <pod-name> -o yaml
> oc describe pod <pod-name>

The POD doesn't exist in the cluster any more. I somehow managed to get name of one of the PODs which vanished and searched for it in the logs which I have attached in the original report.

Comment 6 Shekhar Berry 2017-07-03 15:56:56 UTC
(In reply to Derek Carr from comment #3)
> is this a containerized install?
> 
> are all the pods experiencing the same problem on the same node or set of
> nodes?
> 
> is node to master networking showing any other problems?  i.e. are pod
> status or node status updates showing errors?
> 
> can you confirm that the secret:
> fio4/default-dockercfg-jphz6
> 
> actually exists as well?

Yes, This is a containerized install.

Not all PODs experienced the problem, only 600 of 1000 PODs experienced it. And its not on any particular node, it varies across all the nodes in the cluster. 
The PODs which were still running was also distributed across all the nodes in the cluster and they were unaffected.

I don't have the setup now to confirm the existence of the secret but I could create new app pods after that and there were 400 PODs which were unaffected and hence I believe the secret existed.

Comment 7 Shekhar Berry 2017-07-03 15:58:06 UTC
(In reply to Derek Carr from comment #4)
> reading through the original report, you note that the pod no longer exists.
> who deleted this pod?

No one deleted those PODs, 600 of 1000 running PODs simply vanished.

Comment 8 Derek Carr 2017-07-03 19:00:18 UTC
I am not sure I understand the core problem to know how to proceed. Can you explain your scenario more crisply?  For example, how exactly did you create 1000 pods?

Did you create 1000 deployments each with replica=1?
Did you create 1 deployment with replicas=1000?

The only way a pod would vanish (i.e. be deleted from the API server entirely) is if a backing controller (deployment, replication controller, etc.) deleted it.

Can you see "vanished" pods if you do the following:

$ oc get pods -a --all-namespaces

Do you have 600 pods pending scheduling or in some other state?  Basically, I don't understand what is the symptom you are experiencing now?  Do you have deployments that are not scaled back up to their targeted replica size?  If any one of your application pods failed, the backing controller should have created a new one unless you manually created 1000 pods without a controller.  In which case, the pods that would have failed would have just failed.

In addition, is there a reason why you are trying to scale up to so many pods on so few nodes?  What is your allocatable pods value per node?  (see oc describe node) and look at allocatable.

Comment 11 Shekhar Berry 2017-07-12 09:23:35 UTC
(In reply to Derek Carr from comment #8)
> I am not sure I understand the core problem to know how to proceed. Can you
> explain your scenario more crisply?  For example, how exactly did you create
> 1000 pods?

I created 1000 PODs using scripts. The PODs issues a PVC claim and its served PV by the storage define in storageclass.

> 
> Did you create 1000 deployments each with replica=1?
> Did you create 1 deployment with replicas=1000?

I did 1000 deployments each with replica=1
> 
> The only way a pod would vanish (i.e. be deleted from the API server
> entirely) is if a backing controller (deployment, replication controller,
> etc.) deleted it.
> 
> Can you see "vanished" pods if you do the following:
> 
> $ oc get pods -a --all-namespaces

No, I don't see "vanished" 600 PODs. I only see existing and running 400 PODs
> 
> Do you have 600 pods pending scheduling or in some other state?  Basically,
> I don't understand what is the symptom you are experiencing now?  Do you
> have deployments that are not scaled back up to their targeted replica size?
> If any one of your application pods failed, the backing controller should
> have created a new one unless you manually created 1000 pods without a
> controller.  In which case, the pods that would have failed would have just
> failed.

As I said earlier there were 1000 different application PODs. None related to each other. This 1000 PODs existed in OCP cluster for 8 days without any issue. I was doing some IO runs onto these PODs in these 8 days and all worked fine. But somehow 600 of these 1000 running PODs vanished from system next day. 

> 
> In addition, is there a reason why you are trying to scale up to so many
> pods on so few nodes?  What is your allocatable pods value per node?  (see
> oc describe node) and look at allocatable.

I am part of perf and scale team and we are trying to do PV scalability test by enabling brick multiplex on storage (new feature in RHGS 3.3) The allocatable POD value per node is 250.

Comment 13 Greg Blomquist 2019-07-03 18:18:25 UTC
No updates in 2 years on a low severity bug for a version that's EOL.  If this issue still persists in current versions of OCP, please open a new bug against those versions.