Bug 1304266

Summary: Pod status keeps in pending status on dedicated env
Product: OpenShift Online Reporter: Wang Haoran <haowang>
Component: ContainersAssignee: Jhon Honce <jhonce>
Status: CLOSED CURRENTRELEASE QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: high    
Version: 3.xCC: agrimm, akostadi, aos-bugs, haowang, jhonce, jokerman, mmccomas, pruan, whearn, wzheng
Target Milestone: ---Keywords: TestBlocker
Target Release: ---Flags: jhonce: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-23 15:08:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1303130    
Attachments:
Description Flags
python build failed none

Description Wang Haoran 2016-02-03 08:37:46 UTC
Description of problem:

The pod status become OutOfDisk on the dedicated env after run sometime later
Version-Release number of selected component (if applicable):


How reproducible:

Always
Steps to Reproduce:
1.create project 
2.create a pod
3.check pod status

Actual results:
[vagrant@ose ~]$ oc get pod -o wide
NAME                    READY     STATUS      RESTARTS   AGE       NODE
beego-example-1-build   0/1       OutOfDisk   0          24s       ip-172-31-5-177.ec2.internal

Expected results:
running

Additional info:

Comment 1 Wang Haoran 2016-02-04 02:34:45 UTC
the env cannot build and  deploy now
[vagrant@ose ~]$ oc get event
FIRSTSEEN   LASTSEEN   COUNT     NAME         KIND                    SUBOBJECT   REASON              SOURCE                           MESSAGE
17m         10m        11        database-1   ReplicationController               FailedCreate        {deployer }                      Error creating deployer pod for haowang2/database-1: Pod "database-1-deploy" is forbidden: service account haowang2/deployer was not found, retry after the service account is created
17m         12m        2         database-1   ReplicationController               FailedCreate        {deployer }                      Error creating deployer pod for haowang2/database-1: Internal error occurred: Get http://api.stage.openshift.com/api/v1/namespaces/haowang2: dial tcp 52.5.122.7:80: connection refused
17m         14m        2         database-1   ReplicationController               FailedCreate        {deployer }                      Error creating deployer pod for haowang2/database-1: Internal error occurred: Get http://api.stage.openshift.com/api/v1/namespaces/haowang2: dial tcp 52.72.220.72:80: connection refused
8m          17s        10        database-2   ReplicationController               FailedCreate        {deployer }                      Error creating deployer pod for haowang2/database-2: Pod "database-2-deploy" is forbidden: service account haowang2/deployer was not found, retry after the service account is created
17m         17m        1         database     DeploymentConfig                    DeploymentCreated   {deploymentconfig-controller }   Created new deployment "database-1" for version 1
8m          8m         1         database     DeploymentConfig                    DeploymentCreated   {deploymentconfig-controller }   Created new deployment "database-2" for version 2

Comment 2 Wesley Hearn 2016-02-05 21:04:04 UTC
Initial report that project does not exist and there are no other pods showing this.


Comment 1 is related to https://bugzilla.redhat.com/show_bug.cgi?id=1304586

Can you recreate the OutOfDisk error so I can actually look into it?

Comment 3 Wenjing Zheng 2016-02-06 04:01:04 UTC
No such errors like comment #1 and no OutOfDisk error now; However, pods keep in pending status in nodes:
[wzheng@localhost ~]$ oc get builds
NAME                 TYPE      FROM      STATUS    STARTED   DURATION
php-sample-build-1   Source    Git       Pending             
[wzheng@localhost ~]$ oc get pods -o wide
NAME                       READY     STATUS      RESTARTS   AGE       NODE
database-1-deploy          1/1       Running     0          21m       ip-172-31-5-179.ec2.internal
database-1-kbnn8           1/1       Running     0          21m       ip-172-31-5-179.ec2.internal
database-1-posthook        0/1       Pending     0          21m       ip-172-31-5-180.ec2.internal
database-1-prehook         0/1       Completed   0          21m       ip-172-31-5-179.ec2.internal
php-sample-build-1-build   0/1       Pending     0          19m       ip-172-31-5-180.ec2.internal
[wzheng@localhost ~]$ oc get pods -n wzheng3 -o wide
NAME                       READY     STATUS      RESTARTS   AGE       NODE
php-sample-build-2-build   0/1       Pending     0          19m       ip-172-31-5-179.ec2.internal
[wzheng@localhost ~]$ oc get pods -o wide -n wzheng123
NAME                       READY     STATUS      RESTARTS   AGE       NODE
php-sample-build-3-build   0/1       Pending     0          21m       ip-172-31-5-177.ec2.internal

Comment 5 Wesley Hearn 2016-02-08 15:32:26 UTC
This is related to docker getting hung. It seems to be happening at a higher rate for us in 3.1.1.6.

Comment 6 Jhon Honce 2016-02-08 21:19:38 UTC
During the next hang, please attach strace to the docker process (-f -v -y -yy -s 4096) and log for approximately 5 minutes. Please attach log or forward via email.  Thanks.

Comment 7 Aleksandar Kostadinov 2016-02-10 15:54:53 UTC
Created attachment 1122835 [details]
python build failed

It might be related working in web console, I've got
"the image cannot be retrieved" several times while trying to open:
https://console.stage.openshift.com/console/project/ctrnl/create/fromimage?imageName=python&imageTag=3.4&namespace=ctrnl

At some point it succeeded. But then build failed with failed to push image. See attached console log.

Comment 11 Peter Ruan 2016-02-12 19:18:10 UTC
I've tested it with my limited runs and have not seen the problem.  Will need to run more tests to see and will put it as VERIFIED if still can't reproduce the problem then.

Comment 12 Peter Ruan 2016-02-12 23:48:24 UTC
I've run more tests today on top off yesterday and have not seen the issue again.  Putting it as verified.