Bug 1575756
| Summary: | Stopping/Restarting active master controller instance makes the entire cluster unstable | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> | 
| Component: | Master | Assignee: | Jordan Liggitt <jliggitt> | 
| Status: | CLOSED ERRATA | QA Contact: | Vikas Laad <vlaad> | 
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.10.0 | CC: | aos-bugs, hongkliu, jokerman, mifiedle, mmccomas, vlaad | 
| Target Milestone: | --- | ||
| Target Release: | 3.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | 
       undefined 
 | 
        
        
        
        Story Points: | --- | 
| Clone Of: | Environment: | ||
| Last Closed: | 2018-07-30 19:14:38 UTC | Type: | Bug | 
| Regression: | --- | Mount Type: | --- | 
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
| 
 
        
          Description
        
        
          Vikas Laad
        
        
        
        
        
          2018-05-07 19:52:38 UTC
        
       
      
      
      
    Created attachment 1432805 [details]
master api exited container logs
    Created attachment 1432806 [details]
journalctl logs from ip-172-31-22-84.us-west-2.compute.internal
    I was able to reproduce it, it happens when active master instance is stopped/restarted Here is the state before stopping master controller node (active controller), in this case it was ip-172-31-7-3.us-west-2.compute.internal Master pods start stats master-api-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 9 13m master-api-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 8 19h master-api-ip-172-31-7-3.us-west-2.compute.internal 1/1 Running 2 18h master-controllers-ip-172-31-22-84.us-west-2.compute.inter 1/1 Running 2 12m master-controllers-ip-172-31-32-140.us-west-2.compute.irnal 1/1 Running 4 19h master-controllers-ip-172-31-7-3.us-west-2.compute.intenal 1/1 Running 2 18h master-etcd-ip-172-31-22-84.us-west-2.compute.intern l 1/1 Running 1 12m master-etcd-ip-172-31-32-140.us-west-2.compute.interal 1/1 Running 1 19h master-etcd-ip-172-31-7-3.us-west-2.compute.internal 1/1 Running 2 18h After stopping that instance (ip-172-31-7-3.us-west-2.compute.internal ) I see both the other masters and infra node in NotReady state, even trying to get logs from any of the master api or controllers pod causes “Unable to connect to the server: unexpected EOF” NAME STATUS ROLES AGE VERSION ip-172-31-1-0.us-west-2.compute.internal NotReady compute,infra 20h v1.10.0+b81c8f8 ip-172-31-16-217.us-west-2.compute.internal Ready compute 20h v1.10.0+b81c8f8 ip-172-31-22-84.us-west-2.compute.internal NotReady master 41m v1.10.0+b81c8f8 ip-172-31-32-140.us-west-2.compute.internal NotReady master 20h v1.10.0+b81c8f8 ip-172-31-6-179.us-west-2.compute.internal Ready compute 19h v1.10.0+b81c8f8 Registry and Router also gets re-created after Infra node becomes Ready. After around 15-16 mins Nodes become ready and master API and Controllers pods keep CrashLoopBackOff and restart many times. master-api-ip-172-31-22-84.us-west-2.compute.internal 1/1 CrashLoopBackOff 13 2m master-api-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 14 20h master-controllers-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 4 13s master-controllers-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 5 20h master-etcd-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 2 14s master-etcd-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 1 20h After almost 20-30 mins cluster is still unstable master-api-ip-172-31-22-84.us-west-2.compute.internal 0/1 CrashLoopBackOff 19 25m master-api-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 17 21h master-controllers-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 6 23m master-controllers-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 6 21h master-etcd-ip-172-31-22-84.us-west-2.compute.internal 1/1 Running 2 23m master-etcd-ip-172-31-32-140.us-west-2.compute.internal 1/1 Running 1 21h Created attachment 1433253 [details]
desc node when its NotReady
    Created attachment 1433254 [details]
desc pod when api pod is in CrashLoopBackOff
    is the CRIO in the title indicating this is only seen in CRIO installations? No, I first saw this on CRIO runtime cluster but I was able to re-produce it on Docker too. Removed CRIO from title. is this still reproducible with https://github.com/openshift/origin/pull/19638 in? Verified on following version, no unnecessary restarts occurred even after stopping multiple times the active master openshift v3.10.0-0.50.0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816  |