Bug 1445084

Summary: [3.5] atomic-openshift-node service is filling /var with core files
Product: OpenShift Container Platform Reporter: Stefanie Forrester <dakini>
Component: NodeAssignee: Solly Ross <sross>
Status: CLOSED ERRATA QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.1CC: aos-bugs, dakini, eparis, jeder, jgoulding, jokerman, mifiedle, mmahut, mmccomas, vlaad, wmeng
Target Milestone: ---Keywords: OpsBlocker
Target Release: 3.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-15 18:37:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stefanie Forrester 2017-04-24 21:46:49 UTC
Description of problem:

Core files present on all nodes indicate repeated crashes of the atomic-openshift-node service.

Version-Release number of selected component (if applicable):

atomic-openshift-node-3.5.5.9-1.git.0.d220e61.el7.x86_64

How reproducible:

The files are present on every node in clusters free-int and starter-us-east-2. They're not present on the other 2 Free clusters.

Steps to Reproduce:
1. Unknown. Just leave atomic-openshift-node running and wait.
2.
3.

Actual results:

Multiple core files exist for almost every day, starting on April 18th for free-int and April 19th for starter-us-east-2.


[root@free-int-node-compute-1f8a0 ~]# ls -lt /var/lib/origin|grep core
-rw-------. 1 root root 431726592 Apr 24 09:40 core.59905
-rw-------. 1 root root 380768256 Apr 24 04:30 core.6176
-rw-------. 1 root root 416768000 Apr 24 03:41 core.38163
-rw-------. 1 root root 439848960 Apr 24 00:01 core.9030
-rw-------. 1 root root 397611008 Apr 23 21:30 core.16519
-rw-------. 1 root root 377782272 Apr 23 19:31 core.84302
-rw-------. 1 root root 384069632 Apr 23 18:41 core.19196
-rw-------. 1 root root 394166272 Apr 23 17:40 core.87294
-rw-------. 1 root root 388587520 Apr 23 12:41 core.10631
-rw-------. 1 root root 408862720 Apr 23 11:31 core.102261
-rw-------. 1 root root 406425600 Apr 23 06:51 core.98184
-rw-------. 1 root root 224071680 Apr 23 00:31 core.97272
-rw-------. 1 root root 429555712 Apr 23 00:31 core.8024
-rw-------. 1 root root 405364736 Apr 22 17:20 core.120198
-rw-------. 1 root root 295821312 Apr 22 11:02 core.117401
-rw-------. 1 root root 350875648 Apr 22 11:00 core.102226
-rw-------. 1 root root 395087872 Apr 22 10:50 core.98895
-rw-------. 1 root root 378306560 Apr 22 06:51 core.16058
-rw-------. 1 root root 376401920 Apr 22 05:31 core.51186
-rw-------. 1 root root 400482304 Apr 22 04:10 core.42464
-rw-------. 1 root root 390930432 Apr 22 02:00 core.31292
-rw-------. 1 root root 383844352 Apr 21 23:51 core.120753
-rw-------. 1 root root 378437632 Apr 21 21:00 core.43566
-rw-------. 1 root root 389931008 Apr 21 19:50 core.95839
-rw-------. 1 root root 401235968 Apr 21 18:31 core.122675
-rw-------. 1 root root 382042112 Apr 21 15:01 core.42077
-rw-------. 1 root root 376569856 Apr 21 13:56 core.41917
-rw-------. 1 root root 342814720 Apr 21 12:01 core.72469
-rw-------. 1 root root 372961280 Apr 21 10:20 core.34846
-rw-------. 1 root root 331333632 Apr 21 07:31 core.20298
-rw-------. 1 root root 368013312 Apr 21 07:20 core.91107
-rw-------. 1 root root 367210496 Apr 21 04:10 core.85366
-rw-------. 1 root root 413872128 Apr 21 02:10 core.25268
-rw-------. 1 root root 360497152 Apr 21 01:11 core.848
-rw-------. 1 root root 352985088 Apr 21 00:50 core.106012


[root@free-int-node-compute-1f8a0 ~]# file /var/lib/origin/core.98895
/var/lib/origin/core.98895: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --logl', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/bin/openshift', platform: 'x86_64'


Expected results:

atomic-openshift-node service should run without crashing.

Additional info:

Comment 2 Stefanie Forrester 2017-04-24 21:59:05 UTC
This seems to be the error:

Apr 24 09:40:45 ip-172-31-49-44.ec2.internal atomic-openshift-node[59905]: fatal error: concurrent map read and map write


[root@free-int-node-compute-1f8a0 ~]# journalctl -lu atomic-openshift-node --no-pager | grep concurrent
Apr 23 17:40:37 ip-172-31-49-44.ec2.internal atomic-openshift-node[87294]: fatal error: concurrent map read and map write
Apr 23 18:40:59 ip-172-31-49-44.ec2.internal atomic-openshift-node[19196]: fatal error: concurrent map read and map write
Apr 23 19:31:13 ip-172-31-49-44.ec2.internal atomic-openshift-node[84302]: fatal error: concurrent map read and map write
Apr 23 21:30:45 ip-172-31-49-44.ec2.internal atomic-openshift-node[16519]: fatal error: concurrent map read and map write
Apr 24 00:01:16 ip-172-31-49-44.ec2.internal atomic-openshift-node[9030]: fatal error: concurrent map read and map write
Apr 24 03:41:27 ip-172-31-49-44.ec2.internal atomic-openshift-node[38163]: fatal error: concurrent map read and map write
Apr 24 04:30:12 ip-172-31-49-44.ec2.internal atomic-openshift-node[6176]: fatal error: concurrent map read and map write
Apr 24 09:40:45 ip-172-31-49-44.ec2.internal atomic-openshift-node[59905]: fatal error: concurrent map read and map write

Which corresponds with the timestamps on the core files.

-rw-------. 1 root root 431726592 Apr 24 09:40 core.59905
-rw-------. 1 root root 380768256 Apr 24 04:30 core.6176
-rw-------. 1 root root 416768000 Apr 24 03:41 core.38163
-rw-------. 1 root root 439848960 Apr 24 00:01 core.9030
-rw-------. 1 root root 397611008 Apr 23 21:30 core.16519
-rw-------. 1 root root 377782272 Apr 23 19:31 core.84302
-rw-------. 1 root root 384069632 Apr 23 18:41 core.19196
-rw-------. 1 root root 394166272 Apr 23 17:40 core.87294
-rw-------. 1 root root 388587520 Apr 23 12:41 core.10631
-rw-------. 1 root root 408862720 Apr 23 11:31 core.102261
-rw-------. 1 root root 406425600 Apr 23 06:51 core.98184
-rw-------. 1 root root 224071680 Apr 23 00:31 core.97272
-rw-------. 1 root root 429555712 Apr 23 00:31 core.8024

Comment 3 Stefanie Forrester 2017-04-25 14:27:08 UTC
*** Bug 1445298 has been marked as a duplicate of this bug. ***

Comment 6 Solly Ross 2017-04-25 15:19:49 UTC
Fix in progress here: https://github.com/openshift/origin/pull/13847

Comment 8 Eric Paris 2017-05-01 19:45:10 UTC
https://github.com/openshift/ose/pull/724

Comment 10 DeShuai Ma 2017-06-05 07:35:50 UTC
Could you help verify this bug?

Comment 11 Stefanie Forrester 2017-06-05 14:28:54 UTC
I checked on our clusters running 3.5.5.19 and the problem appears to be fixed. There haven't been any new core files since April 26th.

Comment 13 errata-xmlrpc 2017-06-15 18:37:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1425