Description of problem: Core files present on all nodes indicate repeated crashes of the atomic-openshift-node service. Version-Release number of selected component (if applicable): atomic-openshift-node-3.5.5.9-1.git.0.d220e61.el7.x86_64 How reproducible: The files are present on every node in clusters free-int and starter-us-east-2. They're not present on the other 2 Free clusters. Steps to Reproduce: 1. Unknown. Just leave atomic-openshift-node running and wait. 2. 3. Actual results: Multiple core files exist for almost every day, starting on April 18th for free-int and April 19th for starter-us-east-2. [root@free-int-node-compute-1f8a0 ~]# ls -lt /var/lib/origin|grep core -rw-------. 1 root root 431726592 Apr 24 09:40 core.59905 -rw-------. 1 root root 380768256 Apr 24 04:30 core.6176 -rw-------. 1 root root 416768000 Apr 24 03:41 core.38163 -rw-------. 1 root root 439848960 Apr 24 00:01 core.9030 -rw-------. 1 root root 397611008 Apr 23 21:30 core.16519 -rw-------. 1 root root 377782272 Apr 23 19:31 core.84302 -rw-------. 1 root root 384069632 Apr 23 18:41 core.19196 -rw-------. 1 root root 394166272 Apr 23 17:40 core.87294 -rw-------. 1 root root 388587520 Apr 23 12:41 core.10631 -rw-------. 1 root root 408862720 Apr 23 11:31 core.102261 -rw-------. 1 root root 406425600 Apr 23 06:51 core.98184 -rw-------. 1 root root 224071680 Apr 23 00:31 core.97272 -rw-------. 1 root root 429555712 Apr 23 00:31 core.8024 -rw-------. 1 root root 405364736 Apr 22 17:20 core.120198 -rw-------. 1 root root 295821312 Apr 22 11:02 core.117401 -rw-------. 1 root root 350875648 Apr 22 11:00 core.102226 -rw-------. 1 root root 395087872 Apr 22 10:50 core.98895 -rw-------. 1 root root 378306560 Apr 22 06:51 core.16058 -rw-------. 1 root root 376401920 Apr 22 05:31 core.51186 -rw-------. 1 root root 400482304 Apr 22 04:10 core.42464 -rw-------. 1 root root 390930432 Apr 22 02:00 core.31292 -rw-------. 1 root root 383844352 Apr 21 23:51 core.120753 -rw-------. 1 root root 378437632 Apr 21 21:00 core.43566 -rw-------. 1 root root 389931008 Apr 21 19:50 core.95839 -rw-------. 1 root root 401235968 Apr 21 18:31 core.122675 -rw-------. 1 root root 382042112 Apr 21 15:01 core.42077 -rw-------. 1 root root 376569856 Apr 21 13:56 core.41917 -rw-------. 1 root root 342814720 Apr 21 12:01 core.72469 -rw-------. 1 root root 372961280 Apr 21 10:20 core.34846 -rw-------. 1 root root 331333632 Apr 21 07:31 core.20298 -rw-------. 1 root root 368013312 Apr 21 07:20 core.91107 -rw-------. 1 root root 367210496 Apr 21 04:10 core.85366 -rw-------. 1 root root 413872128 Apr 21 02:10 core.25268 -rw-------. 1 root root 360497152 Apr 21 01:11 core.848 -rw-------. 1 root root 352985088 Apr 21 00:50 core.106012 [root@free-int-node-compute-1f8a0 ~]# file /var/lib/origin/core.98895 /var/lib/origin/core.98895: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --logl', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/bin/openshift', platform: 'x86_64' Expected results: atomic-openshift-node service should run without crashing. Additional info:
This seems to be the error: Apr 24 09:40:45 ip-172-31-49-44.ec2.internal atomic-openshift-node[59905]: fatal error: concurrent map read and map write [root@free-int-node-compute-1f8a0 ~]# journalctl -lu atomic-openshift-node --no-pager | grep concurrent Apr 23 17:40:37 ip-172-31-49-44.ec2.internal atomic-openshift-node[87294]: fatal error: concurrent map read and map write Apr 23 18:40:59 ip-172-31-49-44.ec2.internal atomic-openshift-node[19196]: fatal error: concurrent map read and map write Apr 23 19:31:13 ip-172-31-49-44.ec2.internal atomic-openshift-node[84302]: fatal error: concurrent map read and map write Apr 23 21:30:45 ip-172-31-49-44.ec2.internal atomic-openshift-node[16519]: fatal error: concurrent map read and map write Apr 24 00:01:16 ip-172-31-49-44.ec2.internal atomic-openshift-node[9030]: fatal error: concurrent map read and map write Apr 24 03:41:27 ip-172-31-49-44.ec2.internal atomic-openshift-node[38163]: fatal error: concurrent map read and map write Apr 24 04:30:12 ip-172-31-49-44.ec2.internal atomic-openshift-node[6176]: fatal error: concurrent map read and map write Apr 24 09:40:45 ip-172-31-49-44.ec2.internal atomic-openshift-node[59905]: fatal error: concurrent map read and map write Which corresponds with the timestamps on the core files. -rw-------. 1 root root 431726592 Apr 24 09:40 core.59905 -rw-------. 1 root root 380768256 Apr 24 04:30 core.6176 -rw-------. 1 root root 416768000 Apr 24 03:41 core.38163 -rw-------. 1 root root 439848960 Apr 24 00:01 core.9030 -rw-------. 1 root root 397611008 Apr 23 21:30 core.16519 -rw-------. 1 root root 377782272 Apr 23 19:31 core.84302 -rw-------. 1 root root 384069632 Apr 23 18:41 core.19196 -rw-------. 1 root root 394166272 Apr 23 17:40 core.87294 -rw-------. 1 root root 388587520 Apr 23 12:41 core.10631 -rw-------. 1 root root 408862720 Apr 23 11:31 core.102261 -rw-------. 1 root root 406425600 Apr 23 06:51 core.98184 -rw-------. 1 root root 224071680 Apr 23 00:31 core.97272 -rw-------. 1 root root 429555712 Apr 23 00:31 core.8024
*** Bug 1445298 has been marked as a duplicate of this bug. ***
Fix in progress here: https://github.com/openshift/origin/pull/13847
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038755.html
https://github.com/openshift/ose/pull/724
Could you help verify this bug?
I checked on our clusters running 3.5.5.19 and the problem appears to be fixed. There haven't been any new core files since April 26th.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1425