Bug 1473031

Summary: fatal error: concurrent map read and map write
Product: OpenShift Container Platform Reporter: Eric Paris <eparis>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: router QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: aos-bugs, bbennett, bmeng, jliggitt, jrosenta, xtian
Version: 3.6.0   
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Missing locking around a router data structure Consequence: The router pod would (very occasionally) crash and restart Fix: Add the appropriate locking Result: The invalid data access does not crash the router
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-28 22:04:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs with backtrace none

Description Eric Paris 2017-07-19 21:50:12 UTC
I found a router that had a 'restart'.

ose-haproxy-router:v3.6.126.1

 Looked at the logs for the last pod and found:

  - spec.tls.key: Invalid value: "redacted key data": unrecognized PEM block DSA PRIVATE KEY
E0718 14:14:42.201169       1 router_controller.go:311] invalid route configuration
fatal error: concurrent map read and map write

Comment 1 Eric Paris 2017-07-19 21:51:02 UTC
Created attachment 1301446 [details]
Logs with backtrace

Comment 3 Jordan Liggitt 2017-07-20 16:02:23 UTC
state map is read from outside of a lock on line 770:

	if existingConfig, exists := r.state[backendKey]; exists {

Comment 4 openshift-github-bot 2017-07-22 11:58:47 UTC
Commit pushed to master at https://github.com/openshift/origin

https://github.com/openshift/origin/commit/0b305fba3645f1313b54c30e7890a7a6cf4290f1
Moved locking to protect a read of a map in the router

The locking was not protecting a read, so a simultaneous write would
crash the router.  I made a bunch of new functions that implemented
the functional part of the function without the locking, then made the
locking functions acquire the lock and then call the internal part.
Then in the rename, I moved the lock acquisition earlier and called
the internal functions.

In brief: re-jiggered the code so we could lock properly.

Fixes bug 1473031 (https://bugzilla.redhat.com/show_bug.cgi?id=1473031)

Comment 6 zhaozhanqi 2017-09-27 09:45:49 UTC
verified this bug on v3.7.0-0.127.0

Create route using the following script
*****test.sh*******************
#!/bin/bash
function _create_routes() {
    local name=$1
    echo "  - worker name: ${name} ... "
    sleep 0.0$((RANDOM%3))

    for idx in `seq $((RANDOM%10))`; do
      local route_name="${NAME_PREFIX}-${name}-id-${idx}"
      oc expose service tc-500001 --name="${route_name}"
    done

}  #  End of function  _create_routes.


#
#  main():
#
ntimes=${1:-20}

for i in `seq ${ntimes}`; do
  _create_routes "worker-${i}" &
done

_create_routes "main"

*****************

No above error logs found in the haproxy pod

So verified this bug.  please correct me if the step is not enough. thanks

Comment 9 errata-xmlrpc 2017-11-28 22:04:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188