Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1598158

Summary: Master Services Take Over Minute to Stop - SIGTERM Request Timeout
Product: OpenShift Container Platform Reporter: Jack Ottofaro <jack.ottofaro>
Component: MasterAssignee: Stefan Schimanski <sttts>
Status: CLOSED DEFERRED QA Contact: Xingxing Xia <xxia>
Severity: medium Docs Contact:
Priority: high    
Version: 3.9.0CC: aos-bugs, erich, jokerman, maszulik, mbliss, mfojtik, mmccomas, pdwyer, scuppett
Target Milestone: ---Flags: scuppett: needinfo-
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-20 19:08:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace capture none

Description Jack Ottofaro 2018-07-04 13:33:43 UTC
Created attachment 1456502 [details]
strace capture

Description of problem:

Customer is doing a restart on each service, atomic-openshift-master-api, followed by atomic-openshift-master-controllers, and has noticed that since upgrading from 3.6 to 3.9 it now takes about 1.5 minutes to restart each.

Looking at the sosreport I've determined that it's actually the service termination that's taking the time - after about 1.5 minutes sigterm times out and the service is killed. The service is then started with no issues and that takes a short number of seconds. Key sosreport lines are:

...
May 30 17:07:22 master01 systemd: Stopping Atomic OpenShift Master API...
...
May 30 17:08:52 master01 systemd: atomic-openshift-master-api.service stop-sigterm timed out. Killing.
...
May 30 17:08:52 master01 systemd: atomic-openshift-master-api.service: main process exited, code=killed, status=9/KILL
...
May 30 17:08:52 master01 systemd: Starting Atomic OpenShift Master API...
...
May 30 17:09:01 master01 systemd: Started Atomic OpenShift Master API.
...
May 30 17:15:28 master01 systemd: Stopping Atomic OpenShift Master Controllers...
...
May 30 17:16:58 master01 systemd: atomic-openshift-master-controllers.service stop-sigterm timed out. Killing.
May 30 17:16:58 master01 systemd: atomic-openshift-master-controllers.service: main process exited, code=killed, status=9/KILL
May 30 17:16:58 master01 systemd: Unit atomic-openshift-master-controllers.service entered failed state.
May 30 17:16:58 master01 systemd: atomic-openshift-master-controllers.service failed.
May 30 17:16:58 master01 systemd: Starting Atomic OpenShift Master Controllers...
...
May 30 17:16:59 master01 systemd: Started Atomic OpenShift Master Controllers.
...

I asked and received an strace (attached) when 'systemctl stop atomic-openshift-master-api.service' is executed. In it I can see the SIGTERM received but then it goes on for another ~90 seconds until its killed. 


How reproducible:

Not sure. Occureed after customer upgraded from 3.6 to 3.9.

Actual results:

OpenShift services, atomic-openshift-master-api and atomic-openshift-master-controllers, are taking about 1.5 minutes to stop instead of seconds.

Expected results:

On a system I have where the process gets stopped "normally" an strace from the time of the SIGTERM looks like this:

...
13811 13:27:28.374534 <... futex resumed> ) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
13810 13:27:28.374556 <... futex resumed> ) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
14725 13:27:28.374636 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
23818 13:27:28.374702 --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=1, si_uid=0} ---
22215 13:27:28.374735 futex(0xc42dcfa110, FUTEX_WAIT, 0, NULL <unfinished ...>
13812 13:27:28.374771 futex(0xc420c74110, FUTEX_WAIT, 0, NULL <unfinished ...>
13809 13:27:28.374809 futex(0xc42002cb90, FUTEX_WAIT, 0, NULL <unfinished ...>
13808 13:27:28.374836 futex(0xc42002c810, FUTEX_WAIT, 0, NULL <unfinished ...>
13807 13:27:28.374862 restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
13806 13:27:28.374889 futex(0xf587c70, FUTEX_WAIT, 0, NULL <unfinished ...>
61831 13:27:28.374930 epoll_wait(4,  <unfinished ...>
14725 13:27:28.375017 rt_sigaction(SIGTERM, {SIG_DFL, ~[], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x7f4654c1b680},  <unfinished ...>
86536 13:27:28.375079 futex(0xc43c031610, FUTEX_WAIT, 0, NULL <unfinished ...>
23818 13:27:28.375106 futex(0xc42b8e2810, FUTEX_WAIT, 0, NULL <unfinished ...>
19094 13:27:28.375126 futex(0xc4282e2f10, FUTEX_WAIT, 0, NULL <unfinished ...>
13811 13:27:28.375147 futex(0xc42018c490, FUTEX_WAIT, 0, NULL <unfinished ...>
13810 13:27:28.375159 futex(0xf5ab4e0, FUTEX_WAIT, 0, NULL <unfinished ...>
14725 13:27:28.375176 <... rt_sigaction resumed> NULL, 8) = 0
14725 13:27:28.375201 rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
14725 13:27:28.375280 gettid()          = 14725
14725 13:27:28.375342 tkill(14725, SIGTERM) = 0
14725 13:27:28.375396 --- SIGTERM {si_signo=SIGTERM, si_code=SI_TKILL, si_pid=13806, si_uid=0} ---
61831 13:27:28.389085 +++ killed by SIGTERM +++
14725 13:27:28.389127 +++ killed by SIGTERM +++
86536 13:27:28.389136 +++ killed by SIGTERM +++
23818 13:27:28.389143 +++ killed by SIGTERM +++
19094 13:27:28.389149 +++ killed by SIGTERM +++
13812 13:27:28.389155 +++ killed by SIGTERM +++
13811 13:27:28.389162 +++ killed by SIGTERM +++
13810 13:27:28.389168 +++ killed by SIGTERM +++
13809 13:27:28.389174 +++ killed by SIGTERM +++
13808 13:27:28.389180 +++ killed by SIGTERM +++
13807 13:27:28.389187 +++ killed by SIGTERM +++
22215 13:27:28.480580 +++ killed by SIGTERM +++
13806 13:27:28.480636 +++ killed by SIGTERM +++

Additional info:

Comment 1 Jack Ottofaro 2018-07-11 16:28:33 UTC
Customer recently recreated the same issue with a fresh install of 3.9 so apparently the issue is not related or caused by an upgrade.

I have asked if there is anything he can identify as unique to or different about his environment.

Comment 14 Stephen Cuppett 2019-11-20 19:08:18 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift