Bug 1767138
Summary: | Excessive CPU usage in ProcessLauncher()'s wait loop | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | John Eckersberg <jeckersb> |
Component: | python-oslo-service | Assignee: | John Eckersberg <jeckersb> |
Status: | CLOSED EOL | QA Contact: | pkomarov |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 15.0 (Stein) | CC: | hberaud, lmiccini, mburns, michele, stchen |
Target Milestone: | --- | Keywords: | Triaged, ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-09-30 19:15:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
John Eckersberg
2019-10-30 18:41:21 UTC
Some baseline measurements on my devstack VM, very consistent: [root@devstack ~]# perf stat -e syscalls:sys_enter_select -e syscalls:sys_enter_wait4 -a -I 1000 # time counts unit events 1.000123502 612 syscalls:sys_enter_select 1.000123502 491 syscalls:sys_enter_wait4 2.000979655 617 syscalls:sys_enter_select 2.000979655 492 syscalls:sys_enter_wait4 3.001318574 615 syscalls:sys_enter_select 3.001318574 492 syscalls:sys_enter_wait4 4.002134374 610 syscalls:sys_enter_select 4.002134374 491 syscalls:sys_enter_wait4 5.002278994 616 syscalls:sys_enter_select 5.002278994 493 syscalls:sys_enter_wait4 6.002431351 611 syscalls:sys_enter_select 6.002431351 488 syscalls:sys_enter_wait4 7.003026036 615 syscalls:sys_enter_select 7.003026036 493 syscalls:sys_enter_wait4 8.003501667 613 syscalls:sys_enter_select 8.003501667 490 syscalls:sys_enter_wait4 9.004544555 622 syscalls:sys_enter_select 9.004544555 500 syscalls:sys_enter_wait4 10.004696272 608 syscalls:sys_enter_select 10.004696272 485 syscalls:sys_enter_wait4 We can do better! The main bit in cotyledon is here: https://github.com/sileht/cotyledon/blob/master/cotyledon/_utils.py#L114-L203 with _wait_forever being the waitloop. The main idea is that it sets up a pipe and uses set_wakeup_fd to trigger signal handling via that pipe. Then the waitloop can block forever on select() because the fdset includes the pipe, so signals will properly interrupt the loop. At that point you don't have to loop to check os.waitpid() for child exit; instead you'll catch SIGCHLD when the child terminates and you can just call os.waitpid() at that time to do final cleanup. Thanks John for the given details, I think I got it. (In reply to John Eckersberg from comment #2) > The main bit in cotyledon is here: > > https://github.com/sileht/cotyledon/blob/master/cotyledon/_utils.py#L114-L203 > > with _wait_forever being the waitloop. > > The main idea is that it sets up a pipe and uses set_wakeup_fd to trigger > signal handling via that pipe. If I'm right our ProcessLauncher could process signals by using set_wakeup_fd with our pipe `self.writepipe` [1]. > Then the waitloop can block forever on > select() because the fdset includes the pipe, so signals will properly > interrupt the loop. > At that point you don't have to loop to check > os.waitpid() for child exit; instead you'll catch SIGCHLD when the child > terminates and you can just call os.waitpid() at that time to do final > cleanup. We could adapt this code [2] to move from `os.waitpid` to `select.select` like did in cotyledon, also moving to it could allow us to drop few code conditions and actions. The major difference between cotyledon and oslo.service on this topic is that cotyledon split the service management [4] and the signal manager [5]. Oslo.service provide all these feature in a all in one class (`ProcessLauncher`), so also it could be more proper to do the same thing on the oslo side by creating and SignalHandler class inherited by `ProcessLauncher`. The main goal will be let the `ProcessLauncher` manage workers (adjust) like cotyledon did [7] by using a supervisor thread for child [8] (to manage adjustments on the number of worker needed). Indeed oslo.service mixes worker adjustments and the waitloop so I think split these parts could be valuable to isolate mechanismes and keep the things more cleaner. Thoughts? [1] https://github.com/openstack/oslo.service/blob/046507db14ad48c6f085faa48f6e7d27eb6d45c1/oslo_service/service.py#L429 [2] https://github.com/openstack/oslo.service/blob/046507db14ad48c6f085faa48f6e7d27eb6d45c1/oslo_service/service.py#L611,L617 [3] https://github.com/openstack/oslo.service/blob/046507db14ad48c6f085faa48f6e7d27eb6d45c1/oslo_service/service.py#L648,L693 [4] https://cotyledon.readthedocs.io/en/latest/api.html#cotyledon.ServiceManager [5] https://github.com/sileht/cotyledon/blob/master/cotyledon/_utils.py#L114-L203 [6] https://github.com/openstack/oslo.service/blob/046507db14ad48c6f085faa48f6e7d27eb6d45c1/oslo_service/service.py#L409 [7] https://github.com/sileht/cotyledon/blob/bcc34b6883f921e10689ad0cb4d734e4ac4cbfaf/cotyledon/_service_manager.py#L323 [8] https://github.com/sileht/cotyledon/blob/bcc34b6883f921e10689ad0cb4d734e4ac4cbfaf/cotyledon/_service_manager.py#L228 Side note: I never received notifications about this BZ neither by email or by using the "My bugs" links of bugzilla... so sorry for my late message on this topic. (In reply to Hervé Beraud from comment #3) > ... > The major difference between cotyledon and oslo.service on this topic is > that cotyledon split the service management [4] and the signal manager [5]. > Oslo.service provide all these feature in a all in one class > (`ProcessLauncher`), so also it could be more proper to do the same thing on Oslo.service provide all these feature in a all in one class (`ProcessLauncher`) [6] > ... > [6] > https://github.com/openstack/oslo.service/blob/ > 046507db14ad48c6f085faa48f6e7d27eb6d45c1/oslo_service/service.py#L409 > ... Closing EOL, OSP 15 has been retired as of Sept 19 |