Bug 1656691
| Summary: | "compose-cli compose cancel <uuid>" results in a hung compose-api server | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Ranjith Rajaram <rrajaram> | ||||||||||
| Component: | lorax | Assignee: | Brian Lane <bcl> | ||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Release Test Team <release-test-team-automation> | ||||||||||
| Severity: | medium | Docs Contact: | Vladimír Slávik <vslavik> | ||||||||||
| Priority: | low | ||||||||||||
| Version: | 8.1 | CC: | atodorov, jwboyer, rrajaram, sbueno, vslavik | ||||||||||
| Target Milestone: | rc | Flags: | rule-engine:
mirror+
|
||||||||||
| Target Release: | 8.0 | ||||||||||||
| Hardware: | x86_64 | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | lorax-28.14.19-1 | Doc Type: | If docs needed, set a value | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | |||||||||||||
| : | 1659129 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2019-06-14 02:03:31 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 1659129, 1659131 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Ranjith Rajaram
2018-12-06 05:42:03 UTC
Did you try to cancel it multiple times, or just once? It seems like it removed the results directory while in the middle of trying to cancel. Also, please attach the logs from /var/log/lorax-composer/*log as individual text/plain attachments. Created attachment 1512363 [details]
composer.tar before the reboot
Created attachment 1512364 [details]
After the reboot, lorax is in defunct state
"compose-cli compose cancel <taskid>" was only performed once on the job. compose-cli command does not exit until you see those message mentioned in the additional section of comment #0 Also to mention this is the same server where I tried to reproduce the problem described in BZ #1656684. Technically the cancel was performed on another job. Attempt to cancel was only performed once on each job. After a reboot, lorax-composer is not able to recover. Build process are going to defunct state. We have a defunct process in the ps list root 1543 4.4 1.2 396308 100532 ? Ss 23:17 0:02 /usr/libexec/platform-python /usr/sbin/lorax-composer /var/lib/lorax/composer/blueprints/ root 2138 0.0 0.0 0 0 ? Z 23:17 0:00 [lorax-composer] <defunct> Attaching two tar archives 1. Attachment composer.tar This was before the reboot 2. Attachment composer1.tar Cleaned up the logs,rebooted and then the tar archive of /var/log/lorax-composer was collected. You should have minimal logs. I will try to reproduce the problem again. The problem is reproducible
After cleaning up /var/lib/lorax , tried to reproduce the problem again.
Attachment composer2.tar
In the logs, you would see two builds. 1fd87627-937a-4cfe-b36a-adb94c49e99d or qcow2 build was successful. Attempt to cancel the task was for UUID bb768709-a87b-468b-ade3-6a4a20abba691fd87627-
lorax-composer]# composer-cli compose cancel bb768709-a87b-468b-ade3-6a4a20abba69
^^ cancel command was hung and takes a while to exit.
after a while it exits with the following message
2018-12-07 00:10:28,953: UnixHTTPConnectionPool(host='localhost', port=None): Max retries exceeded with url: /api/v0/compose/cancel/bb768709-a87b-468b-ade3-6a4a20abba69 (Caused by ReadTimeoutError("UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)",))
I tried to check compose status and the command was hung
[root@localhost lorax-composer]# composer-cli compose status
^^ it was hung for a while and in the meantime checking the ps output from another terminal, mksquashfs was running. So the build/compose job was still going on.
root 8476 97.5 6.4 810640 512720 ? Sl 00:10 4:27 mksquashfs /var/tmp/lmc-work-g8mzp6o0/runtime /var/tmp/lmc-work-g8mzp6o0/images/install.img -comp xz -Xbcj x86
After a while
# composer-cli compose status
2018-12-07 00:16:04,902: UnixHTTPConnectionPool(host='localhost', port=None): Max retries exceeded with url: /api/v0/compose/queue (Caused by ReadTimeoutError("UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)",))
We noticed the following message in the log
Dec 07 00:16:29 localhost.localdomain lorax-composer[4183]: 2018-12-07 00:16:29,805: Failed to cancel the build of bb768709-a87b-468b-ade3-6a4a20abba69
Soon after it, compose-cli compose status was quick and the following output was shown
[root@localhost lorax-composer]# composer-cli compose status
1fd87627-937a-4cfe-b36a-adb94c49e99d FINISHED Thu Dec 6 23:51:05 2018 minimalnew 0.0.2 qcow2 1438711808
but mksquashfs was still running on the node as per the ps output and finally(only a snip)
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: During handling of the above exception, another exception occurred:
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: Traceback (most recent call last):
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: self.run()
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: self._target(*self._args, **self._kwargs)
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: File "/usr/lib/python3.6/site-packages/pylorax/api/queue.py", line 118, in monitor
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: open(joinpaths(dst, "STATUS"), "w").write("FAILED\n")
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/lorax/composer/queue/run/bb768709-a87b-468b-ade3-6a4a20abba69/STATUS'
We have a defunct process
root 4183 0.1 1.3 401404 105696 ? Ss Dec06 0:02 /usr/libexec/platform-python /usr/sbin/lorax-composer /var/lib/lorax/composer/blueprints/
root 4186 0.7 0.0 0 0 ? Z Dec06 0:16 [lorax-composer] <defunct>
Created attachment 1512379 [details]
Fresh attempt to reproduce the problem
Attaching the journal logs. Journal logs has more information as per my quick reviw. So attaching it Created attachment 1512380 [details]
journal logs
Thanks. Recovering from a crash/reboot is a known issue and #1647985 should fix that. I'll try to make sense of the logs, but what I was trying to figure out is if you issued 2 cancels to the same build. Either from 2 cli sessions or from cli and gui sessions. It looks like something removed the results directory before the build was done, and it's possible that 2 different requests may do this since I don't have a check for a previously waiting cancel (working on adding that now). Also, with the current code, cancel really only works before the build starts. I'm working on trying to cancel anaconda while it is running, but am not sure if that can be done without causing other problems. So the cancel will hang until the build is finished and then it will delete the results. *** Bug 1656684 has been marked as a duplicate of this bug. *** So I started a new compose (ami), cancelled it, then started another one(live-iso) afterwards. The cancelled compose was removed, no defunc lorax processes, the 2nd compose was able to start immediately and finish successfully. Console log below. # composer-cli compose status f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb 6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem 3602907136 # composer-cli compose start example-http-server ami Compose 4b03c1ad-3ff7-499f-93fe-9a3e3abc1988 added to the queue # composer-cli compose status 4b03c1ad-3ff7-499f-93fe-9a3e3abc1988 4b03c1ad-3ff7-499f-93fe-9a3e3abc1988 RUNNING Wed Feb 6 05:37:42 2019 example-http-server 0.0.1 ami f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb 6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem 3602907136 # composer-cli compose cancel 4b03c1ad-3ff7-499f-93fe-9a3e3abc1988 2019-02-06 05:40:55,770: 4b03c1ad-3ff7-499f-93fe-9a3e3abc1988 is not a valid build uuid # composer-cli compose status f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb 6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem 3602907136 # ps aux | grep lorax root 25887 0.9 0.0 630584 114172 pts/0 S+ 05:33 0:05 /usr/libexec/platform-python /usr/sbin/lorax-composer --no-system-repos /var/lib/lorax/composer/blueprints root 25889 1.8 0.0 435748 45124 pts/0 S+ 05:33 0:11 /usr/libexec/platform-python /usr/sbin/lorax-composer --no-system-repos /var/lib/lorax/composer/blueprints root 38767 0.0 0.0 9184 964 pts/1 S+ 05:43 0:00 grep --color=auto lorax # composer-cli compose start example-http-server live-iso Compose c4246d16-f443-403a-a4ff-4328de58cb5c added to the queue # composer-cli compose status c4246d16-f443-403a-a4ff-4328de58cb5c RUNNING Wed Feb 6 05:44:18 2019 example-http-server 0.0.1 live-iso f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb 6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem 3602907136 # composer-cli compose status f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb 6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem 3602907136 c4246d16-f443-403a-a4ff-4328de58cb5c FINISHED Wed Feb 6 05:49:24 2019 example-http-server 0.0.1 live-iso 494927872 |