RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1656691 - "compose-cli compose cancel <uuid>" results in a hung compose-api server
Summary: "compose-cli compose cancel <uuid>" results in a hung compose-api server
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: lorax
Version: 8.1
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: rc
: 8.0
Assignee: Brian Lane
QA Contact: Release Test Team
Vladimír Slávik
URL:
Whiteboard:
: 1656684 (view as bug list)
Depends On:
Blocks: 1659129 1659131
TreeView+ depends on / blocked
 
Reported: 2018-12-06 05:42 UTC by Ranjith Rajaram
Modified: 2019-06-14 02:03 UTC (History)
5 users (show)

Fixed In Version: lorax-28.14.19-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1659129 (view as bug list)
Environment:
Last Closed: 2019-06-14 02:03:31 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
composer.tar before the reboot (9.45 MB, application/x-tar)
2018-12-07 04:26 UTC, Ranjith Rajaram
no flags Details
After the reboot, lorax is in defunct state (30.00 KB, application/x-tar)
2018-12-07 04:27 UTC, Ranjith Rajaram
no flags Details
Fresh attempt to reproduce the problem (180.00 KB, application/x-tar)
2018-12-07 05:23 UTC, Ranjith Rajaram
no flags Details
journal logs (208.73 KB, application/x-gzip)
2018-12-07 05:28 UTC, Ranjith Rajaram
no flags Details

Description Ranjith Rajaram 2018-12-06 05:42:03 UTC
Description of problem:

unlike BZ #1656684, canceling a job initiated from cockpit-composer gui using "composer-cli compose cancel" results in a hung compose api server.  Newly queued job builds do not start. It remains in waiting state.

# composer-cli compose status
a11cf33d-ddd0-4d2c-a794-f6fc74c4fd72 RUNNING  Wed Dec  5 23:59:12 2018 rhel8testcustomusername 0.0.4 live-iso   

Cancel task results in a timeout

# composer-cli compose cancel a11cf33d-ddd0-4d2c-a794-f6fc74c4fd72
2018-12-06 00:03:39,479: UnixHTTPConnectionPool(host='localhost', port=None): Max retries exceeded with url: /api/v0/compose/cancel/a11cf33d-ddd0-4d2c-a794-f6fc74c4fd72 (Caused by ReadTimeoutError("UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)",))

We also see python tracebacks. They are pasted in the additional section

New jobs remain in waiting state

[root@localhost ~]# composer-cli compose status
334290eb-842b-402d-9e1c-6c76dcc0f8fe WAITING  Thu Dec  6 00:27:27 2018 podman          0.0.3 qcow2       

Job that was canceled continues to remain with status as "Pending state" in the cockpit gui

Reviewing the ps output, one of the lorax-composer process was seen in defunct state

root      1515  0.0  1.4 415060 118840 ?       Ss   Dec05   0:09 /usr/libexec/platform-python /usr/sbin/lorax-composer /var/lib/lorax/composer/blueprints/
root      2135  0.0  0.0      0     0 ?        Z    Dec05   0:25 [lorax-composer] <defunct>


Version-Release number of selected component (if applicable):
lorax-templates-rhel-8.0-19.el8.noarch
lorax-28.14.16-1.el8.x86_64
lorax-composer-28.14.16-1.el8.x86_64
lorax-lmc-novirt-28.14.16-1.el8.x86_64
lorax-templates-generic-28.14.16-1.el8.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Start a image build process from cockpit-gui
2. Cancel the job using composer-cli compose cancel <uuid>
3.

Actual results:

Image build process does not stop but results in an un-expected behavior. Newly queued image builds remain in waiting state forever

Expected results:

Image build process should stop

Additional info:

Dec 06 00:09:40 localhost.localdomain lorax-composer[1515]: 2018-12-06 00:09:40,266: Failed to cancel the build of a11cf33d-ddd0-4d2c-a794-f6fc74c4fd72
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]: chown: cannot access '/var/lib/lorax/composer/results/a11cf33d-ddd0-4d2c-a794-f6fc74c4fd72': No such file or directory
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]: 2018-12-06 00:11:47,810: traceback: Traceback (most recent call last):
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/api/queue.py", line 108, in monitor
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     make_compose(cfg, os.path.realpath(dst))
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/api/queue.py", line 213, in make_compose
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     run_creator(install_cfg, callback_func=cancel_build)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/creator.py", line 662, in run_creator
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     with Mount(disk_img, opts="loop") as mount_dir:
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/imgutils.py", line 349, in __enter__
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     self.mnt = mount(self.dev, self.opts, self.mnt)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/imgutils.py", line 232, in mount
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     runcmd(cmd)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/executils.py", line 341, in runcmd
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     return execWithRedirect(cmd[0], cmd[1:], **kwargs)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/executils.py", line 228, in execWithRedirect
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     env_add=env_add, reset_handlers=reset_handlers, reset_lang=reset_lang)[0]
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/executils.py", line 201, in _run_program
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     raise subprocess.CalledProcessError(proc.returncode, argv, output)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]: subprocess.CalledProcessError: Command '['mount', '-o', 'loop', '/var/lib/lorax/composer/results/a11cf33d-ddd0-4d2c-a794-f6fc74c4fd72/compose/live.iso', '/var/tmp/lorax.imgutils.a9fv9w8k']' returned non-zero exit status 32.
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]: Process Process-1:
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]: Traceback (most recent call last):
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/api/queue.py", line 108, in monitor
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     make_compose(cfg, os.path.realpath(dst))
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/api/queue.py", line 213, in make_compose
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     run_creator(install_cfg, callback_func=cancel_build)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/creator.py", line 662, in run_creator
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     with Mount(disk_img, opts="loop") as mount_dir:
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/imgutils.py", line 349, in __enter__
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     self.mnt = mount(self.dev, self.opts, self.mnt)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/imgutils.py", line 232, in mount
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     runcmd(cmd)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/executils.py", line 341, in runcmd
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     return execWithRedirect(cmd[0], cmd[1:], **kwargs)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/executils.py", line 228, in execWithRedirect
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     env_add=env_add, reset_handlers=reset_handlers, reset_lang=reset_lang)[0]
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/executils.py", line 201, in _run_program
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     raise subprocess.CalledProcessError(proc.returncode, argv, output)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]: subprocess.CalledProcessError: Command '['mount', '-o', 'loop', '/var/lib/lorax/composer/results/a11cf33d-ddd0-4d2c-a794-f6fc74c4fd72/compose/live.iso', '/var/tmp/lorax.imgutils.a9fv9w8k']' returned non-zero exit status 32.
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]: During handling of the above exception, another exception occurred:
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]: Traceback (most recent call last):
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     self.run()
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     self._target(*self._args, **self._kwargs)
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:   File "/usr/lib/python3.6/site-packages/pylorax/api/queue.py", line 118, in monitor
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]:     open(joinpaths(dst, "STATUS"), "w").write("FAILED\n")
Dec 06 00:11:47 localhost.localdomain lorax-composer[1515]: FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/lorax/composer/queue/run/a11cf33d-ddd0-4d2c-a794-f6fc74c4fd72/STATUS'

Comment 1 Brian Lane 2018-12-06 21:37:05 UTC
Did you try to cancel it multiple times, or just once? It seems like it removed the results directory while in the middle of trying to cancel.

Also, please attach the logs from /var/log/lorax-composer/*log as individual text/plain attachments.

Comment 2 Ranjith Rajaram 2018-12-07 04:26:29 UTC
Created attachment 1512363 [details]
composer.tar before the reboot

Comment 3 Ranjith Rajaram 2018-12-07 04:27:23 UTC
Created attachment 1512364 [details]
After the reboot, lorax is in defunct state

Comment 4 Ranjith Rajaram 2018-12-07 04:32:14 UTC
"compose-cli compose cancel <taskid>" was only performed once on the job. compose-cli command does not exit until you see those message mentioned in the additional section of comment #0 

Also to mention this is the same server where I tried to reproduce the problem described in BZ #1656684. Technically the cancel was performed on another job. Attempt to cancel was only performed once on each job.

After a reboot, lorax-composer is not able to recover. Build process are going to defunct state.

We have a defunct process in the ps list

root      1543  4.4  1.2 396308 100532 ?       Ss   23:17   0:02 /usr/libexec/platform-python /usr/sbin/lorax-composer /var/lib/lorax/composer/blueprints/
root      2138  0.0  0.0      0     0 ?        Z    23:17   0:00 [lorax-composer] <defunct>

Attaching two tar archives

1. Attachment composer.tar

   This was before the reboot

2. Attachment composer1.tar

   Cleaned up the logs,rebooted and then the tar archive of /var/log/lorax-composer was collected. You should have minimal logs.

I will try to reproduce the problem again.

Comment 5 Ranjith Rajaram 2018-12-07 05:22:55 UTC
The problem is reproducible

After cleaning up /var/lib/lorax , tried to reproduce the problem again. 

Attachment composer2.tar

In the logs, you would see two builds. 1fd87627-937a-4cfe-b36a-adb94c49e99d or qcow2 build was successful. Attempt to cancel the task was for UUID bb768709-a87b-468b-ade3-6a4a20abba691fd87627-

lorax-composer]# composer-cli compose cancel bb768709-a87b-468b-ade3-6a4a20abba69

^^ cancel command was hung and takes a while to exit. 

after a while it exits with the following message

2018-12-07 00:10:28,953: UnixHTTPConnectionPool(host='localhost', port=None): Max retries exceeded with url: /api/v0/compose/cancel/bb768709-a87b-468b-ade3-6a4a20abba69 (Caused by ReadTimeoutError("UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)",))

I tried to check compose status and the command was hung

[root@localhost lorax-composer]# composer-cli compose status

^^ it was hung for a while and in the meantime checking the ps output from another terminal, mksquashfs was running. So the build/compose job was still going on.

root      8476 97.5  6.4 810640 512720 ?       Sl   00:10   4:27 mksquashfs /var/tmp/lmc-work-g8mzp6o0/runtime /var/tmp/lmc-work-g8mzp6o0/images/install.img -comp xz -Xbcj x86

After a while

# composer-cli compose status
2018-12-07 00:16:04,902: UnixHTTPConnectionPool(host='localhost', port=None): Max retries exceeded with url: /api/v0/compose/queue (Caused by ReadTimeoutError("UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)",))

We noticed the following message in the log

Dec 07 00:16:29 localhost.localdomain lorax-composer[4183]: 2018-12-07 00:16:29,805: Failed to cancel the build of bb768709-a87b-468b-ade3-6a4a20abba69

Soon after it, compose-cli compose status was quick and the following output was shown

[root@localhost lorax-composer]# composer-cli compose status
1fd87627-937a-4cfe-b36a-adb94c49e99d FINISHED Thu Dec  6 23:51:05 2018 minimalnew      0.0.2 qcow2            1438711808

but mksquashfs was still running on the node as per the ps output and finally(only a snip)

Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: During handling of the above exception, another exception occurred:
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: Traceback (most recent call last):
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]:   File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]:     self.run()
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]:   File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]:     self._target(*self._args, **self._kwargs)
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]:   File "/usr/lib/python3.6/site-packages/pylorax/api/queue.py", line 118, in monitor
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]:     open(joinpaths(dst, "STATUS"), "w").write("FAILED\n")
Dec 07 00:19:38 localhost.localdomain lorax-composer[4183]: FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/lorax/composer/queue/run/bb768709-a87b-468b-ade3-6a4a20abba69/STATUS'

We have a defunct process

root      4183  0.1  1.3 401404 105696 ?       Ss   Dec06   0:02 /usr/libexec/platform-python /usr/sbin/lorax-composer /var/lib/lorax/composer/blueprints/
root      4186  0.7  0.0      0     0 ?        Z    Dec06   0:16 [lorax-composer] <defunct>

Comment 6 Ranjith Rajaram 2018-12-07 05:23:34 UTC
Created attachment 1512379 [details]
Fresh attempt to reproduce the problem

Comment 7 Ranjith Rajaram 2018-12-07 05:27:33 UTC
Attaching the journal logs. 

Journal logs has more information as per my quick reviw. So attaching it

Comment 8 Ranjith Rajaram 2018-12-07 05:28:09 UTC
Created attachment 1512380 [details]
journal logs

Comment 9 Brian Lane 2018-12-07 19:21:52 UTC
Thanks. Recovering from a crash/reboot is a known issue and #1647985 should fix that.

I'll try to make sense of the logs, but what I was trying to figure out is if you issued 2 cancels to the same build. Either from 2 cli sessions or from cli and gui sessions. It looks like something removed the results directory before the build was done, and it's possible that 2 different requests may do this since I don't have a check for a previously waiting cancel (working on adding that now).


Also, with the current code, cancel really only works before the build starts. I'm working on trying to cancel anaconda while it is running, but am not sure if that can be done without causing other problems. So the cancel will hang until the build is finished and then it will delete the results.

Comment 10 Brian Lane 2018-12-13 16:12:42 UTC
https://github.com/weldr/lorax/pull/547

Comment 11 Brian Lane 2018-12-14 17:49:49 UTC
*** Bug 1656684 has been marked as a duplicate of this bug. ***

Comment 16 Alexander Todorov 2019-02-06 10:49:53 UTC
So I started a new compose (ami), cancelled it, then started another one(live-iso) afterwards.  The cancelled compose was removed, no defunc lorax processes, the 2nd compose was able to start immediately and finish successfully. Console log below.


# composer-cli compose status
f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb  6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem  3602907136


# composer-cli compose start example-http-server ami
Compose 4b03c1ad-3ff7-499f-93fe-9a3e3abc1988 added to the queue


# composer-cli compose status 4b03c1ad-3ff7-499f-93fe-9a3e3abc1988
4b03c1ad-3ff7-499f-93fe-9a3e3abc1988 RUNNING  Wed Feb  6 05:37:42 2019 example-http-server 0.0.1 ami              
f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb  6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem  3602907136


# composer-cli compose cancel 4b03c1ad-3ff7-499f-93fe-9a3e3abc1988
2019-02-06 05:40:55,770: 4b03c1ad-3ff7-499f-93fe-9a3e3abc1988 is not a valid build uuid


# composer-cli compose status
f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb  6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem  3602907136


# ps aux | grep lorax
root     25887  0.9  0.0 630584 114172 pts/0   S+   05:33   0:05 /usr/libexec/platform-python /usr/sbin/lorax-composer --no-system-repos /var/lib/lorax/composer/blueprints
root     25889  1.8  0.0 435748 45124 pts/0    S+   05:33   0:11 /usr/libexec/platform-python /usr/sbin/lorax-composer --no-system-repos /var/lib/lorax/composer/blueprints
root     38767  0.0  0.0   9184   964 pts/1    S+   05:43   0:00 grep --color=auto lorax


# composer-cli compose start example-http-server live-iso
Compose c4246d16-f443-403a-a4ff-4328de58cb5c added to the queue


# composer-cli compose status
c4246d16-f443-403a-a4ff-4328de58cb5c RUNNING  Wed Feb  6 05:44:18 2019 example-http-server 0.0.1 live-iso         
f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb  6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem  3602907136


# composer-cli compose status
f661d930-b433-417a-a76f-d3c475004390 FINISHED Wed Feb  6 05:36:12 2019 example-http-server 0.0.1 ext4-filesystem  3602907136
c4246d16-f443-403a-a4ff-4328de58cb5c FINISHED Wed Feb  6 05:49:24 2019 example-http-server 0.0.1 live-iso         494927872


Note You need to log in before you can comment on or make changes to this bug.