Bug 1767097 - mock freezing system on FC31
Summary: mock freezing system on FC31
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: mock
Version: 31
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Miroslav Suchý
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 1754807
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-30 16:50 UTC by Sammy
Modified: 2020-01-14 19:01 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-14 19:01:17 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Kernel stack trace (1.02 MB, image/jpeg)
2019-10-31 16:33 UTC, Markus Linnala
no flags Details

Description Sammy 2019-10-30 16:50:07 UTC
After upgrading to Fedora 31 from 30 the system comes to a complete freeze (connot ssh into system from outside) when using mock to build rpms. It has consistently happened 4-5 times. Otherwise the system is stable even under heavy loads running on the system. I have been using mock frequently on fc30 without any problems.

Comment 1 Markus Linnala 2019-10-30 21:05:59 UTC
I have mitigated the problem by adding this line:

config_opts['use_nspawn'] = False

into mock config file, mine at:

~/.config/mock.cfg



I have successfully built a packages using this mitigation.



I think this problem is linked to systemd-nspawn. My machine boots after slight lockup period when I run mock with nspawn enabled. 2/2 tries now.

Last lines by using: 'journalctl -b-1'

Oct 30 22:28:01 workstation.lan userhelper[3853]: running '/usr/libexec/mock/mock -n -N -r fedora-31-x86_64 --buildsrpm --spec=prog.spec --sources=SOURCES --resultdir=/home/user/prog/SRPM' with root privileges on behalf of 'user'
Oct 30 22:28:02 workstation.lan sssd[nss][1284]: Enumeration requested but not enabled


And another set: 'journalctl -b-2'

Oct 30 22:09:09 workstation.lan systemd[1]: Started Container 55d13553a859454eaf3bb1d6cb24b992.
Oct 30 22:09:10 workstation.lan systemd[1]: machine-55d13553a859454eaf3bb1d6cb24b992.scope: Succeeded.
Oct 30 22:09:10 workstation.lan systemd-machined[49325]: Machine 55d13553a859454eaf3bb1d6cb24b992 terminated.


Previously 30 and before I have not had this kind of problems.

Comment 2 Markus Linnala 2019-10-30 21:18:37 UTC
systemd-container-243-4.gitef67743.fc31.x86_64
mock-1.4.20-1.fc31.noarch
Linux workstation.lan 5.3.7-301.fc31.x86_64 #1 SMP Mon Oct 21 19:18:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Comment 3 Sammy 2019-10-30 22:37:24 UTC
I have exactly the same configuration as above!

Comment 4 Miroslav Suchý 2019-10-31 09:03:09 UTC

*** This bug has been marked as a duplicate of bug 1756972 ***

Comment 5 Markus Linnala 2019-10-31 13:33:10 UTC
As far as I can see, this is *not* duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1756972

There is zero SELinux messages. Only hard lock and eventual reboot.

Last logs when using mock patched with:

https://github.com/rpm-software-management/mock/pull/371

Again hard lockout and then reboot. This did not happend when I tried to test it using virtual machine. Only using hw. At one point machine is doing something and then next point it does not do anything visible / audible.


Oct 31 15:06:48 workstation.lan userhelper[60546]: running '/usr/libexec/mock/mock -n -N -r fedora-31-x86_64 --buildsrpm --spec=prog.spec --sources=SOURCES>
Oct 31 15:06:50 workstation.lan systemd-machined[57994]: New machine 12518987042248d7bbc262e8c3dd100d.
Oct 31 15:06:50 workstation.lan systemd[1]: Started Container 12518987042248d7bbc262e8c3dd100d.
Oct 31 15:06:50 workstation.lan systemd[1]: machine-12518987042248d7bbc262e8c3dd100d.scope: Succeeded.
Oct 31 15:06:50 workstation.lan systemd-machined[57994]: Machine 12518987042248d7bbc262e8c3dd100d terminated.
Oct 31 15:06:50 workstation.lan systemd-machined[57994]: New machine b844c70711aa4a8c8f09559da952357c.
Oct 31 15:06:50 workstation.lan systemd[1]: Started Container b844c70711aa4a8c8f09559da952357c.
Oct 31 15:06:50 workstation.lan systemd[1]: machine-b844c70711aa4a8c8f09559da952357c.scope: Succeeded.
Oct 31 15:06:50 workstation.lan systemd-machined[57994]: Machine b844c70711aa4a8c8f09559da952357c terminated.
Oct 31 15:06:50 workstation.lan systemd-machined[57994]: New machine 93cb155e31204e1ea0096ee55ee1eb8a.
Oct 31 15:06:50 workstation.lan systemd[1]: Started Container 93cb155e31204e1ea0096ee55ee1eb8a.
Oct 31 15:06:51 workstation.lan systemd[1]: machine-93cb155e31204e1ea0096ee55ee1eb8a.scope: Succeeded.
Oct 31 15:06:51 workstation.lan systemd-machined[57994]: Machine 93cb155e31204e1ea0096ee55ee1eb8a terminated.
Oct 31 15:06:51 workstation.lan systemd-machined[57994]: New machine 5307c506c84c4017a66dda33e62c0196.
Oct 31 15:06:51 workstation.lan systemd[1]: Started Container 5307c506c84c4017a66dda33e62c0196.
Oct 31 15:06:51 workstation.lan systemd[1]: machine-5307c506c84c4017a66dda33e62c0196.scope: Succeeded.
Oct 31 15:06:51 workstation.lan systemd-machined[57994]: Machine 5307c506c84c4017a66dda33e62c0196 terminated.


% cat .config/mock.cfg 
config_opts['plugin_conf']['ccache_enable'] = True
config_opts['plugin_conf']['ccache_opts']['compress'] = 'on'

config_opts['plugin_conf']['compress_logs_enable'] = True
config_opts['plugin_conf']['compress_logs_opts'] = {}
config_opts['plugin_conf']['compress_logs_opts']['command'] = '/usr/bin/gzip -9'

config_opts['plugin_conf']['procenv_enable'] = True
config_opts['plugin_conf']['procenv_opts'] = {}

config_opts['plugin_conf']['sign_enable'] = True
config_opts['plugin_conf']['sign_opts'] = {}
config_opts['plugin_conf']['sign_opts']['cmd'] = 'rpmsign'
config_opts['plugin_conf']['sign_opts']['opts'] = '--addsign %(rpms)s'

config_opts['nosync'] = True
config_opts['nosync_force'] = True
config_opts['update_before_build'] = False

config_opts['use_bootstrap_container'] = True
config_opts['no_root_shells'] = True



I have enabled watchdog and I assume it reboots machine eventually.

% cat /etc/watchdog.conf 
allocatable-memory = 1
interval = 10
max-load-1 = 0
max-load-15 = 100
max-load-5 = 0
max-temperature = 88
min-memory = 1
priority = 1
realtime = yes
temperature-sensor = /sys/devices/platform/coretemp.0/hwmon/hwmon1/temp2_input
temperature-sensor = /sys/devices/platform/coretemp.0/hwmon/hwmon1/temp4_input
watchdog-device = /dev/watchdog
watchdog-timeout = 60

Comment 6 Markus Linnala 2019-10-31 14:42:26 UTC
I've now ran mock several times without problems using --unpriv, but withtout:

config_opts['use_nspawn'] = False

I do not know yet if it really mitigates, but it seems to help. At least there is major difference how nspawn is used. Around mockbuild/backend.py:382.

I wonder why branches have different order of common parameters.

Comment 7 Markus Linnala 2019-10-31 16:33:03 UTC
Created attachment 1631158 [details]
Kernel stack trace

Okay, I was wrong in comment 6.

I was able to get kernel stack trace. Tainted because of Nvidia.

This is also kernel bug, not actually problem in mock/systemd-nspawn.

I guess this issue happens with mock because it runs systemd-nspawn very actively and then uses cgroups and there is issue in kernel cgroups.


start in other window: sudo dmesg -w

$ uname -r
5.3.7-301.fc31.x86_64


general protection fault: 0000 [#1] SMP PTI
CPU: 3 PID: 3622 Comm: kworker/3:1 Tainted: P
Hardware name: System manufacturer System Product Name/P7H55, BIOS 0901    11/12/2010
Workqueue: cgroup_destroy css_killed_work_fn
RIP: 0010:bfqg_and_blkg_put+0xe/0x60
...
Call Trace:
 bfq_bfqq_move+0x5a/0x160
 bfq_pd_offline+0xd3/0xf0
 blkg_destroy+0x72/0x200


(gdb) l *bfqg_and_blkg_put+0xe/0x60
0xffffffff814cc5e0 is in bfqg_and_blkg_put (block/bfq-cgroup.c:344).
339	
340		blkg_get(bfqg_to_blkg(bfqg));
341	}
342	
343	void bfqg_and_blkg_put(struct bfq_group *bfqg)
344	{
345		blkg_put(bfqg_to_blkg(bfqg));
346	
347		bfqg_put(bfqg);
348	}


(gdb) l *bfq_bfqq_move+0x5a/0x160
0xffffffff814cc790 is in bfq_bfqq_move (block/bfq-cgroup.c:623).
618	 * bfq_bic_update_cgroup on guaranteeing the consistency of blkg
619	 * objects).
620	 */
621	void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
622			   struct bfq_group *bfqg)
623	{
624		struct bfq_entity *entity = &bfqq->entity;
625	
626		/* If bfqq is empty, then bfq_bfqq_expire also invokes
627		 * bfq_del_bfqq_busy, thereby removing bfqq and its entity

(gdb) l *bfq_pd_offline+0xd3/0xf0
0xffffffff814cc8f0 is in bfq_pd_offline (block/bfq-cgroup.c:838).
833	 *
834	 * blkio already grabs the queue_lock for us, so no need to use
835	 * RCU-based magic
836	 */
837	static void bfq_pd_offline(struct blkg_policy_data *pd)
838	{
839		struct bfq_service_tree *st;
840		struct bfq_group *bfqg = pd_to_bfqg(pd);
841		struct bfq_data *bfqd = bfqg->bfqd;
842		struct bfq_entity *entity = bfqg->my_entity;

(gdb) l *blkg_destroy+0x72/0x200
0xffffffff814b8f60 is in blkg_destroy (block/blk-cgroup.c:394).
389	
390		return blkg;
391	}
392	
393	static void blkg_destroy(struct blkcg_gq *blkg)
394	{
395		struct blkcg *blkcg = blkg->blkcg;
396		struct blkcg_gq *parent = blkg->parent;
397		int i;
398

Comment 8 Sammy 2019-10-31 16:59:06 UTC
Naturally, any time the kernel crashes in a system freeze it could be traced to the kernel
and it should not happen. I wonder why it was working in fc30 with similar kernels.

Comment 9 Dan Čermák 2019-11-11 12:35:46 UTC
I believe this should be reopened, as I have been hit by this yesterday evening multiple times with mock-1.4.21-1.fc31.noarch (which should have the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1756972 included). Similarly, disabling nspawn and using chroot "fixes" the issue for me too.

Comment 10 Pavel Raiskup 2019-11-11 13:46:15 UTC
I believe this is "duplicate" to 1754807.

Comment 11 Laura Abbott 2019-11-11 15:35:28 UTC
There's at least one other bfq issue being debugged at the moment https://bugzilla.redhat.com/show_bug.cgi?id=1767539 so it would be worth trying the patch tere.

Comment 12 Zbigniew Jędrzejewski-Szmek 2019-11-13 07:34:54 UTC
> I wonder why it was working in fc30 with similar kernels.

If it is indeed a bfq issue, than the answer would be that we switched from deadline to bfq in F31.

Comment 13 Vitaly 2019-11-16 10:50:34 UTC
systemd.unified_cgroup_hierarchy=0 kernel command-line parameter can be used as a temporary workaround.


Note You need to log in before you can comment on or make changes to this bug.