After upgrading to Fedora 31 from 30 the system comes to a complete freeze (connot ssh into system from outside) when using mock to build rpms. It has consistently happened 4-5 times. Otherwise the system is stable even under heavy loads running on the system. I have been using mock frequently on fc30 without any problems.
I have mitigated the problem by adding this line: config_opts['use_nspawn'] = False into mock config file, mine at: ~/.config/mock.cfg I have successfully built a packages using this mitigation. I think this problem is linked to systemd-nspawn. My machine boots after slight lockup period when I run mock with nspawn enabled. 2/2 tries now. Last lines by using: 'journalctl -b-1' Oct 30 22:28:01 workstation.lan userhelper[3853]: running '/usr/libexec/mock/mock -n -N -r fedora-31-x86_64 --buildsrpm --spec=prog.spec --sources=SOURCES --resultdir=/home/user/prog/SRPM' with root privileges on behalf of 'user' Oct 30 22:28:02 workstation.lan sssd[nss][1284]: Enumeration requested but not enabled And another set: 'journalctl -b-2' Oct 30 22:09:09 workstation.lan systemd[1]: Started Container 55d13553a859454eaf3bb1d6cb24b992. Oct 30 22:09:10 workstation.lan systemd[1]: machine-55d13553a859454eaf3bb1d6cb24b992.scope: Succeeded. Oct 30 22:09:10 workstation.lan systemd-machined[49325]: Machine 55d13553a859454eaf3bb1d6cb24b992 terminated. Previously 30 and before I have not had this kind of problems.
systemd-container-243-4.gitef67743.fc31.x86_64 mock-1.4.20-1.fc31.noarch Linux workstation.lan 5.3.7-301.fc31.x86_64 #1 SMP Mon Oct 21 19:18:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
I have exactly the same configuration as above!
*** This bug has been marked as a duplicate of bug 1756972 ***
As far as I can see, this is *not* duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1756972 There is zero SELinux messages. Only hard lock and eventual reboot. Last logs when using mock patched with: https://github.com/rpm-software-management/mock/pull/371 Again hard lockout and then reboot. This did not happend when I tried to test it using virtual machine. Only using hw. At one point machine is doing something and then next point it does not do anything visible / audible. Oct 31 15:06:48 workstation.lan userhelper[60546]: running '/usr/libexec/mock/mock -n -N -r fedora-31-x86_64 --buildsrpm --spec=prog.spec --sources=SOURCES> Oct 31 15:06:50 workstation.lan systemd-machined[57994]: New machine 12518987042248d7bbc262e8c3dd100d. Oct 31 15:06:50 workstation.lan systemd[1]: Started Container 12518987042248d7bbc262e8c3dd100d. Oct 31 15:06:50 workstation.lan systemd[1]: machine-12518987042248d7bbc262e8c3dd100d.scope: Succeeded. Oct 31 15:06:50 workstation.lan systemd-machined[57994]: Machine 12518987042248d7bbc262e8c3dd100d terminated. Oct 31 15:06:50 workstation.lan systemd-machined[57994]: New machine b844c70711aa4a8c8f09559da952357c. Oct 31 15:06:50 workstation.lan systemd[1]: Started Container b844c70711aa4a8c8f09559da952357c. Oct 31 15:06:50 workstation.lan systemd[1]: machine-b844c70711aa4a8c8f09559da952357c.scope: Succeeded. Oct 31 15:06:50 workstation.lan systemd-machined[57994]: Machine b844c70711aa4a8c8f09559da952357c terminated. Oct 31 15:06:50 workstation.lan systemd-machined[57994]: New machine 93cb155e31204e1ea0096ee55ee1eb8a. Oct 31 15:06:50 workstation.lan systemd[1]: Started Container 93cb155e31204e1ea0096ee55ee1eb8a. Oct 31 15:06:51 workstation.lan systemd[1]: machine-93cb155e31204e1ea0096ee55ee1eb8a.scope: Succeeded. Oct 31 15:06:51 workstation.lan systemd-machined[57994]: Machine 93cb155e31204e1ea0096ee55ee1eb8a terminated. Oct 31 15:06:51 workstation.lan systemd-machined[57994]: New machine 5307c506c84c4017a66dda33e62c0196. Oct 31 15:06:51 workstation.lan systemd[1]: Started Container 5307c506c84c4017a66dda33e62c0196. Oct 31 15:06:51 workstation.lan systemd[1]: machine-5307c506c84c4017a66dda33e62c0196.scope: Succeeded. Oct 31 15:06:51 workstation.lan systemd-machined[57994]: Machine 5307c506c84c4017a66dda33e62c0196 terminated. % cat .config/mock.cfg config_opts['plugin_conf']['ccache_enable'] = True config_opts['plugin_conf']['ccache_opts']['compress'] = 'on' config_opts['plugin_conf']['compress_logs_enable'] = True config_opts['plugin_conf']['compress_logs_opts'] = {} config_opts['plugin_conf']['compress_logs_opts']['command'] = '/usr/bin/gzip -9' config_opts['plugin_conf']['procenv_enable'] = True config_opts['plugin_conf']['procenv_opts'] = {} config_opts['plugin_conf']['sign_enable'] = True config_opts['plugin_conf']['sign_opts'] = {} config_opts['plugin_conf']['sign_opts']['cmd'] = 'rpmsign' config_opts['plugin_conf']['sign_opts']['opts'] = '--addsign %(rpms)s' config_opts['nosync'] = True config_opts['nosync_force'] = True config_opts['update_before_build'] = False config_opts['use_bootstrap_container'] = True config_opts['no_root_shells'] = True I have enabled watchdog and I assume it reboots machine eventually. % cat /etc/watchdog.conf allocatable-memory = 1 interval = 10 max-load-1 = 0 max-load-15 = 100 max-load-5 = 0 max-temperature = 88 min-memory = 1 priority = 1 realtime = yes temperature-sensor = /sys/devices/platform/coretemp.0/hwmon/hwmon1/temp2_input temperature-sensor = /sys/devices/platform/coretemp.0/hwmon/hwmon1/temp4_input watchdog-device = /dev/watchdog watchdog-timeout = 60
I've now ran mock several times without problems using --unpriv, but withtout: config_opts['use_nspawn'] = False I do not know yet if it really mitigates, but it seems to help. At least there is major difference how nspawn is used. Around mockbuild/backend.py:382. I wonder why branches have different order of common parameters.
Created attachment 1631158 [details] Kernel stack trace Okay, I was wrong in comment 6. I was able to get kernel stack trace. Tainted because of Nvidia. This is also kernel bug, not actually problem in mock/systemd-nspawn. I guess this issue happens with mock because it runs systemd-nspawn very actively and then uses cgroups and there is issue in kernel cgroups. start in other window: sudo dmesg -w $ uname -r 5.3.7-301.fc31.x86_64 general protection fault: 0000 [#1] SMP PTI CPU: 3 PID: 3622 Comm: kworker/3:1 Tainted: P Hardware name: System manufacturer System Product Name/P7H55, BIOS 0901 11/12/2010 Workqueue: cgroup_destroy css_killed_work_fn RIP: 0010:bfqg_and_blkg_put+0xe/0x60 ... Call Trace: bfq_bfqq_move+0x5a/0x160 bfq_pd_offline+0xd3/0xf0 blkg_destroy+0x72/0x200 (gdb) l *bfqg_and_blkg_put+0xe/0x60 0xffffffff814cc5e0 is in bfqg_and_blkg_put (block/bfq-cgroup.c:344). 339 340 blkg_get(bfqg_to_blkg(bfqg)); 341 } 342 343 void bfqg_and_blkg_put(struct bfq_group *bfqg) 344 { 345 blkg_put(bfqg_to_blkg(bfqg)); 346 347 bfqg_put(bfqg); 348 } (gdb) l *bfq_bfqq_move+0x5a/0x160 0xffffffff814cc790 is in bfq_bfqq_move (block/bfq-cgroup.c:623). 618 * bfq_bic_update_cgroup on guaranteeing the consistency of blkg 619 * objects). 620 */ 621 void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq, 622 struct bfq_group *bfqg) 623 { 624 struct bfq_entity *entity = &bfqq->entity; 625 626 /* If bfqq is empty, then bfq_bfqq_expire also invokes 627 * bfq_del_bfqq_busy, thereby removing bfqq and its entity (gdb) l *bfq_pd_offline+0xd3/0xf0 0xffffffff814cc8f0 is in bfq_pd_offline (block/bfq-cgroup.c:838). 833 * 834 * blkio already grabs the queue_lock for us, so no need to use 835 * RCU-based magic 836 */ 837 static void bfq_pd_offline(struct blkg_policy_data *pd) 838 { 839 struct bfq_service_tree *st; 840 struct bfq_group *bfqg = pd_to_bfqg(pd); 841 struct bfq_data *bfqd = bfqg->bfqd; 842 struct bfq_entity *entity = bfqg->my_entity; (gdb) l *blkg_destroy+0x72/0x200 0xffffffff814b8f60 is in blkg_destroy (block/blk-cgroup.c:394). 389 390 return blkg; 391 } 392 393 static void blkg_destroy(struct blkcg_gq *blkg) 394 { 395 struct blkcg *blkcg = blkg->blkcg; 396 struct blkcg_gq *parent = blkg->parent; 397 int i; 398
Naturally, any time the kernel crashes in a system freeze it could be traced to the kernel and it should not happen. I wonder why it was working in fc30 with similar kernels.
I believe this should be reopened, as I have been hit by this yesterday evening multiple times with mock-1.4.21-1.fc31.noarch (which should have the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1756972 included). Similarly, disabling nspawn and using chroot "fixes" the issue for me too.
I believe this is "duplicate" to 1754807.
There's at least one other bfq issue being debugged at the moment https://bugzilla.redhat.com/show_bug.cgi?id=1767539 so it would be worth trying the patch tere.
> I wonder why it was working in fc30 with similar kernels. If it is indeed a bfq issue, than the answer would be that we switched from deadline to bfq in F31.
systemd.unified_cgroup_hierarchy=0 kernel command-line parameter can be used as a temporary workaround.