717325 – Assertion 'j->installed' failed at src/manager.c:1202, function transaction_apply(). Aborting.

Bug 717325 - Assertion 'j->installed' failed at src/manager.c:1202, function transaction_apply(). Aborting.

Summary: Assertion 'j->installed' failed at src/manager.c:1202, function transaction_a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	systemd
Sub Component:
Version:	15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	systemd-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-06-28 15:02 UTC by Tvrtko Ursulin
Modified:	2011-11-10 17:44 UTC (History)
CC List:	10 users (show)
Fixed In Version:	systemd-37-3.fc16
Clone Of:
Environment:
Last Closed:	2011-11-10 17:37:10 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Core dump. (248.60 KB, application/x-tar) 2011-06-28 15:03 UTC, Tvrtko Ursulin	no flags	Details
core.518.xz (697.80 KB, application/octet-stream) 2011-10-13 14:00 UTC, Tim Waugh	no flags	Details
View All

Description Tvrtko Ursulin 2011-06-28 15:02:00 UTC

Description of problem:
Jun 28 15:29:45 localhost systemd[1]: Job initialise-layout-daemon.service/start failed with result 'dependency'.
Jun 28 15:29:45 localhost kernel: [  682.454946] systemd[1]: Job start-players.service/start failed with result 'dependency'.
Jun 28 15:29:45 localhost kernel: [  682.454955] systemd[1]: Job layout-daemon.service/start failed with result 'dependency'.
Jun 28 15:29:45 localhost kernel: [  682.454963] systemd[1]: Assertion 'j->installed' failed at src/manager.c:1202, function transaction_apply(). Aborting.
Jun 28 15:29:45 localhost systemd[1]: Caught <ABRT>, dumped core as pid 2162.
Jun 28 15:29:45 localhost systemd[1]: Freezing execution.


Version-Release number of selected component (if applicable):
systemd-sysv-26-3.fc15.i686
systemd-units-26-3.fc15.i686
systemd-26-3.fc15.i686

How reproducible:
Haven't yet figured how.

Additional info:
Core dump attached.

Comment 1 Tvrtko Ursulin 2011-06-28 15:03:27 UTC

Created attachment 510301 [details]
Core dump.

Comment 2 Michal Schmidt 2011-07-03 00:15:40 UTC

Unfortunately I am unable to get anything useful from the core dump:
 (gdb) bt
 #0  0x00b2c424 in __kernel_vsyscall ()
 #1  0x0070539e in ?? ()
 #2  0xbf9b2bc8 in ?? ()
 Backtrace stopped: previous frame inner to this frame (corrupt stack?)

What are the services initialise-layout-daemon, start-players, layout-daemon? I wanted to test them myself, but I could not find anything.

> How reproducible:
> Haven't yet figured how.

Any news? Finding a way to reproduce it would be very useful.

Comment 3 Tvrtko Ursulin 2011-07-04 07:38:53 UTC

(In reply to comment #2)
> Unfortunately I am unable to get anything useful from the core dump:
>  (gdb) bt
>  #0  0x00b2c424 in __kernel_vsyscall ()
>  #1  0x0070539e in ?? ()
>  #2  0xbf9b2bc8 in ?? ()
>  Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> 
> What are the services initialise-layout-daemon, start-players, layout-daemon? I
> wanted to test them myself, but I could not find anything.

They are our internal services, but they do work most of the time - apart this time when they caused systemd to crash.
 
> > How reproducible:
> > Haven't yet figured how.
> 
> Any news? Finding a way to reproduce it would be very useful.

Nothing yet unfortunately.

Comment 4 Tim Waugh 2011-08-28 19:42:52 UTC

I'm getting this with systemd-26-8.fc15.x86_64.  I was able to reproduce it by commenting out a line in /etc/fstab, which caused other lines (bind mounts) to fail.

Here's the backtrace:

#0  0x0000003fa340edbb in raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x000000000040ab8c in crash (sig=6) at src/main.c:121
#2  <signal handler called>
#3  0x0000003fa30352d5 in __GI_raise (sig=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#4  0x0000003fa3036beb in __GI_abort () at abort.c:93
#5  0x000000000045ea21 in log_assert (file=0x4795e2 "src/manager.c", 
    line=1205, func=0x47af30 "transaction_apply", format=<optimized out>)
    at src/log.c:496
#6  0x000000000040eb59 in transaction_apply (mode=<optimized out>, m=0x3640830)
    at src/manager.c:1205
#7  transaction_activate (e=0x0, mode=JOB_ISOLATE, m=0x3640830)
    at src/manager.c:1342
#8  manager_add_job (m=0x3640830, type=JOB_START, unit=0x37544e0, 
    mode=JOB_ISOLATE, override=<optimized out>, e=0x0, _ret=0x0)
    at src/manager.c:1692
#9  0x000000000046b848 in unit_trigger_on_failure (u=0x37475e0)
    at src/unit.c:1120
#10 0x000000000047218b in job_finish_and_invalidate (j=<optimized out>, 
    result=<optimized out>) at src/job.c:613
#11 0x00000000004723a4 in job_finish_and_invalidate (j=<optimized out>, 
    result=JOB_FAILED) at src/job.c:575
#12 0x000000000046bdd9 in unit_notify (u=0x373bdf0, os=UNIT_ACTIVATING, 
    ns=UNIT_FAILED, reload_success=true) at src/unit.c:1200
#13 0x0000000000427854 in mount_set_state (m=0x373bdf0, state=MOUNT_FAILED)
    at src/mount.c:690
#14 0x0000000000428b9c in mount_sigchld_event (u=0x373bdf0, pid=491, code=1, 
    status=32) at src/mount.c:1266
#15 0x000000000040bd62 in manager_dispatch_sigchld (m=0x3640830)
    at src/manager.c:2055
#16 0x0000000000411961 in manager_process_signal_fd (m=<optimized out>)
    at src/manager.c:2254
#17 process_event (ev=0x7fff992f9890, m=0x3640830) at src/manager.c:2279
#18 manager_loop (m=0x3640830) at src/manager.c:2413
#19 0x0000000000408c6a in main (argc=<optimized out>, argv=<optimized out>)
    at src/main.c:1280

(This is from /core.518, which I can keep around for a while -- xz-compressed size is 718kb.)

Here's my /etc/fstab from a run that failed:

/dev/mapper/vg_worm01-LogVol00             /      ext4    defaults        1 1
UUID=cc6e7172-7f4e-4edf-8f08-a067ce88202e  /boot  ext3    defaults        1 2
/dev/mapper/vg_worm00-LogVol00             /home  ext4    defaults        1 2
#/dev/mapper/luks-cd4fe06b-cad2-4a42-8925-9d309ca30c62 /mnt/backup ext3 defaults 1 2
UUID=66412111-7bc5-404f-a072-0f764d8007fa  swap   swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0

/mnt/backup/mock	/var/lib/mock		none	bind		0 0
/mnt/backup/mock-cache	/var/cache/mock		none	bind		0 0

As you can see, with the "/mnt/backup" line commented out, the final two lines cannot succeed.

Comment 5 Michal Schmidt 2011-10-13 08:21:23 UTC

Tim,

(In reply to comment #4)
> Here's the backtrace:
> [...]

It shows the mount unit failed as expected, local-fs.target was considered failed too. local-fs.target has "OnFailure=emergency.target" and the crash happened while trying to isolate emergency.target.

I am still unable to reproduce it on my system though.
Can you reproduce it at will?
 
> (This is from /core.518, which I can keep around for a while -- xz-compressed
> size is 718kb.)

Do you still have it? Could you attach it here or send it to me directly?

Comment 6 Tim Waugh 2011-10-13 14:00:02 UTC

Created attachment 528034 [details]
core.518.xz

Comment 7 Michal Schmidt 2011-10-14 08:59:20 UTC

The assertion failed in transaction_apply() where the list of installed jobs is iterated in order to remove some of the jobs.

                /* When isolating first kill all installed jobs which
                 * aren't part of the new transaction */
                HASHMAP_FOREACH(j, m->jobs, i) {
                        assert(j->installed);

                        if (hashmap_get(m->transaction_jobs, j->unit))
                                continue;

                        job_finish_and_invalidate(j, JOB_CANCELED);
                }

Let's take a look at the job "j" for which the assertion failed:
(gdb) p j
$18 = (Job *) 0x396aed0
(gdb) p j->unit->meta.id
$19 = 0x393a5f0 "\001"

This looks like garbage. In fact the whole Unit structure pointed to by j->unit looks like nonsense.

I wrote a gdb script to dump the whole m->jobs hashmap:

  define dump_job_list
  set $iter = ($arg0)->iterate_list_head
  while ($iter != 0)
  	set $pjob = (Job*)$iter->value
  	printf "Job %p; unit id: %s\n", $pjob, $pjob->unit->meta.id
  	set $iter = $iter->iterate_next
  end
  end


All the jobs in m->jobs look perfectly alright, with readable names:

(gdb) dump_job_list m->jobs
[...]
Job 0x396a070; unit id: systemd-tmpfiles-setup.service
Job 0x396a710; unit id: systemd-random-seed-load.service
Job 0x396b070; unit id: cryptsetup@luks\x2d2a3bee45\x2d582c\x2d4045\x2d9042\x2d2e48d5febf50.service
Job 0x396b190; unit id: dev-disk-by\x2duuid-2a3bee45\x2d582c\x2d4045\x2d9042\x2d2e48d5febf50.device
Job 0x396b2f0; unit id: dev-mapper-luks\x2d2a3bee45\x2d582c\x2d4045\x2d9042\x2d2e48d5febf50.device
[...]

The listed job addresses are increasing monotonically. There is no job corresponding "j" (address 0x396aed0).

The m->jobs hashmap is consistent, so how did we get the address into "j"?
The answer is in the way HASHMAP_FOREACH operates. It is written to be safe against the removal of the current entry. The iterator "i" always points to the entry that will be retrieved next. This means that HASHMAP_FOREACH is not safe against the removal of the _next_ entry.
job_finish_and_invalidate(j, JOB_CANCELED) can remove not only the current job, but also recursively other jobs. One of them may be the one pointed to by the iterator "i". This must result in the crash.

Comment 8 Michal Schmidt 2011-10-17 09:32:01 UTC

Upstream fix:

http://cgit.freedesktop.org/systemd/commit/?id=563ba9ea6e60774086555998b957edf923e24b46

Comment 9 Fedora Update System 2011-10-19 10:49:34 UTC

systemd-37-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/FEDORA-2011-14239

Comment 10 Fedora Update System 2011-10-19 11:08:21 UTC

systemd-26-12.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/FEDORA-2011-14203

Comment 11 Fedora Update System 2011-10-20 02:22:25 UTC

Package systemd-37-2.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-37-2.fc16'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14239
then log in and leave karma (feedback).

Comment 12 Fedora Admin XMLRPC Client 2011-10-20 16:30:18 UTC

This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 13 Fedora Update System 2011-11-10 17:37:10 UTC

systemd-26-13.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 14 Fedora Update System 2011-11-10 17:44:31 UTC

systemd-37-3.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.