Bug 67427

Summary: Crash on shutdown, Assertion 'a >= 0 && a < _JOB_TYPE_MAX_MERGING' failed
Product: systemd Reporter: Ivan Kozik <ivan>
Component: generalAssignee: systemd-bugs
Status: RESOLVED FIXED QA Contact: systemd-bugs
Severity: critical    
Priority: highest CC: bugs.freedesktop.org, keepitsimpleengineer, rmlipman, steven, zoomulator
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: journals with Assertion 'a >= 0 && a < _JOB_TYPE_MAX_MERGING' failed
show types of jobs in job_type_lookup_merge
failed shutdown with systemd.log_level=debug
another crash
and another crash
gzipped core dump from systemd 217
debug build coredump
systemd 317 -O0 -g core dump
patch: don't try to merge a JOB_NOP
systemd 317 -g core dump (not -O0)

Description Ivan Kozik 2013-07-28 05:19:48 UTC
Created attachment 83113 [details]
journals with Assertion 'a >= 0 && a < _JOB_TYPE_MAX_MERGING' failed

When I shut down or reboot NixOS (unstable channel, systemd 203 or 206), the system never finishes shutting down.  After I forcefully reboot it, I see this in the journal:

Jul 28 06:55:05 nixos systemd[1]: Assertion 'a >= 0 && a < _JOB_TYPE_MAX_MERGING' failed at src/core/job.c:338, function job_type_lookup_merge(). Aborting.
Jul 28 06:55:05 nixos systemd[1]: Caught <ABRT>, dumped core as pid 2017.
Jul 28 06:55:05 nixos systemd[1]: Freezing execution.

Full journals are attached, including one where I tried systemd 206.

# uname -a
Linux nixos 3.4.54 #1 SMP Tue Jul 23 21:48:39 UTC 2013 x86_64 GNU/Linux

My NixOS install is running as a guest in VMWare Workstation 9.0.2 on a CentOS 6.4 host.  The guest has the default VMWare guest hardware.


NixOS builds systemd with these options:

https://github.com/NixOS/nixpkgs/blob/master/pkgs/os-specific/linux/systemd/default.nix

and these patches:

https://github.com/NixOS/nixpkgs/tree/master/pkgs/os-specific/linux/systemd

and uses these units:

https://github.com/NixOS/nixos/blob/master/modules/system/boot/systemd.nix

(I had to change these slightly for 206)
Comment 1 Ivan Kozik 2013-07-28 05:26:00 UTC
Note that sometimes it does shut down successfully (maybe 30% of the time, probably more likely when the system is idle).
Comment 2 Zbigniew Jedrzejewski-Szmek 2013-07-29 15:26:51 UTC
Can you try with the following patch, so that we know what values are actually there?
Comment 3 Zbigniew Jedrzejewski-Szmek 2013-07-29 15:28:24 UTC
Created attachment 83212 [details] [review]
show types of jobs in job_type_lookup_merge
Comment 4 Ivan Kozik 2013-07-29 16:19:03 UTC
With the patch:

Jul 29 18:13:27 nixos systemd[1]: Stopped ACPI Daemon.
Jul 29 18:13:27 nixos systemd[1]: Stopped Cron Daemon.
Jul 29 18:13:27 nixos systemd[1]: Stopped Getty on tty1.
Jul 29 18:13:27 nixos systemd[1]: systemd-logind.service: main process exited, code=exited, status=1/FAILURE
Jul 29 18:13:27 nixos systemd[1]: Stopped Login Service.
Jul 29 18:13:27 nixos systemd[1]: Unit systemd-logind.service entered failed state.
Jul 29 18:13:27 nixos systemd[1]: Stopped SSH Daemon.
Jul 29 18:13:27 nixos systemd[1]: Stopped D-Bus System Message Bus.
Jul 29 18:13:27 nixos systemd[1]: Stopped NTP Daemon.
Jul 29 18:13:27 nixos systemd[1]: Stopping User and Group Name Lookups.
Jul 29 18:13:27 nixos systemd[1]: Stopped target User and Group Name Lookups.
Jul 29 18:13:27 nixos systemd[1]: Stopping Name Service Cache Daemon...
Jul 29 18:13:27 nixos systemd[1]: Stopping system-getty.slice.
Jul 29 18:13:27 nixos systemd[1]: Removed slice system-getty.slice.
Jul 29 18:13:27 nixos systemd[1]: Started Store Sound Card State.
Jul 29 18:13:27 nixos systemd[1]: a=5 b=2
Jul 29 18:13:27 nixos systemd[1]: Assertion 'a >= 0 && a < _JOB_TYPE_MAX_MERGING' failed at src/core/job.c:341, function job_type_lookup_merge(). Aborting.
Jul 29 18:13:27 nixos systemd[1]: Caught <ABRT>, dumped core as pid 2070.
Jul 29 18:13:27 nixos systemd[1]: Freezing execution.

I had to reboot a few times to get this to reproduce on systemd 206; I finally got it when I had a few processes running: two `find /`, a `ping -f localhost`, and a `cat /dev/urandom | bzip2 > /dev/null`
Comment 5 Ivan Kozik 2013-07-29 16:29:29 UTC
Reproduced the bug again and I see `a=5 b=2` again.

(This time, I had started about a dozen `find / > /dev/null &` before rebooting.)
Comment 6 Zbigniew Jedrzejewski-Szmek 2013-07-29 16:54:57 UTC
(In reply to comment #5)
> Reproduced the bug again and I see `a=5 b=2` again.
So we're trying to merge JOB_NOP and JOB_STOP. Interesting.
Comment 7 Zbigniew Jedrzejewski-Szmek 2013-07-29 17:52:03 UTC
Hm, can you show the logs with systemd.log_level=debug on the kernel commandline?
Comment 8 Ivan Kozik 2013-07-29 18:28:19 UTC
Created attachment 83227 [details]
failed shutdown with systemd.log_level=debug

Crashed in (apparently) the same way, not sure why the Assertion itself isn't in the journal this time
Comment 9 Ivan Kozik 2013-07-29 18:43:38 UTC
Created attachment 83229 [details]
another crash
Comment 10 Ivan Kozik 2013-07-29 18:44:01 UTC
Created attachment 83230 [details]
and another crash
Comment 11 Ivan Kozik 2013-08-29 10:17:24 UTC
It looks like an altlinux user has also run into this:

http://forum.altlinux.org/index.php?topic=29850.0

(click the second [+] to see the log)
Comment 12 Ricardo M. Correia 2013-09-23 15:37:25 UTC
FWIW, here is the stack trace of the systemd crash on NixOS:

(gdb) bt
#0  0x00006cb659188b0b in raise () from /nix/store/zm4bhsm8lprkzvrjgqr0klfkvr21als4-glibc-2.17/lib/libpthread.so.0
#1  0x000000000040cbe1 in crash ()
#2  <signal handler called>
#3  0x00006cb658bf4fb5 in raise () from /nix/store/zm4bhsm8lprkzvrjgqr0klfkvr21als4-glibc-2.17/lib/libc.so.6
#4  0x00006cb658bf6436 in abort () from /nix/store/zm4bhsm8lprkzvrjgqr0klfkvr21als4-glibc-2.17/lib/libc.so.6
#5  0x0000000000460383 in log_assert ()
#6  0x0000000000460ad0 in log_assert_failed ()
#7  0x000000000047b5c6 in job_type_lookup_merge ()
#8  0x0000000000414672 in transaction_activate ()
#9  0x000000000040e684 in manager_add_job ()
#10 0x0000000000443f6d in bus_unit_queue_job ()
#11 0x000000000043fc1f in bus_manager_message_handler ()
#12 0x00006cb6593b3cb9 in _dbus_object_tree_dispatch_and_unlock () from /nix/store/mv8dsy1q29ph6hdk6c96f252ww89z27m-dbus-libs-1.6.14/lib/libdbus-1.so.3
#13 0x00006cb6593a5e45 in dbus_connection_dispatch () from /nix/store/mv8dsy1q29ph6hdk6c96f252ww89z27m-dbus-libs-1.6.14/lib/libdbus-1.so.3
#14 0x000000000043e82a in bus_dispatch ()
#15 0x0000000000411df8 in manager_loop ()
#16 0x000000000040a233 in main ()
Comment 13 Lennart Poettering 2014-05-24 09:56:09 UTC
Is this still a problem with current git?
Comment 14 Ricardo M. Correia 2014-05-27 18:23:36 UTC
(In reply to comment #13)
> Is this still a problem with current git?

I haven't seen it since NixOS upgraded systemd to version 212...
Comment 15 Lennart Poettering 2014-05-28 01:49:27 UTC
OK, closing then.
Comment 16 Hiram 2014-10-17 07:26:39 UTC
We are seeing this problem also on systemd 215 when restarting a service. 
This is on ARM (imx28)

But it is fairly hard to reproduce so I did not have the possibility to apply the patch or find a good reproduction path.

Oct 02 23:23:44 iris python[146]: Restarting service
Oct 02 23:23:44 iris systemd[1]: Assertion 'a >= 0 && a < _JOB_TYPE_MAX_MERGING' failed at src/core/job.c:314, function job_type_lookup_merge(). Aborting.
Oct 02 23:23:44 iris udhcpc[151]: deleting routers
Oct 02 23:23:44 iris systemd-coredump[2465]: Failed to create coredump file /var/lib/systemd/coredump/.core.systemd.0.4cd67b66fa164cb795ec91545bb3f8cf.2464.1412292224000000a2a530d4db025192: Read-only file system
Oct 02 23:23:44 iris systemd-coredump[2465]: Process 2464 (systemd) of user 0 dumped core.
Oct 02 23:23:44 iris systemd[1]: Caught <ABRT>, dumped core as pid 2464.
Oct 02 23:23:44 iris systemd[1]: Freezing execution.
Oct 02 23:23:45 iris udhcpc[151]: route: SIOCDELRT: No such process

I hope to provide extra information soon
Comment 17 Steven Noonan 2014-11-12 21:53:58 UTC
Created attachment 109372 [details]
gzipped core dump from systemd 217

I'm encountering this as well on systemd 217.

Nov 12 08:01:01 dispater crond[12331]: pam_unix(crond:session): session opened for user root by (uid=0)
Nov 12 08:01:01 dispater CROND[12332]: (root) CMD (run-parts /etc/cron.hourly)
Nov 12 08:01:01 dispater CROND[12331]: pam_unix(crond:session): session closed for user root
Nov 12 08:03:37 dispater dhcpcd[1075]: br0: fe80::2472:37ff:fea0:49d2 is unreachable, expiring it
Nov 12 08:05:03 dispater dhcpcd[1075]: br0: Router Advertisement from fe80::9844:a4ff:fe69:acd5
Nov 12 08:05:03 dispater dhcpcd[1075]: br0: adding address 2002:6898:6b85:beef:6a05:caff:fe0a:6ec2/64
Nov 12 08:05:03 dispater systemd[1]: Cannot add dependency job for unit cups.socket, ignoring: Unit cups.socket failed to load: No such file or directory.
Nov 12 08:05:03 dispater systemd[1]: Stopping Name Service Cache Daemon...
Nov 12 08:05:03 dispater systemd[1]: Assertion 'a >= 0 && a < _JOB_TYPE_MAX_MERGING' failed at src/core/job.c:314, function job_type_lookup_merge(). Aborting.
Nov 12 08:05:03 dispater systemd[1]: Caught <ABRT>, dumped core as pid 12419.
Nov 12 08:05:03 dispater systemd[1]: Freezing execution.
Nov 12 08:05:03 dispater systemd-coredump[12420]: Process 12419 (systemd) of user 0 dumped core.
Nov 12 08:05:08 dispater acpid[554]: client 608[0:0] has disconnected
Nov 12 08:05:08 dispater acpid[554]: client connected from 653[82:82]
Nov 12 08:05:08 dispater acpid[554]: 1 client rule loaded

After the above point I can't log in or do anything useful with the box except reboot it.

Backtrace via 'coredumpctl gdb':

(gdb) thread apply all bt full

Thread 1 (Thread 0x7f1fa9967780 (LWP 12419)):
#0  0x00007f1fa89260c9 in raise () from /usr/lib/libpthread.so.0
No symbol table info available.
#1  0x00007f1fa9a5b3c8 in crash.lto_priv.222 (sig=6) at src/core/main.c:168
        rl = {
          rlim_cur = 18446744073709551615, 
          rlim_max = 18446744073709551615
        }
        sa = {
          __sigaction_handler = {
            sa_handler = 0x0, 
            sa_sigaction = 0x0
          }, 
          sa_mask = {
            __val = {0 <repeats 16 times>}
          }, 
          sa_flags = 0, 
          sa_restorer = 0x0
        }
        __func__ = "crash"
        __PRETTY_FUNCTION__ = "crash"
#2  <signal handler called>
No symbol table info available.
#3  0x00007f1fa85a6967 in raise () from /usr/lib/libc.so.6
No symbol table info available.
#4  0x00007f1fa85a7d3a in abort () from /usr/lib/libc.so.6
No symbol table info available.
#5  0x00007f1fa9a90012 in log_assert_failed (text=text@entry=0x7f1fa9aa78b8 "a >= 0 && a < _JOB_TYPE_MAX_MERGING", file=file@entry=0x7f1fa9a9a212 "src/core/job.c", line=line@entry=314, func=func@entry=0x7f1fa9ab8370 <__PRETTY_FUNCTION__.15000> "job_type_lookup_merge") at src/shared/log.c:718
No locals.
#6  0x00007f1fa99e943f in job_type_lookup_merge (a=<optimized out>, b=<optimized out>) at src/core/job.c:314
        __PRETTY_FUNCTION__ = "job_type_lookup_merge"
#7  0x00007f1fa9a330a5 in job_type_is_superset () at src/core/job.h:198
No locals.
#8  transaction_is_destructive (e=<optimized out>, mode=<optimized out>, tr=<optimized out>) at src/core/transaction.c:516
        i = 0xffffffffffffffff
        j = 0x0
#9  transaction_activate (e=<optimized out>, mode=<optimized out>, m=<optimized out>, tr=<optimized out>) at src/core/transaction.c:722
        i = 0xffffffffffffffff
        j = 0x0
#10 manager_add_job (m=0x7f1fab518380, type=_JOB_TYPE_MAX_MERGING, unit=0x6, mode=_JOB_MODE_INVALID, override=false, e=0x7f1fa9b07180 <buffer>, _ret=0x7fff46a422b0) at src/core/manager.c:1224
        __PRETTY_FUNCTION__ = "manager_add_job"
        __func__ = "manager_add_job"
#11 0x00007f1fa9a1bbc5 in bus_unit_queue_job (bus=0x7f1fab5f9090, message=0x7f1fab66dd20, u=0x7f1fab5b5820, type=_JOB_TYPE_MAX_IN_TRANSACTION, mode=JOB_FAIL, reload_if_possible=128, error=0x7fff46a42440) at src/core/dbus-unit.c:777
        path = 0x6 <error: Cannot access memory at address 0x6>
        j = 0x1ced597e673294c1
        __PRETTY_FUNCTION__ = "bus_unit_queue_job"
#12 0x00007f1fa9a1bfe6 in bus_unit_method_start_generic (bus=0x7f1fab5f9090, message=0x7f1fab66dd20, u=0x7f1fab5b5820, job_type=_JOB_TYPE_MAX_IN_TRANSACTION, reload_if_possible=reload_if_possible@entry=false, error=0x7fff46a42440) at src/core/dbus-unit.c:383
        smode = 0x7f1fab6059f8 "fail"
        mode = JOB_FAIL
        r = 0
        __PRETTY_FUNCTION__ = "bus_unit_method_start_generic"
#13 0x00007f1fa9a756c6 in method_start_unit_generic (bus=0x7f1fab5f9090, message=0x7f1fab66dd20, m=0x7f1fab518380, job_type=_JOB_TYPE_MAX_IN_TRANSACTION, reload_if_possible=<optimized out>, error=0x7fff46a42440) at src/core/dbus-manager.c:478
        name = 0x7f1fab6059e4 "nscd.service"
        u = 0x7f1fab5b5820
        r = <optimized out>
        __PRETTY_FUNCTION__ = "method_start_unit_generic"
#14 0x00007f1fa9a6b3ee in method_callbacks_run (found_object=<optimized out>, require_fallback=<optimized out>, c=<optimized out>, m=<optimized out>, bus=<optimized out>) at src/libsystemd/sd-bus/bus-objects.c:400
        slot = 0x0
        u = 0x7f1fab518380
        error = {
          name = 0x0, 
          message = 0x0, 
          _need_free = 0
        }
        signature = 0xffffffffffffffff <error: Cannot access memory at address 0xffffffffffffffff>
#15 object_find_and_run.lto_priv.239 (bus=0x7f1fab5f9090, m=0x7f1fab66dd20, p=0x6 <error: Cannot access memory at address 0x6>, require_fallback=false, found_object=0x7fff46a425b0) at src/libsystemd/sd-bus/bus-objects.c:1224
        vtable_key = {
          path = 0x7f1fab605948 "/org/freedesktop/systemd1", 
          interface = 0x7f1fab605988 "org.freedesktop.systemd1.Manager", 
          member = 0x7f1fab605970 "TryRestartUnit", 
          parent = 0x1, 
          last_iteration = 0, 
          vtable = 0x7f1fa9a848e1 <bus_read_message.constprop.160+49>
        }
        __PRETTY_FUNCTION__ = "object_find_and_run"
#16 0x00007f1fa9a85ace in bus_process_object (m=<optimized out>, bus=0x7f1fab5f9090) at src/libsystemd/sd-bus/bus-objects.c:1340
        prefix = 0x7fff46a424e0 "\220\220_\253\037\177"
        r = <optimized out>
        pl = <optimized out>
        found_object = true
#17 process_message (m=<optimized out>, bus=0x7f1fab5f9090) at src/libsystemd/sd-bus/sd-bus.c:2409
        r = <optimized out>
#18 process_running (priority=0, ret=0x0, hint_priority=false, bus=0x7f1fab5f9090) at src/libsystemd/sd-bus/sd-bus.c:2451
        m = <optimized out>
        r = <optimized out>
#19 bus_process_internal.constprop.164 (bus=0x7f1fab5f9090, ret=0x0, priority=0, hint_priority=false) at src/libsystemd/sd-bus/sd-bus.c:2640
        _dont_destroy_bus = 0x7f1fab5f9090
#20 0x00007f1fa9a23b11 in sd_bus_process (ret=0x0, bus=<optimized out>) at src/libsystemd/sd-bus/sd-bus.c:2659
No locals.
#21 io_callback (s=<optimized out>, fd=<optimized out>, revents=<optimized out>, userdata=<optimized out>) at src/libsystemd/sd-bus/sd-bus.c:2918
        bus = <optimized out>
        __PRETTY_FUNCTION__ = "io_callback"
#22 0x00007f1fa9a4e680 in source_dispatch (s=0x7f1fab607ee0) at src/libsystemd/sd-event/sd-event.c:2103
        __PRETTY_FUNCTION__ = "source_dispatch"
        __func__ = "source_dispatch"
#23 0x00007f1fa9a53bde in sd_event_dispatch (e=0x7f1fab5187b0) at src/libsystemd/sd-event/sd-event.c:2460
        p = <optimized out>
        r = <optimized out>
        e = 0x7f1fab5187b0
        __PRETTY_FUNCTION__ = "sd_event_dispatch"
#24 0x00007f1fa9a66590 in sd_event_run (timeout=<optimized out>, e=<optimized out>) at src/libsystemd/sd-event/sd-event.c:2489
        r = 0
#25 manager_loop (m=0x7f1fab518380) at src/core/manager.c:2010
        wait_usec = 18446744073709551615
        rl = {
          interval = 1000000, 
          begin = 389870461922, 
          burst = 50000, 
          num = 23
        }
        __PRETTY_FUNCTION__ = "manager_loop"
        __func__ = "manager_loop"
#26 0x00007f1fa99e6859 in main (argc=1, argv=0x7fff46a43098) at src/core/main.c:1731
        m = 0x7f1fab518380
        r = <optimized out>
        retval = 1
        before_startup = <optimized out>
        after_startup = <optimized out>
        timespan = "\000\000\000\000\000\000\000\000`.\244F\377\177\000\000X.\244F\377\177\000\000W.\244F\377\177\000\000P\213z\251\037\177\000\000h.\244F\377\177\000\000\000\000\000\000\034\000\000\000`.\244F\377\177\000"
        fds = 0x0
        reexecute = false
        shutdown_verb = 0x0
        initrd_timestamp = <optimized out>
        userspace_timestamp = {
          realtime = 1415418435002172, 
          monotonic = 3175634
        }
        kernel_timestamp = {
          realtime = <optimized out>, 
          monotonic = 0
        }
        security_start_timestamp = {
          realtime = 1415418435003792, 
          monotonic = 3177255
        }
        security_finish_timestamp = {
          realtime = 1415418435004168, 
          monotonic = 3177631
        }
        systemd = "systemd"
        skip_setup = <optimized out>
        j = <optimized out>
        loaded_policy = false
        arm_reboot_watchdog = false
        queue_default_job = <optimized out>
        empty_etc = false
        switch_root_dir = 0x0
        switch_root_init = 0x0
        __func__ = "main"
        __PRETTY_FUNCTION__ = "main"
(gdb)

Attaching a core dump.
Comment 18 Steven Noonan 2014-11-12 21:54:08 UTC
Reopening
Comment 19 Steven Noonan 2014-11-13 05:28:03 UTC
Created attachment 109380 [details]
debug build coredump

Here's a debug build (-O0 -g) with the same crash. I've got a fairly solid repro now, just have to reboot my router box and when it comes back this system crashes.

Better stack trace follows:

(gdb) thread apply all bt full

Thread 1 (Thread 0x7f5b1bb527c0 (LWP 13326)):
#0  0x00007f5b1ab110c9 in raise () from /usr/lib/libpthread.so.0
No symbol table info available.
#1  0x00007f5b1bbc9a53 in crash (sig=6) at src/core/main.c:168
        rl = {
          rlim_cur = 18446744073709551615, 
          rlim_max = 18446744073709551615
        }
        sa = {
          __sigaction_handler = {
            sa_handler = 0x0, 
            sa_sigaction = 0x0
          }, 
          sa_mask = {
            __val = {0 <repeats 16 times>}
          }, 
          sa_flags = 0, 
          sa_restorer = 0x0
        }
        pid = 0
        __func__ = "crash"
        __PRETTY_FUNCTION__ = "crash"
#2  <signal handler called>
No symbol table info available.
#3  0x00007f5b1a791967 in raise () from /usr/lib/libc.so.6
No symbol table info available.
#4  0x00007f5b1a792d3a in abort () from /usr/lib/libc.so.6
No symbol table info available.
#5  0x00007f5b1bc499fe in log_assert_failed (text=0x7f5b1bd082b8 "a >= 0 && a < _JOB_TYPE_MAX_MERGING", file=0x7f5b1bd08023 "src/core/job.c", line=314, func=0x7f5b1bd08ae0 <__PRETTY_FUNCTION__.13893> "job_type_lookup_merge") at src/shared/log.c:718
No locals.
#6  0x00007f5b1bcaab15 in job_type_lookup_merge (a=_JOB_TYPE_MAX_MERGING, b=JOB_RESTART) at src/core/job.c:314
        __PRETTY_FUNCTION__ = "job_type_lookup_merge"
#7  0x00007f5b1bbda1c3 in job_type_is_superset (a=_JOB_TYPE_MAX_MERGING, b=JOB_RESTART) at src/core/job.h:198
No locals.
#8  0x00007f5b1bbdba67 in transaction_is_destructive (tr=0x7f5b1c484e50, mode=JOB_FAIL, e=0x7fff60c9dd60) at src/core/transaction.c:516
        i = 0xffffffffffffffff
        j = 0x7f5b1c548a90
        __PRETTY_FUNCTION__ = "transaction_is_destructive"
#9  0x00007f5b1bbdc31e in transaction_activate (tr=0x7f5b1c484e50, m=0x7f5b1c46a370, mode=JOB_FAIL, e=0x7fff60c9dd60) at src/core/transaction.c:722
        i = 0xffffffffffffffff
        j = 0x0
        r = 0
        generation = 3
        __PRETTY_FUNCTION__ = "transaction_activate"
        __func__ = "transaction_activate"
#10 0x00007f5b1bbd3587 in manager_add_job (m=0x7f5b1c46a370, type=_JOB_TYPE_MAX_MERGING, unit=0x7f5b1c51cf40, mode=JOB_FAIL, override=true, e=0x7fff60c9dd60, _ret=0x7fff60c9dbd0) at src/core/manager.c:1224
        r = 0
        tr = 0x7f5b1c484e50
        __PRETTY_FUNCTION__ = "manager_add_job"
        __func__ = "manager_add_job"
#11 0x00007f5b1bc19ca9 in bus_unit_queue_job (bus=0x7f5b1c47cfc0, message=0x7f5b1c549010, u=0x7f5b1c51cf40, type=_JOB_TYPE_MAX_IN_TRANSACTION, mode=JOB_FAIL, reload_if_possible=false, error=0x7fff60c9dd60) at src/core/dbus-unit.c:777
        path = 0x0
        j = 0x7f5b1c46a790
        r = 0
        __PRETTY_FUNCTION__ = "bus_unit_queue_job"
#12 0x00007f5b1bc18b54 in bus_unit_method_start_generic (bus=0x7f5b1c47cfc0, message=0x7f5b1c549010, u=0x7f5b1c51cf40, job_type=_JOB_TYPE_MAX_IN_TRANSACTION, reload_if_possible=false, error=0x7fff60c9dd60) at src/core/dbus-unit.c:383
        smode = 0x7f5b1c484f58 "fail"
        mode = JOB_FAIL
        r = 1
        __PRETTY_FUNCTION__ = "bus_unit_method_start_generic"
#13 0x00007f5b1bc12745 in method_start_unit_generic (bus=0x7f5b1c47cfc0, message=0x7f5b1c549010, m=0x7f5b1c46a370, job_type=_JOB_TYPE_MAX_IN_TRANSACTION, reload_if_possible=false, error=0x7fff60c9dd60) at src/core/dbus-manager.c:478
        name = 0x7f5b1c484f44 "nscd.service"
        u = 0x7f5b1c51cf40
        r = 1
        __PRETTY_FUNCTION__ = "method_start_unit_generic"
#14 0x00007f5b1bc12899 in method_try_restart_unit (bus=0x7f5b1c47cfc0, message=0x7f5b1c549010, userdata=0x7f5b1c46a370, error=0x7fff60c9dd60) at src/core/dbus-manager.c:498
No locals.
#15 0x00007f5b1bc81167 in method_callbacks_run (bus=0x7f5b1c47cfc0, m=0x7f5b1c549010, c=0x7f5b1c4854b0, require_fallback=false, found_object=0x7fff60c9de83) at src/libsystemd/sd-bus/bus-objects.c:400
        slot = 0x7f5b1c53c4b0
        error = {
          name = 0x0, 
          message = 0x0, 
          _need_free = 0
        }
        signature = 0x7f5b1c476690 "ss"
        u = 0x7f5b1c46a370
        r = 1
        __PRETTY_FUNCTION__ = "method_callbacks_run"
#16 0x00007f5b1bc83901 in object_find_and_run (bus=0x7f5b1c47cfc0, m=0x7f5b1c549010, p=0x7f5b1c484ea8 "/org/freedesktop/systemd1", require_fallback=false, found_object=0x7fff60c9de83) at src/libsystemd/sd-bus/bus-objects.c:1224
        n = 0x7f5b1c548e10
        vtable_key = {
          path = 0x7f5b1c484ea8 "/org/freedesktop/systemd1", 
          interface = 0x7f5b1c484ee8 "org.freedesktop.systemd1.Manager", 
          member = 0x7f5b1c484ed0 "TryRestartUnit", 
          parent = 0x8c777b69f5dd8c00, 
          last_iteration = 0, 
          vtable = 0x7f5b1bbc95d3 <_start>
        }
        v = 0x7f5b1c4854b0
        r = 0
        __PRETTY_FUNCTION__ = "object_find_and_run"
#17 0x00007f5b1bc83f33 in bus_process_object (bus=0x7f5b1c47cfc0, m=0x7f5b1c549010) at src/libsystemd/sd-bus/bus-objects.c:1340
        prefix = <error reading variable prefix (frame address is not available.)>
        r = 32603
        pl = 25
        found_object = true
        __PRETTY_FUNCTION__ = "bus_process_object"
#18 0x00007f5b1bc91239 in process_message (bus=0x7f5b1c47cfc0, m=0x7f5b1c549010) at src/libsystemd/sd-bus/sd-bus.c:2409
        r = 0
        __PRETTY_FUNCTION__ = "process_message"
        __func__ = "process_message"
#19 0x00007f5b1bc9140e in process_running (bus=0x7f5b1c47cfc0, hint_priority=false, priority=0, ret=0x0) at src/libsystemd/sd-bus/sd-bus.c:2451
        m = 0x7f5b1c549010
        r = 1
        __PRETTY_FUNCTION__ = "process_running"
        __func__ = "process_running"
#20 0x00007f5b1bc91c50 in bus_process_internal (bus=0x7f5b1c47cfc0, hint_priority=false, priority=0, ret=0x0) at src/libsystemd/sd-bus/sd-bus.c:2640
        _dont_destroy_bus = 0x7f5b1c47cfc0
        r = 32603
        __PRETTY_FUNCTION__ = "bus_process_internal"
#21 0x00007f5b1bc91d21 in sd_bus_process (bus=0x7f5b1c47cfc0, ret=0x0) at src/libsystemd/sd-bus/sd-bus.c:2659
No locals.
#22 0x00007f5b1bc924bd in io_callback (s=0x7f5b1c46b000, fd=19, revents=1, userdata=0x7f5b1c47cfc0) at src/libsystemd/sd-bus/sd-bus.c:2918
        bus = 0x7f5b1c47cfc0
        r = 32767
        __PRETTY_FUNCTION__ = "io_callback"
#23 0x00007f5b1bc987c4 in source_dispatch (s=0x7f5b1c46b000) at src/libsystemd/sd-event/sd-event.c:2103
        r = 0
        __PRETTY_FUNCTION__ = "source_dispatch"
        __func__ = "source_dispatch"
#24 0x00007f5b1bc99b44 in sd_event_dispatch (e=0x7f5b1c46ab10) at src/libsystemd/sd-event/sd-event.c:2460
        p = 0x7f5b1c46b000
        r = 32767
        __PRETTY_FUNCTION__ = "sd_event_dispatch"
#25 0x00007f5b1bc99cd5 in sd_event_run (e=0x7f5b1c46ab10, timeout=18446744073709551615) at src/libsystemd/sd-event/sd-event.c:2489
        r = 1
        __PRETTY_FUNCTION__ = "sd_event_run"
#26 0x00007f5b1bbd5ee9 in manager_loop (m=0x7f5b1c46a370) at src/core/manager.c:2010
        wait_usec = 18446744073709551615
        r = 1
        rl = {
          interval = 1000000, 
          begin = 37569410558, 
          burst = 50000, 
          num = 26
        }
        __PRETTY_FUNCTION__ = "manager_loop"
        __func__ = "manager_loop"
#27 0x00007f5b1bbceb93 in main (argc=4, argv=0x7fff60c9e468) at src/core/main.c:1731
        m = 0x7f5b1c46a370
        r = 0
        retval = 1
        before_startup = 37382632984
        after_startup = 0
        timespan = "\004\000\000\000\000\000\000\000\240\343\311`\377\177\000\000\330t\265\033[\177\000\000\000\000\000\000\000\000\000\000\002\000\000\000\300\237\000\000\r\206\315\033[\177\000\000\302", '\000' <repeats 14 times>
        fds = 0x0
        reexecute = false
        shutdown_verb = 0x0
        initrd_timestamp = {
          realtime = 0, 
          monotonic = 0
        }
        userspace_timestamp = {
          realtime = 1415856026408304, 
          monotonic = 37382632434
        }
        kernel_timestamp = {
          realtime = 1415818643775870, 
          monotonic = 0
        }
        security_start_timestamp = {
          realtime = 0, 
          monotonic = 0
        }
        security_finish_timestamp = {
          realtime = 0, 
          monotonic = 0
        }
        systemd = "systemd"
        skip_setup = true
        j = 32767
        loaded_policy = false
        arm_reboot_watchdog = false
        queue_default_job = false
        empty_etc = false
        switch_root_dir = 0x0
        switch_root_init = 0x0
        saved_rlimit_nofile = {
          rlim_cur = 1024, 
          rlim_max = 4096
        }
        __func__ = "main"
        __PRETTY_FUNCTION__ = "main"
Comment 20 Steven Noonan 2014-11-13 05:39:40 UTC
Created attachment 109381 [details]
systemd 317 -O0 -g core dump
Comment 21 Steven Noonan 2014-11-13 06:10:48 UTC
Created attachment 109383 [details] [review]
patch: don't try to merge a JOB_NOP

From my read of the code, here's what happens. The stack traces are slightly obfuscated by the fact we have duplicate values in the JobType enum, so mentally translate these when reading through it:

  _JOB_TYPE_MAX_IN_TRANSACTION is JOB_TRY_RESTART
  _JOB_TYPE_MAX_MERGING is JOB_NOP

The transaction arrives as a JOB_TRY_RESTART, but between the entry of manager_add_job and when it calls transaction_activate, the type changes to JOB_NOP. This appears to happen because of job_type_collapse, which notices the unit is already deactivating or inactive and modifies the passed JobType in place to be a JOB_NOP.

Deeper down the stack, transaction_is_destructive tries to use job_type_is_superset on the JOB_NOP, which will fail since it's not a mergeable job type. Skip the job_type_is_superset on a JOB_NOP and we should be good.

Possible patch attached. Not sure if it's the right approach, though. Still reading through the code to try to understand the transaction/merging semantics. :)
Comment 22 rmlipman 2014-11-27 18:46:01 UTC
Created attachment 110146 [details]
systemd 317 -g core dump (not -O0)

I get a similar crash, but instead of on shutdown, it happens either at random or on boot.

Trace for my situation follows:

(gdb) thread apply all bt full

Thread 1 (Thread 0x7f0a24a337c0 (LWP 593)):
#0  0x00007f0a23c070c9 in raise () from /usr/lib/libpthread.so.0
No symbol table info available.
#1  0x00007f0a24b1eeb8 in crash.lto_priv.222 (sig=6) at src/core/main.c:168
        rl = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        sa = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
          sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, 
          sa_restorer = 0x0}
        __func__ = "crash"
        __PRETTY_FUNCTION__ = "crash"
#2  <signal handler called>
No symbol table info available.
#3  0x00007f0a23887967 in raise () from /usr/lib/libc.so.6
No symbol table info available.
#4  0x00007f0a23888d3a in abort () from /usr/lib/libc.so.6
No symbol table info available.
#5  0x00007f0a24b4dee2 in log_assert_failed (
    text=text@entry=0x7f0a24b65760 "a >= 0 && a < _JOB_TYPE_MAX_MERGING", 
    file=file@entry=0x7f0a24b580c0 "src/core/job.c", line=line@entry=314, 
    func=func@entry=0x7f0a24b76170 <__PRETTY_FUNCTION__.14671> "job_type_lookup_merge") at src/shared/log.c:718
No locals.
#6  0x00007f0a24aa756f in job_type_lookup_merge (a=<optimized out>, 
    b=<optimized out>) at src/core/job.c:314
        __PRETTY_FUNCTION__ = "job_type_lookup_merge"
#7  0x00007f0a24b21825 in job_type_is_superset () at src/core/job.h:198
No locals.
#8  transaction_is_destructive (e=<optimized out>, mode=<optimized out>, 
    tr=<optimized out>) at src/core/transaction.c:516
        i = 0xffffffffffffffff
        j = 0x0
#9  transaction_activate (e=<optimized out>, mode=<optimized out>, 
    m=<optimized out>, tr=<optimized out>) at src/core/transaction.c:722
        i = 0xffffffffffffffff
        j = 0x0
#10 manager_add_job (m=0x7f0a26420380, type=_JOB_TYPE_MAX_MERGING, unit=0x6, 
    mode=_JOB_MODE_INVALID, override=false, e=0x1, _ret=0x7fff32c31fb0)
    at src/core/manager.c:1224
        __PRETTY_FUNCTION__ = "manager_add_job"
        __func__ = "manager_add_job"
#11 0x00007f0a24acd015 in bus_unit_queue_job (bus=0x7f0a26434b90, 
    message=0x7f0a26504ae0, u=0x7f0a264cb770, 
    type=_JOB_TYPE_MAX_IN_TRANSACTION, mode=JOB_FAIL, reload_if_possible=true, 
    error=0x7fff32c32140) at src/core/dbus-unit.c:777
        path = 0x6 <error: Cannot access memory at address 0x6>
        j = 0x4b33a1ae32564764
        __PRETTY_FUNCTION__ = "bus_unit_queue_job"
#12 0x00007f0a24acd436 in bus_unit_method_start_generic (bus=0x7f0a26434b90, 
    message=0x7f0a26504ae0, u=0x7f0a264cb770, 
    job_type=_JOB_TYPE_MAX_IN_TRANSACTION, 
    reload_if_possible=reload_if_possible@entry=false, error=0x7fff32c32140)
    at src/core/dbus-unit.c:383
        smode = 0x7f0a265011d8 "fail"
        mode = JOB_FAIL
        r = 0
        __PRETTY_FUNCTION__ = "bus_unit_method_start_generic"
#13 0x00007f0a24b3df56 in method_start_unit_generic (bus=0x7f0a26434b90, 
    message=0x7f0a26504ae0, m=0x7f0a26420380, 
    job_type=_JOB_TYPE_MAX_IN_TRANSACTION, reload_if_possible=<optimized out>, 
    error=0x7fff32c32140) at src/core/dbus-manager.c:478
        name = 0x7f0a265011c4 "dnsmasq.service"
        u = 0x7f0a264cb770
        r = <optimized out>
        __PRETTY_FUNCTION__ = "method_start_unit_generic"
#14 0x00007f0a24b2df8e in method_callbacks_run (found_object=<optimized out>, 
    require_fallback=<optimized out>, c=<optimized out>, m=<optimized out>, 
    bus=<optimized out>) at src/libsystemd/sd-bus/bus-objects.c:400
        slot = 0x0
        u = 0x7f0a26420380
        error = {name = 0x0, message = 0x0, _need_free = 0}
        signature = 0xffffffffffffffff <error: Cannot access memory at address 0xffffffffffffffff>
#15 object_find_and_run.lto_priv.239 (bus=0x7f0a26434b90, m=0x7f0a26504ae0, 
    p=0x6 <error: Cannot access memory at address 0x6>, 
    require_fallback=false, found_object=0x7fff32c322b0)
    at src/libsystemd/sd-bus/bus-objects.c:1224
        vtable_key = {path = 0x7f0a26501128 "/org/freedesktop/systemd1", 
          interface = 0x7f0a26501168 "org.freedesktop.systemd1.Manager", 
          member = 0x7f0a26501150 "TryRestartUnit", parent = 0x1, 
          last_iteration = 0, 
          vtable = 0x7f0a24b42681 <bus_read_message.constprop.160+49>}
        __PRETTY_FUNCTION__ = "object_find_and_run"
#16 0x00007f0a24b4386b in bus_process_object (m=<optimized out>, 
    bus=0x7f0a26434b90) at src/libsystemd/sd-bus/bus-objects.c:1340
        prefix = 0x7fff32c321e0 "\220KC&\n\177"
        r = <optimized out>
        pl = <optimized out>
        found_object = true
#17 process_message (m=<optimized out>, bus=0x7f0a26434b90)
    at src/libsystemd/sd-bus/sd-bus.c:2409
        r = <optimized out>
#18 process_running (priority=0, ret=0x0, hint_priority=false, 
    bus=0x7f0a26434b90) at src/libsystemd/sd-bus/sd-bus.c:2451
        m = <optimized out>
        r = <optimized out>
#19 bus_process_internal.constprop.164 (bus=0x7f0a26434b90, ret=0x0, 
    priority=0, hint_priority=false) at src/libsystemd/sd-bus/sd-bus.c:2640
        _dont_destroy_bus = 0x7f0a26434b90
#20 0x00007f0a24ae1871 in sd_bus_process (ret=0x0, bus=<optimized out>)
    at src/libsystemd/sd-bus/sd-bus.c:2659
No locals.
#21 io_callback (s=<optimized out>, fd=<optimized out>, 
    revents=<optimized out>, userdata=<optimized out>)
    at src/libsystemd/sd-bus/sd-bus.c:2918
        bus = <optimized out>
        __PRETTY_FUNCTION__ = "io_callback"
#22 0x00007f0a24b0bdf0 in source_dispatch (s=0x7f0a264e1a00)
    at src/libsystemd/sd-event/sd-event.c:2103
        __PRETTY_FUNCTION__ = "source_dispatch"
        __func__ = "source_dispatch"
#23 0x00007f0a24b113fe in sd_event_dispatch (e=0x7f0a264207b0)
    at src/libsystemd/sd-event/sd-event.c:2460
        p = <optimized out>
        r = <optimized out>
        e = 0x7f0a264207b0
        __PRETTY_FUNCTION__ = "sd_event_dispatch"
#24 0x00007f0a24af8d20 in sd_event_run (timeout=<optimized out>, 
    e=<optimized out>) at src/libsystemd/sd-event/sd-event.c:2489
        r = 0
#25 manager_loop (m=0x7f0a26420380) at src/core/manager.c:2010
        wait_usec = 18446744073709551615
        rl = {interval = 1000000, begin = 16571505, burst = 50000, num = 135}
        __PRETTY_FUNCTION__ = "manager_loop"
        __func__ = "manager_loop"
#26 0x00007f0a24aa4845 in main (argc=1, argv=<optimized out>)
    at src/core/main.c:1731
        m = 0x7f0a26420380
        r = <optimized out>
        retval = 1
        before_startup = <optimized out>
        after_startup = <optimized out>
        timespan = "\310\001\000\000\000\000\000\000\310\001\000\000\000\000\000\000\310\001\000\000\000\000\000\000$\000\000\000\000\000\000\000$\000\000\000\000\000\000\000\004\000\000\000\000\000\000\000P\345td\004\000\000\000\210+\001\000\000\000\000"
        fds = 0x0
        reexecute = false
        shutdown_verb = 0x0
        initrd_timestamp = {realtime = 0, monotonic = <optimized out>}
        userspace_timestamp = {realtime = 1417111964157395, 
          monotonic = 1925963}
        kernel_timestamp = {realtime = <optimized out>, monotonic = 0}
        security_start_timestamp = {realtime = 1417111964159510, 
          monotonic = 1928079}
        security_finish_timestamp = {realtime = 1417111964160112, 
          monotonic = 1928680}
        systemd = "systemd"
        skip_setup = <optimized out>
        j = <optimized out>
        loaded_policy = false
        arm_reboot_watchdog = false
        queue_default_job = <optimized out>
        empty_etc = false
        switch_root_dir = 0x0
        switch_root_init = 0x0
        __func__ = "main"
        __PRETTY_FUNCTION__ = "main"
Comment 23 Steven Noonan 2014-11-27 19:57:52 UTC
Yes, I should have mentioned that it happens more or less randomly. In my case it wasn't at shutdown.
Comment 24 Lennart Poettering 2014-12-03 01:39:17 UTC
Any chance you could check if current git fixes your issue? transaction_is_destructive() has been change quite a bit, to fix a few bugs.
Comment 25 Lennart Poettering 2014-12-09 00:57:15 UTC
Pretty sure this is fixed now. Closing bug. Feel free to reopen if problem persists.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.