Bug 98283

Summary: Frequent crashes of ModemManager with a SIGSEGV (dereferencing 0x20 address...)
Product: ModemManager Reporter: Benoît Donnette <benoit.donnette>
Component: generalAssignee: ModemManager bug user <modemmanager>
Status: RESOLVED FIXED QA Contact:
Severity: critical    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Protects 2 more dereferences against NULL

Description Benoît Donnette 2016-10-17 09:13:40 UTC
On a device equipped with several modems, we could see frequent crashes of the ModemManager process. Frequent = more than every 30 minutes.

Investigating on the hardware side (could well be the modems resetting because they're hot), but investigating why the ModemManager crashed, either.

I've got kernel messages about the signal, but no backtrace yet.

kernel: [ 6878.385210] ModemManager[13410]: segfault at 20 ip 00007f247f3eff15 sp 00007fff429de460 error 6 in libqmi-glib.so.5.1.0[7f247f397000+1b2000]

Not enough elements yet to knwow whether the bug actually lies in ModemManager or in libqmi.
Comment 1 Aleksander Morgado 2016-10-17 09:58:48 UTC
Which libqmi/ModemManager versions?

I'd love to see a backtrace.
Comment 2 Benoît Donnette 2016-10-17 11:24:20 UTC
So do I, therefore I'm prepring it. However, it could just be an incoherence between libqmi-proxy and libqmi...

I'm checking this.
Comment 3 Aleksander Morgado 2016-10-17 14:46:05 UTC
(In reply to Benoît Donnette from comment #2)
> So do I, therefore I'm prepring it. However, it could just be an incoherence
> between libqmi-proxy and libqmi...
> 
> I'm checking this.

Does qmi-proxy also segfault? I've got some reports of qmi-proxy segfaults but couldn't reproduce them or get backtraces either
Comment 4 Benoît Donnette 2016-10-17 14:54:50 UTC
Yes, I've got either ModemManager or qmi-proxy crashes (possibly sometimes both). Both may crash with a SIGSEGV. I suspect this is due to using the Debian testing ModemManager and its dependencies, but the Debian 8.5 mainstream qmi-proxy. They ldd each to a different libqmi.
Comment 5 Benoît Donnette 2016-10-17 14:57:03 UTC
"Testing" ModemManager = 1.6.0 for us.

I am pushing the qmi-proxy from testing as well. I'll have a huge field testing after that if it does no harm, so I'll let you know if the bug disappears.

It's on a device with 2 to 4 Cat6 modems, using qmi.
Comment 6 Aleksander Morgado 2016-10-24 11:05:02 UTC
Any update?
Comment 7 Benoît Donnette 2016-10-25 14:35:23 UTC
Well, yes.

I used to have 2 different crashes : one was qmi-proxy trying to read at address 0, the other being ModemManager.

I used to beleive they were connected : they're not.

The qmi-proxy crash comes from a qmi-proxy cohabitating with a ModemManager/libqmi on heterogeneous versions of the libqmi (Debian packaging used not to prevent from this : you can have libqmi-proxy version 1.14.0 and libqmi-glib5 version 1.16.0). Upgrading libqmi-proxy to match libqmi-glib fixes.

I'm now setting up core dumps on the devices for the second bug : a ModemManager crash while writing at address 0x20. And upgrading libqmi-proxy doesn't fix.

I'll keep you informed, I just managed to have a coredump produced bu a systemd daemon, now deploying to get a test. A genuine coredump should arrive within 2 days.
Comment 8 Aleksander Morgado 2016-10-25 17:48:16 UTC
(In reply to Benoît Donnette from comment #7)
> Well, yes.
> 
> I used to have 2 different crashes : one was qmi-proxy trying to read at
> address 0, the other being ModemManager.

I have my own qmi-proxy backtraces now for a quite easy to reproduce crash; I should be providing an update in the next few days.
Comment 9 Benoît Donnette 2016-10-27 15:49:17 UTC
ModemManager crash :

It's actually in libqmi. In qmi-device.c.

We're dereferencing a value after getting the transaction to be cancelled. No test if pointer gets NULL.

I have a stacktrace (upload tomorrow).
Comment 10 Aleksander Morgado 2016-10-27 22:51:45 UTC
(In reply to Benoît Donnette from comment #9)
> ModemManager crash :
> 
> It's actually in libqmi. In qmi-device.c.
> 
> We're dereferencing a value after getting the transaction to be cancelled.
> No test if pointer gets NULL.
> 
> I have a stacktrace (upload tomorrow).

See https://cgit.freedesktop.org/libqmi/commit/?id=0148e81aa978a5d94ef2e9ddf8adfefa7ce2ef3f

I believe the commit may fix the issue you're seeing (transaction pointer is NULL on the cancellation handler because it isn't found in the internal tracking table).

Could you validate this by using libqmi git master in your setup? I also added some other segfault fixes and several memleak fixes as well.
Comment 11 Benoît Donnette 2016-10-28 08:30:30 UTC
Hi,

Building a package and teing today (this afternoon is quite possible). I'll have a feedback, maybe on Monday (I'm supposed to be off but I can access the devices), and worst case on Wednesday.
Comment 12 Benoît Donnette 2016-11-03 09:16:28 UTC
(In reply to Benoît Donnette from comment #11)
> Hi,
> 
> Building a package and teing today (this afternoon is quite possible). I'll
> have a feedback, maybe on Monday (I'm supposed to be off but I can access
> the devices), and worst case on Wednesday.

Well, I still have crashes today after the patch, getting a backtrace of it.
Comment 13 Benoît Donnette 2016-11-03 13:04:20 UTC
Looks that my issue is in the transaction_cancelled() function, as the traces suggest the failing access is a dereference of an offset 0x20 after NULL, with a write access. This means it's a problem on the cancellable_id field.

The stack trace showed a SIGSEGV in the transaction_cancelled function, that' the element I didn't put here. I'm adding a NULL test around the line :

tr->cancellable_id = 0;
Comment 14 Aleksander Morgado 2016-11-13 10:36:22 UTC
Any luck with the backtrace?
Comment 15 Benoît Donnette 2016-11-14 15:06:04 UTC
Well, sure !

As s
Comment 16 Benoît Donnette 2016-11-14 15:12:19 UTC
(In reply to Benoît Donnette from comment #15)
> Well, sure !
> 
> As s

Well sure, I've anayzed it, but not been able to field test modification yet.

As said in my previous message, the crash takes place in :

transaction_cancelled()

I basically added a test against NULL in the result from the 2 remaining calls to device_release_transaction()

Deployment on a sample of the fiend tomorrow.
Comment 17 Benoît Donnette 2016-11-17 15:55:43 UTC
Created attachment 128043 [details]
Protects 2 more dereferences against NULL

Tested successfully today. A usecase that triggered at least 3 crashes before has not shown a single crash.
Comment 18 Aleksander Morgado 2016-12-08 23:29:20 UTC
(In reply to Benoît Donnette from comment #17)
> Created attachment 128043 [details]
> Protects 2 more dereferences against NULL
> 
> Tested successfully today. A usecase that triggered at least 3 crashes
> before has not shown a single crash.

This patch solves the segfault, but I believe the proper fix is to make sure the timeout and cancellations get cleaned up whenever the transaction gets removed from the HT; if we don't make sure of that we may get bitten by the issue somewhere else.
Comment 19 Aleksander Morgado 2017-01-15 22:05:01 UTC
Which device(s) where you using while testing this? Do you have any debug log that could indicate why the transaction was being cancelled?
Comment 20 Aleksander Morgado 2017-01-15 22:53:09 UTC
Nevermind; your analysis was mostly right, we do get a NULL tr in the cancellation callback, but it isn't an error. This happens when the cancellable given to a QMI command is already cancelled when we create the transaction:

https://cgit.freedesktop.org/libqmi/commit/?id=e57f062e666b9a0686e38e3722664b4b2ac9658c

Fixed in libqmi git master, qmi-1-16, qmi-1-14 and qmi-1-12.
Also fixed the equivalent issue in libmbim git master, mbim-1-12 and mbim-1-10.

Thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.