105873 – [i915]Failed to release pages: bind_count=1, pages_pin_count=1, pin_display=0 on kernel 4.14.20

Bug 105873 - [i915]Failed to release pages: bind_count=1, pages_pin_count=1, pin_display=0 on kernel 4.14.20

Summary: [i915]Failed to release pages: bind_count=1, pages_pin_count=1, pin_display=0...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other Linux (All)

Importance:	low normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	for media stack (open source)
Keywords:

Depends on:
Blocks:

Reported:	2018-04-04 01:54 UTC by Owen Zhang
Modified:	2018-10-02 06:55 UTC (History)
CC List:	2 users (show)

See Also:	105272
i915 platform:	SKL
i915 features:	GPU hang

Attachments
dmesg_and_i915_error_states_dump (26.72 KB, text/plain) 2018-04-04 01:54 UTC, Owen Zhang	no flags	Details
backported fixed patch (2.77 KB, patch) 2018-04-19 01:40 UTC, Owen Zhang	no flags	Details \| Splinter Review
for reproducer (1017 bytes, text/plain) 2018-04-19 01:41 UTC, Owen Zhang	no flags	Details
test binary (770.84 KB, application/x-sharedlib) 2018-04-19 01:42 UTC, Owen Zhang	no flags	Details
test script (51 bytes, text/plain) 2018-04-19 01:44 UTC, Owen Zhang	no flags	Details
View All

Description Owen Zhang 2018-04-04 01:54:46 UTC

Created attachment 138562 [details]
dmesg_and_i915_error_states_dump

#: dmesg dump: 
kern  :warn  : [30996.849134] Failed to release pages: bind_count=1, pages_pin_count=1, pin_display=0
kern  :warn  : [30996.849443] ------------[ cut here ]------------
kern  :warn  : [30996.849761] WARNING: CPU: 2 PID: 13057 at drivers/gpu/drm/i915/i915_gem_userptr.c:89 cancel_userptr+0xdc/0xe0 [i915]
kern  :warn  : [30996.850091] Modules linked in: cmac arc4 md4 nls_utf8 cifs ccm dns_resolver vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc iTCO_wdt ipmi_si aesni_intel iTCO_vendor_support crypto_simd ipmi_devintf glue_helper ppdev cryptd i2c_i801 ipmi_msghandler pcspkr sg lpc_ich parport_pc parport shpchp acpi_cpufreq ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod igb ahci ptp libahci i915(OE) drm_kms_helper(OE) syscopyarea pps_core sysfillrect libata sysimgblt fb_sys_fops dca i2c_algo_bit drm(OE) crc32c_intel i2c_core video
kern  :warn  : [30996.852075] CPU: 2 PID: 13057 Comm: kworker/u32:4 Tainted: G     U     OE   4.14.20-dcg20180302-gcc4.8.5 #1
kern  :warn  : [30996.852530] Hardware name: Intel Corporation S1200RP/S1200RP, BIOS S1200RP.86B.03.00.0x43.122320141434 12/23/2014
kern  :warn  : [30996.853018] Workqueue: i915-userptr-release cancel_userptr [i915]
kern  :warn  : [30996.853499] task: ffff942edbd60000 task.stack: ffffa673d183c000
kern  :warn  : [30996.854002] RIP: 0010:cancel_userptr+0xdc/0xe0 [i915]
kern  :warn  : [30996.854500] RSP: 0018:ffffa673d183fe60 EFLAGS: 00010246
kern  :warn  : [30996.855000] RAX: 0000000000000047 RBX: ffff942eda2bc200 RCX: 0000000000000000
kern  :warn  : [30996.855516] RDX: 0000000000000000 RSI: ffff942ee0896978 RDI: ffff942ee0896978
kern  :warn  : [30996.856038] RBP: ffff942eda2bc3b0 R08: 0000000000000000 R09: 000000000000043d
kern  :warn  : [30996.856570] R10: 0000000000000001 R11: 0000000000aaaaaa R12: 0000000000000000
kern  :warn  : [30996.857104] R13: ffff942eddd3a400 R14: 0000000000000000 R15: ffff942dd7b853d8
kern  :warn  : [30996.857643] FS:  0000000000000000(0000) GS:ffff942ee0880000(0000) knlGS:0000000000000000
kern  :warn  : [30996.858202] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kern  :warn  : [30996.858764] CR2: 00007f735000f000 CR3: 000000035700a005 CR4: 00000000003606e0
kern  :warn  : [30996.859338] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kern  :warn  : [30996.859922] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kern  :warn  : [30996.860501] Call Trace:
kern  :warn  : [30996.861083]  process_one_work+0x141/0x340
kern  :warn  : [30996.861672]  worker_thread+0x47/0x3e0
kern  :warn  : [30996.862259]  kthread+0xfc/0x130
kern  :warn  : [30996.862849]  ? rescuer_thread+0x380/0x380
kern  :warn  : [30996.863425]  ? kthread_park+0x60/0x60
kern  :warn  : [30996.863986]  ret_from_fork+0x35/0x40
kern  :warn  : [30996.864554] Code: 05 11 00 00 75 dc 8b 93 d0 01 00 00 8b 8b ac 01 00 00 48 c7 c7 c8 97 61 c0 8b b3 a4 01 00 00 c6 05 e7 04 11 00 01 e8 71 a6 b7 dc <0f> ff eb b3 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 
kern  :warn  : [30996.865776] ---[ end trace 478aed6f3aca9d7a ]---

System: CentOS7.4
the kernel version: 4.14.20(from kernel org)

Comment 1 Owen Zhang 2018-04-19 01:40:24 UTC

Created attachment 138910 [details] [review]
backported fixed patch

i manual backported this patch to 4.14.xx

Comment 2 Owen Zhang 2018-04-19 01:41:03 UTC

Created attachment 138911 [details]
for reproducer

Comment 3 Owen Zhang 2018-04-19 01:42:13 UTC

Created attachment 138916 [details]
test binary

Comment 4 Owen Zhang 2018-04-19 01:44:36 UTC

Created attachment 138917 [details]
test script

Comment 5 Owen Zhang 2018-04-19 01:48:13 UTC

I had bisected the fix patch for kernel org, this patch was merged in 4.16-rc1.

commit b050e685044221099ed88748bfb6853a53c3d479
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Dec 6 12:49:13 2017 +0000

    drm/i915: Remove vma from object on destroy, not close

    Originally we translated from the object to the vma by walking
    obj->vma_list to find the matching vm (for user lookups). Now we process
    user lookups using the rbtree, and we only use obj->vma_list itself for
    maintaining state (e.g. ensuring that all vma are flushed or rebound).
    As such maintenance needs to go on beyond the user's awareness of the
    vma, defer removal of the vma from the obj->vma_list from i915_vma_close()
    to i915_vma_destroy()

    Fixes: 5888fc9eac3c ("drm/i915: Flush pending GTT writes before unbinding")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104155
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171206124914.19960-1-chris@chris-wilson.co.uk
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

this issue's reproducer:
1) Build this stack: https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack
2) Run:
./repr.sh

and i attached the backported fix patch, so Could you help to tell how to do the next?  we need to backport to 4.14 LTS kernel. thanks.

Comment 6 Joonas Lahtinen 2018-04-19 08:41:04 UTC

Can you please clarify if this is a regression, and the bisect points to the commit fixing the regression? Or if this is a bisect to find the exact commit from drm-tip which fixes the issue.

Comment 7 Owen Zhang 2018-04-19 08:48:19 UTC

this isn't one regression. only bisect this fix patch from kernel org. then search this exact patch from drm-tip.

(In reply to Joonas Lahtinen from comment #6)
> Can you please clarify if this is a regression, and the bisect points to the
> commit fixing the regression? Or if this is a bisect to find the exact
> commit from drm-tip which fixes the issue.

Comment 8 Joonas Lahtinen 2018-04-19 12:34:48 UTC

I'm still unsure what has been bisected. What criteria was used for the bisect and which tree was used for it?

Comment 9 Joonas Lahtinen 2018-04-19 12:41:04 UTC

Seems like I made a typo in my comment #6 so please let me clarify my questions. 

1. Is this is a) bisect to commit that introduces a regression b) bisect to commit that fixes a bug?

2. Which criteria was used for bisecting?

3. Which tree was used for bisecting?

Can you also please include the git bisect command history.

Comment 10 Owen Zhang 2018-04-20 01:41:28 UTC

1. b): bisect to one commit that fixed this bug.
2. we used manual methods to bisect. 
  4.16-rc1 can't reproduce, but 4.15-rc1 can reproduce. so we list all i915 patches from 4.15-rc1 to 4.16-rc1. and choose one middle patch by manual(which hadn't any dependence on it's previous) as bisect point, then make and install this kernel. and test again. next will repeat this step...
3. we used this tree. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/?h=linux-4.16.y

we haven't used bisect cmd directly, so we haven't any git bisect cmd history, thanks. 

(In reply to Joonas Lahtinen from comment #9)
> Seems like I made a typo in my comment #6 so please let me clarify my
> questions. 
> 
> 1. Is this is a) bisect to commit that introduces a regression b) bisect to
> commit that fixes a bug?
> 
> 2. Which criteria was used for bisecting?
> 
> 3. Which tree was used for bisecting?
> 
> Can you also please include the git bisect command history.

Comment 11 Jani Saarinen 2018-04-27 11:59:27 UTC

Jani, any help here?

Comment 12 Jani Nikula 2018-04-27 12:06:56 UTC

(In reply to Jani Saarinen from comment #11)
> Jani, any help here?

No, not my area.

Comment 13 Jani Saarinen 2018-05-04 12:32:32 UTC

Chris, how do you see this, is culprit yours?

Comment 14 Jani Saarinen 2018-05-17 09:49:52 UTC

Rerporter, would be really nice to see this done by git bisect, thanks

Comment 15 Owen Zhang 2018-05-29 06:19:08 UTC

(In reply to Jani Saarinen from comment #14)
> Rerporter, would be really nice to see this done by git bisect, thanks

the following is our capture bisect log. due to the cmd:"git bisect" is used to test the regression, so:
"good" means can reproduce this issue.
"bad" means can't reproduce this issue. it means fixed this issue.

the log:

git bisect start
# good: [0d7e76beaa92de8837af7d49a36058ee6cf8fe78] Merge tag 'gvt-next-2017-12-05' of https://github.com/intel/gvt-linux into drm-intel-next-queued
git bisect good 0d7e76beaa92de8837af7d49a36058ee6cf8fe78
# bad: [7125397b82460d74ae0584bdcdc006deec5e895d] drm/i915: Track GGTT writes on the vma
git bisect bad 7125397b82460d74ae0584bdcdc006deec5e895d
# good: [a655aeb34f7cae135a31a8a643314061ca6737e3] drm/i915/uc: Don't fetch GuC firmware if no plan to use GuC
git bisect good a655aeb34f7cae135a31a8a643314061ca6737e3
# good: [121981fafe699d9f398a3c717912ef4eae6719b1] drm/i915/guc: Combine enable_guc_loading|submission modparams
git bisect good 121981fafe699d9f398a3c717912ef4eae6719b1
# bad: [010e3e68cd9cb65ea50c0af605e966cda333cb2a] drm/i915: Remove vma from object on destroy, not close
git bisect bad 010e3e68cd9cb65ea50c0af605e966cda333cb2a
# good: [0dfa1cee613e03cee295b8d1ed8130c84311b584] drm/i915/huc: Load HuC only if requested
git bisect good 0dfa1cee613e03cee295b8d1ed8130c84311b584
# first bad commit: [010e3e68cd9cb65ea50c0af605e966cda333cb2a] drm/i915: Remove vma from object on destroy, not close

Comment 16 Dmitry Rogozhkin 2018-06-01 17:44:06 UTC

Hi, are there any other concerns to start process to apply the fix for 4.14 LTS kernel? There are bisect results, there is an attached patch...

Comment 17 Joonas Lahtinen 2018-06-04 09:04:12 UTC

Right, so this is not a regression, but trying to find which patches fix the issue in new kernels. So tagging this as enhancement, which it is.

The referred patch here itself has a Fixes: tag, so it should be applied only on top of the previously applied patch. That means if picking only this patch helps, then it is through incredible luck only. And if the pointed patch was in LTS, the Fixes: tag should have made it get picked up automatically.

Also, from the bisect log, we got:

"first bad commit: [010e3e68cd9cb65ea50c0af605e966cda333cb2a] drm/i915: Remove vma from object on destroy, not close"

That would mean that introducing this patch actually breaks the usecase, so I'm having trouble following why it is suggested as the fix? Please provide a proper bisect log where you actually build each suggested commit and run your testcase for each step, tagging it as good or bad depending on if your testcase passes or fails (if the testcase can't be reproduce the issue 100% of the time, then do enough runs on EACH step to make sure you have good confidence).

Comment 18 Owen Zhang 2018-06-04 11:02:31 UTC

(In reply to Joonas Lahtinen from comment #17)
> Right, so this is not a regression, but trying to find which patches fix the
> issue in new kernels. So tagging this as enhancement, which it is.
> 
> The referred patch here itself has a Fixes: tag, so it should be applied
> only on top of the previously applied patch. That means if picking only this
> patch helps, then it is through incredible luck only. And if the pointed
> patch was in LTS, the Fixes: tag should have made it get picked up
> automatically.
> 
> Also, from the bisect log, we got:
> 
> "first bad commit: [010e3e68cd9cb65ea50c0af605e966cda333cb2a] drm/i915:
> Remove vma from object on destroy, not close"
> 
> That would mean that introducing this patch actually breaks the usecase, so
> I'm having trouble following why it is suggested as the fix? Please provide
> a proper bisect log where you actually build each suggested commit and run
> your testcase for each step, tagging it as good or bad depending on if your
> testcase passes or fails (if the testcase can't be reproduce the issue 100%
> of the time, then do enough runs on EACH step to make sure you have good
> confidence).

hi,
I had commented the reason why I marked the "bad".
""good" means can reproduce this issue.
"bad" means can't reproduce this issue. it means fixed this issue."

since "git bisect" must mark "bad" before "good". this cmd used to test the regression. so the first "bad" patch is fixed patch in this situation. thanks.

Comment 19 Dmitry Rogozhkin 2018-06-05 16:18:31 UTC

@Joonas: we are bug _reporters_ and we don't support i915 driver. You support it. We are sorry if we did something wrongly which violates some internal guidelines you follow inside the team or even a kernel. From my perspective Owen did a lot of job: 1) bisect was done, 2) patch suggested. If you don't like "Fixes" in the patch or you don't like bisect results, well, we don't know your internal team guidelines... This discussion starts to lead to nowhere. I don't think it is proper to request bug reporters do the job for you.

Also, I completely disagree that this is enhancement. Call the things in your names. You have memory leak in i915 driver on 4.14 LTS kernel. That's a normal bug and nothing else.

Comment 20 Joonas Lahtinen 2018-06-06 10:25:57 UTC

Right, after clearing the confusion around the issue, this issue is simply result of LTS kernel picking Cc: stable patch, but not picking patch with Fixes: for it (which is the attached patch). I've instructed reporter to send a request for inclusion to stable mailing list according to the LTS process.

Comment 21 Dmitry Rogozhkin 2018-06-08 18:33:21 UTC

Was the request sent? Please, provide the status here since not everyone monitor the list... Thank you.

Comment 22 Owen Zhang 2018-06-09 03:42:56 UTC

hi Dmitry, the request sent out. I will update in here when receive any infomation. thanks.

(In reply to Dmitry Rogozhkin from comment #21)
> Was the request sent? Please, provide the status here since not everyone
> monitor the list... Thank you.

Comment 23 Francesco Balestrieri 2018-07-02 06:41:36 UTC

I'm going to close this bug, let's follow the patch inclusion separately.

Comment 24 Jani Nikula 2018-10-01 08:57:12 UTC

(In reply to Owen Zhang from comment #22)
> hi Dmitry, the request sent out. I will update in here when receive any
> infomation. thanks.

When doing backport requests, please Cc: the maintainers. At least I wasn't Cc'd.

Please provide a link to the backport request on the list archive.

Did you get a response?

Comment 25 Owen Zhang 2018-10-01 23:52:27 UTC

(In reply to Jani Nikula from comment #24)
> (In reply to Owen Zhang from comment #22)
> > hi Dmitry, the request sent out. I will update in here when receive any
> > infomation. thanks.
> 
> When doing backport requests, please Cc: the maintainers. At least I wasn't
> Cc'd.
> 
> Please provide a link to the backport request on the list archive.
> 
> Did you get a response?

Hi Jani,
I had sent this request to "stable@vger.kernel.org", but i haven't get any responses, so......
I will add you and send again, sorry for this, thanks a lot.

Comment 26 Jani Nikula 2018-10-02 06:55:08 UTC

I sent another stable backport request. It hasn't showed up on the list archives yet, but the message id is 871s98ly2g.fsf@intel.com

Resolving FIXED; this was our bug that was also fixed upstream.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.