Created attachment 138562 [details] dmesg_and_i915_error_states_dump #: dmesg dump: kern :warn : [30996.849134] Failed to release pages: bind_count=1, pages_pin_count=1, pin_display=0 kern :warn : [30996.849443] ------------[ cut here ]------------ kern :warn : [30996.849761] WARNING: CPU: 2 PID: 13057 at drivers/gpu/drm/i915/i915_gem_userptr.c:89 cancel_userptr+0xdc/0xe0 [i915] kern :warn : [30996.850091] Modules linked in: cmac arc4 md4 nls_utf8 cifs ccm dns_resolver vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc iTCO_wdt ipmi_si aesni_intel iTCO_vendor_support crypto_simd ipmi_devintf glue_helper ppdev cryptd i2c_i801 ipmi_msghandler pcspkr sg lpc_ich parport_pc parport shpchp acpi_cpufreq ip_tables ext4 mbcache jbd2 sr_mod cdrom sd_mod igb ahci ptp libahci i915(OE) drm_kms_helper(OE) syscopyarea pps_core sysfillrect libata sysimgblt fb_sys_fops dca i2c_algo_bit drm(OE) crc32c_intel i2c_core video kern :warn : [30996.852075] CPU: 2 PID: 13057 Comm: kworker/u32:4 Tainted: G U OE 4.14.20-dcg20180302-gcc4.8.5 #1 kern :warn : [30996.852530] Hardware name: Intel Corporation S1200RP/S1200RP, BIOS S1200RP.86B.03.00.0x43.122320141434 12/23/2014 kern :warn : [30996.853018] Workqueue: i915-userptr-release cancel_userptr [i915] kern :warn : [30996.853499] task: ffff942edbd60000 task.stack: ffffa673d183c000 kern :warn : [30996.854002] RIP: 0010:cancel_userptr+0xdc/0xe0 [i915] kern :warn : [30996.854500] RSP: 0018:ffffa673d183fe60 EFLAGS: 00010246 kern :warn : [30996.855000] RAX: 0000000000000047 RBX: ffff942eda2bc200 RCX: 0000000000000000 kern :warn : [30996.855516] RDX: 0000000000000000 RSI: ffff942ee0896978 RDI: ffff942ee0896978 kern :warn : [30996.856038] RBP: ffff942eda2bc3b0 R08: 0000000000000000 R09: 000000000000043d kern :warn : [30996.856570] R10: 0000000000000001 R11: 0000000000aaaaaa R12: 0000000000000000 kern :warn : [30996.857104] R13: ffff942eddd3a400 R14: 0000000000000000 R15: ffff942dd7b853d8 kern :warn : [30996.857643] FS: 0000000000000000(0000) GS:ffff942ee0880000(0000) knlGS:0000000000000000 kern :warn : [30996.858202] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kern :warn : [30996.858764] CR2: 00007f735000f000 CR3: 000000035700a005 CR4: 00000000003606e0 kern :warn : [30996.859338] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kern :warn : [30996.859922] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kern :warn : [30996.860501] Call Trace: kern :warn : [30996.861083] process_one_work+0x141/0x340 kern :warn : [30996.861672] worker_thread+0x47/0x3e0 kern :warn : [30996.862259] kthread+0xfc/0x130 kern :warn : [30996.862849] ? rescuer_thread+0x380/0x380 kern :warn : [30996.863425] ? kthread_park+0x60/0x60 kern :warn : [30996.863986] ret_from_fork+0x35/0x40 kern :warn : [30996.864554] Code: 05 11 00 00 75 dc 8b 93 d0 01 00 00 8b 8b ac 01 00 00 48 c7 c7 c8 97 61 c0 8b b3 a4 01 00 00 c6 05 e7 04 11 00 01 e8 71 a6 b7 dc <0f> ff eb b3 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 kern :warn : [30996.865776] ---[ end trace 478aed6f3aca9d7a ]--- System: CentOS7.4 the kernel version: 4.14.20(from kernel org)
Created attachment 138910 [details] [review] backported fixed patch i manual backported this patch to 4.14.xx
Created attachment 138911 [details] for reproducer
Created attachment 138916 [details] test binary
Created attachment 138917 [details] test script
I had bisected the fix patch for kernel org, this patch was merged in 4.16-rc1. commit b050e685044221099ed88748bfb6853a53c3d479 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Dec 6 12:49:13 2017 +0000 drm/i915: Remove vma from object on destroy, not close Originally we translated from the object to the vma by walking obj->vma_list to find the matching vm (for user lookups). Now we process user lookups using the rbtree, and we only use obj->vma_list itself for maintaining state (e.g. ensuring that all vma are flushed or rebound). As such maintenance needs to go on beyond the user's awareness of the vma, defer removal of the vma from the obj->vma_list from i915_vma_close() to i915_vma_destroy() Fixes: 5888fc9eac3c ("drm/i915: Flush pending GTT writes before unbinding") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104155 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171206124914.19960-1-chris@chris-wilson.co.uk Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> this issue's reproducer: 1) Build this stack: https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack 2) Run: ./repr.sh and i attached the backported fix patch, so Could you help to tell how to do the next? we need to backport to 4.14 LTS kernel. thanks.
Can you please clarify if this is a regression, and the bisect points to the commit fixing the regression? Or if this is a bisect to find the exact commit from drm-tip which fixes the issue.
this isn't one regression. only bisect this fix patch from kernel org. then search this exact patch from drm-tip. (In reply to Joonas Lahtinen from comment #6) > Can you please clarify if this is a regression, and the bisect points to the > commit fixing the regression? Or if this is a bisect to find the exact > commit from drm-tip which fixes the issue.
I'm still unsure what has been bisected. What criteria was used for the bisect and which tree was used for it?
Seems like I made a typo in my comment #6 so please let me clarify my questions. 1. Is this is a) bisect to commit that introduces a regression b) bisect to commit that fixes a bug? 2. Which criteria was used for bisecting? 3. Which tree was used for bisecting? Can you also please include the git bisect command history.
1. b): bisect to one commit that fixed this bug. 2. we used manual methods to bisect. 4.16-rc1 can't reproduce, but 4.15-rc1 can reproduce. so we list all i915 patches from 4.15-rc1 to 4.16-rc1. and choose one middle patch by manual(which hadn't any dependence on it's previous) as bisect point, then make and install this kernel. and test again. next will repeat this step... 3. we used this tree. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/log/?h=linux-4.16.y we haven't used bisect cmd directly, so we haven't any git bisect cmd history, thanks. (In reply to Joonas Lahtinen from comment #9) > Seems like I made a typo in my comment #6 so please let me clarify my > questions. > > 1. Is this is a) bisect to commit that introduces a regression b) bisect to > commit that fixes a bug? > > 2. Which criteria was used for bisecting? > > 3. Which tree was used for bisecting? > > Can you also please include the git bisect command history.
Jani, any help here?
(In reply to Jani Saarinen from comment #11) > Jani, any help here? No, not my area.
Chris, how do you see this, is culprit yours?
Rerporter, would be really nice to see this done by git bisect, thanks
(In reply to Jani Saarinen from comment #14) > Rerporter, would be really nice to see this done by git bisect, thanks the following is our capture bisect log. due to the cmd:"git bisect" is used to test the regression, so: "good" means can reproduce this issue. "bad" means can't reproduce this issue. it means fixed this issue. the log: git bisect start # good: [0d7e76beaa92de8837af7d49a36058ee6cf8fe78] Merge tag 'gvt-next-2017-12-05' of https://github.com/intel/gvt-linux into drm-intel-next-queued git bisect good 0d7e76beaa92de8837af7d49a36058ee6cf8fe78 # bad: [7125397b82460d74ae0584bdcdc006deec5e895d] drm/i915: Track GGTT writes on the vma git bisect bad 7125397b82460d74ae0584bdcdc006deec5e895d # good: [a655aeb34f7cae135a31a8a643314061ca6737e3] drm/i915/uc: Don't fetch GuC firmware if no plan to use GuC git bisect good a655aeb34f7cae135a31a8a643314061ca6737e3 # good: [121981fafe699d9f398a3c717912ef4eae6719b1] drm/i915/guc: Combine enable_guc_loading|submission modparams git bisect good 121981fafe699d9f398a3c717912ef4eae6719b1 # bad: [010e3e68cd9cb65ea50c0af605e966cda333cb2a] drm/i915: Remove vma from object on destroy, not close git bisect bad 010e3e68cd9cb65ea50c0af605e966cda333cb2a # good: [0dfa1cee613e03cee295b8d1ed8130c84311b584] drm/i915/huc: Load HuC only if requested git bisect good 0dfa1cee613e03cee295b8d1ed8130c84311b584 # first bad commit: [010e3e68cd9cb65ea50c0af605e966cda333cb2a] drm/i915: Remove vma from object on destroy, not close
Hi, are there any other concerns to start process to apply the fix for 4.14 LTS kernel? There are bisect results, there is an attached patch...
Right, so this is not a regression, but trying to find which patches fix the issue in new kernels. So tagging this as enhancement, which it is. The referred patch here itself has a Fixes: tag, so it should be applied only on top of the previously applied patch. That means if picking only this patch helps, then it is through incredible luck only. And if the pointed patch was in LTS, the Fixes: tag should have made it get picked up automatically. Also, from the bisect log, we got: "first bad commit: [010e3e68cd9cb65ea50c0af605e966cda333cb2a] drm/i915: Remove vma from object on destroy, not close" That would mean that introducing this patch actually breaks the usecase, so I'm having trouble following why it is suggested as the fix? Please provide a proper bisect log where you actually build each suggested commit and run your testcase for each step, tagging it as good or bad depending on if your testcase passes or fails (if the testcase can't be reproduce the issue 100% of the time, then do enough runs on EACH step to make sure you have good confidence).
(In reply to Joonas Lahtinen from comment #17) > Right, so this is not a regression, but trying to find which patches fix the > issue in new kernels. So tagging this as enhancement, which it is. > > The referred patch here itself has a Fixes: tag, so it should be applied > only on top of the previously applied patch. That means if picking only this > patch helps, then it is through incredible luck only. And if the pointed > patch was in LTS, the Fixes: tag should have made it get picked up > automatically. > > Also, from the bisect log, we got: > > "first bad commit: [010e3e68cd9cb65ea50c0af605e966cda333cb2a] drm/i915: > Remove vma from object on destroy, not close" > > That would mean that introducing this patch actually breaks the usecase, so > I'm having trouble following why it is suggested as the fix? Please provide > a proper bisect log where you actually build each suggested commit and run > your testcase for each step, tagging it as good or bad depending on if your > testcase passes or fails (if the testcase can't be reproduce the issue 100% > of the time, then do enough runs on EACH step to make sure you have good > confidence). hi, I had commented the reason why I marked the "bad". ""good" means can reproduce this issue. "bad" means can't reproduce this issue. it means fixed this issue." since "git bisect" must mark "bad" before "good". this cmd used to test the regression. so the first "bad" patch is fixed patch in this situation. thanks.
@Joonas: we are bug _reporters_ and we don't support i915 driver. You support it. We are sorry if we did something wrongly which violates some internal guidelines you follow inside the team or even a kernel. From my perspective Owen did a lot of job: 1) bisect was done, 2) patch suggested. If you don't like "Fixes" in the patch or you don't like bisect results, well, we don't know your internal team guidelines... This discussion starts to lead to nowhere. I don't think it is proper to request bug reporters do the job for you. Also, I completely disagree that this is enhancement. Call the things in your names. You have memory leak in i915 driver on 4.14 LTS kernel. That's a normal bug and nothing else.
Right, after clearing the confusion around the issue, this issue is simply result of LTS kernel picking Cc: stable patch, but not picking patch with Fixes: for it (which is the attached patch). I've instructed reporter to send a request for inclusion to stable mailing list according to the LTS process.
Was the request sent? Please, provide the status here since not everyone monitor the list... Thank you.
hi Dmitry, the request sent out. I will update in here when receive any infomation. thanks. (In reply to Dmitry Rogozhkin from comment #21) > Was the request sent? Please, provide the status here since not everyone > monitor the list... Thank you.
I'm going to close this bug, let's follow the patch inclusion separately.
(In reply to Owen Zhang from comment #22) > hi Dmitry, the request sent out. I will update in here when receive any > infomation. thanks. When doing backport requests, please Cc: the maintainers. At least I wasn't Cc'd. Please provide a link to the backport request on the list archive. Did you get a response?
(In reply to Jani Nikula from comment #24) > (In reply to Owen Zhang from comment #22) > > hi Dmitry, the request sent out. I will update in here when receive any > > infomation. thanks. > > When doing backport requests, please Cc: the maintainers. At least I wasn't > Cc'd. > > Please provide a link to the backport request on the list archive. > > Did you get a response? Hi Jani, I had sent this request to "stable@vger.kernel.org", but i haven't get any responses, so...... I will add you and send again, sorry for this, thanks a lot.
I sent another stable backport request. It hasn't showed up on the list archives yet, but the message id is 871s98ly2g.fsf@intel.com Resolving FIXED; this was our bug that was also fixed upstream.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.