Summary: | kernel BUG at fs/ext4/inode.c:2721! | ||
---|---|---|---|
Product: | DRI | Reporter: | Robert Holmes <robeholmes> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | RESOLVED MOVED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | major | ||
Priority: | high | CC: | alarrosa, intel-gfx-bugs, massimiliano.torromeo, robeholmes |
Version: | XOrg git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | All | ||
Whiteboard: | Triaged, ReadyForDev | ||
i915 platform: | SKL | i915 features: | GEM/Other |
Description
Robert Holmes
2019-10-15 21:16:09 UTC
(In reply to Robert Holmes from comment #0) > Now, the reason I am reporting this on this bug tracker is that I've > bisected this issue (more precisely, the first WARNING right after boot) to > the following commit: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/ > ?h=linux-5.3.y&id=0bd6cb6b58f7332c61cef2e4ae48db1ca9910b6b Yikes. That shows that code was inherently more buggy than I thought, as it was causing us to drop writes to pages we didn't own (but thought we did). The root cause of the warn and ext4 bug is the lack of lock_page around set_page_dirty in userptr_put_pages. We tried putting a lock there, but we recurse into userptr_put_pages from underneath locked pages... There is a plan afoot to replace this interface with HMM in the hope that it makes the integration between the GPU and user pages much nicer and in the process resolve these mistakes. Hi Chris, Could you please add quick workaround for this problem? Kernel 5.3.x is completely unusable on desktop because of that... Acer laptop, Intel(R) Pentium(R) CPU N4200Intel(R) Pentium(R) CPU N4200. Kernel 5.3.8. Just compiled and tried the 5.3.x kernel. Got the following in dmesg, no hang (yet) but the following is concerning: [ 50.138567] WARNING: CPU: 1 PID: 1330 at fs/ext4/inode.c:3941 ext4_set_page_dirty+0x3e/0x50 [ 50.138569] Modules linked in: rfcomm nf_conntrack_irc nf_conntrack_sip iptable_raw xt_CT nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rt xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables sunrpc lz4 lz4_compress bnep vfat fat snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core iwlmvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_compress ac97_bus ledtrig_audio snd_pcm_dmaengine snd_hda_intel snd_hda_codec intel_telemetry_pltdrv intel_punit_ipc intel_telemetry_core x86_pkg_temp_thermal intel_powerclamp coretemp snd_hwdep kvm_intel snd_hda_core btusb btrtl btbcm kvm btintel iwlwifi uvcvideo bluetooth snd_seq videobuf2_vmalloc videobuf2_memops snd_seq_device videobuf2_v4l2 snd_pcm videobuf2_common videodev joydev mei_hdcp intel_rapl_msr hid_multitouch acer_wmi wmi_bmof sparse_keymap irqbypass snd_timer intel_cstate mei_me snd [ 50.138610] intel_rapl_perf mei mc processor_thermal_device intel_rapl_common wdat_wdt ecdh_generic pcspkr idma64 i2c_i801 intel_lpss_pci lpc_ich ecc int340x_thermal_zone intel_xhci_usb_role_switch soundcore bfq intel_lpss intel_soc_dts_iosf roles int3400_thermal wmi acpi_thermal_rel acer_wireless int3406_thermal dm_crypt i915 i2c_algo_bit cec rc_core drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm mmc_block rtsx_usb_sdmmc sdhci_pci cqhci sdhci mmc_core crct10dif_pclmul crc32_pclmul serio_raw ghash_clmulni_intel rtsx_usb video i2c_hid pinctrl_broxton pinctrl_intel fuse [ 50.138638] CPU: 1 PID: 1330 Comm: kworker/u8:4 Not tainted 5.3.8-Seohyun #1 [ 50.138639] Hardware name: Acer Swift SF113-31/ASAHI_AP_S, BIOS V1.12 03/30/2018 [ 50.138700] Workqueue: i915 __i915_gem_free_work [i915] [ 50.138704] RIP: 0010:ext4_set_page_dirty+0x3e/0x50 [ 50.138706] Code: 48 8b 00 a8 01 75 16 48 8b 57 08 48 8d 42 ff 83 e2 01 48 0f 44 c7 48 8b 00 a8 08 74 0d 48 8b 07 f6 c4 20 74 0f e9 92 e7 f7 ff <0f> 0b 48 8b 07 f6 c4 20 75 f1 0f 0b e9 81 e7 f7 ff 90 0f 1f 44 00 [ 50.138707] RSP: 0018:ffffc1e60137fd90 EFLAGS: 00010246 [ 50.138709] RAX: 0017ffe000002016 RBX: ffff9e337236a200 RCX: 0000000000000000 [ 50.138710] RDX: 0000000000000000 RSI: 0000000121400000 RDI: fffff3ecc498ea40 [ 50.138711] RBP: fffff3ecc498ea40 R08: 0000000121400000 R09: 0000000000000000 [ 50.138712] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000001263a9 [ 50.138713] R13: ffff9e3322c11b00 R14: ffff9e33367f9ca0 R15: 0000000000000000 [ 50.138714] FS: 0000000000000000(0000) GS:ffff9e337ba80000(0000) knlGS:0000000000000000 [ 50.138715] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 50.138716] CR2: 000055fb74151f10 CR3: 000000013c60a000 CR4: 00000000003406e0 [ 50.138717] Call Trace: [ 50.138767] i915_gem_userptr_put_pages+0x14b/0x1e0 [i915] [ 50.138812] __i915_gem_object_put_pages+0x5b/0xa0 [i915] [ 50.138854] __i915_gem_free_objects+0x124/0x230 [i915] [ 50.138898] __i915_gem_free_work+0x64/0x90 [i915] [ 50.138902] process_one_work+0x199/0x340 [ 50.138905] worker_thread+0x4e/0x3b0 [ 50.138907] kthread+0xfc/0x130 [ 50.138910] ? process_one_work+0x340/0x340 [ 50.138912] ? kthread_park+0x80/0x80 [ 50.138915] ret_from_fork+0x35/0x40 [ 50.138919] ---[ end trace ca5ea2ec07e00336 ]--- It looks like this is the same bug as originally reported by Holmes. Please let me know if/how I can provide additional useful information. I'd like to report that I am seeing this message as well, I'm running Fedora 30 on a Dell Latitude 7490, current kernel: 5.3.8-200.fc30.x86_64. I thought using the i915 module parameter enable_guc=2 was the trigger, but I've removed that and at least on this new kernel, the message still came. Hi, I'm also seeing this bug. In my case it's happening in a desktop PC with an i7-6700K cpu and the i915 module (and no other special hardware) using openSUSE Tumbleweed with 5.3.0, 5.3.7 and 5.3.8 kernels. I'm currently running 5.2.14 without any issue since 5.3.x kernels are completely unusable. Once the BUG appears in dmesg, processes get stuck in D state and the system has to be rebooted, and in some cases, it has happened within an hour after booting. I reported this to https://bugzilla.opensuse.org/show_bug.cgi?id=1156537 where I've been putting some information on my system before I was pointed to this bug report. There has been a patch submitted to -stable fixing (or at least working around) this issue: https://www.spinics.net/lists/stable/msg340095.html All that is left, presumably, is for it to be actually pulled into a 5.3.x release. (In reply to Robert Holmes from comment #6) > There has been a patch submitted to -stable fixing (or at least working > around) this issue: https://www.spinics.net/lists/stable/msg340095.html > > All that is left, presumably, is for it to be actually pulled into a 5.3.x > release. What is the impact of this issue apart from warning in the log? (In reply to Lakshmi from comment #7) > (In reply to Robert Holmes from comment #6) > > There has been a patch submitted to -stable fixing (or at least working > > around) this issue: https://www.spinics.net/lists/stable/msg340095.html > > > > All that is left, presumably, is for it to be actually pulled into a 5.3.x > > release. > > What is the impact of this issue apart from warning in the log? If this was just the WARNING, I probably wouldn't have even noticed it in the first place. But the WARNING is just a quick sign that the bug is present, before things end up on a BUG later on. As I've reported initially, after the first (mostly harmless) WARNING, things run fine for a while -- maybe even a few days -- before another WARNING followed by a BUG happens: Oct 07 01:33:15 laptop kernel: ------------[ cut here ]------------ Oct 07 01:33:15 laptop kernel: kernel BUG at fs/ext4/inode.c:2721! Oct 07 01:33:15 laptop kernel: invalid opcode: 0000 [#1] SMP PTI ... This BUG, as the previous commenter notes, causes processes to get stuck in D mode, as well as new processes becoming impossible to start. The system quickly becomes unusable, and requires a (hard) reboot. Just for the record, I've been running a 5.3.11 kernel with the patch Robert Holmes mentioned in #c6 applied (with the diff context slightly modified to apply correctly) and so far (over 24 hours later), the system is still running fine and stable with no warning or bug in dmesg. Kernel 5.3.14 has now been released with the fix, and 5.4.x already carries it. I have been testing the patch for about a week now, and haven't been able to trigger any WARNING/BUG so far. So this issue can probably be closed soon. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/509. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.