Summary: | Screen corruption on Cayman with dpm enabled | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Martin Andersson <g02maran> | ||||||||||||||||||||||
Component: | DRM/Radeon | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||||||||||||||||
Severity: | normal | ||||||||||||||||||||||||
Priority: | medium | CC: | a.heider, tobi | ||||||||||||||||||||||
Version: | DRI git | ||||||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||
Attachments: |
|
Description
Martin Andersson
2013-07-15 15:45:05 UTC
Created attachment 82449 [details]
screenshot of corruption
(In reply to comment #0) > > I did a bisect and found that this commit introduced the corruption: > > http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3. > 11&id=7ad8d0687bb5030c3328bc7229a3183ce179ab25 Prior to that commit, the driver didn't actually change clock levels. Does it work any better if you build radeon as a module rather than building it into the kernel? (In reply to comment #2) > (In reply to comment #0) > > > > I did a bisect and found that this commit introduced the corruption: > > > > http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3. > > 11&id=7ad8d0687bb5030c3328bc7229a3183ce179ab25 > > Prior to that commit, the driver didn't actually change clock levels. > > Does it work any better if you build radeon as a module rather than building > it into the kernel? Well I think I'm already building radeon as a module since I have: CONFIG_DRM=m CONFIG_DRM_RADEON=m or am I missing something? (In reply to comment #3) > > Well I think I'm already building radeon as a module since I have: > CONFIG_DRM=m > CONFIG_DRM_RADEON=m > > or am I missing something? That should be fine. I assumed you were building it in since you were messing with CONFIG_EXTRA_FIRMWARE. (In reply to comment #4) > (In reply to comment #3) > > > > Well I think I'm already building radeon as a module since I have: > > CONFIG_DRM=m > > CONFIG_DRM_RADEON=m > > > > or am I missing something? > > That should be fine. I assumed you were building it in since you were > messing with CONFIG_EXTRA_FIRMWARE. I have learnt something new today, CONFIG_EXTRA_FIRMWARE is not needed for modules. I tried RADEON_DPM_FORCED_LEVEL_LOW and RADEON_DPM_FORCED_LEVEL_HIGH, but they didn't work either. Then I commented out: ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO); With that I can boot the system and dpm seems to work, because the fan is nice and quiet. Also cat /sys/kernel/debug/dri/64/radeon_pm_info says this: uvd vclk: 0 dclk: 0 power level 0 sclk: 25000 mclk: 15000 vddc: 900 vddci: 950 Then (with RADEON_DPM_FORCED_LEVEL_AUTO enabled again) I commented out: if (ni_send_msg_to_smc_with_parameter(rdev, PPSMC_MSG_SetForcedLevels, 0) != PPSMC_Result_OK) That resulted in screen corruption but it worked when I instead commented out: if (ni_send_msg_to_smc_with_parameter(rdev, PPSMC_MSG_SetEnabledLevels, 0) != PPSMC_Result_OK) (In reply to comment #5) > > Then I commented out: > ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO); > That resulted in screen corruption but it worked when I instead commented > out: > if (ni_send_msg_to_smc_with_parameter(rdev, PPSMC_MSG_SetEnabledLevels, 0) > != PPSMC_Result_OK) Unfortunately, removing those calls disables dynamic reclocking so the GPU always stays in the lowest performance level. Created attachment 82501 [details] [review] add module parameter to disable aspm Try this patch which adds a new module parameter to disable aspm. Add radeon.aspm=0 to your kernel command line in grub to disable aspm support. (In reply to comment #7) > Created attachment 82501 [details] [review] [review] > add module parameter to disable aspm > > Try this patch which adds a new module parameter to disable aspm. Add > radeon.aspm=0 to your kernel command line in grub to disable aspm support. Unfortunately that didn't help, it was different though. First I saw the corruption as before, but after a second or two the screen turned blank, like it didn't get any signal. I caught a glimpse of some text that was printed just before the corruption, don't know if it is related or not, just throwing it out there. It's said "mei_me" <some text I forgot> <something about write failed> I have seen it before but then I didn't have time to see what it was. I just tried booting a second time and now experienced exactly the same behaviour as before the patch, that is the screen didn't get blank but continued to display the corrupted screen. Also looked at the dmesg from the patch and it was exactly the same lockup. Created attachment 82519 [details] [review] debugging output Can you attach a dmesg output with dpm enabled with this patch? Created attachment 82534 [details]
dmesg with mc reg dump
Take this information with a grain of salt, since I'm testing stuff without really knowing whats going on. I found it interesting that the dmesg with the mc dump didn't print anything from the first for loop in the patch, so I did some testing. I added a bunch of debug printks to radeon_atom_init_mc_reg_table to see what was happening. I found that num_entries is 13, but what was more interesting was that some printks was skipped and now I got a kernel oops (NULL pointer dereference). It looked like this while loop was executed once (without problems), then strange things happended: while (!(reg_block->asRegIndexBuf[i].ucPreRegDataLength & ACCESS_PLACEHOLDER)... So I added this printk before and after "i++;" in that while loop: printk(KERN_INFO "debug=%u", reg_block->asRegIndexBuf[i].ucPreRegDataLength); gcc didn't complain about the first printk but for the second it said this: drivers/gpu/drm/radeon/radeon_atombios.c:3741:60: warning: array subscript is above array bounds [-Warray-bounds] It might be a false positive but it matches with my debug printks, everything is fine until the second iteration in that while loop. After that the execution jumps further down in the function and crashes. *** Bug 66972 has been marked as a duplicate of this bug. *** *** Bug 66945 has been marked as a duplicate of this bug. *** Created attachment 82551 [details] [review] more debugging Something appears to be corrupting that structure. Try this patch on top of the previous one and post the dmesg with them applied. reg_table->last should not be 0. You should be seeing something like (slightly different per asic): [ 96.516683] 0 s1 = 0x0a2f pre_reg_data = 0x04 [ 96.516696] 1 s1 = 0x0a30 pre_reg_data = 0x00 [ 96.516710] 2 s1 = 0x0ad5 pre_reg_data = 0x04 [ 96.516723] 3 s1 = 0x0a28 pre_reg_data = 0x04 [ 96.516736] 4 s1 = 0x0a29 pre_reg_data = 0x04 [ 96.516746] 5 s1 = 0x0a2a pre_reg_data = 0x04 [ 96.516759] 6 s1 = 0x0a2b pre_reg_data = 0x04 [ 96.516772] 7 s1 = 0x0a2c pre_reg_data = 0x04 [ 96.516786] 8 s1 = 0x0a81 pre_reg_data = 0x04 [ 96.516798] 9 s1 = 0x0a8b pre_reg_data = 0x04 [ 96.516811] 10 s1 = 0x0a5f pre_reg_data = 0x04 for the first loop in the dumper. Created attachment 82552 [details]
dmesg with more debugging
Same here: [ 7.563515] module_index = 2 num_entries = 13 [ 7.563516] 0: s1 = 0x0a2f prd = 0x04 [ 7.567599] switching from power state: no more "s1 =" and not a single "last =" Created attachment 82559 [details] [review] weird fix This magically makes it work... No more stalls, box drops from 125 watt to 62. [ 7.923448] module_index = 2 num_entries = 13 [ 7.923449] 0: s1 = 0x0a2f prd = 0x04 [ 7.923450] 1: s1 = 0x0a30 prd = 0x00 [ 7.923450] 2: s1 = 0x0ad5 prd = 0x04 [ 7.923451] 3: s1 = 0x0a28 prd = 0x04 [ 7.923451] 4: s1 = 0x0a29 prd = 0x04 [ 7.923452] 5: s1 = 0x0a2a prd = 0x04 [ 7.923453] 6: s1 = 0x0a2b prd = 0x04 [ 7.923453] 7: s1 = 0x0a2c prd = 0x04 [ 7.923454] 8: s1 = 0x0a81 prd = 0x04 [ 7.923454] 9: s1 = 0x0a8b prd = 0x04 [ 7.923455] 10: s1 = 0x0a5f prd = 0x04 [ 7.923456] last = 11 [ 7.923457] last = 11 [ 7.923457] 0 s1 = 0x0a2f pre_reg_data = 0x04 [ 7.923458] 1 s1 = 0x0a30 pre_reg_data = 0x00 [ 7.923458] 2 s1 = 0x0ad5 pre_reg_data = 0x04 [ 7.923459] 3 s1 = 0x0a28 pre_reg_data = 0x04 [ 7.923460] 4 s1 = 0x0a29 pre_reg_data = 0x04 [ 7.923460] 5 s1 = 0x0a2a pre_reg_data = 0x04 [ 7.923461] 6 s1 = 0x0a2b pre_reg_data = 0x04 [ 7.923461] 7 s1 = 0x0a2c pre_reg_data = 0x04 [ 7.923462] 8 s1 = 0x0a81 pre_reg_data = 0x04 [ 7.923463] 9 s1 = 0x0a8b pre_reg_data = 0x04 [ 7.923463] 10 s1 = 0x0a5f pre_reg_data = 0x04 [ 7.923464] 0 mclk_max = 40000 [ 7.923465] 0 mc_data = 0x20335155 [ 7.923465] 1 mc_data = 0x20335155 [ 7.923466] 2 mc_data = 0x00000000 [ 7.923466] 3 mc_data = 0x0f129421 [ 7.923467] 4 mc_data = 0x080852c0 [ 7.923468] 5 mc_data = 0x01328d12 [ 7.923468] 6 mc_data = 0x00222000 [ 7.923469] 7 mc_data = 0x001ca511 [ 7.923469] 8 mc_data = 0x2014021b [ 7.923470] 9 mc_data = 0xa28086ba [ 7.923470] 10 mc_data = 0x00000000 [ 7.923471] 1 mclk_max = 80000 [ 7.923472] 0 mc_data = 0x20337177 [ 7.923472] 1 mc_data = 0x20337177 [ 7.923473] 2 mc_data = 0x00000000 [ 7.923473] 3 mc_data = 0x1f2528a5 [ 7.923474] 4 mc_data = 0x0c0952f0 [ 7.923475] 5 mc_data = 0x0275971f [ 7.923475] 6 mc_data = 0x00622200 [ 7.923476] 7 mc_data = 0x001ca511 [ 7.923476] 8 mc_data = 0x2014083c [ 7.923477] 9 mc_data = 0xa50086ba [ 7.923477] 10 mc_data = 0x00000000 [ 7.923478] 2 mclk_max = 90000 [ 7.923479] 0 mc_data = 0x20337177 [ 7.923479] 1 mc_data = 0x20337177 [ 7.923480] 2 mc_data = 0x00000000 [ 7.923480] 3 mc_data = 0x222630e7 [ 7.923481] 4 mc_data = 0x0d095300 [ 7.923481] 5 mc_data = 0x02c61921 [ 7.923482] 6 mc_data = 0x00822300 [ 7.923483] 7 mc_data = 0x001ca511 [ 7.923483] 8 mc_data = 0x20140944 [ 7.923484] 9 mc_data = 0xa58086ba [ 7.923484] 10 mc_data = 0x00000000 [ 7.923485] 3 mclk_max = 100000 [ 7.923485] 0 mc_data = 0x20337177 [ 7.923486] 1 mc_data = 0x20337177 [ 7.923487] 2 mc_data = 0x00000000 [ 7.923487] 3 mc_data = 0x27273929 [ 7.923488] 4 mc_data = 0x0e095310 [ 7.923488] 5 mc_data = 0x03171c24 [ 7.923489] 6 mc_data = 0x00822300 [ 7.923489] 7 mc_data = 0x001ca511 [ 7.923490] 8 mc_data = 0x20140a4c [ 7.923491] 9 mc_data = 0xa64086ba [ 7.923491] 10 mc_data = 0x00000000 [ 7.923492] 4 mclk_max = 125000 [ 7.923492] 0 mc_data = 0x20337177 [ 7.923493] 1 mc_data = 0x20337177 [ 7.923493] 2 mc_data = 0x00000000 [ 7.923494] 3 mc_data = 0x2f68416b [ 7.923495] 4 mc_data = 0x10095330 [ 7.923495] 5 mc_data = 0x03e8a12b [ 7.923496] 6 mc_data = 0x00c34500 [ 7.923496] 7 mc_data = 0x001ca511 [ 7.923497] 8 mc_data = 0x20140e5c [ 7.923497] 9 mc_data = 0xa780874a [ 7.923498] 10 mc_data = 0x00000000 [ 7.923499] 5 mclk_max = 137500 [ 7.923499] 0 mc_data = 0x20337177 [ 7.923500] 1 mc_data = 0x20337177 [ 7.923500] 2 mc_data = 0x00000000 [ 7.923501] 3 mc_data = 0x346949ad [ 7.923501] 4 mc_data = 0x10095330 [ 7.923502] 5 mc_data = 0x0449a32e [ 7.923503] 6 mc_data = 0x00c44600 [ 7.923503] 7 mc_data = 0x001ca511 [ 7.923504] 8 mc_data = 0x20140f5c [ 7.923504] 9 mc_data = 0xa840875a [ 7.923505] 10 mc_data = 0x00000000 [ 7.923505] 6 mclk_max = 150000 [ 7.923506] 0 mc_data = 0x20337177 [ 7.923507] 1 mc_data = 0x20337177 [ 7.923507] 2 mc_data = 0x00000000 [ 7.923508] 3 mc_data = 0x386a51ef [ 7.923508] 4 mc_data = 0x11095340 [ 7.923509] 5 mc_data = 0x04aa2531 [ 7.923509] 6 mc_data = 0x00e44700 [ 7.923510] 7 mc_data = 0x001ca511 [ 7.923511] 8 mc_data = 0x20940164 [ 7.923511] 9 mc_data = 0xa900875a [ 7.923512] 10 mc_data = 0x00000002 I wasn't very clear in my #12 comment, but what I was trying to say it is something fishy about "reg_block->asRegIndexBuf". It is defined in atombios.h as: ATOM_MEMORY_SETTING_DATA_BLOCK asRegDataBuf[1]; But it is accessed well beyond just the first element and it is after the first access that things start getting weird. Dammit copied the wrong line, but my point is still valid: ATOM_INIT_REG_INDEX_FORMAT asRegIndexBuf[1]; (In reply to comment #19) > I wasn't very clear in my #12 comment, but what I was trying to say it is > something fishy about "reg_block->asRegIndexBuf". It is defined in > atombios.h as: > ATOM_MEMORY_SETTING_DATA_BLOCK asRegDataBuf[1]; > It's a variably sized array and the size varies depending on the system. The size of that array is defined by usRegIndexTblSize. Hence the calculation to determine the number of elements: num_entries = (u8)((le16_to_cpu(reg_block->usRegIndexTblSize)) / sizeof(ATOM_INIT_REG_INDEX_FORMAT)) - 1; (In reply to comment #18) > Created attachment 82559 [details] [review] [review] > weird fix > Hmmm, looks like a compiler bug. what compiler are you using? (In reply to comment #21) > (In reply to comment #19) > > I wasn't very clear in my #12 comment, but what I was trying to say it is > > something fishy about "reg_block->asRegIndexBuf". It is defined in > > atombios.h as: > > ATOM_MEMORY_SETTING_DATA_BLOCK asRegDataBuf[1]; > > > > It's a variably sized array and the size varies depending on the system. > The size of that array is defined by usRegIndexTblSize. Hence the > calculation to determine the number of elements: > > num_entries = (u8)((le16_to_cpu(reg_block->usRegIndexTblSize)) / > sizeof(ATOM_INIT_REG_INDEX_FORMAT)) - 1; ok, wasn't aware of that. My compiler is: gcc (GCC) 4.8.1 (In reply to comment #22) > (In reply to comment #18) > > Created attachment 82559 [details] [review] [review] [review] > > weird fix > > > > Hmmm, looks like a compiler bug. what compiler are you using? gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-6ubuntu1) But I think the compiler is right in this case. struct ATOM_INIT_REG_BLOCK doesn't represent what radeon_atom_init_mc_reg_table() is doing. The struct in the header reads: ... ATOM_INIT_REG_INDEX_FORMAT asRegIndexBuf[1]; ATOM_MEMORY_SETTING_DATA_BLOCK asRegDataBuf[1]; So reg_block->asRegIndexBuf[i] for i>0 is not defined, its within asRegDataBuf or even pass the size of the struct. It looks like this gcc is killing the while loop because i can never be anything else than zero. The patch works for me as well (In reply to comment #24) > (In reply to comment #22) > > (In reply to comment #18) > > > Created attachment 82559 [details] [review] [review] [review] [review] > > > weird fix > > > > > > > Hmmm, looks like a compiler bug. what compiler are you using? > > gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-6ubuntu1) > > But I think the compiler is right in this case. struct ATOM_INIT_REG_BLOCK > doesn't represent what radeon_atom_init_mc_reg_table() is doing. > > The struct in the header reads: > ... > ATOM_INIT_REG_INDEX_FORMAT asRegIndexBuf[1]; > ATOM_MEMORY_SETTING_DATA_BLOCK asRegDataBuf[1]; > > So reg_block->asRegIndexBuf[i] for i>0 is not defined, its within > asRegDataBuf or even pass the size of the struct. > > It looks like this gcc is killing the while loop because i can never be > anything else than zero. It works on older version of gcc. I'm using 4.6.2 and 4.7.2 on several boxes and haven't run into any problems. Also, even if the loop gets skipped, the "last = " lines should still get printed. Created attachment 82560 [details] [review] fix I've attached a git version of Andre's patch. I think it should be fine. The logic is preserved. Not sure why the previous logic caused such bogus code in newer versions of gcc. (In reply to comment #26) > Also, even if the loop gets skipped, the "last = " lines should still get printed. Oh right, that makes it even more weird... Created attachment 82563 [details] [review] additional fix Try just this patch on top of attachment 82560 [details] [review]. Do not apply any of the debugging output patches. (In reply to comment #29) > Created attachment 82563 [details] [review] [review] > additional fix > > Try just this patch on top of attachment 82560 [details] [review] [review]. Do not > apply any of the debugging output patches. Works fine for me. (In reply to comment #29) > Created attachment 82563 [details] [review] [review] > additional fix > > Try just this patch on top of attachment 82560 [details] [review] [review]. Do not > apply any of the debugging output patches. Fixed my issues also. Thanks for your great work. (In reply to comment #31) > (In reply to comment #29) > > Created attachment 82563 [details] [review] [review] [review] > > additional fix > > > > Try just this patch on top of attachment 82560 [details] [review] [review] [review]. Do not > > apply any of the debugging output patches. > > Fixed my issues also. > Thanks for your great work. Yep, I applied the latest changes from the drm-fixes-3.11 branch (commit 444bddc) to the 3.11 kernel and that resolved the issue for me as well. I had previously tried just the initial GCC fix and that didn't work, so I guess the 'gcc harder' patch was the one that did it for me. Thanks! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.