Bug 100712 - ring 0 stalled after bytes_moved_threshold reached - CAPVERDE/HD7770 - TAHITI/W9000
Summary: ring 0 stalled after bytes_moved_threshold reached - CAPVERDE/HD7770 - TAHITI...
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-18 15:14 UTC by Julien Isorce
Modified: 2019-11-19 09:28 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled (9.04 KB, text/plain)
2017-04-18 15:15 UTC, Julien Isorce
no flags Details
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled (45.25 KB, text/plain)
2017-04-18 15:15 UTC, Julien Isorce
no flags Details
ddebug_dumps_HD7770_kernel_amd-staging-4.9_ring_stalled (44.33 KB, text/plain)
2017-04-18 15:16 UTC, Julien Isorce
no flags Details
dmesg_W9000_with_custom_fence_debug.log (58.59 KB, text/plain)
2017-04-20 15:15 UTC, Julien Isorce
no flags Details

Description Julien Isorce 2017-04-18 15:14:55 UTC
Kernel 4.9 from https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-4.9 and latest mesa. (same result with drm-next-4.12 branch)
Same result with kernel 4.8 and mesa 12.0.6.

In kernel radeon_object.c::radeon_bo_list_validate, once "bytes_moved > bytes_moved_threshold" is reached (this is the case for 850 bo in the same list_for_each_entry loop), I can see that radeon_ib_schedule emits a fence that it takes more than the radeon.lockup_timeout to be signaled.

In radeon_fence_activity, I checked that the "last_emitted" is the seq number for this last emited fence. And last_seq is equal to last_emitted-1.

Then the next call to ttm_wait_bo blocks (15 * HZ > radeon.lockup_timeout) until gpu lockup which leads to a gpu reset.

Also it seems the fence is signaled by swapper after more than 10 seconds but it is too late. I requires to reduce the "15" param above to 4 to see that.

Is it normal that radeon_bo_list_validate still tries to move the bo if bytes_moved_threshold is reached ? Indeed ttm_bo_validate is always called (it blits from vram to vram).
Is it also normal that ttm_bo_validate is called with evict flag as true once bytes_moved_threshold is reached ?
Comment 1 Julien Isorce 2017-04-18 15:15:23 UTC
Created attachment 130902 [details]
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled
Comment 2 Julien Isorce 2017-04-18 15:15:39 UTC
Created attachment 130903 [details]
dmesg_HD7770_kernel_amd-staging-4.9_ring_stalled
Comment 3 Julien Isorce 2017-04-18 15:16:45 UTC
Created attachment 130904 [details]
ddebug_dumps_HD7770_kernel_amd-staging-4.9_ring_stalled
Comment 4 Michel Dänzer 2017-04-19 03:48:45 UTC
(In reply to Julien Isorce from comment #0)
> In kernel radeon_object.c::radeon_bo_list_validate, once "bytes_moved >
> bytes_moved_threshold" is reached (this is the case for 850 bo in the same
> list_for_each_entry loop), I can see that radeon_ib_schedule emits a fence
> that it takes more than the radeon.lockup_timeout to be signaled.

radeon_ib_schedule is called for submitting the command stream from userspace, not for any BO moves directly, right?

How did you determine that this hang is directly related to bytes_moved / bytes_moved_threshold? Maybe it's only indirectly related, e.g. due to the threshold preventing a BO from being moved to VRAM despite userspace's preference.


> Also it seems the fence is signaled by swapper after more than 10 seconds
> but it is too late. I requires to reduce the "15" param above to 4 to see
> that.

How does "swapper" (what is that exactly?) signal the fence?


> Is it normal that radeon_bo_list_validate still tries to move the bo if
> bytes_moved_threshold is reached ?

There are circumstances where a BO has to be moved even though the threshold is reached.


> Indeed ttm_bo_validate is always called

ttm_bo_validate must be called for every BO referenced by the command stream from userspace for correct lifetime management of its memory.


> (it blits from vram to vram).

It might be worth looking into why this happens, though. If domain == current_domain == RADEON_GEM_DOMAIN_VRAM, I wouldn't expect ttm_bo_validate to trigger a blit.
Comment 5 Julien Isorce 2017-04-19 12:03:32 UTC
(In reply to Michel Dänzer from comment #4)
> (In reply to Julien Isorce from comment #0)
> > In kernel radeon_object.c::radeon_bo_list_validate, once "bytes_moved >
> > bytes_moved_threshold" is reached (this is the case for 850 bo in the same
> > list_for_each_entry loop), I can see that radeon_ib_schedule emits a fence
> > that it takes more than the radeon.lockup_timeout to be signaled.
> 
> radeon_ib_schedule is called for submitting the command stream from
> userspace, not for any BO moves directly, right?
> 
> How did you determine that this hang is directly related to bytes_moved /
> bytes_moved_threshold? Maybe it's only indirectly related, e.g. due to the
> threshold preventing a BO from being moved to VRAM despite userspace's
> preference.
> 

I added a trace and the fence that is not signaled on time is always the one emited by radeon_ib_schedule after that the bytes_moved_threshold is reached.
But you are right it could be only indirectly related.

Here is the sequence I have:

ioctl_radeon_cs
  radeon_bo_list_validate
    bytes_moved > bytes_moved_threshold(=1024*1024ull)
    800 bo are not moved from gtt to vram because of that.
  radeon_cs_ib_vm_chunk
    radeon_ib_schedule(rdev, &parser->ib, NULL, true);
      radeon_fence_emit on ring 0
      r600_mmio_hdp_flush
/ioctl_radeon_cs

Then anything calling ttm_bo_wait will block more than the radeon.lockup_timeout because the above fence is not signaled on time.
Could it be that something is not flushed properly ? (ref: https://patchwork.kernel.org/patch/5807141/ ? tlb_flush ?) 

Are you saying that some bos are required to be moved from gtt to vram in order for this fence to be signaled ?

As you can see above it happens when vram_usage >= half_vram so radeon_bo_get_threshold_for_moves returns 1024*1024, which explains why only 1 or 2 bos can be moved from gtt to vram in that case and why all others are forced to stay in gtt.

In the same run of radeon_bo_list_validate there are many calls to ttm_bo_validate with both domain and current_domain as VRAM, this is the case for around 400 bo. Maybe this cause delay for this fence to be signaled, providing vram usage is high too.

> 
> > Also it seems the fence is signaled by swapper after more than 10 seconds
> > but it is too late. I requires to reduce the "15" param above to 4 to see
> > that.
> 
> How does "swapper" (what is that exactly?) signal the fence?

My wording was wrong sorry, I should have said "the first entity noticing that the fence is signaled" by calling radeon_fence_activity. swapper is the name for process 0 (idle). I change drm logging to print process name and id: (current->comm, current->pid)

> 
> It might be worth looking into why this happens, though. If domain ==
> current_domain == RADEON_GEM_DOMAIN_VRAM, I wouldn't expect ttm_bo_validate
> to trigger a blit.

I will check though I think I get just confused by a previous trace.
Comment 6 Julien Isorce 2017-04-20 15:15:00 UTC
Created attachment 130947 [details]
dmesg_W9000_with_custom_fence_debug.log
Comment 7 Julien Isorce 2017-04-20 15:23:22 UTC
I made 2 apitrace using Zach's test mentioned here https://bugs.freedesktop.org/show_bug.cgi?id=100465#c24 . This test is also good to reproduce this ring 0 stalled issue.

1: apitrace ideal for vram size 2048 (ex: HD7770)
https://drive.google.com/file/d/0Bzat_iFKrgCWYzBlZFFLQjgyRU0/view?usp=sharing

2: apitrace ideal for vram size 6144 (ex: W9000)
https://drive.google.com/file/d/0Bzat_iFKrgCWczgzM2FzaVFTUXc/view?usp=sharing

DISPLAY=:0 apitrace replay thrash.trace

Also I have attached the log (dmesg_W9000_with_custom_fence_debug.log) I get with my dev branch here https://github.com/CapOM/linux/commits/amd-staging-4.9_add_debug_fences where I added traces to debug ring N stalled issues. It prints the backtrace from where it waits for the fence and it also prints the backtrace from where it has emited that fence.

Also note that setting R600_DEBUG=nowc avoids this ring N stalled (so the endless fence is signaled).
Comment 8 Julien Isorce 2017-04-24 11:02:33 UTC
Hack submitted here https://patchwork.kernel.org/patch/9695945/. It contains some info in the commit message and in the replies. For those who want to try, it is easier to just set R600_DEBUG=nowc as said in #7.
Comment 9 Martin Peres 2019-11-19 09:28:23 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/793.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.