Discussion:
BSD ring selection
(too old to reply)
Tvrtko Ursulin
2016-11-24 08:07:36 UTC
Permalink
Hi all,

I am curious on the current operation of the driver with regards to the
ring selection and usage.

As far as I can gather from the code, the driver is happy for the kernel
to choose the ring (on configurations with more than one ring of course)
and seems to be able to run mostly independently of the selection. (I
said mostly because there are some batches which are explicitly sent to
BSD0 ring, based on the feature matrix.)

Have I missed something or there is really nothing else special the
driver does with respect to which ring it is running?

I am looking into this in the context of the long standing desire to
auto-balance workloads better. For example
https://bugs.freedesktop.org/show_bug.cgi?id=97872 expresses the need to
balance per batch buffer as well.

This leads me to the second part of the question and that is the
hardware state. Does the driver currently depend on the hardware state?

Because if we would to implement per batch buffer load balancing in the
kernel, the driver would have to make sure that it doesn't depend on any
state left by the previous batch. Perhaps this is not a concern, I
really know nothing of how the BSD engines are used.

If perhaps it doesn't already, then the change to support this could be
quite simple and I could perhaps prototype something as an RFC.

Regards,

Tvrtko
Chris Wilson
2016-11-24 08:19:16 UTC
Permalink
Post by Tvrtko Ursulin
Hi all,
I am curious on the current operation of the driver with regards to
the ring selection and usage.
As far as I can gather from the code, the driver is happy for the
kernel to choose the ring (on configurations with more than one ring
of course) and seems to be able to run mostly independently of the
selection. (I said mostly because there are some batches which are
explicitly sent to BSD0 ring, based on the feature matrix.)
Have I missed something or there is really nothing else special the
driver does with respect to which ring it is running?
I am looking into this in the context of the long standing desire to
auto-balance workloads better. For example
https://bugs.freedesktop.org/show_bug.cgi?id=97872 expresses the
need to balance per batch buffer as well.
This leads me to the second part of the question and that is the
hardware state. Does the driver currently depend on the hardware state?
No. They cannot since they are using the default context whose ABI is
that there is *no* state carried over between batches.
Post by Tvrtko Ursulin
Because if we would to implement per batch buffer load balancing in
the kernel, the driver would have to make sure that it doesn't
depend on any state left by the previous batch. Perhaps this is not
a concern, I really know nothing of how the BSD engines are used.
Why do this in the kernel when userspace already has the tools to do it?
The only thing preventing them is the abysmal fake BSD selection that
originated from libva.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
Tvrtko Ursulin
2016-11-24 08:35:21 UTC
Permalink
Post by Chris Wilson
Post by Tvrtko Ursulin
Hi all,
I am curious on the current operation of the driver with regards to
the ring selection and usage.
As far as I can gather from the code, the driver is happy for the
kernel to choose the ring (on configurations with more than one ring
of course) and seems to be able to run mostly independently of the
selection. (I said mostly because there are some batches which are
explicitly sent to BSD0 ring, based on the feature matrix.)
Have I missed something or there is really nothing else special the
driver does with respect to which ring it is running?
I am looking into this in the context of the long standing desire to
auto-balance workloads better. For example
https://bugs.freedesktop.org/show_bug.cgi?id=97872 expresses the
need to balance per batch buffer as well.
This leads me to the second part of the question and that is the
hardware state. Does the driver currently depend on the hardware state?
No. They cannot since they are using the default context whose ABI is
that there is *no* state carried over between batches.
Excellent!
Post by Chris Wilson
Post by Tvrtko Ursulin
Because if we would to implement per batch buffer load balancing in
the kernel, the driver would have to make sure that it doesn't
depend on any state left by the previous batch. Perhaps this is not
a concern, I really know nothing of how the BSD engines are used.
Why do this in the kernel when userspace already has the tools to do it?
Kernel would have the idea on the ring usage. Say one client only uses
BSD0, BSD1 is always idle, and then a second client comes in who want to
round-robin per batch. May be better to fix that one to BSD1 then, for
all batches that are satisfied by the BSD1 feature set. Or in other
words, how would userspace be able to figure the optimum scheduling?
Post by Chris Wilson
The only thing preventing them is the abysmal fake BSD selection that
originated from libva.
Hm what do you mean? Why it couldn't be used for round-robin for
example, since you explained that the state doesn't matter. It does
enable explicit ring selection at any time.

Regards,

Tvrtko
Chris Wilson
2016-11-24 08:54:25 UTC
Permalink
Post by Tvrtko Ursulin
Post by Chris Wilson
Post by Tvrtko Ursulin
Because if we would to implement per batch buffer load balancing in
the kernel, the driver would have to make sure that it doesn't
depend on any state left by the previous batch. Perhaps this is not
a concern, I really know nothing of how the BSD engines are used.
Why do this in the kernel when userspace already has the tools to do it?
Kernel would have the idea on the ring usage. Say one client only
uses BSD0, BSD1 is always idle, and then a second client comes in
who want to round-robin per batch. May be better to fix that one to
BSD1 then, for all batches that are satisfied by the BSD1 feature
set. Or in other words, how would userspace be able to figure the
optimum scheduling?
There is both a need for sort-first/sort-last scheduling. Simply by
monitoring its own load-balancing, the application can avoid overcommit
to a saturated engine. That gets you very close to ideal without any
kernel overhead (all the monitoring of hw for the application's load
balancing can be done in userspace). sort-last scheduling allows naive
batch submission (e.g. all clients using the same engine) to be spread
across the engines, but does not come free. Sort-last scheduling is
definitely something that will be useful, more so if we can move
contexts between engines i.e. have identical behaviour to the CPU
schedulers for load balancing, but I see a missed opportunity in that
userspace could already be doing better balancing.
Post by Tvrtko Ursulin
Post by Chris Wilson
The only thing preventing them is the abysmal fake BSD selection that
originated from libva.
Hm what do you mean? Why it couldn't be used for round-robin for
example, since you explained that the state doesn't matter. It does
enable explicit ring selection at any time.
The original ABI was to allow userspace to select exactly which ring to
execute on. Now we have a wart whereby we need to consult a second flag
to see if the engine the user specified is the one to use - whereas the
flag should be an opt-in to pick an alternate equivalent engine, and for
all engines to be exposed in the ring selector.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
Sean V Kelley
2016-11-25 00:17:15 UTC
Permalink
+Kimmo

Hi Trvrko,

As I mentioned in email yesterday, I would be looking into this next quarter.

Why are you jumping into this now?

Please work through me on these requests.


Thanks,

Sean
Post by Tvrtko Ursulin
Hi all,
I am curious on the current operation of the driver with regards to the ring selection and usage.
As far as I can gather from the code, the driver is happy for the kernel to choose the ring (on configurations with more than one ring of course) and seems to be able to run mostly independently of the selection. (I said mostly because there are some batches which are explicitly sent to BSD0 ring, based on the feature matrix.)
Have I missed something or there is really nothing else special the driver does with respect to which ring it is running?
I am looking into this in the context of the long standing desire to auto-balance workloads better. For example https://bugs.freedesktop.org/show_bug.cgi?id=97872 expresses the need to balance per batch buffer as well.
This leads me to the second part of the question and that is the hardware state. Does the driver currently depend on the hardware state?
Because if we would to implement per batch buffer load balancing in the kernel, the driver would have to make sure that it doesn't depend on any state left by the previous batch. Perhaps this is not a concern, I really know nothing of how the BSD engines are used.
If perhaps it doesn't already, then the change to support this could be quite simple and I could perhaps prototype something as an RFC.
Regards,
Tvrtko
_______________________________________________
Libva mailing list
https://lists.freedesktop.org/mailman/listinfo/libva
Loading...