panfrost: Implement command stream scoreboarding

This is a rather complex change, adding a lot of code but ideally
cleaning up quite a bit as we go.

Within a batch (single frame), there are multiple distinct Mali job
types: SET_VALUE, VERTEX, TILER, FRAGMENT for the few that we emit right
now (eventually more for compute and geometry shaders). Each hardware
job has a mali_job_descriptor_header, which contains three fields of
interest: job index, a dependencies list, and a next job pointer.

The next job pointer in each job is used to form a linked list of
submitted jobs. Easy enough.

The job index and dependencies list, however, are used to form a
dependency graph (a DAG, where each hardware job is a node and each
dependency is a directed edge). Internally, this sets up a scoreboarding
data structure for the hardware to dispatch jobs in parallel, enabling
(for example) vertex shaders from different draws to execute in parallel
while there are strict dependencies between tiling the geometry of a
draw and running that vertex shader.

For a while, we got by with an incredible series of total hacks,
manually coding indices, lists, and dependencies. That worked for a
moment, but combinatorial kaboom kicked in and it became an
unmaintainable mess of spaghetti code.

We can do better. This commit explicitly handles the scoreboarding by
providing high-level manipulation for jobs. Rather than a command like
"set dependency #2 to index 17", we can express quite naturally "add a
dependency from job T on job V". Instead of some open-coded logic to
copy a draw pointer into a delicate context array, we now have an
elegant exposed API to simple "queue a job of type XYZ".

The design is influenced by both our current requirements (standard ES2
draws and u_blitter) as well as the need for more complex scheduling in
the future. For instance, blits can be optimized to use only a tiler
job, without a vertex job first (since the screen-space vertices are
known ahead-of-time) -- causing tiler-only jobs. Likewise, when using
transform feedback with rasterizer discard enabled, vertex jobs are
created (to run vertex shaders) with no corresponding tiler job. Both of
these cases break the original model and could not be expressed with the
open-coded logic. More generally, this will make it easier to add
support for compute shaders, geometry shaders, and fused jobs (an
optimization available on Bifrost).

Incidentally, this moves quite a bit of state from the driver context to
the batch, which helps with Rohan's refactor to eventually permit
pipelining across framebuffers (one important outstanding optimization
for FBO-heavy workloads).

v2: Add comment explaining the meaning of "primary batch" as suggested
by Tomeu (trivial - not reviewed).

Signed-off-by: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Reviewed-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Reviewed-by: Rohan Garg <rohan.garg@collabora.com>
7 files changed