HardwareBanter - View Single Post - The Coming Combo Of The CPU And GPU, Ray Tracing Versus Rasterization, And Why Billions Of Dollars Is At Stake

**[email protected]**

Yousuf Khan wrote:
It's not going to take thousands of gigahertz to do it. The only reason
it takes that long on regular CPUs is because they aren't optimized for
it. A GPU can be designed that can do it milliseconds, if there is
enough parallelism available. Ray-traced images using Lightwave have
nothing to do with these sorts of raytraces.

Raytracing is easy to parallelize, very easy in fact. The problem is
bandwidth. Where as for triangle rasterizer all the information that is
needed is a relatively compact amount of information, for a ray tracer,
the FULL SCENE must be accessible (random access!) and the data is
hierarchical and needs traversal to even hit, where as for
scanconverter/rasterizer the most of the data is nested twice at best
(dependent texture reads withstanding).

The processing (before DX10) is at vertex and fragment level. A
fragment does read from basicly only few places (I'm putting this into
OpenGL context, the principle is very similiar for obvious reasons for
the D3D):

- uniforms (both built-in and user defined)
- varyings (ditto)
- samplers (read: textures)

Those are the primary data sources, there are more, example: alpha
blending for example uses the render target for r/w access. Other
associated buffers also come into mind like the zs.

The biggest overhead is when dependent texture reads are used, if the
coordinate is computed the latency of the computation is easier to
hide. For dependent texture read, it is less easy to hide as it is
order invariant. This means the actual result from the texture sampler
unit must arrive before the result can be used as a texture coordinate:
doesn't pipeline very well. Enough of trivialities.

A raytracer, however, for each ray, must see the *whole* database at
once and *traverse it*, binary space partition is often used technique
to reduce the order of the complexity a degree or two. Octrees and
their close cousin KD-trees are often employed techniques, there are
more but these still require systematic traversal of the tree to
finally get to amount of data that a brute-force linear search can be
done.

This involves a lot of maths, no problem there, this IS possible to
distribute. A design, I think, what would use hardware better than
straightforward C-to-VHDL-like translation would be to have units to do
the computation that is common to raytracing. Ray-to-primitive
intersection Unit, which could be chopped again into smaller pieces
like barycentrics computation unit, ray-to-plane solver and so on and
on and on, so that each unit can be re-used as much as possible in
different subtasks involved in this whole debacle.

Then see what the loads to each unit are, and see who's sleeping in the
class.. the sleepers need more work so add units that are under
heaviest load, in other words are the bottleneck.

The main trick in this sort of "stuff" is to avoid waste when possible,
you don't want to have 200K gates there doing nothing most of the time.
I think design like this can be made practical, but the problem would
be that this would involve a rather large amount of random accessing
into the memory. Random in ways that might be a challenge to cache
efficiently (think of texture sampler reads, caching reads from
textures is essentially a solved problem for ages.. first tiling, then
compressing the tiles to reduce even the ram-to-cache fetch bandwidth
and what not..)

I could easily be wrong but first impression is that the caching and
memory bandwidth management would be the biggest issue technically for
this. The second issue is that there is no infrastructure in place to
make money with this in a large scale as there is for the triangle
scanconverters. Even that took a while to build up momentum while it
was the *obvious* way to go.

Wasn't there a raytracing chip few years ago..? I don't ever recall
hearing what happened with that.