Intel details future Larrabee graphics chip

**John Larkin**

On 7 Aug 2008 07:47:13 GMT, (Nick Maclaren) wrote:

In article ,
John Larkin writes:
| On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
| wrote:
| "John Larkin" wrote in message
| .. .
|
| This has got to affect OS design.
|
| They need to completely rethink their multi-threaded synchronization
| algorihtms. I have a feeling that efficient distributed non-blocking
| algorihtms, which are comfortable running under a very weak cache coherency
| model will be all the rage. Getting rid of atomic RMW or StoreLoad style
| memory barriers is the first step.
|
| Run one process per CPU. Run the OS kernal, and nothing else, on one
| CPU. Never context switch. Never swap. Never crash.

Been there - done that :-)

That is precisely how the early SMP systems worked, and it works
for dinky little SMP systems of 4-8 cores. But the kernel becomes
the bottleneck for many workloads even on those, and it doesn't
scale to large numbers of cores. So you HAVE to multi-thread the
kernel.

Why? All it has to do is grant run permissions and look at the big
picture. It certainly wouldn't do I/O or networking or file
management. If memory allocation becomes a burden, it can set up four
(or fourteen) memory-allocation cores and let them do the crunching.
Why multi-thread *anything* when hundreds or thousands of CPUs are
available?

Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

John

**Nick Maclaren**

In article ,
John Larkin writes:
|
| | Run one process per CPU. Run the OS kernal, and nothing else, on one
| | CPU. Never context switch. Never swap. Never crash.
|
| Been there - done that :-)
|
| That is precisely how the early SMP systems worked, and it works
| for dinky little SMP systems of 4-8 cores. But the kernel becomes
| the bottleneck for many workloads even on those, and it doesn't
| scale to large numbers of cores. So you HAVE to multi-thread the
| kernel.
|
| Why? All it has to do is grant run permissions and look at the big
| picture. It certainly wouldn't do I/O or networking or file
| management. If memory allocation becomes a burden, it can set up four
| (or fourteen) memory-allocation cores and let them do the crunching.
| Why multi-thread *anything* when hundreds or thousands of CPUs are
| available?

I don't have time to describe 40 years of experience to you, and
it is better written up in books, anyway. Microkernels of the sort
you mention were trendy a decade or two back (look up Mach), but
introduced too many bottlenecks.

In theory, the kernel doesn't have to do I/O or networking, but
have you ever used a system where they were outside it? I have.

The reason that exporting them to multiple CPUs doesn't solve the
scalability problems is that the interaction rate goes up more
than linearly with the number of CPUs. And the same problem
applies to memory management, if you are going to allow shared
memory - or even virtual shared memory, as in PGAS languages.

And so it goes. TANSTAAFL.

| Using multicore properly will require undoing about 60 years of
| thinking, 60 years of believing that CPUs are expensive.

Now, THAT is true.

Regards,
Nick Maclaren.

**Chris M. Thomasson**

"John Larkin" wrote in message
...
On 7 Aug 2008 07:47:13 GMT, (Nick Maclaren) wrote:

In article ,
John Larkin writes:
| On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
| wrote:
| "John Larkin" wrote in
message
| .. .
|
| This has got to affect OS design.
|
| They need to completely rethink their multi-threaded synchronization
| algorihtms. I have a feeling that efficient distributed non-blocking
| algorihtms, which are comfortable running under a very weak cache
coherency
| model will be all the rage. Getting rid of atomic RMW or StoreLoad
style
| memory barriers is the first step.
|
| Run one process per CPU. Run the OS kernal, and nothing else, on one
| CPU. Never context switch. Never swap. Never crash.

Been there - done that :-)

That is precisely how the early SMP systems worked, and it works
for dinky little SMP systems of 4-8 cores. But the kernel becomes
the bottleneck for many workloads even on those, and it doesn't
scale to large numbers of cores. So you HAVE to multi-thread the
kernel.

Why? All it has to do is grant run permissions and look at the big
picture. It certainly wouldn't do I/O or networking or file
management. If memory allocation becomes a burden, it can set up four
(or fourteen) memory-allocation cores and let them do the crunching.

FWIW, I have a memory allocation algorithm which can scale because its based
on per-thread/core/node heaps:

http://groups.google.com/group/comp....c40d42a04ee855

AFAICT, there is absolutely no need for memory-allocation cores. Each thread
can have a private heap such that local allocations do not need any
synchronization. Also, thread local deallocations of memory do not need any
sync. Local meaning that Thread A allocates memory M which is subsequently
freed by Thread A. When a threads memory pool is exhausted, it then tries to
allocate from the core local heap. If that fails, then it asks the system,
and perhaps virtual memory comes into play.

A scaleable high-level memory allocation algorithm for a super-computer
could look something like:
__________________________________________________ ___________
void* malloc(size_t sz) {
void* mem;

/* level 1 - thread local */
if ((! mem = Per_Thread_Try_Allocate(sz))) {

/* level 2 - core local */
if ((! mem = Per_Core_Try_Allocate(sz))) {

/* level 3 - physical chip local */
if ((! mem = Per_Chip_Try_Allocate(sz))) {

/* level 4 - node local */
if ((! mem = Per_Node_Try_Allocate(sz))) {

/* level 5 - system-wide */
if ((! mem = System_Try_Allocate(sz))) {

/* level 6 - failure */
Report_Allocation_Failure(sz);
return NULL;
}
}
}
}
}

return mem;
}
__________________________________________________ ___________

Level 1 does not need any atomic RMW OR membars at all.

Level 2 does not need membars, but needs atomic RMW.

Level 3 would need membars and atomic RMW.

Level 4 is same as level 3

Level 5 is worst case senerio, may need MPI...

Level 6 is total memory exhaustion! Ouch...

All local frees have same overhead while all remote frees need atomic RMW
and possibly membars.

This algorithm can scale to very large numbers of cores, chips and nodes.

Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

The bottleneck is the cache-coherency system. Luckily, there is years of
experience is dealing with weak cache schemes... Think RCU.

Why multi-thread *anything* when hundreds or thousands of CPUs are
available?

You don't think there is any need for communication between cores on a chip?

**Chris M. Thomasson**

"Chris M. Thomasson" wrote in message
...
"John Larkin" wrote in
message ...
[...]
Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

The bottleneck is the cache-coherency system.

I meant to say:

/One/ bottleneck is the cache-coherency system.

Luckily, there is years of experience is dealing with weak cache
schemes... Think RCU.

**Nick Maclaren**

In article ,
"Chris M. Thomasson" writes:
|
| FWIW, I have a memory allocation algorithm which can scale because its based
| on per-thread/core/node heaps:
|
| AFAICT, there is absolutely no need for memory-allocation cores. Each thread
| can have a private heap such that local allocations do not need any
| synchronization.

Provided that you can live with the constraints of that approach.
Most applications can, but not all.

Regards,
Nick Maclaren.

**Dirk Bruere at NeoPax**

NV55 wrote:
On Aug 5, 5:26 am, Dirk Bruere at NeoPax
wrote:
Skybuck Flying wrote:
As the number of cores goes up the watt requirements goes up too ?
Will we need a zillion watts of power soon ?
Bye,
Skybuck.
Since the ATI Radeon™ HD 4800 series has 800 cores you work it out.

--
Dirk

Each of the 800 "cores", which are simple stream processors, in
ATI RV770
(Radeon 4800 series) are not comparable to the 16, 24, 32 or 48
cores that will be in Larrabee. Just like they're not comparable to
the 240 "cores" in Nvidia GeForce GTX 280. Though I'm not saying
you didn't realize that, just for those that might not have.

True, but they seem to be positioning Larrabee in the same tech segment
as video cards. Which makes sense since a SIMD system is the easiest to
program. If they want N general purpose cores doing general purpose
computing the whole thing will bog down somewhere between 16 and 32. A
lot of the R&D theory was done 30+ years ago.

Maybe they will try something radical, like an ancient data flow
architecture, but I doubt it.

--
Dirk

http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff

**Robert Myers**

On Aug 7, 4:57*pm, Dirk Bruere at NeoPax
wrote:

Each of the 800 *"cores", *which are simple stream processors, in
ATI *RV770
(Radeon 4800 series) * are not comparable to the 16, 24, 32 or 48
cores that will be in Larrabee. Just like they're not comparable to
the 240 *"cores" in *Nvidia GeForce GTX 280. * *Though I'm not saying
you didn't realize that, just for those that might not have.

True, but they seem to be positioning Larrabee in the same tech segment
as video cards. Which makes sense since a SIMD system is the easiest to
program. If they want N general purpose cores doing general purpose
computing the whole thing will bog down somewhere between 16 and 32. A
lot of the R&D theory was done 30+ years ago.

Maybe they will try something radical, like an ancient data flow
architecture, but I doubt it.

"General purpose" GPU's are not really general purpose, but they
aren't doing graphics, either.

Robert.

**Bernd Paysan**

Nick Maclaren wrote:
In theory, the kernel doesn't have to do I/O or networking, but
have you ever used a system where they were outside it? I have.

Actually, doing I/O or networking in a "main" CPU is waste of resources. Any
sane architecture (CDC 6600, mainframes) has a bunch of multi-threaded IO
processors, which you program so that the main CPU has little effort to
deal with IO.

This works well even when you do virtualization. The main CPU sends a
pointer to an IO processor program ("high-level abstraction", not the
device driver details) to the IO processor, which in turn runs the device
driver to get the data in or out. In a VM, the VM monitor has to
sanity-check the command, maybe rewrites it ("don't write to track 3 of
disk 5, write it to the 16 sectors starting at sector 8819834 in disk 1,
which is where the virtual volume of this VM sits").

The fact that in PCs the main CPU is doing IO (even down to the level of
writing to individual IO ports) is a consequence of saving CPUs - no money
for an IO processor, the 8088 can do that itself just fine. Why we'll soon
have 32 x86 cores, but still no IO processor is beyond what I can
understand.

Basically all IO in a modern PC is sending fixed- or variable-sized packets
over some sort of network - via SATA/SCSI, via USB, Firewire, or Ethernet,
etc.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

**Jan Panteltje**

On a sunny day (Fri, 08 Aug 2008 13:02:15 +0200) it happened Bernd Paysan
wrote in :

Nick Maclaren wrote:
In theory, the kernel doesn't have to do I/O or networking, but
have you ever used a system where they were outside it? I have.

Actually, doing I/O or networking in a "main" CPU is waste of resources. Any
sane architecture (CDC 6600, mainframes) has a bunch of multi-threaded IO
processors, which you program so that the main CPU has little effort to
deal with IO.

This works well even when you do virtualization. The main CPU sends a
pointer to an IO processor program ("high-level abstraction", not the
device driver details) to the IO processor, which in turn runs the device
driver to get the data in or out. In a VM, the VM monitor has to
sanity-check the command, maybe rewrites it ("don't write to track 3 of
disk 5, write it to the 16 sectors starting at sector 8819834 in disk 1,
which is where the virtual volume of this VM sits").

The fact that in PCs the main CPU is doing IO (even down to the level of
writing to individual IO ports) is a consequence of saving CPUs - no money
for an IO processor, the 8088 can do that itself just fine. Why we'll soon
have 32 x86 cores, but still no IO processor is beyond what I can
understand.

Basically all IO in a modern PC is sending fixed- or variable-sized packets
over some sort of network - via SATA/SCSI, via USB, Firewire, or Ethernet,
etc.

Do not forget, since the days of 8088, and maybe CPUs running at about 13 MHz,
we now run at 3.4 GHz, 3400 / 13 = 261 x faster.
Also even faster because of better architectures.
This leaves plenty of time for a CPU to do normal IO.
And in fact the IO has been hardware supported always.
For example, although you can poll a serial port bit by bit, there is a hardware shift register,
hardware FIFO too.
Although you can construct sectors for a floppy in software bit by bit, there is a floppy controller
with write pre-compensation etc.. all in hardware.
Although you could do graphics there is a graphics card with hardware acceleration.
the first 2 are included in the chip set, maybe the graphics too.
The same thing for Ethernet, it is a dedicated chip, or included in the chip set,
taking the place of your 'IO processor'.
Same thing for hard disks, and those may even have on board encryption, all you
have to do is specify a sector number and send the sector data.

So.. no real need for a separate IO processor, in fact you likely find a processor
in all that dedicated hardware, or maybe a FPGA.

**John Larkin**

On Fri, 08 Aug 2008 11:30:04 GMT, Jan Panteltje
wrote:

On a sunny day (Fri, 08 Aug 2008 13:02:15 +0200) it happened Bernd Paysan
wrote in :

Nick Maclaren wrote:
In theory, the kernel doesn't have to do I/O or networking, but
have you ever used a system where they were outside it? I have.

Actually, doing I/O or networking in a "main" CPU is waste of resources. Any
sane architecture (CDC 6600, mainframes) has a bunch of multi-threaded IO
processors, which you program so that the main CPU has little effort to
deal with IO.

This works well even when you do virtualization. The main CPU sends a
pointer to an IO processor program ("high-level abstraction", not the
device driver details) to the IO processor, which in turn runs the device
driver to get the data in or out. In a VM, the VM monitor has to
sanity-check the command, maybe rewrites it ("don't write to track 3 of
disk 5, write it to the 16 sectors starting at sector 8819834 in disk 1,
which is where the virtual volume of this VM sits").

The fact that in PCs the main CPU is doing IO (even down to the level of
writing to individual IO ports) is a consequence of saving CPUs - no money
for an IO processor, the 8088 can do that itself just fine. Why we'll soon
have 32 x86 cores, but still no IO processor is beyond what I can
understand.

Basically all IO in a modern PC is sending fixed- or variable-sized packets
over some sort of network - via SATA/SCSI, via USB, Firewire, or Ethernet,
etc.

Do not forget, since the days of 8088, and maybe CPUs running at about 13 MHz,
we now run at 3.4 GHz, 3400 / 13 = 261 x faster.
Also even faster because of better architectures.
This leaves plenty of time for a CPU to do normal IO.
And in fact the IO has been hardware supported always.
For example, although you can poll a serial port bit by bit, there is a hardware shift register,
hardware FIFO too.
Although you can construct sectors for a floppy in software bit by bit, there is a floppy controller
with write pre-compensation etc.. all in hardware.
Although you could do graphics there is a graphics card with hardware acceleration.
the first 2 are included in the chip set, maybe the graphics too.
The same thing for Ethernet, it is a dedicated chip, or included in the chip set,
taking the place of your 'IO processor'.
Same thing for hard disks, and those may even have on board encryption, all you
have to do is specify a sector number and send the sector data.

So.. no real need for a separate IO processor, in fact you likely find a processor
in all that dedicated hardware, or maybe a FPGA.

That's the IBM "channel controller" concept: add complexm specialized
dma-based i/o controllers to take the load off the CPU. But if you
have hundreds of CPU's, the strategy changes.

John

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Intel details future 'Larrabee' graphics chip	NV55	Intel	9	August 22nd 08 09:08 PM
Intel details future 'Larrabee' graphics chip	NV55	AMD x86-64 Processors	9	August 22nd 08 09:08 PM
Intel details future 'Larrabee' graphics chip	NV55	Nvidia Videocards	9	August 22nd 08 09:08 PM
Intel details future 'Larrabee' graphics chip	NV55	Ati Videocards	9	August 22nd 08 09:08 PM
Intel details future -Larrabee- graphics chip	NV55	General	7	August 7th 08 05:12 PM