Memory Transactions per Second.

**Skybuck Flying[_7_]**

"Robert Redelmeier" wrote in message ...

In alt.lang.asm Skybuck Flying wrote in part:
"
Why don't you write a program that simply reads a lot of memory
locations (say 100 Billion) and use the wall clock timer to determine
how much time it takes and from there the real BW. You just have to
make the program access a similar data-set (i.e. random locations) as
the one you want to use in real life.
"

That's pretty hard to do when I don't have the system don't you think ?
I want to know this information so I can decide which system to buy.
And there are a lot of systems out there !

"
Sure, but a friendly store would let you run test pgms.
Some time ago, I wrote (apologies for the C):
"

LOL, Next time I will ask the webstore if they wanna run a program like
that.

Ofcourse there is a problem with your reasoning if it's a new system I might
not yet know how to program it.

"
/* lat10m.c - Measure latency of 10 million fresh memory reads
"

Only 10 million ? I think my AMD x2 3800+ dual core can easily do 180
million per core per second.

So your test is already flawed.

Doesn't that scare you a bit ?

I'd rather have the manufacturer figure it out... if they can't figure it
out then who can ?

Also another potential problem is with running out of memory.

Suppose the processor is 1 Terrahertz and the memory is only 10 gigabytes
then one can't simply generate a "unique" table and pointer-chase-walk it.

There is a chance that some of the memory addresses will be cached and the
more transactions happen the more chance it gets cached.

However as long as the theoretical maximum number of memory transactions is
less than the ammount of memory then this is for now not yet a problem, but
could be a problem later on.

A small problem perhaps but none the less. A simple solution would be an
option in the bios/processor to turn caching off, for proper memory
benchmarking.

(My pentium III can actually do it, the AMD X2 3800+ cannot as far as I know
?!)

Perhaps I should run a memtest on my Pentium III one day with caching off...
just to see what happens ! LOL

=D

Bye,
Skybuck.

**Robert Redelmeier**

In alt.lang.asm Skybuck Flying wrote in part:
LOL, Next time I will ask the webstore if they wanna
run a program like that.

Why not? All they need to do under Linux is make the systems
visible via ssh and give you guest login.

Ofcourse there is a problem with your reasoning if it's a
new system I might not yet know how to program it.

It'd better have a C compiler somewhere. Then you
could run my code.

" /* lat10m.c - Measure latency of 10 million fresh memory
reads "

Only 10 million ? I think my AMD x2 3800+ dual core can
easily do 180 million per core per second.

Maybe in-order. The quad Athlon/DDR3 I tested could
only do 12 million pseudo-random. I could try it multicore.

So your test is already flawed.
Doesn't that scare you a bit ?

Not in the least. I know it's flawed. Going to 100m
iterations is an easy change.

I'd rather have the manufacturer figure it out... if they
can't figure it out then who can ?

No-one -- memory performance will depend on the DRAM
used (timings). Which the CPU mfr doesn't control.

Also another potential problem is with running out of memory.

Suppose the processor is 1 Terrahertz and the memory is only 10
gigabytes then one can't simply generate a "unique" table and
pointer-chase-walk it.

There is a chance that some of the memory addresses will be cached
and the more transactions happen the more chance it gets cached.

Not really -- if I have 10 GB RAM and 10 MB cache, the
chances a random location will be in cache is 0.001 .

-- Robert

**Skybuck Flying[_7_]**

"Robert Redelmeier" wrote in message ...

In alt.lang.asm Skybuck Flying wrote in part:
LOL, Next time I will ask the webstore if they wanna
run a program like that.

"
Why not? All they need to do under Linux is make the systems
visible via ssh and give you guest login.
"

Way to complex, way to dangerous to connect systems like that.

And they don't trust it and don't have the time for it.

And the systems I am interested in are probably in pieces.

Ofcourse there is a problem with your reasoning if it's a
new system I might not yet know how to program it.

"
It'd better have a C compiler somewhere. Then you
could run my code.
"

Unlikely since I am interested in cuda and other parallel systems which
require special programming and compilers.

" /* lat10m.c - Measure latency of 10 million fresh memory
reads "

Only 10 million ? I think my AMD x2 3800+ dual core can
easily do 180 million per core per second.

"
Maybe in-order. The quad Athlon/DDR3 I tested could
only do 12 million pseudo-random. I could try it multicore.
"

Another big flaw in your test program.

Your benchmark is probably limited by the random number generator.

It make sense.. go view some benchmarks about integer performance.

So your test is already flawed.
Doesn't that scare you a bit ?

"
Not in the least. I know it's flawed. Going to 100m
iterations is an easy change.
"

Unlikely since it's flaw after flaw.. it needs new approach

It's gonna take you a while to do it properly.

I'd rather have the manufacturer figure it out... if they
can't figure it out then who can ?

"
No-one -- memory performance will depend on the DRAM
used (timings). Which the CPU mfr doesn't control.
"

Timings is a start.

My formula for now is:

MemoryTransactionPerSecond = MemoryClockInHertz / BytesPerTransaction

^ Seems pretty good to me, and pretty simple too !

Also another potential problem is with running out of memory.

Suppose the processor is 1 Terrahertz and the memory is only 10
gigabytes then one can't simply generate a "unique" table and
pointer-chase-walk it.

There is a chance that some of the memory addresses will be cached
and the more transactions happen the more chance it gets cached.

"
Not really -- if I have 10 GB RAM and 10 MB cache, the
chances a random location will be in cache is 0.001 .
"

Not quite, first of all the table requires 64 bit memory addresses, using
anything less wouldn't make much sense since that would slow it down even
further.

Therefore 10 GB must be divided by 8 bytes which is = 1.342.177.280 possible
memory locations.

The cache itself is probably even worse and stores 64 bytes per cache line
or so which means:

10 MB / 64 = 163.840 memory locations.

It's highly likely that all cache lines will be overwritten each time and
that the chance of actually hitting anything is non-existent. This is called
cache trashing.

So in a way I dismissed my own worries about the cache non the less it's an
interesting calculation so let's go on.

The real chance of hitting the cache is:

163.840 / 1.342.177.280 = 0,0001220703125

An order of magnitude worse then you thought, another big flaw in your
reasoning !

=D

One could wonder why caches are added at all there chance of being hit is
0,0001220703125% in this scenerio... which would be absolutely worthless.

At least the data cache would be worthless... the instruction cache would be
nice... though perhaps the data cache is also still nice for static pointers
to tables or so... pointers
which never change... like base addresses.

However the story does not end yet... because you have another flaw in your
reasoning... another missing factor in the equation.

The chance must be multiplied by the number of tries. This depends on the
actual speed of the memory system.

I mentioned a 1 THZ processor, so let's keep it realistic and divide that by
100 for a realistic memory system.

That's 10 GHZ for the memory system, which means 10.000.000.000 *
0,0001220703125 = 1.220.703 cache hits !

So it turns out the cache is not useless after all ! And does play a big
roll... what a surprising turn of the story isn't it !

So again another big flaw in your reasoning.

Let's calculate how many additional hz that cache gives:

1.220.703 * 100 = 122.070.312

That's roughly an additional 122 MHZ just because of the cache.

That's not so bad... it could ofcourse be a lot better if the memory seeks
where within it's range but that's just the point they are not.

Let's see how much procentage that is of the entire memory system:

(122 / 10000) * 100% = 1.22%

1%

That's still a bit too much for my taste !

=D

I hope you not a chip designer or anything, cause your tests suck big time !

=D

Bye,
Skybuck =D

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Do SD/MMC memory slots need driver updates to read SDHC memory cards?	ken	Storage (alternative)	15	February 26th 09 04:13 AM
Does the addressable physical memory range depend on which slots are occupied by the memory?	Lighter	General	4	October 10th 06 01:24 AM
memory mapped IO: device registers mapped to virtual memory or physical memory?	Olumide	General	13	February 9th 06 10:44 PM
Cahoot "webcard" for one time transactions	des	UK Computer Vendors	3	November 13th 03 05:43 PM
Acer travelmate 313T 31x 128Mb memory 144Mb memory	Paul Cahill	Acer Computers	0	September 15th 03 08:33 PM