"TLB parity error in virtual array; TLB error 'instruction"?

**Ant[_3_]** · March 8th 10, 04:45 PM posted to comp.sys.ibm.pc.hardware.chips

Hello.

Lately, I have been getting random and rare kernel panics on my old
Debian/Linux box (tried both Kernel versions 2.6.30 and 2.6.32). I
couldn't figure out what it was until I discovered mcelog a couple days
ago, and it revealed interesting scary datas in my dmesg/messages and
syslog:

# cat /var/log/messages
....
Mar 7 08:25:24 MyLinuxBox kernel: [ 3299.988026] Machine check events
logged
Mar 7 08:25:24 MyLinuxBox mcelog: HARDWARE ERROR. This is *NOT* a
software problem!
Mar 7 08:25:24 MyLinuxBox mcelog: Please contact your hardware vendor
Mar 7 08:25:24 MyLinuxBox mcelog: MCE 0
Mar 7 08:25:24 MyLinuxBox mcelog: CPU 1 1 instruction cache
Mar 7 08:25:24 MyLinuxBox mcelog: ADDR c11b6ff0
Mar 7 08:25:24 MyLinuxBox mcelog: TIME 1267979124 Sun Mar 7 08:25:24 2010
Mar 7 08:25:24 MyLinuxBox mcelog: TLB parity error in virtual array
Mar 7 08:25:24 MyLinuxBox mcelog: TLB error 'instruction transaction,
level 1'
Mar 7 08:25:24 MyLinuxBox mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 08:25:24 MyLinuxBox mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 7 08:25:24 MyLinuxBox mcelog: CPUID Vendor AMD Family 15 Model 43

I am not familiar with hardwares, so I assume this is very bad, but what
part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? I have had
it and its motherboard since 12/24/2006, so it is not that old yet. I
have the full details on my secondary machine at
http://alpha.zimage.com/~ant/antfarm.../computers.txt ...

Although, this might be related to the PSU's death back in early
December 2009. My friend and I believe it also took out my EVGA GeForce
8800 GT video card and damage a 512 MB of RAM (tested 3 GB with and each
piece with memtest86+ v4.00 to narrow it down).
http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of the
details of my systems. I did run memtest86+ again a couple weeks ago and
this morning for 5-6 hours, but not got no errors after five full tests
(passed). I also do not overclock/OC.

Thank you in advance.

--
"Above ground I shall be food for kites; below I shall be food for
mole-crickets and ants. Why rob one to feed the other?" --Juang-zu (4th
Century B.C.)
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: NT
( ) or
Ant is currently not listening to any songs on his home computer.

**Yousuf Khan** · March 9th 10, 03:01 AM posted to comp.sys.ibm.pc.hardware.chips

Ant wrote:
Hello.

Lately, I have been getting random and rare kernel panics on my old
Debian/Linux box (tried both Kernel versions 2.6.30 and 2.6.32). I
couldn't figure out what it was until I discovered mcelog a couple days
ago, and it revealed interesting scary datas in my dmesg/messages and
syslog:

# cat /var/log/messages
...
Mar 7 08:25:24 MyLinuxBox kernel: [ 3299.988026] Machine check events
logged
Mar 7 08:25:24 MyLinuxBox mcelog: HARDWARE ERROR. This is *NOT* a
software problem!
Mar 7 08:25:24 MyLinuxBox mcelog: Please contact your hardware vendor
Mar 7 08:25:24 MyLinuxBox mcelog: MCE 0
Mar 7 08:25:24 MyLinuxBox mcelog: CPU 1 1 instruction cache
Mar 7 08:25:24 MyLinuxBox mcelog: ADDR c11b6ff0
Mar 7 08:25:24 MyLinuxBox mcelog: TIME 1267979124 Sun Mar 7 08:25:24 2010
Mar 7 08:25:24 MyLinuxBox mcelog: TLB parity error in virtual array
Mar 7 08:25:24 MyLinuxBox mcelog: TLB error 'instruction transaction,
level 1'
Mar 7 08:25:24 MyLinuxBox mcelog: STATUS 9400000000010011 MCGSTATUS 0
Mar 7 08:25:24 MyLinuxBox mcelog: MCGCAP 105 APICID 1 SOCKETID 0
Mar 7 08:25:24 MyLinuxBox mcelog: CPUID Vendor AMD Family 15 Model 43

I am not familiar with hardwares, so I assume this is very bad, but what
part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? I have had
it and its motherboard since 12/24/2006, so it is not that old yet. I
have the full details on my secondary machine at
http://alpha.zimage.com/~ant/antfarm.../computers.txt ...

Yeah, the TLB stands for Translation Lookaside Buffer, it's the part of
the processor that keeps track of memory pages. I'm not sure what they
are referring to when they talk about "virtual array", unless it has
something to do with OS virtualization. In any case, if your TLB is
damaged, then various programs will fail if their memory pages get
tracked by that TLB entry.

Although, this might be related to the PSU's death back in early
December 2009. My friend and I believe it also took out my EVGA GeForce
8800 GT video card and damage a 512 MB of RAM (tested 3 GB with and each
piece with memtest86+ v4.00 to narrow it down).
http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of the
details of my systems. I did run memtest86+ again a couple weeks ago and
this morning for 5-6 hours, but not got no errors after five full tests
(passed). I also do not overclock/OC.

Thank you in advance.

If that PSU failure took out so much other hardware, then it's likely it
took out your processor too, and it took longer for it to finally fail.
CPU chips tend to be more robust than memory chips and GPU chips, a lot
more redundancy, so they may show the signs of the failure much later.

Memtest86+ won't find faults inside the CPU, it only tests for faults in
the RAM.

Yousuf Khan

**Robert Redelmeier** · March 9th 10, 04:49 AM posted to comp.sys.ibm.pc.hardware.chips

Ant wrote in part:
part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? [snip]

Although, this might be related to the PSU's death back in early
December 2009. My friend and I believe it also took out my EVGA
GeForce 8800 GT video card and damage a 512 MB of RAM (tested
3 GB with and each piece with memtest86+ v4.00 to narrow it down).

As Yousef has mentioned, any PSU failure serious enough to
damage RAM could easily damage the CPU. Especially AMD with
the RAM controller and busses inside the CPU.

http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of
the details of my systems. I did run memtest86+ again a couple
weeks ago and this morning for 5-6 hours, but not got no errors
after five full tests (passed). I also do not overclock/OC.

memtest86 is a good pgm, but it is more extensive than intensive.
It tests all memory, but not especially hard. If you want to
diagnose further, you could try running a few dozen copies of my
`burnMMX P`. It is a bit old and not quite as high bandwidth as
possible on newer processors. If there is no error, they should
stay running indefinitely. Watch for terminations and/or dmesg.
Run by `nice -19` should increase TLB transitions.

-- Robert author `cpuburn` http://pages.sbcglobal.net/redelm

**Ant[_3_]** · March 9th 10, 06:37 AM posted to comp.sys.ibm.pc.hardware.chips

On 3/8/2010 7:01 PM PT, Yousuf Khan typed:

If that PSU failure took out so much other hardware, then it's likely it
took out your processor too, and it took longer for it to finally fail.
CPU chips tend to be more robust than memory chips and GPU chips, a lot
more redundancy, so they may show the signs of the failure much later.

Ah, that could be it. So far, a 512 MB of RAM and video card went bust
with the PSU. Too bad my friend and I did not see physical evidences of
busted caps, discolorations, etc.

Memtest86+ won't find faults inside the CPU, it only tests for faults in
the RAM.

What's a good way to test the CPU? I tried sys_basher, unraring 10 GB of
datas, memtest86+ v4.00 (you said it is only for RAM), etc. None of them
caused kernel panics. The crashes seem to happen during idled time. I do
not use AMD's Cool'n' Quiet and PowerNow-K8.
--
"To conquer the world, we must be as meticulous and calculating as a
colony of ants on the march." --Julius Caesar
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: NT
( ) or
Ant is currently not listening to any songs on his home computer.

**Ant[_3_]** · March 9th 10, 06:43 AM posted to comp.sys.ibm.pc.hardware.chips

On 3/8/2010 8:49 PM PT, Robert Redelmeier typed:

As Yousef has mentioned, any PSU failure serious enough to
damage RAM could easily damage the CPU. Especially AMD with
the RAM controller and busses inside the CPU.

Damn. Intel CPUs does better with this?

memtest86 is a good pgm, but it is more extensive than intensive.
It tests all memory, but not especially hard. If you want to
diagnose further, you could try running a few dozen copies of my
`burnMMX P`. It is a bit old and not quite as high bandwidth as
possible on newer processors. If there is no error, they should
stay running indefinitely. Watch for terminations and/or dmesg.
Run by `nice -19` should increase TLB transitions.

-- Robert author `cpuburn` http://pages.sbcglobal.net/redelm

Thanks. I will try it when I don't need to use the box. You should
update your program to support the newer processors.

--
"Is this stuff any good for ants?" "No, it kills them." --unknown
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: NT
( ) or
Ant is currently not listening to any songs on his home computer.

**Ant[_3_]** · March 9th 10, 08:18 AM posted to comp.sys.ibm.pc.hardware.chips

On 3/8/2010 8:49 PM PT, Robert Redelmeier typed:

If you want to diagnose further, you could try running a few dozen
copies of my
`burnMMX P`. It is a bit old and not quite as high bandwidth as
possible on newer processors. If there is no error, they should
stay running indefinitely. Watch for terminations and/or dmesg.

-- Robert author `cpuburn` http://pages.sbcglobal.net/redelm

I ran two "burnMMX P" (did not use nice -19) processes since I have a
dual core 939 CPU and no crashes and errors after 45.25 minutes. Before
I aborted both, I checked the temperatures which look OK for a 75
degrees(F) room:

$ sensors -f
acpitz-virtual-0
Adapter: Virtual device
temp1: +71.2°F (crit = +206.2°F)

k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +129.2°F
Core1 Temp: +104.0°F
--
"It is said that the lonely eagle flies to the mountain peaks while the
lowly ant crawls the ground, but cannot the soul of the ant soar as high
as the eagle?" --unknown
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: NT
( ) or
Ant is currently not listening to any songs on his home computer.

**Ant[_3_]** · March 9th 10, 08:52 AM posted to comp.sys.ibm.pc.hardware.chips

On 3/8/2010 8:49 PM PT, Robert Redelmeier typed:

`burnMMX P`. It is a bit old and not quite as high bandwidth as
possible on newer processors. If there is no error, they should
stay running indefinitely. Watch for terminations and/or dmesg.
Run by `nice -19` should increase TLB transitions.

I hope I am doing this correctly for nice -19. I ran one "nice -19
../burnMMX p" command for over 25.5 minutes and no problems and errors.

$ sensors -f
acpitz-virtual-0
Adapter: Virtual device
temp1: +71.2°F (crit = +206.2°F)

k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +122.0°F
Core1 Temp: +98.6°F

Or was I supposed to run two of them?
--
"I like ants, in chocolate. Crunch, hummmm." --unknown
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: NT
( ) or
Ant is currently not listening to any songs on his home computer.

**Robert Redelmeier** · March 9th 10, 02:01 PM posted to comp.sys.ibm.pc.hardware.chips

Ant wrote in part:
On 3/8/2010 8:49 PM PT, Robert Redelmeier typed:

I ran two "burnMMX P" (did not use nice -19) processes
since I have a dual core 939 CPU and no crashes and errors
after 45.25 minutes. Before I aborted both, I checked the
temperatures which look OK for a 75 degrees(F) room:

$ sensors -f acpitz-virtual-0 Adapter: Virtual device temp1:
+71.2°F (crit = +206.2°F)

k8temp-pci-00c3 Adapter: PCI adapter Core0 Temp: +129.2°F
Core1 Temp: +104.0°F

You must have good cooling. `burnMMX P` only exercises 64 MB
of RAM. You should run about _forty_ (40) of them. This also
eats up more TLB entries and more switching with nice -19 .

It won't run any hotter, (maybe slightly cooler with all the
task switching) but will exercise more of the TLB. At the
very least run 3 or 5 copies.

-- Robert author `cpuburn` http://pages.sbcglobal.net/redelm

**Robert Redelmeier** · March 9th 10, 02:13 PM posted to comp.sys.ibm.pc.hardware.chips

Ant wrote in part:
On 3/8/2010 8:49 PM PT, Robert Redelmeier typed:

As Yousef has mentioned, any PSU failure serious enough to
damage RAM could easily damage the CPU. Especially AMD with
the RAM controller and busses inside the CPU.

Damn. Intel CPUs does better with this?

Not really. With an Intel CPU, the same spike would fry the
northbridge with RAM. It depends on whether you prefer to fry your
mobo or CPU. The CPU is easier to replace, but also often costs more.

Thanks. I will try it when I don't need to use the box. You
should update your program to support the newer processors.

Probably. But I have neither time nor need. I'm still running
on the same box as 1999.

-- Robert author `cpuburn` http://pages.sbcglobal.net/redelm

**Ant[_3_]** · March 9th 10, 02:44 PM posted to comp.sys.ibm.pc.hardware.chips

On 3/9/2010 6:01 AM PT, Robert Redelmeier typed:

wrote in part:
On 3/8/2010 8:49 PM PT, Robert Redelmeier typed:

I ran two "burnMMX P" (did not use nice -19) processes
since I have a dual core 939 CPU and no crashes and errors
after 45.25 minutes. Before I aborted both, I checked the
temperatures which look OK for a 75 degrees(F) room:

$ sensors -f acpitz-virtual-0 Adapter: Virtual device temp1:
+71.2°F (crit = +206.2°F)

k8temp-pci-00c3 Adapter: PCI adapter Core0 Temp: +129.2°F
Core1 Temp: +104.0°F

You must have good cooling. `burnMMX P` only exercises 64 MB
of RAM. You should run about _forty_ (40) of them. This also
eats up more TLB entries and more switching with nice -19 .

40 times?! Is there an easy way to run it in one command or something? I
had to run two of them manually earlier.

It won't run any hotter, (maybe slightly cooler with all the
task switching) but will exercise more of the TLB. At the
very least run 3 or 5 copies.

OK, I am fine to 40 but I don't want to manually copy and paste 40 times.

--
/\___/\
/ /\ /\ \ Phil./Ant @ http://antfarm.ma.cx (Personal Web Site)
| |o o| | Ant's Quality Foraged Links: http://aqfl.net
\ _ / Nuke ANT from e-mail address: NT
( ) or
Ant is currently not listening to any songs on his home computer.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
"TLB parity error in virtual array; TLB error 'instruction"?	Ant[_3_]	AMD x86-64 Processors	8	March 13th 10 04:32 PM
"Parity Error Detected" message when running Intel Storage Console.	Brcobrem	Storage (alternative)	1	November 18th 09 08:49 PM
"paper is jammed" "at the transport" error message-Canon Mp830 (false error)	markm75	Printers	2	August 19th 07 02:04 AM
Samsung ML-2150 (2152W) (1) suddenly prints all pages "almost" blank and (2) error message "HSync Engine Error" , not in user manual	Lady Margaret Thatcher	Printers	5	May 4th 06 04:51 AM
ASUS A8V & ATI AIW 9600 "inf" "thunk.exe" error message?	ByTor	AMD x86-64 Processors	5	January 13th 06 06:50 PM