If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Recommended hard drive temperature
I've been reading this document which is an analysis of Google's hard
disc failure rates: Failure Trends in a Large Disk Drive Population: http://research.google.com/archive/disk_failures.pdf It states that "contrary to previously reported results, we found very little correlation between failure rates and either elevated temperature or activity levels." Figure 4 "shows that failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend." "Figure 5 looks at the average temperatures for different age groups. The distributions are in sync with Figure 4 showing a mostly flat failure rate at mid-range temperatures and a modest increase at the low end of the temperature distribution. What stands out are the 3 and 4-year old drives, where the trend for higher failures with higher temperature is much more constant and also more pronounced." "Overall our experiments can confirm previously reported temperature effects only for the high end of our temperature range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates." Figure 5 suggests that Google's optimum temperature for hard drives is between 35C and 40C. Elsewhere I found this old IBM article: http://web.archive.org/web/200005192.../drivetemp.htm It states that "figure 2 shows the dramatic effect that temperature has on the overall reliability of a hard disk drive. Derivations [sic] from a nominal operating temperature (assumed to be maintained over the life of a drive) can result in a derivation [sic] from the nominal failure rate. As the temperature exceeds the recommended level, the failure rate increases two to three percent for every one degree rise above it. For example, a hard disk drive running for an extended period of time at five degrees above the recommended temperature can experience an increase in failure rate of 10 to 15 percent. Likewise, operating a drive below the recommended temperature can extend drive life." This last statement is a bit ambiguous. If a hard drive is more reliable at a temperature below that which is recommended, then why not recommend a lower temperature in the first place? Then again, maybe the author's intended meaning was "recommended maximum temperature". - Franc Zabkar -- Please remove one 'i' from my address when replying by email. |
#2
|
|||
|
|||
Recommended hard drive temperature
Previously Franc Zabkar wrote:
I've been reading this document which is an analysis of Google's hard disc failure rates: [...] If you can keep your HDDs below around 40C or so, then you will run them under data-center conditions. These conditions is what the Google study is about. An example from my personal experience is with Maxtor disks. They had direct outside airflow and stayed 30C under load and at 22C when idle. No failures in 3 years for about 50 disks. These were the same Maxtors known to die fast when run hot (e.g. at 50-60C). Conditions in a typical PC are different. The HDDs are often not directly cooled with outside air and can get hot under load. If you have temperature spikes in the 50C range or higher, temperature is a major factor in HDD death. How major exactly is currently unknown or only known to the manufacturers. Most drives have a 55C stated maximum temperature. The Maxtors I mention above had a statement in their product manual that up to 60C the drive failure rate would not increase, despite a 55C maximum temperature. There is reason to believe that statement was over-optimistic or a plain lie. So don't expect the HDD manufacturers to tell you about high-temperature life expectancy. Bottom line, the Google study shows that if you can get the drives consitently down to below 40C, temperature does not matter a lot. So the recomendation would be to have your drives (under load, on a hot day) below 40C at all times. Note that this also applies to external enclosures. Arno |
#3
|
|||
|
|||
Recommended hard drive temperature
On 16 Apr 2008 12:20:06 GMT, Arno Wagner put finger
to keyboard and composed: Bottom line, the Google study shows that if you can get the drives consitently down to below 40C, temperature does not matter a lot. So the recomendation would be to have your drives (under load, on a hot day) below 40C at all times. Note that this also applies to external enclosures. Arno AFAICS, the Google study conclusively shows that failure rates also increase when temperatures drop below 35C. In fact lower temps appear to be more dangerous than slightly higher temps, except when the drive is getting old, in which case higher temps start to become significant. - Franc Zabkar -- Please remove one 'i' from my address when replying by email. |
#4
|
|||
|
|||
Recommended hard drive temperature
Previously Franc Zabkar wrote:
On 16 Apr 2008 12:20:06 GMT, Arno Wagner put finger to keyboard and composed: Bottom line, the Google study shows that if you can get the drives consitently down to below 40C, temperature does not matter a lot. So the recomendation would be to have your drives (under load, on a hot day) below 40C at all times. Note that this also applies to external enclosures. Arno AFAICS, the Google study conclusively shows that failure rates also increase when temperatures drop below 35C. In fact lower temps appear to be more dangerous than slightly higher temps, except when the drive is getting old, in which case higher temps start to become significant. Don't read too much into it. AFAIR they did not separate by manufacturer, model and manufactuuring date. It is quite possible that the drives running at lower temperatures were actually from a batch that had less life expectancy from the start and stay at lower temperatures because of different cooling characteristics, i.e. there may well be a systematic error in the measurements. Arno |
#5
|
|||
|
|||
Recommended hard drive temperature
On 16 Apr 2008 22:10:18 GMT, Arno Wagner put finger
to keyboard and composed: Previously Franc Zabkar wrote: On 16 Apr 2008 12:20:06 GMT, Arno Wagner put finger to keyboard and composed: Bottom line, the Google study shows that if you can get the drives consitently down to below 40C, temperature does not matter a lot. So the recomendation would be to have your drives (under load, on a hot day) below 40C at all times. Note that this also applies to external enclosures. Arno AFAICS, the Google study conclusively shows that failure rates also increase when temperatures drop below 35C. In fact lower temps appear to be more dangerous than slightly higher temps, except when the drive is getting old, in which case higher temps start to become significant. Don't read too much into it. AFAIR they did not separate by manufacturer, model and manufactuuring date. It is quite possible that the drives running at lower temperatures were actually from a batch that had less life expectancy from the start and stay at lower temperatures because of different cooling characteristics, i.e. there may well be a systematic error in the measurements. Arno The way I read it, the reliability-versus-temperature result was found to be consistent across all models and manufacturers. ================================================== ================ Failure rates are known to be highly correlated with drive models, manufacturers and vintages. Our results do not contradict this fact. For example, Figure 2 [Annualized failure rates broken down by age groups] changes significantly when we normalize failure rates per each drive model. Most age-related results are impacted by drive vintages. However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data. Interestingly, this does not change our conclusions. In contrast to age-related results, we note that all results shown in the rest of the paper are not affected significantly by the population mix. ================================================== ================ The data in this study are collected from a large number of disk drives, deployed in several types of systems across all of Google’s services. More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. All units in this study were put into production in or after 2001. The population contains several models from many of the largest disk drive manufacturers and from at least nine different models. ================================================== ================ - Franc Zabkar -- Please remove one 'i' from my address when replying by email. |
#6
|
|||
|
|||
Recommended hard drive temperature
Previously Franc Zabkar wrote:
On 16 Apr 2008 22:10:18 GMT, Arno Wagner put finger to keyboard and composed: Previously Franc Zabkar wrote: On 16 Apr 2008 12:20:06 GMT, Arno Wagner put finger to keyboard and composed: Bottom line, the Google study shows that if you can get the drives consitently down to below 40C, temperature does not matter a lot. So the recomendation would be to have your drives (under load, on a hot day) below 40C at all times. Note that this also applies to external enclosures. Arno AFAICS, the Google study conclusively shows that failure rates also increase when temperatures drop below 35C. In fact lower temps appear to be more dangerous than slightly higher temps, except when the drive is getting old, in which case higher temps start to become significant. Don't read too much into it. AFAIR they did not separate by manufacturer, model and manufactuuring date. It is quite possible that the drives running at lower temperatures were actually from a batch that had less life expectancy from the start and stay at lower temperatures because of different cooling characteristics, i.e. there may well be a systematic error in the measurements. Arno The way I read it, the reliability-versus-temperature result was found to be consistent across all models and manufacturers. Indeed. But did they have all models and all manufacturers at all temperatures? ================================================== ================ Failure rates are known to be highly correlated with drive models, manufacturers and vintages. Our results do not contradict this fact. For example, Figure 2 [Annualized failure rates broken down by age groups] changes significantly when we normalize failure rates per each drive model. Most age-related results are impacted by drive vintages. However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data. Interestingly, this does not change our conclusions. In contrast to age-related results, we note that all results shown in the rest of the paper are not affected significantly by the population mix. ================================================== ================ The data in this study are collected from a large number of disk drives, deployed in several types of systems across all of Google’s services. More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. All units in this study were put into production in or after 2001. The population contains several models from many of the largest disk drive manufacturers and from at least nine different models. ================================================== ================ Hmm, I have to look at the paper again. This smells rather strongly of a methodical error. Ok, I have it now. I think you refer to figure 5: "AFR for average drove Temperature". This one seems to indicate slightly higher failure rates for the 15...30C window than for the others in drives younger than 3 years. If you consult figure 4, you see that temperature extremes are rare. Then there is one thing: Partially defective drives work slower or not at all. This may result in lower drive temperatures (spin down, refusal to execute access) and higher drive temperatures (lots and lots of retries, heat from bearings). This can significantly skew the results. The basic results could be that failing drives run hotter or colder than others. I am also missing more break-downs into different temperature profiles (e.g. mainly constant, strong variation, etc..) as it is, e.g., possible thet the problem in the low temp section is due to cycling temperatures. I am not saying the results are wrong, but they are suspicuous and with the data given are _very_ difficult to even understand properly. It does not seem any statistics expert was consulted by the writers and the temperature results are by far the weakest in the paper. I also miss a proof or at least conclusive argument that the remaining observations are temperature independent, both for absolute value and different change profiles. The paper is still very valuable. Figures 7-10 give solid results, and need no further details. Scanning your disks every 2 weeks or so and monitoring reallocation counts is a very good idea (and something I have been doing for several years now). The folks at Google likely also found that the SMART status alone is typically over-optimistic. As to many failures not being predicted by SMART data, my results are different. It is possible that the drive selection here again skewed the picture compared to modern drives. Personally I have had 100% prediction by SMART attributes (not SMART status though) in an addmittedly small population of about 50 drives over three years and with mostly Maxtors that are known to fail gradually. Arno |
#7
|
|||
|
|||
Recommended hard drive temperature
On 17 Apr 2008 13:22:52 GMT, Arno Wagner put finger
to keyboard and composed: Previously Franc Zabkar wrote: On 16 Apr 2008 22:10:18 GMT, Arno Wagner put finger to keyboard and composed: Previously Franc Zabkar wrote: On 16 Apr 2008 12:20:06 GMT, Arno Wagner put finger to keyboard and composed: Bottom line, the Google study shows that if you can get the drives consitently down to below 40C, temperature does not matter a lot. So the recomendation would be to have your drives (under load, on a hot day) below 40C at all times. Note that this also applies to external enclosures. Arno AFAICS, the Google study conclusively shows that failure rates also increase when temperatures drop below 35C. In fact lower temps appear to be more dangerous than slightly higher temps, except when the drive is getting old, in which case higher temps start to become significant. Don't read too much into it. AFAIR they did not separate by manufacturer, model and manufactuuring date. It is quite possible that the drives running at lower temperatures were actually from a batch that had less life expectancy from the start and stay at lower temperatures because of different cooling characteristics, i.e. there may well be a systematic error in the measurements. Arno The way I read it, the reliability-versus-temperature result was found to be consistent across all models and manufacturers. Indeed. But did they have all models and all manufacturers at all temperatures? ================================================== ================ Failure rates are known to be highly correlated with drive models, manufacturers and vintages. Our results do not contradict this fact. For example, Figure 2 [Annualized failure rates broken down by age groups] changes significantly when we normalize failure rates per each drive model. Most age-related results are impacted by drive vintages. However, in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data. Interestingly, this does not change our conclusions. In contrast to age-related results, we note that all results shown in the rest of the paper are not affected significantly by the population mix. ================================================== ================ The data in this study are collected from a large number of disk drives, deployed in several types of systems across all of Google’s services. More than one hundred thousand disk drives were used for all the results presented here. The disks are a combination of serial and parallel ATA consumer-grade hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. All units in this study were put into production in or after 2001. The population contains several models from many of the largest disk drive manufacturers and from at least nine different models. ================================================== ================ Hmm, I have to look at the paper again. This smells rather strongly of a methodical error. Ok, I have it now. I think you refer to figure 5: "AFR for average drove Temperature". This one seems to indicate slightly higher failure rates for the 15...30C window than for the others in drives younger than 3 years. If you consult figure 4, you see that temperature extremes are rare. Then there is one thing: Partially defective drives work slower or not at all. This may result in lower drive temperatures (spin down, refusal to execute access) and higher drive temperatures (lots and lots of retries, heat from bearings). This can significantly skew the results. I would expect that Google would identify a partially defective drive (assuming it was detected by SMART) and eventually take it out of service. Certainly, if the drive does not work at all, then by definition it must be totally, not partially, defective. Having said that, the article doesn't really give a satisfactory definition of failure other than to say that it is the reason that a drive is replaced. shrug As for spin problems, the article states ... "Spin Retries. Counts the number of retries when the drive is attempting to spin up. We did not register a single count within our entire population." The basic results could be that failing drives run hotter or colder than others. I am also missing more break-downs into different temperature profiles (e.g. mainly constant, strong variation, etc..) as it is, e.g., possible thet the problem in the low temp section is due to cycling temperatures. The article states ... "As is common in server-class deployments, the disks were powered on, spinning, and generally in service for essentially all of their recorded life. They were deployed in rack-mounted servers and housed in professionally managed datacenter facilities." I think that would discount your temperature cycling hypothesis. I am not saying the results are wrong, but they are suspicuous and with the data given are _very_ difficult to even understand properly. It does not seem any statistics expert was consulted by the writers and the temperature results are by far the weakest in the paper. I also miss a proof or at least conclusive argument that the remaining observations are temperature independent, both for absolute value and different change profiles. The paper is still very valuable. Figures 7-10 give solid results, and need no further details. Scanning your disks every 2 weeks or so and monitoring reallocation counts is a very good idea (and something I have been doing for several years now). The folks at Google likely also found that the SMART status alone is typically over-optimistic. As to many failures not being predicted by SMART data, my results are different. It is possible that the drive selection here again skewed the picture compared to modern drives. Personally I have had 100% prediction by SMART attributes (not SMART status though) in an addmittedly small population of about 50 drives over three years and with mostly Maxtors that are known to fail gradually. Arno With respect, I prefer to accept Google's experience. "It is difficult to add temperature to this analysis since despite it being reported as part of SMART there are no crisp thresholds that directly indicate errors. However, if we arbitrarily assume that spending more than 50% of the observed time above 40C is an indication of possible problem, and add those drives to the set of predictable failures, we still are left with about 36% of all drives with no failure signals at all." I notice also that Google have an interesting observation regarding seek errors. "When examining our population, we find that seek errors are widespread within drives of one manufacturer only, while others are more conservative in showing this kind of errors. For this one manufacturer, the trend in seek errors is not clear, changing from one vintage to another. For other manufacturers, there is no correlation between failure rates and seek errors." I wonder if the abovementioned manufacturer is Seagate. IME, when Seagate drives report a "seek error rate", they are actually reporting a seek count. - Franc Zabkar -- Please remove one 'i' from my address when replying by email. |
#8
|
|||
|
|||
Recommended hard drive temperature
Previously Franc Zabkar wrote:
On 17 Apr 2008 13:22:52 GMT, Arno Wagner put finger to keyboard and composed: [...] Ok, I have it now. I think you refer to figure 5: "AFR for average drove Temperature". This one seems to indicate slightly higher failure rates for the 15...30C window than for the others in drives younger than 3 years. If you consult figure 4, you see that temperature extremes are rare. Then there is one thing: Partially defective drives work slower or not at all. This may result in lower drive temperatures (spin down, refusal to execute access) and higher drive temperatures (lots and lots of retries, heat from bearings). This can significantly skew the results. I would expect that Google would identify a partially defective drive (assuming it was detected by SMART) and eventually take it out of service. Certainly, if the drive does not work at all, then by definition it must be totally, not partially, defective. Having said that, the article doesn't really give a satisfactory definition of failure other than to say that it is the reason that a drive is replaced. shrug Problem is also that the failure time (according to the article) was the replacement time. I have heard the chief Google technology guy speak about this and he stated something like "every few months defectives are repaired". There can be a long time between faulyre and replacement. As for spin problems, the article states ... "Spin Retries. Counts the number of retries when the drive is attempting to spin up. We did not register a single count within our entire population." That may just mean that no drive managed to get spun-up at all after the first try failed. Or the attribute is unused. The basic results could be that failing drives run hotter or colder than others. I am also missing more break-downs into different temperature profiles (e.g. mainly constant, strong variation, etc..) as it is, e.g., possible thet the problem in the low temp section is due to cycling temperatures. The article states ... "As is common in server-class deployments, the disks were powered on, spinning, and generally in service for essentially all of their recorded life. They were deployed in rack-mounted servers and housed in professionally managed datacenter facilities." I think that would discount your temperature cycling hypothesis. Not at all. The very fact that disks managed to get to high temperatures means that temperature cycles are possible. I am not saying the results are wrong, but they are suspicuous and with the data given are _very_ difficult to even understand properly. It does not seem any statistics expert was consulted by the writers and the temperature results are by far the weakest in the paper. I also miss a proof or at least conclusive argument that the remaining observations are temperature independent, both for absolute value and different change profiles. The paper is still very valuable. Figures 7-10 give solid results, and need no further details. Scanning your disks every 2 weeks or so and monitoring reallocation counts is a very good idea (and something I have been doing for several years now). The folks at Google likely also found that the SMART status alone is typically over-optimistic. As to many failures not being predicted by SMART data, my results are different. It is possible that the drive selection here again skewed the picture compared to modern drives. Personally I have had 100% prediction by SMART attributes (not SMART status though) in an addmittedly small population of about 50 drives over three years and with mostly Maxtors that are known to fail gradually. Arno With respect, I prefer to accept Google's experience. "It is difficult to add temperature to this analysis since despite it being reported as part of SMART there are no crisp thresholds that directly indicate errors. However, if we arbitrarily assume that spending more than 50% of the observed time above 40C is an indication of possible problem, and add those drives to the set of predictable failures, we still are left with about 36% of all drives with no failure signals at all." This does not counter my argument. It just states that there are at least 36% failures that are not temperature related. And it is, as noted, quite arbitratily. The authors are speculating here about whether temperature above 40C is the killer when observed more than 50% of the time. It is not in their environment. This does not surprise me at all. Also note that there is no "Googles experience" in the paper. This is "observations in a specfic environment by three people with Google" and certainly the observations are not well documented with regard to temperature. On the other hand, an air conditioned data center and only two years of observation is not enough to answer that question conclusively. I notice also that Google have an interesting observation regarding seek errors. "When examining our population, we find that seek errors are widespread within drives of one manufacturer only, while others are more conservative in showing this kind of errors. For this one manufacturer, the trend in seek errors is not clear, changing from one vintage to another. For other manufacturers, there is no correlation between failure rates and seek errors." I wonder if the abovementioned manufacturer is Seagate. IME, when Seagate drives report a "seek error rate", they are actually reporting a seek count. Quite frankly this shows that the authors have not a lot of experience with SMART data. Seek errors are due to modern drives starting reading before the heads have settled. This usually works, but when it does not work it becomes a seek error. Some manufacuters list these in the SMART data, other do not. The number seen does not mean much, which is well known to people that work a lot with SMART data. Arno |
#9
|
|||
|
|||
Recommended hard drive temperature
Franc Zabkar wrote in
On 17 Apr 2008 13:22:52 GMT, Arno Wagner put finger to keyboard and composed: Previously Franc Zabkar wrote: On 16 Apr 2008 22:10:18 GMT, Arno Wagner put finger to keyboard and composed: Previously Franc Zabkar wrote: On 16 Apr 2008 12:20:06 GMT, Arno Wagner put finger to keyboard and composed: [awful big snip] I notice also that Google have an interesting observation regarding seek errors. "When examining our population, we find that seek errors are widespread within drives of one manufacturer only, while others are more conservative in showing this kind of errors. For this one manufacturer, the trend in seek errors is not clear, changing from one vintage to another. For other manufacturers, there is no correlation between failure rates and seek errors." I wonder if the abovementioned manufacturer is Seagate. IME, when Seagate drives report a "seek error rate", they are actually reporting a seek count. What else did you think 'rate' meant. - Franc Zabkar |
#10
|
|||
|
|||
Recommended hard drive temperature
Arno Wagner wrote in
Previously Franc Zabkar wrote: On 17 Apr 2008 13:22:52 GMT, Arno Wagner put finger to keyboard and composed: [...] [awful big snip] Quite frankly this shows that the authors have not a lot of experience with SMART data. Seek errors are due to modern drives starting reading before the heads have settled. Babblebot, clueless as always. A seek error is a failure to find the addressed track. The drive has a full rev. to determine that it is on the correct track. It won't start to read user data until it has determined that it is on the right track and in the right rotational position. Also, there is no such time that the drive is *not* reading as it is reading the servo data all the time. If the drive determines that it is on the correct track then obviously the heads have settled. This usually works, but when it does not work it becomes a seek error. Nope, it becomes a read error. Some manufacuters list these in the SMART data, other do not. A seek error is a seek error, and that's that. The number seen does not mean much, which is well known to people that work a lot with SMART data. Right, so obviously this should not be mentioned as an observation. Babblebot, S.M.A.R.T. as ever. Arno |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Hard Drive Temperature | Nehmo | General | 36 | October 27th 05 12:35 AM |
Hard Drive Temperature | Nehmo | Storage (alternative) | 37 | October 27th 05 12:35 AM |
Recommended External Hard Drive Enclosure? | AFGH | Homebuilt PC's | 1 | July 2nd 05 03:55 PM |
Recommended software to recover data from faulty hard drive? | Frederic W. Erk | Storage (alternative) | 5 | June 28th 04 11:55 AM |
Recommended London (UK) vendor for 1394 3.5" hard drive? | Davo-CC | Storage (alternative) | 0 | June 20th 04 12:15 PM |