Softpanorama

Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
May the source be with you, but remember the KISS principle ;-)
Bigger doesn't imply better. Bigger often is a sign of obesity, of lost control, of overcomplexity, of cancerous cells

Hard Drives Failures

News See Also Recommended Links SMART Monitoring Tools Dell Sun Sparc
Softpanorama Norton Ghost Page

Disk Backup

Hard drive failures

History Humor

Etc

Who is General failure and why is he reading my disk ?

Usenet SIG

If history repeats itself, and the unexpected always happens,
 how incapable must Man be of learning from experience.

Bernard Show

"Experience keeps a dear school, but fools will learn in no other"

Benjamin Franklin

Do not expect harddrive, especially laptop harddrive to last forever (all useful life of laptop, say, 5 years). It looks like for laptops anybody who is using a drive that more then three years old is taking chances.

If you data are more valuable then, say, $100, then you better proactively replace it with a new drive each three years or so.  See Slashdot Google Releases Paper on Disk Reliability.  See also SUNRISE  drive statistics - the article is in Russian, but drive related diagram are self-explanatory:

If you data are valuable and you keep them on a laptop and travel a lot it make sense proactively replace the harddrive each three years.  That actually makes laptop leases more attractive then they look otherwise :-)

Also all drive manufacturers have good years and bad years.  Still you need to monitor drive statistics and with first SMART report take appropriate measures:

Our results confirm the findings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."

Of course much depends on the manufacturer:

Re:Doesn't really tell us anything useful

(Score:2)

by toddestan (632714) on Sunday February 18, @05:04PM (#18062352) Come on, who is the worst drive manufacturer, my money's on Maxtor - based on at least six drives dying within 6 months.

Could be, though I have had good luck with Maxtor. Could very well be Western Digital too.

Has IBM improved since merging with Hitachi, or have they just renamed the Deathstar?

The Hitachi drives seem to have average reliability. Really, there was just a couple of bad years there for IBM with the Deathstars. The ones before and after those drives don't seem any worse than average.

Does anyone even use Samsung drives? Whatever happened to Fujitsu and Conner - they really were bad, i.e. sometimes didn't even work when new!

My experience with Samsung drives is that they are quiet, low heat, and reliable. I recommend them. Conner got bought out by Quantum, which got bought by Seagate (IIRC). Fujitsu is still around, they make 2.5" drives mostly. Reliability seems average as 2.5" drives go.

Are Western Digital the best for SATA and Seagate the best for IDE, as is my opinion (got about a dozen of these and only one failure)

I have not had any SATA failures (yet). Seagate IDE drives do seem reliable, Western Digitals are terrible.

All Google told us is that temperature doesn't make a difference, and power-cycling may but they can't really tell as they don't do it often!

Actually, Google tells us that very low and very high temperatures are bad. The temperatures that most drives seem to operate at in most computers is the best, going outside of that is trouble.

I have a friend who has a theory that BitTorrent is really bad for drives, as its constant read/write of little bits.

I'm going to guess that a lot of Google's usage patterns are a lot like bittorrent (as in lots of small, random accesses and writes as opposed to large continuous reads and writes). Google's data seems to show us that a lot usage like this is able to weed out the early failures quickly, but after that it doesn't matter until the drive gets old.

 

NEWS CONTENTS

Old News ;-)

[Mar 02, 2007] Disk drive failures 13 times what vendors say, study says

March 02, 2007 (Computerworld) Customers are replacing disk drives at rates far higher than those suggested by the estimated mean time between failure (MTBF) supplied by drive vendors, according to a study of about 100,000 drives conducted by Carnegie Mellon University.

The study, presented last month at the 5th USENIX Conference on File and Storage Technologies in San Jose, also shows no evidence that Fibre Channel (FC) drives are any more reliable than less expensive but slower performing Serial ATA (SATA) drives.

That surprising comparison of FC and SATA reliability could speed the trend away from FC to SATA drives for applications such as near-line storage and backup, where storage capacity and cost are more important than sheer performance, analysts said.

At the same conference, another study of more than 100,000 drives in data centers run by Google Inc. indicated that temperature seems to have little effect on drive reliability, even as vendors and customers struggle to keep temperature down in their tightly packed data centers. Together, the results show how little information customers have to predict the reliability of disk drives in actual operating conditions and how to choose among various drive types.

Real world vs. data sheets

The Carnegie Mellon study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC and SATA drives. The data sheets for those drives listed MTBF between 1 million to 1.5 million hours, which the study said should mean annual failure rates "of at most 0.88%." However, the study showed typical annual replacement rates of between 2% and 4%, "and up to 13% observed on some systems."

Garth Gibson, associate professor of computer science at Carnegie Mellon and co-author of the study, was careful to point out that the study didn't necessarily track actual drive failures, but cases in which a customer decided a drive had failed and needed replacement. He also said he has no vendor-specific failure information, and that his goal is not "choosing the best and the worst vendors" but to help them to improve drive design and testing.

He echoed storage vendors and analysts in pointing out that as many as half of the drives returned to vendors actually work fine and may have failed for any reason, such as a harsh environment at the customer site and intensive, random read/write operations that cause premature wear to the mechanical components in the drive.

Several drive vendors declined to be interviewed. "The conditions that surround true drive failures are complicated and require a detailed failure analysis to determine what the failure mechanisms were," said a spokesperson for Seagate Technology in Scotts Valley, Calif., in an e-mail. "It is important to not only understand the kind of drive being used, but the system or environment in which it was placed and its workload."

"Regarding various reliability rate questions, it's difficult to provide generalities," said a spokesperson for Hitachi Global Storage Technologies in San Jose, in an e-mail. "We work with each of our customers on an individual basis within their specific environments, and the resulting data is confidential."

Ashish Nadkarni, a principal consultant at GlassHouse Technologies Inc., a storage services provider in Framingham, Mass., said he isn't surprised by the comparatively high replacement rates because of the difference between the "clean room" environment in which vendors test and the heat, dust, noise or vibrations in an actual data center.

He also said he has seen overall drive quality falling over time as the result of price competition in the industry. He urged customers to begin tracking disk drive records "and to make a big noise with the vendor" to force them to review their testing processes.

FC vs. SATA

While a general reputation for increased reliability (as well as higher performance) is one of the reasons FC drives cost as much as four times more per gigabyte than SATA, "We had no evidence that SATA drives are less reliable than the SCSI or Fibre Channel drives," said Gibson. "I am not suggesting the drive vendors misrepresented anything," he said, adding that other variables such as workloads or environmental conditions might account for the similar reliability finding.

Analyst Brian Garrett at the Enterprise Storage Group in Milford, Mass., said he's not surprised because "the things that can go wrong with a drive are mechanical -- moving parts, motors, spindles, read-write heads," and these components are usually the same whether they are used in a SCSI or SATA drive. The electronic circuits around the drive and the physical interface are different, but are much less prone to failure.

Vendors do perform higher levels of testing on FC than on SATA drives, he said, but according to the study that extra testing hasn't produced "a measurable difference" in reliability.

Such findings might spur some customers to, for example, buy more SATA drives to provide more backup or more parity drives in a RAID configuration to get the same level of data protection for a lower price. However, Garrett cautioned, SATA continues to be best suited for applications such as backup and archiving of fixed content (such as e-mail or medical imaging) that must be stored for long periods of time but accessed quickly when it is needed. FC will remain the "gold standard" for online applications such as transaction processing, he predicts.

Don't sweat the heat?

The Google study examined replacement rates of more than 100,000 serial and parallel ATA drives deployed in Google's own data centers. Similar to the CMU methodology, a drive was considered to have failed if it was replaced as part of a repair procedure (rather than as being upgraded to a larger drive).

Perhaps the most surprising finding was no strong correlation between higher operating temperatures and higher failure rates. "That doesn't mean there isn't one," said Luiz Barroso, an engineer at Google and co-author of the paper, but it does suggest "that temperature is only one of many factors affecting the disk lifetime."

Garrett said that rapid changes in temperature -- such as when a malfunctioning air conditioner is fixed after a hot weekend and rapidly cools the data center -- can also cause drive failures.

The Google study also found that no single parameter, or combination of parameters, produced by the SMART (Self-Monitoring Analysis and Reporting Technology) built into disk drives is actually a good predictor of drive failure.

The bottom line

For customers running anything smaller than the massive data centers operated by Google or a university data center, though, the results might make little difference in their day-to-day operations. For many customers, the price of replacement drives is built into their maintenance contracts, so their expected service life only becomes an issue when the equipment goes off warranty and the customer must decide whether to "try to eke out another year or two" before the drive fails, said Garrett.

The studies won't change how Tom Dugan, director of technical services at Recovery Networks, a Philadelphia-based business continuity services provider, protects his data. "If they told me it was 100,000 hours, I'd still protect it the same way. If they told me if was 5 million hours I'd still protect it the same way. I have to assume every drive could fail."

Slashdot Google Releases Paper on Disk Reliability

Re:So

(Score:2, Informative)

by mightyQuin (1021045) on Sunday February 18, @12:37AM (#18057486)

From my experience, Western Digitals are (relatively) reliable. They unfortunately do not have the same power connector orientation as any other consumer drive on the planet, so if you want to use IDE RAID you have to get the type that either (1) fits any consumer ide drive or (2) fits a Western Digital Drive. (grr)

Had some good experiences with Maxtor. A couple of years ago (OK - maybe 6 or 8) we had batches of super reliable Maxtors - 10GB.

Some Samsungs are good, some are evil - the SP0411N was a particularly reliable model - the SP0802N sucked - out of a batch of 20, 15 of them died within a year: all reallocated sector errors beyond the threshold.

Seagates are a mixed bag too - been having a nice experience with the SATA models 160GB and 120GB - can't remember their model #'s off the top of my head. - The older Seagates, though, I spent a fair amount of time replacing.

IBM DeskStar's, as far as I know, have been quite good - for some reason didn't use too many.

Re:So

(Score:2, Informative) by nevesis (970522) on Sunday February 18, @01:31AM (#18057738) Interesting.. but I disagree with your analysis.

The DeskStars were nicknamed DeathStars due to their high failure rate.

Maxtor has a terrible reputation in the channel.

Seagate has a fantastic reputation in the channel.

And as far as the WD power connectors.. I have 4 Western Digitals, a Samsung, a Maxtor, and a Seagate on my desk right now.. and they all have the same layout (left to right: 40 pin, jumpers, molex).

Re:So

(Score:2)

by asuffield (111848) <asuffield@suffields.me.uk> on Sunday February 18, @06:54AM (#18058874)

Some newer 20GB on up, there was a downright scandal about extremely high failure rates on certain lines. It sounds like 1 plant producing them was turning out duds with a near 100% failure rate. IBM sold off the storage division to Hitachi, who now sells Hitachi Deskstars. I can only assume they closed the bad plant, or made sure the clean room was actually clean 8-).

This is how some guys I used to know in the storage division told the story - hearsay, but probably a reasonable approximation to what happened:

At the time, IBM had two disk fabrication plants. Certain lines of deskstars were being migrated to a new kind of platter technology (glass composition? something like that), which necessitated completely rebuilding the production lines.

One of those rebuilds was screwed. All the disks it produced were DOA, but not quite DOA enough to get the problem caught by their standard QA procedures. In the end they had to tear the whole thing down and rebuild it again.

In the end, about half the drives shipped in the affected product lines were defective. Because of how stock allocation from the two plants works, if the store you got your drive from gave you a defective one, most likely every single other drive in their storeroom was from the same plant and therefore also defective, so taking it back there for a warranty replacement was a joke. The deskstars got a bad reputation more from this than from anything else. IBM knew what was going on, but could do little to stop it, because until they got that plant rebuilt they just didn't *have* any replacement drives to hand out. A classic example of how a failure in the QA process can leave a company completely screwed.

By the time Hitachi bought the storage division, the bad production line was long gone and the QA procedures fixed.

I don't know why they didn't just throw in the towel and issue a product recall. Must have been a management decision. There's a lawsuit pending that might find out.

Slashdot Google Releases Paper on Disk Reliability

I read the abstract and the conclusion

(Score:2)

by mshurpik (198339) on Sunday February 18, @03:36AM (#18058336)

Their conclusion (and a glance at their results) indicates that drives fail because of product defects. However, home-use parameters such as brown power (low voltage on the line) are probably not taken into account in their server environment.

It's interesting, and I tend to trust their results, but these conclusions may not be relevant to single-drive situations. That is, if two customers purchase 1 drive each, and both drives are not defected, then this study doesn't explain why one drive would fail before the other. It also doesn't take into account the 1-year warranty foisted on the majority of PC-system purchasers these days.

by maestroX (1061960) on Sunday February 18, @04:35PM (#18062144)

Thank you.

Most drives will fail when the PSU delivers unstable output (ref: http://www.dansdata.com/ [dansdata.com] though some drives are less sensitive to power fluctuation.

It's pretty difficult to determine which drives are ok, since the manufacturers update these things every month.

I would like to hear to *CLUNK* sounds of failing drives at google though ;-)

... at the same conference, Bianca Schroeder presented a paper [cmu.edu] disk reliability that developed sophisticated statistical models for disk failures, building on earlier work by Qin Xin [ucsc.edu] and dozen papers by John Elerath... [google.com]

C'mon, slashdot. There were about twenty other papers presented at FAST this year. Let's not focus only on the one with Google authors...

Re:Temperature conclusion

(Score:4, Interesting)

by gnu-sucks (561404) on Sunday February 18, @12:54AM (#18057568)
(http://lfnet.net/ | Last Journal: Wednesday February 02, @05:36AM)

My guess is this graph on temperature distribution is more or less a graph of temperature sensor accuracy. I can't imagine that drives at 50C had the lowest failure rate.

While this would require a more laboratory-like environment, a dozen drives of each type and manufacture could have been sampled at known temperatures, and a data curve could have been established to calibrate the temperature sensors.

There are lots of studies out there where drives were intentionally heated, and higher degrees of failure were indeed reported (this is mentioned in the google report too). So the correlation is probably still valid, just not well-proven.

Re:Temperature conclusion

(Score:1)

by Bazer (760541) on Sunday February 18, @09:50AM (#18059474)
(http://hoax.e-utp.net/)

The correlation between temperature and data loss is pretty well proven and it's called the superparamagnetic effect.

It happens when a grain of ferromagnetic material on the platter looses it's magnetization due to temperature under the Curie temperature of said material. The probability of it happening is directly proportional to temperature (how it's close to the Curie point) and inversely proportional to size of the grain and (magnetic) hardness of the material.

Hitachi's "vertical bit" technology allowed to use magnetically harders materials for smaller grains and achieve greater data density.

I guess this paper shows this isn't the main cause of drive failures because it is well proven and understood.

Recommended Links

Google matched content

Softpanorama Recommended

Top articles

Sites

SUNRISE drive statistics

The article is in Russian, but drive related diagram are self-explanatory. According to thier data Toshiba is worst in 2.5", and Seagate is best.

http://www.dansdata.com/ Most drives will fail when the PSU delivers unstable output though some drives are less sensitive to power fluctuation.

5th USENIX Conference on File and Storage Technologies – Failure Trends in a Large Disk Drive Population

for people who want the bottom line and not a 13 page paper Check out Google's Disk Failure Experience [storagemojo.com].
The study examined replacement rates of more than 100,000 serial and parallel ATA drives deployed in Google's own data centers. Similar to the CMU methodology, a drive was considered to have failed if it was replaced as part of a repair procedure (rather than as being upgraded to a larger drive).

It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis.

We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity.

Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

Slashdot Google Releases Paper on Disk Reliability

SMART Monitoring Tools

SourceForge.net S.M.A.R.T. Monitoring Tools

A Hard drive - disk diagnostic software to monitor hard disk - drive activity or health – Stellar Smart


Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2018 by Dr. Nikolai Bezroukov. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) in the author free time and without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case softpanorama.org is down you can use the at softpanorama.info

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: April 18, 2018