Jump to content

Welcome to MSFN Forum
Register now to gain access to all of our features. Once registered and logged in, you will be able to create topics, post replies to existing threads, give reputation to your fellow members, get your own private messenger, post status updates, manage your profile and so much more. This message will be removed once you have signed in.
Login to Account Create an Account


Photo

Seagate Barracuda 7200.11 Troubles

- - - - -

  • Please log in to reply
1247 replies to this topic

#1026
ViniciusFerrao

ViniciusFerrao

    Newbie

  • Member
  • 46 posts
I don't want to be banned, but I would like to shout really bad words to dlethe. :realmad:

BTW: My drive is fixed by ME.


How to remove advertisement from MSFN

#1027
mikesw

mikesw

    Advanced Member

  • Member
  • PipPipPip
  • 365 posts
And don't forget that medical offices and the military uses Hard disks to to name a few.

Sending the drive back for an RMA is not realistic. Small doctors offices don't always back up YOUR medical
records. Sorry we lost your medical records wouldn't be acceptable.

The military does use PCs in a war zone although they are ruggedized for abuse and the environment they are in.
However, no contractor, sysadmin, or user can design a ruggedized system against the failure rates seagate had
in this firmware bug. Commanding officer: give me the coordinates of the enemy on the map. I can't the computer
won't recognize the disk drive that was working before lunch!!!! I'd have to do backups every minute and even then
there is no guarantee, and I'd also have to have duplicate computer equipment to move the backup to, just in case. The officer,
I need it now! Sorry it'll take a few hours to restore.

So Seagate and hardware/software manufacturers, how many lives were lost because of your defective products?
:ph34r: :blink:

Edited by mikesw, 26 January 2009 - 01:50 PM.


#1028
Gradius2

Gradius2

    IT Consultant

  • Member
  • PipPip
  • 240 posts
  • OS:Windows 7 x64
  • Country: Country Flag

On old days (back in 2000) I used to hack firmwares for Pioneer burners (DVR-Axx family) as you can see here:
http://gradius.rpc1.org

Those old days reminds me of "conversions" thru firmware patches (ie. Liteon SOHW-812S
to Liteon 832S). That makes me wonder if it's possible to convert/flash a ST3500320AS to
ST3500320NS (Enterprise) using firmware SN06C (or ST31000340AS to ST31000340NS). <_<


It's perfectly possible to do such thing (ASM makes the impossible, possible), however IF the hardware aren't the same (and I hope so) then you're just putting a new name for your HDD.

Since they call those drives as "Enterprise" they must be from better components and parts, otherwise the company would be doing some "dirty underground play" asking more for the very same thing, except the name (label and firmware).
"Two things are infinite: The Universe and Human stupidity; and I'm not sure about the Universe." Albert Einstein

#1029
jaclaz

jaclaz

    The Finder

  • Developer
  • 14,042 posts
  • OS:none specified
  • Country: Country Flag

So Seagate and hardware/software manufacturers, how many lives were lost because of your defective products?
:ph34r: :blink:


On the bright side, how many "Autovelox", "Speed Cameras" or similar Speed checking devices of the last generation went beserk, lowering the chances of you getting fined? :unsure: :)

Another way ;):
http://gizmodo.com/5...ng-police-crazy

:P

jaclaz

#1030
Gradius2

Gradius2

    IT Consultant

  • Member
  • PipPip
  • 240 posts
  • OS:Windows 7 x64
  • Country: Country Flag

So Seagate and hardware/software manufacturers, how many lives were lost because of your defective products?
:ph34r: :blink:


On the bright side, how many "Autovelox", "Speed Cameras" or similar Speed checking devices of the last generation went beserk, lowering the chances of you getting fined? :unsure: :)

Another way ;):
http://gizmodo.com/5...ng-police-crazy

:P

jaclaz


LOL! Very funny, nice find!
"Two things are infinite: The Universe and Human stupidity; and I'm not sure about the Universe." Albert Einstein

#1031
enrolb

enrolb
  • Member
  • 6 posts
I can't believe Seagate.. They closed my case on a BIOS boot issue with a ST31000340AS drive after a week with one
email notice stating the following. What I wanted to know was who and where, I send my drive to get the data recovered as
I had heard Seagate was offering the service. So I needed prices and guarantees.

This is pathetic!! I have spend hours waiting on the phone also..



>>>>>

Thank you for contacting Seagate Support.


A firmware issue has been identified that affects a small number of Seagate Barracuda 7200.11 hard drive models which may result in data becoming inaccessible after a power-off/on operation. The affected products are Barracuda 7200.11, Barracuda ES.2 SATA, and DiamondMax 22.

Based on the low risk as determined by an analysis of actual field return data, Seagate believes that the affected drives can be used as is.

However, as part of our commitment to customer satisfaction, Seagate is offering a free firmware upgrade.


Please follow this link

(http://seagate.custk...sp?DocId=207931)

to enter the Knowledge Base article(s) detailing the steps to update your drive.


In the unlikely event your drive is affected and you cannot access your data, the data still resides on the drive and there is no data loss associated with this issue. If your drive is no longer accessible, contact us directly for further assistance at http://www.seagate.c...out/contact_us/.


NOTE: If you have contacted Seagate Support regarding a separate issue or about another product, please visit http://www.seagate.c...out/contact_us/ to submit an email.

Thank you.

<<<<<<<<<<<<<<<<

#1032
PrOdiGy1

PrOdiGy1

    Newbie

  • Member
  • 28 posts
all you will get from seagate support via email is a blanket email (if any response at all). i thought this was old news. the only way to get any sort of interactive feedback is via the telephone support (and from my experience they usually will not be much use either until you ask to talk to a supervisor)

#1033
sieve-x

sieve-x

    Newbie

  • Member
  • 12 posts
I agree that not every Seagate drive (even if its a 7200.11) with a failure at BIOS
detection can linked to 'boot of death' issue and an attempt to repair the incorrect
problem may result in data-loss or even worse damage but the thread is specific,
it has many users with frozen drives that matches the basic requirements (model,
firmware version and serial number) and some are willing to take the risk....

It seens there has been some tension in the air because of the terminal procedure
posted here but rtos has been already available for many years on the net (almost
since the current firmware evolved from Conner) and it's no big deal since the risks
were cleary stated and only some will take their chances (a few may end-up frying
a pcb, losing data if something goes wrong, ex. bad connection). Most will prefer
sending their drives for the free recover/repair option now being offered by Seagate.

I don't think full knowledge of 'boot-of-death' details would allow to create blueprint for
virus writers as that can already be done for years with the flash code and most of today
malware favors information and networks instead of the old destructive payload.

Either defective test machine (read: someone which had just lost his job :whistle: ) or firmware
bug it does not matter for end-users which were caught by surprise. About the percentage of
affected drives it's something that only Seagate or an external audit may know for sure ...

Media is much more damaging than any obscure info or firmware bugs and Seagate took too
much time to act. The result was overloaded staff, angry customers with downtime, their serial
number tool failed, firmware correction messed up internal validation process and I'm sure that
some customers that paid a premium fee for third-party recovery services feel betrayed.

Seagate general support (chat, toll-free, RMA process, etc) is far from perfect (ie. canned
responses) but it's good when compared to other manufacturers, the 5-year warranty for
desktop items (non-enterprise) WAS a plus (for new desktop drives after Jan/03/2009 that
has been changed to 3 years) and in some cases they will replace a failed drive with
a better/bigger refurbished one to avoid losing a customer.

I hope they make firmware open-source so it can be improved and the bug tracking process is
more flexible/reliable. SSD will catch on in next years, no moving parts and it's lower cost to
manufacture (but it's not failure free). First at mobiles (where frequent standby/parking cycles
is big problem despite drive brand) and later on mainstream desktop/enterprise.

Cheers David. SanTools SMARTMon-UX is great tool and some people here (many had their drives
affected) got p***ed off when you minimized the problem and teased everyone saying that you know
the failure root cause details but are not going tell it because it's a dark secret under NDA... ;)

The obvious question comes to mind .. how do you know your disk suffers from boot-of-death, and not something like a circuit board failure or massive head crash?


Edited by sieve-x, 03 February 2009 - 08:26 PM.


#1034
sieve-x

sieve-x

    Newbie

  • Member
  • 12 posts
Finally here is the failure root cause "secret" details (no NDAs were hurt in the process :D).

Customer update :

Seagate has isolated a potential firmware issue in certain products, including some Barracuda 7200.11 hard drives and related drive families based on their product platform*, manufactured through December 2008. In some circumstances, the data on the hard drives may become inaccessible to the user when the host system is powered on. Retail products potentially affected include the Seagate FreeAgent® Desk and Maxtor OneTouch® 4 storage
solutions.

As part of our commitment to customer satisfaction, we are offering a free firmware upgrade to those with affected products. To determine whether your product is affected, please visit the Seagate Support web site at http://seagate.custk....p?DocId=207931.

Support is also available through Seagate's call center: 1-800-SEAGATE (1-800-732-4283)

Customers can expedite assistance by sending an email to Seagate (discsupport*seagate.com). Please include the following disk drive information: model number, serial number and current firmware revision. We will respond, promptly, to your email request with appropriate instructions.

For a list of international telephone numbers to Seagate Support and alternative methods of contact, please access
http://www.seagate.c...out/contact_us/

*There is no safety issue with these products.

Description

An issue exists that may cause some Seagate hard drives to become inoperable immediately after a power-on operation. Once this condition has occurred, the drive cannot be restored to normal operation without intervention from Seagate. Data on the drive will be unaffected and can be
accessed once normal drive operation has been restored. This is caused by a firmware issue coupled with a specific manufacturing test process.

Root Cause

This condition was introduced by a firmware issue that sets the drive event log to an invalid location causing the drive to become inaccessible.

The firmware issue is that the end boundary of the event log circular buffer (320) was set incorrectly. During Event Log initialization, the boundary condition that defines the end of the Event Log is off by one.
During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data pattern (dependent on the type of tester used during the drive manufacturing test process) had been present
in the reserved-area system tracks when the drive's reserved-area file system was created during manufacturing, firmware will increment the Event Log pointer past the end of the event log data structure. This error is detected and results in an "Assert Failure", which causes the drive to hang as a failsafe measure. When the drive enters failsafe further update s to the counter become impossible and the condition will remain through subsequent power cycles. The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, there is no path to resolve/recover existing failed drives without Seagate technical intervention.

For a drive to be susceptible to this issue, it must have both the firmware that contains the issue and have been tested through the specific manufacturing process.

Corrective Action

Seagate has implemented a containment action to ensure that all manufacturing test processes write the same "benign" fill pattern. This change is a permanent part of the test process. All drives with a date of
manufacture January 12, 2009 and later are not affected by this issue as they have been through the corrected test process.

Recommendation

Seagate strongly recommends customers proactively update all affected drives to the latest firmware. If you have experienced a problem, or have an affected drive exhibiting this behavior, please contact your appropriate
Seagate representative. If you are unable to access your data due to this issue, Seagate will provide free data recovery services. Seagate will work with you to expedite a remedy to minimize any disruption to you or your business.

FREQUENTLY ASKED QUESTIONS (FAQ)

Q: What Seagate drives are affected by this "drive hang after power cycle" issue?
A: The following product types may be affected by this problem:
Barracuda 7200.11, Barracuda ES.2 (SATA), DiamondMax 22, FreeAgent Desk, Maxtor OneTouch 4, Pipeline HD, Pipeline HD Pro, SV35.3, and SV35.4. While only some percentage of the drives will be susceptible to this issue, Seagate recommends that all drives in these families be update d to the latest firmware!

Q: What should I do if I think I have a Seagate drive affected by this issue?
A: Since only some drives have this problem, there is a high likelihood your drive is working and will continue to work perfectly. However, Seagate recommends that all drives in the effected families be update d to the latest firmware as soon as possible. Seagate realizes this recommendation may present challenges for some customers, particularly those with large distributed installed bases. Seagate will work with customers to correct this problem, but requests customers take the following initial actions depending on what type of customer they are. For individual end-users, please contact Seagate Technical Support via web, phone or email.

http://seagate.custk....p?DocId=207931 or 1-800-SEAGATE (1 800 732-4283), or discsupportnseagate.com. If emailing, please include the following disk drive information: model number, serial number and current firmware revision.

Q. If my drives are always on, could I see this issue?
A. No, this can only occur after a power cycle, however Seagate still recommends that you upgrade your firmware due to unforeseen power events such as power loss.

Q: How will Seagate help me if I lost data on this drive?
A. There is no data loss in this situation. The data still resides on the drive and is inaccessible to the end user. If you are unable to access your data due to this issue, Seagate will provide free data recovery services. Seagate will work with you to expedite a remedy to minimize any disruption to you or your business.

Q. Does this affect all drives manufactured through January 2008?
A. No, this only affects products that were manufactured through a specific test process in combination with a specific firmware issue.

Q. Why has it taken so long for Seagate to find this issue on Barracuda ES.2 and SV35?
A. In typical nearline and surveillance operating environments, drives are not power cycled and so are not as likely to experience this issue.

Q. Does this affect the Barracuda ES.2 SAS drive?
A. No, the SATA and SAS drives have different firmware.

Q. How will my RAID-set be affected?
A. If the error occurs, the drive will drop offline after a power cycle. The RAID will go into the defined host specific recovery actions which will result in the RAID operating in a degraded mode or initiating a rebuild if a hot spare is available. If you are unsure how your host will respond to a drop ped drive and have not yet experienced this issue, avoid unnecessary power cycles and refer to manufacturer or support for the appropriate instructions.

Q. Is there a way to upgrade the firmware to my drives if they are in a large RAID-set, or do I need to take the solution offline?
A. The ability to upgrade firmware in a RAID array is system dependant. Refer to your system manufacturer for upgrade instructions.

Q. How can I tell which Barracuda ES/SV35 drives are affected?
A. 1). Check the "Drive model #" against the list of affected models below or
2) check the PN of the drive against the PN list below or
3) Call Seagate Technology, support services at 1-800-SEAGATE (1 800 732-4283), or discsupport*seagate.com

If it is a SV35 SATA drive and it is affected, new firmware will be available 1/23/09


Edited by sieve-x, 03 February 2009 - 08:23 PM.


#1035
jaclaz

jaclaz

    The Finder

  • Developer
  • 14,042 posts
  • OS:none specified
  • Country: Country Flag

Q: What Seagate drives are affected by this "drive hang after power cycle" issue?
A: The following product types may be affected by this problem:
Barracuda 7200.11, Barracuda ES.2 (SATA), DiamondMax 22, FreeAgent Desk, Maxtor OneTouch 4, Pipeline HD, Pipeline HD Pro, SV35.3, and SV35.4. While only some percentage of the drives will be susceptible to this issue, Seagate recommends that all drives in these families be update d to the latest firmware!


Now an English grammar question (provided that the answer was written by a mother tongue English speaking executive, possibly with some Law background besides technical and mathematical knowledge).

How much is "only some percentage"?

Let's see, if they were in total a few hundreds, say 800, i.e. 7 or 8 times the number of serials published on MSFN.

Let's also assume that the affected drive models are 1/3 of past year production.

According to dlethe:

Use some common sense here, factor in how many 'cudas that Seagate ships in a year, and tell me how many millions of disk drives SHOULD be failing if this is a firmware bug that affects all disks running this particular firmware. Seagate is on a 5-year run rate to ship 1,000,000,000 disk drives ANNUALLY by 2014.

Seagate is running for 1,000,000,000 disk drives annually, I guess that extimating 2008 production at 1/3 of that number would be prudential.

So let's try the math:
1/3*1,000,000,000= 333,333,333 drives produced in 2008 :unsure:
1/3*333,333,333=111,111,111 drives of the said to be affected models in 2008 :unsure:
Let's round this number by defect to 100,000,000.

now, let's say that 0,2% is the minimum that can be defined "some percentage" (if it was less than this, anyone in his right mind would have used a definition like "a fraction of percentage" or "less than a single percentage point" or something "diminutive" like that)

now:
0,002*100,000,000=200,000

OK, figures above may be exagerated/wrong, let's introduce a 10 times "safety factor":
200,000/10= 20,000

Anything between 20,000 and 200,000 appears "reasonable".

Even taking the lowest of the "range", 20,000 is several times bigger than the initially assumed 800. :w00t:

Is it one of those days where my understanding of English is failing AND my math skills lack any kind of precision? :ph34r: :blink:

jaclaz

Edited by jaclaz, 27 January 2009 - 05:09 AM.


#1036
AlexLilic

AlexLilic
  • Member
  • 4 posts

The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter.


I don't know what the Event Log that is referred to above stores, or at what rate it can be expected to increment in a "normal" scenario - but I guess we have 2 end points:
  • Best case: It turns out that this Event Log is normally empty, and only used to log uncommon physical errors.
  • Worst case: It turns out that this Event Log is commonly used, and the increment rate is closely coupled to the user's power cycle behavior. For example the drive logs on average 1 error entry per day, and the user happens to power cycle each evening. In this scenario failure rate is 100% !!
Anyone have any thoughts on this log file behavior? (I have of course ignored the part about only drives manufactured using a "specific test process" which limits the enture discussion to an undisclosed number of units).

P.S.
I also logged a case with Seagate Technical Support on the web regarding my 7200.11 SD15 No-BIOS detect, and the incident was closed 2 weeks later WITHOUT any email or any update. Needless to say I am Pi**sed Off. I called them just now, but their tech-support recorded message says that they are "closed due to extreme weather conditions".

I also tried pressing the automated routing choice for Segater Recovery Services but I ended up at another company who explained to me politely that SDS have left that building, but have not updated their voice system.

I can't think of a more frustrating experience that I have had with a vendor.

Edited by AlexLilic, 27 January 2009 - 07:40 AM.


#1037
mikesw

mikesw

    Advanced Member

  • Member
  • PipPipPip
  • 365 posts

The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter.


I don't know what the Event Log that is referred to above stores, or at what rate it can be expected to increment in a "normal" scenario - but I guess we have 2 end points:
  • Best case: It turns out that this Event Log is normally empty, and only used to log uncommon physical errors.
  • Worst case: It turns out that this Event Log is commonly used, and the increment rate is closely coupled to the user's power cycle behavior. For example the drive logs on average 1 error entry per day, and the user happens to power cycle each evening. In this scenario failure rate is 100% !!

I can't think of a more frustrating experience that I have had with a vendor.


Well, if the length of the internal disk drive log is 320 and we are recording one event per day, and I bought the drive
jan 1,2009, then in 320 days it will fail on the 321'th day. This puts it around the 3rd week of November 2009. Hmm,
now I see a correlation here between the problems people had in FY 2008 when things started dying in oct thru early december.....
:thumbup

Edited by mikesw, 27 January 2009 - 08:24 AM.


#1038
DerSnoezie

DerSnoezie

    Member

  • Member
  • PipPip
  • 151 posts

Anything between 20,000 and 200,000 appears "reasonable".

Even taking the lowest of the "range", 20,000 is several times bigger than the initially assumed 800. :w00t:

Is it one of those days where my understanding of English is failing AND my math skills lack any kind of precision? :ph34r: :blink:

jaclaz


The quite reasonable numbers you mentioned above certainly do not look like a "purple squirrel" to me. Thanks jaclaz!

Edited by DerSnoezie, 27 January 2009 - 09:09 AM.


#1039
Gradius2

Gradius2

    IT Consultant

  • Member
  • PipPip
  • 240 posts
  • OS:Windows 7 x64
  • Country: Country Flag
"We will respond, promptly, to your email request with appropriate instructions."

Big laugh at that !

I never got an answer from them, except that "automated" one.

now:
0,002*100,000,000=200,000

OK, figures above may be exagerated/wrong, let's introduce a 10 times "safety factor":
200,000/10= 20,000

Anything between 20,000 and 200,000 appears "reasonable".

Even taking the lowest of the "range", 20,000 is several times bigger than the initially assumed 800. :w00t:

Is it one of those days where my understanding of English is failing AND my math skills lack any kind of precision? :ph34r: :blink:

jaclaz


To me was 50%. 2 out of 4 failed on same day.

They make ~10 millions HDD per month, the problem started around June until January 11. In other words, 7 months !

10mil x 7 = 70 millions HDDs, lets say the half them are 7200.11, so we have 35 millions, if we keeps 50% defective (another half) = 17,5 millions.

Now, let's estimate just 10% of 17,5 millions are unlucky and will have this problem so we have 1,75 millions HDDs, in other words, almost 2 millions worldwide I might said.

20,000 is too optimist in my opinion. This problem just wasn't bigger, because that "320 thing" is based by "luck".
"Two things are infinite: The Universe and Human stupidity; and I'm not sure about the Universe." Albert Einstein

#1040
DerSnoezie

DerSnoezie

    Member

  • Member
  • PipPip
  • 151 posts

Now, let's estimate just 10% of 17,5 millions are unlucky and will have this problem so we have 1,75 millions HDDs, in other words, almost 2 millions worldwide I might said.


LOL, that's actually a pretty fat squirrel :lol: But i guess we'll never receive any feedback on the true numbers.

#1041
mikesw

mikesw

    Advanced Member

  • Member
  • PipPipPip
  • 365 posts
Today Western Digital is announcing their WD20WEADS drive, otherwise known as the WD Caviar Green 2.0TB. With 32MB of onboard cache and special power management algorithms that balance spindle speed and transfer rates, the WD Caviar Green 2TB not only breaks the 2 terabyte barrier but also offers offers an extremely low-power profile in its standard 3.5" SATA footprint. Early testing shows it keeps pace with similar capacity drives from Seagate and Samsung."

http://hothardware.c...-Drive-Preview/

MSRP for the new ginormous Caviar is set at $299. You can catch the official press release from WD. Stay tuned for the full HH monty with WD's new big-bad Caviar, coming soon. http://wdc.com/en/co...3-F872D0E6C335}

spec sheet: http://wdc.com/en/pr...asp?DriveID=576

Warranty policy in various countries for WDC drives (2TB not listed yet) http://support.wdc.c...licy.asp#policy

buy.com has it for $272.00 http://www.pricegrab...20EADS/st=query

:thumbup

Edited by mikesw, 27 January 2009 - 12:21 PM.


#1042
DrDisk

DrDisk
  • Member
  • 3 posts
Well I guess Mr. SanTools, AKA Storagesecrets was just trying to do some PR for Seagate and at the same time gave his website and company a black eye.

Looks like Seagates entire product line except the SAS drives were affected by atleast this 1 bug, but how many others suffer from other bugs.

1.5TB Studder Issue and the Log issue
1TB/500GB and other LBA 0 bug.

It's too much to keep count. Sure there are some people fanning the flames, but almost EVERY SINGLE forum has people complaining, and I'm pretty sure it isn't the SAME people at all groups. You know you have a problem with your hard drive, when the Under Water Basket Weaving forums start to have posts talking about the failures.

I hope Seagate payed Dlethe well for his PR Spin. Odd it was just 1 day before the WD announcement. Coincident?

#1043
anonymous

anonymous
  • Member
  • 4 posts
Regarding "320", here's an exchange with Maxtorman from the Slashdot forum:

Maxtorman's explanation (which was apparently correct):

I'll answer your questions to the best of my ability, and as honestly as I can! I'm no statistician, but the 'drive becoming inaccessable at boot-up' is pretty much a very slim chance - but when you have 10 million drives in the field, it does happen. The conditions have to be just right - you have to reboot just after the drive writes the 320th log file to the firmware space of the drive. this is a log file that's written only occasionally, usually when there are bad sectors, missed writes, etc... might happen every few days on a computer in a non-RAID home use situation.. and if that log file is written even one time after the magic #320, it rolls over the oldest file kept on the drive and there's no issue. It'll only stop responding IF the drive is powered up with log file #320 being the latest one written... a perfect storm situation. IF this is the case, then Seagate is trying to put in place a procedure where you can simply ship them the drive, they hook it up to a serial controller, and re-flashed with the fixed firmware. That's all it takes to restore the drive to operation! As for buying new drives, that's up to you. None of the CC firmware drives were affected - only the SD firmware drives. I'd wait until later in the week, maybe next week, until they have a known working and properly proven firmware update. If you were to have flashed the drives with the 'bad' firmware - it would disable any read/write functions to the drive, but the drive would still be accessible in BIOS and a very good chance that flashing it back to a previous SD firmware (or up to the yet to be released proven firmware) would make it all better. Oh, and RAID0 scares me by it's very nature... not an 'if' but 'when' the RAID 0 craps out and all data is lost - but I'm a bit jaded from too much tech support! :)

My question:

Maxtorman, is the log file written after each power-up (or POR) and before each shut down? It seems to me the #320 is being reached by many users in about 100 days... can that really be from only occasional events like bad sectors and missed writes? See this time histogram:

http://www.msfn.org/...o...st&p=826575

Maxtorman's response:

The log, if my information is correct, is written each time a SMART check is done. This will always happen on drive init, but can also happen at regularly scheduled events during normal usage, as the drive has to go through various maintenance functions to keep it calibrated and working properly.
_______________________

Dlethe said, "The problem is a purple squirrel (sorry about the yankee slang -- it means incredibly rare)."

Well, not if you turn your computer off every night, with or without a SMART check, at least in my opinion.

#1044
Gradius2

Gradius2

    IT Consultant

  • Member
  • PipPip
  • 240 posts
  • OS:Windows 7 x64
  • Country: Country Flag

Now, let's estimate just 10% of 17,5 millions are unlucky and will have this problem so we have 1,75 millions HDDs, in other words, almost 2 millions worldwide I might said.


LOL, that's actually a pretty fat squirrel :lol: But i guess we'll never receive any feedback on the true numbers.


I would estimate at least 1/3 of them will not bother at all and will just return the drives and getting replaced by another brand at first opportunity.

Those will never post a thing or do a research about the problem.

Today Western Digital is announcing their WD20WEADS drive, otherwise known as the WD Caviar Green 2.0TB. With 32MB of onboard cache and special power management algorithms that balance spindle speed and transfer rates, the WD Caviar Green 2TB not only breaks the 2 terabyte barrier but also offers offers an extremely low-power profile in its standard 3.5" SATA footprint. Early testing shows it keeps pace with similar capacity drives from Seagate and Samsung."


Real capacity is 1.81TB (formatted and ready for use).

ATM they are very hard to find.
"Two things are infinite: The Universe and Human stupidity; and I'm not sure about the Universe." Albert Einstein

#1045
sieve-x

sieve-x

    Newbie

  • Member
  • 12 posts
Let's look again into root cause description in a bit more clear way... :huh:

Affected drive model and firmware will trigger assert failure (ex. not detected
at BIOS) on next power-up initilization due to event log pointer getting past
the end of event log data structure (reserved area track data corruption) if
drive contains a particular data pattern (from factory test mess) and if the
Event Log counter is at entry 320, or a multiple of (320 + x*256).


My question:
Maxtorman, is the log file written after each power-up (or POR) and before each shut down? It seems to me the #320 is being reached by many users in about 100 days... can that really be from only occasional events like bad sectors and missed writes? See this time histogram:

http://www.msfn.org/...o...st&p=826575

Maxtorman's response:

The log, if my information is correct, is written each time a SMART check is done. This will always happen on drive init, but can also happen at regularly scheduled events during normal usage, as the drive has to go through various maintenance functions to keep it calibrated and working properly.
_______________________


Event log counter could be written every once in a while for example if S.M.A.R.T automatic
off-line data collection (ex. every 4h) is enabled (it is by default and may include a list of
last few errors like the example below), temperature history, seek error rate and others.

smartctl -l error /dev/sda (data below is an example)

SMART Error Log Version: 1
ATA Error Count: 9 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 9 occurred at disk power-on lifetime: 6877 hours (286 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 51 00 ff ff ff 0f

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 ff ff ff af 00	  02:00:24.339  FLUSH CACHE EXIT
  35 00 10 ff ff ff ef 00	  02:00:24.137  WRITE DMA EXT
  35 00 08 ff ff ff ef 00	  02:00:24.136  WRITE DMA EXT
  ca 00 10 77 f7 fc ec 00	  02:00:24.133  WRITE DMA
  25 00 08 ff ff ff ef 00	  02:00:24.132  READ DMA EXT

Error 8 occurred at disk power-on lifetime: 4023 hours (167 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 71 03 80 01 32 e0  Device Fault; Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  a1 00 00 00 00 00 a0 02   2d+04:33:54.009  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 a0 02   2d+04:33:54.001  IDENTIFY DEVICE
  00 00 00 00 00 00 00 ff   2d+04:33:53.532  NOP [Abort queued commands]
  a1 00 00 00 00 00 a0 02   2d+04:33:47.457  IDENTIFY PACKET DEVICE
  ec 00 00 00 00 00 a0 02   2d+04:33:47.445  IDENTIFY DEVICE

... list goes on until error 5

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


This means that theorically disabling S.M.A.R.T automatic off-line self-test, attributte auto
save (something like: smartctl -s on -o off -S off /dev/sdX) and at system BIOS (before
powering-up the drive again) or even disabling the whole S.M.A.R.T feature set could be
a workaround (crippling S.M.A.R.T would not be a permanent solution becuase it helps
to detect/log drive errors
) until the drive firmware is updated.

smartctl -l directory /dev/sda

Log Directory Supported (this one is from an affected model)

SMART Log Directory Logging Version 1 [multi-sector log support]
Log at address 0x00 has 001 sectors [Log Directory]
Log at address 0x01 has 001 sectors [Summary SMART error log]
Log at address 0x02 has 005 sectors [Comprehensive SMART error log]
Log at address 0x03 has 005 sectors [Extended Comprehensive SMART error log]
Log at address 0x06 has 001 sectors [SMART self-test log]
Log at address 0x07 has 001 sectors [Extended self-test log]
Log at address 0x09 has 001 sectors [Selective self-test log]
Log at address 0x10 has 001 sectors [Reserved log]
Log at address 0x11 has 001 sectors [Reserved log]
Log at address 0x21 has 001 sectors [Write stream error log]
Log at address 0x22 has 001 sectors [Read stream error log]
Log at address 0x80 has 016 sectors [Host vendor specific log]
Log at address 0x81 has 016 sectors [Host vendor specific log]
Log at address 0x82 has 016 sectors [Host vendor specific log]
Log at address 0x83 has 016 sectors [Host vendor specific log]
Log at address 0x84 has 016 sectors [Host vendor specific log]
Log at address 0x85 has 016 sectors [Host vendor specific log]
Log at address 0x86 has 016 sectors [Host vendor specific log]
Log at address 0x87 has 016 sectors [Host vendor specific log]
Log at address 0x88 has 016 sectors [Host vendor specific log]
Log at address 0x89 has 016 sectors [Host vendor specific log]
Log at address 0x8a has 016 sectors [Host vendor specific log]
Log at address 0x8b has 016 sectors [Host vendor specific log]
Log at address 0x8c has 016 sectors [Host vendor specific log]
Log at address 0x8d has 016 sectors [Host vendor specific log]
Log at address 0x8e has 016 sectors [Host vendor specific log]
Log at address 0x8f has 016 sectors [Host vendor specific log]
Log at address 0x90 has 016 sectors [Host vendor specific log]
Log at address 0x91 has 016 sectors [Host vendor specific log]
Log at address 0x92 has 016 sectors [Host vendor specific log]
Log at address 0x93 has 016 sectors [Host vendor specific log]
Log at address 0x94 has 016 sectors [Host vendor specific log]
Log at address 0x95 has 016 sectors [Host vendor specific log]
Log at address 0x96 has 016 sectors [Host vendor specific log]
Log at address 0x97 has 016 sectors [Host vendor specific log]
Log at address 0x98 has 016 sectors [Host vendor specific log]
Log at address 0x99 has 016 sectors [Host vendor specific log]
Log at address 0x9a has 016 sectors [Host vendor specific log]
Log at address 0x9b has 016 sectors [Host vendor specific log]
Log at address 0x9c has 016 sectors [Host vendor specific log]
Log at address 0x9d has 016 sectors [Host vendor specific log]
Log at address 0x9e has 016 sectors [Host vendor specific log]
Log at address 0x9f has 016 sectors [Host vendor specific log]
Log at address 0xa1 has 020 sectors [Device vendor specific log]
Log at address 0xa8 has 020 sectors [Device vendor specific log]
Log at address 0xa9 has 001 sectors [Device vendor specific log]
Log at address 0xe0 has 001 sectors [Reserved log]
Log at address 0xe1 has 001 sectors [Reserved log]


It may also be (theorically) possible to check if the 'specific data pattern' is present in system
area IF it can be read from SMART log pages (using standard ATA interface/specification)
so this could be used to create a simple (multi-platform) tool for verifying if a particular
drive is effectively affected by the issue and maybe even used as workaround solution IF
the wrong pattern data or event counter can be changed (ie. read/write).

Edited by sieve-x, 29 January 2009 - 12:42 AM.


#1046
jaclaz

jaclaz

    The Finder

  • Developer
  • 14,042 posts
  • OS:none specified
  • Country: Country Flag

20,000 is too optimist in my opinion. This problem just wasn't bigger, because that "320 thing" is based by "luck".


Yep :), the point I was trying to make was that if you can go in a few "logical" steps from 100÷150 reports here on MSFN to a bare minimum of 20,000, rounding everything by defect and using largely speculative "safety" factors, we can say, rightfully and without fearing to be proved wrong by actual figures (when and if they will come to the light), that the phenomenon is HUGE.

Which does not mean it's a matter of millions (though it might be :unsure:) but enough to allow me to say that the known title is incorrect:

Seagate boot-of-death analysis - nothing but overhyped FUD

as the issue does not appear that much overhyped (read not at all ;)) and it's definitely not FUD.

Using Dirk Gently's I-CHING calculator:
http://www.thateden.co.uk/dirk/
anything resulting above 4 becomes "A Suffusion of Yellow", on my personal calculator anything above 20,000 results in "lots" or "too many to count".
I don't care if they represent "only" "some percentage of the drives". :P

Besides, dlethe while advises the use of common sense:

Use some common sense here, factor in how many 'cudas that Seagate ships in a year, and tell me how many millions of disk drives SHOULD be failing if this is a firmware bug that affects all disks running this particular firmware. Seagate is on a 5-year run rate to ship 1,000,000,000 disk drives ANNUALLY by 2014. If the drive problem was as big as you say it is, then they would have caught it in QC. The problem is a purple squirrel (sorry about the yankee slang -- it means incredibly rare).


In his article:
http://storagesecret...-overhyped-fud/
seems to be lacking the same.

As long as we are "talking adjectives", everyone is free to have it's own stance and definitions, but when it comes to probabilities and calculating them, checking twice the math would be advised.

Compare the "cryptic" explanation of the "magic number":

So here is what happened. For whatever reason, some of Seagate’s test equipment didn’t zero out the test pattern once the test suite completed, and these disks were shipped. When disks that have this test pattern pre-loaded into the reserved area, and put into service, they are subjected to certain errors, warnings, or I/O activity [remember, I'm not going to tell you what the specific trigger is ..., but the information is available to people who need to know] that results in a counter reaching a certain value. (This is NOT a threshold, but an exact value. I.e., if the magic number was 12345, then 12346 and higher would NOT trigger the bricking logic. Only 12345 triggers it. ). Furthermore, this value is stored in a circular buffer, so it can go up and down over the life of the disk. In order for the disk to brick, the disk must be spun down at the EXACT MOMENT this field is set to this magic number. (The magic number is not an 8-bit value, either). So either on power-down, or power-up, the firmware saw the bit pattern, and the magic number in the circular buffer, and likely did what it was programmed to do … perform a type of lockdown test that is supposed to happen in the safety of the manufacturing/test lab, where it can be unlocked and appropriate action taken by test engineers.

So, let’s say you have a disk with the naughty firmware, that was tested on the wrong test systems at the wrong time. Let’s say that the magic number is a 16-bit number. Then even if you had one of the disks that are at risk, then the odds are > 65,000:1 that you will power the disk off when the counter is currently set to this magic number. If the magic number is stored in a 32-bit field, then buy lottery tickets, because you have a higher probability of winning the lottery then you do that the disk will spin down with the register set to the right value. (BTW, the magic number is not something simple like number of cumulative hours.)


With the one reported from Seagate:

The firmware issue is that the end boundary of the event log circular buffer (320) was set incorrectly. During Event Log initialization, the boundary condition that defines the end of the Event Log is off by one.
During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data pattern (dependent on the type of tester used during the drive manufacturing test process) had been present
in the reserved-area system tracks when the drive's reserved-area file system was created during manufacturing, firmware will increment the Event Log pointer past the end of the event log data structure. This error is detected and results in an "Assert Failure", which causes the drive to hang as a failsafe measure. When the drive enters failsafe further update s to the counter become impossible and the condition will remain through subsequent power cycles. The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, there is no path to resolve/recover existing failed drives without Seagate technical intervention.


Since I guess that this latter info was available to dlethe in his "under NDA" documentation, let's see how many x's we have in 16 bit number :ph34r: :
We have 65,536 values, possibly from 0 to 65,535.
In this range, maximum x can be found by resolving:
320+x*256=65,535
Thus:
x*256=65,535-320
x=(65,535-320)/256
x=254.7461 => 254 (plus the 0 value, i.e. "plain" 320 case) => 255 possible values for x

This would place the odds to 65,536:255 => i.e. roughly to 257:1 instead than the proposed "> 65,000:1" :w00t:

Which would mean that the initial calculation was grossly underestimated.

Again, it is possible that today is not my "lucky" day with math.....:whistle:

jaclaz

Edited by jaclaz, 28 January 2009 - 04:59 AM.


#1047
icefloe01

icefloe01

    Junior

  • Member
  • Pip
  • 61 posts
jaclaz is teh kewlerest!

#1048
Oliver.HH

Oliver.HH
  • Member
  • 7 posts
Another attempt to estimate the probability of a drive failing...

Given the "root cause" document posted here by sieve-x, this is what we know:
  • A drive is affected by the bug if it contains the defective firmware and has been tested on certain test stations.
  • An affected drive will fail if turned off after exactly 320 internal events were logged initially or any multiple of 256 thereafter.
We don't have the details on how often exactly the event log is written to. Someone mentioned that it's written to when the drive initializes on power-up (though I don't remember the source). If that's true, we would have one event per power cycle plus an unknown and possibly varying number in between.

Given that, the probability of an affected drive being alive after one power cycle is 255/256. After two power cycles it's 255/256 * 255/256. After three power cycles it's (255/256)^3. And so on. While the isolated probability of the drive failing on a single power-up is just 0.4%, the numbers go up when you calculate the probability of a drive failing over time.

Let's assume, a desktop drive is power cycled once a day. The probability of an affected drive failing then is:
0.4% for 1 day
11.1% over 30 days
29.7% over 90 days
76.0% over 365 days

Obviously, I'm ignoring the fact that initially a higher number of events (320) must be logged to trigger the failure. Anyway, this would not change the numbers substiantally and the initial number might be even lower than 256 depending on the number of events logged during the manufacturing process. I'm also ignoring the number of events written while the drive is powered on, as it should not affect the overall probability.

Edited by Oliver.HH, 28 January 2009 - 08:59 AM.


#1049
jaclaz

jaclaz

    The Finder

  • Developer
  • 14,042 posts
  • OS:none specified
  • Country: Country Flag

Obsiously, I'm ignoring the fact that initially a higher number of events (320) must be logged to trigger the failure. Anyway, this would not change the numbers substiantally and the initial number might be even lower than 256 depending on the number of events logged during the manufacturing process. I'm also ignoring the number of events written while the drive is powered on, as it should not affect the overall probability.


Yep :), and we don't even have a clear idea on WHICH events are logged and HOW MANY such events take place in an "average powered on hour".

If, as it has been hinted/reported somewhere on the threads, a S.M.A.R.T. query raises an event that is actually logged, we will soon fall in the paradox that the more you check your hardware status the more prone it is to fail.....:w00t:

Additionally, supposing that certain commands create multiple entries (or "sets" of entries) it is debatable whether "320" has more or less probabilities to be reached.

I mean how probable it is with a "random" number of arbitrary "sets" ( say resulting in 1, 2, 3 or 4 log entries) to reach exactly 320 or to miss it, like in:

317+4
318+3
319+2


I don't think we can find an accurate answer :unsure:, but we can say that we are definitely NOT in an Infinite Improbability Drive (pardon me the pun ;)):
http://en.wikipedia....obability_Drive

two to the power of two hundred and sixty-seven thousand seven hundred and nine to one against.

.....

It sounded quite a sensible voice, but it just said, "Two to the power of one hundred thousand to one against and falling," and that was all.

Ford skidded down a beam of light and span round trying to find a source for the voice but could see nothing he could seriously believe in.

"What was that voice?" shouted Arthur.

"I don't know," yelled Ford, "I don't know. It sounded like a measurement of probability."

"Probability? What do you mean?"

"Probability. You know, like two to one, three to one, five to four against. It said two to the power of one hundred thousand to one against. That's pretty improbable you know."

.....


The voice continued.

"Please do not be alarmed," it said, "by anything you see or hear around you. You are bound to feel some initial ill effects as you have been rescued from certain death at an improbability level of two to the power of two hundred and seventy-six thousand to one against — possibly much higher. We are now cruising at a level of two to the power of twenty-five thousand to one against and falling, and we will be restoring normality just as soon as we are sure what is normal anyway. Thank you. Two to the power of twenty thousand to one against and falling."


but rather near, VERY near normality (1:1)....

:thumbup

jaclaz

Edited by jaclaz, 28 January 2009 - 08:34 AM.


#1050
Oliver.HH

Oliver.HH
  • Member
  • 7 posts

Yep :), and we don't even have a clear idea on WHICH events are logged and HOW MANY such events take place in an "average powered on hour".

True, but we don't have to know. The probability of a drive failing is the same as long as at least one event is logged per power cycle.

If, as it has been hinted/reported somewhere on the threads, a S.M.A.R.T. query raises an event that is actually logged, we will soon fall in the paradox that the more you check your hardware status the more prone it is to fail.....:w00t:

No, the chance of a drive failing due to this condition is zero unless it is powered off.

All that matters is that the event counter changes at all from power-on to power-off. It does not matter whether it increases by 1, or by 50 or by any other value as long as such values are equally probable.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users



How to remove advertisement from MSFN