Gradius2, on Jan 27 2009, 05:11 PM, said:
20,000 is too optimist in my opinion. This problem just wasn't bigger, because that "320 thing" is based by "luck".
, the point I was trying to make was that if you can go in a few "logical" steps from 100÷150 reports here on MSFN to a bare minimum
of 20,000, rounding everything by defect and using largely speculative "safety" factors, we can say, rightfully and without fearing to be proved wrong by actual figures (when and if
they will come to the light), that the phenomenon is HUGE
Which does not mean it's a matter of millions (though it might
) but enough to allow me to say that the known title is incorrect:
Seagate boot-of-death analysis - nothing but overhyped FUD
as the issue does not appear that much
overhyped (read not at all
) and it's definitely not FUD.
Using Dirk Gently's I-CHING calculator:
anything resulting above 4 becomes "A Suffusion of Yellow", on my personal calculator anything above 20,000 results in "lots"
or "too many to count"
I don't care if they represent "only" "some percentage of the drives"
while advises the use of common sense:
dlethe, on Jan 26 2009, 04:38 PM, said:
Use some common sense here, factor in how many 'cudas that Seagate ships in a year, and tell me how many millions of disk drives SHOULD be failing if this is a firmware bug that affects all disks running this particular firmware. Seagate is on a 5-year run rate to ship 1,000,000,000 disk drives ANNUALLY by 2014. If the drive problem was as big as you say it is, then they would have caught it in QC. The problem is a purple squirrel (sorry about the yankee slang -- it means incredibly rare).
In his article:
seems to be lacking the same.
As long as we are "talking adjectives", everyone is free to have it's own stance and definitions, but when it comes to probabilities and calculating them, checking twice the math would be advised.
Compare the "cryptic" explanation of the "magic number":
So here is what happened. For whatever reason, some of Seagate’s test equipment didn’t zero out the test pattern once the test suite completed, and these disks were shipped. When disks that have this test pattern pre-loaded into the reserved area, and put into service, they are subjected to certain errors, warnings, or I/O activity [remember, I'm not going to tell you what the specific trigger is ..., but the information is available to people who need to know] that results in a counter reaching a certain value. (This is NOT a threshold, but an exact value. I.e., if the magic number was 12345, then 12346 and higher would NOT trigger the bricking logic. Only 12345 triggers it. ). Furthermore, this value is stored in a circular buffer, so it can go up and down over the life of the disk. In order for the disk to brick, the disk must be spun down at the EXACT MOMENT this field is set to this magic number. (The magic number is not an 8-bit value, either). So either on power-down, or power-up, the firmware saw the bit pattern, and the magic number in the circular buffer, and likely did what it was programmed to do … perform a type of lockdown test that is supposed to happen in the safety of the manufacturing/test lab, where it can be unlocked and appropriate action taken by test engineers.
So, let’s say you have a disk with the naughty firmware, that was tested on the wrong test systems at the wrong time. Let’s say that the magic number is a 16-bit number. Then even if you had one of the disks that are at risk, then the odds are > 65,000:1 that you will power the disk off when the counter is currently set to this magic number. If the magic number is stored in a 32-bit field, then buy lottery tickets, because you have a higher probability of winning the lottery then you do that the disk will spin down with the register set to the right value. (BTW, the magic number is not something simple like number of cumulative hours.)
With the one reported from Seagate:
The firmware issue is that the end boundary of the event log circular buffer (320) was set incorrectly. During Event Log initialization, the boundary condition that defines the end of the Event Log is off by one.
During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data pattern (dependent on the type of tester used during the drive manufacturing test process) had been present
in the reserved-area system tracks when the drive's reserved-area file system was created during manufacturing, firmware will increment the Event Log pointer past the end of the event log data structure. This error is detected and results in an "Assert Failure", which causes the drive to hang as a failsafe measure. When the drive enters failsafe further update s to the counter become impossible and the condition will remain through subsequent power cycles. The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, there is no path to resolve/recover existing failed drives without Seagate technical intervention.
Since I guess that this latter info was available to dlethe
in his "under NDA" documentation, let's see how many x
's we have in 16 bit number
We have 65,536 values, possibly from 0 to 65,535.
In this range, maximum x
can be found by resolving:
x=254.7461 => 254 (plus the 0 value, i.e. "plain" 320 case) => 255 possible values for x
This would place the odds to 65,536:255 => i.e. roughly to 257:1
instead than the proposed "> 65,000:1"
Which would mean that the initial calculation was grossly underestimated.
Again, it is possible that today is not my "lucky" day with math.....
This post has been edited by jaclaz: 28 January 2009 - 04:59 AM