Quantcast StorefrontBacktalk » Blog Archive » Down For 8 Days: American Eagle’s Site Disaster
advertisement
advertisement

Down For 8 Days: American Eagle’s Site Disaster

Written by Frank Hayes and Evan Schuman
July 29th, 2010
Like this story? Share it
To share this story with people in your social network, please click on the network icons below.

In one of the longest site outages ever for a multi-billion-dollar retailer, Tuesday (July 27) saw the apparent end of more than a week of Web problems and days of an outright crashed site for Pittsburgh-based clothing chain American Eagle Outfitters, which outsources much of its Web operations to IBM. The site crashed last Monday (July 19) and stayed dark until Friday (July 23), when it limped along with various parts not functioning until Tuesday afternoon (July 27).

The site’s problems, though, shed light on an interesting strategy. During the many days of complete Web site death, the $2.7 billion apparel chain’s mobile site was still up. But it apparently was not able to perform purchases. Officials at American Eagle Outfitters, IBM and Usablenet—which handles the chain’s mobile site—wouldn’t comment on the mobile site’s functionality during the crash.


New Details About The Crash Causes: Oracle Backup The Culprit, Along With Big Blue

But this raises the question: Should retailers look to their mobile sites as emergency backups for their Web sites? Should pages indicating that a site is down automatically include a link to the site’s mobile version?

Mobile sites, of course, work just as well on desktop machines as they do on phones. American Eagle Outfitters, which has the admirably short URL of ae.com, exists as a mobile site.

Before we dive into that mobile-as-site-backup issue, let’s look at exactly what happened with American Eagle’s site. None of the players involved would get specific as to what was wrong with the site, other than to say that there was no upgrade going on at the time and that the site experienced “a hardware issue.”

A server failure almost certainly would not have caused this problem; redundant servers would likely have kicked in while the defective machine was replaced with a new server and a backup was restored. That process would have taken a few hours, not almost eight days.

This delay suggests some sort of storage problem. Say the storage array begins to fail. OK, no problem, we’ll just find the bad drive and replace it. Whoops, looks like something has corrupted multiple drives. (That could happen if power gets flaky inside the array.) Now we have a catastrophic failure of the storage array. No problem, we’ll just fix the hardware and restore.

Whoops, new problem: Turns out this problem has been going on for a while. The last set of backups is corrupted. So is the set of backups before that. Sorting through to reconstruct good data is going to take time.

Alternatively: All recent backup sets are toast. Maybe nobody was verifying that the data was actually being written. However, all the transactions are being logged. No problem, then: All it takes is a lot of time and special expertise to essentially rerun all the recent transactions (since the last good backup) into an empty database, merge the new stuff with the old stuff and then load it all back into the replacement hardware.

By the way, it seems that American Eagle was recently searching for a “Manager – Business Continuity & Disaster Recovery”. The job was still an active posting on May 25 but has since been filled. Not a moment too soon, eh? (Thanks, Google cache!)


advertisement

One Comment | Read Down For 8 Days: American Eagle’s Site Disaster

  1. Gareth Evans Says:

    Contingency planning is frought with all sorts of pitfalls. The suggestion about running your mobile site on “mirrored versions of the key databases” sounds great, aprt from in AE’s case the gradual curruption of the main site’s databases due to the array problem would also be “mirrored” onto the mobile site.
    You could handle bandwidth issues by locating in the same datacentre and sharing the main site’s bandwidth. But that leaves both sites vulnerable to both a bandwidth outage or a datacentre failure (say, the power supply fails.
    It reminds me of the phrase currently very popular with politians (certainly over here in the UK) “it’s a problem of unintended consequences”.

Leave a Reply

Newsletter

Quickly catch-up on the latest in E-Commerce and Retail Tech with our free weekly newsletter, with urgent bulletins as news merits.
advertisement

Most Recent Comments

Kill All The Passwords

This article does mention, but does not give enough attention to, the fact that the attacks discussed are only feasible when the encrypted password file can be copied and subjected to an offline attack. The trick is to have authentication performed on a separate, much more strongly secured host - such as an Active Directory Domain Controller, or a Kerberos server, or a NIS+ server, or even using something as banal as an LDAP-over-SSL authentication dialog. In these environments, the odds of the "password file" being stolen and subjected to an offline attack go to near zero, and only online attacks may be carried out by the attacker. With sensible exponential backoff between failed password attempts, lockout after a modest number of failed attempts on a single account, and pattern detection, that minimum 7 character password is quite secure enough. Passwords aren't dead yet for security purposes, and they will be with us for a very long while to come for practical purposes. The trick is to employ them correctly. Read more...
The possibilities you describe are years away from being implemented at best, so for the moment passwords are an ugly reality. Luckily, password managers can easily manage hundreds of passwords of any length. The only thing a user needs to remember is the master password. It seems like an easier task to educate users on how to use password managers rather than implement complex security technology on a global basis. Read more...