NCSA
Supercomputing center gets ready to power up 256 more Brocade ports
November 30, 2002
Think your data infrastructure is getting out of hand? Try telling that to Michelle Butler, who oversees a storage infrastructure that grows one to two Tbytes per day.
Butler is technical program manager of the Storage Enabling Technologies Group at the National Center for Supercomputing Applications (NCSA) at the University of Illinois in Champaign-Urbana, Ill. NCSA, which is probably best known as the place where the graphical Web browser was invented, provides high-performance computing systems for a wide variety of science and engineering programs, everything from earthquake modeling to biochemical research.
NCSA's storage group, which comprises seven full-time staffers including Butler, is in charge of managing a 500-Tbyte (and growing) mass storage system; backup and recovery of that data; all of the SAN production and research; and research into parallel and clustered file systems. NCSA's mass storage system has 2,000 and 3,000 active users at any given time.
Yet even with such huge storage capacity requirements, NCSA had resisted implementing a SAN until March 2002. Butler says that until recently she felt SAN technology wasn't reliable enough. "SAN just wasn't as safe as it needed it to be for our production environment," she says. "We really needed high-availability switches to make sure they didn't go down."
The final decision to go to a SAN architecture was sparked by the fact that the NCSA's Windows NT, Unix, and mass storage groups were each getting ready to purchase vast amounts of new disk storage at the same time. Click! The light bulb flicked on. "With the SAN, we wanted to bring a large amount of disk in here so that multiple systems could access it," Butler says. After some labs testing, Butler felt assured that Fibre Channel infrastructure was reliable enough to run NCSA's storage on.In the first phase of the SAN rollout, NCSA deployed 60 Tbytes of DataDirect Networks Inc. storage connected to eight 16-port Brocade Communications Systems Inc. (Nasdaq: BRCD) SilkWorm 3800 switches and one 64-port SilkWorm 12000. The SAN connects more than 200 host servers via QLogic Corp. (Nasdaq: QLGC) host bus adapters.
Butler says NCSA selected Brocade because the company was willing to engage NCSA as a development partner rather than as an ordinary customer. "We are trying to break the 12000 so they can build a better switch," Butler says.
Plus, she says, she got "really great pricing." [Ed. note: Hmmm... so we'd guess this particular deal hasn't added much to Brocade's bottom line.] NCSA also wanted to use Brocade's Fabric Access API to pull data into its proprietary management system to monitor the health of each switch down the port level.
NCSA's storage group tested the 12000's processor-failover capability by throwing enough corrupt data. Butler says that feature worked as advertised. However, the group did encounter an issue in trying to upgrade the 12000's firmware to fix a date-related bug in the switch. Since the switch doesn't support nondisruptive code activation, it must be taken offline while the update occurs. "I would say it's a drawback that you have to bring the whole switch down to do a code load," she says. "Even though it's just for 10 minutes, it brings the whole center down." Brocade has promised to deliver hot code-load activation for the 12000 early next year.
But that shortcoming hasn't stopped NCSA from buying three more 12000s (and a fourth on the way), which it will use in server clusters for TeraGrid, a large computing network sponsored by the National Science Foundation (NSF) that will be distributed among five research facilities -- NCSA, Argonne National Laboratory, the California Institute of Technology, the Pittsburgh Supercomputing Center, and the San Diego Supercomputer Center.In January, NCSA will receive the first machines that will be part of TeraGrid: 256 IBM Corp. (NYSE: IBM) Linux servers running Itanium, Intel Corp.'s (Nasdaq: INTC) 64-bit microprocessor. That will be followed by 700 servers in June. NCSA's TeraGrid cluster will include 230 Tbytes of spinning disk initially, running in IBM FastT 700 arrays. Butler says it may possibly add another 200 Tbytes later in the year.
However, Butler says, the TeraGrid SAN is not yet ready for prime time, primarily because of the immaturity of Linux. "In a Linux environment, it's hard to build a bulletproof SAN," she says. "Right now the Linux OS can't failover to an alternate path to their system disk. I don't have support from the file system, so if the Linux systems go down they're dead in the water."
Older Unix operating systems, such as those from Sun Microsystems Inc. (Nasdaq: SUNW), Hewlett-Packard Co. (NYSE: HPQ), already include multipath I/O. Butler says Red Hat Inc. (Nasdaq: RHAT) and other vendors are building enterprise features into Linux.
"Linux is new," Butler says. "This is part of its evolution, and we're pushing the technology as fast as it can go."
Todd Spangler, US Editor, Byte and Switch
http://www.byteandswitch.com
You May Also Like