A Database Fix for the File System
Most everyone agrees that migrating unstructured files into a structured object-oriented database is the best way to track and manage the explosion of enterprise data. We give you the lowdown
November 12, 2004
However, improved headroom doesn't alter the fact that file systems are self-destructive. Every time you save a file, you overwrite the last valid copy of data. This goes back to the roots of file system design in the 1960s and '70s, when software engineers opted to minimize the costs of expensive resources like storage rather than add to their software journaling or versioning techniques for protecting file versions.
In addition, most file systems today don't automatically provide detailed descriptions of the data. And the stored metadata (data about data) doesn't say much about the contents or usage of a file, which makes ILM and automatic provisioning impossible. Users name their files, and applications, such as Microsoft's Office, let users add content descriptions, which are saved with the files. But it's up to the user to complete the information page when each file is saved. Few actually bother.
Without rigorous file-naming methodologies or consistent application-level file descriptions, recent regulations such as Sarbanes-Oxley and HIPAA (Health Insurance Portability and Accountability Act) are causing big headaches. It's tough to identify which files must be retained in special repositories for regulatory compliance if the files don't include descriptive information. Just try retrieving the correct files quickly, or segregating files that require special protection, when you're under the pressures of an SEC investigation.
Typical Inode Structure |
ILM won't mean much, either, if you don't have the file information necessary to create logical classes. It requires a granular understanding of file content, access requirements, platform cost and capability, and other considerations to create intelligent data-migration policies. Without effective file-naming, you just can't cherry-pick files for storage on appropriate platforms.
Even if ILM isn't in your plans and you're immune from regulatory pressures, the lack of detailed file information can still create problems in your day-to-day business. It's difficult to locate the files your users need if you have nondescript file names and inadequate directory or folder hierarchies.
This problem is exacerbated as organizations grow, and data-sharing among distributed users and applications becomes more pronounced. The larger the organization, the more files get lost in the shuffle.
The idea of moving the file system to a database structure has been around for a while. Oracle, IBM and Microsoft were championing such a transition back in the mid-1990s, but it didn't take off. If anyone can make it happen, it's Microsoft: The company's current Yukon database project is expected to produce a platform for hosting a database-centric file system, and a database file system is part of the initial Longhorn server OS.
Microsoft made significant changes to its file system in Windows 2000 and the XP client operating environment--NTFS improved security and recoverability features in earlier file systems and expanded file sizes and numbers of files per volume. But the big changes will come with Longhorn's database-centric approach. It's unclear when the new technology will arrive, given delays and Microsoft execs' announcement that the WinFS database won't be part of next year's OS server release. Even so, the days of NTFS and other conventional file systems are numbered.Most improvements, such as expanded file size, so far have focused on the interface between file names and the physical distribution of bits on a disk. Windows, Unix and Linux OS developers have concentrated on enlarging the address spaces in the file system so it can scale with physical disk capacity. That way, you can have a directory with subdirectories that span the most capacious volume available.
File-size support is on the rise, too. Microsoft's NTFS allows up to 2 TB, while some Unix and Linux systems allow more than 16 TB. These huge file sizes let you store and retrieve large databases, multimedia files and animation-laden slide decks, for example, together with smaller, more common files.
For all these improvements, however, not much has been done to improve metadata structures or presentation-side capabilities. File metadata currently includes a user ID and group ID that associate a file with its creator, plus the file-creation date and the dates last modified and last accessed. The metadata also includes file-size and block pointer information that correlates a file name with specific locations where bits are stored to disk.
New File System ParadigmClick to Enlarge |
This data about data is good enough for basic file identification, retrieval and security, but inadequate for true data management, which entails detailed indexing, search and retrieval, data migration, and backup and restore. Some organizations get around this by adding additional layers of software over the file system layer, with content- or document-management systems. In the end, this just creates more manual steps and generates more management and administrative costs. NuView's StorageX, for example, is a more integrated and universal software stack, but still not the nirvana of a database-centric approach.Enter the DB
In theory, a database-centric file system enhances files with more metadata capabilities. Then you can store a description of the application that produced the file, plus the file's security and retention requirements, details of its contents and how often it's accessed.
You can use this information to index files, which lets you more quickly find, manage and migrate them. A database also provides file versioning and journaling to protect against accidental deletion and overwriting, so your engineer won't have to go back to the drawing board with his inadvertently deleted CAD file.
One key to object-oriented file storage is hosting all of the file objects on a clustered database engine (see "A New File System Paradigm," below). Server clusters are a must for this approach, given the growth in enterprise data and the different sizes of binary objects (files) that must fit into database cells. Clustering lets you extend the size of the index as the number of object files grows.
Database clustering was widely viewed by critics as the Achilles' heel of Microsoft's Longhorn WinFS effort, given the clustering limitations of its current SQL Server product. This is one of the limitations Microsoft is trying to address with Yukon.Even with a new platform on the horizon, replacing the conventional file service is still no small task. For one thing, the new model will require users and applications to change the way they read and write data. Users will have to unlearn the file-cabinet metaphor and directory tree structure in favor of indexed queries and reports in database-centric file systems.
And application compatibility with the new architecture isn't a given: No one knows whether Microsoft's upcoming file system will initially work only with its own Office application suite, Exchange Mail and related applications.
Another major hurdle is for organizations to get all data that's currently stored in files into a new database-centric system, which entails labeling files manually with the detailed metadata automatically included with new files. This could be a huge undertaking, especially if you have petabytes of older files in storage. So even though a database-centric file system may be the key to true data management, don't expect to replace your conventional file system overnight.
Jon William Toigo is CEO of storage consultancy Toigo Partners International, founder and chairman of the Data Management Institute, and author of 15 books. Write to him at [email protected].
File system data is often called "unstructured data" to differentiate it from the structured data in a database. But file systems are actually organized, and they provide an interface between applications, end users and physical disk storage.The typical Unix file system structure is shown in "The 'I' in Inode" on page 67. The inode contains the file's metadata presented to the end user or application, and it facilitates the reading and writing of data to the disk itself. The complex combination of metadata and direct and indirect block pointers--extents, in Microsoft parlance--found in inodes reflects an era in which storage (both memory and disk) was expensive and coding elegance was prized over convenience.
Still, contemporary file systems present some painful limitations that could be better addressed by an object-oriented database approach to storing data. These limitations include:
Data isolation: Data is accessed via inode stovepipes, which makes coordinating, indexing and managing data a daunting task.
Data duplication: Required for sharing between heterogeneous file systems, results in wasted space, and versioning and consistency issues.
Data incompatibility: There is no guarantee that a file can be opened by a given application, and different semantics used by various OS file systems can limit your access to just one instance of the file.
Some of these limitations, however, have more to do with the way users deploy file-system capabilities. If users more carefully named their files and if file directories or folders were more logically defined and administered, it would help identify files that require disaster-recovery backup or regulatory compliance.
An object-oriented file system is a better option (See "A New File System Paradigm," page 69). Databases provide greater controls over who has access to what, and database controls facilitate concurrent server and client access, even when there are different operating systems involved. Data is stored in a uniform fashion that all OSs can understand.
Other benefits of the database approach are:
Enhanced data naming and metadata management, thanks to descriptive header information that can be used to establish data classes, enable different data views and improve segregation techniques.
Tracking access frequency and access types (read versus write) to fine-tune data placement on storage and to identify suitable candidates for archiving.
More efficient indexing and retrieval for regulatory compliance and disaster recovery.
You May Also Like