'Only The Blocks That Have Changed' And Other Platitudes'Only The Blocks That Have Changed' And Other Platitudes

Many of the technologies we've come to rely on in the storage world nowadays use one form or another of changed block tracking. Snapshots, replication (especially the point-in-time kind), automated tiering and data deduplication all work by identifying changed, or different, blocks and treating them in some special way. The problem is that while parts may be parts, blocks are most definitely not blocks.

Howard Marks

March 24, 2011

3 Min Read
NetworkComputing logo in a gray background | NetworkComputing

Many of the technologies we've come to rely on in the storage world nowadays use one form or another of changed block tracking. Snapshots, replication (especially the point-in-time kind), automated tiering and data deduplication all work by identifying changed, or different, blocks and treating them in some special way. The problem is that while parts may be parts, blocks are most definitely not blocks.

Part of the problem is that when storage guys hear the term "block" they immediately think of 512-byte SCSI blocks and think that copying, moving and storing only the blocks that have changed is an efficient process. Unfortunately for us, when storage systems replicate data or take snapshots, the blocks they move around are more like file system allocation units than SCSI blocks, and are usually a lot bigger than 512 bytes. As a result, users frequently see that they need more snapshot space and WAN bandwidth than they thought they did in order to use the cool features of their storage systems.

The problem is due in some part to the fact that storage folks use the term block for everything, in much the way network guys informally use "packet" even though they have more exact terms like frame, datagram and segment. While some of us will use chunk to represent these larger units of data, just about every presentation I see includes the magic phrase "only the blocks that have changed".

Exactly how large the chunks a storage system uses for its allocation unit varies widely and can have a significant impact on how efficiently the chunk-based function you're looking at will run. If you're running a SQL Server database application that does a lot of random database updates, as many do, each record you update in the database will write an 8KByte updated SQL Server page to the disk.  

If your storage system uses 4KByte chunks like NetApp's WAFL each 8KByte SQL Server page update will cause the system to store two chunks. If your system uses 16MByte chunks, as some do, then one 8KByte database update will take up 16MBytes of snapshot space, consume 16MBytes of WAN bandwidth to replicate and use 16MBytes of expensive flash memory when migrated to tier 0.Since storage systems on the market use chunks from 4KBytes to a whopping 1GByte, how much snapshot space, or flash, you'll need to satisfy your requirements can vary greatly.  When you hear the magic words "only the blocks that have changed," be sure to ask how big those blocks really are. 

As if all that weren't enough to drive you out of your noodle, we next come to the applications that create more disk chunk updates than they have to. Almost any application where you explicitly save versions of your file from Microsoft Word and PowerPoint to Photoshop and Final Cut creates a temporary file where the program stores your edits during a session.  When you save, the program deletes the original file and renames the temporary file to the original name. So, if you edit a 500MByte video, your disk system sees 500MBytes of changed blocks and again you'll use more snapshot space and replication bandwidth. Deduplication can help with this problem, especially if the deduplication engine is context-aware enough to recognize it's a new version of the same file.

In short, whenever you hear the words "only the blocks that have changed," remember to ask what size blocks the speaker is talking about. Granularity matters--really it does.

About the Author

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights