Tuesday, March 11, 2008

All of Your Data in One Basket

By Roderick Flores

I once worked with this person who wrote programs that only wrote to a single file. Once this program was put into the grid environment it would routinely create files that were hundreds of gigabytes in size.  Nobody considered this to be a problem because the space was available and the SAN not only supported files of that size, but also performed amazingly well considering the expectations. While this simplifies the code and data management, there are a number of reasons why this is not a good practice.



  • You don’t always need all of the output data at once. Moving a piece from the grid to your desktop for testing would not even be a consideration.
  • The amount of computation-time needed to recreate a huge file is significant.
  • There is no easy way to get to use multiple threads for writing and/or reading data.
  • Moving files across the network takes a lot more time.
  • A file can only be opened in read-write mode by one process at a time.  One large file is going to block a lot more modification operations than several single files.
  • Backing the file up is remarkably more difficult.  You cannot just burn it to a DVD so it has to be sent to disk or to tape.  If you need to restore a file it can take a significant amount of time.
  • Your file is going to be severely fragmented on the physical drives and therefore will cause increased seek times.
  • You can no longer use memory-mapped files.
  • Performing a checksum on a large file takes forever.
  • Finally, if you had properly distributed the job across the Grid, you should not have such large files!!!


Why would anybody do such a thing?  All your data are belong to us?

No comments:

Post a Comment