Disk efficiency when dealing with tons of small files

By Andrius Miasnikovas

March 12, 2010

When dealing with source code from projects of various sizes the disk quickly gets filled up with literally tens or even hundreds of thousands of files. Especially when the project is under source control and has metadata files for each of the project’s file. While you may still have a lot of disk space left on the drive I found that this kind of setup is highly inefficient when jumping from file to file in an IDE or doing a full text search. While thinking of a quick fix for that one thing came to mind – virtual disk images. So I grabbed my TrueCrypt and created a few volumes for the bigger projects. I know that TrueCrypt is not actually meant for that, but while I was creating the volumes I realized that in my case this was an advantage, because not only am I creating a single file on the physical disk that will be easier to maintain, but also I’m encrypting all the project data thus gaining an additional security layer. I’m not going to get into the details of actually creating a volume file since it is well documented in the TrueCrypt Beginner’s Tutorial though I will advice one thing. Source code files under source control contain a whole lot of tiny metadata files which contain less than 1KB of information, but the most common file system cluster size is 4KB. In practice this means that if you have a file on your disk which contains only the string “hello, world” it will still occupy 4KB of disk space (assuming that your disk’s cluster size is 4KB). It’s the smallest amount of data that can be allocated for file storage. You can read more about cluster sizes here and here. When creating a virtual disk image TrueCrypt lets you choose the cluster size before formating the volume. In my case selecting the smallest available cluster size like 512 Bytes is the most efficient choice and in some cases saved nearly 20MB. I know it’s not much, but I just like things efficient :)

Another interesting tool which lets you manage small files more efficiently is called Hamana and it’s a “graphic viewer powered by DirectX” and it’s great for viewing comics, manga or just your personal photo collection archive without having to extract the files. The first thing I’ll mention that this is Japanese software, so is the whole web page but one can find the download links pretty easily. You could of course just use Google translate on the page if you want to read what it’s all about. The key features for me is that it’s really fast and allows to view the images inside an archive like .zip or .rar file. According to the page also .arj, . cab, .lzh and others, but I don’t actually use those. I’ll admit that this page and the software hasn’t been updated for quite a while, but it works great so no complaints on my side. The first thing you’ll want to do when you run this program is go to the settings menu and configure the controls and maybe other things. The good news is that the menu is in English, but the given control over the viewer is so great that it might be a little overwhelming. You can literally program a macro and assign it to some key or mouse click while checking in what mode currently the viewer is in. It does take some getting used to, but once you set it up the way you like it’s really a nice piece of software.

If you’re using other tools to achieve similar goals or want to share your experience on managing small or big files leave a comment, I’m always interested in finding new ways to optimize my system.