File Systems Unraveled Most of you have probably heard the term FAT, FAT32, FAT16, NTFS and partition table thrown around quite a bit. They seemed like such mystical entities. You never could figure out what exactly they all meant. This article will change that. By the time I am done, you will know precisely what each of these terms mean. What is a File System? First, let's understand what a file system is. A file system can be thought of as the way your computer goes about managing the files that gets stored on your hard drive. Your computer has thousands upon thousands of files. If there were no organized way of managing them, your system would be infinitely slow, provided that it works at all. This is understandable if you just consider how much stuff you have piled in your office, and how much time is wasted finding stuff that's buried under a ton of paper. Now take that mess, and multiply it by a thousand. That is what your computer would be going through if an efficient file system didn't exist. And just like there are all kinds of people in the world who organize things differently in the office, there are many file systems out there with varying features. However, there are several key functions that no file system should be without: - Efficiently use the space available on your hard drive to store the necessary data - Catalog all the files on your hard drive so that retrieval is fast and reliable. - Provide methods for performing basic file operations, such as delete, rename, copy, and move. - Provide some kind of data structure that allows a computer to boot off the file system. There are of course other file systems that go beyond meeting basic requirements by providing additional functionality, such as compression, encryption, password/permissions, filestreams, etc. Later on in this article, I will discuss some of the extra features in relation to Windows NT's NTFS. FAT In Detail Note: This section is more technical in nature than the rest of the article. Feel free to skip if you'd like. But be warned that you'll miss some interesting tidbits about the FAT you probably never knew. So what is FAT, and how do file systems work? The answer is quite simple in fact. The space on your hard drive, at its most basic level, is divided into units called sectors. Each sector is 512 bytes. So if your hard drive had 10 Kilobytes worth of total disk space, that would mean it is divided into 20 sectors. But the file system doesn't directly deal with the hard drive on a sector by sector basis. Instead, it groups a bunch of sectors together into a cluster, and it deals with the cluster. These clusters are also called allocation units by DOS. So another way of thinking about this is to suppose that each sector on your hard disk is a person carrying a bag, where you can store 512 bytes of information into each bag. Now instead of numbering each person as 1,2,3, etc ... The file system first takes several people and put them into a group, and call that group 1. So if you had 400 people, and the file system decided to put 4 people to a group, then you'd have 100 groups. In other words, on a drive with 400 sectors (or roughly 200K of space), and with an allocation size of 4 sectors (or 2K), there would be 100 clusters. So then when the file system needs to access a particular sector, it would first find the cluster number of the sector, and then within that cluster, it would access that particular sector by its sector index. This is akin to saying to find a person, say Jon, I would find Jon's group number first, and then go to his group and look for him. All three of the file systems (FAT16, FAT32 and NTFS) work like this. So what is the difference between FAT16 and 32? The major difference lies in how much space each file system can handle and how efficiently the file system does it. The problem with file efficiency arises because each cluster on a hard disk can only store one file! That means each group can only be made to handle one item. To illustrate my point, consider the following situation: The file system decides to divide all the people into groups of 8 (we'll get into how this number of chosen later). Each of these 8 people has a bag that can store stuff. Now the file system hands the first group a huge box of pencils and says "store this." The eight people start to put the pencils in their bags, and after one fills up, they move on to the next. The box of pencils fills 7 bags. The file system tries to hand the group another small thing to put into the last 8th bag which is empty. But the group says "sorry, we can only handle one thing. You gave us one already." The file system says "fine, but you are wasting 12% of your space (1/8 = 0.125)" The group tells the file system "sorry, we can't help it." The file system moves on. Now the file system gives the next group of 8, only a single pencil to store. The group stores it and refuses to take anything else. The file system informs the group that they are wasting almost 100% of their storage space. But there is nothing they can do. These stories may seem silly, but they do get the point across, which is that as the size of the clusters increase, the amount of space you waste will increase. It is true that if you can make all your files precisely the same size as your cluster, then you'd have 0% waste. But that is not possible. Most typical files are not very big, and if the cluster size gets huge, then the waste can be quite alarming. So now the question becomes how does my computer figure out the size of each cluster? The answer is simple, take the size of your hard drive, and divide that by the number of clusters involved. So what I am saying is this: Cluster Size = Disk Space / Number of Clusters Possible And since Cluster Size is directly proportional to wasted space (in other words, as the cluster size increases, the waste space also increases), we can see that what we want is a file system that can handle a large number of clusters. And this is where FAT16 and FAT32 differ. FAT32 can handle a lot more groups then FAT16 can. But why is that? The simple explanation is that FAT32 can count a lot higher than FAT16. As I said above, each cluster is numbered by the file system. FAT16 uses 16 bit numbers to count the clusters. That means FAT16 uses binary numbers of 16 digits. The consequence is that the highest FAT16 can count to is 2^16 - 1 (yes, it is in fact 2^16 - 1, because there are 2^ 16 digits between 0 and 2^16 - 1. Zero also has to count), or 65535. So there can only be 65535 clusters to each FAT 16 disk. What that means for you, is that as your hard drive gets huge, your cluster count remains the same, so your cluster size increases. But don't think for a minute that you can just indefinitely increase the size of each cluster. That can't happen. The reason is that every sector inside a group also has to be numbered. Each sector has an index number that is written inside a byte. A byte is 8 bits. What that means is that the number has to be less than 2^8 (255 to be exact). And since the way you decrement in computers is to go by powers of 2, that means you can only store a number as big as 2^7, or 128 sectors. So now let's do a little bit of math: You have a max of 65535 clusters, You have a max of 128 sectors per cluster You have 512 bytes per sector. That means your max FAT16 size is = 65535 * 128 * 512 = 4 GB Wait a second? That's not right! I thought the limit was 2GB? And I thought each cluster in FAT16 can be only 32K, not 64K! And you would be right. The problem is that 128 sectors * 512 bytes per sector is 65536, which is one more than a 16 bit number can handle. So again, we decrement to 64 sectors per cluster, which yields us 32K per cluster. And 32K per cluster * 65535 is roughly 2GB. Now FAT32 solves this problem by removing the 65535 clusters per disk limitation. FAT32 now uses 32bit number, which is a number with 32 digits. That allows it to count much higher. And since it can handle a bigger number of clusters, its cluster size is much smaller than that of FAT16 for bigger disks. In fact, FAT32's maximum disk size is 2 Terabytes. To get this number, you take the total number of sectors addressable (and I do mean sectors), which would be 2^32 - 1, and multiple that by 512 bytes per sector. That's a whopping 2048 Gigabytes, or 2 Terabytes. At this point, some of you may be scratching your heads trying to figure out the inconsistencies in my explanation. The first item to address is that even though the file system accesses the sectors by a cluster count first, that still doesn't alleviate the need to number the sectors individually. Even in FAT16, the sectors are numbered. And that leads to the second concern some of you may have. Since FAT16 uses 16 bit numbers, doesn't that mean that there can be only 2^16 - 1 sectors? Wouldn't that translate into 32 megs? Yes. You are right. But unknown to most is the fact that since DOS 4.0, the underlying sector numbering had already been changed to a 32bit value! The limit placed on the disk size was purely due to the 16bit numbering of the clusters, and the limit of the numbering system for the sectors in each cluster, as discussed above. Ok, so we know what sectors and clusters are. But how does that get translated into files? That is where the File Allocation Table comes in. The FAT is a huge database that contains records of where each file is on the disk. In fact, it would not be too much of a stretch to just think of the FAT as a table with several columns that each record something about the files on the drive. Each record inside the FAT will take up 32 bytes of space. In other words, if I had 100 files on the computer, it would take the system roughly 3200 bytes to record all of that information into the FAT. Just for fun, let's take a look at what is stored in these 32 bytes: Byte Range Info Stored 1 to 8 Filename 9 to 11 Extension 12 Attributes (i.e. read-only, archive, hidden) 13 to 22 Reserved bits for latter features 23 to 24 Time Written 25 to 26 Starting cluster 29 to 32 File Size Interesting list isn't it? Some of the entries are self-explanatory. But there are two that are rather interesting. The first thing to look at is the Starting Cluster field. Some of you may have been wondering how the system translates cluster and sector indices into filenames and such. The answer is that for each file, there is a field in the FAT that indicates the first cluster of the file. The system would read that FAT entry and then find the starting cluster and read the file. Now the question is how does the system know when to stop reading? Furthermore, even before that, how does the system know where to read next after this cluster? The answer is that written within each cluster is the address of the next cluster that contains information from this file. So a computer reads the current cluster and checks to see if there are any other clusters after it. If there is, it skips to that cluster and reads it, and checks for the next one. This process repeats until it finds a cluster with no pointers. The CS majors reading this would recognize this as a Linked List implementation. The other interesting feature of this table is that each directory entry (record in the FAT) uses 4 bytes to store the size of the file. This may not seem like much at first. But what it actually tells you is the maximum size possible for any single file. The fact that we use 4 bytes to store a file size tells us that the largest number that can be represented is 32bits (recall that there are 8 bits per byte). So what is the largest 32bit number? That would be 2^32 - 1. So a file can have a maximum of 2^32 -1 bytes, or 4 Gigabytes. This calculation is obviously done under the assumption that we are using FAT32. The last two fields I'd like to take a look at are the filename field and the reserved bytes field. The interesting thing about the filename field is that DOS uses that field to perform undelete. When you erase a file in DOS, you aren't actually erasing the file. All you are doing is changing the first letter of the filename field into a special character. And as far as the file system is concerned, the file isn't there, and the next time a file is written to this cluster, the current file is erased. The way DOS performs an undelete is to simply change that first letter back to something else. That is why when you used undelete in DOS, it always asked for the first letter of the filename before it could restore the file. Mystery solved. Now let me just make a quick mention of the reserved fields. The reserved fields didn't do much in FAT16, but it became rather useful in FAT32 and in NTFS. Since FAT32's cluster numbering used 32bit numbers instead of 16bit, as was the case in FAT16, the system needed two extra bytes to accommodate the added digits. Those two bytes were taken out of the reserved field. And in NTFS, compression attributes, some security information was also written into the reserved field of the FAT. Before I move on, I'd like to point out a few of the other differences between FAT16 and FAT32. In FAT32, the root directory is unlimited in size. What this means is that you can have as many files and directories in C:\> as you'd like. In the days of FAT16, you could have a maximum of 255 directory entries. That means that if you had normal filenames of 8 letters + 3 extensions, you have a maximum of 255 directories + files. That may seem like more than you'd need to put in the root directory. And it probably is , if you had 8.3 filenames. But in Win95, the system can support long filenames. The trick is that Win95 combines multiple directory entries to support long filenames. So consider a file that's named "My English Paper". That is 16 letters long. So it takes 2 directory entries, at least. Actually, it takes 3 directory entries. It takes 2 for the long filename, and another one for the short 8.3 filename to be compatible with DOS and Win3.1. As you can see, long filenames can quickly deplete directory entries. Another nice feature is that FAT32 has better FAT redundancy. Both FAT32 and FAT16 store two copies of the file allocation table on disk. But traditionally, the file system only read from one of them. In FAT32, the system could choose to read from either one, which provides a better failsafe for freak accidents involving corrupt file tables. It is apparent that FAT32 is a superior file system then FAT16. Unfortunately, FAT32 is not supported by every operating system. The original version of Windows 95 couldn't read FAT 32. It wasn't until version B (OSR2) did Win95 gain that ability. And all versions of WinNT before 5.0 (named Windows 2000 or short Win2K) could not read FAT32 drives either. New Technology File System Now that I've covered FAT16 and FAT32 both in excruciating detail, let's turn our attention to NTFS, the proprietary file system for WinNT. While FAT32 was a decent system, it lacked some of the more advanced features that many businesses need to run a network. Chief among them are file level security, encryption, event logging, error recovery and compression. NTFS (5.0) provides all of these features in a nicely optimized package. Permissions: The feature that NT is probably best known for is its file level security. With NTFS permissions, you can control which users have what kind of access to which files. This is a stark contrast to the "security" in Windows 9x, where the system policy editor affords the only measure of protection. Once a knowledgeable user gets past the policy protections, which is only skin deep (or interface deep), every file on the system is his for the taking. In Windows NT, even if you get past the interface lockouts, you'll still have a hell of a time accessing other people's files, because they are locked at the file level. Before I discuss how to set file permissions, we need to take a step back and look quickly at how permissions in general work on Windows NT. Windows NT's security model is an entire topic onto itself. So I will not cover it in detail here. However, a general overview will prove beneficial. With Windows NT, you can assign security at two different levels - on a per user basis, and on a group basis. So, if there is a user called Jane who belongs to the Marketing User Group, you can affect Jane's access permissions by either assigning permissions to her account, or to her group. So what happens if Jane's group has Modify Access to a document, but Jane is only assigned Read Access? Surprising enough, the least restrictive of the two permission sets takes precedence, in this case, the Modify Access. The one glaring exception is the No Access permission. If a No Access permission is assigned at any level, the user has no access, regardless if any other permissions assigned. So if the Marketing Group is assigned No Access, Jane would have no access even if her account is assigned Full Control. So there you have it, NT file level security at a glance. There is so much more to it, but as this is an intro article, a more in-depth exploration of NT file security seems more appropriate in a separate article. With that said, let's take a look at some of the other features of NTFS. Compression: Another useful feature is compression. It works transparently (like DriveSpace), and can be assigned to individual files (unlike DriveSpace). To turn on compression for a file, right-click on it and choose Properties. From the Properties menu, you can check the Compressed attribute. The same can be done on a directory. Encryption: But what is even more useful is the encrypted file system (EFS) included in NTFS 5.0. With EFS, you can actually encrypt a file, rather than just protect it via permissions. This is a long overdue feature since there are other operating systems on the market which will read files on an NTFS volume while bypassing the NT security. BeOS is one example, one which I have used. Various flavors of Linux might also provide similar functionality, though I have yet to personally encounter one. However, if a file is encrypted, then such dangers are drastically mitigated. NT5's EFS is a system level service, which means it runs even when all users are logged off. This also prevents hackers from easily disabling the program, as is the case with user mode encryption programs. Moreover, the encryption system works transparently with respect to the user. What that means is that if a user has permissions to decrypt the file, then when the file is being accessed, it will be decrypted seamlessly, without any user intervention. On the other hand, if the user does not have appropriate access, then an "Access Denied" error will pop up. In principle, EFS works on a public/private key system (via CryptoAPI if you are interested). When a file is encrypted, a file encryption key (FEK) is automatically generated. That randomly generated FEK is used to encrypt the file(or folder). The FEK is then, itself, encrypted using the user's public key. A list of encrypted FEKs is stored as a part of the file content. When the user tries to access the file, the system will attempt to decrypt the FEK with the user's private key. If it succeeds, then the decrypted FEK is then used to decrypt the actual file. However, if a file is copied to a non-NTFS partition, then a plaintext version of the file is created. To activate encryption, simply right-click on a folder and choose Properties from the popup. Then simply check the Encrypt checkbox. By default, Windows Explorer will only allow folders to be encrypted (which is the recommended method). However, the command CIPHER can be used to encrypt on a per file basis. To encrypt: CIPHER /e myfile.txt To decrypt: CIPHER /d myfile.txt The other nice thing about EFS is that it offers a data recovery mechanism. A data recovery agent is automatically configured. In a Windows 2000 domain, that defaults to the domain admin. The assigned security agent could then decrypt any file that is under his scope. It is important to note that when recovery occurs, only the file's FEK is revealed, and NOT the user's public key. This way, it prevents the security agent from accessing files that are not under his scope. As always, the domain admin will have the power to delegate security recovery rights to other user groups as to provide both flexibility and redundancy. File Auditing: However, just protecting the file against possible intruders is not enough. There must be a way for an admin to know if a file hack has been attempted. This is where File Auditing (event logging, if you will) comes in handy. With NTFS, you can keep track of who has tried to access what file, and if they succeeded. To enable file auditing, use the following steps: 1 - First, make sure that File access auditing is turned on via User Manager. 2 - Then simply go into the Security tab of any file you wish to audit and click on the Audit button. 3 - Now simply add the users whom you wish to audit for the given file, and then click OK. 4 - Now select the events you wish to audit. Click on OK. 5 - To view the audited events, go through Event Viewer and look at the security logs. Data Recovery: But what good is protecting your data if it simply gets corrupted when the system crashes? Here too, NTFS has a solution. NTFS has superior data recovery capabilities (compared to FAT and FAT32). Each I/O operation that modifies a file on the NTFS volume is viewed by the file system as a transaction and can be managed as an atomic unit. When a user updates a file, the Log File Service tracks all redo and undo information for the transactions. If every step of the I/O process succeeds, then the changes are committed to disk. Otherwise, NT uses the Undo log to roll back the activity and restore the disk to a state before the changes were made. When Windows NT crashes, NTFS performs three passes upon reboot. First, it performs an analysis phase where it determines exactly which clusters must now be updated, per the information in the log file. Then it performs the redo phase where it performs all transaction steps logged from the last checkpoint. Lastly, it performs the undo phase where it backs out of all incomplete transactions. Together, these steps ensure that data corrupt is kept to a minimum. Yet Another Cool (But Scary) Feature: At this point, you probably think NTFS is pretty cool. But there is one other cool feature in NTFS that is documented, but not very well publicized (for obvious reasons as you will see). What I am referring to is filestreams (Unix users will be familiar with this feature). To illustrate the concept if filestreams, let's first picture any file (whether it be a document, a exe or a jpeg) as a garden hose. When you access the data in the file, that data flows through the file in a continuous stream, like water flows through a garden hose. In a typical file, there is only a single data stream, the default stream. All data written to and read from the file comes out of that stream. When Explorer displays (or the command interpreter) reads the size of the file, it is reading the data stored in that stream. In FAT and FAT32, this fact was of little concern since any file could only be given a single stream (the default). However, this all changes in NTFS, which allows any given file to have multiple data streams. This is akin to a garden hose that has within it multiple smaller hoses, each with its own stream of water flowing. In fact, each stream can contain different types of data. One data stream could be a text document, while another could contain WAV file data, another that contains executable code, and yet another that contains jpeg data. You can almost think of files with multiple data streams as a special kind of folder with multiple files stored within it. To illustrate my point, let's create a text file with multiple filestreams: 1 - Go to Windows NT's Command Interpreter (type cmd at the Run prompt) 2 - Switch to a partition that is NTFS. 3 - Type the following: echo This is what you'll see >> stream.txt [Press Enter] echo This is what you won't see >> stream.txt:hiddenStream [Press Enter] 4 - Now, open the file up in Notepad What you'll see is the text "This is what you'll see." The other string of text "This is what you won't see" is in the file, but it is stored in a separate file stream called hiddenStream. And since most programs do not read data from any stream other than the default stream, that data is hidden from the user. To view the contents of the hidden stream, do the following: 1 - Go to the NT Command Interpreter. 2 - Type the following: more < stream.txt:hiddenStream 3 - And viola! There is your hidden stream At this point, you should be getting chills, because filestreams brings up some very disturbing possibilities for writing viruses and such. A virus writer could conceivably write the executable code for his virus into a hidden stream of a text file! This way, normal virus scanners would not find the harmful code. To activate the virus, the malicious programmer need only to write a catalyst program that performs a seemingly innocuous file read operation from a text file. The worst part of all of this is that hidden streams are difficult to detect because data written into the file stream is NOT calculated as a part of the file's size. So you could have a text file that contains 20 bytes of text and 2 megs of executable code and show up as 20 bytes. Even worse, any user could create files with hidden streams, even your guest account users (assuming they can write to a directory). Thankfully, the situation is not hopeless. For one, hidden file streams can be detected via the use of Windows APIs. Secondly, all hidden streams are lost when the file is copied to a non-NTFS partition. So conceivably, anti virus firms can write scanners that scan form hidden streams. To the best of my knowledge, there haven't been any serious viruses written to take advantage of this particular feature in NTFS. For now, you can rest easy knowing that the end isn't quite here yet. But definitely keep filestreams in mind, for if there is a security weakness, somebody will find it sometime. Conclusion: There you have it - the three most common file systems in a nutshell. I hope this article has been at least mildly entertaining for some of you. As always, any comments, suggestions or corrections can be sent to xinli1 @uiuc.edu. Until next time, happy computing folks. http://www.pcnineoneone.com