File System

File System1

Construction of a hard disk

Low level format

Disks are organized in

Cylinders: a collection of tracks in the same relative position to the spindle
Tracks: a concentric circle on a single platter side
Sectors: segments of a track(usually 512B or 4KB in size)

Sectors usually have an equal number of bytes in them, consisting of a preamble, data, and an error correcting code

cylinder skew

Disks usually have a cylinder skew: i.e., an offset is added to sector 0 in adjecent tracks to account for the seek time
Consecutive disks sectors may be interleaved to account for transfer time

Disk Access Time

Access time = seek time + rotational delay + transfer time

Seek time = time needed to move the arm to the cylinder (dominant)
Rotational latency = time before the sector appears under the head (on average half the rotation time)
Transfer time = time to transfer the data

The estimated seek time (i.e., to move the arm from one track to another) is approximated by:
Ts = n * m + s
In which Ts denotes the estimated seek time, n the number of tracks to be crossed, m the crossing time per track, and s any additional startup delay
Let us assume a disk that rotates at 3600rpm
What is 3600rpm, which is 3600 rotation per minute
So what is the rotational delay? ==>one rotation takes ? ms

The average rotational latency is then 16.7/2 = 8.3ms
Let d denote the number of bytes transferred, N the number of bytes per track, and rpm the rotation speed in rotations per second, the transfer time Tt is given by:

N bytes take 1 revolution : 3600rpm ->> 60000/3600
b contiguous bytes takes b/N revolutions
so Tt = b/N * ms per minute/rpm

Example:
read a file of size 256sectors with Ts = 20ms, 32 sectors/track

Suppose the file is stored as compact as possible - contiguous, i.e., all sectors on 8 consecutive tracks of 32 sectors each (sequential storage)
The first track takes 20+ 8.3+ 16.7 = 45ms
The remaining tracks do not need seek time , hence takes 8.3+16.7 = 25ms
The total time is 45+ 7*25 = 220ms = 0.22s

In case the access is not sequential but at random for the sectors, we get
time per sector = 20+ 8.3+ (16.7/32) = 28.8ms
Total time = 256 * 28.8 = 7.37s
It is important to position the sectors carefully and avoid disk fragmentation

Disk scheduling

Disk scheduling algorithms determine the order in which disk events are processed

First come First served

process the requests in the order that they arrive
Consider the following sequence of disk requests (cylinder locations): 11 1 36 16 34 9 12
In the order of arrival (FCFS) the total length is: |11-1|+|1-36|+|36-16|+|16-34|+|34-9|+|9-12|=111

Shortest seek time first

Shortest seek time first selects the request that is closest to the current head position to reduce head movement
In the order “shortest seek time first, SSTF” (cfr shortest job first) we gain approx. 50% (for 11 1 36 16 34 9 12): |11-12|+|12-9|+|9-16|+|16-1|+|1-34|+|34-36|=61
Shortest seek time first could result in starvation:

The arm stays in the middle of the disk in case of heavy load, edge cylinders are poorly served, the strategy is unfair
Continuously arriving requests for the same location could starve other regions

SCAN

“Lift algorithm, SCAN” : keep moving in the same direction until end is reached (start upwards):

It continues in the current direction, servicing all pending requests as it passes over them
When it gets to the last cylinder, it reverses direction and services all the pending requests (until it reaches the first cylinder)

(Dis-)advantages include:

The upper limit on the “waiting time” is 2 ⇥ number of cylinders, i.e. no starvation occurs
The middle cylinders are favoured if the disk is heavily used (max. wait time is N tracks, 2N for the cylinders on the edge)
“Lift algorithm, SCAN” (for 11 1 36 16 34 9 12): |11-12|+|12-16|+|16-34|+|34-36|+|36-9|+|9-1|=60

C-SCAN

Once the outer/inner side of the disk is reached, the requests at the other end of the disk have been waiting longest
SCAN can be improved by using a circular scan approach ) C-SCAN

The disk arm moves in one direction servicing requests
When it gets to the last cylinder of the disk, it reverses direction but it does not service requests on the return journey
Once it gets back to the first cylinder it reverses direction, and again services requests
It is fairer and equalises response times across a disk

The C-SCAN algorithm (for 11 1 36 16 34 9 12): |11-12|+|12-16|+|16-34|+|34-36|+|36-1|+|1-9|=68

other SCAN variations

LOOK-SCAN moves to the cylinder containing the first/last request (as opposed to the first/last cylinder on the disk with SCAN)
N-step-SCAN only services N requests every single sweep.
Look-SCAN and SSTF are reasonable choices for the algorithms
Optimal algorithms are difficult to achieve if requests arrive over time!

Driver Caching

For most current drives, the time required to seek a new cylinder is more than the rotational time(remember pre-paging in this context)
It makes sense, therefore, to read more sectors than actually required

Read sectors during the rotaional delay(i.e. that accidentally pass by)
Modern controllers read the multiple sectors when asked for the data from one sector: track-at-a-time caching

Solid State Drives (SSDs) have no moving parts and store data using electrical circuits.
They don’t have Tseek or rotational delay!

File System 2

Different views

A user view that defines a file system in terms of the abstractions that the operating system provides
An implementation view that defines the file system in terms of its low level implementation

User View

Important aspects of the user view include:

The file abstraction which hides away implementation details to the user (similar to processes and memory)
File naming policies (abstracts storage details), “user file attributes” (e.g. size, protection, owner, protection, dates). There are also system attributes for files (e.g. non-human readable file descriptors (similar to a PID), archive flag, temporary flag, etc.)
Directory structures and organisation
System calls to interact with the file system
The user view defines how the file system looks like to regular users (and programmers) and relates to abstractions

File Types:
Regular files and Directories
Unix also has character and block special files:

Character special files are used to model serial I/O devices e.g. Keyboards, printers
Block special files are used to model e.g. hard drives

File Control Block

System calls

File control blocks (FCBs) are kernel data structures, i.e. they are protected and only accessible in kernel mode!
Allowing user applications to access them directly could compromise their integrity
System calls enable a user application to ask the operating system to carry out an action on its behalf (in kernel mode)
There are two different categories of system calls:

File manipulation: open(), close(), read(), write(), …
Directory manipulation: create(), delete(), readdir(), rename(), link(), unlink(), list(), update()

File system structures

overview

Different directory structures have been used over the years

Single level: all files in the same directory (reborn in consumer electronics)
Two or multiple level directories (hierarchical): tree structures(Absolute path name: from the root to the file system, relative path name: the current working directory is used as the starting point
Directed acyclic graph(DAG) : allows files to be shared(i.e. links to files or sub-directories) but cycles are forbidden
Generic graph structure in which links and cycles can exist

The use of DAGs and generic graph structures results in significant complications in the implementation

when searching the file system:

Cycles can result in infinite loops
sub-trees can be traversed multiple times

Files have multiple absolute file names
Deleting files becomes a lot more complicated(i.e. links may no longer point to a file, inaccessible cycles may exist)
A garbage collection scheme may be required to remove files that are no longer accessible from the file system tree (that are part of a cycle only)

Directories(possible implementations)

Directories contain a list of human readable file names that are mapped onto unique identifiers and disk locations
Retrieving a file comes down to searching a directory file as fast as possible:

A simple random order of directory entries might be insufficient (search time is linear as a function of the number of entries)
Indexes or hash tables can be used

They can store all file related attributes (e.g. file name, disk address – Windows) or they can contain a pointer to the data structure that contains the details of the file (Unix)
Directories are special files that group files together and of which the structure is defined by the file system

A bit is set to indicate that they are directories!

Implementation View

Low level formatting writes sectors to the disk, high level formatting imposes a file system on top of this (using blocks that can can cover multiple sectors)

Partitions

Disks are usually divided into multiple partitions, and an independent file system may exist on each partition
Master Boot Record is located at start of the entire drive:

Used to boot the computer(BIOS reads and executes MBR)
Contains partition table at its end with active partition
One partition is listed as active containing a boot block to load the operating system

The layout of a partition differs depending on the file system, for a UNIX partition contains:

The partition boot block: contains code to boot the operating system; Every partition has boot block-even if it does not contain OS
Super block contains the partition’s details, e.g. partition size, number of blocks, I-node table size
Free space management contains e.g. a bitmap or linkedlist that indicates the free blocks
I-nodes: an array of data structure, one per file, telling all about the files
Root directory: the top of the file-system tree
Data: files and directories

Free space management

Two methods are commonly used to keep track of free disk space: bitmaps and linked lists
Bitmaps:
Bitmaps represent each block by a single bit in a map

the size of the bitmap grows with the size of the disk but is constant for a given disk
Bitmaps take comparably less space than linked lists

Linked List:
A Linked List of disk blocks (also known as Grouping)
We use free blocks to hold the numbers of the free blocks (hence, they are no longer free). E.g. with a 1KB block a 32-bit disk block number, each block will hold 255 free blocks (one for the pointer to the next block).
Blocks are linked together, i.e., multiple blocks list the free blocks
The size of the list grows with the size of the disk and shrinks with the size of the blocks
Linked lists can be modified by keeping track of the number of consecutive free blocks for each entry (known as Counting)
Bitmaps vs Linked Lists
keep bitmap in main memory is possible only for small disks but not big
e.g. if block size = 2^12 bytes and disk size is 1GB(2^30 bytes), bitmap size = 2^30/2^12 = 2^18 = 32KB
In terms of linked list, there is no waste of disk space, we only need to keep it in memory one block of pointers(load a new block when need)

File tables

Apart from the free space memory tables, there is a number of key data structure stored in memory:

An in-memory mount table
An in-memory directory cache of recently accessed directory information
A system-wide open file table, containing a copy of the FCB for every currently open file in the system, including location on disk, file size, and ‘open count’(number of processes that use the file)
A per-process open file table, containing a pointer to the system open file table

The open(file name) system call first searches the system-wide open file table to see if the file is already in use by another process.
If it is, a per-process open file table entry is created pointing to the existing system-wide open file table.
If the file is not already open, the directory structure is searched for the given file name. Once the file is found, the FCB is copied into a system-wide open file table in memory. This table not only stores the FCB but also tracks the number of processes that have the file open.
When reading a file, it search on per-process open file table and then follow the pointer to the system open file table.

File System 3

The layout of the file system and the file allocation method used by the operating system heavily influences the seek movements

contiguous files will result in many short head movements

While disk scheduling could be implemented in the controller, carrying it out in the OS means that the OS can prioritise/order request as appropriate

File access(Sequential vs Random access)

Contiguous Allocation

Contiguous file systems are similar to dynamic partitioning in memory allocation:

Each file is stored in a single group of adjacent blocks on the hard disk
e.g. 1KB blocks, 100KB file, we need 100 contiguous blocks

Allocation of free space can be done using first fit, best fit, next fit, etc
enter image description here

Advantages:

simple to implement: only location of the first block and the length of the file must be stored(in the directory entry)
Optimal read/write performance: blocks are co-located/clustered in nearby/adjacent sectors, hence the seek time is minimised

Disadvantages:

The exact size of a file(process) is not always known beforehand: what if the file size exceeds the initially allocated disk space
Allocation algorithms needed to decide which free blocks to allocate to a given file(first fit, best fit, etc)
Deleting a file results in external fragmentation: de-fragmentation must be carried out regularly( and is slower than for memory)

Contiguous allocation is still in use: CD-ROMS/DVDs
External fragmentation is less of an issue here since they are write once only

Linked list

To avoid external fragmentation, files are stored in separate blocks (similar to paging) that are linked to one another
Only the address of the first block has to be stored to locate a file
Each block contains a data pointer to the next block (which takes up space)
Advantages:

Easy to maintain: only the first block(address) has to be maintained in the directory entry
File sizes can grow dynamically(i.e. file size does not have to be known beforehand): new blocks/sectors can be added to the end of the file
Similar to paging for memory, every possible block/sector of disk space can be used: i.e. there is no external fragmentation
Sequential access is straightforward, although more seek operations/disk access may be required

Disadvantages:
Random access is very slow, to retrieve a block in the middle, one has to walk through the list from the start
There is internal fragmentation, on average the last half of the block is left unused
May result in random disk access, which is very slow
Space is lost within the blocks due to the pointer, the data in a block is no longer a power of 2!
Diminished reliability: if one block is corrupt/lost, access to the rest of the file is lost

File Allocation Tables(FAT)

Store the linked-list pointers in a separate index table, called a File Allocation Table (FAT), in memory!

Advantages:

block size remains power of 2, i.e. no space is lost due to the pointer
Index table can be kept in memory allowing fast non-sequential/random access

Disadvantages:

The size of the file allocation table grows with the number of blocks, and hence the size of the disk
For a 200GB disk, with a 1KB block size, 200million entries are required. Assuming that each entry at the table occupies 4 bytes, this requires 800 MB of main memory

Thus, it is not fit for big disk

I-Nodes

Each file has a small data structure (on disk) called I-node (index-node) that contains its attributes and block pointers.

In contrast to FAT, an I-node is only loaded when the file is open (stored in system wide open file table)
If every I-node consists of n bytes, and at most k files can be open at any point in time, at most n*k bytes of main memory are required

I-nodes are composed of direct block pointers(usually 10), indirect block pointers, or a combination thereof(e.g. similar to multi-level page tables)

Assuming a block size of 1KB, and 32 bits disk address space
With only direct block pointers, the maximum size of file could be 1KB * number of direct blocks(e.g. 10, a total of 10KB)
A single indirect block can store up to 256 block pointers(32 bits = 4 bytes per pointer; and a block is 1KB = 2^10, thus, there will be 2^10/4 = 256 pointers) That is, with 10 direct blocks + 1 indirect block, we can have files with up to 266 blocks=>266KBs
A double indirect points to a block of 256 block pointers. Each of which points to 256 indirect blocks. Therefore, with double indirect we could have files with size up to 266KB + (256*256=65536KB) = 65802KBs
If we have files larger than 64M, we will need a triple indirect.

Directories implementation with i-nodes

In UNIX, all information about the file(type, size, date, owner and block pointers) is stored in its i-node.
Therefore, directory tables are very simple data structures composed of file name and a pointer to the i-node
Note that directories are no more than a special kind of file, so they have their own i-node.

I-node lookups

Opening a file requires the disk blocks to be located
Absolute file names are located relative to the root directory
Relative file names are located based on the current working directory

E.g. Try to locate /usr/gdm/mbox

Locate the root directory of the file system
-Its i-node sits on a fixed location at the disk(the directory itself can sit anywhere)
Locate the directory entries specified in the path:
-Locate the i-node number for the first component(directory) of the path that is provided
-Use the i-node number to index the i-node table and retrieve the directory file
-Look up the remaining path directories by repeating the two steps above

Hard Links: maintain two(or multiple) references to the same i-node in B and C (the i-node link reference counter will be set to 2)
Symbolic Links: The owner maintains a reference to the i-node in e.g. directory C
The referencer maintains a small file(that has its own i-node) that contains the location and name of the shared

Hard Links

Hard links are the fastest way of linking files!
Disadvantages of hard links:

Assume that the owner of the file deletes it:If the i-node is also deleted, any hard link will, in the best case, point to an invalid i-node
If the i-node gets deleted and “recycled” to point to an other file, the hard links will point to the wrong file!

The only solution is to delete the file, and leave the i-node intact if the “reference count” is larger than 0 (the original owner of the file still gets “charged” for the space)

Soft Links

Disadvantages of soft links:

They result in an extra file lookup (once the link file has been found, the original file needs to be found as well)
They require an extra i-node for the link file

Advantages of symbolic links:

There are no problems with deleting the original file-> the file simply does not exist any more
They can cross the boundaries of machines, i.e. the linked file can be located on a different machine

File System 4

Log structured file system

Consider the creation of a new file on a UNIX system:
1. Allocation, initialise and write the i-node for the file (i-nodes are usually located at the start of the disk)
2. Update and write the directory entry for the file(directories are tables/files that map names onto i-node in UNIX)
3. Write the data to the disk
The corresponding blocks are not necessarily in adjacent locations
Also in linked lists/FAT file systems blocks can be distributed all over the disk
Due to seek and rotational delays, hard disks are slow compared to other components in a computer(e.g. CPU, main memory)
A log structured file system aims to improve speed of a file system on a traditional hard disk by minimising head movements and rotational delays using the entire disk as a great big log

A log is a data structure that is written only at the end

Log structured file systems buffer read and write operations (i-nodes, data, etc.) in memory, enabling us to write “larger volumes” in one go
Once the buffer is full it is “flushed” to the disk and written as one contiguous segment at the end of “a log”
I-nodes and data are all written to the same segment
Finding i-nodes (traditionally located at the start of the partition) becomes more difficult
An i-node map is maintained in memory to quickly find the address of i-nodes on the disk

How do we deal with deleted files?
A cleaner thread is running in the background and spends its time scanning the log circularly and compacting it.
A hard drive is treated as a circular buffer
It removes deleted files and files being used right now are marked as free segments as they will be later written at the end

Advantages and disadvantages:
It greatly increases disk performance on writes, file creates, deletes
Writes are more robust as they are done as a single operation (Multiple small writes are more likely to expose the file system to serious inconsistency).
However, it has not been widely used because it is highly incompatible with existing file systems.
In addition, the cleaner thread takes additional CPU time

Journaling file systems

Deleting a file consists of the following actions:
1. Remove the file’s directory entry
2. Add the file’s i-node to the pool of free i-nodes
3. Add the file’s disk blocks to the free list
Where can it go wrong?e.g.:
Directory entry has been deleted and a crash occurs => i-nodes and disk blocks become inaccessible
The directory entry and i-nodes have been released and a crash occurs=> disk blocks become inaccessible
Changing the order of the events does not necessarily resolve the issues
Journaling file systems aim at increasing the resilience of file systems against crashes by recording each update to the file system as a transaction
The key idea behind a journaling file system is to log all events(transactions) before they take place

Write the actions that should be undertaken to a log file
Carry them out
Remove/commit the entries once completed

If a crash happens in the middle of an action(e.g., deleting a file) the entry in the log file will remain present after the crash
The log can be examined after the crash and used to restore the consistency of the file system
NTFS and EXT3-4 are examples of journaling file systems

Virtual File Systems

Multiple file systems usually coexist on the same computer (e.g., NTFS and ISO9660 for a CD-ROM, NFS)
These file systems can be seamlessly integrated by the operating system (e.g. Unix / Linux)
This is usually achieved by using virtual file systems (VFS)
VFS relies on standard object oriented principles(or manual implementations thereof). e.g. polymorphism
enter image description here

In a similar way, Unix and Linux unify different file systems and present them as a single hierarchy and hides away / abstracts the implementation specific details for the user
The VFS presents a unified interface to the “outside”
File system specific code is dealt with in an implementation layer that is clearly separated from the interface

File System 5

File System Recovery(consistency)

When you use USB drives, data might be inconsistency(Journaling heavily reduces the probability of having inconsistencies in a file system. In case of crash, the log stores what operations were not run. However, it can still be possible)
System utilities are available to restore file systems, e.g.:

UNIX- FSCK
Windows- Scandisk

There are two main consistency checks:Block and directory

Checking block consistency

Block consistency checks whether blocks are assigned/used the correct way
Block consistency is checked by building two tables:

Table one counts how often a block is present in a file(based on the i-nodes)
Table two counts how often a block is present in the free list

A consistent file system has a 1 in either of the tables for each block
Typically, this is a very slow process, taking even hours (and running with the partition unmounted)

Now, from the above image, we see 4 situation

(a) File system is consistent
(b)There is a missing block(it does not exist in any of the tables) =>just add it to the free list
(c)There is double counted in the free list(only when the free space management is using linked list) => rebuild the free list
(d)There is duplicate data block, a block is present in two or more files:
If we remove one file results in the adding the block to the free list
If we remove both files will result in a double entry in the free list
Solution: use new free block and copy the content(the file is still likely to be damaged)
FSCK Algorithm:
1. Iterate through all the i-nodes
  - retrieve the blocks
  - increment the counters
2. Iterate through the free list
  - increment counters for free blocks

Checking directory consistency

Are the i-node counts correct?
Where can it go wrong?:

I-node counter is higher than the number of directories containing the file
- Removing the file will reduce the i-node counter by 1
  Since the counter will remain larger than 1, the i-node/disk space will not be released for future use(This is just waste memory but not serious)
I-node counter is less than the number of directories containing the file
- Removing the file will eventually set the i-node counter to 0 while the file is still referenced
- The file/i-node will be released, even though the file was still in use

Restoring i-node consistency

    - Recurse through the directory hierarchy
  - Increment file specific counters
  - I.e. each file is associated with one counter
- One file may appear in multiple directories
  - Compare the file counters and i-node counters
  - Correct if necessary

History of Linux file system

Minix file system: the maximum file size was 64MB and file names were limited to 14 characters
The “extended file system” (extfs): file names were 255 characters and the maximum file size was 2 GB
The “ext2” file system: larger files, larger file names, better performance
The “ext3-4” file system: journaling etc.
The extended 2 file system: is one of the most popular file systems in linux
The main goals:

Improve the performance of MINIX and extfs file system, distributing directories evenly over the disk
Allow greater file names and sizes, improving directory implementation

Standard UNIX file system vs. Extended 2 file system

The superblock contains file system information (e.g. the number of i-nodes, disk blocks)
The group descriptor contains bitmap locations, the number of free blocks, i-nodes and directories
The data block bitmap and i-node bitmap, used to keep track of free disk blocks and i-nodes(UNIX uses lists)
A table of i-nodes containing file and disk block information
Data blocks containing file and directory blocks

An ext2 partition is split into several block groups to:

reduce fragmentation by storing i-nodes and files, and parent directories and files in the same block group if possible
Reduce seek times and improve performance
All block groups have the same size and are stored sequentially (which allows direct indexing)

Directory entries

Every directory entry contains the following fixed-length fields:

i-node number
Entry size in bytes
Type field, i.e. file, directory, special file, etc.
File name length in bytes
And then, the file name itself
Directories are searched linearly(i.e. they are unsorted) and a cache is maintained for recently accessed items
File names up to 255 characters
File lookups are similar to the Unix file system
12 block addresses are contained in the i-node
single, double or triple indirect blocks are used
With a block of 1kb, the scheme can handle file size of 16GB
If block size is 8KB, it could support file sizes up to 64TB

EXT3 file system

When making changes to an Ext2 file system, files are ideally written immediately to prevent inconsistency:
This generates significant head movement
Ext2 File system is more suitable for flash disks (no journal)
Ext3 builds upon the Ext2 file system by adding:

Tree based structures for directory files to facilitate indexing(HTrees)
Journaling capabilities

[OSC] Day7: File System