[OSC] Day7: File System
File System
File System1
Construction of a hard disk
Low level format
Disks are organized in- Cylinders: a collection of tracks in the same relative position to the spindle
- Tracks: a concentric circle on a single platter side
- Sectors: segments of a track(usually 512B or 4KB in size)
cylinder skew
Disks usually have a cylinder skew: i.e., an offset is added to sector 0 in adjecent tracks to account for the seek timeConsecutive disks sectors may be interleaved to account for transfer time
Disk Access Time
Access time = seek time + rotational delay + transfer time- Seek time = time needed to move the arm to the cylinder (dominant)
- Rotational latency = time before the sector appears under the head (on average half the rotation time)
- Transfer time = time to transfer the data
Ts = n * m + s
In which Ts denotes the estimated seek time, n the number of tracks to be crossed, m the crossing time per track, and s any additional startup delay
Let us assume a disk that rotates at 3600rpm
What is 3600rpm, which is 3600 rotation per minute
So what is the rotational delay? ==>one rotation takes ? ms
The average rotational latency is then 16.7/2 = 8.3ms
Let d denote the number of bytes transferred, N the number of bytes per track, and rpm the rotation speed in rotations per second, the transfer time Tt is given by:
- N bytes take 1 revolution : 3600rpm ->> 60000/3600
- b contiguous bytes takes b/N revolutions
- so Tt = b/N * ms per minute/rpm
read a file of size 256sectors with Ts = 20ms, 32 sectors/track
- Suppose the file is stored as compact as possible - contiguous, i.e., all sectors on 8 consecutive tracks of 32 sectors each (sequential storage)
- The first track takes 20+ 8.3+ 16.7 = 45ms
- The remaining tracks do not need seek time , hence takes 8.3+16.7 = 25ms
- The total time is 45+ 7*25 = 220ms = 0.22s
In case the access is not sequential but at random for the sectors, we get
time per sector = 20+ 8.3+ (16.7/32) = 28.8ms
Total time = 256 * 28.8 = 7.37s
It is important to position the sectors carefully and avoid disk fragmentation
Disk scheduling
Disk scheduling algorithms determine the order in which disk events are processedFirst come First served
process the requests in the order that they arriveConsider the following sequence of disk requests (cylinder locations): 11 1 36 16 34 9 12
In the order of arrival (FCFS) the total length is: |11-1|+|1-36|+|36-16|+|16-34|+|34-9|+|9-12|=111
Shortest seek time first
Shortest seek time first selects the request that is closest to the current head position to reduce head movementIn the order “shortest seek time first, SSTF” (cfr shortest job first) we gain approx. 50% (for 11 1 36 16 34 9 12): |11-12|+|12-9|+|9-16|+|16-1|+|1-34|+|34-36|=61
Shortest seek time first could result in starvation:
- The arm stays in the middle of the disk in case of heavy load, edge cylinders are poorly served, the strategy is unfair
- Continuously arriving requests for the same location could starve other regions
SCAN
“Lift algorithm, SCAN” : keep moving in the same direction until end is reached (start upwards):- It continues in the current direction, servicing all pending requests as it passes over them
- When it gets to the last cylinder, it reverses direction and services all the pending requests (until it reaches the first cylinder)
- The upper limit on the “waiting time” is 2 ⇥ number of cylinders, i.e. no starvation occurs
- The middle cylinders are favoured if the disk is heavily used (max. wait time is N tracks, 2N for the cylinders on the edge)
- “Lift algorithm, SCAN” (for 11 1 36 16 34 9 12): |11-12|+|12-16|+|16-34|+|34-36|+|36-9|+|9-1|=60
C-SCAN
Once the outer/inner side of the disk is reached, the requests at the other end of the disk have been waiting longestSCAN can be improved by using a circular scan approach ) C-SCAN
- The disk arm moves in one direction servicing requests
- When it gets to the last cylinder of the disk, it reverses direction but it does not service requests on the return journey
- Once it gets back to the first cylinder it reverses direction, and again services requests
- It is fairer and equalises response times across a disk
other SCAN variations
LOOK-SCAN moves to the cylinder containing the first/last request (as opposed to the first/last cylinder on the disk with SCAN)N-step-SCAN only services N requests every single sweep.
Look-SCAN and SSTF are reasonable choices for the algorithms
Optimal algorithms are difficult to achieve if requests arrive over time!
Driver Caching
For most current drives, the time required to seek a new cylinder is more than the rotational time(remember pre-paging in this context)It makes sense, therefore, to read more sectors than actually required
- Read sectors during the rotaional delay(i.e. that accidentally pass by)
- Modern controllers read the multiple sectors when asked for the data from one sector: track-at-a-time caching
They don’t have Tseek or rotational delay!
File System 2
Different views
A user view that defines a file system in terms of the abstractions that the operating system providesAn implementation view that defines the file system in terms of its low level implementation
User View
Important aspects of the user view include:- The file abstraction which hides away implementation details to the user (similar to processes and memory)
- File naming policies (abstracts storage details), “user file attributes” (e.g. size, protection, owner, protection, dates). There are also system attributes for files (e.g. non-human readable file descriptors (similar to a PID), archive flag, temporary flag, etc.)
- Directory structures and organisation
- System calls to interact with the file system
- The user view defines how the file system looks like to regular users (and programmers) and relates to abstractions
Regular files and Directories
Unix also has character and block special files:
- Character special files are used to model serial I/O devices e.g. Keyboards, printers
- Block special files are used to model e.g. hard drives
File Control Block
System calls
File control blocks (FCBs) are kernel data structures, i.e. they are protected and only accessible in kernel mode!Allowing user applications to access them directly could compromise their integrity
System calls enable a user application to ask the operating system to carry out an action on its behalf (in kernel mode)
There are two different categories of system calls:
- File manipulation: open(), close(), read(), write(), …
- Directory manipulation: create(), delete(), readdir(), rename(), link(), unlink(), list(), update()
File system structures
overview
Different directory structures have been used over the years- Single level: all files in the same directory (reborn in consumer electronics)
- Two or multiple level directories (hierarchical): tree structures(Absolute path name: from the root to the file system, relative path name: the current working directory is used as the starting point
- Directed acyclic graph(DAG) : allows files to be shared(i.e. links to files or sub-directories) but cycles are forbidden
- Generic graph structure in which links and cycles can exist
when searching the file system:
- Cycles can result in infinite loops
- sub-trees can be traversed multiple times
Deleting files becomes a lot more complicated(i.e. links may no longer point to a file, inaccessible cycles may exist)
A garbage collection scheme may be required to remove files that are no longer accessible from the file system tree (that are part of a cycle only)
Directories(possible implementations)
Directories contain a list of human readable file names that are mapped onto unique identifiers and disk locationsRetrieving a file comes down to searching a directory file as fast as possible:
- A simple random order of directory entries might be insufficient (search time is linear as a function of the number of entries)
- Indexes or hash tables can be used
Directories are special files that group files together and of which the structure is defined by the file system
A bit is set to indicate that they are directories!
Implementation View
Low level formatting writes sectors to the disk, high level formatting imposes a file system on top of this (using blocks that can can cover multiple sectors)Partitions
Disks are usually divided into multiple partitions, and an independent file system may exist on each partitionMaster Boot Record is located at start of the entire drive:
- Used to boot the computer(BIOS reads and executes MBR)
- Contains partition table at its end with active partition
- One partition is listed as active containing a boot block to load the operating system
The layout of a partition differs depending on the file system, for a UNIX partition contains:
- The partition boot block: contains code to boot the operating system; Every partition has boot block-even if it does not contain OS
- Super block contains the partition’s details, e.g. partition size, number of blocks, I-node table size
- Free space management contains e.g. a bitmap or linkedlist that indicates the free blocks
- I-nodes: an array of data structure, one per file, telling all about the files
- Root directory: the top of the file-system tree
- Data: files and directories
Free space management
Two methods are commonly used to keep track of free disk space: bitmaps and linked listsBitmaps:
Bitmaps represent each block by a single bit in a map
- the size of the bitmap grows with the size of the disk but is constant for a given disk
- Bitmaps take comparably less space than linked lists
A Linked List of disk blocks (also known as Grouping)
We use free blocks to hold the numbers of the free blocks (hence, they are no longer free). E.g. with a 1KB block a 32-bit disk block number, each block will hold 255 free blocks (one for the pointer to the next block).
Blocks are linked together, i.e., multiple blocks list the free blocks
The size of the list grows with the size of the disk and shrinks with the size of the blocks
Linked lists can be modified by keeping track of the number of consecutive free blocks for each entry (known as Counting)
Bitmaps vs Linked Lists
keep bitmap in main memory is possible only for small disks but not big
e.g. if block size = 2^12 bytes and disk size is 1GB(2^30 bytes), bitmap size = 2^30/2^12 = 2^18 = 32KB
In terms of linked list, there is no waste of disk space, we only need to keep it in memory one block of pointers(load a new block when need)
File tables
Apart from the free space memory tables, there is a number of key data structure stored in memory:- An in-memory mount table
- An in-memory directory cache of recently accessed directory information
- A system-wide open file table, containing a copy of the FCB for every currently open file in the system, including location on disk, file size, and ‘open count’(number of processes that use the file)
- A per-process open file table, containing a pointer to the system open file table
If it is, a per-process open file table entry is created pointing to the existing system-wide open file table.
If the file is not already open, the directory structure is searched for the given file name. Once the file is found, the FCB is copied into a system-wide open file table in memory. This table not only stores the FCB but also tracks the number of processes that have the file open.
When reading a file, it search on per-process open file table and then follow the pointer to the system open file table.
File System 3
The layout of the file system and the file allocation method used by the operating system heavily influences the seek movementscontiguous files will result in many short head movementsWhile disk scheduling could be implemented in the controller, carrying it out in the OS means that the OS can prioritise/order request as appropriate
File access(Sequential vs Random access)
Contiguous Allocation
Contiguous file systems are similar to dynamic partitioning in memory allocation:- Each file is stored in a single group of adjacent blocks on the hard disk
- e.g. 1KB blocks, 100KB file, we need 100 contiguous blocks
Advantages:
- simple to implement: only location of the first block and the length of the file must be stored(in the directory entry)
- Optimal read/write performance: blocks are co-located/clustered in nearby/adjacent sectors, hence the seek time is minimised
- The exact size of a file(process) is not always known beforehand: what if the file size exceeds the initially allocated disk space
- Allocation algorithms needed to decide which free blocks to allocate to a given file(first fit, best fit, etc)
- Deleting a file results in external fragmentation: de-fragmentation must be carried out regularly( and is slower than for memory)
Contiguous allocation is still in use: CD-ROMS/DVDs
External fragmentation is less of an issue here since they are write once only
Linked list
To avoid external fragmentation, files are stored in separate blocks (similar to paging) that are linked to one anotherOnly the address of the first block has to be stored to locate a file
Each block contains a data pointer to the next block (which takes up space)
Advantages:
- Easy to maintain: only the first block(address) has to be maintained in the directory entry
- File sizes can grow dynamically(i.e. file size does not have to be known beforehand): new blocks/sectors can be added to the end of the file
- Similar to paging for memory, every possible block/sector of disk space can be used: i.e. there is no external fragmentation
- Sequential access is straightforward, although more seek operations/disk access may be required
Random access is very slow, to retrieve a block in the middle, one has to walk through the list from the start
There is internal fragmentation, on average the last half of the block is left unused
May result in random disk access, which is very slow
Space is lost within the blocks due to the pointer, the data in a block is no longer a power of 2!
Diminished reliability: if one block is corrupt/lost, access to the rest of the file is lost
File Allocation Tables(FAT)
Store the linked-list pointers in a separate index table, called a File Allocation Table (FAT), in memory!Advantages:
- block size remains power of 2, i.e. no space is lost due to the pointer
- Index table can be kept in memory allowing fast non-sequential/random access
- The size of the file allocation table grows with the number of blocks, and hence the size of the disk
- For a 200GB disk, with a 1KB block size, 200million entries are required. Assuming that each entry at the table occupies 4 bytes, this requires 800 MB of main memory
Thus, it is not fit for big disk
I-Nodes
Each file has a small data structure (on disk) called I-node (index-node) that contains its attributes and block pointers.- In contrast to FAT, an I-node is only loaded when the file is open (stored in system wide open file table)
- If every I-node consists of n bytes, and at most k files can be open at any point in time, at most n*k bytes of main memory are required
Assuming a block size of 1KB, and 32 bits disk address space
With only direct block pointers, the maximum size of file could be 1KB * number of direct blocks(e.g. 10, a total of 10KB)
A single indirect block can store up to 256 block pointers(32 bits = 4 bytes per pointer; and a block is 1KB = 2^10, thus, there will be 2^10/4 = 256 pointers) That is, with 10 direct blocks + 1 indirect block, we can have files with up to 266 blocks=>266KBs
A double indirect points to a block of 256 block pointers. Each of which points to 256 indirect blocks. Therefore, with double indirect we could have files with size up to 266KB + (256*256=65536KB) = 65802KBs
If we have files larger than 64M, we will need a triple indirect.
Directories implementation with i-nodes
In UNIX, all information about the file(type, size, date, owner and block pointers) is stored in its i-node.Therefore, directory tables are very simple data structures composed of file name and a pointer to the i-node
Note that directories are no more than a special kind of file, so they have their own i-node.
I-node lookups
Opening a file requires the disk blocks to be locatedAbsolute file names are located relative to the root directory
Relative file names are located based on the current working directory
E.g. Try to locate /usr/gdm/mbox
Locate the root directory of the file system
-Its i-node sits on a fixed location at the disk(the directory itself can sit anywhere)
Locate the directory entries specified in the path:
-Locate the i-node number for the first component(directory) of the path that is provided
-Use the i-node number to index the i-node table and retrieve the directory file
-Look up the remaining path directories by repeating the two steps above
Sharing files between directories(Hard an Soft Links)
- Hard Links: maintain two(or multiple) references to the same i-node in B and C (the i-node link reference counter will be set to 2)
- Symbolic Links: The owner maintains a reference to the i-node in e.g. directory C
The referencer maintains a small file(that has its own i-node) that contains the location and name of the shared
Hard Links
Hard links are the fastest way of linking files!Disadvantages of hard links:
- Assume that the owner of the file deletes it:If the i-node is also deleted, any hard link will, in the best case, point to an invalid i-node
- If the i-node gets deleted and “recycled” to point to an other file, the hard links will point to the wrong file!
The only solution is to delete the file, and leave the i-node intact if the “reference count” is larger than 0 (the original owner of the file still gets “charged” for the space)
Soft Links
Disadvantages of soft links:- They result in an extra file lookup (once the link file has been found, the original file needs to be found as well)
- They require an extra i-node for the link file
- There are no problems with deleting the original file-> the file simply does not exist any more
- They can cross the boundaries of machines, i.e. the linked file can be located on a different machine
File System 4
Log structured file system
Consider the creation of a new file on a UNIX system:1. Allocation, initialise and write the i-node for the file (i-nodes are usually located at the start of the disk)
2. Update and write the directory entry for the file(directories are tables/files that map names onto i-node in UNIX)
3. Write the data to the disk
The corresponding blocks are not necessarily in adjacent locations
Also in linked lists/FAT file systems blocks can be distributed all over the disk
Due to seek and rotational delays, hard disks are slow compared to other components in a computer(e.g. CPU, main memory)
A log structured file system aims to improve speed of a file system on a traditional hard disk by minimising head movements and rotational delays using the entire disk as a great big log
A log is a data structure that is written only at the end
- Log structured file systems buffer read and write operations (i-nodes, data, etc.) in memory, enabling us to write “larger volumes” in one go
- Once the buffer is full it is “flushed” to the disk and written as one contiguous segment at the end of “a log”
- I-nodes and data are all written to the same segment
- Finding i-nodes (traditionally located at the start of the partition) becomes more difficult
- An i-node map is maintained in memory to quickly find the address of i-nodes on the disk
How do we deal with deleted files?Advantages and disadvantages:
A cleaner thread is running in the background and spends its time scanning the log circularly and compacting it.
A hard drive is treated as a circular buffer
It removes deleted files and files being used right now are marked as free segments as they will be later written at the end
It greatly increases disk performance on writes, file creates, deletes
Writes are more robust as they are done as a single operation (Multiple small writes are more likely to expose the file system to serious inconsistency).
However, it has not been widely used because it is highly incompatible with existing file systems.
In addition, the cleaner thread takes additional CPU time
Journaling file systems
Deleting a file consists of the following actions:1. Remove the file’s directory entry
2. Add the file’s i-node to the pool of free i-nodes
3. Add the file’s disk blocks to the free list
Where can it go wrong?e.g.:
Directory entry has been deleted and a crash occurs => i-nodes and disk blocks become inaccessible
The directory entry and i-nodes have been released and a crash occurs=> disk blocks become inaccessible
Changing the order of the events does not necessarily resolve the issues
Journaling file systems aim at increasing the resilience of file systems against crashes by recording each update to the file system as a transaction
The key idea behind a journaling file system is to log all events(transactions) before they take place
- Write the actions that should be undertaken to a log file
- Carry them out
- Remove/commit the entries once completed
The log can be examined after the crash and used to restore the consistency of the file system
NTFS and EXT3-4 are examples of journaling file systems
Virtual File Systems
Multiple file systems usually coexist on the same computer (e.g., NTFS and ISO9660 for a CD-ROM, NFS)These file systems can be seamlessly integrated by the operating system (e.g. Unix / Linux)
This is usually achieved by using virtual file systems (VFS)
VFS relies on standard object oriented principles(or manual implementations thereof). e.g. polymorphism
In a similar way, Unix and Linux unify different file systems and present them as a single hierarchy and hides away / abstracts the implementation specific details for the user
The VFS presents a unified interface to the “outside”
File system specific code is dealt with in an implementation layer that is clearly separated from the interface
File System 5
File System Recovery(consistency)
When you use USB drives, data might be inconsistency(Journaling heavily reduces the probability of having inconsistencies in a file system. In case of crash, the log stores what operations were not run. However, it can still be possible)System utilities are available to restore file systems, e.g.:
- UNIX- FSCK
- Windows- Scandisk
Checking block consistency
Block consistency checks whether blocks are assigned/used the correct wayBlock consistency is checked by building two tables:
- Table one counts how often a block is present in a file(based on the i-nodes)
- Table two counts how often a block is present in the free list
Typically, this is a very slow process, taking even hours (and running with the partition unmounted)
Now, from the above image, we see 4 situation
- (a) File system is consistent
- (b)There is a missing block(it does not exist in any of the tables) =>just add it to the free list
- (c)There is double counted in the free list(only when the free space management is using linked list) => rebuild the free list
- (d)There is duplicate data block, a block is present in two or more files:
If we remove one file results in the adding the block to the free list
If we remove both files will result in a double entry in the free list
Solution: use new free block and copy the content(the file is still likely to be damaged)
FSCK Algorithm:
- Iterate through all the i-nodes
- retrieve the blocks
- increment the counters
- Iterate through the free list
- increment counters for free blocks
- Iterate through all the i-nodes
Checking directory consistency
Are the i-node counts correct?
Where can it go wrong?:
- I-node counter is higher than the number of directories containing the file
-
- Removing the file will reduce the i-node counter by 1
Since the counter will remain larger than 1, the i-node/disk space will not be released for future use(This is just waste memory but not serious)
- Removing the file will reduce the i-node counter by 1
- I-node counter is less than the number of directories containing the file
-
- Removing the file will eventually set the i-node counter to 0 while the file is still referenced
- The file/i-node will be released, even though the file was still in use
Restoring i-node consistency
- Recurse through the directory hierarchy
- Increment file specific counters
- I.e. each file is associated with one counter
- One file may appear in multiple directories
- Compare the file counters and i-node counters
- Correct if necessary
History of Linux file system
Minix file system: the maximum file size was 64MB and file names were limited to 14 charactersThe “extended file system” (extfs): file names were 255 characters and the maximum file size was 2 GB
The “ext2” file system: larger files, larger file names, better performance
The “ext3-4” file system: journaling etc.
The extended 2 file system: is one of the most popular file systems in linux
The main goals:
- Improve the performance of MINIX and extfs file system, distributing directories evenly over the disk
- Allow greater file names and sizes, improving directory implementation
Standard UNIX file system vs. Extended 2 file system
- The superblock contains file system information (e.g. the number of i-nodes, disk blocks)
- The group descriptor contains bitmap locations, the number of free blocks, i-nodes and directories
- The data block bitmap and i-node bitmap, used to keep track of free disk blocks and i-nodes(UNIX uses lists)
- A table of i-nodes containing file and disk block information
- Data blocks containing file and directory blocks
- reduce fragmentation by storing i-nodes and files, and parent directories and files in the same block group if possible
- Reduce seek times and improve performance
- All block groups have the same size and are stored sequentially (which allows direct indexing)
Directory entries
Every directory entry contains the following fixed-length fields:- i-node number
- Entry size in bytes
- Type field, i.e. file, directory, special file, etc.
- File name length in bytes
And then, the file name itself
Directories are searched linearly(i.e. they are unsorted) and a cache is maintained for recently accessed items
- File names up to 255 characters
- File lookups are similar to the Unix file system
12 block addresses are contained in the i-node
single, double or triple indirect blocks are used
With a block of 1kb, the scheme can handle file size of 16GB
If block size is 8KB, it could support file sizes up to 64TB
EXT3 file system
When making changes to an Ext2 file system, files are ideally written immediately to prevent inconsistency:This generates significant head movement
Ext2 File system is more suitable for flash disks (no journal)
Ext3 builds upon the Ext2 file system by adding:
- Tree based structures for directory files to facilitate indexing(HTrees)
- Journaling capabilities
评论
发表评论