SLIDE 10.2 - Storage and File Structure
Show slide

*	Concepts and Terminology chapter
	* 	Hard disks and RAID

*	So far we've focused on the logical view of the database model
	*	Chap 10-13 focus on the physical and computational aspect of the model
	*	Characteristics of underlying storage mediums
		*	Trade-offs between different storage mediums
	*	Data structures and File organization 
		*	What kinds of data fit best in them?
	
SLIDE 10.3 - Classification of Physical Storage Media
Show slide

*	For each of the different storage mediums, these are the attributes we will be looking at to compare them

SLIDE 10.4 - Physical Storage Media
Show slide

READ 1
*	DB systems don't manage any cache memory
	*	However, it will take cache effects into account for query processing algorithms
	
READ 2
*	RAM
	*	Fast - 1,000 million per second
	*	Expensive - you can only afford so much of it
	*	ALL data and code is fetched into and executed from main memory
		*	Belongs to the operating system
		*	Won't use it to store your DB
		

SLIDE 10.5 - Physical Storage Media
Show slide

READ 1
*	Non-Volatile
	*	Doesn't need to power for data storage
*	Cheaper than main memory
*	Speed
		Reads - same as main memory
		Writes - microseconds, a thousand times slower
*	Device failure
		Wears out after a certain number of write/erase cycles
		
*	Other uses
	*	Starting  to be used as secondary main memory
			Faster than disk
			Cheaper than main memory
	*	Solid state hard drives
	

SLIDE 10.6 - Physical Storage Media
Show slide


SLIDE 10.7 - Physical Storage Media
Show slide

*	DVD - Digital Video Disk is now turning into
		Digital Versatile Disk since it can store any kind of data
		
*	CD-ROM and DVD-ROM - preloaded, read-only
*	CD-R and DVD-R - record once
		WORM disks - Write Once Read Many
*	CD-RW and DVD+/-RW, DVD-RAM - Written to multiple times
	
*	Non-volatile
*	Cheap
*	Slower than magnetic disk


SLIDE 10.8 - Physical Storage Media
Show slide

*	Sequential Access
*	Large storage capacity
		40-300 GB per tape
		Petabytes stored in silos

*	Non-volatile	
*	Very Cheap
*	Very slow


SLIDE 10.9 - Storage Hierarchy
Show slide

*	Organized by speed and cost
		Trade-offs

*	What happens when a given storage system is both faster and less expensive than the one above it?
		Paper tape, punch cards
		
*	Other issue is volatility
		Main memory and up is strictly volatile
		Data must be written to non-volatile storage
			Even in the midst of processing

			
SLIDE 10.10 - Storage Hierarchy
Show slide

			
SLIDE 10.11 - Magnetic Hard Disk Mechanism
Show slide

KEEP THIS SLIDE UP - BUT TALK THROUGH THE NEXT ONE

			
SLIDE 10.12 - Magnetic Hard Disk
Show slide

ANATOMY OF A HARD DRIVE
*	Platter
		flat circular medium covered with magnetic material
		made from rigid metal or glass
		Information is recorded on both surfaces
		Usually 1 to 5 platters per drive
*	Surface of platter divided into circular tracks
		Over 50K-100K tracks per platter on typical hard disks
*	Each track is divided into sectors.
		A sector is the smallest unit of data that can be read or written.
		Sector size typically 512 bytes
		Typical sectors per track: 500 to 1000 (on inner tracks) to 1000 to 2000 (on outer tracks)
		Higher capacity drives have more sectors per track and more tracks on each platter
*	Spindle
		Platters are all mounted on the same spindle and all turn together at the same time
*	Drive motor spins the platters at a constant velocity
		Usually 60, 90, or 120 revs / second
			There are some up to 250 revs / sec
*	Read-write head
		Positioned very close to the platter surface (almost touching it)
		Reads or writes magnetically encoded information.
		Writing is done by reversing the polarity of the magnetic material
		One read-write head for each side of each platter
		The head typically floats or flies only microns from the disk surface
		The spinning disk creates a small breeze
			The head assembly is shaped so that the breeze keeps the head floating just above the disk surface
*	Disk arm
		Read-write heads are all mounted on a common arm
*	Head-disk assemblies
		Heads and arm are called the head-disk assembly
		Heads on all the platters move together
			When the head on one platter is on the 9th track, the heads on all the platters are on the 9th track
*	Cylinder i consists of the ith track of all the platters
*	To read/write a sector
		disk arm swings to position head on right track
		platter spins continually; data is read/written as sector passes under the head

			
SLIDE 10.12 - Magnetic Disks
Show slide

			
SLIDE 10.13 - Magnetic Disks
Show slide

*	Remapping of bad sectors
		If the controller dectects a bad sector, 
			it can logically map that sector to a different physical location
				allocated from a pool of extra sectors on the disk
		remapped location is saved on the disk or some non-volatile memory
		
			
SLIDE 10.14 - Disk Subsystem
Show slide

READ 1

POINT 2
*	Disks are connected to a computer system through a high-speed interconnection
*	Portable disk systems connect through USB or FireWire

			
SLIDE 10.15 - Disk Subsystem
Show slide

READ 1
READ 2
*	Computer and the disk sybsystem continue to use the same SATA, SCSI, etc protocols to talk with each other
		even though they are separated by a network
READ 3
*	Network File System - Unix / Linux
		developed to allow machines to mount a disk partition on a remote machine as if it were a local disk
*	CIFS
		Common Internet File System
			remote file access using millions of computers at a time. 
			users with different platforms and computers can share files without having to install new software. 
			CIFS runs over TCP/IP 


SLIDE 10.16 - Performance Measures of Disks
Show slide

*	Access time
	*	To read/write data, on a given sector, 
			the arm must first move so that it is positioned over the correct track, and 
			then it must wait for that sector to appear under it as the disk rotates
	*	Seek time
		*	Average seek time - 
				avg of seek times,measured over a sequence of uniformly distributed random requests
	*	Rotational latency
			5400 rpm is 90 revs / sec
*	Access time is the sum of the seek time and the latency
*	Once the first sector of the data to be accessed has come under the head, data transfer begins
*	Data Transfer Rate
		READ


SLIDE 10.17 - Performance Measures of Disks
Show slide

*	Mean Time To Failure
		Measures reliability of the disk
		Amount of time that, on avg, we can expect the system to run continuously without any failure
READ
	500k to 1200k hours is 57 to 136 years
	Most disks have an expected life span of 5 years and have significantly higher rates of failure once they become more than a few years old
	

SLIDE 10.18 - Optimization of Disk Disk-Block Access
Show slide

*	Requests for disk I/O are generated by the file system and by the virtual memory manager of the OS
*	Each request specifies the address on the disk to be references
		the BLOCK number
READ 1
*	Blocks are sometimes called pages although the term page may have a different meaning in other contexts
*	A set of requests for blocks may follow a sequential pattern
		successive blocks on the same or adjacent tracks
	Or may be random
	*	Sequential - You only have to seek once
	*	Random - each block needs its own seek
			Transfer rate is significantly lower
*	Buffering
		By application, OS or DB
*	Read-ahead
		Read several blocks at a time, even if they have not been asked for
			Does not help with random disk data block requests
*	Scheduling
		Collecting several requests and taking the shortest route through all of them
READ 2
	*	Elevator
			Arm is initially moving  from the innermost track to the outermost
			For each track for which there is an access request	
				the arm stops at that track, and services that requests
			Goes on to the next nearest track moving outward
			Until there are no more requests in the outward directions
			Turns around and does the same thing in the opposite direction
			

SLIDE 10.19 - Optimization of Disk Disk-Block Access
Show slide
		
*	File streams in Unix and Windows 
		allocate multiple blocks per file (called an extent) when the file is created
			Sequential access files only need one seek per extent
	When it grows, it needs new extents
READ 1


SLIDE 10.20 - Optimization of Disk Disk-Block Access
Show slide

*	DB updates cannot stay in main memory in case of power failure
		MUST be written to saved immediately
READ 1
READ 2
*	Data still has to be written to their actual location on disk, but can do it later and more efficiently
READ 3


SLIDE 10.21 - Flash Storage
Show slide

POINT 1
*	NOR 
		Allows random access to individual words of memory
		Read time comparable to main memory
*	NAND 
		Cheaper than NOR
			requires one less contact per pair of cells
		Higher storage capacity
		Requires all data to be read a page at a time
READ 2
2a
2b
2c	Multiple flash memories in parallel can give you up to 200 MB/sec
2d	Writing is more complicated
			takes microseconds
			Once written, it cannot be directly overwritten
				Must be erased first (1-2 ms)
		Remapping
			...
			Blocks with multiple deleted pages are periodically erased 
				after copying non-deleted pages to a different block
			Keeps track of how many times each block has been erased
				Assigns data to blocks based on wear


SLIDE 10.22 - RAID
Show slide


SLIDE 10.23 - Improvement of Reliability via Redundancy
Show slide
READ
*	Mirrored-disk systems are available today with a mean time to data loss of 55-110 years


SLIDE 10.24 - Improvement in Performance via Parallelism
Show slide

*	Parallel access to multiple disks
READ
*	Bit-level striping
		Each disk participates in every access (read / write)
		Number of accesses per second is the same as on a single disk
			but read 8 times as much data
READ 
*	Block-level striping

*	GOALS OF PARALLELISM
	1	load-balance multiple small accesses so that throughput increases
	2	Parallelize large accesses so that the response time of large accesses is reduced
	
	
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 

SLIDE 10.25 - RAID Levels
Show slide

*	What do you get by PARALLISM is a disk system ?
		1	Spread individual small reads out over multiple disks
				Load balancing of small accesses
		2	Multiply data transfer rates of large access 
				by loading data in parallel from multiple disks at the same time
				
*	What is disk striping ?
		Spread data file across multiple disks
			Better performance through parallelism
		We talked about 2 ways of striping
			Bit -  
				i-th bit is on the i-th drive
			Block - 
				Array is treated as one LARGE set of LOGICAL blocks
				i-th block is on the i-th drive
		n-times more likely to fail

*	What do you do to avoid data loss?
		1.	Keep extra copies of data - redundancy
				More expensive
		2.	Be able to FIX bad data
		
*	What method of redundancy did talk about last lecture?
		Mirroring 
			Write the same data to 2 different drives
			Reads may proceed in parallel off the 2 drives
		Expensive
			
*	RAID - Redundant Arrays of Independent Disks
		Lower cost
		Better reliability
		Better performance

READ 1
*	In figures, 	C indicates a redundant copy
					P indicates parity error correcting bits
					
READ 2 -  RAID 0
*	What kind of applications would that be?

READ 3 - RAID 1
*	Mirroring - NO striping
	*	Notice the copies of data disks in the figure
*	Appears ans ONE very large, very reliable disk
*	RAID level 1+0 or RAID level 10
		Mirroring with striping
		Some vendors use the term differently
		

SLIDE 10.26 - RAID Levels 2 & 3
Show slide

READ 1
	*	What is a parity ?
			Most memory systems use it
			parity bit - stores whether the number of bits in the byte that are 
				set to 1 is even (parity bit = 0) or odd (parity = 1)
			If either one of the bits in the byte gets damaged - or the parity bit itself gets damaged
				You KNOW there is an error
	*	ECC - Error Correcting Codes
			To reconstruct the data
				You need at least 2 extra bits
	*	RAID 2 - see figure
			Disks labeld P store ECC bits
				If one of the disks fails, the remaining bits of the byte and the associated ECC bits
					can be used to reconstruct the data
			Improves over RAID 1 by removing 1 redundant disk

READ 2		
	*	RAID 3
			Disk controllers can detect when a sector has been read correctly
				so a SINGLE parity bit can be used for error correction
					as well as detection
					
	*	Improves on RAID 2 by
			removing 1 more disk
			RAID 2 is not used any more
			
	*	Improves on RAID 1 by
			Uses only 1 parity disk for several regular disks
				instead of 1 mirrored disk for every regular disks
			N-way bit-striping
				N-times faster reading and writing
	
	*	Drawbacks
			Lower number of I/O operations per second
				EVERY disk has to participate in EVERY I/O request
				

SLIDE 10.27 - RAID Levels 3 & 4
Show slide
		
READ 1

READ 2	


SLIDE 10.28 - RAID Levels 4
Show slide

READ 1a
READ 1b
	*	Overall I/O throughput is higher
		*	Large reads have very high transfer rates
		*	Large writes are also very high
				data and parity may be written in parallel
READ 1c
	*	Small independent writes cannot be performed in parallel
			Single write takes 4 disk accesses
				*	Old parity value of the data block AND the old value of the parity block 
						are needed  to compute the new parity
				1	Write the data
				2	Write the parity
				3	Read old value of parity block
				4	Read old value of data block
READ 1d


SLIDE 10.29 - RAID Levels 5
Show slide

READ 1
*	All disks can participate in read requests
		In RAID 4, parity disk can't participate
		Level 5 increases the total number of requests that can be met in a given amount of time

		
SLIDE 10.30 - RAID Levels 5 & 6
Show slide

READ 1

READ 2
	*	Uses error correcting codes instead of parity
			Uses 2 bits instead of one
			Guards against multiple disk failures
				Can tolerate up to 2 simultaneous disk failures
	*	More expenseive
	

SLIDE 10.31 - Choice of RAID Level
Show slide

READ 1 - Factors 
	*	Rebuild performance
			Time to rebuild can be significant
			Varies with the RAID level used
				RAID 1 just involves a data copy from another source
			Other levels need to access all the other disks in the array to rebuild data of a failed disk
			
	*	What level of CONTIUOUS data availabitlity do you require ?
			How long can you afford to stay down?
			
READ 2 -  RAID 0
	*	High performance apps
	
READ 3 - RAID 2 & 4
	
READ 4 - RAID 3

READ 5 - RAID 6


SLIDE 10.32 - Choice of RAID Level
Show slide

READ SLIDE


SLIDE 10.33 - Hardware Issues
Show slide

READ 1
	*	Hardware and software are logically equivalent
			Anything you can do with hardware (digitally)
				You can do in software
				and vice versa

READ 2


SLIDE 10.34 - Hardware Issues
Show slide

READ 1 - Latent failures
	*	Sectors can go bad after a successful write
			Sector may become unreadable
				bit rot
			May not be detected at the time it goes bad
				Can only be recovered if other disks in the array still have the ECC data
READ 2 - Data scrubbing
	*	when disks are idle, every sector of the disk is read
			the data may be recoverable from the other disks

READ 3 - Hot swapping
	*	What happens to your computer center  if you have to power down an entire disk array?


SLIDE 10.35 - Optical Disks
Show slide

*	What is primary storage ?
		RAM & cache
		Volatile memory
*	What is secondary storage ?
		Flash memory, magnetic disk
		Fast
		Non-volatile memory
*	Tertiary Storage
		CD, DVD, tape
		Slot
		Non-volative memory
		
READ 1
	*	Cheap to mass produce
	*	Data transfer rates of 3-6 MB/s
READ 2
	*	Replaced CDs with larger data capacities
	*	Data transfer rates of 8-20 MB/s
READ 3
	*	Can be stored offsite
	*	Used for data that is not allowed to be overwritten
			financial data and audit trails
	

SLIDE 10.36 - Magnetic Tapes
Show slide
	
READ ALL
*	3480
		18 - 36 tracks simultaneously
		210 MB - 2.4 GB max capacity
*	3590
		128-384 tracks simultaneously
		30-60 GB max capacity
*	 speed, reliability, durability and low media cost


SLIDE 10.37 - File Organization, Record Organization and Storage Access
Show slide
	
	
SLIDE 10.38 - File Organization
Show slide
	
READ 1
	*	Files are mapped onto disk blocks
	*	Assumes an underlying file system courtesy of the OS
	*	How do you represent logical data models in terms of files?
	
	*	Files are logically partitioned into fixed-length storage units of blocks
			Units of BOTH storage allocation and data transfer
			Usually 4-8 KB
				DB will allow you to set this value when the engine is installed
				
	*	Blocks may contain several records
			which records are determined by the form of physical data organization being used
			
	*	Assume that NO record size is larger than the block size
	*	Each record is contained ENTIRELY in the same block
	
	*	Usually each relation in a DB has a different record size than other relations
	
READ 2
	

SLIDE 10.39 - Fixed Fixed-Length Records
Show slide

READ 1
	*	Show instructor record in NoteWhen
type instructor = record {
	id	varchar(5);
	name	varchar(20);
	dept_name varchar(20);
	salary	numeric(8,2);
}
	*	Do not allow variable length fields OR just allocate the max space for each field

READ 2
	

SLIDE 10.40 - Deleting record 3 and compacting
Show slide

*	Move all the records up


SLIDE 10.41 - Deleting record 3 and moving last record
Show slide

*	Move the LAST record in to where the deleted record used to be


SLIDE 10.42 - Deleting record 3 and moving last record
Show slide

*	It is wasteful to move records to occupy the space
		Required additional block accesses
*	Let the space be reused when the next insert is executed
		Need to mark the unsed spaces
*	Linked list call the FREE LIST
		Need an area at the beginning of the file to start the list
			File header
	
*	This works because the records all have the same lenght
		What happens when this is NOT true?
		

SLIDE 10.43 - Variable Variable-Length Records
Show slide
		
READ 1
	*	Two problems arise
			1	how to represent a single record in such a way that individual attributes can be extracted easily
			2	how to store variable length records within a block such that records can be extracted easily
READ 2
READ 3
	*	Fixed length attributes are stored first allocating as many bytes as needed
			Var length fields are stored as fixed length pair of (offset, length) data
				Actual data is stored AFTER the fixed length portion of the data
		Initial part of the record stores a fixed size of info about each attribute, whether it is fixed or variable

type instructor = record {
	id	varchar(5);
	name	varchar(20);
	dept_name varchar(20);
	salary	numeric(8,2);
}		
	*	Show instructor record in NoteWhen	
			Show varchar (offset,length) pair`s
				2 bytes each
			Salary takes 8 bytes
			Each varchar strings takes as many bytes as there are characters in the string
			
READ 4 0 Null bitmap
	*	Indicate which attributes have a null value
			If salary was null, bit four in the bitmap would be on
				values in bytes 12-29 would be ignored
	*	More attributes would require more bytes in the bitmap
	

SLIDE 10.44 - Variable Variable-Length Records: Slotted Page Structure
Show slide

READ 1
	*	Records are allocated contiguously in the block
			starting from the END of the block
	*	When a record is inserted
			space is allocated at the end of the free space area
			entry containing its size and location is added to the header
READ 2
	*	deletions
			space occupied is freed
			entry is set to "deleted" (eg size=-1)
			records in the block before it are moved so that all the free space is in the free space area
			end-of-free-space pointer is updated
	*	updates
			same process since record size might grow or shrink as a result of the update
	*	since this is all done on the same block
			it all takes place in primary memory and is written back to disk in one access
			very fast
READ 3
	*	Addressing is indirect
			Store address to record in the slot record
				from there get the offset to the actual data
				
*	What about BLOBS and CLOBS ?
	*	Data is larger than a block size
	*	Pointer to the data is stored in the record
	*	Actual data is stored in a different file(s)
			
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 

SLIDE 10.45 - Organization of Records in Files
Show slide

*	We've seen how records are represented in a file structure.
		A relation is a set of records
		How do you organize them into a file?
		
READ 1 - Heap
READ 2 - Sequential
READ 3 - Hashing
READ 4 - Multi-table clustering
	*	Used for tables that are joined frequently
	

SLIDE 10.46 - Sequential File Organization
Show slide

READ 1
READ 2
	*	search key does not need to be a PK or even a superkey
	*	records are chained together by pointers
			pointer in each record points to the next record in search key order
	*	attempt to store records in search key order for performance
	*	Figure - ID is the search key; records are stored in that order
	*	Allows records to be read in sorted order
	

SLIDE 10.47 - Sequential File Organization
Show slide

READ 1 - Deletion
	*	Same as what we did before
READ 2 - Insertion
	*	Locate the record in the file that comes BEFORE the record to be inserted
	*	If there is a free record in the same block, 
			insert into that block
		If not, insert into an overflow block
			update pointers just the same
READ 3 - Reorganize
	*	Works well if there few overflow blocks
	*	Search key order vs physical order of records may break down
	*	Reorganize
			Rewrite the entire relation
			Costly
			Must be done during times when system load is low
	*	Frequency of reorganization depends on frequency of insertions in non-search key order
	

SLIDE 10.48 - Multitable Clustering File Organization
Show slide
	
*	Many large scale databases do not use the underlying file system from the OS
	*	Allocate one LARGE system file
	*	Stores ALL relations in ONE file
*	Most databases store records of only one relation in a given block
		Simplifies data management
*	Sometimes, it is more efficient to store records of multiple relations in the same block
*	Consider a NATURAL JOIN on Instructor and Department relations
	*	Each row in Instructor must be matched with each row in department
			Each match must read 2 different blocks
			Both blocks must be transferred into main memory
	*	If the rows for instructors are stored near the rows for the department they are associated with
			you only need to process one block most of the time
	*	Figure
			First row is a department record
			Next rows are all the instructor in that department


SLIDE 10.49 - Multitable Clustering File Organization
Show slide

READ ALL
*	Clustering can produce significant performance gains in query processing


SLIDE 10.50 - Data Dictionary Storage
Show slide

*	A relational database system needs to maintain data about the relations
	*	Called Data Dictionary


SLIDE 10.52 - Relational Representation of System Metadata
Show slide

*	This meta-data constitutes, in effect, a miniature database
*	May be stored in specialized data structures
*	Most likely stored in relations
		Leverage the power of the database engine
*	To find out where a relation resides to query it, engine goes to the relation_metadata tables		
	*	Where does the engine go to find out where the metadata is?
			Wherever the designers of the engine told it to look

			
SLIDE 10.53 - Storage Access
Show slide

READ ALL


SLIDE 10.54 - Buffer Manager
Show slide

READ ALL
*	This is just like the virtual memory manager of the OS
		Size of the DB might be larger than the addressable size of the hardware address space
			memory addresses are not sufficient to address all disk blocks
*	DB may use more sphisticated buffer management schemes
	*	Buffer replacement strategies
	*	Pinned Blocks
			restrict times when a block may be written back to disk to protect against data loss
			blocks are pinned until it is ok to write them to disk
	*	Forced output of blocks
		*	Flushing in case of crash
		

SLIDE 10.55 - Buffer-Replacement Policies
Show slide


SLIDE 10.56 - Buffer-Replacement Policies
Show slide