SLIDE 10.2 - Storage and File Structure Show slide * Concepts and Terminology chapter * Hard disks and RAID * So far we've focused on the logical view of the database model * Chap 10-13 focus on the physical and computational aspect of the model * Characteristics of underlying storage mediums * Trade-offs between different storage mediums * Data structures and File organization * What kinds of data fit best in them? SLIDE 10.3 - Classification of Physical Storage Media Show slide * For each of the different storage mediums, these are the attributes we will be looking at to compare them SLIDE 10.4 - Physical Storage Media Show slide READ 1 * DB systems don't manage any cache memory * However, it will take cache effects into account for query processing algorithms READ 2 * RAM * Fast - 1,000 million per second * Expensive - you can only afford so much of it * ALL data and code is fetched into and executed from main memory * Belongs to the operating system * Won't use it to store your DB SLIDE 10.5 - Physical Storage Media Show slide READ 1 * Non-Volatile * Doesn't need to power for data storage * Cheaper than main memory * Speed Reads - same as main memory Writes - microseconds, a thousand times slower * Device failure Wears out after a certain number of write/erase cycles * Other uses * Starting to be used as secondary main memory Faster than disk Cheaper than main memory * Solid state hard drives SLIDE 10.6 - Physical Storage Media Show slide SLIDE 10.7 - Physical Storage Media Show slide * DVD - Digital Video Disk is now turning into Digital Versatile Disk since it can store any kind of data * CD-ROM and DVD-ROM - preloaded, read-only * CD-R and DVD-R - record once WORM disks - Write Once Read Many * CD-RW and DVD+/-RW, DVD-RAM - Written to multiple times * Non-volatile * Cheap * Slower than magnetic disk SLIDE 10.8 - Physical Storage Media Show slide * Sequential Access * Large storage capacity 40-300 GB per tape Petabytes stored in silos * Non-volatile * Very Cheap * Very slow SLIDE 10.9 - Storage Hierarchy Show slide * Organized by speed and cost Trade-offs * What happens when a given storage system is both faster and less expensive than the one above it? Paper tape, punch cards * Other issue is volatility Main memory and up is strictly volatile Data must be written to non-volatile storage Even in the midst of processing SLIDE 10.10 - Storage Hierarchy Show slide SLIDE 10.11 - Magnetic Hard Disk Mechanism Show slide KEEP THIS SLIDE UP - BUT TALK THROUGH THE NEXT ONE SLIDE 10.12 - Magnetic Hard Disk Show slide ANATOMY OF A HARD DRIVE * Platter flat circular medium covered with magnetic material made from rigid metal or glass Information is recorded on both surfaces Usually 1 to 5 platters per drive * Surface of platter divided into circular tracks Over 50K-100K tracks per platter on typical hard disks * Each track is divided into sectors. A sector is the smallest unit of data that can be read or written. Sector size typically 512 bytes Typical sectors per track: 500 to 1000 (on inner tracks) to 1000 to 2000 (on outer tracks) Higher capacity drives have more sectors per track and more tracks on each platter * Spindle Platters are all mounted on the same spindle and all turn together at the same time * Drive motor spins the platters at a constant velocity Usually 60, 90, or 120 revs / second There are some up to 250 revs / sec * Read-write head Positioned very close to the platter surface (almost touching it) Reads or writes magnetically encoded information. Writing is done by reversing the polarity of the magnetic material One read-write head for each side of each platter The head typically floats or flies only microns from the disk surface The spinning disk creates a small breeze The head assembly is shaped so that the breeze keeps the head floating just above the disk surface * Disk arm Read-write heads are all mounted on a common arm * Head-disk assemblies Heads and arm are called the head-disk assembly Heads on all the platters move together When the head on one platter is on the 9th track, the heads on all the platters are on the 9th track * Cylinder i consists of the ith track of all the platters * To read/write a sector disk arm swings to position head on right track platter spins continually; data is read/written as sector passes under the head SLIDE 10.12 - Magnetic Disks Show slide SLIDE 10.13 - Magnetic Disks Show slide * Remapping of bad sectors If the controller dectects a bad sector, it can logically map that sector to a different physical location allocated from a pool of extra sectors on the disk remapped location is saved on the disk or some non-volatile memory SLIDE 10.14 - Disk Subsystem Show slide READ 1 POINT 2 * Disks are connected to a computer system through a high-speed interconnection * Portable disk systems connect through USB or FireWire SLIDE 10.15 - Disk Subsystem Show slide READ 1 READ 2 * Computer and the disk sybsystem continue to use the same SATA, SCSI, etc protocols to talk with each other even though they are separated by a network READ 3 * Network File System - Unix / Linux developed to allow machines to mount a disk partition on a remote machine as if it were a local disk * CIFS Common Internet File System remote file access using millions of computers at a time. users with different platforms and computers can share files without having to install new software. CIFS runs over TCP/IP SLIDE 10.16 - Performance Measures of Disks Show slide * Access time * To read/write data, on a given sector, the arm must first move so that it is positioned over the correct track, and then it must wait for that sector to appear under it as the disk rotates * Seek time * Average seek time - avg of seek times,measured over a sequence of uniformly distributed random requests * Rotational latency 5400 rpm is 90 revs / sec * Access time is the sum of the seek time and the latency * Once the first sector of the data to be accessed has come under the head, data transfer begins * Data Transfer Rate READ SLIDE 10.17 - Performance Measures of Disks Show slide * Mean Time To Failure Measures reliability of the disk Amount of time that, on avg, we can expect the system to run continuously without any failure READ 500k to 1200k hours is 57 to 136 years Most disks have an expected life span of 5 years and have significantly higher rates of failure once they become more than a few years old SLIDE 10.18 - Optimization of Disk Disk-Block Access Show slide * Requests for disk I/O are generated by the file system and by the virtual memory manager of the OS * Each request specifies the address on the disk to be references the BLOCK number READ 1 * Blocks are sometimes called pages although the term page may have a different meaning in other contexts * A set of requests for blocks may follow a sequential pattern successive blocks on the same or adjacent tracks Or may be random * Sequential - You only have to seek once * Random - each block needs its own seek Transfer rate is significantly lower * Buffering By application, OS or DB * Read-ahead Read several blocks at a time, even if they have not been asked for Does not help with random disk data block requests * Scheduling Collecting several requests and taking the shortest route through all of them READ 2 * Elevator Arm is initially moving from the innermost track to the outermost For each track for which there is an access request the arm stops at that track, and services that requests Goes on to the next nearest track moving outward Until there are no more requests in the outward directions Turns around and does the same thing in the opposite direction SLIDE 10.19 - Optimization of Disk Disk-Block Access Show slide * File streams in Unix and Windows allocate multiple blocks per file (called an extent) when the file is created Sequential access files only need one seek per extent When it grows, it needs new extents READ 1 SLIDE 10.20 - Optimization of Disk Disk-Block Access Show slide * DB updates cannot stay in main memory in case of power failure MUST be written to saved immediately READ 1 READ 2 * Data still has to be written to their actual location on disk, but can do it later and more efficiently READ 3 SLIDE 10.21 - Flash Storage Show slide POINT 1 * NOR Allows random access to individual words of memory Read time comparable to main memory * NAND Cheaper than NOR requires one less contact per pair of cells Higher storage capacity Requires all data to be read a page at a time READ 2 2a 2b 2c Multiple flash memories in parallel can give you up to 200 MB/sec 2d Writing is more complicated takes microseconds Once written, it cannot be directly overwritten Must be erased first (1-2 ms) Remapping ... Blocks with multiple deleted pages are periodically erased after copying non-deleted pages to a different block Keeps track of how many times each block has been erased Assigns data to blocks based on wear SLIDE 10.22 - RAID Show slide SLIDE 10.23 - Improvement of Reliability via Redundancy Show slide READ * Mirrored-disk systems are available today with a mean time to data loss of 55-110 years SLIDE 10.24 - Improvement in Performance via Parallelism Show slide * Parallel access to multiple disks READ * Bit-level striping Each disk participates in every access (read / write) Number of accesses per second is the same as on a single disk but read 8 times as much data READ * Block-level striping * GOALS OF PARALLELISM 1 load-balance multiple small accesses so that throughput increases 2 Parallelize large accesses so that the response time of large accesses is reduced * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * SLIDE 10.25 - RAID Levels Show slide * What do you get by PARALLISM is a disk system ? 1 Spread individual small reads out over multiple disks Load balancing of small accesses 2 Multiply data transfer rates of large access by loading data in parallel from multiple disks at the same time * What is disk striping ? Spread data file across multiple disks Better performance through parallelism We talked about 2 ways of striping Bit - i-th bit is on the i-th drive Block - Array is treated as one LARGE set of LOGICAL blocks i-th block is on the i-th drive n-times more likely to fail * What do you do to avoid data loss? 1. Keep extra copies of data - redundancy More expensive 2. Be able to FIX bad data * What method of redundancy did talk about last lecture? Mirroring Write the same data to 2 different drives Reads may proceed in parallel off the 2 drives Expensive * RAID - Redundant Arrays of Independent Disks Lower cost Better reliability Better performance READ 1 * In figures, C indicates a redundant copy P indicates parity error correcting bits READ 2 - RAID 0 * What kind of applications would that be? READ 3 - RAID 1 * Mirroring - NO striping * Notice the copies of data disks in the figure * Appears ans ONE very large, very reliable disk * RAID level 1+0 or RAID level 10 Mirroring with striping Some vendors use the term differently SLIDE 10.26 - RAID Levels 2 & 3 Show slide READ 1 * What is a parity ? Most memory systems use it parity bit - stores whether the number of bits in the byte that are set to 1 is even (parity bit = 0) or odd (parity = 1) If either one of the bits in the byte gets damaged - or the parity bit itself gets damaged You KNOW there is an error * ECC - Error Correcting Codes To reconstruct the data You need at least 2 extra bits * RAID 2 - see figure Disks labeld P store ECC bits If one of the disks fails, the remaining bits of the byte and the associated ECC bits can be used to reconstruct the data Improves over RAID 1 by removing 1 redundant disk READ 2 * RAID 3 Disk controllers can detect when a sector has been read correctly so a SINGLE parity bit can be used for error correction as well as detection * Improves on RAID 2 by removing 1 more disk RAID 2 is not used any more * Improves on RAID 1 by Uses only 1 parity disk for several regular disks instead of 1 mirrored disk for every regular disks N-way bit-striping N-times faster reading and writing * Drawbacks Lower number of I/O operations per second EVERY disk has to participate in EVERY I/O request SLIDE 10.27 - RAID Levels 3 & 4 Show slide READ 1 READ 2 SLIDE 10.28 - RAID Levels 4 Show slide READ 1a READ 1b * Overall I/O throughput is higher * Large reads have very high transfer rates * Large writes are also very high data and parity may be written in parallel READ 1c * Small independent writes cannot be performed in parallel Single write takes 4 disk accesses * Old parity value of the data block AND the old value of the parity block are needed to compute the new parity 1 Write the data 2 Write the parity 3 Read old value of parity block 4 Read old value of data block READ 1d SLIDE 10.29 - RAID Levels 5 Show slide READ 1 * All disks can participate in read requests In RAID 4, parity disk can't participate Level 5 increases the total number of requests that can be met in a given amount of time SLIDE 10.30 - RAID Levels 5 & 6 Show slide READ 1 READ 2 * Uses error correcting codes instead of parity Uses 2 bits instead of one Guards against multiple disk failures Can tolerate up to 2 simultaneous disk failures * More expenseive SLIDE 10.31 - Choice of RAID Level Show slide READ 1 - Factors * Rebuild performance Time to rebuild can be significant Varies with the RAID level used RAID 1 just involves a data copy from another source Other levels need to access all the other disks in the array to rebuild data of a failed disk * What level of CONTIUOUS data availabitlity do you require ? How long can you afford to stay down? READ 2 - RAID 0 * High performance apps READ 3 - RAID 2 & 4 READ 4 - RAID 3 READ 5 - RAID 6 SLIDE 10.32 - Choice of RAID Level Show slide READ SLIDE SLIDE 10.33 - Hardware Issues Show slide READ 1 * Hardware and software are logically equivalent Anything you can do with hardware (digitally) You can do in software and vice versa READ 2 SLIDE 10.34 - Hardware Issues Show slide READ 1 - Latent failures * Sectors can go bad after a successful write Sector may become unreadable bit rot May not be detected at the time it goes bad Can only be recovered if other disks in the array still have the ECC data READ 2 - Data scrubbing * when disks are idle, every sector of the disk is read the data may be recoverable from the other disks READ 3 - Hot swapping * What happens to your computer center if you have to power down an entire disk array? SLIDE 10.35 - Optical Disks Show slide * What is primary storage ? RAM & cache Volatile memory * What is secondary storage ? Flash memory, magnetic disk Fast Non-volatile memory * Tertiary Storage CD, DVD, tape Slot Non-volative memory READ 1 * Cheap to mass produce * Data transfer rates of 3-6 MB/s READ 2 * Replaced CDs with larger data capacities * Data transfer rates of 8-20 MB/s READ 3 * Can be stored offsite * Used for data that is not allowed to be overwritten financial data and audit trails SLIDE 10.36 - Magnetic Tapes Show slide READ ALL * 3480 18 - 36 tracks simultaneously 210 MB - 2.4 GB max capacity * 3590 128-384 tracks simultaneously 30-60 GB max capacity * speed, reliability, durability and low media cost SLIDE 10.37 - File Organization, Record Organization and Storage Access Show slide SLIDE 10.38 - File Organization Show slide READ 1 * Files are mapped onto disk blocks * Assumes an underlying file system courtesy of the OS * How do you represent logical data models in terms of files? * Files are logically partitioned into fixed-length storage units of blocks Units of BOTH storage allocation and data transfer Usually 4-8 KB DB will allow you to set this value when the engine is installed * Blocks may contain several records which records are determined by the form of physical data organization being used * Assume that NO record size is larger than the block size * Each record is contained ENTIRELY in the same block * Usually each relation in a DB has a different record size than other relations READ 2 SLIDE 10.39 - Fixed Fixed-Length Records Show slide READ 1 * Show instructor record in NoteWhen type instructor = record { id varchar(5); name varchar(20); dept_name varchar(20); salary numeric(8,2); } * Do not allow variable length fields OR just allocate the max space for each field READ 2 SLIDE 10.40 - Deleting record 3 and compacting Show slide * Move all the records up SLIDE 10.41 - Deleting record 3 and moving last record Show slide * Move the LAST record in to where the deleted record used to be SLIDE 10.42 - Deleting record 3 and moving last record Show slide * It is wasteful to move records to occupy the space Required additional block accesses * Let the space be reused when the next insert is executed Need to mark the unsed spaces * Linked list call the FREE LIST Need an area at the beginning of the file to start the list File header * This works because the records all have the same lenght What happens when this is NOT true? SLIDE 10.43 - Variable Variable-Length Records Show slide READ 1 * Two problems arise 1 how to represent a single record in such a way that individual attributes can be extracted easily 2 how to store variable length records within a block such that records can be extracted easily READ 2 READ 3 * Fixed length attributes are stored first allocating as many bytes as needed Var length fields are stored as fixed length pair of (offset, length) data Actual data is stored AFTER the fixed length portion of the data Initial part of the record stores a fixed size of info about each attribute, whether it is fixed or variable type instructor = record { id varchar(5); name varchar(20); dept_name varchar(20); salary numeric(8,2); } * Show instructor record in NoteWhen Show varchar (offset,length) pair`s 2 bytes each Salary takes 8 bytes Each varchar strings takes as many bytes as there are characters in the string READ 4 0 Null bitmap * Indicate which attributes have a null value If salary was null, bit four in the bitmap would be on values in bytes 12-29 would be ignored * More attributes would require more bytes in the bitmap SLIDE 10.44 - Variable Variable-Length Records: Slotted Page Structure Show slide READ 1 * Records are allocated contiguously in the block starting from the END of the block * When a record is inserted space is allocated at the end of the free space area entry containing its size and location is added to the header READ 2 * deletions space occupied is freed entry is set to "deleted" (eg size=-1) records in the block before it are moved so that all the free space is in the free space area end-of-free-space pointer is updated * updates same process since record size might grow or shrink as a result of the update * since this is all done on the same block it all takes place in primary memory and is written back to disk in one access very fast READ 3 * Addressing is indirect Store address to record in the slot record from there get the offset to the actual data * What about BLOBS and CLOBS ? * Data is larger than a block size * Pointer to the data is stored in the record * Actual data is stored in a different file(s) * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * SLIDE 10.45 - Organization of Records in Files Show slide * We've seen how records are represented in a file structure. A relation is a set of records How do you organize them into a file? READ 1 - Heap READ 2 - Sequential READ 3 - Hashing READ 4 - Multi-table clustering * Used for tables that are joined frequently SLIDE 10.46 - Sequential File Organization Show slide READ 1 READ 2 * search key does not need to be a PK or even a superkey * records are chained together by pointers pointer in each record points to the next record in search key order * attempt to store records in search key order for performance * Figure - ID is the search key; records are stored in that order * Allows records to be read in sorted order SLIDE 10.47 - Sequential File Organization Show slide READ 1 - Deletion * Same as what we did before READ 2 - Insertion * Locate the record in the file that comes BEFORE the record to be inserted * If there is a free record in the same block, insert into that block If not, insert into an overflow block update pointers just the same READ 3 - Reorganize * Works well if there few overflow blocks * Search key order vs physical order of records may break down * Reorganize Rewrite the entire relation Costly Must be done during times when system load is low * Frequency of reorganization depends on frequency of insertions in non-search key order SLIDE 10.48 - Multitable Clustering File Organization Show slide * Many large scale databases do not use the underlying file system from the OS * Allocate one LARGE system file * Stores ALL relations in ONE file * Most databases store records of only one relation in a given block Simplifies data management * Sometimes, it is more efficient to store records of multiple relations in the same block * Consider a NATURAL JOIN on Instructor and Department relations * Each row in Instructor must be matched with each row in department Each match must read 2 different blocks Both blocks must be transferred into main memory * If the rows for instructors are stored near the rows for the department they are associated with you only need to process one block most of the time * Figure First row is a department record Next rows are all the instructor in that department SLIDE 10.49 - Multitable Clustering File Organization Show slide READ ALL * Clustering can produce significant performance gains in query processing SLIDE 10.50 - Data Dictionary Storage Show slide * A relational database system needs to maintain data about the relations * Called Data Dictionary SLIDE 10.52 - Relational Representation of System Metadata Show slide * This meta-data constitutes, in effect, a miniature database * May be stored in specialized data structures * Most likely stored in relations Leverage the power of the database engine * To find out where a relation resides to query it, engine goes to the relation_metadata tables * Where does the engine go to find out where the metadata is? Wherever the designers of the engine told it to look SLIDE 10.53 - Storage Access Show slide READ ALL SLIDE 10.54 - Buffer Manager Show slide READ ALL * This is just like the virtual memory manager of the OS Size of the DB might be larger than the addressable size of the hardware address space memory addresses are not sufficient to address all disk blocks * DB may use more sphisticated buffer management schemes * Buffer replacement strategies * Pinned Blocks restrict times when a block may be written back to disk to protect against data loss blocks are pinned until it is ok to write them to disk * Forced output of blocks * Flushing in case of crash SLIDE 10.55 - Buffer-Replacement Policies Show slide SLIDE 10.56 - Buffer-Replacement Policies Show slide