IS721 Semi-Structured Data Weekly Tasks

IS721 Semi-Structured Data Management Weekly Tasks

Intro

Each weekly section below gives you all the materials you need for each week. The set is divided into 3 sections: readings, homework projects (hw1-15 due by the Sunday noted, but hw1-2 are ungraded), and question sets (qs1-7 due every few weeks on Sunday as noted in the schedule). One should do the readings and references before classtime. Ask any questions on the blackboard forum.

Index: week1 / week2 / week3 / week4 / week5 / week6 / week7 / week8 / TBA

Week1 - Nosql Introduction

Readings

Slides
Nosql Distilled Chapters 1-3 (by Martin Fowler)
Martin Fowler book site - Video: Introduction to nosql

Non-graded hw1

This project sets up the required environment for the course on your own computer. This first project is not graded or turned in. Use these steps to setup your computer:

Since we all end up with a unix-type operating system (OS), everyone needs to know a few unix commands and relevant protocols. If you are unfamiliar with them, then review here. This page also shows you how to use your gl account, which is your student UMBC account on a linux machine which we will also use in this course. The instructions below are for your own computer.
You need to install virtualization software that will run linux for you on your computer. I will give instructions below for VirtualBox, but you are free to use any that you already have installed or are familiar with such as VMware Player. Go to https://www.virtualbox.org/ and download the appropriate installer for your platform and install it.
You now need to install Ubuntu linux on that virtualization software. We will use a pre-configured virtual machine image for Ubuntu. Go to http://virtualboxes.org/ and download the latest image for Ubuntu 12.04 (at the bottom of the list). Decompress that image directory and put it somewhere where you want it in your home directory. Note that the pre-configured username is "ubuntu" and the password is "reverse".
Follow the instructions from the site for how to bring up that image in VirtualBox. Name your VM with your umbc username as in: ubuntu12.04-username. (Note that alternatively, you could do an install from the .iso file from the ubuntu site and do a regular desktop install which is more flexible but a little more involved.)
Start your machine and login to test everything. Get rid of the 3 Office suite applications in the left dock by right-clicking them and choose "unlock" to create more room there. Open Dash (the top icon in the dock) and drag the terminal application to the dock. Then drag the Text Editor (gedit) also. You can re-click on Dash to close it. Click on Terminal in left toolbar to open a command terminal. Use the pwd command and verify you are in your home directory. For some reason this image has an Italian keyboard layout. Click on settings in the left dock and change the keyboard layout to English(US) or verify that it already is that. You can remove the Italian layout.
The .bashrc file in your home directory controls your command-line environment. It is beyond the sope of this class to look at it carefully, but as a simple example, we will change the prompt. Open .bashrc with the nano (or pico) editor and uncomment the line to force the prompt to be color. Then type: source .bashrc. You should see a color prompt. You can see how it works, if interested, here.
We will now install some software using apt which is a package installer for Ubuntu. This makes it very easy to install software. You have to use the sudo command for many installs as this temporarily makes you the root user. Because we used an image, there is already a sudo username=ubuntu with username user and password=reverse. So whenever a password is requested, use that one.
Install the guest additons that will allow you to copy and paste from your OS to Ubuntu. Follow the installation instructions from the site, but note the following changes. Skip steps 3-4 since they take so long and this is not a production server. You do not have to do step 8 since the new version pops up a dialog to do this. Restart your machine from the upper right hand shutdown menu item. Verify that you can copy and paste from your OS to the Ubuntu virtual machine terminal. You can also define shared folders between the OSs now.
At the command prompt type: sudo apt-get install gedit-plugins and type y for yes to install the software. It is an add-on plugin for the gedit text editor. Now type: wget http://grigio.org/files/darkmate.xml. Use the ls command and verify the xml file is there. Open gedit from left toolbar and go to edit->preferences->fonts&colors and add the darkmate.xml. Verify that the file browser panel is checked from edit->preferences->plugins. Then be sure to check View->Side Panel in the gedit menu so that the directory structure of the file system is displayed on the left side of gedit.
Install curl by typing: sudo apt-get install curl. We will need this later. You can now explore and experiment with your new OS using both the command-line and the GUI.

Question Set - None

None are due this week but you should start work on the ones due next week (see next week for the questions). All Questions sets should be answered in a plain textfile that is nicely formatted for reading. You submit the text file by attaching it to the relevant assignment on blackboard by the due date. The answer to each question should not exceed about a half page of text, but very precisely and fully answer the question. Points are deducted for vague, incomplete, or incorrect answers.

Week2 - MySQL

Readings

Slides
Nosql Distilled Chapters 4-6
Consistent hashing and Consistent hashing again

Non-graded hw2

This week, we will install the LAMP web-development environment. It includes the Apache2 web server, PHP5 scripting, and the MySQL relational database. This second project is not graded or turned in.

We can install it as a batch using the package manager. See the install instructions here. This installs all the servers and then changes the permissions for convenience (be sure to logout and login after changing). You will use your ubuntu user name. Note the important carrot at the end of the command. Be sure and also do the php5-curl and mod rewrite installs, but there is no need for the graphics one. Now go to http://localhost and verify that the "It Works!" page comes up.
We will also need the PECL extension module for PHP5 that makes it easy to add function to PHP. See the install instructions here. Accept the defaults for the install questions. After the install, you need to modify the php.ini file and restart Apache.
Do this MySQL/PHP tutorial to become familiar with the envrionment. Do it up to but not including the section 'Writing Images'. You do not have to do the first steps, since we have already installed MySQL above. Start at '$ mysql -u root -p' and use the password that you specified during the install. A list of MySQL commands is summarized here.

Question Set qs1 - Due this Sunday

All Questions sets should be answered in a single plain textfile that is nicely formatted for reading. Always put your answers in the context of a comparison with rdbms. Also, give specific examples with each answer.

What are the major issues related to ORM for nosql databases?
What are the major issues with schema for nosql?
What are the major issues with querying for nosql?
What are the major issues with transactions for nosql?
What are the major issues with consistency for nosql?
What are the major issues with concurrency for nosql?
What are the major issues with durability for nosql?
What are the major issues with sharding for nosql?
How does the business model determine approaches to trade-offs with nosql?
What are the major drivers for the decision to use nosql?

Week3 - Key/value - Memcached

Readings

References

Memcached main site
Turbocharge Your Website With Memcached (similar to reading#2)
Graphic stand-alone administration for memcached

Project hw3 - Due this Sunday

The best way to see how a key/value store is used with a relational database is to do the exercises below based on reading #1.

Install memcached with: sudo aptitude install memcached.
Install memcached command-line tools with: sudo aptitude install libmemcached-tools.
Do this exercise using the included instructions: hw3#3. Put the url for your text transcript in project3.txt.
Install memcached library for PHP with: sudo pecl install memcache. Accept default and add the line to php.ini as instructed. Then restart apache2: sudo /etc/init.d/apache2 restart.
Install the lynx command-line web-browser with: sudo apt-get install lynx-cur.
Set up everything with PHP and MySQL as in reading#2 using your root user. Do the simple timing at the end that shows the timing result for the first and (faster) second hits. Add the text of this result to a file project3.txt.
Do the same timing result for 5 additional queries using a library database. The .sql file to create the table and entries is here. A reminder of how to use the command-line for mysql is here. Add these results to the project3.txt file.
Create a TOC on gl that includes (1) project3.txt and (2) a screenshot of mc.php running in firefox showing one of queries from #6 above. Enter the URL for that TOC as your project3 assignment.

Question Set - None

None are due this week but you should start work on the ones due next week (see next week for the questions).

Week4 - Key/value - Redis

Readings

References

Project hw4 - Due this Sunday

We will install Redis and explore how to use it this week.

First, do the on-line interactive tutorial under the Try it link. This uses an on-line server.
The easiest way is to install is to use the package manager: sudo apt-get install redis-server. This installs a slightly older version which is fine for us, moves everything to the correct place, and starts the server. If you want the most recent version, you have to install from source as explained here, but you would have to move everything and start the server yourself. See 'Check if Redis is working' there in any case which we use below.

Test that the server is working. The built-in client is redis-cli. Here is how I checked:

ubuntu@ubuntu-VirtualBox:~$ redis-cli ping
PONG
ubuntu@ubuntu-VirtualBox:~$ redis-cli
redis 127.0.0.1:6379> set mykey kip
OK
redis 127.0.0.1:6379> get mykey
"kip"
redis 127.0.0.1:6379> exit
ubuntu@ubuntu-VirtualBox:~$

Do the 15 minutes introduction to Redis data types link. Create a text file transcript of your commands issued on localhost for submission using redis-cli.

Implement the following modified example for Redisent with PHP on localhost with your Redis server using your VM. You need to add 5 more commands with output. Then fix the program so that it does not duplicate every time you reload the page in 2 ways: using a conditional and using redis sets. Save one screenshot of it working and verify that the keys are there using redis-cli. You can look up the commands at the redis docs page. If you need to review anything about php see here.

require_once 'src/redisent/Redis.php';
$redis = new redisent\Redis('redis://localhost');

$redis->set('awesome2', 'absolutely');
echo "Is Redisent awesome? ", $redis->get('awesome2'), "\n";
$redis->rpush('particles2', 'proton2');
$redis->rpush('particles2', 'electron2');
$redis->rpush('particles2', 'neutron2');
$particles = $redis->lrange('particles2', 0, -1);
$particle_count = $redis->llen('particles2');
echo "<p>The {$particle_count} particles that make up atoms are:</p>";
echo "<ul>";
foreach ($particles as $particle) {
  echo "<li>{$particle}</li>";
}
echo "</ul>";

You can see that it works with:

redis 127.0.0.1:6379> keys *2
1) "awesome2"
2) "particles2"

Implement a Twitter clone using the code at Retwis-w-Predis. Submit a screenshot of the timeline.php url on localhost for your Twitter clone. You must use it to submit at least 8 entries to the redis server that show up. Verify the keys using redis-cli (as above) and save in a text file.
The submission be an HTML Table of Contents (TOC) with six links:
- to a text file with your transcript for #4
- to a screenshot of #5 running on localhost
- to the 2 source code files for #5
- to the screenshot of timeline.php on localhost for #6
- the text file for #6
Post the single url to gl for your TOC on blackboard for hw4.

Question Set 2 - Due this Sunday

All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.

What are the major differences in function and history between memcached and redis?
Why does facebook still use memcached and MySQL when more modern databases are available?
How and why is replication used for key/value dbs?
When and why is denormalization good and bad?
What if you want to query the same value by different keys in redis?
What does the PHP md5 command do for memcached?
What is a common partitioning technique for key/value dbs? How does this work?
Create your own example of an application that uses redis that would require a decision of how to do data modeling. Explain the 2 choices and why one is better. Do not use the same example from the lecture or any reading.
What the important differences between Voldemort and Redis?
How exactly does Retwis implement followers in the code (using what redis feature)?

Week5 - Document - Couchdb1

Readings

Nosql Distilled Chapter 9
CouchDB The Definitive Guide - chapters 1-6
Why CouchDB? - Good video overiew of Couchdb. One can watch half this week and half next week.
Nosql Database Technology: A Survey and Comparison of Systems - this is a very good video survey of nosql databases systems covered in this entire course, including couchdb and a later development of that called couchbase which combines a key-value cache front-end with a back-end couchdb document database.

References

Slides
Couchdb Site
CouchDB versus Couchbase: What are the differences, and what happened to Membase?
Iris Couch - Create an account there before class so we can do an exercise in class.
Futon: Temp Views / Perm Views / Perm View Query

Project hw5 - Due this Sunday

Install Couchdb. You can either use the package manager or install from source - both instructions are in the link. Note that one can also use a free account on a couchdb cloud provider (but you also need to install locally) - see Iris Couch.
Do the introductory tutorial at the link that shows you how to use both the REST-based command-line and the Futon web-based tool for couchdb. Save a screenshot of the result of each of the steps 4-9 on a single web page. Note that in steps 1-3 (not turned in): 1-install couchdb which you have already done, 2-you create the database mycouchshop in Futon, and 3-you create an admin user.
Now do this more complicated introductory exercise using Futon and curl (Text-only version). Save the text transcript of the curl session and the requested 5 screenshots on a single web page.
Submit the url for a TOC on gl with the 2 links above for hw5.

Question Set - None

None are due this week but you should start work on the ones due next week (see next week for the questions).

Week6 - Document - Couchdb2

Readings

Nosql Distilled Chapter 7
MapReduce: Simplified Data Processing on Large Clusters
CouchDB The Definitive Guide - chapters 15-19

References

Slides
PHP-on-Couch - Data Access Library to access a CouchDB server with PHP.

Project hw6 - Due this Sunday

Now do this replication exercise using Futon and curl. Save the text transcript of the curl session and the requested 1 screenshot on a single web page (TOC#1).
Design and create your own simple database on couchdb for student information that includes name, address, multiple phone numbers, course transcript, and several others that you come up with. Add at least 10 documents and 5 views. Make sure you model the document so that the views can drill down to specific information such as all students from a particular state or who received a certain grade. Save a command-line transcript on localhost of requesting all records and then for each view (TOC#2).
Setup your local couchdb for this student information database project to replicate once (not continuously) to the couchdb on kip.umbc.edu. Note that the username/password is student/is721. You must namespace your databasename on this remote server as: username-dbname. For example, I might use canfield-studinfo. Also note that since kip.umbc.edu is behind the firewall, you must first start the vpn from vpn.umbc.edu. Note the url for your replicated database (TOC#3). Also show a screen shot of your couchdb screen on localhost that does the replication (TOC#4).
We will do a brief exercise to see how to use couchdb with a programming language library - in this case PHP. Use PHP-on-Couch - Data Access Library to access a CouchDB server with PHP. Implement the 2 examples from the 'examples' directory. Place the text source of the result for each file on gl. Then create a new php program that accesses your books database an has this resulting source. Note that one must view source to see the output of print_r formatted nicely. Place the source code and text source result in a file and put it on gl. So you have 4 files for the TOC links (TOC#5-8).
Submit the url for a TOC on gl with the 8 links above for hw6.

Question Set 3 - Due this Sunday

All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.

Why are dynamic user populations a driver for nosql databases today?
Why is scaling of 3-tier architectures important for nosql databases today?
How does sharding work for RDBMS? What are the issues with it?
How did the architecture of Farmville on-line game change over time? Why the change?
How is Membase different than Memcached?
How is Couchdb different than Couchbase?
How does eventual consistency work for web apps? Be sure and give specific examples.
What are the advantages and disadvantages of the document-style nosql databases?
How does a cookie-based add-fill application work using nosql technology?
How are workers used in mapreduce? How does cloud deployment relate to this?
What is MVCC? How does it work in couchdb?

Week7 - Document - Mongodb1

Readings

References

Note on mongo client

=> ./bin/mongo staff.mongohq.com:10089/unicorn -u canfield -pxxx
MongoDB shell version: 2.4.5
connecting to: staff.mongohq.com:10089/unicorn
> db.nettuts.find()
{ "_id" : ObjectId("5066104df85ba10002000102"), "dob" : "01/01/30", "first" : "john", "gender" : "m", "hair_colour" : "bald", "last" : "wayne", "nationality" : "american", "occupation" : [  "actor" ] }
{ "_id" : ObjectId("50659d62ec5855af36c8440e"), "dob" : "22/11/1958", "first" : "Jamie Lee", "gender" : "f", "hair_colour" : "brown", "last" : "curtis", "nationality" : "american", "occupation" : [  "actor" ] }
{ "_id" : ObjectId("50659d61ec5855af36c8440c"), "dob" : "03/06/1925", "first" : "arnold", "gender" : "m", "hair_colour" : "brown", "last" : "schwarzenegger", "nationality" : "american", "occupation" : [  "actor",  "politician" ] }
{ "_id" : ObjectId("5065c64e803b1a0002000001"), "first" : "kip", "gender" : "m", "last" : "canfield", "occupation" : [  "professor" ] }
> exit
bye
=>

Project hw7 - Due this Sunday

Do the introductory tutorial: Getting Started with MongoDB - Part 1 on localhost. Save a transcript.
Do the tasks in chapters 1-4 of the mongodb book on localhost. Save a transcript.
Create a php file on localhost using PHP and HTML based on the MongoDB and PHP Tutorial. The php file should give this output. Submit the screenshot with different values and the source code. One will output html rather than use print_r. For example: echo "<p>" . $document["title"] . "</p>";
Sign up for a MongoHQ free account. Create your nettuts database/collection on MongoHQ. Create a read-only user for me. Note the full client url to this: mongo xyz.mongohq.com:12345/tutorial -u user -ppassword. Where tutorial is the database name and the user is the one you created for me.
The Guradian uses MongoDB to store the data available in its REST API. See Tier1 access that does not require an API key. Save 5 queries and results in 5 text files. 3 should be json and 2 should be xml. Use all the parameters from endpoints 1-2 somewhere in the 5 queries. Do not copy queries exactly from the documentation. Create a TOC that has 5 links.
Submit the url for a top-level TOC on gl with the 6 links above for hw7.

Question Set - None

None are due this week but you should start work on the ones due next week (see next week for the questions).

Week8 - Document - Mongodb2

Readings

References

Choosing a Good Shard Key
Server-side vs. Client-side control - read just the Course Introduction.
v2.4 adds hashed shard keys - note that this has changed since last year!

Project hw8 - Due this Sunday

Do the part2 tutorial: Getting Started with MongoDB - Part 2 on localhost. Save a transcript.
Redo the Student Information project from the CouchDB section for MongoDB on MongoHQ. Then create an CRUD application on gl for the remote database that uses the MongoHQ REST API. Create a TOC Give the url to the index.html for the project on gl and links for me to peruse all source code (save a duplicate of all .php files as .txt for me). See the sample. Generally, you will need about 10 pages, depending on how you design it:
- An entry index.html that calls some .php pages to query (retrieve) the database (returns json) and display for either students or courses for the following (5 pages):
  - all student records
  - all course records (remember that this is a separate collection)
  - students by gender
  - students taking a specific course with an associated link to see course information
  - students by year (freshman etc.)
  - students by state and/or city
- A form to create a new record for a course only. (2 pages)
- A form to update the course records only. (1 page)
- A form to delete a record for a course or a student. (2 pages)
Submit the url for a top-level TOC on gl with the 2 links above for hw8.

Question Set 4 - Due this Sunday

All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.

How is a collection different than a relational table?
How are embedded documents similar to joins? How different?
Why is Mongodb good for a logging use case?
What are compound keys? When specifically would you use one for Mongodb? Give a real-world example using PHP.
What issues related to adaptabilty were important to the Guardian decision?
Describe the evolution of the Guradian architecture in stages with special attention to the model at each stage.
How is database query different for Mongodb vs. Couchdb vs. relational? Include discussion of update queries.
What are important schema upgrade/change issues?
What are important replication issues for Mongodb? Include sharding, consistency, flexibility, others, etc.
Describe the new Guardian identity model with Mongodb. How does it fit into the current complete architecture and API? What is the future plan for the architecture?
Summarize and discuss problems and controversies for Mongodb.

Week9 - XML - Existdb1

Readings

References

See: XQuery/TEI Document Timeline (class demo)
eXist-db Open Source Native XML Database (note especially the documentation here, including the screencasts)
www.28msec.com (XQuery on Mongodb)
SAXON The XSLT and XQuery Processor (we use saxon9he.jar for hw)
Additional XQuery tutorials/references
- XQuery.co.uk
- XQuery.me
XQuery Examples (used in hw)
Choosing eXist-db as the XML database for the reBiND project

Notes

XSLT implementations are generally optimized for transforming entire documents; they load the entire input document into memory. XQuery is optimized for selecting fragments of data, for example, from a database. It is designed to be scalable and to take advantage of database features such as indexes for optimization.
XQuery has a more compact, non-XMLsyntax, which is sometimes easier to read and write (and embed in program code) than the XML syntax of XSLT.
XQuery is designed to select from a collection of documents as opposed to a single document. FLWORs make it easy to join information across (and within) documents. Also, XSLT 2.0 stylesheets can operate on multiple documents, but XSLT processors are not particularly optimized for this less common use case.

Project hw9 - Due this Sunday

Install Exist-db on localhost using the jar file and read the Main Documentation (note). Use version 1.4.3 (not the current 2.x). Also download and install the Saxon processor. You need to download the zip file for Java for Saxon-HE 9.4. Also read the XQuery documentation under Using XQuery - Running XQuery from the Command Line.
Read and study the XQuery Basic and Advanced sections.
Implement all the XQuery examples for the W3Schools bookstore example on localhost. First modify the given books.xml file to have 5 addtional books for a total of 9 books. Create a transcript of all these examples working and put them in a text file. Use the url to this text file as the first item in your TOC on gl for Existdb1. This is how your transcript should look. See this note on the data function. Here are the 8 examples repeated from the W3Schools reading:
1. doc("books.xml")/bookstore/book/title
2. doc("books.xml")/bookstore/book[price<30]
3. doc("books.xml")/bookstore/book[price>30]/title
4. for $x in doc("books.xml")/bookstore/book where $x/price>30 return $x/title
5. for $x in doc("books.xml")/bookstore/book where $x/price>30 order by $x/title return $x/title
6. <ul> { for $x in doc("books.xml")/bookstore/book/title order by $x return <li>{$x}</li> } </ul>
7. <ul> { for $x in doc("books.xml")/bookstore/book/title order by $x return <li>{data($x)}</li> } </ul>
8. for $x in doc("books.xml")/bookstore/book return if ($x/@category="CHILDREN")
  then <child>{data($x/title)}</child> else <adult>{data($x/title)}</adult>
Write an xquery that transforms the plant_catalog.xml to look like this. Pay special attention to the w3schools advanced section for guidance. Then redo examples e-h above using that transformed plant catalog with appropriate changes to the XQuery. Save a transcript of the xquery transform and the 4 examples working on the command-line. Be sure and cat each file you are calling before you call it so I can see the source code - there is an example in the sample transcript.
Do the examples in reference#6 chapter 1 examples 5-12. Save a transcript from the command-line.
Use this remote xml file: http://www.dbis.informatik.uni-goettingen.de/Mondial/mondial.xml. It is a very large file. Mondial is a database with global demographic and geographic information in xml format. Get familiar with the file and its tags.

Study this XQuery, figure out how it works, and get it working in your gl account. Create a link to the output xml file as the third TOC item for Existdb1. Note that text() is an XPath function that returns just the text content part of the selected node and the '@' sign selects attributes - see the XPath reference at W3Schools.

      <result>
          { for $country in doc ("http://www.dbis.informatik.uni-goettingen.de/Mondial/mondial.xml")/mondial/country
            return
             <item>
               <country> {$country/name/text()} </country>
              { for $ethnicgroup in $country/ethnicgroups
                return
                  <ethnicgroup>  {$ethnicgroup/@percentage}  {$ethnicgroup/text()}</ethnicgroup>
               } 
              </item>
          }
     </result>

It will return a large number of records like:

  <item>
    <country>Swaziland</country>
    <ethnicgroup percentage="3">European</ethnicgroup>
    <ethnicgroup percentage="97">African</ethnicgroup>
  </item>

Modify the above XQuery to produce a large number of records in the following format. Create a link to the result file and to the source XQuery for the TOC.
```
  <item>
    <country car_code="SD">Swaziland</country>
    <borders country="RSA"/>
    <borders country="MOC"/>
  </item>
  
```
Now put the XML files and XQuery example 11 from step 5 into your Exist database on localhost. Save a screen shot of it running in the browser from Exist.
Submit the url for your TOC with 7 items for Existdb1 on blackboard.

Question Set - None

None are due this week but you should start work on the ones due next week (see next week for the questions).

Week10 - XML - Existdb2

Readings

References

XSLTForms (allows browsers with no native XForms support to use them)
XForms Tutorial
XRX Tutorial (this has the Dictionary Editor application)
Balisage: The Markup Conference
Clinical Document Architecture
The java client for exist
The install for Ubuntu

Project hw10 - Due this Sunday

Make sure you have version 1.4.3 installed for Exist-db.
Create an XRX application for very simple medical records in Exist-db on localhost. The short medical records should be in soap note format. You will create an xml format that follows this organization. Each patient will be in a separate document.
Base your application on the Dictionary Editor Sample Application in the XRX Tutorial. You should get that working first. It is actually based on the 'item-manager' application from the subversion repository. We will demo an example project in class.
Your application will include 15 soap note record documents, a CRUD component, and 5 xquery reports that summarize across documents.
Create a TOC-style entry for your application on localhost that lists patients and allows browsing them and the reports. Submit a single url for a separate TOC on gl for Existdb2 that allows me to browse screenshots for the the entry page, 3 of the patients, and all 5 reports. Also, allow me to browse all the XQuery source code for the reports as text from that TOC.

Question Set 5 - Due this Sunday

All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.

What are the issues with data vs. document style xml documents for persistence in a database? How does middleware fit in?
How is locking variously implemented in xml native databases? What are the issues for performance?
How are the nosql and xml database toolchains at odds? How compatible?
Why is XRX claimed to have simpler application development? Give a particular example not from the reading.
Describe the IHE from the reading. Give special attention to XProc and transaction issues.
How can json be used with XQuery? Why is this a good idea for json? Is it applicable to xml?
Why is xml technology good for specfic application areas but falling into less use in the general web application area?
Does Saxon use a DOM or a SAX parser? Which would be best for processing a long doc like in Existdb1ex6? Why? What about a transform of Existdb1ex7 that would allow a transform to html that inserted links to resolve the actual border country names (not just codes)?
Desribe the typical modern xml support in the database postgresql? When is it good to use these extensions and when is a BLOB just as good (give examples)?
What is the CCD health document standard? Give a brief example of the xml for 2 diagnostic lab results.

Week11 - Graphdb

Readings

References

Project hw11 - Due this Sunday

You need to do this project on localhost.
Install a servlet container - Apache Tomcat6 or 7, docs, and admin using apt-get.
Deploy the Exist-db XML database (use version 1.4.3) as a WAR file using the Tomcat manager.
Deploy the Triple Store - Sesame as a WAR file using the Tomcat manager.
Create a repository called 'your_username-mondial' on Sesame. Load the Mondial Europe database and meta-data into that repository.
Do a the sample SPARQL query from Mondial in Sesame to verify (screenshot required - TOC1).
Browse the Mondial data for countries to see the fields that are there. Then browse Dbpedia to find 5 different fields appropriate to countries that are not in Mondial.
Use CONSTRUCT queries to get that data out of Dbpedia and load the triples into your Mondial database on Sesame. Create a single text file that contains all 5 queries - TOC2.
Do a SPARQL SELECT query in Sesame to verify with at least 2 of the new fields and 2 of the old ones (screenshot required - TOC3).
Create 5 XQuery files (.xq) that issue queries to your Sesame database and return HTML pages. Create a single text file that contains all 5 queries - TOC4. Save the HTML return from each query in an .html file - TOC5-9.
Create a screenshot of your .xq files residing in the Exist-db administrator web interface - TOC10.
Create a TOC on gl with 10 links to all these files and submit that single url as your hw11.

Question Set - None

None are due this week.

Week12 - Cassandra1

Readings

Bigtable: A Distributed Storage System for Structured Data (Google - read Sections 1-4 and 11 - cached pdf)
Amazon's Dynamo - (read Sections 1-5 - cached pdf)
Cassandra - A Decentralized Structured Storage System (cached pdf)
Cassandra Essentials Series (about 6min each video)
- Overview of Apache Cassandra
- Managing Objects and Data in Cassandra
C* Summit 2013: Jonathan Ellis Keynote
Cassandra 2.0 Better, Faster, Stronger

References

Cassandra1 - Slides - Cassandra1 - Video Lecture (has old, deprecated parts from earlier versions - we will show the differences in class)
Apache Cassandra 2.0 Documentation
CQL3 for Cassandra experts - shows evolution of cql

Project hw12 - Due this Sunday

Install cassandra from a binary tarball, not the apt-get package. The dsc.tar.gz archive has the current version of 2.0.2. You do not need to install the opscenter. Start cassandra with sudo bin/cassandra -f to put it in the foreground and open another terminal window to continue work.
Continue on to the Two minute overview and do the exercise in the Querying Cassandra section. Save the command-line transcript as TOC1.
Continue on to the cassandra 2.0 documentation. Read the all the brief sections.
Now go to the CQL for Cassandra 2.0 and do the tasks in CQL data modeling/Data modeling example/Example of a music service. Save the command-line transcript as TOC2. Add 10 more songs with 3 tags each and 4 playlists. Do 8 queries of your database showing all important aspects of the database.
Submit your TOC with 2 links as project hw12.

Question Set - None

None are due this week but you should start work on the ones due next week (see next week for the questions).

Week13 - Cassandra2

Readings

Cassandra Essentials Series (about 6min each video)
- Understanding Partitioning and Replication in Cassandra
- Understanding Data Consistency in Cassandra
Keeping Movies Running Amid Thunderstorms!

References

Cassandra2 - Slides - Cassandra2 - Video Lecture (has old, deprecated parts from earlier versions - we will show the differences in class)
The evolution of schema in Cassandra
Cassandra By Example: Data Modelling with CQL3 - slides - github

Project hw13 - see syllabus for due date

Design and create a time series database (note questions at bottom) using the cql3 command-line. Use a python program to generate 100 values for it. You need to use different names than in the sample for everything in the database that are more descriptive of your own database. Do 5 different queries with cql3 to show time slices. Your complete command-line cql3 transcript of this task is the TOC (you can truncate the 100 row output). At the top of the TOC, write a brief paragraph describing your database.
Submit your TOC as project hw13.

Question Set F - see syllabus for due date

All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.

How does dynamo handle consistency and availability? What are the issues that must be resolved?
Why and how does dynamo modify regular consistent hashing? How does this affect replication?
How does dynamo handle shoping cart inconsistency?
Describe the BigTable data model. How does it handle persistence and availability?
Why did Facebook develop cassandra? What features of older systems did it combine?
What is a good approach to data modeling for cassandra? Why is this different from relational methods?
Discuss the evolution of query slices in cassandra. How exactly do they relate to the underlying database and the query language? Include discussion of 'with compact storage.'
Why did Netfix change its architecture? Describe each the change rationale.
How does Netfix handle rapidly changing demand for services? Give a scenario.
What is the 'thundering herd'? How is it handled?
Describe the strategy, operation, and issues with replication and partitioning in cassandra.

Week14 - Hadoop1

Readings

The following videos are from BigData University which is an IBM project. You need to watch and take notes on each video in the 4 lessons. It looks like a lot of videos but each is just a few minutes long. So the whole thing takes less than an hour. Each lesson link shows you a directory browse of the individual videos in the lesson with an associated text transcript in PDF format.

References

Hadoop1 - Slides
Hadoop1 - Video Lecture
BigData University
Understanding Big DataThe BigData University textbook in PDF format. Chapter4 basically covers the same information as our 4 lessons above.
Introduction to Amazon Web Services (video 13 min). This is good video to show you more concrete information on how a Cloud provider is setup and charges for service. Hadoop processing is typically done by Cloud providers.

Project hw14 - Due this Sunday

Install the job from: Running Hadoop On Ubuntu Linux (Single-Node Cluster) on localhost. The lecture slides give notes and the 7 TOC deliverables on this project. Include your actual umbc username in all directories and filenames you create - e.g. canfield-gutenberg and canfield-pg20417.txt.
Submit your TOC as project hw14.

Question Set - None

None are due this week but you should start work on the ones due next week (see next week for the questions).

Week15 - Hadoop2

Readings

The following videos are from BigData University which discuss Text Analytics.
- Applications for Text Analytics
HBase - excerpt from: Hadoop: The Definitive Guide, Third Edition by Tom White, O’Reilly, 2012.
HBase vs Cassandra: why we moved

References

Project hw15 - Due this Sunday

Run the job from: Writing An Hadoop MapReduce Program In Python on your hadoop cluster. The lecture slides give notes and 4 specific TOC deliverables on this project.
Run the job from: Hadoop Streaming Made Simple using Joins and Keys with Python on your hadoop cluster. The lecture slides give notes and 2 specific TOC deliverables on this project.
Use the python code from #2 to process the data at State & County QuickFacts on your hadoop cluster. One must combine two data sets to produce 2 reports - one to report the count all the counties in a state and one to report the population for a state by summing county populations. Each report from the reducer will have 2-columns: state code|sum. The lecture slides give notes and 7 specific TOC deliverables on this project.
Submit your overall TOC as project hw15. Include your actual umbc username in all directories and filenames you create for all exercises - e.g. canfield-fad-output and canfield-fad.csv. The file/directory name part can be whatever you want.

Question Set G - Due this Sunday

All Questions sets should be answered in a single plain textfile that is nicely formatted for reading. Note that the slides have more information on the query tasks.

How does hadoop rack awareness affect streaming vs. random access?
What is heartbeat and how and why is it used in a hadoop cluster?
Compare using pig, hive, jaql, java, and python as query languages for hadoop. What are the differences, advantages for each?
Do the pig command-line task in the part2 video. Submit the url to the transcript as the answer for this question.
Do the hive command-line task in the part3 video. Submit the url to the transcript as the answer for this question.
Do the jaql command-line task in the part4 video. Submit the url to the transcript as the answer for this question.
Describe the steps for a scenario where hadoop is used for text analytics like log analysis. Be specific about the steps.
What is hadoop streaming? How is it relevant to our python examples?
What is Hbase and what is the relationship to hdfs? Describe when it would be used.
Compare cassandra and Hbase. What are the differences, similarities, and what is the best use of each?
What is an embedded table? What is it used for in hadoop-related technology?