IS721 Semi-Structured Data Management Weekly Tasks
Intro
Each weekly section below gives you all the materials you need
for each week. The set is divided into 3 sections: readings, homework
projects (hw1-15 due by the Sunday noted, but hw1-2 are ungraded), and
question sets (qs1-7 due every few weeks on Sunday as noted in the
schedule). One should do the readings and references before classtime.
Ask any questions on the blackboard forum.
Index: week1 / week2 / week3 / week4 / week5 / week6 / week7 / week8 / TBA
Readings
Non-graded hw1
This project sets up the required environment for the course on your own computer. This first project is not graded or turned in. Use these steps to setup your computer:
-
Since we all end up with a unix-type operating system (OS),
everyone needs to know a few unix commands and relevant protocols. If
you are unfamiliar with them, then review here. This page also shows you how to use your gl
account, which is your student UMBC account on a linux machine which we
will also use in this course. The instructions below are for your own
computer.
-
You need to install virtualization software that will run linux
for you on your computer. I will give instructions below for VirtualBox,
but you are free to use any that you already have installed or are
familiar with such as VMware Player. Go to https://www.virtualbox.org/ and download the appropriate installer for your platform and install it.
-
You now need to install Ubuntu linux on that virtualization
software. We will use a pre-configured virtual machine image for
Ubuntu. Go to http://virtualboxes.org/
and download the latest image for Ubuntu 12.04 (at the bottom of the
list). Decompress that image directory and put it somewhere where you
want it in your home directory. Note that the pre-configured username
is "ubuntu" and the password is "reverse".
-
Follow the instructions from the site for how to bring up that image in VirtualBox. Name your VM with your umbc username as in: ubuntu12.04-username. (Note that alternatively, you could do an install from the .iso file from the ubuntu site and do a regular desktop install which is more flexible but a little more involved.)
-
Start your machine and login to test everything. Get rid of the
3 Office suite applications in the left dock by right-clicking them and
choose "unlock" to create more room there. Open Dash (the top icon in
the dock) and drag the terminal application to the dock. Then drag the
Text Editor (gedit) also. You can re-click on Dash to close it. Click
on Terminal in left toolbar to open a command terminal. Use the pwd
command and verify you are in your home directory. For some reason this
image has an Italian keyboard layout. Click on settings in the left
dock and change the keyboard layout to English(US) or verify that it
already is that. You can remove the Italian layout.
-
The .bashrc file in your home directory controls your
command-line environment. It is beyond the sope of this class to look at
it carefully, but as a simple example, we will change the prompt. Open
.bashrc with the nano (or pico) editor and uncomment the line to force
the prompt to be color. Then type: source .bashrc. You should see a color prompt. You can see how it works, if interested, here.
-
We will now install some software using apt which is a package installer for Ubuntu. This makes it very easy to install software. You have to use the sudo
command for many installs as this temporarily makes you the root user.
Because we used an image, there is already a sudo username=ubuntu with username user and password=reverse. So whenever a password is requested, use that one.
-
Install the guest additons that will allow you to copy and paste from your OS to Ubuntu. Follow the installation instructions from the site,
but note the following changes. Skip steps 3-4 since they take so long
and this is not a production server. You do not have to do step 8
since the new version pops up a dialog to do this. Restart your
machine from the upper right hand shutdown menu item. Verify that you
can copy and paste from your OS to the Ubuntu virtual machine terminal.
You can also define shared folders between the OSs now.
-
At the command prompt type: sudo apt-get install gedit-plugins and type y for yes to install the software. It is an add-on plugin for the gedit text editor. Now type: wget http://grigio.org/files/darkmate.xml. Use the ls
command and verify the xml file is there. Open gedit from left toolbar
and go to edit->preferences->fonts&colors and add the
darkmate.xml. Verify that the file browser panel is checked from
edit->preferences->plugins. Then be sure to check View->Side
Panel in the gedit menu so that the directory structure of the file
system is displayed on the left side of gedit.
- Install curl by typing: sudo apt-get install curl. We will need this later.
You can now explore and experiment with your new OS using both the command-line and the GUI.
Question Set - None
None are due this week but you should start work on the ones due next week (see next week for the questions). All Questions sets should be answered in a plain textfile
that is nicely formatted for reading. You submit the text file by
attaching it to the relevant assignment on blackboard by the due date.
The answer to each question should not exceed about a half page of text,
but very precisely and fully answer the question. Points are deducted
for vague, incomplete, or incorrect answers.
Week2 - MySQL
Readings
- Slides
- Nosql Distilled Chapters 4-6
- Consistent hashing and
Consistent hashing again
Non-graded hw2
This week, we will install the LAMP web-development environment.
It includes the Apache2 web server, PHP5 scripting, and the MySQL
relational database. This second project is not graded or turned in.
-
We can install it as a batch using the package manager. See the install instructions here.
This installs all the servers and then changes the permissions for
convenience (be sure to logout and login after changing). You will use
your ubuntu user name. Note the important carrot at the end of
the command. Be sure and also do the php5-curl and mod rewrite
installs, but there is no need for the graphics one. Now go to
http://localhost and verify that the "It Works!" page comes up.
-
We will also need the PECL extension module for PHP5 that makes it easy to add function to PHP. See the install instructions here. Accept the defaults for the install questions. After the install, you need to modify the php.ini file and restart Apache.
-
Do this MySQL/PHP tutorial
to become familiar with the envrionment. Do it up to but not including
the section 'Writing Images'. You do not have to do the first steps,
since we have already installed MySQL above. Start at '$ mysql -u root
-p' and use the password that you specified during the install. A list
of MySQL commands is summarized here.
Question Set qs1 - Due this Sunday
All Questions sets should be answered in a single plain textfile
that is nicely formatted for reading. Always put your answers in the
context of a comparison with rdbms. Also, give specific examples with
each answer.
-
What are the major issues related to ORM for nosql databases?
-
What are the major issues with schema for nosql?
-
What are the major issues with querying for nosql?
-
What are the major issues with transactions for nosql?
-
What are the major issues with consistency for nosql?
-
What are the major issues with concurrency for nosql?
-
What are the major issues with durability for nosql?
-
What are the major issues with sharding for nosql?
-
How does the business model determine approaches to trade-offs with nosql?
-
What are the major drivers for the decision to use nosql?
Week3 - Key/value - Memcached
Readings
- Slides
- Memcached And PHP, (cached)
-
Normalization and Denormalization (pdf)
-
Distributed Caching with Memcached
- Scale at Facebook (video)
References
- Memcached main site
- Turbocharge Your Website With Memcached (similar to reading#2)
- Graphic stand-alone administration for memcached
Project hw3 - Due this Sunday
The best way to see how a key/value store is used with a
relational database is to do the exercises below based on reading #1.
-
Install memcached with: sudo aptitude install memcached.
-
Install memcached command-line tools with: sudo aptitude install libmemcached-tools.
-
Do this exercise using the included instructions: hw3#3. Put the url for your text transcript in project3.txt.
-
Install memcached library for PHP with: sudo pecl install memcache. Accept default and add the line to php.ini as instructed. Then restart apache2: sudo /etc/init.d/apache2 restart.
-
Install the lynx command-line web-browser with: sudo apt-get install lynx-cur.
-
Set up everything with PHP and MySQL as in reading#2 using your
root user. Do the simple timing at the end that shows the timing result
for the first and (faster) second hits. Add the text of this result to a
file project3.txt.
-
Do the same timing result for 5 additional queries using a library database. The .sql file to create the table and entries is here. A reminder of how to use the command-line for mysql is here. Add these results to the project3.txt file.
-
Create a TOC on gl that includes (1) project3.txt and (2) a
screenshot of mc.php running in firefox showing one of queries from #6
above. Enter the URL for that TOC as your project3 assignment.
Question Set - None
None are due this week but you should start work on the ones due next week (see next week for the questions).
Week4 - Key/value - Redis
Readings
- Slides
- Nosql Distilled Chapter 8
- The Little Redis Book
-
Normalization and Denormalization Part2
References
- Redis Site
- 11 Common Web Use Cases Solved In Redis
- Data Modeling In Redis
Project hw4 - Due this Sunday
We will install Redis and explore how to use it this week.
- First, do the on-line interactive tutorial under the Try it link. This uses an on-line server.
-
The easiest way is to install is to use the package manager: sudo apt-get install redis-server.
This installs a slightly older version which is fine for us, moves
everything to the correct place, and starts the server. If you want the
most recent version, you have to install from source as explained here,
but you would have to move everything and start the server yourself.
See 'Check if Redis is working' there in any case which we use below.
-
Test that the server is working. The built-in client is redis-cli. Here is how I checked:
ubuntu@ubuntu-VirtualBox:~$ redis-cli ping
PONG
ubuntu@ubuntu-VirtualBox:~$ redis-cli
redis 127.0.0.1:6379> set mykey kip
OK
redis 127.0.0.1:6379> get mykey
"kip"
redis 127.0.0.1:6379> exit
ubuntu@ubuntu-VirtualBox:~$
- Do the 15 minutes introduction to Redis data types link. Create a text file transcript of your commands issued on localhost for submission using redis-cli.
- Implement the following modified example for Redisent with PHP on localhost with your Redis server using your VM. You need to add 5 more commands with output. Then fix the program so that it does not duplicate
every time you reload the page in 2 ways: using a conditional and using
redis sets. Save one screenshot of it working and verify that the keys
are there using redis-cli. You can look up the commands at the redis
docs page. If you need to review anything about php see here.
require_once 'src/redisent/Redis.php';
$redis = new redisent\Redis('redis://localhost');
$redis->set('awesome2', 'absolutely');
echo "Is Redisent awesome? ", $redis->get('awesome2'), "\n";
$redis->rpush('particles2', 'proton2');
$redis->rpush('particles2', 'electron2');
$redis->rpush('particles2', 'neutron2');
$particles = $redis->lrange('particles2', 0, -1);
$particle_count = $redis->llen('particles2');
echo "<p>The {$particle_count} particles that make up atoms are:</p>";
echo "<ul>";
foreach ($particles as $particle) {
echo "<li>{$particle}</li>";
}
echo "</ul>";
You can see that it works with:
redis 127.0.0.1:6379> keys *2
1) "awesome2"
2) "particles2"
- Implement a Twitter clone using the code at Retwis-w-Predis. Submit a screenshot of the timeline.php url on localhost
for your Twitter clone. You must use it to submit at least 8 entries to
the redis server that show up. Verify the keys using redis-cli (as
above) and save in a text file.
- The submission be an HTML Table of Contents (TOC) with six links:
- to a text file with your transcript for #4
- to a screenshot of #5 running on localhost
- to the 2 source code files for #5
- to the screenshot of timeline.php on localhost for #6
- the text file for #6
Post the single url to gl for your TOC on blackboard for hw4.
Question Set 2 - Due this Sunday
All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.
- What are the major differences in function and history between memcached and redis?
- Why does facebook still use memcached and MySQL when more modern databases are available?
- How and why is replication used for key/value dbs?
- When and why is denormalization good and bad?
- What if you want to query the same value by different keys in redis?
- What does the PHP md5 command do for memcached?
- What is a common partitioning technique for key/value dbs? How does this work?
- Create your own example of an application that uses
redis that would require a decision of how to do data modeling. Explain
the 2 choices and why one is better. Do not use the same example from
the lecture or any reading.
- What the important differences between Voldemort and Redis?
- How exactly does Retwis implement followers in the code (using what redis feature)?
Week5 - Document - Couchdb1
Readings
- Nosql Distilled Chapter 9
- CouchDB The Definitive Guide - chapters 1-6
- Why CouchDB? - Good video overiew of Couchdb. One can watch half this week and half next week.
-
Nosql Database Technology: A Survey and Comparison of Systems
- this is a very good video survey of nosql databases systems covered
in this entire course, including couchdb and a later development of that
called couchbase which combines a key-value cache front-end with a
back-end couchdb document database.
References
- Slides
- Couchdb Site
- CouchDB versus Couchbase: What are the differences, and what happened to Membase?
- Iris Couch - Create an account there before class so we can do an exercise in class.
- Futon: Temp Views / Perm Views / Perm View Query
Project hw5 - Due this Sunday
-
Install Couchdb.
You can either use the package manager or install from source - both
instructions are in the link. Note that one can also use a free account
on a couchdb cloud provider (but you also need to install locally) -
see Iris Couch.
-
Do the introductory tutorial
at the link that shows you how to use both the REST-based command-line
and the Futon web-based tool for couchdb. Save a screenshot of the
result of each of the steps 4-9 on a single web page. Note that in
steps 1-3 (not turned in): 1-install couchdb which you have already
done, 2-you create the database mycouchshop in Futon, and 3-you create an admin user.
-
Now do this more complicated introductory exercise using Futon and curl (Text-only version). Save the text transcript of the curl session and the requested 5 screenshots on a single web page.
-
Submit the url for a TOC on gl with the 2 links above for hw5.
Question Set - None
None are due this week but you should start work on the ones due next week (see next week for the questions).
Week6 - Document - Couchdb2
Readings
- Nosql Distilled Chapter 7
- MapReduce: Simplified Data Processing on Large Clusters
-
CouchDB The Definitive Guide - chapters 15-19
References
- Slides
- PHP-on-Couch - Data Access Library to access a CouchDB server with PHP.
Project hw6 - Due this Sunday
-
Now do this replication exercise using Futon and curl. Save the text transcript of the curl session and the requested 1 screenshot on a single web page (TOC#1).
-
Design and create your own simple database on couchdb for
student information that includes name, address, multiple phone numbers,
course transcript, and several others that you come up with. Add at
least 10 documents and 5 views. Make sure you model the document so
that the views can drill down to specific information such as all
students from a particular state or who received a certain grade. Save a
command-line transcript on localhost of requesting all records and then
for each view (TOC#2).
-
Setup your local couchdb for this student information database
project to replicate once (not continuously) to the couchdb on kip.umbc.edu. Note that the username/password is student/is721. You must namespace your databasename on this remote server as: username-dbname. For example, I might use canfield-studinfo. Also note that since kip.umbc.edu
is behind the firewall, you must first start the vpn from vpn.umbc.edu.
Note the url for your replicated database (TOC#3). Also show a screen shot of your couchdb screen on localhost that does the replication (TOC#4).
-
We will do a brief exercise to see how to use couchdb with a programming language library - in this case PHP. Use PHP-on-Couch
- Data Access Library to access a CouchDB server with PHP. Implement
the 2 examples from the 'examples' directory. Place the text source of
the result for each file on gl. Then create a new php program that
accesses your books database an has this resulting source. Note that one must view source to see the output of print_r
formatted nicely. Place the source code and text source result in a
file and put it on gl. So you have 4 files for the TOC links (TOC#5-8).
-
Submit the url for a TOC on gl with the 8 links above for hw6.
Question Set 3 - Due this Sunday
All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.
- Why are dynamic user populations a driver for nosql databases today?
- Why is scaling of 3-tier architectures important for nosql databases today?
- How does sharding work for RDBMS? What are the issues with it?
- How did the architecture of Farmville on-line game change over time? Why the change?
- How is Membase different than Memcached?
- How is Couchdb different than Couchbase?
- How does eventual consistency work for web apps? Be sure and give specific examples.
- What are the advantages and disadvantages of the document-style nosql databases?
- How does a cookie-based add-fill application work using nosql technology?
- How are workers used in mapreduce? How does cloud deployment relate to this?
- What is MVCC? How does it work in couchdb?
Week7 - Document - Mongodb1
Readings
- Slides
-
The Little MongoDB Book - Chapters 1-4
-
Why I Chose MongoDB for guardian.co.uk
/ (Slides)
-
Use Cases
References
- Mongodb Site / 10gen Site
- Mongodb Intro Tutorial at the Mongo Site
-
MongoDB, CouchDB, MySQL Compare Grid
- MongoHQ Hosting
- MongoDB and PHP Tutorial
Note on mongo client
=> ./bin/mongo staff.mongohq.com:10089/unicorn -u canfield -pxxx
MongoDB shell version: 2.4.5
connecting to: staff.mongohq.com:10089/unicorn
> db.nettuts.find()
{ "_id" : ObjectId("5066104df85ba10002000102"), "dob" : "01/01/30", "first" : "john", "gender" : "m", "hair_colour" : "bald", "last" : "wayne", "nationality" : "american", "occupation" : [ "actor" ] }
{ "_id" : ObjectId("50659d62ec5855af36c8440e"), "dob" : "22/11/1958", "first" : "Jamie Lee", "gender" : "f", "hair_colour" : "brown", "last" : "curtis", "nationality" : "american", "occupation" : [ "actor" ] }
{ "_id" : ObjectId("50659d61ec5855af36c8440c"), "dob" : "03/06/1925", "first" : "arnold", "gender" : "m", "hair_colour" : "brown", "last" : "schwarzenegger", "nationality" : "american", "occupation" : [ "actor", "politician" ] }
{ "_id" : ObjectId("5065c64e803b1a0002000001"), "first" : "kip", "gender" : "m", "last" : "canfield", "occupation" : [ "professor" ] }
> exit
bye
=>
Project hw7 - Due this Sunday
-
Do the introductory tutorial: Getting Started with MongoDB - Part 1 on localhost. Save a transcript.
- Do the tasks in chapters 1-4 of the mongodb book on localhost. Save a transcript.
- Create a php file on localhost using PHP and HTML based on the MongoDB and PHP Tutorial. The php file should give this output. Submit the screenshot with different values and the source code. One will output html rather than use print_r. For example: echo "<p>" . $document["title"] . "</p>";
- Sign up for a MongoHQ
free account. Create your nettuts database/collection on MongoHQ.
Create a read-only user for me. Note the full client url to this: mongo
xyz.mongohq.com:12345/tutorial -u user -ppassword. Where tutorial is the database name and the user is the one you created for me.
- The Guradian uses MongoDB to store the data available in its REST API. See Tier1 access that does not require an API key. Save 5 queries and results
in 5 text files. 3 should be json and 2 should be xml. Use all the
parameters from endpoints 1-2 somewhere in the 5 queries. Do not copy
queries exactly from the documentation. Create a TOC that has 5 links.
-
Submit the url for a top-level TOC on gl with the 6 links above for hw7.
Question Set - None
None are due this week but you should start work on the ones due next week (see next week for the questions).
Week8 - Document - Mongodb2
Readings
- Slides
-
The Little MongoDB Book - Chapters 5-7
-
O'Reilly Webcast: MongoDB and PHP /
(Slides)
- Sharding Introduction
- Controversy:
References
- Choosing a Good Shard Key
- Server-side vs. Client-side control - read just the Course Introduction.
- v2.4 adds hashed shard keys - note that this has changed since last year!
Project hw8 - Due this Sunday
-
Do the part2 tutorial: Getting Started with MongoDB - Part 2 on localhost. Save a transcript.
-
Redo the Student Information project from the CouchDB section
for MongoDB on MongoHQ. Then create an CRUD application on gl for the
remote database that uses the MongoHQ REST API. Create a TOC Give the
url to the index.html for the project on gl and links for me to peruse all source code (save a duplicate of all .php files as .txt for me). See the sample. Generally, you will need about 10 pages, depending on how you design it:
- An entry index.html that calls some .php pages to query (retrieve) the database (returns json) and display for either students or courses for the following (5 pages):
- all student records
- all course records (remember that this is a separate collection)
- students by gender
- students taking a specific course with an associated link to see course information
- students by year (freshman etc.)
- students by state and/or city
- A form to create a new record for a course only. (2 pages)
- A form to update the course records only. (1 page)
- A form to delete a record for a course or a student. (2 pages)
-
Submit the url for a top-level TOC on gl with the 2 links above for hw8.
Question Set 4 - Due this Sunday
All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.
- How is a collection different than a relational table?
- How are embedded documents similar to joins? How different?
- Why is Mongodb good for a logging use case?
- What are compound keys? When specifically would you use one for Mongodb? Give a real-world example using PHP.
- What issues related to adaptabilty were important to the Guardian decision?
- Describe the evolution of the Guradian architecture in stages with special attention to the model at each stage.
- How is database query different for Mongodb vs. Couchdb vs. relational? Include discussion of update queries.
- What are important schema upgrade/change issues?
- What are important replication issues for Mongodb? Include sharding, consistency, flexibility, others, etc.
- Describe the new Guardian identity model with Mongodb. How
does it fit into the current complete architecture and API? What is the
future plan for the architecture?
- Summarize and discuss problems and controversies for Mongodb.
Week9 - XML - Existdb1
Readings
- General Slides
- XQuery Slides
- XML and Databases
- nosql and the Future of XML
- XQuery Tutorial
References
- See: XQuery/TEI Document Timeline (class demo)
- eXist-db Open Source Native XML Database (note especially the documentation here, including the screencasts)
- www.28msec.com (XQuery on Mongodb)
- SAXON The XSLT and XQuery Processor (we use saxon9he.jar for hw)
- Additional XQuery tutorials/references
- XQuery Examples (used in hw)
- Choosing eXist-db as the XML database for the reBiND project
Notes
- XSLT implementations are generally optimized for transforming
entire documents; they load the entire input document into memory.
XQuery is optimized for selecting fragments of data, for example, from a
database. It is designed to be scalable and to take advantage of
database features such as indexes for optimization.
- XQuery has a more compact, non-XMLsyntax, which is sometimes
easier to read and write (and embed in program code) than the XML syntax
of XSLT.
- XQuery is designed to select from a collection of documents as
opposed to a single document. FLWORs make it easy to join information
across (and within) documents. Also, XSLT 2.0 stylesheets can operate on
multiple documents, but XSLT processors are not particularly optimized
for this less common use case.
Project hw9 - Due this Sunday
-
Install Exist-db on localhost using the jar file and read the Main Documentation (note). Use version 1.4.3 (not the current 2.x). Also download and install the Saxon processor. You need to download the zip file for Java for Saxon-HE 9.4. Also read the XQuery documentation under Using XQuery - Running XQuery from the Command Line.
-
Read and study the XQuery Basic and Advanced sections.
-
Implement all the XQuery examples for the W3Schools bookstore
example on localhost. First modify the given books.xml file to have 5 addtional books for a total of 9 books. Create a transcript of all these examples working and put them in a text file. Use the url to this text file as the first item in your TOC on gl for Existdb1. This is how your transcript should look. See this note on the data function. Here are the 8 examples repeated from the W3Schools reading:
- doc("books.xml")/bookstore/book/title
- doc("books.xml")/bookstore/book[price<30]
- doc("books.xml")/bookstore/book[price>30]/title
- for $x in doc("books.xml")/bookstore/book
where $x/price>30
return $x/title
- for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
- <ul>
{
for $x in doc("books.xml")/bookstore/book/title
order by $x
return <li>{$x}</li>
}
</ul>
- <ul>
{
for $x in doc("books.xml")/bookstore/book/title
order by $x
return <li>{data($x)}</li>
}
</ul>
- for $x in doc("books.xml")/bookstore/book
return if ($x/@category="CHILDREN")
then <child>{data($x/title)}</child>
else <adult>{data($x/title)}</adult>
- Write an xquery that transforms the plant_catalog.xml to look like this.
Pay special attention to the w3schools advanced section for guidance.
Then redo examples e-h above using that transformed plant catalog with
appropriate changes to the XQuery. Save a transcript of the xquery transform and the 4 examples working on the command-line. Be sure and cat each file you are calling before you call it so I can see the source code - there is an example in the sample transcript.
- Do the examples in reference#6 chapter 1 examples 5-12. Save a transcript from the command-line.
-
Use this remote xml file: http://www.dbis.informatik.uni-goettingen.de/Mondial/mondial.xml. It is a very large file. Mondial is a database with global demographic and geographic information in xml format. Get familiar with the file and its tags.
-
Study this XQuery, figure out how it works, and get it working in your gl account. Create a link to the output xml file as the third TOC item for Existdb1. Note that text()
is an XPath function that returns just the text content part of the
selected node and the '@' sign selects attributes - see the XPath
reference at W3Schools.
<result>
{ for $country in doc ("http://www.dbis.informatik.uni-goettingen.de/Mondial/mondial.xml")/mondial/country
return
<item>
<country> {$country/name/text()} </country>
{ for $ethnicgroup in $country/ethnicgroups
return
<ethnicgroup> {$ethnicgroup/@percentage} {$ethnicgroup/text()}</ethnicgroup>
}
</item>
}
</result>
It will return a large number of records like:
<item>
<country>Swaziland</country>
<ethnicgroup percentage="3">European</ethnicgroup>
<ethnicgroup percentage="97">African</ethnicgroup>
</item>
-
Modify the above XQuery to produce a large number of records in the following format. Create a link to the result file and to the source XQuery for the TOC.
<item>
<country car_code="SD">Swaziland</country>
<borders country="RSA"/>
<borders country="MOC"/>
</item>
- Now put the XML files and XQuery example 11 from step 5 into your Exist database on localhost. Save a screen shot of it running in the browser from Exist.
-
Submit the url for your TOC with 7 items for Existdb1 on blackboard.
Question Set - None
None are due this week but you should start work on the ones due next week (see next week for the questions).
Week10 - XML - Existdb2
Readings
- Slides
- XRX: Simple, Elegant, Disruptive (cached)
- Web Services: JSON vs. XML
- Programming Application Logic for RESTful Services Using XML Technologies
- Adapting the XML stack for a world of XML, HTML, JSON and JavaScript
References
- XSLTForms (allows browsers with no native XForms support to use them)
- XForms Tutorial
- XRX Tutorial (this has the Dictionary Editor application)
- Balisage: The Markup Conference
- Clinical Document Architecture
- The java client for exist
- The install for Ubuntu
Project hw10 - Due this Sunday
- Make sure you have version 1.4.3 installed for Exist-db.
- Create an XRX application for very simple medical records in Exist-db on localhost. The short medical records should be in soap note format. You will create an xml format that follows this organization. Each patient will be in a separate document.
- Base your application on the Dictionary Editor
Sample Application in the XRX Tutorial. You should get that working
first. It is actually based on the 'item-manager' application from the subversion repository. We will demo an example project in class.
- Your application will include 15 soap note record documents,
a CRUD component, and 5 xquery reports that summarize across documents.
- Create a TOC-style entry for your application on localhost
that lists patients and allows browsing them and the reports. Submit a
single url for a separate TOC on gl for Existdb2 that allows me to browse screenshots for the the entry page, 3 of the patients, and all 5 reports. Also, allow me to browse all the XQuery source code for the reports as text from that TOC.
Question Set 5 - Due this Sunday
All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.
- What are the issues with data vs. document style xml documents for persistence in a database? How does middleware fit in?
- How is locking variously implemented in xml native databases? What are the issues for performance?
- How are the nosql and xml database toolchains at odds? How compatible?
- Why is XRX claimed to have simpler application development? Give a particular example not from the reading.
- Describe the IHE from the reading. Give special attention to XProc and transaction issues.
- How can json be used with XQuery? Why is this a good idea for json? Is it applicable to xml?
- Why is xml technology good for specfic application areas but falling into less use in the general web application area?
- Does Saxon use a DOM or a SAX parser? Which would be best for
processing a long doc like in Existdb1ex6? Why? What about a
transform of Existdb1ex7 that would allow a transform to html that
inserted links to resolve the actual border country names (not just
codes)?
- Desribe the typical modern xml support in the database postgresql? When is it good to use these extensions and when is a BLOB just as good (give examples)?
- What is the CCD health document standard? Give a brief example of the xml for 2 diagnostic lab results.
Week11 - Graphdb
Readings
- The Semantic Web - review from is651
- Graph Databases
- How RDF Databases Differ from Other nosql Solutions
- SPARQL By Example - A Tutorial (pdf)
- XQuery/SPARQL Tutorial and DBpedia with SPARQL
References
- Graphdb - Slides
- Graphdb - Video Lecture
- OpenRDF.org - the home of Sesame
- Neo4j: nosql for the Enterprise
- SPARQL Cheat Sheet
- SPARQL Construct, Sesame Tutorial Exercise
- XQuery Example, XQuery Example Results
- Dbpedia URI Lookup Services
- Language Tags
- URL Encoding, URL Encoding Example
- Mondial Site
Project hw11 - Due this Sunday
- You need to do this project on localhost.
- Install a servlet container - Apache Tomcat6 or 7, docs, and admin using
apt-get.
- Deploy the Exist-db XML database (use version 1.4.3) as a WAR file using the Tomcat manager.
- Deploy the Triple Store - Sesame as a WAR file using the Tomcat manager.
- Create a repository called 'your_username-mondial' on Sesame. Load the Mondial Europe database and meta-data into that repository.
- Do a the sample SPARQL query from Mondial in Sesame to verify (screenshot
required - TOC1).
- Browse the Mondial data for countries to see the fields that are there. Then
browse Dbpedia to find 5 different fields appropriate to countries that are
not in Mondial.
- Use CONSTRUCT queries to get that data out of Dbpedia and load the triples into
your Mondial database on Sesame. Create a single text file that contains all 5
queries - TOC2.
- Do a SPARQL SELECT query in Sesame to verify with at least 2 of the new fields
and 2 of the old ones (screenshot required - TOC3).
- Create 5 XQuery files (.xq) that issue queries to your Sesame database and
return HTML pages. Create a single text file that contains all 5 queries - TOC4.
Save the HTML return from each query in an .html file - TOC5-9.
- Create a screenshot of your .xq files residing in the Exist-db administrator web
interface - TOC10.
- Create a TOC on gl with 10 links to all these files and submit that single url as your
hw11.
Question Set - None
None are due this week.
Week12 - Cassandra1
Readings
- Bigtable: A Distributed Storage System for Structured Data (Google - read Sections 1-4 and 11 - cached pdf)
- Amazon's Dynamo - (read Sections 1-5 - cached pdf)
- Cassandra - A Decentralized Structured Storage System (cached pdf)
- Cassandra Essentials Series (about 6min each video)
- Overview of Apache Cassandra
- Managing Objects and Data in Cassandra
- C* Summit 2013: Jonathan Ellis Keynote
- Cassandra 2.0 Better, Faster, Stronger
References
- Cassandra1 - Slides - Cassandra1 - Video Lecture (has old, deprecated parts from earlier versions - we will show the differences in class)
- Apache Cassandra 2.0 Documentation
- CQL3 for Cassandra experts - shows evolution of cql
Project hw12 - Due this Sunday
- Install cassandra from a binary tarball, not the apt-get package. The dsc.tar.gz archive has the current version of 2.0.2. You do not need to install the opscenter. Start cassandra with sudo bin/cassandra -f to put it in the foreground and open another terminal window to continue work.
- Continue on to the Two minute overview and do the exercise in the Querying Cassandra section. Save the command-line transcript as TOC1.
- Continue on to the cassandra 2.0 documentation. Read the all the brief sections.
- Now go to the CQL for Cassandra 2.0 and do the tasks in CQL data modeling/Data modeling example/Example of a music service.
Save the command-line transcript as TOC2. Add 10 more songs with 3
tags each and 4 playlists. Do 8 queries of your database showing all
important aspects of the database.
- Submit your TOC with 2 links as project hw12.
Question Set - None
None are due this week but you should start work on the ones due next week (see next week for the questions).
Week13 - Cassandra2
Readings
- Cassandra Essentials Series (about 6min each video)
- Understanding Partitioning and Replication in Cassandra
- Understanding Data Consistency in Cassandra
- Keeping Movies Running Amid Thunderstorms!
References
- Cassandra2 - Slides - Cassandra2 - Video Lecture (has old, deprecated parts from earlier versions - we will show the differences in class)
- The evolution of schema in Cassandra
- Cassandra By Example: Data Modelling with CQL3 - slides - github
Project hw13 - see syllabus for due date
- Design and create a time series database
(note questions at bottom) using the cql3 command-line. Use a python
program to generate 100 values for it. You need to use different names
than in the sample for everything in the database that are more
descriptive of your own database. Do 5 different queries with cql3 to
show time slices. Your complete command-line cql3 transcript of this
task is the TOC (you can truncate the 100 row output). At the top of
the TOC, write a brief paragraph describing your database.
- Submit your TOC as project hw13.
Question Set F - see syllabus for due date
All Questions sets should be answered in a single plain textfile that is nicely formatted for reading.
- How does dynamo handle consistency and availability? What are the issues that must be resolved?
- Why and how does dynamo modify regular consistent hashing? How does this affect replication?
- How does dynamo handle shoping cart inconsistency?
- Describe the BigTable data model. How does it handle persistence and availability?
- Why did Facebook develop cassandra? What features of older systems did it combine?
- What is a good approach to data modeling for cassandra? Why is this different from relational methods?
- Discuss the evolution of query slices in cassandra. How
exactly do they relate to the underlying database and the query
language? Include discussion of 'with compact storage.'
- Why did Netfix change its architecture? Describe each the change rationale.
- How does Netfix handle rapidly changing demand for services? Give a scenario.
- What is the 'thundering herd'? How is it handled?
- Describe the strategy, operation, and issues with replication and partitioning in cassandra.
Week14 - Hadoop1
Readings
- The following videos are from BigData University which
is an IBM project. You need to watch and take notes on each video in
the 4 lessons. It looks like a lot of videos but each is just a few
minutes long. So the whole thing takes less than an hour. Each lesson
link shows you a directory browse of the individual videos in the lesson
with an associated text transcript in PDF format.
References
- Hadoop1 - Slides
- Hadoop1 - Video Lecture
- BigData University
- Understanding Big DataThe BigData University textbook in PDF format. Chapter4 basically covers the same information as our 4 lessons above.
- Introduction to Amazon Web Services (video 13 min).
This is good video to show you more concrete information on how a
Cloud provider is setup and charges for service. Hadoop processing is
typically done by Cloud providers.
Project hw14 - Due this Sunday
- Install the job from: Running Hadoop On Ubuntu Linux (Single-Node Cluster) on localhost. The lecture slides give notes and the 7 TOC deliverables on this project. Include your actual umbc username in all directories and filenames you create - e.g. canfield-gutenberg and canfield-pg20417.txt.
- Submit your TOC as project hw14.
Question Set - None
None are due this week but you should start work on the ones due next week (see next week for the questions).
Week15 - Hadoop2
Readings
- The following videos are from BigData University which discuss Text Analytics.
- HBase - excerpt from: Hadoop: The Definitive Guide, Third Edition by Tom White, O’Reilly, 2012.
- HBase vs Cassandra: why we moved
References
- Hadoop1 - Slides
- Hadoop2 - Video Lecture
- A Guide to Python Frameworks for Hadoop
- Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
Project hw15 - Due this Sunday
- Run the job from: Writing An Hadoop MapReduce Program In Python on your hadoop cluster. The lecture slides give notes and 4 specific TOC deliverables on this project.
- Run the job from: Hadoop Streaming Made Simple using Joins and Keys with Python on your hadoop cluster. The lecture slides give notes and 2 specific TOC deliverables on this project.
- Use the python code from #2 to process the data at State & County QuickFacts on your hadoop cluster. One must combine
two data sets to produce 2 reports - one to report the count all the
counties in a state and one to report the population for a state by
summing county populations. Each report from the reducer will have
2-columns: state code|sum. The lecture slides give notes and 7 specific
TOC deliverables on this project.
- Submit your overall TOC as project hw15. Include your actual umbc username in all
directories and filenames you create for all exercises - e.g.
canfield-fad-output and canfield-fad.csv. The file/directory name part
can be whatever you want.
Question Set G - Due this Sunday
All Questions sets should be answered in a single plain textfile that is nicely formatted for reading. Note that the slides have more information on the query tasks.
- How does hadoop rack awareness affect streaming vs. random access?
- What is heartbeat and how and why is it used in a hadoop cluster?
- Compare using pig, hive, jaql, java, and python as query languages for hadoop. What are the differences, advantages for each?
- Do the pig command-line task in the part2 video. Submit the url to the transcript as the answer for this question.
- Do the hive command-line task in the part3 video. Submit the url to the transcript as the answer for this question.
- Do the jaql command-line task in the part4 video. Submit the url to the transcript as the answer for this question.
- Describe the steps for a scenario where hadoop is used for text analytics like log analysis. Be specific about the steps.
- What is hadoop streaming? How is it relevant to our python examples?
- What is Hbase and what is the relationship to hdfs? Describe when it would be used.
- Compare cassandra and Hbase. What are the differences, similarities, and what is the best use of each?
- What is an embedded table? What is it used for in hadoop-related technology?