Spring 2016 - Projects On Big Data Software (I590 - Geoffrey C. Fox)¶
- Course Page: http://datascience.scholargrid.org
- FutureSystems Project Page: https://portal.futuresystems.org/projects/491
Notice¶
System Notice
Technology Section¶
Units in Technology Section - Spring 2016¶
Overview¶
Projects on Big Data Software course introduces lessons in two sections: Theory and Technology. Units in Technology are listed in this page. To navigate Theory Units, syllabus and discussions, please use the course site scholargrid.org.
Schedule for Units in Technology Section
Topic | Due |
---|---|
Gaining Access to FutureSystems and Core Technologies | 01/25 |
Gaining Access to FutureSystems and Core Technologies¶
In this unit, you will learn how to gain access to the FutureSystems resources. It includes the portal account creation, class project participation, SSH key generation and login node access. Some of other lessons have been prepared for the beginners to help understand the basics of Linux operating systems and the collaboration tools i.e. GitHub, Google Hangout and Remote Desktop. Please watch video lessons and read through web contents.
Topic | Video | Text |
---|---|---|
Overview and Introduction | 16 mins | 10 mins |
|
4 mins | 15 mins |
GitHub | 18 mins | 30 mins |
Topic | Video | Text |
---|---|---|
ssh-keygen | 4 mins | 10 mins |
Account Creation | 12 mins | 10 mins |
Remote Login | 6 mins | 10 mins |
Putty for Windows | 11 mins | 10 mins |
Topic | Video | Text |
---|---|---|
Overview and Introduction | 4 mins | 5 mins |
Shell Scripting | 15 mins | 30 mins |
|
5 mins | 30 mins |
|
27 mins | 1 hour |
|
3 mins | 10 mins |
|
3 mins | 20 mins |
Modules | 3 mins | 10 mins |
Note
Find an editor that you will be using to do your programming with. For advanced Python programming we recommend PyCharm. However you can use others e.g. Enthought Canopy on your local computer. The way you could use it is to edit python locally, push the code into github and check it out into your vm or your login node on india.futuresystems.org. This is how many of us work.
Length of the lessons in this Unit¶
- Total of video lessons: 2 hours
- Total of study materials: 4 hours and 30 minutes
Assignment HW¶
Topic | Description |
---|---|
Start with Account, Github and Python | 9 tasks |
HW2: Get Ready for FutureSystems¶
Guidelines¶
- Assignments must be completed individually.
- Discussion is allowed (e.g. via Slack) but the submission should be made by yourself. Acknowledge your helpers/collaborators name in the submission if you discussed or got help from anyone.
- Use an individual github repository. A repository in FutureSystems will be given later.
Tasks¶
Complete the following tasks. Place all answers in a file named HW2-$USERID.txt and submit it via IU Canvas. (Replace $USERID with your email name at IU e.g. HW2-albert.txt if your email address is albert@indiana.edu)
Example view of your submission:
1. albert
2. ...
3. ...
9. http://...
FutureSystems Access¶
Sign Up portal.futuresystems.org, if not exist. Provide your portal ID in your submission.
Join Class Project and provide a project number in your submission.
Generate a new SSH key and register on the portal.futuresystems.org. Provide your key fingerprint in your submission.
SSH into india.futuresystems.org with your registered key. Run the following command and attach the output messages (plain text) in your submission. (Most SSH client tools offer copy and paste option with a mouse from the screen):
finger $USER
GitHub¶
Sign up github.com with your SSH key. Provide your github.com user name in your submission.
Create a ‘I590-Projects-BigData-Software’ repository on your account. Create ‘hw2’ branch. Provide a clone URL of ‘hw2’ branch in your submission.
This is a question for you to answer with appropriate git commands to satisfy the following descriptions:
Albert has some Python code files that he was developing on his local machine but he wanted to use github.com to trace changes and share his work with others. He already created a new repository named ‘BigData’ on his github account. So he made a copy of the repository on his machine and there was nothing in the repo. He added a ‘README.rst’ file to describe his repository first. To make sure his description looks okay he pushed his update to github and opened a webpage to check. When he get access to his repository on github.com via a web browser, he found that Contact Info was missing so he added the info in the README.rst file online using a web browser and his description showed with the new Contact Info. He returned to his local repository and updated his repository because he wanted to sync the changes that he made on github.com. His next task was adding new_feature.py and bug.py in a separate branch, not in master because he thought these two files are still in progress with different purposes. He simply created ‘next’ and ‘error’ branches in his current repository and added the two files accordingly. his all tasks are applied to github.com.
List git
commands that Albert used in his work above in your submission.
Python¶
- Write a python program called fizzbuzz.py that accepts an integer n from the command line. Pass this integer to a function called fizzbuzz. The fizzbuzz function should then iterate from 1 to n. If the ith number is a multiple of two, print “fizz”, if a multiple of three print “buzz”, if a multiple of both print “fizzbuzz”, else print the value.
- Create a ‘hw2’ branch on your github repository ‘I590-Projects-BigData-Software’ and add fizzbuzz.py to the branch. Provide a clone URL for the branch in your submission.
HW3: OpenStack Exercise¶
Guidelines¶
- Assignments must be completed individually.
- Discussion is allowed (e.g. via Slack) but the submission should be made by yourself. Acknowledge your helpers/collaborators name in the submission if you discussed or got help from anyone.
- Use an individual github repository. A repository in FutureSystems will be given later.
Create IU GitHub Account¶
- Simply login https://github.iu.edu with your IU Username and Password (It is a same IU Credential that you use on other IU sites e.g. one.iu.edu)
OpenStack Command Line Tool nova
¶
OpenStack Kilo is ready to use (as of 02/04/2016) on FutureSystems and you will
have a virtual instance (server) using OpenStack Command Line Tool nova
, if
you complete all the tasks in this assignment. The tasks you need to complete
are:
- SSH into india.futuresystems.org and
- enable
nova
command
- enable
- Register a SSH key on OpenStack
rsa
type- with default key file names
- public:
$HOME/.ssh/id_rsa.pub
- private:
$HOME/.ssh/id_rsa
- public:
passphrase enabled
- Start a single instance:
- on
fg491
project - with a
m1.small
flavor, - a
Ubuntu-15.10-64
image, - the registered key above,
- and
hw3-$OS_USERNAME
vm name - Assign a Floating IP address
- on
- Install required software on a virtual instance
- virtualenv
- pip
- ansible
Warning
Do not terminate your instance, even if you completed and submitted hw3.
Test Program¶
We provide hw3.py
test file in your repository, checkout hw3
branch.
Run this on india.futuresystems.org, if you completed all tasks above. All
available tests should be succeeded without errors. First, clone your
private repository from IU GitHub. See details here:
IU GitHub Guidelines.
You will use virtualenv to prepare packages.
Run:
bash setup.sh
source $HOME/bdossp_sp16/bin/activate
Now, you can run the test program:
python hw3.py
Completed all? You may see:
...........
----------------------------------------------------------------------
Ran 11 tests in 1.646s
OK
Find hw3-results.txt
file after you ran hw3.py
python program in your
current directory. Add this file in your IU GitHub repository.
FAQ¶
Q. Where should I run the test program hw3.py? A. It is on india.futuresystems.org, not your VM instance.
Q. bash setup.sh
produces command not found
errors.
A. Make sure you can use nova
command to start a new VM like you did in the hw3 tasks. Otherwise, the test program can’t verify what you accomplished.
Q. The hw3.py
test program was failed due to missing python package named lib
.
A. Run hw3.py in a main directory of the hw3 branch. hw3.py
itself won’t work. Helper functions are required.
Submission via IU GitHub (github.iu.edu)¶
From now on, you will use IU GitHub to submit assignments on a private repository. IU GitHub Guidelines
- Clone your private repository from the course organization. You IU Username is the name of your repository.
- Create a
hw3
branch
git branch hw3
git checkout hw3
Run
pull
command to fetch and merge with the template repository:git pull git@github.iu.edu:bdossp-sp16/assignments.git hw3
Sync with remote:
git push -u origin hw3
Add
hw3-results.txt
to your repository:git add hw3-results.txt
Merge the template
git commit -am "initial merge with the template"
Sync your changes:
git push -u origin hw3
Challenging Tasks (Optional)¶
The following tasks are optional but strongly recommended to try. These are
related to Python packages and APIs (application program interface).
OpenStack nova
is also extended to get more experience.
‘Hello Big Data’ Flask Web Framework¶
Find a flask
sub-directory in challange
directory in your assignment
repository. We provide hello.py
python file and you can run the file in
your VM but there are a few requirements that we request:
* Use virtualenv named 'bdossp-sp16' in your home directory
* Open a web port to the Flask application to allow access from outside
Note
The two terms, VM or virtual instance, are exchangeable in this context.
- What command(s) do you run to create and enable the virtualenv?
python hello.py
may not work if you run only with standard python libraries. What command(s) do you run to resolve the issue? (hint. Flask is not a Python standard package)- If you ran the application successfully, you can see ‘Hello Big Data’
message on your web browser with the
15000
web port. However, it is not accessible from outside e.g. http://IP_ADDRESS:15000. It is because that there is no rule for the port in OpenStack Security Group. (We assume there is no firewall here). Whatnova
command(s) do you need to create/add a security group for the port? flask
rule is provided in fg491 project. Whatnova
command(s) do you need to see current rule(s) in the security group and to apply it to your VM?
Write your solution in the name of flask-sol.txt
text file after completing
the tasks above. Add this file in the flask
sub-directory.
Example view of your submission:
1. albert
2. ...
3. ...
9. http://...
Useful links¶
- Python lesson: http://bdossp-spring2016.readthedocs.org/en/latest/lesson/linux/python.html
- OpenStack Beginners: http://bdossp-spring2016.readthedocs.org/en/latest/lesson/iaas/openstack.html
- OpenStack QuickGuide: http://bdossp-spring2016.readthedocs.org/en/latest/lesson/quickstart_openstack.html
- OpenStack Operations Guide: http://docs.openstack.org/openstack-ops/content/user_facing_operations.html
HW5: Ansible Exercise¶
Note
Replace mongodb.yml
to site.yml
in hw5.sh, if you failed
running hw5.sh with the incorrect filename. If you recently pulled hw5
branch in your private repository, you don’t need this fix.
Guidelines¶
- Assignments must be completed individually.
- Discussion is allowed (e.g. via Slack) but the submission should be made by yourself. Acknowledge your helpers/collaborators name in the submission if you discussed or got help from anyone.
- Use an individual github repository. A repository in FutureSystems will be given later.
Use hw5
branch¶
- Login https://github.iu.edu with your IU Username and Password (It is a same IU Credential that you use on other IU sites e.g. one.iu.edu)
- Checkout hw5 branch
MongoDB Ansible Role¶
Writing Mongodb playbook is taught in the Ansible lessons. You write MongoDB Ansible role in this assignment. Submit inventory, main playbook, command script and your role including sub-directories.
Requirements¶
The following files should be included in your submission
inventory
filesite.yml
the main playbook filemongodb
directory (which is ansible role for mongodb)hw5-cmd.script
file
Preparation¶
Login to india.futuresystems.org
Use the same
bdossp-sp16
virtualenv used in hw3Install
ansible
to the bodssp-sp16 virtualenv via python package managerChange a directory to your IU GitHub repository where you work on hw5
Create a new branch
hw5
by:git checkout -b hw5
Pull hw5 template files by:
git pull git@github.iu.edu:bdossp-sp16/assignments.git hw5
Sync to remote by:
git push origin hw5
Start working on hw5
HW5 Tasks¶
You need to write an Ansible role to install mongodb on your vm instance
hw3-$USER
. Ansible Playbook for MongoDB installation is given in the
Ansible lessons. You may start from there but you need to install MongoDB on
Ubuntu 15.10 at this time. Systemd is a main init system in Ubuntu 15.10 and
you need to locate a service file using Ansible modules. Certain conditions
should be met in your submission, see the requirements below:
- create a new Ansible role where
mongodb
is a role name
- Describe tasks in tasks directory
- Add the MongoDB public GPG key from:
hkp://keyserver.ubuntu.com:80
- Use
EA312927
as MongoDB public GPG Key ID when Ubuntu package management imports a key (apt-key)
Install
mongodb-org
with 3.2 Community Edition for Ubuntu Trusty 14.04 LTS by adding a MongoDB repository from:deb http://repo.mongodb.org/apt/ubuntu trusty/mongodb-org/3.2 multiverse
Define those as Ansible variables in defaults directory, at least the four variable names below should be used:
- mongodb_keyserver (to store hkp://...)
- mongodb_gpgkey_id (to store EA312...)
- mongodb_repository_list (to store deb http://...)
- monogodb_package_name (use ‘mongodb’)
- (more vars can be defined)
- Two handlers
- one for starting mongodb
- one for restarting mongodb
- Locate a service file where:
- destination is
/lib/systemd/system/mongodb.service
- owner/group of the destination file is
root
- mode of the file is
0644
- reload mongodb after adding this file to remote
- You can find mongodb.service.j2 template file in your hw5 branch
- destination is
- Write a main playbook:
- to include your new role
- in
site.yml
file
Run
hw5.sh
to record your outputs inhw5-cmd.script
file
Grading Guidelines¶
- Existence of required files/directories (15%)
- inventory
- site.yml
- (role) directory including subdirectories
- hw5-cmd.script
Proper use of Ansible Variables (15%)
Proper use of Ansible Tasks (15%)
Proper use of Ansible Templates (15%)
Proper use of Ansible Handlers (15%)
General understanding of Ansible Roles (20%)
Successful Execution (5%)
FAQ¶
- How do I avoid typing SSH passphrase while current session is alive?
Use ssh-agent like this:
eval `ssh-agent` ssh-add
- Where should I run Ansible Playbooks or Roles?
- It is on india.futuresystems.org, not on your VM instance.
- I see mongodb.service.j2 template file but don’t exactly know what to do.
- Once you installed a mongodb server to a destination, you may need to
register a mongodb server as a service. In Ubuntu 15.10, systemd is a main
init system and you need to locate a service file to register. Explore Ansible
template
module which is useful to locate a file with variables. See documentation here: http://docs.ansible.com/ansible/template_module.html
- Permission denied on
git pull git@github.iu.edu:bdossp-sp16/assignments.git hw5
- Try https or register your ssh key at IU GitHub. Using https URL is like:
git pull https://github.iu.edu/bdossp-sp16/assignments.git hw5
Submission via IU GitHub (github.iu.edu)¶
Use IU GitHub to submit assignments on a private repository. IU GitHub Guidelines
- Clone your private repository from the course organization. You IU Username is the name of your repository.
- Create a
hw5
branch
git branch hw5
git checkout hw5
Run
pull
command to fetch and merge with the template repository:git pull git@github.iu.edu:bdossp-sp16/assignments.git hw5
Sync with remote:
git push -u origin hw5
Add files and directories to your repository:
git add inventory git add mongodb git add site.yml git add hw5-cmd.script
commit
git commit -am "submission hw5"
Sync your changes:
git push -u origin hw5
Challenging Tasks (Optional)¶
The following tasks are optional but strongly recommended to try. These are to write mongodb roles for RedHat-based operating system as well using Ansible conditionals and different modules, if necessary.
MongoDB Roles for RedHat¶
You have completed writing mongodb roles for Ubuntu 15.10 which is Debian-based operating system only. In this challenge task, you are required to extend your mongodb roles for RedHat-based operating system as well. Ansible conditionals is recommended to select correct tasks/files in different operating systems.
Find mongodb-redhat
directory in challange sub-directory. Add your extended
mongodb role in the directory.
Possible Project idea (Running Ansible on Windows)¶
Develop Ansible Playbooks and Roles for Windows machines using PowerShell and winrm Python package instead of SSH. You may find multiple ways like:
- develop a PowereShell script that starts a VirtualBox and runs the Debian ansible in it, have a local key be used see the instalation instructions of Cloudmesh that let you set up ssh on a windows machine also.
- develop a Docker based ansible container. However this is not as straight forward as the key management need to be done right.
You can find more information here Windows Support
Useful links¶
- Ansible Basic: http://bdossp-spring2016.readthedocs.org/en/latest/lesson/ansible.html
- Ansible Playbook: http://bdossp-spring2016.readthedocs.org/en/latest/lesson/ansible_playbook.html
- Ansible Role: http://bdossp-spring2016.readthedocs.org/en/latest/lesson/ansible_roles.html
- Ansible Best Practices: https://docs.ansible.com/ansible/playbooks_best_practices.html
- Ansible official documentation: http://docs.ansible.com/ansible/index.html
Project Guidelines¶
News¶
- NIST Fingerprint Example (03/09/2016)
- HBase is now supported (03/02/2016)
- Examples are under development (03/02/2016)
- Projects, datasets, and technologies from the past are available (02/26/2016)
Important Dates¶
- Project Proposal: March 18th
- Oral Presentation: Week 12 - April 1st, 2nd (Tentative)
- Progress Checkup: Week 14 - April 15th, 16th (Tentative)
- Final Submission: April 29th
Note
Those who can’t make the presentation with a time conflict, schedule a meeting with Course Team.
Submission¶
- IU GitHub: https://github.iu.edu/bdossp-sp16
Team Coordination¶
Up to 3 members is recommended but individual is allowed.
Project Expectation (Grade)¶
Final project counts as 60% of semester grade and 40% goes on assignments.
- 60% Final project
- 10% Proposal
- 10% Presentation
- 30% Source code
- 10% Report
Project Style¶
- Basic
- Bonus
You do not require strong background or programming skills with HPC or Hadoop to complete a final project. We’ve noticed that, however, there are some difficulties learning Linux systems, shells, or scripts and improving programming skills with parallelization in general. You have two options, Basic and Bonus, to start your project based on your capability on these.
Basic Project¶
Basic project starts from existing projects and extends the scope of the projects with minimal efforts on code developments. For example, take existing Hadoop benchmark tools and run them on hadoop clusters with different system configurations to compare. Try to increase data nodes, master nodes or add ZooKeeper with different settings and measure differences. Comparing performance in different software versions, settings or configurations tells you where focal points are to optimize or improve throughput of hadoop. Choose a basic project if you are not conpetent with programming languages e.g. Java or Python. Note that starting from existing projects doesn’t mean that you can simply search and download popular projects on the internet and execute. You need to address new findings and include the original source of the projects that you referenced in your final project and reports.
- Minimal code writing
- Start from existing projects
Bonus Project¶
If you are working on a bonus project, you are required to write code/scripts to implement your idea in the final project. Installation and configuration should be done by Ansible Playbooks. For example, take NIST Facial Recognition software and run with Hadoop clusters. Change serial calculation to be executed in parallel. Writing map and reduce functions may be necessary in Java, Python or Scala. Write Ansible Playbooks to install and configure your software packages within a few commands. If data analytics is the area that you are interested, you may try to develop new techniques to improve performance or implement parallel algorithms for complex face detection. Developing parallel programs would be involved in most cases. There are other possibilities as well. For instance, take hadoop-ansible-stacks which consists of basic components of Hadoop and append new software tools by writing new playbooks in roles and addons. You could add Hives or update Spark with the latest release using parameters or definition in YAML. If you focus on managing systems and software deployments, think about how to manage traffics by adding/removing additional nodes or how to apply new patches on particular nodes. Bonus points are given exceptional project results.
- Ansible is required
- Extensive code and scripts writing are welcome
- Using GitHub Issues is mandatory to communicate with AIs for your projects
- Bonus points
Project Choice¶
- Deployment
- Benchmark (Performance Test)
- Parallelization
- Analytics
- Created Own (upon approval)
Deployment¶
Deployment project focuses on automated software deployments on multiple nodes using automation tools/configuration managements such as Ansible, Chef, Puppet, Salt or Juju. For example, you can work on deploying Hadoop clusters with 10 medium virtual instances or Sharded MongoDB clusters or filesystems e.g. NFS or Gluster. Ansible is recommended and supported in the class.
Examples:
- Deployment Hadoop clusters
- Deployment cluster managers (e.g. Mesos)
Benchmark¶
Benchmark project focuses on testing system’s performance by putting some stresses on different spots. Filesystems, CPUs, or memories can be tested and measured, if you think about hardware benchmark. APIs, messaging queues, load balancers or any applications can be tested and measured, if software is more focused. Hibench, Big Data Benchmark, or built-in tools e.g. Terasort are available for Hadoop benchmark.
Examples:
- Hibench
- Storm Benchmark
- Big Data Benchmark for Big Bench
Parallelization¶
Parallelization project focuses on building efficient software stacks in parallel including MPI and Hadoop clusters. For example, you may find writing map and reduce functions is relatively easy e.g. WordCount, but applying it in practice with large datasets isn’t that simple. Think about how to load your dataset into hadoop file systems or databases and run your jobs in a distributed fashion.
Examples:
- Pig
- Spark
Analytics¶
Analytics project focuses on developing algorithms for different problems based on datasets and topics that you chose in your project. You will be required to develop algorithms for improving parallelism or performance in this project rather than developing new algorithm for face recognition, for example.
Examples:
- Faunus Graph analytics
- Ibis
Created Own¶
You can develop own project idea and make it as a class project upon approval. Describe your thought, tools, and topics and make a clear statement of the problems you identified in your project proposal.
Project Requirement¶
- Installation/Configuration by Ansible playbook or relevant tools (Ansible Roles)
- Reproducibility - runnable on Linux distribution
- Sample Dataset - up to 480GB per team
- 12 VM instances with m1.medium are given to the utmost each team
- Software Stacks similar to Software Layers
Project Copyright¶
Your project deliverables may be introduced in the future classes or be shared by others online after the end of semester.
Project Proposal¶
Please submit your project proposal by due. The proposal.rst
RST file is
provided in the project template repository. Fork this repository and write
your proposal in the file under ‘docs’ directory. Find RST Quick Reference , Online RST
Editor here. A project proposal is typically 1-2
pages long and should contain in the description section:
- the nature of the project and its context
- the technologies used
- any proprietary issues
- specific aims you intent to complete
- and a list of intended deliverables (atrifacts produced)
Oral Presentation¶
You are required to demonstrate your project during the presentation week. The clear statement of problems are necessary with schedule, plan, role of team members, resources to use.
- A student will use Adobe Connect to give a presentation which will be recorded.
- 3-5 minutes per team.
- Presentation can be substituted with written reports upon approval. 1-2 page progress report(s) need to be included.
Presentation Guideline¶
- Demonstrate the following criteria:
- team members (roles)
- problem definition
- list of technologies
- list of development tools, languages
- list of dataset and its availability
- schedule
- resources to use
All presentations will be recorded.
Progress Checkup¶
The following activities will be evaluated:
- Code development in a project repository
- Participation of team members
- Software installation
- Datasets preparation
List of Possible Projects¶
Note
We are currently working on this and any software and/or details are subject to change without notice. This is reference only.
- Big Data Analytics Stack
- Deployment project using Ansible Playbooks (Ansible Roles)
Layer | Supported | In Progress | Optional |
---|---|---|---|
Scheduling Layer | YARN | Mesos | |
Database Layer | HBase | MongoDB, MySQL | CouchDB, PostgreSQL, Memcached, Redis |
Analytics Layer | Java | MLlib, Python | BLAS, LAPACK, Mahout, MLbase, R |
Data Processing Layer | Hadoop MapReduce, Spark, Pig | Storm, Flink | Tez, Hama, Hive |
You may consider to work on Big Data Analytics Stack using Ansible Playbooks. The default configuration of the stack is YARN + HDFS + Java + Hadoop MapReduce, Spark, and Pig. You can develop a new addon for one of the optional software and attach to your stack. Find more details here big-data-stack, Ansible Roles.
Projects from Software Deployments¶
Projects related to the hadoop stack consist of either extending the functionality or using the current features. This repository is intended to define a simple, easily deployable, customizable, data analytics stack built on hadoop. Currently, deployment is done to a virtual cluster running on OpenStack Kilo on FutureSystems.
Title | Category | Data Sets | Technologies |
---|---|---|---|
big-data-stack | Software Deployments | n/a | Ansible |
Projects Derived from Benchmarking Sets¶
There are many benchmark sets such as BigDataBench, HiBench, Graph 500, BigBench, LinkBench, MineBench, BG Benchmark, Berkeley Big Data Benchmark, TPCx-HS, and CloudSuite. See http://dsc.soic.indiana.edu/publications/OgreFacetsv9.pdf
Title | Category | Data Sets | Technologies |
---|---|---|---|
Amazon Movie Reviews | Batch Data Analytics | 8 million reviews |
|
Google web graph | Batch Data Analytics | Webgraph from Google, 2002 |
|
Facebook Social Network | Batch Data Analytics | Facebook data |
|
Genome sequence data | Batch Data Analytics | .cfa sample data (unstructured text file) |
Work Queue (master/worker framework) |
Wang, Lei, et al. “Bigdatabench: A big data benchmark suite from internet services.” High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 2014. link
Title | Category | Data Sets | Technologies |
---|---|---|---|
Storm Benchmark | Batch Data Analytics | https://github.com/intel-hadoop/storm-benchmark | Storm |
Big Data Benchmark for Big Bench | Batch Data Analytics | https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench | Hadoop, Hive, Mahout |
Title | Category | Data Sets | Technologies |
---|---|---|---|
|
Batch Data Analytics | https://github.com/intel-hadoop/HiBench | Hadoop |
|
Batch Data Analytics | https://github.com/intel-hadoop/HiBench | Mahout |
|
Batch Data Analytics | https://github.com/intel-hadoop/HiBench | Mahout |
|
Batch Data Analytics | https://github.com/intel-hadoop/HiBench | Hive |
Title | Category | Data Sets | Technologies |
---|---|---|---|
Graph 500 | Batch Data Analytics | main site | MPI |
BigBench | Batch Data Analytics | main site |
|
LinkBench | Batch Data Analytics | main repo |
|
BG Benchmark | Batch Data Analytics | main site |
|
Berkeley Big Data Benchmark | Data Systems | main site |
|
TPCx-HS | Data Systems | main site | Hadoop |
CloudSuite | Batch Data Analytics | main site | MapReduce |
MineBench | Batch Data Analytics | main site, Data Generator |
Projects From NIST¶
Title | Category | Data Sets | Technologies |
---|---|---|---|
Fingerprint Matching | Batch Data Analytics |
|
|
Human and Face Detection from Video (simulated streaming data) | Streaming Data Analytics | OpenCV, INRIA Person Dataset |
|
Live Twitter Analysis | Streaming Data Analytics | Live Twitter feed |
|
Big data Analytics for Healthcare Data/Health informatics | Batch Data Analytics | Medicare Part-B in 2014 |
|
Spatial Big data/Spatial Statistics/Geographic Information Systems | Batch Data Analytics | Uber Ride Sharing GPS Data |
|
Data Warehousing and Data mining | Batch Data Analytics | 2010 Census Data Products: United States |
|
- *Reference URL of these projects: http://bigdatawg.nist.gov/_uploadfiles/M0399_v2_8471652990.doc
Title | Data set | Software | Category |
---|---|---|---|
NIST Fingerprint (a subset of): NFIQ PCASYS MINDTCT BOZORTH3 NFSEG SIVV | NIST Special Database 27A [4GB] | NIST Biometric Image Software (NBIS) v5.0 [userguide] | Batch Data Analytics |
Hadoop Benchmark (each) - TeraSort Suite | Teragen | hadoop-examples.jar | Batch Data Analytics |
Hadoop Benchmark (each) - DFSIO (HDFS Performance) | hadoop-mapreduce-client-jobclient | Batch Data Analytics | |
Hadoop Benchmark (each) - NNBench (NameNode Perf.) | hadoop-mapreduce-client-jobclient | Batch Data Analytics | |
Hadoop Benchmark (each) - MRBench (MapReduce Perf.) | src/test/org/apache/hadoop/mapred/MRBench.java | Batch Data Analytics |
Projects from Other Sources¶
Title | Category | Data Sets | Technologies |
---|---|---|---|
MapReduce Implementation for Longest Common Substring Problem | Batch Data Analytics | Escherichia coli K-12 |
|
MapReduce Implementation for GFF Parsing | Batch Data Analytics |
|
- Examples from the previous class
List of Datasets¶
Note
We are currently working on this and any software and/or details are subject to change without notice. This is reference only.
- Examples from the previous class
Note
There is no direct support on datasets.
Note
Large datasets should be informed to Course Team. These will be
prepared and downloadable via /share/project2/FG491
on
india.futuresystems.org
List of Technologies¶
Note
We are currently working on this and any software and/or details are subject to change without notice. This is reference only.
- Examples from the previous class
Note
There is no direct support on Analytics software.
Details on Software Submission¶
Code submission should be made at Github including a README
file.
- Source code on Github: https://github.iu.edu/bdossp-sp16/sw-project-template
README
includes:
- Test instruction
- List of data source
- List of technologies used
Details on Final Report¶
Final report concludes the work of your team and describes findings with its results. The following sections should be included:
Description of your project
Problem statement
Purpose and objectives
Results
Findings
Implementation
- References
- original source of code snippets
- original source of datasets
The final reports should sastify the following guidelines:
- 4 - 6 pages
- Time Roman 12 point – spacing 1.1 in Microsoft Word
- Figures can be included
- Proper citations must be included
- Material may be taken from other sources but that must amount to at most 25% of original work and must be cited
- The level should be similar to a publishable paper or technical report
Details on Grading Criteria¶
- Proposal
- Clear statement
- Quality and Breath
- Interest
- Code
- Reproducibility
- Executable (Most weighted)
- Instruction of Installation
- Instruction of Configuration
- Datasets
- Acknowledgements
- Gee whiz factor
- Report
- Related Work
- Completeness
- Level of insight
FAQ¶
- Use of FutureSytem is required?
A. No, it is not required, but it must be deployable and runnable on
FutureSystems Kilo and you should provide detailed instructions on how to do
so. Ideally, running ansbile-playbook site.yml
should be all that is needed
to deploy, after booting and editing inventory.txt
file.
- I need more time to complete code development, may I have an extension?
A. Extension would be approved upon request. Send an extension request email
message to the course email with a title [Project Extension]
and an
expected completion date.
Q. Our team wants to change a topic or scope of a project after project proposal or presentation, is it allowed?
A. Topic should be close to what you proposed earlier. Please contact Dr. Fox or Course Email if you change a topic or a scope of your project significantly. Also inform if you change team members. These changes would be approved upon request.
- Report or survey type of final project is allowed?
- No, software project is only allowed in this class.
Q. I found there is a similar project that I proposed, should I keep working on my project?
A. Consult with Course Team to make differences in detail. You may be asked to focus on specific area in order to avoid similarity.
- Can’t make a oral presentation because I have a business trip (or a conference).
- Schedule a meeting in Week 11 or Week 13 with Course Team.
- What does that mean there is no direct support on Datasets and Analytics software?
- We will provide support for accessing a dataset under 500 GB
Questions & Support¶
- Course Email: bdosspcoursehelp@googlegroups.com
- Google Hangout (voice & screen share): upon request
Useful Links¶
- Scheduler
- Python Tools
- Visualization
Office Hours¶
General Discussions (Adobe Connect)
- TBD
Support Office Hours (Google hangout)
Office hours are conducted on appointment only during the times given below. Please send an mail to bdosspcoursehelp@googlegroups.com
- Mon-Fri 9-5pm
Contact
Adobe Connect¶
Discussions will be done using Adobe Connect. Make sure you are prepared to join a meeting by the following links:
- Adobe Connect Setup Instructions: https://ittraining.iu.edu/connectsetup/
- Installing Adobe Connect Plug-In: https://kb.iu.edu/d/bdoc
- Testing Connectivity: https://connect.iu.edu/common/help/en/support/meeting_test.htm
Note
All times are in US Eastern Standard Time (EST).
Additionally, there may be some technical issues when attempting to join a meeting, please see the following for guidance.
FAQ¶
Contents
FAQ¶
How to ask a question¶
When submitting a questions please:
copy/paste directly from the terminal into the email message. Do not send text files, zipfiles, or other attachments as they may be not be opened. Screen grabs may be acceptable if you annotate the relevant parts using an image editor, or they are integral to showing the problem.
use the mailing list (bdosspcoursehelp@googlegroups.com) to direct questions as all support staff are subscribed. Direct message via Slack is discouraged.
describe the specific step you are trying to accomplish, include the actions you took and the results. Ensure that you can reproduce your problem by executing only the steps present in your email. Also include:
- your futuresystems username
- the output of
nova show MY_VM
(replacingMY_VM
appropriately)
For example:
I cannot ssh into my VM due to permission denied errors: $ ssh user@machine Permission denied (publickey,hostbased). I tried following the steps in the FAQ as defined here: https://... <SHOW THE COMMANDS AND THEIR OUTPUT> Relvant information is: Username: albert $ nova show MY_VM +--------------------------------------+----------------------------------------------------------+ | Property | Value | +--------------------------------------+----------------------------------------------------------+ | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2016-03-29T19:35:41Z | | fg491-net network | 10.0.5.22, 149.165.159.241 | | flavor | m1.large (4) | | hostId | 683eed6c03fcc23879620b6042eaaa22149d915bfcd4ec9e5feab7c5 | | id | 60dc4420-e5c0-4897-9a58-2cd48d6521b0 | | image | CentOS7 (dc5c041f-7881-441e-af91-e9620efde901) | | key_name | india | | metadata | {} | | name | $USER-myvmname | | os-extended-volumes:volumes_attached | [] | | progress | 0 | | security_groups | default | | status | ACTIVE | | tenant_id | 74e411d6d99e4497901d4c4e2b159f41 | | updated | 2016-03-29T19:35:56Z | | user_id | 090ea72e85c94c49ad8cc8133627fa1a | +--------------------------------------+----------------------------------------------------------+
Why can’t I ssh
into my VM?¶
Make sure to boot using on the internal network first. Once the node is up, then attach the floating ip.
$ NET_ID=$(nova net-list | grep $OS_PROJECT_NAME | cut -d' ' -f2) $ nova boot --nic net-id=$NET_ID # ... etc $ nova floating-ip-list | grep ' - ' # find an available floating ip. $ # create a floating ip with nova floating-ip-create if there are no floating ips available $ nova floating-ip-associate # ... etc
If your VM is on the 10.1.x.y it is accessible:
- from outside the subnet with a floating ip only
- from inside the 10.1.x.y subnet
Check ip of your VM(s) with
$ nova show $USER-myvmname | grep network
Make sure you can ping your VM:
$ ping -c 5 $IP
Make sure that the machine have finished boot by checking that ssh daemon is listening on port 22:
$ nc -zv $IP 22 Connection to 149.165.158.1 22 port [tcp/ssh] succeeded!
Make sure you are trying to log in with the correct username. For Ubuntu VMs, the username is
ubuntu
; for CentOS usecentos
.For example, to log into an Ubuntu VM:
$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@$IP
If you are still unable to ssh, try a hard reboot a few times:
$ nova reboot --hard $USER-myvmname
Check that you have an ssh key registered with openstack using
nova keypair-list
and make note of the fingerprint:$ nova keypair list +----------------+-------------------------------------------------+ | Name | Fingerprint | +----------------+-------------------------------------------------+ | india | 41:29:20:a2:51:25:5d:66:71:02:15:b6:cd:e2:36:06 | +----------------+-------------------------------------------------+
Check that the correct key name was passed to
nova boot
when starting the VM by usingnova show
:$ nova show $USER-myvmname +--------------------------------------+----------------------+ | Property | Value | +--------------------------------------+----------------------+ # ... | key_name | india | # ... +--------------------------------------+----------------------+
Ensure that the fingerprint matches:
$ ssh-keygen -lf ~/.ssh/id_rsa 2048 41:29:20:a2:51:25:5d:66:71:02:15:b6:cd:e2:36:06 ~/.ssh/id_rsa.pub
Make sure that the key was injected into the VM during the startup by grabbing the console log and searching for your fingerprint. Make sure to wait for a few minutes after
nova boot
to allow the node start up:
$ nova console-log $USER-myvmname | grep -A 2 -B 4 '41:29:20:a2:51:25:5d:66:71:02:15:b6:cd:e2:36:06' ci-info: ++++++Authorized keys from /home/centos/.ssh/authorized_keys for user centos+++++++ ci-info: +---------+-------------------------------------------------+---------+-----------+ ci-info: | Keytype | Fingerprint (md5) | Options | Comment | ci-info: +---------+-------------------------------------------------+---------+-----------+ ci-info: | ssh-rsa | 41:29:20:a2:51:25:5d:66:71:02:15:b6:cd:e2:36:06 | - | | ci-info: +---------+-------------------------------------------------+---------+-----------+
If, after going through these steps, you are still unable to access the VM, delete the VM and try again two or three times, waiting a few minutes between each attempt. OpenStack is a collection of many distributed systems, and the nature of distributed systems is that they can be prone to random failure.
If you are still unable to log in, please contact us and indicate that you have gone through these steps, and show the output of the above commands.
Why can’t I modify my ~/.ssh/authorized_keys
file?¶
You can not manually manage your authorized_keys
file on india
for security reasons.
If you need to change your ssh key, do so via the SSH keys
tab on your Web Portal Account.
Why does my MongoDB deployment fail?¶
In this case: mongodb is installed successfully, but the service cannot be started. Solving this is the goal of the assignment, which is demonstrating an important aspect of many development processes: namely the affects of changing infrastructure.
To put this in context: Ubuntu for many years (through the 14.04 LTS release) used the Upstart init daemon. As of 15.04, this is switched to systemd. However, the mongodb installation expects to use Upstart to run the service, which therefore fails.
There are many solutions to this type of problem:
- add the system service file by hand
- rollback the OS from Ubuntu 15.04 to 14.04
- use a different repository which includes the systemd service file
For the purposes of this homework, the first option is taken, and the service file is provided in the repository. As the hw instructions say place the provided service file in the appropriate location.
If, after deploying the service file you are still unable to start the mongodb service, please include the contents of /lib/systemd/system/mongodb.service
in your email.
One common issue is the user the mongodb service runs as: you should make sure that the username in the service file matches the user account created for mongodb.
- Check the username in the service file by looking at the
User
value. - Check the username on the system by
grep -i mongo /etc/passwd
If these two values do not match, adjust your ansible deployment.
Security Groups¶
As projects are shared and everyone can modify the security groups, it is best to create security groups prefixed the your username: eg $USER-default
and add your rules to that.
Naming VMs¶
All VMs should be prefixed by your username. This will allow everyone to identify the VMs that belong to your.
Naming the key on the Cloud¶
It is best to name the key on the cloud with your <portalname> in order not to confuse that with others its also good practice to optionally put -key behind it, SO your key name would be <portalname>-key.
Accessing Root¶
The default login user (ubuntu
on India, cc
on Chameleon) has sudo
privileges.
Beware of Denial-of-Service attacking your own machine¶
We have seen students looping through an ssh command and as soon as it failed issued a new one during boot. They did it so many times that they flodded the network as they did it not just with one but many many vms. Multiply this by X users and you can see that alone through this process you can create a denial of service attack on cloud services. So what you have to do is to put a sleep between such ssh attemts to see if your vm is really up. Put at least 30 seconds At time it can be as much as 10 minutes dependent on usage.
You can do this in bash and zsh using until
:
$ until nc -zv $IP 22; do sleep 30s; done
This sleeps for seconds each iteration until the ssh service is detect to be available on port 22.
Using Chameleon Cloud¶
You can find documentation on how to migrate from India (Futuresystems) OpenStack to Chameleon cloud here: https://github.com/futuresystems/class-admin-tools/blob/master/chameleon/big-data-stack.org
Make sure you follow these instructions.
Regarding some common questions about switching to and using Chameleon, here are some tips if you are having trouble:
General¶
there is so far no evidence that chameleon is experiencing the same load problems that india is
make sure you don’t source anything under
~/.cloudmesh/
make sure you do source the
CH-817724-openrc.sh
filemake sure you enter your password in correctly
- make sure that running
nova list
works as a sanity check (if not, try steps 2, 3 again) As there is no confirmation or denial that your password is entered correctly or incorrectly, you should test using
nova list
to ensure authentication is possible. Make sure you have not sourced anything under~/.cloudmesh/clouds/...
as this will corrupt the environment nova uses to authenticate to chameleon.
- make sure that running
Differences between Chameleon and India:¶
- the username to log into the VM is different: use
cc
instead ofubuntu
- you cannot log into the internal IP address (192.X.Y.Z). You must associate a floating ip address first
- the ubuntu image is called
CC-Ubuntu14.04
instead ofUbuntu-14.04-64
.
BDS on Chameleon¶
If you are using the Big Data Stack, you need to make the following changes as well to tell BDS to use the Chameleon-specific environment instead of india:
SSH problems with Chameleon¶
If you experience trouble ssh-ing into a Chameleon instance, make sure that the fingerprint of the key injected into the instance (get it with nova console-log $VM_NAME
) matches the one you are using (default is ~/.ssh/id_rsa`, use ``ssh-keygen -lf $PATH_TO_KEY
to see it).
Keystone problems¶
The authentication mechanisms for FutureSystems and Chameleon clouds are incompatible.
This means that if you source the openrc file for FutureSystems (usually in your ~/.cloudmesh/clouds/india
directory and then your Chameleon Cloud openrc file, you may get something like this:
$ nova list
ERROR (DiscoveryFailure): Cannot use v2 authentication with domain scope
The following has also been reported
$ nova list
No handlers could be found for logger "keystoneclient.auth.identity.generic.base"
... terminating nova client
The solution is to open a new terminal session with a fresh environment and source only the openrc file for the cloud you need to use.
For the adventurous, you may alternatively clear your openstack environment variables using the following.
env | egrep '^OS_' | cut -d= -f1 | while read var; do unset $var; done
The above should work in sh-like shells (eg sh, bash, zsh).