Spring 2016 - Projects On Big Data Software (I590 - Geoffrey C. Fox)

Notice

System Notice

Technology Section

Units in Technology Section - Spring 2016

Overview

Projects on Big Data Software course introduces lessons in two sections: Theory and Technology. Units in Technology are listed in this page. To navigate Theory Units, syllabus and discussions, please use the course site scholargrid.org.

Schedule for Units in Technology Section

Schedule Section on Technologies
Topic Due
Gaining Access to FutureSystems and Core Technologies 01/25

Notice

System Notice

Gaining Access to FutureSystems and Core Technologies

In this unit, you will learn how to gain access to the FutureSystems resources. It includes the portal account creation, class project participation, SSH key generation and login node access. Some of other lessons have been prepared for the beginners to help understand the basics of Linux operating systems and the collaboration tools i.e. GitHub, Google Hangout and Remote Desktop. Please watch video lessons and read through web contents.

Collaboration Tools
Topic Video Text
Overview and Introduction 16 mins 10 mins
Google
  • Google+, Hangout, Remote Desktop
4 mins 15 mins
GitHub 18 mins 30 mins
System Access to FutureSystems
Topic Video Text
ssh-keygen 4 mins 10 mins
Account Creation 12 mins 10 mins
Remote Login 6 mins 10 mins
Putty for Windows 11 mins 10 mins
Linux Basics
Topic Video Text
Overview and Introduction 4 mins 5 mins
Shell Scripting 15 mins 30 mins
Editors
  • Emacs, vi, and nano
5 mins 30 mins
Python
  • virtualenv, Pypi
27 mins 1 hour
Package Managers
  • yum, apt-get, and brew
3 mins 10 mins
Advanced SSH
  • SSH Config and Tunnel
3 mins 20 mins
Modules 3 mins 10 mins

Note

Find an editor that you will be using to do your programming with. For advanced Python programming we recommend PyCharm. However you can use others e.g. Enthought Canopy on your local computer. The way you could use it is to edit python locally, push the code into github and check it out into your vm or your login node on india.futuresystems.org. This is how many of us work.

Length of the lessons in this Unit
  • Total of video lessons: 2 hours
  • Total of study materials: 4 hours and 30 minutes
Assignment HW
Get Ready for FutureSystems and Warm-Up
Topic Description
Start with Account, Github and Python 9 tasks

HW2: Get Ready for FutureSystems

Guidelines

  • Assignments must be completed individually.
  • Discussion is allowed (e.g. via Slack) but the submission should be made by yourself. Acknowledge your helpers/collaborators name in the submission if you discussed or got help from anyone.
  • Use an individual github repository. A repository in FutureSystems will be given later.

Tasks

Complete the following tasks. Place all answers in a file named HW2-$USERID.txt and submit it via IU Canvas. (Replace $USERID with your email name at IU e.g. HW2-albert.txt if your email address is albert@indiana.edu)

Example view of your submission:

1. albert
2. ...
3. ...
9. http://...
FutureSystems Access
  1. Sign Up portal.futuresystems.org, if not exist. Provide your portal ID in your submission.

  2. Join Class Project and provide a project number in your submission.

  3. Generate a new SSH key and register on the portal.futuresystems.org. Provide your key fingerprint in your submission.

  4. SSH into india.futuresystems.org with your registered key. Run the following command and attach the output messages (plain text) in your submission. (Most SSH client tools offer copy and paste option with a mouse from the screen):

    finger $USER

GitHub
  1. Sign up github.com with your SSH key. Provide your github.com user name in your submission.

  2. Create a ‘I590-Projects-BigData-Software’ repository on your account. Create ‘hw2’ branch. Provide a clone URL of ‘hw2’ branch in your submission.

  3. This is a question for you to answer with appropriate git commands to satisfy the following descriptions:

    Albert has some Python code files that he was developing on his local machine but he wanted to use github.com to trace changes and share his work with others. He already created a new repository named ‘BigData’ on his github account. So he made a copy of the repository on his machine and there was nothing in the repo. He added a ‘README.rst’ file to describe his repository first. To make sure his description looks okay he pushed his update to github and opened a webpage to check. When he get access to his repository on github.com via a web browser, he found that Contact Info was missing so he added the info in the README.rst file online using a web browser and his description showed with the new Contact Info. He returned to his local repository and updated his repository because he wanted to sync the changes that he made on github.com. His next task was adding new_feature.py and bug.py in a separate branch, not in master because he thought these two files are still in progress with different purposes. He simply created ‘next’ and ‘error’ branches in his current repository and added the two files accordingly. his all tasks are applied to github.com.

List git commands that Albert used in his work above in your submission.

Python
  1. Write a python program called fizzbuzz.py that accepts an integer n from the command line. Pass this integer to a function called fizzbuzz. The fizzbuzz function should then iterate from 1 to n. If the ith number is a multiple of two, print “fizz”, if a multiple of three print “buzz”, if a multiple of both print “fizzbuzz”, else print the value.
  2. Create a ‘hw2’ branch on your github repository ‘I590-Projects-BigData-Software’ and add fizzbuzz.py to the branch. Provide a clone URL for the branch in your submission.

HW3: OpenStack Exercise

Guidelines

  • Assignments must be completed individually.
  • Discussion is allowed (e.g. via Slack) but the submission should be made by yourself. Acknowledge your helpers/collaborators name in the submission if you discussed or got help from anyone.
  • Use an individual github repository. A repository in FutureSystems will be given later.

Create IU GitHub Account

  • Simply login https://github.iu.edu with your IU Username and Password (It is a same IU Credential that you use on other IU sites e.g. one.iu.edu)

OpenStack Command Line Tool nova

OpenStack Kilo is ready to use (as of 02/04/2016) on FutureSystems and you will have a virtual instance (server) using OpenStack Command Line Tool nova, if you complete all the tasks in this assignment. The tasks you need to complete are:

  • SSH into india.futuresystems.org and
    • enable nova command
  • Register a SSH key on OpenStack
    • rsa type

    • with default key file names
      • public: $HOME/.ssh/id_rsa.pub
      • private: $HOME/.ssh/id_rsa
    • passphrase enabled

  • Start a single instance:
    • on fg491 project
    • with a m1.small flavor,
    • a Ubuntu-15.10-64 image,
    • the registered key above,
    • and hw3-$OS_USERNAME vm name
    • Assign a Floating IP address
  • Install required software on a virtual instance
    • virtualenv
    • pip
    • ansible

Warning

Do not terminate your instance, even if you completed and submitted hw3.

Test Program

We provide hw3.py test file in your repository, checkout hw3 branch. Run this on india.futuresystems.org, if you completed all tasks above. All available tests should be succeeded without errors. First, clone your private repository from IU GitHub. See details here: IU GitHub Guidelines. You will use virtualenv to prepare packages.

Run:

bash setup.sh
source $HOME/bdossp_sp16/bin/activate

Now, you can run the test program:

python hw3.py

Completed all? You may see:

...........
----------------------------------------------------------------------
Ran 11 tests in 1.646s

OK

Find hw3-results.txt file after you ran hw3.py python program in your current directory. Add this file in your IU GitHub repository.

FAQ

Q. Where should I run the test program hw3.py? A. It is on india.futuresystems.org, not your VM instance.

Q. bash setup.sh produces command not found errors. A. Make sure you can use nova command to start a new VM like you did in the hw3 tasks. Otherwise, the test program can’t verify what you accomplished.

Q. The hw3.py test program was failed due to missing python package named lib. A. Run hw3.py in a main directory of the hw3 branch. hw3.py itself won’t work. Helper functions are required.

Submission via IU GitHub (github.iu.edu)

From now on, you will use IU GitHub to submit assignments on a private repository. IU GitHub Guidelines

  1. Clone your private repository from the course organization. You IU Username is the name of your repository.
  2. Create a hw3 branch
git branch hw3
git checkout hw3
  1. Run pull command to fetch and merge with the template repository:

    git pull git@github.iu.edu:bdossp-sp16/assignments.git hw3

  2. Sync with remote:

    git push -u origin hw3

  3. Add hw3-results.txt to your repository:

    git add hw3-results.txt

  4. Merge the template

    git commit -am "initial merge with the template"
    
  5. Sync your changes:

    git push -u origin hw3

Challenging Tasks (Optional)

The following tasks are optional but strongly recommended to try. These are related to Python packages and APIs (application program interface). OpenStack nova is also extended to get more experience.

‘Hello Big Data’ Flask Web Framework

Find a flask sub-directory in challange directory in your assignment repository. We provide hello.py python file and you can run the file in your VM but there are a few requirements that we request:

* Use virtualenv named 'bdossp-sp16' in your home directory
* Open a web port to the Flask application to allow access from outside

Note

The two terms, VM or virtual instance, are exchangeable in this context.

  1. What command(s) do you run to create and enable the virtualenv?
  2. python hello.py may not work if you run only with standard python libraries. What command(s) do you run to resolve the issue? (hint. Flask is not a Python standard package)
  3. If you ran the application successfully, you can see ‘Hello Big Data’ message on your web browser with the 15000 web port. However, it is not accessible from outside e.g. http://IP_ADDRESS:15000. It is because that there is no rule for the port in OpenStack Security Group. (We assume there is no firewall here). What nova command(s) do you need to create/add a security group for the port?
  4. flask rule is provided in fg491 project. What nova command(s) do you need to see current rule(s) in the security group and to apply it to your VM?

Write your solution in the name of flask-sol.txt text file after completing the tasks above. Add this file in the flask sub-directory.

Example view of your submission:

1. albert
2. ...
3. ...
9. http://...

HW5: Ansible Exercise

Note

Replace mongodb.yml to site.yml in hw5.sh, if you failed running hw5.sh with the incorrect filename. If you recently pulled hw5 branch in your private repository, you don’t need this fix.

Guidelines

  • Assignments must be completed individually.
  • Discussion is allowed (e.g. via Slack) but the submission should be made by yourself. Acknowledge your helpers/collaborators name in the submission if you discussed or got help from anyone.
  • Use an individual github repository. A repository in FutureSystems will be given later.

Use hw5 branch

  • Login https://github.iu.edu with your IU Username and Password (It is a same IU Credential that you use on other IU sites e.g. one.iu.edu)
  • Checkout hw5 branch

MongoDB Ansible Role

Writing Mongodb playbook is taught in the Ansible lessons. You write MongoDB Ansible role in this assignment. Submit inventory, main playbook, command script and your role including sub-directories.

Requirements

The following files should be included in your submission

  • inventory file
  • site.yml the main playbook file
  • mongodb directory (which is ansible role for mongodb)
  • hw5-cmd.script file

Preparation

  • Login to india.futuresystems.org

  • Use the same bdossp-sp16 virtualenv used in hw3

  • Install ansible to the bodssp-sp16 virtualenv via python package manager

  • Change a directory to your IU GitHub repository where you work on hw5

  • Create a new branch hw5 by:

    git checkout -b hw5
    
  • Pull hw5 template files by:

    git pull git@github.iu.edu:bdossp-sp16/assignments.git hw5
    
  • Sync to remote by:

    git push origin hw5
    
  • Start working on hw5

HW5 Tasks

You need to write an Ansible role to install mongodb on your vm instance hw3-$USER. Ansible Playbook for MongoDB installation is given in the Ansible lessons. You may start from there but you need to install MongoDB on Ubuntu 15.10 at this time. Systemd is a main init system in Ubuntu 15.10 and you need to locate a service file using Ansible modules. Certain conditions should be met in your submission, see the requirements below:

  • create a new Ansible role where
    • mongodb is a role name
  • Describe tasks in tasks directory
    • Add the MongoDB public GPG key from:
      • hkp://keyserver.ubuntu.com:80
      • Use EA312927 as MongoDB public GPG Key ID when Ubuntu package management imports a key (apt-key)
    • Install mongodb-org with 3.2 Community Edition for Ubuntu Trusty 14.04 LTS by adding a MongoDB repository from:

      • deb http://repo.mongodb.org/apt/ubuntu trusty/mongodb-org/3.2 multiverse
  • Define those as Ansible variables in defaults directory, at least the four variable names below should be used:

    • mongodb_keyserver (to store hkp://...)
    • mongodb_gpgkey_id (to store EA312...)
    • mongodb_repository_list (to store deb http://...)
    • monogodb_package_name (use ‘mongodb’)
    • (more vars can be defined)
  • Two handlers
    • one for starting mongodb
    • one for restarting mongodb
  • Locate a service file where:
    • destination is /lib/systemd/system/mongodb.service
    • owner/group of the destination file is root
    • mode of the file is 0644
    • reload mongodb after adding this file to remote
    • You can find mongodb.service.j2 template file in your hw5 branch
  • Write a main playbook:
    • to include your new role
    • in site.yml file
  • Run hw5.sh to record your outputs in hw5-cmd.script file

Grading Guidelines
  • Existence of required files/directories (15%)
    • inventory
    • site.yml
    • (role) directory including subdirectories
    • hw5-cmd.script
  • Proper use of Ansible Variables (15%)

  • Proper use of Ansible Tasks (15%)

  • Proper use of Ansible Templates (15%)

  • Proper use of Ansible Handlers (15%)

  • General understanding of Ansible Roles (20%)

  • Successful Execution (5%)

FAQ
  1. How do I avoid typing SSH passphrase while current session is alive?
  1. Use ssh-agent like this:

    eval `ssh-agent`
    ssh-add
    
  1. Where should I run Ansible Playbooks or Roles?
  1. It is on india.futuresystems.org, not on your VM instance.
  1. I see mongodb.service.j2 template file but don’t exactly know what to do.
  1. Once you installed a mongodb server to a destination, you may need to register a mongodb server as a service. In Ubuntu 15.10, systemd is a main init system and you need to locate a service file to register. Explore Ansible template module which is useful to locate a file with variables. See documentation here: http://docs.ansible.com/ansible/template_module.html
  1. Permission denied on git pull git@github.iu.edu:bdossp-sp16/assignments.git hw5
  1. Try https or register your ssh key at IU GitHub. Using https URL is like:
git pull https://github.iu.edu/bdossp-sp16/assignments.git hw5

Submission via IU GitHub (github.iu.edu)

Use IU GitHub to submit assignments on a private repository. IU GitHub Guidelines

  1. Clone your private repository from the course organization. You IU Username is the name of your repository.
  2. Create a hw5 branch
git branch hw5
git checkout hw5
  1. Run pull command to fetch and merge with the template repository:

    git pull git@github.iu.edu:bdossp-sp16/assignments.git hw5

  2. Sync with remote:

    git push -u origin hw5

  3. Add files and directories to your repository:

    git add inventory git add mongodb git add site.yml git add hw5-cmd.script

  4. commit

    git commit -am "submission hw5"
    
  5. Sync your changes:

    git push -u origin hw5

Challenging Tasks (Optional)

The following tasks are optional but strongly recommended to try. These are to write mongodb roles for RedHat-based operating system as well using Ansible conditionals and different modules, if necessary.

MongoDB Roles for RedHat

You have completed writing mongodb roles for Ubuntu 15.10 which is Debian-based operating system only. In this challenge task, you are required to extend your mongodb roles for RedHat-based operating system as well. Ansible conditionals is recommended to select correct tasks/files in different operating systems.

Find mongodb-redhat directory in challange sub-directory. Add your extended mongodb role in the directory.

Possible Project idea (Running Ansible on Windows)

Develop Ansible Playbooks and Roles for Windows machines using PowerShell and winrm Python package instead of SSH. You may find multiple ways like:

  • develop a PowereShell script that starts a VirtualBox and runs the Debian ansible in it, have a local key be used see the instalation instructions of Cloudmesh that let you set up ssh on a windows machine also.
  • develop a Docker based ansible container. However this is not as straight forward as the key management need to be done right.

You can find more information here Windows Support

Project Guidelines

News

  • NIST Fingerprint Example (03/09/2016)
  • HBase is now supported (03/02/2016)
  • Examples are under development (03/02/2016)
  • Projects, datasets, and technologies from the past are available (02/26/2016)

Important Dates

  • Project Proposal: March 18th
  • Oral Presentation: Week 12 - April 1st, 2nd (Tentative)
  • Progress Checkup: Week 14 - April 15th, 16th (Tentative)
  • Final Submission: April 29th

Note

Those who can’t make the presentation with a time conflict, schedule a meeting with Course Team.

Team Coordination

Up to 3 members is recommended but individual is allowed.

Project Expectation (Grade)

Final project counts as 60% of semester grade and 40% goes on assignments.

  • 60% Final project
    • 10% Proposal
    • 10% Presentation
    • 30% Source code
    • 10% Report

Project Style

  • Basic
  • Bonus

You do not require strong background or programming skills with HPC or Hadoop to complete a final project. We’ve noticed that, however, there are some difficulties learning Linux systems, shells, or scripts and improving programming skills with parallelization in general. You have two options, Basic and Bonus, to start your project based on your capability on these.

Basic Project

Basic project starts from existing projects and extends the scope of the projects with minimal efforts on code developments. For example, take existing Hadoop benchmark tools and run them on hadoop clusters with different system configurations to compare. Try to increase data nodes, master nodes or add ZooKeeper with different settings and measure differences. Comparing performance in different software versions, settings or configurations tells you where focal points are to optimize or improve throughput of hadoop. Choose a basic project if you are not conpetent with programming languages e.g. Java or Python. Note that starting from existing projects doesn’t mean that you can simply search and download popular projects on the internet and execute. You need to address new findings and include the original source of the projects that you referenced in your final project and reports.

  • Minimal code writing
  • Start from existing projects
Bonus Project

If you are working on a bonus project, you are required to write code/scripts to implement your idea in the final project. Installation and configuration should be done by Ansible Playbooks. For example, take NIST Facial Recognition software and run with Hadoop clusters. Change serial calculation to be executed in parallel. Writing map and reduce functions may be necessary in Java, Python or Scala. Write Ansible Playbooks to install and configure your software packages within a few commands. If data analytics is the area that you are interested, you may try to develop new techniques to improve performance or implement parallel algorithms for complex face detection. Developing parallel programs would be involved in most cases. There are other possibilities as well. For instance, take hadoop-ansible-stacks which consists of basic components of Hadoop and append new software tools by writing new playbooks in roles and addons. You could add Hives or update Spark with the latest release using parameters or definition in YAML. If you focus on managing systems and software deployments, think about how to manage traffics by adding/removing additional nodes or how to apply new patches on particular nodes. Bonus points are given exceptional project results.

  • Ansible is required
  • Extensive code and scripts writing are welcome
  • Using GitHub Issues is mandatory to communicate with AIs for your projects
  • Bonus points

Project Choice

  • Deployment
  • Benchmark (Performance Test)
  • Parallelization
  • Analytics
  • Created Own (upon approval)
Deployment

Deployment project focuses on automated software deployments on multiple nodes using automation tools/configuration managements such as Ansible, Chef, Puppet, Salt or Juju. For example, you can work on deploying Hadoop clusters with 10 medium virtual instances or Sharded MongoDB clusters or filesystems e.g. NFS or Gluster. Ansible is recommended and supported in the class.

Examples:

  • Deployment Hadoop clusters
  • Deployment cluster managers (e.g. Mesos)
Benchmark

Benchmark project focuses on testing system’s performance by putting some stresses on different spots. Filesystems, CPUs, or memories can be tested and measured, if you think about hardware benchmark. APIs, messaging queues, load balancers or any applications can be tested and measured, if software is more focused. Hibench, Big Data Benchmark, or built-in tools e.g. Terasort are available for Hadoop benchmark.

Examples:

  • Hibench
  • Storm Benchmark
  • Big Data Benchmark for Big Bench
Parallelization

Parallelization project focuses on building efficient software stacks in parallel including MPI and Hadoop clusters. For example, you may find writing map and reduce functions is relatively easy e.g. WordCount, but applying it in practice with large datasets isn’t that simple. Think about how to load your dataset into hadoop file systems or databases and run your jobs in a distributed fashion.

Examples:

  • Pig
  • Spark
Analytics

Analytics project focuses on developing algorithms for different problems based on datasets and topics that you chose in your project. You will be required to develop algorithms for improving parallelism or performance in this project rather than developing new algorithm for face recognition, for example.

Examples:

  • Faunus Graph analytics
  • Ibis
Created Own

You can develop own project idea and make it as a class project upon approval. Describe your thought, tools, and topics and make a clear statement of the problems you identified in your project proposal.

Project Requirement

  • Installation/Configuration by Ansible playbook or relevant tools (Ansible Roles)
  • Reproducibility - runnable on Linux distribution
  • Sample Dataset - up to 480GB per team
  • 12 VM instances with m1.medium are given to the utmost each team
  • Software Stacks similar to Software Layers

Project Proposal

Please submit your project proposal by due. The proposal.rst RST file is provided in the project template repository. Fork this repository and write your proposal in the file under ‘docs’ directory. Find RST Quick Reference , Online RST Editor here. A project proposal is typically 1-2 pages long and should contain in the description section:

  • the nature of the project and its context
  • the technologies used
  • any proprietary issues
  • specific aims you intent to complete
  • and a list of intended deliverables (atrifacts produced)

Oral Presentation

You are required to demonstrate your project during the presentation week. The clear statement of problems are necessary with schedule, plan, role of team members, resources to use.

  • A student will use Adobe Connect to give a presentation which will be recorded.
  • 3-5 minutes per team.
  • Presentation can be substituted with written reports upon approval. 1-2 page progress report(s) need to be included.
Presentation Guideline
  • Demonstrate the following criteria:
    • team members (roles)
    • problem definition
    • list of technologies
    • list of development tools, languages
    • list of dataset and its availability
    • schedule
    • resources to use
  • All presentations will be recorded.

Progress Checkup

The following activities will be evaluated:

  • Code development in a project repository
  • Participation of team members
  • Software installation
  • Datasets preparation

List of Possible Projects

Note

We are currently working on this and any software and/or details are subject to change without notice. This is reference only.

  • Big Data Analytics Stack
Software Layers
Layer Supported In Progress Optional
Scheduling Layer YARN   Mesos
Database Layer HBase MongoDB, MySQL CouchDB, PostgreSQL, Memcached, Redis
Analytics Layer Java MLlib, Python BLAS, LAPACK, Mahout, MLbase, R
Data Processing Layer Hadoop MapReduce, Spark, Pig Storm, Flink Tez, Hama, Hive

You may consider to work on Big Data Analytics Stack using Ansible Playbooks. The default configuration of the stack is YARN + HDFS + Java + Hadoop MapReduce, Spark, and Pig. You can develop a new addon for one of the optional software and attach to your stack. Find more details here big-data-stack, Ansible Roles.

Projects from Software Deployments

Projects related to the hadoop stack consist of either extending the functionality or using the current features. This repository is intended to define a simple, easily deployable, customizable, data analytics stack built on hadoop. Currently, deployment is done to a virtual cluster running on OpenStack Kilo on FutureSystems.

big-data-stack
Title Category Data Sets Technologies
big-data-stack Software Deployments n/a Ansible
Projects Derived from Benchmarking Sets

There are many benchmark sets such as BigDataBench, HiBench, Graph 500, BigBench, LinkBench, MineBench, BG Benchmark, Berkeley Big Data Benchmark, TPCx-HS, and CloudSuite. See http://dsc.soic.indiana.edu/publications/OgreFacetsv9.pdf

BigDataBench, ICT, Chinese Academy of Sciences**
Title Category Data Sets Technologies
Amazon Movie Reviews Batch Data Analytics 8 million reviews
  • Hadoop
  • Spark
  • MPI
Google web graph Batch Data Analytics Webgraph from Google, 2002
  • Hadoop
  • Spark
  • MPI
Facebook Social Network Batch Data Analytics Facebook data
  • Hadoop
  • Spark
  • MPI
Genome sequence data Batch Data Analytics .cfa sample data (unstructured text file) Work Queue (master/worker framework)

Wang, Lei, et al. “Bigdatabench: A big data benchmark suite from internet services.” High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on. IEEE, 2014. link

Storm, Hadoop, Hive, Mahout from Intel and Yahoo
Title Category Data Sets Technologies
Storm Benchmark Batch Data Analytics https://github.com/intel-hadoop/storm-benchmark Storm
Big Data Benchmark for Big Bench Batch Data Analytics https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench Hadoop, Hive, Mahout
HiBench
Title Category Data Sets Technologies
Micro Benchmarks
  • Sort
  • WordCount
  • TeraSort
  • EnhancedDFSIO
Batch Data Analytics https://github.com/intel-hadoop/HiBench Hadoop
Web Search
  • Nutch Indexing
  • Page Rank
Batch Data Analytics https://github.com/intel-hadoop/HiBench Mahout
Machine Learning
  • Bayesian Classification
  • K-means Clustering
Batch Data Analytics https://github.com/intel-hadoop/HiBench Mahout
OLAP Analytical Query
  • Hive Join
  • Hive Aggregation
Batch Data Analytics https://github.com/intel-hadoop/HiBench Hive
Other Benchmarking Sets
Title Category Data Sets Technologies
Graph 500 Batch Data Analytics main site MPI
BigBench Batch Data Analytics main site
  • MapReduce
  • Hadoop
LinkBench Batch Data Analytics main repo
  • Java
  • MySQL
BG Benchmark Batch Data Analytics main site
  • MongoDB
  • HBase
  • VoltDB
Berkeley Big Data Benchmark Data Systems main site
  • Redshift
  • Hive
  • SparkSQL
  • Impala
  • Stinger/Tez
TPCx-HS Data Systems main site Hadoop
CloudSuite Batch Data Analytics main site MapReduce
MineBench Batch Data Analytics main site, Data Generator  
Projects From NIST
Possible Projects From NIST* (http://bigdatawg.nist.gov/_uploadfiles/M0399_v2_8471652990.doc)
Title Category Data Sets Technologies
Fingerprint Matching Batch Data Analytics
  • NIST Special Database 27a (Free)
  • NIST Special Database 14, 29, 30 (non-Free)
  • Apache Hadoop
  • Spark
  • HBase
Human and Face Detection from Video (simulated streaming data) Streaming Data Analytics OpenCV, INRIA Person Dataset
  • Apache Hadoop
  • Spark
  • OpenCV
  • Mahout
  • MLlib
Live Twitter Analysis Streaming Data Analytics Live Twitter feed
  • Apache Strom
  • HBase
  • Twitter’s Search and Streaming APIs,
  • D3.js
  • Tableau
Big data Analytics for Healthcare Data/Health informatics Batch Data Analytics Medicare Part-B in 2014
  • Apache Hadoop
  • Spark
  • HBase
  • Mahout
  • Lucene/Solr
  • MLlib
Spatial Big data/Spatial Statistics/Geographic Information Systems Batch Data Analytics Uber Ride Sharing GPS Data
  • Apache Hadoop
  • Spark
  • GIS-tools
  • Mahout
  • MLlib
Data Warehousing and Data mining Batch Data Analytics 2010 Census Data Products: United States
  • Apache Hadoop
  • Spark
  • HBase
  • MongoDB
  • Hive
  • Pig
  • Mahout
  • Lucene/Solr
  • MLlib
2015 Fall Suggested Projects
Title Data set Software Category
NIST Fingerprint (a subset of): NFIQ PCASYS MINDTCT BOZORTH3 NFSEG SIVV NIST Special Database 27A [4GB] NIST Biometric Image Software (NBIS) v5.0 [userguide] Batch Data Analytics
Hadoop Benchmark (each) - TeraSort Suite Teragen hadoop-examples.jar Batch Data Analytics
Hadoop Benchmark (each) - DFSIO (HDFS Performance)   hadoop-mapreduce-client-jobclient Batch Data Analytics
Hadoop Benchmark (each) - NNBench (NameNode Perf.)   hadoop-mapreduce-client-jobclient Batch Data Analytics
Hadoop Benchmark (each) - MRBench (MapReduce Perf.)   src/test/org/apache/hadoop/mapred/MRBench.java Batch Data Analytics
Projects from Other Sources
Projects From Ohter Sources
Title Category Data Sets Technologies
MapReduce Implementation for Longest Common Substring Problem Batch Data Analytics Escherichia coli K-12
  • Python
  • Amazon
  • MapReduce
MapReduce Implementation for GFF Parsing Batch Data Analytics  
  • Python
  • Disco
  • Amazon EC2
  • MapReduce

List of Datasets

Note

We are currently working on this and any software and/or details are subject to change without notice. This is reference only.

Note

There is no direct support on datasets.

Note

Large datasets should be informed to Course Team. These will be prepared and downloadable via /share/project2/FG491 on india.futuresystems.org

List of Technologies

Note

We are currently working on this and any software and/or details are subject to change without notice. This is reference only.

Note

There is no direct support on Analytics software.

Details on Software Submission

Code submission should be made at Github including a README file.

README includes:

  • Test instruction
  • List of data source
  • List of technologies used
Details on Final Report

Final report concludes the work of your team and describes findings with its results. The following sections should be included:

  • Description of your project

  • Problem statement

  • Purpose and objectives

  • Results

  • Findings

  • Implementation

  • References
    • original source of code snippets
    • original source of datasets

The final reports should sastify the following guidelines:

  • 4 - 6 pages
  • Time Roman 12 point – spacing 1.1 in Microsoft Word
  • Figures can be included
  • Proper citations must be included
  • Material may be taken from other sources but that must amount to at most 25% of original work and must be cited
  • The level should be similar to a publishable paper or technical report
Details on Grading Criteria
  • Proposal
    • Clear statement
    • Quality and Breath
    • Interest
  • Code
    • Reproducibility
    • Executable (Most weighted)
    • Instruction of Installation
    • Instruction of Configuration
    • Datasets
    • Acknowledgements
    • Gee whiz factor
  • Report
    • Related Work
    • Completeness
    • Level of insight

FAQ

  1. Use of FutureSytem is required?

A. No, it is not required, but it must be deployable and runnable on FutureSystems Kilo and you should provide detailed instructions on how to do so. Ideally, running ansbile-playbook site.yml should be all that is needed to deploy, after booting and editing inventory.txt file.

  1. I need more time to complete code development, may I have an extension?

A. Extension would be approved upon request. Send an extension request email message to the course email with a title [Project Extension] and an expected completion date.

Q. Our team wants to change a topic or scope of a project after project proposal or presentation, is it allowed?

A. Topic should be close to what you proposed earlier. Please contact Dr. Fox or Course Email if you change a topic or a scope of your project significantly. Also inform if you change team members. These changes would be approved upon request.

  1. Report or survey type of final project is allowed?
  1. No, software project is only allowed in this class.

Q. I found there is a similar project that I proposed, should I keep working on my project?

A. Consult with Course Team to make differences in detail. You may be asked to focus on specific area in order to avoid similarity.

  1. Can’t make a oral presentation because I have a business trip (or a conference).
  1. Schedule a meeting in Week 11 or Week 13 with Course Team.
  1. What does that mean there is no direct support on Datasets and Analytics software?
  1. We will provide support for accessing a dataset under 500 GB

Questions & Support

Office Hours

General Discussions (Adobe Connect)

  • TBD

Support Office Hours (Google hangout)

Office hours are conducted on appointment only during the times given below. Please send an mail to bdosspcoursehelp@googlegroups.com

  • Mon-Fri 9-5pm

Contact

Adobe Connect

Discussions will be done using Adobe Connect. Make sure you are prepared to join a meeting by the following links:

  1. Adobe Connect Setup Instructions: https://ittraining.iu.edu/connectsetup/
  2. Installing Adobe Connect Plug-In: https://kb.iu.edu/d/bdoc
  3. Testing Connectivity: https://connect.iu.edu/common/help/en/support/meeting_test.htm

Important

You can join a meeting by this link.

Note

All times are in US Eastern Standard Time (EST).

Additionally, there may be some technical issues when attempting to join a meeting, please see the following for guidance.

FAQ

FAQ

How to ask a question

When submitting a questions please:

  1. copy/paste directly from the terminal into the email message. Do not send text files, zipfiles, or other attachments as they may be not be opened. Screen grabs may be acceptable if you annotate the relevant parts using an image editor, or they are integral to showing the problem.

  2. use the mailing list (bdosspcoursehelp@googlegroups.com) to direct questions as all support staff are subscribed. Direct message via Slack is discouraged.

  3. describe the specific step you are trying to accomplish, include the actions you took and the results. Ensure that you can reproduce your problem by executing only the steps present in your email. Also include:

    • your futuresystems username
    • the output of nova show MY_VM (replacing MY_VM appropriately)

    For example:

    I cannot ssh into my VM due to permission denied errors:
    
    $ ssh user@machine
    Permission denied (publickey,hostbased).
    
    I tried following the steps in the FAQ as defined here: https://...
    <SHOW THE COMMANDS AND THEIR OUTPUT>
    
    
    Relvant information is:
    
    Username: albert
    $ nova show MY_VM
    +--------------------------------------+----------------------------------------------------------+
    | Property                             | Value                                                    |
    +--------------------------------------+----------------------------------------------------------+
    | accessIPv4                           |                                                          |
    | accessIPv6                           |                                                          |
    | config_drive                         |                                                          |
    | created                              | 2016-03-29T19:35:41Z                                     |
    | fg491-net network                    | 10.0.5.22, 149.165.159.241                               |
    | flavor                               | m1.large (4)                                             |
    | hostId                               | 683eed6c03fcc23879620b6042eaaa22149d915bfcd4ec9e5feab7c5 |
    | id                                   | 60dc4420-e5c0-4897-9a58-2cd48d6521b0                     |
    | image                                | CentOS7 (dc5c041f-7881-441e-af91-e9620efde901)           |
    | key_name                             | india                                                    |
    | metadata                             | {}                                                       |
    | name                                 | $USER-myvmname                                           |
    | os-extended-volumes:volumes_attached | []                                                       |
    | progress                             | 0                                                        |
    | security_groups                      | default                                                  |
    | status                               | ACTIVE                                                   |
    | tenant_id                            | 74e411d6d99e4497901d4c4e2b159f41                         |
    | updated                              | 2016-03-29T19:35:56Z                                     |
    | user_id                              | 090ea72e85c94c49ad8cc8133627fa1a                         |
    +--------------------------------------+----------------------------------------------------------+
    

Why can’t I ssh into my VM?

  1. Make sure to boot using on the internal network first. Once the node is up, then attach the floating ip.

    $ NET_ID=$(nova net-list | grep $OS_PROJECT_NAME | cut -d' ' -f2)
    $ nova boot --nic net-id=$NET_ID      # ... etc
    $ nova floating-ip-list | grep ' - '  # find an available floating ip.
    $ # create a floating ip with nova floating-ip-create if there are no floating ips available
    $ nova floating-ip-associate          # ... etc
    
  2. If your VM is on the 10.1.x.y it is accessible:

    • from outside the subnet with a floating ip only
    • from inside the 10.1.x.y subnet

    Check ip of your VM(s) with

    $ nova show $USER-myvmname | grep network
    
  3. Make sure you can ping your VM:

    $ ping -c 5 $IP
    
  4. Make sure that the machine have finished boot by checking that ssh daemon is listening on port 22:

    $ nc -zv $IP 22
    Connection to 149.165.158.1 22 port [tcp/ssh] succeeded!
    
  5. Make sure you are trying to log in with the correct username. For Ubuntu VMs, the username is ubuntu; for CentOS use centos.

    For example, to log into an Ubuntu VM:

    $ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@$IP
    
  6. If you are still unable to ssh, try a hard reboot a few times:

    $ nova reboot --hard $USER-myvmname
    
  7. Check that you have an ssh key registered with openstack using nova keypair-list and make note of the fingerprint:

    $ nova keypair list
    +----------------+-------------------------------------------------+
    | Name           | Fingerprint                                     |
    +----------------+-------------------------------------------------+
    | india          | 41:29:20:a2:51:25:5d:66:71:02:15:b6:cd:e2:36:06 |
    +----------------+-------------------------------------------------+
    
  8. Check that the correct key name was passed to nova boot when starting the VM by using nova show:

    $ nova show $USER-myvmname
    +--------------------------------------+----------------------+
    | Property                             | Value                |
    +--------------------------------------+----------------------+
    # ...
    | key_name                             | india                |
    # ...
    +--------------------------------------+----------------------+
    
  9. Ensure that the fingerprint matches:

    $ ssh-keygen -lf ~/.ssh/id_rsa
    2048 41:29:20:a2:51:25:5d:66:71:02:15:b6:cd:e2:36:06 ~/.ssh/id_rsa.pub
    
  10. Make sure that the key was injected into the VM during the startup by grabbing the console log and searching for your fingerprint. Make sure to wait for a few minutes after nova boot to allow the node start up:

$ nova console-log $USER-myvmname | grep -A 2 -B 4 '41:29:20:a2:51:25:5d:66:71:02:15:b6:cd:e2:36:06'
ci-info: ++++++Authorized keys from /home/centos/.ssh/authorized_keys for user centos+++++++
ci-info: +---------+-------------------------------------------------+---------+-----------+
ci-info: | Keytype |                Fingerprint (md5)                | Options |  Comment  |
ci-info: +---------+-------------------------------------------------+---------+-----------+
ci-info: | ssh-rsa | 41:29:20:a2:51:25:5d:66:71:02:15:b6:cd:e2:36:06 |    -    |           |
ci-info: +---------+-------------------------------------------------+---------+-----------+

If, after going through these steps, you are still unable to access the VM, delete the VM and try again two or three times, waiting a few minutes between each attempt. OpenStack is a collection of many distributed systems, and the nature of distributed systems is that they can be prone to random failure.

If you are still unable to log in, please contact us and indicate that you have gone through these steps, and show the output of the above commands.

Why can’t I modify my ~/.ssh/authorized_keys file?

You can not manually manage your authorized_keys file on india for security reasons. If you need to change your ssh key, do so via the SSH keys tab on your Web Portal Account.

Why does my MongoDB deployment fail?

In this case: mongodb is installed successfully, but the service cannot be started. Solving this is the goal of the assignment, which is demonstrating an important aspect of many development processes: namely the affects of changing infrastructure.

To put this in context: Ubuntu for many years (through the 14.04 LTS release) used the Upstart init daemon. As of 15.04, this is switched to systemd. However, the mongodb installation expects to use Upstart to run the service, which therefore fails.

There are many solutions to this type of problem:

  1. add the system service file by hand
  2. rollback the OS from Ubuntu 15.04 to 14.04
  3. use a different repository which includes the systemd service file

For the purposes of this homework, the first option is taken, and the service file is provided in the repository. As the hw instructions say place the provided service file in the appropriate location.

If, after deploying the service file you are still unable to start the mongodb service, please include the contents of /lib/systemd/system/mongodb.service in your email.

One common issue is the user the mongodb service runs as: you should make sure that the username in the service file matches the user account created for mongodb.

  • Check the username in the service file by looking at the User value.
  • Check the username on the system by grep -i mongo /etc/passwd

If these two values do not match, adjust your ansible deployment.

Security Groups

As projects are shared and everyone can modify the security groups, it is best to create security groups prefixed the your username: eg $USER-default and add your rules to that.

Naming VMs

All VMs should be prefixed by your username. This will allow everyone to identify the VMs that belong to your.

Naming the key on the Cloud

It is best to name the key on the cloud with your <portalname> in order not to confuse that with others its also good practice to optionally put -key behind it, SO your key name would be <portalname>-key.

Accessing Root

The default login user (ubuntu on India, cc on Chameleon) has sudo privileges.

Beware of Denial-of-Service attacking your own machine

We have seen students looping through an ssh command and as soon as it failed issued a new one during boot. They did it so many times that they flodded the network as they did it not just with one but many many vms. Multiply this by X users and you can see that alone through this process you can create a denial of service attack on cloud services. So what you have to do is to put a sleep between such ssh attemts to see if your vm is really up. Put at least 30 seconds At time it can be as much as 10 minutes dependent on usage.

You can do this in bash and zsh using until:

$ until nc -zv $IP 22; do sleep 30s; done

This sleeps for seconds each iteration until the ssh service is detect to be available on port 22.

Using Chameleon Cloud

You can find documentation on how to migrate from India (Futuresystems) OpenStack to Chameleon cloud here: https://github.com/futuresystems/class-admin-tools/blob/master/chameleon/big-data-stack.org

Make sure you follow these instructions.

Regarding some common questions about switching to and using Chameleon, here are some tips if you are having trouble:

General
  1. there is so far no evidence that chameleon is experiencing the same load problems that india is

  2. make sure you don’t source anything under ~/.cloudmesh/

  3. make sure you do source the CH-817724-openrc.sh file

  4. make sure you enter your password in correctly

  5. make sure that running nova list works as a sanity check (if not, try steps 2, 3 again)

    As there is no confirmation or denial that your password is entered correctly or incorrectly, you should test using nova list to ensure authentication is possible. Make sure you have not sourced anything under ~/.cloudmesh/clouds/... as this will corrupt the environment nova uses to authenticate to chameleon.

Differences between Chameleon and India:
  1. the username to log into the VM is different: use cc instead of ubuntu
  2. you cannot log into the internal IP address (192.X.Y.Z). You must associate a floating ip address first
  3. the ubuntu image is called CC-Ubuntu14.04 instead of Ubuntu-14.04-64.
BDS on Chameleon

If you are using the Big Data Stack, you need to make the following changes as well to tell BDS to use the Chameleon-specific environment instead of india:

  1. update .cluster.py to use the CC-Ubuntu14.04 image (as per here)
  2. set “create_floating_ip” to “True” in .cluster.py (as per here)
  3. set the user to cc in ansible.cfg (as per here)
SSH problems with Chameleon

If you experience trouble ssh-ing into a Chameleon instance, make sure that the fingerprint of the key injected into the instance (get it with nova console-log $VM_NAME) matches the one you are using (default is ~/.ssh/id_rsa`, use ``ssh-keygen -lf $PATH_TO_KEY to see it).

Keystone problems

The authentication mechanisms for FutureSystems and Chameleon clouds are incompatible. This means that if you source the openrc file for FutureSystems (usually in your ~/.cloudmesh/clouds/india directory and then your Chameleon Cloud openrc file, you may get something like this:

$ nova list
ERROR (DiscoveryFailure): Cannot use v2 authentication with domain scope

The following has also been reported

$ nova list
No handlers could be found for logger "keystoneclient.auth.identity.generic.base"
... terminating nova client

The solution is to open a new terminal session with a fresh environment and source only the openrc file for the cloud you need to use.

For the adventurous, you may alternatively clear your openstack environment variables using the following.

env | egrep '^OS_' | cut -d= -f1 | while read var; do unset $var; done

The above should work in sh-like shells (eg sh, bash, zsh).