Welcome to CI learning-challenge’s documentation!¶
Contents:
Introduction¶
Resources¶
- Git - Repository: https://bitbucket.org/paheld/learning-challenge
- Documentation: http://learning-challenge.readthedocs.org
Requirements¶
- django
- Matplotlib
- SciPy
- NumPy
- PyMySQL
- six
- pydot
- docutils
Algorithms¶
Decision Trees - (dtree)¶
(wikipedia) A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal.
See also: [QR1986] and [QR1987]
[QR1986] | Quinlan, J. R. Ross. Induction of decision trees. Machine learning, 1986, 1. Jg., Nr. 1, S. 81-106. |
[QR1987] | Quinlan, J. R. Ross. Simplifying decision trees. International journal of man-machine studies, 1987, 27. Jg., Nr. 3, S. 221-234. |
Parameters¶
splitter - Which strategy should be used to choose the next split attribute?¶
`Best` always chooses the attribut that maximizes the score obtained via the split quality criterion, `random` just chooses a ... random... criterion.
- name: Splitting strategy
- default:
- values: [u’best’, u’random’]
- type: list
criterion - Measure of split quality¶
`gini` uses Gini’s impurity measure, `entropy` uses the Information Gain.
- name: Split criterion
- default:
- values: [u’gini’, u’entropy’]
- type: list
max_depth - Maximum depth of a tree.¶
If 0 then leaves are expanded until completely pure or all attributes have been used. Otherwise tree growth will stop after the set amount of attributes has been used.
- min: 0
- default:
- type: int
- name: Maximum tree depth
k-Nearest Neighbors - (kNN)¶
[WPknn] In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.
[WPknn] | http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm |
Parameters¶
distance - Used distance measure¶
Based on this metric the `k` nearest neighbors are chosen.
- name: Distance measure
- default:
- values: {u’chebyshev’: {u’short_desc_html’: u’<div class=”document”>n<p>The Chebyshev (maximum norm) distance measure</p>n</div>n’, u’short_desc’: u’The Chebyshev (maximum norm) distance measure’, u’description’: u’The distance is calculated as .’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005max<sub><i>i</i></sub>|<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>|</span>n.</p>n</div>n’}, u’euclidean’: {u’short_desc_html’: u’<div class=”document”>n<p>The common euclidean distance measure</p>n</div>n’, u’short_desc’: u’The common euclidean distance measure’, u’description’: u’The distance is calculated as .’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”sqrt”><span class=”radical”>u221a</span><span class=”ignored”>(</span><span class=”root”><span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup>2</sup></span><span class=”ignored”>)</span></span></span>n.</p>n</div>n’}, u’sqeuclidean’: {u’short_desc_html’: u’<div class=”document”>n<p>The squared euclidean distance measure</p>n</div>n’, u’short_desc’: u’The squared euclidean distance measure’, u’description’: u’The distance is calculated as .’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup>2</sup></span>n.</p>n</div>n’}, u’hamming’: {u’short_desc_html’: u’<div class=”document”>n<p>The hamming distance measure</p>n</div>n’, u’short_desc’: u’The hamming distance measure’, u’description’: u’The distance is calculated as the number of components which differ over the number of components.’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as the number of components which differ over the number of components.</p>n</div>n’}, u’weuclidean’: {u’short_desc_html’: u’<div class=”document”>n<p>The weighted euclidean distance measure</p>n</div>n’, u’short_desc’: u’The weighted euclidean distance measure’, u’description’: u’The distance is calculated as .’, u’parameters’: {u’weights’: {u’name’: u’Weigths’, u’default’: u’‘, u’short_desc’: u’Comma-separated list of weights for each dimension.’, u’short_desc_html’: u’<div class=”document”>n<p>Comma-separated list of weights for each dimension.</p>n</div>n’, u’validator’: <function invalid_weights at 0x7fddc64f9050>, u’validators’: [<function invalid_optional at 0x7fddc654aaa0>, <function invalid_weights at 0x7fddc64f9050>], u’type’: u’string’}}, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”sqrt”><span class=”radical”>u221a</span><span class=”ignored”>(</span><span class=”root”><span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span><i>w</i><sub><i>i</i></sub>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup>2</sup></span><span class=”ignored”>)</span></span></span>n.</p>n</div>n’}, u’minkowski’: {u’short_desc_html’: u’<div class=”document”>n<p>The Minkowski distance with parameter <tt class=”docutils literal”>`p`</tt>. To turn this into euclidean distance, choose p=2.</p>n</div>n’, u’short_desc’: u’The Minkowski distance with parameter `p`. To turn this into euclidean distance, choose p=2.’, u’description’: u’The distance is calculated as .’, u’parameters’: {u’p’: {u’name’: u’p’, u’min’: 0.001, u’default’: u’‘, u’parser’: <type ‘float’>, u’short_desc’: u’Parameter of the Minkowski distance (exponent).’, u’short_desc_html’: u’<div class=”document”>n<p>Parameter of the Minkowski distance (exponent).</p>n</div>n’, u’validators’: [<function invalid_optional at 0x7fddc654aaa0>, <function invalid_range at 0x7fddc64f8ed8>], u’type’: u’float’}}, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005(<span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup><i>p</i></sup>)<sup>1u2005u2044u2005<i>p</i></sup></span>n.</p>n</div>n’}, u’seuclidean’: {u’short_desc_html’: u’<div class=”document”>n<p>The standardized euclidean distance measure</p>n</div>n’, u’short_desc’: u’The standardized euclidean distance measure’, u’description’: u’The distance is calculated as , where is the standard deviation between the and in the data set.’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”sqrt”><span class=”radical”>u221a</span><span class=”ignored”>(</span><span class=”root”><span class=”fraction”><span class=”ignored”>(</span><span class=”numerator”><span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup>2</sup></span><span class=”ignored”>)/(</span><span class=”denominator”><i>s</i><span class=”scripts”><sub class=”script”><i>i</i></sub><sup class=”script”>2</sup></span></span><span class=”ignored”>)</span></span></span><span class=”ignored”>)</span></span></span>n, where <span class=”formula”><i>s</i><span class=”scripts”><sub class=”script”><i>i</i></sub><sup class=”script”>2</sup></span></span>n is the standard deviation between the <span class=”formula”><i>x</i><sub><i>i</i></sub></span>n and <span class=”formula”><i>y</i><sub><i>i</i></sub></span>n in the data set.</p>n</div>n’}, u’manhattan’: {u’short_desc_html’: u’<div class=”document”>n<p>The manhatten (or cityblock) distance measure</p>n</div>n’, u’short_desc’: u’The manhatten (or cityblock) distance measure’, u’description’: u’The distance is calculated as .’, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005<span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>|<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>|</span>n.</p>n</div>n’}, u’wminkowski’: {u’short_desc_html’: u’<div class=”document”>n<p>The Minkowski distance with parameter p. To turn this into euclidean distance, choose p=2.</p>n</div>n’, u’short_desc’: u’The Minkowski distance with parameter p. To turn this into euclidean distance, choose p=2.’, u’description’: u’The distance is calculated as .’, u’parameters’: {u’p’: {u’name’: u’p’, u’min’: 0.001, u’default’: u’‘, u’parser’: <type ‘float’>, u’short_desc’: u’Parameter of the Minkowski distance (exponent).’, u’short_desc_html’: u’<div class=”document”>n<p>Parameter of the Minkowski distance (exponent).</p>n</div>n’, u’validators’: [<function invalid_optional at 0x7fddc654aaa0>, <function invalid_range at 0x7fddc64f8ed8>], u’type’: u’float’}, u’weights’: {u’name’: u’Weights’, u’default’: u’‘, u’short_desc’: u’Comma-separated list of weights for each dimension.’, u’short_desc_html’: u’<div class=”document”>n<p>Comma-separated list of weights for each dimension.</p>n</div>n’, u’validator’: <function invalid_weights at 0x7fddc64f9050>, u’validators’: [<function invalid_optional at 0x7fddc654aaa0>, <function invalid_weights at 0x7fddc64f9050>], u’type’: u’string’}}, u’description_html’: u’<div class=”document”>n<p>The distance is calculated as <span class=”formula”><i>d</i>(<i>x</i>,u2005<i>y</i>)u2005=u2005(<span class=”limits”><span class=”limit”><span class=”symbol”>u2211</span></span></span><span class=”scripts”><sup class=”script”><i>n</i></sup><sub class=”script”><i>i</i>u2005=u20050</sub></span>(<i>x</i><sub><i>i</i></sub>u2005u2212u2005<i>y</i><sub><i>i</i></sub>)<sup><i>p</i></sup>)<sup>1u2005u2044u2005<i>p</i></sup></span>n.</p>n</div>n’}}
- type: dict
k - Number of neighbors¶
The higher this value is chosen the more neighboring points can `vote` for the class label of the target instance. Too low values may be prone to noise, too high values may be prone to unbalanced class distributions.
- min: 1
- default: 3
- type: int
- name: k
weights - Method used for instance weighting.¶
If `uniform` is used, each of the nearest neighbours will have equal weight in the prediction of the new class label while `weighted` uses the inverses of their respective distances as weights.
- name: Weighting method
- default:
- values: [u’uniform’, u’distance’]
- type: list
Distance-Measures¶
chebyshev - (chebyshev)¶
The distance is calculated as .
euclidean - (euclidean)¶
The distance is calculated as .
hamming - (hamming)¶
The distance is calculated as the number of components which differ over the number of components.
manhattan - (manhattan)¶
The distance is calculated as .
minkowski - (minkowski)¶
The distance is calculated as .
seuclidean - (seuclidean)¶
The distance is calculated as , where is the standard deviation between the and in the data set.
sqeuclidean - (sqeuclidean)¶
The distance is calculated as .
weuclidean - (weuclidean)¶
The distance is calculated as .
Configuration¶
This is an example configuration. Please copy this file to config.py and change the settings to your local environment.
- config_example.BACKEND_FILE_CACHE_DIR = '/tmp'¶
Directory to store tmp files, like dataset files
- config_example.BACKEND_PASSWORD = u'foobar'¶
Password for the backend user
- config_example.BACKEND_PIDFILE = u'/tmp/learning-challenge-worker.pid'¶
PID file of the worker process
- config_example.BACKEND_PING_TIME = 15¶
Average delay between two pings.
- config_example.BACKEND_RUNFILE = u'/tmp/learning-challenge-worker.running'¶
File that indicates, that the workers are running. Will also be created if worker process is not a daemon.
- config_example.BACKEND_USERNAME = u'backend'¶
The username for backend access to server. Need admin rights.
- config_example.BACKEND_WEKA_PATH = u''¶
Path to Weka’s jar files.
- config_example.DJANGO_DB_DATABASE = u'learning-challenge'¶
MySQL-Database name
- config_example.DJANGO_DB_HOSTNAME = u'localhost'¶
MySQL Server name
- config_example.DJANGO_DB_PASSWORD = u'bar'¶
Password for MySQL-Server, needed by Django Server Instance only.
- config_example.DJANGO_DB_USERNAME = u'foo'¶
Username for MySQL-Server, needed by Django Server Instance only.
- config_example.DJANGO_DEBUG = True¶
Run Django in debug mode?
- config_example.DJANGO_MEDIA_ROOT = u'./MEDIA/'¶
Directory for job Images.
- config_example.DJANGO_STATIC_ROOT = u'./STATIC/'¶
Directory for static files.
- config_example.TORNADO_PIDFILE = u'/tmp/learning-challenge-server.pid'¶
PID file of the server process
- config_example.TORNADO_PORT = 8888¶
Port of the tornado server
- config_example.VISUALIZATION_DIMENSION_DEFAULT = 4¶
How many dimensions are plotted as default
Web-Frontend¶
Models¶
This is a collection of all required models.
- class webapp.models.Dataset(*args, **kwargs)[source]¶
Represents a dataset object. It contains of multiple DatasetPoints.
- class_labels = None¶
class_id to label map in JSON format.
- default_dimensions = None¶
Default scatter dimensions.
- default_visualization = None¶
Default visualization type.
- dimensions = None¶
Number of dimensions, where -1 means variable.
- name = None¶
Name of the Dataset
- nr_classes = None¶
Number of different classes
- nr_test_points = None¶
Number of test points.
- nr_training_points = None¶
Number of training points.
- nr_validation_points = None¶
Number of validation points.
- class webapp.models.DatasetPoint(*args, **kwargs)[source]¶
A single Point of a dataset.
- class_id = None¶
The class as integer.
- coords = None¶
The coordinates as JSON array
- dataset¶
The related dataset.
- point_type = None¶
This pont is part of which part of the dataset (training, test, validation)?
- class webapp.models.DatasetResultPoints(*args, **kwargs)[source]¶
Best result with points for a given dataset in a specific round.
- dataset¶
The selected Dataset
- job¶
The related job
- points = None¶
The earned Points
- round¶
The round
- score = None¶
The Validation Score
- user¶
User of this result
- class webapp.models.Group(*args, **kwargs)[source]¶
The Group class represents a group of people.
There could be multiple groups. A group could be a lecture, an event or any other user group.
- free_to_enter = None¶
Is it possible for new users to enter this group?
- name = None¶
The name of the group. Must be unique
- users¶
Set of all users, which are actually in this group
- class webapp.models.Job(*args, **kwargs)[source]¶
Jobs are the units create by user to work in a specific round.
- algorithm = None¶
selected algorithm
- created = None¶
Create time of the job.
- dataset¶
Related Dataset
- extra_scores_test = None¶
A dictionary with further scores (just for information) from the test set in JSON notation
- extra_scores_validation = None¶
A dictionary with further scores (just for information) from the validation set in JSON notation.
- fetch_params_ts = None¶
Timestamp when a worker fetches the params.
- finished_ts = None¶
Timestamp when the calculation are finished.
- is_selected = None¶
Is this job marked as a selected result?
- labels_test = None¶
A list of lables for the test set in JSON notation
- labels_training = None¶
A list of lables for the training set in JSON notation
- labels_validation = None¶
A list of labels for the validation set in JSON notation
- message = None¶
A message from the training set. It could contain more information about the training/test.
- modified = None¶
last modified
- params = None¶
Algorithms parameters as JSON dict
- round¶
Related Round
- score_test = None¶
The score of the test set. Should be in the interval [0,1]
- score_validation = None¶
The score of the validation set. Should be in the interval [0,1]
- success = None¶
True if the learning process was successful
- user¶
Owner of the Job
- worker¶
Worker which handles this job.
- class webapp.models.JobPicture(*args, **kwargs)[source]¶
Pictures to illustrate job details
- filename = None¶
Filename of the image
- job¶
Related Job
- class webapp.models.Round(*args, **kwargs)[source]¶
A Round is a single event for a group.
Each group can define multiple rounds.
- algorithms = None¶
Selected algorithms as JSON list.
- datasets¶
All active datasets for this round.
- end_time = None¶
The end of the round. If the end is in the past, this round is closed. If the end is not set, the round will be open.
- get_free_jobs(user, dataset_id)[source]¶
Return the number of open jobs.
Admin user always get the full number of jobs. Failed jobs are ignored.
Parameters: user : user object
The requesting user
dataset_id : int
The PK of the selected dataset.
- get_free_submits(user, dataset_id)[source]¶
Return the number of open submits.
Admin user always get the full number of submits.
Parameters: user : user object
The requesting user
dataset_id : int
The PK of the selected dataset.
- group¶
Group assigned to this round
- limit_jobs = None¶
Number of allowed jobs.
- limit_submit = None¶
Number of submissions.
- name = None¶
Name representation of this round
- start_time = None¶
The start of the round. If the start time is in future or not set, this round is not active
- class webapp.models.RoundResultPoints(*args, **kwargs)[source]¶
Round results.
- points = None¶
Sum of points of all datasets.
- round¶
related round
- stars = None¶
Stars / Big Points for this round
- user¶
related user
- class webapp.models.Worker(*args, **kwargs)[source]¶
A worker is the backend host, which does the calculations
- hostname = None¶
Hostname of the worker
- jobs_done = None¶
Number of finished jobs.
- last_heartbeat = None¶
Is used to check if worker is up
- last_job = None¶
Last Job Timestamp, will be used to select worker
Views¶
This is a collection of all required views for the web frontend.
- webapp.views.add_group(request, *args, **kwargs)[source]¶
Creates a new group.
The Group parameter are read from the request.REQUEST field. So you can use it with POST and GET.
Parameters: group_name : str
The name of the new group
public : str
Default is public. Private on the values (‘0’, ‘no’, ‘private’, ‘false’)
Python - API¶
#TODO: add content descriptions
- api.normalize_url(host, path)[source]¶
Adds the path to the url, to get an full uri. Multiple “/” between host and path will be normalized to host/path
Parameters: host : string
Hostname, maybe with port extension and protocoll
path : string
Path of the uri
- class api.API(username=u'backend', password=u'foobar', host=u'http://localhost:8000/')[source]¶
- add_group(name, public=True)[source]¶
Creates a new group
Parameters: name : string
Name of the new group
public : boolean
Is the group joinable? Default: True
Notes
Requires admin rights.
- add_job_picture(job_id, filename)[source]¶
Adds a picture to the job.
Parameters: job_id : int
The ID of the job.
filename : string
Path to picture
- get_dataset_points(dataset_id)[source]¶
Retrieves all points from a given dataset.
Parameters: dataset_id : int
The ID of the dataset
Returns: List of point dicts.
- get_job_params(job_id)[source]¶
Load the job parameters. This is only used for workers and need admin rights.
Parameters: job_id : int
The ID of the selected job.
Returns: Dictionary:
- algorithm
- dataset
- params
- join_group(group_name)[source]¶
Current user will be added the the selected group.
Parameters: group_name : string
Name of the selected group
- list_groups()[source]¶
List all accessible groups
Returns: Dictionary with key is the name and value are group parameters.
- load(path, raw=False, **kwargs)[source]¶
Load the required resource from the host.
If a login was successful, cookie session information are attached to the request. Also the CRFS-Token is added, if data is set. All values from data are transmitted via POST command.
Parameters: path : string
path to the resource
raw : boolean
if True no JSON decoding is done
**kwargs : dict
Additional parameters which will be send as POST request.
- login(username, password)[source]¶
Send login request. This is normally done by constructor.
Parameters: username : string
The username of the user
password : string
The password of the user
Returns: True, if login was successful
HTML response otherwise
- ping()[source]¶
Ping the server and try to retrieve a new job to be processed.
Returns: Job ID if a new job is available or None if not.
- update_job_details(job_id, score_test, score_validation, labels_training, labels_test, labels_validation, extra_scores_test=None, extra_scores_validation=None, message=None, success=True)[source]¶
Posts the job results to the server.
Parameters: job_id : integer
The ID of the job
score_test : float
The score of the test set. Should be in the interval [0,1]
score_validation : float
The score of the validation set. Should be in the interval [0,1]
labels_training : list
A list of lables for the training set
labels_test : list
A list of lables for the test set
labels_validation : list
A list of labels for the validation set
extra_scores_test : dict (optional)
A dictionary with further scores (just for information) from the test set.
extra_scores_validation : dict (optional)
A dictionary with further scores (just for information) from the validation set.
message : string (optional)
A message from the training set. It could contain more information about the training/test.
pictures : list of filenames (optional)
A list of filenames with additional visualizations of the test set.
success : boolean (default: True)
True if the learning process was successful