Python 101

“For the things we have to learn before we can do them, we learn by doing them.”

—Aristotle (around 350 BCE)

Welcome to the Python tutorial of the group “Micromechanical Materials Modelling” of the “Institute of Mechanics and Fluid Dynamics”. The goal of this tutorial is to introduce the basic tools used in our everyday scientific worklife by actually performing such tasks. We will start with a brief introduction to the Unix Shell in which we list the most common commands and what they do.

Contents

Installing Python

How to install Python depends a lot on your operating system.

macOS

For macOS two popular choices for package managers exist: Homebrew and MacPorts. Both are solid so the choice is up to you, I currently prefer Homebrew.

Homebrew

  1. Install the command line tools via

    $ xcode-select --install
    
  2. Agree to the Xcode license by running

    $ sudo xcodebuild -license
    
  3. Install Homebrew.

  4. Install the required formulas by running

    $ brew install git python3 hdf5 freetype libpng
    

MacPorts

  1. Install the current version of Xcode on your Mac via the Mac App Store.

  2. Open your terminal and install the Xcode Command Line Utilities and run

    $ xcode-select --install
    
  3. Agree to the Xcode license by running

    $ sudo xcodebuild -license
    
  4. Install MacPorts for your version of macOS.

  5. Install the required ports via

    $ sudo port install freetype libpng git hdf5 python35
    

Fedora 25

Install the required packages via

$ sudo dnf install python3-tkinter hdfview

Ubuntu 16.04 LTS

Install the required packages via

$ sudo apt-get install cmake gfortran git libfreetype6-dev libatlas-dev liblapack-dev libhdf5-dev python3-dev python3-venv python3-pip python3-tk

Ubuntu 16.10

Install the required packages via

sudo apt-get install python3-venv python3-tk hdfview

Setting up Python 101

First we will set up a virtual environment using the venv module of Python.

$ python3 -m venv ~/.virtualenvs/python101

The virtual environment can be activated via

$ source ~/.virtualenvs/python101/bin/activate

If everything went fine you should see that your shell prompt now starts with (python101). This means that everything you do with regards to Python now uses this virtual environment.

To make sure that the Python packages that are required for installing other Python packages are up to date execute the following command:

$ pip install --upgrade pip setuptools

Now we will clone the Python 101 repository using git.

$ git clone https://github.com/ZeeD26/python101.git

We enter the newly created directory

$ cd python101

and install Python 101 along with its dependencies

$ python -m pip install -e .

Attention

The . after python -m pip install -e is important! python -m pip install -e is not sufficient.

The right tool for the job

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.”

—Abraham Lincoln

Programming may be considered a craft (although you can find some debate on that). If you need to solve a scientific problem that may, e.g., only be solvable numerically, and there is no software at hand to solve this problem, you have to write code that does. And as you expect a “regular” craftsman to know the right tool for his job, you may also want to know your tools for your job as a scientific programmer.

In the following an overview of the basic tools for scientific programmers is given, starting with the Unix shell as a mean to interact with the computer while not relying on a GUI and enabling reproducibility, followed by an introduction to text editors suitable for programming, and finishing with a brief introduction to version control systems.

The Unix shell

“The Linux philosophy is ‘Laugh in the face of danger’. Oops. Wrong One. ‘Do it yourself’. Yes, that’s it.”

—Linus Torvalds

We could try to think of a neat way of defining what a Unix shell is, but thankfully Wikipedia comes to the rescue:

A Unix shell is a command-line interpreter or shell that provides a traditional Unix-like command line user interface. Users direct the operation of the computer by entering commands as text for a command line interpreter to execute, or by creating text scripts of one or more such commands.

To put it in simpler terms, instead of pointing and clicking on colorful icons to perform actions like opening a directory or a program, you type commands in a terminal. The most commonly used commands are outlined in the following sections.

Note

If you do not have any prior experience with command line interfaces (CLIs), working with one might seem daunting at first. This is perfectly normal. It takes time to get used to the Unix shell. The key to mastering the Unix shell is to use it. A lot.

pwd

This command is used to show the path to the current working directory. It is usually used to see where you currently are.

$ pwd

ls

This command is used to show the contents of the directory you specify as argument.

$ ls <directory>

If you do not provide anything besides ls it prints the contents of the current working directory.

$ ls

A single . in place of <directory> also results in the contents of the current working directory to be printed, as . is an explicit way of specifying the current working directory.

For a more verbose version you may use the -l options:

$ ls -l

Note

If you have some spaces in some of your file names you have to enclose the whole filename like this: "<filename>"

cd

This command is used to change the directory to the one specified as argument. If no argument is provided you are changing to your home directory.

$ cd <directory>

To get to the parent directory use

$ cd ..

Note

In a lot of cases bash—the shell you are using—provides rather intelligent autocompletion. To use this start typing the name of a file or directory and hit the tab button. If there is a unique completion option it is completed automatically. Else hitting tab another time will give you a list of options that starts with whatever you typed until then.

cat

This command is used to print the contents of the files specified as arguments.

$ cat <file1> <file2> ... <fileN>

cp

This command is used to copy files. For example

$ cp <file1> <file2>

copies <file1> to <file2>. If you want to copy a lot of files to another directory use

$ cp <file1> <file2> ... <fileN> <directory1>/

Copying a whole directory requires you to use the -r option:

$ cp -r <directory1> <directory2>

mv

This command is used to move files. For example

$ mv <file1> <file2>

essentially renames <file1> to <file2>. To move several files into a directory use

$ mv <file1> <file2> ... <fileN> <directory1>/

As opposed to cp the mv command can move whole directories without using the -r option:

$ mv <directory1> <directory2>

touch

This command is used to create an empty file. Using

$ touch <file1>

hence results in an empty file with the name <file>.

Note

If you want to copy something from the Terminal you can not do this via the key combination Ctrl + C as this is reserved for cancelling the running program. Instead use Ctrl + Shift + C. For pasting you also have to use Ctrl + Shift + V.

mkdir

This command is used to create a directory. Using

$ mkdir <directory>

thus creates a directory with the name <directory>.

rm

This command is used to delete files and directories. Hence

$ rm <file>

deletes <file>.

Warning

If you delete files or directories on a modern, graphical operating system the files and directories usually do not get deleted immediately, but are copied to an intermediate directory that is usually called trash bin. This could be considered a safety measure against accidentally deleting important files. This “safety net” does not exist for the rm command. Whatever you delete via rm is permanently deleted.

grep

If you want to see whether some text is contained within a file you can use grep:

$ grep <pattern> <file>

with <pattern> being the text you are looking for.

find

Find files in a directory hierarchy. This program is rather extensive and can perform complex search operations. In its simplest form it may be used like this:

$ find <path> -name "<pattern>"

It is important to put <pattern> into quotation marks to make sure the shell is not expanding some special characters. Special characters can, e.g., be the wildcard character *, which matches everything. Hence, the command

$ find . -name "*.py"

finds all Python files below the current directory.

Summary

pwd
Print the path to the directory you are currently in.
ls <directory>
List the contents of directory specified by <directory>. If you do not specify a directory it defaults to your current directory.
cd <directory>
Change the directory to <directory>. If you do not specify a directory you go to your home directory. If you want to go back to your last directory you can use cd -.
cat <file1> <file2> ... <fileN>
Read the files specified and print their content to the terminal.
cp <file1> <file2>
Copy the first argument to the second argument. If you want to copy a directory you have to use it with the -r option: cp -r <directory1> <directory2>.
mv <file1/directory1> <file2/directory2>
Move the first argument to the second argument. It is basically like renaming the first argument.
touch <file>
Create an empty file at <file>.
mkdir <directory>
Create a directory at <directory>.
rm <file1> <file2> ... <fileN>
Delete the files specified. If you want to delete a directory and its contents you have to use it with the -r option: rm -r <file1>.
grep <pattern> <file>
Search for <pattern> in <file>.
find <path> -name "<pattern>"
Find all files in <path> and below whose name is matching <pattern>.

Exercises

  1. Create an empty file called my_first_file.txt
  2. Open the file with your text editor and fill it with something other than asdf. Save and close afterwards.
  3. Print the content of the file to the terminal.
  4. Make a new directory named my_first_directory
  5. Copy the file my_first_file.txt into this directory.
  6. Remove the old file.
  7. Print the content of the file my_first_file.txt in the directory my_first_directory to the terminal.
  8. Print your current working directory.
  9. Enter the directory my_first_directory.
  10. Print your current working directory.
  11. Enter the parent directory.
  12. List the content of your current working directory.
  13. Delete the directory my_first_directory.
  14. List the content of your current working directory.

Text editors

The features a good text editor should have:

  • Unicode support
    A good text editor should support Unicode. This is the de facto standard for cross-platform compatibility of text files. With Unicode support you can be sure that the text file you edited on your Unix machine can be worked with properly on, e.g., a Windows machine.
  • syntax highlighting
    In our case, the main objective of using a text editor is to write code. Choosing a text editor that supports syntax highlighting for the programming language we use helps us to comprehend the written code easier by applying different colors and/or weights to the text depending on whether the piece of text is, e.g., a string, a type declaration, or a keyword.
  • snippets
    Every programming language has certain structures for building blocks like, e.g., if-conditionals or for-loops. Typing them out each and every time is a waste of time. Good text editors allow the use of so-called snippets.
  • hackability
    Each programmer has a different taste with regards to how they want to accomplish certain actions in text editors. Different shortcuts might be required, or a custom layout of the editor. A good text editor allows you to customize it to your needs, not require you to adapt to its way (at least to a certain amount).
  • package manager
    To be truly hackable a good text editor supports third-party packages that extend or alter the functionality of the text editor.

To gain more insight about the performance of different text editors take a look at this editor performance comparison.

In the following a brief introduction to the text editors I recommend is given.

Atom

Atom is a text editor made by GitHub.

Advantages

  • support for different encodings
  • highly hackable
  • open-source
  • easy to learn

Disadvantages

  • slower than the other presented editors
  • less memory efficient than the other presented editors
  • may not be installed on all machines you have to work with

Sublime Text 3

Sublime Text is a text editor made by a small team of developers. Proper support for packages is supplied by Package Control.

Advantages

  • fast
  • relatively memory efficient
  • support for different encodings
  • highly hackable
  • easy to learn

Disadvantages

  • closed-source
  • may not be installed on all machines you have to work with
  • if used for free a pop-up every ten times you save a file
  • a license costs $70 at the time of writing

Visual Studio Code

Visual Studio Code is a text editor made by Microsoft.

Advantages

  • support for different encodings
  • highly hackable
  • open-source
  • easy to learn

Disadvantages

  • may not be installed on all machines you have to work with

Vim

Vim is a text editor made by the community lead by Bram Moolenaar. Support for packages is built-in as of vim 8, but due to backward compatibility the most popular package manager is vim-plug.

Advantages

  • open-source
  • fast
  • memory efficient
  • may very well be installed on all Unix machines you have to work with

Disadvantages

  • takes time to get used to
  • hard to master

Emacs

Emacs is a text editor made by the community lead by the Free Software Foundation.

Advantages

  • open-source
  • fast
  • memory efficient
  • may be installed on all Unix machines you have to work with

Disadvantages

  • takes time to get used to
  • hard to master

Version control systems

Even if you have no prior experience with programming, you may already be familiar with some kind of version control, whether on purpose or not. Over the course of your studies it is very likely you have had to write some sort of papers, maybe even while collaborating with others. Does the following comic strip resonate with you?

_images/phd101212s.gif

“Piled Higher and Deeper” by Jorge Cham (www.phdcomics.com)

This gives you a way of keeping track of changes, as you can compare two versions of the same document. Depending on the software you use you either have to compare the changes “manually” by skimming the whole document, or you can get a view highlighting the differences (see, for example, subsec_diff).

Another way of having versions of a file usually comes with cloud file storage services like Dropbox or Google Drive. Those services keep a version history of the synced files available for restoring files you want to revert to older versions or which you deleted accidentally.

But why would we want to keep track of changes to our code? Would the latest working version not suffice at all times?

If we keep track of changes to our code we do not have to be afraid to simply tinker with our code. Maybe we get an awesome idea for a new feature or how to speed up your code. As we hack away we realize that the idea is not really that awesome and we wish to go back to our previous working state. If we are then in the unlucky state that the “undo” action of our text editor is unable to revert all the changes we made, we have to fall back to another solution, e.g., the aforementioned naive version control schemes. If we worked on a copy of the last working version “reverting” is simple—we just delete the fresh copy. Reverting our file back using the version history of a cloud file storage service requires may require us to go to their website and check quite a few versions of this file, as we have not had any way of controlling which state of the file represents a meaningful increment. This leads us to the first feature we would like our version control system to have:

Note

We want to have control over what changes in our document/code form a logical unit that represents a meaningful increment.

More complicated documents or code libraries may be split up into several files to emphasize the structure while separating logical units like chapters or classes, respectively. When using the copy-and-rename scheme we would copy and rename the directory containing all the files. The drawback of this is the redundancy, as not all the files may change within the same increment. When using the version history of cloud file storage services, on the other hand, we do not run into the problem of data redundancy, but we are unable to track which versions of the files belong together.

Note

We want to be able to not only track the changes of a file, but the cohesive changes across multiple files, i.e., we want to track changes of a project instead of files.

These two ways of keeping track of file versions come with a variety of advantages and disadvantages that we are going to address in the following while we introduce git as the right tool for the job with regards to version control systems.

During our coding sessions there will be times when we wish we could take some of the changes to a working script back and start from the last working checkpoint. Or we find a bug that was introduced somewhere down the road. Or we accidentally delete an important file. In the past we may have used the “undo” command in our text editors or the history tracking of cloud file hosting services like Dropbox. But this is not sufficient for proper source code management. Let us work out the advantages of version control systems as another tool in our toolbox over file versioning by, e.g., appending version numbers to the filename—d

Best practices

What not to put into version control

  • config files with sensitive information, e.g., passwords, private keys
  • editor backups
  • generated files, e.g., executables, HTML documentation
  • OS specific files, e.g., Thumbs.db, .DS_Store

SVN

Git

One of the most popular version control systems at the time of writing this tutorial is git. It was created by Linus Torvalds for development of the Linux kernel, which is one of the most widely distributed software repositories.

Let us introduce the most important commands by creating a new project: a collection of recipes.

Initializing a git repository—git init

First of all we create a new directory that represents the root of our project via

mkdir recipes

and subsequently we enter it

cd recipes

At this point we can use the command

git init

to initialize the recipes directory as a new git repository. We should see the following output in your console

Initialized empty Git repository in <path_before_recipes>/recipes/.git/

If we now view the contents of the recipes directory via

ls -A

we should see a new directory called .git, whose content we can inspect using

ls .git

resulting in the output

branches  config  description  HEAD  hooks  info  objects  refs

This directory is where git stores all the information about all versions of the files associated with the project. At this point we should neither modify nor delete any files in this directory, as it may lead to corruption of our repository.

Knowing what’s what—git status

Ignoring files

Useful gitignore templates

The official github group supplies a nice list of global gitignore templates as well as project specific gitignore templates.

Basics

“The canonical, ‘Python is a great first language’, elicited, ‘Python is a great last language!’”

—Noah Spurrier

Now we are going to introduce the basics of the Python programming language. We start with the infamous Hello, World! program and the basic syntax of a Python script.

Attention

Make sure you have your virtual environment activated! If you do not have (python101) in front of your command line prompt you need to activate it using

$ source ~/.virtualenvs/python101/bin/activate

Execute the following command to start the interactive Python interpreter:

$ python

You should see a couple of lines printed in the terminal, with the first line stating, among the current date and time, the version of Python you use. The very last line should start with >>> which indicates the Python prompt. You can write Python commands there and execute them.

Attention

In the previous chapter The Unix shell the basic commands for the Unix shell were introduced. Notice how every command was preceeded by a $ character. In this tutorial code blocks that start with a $ sign are meant to be executed in the Unix shell. If the code block is preceeded by >>> it means that it should either be executed in the Python interpreter or be used in a Python script.

Python as an interactive calculator

To get your feet wet with Python you can use the Python interpreter as a calculator. You have the usual mathematical operators at your disposal, like

+:addition,
-:subtraction,
*:multiplication,
/:division,
**:exponent,
//:integer division, and
%:modulus.

If you are not familiar with one of them just give it a try in the Python interpreter—do not only use integers, but also floating point numbers. You can also use brackets as you would use them in mathematical expressions. Try whether Python uses the proper mathematical rules with regards to the order of execution of the operators.

Note

At one point you will enter an “invalid” expression like 1 + * 2. Python will then raise a SyntaxError to tell you that whatever you typed is not valid Python syntax. In many cases Python will also give you additional information about the error. There are many more errors you can encounter, and it is perfectly normal to have errors. The only difference between a seasoned programmer and a beginner is the time it takes to fix those errors. The more errors and mistakes you made the better you know how to solve them.

To leave the Python interpreter you either execute

>>> quit()  

or you press Ctrl + D. Besides the interactive Python interpreter you can also write scripts with Python. Scripts are files that can be executed from the command line interface. They contain Python expressions that get executed once you call the scripts. A script can be simple and merely rename files or it can be complex and run a complete simulation of a car crash. You decide how simple they are.

Your first Python script

We will start with the infamous Hello, World! Program. Open a new terminal, activate your virtual environment, and create a new file named hello_world.py via

$ touch hello_world.py

Open it with your favorite text editor, e.g., Atom or SublimeText. In the former case you would open the file via

$ atom hello_world.py

Now type (not copy!) the following into the file

hello_world.py
print('Hello, World!')

Save the file, switch to your command line interface, and execute

$ python hello_world.py

If you did everything correctly you should see the phrase Hello, World! popping up in your command line interface. If you see something like

  File "hello_world.py", line 1
    print('Hello, World!)
                        ^
SyntaxError: EOL while scanning string literal

or

  File "hello_world.py", line 2

                         ^
SyntaxError: unexpected EOF while parsing

it means that you have either forgotten the closing ' or ), respectively. As you can see Python tries its best to describe the error to you so that it can be fixed quickly.

If everything went fine: Congratulations! You wrote your first Python script!

The print function

The function you used in your first Python script, the print() function, has a rather simple goal: Take whatever you have in there and display it in the command line interface. In the Python interpreter (the command line starting with >>>) the result of an expression was displayed automatically. Try creating a new file math_expressions.py and enter several mathematical expressions like you did earlier. Save the file, switch to your terminal and execute the file via

python math_expressions.py

You should not see a single thing happening. That is because you never told Python what to actually do with those expressions. So what it does is evaluate them and nothing more. Now wrap the mathematical expressions in the print() function, for example like this:

print((3 + 4)*6)

If you execute the script again you should see the expected output.

Integers, floats and strings

In the previous examples you worked with integers, floating-point numbers, and with strings. -4, 0, and 2 are all integers. 1.2, 1.0 and -2e2 (which is the scientific notation for -200.0) are floating-point numbers. Finally, 'Hello, World!' is a string. These categories are called data types. Every value in Python is of a certain data type.

The meaning of operators may depend on the data types of the values surrounding it. Take, e.g., the addition operator +:

>>> 1 + 2
3
>>> 1.2 + 3.4
4.6
>>> 'My first sentence.' + 'My second sentence.'
'My first sentence.My second sentence.'
>>> 'My ' + 3 + 'rd sentence.'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Can't convert 'int' object to str implicitly

In the last case the addition operator has no idea how to combine the integer 1 with the strings. What you can do to solve this is to convert the integer to a string using str():

>>> 'My ' + str(3) + 'rd sentence.'
'My 3rd sentence.'

If you want to convert something to a string you use str(), to convert to an integer you use int(), for floating-point numbers you use float().

>>> '1.2' + '3.4'
'1.23.4'
>>> float('1.2') + float('3.4')
4.6
>>> int('1.4')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '1.4'
>>> str(1e2)
'100.0'

Play around with those three functions to see what can be converted and what can not. Try the different operators, e.g., try to multiply a string with an integer, etc.

Variables

Like in mathematics you can also use variables to store values. A variable has a name by which it is called and a value. There are three rules that a variable name must comply:

  1. It must be exactly one word.
  2. It must comprise only letters, numbers, and the underscore character.
  3. It must not begin with a number.

Other than that anything goes.

Warning

Although anything else is a viable variable name, you should take special care not to use names of built-in objects like, e.g., int. If you name a variable after some function or class it is not usable anymore in the subsequent code.

To assign a value to a variable you use the equal sign = with the variable name on the left and the value on the right:

>>> my_first_variable = 21
>>> 2*my_first_variable
42
>>> my_second_variable = 3
>>> my_first_variable/my_second_variable
7.0
>>> my_third_variable = my_first_variable
>>> print(my_third_variable)
21

Here is a slightly more complex example:

students = 35
tutors = 2
classrooms = 1
pizza_orders = 20

students_per_tutor = students / tutors
persons = students + tutors
persons_per_classroom = persons / classrooms
hungry_persons = persons - pizza_orders

print('There are', students, 'students and', tutors, 'tutors.')
print('That makes', persons, 'persons in', classrooms, 'class room(s).')
print(hungry_persons, 'have to stay hungry...')

The advantages of using variables are two-fold:

  • If the amount of students, tutors, classrooms or pizza_orders changes you only have to update one line instead of many. This is less error-prone and faster.
  • You give the values some meaning which should be represented in the variable name. You could in principle read “the students per tutor is the amount of students divided by the amount of tutors.” This makes your code easily comprehensible and you need fewer comments. But you still should write them when they make sense!

And here is what the output should look like:

There are 35 students and 2 tutors.
That makes 37 persons in 1 class room(s).
17 have to stay hungry...

Notice how we used , to separate strings and variable names in print(), but everything was composed in a nice way? The reason for this is that print() can take an arbitrary amount of arguments. Just chain them using , and you are good to go. How this works is part of the section Functions.

User input

In some cases you may want to ask the user of your script to provide some additional information, like the path to a file or parameters for a simulation. For this the input() can be used.

print('What is your name?')
name = input()
print('Nice to meet you,', name)

Note

The value returned by input() is always a string. So when you are asking for numbers you have to convert them.

print('What is your age in years?')
age = int(input())
print('In 5 years you will be', age+5, 'years old.')

Imports

Sometimes the features that Python offers by default are not enough. What if you want to use the \(\sin(x)\) function? For more specialized topics Python offers modules or packages, either ones that already ship with every Python installation or packages from external parties. The packages that Python ships with are called the standard library. External packages may be, e.g., NumPy and SciPy for scientific computing with Python, or Matplotlib for plotting.

You activate this additional functionality by importing these packages:

>>> import math

Now we have access to all functions available in the math module.

>>> math.pi
3.141592653589793
>>> math.sin(0.5*math.pi)
1.0

Take your time and browse the documentation of the math module, try some of the provided functions like math.ceil(), math.exp(), etc.

Summary

  • You can use the interactive Python interpreter to execute small commands.
  • You can execute scripts that hold several commands using Python.
  • You can display results of computations or strings using the print() function
  • You can use str(), int(), float() to convert from one data type to another—if it is somehow possible.
  • You can store values in variables to access them at a later point in your script.
  • You can import modules or packages to extend Pythons builtin functionality using the import statement.

Exercises

  1. Write a script that asks the user for the radius of a circle and subsequently shows the circumference and the area of the circle in the terminal.

Comments

Once your programs grow longer and the logic gets more and more complicated it is a very good idea to place meaningful comments in your code. This makes it easier for the future you and your collaborators or supervisors to actually understand why your implementation is as it is.

In the very beginning your comments may be used to outline what the code block following the comment is actually doing. Once you get more familiar with Python the content of the comment should change from what the code is doing to why it is doing it.

To actually write a comment prepend it with the # character. So if you write

# In the following example you should only see
# ``Now you see me.``, and
# ``Are you still seing me?`` in the terminal.

print('Now you see me.')
# print("Now you don't.")
print('Are you still seeing me?')  # Inline comments are nice as well!

into a file and execute it you should only see

Now you see me.
Are you still seeing me?

Summary

  • You can use comments by prepending #
  • Comments are useful to explain why you did something the way you did it

Exercises

  1. If you encounter some more complicated code-blocks in the future comment them. Admittedly this is not directly an exercise, but a sincere request.
  2. Really, comment your code for others and your future self. You will forget why you coded something the way you did. Still not really an exercise.

Containers

At one point you may want to keep objects like strings or numbers in another object to show that they are strongly related. These objects that contain other objects are called container. Python offers a fair amount of useful containers out of the box, some of which are introduced in the following.

Lists

If you need a flexible container that can take in about anything (even itself) and you need to keep an order to the contained objects you should use a list. An empty list is generated by either

>>> list()
[]

or

>>> []
[]

Or you generate a non-empty list

>>> list([1, 2, 3])
[1, 2, 3]

or

>>> [1, 2, 3]
[1, 2, 3]

Accessing values for the list is done like this:

>>> x = [1, 2, 3]
>>> print(x[1])
2

Note

As you can see in the example above specifying x[1] did not give you the first element of the list, but the second. This is due to Python starting indexing at 0!

With this you can also change values in a list

>>> x = [1, 2, 3]
>>> x[1] = 4
>>> print(x)
[1, 4, 3]

Negative indices can be used to retrieve element from the end:

>>> x = [1, 2, 3, 4, 5]
>>> print(x[-1])
5
>>> print(x[-2])
4

You can also create slices of the list using the following notation:

list[start:end:step]

Its usage is best made clear by some examples:

>>> x = ['zero', 'one', 'two', 'three', 'four']
>>> print(x[2:])
['two', 'three', 'four']
>>> print(x[:3])
['zero', 'one', 'two']
>>> print(x[1:3])
['one', 'two']
>>> print(x[::2])
['zero', 'two', 'four']
>>> print(x[::-1])
['four', 'three', 'two', 'one', 'zero']
>>> print(x[-2:])
['three', 'four']

You can also append values to a list

>>> x = [1, 2, 3]
>>> print(x)
[1, 2, 3]
>>> x.append(4)
>>> print(x)
[1, 2, 3, 4]

or insert values at specified positions

>>> x = [1, 2, 3]
>>> print(x)
[1, 2, 3]
>>> x.insert(0, -1)
>>> print(x)
[-1, 1, 2, 3]
>>> x.insert(1, 0)
>>> print(x)
[-1, 0, 1, 2, 3]

A list can hold anything, even other lists

>>> [1, 'two', ['three', 4]]
[1, 'two', ['three', 4]]

Tuples

If you are sure that you do not want to modify the data container after its generation and want to keep everything in the order you specified the tuple container is the container of choice. They may be initialized by

>>> tuple([1, 2, 3])
(1, 2, 3)
>>> (1, 2, 3)
(1, 2, 3)

Accessing single elements is done like in lists:

>>> x = (1, 2, 3)
>>> print(x[1])
2

And when you try to modify them an error is raised:

>>> x = (1, 2, 3)
>>> x[1] = 4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment

Sets

If you do not want duplicates in your container and you do not care about their order you should use a set. They are initialized via

>>> set()  # This one is empty
set()
>>> set([1, 2, 2, 3, 3, 'three'])  
{1, 2, 3, 'three'}
>>> {1, 2, 2, 3, 3, 'three'}  
{1, 'three', 2, 3}

and elements can be added like this:

>>> x = {1, 2, 3}
>>> print(x)  
{1, 2, 3}
>>> x.add('four')
>>> print(x)  
{1, 2, 'four', 3}

Sets furthermore support a lot of the operators you are familiar with from the set in mathematics. For further information see the Python documentation.

Dictionaries

If you do not want to keep the order of things but would rather like to use keys to access the objects in the container the dict has got you covered. Dicts are initialized via

>>> dict()
{}
>>> {}
{}
>>> x = {'One': 1, 'two': 2, 'THREE': ['A', 'list', 'of', 'strings']}  
{'THREE': ['A', 'list', 'of', 'strings'], 'One': 1, 'two': 2}

and their values may be accessed like this:

>>> x = {'One': 1, 'two': 2, 'THREE': ['A', 'list', 'of', 'strings']}
>>> x['One']
1
>>> x['THREE']
['A', 'list', 'of', 'strings']

As you can see there are no restrictions regarding the type of the value that each key is pointing to. Keys on the other hand have to be hashable. In practice the keys are usually ints or strings. If you want to retrieve something that does not exist an error is raised:

>>> x = {'One': 1, 'two': 2, 'THREE': ['A', 'list', 'of', 'strings']}
>>> x['four']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'four'

A way of providing a fallback in case a key does not exist in the dictionary is the dict.get() method:

>>> x = {'One': 1, 'two': 2, 'THREE': ['A', 'list', 'of', 'strings']}
>>> x.get('two', 'fallback')
2
>>> x.get('four', 'fallback')
'fallback'

Common operations

Length

Sometimes you are interested in the amount of objects that a container holds. This information can be accessed using the len() function:

>>> x = [1, 2, 3]
>>> len(x)
3
>>> x = (1, 2)
>>> len(x)
2
>>> x = set([1, 2, 2, 3, 3, 'three'])
>>> len(x)
4
>>> x = {'One': 1, 'two': 2, 'THREE': ['A', 'list', 'of', 'strings']}
>>> len(x)
3

Membership check

>>> x = [1, 2, 3]
>>> 2 in x
True

Note

While the behavior is clear for lists, tuples and sets, the in checks for the keys of a dictionary.

>>> x = {
...    'one': 'two',
...    3: 4
... }
>>> 'one' in x
True
>>> 'two' in x
False
>>> 3 not in x
False
>>> 4 not in x
True

Due to the nature of the implementation member checks are especially fast in sets and dictionaries

Summary

  • use lists if the order is important and you may need to modify the container.
  • use tuples if the order is important and the container is fixed.
  • use sets if order is not important and you want to ensure uniqueness inside the container.
  • use dictionaries if you want to store key-value pairs.

Flow control

“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where—” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
“—so long as I get somewhere,” Alice added as an explanation.
“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”

Alice’s Adventures in Wonderland

Up until now the code war rather… underwhelming. You only executed a sequence of commands each and every time. But with the introduction of the control flow of a program a whole new world of programs reveals itself. It is up to your experience and creativity to use the control flow tools to structure your code in a way that complex tasks can be handled.

Boolean expressions

Knowing boolean expressions is essential for the control flow of a program. Python handles this rather intuitively. The most important boolean operations are

==:equal to
!=:unequal to
<:less than
<=:less than or equal to
>:greater than
>=:greater than or equal to
>>> 1 == 1
True
>>> 1 + 1 != 2
False
>>> 5*2 >= 11
False

Warning

Beware of float comparisons, as the way computers represent floating-point numbers internally might lead to some seemingly weird behavior.

>>> 0.1 + 0.2 == 0.3
False

This is certainly unexpected behavior. To deal with this shortcoming version 3.5 of Python introduced the isclose function in its math module.

>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)  
True

If you are working with an older version there are still convenience solutions. The numpy package that is introduced later on in this tutorial also provides the means for float comparisons.

In many cases a simple boolean expression is not sufficient to correctly specify the control flow of a program. Sometimes a code block should only be executed when a condition is not met, when several conditions are met all at once or when at least one of several conditions is met. For this Python hast the keywords not, and, and or, respectively. Their behavior is outlined in the following tables.

Truth table for not
A not A
True False
False True
Truth table for and
A B A and B
True True True
True False False
False True False
False False False
Truth table for or
A B A or B
True True True
True False True
False True True
False False False

Combining boolean expressions using these keywords is an essential skill for programming.

>>> i = 3
>>> i < 4 and i >= 0
True
>>> i < 4 and i > 3
False
>>> i < 4 and not i > 3
True
>>> i >= 0 or i > 3
True

You can furthermore use brackets to specify the order of the evaluation of the subexpressions like you would in equations.

Note

In Python False is not the only thing that is “false” in Python. False, None, numeric zero of all types, and empty strings and containers are all interpreted as false.

The most important construct using boolean expressions is introduced in the following.

The if statement

A construct every programming language provides is the if statement. The basic structure is as follows:

if <expression>:
    <code block>

The result is that the code block is only executed when expression is True.

Note

Test

>>> x = 0
>>> y = 1
>>> if x > 0:
...     print('Larger than 0')
...
>>> if y > 0:
...     print('Larger than 0')
...
Larger than 0

The expression can also include a negation using Boolean operations:

>>> x = 0
>>> y = 1
>>> if not x > 0:
...     print('Not larger than 0')
...
Not larger than 0
>>> if not y > 0:
...     print('Not larger than 0')
...

If you want to cover both cases you can also use the else keyword:

>>> x = 1
>>> if x < 0:
...     print('Negative')
... else:
...     print('Positive')
...
Positive

But as you can see this does not cover all the cases. What if x is 0? For this we have to use elif:

>>> x = 0
>>> if x < 0:
...     print('Negative')
... elif x == 0:
...     print('Zero')
... else:
...     print('Positive')
...
Zero

And you can add as many elifs as you want.

The while loop

Sometimes it is necessary to perform a routine until a certain condition is met. This is achieved using a while loop.

>>> x = 0
>>> while x < 5:
...     print(x)
...     x += 1
...
0
1
2
3
4

Assume you want to exit the while loop when a certain condition is met. This is possible with the break statement.

>>> x = 0
>>> while x < 5:
...     if x == 3:
...         break
...     print(x)
...     x += 1
...
0
1
2

Attention

Although while loops are a common building stone in every programming language I advise you to avoid using them whenever possible. It happens quite easily that the criterion for exiting the loop is never reached and your program gets stuck performing the same task more often than you intended. In many cases a while loop can be substituted with a for loop.

The for loop

In a lot of cases you just want to work on all the elements of a container one at a time. This is easily achieved with for loops.

>>> x = [1, 2, 3]
>>> for i in x:
...     print(i)
...
1
2
3

Here i takes on the value of the elements of x one after the other. This allows you to work with i inside of this for loop. After all elements have been visited you automatically exit the loop. A more sophisticated example might be to store the squared values of another list in a new list.

>>> x = [1, 2, 3]
>>> x_squared = []
>>> for value in x:
...     x_squared.append(value**2)
...
>>> print(x_squared)
[1, 4, 9]

range

A shortcut to loop over integers is given as the range() function.

>>> for i in range(3):
...     print(i)
...
0
1
2
>>> for i in range(3, 6):
...     print(i)
...
3
4
5
>>> for i in range(3, 12, 3):
...     print(i)
...
3
6
9

enumerate

Sometimes you also want to track where you currently are in your iteration. For example you want to know what the current state of your program is, but printing the value you are operating on each single time is kind of too much. Then you could use enumerate() like this:

>>> results = []
>>> for i, value in enumerate(range(100,900)):
...     if i % 200 == 0:
...         print('Current step:', i, '-- Value:', value)
...     results.append(i**2 % 19)
...
Current step: 0 -- Value: 100
Current step: 200 -- Value: 300
Current step: 400 -- Value: 500
Current step: 600 -- Value: 700

As you can see we now have comma-separated variables i and value. i get the current index we are in whereas value holds the actual object of the container.

zip

Another common task is that you have to loop over several lists at the same time. Use the :func:zip function for this:

>>> fruits = ['banana', 'orange', 'cherry']
>>> colors = ['yellow', 'orange', 'red']
>>> for fruit, color in zip(fruits, colors):
...     print('The color of', fruit, 'is', color)
...
The color of banana is yellow
The color of orange is orange
The color of cherry is red

Note

zip() stops the for loop as soon as one list is empty:

>>> fruits = ['banana', 'orange', 'cherry', 'apple', 'lemon']
>>> colors = ['yellow', 'orange', 'red']
>>> for fruit, color in zip(fruits, colors):
...     print('The color of', fruit, 'is', color)
...
The color of banana is yellow
The color of orange is orange
The color of cherry is red

Errors

As Python is a dynamic language it can never be garantueed that the input of your functions is what you want it to be. Take as an example a function whose purpose is the computation of the sum of the digits in a number. What if by accident someone passes a string as argument to this function? In some cases it is hence a good idea to check the input of a function for its sanity. If the input does not hold this test you may raise and error like this:

>>> raise ValueError('The input was wrong')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: The input was wrong

Summary

  • if statements are a way to add a branch into your code.
  • while loops should be avoided whenever possible.
  • for loops are used to work on items of a container.
  • range() is used to ad-hoc generate containers holding integers.
  • enumerate() is used to keep track of the current index in a for loop.
  • zip() is used to loop over several containers at once.
  • raise is used to raise an error when something is fishy.

Functions

Introduction

Some tasks in your code may be performed a lot of times. Say you want to say “Hello” not only to the world but to a lot of people. You would have to write

print('Hello, {name}!')

an awful lot of times. It would be way shorter if we were able to write

greet('{name}')

and it would just print Hello, {name}!. OK, it is not that shorter… But imagine you want to find the roots of some functions. There is way more code involved and having a nice little shortcut to execute all that code without writing it over and over again would be nice, wouldn’t it? This is in the spirit of the DRY principle! As a nice side effect you only have to change the routine in one place of your code instead of all of them if you find a bug or want to use some more sophisticated routine.

But even if you do only use a code snippet once in your program it may be more declarative to give it a short name and hide the implementation somewhere else. This also helps you to divide and conquer larger problems!

The nice thing is that you already know a function: print()! You used it like this:

print('Hello, World!')

Let’s take a closer look at the parts that make up a function.

Components of a function

Name

Most functions have names [1]. The name of print() is – never would have guessed – print. Printing the function without calling it reveals that it is a built-in function:

>>> print(print)
<built-in function print>

There are a lot more Built-in Functions, among them, for example, float().

Positional arguments

float() is a function that lets you convert a number or a string that represents a number to a floating point number. Its interface is defined as

float(x)

where x is its argument. To be more precise it is its positional argument. By calling float() and passing an integer as argument the result is the corresponding float:

>>> float(1)
1.0
>>> float(-2)
-2.0

You can also pass strings that represent numbers to float():

>>> float('1')
1.0
>>> float('-2')
-2.0
>>> float('1.500')
1.5
>>> float('1e-2')
0.01
>>> float('+1E6')
1000000.0

A similar function also exists for integer. It is the function int().

Keyword arguments

int() is a function that lets you convert a number or a string that represents a number to an integer. Its interface is defined as

int(x, base=10)

where x is its positional argument and base is its keyword argument. While positional arguments must be passed to a function keyword arguments are optional and provide default values. The default value of base is 10.

When passing floats to int() everything after the decimal point is dropped:

>>> int(1.0)
1
>>> int(-2.0)
-2
>>> int(1.3)
1
>>> int(1.8)
1
>>> int(-1.3)
-1
>>> int(-1.8)
-1

When passing strings to int() the base keyword argument is used to indicate how the number should be interpreted in terms of which base it is given in.

>>> int('1')
1
>>> int('-2')
-2
>>> int('101010')
101010
>>> int('101010', base=2)
42

You do not have to write out the keyword arguments, they will be interpreted in the same order as they are in the interface:

>>> int('101010', 2)
42

But you can not convert strings that represent floating point numbers to integers like this:

>>> int('1.0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '1.0'

For this you would have to chain int() and float():

>>> int(float('1.0'))
1

Defining a function

A function is defined by starting with the def keyword. Subsequently you provide the name of the function and its arguments in paranthesis. You can then use it like the functions you already used.

>>> def greet(name):
...     print('Hello', name)
...
>>> greet('World')
Hello World
>>> greet('you')
Hello you
>>> def add_reciprocal(a, b):
...     return 1/a + 1/b
>>> add_reciprocal(4, 8)
0.375

To provide keyword arguments you let the argument have a default value by assigning a value to it:

>>> def name_and_favorite_food(name, favorite_food='pizza'):
...     return name + "'s favorite food is " + favorite_food + '.'
...
>>> name_and_favorite_food('Dominik')
"Dominik's favorite food is pizza."
>>> name_and_favorite_food('Stefan', 'kimchi')
"Stefan's favorite food is kimchi."

If you execute them in the interactive Python interpreter the value returned by a function is automatically displayed if you do not assign it to a variable. It is important to know that the return keyword is used to make the value accessible to the outer scope, that is outside of the function. See the following example, where we first do not have a return statement:

>>> def is_positive_without_return(x):
...     if x >= 0:
...         print(True)
...     else:
...         print(False)
...
>>> is_positive_without_return(1)
True
>>> is_positive_without_return(-1)
False
>>> a = is_positive_without_return(1)
True
>>> print(a)
None

As you can see the variable a has no value at all. If you want a to receive the result of the function you have to return it, not print it:

>>> def is_positive_without_return(x):
...     if x >= 0:
...         return True
...     else:
...         return False
...
>>> is_positive_without_return(1)
True
>>> is_positive_without_return(-1)
False
>>> a = is_positive_without_return(1)
>>> print(a)
True

Functions as function arguments

You are able to pass almost anything to a function—even functions itself! This is especially useful if you want to do data fitting or root finding of mathematical functions.

def format_heading(text):
    return '\n' + text + '\n' + '='*len(text) + '\n'

def prettify(sections, header_formatter):
    # Assume that `sections` is a list of dictionaries with the keys
    # ``heading`` for the heading text and ``content`` for the content
    # of the section.
    # Assume that the `header_formatter` is a function that takes a string
    # as argument and formats it in a way that is befitting for a heading.
    text = ''
    for section in sections:
        heading = section['heading']
        content = section['content']
        text += format_heading(heading) + content + '\n'
    # Before returning it we make shure that all surrounding whitespace is
    # gone.
    return text.strip()

secs = [
    {
        'heading': 'Introduction',
        'content': 'In this section we introduce some smart method to teach Python.'
    },
    {
        'heading': 'Results',
        'content': 'More than 42 % of participants in this study learned Python.'
    }
]

pretty_text = prettify(secs, header_formatter=format_heading)
print(pretty_text)

So as you can see the argument header_formatter is just treated as if it was a function. If you run this script the output will be:

Introduction
============
In this section we introduce some smart method to teach Python.

Results
=======
More than 42 % of participants in this study learned Python.

Summary

  • Functions can make your life easier by streamlining repeated tasks or giving a name to complex programming logic.
  • Functions may have two different kinds of arguments, positional arguments that must be given to the function and keyword arguments which provide default values.
  • Functions are defined using the def keyword.
  • Functions may return values by using the return keyword.

Exercises

Heaviside step function

The Heaviside step function can be defined as

\[\begin{split}H(x) = \begin{cases} 0, &x < 0, \\ 1, &x \ge 0. \end{cases}\end{split}\]

Write a function that implements the Heaviside step function following the given definition:

heaviside_step_function(n)

Return the value of the Heaviside step function of n.

Parameters:n (float) – The number of which the Heaviside step function should be computed.
Returns:The value of the Heaviside step function of n.
Return type:int

Start by downloading the exercise template and editing this file. You can run tests via

$ python heaviside_step_function.py test

to check whether you got a correct solution. You can also take a look at one possible solution.

Sum of multiples of 3 and 5

Note

The following problem was inspired by Problem 1 at Project Euler.

If you list all the natural numbers below 10 that are multiples of 3 or 5, you get 3, 5, 6, and 9. The sum of these multiples is 23. Define a function according to the following definition:

sum_of_multiples_of_3_and_5(n)

Return the sum of all multiples of 3 and 5 below n.

Parameters:n (int) – The number up to which the sum of multiples of 3 and 5 is computed.
Returns:The sum of all multiples of 3 and 5 up to n.
Return type:int

Start by downloading the exercise template and editing this file. You can run tests via

$ python sum_of_multiples_of_3_and_5.py test

to check whether you got a correct solution. You can also take a look at one possible solution.

Even Fibonacci numbers

Note

The following problem was inspired by Problem 2 at Project Euler.

Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 0 and 1, the first ten terms will be

0, 1, 1, 2, 3, 5, 8, 13, 21, 34.

Define a function according to the following definition:

even_fibonacci_numbers(n)

Return the sum of all even Fibonacci numbers below n.

Parameters:n (int) – The number up to which the sum of Fibonacci numbers is computed.
Returns:The sum of all Fibonacci numbers up to n.
Return type:int

Start by downloading the exercise template and editing this file. You can run tests via

$ python even_fibonacci_numbers.py test

to check whether you got a correct solution. You can also take a look at one possible solution.

Prime factorization

The prime factors of 360 are

\[360 = 2 \cdot 2 \cdot 2 \cdot 3 \cdot 3 \cdot 5 = 2^3 \cdot 3^2 \cdot 5.\]

Write a function that returns a dictionary whose first keys correspond to the prime factor and the value to the multiplicity of this prime factor. For example, given 360 the function should return:

{
    2: 3,
    3: 2,
    5: 1
}
prime_factorization(n)

Return the prime factorization of n.

Parameters:n (int) – The number for which the prime factorization should be computed.
Returns:List of tuples containing the prime factors and multiplicities of n.
Return type:dict[int, int]

Start by downloading the exercise template and editing this file. You can run tests via

$ python prime_factorization.py test

to check whether you got a correct solution. You can also take a look at one possible solution.

Newton’s method (1D)

Newton’s method is a rather popular iterative root finding algorithm. Starting at an initial guess \(x_0\) it tries to find better and better approximations of the root of a function \(f(x)\). For this it uses the first derivative \(f'(x)\) of the function. The process

\[x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\]

is repeated until a value \(f(x_n)\) is reached that is within a predefined tolerance to zero. For further information see the Wikipedia page.

The way it is supposed to work is as follows:

>>> def f(x):
...     return x**2 - 2
...
>>> def df_dx(x):
...     return 2*x
...
>>> newtons_method_1d(f, df_dx, x0=4, tol=1e-8)  
1.4142135623730951

Hence, implement a function following the given definition:

newtons_method_1d(f, df_dx, x0, tol)

Return the root of f within tol by using Newton’s method.

Parameters:
  • f (callable[[float], float]) – The function of which the root should be found.
  • df_dx (callable[[float], float]) – The derivative of f.
  • x_0 (float) – The initial guess for the algorithm.
  • tol (float) – The tolerance of the method.
Returns:

The root of f within a tolerance of tol.

Return type:

float

Start by downloading the exercise template and editing this file. You can run tests via

$ python newtons_method_1d.py test

to check whether you got a correct solution. You can also take a look at one possible solution.

Footnotes

[1]Lambdas are the exception to the rule as they define anonymous functions.

Strings

You have worked with them for a while now: strings. In principle strings are containers of characters in a certain order. As such they are somewhat similar to lists <list>`, and also support, e.g., indexing.

>>> my_first_string = 'Hello, World!'
>>> my_first_string[7:]
'World!'
>>> my_first_string[::-1]
'!dlroW ,olleH'

They offer a large amount of ways to work with text, of which a few selected ones are outlined in the following.

lower

Returns the lowercase version of the string.

>>> 'Hello, World!'.lower()
'hello, world!'

upper

Returns the uppercase version of the string.

>>> 'Hello, World!'.upper()
'HELLO, WORLD!'

split

Working with a whole string at once is needlessly complicated in many cases. Hence the Python standard library offers a way to take a string apart: str.split()

>>> some_string = 'My first string'
>>> some_string.split()
['My', 'first', 'string']

By default it splits to an arbitrary amount of whitespace

>>> some_string = 'My     second     string'
>>> some_string.split()
['My', 'second', 'string']

Note

In Python whitespace is everything you can not directly see like spaces, tabs (\t in strings), and newlines (\n in strings).

but you can also specify another string which is used to split the other string apart:

>>> some_string = '0.1, 0.2, 0.3, 0.4, 0.5'
>>> some_string.split(',')
['0.1', ' 0.2', ' 0.3', ' 0.4', ' 0.5']

The string that is used to split with is consumed in the process. So to get rid of the additional whitespace you might get the idea to add it to the string used for the splitting

>>> some_string = '0.1, 0.2, 0.3, 0.4, 0.5'
>>> some_string.split(', ')
['0.1', '0.2', '0.3', '0.4', '0.5']

Eventually this might lead to problems if the string format is not strictly kept:

>>> some_string = '0.1,   0.2,0.3, 0.4,  0.5'
>>> some_string.split(', ')
['0.1', '  0.2,0.3', '0.4', ' 0.5']

But Python’s got you covered…

strip

To get rid of unwanted whitespace around a string you can use the str.strip() method:

>>> some_string = '\n\n    So much surrounding whitespace\n\n'
>>> print(some_string)


    So much surrounding whitespace


>>> some_string.strip()
'So much surrounding whitespace'

You can also get rid of something else than whitespace by specifying the characters as an argument:

>>> more_string = '...Python...'
>>> more_string.strip('.')
'Python'

Note

The characters you specify as an argument are not seen as a string but as a collection of characters you want to strip of the string:

>>> incantation = 'abracadabra'
>>> incantation.strip('bra')
'cad'

Summary

Exercises

Split and strip

From time to time you may want to split a string by a delimiting character like ,, but the whitespace is all over the place.

>>> some_string = '0.1,   0.2,0.3, 0.4,  0.5'
>>> some_string.split(',')
['0.1', '   0.2', '0.3', ' 0.4', '  0.5']
>>> some_string.split(', ')
['0.1', '  0.2,0.3', '0.4', ' 0.5']

So a pure splitting operation does not really do the trick. Write a function that improves this behavior for this case, so that you can use it as follows:

>>> strip_and_split(some_string, ',')  
['0.1', '0.2', '0.3', '0.4', '0.5']
split_and_strip(string, delimiter)

Return a list of stripped strings after splitting string by delimiter.

Parameters:
  • string (str) – The string to split and strip.
  • delimiter (str) – The string to split by.
Returns:

The list of strings that are stripped after splitting by delimiter.

Return type:

list[str]

Start by downloading the exercise template and editing this file. You can run tests via

$ python split_and_strip.py test

to check whether you got a correct solution. You can also take a look at one possible solution.

Is palindrome string

A palindrome is a word, phrase, number or sequence of characters which reads the same backward as forward. Examples for words are radar, level, and racecar. Write a function that checks whether a strings is a palindrome or not.

is_palindrome_string(string)

Return whether string is a palindrome, ignoring the case.

Parameters:string (str) – The string to check.
Returns:A boolean representing whether string is a palindrome or not.
Return type:bool

Start by downloading the exercise template and editing this file. You can run tests via

$ python is_palindrome_string.py test

to check whether you got a correct solution. You can also take a look at one possible solution.

File IO

Usually the result of your program is important not only now but in the future as well. For this it is a good practice to store the results in file so that you can work with them later on if necessary. This is called file input and output, or in short: file IO.

Reading from a file

The best way to open a file in Python is by using the with statement. This automatically opens the file, keeps a lock on it so that no one is able to modify it while you are working with it, and closes it afterwards. Interacting with the file is then done by working with the variable that you specify after as. This is called a file object. To get the complete content of a file use its read() method like this:

>>> with open('csv_file.txt', 'r') as f:
...     file_content = f.read()
...
>>> print(file_content)
1, 2, 3
4, 5, 6
7, 8, 9

where the first parameter is the path to the file and the second parameter is the so-called file mode. r is for read-only.

Note

In some tutorials and also production code you may find something along the lines of this to interact with a file:

>>> f = open('csv_file.txt', 'r')
>>> file_content = f.read()
>>> f.close()
>>> print(file_content)
1, 2, 3
4, 5, 6
7, 8, 9

This is kind of the old style of working with files when the with statement did not exist yet. It has the huge downside that you have to take care about closing the file. And if an error is raised between open and close it is not closed at all. The with statement takes care of this for you even in the case of an exception and is subsequently the preferred way.

You can also iterate over f as if it is some form of container:

>>> with open('csv_file.txt', 'r') as f:
...     for i, line in enumerate(f):
...         print('Line', i, '--', line)
...
Line 0 -- 1, 2, 3

Line 1 -- 4, 5, 6

Line 2 -- 7, 8, 9

As you can see you get some additional white lines. The reason for this is that each line still contains its newline character \n in the end. To this one print() adds an additional one by default so you end up with an empty line. To circumvent this use the strip() method of the line string:

>>> with open('csv_file.txt', 'r') as f:
...     for i, line in enumerate(f):
...         stripped_line = line.strip()
...         print('Line', i, '--', stripped_line)
...
Line 0 -- 1, 2, 3
Line 1 -- 4, 5, 6
Line 2 -- 7, 8, 9

Writing to a file

To write to a file you have to open it first, this time with w as file opening mode, to indicate that you want to write to the file. Then you can use the write() method of the file object to write to the file:

>>> with open('my_first_file.txt', 'w') as f:
...     f.write('This is smart.')
...     f.write('This is even smarter.')
...
14
21

Now the content of your file would be

This is smart.This is even smarter.

Which is not nicely formatted. So you have to take care that you add the newline character \n and spaces accordingly:

>>> with open('my_first_file.txt', 'w') as f:
...     f.write('This is smart.\n')
...     f.write('This is even smarter.\n')
...
15
22

Subsequently the content of your file would be

This is smart.
This is even smarter.

Summary

  • You can open a file using the open() function in a with statement.
  • To merely read from a file open it with the filemode r and use the read() method of the file object.
  • To write from a file open it with the filemode w and use the write method of the file object.

Classes

Especially for larger code bases the rather mathematical way of programming may become slightly convoluted. A paradigm that tries to alleviate this is object-oriented programming. Let us just dive right in with the definition of a rectangle class. Each rectangle has a length and a width, from which one can compute its area and perimeter. Here is the definition:

class Rectangle:
    """A rectangle that is described by its length and width."""

    def __init__(self, length, width):
        self.length = length
        self.width = width

    def area(self):
        """Return the area of the rectangle."""
        return self.length*self.width

    def perimeter(self):
        """Return the perimeter of the rectangle."""
        return 2*(self.length + self.width)

Let us go over this step by step. The definition of a class starts with the keyword class followed by the name of the class. Within the class you can define methods. They are like functions that are attached to the class. What is slightly peculiar is, that all take self as their first argument— we will come to that in a second. The first method is __init__.

def __init__(self, length, width):
    self.length = length
    self.width = width

As you use classes you instantiate objects of that class. During the instantiation you can provide some initial arguments to the class to customize the resulting object using the special __init__ method. In this example a rectangle object can be initialized with values for its length \(l\) and for its width \(w\). These values are then stored in the instance as attributes. The reference to the instance is the self argument, that each method has as its first argument. So by attaching information to self, each distinct object has its own state.

Now we can exploit this in other methods. The area \(A\) of a rectangle is defined as

\[A = l \cdot b\]

In the code this is described by the following definition:

def area(self):
    """Return the area of the rectangle."""
    return self.length*self.width

Where a method is defined, that takes the length of the rectangle self.length, multiplies it with self.width and then returns it. If we did not use self.length but just length or used width instead of self.width we would not have accessed the values we stored in this rectangle during its initialization, but some global values instead. So to make sure that we only use the attributes of our specific rectangle we access them from self.

The method used to get the perimeter \(P\), which, for a rectangle, is defined as

\[P = 2 (l + b)\]

follows the same theme:

def perimeter(self):
    """Return the perimeter of the rectangle."""
    return 2*(self.length + self.width)

In the following the usage of this class is shown.

first_rectangle = Rectangle(length=2, width=3)
print('Information about the first rectangle')
print('Length:', first_rectangle.length)
print('Width:', first_rectangle.width)
print('Area:', first_rectangle.area())
print('Perimeter:', first_rectangle.perimeter())

First an object of the class Rectangle is instantiated with the length argument set to 2 and the width argument set to 3. The name of our first Rectangle object is first_rectangle. Subsequently the attributes length and width of the object first_rectangle can be accessed via first_rectangle.length and first_rectangle.width, respectively. One could say that what ever has been self in the class definition now is replaced by the name of the object. In the case of the methods the self argument is implicitly supplied by calling the method from the object. So it is sufficient to use first_rectangle.area(), and not first_rectangle.area(self) or first_rectangle.area(first_rectangle)— both of which would be wrong. The output of the above code is

Information about the first rectangle
Length: 2
Width: 3
Area: 6
Perimeter: 10

If another rectangle is instantiated with different values the information changes accordingly. So in the case of a second_rectangle

second_rectangle = Rectangle(length=5, width=7)
print('Information about the second rectangle')
print('Length:', second_rectangle.length)
print('Width:', second_rectangle.width)
print('Area:', second_rectangle.area())
print('Perimeter:', second_rectangle.perimeter())

the output would be

Information about the second rectangle
Length: 5
Width: 7
Area: 35
Perimeter: 24

Inheritance

A square is a special case of a rectangle, i.e., a rectangle with equal sides. As the computation of the geometrical properties remains the same, one option of initializing a square could be

square = Rectangle(length=2, width=2)

But maybe you want to be more explicit when initializing squares. This is where inheritance kicks in. Let us take a look at the definition of a Square class that inherits from the previously defined Rectangle class:

class Square(Rectangle):
    """A square that is described by its side length."""

    def __init__(self, side_length):
        super().__init__(length=side_length, width=side_length)

As opposed to the Rectangle class the name of the Square class is followed by parenthesis containing Rectangle. This tells Python that the Rectangle class is the superclass of Square, i.e., Square inherits from Rectangle. To inherit means that, if not otherwise defined, Square has the exact same method definitions as its superclass Rectangle.

But as the __init__ method of Rectangle takes the arguments length and width, which is not required for the definition of a square, we can simplify it. Now it takes only one argument side_length. If we stored it as we did in the Rectangle class, i.e., as

def __init__(self, side_length):
    self.side_length = side_length

The methods that Square inherits from Rectangle would fail to be callable, as they access the attributes length and width, which would not be defined if we took this definition of the __init__ method. Instead we could do this:

def __init__(self, side_length):
    self.length = side_length
    self.width = side_length

So now the definition looks awfully similar to the one of the Rectangle class. A bit too similar maybe, and we do not want to repeat ourselves. Another side-effect is that, should the __init__ method of the Rectangle implement some more code, it would have to be copied to Square as well. As this is error-prone there is a way to leverage the method of a superclass within the child class, and this is done using the super() function. If used within a method definition followed by calling a method it will resolve to the first parent class that implements a method with this name and call it for the current object. So by implementing it via

def __init__(self, side_length):
    super().__init__(length=side_length, width=side_length)

we tell Python to call the __init__ method of the superclass of Square and pass the side_length for the length and width.

Using the class can now be done like this:

first_square = Square(side_length=2)
print('Information about the first square')
print('Length:', first_square.length)
print('Width:', first_square.width)
print('Area:', first_square.area())
print('Perimeter:', first_square.perimeter())

The respective output:

Information about the first square
Length: 2
Width: 2
Area: 4
Perimeter: 8

Type checking

Checking whether an object is of a certain type, or is a child of a certain type, is done by using the isinstance() function. The first argument is the object whose type should be checked, the second argument is the class for which to check.

def what_is_it(object):
    if isinstance(object, Rectangle):
        print('It is a rectangle.')
    if isinstance(object, Square):
        print('It is a square.')

If we use this simple function on our geometrical objects you can see how it works.

what_is_it(first_rectangle)

As first_rectangle is a Rectangle, but not a Square, the output is:

It is a rectangle.

Now for the first_square:

what_is_it(first_square)

As Square is a specialization of Rectangle you can also see that it identifies as such in addition to being a Square.

It is a rectangle.
It is a square.

Special methods

Some behavior of Python classes is implemented in terms of so-called Special method names. An example of how they may be used can be seen in the following:

import math

class FreeVector:
    """A vector that is not bound by an initial or terminal point."""

    def __init__(self, vector):
        self.vector = tuple(vector)

    @property
    def magnitude(self):
        return math.sqrt(math.fsum(v**2 for v in self.vector))

    @property
    def direction(self):
        magnitude = self.magnitude
        return tuple(v/magnitude for v in self.vector)

    def __repr__(self):
        return '{self.__class__.__name__}(vector={self.vector!r})'.format(
            self=self)

    def __str__(self):
        return str(self.vector)

    def __eq__(self, other):
        if (isinstance(other, FreeVector) and
                all(math.isclose(a, b) for a, b in zip(
                    other.vector, self.vector))):
            return True
        else:
            return False

    def __neq__(self, other):
        return not self.__eq__(self, other)

    def __add__(self, other):
        if not isinstance(other, FreeVector):
            return NotImplemented
        return tuple(a + b for a, b in zip(self.vector, other.vector))

    def __sub__(self, other):
        if not isinstance(other, FreeVector):
            return NotImplemented
        return tuple(a - b for a, b in zip(self.vector, other.vector))

The usage may be as follows:

>>> a = FreeVector((1, 2, 3))
>>> a
FreeVector(vector=(1, 2, 3))
>>> str(a)
'(1, 2, 3)'
>>> b = FreeVector((1, 2, 3))
>>> c = FreeVector((4, 5, 6))
>>> a == b
True
>>> a == c
False
>>> a + c
(5, 7, 9)
>>> c - a
(3, 3, 3)

Exercises

  • Copy the definition of the Rectangle class and extend it by adding a method aspect_ratio which returns the ratio of its length to its width.
  • Define a Circle class with the radius \(r\) as defining attribute. Implement the area and perimeter class accordingly.
  • Read and work on the book “Building Skills in Object-Oriented Design” to understand the process of object-oriented design.

NumPy

NumPy is a third-party package for Python that provides utilities for numerical programming with Python. It is commonly imported via

import numpy as np

which maps the NumPy package to the name np. So every time you see some code with the np prefix you may assume that NumPy is imported even though it might not be stated explicitly.

Array

The NumPy array is the underlying mechanism that makes NumPy so convenient and efficient.

Creating arrays

A NumPy array is easily initialized via

>>> np.array([0, 1, 2])  # 1D array of integers
array([0, 1, 2])
>>> np.array([0.0, 1.0, 2.0])  # 1D array of floats
array([ 0.,  1.,  2.])
>>> np.array([[0, 1], [2, 3]])  # 2D array of integers
array([[0, 1],
       [2, 3]])

So by nesting lists of numbers you are able to construct multi-dimensional arrays. But there are also other methods to initialize arrays:

  • numpy.zeros():

    >>> np.zeros(3)
    array([ 0.,  0.,  0.])
    >>> np.zeros([2, 3])
    array([[ 0.,  0.,  0.],
           [ 0.,  0.,  0.]])
    
  • numpy.zeros():

    >>> np.ones(3)
    array([ 1.,  1.,  1.])
    >>> 3 * np.ones(3)
    array([ 3.,  3.,  3.])
    >>> np.ones([2, 3])
    array([[ 1.,  1.,  1.],
           [ 1.,  1.,  1.]])
    
  • numpy.eye():

    >>> np.eye(2)
    array([[ 1.,  0.],
           [ 0.,  1.]])
    >>> -2 * np.eye(2)
    array([[-2., -0.],
           [-0., -2.]])
    
  • numpy.arange(): This one should feel rather familiar to the range of regular Python. But where the latter is only able to deal with integers, the implementation of NumPy can also work with floats.

    >>> np.arange(5)
    array([0, 1, 2, 3, 4])
    >>> np.arange(3, 7)
    array([3, 4, 5, 6])
    >>> np.arange(2, 4, 0.5)
    array([ 2. ,  2.5,  3. ,  3.5])
    
  • numpy.linspace(): If you need evenly spaced samples this way of initializing them is preferred to numpy.arange() due to the latter introducing slight roundoff errors which is caused by the implementation.

    >>> np.linspace(0, 4, 5)
    array([ 0.,  1.,  2.,  3.,  4.])
    >>> np.linspace(0.3, 0.7, 5)
    array([ 0.3,  0.4,  0.5,  0.6,  0.7])
    >>> np.linspace(0.3, 0.7, 4, endpoint=False)
    array([ 0.3,  0.4,  0.5,  0.6])
    

Array attributes

The arrays also provide some information about themselves which can be accessed by its attributes.

Number of dimensions

The ndim attribute of an array is the amount of dimensions of the array.

>>> x = np.array([0.0, 1.0, 2.0])
>>> x.ndim
1
>>> x = np.array([[0, 1, 2], [3, 4, 5]])
>>> x.ndim
2

Shape

The shape attribute of an array is a tuple representing the number of elements along each dimension.

>>> x = np.array([0.0, 1.0, 2.0])
>>> x.shape
(3,)
>>> x = np.array([[0, 1, 2], [3, 4, 5]])
>>> x.shape
(2, 3)
>>> x.shape[0]
2
>>> x.shape[1]
3

Size

The size attribute of an array is the amount of elements of the array.

>>> x = np.array([0.0, 1.0, 2.0])
>>> x.size
3
>>> x = np.array([[0, 1, 2], [3, 4, 5]])
>>> x.size
6

So essentially it is the product sum of the shape.

Accessing data

Similarly to lists and tuples data is accessed via referring to the indices:

>>> x = np.array([[0, 1, 2],
...               [3, 4, 5]])
>>> x[0, 0]
0
>>> x[0, 1]
1
>>> x[1, 0]
3
>>> x[1, 2]
5

Slicing

Slicing refers to extracting partial data from arrays. This is very efficient as nothing is copied in memory.

>>> x = np.array([[0, 1, 2, 3, 4],
...               [5, 6, 7, 8, 9],
...               [10, 11, 12, 13, 14],
...               [15, 16, 17, 18, 19],
...               [20, 21, 22, 23, 24]])
>>> x[0, :]
array([0, 1, 2, 3, 4])
>>> x[1, :]
array([5, 6, 7, 8, 9])
>>> x[:, 1]
array([ 1,  6, 11, 16, 21])
>>> x[:3, :3]
array([[ 0,  1,  2],
       [ 5,  6,  7],
       [10, 11, 12]])
>>> x[2:, 2:]
array([[12, 13, 14],
       [17, 18, 19],
       [22, 23, 24]])

Note

Slices of an array share the memory of the original array. Hence all the changes you do to a slice are also represented in the original array:

>>> x = np.array([[0, 1, 2, 3, 4],
...               [5, 6, 7, 8, 9],
...               [10, 11, 12, 13, 14],
...               [15, 16, 17, 18, 19],
...               [20, 21, 22, 23, 24]])
>>> print(x)
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
>>> x_slice = x[1:-1, 1:-1]
>>> print(x_slice)
[[ 6  7  8]
 [11 12 13]
 [16 17 18]]
>>> x_slice[1, 1] = 888
>>> print(x_slice)
[[  6   7   8]
 [ 11 888  13]
 [ 16  17  18]]
>>> print(x)
[[  0   1   2   3   4]
 [  5   6   7   8   9]
 [ 10  11 888  13  14]
 [ 15  16  17  18  19]
 [ 20  21  22  23  24]]

Math

The goal of NumPy is to provide functions and classes that make numerical computations easier. For an extensive overview see the section Linear algebra (numpy.linalg). of the NumPy documentation. Some of them require the part linalg after numpy. In combination with

import numpy as np

the functions and classes may be used via

np.linalg.<function>

The functions that require this import are indicated by this usage.

Componentwise math operations

NumPy arrays support the typical math operations introduced in Basics and they are executed componentwise. Here are a few examples:

>>> x = np.array([1, 2, 3, 4, 5])
>>> y = np.array([6, 5, 8, 9, 10])
>>> z = 2.0
>>> x + z
array([ 3.,  4.,  5.,  6.,  7.])
>>> x + y
array([ 7,  7, 11, 13, 15])
>>> y % z
array([ 0.,  1.,  0.,  1.,  0.])
>>> y / x
array([ 6.        ,  2.5       ,  2.66666667,  2.25      ,  2.        ])

Dot

If you need a scalar product for 1D arrays or matrix multiplication for 2D arrays the function numpy.dot() is the function of choice:

>>> x = np.array([1, 2, 3, 4, 5])
>>> y = np.array([6, 5, 8, 9, 10])
>>> np.dot(x, y)
126
>>> x.dot(y)  # alternative syntax
126
>>> x = np.array([[1, 2, 3],
...               [4, 5, 6],
...               [7, 8, 9]])
>>> y = np.array([[10, 11, 12],
...               [13, 14, 15],
...               [16, 17, 18]])
>>> z = np.array([[1/19, 1/20, 1/21],
...               [1/22, 1/23, 1/24],
...               [1/25, 1/26, 1/27]])
>>> np.dot(x, np.dot(y, z))  # this is a bit convoluted
array([[ 12.35196172,  11.80535117,  11.30555556],
       [ 29.63712919,  28.32591973,  27.12698413],
       [ 46.92229665,  44.84648829,  42.9484127 ]])
>>> x.dot(y).dot(z)  # better
array([[ 12.35196172,  11.80535117,  11.30555556],
       [ 29.63712919,  28.32591973,  27.12698413],
       [ 46.92229665,  44.84648829,  42.9484127 ]])

Eigenvalues

The function numpy.linalg.eig() may be used to compute the eigenvalues and eigenvectors of a square array:

>>> x = np.array([[1, 2, 3],
...               [4, 5, 6],
...               [7, 8, 9]])
>>> np.linalg.eig(x)
(array([  1.61168440e+01,  -1.11684397e+00,  -1.30367773e-15]), array([[-0.23197069, -0.78583024,  0.40824829],
       [-0.52532209, -0.08675134, -0.81649658],
       [-0.8186735 ,  0.61232756,  0.40824829]]))

The output is a bit messy, but ss you can read in the documentation of numpy.linalg.eig() the function actually returns two things: an array holding the eigenvalues and an array holding the eigenvectors as columns:

>>> eigenvalues, eigenvectors = np.linalg.eig(x)
>>> for i in range(eigenvalues.size):
...     print(eigenvalues[i], eigenvectors[:, i])
...
16.1168439698 [-0.23197069 -0.52532209 -0.8186735 ]
-1.11684396981 [-0.78583024 -0.08675134  0.61232756]
-1.30367772647e-15 [ 0.40824829 -0.81649658  0.40824829]

Norm

Computing the norm of a vector is a fairly common task, so it seems obvious that NumPy also provides a function for this: numpy.linalg.norm().

>>> x = np.array([3, 4])
>>> np.linalg.norm(x)
5.0

Besided the regularly used \(\ell^2\)-norm this function also offers other norms. For further information use the documentation of numpy.linalg.norm().

Determinant

Knowing the determinant of an array may tell you whether it is singular or, in the case of a \(3x3\) array, tell you the volume of the rhomboid that is spanned by the vectors composing the array.

>>> x = np.array([[1, 2, 3],
...               [4, 5, 4],
...               [3, 2, 1]])
>>> np.linalg.det(x)
-7.9999999999999982

This should be \(-8\) so it is an interesting way to see that the determinant computed by NumPy is computed numerically and not analytically.

Exercises

Gaussian Elimination (Eye Variant)

Solving systems of linear equations is one of the basic tasks in numerical mathematics—hence it is also one of the basic tasks in computational materials science. A system of linear equations like

\[\begin{split} 2x + y - z &= 8 & \quad L_1 \\ -2x - y + 2z &= -11 & \quad L_2 \\ -2x + y + 2z &= -3 & \quad L_3\end{split}\]

may also be described via

\[\vec{A} \vec{x} = \vec{b}\]

where

\[\begin{split}A = \begin{bmatrix} 2 & 1 & -1 \\ -3 & -1 & 2 \\ -2 & 1 & 2 \end{bmatrix} \quad x = \begin{bmatrix} x \\ y \\ z \end{bmatrix} \quad b = \begin{bmatrix} 8 \\ -11 \\ -3 \end{bmatrix}\end{split}\]

The solution algorithm makes use of the augmented matrix form

\[\begin{split}\begin{bmatrix} 2 & 1 & -1 & | & 8 \\ -3 & -1 & 2 & | & -11 \\ -2 & 1 & 2 & | & -3 \end{bmatrix}\end{split}\]

The procedure is then to first get this augmented matrix form into triangle form and subsequently form the identity matrix on the left hand side. The right hand side then is the solution. For further information see Gaussian elimination

Start with the following template, complete it, and test it:

def gaussian_elimination_eye(A, b):

Matplotlib

“The purpose of visualization is insight, not pictures.”

—Ben Shneiderman

Visualization is one of the most important means for the interpretation of the results of scientific computations. The package that is mostly used for this task is Matplotlib.

The matplotlib package is typically imported via

import matplotlib.pyplot as plt

And subsequently all plotting commands are called using the prefix plt.

Rather than exercises that want you to get some results you are shown simple examples for the most common types of plots.

Note

Producing scientific plots that manage to support your chain of arguments in a presentation or publication can not be overestimated. To a certain degree it could be considered an art, but there are some best practices, e.g., for picking colormaps.

Line plots

The most common plot you see in scientific publications may well be a so-called line plot. The command for line plots is plot() and using it like this

import matplotlib.pyplot as plt
import numpy as np

# Generate data for the plot
x = np.linspace(0, 1, 21)
y0 = np.sin(2*np.pi*x)
y1 = np.cos(2*np.pi*x)

# Generate the plot
fig, ax = plt.subplots()
ax.plot(x, y0)
ax.plot(x, y1)
plt.show(fig)

results in a line plot with both the sine and cosine:

_images/line_plot.png

Scatter plots

When your data does not have a sequence but is still characterized by data points a so-called scatter plot is the plot of choice. The command is scatter() and when used like this

import matplotlib.pyplot as plt
import numpy as np

# Generate data for the plot
r = np.random.RandomState(42)
x = r.random_sample(10)
y0 = r.random_sample(10)
y1 = r.random_sample(10)

# Generate the plot
fig, ax = plt.subplots()
ax.scatter(x, y0)
ax.scatter(x, y1)
plt.show(fig)

it results in a plot showing the different data point groups:

_images/scatter_plot.png

Pcolormesh plots

pcolormesh()

import matplotlib.pyplot as plt
import numpy as np
from scipy.ndimage.filters import gaussian_filter

# Generate data for the plot
x = np.linspace(0, 1, 51)
y = np.linspace(0, 1, 51)
r = np.random.RandomState(42)
z = gaussian_filter(r.random_sample([50, 50]), sigma=5, mode='wrap')
z -= np.min(z)
z /= np.max(z)

# Generate the plot
fig, ax = plt.subplots()
cmap = ax.pcolormesh(x, y, z)
fig.colorbar(cmap)
plt.show(fig)
_images/pcolormesh_plot.png

Contour plots

contour()

import matplotlib.pyplot as plt
import numpy as np
from scipy.ndimage.filters import gaussian_filter

# Generate data for the plot
x = np.linspace(0, 1, 50)
y = np.linspace(0, 1, 50)
r = np.random.RandomState(42)
z = gaussian_filter(r.random_sample([50, 50]), sigma=5, mode='wrap')
z -= np.min(z)
z /= np.max(z)

# Generate the plot
fig, ax = plt.subplots()
cmap = ax.contour(x, y, z)
fig.colorbar(cmap)
plt.show(fig)
_images/contour_plot.png

Contourf plots

contourf()

import matplotlib.pyplot as plt
import numpy as np
from scipy.ndimage.filters import gaussian_filter

# Generate data for the plot
x = np.linspace(0, 1, 50)
y = np.linspace(0, 1, 50)
r = np.random.RandomState(42)
z = gaussian_filter(r.random_sample([50, 50]), sigma=5, mode='wrap')
z -= np.min(z)
z /= np.max(z)

# Generate the plot
fig, ax = plt.subplots()
cmap = plt.contourf(x, y, z)
fig.colorbar(cmap)
plt.show(fig)
_images/contourf_plot.png

SciPy

SciPy is a collection of modules that offer high-level interfaces to algorithms that are common in the scientific community, like linear algebra, optimization, integration, interpolation, etc. Covering all the possibilities in a tutorial is almost impossible, so to see what SciPy can do refer to its documentation.

Fitting curves

The routine used for fitting curves is part of the scipy.optimize module and is called scipy.optimize.curve_fit(). So first said module has to be imported.

>>> import scipy.optimize

The function that you want to fit to your data has to be defined with the x values as first argument and all parameters as subsequent arguments.

>>> def parabola(x, a, b, c):
...     return a*x**2 + b*x + c
...

For test purposes the data is generated using said function with known parameters.

>>> params = [-0.1, 0.5, 1.2]
>>> x = np.linspace(-5, 5, 31)
>>> y = parabola(x, params[0], params[1], params[2])
>>> plt.plot(x, y, label='analytical')  
[<matplotlib.lines.Line2D object at ...>]
>>> plt.legend(loc='lower right')  
<matplotlib.legend.Legend object at ...>
>>> plt.show()
_images/fitting_curves-3.png

Then some small deviations are introduced as fitting a function to data generated by itself is kind of boring.

>>> r = np.random.RandomState(42)
>>> y_with_errors = y + r.uniform(-1, 1, y.size)
>>> plt.plot(x, y_with_errors, label='sample')  
[<matplotlib.lines.Line2D object at ...>]
>>> plt.legend(loc='lower right')  
<matplotlib.legend.Legend object at ...>
>>> plt.show()
_images/fitting_curves-4.png

Now the fitting routine can be called.

>>> fit_params, pcov = scipy.optimize.curve_fit(parabola, x, y_with_errors)

It returns two results, the parameters that resulted from the fit as well as the covariance matrix which may be used to compute some form of quality scale for the fit. The actual data for the fit may be compared to the real parameters:

>>> for param, fit_param in zip(params, fit_params):
...     print(param, fit_param)
...
-0.1 -0.0906682795944
0.5 0.472361903203
1.2 1.00514576223
>>> y_fit = parabola(x, *fit_params)
>>> plt.plot(x, y_fit, label='fit')  
[<matplotlib.lines.Line2D object at ...>]
>>> plt.legend(loc='lower right')  
<matplotlib.legend.Legend object at ...>
>>> plt.show()
_images/fitting_curves-6.png

As you can see the fit is rather off for the third parameter c. A look at the covariance matrix also shows this:

>>> print(pcov)
[[  1.64209005e-04   1.75357845e-12  -1.45963560e-03]
 [  1.75357845e-12   1.16405938e-03  -1.73642112e-11]
 [ -1.45963560e-03  -1.73642112e-11   2.33217333e-02]]

But to get a more meaningful scale for the quality of the fit the documentation recommends the following:

>>> print(np.sqrt(np.diag(pcov)))
[ 0.01281441  0.03411831  0.15271455]

So the fit for a is quite well whereas the quality c is the worst of all the parameters.

Note

This routine works by iteratively varying the parameters and checking whether the fit got better or worse. To help the routine find the best fit it is hence a good idea to give it a good starting point. this can be done using the p0 argument of curve_fit(). In some cases this is even necessary.

Exercises

Fitting a frequency measurement

Use the data provided here and compute the amplitude \(A\), the frequency \(f\), and the phase offset \(\varphi\) of the measurement.

Recommended steps

  1. Read the data from the file.

  2. Plot the data to get a feeling for it.

  3. Think of a function that may be well suited for your data, and define it. In this case

    \[f(t) = A \sin (2 \pi f \cdot t - \varphi) + b\]
  4. Try to fit it to the data. Are there any problems? How could you solve them?

  5. Find a good initial guess by taking the data into account. It may be useful to smoothen the data first to reduce the noise. The scipy.signal module provides a variety of filters to do this. Use the smoothed data to get estimates for

    • the amplitude \(A\)
    • the frequency \(f\)
    • the phase offset \(\varphi\)
    • the amplitude offset \(b\)
  6. Use the initial guess to fit the function to the raw data again.

  7. Store the results in a file.

File formats

Ideally, a file format is

  • self-documented, i.e., it offers metadata that allows the user to infer what data is contained within the file.
  • easily readable by a multitude of programming languages, as not everyone we collaborate with shares our affection to Python.
  • dense, i.e., it should not take up a lot of space.
  • fault tolerant, i.e., in case of a faulty switched bit on the hard-drive the data should be recoverable.
  • flexible in the shape of the data that is to be stored.

In the following a selection of file formats is introduced.

Comma-separated values

Filename extension:
 .csv

We may know comma-separated values (CSV) file mostly by working with people who use spreadsheet software like Microsoft Excel. As the native file format of such software might be cumbersome to read, we kindly ask for a CSV export of those spreadsheets.

Typically a CSV file is a plain text format which contains one record per line with several delimiter separated fields. The sequence of the fields is the same for every records.

Note that CSV files are not standardized, and as such a lot of different flavors exist. The delimiter, for example, does not necessary have to be a comma, it could also be a simple space. If the delimiter is part of a field value, for example, in a string, an escape character is required, which depends on the flavor of the CSV implementation. Comments are also treated differently in each implementation, with some allowing comments in the header via #, and others not supporting them at all

Hence, CSV files may be used for very simple files, but may proof unsuitable for complex data.

Let us take a look at a CSV file that is the result of an atomistic simulation which is stored in a file:

with open('md_result.csv', 'r') as f:
    print(f.read())
# Atom types:
# - He: Helium
# - Ar: Argon
# Potentials:
# - He: Lennard-Jones potential with epsilon/k_B = 10.22 K, sigma = 256 pm
# - Ar: Lennard-Jones potential with epsilon/k_B = 120 K, sigma = 341 pm
# Simulation box size: 100 µm x 200 µm x 300 µm
# Periodic boundary conditions in all directions
# Step: 0
# Time: 0 s
# All quantities are given in SI units
atom_id type x-position y-position z-position x-velocity y-velocity z-velocity
0 He 5.7222e-07 4.8811e-09 2.0415e-07 -2.9245e+01 1.0045e+02 1.2828e+02
1 He 9.7710e-07 3.6371e-07 4.7311e-07 -1.9926e+02 2.3275e+02 -5.3438e+02
2 Ar 6.4989e-07 6.7873e-07 9.5000e-07 -1.5592e+00 -3.7876e+02 8.4091e+01
3 Ar 5.9024e-08 3.7138e-07 7.3455e-08 3.4282e+02 1.5682e+02 -3.8991e+01
4 He 7.6746e-07 8.3017e-08 4.8520e-07 -3.0450e+01 -3.7975e+02 -3.3632e+02
5 Ar 1.7226e-07 4.6023e-07 4.7356e-08 -3.1151e+02 -4.2939e+02 -6.9474e+02
6 Ar 9.6394e-07 7.2845e-07 8.8623e-07 -8.2636e+01 4.5098e+01 -1.0626e+01
7 He 5.4450e-07 4.6373e-07 6.2270e-07 1.5889e+02 2.5858e+02 -1.5150e+02
8 He 7.9322e-07 9.4700e-07 3.5194e-08 -1.9703e+02 1.5674e+02 -1.8520e+02
9 Ar 2.7797e-07 1.6487e-07 8.2403e-07 -3.8650e+01 -6.9632e+02 2.1642e+02
10 He 1.1842e-07 6.3244e-07 5.0958e-07 -1.4963e+02 4.2288e+02 -7.6309e+01
11 Ar 2.0359e-07 8.3369e-07 9.6348e-07 4.8457e+02 -2.6741e+02 -3.5254e+02
12 He 5.1019e-07 2.2470e-07 2.3846e-08 -2.3192e+02 -9.9510e+01 3.2770e+01
13 Ar 3.5383e-07 8.4581e-07 7.2340e-07 -3.0395e+02 4.7316e+01 2.2253e+02
14 He 3.8515e-07 2.8940e-07 5.6028e-07 2.3308e+02 2.5418e+02 4.2983e+02
15 He 1.5842e-07 9.8225e-07 5.7859e-07 1.9963e+02 2.0311e+02 -4.2560e+02
16 He 3.6831e-07 7.6520e-07 2.9884e-07 6.6341e+01 2.2232e+02 -9.7653e+01
17 He 2.8696e-07 1.5129e-07 6.4060e-07 9.0358e+01 -6.7459e+01 -6.4782e+01
18 He 1.0325e-07 9.9012e-07 3.4381e-07 7.1108e+01 1.1060e+01 1.5912e+01
19 Ar 4.3929e-07 7.5363e-07 9.9974e-07 2.3919e+02 1.7383e+02 3.3529e+02

Note that the first row that is not a comment holds the field names. This will be important for later. Using the csv from the Python standard library we can read it in nicely:

import csv

number_of_rows_to_skip = 12
with open('md_result.csv', 'r', newline='') as f:
    # skip the first rows
    for _ in range(number_of_rows_to_skip):
        next(f)

    csv_reader = csv.reader(f, delimiter=' ')
    for row in csv_reader:
        print(row)

Which then results in the following output:

['0', 'He', '5.7222e-07', '4.8811e-09', '2.0415e-07', '-2.9245e+01', '1.0045e+02', '1.2828e+02']
['1', 'He', '9.7710e-07', '3.6371e-07', '4.7311e-07', '-1.9926e+02', '2.3275e+02', '-5.3438e+02']
['2', 'Ar', '6.4989e-07', '6.7873e-07', '9.5000e-07', '-1.5592e+00', '-3.7876e+02', '8.4091e+01']
['3', 'Ar', '5.9024e-08', '3.7138e-07', '7.3455e-08', '3.4282e+02', '1.5682e+02', '-3.8991e+01']
['4', 'He', '7.6746e-07', '8.3017e-08', '4.8520e-07', '-3.0450e+01', '-3.7975e+02', '-3.3632e+02']
['5', 'Ar', '1.7226e-07', '4.6023e-07', '4.7356e-08', '-3.1151e+02', '-4.2939e+02', '-6.9474e+02']
['6', 'Ar', '9.6394e-07', '7.2845e-07', '8.8623e-07', '-8.2636e+01', '4.5098e+01', '-1.0626e+01']
['7', 'He', '5.4450e-07', '4.6373e-07', '6.2270e-07', '1.5889e+02', '2.5858e+02', '-1.5150e+02']
['8', 'He', '7.9322e-07', '9.4700e-07', '3.5194e-08', '-1.9703e+02', '1.5674e+02', '-1.8520e+02']
['9', 'Ar', '2.7797e-07', '1.6487e-07', '8.2403e-07', '-3.8650e+01', '-6.9632e+02', '2.1642e+02']
['10', 'He', '1.1842e-07', '6.3244e-07', '5.0958e-07', '-1.4963e+02', '4.2288e+02', '-7.6309e+01']
['11', 'Ar', '2.0359e-07', '8.3369e-07', '9.6348e-07', '4.8457e+02', '-2.6741e+02', '-3.5254e+02']
['12', 'He', '5.1019e-07', '2.2470e-07', '2.3846e-08', '-2.3192e+02', '-9.9510e+01', '3.2770e+01']
['13', 'Ar', '3.5383e-07', '8.4581e-07', '7.2340e-07', '-3.0395e+02', '4.7316e+01', '2.2253e+02']
['14', 'He', '3.8515e-07', '2.8940e-07', '5.6028e-07', '2.3308e+02', '2.5418e+02', '4.2983e+02']
['15', 'He', '1.5842e-07', '9.8225e-07', '5.7859e-07', '1.9963e+02', '2.0311e+02', '-4.2560e+02']
['16', 'He', '3.6831e-07', '7.6520e-07', '2.9884e-07', '6.6341e+01', '2.2232e+02', '-9.7653e+01']
['17', 'He', '2.8696e-07', '1.5129e-07', '6.4060e-07', '9.0358e+01', '-6.7459e+01', '-6.4782e+01']
['18', 'He', '1.0325e-07', '9.9012e-07', '3.4381e-07', '7.1108e+01', '1.1060e+01', '1.5912e+01']
['19', 'Ar', '4.3929e-07', '7.5363e-07', '9.9974e-07', '2.3919e+02', '1.7383e+02', '3.3529e+02']

But as you can see all the numbers are read in as strings. This is due to CSV files not preserving the type information. A quick hack might be the following:

import csv

number_of_rows_to_skip = 12
possible_types = (int, float, str)

with open('md_result.csv', 'r', newline='') as f:
    # skip the first rows
    for _ in range(number_of_rows_to_skip):
        next(f)

    csv_reader = csv.reader(f, delimiter=' ')
    for row in csv_reader:
        for i, entry in enumerate(row):
            for possible_type in possible_types:
                try:
                    entry = possible_type(entry)
                except ValueError:
                    continue
                except:
                    raise
                else:
                    row[i] = entry
                    break
        print(row)

Here we define an order of types to check for, in this example we first check whether the entry can be cast to an integer, then to a float, and then to a string. If a casting operation succeeds, we set the entry of the row to the new value and exit the loop that checks for the types. Now the output is closer to what we would like.

[0, 'He', 5.7222e-07, 4.8811e-09, 2.0415e-07, -29.245, 100.45, 128.28]
[1, 'He', 9.771e-07, 3.6371e-07, 4.7311e-07, -199.26, 232.75, -534.38]
[2, 'Ar', 6.4989e-07, 6.7873e-07, 9.5e-07, -1.5592, -378.76, 84.091]
[3, 'Ar', 5.9024e-08, 3.7138e-07, 7.3455e-08, 342.82, 156.82, -38.991]
[4, 'He', 7.6746e-07, 8.3017e-08, 4.852e-07, -30.45, -379.75, -336.32]
[5, 'Ar', 1.7226e-07, 4.6023e-07, 4.7356e-08, -311.51, -429.39, -694.74]
[6, 'Ar', 9.6394e-07, 7.2845e-07, 8.8623e-07, -82.636, 45.098, -10.626]
[7, 'He', 5.445e-07, 4.6373e-07, 6.227e-07, 158.89, 258.58, -151.5]
[8, 'He', 7.9322e-07, 9.47e-07, 3.5194e-08, -197.03, 156.74, -185.2]
[9, 'Ar', 2.7797e-07, 1.6487e-07, 8.2403e-07, -38.65, -696.32, 216.42]
[10, 'He', 1.1842e-07, 6.3244e-07, 5.0958e-07, -149.63, 422.88, -76.309]
[11, 'Ar', 2.0359e-07, 8.3369e-07, 9.6348e-07, 484.57, -267.41, -352.54]
[12, 'He', 5.1019e-07, 2.247e-07, 2.3846e-08, -231.92, -99.51, 32.77]
[13, 'Ar', 3.5383e-07, 8.4581e-07, 7.234e-07, -303.95, 47.316, 222.53]
[14, 'He', 3.8515e-07, 2.894e-07, 5.6028e-07, 233.08, 254.18, 429.83]
[15, 'He', 1.5842e-07, 9.8225e-07, 5.7859e-07, 199.63, 203.11, -425.6]
[16, 'He', 3.6831e-07, 7.652e-07, 2.9884e-07, 66.341, 222.32, -97.653]
[17, 'He', 2.8696e-07, 1.5129e-07, 6.406e-07, 90.358, -67.459, -64.782]
[18, 'He', 1.0325e-07, 9.9012e-07, 3.4381e-07, 71.108, 11.06, 15.912]
[19, 'Ar', 4.3929e-07, 7.5363e-07, 9.9974e-07, 239.19, 173.83, 335.29]

But programming with this still requires you to know exactly which field number corresponds to which entry. And maybe your format may differ from file to file, so that your hardcoded indices lead to wrong results. It would be better if we could somehow access the fields by names, e.g., row['id'] to get the id of the record. This is where csv.DictReader comes in.

>>> import csv
>>> number_of_rows_to_skip = 11
>>> with open('md_result.csv', 'r', newline='') as f:
...     # skip the first rows
...     for _ in range(number_of_rows_to_skip):
...         next(f)
...
...     csv_reader = csv.DictReader(f, delimiter=' ')
...     for row in csv_reader:
...         print(row)
...
OrderedDict([('atom_id', '0'), ('type', 'He'), ('x-position', '5.7222e-07'), ('y-position', '4.8811e-09'), ('z-position', '2.0415e-07'), ('x-velocity', '-2.9245e+01'), ('y-velocity', '1.0045e+02'), ('z-velocity', '1.2828e+02')])
OrderedDict([('atom_id', '1'), ('type', 'He'), ('x-position', '9.7710e-07'), ('y-position', '3.6371e-07'), ('z-position', '4.7311e-07'), ('x-velocity', '-1.9926e+02'), ('y-velocity', '2.3275e+02'), ('z-velocity', '-5.3438e+02')])
OrderedDict([('atom_id', '2'), ('type', 'Ar'), ('x-position', '6.4989e-07'), ('y-position', '6.7873e-07'), ('z-position', '9.5000e-07'), ('x-velocity', '-1.5592e+00'), ('y-velocity', '-3.7876e+02'), ('z-velocity', '8.4091e+01')])
OrderedDict([('atom_id', '3'), ('type', 'Ar'), ('x-position', '5.9024e-08'), ('y-position', '3.7138e-07'), ('z-position', '7.3455e-08'), ('x-velocity', '3.4282e+02'), ('y-velocity', '1.5682e+02'), ('z-velocity', '-3.8991e+01')])
OrderedDict([('atom_id', '4'), ('type', 'He'), ('x-position', '7.6746e-07'), ('y-position', '8.3017e-08'), ('z-position', '4.8520e-07'), ('x-velocity', '-3.0450e+01'), ('y-velocity', '-3.7975e+02'), ('z-velocity', '-3.3632e+02')])
OrderedDict([('atom_id', '5'), ('type', 'Ar'), ('x-position', '1.7226e-07'), ('y-position', '4.6023e-07'), ('z-position', '4.7356e-08'), ('x-velocity', '-3.1151e+02'), ('y-velocity', '-4.2939e+02'), ('z-velocity', '-6.9474e+02')])
OrderedDict([('atom_id', '6'), ('type', 'Ar'), ('x-position', '9.6394e-07'), ('y-position', '7.2845e-07'), ('z-position', '8.8623e-07'), ('x-velocity', '-8.2636e+01'), ('y-velocity', '4.5098e+01'), ('z-velocity', '-1.0626e+01')])
OrderedDict([('atom_id', '7'), ('type', 'He'), ('x-position', '5.4450e-07'), ('y-position', '4.6373e-07'), ('z-position', '6.2270e-07'), ('x-velocity', '1.5889e+02'), ('y-velocity', '2.5858e+02'), ('z-velocity', '-1.5150e+02')])
OrderedDict([('atom_id', '8'), ('type', 'He'), ('x-position', '7.9322e-07'), ('y-position', '9.4700e-07'), ('z-position', '3.5194e-08'), ('x-velocity', '-1.9703e+02'), ('y-velocity', '1.5674e+02'), ('z-velocity', '-1.8520e+02')])
OrderedDict([('atom_id', '9'), ('type', 'Ar'), ('x-position', '2.7797e-07'), ('y-position', '1.6487e-07'), ('z-position', '8.2403e-07'), ('x-velocity', '-3.8650e+01'), ('y-velocity', '-6.9632e+02'), ('z-velocity', '2.1642e+02')])
OrderedDict([('atom_id', '10'), ('type', 'He'), ('x-position', '1.1842e-07'), ('y-position', '6.3244e-07'), ('z-position', '5.0958e-07'), ('x-velocity', '-1.4963e+02'), ('y-velocity', '4.2288e+02'), ('z-velocity', '-7.6309e+01')])
OrderedDict([('atom_id', '11'), ('type', 'Ar'), ('x-position', '2.0359e-07'), ('y-position', '8.3369e-07'), ('z-position', '9.6348e-07'), ('x-velocity', '4.8457e+02'), ('y-velocity', '-2.6741e+02'), ('z-velocity', '-3.5254e+02')])
OrderedDict([('atom_id', '12'), ('type', 'He'), ('x-position', '5.1019e-07'), ('y-position', '2.2470e-07'), ('z-position', '2.3846e-08'), ('x-velocity', '-2.3192e+02'), ('y-velocity', '-9.9510e+01'), ('z-velocity', '3.2770e+01')])
OrderedDict([('atom_id', '13'), ('type', 'Ar'), ('x-position', '3.5383e-07'), ('y-position', '8.4581e-07'), ('z-position', '7.2340e-07'), ('x-velocity', '-3.0395e+02'), ('y-velocity', '4.7316e+01'), ('z-velocity', '2.2253e+02')])
OrderedDict([('atom_id', '14'), ('type', 'He'), ('x-position', '3.8515e-07'), ('y-position', '2.8940e-07'), ('z-position', '5.6028e-07'), ('x-velocity', '2.3308e+02'), ('y-velocity', '2.5418e+02'), ('z-velocity', '4.2983e+02')])
OrderedDict([('atom_id', '15'), ('type', 'He'), ('x-position', '1.5842e-07'), ('y-position', '9.8225e-07'), ('z-position', '5.7859e-07'), ('x-velocity', '1.9963e+02'), ('y-velocity', '2.0311e+02'), ('z-velocity', '-4.2560e+02')])
OrderedDict([('atom_id', '16'), ('type', 'He'), ('x-position', '3.6831e-07'), ('y-position', '7.6520e-07'), ('z-position', '2.9884e-07'), ('x-velocity', '6.6341e+01'), ('y-velocity', '2.2232e+02'), ('z-velocity', '-9.7653e+01')])
OrderedDict([('atom_id', '17'), ('type', 'He'), ('x-position', '2.8696e-07'), ('y-position', '1.5129e-07'), ('z-position', '6.4060e-07'), ('x-velocity', '9.0358e+01'), ('y-velocity', '-6.7459e+01'), ('z-velocity', '-6.4782e+01')])
OrderedDict([('atom_id', '18'), ('type', 'He'), ('x-position', '1.0325e-07'), ('y-position', '9.9012e-07'), ('z-position', '3.4381e-07'), ('x-velocity', '7.1108e+01'), ('y-velocity', '1.1060e+01'), ('z-velocity', '1.5912e+01')])
OrderedDict([('atom_id', '19'), ('type', 'Ar'), ('x-position', '4.3929e-07'), ('y-position', '7.5363e-07'), ('z-position', '9.9974e-07'), ('x-velocity', '2.3919e+02'), ('y-velocity', '1.7383e+02'), ('z-velocity', '3.3529e+02')])

Note

If you are not using at least Python 3.6 the DictReader returns regular dict instead of its ordered variant, OrderedDict.

Now that the fields are in an OrderedDict, the routine to cast the field entries is slightly different:

>>> number_of_rows_to_skip = 11
>>> with open('md_result.csv', 'r', newline='') as f:
...     # skip the first rows
...     for _ in range(number_of_rows_to_skip):
...         next(f)
...
...     csv_reader = csv.DictReader(f, delimiter=' ')
...     for row in csv_reader:
...         for key, entry in row.items():
...             for possible_type in possible_types:
...                 try:
...                     entry = possible_type(entry)
...                 except ValueError:
...                     continue
...                 except:
...                     raise
...                 else:
...                     row[key] = entry
...                     break
...         print(row)
...
OrderedDict([('atom_id', 0), ('type', 'He'), ('x-position', 5.7222e-07), ('y-position', 4.8811e-09), ('z-position', 2.0415e-07), ('x-velocity', -29.245), ('y-velocity', 100.45), ('z-velocity', 128.28)])
OrderedDict([('atom_id', 1), ('type', 'He'), ('x-position', 9.771e-07), ('y-position', 3.6371e-07), ('z-position', 4.7311e-07), ('x-velocity', -199.26), ('y-velocity', 232.75), ('z-velocity', -534.38)])
OrderedDict([('atom_id', 2), ('type', 'Ar'), ('x-position', 6.4989e-07), ('y-position', 6.7873e-07), ('z-position', 9.5e-07), ('x-velocity', -1.5592), ('y-velocity', -378.76), ('z-velocity', 84.091)])
OrderedDict([('atom_id', 3), ('type', 'Ar'), ('x-position', 5.9024e-08), ('y-position', 3.7138e-07), ('z-position', 7.3455e-08), ('x-velocity', 342.82), ('y-velocity', 156.82), ('z-velocity', -38.991)])
OrderedDict([('atom_id', 4), ('type', 'He'), ('x-position', 7.6746e-07), ('y-position', 8.3017e-08), ('z-position', 4.852e-07), ('x-velocity', -30.45), ('y-velocity', -379.75), ('z-velocity', -336.32)])
OrderedDict([('atom_id', 5), ('type', 'Ar'), ('x-position', 1.7226e-07), ('y-position', 4.6023e-07), ('z-position', 4.7356e-08), ('x-velocity', -311.51), ('y-velocity', -429.39), ('z-velocity', -694.74)])
OrderedDict([('atom_id', 6), ('type', 'Ar'), ('x-position', 9.6394e-07), ('y-position', 7.2845e-07), ('z-position', 8.8623e-07), ('x-velocity', -82.636), ('y-velocity', 45.098), ('z-velocity', -10.626)])
OrderedDict([('atom_id', 7), ('type', 'He'), ('x-position', 5.445e-07), ('y-position', 4.6373e-07), ('z-position', 6.227e-07), ('x-velocity', 158.89), ('y-velocity', 258.58), ('z-velocity', -151.5)])
OrderedDict([('atom_id', 8), ('type', 'He'), ('x-position', 7.9322e-07), ('y-position', 9.47e-07), ('z-position', 3.5194e-08), ('x-velocity', -197.03), ('y-velocity', 156.74), ('z-velocity', -185.2)])
OrderedDict([('atom_id', 9), ('type', 'Ar'), ('x-position', 2.7797e-07), ('y-position', 1.6487e-07), ('z-position', 8.2403e-07), ('x-velocity', -38.65), ('y-velocity', -696.32), ('z-velocity', 216.42)])
OrderedDict([('atom_id', 10), ('type', 'He'), ('x-position', 1.1842e-07), ('y-position', 6.3244e-07), ('z-position', 5.0958e-07), ('x-velocity', -149.63), ('y-velocity', 422.88), ('z-velocity', -76.309)])
OrderedDict([('atom_id', 11), ('type', 'Ar'), ('x-position', 2.0359e-07), ('y-position', 8.3369e-07), ('z-position', 9.6348e-07), ('x-velocity', 484.57), ('y-velocity', -267.41), ('z-velocity', -352.54)])
OrderedDict([('atom_id', 12), ('type', 'He'), ('x-position', 5.1019e-07), ('y-position', 2.247e-07), ('z-position', 2.3846e-08), ('x-velocity', -231.92), ('y-velocity', -99.51), ('z-velocity', 32.77)])
OrderedDict([('atom_id', 13), ('type', 'Ar'), ('x-position', 3.5383e-07), ('y-position', 8.4581e-07), ('z-position', 7.234e-07), ('x-velocity', -303.95), ('y-velocity', 47.316), ('z-velocity', 222.53)])
OrderedDict([('atom_id', 14), ('type', 'He'), ('x-position', 3.8515e-07), ('y-position', 2.894e-07), ('z-position', 5.6028e-07), ('x-velocity', 233.08), ('y-velocity', 254.18), ('z-velocity', 429.83)])
OrderedDict([('atom_id', 15), ('type', 'He'), ('x-position', 1.5842e-07), ('y-position', 9.8225e-07), ('z-position', 5.7859e-07), ('x-velocity', 199.63), ('y-velocity', 203.11), ('z-velocity', -425.6)])
OrderedDict([('atom_id', 16), ('type', 'He'), ('x-position', 3.6831e-07), ('y-position', 7.652e-07), ('z-position', 2.9884e-07), ('x-velocity', 66.341), ('y-velocity', 222.32), ('z-velocity', -97.653)])
OrderedDict([('atom_id', 17), ('type', 'He'), ('x-position', 2.8696e-07), ('y-position', 1.5129e-07), ('z-position', 6.406e-07), ('x-velocity', 90.358), ('y-velocity', -67.459), ('z-velocity', -64.782)])
OrderedDict([('atom_id', 18), ('type', 'He'), ('x-position', 1.0325e-07), ('y-position', 9.9012e-07), ('z-position', 3.4381e-07), ('x-velocity', 71.108), ('y-velocity', 11.06), ('z-velocity', 15.912)])
OrderedDict([('atom_id', 19), ('type', 'Ar'), ('x-position', 4.3929e-07), ('y-position', 7.5363e-07), ('z-position', 9.9974e-07), ('x-velocity', 239.19), ('y-velocity', 173.83), ('z-velocity', 335.29)])

As long as the field names are consistent across files you can write code that needs less maintenance.

Another way of reading CSV files is by using the loadtxt() function of NumPy. By specifying the data type as you would for Structured arrays the type conversion is done for you, while retaining the dictionary like behavior. You can also specify a comment character that should be ignored and the amount of rows to skip:

csv_dtype = [
    ('atom_id', np.int32),
    ('type', np.string_, 2),
    ('position', np.float64, 3),
    ('velocity', np.float64, 3)
]
with open('md_result.csv', 'r') as f:
    md_data = np.loadtxt(f, dtype=csv_dtype, skiprows=12)
print(md_data)
[ ( 0, b'He', [  5.72220000e-07,   4.88110000e-09,   2.04150000e-07], [ -29.245 ,  100.45  ,  128.28  ])
 ( 1, b'He', [  9.77100000e-07,   3.63710000e-07,   4.73110000e-07], [-199.26  ,  232.75  , -534.38  ])
 ( 2, b'Ar', [  6.49890000e-07,   6.78730000e-07,   9.50000000e-07], [  -1.5592, -378.76  ,   84.091 ])
 ( 3, b'Ar', [  5.90240000e-08,   3.71380000e-07,   7.34550000e-08], [ 342.82  ,  156.82  ,  -38.991 ])
 ( 4, b'He', [  7.67460000e-07,   8.30170000e-08,   4.85200000e-07], [ -30.45  , -379.75  , -336.32  ])
 ( 5, b'Ar', [  1.72260000e-07,   4.60230000e-07,   4.73560000e-08], [-311.51  , -429.39  , -694.74  ])
 ( 6, b'Ar', [  9.63940000e-07,   7.28450000e-07,   8.86230000e-07], [ -82.636 ,   45.098 ,  -10.626 ])
 ( 7, b'He', [  5.44500000e-07,   4.63730000e-07,   6.22700000e-07], [ 158.89  ,  258.58  , -151.5   ])
 ( 8, b'He', [  7.93220000e-07,   9.47000000e-07,   3.51940000e-08], [-197.03  ,  156.74  , -185.2   ])
 ( 9, b'Ar', [  2.77970000e-07,   1.64870000e-07,   8.24030000e-07], [ -38.65  , -696.32  ,  216.42  ])
 (10, b'He', [  1.18420000e-07,   6.32440000e-07,   5.09580000e-07], [-149.63  ,  422.88  ,  -76.309 ])
 (11, b'Ar', [  2.03590000e-07,   8.33690000e-07,   9.63480000e-07], [ 484.57  , -267.41  , -352.54  ])
 (12, b'He', [  5.10190000e-07,   2.24700000e-07,   2.38460000e-08], [-231.92  ,  -99.51  ,   32.77  ])
 (13, b'Ar', [  3.53830000e-07,   8.45810000e-07,   7.23400000e-07], [-303.95  ,   47.316 ,  222.53  ])
 (14, b'He', [  3.85150000e-07,   2.89400000e-07,   5.60280000e-07], [ 233.08  ,  254.18  ,  429.83  ])
 (15, b'He', [  1.58420000e-07,   9.82250000e-07,   5.78590000e-07], [ 199.63  ,  203.11  , -425.6   ])
 (16, b'He', [  3.68310000e-07,   7.65200000e-07,   2.98840000e-07], [  66.341 ,  222.32  ,  -97.653 ])
 (17, b'He', [  2.86960000e-07,   1.51290000e-07,   6.40600000e-07], [  90.358 ,  -67.459 ,  -64.782 ])
 (18, b'He', [  1.03250000e-07,   9.90120000e-07,   3.43810000e-07], [  71.108 ,   11.06  ,   15.912 ])
 (19, b'Ar', [  4.39290000e-07,   7.53630000e-07,   9.99740000e-07], [ 239.19  ,  173.83  ,  335.29  ])]

So this makes it really convenient to work with, e.g., the velocity may easily be computed like this:

print(np.linalg.norm(md_data['velocity'], axis=1))

With the output

[ 165.53317168  615.98627785  387.98565049  378.99652094  508.18147093
  874.11550713   94.73885146  339.20775124  312.54965765  730.20064455
  455.01614782  656.20167982  254.48575481  379.66301803  551.07856763
  512.09488281  251.72091508  130.04611632   73.70117372  447.04374406]

JavaScript Object Notation

Filename extension:
 .json

The JavaScript Object Notation (JSON) format was derived from JavaScript. It is, however, a language-independent format with many programming languages offering support for simple generation and parsing of JSON files, among them Python.

>>> with open('md_result.json', 'r') as f:
...     print(f.read())
...
{
  "potentials": {
    "Ar": {
      "type": "Lennard-Jones",
      "parameters": {
        "sigma": {
          "unit": "pm",
          "magnitude": 341
        },
        "epsilon/k_B": {
          "unit": "K",
          "magnitude": 120
        }
      }
    },
    "He": {
      "type": "Lennard-Jones",
      "parameters": {
        "sigma": {
          "unit": "pm",
          "magnitude": "256"
        },
        "epsilon/k_B": {
          "unit": "K",
          "magnitude": "10.22"
        }
      }
    }
  },
  "atoms": [
    {
      "type": "He",
      "velocity": [
        -29.245,
        100.45,
        128.28
      ],
      "atom_id": 0,
      "position": [
        5.7222e-07,
        4.8811e-09,
        2.0415e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        -199.26,
        232.75,
        -534.38
      ],
      "atom_id": 1,
      "position": [
        9.771e-07,
        3.6371e-07,
        4.7311e-07
      ]
    },
    {
      "type": "Ar",
      "velocity": [
        -1.5592,
        -378.76,
        84.091
      ],
      "atom_id": 2,
      "position": [
        6.4989e-07,
        6.7873e-07,
        9.5e-07
      ]
    },
    {
      "type": "Ar",
      "velocity": [
        342.82,
        156.82,
        -38.991
      ],
      "atom_id": 3,
      "position": [
        5.9024e-08,
        3.7138e-07,
        7.3455e-08
      ]
    },
    {
      "type": "He",
      "velocity": [
        -30.45,
        -379.75,
        -336.32
      ],
      "atom_id": 4,
      "position": [
        7.6746e-07,
        8.3017e-08,
        4.852e-07
      ]
    },
    {
      "type": "Ar",
      "velocity": [
        -311.51,
        -429.39,
        -694.74
      ],
      "atom_id": 5,
      "position": [
        1.7226e-07,
        4.6023e-07,
        4.7356e-08
      ]
    },
    {
      "type": "Ar",
      "velocity": [
        -82.636,
        45.098,
        -10.626
      ],
      "atom_id": 6,
      "position": [
        9.6394e-07,
        7.2845e-07,
        8.8623e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        158.89,
        258.58,
        -151.5
      ],
      "atom_id": 7,
      "position": [
        5.445e-07,
        4.6373e-07,
        6.227e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        -197.03,
        156.74,
        -185.2
      ],
      "atom_id": 8,
      "position": [
        7.9322e-07,
        9.47e-07,
        3.5194e-08
      ]
    },
    {
      "type": "Ar",
      "velocity": [
        -38.65,
        -696.32,
        216.42
      ],
      "atom_id": 9,
      "position": [
        2.7797e-07,
        1.6487e-07,
        8.2403e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        -149.63,
        422.88,
        -76.309
      ],
      "atom_id": 10,
      "position": [
        1.1842e-07,
        6.3244e-07,
        5.0958e-07
      ]
    },
    {
      "type": "Ar",
      "velocity": [
        484.57,
        -267.41,
        -352.54
      ],
      "atom_id": 11,
      "position": [
        2.0359e-07,
        8.3369e-07,
        9.6348e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        -231.92,
        -99.51,
        32.77
      ],
      "atom_id": 12,
      "position": [
        5.1019e-07,
        2.247e-07,
        2.3846e-08
      ]
    },
    {
      "type": "Ar",
      "velocity": [
        -303.95,
        47.316,
        222.53
      ],
      "atom_id": 13,
      "position": [
        3.5383e-07,
        8.4581e-07,
        7.234e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        233.08,
        254.18,
        429.83
      ],
      "atom_id": 14,
      "position": [
        3.8515e-07,
        2.894e-07,
        5.6028e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        199.63,
        203.11,
        -425.6
      ],
      "atom_id": 15,
      "position": [
        1.5842e-07,
        9.8225e-07,
        5.7859e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        66.341,
        222.32,
        -97.653
      ],
      "atom_id": 16,
      "position": [
        3.6831e-07,
        7.652e-07,
        2.9884e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        90.358,
        -67.459,
        -64.782
      ],
      "atom_id": 17,
      "position": [
        2.8696e-07,
        1.5129e-07,
        6.406e-07
      ]
    },
    {
      "type": "He",
      "velocity": [
        71.108,
        11.06,
        15.912
      ],
      "atom_id": 18,
      "position": [
        1.0325e-07,
        9.9012e-07,
        3.4381e-07
      ]
    },
    {
      "type": "Ar",
      "velocity": [
        239.19,
        173.83,
        335.29
      ],
      "atom_id": 19,
      "position": [
        4.3929e-07,
        7.5363e-07,
        9.9974e-07
      ]
    }
  ],
  "step": 0,
  "boundary_conditions": [
    "periodic",
    "periodic",
    "periodic"
  ],
  "system_size": [
    {
      "unit": "µm",
      "magnitude": 100
    },
    {
      "unit": "µm",
      "magnitude": 200
    },
    {
      "unit": "µm",
      "magnitude": 300
    }
  ],
  "time": {
    "unit": "s",
    "magnitude": 0
  },
  "atom_types": {
    "Ar": "Argon",
    "He": "Helium"
  }
}

Which can easily be read with the json module of the Python standard library:

>>> import json
>>> with open('md_result.json', 'r') as f:
...     md_data = json.load(f)
...
>>> print(md_data)
{'potentials': {'He': {'parameters': {'sigma': {'magnitude': '256', 'unit': 'pm'}, 'epsilon/k_B': {'magnitude': '10.22', 'unit': 'K'}}, 'type': 'Lennard-Jones'}, 'Ar': {'parameters': {'sigma': {'magnitude': 341, 'unit': 'pm'}, 'epsilon/k_B': {'magnitude': 120, 'unit': 'K'}}, 'type': 'Lennard-Jones'}}, 'time': {'magnitude': 0, 'unit': 's'}, 'boundary_conditions': ['periodic', 'periodic', 'periodic'], 'atom_types': {'He': 'Helium', 'Ar': 'Argon'}, 'system_size': [{'magnitude': 100, 'unit': 'µm'}, {'magnitude': 200, 'unit': 'µm'}, {'magnitude': 300, 'unit': 'µm'}], 'atoms': [{'position': [5.7222e-07, 4.8811e-09, 2.0415e-07], 'atom_id': 0, 'type': 'He', 'velocity': [-29.245, 100.45, 128.28]}, {'position': [9.771e-07, 3.6371e-07, 4.7311e-07], 'atom_id': 1, 'type': 'He', 'velocity': [-199.26, 232.75, -534.38]}, {'position': [6.4989e-07, 6.7873e-07, 9.5e-07], 'atom_id': 2, 'type': 'Ar', 'velocity': [-1.5592, -378.76, 84.091]}, {'position': [5.9024e-08, 3.7138e-07, 7.3455e-08], 'atom_id': 3, 'type': 'Ar', 'velocity': [342.82, 156.82, -38.991]}, {'position': [7.6746e-07, 8.3017e-08, 4.852e-07], 'atom_id': 4, 'type': 'He', 'velocity': [-30.45, -379.75, -336.32]}, {'position': [1.7226e-07, 4.6023e-07, 4.7356e-08], 'atom_id': 5, 'type': 'Ar', 'velocity': [-311.51, -429.39, -694.74]}, {'position': [9.6394e-07, 7.2845e-07, 8.8623e-07], 'atom_id': 6, 'type': 'Ar', 'velocity': [-82.636, 45.098, -10.626]}, {'position': [5.445e-07, 4.6373e-07, 6.227e-07], 'atom_id': 7, 'type': 'He', 'velocity': [158.89, 258.58, -151.5]}, {'position': [7.9322e-07, 9.47e-07, 3.5194e-08], 'atom_id': 8, 'type': 'He', 'velocity': [-197.03, 156.74, -185.2]}, {'position': [2.7797e-07, 1.6487e-07, 8.2403e-07], 'atom_id': 9, 'type': 'Ar', 'velocity': [-38.65, -696.32, 216.42]}, {'position': [1.1842e-07, 6.3244e-07, 5.0958e-07], 'atom_id': 10, 'type': 'He', 'velocity': [-149.63, 422.88, -76.309]}, {'position': [2.0359e-07, 8.3369e-07, 9.6348e-07], 'atom_id': 11, 'type': 'Ar', 'velocity': [484.57, -267.41, -352.54]}, {'position': [5.1019e-07, 2.247e-07, 2.3846e-08], 'atom_id': 12, 'type': 'He', 'velocity': [-231.92, -99.51, 32.77]}, {'position': [3.5383e-07, 8.4581e-07, 7.234e-07], 'atom_id': 13, 'type': 'Ar', 'velocity': [-303.95, 47.316, 222.53]}, {'position': [3.8515e-07, 2.894e-07, 5.6028e-07], 'atom_id': 14, 'type': 'He', 'velocity': [233.08, 254.18, 429.83]}, {'position': [1.5842e-07, 9.8225e-07, 5.7859e-07], 'atom_id': 15, 'type': 'He', 'velocity': [199.63, 203.11, -425.6]}, {'position': [3.6831e-07, 7.652e-07, 2.9884e-07], 'atom_id': 16, 'type': 'He', 'velocity': [66.341, 222.32, -97.653]}, {'position': [2.8696e-07, 1.5129e-07, 6.406e-07], 'atom_id': 17, 'type': 'He', 'velocity': [90.358, -67.459, -64.782]}, {'position': [1.0325e-07, 9.9012e-07, 3.4381e-07], 'atom_id': 18, 'type': 'He', 'velocity': [71.108, 11.06, 15.912]}, {'position': [4.3929e-07, 7.5363e-07, 9.9974e-07], 'atom_id': 19, 'type': 'Ar', 'velocity': [239.19, 173.83, 335.29]}], 'step': 0}

So parsing a JSON file results in a composition of dictionaries, lists, strings, integers, and floats.

Writing JSON files is equally simple:

>>> with open('my_md_result.json', 'w') as f:
...     json.dump(md_data, f, indent=2)
...

with the indent keyword resulting in the file not written in a very condensed manner, but with newlines and an indentation of 2. Note that the following Python types can be converted out-of-the-box:

Python JSON
dict object
list, tuple array
str string
int, float, int- & float-derived Enums number
True true
False false
None null

Extensible Markup Language

Filename extension:
 .xml

The Extensible Markup Language (XML) format was developed by the World Wide Web Consortium and is supported by a lot of programming languages. Python offers support for XML via the xml package.

The short introduction of the xml.etree.ElementTree over at Python is well written and recommended as a starting point.

Hierarchical Data Format 5

filename extension:
 .h5, .hdf5

The file structure of HDF5 includes two major types of objects:

  • Datasets
    Multidimensional arrays of a homogeneous type.
  • Groups
    Container structures which can hold datasets or other groups.

In that way we end up with a data format that somewhat resembles a filesystem.

Python support for HDF5 is due to the h5py package, which can be installed via

pip install h5py

and be used after importing it via

import h5py

File objects

To open an HDF5 file the recommended way is to use

with h5py.File('my_file.h5', 'r') as f:
    # interact with the file object

where the first argument is the path to the HDF5 file, and the second one is the mode in which to open the file. Available modes are

r:readonly, file must exist
r+:read/write, file must exist
w:create file, truncate if exists
w- or x:create file, fail if exists
a:read/write if exists, create otherwise

with the default filemode being a.

Danger

Similarly to open() there is another way of opening an HDF5 file:

f = h5py.File('my_file.h5', 'r')
# interact with the file object
f.close()

This has the disadvantage that you have to take care of closing the file yourself. As the binary nature of HDF5 files essentially means that one corrupted bit may lead to loss of all the data in the file. We therefore strongly recommend the previously mentioned way with the context manager.

Every file object is also an HDF5 group object representing the root group of the file.

Groups

The hierarchical organization of the HDF5 file format is achieved by groups. Similarly to directories on your regular filesystem they help you with the organization of your data within an HDF5 file. As mentioned before the file object returned by the File initialization. Creating a new group is done by calling the create_group() method of an HDF5 Group object:

with h5py.File('my_file.h5', 'w') as f:
    print('Name of group:', f.name)
    nested_group = f.create_group('nested group')
    print('Name of nested group:', nested_group.name)
    recursively_created_group = f.create_group('/this/is/deep')
    print(
        'Name of recursively created group:',
        recursively_created_group.name)

Which results in the following output:

Name of group: /
Name of nested group: /nested group
Name of recursively created group: /this/is/deep

Working with files may also be compared to working with dict objects, as they offer the indexing syntax to traverse groups, support iteration and the keys(), values() and items() syntax:

with h5py.File('my_file.h5', 'w') as f:
    f.create_group('/favorite/group/one')
    f.create_group('/favorite/group/two')
    f.create_group('/favorite/group/three')
    group = f['/favorite/group']
    for subgroup_name in sorted(group):
        print(subgroup_name)
    subgroup_one = group['one']
    print(subgroup_one.name)
    for subgroup_name, subgroup in sorted(group.items()):
        print(subgroup_name, subgroup.name)
one
three
two
/favorite/group/one
one /favorite/group/one
three /favorite/group/three
two /favorite/group/two

Datasets

Creating a dataset is done via calling the create_dataset method on an HDF5 group. Although we can create empty datasets in HDF5 and fill them with data later on, we much more commonly just want to store data that has been worked on in a NumPy array. This can be done like this:

my_array = np.array(5*5*3*3*3).reshape(5, 5, 3, 3, 3)

with h5py.File('my_file.h5', 'w') as f:
    group = f.create_group('my_group')
    dset = group.create_dataset('my_dataset', data=my_array)

with h5py.File('my_file.h5', 'r') as f:
    retrieved_array = np.array(f['my_group/my_dataset'])

print(np.allclose(my_array, retrieved_array))

Which results in the ouptut True. Handling the datatypes is done automatically both for dumping and loading the data—both for regular and structured arrays.

As the data we store tends to get quite large we can leverage the compression options HDF5 offers. H5Py makes keeps this simple and enables compression when a compression keyword is supplied, followed by a number in the range of 1 to 9 indicating the cpu-time/compression trade-off, from “least compression” to “densest compression.”

my_array = np.array(5*5*3*3*3).reshape(5, 5, 3, 3, 3)

with h5py.File('my_file.h5', 'w') as f:
    group = f.create_group('my_group')
    dset = group.create_dataset('my_dataset', data=my_array, compression=9)

Attributes

Each group and dataset may have attributes. Attributes are metadata that can be assigned in a dictionary-like fashion using the attributes attribute of the group or dataset object.

md_dtype = [
    ('atom_id', np.int32),
    ('type', np.string_, 2),
    ('position', np.float64, 3),
    ('velocity', np.float64, 3)
]
md_data = np.array(
    [
        (0, 'He', (5.7222e-07, 4.8811e-09, 2.0415e-07), (-29.245, 100.45, 128.28)),
        (1, 'He', (9.7710e-07, 3.6371e-07, 4.7311e-07), (-199.26, 232.75, -534.38)),
        (2, 'Ar', (6.4989e-07, 6.7873e-07, 9.5000e-07), (-1.5592, -378.76, 84.091)),
        (3, 'Ar', (5.9024e-08, 3.7138e-07, 7.3455e-08), (342.82, 156.82, -38.991)),
        (4, 'He', (7.6746e-07, 8.3017e-08, 4.8520e-07), (-30.45, -379.75, -336.32)),
        (5, 'Ar', (1.7226e-07, 4.6023e-07, 4.7356e-08), (-311.51, -429.39, -694.74)),
        (6, 'Ar', (9.6394e-07, 7.2845e-07, 8.8623e-07), (-82.636, 45.098, -10.626)),
        (7, 'He', (5.4450e-07, 4.6373e-07, 6.2270e-07), (158.89, 258.58, -151.5)),
        (8, 'He', (7.9322e-07, 9.4700e-07, 3.5194e-08), (-197.03, 156.74, -185.2)),
        (9, 'Ar', (2.7797e-07, 1.6487e-07, 8.2403e-07), (-38.65, -696.32, 216.42)),
        (10, 'He', (1.1842e-07, 6.3244e-07, 5.0958e-07), (-149.63, 422.88, -76.309)),
        (11, 'Ar', (2.0359e-07, 8.3369e-07, 9.6348e-07), (484.57, -267.41, -352.54)),
        (12, 'He', (5.1019e-07, 2.2470e-07, 2.3846e-08), (-231.92, -99.51, 32.77)),
        (13, 'Ar', (3.5383e-07, 8.4581e-07, 7.2340e-07), (-303.95, 47.316, 222.53)),
        (14, 'He', (3.8515e-07, 2.8940e-07, 5.6028e-07), (233.08, 254.18, 429.83)),
        (15, 'He', (1.5842e-07, 9.8225e-07, 5.7859e-07), (199.63, 203.11, -425.6)),
        (16, 'He', (3.6831e-07, 7.6520e-07, 2.9884e-07), (66.341, 222.32, -97.653)),
        (17, 'He', (2.8696e-07, 1.5129e-07, 6.4060e-07), (90.358, -67.459, -64.782)),
        (18, 'He', (1.0325e-07, 9.9012e-07, 3.4381e-07), (71.108, 11.06, 15.912)),
        (19, 'Ar', (4.3929e-07, 7.5363e-07, 9.9974e-07), (239.19, 173.83, 335.29))
    ],
    dtype=md_dtype)
with h5py.File('md_results.h5') as f:
    # Generic information regarding the file/the simulation
    f.attrs['units'] = 'All quantities are in SI units.'
    f.attrs['atom-types'] = json.dumps(
        {
            'He': 'Helium',
            'Ar': 'Argon'
        })
    f.attrs['He potential'] = json.dumps(
        {
            'type': 'Lennard-Jones',
            'parameters': {
                'epsilon/k_B': 10.22,
                'sigma': 256e-12
            }
        })
    f.attrs['Ar potential'] = json.dumps(
        {
            'type': 'Lennard-Jones',
            'parameters': {
                'epsilon/k_B': 120,
                'sigma': 341e-12
            }
        })
    f.attrs['system size'] = np.array([100e-6, 200e-6, 300e-6])
    f.attrs['boundary conditions'] = json.dumps(
        ['periodic', 'periodic', 'periodic'])

    # Information specific to this dataset
    dset = f.create_dataset('0000', data=md_data, compression=9)
    dset.attrs['step'] = 0
    dset.attrs['time'] = 0

HDFView

The HDF Group offers a tool for brwosing and editing HDF5 files: HDFView. Often it can be installed via the package manager of our operating system, and subsequently be executed via