Aalto Scientific Computing

This site contains documentation about scientific and data-intensive computing at Aalto University and beyond. It is targeted towards Aalto researchers, but has some useful information for everyone.

Welcome, researchers!

Welcome to Aalto, researchers. Aalto has excellent resources for you, but it can be quite hard to know of them all. These pages will provide a good overview of IT services for researchers for you (focused on computation and data-intensive work, including experimental work).

See also

  • These aren’t generic IT instructions - ITS has an introduction for staff somewhere (but apparently not online).

  • IT Services for Research is the comprehensive list of researcher-oriented IT services available (compared to this which is a starting tutorial)

  • What file storage to use? - good summary not focused on scientific computing.

Aalto services

Understanding all the Aalto services can be quite confusing. Here are some of the key players:

  • Department IT: Only a few departments (mainly in SCI) have their own IT staff. Others have people such as laboratory managers which may be able to provide some useful advice. Known links: CS, NBE, PHYS, Math.

  • Science-IT: Overlaps with SCI department IT groups. They run the Triton cluster and support scientific computing. Their services may be used throughout the entire university, but support is organized from the departments which fund them. The core Science-IT departments are CS, NBE, and PHYS. Science-IT runs a daily SciComp garage, where we provide hands on support for anything related to scientific computing. This site (scicomp.aalto.fi) is the main home, read more about us on the about page.

    • Aalto Research Software Engineers provide specialized services in computation, data, and software. If you ever think “I can’t do X because we don’t have the skills” or “I wish we could be more efficient”, realize you aren’t alone and open a request with us. Our projects last days to months, longer than typical support staff’s projects.

  • Aalto IT Services (ITS): Provides central IT infrastructure. They have a “IT Services for Research” group, but it is less specialized than Science-IT. ITS is the first place to contact for non-specialized services or people outside SCI.. Their infrastructure is used in all schools including SCI, and the base on which everyone builds. Their instructions are on aalto.fi, but most importantly the already-mentioned IT Services for Research page. Contact via servicedesk.

  • Aalto Research Services: Administrative-type support. Provides support for grantwriting, innovation and commercialization, sponsored projects, legal services for research, and research infrastructures. (In 2019 a separate “innovation services” split from the previous “research and innovation services”).

  • CSC is the Finnish academic computing center (and more). They provide a lot of basic infrastructure you use without knowing it, as well as computing and data services to researchers (all for free). research.csc.fi

The major sources of information are: everywhere:

  • aalto.fi is the normal homepage, but the joke is it’s hard to find anything and hard to use. This site is “not designed to have a logical structure and instead, you are expected to search for information” (actual quote). Some pages have more information appear if you log in, and there is no indication of which ones. In general, unless you know what you are looking for, don’t expect to find anything here without extensive work.

  • wiki.aalto.fi is obviously the Aalto wiki space. Anyone can make a space here, and many departments’ internal sites are here. Searching can randomly find useful information, but it is not a primary information source anymore. Most sites aren’t publically searchable.

  • scicomp.aalto.fi is where you are now. It has a lot of information related to scientific computing and data. We try to not duplicate what is on aalto.fi, but sometimes we elaborate or make things more findable. This might be the best place to find information on specialized research and scientific computing - as opposed to general “staff computing” you find other places.

Computers, devices, and end-user systems

Aalto provides computers to it’s employees, obviously. Wherther it is an Aalto wide managed system or standalone depends on your department policies. If it’s standalone, you are on your own. If managed, login is through your Aalto account. You can get laptop or desktop, and Linux, Mac, or Windows.

Desktops are connected directly to the wired networks and are typically preferred by researchers using serious data or computation. Linux desktops have fast and automatic access to all of the university data storage systems, including Triton and department storage. They also have a wide variety of scientific software already available (and somewhat similar to Triton). We have some limited instructions and pointers to the main instructions for mac and windows computers.

Managed laptops are usable in and out of the Aalto networks.

On both managed desktops and laptops you can become a “primary user” which allows you to install needed software that is found from the official repositories. Additionally, in some cases, Workstation Administrator (wa) account can be given which close to normal root/Administrator account with some limitations. The “primary user” is widely accepted and recommended by Aalto ITS to all users while wa accounts are regulated by the department policies or Aalto ITS.

Computing

With a valid Aalto account, you have two primary options: workstations and Triton. The Aalto workstations have basic scientific software installed.

Most demanding computing at Aalto is performed on Triton, the Aalto high performance computing cluster. It is a fairly standard medium-sized cluster, and it’s main advantage is the close integration into the Aalto environment: it shares Aalto accounts, its data storage (5PB) is also available on workstations, and has local support. If you need dedicated resources, you can purchase them and they can be managed by Science IT team as part of Triton so that you get dedicated resources and can easily scale to the full power of Triton. Triton is part of the Finnish Grid and Cloud Infrastructure. Triton is the largest publically known computing cluster in Finland after the CSC clusters. Triton provides a web-based interface via JupyterHub and Open OnDemand. To get started with Triton, request access, check the tutorials sequence (or quickstart guide if you know the basics), and you’ll learn all you need.

CSC (the Finnish IT Center for Science) is a government-owned organization which provides a lot of services, most notably huge HPC clusters, data, and IT infrastructure services to the academic sector. All of their services are free to the academic community (paid directly by the state of Finland). They also coordinate the Finnish Grid and Cloud Infrastructure. They have the largest known clusters in Finland.

Data

Data management isn’t just storage: if data is just put somewhere, you get a massive mess and data isn’t usable in even 5 years. Funders now require “data management plans”. Thus data management is not just a hot topic, it’s an important one. We have a whole section on data (not maintained much anymore), and also there are higher level guides from Aalto. If you just want to get something done, you should start with our Aalto-specific guideline for Science-IT data storage (used in CS, NBE, PHYS) - if you follow our plan, you will be doing better than most people. If you have specific questions, there is an official service email address you can use (see the Aalto pages), or you can ask the Science-IT team.

Aalto has many data storage options, most free. In general, you should put your data in some centralized location shared with your group: if you keep it only on your own systems, the data dies when you leave. We manage data by projects: a group of people with shared access and a leader. Groups provide flexibility, sharing, and long-term management (so that you don’t lose or forget about data every time someone leaves). You should request as many projects as you need depending on how fine-grained you need access control, and each can have its own members and quota. You can read a general guide from Aalto (going beyond scientific computing) about the storage locations available and storage service policy.

Triton has 5PB of non-backed up data storage on the high-performance Lustre filesystem. This is used for large active computation purposes. The Triton nodes have an incredible bandwidth to this and it is very fast and parallel. This is mounted by default at Science-IT departments, and can be by default in other departments too.

Aalto provides “work” and “teamwork” centralized filesystems which are large, backed up, snapshotted, shared: everything you may want. Within the Science-IT departments, Science-IT and department IT manages it and provides access. For other schools/departments, both are provided by Aalto ITS but you will have to figure out your school’s policies yourself. It’s possible to hook this storage into whatever else you need over the network. (In general, “work” is organized by the Aalto hierarchy, while “teamwork” is flatter. If you consider yourself mainly Aalto staff who fits in the hierarchy, work is probably better. If you consider yourself a research who collaborates with whoever, teamwork is better.) Teamwork instructions

CSC provides both high-performance Lustre filesystems (like Triton) and archive systems. CSC research portal.

In our data management section, we provide many more links to long-term data repositories, archival, and so on. The fairdata.fi project is state-supported and has a lot more information on data. They also provide some data storage focused on safety and longer-term storage (like IDA), though they are not very used at Aalto because we provide such good services locally.

Aalto provides, with Aalto accounts, Google Drive (unlimited, also Team Drives), Dropbox (unlimited), and Microsoft OneDrive (5TB). Be aware that once you leave Aalto, this data will disappear!

Software

Triton and Aalto Linux workstations come with a lot of scientific software installed, with in the Lmod system. Triton generally has more. If you need something, it can be worth asking us first to install it for everyone.

If you are the primary user of a workstation, you can install Ubuntu packages yourself (and if you aren’t, you should ask to be marked as primary user). If you use Triton or are in a Science-IT department, it can be worth asking Science-IT about software you need - we are experts in this and working to simplify the mess that scientific software is. Windows workstations can have things automatically installed, check the windows page.

Triton and Aalto workstations have the central software available, currently for laptops you are on your own except for some standard stuff.

On Triton and Linux workstations, type module spider $name to search for available software. We are working to unify the software stack available on Triton and Aalto workstations so that they have all the same stuff.

ITS has a software and licenses (FI) page, and also a full list of licenses (broken link, missing on new page). There is also https://download.aalto.fi/.

CSC also has a lot of software. Some is on CSC computers, some is exported to Triton.

Starting a project

Each time you start a project, it’s worth putting a few minutes into planning so that you create a good base (and don’t end up with chaos in a few years). We don’t mean some grant, we mean a line of work with a common theme, data, etc.

  • Think about how you’ll manage data. It’s always easy to just start working, but it can be worth getting all project members on the same page about where data will be stored and what you want to happen to it in the end. Having a very short thing written will also help a lot to get newcomers started. The “practical DMP” section here can help a lot - try filling out that A4 page to consider the big sections.

  • Request a data group (see above) if you don’t already have a shared storage location. This will keep all of your data together, in the same place. As people join, you can easily give them access. When people leave, their work isn’t lost.

    • If you already have a data group that is suitable (similar members), you can use that. But there’s no limit to the number of projects, so think about if it’s better to keep things apart earlier.

    • Mail your department IT support and request a group. Give the info requested at the bottom of data outline page.

    • In the same message, request the different data storage locations, e.g. scratch, project, archive. Quotas can always be increased later.

  • If you need specialized support in computing, data, or software, request a consultation with Aalto Research Software Engineers.

Training

Of course you want to get straight to research. However, we come from a wide range of backgrounds and we’ve noticed that missing basic skills (computer as a tool) can be a research bottleneck. We have constructed a multi-level training plan, Hands-on Scientific Computing so that you can find the right courses for your needs. We have extensive internal training about practical matters not covered in academic courses. These courses are selected by researchers for researchers, so we make sure that everything is relevant to you.

Check our upcoming training page for a list of upcoming courses. If you do anything computational or code-based at all, you should consider the twice-yearly CodeRefinery workshops (announced on our page). If you have a Triton account or do high-performance computing or intensive computing or data-related tasks, you should come to the Summer (3 days) or Winter (1 day) kickstart, which teaches you the basics of Triton and HPC usage (we say it is “required” if you have a Triton account).

Other notes

Remember to keep the IT Services for Research and What file storage to use? pages close at hand

Research is usually collaborative, but sometimes you can feel isolated - either because you are lost in a crowd, or far away from your colleagues. Academic courses don’t teach you everything you need to be good at scientific computing - put some effort into working together with, learning from, and teaching your colleagues and you will get much further.

There are some good cheatsheets which our team maintains. They are somewhat specialized, but useful in the right places.

It can be hard to find your way around Aalto, the official campus maps and directions are known for being confusing confusing. Try UsefulAaltoMap instead.

Welcome, students!

See also

Primary information is at Aalto’s IT Services for Students page, which focuses on basic services. This focuses on students in computing and data intensive programs.

Welcome to the Aalto! We are glad you are interested in scientific computing and data. scicomp.aalto.fi may be useful to you, but is somewhat targeted to research usage. However, it can still serve as a good introduction to resources for scientific and data-intensive computing at Aalto if you are a student. This page is devoted to resources which are available to students.

If you are involved in a research group or doing researcher for a professor/group leader, you are a researcher! You should acquaint yourself with all information on this site, starting with Welcome, researchers! and use whatever you need.

General IT instructions can be found at https://www.aalto.fi/en/it-help. There used to be some on into.aalto.fi, but these are gone now. There also used to be a 2-page PDF introduction for students, but it also seems to be gone from online. IT Services for Students is now the best introduction.

Accounts

In general, your Aalto account is identical to that which researchers have — the only difference is that you don’t have an departmental affiliation.

Getting help

As a student, the ITS servicedesks are the first place to go for help. The site https://www.aalto.fi/en/it-help is the new central site for IT instructions.

This site, https://scicomp.aalto.fi, is intended for research scientific computing support but has a few page useful to you.

Computation

As a student, you have access to various light computational resources which are suitable for most courses that need extra power:

Paniikki computer lab

Linux workstations, GPUs, software via modules

Other computer labs

workstations, different OSs

Shell servers

via ssh, software via modules, overcrowded. Brute and Force are for computation, others not.

JupyterHub

basic software, in web browser

Remote desktop

Windows and Linux

Own computers

Software at https://download.aalto.fi

The Jupyter service at https://jupyter.cs.aalto.fi is available to everyone with an Aalto account. It provides at least basic Python and R software; we try to keep it up to date with the things people need most for courses that use programming or data.

The shell servers brute and force are for light computing, and generally for students. You may find them useful, but can often be overloaded. Light computing shell servers. Learn how to launch Jupyter notebook on there.

For GPU computing, the Paniikki Linux computer lab (map) has GPUs in all workstations. Software is available via module spider $name to search and module load $name to load (and the module anaconda has Python, tensorflow, etc.). Read the Paniikki cheatsheet here. The instructions for Aalto workstations sort of apply there as well. The software on these machines is managed by the Aalto-IT team. This is the place if you need to play with GPUs, deep learning, etc, and helps you transition to serious computing on large clusters.

A new (2018) remote desktop service is available at https://vdi.aalto.fi (instructions). This provides Windows and Linux desktops and is designed to replace the need for computer classrooms with special software installed. You can access it via a web browser or the VMware Horizon client. More VDI Windows workstations are also available at http://mfavdi.aalto.fi/ .

The use of Triton is for research purposes and students can’t get access unless you are affiliated with a research project or (in very rare cases), a course makes special arrangements.

Data storage

Aalto home directories have a 100GB quota, and this is suitable for small use. Note that files here are lost once you leave Aalto, so make sure you back up.

The What file storage to use? page contains basic services which may be useful for data storage. Of the cloud services, note that everyone at Aalto can get an unlimted Google Drive account through the Aalto Google Apps service: instructions. Your Aalto Google account will expire once you are no longer affiliated, so your files here will become inaccessible.

Software

ITS has a software and licenses (FI) page, and also a full list of licenses. There is also http://download.aalto.fi/. Various scientific software can be found for your own use via the Aalto software portals.

The Lmod (module) system provides more software on brute/force and in Paniikki. For example, to access a bunch of scientific Python software, you can do module load anaconda. The researcher-focused instructions are here, but like many things on this site you may have to adapt to the student systems.

Common software:

Python

module load anaconda on Linux

Tensorflow etc packages

same as Python, in Paniikki

Other notes

It can be hard to find your way around Aalto, the official campus maps and directions are known for being confusing confusing. Try UsefulAaltoMap instead.

Do you have suggestions for this page? Please leave on issue on Github (make sure you have a good title that mentions the audience is students, so we can put the information in the right place). Better yet, send a pull request to us yourself.

News

22/03/2024 We have a new exciting course coming! Best practices in HPC! More info below

Upcoming Courses

April-May/2024 Tuesdays Tools and Techniques for High Performance Computing (TTT4HPC). More info and registrations at this page. A new course on HPC practices! Four self-contained episodes. Pick the one which you need the most or join us for all of them!

Join our daily zoom garage for any scientific computing related issue (not just Triton!) or to just chat and feel part of the community. Follow us on Mastodon.

News

22/03/2024 We have a new exciting course coming! Best practices in HPC! More info below

Upcoming Courses

April-May/2024 Tuesdays Tools and Techniques for High Performance Computing (TTT4HPC). More info and registrations at this page. A new course on HPC practices! Four self-contained episodes. Pick the one which you need the most or join us for all of them!

Join our daily zoom garage for any scientific computing related issue (not just Triton!) or to just chat and feel part of the community. Follow us on Mastodon.

News archive

24/01/2024 Triton users’ group meeting. This is a special event for all the users of our cluster: come and hear what’s new with Triton. This is also a good moment to express your wishes for future developments.

16-18/01/2024 Linux Shell Scripting course. More info and registrations at this page. Please also visit our training webpages to check other upcoming courses or request a re-run of past hands-on courses.

22/11/2022 From 22nd till 25th of November we will be running our popular *Python for Scientific Computing* course again.

21/10/2022 Upgrade of the triton login node: After running out of memory several times on our login node, we upgraded the memory from previously 128Gb to 256Gb. This is hopefully sufficient for most compilation and development work happening on the node. Any computation or memory intensive job should still be run on the compute nodes, but this upgrade provides us with a more robust system.

12/09/2022 September 2022, it is another academic year! We have CodeRefinery starting in one week: if you write code for research, then this is the workshop for you! Come and learn about git, jupyter, conda, reproducibility and much more. Click here for CodeRefinery Fall 2022 registration and info page.

9/06/2022 Join us today on Twitch.tv at 12:00 EEST for our Intro to Scientific Computing and HPC. The course is open to anyone with an internet connection. If you want to do the hands-on exercises with us, you need access to an HPC cluster. If you are at Aalto please apply for access to the triton cluster, otherwise check what is available at your institution. You can also watch without doing the practical parts, but we recommend registering anyway so you will be able to ask questions on HackMD.

17/01/2022 Join us for our next Twitch.tv courses dedicated to the basics of scientific computing and HPC: 2/Feb/2022 Intro to Scientific Computing and 3-4/Feb/2022 Intro to High Performance Computing. The course is open to anyone with an internet connection. For day 2+3 you need access to an HPC cluster. If you are at Aalto please apply for access to the triton cluster, otherwise check what is available at your institution. You can also watch without doing the practical parts, but we recommend registering anyway so you will be able to ask questions on HackMD.

8/09/2021 Research Software Hour Twitch show is back at a different time. Join us today at 15:00 to talk about “Computers for research 101: The essential course that everyone skipped”.

9/8/2021 We are back from the summer break. Our zoom garage schedule is back to normal (every day at 13:00).

7-9/06/2021 New Triton user? Join our course on how to use Triton and HPC https://scicomp.aalto.fi/training/scip/summer-kickstart/

10/05/2021 CodeRefinirey online workshop starts today. Tune in for git intro part 1. If you did not register, you can watch via Twitch: https://www.twitch.tv/coderefinery

01/04/2021 April fools’ … NOT: no jokes but instead a reminder that we have new courses starting in April “Hands on Data Anonymization” and “Software Design for Scientific Computing”. More info and registration links at https://scicomp.aalto.fi/training/

19/03/2021 Linux Shell Scripting starts next week! There is still time to register at: https://scicomp.aalto.fi/training/scip/shell-scripting/

15/02/2021 We have a new login node and new software versions on Triton for: abinit, anaconda, cuda, julia, and quantum espresso. Read more at our issue tracker. We recommend following the issue tracker for live updates from us and from our users too!

14/01/2021 Save the date: 29 January 2021: Crash course on Data Science workflows at Aalto + Linux terminal basics in preparation for 1-2 February 2021: Triton Winter Kickstart. Registration link can be found withing the course pages. Kickstart course is highly recommended to new Triton HPC users.

10/12/2020 We are updating and consolidating our tutorials and guidelines on https://scicomp.aalto.fi website. There might be temporary broken links, please let us know if you spot anything that does not look as it should. Please note that the next Research Software Hour on https://twitch.tv/RSHour will be on Thursday 17/12 at 21:30 Helsinki time. A special episode about Advent of Code 2020.

02/12/2020 This week Research Software Hour on https://twitch.tv/RSHour will happen during the day, straight from the https://nordic-rse.org/ meeting! 13:30 Helsinki time: All you wanted to know about the Rust programming language! Past episodes at Research Software Hour .

26/11/2020 Today at 21:30 Helsinki time, join us for another live episode of Research Software Hour on https://twitch.tv/RSHour Tonight: code debugging! Past episodes at Research Software Hour .

19/11/2020 Our course on Matlab Basics finishes today. Videos from the course will be uploaded to the Aalto Scientific Computing YouTube channel. See the course webpage for more info.

10/11/2020 Our course on Matlab Basics starts today. See the course webpage for more info.

29/10/2020 Today at 21:30 Helsinki time, join us for another live episode of Research Software Hour on https://twitch.tv/RSHour Tonight: git-annex to version control your data and HPC cluster etiquette.

26/10/2020 Tomorrow day 4 of our online CodeRefinery workshop. Materials are available here https://coderefinery.github.io/2020-10-20-online and if you did not register, you can watch it live at https://www.twitch.tv/coderefinery.

21/10/2020 Today day 2 of our online CodeRefinery workshop. Materials are available here https://coderefinery.github.io/2020-10-20-online and if you did not register, you can watch it live at https://www.twitch.tv/coderefinery.

20/10/2020 Today day 1 of our online CodeRefinery workshop. Come and learn about version control, jupyter, documentation. Materials are available here https://coderefinery.github.io/2020-10-20-online and if you did not register, you can watch it live at https://www.twitch.tv/coderefinery.

19/10/2020 Today **”Triton users group meeting”**, come and hear about the future of Triton/ScienceIT/Aalto Scientific Computing, exciting news on new services, new hardware (GPUs!), and anything related to Aalto Scientific Computing.

16/10/2020 Today the fourth an last part of our course on Data analysis workflows with R and Python . You can watch it on CodeRefinery Twitch channel.

14/10/2020 Today our course on Data analysis workflows with R and Python continues. You can watch it on CodeRefinery Twitch channel. Please note that the last part of the course is on Friday 16/10/2020.

13/10/2020 Tomorrow our course on Data analysis workflows with R and Python continues. You can watch it on CodeRefinery Twitch channel.

06/10/2020 Today is Tuesday, however Research Software Hour has now moved from Tuesdays to Thursdays. Tune in on Twitch on Thursday October 15 at 21:30 (Helsinki time) to watch live the next episode.

05/10/2020 Today starts our Data analysis workflows with R and Python. You can watch it on CodeRefinery Twitch channel.

29/09/2020 - Join us tonight (21:30 Helsinki time), for Research Software Hour, a one hour interactive discussion with Radovan Bast and Richard Darst. Tonight how to organise research software projects and other tips to keep track of notes, tools, etc.

28/09/2020 – Friendly reminder that you can still register for our Data analysis workflows with R and Python. Link to registration is here. Also save the date: Mon 19/10/2020 at 14:00 “Triton users group meeting”, come and hear about the future of Triton/ScienceIT/Aalto Scientific Computing, exciting news on new services, new hardware (GPUs!), and anything related to Aalto Scientific Computing. More details coming soon.

25/09/2020 – Friendly reminder that you can still register for our Data analysis workflows with R and Python. Link to registration is here.

24/09/2020 – Join our informal chat about research software on zoom at 10:00: RSE activities in Finland. Today is also the SciComp garage day focused on HPC/Triton issues: daily garage.

23/09/2020 – Last day of our course on “Python for Scientific Computing” covering packaging and binder. It can also be watched live on CodeRefinery Twitch if you did not have time to register.

22/09/2020 – Join us tonight (21:30 Helsinki time), for Research Software Hour, a one hour interactive discussion with Radovan Bast and Richard Darst. Tonight we cover command line arguments and running things in parallel. You can watch RSH past episodes on YouTube to get an idea of the topics covered.

21/09/2020 – This week is the last week of our course on “Python for Scientific Computing” You can re-watch the lessons on CodeRefinery Twitch channel

14/09/2020 – Our course on “Python for Scientific Computing” has started today. It can also be watched live on CodeRefinery Twitch if you did not have time to register.

08/09/2020“Research Software Hour” will start on 22/09/2020. RSH is an interactive, streaming web show all about scientific computing and research software. You can watch past episodes at the RSH video archive on youtube.

xx/09/2020 – We started a small News section to keep users up to date and avoid missing important things coming up. Check our trainings coming in October and November. Join our daily garage if you have issues to discuss related to computing or data management.

The Aalto environment

Aalto provides a wide variety of support for scientific computing. For a summary, see the IT Services for Research page. For information about data storage at Aalto, see the section on data management below.

Aalto tools

For more services provided at the Aalto level, see the IT Services for Research page.

Aalto account

Extension to Aalto account and email

Aalto account expiration is bound to staff or student status. Account closes one week after the affiliation to Aalto university ends. Expiration is managed completely by Aalto IT Services, and department IT staff is not able to extend Aalto accounts.

If extension to account is needed, this may be achieved with visitor contract. The contract requires host information, so you should contact your supervisor who (if accepting your request) contacts HR with needed details to prepare the official visitor contract.

Aalto Linux

See also

https://linux.aalto.fi/ provides official information on Aalto Linux for all Aalto. This page is a bit focused on the Science-IT departments, but also useful for everyone.

Aalto Linux is provided to all departments in Aalto. Department IT co-maintains this, and in some departments provides more support (specifically, CS, NBE, PHYS and Math at least). It contains a lot of software and features to support scientific computing and data. Both laptop and desktop setups are available.

This page is mainly about the Linux flavor in CS/PHYS/NBE and partly Math, co-managed by these departments and Science-IT. Most of it is relevant to all Aalto, though.

Basics
  • Aalto home directory. In the Aalto Ubuntu workstations, your home directory will be your Aalto home directory. That is, the same home directory that you have in Aalto Windows machines and the Aalto Linux machines, including shell servers (kosh, taltta, lyta, brute, force).

  • Most installations have Ubuntu 16.04 or 18.04, 20.04 is coming soon.

  • A pretty good guide is availiable at https://linux.aalto.fi .

  • Login is with Aalto credentials. Anyone can log in to any computer. Since login is tied to your Aalto account, login is tied to your contract status. Please contact HR if you need to access systems after you leave the university or your account stops working due to contract expiration.

  • All systems are effectively identical, except for local Ubuntu packages installed. Thus, switching machines is a low-cost operation.

  • Systems are centrally managed using puppet. Any sort of configuration group can be set up, for example to apply custom configuration to one group’s computers.

  • Large scientific computing resources are provided by the Science-IT project. The compute cluster there is named Triton. Science-IT is a school of science collaboration, and its administrators are embedded in NBE, PHYS, CS IT.

  • Workstations are on a dedicated network VLAN. The network port must be configured before it can be turned on and you can’t just assume that you can move your computer to anywhere else. You can request other network ports enabled for personal computers, just ask.

  • Installation is fully automated via netboot. Once configuration is set up, you can reboot and PXE boot to get a fresh install. There is almost no local data (except the filesystem for tmp data on the hard disks which is not used for anything by default, /l/ below), so reinstalling is a low-cost operation. The same should be true for upgrading, once the new OS is ready you reboot and netinstall. Installation takes less than two hours.

  • Default user interface. The new default user interface for Aalto Linux is Unity. If you want to switch to the previous default interface (Gnome), before logging in please select “Gnome Flashback (Metacity)” by clicking the round ubuntu logo close to the “Login” input field.

  • Personal web pages. What you put under ~/public_html will be visible at https://users.aalto.fi/~username. See Filesystem details.

When requesting a new computer:

  • Contact your department IT

  • Let us know who the primary user will be, so that we can set this properly.

When you are done with a computer:

  • Ensure that data is cleaned up. Usually, disks will be wiped, but if this is important then you must explicitly confirm before you leave. There may be data if you use the workstation local disks (not the default). There is also a local cache ($XDG_CACHE_HOME), which stores things such as web browser cache. Unix permissions protect all data, even if the primary user changes, but it is better safe than sorry. Contact IT if you want wipes.

Laptops
  • You can get laptops with Linux on it.

  • Each user should log in the first time while connected to the Aalto network. This will cache the authentication information, then you can use it wherever you want.

  • Home directories can be synced with the Aalto home directories. This is done using unison. TODO: not documented, what about this?

  • If you travel, make sure that your primary user is set correctly before you go. The system configuration can’t be updated remotely.

  • Otherwise, environment is like the workstations. You don’t have access to the module system, though.

  • If the keychain password no longer works: see FAQ at the bottom.

Workstations

Most material on this page defaults to the workstation instructions.

Primary User

The workstations have a concept of the “primary user”. This user can install software from the existing software repositories and ssh remotely to the desktops.

  • Primary users are implemented as a group with name

    $hostname-primaryuser. You can check primary user of a computer by using getent group $hostname-primaryuser or check your primary-userness with groups.

  • If you have a laptop setup, make sure you have the PrimaryUser set! This can’t be set remotely.

  • Make sure to let us know about primary users when you get a new computer set up or change computers. You don’t have to, but it makes it convenient for you.

  • It is not currently possible to have group-based primary users (a group of users all have primary user capabilities across a whole set of computers, which would be useful in flexible office spaces). TODO: are we working on this? (however, one user can have primary user access across multiple computers, and hosts can have multiple primary users, but this does not scale well)

Data

See the general storage page for the full story (this is mainly oriented towards Linux). All of the common shared directories are available on department Linux by default.

We recommend that most data is stored in shared group directories, to provide access control and sharing. See the Aalto data page.

You can use the program unison or unison-gtk to synchronise files.

Full disk encryption (Laptops)

All new (Ubuntu 16.04 and 18.04) laptops come with full disk encryption by default (instructions). This is a big deal and quite secure, if you use a good password.

When the computer is first turned on, you will be asked for a disk encryption password. Enter something secure and remember it - you have only one chance. Should you want to change this password, take the computer to an Aalto ITS service desk. They can also add more passwords for alternative users for shared computers. Aalto ITS also has a backup master key. (If you have local root access, you can do this with cryptsetup, but if you mess up there’s nothing we can do).

Desktop workstations do not have full disk encryption, because data is not stored directly on them.

Software
Already available
  • Python: module load anaconda (or anaconda2 for Python 2) (desktops)

  • Matlab: automatically installed on desktops, Ubuntu package on laptops.

Ubuntu packages

If you have PrimaryUser privileges, you can install Ubuntu packages using one of the following commands:

  • By going to the Ubuntu Software Center (Applications -> System Tools -> Administration -> Ubuntu Software Centre). Note: some software doesn’t appear here! Use the next option.

  • aptdcon --install $ubuntu_package_name (search for stuff using apt search)

  • By requesting IT to make a package available across all computers as part of the standard environment. Help us to create a good standard operating environment!

The module system

The command module provides a way to manage various installed versions of software across many computers. This is the way that we install custom software and newer versions of software, if it is not available in Ubuntu. Note that these are shell functions that alter environment variables, so this needs to be repeated in each new shell (or automated in login).

Note: The modules are only available on Aalto desktop machines, not on laptops.

  • See the Triton module docs docs for details.

  • module load triton-modules will make most Triton software available on Aalto workstations (otherwise, most is hidden).

  • module avail to list all available package.

  • module spider $name to search for a particular name.

  • module load $name to load a module. This adjusts environment variables to bring various directories into PATH, LD_LIBRARY_PATH, etc.

  • We will try to keep important modules synced across the workstations and Triton, but let us know.

Useful modules:

  • anaconda and anaconda2 will always be kept up to date with the latest Python Anaconda distribution, and we’ll try to keep this in sync across Aalto Linux and Triton.

  • triton-modules: a metamodule that makes other Triton software available.

Admin rights

Most times you don’t need to be an admin on workstations. Our Linux systems are centrally managed with non-standard improvements and features, and 90% of cases can be handled using existing tools:

Do you want to:

  • Install Ubuntu packages: Use aptdcon --install $package_name as primary user.

  • This website tells me to run sudo apt-get to install something. Don’t, use the instructions above.

  • This website gives me some random instructions involving sudo to install their program. These are not always a good idea to run, especially since our computers are networked, centrally managed, and these instructions don’t always work. Sometimes, these things can be installed as a normal user with simple modifications. Sometimes their instructions will break our systems. In this case, try to install as normal user and then send a support request first. If none of these work and you have studied enough to understand the risk, you can ask us. Make sure you give details of what you want to do.

  • I need to change network or some other settings. Desktops are bound to a certain network and settings can’t be changed, users can’t be managed, etc.

  • It’s a laptop: then yes, there are slightly more cases you need this, but see above first.

  • I do low-level driver, network protocol, or related systems development. Then this is a good reason for root, ask us.

If you do have root and something goes wrong, our help is limited to reinstalling (wiping all data - note that most data is stored on network drives anyway).

If you do need root admin rights, you will have to fill out a form and get a new wa account, then Aalto has to approve. Contact your department IT to get the process started.

Remote access to your workstation

If you are primary user, you can ssh to your own workstation from certain Aalto servers, including at least taltta. See the remote access page.

More powerful computers

There are different options for powerful computing.

First, we have desktop Linux workstations that are more powerful than normal. If you want one of these, just ask. It includes a medium-power GPU card. You can buy a more powerful workstation if you need, but…

Beyond that, we recommend the use of Triton rather than constructing own servers which will only be used part-time. You can either use Triton as-is for free, or pay for dedicated hardware for your group. Your own hardware as part of Triton means that you can use all Triton and even CSC if you need with little extra work. You could have your own login node, or resources as part of the queues.

Triton is Aalto’s high-performance computing cluster. It is not a part of the department Linux, but is heavily used by researchers. You should see the main documentation at the Triton user guide, but for convenience some is reproduced here:

  • Triton is CentOS (compatible with the Finnish Grid and Cloud Infrastructure), while CS workstations are Ubuntu. So, they are not identical environments, but we are trying to minimize the differences.

    • Since it is is part of FGCI, it is easy to scale to more power if needed.

  • We will try to have similar software installed in workstation and Triton module systems.

  • The paths /m/$dept/ are designed to be standard across computers

  • The project and archive filesystems are not available on all Triton nodes. This is because they are NFS shares, and if someone starts a massively parallel job accessing data from here, it will kill performance for everyone. Since history shows this will eventually happen, we have not yet mounted them across all nodes.

    • These are mounted on the login nodes, certain interactive nodes, and dedicated group nodes.

    • TODO: make this actually happen.

  • Triton was renewed in 2016 and late 2018.

  • All info in the triton user guide

Common problems
Network shares are not accessible

If network shares do not work, there is usually two things to try:

  • Permission denied related problems are usually solved by obtaining new Kerberos ticket with command ‘kinit’

  • If share is not visible when listing directories, try to ‘cd’ to that directory from terminal. Shares are mounted automatically when they are accessed, and might not be visible before you try to change to the directory.

Graphical User Interface on Aalto CS Linux desktop is sluggish, unstable or does not start
    1. Check your disk quota from terminal with command quota. If you are not able to log in to GUI, you can change to text console with CTRL+ALT+F1 key combo and log in from there. GUI login can be found with key combo CTRL+ALT+F7.

    2. If you are running low on quota (blocks count is close quota), you should clean up some files and then reboot the workstation to try GUI login again.

      • You can find out what is consuming quota from terminal with command: bash -c 'cd && du -sch .[!.]\* \* \|sort -h'

Enter password to unlock your login keyring

You should change your Aalto password in your main Aalto workstation. If you change the password through e.g. https://password.aalto.fi, then your workstation’s password manager (keyring) does not know the new password and requests you to input the old Aalto password.

If you remember your old password, try this:

  1. Start application Passwords and Keys (“seahorse”)

  2. Click the “Login” folder under “Passwords” with right mouse button and select “Change password”

  3. Type in your old password to the opening dialog

  4. Input your current Aalto password to the “new password” dialog

  5. Reboot the workstation / laptop

If changing password didn’t help, then try this:

  • Then instead of selecting the “change password” from the menu behind right mouse key select “delete” and reboot the workstation. When logging in, the keyring application should use your logging key automatically.

In linux some process is stuck and freezez the whole session

You can kill a certain (own) process via text console.

How do I use eJournals, Netmot and other Aalto library services from home?

There is a weblogin possibility at Aalto Library. After this, all library provided services are available. There are links for journals (nelli) and netmot. Or use VPN which should already be configured.

Rsync complains about Quota, even though there is plenty left.

The reason usually is that default rsync -av tries to preserve the group. Thus, there is wrong group in the target. Try using rsync -rlptDxvz --chmod=Dg+s <source> <target>. This will make group setting correct on /scratch/ etc and quota should then be fine.

Quota exceeded or unable to write files to project / work / scratch / archive

Most likely this is due to wrong Linux filesystem permissions. Quota is set per group (e.g. braindata) and by default file go to the default group (domain users). If this happens under some project, scratch etc directory it will complain about “Disk quota exceeded”.

In general this is fixed by admins by setting the directory permissions such that all goes ok automatically. But sometimes this breaks down. Some programs often are responsible for this (rsync, tar for instance).

There are two easy ways to fix this

  • In terminal, run the command find . -type d -exec chmod g+rwxs {} \; under your project directory. After this all should be working normally again.

  • If it’s on scratch or work, see the Triton quotas page

  • Contact NBE-IT and we will reset the directory permissions for the given directory

I cannot start Firefox

There are two reasons for this.

1. Your network home disk is full

# Go to your user dir
cd ~/..
# Check disk usage
du -sh *

The sum should be less than the max quota which is 100GB (as of 2020). If your disk is full then delete something or move it to a local directory, /l/.

2. Something went wrong with your browser profile

If you get an error like “The application did not identify itself”, following might solve the issue.

Open terminal,

firefox -P -no-remote

This will launch Firefox and ask you to choose a profile. Note that when you delete a profile you delete passwords, bookmarks and etc. So it’s better to create a new profile, migrate bookmarks and delete the old one.

Aalto Mac

This page describes the Aalto centrally-managed Mac computers, where login is via Aalto accounts. If you have a standalone laptop (one which does not use your Aalto account), some of this may be relevant, but for the most part you are on your own and you will access your data and Aalto resources via Remote Access.

More instructions: https://inside.aalto.fi/display/ITServices/Mac

Basics

In the Aalto installations, login is via Aalto account only.

  • When you get a computer, ask to be made primary user (this should be default, but it’s always good to confirm). This will allow you to manage the computer and install software.

  • The first time you login, you must be on an Aalto network (wired or aalto wifi) so that the laptop can communicate with Aalto servers and get your login information. After this point, you don’t need to be on the Aalto network anymore.

  • Login is via your Aalto account. The password stays synced when you connect from an Aalto netowrk.

Full disk encryption

This must be enabled per-user, using FileVault. You should always do this, there is no downside. On Aalto-managed laptops, install “Enable FileVault disk encryption” (it’s a custom Aalto thing that does it for you). To do this manually, “Settings → Privacy → enable File Vault.”

Data

You can mount Aalto filesystems by using SMB. Go to Finder → File or Go (depending on OS) → Connect to Server → enter the smb:// URL from the data storage pages.

You can find more information at For generic ways of accessing, see Remote Access. For Aalto data storage locations see Filesystem details, and for the big picture of where and how to store data see Science-IT department data principles.

The program AaltoFileSync is pre-installed and can be used to synchronize files. But you basically have to set it up yourself.

Software
.dmg files

If you are the primary user, in the Software Center you can install the program “Get temporary admin rights”. This will allow you to become an administrator for 30 minutes at a time. Then, you can install .dmg files yourself. This is the recommended way of installing .dmg files.

Aalto software

There is an application called “Managed software center” pre-installed (or “Managed software update” in older versions). You can use this to install a wide variety of ready-packaged software. (ITS instructions).

Homebrew

Homebrew is a handy package manager on Macs. On Aalto Macs, you have to install Brew in your home dir. Once you install brew, you can easily instnall whatever you may need.

First install Xcode through Managed Software Centre (either search Xcode, or navigate through Categories -> Productivity -> Xcode).

# Go to wherever you want to have your Brew and run this
mkdir Homebrew && curl -L https://github.com/Homebrew/brew/tarball/master | tar xz -C Homebrew --strip 1

# This is a MUST!!!
echo "export PATH=\$PATH:$(pwd)/Homebrew/bin" >> ~/.zprofile

# Reload the profile
source ~/.zprofile

# Check if brew is correctly installed.
which brew    # /Users/username/Homebrew/bin/brew

Older versions of MacOS (pre Mojave) use bash as the default shell, therefore you need to setup the environment differently:

echo "export PATH=\$PATH:$(pwd)/Homebrew/bin" >> ~/.bash_profile

# Reload the profile
source ~/.bash_profile
Admin rights

The “Get temporary admin rights” program described under .dmg file installation above lets you get some admin rights - but not full sudo and all.

You don’t need full admin rights to install brew.

If you need sudo rights, you need a workstation admin (wa) account. Contact your department admin for details.

CS Mac backup service

The CS department provides a full clone-backup service for Aalto-installation mac computers. Aalto-installation means the OS is installed from Aalto repository.

We use Apple Time Machine. Backup is wireless, encrypted, automatic, periodic and can be used even outside the campus using the Aalto VPN. It is “clone” because we can restore your environment in its entirety. You can think of it as a snapshot backup(though it isn’t). We provide twice the space of your SSD; your Mac has 250GB of space, you get 500GB of backup space. If you would like to enroll in the program please pay a visit to our office, T-talo A243.

Encryption

We provide two options for encryption:

  1. You set your own encryption key and only you know it. The key is neither recoverable nor resettable. You lose it, you lose your backup.

  2. We set it on behalf of you and only we know it.

Restore

With Time Machine you have two options for restore.

  1. Partial

    • You can restore file-by-file. Watch the video,

  1. Complete restore

    • In case your Mac is broken, you can restore completely on a new Mac. For this, you must visit us.

Trouble-shooting
Can’t find the backup destination

This happens because either 1). you changed your Aalto password or 2). the server is down. Debug in the following manner,

# Is the server alive?
ping timemachine.cs.aalto.fi

# If alive, probably it's your keychain.
# Watch the video below.

# If dead, something's wrong with the server.
# Pease contact CS-IT.
Corrupted backup
alternate text

This is an unfortunate situation with an unknown reason. We take a snapshot of your backup. Please contact CS-IT.

Common problems
Insane CPU rampage by UserEventAgent

It is a mysterious bug which Apple hasn’t solved yet. We can reinstall your system for you.

Aalto Windows

This page describes the Aalto centrally-managed Windows computers, where login is via Aalto accounts. If you have a standalone laptop (login not using Aalto account), some of this may be relevant, but for the most part you will access your data and Aalto resources via Remote Access.

More instructions: https://inside.aalto.fi/display/ITServices/Windows

Basics

In the Aalto installations, login is via Aalto account only.

  • You must be on the Aalto network the first time you connect.

Full disk encryption

Aalto Windows laptops come with this by default, tied to your login password. To verify encryption, find “BitLocker” from the start menu and check that it is on.

Note, that on standalone installations, you can do encryption by searching “TrueCrypt” in programs - it is already included.

Data

This section details built-in ways of accessing data storage locations. For generic ways of accessing remotely, see Remote Access. For Aalto data storage locations, see Filesystem details and Science-IT department data principles.

Your home directory is automatically synced to some degree.

You can store local data at C:\LocalUserData\User-data\<yourusername>. Note that this is not backed up or supported. For data you want to exist in a few years, use a network drive. It can be worth making a working copy here, since it can be faster.

Software
Aalto software

There is a Windows software self-service portal which can be used to install some software automatically.

Installing other software

To install most other software, you need to apply for a workstation admin (wa) account. Contact your department IT to get the process started.

Common problems

CodeRefinery

The NeIC sponsored CodeRefinery project is being hosted in Otaniemi from (previously we had one in Otaniemi from 12-14 December). We highly recommend this workshop. (note: It is full and registration is closed).

If you have an Aalto centrally-managed laptop, this page gives hints on software installation. You have to use these instructions along with the CodeRefinery instructions.

Note

These are only for the Aalto centrally managed laptops. They are not needed if you have your own computer you administer yourself, or if you have an Aalto standalone computer you administer yourself).

Warning

You should request primary user rights early, or else it won’t be ready on time and you will have trouble installing things. For Windows computers, request a wa (workstation admin) account.

Linux

You need to be primary user in order to install your own packages. Ask your IT support to make you if you aren’t already. You can check with the groups command (see if you are in COMPUTERNAME-primaryuser).

Install the required packages this way. If you are primary user, you will be asked to enter your own password:

pkcon install bash git git-gui gitk git-cola meld gfortran gcc g++ build-essential snakemake sphinx-doc python3-pytest python3-pep8

For Python, we strongly recommend using Anaconda to get the latest versions of software and to have things set up mostly automatically.

You should install Anaconda to your home directory like normal (this is the best way to get the latest versions of the Python packages). If your default shell is zsh (this is the Aalto default, unless you changed it yourself), then Anaconda won’t be automatically put into the path. Either: copy the relevant lines from .bashrc to .zshrc (you may have to make this file), or just start bash before starting the Anaconda programs.

Jupyter: use via Anaconda.

PyCharm: the “snap package” installer requires root, which most people don’t have. Instead, download the standalone community file (.tar.gz), unpack it, and then just run it using ./pycharm.../bin/pycharm.sh. The custom script in /usr/loca/bin won’t work since you aren’t root, but you can make an alias in .bashrc or .zshrc: alias pycharm=... (path here).

Docker: you can’t easily do this on the Aalto laptops, but it is optional.

Mac

You also need to be primary user to install software.

If you are the primary user, in the Software Center you can install “Get temporary admin rights”. This will allow you to become an administrator for 30 minutes at a time. Then, you can install .dmg files yourself (Use this for git, meld, cmake, docker).

Anaconda: you should be able to do “Install for me only”.

Xcode can be installed via the Software Center.

Jupyter: use it via Anaconda, no need to install.

Windows

You should request a workstation-admin account (”wa account”), then you can install everything. Note: these instructions are not extensively tested.

Git and bash can be installed according to the instructions.

Visual diff tools: Needs wa-account.

Mingw: Not working, but seems to be because of download failing.

Cmake: Needs wa-account.

Docker: untested, likely requires wa-account.

CS Linux

CS Linux is an OS used for computers not supported by Aalto Linux. It is maintained by the CS Department IT and is currently only available for researchers in the CS department. The OS is intended for setups which the Aalto Linux setup is not flexible enough (mainly custom built setups) for. The Aalto Linux setup is recommended if it serves your needs.

Currently only desktop setups are available.

Basics
  • Home directory. CS Linux computers have a local home directory (instead of the Aalto home directory found in Aalto Linux).

  • Aalto credentials are used for login. Anyone in the CS department is able to login to any computer on-site. However, ssh login has to be enabled manually by CS-IT.

  • The systems are centrally managed with the help of Puppet.

  • CS Linux computers operate on a dedicated VLAN (different from Aalto Linux). The ethernet port used must be configured before using the computer. The login will not work if the computer is connected to the wrong VLAN. Changes to port configurations can be requested from CS-IT.

  • The default user interface for CS Linux is GNOME. If your computer doesn’t have a graphical interface, but you would like it to have one, please contact CS-IT and it can be configured remotely with the help of Puppet.

Requesting a new CS Linux computer

  • Contact CS-IT.

  • Let CS-IT know who will be using the computer and if they need SSH and sudo access. The primary user receives sudo rights by default.

  • Let CS-IT know if you would like a graphical interface to be installed.

When you are done with a computer

  • Let CS-IT know that you are leaving and bring the computer to the CS-IT office or arrange for someone from the IT team to pick it up. CS-IT will perform a secure erase on the hard drive(s). This is important as most of the data is stored locally.

Software
Ubuntu packages

If you are the primary user, you have sudo rights. You can then use apt to install packages.

The module system

The command module provides a way to manage various installed versions of software across many computers. See here for a detailed description on the module system.

Data

Everything is stored locally, meaning that there are no backups. Anyone with physical access to the computer is able to access the data stored on it.

You are able to mount the Aalto home directory as well as the teamwork directories (requires sudo rights). This can be done by “connect to server” in the file browser for easy graphical access, or via the command line to choose the mounting location.

Samba share addresses:

  • smb://home.org.aalto.fi/$USER

  • smb://tw-cs.org.aalto.fi/project/$projectname/ - replace $projectname.

  • smb://tw-cs.org.aalto.fi/archive/$archivename/ - replace $archivename.

Mounting an smb share using terminal

sudo mount -t cifs -o username=$USER,cruid=$USER,uid=$(id -u $USER),gid=$(id -g $USER),sec=krb5 //tw-cs.org.aalto.fi/project/ ~/mnt

Note

Notice that Samba mounts don’t include information about file and directory permissions. This means that all files and directories will have the default permissions. This also applies to anything that you create.

User accounts

User accounts on CS Linux are managed via the central configuration management. If you want to grant access to the system for other users, please contact CS-IT. Creating local users manually may cause unexpected issues.

Admin rights

The primary user of the computer receives sudo rights by default.

Sudo rights can also be requested for other users (requires approval from primary user). These requests can be sent to CS-IT.

CS Linux computers are centrally managed, meaning that the centralized management should not be broken. Our support is mostly limited to reinstalling the computer, in cases where sudo rights have been used to change settings.

Remote Access

This page describes remote access solutions. Most of them are provided by Aalto, but there are also instruction for accessing your workstations here. See Aalto Inside for more details.

Linux shell servers
  • Department servers have project, archive, scratch, etc mounted, so are good to use for research purposes.

    • CS: magi.cs.aalto.fi: Department staff server (no heavy computing, access to workstations and has file systems mounted, use the kinit command first if project directories are not accessible)

    • NBE: amor.org.aalto.fi, same as above.

    • Math: elliptic.aalto.fi, illposed.aalto.fi, same as above (but no project, archive and scratch directories)

  • Aalto servers

    • kosh.aalto.fi, lyta.aalto.fi: Aalto, for general login use (no heavy compting)

    • brute.aalto.fi, force.aalto.fi: Aalto, for “light computing” (expect them to be overloaded and not that useful). If you are trying to use these for research, you really want to be using Triton instead.

    • viila.aalto.fi: Staff server (access to workstations and has filesystems mounted, but you need to kinit to access them.) that is kind of outdated and different.

  • Your home directory is shared on all Aalto shell servers, and that means .ssh/authorized_keys as well.

  • You can use any of these to mount things remotely via sshfs. This is easy on Linux, harder but possible on other OSs. You are on your own here. You still need kinit at the same time.

    • The CS filesystems project and archive and Triton filesystems scratch and work are mounted on magi (and viila.aalto.fi) (see storage).

For any of these, if you can’t access something, run kinit!

VPN

To access certain things, you need to be able to connect to the Aalto networks. VPN is one way of doing that. This is easy and automatically set up on Aalto computers.

Main Aalto instructions. Below is some quick reference info.

  • Generic: OpenConnect/Cisco AnyConnect protocols. vpn.aalto.fi, vpn1.aalto.fi or vpn2.aalto.fi

  • Aalto Linux: Status bar → Network → VPN Connections → Aalto TLS VPN.

  • Aalto mac: Dock → Launchpad → Cisco AnyConnect Secure Mobility Client

  • Aalto windows: Start → Search → AnyConnect

  • Personal Linux laptops: Use OpenConnect. Configuration on Ubuntu: Networks → Add Connection → Cisco AnyConnect compatible VPN. → vpn.aalto.fi. Then connect and use Aalto username/password. Or from command line: openconnect https://vpn.aalto.fi

  • Personal mac: use Cisco AnyConnect VPN Client

  • personal windows: use Cisco AnyConnect VPN Client

SSH SOCKS proxy

If you need to access the Aalto networks, but can’t send all of your traffic through the Aalto network, you can use SSH + the SSH built in SOCKS proxy. Only use this on computers that only you control, since the proxy itself doesn’t have authentication.

Connect to an Aalto server using SSH with the -D option:

$ ssh -D 8080 USERNAME@kosh.aalto.fi

Configure your web browser or other applications to use a SOCKS5 proxy on localhost:8080 for connections. Remember to revert when done or else you can’t connect to anything once the SSH tunnel stops (“proxy refusing connections”).

The web browser extension FoxyProxy Standard (available on many web browsers despite the name) may be useful here, because you can direct only the domains you want through the proxy.

  • Go to the FoxyProxy options

  • Configure a proxy with some title (“Aalto 8080” for example), Proxy type SOCKS5, Proxy IP 127.0.0.1 (localhost), port 8080 (or whatever you used in the ssh command, no username or password.

  • Save and edit patterns

  • Add a new pattern (“New White”) and use a pattern you would like, for example *.aalto.fi, and make sure it’s enabled.

  • Save

Now, in this browser, when you try to access anything at *.aalto.fi, it will go through the SOCKS proxy and appear to come from the computer to which you connected. By digging around in options or using the extension button, you can direct everything through a proxy and so on.

This can actually also be used for SSH on linux at least (install the program netcat-openbsd):

ssh -o 'ProxyCommand=nc -X 5 -x 127.0.0.1:8123 %h %p' HOSTNAME
Remote mounting of network filesystems
Accessing your Linux workstation / Triton remotely
  • Remote access to desktop workstations is available via the university staff shell servers viila.aalto.fi or department-specific servers magi.cs.aalto.fi (CS), amor.org.aalto.fi (NBE), elliptic.aalto.fi/illposed.aalto.fi (Math).

  • You need to be the PrimaryUser of the desktop in order to ssh to it.

  • Remote access to Triton is available from any Aalto shell server: viila, kosh.aalto.fi, etc.

  • When connecting from outside Aalto, you have to use both SSH keys and a password, or use the VPN.

  • See SSH for generic SSH instructions.

  • SSHing directly to computers using openssh ProxyJump:

    • Put this in your .ssh/config file under the proper Host line: ProxyJump viila.aalto.fi (or for older SSH clients, ProxyCommand ssh viila.aalto.fi -W %h:%p).

    • Note that unless your local username matches your Aalto username, or unless you have defined the username for viila.org.aalto.fi elsewhere in the SSH config, you will have to use the format aaltousername@viila.org.aalto.fi instead.

Remote desktop

Aalto has remote desktops available at https://vdi.aalto.fi and http://mfavdi.aalto.fi/. This works from any network.

There are both Windows and Linux desktops available. They are arranged as virtual machines with the normal desktop installations, so have access to all the important filesystems and all /m/{dept}/....

Aalto Gitlab

https://version.aalto.fi is a Gitlab installation for the Aalto community. Gitlab is a git server and hosting facility (an open source Github, basically).

Note

Git in general

Git seems to have become the most popular and supported version control system, even if it does have some rough corners. See the general git page on this site for pointers.

Aalto Gitlab service

Aalto has a self-hosted Gitlab installation at https://version.aalto.fi, which has replaced most department-specific Gitlabs. With Aalto Gitlab, you can:

  • Have unlimited private repositories

  • Have whatever groups you need

  • Get local support

The Aalto instructions can be found here, and general gitlab help here.

All support is provided by Aalto ITS. Since all data is stored within Aalto and is managed by Aalto, this is suitable for materials up to the “confidential” level.

Extra instructions for Aalto Gitlab

Always login with HAKA wherever you see the button. To use your Aalto account otherwise, use username@aalto.fi and your Aalto password (for example, use this with https pushing and pulling). But, you really should try to configure ssh keys for pushing and pulling.

For outside/public sharing read-only, you can make repositories public.

If you need to share with an outside collaborator, this is supported. These outside partners can access repositories shared with them, but not make new ones. They will get a special gitlab username/password, and should use that with the normal gitlab login boxes. To request an collaborator account, their Aalto sponsor should go here to the request form (employees only, more info). (You can always set a repository as public, so anyone can clone. Another hackish method is to add ssh deploy keys (read-only or read-write) for outside collaborators, but this wouldn’t be recommended for serious cases.)

For public projects where you want to build a community, you can also consider Github. There’s nothing wrong with having both sites for your group, just make sure people know about both. Gitlab can have public projects, and Github can also have group organizations.

NOTE! If your work contract type changes (e.g. staff -> visitor, student->employee, different department), the Aalto Version blocks the access as a “security” measure. Please contact Aalto ITS Servicedesk <servicedesk@aalto.fi> to unblock you. This is annoying, but can’t be fixed yet.

The service doesn’t have quotas right now, but has limited resources and we expect everyone to use disk space responsibly. If you use too much space, you will be contacted. Just do your best to use the service well, and the admins will work with you to get your work done.

CodeRefinery Gitlab and Gitlab CI service

CodeRefinery is a publically funded project (by Nordforsk / Nordic e-Infrastructure Collaboration) which provides teaching and a GitLab platform for Nordic researchers. This is functionally the same as the Aalto Gitlab and may be more useful if you have cross-university collaboration, but requires more activation to set up.

They also have a Gitlab CI (continuous integration) service which can be used for automated building and testing. This is also free for Nordic researchers, and can be used even with Aalto Gitlab. Check their repository site info, if info isn’t there yet, then mail their support asking about it.

Recommendations

version.aalto.fi is a great resource for research groups. Research groups should create a “Gitlab group” and give all their members access to it. This way, code and important data will last longer than single person’s time at Aalto. Add everyone as a member to this group so that everyone can easily find code.

Think about the long term. Will you need access to this code in 5 years, and if so what will you do?

  • If you are a research group, put your code in a Gitlab group. The users can constantly switch, but the code will stay with the group.

  • If you are an individual, plan on needing a different location once you leave Aalto. If your code can become group code, include it in the group repository so at least someone will keep it at Aalto.

  • Zenodo is a long-term data archive. When you publish projects, consider archiving your code there. (It has integration with Github, which you might prefer to use if you are actually making your code open.) Your code is then citeable with a DOI.

  • In all cases, if multiple people are working on something, think about licenses at the beginning. If you don’t, you may be blocked from using your own work.

FAQ
  • What password should I use? It is best to use HAKA to log in to gitlab, in which case you don’t need a separate gitlab password. To push, it is best to use ssh keys.

  • My account is blocked! That’s not a question, but Gitlab blocks users when your Aalto unit changes. This is unfortunately part of gitlab and hasn’t been worked around yet. Mail servicedesk@aalto.fi with your username and request “my version.aalto.fi username XXX be unblocked (because my aalto unit changed)” and they should do it.

  • What happens when I leave, can I still access my stuff? Aalto can only support it’s community, so your projects should be owned by a group which you can continue collaborating after you leave (note that this is a major reason for group-based access control!). Email servicedesk for information on what to do to become an external collaborator.

  • When are accounts/data deleted? In 2017, the deletion policy was findable in the privacy policy. In 2017, it was 6 months after Aalto account closed, 24 months after last login, or 12 months after last login of an external collaborator. Now, that link is dead and only links to the general IT Services privacy notice

  • Are there continuous integration (CI) services available? Not from Aalto (though you can run your own runners), but the CodeRefinery project has free CI services to Nordics, see their site and the description above.

Harbor: Container registry for images and artifacts

Aalto University provides an instance of popular Harbor registry for storing and managing images and other artifacts. Service can be found at https://harbor.cs.aalto.fi.

Web login

Currently only Aalto users can login into the service. When you visit https://harbor.cs.aalto.fi you can choose between OIDC provider and local DB. Choose OIDC provider. It will take you to Microsoft sign-in page for Aalto University.

Projects
  • New projects can only be created by CS-IT (guru at cs dot aalto.fi).

  • Each project has project administrators who manages it, and members.

  • Each new member must be added to project individually. Adding existing Aalto unix groups isn’t currently possible without special request and extra work (due to a limitation of the Aalto Azure directory). If a group is very helpful to your work, ask.

  • Trivy vulnerability scanner by Aqua Security is available for all projects. You can see security vulnerabilities on each image page.

Docker access

Never use your Aalto password from the docker command line - push is via a token.

Before first time accessing registry you must install docker-credential-helpers and configure docker to use your local credential store.

To install docker-credential-helpers on Aalto Linux run:

pkcon install golang-docker-credential-helpers

Then add following to ~/.docker/config.json:

{
  "credsStore": "secretservice"
}

Now when you login to registry using docker the token is stored to your credential store.

Login to Harbor using docker doesn’t happen with your Aalto password, but instead you need to get a CLI secret from the Harbor web app. You can find your secret by clicking your email address on right corner and select user profile from dropdown. Last in the user profile dialog is CLI secret that you can copy by clicking the icon next to the field. You can also generate new secret or upload your own secret.

Now run:

docker login https://harbor.cs.aalto.fi

For username enter the username show in the user profile dialog, and for the password use the CLI secret from the same dialog.

Tag the image first before pushing (images must be prefixed with harbor.cs.aalto.fi):

docker tag <source_image>[:<tag>] harbor.cs.aalto.fi/<project>/<repository>[:<tag>]

To push an image to a project use:

docker push harbor.cs.aalto.fi/<project>/<repository>[:<tag>]

You can find the project specific tag and push commands from the repositories page of the project. Similarly the pull commands for individual artifacts can be found in the artifacts page of their repository.

Robot accounts

Harbor supports robot accounts for projects. They can be create from the robot accounts page of project. Each robot account can have different set of permissions. Each robot account should have minimal permissions needed for their use case. After creating the robot account Harbor generates a secret for it. This secret used to login to the account in same way as with normal accounts. If you forget the secret you refresh it to new one later.

Security

Aalto’s Harbor is officially security rated for public data. Still, if you set permissions right it should only be available to those with permissions (unless it’s set to public).

JupyterHub (jupyter.cs)

Note

This page is about the JupyterHub for light use and teaching, https://jupyter.cs.aalto.fi. The Triton JupyterHub for research is documented at JupyterHub on Triton.

NBGrader in JupyterLab is the default (2023 Autumn)

JupyterLab interface is now available and is the default option for new course servers. Doing Assignments in JupyterLab tells more about using it. You can access nbgrader from the JupyterLab menu.

https://jupyter.cs.aalto.fi is a JupyterHub installation for teaching and light usage. Anyone at Aalto may use this for generic light computing needs, teachers may create courses with assignments using nbgrader. Jupyter has a rich ecosystem of tools for modern computing.

Basic usage

Log in with any valid Aalto account. Our environment may be used for light computing and programming by anyone.

Your persistent storage has a quota of 1GB. Your data belongs to you, may be accessed from outside, and currently is planned to last no more than one year from last login. You are limited to several CPUs and 1GB memory.

Your notebook server is stopped after 60 minutes of idle time, or 8 hours max time. Please close the Jupyter tab if you are not using it, or else it may still appear as active.

There are some general use computing environments. You will began with Jupyter in the /notebooks directory, which is your persistent storage. Your server is completely re-created each time it restarts. Everything in your home directory is re-created, only /notebooks is preserved. (Certain files like .gitconfig are preserved by linking into /notebooks/.home/....)

You begin with a computing server with the usual scipy stack installed, plus a lot of other software used in courses here.

You may access your data as a network drive by SMB mounting it on your own computer - see Accessing JupyterHub (jupyter.cs) data. This allows you total control over your data.

JupyterHub has no GPUs, but you can check out the instructions for using the Paniikki GPUs with the JupyterHub data. These instructions are still under development.

Each notebook server is basically a Linux container primarily running a Juptyer notebook server. You may create Jupyter notebooks to interact with code in notebooks. To access a Linux bash shell, create a new terminal - this is a great place to learn something new.

Accessing JupyterHub (jupyter.cs) data

Unlike many JupyterHub deployments, your data is yours and have many different ways to access it. Thus, we don’t just have jupyter.cs, but a whole constellation of ways to access and do your work, depending on what suits you best for each part.

Your data (and as an instructor, your course’s data) can be accessed many ways:

  • On jupyter.cs.

  • Via network drive on your own computer as local files.

  • On Aalto shell servers (such as kosh.aalto.fi).

  • On other department/university workstations.

On Paniikki and Aalto computers

On Paniikki, and the Aalto servers kosh.aalto.fi, lyta.aalto.fi, brute.aalto.fi, and force.aalto.fi (and possibly more), the JupyterHub is available automatically. You can, for example, use the Paniikki GPUs.

Data is available within the paths /m/jhnas/jupyter. The path on Linux servers is also available on the hub, if you want to write portable files.

Name

Path on hub

Path on Linux servers

personal notebooks

/notebooks

/m/jhnas/jupyter/u/$nn/$username/

course data

/coursedata

/m/jhnas/jupyter/course/$course_slug/data/

course instructor files

/course

/m/jhnas/jupyter/course/$course_slug/files/

shared data

/m/jhnas/jupyter/shareddata/

/m/jhnas/jupyter/shareddata/

Variable seen above

Meaning

$username

Your Aalto username

$nn

The two numbers you see in echo $HOME (the last two digits of your Aalto uid, id)

$course_slug

The short name of your course.

You can change directly to your notebook directory by using cd /m/jhnas/jupyter/${HOME%/unix}.

You can link it to your home directory so that it’s easily available. In a terminal, run /m/jhnas/jupyter/u/makedir.sh and you will automatically get a link from ~/jupyter in your home directory to your user data.

Permission denied? Run kinit in the shell - this authenticates yourself to the Aalto server and is required for secure access. If you log in with ssh keys, you may need to do this.

Remote access via network drive
Basic info

Name

Network drive path

personal notebooks

smb://jhnas.org.aalto.fi/$username/

course data

smb://jhnas.org.aalto.fi/course/$course_slug/data/

course instructor files

smb://jhnas.org.aalto.fi/course/$course_slug/files/

shared data

smb://jhnas.org.aalto.fi/shareddata/

You can do a SMB mount, which makes the data available as a network drive. You will have the same copy of the data as on the hub - actually, same data, so edits immediately take effect on both places, just like your home directory. You must be on an Aalto network, which for students practically means you must be connected to the Aalto VPN (see vpn instructions) or use an Aalto computer. The “aalto” wifi network does not work unless you have an Aalto computer.

  • Linux: use “Connect to Server” from the file browser. The path is smb://jhnas.org.aalto.fi/$username. You may need to use AALTO\username as your username. If there is separate “domain” option, use AALTO for domain and just your username for the username.

  • Mac: same path as Linux above, “Connect to Server”. Use AALTO\your_username as the username.

  • Windows: \\jhnas.org.aalto.fi\$username, and use username AALTO\your_username. Windows sometimes caches the username/password for a long time, so if it does not work try rebooting.

You can also access course data and shared data by using jhnas.org.aalto.fi/course/ or jhnas.org.aalto.fi/shareddata/.

See also

Mounting network drives in Windows is the same instructions, but for Aalto home directories. Anything there should apply here, too.

Using GPUs

One problem with our JupyterHub so far is that we don’t have GPUs available. But, because our data is available to other computers, you can use the Paniikki: Computer Lab For Students GPUs (quite good ones) to get all the power you need. To do this, you just need to access the Jupyter data on these classroom computers.

Terminal: First, start a terminal. You can navigate to your data following the instructions above: cd /m/jhnas/jupyter/${HOME%/unix}. From there, navigate to the right directories and do what is needed.

File browser: Navigate to the path /m/jhnas/jupyter/u/$nn/$username, where $nn is the two numbers you see when you do echo $HOME in a terminal. To open a terminal from a location, right click and select “Open in Terminal”.

Now that you have the terminal and the data, you can do whatever you want with it. Presumably, you will start Jupyter here - but first you want to make the right software available. If you course tells you how to do that using an Anaconda environment, go ahead and do it. (Please don’t go installing large amounts of software like anaconda in the Jupyter data directories - they are for notebooks and small-medium data.)

Using the built-in anaconda, you can load the Python modules with module load anaconda and start Jupyter with jupyter notebook:

Start jupyter with the anaconda module.

Note that now, you need to module load anaconda, not anaconda3 like the image shows.

Terms of use

This service must be used according to the general IT usage policy of Aalto university (including no unlawful purposes). It should only be used for academic purposes (but note that self-exploration and programming for own interests is considered an academic purpose, though commercial purposes is not allowed). For more information, see the Aalto policies. Heavy non-interactive computational use is not allowed (basically, don’t script stuff to run in the background when you are not around. If you are using this service is person, it is OK). For research computing, see Triton cluster.

Courses and assignments

New: nbgrader in JupyterLab instructions: Doing Assignments in JupyterLab

Some courses may use the nbgrader system to give and grade assignments. These courses have special entries in the list. If you are a student in such a course, you will have a special environment for that course. Your instructor may customize the environment, or it may one of our generic environments.

If your course is using Jupyter with nbgrader, there are some built-in features for dealing with assignments. Under the Assignment list tab, you can see the assignments for your course (only the course you selected when starting your notebook server). You can fetch assignments to work on them - they are then copied to your personal /notebooks directory. You can edit the assignments there - fill out the solutions and validate them. Once you are done, you can submit them from the same assignment list. We have a short tutorial that walks through this process using the new JupyterLab interface.

A course may give you access to a /coursedata folder with any course-specific data.

By default, everyone may access every course’s environment and fetch their assignments. We don’t stop you from submitting assignments to courses you are not enrolled in - but please don’t submit assignments unless you are registered, because the instructors must then deal with it. Some courses may restrict who can launch their notebook servers: if you can not see or launch the notebook server for a course you are registered for, please contact your instructor in this case.

Note that the /notebooks folder is shared across all of your courses/servers, but the assignment list is specific to the course you have started for your current session. Thus, you should pay attention to what you launch. Remember to clean up your data sometimes.

Instructors
JupyterHub (jupyter.cs) for instructors

See also

Main article with general usage instructions: Jupyterhub for Teaching. For research purposes, see Triton JupyterHub.

Jupyter is an open-source web-based system for interactive computing in “notebooks”, highly known for its features and ease of use. Nbgrader (“notebook grader”) is a Jupyter extension to support automatic grading via notebooks. The primary advantage (and drawback) is its simplicity: there is very little difference between the notebook format for research work and automatic grading. This lowers the barrier to creating assignments and means that the interface students (and you) learn is directly applicable to (research) projects that may come later.

Nbgrader documentation is at https://nbgrader.readthedocs.io/, and is necessary reading to understand how to use it. For a quickstart in the notebook format, see the highlights page. However, the Noteable service documentation (https://noteable.edina.ac.uk/documentation/) is generally much better, and most of it is applicable to here as well. The information included in these is not duplicated here, and is required in order to use jupyter.cs.

Below, you mostly find documentation specific to jupyter.cs and important notes you do not find other places.

jupyter.cs news
Spring 2024
  • We had a user’s group meeting. You can find the slides here, including commentary.

  • :help:`/help/garage` now has a focus day for jupyter.cs on Wednesdays.

Autumn 2023
  • JupyterLab is now available and is the default for new course servers. If you’d like to continue using Jupyter Notebook for your courses, let us know when requesting a new course. JupyterLab now supports everything nbgrader needs, though the user interface is slightly different. You can send Doing Assignments in JupyterLab to your students for instructions.

Summer/Autumn 2020
  • You can now make a direct link that will spawn a notebook server, for example for a course with a slug of testcourse: `https://jupyter.cs.aalto.fi/hub/spawn?profile=testcourse If the user is already running a server, it will not switch to the new course. Expect some subtle confusion with this. Full info in FAQ and hints.

Basics

The JupyterHub installation provides a way to provide a notebook-based computational environment to students. It is best to not think of this service as a way to do assignments, but as a general light computing environment that is designed to be easy enough to be used for courses. Thus, students should feel empowered to do their own computing and this should feel like a stepping stone to using their own systems set up for scientific computing. Students’ own data is persistent as they go through courses, and need to learn how to manage it themselves. Jupyter works best for project/report type workflows, not lesson/exercise workflows but of course it can do that too. In particular, there is no real possibility for real-time grading and so on.

Optionally, you may use nbgrader (notebook grader to make assignments, submit them to students, collect them, autograde them, manually grade, and then export a csv/database of grades. From that point, it is up to you to manage everything. There is currently no integration with any other system, except that Aalto accounts are used to login.

What does this mean? Jupyter is not a learning management system (even when coupled with nbgrader), it’s “a way to make computational narratives”. This means that this is not a point and click solution to running courses, but a base to build computations on. In order to build a course, you need to be prepared to do your own scripting and connections using the terminal.

You may find the Noteable documentation (serves as a nbgrader user guide) and book Teaching and Learning with Jupyter (broad, less useful) helpful.

Currently we support Python the most, but there are other language kernels available for Jupyter. For research purposes, see the Triton Jupyter page.

Limits
  • This is not a captive environment: students may always trivially remove their files and data, and may share notebooks across different courses. See above for the link to isolate-environment with instructions for fixing this.

  • We don’t have unlimited computational resources, but in practice we have quite a lot. Try to avoid all students doing all the work right before a deadline and you should be fine, even with hundreds of students.

  • There is no integration to any other learning management systems, such as the CS department A+ (yet). The only unique identifier of students is the Aalto username. nbgrader can get you a csv file with these usernames, what happens after that point is up to you.

  • There is currently no plagiarism detection support. You will have to handle this yourself somehow so far.

System environment

The following is the environment in each Jupyter notebook server exists. This is a normal Linux environment, and you are encouraged to use the shell console to interact with it. In fact, you will need to use the console to do various things, and you will probably need to do some scripting.

Why is everything not a push-button solution? Everyone has such unique needs, and we need to solve all of them. We can only accomplish our goals if people are able to - and do - do their own scripting.

Linux container

Each time you launch your server, you get a personal Linux container. Everything (except the data) gets reset each time it stops. From the user perspective, it looks like a normal Linux container. Unlike some setups, we allow students to acknowledge and browse the whole Linux system. (other systems try to hide it, but in reality they can’t stop students from accessing it).

Data
  • /notebooks/ is your per-user area. It’s what you see by default, and is shared among all your courses.

  • /course/ is the course directory (a nbgrader concept). It is available only to instructors. You need to read the nbgrader instructions to understand how this works.

  • /coursedata/ is an optional shared course data directory. Instructors can put files here so that students can access them without having to copy data over and over. Instructors can write here, students can only read. Remeber to make it readable to all students: chmod -R a+rX /coursedata.

  • /srv/nbgrader/exchange is the exchange directory, a nbgrader concept but you generally don’t have to worry about it yourself.

Data is available from outside JupyterHub: it is hosted on an Aalto-wide server provided by Aalto. Thus, you can access it on your laptops, on Aalto public shell servers, and more. A fast summary is below, but see Accessing JupyterHub (jupyter.cs) data for the main info.

  • From your own laptop: The SMB server jhnas.org.aalto.fi path /vol/jupyter/{course,$username}.

    • Linux: “Connect to server” from the file browser, URL smb://jhnas.org.aalto.fi/vol/jupyter

    • Mac: same as Linux

    • Windows: \\jhnas.org.aalto.fi\vol\jupyter.

  • Data is available on public Aalto shell servers such as kosh and lyta, at /m/jhnas/jupyter/.

Software

For Python, software is distributed through conda. You can install your own packages using pip or conda, but everything is reset when you restart the server. This is sort of by design: a person can’t permanently break their own environment (restarting gets you to a good state), but you have your own flexibility.

You should ask us to install common software which you are your students need, instead of installing it yourself each time. But you should feel free to install it yourself to get your work done until you do that.

Jupyter

Both Jupyter Lab and classic notebooks are installed, along with a lot of extensions. If you need more extensions, let us know. All courses use only the classic notebook interface by default, because the nbgrader web extensions do not work from Lab.

Requesting a course

Note

JupyterLab interface is now available and is the default option for new course servers. If you’d still like to use the Jupyter Notebook interface for your course, let us know.

To get started with a course, please read the below list and describe your needs from the relevant items, and contact guru@cs.aalto.fi. Don’t worry too much about understanding or answering everything perfectly, just let us know what you want to accomplish and we will guide you to what you need.

Course or not?

If all you need is a Python environment to do assignments and projects, you don’t need to request anything special - students can just use the generic servers for their independent computational needs. Students can upload and download any files they need. You could add data to the “shareddata” location, which is available to any user.

You would want a course environment if you want to (distribute assignments to students via the interface) and/or (collect assignments via the interface).

Request template

To make things faster and more complete, copy and paste the below in your email to us (guru@cs.aalto.fi), and edit all of fields (and if anything unclear, don’t worry: send it and a human will figure it out), and send it to us with any other comments. The format is YAML, by the way (but we can handle the syntax details).

name: CS-E0000 Course Name (Year)
uid: (leave blank, we fill in)
gid: (leave blank, we fill in)

# supervisor = faculty in charge of course
# contacts = primary TAs which should also get emails from us.
# manager = (optional) has rights to add other TAs via
#           domesti.cs.aalto.fi (supervisor is always a manager)
# Please separately tell us who the initial TAs are!  Managers can
# add more later via domesti.cs.aalto.fi.
supervisor: teacher.in.charge@aalto.fi
contact: [teacher.in.charge@aalto.fi, head.ta@aalto.fi]
#manager: [can_add.tas@aalto.fi]

# if true, create a separate data directory
datadir: false

# Important dates.  But not too important, we can always adjust later.
# So far, you need to email us to make it public when you are ready!
public_date:  2020-09-08      # becomes visible to students before course
private_date: 2021-01-31      # hidden from students after course
archive_date: 2021-09-01      # becomes hidden from instructors
delete_date:  2021-09-01      # after this, we ask if it can be deleted

# For the course dates itself (just for our reference, not too important)
start_date: 2020-10-01
end_date: 2020-12-15
course_times: EXAMPLE, fill in: Exercise sessions Tuesday afternoons, Deadlines Fridays at 18

# The dates above actually aren't used.  These control visibility:
private: true
archive: false

# Internal use, ignore this.  The date is the version of software
# you get (your course won't get surprise updates to software after
# that date).
image: [standard, 2020-01-05]
Course environment options

When requesting a course, please read the following and tell us your requirements in the course request email, guru@cs.aalto.fi (using the template above). If you are using the hub without a specific course item in the selection list, please let us know at least 3a, 6, 7, and 8 below. You don’t need to duplicate stuff in the YAML above.

Required metadata is:

  1. Course slug

Permanent identifier of course, of the form nameYEAR, for example mlbp2018) and full name.

  1. Course display name

What students see in the interface

  1. Contact

Who to ask about day-to-day matters, could be multiple. Aalto emails or usernames.

3a. Who should be added to the “announcement” issue and gets announcements about updates during the periods.

  1. Supervisor

Long-term staff who can answer questions about old data even if the course TAs move on. Might be same as contact. This is the “primary owner” of all data according to the Science-IT data policy.

  1. Instructors

Who will have access to the instructor data? Instructors will be added to a Aalto unix group named jupyter-$courseslug to provide access control. To request new instructors, you do this yourself (see the relevant FAQ). Or, email CS-IT and ask that people be added/removed from your group jupyter-$courseslug.

  1. Number of students

Just to keep track of expected load and so on.

  1. Course schedule

Sessions when all students will be using it (e.g. lectures, tutorials). Deadlines when you expect many students will be working. Will be added to our hub calendar, to avoid doing maintenance when at critical moments. Please do whatever you can to de-peak loads, but in reality we can probably handle whatever you throw at as. Very late night deadlines are usually not good since we often do maintenance then (and are bad for students…).

  1. Expected load

What kind of assignments? Lots of CPU, memory intensive? Knowing how people use the resources helps us to make things work well.

  1. Course time frame

What periods is the course? Note: these aren’t automatically used yet, you may still have to mail us to make it private or not.

9a. Public date - course automatically becomes public on this date (until then, students can’t see it).

9b. Hide date - course automatically goes back to private mode on this date. (it’s fine and recommended to give a long buffer here).

9c. Archive date - course goes into “archive” mode after this time, gets hidden from instructors, too.

9a. Delete date - data removed. Not automatic, contacts will get an email to confirm (we aren’t crazy).

A course environment consists of (comment on any specifics here):

  1. A course directory /course, available only to instructors. This comes by default, with a quota of a few gigabytes (combined with coursedata). Note: instructors should manage assignments and so on using git or some other version control system, because the course directory lasts only one year, and is renewed for the next year.

  2. Software (optional, recommended to use the default and add what you need) A list of required software, or a docker container containing the Jupyter stack and additional software. By default, we have an image based on the scipy stack and all the latest software that anyone else has requested, as long as it is mutually compatible. You can request additional software, and this is shared among all courses. If you need something special, you may be asked to take our image and extend it yourself. Large version updates to the image are done twice a year during holidays.

    1. (optional) A sample python file or notebook to test that the environment works for your course (which will be made public and open source). We also use use automated testing on our software images, so that we can be sure that our server images still work when they are updated. If you send us a file, either .py or .ipynb, we will add this to our automatic tests. The minimum amount is something like import of the packages you need, a more advanced thing would test the libraries a little bit - do a minimal, quick calculation.

  3. Computational resources (optional, not recommended) A list of computational resources per image. Default is currently 2GB and 4 processors (oversubscribed). Note that because this is a container, only the memory of the actual Python processes are needed, not the rest of the OS, and memory tends to be quite small.

  4. Shared data directories. If you have nontrivial data which needs distributing, consider one of these shared directories which saves it from being copied over and over. The notebook directory itself can only support files of up to 2MB to prevent possible problems. If number of students times amount of data is more than a few hundred MB, strongly consider one of the data directories. Read more about this below.

    1. You can use the “shareddata” directory /mnt/jupyter/shareddata. shareddata is available in all notebooks on jupyter.cs.aalto.fi (even outside of your course) and also (eventually) other Aalto servers. This data should be considered public (and have a valid license), even though for now it’s only accessible to Aalto accounts.

    2. /coursedata is only available within your course’s environment (as chosen from the list). coursedata is also assumed to be public to everyone at Aalto, though you have more control over it.

    3. If you use either of these, you can embed the paths directly in your notebooks. This is easy for hub use, but makes it harder to copy the notebooks out of the hub to use on your own computers. This is something we are working on.

Also tell us if you want to join the jupyterhub-courses group to share knowledge about making notebooks for teaching.

Course data

See also

One of the best features of jupyter.cs is powerful data access. See Accessing JupyterHub (jupyter.cs) data

If your course uses data, request a coursedata or shareddata directory as mentioned above. You need to add the data there yourself, either through the Jupyter interface or SMB mounting of data.

If you use coursedata, just start the course environment and instructors should have permissions to put files in there. Please try to keep things organized!

If you use shareddata, ask for permission to put data there - we need to make the directory for you. When asking, tell us the (computer readable short)name of the dataset. In the shareddata directory, you find a README file with some more instructions. All datasets should have a minimum README (copy the template) which makes it minimally usable for others.

In both cases, you need to chmod -R a+rX the data directory whenever new files or directories are added so that the data becomes readable to students.

Note: after you are added to relevant group to access the data, it make take up to 12 hours for your account information to be updated so that it can be accessed via remote mounting.

Don’t include large amount of data in the assignment directories - there will be at least four, if not more, copies of data made for every student.

Data from other courses

Sometimes, when you are in course A’s environment, you want to access the data from course B. For example, A is the next year’s edition of the course B, and it could be useful to check the old files.

You can access the files for every course which you are an instructor of at the path /m/jhnas/jupyter/course/. The files/ sub-directory is the entire course directory for that course, the same as /course/ in each course image. You can also access the course data directory at data/ there.

All old courses (for which you are listed as an instructor) are available, but if the course is in the “achived” state, you can’t modify the files.

Nbgrader basics

Note

We have prepared a tutorial for students on how to fetch/submit assignments: Doing Assignments in JupyterLab. Feel free to share it with them/link to it in MyCourses pages.

“nbgrader is a tool that facilitates creating and grading assignments in the Jupyter notebook. It allows instructors to easily create notebook-based assignments that include both coding exercises and written free-responses. nbgrader then also provides a streamlined interface for quickly grading completed assignments.” - nbgrader upstream documentation

Currently you should read the upstream nbgrader documentation, which we don’t repeat. You might also find the Noteable services’ nbgrader documentation useful. We have some custom Aalto modifications (also submitted upstream) which are:

How to use nbgrader

Read the nbgrader docs! We can’t explain everything again here.

The course directory is /course/. Within this are source/, release/, submitted/, autograded/, and feedback/.

Things which don’t (necessarily) work in nbgrader
  • Autograde: if you click the thing, it will work, but is the same as running all your students code on your own computer with no security whatsoever. A slightly clever student is able to see other students work (a privacy breach), alter their grades.

  • Feedback: While it appears to work, it is designed to operate by hashing the contents of the notebook. Thus, if you have to edit the notebook to make it execute, the hash will be different and the built-in feedback distribution will not work.

  • Furthermore, don’t expect hidden tests to stay hidden, grading to happen actually automatically, things to be fully automatic, and so on. Do expect a computing environment optimized for learning.

These are just intrinsic to how nbgrader works. We’d hope to fix these sometime, but it will require a more coordinated development effort.

Aalto specifics
  • Instructors can share responsibilities, multiple instructors can use the exchange to release/collect files, autograde, etc. Note that with this power comes responsibility - try hard to keep things organized.

  • We can have the assignments in /notebooks while providing whole-filesystem access (so that students can also access /coursedata).

  • We’ve added some extra security and sharing measures (most of these are contributed straight to nbgrader).

  • Join the shared course repository to share knowledge with others

To use nbgrader:

  • Request a course as above.

  • Read the nbgrader user instructions.

  • You can use the Formgrader tab at the top to manage the whole nbgrader process (this automatically appears for instructors). This is the easiest way, because it will automatically set up the course directory, create assignment directories, etc. But, you can use the nbgrader command line, too. It is especially useful for autograding.

  • It’s good to know how we arrange the course directory anyway, especially if you want to manage things yourself without Formgrader. The “course directory” (nbgrader term) is /course. The original assignments go in /course/source. The other directories are /course/{nbgrader_step} and, for the most part, are automatically managed.

  • New assignments should be in /course/source. Also don’t use + in the assignment filename (nbgrader #928).

  • Manage your assignments with git. See below for some hints about how to do this.

  • If you ever get permission denied errors, let us know. nbgrader does not support multiple instructors editing the same files that well, but we have tried to patch it in order to do this. We may still have missed some things here.

Version control of course assignments

See also

Shared jupyterhub-courses version.aalto.fi Gitlab organization to share notebooks and knowledge about running JupyterHub courses.

git is a version control system which lets you track file versions, examine history, and share. We assume you have basic knowledge of git, and here we will give practical tips to use git to manage a course’s files. Our vision is that you should use nbgrader to manage the normal course files, not the students submissions. Thus, to set up the next year’s course, you just clone the existing git repository to the new /course directory. You backup the entire old course directory to maintain the old students work. Of course, there are other options, too.

Create a new git repository in your /course/ directory and do some basic setup:

cd /course/
git init
git config core.sharedRepository group

You should make a .gitignore file excluding some common things (TODO: maybe more is needed):

gradebook.db
release/
submitted/
autograded/
feedback/
.nbgrader.log
.ipynb-checkpoints

The git repository is in /course, but the main subdirectory of interest is the source/ directory, which has the original files, along with whatever other course notes/management files you may have which are under /course. Everything else is auto-generated.

Autograding

Warning

nbgrader autograde is not secure, because arbitrary student code is run with instructor permissions. Read more from the instructor page.

Testing a course

Often, people ask “how can I test the assignments if I use nbgrader”? There are different options.

Test as an instructor

The instructor functions don’t overlap with the student functions: you don’t need some special way to test the student experience.

As an instructor, you can release assignments, then go to the student view, fetch, do, submit, etc. This is the same experience as students would get, and really is the full experience (there is not much else to test). You and your TAs can test this way - and of course you can add others just for the purpose of testing it this way.

Of course, you can add TAs just for the purpose of testing it like this, and this would be recommended (as long as nothing is secret is the course directory at the time you are doing these tests - remember to remove them later). You can do this yourself using the group management service we send you (domesti.cs).

An instructor also has an option in the server list to spawn as a student. This hides the /course directory and makes the environment identical to that of a student (but it shouldn’t matter much).

Send assignments to testers yourself

Before all this fancy Jupyter interface, nbgrader was very simple: send assignments around manually. For example, they would post assignments on the course website, people would submit via the course site, and they would be downloaded and unpacked into the right places in the course directory. This is still probably the best way to test things out.

Steps:

  • To send an assignment to someone: download the generated release version from /course/release/$assignment_id/$name.ipynb .

  • Send (e.g. email) to someone. They send it back to you when done. They can do the assignment on their own computer, or upload to jupyter.cs to do it (the “general use” server works fine).

  • To receive the assignment, put it back in the course dir as /course/submitted/$STUDENT_NAME/$assignment_id/$name.ipynb. $STUDENT_NAME is invented by you, but the others should match.

That is all: now you can autograde and all, completely normally. This is all that the web interface does anyway.

When you are done testing, you can delete these $STUDENT_NAME directories. There is also some command to delete them from the database if you want, or more likely you might remove the whole gradebook.db to make sure you start fresh.

The shell access (and other data access, see System environment) makes it easy to manage these files, copy them in and out, and so on.

Add student testers while in private mode

While your course is still in private mode, you can add dedicated student testers. This might be useful before the course becomes public.

  • While this works, we don’t recommend it unless you really need a lot of testers. It is manual work to set up, and manual work to remove. And likely we are going to forget to clean it up later.

  • Just like above, you may need to clean up these test students.

  • Send us a list of Aalto emails or usernames to add.

Request another course

In principle, you could request a whole other jupyter.cs course, just for testing, and we could add private students there. But this would be a lot of work for us (and some for you, when you need to transfer files over - but if you use git that part won’t be that bad).

In general, we don’t do this - one of the above options should work for you. Even if you do this, you likely have to combine with some of the above tasks (requesting us to add students while in private mode).

Nbgrader on your own computer

You can always install nbgrader yourself, on your own computer, to test out how it works. Probably this is not for everyone, but is effective to test things out.

nbgrader hints

These are practical hints on using nbgrader for grades and assignments. You should also see the separate autograding hints page if you use that.

General

To export grades, nbgrader export is your central point. It will generate a CSV file (using a custom MyCourses exporter), which you can download, check, and upload to MyCourses. You can add the option --MyCoursesExportPlugin.scale_to_100=False to not scale points to 100.

If students submit assignments/you use autograding

See also

Autograding

  • In each notebook (or at least the assignment zero), in the top, have a STUDENT_NUMBER = xxx which they have to fill in. Asking each student to include the student number in a notebook ensures that you can later write a script to capture it.

Testing releasing assignments, without students seeing

Sometimes instructors want to release and collect assignments as a test, while the course is running. To understand how the solution is simpler than “make a new course”, we need to understand what “release” and “collect” do: they just move files around. So, you can just move them to a different place (called the exchange) instead of the one that all students see. Nbgrader docs sure doesn’t do a good job of explaining it, but behind the scenes it’s quite simple, and that simplicity means it’s easy to control if you know what you are up to…

You can equally move your test files around to a test, instructor-only exchange for your own testing. (Actually, this isn’t even needed, you can just copy them directly, test, and put back in the submitted/ directory. But some people want more. So, from the jupyter terminal, we have made these extra aliases:

# Release to test exchange (as instructor):
nbgrader-instructor-exchange release_assignment  $assignment_id
# Fetch from test exchange (as instructor, pretending to be a student):
nbgrader-instructor-exchange fetch_assignment  $assignment_id
# Submit to test exchange (as instructor, pretending to be a student):
nbgrader-instructor-exchange submit $assignment_id
# Collect to test exchange (as instructor):
nbgrader-instructor-exchange collect $assignment_id

This copies files to and from /course/test-instructor-exchange/, which you can examine and fully control. If you are doing this, you probably need that control anyway. These terms match the normal nbgrader terminology.

There’s no easy way to make a switch between “live exchange” and “instructor exchange” in the web interface, but because of the power of the command line, we can easily do it anyway.

(use type -a nbgrader-instructor-exchange to see just what it does.)

Known problems
  • The built-in feedback functionality doesn’t work if you modify the submitted notebooks (for example, to make them run). nbgrader upstream limitation. Contact us and we can run a script that will release the feedback to your students.

Course data

If you use the /coursedata directory and want the notebook to be usable outside of JupyterHub too, try this pattern:

import os
if 'AALTO_JUPYTERHUB' in os.environ:
    DATA = '/coursedata'
else:
    DATA = 'put_path_here'

# when loading data, always os.path.join(DATA, 'the_file.py')

This way, the file can be easily modified to load data from somewhere else. Of course, many variations are possible.

Converting usernames to emails

JupyterHub has no access to emails or student numbers. If you do need to link to email addresses, you can do the following. (Note: the format USERNAME@aalto.fi works for MyCourses upload, this process is not usually needed these days anymore.)

  • ssh to kosh.aalto.fi

  • cd to wherever you have exported a csv file with your grades (for example your course directory, cd /m/jhnas/jupyter/course/$course_slug/files/).

  • Run /m/jhnas/jupyter/software/bin/username-to-email.py exported_grades.csv - this will add an email column right after the username column. If the username column is not the zeroth (counting from zero), use the -c $N option to tell it that the usernames are in the Nth column (zero indexed).

  • Save the output somewhere, for example you could redirect it using > to a new filename. A full example:

    /m/jhnas/jupyter/software/bin/username-to-email.py mycourses_export.csv > mycourses_usernames.csv
    

This script is also available on github.

Our scripts and resources

Some scripts at https://github.com/AaltoSciComp/jupyter-wiki .

We are soon going to revise all of our instructor info which can be useful to you later.

Autograding

Autograding is sometimes seen as the “holy grail” of using Jupyter for teaching. But you need an appreciation of the level of the task at hand and how to do it.

Autograding

Warning

Running nbgrader autograde is not secure, because arbitrary student code is run with instructor permissions, including access to all instructor files and all other student data. We have designed our own system to make it secure, but we must run it for you. Contact us to use it. If you autograde yourself, you are making a choice to risk privacy of all students (probably violating Finnish law) and the integrity of your grades. This is a long-standing design flaw of nbgrader which we have fixed as best we can.

The secure autograder has to be run manually, by us. Fetch your assignments and contact us in good time.

How deep do you go?
  1. Normal Jupyter notebooks, no automation. You might use our JupyterHub to distribute assignments and as a way for students to avoid running their own software, but that’s all.

  2. Use nbgrader facilities to generate a student version of assignments, but handle grading yourself (“manually using nbgrader” or via some other system).

  3. Full autograding.

You may think “autograding will save me effort”. It may, but it will make a whole lot of effort in another way: making your assignment robust to autograding. As someone once said: plan for one day to write an assignment, one week to make it autogradeable, then weeks to make it robust. It doesn’t help that most reference material you can find is about basic programming, not about advanced data science projects.

If you use autograding, you have to test your notebooks with many students of different levels. Plan on weeks for this.

What is autograding?

nbgrader is not a fancy thing - it just copies files around. Autograding is only running the whole notebook from top to bottom and looking for errors. If there are errors, subtract points. There is not some major platform running in the background that does things actually automatically. This is also the primary benefit: a simple system allows your notebooks to be more portable and reusable, and match more closely to real work.

Autograding at Aalto
  1. Design your notebook well

  2. Collect your notebooks using the nbgrader interface. Don’t click any “autograde” buttons (unless you check the notebook yourself first).

  3. Send an email to guru asking specifying your course and assignment and ask for autograding. We will run actually secure autograding on our server soon, and send you a report on what worked or didn’t. Everything gets automatically updated in your environment.

  4. Proceed as normal, for example…:

  5. If autograding didn’t work for some people, you can check them, modify if needed, and re-run the autograding yourself (since you just checked it).

Designing notebooks for autograding

(please contribute or comment on these ideas)

Check out the upstream autograding hints, which include: hints on writing good test cases, checking if a certain function has been used, checking how certain functions were called, grading plots, and more. But when reading this, not how these examples are simple code - your cases will probably be more complex.

Understand the whole loop of transferring files from you, to student versions, to students, and back. Understand what the loop is not as well. Understand that there isn’t actual automatic autograding.

Have an assignment zero with no content and worth zero (or one) points, which students have to submit just to show they know how the system works (for example, they don’t forget to push “submit”). Maybe it just has some trivial math or programming exercises. This reduces the cognitive load when doing the real assignments.

Design your notebook with a mindset of unit testing. Note that this isn’t the way that notebooks are usually used, though. Functions and testable functions are good. But note that if you put everything in functions, you lose some of the main benefits of notebooks (interactivity made possible by having things in the top-level scope)! Such is life.

Have sufficient tests that are visible to the students, so that they can tell if their answers are reasonable. For example, student-visible tests might check for the shape of arrays, hidden tests check for the actual values. This also ensures that they are approaching it the way you expect.

Similarly, some instructors have found that you must have plenty of structure so that students only have to fill in well-defined chucks, with instructor code before and after. This ensures that students do “the right thing”, but also means that students lose the experience of the “big picture”: loading, preprocessing, and finalization - important skills for the future. Instead, they learn to fill in blanks and no more, no less. So, in this way autograding is a trade-off: more grade-able, less realistic about the full cycle of work.

Within your tests, use variable names that won’t have a conflict (for example, a random suffix like testval_randomstring36456165 instead of testval). This reduces the chance of one of your tests conflicting/overwriting something that the students have added.

Expect students to do everything wrong, and fail in weird ways. Your tests need to be robust.

Consider if your assignment is more open-ended, or there is one specific way to solve it. If it’s more open-ended, consider if it is even realistic to make it autogradeable.

nbgrader relies on metadata in order to do the autograding. In order for this to work, the cell metadata needs to be intact. Normally, you can’t even see it for a cell, but it can be affected if: a) cells are copied and pasted to another notebook file (metadata lost, autograding fails), or b) cells are split (metadata duplicated, nbgrader halts then). You should ask students to copy the whole notebook file around when needed. You should also ask the students to generally avoid doing anything weird with the notebook files.

The environment variable NBGRADER_VALIDATING can be used to tell if the code is being run in the autograding context.

A notebook shouldn’t do extensive external operations when autograding, such as downloading data. For that matter, it should try to minimize these when running on JupyterHub, too (a course with 1000 students doesn’t need every student to download data separately - that’s a recipe to get us blocked). Request a /coursedata/ directory and you can put any type of data there for students to use. You can try these kind of conditionals to handle these cases:

# Setup for if on aalto jupyterhub or if we are autograding
if 'AALTO_JUPYTERHUB' in os.environ or 'NBGRADER_VALIDATING' in os.environ:
    data_home = '/coursedata/scikit_learn_data/'
    # Make sure that it doesn't try to write new data here,
    # students won't be able to
else:
   data_home = None        # use default for a personal computer
Warnings to give to students
  • Don’t copy and paste cells within a notebook. This will mess up the tracking metadata and prevent autograding from working.

  • Be cautious about things such as copying the whole notebook to Colab to work on it. This has sometimes resulted in removing all notebook metadata, making autograding impossible.

FAQ
  • This error message:

    [ERROR] One or more notebooks in the assignment use an old version
        of the nbgrader metadata format. Please **back up your class files
        directory** and then update the metadata using:
    
        nbgrader update .
    
    • There are various ways this can happen: perhaps the most common is a student duplicates a cell. There is no solution other than manually fixing the notebook, or grading it yourself. (The error message is confusing and doesn’t make sense, a wide variety of internal problems can cause the same error).

Public copy of assignments

One disadvantage of a powerful system is that we have to limit access to authorized users. But you shouldn’t let this limit access to your course: there is nothing special about our system, and if you allow others to see your assignments, they can run them themselves. For example, the service https://mybinder.org allows anyone to run arbitrary notebooks from git repositories.

This is also important because your course environment will go away after a few months - do you want students to be able to refer to it later? If so, do the below.

  • change to the release/ directory and git init. Create a new repo here.

  • Manually git add the necessary assignment files after they are generated from the source directory. Why do we need a new repo? Because you can’t have the instructor solutions/answers made public.

  • Update files (git commit -a or some such) occasionally when new versions come out.

  • Add a requirements.txt file listing the different packages you need installed for a student to use the notebooks. See the MyBinder instructions for different ways to do this, but a normal Python requirements.txt file is easiest for most cases. On each line, put in a name of a package from the Python Package Index. There are other formats for R, conda, etc, see the page.

  • Then, push this release/ repo to a public repository (check mybinder for supported locations). Make sure you don’t ever accidentally push the course repository!

  • Then, go to https://mybinder.org/ and use the UI to create a URL for the resources. You can paste this button into your course materials, so that it’s a one-click process to run your assignments.

  • Note that mybinder has a limit of 100 simultaneous users for a repository, to prevent too much use for single organization’s projects. This shouldn’t be the first place you direct students for day-to-day work.

  • If you have a /coursedata directory, you will have to provide these files some other way. You could put them in the assignment directory and the release/ git repository, but then you’ll need to have notebooks able to load them from two places: /coursedata or .. I’d recommend do this: import os, if os.path.exists('/coursedata'): DATADIR='/coursedata', else: DATADIR='.' and then access all data files by os.path.join('DATADIR', 'filename.dat'). This has the added advantage that it’s easy to swap out DATADIR later, too.

FAQ and hints
Shared course repository

There’s a lot to figure out and everyone has to learn by doing. Why not learn from each other? We have a shared jupyterhub-courses repository on version.aalto.fi with a repository for each course. You can browse and learn from how other courses make notebooks, thus saving you time. It also makes it easier for us to help you.

  • Decide who are the people to be added to the jupyterhub-courses Gitlab organization (usually those who have long term contracts with Aalto). You can add whoever you want to the your own courses’s repository itself, but organization side should be kept in smaller group so that other TAs won’t get access to courses which they might participate in.

  • Setup git for your course. This is something that you might have already done, but here are some general tips for nbgrader specifically.

  • After you have gotten an access to the organization, you can create a course in version.aalto and then setup it as a new origin for your git repository: git remote add new_remote_name {address}. (Github help)

  • Now you can use to push to this new remote! For example, if your new origin were “gitlab” then git push gitlab master would push into version.aalto. Now you should be ready to go!

Instructions/hints
  • Request a course when you are sure you will use it. You can use the general use containers for writing notebooks before that point.

  • Don’t forget about the flexible ways of accessing your course data.

  • The course directory is stored according to the Science-IT data policy. In short, all data is stored in group directories (for these purposes, the course is a group). The instructor in change is the owner of the group: this does not mean they own all files, but are responsible for granting access and answering questions about what to do with the data in the long term. There can be a deputy who can also grant access.

  • To add more instructors/TAs, go to domesti.cs.aalto.fi and you can do it yourself. You must be connected to an Aalto network. See the Aalto VPN guide for help with connecting to an Aalto network from outside.

  • Store your course data in a git repository (or some other version control system) and push it to version.aalto.fi or some such system. git and relevant tools are all installed in the images.

  • You know that you are linked as an instructor to a course if, when you spawn that course’s environment, you get the /course directory.

  • You can now make a direct link that will spawn a notebook server, for example for a course with a slug of testcourse: `https://jupyter.cs.aalto.fi/hub/spawn?profile=testcourse. If the user is already running a server, it will not switch to the new course. Expect some subtle confusion with this and plan for it.

  • We have a test course which you can use as a sandbox for testing nbgrader and courses. No data here is private even after deleted, and data is not guaranteed to be persistent. Use only for testing. Use the general use notebook for writing and sharing your files (using git).

  • The course environments are not captive: students can install whatever they want. Even if we try to stop them, they can use the general use images (which may get more software at any time) or download and re-upload the notebook files. Either way, autograding is done in the instructors environment, so if you want to limit the software that students can use, this must be done at the autograding stage or via other hacks.

    • 1) If you want to check that students have not used some particular Python modules, have an hidden test that they haven’t used the module, like: 'tensorflow' not in sys.modules.

    • 2) autograde in an environment which does not have these extra packages. Really, #2 is the only true solution. See the information under https://github.com/AaltoSciComp/isolate-namespace for information on doing this.

    • In all cases, it is good practice to pre-import all modules the students are expected to be able to use and tell students that other modules should not be imported.

  • Students should use you, not us, as the first point of contact for problems in the system. Please announce this to students. Forward relevant problems to us.

  • You can access your course data via SMB mounting at the URLs smb://jhnas.org.aalto.fi/course/$courseslug/files/ and the course data using smb://jhnas.org.aalto.fi/course/$courseslug/data/ (with Windows, use \\ instead of / and don’t include smb://). This can be very nice for managing files. This may mess up group-writeability permissions. It will take up to half a day to be able to access the course files after your request your course.

  • You are the data controller of any assignments which students submit. We do not access these assignments on your behalf, and a submission of an assignment is an agreement between you and the student.

  • You should always do random checks of a fair fraction of notebooks, to avoid unexpected problems.

  • You can tell what image you have using echo $JUPYTER_IMAGE_SPEC.

  • A notebook can tell if it is in the hub environment if the AALTO_JUPYTERHUB environment variable is set.

  • A notebook can tell if it is being autograded by checking if NBGRADER_VALIDATING is set.

  • You can install an identical version of nbgrader as we have using:

    pip install git+https://github.com/AaltoSciComp/nbgrader@live
    

    This may be useful if you get metadata mismatch errors between your system and ours. There used to be more differences, these days the differences are minimal because most of our important changes have been accepted upstream.

  • You can get an environment.yml file of currently installed packages using:

    conda env export -n base --no-builds
    

    But note this is everything installed: you should remove everything from this file except what your assignments actually depend on, since being less strict will increase the chances that it’s reproduceable. nbgrader should be removed (it pins to an unreleased development version which isn’t available), and perhaps the prefix should too. For actual versions installed, see base and standard dockerfiles in the singleuser-image repo.

FAQ
  • Something with nbgrader is giving an error in the web browser. Try running the equivalent command from the command line. That will usually give you more debugging information, and may tell you what is going wrong.

  • I see Server not running … Would you like to restart it? This particular error also happens if there are temporary network problems (even a few seconds and it comes back). It doesn’t necessarily mean that your server isn’t running, but there is no way to recover. I always tell people: if you see this message, refresh the page. If the server is still running, it recovers. If it’s actually not running, it will give you the option to restart it again. If there are still network problems, you’ll see an error message saying that.

  • Gurobi Gurobi has license issues, and it’s not clear if it can even be distributed by us. So far, we only support open software.

    But, courses have used gurobi before. They had students install themselves, in the anaconda environment, and somehow told it what the Aalto license server was. For examaple, using the magic of “!” shell commands embedded in notebooks, it was something like this, which would automatically install gurobi for students and set the license file information.:

    !conda install -c gurobi gurobi
    !echo [license_file_information] > ~/.[license_file_path]
    
  • I have done a test release/fetch/autograde of an assignment, and I want to re-generate it. It says I can’t since there are already grades. You also need to remove it from the database with the following command. Note that if students have already fetched, they will need to re-fetch it so don’t do this if it’s already in the hands of the students - you will only create chaos (see the point below).

    $ nbgrader db assignment remove ASSIGNMENT-ID
    
  • I have already released an assignment, and now I need to update it and release it again. Some students have already fetched it. This works easily if students haven’t fetched it yet, if they have it requires some manual work from them.

    What you need to do: (make sure the old version is git-committed), edit the source/ directory version, un-release the assignment, generate it again, release the assignment again. You might need to force it to fetch the assignment again, if it has already been fetched. (verify, TODO: let me know how you do this)

    On the student side: After an assignment is fetched, it won’t present the option to fetch it again (that would lose their work). Instead, they need to move the fetched version to somewhere else, then re-fetch. You can send the following instructions to your students:

    I have updated an assignment, and you will need to re-fetch it. You work won’t be lost, but you will need to merge it into the new versions.

    • First, make sure you save everything and close the notebooks.

    • Open a terminal in Jupyter

    • Run the following commands to change to the course assignment directory and move the assignment to a new place (-old suffix on the directory name):

      $ cd /notebooks/COURSE/
      $ mv ASSIGMENT_ID ASSIGNMENT_ID-old
      
    • In the assignment list, it should now offer you to re-fetch the assignment.

    • You can now open both the new old old versions (but to open the old version, you need to navigate to /notebooks/COURSE/ASSIGNMENT_ID-old yourself to see it).

    • If you have already submitted the assignment, submit again. The old assignment is still submitted, but our fetching should get the new one.

Contact

CS-IT. (students, always contact your course instructors first.)

  • Chat via scicomp chat, https://scicomp.zulip.cs.aalto.fi, stream #jupyter.cs for quick questions (don’t send personal data here, it is public).

  • Issues needing action (new courses, autograding, software installation, etc) via the CS IT email alias guru @ cs dot aalto.fi

  • Realtime support via Triton, SciComp, RSE, and CS every day at 13:00, focus days on Wednesdays but some help might be possible on other days (good for screensharing to show a problem, you can prepare us by mentioning your issue in the chat first). You can coordinate by chat to be sure.

More info

See the separate instructors guide. This service may be either used as general light computing for your students, or using nbgrader to release and collect assignments.

Privacy notice

Summary: This system is managed by Aalto CS-IT. We do not store separate accounts or user data beyond a minimal database of usernames and technical logs of notebooks which are periodically removed (this is separate from your data). The actual data (your data, course data) is controlled by you and the course instructor respectively. We do not access data, but when necessary for the operation of the system, but we may see file metadata (stat FILENAME) such as permissions, size, timestamp and filename. Your personal data may be deleted once it has been inactive for one year, and at the latest once your Aalto home directory is removed (after your Aalto account expires). Course data is controlled by course instructors.

See the separate privacy policy document for more details.

FAQ and bugs
  • I started the wrong environment and can’t get back to the course selection list. In JupyterLab, use the menu bar, “Hub->Control Panel”. On the classic notebooks, use the “Control panel” button on the top right. (Emergency backup: you can always change the URL path to /hub/home).

  • Is JupyterLab available? Yes, and it’s nice. There are two general use instances that are actually the same, the only difference is one starts JupyterLab by default and one starts classic notebooks by default.

  • Can I login with a shell? Run a new terminal within the notebook interface.

  • Can I request more software be installed? Yes, let us know and we will include it if it is easy. We aim to have featureful environments by default, but won’t go so far as to install large specialist software. It should be in standard repositories (conda or pip for Python stuff).

  • Can I do stuff with my class’s assignments and not have it submitted? You have your personal storage space /notebooks/, which you can use for whatever you want. You can always make a copy of the assignment files there and play around with them as much as you want - even after the course is over, of course.

  • Are there other programming languages available? Currently there is Python, R, and Julia. More could be added if there is a good Jupyter kernel for it.

  • What can I use this for? Intended uses include anything related to courses, own exploration of programming, own data analysis, and so on (see Terms of Use above). Long-term background processing isn’t good (but it’s OK to leave small stuff running, close the tab, and come back).

  • When using nbgrader, how do I know what assignments I have already submitted? Currently you can’t beyond what is shown there.

  • Can I know right away what my score is after I submit an assignment with nbgrader? nbgrader is not currently designed for this.

  • Are there backups of data? Data storage is provided by the Aalto Teamwork system. There are snapshots available in .snapshot in every directory (you have to ls this directory in a shell using its full name for it to appear the first time). This service is not designed for long term data storage, and you should back up anything important because it will be lost after about one year or when your Aalto account expires. You should use git as your primary backup mechanism, obviously.

  • Is git installed? Yes, and you should use it. Currently you have to configure your username and email each time you use it, because this isn’t persistent (because home directories are not persistent). Git will guide you through doing this. In the future, your Aalto directory name/email will be automatically set. As a workaround, run git config without the --global option in each repository.

  • I don’t see “Assignment list”. You have probably launched the general use server instead of a course server. Stop your server and go spawn the notebook server of your course.

  • I’m getting an error code Here are the ones we know about:

    • 504 Gateway error: The hub isn’t running in background. This may be hub just restarting or us doing maintenance. If it persists for more than 30 minutes, let someone know.

  • Stan/pystan/Rstan don’t work. Stan needs to do a memory-intensive compilation when your program is run. We can’t increase our memory limits too much, but we have a workaround: you need to tell your program to use the clang compiler instead of the gcc compiler by setting the environment variables CC=clang and CXX=clang++. For R notebooks, this should be done for you. For RStudio, we don’t know. For Python, put the following in your notebook:

    import os
    os.environ['CC'] = "clang"
    os.environ['CXX'] = "clang++"
    

    We should set this the default, but want to be sure there are no problems first.

  • RStudio doesn’t appear. It seems that it doesn’t work from the Edge browser. We don’t know why, but try another browser.

  • I’ve exceeded my quota. You should reduce the space you use, the quota is 1GB. If this isn’t enough and you actually need more for your classes, tell your instructor to contact us. To find large directories files: open a terminal and run du -h /notebooks/ | sort -h to find all large files. Then clean up that stuff somehow, for example rm -r. Note that .home/.local/share/jupyter/nbgrader_cache will continue to grow and eventually needs to be cleaned up - after the respective course is done.

  • I don’t see the assignments for my course. There are different profiles you can start, and you can’t tell which profile you have started. Go back to the hub control panel and restart your server. To be more precise, click the “Control Panel” in the upper-right corner, then click “Stop my Server”, wait a little bit, then click “Start My Server” and choose the profile for your course.

More info

Students, your first point of contact for course-related or Jupyter matters and bugs with JuptyerHub should be your instructors, not us. They will answer questions and send the relevant ones to us. But, if you can actively help with other things, feel free to comment via Github repositories below.

The preferred way to send feedback and development requests is via Github issues and pull requests. However, we’re not saying it’s best to give Github all our information, so you can also send tickets to CS-IT.

Students and others who have difficulty in usage outside of a course can contact CS-IT via the guru alias.

Jupyter notebooks are not an end-all solution: for an entertaining look at some problems, see “I don’t like notebooks” by Joel Grus or less humorous pitfalls of Jupyter notebooks. Most of these aren’t actually specific to notebooks and JupyterLab makes some of the problems better, but thinking hard about the downfalls of notebooks makes your work better no matter what you do.

Our source is open and on Github:

Local LLM web APIs

As a pilot service, Aalto RSE has a service running some common open-source LLMS (llama2, mistral, etc.) available via the web. This can be used for lightweight purposes via programming, but shouldn’t replace batch usage (use LLMs) or interactive chatting (use Aalto GPT).

Access

Currently this is not available publicly, but if you ask, we can provide development access. Chat with us in the #llms stream on Chat. That’s also the best way to contact the developers (other contact methods are in Help.

The API doesn’t have it’s own detailed documentation (ask us), but the API should be OpenAI compatible (for chat models) so many existing libraries work automatically.

Intended use and resource availability

This is ideal if you need to run things through LLMs which are only running on Aalto servers, and without many requests per second (this isn’t for batch use). This could, for example, be an alternative to running your own LLM server for basic testing or small question answering. It’s also good if you need to test various open-source LLMs out before beginning batch work. It’s perfectly suited for intermittent daily use.

Right now, each models has limited resources (some running on CPUs and some on GPUs). They can serve a request every few seconds, but resources could easily be overloaded. We intend to add resources as needed, depending on use. For any serious use, please contact us so that we can plan for the future. Don’t assume any stability or performance right now.

Technical implementation

Models run on local hardware on the Aalto University premises. Kubernetes is used to manage computing power, so in principle there is plenty of opportunity for scaling, but this is not turned on until a need is established. CPU resources are significant, but there are limited GPU resources (but that can change, depending on demand).

Standalone Matlab

General matlab hints: http://math.aalto.fi/opetus/Mattie/MattieO/matlab.html

Installation and license activation on staff-owned computers

Matlab academic license permits installation on home computers for university personnel. Triton MDCS workers are available to anyone with a Triton account, which means the workers can be utilized from personal laptops as well.

Download image

Log into http://download.aalto.fi/ with your Aalto account. Look for the link Software for employees’ home computers which will take you to the Matlab download links. Download the UNIX version for Linux and OSX or the separate separate image for Windows.

The ISO image can be burned on a DVD or mounted on a virtual DVD drive.

  • Windows: Use MagicDisk or Virtual CloneDrive OR burn the image on DVD. Double click on setup.exe icon.

  • Linux:

    # sudo mkdir /mnt/loop
    # sudo mount -o loop Download/Matlab_R2010b_UNIX.iso /mnt/loop
    # sudo /mnt/loop/install.sh
    
  • Mac OS X: Double click on InstallForMacOSX.app icon.

Installation steps

Select the installer options as shown in the screenshots.

Mathworks account is required to continue with the installation.

  • Enter your account information in the installer to log in. If the password has been lost, Click on the Forgot your password? option to receive your password in email. OR

  • Register to Mathworks with the installer.

    1. Click on I need to create an account.

    2. Enter your name and email address. To be recognized as Aalto academic user the email address must end in one of aalto.fi, tkk.fi, hut.fi, hse.fi, hkkk.fi or uiah.fi domains.

    3. The installer will ask for an activation key, which is shown here in the last screenshot.

You may leave out unnecessary toolboxes and change the installation location. Remember however, that the Parallel Computing Toolbox is necessary to run any Matlab batch jobs on Triton.

Install Triton-MDCS integration scripts

Continue MDCS setup from Matlab Distributed Computing Server.

Stand-alone license activation on Aalto Linux laptops

To install Matlab and activate a stand-alone license on your Aalto Linux computer:

  • Install Matlab using the command: pkcon install matlab

  • Run /opt/matlab2022a/bin/activate_matlab.sh (replacing 2022a by whatever version you are using)

  • Select “Activate automatically using the Internet” and press Next.

  • Select the license saying “individual” and press Next.

  • Enter your Aalto user name as the login name and press Next.

  • Press Confirm.

Without a stand-alone license, you can only run Matlab if you have an internet connection to the Aalto network, or whatever internet connection and an Aalto VPN connection. With the stand-alone license, you can run Matlab even without an internet connection.

FAQ
Matlab freezes with Out of Memory errors

Q: Matlabs freezes and I get errors like this. What to do?:

Exception in thread "Explorer NavigationContext request queue" java.lang.OutOfMemoryError: GC overhead limit exceeded
      at com.mathworks.matlab.api.explorer.FileLocation.<init>(FileLocation.java:89)
      at com.mathworks.matlab.api.explorer.FileLocation.getParent(FileLocation.java:126)
     ... ... ...

A1: Add more memory in Home -> Preferences -> General -> Java Heap memory

A2: Can you free up memory in your code sooner using the clear command? https://se.mathworks.com/help/matlab/ref/clear.html

GPU acceleration?

Q: is there functional GPU acceleration? Does the acceleration even work?

A: run code:

>> g = gpuDevice;
>> ng

A2: Just query some feature:

>> fprintf('%s\n', g.ComputeCapability)

a3: Show multiple devices if found:

>> for ii = 1:gpuDeviceCount
g = gpuDevice(ii);
fprintf(1,'Device %i has ComputeCapability %s \n', ...
g.Index,g.ComputeCapability)
end

Open Source at Aalto

Note

This policy was developed at the Department of Computer Science, in conjunction with experts from Research and Innovation services (both the legal and commercialization sides) with the intention of serving the wider community.

After more research, we have learned that this policy is, in fact, de-facto applicable to all of Aalto, it is just extremely unclear that open source is actually allowed. Thus, this policy can be seen as best practices for all of Aalto. However, everyone (including CS) has more rights: one does not have to use this policy. You don’t have to use an open source license. IP ownership may be in more limited hands, so that you need fewer agreements to release.

However, we strongly encourage you to use this policy anyway. If you use this, you know that you are safe and have all permissions to make open source, regardless of your particular funding situation. It also ensures that you make proper open source software, for maximum benefit and open science impact.

References at bottom.

Researchers make at least three primary outputs: publications, software, and data. This policy aims to make openly releasing all types of work as straightforward the traditional academic publishing process.

This document describes the procedure for Aalto employees releasing the output of their work openly (open source software, data, and publications). Aalto University encourages openness. This policy covers only cases where work can clearly be released openly with no bureaucracy needed. It does not cover complex cases, such as commercial software, work related to inventions, complex partnership agreements, etc. The policy is voluntary, and provides a right to release openly, but does not require it or preclude any other university process. (Thus it’s more of a guideline than a policy.) It only is relevant when the creator has an employment relationship with Aalto. If they don’t (e.g. students), they own their own work unless there is some other agreement in place (e.g. their own funding contract, grant, etc). Still, they can use this same process with no extra bureaucracy needed.

We realize that this policy does not cover all cases. We aim to cover the 99% case, and existing processes are used for complicated cases. Aalto Innovation Services provides advice on both commercialization and open source release.

This policy is for public licensing only (one to many). You must go through Research and Innovation Services for anything involving a multi-party agreement.

Why release?

The more people who see and build on our work, the more impact we can have. If this isn’t enough, you get more citations and attention. While we can’t require anything, we strongly encourage that all work is either made open source or taken through the commercialization process. If you don’t know what to do, don’t worry: they are not mutually exclusive. Proper open-source licensing can protect your ability to commercialize later. Talk to Innovation Services. They like open source, too, and will help you find the right balance. Anyway, if work matches the criteria in this policy, it probably has limited commercial potential anyway: what is more important is your own knowledge and skills that went into it.

You want to add a proper open source license to your work, rather than just putting code on some webpage. Without a license, others can not build on your code, making your impact limited. No one will build on you, and eventually your work rots and gets lost.

You always want to go through this process as soon as possible at the beginning of a project: if you don’t, it becomes much harder to track everyone down.

You shouldn’t release as open source (yet) if your work is intentionally commercial or contains patentable inventions. In these cases, contact Innovation Services. In the second case (patentable inventions), according to Finnish legislation you are actually required to report the invention to Innovation Services.

Traps and acting early

Intellectual property rights don’t give you the right to do anything - they give you the right to block others from doing something. Thus, it is very important that you don’t end up in a situation where others can block you, and that means thinking early.

Decide on a license as soon as possible. Once it goes into the repository, future contributors implicitly agree to it. Otherwise, you are stuck trying to find all past contributors and get their agreement.

Another common trap is non-open source friendly grants. Not many outright ban it, but some require permission from all partners, and if there are a lot then this becomes close to impossible. Ask in advance, but in the worst case it might be you just can’t write software at the times you are paid by these projects!

Step-by-step guide for release under this policy
  1. Do these steps at the beginning of your project, not at the end!

  2. Check if the work is covered under the “conditions for limited commercial potential” in the policy.

  3. Choose a proper license to match your needs. See below for information. It must be open source, and you can not transfer any type of exclusive license away - Aalto keeps full right to future use.

  4. Get the consent of all authors and their supervisors and/or funders. There are no particular requirements for this, the only need is proving it later in case a question ever arises. You should also make sure that your particular funding source/collaboration agreements don’t have any further requirements on you. (For example, some grant agreements may say no GPL-type licenses without consent of all partners.) Your advisor (and Research and Innovation Services) can help you with this.

    If you are funded by Aalto basic funding, you by default have permission. Same goes for other big public funding agencies (Academy, EU… but the grant can always override this).

    If you are in services, follow your source of funding. At the very worst, whoever is responsible for your funding can decide, but it may be someone lower too.

  5. You are responsible for making sure that you have the right to release your code. For example, that there are no other agreements other rights (intellectual property and privacy), legal restrictions, or anything else restricting a release. Also, any other included software must have compatible licenses.

  6. Put a copyright license in the source repository. In the best case, each individual source file should list copyright and authors, but in practice if you don’t do this it’s not too much of a problem. Make sure that the license disclaims any warranty (almost all licenses will do this). After this, contributors implicitly consent to the license. If you have an important case, ask explicitly too. The important thing is that you have more evidence than the amount of scrutiny you might get (low in typical projects, will be higher if your project becomes more important).

  7. This policy is seen as Aalto transferring the rights to you to release, not Aalto releasing itself (just the same as with publications). Release in your own name, but you can(+should) list your affiliations.

  8. Make your code public if/when you want. No particular requirements here, but see below for best practices.

Any borderline or questionable cases should be handled by the existing innovation disclosure process.

In addition to the above requirements, the following are best practices:

  1. You can’t require that people cite you, but you can ask nicely. Make it easy to do this! Include the proper citations directly in the README. Make your code itself also citeable by publishing it somewhere (Github, Zenodo, …).

  2. Put on a good hosting location and encourage contributions. For example, Github is the most popular these days, but there are plenty of others. Welcome contributions and bug reports, and build on them. Make yourself the hub of expertise of your knowledge and methods.

Choosing a license

Under this policy, any Creative Commons, Open Source Initiative, and Free Software Foundation approved open source licenses are usable. However, you should not try to be creative, and use the most common license that serves your needs.

Top-level recommendations:

  1. Use this nice site: https://choosealicense.com/. It contains everything you need to know, including what is here. If you need something more specific you can have a look at http://oss-watch.ac.uk/apps/licdiff/.

  2. MIT for software which should be basically public domain, Apache 2.0 for larger almost-public domain things (the Apache license protects more against patent trolling). Anyone can use this for any purpose, including putting it in their own proprietary, non-open products.

  3. GNU General Public License (GPL) (“v2 or any later version”) for software which you may try to commercialize in the future. This license says that others can not make it closed-source without your consent. Others can use it for commercial purposes, but all derivative work must also be made open source - so you keep an advantage.

For special cases:

  1. Lesser GNU General Public License (LGPL, GPL with classpath exception) type licenses. Suitable where the GPL would be appropriate but the software is a library. It can be embedded within other proprietary products, but the code itself must stay open.

  2. The Affero GPL/LGPL. These get around the “webservice loophole”: if your code is available via a webservice, the code running it must stay open.

  3. CC-BY for other non-software output.

Discussion:

  • Most public domain → MIT / Apache 2 > CC-BY > LGPL > GPL > AGPL → Most protection against proprietary use

  • If you think you might want to commercialize in the future: ask innovation services and they’ll help you release as open source now and preserve commercialization possibilities for the future.

The policy
Open Source Policy
Covered work
  1. Software

  2. Publications and other writing (Note that especially in this case, it is common to sign away full rights. This is a case where you do more than this policy says.)

  3. Data

Conditions for limited commercial potential

This policy supports the release of work with limited commercial potential. Work with commercial potential should be assessed via Aalto’s innovation process.

  1. If work’s entire novelty is equally contained in academic publications, there is usually little commercial value. Examples: code implementing algorithms, data handling scripts.

  2. Similarly, work which only is a byproduct of academic publications or other work probably has limited commercial value, unless some other factor overrides. For example: analysis codes, blog posts, datasets, other communications.

  3. Small products with limited independent value. If the time required to reproduce the work is small (one week or less), there is likely not commercial value. For example: sysadmin scripts, analysis codes, etc. Think about the time for someone else to reproduce the work given what you are publishing, not the time it took for you to create it.

  4. Should a work be contributing to an existing open project, there is probably little commercial value. For example: contribution to existing open-source software, Wikipedia edits, etc.

  5. NOT INCLUDED: Should work contain patentable elements or have commercial potential, this policy does not apply and it should be evaluated according to the Aalto innovation process. Patentable discoveries are anything which is a truly new, non-obvious, useful inventions. In case of doubt, always contact Innovation Services! Indicators for this category: actually novel, non-obvious, useful, and actually an invention. Algorithms and math usually do not count, but expressions of these can.

  6. NOT INCLUDED: Software designed for mass-market consumption or business-to-business use should be evaluated according to the Aalto innovation process. Indicators for this category: large amount of effort, software being a primary output.

Ownership of intellectual property rights at Aalto
  1. This policy covers work of employees whose contracts assign copyright and other intellectual property rights of their work to Aalto. However, the Aalto rules for ownership of IP are extremely difficult, so see the last point.

  2. Your rights are assigned to Aalto if you are funded by external funding, or if there are other Aalto agreements regarding your work.

  3. If neither of the points in (2) apply to you AND your work is independent (self-decided and self-directed), then according to Finnish law you own all rights to your own work. You may release it how you please, and the rest of this policy does NOT apply (but we recommend reading it anyway for valuable advice). Aalto Innovation Services can serve you anyway.

  4. Rather than figure out the the ownership of work, this policy is written to apply to all work, so that you do not need to worry about this.

Release criteria and process
  1. This policy applies to copyright only, not other forms of intellectual property. Should a work contain other intellectual property (which would not be published academically), this policy does not apply. In particular, this policy does not cover any work which contains patentable inventions.

  2. The employee and supervisor must consider commercial potential. The guidelines in the “conditions for limited commercial potential” may guide you. Should there be commercial potential, go through the existing innovation disclosure processes. In particular, any work which may cover patentable inventions must be reported first.

  3. If all conditions are satisfied, you, in consultation with your PI, supervisor, or project leader (whichever is applicable) and any funder/client requirements, may choose to release the work. Should the supervisor or PI have a conflict of interest or possible conflict of interest, their supervisor should also be consulted.

  4. Depending on funding sources, you may have more restrictions on licensing and releasing as open source. Project proposals and grant agreements may contain provisions relevant to releasing work openly. When making project proposals, consider these topics already. When in doubt, contact the relevant staff.

  5. To be covered under this policy, work must be licensed under a open/open source/free software license. In case of doubt, Creative Commons, Open Source Initiative, and Free Software Foundation approved open source licenses are considered acceptable. See below for some license recommendations.

  6. All warranty must be disclaimed. The easiest way of doing this is by choosing an appropriate license. Practically all of them disclaim warranty.

  7. All authors must consent to the release terms.

  8. The employee should not transfer an exclusive license or ownership to a third party. Aalto maintains the right to relicense and use internally, commercially, or re-license should circumstances change.

  9. Employees should acknowledge their Aalto affiliation, if this possible and within the community norms.

  10. This right should not be considered Aalto officially releasing any work, but allowing the creators to release it in their own name. Thus, Aalto does not assume liability or responsibility for work released in this way. Copyright owner/releaser should be listed as the actual authors.

  11. Employees are responsible for ensuring that they have the right to license their work as open source, for example ensuring that all included software and data is compatible with this license and that they have permission of all authors. Also the release must be allowed by any relevant project agreements. Should you have any doubts or concern, contact Innovation Services.

To apply this to your work, first receive any necessary permissions. In writing, by email, is sufficient. Apply the license in your name, but list Aalto University as an affiliation somewhere that makes sense. Do not claim any special Aalto approval for your work.

For clarity, raw official text is separate from the guidance on this page. Current approvals: Department of Computer Science (2017-03-17).

How to run a good open-source software project

One of the largest benefits to open source is having a community of people contributing back to you. To do this, you need to have a good environment. Open development, good style and a basic contribution guide, and encouragement is the base of this. Eventually, this section may contain some more pointers to how to create this type of community. (TODO)

References

Overleaf

Aalto provides an professional» site license to all the community. For more information, see https://www.overleaf.com/edu/aalto.

In order to link yourself to Aalto, you must register for and have an OrcID [wikipedia]. An OrdID (“Open Researcher and Contributor ID”) is some permanent ID which is used for linking researchers to their work, for example, some journals require linking to an OrcID. OrdID can be accessed directly with your Aalto account.

TODO: determine exact procedure and update here

Aalto rates overleaf as for “public” data. This doesn’t mean that Overleaf makes your data public, but just that Aalto can’t promise security. In reality, you decide if Overleaf is secure enough. If there is some legal requirement for security, you probably shouldn’t use Overleaf. If there is a collaborator requirement for security, then you must make your own choice if Overleaf is suitable.

Paniikki: Computer Lab For Students

Paniikki is a cutting edge computer lab in the computer science department. It is located in T-building C106 (right under lecture hall T1). This documentation is a Paniikki cheatsheet.

alternate text

< The blue box at the entrance is Paniikki >

For more services directed at students, see Welcome, students!.

The name

Paniikki means “panic” in English which is a fascinating name as people in panic are in panic. I don’t know which comes first, the space or the emotion. Anyway, people experience the both simultaneously.

Access
Physical

You can access Paniikki in the T-building C106. It is right by the building’s main entrance (you can see it through the windows by the building’s main entrance).

Remote

You can ssh via the normal Aalto shell servers kosh and lyta. Going through them, you can then ssh to one of the Paniikki computers. Be warned, there is no guarantee that you get an empty one… if it seems loaded (use htop to check), try a different one.

You can find the hostnames of the Paniikki computers on aalto.fi.

Hardware

CPU properties

Spec

Model

Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

Architecture

x86_64

CPU(s)

12

Thread(s) per core

2

max MHz

4000.0000

Virtualization

VT-x

L1d cache

32K

L1i cache

32K

L2 cache

256K

L3 cache

15360K

Model

NVIDIA Quadro P5000

GPU properties

Spec

Core

GP104GL (Pascal-based)

Core clock

1607 MHz

Memory clock

1251 MHz

Memory size

16384 MiB

Memory type

256-bit GDDR5X

Memory bandwidth

320

CUDA cores

2560

CUDA compute capability

6.1

OpenGL

4.5

OpenCL

1.2

Near GeForce Model

GeForce GTX 1080

Memory properties

Spec

RAM

32GiB

Software

First thing first, you don’t have sudo rights on Aalto classroom machines and you can’t, because they are shared. We provide the most used SW and if you need more you could inquire via servicedesk@aalto.fi. We try to have a good base software that covers most people’s needs.

What?

How?

Python via Anaconda

module load anaconda

Python (system)

Default available

Tensorflow

in the Python environments, e.g. anaconda above

Modules

In short, module is a software environment management tool. With module you can manage multiple versions of software easily. Here are some sample commands:

Command

Description

module load NAME

load module

module avail

list all modules

module spider PATTERN

search modules

module spider NAME/ver

show prerequisite modules to this one

module list

list currently loaded modules

module show NAME

details on a module

module help NAME

details on a module

module unload NAME

unload a module

module save ALIAS

save module collection to this alias (saved in ~/.lmod.d/)

module savelist

list all saved collections

module describe ALIAS

details on a collection

module restore ALIAS

load saved module collection (faster than loading individually)

module purge

unload all loaded modules (faster than unloading individually)

There are some modules set up specifically for different courses: if you just load the environment with “module load”, you will have everything you need.

Read the details in Module environment page.

Example 1

Assume we are in Paniikki and wants to do our homework for CS-E4820 Machine Learning: Advanced probabilistic methods. In the course students use Tensorflow and Edward.

# Check available modules
$ module load courses/    # Tab to auto-complete

# Finally you will complete this
$ module load courses/CS-E4820-advanced-probabilistic-methods.lua

# Check the module you loaded
$ module list

Currently Loaded Modules:
        1) courses/CS-E4820-advanced-probabilistic-methods

# Check the packages
$ conda list    # You will see Tensorflow and etc.

# Launch Jupyter
$ jupyter notebook

# Do your homework

# You are done and want to un-load all the modules?
$ module purge
Example 2: General Python software

Need Python and general software? The anaconda modules have Python, a bunch of useful scientific and data packages, and machine learning libraries.

# Latest Python 3
$ module load anaconda

# Old Python 2
$ module load anaconda2
Example 3: List all software

You can check all other modules as well

$ module avail
alternate text

< Available modules in Paniikki as of 2018 March 8th >

You want to use Matlab?

$ module load matlab/2017b
$ matlab
Questions?

If you have any question please contact servicedesk@aalto.fi and clearly mention the Paniikki classroom in the message.

Python on Aalto Linux

The scientific python ecosystem is also available on Aalto Linux workstations (desktops), including the anaconda (Python 3) and anaconda2 (Python 2) modules providing the Anaconda python distribution. For a more indepth description see the generic python page under scientific computing docs.

On Aalto Linux Laptops, these instructions don’t work. Instead, we’d recommend installing Anaconda or Miniconda yourself and then you can manage packages via environments. You can also install Python packages through the package manager, but that can have problems with installing your own libraries if not managed carefully.

Anaconda on Aalto Linux

You can mostly use Python like normal - see Python.

To create your own anaconda environments, first load the Anaconda module:

$ module load anaconda

then you get the conda command. If you get an error such as:

NotWritableError: The current user does not have write permissions to a required path.
path: /m/work/modules/automatic/anaconda/envs/aalto-ubuntu1804-generic/software/anaconda/2020-04-tf2/1b2b24f2/pkgs/cache/18414ddb.json

Try the following to solve it (this prevents conda from trying to store its downloaded files in the shared directory):

$ conda config --prepend pkgs_dirs ~/.conda/pkgs
The “neuroimaging” environment

On the Aalto Linux workstations and Triton, there is a conda environment which contains an extensive collection of Python packages for the analysis of neuroimaging data, such as fMRI, EEG and MEG.

To use it on Aalto Ubuntu workstations and VDI:

$ ml purge
$ ml anaconda3
$ source activate neuroimaging

To use it on Triton:

$ ml purge
$ ml neuroimaging

To see the full list of packages what are installed in the environment, use:

$ conda list

Some highlights include:

  • Basic scientific stack

    • numpy

    • scipy

    • matplotlib

    • pandas

    • statsmodels

  • fMRI:

    • nibabel

    • nilearn

    • nitime

    • pysurfer

  • EEG/MEG:

    • mne

    • pysurfer

  • Machine learning:

    • scikit-learn

    • tensorflow

    • pytorch

  • R:

    • rpy2 (bridge between Python and R)

    • tidyverse

Finally, if you get binaries from the wrong environment (check with which BINARYNAME) you may need to update the mappings with:

$ rehash
MNE Analyze

Note: this was tested only for NBE workstations. If you wish to run mne_analyze from your workstation you should follow this procedure. Open a new terminal and make sure you have the bash shell (echo $SHELL, if you do not have it, just type bash) and then:

$ module load mne
$ source /work/modules/Ubuntu/14.04/amd64/common/mne/MNE-2.7.4-3434-Linux-x86_64/bin/mne_setup_sh
$ export SUBJECTS_DIR=PATHTOSUBJECTFOLDER
$ export SUBJECT=SUBJECTID
$ mne_analyze

Please note that the path of the “source” command might change with most up to date versions of the tool. Please note that the “PATHTOSUBJECTFOLDER” and “SUBJECTID” are specific to the data you have. Please refer to MNE documentation for more help on these.

Mayavi

If you experience problems with the 3D visualizations that use Mayavi (for example MNE-Python’s brain plots), you can try forcing the graphics backend to Qt5:

  • For the Spyder IDE, set Tools -> Preferences -> Ipython console -> Graphics -> Backend: Qt5

  • For the ipython consoles, append c.InteractiveShellApp.matplotlib = 'qt5' to the ipython_config.py and ipython_kernel_config.py configuration files. By default, these can be found in ~/.ipython/profile/default/.

  • In Jupyter notebooks, execute the magic command %matplotlib qt5 at the beginning of your notebook.

Installation of additional packages

The “neuroimaging” environment aims to provide everything you need for the analysis of neuroimaging data. If you feel a package is missing that may be useful for others as well, contact Marijn van Vliet. To quickly install a package in your home folder, use pip install <package-name> --user.

Remote Jupyter Notebook on shell servers

See also

We now have a General use student/teaching JupyterHub installation which may serve your uses more simply.

Here we describe how you can utilise Aalto computing resources for Jupyter Notebook remotely. The guide is targeted for UNIX users at the moment.

Aalto provides two “light computing” servers: brute.org.aalto.fi, force.org.aalto.fi. We demonstrate how to launch a Jupyter Notebook on brute and access it on your laptop.

alternate text

< System activity on Brute >

ssh username@brute.org.aalto.fi

# Create your Kerberos ticket
kinit

# Create a session. I use tmux
tmux

# Load Anaconda
module load anaconda

# Create your env
conda create -n env-name python=3.6 jupyter

# Activate your python environment
source activate env-name

# Launch jupyter notebook in headless mode and a random port number
jupyter notebook --no-browser --port=12520

Note

You might get messages like The port 12520 is already in use, trying another port while starting the notebook server. In that case, take note of the port the server is running in, e.g.:

[I 15:42:14.187 NotebookApp] The Jupyter Notebook is running at:
[I 15:42:14.187 NotebookApp] http://localhost:12470/?token=kjsahd21n9...

and replace “12520” below with the correct port number, 12470 in this case.

Now back to your laptop

# Forward the port
ssh -L 12520:localhost:12520 -N -f -l username brute.org.aalto.fi

Now launch your browser and go to http://localhost:12520 with your token.

Zulip

See also

Instructors, see the relocated instructor page at Zulip for instructors.

Aalto Scicomp Zulip - researcher and staff discussion

If you are a researcher looking for the ASC chat for help and support, see the chat help section or log in directly at https://scicomp.zulip.cs.aalto.fi .

Zulip is a open-source chat platform, which CS hosts at Aalto. It is used as a chat platform for some courses, and allows better student and chat privacy.

The primary distinguishing feature of Zulip is topics, which allows one to make order out of a huge number of messages. By using topics, you can narrow to a certain thread of conversation while not losing sight of the overall flow of messages.

Zulip for instructors
Introduction

Zulip is an online discussion tool with latex support. It has been used by some Aalto teachers as an external service on individual courses. For spring and summer 2021, Zulip was provided by Aalto CS as a pilot solution for all School of Science departments’ course needs. For the autumn 2021 and spring 2022, the pilot at SCI continues and is widened in small scale also for other schools. The pilot refers to a) a fixed-term project with clear lifecycle needs, like in courses which start and end at certain times and after which the Zulip instance can be deleted; b) a transitional period between current state and possible production use or change to other solutions; and c) a basic solution with without all the fancy features or user interface. During the pilot users are expected to provide feedback, which will effect on the decision-making for future solutions, and the development of usability.

CS-IT hosts Zulip the chat instances for you. These chat instances are hosted at <chat-name>.zulip.aalto.fi (or older instances at <chat-name>.zulip.cs.aalto.fi). Login to the chats is available with Aalto accounts. Email registration for external users is also possible via invitations. After logging in for the first time with an Aalto account, if no matching Zulip account was found, you are prompted to “Register” and create one. Once the Zulip account has been created, it should be linked to your Aalto credentials.

Internal or confidential matters should not be discussed on the platform.

Get started / request Zulip

Note

Chat realms can be requested using the form at https://zulip.aalto.fi/requests/.

Note

If you encounter issues, report them to CS-IT or on #zulip-support at scicomp.zulip.cs.aalto.fi

You can also give/discuss feedback, complaints or suggestions on #zulip-feedback at scicomp.zulip.cs.aalto.fi

Note

You can test out Zulip at testrealm.zulip.cs.aalto.fi. Use the Aalto login. This chat is for testing only.

After you have received the chat instance

Within few days of requesting an instance, you should have gotten details for your chat instance in email. After this you

  • Can login to the chat instance <chat-instance>.zulip.cs.aalto.fi with your Aalto account

  • Should already have the owner role assigned.

  • Can configure the chat instance from (cog wheel in the top-right corner) -> Manage organization

    • Please carefully read the Configuration section before making changes

  • Can appoint more admins/owners (e.g. TAs)

    1. Ask them to login first

    2. Change their role from Manage organization -> Users

Configuring your organization

Below are listed the most important settings found under Manage organization in Zulip. There is no easy way for us to enforce these, so it is your responsibility as organization owner or admin to make sure they are set correctly. Make sure any owners/admins you appoint are aware of these as well.

Note

Settings that are not mentioned here, you can configure to your liking. However you should still exercise care, since you are responsible for the service and safety of your user’s data. If you would like advice, please ask us.

Organization settings / Video chat provider

  • Set to None

  • The default provider (Jitsi) has not been evaluated or approved by Aalto

  • Integration with Aalto Zoom may come later on

Organization permissions / Invitation settings

Do not set both “Organizational Premissions→Invitations = not required” and “Authentication methods→Email = enabled” at the same time.

You can allow signup by Aalto account or any email. You can allow anyone to signup or make it invitation only. But you can not set “Anyone with Aalto account may signup without invitation, but by email you must be invited” (Zulip limitation). So, we have to work around this, otherwise bots and random people might join in your chat. If the chat needs to include external users, make it invite only.

The exact questions and answers:

  • Are invitations required for joining in the organization?

    • If you are only allowing Aalto Login (see ‘Authentication methods’): Can be set to No,… (But still, anyone with Aalto account can join)

    • If you are allowing external users/email registration (see ‘Authentication methods’ below): Set to Yes, only admins can send invitations. (You can invite people via their Aalto email address for Aalto login)

Organization permissions / Who can access user email addresses

  • Set this to Admins only or Nobody

Organization permissions / Who can add bots

  • Set to Admins only

  • Consult CS-IT before deploying any bots

Authentication methods

  • AzureAD

    • This is Aalto Login and should be enabled

  • Email

    • This allows users to register using an email address

    • We cannot allow random people or bots to register freely

    • If you enable this, make the chat invitation only as described in ‘Invitation settings’ above, for the reason described there.

Users

  • You can manage users here.

  • Please be careful with who you assign admins/owners. These roles should be only given to course staff.

  • The “moderator” role can has extra permissions assigned, such as managing streams and renaming topics. This could be good for course staff/TAs.

Other settings, up to you

  • You allow messages to be edited longer using Settings → Organization Settings. It is often useful to set this to a longer period.

Practical hints

There is a fine line between a discussion platform and chat, normal chat and topic-based chat, and chaos and order. Here, we give suggestions for you, based on what other teachers have learned.

  • Topics (basically, like subject for a message thread) is the key feature of Zulip. It is explained more below, but basically keeps things organized. If you don’t want to do that or it doesn’t match your flow, you won’t like the model.

  • Read the guidelines for students to see the importance of topics and the three ways to use Zulip, and how we typically manage the flood of information in practice.

  • Give these guidelines to your students (copy and paste from the student page).

  • Consider why you want a course chat.

    • Do you want a way to chat and ask questions/discuss in a lower-threshold platform than forum posts? Then this could be good.

    • Do you want a Q&A forum or support center? Then this may work, but would MyCourses be a better forum?

    • Do you want a place for students groups to be able to chat among small groups?

    • Do you mainly want announcements? Then maybe simply use MyCourses?

  • Create your channels (“streams”) before your students join, and make the important ones default streams (this is done under “Manage organization”), so that everyone will be subscribed (since peolpe will always forget to join streams).

    • If you do create a new default stream later, use the “clone subscribers” option to clone from another default stream, so that everyone will be subscribed.

    • Some common streams you might want are #general, #announcements, #questions. Some people have one stream per homework, exam, theme, and/or task.

    • The main point of streams is to be able to independently filter, mute, and subscribe to notifications. For example, it might be useful to view all questions about one homework in order, or request email notifications from the #announcements stream.

  • You can create user groups (teams) with a certain name. The group can be @-mentioned together, or added to a stream.

  • Moderators (and others) can organize other people’s messages by topic. Edit the message to do this, including other people’s. Hotkey is e.

  • If you want a Q&A forum, make a stream called #questions, or smaller streams for specific topics, and direct students there.

    • You can click the check mark by a topic to mark it as resolved.

    • Remind students to make a new topic for each new question. This enables good follow-up via “Recent topics”

    • If students don’t make a new topic (or a topic goes off-track), edit the message and change the topic (change topic for “this message and all later messages”). Then, you keep questions organized, findable, and trackable.

    • If you don’t want to be answering questions in private message (who does?… it leads to duplicate work), make a clear policy on either reposting the questions publicly yourself (without identification), or directing the students to repost in the public steam themselves.

  • If you want to limit students to not be able to do anything, you can consider disabling:

    • Adding streams, adding others to streams (if you want people to only ask and not make their own groups).

    • Disable private messages (if you really don’t want personal requests for help).

    • Adding bots, adding custom emojis.

    • Seeing email addresses. Changing their name.

  • On the other hand, you might want to “allow message editing” to a much longer period and allow message deleting. For Q&A these are quite useful to have.

  • You can use the /poll [TITLE] command to make lightweight non-anonymous polls. For anonymous polls, someone has used a bot called Errbot, but we don’t currently know much about that.

FAQ
  • Is there an easier way than subscribing students manually when streams are created? Yes, you should never be doing that manually. See above for cloning membership of a stream from another.

  • Isn’t it too much work to have to give a topic to every message? Well, you don’t have to when replying. And this is sort of a natural trade-off needed to keep things organized and searchable: you have to think before you send. Most people consider this a worthy trade-off. Note that you can change the topic of messages after the fact, just talk and organize later as needed.

Extra requested features

(see also the student page)

  • Anonymous polls (a pull request exists with this feature)

  • Anonymous discussion

  • More fine-grained permissions for TAs. DONE: moderator role now exists.

  • Support for bots and other advanced features (more like permission to recommend them, bot support works very well already).

  • Pinned topics (pull request exists, high-priority issue, #19483).

  • Long-term invitations (upcoming, high-priority issue, #20337)

Basics
Streams and Topics

In Zulip, discussions are orginized in streams, which are further divided into topics.

Views

The left sidebar let’s you narrow down messages that are displayed, you can select:

  • All messages, to see everything that is being posted efficiently.

  • Recent topics, to see which topics have new information.

  • Different streams and topics, to narrow down to a specific stream or topic.

Recent topics is good to manage a flood of information (see what’s new, click on relevant stuff, ignore all the rest). All messages is better when you are caught up and want to make sure you don’t miss anything. Viewing single topics and streams is good for catching up on something you don’t remember.

Of course, everyone has their own ways and workflows so you should experiment what works best and which views are useful for you.

Message Pane

In the middle of your screen, you have the Message Pane, where the messages are shown.

_images/zulip-topics.png

Message Pane. This is the basic view of messages. You can click on various places to narrow your view to one conversation or reply.

Selecting visible topics

Not all streams are visible in the sidebar by default.

Click the gear icon above the channel list in order to see all available streams and select which ones you want to participate in. It is good to occasionally look at this menu in case new streams are added.

_images/zulip-recenttopics.png

Recent topics, another view of recent activity that shows activity per-topic.

Hints on using Zulip efficiently
How to ask a question

Seems obvious, doesn’t it? You can get the best and fastest answers by helping to keep things organized. These recommendations are mainly for Q&A-forum type chats.

  • First, search history to see if it has already been asked.

    • If so, click on the topic name. You will narrow your view to see that entire conversation.

  • If your question isn’t answered yet, but is a follow up to an existing topic, click on a message in that topic. Then, when you ask, it will go to that same topic as a follow-up, and anyone else can narrow to see the whole history.

    _images/zulip-reply.png

    Replying to an existing topic.

    • Unlike other chats, your message will not get lost, and people will both see that it is new and can see the history of that thread.

    • Your course can say what the threshold for “new topic” is. Maybe they would have one topic per question pre-created or something clever like that.

  • If you don’t find anything relevant to follow up on, make a new topic.

    _images/zulip-new.png

    Making a new topic.

    • Select the stream you want to post to (whatever fits best).

    • Click “New topic”.

    • Enter the topic name down below: a few words, like an email subject. For example, week 1 question 3, integrals of complex functions, exam preparation.

    • Enter your message and send.

Others (or you…) can split or join topics if they want by going to “edit message”, so there is no risk of doing something wrong. Don’t worry, just ask!

By being organized, you can get both the benefits of quick chat with the organization of not missing anything.

Other hints
  • You can format your messages using Zulip markdown.

  • Are you annoyed by having to enter a topic every time you send a message? Remember, when replying you don’t need to. But otherwise, it’s a trade-off: keep it organized or be less searchable. Most of users are clear that keeping organized is worth the searchability. But don’t worry too much: if you happen to get things wrong, others can re-organize topics afterwards.

  • “Mute a stream” (or topic) is useful when you want to stay subscribed but not be notified of messages by default. You can still find it if you click through the sidebar.

  • Since Zulip 8.0, you can mute/default/follow (receive notifications) per-topic, for every topic (instead of only muting a topic). This is very powerful. Note that you can change the default in your Notification Settings: when a stream is automatically followed. You might want to adjust the default.

  • You can also request notifications for everything in a certain stream. This could be good for announcement streams, or your particular projects.

  • The desktop and mobile apps can support multiple organizations. At least on mobile apps, switching is kind of annoying.

Apps

There are reasonable applications for most desktop and mobile operating systems. These don’t send your data to any other services.

The mobile applications work, but may not be the best for following a large number of courses simultaneously. We can’t currently make improvements in them.

Open issues

We are aware of the following open issues:

  • It is annoying to have one chat instance per course (but it seems to be standard in chats these days).

  • There are no mobile push notifications (it would cost too much to use the main Zulip servers, and we haven’t decided to build our own apps yet. info).

  • Likewise with built-in video calls (via https://meet.jit.si or Zoom).

  • Various user interface things. But Zulip is open-source, so feel free to contribute to the project…

Cheatsheets: CS, Data.

Data management

In this section, you can find some information and instructions on data management.

Data

Data connects most research together. It’s easy to make in short term, but in the long term it can become so chaotic it loses its value. Can you access your research group’s data from 5 years ago and use it?

Data storage in Aalto

Data in Science-IT departments

Requesting Science-IT department storage space

Existing data groups and responsible contacts:

Requesting to be added to a group

Note

CS department: New! Group owners/managers can add members to their groups self-service. Go to https://domesti.cs.aalto.fi from Aalto networks, over VPN, or remote desktop at https://vdi.aalto.fi, and it should be obvious.

Send an email to the responsible contact (see above) and CC the group owner or responsible person, and include this information:

  • Group name that you request to join

  • copy and paste this statement, or something similar: “I am aware that all data stored here is managed by the group’s owner and have read the data management policies.”

  • Ask the group owner to reply with confirmation.

  • Do you need access to scratch or work? If so, you need a Triton account and you can request it now. If you don’t, you’ll get “input/output error” and be very confused.

  • Example:

    Hi, I (account=omes1) would like to join the group myprof. I am aware that all data stored here is managed by the group’s owner and have read the data management policies. $professor_name, please reply confirming my addition.

Requesting a new group

Send an email to the responsible contact (see above) with the following information. Group owners should be long-term (e.g. professor level) staff.

  • Requested group name (you can check the name from the lists below)

  • Owner of data (prof or long-term staff member)

  • Other responsible people who can authorized adding new members to the group. (they can reply and say “yes” when someone asks to join the group.)

  • Who is responsible for data should you become unavailable (default: supervisor who is probably head of department).

  • Initial members

  • Expiration time (default=max 2 years, extendable. max 5 years archive). We will ping you for management/renewal then.

  • Which filesystems and what quota. (project, archive, scratch). See the the storage page.

  • Basic description of purpose of group.

  • Is there any confidential or personal data (see above for disclaimer).

  • Any other notes that CS-IT should enforce, for example check NDA before giving access.

  • Example:

    I would like to request a new group coolproject. I am the owner, but my postdoc Tiina Tekkari can also approve adding members. (Should I become unavailable, my colleague Anna Algorithmi (also a professor here) can provide advice on what to do with the data)

    We would like 20GB on the project filesystem.

    This is for our day to day work in algorithms development, we don’t expect anything too confidential.

Science-IT department data principles

Note

Need a place to store your data? This is the place to look. First, we expect you to read and understand this information, at least in general. Then, see Requesting Science-IT department storage space.

This page is about how to handle data - not the raw storage part, which you can find at data storage. Aalto has high-level information on research data management, too.

What is data management?

Data management is much more than just storage. It concerns everything from data collection, to data rights, to end-of-life (archival, opening, etc). This may seem far-removed from research practicalities, but funding agencies are beginning to require advanced planning. Luckily, there are plenty of resources at Aalto (especially in SCI), and it’s just a matter of connecting the dots.

Oh, and data management is also important because without data management, data becomes disorganized, you lose track, and as people come and go, you lose knowledge of what you have. Don’t let this happen to you or your group!

Another good starting point is the Aalto research data management pages. These pages can also help with preparing a data management plan.

Data management is an important part of modern science! We are here to help. These pages both describe the resources available at Aalto (via Science-IT), and provide pointers to issues that may be relevant to your research.

Data storage at Aalto SCI (principles and policies)

Note

This especially applies to CS, NBE, and PHYS (the core Science-IT departments). The same is true for everyone using Triton storage. These policies are a good idea for everyone at Aalto, and are slowly being developed at the university level.

Most data should be stored in a group (project) directory, so that multiple people can access it and there is a plan for after you leave. Ask your supervisor/colleagues what your group’s existing groups are and where the data is stored. Work data should always be stored in a project directory, not personal home directories. See below for how to create or join a group. Home directory data can not be accessed by IT staff, according to law and policy - data there dies when you leave.

All data in group directories is considered accessible to all members (see below).

All data stored should be Aalto or research related. Should there be questions, ask. Finnish law and Aalto policies must be followed (in that order), including by IT staff. Should there be agreements with third-parties regarding data rights, those will also be followed by IT staff, but these must be planned in advance.

All data must have an owner and lifespan. We work with large amount of data from many different people, and data without clear ownership becomes a problem. (“ownership” refers to decision-making responsibility, not IPR ownership). Also, there must be a clear successor for when people leave or become unavailable. By default, this is supervisor.

Personal workstations are considered stateless and, unless there is special agreement, could be reinstalled at any time and are not backed up. This should not concern day to day operations, since by default all data is stored on network filesystems.

We will, in principle, make space for whatever data is needed. However, it is required that it be managed well. If you can answer what the data contains, why it’s stored, and how the space is used, and why it’s needed, it’s probably managed well for these purposes.

Read the full Science-IT data management policy here.

Information on all physical locations how to use them is on the storage page.

Groups

Everywhere on this page, “group” refers to a certain file access group groups (such as a unix group), not an organizational (research) group. They will often be the same, but there can be many more access groups made for more fine-grained data access.

Data is stored in group directories. A group may represent a real research group, a specific project, or specific access-controlled data. These are easy to make, and they should be extensively used to keep data organized. If you need either finer-grained or more wide data access, request that more groups are made.

Please note, that by design all project data is accessible to every member in the group. This means that, when needed, IT can fix all permissions so that all group members can read all data. For limiting the access more fine-grained than these project groups, please have a separate group created. Data in a group is considered “owned and managed” by the group owner on file. The owner may grant access to others and change permissions as needed. Unless otherwise agreed, any group member may also request permissions to be corrected so that everyone in the group has access.

  • Access control is provided by unix groups (managed in the Aalto active directory). There can be one group per group leader, project, or data that needs isolation. You should use many groups, they make overall management easier. A group can be a sub-group of another.

  • Each group can get its own quota and fileystem directories (project, archive, scratch, etc). Quota is per-filesystem. Tell us requested quota when you set up a project.

    • A typical setup would be: one unix group for a research group, with more groups for specific project when that is helpful. If there are fixed multi-year projects, they can also get a group.

  • Groups are managed by IT staff. To request a group, mail us with the necessary information (see bottom of page).

  • Each group has an owner, quota on filesystems, and some other metadata (see below).

  • Group membership is per-account, not tied to employment contracts or HR group membership. If you want someone to lose access to a group you manage, they have to be explicitly removed by the same method they were added (asking someone or self-service, see bottom of page).

  • To have a group created and storage space allocated, see below.

  • To get added to a group, see instructions below.

  • To see your groups: use the groups command or groups $username

  • To see all members of a group: getent group $groupname

Common data management considerations
Organizing data

This may seem kind of obvious, but you want to keep data organized. Data is always growing in volume and variety, so if you don’t organize it as it is being made, you have no chance of doing it later. Organize by:

  • Project

  • To be backed up vs can be recreated

  • Original vs processed.

  • Confidential or not confidential

  • To be archived long-term vs to be deleted

Of course, make different directories to sort things. But also the group system described above is one of the pillars of good data organization: sort things by group and storage location based on how it needs to be handled.

Backups

Backups are extremely important, not just for hardware failure, but consider user error (delete the wrong file), device lost or stolen, etc. Not all locations are backed up. It is your responsibility to make sure that data gets stored in a place with sufficient backups. Note that personal workstations and mobile devices (laptops) are not backed up.

Openness

Aalto strongly encourages to share the data openly or under controlled access with a goal of 50% data shared by 2020 (see The Aalto RDM pages). In short, Aalto says that you “must” make strategic decisions about openness for the best benefits (which practically probably means you can do what you would like). Regardless, being open is usually a good idea when you can: it builds impact for your work and benefits society more.

Zenodo (https://zenodo.org/) is an excellent platform for sharing data, getting your data cited (it provides a DOI), and control what you share with different policies (https://about.zenodo.org/policies/). For larger data, there are other resources, such as IDA/AVAA provided by CSC (see below).

There are lists of data repositories: r3data, and Nature Scientific Data’s list.

Datasets can and should also be listed on ACRIS, just like papers - this allows you to get credit for them in the university’s academic reporting.

Data management plans

Many funders now require data management plans when submitting grants. (Aside from this, it’s useful to do a practical consideration of how you’ll deal with data)

Please see:

Summary of data locations

Below is a summary of the core Science-IT data storage locations.

Solution

Purpose

Where available?

Backup?

Group management?

project

Research time storage for data that requires backup. Good for e.g. code, articles, other important data. Generally for a small amount of data per project.

Workstations, triton login node

Weekly backup to tape (to recover from major failure) + snapshots (recover accidentally deleted files).

Snapshots go back

  • hourly last 26 working hours (8-20)

  • daily last 14 days - weekly last 10 weeks

yes

Archive

Data which a longer life that project. Practically the same, but better to sort things out early. Also longer snapshot and guaranteed to get backed up to tape.

Workstations, Triton login node. /m/$dept/project/$group.

Same as above

yes

Scratch (group based)/work (per-user)

Large research data that doesn’t need backup. Temporary working storage. Very fast access on Triton.

/m/$dept/$scratch/$groupname, /m/$dept/work/$username.

no

scratch: yes, work: no

See data storage for full info.

Requesting data storage space

See Requesting Science-IT department storage space.

Filesystem details

This page gives details of available data storage spaces, with an emphasis on scientific computing access on Linux.

Other operating systems: Windows and OSX workstations do not currently have any of these paths mounted. In the future, project and archive may be automatically mounted. You can always remote mount via sshfs or SMB. See the remote access page for Linux, Mac, and Windows instructions for home,project, and archive. In OSX, there is a shortcut in the launcher for mounting home. In Windows workstations, this is Z drive. On your own computers, you may need to use AALTO\username as your username for any of the SMB mounts.

Laptops: Laptops have their own filesystems, including home directories. These are not backed up automatically. Other directories can be mounted as described on the remote access page.

Summary table

This table lists all available options in Science-IT departments, including those not managed by departments. In general, project is for most research data that requires good backups. For big data, use scratch. Request separate projects when needed to keep things organized.INLINE

Filesystem

Path (Linux)

Triton?

Quota

Backups?

Notes

home

/u/…/$user name/unix

no

100 GiB

yes, $HOME/../.sn apshot/

Used for personal and non-research files

project

/m/$dept/proj ect/$project/

some

per-project, up to 100s of GiB

Yes, hourly/daily /weekly. (.snapshot)

archive

/m/$dept/arch ive/ $project/

some

per-project, up to 100s of GiB

Yes, hourly/daily weekly. + off-site tape backups. (.snapshot)

scratch

/m/$dept/scr atch/$pro hect/

yes

per-project, 2 PiB available

RAID6, but no backups.

Don’t even think about leaving irreplaceable files here! Need Triton account.

work (Triton)

/m/$dept/wor k/$username/

yes

200 GB default

RAID6, but no backups.

Same as scratch. Need Triton account.

local

/l/$username /

yes

usually a few 100s GiB available

No, and destroyed if computer reinstalled.

Directory needs to be created and permissions should be made reasonable (quite likely ‘chmod 700 /l/$USER’, by default has read access for everyone!)

Space usage: `du -sh /l/`. Not shared among computers.

tmpfs

/run/user/$u id/

yes

local memory

No

Not shared.

webhome

$HOME/public _html/

(/m/webhome/ …)

no

5 GiB

https://use rs.aalto.fi/ ~USER/

custom solutions

Contact us for special needs, like sensitive data, etc.

General notes
  • The table below details the types of filesystems available.

  • The path /m/$dept/ is designed to be a standard location for mounts. In particular, this is shared with Triton.

  • The server magi is magi.cs.aalto.fi and is for the CS department. Home directory is mounted here without kerberos protection but directories under /m/ need active kerberos ticket (that can be acquired with ‘kinit’ command) . taltta is taltta.aalto.fi and is for all Aalto staff. Both use normal Aalto credentials.

  • Common problem: The Triton scratch/work directories are automounted. If you don’t see it, enter the full name then tab complete and it will appear. It will appear after you try accessing with the full name.

  • Common problem: These filesystems are protected with Kerberos, which means that you must be authenticated with Kerberos tickets to access them. This normally happens automatically, but they expire after some time. If you are using systems remotely (the shell servers) or have stuff running in the background, this may become a problem. To solve, run kinit and it will refresh your tickets..

Details
  • home: your home directory

    • Shared with the Aalto environment, for example regular Aalto workstations, Aalto shell servers, etc.

    • Should not be used for research work, personal files only. Files are lost once you leave the university.

      • Instead, use project for research files, so they are accessible to others after you leave.

    • Quota 100 GiB.

    • Backups recoverable by $HOME/../.snapshot/ (on linux workstations at least).

    • SMB mounting: smb://home.org.aalto.fi/

  • project: main place for shared, backed-up project files

    • /m/$dept/project/$project/

    • Research time storage for data that requires backup. Good for e.g. code, articles, other important data. Generally for small amount (10s-100s GiB) of data per project.

    • This is the normal place for day to day working files which need backing up.

    • Multi user, per-group.

    • Quotas: from 10s to 100s of GiB

    • Quotas are not designed to hold extremely large research data (TiBs). Ideal case would be 10s of GiB, and then bulk intermediate files on scratch.

    • Weekly backup to tape (to recover from major failure) + snapshots (recover accidentally deleted files). Snapshots go back:

      • hourly last 26 working hours (8-20)

      • daily last 14 days

      • weekly last 10 weeks

      • Can be recovered using .snapshot/ within project directories

    • Accessible on magi/taltta at the same path.

    • SMB mounting: smb://tw-cs.org.aalto.fi/project/$group/

  • archive:

    • /m/$dept/archive/$project/

    • For data that should be kept accessible for 1-5 years after the project has ended. Alternatively a good place to store a copy of a large original data (backup).

    • This is practically the same as project, but retains snapshots for longer so that data is ensured to be written to tape backups.

    • This is a disk system, so does have reasonable performance. (Actually, same system as project, but separation makes for easier management).

    • Quotas: 10s to 1000s of GiB

    • Backups: same as project.

    • Accessible on magi/taltta at the same path.

    • SMB mounting: smb://tw-cs.org.aalto.fi/archive/$group/

  • scratch: large file storage and work, not backed up (Triton).

    • /m/$dept/scratch/$group/

    • Research time storage for data that does not require backup. Good for temporary files and large data sets where the backup of original copy is somewhere else (e.g. archive).

    • This is for massive, high performance file storage. Large reads are extremely fast (1+ GB/s).

    • This is a lustre file system as part of triton (which is in Keilaniemi).

    • Quotas: 10s to 100s of TiB. The university has 2 PB available total.

    • In order to use this, you must have a triton account. If you don’t, you get “input/output error” which is extremely confusing.

    • On workstations, this is mounted via NFS (and accessing it transfers data from Keilaniemi on each access), so it is not fast on workstations, just large file storage. For high performance operations, work on triton and use the workstation mount for convenience when visualizing.

    • This is RAID6, so is pretty well protected against single disk failures, but not backed up at all. It is possible that all data could be lost. Don’t even think about leaving irreplaceable files here. CSC actually had a problem in 2016 that resulted in data loss. It is extremely rare (decades) thing, but it can happen. (still, it’s better than your laptop or a drive on your desk. Human error is the greatest risk here).

    • Accessible on magi/taltta at the same path.

    • SMB mounting: smb://data.triton.aalto.fi/scratch/$dept/$dir/. (Username may need to be AALTO\yourusername.)

  • Triton work: personal large file storage and work (Triton)

    • /m/$dept/work/$username/

    • This is the equivalent of scratch, but per-person. Data is lost once you leave.

    • Accessible on magi/taltta at the same path.

    • SMB mounting: smb://data.triton.aalto.fi/work/$username. (Username may need to be AALTO\yourusername.)

    • Deleted six months after your account expires.

    • Not to be confused with Aalto work (see below).

  • local: local disks for high performance

    • You can use local disks for day to day work. These are not redundant or backed up at all. Also, if your computer is reinstalled, all data is lost.

    • Performance is much higher than any of the other network filesystems, especially for small reads. Scratch+Triton is still faster for large reads.

    • If you use this, make sure you set UNIX permissions to restrict the data properly. Ask if you are not sure.

    • If you store sensitive data here, you are responsible for physical security of your machine (as in no one taking a hard drive). Unix permissions should protect most other cases.

    • When you are done with the computer, you are also responsible for secure management/wiping/cleanup of this data.

    • See the note about disk wiping under Aalto Linux (under “when you are done with your computer”). IT should do this, but if it’s important you must mention it, too.

  • tmpfs: in-memory filesystem

    • This is a filesystem that stores all data in memory. It is extremely high performance, but extremely temporary (lost on each reboot). Also shares RAM with your processes, so don’t use too much and clean up when done.

    • TODO: are these available everywhere?

  • webhome: web space for users.aalto.fi

    • This is the space for users.aalto.fi space can be accessed from the public_html link in your home directory.

    • This is not a real research filesystem, but convenient to note here.

    • Quota (2020) is 5 GiB. (/m/webhome/webhome/)

    • https://users.aalto.fi/~USER/

  • triton home: triton’s home directories

    • Not part of departments, but documented here for convenience

    • The home directory on Triton.

    • Backed up daily.

    • Not available on workstations.

    • Quota: 1 GB

    • Deleted six months after your account expires.

  • Aalto work: Aalto’s general storage space

    • /work/$deptcode on Aalto workstations and servers.

    • Not often used within Science-IT departments: we use project and archive above, which are managed by us and practically equivalent. You could request space from here, but expect less personalized service.

    • Aalto home directories are actually here now.

    • You may request storage space from here, email the Aalto servicedesk and request space on work. The procedures are not very well established.

    • Data is snapshotted and backed up offsite for disaster recovery.

    • Search https://it.aalto.fi for “work.org.aalto.fi” for the latest instructions.

    • SMB mounting via smb://work.org.aalto.fi

  • Aalto teamwork: Aalto’s general storage space

    • Not used directly within Science-IT departments: we have our own direct interfaces to this, and project and archive directories are atually here.

    • For information on getting teamwork space (outside of Science-IT departments), contact servicedesk.

    • Teamwork is unique in that it is arbitrarily extensible, and you may buy the space from the vendor directly. Thus, you can use external grant money to buy storage space here.

    • SMB mounting via smb://teamwork.org.aalto.fi

Confidential data handling

Confidential data is data which has some legal reason to be protected.

Confidential or sensitive data

Note

The following description is written for the CS department, but applies almost equally to NBE and PHYS. This is being expanded and generalized to other department as well. Regardless of your department, these are good steps to follow for any confidential data at Aalto.

Note

This meets the requirements for “Confidential” data, which covers most use cases. If you have extreme requirements, you will need something more (but be careful about making custom solutions).

Aalto has some guidelines for classification of confidential information, but they tend to deal with documents as opposed to practical guidelines for research data. If you have data which needs special attention, you should put it in a separate group and tell us when creating the group.

The following paragraph is a “summary for proposals”, which can be used when the CS data security needs to be documented. This is for the CS department, but similar thing can be created for other departments. A longer description is also available.

Aalto CS provides secure data storage for confidential data. This data is stored centrally in protected datacenters and is managed by dedicated staff. All access is through individual Aalto accounts, and all data is stored in group-specific directories with per-person access control. Access rights via groups is managed by IT, but data access is only provided upon request of the data owner. All data is made available only through secure, encrypted, and password-protected systems: it is impossible for any person to get data access without a currently active user account, password, and group access rights. Backups are made and also kept confidential. All data is securely deleted at the end of life. CS-IT provides training and consulting for confidential data management.

If you have confidential data at CS, follow these steps. CS-IT takes responsibility that data managed this way is secure, and it is your responsibility to follow CS-IT’s rules. Otherwise you are on your own:

  • Request a new data folder in the project from CS-IT. Notify them that it will hold confidential data and any special considerations or requirements. Consider how fine-grained you would like the group: you can use an existing group, but consider how many people will have access.

  • Store data only in this directory on the network drive. It can be accessed from CS computers, see data storage.

  • To access data from laptops (Aalto or your own), use network drive mounting, not copying. Also consider if temporary files: don’t store intermediate work or let your programs save temporary files to your own computer.

  • Don’t transfer the data to external media (USB drives, external hard drives, etc) or your own laptops or computers. Access over the network.

  • All data access should go through Aalto accounts. Don’t send data to others and or create other access methods. Aalto accounts provide central auditing and access control.

  • Realize that you are responsible for the day to day management of data and using best practices. You are also responsible for ensuring that people who have access to the data follow this policy.

  • In principle, one can store data on laptops or external devices with full disk encryption. However, in this case we does not take responsibility unless you ask us first.you must ask us about this. In general it’s best to try to adapt to the network drive workflow. (Laptop full disk encryption is a good idea anyway).

We can assist in creating more secure data systems, as can Aalto IT security. It’s probably more efficient to contact us first.

Personal data (research data about others, not about you)

“Personal data” is any data concerning an identifiable person. Personal data is very highly regulated (mainly by the Personal Data Act, soon by the General Data Protection Regulation). Aalto has a document that describes what is needed to process personal data for research, which is basically a research-oriented summary of the Personal Data Act. Depending on the type of project, approval from the Research Ethics Committee may be needed (either for publication, or for human interaction. The second one would not usually cover pure data analysis of existing data). Personal data handling procedures are currently not very well defined at Aalto, so you will need to use your judgment.

However, most research does not need data to be personally identifiable, and thus research is made much simpler. Thus, you want to try to always make sure that data is not identifiable, even to yourself using any technique (anonymization). The legal requirement is “reasonable likelihood of identification”, which can include technical and confidentiality measures, but in the end is still rather subjective. Always anonymize before data arrives at Aalto, if possible. Let us know when you have personal data, so we can make a note of it in the data project.

However, should you need to use personal data, the process is not excessively involved beyond what you might expect (informed consent, ethics, but then a notification of personal data file). Contact us for initial help in navigating the issues and RIS for full advice.

Boilerplate descriptions of security

For grants, etc. you can find a description under Boilerplate text for grant proposals

Long-term archival

Long-term archival is important to make sure that you have ability to access your group’s own data in the long term. Aalto resources are not currently intended for long-term archival. There are other resources available for this, such as

Leaving Aalto

Unfortunately, everyone leaves Aalto sometime. Have you considered what will happen to your data? Do you want to be remembered? This section currently is written from the perspective of a researcher, not a professor-level staff member, but if you are a group leader you need to make sure your data will stay available! Science-IT (and most of these resources) are focused on research needs, not archiving a person’s personal research data (if we archive it for a person who has left, it’s not accessible anyway! Our philosophy is that it should be part of a group as described above.). In general, we can archive data as part of a professor’s group data (managed in the group directories the normal ways), but not for individuals.

  • Remember that your home directories get removed when your account expires (we think in only two weeks!).

  • Data in the group directories it won’t be automatically deleted. But you should clean up all your junk and leave only what is needed for future people. Remember, if you don’t take care of it, it becomes extremely hard for anyone else to. The owner of the group (professor) will be responsible for deciding what to do with the data, so make sure to discuss with them and easy for them to do the right thing!

  • Make sure that the data is documented well. If it’s undocemented, then it’s unusable anyway.

  • Can your data be released openly? If you can release something as open data on a reputable archive site like Zenodo, you can ensure that you will always have access to it. (The best way to back up is to let the whole internet do it for you.)

  • For lightweight archival (~5 years past last use, not too big), the archive filesystem is suitable. The data must be in a group directory (probably your professor’s). Make sure that you discuss the plans with them, since they will have to manage it.

  • IDA (see above) could be used for archival of any data, but you will have to maintain a CSC account (TODO: can this work, and how?). Also, these projects have to be owned by a senior-level staff person, so you have to transfer it to a group anyway.

  • Finland aims to have a long-term archival service by 2017 (PAS), but this is probably not intended for own data, only well-curated data. Anyway, if you need something that long and it isn’t confidential, consider opening it.

Data organization

How should data be stored? On the simplest level, this asks “on what physical disks”, but this page is concerned about something more high-level: how you organize data on those disks.

Data organization is very important, because if you don’t do it early, you end up with a epic mess which you will never have time to clean up. If you organize data well, then everything after becomes much easier: you can archive what you need. Others can find what they need. You can open what you need easily.

Everything here applies equally if you are working alone or if you are part of a team.

Organize your projects into directories
Names

As simple as it seems, choosing a good name for each distinct workspace is an important first step. This serves as an identifier to you and others, and by having a name you are able to refer to, find, and organize your data now and in the future.

A name should be unique among all of your work over all your career, and also unique among all of your colleagues, too (and any major public projects, too). Don’t reuse the same names for related things. For example, let’s say I have a project called xproject. If I track the code separately from the data, I’d have a different directory called xproject-data and the main projects refers to the data directory, instead of coping the data.

How many named workspaces should you have for each project? It depends on how large they are and how diverse the types of data are. If the data is small and not very demanding, it doesn’t matter much. If you have large data vs small other files, it may be good to separate out the data. If you have some data/code/files which will be reused in different projects, it makes sense to split them. If you have confidential data that can’t be shared, it’s good to separate them from the rest of the data.

Names should be usable and directory names and identifiers. Try to stick to letters, numbers, -, and _ - no spaces, punctuation, or symbols. Then, the name is usable on repositories and other services, too.

Good names include MobilityAnalysis, transit, transit-hsl, and lgm-paper. Bad names are too general given their purpose or what else you might do.

Each directory’s contents moves together as a unit, as much as possible.

Organizing these directories

You should have a flat organization in as few places as possible. For example, on your laptop you may have ~/project for things for the stuff you mainly work on and ~/git for other minor version controlled things. On your workstations or servers, you may also have /scratch/work/$username which is your personal stuff that is not backed up, /m/cs/project/$groupname/$username/ which is backed up, /local which is temporary stuff on your own computer, and so on. The server-based locations can be easily shared among multiple people.

Your structure should be as flat as possible, without many layers in each directory. Thus, to find a given project, you only need to look inside each of the main locations above, not inside every other project. This allows you to get the gist of your data for future archival or clean-up. When two directories need to refer to each other, you have them directly refer to each other where they are, for example use ../xproject-data from inside the xproject directory. (You can have subdirectories inside the projects).

Different types of projects go in different places. For example, xproject can be on the backed up location because it’s your daily work, while xproject-data is on some non-backed up place because you can always recover the data.

Synchronizing

If you work on different systems, each directory of the same name should have roughly the same contents - as if you could synchronize it with version control.

For small stuff, you might synchronize with version control. You may use some other program, like Dropbox or the like. Or in the case of data which has a master copy somewhere else, you just download what you need.

Organize files within directories
Traditional organization

This is the traditional organization within a single person’s project. The key concept is separation of code, original data, scratch data, and final outputs. Each is handled properly.

  • PROJECT/code/ - backed up and tracked in a version control system.

  • PROJECT/original/ - original and irreplaceable data. Backed up at the same time it is placed here.

  • PROJECT/scratch/ - bulk data, can be regenerated from code+original

  • PROJECT/doc/ - final outputs, which should be kept for a very long term.

  • PROJECT/doc/paper1/ - different papers/reports, if not stored in a different project directory.

  • PROJECT/doc/paper2/

  • PROJECT/doc/opendata/

When the project is over, code/ and doc/ can be backed up permanently (original/ is already backed up) and the scratch directory can be kept for a reasonable time before it is removed (or put into cold storage).

The most important thing is that code is kept separate from the data. This means no copying files over and over to minor variations. Could should be adjustable for different purposes (and you can always get the old versions from version control). Code is run from the code directory, no need to copy to each folder individually.

Multi-user

The system above can be trivially adapted to suit a project with multiple users:

  • PROJECT/USER1/.... - each user directory has their own code/, scratch/, and doc/ directories. Code is synced via the version control system. People use the original data straight from the shared folder in the project.

  • PROJECT/USER2/....

  • PROJECT/original/ - this is the original data.

  • PROJECT/scratch/ - shared intermediate files, if they are stable enough to be shared.

For convenience, each user can create a symbolic link to the original/ data directory from their own directory.

Master project

In this, you have one long-term master directory for a whole research group, and members project that has many different users and research themes with in. As time goes on, once users leave, their directories can be cleaned up and removed. The same can happen for the themes.

  • PROJECT/USER1/SUBPROJECT1/...

  • PROJECT/USER1/SUBPROJECT2/...

  • PROJECT/USER2/SUBPROJECT1/...

  • PROJECT/original/

  • PROJECT/THEME/USER1/...

  • PROJECT/THEME/USER2/...

  • PROJECT/archive/

Common variants
  • Simulations with different parameters: all parameters are stored in the code directory, within version control. The code knows what parameters to use when making a new run. This makes it easy to see the entire history of your simulations.

  • Downloading data: this can be put into either original or scratch, depending on how much you trust the original source to stay available.

  • Multiple sub-projects: this can be

  • Multiple types of code: separate long-term code from scratch research code. You can separate parameters from code. And so on…

Projects

In Aalto, data is organized into project groups. Each project has members who can access the data, and different shared storage spaces (project, archive, scratch (see below)). You can apply for these whenever you need.

What should a project contain? How much should go into the same project?

  • One project that lasts forever per research group: This is traditional. A professor will get a project allocated, and then people put data in here. There may be subdirectories for each researcher or topic, and some shared folders for common data. The problem here is that the size will grow without bound. Who will ever clean up all the old stuff? These have a way of growing forever so that the data becomes no longer manageable, but they are convenient because it keeps the organization flat.

    • If data size is small and growing slower than storage, this works for long-term.

    • It can also work if particular temporary files are managed well and eventually removed.

  • One project for each distinct theme: A research group may become interested in some topic (for example, a distinct funded project), and they get storage space just for this. The project goes on and is eventually closed.

    • You can be more fine-grained in access, if data is confidential

    • You can ensure that the data stays together

    • You can ensure that data end-of-life happens properly. This is especially useful for showing you are managing data properly as part of grant applications.

    • You can have a master group as a member of the specific project. This allows a flat organization, where all of your members can access all data in different projects.

Science-IT data policy

Note

This was originally developed at CS, but applies to all departments managed by the Science-IT team.

In Aalto, large amounts of data with variety of requirements are being processed daily. This describes the responsibilities of IT support and users with respect to data management.

Everyone should know the summary items below. The full policy is for reference in case of doubts (items in bold are things which are not completely obvious).

This policy is designed to avoid the most common problems by advance planning for the majority case. Science-IT is eager to provide a higher level of service for those who need it, but users must discuss with staff. This policy is jointly implemented by department IT and Science-IT.

Summary for users
  • Do not store research data in home directories, this is not accessible should something happen to you or when you leave. They will be automatically deleted.

  • Project directories are accessible to ALL members, files not intended for access by ALL members should be stored in a separate project.

  • Workstations and mobile devices are NOT backed up. Directories with backups are noted. It is your responsible to make sure that you store in backed up places. Don’t consider only disk failure, but also user error, loss of device, etc.

  • Data stored in project directories is managed by the (professor, supervisor) who owns the directory, and they can make decisions regarding access now and in the future. Any special considerations should be discussed with them.

  • Data is not archived or saved for individual users. Data which must be saved should be in a shared project directory with an owner who is still at Aalto. Triton’s individual users data is permantently deleted after 6 months from the expiration date of the user account (Aalto home directories may be deleted even sooner).

  • There is no default named security level - of course we keep all data secure, but should you be dealing with legally confidential files, you must ask us.

Summary for data directory owners (professors or long-term staff)
  • Data in the shared directories controlled by you and you make decisions on it.

  • All data within a project is accessible by all members of that project. Make more projects if more granularity is needed.

  • Data must have an expiration time, and this is extended as needed. Improperly managed data is not stored indefinitely. If data is long-term archived, it must still have an administrative owner at Aalto who can make decisions about it.

  • There must be a succession plan for the data, should the data owner leave or become unavailable to answer questions. By default this is the supervisor or department head. They will make decisions about access, management, and end-of-life.

  • We will try to handle whatever data you may need us to. The only prerequisite is that it is managed well. We can’t really define “managed well”, but at least it means you know what it contains and where the space is going.

Detailed policy

This is the detailed policy. The important summary for users and owners is above, but the full details are written below for avoidance of doubts.

Scope
  1. This policy concerns all data stored in the main provided locations or managed by Science-IT staff (including its core departments).

Responsibilities
  1. In data processing and rules we follow Finnish legislation and Aalto university policies in this order.

  2. If there are agreements with a third party organization for data access those rules are honored next. Regarding this type of data we must be consulted first prior to the storing the data.

  3. Users are expected to follow all Aalto and CS policies, as well as good security practices.

  4. IT is expected to provided a good service, data security, and instruction on best practices.

Storage
  1. All data must have owner and given lifespan. Data cannot be stored indefinitely, but of course lifespan is routinely extended when needed. There are other long-term archival services.

  2. Work related data should always be stored outside users HOME directory. HOME is meant only for private and non-work related files. (IT staff is not allowed to retrieve lost research files from a user’s home directory)

  3. Other centrally available folders (i.e. Project, Archive, Scratch) than HOME are meant for work related information only.

  4. Desktop computers are considered as stateless. They can be re-installed at any point by IT if necessary. Data stored on local workstations is always considered as temporary data and is not backed up. IT support will still try to inform users of changes.

  5. Backed-up data locations are listed. It is the user’s responsibility to ensure that data is stored in backed-up locations as needed. Mobile devices (laptops) and personal workstations are not backed up.

Ownership, and access rights, and end-of-life
  1. Access rights in this policy refer only to file system rights. Other rights (e.g. IPR) to the stored information are not part of this policy.

  2. There must be a clear owner and chain of responsibility (successor) for all data (who owns it and can make decisions and who to ask when they leave or become unavailable).

  3. For group directories (e.g. project, archive, scratch), file system permissions (possibility to read, write, copy, modify and delete) of these files belongs to group. There is not more granular access, for example single files with more restrictive permissions. Permissions will be fixed by IT on request from group members.

  4. The group owner-on-file can make any decisions related to data access, management, or end-of-life.

  5. Should a data owner of a group directory become unavailable or unable to answer questions about access, management, or end-of-life, the successor they named may make decisions regarding the data access, including end-of-life. This defaults to their supervisor (e.g. head of department), but should be discussed on data opening.

  6. Triton data stored on folders that are not group directories (e.g. the content of /scratch/work/USERNAME or /home/USERNAME) will be permanently deleted after 6 months from the user’s account expiration. Please remember to back up your data if you know that your account is expiring soon. (Note that Aalto home directory data may be removed even earlier)

  7. Should researchers need a more complex access scheme, this must be discussed with IT support.

Security/Confidentiality
  1. Unless there is a notification, there is no particular guaranteed service level regarding confidential data. However, all services are expected to be as secure as possible and are designed to support confidential data.

  2. Should a specific security level be needed, that must be agreed separately.

  3. Data stored to the provided storage location is not encrypted at rest.

  4. Confidentiality is enforced by file system permissions will be set and access changes will be always confirmed from data owner.

  5. All storage medium (hard drives, etc), should be securely wiped to the extend technically feasible at end of life. This is handled by IT, but if it is required it must be handled by the end users.

  6. All remote data access should use strong encryption.

  7. Users must notify IT support or their supervisor about any security issues or misuse of data.

  8. Security of laptops, mobile devices and personal devices is not currently guaranteed by IT support. Confidential data should use centralized IT-provided services only.

  9. Users and data owners must take primary responsibility for data security, since technical security is only one part of the process.

Communication
  1. Details about centrally provided folders and best practices are available in online documentation.

  2. Changes to policy will be coordinated by department management. All changes will at least be announced to data owners, but individual approvals are not needed unless a service level drops.

Data on Triton

Triton is a computer cluster that provides large and fast data storage connected to significant computing power, but it is not backed up.

Data management

This section covers administrative and organizational matters about data.

Other

Summary table

O = good, x = bad

Large

Fast

Confidential

Frequent backups

Long-term archival

Code

OO

OO

Original data

O

O

OO?

OO

OO

Intermediate files

OO

OO

OO?

Final results/open data

OO

Large

Fast

Confidential

Backups

Long-term archival

Shareable

Triton

scratch

OO

OO

O

x

x

O

work

OO

OO

O

x

x

Triton home

x

O

OO

Local disks

O

OO

O

ramfs

OOO

OO

Depts

/m/…/project

O

O

OO

OO

O

/m/…/archive

O

O

OO

OO

O

O

Aalto

Aalto home

OO

OO

Aalto work

O

O

OO

OO

O

Aalto teamwork

O

O

OO

OO

O

Aalto laptops

x

x

X

Aalto webspace

OO

version.aalto.fi

OO

OO

O

OO

ACRIS

O

O

Eduuni

Aalto Wiki

Finland

Funet filesender

O

OO

CSC cPouta

O

O

O

CSC Ida

OOO

x

OO

O

O

FSD

OO

O

OO

O

Public

github

x

OO

Zenodo

OO

OO

Google drive

x

O

OneDrive

Own computers

x

x

x

Emails

x

x

x

EUDAT B2SHARE

O

O

O

Cheatsheets: Data, A4 Data management plan.

Triton

Triton is the Aalto high performance computing cluster. It is your go-to resources for anything that exceeds your desktop computer’s capacity. To get started, you could check out the tutorials (going through all the principles) or quickstart guide (if you pretty much know the basics).

Triton cluster

Triton is the Aalto high-performance computing cluster. It serves all researchers of Aalto, but is currently coordinated from within the School of Science. Access is free for researchers (see Triton accounts, students not doing research should check out our intro for students). It is similar to the CSC clusters, though CSC clusters are larger and Triton is easier to use because it is more integrated into the Aalto environment.

Overview

Triton accounts

You need to request Triton access separately, however, the account information (username, password, shell, etc) is shared with the Aalto account so there is not actually a separate account. Triton access is available to any researcher at Aalto for free. Resources are funded by departments, and distributed by a fairshare algorithm: members of departments and schools which provide direct funding have a greater share.

Please use the account request form (“Triton: New user account”) to request the account. (For future help, you should probably use our issue tracker: see the Getting Triton help page.)

A few prerequisites:

Accounts are for (see details):

  • Researchers (as in, affiliated with a research PI in any way). Please tell us who your supervisor is in your account request.

    • If you are a student doing a course project which is essentially research project (you are basically joining the research group), you may use Triton for that project. You should be clear about this in your request, put your research supervisor (not course instructor) as supervisor, and we’ll verify.

  • Students coming to one of our Scientific Computing in Practice courses which uses Triton. You will be specifically told if this is the case

  • Other students not doing research needing computational facilities should check out our introduction for students. This includes most student projects as part of courses, unless you are effectively joining a research group to do a project.

You know that you have Triton access if you are in the triton-users group at Aalto: groups shows this on Aalto linux machines.

Your department/unit

When you get an account, you get added to a unit’s group, which is “billed” for your usage. If you change Aalto units, this may need updated. Check sshare -U or sshare and if it’s wrong, let us know (the units are first on the line). (These are currently by department, so changes are not that frequent)

Password change and other issues

Since your Triton account is a regular Aalto account, for any password change, shell change etc use Aalto services. You can always do these on the server kosh.aalto.fi (at least).

If you are in doubts, in case of any account related issue your primary point of contact is your local support team member via the support email address. Do not post such issues on the tracker.

Account deactivation / remove from mailing list

Your account lasts as long as your Aalto account does, and the triton-users mailing list is directly tied to Triton account. You will also be unsubscribed from the mailing list (they go together, you can’t just be removed from the mailing list).

If you want to deactivate your account, send an email to the scicomp email address (scicomp -at- aalto.fi). You can save time by saying something like the following in your message (otherwise we will reply to confirm, if you have any special requests or need help, ask us): “I realize that I will lose access to Triton, I have made plans for any important data data and I realize that any home and work directory data will eventually be deleted”.

Before you leave, please clean up your home/work/scratch directories data. Consider who should have your data after you are done: does your group still need access to it?. You won’t have access to the files after your account is deactivated. Note that scratch/work directory data are unrecoverable after deleting, which will happen eventually. If data is stored in a group directory (/scratch/$dept/$groupname), it won’t be deleted and will stay managed by the group owner.

Terms of use/privacy policy

See the Usage policies and legal page.

Triton quick reference

In this page, you have all important reference information

Quick reference guide for the Triton cluster at Aalto University, but also useful for many other Slurm clusters. See also this printable Triton cheatsheet, as well as other cheatsheets.

Connecting

See also: Connecting to Triton.

Method

Description

From where?

ssh from Aalto networks

Standard way of connecting via command line. Hostname is triton.aalto.fi. More SSH info.

>Linux/Mac/Win from command line: ssh USERNAME@triton.aalto.fi

>Windows: same, see Connecting via ssh for details options.

VPN and Aalto networks (which is VPN, most wired, internal servers, eduroam, aalto only if using an Aalto-managed laptop, but not aalto open). Simplest SSH option if you can use VPN.

ssh (from rest of Internet)

Use Aalto VPN and row above.

If needed: same as above, but must set up SSH key and then ssh -J USERNAME@kosh.aalto.fi USERNAME@triton.aalto.fi.

Whole Internet, if you first set up SSH key AND also use passwords (since 2023)

VDI

“Virtual desktop interface”, https://vdi.aalto.fi, from there you have to ssh to Triton (previous rows) and can run graphical programs via SSH. More info.

Whole Internet

Jupyter

https://jupyter.triton.aalto.fi provides the Jupyter interface directly on Triton (including command line). Get a terminal with “New → Other → Terminal”. More info.

Whole Internet

Open OnDemand

https://ood.triton.aalto.fi, Web-based interface to the cluster. Includes shell access and data transfer. “Triton Shell Access” for the terminal. More info.

VPN and Aalto networks

VSCode

Web-based available via OpenOnDemand (row above).

Desktop-based “Remote SSH” allows running on Triton (which is OK, but don’t use it for large computation). More info.

Same as Open OnDemand or SSH above

Modules

See also: Software modules.

Command

Description

module load NAME

load module

module avail

list all modules

module spider PATTERN

search modules

module spider NAME/ver

show prerequisite modules to this one

module list

list currently loaded modules

module show NAME

details on a module

module help NAME

details on a module

module unload NAME

unload a module

module save ALIAS

save module collection to this alias (saved in ~/.lmod.d/)

module savelist

list all saved collections

module describe ALIAS

details on a collection

module restore ALIAS

load saved module collection (faster than loading individually)

module purge

unload all loaded modules (faster than unloading individually)

Common software

See also: Applications.

  • Python: module load anaconda for the Anaconda distribution of Python 3, including a lot of useful packages. More info.

  • R: module load r for a basic R package. More info.

  • Matlab: module load matlab for the latest Matlab version. More info.

  • Julia: module load julia for the latest Julia version. More info.

Storage

See also: Data storage

Name

Path

Quota

Backup

Locality

Purpose

Home

$HOME or /home/USERNAME/

hard quota 10GB

Nightly

all nodes

Small user specific files, no calculation data.

Work

$WRKDIR or /scratch/work/USERNAME/

200GB and 1 million files

x

all nodes

Personal working space for every user. Calculation data etc. Quota can be increased on request.

Scratch

/scratch/DEPT/PROJECT/

on request

x

all nodes

Department/group specific project directories.

Local temp

/tmp/

limited by disk size

x

single-node

Primary (and usually fastest) place for single-node calculation data. Removed once user’s jobs are finished on the node.

Local persistent

/l/

varies

x

dedicated group servers only

Local disk persistent storage. On servers purchased for a specific group. Not backed up.

ramfs (login nodes only)

$XDG_RUNTIME_DIR

limited by memory

x

single-node

Ramfs on the login node only, in-memory filesystem

Remote data access

See also: Remote access to data.

Method

Description

rsync transfers

Transfer back and forth via command line. Set up ssh first.

rsync triton.aalto.fi:/path/to/file.txt file.txt

rsync file.txt triton.aalto.fi:/path/to/file.txt

SFTP transfers

Operates over SSH. sftp://triton.aalto.fi in file browsers (Linux at least), FileZilla (to triton.aalto.fi).

SMB mounting

Mount (make remote viewable locally) to your own computer.

Linux: File browser, smb://data.triton.aalto.fi/scratch/

MacOS: File browser, same URL as Linux

Windows: \\data.triton.aalto.fi\scratch\

Partitions

Partition

Max job size

Mem/core (GB)

Tot mem (GB)

Cores/node

Limits

Use

<default>

If you leave off all possible partitions will be used (based on time/mem)

debug

2 nodes

2.66 - 12

32-256

12,20,24

15 min

testing and debugging short interactive. work. 1 node of each arch.

batch

16 nodes

2.66 - 12

32-510

12, 20,24,40,128

5d

primary partition, all serial & parallel jobs

short

8 nodes

4 - 12

48-256

12, 20,24

4h

short serial & parallel jobs, +96 dedicated CPU cores

hugemem

1 node

43

1024

24

3d

huge memory jobs, 1 node only

gpu

1 node, 2-8GPUs

2 - 10

24-128

12

5d

Long gpu jobs

gpushort

4 nodes, 2-8 GPUs

2 - 10

24-128

12

4h

Short GPU jobs

interactive

2 nodes

5

128

24

1d

for sinteractive command, longer interactive work

Use slurm partitions to see more details.

Job submission

See also: Serial Jobs, Array jobs: embarassingly parallel execution, Parallel computing: different methods explained, Serial Jobs.

Command

Description

sbatch

submit a job to queue (see standard options below)

srun

Within a running job script/environment: Run code using the allocated resources (see options below)

srun

On frontend: submit to queue, wait until done, show output. (see options below)

sinteractive

Submit job, wait, provide shell on node for interactive playing (X forwarding works, default partition interactive). Exit shell when done. (see options below)

srun --pty bash

(advanced) Another way to run interactive jobs, no X forwarding but simpler. Exit shell when done.

scancel JOBID

Cancel a job in queue

salloc

(advanced) Allocate resources from frontend node. Use srun to run using those resources, exit to close shell when done (see options below)

scontrol

View/modify job and slurm configuration

Command

Option

Description

sbatch/srun/etc

-t, --time=HH:MM:SS

time limit

-t, --time=DD-HH

time limit, days-hours

-p, --partition=PARTITION

job partition. Usually leave off and things are auto-detected.

--mem-per-cpu=N

request n MB of memory per core

--mem=N

request n MB memory per node

-c, --cpus-per-task=N

Allocate *n* CPU’s for each task. For multithreaded jobs. (compare ``–ntasks``: ``-c`` means the number of cores for each process started.)

-N, --nodes=N-M

allocate minimum of n, maximum of m nodes.

-n, --ntasks=N

allocate resources for and start n tasks (one task=one process started, it is up to you to make them communicate. However the main script runs only on first node, the sub-processes run with “srun” are run this many times.)

-J, --job-name=NAME

short job name

-o OUTPUTFILE

print output into file output

-e ERRORFILE

print errors into file error

--exclusive

allocate exclusive access to nodes. For large parallel jobs.

--constraint=FEATURE

request feature (see slurm features for the current list of configured features, or Arch under the hardware list). Multiple with --constraint="hsw|skl".

--array=0-5,7,10-15

Run job multiple times, use variable $SLURM_ARRAY_TASK_ID to adjust parameters.

--gres=gpu

request a GPU, or --gres=gpu:n for multiple

--gres=spindle

request nodes that have disks, spindle:n, for a certain number of RAID0 disks

--mail-type=TYPE

notify of events: BEGIN, END, FAIL, ALL, REQUEUE (not on triton) or ALL. MUST BE used with --mail-user= only

--mail-user=YOUR@EMAIL

whome to send the email

srun

-N N_NODES hostname

Print allocated nodes (from within script)

Command

Description

slurm q ; slurm qq

Status of your queued jobs (long/short)

slurm partitions

Overview of partitions (A/I/O/T=active,idle,other,total)

slurm cpus PARTITION

list free CPUs in a partition

slurm history [1day,2hour,…]

Show status of recent jobs

seff JOBID

Show percent of mem/CPU used in job. See Monitoring.

sacct -o comment -p -j JOBID

Show GPU efficiency

slurm j JOBID

Job details (only while running)

slurm s ; slurm ss PARTITION

Show status of all jobs

sacct

Full history information (advanced, needs args)

Full slurm command help:

$ slurm

Show or watch job queue:
 slurm [watch] queue     show own jobs
 slurm [watch] q   show user's jobs
 slurm [watch] quick     show quick overview of own jobs
 slurm [watch] shorter   sort and compact entire queue by job size
 slurm [watch] short     sort and compact entire queue by priority
 slurm [watch] full      show everything
 slurm [w] [q|qq|ss|s|f] shorthands for above!
 slurm qos               show job service classes
 slurm top [queue|all]   show summary of active users
Show detailed information about jobs:
 slurm prio [all|short]  show priority components
 slurm j|job      show everything else
 slurm steps      show memory usage of running srun job steps
Show usage and fair-share values from accounting database:
 slurm h|history   show jobs finished since, e.g. "1day" (default)
 slurm shares
Show nodes and resources in the cluster:
 slurm p|partitions      all partitions
 slurm n|nodes           all cluster nodes
 slurm c|cpus            total cpu cores in use
 slurm cpus   cores available to partition, allocated and free
 slurm cpus jobs         cores/memory reserved by running jobs
 slurm cpus queue        cores/memory required by pending jobs
 slurm features          List features and GRES

Examples:
 slurm q
 slurm watch shorter
 slurm cpus batch
 slurm history 3hours

Other advanced commands (many require lots of parameters to be useful):

Command

Description

squeue

Full info on queues

sinfo

Advanced info on partitions

slurm nodes

List all nodes

Slurm examples

See also: Serial Jobs, Array jobs: embarassingly parallel execution.

Simple batch script, submit with sbatch the_script.sh:

#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=1G

module load anaconda
python my_script.py

Simple batch script with array (can also submit with sbatch --array=1-10 the_script.sh):

#!/bin/bash -l
#SBATCH --array=1-10

python my_script.py --seed=$SLURM_ARRAY_TASK_ID
Toolchains

Toolchain

Compiler version

MPI version

BLAS version

ScaLAPACK version

FFTW version

CUDA version

GOOLF Toolchains:

goolf/triton-2016a

GCC/4.9.3

OpenMPI/1.10.2

OpenBLAS/0.2.15

ScaLAPACK/2.0.2

FFTW/3.3.4

goolf/triton-2016b

GCC/5.4.0

OpenMPI/1.10.3

OpenBLAS/0.2.18

ScaLAPACK/2.0.2

FFTW/3.3.4

goolfc/triton-2016a

GCC/4.9.3

OpenMPI/1.10.2

OpenBLAS/0.2.15

ScaLAPACK/2.0.2

FFTW/3.3.4

7.5.18

goolfc/triton-2017a

GCC/5.4.0

OpenMPI/2.0.1

OpenBLAS/0.2.19

ScaLAPACK/2.0.2

FFTW/3.3.4

8.0.61

GMPOLF Toolchains:

gmpolf/triton-2016a

GCC/4.9.3

MPICH/3.0.4

OpenBLAS/0.2.15

ScaLAPACK/2.0.2

FFTW/3.3.4

gmpolfc/triton-2016a

GCC/4.9.3

MPICH/3.0.4

OpenBLAS/0.2.15

ScaLAPACK/2.0.2

FFTW/3.3.4

7.5.18

GMVOLF Toolchains:

gmvolf/triton-2016a

GCC/4.9.3

MVAPICH2/2.0.1

OpenBLAS/0.2.15

ScaLAPACK/2.0.2

FFTW/3.3.4

gmvolfc/triton-2016a

GCC/4.9.3

MVAPICH2/2.0.1

OpenBLAS/0.2.15

ScaLAPACK/2.0.2

FFTW/3.3.4

7.5.18

IOOLF Toolchains:

ioolf/triton-2016a

icc/2015.3.187

OpenMPI/1.10.2

OpenBLAS/0.2.15

ScaLAPACK/2.0.2

FFTW/3.3.4

IOMKL Toolchains:

iomkl/triton-2016a

icc/2015.3.187

OpenMPI/1.10.2

imkl/11.3.1.150

imkl/11.3.1.150

imkl/11.3.1.150

iomkl/triton-2016b

icc/2015.3.187

OpenMPI/1.10.3

imkl/11.3.1.150

imkl/11.3.1.150

imkl/11.3.1.150

iompi/triton-2017a

icc/2017.1.132

OpenMPI/2.0.1

imkl/2017.1.132

imkl/2017.1.132

imkl/2017.1.132

Hardware

See also: Cluster technical overview.

Node name

Number of nodes

Node type

Year

Arch (--constraint)

CPU type

Memory Configuration

Infiniband

GPUs

Disks

pe[1-48,65-81]

65

Dell PowerEdge C4130

2016

hsw avx avx2

2x12 core Xeon E5 2680 v3 2.50GHz

128GB DDR4-2133

FDR

900GB HDD

pe[49-64,82]

17

Dell PowerEdge C4130

2016

hsw avx avx2

2x12 core Xeon E5 2680 v3 2.50GHz

256GB DDR4-2133

FDR

900GB HDD

pe[83-91]

8

Dell PowerEdge C4130

2017

bdw avx avx2

2x14 core Xeon E5 2680 v4 2.40GHz

128GB DDR4-2400

FDR

900GB HDD

c[639-647,649-653,655-656,658]

17

ProLiant XL230a Gen9

2017

hsw avx avx2

2x12 core Xeon E5 2690 v3 2.60GHz

128GB DDR4-2666

FDR

450G HDD

skl[1-48]

48

Dell PowerEdge C6420

2019

skl avx avx2 avx512

2x20 core Xeon Gold 6148 2.40GHz

192GB DDR4-2667

EDR

No disk

csl[1-48]

48

Dell PowerEdge C6420

2020

csl avx avx2 avx512

2x20 core Xeon Gold 6248 2.50GHz

192GB DDR4-2667

EDR

No disk

milan[1-32]

32

Dell PowerEdge C6525

2023

milan avx avx2

2x64 core AMD EPYC 7713 @2.0 GHz

512GB DDR4-3200

HDR-100

No disk

fn3

1

Dell PowerEdge R940

2020

avx avx2 avx512

4x20 core Xeon Gold 6148 2.40GHz

2TB DDR4-2666

EDR

No disk

gpu[1-10]

10

Dell PowerEdge C4140

2020

skl avx avx2 avx512 volta

2x8 core Intel Xeon Gold 6134 @ 3.2GHz

384GB DDR4-2667

EDR

4x V100 32GB

1.5 TB SSD

gpu[11-17,38-44]

14

Dell PowerEdge XE8545

2021, 2023

milan avx avx2 ampere a100

2x24 core AMD EPYC 7413 @ 2.65GHz

503GB DDR4-3200

EDR

4x A100 80GB

440 GB SSD

gpu[20-22]

3

Dell PowerEdge C4130

2016

hsw avx avx2 kepler

2x6 core Xeon E5 2620 v3 2.50GHz

128GB DDR4-2133

EDR

4x2 GPU K80

440 GB SSD

gpu[23-27]

5

Dell PowerEdge C4130

2017

hsw avx avx2 pascal

2x12 core Xeon E5-2680 v3 @ 2.5GHz

256GB DDR4-2400

EDR

4x P100

720 GB SSD

gpu[28-37]

10

Dell PowerEdge C4140

2019

skl avx avx2 avx512 volta

2x8 core Intel Xeon Gold 6134 @ 3.2GHz

384GB DDR4-2667

EDR

4x V100 32GB

1.5 TB SSD

dgx[1-2]

2

Nvidia DGX-1

2018

bdw avx avx2 volta

2x20 core Xeon E5-2698 v4 @ 2.2GHz

512GB DDR4-2133

EDR

8x V100 16GB

7 TB SSD

dgx[3-7]

5

Nvidia DGX-1

2018

bdw avx avx2 volta

2x20 core Xeon E5-2698 v4 @ 2.2GHz

512GB DDR4-2133

EDR

8x V100 32GB

7 TB SSD

gpuamd1

1

Dell PowerEdge R7525

2021

rome avx avx2 mi100

2x8 core AMD EPYC 7262 @3.2GHz

250GB DDR4-3200

EDR

3x MI100

32GB SSD

Node type

CPU count

48GB Xeon Westmere (2012)

1404

24GB Xeon Westmere + 2x GPU (2012)

120

96GB Xeon Westmere (2012)

288

1TB Xeon Westmere (2012)

48

256GB Xeon Ivy Bridge (2014)

480

64GB Xeon Ivy Bridge (2014)

480

128GB Xeon Haswell (2016)

1224

256GB Xeon Haswell (2016)

360

128GB Xeon Haswell + 4x GPU (2016)

36

GPUs

See also: GPU computing.

Card

Slurm feature name (--constraint=)

Slurm gres name (--gres=gpu:NAME:n)

total amount

nodes

architecture

compute threads per GPU

memory per card

CUDA compute capability

Tesla K80*

kepler

teslak80

12

gpu[20-22]

Kepler

2x2496

2x12GB

3.7

Tesla P100

pascal

teslap100

20

gpu[23-27]

Pascal

3854

16GB

6.0

Tesla V100

volta

v100

40

gpu[1-10]

Volta

5120

32GB

7.0

Tesla V100

volta

v100

40

gpu[28-37]

Volta

5120

32GB

7.0

Tesla V100

volta

v100

16

dgx[1-7]

Volta

5120

16GB

7.0

Tesla A100

ampere

a100

56

gpu[11-17,38-44]

Ampere

7936

80GB

8.0

AMD MI100 (testing)

mi100

Use -p gpu-amd only, no --gres

gpuamd[1]

Conda

See also: Python Environments with Conda

Command

Description

module load miniconda

Load module that provides miniconda on Triton - recommended for using conda for your own environments.

First time setup

See link for six commands to run once per user account on Triton (to avoid filling up all space on your home directory).

name: conda-example
channels:
  - conda-forge
dependencies:
  - numpy
  - pandas

Minimal environment.yml example. By defining our requirements in one place, our environment becomes reproducible and we can solve problems by re-creating it. “Dependencies” lists packages that will be installed.

Environment management:

conda env create --file environment.yml

Create environment from yaml file. Use -n NAME to set or override the name from the .yml file. Environments with -n are stored in conda config --show envs_dirs.

source activate NAME

Activate environment of name NAME. Note we use this and not conda init/conda activate to avoid changing Python for your whole account.

source deactivate

Deactivate conda from this session.

conda env list

List all environments.

conda env remove -n NAME

Remove the environment of that name.

Package management:

Inside the activate environment

conda list

List packages in currently active environment.

conda install --freeze-installed --channel CHANNEL PACKAGE_NAME

Install packages in an environment with minimal changes to what is already installed. Usually you would want to go at add them to environment.yml if it is a dependency. Better: add to environment.yml and then see next line.

conda env update --file environment.yml

Update an environment file based on environment.yml

conda env export

Export an environment.yml that describes the current environment. Add --no-builds to make it more portable across operating systems. Add --from-history to list only what you have explicitly requested in the past.

conda search [--channel conda-forge] NAME

Search for a package. List includes name, version, build version (often including linked libraries like Python/CUDA), and channel.

Other:

mamba ...

Use mamba instead of conda for faster operations. mamba is a drop-in replacement. It should be installed in the environment.

conda clean -a

Clean up cached files to free up space (not environments or packages in them).

CONDA_OVERRIDE_CUDA="11.2" conda ..

Used when making CUDA environment on login node (choose right CUDA version for you). Used with ... env create or ... install to indicate that CUDA will be available when the program runs.

Channel conda-forge

Package selection tensorflow=*=*cuda*

Package selection for tensorflow. The first * can be replaced with the Tensorflow version specification

Channels pytorch and conda-forge

Package selection pytorch=*=*cuda*

Package selection for pytorch. The first * can be replaced with the pytorch version specification.

CUDA

In channel conda-forge, automatically selected based on software you need. For manual compilation, package cudatoolkit in conda-forge.

Command line

See also: Linux shell crash course.

General notes

The command line has many small programs that when connected, allow you to do many things. Only a little bit of this is shown here.

Programs are generally silent if everything worked, and only print an error if something goes wrong.

ls [DIR]

List current directory (or DIR if given).

pwd

Print current directory.

cd DIR

Change directory. .. is parent directory, / is root, / is also chaining directories, e.g. dir1/dir2 or ../../

nano FILE

Edit a file (there are many other editors, but nano is common, nice, and simple).

mkdir DIR-NAME

Make a new directory.

cat FILE

Print entire contents of file to standard output (the terminal).

less FILE

Less is a “pager”, and lets you scroll through a file (up/down/pageup/pagedown). q to quit, / to search.

mv SOURCE DEST

Move (=rename) a file. mv SOURCE1 SOURCE2 DEST-DIRECTORY/ copies multiple files to a directory.

cp SOURCE DEST

Copy a file. The DEST-DIRECTORY/ syntax of mv works as well.

rm FILE ...

Remove a file. Note, from the command line there is no recovery, so always pause and check before running this command! The -i option will make it confirm before removing each file. Add -r to remove whole directories recursively.

head [FILE]

Print the first 10 (or N lines with -n N) of a file. Can take input from standard input instead of FILE. tail is similar but the end of the file.

tail [FILE]

See above.

grep PATTERN [FILE]

Print lines matching a pattern in a file, suitable as a primitive find feature, or quickly searching for output. Can also use standard input instead of FILE.

du [-ash] [DIR]

Print disk usage of a directory. Default is KiB, rounded up to block sizes (1 or 4 KiB), -h means “human readable” (MB, GB, etc), -s means “only of DIR, not all subdirectories also”. -a means “all files, not only directories”. A common pattern is du -h DIR | sort -h to print all directories and their sizes, sorted by size.

stat

Show detailed information on a file’s properties.

find [DIR]

find can do almost anything, but that means it’s really hard to use it well. Let’s be practical: with only a directory argument, it prints all files and directories recursively, which might be useful itself. Many of us do find DIR | grep NAME to grep for the name we want (even though this isn’t the “right way”, there are find options which do this same thing more efficiently).

| (pipe): COMMAND1 | COMMAND2

The output of COMMAND1 is sent to the input of COMMAND2. Useful for combining simple commands together into complex operations - a core part of the unix philosophy.

> (output redirection): COMMAND > FILE

Write standard output of COMMAND to FILE. Any existing content is lost.

>> (appending output redirection): COMMAND >> FILE

Like above, but doesn’t lose content: it appends.

< (input redirection): COMMAND < FILE

Opposite of >, input to COMMAND comes from FILE.

type COMMAND or which COMMAND

Show exactly what will be run, for a given command (e.g. type python3).

man COMMAND-NAME

Browse on-line help for a command. q will exit, / will search (it uses less as its pager by default).

-h and --help

Common command line options to print help on a command. But, it has to be implemented by each command.

Triton quickstart guide

This is a quickstart guide to the Triton cluster. Each individual guide will link to additional resources with more extensive information.

Connecting to Triton

Most of the information on this page is also available on other tutorial sites. This page is essentially a condensed version of those sites, that will only give you a recipe how to quickly set up your machine and the most important details. For more in-depth information, please have a look at the linked pages for each section.

There are three suggested ways to connect to Triton, as detailed in the table below, with more info found at the connecting tutorial.

Method

Description

From where?

ssh from Aalto networks

Standard way of connecting via command line. Hostname is triton.aalto.fi. More SSH info.

>Linux/Mac/Win from command line: ssh USERNAME@triton.aalto.fi

>Windows: same, see Connecting via ssh for details options.

VPN and Aalto networks (which is VPN, most wired, internal servers, eduroam, aalto only if using an Aalto-managed laptop, but not aalto open). Simplest SSH option if you can use VPN.

ssh (from rest of Internet)

Use Aalto VPN and row above.

If needed: same as above, but must set up SSH key and then ssh -J USERNAME@kosh.aalto.fi USERNAME@triton.aalto.fi.

Whole Internet, if you first set up SSH key AND also use passwords (since 2023)

VDI

“Virtual desktop interface”, https://vdi.aalto.fi, from there you have to ssh to Triton (previous rows) and can run graphical programs via SSH. More info.

Whole Internet

Jupyter

https://jupyter.triton.aalto.fi provides the Jupyter interface directly on Triton (including command line). Get a terminal with “New → Other → Terminal”. More info.

Whole Internet

Open OnDemand

https://ood.triton.aalto.fi, Web-based interface to the cluster. Includes shell access and data transfer. “Triton Shell Access” for the terminal. More info.

VPN and Aalto networks

VSCode

Web-based available via OpenOnDemand (row above).

Desktop-based “Remote SSH” allows running on Triton (which is OK, but don’t use it for large computation). More info.

Same as Open OnDemand or SSH above

Get an account

First, you need to get an account.

Connecting via ssh

Prerequisites

This section assumes that you have a basic understanding of the linux shell, you know know, what an ssh key is, that you have an ssh public/private key pair stored in the default location and that you have some basic understanding of the ssh config. If you lack either of these, have a look at the following pages:

Setting up ssh for passwordless access

The following guide shows you how to set up the ssh system to allow you to connect to Triton from either outside of the Aalto network or from within using an ssh key instead of your password. In the following guide USERNAME refers to your Aalto user name and ~/.ssh refers to your ssh config folder. (On Windows, you can use GIT-bash, which will allow you to use linux style abbreviations. The actual folder is normally located under C:\Users\currentuser\.ssh, where currentuser is the name of the user). First, create the file config in the ~/.ssh folder with the following content, or add the following lines to it if it already exists. Instead of kosh you can also use any other remote access server (see Remote Access)

Host triton
    User USERNAME
    Hostname triton.aalto.fi

Host kosh
    User USERNAME
    Hostname kosh.aalto.fi


Host triton_via_kosh
    User USERNAME
    Hostname triton
    ProxyJump kosh

Next, you have to add your public key to the authorized keys of both kosh and Triton. For this purpose you have to connect to the respective servers and add your public key to the authorized_keys file in the servers .ssh/ folder.

# Connect and log in to kosh
ssh kosh
# Open the authorized_keys file and copy your public key.
nano .ssh/authorized_keys
# Copy your public key into this file
# to save the file press ctrl + x and the confirm with y
# afterwards exit from kosh
exit

Now you do the same for Triton by using our defined proxy jump over kosh.

# Connect and log in to kosh
ssh triton_via_kosh
# Open the authorized_keys file and copy your public key.
nano .ssh/authorized_keys
# Copy your public key into this file
# to save the file press ctrl + x and the confirm with y
# afterwards exit from Triton
exit

Now, to connect to Triton you can simply type:

ssh triton
# Or, if you are not on the aalto network:
ssh triton_via_kosh
Installing and running an X Server on Windows

This tutorial explains how to install an X-Server on Windows. We will use the VcXsrv, a free X-server for this purpose.

Steps:

  • Download the installer from here

  • Run the installer.

    • Select Full under Installation Options and click Next

    • Select a target folder

To Run the Server:

  • Open the XLaunch program (most likely on your desktop)

  • Select Multiple Windows and click Next

  • Select Start no client and click Next

  • On the Extra settings window, click Next

  • On the Finish configuration page click Finish

You have now started your X Server.

Set up your console

In the Git bash or the windows command line (cmd) terminal, before you connect to an ssh server, you have to set the used display. Under normal circumstances, VcXsrv will start the Xserver as display 0.0. If for some reason the remote graphical user interface does not start later on, you can check, the actual display by right-clicking on the tray-icon of the X Server and select Show log. Search for DISPLAY in the log file, and you will find something like:

DISPLAY=127.0.0.1:0.0

In your terminal enter:

set DISPLAY=127.0.0.1:0.0

Now you are set up to connect to the server of your choice via:

ssh -Y your.target.host

Notice, that on windows you will likely need the -Y flag for X Server connections, since it seems -X does not normally work.

Data on Triton

This section gives an best practices data usage, access and transfer to and from Triton.

Prerequisites

For data transfer, we assume that you have set up your system according to the instructions in the quick guide

Locations and quotas

Name

Path

Quota

Backup

Locality

Purpose

Home

$HOME or /home/USERNAME/

hard quota 10GB

Nightly

all nodes

Small user specific files, no calculation data.

Work

$WRKDIR or /scratch/work/USERNAME/

200GB and 1 million files

x

all nodes

Personal working space for every user. Calculation data etc. Quota can be increased on request.

Scratch

/scratch/DEPT/PROJECT/

on request

x

all nodes

Department/group specific project directories.

Local temp

/tmp/

limited by disk size

x

single-node

Primary (and usually fastest) place for single-node calculation data. Removed once user’s jobs are finished on the node.

Local persistent

/l/

varies

x

dedicated group servers only

Local disk persistent storage. On servers purchased for a specific group. Not backed up.

ramfs (login nodes only)

$XDG_RUNTIME_DIR

limited by memory

x

single-node

Ramfs on the login node only, in-memory filesystem

Access to data and data transfer

Prerequisites

On Windows systems, this guide assumes that you use GIT-bash, and have rsync installed according to this guide

Download data to Triton

To download a dataset directly to Triton, if it is available somewhere online at a URL, you can use wget:

wget https://url.to.som/file/on/a/server

If the data requires a login you can use:

wget --user username --ask-password https://url.to.som/file/on/a/server

Downloading directly to Triton allows you to avoid the unnecessary network traffic and time required to first download it to your machine and then transferring it over to Triton.

If you need to download a larger (>10GB) dataset to Triton from the internet please verify that the download actually succeeded properly. This can be done by comparing the md5 checksum (or others using e.g. sha256sum and so on), commonly provided by hosts along with the downloadable data. The resulting checksum has to be identical to the one listed online. If it is not, your data is most likely corrupted and should not be used. After downloading simply run:

md5sum downloadedFileName

For very large datasets (>100GB) you should check, whether they are already on Triton. The folder for these kinds of datasets is located at: /scratch/shareddata/dldata/, and if not, please contact the admins to have it added there. This avoids the same dataset being downloaded multiple times.

Copy data to and from Triton

The folders available on Triton are listed above. To copy small amounts of data to and from Triton from outside the Aalto network, you can either use scp or on linux/mac mount the file-system using sftp (e.g. sftp://triton_via_kosh).

From inside the Aalto network (or VPN), you can also mount the Triton file system via smb (More details can be found here):

  • scratch: smb://data.triton.aalto.fi/scratch/.

  • work: smb://data.triton.aalto.fi/work/$username/.

For larger files, or folders with multiple files and if the data is already on your machine, we suggest using rsync (For more details on rsync have a look here):

# Copy PATHTOLOCALFOLDER to your Triton home folder
rsync -avzc -e "ssh" PATHTOLOCALFOLDER triton_via_kosh:/home/USERNAME/
# Copy PATHTOTRITONFOLDER from your Triton home folder to LOCALFOLDER
rsync -avzc -e "ssh" triton_via_kosh:/home/USERNAME/PATHTOTRITONFOLDER LOCALFOLDER triton_via_kosh:/home/USERNAME/
Best practices with data

I/O can be a limiting factor when using the cluster. The probably most important factor limiting I/O speed on Triton is file-sizes. The smaller the files the more inefficient their transfer. When you run a job on Triton and need to access many small files, we recommend to first pack them into a large tarball:

# To tar, and compress a folder use the following command
tar -zcvf mytarball.tar.gz folder
# To only bundle the data (e.g. if you want to avoid overhead by decompressing)
# a folder use the following command
tar -cvf mytarball.tar folder

copy them over to the node where your code is executed and extract them there within the slurm script or your code.

# copy it over
cp mytarball.tar /tmp
# and extract it locally
tar -xf /tmp/mytarball.tar

If each input file is only used once, it’s more efficient to load the tarball directly from the network drive. If it fits into memory, load it into memory, if not, try to use a sequentially reading input method and have the required files in the tar-ball in the required order. For more information on storage and data usage on Triton have a look at these documents:

Submitting jobs on Triton

Prerequisites

Optimally, before submitting a job: do enough tests and have a rough idea, how long your job takes, how much memory it needs and how much CPU(s)/GPU(s) it needs. Required Reading:

Required Setup:

Types of jobs:

Triton uses the Slurm scheduling system to allocate resources, like computer nodes, memory on the nodes, GPUs etc, to the submitted jobs. For more details on Slurm, have a look here. In this quickstart guide, we will only introduce the most important parameters, and skip over a lot of details. There are multiple different types of jobs available on Triton. Here we focus on the most commonly used ones.

  • Interactive jobs (commonly to test things or run graphical platforms with cluster resources)

  • Batch jobs (normal jobs submitted to the cluster without direct user input)

to run an interactive connect to Triton and job simply run

sinteractive

from the command line. You will then be connected to a free node, and can run your interactive session. More details can be found in the tutorial for interactive jobs. If you have a specific command that you want to run you can also use:

srun your_command

The most common job to run is a batch job, i.e. you submit a script that runs your code on the cluster. To run this kind of job, you need a small script where you set parameters for the job and submit it to the cluster. Using a script to set the parameters has the advantage that it is easier to modify and reuse than passing the parameters on the command line. A basic script (e.g. in the file BatchScript.slurm) for a slurm batch job could look as follows:

#!/bin/bash
#SBATCH --time=04:00:00
#SBATCH --mem=2G
#SBATCH --output=ScriptOutput.log

module load anaconda
srun python /path/to/script.py

To run this script use the command sbatch BatchScript.slurm.

So, let us go through this script:

#SBATCH --time=04:00:00 asks for a 4 hour time slot, after which the job will be stopped.
#SBATCH --mem=2G asks for 2Gb of memory for your job.
#SBATCH --output=ScriptOutput.log sets the terminal output of the job to the specified file.
module load anaconda tells the node you run on to load the anaconda module.
srun python /path/to/script tells the cluster to run the command python /path/to/script.py

Most programming languages and tools have their own modules that need to be loaded before they can be run. You can get a list of available modules by running module spider. If you need a specific version of a module, you can check the available versions by running module spider MODULENAME (e.g. module spider r for R). To load a specific version you have to specify this version during the load command (e.g. module load matlab/r2018b for the 2018b release of MATLAB). For further details please have a look at the instructions for the specific application

There are plenty more parameters that you can set for the slurm scheduler as well (for a detailed list can be found here), but we are not going to discuss them in detail here, since they are likely not necessary for your first job.

Creating a graphical job on triton

Prerequisites

Before submitting a job: Optimally, through tests, have a rough idea, how long your job takes, how much memory it needs and how much CPU(s)/GPU(s) it needs.

Required Reading:

Required Setup:

First off, in general, using graphical user interfaces to programming languages (e.g. graphical Matlab, or RStudio) is not recommended, since there is no real advantage to submitting a job to the cluster.

However, there are instances where you might need large amount of resources e.g. to visualize data which is indeed intended use. There are two things you need to do to run a graphical program on the cluster:

  • Start X-forwarding (ssh -X host or ssh -Y host)

  • request an interactive job on the cluster (sinteractive)

Once you are on a node, you can load and run your program.

As for using various programming languages to run on Triton, one can see the following examples:

Getting Triton help

There are many ways to get help, and you should try them all. If you are just looking for the most important link, it is our issue tracker.

Whatever you do, these guidelines for making good support requests are very useful.

See also

Are you just looking for a Triton account? See Triton accounts.

Give enough information

We get many requests for help which are too vague to give a useful response. So, when sending us a question, always answer these questions and you’ll get the fastest useful response:

  • Has it ever worked? (If so, what has changed?)

  • What are you trying to accomplish? (Your ultimate goal, not current technical obstacle.)

  • What did you do? (Be specific enough to be reproducible - copy and paste exact commands you run, scripts, inputs, output messages, etc.)

  • What do you need? Do you need a complete solution, pointers to get started, or should we say if it will take too long and we recommend you think of other solutions first?

If you don’t know something, it’s OK, just do your best and we’ll help from there! You can also chat with us to brainstorm about issues in general. A much more detailed guide is available from Sigma2 documentation.

The Triton docs

In case you got to this page directly, you are now on the Triton and Science-IT (CS, NBE, PHYS at least) documentation site. See the main page for the index.

Your colleagues

Science is a collaborative process, even if it doesn’t seem so. Academic courses don’t teach you everything you need to know, so it’s worth trying to work together and learn from each other - your group is the expert in it’s work, after all.

Daily garage

Come by one of the online Scientific computing garages any day at 13:00. It’s the best place to get problems solved fast - chat with us and see.

Issue tracker

We keep track of cluster issues at https://version.aalto.fi/gitlab/AaltoScienceIT/triton/issues. Feel free to post your issue there. Either admins or other users can reply — and you should feel free to reply and help others, too. The system is accessible from anywhere in the world, but you need to login with HAKA (using the button). All newly created issues are reported to admins by email.

This is primary support channel and meant for general issues like general help, troubleshooting, problems with code, new software requests, problems that may affect several users.

Note

If you get a message that you are blocked from version.aalto.fi, send the email to servicedesk. It’s not your fault: it automatically blocks people when their organizational unit changes. Yes, this is bad but it’s not in our control…

If you have an Aalto visitor account, login with HAKA won’t work - use your email address and Aalto password.

Email ticketing system

For private issues you can also contact us via our email alias (on our wiki pages, login required). This is primarily intended for specific issues such as requesting new accounts, quotas, etc. Please avoid sending personal mails directly to admins, because it is best for all admins to be aware of issues, people may be absent, and personal emails are likely to be lost.

Most general issues should be reported to the issue tracker instead, not by email. Email is primarily for accounts related queries.

Research Software Engineers

Sometimes, a problem goes beyond “Triton support” and becomes “Research support”. Our Research Software Engineers are perfect for these kinds of problems: they can program with you, set up your workflow, or even handle all the technical problems for you.

Users’ mailing list

All cluster users are on the triton-users mailing list (automagically kept in sync with those who have Triton access). It is for announcements and open discussions mainly, for problem solving please try the tracker.

If you do not receive list emails, you’d better check out with your local Triton admin that you are on the list. Otherwise you miss all the announcements including critical ones about maintenance breaks.

Triton support team

Most of us are members of your department’s support teams, so can answer questions about balancing use of Triton and your department’s computers. We also like it when people drop by and talk with us, so that we can better plan our services. In general, don’t mail us directly - use either the issue tracker above or the support email address. You can address your request to a specific person.

Dept

Name

Location

CS/NBE

Mikko Hakala

T-building A243 / Otakaari 3, F354

CS

Simo Tuomisto

T-building A243

PHYS

Simppa Äkäslompolo

Otakaari 1, Y415a

PHYS

Ivan Degtyarenko

Otakaari 1, Y415a

CS/SCI

Richard Darst

T-building A243

NBE

Enrico Glerean

Otakaari 3, F354

Science-IT trainings

We have regular training in topics relevant to HPC and scientific computing. In particular, each January and June we have a “kickstart” course which teaches you everything you need to know to do HPC work. Each Triton user should come to one of these. For the schedule, see our training page.

Getting a detailed bug report with triton-record-environment

We have a script named triton-record-environment which will record key environment variables, input, and output. This greatly helps in debugging.

To use it to run a single command that gives an error:

triton-record-environment YOUR_COMMAND
Saving output to record-environment.out.txt
...

Then, just check the output of record-environment.out.txt (it shouldn’t have any confidential information, but make sure) and send it to us/attach it to the bug report.

If you use Python, add the -p option, matlab should use -m, and graphical programs should use -x (these options have to go before the command you execute).

Frequently asked questions
Job status and submission
None

Accounts are limited in how much the can run at a time, in order to prevent a single or a few users from hogging the entire cluster with long-running jobs if it happens to be idle (e.g. after a service break). The limit is such that it limits the maximum remaining runtime of all the jobs of a user. So the way to run more jobs concurrently is to run shorter and/or smaller (less CPU’s, less memory) jobs. For an in-depth explanation see http://tech.ryancox.net/2014/04/scheduler-limit-remaining-cputime-per.html and for a graphical simulator you can play around with: https://rc.byu.edu/simulation/grpcpurunmins.php . You can see the exact limits of your account with

sacctmgr -s show user $USER format=user,account,grptresrunmins%70
None

Slurm is configured such that if a job fails due to some outside reason (e.g. the node where it’s running fails rather than the job itself crashing due to a bug in the job) the job is requeued in a held state. If you’re sure that everything is ok again you can release the job for scheduling with “scontrol release JOBID”. If you don’t want this behavior (i.e. you’d prefer that such failed jobs would just disappear) then you can prevent the requeuing with

#SBATCH --no-requeue
None

This happens when a job is submitted to multiple partitions (this is the default: it tries to go to partitions of all node types) and it is BadConstraints for some partitions. Then, it gives the BadConstraints reason for the whole job, even though it will eventually run. (If constraints are bad in all partitions, it will usually fail right when you are trying to submit it, something like sbatch: error: Batch job submission failed: Requested node configuration is not available).

You don’t need to do anything, but if you want a clean status: you can get rid of this message by limiting to partitions that actually satisfy the constraints. For example, if you request 96 CPUs, you can limit to the Milan nodes with -p batch-milan since those are tho only nodes with more than 40 CPUs. This example is valid as of 2023, if you are reading this later you need to figure out what the current state is (or ask us).

None

You can find out the remaining time of any job that is running with

squeue -h -j  -o %L

Inside a job script or sinteractive session you can use the environment variable SLURM_JOB_ID to refer to the current job ID.

None

SLURM kills jobs based on the partition’s TimeLimit + OverTimeLimit parameter. The later in our case is 60 minutes. If for instance queue time limit is 4 hours, SLURM will allow to run on it 4 hours, plus 1 hour, thus no longer than 5 hours. Though OverTimeLimit may vary, don’t rely on it. Partition’s (aka queue’s) TimeLimit is the one that end user should take into account when submit his/her job. Time limits per partiton one can check with slurm p command.

For setting up exact time frame after which you want your job to be killed anyway, set --time parameter when submitting the job. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. If you run a parallel job, set --time with srun as well. See ‘man srun' and ‘man sbatch’ for details.

#SBATCH --time=1:00:00
...

srun --time=1:00:00 ...
None

You have requested some Slurm options which do not include any nodes (for example, asking for a GPU with --gres=gpu and a partition without GPUs). Figure out what the problem is and adjust your Slurm options.

None

This error usually occurs when a requested node is down, drained or reserved which can happen if the cluster is undergoing some work - and might happen if there are very few default nodes that Slurm chooses from. If this error occurs then the shell will usually hang after the job has been submitted if the job is still waiting for allocation. To find which nodes are available for us to run jobs we can use sinfo and under the STATE column you will see for each partition the states of the nodes.

To fix this we can either wait for the node to be available or choose a different partition with the --partition= command, using one of the partitions from sinfo which has free and available (idle) nodes.

Accounts and Access to triton
None

Remote mounting

The scratch filesystem can be mounted from inside the Aalto networks by using smb://data.triton.aalto.fi/scratch/. For example, from Nautilus (the file manager) on Ubuntu, use “File” -> “Connect to server”. Outside Aalto networks, use the Aalto VPN. If it is not an Aalto computer, you may need to us AALTO\username as the username, and your Aalto password.

Or you can use sshfs – filesystem client based on SSH. Most Linux workstations have it installed by default, if not, install it or ask your local IT support to do it for you. For setting up your SSHFS mount from your local workstation: create a local directory and mount remote directory with sshfs

$ mkdir /LOCALDIR/triton
$ sshfs user1@triton.aalto.fi:/triton/PATH/TO/DIR /LOCALDIR/triton

Replace user1 with your real username and /LOCALDIR with a real directory on your local drive. After successful mount, use you /LOCALDIR /triton directory as it would be local. To unmount it, run fusermount -u /LOCALDIR/triton.

PHYS users example, assuming that Triton and PHYS accounts are the same:

$ mkdir /localwrk/$USER/triton
$ sshfs triton.aalto.fi:/triton/tfy/work/$USER  /localwrk/$USER/triton
$ cd /localwrk/$USER/triton
... (do what you need, and then unmount when there is no need any more)
$ fusermount -u /localwrk/$USER/triton

Easy access with Nautilus

The SSHFS method described above works from any console. Though in case of Linux desktops, when one has a GUI like Gnome or Unity (read all Ubuntu users) one may use Nautilus – default file manager – to mount remote SSH directory. Click File -> Connect to Server choose SSH, input triton.aalto.fi as a server and directory /triton/PATH/TO/DIR you’d like to mount, type your name. Leave password field empty if you use SSH key. As soon as Nautilus will establish connection it will appear on the left-hand side below Network header. Now you may access it as it would be your local directory. To keep it as a bookmark click on the mount point and press Ctrl+D, it will appear below Bookmark header on the same menu.

Copying files

If your workstatios has no NFS mounts from Triton (CS and NBE have, consult with your local admins for exact paths), you may always use SSH. Either copy your files from triton to a local directory on your workstation, like:

$ sftp user1@triton.aalto.fi:/triton/path/to/dir/* .
None

Let’s say you have some server (e.g. debugging server, notebook server, …) running on a node. As usual, you can do this with ssh using port forwarding. It is the same principle as in several of the above questions.

For example, you want to connect from your own computer to port AAAA on node nnnNNN. You run this command:

ssh -L BBBB:nnnNNN:AAAA username@triton.aalto.fi

Then, when you connect to port BBBB on your own computer (localhost, it gets forwarded straight to port AAAA on node nnnNNN. Thus only one ssh connection gets us to any node. It is possible for BBBB to be the same as AAAA. By the way, this works with any type of connection. The node has to be listening on any interface, not just the local interface. To connect to localhost:AAAA on a node, you need to repeat the above steps twice to forward from workstation->login and login->node, with the second nnnNNN being localhost.

None

In order for graphical programs on Linux to work, a file ~/.Xauthority has to be written. If your home directory quota (check with quota) is exceeded, then this can’t be written and graphical programs can’t open. If your quota is exceeded, clean up some files, close connections, and log in again. You can find where most of your space goes with du -h $HOME | sort -hr | less.

This is often the case if you get X11 connection rejected because of wrong authentication.

Storage, file transfer and quota
None

Main article: Triton Quotas

Everyone should have a group quota, but no user quota. All files need to be in a proper group (either a shared group with quota, or your “user private group”). First of all, use the ‘quota’ command to make sure that neither disk space nor number of files are exceeded. Also, make sure that you use $WRKDIR for data and not $HOME. If you actually need more quota, ask us.

Solution: add to your main directory and all your subdirectories to the right group, and make sure all directories have the group s-bit set, (SETGID bit, see man chmod). This means “any files created within this directory get the directory’s group”. Since your default group is “domain users” which has no quota, if the s-bit is not set, you get an immediate quota exceeded by default.

# Fix everything
#  (only for $WRKDIR or group directories, still in testing):
/share/apps/bin/quotafix -sg --fix /path/to/dir/

# Manual fixing:
# Fix sticky bit:
lfs find $WRKDIR -type d --print0 | xargs -0 chmod g+s
# Fix group:
lfs find /path/to/dir  ! --group $GROUP -print0 | xargs -0 chgrp $GROUP

Why this happens: $WRKDIR directory is owned by the user and user’s group that has the same name and GID as UID. Quota is set per group, not per user. That is how it was implemented since 2011 when we got Lustre in use. Since spring 2015 Triton is using Aalto AD for the authentication which sets everyone a default group ID to ‘domain users’. If you copy anything to $WRKDIR/subdirectory that has no +s bit you copy as a ‘domain users’ member and file system refuses to do so due to no quota available. If g+s bit is set, all your directories/files copied/created will get the directory’s group ownership instead of that default group ‘domain users’. There can be very confusing interactions between this and user/shared directories.

None

It is related to the above mentioned issue, something like rsync -a … or cp -p … are trying to save original group ownership attribute, which will not work. Try this instead:

## mainly one should avoid -g (as well as -a) that preserves group attributes
$ rsync -urlptDxv --chmod=Dg+s somefile triton.aalto.fi:/path/to/work/directory

## avoid '-p' with cp, or if you want to keep timestapms, mode etc, then use '--preserve='
$ cp -r --preserve=mode,timestamps  somefile /path/to/mounted/triton/work/directory
None

Most likely your Kerberos ticket has expired. If you log in with a password or use ‘kinit’, you can get an another ticket. See page on data storage and remote data for more information.

None

It is an extension of the previous question. In case you are outside of Aalto and has neither direct access to Triton nor access to NFS mounted directories on your directory servers. Say you want to copy your Triton files to your home workstation. It could be done by setting up an SSH tunnel to your department SSH server. A few steps to be done: set tunnel to your local department server, then from your department server to Triton, and then run any rsync/sftp/ssh command you want from your client using that tunnel. The tunnel should be up during whole session.

client: ssh -L9509:localhost:9509 department.ssh.server
department server: ssh -L9509:localhost:22 triton.aalto.fi
client: sftp -P 9509 localhost:/triton/own/dir/* /local/dir

Note that port 9509 is taken for example only. One can use any other available port. Alaternatively, if you have a Linux or Mac OS X machine, you can setup a “proxy command”, so you don’t have to do the steps above manually everytime. On your home machine/laptop, in the file ~/.ssh/config put the lines

Host triton
    ProxyCommand /usr/bin/ssh DEPARTMENTUSERNAME@department.ssh.server "/usr/bin/nc -w 10 triton.aalto.fi 22"
    User TRITONUSERNAME

This creates a host alias “triton” that is proxied via the department server. So you can copy a file from your home machine/laptop to triton with a command like:

rsync filename triton:remote_filename
None

Most probably your quota has exceeded, check it out with quota command.

quota is a wrapper at /usr/local/bin/quota on front end which merges output from classic quota utility that supports NFS and Lustre’s lfs quota. NFS $HOME directory is limited to 10GB for everyone and intended for initialization files mainly. Grace period is set to 7 days and “hard” quota is set to 11GB, which means you may exceed your 10GB quota by 1GB and have 7 days to go below 10GB again. However none can exceed 11GB limit.

Note: Lustre mounted under /triton is the right place for your simulation files. It is fast and has large quotas.

None

Short answer: yes for $HOME directory and no for $WRKDIR.

$HOME is slow NFS with small quota mounted through Ethernet. Intended mainly for user initialization files and for some plain configs. We make regular backups from $HOME.
$WRKDIR (aka /triton) is fast Lustre, has large quota, mounted through InfiniBand. Though no backups made from /triton, the DDN storage system as such is secure and safe place for your data, though you can always loose your data deleting them by mistake. Every user must take care about his work files himself. We provide as much diskspace to every user, as one needs and the amount of data is growing rapidly. That is the reason why the user should manage his important data himself. Consider backups of your valuable data on DVDs/ USB drives or other resources outside of Triton.
Command line interface
None

Yes. Change shell to your Aalto account and re-login to Triton to get your newly changed shell to work. For Aalto account changes one can login to kosh.aalto.fi, run kinit first and then run chsh, then type /bin/bash. To find out what is your current shell, run echo $SHELL

For the record: your default shell is not set by Triton environment but by your Aalto account.

None

This happens because your computer is sending the “locale” information (language, number format, etc) to the other computer (Triton), but Triton doesn’t know the one on your computer. You can unset/adjust all the LC_* and/or LOCALE environment variables, or in your .ssh/config, try setting the following in your Triton section (see SSH for info on how this works, you need more than you see here):

Host triton
    SendEnv LC_ALL=C

env | grep LC_ and env | grep LANG might give you hints about exactly what environment variables are being sent from your computer (and thus you should override in the ssh config file).

Modules and environment settings
None

You have included ‘module load module/name’ but job still fails due to missing shared libraries or that it can not find some binary etc. That is a known ZSH related issue. In your sbatch script please use -l option (aka --login) which forces bash to read all the initialization files at /etc/profile.

#!/bin/bash -l
...

Alternatively, one can change shell from ZSH to BASH to avoid this hacks, see the post above.

None

Indeed the default git with Triton OS system (CentOS) is quite old (v 1.8.x). To get a more modern git you can run module load git (version 2.28.0 when this is being written).

Coding and Compiling
None

You are trying to run a GPU program (using CUDA) on a node without a GPU (and thus, no libcuda.so.1. Remember to specify that you need GPUs

None

Currently there are two different sets of compilers: (i) GNU compilers, native for Linux, installed by default, (ii) Intel compilers plus MKL, a commercial suite, often the fastest compiler on Xeons.

FGI provides all FGI sites with 7 Intel licenses, thus only 7 users can compile/link with Intel at once.

None

That means your program can’t find libraries which has been used at linking/compiling time. You may always check shared library dependencies:

$ ldd YOUR_PROGRAM # print the list of libraries required by program
If some of libraries is marked as not found, then you should first (i) find the exact path to that lib (suppose it is installed), then second (ii) explicitly add it to your environment variable $LD_LIBRARY_PATH.
For instance, if your code has been previously compiled with the libmpi.so.0 but on SL6.2 it reports an error like error while loading shared libraries: libmpi.so.0 try to locate the library:
$ locate libmpi.so.0
/usr/lib64/compat-openmpi/lib/libmpi.so.0
/usr/lib64/compat-openmpi/lib/libmpi.so.0.0.2

and the add it to your $LD_LIBRARY_PATH

export LD_LIBRARY_PATH=/usr/lib64/compat-openmpi/lib:$LD_LIBARY_PATH # export the lib in BASH environment

or, as in case of libmpi.so.0 we have ready module config, just run

module load compat-openmpi-x86_64

In case your code is missing some specific libs, not installed on Triton (say you got a binary compiled from somewhere else), you have a few choices: (i) get statically linked program or (ii) find/download missing libs (for instance from developers’ site). For the second, copy libs to your $WRKDIR and add paths to $LD_LIBRARY_PATH, in the same maner as described above.

See also:

ldconfig -p # print the list of system-wide available shared libraries
None

Background: Compiled code has dynamic libraries. When a program runs, it needs to load that code. The code embeds the name of the library like libc.so.6 and then when it runs, it uses built-in paths (/etc/ld.so.conf) and the LD_LIBRARY_PATH environment variable. It takes the first thing it finds and loads it.

In all of these cases, they work in the fine line between the operating system, software we have installed, and software you have installed. Have a very low threshold to ask for help by coming to our daily garage with your problem. We might have a much easier solution much faster than you con figure out.

Problem 1: Library not found: In this case, something expects a certain library, but it can’t be found. Possible solutions could include:

  • Loading a module that provides the library (did you have a module loaded when you compiled the code? Are you installing a Python/R extension that needs a library from outside?)

  • Setting the LD_LIBRARY_PATH variable to point to the library. If you have self-complied things this might be appropriate, but it might also be a sign that something else is wrong.

Problem 2: library version not found (such as GLIBC_2.29 not found): This usually means that it’s finding a library, loading it, but the version is too old. This especially happens on clusters, where the operating system can’t change that often.

  • If it’s about GLIBCXX_version, and you can module load gcc of a proper version, or if you are in a conda environment, install the gcc package to bring.

  • If it’s about GLIBC, then it’s about the base C library libc, and that is very hard to solve, since this is intrinsically connected to the operating system. Likely, the program is compiled on an operating system too new for the cluster and you’d think about re-compiling on the cluster, putting it in a container.

  • Setting LD_LIBRARY_PATH might help to direct to a proper version. Again, this probably indicates some other problem.

Problem 3: you think you have the newer library loaded by a module or something, but it’s still giving a version error: This has sometimes happened with programs that use extensions. The base program uses is older version of the library, but an extension needs a newer version. Since the base program has already loaded an older version, even specifying the new version via LD_LIBRARY_PATH doesn’t help much.

  • Solution: this is tricky, since the program should be using the never version if it’s on LD_LIBRARY_PATH already. Maybe it’s hard-coded to use a particular older version? In this case, since it’s hard-coded to an old version, maybe you need a newer version of the base program itself (an example of this was an R extension that expected a newer GLIBCXX_version: the answer was to build Triton’s R module with a newer gcc compiler version). If you get this case, you should be asking us to take a look.

None

One can use both, though for shared libs all your linked libs must be either in your $WRKDIR in /shared/apps or must be installed by default on all the compute nodes like vast majority of GCC and other default Linux libs.

None

Use file utility:

# file /usr/bin/gcc
/usr/bin/gcc: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV),
for GNU/Linux 2.4.0, dynamically linked (uses shared libs), not stripped

it displays the type of an executable or object file.

Other issues
None

We don’t have local department printers configured anywhere on Triton. But one can use SSH magic to send a file or command output to a remote printer. Run from your local workstation, insert the target printer name:

... printing text file
$ ssh user@triton.aalto.fi "cat file.txt" | enscript -P printer_name
... printing a PostScript file
$ ssh user@triton.aalto.fi "cat file.ps" | lp -d printer_name -
... printing a man page
$ ssh user@triton.aalto.fi "man -t sbatch" | lp -d printer_name -
None

Having a user account on Triton also means being on the triton-users at aalto.fi mailist. That is where support team sends all the Triton related announcements. All the Triton users MUST be subscibed to the list. It is automatically kept up to date these days, but just in case you are not yet there, please send an email to your local team member and ask to add your email.

How to unsubscribe? You will be removed from the maillist as soon as your Triton account is deleted from the system. Otherwise no way, since we can’t notify about urgent things that affect data integrity or other issues.

None

All the hardware delivered by the vendor has been labeled with some short name. In particular every single compute node has a label like Cn01 or GPU001 etc. we used this notation to name compute nodes, that is cn01 is just a hostname for Cn01, gpu001 is a hostname for GPU001 etc. Shorthands like cn[01-224] mean all the hostnames in the range cn01, cn02, cn03 .. cn224. Same for gpu[001-008], tb[003-008], fn[01-02]. Similar notations can be used with SLURM commands like:

$ scontrol show node cn[01-12]
None

Check your .bashrc and other startup files. Some modules bring in so many dependencies that it can interfere with standard operating system functions: in this case, SSH setting up X11 forwarding for graphical applications.

Cluster technical overview
Shared resource

Triton is a joint installation by a number of Aalto School of Science faculties within Science-IT project, which was founded in 2009 to facilitate the HPC Infrastructure in all of School of Science. It is now available to all Aalto researchers.

As of 2016, Triton is part of FGCI - Finnish Grid and Cloud Infrastructure (predecessor of Finnish Grid Infrastructure). Through the national grid and cloud infrastructure, Triton also becomes part of the European Grid Infrastructure.

Hardware

Node name

Number of nodes

Node type

Year

Arch (--constraint)

CPU type

Memory Configuration

Infiniband

GPUs

Disks

pe[1-48,65-81]

65

Dell PowerEdge C4130

2016

hsw avx avx2

2x12 core Xeon E5 2680 v3 2.50GHz

128GB DDR4-2133

FDR

900GB HDD

pe[49-64,82]

17

Dell PowerEdge C4130

2016

hsw avx avx2

2x12 core Xeon E5 2680 v3 2.50GHz

256GB DDR4-2133

FDR

900GB HDD

pe[83-91]

8

Dell PowerEdge C4130

2017

bdw avx avx2

2x14 core Xeon E5 2680 v4 2.40GHz

128GB DDR4-2400

FDR

900GB HDD

c[639-647,649-653,655-656,658]

17

ProLiant XL230a Gen9

2017

hsw avx avx2

2x12 core Xeon E5 2690 v3 2.60GHz

128GB DDR4-2666

FDR

450G HDD

skl[1-48]

48

Dell PowerEdge C6420

2019

skl avx avx2 avx512

2x20 core Xeon Gold 6148 2.40GHz

192GB DDR4-2667

EDR

No disk

csl[1-48]

48

Dell PowerEdge C6420

2020

csl avx avx2 avx512

2x20 core Xeon Gold 6248 2.50GHz

192GB DDR4-2667

EDR

No disk

milan[1-32]

32

Dell PowerEdge C6525

2023

milan avx avx2

2x64 core AMD EPYC 7713 @2.0 GHz

512GB DDR4-3200

HDR-100

No disk

fn3

1

Dell PowerEdge R940

2020

avx avx2 avx512

4x20 core Xeon Gold 6148 2.40GHz

2TB DDR4-2666

EDR

No disk

gpu[1-10]

10

Dell PowerEdge C4140

2020

skl avx avx2 avx512 volta

2x8 core Intel Xeon Gold 6134 @ 3.2GHz

384GB DDR4-2667

EDR

4x V100 32GB

1.5 TB SSD

gpu[11-17,38-44]

14

Dell PowerEdge XE8545

2021, 2023

milan avx avx2 ampere a100

2x24 core AMD EPYC 7413 @ 2.65GHz

503GB DDR4-3200

EDR

4x A100 80GB

440 GB SSD

gpu[20-22]

3

Dell PowerEdge C4130

2016

hsw avx avx2 kepler

2x6 core Xeon E5 2620 v3 2.50GHz

128GB DDR4-2133

EDR

4x2 GPU K80

440 GB SSD

gpu[23-27]

5

Dell PowerEdge C4130

2017

hsw avx avx2 pascal

2x12 core Xeon E5-2680 v3 @ 2.5GHz

256GB DDR4-2400

EDR

4x P100

720 GB SSD

gpu[28-37]

10

Dell PowerEdge C4140

2019

skl avx avx2 avx512 volta

2x8 core Intel Xeon Gold 6134 @ 3.2GHz

384GB DDR4-2667

EDR

4x V100 32GB

1.5 TB SSD

dgx[1-2]

2

Nvidia DGX-1

2018

bdw avx avx2 volta

2x20 core Xeon E5-2698 v4 @ 2.2GHz

512GB DDR4-2133

EDR

8x V100 16GB

7 TB SSD

dgx[3-7]

5

Nvidia DGX-1

2018

bdw avx avx2 volta

2x20 core Xeon E5-2698 v4 @ 2.2GHz

512GB DDR4-2133

EDR

8x V100 32GB

7 TB SSD

gpuamd1

1

Dell PowerEdge R7525

2021

rome avx avx2 mi100

2x8 core AMD EPYC 7262 @3.2GHz

250GB DDR4-3200

EDR

3x MI100

32GB SSD

All Triton computing nodes are identical in respect to software and access to common file system. Each node has its own unique host name and ip-address.

Networking

The cluster has two internal networks: Infiniband for MPI and Lustre filesystem and Gigabit Ethernet for everything else like NFS /home directories and ssh.

The internal networks are unaccessible from outside. Only the login node triton.aalto.fi has an extra Ethernet connection to outside.

High performance InfiniBand has fat-tree configuration in general. Triton has several InfiniBand segments (often called islands) distinguished based on the CPU arch. The nodes within those islands connected with different ratio like 2:1, 4:1 or 8:1, (i.e. in 4:1 case for each 4 downlinks there is 1 uplink to spine switches. The islands are ivb[1-45] 540 cores, pe[3-91] 2152 cores (keep in mind that pe[83-91] have 28 cores per node), four c[xxx-xxx] segments with 600 cores each, skl[1-48] and csl[1-48] with 1920 cores each [CHECKME]. Uplinks from those islands are mainly used for Lustre communication. Running MPI jobs possible on the entire island or its segment, but not across the cluster.

Disk arrays

All compute nodes and front-end are connected to DDN SFA12k storage system: large disk arrays with the Lustre filesystem on top of it cross-mounted under /scratch directory. The system provides about 1.8PB of disk space available to end-user.

Software

The cluster is running open source software infrastructure: CentOS 7, with SLURM as the scheduler and batch system.

Acknowledging Triton
Acknowledgement line

Triton and Science-IT gets funding from departments and Aalto, so it is critical that we show them the results of our work. Thus, please note that if you use the cluster for research that is published or presented in a talk or poster form you must acknowledge the Science-IT project by School of Science, that funds the Triton and affiliated resources. By published work we mean everything like articles, doctoral theses, diplomas, reports, other relevant publications. Use of triton can be anything: CPUs, GPUs, or the storage system (note that the storage system is the “scratch” system, which is cross-mounted to several different departments - you can use Triton without logging into it.)

An appropriate acknowledgement line might be one of:

We acknowledge the computational resources provided by the Aalto Science-IT project.

or

The calculations presented above were performed using computer resources within the Aalto University School of Science “Science-IT” project.

You can decide which one fits better to your text/slides. Rephrasing is also fine, the main issue is referencing to Science-IT and Aalto. (Note that this does not exist in various funding databases, this is an Aalto internal project.)

Reporting

This applies for:

  • Triton cluster usage (including data storage)

  • The Research Software Engineer service

  • SciComp garage support (if you think it’s significant enough).

We can’t automatically track all publications, so we need all users to verify their publications are linked to Science-IT in ACRIS (the Aalto research information system). It takes about 30 seconds if you aren’t looking at ACRIS now, or 5 when you are already there. All publications are required to be in ACRIS anyway, so this is a fast process.

You can see the already-reported publications here: https://research.aalto.fi/en/equipment/scienceit(27991559-92d9-4b3b-95ee-77147899d043)/publications.html

Instructions:

  1. Log in to ACRIS: https://acris.aalto.fi

  2. Find your publication: Research Output (left sidebar) -> Click on your publication

    • If your publication is not already there, then see your department’s ACRIS instructions, or the general help below.

  3. Link it to Science-IT: scroll down to “Relations” -> “Facilities/Equipment” -> Search “Science-IT” and select it. (This is on the main page, not the separate “Relations” page.)

  4. Click Save at the bottom of the window.

  5. Repeat for all publications (and datasets, etc.)

Location of Facilities/Equipment link

You are done! Your publication should appear in our lists and support our continued funding.

More help:

Should you have problems, first contact your department’s ACRIS help (academic coordinators). If a publication or academic output somehow can’t be linked, let us know and we will make sure that we include it in our own lists.

Other promotional pictures for Science-IT’s use

We collect pictures about the work done by our community, which are used for various other presentations or funding proposals. If you have some good pictures of research which can be shared publicly, please send them to us.

  • Please say the requested credit (author) + citation for us to use.

  • Please clarify license. CC-BY 4.0 is the minimum, but CC-0 is even better.

  • Optional: some text description about the research and/or use of resources.

  • Don’t worry about making things look perfect. Most things aren’t.

  • Send to scicomp@aalto.fi

Tutorials

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

These are designed to be read in-order by every Triton user when they get their accounts (except maybe the last ones). In order to use Triton well, in the Hands-on SciComp roadmap you should also know the Basics (A) and Linux (C) levels as a prerequisite.

Cluster ecosystem explained
About these tutorials

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Welcome to the Aalto Scientific Computing High-performance computing (HPC) tutorials. These tutorials will get you started with the Triton cluster.

Despite the HPC in the name, most of these tutorials are not about the high-performance part: instead, we get you started using and submitting jobs to the cluster. These days, many people use a cluster for simple jobs: getting more stuff done at once, not a few very big tasks. Doing the big tasks are a more specialized topic, which this will introduce you to and you will be able to use other software for that. Programming your own HPC software is out of our scope.

Not at Aalto?
Tutorials required cluster setup

This page describes the HPC/Slurm setup needed to follow along in our HPC (=cluster computing) kickstart courses. The target audience of this page is HPC system staff who would like to direct their users to follow along with this course. What is on this page is not actual “requirements” but “if you don’t match this, you will have to tell your users”. Perhaps it could be added to your quick reference.

This course is designed to be a gentle introduction to scaling up from a local workflow to running on a cluster. It is not especially focused on the high performance part but instead the basics and running existing things on a cluster. And just to make it clear: our main lesson isn’t just following our tutorials, but teaching someone how to figure things out on other clusters, too.

Our philosophy for clusters is:

  • Make everything possible automatic (for example, partition selection, Slurm options). A user should only need to specify what is needed - at least for tutorials.

  • Standardization is good: don’t break existing standard Slum things, it should be possible to learn “base Slurm” and use it across clusters (even if it’s not the optimal form)

General

These tutorials/our course will be quite easy to use for users of a cluster which have:

  • Slurm as the batch system

  • You can get a shell (likely via ssh)

  • git installed without needing to load a module

  • Python 2 or 3 (any version) installed as python without needing to load a module.

Quick reference

If you run your own cluster, create a quick reference such as Triton quick reference so that others following tutorials such as ours can quickly translate to your own cluster’s specifics. (Our hope is that all the possible local configuration is on there, so that you can translate it to your site, and that is sufficient to run).

Connecting

Connection should be possible via ssh. You probably want a cheatsheet and installation help before our course.

Slurm

Slurm is the workload manager.

Partitions are automatically detected in most cases. We have a job_submit.lua script that detects these cases, so that for most practical purposes --partition never needs to be specified:

  • Partition is automatically detected based on run time (except for special ones such as debug, interactive, etc).

  • GPU partition is automatically detected based on --gres.

There are no other mandatory Slurm arguments such as account or cluster selection.

seff is installed.

We use this slurm_tool wrapper, but we don’t require it (but it might be useful for your users anyway, perhaps this is an opportunity for joint maintenance): https://github.com/jabl/slurm_tool

Applications

You use Lmod and it works across all nodes without further setup.

Git: Git is used to clone our examples (and should have network access).

Python: We assume Python is available (version 2 or 3 - we make our examples run on both) without loading a module. Many of our basic examples use this to run simple programs to demonstrate various cluster features without getting deeper into software.

Data storage

We expect this to be different for everyone. We expect most clusters have at least a home directory (small) and a work space (large and high-performance).

$HOME is the home directory, small and backed up, not for big research, mounted on all nodes.

$WRKDIR is an environment variable that points to a per-user scratch directory (large, not backed up, suitable for fast I/O across all nodes)

We also strongly recommend group-based storage spaces for better data management.

These tutorials use Aalto’s cluster as an example, but they are designed to be useful to a wide audience: most clusters operate on the same principles with local configuration or practices needed. This course/these tutorials, along with a quick reference similar to ours, will be a great start to your career. (People running a cluster can check out our hint sheet to see what differences you may need to explain.)

We will point out things that may be different, but you need to consult your own reference to see how to do it:

  • The way you connect to the cluster, including remote access methods.

  • Exact names of batch partitions.

  • The slurm utility probably isn’t installed, seff may not be there.

  • Module names for software.

  • You probably don’t have our Singularity container stuff installed.

  • Parallel and GPU stuff is probably different.

What’s next?

Introduce yourself to the cluster resources at Aalto.

About clusters and your work

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

This is the first tutorial. The next is Connecting to Triton.

Science-IT is an Aalto infrastructure for scientific computing. Its roots was a collaboration between the Information and Computer Science department (now part of CS), Biomedical Engineering and Computational Science department (now NBE), and Applied Physics department. Now, it still serves all Aalto and is organized from the School of Science.

You are now at the first step of the Triton tutorial.

What is a cluster?

A high-performance computing cluster is basically a lot of computers not that different than your own. While the hardware is not that much more powerful than a typical “power workstation”, it’s special in that there is so much of it and you can use it together. We’ll learn more about how it’s used together later on.

Schematic of cluster.  At the left we see our laptop, the internet (cloud), and a network drive. To the right of that we see the login node, by which all connections go, data storage, and then all of the different compute nodes (CPU and GPU).

The schematic of our sample cluster. We’ll go through this piece by piece.

The things labeled “CPU Node” and “GPU Node” aren’t quite accurate in real life: that picture better depicts one a whole rack of nodes. But we show it like this so that we can pretend that one row is a CPU later on to illustrate a point.

About Triton

Triton is a mid-sized heterogeneous computational Linux cluster. This means that we are not at a massive scale (though we are, after CSC, the largest publically known known cluster in Finland). We are heterogeneous, so we continually add new hardware and incrementally upgrade. We are designed for scientific computing and data analysis. We use Linux as an operating system (like most supercomputers). We are a cluster: many connected nodes with a scheduling system to divide work between them. The network and some storage is shared, CPUs, memory, and other storage is not shared.

A real Ship of Theseus

In the Ship of Theseus thought experiment, every piece of a ship is incrementally replaced. Is it the same ship or not?

Triton is a literal Ship of Theseus. Over the ~10 years it has existed, every part has been upgraded and replaced, except possibly some random cables and other small parts. Yet, it is still Triton. Most clusters are recycled after a certain lifetime and replaced with a new one.

On an international scale of universities, the power of Triton is relatively high and it has a very diverse range of uses, though CSC has much more. Using this power requires more effort than using your own computer - you will need to get/be comfortable in the shell, using shell scripting, managing software, managing data, and so on. Triton is a good system to use for learning.

Getting help

See also

Main article: Getting Triton help

First off, realize it is hard to do everything alone - with the diversity of types of computational research and researchers, it’s not even true that everyone should know everything. If you would like to focus on your science and have someone else focus on the computational part, see our Research Software Engineer service. It’s also available for expert consultations.

There are many ways to get help. Most daily questions should go to our issue tracker (direct link), which is hosted on Aalto Gitlab (login with the HAKA button). This is especially important because many people end up asking the same questions, and in order to scale everyone needs to work together.

We have daily “SciComp garage” sessions where we provide help in person. Similarly, we have chat that can be used to ask quick questions.

Also, always search this scicomp docs site and old issues in the issue tracker.

Please, don’t send us personal email, because it won’t be tracked and might go to the wrong person or someone without time right now. Personal email is also very likely to get lost. For email contact, we have a service email address, but this should only be used for account matters. If it affects others (software, usage problems, etc), use the issue tracker, otherwise we will point you there.

Quick reference

Open the Triton quick reference - you don’t need to know what is on it (that is what these tutorials cover), but having it open now and during your work will help you a lot.

What’s next?

The next tutorial is Cluster general background knowledge.

Cluster general background knowledge

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

The following topics are required background knowledge for productive use of a remote computer cluster, and not covered in the following sequence of tutorials. You should at least browse these to confirm that you know the basics here.

Building your skills

See also

Main article: Training

As time goes on, computers are getting easier and easier to use. However, research is not a consumer product, and the fact is that you need more knowledge to use Triton than most people learn in academic courses.

We have created a modular training plan, which divides useful knowledge into levels. In order to use Triton well, you need to be somewhat proficient at Linux usage (C level). In order to do parallel work, you need to be good at the D-level and also somewhat proficient at the HPC level (E-level). This tutorial and user guide covers the D-level, but it is up to you to reach the C-level first.

See our training program and plan for suggested material for self-study and lessons. We offer routine training, see our Scientific Computing in Practice lecture series page for info.

You can’t learn everything you need all at once. Instead, continually learn and know when to ask for help.

What’s next?

The next tutorial is about connecting to the cluster.

Connecting to Triton

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

The traditional way of interacting with a cluster is via the command line in a shell in a terminal, and Secure Shell (ssh) is the most common way of doing that. To learn more command line basics, see our shell crash course.

Abstract

  • When connecting to a cluster, our goal is to get a command-line terminal that provides a base for the rest of our work.

  • The standard way of connecting is via ssh, but Open OnDemand and Jupyter provide graphical environments that are useful for interactive work.

  • SSH host name is triton.aalto.fi, use VPN if not on an Aalto network.

Method

Description

From where?

ssh from Aalto networks

Standard way of connecting via command line. Hostname is triton.aalto.fi. More SSH info.

>Linux/Mac/Win from command line: ssh USERNAME@triton.aalto.fi

>Windows: same, see Connecting via ssh for details options.

VPN and Aalto networks (which is VPN, most wired, internal servers, eduroam, aalto only if using an Aalto-managed laptop, but not aalto open). Simplest SSH option if you can use VPN.

ssh (from rest of Internet)

Use Aalto VPN and row above.

If needed: same as above, but must set up SSH key and then ssh -J USERNAME@kosh.aalto.fi USERNAME@triton.aalto.fi.

Whole Internet, if you first set up SSH key AND also use passwords (since 2023)

VDI

“Virtual desktop interface”, https://vdi.aalto.fi, from there you have to ssh to Triton (previous rows) and can run graphical programs via SSH. More info.

Whole Internet

Jupyter

https://jupyter.triton.aalto.fi provides the Jupyter interface directly on Triton (including command line). Get a terminal with “New → Other → Terminal”. More info.

Whole Internet

Open OnDemand

https://ood.triton.aalto.fi, Web-based interface to the cluster. Includes shell access and data transfer. “Triton Shell Access” for the terminal. More info.

VPN and Aalto networks

VSCode

Web-based available via OpenOnDemand (row above).

Desktop-based “Remote SSH” allows running on Triton (which is OK, but don’t use it for large computation). More info.

Same as Open OnDemand or SSH above

Kickstart course preparation

Are you here for a SciComp KickStart course? You just need to make sure you have an account and then be able to get to a terminal (as seen in the picture below) by any of the means here, and you don’t need to worry about anything else. Everything else, we do tomorrow.

Local differences

The way you connect will be different in every site, but you should be able to get a terminal somehow.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

We are working to get access to the login node. This is the gateway to all the rest of the cluster.

Getting an account

Triton uses Aalto accounts, but your account must be activated first.

The terminal

This is what you want by the end of this page: the command line terminal. Take the first option that works, or the one that’s comfortable to you. However, it’s good to get ssh working someday, since it is very useful. Later, in Using the cluster from a shell, will explain more about how to actually use it.

Image of terminal with two commands ran: ``whoami`` and ``hostname``

Image of a terminal - this is what you want after this page. You’ll see more about this means in Using the cluster from a shell. Don’t worry about what the commands mean, but you can probably figure out.

Connecting via ssh

ssh is one of the most fundamental programs of remote connections: by using it well, you can really control almost anything from from anywhere. It is not only used for connecting to the cluster, but also for data transfer. It’s worth making yourself comfortable with its use.

All Linux distributions come with an ssh client, so you don’t need to do anything. To use graphical applications, use the standard -X option, nothing extra needed.:

$ ssh triton.aalto.fi
## OR, if your username is different:
$ ssh USERNAME@triton.aalto.fi

If you are not in the Aalto networks, use the Aalto VPN.

If you are not on an Aalto network, there are extra steps. We recommend you use the Aalto VPN rather than any other workarounds. (Aalto networks are VPN, Eduroam, wired workstations, internal servers, and aalto network only if using an Aalto-managed computer.)

When connecting, you can verify the ssh key fingerprints which will ensure security.

See the advanced ssh information to learn how to log in without a password, automatically save your username and more. It really will save you time.

Aalto: Change your shell to bash

Only needed if you shell isn’t already bash. If echo $SHELL reports /bin/bash, then you are already using bash.

The thing you are interacting with when you type is the shell - the layer around the operating system. bash is the most common shell, but the Aalto default shell used to be zsh (which is more powerful in some ways, but harder to teach with). Depending on when you joined Aalto, your default might already be bash. We recommend that you check and change your shell to bash.

You can determine if your shell is bash by running echo $SHELL. Does it say /bin/bash?

If not, ssh to kosh.aalto.fi and run chsh -s /bin/bash. It may take 15 minutes to update, and you will need to log in again.

Connecting via Open onDemand

See also

Open OnDemand

OOD (Open onDemand) is a web-based user interface to Triton, including shell access, and data transfer, and a number of other applications that utilize graphical user interfaces. Read more from its guide. The Triton shell access app will get you the terminal that you need for basic work and the rest of these tutorials.

It is only available from Aalto networks and VPN. Go to https://ood.triton.aalto.fi and login with your Aalto account.

Connecting via JupyterHub

Jupyter is a web-based way of doing computing. But what some people forget is that it has a full-featured terminal and console included.

Go to https://jupyter.triton.aalto.fi (not .cs.aalto.fi) and log in. Select “Slurm 5 day, 2G” and start.

To start a terminal, click File→New→Terminal - this is the shell you need. If you need to edit text files, you can also do that through JupyterLab (note: change to the right directory before creating a new file!).

Warning: the JupyterHub shell runs on a compute node, not a login node. Some software is missing so some things don’t work. Try ssh triton.aalto.fi from the Jupyter shell to connect to the login node. To learn more about Jupyterlab, you need to read up elsewhere, there are plenty of tutorials.

Connecting via the Virtual Desktop Interface

If you go to https://vdi.aalto.fi, you can access a cloud-based Aalto Linux workstation. HTML access works from everywhere, or download the “VMWare Horizon Client” for a better connection. Start a Ubuntu desktop (you get Aalto Ubuntu). From there, you have to use the normal Linux ssh instructions to connect to Triton (via the Terminal application) using the instructions you see above: ssh triton.aalto.fi.

VSCode

You can use a web-based VSCode through Open OnDemand. Desktop VSCode can also connect to Triton via SSH. Read more

Exercises

If you are in the kickstart course, Connecting-1 is required for the rest of the course.

Connecting-1: Connect to Triton

Connect to Triton, and get a terminal by one of the options above. Type the command hostname to verify that you are on Triton. Run whoami to verify your username.

Connecting-2: (optional) Test a few command line programs

Check the uptime and load of the login node: uptime and htop (q to quit - if htop is not available, then top works almost as well). What else can you learn about the node? (You’ll learn more about these in Using the cluster from a shell, this is just a preview to fill some time.)

Connecting-3: (optional, Aalto only) Check your default shell

Check what your default shell is: run echo $SHELL. If it doesn’t say /bin/bash, go ahead and change your shell to bash if it’s not yet (see the expandable box above).

This $SHELL syntax is an environment variable and a pattern you will see in the future.

See also
What’s next?

The next tutorial is about using the terminal.

Using the cluster from a shell

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

A shell is the command-line terminal interface which is the most common method of accessing remote computers. If you are using a cluster, you aren’t just computing things. You are programming the computer to do things for you over and over again. The shell is the only option to make this work, so you have to learn a little bit.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

We are still only on the login node. If you stop here, you aren’t actually using the cluster - just the entry point. If you run too much code here, you’ll get a polite message asking to use the rest of the cluster.

Terminology
  • A terminal is the physical or software device that sends lines and shows the output.

  • A shell is the interface that reads in lines, does something with the operating system, and sends it back out.

  • The command line interface refers to the general concept of these lines in, lines out.

All these terms are usually used somewhat interchangably.

Why command line interfaces?

The shell is the most powerful interface to computers: you can script other programs to do things automatically. It’s much easier to script things with text, than by clicking buttons. It’s also very easy to add the command line interfaces to programs to make them scriptable. Shells, such as bash or zsh, are basically programming languages designed to connect programs together.

Image of terminal with two commands ran: ``whoami`` and ``hostname``

Image of a terminal - this is what does it all.

In the image above, we see a pretty typical example. The prompt is darstr1@login3:~$ and gives a bit of info about what computer you are running on. The commands whoami tells who you are (darstr1) and hostname tells what computer you are on (login3.triton.aalto.fi).

You can also give options and arguments to programs, like this:

$ python pi.py --seed=50

The parts are like this:

  • python is the program that is run.

  • pi.py and --seed=50 are arguments. It tells the program what to do, and the program can interpert them however it wants. For Python, your-program.py is the Python file and that Python file itself knows how to handle --seed=50.

These arguments let you control the program without modifying the source code (or clicking buttons with your mouse!). This lets us, for example, make a shell script that runs with many different --seed values automatically (this is a hint about our future!).

You will learn all sorts of commands as you progress in your career. The command line quick reference gives the most important ones.

Files and directories

On your phone and other “app”-like things, data just exists - you don’t really know where. Now, you are the programmer doing scientific computing, so you have to make more meaningful decisions about data arrangement. This means knowing about files (a chunk of data) and directories (hierarchical storage units, also known as folders). On a cluster, you can’t throw everything into the same place. You need to sort stuff and keep it organized. File names are an essential part of automating things. Thus, you need knowledge of the storage hierarchy.

Everything on a Unix (Linux) system is organized in a hierarchy. There aren’t “drives” like “C-drive”, different storage systems can be available anywhere:

  • / is the root of the filesystem

  • /home/ is a directory (“home directories”)

  • /home/darstr1/ is the home directory of the user darstr1

  • /home/darstr1/git/ is the directory darstr1 uses to store general git repositories.

  • … etc

  • $HOME is an environment variable shortcut for your home directory.

  • ~ is another shortcut for you home directory.

  • On Triton, /scratch/ is the basic place for storing research data. Also on Triton, $WRKDIR is a shortcut for your personal space in scratch (this is an environment variable).

On a graphical computer, you open a window to view files, but this is disconnected from how you run programs. In a shell, they are intrinsically connected and that is good.

The most common commands related to directories:

  • pwd shows the directory you are in.

  • cd NAME changes to a directory. All future commands are relative to the directory you change to. This is the (current) working directory

  • ls [NAME] lists the contents of a directory. [NAME] is an optional directory name - by default, it lists the working directory.

  • mkdir NAME makes a new directory

  • rm -r NAME removes a directory (or file) recusrsively - that and everything in it! There is no backup, be careful.

Exercises, directories

You have to be connected to the custer and have a terminal to do these exercises.

Shell-1: Explore directories

If you are not at Aalto, try to do similar things but adjusted to your cluster’s data storage.

  • Print your current directory with pwd

  • List the contents with ls

  • List the contents of /scratch/, then the contents of another directory within it, and so on.

  • List your work directory $WRKDIR.

  • Change to your work directory. List it again, with a plain ls (no full path needed).

  • List your home directory from your work directory (you need to give it a path)

  • Log out and in again. List your current directory. Note how it returns to your home directory - each time you log in, you need to navigate to where you need to be.

Shell-2: Understand power of working directory

  • ls /scratch/cs/

  • Change directory to /scratch

  • Now list /scratch/cs, but don’t re-type /scratch.

Copy your code to the cluster

Usually, you would start by copying some existing code and data into the cluster (you can also develop the code straight on the cluster). Let’s talk about the code first. You would ideally have code in a git repository - this version control system (VCS) can tracks files, synchronizes versions, and most importantly lets you copy them to the cluster easily.

You’d make a git repository on your own computer where you work. You would sync this with some online service (such as Github (github.com) or Aalto Gitlab (version.aalto.fi)), and then copy it to the cluster. Changes can go the other way. (You can also go straight from computer→cluster, but that’s beyond the scope of now). Git is outside the scope of this tutorial, but you should see CodeRefinery’s git-intro course, and really all of CodeRefinery’s courses. This isn’t covered any further here.

We are going to pretend we are researchers working on a sample project, named hpc-examples. We’ll pretend this is our research code and keep using this example repository for the rest of the tutorials. You can look at all the files in the repository here: https://github.com/AaltoSciComp/hpc-examples/ .

Let’s clone the HPC-examples repository so that we can work on it. First, we make sure we are in our home directory (we always want to make sure we know where we are! The home directory is the default place, though):

$ cd $HOME

Then we clone our git repository:

$ git clone https://github.com/AaltoSciComp/hpc-examples/

We can change into the directory:

$ cd hpc-examples

Now we have our code in a place that can be used.

Warning

Storing your analysis codes in your home directory usually isn’t recommended, since it’s not large or high performance enough. You will learn more about where to store your work in Data storage.

Shell-3: clone the hpc-examples repository

Do the steps above. List the directory and verify it matches what you see in the Github web interface.

Is your home directory the right place to store this?

Shell-4: log out and re-navigate to the hpc-examples reports

Log out and log in again. Navigate to the hpc-examples repository. Resuming work is an important but often forgotten part of work.

Running a basic program

But how would you actually run things? Usually, you would:

  • Decide where to store your code

  • Copy your code to the cluster (like we did above with the hpc-examples repository)

  • Each time you connect, change directory to the place with the code and run from there.

In our case, after changing to the hpc-examples directory, let’s run the program pi.py using Python (this will be our common example for a while):

$ cd hpc-examples
$ python3 slurm/pi.py 10000

The argument “10000” is the number of iterations of the circle in square method of calculating π.

Danger

This is running your program on the login node! Since this takes only a second, it’s OK enough for now (so that we only have to teach one thing at a time). You will learn how to run programs properly starting in Slurm: the queuing system.

Shell-5: try calculating pi

Try doing what is above and running pi.py several times with different numbers of iterations. Try passing the --seed command line option with the values 13, and 19759.

From this point on, you need to manage your working directory. You need to be in the hpc-examples directory when appropriate, or somehow give a proper path to the program to be run.

Shell-6: Try the --help option

Many programs have a --help option which gives a reminder of the options of the program. (Note that this has to be explicitly programmed - it’s a convention, not magic.) Try giving this option to pi.py and see what happens.

Copying and manipulating files

More info: Linux shell crash course

  • cp OLD NEW make a copy of OLD in NEW

  • mv OLD NEW renames a file OLD to NEW

  • rm NAME removes a file (with no warning or backup)

A file consists of its contents and metadata. The metadata is information like user, group, timestamps, permissions. To view metadata, use ls -l or stat.

Shell-7: (optional) Make a copy of pi.py

Make a copy of the pi.py program we have been using. Call it pi-new.py

Editing and viewing files

You will often need to edit files (in other words, change their contents). You could do this on your computer and copy them over every time, but that’s really slow. You can, and should, do basic edits directly on the cluster itself.

  • nano is an editor which allows you to edit files directly from the shell. This is a simple console editor which always gets the job done. Use Control-x (control and x at the same time), then y when requested and enter, to save and exit.

  • less is a pager (file viewer) which lets you view files without editing them. (q to quit, / to search, n / N to research forward and backwards, < for beginning of file, > for end of file)

  • cat dumps the contents of a file straight to the screen - sometimes useful when looking at small things.

Shell-9: Create a new file and show its contents

Create a new file poem.txt. Write some poem in it. View the contents of the file.

Shell-10: (optional, advanced) Edit py-new.py

Remember the pi-new.py file you made? Add some nonsense edits to it and try to run it. See if it fails.

Exercises

Shell-11: (advanced, to fill time) shell crash course

Browse the Linux shell crash course and see what you do and don’t know from there.

See also

This is only a short intro.

What’s next?

The next step is looking at the applications available on the cluster.

Applications

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

In this tutorial, we talk about the overall process of finding, building, and compiling software. These days, installing and managing scientific software is taking more and more time, thus we need to specifically talk about it.

Clusters, being shared systems, have more complicated software requirements. In this tutorial, you will learn how to use existing software. Be aware that installing your own is possible (and people do it all the time), but does require some attention to details. Either way, you will need to know the basics of software on Linux.

Abstract

  • There are many ways to get software on Triton:

    • Usually, you can’t install software the normal operating system way.

    • The cluster admins install many things for you, and they are loadable with Modules.

    • Sometimes, you need to install some stuff on top of that (your own libraries or environments)

    • You can actually install your own applications, but you need to modify instructions to work in your own directories.

  • Singularity containers allow you to run other hard-to-install software.

  • Ask us to install software if you need it. Ask for help if you try it yourself and it seems to be taking too long.

See also

Main article: Applications: General info

Local differences

Almost every site will use modules. The exact module names, and anything beyond that, will be different. Containers are becoming more common, but they are less standardized.

There are four main ways of getting software:

  • It’s installed through the operating system some relatively “normal” way.

  • Someone (a cluster admin) has already installed it for you. Or you ask someone to install what you need.

  • Someone has installed the base of what you need. You do some extra.

  • Install it yourself in a storage place you have access to. (Maybe you share it with others?)

Installed through operating system

People sometimes expect the cluster to work just like your laptop: install something through a package manager or app store. This doesn’t really work when you have hundreds of users on the same system: if you upgrade X, how many other people’s work suddenly breaks?

Thus, this isn’t really done, except for very basic, standalone applications. If it is done, this stuff isn’t upgraded and often old: instead, we install through modules (the next point) so that people can choose the version they want.

One unfortunate side-effect is that almost all software installation instructions you find online don’t work on the cluster. Often times, it can be installed, but people don’t think to mention it in the documentation. This often requires some thought to figure out: if you can’t figure it out, ask for help!

Cluster admin has installed it for you

The good thing about the cluster is that a few people can install software and make it usable by a lot of people. This can save you a lot of time. Your friendly admins can install things through the Software modules (an upcoming lesson), so that you can module load it with very little work. You can even choose your exact version, so that it keeps working the same even if someone else needs a newer version.

Some clusters are very active in this, some expect the users to do more. Some things are so obscure, or so dependent on local needs, that it only makes sense to help people install it themselves. To look for what is available:

If you need something installed, contact us. The issue tracker is usually the best way to do this.

Some of the most common stuff that is available:

  • Python: module load anaconda for the Anaconda distribution of Python 3, including a lot of useful packages. More info.

  • R: module load r for a basic R package. More info.

  • Matlab: module load matlab for the latest Matlab version. More info.

  • Julia: module load julia for the latest Julia version. More info.

Important

This is Aalto-specific. Some of these will work if you module load fgci-common at other Finnish sites (but not CSC). This is introduced in the next lesson.

Already installed, you add extra modules you need

Even if a cluster admin installs some software, often you might need to improve it some. One classic example is Python: we provide Python installations, but you need your own modules there. So, you can use our base Python installation to create your own environments - self-contained systems where you can install whatever you need. Different languages have different ways of doing this:

Environments have an advantage that you can do multiple projects at once, and move between computers more easily.

Install it yourself

Sometimes, you need to install software yourself - which you can do if you can tell it to install just into your home directory. Usually, the software’s instructions don’t talk about how to do this (and might not even mention things like the environments in the previous point).

One common way of doing this is containers (for example, Docker or Apptainer/Singularity). These basically allow you to put an entire operating system in one file, so that your software works everywhere. Very nice when software is difficult to install or needs to be moved from computer to computer, but can take some work to set up. See Singularity Containers for the information we have so far.

We can’t go into this more right now - ask us for help if needed. If you make a “we need X installed” request, we’ll tell you how to do it if self-installation is the easiest way.

What you should do
Exercises

These are more for thinking than anything.

Applications-1: Check your needs

Find the Applications page link above, the issue tracker, etc., and if we already have your software installed. See if we have what you need, using any of the strategies on that list.

(optional) Applications-2: Your group’s needs

Discuss among your group what software you need, if it’s available, and how you might get it. Can they tell you how to get started?

What’s next?

The next tutorial covers software modules in more detail.

Software modules

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

There are hundreds of people using every cluster. They all have different software needs, including conflicting versions required for different projects! How do we handle this without making a mess, or one person breaking the cluster for everyone?

This is actually a very hard, but solved within certain parameters, problem. Software installation and management takes up a huge amount of our time, but we try to make it easy for our users. Still, it can end up taking a lot of your effort as well.

Abstract

  • We use the standard Lmod module system, which makes more software available by adjusting environment variables like PATH.

  • module spider NAME searches for NAME.

  • module load NAME loads the module of that name. Sometimes, it can’t until you module load something else (read the module spider message carefully).

  • See the Triton quick reference for a module command cheatsheet.

Local differences

Almost every site uses modules, and most use the same Lmod system we use here. But, the exact module names you can load will be different.

Introduction to modules

The answer is the standard “module” system Lmod. It allows us to have unlimited number of different software packages installed, and the user can select what they want. Modules include everything from compilers (+their required development files), libraries, and programs. If you need a program installed, we will put it in the module system.

In a system the size of Triton, it just isn’t possible to install all software by default for every user.

A module lets you adjust what software is available, and makes it easy to switch between different versions.

As an example, let’s inspect the anaconda module with module show anaconda:

$ module show anaconda
----------------------------------------------------------------------------
  /share/apps/anaconda-ci/fgci-centos7-anaconda/modules/anaconda/2023-01.lua:
----------------------------------------------------------------------------
whatis("Name : anaconda")
whatis("Version : 2023-01")
help([[This is an automatically created Anaconda installation.]])
prepend_path("PATH","/share/apps/anaconda-ci/fgci-centos7-anaconda/software/anaconda/2023-01/2eea7963/bin")
setenv("CONDA_PREFIX","/share/apps/anaconda-ci/fgci-centos7-anaconda/software/anaconda/2023-01/2eea7963")
setenv("GUROBI_HOME","/share/apps/anaconda-ci/fgci-centos7-anaconda/software/anaconda/2023-01/2eea7963")
setenv("GRB_LICENSE_FILE","/share/apps/manual_installations/gurobi/license/gurobi.lic")

The command shows some meta-info (name of the module, its version, etc.) When you load this module, it adjusts various environment paths (as you see there), so that when you type python it runs the program from /share/apps/anaconda-ci/fgci-centos7-anaconda/software/anaconda/2023-01/2eea7963/bin. This is almost magic: we can have many versions of any software installed, and everyone can pick what they want, with no conflicts.

Loading modules

Let’s dive right into an example and load a module.

Local differences

If you are not at Aalto, you need to figure out what modules exist for you. The basic princples probably work on almost any cluster.

Let’s assume you’ve written a Python script that is only compatible with Python version 3.7.0 or higher. You open a shell to find out where and what version our Python is. The type program looks up the current detected version of a program - very useful when testing modules (if this doesn’t work, use which).:

$ type python3
python3 is /usr/bin/python3
$ python3 -V
Python 3.6.8

But you need a newer version of Python. To this end, you can load the anaconda module using the module load anaconda command, that has a more up to date Python with lots of libraries already included:

$ module load anaconda
$ type python
python3 is /share/apps/anaconda-ci/fgci-centos7-anaconda/software/anaconda/2023-01/2eea7963/bin/python3
$ python -V
Python 3.10.8

As you see, you now have a newer version of Python, in a different directory.

You can see a list of the all the loaded modules in our working shell using the module list command:

$ module list
Currently Loaded Modules:
  1) anaconda/2023-01

Note

The module load and module list commands can be abbreviated as ml

Let’s use the module purge command to unload all the loaded modules:

$ module purge

Or explicitly unload the anaconda module by using the module unload anaconda command:

$ module unload anaconda

You can load any number of modules in your open shell, your scripts, etc. You could load modules in your ~/.bash_profile, but then it will always automatically load it - this causes unexplainable bugs regularly!

Module versions

What’s the difference between module load anaconda and module load anaconda/2023-01?

The first anaconda loads the version that Lmod assumes to be the latest one - which might change someday! Suddenly, things don’t work anymore and you have to fix them.

The second loading anaconda/2023-01 loads that exact version, which won’t change. Once you want stability (possibly from day one!), it’s usually a good idea to load specific version, so that your environment will stay the same until you are done.

Hierarchical modules

Hierarchical modules means that you have to load one module before you can load another. This is usually a compiler:

For example, let’s load a newer version of R:

Lmod says that the modules exist but can’t be loaded, but gives a hint for what to do next. Let’s do that:

So now we can load it (we can do it in one line):

$ module load gcc/11.3.0 r/4.2.2
$ R --version
R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
What’s going on under the hood here?

In Linux systems, different environment variables like $PATH and $LD_LIBRARY_PATH help figure out how to run programs. Modules just cleverly manipulate these so that you can find the software you need, even if there are multiple versions available. You can see these variables with the echo command, e.g. echo $PATH.

When you load a module in a shell, the module command changes the current shell’s environment variables, and the environment variables are passed on to all the child processes.

You can explore more with module show NAME.

Making a module collection

There is a basic dependency/conflict system to handle module dependency. Each time you load a module, it resolves all the dependencies. This can result in long loading times or be annoying to do each time you log in to the system. However, there is a solution: module save COLLECTION_NAME and module restore COLLECTION_NAME

Let’s see how to do this in an example.

Let’s say that for compiling / running your program you need:

  • a compiler

  • CMake

  • MPI libraries

  • FFTW libraries

  • BLAS libraries

You could run this each time you want to compile/run your code:

$ module load gcc/9.2.0 cmake/3.15.3 openmpi/3.1.4 fftw/3.3.8-openmpi openblas/0.3.7
$ module list           # 15 modules

Let’s say this environment works for you. Now you can save it with module save MY-ENV-NAME. Then module purge to unload everything. Now, do module restore MY-ENV-NAME:

$ module save my-env
$ module purge
$ module restore my-env
$ module list           # same 15 modules

Generally, it is a good idea to save your modules as a collection to have your desired modules all set up each time you want to re-compile/re-build.

So the subsequent times that you want to compile/build, you simply module restore my-env and this way you can be sure you have the same previous environment.

Note

You may occasionally need to rebuild your collections in case we re-organize things (it will prompt you to rebuild your collection and you simply save it again).

Full reference

Command

Description

module load NAME

load module

module avail

list all modules

module spider PATTERN

search modules

module spider NAME/ver

show prerequisite modules to this one

module list

list currently loaded modules

module show NAME

details on a module

module help NAME

details on a module

module unload NAME

unload a module

module save ALIAS

save module collection to this alias (saved in ~/.lmod.d/)

module savelist

list all saved collections

module describe ALIAS

details on a collection

module restore ALIAS

load saved module collection (faster than loading individually)

module purge

unload all loaded modules (faster than unloading individually)

Final notes

If you have loaded modules when you build/install software, remember to load the same modules when you run the software (also in Slurm jobs). You’ll learn about running jobs later, but the module load should usually be put into the job script.

The modules used to compile and run a program become part of its environment and must always be loaded.

We use the Lmod system and Lmod works by changing environment variables. Thus, they must be sourced by a shell and are only transferred to child processes. Anything that clears the environment clears the loaded modules too. Module loading is done by special functions (not scripts) that are shell-specific and set environment variables.

Triton modules are also available on Aalto Linux: use module load triton-modules to make them available.

Some modules are provided by Aalto Science-IT, and on some clusters they could be provided by others, too. You could even make your own user modules.

Exercises

Before each exercise, run module purge to clear all modules.

If you aren’t at Aalto, many of these things won’t work - you’ll have to check your local documentation for what the equivalents are.

Modules-1: Basics

module avail and check what you see. Find a software that has many different versions available. Load the oldest version.

Modules-2: Modules and PATH

PATH is an environment variable that shows from where programs are run. See it’s current value using echo $PATH.

type is a command line tool (a shell builtin, so your shell may not support it, but bash and zsh do) which tells you the full path of what will be run for a given command name - basically it looks up the command in PATH

  • Run echo $PATH and type python.

  • module load anaconda

  • Re-run echo $PATH and type python. How does it change?

Modules-3: Complex module and PATH

Check the value of $PATH. Then, load the module py-gpaw. List what it loaded. Check the value of PATH again. Why is there so much stuff? Can you find a module command that explains it?

Hierarchical modules

How can you load the module quantum-espresso/7.1:

$ ml load quantum-espresso/7.1
Lmod has detected the following error:  These module(s) or
extension(s) exist but cannot be loaded as requested: "quantum-espresso/7.1"
   Try: "module spider quantum-espresso/7.1" to see how to load the module(s).

Modules-5: Modules and dependencies

Load a module with many dependencies, such as r-ggplot2 and save it as a collection. Purge your modules, and restore the collection.

See also
What’s next?

The next tutorial covers data storage.

Data storage

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

These days, computing is as much (or more) about data than the actual computing power. And data is more than number of petabytes: it is so easy to get it unorganized, or stored in such a way that it slows down the computation.

In this tutorial, we go over places to store data on Triton and how to choose between them. The next tutorial tells how to access it remotely.

Abstract

  • See the Triton quick reference

  • There are many places to store files since they all make a different trade-off of speed, size, and backups.

  • We recommend scratch / $WRKDIR (below) for most cases.

  • We are a standard Linux cluster with these options:

    • $HOME = /home/$USER: 10GB, backed up, not made larger

    • Scratch is large but not backed up:

      • $WRKDIR = /scratch/work/$USER: Personal work directory

      • /scratch/DEPARTMENT/NAME/: Group-based shared directories (recommended for most work, group leaders can request them)

    • /tmp: temporary directory, pre-user mounted in jobs and automatically cleaned up.

    • /l/: local persistent storage on some group servers

    • $XDG_RUNTIME_DIR: ramfs on login node

  • See Remote access to data for how to transfer and access the data from other computers.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

We are now looking at the data storage of a cluster.

Basics

Triton has various ways to store data. Each has a purpose, and when you are dealing with large data sets or intensive I/O, efficiency becomes important.

Roughly, we have small home directories (only for configuration files), large Lustre (scratch and work, large, primary calculation data), and special places for scratch during computations (local disks). At Aalto, there is aalto home, project, and archive directories which, unlike Triton, are backed up but don’t scale to the size of Triton.

Filesystem performance can be measured by both IOPS (input-output operations per second) and stream I/O speed. /usr/bin/time -v can give you some hints here. You can see the profiling page for more information.

Think about I/O before you start! - General notes

When people think of computer speed, they usually think of CPU speed. But this is missing an important factor: How fast can data get to the CPU? In many cases, input/output (IO) is the true bottleneck and must be considered just as much as processor speed. In fact, modern computers and especially GPUs are so fast that it becomes very easy for a few GPUs with bad data access patterns to bring the cluster down for everyone.

The solution is similar to how you have to consider memory: There are different types of filesystems with different tradeoffs between speed, size, and performance, and you have to use the right one for the right job. Often times. So you have to use several in tandem: For example, store original data on archive, put your working copy on scratch, and maybe even make a per-calculation copy on local disks. Check out wikipedia:Memory Hierarchy and wikipedia:List of interface bit rates.

The following factors are useful to consider:

  • How much I/O are you doing in the first place? Do you continually re-read the same data?

  • What’s the pattern of your I/O and which filesystem is best for it? If you read all at once, scratch is fine. But if there are many small files or random access, local disks may help.

  • Do you write log files/checkpoints more often than is needed?

  • Some programs use local disk as swap-space. Only turn it on if you know it is reasonable.

There’s a checklist in the storage details page.

Avoid many small files! Use a few big ones instead. (we have a dedicated page on the matter)

Available data storage options

Each storage location has different sizes, speed, types of backups, and availability. You need to balance between these. Most routine work should go into scratch (group directories) or work (personal). Small configuration and similar can go into your home directory.

Name

Path

Quota

Backup

Locality

Purpose

Home

$HOME or /home/USERNAME/

hard quota 10GB

Nightly

all nodes

Small user specific files, no calculation data.

Work

$WRKDIR or /scratch/work/USERNAME/

200GB and 1 million files

x

all nodes

Personal working space for every user. Calculation data etc. Quota can be increased on request.

Scratch

/scratch/DEPT/PROJECT/

on request

x

all nodes

Department/group specific project directories.

Local temp

/tmp/

limited by disk size

x

single-node

Primary (and usually fastest) place for single-node calculation data. Removed once user’s jobs are finished on the node.

Local persistent

/l/

varies

x

dedicated group servers only

Local disk persistent storage. On servers purchased for a specific group. Not backed up.

ramfs (login nodes only)

$XDG_RUNTIME_DIR

limited by memory

x

single-node

Ramfs on the login node only, in-memory filesystem

Home directories

The place you start when you log in. Home directory should be used for init files, small config files, etc. It is however not suitable for storing calculation data. Home directories are backed up daily. You usually want to use scratch instead.

scratch and work: Lustre

Scratch is the big, high-performance, 2PB Triton storage. It is the primary place for calculations, data analyzes etc. It is not backed up but is reliable against hardware failures (RAID6, redundant servers), but not safe against human error.. It is shared on all nodes, and has very fast access. It is divided into two parts, scratch (by groups) and work (per-user). In general, always change to $WRKDIR or a group scratch directory when you first log in and start doing work. (note: home and work may be deleted six months after your account expires: use a group-based space instead).

Lustre separates metadata and contents onto separate object and metadata servers. This allows fast access to large files, but induces a larger overhead than normal filesystems. See our small files page for more information.

See Storage: Lustre (scratch)

Local disks

Local disks are on each node separately. It is used for the fastest I/Os with single-node jobs and is cleaned up after job is finished. Since 2019, things have gotten a bit more complicated given that our newest (skl) nodes don’t have local disks. If you want to ensure you have local storage, submit your job with --gres=spindle.

See the Compute node local drives page for further details and script examples.

ramfs - fast and highly temporary storage

On login nodes only, $XDG_RUNTIME_DIR is a ramfs, which means that it looks like files but is stored only in memory. Because of this, it is extremely fast, but has no persistence whatsoever. Use it if you have to make small temporary files that don’t need to last long. Note that this is no different than just holding the data in memory, if you can hold in memory that’s better.

Other Aalto data storage locations

Aalto has other non-Triton data storage locations available. See Filesystem details and Science-IT department data principles for more info.

Quotas

All directories under /scratch (as well as /home) have quotas. Two quotas are set per-filesystem: disk space and file number. Quotas exist not because we need to limit space, but because we need to make people think before using large amounts of space. Ask us if you need more.

Disk quota and current usage are printed with the command quota. ‘space’ is for the disk space and ‘files’ for the total number of files limit. There is a separate quota for groups on which the user is a member.

$ quota
User quotas for darstr1
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
/home              484M    977M   1075M           10264       0       0
/scratch          3237G    200G    210G       -    158M      1M      1M       -

Group quotas
Filesystem   group                  space   quota   limit   grace   files   quota   limit   grace
/scratch     domain users            132G     10M     10M       -    310M    5000    5000       -
/scratch     some-group              534G    524G    524G       -    7534   1000M   1000M       -
/scratch     other-group              16T     20T     20T       -   1088M      5M      5M       -

If you get a quota error, see the quotas page for a solution.

Remote access

The next tutorial, Remote access to data, covers accessing the data from your own computer.

Exercises

Most of these exercises will be specific to your local site. Use this time to review your local guides to see how they are adapted to your site.

Data storage locations:

Storage-1: Review data storage locations

(Optional) Look at the list of data storage locations above. Also look at the Filesystem details. Which do you think are suitable for your work? Do you need to share with others?

Storage-2: Your group’s data storage locations

Ask your group what they use and if you can use that, too.

Misc:

Storage-3: Common errors

What do all of the following have in common?

  1. A job is submitted but fails with no output or messages.

  2. I can’t start a Jupyter server on jupyter.triton.

  3. Some files are randomly empty. Or the file had content, I tried to save it again, and now it’s empty!

  4. I can’t log in.

  5. I can log in with ssh, but ssh -X doesn’t work for graphical programs.

  6. I get an error message about corruption, such as InvalidArchiveError("Error with archive ... You probably need to delete and re-download or re-create this file.

  7. I can’t install my own Python/R/etc libraries.

About filesystem performance:

strace is a command which tracks system calls, basically the number of times the operating system has to do something. It can be used as a rudimentary way to see how much I/O load there is.

Storage-4: strace and I/O operations

Use strace -c to compare the number of system calls in ls, ls -l, on a directory with many files. On Triton, you can use the directory /scratch/scip/lustre_2017/many-files/ as a place with many files in it. How many system calls per file were there for each option?

Storage-5: strace and time

Using strace -c, compare the times of find and lfs find on the directory mentioned above. Why is it different?

(advanced) Storage-6: Benchmarking

(this exercise requires slurm knowledge from future tutorials and also other slurm knowledge).

Clone the https://github.com/AaltoSciComp/hpc-examples/ git repository to your personal work directory. Change to the io directory. Create a temporary directory and…

  1. Run create_iodata.sh to make some data files in data/

  2. Compare the IO operations of find and lfs find on this directory.

  3. use the iotest.sh script to do some basic analysis. How long does it take? Submit it as a slurm batch job.

  4. Modify the iotest.sh script to copy the data/ directory to local storage, do the operations, then remove the data. Compare to previous strategy.

  5. Use tar to compress the data while it is on lustre. Unpack this tar archive to local storage, do the operations, then remove. Compare to previous strategies.

What’s next?

The next tutorial is about remote data access.

Remote access to data

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

The cluster is just one part of your research: most people are constantly transferring data back and forth. Unfortunately, this can be a frustrating experience if you haven’t got everything running smoothly. In this tutorial, we’ll explain some of the main methods. See the main storage tutorial first.

Abstract

  • Data is also available from other places in Aalto, such as desktop workstations in some departments, shell servers, and https://vdi.aalto.fi.

  • Transferring data is available via ssh (the standard rsync and sftp)

  • Data can be mounted remotely using ssh (sshfs, from anywhere with ssh access) and SMB mounting on your own computer (within Aalto networks, Linux/mac: smb://data.triton.aalto.fi/PATH, Windows: \\data.triton.aalto.fi\PATH and uses \, PATH could be work/USERNAME or scratch/DEPT/GROUPNAME)

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

Mounting data: on your machine, you have a view of the data directly on the cluster: there is only one copy.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

Copying data: there become two copies that you have to manage.

History and background

Historically, ssh transfers have been the most common (which includes rsync (recommended these days), scp, sftp, and various other graphical programs that use these protocols) - and this is still the most robust and reliable method. There are other modern methods, but they require other things.

There are two main styles of remote data access:

  • Transferring data makes a new copy on the other computer. This is generally efficient for large data.

  • Remote mounting makes a view of the data on the other computer: when you access/modify the data on the other computer, it transparently accesses/modifies in the original place without making a copy. This is very convenient, but generally slow.

    • We have this already set up for you from many computers at Aalto.

Data availability throughout Aalto

Data is the basis of almost everything we do, and accessing it seamlessly throughout Aalto is a great benefit. Various other Aalto systems have the data available. However, this varies per department: each department can manage its data as it likes. So, we can’t make general promises about what is available where.

Linux shell server mounts require a valid Kerberos ticket (usually generated when you log in). On long sessions these might expire, and you have to renew them with kinit to keep going. If you get a permission denied, try kinit.

Virtual desktop interface

VDI, vdi.aalto.fi, is a Linux workstation accessible via your web browser, and useful for a lot of work. It is not Triton, but has scratch mounted at /m/triton/scratch/. Your work folder can be access at /m/triton/scratch/work/USERNAME. For SCI departments the standard paths you have on your workstations are also working /m/{cs,nbe}/{scratch,work}/.

Shell servers

Departments have various shell servers, see below. There isn’t a generally available shell server anymore.

NBE

On workstations, work directories are available at /m/nbe/work and group scratch directories at /m/nbe/scratch/PROJECT/, including the shell server amor.org.aalto.fi.

PHYS

Directories available on demand through SSHFS. See the Data transferring page at PHYS wiki.

CS

On workstations, work directories are available at /m/cs/work/, and group scratch directories at /m/cs/scratch/PROJECT/. The department shell server is magi.cs.aalto.fi and has these available.

Remote mounting

There are many ways to access Triton data remotely. These days, we recommending figuring out how to mount the data remotely, so that it appears as local data but is accessed over the network. This saves copying data back and forth and is better for data security, but is slower and less reliable than local data.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

Mounting data: on your machine, you have a view of the data directly on the cluster: there is only one copy.

Remote mounting using SMB

By far, remote mounting of files is the easiest method to transfer files. If you are not on the Aalto networks (wired, eduroam, or aalto with Aalto-managed laptop), connect to the Aalto VPN first. Note that this is automatically done on some department workstations (see above) - if not, request it!

The scratch filesystem can be remote mounted using SMB inside secure Aalto networks at the URLs

  • scratch: \\data.triton.aalto.fi\scratch\

  • work: \\data.triton.aalto.fi\work\%username%\

To access these folders: To do the mounting, Windows Explorer → This PC → Map network drive → select a free letter.

From Aalto managed computers, you can use lgw01.triton.aalto.fi instead of data.triton.aalto.fi and it might auto-login.

Depending on your OS, you may need to use either your username directly or AALTO\username.

Warning

In the future, you will only be able to do this from Aalto managed computers. This remote mounting will really help your work, so we recommend you to request an Aalto managed computer (citing this section) to make your work as smooth as possible (or use vdi.aalto.fi, see below.

Remote mounting using sshfs

sshfs is a neat program that lets you mount remote filesystems via ssh only. It is well-supported in Linux, and somewhat on other operating systems. Its true advantage is that you can mount any remote ssh server - it doesn’t have to be set up specially for SMB or any other type of mounting. On Ubuntu an other Linuxes, you can mount by “File → Connect to server” and using sftp://triton.aalto.fi/scratch/work/USERNAME. This also works from any shell server with data (see previous section).

The below uses command line programs to do the same, and makes the triton_work on your local computer access all files in /scratch/work/USERNAME. Can be done with other folders, too:

$ mkdir triton_work
$ sshfs USERNAME@triton.aalto.fi:/scratch/work/USERNAME triton_work

Note that ssh binds together many ways of accessing Triton (and other servers), with a similar syntax and options. Learning to use it well is a great investment in your future. Learn more about ssh on the ssh page - if you set up a ssh config file, it will work here, too!

For Aalto Linux workstation users: it is recommended that you mount /scratch/ under the local disk /l/. You should be able to create the subfolder folder under /l/ and point sshfs to that subfolder as in the example here above.

Transferring data

This section tells ways you can copy data back-and-forth between Triton and your own computers. This may be more annoying for day-to-day work but is better for transferring large data.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

Copying data: there become two copies that you have to manage.

Version control

Don’t forget that you can use version control (git, etc.) for your code and other small files. This way, you transfer to/from Triton via a version control server (Aalto Gitlab, Github, etc). Often, one would develop locally (committing often of course), pull on Triton, do whatever some minor development directly on Triton to make it work there, then push back to the server.

Mount and copy

You know, you can do the network drive mounting (see previous section), and copy files that way.

Using rsync

Prerequisites

To install rsync on windos please refer to this guide

Rsync is good for large files since it can restart interrupted transfers. Use rsync for large file transfers. rsync actually uses the ssh protocol so you can rsync from anywhere you can ssh from. rsync is installed by default on Linux and Mac terminals. On Windows machines we recommend using GIT-bash.

While there are better places on the internet to read about rsync, it is good to try it out to synchronise a local folder on your triton’s scratch. Sometimes the issue with copying files is related to group permissions. This command takes care of permissions and makes sure that all your local files are identical (= same MD5 fingerprint) to your remote files:

$ rsync -avzc -e "ssh" --chmod=g+s,g+rw --group=GROUPNAME PATHTOLOCALFOLDER USERNAME@triton.aalto.fi:/scratch/DEPT/PROJECTNAME/REMOTEFOLDER/

Replace the bits in CAPS with your own case. Briefly, -a tries to preserve all attributes of the file, -v increases verbosity to see what rsync is doing, -z uses compression, -c skips files that have identical MD5 checksum, -e specifies to use ssh (not necessary but needed for the commands coming after), --chmod sets the group permissions to shared (as common practice on scratch project folders), and --group sets the groupname to the group you belong to (note that GROUPNAME == PROJECTNAME on our scratch filesystem).

If you want to just check that your local files are different from the remote ones, you can run rsync in “dry run” so that you only see what the command would do, without actually doing anything.:

$ rsync --dry-run -avzc ...

Sometimes you want to copy only certain files. E.g. go through all folders, consider only files ending with py:

$ rsync -avzc --include '*/' --include '*.py' --exclude '*' ...

Sometimes you want to copy only files under a certain size (e.g. 100MB):

$ rsync -avzc --max-size=100m ...

Rsync does NOT delete files by default, i.e. if you delete a file from the local folder, the remote file will not be deleted automatically, unless you specify the --delete option.

Please note that when working with files containing code or simple text, git is a better option to synchronise your local folder with your remote one, because not only it will keep the two folders in sync, but you will also gain version controlling so that you can revert to previous version of your code, or txt/csv files.

Using sftp

The SFTP protocol uses ssh to transfer files. On Linux and Mac, the sftp command line program are the must fundamental way to do this, and are available everywhere.

A more user-friendly way of doing this (with a nice GUI) is the Filezilla program. Make sure you are using Aalto VPN, then you can put triton.aalto.fi as SFTP server with port 22.

With all modern OS it is also possible to just open your OS file manager (e.g. Nautilus on Linux) and just put as address in the bar:

sftp://triton.aalto.fi

If you are connecting from remote and cannot use the VPN, you can connect instead to department machines like kosh.aalto.fi, amor.org.aalto.fi (for NBE). The port is 22. Note: If you do not see your shared folder, you need to manually specify the full path (i.e. the folder is there, just not yet visible).

Exercises

RemoteData-1: Mounting your work directory

Mount your work directory by SMB (or sshfs) and transfer a file to Triton. Note that for SMB, you must be connected to the Aalto VPN (from outside campus), or on eduroam, the aalto with Aalto laptop (from campus).

(advanced) RemoteData-2: rsync

If you have a Linux or Mac computer, or have installed it on Windows, study the rsync manual page and try to transfer a file.

What’s next?

The next tutorial is about how the cluster queuing system Slurm works.

Running calculations
Slurm: the queuing system

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

What is a cluster?

Triton is a large system that combines many different individual computer nodes. Hundreds of people are using Triton simultaneously. Thus resources (CPU time, memory, etc.) need to be shared among everyone.

This resource sharing is done by a software called a job scheduler or workload manager, and Triton’s workload manager is Slurm (which is also the dominant in the world one these days). Triton users submit jobs which are then scheduled and allocated resources by the workload manager.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

Slurm allows you to control all of the computing power from the login node.

An analogy: the HPC Diner

You’re eating out at the HPC Diner. What happens when you arrive?

Scheduling resources

  • A host greets you and takes your party size and estimated dining time. You are given a number and asked to wait a bit.

  • The host looks at who is currently waiting and makes a plan.

    • If you are two people, you might squeeze in soon.

    • If you are a lot of people, the host will try to slowly free up enough tables to join to eat together.

    • If you are a really large party, you might need an advance reservation (or have to wait a really long time).

  • Groups are called when it is their turn.

  • Resources (tables) are used as efficiently as possible

Cooking in the background

  • You don’t use your time to cook yourself.

  • You make an order. It goes to the back and gets cooked (possibly a lot at once!), and you can do something else.

  • Your food comes out when ready and you can check the results.

  • Asynchronous execution allows more efficient dining.

Thanks to HPC Carpentry / Sabry Razick for the idea.

The basic process
  • You have your program mostly working

  • You decide what resources you want

  • You ask Slurm to give you those resources

    • You might say “run this and let me know when done”, this is covered later in Serial Jobs.

    • You might want those resources to play around yourself. This is covered next in Interactive jobs.

  • If you are doing the first one, you come back later and check the output files.

The resources Slurm manages

Slurm comes with a multitude of parameters which you can specify to ensure you will be allocated enough memory, CPU cores, time, etc.

3D drawing of a box, with the three dimensions labeled "CPUs", "Memory", and "Time"

Imagine resource requests as boxes of a requested number of CPUs, memory, time, and any other resources requested. The smaller the box, the more likely you can get scheduled soon.

The basic resources are:

  • Time: While not exactly a resources, you need to specify the expacted usage time (run time) of each job for scheduling purposes. If you go over by too much, your job will be killed. This is --time, for example --time=DAYS-HH:MM:SS.

  • Memory: Memory is needed for data in jobs. If you run out of processors, your job is slow, but if you run out of memory, then everything dies. This is --mem or --mem-per-cpu.

  • CPUs (also known as “processors” or “(processor) cores”): Processor cores. This resource lets you do things in parallel the classic way, by adding processors. Depending on how the parallelism works, there are different ways to request the CPUs - see Parallel computing: different methods explained. CPUs. This is --cpus-per-task and --ntasks, but you must read that page before using these!

  • GPUs: Graphical Processing Units are modern, highly parallel compute units. We will discuss requesting them in GPU computing.

  • If you did even larger work on larger clusters, input/output bandwidth and licenses are also possible resources.

The more resources you request, the lower your priority will be in the future. So be careful what you request!

See also

As always, the Triton quick reference lists all the options you need.

Other submission parameters

We won’t go into them, but there are other parameters that tell Slurm what to do. For example, you could request to only run on the latest CPU architecture. You could say you want a node all to yourself. And so on.

How many resources to request?

This is one of the most fundamental questions:

  • You want to request enough resources, so that your code actually runs.

  • You don’t want to request too much, since it is wasteful and lowers your priority in the future.

Basically, people usually start by guessing and request more than you think you need at the start for testing. Check what you have actually used (Triton: slurm history), and adjust the requests to match.

The general rule of thumb is to request the least possible, so that your stuff can run faster. That is because the less you request, the faster you are likely to be allocated resources. If you request something slightly less than a node size (note that we have different size nodes) or partition limit, you are more likely to fit into a spare spot.

For example, we have many nodes with 12 cores, and some with 20 or 24. If you request 24 cores, you have very limited options. However, you are more likely to be allocated a node if you request 10 cores. The same applies to memory: most common cutoffs are 48, 64, 128, 256GB. It’s best to use smaller values when submitting interactive jobs, and more for batch scripts.

Partitions

A slurm partition is a set of computing nodes dedicated to a specific purpose. Examples include partitions assigned to debugging(“debug” partition), batch processing(“batch” partition), GPUs(“gpu” partition), etc.

On Triton, you don’t need to worry about partitions most of the time - they are automatically set. You might need partition in several cases though:

  • --partition debug gives you some nodes reserved for quick testing.

  • --partition interactive gives you some settings optimized for interactive work (where things aren’t running constantly).

On other clusters, you might need to set a partition other times.

Command sinfo -s lists a summary of the available partitions. You can see the purpose and use of our partitions in the quick reference.

Exercises

Slurm-1: Info commands

Check out some of these commands: sinfo, sinfo -N, squeue, and squeue -a. These give you some information about Slurm’s state.

What’s next?

We move on to running interactive jobs.

Interactive jobs

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

  • Interactive jobs allow you to quickly test code (before scaling up) or getting more resources for manual analysis.

  • To run a single command interactively

    • srun [SLURM OPTIONS] COMMAND ... to run before any COMMAND to run it in Slurm

  • To get an interactive shell

    • srun [SLURM OPTIONS] --pty bash (general Slurm)

    • sinteractive (Triton specific)

  • The exact commands often varies among clusters, check your docs.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

Interactive jobs let you control a small amount of resources for development work.

Why interactive jobs?

There are two ways you can submit your jobs to Slurm queue system: either interactively using srun or by submitting a script using sbatch. This tutorial walks you through running your jobs interactively, and the next tutorial on serial jobs will go through serial jobs.

Some people say “the cluster is for batch computing”, but really it is to help you get your work done. Interactive jobs let you:

  • Run a single job in the Slurm job allocation to test parameters and make sure it works (which is easier than constantly modifying batch scripts).

  • Get a large amount of resources for some manual data analysis.

Interactive jobs

Let’s say you want to run the following command:

$ python3 slurm/pi.py 10000

You can submit this program to Triton using srun. All input/output still goes to your terminal (but note that graphical applications don’t work this way - see below):

$ srun --mem=100M --time=0:10:00 python3 slurm/pi.py 10000
srun: job 52204499 queued and waiting for resources

Here, we are asking for 100 Megabytes of memory (--mem=100M) for a duration of ten minutes (--time=0:10:00) (See the quick reference or below for more options). While your job - with jobid 52204499 - is waiting to be allocated resources, your shell blocks.

You can open a new shell (ssh again) on the cluster and run the command squeue -u $USER or slurm q to see all the jobs you currently have waiting in queue:

$ slurm q
JOBID              PARTITION NAME                  TIME       START_TIME    STATE NODELIST(REASON)
52204499           short-ivb python3               0:00              N/A  PENDING (None)

You can see information such as the state, which partition the requested node reside in, etc.

Once resources are allocated to your job, you see the name of the machine in the cluster your program ran on, output to your terminal:

srun: job 52204499 has been allocated resources
{"pi_estimate": 3.126, "iterations": 10000, "successes": 7815}

To show it’s running on a diferent computer, you can srun hostname (in this case, it runs on csl42):

$ hostname
login3.triton.aalto.fi
$ srun hostname
srun: job 19039411 queued and waiting for resources
srun: job 19039411 has been allocated resources
csl42.int.triton.aalto.fi

Disadvantages

Interactive jobs are useful for debugging purposes, to test your setup and configurations before you put your tasks in a batch script for later execution.

The major disadvantages include:

  • It blocks your shell until it finishes

  • If your connection to the cluster gets interrupted, you lose the job and its output.

Keep in mind that you shouldn’t open 20 shells to run 20 srun jobs at once. Please have a look at the next tutorial about serial jobs.

Interactive shell

What if you want an actual shell to do things interactively? Put more precisely, you want access to a node in the cluster through an interactive bash shell, with many resources available, that will let you run commands such as Python and let do some basic work. For this, you just need srun’s --pty option coupled with the shell you want:

$ srun -p interactive --time=2:00:00 --mem=600M --pty bash

The command prompt will appear when the job starts. And you will have a bash shell runnnig on one of the computation nodes with at least 600 Megabytes of memory, for a duration of 2 hours, where you can run your programs in. The option -p interactive requests a node in the interactive partition (group of nodes) which is dedicated to interactive usage (more on this later).

Warning

Remember to exit the shell when you are done! The shell will be running if you don’t and it will count towards your usage. This wastes resources and effectively means your priority will degrade in the future.

Interactive shell with graphics

sinteractive is very similar to srun, but more clever and thus allows you to do X forwarding. It starts a screen session on the node, then sshes to there and connects to the shell:

$ sinteractive --time=1:00:00 --mem=1000M

Warning

Just like with srun --pty bash, remember to exit the shell. Since there is a separate screen session running, just closing the terminal isn’t enough. Exit all shells in the screen session on the node (C-d or exit) or cancel the job.

Use remote desktop if off campus

If you are off-campus, you might want to use https://vdi.aalto.fi as a virtual desktop to connect to Triton to run graphical programs: ssh from there to Triton with ssh -XY. Graphical programs run very slowly when sent across the general Internet.

Checking your jobs

When your jobs enter the queue, you need to be able to get information on how much time, memory, etc. your jobs are using in order to know what requirements to ask for. We’ll see this later in Monitoring job progress and job efficiency.

The command slurm history (or sacct --long | less) gives you information such as the actual memory used by your recent jobs, total CPU time, etc. You will learn more about these commands later on.

As shown in a previous example, the command slurm queue (or squeue -u $USER) will tell you the currently running processes, which is a good way to make sure you have stopped everything.

Setting resource parameters

Remember to set the resources you need well, otherwise your are wasting resources and lowering your priority. We went over this in Slurm: the queuing system.

Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

Interactive-1: Basic Slurm options

The program hpc-examples/slurm/memory-use.py uses up a lot of memory to do nothing. Let’s play with it. It’s run as follows: python hpc-examples/slurm/memory-use.py 50M, where the last argument is however much memory you want to eat. You can use --help to see the options of the program.

  1. Try running the program with 50M.

  2. Run the program with 50M and srun --mem=500M.

  3. Increase the amount of memory the Python process tries to use (not the amount of memory Slurm allocates). How much memory can you use before the job fails?

  4. Look at the job history using slurm history - can you see how much memory it actually used? - Note that Slurm only measures memory every 60 seconds or so. To make the program last longer, so that the memory used can be measured, give the --sleep option to the Python process, like this: python hpc-examples/slurm/memory-use.py 50M --sleep=60 - keep it available.

Interactive-2: Time scaling

The program hpc-examples/slurm/pi.py calculates pi using a simple stochastic algorithm. The program takes one positional argument: the number of trials.

The time program allows you to time any program, e.g. you can time python x.py to print the amount of time it takes.

  1. Run the program, timing it with time, a few times, increasing the number of trials, until it takes about 10 seconds: time python hpc-examples/slurm/pi.py 500, then 5000, then 50000, and so on.

  2. Add srun in front (srun python ...). Use the seff JOBID command to see how much time the program took to run. (If you’d like to use the time command, you can run srun --mem=MEM --time=TIME time python hpc-examples/slurm/pi.py ITERS)

  3. Look at the job history using slurm history - can you see how much time each process used? What’s the relation between TotalCPUTime and WallTime?

Interactive-3: Info commands

Run squeue -a to see what is running, and then run slurm job JOBID (or scontrol show job JOBID) on some running job - does anything look interesting?

Interactive-4: Showing node information

Run scontrol show node csl1 What is this? (csl1 is the name of a node on Triton - if you are not on Triton, look at the sinfo -N command and try one of those names).

Interactive-5: Why not script srun

Some people are clever and use shell scripting to run srun many times in a loop (using & to background it so that they all run at the same time). Can you list some advantages and disadvantages to this?

What’s next?

In the next tutorial on serial batch jobs, you will learn how to put the above-mentioned commands in a script, namely a batch script (a.k.a submission script) that allows for a multitude of jobs to run unattended.

Serial Jobs

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

  • Batch scripts let you run work non-interactively, which is important for scaling. You create a batch script, which runs in the background. You come back later and see the results.

  • Example batch script, submit with sbatch the_script.sh:

    #!/bin/bash -l
    #SBATCH --time=01:00:00
    #SBATCH --mem=4G
    
    # Run your code here
    python my_script.py
    
  • See the quick reference for complete list of options.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

This tutorial covers the basics of serial jobs. With what you are learning so far, you can control a small amount of power of the cluster.

Prerequisites
Why batch scripts?

You learned, in Slurm: the queuing system, how all Triton users must do their computation by submitting jobs to the Slurm batch system to ensure efficient resource sharing. This lets you run many things at once without having to watch each one separately - the true power of the cluster.

A batch script is simply a shell script (remember Using the cluster from a shell?), where you put your resource requests and job steps.

Your first job script

A job script is simply a shell script (Bash). And so the first line in the script should be the shebang directive (#!) followed by the full path to the executable binary of the shell’s interpreter, which is Bash in our case. What then follow are the resource requests, and then the job steps.

Let’s take a look at the following script

#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --mem=100M
#SBATCH --output=pi.out

echo "Hello $USER! You are on node $HOSTNAME.  The time is $(date)."

# For the next line to work, you need to be in the
# hpc-examples directory.
srun python3 slurm/pi.py 10000

Let’s name it run-pi.sh (create a file using your editor of choice, e.g. nano; write the script above and save it)

The symbol # is a comment in the bash script, and Slurm understands #SBATCH as parameters, determining the resource requests. Here, we have requested a time limit of 5 minutes, along with 100 MB of RAM.

Resource requests are followed by job steps, which are the actual tasks to be done. Each srun within the a slurm script is a job step, and appears as a separate row in your history - which is useful for monitoring.

Having written the script, you need to submit the job to Slum through the sbatch command. Since the command is python slurm/pi.py, you need to be in the hpc-examples directory from our sample project:

$ cd hpc-examples       # wherever you have hpc-examples
$ sbatch run-pi.sh
Submitted batch job 52428672

Warning

You must use sbatch, not bash to submit the job since it is Slurm that understands the SBATCH directives, not Bash.

When the job enters the queue successfully, the response that the job has been submitted is printed in your terminal, along with the jobid assigned to the job.

You can check the status of you jobs using slurm q/slurm queue (or squeue -u $USER):

$ slurm q
JOBID              PARTITION NAME                  TIME       START_TIME    STATE NODELIST(REASON)
52428672           debug     run-pi.sh             0:00              N/A  PENDING (None)

Once the job is completed successfully, the state changes to COMPLETED and the output is then saved to pi.out in the current directory. You can also wildcards like %u for your username and %j for the jobid in the output file name. See the documentation of sbatch for a full list of available wildcards.

Setting resource parameters

The resources were discussed in Slurm: the queuing system, and barely need to be mentioned again here - the point is they are the same. For example, you might use --mem=5G or --time=5:00:00. Always keep the reference page close for looking these up.

Checking your jobs

Once you submit your jobs, it goes into a queue. The two most useful commands to see the status of your jobs with are slurm q/slurm queue and slurm h/slurm history (or squeue -u $USER and sacct -u $USER).

More information is in the monitoring tutorial.

Cancelling a job

You can cancel jobs with scancel JOBID. To obtain job id, use the monitoring commands.

Full reference

The reference page contains it all, or expand it below.

Slurm quick ref

Command

Description

sbatch

submit a job to queue (see standard options below)

srun

Within a running job script/environment: Run code using the allocated resources (see options below)

srun

On frontend: submit to queue, wait until done, show output. (see options below)

sinteractive

Submit job, wait, provide shell on node for interactive playing (X forwarding works, default partition interactive). Exit shell when done. (see options below)

srun --pty bash

(advanced) Another way to run interactive jobs, no X forwarding but simpler. Exit shell when done.

scancel JOBID

Cancel a job in queue

salloc

(advanced) Allocate resources from frontend node. Use srun to run using those resources, exit to close shell when done (see options below)

scontrol

View/modify job and slurm configuration

Command

Option

Description

sbatch/srun/etc

-t, --time=HH:MM:SS

time limit

-t, --time=DD-HH

time limit, days-hours

-p, --partition=PARTITION

job partition. Usually leave off and things are auto-detected.

--mem-per-cpu=N

request n MB of memory per core

--mem=N

request n MB memory per node

-c, --cpus-per-task=N

Allocate *n* CPU’s for each task. For multithreaded jobs. (compare ``–ntasks``: ``-c`` means the number of cores for each process started.)

-N, --nodes=N-M

allocate minimum of n, maximum of m nodes.

-n, --ntasks=N

allocate resources for and start n tasks (one task=one process started, it is up to you to make them communicate. However the main script runs only on first node, the sub-processes run with “srun” are run this many times.)

-J, --job-name=NAME

short job name

-o OUTPUTFILE

print output into file output

-e ERRORFILE

print errors into file error

--exclusive

allocate exclusive access to nodes. For large parallel jobs.

--constraint=FEATURE

request feature (see slurm features for the current list of configured features, or Arch under the hardware list). Multiple with --constraint="hsw|skl".

--array=0-5,7,10-15

Run job multiple times, use variable $SLURM_ARRAY_TASK_ID to adjust parameters.

--gres=gpu

request a GPU, or --gres=gpu:n for multiple

--gres=spindle

request nodes that have disks, spindle:n, for a certain number of RAID0 disks

--mail-type=TYPE

notify of events: BEGIN, END, FAIL, ALL, REQUEUE (not on triton) or ALL. MUST BE used with --mail-user= only

--mail-user=YOUR@EMAIL

whome to send the email

srun

-N N_NODES hostname

Print allocated nodes (from within script)

Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

Serial-1: Basic batch job

Submit a batch job that just runs hostname and pi.py. Remember to give pi.py some number of iterations as an argument.

  1. Set time to 1 hour and 15 minutes, memory to 500MB.

  2. Change the job’s name and output file.

  3. Check the output. Does the printed hostname match the one given by slurm history/sacct -u $USER?

Serial-2: Submitting and cancelling a job

Create a batch script which does nothing (or some pointless operation for a while), for example sleep 300 (this shell command does nothing for 300 seconds). Check the queue to see when it starts running. Then, cancel the job. What output is produced?

Serial-3: Modifying Slurm script while its running

Modifying scripts while a job has been submitted is a bad practice.

Add sleep 120 into the Slurm script that runs pi.py. Submit the script and while it is running, open the Slurm script with an editor of your choice and add the following line near the end of the script.

echo 'Modified'

Use slurm q to check when the job finishes and check the output. What can you interpret from this?

Remove the created line after you have finished.

Serial-4: Modify script while it is running

Modifying scripts while a job has been submitted is a bad practice.

Add sleep 180 into the Slurm script that runs pi.py. Submit the script and while it is running, open the pi.py with an editor of your choice and add the following line near the start of the script.

raise Exception()

Use slurm q to check when the job finishes and check the output. What can you interpret from this?

Remove the created line after you have finished. You can also use git checkout -- pi.py (remember to give a proper relative path, depending on your current working directory!)

Serial-5: Checking output

You can look at the output of files as your program is running. Let’s demonstrate.

Create a slurm script that runs the following program. This is a shell script which, every 10 seconds (for 30 iterations), prints the date:

for i in $(seq 30); do
  date
  sleep 10
done
  1. Submit the job to the queue.

  2. Log out from Triton. Log back in and use slurm queue/squeue -u $USER to check the job status.

  3. Use cat NAME_OF_OUTPUTFILE to check at the output periodically. You can use tail -f NAME_OF_OUTPUTFILE to view the progress in real time as new lines are added (Control-C to cancel)

  4. Cancel the job once you’re finished.

Serial-6: Constrain to a certain CPU architecture

Modify the script from exercise #1 to run on only one type of CPU using the --constraint option. Hint: check Triton quick reference

Serial-7: Why you use sbatch, not bash.

(Advanced) What happens if you submit a batch script with bash instead of sbatch? Does it appear to run? Does it use all the Slurm options?

(advanced) Serial-8: Interpreters other than bash

(Advanced) Create a batch script that runs in another language using a different #! line. Does it run? What are some of the advantages and problems here?

(advanced) Serial-9: Job environment variables.

Either make a sbatch script that runs the command env | sort, or use srun env | sort. The env utility prints all environment variables, and sort sorts it (and | connects the output of env to the input of sort.)

This will show all of the environment variables that are set in the job. Note the ones that start with SLURM_. Notice how they reflect the job parameters. You can use these in your jobs if needed (for example, a job that will adapt to the number of allocated CPUs).

What’s next?

There are various tools one can use to do job monitoring.

Monitoring job progress and job efficiency

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

  • You must always monitor jobs to make sure they are using all the resources you request.

  • Test scaling: double resources, if it doesn’t run almost twice as fast, it’s not worth it.

  • seff JOBID shows efficiency and performance of a single jobs

  • slurm queue shows waiting and running jobs (this is a custom command)

  • slurm history shows completed jobs (also custom command)

  • GPU efficiency: A job’s comment field shows GPU performance info (custom setup at Aalto), sacct -j JOBID -o comment -p shows this.

Introduction

When running jobs, one usually wants to do monitoring at various different stages:

  • Firstly, when job is submitted, one wants to monitor the position of the job in the queue and expected starting time for the job.

  • Secondly, when job is running, one wants to monitor the jobs state and how the simulations is performing.

  • Thirdly, once the job has finished, one wants to monitor the job’s performance and resource usage.

There are various tools available for each of these steps.

See also

Please ensure you have read Interactive jobs and Serial Jobs before you proceed with this tutorial.

Monitoring during queueing

The command slurm q/slurm queue (or squeue -u $USER) can be used to monitor the status of your jobs in the queue. An example output is given below:

$ slurm q
JOBID              PARTITION NAME                  TIME       START_TIME    STATE NODELIST(REASON)
60984785           interacti _interactive          0:29 2021-06-06T20:41  RUNNING pe6
60984796           batch-csl hostname              0:00              N/A  PENDING (Priority)

Here the output are as follows:

  • JOBID shows the id number that Slurm has assigned for your job.

  • PARTITION shows the partition(s) that the job has been assigned to.

  • NAME shows the name of the submission script / job step / command.

  • TIME shows the amount of time of the job has run so far.

  • START_TIME shows the start time of the job. If job isn’t currently running, Slurm will try to form an estimate on when the job will run.

  • STATE shows the state of the job. Usually it is RUNNING or PENDING.

  • NODES shows the names of the nodes where the program is running. If the job isn’t running, Slurm tries to give a reason why the job is not running.

When submitting a job one often wants to see if job starts successfully. This can be made easier by running slurm w q/slurm watch queue or (watch -n 15 squeue -u $USER). This opens a watcher that prints the output of slurm queue every 15 seconds. This watcher can be closed with <CTRL> + C. Do remember to close the watcher when you’re not watching the output interactively.

To see all of the information that Slurm sees, one can use the command scontrol show -d jobid JOBID.

The slurm queue is a wrapper built around squeue-command. One can also use it directly to get more information on the job’s status. See squeue’s documentation for more information.

There are other commands to slurm that you can use to monitor the cluster status, job history etc.. A list of examples is given below:

Slurm status info reference

Command

Description

slurm q ; slurm qq

Status of your queued jobs (long/short)

slurm partitions

Overview of partitions (A/I/O/T=active,idle,other,total)

slurm cpus PARTITION

list free CPUs in a partition

slurm history [1day,2hour,…]

Show status of recent jobs

seff JOBID

Show percent of mem/CPU used in job. See Monitoring.

sacct -o comment -p -j JOBID

Show GPU efficiency

slurm j JOBID

Job details (only while running)

slurm s ; slurm ss PARTITION

Show status of all jobs

sacct

Full history information (advanced, needs args)

Full slurm command help:

$ slurm

Show or watch job queue:
 slurm [watch] queue     show own jobs
 slurm [watch] q   show user's jobs
 slurm [watch] quick     show quick overview of own jobs
 slurm [watch] shorter   sort and compact entire queue by job size
 slurm [watch] short     sort and compact entire queue by priority
 slurm [watch] full      show everything
 slurm [w] [q|qq|ss|s|f] shorthands for above!
 slurm qos               show job service classes
 slurm top [queue|all]   show summary of active users
Show detailed information about jobs:
 slurm prio [all|short]  show priority components
 slurm j|job      show everything else
 slurm steps      show memory usage of running srun job steps
Show usage and fair-share values from accounting database:
 slurm h|history   show jobs finished since, e.g. "1day" (default)
 slurm shares
Show nodes and resources in the cluster:
 slurm p|partitions      all partitions
 slurm n|nodes           all cluster nodes
 slurm c|cpus            total cpu cores in use
 slurm cpus   cores available to partition, allocated and free
 slurm cpus jobs         cores/memory reserved by running jobs
 slurm cpus queue        cores/memory required by pending jobs
 slurm features          List features and GRES

Examples:
 slurm q
 slurm watch shorter
 slurm cpus batch
 slurm history 3hours

Other advanced commands (many require lots of parameters to be useful):

Command

Description

squeue

Full info on queues

sinfo

Advanced info on partitions

slurm nodes

List all nodes

Monitoring a job while it is running

As the most common way of using HPC resources is to run non-interactive jobs, it is usually a good idea to make certain that the program that will be run will produce some output that can be used to monitor the jobs’ progress.

The typical way of monitoring the progress is to add print-statements that produce output to the standard output. This output is then redirected to the Slurm output file (-o FILE, default slurm-JOBID.log) where it can be read by the user. This file is updated while the job is running, but after some delay (every few KB written) because of buffering.

It is important to differentiate between different types of output:

  • Monitoring output is usually print statements and it describes what the program is doing (e.g. “Loading data”, “Running iteration 31”), what is the state of the simulation (e.g. “Total energy is 4.232 MeV”, “Loss is 0.432”) and to get timing information (e.g. “Iteration 31 took 182s”). This output can then be used to see if the program works, if the simulation converges and to determine how long does it take to do different calculations.

  • Debugging output is similar to monitoring output, but it is usually more verbose and writes the internal state of the program (e.g. values of variables). This is usually required during development stage of a program, but once the program works and longer simulations are needed, printing debugging output is not recommended.

  • Checkpoint output can be used to resume the current state of the simulation in the case of unexpected situations such as bugs, network problems or hardware failures. These should be in binary data as this keeps the accuracy of the floating point numbers intact. In big simulations checkpoints can be large, so the frequency of taking checkpoints should not be too high. In iterative processes e.g. Markov chain, taking checkpoints can be very quick and can be done more frequently. In smaller applications it is usually good to take checkpoints if the program starts a different phase of the simulation (e.g. plotting after simulation). This minimizes loss of simulation time due to programming bugs.

  • Simulation output is something that the program outputs when the simulation is done. When doing long simulations it is important to consider what output parameters do you want to output. One should include all parameters that might be needed so that the simulations do not need to be run again. When doing time series output this is even more important as e.g. averages, statistical moments cannot necessarily be recalculated after the simulation has ended. It is usually good idea to save a checkpoint at the end as well.

When creating monitoring output it is usually best to write it in a human-readable format and human-readable quantities. This makes it easy to see the state of the program.

Checking job history after completion

The command slurm h/slurm history can be used to check the history of your jobs. Example output is given below:

$ slurm h
JobID         JobName              Start            ReqMem  MaxRSS TotalCPUTime    WallTime Tasks CPU Ns Exit State Nodes
60984785      _interactive         06-06 20:41:31    500Mc       -    00:01.739    00:07:36  none   1 1   0:0 CANC  pe6
  └─ batch    *                    06-06 20:41:31    500Mc      6M    00:01.737    00:07:36     1   1 1   0:0 COMP  pe6
  └─ extern   *                    06-06 20:41:31    500Mc      1M    00:00.001    00:07:36     1   1 1   0:0 COMP  pe6
60984796      hostname             06-06 20:49:36    500Mc       -    00:00.016    00:00:00  none  10 10  0:0 CANC  csl[3-6,9,14,17-18,20,23]
  └─ extern   *                    06-06 20:49:36    500Mc      1M    00:00.016    00:00:01    10  10 10  0:0 COMP  csl[3-6,9,14,17-18,20,23]

Here the output are as follows:

  • JobID shows the id number that Slurm has assigned for your job.

  • JobName shows the name of the submission script / job step / command.

  • Start shows the start time of the job.

  • ReqMem shows the amount of memory requested by the job. The format is an an amount in megabytes or gigabytes followed by c or n for memory per core or memory per node respectively.

  • MaxRSS shows the maximum memory usage of the job as calculated by Slurm. This is measured in set intervals.

  • TotalCPUTime shows the total CPU time used by the job. It shows the amount of seconds the CPUs were at full utilization. For single CPU jobs, this should be close to the WallTime. For jobs that use multiple CPUs, this should be close to the number of CPUs reserved times WallTime.

  • WallTime shows the runtime of the job in seconds.

  • Tasks shows the number of MPI tasks reserved for the job.

  • CPU shows the number of CPUs reserved for the job.

  • Ns shows the number of nodes reserved for the job.

  • Exit State shows the exit code of the command. Successful run of the program should return 0 as the exit code.

  • Nodes shows the names of the nodes where the program ran.

The slurm history-command is a wrapper built around sacct-command. One can also use it directly to get more information on the job’s status. See sacct’s documentation for more information.

For example, command sacct --format=jobid,elapsed,ncpus,ntasks,state,MaxRss --jobs=JOBID which will show information as indicated in the --format option (jobid, elapsed time, number of reserved CPUs, etc.). You can specify any field of interest to be shown using --format.

CheckingCPU and RAM efficiency after completion

You can use seff JOBID to see what percent of available CPUs and RAM was utilized. Example output is given below:

$ seff 60985042
Job ID: 60985042
Cluster: triton
User/Group: tuomiss1/tuomiss1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:29
CPU Efficiency: 90.62% of 00:00:32 core-walltime
Job Wall-clock time: 00:00:16
Memory Utilized: 1.59 MB
Memory Efficiency: 0.08% of 2.00 GB

If your processor usage is far below 100%, your code may not be working correctly. If your memory usage is far below 100% or above 100%, you might have a problem with your RAM requirements. You should set the RAM limit to be a bit above the RAM that you have utilized.

You can also monitor individual job steps by calling seff with the syntax seff JOBID.JOBSTEP.

Important

When making job reservations it is important to distinguish between requirements for the whole job (such as --mem) and requirements for each individual task/cpu (such as --mem-per-cpu). E.g. requesting --mem-per-cpu=2G with --ntasks=2 and --cpus-per-task=4 will create a total memory reservation of (2 tasks)*(4 cpus / task)*(2GB / cpu)=16GB.

Monitoring a job’s GPU utilization

See also

GPU computing. We will talk about how to request GPUs later, but it’s kept here for clarity.

When running a GPU job, you should check that the GPU is being fully utilized.

When your job has started, you can ssh to the node and run nvidia-smi. It should be close to 100%.

Once the job has finished, you can use slurm history to obtain the jobID and run:

$ sacct -j JOBID -o comment -p
{"gpu_util": 99.0, "gpu_mem_max": 1279.0, "gpu_power": 204.26, "ncpu": 1, "ngpu": 1}|

This also shows the GPU utilization.

If the GPU utilization of your job is low, you should check whether its CPU utilization is close to 100% with seff JOBID. Having a high CPU utilization and a low GPU utilization can indicate that the CPUs are trying to keep the GPU occupied with calculations, but the workload is too much for the CPUs and thus GPUs are not constantly working.

Increasing the number of CPUs you request can help, especially in tasks that involve data loading or preprocessing, but your program must know how to utilize the CPUs.

However, you shouldn’t request too many CPUs: There wouldn’t be enough CPUs for everyone to use the GPUs and they would go to waste (all of our nodes have 4-12 CPUs for each GPU).

Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

Monitoring-1: Adding more verbosity into your scripts

echo is a shell command which prints something - the equivalent of “print debugging”.

date is a shell command that prints the current date and time. It is useful for getting timestamps.

Modify one of the scripts from Serial Jobs with a lot of echo MY LINE OF TEXT commands to be able to verify what it’s doing. Check the output.

Now change the script and add date-command below the echo-commands. Run the script and check the output. What do you see?

Now change the script, remove the echos, and add “set -x” below the #SBATCH-comments. Run the script again. What do you see?

Monitoring-2: Basic monitoring example

Using our standard pi.py example,

  1. Create a slurm script that runs the algorithm with 100000000 (\(10^8\)) iterations. Submit it to the queue and use slurm queue, slurm history and seff to monitor the job’s performance.

  2. Add multiple job steps (separate srun lines), each of which runs the algorithm pi.py with increasing number of iterations (from range 100 - 10000000 (\(10^7\)). How does this appear in slurm history?

Monitoring-3: Using seff

Continuing from the example above,

  1. Use seff to check performance of individual job steps. Can you explain why the CPU utilization numbers change between steps?

This is really one of the most important take-aways from this lesson.

Monitoring-4: Multiple processors

The script pi.py has been written so that it can be run using multiple processors. Run the script with multiple processors and \(10^8\) iterations with:

$ srun --cpus-per-task=2 python pi.py --nprocs=2 100000000

After you have run the script, do the following:

  1. Use slurm history to check the TotalCPUTime and WallTime. Compare them to the timings for the single CPU run with \(10^8\) iterations.

  2. Use seff to check CPU performance of the job.

Monitoring-5: No output

You submit a job, and it should be writing some stuff to the output. But nothing is appearing in the output file. What’s wrong?

What’s next?

Next tutorial is about different ways of doing parallel computing.

Parallel computing: different methods explained

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Parallel computing is what HPC is really all about: processing things on more than one processor at once. By now, you should have read all of the previous tutorials.

Abstract

  • You need to figure out what parallelization paradigm your program uses, otherwise you won’t know which options to use.

  • You must always monitor jobs to make sure they are using all the resources you request (seff JOBID).

  • If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

We are working to get access to the login node. This is the gateway to all the rest of the cluster.

Parallel programming models

Parallel programming is used to create programs that can execute instructions on multiple processors at a same time. Most of our users that run their programs in parallel utilize existing parallel execution features that are present in their programs and thus do not need to learn how to create parallel programs. But even when one is running programs in parallel, it is important to understand different models of parallel execution.

The main models of parallel programming are:

  • Embarrassingly parallel problem can be split into completely independent jobs that can be executed separately with no communication between individual jobs.

    More often than not, scientific problems involve running a single program again and again with different datasets or parameters. Slurm has a structure called job array, which enables users to easily submit a large amount of such jobs.

    Any program can be run in an embarassingly parallel way as long as the problem at hand can be split into multiple independent jobs.

    Each job in an array is identical to every other job, but each independent job gets its own unique ID.

    Workloads that utilize this model should request what a single job needs and the number of array jobs that the whole array should have.

    See: array jobs.

    Representation of array jobs on our cluster schematic.

    The array job runs independently across the cluster.

    _images/parallel-array.svg
  • Shared memory (or multithreaded/multiprocess) parallel programs run multiple processes / threads on the same machine. As the name suggests, all of the computer’s memory has to be accessible to all of the processes / threads.

    Thus programs that utilize this model should request one node, one task and multiple CPUs.

    Example applications that utilize this model: Matlab (internally & parallel pool), R (internally & parallel-library), Python (numpy internally & threading/multiprocessing-modules), OpenMP applications, BLAS libraries, FFTW libraries, typical multithreaded/multiprocess parallel desktop programs.

    See: shared-memory parallelism.

    Representation of shared memory jobs on our cluster schematic.

    The shared memory job runs across one node - since that’s what shares memory.

    _images/parallel-shared.svg
  • MPI parallelism utilizes MPI (Message Passing Interface) libraries for communication between MPI tasks. These MPI tasks work in a collective fashion and each task executes its part of the same program.

    Communication between MPI tasks is passed through the high-speed interconnects between different compute nodes and this allows for programs that can tuilize thousands of CPU cores.

    Almost all large-scale scientific programs utilize MPI. MPI programs are usually quite complex and written for a specific use case as the nature of the collective operations depends on the problem at hand.

    Programs that utilize this model should request single/multiple nodes with multiple tasks each. You should not request multiple CPUs per task.

    Example applications that utilize this model: CP2K, GPAW, LAMMPS, OpenFoam. See: MPI parallelism.

    Representation of MPI jobs in our cluster schematic.

    The MPI job can communicate across nodes.

    _images/parallel-mpi.svg
  • Parallel execution in GPUs is not parallel in the traditional sense where multiple CPUs run different processes. Instead GPU parallelism leverages GPGPUs (general-purpose graphics processing units) that have thousands of compute cores inside them. When running suitable problems GPUs can be substantially faster than CPUs.

    Programs that utilize GPUs are written in parts where some part of the program executes on the CPU and other is executed on the GPU. The part that runs on the CPU usually does things like reading input and writing output, while the GPU part is more focused on doing numerical calculations. Often multiple CPUs are needed per GPU to do things such as data preprocessing just to keep the GPU preoccupied.

    A typical CPU program cannot utilize GPUs unless it has been designed to use them. Additionally programs that utilize GPUs cannot utilize multiple GPUs unless they have been designed for it.

    Programs that utilize GPUs should request a single node, a single task, (optionally) multiple CPUs and a GPU.

    See: GPU computing.

    _images/parallel-gpu.svg

Does my code parallelize?

Normal serial code can’t just be run in parallel without modifications. As a user it is your responsibility to understand what parallel model implementation your code has, if any.

When deciding whether using parallel programming is worth the effort, one should be mindful of Amdahl’s law and Gustafson’s law. All programs have some parts that can only be executed in serial fashion and thus speedup that one can get from using parallel execution depends on how much of programs’ execution can be done in parallel.

_images/parallel-execution.svg

Thus if your program runs mainly in serial but has a small parallel part, running it in parallel might not be worth it. Sometimes, doing data parallelism with e.g. array jobs is much more fruitful approach.

Another important note regarding parallelism is that all the applications scale good up to some upper limit which depends on application implementation, size and type of problem you solve and some other factors. The best practice is to benchmark your code on different number of CPU cores before you start actual production runs.

If you want to run some program in parallel, you have to know something about it - is it shared memory or MPI? A program doesn’t magically get faster when you ask more processors if it’s not designed to.

Combining different parallel execution models

Different parallel execution models can be combined if your program supports them. Below a few common situations are listed:

Embarassingly parallel everything

As running programs in an embarassingly parallel fashion is not a feature of the program, but a feature of the workflow itself, any program can be run in an embarassingly parallel fashion if needed.

One can run shared-memory parallel, MPI parallel and GPU parallel jobs in array jobs as well. Each individual job will get their own resources.

Hybrid parallelism

When MPI and shared memory parallelism are done by the same application it is usually called hybrid parallelization. Programs that utilize this model can require both multiple tasks and multiple CPUs per task.

For example, CP2K compiled to psmp-target has hybrid parallelization enabled while popt-target has only MPI parallelization enabled. The best ratio between MPI tasks and CPUs per tasks depends on the program and needs to be measured.

Shared memory parallelism and GPUs

GPUs are usually very fast to execute their part of the program. This, combined with the fact that there are typically much more CPUs in a GPU machine than there are GPUs, creates a situation where it is advantageous use multiple CPUs to minimize the time needed by the CPU part of the calculation.

Deep learning frameworks such as Tensorflow and PyTorch also use CPUs for data preprocessing while the GPU is doing training.

Multi-node parallelism without MPI

Some programs can run with multiple nodes in parallel, but they do not use MPI for communication between nodes. Resources for these programs are reserved in a similar fashion to the MPI programs, but the program launch is usually done by scripts that run different instructions on different machines. The setup depends on the program and can be complex.

See also
  • The Research Software Engineers can help in all aspects of parallel computing - we’d recommend anyone getting to this point set up a consultation to make sure your work is as efficient as it can be.

What’s next?

The next tutorial is about array jobs.

Array jobs: embarassingly parallel execution

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

  • Arrays allow you to submit jobs and it runs many times with the same Slurm parameters.

  • Submit with the --array= Slurm argument, give array indexes like --array=1-10,12-15.

  • The $SLURM_ARRAY_TASK_ID environment variable tells a job which array index it is.

  • There are different templates to use below, which you can adapt to your task.

  • If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.

More often than not, scientific problems involve running a single program again and again with different datasets or parameters.

When there is no dependency or communication among the individual program runs, these individual runs can be run in parallel on separate Slurm jobs. This kind of parallelism is called embarassingly parallel.

Slurm has a structure called job array, which enables users to easily submit and run several instances of the same Slurm script independently in the queue.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

Array jobs let you control a large amount of the cluster. In Parallel computing: different methods explained, we will see another way.

Introduction

Array jobs allow you to parallelize your computations. They are used when you need to run the same job many times with only slight changes among the jobs. For example, you need to run 1000 jobs each with a different seed value for the random number generator. Or perhaps you need to apply the same computation to a collection of data sets. These can be done by submitting a single array job.

A Slurm job array is a collection of jobs that are to be executed with identical parameters. This means that there is one single batch script that is to be run as many times as indicated by the --array directive, e.g.:

#SBATCH --array=0-4

creates an array of 5 jobs (tasks) with index values 0, 1, 2, 3, 4.

The array tasks are copies of the submitted batch script that are automatically submitted to Slurm. Slurm provides a unique environment variable SLURM_ARRAY_TASK_ID to each task which could be used for handling input/output files to each task.

_images/parallel-array.svg

--array via the command line

You can also pass the --array option as a command-line argument to sbatch. This can be great for controlling things without editing the script file.

Important

When running array job you’re basically running identical copies of a single job. Thus it is increasingly important to know how your code behaves with respect to the file system:

  • Does it use libraries/environment stored in the work directory?

  • How much input data does it need?

  • How much output data does the job create?

For example, running an array job with hundreds of workers that uses a Python environment stored in the work disk can inadvertently cause a lot of filesystem load as there will be hundreds of thousands of file calls.

If you’re unsure how your job will behave, ask us Research Software Engineers for help for help.

Your first array job

Let’s see a job array in action. Lets create a file called array_example.sh and write it as follows.

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=200M
#SBATCH --output=array_example_%A_%a.out
#SBATCH --array=0-15

# You may put the commands below:

# Job step
srun echo "I am array task number" $SLURM_ARRAY_TASK_ID

Submitting the job script to Slurm with sbatch array_example.sh, you will get the message:

Submitted batch job 60997836

The job id in the message is that of the primary array job. This is common for all of the jobs in the array. In addition, each individual job is given an array task id.

As now we’re submitting multiple jobs simultaneously, each job needs an individual output file or the outputs will overwrite each other. By default, Slurm will write the outputs to files named slurm-${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}.out. This can be overwritten using the --output=FILENAME-parameter, when you can use wildcard %A for the job id and %a for the array task id.

Once the jobs are completed, the output files will be created in your work directory, with the help %u to determine your user name:

$ ls
array_example_60997836_0.out   array_example_60997836_12.out  array_example_60997836_15.out  array_example_60997836_3.out  array_example_60997836_6.out  array_example_60997836_9.out
array_example_60997836_10.out  array_example_60997836_13.out  array_example_60997836_1.out   array_example_60997836_4.out  array_example_60997836_7.out  array_example.sh
array_example_60997836_11.out  array_example_60997836_14.out  array_example_60997836_2.out   array_example_60997836_5.out  array_example_60997836_8.out

You can cat one of the files to see the output of each task:

$ cat array_example_60997836_11.out
I am array task number 11

Important

The array indices do not need to be sequential. For example, if after running an array job you find out that tasks 2 and 5 failed, you can relaunch just those jobs with --array=2,5.

You can even simply pass the --array option as a command-line argument to sbatch.

More examples

The following examples give you an idea on how to use job arrays for different use cases and how to utilize the $SLURM_ARRAY_TASK_ID environment variable. In general,

  • You need some map of numbers to configuration. This might be files on the filesystem, a hardcoded mapping in your code, or some configuration file.

  • You generally want the mapping to not get lost. Be careful about running some jobs, changing the mapping, and running more: you might end up with a mess!

Reading input files

In many cases, you would like to process several data files. That is, pass different input files to your code to be processed. This can be achieved by using $SLURM_ARRAY_TASK_ID environment variable.

In the example below, the array job gives the program different input files, based on the value of the $SLURM_ARRAY_TASK_ID:

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --mem=1G
#SBATCH --array=0-29

# Each array task runs the same program, but with a different input file.
srun ./my_application -input input_data_${SLURM_ARRAY_TASK_ID}
Hardcoding arguments in the batch script

One way to pass arguments to your code is by hardcoding them in the batch script you want to submit to Slurm.

Assume you would like to run the pi estimation code for 5 different seed values, each for 2.5 million iterations. You could assign a seed value to each task in you job array and save each output to a file. Having calculated all estimations, you could take the average of all the pi values to arrive at a more accurate estimate. An example of such a batch script pi_array_hardcoded.sh is as follows.

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --mem=500M
#SBATCH --job-name=pi-array-hardcoded
#SBATCH --output=pi-array-hardcoded_%a.out
#SBATCH --array=0-4

case $SLURM_ARRAY_TASK_ID in
   0)  SEED=123 ;;
   1)  SEED=38  ;;
   2)  SEED=22  ;;
   3)  SEED=60  ;;
   4)  SEED=432 ;;
esac

srun python slurm/pi.py 2500000 --seed=$SEED > pi_$SEED.json

Save the script and submit it to Slurm:

$ sbatch pi_array_hardcoded.sh
Submitted batch job 60997871

Once finished, 5 Slurm output files and 5 application output files will be created in your current directory each containing the pi estimation; total number of iterations (sum of iteration per task); and total number of successes):

$ cat pi_22.json
{"successes": 1963163, "pi_estimate": 3.1410608, "iterations": 2500000}
Reading parameters from one file

Another way to pass arguments to your code via script is to save the arguments to a file and have your script read the arguments from it.

Drawing on the previous example, let’s assume you now want to run pi.py with different iterations. You can create a file, say iterations.txt and have all the values written to it, e.g.:

$ cat iterations.txt
100
1000
50000
1000000

You can modify the previous script to have it read the iterations.txt one line at a time and pass it on to pi.py. Here, sed is used to get each line. Alternatively you can use any other command-line utility, e.g. awk. Do not worry if you don’t know how sed works - Google search and man sed always help. Also note that the line numbers start at 1, not 0.

The script pi_array_parameter.sh looks like this:

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --mem=500M
#SBATCH --job-name=pi-array-parameter
#SBATCH --output=pi-array-parameter_%a.out
#SBATCH --array=1-4

n=$SLURM_ARRAY_TASK_ID
iteration=`sed -n "${n} p" iterations.txt`      # Get n-th line (1-indexed) of the file
srun python slurm/pi.py ${iteration} > pi_iter_${iteration}.json

You can additionally do this procedure in a more complex way, e.g. read in multiple arguments from a csv file, etc.

(Advanced) Two-dimensional array scanning

What if you wanted an array job that scanned a 2D array of points? Well, you can map 1D to 2D via the following pseudo-code: x = TASK_ID // N (floor division) and y = TASK_ID % N (modulo operation). Then map these numbers into your grid. This can be done in bash, but at this point you’d want to start thinking about passing the SLURM_ARRAY_TASK_ID variable into your code itself for this processing.

(Advanced) Grouping runs together in bigger chunks

If you have lots of jobs that are short (a few minutes), using array jobs may induce too much overhead in scheduling and you will create huge number of output files. In these kinds of cases you might want to combine multiple program runs into a single array job.

Important

A good target time for the array jobs would be approximately 30 minutes, so please try to combine your tasks so that each job would at least take this long.

Easy workaround for this is to create a for-loop in your Slurm script. For example, if you want to run the pi script with 50 different seed values you could run them in chunks of 10 and run a total of 5 array jobs. This reduces the amount of array jobs we need by a factor of 10!

This method demands more knowledge of shell scripting, but the end result is a fairly simple Slurm script pi_array_parameter.sh that does what we need.

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --mem=500M
#SBATCH --job-name=pi-array-grouped
#SBATCH --output=pi-array-grouped_%a.out
#SBATCH --array=1-4

# Lets create a new folder for our output files
mkdir -p json_files

CHUNKSIZE=10
n=$SLURM_ARRAY_TASK_ID
indexes=`seq $((n*CHUNKSIZE)) $(((n + 1)*CHUNKSIZE - 1))`

for i in $indexes
do
   srun python slurm/pi.py 1500000 --seed=$i > json_files/pi_$i.json
done
Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

Array-1: Array jobs and different random seeds

Create a job array that uses the slurm/pi.py to calculate a combination of different iterations and seed values and save them all to different files. Keep the standard output (#SBATCH --output=FILE) separate from the standard error (#SBATCH --error=FILE).

Array-2: Combine the outputs of the previous exercise.

You find an pi-aggregate.py program in hpc-examples. Run this and give all the output files as arguments. It will combine all the statistics and give a more accurate value of \(\pi\).

Array-3: Reflect on array jobs in your work

Think about your typical work. How could you split your stuff into trivial pieces that can be run with array jobs? When can you make individual jobs smaller, but run more of them as array jobs?

(Advanced) Array-4: Array jobs with advanced index selector

Make a job array which runs every other index, e.g. the array can be indexed as 1, 3, 5… (the sbatch manual page can be of help)

Array-5: Array job with varying memory requirements.

Make an array job that runs slurm/memory-use.py with five different values of memory (50M, 100M, 500M, 1000M, 5000M) using one of the techniques above - this is the memory that the memory-use script requests, not the is requested from Slurm. Request 250M of memory for the array job. See if some of the jobs fail.

Is this a proper use of array jobs?

See also
What’s next?

The next tutorial is about shared memory parallelism.

Shared memory parallelism: multithreading & multiprocessing

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

  • Verify that your program can utilize multiple CPUs.

  • Use --cpus-per-task=C to reserve C CPUs for your job.

  • If you use srun to launch your program in your sbatch-script and want your program to utilize all of the allocated CPUs, run export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK in your script before calling srun.

  • You must always monitor jobs to make sure they are using all the resources you request (seff JOBID).

  • If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

Shared-memory parallelism and multiprocessing lets you scale to the size of one node. For purposes of illustration, the picture isn’t true to life: we call the whole stack one node, but in reality each node is one of the rows.

What is shared memory parallelism?

In shared memory parallelism a program will launch multiple processes or threads so that it can leverage multiple CPUs available in the machine.

Slurm reservations for both methods behave similarly. This document will talk about processes, but everything mentioned would be applicable to threads as well. See section on difference between processes and threads for more information on who proceseses and threads differ.

Communication between processes happens via shared memory. This means that all processes need to run on the same machine.

_images/parallel-shared.svg

Depending on a program, you might have multiple processes (Matlab parallel pool, R parallel-library, Python multiprocessing) or have multiple threads (OpenMP threads of BLAS libraries that R/numpy use).

Running a typical multiprocess program
Reserving resources for shared memory programs

Reserving resources for shared memory programs is easy: you’ll only need to specify how many CPUs you want via --cpus-per-task=C-flag.

For most programs using --mem=M is the correct way to reserve memory, but in some cases the amount of memory needed scales with the number of processors. This might happen, for example, if each process opens a different dataset or runs its own simulation that needs extra memory allocations. In these cases you can use --mem-per-core=M to specify a memory allocation that scales with the number of CPUs. We recommend starting with --mem=M if you do not know how your program scales.

Running an example shared memory parallel program

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

For this example, let’s consider pi.py in the slurm-folder. It estimates pi with Monte Carlo methods and can utilize multiple processes for calculating the trials.

First off, we need to compile the program with a suitable OpenMPI version. Let’s use the

The script is in the slurm-folder. You can call the script with python pi.py --nprocs=C N, where N is the number of iterations to be done by the algorithm and C is the number of processors to be used for the parallel calculation.

Let’s run the program with two processes using srun:

$ srun --cpus-per-task=2 --time=00:10:00 --mem=1G python pi.py --nprocs=2 1000000

It is vitally important to notice that the program needs to be told the amount of processes it should use. The program does not obtain this information from the queue system automatically. If the program does not know how many CPUs to use, it might try to use too many or too few. For more information, see the section on CPU over- and undersubscription.

Using a slurm script giving the number of CPUs to the program becomes easier:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=1G
#SBATCH --output=pi.out
#SBATCH --cpus-per-task=2

export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun python pi.py --nprocs=$SLURM_CPUS_PER_TASK 1000000

Let’s call this script pi-sharedmemory.sh. You can submit it with:

$ sbatch pi-sharedmemory.sh

The environment variable $SLURM_CPUS_PER_TASK is set during program runtime and it is set based on the number of --cpus-per-task requested. For more tricks on how to set the number of processors, see the section on using it effectively.

If you use srun to launch your program in your sbatch-script and want your program to utilize all of the allocated CPUs, run export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK in your script before calling srun. For more information see section on lack of parallelization when using srun.

Special cases and common pitfalls
Monitoring CPU utilization for parallel programs

You can use seff JOBID to see what percent of available CPUs and RAM was utilized. Example output is given below:

$ seff 60985042
Job ID: 60985042
Cluster: triton
User/Group: tuomiss1/tuomiss1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:29
CPU Efficiency: 90.62% of 00:00:32 core-walltime
Job Wall-clock time: 00:00:16
Memory Utilized: 1.59 MB
Memory Efficiency: 0.08% of 2.00 GB

If your processor usage is far below 100%, your code may not be working correctly. If your memory usage is far below 100% or above 100%, you might have a problem with your RAM requirements. You should set the RAM limit to be a bit above the RAM that you have utilized.

You can also monitor individual job steps by calling seff with the syntax seff JOBID.JOBSTEP.

Important

When making job reservations it is important to distinguish between requirements for the whole job (such as --mem) and requirements for each individual task/cpu (such as --mem-per-cpu). E.g. requesting --mem-per-cpu=2G with --ntasks=2 and --cpus-per-task=4 will create a total memory reservation of (2 tasks)*(4 cpus / task)*(2GB / cpu)=16GB.

Multithreaded vs. multiprocess and double-booking

Processes are individual program executions while threads are basically small work executions within a process. Processes have their own memory allocations and can work independently from the main process. Threads, on the other hand, are smaller parallel executions within the main process. Processes are slower to launch, but due to their independent nature they won’t block each other’s execution as easily as threads can.

Some programs can utilize both multithread and multiprocess parallelism. For example, R has parallel-library for running multiple processes while BLAS libraries that R uses can utilize multiple threads.

When running a program that can parallelise through processes and through threads, you should be careful to check that only one method of parallisation is in effect.

Using both can result in double-booked parallelism where you launch \(N\) processes and each process launches \(N\) threads, which results in \(N^2\) threads. This will usually tank the performance of the code as the CPUs are overbooked.

Often threading is done in a lower level library when they have been implemented using OpenMP. If you encounter bad performace or you see a huge number of threads appearing when you use parallel processes try setting export OMP_NUM_THREADS=1 in your Slurm script.

Over- and undersubscription of CPUs

The number of threads/processes you launch should match the number of requested processors. If you create a lower number, you will not utilize all CPUs. If you launch a larger number, you will oversubscribe the CPUs and the code will run slower as different threads/processes will have to swap in/out of the CPUs and compete for the same resources.

Using threads and processes at the same time can also result in double-booking.

Using $SLURM_CPUS_PER_TASK is the best way of letting your program know how many CPUs it should use. See section on using it effectively for more information.

Using SLURM_CPUS_PER_TASK effectively

The environment variable $SLURM_CPUS_PER_TASK can be utilized in multiple ways in your scripts. Below are few examples:

  • Setting a number of workers when $SLURM_CPUS_PER_TASK is not set:

    $SLURM_CPUS_PER_TASK is only set when --cpus-per-task has been specified. If you want to run the same code in your own machine and in the cluster it might be useful to set a variable like export NCORES=${SLURM_CPUS_PER_TASK:-4} and use that in your scripts.

    Here $NCORES is set to the number specified by $SLURM_CPUS_PER_TASK if it has been set. Otherwise, it will be set to 4 via Bash’s syntax for setting default values for unset variables.

  • In Python you can use the following for obtaining the environment variable:

    import os
    
    ncpus=int(os.environ.get("SLURM_CPUS_PER_TASK", 1))
    

    For more information on parallelisation in Python see our Python documentation.

  • In R you can use the following for obtaining the environment variable:

    ncpus <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK", unset=1))
    

    For more information on parallelisation in R see our R documentation.

  • In Matlab you can use the following for obtaining the environment variable:

    ncpus=str2num(getenv("SLURM_CPUS_PER_TASK"))
    

    For more information on parallelisation in Matlab see our Matlab documentation.

Lack of parallelisation when using srun

Since Slurm version 22.05, job steps run with srun will not automatically inherit the --cpus-per-task-value that is requested by sbatch. This was done to make it easier to start multiple job steps with different CPU allocations within one job.

If you want to give all CPUs to srun you can either call srun in the script with srun --cpus-per-task=$SLURM_CPUS_PER_TASK or set:

export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

For more information see documentation pages for srun and sbatch.

Asking for multiple tasks when code does not use MPI

Normally you should not use --ntasks=n when you want to run shared memory codes. The number of tasks is only relevant to MPI codes and by specifying it you might launch multiple copies of your program that all compete on the reserved CPUs.

Only hybrid parallelization codes should have both --ntasks=n and --cpus-per-task=C set to be greater than one.

Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

Shared memory parallelism 1: Test the example’s scaling

Run the example with a bigger number of trials (100000000 or \(10^{8}\)) and with 1, 2 and 4 CPUs. Check the running time and CPU utilization for each run.

Shared memory parallelism 2: Test scaling for a program that has a serial part

pi.py can be called with an argument --serial=0.1 to run a fraction of the trials in a serial fashion (here, 10%).

Run the example with a bigger number of trials (100000000 or \(10^{8}\)), 4 CPUs and a varying serial fraction (0.1, 0.5, 0.8). Check the running time and CPU utilization for each run.

Shared memory parallelism 3: More parallel \(\neq\) fastest solution

pi.py can be called with an argument --optimized to run an optimized version of the code that utilizes NumPy for vectorized calculations.

Run the example with a bigger number of trials (100000000 or \(10^{8}\)) and with 4 CPUs. Now run the optimized example with the same amount of trials and with 1 CPU. Check the CPU utilization and running time for each run.

Shared memory parallelism 4: Your program

Think of your program. Do you think it can use shared-memory parallelism?

If you do not know, you can check the program’s documentation for words such as:

  • nprocs

  • nworkers

  • num_workers

  • njobs

  • OpenMP

These usually point towards some method of shared-memory parallel execution.

What’s next?

The next tutorial is about MPI parallelism.

What’s next?

The next tutorial is about MPI parallelism.

MPI parallelism: multi-task programs

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

  • Verify that your program can use MPI.

  • Compile to link with our MPI libraries. Remember to load the same modules in your Slurm script.

  • Use --nodes=1 and --ntasks=n to reserve \(n\) tasks for your job.

  • Start your application via srun if using module installed MPI or mpirun if you have your own installation of MPI.

  • For spreading tasks evenly across nodes, use --nodes=N and --ntasks-per-node=n for getting \(N \cdot n\) tasks.

  • You must always monitor jobs to make sure they are using all the resources you request (seff JOBID).

  • If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

MPI parallelism lets you scale to many nodes on the cluster, at the cost of extra programming work.

What is MPI parallelism?

MPI or message passing interface is a standard for creating communication between many tasks that collectively run a program in parallel. Programs using MPI can scale up to thousands of nodes.

Programs using MPI need to be written so that they utilize the MPI communication. Thus typical programs that are not written around MPI cannot use MPI without major modifications.

_images/parallel-mpi.svg

MPI programs typically work in the following way:

  1. Same program is started in multiple separate tasks.

  2. All tasks join a communication layer with MPI.

  3. Each tasks gets their own rank (basically and ID number).

  4. Based on their ranks tasks execute their part of the code and communicate to other tasks. Rank 0 is usually the “main program” that prints output for monitoring.

  5. After the program finishes communication layer is stopped.

When using module installed installations of MPI the MPI ranks will automatically get information on their ranks from Slurm via library called PMIx. If the MPI used is some other version, they might not connect with the Slurm correctly.

Running a typical MPI program
Compiling a MPI program

For compiling/running an MPI job one has to pick up one of the MPI library suites. There are various different MPI libraries that all implement the MPI standard. We recommend that you use our OpenMPI installation (openmpi/4.1.5). For information on other installed versions, see the MPI applications page.

Some libraries/programs might have already existing requirement for a certain MPI version. If so, use that version or ask for administrators to create a version of the library with dependency on the MPI version you require.

Warning

Different versions of MPI are not compatible with each other. Each version of MPI will create code that will run correctly with only that version of MPI. Thus if you create code with a certain version, you will need to load the same version of the library when you are running the code.

Also, the MPI libraries are usually linked to slurm and network drivers. Thus, when slurm or driver versions are updated, some older versions of MPI might break. If you’re still using said versions, let us know. If you’re just starting a new project, it is recommended to use our recommended MPI libraries.

Reserving resources for MPI programs

For basic use of MPI programs, you will need to use the --nodes=1 and --ntasks=N-options to specify the number of MPI workers. The --nodes=1 option is recommended so that your jobs will run in the same machine for maximum communication efficiency. You can also run without it, but this can result in worse performance.

In many cases you might require more tasks than one node has CPUs. When this is the case, it is recommended to split the number of workers evenly among the nodes. To do this, one can use --nodes=N and --ntasks-per-node=n. This would give you \(N \cdot n\) tasks in total.

Each task will get a default of 1 CPU. See section on hybrid parallelisation for information on whether you can give each task more than one CPU.

Running and example MPI program

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

For this example, let’s consider pi-mpi.c-example in the slurm-folder. It estimates pi with Monte Carlo methods and can utilize multiple MPI tasks for calculating the trials.

First off, we need to compile the program with a suitable OpenMPI version. Let’s use the recommended version openmpi/4.1.5:

$ module load openmpi/4.1.5

$ mpicc -o pi-mpi pi-mpi.c

The program can now be run with srun ./pi-mpi N, where N is the number of iterations to be done by the algorithm.

Let’s ask for resources and run the program with two processes using srun:

$ srun --nodes=1 --ntasks=2 --time=00:10:00 --mem=500M ./pi-mpi 1000000

This worked because we had the correct modules already loaded. Using a slurm script setting the requirements and loading the correct modules becomes easier:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi.out
#SBATCH --nodes=1
#SBATCH --ntasks=2

module load openmpi/4.1.5

srun ./pi-mpi 1000000

Let’s call this script pi-mpi.sh. You can submit it with:

$ sbatch pi-mpi.sh
Special cases and common pitfalls
MPI workers do not see each other

When using our installations of MPI the MPI ranks will automatically get information on their ranks from Slurm via library called PMIx. If the MPI used is some other version, they might not connect with the Slurm correctly.

If you have your own installation of MPI you might try setting export SLURM_MPI_TYPE=pmix_v2 in your job before calling srun. This will tell Slurm to use PMIx for connecting with the MPI installation.

Setting a constraint for a specific CPU architecture

The number of CPUs/tasks one can specify for a single parallel job depends usually on the underlying algorithm. In many codes, such as many finite-difference codes, the workers are set in a grid-like structure. The user of said codes has then a choice of choosing the dimensions of the simulation grid aka. how many workers are in x-, y-, and z-dimensions.

For best perfomance one should reserve half or full nodes when possible. In heterogeneous clusters this can be a bit more complicated as different CPUs can have different numbers of cores.

In Triton CPU partitions there are machines with 24, 28 and 40 CPUs. See the list of available nodes for more information.

However, one can make the reservations easier by specifying a CPU architecture with --constraint=ARCHITECTURE. This tells Slurm to look for nodes that satisfy a specific feature. To list available features, one can use slurm features.

For example, one could limit the code to the Haswell-architecture with the following script:

#!/bin/bash
#SBATCH --time=00:10:00      # takes 5 minutes all together
#SBATCH --mem-per-cpu=200M   # 200MB per process
#SBATCH --nodes=1            # 1 node
#SBATCH --ntasks-per-node=24 # 24 processes as that is the number in the machine
#SBATCH --constraint=hsw     # set constraint for processor architecture

module load openmpi/4.1.5  # NOTE: should be the same as you used to compile the code
srun ./pi-mpi 1000000
Monitoring performance

You can use seff JOBID to see what percent of available CPUs and RAM was utilized. Example output is given below:

$ seff 60985042
Job ID: 60985042
Cluster: triton
User/Group: tuomiss1/tuomiss1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:29
CPU Efficiency: 90.62% of 00:00:32 core-walltime
Job Wall-clock time: 00:00:16
Memory Utilized: 1.59 MB
Memory Efficiency: 0.08% of 2.00 GB

If your processor usage is far below 100%, your code may not be working correctly. If your memory usage is far below 100% or above 100%, you might have a problem with your RAM requirements. You should set the RAM limit to be a bit above the RAM that you have utilized.

You can also monitor individual job steps by calling seff with the syntax seff JOBID.JOBSTEP.

Important

When making job reservations it is important to distinguish between requirements for the whole job (such as --mem) and requirements for each individual task/cpu (such as --mem-per-cpu). E.g. requesting --mem-per-cpu=2G with --ntasks=2 and --cpus-per-task=4 will create a total memory reservation of (2 tasks)*(4 cpus / task)*(2GB / cpu)=16GB.

Hybrid parallelization aka. giving more than one CPU to each MPI task

When MPI and shared memory parallelism are done by the same application it is usually called hybrid parallelization. Programs that utilize this model can require both multiple tasks and multiple CPUs per task.

For example, CP2K compiled to psmp-target has hybrid parallelization enabled while popt-target has only MPI parallelization enabled. The best ratio between MPI tasks and CPUs per tasks depends on the program and needs to be measured.

Remember that the number of CPUs in a machine is hardware dependent. The total number of CPUs per node when you request --ntasks-per-node=n and --cpus-per-task=C is \(n \cdot C\). This number needs to be equal or less than the total number of CPUs in the machine.

Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

MPI parallelism 1: Explore and understand basic Slurm options

Run srun --cpus-per-task=4 hostname, srun --ntasks=4 hostname, and srun --nodes=4 hostname. What’s the difference and why?

MPI parallelism 2: Test the example with various options

Compile the example pi-mpi.c. Now try running it with a bigger number of trials (2000000000 or \(2 \cdot 10^{9}\)) and with the following Slurm options:

  1. --ntasks=4 without specifying --nodes=1.

  2. --ntasks-per-node=4

  3. --nodes=2 and --ntasks-per-node=2.

Check the CPU efficiency and running time. Do you see any difference in the output?

MPI parallelism 3: Your program

Think of your program. Do you think it can use MPI parallelism?

If you do not know, you can check the program’s documentation for words such as:

  • MPI

  • message-passing interface

  • mpirun

  • mpiexec

These usually point towards some method of MPI parallel execution.

What’s next?

The next tutorial is about GPU parallelism.

GPU computing

Videos

Videos of this topic may be available from one of our kickstart course playlists: 2023, 2022 Summer, 2022 February, 2021 Summer, 2021 February.

Abstract

  • Request a GPU with the Slurm option --gres=gpu:1 or --gpus=1 (some clusters need -p gpu or similar).

  • Select a certain type of GPU with e.g. --constraint='volta' (see the quick reference for names).

  • Monitor GPU performance with sacct -j JOBID -o comment -p.

  • For development, run jobs of 4 hours or less, and they can run quickly in the gpushort queue.

  • If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.

Schematic of cluster with current discussion points highlighted; see caption or rest of lesson.

GPU nodes allow specialized types of work to be done massively in parallel.

What are GPUs and how do they parallelise calculations?

GPUs, short for graphical processing unit, are massively-parallel processors that are optimized to perform numerical calculations in parallel. Due to this specialisation GPUs can be substantially faster than CPUs when solving suitable problems.

GPUs are especially handy when dealing with matrices and vectors. This has allowed GPUs to become an indispensable tool in many research fields such as deep learning, where most of the calculations involve matrices.

The programs we normally write in common programming languages, e.g. C++ are executed by the CPU. To run a part of that program in a GPU the program must do the following:

  1. Specify a piece of code called a kernel, which contains the GPU part of the program and is compiled for the specific GPU architecture in use.

  2. Transfer the data needed by the program from the RAM to GPU VRAM.

  3. Execute the kernel on the GPU.

  4. Transfer the results from GPU VRAM to RAM.

_images/parallel-gpu.svg

To help with this procedure special APIs (application programming interfaces) have been created. An example of such an API is CUDA toolkit, which is the native programming interface for NVIDIA GPUs.

On Triton, we have a large number of NVIDIA GPU cards from different generations and a single machine with AMD GPU cards. Triton GPUs are not the typical desktop GPUs, but specialized research-grade server GPUs with large memory, high bandwidth and specialized instructions. For scientific purposes, they generally outperform the best desktop GPUs.

See also

Please ensure you have read Interactive jobs and Serial Jobs before you proceed with this tutorial.

Running a typical GPU program
Reserving resources for GPU programs

Slurm keeps track of the GPU resources as generic resources (GRES) or trackable resources (TRES). They are basically limited resources that you can request in addition to normal resources such as CPUs and RAM.

To request GPUs on Slurm, you should use the --gres=gpu:1 or --gpus=1 -flags.

You can also use syntax --gres=gpu:GPU_TYPE:1, where GPU_TYPE is a name chosen by the admins for the GPU. For example, --gres=gpu:v100:1 would give you a V100 card. See section on reserving specific GPU architectures for more information.

You can request more than one GPU with --gres=gpu:G, where G is the number of the requested GPUs.

Some GPUs are placed in a quick debugging queue. See section on reserving quick debugging resources for more information.

Note

Most GPU programs cannot utilize more than one GPU at a time. Before trying to reserve multiple GPUs you should verify that your code can utilize them.

Running an example program that utilizes GPU

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

For this example, let’s consider pi-gpu.cu in the slurm-folder. It estimates pi with Monte Carlo methods and can utilize a GPU for calculating the trials.

The script is in the slurm-folder. This example is written in C++ and CUDA. Thus it needs to be compiled before it can be run.

To compile CUDA-based code for GPUs, lets load a cuda-module and a newer compiler:

module load gcc/8.4.0 cuda

Now we should have a compiler and a CUDA toolkit loaded. After this we can compile the code with:

nvcc -arch=sm_60 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -o pi-gpu pi-gpu.cu

This monstrosity of a command is written like this because we want our code to be able run on multiple different GPU architectures. For more information, see section on setting compilation flags for GPU architectures.

Now we can run the program using srun:

srun --time=00:10:00 --mem=500M --gres=gpu:1 ./pi-gpu 1000000

This worked because we had the correct modules already loaded. Using a slurm script setting the requirements and loading the correct modules becomes easier:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=pi-gpu.out
#SBATCH --gres=gpu:1

module load gcc/8.4.0 cuda
./pi-gpu 1000000

Note

If you encounter problems with CUDA libraries, see the section on missing CUDA libraries.

Special cases and common pitfalls
Monitoring efficient use of GPUs

When running a GPU job, you should check that the GPU is being fully utilized.

When your job has started, you can ssh to the node and run nvidia-smi. It should be close to 100%.

Once the job has finished, you can use slurm history to obtain the jobID and run:

$ sacct -j JOBID -o comment -p
{"gpu_util": 99.0, "gpu_mem_max": 1279.0, "gpu_power": 204.26, "ncpu": 1, "ngpu": 1}|

This also shows the GPU utilization.

If the GPU utilization of your job is low, you should check whether its CPU utilization is close to 100% with seff JOBID. Having a high CPU utilization and a low GPU utilization can indicate that the CPUs are trying to keep the GPU occupied with calculations, but the workload is too much for the CPUs and thus GPUs are not constantly working.

Increasing the number of CPUs you request can help, especially in tasks that involve data loading or preprocessing, but your program must know how to utilize the CPUs.

However, you shouldn’t request too many CPUs: There wouldn’t be enough CPUs for everyone to use the GPUs and they would go to waste (all of our nodes have 4-12 CPUs for each GPU).

Reserving specific GPU types

You can restrict yourself to a certain type of GPU card by using using the --constraint option. For example, to restrict the submission to Pascal generation GPUs only you can use --constraint='pascal'.

For choosing between multiple generations, you can use the |-character between generations. For example, if you want to restrict the submission Volta or Ampere generations you can use --constraint='volta|ampere'. Remember to use the quotes since | is the shell pipe.

To see what GPU resources are available, run slurm features or sinfo -o '%50N %18F %26f %30G'.

Alternative way is to use syntax --gres=gpu:GPU_TYPE:1, where GPU_TYPE is a name chosen by the admins for the GPU. For example, --gres=gpu:v100:1 would give you a V100 card.

Reserving resources from the short job queue for quick debugging

There is a gpushort partition with a time limit of 4 hours that often has space (like with other partitions, this is automatically selected for short jobs). As of early 2022, it has four Tesla P100 cards in it (view with slurm partitions | grep gpushort). If you are doing testing and development and these GPUs meet your needs, you may be able to test much faster here. Use -p gpushort for this.

CUDA libraries not found

If you ever get libcuda.so.1: cannot open shared object file: No such file or directory, this means you are attempting to use a CUDA program on a node without a GPU. This especially happens if you try to test a GPU code on the login node.

Another problem that might occur is when a program will try to use pre-compiled kernels, but the corresponding CUDA toolkit is not available.

This might happen in you have used a cuda-module to compile the code and it is not loaded when you try to run the code.

If you’re using Python, see the section on CUDA libraries and Python.

CUDA libraries and Python deep learning frameworks

When using a Python deep learning frameworks such as Tensorflow or PyTorch you usually need to create a conda environment that contains both the framework and CUDA framework that the framework needs.

We recommend that you either use our centrally installed module that contains both frameworks (more info here) or install your own using environment using instructions presented here. These instructions make certain that the installed framework has a corresponding CUDA toolkit available. See the application list for more details on specific frameworks.

Please note that pre-installed software either has CUDA already present or it loads the needed modules. Thus you do not need to explicitly load CUDA from the module system when loading these.

Setting CUDA architecture flags when compiling GPU codes

Many GPU codes come with precompiled kernels, but in some cases you might need to compile your own kernels. When this is the case you’ll want to give the compiler flags that make it possible to run the code on multiple different GPU architectures.

For GPUs in Triton these flags are:

-arch=sm_60 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80

Here architectures (compute_XX/sm_XX) number 60, 70 and 80 correspond to GPU cards P100, V100 and A100 respectively.

For more information, you can check this excellent article or CUDA documentation on the subject.

Keeping GPUs occupied when doing deep learning

Many problems such as deep learning training are data-hungry. If you are loading large amounts of data you should make certain that the data loading is done in an efficient manner or the GPU will not be fully utilized.

All deep learning frameworks have their own guides on how to optimize the data loading, but they all are some variation of:

  1. Store your data in multiple big files.

  2. Create code that loads data from these big files.

  3. Run optional pre-processing functions on the data.

  4. Create a batch of data out of individual data samples.

Tasks 2 and 3 are usually parallelized across multiple CPUs. Using pipelines such as these can dramatically speed up the training procedure.

If your data consists of individual files that are not too big, it is a good idea to have the data stored in one file, which is then copied to nodes ramdisk /dev/shm or temporary disk /tmp.

Avoiding small files is in general a good rule to follow. Please refer to the small files page for more detailed information.

If your data is too big to fit in the disk, we recommend that you contact us for efficient data handling models.

For more information on suggested data loading procedures for different frameworks, see Tensorflow’s and PyTorch’s guides on efficient data loading.

Profiling GPU usage with nvprof

When using NVIDIA’s GPUs you can try to use a profiling tool called nvprof to monitor what took most of the GPU’s time during the code’s execution.

Sample output might look something like this:

==30251== NVPROF is profiling process 30251, command: ./pi-gpu 1000000000
==30251== Profiling application: ./pi-gpu 1000000000
==30251== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   84.82%  11.442ms         1  11.442ms  11.442ms  11.442ms  throw_dart(curandStateXORWOW*, int*, unsigned long*)
                   14.70%  1.9833ms         1  1.9833ms  1.9833ms  1.9833ms  setup_rng(curandStateXORWOW*, unsigned long)
                    0.30%  40.704us         1  40.704us  40.704us  40.704us  [CUDA memcpy DtoH]
                    0.17%  23.328us         1  23.328us  23.328us  23.328us  [CUDA memcpy HtoD]
      API calls:   89.52%  122.81ms         3  40.936ms  3.6360us  122.70ms  cudaMalloc
                   10.05%  13.794ms         2  6.8969ms  68.246us  13.726ms  cudaMemcpy
                    0.20%  269.55us         3  89.851us  11.283us  130.45us  cudaFree
                    0.14%  196.08us       101  1.9410us     122ns  83.854us  cuDeviceGetAttribute
                    0.04%  57.228us         2  28.614us  6.3760us  50.852us  cudaLaunchKernel
                    0.02%  32.426us         1  32.426us  32.426us  32.426us  cuDeviceGetName
                    0.01%  13.677us         1  13.677us  13.677us  13.677us  cuDeviceGetPCIBusId
                    0.01%  10.998us         1  10.998us  10.998us  10.998us  cudaGetDevice
                    0.00%  2.3540us         1  2.3540us  2.3540us  2.3540us  cudaGetDeviceCount
                    0.00%  1.2690us         3     423ns     207ns     850ns  cuDeviceGetCount
                    0.00%     663ns         2     331ns     170ns     493ns  cuDeviceGet
                    0.00%     656ns         1     656ns     656ns     656ns  cuDeviceTotalMem
                    0.00%     396ns         1     396ns     396ns     396ns  cuModuleGetLoadingMode
                    0.00%     234ns         1     234ns     234ns     234ns  cuDeviceGetUuid

This output shows that most of the computing time was caused by calling the throw_dart-kernel. It is important to note that in this example memory allocation cudaMalloc and memory copying cudaMemcpy used more time than the actual computation. Memory operations are time consuming operations and thus best codes try to minimize the need for doing them.

To see a chronological order of different GPU operations one can also run nprof --print-gpu-trace. The output will look something like this:

==31050== NVPROF is profiling process 31050, command: ./pi-gpu 1000000000
==31050== Profiling application: ./pi-gpu 1000000000
==31050== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
182.84ms  23.136us                    -               -         -         -         -  256.00KB  10.552GB/s    Pageable      Device  Tesla P100-PCIE         1         7  [CUDA memcpy HtoD]
182.89ms  1.9769ms            (512 1 1)       (128 1 1)        31        0B        0B         -           -           -           -  Tesla P100-PCIE         1         7  setup_rng(curandStateXORWOW*, unsigned long) [118]
184.87ms  11.450ms            (512 1 1)       (128 1 1)        19        0B        0B         -           -           -           -  Tesla P100-PCIE         1         7  throw_dart(curandStateXORWOW*, int*, unsigned long*) [119]
196.33ms  40.704us                    -               -         -         -         -  512.00KB  11.996GB/s      Device    Pageable  Tesla P100-PCIE         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

Here we see that the sample code did a memory copy to the device, ran kernel setup_rng, ran kernel throw_dart and did a memory copy back to the host memory.

For more information on nvprof, see NVIDIA’s documentation on it.

Available GPUs and architectures

Card

Slurm feature name (--constraint=)

Slurm gres name (--gres=gpu:NAME:n)

total amount

nodes

architecture

compute threads per GPU

memory per card

CUDA compute capability

Tesla K80*

kepler

teslak80

12

gpu[20-22]

Kepler

2x2496

2x12GB

3.7

Tesla P100

pascal

teslap100

20

gpu[23-27]

Pascal

3854

16GB

6.0

Tesla V100

volta

v100

40

gpu[1-10]

Volta

5120

32GB

7.0

Tesla V100

volta

v100

40

gpu[28-37]

Volta

5120

32GB

7.0

Tesla V100

volta

v100

16

dgx[1-7]

Volta

5120

16GB

7.0

Tesla A100

ampere

a100

56

gpu[11-17,38-44]

Ampere

7936

80GB

8.0

AMD MI100 (testing)

mi100

Use -p gpu-amd only, no --gres

gpuamd[1]

Exercises

The scripts you need for the following exercises can be found in our hpc-examples, which we discussed in Using the cluster from a shell. You can clone the repository by running git clone https://github.com/AaltoSciComp/hpc-examples.git. Doing this creates you a local copy of the repository in your current working directory. This repository will be used for most of the tutorial exercises.

GPU 1: Test nvidia-smi

Run nvidia-smi on a GPU node with srun. Use slurm history to check which GPU node you ended up on.

GPU 2: Running the example

Run the example given above with larger number of trials (10000000000 or \(10^{10}\)).

Try using sbatch and Slurm script as well.

GPU 3: Run the script and do basic profiling with nvprof

nvprof is part of NVIDIA’s profiling tools and it can be used to monitor which parts of the GPU code use up most time.

Run the program as before, but add nvprof before it.

Try running the program with chronological trace mode (nvprof --print-gpu-trace) as well.

GPU 4: Your program

Think of your program. Do you think it can utilize GPUs?

If you do not know, you can check the program’s documentation for words such as:

  • GPU

  • CUDA

  • ROCm

  • OpenMP offloading

  • OpenACC

  • OpenCL

See also
What’s next?

You have now seen the basics - but applying these in practice is still a difficult challenge! There is plenty to figure out while combining your own software, the Linux environment, and Slurm.

Your time is the most valuable thing you have. If you aren’t fully sure of how to use the tools, it is much better to ask that struggle forever. Contact us the Research Software Engineers early - for example in our daily garage, and we can help you get set up well. Then, you can continue your learning while your projects are progressing.

Job dependencies
Introduction

Job dependencies are a way to specify dependencies between jobs. The most common use is to launch a job only after a previous job has completed successfully. But other kinds of dependencies are also possible.

Basic example

Dependencies are specified with the --dependency=DEPENDENCY_LIST option. E.g. --dependency=afterok:123:124 means that the job can only start after job ID’s 123 and 124 have both completed successfully.

Automating job dependencies

A common problem with job dependencies is that you want job B to start only after job A finishes successfully. However, you cannot know the job ID of job A before it has been submitted. One solution is to catch the job id of job A when submitting it and store it as a shell variable, and using the stored value when submitting job B. Like:

$ idA=$(sbatch jobA.sh | awk '{print $4}')
$ sbatch --dependency=afterok:${idA} jobB.sh
Exercises

Dependencies-1: read the docs

Look at man sbatch and investigate the --dependency parameter.

Dependencies-2: Chain of jobs

Create a chain of jobs A -> B -> C each depending on the successful completion of the previous job. In each job run e.g. sleep 60 to give you time to investigate the status of the queue.

Dependencies-3: First job fails

Continuing from the previous exercise, what happens if at the end of the job A script you put exit 1. What does it mean?

Applications

See our general information and the full list below:

Applications: General info

See also

Intro tutorial: Applications (this is assumed knowledge for all software instructions)

When you need software, check the following for instructions (roughly in this order):

  • This page.

  • Search the SciComp site using the search function.

  • Check module spider and module avail to see if something is available but undocumented.

  • The issue tracker for other people who have asked - some instructions only live there.

If you have difficulty, it’s usually a good idea to search the issue tracker anyway, in order to learn from the experience of others.

Modules

See Software modules. Modules are the standard way of loading software.

Singularity

See Singularity Containers. Singularity are software containers that provide an operating system within an operating system. Software will tell you if you need to use it via Singularity.

Software installation and policy

We want to support all software, but unfortunately time is limited. In the chart below, we have these categories (which don’t really mean anything, but in the future should help us be more transparent about what we are able to support):

  • A: Full support and documentation, should always work

  • B: We install and provide best-effort documentation, but may be out of date.

  • C: Basic info, no guarantees

If you know some application which is missing from this list but is widely in use (anyone else than you is using it) it would make sense install to /share/apps/ directory and create a module file. Send your request to the tracker. We want to support as much software as possible, but unfortunately we don’t have the resources to do everything centrally.

Software is generally easy to install if it is in Spack (check that package list page), a scientific software management and building system. If it has easy-to-install Ubuntu packages, it will be easy to do via singularity.

Software documentation pages

Name

Python

A

FHI-aims

FHI-aims  (Fritz Haber Institute ab initio molecular simulations package) is an electronic structure theory code package for computational molecular and materials science. FHI-aims density functional theory and many-body perturbation calculations at all-electron, full-potential level.

FHI-aims is licensed software with voluntary payment for an academic license. While the license grants access to the FHI-aims source code each holder of a license can use pre-built binaries available on Triton. To this end, contact Ville Havu at the PHYS department after obtaining the license.

On Triton the most recent version of FHI-aims is available via the modules FHI-aims/latest-intel-2020.0 that is compiled using the Intel Parallel Studio and FHI-aims/latest-OpenMPI-intel-2020.0-scalapack that is compiled without any Intel parallel libraries since in rare cases they can result in spurious segfaults. The binaries are available in /share/apps/easybuild/software/FHI-aims/<module name>/bin as aims.YYMMDD.scalapack.mpi.x where YYMMDD indicates the version stamp.

Notes:

  • module spider fhi will show various versions available.

  • The clean Intel version is fastest, but the OpenMPI module is more stable (info as of 2021-07).

  • FHI-aims is compiled without any Intel parallel libraries since in rare cases, like really big systems, they can result in spurious segfaults.

  • Search the Triton issue tracker for some more debugging about this.

Running FHI-aims on Triton

To run FHI-aims on Triton a following example batch script can be used:

#!/bin/bash -l
#SBATCH --time=01:00:00
#SBATCH --constraint=avx     # FHI-aims build requires at least AVX instrution set
#SBATCH  --mem-per-cpu=2000M
#SBATCH --nodes=1
#SBATCH --ntasks=24

ulimit -s unlimited
export OMP_NUM_THREADS=1
module load FHI-aims/latest-intel-2020.0
srun aims.YYMMDD.scalapack.mpi.x
Armadillo
supportlevel:

C

Armadillo http://arma.sourceforge.net/ is C++ linear algebra library that is needed to support some other software stacks. To get best performance using MKL as backend is adviced.

The challenge is that default installer does not find MKL from non-standard location.

  1. module load mkl

  2. Edit “./build_aux/cmake/Modules/ARMA_FindMKL.cmake” and add MKL path to “PATHS”

  3. Edit “./build_aux/cmake/Modules/ARMA_FindMKL.cmake” and replace mkl_intel_thread with mkl_sequential (we do not want threaded libs on the cluster)

  4. Edit “include/armadillo_bits/config.hpp” and enable ARMA_64BIT_WORD

  5. cmake . && make

  6. make install DESTDIR=/share/apps/armadillo/<version>

Boost
supportlevel:

C

pagelastupdated:

2014

Boost is a numerical library needed by some other packages. There is a rpm-package of this in the default SL/RHEL repositories. In case the repository version is too old, a custom compilation is required.

To setup see the manual and follow the few simple steps to bootstrap and compile/install.

https://www.boost.org/doc/libs/1_56_0/more/getting_started/unix-variants.html

COMSOL Multiphysics

Hint

We are continuing the COMSOL focus days in our daily zoom garage in Spring 2024: someone from COMSOL (the company) plans to join our zoom garage at 13:00 on the following Tuesdays: 2024-01-23, 2024-02-27, 2024-03-26, 2024-04-23, 2024-05-28.

Hint

Join the other COMSOL users in our Zulip Chat: Stream “#triton”, topic “Comsol user group”.

To check which versions of Comsol are available, run:

module spider comsol

Comsol in Triton is best run in Batch-mode, i.e. without the graphical userinterface. Prepare your models on your workstation and bring the ready-to-run models to triton. For detailed tutorials from COMSOL, see for example the Comsol Knowledge base articles Running COMSOL® in parallel on clusters and Running parametric sweeps, batch sweeps, and cluster sweeps from the command line. However, various settings must be edited in the graphical user interface.

Best practices of using COMSOL Graphical User Interface in Triton
  1. Connect to triton

    • Use Open OnDemand for the best experience for interactive work on triton.

    1. Connect to https://ood.triton.aalto.fi with your browser, log in. (It currently takes a while, please be patient.) Choose “My Interactive Sessions” from top bar, and then “Triton Desktop” from bottom. Launch your session, and once resources become available in triton, the session will be started on one of the interactive session nodes of triton. You can connect to a desktop in your browser with the “Launch Triton Desktop” button.

    2. Once you have connected, you can open a terminal (in XFCE the black rectangle in the bottom of the screen).

  • You can alternatively open a linux session in https://vdi.aalto.fi.

    1. Open a terminal, and connect with ssh to triton login node

    ssh -Y triton.aalto.fi
    

    However, if you use this terminal to start COMSOL, it will be running on the login node, which is a shared resource, and you should be careful not to use too much memory or CPUs.

  1. Start comsol

    1. First make sure you have graphical connection (should print something like “:1.0”)

      echo $DISPLAY
      
    2. then load the comsol module (version of your choice)

      module load comsol/6.1
      
    3. and finally start comsol

      comsol
      
Prerequsities of running COMSOL in Triton

There is a largish but limited pool of floating COMSOL licenses in Aalto University, so please be careful not launch large numbers of comsol processess that each consume a separate license.

  • Comsol uses a lot of temp file storage, which by default goes to $HOME. Fix a bit like the following:

    $ rm -rf ~/.comsol/
    $ mkdir /scratch/work/$USER/comsol_recoveries/
    $ ln -sT /scratch/work/$USER/comsol_recoveries/ ~/.comsol
    
  • You may need to enable access to the whole filesystem in File|Options –> Preferences –> Security: File system access:All files

    Figure showing the comsol security preferences dialog box: File system access: All files is highlighted.
  • Enable the “Study -> Batch and Cluster” as well as “Study -> Solver and Job Configurations” nodes in the “Show More Options dialog box you can open by right-clicking the study in the Model Builder Tree.

The cluster settings can be saved in comsol settings, not in the model file. The correct settings are entered in File|Options –> Preferences –> Multicore and Cluster Computing. It is enough to choose Scheduler type: “SLURM

Figure showing the cluster preferences dialog box: Scheduler type: Slurm is highlighted.

You can test by loading from the Application Libraries the “cluster_setup_validation” model. The model comes with a documentation -pdf file, which you can open in the Application Libraries dialogue after selecting the model.

COMSOL requires MPICH2 compatible MPI libraries:

$ module purge
$ module load comsol/6.1 intel-parallel-studio/cluster.2020.0-intelmpi
A dictionary of COMSOL HPC lexicon

The knowledge base article Running COMSOL® in parallel on clusters explains the following meanings COMSOL uses:

COMSOL HPC lexicon

COMSOL

SLURM & MPI

node

task

A process, software concept

host

node

A single computer

core

cpu

A single CPU-core

However, COMSOL does not seem to be using the terms in a 100% consistent way. E.g. sometimes in the SLURM context COMSOL may use node in the SLURM meaning.

An example run in a single node

Use the parameters -clustersimple and -launcher slurm. Here is a sample batch-job:

#!/bin/bash

# Ask for e.g. 20 compute cores
#SBATCH --time=10:00:00
#SBATCH --mem-per-cpu=2G
#SBATCH --cpus-per-task=20

cd $WRKDIR/my_comsol_directory
module load Java
module load comsol/6.1
module load intel-parallel-studio/cluster.2020.0-intelmpi

# Details of your input and output files
INPUTFILE=input_model.mph
OUTPUTFILE=output_model.mph

comsol batch -clustersimple -launcher slurm -inputfile $INPUTFILE -outputfile $OUTPUTFILE -tmpdir $TMPDIR
Cluster sweep

If you have a parameter scan to perform, you can use the Cluster sweep node. The whole sweep only needs one license even if comsol launches multiple instances of itself.

First set up the cluster preferences, as described above.

Start by loading the correct modules in triton (COMSOL requires MPICH2 compatible MPI libraries). Then open the graphical user interface to comsol on the login node and open your model.

$ module purge
$ module load comsol/6.1 intel-parallel-studio/cluster.2020.0-intelmpi
$ comsol

Add a “Cluster Sweep” node to your study and a “Cluster Computing” node into your “Job Configurations” (You may need to first enable them in the “Show more options”. Check the various options. You can try solving a small test case from the graphical user interface. You should see COMSOL submitting jobs to the SLURM queue. You can download an example file.

For a larger run, COMSOL can then submit the jobs with comsol but without the GUI:

$ comsol batch -inputfile your_ready_to_run_model.mph -outputfile output_file.mph -study std1 -mode desktop

See also how to run a parametric sweep from command line?

Since the sweep may take some time to finnish, please consider using tmux or screen to keep your session open.

Cluster computing controlled from your windows workstation

The following example shows a working set of settings to use triton as a remote computation cluster for COMSOL.

Prerequisities:

  • Store ssh-keys in pagent so that you can connect to triton with putty without entering the password.

  • Save / install putty executables locally, e.g. in Z:\putty:

    • plink.exe

    • pscp.exe

    • putty.exe

    Figure showing the comsol settings for Cluster Computing.

    In this configuration, sjjamsa is replaced with your username.

    Figure showing the comsol settings for Cluster Computing within the Job Configurations.
Deep learning software

This page has information on how to run deep learning frameworks on Triton GPUs.

Theano
Installation

The recommended way of installing theano is with an anaconda environment.

Detectron

Detectron uses Singularity containers, so you should refer to that page first for general information.

Detectron-image is based on a Dockerfile from Detectron’s repository. In this image Detectron has been installed to /detectron.

Usage

This example shows how you can launch Detectron on a gpu node. To run example given in Detectron repository one can use the following Slurm script:

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --mem=8G
#SBATCH --gres=gpu:teslap100:1
#SBATCH -o detectron.out

module load singularity-detectron

mkdir -p $WRKDIR/detectron/outputs

singularity_wrapper exec python2 /detectron/tools/infer_simple.py \
    --cfg /detectron/configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml \
    --output-dir $WRKDIR/detectron/outputs \
    --image-ext jpg \
    --wts https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl \
    /detectron/demo

Now example can by run on GPU node with:

sbatch detectron.sh

In typical usage one does not want to download models for each run. To use stored models one needs to:

  1. Copy detectron sample configurations from the image to your own configuration folder:

    module load singularity-detectron
    mkdir -p $WRKDIR/detectron/
    singularity_wrapper exec cp -r /detectron/configs $WRKDIR/detectron/configs
    cd $WRKDIR/detectron
    
  2. Create data directory and download example models there:

    mkdir -p data/ImageNetPretrained/MSRA
    mkdir -p data/coco_2014_train:coco_2014_valminusminival/generalized_rcnn
    wget -O data/ImageNetPretrained/MSRA/R-101.pkl \
        https://s3-us-west-2.amazonaws.com/detectron/ImageNetPretrained/MSRA/R-101.pkl
    wget -O data/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl \
        https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl
    
  1. Edit the weights-parameter in configuration file 12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml:

    33c33
    <   WEIGHTS: $WRKDIR/detectron/data/ImageNetPretrained/MSRA/R-101.pkl
    ---
    >   WEIGHTS: https://s3-us-west-2.amazonaws.com/detectron/ImageNetPretrained/MSRA/R-101.pkl
    
  2. Edit Slurm script to point to downloaded weigths and models:

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --mem=8G
#SBATCH --gres=gpu:teslap100:1
#SBATCH -o detectron.out

module load singularity-detectron

mkdir -p $WRKDIR/detectron/outputs

singularity_wrapper exec python2 /detectron/tools/infer_simple.py \
    --cfg $WRKDIR/detectron/configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml \
    --output-dir $WRKDIR/detectron/outputs \
    --image-ext jpg \
    --wts $WRKDIR/detectron/data/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl \
    /detectron/demo
  1. Submit job:

    sbatch detectron.sh
    
Fenics

This uses Singularity containers, so you should refer to that page first for general information.

Fenics-images are based on these images.

Usage

This example shows how you can run a fenics example. To run example one should first copy the examples from the image to a suitable folder:

mkdir -p $WRKDIR/fenics
cd $WRKDIR/fenics
module load singularity-fenics
singularity_wrapper exec cp -r /usr/local/share/dolfin/demo demo

The examples try to use interactive windows to plot the results. This is not available in the batch queue so to fix this one needs to specify an alternative matplotlib backend. This patch file fixes example demo_poisson.py. Download it into $WRKDIR/fenics and run

patch -d demo -p1 < fenics_matplotlib.patch

to fix the example. After this one can run the example with the following Slurm script:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=1G
#SBATCH -o fenics_out.out

module purge
module load singularity-fenics

cd demo/documented/poisson/python/

srun singularity_wrapper run demo_poisson.py

To submit the script one only needs to run:

sbatch fenics.sh

Resulting image can be checked with e.g.:

eog demo/documented/poisson/python/poisson.png
FMRIprep

December 2022: Note that the previous module we had installed (fmriprep 20.2.0) has been FLAGGED by the developers. Please specify a different version, for example with module load singularity-fmriprep/22.1.0.

module load singularity-fmriprep

fmriprep is installed as a singularity container. By default it will always run the current latest version. If you need a version that is not currently installed on triton, please open an issue at https://version.aalto.fi/gitlab/AaltoScienceIT/triton/issues

Here an example to run fmriprep for one subject, using an interactive session, without free-surfer reconall, using ica-aroma, and with co-registration to the 2mm isotropic MNI template space (MNI152NLin6Asym, FSL bounding box of 91x109x91 voxels). The raw data in BIDS format are in the path <path-to-bids>, then you can create a folder for the derivatives that is different than the BIDS folder <path-to-your-derivatives-folder>. Also create a temporary folder under your scratch/work folders for storing temporary files <path-to-your-scratch-temporary-folder> for example /scratch/work/USERNAME/tmp/. The content of this folder is removed after fmriprep has finished.

# Example running in an interactive session, this can be at maximum 24 hours
# You might want to use a tool such as "screen" or "tmux"
ssh triton.aalto.fi
# start screen or tmux
sinteractive --time=24:00:00 --mem=20G # you might need more or less memory or time depending on the size
module load singularity-fmriprep
singularity_wrapper exec fmriprep <path-to-bids> <path-to-your-derivatives-folder> -w <path-to-your-scratch-temporary-folder-for-this-participant> participant --participant-label 01 --output-spaces MNI152NLin6Asym:res-2 --use-aroma --fs-no-reconall --fs-license-file /scratch/shareddata/set1/freesurfer/license.txt

If you want to parallelyze things you can write a script that cycles through each subject labels and queues SBATCH jobs for each subject (it can be an array job or a series of serial jobs). It is important you tune your memory and time requirements before processing many subjects at once. It is important to create a dedicated temporary scratch folder for each subject

POST-processing

Fmriprep does the minimal preprocessing. There is no smoothing, no temporal filtering and in general you need to regress out the estimated confounds. They can be regressed before further analysis (e.g. functional connectivity, intersubject correlation), or they can be included as part of a general linear model (it is always the best to have them as close as possible to the model if this is what you are doing). If you plan to regress the confoudns without being part of a general linear model, the most simple way is then to decide which columns of the “confounds.tsv” matrix you want to use as confounds and use NIlearn image_clean https://nilearn.github.io/dev/modules/generated/nilearn.image.clean_img.html

There are also tools for post-processing such as:

These are not installed on the singularity image, hence you need to experiment with these on your own.

Freesurfer
module load freesurfer

Follow the instruction to source the init script specific to your shell.

FSL
module load fsl

Follow the instruction to source the init script specific to your shell.

GCC

GNU Compiler Collection (GCC) is one of the most popular compilers for compiling C, C++ and Fortran programs.

In Triton we have various GCC versions installed, but only some of them are actively supported.

Basic usage
Hello world in C

Let’s consider the following Hello world-program (hello.c) written in C.

#include <stdio.h>
int main()
{
   printf("Hello world.\n");
   return 0;
}

After downloading it to a folder, we can compile it with GCC.

First, let’s load up a GCC module:

module load gcc/8.4.0

Secondly, let’s compile the code:

gcc -o hello hello.c

Now we can run the program:

./hello

This outputs the expected Hello world-string.

Available installations

System compiler is installed only on the login node. Other versions of GCC are installed as modules.

GCC version

Module name

4.8.5

none (on login node only)

8.4.0

gcc/8.4.0

9.3.0

gcc/9.3.0

11.2.0

gcc/11.2.0

If you need a different version of GCC, please send a request through the issue tracker.

Old installations

These installations will work, but they are not actively updated.

GCC version

Module name

6.5.0

gcc/6.5.0

9.2.0

gcc/9.2.0

9.2.0 with (CUDA offloading)

gcc/9.2.0-cuda-nvptx

Other old installations are not recommended.

GPAW

There is GPAW version installed in GPAW/1.0.0-goolf-triton-2016a-Python-2.7.11. It has been compiled with GCC, OpenBLAS and OpenMPI and it uses Python/2.7.11-goolf-triton-2016a as its base Python. You can load it with:

$ module load GPAW/1.0.0-goolf-triton-2016a-Python-2.7.11

You can create a virtual environment against the Python environment with:

$ export VENV=/path/to/env
$ virtualenv --system-site-packages $VENV
$ cd $VENV
$ source bin/activate
# test installation
$ python -c 'import gpaw; print gpaw'

GPAW site: https://wiki.fysik.dtu.dk/gpaw/

Gurobi Optimizer

Gurobi Optimizer is a commercial optimizing library.

License

Aalto University has a site-wide floating license for Gurobi.

Important notes

As of writing of this Guide, Aalto only has a valid license for Gurobi 9.X and older. Therefore Gurobi 10 cannot be run on triton unless you bring your own license.

Gurobi with Python

Package names

Unfortunately the python gurobi packages installed via pip and via conda come with two distinct package names gurobi for the anaconda package and gurobipy for the pip package. Normally, we install the guobi package in the anaconda environment, but there are some anaconda modules which have the gurobipy package. So you might need to select the correct package.

License Files for older Anaconda modules

Older anaconda modules on Triton might not have the GRB_LICENSE_FILE environment variable set properly, so you might need to point to it manually. To do so, you need to create a gurobi.lic file in your home folder. The file should contain the following single line:

TOKENSERVER=lic-gurobi.aalto.fi

You can create this license file with the following command on the login node:

echo "TOKENSERVER=lic-gurobi.aalto.fi" > ~/gurobi.lic

The license is an Educational Institution Site License:

Free Academic License Requirements, Gurobi Academic Licenses: Can only be used by faculty, students, or staff of a recognized degree-granting academic institution. Can be used for: Research or educational purposes. Consulting projects with industry – provided that approval from Gurobi has been granted.

After setting the license, one can run, for example:

module load anaconda
python

And then run the following script

import gurobipy as gp
# Depending on your anaconda version you
# might need gurobi instead of gurobipy

# Create a new model
m = gp.Model()

# Create variables
x = m.addVar(vtype='B', name="x")
y = m.addVar(vtype='B', name="y")
z = m.addVar(vtype='B', name="z")

# Set objective function
m.setObjective(x + y + 2 * z, gp.GRB.MAXIMIZE)

# Add constraints
m.addConstr(x + 2 * y + 3 * z <= 4)
m.addConstr(x + y >= 1)

# Solve it!
m.optimize()

print(f"Optimal objective value: {m.objVal}")
print(f"Solution values: x={x.X}, y={y.X}, z={z.X}")
Gurobi with Julia

For Julia there exists a package called Gurobi.jl that provides an interface to Gurobi. This package needs Gurobi C libraries so that it can run. The easiest way of obtaining these libraries is to load the anaconda-module and use the same libraries that the Python API uses.

To install Gurobi.jl, one can use the following commands:

module load gurobi/9.5.2
module load julia
julia

After this, in the julia-shell, install Gurobi.jl with:

using Pkg
Pkg.add("Gurobi")
Pkg.build("Gurobi")

# Test installation
using Gurobi
Gurobi.Optimizer()

Before using the package do note the recommendations from Gurobi.jl’ GitHub-page regarding the use of JuMP.jl and the reuse of environments.

Gurobi with any other language supported by gurobi

For other languages supported by gurobi (like MATLAB, R or C/C++) use

module load gurobi/9.5.2

to load gurobi version 9.5.2 and then follow the instructions from the gurobi web-page. All global variables necessary for gurobi are already set, so you don’t need any further configuration

Intel Compilers

Intel provides their own compiler suite which is popular in HPC settings. This suite contains compilers for C (icc), C++ (icpc) and Fortran (ifc).

Previously this suite was licensed, but nowadays Intel provides it for free as a part of their oneAPI-program. This change has had an effect on many module names.

In Triton we have various versions of the Intel compiler suite installed, but only some of them are actively supported.

Basic usage
Choosing a GCC for Intel compilers

Intel uses many tools from the GCC suite and thus it is recommended to load a gcc-module with it:

module load gcc/8.4.0 intel-oneapi-compilers/2021.4.0

See GCC page for more information on available GCC compilers.

Hello world in C

Let’s consider the following Hello world-program (hello.c) written in C.

#include <stdio.h>
int main()
{
   printf("Hello world.\n");
   return 0;
}

After downloading it to a folder, we can compile it with Intel C compiler (icc).

First, let’s load up Intel compilers and a GCC module that icc will use in the background:

module load gcc/8.4.0 intel-oneapi-compilers/2021.4.0

Now let’s compile the code:

icc -o hello hello.c

Now we can run the program:

./hello

This outputs the expected Hello world-string.

Current installations

There are various Intel compiler versions installed as modules.

Intel compiler version

Module

2021.2.0

intel-oneapi-compilers/2021.2.0

2021.3.0

intel-oneapi-compilers/2021.3.0

2021.4.0

intel-oneapi-compilers/2021.4.0

If you need a different version of these compilers, please send a request through the issue tracker.

Old installations

These installations will work, but they are not actively updated.

Intel compiler version

Module

2019.3 with Intel MPI

intel-parallel-studio/cluster.2019.3-intelmpi

2019.3

intel-parallel-studio/cluster.2019.3

2020.0 with Intel MPI

intel-parallel-studio/cluster.2020.0-intelmpi

2020.0

intel-parallel-studio/cluster.2020.0

Other old installations are not recommended.

Julia

The Julia programming language is a high-level, high-performance dynamic programming language for technical computing, in the same space as e.g. MATLAB, Scientific Python, or R. For more details, see their web page.

Interactive usage

Julia is available in the module system. By default the latest stable release is loaded:

module load julia
julia
Batch usage

Running Julia scripts as batch jobs is also possible. An example batch script is provided below:

#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --mem=1G

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
module load julia
srun julia juliascript.jl
Number of threads to use

By default Julia uses up to 16 threads for linear algebra (BLAS) computations. In most cases, this number will be larger than the amount of CPUs reserved for the job. Thus when running Julia jobs it is a good idea to set the number of parallelization threads to be equal to the number of threads reserved for the job with --cpus-per-task. Otherwise, the performance of your program might be poor. This can be done by adding the following line to your slurm-script:

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}

Alternatively, you can use the blas_set_num_threads()-function in Julia.

JupyterHub on Triton

Note

Quick link

Triton’s JupyterHub is available at https://jupyter.triton.aalto.fi.

Note

For new users

Are you new to Triton and want to access JupyterHub? Triton is a high-performance computing cluster, and JupyterHub is just one of our services - one of the easiest ways to get started. You still need a Triton account. This site has many instructions, but you should read at least:

If you want to use Triton more, you should finish the entire tutorials section.

alternate text

< Triton JupyterHub Demo >

Jupyter notebooks are a way of interactive, web-based computing: instead of either scripts or interactive shells, the notebooks allow you to see a whole script + output and experiment interactively and visually. They are good for developing and testing things, but once things work and you need to scale up, it is best to put your code into proper programs (more info). You must do this if you are going to large parallel computing.

Triton’s JupyterHub is available at https://jupyter.triton.aalto.fi. You can try them online at try.jupyter.org (there is a temporary notebook with no saving).

You can always run notebooks yourself on your own (or remote) computers, but on Triton we have some facilities already set up to make it easier.

How Jupyter notebooks work
  • Start a notebook

  • Enter some code into a cell.

  • Run it with the buttons or Control-enter or Shift-enter to run a cell.

  • Edit/create new cells, run again. Repeat indefinitely.

  • You have a visual history of what you have run, with code and results nicely interspersed. With certain languages such as Python, you can plots and other things embedded, so that it becomes a complete reproducible story.

JupyterLab is the next iteration of this and has many more features, making it closer to an IDE or RStudio.

Notebooks are without a doubt a great tool. However, they are only one tool, and you need to know their limitations. See our other page on limitations of notebooks.

JupyterHub

Note

JupyterHub on Triton is still under development, and features will be added as they are needed or requested. Please use the Triton issue tracker.

The easiest way of using Jupyter is through JupyterHub - it is a multi-user jupyter server which takes a web-based login and spawns your own single-user server. This is available on Triton.

Connecting and starting

Currently jupyterHub is available only within Aalto networks, or from the rest of the internet after a first Aalto login: https://jupyter.triton.aalto.fi.

Once you log in, you must start your single-user server. There are several options available that trade off between long run time and short run time but more memory available. Your server runs in the Slurm queue, so the first start-up takes a few seconds but after that it will stay running even if you log out. The resources you request are managed by slurm: if you go over the memory limit, your server will be killed without warning or notification (but you can see it in the output log, ~/'jupyterhub_slurmspawner_*.log). The Jupyter server nodes are oversubscribed, which means that we can allocate more memory and CPU than is actually available. We will monitor the nodes to try to ensure that there are enough resources available, so do report problems to us. Please request the minimum amount of memory you think you need - you can always restart with more memory. You can go over your memory request a little bit before you get problems.

When you use Jupyter via this interface, the slurm billing weights are lower, so that the rest of your Triton priority does not decrease by as much.

Usage

Once you get to your single-user server Jupyter running as your own user on Triton. You begin in a convenience directory which has links to home, scratch, etc. You can not make files in this directory (it is read-only), but you can navigate to the other folders to create your notebooks. You have access to all the Triton filesystems (not project/archive) and all normal software.

We have some basic extensions installed:

  • Jupyterlab (to use it, change /tree in the URL to /lab). Jupyterlab will eventually be made the default.

  • modules integration

  • jupyter_contrib_nbextensions - check out the variable inspector

  • diff and merge tools (currently does not work somehow)

The log files for your single-user servers can be found in, see ~/jupyterhub_slurmspawner_*.log. When a new server starts, these are automatically cleaned up when they are one week old.

For reasons of web security, you can’t install your own extensions (but you can install your own kernels). Send your requests to us instead.

Problems? Requests?

This service is currently in beta and under active development. If you notice problems or would like any more extensions or features, let us know. If this is useful to you, please let us know your user store, too. In the current development stage, the threshold for feedback should be very low.

Currently, the service level is best effort. The service may go down at any time and/or notebooks may be killed whenever there is a shortage of resources or need of maintenance. However, notebooks auto-save and do survive service restarts, and we will try to avoid killing things unnecessarily.

Software and kernels

A Jupyter Kernel is the runtime which actually executes the code in the notebook (and it is separate from JupyterHub/Jupyter itself). We have various kernels automatically installed (these instructions should apply to both JupyterHub and sjupyter):

  • Python (2 and 3 via some recent anaconda modules + a few more Python modules.)

  • Matlab (latest module)

  • Bash kernel

  • R (a default R environment you can get by module load r-triton. (“R (safe)” is similar but tries to block some local user configuration which sometimes breaks things, see FAQ for more hints.)

  • We do not yet have a kernel management policy. Kernels may be added or removed over time. We would like to keep them synced with the most common Triton modules, but it will take some time to get this automatic. Send requests and problem reports.

Since these are the normal Triton modules, you can submit installation requests for software in these so that it is automatically available.

Installing kernels from virtualenvs or Anaconda environments

If you want to use Jupyter with your own packages, you can do that. First, make a conda environment / virtual environment on Triton and install the software you need in it (see Anaconda and conda environments or Python: virtualenv). This environment can be used for other things, such as your own development outside of Jupyter.

You have to have the package ipykernel installed in the environment: Add it to your requirements/environment, or activate the environment and do pip install ipykernel.

Then, you need to make the environment visible inside of Jupyter. For conda environments, you can do:

$ module load jupyterhub/live
$ envkernel conda --user --name INTERNAL_NAME --display-name="My conda" /path/to/conda_env

Or for Python virtualenvs:

$ module load jupyterhub/live
$ envkernel virtualenv --user --name INTERNAL_NAME --display-name="My virtualenv" /path/to/virtualenv
Installing a different R module as a kernel

Load your R modules, install R kernel normally (to some NAME), use envkernel as a wrapper to re-write the kernel (reading the NAME and rewriting to the same NAME), after it loads the modules you need:

## Load jupyterhub/live, and R 3.6.1 with IRkernel.
$ module load r-irkernel/1.1-python3
$ module load jupyterhub/live

## Use Rscript to install jupyter kernel
$ Rscript -e "library(IRkernel); IRkernel::installspec(name='NAME', displayname='R 3.6.1')"

## Use envkernel to re-write, loading the R modules.
$ envkernel conda  --user --kernel-template=NAME --name=NAME $CONDA_PREFIX
Installing a different R version as a kernel

There are two ways to install a different R version kernel for jupyter. One relies on you building your own conda environment. The disadvantage is that you will need to create a kernel, the advantage is that you can add additional packages. The other option is to use the existing R installations on Triton.

You will need to create your own conda environment with all packages that are necessary to deploy the environment as a kernel.:

## Load and miniconda before creating your environment - this provides mamba that is used to create your environment
$ module load miniconda

Create your conda environment, selecting a NAME for the environment.:

## This will use the latest R version on conda-forge. If you need a specific version you can specify it
## as r-essentials=X.X.X, where X.X.X is your required R version number
$ mamba create -n NAME -c conda-forge r-essentials r-irkernel
## If Mamba doesn't work you can also replace it with conda, but usually mamba is a lot faster

The next steps are the same as building a Kernel, except for activating the environment instead of loading the r-irkernel module, since this module depends on the R version. the displayname will be what will be displayed on jupyter

## Use Rscript to install jupyter kernel, you need the environment for this.
## You need the Python `jupyter` command so R can know the right place to
## install the kernel (provided by jupyterhub/live)
$ module load jupyterhub/live
$ source activate NAME
$ Rscript -e "library(IRkernel); IRkernel::installspec(name='NAME', displayname='YOUR R Version')"
$ conda deactivate NAME

## For R versions before 4, you need to install the kernel. After version 4 IRkernel automatically installs it.
$ envkernel lmod --user --kernel-template=NAME --name=NAME

Note

Installing R packages for jupyter

Installing packages via jupyter can be problematic, as they require interactivity, which jupyter does not readily support. To install packages therefore go directly to triton. Load the environment or R module you use and install the packages ineractively. After that is done, restart your jupyter session and reload your kernel, all packages that you installed should then be available.

Install your own kernels from other Python modules

This works if the module provides the command python and ipykernel is installed. This has to be done once in any Triton shell:

$ module load jupyterhub/live
$ envkernel lmod --user --name INTERNAL_NAME --display-name="Python from my module" MODULE_NAME
$ module purge
Install your own kernels from Singularity image

First, find the .simg file name. If you are using this from one of the Triton modules, you can use module show MODULE_NAME and look for SING_IMAGE in the output.

Then, install a kernel for your own user using envkernel. This has to be done once in any Triton shell:

$ module load jupyterhub/live
$ envkernel singularity --user --name KERNEL_NAME --display-name="Singularity my kernel" SIMG_IMAGE
$ module purge

As with the above, the image has to provide a python command and have ipykernel installed (assuming you want to use Python, other kernels have different requirements).

Julia

Julia: currently doesn’t seem to play nicely with global installations (so we can’t install it for you, if anyone knows something otherwise, let us know). Roughly, these steps should work to install the kernel yourself:

$ module load julia
$ module load jupyterhub/live
$ julia
julia> Pkg.add("IJulia")

If this doesn’t work, it may think it is already installed. Force it with this:

julia> using IJulia
julia> installkernel("julia")
Install your own non-Python kernels
  • First, module load jupyterhub/live. This loads the anaconda environment which contains all the server code and configuration. (This step may not be needed for all kernels)

  • Follow the instructions you find for your kernel. You may need to specify --user or some such to have it install in your user directory.

  • You can check your own kernels in ~/.local/share/jupyter/kernels/.

If your kernel involves loading a module, you can either a) load the modules within the notebook server (“softwares” tab in the menu), or b) update your kernel.json to include the required environment variables (see kernelspec). (We need to do some work to figure out just how this works). Check /share/apps/jupyterhub/live/miniconda/share/jupyter/kernels/ir/kernel.json for an example of a kernel that loads a module first.

From Jupyter notebooks to running on the queue

While jupyter is great to interactively run code, it can become a problem if you need to run multiple parameter sets through a jupyter notebook or you need a specific resource which is not available for jupyter. The latter might be because the resource is sparse enough that having an open jupyter session that finished a part and is waiting for the user to start the next is idly blocking the resource. At this point you will likely want to move your code to pure python and run it via the queue.

Here are the steps necessary to do so:

  1. Log into Triton via ssh ( Tutorials can be found here and here ).

  2. In the resulting terminal session, load the jupyterhub module to have jupyter available ( module load jupyterhub )

  3. Navigate to the folder where your jupyter notebooks are located. You can see the path by moving your mouse over the files tab on jupyterlab.

  4. Convert the notebook(s) you want to run on the cluster ( jupyter nbconvert yourScriptName.ipynb --to python).

    • If you need to run your code for multiple different parameters, modify the python code to allow input parameter parsing (e.g. using argparse, or docopt ) You should include both input and output arguments as you want to save files to different result folders or have them have indicative filenames. There are two main reasons for this approach: A) it makes your code more maintainable, since you don’t need to modify the code when changing parameters and B) you are less likely to use the wrong version of your code (and thus getting the wrong results).

  5. (Optional) Set up a conda environment. This is mainly necessary if you have multiple conda or pip installable packages that are required for your job and which are not part of the normal anaconda module. Try it via module load anaconda. You can’t install into the anaconda environment provided by the anaconda module and you should NOT use pip install --user as it will bite you later (and can cause difficult to debug problems). If you need to set up your own environment follow this guide

  6. Set up a slurm batch script in a file e.g. simple_python_gpu.sh. You can do this either with nano simple_python_gpu.sh (to save the file press ctrl+x, type y to save the file and press Enter to accept the file name), or you can mount the triton file system and use your favorite editor, for guides on how to mount the file system have a look here and here). Depending on your OS, it might be difficult to mount home and it is anyways best practice to use /scratch/work/USERNAME for your code. Here is an example:

    #!/bin/bash
    #SBATCH --cpus-per-task 1       # The number of CPUs your code can use, if in doubt, use 1 for CPU only code or 6 if you run on GPUs (since code running on GPUs commonly allows parallelization of data provision to the GPU)
    #SBATCH --mem 10G               # The amount of memory you expect your code to need. Format is 10G for 10 Gigabyte, 500M for 500 Megabyte etc
    #SBATCH --time=01:00:00         # Time in HH:MM:SS or DD-HH of your job. the maximum is 120 hours or 5 days.
    #SBATCH --gres=gpu:1            # Additional specific ressources can be requested via gres. Mainly used for requesting GPUs format is: gres=RessourceType:Number
    module load anaconda            # or module load miniconda if you use your own environment.
    source activate yourEnvironment # if you use your own environment
    python yourScriptName.py ARG    
    

    This is a minimalistic example. If you have parameter sets that you want to use have a look at array jobs here)

  7. Submit your batch script to the queue : sbatch simple_python_gpu.sh This call will print a message like: Submitted batch job <jobid> You can use e.g. slurm q to see your current jobs and their status in the queue, or monitor your jobs as described here.

Git integration

You can enable git integration on Triton by using the following lines from inside a git repository. (This is normal nbdime, but uses the centrally installed one so that you don’t have to load a particular conda environment first. The sed command fixes relative paths to absolute paths, so that you use the tools no matter what modules you have loaded):

$ /share/apps/jupyterhub/live/miniconda/bin/nbdime config-git --enable
$ sed --in-place -r 's@(= )[ a-z/-]*(git-nb)@\1/share/apps/jupyterhub/live/miniconda/bin/\2@' .git/config
FAQ/common problems
  • Jupyterhub won’t spawn my server: “Error: HTTP 500: Internal Server Error (Spawner failed to start [status=1].”. Is your home directory quota exceeded? If that’s not it, check the ~/jupyterhub_slurmspawner_* logs then contact us.

  • My server has died mysteriously. This may happen if resource usage becomes too much and exceed the limits - Slurm will kill your notebook. You can check the ~/jupyterhub_slurmspawner_* log files for jupyterhub to be sure.

  • My server seems inaccessible / I can’t get to the control panel to restart my server. Especially with JupyterLab. In JupyterLab, use File→Hub Control Panel. If you can’t get there, you can change the URL to /hub/home.

  • My R kernel keeps dying. Some people seem to have global R configuration, either in .bashrc or .Renviron or some such which globally, which even affects the R kernel here. Things we have seen: pre-loading modules in .bashrc which conflict with the kernel R module; changing RLIBS in .Renviron. You can either (temporarily or permanently) remove these changes, or you could install your own R kernel. If you install your own, it is up to you to maintain it (and remember that you installed it).

  • “Spawner pending” when you try to start - this is hopefully fixed in issue #1534/#1533 in JupyterHub. Current recommendation: wait a bit and return to JupyterHub home page and see if the server has started. Don’t click the button twice!

See also

Our configuration is available on Github. Theoretically, all the pieces are here but it is not yet documented well and not yet generalizable. The Ansible role is a good start but the jupyterhub config and setup is hackish.

Keras
supportlevel:

pagelastupdated:

2020-02-20

maintainer:

Keras is a neural network library which runs on tensorflow (among other things).

Basic usage

Keras is available in the anaconda module and some other anaconda modules. Run module spider anaconda to list available modules.

You probably want to learn how to run in the GPU queues. The other information in the tensorflow page also applies, especially the --constraint options to restrict to the GPUs that have new enough features.

Example
srun --gres=gpu:1 --pty bash
module load anaconda
python3
>>> import keras
Using TensorFlow backend.
>>> keras.__version__
'2.2.4'
LAMMPS
pagelastupdated:

2023-02-08

LAMMPS is a classical molecular dynamics simulation code with a focus on materials modeling.

Building a basic version of LAMMPS

LAMMPS is typically built based on the specific needs of the simulation. When building LAMMPS one can enable and disable various different packages that enable commands when LAMMPS is run.

LAMMPS has an extensive guide on how to build LAMMPS. The recommended way of building LAMMPS is with CMake.

Below are instructions on how to do a basic build of LAMMPS with OpenMP and MPI parallelizations enabled.

One can obtain LAMMPS source code either from LAMMPS download page or from LAMMPS source repository. Here we’ll be using the version 22Jun2022.

# Obtain source code and go to the code folder
wget https://download.lammps.org/tars/lammps-23Jun2022.tar.gz
tar xf lammps-23Jun2022.tar.gz
cd lammps-23Jun2022

# Create a build folder and go to it
mkdir build
cd build

# Activate CMake and OpenMPI modules needed by LAMMPS
module load cmake gcc/11.3.0 openmpi/4.1.5

# Configure LAMMPS packages and set install folder
cmake ../cmake -D BUILD_MPI=yes -D BUILD_OMP=yes -D CMAKE_INSTALL_PREFIX=../../lammps-mpi-23Jun2022

# Build LAMMPS
make -j 2

# Install LAMMPS
make install

# Go back to starting folder
cd ../..

# Add installed LAMMPS to the executable search path
export PATH=$PATH:$PWD/lammps-mpi-23Jun2022/bin

Now we can verify that we have a working LAMMPS installation with the following command:

echo "info configuration" | srun lmp

The output should look something like this:

srun: job 11839786 queued and waiting for resources
srun: job 11839786 has been allocated resources
LAMMPS (23 Jun 2022 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task

Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info
Printed on Thu Jan 19 17:20:21 2023

LAMMPS version: 23 Jun 2022 / 20220623

OS information: Linux "CentOS Linux 7 (Core)" 3.10.0-1160.71.1.el7.x86_64 x86_64

sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

Compiler: GNU C++ 8.4.0 with OpenMP 4.5
C++ standard: C++11

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_SMALLBIG

Available compression formats:

Extension: .gz     Command: gzip
Extension: .bz2    Command: bzip2
Extension: .xz     Command: xz
Extension: .lzma   Command: xz
Extension: .lz4    Command: lz4


Installed packages:



Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info

Total wall time: 0:00:00
Building a version of LAMMPS with most packages

Many packages in LAMMPS need other external libraries such as BLAS and FFTW libraries. These extra libraries can be given to LAMMPS via flags mentioned in this documentation, but in most cases loading the appropriate modules from the module system is enough for CMake to find the libraries.

To include extra packages in the build one can either use flags mentioned in this documentation or one can use developer maintained CMake presets for installing a collection of packages.

Below is an example that installs LAMMPS with “most packages”-collection enabled:

# Obtain source code and go to the code folder
wget https://download.lammps.org/tars/lammps-23Jun2022.tar.gz
tar xf lammps-23Jun2022.tar.gz
cd lammps-23Jun2022

# Create a build folder and go to it
mkdir build
cd build

# Activate CMake and OpenMPI modules needed by LAMMPS
module load cmake gcc/11.3.0 openmpi/4.1.5 fftw/3.3.10 openblas/0.3.23 eigen/3.4.0 ffmpeg/6.0  voropp/0.4.6 zstd/1.5.5

# Configure LAMMPS packages and set install folder
cmake ../cmake -C ../cmake/presets/most.cmake -D BUILD_MPI=yes -D BUILD_OMP=yes -D CMAKE_INSTALL_PREFIX=../../lammps-mpi-most-23Jun2022

# Build LAMMPS
make -j 2

# Install LAMMPS
make install

# Go back to starting folder
cd ../..

# Add installed LAMMPS to the executable search path
export PATH=$PATH:$PWD/lammps-mpi-most-23Jun2022/bin

Now we can verify that we have a working LAMMPS installation with the following command:

echo "info configuration" | srun lmp

The output should look something like this:

srun: job 13235690 queued and waiting for resources
srun: job 13235690 has been allocated resources
LAMMPS (23 Jun 2022 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task

Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info
Printed on Tue Feb 07 11:41:05 2023

LAMMPS version: 23 Jun 2022 / 20220623

OS information: Linux "CentOS Linux 7 (Core)" 3.10.0-1160.71.1.el7.x86_64 x86_64

sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

Compiler: GNU C++ 8.4.0 with OpenMP 4.5
C++ standard: C++11

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_FFMPEG
-DLAMMPS_SMALLBIG

Available compression formats:

Extension: .gz     Command: gzip
Extension: .bz2    Command: bzip2
Extension: .xz     Command: xz
Extension: .lzma   Command: xz
Extension: .lz4    Command: lz4


Installed packages:

ASPHERE BOCS BODY BPM BROWNIAN CG-DNA CG-SDK CLASS2 COLLOID COLVARS COMPRESS
CORESHELL DIELECTRIC DIFFRACTION DIPOLE DPD-BASIC DPD-MESO DPD-REACT
DPD-SMOOTH DRUDE EFF ELECTRODE EXTRA-COMPUTE EXTRA-DUMP EXTRA-FIX
EXTRA-MOLECULE EXTRA-PAIR FEP GRANULAR INTERLAYER KSPACE MACHDYN MANYBODY MC
MEAM MISC ML-IAP ML-SNAP MOFFF MOLECULE OPENMP OPT ORIENT PERI PHONON PLUGIN
POEMS QEQ REACTION REAXFF REPLICA RIGID SHOCK SPH SPIN SRD TALLY UEF VORONOI
YAFF

Info-Info-Info-Info-Info-Info-Info-Info-Info-Info-Info

Total wall time: 0:00:00
Examples
LAMMPS indent-example

Let’s run a simple example from LAMMPS examples. This specific model represents a spherical indenter into a 2D solid.

First, we need to get the example:

# Obtain source code and go to the code folder
wget https://download.lammps.org/tars/lammps-23Jun2022.tar.gz
tar xf lammps-23Jun2022.tar.gz
cd lammps-23Jun2022/examples/indent/

After this we can launch LAMMPS with a slurm script like this:

#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --mem=2G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --output=lammps_indent.out


# Load modules used for building the LAMMPS binary
module load cmake gcc/11.3.0 openmpi/4.1.5 fftw/3.3.10 openblas/0.3.23 eigen/3.4.0 ffmpeg/6.0  voropp/0.4.6 zstd/1.5.5

# Set path to LAMMPS executable
export PATH=$PATH:$PWD/../../../lammps-mpi-most-23Jun2022/bin

# Run simulation
srun lmp < in.indent
LLMs

Large-language models are AI models that can understand and generate text, primarily using transformer architectures. This page is about running them on a local HPC cluster. This requires extensive programming experience and knowledge of using the cluster (Tutorials), but allows maximum computational power for the least cost. Aalto RSE maintains these models and can provide help in using these, even to people who aren’t computational experts.

Because the models are typically very large and there are many people interested in them, we provide our users with pre-downloaded model weights and this page has instructions on how to load these weights for inference purposes or for retraining and fine-tuning the models.

Pre-downloaded model weights
Raw model weights

We have downloaded the following raw model weights (PyTorch model checkpoints):

Model type

Model version

Module command to load

Description

Llama 2

Raw Data

module load model-llama2/raw-data

Raw weights of Llama 2.

Llama 2

7b

module load model-llama2/7b

Raw weights of 7B parameter version of Llama 2.

Llama 2

7b-chat

module load model-llama2/7b-chat

Raw weights of 7B parameter chat optimized version of Llama 2.

Llama 2

13b

module load model-llama2/13b

Raw weights of 13B parameter version of Llama 2.

Llama 2

13b-chat

module load model-llama2/13b-chat

Raw weights of 13B parameter chat optimized version of Llama 2.

Llama 2

70b

module load model-llama2/70b

Raw weights of 70B parameter version of Llama 2.

Llama 2

70b-chat

module load model-llama2/70b-chat

Raw weights of 70B parameter chat optimized version of Llama 2.

CodeLlama

Raw Data

module load model-codellama/raw-data

Raw weights of CodeLlama.

CodeLlama

7b

module load model-codellama/7b

Raw weights of 7B parameter version of CodeLlama.

CodeLlama

7b-Python

module load model-codellama/7b-python

Raw weights of 7B parameter version CodeLlama, specifically designed for Python.

CodeLlama

7b-Instruct

module load model-codellama/7b-instruct

Raw weights of 7B parameter version CodeLlama, designed for instruction following.

CodeLlama

13b

module load model-codellama/13b

Raw weights of 13B parameter version of CodeLlama.

CodeLlama

13b-Python

module load model-codellama/13b-python

Raw weights of 13B parameter version CodeLlama, specifically designed for Python.

CodeLlama

13b-Instruct

module load model-codellama/13b-instruct

Raw weights of 13B parameter version CodeLlama, designed for instruction following.

CodeLlama

34b

module load model-codellama/34b

Raw weights of 34B parameter version of CodeLlama.

CodeLlama

34b-Python

module load model-codellama/34b-python

Raw weights of 34B parameter version CodeLlama, specifically designed for Python.

CodeLlama

34b-Instruct

module load model-codellama/34b-instruct

Raw weights of 34B parameter version CodeLlama, designed for instruction following.

Each module will set the following environment variables:

  • MODEL_ROOT - Folder where model weights are stored, i.e., PyTorch model checkpoint directory.

  • TOKENIZER_PATH - File path to the tokenizer.model.

Here is an example slurm, script using the raw weights to do batch inference. For detailed environment setting up, example prompts and Python code, please check out this repo.

#!/bin/bash
#SBATCH --time=00:25:00
#SBATCH --cpus_per_task=4
#SBATCH --mem=20GB
#SBATCH --gres=gpu:1
#SBATCH --output=llama2inference-gpu.%J.out
#SBATCH --error=llama2inference-gpu.%J.err

# get the model weights
module load model-llama2/7b
echo $MODEL_ROOT
# Expect output: /scratch/shareddata/dldata/llama-2/llama-2-7b
echo $TOKENIZER_PATH
# Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model

# activate your conda environment
module load miniconda
source activate llama2env

# run batch inference
torchrun --nproc_per_node 1 batch_inference.py \
  --prompts prompts.json \
  --ckpt_dir $MODEL_ROOT \
  --tokenizer_path $TOKENIZER_PATH \
  --max_seq_len 512 --max_batch_size 16
Model weight conversions

Usually, models produced in research are stored as weights from PyTorch or other frameworks. As for inference, we also have models that are already converted to different formats.

Huggingface Models

Following Huggingface models are stored on triton. Full list of all the available models are located at /scratch/shareddata/dldata/huggingface-hub-cache/models.txt. Please contact us if you need any other models.

Model type

Huggingface model identifier

Text Generation

mistralai/Mistral-7B-v0.1

Text Generation

mistralai/Mistral-7B-Instruct-v0.1

Text Generation

tiiuae/falcon-7b

Text Generation

tiiuae/falcon-7b-instruct

Text Generation

tiiuae/falcon-40b

Text Generation

tiiuae/falcon-40b-instruct

Text Generation

google/gemma-2b-it

Text Generation

google/gemma-7b

Text Generation

google/gemma-7b-it

Text Generation

google/gemma-7b

Text Generation

LumiOpen/Poro-34B

Text Generation

meta-llama/Llama-2-7b-hf

Text Generation

meta-llama/Llama-2-13b-hf

Text Generation

meta-llama/Llama-2-70b-hf

Text Generation

codellama/CodeLlama-7b-hf

Text Generation

codellama/CodeLlama-13b-hf

Text Generation

codellama/CodeLlama-34b-hf

Translation

Helsinki-NLP/opus-mt-en-fi

Translation

Helsinki-NLP/opus-mt-fi-en

Translation

t5-base

Fill Mask

bert-base-uncased

Fill Mask

bert-base-cased

Fill Mask

distilbert-base-uncased

Text to Speech

microsoft/speecht5_hifigan

Text to Speech

facebook/hf-seamless-m4t-large

Automatic Speech Recognition

openai/whisper-large-v3

Token Classification

dslim/bert-base-NER-uncased

All Huggingface models can be loaded with module load model-huggingface/all. Here is a Python script using huggingface model.

## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. NOTE: this must be run before importing transformers.
import os
os.environ['TRANSFORMERS_OFFLINE'] = '1'

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")

prompt = "How many stars in the space?"

model_inputs = tokenizer([prompt], return_tensors="pt")
input_length = model_inputs.input_ids.shape[1]

generated_ids = model.generate(**model_inputs, max_new_tokens=20)
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
llama.cpp and GGUF

llama.cpp is a popular framework for running inference on LLM models with CPUs or GPUs. llama.cpp uses a format called GGUF as its storage format.

We have llama.cpp conversions of all Llama 2 and CodeLlama models with multiple quantization levels.

NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run module load model-codellama/34b first, and then run module load codellama.cpp/q8_0-2023-12-04 to get the 8-bit integer version of CodeLlama weights in a .gguf file.

Model type

Model version

Module command to load

Description

Llama 2

f16-2023-08-28

module load model-llama.cpp/f16-2023-12-04 (after loading a Llama 2 model for some raw weights)

Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.

Llama 2

q4_0-2023-08-28

module load model-llama.cpp/q4_0-2023-12-04 (after loading a Llama 2 model for some raw weights)

4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.

Llama 2

q4_1-2023-08-28

module load model-llama.cpp/q4_1-2023-12-04 (after loading a Llama2 model for some raw weights)

4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.

Llama 2

q8_0-2023-08-28

module load model-llama.cpp/q8_0-2023-12-04 (after loading a Llama 2 model for some raw weights)

8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.

CodeLlama

f16-2023-08-28

module load codellama.cpp/f16-2023-12-04 (after loading a CodeLlama model for some raw weights)

Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.

CodeLlama

q4_0-2023-08-28

module load codellama.cpp/q4_0-2023-12-04 (after loading a CodeLlama model for some raw weights)

4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.

CodeLlama

q8_0-2023-08-28

module load codellama.cpp/q8_0-2023-12-04 (after loading a CodeLlama model for some raw weights)

8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.

Each module will set the following environment variables:

  • MODEL_ROOT - Folder where model weights are stored.

  • MODEL_WEIGHTS - Path to the model weights in GGUF file format.

This Python code snippet is part of a ‘Chat with Your PDF Documents’ example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out this repo.

import os
from langchain.llms import LlamaCpp

model_path = os.environ.get('MODEL_WEIGHTS')
llm = LlamaCpp(model_path=model_path, verbose=False)
More examples
Starting a local API

With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout this repo.

Using Mathematica on Triton
Load Mathematica

Mathematica is loaded through a module:

module load mathematica

See available versions with module avail mathematica.

You can test by running in text-based mode:

$ wolfram
With graphical user interface

To launch the graphical user interface (GUI), login to triton.aalto.fi with -X, i.e. X11 forwarding enabled.

ssh -X triton.aalto.fi

If you need to run computationally-intensive things with the GUI, use sinteractive to get an interactive shell on a node:

sinteractive --mem=10G --time=1:00

Either way, you start the GUI with mathematica:

$ mathematica &
Running batch scripts

Create a script file, say script.m. You can run this script and store the outputs in output.txt using:

math -noprompt -run '<<script.m' > output.txt

To put this in a batch script, simply look at the serial jobs tutorial. Here is one such example:

#!/bin/bash
#SBATCH --mem=5G
#SBATCH --time=2:00

module load mathematica
math -noprompt -run '<<script.m'
Common problems

Activation If you need to activate Mathematica when you first run it, we recommend that you launch it in GUI mode first, choose ‘Other ways to activate’” then “Connect to a network license server”, and paste lic-mathematica.aalto.fi. It should be automatically activated, though, if not file an issue and link this page.

See also

Various other references also apply here once you load the module and adapt them to Slurm:

Admin notes

When installing new versions, put !lic-mathematica.aalto.fi into Configuration/Licensing/mathpass in the base directory

Matlab

This page will explain how to run Matlab jobs on triton, and introduce important details about Matlab on triton. (Note: We used to have the Matlab Distributed Computing Server (MDCS), but because of low use we no longer have a license. You can still run in parallel on one node, with up to 40 cores.)

Important notes

Matlab writes session data, compiled code and additional toolboxes to ~/.matlab. This can quicky fill up your $HOME quota. To fix this we recommend that you replace the folder with a symlink that points to a directory in your working directory.

rsync -lrt ~/.matlab/ $WRKDIR/matlab-config/ && rm -r ~/.matlab
ln -sT $WRKDIR/matlab-config ~/.matlab
quotafix -gs --fix $WRKDIR/matlab-config

If you run parallel code in matlab, keep in mind, that matlab uses your home folder as storage for the worker files, so if you run multiple jobs you have to keep the worker folders seperate To address this, you need to specify the worker location ( the JobStorageLocation field of the parallel cluster) to a location unique to the job

% Initialize the parallel pool
c=parcluster();

% Create a temporary folder for the workers working on this job,
% in order not to conflict with other jobs.
t=tempname();
mkdir(t);

% set the worker storage location of the cluster
c.JobStorageLocation=t;

To address the latter, the number of parallel workers needs to explicitly be provided when initializing the parallel pool:

% get the number of workers based on the available CPUS from SLURM
num_workers = str2double(getenv('SLURM_CPUS_PER_TASK'));

% start the parallel pool
parpool(c,num_workers);

Here we provide a small script, that does all those steps for you.

Interactive usage

Interactive usage is currently available via the sinteractive tool. Do not use the cluster front-end for this, but connect to a node with sinteractive The login node is only meant for submitting jobs/compiling. To run an interactive session with a user interface run the following commands from a terminal.

ssh -X user@triton.aalto.fi
sinteractive
module load matlab
matlab &
Simple serial script

Running a simple Matlab job is easy through the slurm queue. A sample slurm script is provided below:

#!/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --mem=100M
#SBATCH -o serial_Matlab.out
module load matlab
n=3
m=2
srun matlab -nojvm -nosplash -r "serial_Matlab($n,$m) ; exit(0)"

The above script can then be saved as a file (e.g. matlab_test.sh) and the job can be submitted with sbatch matlab_test.sh. The actual calculation is done in serial_Matlab.m-file:

function C = serial_Matlab(n,m)
        try
                A=0:(n*m-1);
                A=reshape(A,[2,3]).'

                B=2:(n*m+1);
                B=reshape(B,[2,3]).'

                C=0.5*ones(n,n)
                C=A*(B.') + 2.0*C
        catch error
                disp(getReport(error))
                exit(1)
        end
end

Remember to always set exit into your slurm script so that the program quits once the function serial_Matlab has finished. Using a try-catch-statement will allow your job to finish in case of any error within the program. If you don’t do this, Matlab will drop into interactive mode and do nothing while your job wastes time.

NOTE: Starting from version r2019a the launch options -r ...; exit(0) can be easily replaced with the -batch option which automatically exits matlab at the end of the command that is passed (see here for details). So the last command from the slurm script above for Matlab r2019a will look like:

srun matlab -nojvm -nosplash -batch "serial_Matlab($n,$m);"
Running Matlab Array jobs

The most common way to utilize Matlab is to write a single .M-file that can be used to run tasks as a non-interactive batch job. These jobs are then submitted as independent tasks and when the heavy part is done, the results are collected for analysis. For these kinds of jobs the Slurm array jobs is the best choice; For more information on array jobs see Array jobs in the Triton user guide.

Here is an example of testing multiple mutation rates for a genetic algorithm. First, the matlab code.

% set the mutation rate
mutationRate = str2double(getenv('SLURM_ARRAY_TASK_ID'))/100;
opts = optimoptions('ga','MutationFcn', {@mutationuniform, mutationRate});

% Set population size and end criteria
opts.PopulationSize = 100;
opts.MaxStallGenerations = 50;
opts.MaxGenerations = 200000;

%set the range for all genes
opts.InitialPopulationRange = [-20;20];

% define number of variables (genes)
numberOfVariables = 6;

[x,Fval,exitFlag,Output] = ga(@fitness,numberOfVariables,[],[],[], ...
    [],[],[],[],opts);

output = [4,-2,3.5,5,-11,-4.7] * x'

save(['MutationJob' getenv('i') '.mat'], 'output');

exit(0)

function fit = fitness(x)
    output = [4,-2,3.5,5,-11,-4.7] * x';
    fit = abs(output - 44);
end

We run this code with the following slurm script using sbatch

#!/bin/bash 
#SBATCH --time=00:30:00
#SBATCH --array=1-100
#SBATCH --mem=500M
#SBATCH --output=r_array_%a.out

module load matlab

srun matlab -nodisplay -r serial

Collecting the results

Finally a wrapper script to read in the .mat files and plots the resulting values

function collectResults(maxMutationRate) 
   X=1:maxMutationRate
   Y=zeros(maxMutationRate,1);
   for index=1:maxMutationRate
      % read the output from the jobs
      filename = strcat( 'MutationJob', int2str( index ) );
      load( filename );

      Y(index)=output; 
   end 
   plot(X,Y,'b+:')
Seeding the random number generator

Note that by default MATLAB always initializes the random number generator with a constant value. Thus if you launch several matlab instances e.g. to calculate distinct ensembles, then you need to seed the random number generator such that it’s distinct for each instance. In order to do this, you can call the rng() function, passing the value of $SLURM_ARRAY_TASK_ID to it.

Parallel Matlab with Matlab’s internal parallelization

Matlab has internal parallelization that can be activated by requesting more than one cpu per task in the Slurm script and using the matlab_multithread to start the interpreter.

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --cpus-per-task=4
#SBATCH --output=ParallelOut

module load matlab

srun matlab_multithread -nodisplay -r parallel_fun

An example function is provided in this script

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --cpus-per-task=4
#SBATCH --output=ParallelOut

module load matlab

srun matlab_multithread -nodisplay -r parallel_fun
Parallel Matlab with parpool

Often one uses Matlab’s parallel pool for parallelization. When using parpool one needs to specify the number of workers. This number should match the number of CPUs requested. parpool uses JVM so when launching the interpreter one needs to use -nodisplay instead of -nojvm. Example

Slurm script:

#!/bin/bash 
#SBATCH --time=00:15:00
#SBATCH --mem=500M
#SBATCH --cpus-per-task=4
#SBATCH --output=matlab_parallel.out

module load matlab

srun matlab_multithread -nodisplay -r parallel

An example function is provided in this script

initParPool()
% Create matrices to invert
mat = rand(1000,1000,6);

parfor i=1:size(mat,3)
    invMats(:,:,i) = inv(mat(:,:,i))
end
% And now, we proceed to build the averages of each set of inverted matrices
% each time leaving out one.

parfor i=1:size(invMats,3)
    usedelements = true(size(invMats,3),1)
    usedelements(i) = false
    res(:,:,i) = inv(mean(invMats(:,:,usedelements),3));
end
% end the program
exit(0)
Parallel matlab in exclusive mode
#!/bin/bash -l
#SBATCH --time=00:15:00
#SBATCH --exclusive
#SBATCH -o parallel_Matlab3.out

export OMP_NUM_THREADS=$(nproc)

module load matlab/r2017b
matlab_multithread -nosplash -r "parallel_Matlab3($OMP_NUM_THREADS) ; exit(0)"

parallel_Matlab3.m:

function parallel_Matlab3(n)
        % Try-catch expression that quits the Matlab session if your code crashes
        try
                % Initialize the parallel pool
                c=parcluster();
                % Ensure that workers don't overlap with other jobs on the cluster
                t=tempname()
                mkdir(t)
                c.JobStorageLocation=t;
                parpool(c,n);
                % The actual program calls from matlab's example.
                % The path for r2017b
                addpath(strcat(matlabroot, '/examples/distcomp/main'));
                % The path for r2016b
                % addpath(strcat(matlabroot, '/examples/distcomp'));
                pctdemo_aux_parforbench(10000,100,n);
        catch error
                getReport(error)
                disp('Error occured');
                exit(0)
        end
end
FAQ / troubleshooting

If things randomly don’t work, you can try removing or moving either the ~/.matlab directory or ~/.matlab/Rxxxxy directory to see if it’s caused by configuration.

Random error messages about things not loading and/or something (Matlab Live Editor maybe) doesn’t work: ls *.m, do you have any unexpected files like pathdef.m in there? Remove them.

Also, check your home quota. Often .matlab gets large and fills up your home directory. Check the answer at the very top of the page, under “Matlab Configuration”.

MLPack
pagelastupdated:

2014

supportlevel:

C

https://www.mlpack.org/

  1. module load cmake; module load armadillo/4.3-mkl; module load mkl

  2. mkdir build && cd build

  3. cmake -D ARMADILLO_LIBRARY=$ARMADILLO_LIBRARY -D ARMADILLO_INCLUDE_DIR=$ARMADILLO_INCLUDE ../

  4. make

  5. bin/mlpack_test

  6. make install CMAKE_INSTALL_PREFIX=/share/apps/mlpack/1.0.8

For newer boost library also load boost module and tell cmake where to find boost

module load boost
...
cmake -D BOOST_ROOT=$BOOST_ROOT -D ARMADILLO_LIBRARY=$ARMADILLO_LIBRARY -D ARMADILLO_INCLUDE_DIR=$ARMADILLO_INCLUDE ../
..
Notes
  • 1.0.10 installation failed when installing doc to /usr/local (install prefix defined ad /share/apps/mlpack/1.0.10). The solution was manually tune install prefix at cmake_install.cmake

MNE
pagelastupdate:

2018

maintainer:

module load mne

Follow the instruction to source the init script specific to your shell. In the directory:

$MNE_ROOT/..

you can find the relase notes, the manual, and some sample data.

We do not recommend using the MNE command line tools, a more modern solution is to use the MNE-python suite.

MPI

Message Passing Interface (MPI) is used in high-performance computing (HPC) clusters to facilitate big parallel jobs that utilize multiple compute nodes.

MPI and Slurm

For a tutorial on how to do Slurm reservations for MPI jobs, check out the MPI section of the parallel computing-tutorial.

Installed MPI versions

There are multiple installed MPI versions in the cluster, but due to updates to the underlying network and the operating system some older ones might not be functional.

Therefore it is highly recommended to use the recommended and tested versions of MPI.

Each MPI version will use some underlying compiler by default. Please check here for information on how to change the underlying compiler.

MPI provider

MPI version

GCC compiler

Module name

Extra notes

OpenMPI

4.1.5

gcc/11.3.0

openmpi/4.1.5

OpenMPI

4.0.5

gcc/8.4.0

openmpi/4.0.5

There are known issues with this version, we do not recommend using this for new compilations

Some libraries/programs might have already existing requirement for a certain MPI version. If so, use that version or ask for administrators to create a version of the library with dependency on the MPI version you require.

Warning

Different versions of MPI are not compatible with each other. Each version of MPI will create code that will run correctly with only that version of MPI. Thus if you create code with a certain version, you will need to load the same version of the library when you are running the code.

Also, the MPI libraries are usually linked to slurm and network drivers. Thus, when slurm or driver versions are updated, some older versions of MPI might break. If you’re still using said versions, let us know. If you’re just starting a new project, it is recommended to use our recommended MPI libraries.

Usage
Compiling and running an MPI Hello world-program

The following example uses example codes stored in the hpc-examples-repository. You can get the repository with the following command:

git clone https://github.com/AaltoSciComp/hpc-examples/

Loading module:

module load gcc/11.3.0      # GCC
module load openmpi/4.1.5  # OpenMPI

Compiling the code:

C code is compiled with mpicc:

cd hpc-examples/hello_mpi/
mpicc    -O2 -g hello_mpi.c -o hello_mpi

For testing one might be interested in running the program with srun:

srun --time=00:05:00 --mem-per-cpu=200M --ntasks=4 ./hello_mpi

For actual jobs this is obviously not recommended as any problem with the login node can crash the whole MPI job. Thus we’ll want to run the program with a slurm script:

#!/bin/bash
#SBATCH --time=00:05:00      # takes 5 minutes all together
#SBATCH --mem-per-cpu=200M   # 200MB per process
#SBATCH --ntasks=4           # 4 processes

module load openmpi/4.1.5  # NOTE: should be the same as you used to compile the code
srun ./hello_mpi

Important

It is important to use srun when you launch your program. This allows for the MPI libraries to obtain task placement information (nodes, number of tasks per node etc.) from the slurm queue.

Overwriting default compiler of an MPI installation

Typically one should use the compiler that the MPI installation has been compiled with. Thus if you encounter a situation where you would like to use a different compiler, it might be best to ask the administrators to install a different version of MPI with a different compiler.

However sometimes one can try to overwrite the default compiler. This will obviously be faster than installing newer MPI versions. However, if you encounter problems after switching the complier, you should not use it.

Changing complier when using OpenMPI

The procedure of changing compilers for OpenMPI is documented in OpenMPI’s FAQ. Environment variables such as OMPI_MPICC and OMPI_MPIFC can be set to overwrite the default compiler. See the article for full list of environment variables.

For example, one could use an Intel compiler to compile the Hello world!-example by setting OMPI_MPICC- and OMPI_MPIFC-environment variables.

Intel C compiler is icc:

module load gcc/11.3.0
module load openmpi/4.1.5
module load intel-oneapi-compilers/2021.4.0

export OMPI_MPICC=icc  # Overwrite the C compiler

mpicc    -O2 -g hello_mpi.c -o hello_mpi
NVIDIA’s singularity containers
supportlevel:

A

pagelastupdated:

2020-05-15

maintainer:

NVIDIA provides many different docker images containing scientific software through their NGC repository. This software is available for free for NVIDIA’s GPUs and one can register for free to get access to the images.

You can use these images as a starting point for your own GPU images, but do be mindful of NVIDIA’s terms and conditions. If you want to store your own images that are based on NGC images, either use NGC itself or our own Docker registry that is documented on the singularity containers page.

We have converted some of these images with minimal changes into singularity images that are available in Triton.

Currently updated images are:

  • nvidia-tensorflow: Contains tensorflow. Due to major changes that happened between Tensorflow v1 and v2, image versions have either tf1 or tf2 to designate the major version of Tensorflow.

  • nvidia-pytorch: Contains PyTorch.

There are various other images available that can be installed very quickly if required.

Running simple Tensorflow/Keras model with NVIDIA’s containers

Let’s run the MNIST example from Tensorflow’s tutorials:

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

The full code for the example is in tensorflow_mnist.py. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/tensorflow/tensorflow_mnist.py
module load nvidia-tensorflow/20.02-tf1-py3
srun --time=00:15:00 --gres=gpu:1 singularity_wrapper exec python tensorflow_mnist.py

or with sbatch by submitting tensorflow_singularity_mnist.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

module load nvidia-tensorflow/20.02-tf1-py3

singularity_wrapper exec python tensorflow_mnist.py

Do note that by default Keras downloads datasets to $HOME/.keras/datasets.

Running simple PyTorch model with NVIDIA’s containers

Let’s run the MNIST example from PyTorch’s tutorials:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

The full code for the example is in pytorch_mnist.py. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/pytorch/pytorch_mnist.py
module load nvidia-pytorch/20.02-py3
srun --time=00:15:00 --gres=gpu:1 singularity_wrapper exec python pytorch_mnist.py

or with sbatch by submitting pytorch_singularity_mnist.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

module load nvidia-pytorch/20.02-py3

singularity_wrapper exec python pytorch_mnist.py

The Python-script will download the MNIST dataset to data folder.

Octave
From Octave’s web page:

GNU Octave is a high-level language, primarily intended for numerical computations. It provides a convenient command line interface for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with Matlab. It may also be used as a batch-oriented language.

Octave has extensive tools for solving common numerical linear algebra problems, finding the roots of nonlinear equations, integrating ordinary functions, manipulating polynomials, and integrating ordinary differential and differential-algebraic equations. It is easily extensible and customizable via user-defined functions written in Octave’s own language, or using dynamically loaded modules written in C++, C, Fortran, or other languages.

Getting started

Simply load the latest version of Octave.

module load octave
octave

It is best to pick a version of octave and stick with it. Do module spider octave and use the whole name:

module load octave/4.4.1-qt-python2

To run octave with the GUI, run it with:

octave --force-gui
Installing packages

Before installing packages you should create a file ~/.octaverc with the following content:

package_dir = ['/scratch/work/',getenv('USER'),'/octave'];
eval (["pkg prefix ",package_dir, ";"]);
setenv("CXX","g++ -std=gnu++11")
setenv("DL_LD","g++ -std=gnu++11")
setenv("LD_CXX","g++ -std=gnu++11")
setenv("CC","gcc")
setenv("F77","gfortran")

This sets up /scratch/work/$USER/octave to be your Octave package directory and sets gcc to be your compiler. By setting Octave package directory to your work directory you won’t run into any quota issues.

After this you should load gcc- and texinfo-modules. This gives you an up-to-date compiler and tools that Octave uses for its documentation:

module load gcc
module load texinfo

Now you can install packages in octave with e.g.:

pkg install -forge -local io

After this you can unload the gcc- and texinfo-modules:

module unload gcc
module unload texinfo
OpenFoam

OpenFoam is a popular open source CFD software. There are two main forks of the same software available:

There are various installations of these installed in Triton.

OpenFOAM installations

Below is a list of installed OpenFOAM versions:

OpenFOAM provider

Version

Module name

openfoam.com

v1906

openfoam/1906-openmpi-metis

openfoam.org

9

openfoam-org/9-openmpi-metis

openfoam.org

8

openfoam-org/8-openmpi-metis

openfoam.org

7

openfoam-org/7-openmpi-metis

Running OpenFOAM

OpenFOAM installations are built using OpenMPI and thus one should reserve the resources following the MPI instructions.

When running the MPI enabled programs, one should launch them with srun. This enables SLURM to allocate the tasks correctly.

Some programs included in the OpenFOAM installation (such as blockMesh and decomposePar) do simulation initialization in a serial fashion and should be called without using srun.

Examples
Running damBreak example

One popular simple example is an example of a dam breaking in two dimensions. For more information on the example, see this article.

First, we need to take our own copy of the example:

module load openfoam-org/9-openmpi-metis
cp -r $FOAM_TUTORIALS/multiphase/interFoam/laminar/damBreak/damBreak .

Second, we need to write a Slurm script run_dambreak.sh:

#!/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --mem=4G
#SBATCH --ntasks=4
#SBATCH --output=damBreak.out

set -e

module load openfoam-org/9-openmpi-metis

cd damBreak

blockMesh
decomposePar

srun interFoam -parallel

After this we can submit the Slurm script to the queue with sbatch run_dambreak.sh. The program will run in the queue and we will get results in damBreak.out and in the simulation folder.

Do note that some programs (blockMesh, decomposePar) do not require multiple MPI tasks. Thus these are run without srun. By contrast, the program call that does the main simulation (interFoam -parallel) uses multiple MPI tasks and thus is called via srun.

OpenPose

This uses Singularity containers, so you should refer to that page first for general information.

OpenPose has been compiled against OpenBlas, Caffe, CUDA and cuDNN. Image is based on a nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 docker image.

Dockerfile for this image is available here.

Within the container OpenPose is installed under /opt/openpose. Due to the way the libraries are organized, singularity_wrapper changes the working directory to /opt/openpose.

Running OpenPose example

One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/openpose/openpose.sh
module load singularity-openpose
sbatch openpose.sh

Example sbatch script is shown below.

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --mem=8G
#SBATCH --gres=gpu:1

module load singularity-openpose/v1.5.1

# Print out usage flags
singularity_wrapper exec openpose --help

# Run example
singularity_wrapper exec openpose --video /opt/openpose/examples/media/video.avi --display 0 --write_video $(pwd)/openpose.avi
ORCA

ORCA is a scientific software that provides cutting-edge methods in the fields of density functional theory and correlated wave-function based methods.

Basic Usage

You can do a simple run with ORCA with the following script.

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=2G
#SBATCH --output=orca_example.out

module load orca/4.2.1-openmpi

rm -f water*

cat > water.inp << EOF
!HF
!DEF2-SVP
!PAL4
* xyz 0 1
O   0.0000   0.0000   0.0626
H  -0.7920   0.0000  -0.4973
H   0.7920   0.0000  -0.4973
*
EOF

# Parallel runs need the full path to orca executable
# Do not use srun as orca will call mpi independently: https://www.orcasoftware.de/tutorials_orca/first_steps/trouble_install.html#using-openmpi
$(command -v orca) water.inp

This script performs a parallel run of ORCA to simulate the behavior of water molecule. The input file for this simulation is called water.inp, which is written to by the cat command.

To run this script, download it and submit into the queue using sbatch:

$ wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/orca/orca_example.sh
$ sbatch orca_example.sh
How to launch ORCA when using MPI parallelism

When doing parallel runs you should always launch ORCA with

$(command -v orca) input_file.inp

in your Slurm scripts. This is because ORCA will need the executable to be launched with the full path of the executable and it will launch the MPI tasks independently. For more information, see this documentation page.

Setting the number of MPI tasks

The example given above asked for 4 MPI tasks by setting #SBATCH --ntasks-per-node=4 in the Slurm batch script and then told ORCA to use 4 tasks by setting !PAL4 in the input file.

When asking for more than 8 tasks you need use %PAL NPROCS 16 END to set the number of tasks in ORCA input (here, the line would specify 16 tasks).

For more information please refer to ORCA’s documentation page on parallel calculations.

Paraview
As a module

A serial version is available on login2. You will need to use the “forward connection” strategy by using ssh port forwarding. For example, run ssh -L BBBB:nnnNNN:AAAA username@triton, where BBBB is the server you connect to locally and nnnNNN is the node name and AAAA is the port on that node. See this FAQ question.

See issue #13: https://version.aalto.fi/gitlab/AaltoScienceIT/triton/issues/13 for some user experiences. (Note: the author of this entry is not a paraview expert, suggestions welcome.)

As a container

You can also use paraview via Singularity containers, so you should refer to that page first for general information. It is part of the OpenFoam container.

Python

Python is widely used programming language where we have installed all basic packages on every node. Yet, python develops quite fast and the system provided packages are ofter not complete or getting old.

Python distributions

Python to use

How to install own packages

I don’t really care, I just want recent stuff and to not worry.

Anaconda: module load anaconda

Simple programs with common packages, not switching between Pythons often

Anaconda: module load anaconda

pip install --user

Your own conda environment

Miniconda: module load miniconda

conda environment + conda

Your own virtual environment

Module virtualenv module load py-virtualenv

virtualenv + pip + setuptools

The main version of modern Python is 3. Support for old Python 2 ended at the end of 2019. There are also different distributions: The “regular” CPython, Anaconda (a package containing CPython + a lot of other scientific software all bundled togeter), PyPy (a just-in-time compiler, which can be much faster for some use cases). Triton supports all of these.

  • For general scientific/data science use, we suggest that you use Anaconda. It comes with the most common scientific software included, and is reasonably optimized.

  • There are many other “regular” CPython versions in the module system. These are compiled and optimized for Triton, and are highly recommended. The default system Python is old and won’t be updated.

Make sure your environments are reproducible - you can recreate them from scratch. History shows you will probably have to do this eventually, and it also ensures that others can always use your code. We recommend a minimal requirements.txt (pip) or environment.yml (conda), hand-created with the minimal dependencies in there.

Quickstart

Use module load anaconda to get our Python installation.

If you have simple needs, use pip install –user to install packages. For complex needs, use anaconda + conda environments to isolate your projects.

Install your own packages easily

Warning

pip install --user can result in incompatibilities

If you do this, then the module will be shared among all your projects. It is quite likely that eventually, you will get some incompatibilities between the Python you are using and the modules installed. In that case, you are on your own (simple recommendation is to remove all modules from ~/.local/lib/pythonN.N and reinstall). If you get incompatible module errors, our first recommendation will be to remove everything installed this way and use conda/virtual environments instead. It’s not a bad idea to do this when you switch to environments anyway.

If you encounter problems, remove all your user packages:

$ rm -r ~/.local/lib/python*.*/

and reinstall everything after loading the environment you want.

Installing your own packages with pip install won’t work, since it tries to install globally for all users. Instead, you should do this (add --user) to install the package in your home directory (~/.local/lib/pythonN.N/):

$ pip install --user $package_name

This is quick and effective best used for leaf packages without many dependencies and if you don’t switch Python modules often.

Note

Example of dangers of pip install --user

Someone did pip install --user tensorflow. Some time later, they noticed that they couldn’t use Tensorflow + GPUs. We couldn’t reproduce the problem, but in the end found they had this local install that was hiding any Tensorflow in any module (forcing a CPU version on them).

Note: pip installs from the Python Package Index.

Anaconda and conda environments

Anaconda is a Python distribution by Continuum Analytics (open source, of course). It is nothing fancy, they just take a lot of useful scientific packages and their dependencies and put them all together, make sure they work, and do some optimization. They also include most of the most common computing and data science packages and non-Python compiled software and libraries. It is also all open source, and is packaged nicely so that it can easily be installed on any major OS.

To load anaconda, use the module system (you can also load specific versions):

$ module load anaconda     # python3
$ module load anaconda2    # python2

Note

Before 2020, Python3 was via the anaconda3 module (note the 3 on the end). That’s still there, but in 2020 we completely revised our Anaconda installation system, and dropped active maintenance of Python 2. All updates are in anaconda only in the future.

Conda environments

See also

Watch a Research Software Hour episode on conda for an introduction + demo.

If you encounter a situation where you need to create your own environment, we recommend that you use conda environments. When you create your own environment the packages from the base environment (default environment installed by us) will not be used, but you can choose which packages you want to install.

We nowadays recommend that you use the miniconda-module for installing these environments. Miniconda is basically a minimal Anaconda installation that can be used to create your own environments.

By default conda tries to install packages into your home folder, which can result in running out of quota. To fix this, you should run the following commands once:

$ module load miniconda

$ mkdir $WRKDIR/.conda_pkgs
$ mkdir $WRKDIR/.conda_envs

$ conda config --append pkgs_dirs ~/.conda/pkgs
$ conda config --append envs_dirs ~/.conda/envs
$ conda config --prepend pkgs_dirs $WRKDIR/.conda_pkgs
$ conda config --prepend envs_dirs $WRKDIR/.conda_envs

virtualenv does not work with Anaconda, use conda instead.

  • Load the miniconda module. You should look up the version and use load same version each time you source the environment:

    ## Load miniconda first.  This must always be done before activating the env!
    $ module load miniconda
    
  • Create an environment. This needs to be done once:

    ## create environment with the packages you require
    $ conda create -n ENV_NAME python pip ipython tensorflow-gpu pandas ...
    
  • Activate the environment. This needs to be done every time you load the environment:

    ## This must be run in each shell to set up the environment variables properly.
    ## make sure module is loaded first.
    $ source activate ENV_NAME
    
  • Activating and using the environment, installing more packages, etc. can be done either using conda install or pip install:

    ## Install more packages, either conda or pip
    $ conda search PACKAGE_NAME
    $ conda install PACKAGE_NAME
    $ pip install PACKAGE_NAME
    
  • Leaving the environment when done (optional):

    ## Deactivate the environment
    $ source deactivate
    
  • To activate an environment from a Slurm script:

    #!/bin/bash
    #SBATCH --time=00:05:00
    #SBATCH --cpus_per_task=1
    #SBATCH --mem=1G
    
    source activate ENV_NAME
    
    srun echo "This step is ran inside the activated conda environment!"
    
    source deactivate
    
  • Worst case, you have incompatibility problems. Remove everything, including the stuff installed with pip install --user. If you’ve mixed your personal stuff in with this, then you will have to separate it out.:

    ## Remove anything installed with pip install --user.
    $ rm -r ~/.local/lib/python*.*/
    

A few notes about conda environments:

  • Once you use a conda environment, everything goes into it. Don’t mix versions with, for example, local packages in your home dir and --pip install --user. Things installed (even previously) with pip install --user will be visible in the conda environment and can make your life hard! Eventually you’ll get dependency problems.

  • Often the same goes for other python based modules. We have setup many modules that do use anaconda as a backend. So, if you know what you are doing this might work.

conda init, conda activate, and source activate

We don’t recommend doing conda init like many sources recommend: this will permanently affect your .bashrc file and make hard-to-debug problems later. The main points of conda init are to a) automatically activate an environment (not good on a cluster: make it explicit so it can be more easily debugged) and b) make conda a shell function (not command) so that conda activate will work (source activate works as well in all cases, no confusion if others don’t.)

  • If you activate one environment from another, for example after loading an anaconda module, do source activate ENV_NAME like shown above (conda installation in the environment not needed).

  • If you make your own standalone conda environments, install the conda package in them, then…

  • Activate a standalone environment with conda installed in it by source PATH/TO/ENV_DIRECTORY/bin/activate (which incidentally activates just that one session for conda).

Python: virtualenv

Virtualenv is default-Python way of making environments, but does not work with Anaconda. We generally recommend using anaconda, since it includes a lot more stuff by default, but virtualenv works on other systems easily so it’s good to know about.

## Load module python
$ module load py-virtualenv

## Create environment
$ virtualenv DIR

## activate it (in each shell that uses it)
$ source DIR/bin/activate

## install more things (e.g. ipython, etc.)
$ pip install PACKAGE_NAME

## deactivate the virtualenv
$ deactivate
Anaconda/virtualenvironments in Jupyter

If you make a conda environment / virtual environment, you can use it from Triton’s JupyterHub (or your own Jupyter). See Installing kernels from virtualenvs or Anaconda environments.

IPython Parallel

ipyparallel is a tool for running embarrassingly parallel code using Python. The basic idea is that you have a controller and engines. You have a client process which is actually running your own code.

Preliminary notes: ipyparallel is installed in the anaconda{2,3}/latest modules.

Let’s say that you are doing some basic interactive work:

  • Controller: this can run on the frontend node, or you can put it on a script. To start: ipcontroller --ip="*"

  • Engines: srun -N4 ipengine: This runs the four engines in slurm interactively. You don’t need to interact with this once it is running, but remember to stop the process once it is done because it is using resources. You can start/stop this as needed.

  • Start your Python process and use things like normal:

    import os
    import ipyparallel
    client = ipyparallel.Client()
    result = client[:].apply_async(os.getpid)
    pid_map = result.get_dict()
    print(pid_map)
    

This method lets you turn on/off the engines as needed. This isn’t the most advanced way to use ipyparallel, but works for interactive use.

See also: IPython parallel for a version which goes in a slurm script.

Background: pip vs python vs anaconda vs conda vs virtualenv

Virtual environments are self-contained python environments with all of their own modules, separate from the system packages. They are great for research where you need to be agile and install whatever versions and packages you need. We highly recommend virtual environments or conda environments (below)

  • Anaconda: use conda, see below

  • Normal Python: virtualenv + pip install, see below

You often need to install your own packages. Python has its own package manager system that can do this for you. There are three important related concepts:

  • pip: the Python package installer. Installs Python packages globally, in a user’s directory (--user), or anywhere. Installs from the Python Package Index.

  • virtualenv: Creates a directory that has all self-contained packages that is manageable by the user themself. When the virtualenv is activated, all the operating-system global packages are no longer used. Instead, you install only the packages you want. This is important if you need to install specific versions of software, and also provides isolation from the rest of the system (so that you work can be uninterrupted). It also allows different projects to have different versions of things installed. virtualenv isn’t magic, it could almost be seen as just manipulating PYTHONPATH, PATH, and the like. Docs: https://docs.python-guide.org/dev/virtualenvs/

  • conda: Sort of a combination of package manager and virtual environment. However, it only installed packages into environments, and is not limited to Python packages. It can also install other libraries (c, fortran, etc) into the environment. This is extremely useful for scientific computing, and the reason it was created. Docs for envs: https://conda.io/projects/conda/en/latest/user-guide/concepts/environments.html.

So, to install packages, there is pip and conda. To make virtual environments, there is venv and conda.

Advanced users can see this rosetta stone for reference.

On Triton we have added some packages on top of the Anaconda installation, so cloning the entire Anaconda environment to local conda environment will not work (not a good idea in the first place but some users try this every now and then).

Examples
Running Python with OpenMP parallelization

Various Python packages such as Numpy, Scipy and pandas can utilize OpenMP to run on multiple CPUs. As an example, let’s run the python script python_openmp.py that calculates multiplicative inverse of five symmetric matrices of size 2000x2000.

nrounds = 5

t_start = time()

for i in range(nrounds):
    a = np.random.random([2000,2000])
    a = a + a.T
    b = np.linalg.pinv(a)

t_delta = time() - t_start

print('Seconds taken to invert %d symmetric 2000x2000 matrices: %f' % (nrounds, t_delta))

The full code for the example is in HPC examples-repository. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoSciComp/hpc-examples/master/python/python_openmp/python_openmp.py
module load anaconda/2022-01
export OMP_PROC_BIND=true
srun --cpus-per-task=2 --mem=2G --time=00:15:00 python python_openmp.py

or with sbatch by submitting python_openmp.sh:

#!/bin/bash -l
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1G
#SBATCH -o python_openmp.out

module load anaconda/2022-01

export OMP_PROC_BIND=true

echo 'Running on: '$HOSTNAME

srun python python_openmp.py

Important

Python has a global interpreter lock (GIL), which forces some operations to be executed on only one thread and when these operations are occuring, other threads will be idle. These kinds of operations include reading files and doing print statements. Thus one should be extra careful with multithreaded code as it is easy to create seemingly parallel code that does not actually utilize multiple CPUs.

There are ways to minimize effects of GIL on your Python code and if you’re creating your own multithreaded code, we recommend that you take this into account.

Running MPI parallelized Python with mpi4py

MPI parallelized Python requires a valid MPI installation that support our SLURM scheduler. Thus anaconda is not the best option. We have installed MPI-supporting Python versions to different toolchains.

Using mpi4py is quite easy. Example is provided below.

Python MPI4py

A simple script mpi4py.py that utilizes mpi4py.

#!/usr/bin/env python
"""
Parallel Hello World
"""
from mpi4py import MPI
import sys
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()
sys.stdout.write(
    "Hello, World! I am process %d of %d on %s.\n"
    % (rank, size, name))

Running mpi4py.py using only srun:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --ntasks=4

module load Python/2.7.11-goolf-triton-2016b
mpiexec -n $SLURM_NTASKS python mpi4py.py

Example sbatch script mpi4py.sh when running mpi4py.py through sbatch:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --ntasks=4

module load Python/2.7.11-goolf-triton-2016b
mpiexec -n $SLURM_NTASKS python mpi4py.py
Python Environments with Conda

Conda is a popular package manager that is especially popular in data science and machine learning communities.

It is commonly used to handle complex requirements of Python and R packages.

Quick usage guide
First time setup

You can get conda by loading the miniconda-module:

$ module load miniconda

By default Conda stores installed packages and environments in your home directory. However, as your home directory has a lower quota, it is a good idea to tell conda to install packages and environments into your work directory:

$ mkdir $WRKDIR/.conda_pkgs
$ mkdir $WRKDIR/.conda_envs

$ conda config --append pkgs_dirs ~/.conda/pkgs
$ conda config --append envs_dirs ~/.conda/envs
$ conda config --prepend pkgs_dirs $WRKDIR/.conda_pkgs
$ conda config --prepend envs_dirs $WRKDIR/.conda_envs

Now you’re all set up to create your first environment.

Creating a simple environment with conda

One can install environments from the command line itself, but a better idea is to write an environment.yml-file that describes the environment.

Below we have a simple environment.yml:

name: conda-example
channels:
  - conda-forge
dependencies:
  - numpy
  - pandas

Now we can use the conda-command to create the environment:

$ module load miniconda
$ conda env create --file environment.yml

Once the environment is installed, you can activate it with:

$ source activate conda-example

conda init, conda activate, and source activate

We don’t recommend doing conda init like many sources recommend: this will permanently affect your .bashrc file and make hard-to-debug problems later. The main points of conda init are to a) automatically activate an environment (not good on a cluster: make it explicit so it can be more easily debugged) and b) make conda a shell function (not command) so that conda activate will work (source activate works as well in all cases, no confusion if others don’t.)

  • If you activate one environment from another, for example after loading an anaconda module, do source activate ENV_NAME like shown above (conda installation in the environment not needed).

  • If you make your own standalone conda environments, install the conda package in them, then…

  • Activate a standalone environment with conda installed in it by source PATH/TO/ENV_DIRECTORY/bin/activate (which incidentally activates just that one session for conda).

Resetting conda

Sometimes it is necessary to reset your Conda configuration. So here are instructions on how to wipe all of your conda settings and existing environments. To be able to do so first activate conda. On Triton, by loading the miniconda environment:

$ module load miniconda

First, check where conda stores your environments:

$ conda config --show envs_dirs
$ conda config --show pkgs_dirs

Delete the directories that are listed and start with /home/USERNAME (this could e.g. be /home/<username>/.conda/envs) and /scratch/ ( e.g. /scratch/work/USERNAME/conda_envs). You would delete these with rm -r DIRNAME, but be careful you use the right paths because there is no going back. This will clean up all packages and environments you have installed.

Next, clean up your .bashrc, .zshrc, .kshrc and .cshrc (whichever ones exist for you). Open these files in an editor (e.g. nano .bashrc) and search for the line # >>> conda initialize >>> delete everything between this line and the line # <<< conda initialize <<<. These lines automatically initilize conda upon login which can cause a lot of trouble on a cluster.

Finally delete the file .condarc from your home folder ( rm ~/.condarc) to reset your conda configuration. After this close the current connection to triton and reconnect in a new session.

Now you should have a system that doesn’t have any remains of conda, so you can now follow the initial steps as detailed here.

Understanding the environment file

Conda environment files are written using YAML syntax. In an environment file one usually defines the following:

  • name: Name of the desired environment.

  • channels: Which channels to use for packages.

  • dependencies: Which conda and pip packages to install.

Choosing conda channels

When an environment file is used to create an environment, conda looks up the list of channels (in descending priority) and it will try to find the needed packages.

Some of the most popular channels are:

  • conda-forge: An open-source channel with over 18k packages. Highly recommended for new environments. Most packages in anaconda-modules come from here.

  • defaults: A channel maintained by Anaconda Inc.. Free for non-commercial use. Default for anaconda distribution.

  • r: A channel of R packages maintained by Anaconda Inc.. Free for non-commercial use.

  • bioconda: A community maintained channel of bioinformatics packages.

  • pytorch: Official channel for PyTorch, a popular machine learning framework.

One can have multiple channels defined like in the following example:

name: pytorch-env
channels:
  - nvidia
  - pytorch
  - conda-forge
dependencies:
  - pytorch
  - pytorch-cuda=12.1
  - torchvision
  - torchaudio
Setting package dependencies

Packages in environment.yml can have version constraints and version wildcards. One can also specify pip packages to install after conda-packages have been installed.

For example, the following dependency-env.yml would install a numpy with version higher or equal than 1.10 using conda and scipy via pip:

name: dependency-env
channels:
  - conda-forge
dependencies:
  - numpy>=1.10.*
  - pip
  - pip:
    - scipy
Listing packages in an environment

To list packages installed in an environment, one can use:

$ conda list
Removing an environment

To remove an environment, one can use:

$ conda env remove --name environment_name

Do remember to deactivate the environment before trying to remove it.

Cleaning up conda cache

Conda uses a cache for downloaded and installed packages. This cache can get large or it can be corrupted by failed downloads.

In these situations one can use conda clean to clean up the cache.

  • conda clean -i cleans up the index cache that conda uses to find the packages.

  • conda clean -t cleans up downloaded package installers.

  • conda clean -p cleans up unused packages.

  • conda clean -a cleans up all of the above.

Installing new packages into an environment

Installing new packages into an existing environment can be done with conda install-command. The following command would install matplotlib from conda-forge into an environment.

$ conda install --freeze-installed --channel conda-forge matplotlib

Installing packages into an existing environment can be risky: conda uses channels given from the command line when it determines which channels it should use for the new packages.

This can cause a situation where installing a new package results in the removal and reinstallation of multiple packages. Adding the --freeze-installed-flags makes already installed packages safe and by giving explicitly the channels to use, one can make certain that the new packages come from the same source.

It is usually a better option to create a new environment with the new package set as an additional dependency in the environment.yml. This keeps the environment reproducible.

If you intend on installing packages to existing environment, adding default channels for the environment can also make installing packages easier.

Setting default channels for an environment

It is a good idea to store channels used when creating the environment into a configuration file that is stored within the environment. This makes it easier to install any missing packages.

For example, one could add conda-forge into the list of default channels with:

$ conda config --env --add channels conda-forge

We can check the contents of the configuration file with:

$ cat $CONDA_PREFIX/.condarc
Doing everything faster with mamba

mamba is a drop-in replacement for conda that does environment building and solving much faster than conda.

To use it, you either need to install mamba-package from conda-forge-channel or use the miniconda-module.

If you have mamba, you can just switch from using conda-command to using mamba and it should work in the same way, but faster.

For example, one could create an environment with:

$ mamba env create --file environment.yml
Motivation for using conda
When should you use conda?

If you need basic Python packages, you can use pre-installed anaconda-modules. See the Python-page for more information.

You should use conda when you need to create your own custom environment.

Why use conda? What are its advantages?

Quite often Python packages are installed with Pip from the Python Package Index (PyPI). These packages contain Python code and in many cases some compiled code as well.

However, there are three problems pip cannot solve without additional tools:

  1. How do you install multiple separate suites of packages for different use cases?

  2. How do you handle packages that depend on some external libraries?

  3. How do you make sure that all of the packages have are compatible with each other?

Conda tries to solve these problems with the following ways:

  1. Conda creates environments where packages are installed. Each environment can be activated separately.

  2. Conda installs library dependencies to the environment with the Python packages.

  3. Conda uses a solver engine to figure out whether packages are compatible with each other.

Conda also caches installed packages so doing copies of similar environments does not use additional space.

One can also use the environment files to make the installation procedure more reproducible.

Creating an environment with CUDA toolkit

NVIDIA’s CUDA-toolkit is needed for working with NVIDIA’s GPUs. Many Python frameworks that work on GPUs need to have a supported CUDA toolkit installed.

Conda is often used to provide the CUDA toolkit and additional libraries such as cuDNN. However, one should choose the version of the CUDA toolkit based on what the software requires.

If the package is installed from a conda channel such as conda-forge, conda will automatically retreive the correct version of CUDA toolkit.

In other cases one can use an environment file like this cuda-env.yml:

name: cuda-env
channels:
  - conda-forge
dependencies:
  - cudatoolkit

Hint

During installation conda will try to verify what is the maximum version of CUDA installed graphics cards can support and it will install non-CUDA enabled versions by default if none are found (as is the case on the login node, where environments are normally built). This can be usually overcome by setting explicitly that the packages should be the CUDA-enabled ones. It might however happen, that the environment creation process aborts with a message similar to:

nothing provides __cuda needed by tensorflow-2.9.1-cuda112py310he87a039_0

In this instance it might be necessary to override the CUDA settings used by conda/mamba. To do this, prefix your environment creation command with CONDA_OVERRIDE_CUDA=CUDAVERSION, where CUDAVERSION is the CUDA toolkit version you intend to use as in:

CONDA_OVERRIDE_CUDA="11.2" mamba env create -f cuda-env.yml

This will allow conda to assume that the respective CUDA libraries will be present at a later point and so it will skip those requirements during installation.

For more information, see this helpful post in Conda-Forge’s documentation.

Creating an environment with GPU enabled Tensorflow

To create an environment with GPU enabled Tensorflow you can use an environment file like this tensorflow-env.yml:

name: tensorflow-env
channels:
  - conda-forge
dependencies:
  - tensorflow=*=*cuda*

Here we install the latest tensorflow from conda-forge-channel with an additional requirement that the build version of the tensorflow-package must contain a reference to a CUDA toolkit. For a specific version replace the =*=*cuda* with e.g. =2.8.1=*cuda* for version 2.8.1.

If you encounter errors related to CUDA while creating the environment, do note this hint on overriding CUDA during installation.

Creating an environment with GPU enabled PyTorch

To create an environment with GPU enabled PyTorch you can use an environment file like this pytorch-env.yml:

name: pytorch-env
channels:
  - nvidia
  - pytorch
  - conda-forge
dependencies:
  - pytorch
  - pytorch-cuda=12.1
  - torchvision
  - torchaudio

Here we install the latest pytorch version from pytorch-channel and the pytorch-cuda-metapackage that makes certain that the

Additional packages required by pytorch are installed from conda-forge-channel.

If you encounter errors related to CUDA while creating the environment, do note this hint on overriding CUDA during installation.

Installing numpy with Intel MKL enabled BLAS

NumPy and other mathematical libaries utilize BLAS (Basic Linear Algebra Subprograms) implementation for speeding up many operations. Intel provides their own fast BLAS implementation in Intel MKL (Math Kernel Library). When using Intel CPUs, this library can give a significant performance boost to mathematical calculations.

One can install this library as the default BLAS by specifying blas * mkl as a requirement in the dependencies like in this mkl-env.yml:

name: mkl-env
channels:
  - conda-forge
dependencies:
  - blas * mkl
  - numpy
Advanced usage
Finding available packages

Because conda tries to make certain that all packages in an environment are compatible with each other, there are usually tens of different versions of a single package.

One can search for a package from a channel with the following command:

$ mamba search --channel conda-forge tensorflow

This will return a long list of packages where each line looks something like this:

tensorflow                     2.8.1 cuda112py39h01bd6f0_0  conda-forge

Here we have:

  • The package name (tensorflow).

  • Version of the package (2.8.1).

  • Package build version. This version often contains information on:

    • Python version needed by the package (py39 or Python 3.9).

    • Other libraries used by the package (cuda112 or CUDA 11.2).

  • Channel where the package comes from (conda-forge).

Checking package dependencies

One can check package dependencies by adding the --info-flag to the search command. This can give a lot of output, so it is a good idea to limit the search to one specific package:

$ mamba search --info --channel conda-forge tensorflow=2.8.1=cuda112py39h01bd6f0_0

The output looks something like this:

tensorflow 2.8.1 cuda112py39h01bd6f0_0
--------------------------------------
file name   : tensorflow-2.8.1-cuda112py39h01bd6f0_0.tar.bz2
name        : tensorflow
version     : 2.8.1
build       : cuda112py39h01bd6f0_0
build number: 0
size        : 26 KB
license     : Apache-2.0
subdir      : linux-64
url         : https://conda.anaconda.org/conda-forge/linux-64/tensorflow-2.8.1-cuda112py39h01bd6f0_0.tar.bz2
md5         : 35716504c8ce6f685ae66a1d9b084fc7
timestamp   : 2022-05-21 09:09:53 UTC
dependencies:
  - __cuda
  - python >=3.9,<3.10.0a0
  - python_abi 3.9.* *_cp39
  - tensorflow-base 2.8.1 cuda112py39he716a45_0
  - tensorflow-estimator 2.8.1 cuda112py39hd320b7a_0

Packages with underscores are meta-packages that should not be added to conda environment specifications. They will be solved by conda automatically.

Here we can see more info on the package, including its dependencies.

When using mamba, one can also use mamba repoquery depends to see the dependencies:

$ mamba repoquery depends --channel conda-forge tensorflow=2.8.1=cuda112py39h01bd6f0_0

Output looks something like this:

 Name                     Version Build                 Channel
─────────────────────────────────────────────────────────────────────────────
 tensorflow               2.8.1   cuda112py39h01bd6f0_0 conda-forge/linux-64
 __cuda >>> NOT FOUND <<<
 python                   3.9.9   h62f1059_0_cpython    conda-forge/linux-64
 python_abi               3.9     2_cp39                conda-forge/linux-64
 tensorflow-base          2.8.1   cuda112py39he716a45_0 conda-forge/linux-64
 tensorflow-estimator     2.8.1   cuda112py39hd320b7a_0 conda-forge/linux-64

One can also print the full dependency list with mamba repoquery depends --tree. This will produce a really long output.

$ mamba repoquery depends --channel conda-forge tensorflow=2.8.1=cuda112py39h01bd6f0_0
Fixing conflicts between packages

Usually first step of fixing conflicts between packages is to write a new environment file and list all required packages in the file as dependencies. A fresh solve of the environment can often result in a working environment.

Sometimes there is a case where a single package does not have support for a specific version of Python or specific version of CUDA toolkit. In these cases it is usually beneficial to give more flexibility to the solver by limiting the number of specified versions.

One can also use the search commands provided by mamba to see what dependencies individual packages have.

PyTorch
pagelastupdated:

2022-08-08

PyTorch is a commonly used Python package for deep learning.

Basic usage

First, check the tutorials up to and including GPU computing.

If you plan on using NVIDIA’s containers to run your model, please check the page about NVIDIA’s singularity containers.

The basic way to use PyTorch is via the Python in the anaconda module. Don’t load any additional CUDA modules, anaconda includes everything.

Building your own environment with PyTorch

If you need a PyTorch version different to the one supplied with anaconda we recommend installing your own anaconda environment as detailed here.

Creating an environment with GPU enabled PyTorch

To create an environment with GPU enabled PyTorch you can use an environment file like this pytorch-env.yml:

name: pytorch-env
channels:
  - nvidia
  - pytorch
  - conda-forge
dependencies:
  - pytorch
  - pytorch-cuda=12.1
  - torchvision
  - torchaudio

Here we install the latest pytorch version from pytorch-channel and the pytorch-cuda-metapackage that makes certain that the

Additional packages required by pytorch are installed from conda-forge-channel.

Hint

During installation conda will try to verify what is the maximum version of CUDA installed graphics cards can support and it will install non-CUDA enabled versions by default if none are found (as is the case on the login node, where environments are normally built). This can be usually overcome by setting explicitly that the packages should be the CUDA-enabled ones. It might however happen, that the environment creation process aborts with a message similar to:

nothing provides __cuda needed by tensorflow-2.9.1-cuda112py310he87a039_0

In this instance it might be necessary to override the CUDA settings used by conda/mamba. To do this, prefix your environment creation command with CONDA_OVERRIDE_CUDA=CUDAVERSION, where CUDAVERSION is the CUDA toolkit version you intend to use as in:

CONDA_OVERRIDE_CUDA="11.2" mamba env create -f cuda-env.yml

This will allow conda to assume that the respective CUDA libraries will be present at a later point and so it will skip those requirements during installation.

For more information, see this helpful post in Conda-Forge’s documentation.

Examples:
Simple PyTorch model

Let’s run the MNIST example from PyTorch’s tutorials:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

The full code for the example is in pytorch_mnist.py. One can run this example with srun:

$ wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/pytorch/pytorch_mnist.py
$ module load anaconda
$ srun --time=00:15:00 --gres=gpu:1 python pytorch_mnist.py

or with sbatch by submitting pytorch_mnist.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

module load anaconda

python pytorch_mnist.py

The Python-script will download the MNIST dataset to data folder.

Running simple PyTorch model with NVIDIA’s containers

Let’s run the MNIST example from PyTorch’s tutorials:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

The full code for the example is in pytorch_mnist.py. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/pytorch/pytorch_mnist.py
module load nvidia-pytorch/20.02-py3
srun --time=00:15:00 --gres=gpu:1 singularity_wrapper exec python pytorch_mnist.py

or with sbatch by submitting pytorch_singularity_mnist.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

module load nvidia-pytorch/20.02-py3

singularity_wrapper exec python pytorch_mnist.py

The Python-script will download the MNIST dataset to data folder.

R

R is a language and environment for statistical computing and graphics with wide userbase. There exists several packages that are easily imported to R.

Getting started

Simply load the latest R.

module load r
R

As any packages you install against R are specific to the version you installed them with, it is best to pick a version of R and stick with it. You can do this by checking the R version with module spider r and using the whole name when loading the module:

module load r/3.6.1-python3

If you want to detect the number of cores, you should use the proper Slurm environment variables (defaulting to all cores):

library(parallel)
as.integer(Sys.getenv('SLURM_CPUS_PER_TASK', parallel::detectCores()))
Installing packages

There are two ways to install packages.

  1. You can usually install packages yourself, which allows you to keep up to date and reinstall as needed. Good instructions can be found here, for example:

    R
    > install.packages('L1pack')
    

    This should guide you to selecting a download mirror and offer you the option to install in your home directory.

    If you have a lot of packages, you can run out of home quota. In this case you should move your package directory to your work directory and replace it the ~/R-directory with a symlink that points to your $WRKDIR/R.

    Example of doing this is here:

    mv ~/R $WRKDIR/R
    ln -s $WRKDIR/R ~/R
    

    More info on R library paths can be found here. Looking at R startup can also be informative.

  2. You can also put a request to the triton issue tracker and mention which R-version you are using.

Simple R serial job
Serial R example

r_serial.sh:

#!/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
#SBATCH --mem=100M
#SBATCH --output=r_serial.out

module load r
n=3
m=2
srun Rscript --vanilla r_serial.R $n $m

r_serial.R:

args = commandArgs(trailingOnly=TRUE)

n<-as.numeric(args[1])
m<-as.numeric(args[2])

print(n)
print(m)

A<-t(matrix(0:5,ncol=n,nrow=m))
print(A)
B<-t(matrix(2:7,ncol=n,nrow=m))
print(B)
C<-matrix(0.5,ncol=n,nrow=n)
print(C)

C<-A %*% t(B) + 2*C
print(C)
Simple R job using OpenMP for parallelization
R OpenMP Example

r_openmp.sh:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G
#SBATCH --output=r_openmp.out

module load r
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
time srun Rscript --default-packages=methods,utils,stats R-benchmark-25.R

The benchmark script is available here (more information about it is available here page).

Simple R parallel job using ‘parallel’-package
Parallel R example

r_parallel.sh:

#!/bin/bash
#SBATCH --time=00:20:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G
#SBATCH --output=r_parallel.out

# Set the number of OpenMP-threads to 1,
# as we're using parallel for parallelization
export OMP_NUM_THREADS=1

# Load the version of R you want to use
module load r

# Run your R script
srun Rscript r_parallel.R

r_parallel.R:

library(pracma)
library(parallel)
invertRandom <- function(index) {
    A<-matrix(runif(2000*2000),ncol=2000,nrow=2000);
    A<-A + t(A);
    B<-pinv(A);
    return(max(B %*% A));
}
ptm<-proc.time()
mclapply(1:16,invertRandom, mc.cores=Sys.getenv('SLURM_CPUS_PER_TASK'))
proc.time()-ptm

When constrained to opt-architecture, run times for different core numbers were

ncores

1

2

4

8

runtime

380.757

182.185

125.526

84.230

RStan
supportlevel:

B

pagelastupdated:

2018-07-26

maintainer:

RStan is an R interface to Stan. Stan is a platform for modeling.

Basic installation

RStan is installed as an R package and there is nothing too special about it.

First, load the R module you need to use. There are different options, using different compilers. Do not use an iomkl R version, because it requires the intel compilers to work on the nodes to compile every time you run, and they aren’t available there. If you load a goolf R version, it will work (you could work around this by pre-compiling models, if you wanted):

$ module spider R
...
R/3.4.1-goolf-triton-2017a
R/3.4.1-iomkl-triton-2017a

$ module load R/3.4.1-goolf-triton-2017a

If you change R versions (from intel to gcc) or get errors about loading libraries, you may have installed incompatible libraries. Removing your ~/R directory and reinstalling all of your libraries is a good first place to start.

Notes

You should detect the number of cores with:

as.integer(Sys.getenv('SLURM_JOB_CPUS_PER_NODE', parallel::detectCores()))
Common Rstan problems
  • Models must be compiled on the machine that is running them, Triton or other workstations. The compiled model files aren’t necessarily portable, since they depend on the libraries available when build. One symptom of this problem is error messages which talk about loading libraries and GLIBC_2.23 or some such.

  • In order to compile models, you must have the compiler available on the nodes. Thus, the Intel compilers (iomkl) won’t work. It also won’t work if the Intel compiler license servers are down. Using the GNU compiler toolchains are more reliable.

Example
RStudio
supportlevel:

C

pagelastupdated:

2014

https://www.rstudio.com/ is an IDE for R

module load R/3.1.1-openblas boost/1.56 cmake/2.8.12.2 gcc/4.9.1 PrgEnv-gnu/0.1 qt/4.8.6

mkdir build && cd build
cmake .. -DRSTUDIO_TARGET=Desktop -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/share/apps/rstudio/0.98/ -DBOOST_ROOT=$BOOST_ROOT
Siesta & Transiesta

Copy-pasted Makefiles from Rocks. Should be used as a starting point. If you have a fully working version for SL6.2, send us a copy please.

See old wiki: https://wiki.aalto.fi/display/Triton/Applications

Rename siesta-3.0.arch.make.xxx => siesta-3.0-b/Obj/arch.make

Your own notebooks on Triton via sjupyter

Note

Now that Triton Jupyterhub exists, this method of running Jupyter is not so important. It is only needed if you need more resources than JupyterHub can provide.

We provide a command sjupyter which automates launching your own notebooks in the Slurm queue. To use this, module load sjupyter. This gives you more flexibility in choosing your nodes and resources than Jupyterhub, but also will after your and your department’s Triton priority more because you are blocking others from using these resources.

Set up the proxy

When running Jupyter on another system, the biggest problem is always making the conenction securely. To do this here, we use a browser extension and SSH Proxy.

  • Install the proxy extension

    • Install the extension FoxyProxy Standard (Firefox or Chrome). Some versions do not work properly: the 5.x series for Firefox may not work, but older and newer does.

  • Create a new proxy rule with the pattern *int.triton.aalto.fi* (or jupyter.triton.aalto.fi if you want to connect to that using the proxy).

    • Proxy type: SOCKS5, Proxy URL: localhost, port 8123.

    • DNS through the proxy: on.

  • SSH to triton and use the -D 8123. This starts a proxy on your computer on port 8123. This has to always be running whenever you connect to the notebook.

    • If you are in Aalto networks: ssh -D 8123 USERNAME@triton.aalto.fi.

    • If you are not in Aalto networks, you need to do an extra hop through another Aalto server: ssh -D 8123 -J USERNAME@kosh.aalto.fi USERNAME@triton.aalto.fi.

Now, when you go to any address matching *.int.triton.aalto.fi*, you will automatically connect to the right place on Triton. You can use Jupyter like normal. But if the ssh connection goes down, then you can’t connect and will get errors, so be aware (especially with jupyter.triton.aalto.fi which you might expect to always work).

Starting sjupyter

We have the custom-built command sjupyter for starting Jupyter on Triton.

First, you must load the sjupyter module:

module load sjupyter

To run in the Triton queue (using more resources), just use sjupyter. This will start a notebook on the interactive Slurm queue. All the normal rules apply: timelimits, memory limits, etc. If you want to request more resources, use the normal Slurm options such as -t, --mem, etc. Notebooks can only last as long as your job lasts, and you will need to restart them. Be efficient with resource usage: if you request a lot of resources and leave the notebook idle, no one else can use them. Thus, try to use the (default) interactive partition, which handles this automatically.

To run on the login node, run sjupyter --local. This is good for small testing and so on, which doesn’t use too much CPU or memory.

speech2text: easy speech transcription

speech2text is a wrapper we have made around the Whisper tool to make it easier to run for a wide audience. Fundamentally, it’s a wrapper around the command line tool + set of instructions for transferring data in a way that (hopefully) can’t go too wrong.

You can read the instructions here: https://aaltorse.github.io/speech2text/

If you use speech2text, you are using Triton and any outputs (papers, thesis, conference publications) should acknowledge Triton, and link it to the “Science-IT” infrastructure in ACRIS once it is published. You might get an email each year reminding you to do this.

Spyder

Spyder is the Scientific PYthon Development EnviRonment:https://www.spyder-ide.org/

On triton there are two modules that provide Spyder: - The basic anaconda module: module load anaconda or - The neuroimaging environment module: module load neuroimaging

By loading either module you will get access to Spyder.

Using Spyder on Triton

To use spyder on triton, you will need an xserver on your local machine (in order to display the spyder GUI) e.g. VcXsrv. You will further need to connect to triton with X-Forwarding: ssh -X triton.aalto.fi

Finally, load the module you want to use Spyder from (see above) and run spyder

Use a different environment for Spyder

If you want to use python packages which are not part of the module you use spyder from, it is strongly to suggested to create a virtual environment (e.g. e.g. Conda environments). Set up the environment with all packages you want to use. After that, the following steps will make spyder use the environment:

  1. Activate your environment

  2. Run python -c "import sys; print(sys.executable) to get the path to the python interpreter in your environment

  3. Deactivate the environment

  4. Start Spyder

  5. In spyder Navigate to “Tools -> Preferences” and select “Python interpreter”. Under “Use the following Python Interpreter” enter the path from step 2

That will make Spyder use the created python environment.

Tensorflow
pagelastupdated:

2022-01-09

Tensorflow is a commonly used Python package for deep learning.

Basic usage

First, check the tutorials up to and including GPU computing.

Installing via conda

Have a look here for details on how to install conda environments.

Creating an environment with GPU enabled Tensorflow

To create an environment with GPU enabled Tensorflow you can use an environment file like this tensorflow-env.yml:

name: tensorflow-env
channels:
  - conda-forge
dependencies:
  - tensorflow=*=*cuda*

Here we install the latest tensorflow from conda-forge-channel with an additional requirement that the build version of the tensorflow-package must contain a reference to a CUDA toolkit. For a specific version replace the =*=*cuda* with e.g. =2.8.1=*cuda* for version 2.8.1.

Hint

During installation conda will try to verify what is the maximum version of CUDA installed graphics cards can support and it will install non-CUDA enabled versions by default if none are found (as is the case on the login node, where environments are normally built). This can be usually overcome by setting explicitly that the packages should be the CUDA-enabled ones. It might however happen, that the environment creation process aborts with a message similar to:

nothing provides __cuda needed by tensorflow-2.9.1-cuda112py310he87a039_0

In this instance it might be necessary to override the CUDA settings used by conda/mamba. To do this, prefix your environment creation command with CONDA_OVERRIDE_CUDA=CUDAVERSION, where CUDAVERSION is the CUDA toolkit version you intend to use as in:

CONDA_OVERRIDE_CUDA="11.2" mamba env create -f cuda-env.yml

This will allow conda to assume that the respective CUDA libraries will be present at a later point and so it will skip those requirements during installation.

For more information, see this helpful post in Conda-Forge’s documentation.

Examples:
Simple Tensorflow/Keras model

Let’s run the MNIST example from Tensorflow’s tutorials:

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

The full code for the example is in tensorflow_mnist.py. One can run this example with srun:

$ wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/tensorflow/tensorflow_mnist.py
$ module load anaconda
$ srun --time=00:15:00 --gres=gpu:1 python tensorflow_mnist.py

or with sbatch by submitting tensorflow_mnist.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

module load anaconda

python tensorflow_mnist.py

Do note that by default Keras downloads datasets to $HOME/.keras/datasets.

Running simple Tensorflow/Keras model with NVIDIA’s containers

Let’s run the MNIST example from Tensorflow’s tutorials:

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

The full code for the example is in tensorflow_mnist.py. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoSciComp/scicomp-docs/master/triton/examples/tensorflow/tensorflow_mnist.py
module load nvidia-tensorflow/20.02-tf1-py3
srun --time=00:15:00 --gres=gpu:1 singularity_wrapper exec python tensorflow_mnist.py

or with sbatch by submitting tensorflow_singularity_mnist.sh:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

module load nvidia-tensorflow/20.02-tf1-py3

singularity_wrapper exec python tensorflow_mnist.py

Do note that by default Keras downloads datasets to $HOME/.keras/datasets.

Theano
supportlevel:

pagelastupdated:

maintainer:

If you’re using the theano library, you need to tell theano to store compiled code on the local disk on the compute node. Create a file ~/.theanorc with the contents

[global]
base_compiledir=/tmp/%(user)s/theano

Also make sure that in your batch job script you create this directory before you launch theano. E.g.

mkdir -p /tmp/${USER}/theano

The problem is that by default the base_compiledir is in your home directory (~/.theano/), and then if you first happen to run a job on a newer processor, a later job that happens to run on an older processor will crash with an “Illegal instruction” error.

VASP

VASP  (Vienna Ab initio Simulation Package) is a computer program for atomic scale materials modelling, e.g. electronic structure calculations and quantum-mechanical molecular dynamics, from first principles.

VASP is licensed software, requiring the licensee to keep the vasp team updated with a list of user names. Thus, in order to use VASP arrange with the “vaspmaster” for your group to be put on the vasp licensed user list. Afterwards, contact your local triton admin who will take care of the IT gymnastics, and CC the vaspmaster so that he is aware of who gets added to the list.

For the PHYS department, the vaspmaster is Ivan Tervanto.

For each VASP version, there are 3 binaries compiled. All versions are MPI and OpenMP versions.

  • vasp_std: The “standard” vasp, compiled with NGZhalf

  • vasp_gam: Gamma point only. Faster if you use only a single k-point.

  • vasp_ncl: For non-collinear spin calculations

VASP 6.4.1

The binaries are compiled with the GNU compilers, MKL (incl ScaLAPACK) and OpenMPI libraries, the used modules gcc/11.2.0 intel-oneapi-mkl/2021.4.0 openmpi/4.0.5. Example batch script

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=06:00:00
#SBATCH --mem-per-cpu=1500M
module load vasp/6.4.1
srun vasp_std
Potentials

Potentials are stored at /share/apps/vasp/pot.

Old VASP versions (obsolete, for reference only!)

These old versions are unlikely to work as they use old MPI and IB libraries that have stopped working due to upgrades over the years.

VASP 5.4.4

The binaries are compiled with the Intel compiler suite and the MKL library, the used toolchain module is intel-parallel-studio/cluster.2020.0-intelmpi. Example batch script

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=06:00:00
#SBATCH --mem-per-cpu=1500M
module load vasp/5.4.4
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
srun vasp_std
VASP 5.4.1

Currently the binaries are compiled with GFortran instead of Intel Fortran (the Intel Fortran binaries crashed, don’t know why yet). Example batch script

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=06:00:00
#SBATCH --mem-per-cpu=1500M
module load vasp/5.4.1-gmvolf-triton-2016a
srun vasp_std

For each VASP version, there are two binaries compiled with slightly different options:

vasp.mpi.NGZhalf
vasp.mpi

Both are MPI versions. The first one is what you should normally use; it is compiled with the NGZhalf option which reduces charge density in the Z direction, leading to less memory usage and faster computation. The second version is needed for non-collinear spin calculations. The binaries can be found in the directory /share/apps/vasp/$VERSION/ . For those of you who need to compile your own version of VASP, the makefiles used for these builds can be used as a starting point, and are found in the directory /share/apps/vasp/makefiles .

VASP 5.3.5

The binaries are optimized for the Xeon Ivy Bridge nodes, although they will also work fine on the older Xeon Westmere and Opteron nodes. Note that for the moment only the NGZhalf version has been built. If you need the non-NGZhalf version for non-collinear spin calculations please contact triton support. Example job script below:

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=12
#SBATCH --time=06:00:00
#SBATCH --mem-per-cpu=2500M

module load vasp/5.3.5

srun vasp.mpi.NGZhalf

The relative time to run the vasptest v2 testsuite on 12 cores (so a full node for Xeon Westmere and Opteron nodes, and 12/20 cores on a Xeon Ivy Bridge node) is for Xeon IB/Xeon Westmere/Opteron 1.0/2.0/2.8. So one sees that the Xeon Ivy Bridge nodes are quite a lot faster per core than the older nodes (with the caveat that the timings may vary depending on other jobs that may have been running on the Xeon IB node during the benchmark).

VASP 5.3.3

The binaries are optimized for the Xeon nodes, although they also work on the Opteron nodes. Some simple benchmarks suggest that the Opteron nodes are a factor of 1.5 slower than the Xeon nodes, although it is recommended to write the batch script such that Opteron nodes can also be used, as the Opteron queue is often shorter. An example script below:

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=12
#SBATCH --time=06:00:00
#SBATCH --mem-per-cpu=2500M

module load vasp/5.3.3

srun vasp.mpi.NGZhalf
VASP 5.3.2 and older

The binaries are optimized for the Intel Xeon architecture nodes, and are not expected to work on the Opteron nodes. An example job script is below (Note that it is different from the script for version 5.3.3 and newer above!):

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=12
#SBATCH --time=1-00:00:00
#SBATCH --mem-per-cpu=3500M

module load vasp/5.3.2

srun vasp.mpi.NGZhalf
Potentials

PAW potentials for VASP can be found in the directory /share/apps/vasp/pot. The recommended potentials are the ones in the Apr2012.52 subdirectory. For reference, an older set of potentials dating back to 2003 can be found in the “2003” subdirectory.

Validation

The vasp.mpi.NGZhalf builds have been verified to pass all the tests in the vasptest suite.

Other

Old makefiles

Here is a number of Makefiles copy-pasted from old Rocks installation. Can be useful in general, though may require adaptation to new installation. Please, send us a fully working copy if you have one.

See old wiki: https://wiki.aalto.fi/display/Triton/Applications

Rename vasp.x.y.makefile => vasp.x.y/makefile

VisIT

This uses Singularity containers, so you should refer to that page first for general information.

Visit has been compiled using the build_visit-script from the VisIT page on an Ubuntu image. It has minimal amount of other software installed.

Parallelization is done against Triton’s OpenMPI, so using this container with other OpenMPI modules is discouraged.

Within the container VisIT is installed under /opt/visit/. PATH is automatically appended with their respective paths so all program calls are available automatically.

Usage

This example shows how you can launch visit on the login node for small visualizations or launch it in multiprocess state on a reserved node. Firstly, let’s load the module:

module use /share/apps2/singularity/modules
module load Visit

Now you can run VisIT with:

singularity_wrapper exec visit

If you want to run VisIT with multiple CPUs, you should reserve a node with sinteractive:

sinteractive --time=00:30:00 --ntasks=2 --nodes=1-1
singularity_wrapper exec visit -np 2

Do note the flag --nodes=1-1 that ensures that all of VisITs processes end up on the same node. Currently VisIT encounters problems when going across the node lines.

VSCode on Triton

VSCode is a text editor and integrated development environment. It is very popular these days, partly due to it’s good usability.

Installation

If you are using on Triton, it’s available as a web app through Open OnDemand, see below.

It can also be installed on your own computer, which might be good to do anyway. If you do this, make sure you turn off telemetry if you don’t want Microsoft to get reports of your activity. Search “telemetry” in settings to check and disable (note that this doesn’t fully turn it off).

VSCodium is an open-source build of VScode (like “chromium” is an open-source build of Google Chrome) that disables all telemetry by default and removes non-open source bits. It is essentially the same thing, but due to Microsoft licenses it can’t use the same extension registry as VSCode. It does have a stand-alone extension registry, though.

Security and extensions

As always when using user-contributed extensions, be cautious of what extensions you install. A malicious extension can access and/or delete all of the data available via your account.

VSCode through Open OnDemand

See also

Open OnDemand

VSCode is available through Open OnDemand, and with this you can select whatever resources you want (memory, CPU, etc) and run directly in the Slurm queue. This means you can directly perform calculations in that VSCode session and it runs properly (not on the login node).

This is useful for getting things done quickly, but running in a web browser can be limited in some cases (interface, lifetime, etc.).

VSCode remote SSH

“Remote SSH” is a nice way to work on a remote computer and provides both editing and shell access, but everything will run directly on the login node on Triton. This is OK for editing, but not for main computations (see the section above or below). To repeat: don’t use this for running big computations.

Screenshot saying "SSH: triton".

If you see this in the lower left corner (or whatever the name of your cluster SSH config is), you are connected to the login node (and should not do big calculations). It’s possible the exact look may be different for others.

You can see connection instructions (including screenshots) at the Sigma2 instructions.

VSCode can use a regular OpenSSH configuration file, so you may as well set that up once and it can be used for everything - see SSH for the full story. The basics of SSH to Triton are in Connecting via ssh. A SSH key can allow you to connect without entering a password every time.

VSCode remote SSH host directly to interactive job

Sometimes you want more resources than the login node. This section presents a way to have VSCode directly connect to a job resource allocation on Triton - so you can do larger calculations / use more memory / etc. without interfering with others. Note that for real production calculations, you should use Serial Jobs, and not run stuff through your editor, since everything gets lost when your connection dies.

This section contains original research and may not fully work, and may only work on Linux/Mac right now (but Windows might work too since it uses OpenSSH).

In you ~/.ssh/config, add this block to define a server triton-vscode. For more information .ssh/config, including what these mean and what else you might need in here, see SSH:

Host triton-vscode
    ProxyCommand ssh triton /share/apps/ssh-node-proxycommand --partition=interactive --time=1:00:00
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
    User USERNAME

# You also need a triton alias here:

Host triton
    HostName triton.aalto.fi
    # ... any other thing you need for connecting to triton.
    User USERNAME

Now, with VSCode’s Remote SSH, you can select the triton-vscode remote. It will ssh to Triton, request a job, and then directly connect to the job. Configure the job requirements in the ProxyCommand line (see Job submission - you can have multiple Host sections for different types of requirements).

Possible issues which may affect usage:

  • If the ssh connection dies, the background job will be terminated. You will lose your state and not be able to save.

  • If the job dies due to time or memory exceeded, the same as above will happen: your job will die and there is no time to save.

  • If you srun from within the job, then it gets messed up because the environment variable SLURM_JOB_ID is set from the interactive job that got started. It’s hard for us to unset this, so if you are using the terminal to srun or sbatch, you should unset SLURM_JOB_ID first. (Note there are many other variables set by Slurm. Make sure that they don’t interfere with jobs you may run from this vscode session).

  • If you request a GPU node or other high resources, this is reserved the whole time even if you aren’t using them. Consider this before reserving large resources (unless you close the jobs soon), or you might get an email from us asking if we can help you improve research usage.

Whisper

This uses Singularity containers, so you should refer to that page first for general information.

There are two variants of Whisper available. The “standard” Whisper uses whisper-ctranslate2, which is a CLI for faster-whisper, a reimplementation of OpenAI’s Whisper using Ctranslate2. Original repository for this project can be found here.

The second variant is whisper-diarization, which is a fork of faster-whisper with support for speaker detection (diarization). Original repository for this project can be found here.

Of these two, whisper-diarization runs noticable slower and has less versatile options. Using base Whisper is recommended if speaker detection is not necessary.

Usage (Whisper)

This example shows you a sample script to run Whisper.

$ module load whisper
$ srun --mem=4G singularity_wrapper run your_audio_file.wav --model_directory $medium_en --local_files_only True --language en

Option --model_directory $medium_en Tells whisper to use a local model, in this case the model medium.en with the path to the model given through the environment variable $medium_en. For list of all local models, you can run echo $model_names as long as the module is loaded. (These models are pre-downloaded by us and the variables are defined when the module is loaded.) You can also give it a path to your own model if you have one. The other imporant option here is --local_file_only True. This stops Whisper from checking if there are newer versions of the model online. The option --language LANG is not necessary, but whisper’s language detection is sometimes weird.

If you are transcribing language different from English, use a general model e.g. $medium. If your source audio is in English, using English-specific models is usually a performance gain.

For full list of options, run:

$ singularity_wrapper run --help

Notes on general Slurm resources:

  • For memory, requesting roughly 4G for medium model or smaller, and 8G for large should be sufficient.

  • When running on CPU, requesting additional CPUs should give a performance increase until 8 CPUS. Whisper doesn’t scale properly beyond 8 CPUS, and will actually run slower in most cases.

Running on GPU

Singularity-wrapper takes care of making GPUs available for the container, so all you need to do to run Whisper on a GPU is use the previous command and add additional flag: --device cuda. Without this, Whisper will only run on a CPU even if a GPU is available. Remember to request a GPU in the Slurm job.

Usage (Whisper-diarization)

This example shows you a sample script to run whisper-diarization.

$ module load whisper-diarization
$ srun --mem=6G singularity_wrapper run -a your_audio_file.wav --whisper-model $medium_en

Option --whisper-model $medium_en Tells whisper which model to use, in this case medium.en. If you use environment variables that come with the module to specify the model, whisper will run using a local model. Otherwise it will download the model to your home directory. For list of all local models, run echo $model_names with whisper-diarization loaded.

Note that syntax is unfortunately somewhat different compared to plain whisper. You need to specify the audio file to use with the argument -a audio_file.wav and similarily the syntax to specificy the model is different.

For full list of options, run:

$ singularity_wrapper run --help

Notes on general Slurm resources:

  • Whisper-diarization requires slightly more memory than plain Whisper. Requesting roughly 6G for medium model or smaller, and 12G for large should be sufficient.

  • When running on CPU, requesting additional CPUs should give a performance increase until 8 CPUS. Whisper doesn’t scale properly beyond 8 CPUS, and will actually run slower in most cases.

Running on GPU

Compared to plain Whisper, running whisper-diarization on GPU takes little more work. Singularity-wrapper still takes care of making GPUs available for the container and you still specify you want to use GPU using the flag --device cuda.

Unfortunately whisper-diarization requires multiple models when using a GPU , and there isn’t a practical way to use local models for this. For this reason, you should create a symlink from whisper’s cache folder in your home, to your work directory. This way you avoid filling your home directory’s quota.

To do this, run following commands:

$ mkdir -p ~/.cache/huggingface/ ~/.cache/torch/NeMo temp_cache/huggingface/ temp_cache/NeMo/ $WRKDIR/whisper_cache/huggingface $WRKDIR/whisper_cache/NeMo
$ mv ~/.cache/huggingface/* temp_cache/huggingface/
$ mv ~/.cache/torch/NeMo/* temp_cache/NeMo/
$ rmdir ~/.cache/huggingface/ ~/.cache/torch/NeMo
$ ln -s $WRKDIR/whisper_cache/huggingface ~/.cache/
$ ln -s $WRKDIR/whisper_cache/NeMo ~/.cache/torch/
$ mv temp_cache/huggingface/* ~/.cache/huggingface/
$ mv temp_cache/NeMo/* ~/.cache/torch/NeMo
$ rmdir temp_cache/huggingface temp_cache/NeMo temp_cache

This bunch of commands first creates cache folders if they don’t exist and moves any existing files to temp directory, Next it creates symlinks to your work directory in place of original cache directories, and moves all previous files back. This way all downloaded files exist on your work instead of eating your home quota.

Converting audio files

Whisper should automatically convert your audio file to a correct format when you run it. In the case this does not work, you can convert it on Triton using ffmpeg with following commands:

$ module load ffmpeg
$ ffmpeg -i input_file.audio output.wav

If you want to extract audio from a video, you can instead do:

$ module load ffmpeg
$ ffmpeg -i input_file.video -map 0:a output.wav

Examples

Examples
Master-Worker Example

Following example shows how to manage host list using the python-hostlist package and run different tasks for master task and worker task.

This kind of structure might be needed if one wants to create a e.g. Spark cluster or use some other program that uses master-worker-paradigm, but does not use MPI.

It is important to make sure that in case of job cancellation all programs started by the scripts will be killed gracefully. In case of Spark or other programs that initialize a cluster using SSH and then forking a process, these forked processes must be killed after job allocation has ended.

hostlist-test.sh:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --nodes=3
#SBATCH --ntasks=5
#SBATCH -o hostlist-test.out

# An example of a clean_up-routine if the master has to take e.g. ssh connection to start program on workers
function clean_up {
    echo "Got SIGTERM, will clean up my workers and exit."
    exit
}
trap clean_up SIGHUP SIGINT SIGTERM

# Actual script that defines what each worker will do
srun bash run.sh

run.sh:

#!/bin/bash

# Get a list of hosts using python-hostlist
nodes=`hostlist --expand $SLURM_NODELIST|xargs`

# Determine current worker name
me=$(hostname)

# Determine master process (first node, id 0)
master=$(echo $nodes | cut -f 1 -d ' ')

# SLURM_LOCALID contains task id for the local node
localid=$SLURM_LOCALID

if [[ "$me" == "$master" && "$localid" -eq 0 ]]
then
   # Run these if the process is the master task
   echo "I'm the master with number "$localid" in node "${me}". My subordinates are "$nodes
else
   # Run these if the process is a worker
   echo "I'm a worker number "$localid" in node "${me}
fi

Example output:

I'm a worker number 1 in node opt469
I'm a worker number 2 in node opt469
I'm the master with number 0 in node opt469. My subordinates are opt469 opt470 opt471
I'm a worker number 0 in node opt471
I'm a worker number 0 in node opt470
Python OpenMP example

parallel_Python.sh:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G
#SBATCH -o parallel_Python.out

module load anaconda/2022-01

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun -c $SLURM_CPUS_PER_TASK python parallel_Python.py

parallel\Python.py:

import numpy as np
a = np.random.random([2000,2000])
a = a + a.T
b = np.linalg.pinv(a)
print(np.amax(np.dot(a,b)))
Serial R example

r_serial.sh:

#!/bin/bash -l
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
#SBATCH --mem=100M
#SBATCH --output=r_serial.out

module load r
n=3
m=2
srun Rscript --vanilla r_serial.R $n $m

r_serial.R:

args = commandArgs(trailingOnly=TRUE)

n<-as.numeric(args[1])
m<-as.numeric(args[2])

print(n)
print(m)

A<-t(matrix(0:5,ncol=n,nrow=m))
print(A)
B<-t(matrix(2:7,ncol=n,nrow=m))
print(B)
C<-matrix(0.5,ncol=n,nrow=n)
print(C)

C<-A %*% t(B) + 2*C
print(C)
Parallel R example

r_parallel.sh:

#!/bin/bash
#SBATCH --time=00:20:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G
#SBATCH --output=r_parallel.out

# Set the number of OpenMP-threads to 1,
# as we're using parallel for parallelization
export OMP_NUM_THREADS=1

# Load the version of R you want to use
module load r

# Run your R script
srun Rscript r_parallel.R

r_parallel.R:

library(pracma)
library(parallel)
invertRandom <- function(index) {
    A<-matrix(runif(2000*2000),ncol=2000,nrow=2000);
    A<-A + t(A);
    B<-pinv(A);
    return(max(B %*% A));
}
ptm<-proc.time()
mclapply(1:16,invertRandom, mc.cores=Sys.getenv('SLURM_CPUS_PER_TASK'))
proc.time()-ptm

When constrained to opt-architecture, run times for different core numbers were

ncores

1

2

4

8

runtime

380.757

182.185

125.526

84.230

R OpenMP Example

r_openmp.sh:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G
#SBATCH --output=r_openmp.out

module load r
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
time srun Rscript --default-packages=methods,utils,stats R-benchmark-25.R

The benchmark script is available here (more information about it is available here page).

Python
IPython parallel

A example batch script that uses IPython parallel (ipyparallel) within slurm. See also the interactive hints on the Python page.

ipyparallel uses global state in your home directory, so you can only run _one_ of these at a time! You can add the --profile= option to name different scripts (you could use $SLURM_JOB_ID). But then you will get a growing number of unneeded profile directories at ~/.ipython/profile_*, so this isn’t recommended. Basically, ipyparallel is more designed for one-at-a-time interactive use rather than batch scripting (unless you do more work…).

ipyparallel.sh is an example slurm script that sets up ipyparallel. It assumes that most work is done in the engines. It has inline Python, replace this with python your_script_name.py

#!/bin/bash
#SBATCH --nodes=4

module load anaconda
set -x

ipcontroller --ip="*" &
sleep 5
# Run the engines in slurm job steps (makes four of them, since we use
# the --nodes=4 slurm option)...
srun ipengine --location=$(hostname -f) &

sleep 5
# Put the actual Python isn't in a job step.  This is assuming that
# most work happens in engines
python3 <<EOF
import os
import ipyparallel
client = ipyparallel.Client()
result = client[:].apply_async(os.getpid)
pid_map = result.get_dict()
print(pid_map)
EOF
Python MPI4py

A simple script mpi4py.py that utilizes mpi4py.

#!/usr/bin/env python
"""
Parallel Hello World
"""
from mpi4py import MPI
import sys
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()
sys.stdout.write(
    "Hello, World! I am process %d of %d on %s.\n"
    % (rank, size, name))

Running mpi4py.py using only srun:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --ntasks=4

module load Python/2.7.11-goolf-triton-2016b
mpiexec -n $SLURM_NTASKS python mpi4py.py

Example sbatch script mpi4py.sh when running mpi4py.py through sbatch:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --ntasks=4

module load Python/2.7.11-goolf-triton-2016b
mpiexec -n $SLURM_NTASKS python mpi4py.py
Running Python with OpenMP parallelization

Various Python packages such as Numpy, Scipy and pandas can utilize OpenMP to run on multiple CPUs. As an example, let’s run the python script python_openmp.py that calculates multiplicative inverse of five symmetric matrices of size 2000x2000.

nrounds = 5

t_start = time()

for i in range(nrounds):
    a = np.random.random([2000,2000])
    a = a + a.T
    b = np.linalg.pinv(a)

t_delta = time() - t_start

print('Seconds taken to invert %d symmetric 2000x2000 matrices: %f' % (nrounds, t_delta))

The full code for the example is in HPC examples-repository. One can run this example with srun:

wget https://raw.githubusercontent.com/AaltoSciComp/hpc-examples/master/python/python_openmp/python_openmp.py
module load anaconda/2022-01
export OMP_PROC_BIND=true
srun --cpus-per-task=2 --mem=2G --time=00:15:00 python python_openmp.py

or with sbatch by submitting python_openmp.sh:

#!/bin/bash -l
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1G
#SBATCH -o python_openmp.out

module load anaconda/2022-01

export OMP_PROC_BIND=true

echo 'Running on: '$HOSTNAME

srun python python_openmp.py

Important

Python has a global interpreter lock (GIL), which forces some operations to be executed on only one thread and when these operations are occuring, other threads will be idle. These kinds of operations include reading files and doing print statements. Thus one should be extra careful with multithreaded code as it is easy to create seemingly parallel code that does not actually utilize multiple CPUs.

There are ways to minimize effects of GIL on your Python code and if you’re creating your own multithreaded code, we recommend that you take this into account.

Detailed instructions

Debugging

Note

Also see Profiling.

Debugging is one of the most fundamental things you can do while using software: debuggers allow you to see inside of running programs, and this is a requirement of developing with any software. Any reasonable programming language will have a debugger made as one of the first tasks when it is being created.

Serial code debugging

GDB is the usual GNU debugger.

Note: the latest version of gcc/gfortran available through module require -gdwarf-2 option along with the -g to get it to work with the default gdb command. Otherwise the default version 4.4 should work normally with just -g.

Valgrind is another tool that helps you to debug and profile your serial code on Triton.

MPI debugging & profiling
GDB with the MPI code

Compile your MPI app with -g, run GDB for every single MPI rank with:

salloc -­p play ­­--nodes 1 ­­--ntasks 4 srun xterm ­-e gdb mpi_app

You should get 4 xterm windows to follow, from now on you have full control of you MPI app with the serial debugger.

PADB

A Parallel Debugging Tool. Works on top of SLURM, support OpenMPI or MPICH only (as of June 2015), that is MVAPICH2 is not supported. Do not require code re-compilation, just run your MPI code normally, and then launch padb separately to analyze the code behavior.

Usage summary (for full list and explanations please consult http://padb.pittman.org.uk/):

# assume you have your openmpi module loaded already
module load padb
padb --create-secret-file    # for the very first time only

# Show all your current active jobs in the SLURM queue
padb -show-jobs

# Target a specific jobid, and reports its process state
padb  --proc-summary
# or, for all running jobs
padb --all --proc-summary

# Target a specific jobid, and report its MPI message queue, stack traceback, etc.
padb --full-report=

# Target a specific jobid, and report its stack trace for a given MPI process (rank)
padb  --stack-trace --tree --rank

# Target a specific jobid, and report its stack trace including information about parameters and local variables for a given MPI process (rank)
padb  --stack-trace --tree --rank  -Ostack-shows-locals=1 -Ostack-
shows-params=1

# Target a specific jobid, and reports its MPI message queues
padb  --mpi-queue

# Target a specific jobid, and report its MPI process progress (queries in loop over and over again)
padb  --mpi-watch --watch -Owatch-clears-screen=no
Storage: local drives

Local disks on computing nodes are the preferred place for doing your IO. The general idea is use network storage as a backend and local disk for actual data processing.

  • In the beginning of the job cd to /tmp and make a unique directory for your run

  • copy needed input from WRKDIR to there

  • run your calculation normally forwarding all the output to /tmp

  • in the end copy relevant output to WRKDIR for analysis and further usage

Pros

  • You get better and steadier IO performance. WRKDIR is shared over all users making per-user performance actually rather poor.

  • You save performance for WRKDIR to those who cannot use local disks.

  • You get much better performance when using many small files (Lustre works poorly here).

  • Saves your quota if your code generate lots of data but finally you need only part of it

  • In general, it is an excellent choice for single-node runs (that is all job’s task run on the same node).

Cons

  • Not feasible for huge files (>100GB). Use WRKDIR instead.

  • Small learning curve (must copy files before and after the job).

  • Not feasible for cross-node IO (MPI jobs). Use WRKDIR instead.

How to use local drives on compute nodes

NOT for the long-term data. Cleaned every time your job is finished.

You have to use --gres=spindle to ensure that you get a hard disk (note 2019-january: except GPU nodes).

/tmp is a bind-mounted user specific directory. Directory is per-user (not per-job that is), if you get two jobs running on the same node, you get the same /tmp.

Interactively

How to use /tmp when you login interactively

$ sinteractive --time=1:00:00              # request a node for one hour
(node)$ mkdir /tmp/$SLURM_JOB_ID       # create a unique directory, here we use
(node)$ cd /tmp/$SLURM_JOB_ID
... do what you wanted ...
(node)$ cp your_files $WRKDIR/my/valuable/data  # copy what you need
(node)$ cd; rm -rf /tmp/$SLURM_JOB_ID  # clean up after yourself
(node)$ exit
In batch script

Batch job example that prevents data lost in case program gets terminated (either because of scancel or due to time limit).

#!/bin/bash

#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=2500M                                   # time and memory requirements

mkdir /tmp/$SLURM_JOB_ID                                      # get a directory where you will send all output from your program
cd /tmp/$SLURM_JOB_ID

## set the trap: when killed or exits abnormally you get the
## output copied to $WRKDIR/$SLURM_JOB_ID anyway
trap "mkdir $WRKDIR/$SLURM_JOB_ID; mv -f /tmp/$SLURM_JOB_ID $WRKDIR/$SLURM_JOB_ID; exit" TERM EXIT

## run the program and redirect all IO to a local drive
## assuming that you have your program and input at $WRKDIR
srun $WRKDIR/my_program $WRKDIR/input > output

mv /tmp/$SLURM_JOB_ID/output $WRKDIR/SOMEDIR                   # move your output fully or partially
Batch script for thousands input/output files

If your job requires a large amount of files as input/output using tar utility can greatly reduce the load on the $WRKDIR-filesystem.

Using methods like this is recommended if you’re working with thousands of files.

Working with tar balls is done in a following fashion:

  1. Determine if your input data can be collected into analysis-sized chunks that can be (if possible) re-used

  2. Make a tar ball out of the input data (tar cf <tar filename>.tar <input files>)

  3. At the beginning of job copy the tar ball into /tmp and untar it there (tar xf <tar filename>.tar)

  4. Do the analysis here, in the local disk

  5. If output is a large amount of files, tar them and copy them out. Otherwise write output to $WRKDIR

A sample code is below:

#!/bin/bash

#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=2000M                       # time and memory requirements
mkdir /tmp/$SLURM_JOB_ID                          # get a directory where you will put your data
cp $WRKDIR/input.tar /tmp/$SLURM_JOB_ID           # copy tarred input files
cd /tmp/$SLURM_JOB_ID

trap "rm -rf /tmp/$SLURM_JOB_ID; exit" TERM EXIT  # set the trap: when killed or exits abnormally you clean up your stuff

tar xf input.tar                                  # untar the files
srun  input/*                                     # do the analysis, or what ever else
tar cf output.tar output/*                        # tar output
mv output.tar $WRKDIR/SOMEDIR                     # copy results back
Storage: Lustre (scratch)

Lustre is scalable high performance file system created for HPC. It allows MPI-IO but mainly it provides large storage capacity and high sequential throughput for cluster applications. Currently the total capacity is 2PB. The basic idea in Lustre is to spread data in each file over multiple storage servers. With large (larger than 1GB) files Lustre will significantly boost the performance.

Working with small files

As Lustre is meant for large files, the performance with small (smaller than 10MB) files will not be optimal. If possible, try to avoid working with multiple small files.

Note: Triton has a default stripe of 1 already, so it is by default optimized for small files (but it’s still not that great). If you use large files, see below.

If small files are needed (i.e. source codes) you can tell Lustre not to spread data over all the nodes. This will help in performance.

To see the striping for any given file or directory you can use following command to check status

lfs getstripe -d /scratch/path/to/dir

You can not change the striping of an existing file, but you can change the striping of new files created in a directory, then copy the file to a new name in that directory.

lfs setstripe -c 1 /scratch/path/to/dir
cp somefile /scratch/path/to/dir/newfile
Working with lots of small files

Large datasets which consist mostly of small (<1MB) files can be slow to process because of network overhead associated with individual files. If it is your case, please consult Compute node local drives page, see the tar example over there or find some other way to compact your files together into one.

Working with large files

By default Lustre on Triton is configured to stripe a single file over a single OST. This provides the best performance for small files, serial programs, parallel programs where only one process is doing I/O, and parallel programs using a file-per-process file I/O pattern. However, when working with large files (>> 10 GB), particularly if they are accessed in parallel from multiple processes in a parallel application, it can be advantageous to stripe over several OST’s. In this case the easiest way is to create a directory for the large file(s), and set the striping parameters for any files subsequently created in that directory:

cd $WRKDIR
mkdir large_file
lfs setstripe -c 4 large_file

The above creates a directory large_file and specifies that files created inside that directory will be striped over 4 OST’s. For really really large files (hundreds of GB’s) accessed in parallel from very large MPI runs, set the stripe count to “-1” which tells the system to stripe over all the available OST’s.

To reset back to the default settings, run

lfs setstripe -d path/to/directory
Lustre: common recommendations
  • Minimize use of ls -l and ls --color when possible

Several excellent recommendations are at

They are fully applicable to our case.

Be aware, that being a high performance filesystem Lustre still has its own bottlenecks, and even non-proper a usage by a single user can get whole system in stuck. See the recommendations at the link above how to avoid those potential situations. Common Lustre troublemakers are ls -lR, creating many small files, rm -rf, small random i/o, heavy bulk i/o.

For advanced user, these slides can be interesting: https://www.eofs.eu/fileadmin/lad2012/06_Daniel_Kobras_S_C_Lustre_FS_Bottleneck.pdf

Open OnDemand

Warning

Triton OOD is under development and is available as a preview/for feedback. It may or may not work at any given time as work on it. It is probably best to use our chat to give quick feedback.

Open OnDemand is a web-based interface to computer clusters. It provides a low-threshold way to do easy work and shell access to do more. It complements, not replaces, the traditional ssh access: just like with Jupyter, it may help you get started, but most people will eventually move towards shell access (even if that shell is via Open OnDemand).

Connecting

Address: http://ood.triton.aalto.fi . Log in with the usual Aalto login. Connections only from Aalto networks or VPN. A pre-existing Triton account is needed.

How to use

The first view is a dashboard that provides an interface to a number of applications:

  • Shell: Top bar → Clusters → Triton shell access. Or via the file manager.

  • Files: Top bar → Files → choose your directory. You can upload and download files this way.

  • Other applications via the main page or Top bar → Interactive Apps → (choose).

Applications

Once logged in, there are ways to start separate applications, for example Jupyter. These run as separate, independent processes.

We have these applications available and supported:

  • Jupyter

  • RStudio

  • Matlab

  • Spyder

  • Code Server

Choose partition ‘interactive’ and a correct account.

Current issues
  • Apps will be adjusted.

Profiling

Note

Also see Debugging.

You have code, you want it to run fast. This is what Triton is for. But how do you know if your code is running as fast as it can? We are scientists, and if things aren’t quantified we can’t do science on them. Programming can often seem like a black box: modern computers are extremely complicated, and people can’t predict what is actually making code fast or slow anymore. Thus, you need to profile your code: get detailed performance measurements. These measurements let you know how to make it run faster.

There are many tools for profiling, and it really is one of the fundamental principles for any programming language. You really should learn how to do quick profile just to make sure things are OK, even if you aren’t trying to optimize things: you might find a quick win even if you didn’t write the code yourself (for example, 90% of your time is spent on input/output).

This page is under development, but so far serves as an introduction. We hope to expand it with specific Triton examples.

Summary: profiling on Linux

First off, look at your language-specific profiling tools.

CPU profiling

This can give you a list of where all your processor time is going, either per-function or per-line. Generally, most of your time is in a very small region of your code, and you need to know what this is in order to improve just that part.

See the C and Python profiling example above.

GNU gprof

gprof is a profiler based on instrumenting your code (build with -pg). It has relatively high overhead, but gives exact information e.g. for the number of times a function is called.

Perf

perf is a sampling profiler, which periodically samples events originating e.g. from the CPU performance monitoring unit (PMU). This generates a statistical profile, but the advantage is that the overhead is very low (single digit %), and one can get timings at the level of individual asm instructions. For a simple example, consider a (naive) matrix multiplication program:

Compile the program (-g provides debug symbols which will be useful later on, at no performance cost):

$ gfortran -Wall -g -O3 mymatmul.f90

Run the program via the profiler to generate profile data:

$ perf record ./a.out

Now we can look at the profile:

$ perf report
# Samples: 1251
#
# Overhead Command Shared Object Symbol
# ........ .............. ............................. ......
#
85.45% a.out ./a.out [.] MAIN\_\_
4.24% a.out /usr/lib/libgfortran.so.3.0.0 [.] \_gfortran\_arandom\_r4
3.12% a.out /usr/lib/libgfortran.so.3.0.0 [.] kiss\_random\_kernel

So 85% of the runtime is spent in the main program (symbol MAIN__), and most of the rest is in the random number generator, which the program calls in order to generate the input matrices.

Now, lets take a closer look at the main program:

$ perf annotate MAIN__
------------------------------------------------
Percent \| Source code & Disassembly of a.out
------------------------------------------------
:
:
:
: Disassembly of section .text:
:
: 00000000004008b0 <MAIN__>:

: c = 0.
:
: do j = 1, n
: do k = 1, n
: do i = 1, n
: c(i,j) = c(i,j) + a(i,k) \* b(k,j)
30.12 : 400a40: 0f 28 04 01 movaps (%rcx,%rax,1),%xmm0
4.92 : 400a44: 0f 59 c1 mulps %xmm1,%xmm0
12.36 : 400a47: 0f 58 04 02 addps (%rdx,%rax,1),%xmm0
40.73 : 400a4b: 0f 29 04 02 movaps %xmm0,(%rdx,%rax,1)
9.65 : 400a4f: 48 83 c0 10 add $0x10,%rax

Unsurprisingly, the inner loop kernel takes up practically all the time.

For more information on using perf, see the perf tutorial at

https://perf.wiki.kernel.org/index.php/Tutorial

Input/output profiling

This will tell you how much time is spent reading and writing data, where, and what type of patterns it has (big reads, random access, etc). Note that you can see the time information when CPU profiling: if input/output functions take a lot of time, you need to improve IO.

/usr/bin/time -v prints some useful info about IO operations and statistics.

Lowest level: use strace to print the time taken in every system call that accesses files. This is not that great.:

#  Use strace to print the total bytes
strace -e trace=desc $command |& egrep 'write' | awk --field-separator='='  '{ x+=$NF } END { print x }'
strace -e trace=desc $command |& egrep 'read' | awk --field-separator='='  '{ x+=$NF } END { print x }'

# Number of calls only
strace -e trace=file -c  $command
Memory profiling

Less common, but it can tell you something about what memory is being used.

If you are making your own algorithms, memory profiling becomes more important because you need to be sure that you are using the memory hierarchy efficiently. There are tools for this.

MPI and parallel profiling
mpiP

mpiP: Lightweight, Scalable MPI Profiling http://mpip.sourceforge.net/. Collects statistical information about MPI functions. mpiP is a link-time library, that means that it can be linked to the object file, though it is recommended that you have recompiled the code with -g. Debugging information is used to decode the program counters to a source code filename and line number automatically. mpiP will work without -g, but mileage may vary.

Usage example:

# assume you have you MPI flavor module loaded
module load mpip/3.4.1

# link or compile your code from scratch with -g
mpif90 ­-g ­-o my_app my_app.f90 ­-lmpiP ­-lm ­-lbfd ­-liberty ­-lunwind
# or
mpif90 ­-o my_app my_app.o ­-lmpiP ­-lm ­-lbfd ­-liberty ­-lunwind

# run the code normally (either interactively with salloc or as usual with sbatch)
salloc -p play --ntasks=4 srun mpi_app

If everything works, you will see the mpiP header preceding your program stdout, and there will be generated a text report file in your work directory. File is small, no worries about quota. Please, consult the link above for the file content explanation. During runtime, one can set MPIP environment variables to change the profiler behavior. Example:

export MPIP="-t 10.0 -k 2"
Scalasca

Available through module load scalasca

How big is my program?

Abstract

  • You can use your workstation / laptop as a base measuring stick: If the code runs on your machine, as a first guess you can reserve the same amount of CPUs & RAM as your machine has.

  • Similarly for running time: if you have run it on your machine, you should reserve similar time in the cluster.

  • Natural unit of program size in Triton is 1 CPU & 4 GB of RAM. If your program needs a lot of RAM, but does not utilize the CPUs, you should try to optimize it.

  • If your program does the same thing more than once, you can estimate that the total run time is \(T \approx n_{\textrm{steps}} \cdot t_{\textrm{step}}\), where \(t_{\textrm{step}}\) is the time taken by each step.

  • Likewise, if your program runs multiple parameters, the total time needed is \(T_{\textrm{total}} \approx n_{\textrm{parameters}} \cdot T_{\textrm{single}}\), where \(T_{\textrm{single}}\) is time needed to run the program with some parameters.

  • You can also run a smaller version of the problem and try to estimate how the program will scale when you make the problem bigger.

  • You should always monitor jobs to find out what were the actual resources you requested (seff JOBID).

  • If you aren’t fully sure of how to scale up, contact us Research Software Engineers early.

Why should you care?

There are many reasons why you should care about this question.

None

A cluster environment is shared among multiple users and thus all users will get their own share of the cluster resources.

The queue system will calculate your fair share of the resources based on the resource requirements you have specified.

This means that if you request more than you need, you will waste resources and you will get less resources in the near future.

None

If, for example, you have a program that takes a day to run a single computation and you have thousands of computations you need to do, you can estimate that you can save a lot of time optimizing the program before starting the computations.

Likewise you can find out that it is not worth the effort to optimize something you will only run once.

You can also find that something is unfeasible with the method you have chosen before you’ve invested a lot of time in implementating it.

None

If, for example, you have a program that you assume should finish in an hour, but it does not finish in an hour, you can infer that either the assumption was incorrect or that the program did not behave as it should.

This can often happen when program is transferred from a desktop environment into the cluster and the program is not aware of this change.

Quite often a program can appear to be running slower than it should be because it is running slower than it should be and something is holding it back. Recognizing that is important.

How do you measure program size?

The program size can be measured in couple of ways:

  • How many CPUs does my program use?

  • How much RAM does my program use?

  • How long does it take to run my program?

  • How many times do I need to run a program?

These question can seem complicated to answer. Monitoring and profiling is one way of getting concrete numbers, but there are couple of tricks you can use to get a estimate.

How to estimate CPU and RAM usage?
Simple measuring stick: Your own computer

If you know nothing of your program, you can still probaly answer this question:

Does the program run on my own computer?

This can give you a good baseline on how big your program is. In general, you can use the following estimates to approximate your computer:

  • A typical laptop computer laptop is about 4 CPUs and 16GB of RAM.

  • A typical workstation computer desktop is about 8 CPUs and 32GB of RAM.

  • A typical compute node server starts from about 32 CPUs and 128GB of RAM, but they can range up to 128 CPUs and 512GB of RAM.

So if, for example, the program runs on your laptop (laptop), you’ll know that it should work with a request of 4 CPUs and 16GB.

In general, you can say that: server \(\approx 4 \: \cdot\) desktop \(\approx 8 \: \cdot\) laptop or more. This will give you a good initial measuring scale.

Getting a better CPU / RAM estimate: check your task manager

A simple way of getting a better estimate is to check your computer’s task manager when you are running the program.

  • In Windows you can open Task manager from the start menu or by pressing CTRL + ALT + DEL.

  • In Mac OS X you can use finder to launch Activity monitor or press CMD + ALT + ESC.

  • In Linux you can use System Monitor to see your processes.

When you’re running a program these tools will easily tell you how many CPUs the processes are using and how much memory they are using. CPU usage is a percentage of total CPU capacity. So if your machine has 4 CPUs and you see an usage of 25%, that means your program is using 1 CPU. Similarly, the memory usage is reported as a percentage of the total available memory.

In a cluster environment you can use seff JOBID for seeing how much of the reserved CPU and RAM your program used. For more information, see the monitoring documentation.

Natural unit of scale: 1 CPU = 4GB of RAM

From the previous section we can notice an interesting observation: in HPC clusters, there is usually around 1 CPU for each 4 GB of RAM.

_images/slot-explanation.svg

This is not an universal law, but a coincidence that has been true for couple of years due to economic reasons: these numbers give usually the best “bang for the buck”.

In other HPC clusters the ratio might be different, but it is important to know this ratio as that is the ratio that the Slurm queue system uses when it determines the size of a job. It is very easy to calculate: just divide the available RAM with the amount of CPUs.

When determining how big your job is it is useful to round up to the nearest slot:

_images/slot-queue.svg

If your program requires a lot of RAM, but it does not utilize multiple CPUs, it is usually good idea to check whether the RAM usage can be lowered or whether you can utilize multiple CPUs via shared memory parallelism. Otherwise you’re getting billed for resources that you’re not actively using, which lowers your queue priority.

How to estimate execution time?
Simple measuring stick: Your own computer

If you have run the problem on your computer, you’ll want to use that as a measuring stick. First good assumption is that given the same resources, the program should run in the same time in the compute node.

Programs that do iterations

Usually, a program does the same thing more than once. For example:

  • Physics simulation codes will usually integrate equations in discrete time steps.

  • Markov chains do the same calculation for each node in the chain.

  • Deep learning training does training over multiple epochs and the epochs themselves consist of multiple training steps.

  • Running the same program with different inputs.

If this is the case, it is usually enough to measure the time taken by few iterations and from that information extrapolate the total runtime.

If the time taken by each step is \(t_{\textrm{step}}\), then the total runtime \(T\) is approximately \(T \approx n_{\textrm{steps}} \cdot t_{\textrm{step}}\).

_images/program-iteration.svg

Do note that if you’re planning on running the same calculation multiple times with different parameters and/or datasets you can estimate that the time needed for running it \(T_{\textrm{total}} \approx n_{\textrm{parameters}} \cdot T_{\textrm{single}}\). In these cases array jobs can often be used to split the calculation into multiple jobs.

Programs that run a single calculation

For programs that run a single calculation you can estimate the runtime by solving smaller problems. By running a smaller problem on your own computer and then estimating how much bigger the bigger problem is, you can usually get a good estimate on how much time it takes to solve the bigger problem.

For example, let’s consider a situation where you need to calculate various matrix operations such as multiplications, inversions etc.. Now if a smaller problem uses a matrix of size \(n^{2}\) and bigger problem uses a matrix of size \(m^{2}\), you can calculate that the ratio of the bigger problem to the initial problem is \(r = (m / n)^{2}\).

_images/problem-scaling.svg

So if solving the smaller problem takes time \(T_{\textrm{small}}\), then you could estimate that the time taken by the bigger problem is at least \(T_{\textrm{large}} \approx r \cdot T_{\textrm{small}} = (m / n)^{2} \cdot T_{\textrm{small}}\).

This estimate is most likely a bad estimate (most linear algebra algorithms do not scale with \(O(n^{2})\) complexity), but it is a better estimate than no estimate at all.

It is especially important to notice if your problem scales as \(O(n!)\). These kinds of problems can quickly become very time consuming. Problems that involve permutations such as the travelling salesman problem are famous for their complexity.

_images/problem-permutations.svg

If you’re interested on the topic, a good introduction is this excellent blog-series on Big O notation.

Image sources
Quotas

Triton has quotas which limit both the space usage and number of files. The quota for your home directory is 10GB, for $WRKDIR by default is 200GB, and project directories depending on request (as of 2021). These quotas exist to avoid usage exploding without anyone noticing. If you ever need more space, just ask. We’ll either give you more or find a solution for you.

There is a inode (number of files) quota of 1 million, because scratch is not that great for too many small files. If you have too many small files, see the page on small files.

Useful commands

  • quota - print your quota and usage

  • du -h $HOME | sort -h: print all directories and subdirectories in your home directory, sorted by size. This lets you find out where space is being used. $HOME can be replaced with any other directory (or left off for the current directory). Use du -a to list all files, not only directories.

    • du -h --max-depth=1 $HOME | sort -h: Similar, but only list down to --max-depth levels.

    • du --inodes --max-depth=1 $HOME | sort -n: Similar, but list the number of files in the directories.

  • rm removes a single file, rm -r removes a whole directory tree. Warning: on scratch and Linux in general (unless backed up), there is no recovery from this!! Think twice before you push enter. If you have any questions, come to a garage and get help.

  • conda clean cleans up downloaded conda files (but not environments).

Lustre (scratch/work) quotas

Note

Before 2021-09-15, quotas worked differently, and used group IDs rather than project IDs. There were many things that could go wrong and give you “disk quota exceeded” even though there appeared to be enough space.

There are both quotas for users and projects (/m/$dept/scratch/$project). We use project IDs for this (see detailed link in See Also), and our convention is that project IDs are the same as numeric group IDs. The quota command shows the correct quotas (by project) by default, so there is nothing special you should need to do.

If you want to look deeper, check the project ID with lfs project -d {path} and quotas with lfs quota -hp {project_id}.

Unlike the previous situation, there should be much fewer possible quota problems.

Home directory quotas

Home directories have a quota, and unlike scratch, space for home is much more limited. We generally don’t increase home directory quotas, but we can help you move stuff to scratch for the cases that fill up your home directories (e.g. installing Python or R packages which go to home by default)

Project/archive (“Aalto Teamwork”)

The project/scratch directories use a completely different system from scratch (though quotas work similarly), even if they are visible on Triton. Quotas for these are managed through your departments or IT Services.

See also
Singularity Containers

A container is basically an operating system within a file: by including all the operating system support files, software inside of it can run (almost) anywhere. This is great for things like clusters, where the operating system has to be managed very conservatively yet users have all sorts of bleeding-edge needs.

The downside is that it’s another thing to understand and manage. Luckily, most of the time containers for the software already exists, and using them is not much harder than other shell scripting.

What are containers?

As stated above, the basic idea is that software is packaged into a container which basically contains the entire operating system. This is done via a image definition file (Dockerfile, Singularity definition file .def) which is itself interesting because it contains a script that makes the whole image automatically - which makes it reproducible and shareable. The image itself is the data which contains the operating system and software.

During runtime, the root file system / is used from inside the image and other file systems (/scratch, /home, etc.) can be brought into the container through bind mounts. Effectively, the programs in the container are run in an environment mostly defined by the container image, but the programs can read and write specific files in Triton - all the data you need to operate on. Typically, e.g. the home directory comes from Triton.

This sounds complicated, but in practice it is not too hard once you see an example and can copy the commands to run. For images managed by Triton admins themselves, this is easy due to singularity_wrapper tool we have written for Triton. You can also run singularity on triton without the wrapper, but you may need to e.g. bind /scratch yourself to access your data.

The hardest part of using containers is keeping track of files inside vs outside: You specify a command that gets run inside the container image. It mostly accesses files inside the image, but it can access files outside if you bind-mount them in. If you ever get confused, use singularity shell (see below) to enter the container and see what is going on.

About Singularity

Docker is the most commonly talked about container runtime, but most clusters use Singularity. The following table should make the reasons clear:

Docker

Singularity

Designed for infrastructure deployment

Designed for scientific computing

Operating system service

User application

In practice, gives root access to whole system

Does not give or need extra permissions to the system

Images stored in layers in hidden operating system locations opaquely managed through some commands.

One image is one .sif file which you manage using normal commands.

Docker is still a standard image format, and there are ways to convert images between the formats. In practice, if you can use Docker, you can also use Singularity by converting your image (commands on this page) and running it by copying other commands on this page.

Singularity with Triton’s pre-created modules

Some of the Triton modules automatically activate a Singularity image. On Triton, you just need to load the proper module. This will set some environment variables and enable the use of singularity_wrapper (to see how it works, check module show MODULE_NAME).

While the image itself is read-only, remember that /home, /m, /scratch and /l etc. are not. If you edit/remove files in these locations within the image, that will happen outside the image as well.

singularity_wrapper is written so that when you load a module written for a singularity image, all the important options are already handled for you. It has three basic commands:

  1. singularity_wrapper shell [SHELL] - Gives user a shell within the image (specify [SHELL] to say which shell you want).

  2. singularity_wrapper exec CMD - Executes a program within the image.

  3. singularity_wrapper run PARAMETERS - Runs the singularity image. What this means depends on the image in question - each image will define a “run command” which does something. If you don’t know what this is, use the first two instead.

Under the hood, singularity_wrapper does this:

  1. Choosing appropriate image based on module version

  2. Binding of basic paths (-B /l:/l, /m:/m, /scratch:/scratch)

  3. Loading of system libraries within images (if needed) (e.g. -B /lib64/nvidia:/opt/nvidia)

  4. Setting working directory within image (if needed)

Singularity commands

This section describes using Singularity directly, with you managing the image file and running it.

Convert a Docker image to a Singularity image

If you have a Docker image, it has to be on a registry somewhere (since they don’t exist as standalone files). You can pull to convert it to a .sif file (remember to change to a scratch folder with plenty of space first):

$ cd $WRKDIR
$ singularity build IMAGE_OUTPUT.sif docker://GROUP/IMAGE_NAME:VERSION

If you are running on your own computer with Docker and Singularity both installed, you can use a local image like this (and then you need to copy it to the cluser):

$ singularity build IMAGE_OUTPUT.sif docker-daemon://LOCAL_IMAGE_NAME:VERSION

This will store the Docker layers in $HOME/.singularity/cache/, which can result in running out of quota in your home folder. In a situation like this, you can then clean the cache with:

singularity cache clean

You can also use another folder for your singularity cache by setting the SINGULARITY_CACHEDIR-variable. For example, you can set it to a subfolder of your WRKDIR with:

export SINGULARITY_CACHEDIR=$WRKDIR/singularity_cache
mkdir $SINGULARITY_CACHEDIR
Create your own image

See the Singularity docs on this. You create a Singularity definition file NAME.def, and then:

$ singularity build IMAGE_OUTPUT.sif NAME.def
Running containers

These are the “raw” singularity commands. If you use these, you have to configure the images and bind mounts yourself (which is done automatically by singularity_wrapper). If you module show NAME on a singularity module, you will get hints about what happens.

  • singularity shell IMAGE_FILE.sif will start a shell inside of the image. This is great for understanding what the image does.

  • singularity exec IMAGE_FILE.sif COMMAND will run COMMAND inside of the image. This is how you would script it for batch jobs, etc.

  • singularity run IMAGE_FILE.sif is a lot like exec, but will run some pre-configured command (defined as part of the image definition). This might be useful when using a pre-made image. If you make an image executable, you can do this by running the image directly: ./IMAGE_FILE.sif [COMMAND]

  • The extra arguments --bind=/m,/l,/scratch will make the import Triton data filesystems available inside of the container. $HOME happens by default. You may want to add $PWD for your current working directory.

  • --nv provides GPU access (though sometimes more is needed).

Examples

Writable container image that can be updated

Sometimes, it is too much work to completely define an image before building it: it is more convenient to incrementally update it, just like your own computer. You can make a writeable image directory using singularity build --sandbox and then when you run it you can make permanent changes to it by running with singularity [run|exec|shell] --writeable. You could, for example, pull a Ubuntu image and then slowly install things in it.

But note these disadvantages:

  • The image isn’t reproducible: you don’t have the definition file to make it, so if it gets messed up you can’t go back. Being able to delete and reproduce is very useful.

  • There isn’t an efficient, single-file image: instead, there are tens of thousands of files in a directory. You get the problems of many small files. If you run this many times, use singularity build SINGLE_FILE.sif WRITEABLE_DIRECTORY_IMAGE/ to convert it to a single file.

See also
Small files

Millions of small files are a huge problem on any filesystem. You may think /scratch, being a fast filesystem, doesn’t have this problem, but it’s actually worse here. Lustre (scratch) as like an object store, and stores files separately from medatata. This means that each file access requires multiple different network requests, and making a lot of files brings your research (and managing the cluster) to a halt. What counts as a lot? Your default quota is 1e6 files. 1e4 for a project is not a lot. 1e6 for a single project is.

You may have been directed here because you have a lot of files. In that case, welcome to the world of big data, even if your total size isn’t that much! (it’s not just size, but difficulty of handling using normal tools) Please read this and see what you can learn, and ask us if you need help.

This page is mostly done, but specific examples could be expanded.

See also:

Contents
The problem with small files

You know Lustre is high performance and fast. But, there is a relatively high overhead for accessing each file. Below, you can see some sample transfer rates, and you can see that total performance drops drastically when files get small. (These numbers were for the pre-2016 Lustre system, it’s better now but the same principle applies.) This isn’t just a problem when you are trying to read files, it’s also a problem when managing, moving, migrating, etc.

File size

Net transfer rate, many files of this size

10GB

1100 MB/s

100MB

990 MB/s

1MB

90MB/s

10KB

.9MB/s

512B

.04 MB/s

Why do people make millions of small files?

We understand there reasons people make lots of files: it’s convenient. Here are some of the common problems (and alternative solutions) people may be trying to solve with lots of files.

  • Flat files are universal format. If you have everything in its own file, then any other program can look at any data individually. It’s convenient. This is a fast way to get started and use things.

  • Compatibility with other programs. Same as above.

  • Ability to use standard unix shell tools. Maybe your whole preprocessing pipeline is putting each piece of data in its own file and running different standard programs on it. It’s the Unix way, after all. Using filesystem as your index. Let’s say you have a program that reads/writes data which is selected by different keys. It needs to locate the data for each key separately. It’s convenient to put all of these in their own files: this takes the role of a database index, and you simply open the file with the name of the key you need. But the filesystem is not a good index.

    • Once you get too many files, a database is the right tool for the job. There are databases which operate as single files, so it’s actually very easy.

  • Concurrency: you use filesystem as the concurrency layer. You submit a bunch of jobs, each job writes data to its own file. Thus, you don’t have to worry about problems with appending to the same file/database synchronization/locking/etc. This is actually a very common reason.

    • This is a big one. The filesystem is the most reliable way to join the output of different jobs (for example an array job), and it’s hard to find a better strategy. It’s reasonable to keep doing this, and combine job outputs in a second stage to reduce the number of files

  • Safety/security: the filesystem isolates different files from each other, so if you modify one, there’s less chance of corrupting any other ones. This goes right along with the reason above.

  • You only access a few files at a time in your day to day work, so you never realize there’s a problem. However, when we try to manage data (migrate, move, etc), then a problem comes up.

  • Realize that forking processes has similar overhead. Small reads are also non-ideal, but less bad(?).

Strategies
  • Realize you will have to have to change you workflow. You can’t do everything with grep, sort, wc, etc. anymore. Congratulations, you have big data.

  • Consider right strategy for your program: a serious program should provide options for this.

    • For example, I’ve seen some machine learning frameworks which provide an option to compress all the input data into a single file that is optimized for reading. This is precisely designed for this type of case. You could read all the files individually, but it’ll be slower. So in this case, one should first read the documentation and see there’s a solution. One would take all the original files and make the processed input files. Then, take the original training data, package it together in one compressed archive for long-term storage. If you need to look at individual input files, you can always decompress one by one.

  • Split - combine - analyze

    • Continue like you have been doing: each (array?) job makes different output files. Then, after running, combine the outputs into one file/database. Clean up/archive the intermediate files. Use this combined DB/file to analyze the data in the long term. This is perhaps the easiest way to adapt your workflow.

  • HDF5: especially for numerical data, this is a good format for combining your results. It is like a filesystem within a file, you can still name your data based on different keys for individual access.

  • Unpack to local disk, pack to scratch when done.

    • Main article: Compute node local drives,

    • This strategy can be combined with many of the other strategies below

    • This strategy is especially good when your data is write-once-read-many. You package all of your original data into one convenient archive, and unpack it to the local disk when you need it. You delete it when you are done.

  • Use a proper database suitable for your domain (sqlite): Storing lots of small data where anything can be quickly findable and you can do computation efficiently is exactly what databases do. It can be difficult to have a general purpose database work for you, but there are a wide variety of special-purposes databases these days. Could one of them be suitable for storing the results of your computation for analysis?

    • Note that if you are really doing high-performance random IO, putting a database on scratch is not a good idea, and you need to think more.

    • Consider combining this with local disk: You can copy your pre-created database file to local disk and do all the random access you need. Delete when done. You can do modification/changes directly on scratch if you want.

  • key-value stores: A string key stores arbitrary data.

    • This is a more general database, basically. It stores arbitrary data for a certain key.

  • Read all data to memory.

    • A strategy for using many files. Combine all data into one file, read them all into memory, then do the random access in memory.

  • Compress them down when done.

    • It’s pretty obvious: when you are done with files, compress all of them into one. You have the archive and can always unpack when needed. You should especially at least do this when you are done with a project: if everyone did this, the biggest problems could be solved.

  • Make sure you have proper backups for large files, mutating files introduces risks!

    • If you do go using these strategies, make sure you don’t accidentally lose something you need. Have backups (even if it’s on scratch: backup your database files)

  • If you do have to keep many small flies, check the link above for lustre performance tuning.

  • If you have other programs that can only operate on separate files

    • This is a tough situation, investigate what you can do combining the strategies above. At least you can pack up when done, and possibly copying to local disk while you are accessing is a good idea.

  • MPI-I/O: if you are writing your own MPI programs, this can parallelize output

Specific example: HDF5 for numerical data, or some database

HDF5 is essentially a database for numerical data. You open a HDF5 file and access different data by path - the path is like a filename. There are libraries for accessing this data from all relevant programming languages.

If you have some other data that is structured, there are other databases that will work. For example, sqlite is a single-file, serverless database for relational data, and there are other similar things for time serieses or graphs.

Specific example: Unpacking to local disk

You can see examples at compute node local drives

Specific example: Key-value stores

Let’s say you have written all your own code and want an alternative to files. Instead, use a key-value database. You open one file, and store your file contents under different keys. When you need the data out, you request it by that key again. The keys take the place of filenames. Anytime you would open files, you just access from these key-value stores. You also have ways of dumping and restoring the data if you need to analyze it from different programs.

Performance tuning for small files

See here: Data storage on the Lustre file system

Triton ssh key fingerprints

ssh key fingerprints allow you to verify the server you are connecting to. The usual security model is that once you connect once, you save the key and can always be sure you are connecting to the same server from then on. To be smarter, you can actually verify the keys the first time you connect - thus, they are provided below.

You can verify SSH key fingerprints with a command like:

ssh-keygen -l -E sha256 -f <(ssh-keyscan triton.aalto.fi 2>/dev/null)

Here are the SSH key fingerprints for Triton:

256 SHA256:04Wt813WFsYjZ7KiAyo3u6RiGBelq1R19oJd2GXIAho no comment (ECDSA)
256 SHA256:1Mj2Gpf6iinwni/Yf9g/b/wToaUaOU87szzzCtibj6I no comment (ED25519)
2048 SHA256:glizQJUQKoGcN2aTtp9JtXuJjJtnrKxRD8yImE06RJQ no comment (RSA)

and the same but with md5 hashes:

256 MD5:ac:61:86:86:e1:11:29:f5:46:23:d8:25:00:8a:7b:f0 no comment (ECDSA)
256 MD5:1d:e7:c9:f6:92:a1:c0:65:10:97:d7:72:7d:4c:82:5a no comment (ED25519)
2048 MD5:a4:73:89:ae:8c:a5:ea:2a:04:76:cc:0b:6a:f7:e6:9a no comment (RSA)

Or this can be copied and pasted directly into your .ssh/authorized_keys file:

triton.aalto.fi ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDk8MvTSB2gYZf9Y969vhMczdGSO+rNGZQhZLUGMkMduq4q+b/LpHCn/yH1JN8NWeIDt8NELdnl+/0hmk/zk7IHxtnPvNbZuAYO1T1Hh7Kk72zQFOESHqmbYcPH5SDf12XfNYJ6cQIqHRaF4QT483+f9fvUlp7E+MKQlr3+NreKm4AHdTcHjqW75r1Mh/z0q9Qoqdgn3gDCzmN6+Y0aGyf4wICMJlKUBQP0muqSfYWX43StaPh+hoOQFYOiK1jOVEBY/HFXOuDzgCCG2b9qWhTrA3svcSKK4E6X76sXOR+8FTbC7u9xnLgm+903+zsGfsEQY2eNXfR7YChNxz4y5ASf
triton.aalto.fi ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBAZvw6Bgs+cPGFjwqMABAGC+cG2bBYR69+Hc5ChxQhwNwCW1zCg6w/pAerbr+A6IzJDx8uN03bcTZj+xzLH2kLE=
triton.aalto.fi ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDumqy+fbEwTOtyVlPGqzS/k4i/hJ8L+kUDf6MpWO1OI

There is also a page for ssh host keys for the Aalto shell servers kosh, lyta, brute, force

Storage

See also

These pages have details that go beyond this page:

This pages gives an overview of more advanced storage topics. You should read the storage and remote data tutorials first.

Checklist

Do any of these apply to you? If so, consider your situation and ask us for help!

If you have been sent this checklist because your jobs may be having a lot of IO, don’t worry. It’s not necessarily a problem but please go through this checklist and let us know what applies to you so we can give some recommendations.

  • Many small files being accessed in jobs (hundred or more).

  • Files with extremely random access, in particular databases or database-like things (hdf5).

  • Files being read over and over again. Alternatives: copy to local disks, or read once and store in memory.

  • Number of files growing, for example all your runs have separate input files, output files, Slurm output files, and you have many runs.

  • Constantly logging to certain files, writing to files from many parallel jobs at the same time.

  • Reading from single files from many parallel jobs or threads at the same time.

  • Is all your IO concentrated at one point, or spread out over the whole job?

( and if we’ve asked you specifically about your jobs, could you also describe what kind of job it is, the type of disk read and write happens, and in what kinds of pattern? Many small files, a few large ones, reading same files over and over, etc. How’s it spread out across jobs? )

If you think your IO may have bad patterns or even you just want to talk to make sure, let one of the Triton staff know or submit an issue in the issue tracker.

Checking your jobs’ IO usage

You can check the total disk read and write of your past jobs using:

# All your recent jobs:
sacct -o jobid%10,user%8,jobname%10,NodeList,MaxDiskRead,MaxDiskWrite -u $USER
# A single jobid
sacct -o jobid%10,user%8,jobname%10,NodeList,MaxDiskRead,MaxDiskWrite -j $jobid

These statistics are calculated on the whole node and thus include IO caused by other jobs on the same server while your job is running.

More advanced tool are being tested and will be available once they are finished.

Loading data for machine learning

As we’ve said before, modern GPUs are super data-hungry when used for machine learning. If you try to open many files to feed it the data, “you’re going to have a bad time”. Luckily, different packages have different solutions to the problem.

In general, at least try to combine all of your input data into some sort of single file that can be read in sequence.

Try to do the least amount of work possible in the core training loops: any CPU usage, print, logging, preprocessing, postprocessing, etc. reduces the amount of time the GPU is working unless you do it properly (Amdhal’s law).

(more coming later)

Remote workflows at Aalto

Note

The more specific remote access instructions for scicomp is at Remote Access (recent email had duplicate links to this page). This page explains the options, including other systems.

How can you work from home? For that matter, how can you work on more than your desktop/laptop while at work? There are many options which trade off between graphical interfaces and more power. Read more for details.

You have most likely created your own workflow to analyse data at Aalto and most likely you are using a dedicated desktop workstation in Otaniemi. However, with increased mobility of working conditions and recent global events that recommend tele-working, you might be asking yourself: “how do I stop using my workstation at the dept, and get analysis/figures/papers done from home?”.

The data analysis workflows from remote might not be familiar to everyone. We list here few possible cases, this page will expand according to the needs and requests of the users.

What’s your style?

If you need the most power or flexibility, use Triton for your data storage and computation. To get started, you can use Jupyter (4) and VDI (3) which are good for developing and prototyping. Then to scale up, you can use the Triton options: 6, 7, 8 which have access to the same data. (Triton account required for 4-8).

If you need simple applications with a graphical interface, try 3 (VDI).

If you use your own laptop/desktop (1, 2), then it’s good for getting started but you have to copy your data and code back and forth once you need to scale up.

Computing options at Aalto University

< Overview of all the computing options at Aalto University >

Summary table for remote data analysis workflows
  • Good for data security: 3, 4, 5, 6, 7

  • Good for prototyping, working on the go, doing tests, interactive work: 1, 2, 3, 4, 5

  • Shares Triton data (e.g. scratch folders): 3, 4, 5, 6, 7

  • Easy to scale up, shares software, data, etc: 4, 5, 6, 7

  • Largest resources available 7 (medium: 6)

Workflow

Pros

Cons

Recommendation

Triton data Y/N

  1. Own laptop/desktop computer

Can work from anywhere. Does not require internet connection. You are in control.

Not good for personal or confidential data. Computing resources might not be enough. Accessing large data remotely stored at Aalto might be problematic - you will end up having to copy a lot. You have to manage software yourself.

Excellent for prototyping, working on the go, doing tests, interactive work (e.g. making figures). Don’t use it with large data or confidential / personal data.

N

  1. Aalto laptop

Same as above, plus same tools available as Aalto employer.

Same as above.

Same as above.

N

  1. Remote virtual machine (https://vdi.aalto.fi)

Computing happens on remote. Data access happens on remote, so it is more secure.

Computing resources are limited.

Excellent for prototyping, working on the go, doing tests, interactive work (e.g. making figures). More secure access to data.

Y

  1. Aalto Jupyterhub (https://jupyter.triton.aalto.fi)

Cloud based - resume work from anywhere. Includes command line (#6) and batch (#7) easily. Same data as seen on Triton (/scratch/dept/ and /work/ folders)

Jupyter can become a mess if you aren’t careful. You need to plan to scale up with #7 eventually, once your needs increase.

Excellent for prototyping, working on the go, doing tests, interactive work (e.g. making figures). Secure access to data. Use if you know you need to switch to batch jobs eventually (7).

Y

  1. Interactive graphical session on Triton HPC (ssh -X)

Graphical programs.

Lost once your internet connection dies, needs fast internet connection.

If you need specific graphical applications which are only on Triton.

Y

  1. Interactive command line session on Triton HPC (ssh + sinteractive)

Works from anywhere. Can get lots of resources for a short time.

Limited time limits, must be used manually.

A general workhorse once you get comfortable with shell - many people work here + #7.

Y

  1. Non-interactive batch HPC computing on Triton (ssh + sbatch)

Largest resources, bulk computing

Need to script your computation

When you have the largest computational needs.

Y

  1. Non-interactive batch HPC computing on CSC (ssh + sbatch)

Similar to #7 but at CSC

Similar to #7

Similar to #7

N

1. Own laptop/desktop computer

Description: Here you are the administrator. You might be working from a cafe with your own laptop, or from home with a desktop. You should be able to install any tool you need. As an Aalto employer you get access to many nice commercial tools for your private computers. Visit: https://download.aalto.fi/index-en.html and https://aalto.onthehub.com/ for some options.

Pros: Computing freedom! You can work anywhere, you can work when there is no internet connection, you do not share the computing resources with other users so you can fully use the power of your computer.

Cons: If you work with personal or confidential data, the chances of a data breach increase significantly, especially if you work from public spaces. Even if you encrypt your hard disks (links:https://www.aalto.fi/en/cyber-security-hub-under-construction/aalto-it-securitys-top-10-tips-for-daily-activities ) and even if you are careful, you might be forgetting to lock your computer or somebody behind you might see which password you type. Furthermore, personal computers have limited resources when it comes to RAM/CPUs/GPUs. When you need to scale up your analysis, you want to move it to an HPC cluster, rather than leaving scripts running for days. Finally, although you can connect your Aalto folders to your laptop (link Remote Access and Remote access to data), when the data size is too big, it is very inefficient to analyse large datasets over the internet.

Recommendation: Own computer is excellent for prototyping data analysis scripts, working on the go, doing tests or new developments. You shouldn’t use this option if you are working with personal data or with other confidential data. You shouldn’t use this option if your computational needs are much bigger.

2. Aalto laptop

Description: As an Aalto employer, you are usually provided with a desktop workstation or with an Aalto laptop. With an Aalto laptop you can apply for administrator rights (link to the form) and basically everything you have read for option 1 above is valid also in this case. See “Aalto {Linux|Mac|Windows}” on scicomp’s Aalto section at https://scicomp.aalto.fi/aalto/.

Pros/Cons/Recommendation: see option 1 above. But, when on Aalto networks, you have easier access to Aalto data storage systems.

3. Remote virtual machine with VDI

Description: You might be working with very large datasets or with confidential/personal data, so that you cannot or do not want to copy the data to your local computer. Sometimes you use many computers, but would like to connect to “the same computer” from remote where a longer analysis script might be crunching numbers. Aalto has a solution called VDI https://vdi.aalto.fi (description at aalto.fi) where you can get access to a dedicated virtual machine from remote within the web browser. Once logged in, you can pick if you prefer Aalto Linux or Aalto Windows, and then you see the same interface that you would see if you logged in from an Aalto dedicated workstation. To access Triton data from the Linux one, use the path /m/{dept}/scratch/ (just like Aalto desktops).

Pros: The computing processes are not going to run on your local computer, computing happens on remote which means that you can close your internet connection, have a break, and resume the work where you left it. There is no need to copy the data locally as all data stays on remote and is accessed as if it was a desktop computer from the campus.

Cons: VDI machines have a limited computing power (2 CPUs, 8GB of RAM). So they are great for small prototyping, but for a large scale computation you might want to consider Aalto Triton HPC cluster. The VDI session is not kept alive forever. If you close the connection you can still resume the same session within 24h, after that you are automatically logged out to free resources for others. If you have a script that needs more than 24h, you might want to consider Aalto Triton HPC.

Recommendation: VDI is excellent when you need a graphic interactive session and access to large data or to personal/confidential data without the risks of data breach. Use VDI for small analysis or interactive development, we do not recommend it when the executing time of your scripts starts to be bigger than a 7 hours working day.

4. Aalto Jupyterhub

Description: Jupyter notebooks are a way of interactive, web-based computing: instead of either scripts or interactive shells, the notebooks allow you to see a whole script + output and experiment interactively and visually. They are good for developing and testing things, but once things work and you need to scale up, it is best to put your code into proper programs. Triton’s JupyterHub is available at https://jupyter.triton.aalto.fi . Read more about it at: https://scicomp.aalto.fi/triton/apps/jupyter.html. Triton account required.

Pros: JupyterHub it has similar advantages than #4, although data and code are accessed through the JupyterHub interface. In addition, things can stay running in the cloud. Although it can be used with R or Matlab, Python users will most likely find this to be a very familiar and comfortable prototyping environment. Similar to the VDI case, you can resume workflow (there are sessions of different lengths). You also also access Triton shell and batch (#6, #7) in the Jupyter interface, and it’s easy to scale up and use them all together.

Cons: You are limited to the Jupyter interface (but you can upload/download data, and integrate with many other things). Jupyter can become a mess if you aren’t careful. Computationally, an instance will always have limited CPUs and memory. Once you need more CPU/RAM, look into options #6 and #7 - they work seamlessly with the same data, software, etc.

Recommendation: Good for exploration and prototyping, access to large dataset, access to confidential/personal data. For more computational needs, be ready to switch to batch jobs (#7) once you are done prototyping.

5. Interactive graphical session on Triton HPC

Description: Sometimes what you can achieve with your own laptop or with VDI is not enough when it comes to computing resources. However, your workflow does not yet allow you to go fully automatic as you still need to manually interact with the analysis process (e.g. point-click analysis interfaces, doing development work, making figures, etc). An option is to connect to triton.aalto.fi with a graphical interface. This is usually done with ssh -X triton.aalto.fi. For example you can do it from a terminal within a VDI Linux session. Once connected to the triton log-in node, you can then request a dedicated interactive node with command sinteractive, and you can also specify the amount of CPU or RAM you need (link to sinteractive help page). Triton account required.

Pros: This is similar to the VDI case above (#3) without the computing limitation imposed by VDI.

Cons: If you connect from triton.aalto.fi from your own desktop/laptop, your internet connection might be limiting the speed of the graphical session making it very difficult to use graphical IDEs or other tools. Move to VDI, which optimises how the images are transferred over the internet. Sinteractive sessions cannot last for more than 24 hours, if you need to run scripts that have high computational requirements AND long time of execution, the solution for you is to go fully non-interactive using Triton HPC with slurm (case #6)

Recommendation: This might be one of the best scenarios for working from remote with an interactive graphical session. Although you cannot keep the session open for more than 24 hours, you can still work on your scripts/code/figures interactively without any limitation and without any risks of data breaches.

6. Interactive command line only session on Triton HPC/dept workstation

Description: sometimes you do not really need a graphical interface because you are running interactively scripts that do not produce or need a graphical output. This is the same case as sinteractive above, but without the limitation of the 24h session. The best workflow is to: 1) connect to triton ssh triton.aalto.fi 2) start a screen/tmux session that can be detached / reattached in case you lose the internet connection or in case you need to leave the interactive script running for days 3) request a dedicated interactive terminal with command srun -p interactive --time=HH:MM:SS --mem=nnG --pty bash (see other examples at https://scicomp.aalto.fi/triton/tut/interactive.html or https://scicomp.aalto.fi/triton/usage/gpu.html for interactive GPU) 4) get all your numbers crunched and remember to close it once you are done. Please note that, if you have a dedicated Linux workstation at a department at Aalto, you can also connect to your workstation and use it as a remote computing node fully dedicated to you. The resources are limited to your workstation, but here you won’t have the time constraint or the need to queue for resources if Triton’s queue is overcrowded. Triton account required.

Pros: when you do not need a graphical interface and when you need to run something interactively for days, this is the best option: high computing resources, secure access to data, persistent interactive session.

Cons: when you request an interactive command line session you are basically submitting a slurm job. As with all jobs, you might need to wait in the queue according to the amount of resources you have requested. Furthermore, jobs cannot last more than 5 days. In general, if you have an analysis script that needs more than 5 days to operate, you might want to identify if it can be parallelized or split into sub-parts with checkpoints.

Recommendation: this is the best option when you need long-lasting computing power and large data/confidential data access with interactive input from the user. This is useful once you have your analysis pipeline/code fully developed so that you can just run the scripts in command line mode. Post processing/figure making can then happen interactively once your analysis is over.

7. Non-interactive batch computing on Triton HPC

Description: this is the case when no interactive input is needed to process your data. This is extremely useful when you are going to perform the same analysis code for hundreds of time. Please check more detailed descriptions at https://scicomp.aalto.fi/triton/index.html and if you havent, go through the tutorials https://scicomp.aalto.fi/triton/index.html#tutorials. Triton account required.

Pros: when it comes to large scale data analysis, this is the most efficient way to do it. Having a fully non-interactive workflow also makes your analysis reproducible as it does not require any human input which can sometimes be the source of errors or other irreproducible/undocumented steps.

Cons: as this is a non-interactive workflow, this is not recommended for generating figures or with graphical tools that does not allow “batch” mode operations.

Recommendation: this is the best option when you need long-lasting parallel computing power and large data/confidential data access. This is also recommended from reproducibility/replicability perspective since, by fully removing human input, the workflow can be made fully replicable.

8. Non-interactive batch HPC computing at CSC

Description: this case is similar to #7. You can read/learn more about this option at https://research.csc.fi/guides

Pro/Cons/Recommendation: see #7.

See also

Cheatsheets: Triton

Aalto Research Software Engineers

Skills to do science are different than skills to write good research code. The Aalto Research Software Engineers (AaltoRSE) provide support and mentoring to those using computing and data so that everyone can do the best possible work.

Research Software Engineers

The Aalto Research Software Engineers (RSEs) provide specialist support regarding software, computing, and data. As research becomes more digital and computer-dependent, the prerequisite knowledge grows larger and larger, and we exist to help you fill that gap.

For anything related to custom software development, computational research, data management, workflow automation, scaling-up, deployment of public previews, collaborative work, reproducible research, optimization, high-performance computing, and more, we can:

  • Do it for you: You need some custom technical software/solution. We do it for you, you get straight to your work.

  • Do it with you: We co-work with your group, teaching while we go along.

  • Make it reusable: You already have something, but it doesn’t work for others.

  • Plan your ambitions: Figure out how far you can reach in your next project or grant.

Instead of, or in addition to, hiring your own intern, postdoc, etc. to struggle with certain issues, we can help instead. We consist of experienced researchers who have broad experience with scientific computing (programming, computing, data) for our academic work, and thus can seamlessly collaborate on research projects. We can also do consultation and training. You will have more impact since your work is more reusable, open, and higher quality. We can work on existing projects or you can write us directly into your grant applications.

Service availability: Garage support is available to researchers at Aalto. We serve projects from all Aalto schools thanks to IT Services grants, but our main funding currently comes from the School of Science. For more information, see Unit information.

Contact

For a quick chat to get started with any kind of project or request any type of support, come to our daily garage, every workday online at 13:00. Or contact us by email at rse-group at aalto.fi, or fill out our request form. See requesting RSE for more.

About our services

For researchers and research groups

You program or analyze data in your daily work, and you know something is missing: your code and data is less organized, less efficient, less managed than others, and it’s affecting the quality of your work. Or maybe you don’t know how to start your project, or publish it. You’re too busy with the science to have time to focus on the computing.

To find out more or make a request, contact us.

Case study: preparation for publication

A group is about to publish a paper about a method, but their code is a bit messy. Without easy-to-use, (relatively) high-quality code, they know their impact will be minimal. They invest in a few days of RSE work in order to help adopt best practices and release their method as open source.

Case study: external grant

A PI has gotten a large external grant, and as part of that they need some software development expertise. The time frame is four months, but they can’t hire a top-quality person on an academic salary for that short time. They contact the Aalto RSE group (either before the grant, or while it is running) and use our speciality for four days per week.

Case study: improve workflow

A group of researchers all works on similar things, but independently since their backgrounds have been in science, not software development. They invite the RSE for a quick consultation to help them get set up with version control and show a more modular way to structure their code, so they can start some real collaborations, not just talking. This is the first step to more impact (and open science) from their work.

Case study: sustainability of finished projects

A project has ended and the main person who managed the code/analysis pipeline has left to continue their career somewhere else. You wish to replicate and extend the previous work, but your only starting point is a folder with hundreds of files and no clear instructions/documentation. Aalto RSEs can help you re-using and recycling previous code, document it, and extend it to make it more sustainable to be reused in future projects.

Case study: outreach and impact

ChatGPT wasn’t in the news just because it was good - it’s because it had an excellent interface for the public to test it. Developing and running these services requires a different set of skills than research, and Aalto RSEs can help to make and deploy these services.

What we do

Our RSEs may provide both mentorship and programming-as-a-service for your projects. Are you tired of research being held back by slow programming? We can help.

You can request our help for a few hours, to consult and co-work on a project. Our goal will be to primarily teaching and mentoring, to help you help yourselves in the long run. We’ll point you in the right direction and where to look next.

You can also request longer-term work as a service. This can be producing or modifying some software for you, or whatever you may need. If it’s short, it’s covered under basic funding, and if it is longer it is expected to be paid from your grants. (Need someone for a few months for your grant? We can do that.)

Note

Master’s and Bachelor’s students

The RSE service is intended for researchers, but students can be researchers if they are involved in a research project. To get started on anything longer than a short consultation, we would need to meet with your supervisor.

Short-term examples

Format could be personal work, lecture, or group seminar followed by a hands-on session, for example.

  • Setting up a project in version control with all the features. This also includes version control of data.

  • Preparing code or data for release and publication

  • FAIR data (findable, accessible, interoperable, reusable) - consultation and help.

  • Creating or automating a workflow, especially those processing data or running simulations

  • Optimizing some code - both for speed and/or adaptability

  • Efficiency storing data for intensive analysis. Data replication and management.

  • Making existing software more modular and reusable

  • Help properly using, for example, machine learning library pipelines, instead of hacking things together yourself

  • Setting up automatic software testing

  • Transforming projects from individual to collaborative - either within a group, or open source.

  • Generalized “code clean-up” which takes a project from development to stabilized

More involved examples

These would combine co-working, mentoring, and independent work. We go to you and work with you.

  • Developing or maintaining specific software, services, demos, or implementations.

  • Software development as a service

  • Software support that lasts beyond the time frame of a single student’s attention

  • Adding features to existing software

  • Contributing to some other open source software you need for your research

Free basic service

In order to help everyone and avoid microtransactions, departments/schools/etc can sponsor a basic service, which provides a few hours or days of close support to improve how you work (especially for the “basic examples” above.

One of our trained RSEs will work with you for a short period to begin or improve your project. The goal is not to do it for you, but to show you by example so that you can do it yourself later.

How to contact us and request help

To request a service, see the request area.

Requests are prioritized according to:

  • Short-term “urgent help” for as many projects as possible

  • Priority for projects and units which provide funding

  • Strategic benefit

  • Long-term impact to research (for example, improved skills)

  • Diversity and balance

For units such as departments, schools, or flagship projects

Our service is funded by departments and schools, and members of these units can receive our services free of charge for a short period of time (in accordance to the shares of funding). In addition to the basic service, researchers and group leaders can request long-term support which they pay for themselves.

By joining the Research Software Engineering service, you provide the highest-quality computational tools to your researchers, enabling the best possible research and attracting the best possible candidates. You fund a certain amount of time, and actual cost decreases when groups pay for long-term service themselves. For both short and long-term projects, our surveys indicate a significant efficiency: (researcher time saved) ≥ 5 × (time we spend).

Case study: Systematic improvements

Your department has a lot of people doing little bits of programming everywhere, but everyone is doing things alone. What if they could work together better? By joining the RSE program as a unit, your staff can get rapid help to understand tools to make their programming/data work better. After a few years, you notice a dramatic cultural shift: there is more collaboration and higher-quality work. Perhaps you already see a change in your KPIs.

Benefits

Benefits to schools/departments:

  • Increase the quality and efficiency of your research by providing the best possible tools and support.

  • Provide hands-on technical research services to your community at a higher level than basic IT (see Scicomp garage).

  • More societal impact, for example ChatGPT-type preview interfaces.

  • Help with data management, open science, FAIR data - be more competitive for funding and help get value out of your unit’s data.

  • You will be able to set priorities for your funding, for example do you focus on a certain strategy, wide variety of projects, high-impact project, etc.

Benefits to groups:

  • Receive staff/on-call software development expertise within your group, without having to make a separate hire, and at less than a full time equivalent. We don’t disappear right after your project.

  • Instead of just one person, you have the resources of our whole team available to you.

  • Your researchers focus on their science while improving their computational skills by co-working with us.

How to join

The RSE program is a part of Aalto Science-IT (Aalto Scientific Computing), so is integrated to our computing and data management infrastructure and training programs. You don’t just get a service, but a whole community of experts. We can seamlessly work with existing technical services within your department for even more knowledge transfer - if it matches their mission, your existing technical services can even join us directly.

In practice, joining us means that you contribute a certain amount of funding, which allows us to hire more staff (combined with the other departments), to provide a certain amount of time to research groups in your unit. This is easy with basic funding, but we can also use Halli to work with project funding.

If you would like to join contact Richard Darst or rse-group at aalto.fi.

Current sponsoring units

See Unit information

Project portfolio

This page lists examples of projects which we have done. As of early 2024, our internal project numbers are in the 200s.

Summary table
3x3 table with "programming", "workflows integration", "data" across the top and "help you do it", "do it with you", and "do it for you" along the side.

Example range of projects we do. We sometimes do things outside of this table, too.

Software publishing (M)

A CS doctoral researcher’s paper had code released along with it - with seven PDF pages of installation instructions, five pages of pre-processing instructions, and fifteen pages on how to run it. This code was effectively un-reusable, meaning that the potential for impact was much lower than otherwise.

Aalto RSE helped to transform this analysis into a standard R package that could be installed using standard tools and run using a Snakemake, a workflow automation tool. Other researchers - including future researchers in the same group - could reuse the tool for their science. Time spent: 3 weeks. Benefit: One paper’s results become reusable (both internally and externally).

cor:ona data collection platform (L)

Cor:ona (Comparison of rhythms: old vs. new activity) was research study studying personal habits during the transition between remote work and post-remote-work. For this study to be successful, a platform to integrate survey and smart device data had to be created within a one-month time frame.

Aalto RSE worked with the researcher to do a complete and quick ethical review and build the platform. Unlike a hired software developer, our staff already knows the research methods and can work much faster - and stays around providing years of support with the post-processing whenever it is needed. [Source code on Github]. Time spent: ~1 month. Benefits: one study and multiple papers that could not otherwise exist.

Periodic table of quantum force fields (S)

A researcher wanted to create a website that could find quantum mechanical force fields (≈models). The researcher dropped by our daily scientific computing garage for advice, and we discussed options - by working with us, the path could be greatly simplified to a static site. We found a suitable open-source starting point, adjusted it to work for the needed purpose, and provided it to them for future work by the next day. The researcher has been able to carry on with the project independently. Time spent: 0.5 day, time saved: 4 days + simpler implementation.

Finnish Center for Artificial Intelligence (dedicated)

The Finnish Center for AI (FCAI) aims for its research to have an impact in the world, and to do that, its software must be reusable. They have identified that as a bottleneck, and thus provides 5 years of funding for Aalto RSE to hire a research software engineer dedicated to FCAI projects. This person works along with all other RSEs in the team, so that FCAI has far more resources than a single hire could do themselves.

Business Finland project (M)

A research group had gotten Business Finland funds to develop an idea into a product, but were still working within Aalto. They needed software development expertise to start off quickly. They were large enough that they needed a dedicated developer, but our initial work could allow them to start sooner and lay a good groundwork for the developer they hired later.

Debugging and Parallelisation (S)

A researcher had a huge dataset to run an analysis on. Sequential analysis would have been infeasible and they wanted to run it in parallel. They tried to implement it themselves but got stuck, so they came to garage, where an RSE was able to help them to modify their code allowing them to parallelize a lot of the work and perform the analysis. The resulting work got published in Fuel.

Introduction to Julia course (M)

Julia is a relatively new programming language that has found many users in certain fields. A professor taught an undergraduate course using Julia, but there were not sufficient introductory resources to prepare students for the course, nor any other resources to prepare them. Aalto RSE found an open-source course prepared by CSC (Finland’s national scientific computing center), improved it to handle the things needed by the undergraduate course, and successfully taught it on demand. All course material is open source, so that others may also use it. Time spent: ~1 month. Benefit: Course given twice, undergraduate course made better, open material produced, internal Julia expertise

Releasing an open-source Github-based book (S)

A researcher had prepared the start of an open-source book and needed help and advice in releasing it as an open project. Aalto RSE helped with the technical setup to host the book on Github, the basics of Git usage, and creating a continuous integration system that would rebuild the book on every change. This allowed the book to both be fully open-source and to accept contributions from others. Aalto RSE also used its connections to Research Services to discuss the intellectual property aspects and how it might affect the possibility for future publication. Time spent: <1 day. Benefit: Open book and community project.

Releasing a microscope control code (S)

A researcher had created a code in Python to control a physical measurement device. This code could be useful to others, but had to be packaged and released. Aalto RSE helped to clean and release the code. Time spent: 1 day. Time saved: 1 month.

“Programming parallel supercomputers” course (M)

The “Programming parallel supercomputers” course, as the name says, gives students a first experience with HPC work. It can be difficult to find teaching assistants capable of giving the exercises a deep-enough check - in addition to confirming they follow best practices on the cluster. There is also a secondary effect of making sure students see best practices in research software (development, documentation, etc.), which can often be left behind in academic courses. Aalto RSE plays an important role in this course by bridging the technology with the teaching.

Aalto Gitlab improvements (M)

Aalto University’s Gitlab needed some scripting for management tasks. While not exactly in our scope, we were the logical team to take a look (as opposed to hiring outside consultants, especially since we could better fit in with an incremental development schedule and longer-term support). We talked with the system owners, refined the tasks, understood GitLab documentation, created the necessary scripts and improvements, handed them off to the sysadmins for production, and helped to understand tasks which should be done at another level. Time spent: 1 week. Benefit: improved service for Aalto University, significant cost savings. This type of project would be available for other internal service teams, assuming availability.

How to get started

Contact us as mentioned above, or read here for more details.

Requesting RSE support

You can contact us regardless of how small your issue is - or even if you would like to know if we could help your project. At least, we can point you in the right direction.

Quick Consultations

We recommend you come to our daily garage sessions for a short chat. There are no reservations, and this is the online equivalent of dropping by our office to say hi.

Contact

We recommend you come to our daily garage sessions (see above), rather than send email (or to come by after you send the email). We almost always have more questions and want to chat, so that responding to email is slow.

Our email is rse-group@aalto.fi (the Triton email address scicomp@aalto.fi also gets to us). You can also use the structured request form (“Research Software Engineer request”). This guides you through some of our initial questions, but goes to the same place as email and everything is read by a human anyway.

Next steps

See Starting a project with us for more info.

Starting a project with us

This page is mostly focused on how long-term scheduled projects, which are funded by the research groups themselves, work. Long-term projects are scheduled by fraction of full-time-equivalent (FTE) over weeks or months.

For short-term code review tasks, come to any of our garage sessions and we will immediately take a look.

Types of service
  • Long-term service deals with jobs that last months, and are scheduled in terms of FTE percentage over months. This is often directly as salary from some grant, as a researcher would be.

  • Medium-term service deal with jobs scheduled in days. This is mostly funded by basic funding from the member units.

  • Short-term usually consists of support at one of our garages or a few hours of work. This is generally free (paid by unit basic funding).

Beginning

To actually make a request for support, see Requesting RSE support.

Initial meeting

First, you can expect an quick initial meeting between the researchers and RSEs. Depending on the size and complexity of the project, there may be several to find the right RSE and ensure that we can do a good job helping you.

  • What scientific background knowledge is needed? How long does it take to get started?

  • What type of contribution will the RSE make (see next section)? For purposes of scientific integrity, consider if this reaches the point of scientific authorship (see bottom).

  • Researchers: provide access to code, documentation, and relevant scientific background in advance, so that they can be browsed. The more we know in advance, the better we can estimate the time required and how to best help you.

  • How do you manage your data? To map things out, consider this one-page data management plan table.

  • Final outputs, location, publication.

  • Time frame and schedule flexibility.

What we can accomplish

It is very important to consider what the practical outcome of the project will be, because different researchers have very different needs. Together, we will think about these questions:

  • What’s the object of focus

    • Software

    • Data

    • Workflows

  • What is accomplished?

    • Create a brand-new product based on scientific specification. Is this done in an agile way (continuous feedback) or is the exact result known?

    • Improve some existing workflow or software, possibly drastically.

    • Improve some other project, primarily maintained by someone else.

    • Prepare a project for publication, release, or being used by more people.

  • Future plan

    • Primarily teach via example, so that the researcher can fully continue developing the project themselves.

    • Provide a finished product, which won’t need updates later

    • Provide a product that will be continually maintained by specialists (RSEs or similar - us?).

Scheduling and planning

RSEs will be assigned based on discussion between the researchers, RSEs, and Aalto Scientific Computing (the RSE group). Your agreement is with the RSE group, so your RSEs may change if there is a need (even though we’ll try to avoid this).

We will work with you to give a good view of how long we take something will take and any risks (as in, what if it turns out to not be possible?) We can’t promise specific results in a specific time (no one can), but we do try to give the best estimates we can. This planning includes any buffer and backup plans.

It may take some time to fit your project into our schedule (of course this also depends on the urgency.) We realizes that your schedule is also uncertain, but we hope that you can find time to work with us once we start, since otherwise we may move on and requeue your project.

If we schedule a project but lose contact with you (no responses to our messages), we’ll assume you are busy with other things and may re-add the project to the queue, and we’ll need to find a new time in the schedule. Please let us know if you don’t have time, we understand the busyness of research.

A project doesn’t have to be done “all at once” but can be interleaved with your own work schedule.

Costs and time tracking

We track the time we spend and record it to your project.

Getting started
Version control

One can hardly do development work without using a good version control system. Our first step will be help you start using a version control system, if you are not yet using one, or if you are ensure you are using it optimally. If you don’t have a preference, we’ll recommend git and GitHub / Aalto Gitlab.

Research background

If some understanding of the scientific background wasn’t important, you might be hiring a software developer instead. Expect us to take some time to understand the science.

Understanding existing code

Also expect that, if there is any existing code, it will take some time to understand for a new person. Also, there is likely to be a period of refactoring to improve the existing code, where it seems like not much is getting done. This is a necessary step in investing for the future.

During the project

Our RSE will most likely want to go work with you, in your physical location (well, after corona-time), a lot of the time. It would be good to arrange a desk area as close as possible to existing researchers. “Mobile-space” but close is better than fixed but further.

Our goal isn’t just to provide a service, but to teach your group how to work better yourselves after the project.

Software quality and testing

Software which is untested can hardly be considered scientific. We will work with you to set up a automatic testing framework and other good practices so that you can ensure software quality, even after the project. This also ensures faster and more accurate development in the future. We’ll teach you how to maintain this going forward. This is in proportion to the complexity of the project and need.

We also pay particular attention to the maintenance burden of software: you’ll be using software much longer than you write it. We aim for simple, reliable strategies rather than the fanciest things right now.

After the project

We don’t want to drop support right after the project (that’s why you work with us, not an external software developer). Still, we have finite resources and can’t fund work on one project from another, so can’t do everything for everyone. You can expect us to try to passively keep supporting you for during the “daily garage” time as best we can.

If your department or unit provides basic funding (see the implementation plan), then long-term service is included, and this has no limits. However, this is shared among everyone in your unit, and focused on strategically support that helps many people.

Tracking scientific benefits

We need to record the benefits of this service:

  • Researcher time saved

  • Computer time saved

  • Number of papers supported

  • Software released or contributed to

  • Open science outcomes (e.g. open software, data management)

  • New work made possible (e.g. grant or project wouldn’t have been possible)

  • Qualitative experience: increased satisfaction, educational outcomes, etc.

Releasing the software

A key goal of our support is releasing the software for broader use in the community (open science). Ideally, this will be a continual process (continue releasing as development goes forward), but we can prepare you for a first release later on, too.

We recognize the need to maintain a competitive advantage for your own work, but at the same time, if your work is not reproducible, it’s not science. We’ll work with you to find the right balance, but a common strategy is some core is open, while your actual analysis scripts which make use of that core are released with your articles.

Academic credit

Our RSEs do creative scientific work on your projects, which (depending on scope) can rise to the level of scientific authorship. This should be discussed early in the project.

  • The software-based scientific creativity can be different than what is published in your articles: in this case, it can make sense to release the software separately.

  • This is not to say that RSEs who work on a project should always be authors, but it should be considered at the start. See TENK guidelines on research integrity (authorship section).

  • A contributing that is significant enough to become scientific novelty and such that the programmer must take responsibility for the outcome of the work usually rises to the level of co-authorship.

  • It is OK to consider the code authorship as a separate output from the scientific ideas, and the RSE can help properly publish the code so that it is citeable separately from the paper.

Acknowledging us

You can acknowledge us as “Aalto Research Software Engineering service” or “Aalto RSE”. In papers/presentations, please acknowledge us if we significantly contribute to your work.

When talking with/presenting to your colleagues, please do talk about our services and its benefits. Our link is https://scicomp.aalto.fi/rse/ . Word of mouth is the best way to ensure our funding to continue to serve you.

See also
  • UCL RSE group processes: That page heavily inspired this page. Broadly, most of what you read there also applies to us.

For group leaders

You, or someone in your group, has requested Research Software Engineer services for one of your group’s projects. This service provides specialist support for software, data, and open science so that you can focus on the science that is interesting to you. You probably have some questions about what this is, and this page will answer those practical questions. For researchers using our services, also see Starting a project with us.

How it is funded

There are two funding strategies:

  • Short term (a few weeks or less) is funded by your department, if you are in one of the sponsoring units. You don’t need to do anything special.

  • Longer term (month or more) is funded from your own projects. See the information for grant applicants, it is also relevant if you already have funding you want to use.

    You can use our services for both a specific project, or generally have us around on retainer to support all of your projects (for example, 20% time for a year). If you are applying for a new grant, see For grant applicants.

Access to data and tools

Our goal is not to come in, wave our hands, and leave you with something unusable. Instead, we want to come in and set you up to work yourself in the future. Thus, (if it’s necessary) we’ll want the same access to your group’s data/workspace/tools as you have.

This access is removed after the project is finished. We will try to remember this, but sometimes projects drag on with no clear ending (or you want long-term consultation), so you should also pay attention to this. Out of principle (+ policies), we access the data the same as a normal researcher would.

NDAs, intellectual property, etc.

The RSE staff are Aalto employees and are automatically bound to confidentiality, and have signed the same extra confidentiality agreement that Aalto IT system administrators have, and are similarly vetted.

Using our services doesn’t affect your intellectual property right any more than another employee working on the project will. This is service-for-pay, so you get all rights. However, our RSEs expect to be acknowledged according to good scientific practice (see Starting a project with us).

For grant applicants

Warning

Grant applicants, if you are planning to use Aalto Research Software Engineers service, feel free to drop by SciComp garage for a chat, contact us at rse-group at aalto.fi, or fill out our request form.

This page is currently (2024-01) our best understanding of what is possible. However, we are still exploring what works and doesn’t, so contact us early so we can work out bugs together. Please send corrections to us.

If you’ve decided you would like to use the research software engineer services in your project for a long period, you might want to write it directly into your grant proposals. If written correctly, this can increase your competitiveness: your research will be better because you can use RSEs for porting/optimizing/scaling of codes, automation, data management, and open science, while concentrating the main project resources on the actual research question. We can do those listed things much faster than most researchers.

If you don’t know if you need our services, or need a consultation of what is even possible computationally, let us know, too!

Summary

  • We can serve as a specialist to complement the researchers in your project, which will make your grant more competitive.

  • Plan on “Staff Scientist” salary level for at least a few months when convenient for you. (can be as low as 10% time spread out over a longer period of time).

  • We are written is as a normal staff, since we are. Don’t mention subcontracting or purchasing or things that imply that (this can make funders ask questions).

  • Contact us for more exact costs and our availability.

Funding options

Short-term services, less than one month per research group, are funded by various departments and schools and free to the users (part of the “research environment” services). Longer term service should be funded by projects - either an external grant or basic funds. There are two ways to write this into a project proposal:

  1. As a research salary, just like other salaries on your project. This has fewer limits, but is less flexible because we need to go through HR and financial planning.

  2. As a internal changing/purchased service, like usage of different infrastructures. This is flexible, but not compatible with some funders. It should work well with internal, basic funding.

Don’t mention subcontracting or purchasing in your grant text unless it really has to be organized that way. Make us appear as normal employees, since we are.

(1) Funding RSE salary

In this option, your grant directly pays the salary of an RSE from our team. To a funder, this appears the same as hiring a researcher, so is compatible with many types of grants. Some considerations:

  • This only works internally in Aalto.

  • Contact us for salary levels (it is roughly staff scientist) and availability.

  • Tell your controller this salary level and duration. Your controller will compute the necessary overhead and tell you if it is possible. (You should tell you financial staff/HR/etc. that the salary will be used for someone in the School of Science (SCI-common) and ensure that this is fine.)

  • Finance/HR will set up the Halli system so that we can bill our working hours directly to your project, based on actual time we work.

  • Realistically, we can spend up to about 80-90% time in a month on a single project (but you must make sure we have time first!).

  • We bill only the actual time relevant to your project, so while the costs are higher, in the end we are much more efficient than typical researchers who have many other tasks going on.

(2) Purchasing RSE services

Contact us with your needs, and we can give you an estimated price and time required. We can provide the services distributed over a time period that is relevant to you.

  • Warning: many funders (for example, the Academy of Finland or EU) don’t like for this to be used in their grants. If you do include this in a grant, carefully consult with grant/financial services to make sure this is possible.

  • In theory, we can serve groups outside of Aalto, but overheads are quite large. We are working on a RSE network within Finland so that we you can efficiently get RSE services no matter where you are.

General grant considerations

You can find general boilerplate text to include in your proposals Boilerplate text for grant proposals, but you can read below to build it in even more.

Data Management / Open Science are big deals among some funders right now, and research engineers are perfect for helping with these things because they are experts in the associated technical challenges. The RSE service can help in the societal impact sections: your outputs will be more ready to be reused by society. You could, for example, promise to deliver more types of outputs that aren’t exactly novel science but help society to use your results (e.g. databases, interactive visualisations, etc.).

Make sure you mention the general Science-IT infrastructure in the “research environment” section, i.e., the basic service provided by Aalto. You can copy something from the boilerplate text (first link in this section).

Specific funders
Academy of Finland

This applies to most general research grants, from the general terms and conditions. Funding may be used to cover costs related to the research plan or action plan. The research site must fund basic project facilities - which is the case at Aalto for basic RSE services.

Interesting terms from the Academy: it urges research data and methods to be freely available. 6.2.2: “Research data and material produced with Academy funding in research projects and research infrastructure projects must be made freely available as soon as possible after the research results have been published.” We are experts in exactly this for computational and data sciences.

  • As a RSE salary:

    • Contact us for the salary level which you should budget and our availability. Your controller will help you write this into the budget.

    • “Salaries, fees and indirect employee costs” may be included in Academy projects. These may go to research software engineers, which to the academy appear equivalent to “normal researchers”. The RSEs are researchers.

    • Write in a Research Software Engineer as a salary for a set number of months. You may specify a name as N.N., or contact us for a name to include. We do not promise any one person, but we will work with you as much as possible. Contact us for costs per person and we will put you in touch with our controllers. You can also contact us to discuss how much effort you may need.

    • Note that “We recommend that they be hired for a period of employment no shorter than the funding period, unless a shorter contract is necessary for special reasons dictated by the implementation of the research plan or action plan (or equivalent). Short-term research, studies or other assignments may also be carried out in the form of outsourced services.” So, consider this in justifying the research plan.

    • Don’t call this subcontracting or purchasing. It’s normal internal salary.

  • As a service purchase:

    • Warning

      Our latest information indicates that internal billing (this service purchase) is not really possible for Academy grants. You must use “As a RSE salary” above.

    • Please contact us for general costs, and how many person-months you can get for a given price (it is roughly on “Staff Scientist” level). Since estimating the amount of effort needed is difficult, contact us and we can help you prepare with the help of our controllers.

    • The research site should provide “basic project facilities”, which Aalto does. Justify the extra purchase as beyond the basics.

    • Maximum amount: We recommend you include no more than XXXXX as a service purchase. Please see LINK (login required) for our prices, when paid via external funding.

    • Justification for funding (include in proposal): “Technical specialist work to ensure scientific and societal impact outputs follow best practices in software development and research data management practices, so that they can be of greatest possible benefit to society.”

    • Flexibility: we could flexibly invoice as needed for your project. You don’t have to decide the time period in advance (only follow your submitted budget), and different RSEs can work on different parts of the problem, so you always have the best person for the job.

European Commission grants

Internal billing is (for practical purposes) not possible for EC grants. Use the “RSE salary” method (and don’t call it subcontracting or purchasing - we are normal salary).

RSE financial practicalities

Let’s say you know what we do and have funding and would like to send it our way. This page says what to do. You can read what we know about different funders, but it’s probably better to ask your controller directly if you have the funding already.

Instructions for group leaders

Please send a message such as this one to your controllers (we will tell you the relevant salary):

I am wondering what types of funding I have available to cover salary at [NNNN]€/month (Aalto internal, SCI) - do I have enough funding for [1 month / 5 days / 4 months at 25% / etc.] at this level? I would like to hire one of the Aalto Research Software Engineers for a short amount of time for a project. You can read more here:

If the answer is positive and you want to start the process, reply and include Richard Darst, the Research Software Engineer, and the RSE controller that we indicate:

The person is [USERNAME/EMAIL]. They can be added to Halli and they will directly allocate their salary to this project based on time worked, or costs can be paid by internal charging - please let us know. Please also let us know any requirements (maximum amount of time, valid months, etc.) Richard Darst (cc:ed) can answer any more questions about this.

Checklist:

  • Project discussed with researcher (and research software engineer, if relevant)

  • Initial request sent to your controller to confirm funds are available

  • Request for getting started sent to your controller, RSE controller, and…

  • Details relayed back to research software engineer

Instructions for junior researchers

Below is an example message to send to your group leader, if you need some inspiration:

Dear GROUP LEADER, as you know I have been going to the SciComp garage to get help from the Aalto Scientific Computing people. We are at the point that they would like to help more, and our group has already reached the limit of the free “research software engineer” service, which goes beyond the typical cluster support. Do you think we have a little bit of funding which would allow us to hire their services for a short period?

With a little bit of funding, we can make our work much better and faster, and I [won’t have to worry about [topic] / will learn about [topic] much faster].

You can read more about the service here: https://scicomp.aalto.fi/rse/

Checklist:

  • Project discussed with research software engineer

  • Request sent to supervisor

Instructions for department controller

Our Research Software Engineer (RSE) will work on a project for the PI according to the PI’s requests and be paid by their project. The RSE will track the time they spend and record the actual time used for the project in Halli, or salaries can be handled by internal charging.

Checklist:

  • Name staff (RSE) and duration of funding received.

  • Confirm funding conditions.

  • If using Halli:

    • RSE added to Halli by department controller (add permission for staff to record hours to the correct project).

    • Project number and any additional constraints (maximum hours, funding deadline, etc.) sent to the RSE, RSE lead (Richard Darst), and PI.

  • If using internal charging:

    • Arranged between department controller and RSE controller (SCI)

    • We’ll track all our time internally.

  • If an EU project or any special constraints on how we keep our records, let us know. EU projects use Halli and are described on the project administration page, heading “special projects”.

  • Tell us how to update this page to be more useful to others.

About research software engineers

RSE community

Do you like coding, research, and the academic environment, but want slightly more emphasis and community around the software side? Join the Aalto RSE community. You can join whatever your current position is, you don’t need to be hired as a research software engineer. There are no requirements, just networking and development. This is also a “Triton powerusers group”.

RSEs have been an essential part of science for ages, but are hardly ever recognised. We have many here at Aalto. Aalto SciComp is trying to make a community of these people. By taking part, you can:

  • Network with others in similar situations and discover career opportunities.

  • Share knowledge among ourselves (maybe have relevant training, too).

  • Take part in developing our services - basically, be a voice of the users.

  • More directly help the community by, for example, directly updating this site, helping to develop services, or teaching with us.

To join the community, see the general SciComp community page. You may want to join the Aalto RSE community mailing list, which is a general-purpose list which anyone may post to, including possibly internal job advertisements or other random discussion. Also, you should take part in the Nordic-RSE Finland chats - there is a strong Aalto presence there, and we use that as our Aalto chat time, too.

For RSE candidates and community

See also

We occasionally hire people. To get notified (of this and other similar jobs):

  • From time to time, job advertisements are posted on the Aalto University job portal. If you are considering Aalto, CSC also quite often has jobs open.

  • This blog post describes what you might want to know for applying for jobs with us.

  • If you are looking for jobs inside and outside of Aalto, consider following the Society-RSE job vacancies form.

  • If you are inside of Aalto, join the RSE community mailing list (mailing list). This will get announcements of both our jobs, events, and other research groups looking to hire a RSE skillset.

  • If you are in Nordics/Baltics/etc, consider joining Nordic-RSE or CodeRefinery and participating in their events. We are active in these organizations and this is a good way to learn how they work.

This page guides people into the interesting world of research software engineering: providing a view to what this career is like, and what you could do if you want to develop your skills. This isn’t what you have to know to start off. It’s a map of ideas for both before and after, not a list of requirements.

If (some of) the following apply to you, you are a good candidate:

  • I like the academic environment, but don’t want to focus just on making publications.

  • I am reasonably good at some programming concepts, and am eager to learn more. I know one language well, can shell script, and generally familiar with Linux.

  • I am interested in going to a scientist-developer kind of role in a company, but need more experience before I can make the transition.

Components of RSE skills
  1. Research practices: Research is its own special thing, with special ways of working (this includes data management and open science). Research experience helps you connect to our target audience and know what works and doesn’t.

  2. Programming and software development: Programming and general development and project management practices are important - but we must keep in mind the relatively small-scale nature of our projects. Basics are useful, enterprise-grade usually isn’t.

  3. Open-source/open-project knowledge: We emphasize making research results reusable, and open source practices are a key way to do that.

A person coming from a research background will be probably be good at (1) but likely need to improve more in (2). Someone coming from an industry background will probably be good at (2) but need to improve in (1). (3) is very person-dependent.

Let’s not forget a final

  1. Mentoring and teaching: As in every job, social skills are the most important aspect, since you are working closely with a wide variety of researchers.

Research practices

To get experience with this, there is a fairly clear academic career path which can provide good RSE education, especially if you look beyond producing as many papers as possible. To broaden your skills, try:

  • Try to get involved in a wide variety of computing, data, and software related research.

  • Publish datasets and software (properly) along with your papers - either separately or in a software/data paper.

  • Try to work on more collaborative projects (sharing code/data), rather than focusing on your own work.

  • Manage your data well (remember, it’s not just about the software).

  • Use different types of computing environments for your work, especially cluster environments (see our HPC cluster lessons).

Software development

Technical skills are an important part of what we do: computing, data, and software. Many people basic programming courses, but there are many important practices beyond that: version control, other tools, methods (Scrum, agile, etc), deployment strategies, and so on.

Don’t let “software” trick you into under-valuing other forms of skills: data managers, computational specialists, etc are all important, too.

To develop these skills, try:

  • Get at least minimally comfortable with the command line.

  • Use version control (at the right level for your project). Can you make your project a bit more professional and level up your version control?

  • Add a command line interface to a code.

  • Make a modular, reusable code.

  • Add automated tests, continuous integration.

  • Play with a new language or tool for some small project - do you have experience in both high and low level languages?

  • Automate your workflow to make it reproducible.

  • Use the best data storage methods possible.

  • Make a merge request / pull request to a project you want to contribute to.

  • CodeRefinery workshops cover most of what you need.

  • Look at the Zen of Scientific Computing for other ways to advance some projects up those levels.

Open source / open project knowledge

One of our most important goals is to make research reusable and more open. For computational research, the practices of open-source projects are our main toolbox, since they are often shareable and reusable by design. Don’t limit your vision to just software projects, for example Wikipedia and OpenStreetMap are open projects focused on data curation.

To develop these skills, try:

  • On Github, subscribe to a project of interest to you. See how it is run. (see if you find some that are large enough to use best practices and active communication, but not so large there is a flood of messages). Or, subscribe to some mailing lists of the project.

  • Report issues and try to help debug a project of interest to you.

  • Make a contribution to a project of interest to you.

  • Package and release one your projects…

  • … and see if you can get others to use it.

  • Help others use one of your tools.

Mentoring and teaching

The job of a RSE, at least in our vision, is as much mentoring and teaching others as it is doing things. To improve this, you could try:

Role at Aalto

At least at Aalto, you will:

  • Provide software development and consulting as a service, depending on demand from research groups.

  • Provide one-on-one research support from a software, programming, Linux, data, and infrastructure perspective: short-term projects helping researchers with specific tasks, so that the researchers gain competence to work independently.

  • As needed and desired, teach and provide other research support.

  • A typical cycle involves evaluating potential projects, meeting, formulating a work plan, co-working to develop a solution, teaching and mentoring for skill development, and follow-up.

All will be done as part of a team to round out skills and continuous internal knowledge-sharing.

You may also be interested in these presentations on the topic of “what we do”:

Training resources

These resources may be interesting to support your career as an RSE:

Skillset

Below, we have a large list of the types of technologies which are valued by our researchers and useful to our RSEs. No one person is expected to know everything, but we will hire a variety of people to cover many of the things you see here.

Most important is do you want to learn things from this list? Can you do so mostly independently but with the help of a great team?

Checklists

RSE project done
Discuss with the researchers
  • Explicitly confirm with customers that we are ending our focus on this project and won’t do more until we hear from them again.

  • Confirm it is publicly released, licensed, everything is done (or discuss what else might need to be done).

  • Make sure outputs are reported into ACRIS This is important because it makes our work visible.

    • Software: Add Content → Research output → Artistic and non-textual output → Software.

    • Data: https://www.aalto.fi/en/services/research-data-and-acris (Add Content → Dataset)

    • For each entry, under “Facilities/Equipment”, add “”Science IT””. This links it as an output of Aalto RSE.

    • Anyone can do this and add other relevant authors. The metadata entry can be made private or public, and the actual software/data is usually hosted elsewhere (and can be public or not).

  • Discuss what to do if there are issues in the future - garage, issue tracker, training courses.

  • Discuss what else may (or may not) need doing in the future.

Internal (RSE group) tasks
  • Issue tracker:

    • /summary should contain a several sentence summary focused on the benefit to RSE service (this is used for final reports, etc).

    • Confirm other metadata is correct

      • /contact, /supervisor contains people who may get emails about the project later (and shouldn’t contain people who may be surprised about automated survey emails). If these people should not get

      • /timesaved

      • Outputs /projects, /publications, /software, /datasets, /outputs

  • Get an interesting picture or screenshot for use in future material.

    • Not needed if there are overriding confidentiality considerations. The picture should never include personal data or data coming from a research subject (unless it’s already open).

    • Add to triton:/scratch/scicomp/aaltoscicomp-marketing.git (pictures/rse/).

    • Include a readme with citation, confirmation of what usage permissions there are, and a one-sentence general description suitable for presentations.

    • Examples; screenshot of website, screenshot of code that looks interesting, screenshot of repository page, picture of hardware device used, etc.

  • Add it to the next meeting agenda. We will collaboratively do an analysis to find lessons learned:

    • Facts about the project

    • Arrange facts into the big picture and timeline

    • Draw conclusions: what went well and did not go well? What were the causes of the good and bad things?

    • Lessons learned: what to do differently in the future.

Python project checklist

This checklist covers major considerations when creating a high-quality, maintainable, reusable Python codebase. It is designed to be used along with a RSE to guide you through it (it is in a draft stage, and doesn’t have link to what these mean). Not everything is expected for every project, but a sufficiently advanced complicated project will have most of these things.

  • Citeability and credit, authorship discussion

  • License

  • Version control

    • In use locally

    • In use on some platform (Github/Gitlab/etc)

    • Regular commits

    • Discuss issue tracker

    • Make one example pull request

  • Modular design

    • Standard project layout

    • Importable modules

    • Command line or other standard interface

    • (relates to packaging below)

  • Tests

    • Recommendation: pytest

    • Simple system tests on basic examples

    • More fine-grained integration or unit tests

    • CI setup

    • Test coverage

  • Documentation

    • Forms / levels

      • README file: good enough?

      • Project webpage

      • Sphinx project

      • Read The Docs

    • To include

      • About

      • Installation

      • Tutorials

      • How to / simple examples to copy

      • Reference

  • Release

    • Module structure

    • pyproject.toml or setup.py

    • requirements.txt or environment.yml

    • PyPI release

    • conda-forge

    • Zenodo

Other pages on this site: Package your software well, The Zen of Scientific computing

Internal documents

We believe in openness, so make our procedures open. They are subject to improvement at any time. Also see the FCCI Tech seminar series for how our broader team works internally.

Message templates

These are templates for different messages we might send. As you might expect, they are probably not suitable for using directly (even by us), but it’s better to record them than lose them, and better to be open than not.

Announcements
Contacting researchers
None

Did you know of the Aalto Research Software Engineer service (https://scicomp.aalto.fi/rse/)? It provides specialized support in software development and computational science. Could any of your infrastructure users benefit from this service?

The point of this service is to make sure that anyone can succeed in their service, regardless of their computational background. For example, we can provide software development, advice and support for those programming themselves, data management support, help packaging and publish software, and so on. There are so many things that a person needs to know these days that one can’t expect to know everything.

We started in 2020 in the School of Science, and now have funding to support people from any school.

If you have any ideas, feel free to point your users to our service, https://scicomp.aalto.fi/rse/ . Or, we can arrange a discussion session to talk about ways to more closely work together, since I am sure there are ways that joining forces is best.

Project status (waiting)
None

We (Aalto RSE) still have an open issue in our queue about your project DESCRIPTION.

It’s still in our queue, and hopefully someone can get to it in WHEN. I’m wondering about status from your side - Is this still important to you? Have you figured out something else already, so that it’s not needed? Anything we should know about our scheduling and planning? Should we increase/reduce the priority? Would some smaller amount of help let you get going?

For short term stuff and consultations about the project, you can always try dropping by our garage, even before we actively start working: https://scicomp.aalto.fi/help/garage/

Follow-ups
None
/contact
/supervisor

Department/group:


Basic description:
- .


Current team:
- .


Each team does:
- .


Tech tools:
- .


Scientific tools/domain knowledge:
- .


Schedule
- Time estimate:
- Any deadlines?:
- Expected time, likelihood of going over:
- What happens if it goes over time?  Backup plans?:


Links to existing docs:
- ...



/summary

/estimate
Feedback
None

Hi,

Some time ago, we helped you with ________________ as part of our Research Software Engineer service. Now that some time has passed, we would like to know if you had any feedback on our support. This is very important to us to ensure the continuation of this service, so please take a minute or two to quickly answer! A few numbers in reply to this message is sufficient.

First off, we wonder how much time (mental effort) do you think our work has saved you? (We know this can be hard to estimate, but any kind of rough prediction of “I avoided spending X days/hours to plan, implement, or debug what we would have done otherwise”.)

Then, what about these research outputs: how many have we contributed to?: Articles/papers, datasets, software projects released, projects supported in general, etc.

Do you have any other comments on our service?

Project administration

Note

This page is still a working document, discuss anything that appears like it should be improved.

Unfortunately (fortunately, since it means our work has value?), we need to track where our time goes in order to justify the benefits of what we do. There are two main uses of the data:

  1. General reporting: being able to say how our time is distributed among departments and projects. This doesn’t have to be perfectly accurate (and since we have so many small projects, it would be a big waste of time to try to be perfect) - but it should be roughly proportional to actual time spent. This is tracked in Gitlab.

  2. Financial reporting and project payments. This needs to be accurate, but only for the few projects which have special funding. The master data is in financial systems, but Gitlab can sometimes be used to make this reporting a bit easier.

Typical project flow
  • Someone will contact us somehow. We try to get them to the garage or a some talk as soon as possible.

  • Initial discussion. If it seems this should be a tracked project, then make the issue

  • Be aware that it takes some time to get up to speed with a project. This should be considered when making the initial estimate, during the first consultation. When recording time spent, include the time it takes to get up to speed and learn whatever else is needed for the project.

Finance time tracking

For projects with their own funding (external or internal funding), you should get instructions about how to record it. For many projects, this is marking them to Halli. All other projects (funded by the department’s/school’s basic funding) is marked in Halli to the standard RSE salaries project (ask for it).

Types of projects
Special projects

Examples: EU-funded projects

Special projects are their own distinct entity and are not mixed with other work of our team. They receive dedicated days for their work, and are not given attention on other days. Because these get exclusive days, the master data of these projects is in Halli, and because Halli can be used for records later, they are not recorded in Gitlab. (Note: “special” does not mean better, it’s usually more productive to be available for researchers whenever they need us).

Special projects get one Gitlab issue to track the overall contact, but it isn’t updated on a day-to-day basis.

Daily procedures: At the end of every day, record the working time in Halli. As much as possible, these project days should not be mixed with other work, but internal team meetings, etc. are allowed if necessary. In Halli, record each day’s worktime (scaled to the standard 7.25h/day) in proportion to the time spent on the special project (allocated to that project)/internal work (allocated to RSE-salaries).

Normal funded projects

For projects providing their own funding, but aren’t special, GitLab is used to track the time we spend on them. The main purpose of Gitlab is to record the department distribution of all of our basic funding, for which Halli can’t hold all the needed information. Other funded projects which can be intermixed with our normal work can fit into this category.

Daily procedures: A Gitlab issue is created for every project and used for each day’s work, with funding source Funding::Project. Time is recorded in Gitlab and may be mixed with other projects however the customer sees appropriate. Halli is marked to the respective project and at least is correct by-month.

Internal charging projects

“Internal changing” projects are funded, but are paid in one sum for a certain amount of work, and there is no place to mark hours into Halli. These are mostly certain types of basic funding. Gitlab is used to track time spent on these projects.

Daily procedures: Like above for Gitlab. Halli is marked to the standard RSE-salaries project. Funding::Project

Basic funding projects

These projects are paid by our basic funding, provided by our sponsoring units. This also includes all of our internal work, meetings, development, and teaching.

Daily procedures: Same as above. Gitlab funding marked as Funding::Unit

Gitlab day-to-day procedure

See the rse-timetracking repository for info on how to use Gitlab. But the actual data is in rse-projects, a separate private repository.

Project prioritization

This page describes the types of projects we have and the general principles of how we prioritize them. This doesn’t exactly say how things are prioritized (there are too many)

Types of projects
  • Size G (for “garage”): the smallest projects, handled within the daily garage. A few hours and not scheduled, they are handled as people come to garage. Entered in the garage diary, but not the rse-projects tracker.

  • Size S (“small”): <= 1-2 days.

  • Size M (“medium”): <= 1 month.

  • Size L (“large”): > 1 month. These are generally paid by the projects themselves.

RSE staff that are fully funded from a certain project out outside the system of this page, and work on the projects as decided by their funders.

Prioritization

G projects usually get priority, but that is because they are drop-in and not scheduled. Whoever is available will help (usually the same person but who knows). We help for a reasonable amount of time (depending on need and busyness) for each drop-in session. A session ends with the problem solved, a request to come back the next day, or an upgrade to an S-level project.

S projects are often used as fillers during downtime in other projects. We often have a general priority list, but the actual start time can be a bit uncertain.

M projects are sort of in the middle. They are scheduled when possible, but since they aren’t paid by the research groups the work might be more intermittent.

L projects, being paid by a particular group, usually get priority. However, often time there is downtime during these, which are used for other projects.

Some research groups provide “retainer” funding: long-term funding without a specific L-size project. Their funding is used for whatever S and M projects come up, and those S and M projects get a much higher priority (of course, depending on the urgency of the project itself).

There are two main steps in our prioritization:

  • General discussions during the weekly team meetings.

  • Each RSE’s evaluation of each project, based on their knowledge of the work, the time they have, and what the benefit will be.

Per-project prioritization factors:

  • Self-evaluation of usefulness and importance by the researchers

  • Benefit to open science and broader strategic impact

  • Long-term impact to research (for example, improved skills or use of tools)

  • Priority for units which provide funding

  • Diversity and balance, including diversity goals.

Implementation (2020 plan)

About this page

This is our tentative implementation plan, as of August 2020. It is always subject to revision, but is a somewhat controlled document.

About

Researcher Software Engineers provide specialized scientific computing and data management support to researchers, beyond what is currently offered by Science-IT. Their funding is guaranteed by departments/schools/other units, but after the ramp-up phase most funding is expected to come from the research projects themselves.

Services include, for example, software development, scaling up or optimizing computations, taking new technologies into use, and in general promoting best practices in new and existing research using computational methods.

Funding types and sources

Funding has three types:

  • Ramp-up/Guarantee (R/G): Ramp-up funding to do initial hires, until project funding takes over

    • Ramp-up: department/schools/other units allocate a certain amount of money to do hires.

    • Units which provide Ramp-up/guarantee get first priority for their projects.

    • Replaced with project funding (below), if there are no projects then used for basic services (below).

  • Project (P): External or group money, allocated by a PI for a specific task in their group.

  • Basic (B): Allocated from units for short-term basic service for all of its members.

    • Allows short, strategic assistance without microtransactions

    • Science-IT work is a type of basic work, but may be requested by the Science-IT team instead of the researchers themselves. (For example, Science-IT has a long list of inefficient hardware use and inefficient software practices which can keep RSEs occupied for a long time. RSEs can also work on Triton/scientific computing technical development projects, which helps RSEs gain competence for the rest of their tasks.)

Time allocation principles
  • We track time spent per unit. Fairshare algorithm: the unit with the largest “deficit” in time gets priority for upcoming projects.

  • Units which provide ramp-up/guarantee funding get priority for their projects.

  • Project funding replaces ramp-up/guarantee funding.

  • Time paid from basic funding is allocated to tasks within the unit with the greatest strategic benefit, for example helping an entire group to use better tools or fixing extreme waste of resources.

  • When a group provides project funding, they can decide the tasks the RSE will do.

Ramp-up plan

This is a rough estimate of the type of demand we expect:

Distribution of work

2020 H2

2021

2022

2023

Long-term

FTE

2

2–3

3–4

3–5

4+

Project work

20%

50%

60%

70%

70%

Basic work for units

50%

40%

30%

20%

20%

Basic work for Science-IT

30%

10%

10%

10%

10%

  • Our initial survey reached only Triton users and had 40 responses, of them 60% said “quick consultation”, 60% said “short term, 1–2 days”, 40% said “medium term, weeks to months”.

  • Actual ramp-up depends on funding cycles, research timing, and human psychology.

Start-up funding (already guaranteed)

(section removed; to be placed elsewhere)

Funding practicalities

Principle: the daily rate is roughly equal to “senior postdoc/staff scientist” salary + overheads.

Principle: When working for a research project, the RSEs record those working hours in Halli to that project. The corresponding portion of the salary is then automatically charged to the project. Remaining hours are recorded to the Dean’s unit RSE project, and once a year we split these costs and send them to each department. [Updated 2020-11-05]

(details to be filled in by Finance)

Measurement and KPIs
  • Number of projects and researchers who have been given support

  • Number of researcher-days saved, as estimated by our customers.

  • Fraction of project funding vs total program funding

Communication
  • Units which fund us will be informed of our activities at least every 6 months.

  • “As open as possible, as closed as necessary”. All RSE program data, documents, and statistics will be public, excluding actual project funding and information from the customers.

Risks and ramp-down
  • Primary risk: making permanent hires, yet not being able to sustain the program long-term.

    • Mitigation: we will only hire RSEs which can be absorbed into Science-IT naturally should the need of this service fade away.

  • Risk: difficulty in reaching researchers and explaining what we do

    • Mitigation: Science-IT has a long list of researchers who are using research services inefficiently: they can be contacted directly to inform about this service. Helping them and producing best practice examples for the future can keep several people busy for years.

  • Risk: Researchers see need, but group leaders unwilling to pay

    • This is indeed a risk, but there is precedence from other countries that there are enough people willing to pay. There will likely be a slow start, but as time goes on, expenses incurred by this service can directly be written into the budget of funding applications.

In our ramp-down strategy, we absorb the RSEs into Science-IT, CS-IT as part of its development efforts, or into other existing teams.

Job descriptions

Warning

This page is still in draft form and being discussed and developed. See the note on the parent page.

These are job descriptions for RSE descriptions. They are not yet formal HR job descriptions and won’t be directly used as such, but provide a vision of our career ladder.

A RSE is researcher whose advancement of science is not defined by number of papers, but by quality of software and contributions to open science.

RSE 1

A RSE1 is just starting their career and is being introduced both to software tools and the research process. This RSE would get mentoring much like a new doctoral student does, but instead of aiming to publications, they would aim to quality, released software.

Qualifications: Masters degree, thesis in combining computation and research or software development with some research qualifications, but little real-world research experience.

Pay/job level: roughly like master’s employee or PhD student. Advancement: would be expected within 1-2 years.

RSE 2

Able to competently work on own projects using tools they know while learning new tools effortlessly. They are currently learning to finding the right tool for the job and to connect the technical task (software and data related) to the impact to society, Aalto, and individual grants.

This is roughly equivalent to a postdoctoral researcher, a transition time between academic skills of a doctorate and whatever may come next. In particular, this can serve as a bridge between a (somewhat more theoretically focused) doctorate degree and a job in industry, and CV and skills development is in line with this.

Qualifications: Doctorate or extended work experience. Pay/job level: similar to postdoc.

Advancement: expected to advance within 2-3 years. This person is still in training (much like a postdoc) and is probably deciding which way to take their career.

RSE 3

Like above, but is additionally able to independently negotiate with research groups to plan a project, including deciding tools and expected results. In particular, a RSE 3 should be able to explain the value of good software practices to the researchers and plan/advocate for good open science and research data management practices across various fields.

Pay/job level: like staff scientist, always permanent.

Advancement: A person is a competent, independent scientist/engineer at this point, and advancing is not needed for everyone. Of course, lifelong learning always continues. To be honest, advancing in the academic system is difficult, and many people will make a horizontal move to another place.

Beyond

At this point, you are not exactly developing RSE skills but leadership skills. This is surely adjusted to each person individually, but two possible layers include:

  • RSE leader responsible for a department, school, or research area.

  • RSE group leader responsible for university-wide leadership.

Other internal/parallel advancement

Other career development is not a part of the Aalto RSE program (yet?), and to be honest it’s hard to see an internal advancement in the current academic system (by the time you get to the top of our team, you are already at the top). Still, there are many ways people can continue their career development depending on their career goals, for example:

  • Tech lead of larger RSE projects (few projects require this)

  • Study and develop new technologies for production (perhaps a parallel move to an IT team)

  • Management, either of RSE group or other services

  • Applying for grants, leading projects, etc. as a staff scientist might do (this would be outside the RSE service team)

  • Mentoring or supervising students or other researchers

At Aalto, these aspects are not yet developed, and some of them would be horizontal moves outside the RSE team (or collaboration with someone outside the team). At some point, people have to take their careers in the direction they want and begin combining various unique skills.

Commercial developers

We don’t plan on competing with commercial developers, but the difference with a RSE3 is that:

  • A software developer can do what is asked, but not work with the researcher to figure out what they actually need. The software developer will probably be slightly more requirements-product based, rather than agile-research work to develop a tool over time.

  • A software developer make produce a product that is not sustainable in an academic setting: requires too much focus and specialized knowledge to be improved in an academic environment.

  • A software developer may use more modern and industrial-scale tools.

  • A software developer from outside would come in and leave, a RSE in this group would provide longer-term support (but this is more a property of the group, not the person).

Unit information

This page describes the Aalto units which are supporting the RSE program and what their priorities are.

The service is currently (early 2023) mainly funded by the School of Science, with a grant from IT Services to allow use through all of Aalto.

See For units such as departments, schools, or flagship projects if you would like to join the RSE service as a department or school.

SCI

Supporting all community.

CS

Supporting all community.

NBE

Supporting all community.

PHYS

Supporting all community.

FCAI

FCAI sponsors several research software engineers, who both do general work and targeted work. In effect, FCAI projects get a higher priority and management sends some strategic projects to us for intense support.

Rest of Aalto

IT Services provides a grant to support research in all of Aalto.

If you project has its own funding, we can support it. And the Scicomp garage support is always available.

Advisory board

Warning

This page is a draft.

This page describes the advisory board of the Aalto RSE program and hosts the results of its meetings. Out of principle, all material is open on this page (though specific items may be retracted).

Purpose of the advisory board

The advisory board provides advice to the strategy (and when relevant, day to day implementation) of the Aalto RSE program and its relation to research, scientific computing, and teaching at Aalto.

Current advisory board

Currently, the advisory board is the Science-IT board.

Meetings (Section not in use)

Topics for the next meeting and results from previous meetings are located here, newest first.

Next meeting
  • Purpose of advisory board and its roles. How often to meet?

  • What are your priorities?

  • What is the threshold for your department to “pay” for service.

  • How can we find customers?

  • How much do we focus on cost recovery, and how much on basic work?

  • What are our KPIs? See Measurement and KPIs and Tracking scientific benefits.

    • Cost recovery from projects

    • N ongoing projects and N completed projects

    • N publications supported.

    • N open outputs produced (non-publication: datasets, software, etc.)

    • Survey (of PIs) of benefits after.

    • Estimated time saved.

2022 Aalto RSE report

Summary

  • The Research Software Engineering service allows researchers to take on more ambitious computational projects, and for existing projects to be much faster and higher quality.

  • About 100 projects in a bit less than two years.

  • Perhaps 5× return on time spent (time we spent vs time researchers saved).

  • There is no shortage of RSE projects, we could get more if we did more outreach (which we don’t focus on, since we would then be over-capacity).

  • We haven’t been able to receive much project funding, due to financial difficulties (grant rules make this difficult, we complete most projects so fast that the transaction would be too small).

  • We have gotten other long-term support: FCAI has supported a dedicated RSE, IT Services has provided funding to extend beyond SCI.

  • Our proposal is that groups receive ~1 month of free service, paid with department/school funding. After that, they should find their own funding.

    • Most projects take well under one month, though, so we still focus on basic funding.

    • However, there is a steady stream of longer-term project proposals which offer funding.

  • We ask for

    • Continued basic funding at the current levels.

    • Help in finding the researchers and projects who can most benefit by this service - can our results better be reported as Open Science/societal impact stories - should a small amount of RSE time be written into every grant?

    • Recommendations for other schools/departments to join us.

Current status of Aalto RSE
History of Aalto RSE
  • December 2018: Idea (“Computational support scholar” postdoc-type position)

  • December 2019: initial funding from SCI

  • October 2020: first hires (permanent)

  • Now: three permanent full time staff, continuous stream of projects.

Current staff and jobs

We are part of Science-IT:

  • Three full-time RSEs (the only ones funded by this service)

  • One staff leading the RSE group + working as a RSE

  • Two other staff with RSE funding from specific projects

  • Three other staff focused on infrastructure support, but “we are all RSEs anyway”

Types of projects
  • No shortage of projects, also not yet over our capacity.

    • We don’t advertise too much, since that would take us over our limit.

    • Thus, there is definitely capacity to expand.

Projects fit into two main categories:

  • RSE projects, take days to months, recorded in our issue tracker, includes long-term support.

    • The “classic work” of an RSE.

  • Garage help are small questions that come up in the “daily garage” and answered immediately.

    • Garage is our daily support method, answering small questions.

    • Garage transforms research from “trial and error” to “professional quality”.

Project stats
  • 101 researcher projects in ~ 1.75 years.

  • Overall “researcher time saved” is generally 5× “RSE time spent” (self-reports from customers)

_images/2022-projects-time-by-department.png

Time spent by department, 2020-2022. Not all time is recorded. Figure includes only full-time RSE time spent. Note that we have other funding sources that allow us to work outside of SCI, and that Science-IT receives general funding to serve the whole university.

_images/2022-projects-time-needed-by-task.png

Time estimate (past and present) by type of tasks of all recorded projects, including future projects, leads, and canceled ideas, 2020-2022. Projects have multiple tasks/benefits and all time is included in all tasks in this figure. Values should only be used as a relative comparison.

Garage stats
  • From October 2020 – August 2022, about 500 visitors logged (about half of visits are recorded).

  • Average visit from 30-60 minutes of support.

  • Most are answering tech/software questions to help research, and teach people to be self-sufficient.

  • We also estimate overall “researcher time saved” is 5× “RSE time spent” (self-reports from customers)

_images/2022-garage-customers-departments.png

Garage customers by department, 2020-2022, small sample of data. Note that old garage data is extremely sparse, so this is more of a current estimate.

_images/2022-garage-customers.png

Garage customers by title, small sample of data. We only recently began collecting this data, so it is very incomplete but roughly representative. We only support researchers and staff, students not doing research are directed to student resources.

Current and future funding
Financial transaction difficulty
  • Original plan: try to get most funding from grants.

  • Finance (for very good reasons) doesn’t want to do small transactions - minimum 1 month. Thus, we haven’t been able to accept much project funding.

    • Academy/EU rules don’t allow easy internal invoicing, so must pay salary directly. This makes more overheads.

  • We need high-level leadership support on this topic.

Our current project funding policy
  • Each research group gets ~1 month of free RSE time sponsored by basic funding.

  • After that, a group is expected to provide their own funding for future RSE projects.

  • However, we finish most projects in less than a month.

Future funding plan
  • We should maintain at least ~2 FTE of basic funding for the near future for our current number of customers (≈ SCI).

    • Any increases would be used well, though.

  • Future hires could be made when project funding is enough to justify costs (SCI funding as buffer between project periods)

  • A fair number of projects (~10-20) have written months of work into submitted grants, funded us, or offered funding.

  • More basic funding from other departments?

    • IT Services has provided pilot funding (3 months) to expand to other schools, and has been a success.

Future plans
Planned long-term funding
  • The Finnish Center for AI has committed 4-5 years of full-time RSE funding, this was used to hire a third RSE.

  • We are currently (September) in planning to get more IT Services funding to secure the service beyond SCI. We will need to carefully check how this affects our staffing levels.

  • These type of strategic investments seem to practical and scaleable.

Wanted: Better outreach and impact
  • There is no shortage of projects, and advertising more will surely fill us up.

  • But, we can still increase the impact of the projects we select. Can you help point the most important projects to us?

  • Especially societal impact (public use of data and algorithms) could give us many more projects.

Expansion to other schools
  • We expect this service to expand to other schools and universities in the future (bringing their own funding).

  • This will allow a broader knowledge base from which any individual project can draw.

  • Please recommend to other leaders to join us in the RSE concept.

See also

  • The CSC optimization service is essentially a RSE service, targeted to CSC/LUMI resources (but in theory can do more). They are good at low-level programming kind of things.

  • The Nvidia AI Tech Center provides free RSE services for research projects for Finnish Center for AI members (includes Aalto).

Scientific computing

In this section, you find general (not Aalto specific) scientific computing resources. We encourage other sites to use and contribute to this information.

Scientific computing tips

Encryption for researchers

This page describes the basics of encryption to an audience of researchers. It describes how it may be useful (nd when not needed) in a professional researcher environment, in order to secure data. It doesn’t describe encryption for personal use (there are plenty of other guides for that). It doesn’t go into very deep details about cryptography. It doesn’t get go into deep details. Cryptography is the type of things where there are a huge number of subtle details, and everyone has their own opinion. This guide is designed to provide an understanding for basic use, not cover every single subtle point.

Status: this is somewhat complete, but is not a complete guide. It will be extended as needed.

Summary

Modern cryptography is quite well developed, and available many places. However, the human side is very difficult. Encrypting something, but not keeping the key or password secure, has no benefits. To use your encryption, you need to decide what your goals are (who should access, who you want to keep safe from) and then plan accordingly. The security of cryptography is decided more by how you manage the keys and process than the deepest technical details.

Key management

The point of encryption is to trade a hard problem (keeping a lot of data secure) to a more limited problem (keeping a single key or password secure). These keys can be managed separately from the data. This immediately takes us to the importance of key management. Let’s say you can’t send data over email unless it is encrypted. If you encrypt it and send the password in the same email as the encrypted data, you have managed to technically satisfy the requirement while adding no real security at all. A better strategy would be to give the password to someone when you meet them in person, send it by another channel (e.g. SMS, but then it is only secure as SMS+email), or even better use asymmetric encryption (see below).

Deciding how you will manage keys is the hardest part of using encryption. For some fun, next time you hear someone talk about using encryption, see if they mention how they keep the keys secure. Usually, the don’t, and you have no way of knowing if they actually are doing it securely.

Symmetric vs asymmetric encryption

There are two main types of cryptography. They can both be considered equally secure, but have different ways of managing keys.

Symmetric encryption uses the same password/key for encrypting and decrypting. It is good because it is simple, because there is only one key or password you need to know and it is easy to think “one data=one password”. However, everyone needs to know the same password, and it can’t be changed. Since the same password has to be everywhere, this can be a bit insecure depending on the use, and you can argue it’s a bit complicated to keep that key password secure (if there are many people, or if it needs to be automated).

Asymmetric encryption has different keys for encrypting and decrypting. So, you use a “public key” to do an encryption (which requires no password - everyone in the world can know this key and your data is still secure). You have a separate private key (+password) which allows only you to decrypt it. This separation of encryption and decryption was a major mathematical breakthrough. Then, anyone who needed to receive data securely would have their own public/private key, and all the public keys are, well, public. When you want to send data to someone, you just encrypt it using their public key, and there is no need to manage sharing a password. This allows you to: encrypt so that multiple people can read it, encrypt automatically without password, and encrypt to someone not involved in the initial process.

With asymmetric encryption, there are some more things to consider. How do you make sure that you have the right public key?

Encryption programs

This lists some common programs, but this should not be taken to mean that using these programs makes your data safe. Security depends no how you use the program, and security will only decrease over time as new analysis is done. It is usually best to choose well-supported open source programs where possible. More detailed instructions will be provided as needed.

7zip

7zip is a file archiver (like zip). It can symmetrically encrypt files with a passphrase.

PGP

PGP is a set of encryption standards (and also a program). It has a full suite of encryption tools, and is quite stable and well-supported. You ofter hear about PGP in the context of email encryption, but it can be used for many things.

On Linux systems, it is normally found as the program gpg (Gnu Privacy Guard). This guide uses gpg.

Full disk encryption

Programs can encrypt the entire hard disk of your computer. This means that any data on it is safe, should your computer be lost. There are programs to do this for every operating system, and Aalto laptops now come encrypted by default.

Using symmetric encryption with gpg

Encryption:

gpg --symmetric input.file

Decryption:

gpg input.file.gpg

This will ask you for a password. If you do not want it to, you can use –passphrase-fd to pass it automatically. Normally, keeping a password in a file is considered quite insecure! Make sure that the permissions are restrictive. Anyone that can read this file once be able to read your data forever. The file could be backed up and spread all over the place - is that what you want? IT admins will be technically able to see the passphrase (though they do not). Is this all within the scope of your requirements?

cat pass.txt | gpg --passphrase-fd 0 --symmetric input.file
Using asymmetric encryption with gpg

When using asymmetric (public key) encryption, you have to generate two keys: public and private (they are made at the same time). The private key must be kept private, and has a passphrase on it too. This provides an added level of security on top of the file permissions.

There are plenty of guides on this available. Some examples:

You can encrypt a single files to multiple keys. This means that the owner of any of the private keys can decrypt the file. This can be useful for backups and disaster recovery.

General warnings
  • Strong encryption is serious business. It is designed so that no one can read the data should the keys or passwords be lost. If you mess this up and lose the key/password, your data is gone forever. You must have backups (and those backups must also be secure), …

  • If you keep passwords in files, or send them insecurely anyhow, then the technical security of your data is only as great as of that key/password.

  • The strength of your encryption also depends on the strength of your password (there is the reason it is often called a “passphrase” - a phrase is more secure than a standard password). Choose it carefully.

Advanced / to do
  • How much security is enough?

  • Set cipher to AES (pre 16.04)

Git

Git is a version control system. This page collects various Git tutorials and resources

Version control systems track changes to files. This means that as you are working on your projects (code, LaTex, notes, etc), you can track history. This means that you can see former history, and collaborate better. Using one for at least for code should probably be one of the minimum standards of computational research.

Git is a distributed version control system designed to handle everything from small to very large projects with speed and efficiency. Git is easy to learn and has a tiny footprint with lightning fast performance. It outclasses SCM tools like Subversion, CVS, Perforce, and ClearCase with features like cheap local branching, convenient staging areas, and multiple workflows.Git

Note

  • This page is git in general, not Aalto specific.

  • aalto/git contains advice on the Aalto Gitlab, a repository for the Aalto community integrated to Aalto systems.

Basic git tutorials
More references

Gitlab-specific information:

Other hosting services

Realistically, use version.aalto.fi for most projects related to Aalto research, and Github if you want to make something open-source with a wider community (but you can also make open repos in Aalto Gitlab, just harder for random people to contribute). For non-work private repos, you have to make your own choice.

  • Github is a proprietary commercial service, but extremely popular. No free private repositories or groups (but you can pay).

  • Bitbucket is also somewhat popular, limit of free 5 private repositories (but you can pay for more).

  • Gitlab.com is a commercial service but makes the open-source Gitlab edition. Gitlab.com offers unlimited private repositories.

  • source.coderefinery.org is another Gitlab hosted by the Coderefinery project, a pan-Nordic academic group. It might be useful if you have a very distributed project, but realistically for Aalto projects, use Aalto gitlab.

Git-annex for data management

See also

Video intro to git-annex, from Research Software Hour.

DataLad is a researcher/research data management focused frontend to git-annex. This page is a relatively technical introduction to what goes on inside of git-annex, so the DataLad handbook might be a better place to start, and then consult this page for another view / more detailed information.

git-annex is a extension to git which allows you to manage large files with git, without checking their contents in git. This may seem contradictory, but it basically creates a key-value store for large files, whose metadata is stored in git and contents distributed using other management commands.

This page describes only a very limited set of features of git-annex and how to use them. In particular, it tries to break git-annex into three “simple” types of tasks. Git-annex can do almost anything related to data management, but that is also its weakness (it doesn’t “do one thing and do it simply”). By breaking the possibilities down, we can hopefully make it manageable. The three layers are:

  • Level 1: Track metadata in git and lock file contents local-only: Even on a single computer, one can rigorously track data files to record who produced the data, the history, and the hash of the content, even without recording the contents into git. On top of this, files can be very safely locked to prevent accidental modification of primary copies of the data. (commands such as git annex add)

  • Level 2: Transfer and synchronize file content between repositories: Once the metadata is tracked and the git repository is shared, you might want to move the content between repositories. You can easily do this git annex get, git annex copy [--to|--from]. You can put any file anywhere and metadata is always synced.

  • Level 3: Manage synchronization across many repositories: Once you have more than two (or even more than one) repository, keeping track of locations of all files is hard. Git-annex solves this as well: you can define what content should be in each location and data is automatically distributed. So, for example, you can insist on all data is always stored in your object storage, all active data is also on the cluster, and user environments have whatever is requested. Git-annex is very focused on never losing data, it can ensure that one locked copy is always present in some repository. (commands such as git annex wanted, git annex numcopies, git annex sync --content)

The biggest problems are that it can do everything, which makes documentation quite dense, and the documentation can be hard to navigate.

Background

You probably know what git is - it tracks versions of files. The full history of every file is kept. When something is recorded in git-annex, the raw data is a separate storage area, and only links to that and the metadata is distributed using regular git. So, all clones know about all files, but don’t necessarily have all data. Using git annex get, one can get the raw data from another repo and make it available locally.

For example, this is a ls -l of a real git repository which has a small-file.txt and a large-file.dat. You see that the small file is just there, but the large file is a symlink to .git/annex/objects/XX/YY/...:

$ ls -l
lrwxrwxrwx 1 darstr1 darstr1 200 Feb  4 11:08 large-file.dat -> .git/annex/objects/X4/xZ/SHA256E-s10485760--4c95ccee15c93531c1aa0527ad73bf1ed558f511306d848f34cb13017513ed34.dat/SHA256E-s10485760--4c95ccee15c93531c1aa0527ad73bf1ed558f511306d848f34cb13017513ed34.dat
-rw-rw-r-- 1 darstr1 darstr1  21 Feb  4 11:06 small-file.txt

If the repository has the file, the symlink target exists. If the repository doesn’t have the file, it’s a dangling symlink. git add works like normal, git annex add makes the symlink.

Now let’s git annex list here. We see there are two repositories, here and demo2. large-file.dat is in both, as you can see by the Xs. (“web” and “bittorrent” are advanced features, not used unless you request… but give you the idea of what you can do):

here
|demo2
||web
|||bittorrent
||||
XX__ large-file.dat

The basic commands to distribute data are git annex get, git annex drop, git annex sync, and so on. The basic principles of git-annex are data integrity and security: it will try very hard to prevent you from using git/git-annex commands to lose the only copy of any data.

Basic setup

After you have a git repository, you run git annex init to set up the git-annex metadata. This is run once in each repository in the git-annex network:

$ git init
$ git annex init 'triton cluster'   # give a name to the current repo
Level 1: locally locking and tracking data

You can add small files like normal using git (full content in git), and large files with git annex add, which replaces the file with a symlink to its locked content:

$ git add small-file.txt
$ git annex add large-file.dat
$ git commit           # metadata: commit message, author, etc.

Now, your content is safe: it is a symlink to somewhere in .git/annex/objects and it is almost impossible for you to accidentally lose the data. If you do want to modify a file, first run git annex unlock, and then commit it again when done. The original content is saved until you clean it up (unless you configure otherwise). The largefiles settings will determine the behavior of git add, you can set which files should always be committed to the annex (instead of git).

At this point, git push|pull will only move metadata around (the commit message and link to .git/objects/AA/BB/HHHHHHHH, with the hash HHHHH a unique hash of the file contents). This is what is stored in the primary git history itself.

Structured metadata (arbitrary key/value pairs) can be assigned to any files with git annex metadata (and can be automatically generated when files are first added, such as the date of addition). Files can be filtered and transferred based on this metadata. Structured metadata helps us manage data much better once we get to level 3.

So now, with little work, we have a normal git repository that provides a history (metadata) to other data files, keeps them safe, and can be used like a normal repository.

Relevant commands:

Level 2: moving data

Data in one place isn’t enough, so let’s do more. Just like git remotes, git-annex remotes allow moving data around in a decentralized manner.

Regular git remotes are set up with git annex init on the remote side. Special remotes are created with git annex initremote. Every remote has a unique name and UUID to manage data locations.

Once the remotes are set up, you can move data around:

$ git annex get data/input1.dat                # get data from any available source
$ git annex copy --to=archive data/input2.dat

You can remove data from a repo, but git-annex will actively connect to other remotes to verify that other copies of the file exist before dropping it:

$ git annex drop data/scratch1.txt

These commands more around data in .git/annex/objects/ and update tracking information on the special git-annex branch so that git-annex knows which remotes have which files - very important to avoid a giant mess!

Special remotes can be created like such:

$ git annex initremote NAME type=S3 encryption=shared host=a3s.fi

And enabled in other git repositories to make more links within the repository network:

$ git annex enableremote NAME

Note that special remotes are client-side encrypted unless you set encryption=none, and also chunked to deal with huge files even on remotes which do not support them.

Relevant commands:

Level 3: synchronizing data

Moving data is great, but when data becomes Big, manually managing it doesn’t work. Git-annex really shines here. The most basic command is sync --content, which will automatically commit anything new (to git or the annex depending on the largefiles rules) and distribute all data everywhere reachable (including regular git-tracked files). Without --content, it syncs only metadata and regular commits:

$ git annex sync --content

But, all data everywhere doesn’t scale to complex situations: we need to somehow define what goes where. And this should be done declaratively. One of the most basic declarations in the minimum number of copies allowed numcopies. Git-annex won’t let you drop a file from a repository without being very sure that this many copies exist in other repositories. This setting is synced through the entire repository network:

$ git annex numcopies N

The next level is preferred content, which specifies what files a given repository wants. git annex sync --content will use these expressions to determine what to send where:

$ git annex wanted . 'include=*.mp3 and (not largerthan=100mb) and exclude=old/*'
$ git annex wanted archive 'anything'
$ git annex wanted cluster 'present or copies=1'

Repository groups and standard groups allow you to more easily define rules (the standard groups list lets you see the power of these expressions). Various built-in background processes can automatically watch for new files and run git annex sync --content automatically for you, which can make your data management a fully automatic process. Repository transfer costs can allow git-annex to fetch data from a nearby source, rather than a further one. Client-side encryption can allow you to use any available storage with confidence.

Relevant commands:

See also
  • Video intro to git-annex, from Research Software Hour.

  • DataLad is a data-management focused interface for git-annex. This might be a better place to start. DataLad also handles submodules (useful for very large numbers of files) and running workflows and saving the metadata.

  • git LFS These two git extensions are often compared. git LFS is created by GitHub, and operates on a centralized model: there is one server, all data goes there. This introduces a single point of failure, requires a special server capable of holding all data, and loses distributed features. git-annex is a true distributed system, and thus better for large scale data management.

  • dvc: The level 1/2 use case is practically copied from git-annex. It seems to have a lot less flexibility on high-level data management, client-side encryption. The main point of dvc seems to be track commands that have been run and their inputs/output to make those commands reproducible, which is completely different from git-annex. Most importantly (to the author of this page) it has default-on analytics sent to remote servers, which makes its ethics questionable.

Hybrid events

This page is our recommendations/ideas for hybrid events (in-person plus online components). It may be out of place on scicomp.aalto.fi, but it’s the best place we have to put them right now. Unlike other recommendations, this page is not just for teaching but applies to any type of event.

Why hybrid?
  • Why do you want hybrid, as opposed to online or in-person? If you can’t clarify the purpose to yourself, it may be hard to put on a successful event.

    • In-person gives better chances to talk in small groups and among your friends, both during and after the event. (Is your in-person event disadvantaging introverts or less well connected people?)

    • Online allows anyone to participate with a lower threshold. If you do it right, you could allow anyone in the world to take part.

As a side note, for massive events, participants can get a full experience by having their own group chat to discuss the topics, separate from the event chats.

General considerations
  • Plan and test early, don’t assume things work unless you experience it yourself.

  • The first time (or few times), have a separate “director” who can manage the online part and tech, so the hosts focus on hosting.

  • Related to the above (possibly the same person), have someone to help interface with the audience and relay questions from them to you, answer basic questions, etc. This person should be able to interrupt you immediately for pressing questions. For the largest events, have two: one person answering questions directly, one selecting and queuing questions for the speakers.

  • Audio is the most important part and will most often go wrong. Make sure you use microphones well, don’t count on wide-area room mics, do an audio check days before and immediately before, ask audience if it is good, and make sure they tell you immediately if problems develop.early if things get worse.

  • Consider activities for during breaks for the people online. Yes, you need to be slow to give people a chance to go get their coffee, but also can you do something during breaks. Are there some ways to facilitate online↔in-person networking during breaks?

  • The meeting begins well before the scheduled time for random discussion, and ends well after scheduled time for post-meeting discussion. Don’t end the online discussion right after the meeting (this is an important lesson even for online meetings!).

  • For the reasons above, you need more staff than a single-faceted event. For each of registration, entertaining people during breaks, etc, you will need someone to do the same thing for the online people, and usually it would be better if you have someone focusing on each audience (at the same time working together to bring the them together).

  • What about after the event. If you have streamed it, you could also record it. Can you do this while maintaining privacy of all participants, so that this information is not lost and reuseable later? What follow-up communication and so on can you do? Start thinking of this early.

Feedback and interaction

One of the biggest advantage of online events is the combination of multiple communication channels, so that it is not just extroverts asking questions.

  • Have a clear way to get feedback (like presomo). Make it very explicit how this works. Have some icebreaker polls/questions.

  • Require in-person audience to ask questions via the feedback tool, not via voice. Distributing microphones is a lot of work and will often be forgotten, and also voice questions bias towards extroverts, and you will be able to better order your answers. Text questions also allow other people to answer and give help at the same time. If a question becomes a discussion, you could distribute microphones.

  • When feedback and questions are done well, they can be published along with the talk (make sure you announce this in advance). Especially the “document-based” method below is very good for this, since it can be fixed up after the course.

  • Make sure that the current presenter can always see the questions. A good recommendation is a separate computer with it large font next to your presentation computer.

  • To encourage people to use this, it is best to also screenshare/project it, so that the audience can see that it is in active use. This takes some screen space, but can be well worth it if it increases interaction.

  • If the text communication tool is the same as the rest of what the event uses, and has good treading support, then you get even more synergies.

There are different types of feedback tools:

  • Chat is simple, but linear and thus questions can easily get lost, and answers are hard to connect to questions. The advantage is it is usually built-in to meeting software.

  • feedback tools like Presemo (https://presemo.aalto.fi) allows basic questions, voting, and replies.

  • Documents (google docs, HackMD, etc) allow free-form text. The general idea is people write a free-form question or comment at the bottom of the document, and bullet points are used to give answers or replies. This requires some getting used to and has risk of trolling in extremely large events, but when this works, it works well. See the CodeRefinery HackMD mechanics for an example and advice.

Tech: Zoom

Zoom, and other meeting software, have many of the features that can be used for an easy, self-service hybrid event. We assume you know how to use Zoom (or equivalent) by yourself for an online meeting, and here we describe the changes for hybrid events.

The advantage of using normal meeting software is that you don’t need to learn a new tool and it is perfectly reasonable to do everything self-service.

  • Classrooms set up for hybrid work have camera inputs hooked up to the room cameras. There is a separate control panel for switching and rotating the cameras. Play around with the controls to learn how they work. Select the right input.

  • Zoom can equally share the screen like normal.

  • If you present from your own computer, you can run zoom on your computer to share screen, and use the room computer to share the camera view + sound. You can tell any other presenters to do the same.

  • Consider how you screenshare if it should be a two-way meeting (online audience should be visible to local audience):

    • Zoom in “Dual monitor mode” (find under general settings) actually produces two windows, one with the {current speaker or screenshare} and one with the gallery. If you have two monitors in the room, this makes a great experience: the entire gallery is visible and if someone uses zoom “raise hand”, it is apparent to everyone.

    • If you do the above, the current speaker can present from their desk via screenshare. This may be easier than transferring to the presentation computer.

    • Remember to share the collaborative notes, agenda, and/or chat by default, so that people are motivated to use that instead of speaking over each other.

  • Remember the benefits of being online. Providing slides and material in advance allows online (and in-person) people to use multiple channels at the same time, if it suits them.

Zoom audio in a classroom

As described above, audio is one of the most important considerations. In principle it is easy, but there are many details to consider.

  • The first is your goals: we have three categories, (presenter), (in-person audience), (online audience). Which of them should hear each other?

  • The main thing is to prevent audio feedback. To solve this, it is important to have one machine as the audio master in the room (it has both the microphone and speakers connected to it). This also prevents the presenter from having their audio go back into the room via the online meeting.

  • Presenter → online can be done with microphones connected to a computer, for example the classroom computer connected to the microphones or a bluetooth microphone.

  • In-person audience → online, in practice, needs to be done by passing around microphones. An wide-area microphone might work, or might not.

  • Online → in-person is a bit more interesting. You can connect the audio computer to the speakers in the room (or external speakers). You will need to position the speakers to avoid feedback into the microphones as much as possible, and adjust all the different volumes.

  • To adjust for different sound levels of the different groups, you might need someone continually monitor and go adjusting the volumes of the various microphones separately.

Overall, you could say that voice communications is the main point of in-person meetings. But it is also the hardest to scale to a large audience. Consider if you can get text feedback and interaction working well, and then perhaps you could skip audio - and perhaps the entire effort of a hybrid event?

Tech: dedicated A/V setup

We have put on an event with a dedicated A/V setup, with external microphones, etc. In the end, it also used Zoom to broadcast to the world, so was quite similar to the above. Perhaps this recommendation is obsolete and one should just use the above as a starting point?

TODO: more info

Tech: live streaming

For a largest events, meeting software doesn’t work: you have to manage all the participants, and any one participant can disrupt the event for everyone else. The “live streaming” model is much better in this case: it is a one-to-many broadcast, not many-to-many meeting. Live streaming is popular these days, and thus you can find many user-friendly but powerful tools.

For now, see CodeRefinery manuals on the MOOC strategy for a detailed description.

See also

Aalto University links:

Pitfalls of Jupyter Notebooks

Jupyter Notebooks are a great tool for research, data science type things, and teaching. But they are not perfect - they support exploration, but not other parts of the coding phase such as modularity and scaling. This page lists some common limitations and pitfalls and what you can do to avoid them.

Do use notebooks if you like, but do keep in mind their limitations, how to avoid them, and you can get the best of both worlds.

None of the limitations on this page are specific to notebooks - in fact we’ve seen most of them in scripts long before notebooks were popular.

Modularity

We all agree that code modularity is important - but Jupyter encourages you to put most code directly into cells so that you can best use interactive tools. But to make code the most modular, you want lots of functions, classes, etc. Put another way, the most modular code has nothing except function/class/variable/import definitions touching the left margin - but in Jupyter, almost everything touches the left margin.

Solutions:

  • Slowly work towards functions/classes/etc where appropriate, but realize it’s not as easy to inspect their insides as non-function code.

  • Be aware of the transition to modules - do it when you need to. See the next point.

  • Try to plan so it’s not too painful to make the conversion when the time comes.

Transitioning to modules

You may start coding in notebooks, but once your project gets larger, you will need to start using your code more places. Do you copy and paste? At this point, you will want to split your core code into regular Python modules, import them into your notebooks, and use the notebooks as an interface to them - so that modules are somewhat standard working code and notebooks are the exploration and interactive layer. But when does that happen? It is difficult to make that transition unless you really try hard, because it’s easier to just keep on going.

Solutions:

  • Remember that you will probably need to form a proper module eventually. Plan for it and do it quickly once you need to.

  • Make sure you notebooks aren’t disconnected from your own Python code in modules/packages.

  • You can set modules to automatically reload with %load_ext autoreload, %autoreload 1, and then %aimport module_name. Then your edits to the Python source code are immediately used without restarting and your work is not slowed down much. See more at the IPython docs on autoreload (note: this is Python kernel specific).

  • importnb to import notebooks as modules - but maybe if you get to this, you need to rethink your goal.

Difficulty to test

For the same reasons modularity outlined above, it’s hard to test notebooks using the traditional unit testing means (if you can’t import notebooks into other modules, you can’t do much). Testing is important to ensure the accuracy of code.

Solution: Include mini-tests / assertions liberally. Split to modules when it is necessary - maybe you only create a proper testing system once you transition to modules.

Solutions:

  • Various extensions to pytest that work with notebooks

    • nbval, pytest-notebook: run notebook, check actual outputs match outputs in ipynb.

    • pytest-ipynb: cells are unit tests

    • This list isn’t complete or a recommendation

  • But just like with modularity above, a notebook designed to be easily testable isn’t designed for interactive work.

  • Transition to modules instead of testing in the notebook.

Version control

Notebooks can’t be version controlled well, since they are JSON format. Of course, they can be version controlled (and should be), and there are a variety of good solutions so this shouldn’t stop you.

Solutions:

  • Don’t let this stop you. Do version control your notebooks (and don’t forget to commit often!), even if you don’t use any of the other strategies.

  • nbdime - diffing and merging, VCS integration

  • Jupyter lab / notebook git integration work well.

  • Notebooks in other plain-text formats: Rmarkdown, Jupytext (pair notebooks with plain text versions).

  • Remember, blobs in version control is still better than nothing.

Hidden state is opposed to reproducibility

This is a bit of an obscure one: people always say that notebooks are good for reproducibility. But they also allow you to run cells in different orders, delete cells after it has run, change code after you run it, and so on. And this is the whole point of notebooks. So it’s very easy to get into a state where you have variables defined which aren’t in your current code and you don’t remember how you got them. Since old output is saved, you might not realize this until it’s too late.

Solutions:

  • Use “Restart and run all” liberally. Unless you do, you can’t be sure that your code will reproduce your output.

  • But wait… part of the point of notebooks is that you can keep data in memory instead of recalculating each time you run. “Restart and run all” defeats the purpose of that, so… balance it out.)

  • Design for modularity and clean interfaces, even within a notebook. Don’t make a mess.

Notebooks aren’t named by default

This is really small, but notebooks aren’t named by default. If you don’t name them well, you will end up with a big mess. Also somewhat related, notebooks tend to purpose drift: they start for one thing then end up with a lot of random stuff in them. How do you find what you need? Obviously this isn’t specific to notebooks, but the interactive nature and modularity-second makes the problem more visible.

Solutions:

  • Remember to name notebooks wells, immediately after making them.

  • Keep mind of when they start to feature drift too much, or have too many unrelated things in them. Take some time to sort your code logically once that happens.

Difficult to integrate into other execution systems

A notebook is designed for interactive use - you can run them from the command line with various commands. But there’s no good command line interface to pass arguments, input and output, and so on. So you write one notebook, but can’t easily turn it into a flexible script to be used many times.

Solutions:

  • Modularize your code and notebooks. Use notebooks to explore, scripts to run in bulk.

  • Create command line interfaces to your libraries, use that instead of notebooks.

  • There are many different tools to parameterize and execute notebooks, if you think you can keep stuff organized:

Jupyter disconnected from other computing

This is also a philosophical one: some Jupyter systems are designed to insulate the user from the complexities of the operating system. When someone needs to go beyond Jupyter to other forms of computing (such as ssh on cluster), are they prepared?

Solutions:

  • This is more of a mindset than anything else.

  • System designers should not go through extra efforts to hide the underlying operating system, nor separate the Jupyter systems from other systems.

  • Include non-Jupyter training, some intro to the shell, etc. in the Jupyter user training.

Summary

The notebooks can be great for starting projects and interactive exploration. However, as a project gets more advanced, you will eventually find that the linear nature of notebooks is a limitation because code can not really be reused. It is possible to define functions/classes within the notebook, but you lose the power of inspection (they are just seen as single blocks) and can’t share code across notebooks (and copy and paste is bad). This doesn’t mean to not use notebooks: but do keep this in mind, and once your methods are mature enough (you are using the same code in multiple places), try to move the core functions and classes out into a separate library, and import this into the day-to-day exploration notebooks. For more about problems with notebooks and how to avoid them, see this fun talk “I don’t like notebooks” by Joel Grus. These problems are not specific to notebooks, and will make your science better.

In a cluster environment, notebooks are inefficient for big calculations because you must reserve your resources in advance, but most of the time the notebooks are not using all their resources. Instead, use notebooks for exploration and light calculation. When you need to scale up and run on the cluster, separate the calculation from the exploration. Best is to create actual programs (start, run, end, non-interactive) and submit those to the queue. Use notebooks to explore and process the output. A general rule of thumb is “if you would be upset that your notebook restarted, it’s time to split out the calculation”.

Notebooks are hard to version control, so you should look at the Jupyter diff and merge tools. Just because notebooks is interactive doesn’t mean version control is any less important! The “split core functions into a library” is also related: that library should be in version control at least.

Don’t open the same notebook more than once at the same time - you will get conflicts.

References

nbscript: run notebooks as scripts

Warning

This page and nbscript are under active development.

Notebooks as scripts?

Jupyter is good for interactive work and exploration, but eventually you need more resources than an interactive session can provide. nbscript is a tool (written by us) that lets you run Jupyter notebooks just like you would Python files. (nbscript main site)

See also

Other tools: There are other tools that run notebooks non-interactively, but (in my opinion) they treat command-line execution as an afterthought. There is a long-standing standard for running scripts on UNIX-like systems, and if you don’t use that, you are staying locked in to Jupyter stuff: the two worlds should be connected seamlessly. Links to more tools here.

Once you start running notebooks as scripts, you really need to think about how modular your whole workflow is. Mainly, think about dividing your work into separate preprocessing (“easy”), analysis (“takes lots of time and memory”), and visualization/post processing (“easy”) stages. Only the analysis phase needs to be run non-interactively at first (to take advantage of more resources or parallelize), but other parts can still be done interactively through Jupyter. You also need to design the analysis part so that it can run on a small amount of data for development and debugging, and the whole data for the actual processing. You can read more general advice at Jupyter notebook pitfalls.

Concrete examples include:

  • Run your notebook efficiently on a separate machine with GPUs.

  • Run your code in parallel with many more processors

  • Run your code as a Slurm batch job or array job, specifying exactly the resources you need.

nbscript basics

The idea is nbscript input.ipynb has exactly the same kind of interface you expect from bash input.sh or python input.py: command line arguments (including input files), printing to standard output. Since notebooks don’t normally have any of these concepts and you probably still want to run the notebook through the Jupyter interface, there is a delicate balance.

Basic usage from command line. To access these command line arguments, see the next section:

$ nbscript input.ipynb [argument1] [argument2]

If you want to save the output automatically, and not have it printed to standard output:

$ nbscript --save input.ipynb               # saves to input.out.ipynb
$ nbscript --save --timestamp input.ipynb   # saves to input.out.TIMESTAMP.ipynb

If you want to submit to a cluster using Slurm, you can do that with snotebook. These all run automatically with --save --timestamp to save the output:

$ snotebook --mem=5G --time=1-12:00 input.ipynb
Setting up your notebook

You need to carefully design your notebook if you want it to be usable both as a script and as through Jupyter. This section gives some common patterns you may want to use.

Detect if your notebook is running via nbscript, or not:

import nbscript
if nbscript.argv is not None:
    # We *are* running through nbscript

Get the command line arguments through nbscript. This is None if you are not running through nbscript:

import nbscript
nbscript.argv

You can use argparse like normal to parse arguments when non-interactive (take argv from above):

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('input', help='Input file')
args = parser.parse_args(args=argv)

Save some variables or save file if not running through nbscript:

if nbscript.argv is not None:
    import cPickle as pickle
    state = dict(results=some_array,
                 other_results=other_array,
                 )
    pickle.dump(state, open('variables.pickle'), pickle.HIGHEST_PROTOCOL)

Don’t run the main analysis when interactive:

if nbscript.argv is None:
    # Don't do this stuff in Jupyter interface
Running with Slurm

Running as a script is great, but you need to submit to your cluster. nbscript comes with the command snotebook to make it easy to submit to Slurm clusters. It’s designed to work just like sbatch, but directly submit notebook files without needing a wrapper script.

snotebook is just like nbscript, but submits to slurm (via sbatch) using any Slurm options:

$ snotebook --mem=5G --time=1-12:00 input.ipynb
$ snotebook --mem=5G --time=1-12:00 input.ipynb argument1.csv

By default, this automatically saves to input.out.TIMESTAMP.ipynb, but can be configured.

You can put normal #SBATCH comments in the notebook file, just like you would when submitting with sbatch. But, it will only detect it from the very first cell that has any of these arguments, so don’t split them over multiple cells. Example:

#SBATCH --mem=5G
#SBATCH --time=1-12:00

Just like with sbatch, you can combine command line options and in-notebook options.

See also

New group leaders: what to know about computational research

As a new group leader, how can you make the most of your future group’s computational work, so that it becomes an investment rather than a liability? This is currently focused on software.

If you are actively writing research software yourself, perhaps directly check out The Zen of Scientific computing instead of this for the more practical side.

About you
  • Are you planning a research group which partly uses computing

  • Is your computing not your main thing (not what you want to focus on/not what you studied)?

  • Do you want your new hires to use best practices, even if you can’t mentor them yourself?

  • Do you want your research to be reproducible and open?

Why plan in advance?
  • Your group’s work is valuable.

  • Over time, your work’s value can grow…

  • … or it can be lost every 5 years as your group changes.

What usually goes wrong?

At a group level, these often happen to semi-computational groups:

  • Every researcher starts a project over from scratch

  • Researchers leave, previous work becomes unusable (your group completely changes every ~5 years!)

  • If you don’t work at it, your group’s software and data gets more and more disorganized, until it becomes unusable. It limits what you can do in the future.

At an individual level:

  • Time wasted with bugs

  • Time wasted when one can’t repeat analysis for reviews

  • Desire to hide or not share code because it’s “messy”, which promotes the above cycle continuing. And less Open Science.

Step 1: Define how you work together

This is kind of meta, but: do you want to be a group of people connected by supervisor, or a team that works together?

  • Is co-working limited to coffee chats and presentations at group meetings?

    • Do these presentations comment only on the final results?

    • Or do you discuss and praise good practices for getting those results?

    • Are some meetings spent on skill development?

  • Or on the other end, are you co-developing the same project?

  • Are you a team, or a bunch of independent contractors?

Suggestions

  • Don’t be only results oriented in your group activities. Make sure you value the process with both your time and mental energy.

Planning vs writing a plan

Plans are useless, planning is indispensable - Dwight Eisenhower

  • Different grants request you make a data management plan and I’ve seen ideas of software management plan for the future.

  • If you making a plan just for a grant, I think that’s the wrong idea. You want everything you do to go beyond single projects.

Suggestions

  • Make a “practical plan” for important aspects, in your group’s documentation area: “here is where you find our data”, “here is where we share code”, etc. Keep it lightweight but useful.

  • Designate it as part of onboarding.

  • Update it as needed.

Group documentation, “group wiki”

A single place for reference on groups practices helps with onboarding and keeping things consistent and usable over time.

  • A group wiki is a good place to start.

  • Minimum documentation about how you want things done - or how they are actually being done.

  • But not so strict that you can’t make progress in the future.

  • Index of important software, data, and other resources

    • But description of the software/data should be with the them, not in the group docs.

  • Can you make everything open. e.g. your group website contains this reference information, so it also serves as an advertisement?

Suggestions

  • If in doubt, make a group wiki

  • Use it to keep your group’s internal operating information organized - however makes sense for you.

  • When you hear of someone doing something new, ask: “did you update this in our wiki?”

Skill development

Many people learn basic programming. Far fewer people learn best practices beyond programming:

But there is also informal learning, mentoring:

  • You learn more from co-working than courses.

  • You need good, active mentoring (not weekly status checks, but real co-working)

  • Desks next to each other where you can see each others screens

  • Pair programming

  • But, as an academic supervisor, you probably don’t have time to mentor. How do you get mentoring?

    • Set up group to work together

    • Time and motivation for self-learning

    • Encourage a internal specialist who can mentor for you (“Research software engineer”).

Suggestions

  • Everyone in your group attends a CodeRefinery workshop

  • At least one group member is developed into a computational specialist and supports others.

Why talk so much about teaching and mentoring, rather than practices?
  • Unlike many topics, we can’t rely on academic courses to prepare your group members.

  • In all my experience, good software and data practices comes from sharing good internal practices.

  • I know supervisors can’t do everything, but hopefully they can promote what they need internally.

Software in research
  • Software allows you to do far more than one can alone and transform research.

  • … but can also be one of the most complex tasks you do.

  • What kind do you use?

    • You can and will use software developed by others

    • Many groups develop their own internally.

    • If you make something good, you may want to release it so that others can use it - and cite you.

Software: tools

We give a lightning overview. Come to CodeRefinery for the full story.

Version control
  • Tracks changes

    • solves: Everything just broke but I don’t know what I changed.

    • solves: I’m getting different results than when we submitted the paper.

  • Allows collaboration

    • solves: “can you send me the latest version of the code”

    • solves: “we’re using two different versions, too bad”

  • Creates a single source of truth for the code

    • Not different scattered around on everyone’s computers

  • Most common these days: git

Suggestions

  • Everyone must learn the basics of a version control system (CodeRefinery week 1 does this).

  • Find a source of advanced support (your specialist group member or some other university service)

Github, Gitlab, etc.
  • Version control platforms

  • Online hosting platforms for git (others available)

  • Very useful to keep stuff organized

  • Makes a lot of stuff below possible.

  • Individual projects and organizations with members - for your group.

Suggestions

  • Make one public Github/Gitlab organization for your group

  • Make one internal Gitlab organization hosted at your university.

  • Strongly discourage personal repositories for common code.

Issue tracking
  • Version control platforms provide issue trackers

  • Important bugs, improvements, etc. can be closely tracked.

Suggestions

  • Use issues for your most important common projects

Change proposals (aka “pull requests”)
  • Feature of version control platforms like Github or Gitlab

  • People should work together, but maybe not everyone should be able to modify everything, right?

  • Contributors (your group or outside) can contribute without risk of messing things up.

  • For this to work you need to actually review, improve, and accept them

Suggestions

  • Decide which projects are important enough for a more formal change process.

  • Use pull requests for these projects which should not be broken.

Testing
  • How do you know your code is correct? Try running it, right?

  • But what happens if you change it later?

  • Software testing is a concept of writing tests, which can automatically verify functionality.

  • You write tests, and then anytime you make a change later, the tests verify it still works.

Suggestions

  • Each moderately important project has some test data and can automatically run something

  • More important projects: add in as many tests as practical

Documentation
  • Documentation makes reusability.

  • Minimum is Readme files in each repository.

  • Big projects can have dedicated documentation.

Suggestions

  • Every projects gets a README file. As supervisor, read these README files and confirm what it contains.

  • Dedicated, in-repository documentation for large projects (for example Sphinx)

Licensing
  • Reuse gets you citations

  • Reuse requires a license - or else significant reuse will be minimal.

  • You will often need to check your local policies on making something open source.

Suggestions

  • Decide (with stakeholders) on a license as early as possible - use only open-source licenses unless there is special reason. You don’t have to actually open right away.

  • Try to focus on using similarly licensed things.

Publication and release
  • If you invest in your software, you probably want to share it

    • “If we release a paper on some method, and we don’t include easy to use software to run it, our impact will be tiny compared to what it could be.” - CS Professor

  • Good starting point: make the repository open on Github/Gitlab

  • Can also be archived on Zenodo (or other places) to make it citeable.

  • Do all work expecting that it might be made open someday. Separate public and secret information into different repositories.

Suggestions

  • Public on GitHub/GitLab as soon as possible

  • Next level is releases on package indexes

  • You can make software papers later (when relevant)

Working together on code

Group discussion: What can go wrong when people work together?

Other computational topics

… not exactly software, but still relevant to this discussion.

Data storage
  • Discourage single-user storage spaces (laptop, home directories)

  • Use common shared spaces instead

  • Network drives

    • Usually used via a remote system

    • Some can be locally mounted on your own laptop for ease of use

    • Not the best for people who want to work on their own computer, but works. Data can be synced.

Aalto Scientific Computing strategy:

  • All mass storage provided in shared group directories.

  • Request as many as your want - each one has a unique access control.

  • Access and data can be passed on as the group evolves.

Suggestions

  • Have a plan. People know where central storage is and at least one copy must be there.

  • Request central network drive storage if possible.

  • Ask your group members: “Where is your data? Is the data documented?”

Data storage locations at Aalto University
  • Own devices

    • Danger, no backups! Personal devices are considered insecure.

  • Aalto home directories

  • Aalto network drives

    • Large, secure, backed-up. Request from your department or from Aalto IT Services.

    • 10-100 GB range is easy.

  • Triton HPC Cluster

    • Very large, fast, direct cluster access, but not backed up.

    • 10s-100s of TB.

  • CSC data storage resources

  • Public data repositories

    • For open data

Computing

There are a range of computing options: (easy to use, small) ⋄ (harder to use, large)

  • Own devices

  • Remote servers

  • Remote computer clusters

    • Aalto

    • CSC

Support

It’s dangerous to go alone. Take us!

  • There were many things above.

  • Hopefully you got some ideas, but I don’t think that anyone can do this alone (I learned everything by working with others)

  • Rely on support and mentoring.

Some possibilities, if you are at Aalto:

Suggestions

  • Ensure your group members come to garage if they have questions you can’t answer.

  • Come to a RSE consultation and chat at least once when getting your group started.

Summary: dos and don’ts

You are not allowed to

  • Not use version control

  • Not push to online repository

  • Have critical data or material only on an own computer.

  • Make something so chaotic that you can’t organize it later

  • Go alone

… but you don’t have to

  • Start every code perfectly

  • Do everything perfectly

  • … as long as you can improve it later, if needed.

  • Know everything yourself.

Checklist
  • Set up group reference information (for example, wiki).

  • Work with your supporters to create a basic outline of plan.

  • Set up Github organization for group code

  • Set up Gitlab organization for internal work (university Gitlab)

  • Create your internal data/software management plan.

  • (Think what code/data will be most reused, put it in one place, and make it reusable.)

  • Send group members to CodeRefinery as they join.

See also
  • The Zen of Scientific computing - different levels of different aspects you can slowly improve. Emphasizes that you don’t have to be perfect when you first start.

Package your software well

This page gives hints on packaging your research software well, so that it can be installed by others.

As HPC cluster administrators, a lot of time is spent trying to install very difficult software. Many users want to use a tool released by someone, but it turns out to not be easy to install. Don’t let that happen to your code - keep the following things in mind, even at the beginning of your work. Do you want your code to be reused, so that you can be cited?

This page is specifically about packaging and distribution, and doesn’t repeat standard programming practices for scientists.

Watch a humorous, somewhat related talk “How to make package managers cry”.

Application or library
  • Application: Runs alone, does not need to be combined with other software. Note that if your application is expected to be installed in a environment that is shared with other software, it is more like a library. Note that this is how most scientific software is installed!

  • Library: Runs embedded and connected with other software that is not under your control. You can’t expect everything else to use the exact versions of software that you need.

The dependency related topics below mostly apply to libraries - but as the note says, in practice they affect many applications, too.

Use the proper tools

Each language has some way(s) to distribute its code “properly”. Learn them and use them. Don’t invent your own way of doing things.

Use the simplest, most boring, reliable, and mainstream system there is (that suits your needs).

Minimize dependencies

Build off of what others make, don’t re-invent everything yourself. But at the same time, see if you can avoid random unnecessary dependencies, especially ones that are not packaged well and well-maintained. It will make your life and others worse.

Don’t pin dependencies

Don’t pin exact versions of dependencies in a released library. Imagine if you want to install several different libraries that pin slightly different versions of their dependencies. They can’t be installed together, and the dependency solver may take a long time trying before it gives up.

But you do often want to pin dependencies for your environments, for example, the exact collection of software you are using to make your paper. This keeps your results reproducible, but is a different concept that releasing your software package.

You don’t pin dependencies strictly when someone may indirectly use your software in combination with arbitrary other packages. You should have some particular reason for each pin your have, not just “something may break in the future”. If the chances of something breaking in the future are really that high, you should wonder if you should recommend others to use this until that can be taken care of (for example, build on a more stable base).

You’ll notice that a lot of these topics deal with dependencies. Dependency hell is a real thing, and you should carefully think about them.

Be flexible on dependencies

Following up from above, be as flexible as dependencies as possible. Don’t expect the newest just because it’s the newest.

If you have to be strict on dependencies because the other software is changing behavior all the time, perhaps it’s not a good choice to build on. Maybe there’s no other choice, but that also means that you need to realize that your package isn’t as reusable as you might hope.

Try to be robust in dependencies

Follow the robustness principle to the extent possible: “Be conservative in what you do, be liberal in what you accept from others”. Try not to be as resistant as possible to dependencies changing, while providing a stable interface for other things. Of course, this is hard, and you need a useful balance. For “resistance to dependencies changing”, I interpret this as being careful what interfaces I use, and see if I can avoid using things I consider likely to change in the future.

Of course, robustness applies to other aspects, too.

Have tests

Have at least some basic automated tests to ensure that your code works in conjunction with all the dependencies. Perhaps also have a minimal example in the README file that someone can use to verify that they installed properly (could be the same as the tests). The tests don’t have to be fancy, even something that runs the code in a full expected use case will let you detect major problems early. This way, when someone is installing the software for someone else, they can check if they did it correctly.

Don’t expect the latest OS

Don’t design only for the latest and greatest operating system: then, many people who can’t upgrade right away won’t be able to use it easily. Or, they’ll have to go through extra effort to install newer runtimes on their older operating system.

For example, I usually try to make my software compatible with the latest stable operating systems from one year ago, and latest Python packages from two years ago. This has really reduced my stress in moving my code around, even if it does mean I have to wait to wait to use some new features.

Test on different dependency versions/OSs/etc

This starts to get a little bit harder, but it’s good to test with diverse operating systems or versions of your key dependencies. This probably isn’t worth it in the very early phases, but it is easier once you start using continuous integration / automated testing. Look into these once you get advanced enough.

Most clusters have different and older operating systems that you’d use on your desktop computer.

A container does not replace good packaging

“I only support using the Docker container” does not replace good packaging as described above. At the very least, it assumes that everyone can use Docker/singularity/container system of the year on the systems they need to run on. Second, what happens if they need to combine with other software?

A container is a good way to make compute easier and move it around, but make good packaging first, and use that packaging to install in the container.

Other

There is plenty more you should do, but it’s not specific to the topic of this page. For example,

  • Have versions and releases

  • Use a package repository suitable to your language and tool.

  • Have good documentation

  • Have a changelog

  • etc…

See also

Python

Note For triton specific instructions see triton python page. For Aalto Linux workstation specific stuff, see Aalto python page.

Python is widely used high level programming language that is widely used in many branches of science.

Python distributions

Python to use

How to install own packages

Simple programs with common packages, not switching between Pythons often

Anaconda 2/3

pip install --user

Most of the use cases, but sometimes different versions of modules needed

Anaconda 2/3

conda environment + conda

Special advanced cases.

Python from module system

virtualenv + pip install

There are two main versions of python: 2 and 3. There are also different distributions: The “regular” CPython that is usually provided with the operating system, Anaconda (a package containing cpython + a lot of other scientific software all bundled together), PyPy (a just-in-time compiler, which can be much faster for some use cases).

  • For general scientific/data science use, we suggest that you use Anaconda. It comes with the most common scientific software included, and is reasonably optimized.

  • PyPy is still mainly for advanced use (it can be faster under certain cases, but does not work everywhere). It is available in a module.

Installing your own packages with “pip install” won’t work unless you have administrator access, since it tries to install globally for all users. Instead, you have these options:

  • pip install --user: install a package in your home directory (~/.local/lib/pythonN.N/). This is quick and effective, but if you start using multiple versions of Python, you will start having problems and the only recommendation will be to delete all modules and reinstall.

  • Virtual environments: these are self-contained python environment with all of its own modules, separate from any other. Thus, you can install any combination of modules you want, and this is most recommended.

    • Anaconda: use conda, see below

    • Normal Python: virtualenv + pip install, see below

Installing own packages: Virtualenv, conda, and pip

You often need to install your own packages. Python has its own package manager system that can do this for you. There are three important related concepts:

  • pip: the Python package installer. Installs Python packages globally, in a user’s directory (--user), or anywhere. Installs from the Python Package Index.

  • virtualenv: Creates a directory that has all self-contained packages that is manageable by the user themself. When the virtualenv is activated, all the operating-system global packages are no longer used. Instead, you install only the packages you want. This is important if you need to install specific versions of software, and also provides isolation from the rest of the system (so that you work can be uninterrupted). It also allows different projects to have different versions of things installed. virtualenv isn’t magic, it could almost be seen as just manipulating PYTHONPATH, PATH, and the like. Docs: https://docs.python-guide.org/dev/virtualenvs/

  • conda: Sort of a combination of package manager and virtual environment. However, it only installed packages into environments, and is not limited to Python packages. It can also install other libraries (c, fortran, etc) into the environment. This is extremely useful for scientific computing, and the reason it was created. Docs for envs: https://conda.io/projects/conda/en/latest/user-guide/concepts/environments.html.

So, to install packages, there is pip and conda. To make virtual environments, there is venv and conda.

Advanced users can see this rosetta stone for reference.

Anaconda

Anaconda is a Python distribution by Continuum Analytics. It is nothing fancy, they just take a lot of useful scientific packages and put them all together, make sure they work, and do some sort of optimization. They also include all of the libraries needed. It is also all open source, and is packaged nicely so that it can easily be installed on any major OS. Thus, for basic use, it is a good base to start with. virtualenv does not work with Anaconda, use conda instead.

Conda environments

See also

Watch a Research Software Hour episode on conda for an introduction + demo.

A conda environment lets you install all your own packages. For instructions how to create, activate and deactivate conda environments see http://conda.pydata.org/docs/using/envs.html .

A few notes about conda environments:

  • Once you use a conda environment, everything goes into it. Don’t mix versions with, for example, local packages in your home dir. Eventually you’ll get dependency problems.

  • Often the same goes for other python based modules. We have setup many modules that do use anaconda as a backend. So, if you know what you are doing this might work.

  • The commands below will fail:

    • conda create -n foo pip # tries to use the global dir, use the --user flag instead

    • conda create --prefix $WRKDIR/foo --clone root # will fail as our anaconda module has additional packages (e.g. via pip) installed.

Basic pip usage

pip install by itself won’t work, because it tries to install globally. Instead, use this:

pip install --user

Warning! If you do this, then the module will be shared among all your projects. It is quite likely that eventually, you will get some incompatibilities between the Python you are using and the modules installed. In that case, you are on your own (simple recommendation is to remove all modules from ~/.local/lib/pythonN.N and reinstall). If you get incompatible module errors, our first recommendation will be to remove everything installed this way and not do it anymore.

Python: virtualenv

Virtualenv is default-Python way of making environments, but does not work with Anaconda.

# Create environment
virtualenv DIR

# activate it (in each shell that uses it)
source DIR/bin/activate

# install more things (e.g. ipython, etc.)
pip install PACKAGE_NAME

# deactivate the virtualenv
deactivate

Linux shell crash course

Note

This is a kickstart for the Linux shell, to teach the minimum amount needed for any scientific computing course. For more, see the linux shell course or the references below.

This is basic B-level: no prerequisites.

Watch this in video format

There is a companion video on YouTube, if you would also like that format (and a slightly longer one with more detail).

If you are reading this case, you probably need to do some sort of scientific computing involving the Linux shell, or command line interface. You may wonder why we are still using a command line today, but the answer is somewhat simple: once you are doing scientific computing, you eventually need to script and automate something. The shell is the only method that gives you the power to do anything you may want.

These days, you don’t need to know as much about the shell as you used to, but you do need to know a few important commands because the command line works when nothing else does - and you can’t do scripting without it.

What’s a shell?

It’s the old-fashioned looking thing where you type commands with a keyboard and get output to the screen. It seems boring, but the real power is that you can script (program) commands to run automatically - which is the point of scientific computing.

You type a command, which may include arguments. Output gets shown to the screen. Spaces separate commands and arguments. Example: cp -i file1.txt file2.txt. cp is the command, -i is an option, and file1.txt and file2.txt are arguments. The meaning of each option and argument is completely determined by the program itself.

There are some conventions for options. For example, --help or -h usually prints some help.

Files are represented by filenames, like file.txt. Directories are separated by /, for example mydir/file.txt is file.txt inside of mydir.

Exercise: Start a shell. On Linux or Mac, the “terminal” application does this.

Editing and viewing files

nano is an editor which allows you to edit files directly from the shell. This is a simple console editor which always gets the job done. Use Control-x (control and x at the same time), then y when requested and enter, to save and exit.

less is a pager (file viewer) which lets you view files without editing them. (q to quit, / to search, n / N to research forward and backwards, < for beginning of file, > for end of file)

Listing and moving files

ls lists the current directory. ls -l shows more information, and ls -a shows hidden files. The options can be combined, ls -la or ls -l -a. This pattern of options is standard for most commands.

mv will move or rename files. For example, mv file.old file.new.

cp will make a copy of a file, with the exact same syntax as mv: cp file.old file.copy.

rm will remove a file: rm file.txt. To remove a directory, use rm -r. Note that rm does not have backups and does not ask for confirmation!

mkdir makes a directory: mkdir dirname.

Current directory

Unlike with a graphical file browser, there is a concept of current working directory: each shell is in a current directory. If you ls, it lists files in your current directory. If a program tries to open a file, it opens it relative to that directory.

cd dirname will change working directories for your current shell. Normally, you will cd to a working directory, and use relative paths from there. / alone refers to the root directory, the parent of all files and directories.

cd .. will change to the parent directory (dir containing this dir). By the same token, ../.. the parent of the parent, and so on.

Exercise: Change to some directory and then another. What do (cd -) and (cd with no arguments) do? Try each a few times in a row.

Online manuals for any command

man is an on-line manual, type man ls to get help on the ls command. The same works for almost any program. In general you look for what you need, not read everything. The program that views the manual pages is (by default) less which was described above: Use q to quit or / to search (n and N to search again forward and backwards).

--help or -h is a standard argument that prints a short help directly: for example cp --help.

Manual pages can be long, some are easy to read, some are impossible. tldr.sh is a project that collects simplified usage examples, see the tldr.sh interactive web viewer.

Exercise: briefly look at the manual pages and --help output for the commands we have learned thus far. How can you make rm ask before removing a file?

History and tab completion

Annoyed at typing so much? We’ve got two ways to make work faster.

First, each shell keeps its (shell) history. By pushing the up arrow key, you can access previous lines. Never type similar things twice, go up in history and find the previous line, modify it, then push enter to re-run.

Shells also have tab completion. Type the first few letters of any command or filename and push tab once or twice… it will either complete it or show you the options. This is so important that it’s used often, and many command arguments can also be completed.

Exercise: Play around with tab completion. Type pytho and push TAB. (erase that then start over) Then type p and push TAB twice. (erase that and start over) Then ls, space, and the first few letters of a filename, then push TAB.

Variables

There are two kinds of variables in shell: environment variables and shell variables. You don’t need to worry about the difference now. The $NAME or ${NAME} syntax is used to is used to access the value of a variable.

For example, the environment variable HOME holds your home directory, for me /home/rkdarst. The command echo prints whatever its arguments are, so echo $HOME prints my home directory. (Note that the variable is a property of the shell, not of the echo command - this is sometimes important).

To set a variable, use NAME=value. export NAME=value sets it as an environment variable which means that other processes you start (from this shell) can use it.

The $VARIABLE syntax is also often used for examples: in this case, it isn’t an environment variable, but just something you need to substitute yourself when running a command.

Quick reference

Cheatsheet

General notes

The command line has many small programs that when connected, allow you to do many things. Only a little bit of this is shown here.

Programs are generally silent if everything worked, and only print an error if something goes wrong.

ls [DIR]

List current directory (or DIR if given).

pwd

Print current directory.

cd DIR

Change directory. .. is parent directory, / is root, / is also chaining directories, e.g. dir1/dir2 or ../../

nano FILE

Edit a file (there are many other editors, but nano is common, nice, and simple).

mkdir DIR-NAME

Make a new directory.

cat FILE

Print entire contents of file to standard output (the terminal).

less FILE

Less is a “pager”, and lets you scroll through a file (up/down/pageup/pagedown). q to quit, / to search.

mv SOURCE DEST

Move (=rename) a file. mv SOURCE1 SOURCE2 DEST-DIRECTORY/ copies multiple files to a directory.

cp SOURCE DEST

Copy a file. The DEST-DIRECTORY/ syntax of mv works as well.

rm FILE ...

Remove a file. Note, from the command line there is no recovery, so always pause and check before running this command! The -i option will make it confirm before removing each file. Add -r to remove whole directories recursively.

head [FILE]

Print the first 10 (or N lines with -n N) of a file. Can take input from standard input instead of FILE. tail is similar but the end of the file.

tail [FILE]

See above.

grep PATTERN [FILE]

Print lines matching a pattern in a file, suitable as a primitive find feature, or quickly searching for output. Can also use standard input instead of FILE.

du [-ash] [DIR]

Print disk usage of a directory. Default is KiB, rounded up to block sizes (1 or 4 KiB), -h means “human readable” (MB, GB, etc), -s means “only of DIR, not all subdirectories also”. -a means “all files, not only directories”. A common pattern is du -h DIR | sort -h to print all directories and their sizes, sorted by size.

stat

Show detailed information on a file’s properties.

find [DIR]

find can do almost anything, but that means it’s really hard to use it well. Let’s be practical: with only a directory argument, it prints all files and directories recursively, which might be useful itself. Many of us do find DIR | grep NAME to grep for the name we want (even though this isn’t the “right way”, there are find options which do this same thing more efficiently).

| (pipe): COMMAND1 | COMMAND2

The output of COMMAND1 is sent to the input of COMMAND2. Useful for combining simple commands together into complex operations - a core part of the unix philosophy.

> (output redirection): COMMAND > FILE

Write standard output of COMMAND to FILE. Any existing content is lost.

>> (appending output redirection): COMMAND >> FILE

Like above, but doesn’t lose content: it appends.

< (input redirection): COMMAND < FILE

Opposite of >, input to COMMAND comes from FILE.

type COMMAND or which COMMAND

Show exactly what will be run, for a given command (e.g. type python3).

man COMMAND-NAME

Browse on-line help for a command. q will exit, / will search (it uses less as its pager by default).

-h and --help

Common command line options to print help on a command. But, it has to be implemented by each command.

See also

Explore manual pages

For some fun, look at the manual pages for cat, head, tail, grep.

Linux shell course (advanced)

Read the Linux shell course and understand what “pipes” and piping” are.

SSH

Secure Shell (SSH) is the standard program for connecting to remote servers and transferring data. It is very secure and well-supported, so it’s worth learning to use it properly. This page both gives a bit of a crash course (top) and more details (bottom) for all common connection methods.

Setup

Check the tabs below for your operating system and methods to see which method you want to use.

PowerShell is built in to Windows 10 and includes OpenSSH (the same as on Linux). Start the “Windows PowerShell” program. Then, follow the “Command line” instructions on most of this page if there isn’t a separate PowerShell tab. If you want to set up SSH keys there are a few differences but overall it is the same procedure.

This should work by default on recent Windows 10.

This guide uses Aalto University’s HPC cluster as an example, but should be applicable to other remote servers at Aalto as well and many other outsiders as well.

Basic use: connect to a server

The standard login command with the command line is:

$ ssh USER@triton.aalto.fi

where USER is your username (Aalto: standard Aalto login, not email address) and triton.aalto.fi is the address of the server you with to connect - replace this for your situation.

First time login: check host key

When connecting to a new computer, you will be prompted to affirm that you wish to connect to this server for the first time. This lets you make sure you are connecting to the right computer (which is important if you type a password!). You’ll get a message such as:

The authenticity of host 'triton.aalto.fi (130.233.229.116)' can't be established.
ECDSA key fingerprint is SHA256:04Wt813WFsYjZ7KiAyo3u6RiGBelq1R19oJd2GXIAho.
Are you sure you want to continue connecting (yes/no)?

If possible, compare the key fingerprint you get to the one for the machine which you can find online (Triton cluster: Triton ssh key fingerprints, Aalto servers), and if they do not match, please contact the server administrator immediately. If they do match, type yes and press enter. You will receive a notice:

Warning: Permanently added 'triton.aalto.fi,130.233.229.116' (ECDSA) to the list of known hosts.

The public key that identifies Triton will be stored in the file ~/.ssh/known_hosts and you shouldn’t get this prompt again. You will be also asked to input your Aalto password before you are fully logged in. You want to say “yes, save the key for the future” - it’s more secure and you can always change it later if needed.

Checking known servers

You will not receive an authenticity prompt upon first login if the server’s public key can be found in a list of known hosts. To check whether a server, for example kosh.aalto.fi, is known:

$ ssh-keygen -F kosh.aalto.fi

Your computer might come with some keys pre-loaded for your university’s computers, for example:

$ ssh-keygen -f /etc/ssh/ssh_known_hosts -F kosh.aalto.fi
SSH keys: better than just passwords

By default, you will need to type your password each time you wish to ssh into Triton, which can be tiresome, particularly if you regularly have multiple sessions open simultaneously. A more secure (and faster) way to authenticate yourself is to use a SSH key pair (this is public-key cryptography. The private key should be encrypted with a strong password xkcd has good and amusing recommendations on the subject of passwords. This authentication method will allow you to log into multiple ssh sessions while only needing to enter your password once, saving you time and keystrokes.

Generate an SSH key

While there are many options for the key generation program ssh-keygen, here are the four main ones.

  • -t -> the cryptosystem used to make the unique key-pair and encrypt it.

  • -f -> filename of key

  • -C -> comment on what the key is for

Here are our recommended input options for key generation:

$ ssh-keygen -t ed25519

This works on Linux, MacOS, Windows

Accept the default name of the key file by pushing enter with no extra text(it will be automatically used later). Then, you will be prompted to enter a password. PLEASE use a strong unique password. Upon confirming the password, you will be presented with the key fingerprint as both a SHA256 hex string as well as randomart image. Your new key pair should be found in the hidden ~/.ssh directory (A directory called .ssh in your user’s home directory).

Key type ed25519 makes a private key named ~/.ssh/id_ed25519 and public key named ~/.ssh/id_ed25519.pub. The private key only stays on your computer. The public key goes to other comuters. Other key types were common in the past, and you may need to change your filenames in some of the future commands (for exmaple ~/.ssh/id_rsa.pub).

Copy public key to server

In order to use your key-pair to login to a server (for example: the Triton cluster), you first need to securely copy the desired public key to the machine with ssh-copy-id. The script will also add the key to the ~/.ssh/authorized_keys file on the server. You will be prompted to enter your Aalto password to initiate the secure copy of the file to Triton.

$ ssh-copy-id -i ~/.ssh/id_ed25519.pub USER@triton.aalto.fi

Connecting from outside of the Aalto network

Sometimes, you can’t connect directly to the computer you need to, since there is a jump host as some sort of a firewall. You need to connect to that computer first. This is described below in the section ProxyJump, but we give first workaround here. but roughly.

All this is easier if you set up a config file with ProxyJump (-J) first, and copy keys one at a time. (see as described below). Once this is done, you can copy your key to kosh first, then triton_via_kosh for example.

Aalto University: If you can connect by VPN, or to Eduroam, then you can directly access the Triton cluster and copy your key like above.

First copy the key to the jump host (like kosh.aalto.fi), then copy to your final destination (like triton.aalto.fi):

$ ssh-copy-id -i ~/.ssh/id_ed25519.pub USER@kosh.aalto.fi
$ ssh-copy-id -i ~/.ssh/id_ed25519.pub -o ProxyJump=USER@kosh.aalto.fi USER@triton.aalto.fi
Login with SSH key

If the key is in one of the standard filenames, it should work directly.

SSH key agent

To avoid having to type the decryption password, the private key needs to be added to the ssh-agent with the command

You will need administrative permissions to be able to start a ssh-agent on your machine that can store and handle passwords.

  1. Open Services from the start menu

  2. Scroll down to OpenSSH Authentication Agent > double click

  3. Change the Startup type to Automatic (Delayed Start), or anything that is not Disabled, then Apply, and also start the service manually if it is not yet running.

  4. ssh-add to add the default key (to add a certain key, use ssh-add ~/.ssh/id_ed25519, for example)

Once the password is added, you can ssh as normal but will immediately be connected without any further prompts for passwords.

ProxyJump

Often, you can’t connect directly to your target computer: you need to go through some other firewall host. This is often done with two separate ssh commands, but can be done with only one with the -J (ProxyJump) option:

$ ssh -J FIREWALL.aalto.fi triton.aalto.fi

Both of these can take more options, for example if you need to specify your username you might need to do it twice:

$ ssh -J USER@FIREWALL.aalto.fi USER@triton.aalto.fi

Read more details at https://www.redhat.com/sysadmin/ssh-proxy-bastion-proxyjump, including putting this in your configuration file (or see below).

(Windows with PuTTY: Connection > Proxy > Proxy type=”SSH to proxy and use port forward.”, then enter the firewall host as “Proxy hostname” and port 22.

Multiplexing

Connections can be even faster: you can re-use existing connections to start new connections, so that future ssh commands to the same host are almost instant. It multiplexes across the same connection, and is controlled by ControlMaster, ControlPath, and ControlPersist. With a proper SSH key setup, the gain is minimal, but it can be useful sometimes. It is not recommend to use this unless you really want this, since there are some gotchas::

  • Connections hanging (e.g. unstable network, changing network) will cause all multiplexed connections to hang.

  • All multiplexed connections need to stop before the master process (first SSH connection) will stop. So if you try to exit the first SSH but child processes are using it, it will appear to hang - this may not be obvious.

  • If you are using with ProxyJump, there are two possible SSH processes which can hang and cause things to go wrong.

  • Only use this on your own computers that you control, for security reasons.

This works with OpenSSH. If you want to use this, to you ssh config file (see below) add ControlMaster auto and ControlPath /tmp/.ssh-USER-mux-ssh-%r@%h:%p (replacing USER with your username) and test well. You might want ServerAliveInterval 30 to kill stuff soon if network goes down. We don’t give a full example to prevent unintended problems. If you notice weird things happening with your ssh, point your helpers to this section.

Config file: don’t type so many options

Remembering the full settings list for the server you are working on each time you log in can be tedious. A ssh config file allows you to store your preferred settings and map them to much simpler login commands. To create a new user-restricted config file

$ touch ~/.ssh/config && chmod 600 ~/.ssh/config

Open the created file to edit it as indicated below.

For a new configuration, you need specify in config at minimum the

  • Host: the name of the settings list

  • User: your login name when connecting to the server (if different from the username on your computer)

  • Hostname: the address of the server

So for the simple Triton example, it would be:

# Configuration file for simplifying SSH logins
#
# HPC slurm cluster
Host triton
    User LOGIN_NAME
    Hostname triton.aalto.fi

and you can use only this command to log in from now on:

$ ssh triton

Any additional server configs can follow the first one and must start with declaring the configuration Host:

# general login server
Host kosh
    User LOGIN_NAME
    Hostname kosh.aalto.fi
# light-computing server
Host brute
    User LOGIN_NAME
    Hostname brute.aalto.fi

There are optional ssh settings that may be useful for your work, such as:

# Turn on X11 forwarding for Xterm graphics access
ForwardX11 yes
# Connect through another server (eg Kosh) if not connected directly to Aalto network
ProxyJump USER@kosh.aalto.fi
Full sample config file

The following code is placed in the config file created above (i.e. ~/.ssh/config on Mac/Linux or %USERPROFILE%/.ssh/config on windows):

# general login server
Host kosh
    User LOGIN_NAME
    Hostname kosh.aalto.fi

# Triton, via kosh
Host triton_via_kosh
    User LOGIN_NAME
    Hostname triton.aalto.fi
    ProxyJump kosh

Now, you can just do command such as:

$ ssh triton
$ rsync triton:/m/cs/scratch/some_file .
## And this works in any other tool that uses ssh.

directly, by using the triton alias. Note that the Triton rule uses the name kosh which is defined in the first part of the file.

References

The Zen of Scientific computing

Have you ever felt like all your work was built as a house of cards, ready to crash down at any time?

Have you ever felt that you are far too inefficient to survive?

No, you’re not alone. Yes, there is a better way.

Production code vs research code

Yes, many things about software development may not apply to you:

  • Production code:

    • you sort of know what the target is

    • code is the main result

    • must be maintainable for the future

  • Research code:

    • you don’t know what the target is

    • code is secondary

But research code often becomes important in the future, so not all can be an unmaintainable mess…

Research code pyramid

I know that not all research code will be perfect.

But if you don’t build on a good base, you will end up with misery.

_images/zen-of-scicomp-pyramid.svg


_images/zen-of-scicomp-tower.svg _images/zen-of-scicomp-block.svg
Yes, you can’t do everything perfectly

Not everything you do will be perfect. But it has to be good enough to:

  • be correct

  • be changed without too much difficulty

  • be run again once reviews come in

  • ideally, not wasted once you do something new

Even as a scientist, you need to know the levels of maturity so that you can do the right thing for your situation.

It takes skill and practice to do this right. But it is part of being a scientist.

This talk’s outline:

  • Describe different factors that influence code quality

  • Describe what the maturity levels are and when you might need them

What aspects can you improve?

Below are many different aspects of scientific computing which you can improve.

Some are good for everyone. Some you may not need yet. Different levels of maturity are presented for each topic, so that you can think about what is right for you.

Version control

Version control allows you to track changes and progress.

For example, you can figure out what you just broke or when you introduced a bug. You can always go back to other versions.

Version control is essential to any type of collaboration.

  • L0: no version control

  • L1: local repo, just commit for yourself

  • L2: shared repo, multiple collaborators push directly

  • L3: shared repo, pull-request workflow

Resources:

Modular code

Modularity is one of the basic prerequisites to be able to understand, maintain, and reuse things - and also hard to get right at the beginning.

Don’t worry too much, but always think about how to make things reusability.

  • L0: bunch of copy-and-paste scripts

  • L1: important code broken out into functions

  • L2: separation between well-maintained libraries and daily working scripts.

Resources:

Organized workspaces

You will need to store many files. Are they organized, so that you can find them later, or will you get lost in your own mess?

  • L0: no particular organization system

  • L1: different types of data separated (original data/code/scratch/outputs)

  • L2: projects cleanly separated, named, and with a purpose

Resources:

  • I don’t know of good sources for this.

  • But you can find different recommendations for organizational systems

Workflow/pipeline automation

When you are doing serious work, you can’t afford to just manage stuff by hand. Task automation allows you to do more faster.

Something such as make can automatically detect changed input files and code and automatically generate the outputs.

  • L0: bunch of scripts you have to run and check output of by hand.

  • L1: hand-written management scripts, each output can be traced to its particular input and code.

  • L2: make or other workflow management tool to automate things.

  • L3: Full automation from original data to final figures and data

Resources:

Reproducibility of environment

Is someone else able to (know and) install the libraries needed to run your code? Will a change in another package break your code?

Scientific software is notoriously bad at managing its dependencies.

  • L0: no documentation

  • L1: state the dependencies somewhere, tested to ensure they work

  • L2: pin exact versions used to generate your results

  • L3: containerized workflow or equivalent

Resources:

Documentation

If you don’t say what you do, there’s no way to understand it. You won’t be able to understand it later, either.

At minimum, there should be some README files that explain the big picture. There are fancier systems, too.

  • L0: nothing except scattered code comments

  • L1: script-level comments and docstrings explaining overall logic

  • L2: simple README files explaining big picture and main points (example)

  • L3: dedicated documenentation including tutorials, reference, etc.

Resources:

Testing

You have to test your code at least once when you first run it. How do you know you don’t break something later?

Testing gives you a way to ensure things always work (and are correct) in the future by letting you run every test automatically.

There’s nothing more liberating than knowing “tests still pass, I didn’t break anything”. It’s extremely useful for debugging, too.

  • L0: ad-hoc and manually

  • L1: defensive programming (assertions), possibly some test data and scripts

  • L2: structured, comprehensive unit/integration/system tests (e.g. pytest)

  • L3: continuous integration testing on all commits (e.g. Github Actions)

If code is easy to test, it is usually easy to reuse, too. Furthermore, making code testable makes it reusable.

Resources:

Licensing

You presumably want people to use your work so they will cite you. If you don’t have a license, they won’t (or they might and not tell anyone).

Equally, you want to use other people’s work. You need to check their licenses.

  • L0: no license given / copy and paste from other sources

  • L1: license file in repo / careful to not copy incompatible code

  • L2: license tracked per-file and all contributors known.

Resources:

Distribution

Code can be easy to reuse, but not easy to get. Luckily there are good systems for sharing code.

  • L0: code not distributed

  • L1: code provided only if someone asks

  • L2: code on a website

  • L3: version control system repo is public

  • L4: packaged, tagged, and versioned releases

Resources:

Reuse

Are you aware of what what others have already figured out through their great effort?

Choosing the right thing to build off of is not always easy, but you must

  • L0: reinvent everything yourself

  • L1: use some existing tools and libraries

  • L2: deep study of existing solutions and tools, reuse them when appropriate

Resources:

  • I don’t know where to refer you to right now.

Collaboration

Is science like monks working in their cells, or a community effort?

These skills move so fast that learning peer-to-peer is one of the best ways to do it.

There’s a whole other art of applying these skills which isn’t taught in classes.

If you don’t work together, you will fall behind.

  • L0: you work alone and re-invent everything

  • L1: you occasionally talk about results or problems

  • L2: collaborative package development

  • L3: code reviews, pair programming, etc.

  • L4: community project welcoming other contributors

Resources:

The future

Science with computers can be extremely enjoyable… or miserable.

We are here to help you. You are here to others.

Will we?

Practical git PRs for small teams

This is the prototype of a mini-course about using git for pull requests (PRs) within small teams that are mostly decentralized, perhaps don’t have test environments everywhere, and thus standard review and CI practices don’t directly apply. The audience is expected to be pretty good with git already, but wondering how PRs apply to them.

The goal isn’t to convince you to use PR-based workflows no matter the cost, but instead think about how the tech can make your social processes better.

Status: Alpha-quality, this is more a start of a discussion than a lesson. Editor: rkdarst

Learning objectives
  • Why use pull requests?

  • What are the typical procedures of using PRs?

  • How do we adapt our team to use them?

  • How does this improve our work?

Why pull requests?
pull request = change proposal

You have some work which should be reviewed before deploying.

  • Someone is expected to give useful feedback

  • Maybe a quick idea, easier to draft&discuss than talk about it abstractly

pull request = review request

You’ve made the change already, or you are already the expert so don’t expect it to really be debated.

  • You edited it in deployment, or it is already live

  • Or you are the expert, and others don’t usually give suggestions

  • Still, someone might have some comments to improve your integration with other services.

pull request = change announcement
  • You don’t expect others to ever make suggestions

  • But you think others should know what you are doing, to distribute knowledge

  • If no one comments, you might merge this yourself in a few hours or days.

pull request = CI check
  • You want the automated tests/ continuous integration (CI) to run to verify the change works.

  • If it works, you might merge yourself even without others knowing.

  • A bit safer than CI after the push to master.

Benefits of PRs
  • Multiple sets of eyes

    • Everything should be seen by multiple people to remove single point of failure problems.

    • Share knowledge about how our services work.

    • Encourages writing a natural-language description of what you are doing - clarify purpose to yourself and others

  • Suggestion or draft

    • Unsure if good idea, make a draft to get feedback

    • Discuss and iterate via issue. No pressure to make it perfect the first time, so writing is faster

  • CI

    • Run automated tests before merging

    • Requires a test environment

    • Very important for fast and high-quality development.

  • Discussion

    • Structured place for conversation about changes

    • Refer to and automatically close issues

How do you make a pull request
  • Technically, a pull request is:

    • A git branch

    • Github/Gitlab representation of wanting to merge that head branch into some base branch (probably the default branch).

    • Discussion, commenting, and access control around that

    • So, there’s nothing really magic beyond the git branch.

  • We don’t really need to repeat existing docs: you can read how to on Github, Gitlab, etc. yourself.

  • A PR starts with a branch pushed to the remote.

  • Then, the platform registers a pull request which means “I want to merge this branch into master”. (Yes, a bit misnamed) Go to the repo page and you see a button, or a link to make one is printed when you push.

  • git-pr makes it easy - fewest possible keystrokes, no web browser needed, and I use the commit message also as the PR message to save even more time.

Pull request description
  • These days, I (rkdarst) tend to write my initial PR message into my commit, then git-pr will use that when I push. This also stores the description permanently in the git history.

  • There is also the concept of “pull request templates” within Github/Gitlab. (They can keep changes organized, provide checklists, and keep things moving. But after fast small PRs via git-pr I really don’t like this being required for small changes where I can write the important aspects myself.)

  • What should go in a description:

    • Why are changes being made?

    • What are the changes?

    • Risks, benefits, etc…

    • Is it done or a work in progress? Need help?

    • What should be reviewed?

CI checks
  • CI pipelines can run on the pull request and will report failures. On Github, success is a green check. Can be shared with checks of direct pushes.

  • Even if there aren’t tests, syntax checks and similar could be useful.

Semantics around PRs

How do you actually review and handle a PR once it comes in? What’s the social process?

Actions you can take

Actions you can do from the web (Github):

  • merge: accept it

  • comment: add a message

  • approve/request changes: “review” you can do from “file list” view

  • line comments (*): from diff view, you can select ranges of lines and comment there

  • suggestions (*): from diff, you can select ranges of lines then click “suggest” button to make a suggestion. This can easily be applied from web.

  • commit suggestion (*): from diff view, you can accept the suggestion and it makes a commit out of it.

  • (*) items can be done in batch from file view, to avoid one email for every action.

  • draft pull request can’t be merged yet. There is a Github flag for this, or sometimes people prefix with WIP:.

  • assign a reviewer: request people to do the review, instead of waiting for someone to decide themselves.

  • close: Reject the change and mark the PR as closed.

My usual procedure
  • If it’s good as-is, just click “merge”

    • If it’s a new contributor I usually try to say some positive words, but in long-term efficient mode, I don’t see a need to.

  • Otherwise, comment in more detail. Line-based comments are really useful here. Commenting can be line-based, or an overall “accept”, “request changes”, or “comment” on the PR as a whole (see above)

  • If you aren’t sure if you are supposed to merge it (yet), but it looks good, just “approve” it.

    • This cas be a sign to the original author that it looks sane to you, and they merge when they are ready.

  • If someone marks my PR “approve” but don’t merge it themselves, I will merge it myself as soon as I am ready.

  • If someone else requested changes, I’ve done the changes (if I agree), and I think there’s not much more to discuss, I will just merge it myself without another round of review.

  • You can both make suggestions and approve (usually with some words saying no need to accept hte suggestions if they don’t make sense).

How do humans use PRs?
Who should merge them?
  • What happens when the person making the PR is the only one (or main one) who can give it a useful review?

    • Then, perhaps your team needs some redundancy…

  • You can assign reviewers, if you want to suggest who should take a look.

  • Discuss as part of your team for each project. This leads to a social discussion of “how do we collaborate in practice?”

When do you merge a pull request?
  • How much review do you need to give, if you aren’t the expert?

  • My proposal:

    • If you are aren’t the author, and can evaluate it, merge it ASAP

    • If you aren’t an expert, but no one else has merged it after a few days, merge it yourself. Or if you are the original author and need it.

    • If no one else has after a week, anyone does it (mainly relevant to external contributors).

  • I don’t feel bad making a PR if I expect I will be the one to merge it a few days later: at least I gave people a chance to take part.

How do you keep up to date with PRs?
How can our team adapt to PRs?
Traditional software project or utility
  • PRs make a lot of sense

Deployments: There is no testing environment!

Yes, there should be a test environment, but let’s be real: many thing start off too small to have that. What do we do about it?

  • “If the change has already been made, it’s not really a change proposal”

  • PRs don’t work too well here, but when you think about it, it would be nice to be able to test before deploying!

    • Maybe this gives us encouragement to use more PRs

  • Make a PR anyway even though it’s in productive, as a second-eyes formality.

All of our projects are independent
  • Is this good for knowledge transfer?

What advantages would we see with more PRs?
Other

These things can make our work a bit soother, and something we can discuss.

git-pr
  • I got annoyed at needing too many keystrokes, and having to go to a web browser to create the pull requests

  • I created git-pr to make this as fast as possible, and it really does feel much smoother now

  • Works equally for Github and Gitlab, at least.

Shared git aliases
  • How can we deploy some shared aliases to all hosts we manage, to make git more enjoyable to use?

Blocking authorless commits
  • To block authorless commits, run this to set a pre-commit hook:

    echo 'git var GIT_AUTHOR_IDENT | grep root && echo "Can not commit as root!  Use --author" && exit 1 || exit 0' >> .git/hooks/pre-commit ; chmod a+x .git/hooks/pre-commit ```
    
  • Can this be made automatic in all of our repos?

Cheatsheets: git the way you need it, Gitlab (produced by Gitlab, with Aalto link)

Training

We have various recommended training courses for researchers who deal with computation and data. These courses are selected by researchers, for researchers and grouped by level of skill needed.

Training

Scientific computing and data science require special, practical skills in programming and computer use. However, these aren’t often learned in academic courses. This page is your portal for getting these skills. The focus is practical, hands-on courses for scientists, not theoretical academic courses.

Scientific Computing in Practice

SCIP is a lecture series at Aalto University which covers hands-on, practical scientific computing related topics. Lectures are open for the entire Aalto community as well as our partners at FGCI consortium.

Examples of topics covered at different lectures: HPC crash course, Triton kickstarts, Linux Shell, Parallel programming models: MPI and OpenMP, GPU computing, Python for scientists, Data analysis with R and/or Python, Matlab and many others.

If you are interested in a re-run of our past courses or if you want to suggest a new course, please take this survey.

August 2023 / Linux Shell Basics

Part of Scientific Computing in Practice lecture series at Aalto University.

Audience: Scientists, researchers, and others looking for an extensive intro into Linux shell / terminal. Primary audience is academics in Finland, outsiders are welcome to register and are accepted if there is space (there always is space).

About the course: The Linux shell lets you work efficiently on remote computers and automate bigger projects - whether you are managing a lot of data or running programs on a computer cluster. Without it, you are often stuck when you need to do move beyond basic tools like Jupyter notebooks. This course covers the Bash shell, but the principles apply to other shells such as zsh.

This course will cover the basics so that you’ll know what the shell is, are comfortable using it for your own projects, and are able to attend other courses with the shell as a prerequisite. We’ll get familiar with the command line, files and directories, and other things you often find in shell environments. We will unleash the power of that blinking cursor in the terminal window. Windows/Mac/Linux users are warmly welcome - regardless of what you use on your desktop, you’ll need this when using more powerful remote computers.

We will start with somewhat basics like files and processes and go up to command line magic like redirections and pipes. This should be enough to get started with the Linux terminal.

There is an advanced part of this course given later in the spring which will go through scripting and automation in more detail (part 2 in the material).

Lecturer: Ivan Tervanto, D. Sc., Science IT / Department of Applied Physics, Aalto University

Time, date, place: the course consists of three hands-on sessions (3h each), via Zoom. On-site option is possible if there is enough interest from course participants.

On-site: (not given this year)

Zoom: link to be posted to the registered participants list

  • Tue 29.8 12:00-15:00

  • Wed 30.8 12:00-15:00

  • Thu 31.8 12:00-15:00

Course material: will be mostly based on the first part of aaltoscicomp.github.io/linux-shell.

Cost: Free!

Registration: Please register here

Credits/certificate: It is not possible to obtain certificates or credits for this course.

Required setup: During the tutorials we’ll use a terminal with a BASH shell, means that either you have a Linux/Mac computer at your place or a Windows PC with the Git BASH, or VDI to a Linux, or SSH client installed for accessing any of Linux server. If you are at Aalto university you can run ssh USERNAME@kosh.aalto.fi to connect on a native Linux shell. Other servers are listed here . If you are at University of Helsinki, see the list of available SSH linux servers at this link. We will cover ssh connections at the beginning of the first day.

Additional course info at: scip -at- aalto.fi

What’s next?: After this course, check out CodeRefinery, 19-21 and 26-28 September 2023. CodeRefinery is the next step in scientific programming, not teaching programming but the tools to do it comfortable and without wasting time on problems.

Nov 7th - Nov 10th 2023 / Python for Scientific Computing

News and Important info

This is a medium-advanced course in Python tools such as NumPy, SciPy, Matplotlib, and Pandas. It is suitable for people who know basic Python and want to know some internals and important libraries for science - basically, how a typical scientist actually uses Python. Read the learner personas to see if the course is right for you. Prerequisites include basic programming in Python.

Part of Scientific Computing in Practice lecture series at Aalto University, in partnership with CodeRefinery.

Partners

This course is hosted by Aalto Scientific Computing (Aalto University, Finland) and CodeRefinery. Our livestream, registration, materials, and published videos are free for all in the spirit of open science and education, but certain partners provide extra benefits for their own audience.

Staff and partner organizations:

  • Radovan Bast (CodeRefinery, The Artic University of Norway) (instructor, helper)

  • Richard Darst (ASC, Aalto University) (instructor, instructor coordinator, director)

  • Enrico Glerean (ASC, Aalto University) (instructor, registration coordinator, communication, helper)

  • Johan Hellsvik (PDC, NAISS, KTH) (instructor, helper)

  • Diana Iusan (UPPMAX, NAISS, Uppsala University) (instructor, helper)

  • Thomas Pfau (ASC, Aalto University) (instructor, helper)

  • Jarno Rantaharju (ASC, Aalto University) (instructor, helper)

  • Teemu Ruokolainen (ASC, Aalto University) (instructor, helper)

  • Sabry Razick (University of Oslo) (instructor, helper)

  • Simo Tuomisto (ASC, Aalto University) (instructor, helper)

…and many contributors to the learning materials on Github.

Practical information

This is an online course streamed via Twitch (the CodeRefinery channel) so that anyone may follow along without registration. You do not need a Twitch account. There is a collaborative notes link which is used for asking questions during the course. The actual material is here.

While the stream is available even without providing personal data, if you register you may get collaborative notes access for asking questions and will support our funding by contributing to our attendance statistics.

Credits

It is possible to obtain a certificate from the course with a little extra work. The certificate is equivalent to 1 ECTS and your study supervisor will be able to register it as a credit in your university study credit system. Please make sure that your supervisor/study program accepts it.

Learners with a valid Aalto student number will automatically get the credit registered in Aalto systems.

To obtain a certificate/credit, we expect you to have registered to the course by 10/11/2023, follow the 4 sessions and provide us with at least the following 5 documents via email (1 text document, 4 or more python scripts/notebooks). Please remember to add your name and surname to all submitted files. If you are a student at Aalto University, please also add your student number.

  • 1 text document (PDF or txt or anything for text): For each of the 4 days, write a short paragraph (learning diary) to highlight your personal reflections about what you have found useful, which topic inspired you to go deeper, and more in general what you liked and what could be improved.

  • 4 (or more) .py scripts/notebooks: For each of the 4 days take one code example from the course materials and make sure you can run it locally as a “.py” script or as a jupyter notebook. Modify it a bit according to what inspires you: adding more comments, testing the code with different inputs, expanding it with something related to your field of research. There is no right or wrong way of doing this, but please submit a python script/notebook that we are eventually able to run and test on our local computers.

These 5 (or more) documents should be sent before 31/December/2023 23:59CET to scip@aalto.fi. If the evaluation criteria are met for each of the 5 (or more) documents, you will receive a certificate by mid January 2023. Please note that we do not track course attendance and if you missed one session, recordings will be available on Twitch immediately after the streaming ends.

NEW! Credit fast track: if you submit your homework by 17/November/2023 23:59CET, you get the credit/certificate before 30/Nov. If you submit after the 17/Nov deadline, your credit/certificate will be processed in January (see previous paragraph).

Additional course info at: scip -at- aalto.fi

Schedule

The course consists of four online hands-on sessions 3h each. All times EET (convert 9:50 to your timezone). The schedule is tentative, we may run earlier or later, so join early if attending a single lesson.

Warning

Timezones! Times in this page in the Europe/Helsinki timezone. In Central Europe, the course starts at 8:50! (convert 9:50 Helsinki to your timezone)

Preparation

Prerequisites include basic programming in Python.

Software installation:

Mental preparation: Online workshops can be a productive format, but it takes some effort to get ready. Browse these resources:

Community standards

This is a large course, and we will have many diverse groups attending it. There will be people attending at all different levels, from “just learned Python” to “been using Python for a while and want to see some tips and tricks”. Everyone will choose their own path, some people will be more hands-on or more “watching”. Everyone is be both a teacher and a learner. Even our instructors are always learning things and make mistakes (and this is part of the point!). Please learn from our mistakes, too!

This course consists of both lectures, hands-on exercises, and demos. It is designed to have a range of basic to advanced topics: there should be something for everyone.

The main point this course is the exercises. If you are with a group, we hope people to work together and help each other. We expect everyone to help each other as best as they can with respect for different levels of knowledge - at the same time be aware of your own limitations. No one is better than anyone else, we just have different existing skills and backgrounds.

If there is anything wrong, tell us - HackMD is best. If you need to contact us privately, you can message the host on Zoom, instructor chat is via CodeRefinery chat, and by email contact CodeRefinery support. This could be as simple as “speak louder / text on screen is unreadable” or someone is creating a harmful learning environment.

Code of Conduct

We are committed to creating a friendly and respectful place for learning, teaching, and contributing. You can read our Code of Conduct here. If you need to report any violation of the code of conduct, you can email the organisers at scip _at_ aalto.fi, alternatively you can also use this web form.

Material
Contact
See also
January 2024 / Linux Shell Scripting

Part of Scientific Computing in Practice lecture series at Aalto University.

Audience: Anyone with intermediate or advanced level in Linux shell.

About the course: You might have already used Linux shell commands interactively, but how do you go from interactive terminal use to non-interactive workflow with scripts? This course is oriented on those who want to start using BASH programming fully and use terminal efficiently.

We expect that course participants are familiar with the shell basics (experience with BASH, ZSH, etc). We somewhat touch the Part 1 of the Linux Shell tutorial, and continue to Part 2. Though we expect that participant knows how to create a directory and can edit file from the linux shell command line. We will be scripting a lot, there will be lots of demos and real practicing.

Lecturer: Ivan Degtyarenko, D. Sc., Science IT / Department of Applied Physics, Aalto University

Place: Online and in-person at Room U135a (U7) Otaniemi (in-person only if there are enough participants). Please register for receiving the link to streaming and other infos for in-person sessions.

Time, date (all times EET):

Date

Time

Tue 16.01

12:00-15:00

Wed 17.01

12:00-15:00

Thu 18.01

12:00-15:00

Course material: will be mostly based on the second part of the Linux shell tutorial. Videos are archived at this playlist

Registration: You can register at this link

Credits and certificates: We do not provide credits or certificates for this course.

Setup instructions: For the online course we expect you to have Zoom client installed on your local workstation/laptop. Then we expect you to have access to Linux-like shell terminal. You can check BASH installation instructions for various operating systems at this link. If needed participants can be provided with access to the Triton HPC cluster for running examples.

Additional course info at: scip -at- aalto.fi

Tuesday Tools & Techniques for High Performance Computing
_images/ttt4hpc.png

Do you use supercomputers in your research work? Are you curious about making your computing faster and more efficient? Join us for TTT4HPC: four self-contained episodes on best practices in High Performance Computing. This is a great chance to enhance your computational skills. What you will learn is also used a lot outside academia whenever large scale computations are needed.

The course happens online. Mornings (2h) lectures via TwitchTV. Afternoons (1.5h) hands-on exercises on zoom with our HPC experts.

Here below you find the list of episodes and how to register. Episodes are self-contained, you can join only for the episodes that are useful for your research.

Episode 1 - 16/04/2024 - HPC Resources: RAM, CPUs/GPUs, I/O

Content: focus on HPC computational resources, starting with understanding and managing memory, CPUs, and GPUs, monitoring computational processes and I/O, utilizing local disks and ramdisks, and extending into benchmarking and selecting job parameters.

Instructors: Jarno Rantaharju, Radovan Bast, Diana Iusan

Learning materials: Managing resources on HPC

Registration: Please register at this link

Schedule for the day in EEST (Helsinki, Oslo+1) timezone
  • 09:50-10:00 Streaming starts with icebreakers https://www.twitch.tv/coderefinery

  • 10:00-12:00 Episode 1 - HPC Resources
    • Job scheduling and Slurm basics

    • How to choose the number of cores by timing a series of runs

    • Measuring and choosing the right amount of memory

    • I/O Best Practices

  • 12:00-13:00 Lunch (on your own)

  • 13:00-14:30 Hands-on exercises on zoom (register to receive link)

How to attend: You can watch the streaming at https://www.twitch.tv/coderefinery, but you need to register to get access to the shared document for questions and answers, and the zoom room for the afternoon session.

Episode 2 - 23/04/2024 - Day-to-day working on clusters

Content: focus on software development on HPC, syncing data, interactive work with HPC, vscode

Learning materials: coming soon

Registration: Please register at this link

Schedule for the day in EEST (Helsinki, Oslo+1) timezone
  • 09:50-10:00 Streaming starts with icebreakers https://www.twitch.tv/coderefinery

  • 10:00-12:00 Episode 2 - Day-to-day working on clusters
    • Syncing data and code

    • Developing and interacting with HPC

    • Using VScode with HPC clusters

  • 12:00-13:00 Lunch (on your own)

  • 13:00-14:30 Hands-on exercises on zoom (register to receive link)

Episode 3 - 07/05/2024 - Containers on clusters

Content: focus on containers with Apptainer/Singularity, how to build containers for HPC, how to work with the filesystem, other practical examples with containers

Learning materials: coming soon

Registration: Please register at this link

Schedule for the day in EEST (Helsinki, Oslo+1) timezone
  • 09:50-10:00 Streaming starts with icebreakers https://www.twitch.tv/coderefinery

  • 10:00-12:00 Episode 3 - Containers on clusters
    • Intro to containers on HPC

    • Using Apptainer/Singularity in practice

    • Advanced cases for containers in HPC

  • 12:00-13:00 Lunch (on your own)

  • 13:00-14:30 Hands-on exercises on zoom (register to receive link)

Episode 4 - 14/05/2024 - Parallelization and workflows

Content: focus on parallelization with HPC, efficient parameter sweeps, workflow automation, hyperscaling pitfalls

Learning materials: coming soon

Registration: Please register at this link

Schedule for the day in EEST (Helsinki, Oslo+1) timezone
  • 09:50-10:00 Streaming starts with icebreakers https://www.twitch.tv/coderefinery

  • 10:00-12:00 Episode 4 - Parallelization and workflows
    • Parallelization with HPC

    • Workflow automation

    • Hyperscaling pitfalls

  • 12:00-13:00 Lunch (on your own)

  • 13:00-14:30 Hands-on exercises on zoom (register to receive link)

Prerequisites

You won’t be able to engage with the exercises and examples of the course if you don’t have access to an HPC cluster. Usually employers from higher education institutions can always request access to HPC resources. If you are unsure, please get in touch with your local support. Being familiar with basics tools used with HPC and remote computing is fundamental for this course. Familiarize yourself with the Linux command line. You should be familiar with basics concepts and rules of HPC systems. You can watch our past training on “Introduction to HPC (aka kickstart)

Credits

It is possible to receive 1 ECTS. Here what is required: - be affiliated with a research organisation. Your submission must come from an email address of a research organisation. - attend all four zoom exercise sessions. During the zoom session send a zoom chat message to Enrico Glerean to mark your presence. You can miss at maximum one session. Please arrange an extra task with Enrico Glerean to compensate for the absence. - Submit a tar or zip file with four folders, one folder for each of the four episodes. Inside each folder include the scripts, code, commands that you wrote and run during the exercise sessions. Please make sure that all the files submitted have clear comments that explain each of the steps in relation to the exercises and what was done in the zoom session. Provide the output of each of the scripts or commands that you have run (for example as a copy paste from the terminal into a txt file is enough). If the output is very long, it is ok to just copy what is left visible in the terminal. - Submit a learning diary for each episode: a short text that highlights i) what went well with the episode, ii) what could be improved, iii) how you will use what you have learned.

From your organisation’s email address, email all these files to scip _at_ aalto.fi by the last day of May 2024. Learners at Aalto University: please include your student number to get the credit registered automatically. Learners from other universities: you might want to check with your study coordinator if you can convert the certificate from this course into 1 ECTS. If they have questions, you can tell them to get in touch with Enrico Glerean

Questions
  • Q: Can I get a certificate even though I am not affiliated with a University or other research organisation?

  • A: Unfortunately we provide credits only for students or researchers affiliated with research organisations.

  • Q: I received a calendar invitation only for one of the episodes, but I marked that I want to register for all episodes, how can I get a calendar invitation?

  • A: We do not have a clever system for sending multiple calendar invitations at once. If you find calendar invitations useful, you need to register manually to each of the four episodes.

  • Q: The materials are not yet ready, when will they be ready?

  • A: This is the first run ever for this course, so we are still tweaking learning materials until the last minutes before the course. Your feedback is highly appreciated to turn this pilot into a course that we can run again in the future. Consider contributing to the learning materials by joining the CodeRefinery Zulip chat.

Contributors and Acknowledgments

Course coordinator: Enrico Glerean.

Episodes coordinators: Richard Darst, Samantha Wittke, Simo Tuomisto, Enrico Glerean, Thomas Pfau

Contributors to learning materials: Richard Darst, Samantha Wittke, Simo Tuomisto, Enrico Glerean, Thomas Pfau, Radovan Bast, Diana Iusan, Dhanya Pushpadas, Hossein Firooz, Jarno Rantaharju, Maiken Pedersen.

Communication partners: CSC, University of Trömsö, University of Bergen, Uppsala University, University of Oslo.

See also / more info

Chat with us in the CodeRefinery chat or Aalto SciComp chat. Or private contact via Enrico Glerean, scip -a-t- aalto.fi.

June 2024 / Intro to Scientific Computing / HPC Summer Kickstart

Quick links

  • This page is generated based on the 2023 version. The information and schedule will still be updated - expect significant schedule changes.

  • Registration is not yet open.

Kickstart is a three × half day course for researchers to get started with high-performance computing (HPC) clusters. The first day serves as a guide to skills you need in your career: a map to the types of resources that are available and skills you may need in your career, so that you can be prepared when you need more in the future. This part is especially suitable to new researchers or students trying to understand computational/data analysis options available to them. It won’t go into anything too deep, but will provide you with a good background for your next steps: you will know what resources are available and know the next steps to use them.

The second and third days take you from being a new user to being competent to run your code at a larger scale than you could before using a computer cluster. This part is good for any researcher who thinks they may need to scale up to larger resources in the next six months, in any field - this is many new researchers in our departments. Even if you don’t use computing clusters, you will be better prepared to understand how computing works on other systems. If you are a student, this is an investment in your skills. By the end of the course you get the hints, ready solutions and copy/paste examples on how to find, run and monitor your applications, and manage your data.

If you are at Aalto University: the course is obligatory for all new Triton users and recommended to all interested in the field.

This course is part of Scientific Computing in Practice lecture series at Aalto University, supported by many others outside Aalto, and offered to others as part of CodeRefinery.

Practical information

This is a livestream course with distributed in-person exercise and support. Everyone may attend the livestream at https://twitch.tv/coderefinery, no registration needed, and this is the primary way to watch all sessions. Aalto has an in-person exercise and support session (location TBA), as do some other partners, and a collaborative document is used for a continuous Q&A session.

Time, date: 4 – 6 June 2024 (Tue–Thu). 11:50-16:00 EEST

Place: Online via public livestream, Zoom exercise sessions for partners, and probably in-person discussion/practice rooms at some campus.

Registration: Please register at this link: TODO . It’s OK to attend only individual sessions is fine.

Cost: Livestream is free to everyone. Aalto in-person is free of charge for FGCI consortium members including Aalto employees and students.

Additional course info at: scip@aalto.fi

Other universities

If you are not at Aalto University, you can follow along with the course and will learn many things anyway. The course is designed to be useful to people outside of Aalto, but some of the examples won’t directly work on your cluster (most will, anyway we will give hints about adapting). How to register if you are not at Aalto:

  • Regardless of where you are from, you may use the primary registration form to get emails about the course. You don’t get anything else.

  • Participants from University of Helsinki can follow how to connect to their Kale/Turso cluster by following their own instructions.

  • Participants from University of Oulu: please follow instructions on how to access the Carpo2 computing cluster.

  • Tampere: this course is recommended for all new Narvi users and also all interested in HPC. Most things should work with simply replacing triton -> narvi. Some differences in configuration are listed in Narvi differences

  • [no active support] CSC (Finland): Participants with CSC user account can try examples also in CSC supercomputers, see the overview of CSC supercomputers for details on connecting, etc.

If you want to get your site listed here and/or help out, contact us via the CodeRefinery chat (#kickstart-aalto stream). We have docs for other sites’ staff to know what might be different between our course and your cluster.

Schedule

All times are EEST (Europe/Helsinki time)!

The daily schedule will be adjusted based on the audience’s questions. There will be frequent breaks and continuous questions time going on, this is the mass equivalent of an informal help session to get you started with the computing resources.

Subject to change

Schedule may still have minor updates, please check back for the latest.

  • Day #1 (Tue 4.jun): Basics and background

    • 11:50–12:00: Joining time/icebreaker

    • 12:00–12:10 Introduction, about the course Richard Darst and other staff Materials: Summer Kickstart intro

    • 12:10–12:25: From data storage to your science Enrico Glerean and Simo Tuomisto

      • Data is how most computational work starts, whether it is externally collected, simulation code, or generated. And these days, you can work on data even remotely, and these workflows aren’t obvious. We discuss how data storage choices lead to computational workflows. Materials: SciComp Intro

    • 12:25–12:50: What is parallel computing? An analogy with cooking Enrico Glerean and Thomas Pfau

      • In workshops such as this, you will hear lots about parallel computing and how you need it, but rarely get a understandable introduction to how they relate and which are right for you. Here, we give a understandable metaphor with preparing large meals. Slides

    • 13:00–13:25: How big is my calculation? Measuring your needs. Simo Tuomisto and Thomas Pfau

      • People often wonder how many resources their job needs, either on their own computer or on the cluster. When should you move to a cluster? How many resources to request? We’ll go over how we think about these problems. Materials: How big is my program?

    • 13:25–13:50: Behind the scenes: the humans of scientific computing Richard Darst and Teemu Ruokolainen

      • Who are we that teach this course and provide SciComp support? What makes it such a fascinating career? Learn about what goes on behind the scenes and how you could join us.

    • 14:00–14:45: Connecting to a HPC cluster Thomas Pfau and Jarno Rantaharju

      • Required if you are attending the Triton/HPC tutorials the following days, otherwise the day is done.

      • 14:00–14:20?: Livestream introduction to connecting

      • 14:??–15:00: Individual help time in Zoom (links sent to registered participants)

      • Break until 15:00 once you get connected.

      • Material: Connecting to Triton

    • 15:00–15:25: Using the cluster from the shell (files and directories) Richard Darst and Teemu Ruokolainen

      • Once we connect, what can we do? We’ll get a tour of the shell, files diretories, and how we copy basic data to the cluster. Material: Using the cluster from a shell.

    • 15:25–15:50: What can you do with a computational cluster? (Jarno Rantaharju and Richard Darst)

      • See several real examples of how people use the cluster (what you can do at the end of the course): 1) Large-scale computing with array jobs, 2) Large-scale parallel computing. Demo.

    • Preparation for day 2:

      • Remember to read/watch the “shell crash course” (see “Preparation” below) if you are not yet confident with the command line. This will be useful for tomorrow.

  • Day #2 (Wed 5.jun): Basic use of a cluster (Richard Darst, Simo Tuomisto)

  • Day #3 (Thu 6.jun): Advanced cluster use (Simo Tuomisto, Richard Darst)

Preparation

We strongly recommend you are familiar with the Linux command line. Browsing the following material is sufficient:

How to attend: Online workshops can be a productive format, but it takes some effort to get ready. Browse these resources:

Technical prerequisites

Software installation

  • SSH client to connect to the cluster (+ be able to connect, see next point)

  • Zoom (if attending breakout rooms)

Cluster account and connection verification:

  • Access to your computer cluster.

  • Then, connect and get it working

    • Aalto (and possibly useful to others): try to connect to Triton to be ready. Come to the Wednesday session for help connecting (required).

Next steps / follow-up courses

Keep the Triton quick reference close (or equivalent for your cluster), or print this cheatsheet if that’s your thing.

Each year the first day has varying topics presented. We don’t repeat these every year, but we strongly recommend that you watch some of these videos yourself as preparation.

Very strongly recommended:

Other useful material in previous versions of this course:

While not an official part of this course, we suggest these videos (co-produced by our staff) as a follow-up perspective:

Community standards

We hope to make a good learning environment for everyone, and expect everyone to do their part for this. If there is anything we can do to support that, let us know.

If there is anything wrong, tell us right away - if you need to contact us privately, you can message the host on Zoom or contact us outside the course. This could be as simple as “speak louder / text on screen is unreadable / go slower” or as complex as “someone is distracting our group by discussing too advanced things”.

Material

See the schedule

Course archive

Currently active (upcoming) courses have been moved to the training index. Below is a list of past courses.

This course list is used to be at science-it.aalto.fi/scip page, but that page is now deleted. This series has existed since 2016.

2020
2021
2022
2023
Announcement maillist

Events and other Aalto Scientific Computing (Science-IT) announcements distributed over several lists such as the Triton-users and department mailing lists. In addition we run the scicomp-announcements@list.aalto.fi maillist that covers everyone else who wants to stay tuned and receive Science IT news.

The moderated list is free to subscribe / unsubscribe at any time, accepts all emails including non-Aalto ones.

Future courses Autumn 2023 courses - Linux Shell, CodeRefinery, Python for Scientific Computing, … and more! We are always adding interesting courses. Please check this page once in a while. If you are interested in a re-run of our past courses or if you want to suggest a new course, please take this survey.

Anyone can sign up for announcements at the SCIP announcement mailinglist.

Our most important courses

These are the most important courses we recommend to new users:

These are other quite important courses we have developed:

Other interesting courses

Data management, Reproducibility, open science

Other relevant courses by Aalto Open Science team will be listed at: https://www.aalto.fi/en/services/training-in-research-data-management-and-open-science

Other courses on scientific computing and data management

Please check https://mycourses.aalto.fi/ for other courses at Aalto and https://www.csc.fi/en/training for training courses and events at CSC.

MOOC on scientific computing:

Skills map

There is a lot to learn, and it all depends on each other. How do you get started?

Our training map Hands-on Scientific Computing sorts the skills you need by level and category, providing you a strategy to get started.

Level dependencies

In order to do basic scientific computing, C (Linux and shell) is needed. To use a computer cluster, D (Clusters and HPC) is useful. E (scientific coding) is useful if you are writing your own software.

Help

Don’t go alone, we are there! There is all kinds of “folk knowledge” to efficiently use the tools of scientific computing, and we would like to learn that. In particular, our community is welcome to come to our SciComp garage even for small random chats about your work, but there are plenty of other ways to ask for help, too.

Help

There are many ways to get help with your scientific computing and data needs - in fact, so many you don’t know what to use. This page lists how to ask for help, for different kinds of needs.

Video

Wonder if you should, or how, to ask for help? video: When and how to ask for help (slides)

I don’t know my exact question, or even if I should have a question

Well-defined task and end goal

Significant or open-ended problem solving

Issues with your own Triton account

General needs at Aalto University, not related to SciComp

SciComp garage to discuss,

or …

Search scicomp.aalto.fi or the Issue tracker for answers,

then …

Open an issue at the issue tracker so we can keep track,

and possibly …

scicomp@aalto.fi email (account issues only, not general questions),

then if urgent …

servicedesk@aalto.fi for IT issues,

or …

SciComp chat brainstorming

SciComp chat question (small questions),

or …

Drop by SciComp garage to discuss details,

or …

SciComp Garage,

then if needed …

researchdata@aalto.fi for research data related topics.

SciComp issue tracker post (big questions),

and/or/then, if needed …

We’ll create a Research Software Engineer project on the topic (you could also start here)

SciComp chat (e.g. “is Triton down for others?”)

SciComp Garage co-working

Don’t forget that you can and should discuss among your research group, too!

Formulate your question

We get many requests for help which are too vague to give a useful response, so we delay while we try to find something better than “please explain more”, which slows everything down. So, when sending us a question, always try to clarify these points to get the fastest solution:

  • Has it ever worked? (If so, what has changed?)

  • What are you trying to accomplish? (Your ultimate goal, not current technical obstacle.)

  • What did you do? (Be specific enough to be reproducible - copy and paste exact commands you run, exact output messages, scripts, inputs, etc.)

  • What do you need? Do you need a complete solution, pointers to get started, or should we say if it will take too long and we recommend you think of other solutions first?

If you don’t know something, it’s OK, just explain the best you can and we’ll go from there! You can also chat with us to brainstorm about issues in general, which helps to figure out these questions. A much more detailed guide is available from Sigma2 documentation.

We don’t need a long story in the first message - we’ll ask for more later. Try to cover these points, and we are happy to get your message.

Aalto Scientific Computing

Aalto Scientific Computing (Science-IT) is focused on all aspects of computing and data, and mostly consist of PhD-level researchers so we can understand what you are doing, too. Our main focus areas are high-performance computing (Triton), research software (RSEs), data, and training training.

  • Problems with Triton, using Triton

  • Help with software on Triton

  • Data advice, FAIR data, confidential data, data organization

  • Suggestions on tools and workflows to use

  • General research software and research tools

  • Advice on other Aalto services

  • Advice on using CSC services

  • Triton Accounts (by email)

  • Increasing quotas, requesting group storage space (by email)

Scicomp garage

Planned disruptions

  • There are no current planned disruptions in the daily garage.

If you need more help than the issue trackers, this is the place to be. It’s not just Triton, but all aspects of scientific computing.

Come if you want to:

  • Solve problems

  • Discuss and figure out what your problem really is

  • Brainstorm the best strategy are for your problems

  • Work with someone on your issues in real time

  • Network with others who are doing similar work and learn something new

What kind of issues can we help with:

  • Code and Software:

    • Issues with your code or software tools you use (e.g. debugging, setting up software, linking libraries)

    • Code parallelization

    • Code versioning, git, testing

  • Data Management:

    • Data management plans, data sharing

    • Handling of sensitive data and general legal and ethical (to some extent) questions about research data

    • Workflows for big datasets

    • Data versioning

  • Triton cluster:

    • Slurm job submissions

    • Cluster usage

    • Script setup

    • Module management / Library loading

  • General:

    • Basic methodological or statistical issues

Notes:

  • All garages are designed for researchers and staff working in Aalto (or those who have a need to contact us).

  • You don’t have to have a specific question, you can come by just to chat, listen, or figure out if you should have a question.

  • You can also chat with us any other time (no promises on reply time, though).

Triton, SciComp, RSE, and CS

You can meet us online, every workday, at 13:00, online via zoom. Imagine this like walking into our office to ask for help. Even if you are not sure whether we can help you, come and chat with us anyway and we can figure it out.

  • This doesn’t replace email or the Triton issue tracker for clearly-defined tasks. Garage is good for discussion, brainstorming, and deciding the best path. If in doubt, come to garage and we will help you decide. Many people make an issue, then come to garage to discuss.

  • Try to arrive between 13:00 - 13:15. We may leave early if there is no one around. Please don’t arrive early since we have other meetings then.

  • We have some special days (see list below) to ask about specific topics, but in reality we can answer any question any day.

  • Join on Zoom via https://aalto.zoom.us/j/61322268370 .

NBE/PHYS

PHYS, NBE, and ITS (Aalto IT Services) staff are part of the Garage sessions every Monday and Wednesday. Regular reminders are sent to the department personnel lists.

Special days

Some days are special, and have extra staff about certain topics. But you can always visit on any day and ask any question, and we can usually give a good answer (especially about Triton, HPC, computing, software, and data).

  • Mondays also have NBE/PHYS IT present.

  • Tuesdays We are continuing the COMSOL Multiphysics focus days in Spring 2024: someone from COMSOL (the company) plans to join our zoom garage at 13:00 on the following Tuesdays: 2024-01-23, 2024-02-27, 2024-03-26, 2024-04-23, 2024-05-28.

  • Wednesdays also have NBE/PHYS IT present. We also have more staff to help jupyter.cs instructors/TAs.

  • Thursdays

  • Fridays also have CS IT present (at the beginning).

Others

Aalto IT services runs something similar for some other schools and departments.

In person

In-person garages haven’t been held since early 2020 for the obvious reason. The online garage above is more frequent and you are more likely to meet the very best person for your topic.

Past events

Scicomp Garage has existed since Spring 2017. It has been online since March 2020, and daily since summer 2020.

SciComp community

Let’s face it: we learn more from each other than from classes. There is a major problem with inequality in computational sciences, a large part is related to how we learn these tools. Join the Aalto Scientific Computing community to help you and others be the best scientist you can be. You can

  • Network with others within a supportive mentoring community.

  • Share knowledge among ourselves, avoid wasting time on things where someone knows the answer.

  • Take part in developing our services - basically, be a voice of the users.

SciComp Garage and issues

Currently, most of our interaction happens in the daily SciComp Garage, which is a daily meeting where we help others (and learn ourselves). If you hang out there, you will learn a lot.

If you subscribe to the Triton issue tracker, you will see a lot of questions and answers, and thus learn a lot.

Aalto community chats

We have weekly chats for the Aalto scientific computing poweruser/RSEs as a way to network with the community and Aalto staff. Currently, these are done at 10:00 on Thursdays as part of the Nordic-RSE Finland chats. Anyone is welcome to join and discuss Aalto-related topics.

Mailing lists
Chat
User groups

Often, there is specialized software or problem domains which need more advanced documentation than the generic HPC talks. Often, the SciComp staff aren’t experts in this particular domain, so we can’t provide immediate help without knowing more. For this, we have user groups: we meet with groups of users to discuss problems and create solutions/documentation about them..

Existing user groups

To be formed.

If you would like to create a user group, let us know. The hardest part is finding the users, so if you form the group of people and schedule a time, it is very easy for us to come. To be clear, if you bring people together and want to organize the group, we are very happy and will take part and make it “official”.

User group meetings

A user group meets periodically, and does various things. At the meeting is some SciComp staff as well as interested users, who want to make a larger change than just solving their own problems.

  • See examples of the software or problem in practice.

  • Discuss the best solution of problems

  • Collaboratively create documentation on the problem (which can be put straight at scicomp.aalto.fi, for example in Applications: General info). We can create video demos, examples, and more.

  • Discuss how the infrastructure needs to be adapted to the actual use cases.

  • Provide a network for informal support within research groups.

Preparing for a user group
  • We will create a Triton issue about it and use that for communication. Subscribe (= turn on notifications or comment) to the issue to get emails about it.

  • Please submit some examples to the issue tracker, for example either things which already work (discuss + document) or things that don’t yet (we will work together to improve + document). This will form the main part of the meeting. We need examples!

Group meetings

This page applies to these departments so far: CS, NBE, PHYS (if others want to join, let us know).

We would like to meet with each research group once a year. This isn’t to advertise stuff to you, but to hear what you all need but can’t get, so that we can help you with that. A group meeting consists of your group plus other technical services staff (Science-IT, CS-IT, etc.) which are relevant for your group’s work. Hopefully, we can immediately solve some of your major problems. Your group will come away better able to use the best possible services, and we will come away knowing what to focus on in the next year.

Practical matters

Ideally, someone (Science-IT, CS-IT, etc.) contacts your group leader to arrange a time. On the other hand, contact your most local (department) anytime to arrange a group meeting - we are always happy for an eager audience. Your local support will request all the other relevant parties to be there.

The group meeting would happen whenever is most convenient for you - for example, during your regular group meetings. Please propose the best times for you. One hour is sufficient.

You don’t need any particular preparation. If you do anything, think about what computational/data/software tools you use and what problems you have - you could have one or a few people tell about the typical workflows of the group.

Who we are, what we do

We are Technical services (in particular ones focused on computing). See the rest of scicomp.aalto.fi for the types of things we support. Welcome, researchers! tells our most important services for you (+ the most important ones by others at Aalto).

Also at Aalto, you also have these other major service units which are relevant to you (this meeting isn’t mainly about them, but we have inside knowledge of IT Services so can help there):

  • IT Services (ITS): General mass-consumption IT services for all Aalto.

  • Research services: applying for grants, administrating, legal, etc.

  • Learning services: teaching

  • Communication services

  • Finance

  • HR

Topics
  • Reminder of services available at Aalto and your department (short)

  • News: Latest changes or improvements (short)

  • Stories from the field: how do you do your work?

  • Feedback: How do you do your work now? What works well? What doesn’t work well? What do you need in the future? Tell us all your complaints, because we can’t work on the right things without them. (long)

News / topical items, 2022
Discussion starters
_images/hierarchy-of-researchers-needs.png

The types of research service needs you may have, sorted into different levels of concern. Source

  • Data

    • Where you store and share data

    • Data-driven research: need more support?

    • Department (project, archive), Triton (scratch), cloud, any other needs?

    • Management: collection, storage, transfer, archive, sharing.

    • What do you usually use?

    • Sensitive data: support and storage locations

  • Computing

    • Cloud vs shared workstations vs personal workstations vs laptops

    • Desktops, laptops

    • Scientific computing

    • GPUs

    • Containers for difficult to run software (docker, singularity, etc)

    • Virtual machines

    • CSC (supercomputers, cloud, data, collaboration between universities in Finland)

  • Usability and accessibility (user interfaces)

  • Teaching

    • Learning Services

    • Online solutions on cloud platforms (local solutions, VMs, Azure)

    • jupyter.cs

    • A+

    • Chat: Zulip, Teams, Slack, …

  • Software

    • Installation problems

    • Reusing old software

  • Support

    • Support channels

    • Daily SciComp garage - every workday, 13:00, online.

    • Chat

    • Software development: (tools, best practices, collaboration)

    • RSE service

    • How to more closely support teaching/research

  • General services

    • WWW servers

    • CSC services

    • Email

    • Printing

    • Technical procurement

  • Open Science / Open Data / Open Access

See also
Website

Search this website for help. For that matter, also search the internet in usual. This is usually a good place to start, but often you need to move on to the next steps.

Triton Issue tracker

The Triton issue tracker, which is where all Triton issues should go. Log in and search the issue tracker for related issues, you may find the solution already

If you issue is about or related to Triton this is where it should go.

Garage

Daily SciComp Garage sessions, where you can informally chat. This is especially useful when your question is not yet fully defined, or you think that demonstrating the problem for immediate feedback is useful.

Chat

Chat can be a great way to quickly talk with others, share tips, and quickly get feedback on if a problem is large or small, something to get help with or figure out yourself, etc. For longer solutions, we will direct you to the issue trackers but it rarely hurts to do a real-time discussion. (For real-time video chat with screen sharing, come to the garage above).

The SciComp Zulipchat, scicomp.zulip.cs.aalto.fi is where we most often hang out. You can ask triton questions in #triton, general questions in #general, research software engineering questions in #rse, etc. The main point of Zulip is topics, which allow you to name threads and easily follow old information. (use zulip in your courses)

You can also chat with us on Aalto Microsoft Teams. The invite code is e50tyij. Our staff also hang out on other department chats.

Research Software Engineer service

Sometimes, a problem goes beyond “Triton support” and becomes “scientific computing support”. Our Research Software Engineers are perfect for these kinds of problems: they can program with you, set up your workflow, or even handle all the technical problems for you. Contact via the other contact methods on this page, especially via the garage.

Email
  • scicomp at aalto.fi. Use this only for things related to your account (requesting a Triton account), quota, etc. - most other things go to the tracker above.

  • rse-group at aalto.fi: Research software engineering service requests. (it’s usually better to drop by SciComp garage since we usually need to discuss more.)

Department IT

CS, NBE, and PHYS have their own IT groups (among others, but those are the Science-IT departments with the most support). They handle local matters and can reliably direct you to the right resources. Department IT handles:

  • Computers, laptops, personal devices

  • Department data storage spaces

  • Other department-managed tools and services

Reach them by department-specific email addresses

NBE and PHYS IT use the same email issue tracker (esupport) as Aalto IT, so issues can be exchanged no matter which address you send an issue to. CS uses a different one, so you have to think a bit more before sending something.

Community

In addition to formal support, there is are informal activities, too:

  • The daily SciComp Garage, designed to provide one-on-one help, but we invite anyone to come, hang out in the main room, and network with us. This is for basic help and brainstorming.

  • Subscribe to notifications from the Triton issue tracker even if you don’t post there. You will learn a lot.

  • Sign up for the Research software engineers and powerusers mailing list and learn about more events that interest you. This isn’t the place to ask for basic help, but if you hang out here you will learn a lot.

Other groups at Aalto

servicedesk, Aalto IT

servicedesk()aalto.fi is the general IT Services service desk. They can handle things with account, devices, and so on. They have a wide range of responsibilities, but don’t always know about local resources that may be more appropriate for your needs. There is an “IT Services for Research” group which focuses on research needs

For students (who aren’t also researchers), this is always your first point of contact - in addition to your teacher.

servicedesk handles:

  • Aalto accounts, passwords (including Triton passwords)

  • University-wide data storage (work, teamwork, home directories)

  • All university-wide common IT infrastructure: wifi, network, devices, websites, learning platforms, etc.

  • Anything department stuff, when you are not in a department with local IT staff.

Reach them by:

Research services

Aalto Research Services function more as project administrative services rather than close research support. They provide important help for:

  • Data management plans for funding applications, other openscience-level data-related questions, and Open Science (contact researchdata@aalto.fi)

  • Legal or ethical advice, making contracts and NDAs.

  • Library services

  • Applying for funding and administering it.

In many cases, you can chat with Aalto Scientific Computing and we can give some initial practical advice and direct you to the right Research Services resources.

Reach research services by:

  • Contacting service email addresses at the link above

  • Contacting school representatives findable at the link above

  • researchdata@aalto.fi for data-related things

About us

Aalto Scientific Computing isn’t a HPC center - we provide HPC services, but our goal is to support scientific computing no matter what resources you need. Computing is hard, and we know that support is even more important than the infrastructure. If you are a unit at Aalto University, you can join us. [Mastodon, Twitter]

About

Computational research is one of the focus areas in Aalto University, and Aalto Scientific Computing makes that possible.

The Science-IT project was founded in 2009 (with roots going back much further) and has since expanded from high-performance computing services to a complete package: we provide computation, data management, software, and training. Our partnerships with departments and central IT services allow a streamlined experience from personal devices to the largest clusters.

To reflect our expanded services, we have rebranded to Aalto Scientific Computing to reflect our greater mission and partners.

Many Centres of Excellence and departments at Aalto University are using our resources with great success. There are currently over 1000 user accounts from all six different schools and at least 14 different departments using our resources. Science-IT is administered from the School of Science with additional university-level funding - our HPC services are available to all Aalto University, free of charge.

Boilerplate text for grant proposals

Below are various texts which describe Aalto Science IT, Aalto ITS, and CSC resources, suitable for inclusion in grant applications an the like. There are various types suitable for different purposes.

If you create your own texts and would like to share them, send them to us.

Warning

These texts are starting points, not something that should be included as-is. The texts need to be adapted and tailored to fit your particular proposal - if you need help with proposal writing you can contact the Grant Writer or Research Liaison Officer of your School for advice (contact information is available here).

Focus on Triton

Computing and modelling are strategic areas of Aalto University. To support research in these scopes the university is committed to provide proper hardware resources and supporting personnel on long term basis. Currently Aalto Science-IT provides a system with about 10000 computing cores. The System also contains 150 NVIDIA cards for GPU computing and over 5 PB of fast storage capacity suitable for Big Data needs. All parts are connected with a fast Infiniband network to support parallel computing and fast data access. To keep the resources competitive Aalto Science-IT annually upgrades the system based on the needs of researchers.

All resources are integrated with the national resources allowing easy migration to even larger resources when necessary. These include e.g. University dedicated OpenStack based cloud resources and access to thousands of servers via the national computing grid. Furthermore Aalto Science-IT provides much preconfigured software and hands on support to make the usage for researchers as effective as possible. On the personnel side Science-IT has six permanent Ph.D. level staff to keep the system running and providing teaching and consultation for researchers.

Acknowledging Triton in publications

Remember you need to acknowledge Aalto Science-IT in your papers if you use Triton and its scratch filesystem. See the acknowledging Triton page for instructions on how to do that and some boilerplate text.

Focus on data

Computing and data are strategic areas in Aalto University.

The university provides data management and computing solutions throughout the data lifecycle. The university provides free storage to researchers of essentially unlimited size, provided that the data is managed well. Data storage includes 5PB of high-performance non-backed-up Lustre filesystem space connected directly to the Triton computing cluster for efficient and secure analysis, and 1PB of reliable backed-up storage space for longer-term storage. Expert staff, both technical and administrative, provide advice and hands-on support in data storage, computation, FAIR principles, data management planning, as well as computation.

Data management is designed with a focus on security. Recommended storage locations are centrally located for security. Computing nodes and Lustre data storage servers are physically located at CSC, Keilaranta 14, Espoo. The server room is certified security level 3 (VAHTI-3) i.e. only authorized personnel with clearance are given access to it and there is continuous camera surveillance. All data is access controlled by passwords and individual-level authorization, and firewalled to university networks.

Aalto ITS data storage is directly integrated into Aalto’s sustainable computing environment. Storage is double-redundant and includes the possibility to roll back to previous points in time, with disaster recovery management. In addition to confidential data processing, there are multiple encrypted and/or audited storage environments for sensitive data processing. For IoT, Aalto ITS utilizes public cloud computing providers for case-specific construction of services. Aalto has IT infrastructure personnel, who can help researchers with building the relevant solution for the use case.

Focus on sensitive data

Aalto university provides secure solutions for data management and computing throughout the data lifecycle. The university has an Information Security Management System (ISMS) in place, adapted from the ISO 27001 standard. These processes govern how all our IT systems are being acquired, developed, implemented, operated and maintained. Based on the information classification, we use only selected systems that comply with high security requirements and have been approved for use with sensitive data.

We use encryption technologies to safeguard sensitive data in transit and ensure secure collaboration. Our secure network storage is encrypted at rest, includes the possibility to roll back to previous points in time, and supports encrypted backups for disaster recovery.

We operate a dedicated secure computing environment SECDATA to enable research with most sensitive data. The environment has been audited to comply with the Act on the Secondary Use of Health and Social Data and Findata requirements. Each research project will get a separate virtual desktop environment with customized amounts of memory, disk space, and computing power with a possibility to use GPUs for computational tasks. To safeguard data, transfers are limited and done only through specific audited process and the environment is disconnected from the public internet.

Our technical, administrative, and legal experts provide advice and hands-on support for handling sensitive data. The Aalto Research Software Engineer (RSE) team and Data Agents help with essential privacy techniques such as minimization, pseudonymization, and anonymization. Aalto’s Data Protection Officer provides guidance and oversight on the processing of data and ensuring privacy.

Confidential data (shorter, for CS)

Aalto CS provides secure data storage for confidential data. This data is stored centrally in protected datacenters and is managed by dedicated staff. All access is through individual Aalto accounts, and all data is stored in group-specific directories with per-person access control. Access rights via groups is managed by IT, but data access is only provided upon request of the data owner. All data is made available only through secure, encrypted, and password-protected systems: it is impossible for any person to get data access without a currently active user account, password, and group access rights. Backups are made and also kept confidential. All data is securely deleted at the end of life. CS-IT provides training and consulting for confidential data management.

Focus on connectivity

Aalto researchers can use the Low Power Wide Area Network (LoRaWAN), a data network for Internet of things (IoT) devices with nationwide coverage, free of charge. Using this network, a device can send a small amount of data with minimal power which makes batteries last long. LoRaWAN is suitable for static and mobile sensors that are operated by batteries. Aalto IT services provide support and configure the network together with the user. In Finland, public mobile networks support also NB-IoT (Narrowband IoT) technology.

Aalto campus area has a specific research environment for 5G connectivity, that can be used for developing and testing 5G technology and applications. On the campus area connectivity is ensured via a 100 Gbit/s fault-tolerant internet connection, 1 – 10 Gbit/s connections to workstations and servers, and extensive wireless coverage. Secure connectivity outside Aalto-campus is also possible by various technologies, e.g. VPN.

Research environment: research software engineers

The Aalto Research Software Engineer (RSE) team provides a specialized advice and service in research software, data, and computing so that any researcher can accomplish the best science without being held back by technological problems. Typical tasks including implementing a method bettor or faster than could otherwise be done, or ensuring that results are as open and reusable as possible so that the full impact of the work can be realized. RSE staff are professional researchers with years of experience in computational sciences, and work seamlessly with the rest of the Science-IT team. For the School of Science, basic services are included as part of overheads, or longer-term services can be funded from specific research projects.

Research software engineering services

(this text must be tuned to your grant, replace the parts in CAPITAL LETTERS)

This grant will make use of the Aalto Research Software Engineer program to hire high-quality TOPIC specialists. This program provides PhD-level personnel to work on THINGS, which allows the other staff on this project to focus on YYY. Research software engineers do not need to be independently recruited, and are available for consultation also before and after the project. This service is provided by Aalto Scientific Computing, which also provides high-performance computing resources for your project. The Research Software Engineering service is integrated into computing services as a consistent package.

(for basic service, for now only SCI) The service is available as a basic consulting service for free.

(for paid services) This project receives dedicated service from the Research Software Engineering group, funded as researcher salary from this grant. During this period, one of the Aalto research software engineers joins this project as a researcher, equal to all other project employees.

Other computing and IT solutions

Please note that the boilerplate texts for the computing solutions listed below are not about the Aalto Triton HPC cluster. Please familiarize with the Aalto cloud computing services and CSC services before you include them in your grant application. Please also refer to their terms of service and pricing if you need to mention these in your application.

Focus on cloud computing

Aalto University has agreements with major public cloud service providers (e.g. Microsoft Azure, Google Cloud Platform and Amazon Web Services), and the platforms have been integrated into the Aalto digital environment in a secure and well-governed manner. The platforms provide scalable, collaborative, and integrated computing tooling with software for rapid iteration on data using for example machine learning or access to ready-made AI API’s for [YOUR TOPIC / IMAGE DETECTION / TEXT ANALYSES].

Aalto has private and secure network connectivity between on-premises environment and the cloud platforms, and access is managed through a central identity management system. Expert staff provide solution consultation and hands-on support for end-user needs.

Focus on CSC

Aalto researchers have access to services from the Finnish IT Center for Science (CSC), a government owned center which provides internationally high-quality ICT expert services. These services include multiple use-case specific components – such as containers, databases, HPC and machine-learning utilities - for storing and processing data. The CSC and Aalto services are connected through a high-speed Funet network (Finnish University and Research Network). The CSC coordinates the Finnish Grid and Cloud Infrastructure and has the largest known clusters in Finland.

CSC’s data center in Kajaani, Finland houses the pan-European pre-exascale supercomputer LUMI. This is one of the most eco-efficient data centers in the world. LUMI is using 100% hydro powered energy. The waste heat of LUMI will produce 20 percent of the district heat of the area and reduce the city’s annual carbon footprint by 12,400 tons. Further info at https://www.lumi-supercomputer.eu/sustainable-future/.

Focus on IT solution for remote and hybrid work

Aalto University provides IT solutions for remote and hybrid working. Secure digital workspaces for remote working are created through virtual and remote desktop infra and cloud tools, as well as online support and secure use of one’s own devices and applications. Aalto campus has specially designed (class)rooms with integrated and automated audiovisual technologies in support of hybrid meetings and teaching.

See also

Usage model and joining

Aalto Scientific Computing operates with a community stakeholder model and is administered by the School of Science. Schools, departments, and other units join and contribute resources to get a fair-share of the output. There are two different components to join:

For everyone

Aalto Scientific Computing gets university-level support already, so our computing resources are usable by anyone doing research at Aalto (with a limited share). By joining further, a unit gets something even more valuable: time. Our support for using our infrastructure is concentrated for member departments which provide joint staff with us or support the RSE program, in addition to a greater share of resources.

Staff network

There is no Aalto Scientific Computing, just people who want to make computing better.

You might be a department IT staff member, a lab engineer, a skilled postdoc or a doctoral candidates who helps other researchers with their technical/computational challenges. Why not joining forces and join our network of specialists? There is no “Aalto Scientific Computing” on paper, only different teams that work together to help researchers better than they could alone. We invite interested staff to join our community, help sessions, infrastructure development, etc. This program is just being developed (as of 2020), but it roughly includes:

  • Participation in admin meetings to help us develop infrastructure (e.g. Triton) in the best way for your users

  • Teaching, for example ensure our classes are suitable to your audience, teach your own classes with our help via CodeRefinery, or directly help us teach.

  • Co-maintenance of infrastructure (for example, your unit’s special software) on Triton and in out automated software deployment systems.

  • Learn how to solve your users’ problems more efficiently.

  • Networking and continual professional development

  • This is not just for IT support or administrative support, but high-quality research support that connects all aspects of modern work.

This does not replace local support, it just makes it more powerful.

Todo

How to take part.

Triton: computing and data storage resources

Triton is the Aalto computing cluster, for computationally and data-intensive research. Users from members of the community are allocated resources using a fair-share algorithm that guarantees a level of resources at least proportional to the stake, without the need for individual users to engage in separate application processes and billing.

Each participating department/unit funds a fraction of costs and is given an agreed share of resources. These discussions are carried out with the board of the Science-IT project. Based on this agreed share, units cover the running expenses of the project. There is also direct Aalto funding, which allows the entire Aalto community to access a share of Triton for free.

However, computing is not just hardware: support and training is just as critical. To provide support, each unit that is a full member of Science-IT is required to nominate a local support contact as their first contact point. Our staff tries to provide scientific computing support to units without a support contact on a best-effort basis (currently, that effort is good), but we must assume a basic level of knowledge and attendance at our training courses.

Interested parties may open discussion with Science-IT at any time. Using our standing procurement contracts, parties may order hardware to be integrated into our cluster with dedicated or priority access (or standalone usage), allowing you to take advantage of our extensive software stack and management expertise, with varying levels of dedicated access: a share of total compute time, partitions with priority access, private interactive nodes, and so on. Please contact us for details.

Scientific software: research software engineers

The Research Software Engineer program provides specialists in software and data, who can be contracted out to projects to provide close support. The goal is not just to perform a service, but to teach by hands-on mentoring.

For projects, the principle is that the project pays for help lasting more than a few hours or days. This can seamlessly come from project money as a researcher salary.

Units (departments, schools) can also join to get a basic service - their members can receive short-term support without any billing needed. Their members will also receive priority for the project services.

For more information, see the RSE for units page.

Contact

Let Mikko Hakala know about Science-IT related joining, Richard Darst know about the RSE program or SciComp community, or contact us at our scicomp ↔ aalto.fi email address.

What we do

We don’t just provide computing hardware, but a complete package of infrastructure, training, and hands-on support. All of these three activities feed back into each other to improve the whole ecosystem.

_images/scicomp-3-components.png

We provide many types of services:

_images/what-we-do.png

Our components, partners, and collaborators

Aalto Scientific Computing serves as a hub of computational science at Aalto. We guide researchers to the right service, regardless of who is providing it.

Science-IT serves as the coordinator, and runs the Triton cluster, the physical hub of large scale computational and data-intensive research at Aalto. As such, we maintain many active collaborations which allow us to guide researchers to the right resource, regardless of who provides it.

Science-IT
Science-IT (Aalto HPC)

Science-IT is the formal name of the project which provides the Triton computational cluster. It is funded by Aalto University, departments and schools, and the Academy of Finland. Perhaps a better description would be Aalto HPC (high-performance computing).

Science-IT is the “legal representation” of Aalto Scientific Computing within Aalto.

Computational research is one of the focus areas in Aalto University. The Science-IT project was founded in 2009 to facilitate the computational infrastructure needed in top-tier scientific research. Many Centres of Excellence and departments at Aalto University are using our resources with great success. There are many. Science-IT is administered from the School of Science, and direct Aalto level funding enables use of our resources from all Aalto University, free of charge.

Our services

In Science-IT, we concentrate on mid-range computing and special resources needed by researchers in the School of Science. With local resources, we can provide high-quality support and even research-project-level customization. Because our resources are integrated into the Aalto IT environment, with regular local training in the scientific computing practice to entry-level users, our resources enjoy an ease of access and lower barrier to entry than, for example, CSC HPC resources. We are also a basic research infrastructure, enabling the integration of separately purchased resources to our cluster and storage environments, with dedicated access for the purchaser.

Membership

Departments and schools can join the Science-IT project and receive a share of our resources and dedicated staff support. Please contact Mikko Hakala for details.

Science-IT Management

Science-IT is managed by the board: prof. Harri Lähdesmäki (head), prof. Adam Foster, prof. Mikko Kurimo, prof. Petteri Kaski.

Operational team: Mikko Hakala, D.Sc. (Tech), Ivan Degtyarenko, D.Sc. (Tech), Richard Darst (Ph.D.), Simo Tuomisto (M.Sc), Enrico Glerean (Ph.D).

To get additional information or how to get involved please contact one of the board member above (firstname.lastname@aalto.fi).

Science-IT is the organizational manifestation of Aalto Scientific Computing.

Science-IT concentrates on mid-range computing and special resources needed by researchers in the School of Science. With local resources, we can provide high-quality support and even research-project-level customization. Because our resources are integrated into the Aalto IT environment, with regular local training in the scientific computing practice to entry-level users, our resources enjoy an ease of access and lower barrier to entry than, for example, CSC HPC resources. We are also a basic research infrastructure, enabling the integration of separately purchased resources to our cluster and storage environments, with dedicated access for the purchaser.

Our team is mainly known for providing the Triton cluster, a mid-range HPC cluster with ~10000 CPUs, 5PB storage capacity, Infiniband network, and ~150 NVIDIA GPUs for deep learning and artificial intelligence research. We provide a Jupyter Notebook based interface to enable light computing with less initial knowledge required to make our services easily accessible to everyone. Our team also works with the CS, NBE, and PHYS departments to provide data storage and a seamless computational research experience. We maintain http://scicomp.aalto.fi, the central hub for scientific computing instructions and have a continuous training program, Scientific Computing in Practice.

Computer Science, Physics, and Neuroscience and Biomedical Engineering

These departments are members of Science-IT, and their local IT staff provide a great deal of scientific computing support, and in fact all the Science-IT team above is contained here. These departments resources are seamlessly integrated with Aalto’s HPC resources.

Computer Science IT

Computer Science IT provides advanced computing, data, and IT services to the Department of Computer Science. Ten years ago, we focused on daily infrastructure and devices. We still do that, but our we now serve a far broader mission including teaching and services, data management, specialised research tools, and cloud services.

Our services

We:

… but most basic IT tools are handled by Aalto IT Services, not us. We build on their work and make sure research and teaching goes as quickly as possible.

(also note, we don’t primarily serve CS undergraduate students)

Work for CS-IT

We are always looking for students interested in IT, programming, and system administration. We also are a good place for civil service. The most important prerequisites are a good understanding of Linux and a never-ending desire to learn more. Buzzwords you are likely to become familiar with/useful skills to have:

  • Kubernetes, docker, and virtual machines

  • Web service development

  • Puppet (and Ansible)

  • Data and storage systems

  • Computer hardware, building high-performance workstations

Contact

You can always drop by room A243 if we are there (not during covid-19, please) or join the daily online garage, or contact us by the email address findable on our internal wiki.

See our members on the About Aalto Scientific Computing page.

Partners

We are a leading member of the Finnish Grid and Cloud Infrastructure (FGCI), a university consortium to support mid-range computing in universities. FGCI, via Academy of Finland research infrastructure grants, funds a large portion of our work. Thus, we maintain ties to most other universities in Finland as well as CSC, the national academic computing center. Through the FGCI, we provide grid computing access across all of Finland and Europe.

Our team overlaps with the Departments of Computer Science, Neuroscience and Biomedical Engineering, and Applied Physics. The IT groups in these departments provide advanced Triton support.

We maintain close collaboration with Aalto University IT Services (ITS). We are not a part of ITS, but work closely with them as the computational arm of IT Services. ITS provides the base which we repackage and build on for many of our services.

Our team maintains ties to Aalto Research and Innovation Services to guide data and research policy. Triton is an Aalto-level research infrastructure. Our staff is involved in research policy making, including ethical, data security, and data management. Our team contains several Aalto Data Agents.

We partner with CodeRefinery, a Nordic consortium to assist in training of scientists, to provide training and support computational competence.

Who we are

This table lists people supporting Scientific Computing at Aalto University who considers themselves a part of ASC. If you want to be added here, let us know. We welcome all contributors. There is no Aalto Scientific Computing, just people who want to make computing better.

Affiliations

Specialties

Richard Darst

Science-IT, CS-IT, Data Agents, Aalto RSE

Data science, Triton, teaching, usability

Ivan Tervanto

Science-IT, PHYS-IT

Triton, HPC hardware, HPC OS, teaching

Enrico Glerean

Science-IT, NBE, Data Agents, Aalto ethics committee

Triton, ethics and personal data, data.

Simppa Äkäslompolo

Science-IT, PHYS-IT

Triton, HPC OS, parallel software

Jarno Rantaharju

Science-IT, Aalto RSE

Software Development, HPC software and optimization, profiling

Thomas Pfau

Science-IT, Aalto RSE

Software development, Matlab, Linear/mixed integer programming, Constraint based metabolic modelling

Simo Tuomisto

Science-IT, CS-IT

Software Development, HPC software design and optimization, GPU computing

Mikko Hakala

Science-IT, CS-IT, NBE-IT

Triton, data storage systems, HPC administration

Scientific outputs

Most of the computationally-intensive research outputs from our member departments use our resources. In addition, at least the CS and NBE departments use our data storage for most big data projects. You may view the our research results using research.aalto.fi (Science-IT infrastructure section).

Current research areas

Our users come from countless research areas:

  • Method development

  • Computational materials research

  • Network research

  • Neuroscience

  • Data mining

  • Deep learning and artificial intelligence

  • Big data analysis

FCCI Tech Seminar series

We have an occasional seminar series, open to all, on how we run our group, FCCI Tech. Our archive may be interesting to other scientific computing teams and research software engineers.

FCCI Tech (fka Behind Triton)

This is a series of talks about scientific computing support and HPC infrastructure administration in practice. It started as our internal kickstart to new members of our staff, but the scope is expanded and now others interested in research infrastructure is invited, though our orientation is still primarily on our own team. Typical attendee are computational research engineers, scientific computing support, or HPC cluster/SciComp admins.

In the future, this may turn into a more general “research engineering” seminar series, once we are done with internal explanations. Guest speakers are welcome. The name stands for “Finnish Computing Competence Infrastructure Tech”.

We share what our practices are, what we have learned, and informally discuss.

Practicalities

Time: The next speaker announce the time/date of the seminar the week before. The speaker sends invitation with the Zoom link. Usually Fridays at 10:00 EET.

Duration: Rough estimate: as desired; ~60 minutes time slots; should be plenty of time for questions and discussion.

Location: Zoom, ask for an invitation but it is usually the garage link.

Recordings: You can view a playlist of some videos on youtube (and a few more are available to our team internally).

It is not a right but a privilege to participate. Free.

Past and currently planned
User support

As infrastructure providers, we are often thrust into a user support role (as well as a teaching role). We should look at this as a good thing: support of top-level science requires an intimate connection to the tools to do that science. I see that as part of our plan.

This talk is about Aalto Scientific Computing’s user support. It is designed as much to explain our philosophy of user support as it is to talk about specific tools. It takes a critical view of some existing common practices, as discussed in CodeRefinery/NordicHPC channels.

Broad contents:

  • What does “user support” even mean?

  • AaltoSciComp’s lines of user support

  • Strategic risks and considerations

_images/scicomp-3-components.png

The three roles of Aalto Scientific Computing are all interdependent on one another.

About us
  • We are Aalto Scientific Computing - Science-IT (HPC) - Department IT (CS, NBE, PHYS) - Close collaborations with Aalto ITS, CSC, FCCI

  • Our collaboration used to be called “Finnish Grid and Cloud Infrastructure”, now will be called “Finnish Computing Competence Infrastructure” so user support is clearly more important than ever.

  • We are proud of our user support, but it is a multi-faceted approach which requires the right mindset.

Role of user support in scientific computing
User support has a bad reputation
  • Customers often think it is really bad (the support staff hate me!)

  • Support staff often hate doing it (the customers don’t know anything!)

  • Our term “issue” or “ticket” implies it’s a discrete task that you want to end as soon as possible.

Why?

  • Technology is hard

  • Users usually don’t give enough information to solve the issue.

  • … Users don’t even know how to give enough information.

  • We often pick up slack when something isn’t otherwise taught

  • We are disconnected from the user community

  • User support may be some forced extra thing on top of our “real” job.

Types of support
  • How do we even answer questions people may have? Some issues are system bugs that are our action items, but when the user themself needs help we can make some hierarchy of support strategies:

    1. “read the manual: <link>”

    2. tell them what to do

    3. give them a live demo

    4. pair program working example, you lead

    5. do the task for them, no need to teach

  • Lower letters are faster to answer and traditional support. Higher letters are much more time-consuming, and approach mentoring or Research Software Engineering services.

Why is support hard?
  • “Crisis of computing”: most users skills are much less than needed.

  • User interfaces are usually bad

  • Lots of hidden internal state

XY problem
  • People ask for what they think they need (X)

  • They are given X

  • X isn’t even a good way of doing what they actually want (Y), but we spend a huge amount of time doing X, when the right way Z→Y is much simpler.

  • XY problem (wikipedia): people don’t ask for the end goal, but some intermediate step.

  • XY solution (my term): Support person wants to answer X because it requires less investigation and you can close the ticket and move on, even though they get the feeling it’s not a good idea.

Be motivating
  • “How to help someone use a computer” by Phil Agre: https://www.librarian.net/stax/4965/how-to-help-someone-use-a-computer-by-phil-agre/

  • Hanlon’s razor: “never attribute to malice that which is adequately explained by stupidity”

  • In our case, this is never attribute to malice or stupidity that which is adequately explained by having never been told something obvious

  • Avoid expressing unhappiness, displeasure, a condescending attitude, expectation that they should have known better, “damage”, etc.

  • Resist the temptation to blame the user. If they actually can do something that harms others, it’s the system’s fault. If they don’t know something, the UI is bad or society’s preparation is not enough. Etc.

SciComp`s user support tools
Our general guidelines
  • “help page”, scicomp.aalto.fi/help

    • Describes what to do in general, key points to mention when making a request.

    • It links to a longer “how to ask for help”

    • Both can be a bit patronizing to link to during an issue, so we have to be careful.

Docs
  • https://scicomp.aalto.fi (this site)

  • Open-source (CC-BY), public

  • Built with Sphinx

  • Findable by general web search. This is a big deal - don’t hide your docs!

  • Managed by git on Github

  • There will be another talk on specific Sphinx information later.

Gitlab issue tracker
  • We use Aalto Gitlab (version.aalto.fi) as issue tracker

    • University single-sign on

    • “Internal” permissions (anyone who can log in)

    • Common interface, reasonably powerful labelling, searching, etc.

  • When is an issue closed? As soon as possible, or when you are sure they are happy?

    • We are too much “when we are sure they are happy”, which often is “never”

    • Closing too soon discourages asking for help.

    • Is issue the right term here, or is conversation the right term?

Email tracker
  • Email is a bad medium, advanced issues should be public so that users can learn from each other and we don’t have to type the same thing over and over.

  • Low threshold to direct to the issue tracker instead of email.

    • Most users know this and we get few emails

  • Aalto IT services uses Efecte, CS uses its own RT (much nicer).

  • Three groups: scicomp, scip (teaching), rse-group (RSE services).

Daily Garage
  • Scicomp garage

  • Online “office hours” via Zoom

  • Every day, 13-14. If no one comes, it’s admin chat time.

  • Amazingly good for keeping a community going.

Chat
  • Chat

  • Is chat a good idea or does it get out of hand? Remains to be seen

  • Current philosophy: we need to build community. Chat is not for issues, but chat and determining if something should be an issue or not.

  • Uses Aalto-hosted Zulipchat. Believe us, just don’t use Slack.

Office drop-in
  • Not done in pandemic time, obviously

  • Mostly replaced by “daily garage” which is better anyway

  • Our offices are spread around the departments we serve, and we accept drop-ins anytime we are there.

  • This keeps us closely connected to the community.

Personal networks
  • Most of us came from the departments we serve now

  • Our existing networks are a good way of contacting us

Teaching
  • Training

  • You can’t just answer questions as they come in, you need to proactively.

  • Our teaching is open and free.

  • Low threshold to direct to existing material rather than answering new question. Close support ↔ teaching connection.

  • CodeRefinery is a Nordic teaching collaboration.

Private email
  • I (rkdarst) really discourage this and always direct people to one of the tracked means.

  • My phrasing “If you send it to me personally, I am almost certain to eventually forget to reply, and I may not be the person who can best answer you anyway.” Then I usually try to give some sort of an attempt at an answer, since I have to give the appearance that I really care.

Strategic vision of support
Support ↔ teaching ↔ RSE
  • Support: one-to-one answering questions

  • Teaching: one-to-many improving skills

  • Research Software Engineering: one-to-few “I will do it for you” or “Let me get you started”

Strategic risks
  • The middle layer of science always gets cut first: when funding goes down, support will get cut and researchers left more alone.

  • Our load increases, and our funding doesn’t

    • We become unhappy, support level goes down

    • Emphasis increases on speed of closing tickets

Strategic benefits of good support

These can be used to argue for good funding of our teams:

  • Diversity

    • Without good support, “rich get richer” contributes to the increasing homogeneity of computational science.

    • Previous talk by Richard Darst:

      • Summary: Computational sciences has a crisis of demographics. We are on the front lines of this battle, and it’s up to us

      • Slides

      • Video

  • Open science

    • Without good user skills, people can’t make their computational work reproducible or shareable.

    • We need to claim our place in this problem, rather than let it go to administrative Open Science staff.

Exercise: problematic situations
  1. Someone emails you privately about something they have clearly not even tried yet.

  2. A new researcher is trying to use Triton to do some machine learning. They are trying to use Python+Jupyter, but minimal experience managing a Python environment.

Conclusions

Open questions

  • What do you think?

  • Do we have too many lines of support?

See also
Credits
  • Author/editor: Richard Darst

  • Thanks to Radovan Bast, Anne Fouilloux, and others in the CodeRefinery NordicHPC channel for good discussions.

Technical documentation with Sphinx

This talk explains how one can use Sphinx for technical documentation, in particular this very site scicomp.aalto.fi. The focus is to make an overview for contributing to this site (or similar ones), but it will also provide a strong basis for creating such a site yourself.

See also

About this site for a quick guide for editing this site.

Basics
scicomp.aalto.fi
  • Home of Aalto Scientific Computing’s documentation

  • Before 2017, was Triton’s documentation using Confluence (wiki software)

  • Now has information on many different topics about scientific computing.

  • Rather highly ranked in search engines.

  • Converted from wiki.aalto.fi (Triton) using _meta/confluence2html.py and then pandoc to convert HTML→ReST.

  • CC-BY license agreed at that time

Properties of good documentation
  • Organized, easy to use

  • Versioned

  • Anyone can contribute

  • Shareable, reuseable, licensed

  • No lock-in, can migrate later

  • Plain text so 50 years of text processing development (grep, sed), etc all work.

  • Not standalone, can integrate with other materials (e.g. literalinclude).

  • git? (naturally comes out of the above)

The basic documentation stack
  • Git repository

  • Hosted on Github

  • Documentation written in ReStructured Text or MyST-Markdown

  • Built with Sphinx

    • With various extensions

  • Hosted on ReadTheDocs

  • GitHub actions validate basic syntax

Demo: making a change

I want to add the Journal of Open Source Software (JOSS) review checklist (https://joss.readthedocs.io/en/latest/review_checklist.html) to the RSE checklists section (https://scicomp.aalto.fi/rse/#checklists).

Through this, we will see:

  • Git repository layout

  • ReStructructured Text format

  • Sphinx table of contents directives (toctree)

  • Creating a pull request with git-pr

  • Reviewing the pull request

  • Merging

  • See the rendered version.

Building the site
  • Git repo: https://github.com/AaltoSciComp/scicomp-docs/

  • It has a requirements.txt like a normal Python project.

    • Until recently, was buildable with stock Debian/Ubuntu packages. Now it may require custom extensions.

  • conf.py contains all configuration

  • index.rst is the root of all docs.

  • Makefile builds it

    • make html to make it

    • make clean html to rebuild

    • make clean check to build and check for any errors

    • sphinx-autobuild . _build/html/ may be useful - start a web server that automatically reloads on changes.

  • View results in _build/

Editing on the web
  • The Github web interface is suitable for making simple changes.

  • You can either directly commit or open a PR.

  • Can we use this more?

Sphinx toctree (table of contents tree)
  • The toctree directive is the fundamental building block of the site.

  • It organizes documents into a tree, and that three is used to make the sidebar. This directive can be put into any page.

  • Example:

    .. toctree::
       :maxdepth: 2
    
       aalto/*
       data/index
       README
    
  • Example: Follow it from index.rstaalto/index.rstaalto/jupyterhub.rstaalto/jupyterhub-instructors/index.rst → various subpages.

  • It makes sense, but for complicated case I often do trial and error.

Arrangement of the site
  • scicomp.aalto.fi started from the Triton wiki

  • It then grew top-level sections for Aalto, Triton, Data, Training, RSE, etc.

  • It is about time that we rethink how it is organized.

  • rkdarst is currently the one with the overall picture in mind - for consultations about big changes.

Other details
Sphinx
  • Sphinx is a full-fledged extendable documentation generator

  • We use many extensions such as sphinx_gitstamp, sphinx-{copybutton,tabs,togglebutton}, sphinx_rtd_theme.

  • Custom Javascript and CSS in _static.

  • Very useful to know for other projects in general

  • CodeRefinery documentation lesson on Sphinx.

ReStructured Text syntax

Most surprising ReST points:

  • Double quotes for literals:

    Run ``nano`` to begin
    

    (configurable)

  • Links are scoped:

    :doc:`/triton/index`
    :ref:`tutorials`
    

    (configurable)

  • Two underscores under links:

    The main `Aalto website <https://aalto.fi/>`__
    
Github Action checks
ReadTheDocs
  • https://readthedocs.org provides a management interface for the docs

  • There is a joint aalto-scicomp account to manage it

  • Demo if time, but pretty much self-explanatory

  • Occasionally a build fails for no reason and rkdarst needs to go wipe and rebuild, or fix dependency versions.

Little-known features
We could use Markdown or Jupyter
  • Via MyST-parser or MyST-nb for Jupyter.

  • They all work together in the same site.

  • ReST is really nicer for this than shoving directives into Commonmark.

Compatible with many other projects
  • Standard documentation system for many projects

  • Used in recent CodeRefinery lessons, for example

Minipres
Redirect to HTTPS
  • ReadTheDocs doesn’t natively do this for external domains

  • Done via Javascript

  • Can anyone improve?

Other output formats
  • Sphinx can output to PDF, single-page HTML, epub, manual pages, and more.

  • Can anyone think of a use for this?

Substitution extension
sphinx-gitstamp
Open questions
Pull requests or not?
  • When should we use pull requests? When should we push directly?

  • In practice both are fine, up to you to decide what you want

  • rkdarst believes that, if you aren’t sure, push directly and ask for review.

Sharing with other sites
  • We had this long-term plan to build scicomp.aalto.fi so that other sites could share our HPC tutorials and customize them to their sites.

  • sphinx_ext_substitution (written by rkdarst) could make this easier

  • This has not yet been done, and by now scicomp-docs is so complex I’m not sure if that if it is a reasonable thing to do.

Others at Aalto can use scicomp.aalto.fi
  • Should we encourage others to join our project here?

Testable docs
  • Our dream would be to make examples in a testable form, where one can automatically run them all and find errors.

  • For example, this python-openmp example includes everything needed to submit and run the file.

  • Can this be automatically tested? A bit too complex for the typical doctest.

Integrated HPC-examples
Don’t use ReadTheDocs anymore?
  • Github Actions + GitHub Pages or other hosting sites would work instead of ReadTheDocs now.

How can we keep things up to date?
  • Requires continuous work, like any docs.

  • What should the threshold be for removing old material?

  • We now have a last updated time at the top.

  • We clearly need to think about this more.

Visitor stats
  • ReadTheDocs provides limited stats based on web server logs.

  • rkdarst is against detailed web tracking.

  • Can we find a way to get both?.

  • 2022 update: we have Plausible analytics which is sufficiently anonymous.

Building a community
  • How can we get more people to contribute?

Online work and support

See also

Our garage description page for users: Scicomp garage. (This is an internal description page)

Since 2020, Aalto Scientific Computing has worked online. Since we are a distributed team supporting users in many locations, this has improved our work in numerous ways. Online work gives us:

  • A way to interact regardless of physical distribution close to users.

  • Higher-quality, continuous interaction (including better onboarding).

  • Better work-life balance and adaptability to different lifestyles.

Since 2017, we have had weekly office hours called “garage”. Since 2020, they have been online and revolutionized our support. The garage gives us:

  • A standard way to help users interactively, without the burden of scheduling meetings.

  • A “social time” in the middle of the workday to chat with each other.

  • By combining the above two, we can chat about useful things relevant to our work, handle many internal meetings that would otherwise have to be scheduled and allows us to better share knowledge, and in general provide the spontaneous interaction that everyone claims is missing from remote work.

rkdarst’s principles of remote work

The online garage helps with at least two of rkdarst’s principles of remote work, and part of a third:

  • No private messages (allow others to know what you are doing)

  • Don’t schedule meetings (use standard meeting locations and talk spontaneously)

  • Work in public (make it possible for others to join you)

How it works: “garage” support session
  • We have an announced time: 13:00, every workday.

  • Users and staff join the meeting during the scheduled time

  • We don’t promise any service level - some days, it could be that users arrive but there are no staff. But this has become so integral to our work that it never happens.

  • Users ask their question and we do initial triage

  • We help either in the main room or breakout rooms. (for example, main room for a question where there is one user and the topic is a good discussion point for everyone on the team)

  • At least one staff helps. Usually, we try to have two helping: one who knows and one who is learning the topic. (This is very useful “on the job training”.)

  • Staff don’t have to be always-on. It is usual for many staff to be working on their other work, passively listening in case something interesting comes up or someone says their name.

  • Staff can also easily be called to the meeting using chat.

  • Very low threshold for screensharing.

  • “Remote control” is very useful, and a middle ground between telling people commands to type (extremely slow and demotivating when someone has no idea what to do) and taking over their computer (demotivating/hiding information in another way) since the user can easily and actively see what is going on.

  • Often in other issues, when the actual problem is unclear, we will say “Let’s talk in garage” rather than try to debug by asynchronous chat. Since garage is so frequent, this feels good.

  • You can read our “support flowchat” from Help.

Technical setup
  • There is one recurring Zoom meeting

  • Meeting schedule = “recurring at no fixed time” option.

  • Everyone on our team is a co-host (must be in same Zoom organization)

  • The first co-host to join becomes the meeting host.

  • Any co-host can open breakout rooms or assign customers to breakout rooms. But for the most part we tell users where to go and they go themselves.

  • Normally, first person to need breakout rooms opens an excess number, such as 10, and selects “allow participates to choose”, click “open”, and takes no further management action.

  • Some people may initially use chat to ask their question (the dispatcher can also send these initial questions by chat). This is especially good as a second conversation while one problem is being discussed.

  • Zoom trolls have never been a problem, even though the link is public. One hypothesis is that by not listing specific dates on the webpage, it is not a findable target by someone looking for “where to troll now?”.

Typical procedures
  • Usually one person is the effective “dispatcher”: they make sure that everyone is greeted, take a basic description of each problem. They make sure that people are handled, call in the best supporter, etc. (after a team gets enough experience, this role becomes implicit).

How it works: internal meetings
  • The garage room is actually our only meeting room for all normal team meetings

    • For example, we have our weekly team meeting right before the garage one day of the week.

    • This keeps meetings on schedule and provides a day when we can be sure most people are at garage, for the hardest questions

    • We would enable the waiting room towards the ends of these meetings (but normally we don’t use the waiting room).

  • Even other meetings, either two people discussing something, happen in this room.

    • Worst case: meetings overlap or run into each other. But this is actually good, doesn’t everyone complain that you don’t spontaneously meet people online? We split to breakout rooms and manage. Sometimes we have even made important connections this way.

  • This isn’t just for users - other staff teams can come talk to us during this time. Basically, it replaces a lot of the overhead with any meeting with us.

  • Online-default meetings is great for work-life balance of people, especially those with families.

  • Chat or other asynchronous text-based communication is a requirement for inclusive meetings. It allows anyone to contribute ideas without waiting for a pause, and more than makes up for any online awkwardness. (The “meeting agenda” below can also serve this purpose).

  • Meetings are managed with a Google Docs agenda.

    • Each week, a new heading is made, and it collects topics for the next meeting. There is no running through a list of ongoing projects and hearing “going on”, every agenda item has been actively placed by someone over the last week, who actively needs thoughts and a decision from the rest of the team.

    • Someone screenshares the agenda. Instead of needing to find a pause to talk, people can write information/thoughts directly into the agenda, so meetings scale better. People can write information already in advance of the meeting, to focus the meeting on discussion and not sharing information.

    • Everyone should have the agenda open themselves so they can see, scroll, and contribute - a meeting is no longer just voice talking!

      • The meeting agenda can also serve as chat - if someone wants to say something but can’t find a time to use voice, they right it there directly as a point.

      • If you want, you can expect everyone to write down their most important points and summaries of their points directly in the agenda - themselves (instead of delegating that to a designated note-taker). This is more fair, allows everyone to write their notes in their own words, emphasize their most important points (unimportant points not written), and gives others a time to talk.

    • It is only one running document (not a new one each week). New weeks are added to the top (since top loads first). Attendees can easily scroll down to refer to past weeks.

    • This strategy has revolutionized our meetings. Other meetings have much more of a “this meeting should have been an email” feeling after this. (In no small part because the “this should have been an email” parts get written and read by everyone, with only a short mention if that’s all it needs).

How it works: general common space
  • If two people are text-chatting and need to talk in person, there is zero overhead. One simply asks “Zoom now?”, the other confirms, and they know exactly where to go. Or the answer might be “Garage tomorrow?”

  • This space is also is used for random coffee breaks, etc, which are usually spontaneously announced.

  • In theory, especially when we are onboarding people, this can be a generic hangout space during downtime. You might meet someone there and chat and learn something.

  • In short, the meeting is the “commons” of “caves and commons”.

Problems with in-person office hours / garage
  • People have to bring their own laptop. When someone works on a power desktop, they can’t bring it.

  • No screen-sharing. People are crowded around one computer looking at it.

    • You can’t type on their computer without taking it away from them. For screen sharing, if you do “remote control” at least they can clearly see and feel in control.

    • Really hard to have multiple supporters with one customer.

    • From your main workspace, you hopefully have multiple screens. One screen can be the screenshare while the other is your own debugging/testing work.

  • For individual-person office hours, or even an open office policy, someone may come by and the best person to answer may not be there, may be in another building, etc.

    • Even if they are there, one-on-one support doesn’t give the “on-the-job training” to other team members.

  • “Open door policy” makes for constant distractions.

  • In-person garage tends to be limited to once a week, since everyone has to go there. Staff leave their main workspace, so can’t work as efficiently. Online, it is completely reasonable to be working on other work while muted/video off and passively listening in case something useful comes up.

Open questions
  • What is the largest size team for which this works? What happens when we go over that?

  • What’s the best frequency? We really think that every day works best for something within a team.

  • Mixing different teams in general: how different of teams can use the same garage/standard meeting room.

  • If multiple teams have separate garages, should they be at the same time or different? Combined? (does it get too big?)

    • Is it even possible for one person to have multiple garages they need to keep in their mind - or is it a “one-per-person” kind of thing?

  • How many garages can someone attend (as staff) before it becomes “too much”.

  • Is there a better tech than Zoom? In 2022, it works much better than early 2021, and at least people can join via browser.

  • When people start working in-office again, how does this continue? (People have started, and Garage seems to be a permanent culture shift. But it helps that our offices are distributed around).

Proposal:

  • Flip it around: don’t look it as a “how to scale garage to more staff”. Scale communities to the size that can be supported by a garage, then make more communities as needed, each with their own support infrastructure.

  • So garages contain 5-15 supporters, and the communities perhaps several hundreds. The communities can overlap/be virtual inside of organization units.

  • The support staff within the garages network between communities on the support/tool side, so that they are aware of the broader environment and can direct the members to other garages as needed.

The future
  • Coordinated garages across different teams? At the same time or different?

  • Some sort of cross-organization garage sessions. But, is something only once a week good enough to support continuous work? Does it work as a starting point, then you direct the user to your own specific daily garage?

Recommendations for how to implement your garage

(I’m not sure what to say here, that isn’t already said or implied above. Any ideas?)

See also
  • Our help page

  • List of garages

  • Why the name?

    • I think it came from another Aalto team that held a “travel garage”. Unsure where they got the name from or if there is a better name.

How to actually respond to user support requests?

I’ve been asked before, “how do you actually respond to customer support requests?”. There are some obvious answers (be polite, try to answer, etc), but are there any specific references for research computing / scientific computing support staff? This page collects my ideas after having done it formally and informally for years.

This page is specifically about making responses respectfully and with compassion for the requestors. It’s not designed to be a big-picture how-to of user support - there are plenty of other resources about that.

Unsorted notes:

  • one person takes the lead in communication

  • Start by talking with people about the big picture

    • their position

    • past work

    • what they expect to get out of the support

  • many questions are actually about:

    • the environment setup

Why care about how you respond?

An example:

When interviewing people once, we started our interview with to-the-point factual information and questions. Our tone of voice was “bureaucratic”, to say the least. Our interviewees responded in kind: with little enthusiasm and we could wonder if they even wanted the job.

We realized something had to change. Our next interviewees were greeted with enthusiasm and excitement about the job. The interviewees responded likewise, and we could more easily see how someone could perform.

Why is this important? Basically, we feed our users how they will respond. Is computing a chore they hate? Is it something that’s fascinating, even if not their main goal? Do they see working with us as the highlight of their day or a last-resort? We need to set the right tone with our interactions. This is true in all of:

  • Our answers

  • Our requests for follow-up information

  • Outreach about our services

See also: Observer-expectancy effect and Clever Hans.

Levels of competence

Customers have all levels of existing competences and needs. The more you understand of this, the better you can assist - and it is needed to frame any response.

  • Understand the level that the requestor is at and the level they need to be at. (this is usually not apparent at first)

  • An answer far below their level is demeaning.

  • An answer far above their level is demotivating.

  • It can be hard to know the level to answer, so multiple levels of answer are useful: one general paragraph, then one more detailed paragraph properly connected. This also helps people advance up their level of confidence, but needs more writing.

  • Aalto SciComps’s Bloom’s taxonomy of scientific computing skills may help to guide your thoughts in evaluating this.

  • Discuss: Is it better to assume at too low a level or too high? How can we find the right level to answer at?

XY problem

XY problem: someone asks about their attempted solution (Y) and not their root problem (X). If a supporter focuses on the Y and not the X can cause very inefficient answers.

Examples: “How do I turn on the stove?” vs “I am trying to make tea, how do I turn on the stove” which allows the answer to point out that the asker is trying to us an electric kettle on the stove.

  • Don’t assume that what someone asks for is what they really need - you need to read between the lines.

  • This isn’t their fault, maybe they don’t know what they need.

  • Possible mitigations:

    • When replying, state your assumptions in your response so that they can correct you if they notice it wrong (if this is relevant).

    • Also consider stating several other possibilities briefly, and when they would be relevant. For example: “Do XXX to install the software. But do you know that you can also load it via the YYY module?”

General guidelines
  • Think about what the underlying need is (X, not the Y)

  • Be verbose (or at least not short).

    • If your answer is “no”, it feels better to say it with many words, rather than few.

    • Verbosity is a sign of engagement, which makes the customer feel respected no matter if the verbosity is useful to them or not.

    • Be especially cautious about answers that are just a link to the documentation - unless they are specifically asking for that. Even then, try putting it in context.

  • Service gesture: something more than people expect (beyond the minimum that they asked). (example: try harder to find someone who can answer, point them to that person.)

Know your audience
  • The more you know about the very work of the person, the faster and better you can answer questions.

  • This is a more direct lesson for the people managing support, but can you do anything about it yourself, too?

Consider at what level someone needs support
  • Do they need single answers to a question?

  • Are they very lost and need to work with someone to implement it?

    • If you answer small questions piece-by-piece this is inefficient hill-climbing.

    • Direct to a RSE service for more support?

  • Do they need a tutorial, reference, theoretical explanation, or how-to (the 4 types of docs). These are all very different types of answers or links.

Accept that you can’t do everything
  • Make this decision explicit, not implicit.

  • An implicit decision here means it is made based on internal biases.

  • Better to discuss among the team to make sure it is consistent.

  • Document what you do know and learn while working, even if you don’t have the full answer yet.

    • Yes, this can be a rather hard thing to do: we don’t want to give a partial or possibly wrong answer.

    • On the other hand, being silent for days or weeks until you have the proper answer really doesn’t help anyone. With the rate of research, they have probably even gone on to something else!

    • Consider if you should keep the requestor in the loop (generally yes, probably good, but qualify if something is still in progress and may not work).

    • This also helps any future staff who may pick up after you. So, even if you don’t document to the requestor, document internally.

  • Try to avoid long silences before any replies, for example if you don’t even know who can answer. This can be especially hard without a front desk or if you think “just a bit more and we’ll know something”.

Giving bad news

Sometimes you have to say “no”

  • Again, be more verbose rather than less

  • Acknowledge the X and the Y of the initial request, so that they know the request really isn’t possible (rather than “you not understanding”).

  • State why it’s not possible, in more or less words.

  • Can you turn this into an X-Y answer - find what they really need, that you (or someone) can do?

If you don’t know the answer

Our audience does all kinds of advanced work, so often we don’t know the answer - or don’t know it right away.

  • Ask to see what they actually do, all error messages, etc. Ask to share screen. This can help you to see some problems, and makes most problems easy.

  • Request the basic information to “work on it yourself for a bit to save time”, this gives you enough time to study solutions.

  • Related to the above, take the time to make things reproducible. This is needed for you to begin working, but also seeing the basic steps will help to understand the background.

Dealing with mis-directed issues
  • It can be frustrating when someone asks the wrong place

  • If you need to be nicer than just saying “no”, since you have presumably already understood what the issue is, you actually can give useful pointers to where to ask next. This itself may be a useful answer to them.

  • Can you give keywords / a copy-paste text that explain the actual problem, that they can send to the other support you are now directing them to. This:

    • Save the other staff time (they don’t have to do the X-Y analysis themselves)

    • Save the customer time in thinking about what to say

    • Makes the customer feel valued and validated

Communication strategies
  • Communicate with respect. Informal is probably OK, but know your audience.

  • Sarcasm is usually bad (but we should have already know it’s bad online). Even if you think the person reading now will get it, what about all the people in the future who might read and rely on the same answer?

In-person or synchronous support
  • See the How to help someone use a computer for many ideas that are relevant to in-person support (and more).

  • When you learn something, do you want to create an issue about it so that the knowledge can be used later?

  • Try to avoid simply taking over their computer and doing something. On the other hand, dictating something key-by-key can be equally frustrating. Try to let the user do as much as possible and clearly explain why you do some things yourself.

    • Does saying “I don’t know, so it’s hard for me to tell you what to do. But I can try to figure it out while you watch - is that good?”

    • Online support allows screen-sharing and remote control, which allows you to type but the other person to still feel like they are an important part of the process since they can see everything.

Ticketing system support
  • Is your ticket system public (e.g. Gitlab internal to organization, but not private to your team) or private (requestors only see their own tickets). You should answer respectfully anyway, but this does matter somehow. The more people who can see it, the more careful you should be, but also the more long-term benefit your answers have.

  • Document your intermediate progress at least as comments in the tickets - if it’s not appropriate to send to the user, too. (see above about silence)

  • You want separate issues in separate tickets. Often times, users will ask multiple things at once. You’ll have to figure out what to do about it, but you should probably clearly say “more emails is better, don’t worry about sending us three emails all at the same time if they are different things”.

    • Can you separate issues yourself, instead of replying “please send this again”

Private email support
  • Do you forward it to a ticket system? Information in private email always gets lost.

  • If you reply with only “please re-send this”, that can sound like you don’t want the issue in the first place. What do you do?

Plan for problem situations

Exercises:

How do you answer things such as the following? Write draft responses:

  • Not enough information

  • Possibly

  • Mis-directed

  • Something requestor should be able to do themselves?

Examples

(examples to be inserted here)

See also

Events are listed below in chronological order, but sort of sorted by usefulness to a broad audience in the left sidebar (including events which have been drafted but not presented).

  • Triton hardware, Ivan Degtyarenko, Wed 3.3 2021, 10:00

    • Triton hardware wise: machine room, different archs, IPMI, hardware troubleshooting

    • [Material includes sensitive data, can be provided on request]

  • Triton networking, Ivan Degtyarenko, Fri 12.3 2021, 10:15-11:15

    • Networking: IB and Ethernet setup, IB islands, troubleshooting

    • Interval video (Material includes sensitive data, provided on request)

  • Ansible for FCCI, Mikko Hakala, Mon 22.3 2021, 14-15

    • Ansible, provisioning with OpenHPC, standalone servers

    • Internal video

  • User support in Aalto Scientific Computing, Richard Darst, Mon 29.3 2021, 14-15

    • User support made easy: different support level by Science IT, docs, issue tracker, garage, etc

    • Presentation

    • Video

  • Triton software stack, Simo Tuomisto, Fri 9.4 2021, 10:15-11:15

    • Triton / FCCI software stack: Spack, building software, …

    • Video

  • Jupyter at Aalto, Richard Darst, Fri 30.4 2021, 10:15

  • Anaconda on Triton: automatic build system, Simo Tuomisto, Fri 7.5 2021, 10:15

    • Anaconda setup on Triton

    • Video

  • Diversity in computational sciences vs university services

    • This wasn’t originally given in FCCI Tech but is relevant to the people reading this page.

    • Presentation

    • Video

  • Sphinx documentation, Richard Darst, Fri 14.5 2021, 10:15

    • Open and accessible documentation using Sphinx, RST/MyST, and Readthedocs: the story behind scicomp.aalto.fi.

    • Presentation

    • Video

  • ClusterStor, Andreas Muller (HPE), Tue 18.5 2021, 12:00

    • Storage systems: ClusterStor hardware and software behind Triton’s new /scratch. Maintenance, troubleshooting.

  • RSE service status update, Jarno Rantaharju, Marijn van Vliet, and Richard Darst, Fri 28.5 2021, 10:15

  • How we did Summer Kickstart 2021, Richard darst + Reading + Video

  • Introduction to a Kubernetes deployment, Richard Darst, Fri 8.10 2021, 10:15

    • What is kubernetes and when is it useful?

    • Different types of Kubernetes objects and how you learn about them

    • Walk through how you would deploy a service into kubernetes - live demo

    • Q&A

    • Reading

    • Video.

  • jupyter.cs, Richard Darst, Fri 19.11 2021, 10:00

  • Triton authentication, Mikko Hakala, Fri 26.11 2021, 10:15

    • Internal video

  • NetApp at Aalto: department admins guide, Pekka Alaruikka / Mika Kontiala, Fri 3.12 2021, 10:15

    • NetApp setup at Aalto

    • what department admins may and may not of TeamWork

    • Practicalities: volumes, exports, qtrees, quotas, settings, permissions etc

    • (if time left) about backups on the TeamWork, troubleshooting, getting help, etc

  • High Performance Clusters at NVIDIA, Janne Blomqvist, Fri 10.12 2021, 10:15

    • NVIDIA cluster setup overview

    • Best practices of the HPC cluster maintenance

    • What we are doing wrong at FCCI as comparing to NVIDIA

  • The future of teaching: CodeRefinery teaching strategy Richard Darst, Fri 17.12 2021, 10:00

    • The role of teaching in CodeRefinery and Aalto Scientific Computing

    • Tools and strategies we use to successfully teach online: HackMD, streaming, helpers, teams, co-teaching, and more.

    • Future outlook and goals

    • Reading

    • Video

    • Demo of our online teaching strategies

  • Open onDemand experience by Esko Järnfors et all (CSC), Fri 17.12 2021, 12:00

    • NOTE: the second talk on the same Fri 17.12

  • Simple Kubernetes deployment by Richard Darst, Fri 3 Nov 2023

    • If you have a containerized service, how can you easily deploy it using Kubernetes?

    • Notes, Video

  • Demo: Publishing a Python Package by Jarno Rantaharju, Fri Jan 26th 2024

  • Demonstration of open source software publishing. I will take a part of an existing Python package and spin it of as a small stand-alone package. We will discuss what is needed for a software publication and recommended practices.

  • Notes

Proposed/requested future topics
  • SLURM setup, Simppa Äkäslompolo

  • Cluster monitoring, Simo/Mikko

  • Online courses and CodeRefinery, Richard Darst

  • Online work and support, Richard Darst

  • Respectfully and efficiently handling user support requests, Richard Darst

  • Science-IT data management: policies and procedures

  • Science-IT data management: storage systems and tech setup

  • History and structure of FCCI

  • Security

Send pull requests to this section to add more requests, or to the previous section to schedule a talk.

Other

Sustainability and Environment statement for the Triton HPC cluster

Building a sustainable future is the most important goal of our community and saving energy is one of the most significant actions that we can take to improve sustainability. Fast and large computational resources have high energy requirements, whether it is our Aalto High Performance Computing (HPC) cluster Triton or the workstation at your desk. At Aalto Scientific Computing / Science IT we take these things seriously and we believe that transparency in the energy consumption of our shared computational resources benefits both the users of our cluster as well as the general public.

In this statement, we first summarize the action points that we are implementing to improve Triton HPC energy efficiency and list what you can do as a Triton user to improve the environmental impact of your computations. Then we describe the energy consumption of the computational nodes that are forming our HPC cluster and we explain the energy saving strategies that are implemented to optimize energy when the nodes are idle and also when the situation of the national energy demand from FinGrid requires everyone to be more careful with energy consumption.

Important

What Aalto Scientific Computing / Science IT is doing to reduce energy consumption

Here the main action points on what Aalto Scientific Computing / Science IT is doing to reduce energy consumption for the Triton HPC cluster.

  1. Support for researchers to optimize their calculations: Our daily SciComp garage session and Research Software Engineering service provide ongoing support for making all computational and data-intensive work as efficient as possible.

  2. Switching off nodes when the national energy demand is high (coming during the winter): Though the Triton HPC cluster will not be affected by Fingrid power cuts, we will reduce the number of active computational resources during periods of high demand according to Fingrid announcements. Triton is too small to directly participate in Fingrid’s relevant demand response program.

  3. Acquiring newer hardware with better energy efficiency (ongoing): More energy efficient nodes are being acquired and they are already replacing older hardware.

  4. Moving to a new datacenter with better power usage effectiveness (2024): A new colocation facility with better PUE has been chosen and we have started the work needed to switch to the new location.

What Triton users can do to reduce their energy consumption

The most important thing you can do is to make your computations as efficient as practical. Second, centrally-hosted compute infrastructure is generally much more efficient than standalone computing solutions. For any of the matters below, we offer extensive, immediate support in our daily garage (every day at 13:00). For significant cases, our Research Software Engineers can directly work with you to improve your workflow with minimal trouble to you.

  1. Make your computations efficient. Make sure that a) your own code is as efficient as reasonable, and b) it fully uses the reserved HPC resources.

    1. Not all code deserves to be fully optimized, but the more resources you use, the more you should think about optimizing.

    2. When you work with HPC resources, your starting point for looking at the efficiency of your computations is seff <JOBID>. For additional support, come to the daily garage mentioned above.

  2. Save energy by using the centralized infrastructure: HPC computations are more efficient than your workstation - even before considering your workstation’s idle during development. Being a shared resource, you only use what you need to use.

  3. Do you always need a GPU? While many computational tools are offering faster computing times by using GPUs, one should consider how much is gained by using GPU versus CPU. Roughly the most expensive GPUs need 5 times more energy than a full 40-core CPU node (assuming that the efficiency of the computations is at 100%), so if your computation over GPUs is not at least 5 times faster than running it on CPUs, you should consider avoiding using GPU and accepting a 2-to-4 times slower computational time.

Controlled power cuts: the Triton HPC cluster will not be affected

As communicated by FinGrid here, there are chances of national power cuts during the upcoming winter. The Triton cluster is colocated in a CSC machine room, which also hosts other nationally important infrastructure and is not expected to be affected by the power cuts. In the case of unexpected outages, there is a backup generator. When it comes to the connectivity between the internet and Triton, Aalto IT Services has ensured that also the physical switches providing remote access are not going to be affected by power cuts.

Even though Triton should not be affected by power cuts, we will react to the national electricity supply and reduce the power consumed during these periods.

Energy consumption of Triton

For the first half of 2022, Triton’s average power was 214 kW (long-run average). This includes all compute nodes, GPUs, data storage, network, and other administrative servers. It does not include cooling.

  • A typical CPU node consumes around 450W when active and 60W when idling (Dell PowerEdge C6420, 40 CPU cores).

  • The newest GPU nodes use 2200W at peak use and average 1200W (Dell PowerEdge XE8545, 48 CPU cores and 4 NVIDIA A100 cards).

In general, Triton has a relatively high usage factor (on average above 90% in the year 2022), so there is minimal waste from idling. While our current machine room does not recover waste heat for district heating, our new machine room will be able to do so. Furthermore, we are constantly updating our hardware with new nodes with more efficient energy consumption. You can check further details about Triton’s hardware at this page.

For comparison, the minimum power to participate in Fingrid’s demand-response frequency restoration reserve market is 1MW.

Energy efficiency of the CSC colocation

The energy efficiency of colocation facilities is described by the ratio Power Usage Effectiveness (PUE) determined by dividing the total amount of power entering a data center by the power used to run the IT equipment within it. In an ideal world PUE should be as close as 1 as possible, the most efficient datacenters in the world are reporting a PUE of 1.02 (reference) and the average datacenter has a PUE of 1.57 (average from a survey in 2021).

Triton is physically located at CSC colocation facilities with other servers supporting all researchers in Finland (e.g. the FUNET network). Our current colocation has a PUE of 1.3. This is not the state of the art, although it is better than the average datacenter around the world. Energy efficiency will be a very important criteria in the upcoming move to a new facility. The current tentative plan is to move Triton’s hardware in a new colocation during the year 2023 and be ready for 2024.

Impact of Triton hardware purchases

Unlike many clusters, Triton does not build a new cluster every few years. Triton is continually upgraded, and old hardware is only discarded after it is actually obsolete (which usually comes due to excessive energy consumption relative to newer hardware). This allows us to adjust the e-waste/power consumption tradeoff dynamically, depending on the circumstances. We try to minimize the entire lifecycle impact of our cluster. Yes, Triton is a metaphorical Ship of Theseus.

Web accessibility

This website is partially conformant with the Web Content Accessibility Guidelines (WCAG) level AA.

This is the accessibility statement for the scicomp.aalto.fi website. The accessibility requirements are based on the Act on the Provision of Digital Services (306/2019).

But as we know from other Aalto web sites, web accessibility doesn’t mean it’s actually useful for any particular purpose. We strive to make this site actually usable by everyone, and we welcome any contributions to help us with that.

Accessibility status of the website

The Web Content Accessibility Guidelines (WCAG) defines requirements for designers and developers to improve accessibility for people with disabilities. Based on self-assessment with Web Accessibility Evaluation Tool, this website is partially conformant with WCAG 2.1 level AA on computers, tablets, and smartphones. Partially conformant means that some parts of the content do not fully conform to the accessibility standard.

Inaccessible content

Below is a description of known limitations, and potential solutions. Please contact us if you observe an issue not listed below.

Known limitations for scicomp.aalto.fi website:

  • Inclusion of PDF documents that might have accessibility issues.

Please follow this issue to track updates and improvements to the accessibility of scicomp.aalto.fi.

Technical specifications

Accessibility of scicomp.aalto.fi website relies on the following technologies to work with the particular combination of web browser and any assistive technologies or plugins installed on your computer:

  • HTML

  • CSS

  • WAI-ARIA

These technologies are relied upon for conformance with the accessibility standards used.

Next steps for improving the accessibility

Please follow this issue to track updates and improvements to the accessibility of scicomp.aalto.fi.

Accessibility feedback

We welcome your feedback on the accessibility of scicomp.aalto.fi website. Please let us know if you encounter accessibility barriers on scicomp.aalto.fi website:

Supervisory authority

If you encounter any problems with accessibility on the website, first send your feedback to us. We will respond to your feedback within 14 days.

If you are not satisfied with the response you have received from us, or if our response does not arrive within 14 days, you may file a complaint with the Regional State Administrative Agency for Southern Finland. https://www.saavutettavuusvaatimukset.fi/oikeutesi/ilmoita-ongelmasta-saavutettavuudessa/

Contact details of the supervisory authority
Regional State Administrative Agency for Southern Finland
Accessibility monitoring unit
Phone: +358 (0)9 47001
Release and update information

This accessibility statement was last updated on 26 October 2020.

This website was launched on 15 June 2017.

This accessibility statement is based on a similar statement from Fairdata.fi.

About this site

These docs originally came from the Triton User Guide, but now serves as a general Aalto scientific computing guide. The intention is a good central resources for researchers, kept up to date by the whole community. Many parts are useful to the broader world, too. We encourage the community and world to all contribute when they see a need.

Sphinx is a static site generator - you can build the site on your own computer and browse the HTML. It’s automatically built and hosted by ReadTheDocs, but you don’t need to mess with that part. Github will validate basic syntax in pull requests.

See also

Technical documentation with Sphinx for an overview about how and why it’s set up like this.

Contributing

We welcome contributions via normal Github open source practices: send us a pull request.

This documentation is Open Source (CC-BY 4.0), and we welcome contributions from the community. The project is run on Github in the repository AaltoSciComp/scicomp-docs.

To contribute, you can always use the normal Github contribution mechanisms: make a pull request, issues, or comments. If you are at Aalto, you can also get direct write access. Make a github issue, then contact us in person/by email for us to confirm.

The worst contribution is one that isn’t made. Don’t worry about making things perfect: send your improvement and someone can improve the syntax/writing/etc as needed. This is also true for formatting errors - if you can’t do ReStructudedText perfectly, just do your best (and pretend it’s markdown because all the basics are similar).

When you submit a change, there is continuous testing that will notify you of errors, so that you can make changes with confidence: “wiki rules: deploy and iterate” rather than “perfect before merge”.

Contributing gives agreement to use content under the licenses (CC-BY 4.0 or CC0 for examples).

Requirements and building

Set up the environment first (example, but do as you’d like). The basic requirements are sphinx and sphinx_rtd_theme` which are also in Ubuntu: (``python-sphinx and python-sphinx-rtd-theme):

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Then you can build it locally to test:

$ make html
$ sphinx-autobuild . _build/html/     # starts web server that automatically updates
$ make clean check                    # Full rebuild and warn of important errors

HTML output is in _build/html/index.html.

Editing

In short: find an example page and copy. To add sections, add a new page in a subfolder. In order to appear in the sidebar, it has to be linked from a toctree directive. Check nearby index.rst pages and add there.

Recommended pages for copying:

Most common missed quirks
  • Double backquote for literal text, not single. (Why? Single can be assigned other purposes, like :doc: links, :ref: links, or in other projects :func: and so on. We be generic so compatible with other projects that make a different choice.):

    Run ``ssh -X triton.aalto.fi`` to ...
    
  • Raw HTML links have two underscores. (Why? single underscore is some other fancy things. Most links are internal reference/docs links):

    The `OpenSSH project <https://www.openssh.com/>`__ does...
    
  • Internal links have structures: they can be :doc:, :ref:, etc. If you give a link to something, it knows where it is, validates it at build time, and you can give just the link and it takes the title from the target.

  • You can set default highlighting for literal blocks, so you don’t have to do .. code-block:: LANGUAGE all the time:

    .. highlight:: console
    

    This sets the default for all literal blocks, but you can still make a ..code-block:: for other cases (or change it partway through).

  • For command line, use the console highlighting language instead of bash or others. console will highlight the $ and make it not selectable so it won’t be copied.

  • This isn’t relevant to scicomp-docs, but intersphinx lets you link directly to function/etc definitions in other Sphinx docs, by function name. (This is why rigid structure is nice). Python for SciComp heavily uses this for great effect.

ReStructured text

ReStructured Text is similar to markdown for basics, but has a more strictly defined syntax and more higher level structure. This allows more semantic markup, more power to compile into different formats (since there isn’t embedded HTML), and advanced things like indexing, permanent references, etc.

Restructured text quick reference and home.

Note: Literal inline text uses `` instead of a single ` (second works but gives warning).

A very quick guide is below.

Inline syntax

Inline code/monospace, emphasis, strong emphasis

``Inline code/monospace``, *emphasis*, **strong emphasis**
Literal blocks, code highlighting

Literal blocks (= code blocks) use :: and are intended:

Literal block
Literal block
::

  Literal block
  Literal blocks

Block quotes can also start with paragraph ending in double colon, like this:

Block quote
Block quotes can also start with paragraph ending in double colon,
like this::

    Block quote

If you define a highlight language, it will be used as the default highlight language for every block:

.. highlight:: python

Use Python for python. Use console for console commands, and include the $ before the commands. The $ won’t be selectable so copy-and-paste works well.

Admonitions: notes, warnings, etc.

Notes, warnings, etc.

Note

This is a note.

Warning

This is a warning.

Admonition directives have titles.

This has misc text.

.. note::

  This is a note.

.. warning::

  This is a warning.

.. admonition:: Admonition directives have titles.

   This has misc text.

.. admonition:: Dropdown can be clicked to expand.
   :class: dropdown

   When it's not important for everyone to see.  ``:class: dropdown``
   sets a CSS class which gets interpreted in the HTML.
Indexing

Indexing isn’t currently used.

.. index:: commit; amend

.. index::
   commit
   commit; message
   pair: commit; amend

:index:`commit`

:index:`loop variables <pair: commit; amend>`

Aalto Scientific Computing (ASC) maintains these pages with the help of the Aalto community. This site is open source: all content is licensed under CC-BY 4.0 and all examples under CC0 (public domain). Additionally, this is an open project and we strongly encourage anyone to contribute. For information, see the About this site and the Github links at the top of every page. Either make Github issues, pull requests, or ask for direct commit access. Be bold: the biggest problem is missing information, and mistakes can always be fixed.