Virtual core for data science¶
Warning
virtual-core isn’t maintained anymore!
Virtual Core is a turn-key solution to deploy a complete set of data-science core services that can be used as a base infrastructure for big data analysis. It was developed to support bioinformatics data analysis at the University of Montana.
The solution is based on Docker and it currently includes:
- A LDAP container, along with a web interface (phpldapadmin)
- Zabbix-based monitoring.
- PostgreSQL database server.
- An extensible software container, currently including Python and R tools for data-science (based on Anaconda)
- A user container, where users can log-in and run all data-science applications
- A file server, capable of routing data from other file servers (e.g. communicate with a Samba server and exposing an NFS interface)
- Cluster software (currently only SLURM)
- A SLURM-compute container that can be deployed across a cluster (via Docker swarm)
- A exploratory analysis server, which includes a Jupyter hub
- Plain web server
Most communications are SSL secured and the system can work as a adhoc SSL certification authority.
The containers can be split across a cluster with Docker Swarm.
A wizard is included to get a single-machine configuration up and running.
The system can be extended with flavors (Natural Language Processing, Finance, costumer analysis).
We currently have a flavor for bioinformatics with a Galaxy container, extra analysis software (based on bioconda) and the file router can be configured to receive data from Illumina Sequencers.
Warning
Virtual Core is currently in production an the University of Montana, but its installation by others is still quite hard. We are working hard to produce documentation, but that is still under heavy development (as you can see here). Some functionality is still being finalized. That being said, you are most welcome to try this (pre-alpha quality) software. If you need any help, do not hesitate to contact me.
Contents¶
Installation¶
Here you can find a set of containers to help creating a data science core architecture, for example:
- LDAP server
- PostgreSQL server
- File router (NFS and Samba)
- Galaxy server
- Zabbix server
- Software server (i.e. a lot of pre-installed software)
- Interative Compute server (a place for users to login)
- Exploratory Analysis server (JupyterHub with JupyterLab)
- SLURM grid configuration
There is a focus on bioinformatics, but the infrastructure can be used for other applications.
Base images¶
We use Alpine Linux for simple servers (very small footprint) and Ubuntu for larger images. It might happen that some containers are derived from Debian ones.
Dependencies¶
- Python 2.7 (for ansible) and 3.5+
- PyYAML
- Docker
- Ansible
- docker-py (python-docker on ubuntu)
Todo
Check Python version for ansible (conda…)
If you use the setup wizard (strongly recommended for a first install)
- Flask
- openssl and pyOpenSSL (if you need to generate keys)
(explain with conda)
Installation¶
Virtual Core uses Docker swarm mode. You can check the Swarm mode documentation, though we provide basic instructions:
docker swarm init
Virtual core requires a local Docker registry. Again you can find documentation on the Docker site, but you probably only need to do this:
docker run -d -p 5000:5000 --name registry registry:2
This will install all your servers on the local machine. If you have a very large big-iron machine, this might be what you want. If you have a cluster, this is still a reasonable starting point, though you will have some work to do, especially on the security front.
- Make sure you have a docker with swarm mode installation running.
- Use the wizard to configure the most complicated stuff:
./run_wizard.sh
. Typically this will be on http://127.0.0.1:7000/ - Create an overlay virtual network called virtual_core
docker network create --driver overlay --attachable --subnet 172.18.0.0/24 virtual_core
. Make sure this is configured everytime you start the system. Also make sure the subnet is OK for you. - Create a directory that will store all your docker volumes. This might need to be very big. Lets call this your base directory.
cp etc/hosts.sample etc/hosts
(you will want to edit this in the future)- Make sure all variables on
ansible/
are correctly defined, especially on the common role python src/create_directory_structure.py <base_directory>
cd _instance/ansible; ansible-playbook --ask-pass -i ../../etc/hosts main.yml
Notes on containers¶
Support containers¶
These are available in dockerhub. There are two different flavors: Alpine and Ubuntu-based.
All Virtual Core containers are based on either alpine-supervisor-sshd or tiagoantao/ubuntu-supervisor-sshd. These have the following features:
- Processes are controlled by supervisord
- Includes a zabbix monitor
- A SSH deamon is installed * A authorized_key has to be supplied. Used for root access.
- The entry point is /sbin/server_run.sh. A file /sbin/prepare_magic.sh can be installed and will be called before supervisor (e.g. to use passed environment variables to setup stuff)
Todo
Discuss shared volumes and the need of consistent mount points
Todo
Talk about docker swarm
LDAP service¶
Both containers are based on Alpine Linux. This will probably change in the future as Alpine Linux is not a good solution for most other services.
ldap¶
Attention
The container exposes only the SSL port, but a non-SSL port is available inside the docker network.
PostgreSQL service and container¶
The PostgreSQL database is configured by default to authenticate via LDAP, but some databases are trusted to other hosts, these are normally related to other containers. For example the zabbix database is controlled by the user zabbix which is trusted to login from the host zabbix. Note a few things:
- This is a default behavior that you can change during the configuration stage
- All databases necessary by other containers are created along with access rights even for containers that are not going to be installed. Note that the database schemas should be created by the respective containers.
- Make sure you are happy with access rules, especially if you spread containers across multiple machines. In that case you have some tweaking to do.
Software container¶
Compute nodes¶
We provide support for clusters via SLURM. You can be in one of three situations:
- You already have a cluster infrastructure not based on SLURM. If that is your case we do not support this directly. But if your cluster is based on free software we would be most happy to try to support it, please do contact us!
- You already have a SLURM based configuration. In this case you will have to make a few changes to the configuration. The wizard will help you here.
- You have nothing installed. We will take care of setting up a complete SLURM solution.
User server¶
based on software container. It is also a SLURM head node (if desired).
The file router container¶
The file router container serves two purposes:
- Export NFS volumes
- Interacting with Windows machines
Export NFS volumes¶
To export NFS volumes, configure the NFS exports file appropriately (see an example on copy/etc/exports.sample
Interacting with Windows machines¶
There are a few use cases here:
- Interacting with services that are on Windows (For example, Illumina sequencers)
- Exporting volumes to users
Warning
Integration of LDAP and Windows authentication is nothing short of a mess. There are plenty of alternatives on how to do it, but unless you really have to do it, you might want to consider avoiding it. There is plenty of documentation on the web, and we will not
If you want users to be able to mount your volumes as samba shares and have integrated LDAP authentication, we offer a ad-hoc script /usr/bin/change_password.py that does sync between samba and LDAP. It works this way:
- The user already has an account on LDAP
- You create an account for the user on Samba using
pdbedit -a ldap_uid
. Usesmbpasswd ldap_uid
, password will beboot
- The user logs in, ASAP, on the file_router and uses
change_password.py boot
to sync the passwords - From now on the user can login on the file_router to change the password using
change_password.py
(indeed password change can only happen on the file_router or the passwords will be out of sync)
Yes, this is ugly, but LDAP/Samba/Windows AD integration is ugly.
Note that for ad-hoc users (for example, in the bioinformatics case, interacting with a Illumina sequencer) you can just mantain a separate account just on samba without LDAP sync.
Customizing your system¶
_instance directory
LDAP authentication based on the virtual core¶
Ubuntu¶
apt-get install libpam-ldap ldap-utls
ldap-utils is recommended, not required.
On the configuration you will need to supply your LDAP URL (with https) and your base DN. You probably will want to allow users to change their passwords (with impact on the LDAP server). The server is not authenticated.
This will change the PAM configuration on /etc/pam.d. It will also change /etc/ldap.conf
If you are using your own certificate authority, you will need to add the certificate of the authority, by changing the TLS_CACERT parameter on /etc/ldap/ldap.conf . Be careful with auto-reconfiguration
Finally do
pam-auth-update
And you should be done
https://wiki.debian.org/LDAP/PAM
(uid limitation - pam_filter)
CentOS 5¶
We do this manually
Make sure openldap-clients and nss_ldap is installed
Copy your CA certificate to /etc/openldap/cacerts
Make sure /etc/ldap.conf has (among other things):
URI ldaps://PATH_TO_YOUR_LDAP_SERVER BASE your_base pam_password exop ssl on port 636 tls_cacertfile /etc/openldap/cacerts/cacert.pem
Make sure /etc/openldap/ldap.conf has (among other things):
URI ldaps://PATH_TO_YOUR_LDAP_SERVER BASE your_base TLS_CACERT /etc/openldap/cacerts/cacert.pem
Edit /etc/nsswitch.conf to include ldap (on password, group and shadow)
Edit at least /etc/pam.d/system-auth and add in appropriate places:
auth sufficient pam_ldap.so use_first_pass account [default=bad success=ok user_unknown=ignore] pam_ldap.so password sufficient pam_ldap.so use_authtok session optional pam_ldap.so
Restart nscd
CentOS 6¶
Install openldap-clients, sssd, pam_ldap and nss-pam-ldapd
Make sure sssd is running
on /etc/nsswitch.con use sss instead of ldap
Here is an example of a sssd.conf file:
[sssd] domains = LDAP services = nss config_file_version = 2 [nss] filter_groups = root filter_users = root [domain/LDAP] enumerate=true cache_credentials = TRUE id_provider = ldap auth_provider = ldap ldap_schema = rfc2307 chpass_provider = ldap ldap_uri = YOU_SERVER ldap_search_base = YOUR_BASE
Copy your CA certificate to /etc/openldap/cacerts
Make sure /etc/pam_ldap.conf has (among other things):
URI ldaps://PATH_TO_YOUR_LDAP_SERVER BASE your_base pam_password exop ssl on port 636 tls_cacertfile /etc/openldap/cacerts/cacert.pem
Make sure /etc/openldap/ldap.conf has (among other things):
URI ldaps://PATH_TO_YOUR_LDAP_SERVER BASE your_base TLS_CACERT /etc/openldap/cacerts/cacert.pem
Edit at least /etc/pam.d/system-auth and add in appropriate places:
auth sufficient pam_ldap.so use_first_pass account [default=bad success=ok user_unknown=ignore] pam_ldap.so password sufficient pam_ldap.so use_authtok session optional pam_ldap.so
CentOS 7¶
needs review
follow instructions for centos 5, caveats:
On CentOS install nss_ldap and nss-pam-ldapd
/etc/nslcd.conf - ldap server (instead of /etc/ldap.conf)
make sure nslcd is started
make sure /etc/pam.d/system-auth is the only file of interest (e.g. password-auth)
- authconfig –enableldap –enableldapauth –ldapserver=ldap://ldap.YOUR-DOMAIN:389/
- –ldapbasedn=”BASE-DN” –enablecache –disablefingerprint –kickstart
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line