Introduction¶
What’s BioThings?¶
We use “BioThings” to refer to objects of any biomedical entity-type represented in the biological knowledge space, such as genes, genetic variants, drugs, chemicals, diseases, etc.
BioThings SDK¶
SDK represents “Software Development Kit”. BioThings SDK provides a Python-based toolkit to build high-performance data APIs (or web services) from a single data source or multiple data sources. It has the particular focus on building data APIs for biomedical-related entities, a.k.a “BioThings”, though it’s not necessarily limited to the biomedical scope. For any given “BioThings” type, BioThings SDK helps developers to aggregate annotations from multiple data sources, and expose them as a clean and high-performance web API.
The BioThings SDK can be roughly divided into two main components: data hub (or just “hub”) component and web component. The hub component allows developers to automate the process of monitoring, parsing and uploading your data source to an Elasticsearch backend. From here, the web component, built on the high-concurrency Tornado Web Server , allows you to easily setup a live high-performance API. The API endpoints expose simple-to-use yet powerful query features using Elasticsearch’s full-text query capabilities and query language.
BioThings API¶
We also use “BioThings API” (or BioThings APIs) to refer to an API (or a collection of APIs) built with BioThings SDK. For example, both our popular MyGene.Info and MyVariant.Info APIs are built and maintained using this BioThings SDK.
BioThings Studio¶
BioThings Studio is a buildin, pre-configured environment used to build and administer BioThings API. At its core is the Hub, a backend service responsible for maintaining data up-to-date, producing data releases and update API frontends.
Installation¶
You can install the latest stable BioThings SDK release with pip from PyPI, like:
pip install biothings
You can install the latest development version of BioThings SDK directly from our github repository like:
pip install git+https://github.com/biothings/biothings.api.git#egg=biothings
Alternatively, you can download the source code, or clone the BioThings SDK repository and run:
python setup.py install
Quick Start¶
We recommend to follow this tutorial to develop your first BioThings API in our pre-configured BioThings Studio development environment.
BioThings Studio¶
BioThings Studio is a pre-configured environment used to build and administer BioThings API. At its core is the Hub, a backend service responsible for maintaining data up-to-date, producing data releases, and updating API frontends.
A. Tutorial¶
This tutorial will guide you through BioThings Studio by showing, in a first part, how to convert a simple flat file to a fully operational BioThings API. In a second part, this API will enrich for more data.
Note
You may also want to read the developer’s guide for more detailed informations.
Note
The following tutorial is only valid for BioThings Studio release 0.2b. Check all available releases for more.
1. What you’ll learn¶
Through this guide, you’ll learn:
how to obtain a Docker image to run your favorite API
how to run that image inside a Docker container and how to access the BioThings Studio application
how to integrate a new data source by defining a data plugin
how to define a build configuration and create data releases
how to create a simple, fully operational BioThings API serving the integrated data
how to use multiple datasources and understand how data merge is done
2. Prerequisites¶
Using BioThings Studio requires a Docker server up and running, some basic knowledge
about commands to run and use containers. Images have been tested on Docker >=17. Using AWS cloud,
you can use our public AMI biothings_demo_docker (ami-44865e3c
in Oregon region) with Docker pre-configured
and ready for studio deployment. Instance type depends on the size of data you
want to integrate and parsers’ performances. For this tutorial, we recommend using instance type with at least
4GiB RAM, such as t2.medium
. AMI comes with an extra 30GiB EBS volume, which is more than enough
for the scope of this tutorial.
Alternately, you can install your own Docker server (on recent Ubuntu systems, sudo apt-get install docker.io
is usually enough). You may need to point Docker images directory to a specific hard drive to get enough space,
using -g
option:
# /mnt/docker points to a hard drive with enough disk space
sudo echo 'DOCKER_OPTS="-g /mnt/docker"' >> /etc/default/docker
# restart to make this change active
sudo service docker restart
3. Installation¶
BioThings Studio is available as a Docker image that you can pull from our BioThings Docker Hub repository:
$ docker pull biothings/biothings-studio:0.2b
A BioThings Studio instance exposes several services on different ports:
8080: BioThings Studio web application port
7022: BioThings Hub SSH port
7080: BioThings Hub REST API port
7081: BioThings Hub REST API port, read-only access
9200: ElasticSearch port
27017: MongoDB port
8000: BioThings API, once created, it can be any non-priviledged (>1024) port
9000: Cerebro, a webapp used to easily interact with ElasticSearch clusters
60080: Code-Server, a webapp used to directly edit code in the container
We will map and expose those ports to the host server using option -p
so we can access BioThings services without
having to enter the container:
$ docker run --rm --name studio -p 8080:8080 -p 7022:7022 -p 7080:7080 -p 7081:7081 -p 9200:9200 \
-p 27017:27017 -p 8000:8000 -p 9000:9000 -p 60080:60080 -d biothings/biothings-studio:0.2b
Note
we need to add the release number after the image name: biothings-studio:0.2b. Should you use another release (including unstable releases,
tagged as master
) you would need to adjust this parameter accordingly.
Note
Biothings Studio and the Hub are not designed to be publicly accessible. Those ports should not be exposed. When
accessing the Studio and any of these ports, SSH tunneling can be used to safely access the services from outside.
Ex: ssh -L 7080:localhost:7080 -L 8080:localhost:8080 -L 7022:localhost:7022 -L 9000:localhost:9000 user@mydockerserver
will expose the Hub REST API, the web application,
the Hub SSH, and Cerebro app ports to your computer, so you can access the webapp using http://localhost:8080, the Hub REST API using http://localhost:7080,
http://localhost:9000 for Cerebro, and directly type ssh -p 7022 biothings@localhost
to access Hub’s internals via the console.
See https://www.howtogeek.com/168145/how-to-use-ssh-tunneling for more details.
We can follow the starting sequence using docker logs
command:
$ docker logs -f studio
Waiting for mongo
tcp 0 0 127.0.0.1:27017 0.0.0.0:* LISTEN -
* Starting Elasticsearch Server
...
Waiting for cerebro
...
now run webapp
not interactive
Please refer to Filesystem overview and Services check for more details about Studio’s internals.
By default, the studio will auto-update its source code to the latest version available and install all required dependencies. This behavior can be skipped
by adding no-update
at the end of the command line of docker run ...
.
We can now access BioThings Studio using the dedicated web application (see webapp overview).
4. Getting start with data plugin¶
In this section we’ll dive in more details on using the BioThings Studio and Hub. We will be integrating a simple flat file as a new datasource within the Hub, declare a build configuration using that datasource, create a build from that configuration, then a data release and finally instantiate a new API service and use it to query our data.
The whole source code is available at https://github.com/sirloon/pharmgkb, each branch pointing to a specific step in this tutorial.
4.1. Input data¶
For this tutorial, we will use several input files provided by PharmGKB, freely available in their download section, under “Annotation data”:
annotations.zip: contains a file
var_drug_ann.tsv
about variant-gene-drug annotations. We’ll use this file for the first part of this tutorial.drugLabels.zip: contains a file
drugLabels.byGene.tsv
describing, per gene, which drugs have an impact of themoccurrences.zip: contains a file
occurrences.tsv
listing the literature per entity type (we’ll focus on gene type only)
The last two files will be used in the second part of this tutorial when we’ll add more datasources to our API.
4.2. Parser¶
In order to ingest this data and make it available as an API, we first need to write a parser. Data is pretty simple, tab-separated files, and we’ll
make it even simpler by using pandas
python library. The first version of this parser is available in branch pharmgkb_v1
at
https://github.com/sirloon/pharmgkb/blob/pharmgkb_v1/parser.py. After some boilerplate code at the beginning for dependencies and initialization,
the main logic is the following:
def load_annotations(data_folder):
results = {}
for rec in dat:
if not rec["Gene"] or pandas.isna(rec["Gene"]):
logging.warning("No gene information for annotation ID '%s'", rec["Annotation ID"])
continue
_id = re.match(".* \((.*?)\)",rec["Gene"]).groups()[0]
# We'll remove space in keys to make queries easier. Also, lowercase is preferred
# for a BioThings API. We'll use an helper function `dict_convert()` from BioThings SDK
process_key = lambda k: k.replace(" ","_").lower()
rec = dict_convert(rec,keyfn=process_key)
results.setdefault(_id,[]).append(rec)
for _id,docs in results.items():
doc = {"_id": _id, "annotations" : docs}
yield doc
Our parsing function is named load_annotations
, it could be named anything else, but it has to take a folder path data_folder
containing the downloaded data. This path is automatically set by the Hub and points to the latest version available. More on this later.
infile = os.path.join(data_folder,"var_drug_ann.tsv")
assert os.path.exists(infile)
It is the responsibility of the parser to select, within that folder, the file(s) of interest. Here we need data from a file named var_drug_ann.tsv
.
Following the moto “don’t assume it, prove it”, we make that file exists.
dat = pandas.read_csv(infile,sep="\t",squeeze=True,quoting=csv.QUOTE_NONE).to_dict(orient='records')
results = {}
for rec in dat:
...
We then open and read the TSV file using pandas.read_csv()
function. At this point, a record rec
looks like the following:
{'Alleles': 'A',
'Annotation ID': 608431768,
'Chemical': 'warfarin (PA451906)',
'Chromosome': 'chr1',
'Gene': 'EPHX1 (PA27829)',
'Notes': nan,
'PMID': 19794411,
'Phenotype Category': 'dosage',
'Sentence': 'Allele A is associated with decreased dose of warfarin.',
'Significance': 'yes',
'StudyParameters': '608431770',
'Variant': 'rs1131873'}
Keys are uppercase, for a BioThings API, we like to have them as lowercase. More importantly, we want to remove spaces in those keys as querying the API in the end will be hard with spaces. We’ll use a special helper function from BioThings SDK to process these.
process_key = lambda k: k.replace(" ","_").lower()
rec = dict_convert(rec,keyfn=process_key)
Finally, because there could be more than one record by gene (ie. more than one annotation per gene), we need to store those records as a list, in a dictionary indexed by gene ID. The final documents are assembled in the last loop.
...
results.setdefault(_id,[]).append(rec)
for _id,docs in results.items():
doc = {"_id": _id, "annotations" : docs}
yield doc
Note
The _id key is mandatory and represents a unique identifier for this document. The type must be a string. The _id key is used when data from multiple datasources are merged together, that process is done according to its value (all documents sharing the same _id from different datasources will be merged together). Due to the indexing limitation, the length of the _id key should be kept no more than 512.
Note
In this specific example, we read the whole content of this input file in memory, then store annotations per gene. The data itself is small enough to do this, but memory usage always needs to be cautiously considered when we write a parser.
4.3. Data plugin¶
Parser is ready, it’s now time to glue everything together and build our API. We can easily create a new datasource and integrate data using BioThings Studio, by declaring a data plugin. Such plugin is defined by:
a folder containing a manifest.json file, where the parser and the input file location are declared
all necessary files supporting the declarations in the manifest, such as a python file containing the parsing function for instance.
This folder must be located in the plugins directory (by default /data/biothings_studio/plugins
, where the Hub monitors changes and
reloads itself accordingly to register data plugins. Another way to declare such plugin is to register a github repository
that contains everything useful for the datasource. This is what we’ll do in the following section.
Note
Whether the plugin comes from a github repository or directly found in the plugins directory doesn’t really matter. In the end, the code
will be found in that same plugins
directory, whether it comes from a git clone
command while registering the github URL or
from folder(s) and file(s) manually created in that location. However, when developing a plugin, it’s easier to directly work on local files first
so we don’t have to regurlarly update the plugin code (git pull
) from the webapp, to fetch the latest code. That said, since the plugin
is already defined in github in our case, we’ll use the github repo registration method.
The corresponding data plugin repository can be found at https://github.com/sirloon/pharmgkb/tree/pharmgkb_v1. The manifest file looks like this:
{
"version": "0.2",
"requires" : ["pandas"],
"dumper" : {
"data_url" : ["https://s3.pgkb.org/data/annotations.zip",
"https://s3.pgkb.org/data/drugLabels.zip",
"https://s3.pgkb.org/data/occurrences.zip"],
"uncompress" : true
},
"uploader" : {
"parser" : "parser:load_annotations",
"on_duplicates" : "error"
}
}
version specifies the manifest version (it’s not the version of the datasource itself) and tells the Hub what to expect from the manifest.
parser uses
pandas
library, we declare that dependency in requires section.the dumper section declares where the input files are, using data_url key. In the end, we’ll use 3 different files so a list of URLs is specified there. A single string is also allowed if only one file (ie. one URL) is required. Since the input file is a ZIP file, we first need to uncompress the archive, using uncompress : true.
the uploader section tells the Hub how to upload JSON documents to MongoDB. parser has a special format, module_name:function_name. Here, the parsing function is named load_annotations and can be found in parser.py module. ‘on_duplicates’ : ‘error’ tells the Hub to raise an error if we have documents with the same _id (it would mean we have a bug in our parser).
For more information about the other fields, please refer to the plugin specification.
Let’s register that data plugin using the Studio. First, copy the repository URL:

Now go to the Studio web application at http://localhost:8080, click on the tab, then
icon, this will open a side bar on the left. Click on New data plugin, you will be asked to enter the github URL.
Click “OK” to register the data plugin.

Interpreting the manifest coming with the plugin, BioThings Hub has automatically created for us:
a dumper using HTTP protocol, pointing to the remote file on the CGI website. When downloading (or dumping) the data source, the dumper will automatically check whether the remote file is more recent than the one we may have locally, and decide whether a new version should be downloaded.
and an uploader to which it “attached” the parsing function. This uploader will fetch JSON documents from the parser and store those in MongoDB.
At this point, the Hub has detected a change in the datasource code, as the new data plugin source code has been pulled from github locally inside the container. In order to take this new plugin into account, the Hub needs to restart to load the code. The webapp should detect that reload and should ask whether we want to reconnect, which we’ll do!

The Hub shows an error though:

Indeed, we fetch source code from branch master
, which doesn’t contain any manifest file. We need to switch to another branch (this tutorial is organized using branches,
and also it’s a perfect opportunity to learn how to use a specific branch/commit using BioThings Studio…)
Let’s click on pharmgkb
link, then . In the textbox on the right, enter
pharmgkb_v1
then click on Update
.

BioThings Studio will fetch the corresponding branch (we could also have specified a commit hash for instance), source code changes will be detected and the Hub will restart. The new code version is now visible in the plugin tab

If we click back on PharmGKB appears fully functional, with different actions available:

is used to trigger the dumper and (if necessary) download remote data
will trigger the uploader (note it’s automatically triggered if a new version of the data is available)
Let’s open the datasource by clicking on its title to have more information. Dumper and Uploader tabs are rather empty since
none of these steps have been launched yet. Without further waiting, let’s trigger a dump to integrate this new datasource.
Either go to Dump tab and click on or click on
to go back to the sources list and click on
at the bottom of the datasource.
The dumper is triggered, and after few seconds, the uploader is automatically triggered. Commands can be listed by clicking at the top the page. So far we’ve run 3 commands to register the plugin, dump the data and upload the JSON documents to MongoDB. All succeeded.

We also have new notifications as shown by the red number on the right. Let’s have a quick look:

Going back to the source’s details, we can see the Dumper has been populated. We now know the release number, the data folder, when the last download was, how long it tooks to download the file, etc…

Same for the Uploader tab, we now have 979 documents uploaded to MongoDB.

4.4. Inspection and mapping¶
Now that we have integrated a new datasource, we can move forward. Ultimately, data will be sent to ElasticSearch, an indexing engine. In order to do so, we need to tell ElasticSearch how the data is structured and which fields should be indexed (and which should not). This step consists of creating a “mapping”, describing the data in ElasticSearch terminology. This can be a tedious process as we would need to dig into some tough technical details and manually write this mapping. Fortunately, we can ask BioThings Studio to inspect the data and suggest a mapping for it.
In order to do so, click on Mapping tab, then click on .
We can inspect the data for different purposes:
Mode
type
: inspection will report any types found in the collection, giving detailed information about the structure of documents coming from the parser. Note results aren’t available from the webapp, only in MongoDB.stats
: same as type but gives numbers (count) for each structures and types found. Same as previous, results aren’t available in the webapp yet.mapping
: inspect the date types and suggest an ElasticSearch mapping. Will report any error or types incompatible with ES.
Here we’ll stick to mode mapping
to generate that mapping. There are other options used to explore the data to inspect:
Limit: limit the inspected documents.
Sample: randomize the documents to inspect (1.0 = consider all documents, 0.0 = skip all documents, 0.5 = consider every other documents)
The last two options can be used to reduce the inspection time of huge data collection, or you’re absolutely sure the same structure is returned for any documents output from the parser.

Since the collection is very small, inspection is fast. But… it seems like we have a problem

More than one type was found for a field named notes
. Indeed, if we scroll down on the pre-mapping structure, we can see the culprit:

This result means documents sometimes have notes
key equal to NaN
, and sometimes equal to a string (a splittable string, meaning there are spaces in it).
This is a problem for ElasticSearch because it wouldn’t index the data properly. And furthermore, ElasticSearch doesn’t allow NaN
values anyway. So we need
to fix the parser. The fixed version is available in branch pharmgkb_v2
(go back to Plugin tab, enter that branch name and update the code).
The fix consists in removing key/value from the records, whenever a value is equal to NaN
.
rec = dict_sweep(rec,vals=[np.nan])
Once fixed, we need to re-upload the data, and inspect it again. This time, no error, our mapping is valid:

For each highlighted field, we can decide whether we want the field to be searchable or not, and whether the field should be searched by default when querying the API. We can also change the type for that field, or even switch to “advanced mode” and specify your own set of indexing rules. Let’s click on “gene” field and make it searched by default. Let’s also do the same for field “variant”.

Indeed, by checking the “Search by default” checkbox, we will be able to search for instance gene symbol “ABL1” with /query?q=ABL1
instead of /query?q=annotations.gene:ABL1
. Same for “variant” field where we can specify a rsid.
After this modification, you should see at the top of the mapping, let’s save our changes clicking on
. Also, before
moving forwared, we want to make sure the mapping is valid, let’s click on
. You should see this success message:

Note
“Validate on localhub” means Hub will send the mapping to ElasticSearch by creating a temporary, empty index to make sure the mapping syntax and content are valid. It’s immediately deleted after validation (whether successful or not). Also, “localhub” is the default name of an environment. Without further manual configuration, this is the only development environment available in the Studio, pointing to embedded ElasticSearch server.
Everything looks fine, the last step is to “commit” the mapping, meaning we’re ok to use this mapping as the official, registered mapping that will actually be used by ElasticSearch. Indeed the left side of the page is about inspected mapping, we can re-launch the
inspection as many times as we want, without impacting active/registered mapping (this is usefull when the data structure changes). Click on
then “OK”, and you now should see the final, registered mapping on the right:

4.5. Build¶
Once we have integrated data and a valid ElasticSearch mapping, we can move forward by creating a build configuration. A build configuration
tells the Hub which datasources should be merged together, and how. Click on then
and finally, click on
.

enter a name for this configuration. We’re going to have only one configuration created through this tutorial so it doesn’t matter, let’s make it “default”
the document type represents the kind of documents stored in the merged collection. It gives its name to the annotate API endpoint (eg. /gene). This source is about gene annotations, so “gene” it is…
open the dropdown list and select the sources you want to be part of the merge. We only have one, “pharmgkb”
in root sources, we can declare which sources are allowed to create new documents in the merged collection, that is merge documents from a datasource, but only if corresponding documents exist in the merged collection. It’s useful if data from a specific source relates to data on another source (it only makes sense to merge that relating data if the data itself is present). If root sources are declared, Hub will first merge them, then the others. In our case, we can leave it empty (no root sources specified, all sources can create documents in the merged collection)
selecting a builder is optional, but for the sake of this tutorial, we’ll choose
LinkDataBuilder
. This special builder will fetch documents directly from our datasources pharmgkb when indexing documents, instead of duplicating documents into another connection (called target or merged collection). We can do this (and save time and disk space) because we only have one datasource here.the other fields are for advanced usage and are out-of-topic for this tutorial
Click “OK” and open the menu again, you should see the new configuration available in the list.

Click on it and create a new build.

You can give a specific name for that build, or let the Hub generate one for you. Click “OK”, after a few seconds, you should see the new build displayed on the page.

Open it by clicking on its name. You can explore the tabs for more information about it (sources involved, build times, etc…). The “Release” tab is the one we’re going to use next.
4.6. Data release¶
If not there yet, open the new created build and go the “Release” tab. This is the place where we can create new data releases. Click on .

Since we only have one build available, we can’t generate an incremental release, so we’ll have to select full this time. Click “OK” to launch the process.
Note
Should there be a new build available (coming from the same configuration), and should there be data differences, we could generate an incremental release. In this case, Hub would compute a diff between previous and new builds and generate diff files (using JSON diff format). Incremental releases are usually smaller than full releases, usually take less time to deploy (applying diff data) unless diff content is too big (there’s a threshold between using an incremental and a full release, depending on the hardware and the data, because applying a diff requires you to first fetch the document from ElasticSearch, patch it, and then save it back).
Hub will directly index the data on its locally installed ElasticSearch server (localhub
environment). After few seconds, a new full release is created.

We can easily access ElasticSearch server using the application Cerebro, which comes pre-configured with the studio. Let’s access it through http://localhost:9000/#/connect (assuming ports 9200 and 9000 have properly been mapped, as mentioned earlier). Cerebro provides an easy-to-manage ElasticSearch and check/query indices.
Click on the pre-configured server named BioThings Studio
.

Clicking on an index gives access to different information, such as the mapping, which also contains metadata (sources involved in the build, releases, counts, etc…)

4.7. API creation¶
At this stage, a new index containing our data has been created on ElasticSearch, it is now time for final step. Click on then
and finally
We’ll name it pharmgkb and have it running on port 8000.
Note
Spaces are not allowed in API names

Once form is validated a new API is listed.

To turn on this API instance, just click on , you should then see a
label on the top right corner, meaning the API
can be accessed:

Note
When running, queries such /metadata
and /query?q=*
are provided as examples. They contain a hostname set by Docker though (it’s the Docker instance’s hostname), which probably
means nothing outside of Docker’s context. In order to use the API you may need to replace this hostname by the one actually used to access the
Docker instance.
4.8. Tests¶
Assuming API is accessible through http://localhost:8000, we can easily query it with curl
for instance. The endpoint /metadata
gives
information about the datasources and build date:
$ curl localhost:8000/metadata
{
"biothing_type": "gene",
"build_date": "2020-01-16T18:36:13.450254",
"build_version": "20200116",
"src": {
"pharmgkb": {
"stats": {
"pharmgkb": 979
},
"version": "2020-01-05"
}
},
"stats": {
"total": 979
}
}
Let’s query the data using a gene name (results truncated):
$ curl localhost:8000/query?q=ABL1
{
"max_score": 7.544187,
"took": 70,
"total": 1,
"hits": [
{
"_id": "PA24413",
"_score": 7.544187,
"annotations": [
{
"alleles": "T",
"annotation_id": 1447814556,
"chemical": "homoharringtonine (PA166114929)",
"chromosome": "chr9",
"gene": "ABL1 (PA24413)",
"notes": "Patient received received omacetaxine, treatment had been stopped after two cycles because of clinical intolerance, but a major molecular response and total disappearance of the T315I clone was obtained. Treatment with dasatinib was then started and after 34-month follow-up the patient is still in major molecular response.",
"phenotype_category": "efficacy",
"pmid": 25950190,
"sentence": "Allele T is associated with response to homoharringtonine in people with Leukemia, Myelogenous, Chronic, BCR-ABL Positive as compared to allele C.",
"significance": "no",
"studyparameters": "1447814558",
"variant": "rs121913459"
},
{
"alleles": "T",
"annotation_id": 1447814549,
"chemical": "nilotinib (PA165958345)",
"chromosome": "chr9",
"gene": "ABL1 (PA24413)",
"phenotype_category": "efficacy",
"pmid": 25950190,
"sentence": "Allele T is associated with resistance to nilotinib in people with Leukemia, Myelogenous, Chronic, BCR-ABL Positive as compared to allele C.",
"significance": "no",
"studyparameters": "1447814555",
"variant": "rs121913459"
}
]
}
]
}
Note
We don’t have to specify annotations.gene
, the field in which the value “ABL1” should be searched, because we explicitely asked ElasticSearch
to search that field by default (see fieldbydefault)
Finally, we can fetch a variant by its PharmGKB ID:
$ curl "localhost:8000/gene/PA134964409"
{
"_id": "PA134964409",
"_version": 1,
"annotations": [
{
"alleles": "AG + GG",
"annotation_id": 1448631680,
"chemical": "etanercept (PA449515)",
"chromosome": "chr1",
"gene": "GBP6 (PA134964409)",
"phenotype_category": "efficacy",
"pmid": 28470127,
"sentence": "Genotypes AG + GG is associated with increased response to etanercept in people with Psoriasis as compared to genotype AA.",
"significance": "yes",
"studyparameters": "1448631688",
"variant": "rs928655"
}
]
}
4.9. Conclusions¶
We’ve been able to easily convert a remote flat file to a fully operational BioThings API:
by defining a data plugin, we told the BioThings Hub where the remote data was and what the parser function was
BioThings Hub then generated a dumper to download data locally on the server
It also generated an uploader to run the parser and store resulting JSON documents
We defined a build configuration to include the newly integrated datasource and then trigger a new build
Data was indexed internally on local ElasticSearch by creating a full release
Then we created a BioThings API instance pointing to that new index
The next step is to enrich that existing API with more datasources.
4.10. Multiple sources data plugin¶
In the previous part, we generated an API from a single flat file. This API serves data about gene annotations, but we need more: as mentioned earlier in Input data, we also downloaded drug labels and publications information. Integrating those unused files, we’ll be able to enrich our API even more, that’s the goal of this part.
In our case, we have one dumper responsible for downloading three different files, and we now need three different uploaders in order to process these files. With above data plugin (4.3), only one file is parsed. In order to proceed
further, we need to specify multiple uploaders on the manifest.json file, the full example can be found in branch pharmgkb_v5
available at https://github.com/remoteeng00/pharmgkb/tree/pharmgkb_v5.
Note
You can learn more about data plugin in the section B.4. Data plugin architecture and specifications
5. Regular data source¶
5.1. Data plugin limitations¶
The data plugin architecture provided by BioThings Studio allows to quickly integrate a new datasource, describing where the data is located, and how the data should be parsed. It provides a simple and generic way to do so, but also comes with some limitations. Indeed, in many advanced use cases, you need to use a custom data builder instead of LinkDataBuilder (that you used at the point 4.5). But you can not define a custom builder on the data plugin.
Luckily, BioThings Studio provides an easy to export python code that has been generated during data plugin registration. Indeed, code is generated from the manifest file, compiled and injected into Hub’s memory. Exporting the code consists in writing down that dynamically generated code. After successful export,we have a new folder stays in hub/dataload/sources and contains exported python files - that is a Regular data source (or a regular dumper/uploader based data sources) Following below steps, you will learn about how to deal with a regular data source.
5.2. Code export¶
Note
You MUST to update above pharmgkb data plugin to the version pharmgkb_v2.
Let’s go back to our datasource, Plugin tab. Clicking on brings the following form:

We have different options regarding the parts we can export:
Dumper
: exports code responsible for downloading the data, according to URLs defined in the manifest.Uploader
: exports code responsible for data integration, using our parser code.Mapping
: any mapping generated from inspection, and registered (commit) can also be exported. It’ll be part of the uploader.
We’ll export all these parts, let’s validate the form. Export results are displayed (though quickly as Hub will detect changes in the code and will want to restart)

We can see the relative paths where code was exported. A message about ACTIVE_DATASOURCES
is also displayed explaining how to activate our newly exported datasource. That said,
BioThings Studio by default monitors speficic locations for code changes, including where code is exported, so we don’t need to manually activate it. That’s also the reason why
Hub has been restarted.
Once reconnected, if we go back on , we’ll see an error!

Our original data plugin can’t registered (ie. activated) because another datasource with the same name is already registered. That’s our new exported datasource! When the Hub starts,
it first loads datasources which have been manually coded (or exported), and then data plugins. Both our plugin and exported code is active, but the Hub can’t know which one to use.
Let’s delete the plugin, by clicking on , and confirm the deletion.
Hub will restart again (reload page if not) and this time, our datasource is active. If we click on pharmgkb
, we’ll see the same details as before except the Plugin
tab which
disappeared. So far, our exported code runs, and we’re in the exact same state as before, the Hub even kept our previously dumped/uploaded data.
Let’s explore the source code that has been generated through out this process. Let’s enter our docker container, and become user biothings
(from which everything runs):
$ docker exec -ti studio /bin/bash
$ sudo su - biothings
Paths provided as export results (hub/dataload/sources/*
) are relative to the started folder named biothings_studio
. Let’s move there:
$ cd biothings_studio/hub/dataload/sources/
$ ls -la
total 0
-rw-rw-r-- 1 biothings biothings 0 Jan 15 23:41 __init__.py
drwxrwxr-x 2 biothings biothings 45 Jan 15 23:41 __pycache__
drwxr-xr-x 1 biothings biothings 75 Jan 15 23:41 ..
drwxr-xr-x 1 biothings biothings 76 Jan 22 19:32 .
drwxrwxr-x 3 biothings biothings 154 Jan 22 19:32 pharmgkb
A pharmgkb
folder can be found and contains the exported code:
$ cd pharmgkb
$ ls
total 32
drwxrwxr-x 3 biothings biothings 154 Jan 22 19:32 .
drwxr-xr-x 1 biothings biothings 76 Jan 22 19:32 ..
-rw-rw-r-- 1 biothings biothings 11357 Jan 22 19:32 LICENSE
-rw-rw-r-- 1 biothings biothings 225 Jan 22 19:32 README
-rw-rw-r-- 1 biothings biothings 70 Jan 22 19:32 __init__.py
drwxrwxr-x 2 biothings biothings 142 Jan 22 19:45 __pycache__
-rw-rw-r-- 1 biothings biothings 868 Jan 22 19:32 dump.py
-rw-rw-r-- 1 biothings biothings 1190 Jan 22 19:32 parser.py
-rw-rw-r-- 1 biothings biothings 2334 Jan 22 19:32 upload.py
Some files were copied from data plugin repository (LICENCE
, README
and parser.py
), the others are the exported ones: dump.py
for the dumper, upload.py
for the uploader and the mappings, and __init__.py
so the Hub can find these components upon start. We’ll go in further details later, specially when we’ll add more
uploaders.
For conveniency, the exported code can be found in branch pharmgkb_v3
available at https://github.com/sirloon/pharmgkb/tree/pharmgkb_v3. One easy way to follow
this tutorial without having to type too much is to replace folder pharmgkb
with a clone from Git repository. The checked out code is exactly the same as code after export.
$ cd ~/biothings_studio/hub/dataload/sources/
$ rm -fr pharmgkb
$ git clone https://github.com/sirloon/pharmgkb.git
$ cd pharmgkb
$ git checkout pharmgkb_v3
5.3. More uploaders¶
Now that we have exported the code, we can start the modifications. The final code can be found on branch https://github.com/sirloon/pharmgkb/tree/pharmgkb_v4.
Note
We can directly point to that branch using git checkout pharmgkb_v4
within the datasource folder previously explored.
First we’ll write two more parsers, one for each addition files. Within parser.py
:
at the beginning,
load_annotations
is the first parser we wrote, no changes requiredload_druglabels
function is responsible for parsing file nameddrugLabels.byGene.tsv
load_occurrences
function is parsing fileoccurrences.tsv
Writing parsers is not the main purpose of this tutorial, which focuses more on how to use BioThings Studio, so we won’t go into further details.
Next is about defining new uploaders. In upload.py
, we currently have one uploader definition, which looks like this:
class PharmgkbUploader(biothings.hub.dataload.uploader.BaseSourceUploader):
name = "pharmgkb"
__metadata__ = {"src_meta": {}}
idconverter = None
storage_class = biothings.hub.dataload.storage.BasicStorage
...
The important pieces of information here is name
, which gives the name of the uploader we define. Currently uploader is named pharmgkb
.
That’s how this name is displayed in the “Upload” tab of the datasource. We know we need three uploaders in the end so we need to adjust names. In order to do so, we’ll define
a main source, pharmgkb
, then three different other “sub” sources: annotations
, druglabels
and occurrences
. For clarity, we’ll put these uploaders in three different files.
As a result, we now have:
file
upload_annotations.py
, originally coming from the code export. Class definition is:
class AnnotationsUploader(biothings.hub.dataload.uploader.BaseSourceUploader):
main_source = "pharmgkb"
name = "annotations"
Note
We renamed the class itself, pharmgkb
is now set as field main_source
. This name matches the dumper name as well, which is how the Hub knows how dumpers and uploaders relates
to each others. Finally, the sub-source named annotation
is set as field name
.
doing the same for
upload_druglabels.py
:
from .parser import load_druglabels
class DrugLabelsUploader(biothings.hub.dataload.uploader.BaseSourceUploader):
main_source = "pharmgkb"
name = "druglabels"
storage_class = biothings.hub.dataload.storage.BasicStorage
def load_data(self, data_folder):
self.logger.info("Load data from directory: '%s'" % data_folder)
return load_druglabels(data_folder)
@classmethod
def get_mapping(klass):
return {}
Note
In addition to adjusting the names, we need to import our dedicated parser, load_druglabels
. Following what the Hub did during code export, we “connect” that parser to this
uploader in method load_data
. Finally, each uploader needs to implement class method get_mapping
, currently an empty dictionary, that is, no mapping at all. We’ll fix this soon.
finally,
upload_occurences.py
will deal with occurences uploader. Code is similar as previous one.
from .parser import load_occurrences
class OccurrencesUploader(biothings.hub.dataload.uploader.BaseSourceUploader):
main_source = "pharmgkb"
name = "occurrences"
storage_class = biothings.hub.dataload.storage.BasicStorage
def load_data(self, data_folder):
self.logger.info("Load data from directory: '%s'" % data_folder)
return load_occurrences(data_folder)
@classmethod
def get_mapping(klass):
return {}
The last step to activate those components is to expose them through the __init__.py
:
from .dump import PharmgkbDumper
from .upload_annotations import AnnotationsUploader
from .upload_druglabels import DrugLabelsUploader
from .upload_occurrences import OccurrencesUploader
Upon restart, the “Upload” tab now looks like this:

We still have an uploader named pharmgkb
, but that component has been deleted! Hub indeed kept information within its internal database, but also detected that
the actual uploader class doesn’t exists anymore (see message No uploader found, datasource may be broken
). In that specific case, an option to delete that internal information
is provided, let’s clock on the closing button on that tab to remove that information.
If we look at the other uploader tabs, we don’t see much information, that’s because they haven’t been launched yet. For each on them, let’s click on “Upload” button.
Note
Another way to trigger all uploaders at once is to click on to list all datasources, then click on
for that datasource in particular.
After a while, all uploaders have run, data is populated, as shown in the different tabs.
5.4. More data inspection¶
Data is ready, it’s now time to inspect the data for the new uploaders. Indeed, if we check the “Mapping” tab, we still have the old mapping from the original pharmgkb
uploader
(we can remove that “dead” mapping by clicking on the closing button of the tab), but nothing for uploaders druglabels
and occurences
.
Looking back at the uploaders’ code, get_mapping
class method was defined such as it returns an empty mapping. That’s the reason why we don’t have anything shown here,
let’s fix that by click on . After few seconds, mappings are generated, we can review them, and click on
to validate and register those mappings, for
each tab.
5.5. Modifying build configuration¶
All data is now ready, as well as mappings, it’s time to move forward and build the merged data. We now have three differents source for documents, and we need to merge them
together. Hub¨ will do so according to field _id
: if two documents from different sources share the same _id
, they are merged together (think about dictionary merge).
In order to proceed further, we need to update our build configuration, as there’s currently
only datasource involved in the merge. Clicking on , then
we can edit existing configuration.

There several parameters we need to adjust:
first, since original
pharmgkb
uploader doesn’t anymore, that datasource isn’t listed anymorein the other hand, we now have our three new datasources, and we need to select all of them
our main data is coming from
annotations
, and we want to enrich this data with druglabels and litterature occurrences. But only if data first exists inannotations
. Behing this requirement is the notion of root documents. When selectionannotations
as a source for root documents, we tell the Hub to first merge that data, then merge the other sources only if a document fromannotations
with the same _id exists. If not, documents are silently ignored.finally, we were previously using a
LinkDataBuilder
because we only had one datasource (data wasn’t copied, but refered, or linked to the original datasource collection). We now have three datasources involved in the merge so we can’t use that builder anymore and need to switch to the defaultDataBuilder
one. If not, Hub will complain and deactivate the build configuration until it’s fixed.
The next configuration is summarized in the following picture:

Upon validation, build configuration is ready to be used.
5.6. Incremental release¶
Configuration reflects our changes and is up-to-date, let’s create a new build. Click on if not already open, then “Create a new build”

After few seconds, we have a new build listed. Clicking on “Logs” will show how the Hub created it. We can see it first merged annotations
in the “merge-root” step (for root documents), then druglabels
and occurrences
sources. The remaining steps, (diff, release note) were
automatically triggered by the Hub. Let’s explore these further.

If we open the build and click on “Releases” tab, we have a diff release, or incremental release, as mentioned in the “Logs”. Because a previous release existed for that build configuration (the one we did in part one), the Hub tries to compute an release comparing the two together, identifying new, deleted and updated documents. The result is a diff release, based on json diff format.

In our case, one diff file has been generated, its size is 2 MiB, and contains information to update 971 documents. This is expected since we enriched our existing data. Hub also mention the mapping has been changed, and these will be reported to the index as we “apply” that diff release.
Note
Because we added new datasources, without modifying existing mapping in the first annotations
source, the differences between previous and new mappings correspond to
“add” json-diff operations. This means we strictly only add more information to the existing mapping. If we’d removed, and modify existing mapping fields, the Hub would
have reported an error and aborted the generation of that diff release, to prevent an error during the update of the ElasticSearch index, or to avoid data inconsistency.
The other document that has been automatically generated is a release note.

If we click on “View”, we can see the results: the Hub compared previous data versions and counts, deleted and added datasources and field, etc… In other words, a “change log” summarizing what happened betwen previous and new releases. These release notes are informative, but also can be published when deploying data releases (see part 3).

Let’s apply that diff release, by clicking on
We can select which index to update, from a dropdown list. We only have index, the one we created earlier in part 1. That said, Hub will do its best to filter out any incompatible indices, such those not coming from the same build configuration, or not having the same document type.

Once confirmed, the synchronization process begins, diff files are applied to the index, just as if we were “patching” data. We can track the command execution from the command list, and also from the notification popups when it’s done.


Our index, currently served by our API defined in the part 1, has been updated, using a diff, or incremental, release. It’s time to have a look at the data.
5.7. Testing final API¶
Because we directly apply a diff, or patch our data, on ElasticSearch index, we don’t need to re-create an API. Querying the API should just transparently reflect that “live” update.
Time to try our new enriched API. We’ll use curl
again, here few query examples:
$ curl localhost:8000/metadata
{
"biothing_type": "gene",
"build_date": "2020-01-24T00:14:28.112289",
"build_version": "20200124",
"src": {
"pharmgkb": {
"stats": {
"annotations": 979,
"druglabels": 122,
"occurrences": 5503
},
"version": "2020-01-05"
}
},
"stats": {
"total": 979
}
Metadata has changed, as expected. If we compare this result with previous one, we now have three different sources: annotations
, druglabels
and occurrences
,
reflecting our new uploaders. For each of them, we have the total number of documents involved during the merge. Interestingly, the total number of documents is in our case 979 but,
for instance, occurrences
shows 5503 documents. Remember, we set annotations
as a root documents source, meaning documents from others are merged only if they matched (based on _id
field)
an existing documents in this root document source. In other words, with this specific build configuration, we can’t have more documents in the final API than the number of documents in
root document sources.
Let’s query by symbol name, just as before:
$ curl localhost:8000/query?q=ABL1
{
"max_score": 7.544187,
"took": 2,
"total": 1,
"hits": [
{
"_id": "PA24413",
"_score": 7.544187,
"annotations": [
{
"alleles": "T",
"annotation_id": 1447814556,
"chemical": "homoharringtonine (PA166114929)",
"chromosome": "chr9",
"gene": "ABL1 (PA24413)",
"notes": "Patient received received omacetaxine, treatment had been stopped after two cycles because of clinical intolerance, but a major molecular response and total disappearance of theT315I clone was obtained. Treatment with dasatinib was then started and after 34-month follow-up the patient is still in major molecular response.",
"phenotype_category": "efficacy",
"pmid": 25950190,
"sentence": "Allele T is associated with response to homoharringtonine in people with Leukemia, Myelogenous, Chronic, BCR-ABL Positive as compared to allele C.",
"significance": "no",
"studyparameters": "1447814558",
"variant": "rs121913459"
},
...
],
"drug_labels": [
{
"id": "PA166117941",
"name": "Annotation of EMA Label for bosutinib and ABL1,BCR"
},
{
"id": "PA166104914",
"name": "Annotation of EMA Label for dasatinib and ABL1,BCR"
},
{
"id": "PA166104926",
"name": "Annotation of EMA Label for imatinib and ABL1,BCR,FIP1L1,KIT,PDGFRA,PDGFRB"
},
...
]
"occurrences": [
{
"object_id": "PA24413",
"object_name": "ABL1",
"object_type": "Gene",
"source_id": "PMID:18385728",
"source_name": "The cancer biomarker problem.",
"source_type": "Literature"
},
{
"object_id": "PA24413",
"object_name": "ABL1",
"object_type": "Gene",
"source_id": "PMC443563",
"source_name": "Two different point mutations in ABL gene ATP-binding domain conferring Primary Imatinib resistance in a Chronic Myeloid Leukemia (CML) patient: A case report.",
"source_type": "Literature"
},
...
]
}
We new have much information associated (much have been remove for clarity), including keys drug_labels
and occurrences
coming the two new uploaders.
5.8. Conclusions¶
Moving from a single datasource based API, previously defined as a data plugin, we’ve been able to export this data plugin code. This code was used as a base to extend our API, specifically:
we implemented two more parsers, and their counter-part uploaders.
we updated the build configuration to add these new datasources
we created a new index (full release) and created a new API serving this new data.
So far APIs are running from within BioThings Studio, and data still isn’t exposed to the public. The next step to publish this data and make the API available for everyone.
Note
BioThings Studio is a backend service, aimed to be used internally to prepare, test and release APIs. It is not inteneded to be facing public internet, in other words, it’s not recommended to expose any ports, including API ports, to public-facing internet.
6. API cloud deployments and hosting¶
This part is still under development… Stay tuned and join Biothings Google Groups (https://groups.google.com/forum/#!forum/biothings) for more.
7. Troubleshooting¶
We test and make sure, as much as we can, that the BioThings Studio image is up-to-date and running properly. But things can still go wrong…
A good starting point investigating an issue is to look at the logs from the BioThings Studio. Make sure it’s connected (green power button on the top right), then click “Logs” button, on the bottom right. You will see logs in real-time (if not connected, it will complain about a disconnected websocket). As you click and perform actions throughout the web application, you will see log message in that window, and potentially errors not displayed (or with less details) in the application.

The “Terminal” (click on the bottom left button) gives access to commands you can manually type from the web application. Basically, any action performed clicking on the application
is converted into a command call. You can even see what commands were launched and which ones are running. This terminal also gives access to more commands, and advanced options that may
be useful to troubleshoot an issue. Typing help()
, or even passing a command name such as help(dump)
will print documentation on available commands and how to use them.

On a lower level, make sure all services are running in the docker container. Enter the container with
docker exec -ti studio /bin/bash
and type netstat -tnlp
, you should see services running on ports
(see usual running services). If services on ports 7080 and 7022 aren’t running, it means the
Hub has not started. If you just started the instance, wait a little more as services may take a while
before they’re fully started and ready.
If after ~1 min, you still don’t see the Hub running, log in as user biothings
and check the starting sequence.
Note
Hub is running in a tmux session, under user biothings
.
# sudo su - biothings
$ tmux a # recall tmux session
$ python bin/hub.py
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Hub DB backend: {'uri': 'mongodb://localhost:27017', 'module': 'biothings.utils.mongo'}
INFO:root:Hub database: biothings_src
DEBUG:hub:Last launched command ID: 14
INFO:root:Found sources: []
INFO:hub:Loading data plugin 'https://github.com/sirloon/mvcgi.git' (type: github)
DEBUG:hub:Creating new GithubAssistant instance
DEBUG:hub:Loading manifest: {'dumper': {'data_url': 'https://www.cancergenomeinterpreter.org/data/cgi_biomarkers_latest.zip',
'uncompress': True},
'uploader': {'ignore_duplicates': False, 'parser': 'parser:load_data'},
'version': '0.1'}
INFO:indexmanager:{}
INFO:indexmanager:{'test': {'max_retries': 10, 'retry_on_timeout': True, 'es_host': 'localhost:9200', 'timeout': 300}}
DEBUG:hub:for managers [<SourceManager [0 registered]: []>, <AssistantManager [1 registered]: ['github']>]
INFO:root:route: ['GET'] /job_manager => <class 'biothings.hub.api.job_manager_handler'>
INFO:root:route: ['GET'] /command/([\w\.]+)? => <class 'biothings.hub.api.command_handler'>
...
INFO:root:route: ['GET'] /api/list => <class 'biothings.hub.api.api/list_handler'>
INFO:root:route: ['PUT'] /restart => <class 'biothings.hub.api.restart_handler'>
INFO:root:route: ['GET'] /status => <class 'biothings.hub.api.status_handler'>
DEBUG:tornado.general:sockjs.tornado will use json module
INFO:hub:Monitoring source code in, ['/home/biothings/biothings_studio/hub/dataload/sources', '/home/biothings/biothings_studio/plugins']:
['/home/biothings/biothings_studio/hub/dataload/sources',
'/home/biothings/biothings_studio/plugins']
You should see something like above. If not, you should see the actual error, and depending on the error, you may be able to
fix it (not enough disk space, etc…). BioThings Hub can be started again using python bin/hub.py
from within the application
directory (in our case, /home/biothings/biothings_studio
)
Note
Press Control-B then D to dettach the tmux session and let the Hub run in background.
By default, logs are available in /data/biothings_studio/logs/
.
Finally, you can report issues and request for help, by joining Biothings Google Groups (https://groups.google.com/forum/#!forum/biothings).
B. Developer’s guide¶
This section provides both an overview and detailed information about BioThings Studio, and is specifically aimed to developers who like to know more about internals.
A complementary tutorial is also available, explaining how to set up and use BioThings Studio, step-by-step, by building an API from a flat file.
1. What is BioThings Studio¶
BioThings Studio is a pre-configured, ready-to-use application. At its core is BioThings Hub, the backend service behind all BioThings APIs.
1.1. BioThings Hub: the backend service¶
Hub is responsible for maintaining data up-to-date, and creating data releases for the BioThings frontend.
The process of integrating data and creating releases involves different steps, as shown in the following diagram:

data is first downloaded locally using dumpers
parsers will then convert data into JSON documents, those will be stored in a Mongo database using uploaders
when using multiple sources, data can be combined together using mergers
data releases are then created either by indexing data to an ElasticSearch cluster with indexers, or by computing the differences between the current release and previous one, using differs, and applying these differences using syncers
The final index along with the Tornado application represents the frontend that is actually queried by the different available clients, and is out of this document’s scope.
1.2. BioThings Studio¶
The architecture and different software involved in this system can be quite intimidating. To help, the whole service is packaged as a pre-configured application, BioThings Studio. A docker image is available Docker Hub registry, and can be pulled using:
$ docker pull biothings/biothings-studio:0.2a

A BioThings Studio instance exposes several services on different ports:
8080: BioThings Studio web application port
7022: BioThings Hub SSH port
7080: BioThings Hub REST API port
7081: BioThings Hub read-only REST API port
9200: ElasticSearch port
27017: MongoDB port
8000: BioThings API, once created, it can be any non-priviledged (>1024) port
9000: Cerebro, a webapp used to easily interact with ElasticSearch clusters
60080: Code-Server, a webapp used to directly edit code in the container
BioThings Hub and the whole backend service can be accessed through different options according to some of these services:
a web application allows interaction with the most used elements of the service (port 8080)
a console, accessible through SSH, gives access to more commands, for advanced usage (port 7022)
a REST API and a websocket (port 7080) can be used to interact with the Hub, query the differents objects inside, and get real-time notifications when processes are running. This interface is a good choice for third-party integration.
1.3. Who should use BioThings Studio ?¶
BioThings Studio can be used in different scenarios:
you want to contribute to an existing BioThings API by integrating a new data source
you want to run your own BioThings API but don’t want to install all the dependencies and learn how to configure all the sub-systems
1.4. Filesystem overview¶
Several locations on the filesystem are important, when it comes to change default configuration or troubleshoot the application:
Hub (backend service) is running under
biothings
user, running code is located in/home/biothings/biothings_studio
. It heavily relies on BioThings SDK located in/home/biothings/biothings.api
.Several scripts/helpers can be found in
/home/biothings/bin
:run_studio
is used to run the Hub in a tmux session. If a session is already running, it will first kill the session and create a new one. We don’t have to run this manually when the studio first starts, it is part of the starting sequence.update_studio
is used to fetch the latest code for BioThings Studioupdate_biothings
, same as above but for BioThings SDK
/data
contains several important folders:mongodb
folder, where MongoDB server stores its dataelasticsearch
folder, where ElasticSearch stores its databiothings_studio
folder, containing different sub-folders used by the Hub:datasources
contains data downloaded by the differentdumpers
, it contains sub-folders named according to the datasource’s name. Inside the datasource folder can be found the different releases, one per folder.dataupload
is where data is stored when uploaded to the Hub (see below dedicated section for more).logs
contains all log files produced by the Hubplugins
is where data plugins can be found (one sub-folder per plugin’s name)
Note
Instance will store MongoDB data in /data/mongodb, ElasticSearch data in /data/elasticsearch/ directory,
and downloaded data and logs in /data/biothings_studio. Those locations could require extra disk space;
if necessary, Docker option -v
can be used to mount a directory from the host, inside the container.
Please refer to Docker documentation. It’s also important to give enough permissions so the different services
(MongoDB, ElasticSearch, NGNIX, BioThings Hub, …) can actually write data on the docker host.
1.5. Configuration files¶
BioThings Hub expects some configuration variables to be defined first, in order to properly work. In most BioThings Studio, a config_hub.py
defines
those parameters, either by providing default value(s), or by setting them as ConfigurationError
exception. In the latter case, it means no defaults can be used
and user/developer has to define it. A final config.py
file must be defined, it usually imports all parameters from config_hub.py
(from config_hub import *
).
That config.py
has to be defined before the Hub can run.
Note
This process is only required when implementing or initializing a Hub from scratch. All BioThings Studio applications come with that file defined, and the Hub is ready to be used.
It’s also possible to override parameters directly from the webapp/UI. In that case, new parameters’ values are stored in the internal Hub database. Upon start, Hub will check
that database and supersede any values that are defined directly in the python configuration files. This process is handled by class biothings.ConfigurationManager
.
Finally, a special (simple) dialect can be used while defining configuration parameters, using special markup within comments above declaration. This allows to:
provide documentation for parameters
put parameters under different categories
mark a parameter as read-only
set a parameter as “invisible” (not exposed)
This process is used to expose Hub configuration through the UI, automatically providing documentation in the webapp without having to duplicate code, parameters and documentation.
For more information, see class biothings.ConfigurationParser
, as well as existing configuration files in the different studios.
1.6. Services check¶
Let’s enter the container to check everything is running fine. Services may take a while, up to 1 min, before fully started. If some services are missing, the troubleshooting section may help.
$ docker exec -ti studio /bin/bash
root@301e6a6419b9:/tmp# netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:7080 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:9000 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:27017 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:7022 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:9200 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:8080 0.0.0.0:* LISTEN 166/nginx: master p
tcp 0 0 0.0.0.0:9300 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 416/sshd
tcp6 0 0 :::7080 :::* LISTEN -
tcp6 0 0 :::7022 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN 416/sshd
Specifically, BioThings Studio services’ ports are: 7080, 7022 and 8080.
2. Overview of BioThings Studio web application¶
BioThings Studio web application can simply be accessed using any browser pointing to port 8080. The home page shows a summary of current data and recent updates. For now, it’s pretty quiet since we didn’t integrate any data yet.

Let’s have a quick overview of the different elements accessible through the webapp. On the top left is the connection widget. By default, BioThings Studio webapp will connect to the hub API through port 7080, the one running within docker. But the webapp is a static web page, so you can access any other Hub API by configuring a new connection:

Enter the Hub API URL, http://<host>:<port>
(you can omit http://
, the webapp will use that scheme by default):

The new connection is now listed and can be accessed quickly later simply by selecting it. Note the connection can be deleted with the “trash” icon, but cannot be edited.

Following are several tabs giving access to the main steps involved in building a BioThings API. We’ll get into those in more details while creating our new API. On the right, we have different information about jobs and resources:

Running commands are show in this popup, as well as as commands that have been running before, when switching to “Show all”¶

When jobs are using parallelization, processes will show information about what is running and how much resources each process takes. Notice we only have 1 process available, as we’re running a t2.medium instance which only has 2 CPU, Hub has automatically assigned half of them.¶

BioThings Hub also uses threads for parallelization, their activity will be shown here. Number of queued jobs, waiting for a free process or thread, is showned, as well as the total amount of memory the Hub is currenly using¶

In this popup are shown all notifications coming from the Hub, in real-time, allowing to follow all jobs and activity.¶

The first circle shows the page loading activity. Gray means nothing active, flashing blue means webapp is loading information from the Hub, and red means an error occured (error should be found either in notifications or by opening the logs from the bottom right corner).¶
The next button with a cog icon gives access to the configuration and is described in the next section.

Finally, a logo shows the websocket connection status (green power button means “connected”, red plug means “not connected”).¶
3. Configuration¶
By clicking on the cog icon in the bar on the right, Hub configuration can be accessed. The configuration parameters, documentation, sections are defined in python configuration files (see Configuration files). Specifically, if a parameter is hidden, redacted or/and read-only, it’s because of how it was defined in the python configuration files.

All parameters must be entered in a JSON format. Ex: double quotes for strings, square brackets to define lists, etc. A changed parameter can be saved using the “Save” button, available for each parameter. The “Reset” button can be used to switch it back to the original default value that was defined in the configuration files.
Ex: Update Hub’s name

First enter the new name, for paramerer HUB_NAME
. Because the value has changed, the “Save” button is available.

Upon validation, a green check mark is shown, and because the value is not the default one, the “Reset” button is now available. Clicking on it will switch the parameter’s value back to its original default one.

Note each time a parameter is changed, Hub needs to be restarted, as shown on the top.

4. Data plugin architecture and specifications¶
BioThings Studio allows to easily define and register datasources using data plugins. As of BioThings Studio 0.2b, there are two different types of data plugin.
4.1. Manifest plugins¶
a manifest.json file
other python files supporting the declaration in the manifest.
The plugin name, that is, the folder name containing the manifest file, gives the name to the resulting datasource.
A manifest file is defined like this:
{
"version": "0.2",
"__metadata__" : { # optional
"url" : "<datasource website/url>",
"license_url" : "<url>",
"licence" : "<license name>",
"author" : {
"name" : "<author name>",
"url" : "<link to github's author for instance>"
}
},
"requires" : ["lib==1.3","anotherlib"],
"dumper" : {
"data_url" : "<url>" # (or list of url: ["<url1>", "<url1>"]),
"uncompress" : true|false, # optional, default to false
"release" : "<path.to.module>:<function_name>" # optional
"schedule" : "0 12 * * *" # optional
},
"uploader" : { # optional
"parser" : "<path.to.module>:<function_name>",
"on_duplicates" : "ignore|error|merge" # optional, default to "error"
}
]
}
or with multiple uploader
{
"version": "0.2",
"__metadata__" : { # optional
"url" : "<datasource website/url>",
"license_url" : "<url>",
"licence" : "<license name>",
"author" : {
"name" : "<author name>",
"url" : "<link to github's author for instance>"
}
},
"requires" : ["lib==1.3","anotherlib"],
"dumper" : {
"data_url" : "<url>" # (or list of url: ["<url1>", "<url1>"]),
"uncompress" : true|false, # optional, default to false
"release" : "<path.to.module>:<function_name>" # optional
"schedule" : "0 12 * * *" # optional
},
"uploaders" : [{ # optional
"parser" : "<path.to.module>:<function_name_1>",
"on_duplicates" : "ignore|error|merge" # optional, default to "error"
},{
"parser" : "<path.to.module>:<function_name_2>",
"on_duplicates" : "ignore|error|merge" # optional, default to "error"
},{
"parser" : "<path.to.module>:<function_name_3>",
"on_duplicates" : "ignore|error|merge" # optional, default to "error"
}
]
}
Note
it’s possible to only have a dumper section, without any uploader specified. In that case, the data plugin will only download data and won’t provide any way to parse and upload data.
a version defines the specification version the manifest is using. Currently, version 0.2 should be used. This is not the version of the datasource itself.
an optional (but highly recommended) __metadata__ key provides information about the datasource itself, such as a website, a link to its license, the license name. This information, when provided, is displayed in the
/metadata
endpoint of the resulting API.a requires section, optional, describes dependencies that should be installed for the plugin to work. This uses pip behind the scene, and each element of that list is passed to pip install command line. If one dependency installation fails, the plugin is invalidated. Alternately, a single string can be passed, instead of a list.
a dumper section specifies how to download the actual data:
data_url specifies where to download the data from. It can be a URL (string) or a list of URLs (list of strings). Currently supported protocols are http(s) and ftp. URLs must point to individual files (no wildcards), and only one protocol is allowed within a list of URLs (no mix of URLs using http and ftp are allowed). All files are download in a data folder, determined by
config.DATA_ARCHIVE_ROOT
/<plugin_name>/<release>uncompress: once data is downloaded, this flag, if set to true, will uncompress all supported archives found in the data folder. Currently supported formats are:
*.zip
,*.gz
,*.tar.gz
(includes untar step)schedule will trigger the scheduling of the dumper, so it automatically checks for new data on a regular basis. Format is the same as crontabs, with the addition of an optional sixth parameter for scheduling by the seconds.
Ex:
* * * * * */10
will trigger the dumper every 10 seconds (unless specific use case, this is not recommanded).For more information, Hub relies on aiocron for scheduling jobs.
release optionally specifies how to determine the release number/name of the datasource. By default, if not present, the release will be set using:
Last-Modified
header for an HTTP-based URL. Format:YYYY-MM-DD
ETag
header for an HTTP-based URL ifLast-Modified
isn’t present in headers. Format: the actual etag hash.MDTM
ftp command if URL is FTP-based.
If a list of URLs is specified in data_url, the last URL is the one used to determine the release. If none of those are available or satisfactory, a release section can be specified, and should point to a python module and a function name following this format:
module:function_name
. Within this module, function has the following signature and should return the release, as a string.set_release
is a reserved name and must not be used.The example about release can be found at https://github.com/remoteeng00/FIRE.git
In master branch, the manifest file does not contain release field, so you can see the “failed” when dump the data source.
When you checkout to the version “v2” (https://github.com/remoteeng00/FIRE/tree/v2) then you can dump the data source.
def function_name(self):
# code
return "..."
self
refers to the actual dumper instance of either biothings.hub.dataload.dumper.HTTPDumper
or biothings.hub.dataload.dumper.FTPDumper
, depending
on the protocol. All properties and methods from the instance are available, specifically:
self.client
, the actual underlying client used to download files, which is either arequest.Session
or aftplib.FTP
instance, and should be preferred over initializing a new connection/client.
self.SRC_URLS
, containing the list of URLs (if only one URL was specified in data_url, this will be a list of one element), which is commonly used to inspect and possibly determine the release.
an uploader section specifies how to parse and store (upload):
parser key defines a module and a function name within that module. Format:
module:function_name
. Function has the following signature and returns a list of dictionary
(or
yield
dictionaries) containing at least a_id
key reprensenting a unique identifier (string) for this document:
def function_name(data_folder):
# code
yield {"_id":"..."}
data_folder
is the folder containing the previously downloaded (dumped) data, it is automatically set to the latest release available. Note the function doesn’t
take an filename as input, it should select the file(s) to parse.
on_duplicates defines the strategy to use when duplicated records are found (according to the
_id
key):
error
(default) will raise an exception if duplicates are found
ignore
will skip any duplicates, only the first one found will be store
merge
will merge existing document with the duplicated one. Refer tobiothings.hub.dataload.storage.MergerStorage
class for more.parallelizer points to a
module:function_name
that can be used when the uploader can be parallelized. If multiple input files exist, using the exact same parser, the uploader can be parallelized using that option. The parser should take an input file as parameter, not a path to a folder. The parallizer function should return a list of tuples, where each tuple corresponds to the list of input parameters for the parser.jobs
is a reserved name and must not be used.mapping points to a
module:classmethod_name
that can be used to specify a custom ElasticSearch mapping. Class method must return a python dictionary with a valid mapping.get_mapping
is a reserved name and must not be used. There’s no need to add@classmethod
decorator, Hub will take care of it. The first and only argument is a class. Ex:
def custom_mapping(cls):
return {
"root_field": {
"properties": {
"subfield": {
"type": "text",
}
}
}
}
If you want to use multiple uploader in you data plugin, you will need to use uploaders section, it’s a list of above uploader.
Please see https://github.com/remoteeng00/pharmgkb/tree/pharmgkb_v5 for a example about multiple uploader definition.
Note
Please do not use both uploaders and uploader in your manifest file.
Note
Please see https://github.com/sirloon/mvcgi for a simple plugin definition. https://github.com/sirloon/gwascatalog will show how to use
the release
key; https://github.com/remoteeng00/FIRE will demonstrate the parallelization in the uploader section.
4.2. Advanced plugins¶
This type of plugins is more advanced in the sense that it’s plain python code. They typically come from a code export of a manifest plugin but has slightly different (Following the A.5.2. Code export section,
the exported python code is placed in hub/dataload/sources/*
folder, but advanced plugins are placed in the same folder with manifest plugins at config.DATA_PLUGIN_FOLDER
).
The resulting python code defines dumpers and uploaders as python class, inheriting from BioThings SDK components. These plugins can be written from scratch, they’re “advanced” because they require more knowledge about
BioThings SDK.
In the root folder (local folder or remote git repository), a __init__.py
is expected, and should
contain imports for one dumper, and one or more uploaders.
An example of advanced data plugin can be found at https://github.com/sirloon/mvcgi_advanced.git. It comes from “mvcgi” manifest plugin, where code was exported.
5. Hooks and custom commands¶
While it’s possible to define custom commands for the Hub console by deriving class biothings.hub.HubServer
, there’s also an easy way to enrich existing commands using hooks.
A hook is a python file located in HOOKS_FOLDER
(defaulting to ./hooks/
). When the Hub starts, it inspects this folder and “injects” hook’s namespace into its console. Everything
available from within the hook file becomes available in the console. On the other hand, hook can use any commands available in the Hub console.
Hooks provide an easy way to “program” the Hub, based on existing commands. The following example defines a new command, which will archive any builds older than X days. Code can be
found at https://github.com/sirloon/auto_archive_hook.git. File auto_archive.py
should be copied into ./hooks/
folder. Upon restart, a new command named auto_archive
is now
part of the Hub. It’s also been scheduled automatically using schedule(...)
command at the end of the hook.
The auto_archive
function uses several existing Hub commands:
lsmerge
: when given a build config name, returns a list of all existing build names.archive
: will delete underlying data but keep existing metadata for a given build namebm.build_info
:bm
isn’t a command, but a shortcut for build_manager instance. From this instance, we can callbuild_info
method which, given a build name, returns information about it, including thebuild_date
field we’re interested in.
Note
Hub console is actually a python interpreter. When connecting to the Hub using SSH, the connection “lands” into that interpreter. That’s why it’s possible to inject python code into the console.
Note
Be careful. User-defined hooks can be conflicting with existing commands and may break the Hub. Ex: if a hook defines a command “dump”, it will replace, and potentially break existing one!
BioThings Standalone¶
This step-by-step guide shows how to use Biothings standalone instances. Standalone instances are based on Docker containers and provide a fully pre-configured, ready-to-use Biothings API that can easily be maintained and kept up-to-date. The idea is, for any user, to be able to run his/her own APIs locally and fulfill differents needs:
keep all API requests private and local to your own server
enrich existing and publicly available data found on our APIs with some private data
run API on your own architecture to perform heavy queries that would sometimes be throttled out from online services
Quick Links¶
If you already know how to run a BioThings standalone instance, you can download the latest avaiable Docker images from the following tables.
Warning
Some data sources, managed and served by standalone instances (production, demo and old), have restricted licenses. Therefore these standalone instances must not be used for other than non-profit purposes. By clicking on the following links, you agree and understand these restrictions. If you’re a for-profit company and would like to run a standalone instance, please contact us.
Note
Images don’t contain data but are ready to download and maintain data up-to-date running simple commands through the hub.
List of standalone instances¶
¶
Production and old data require at least 30GiB disk space.
Production |
Demo |
Old |
---|---|---|
¶
Production and old data require at least 2TiB disk space.
Production |
Demo |
Old |
---|---|---|
¶
Production and old data require at least 150Gib disk space.
Production |
Demo |
Old |
---|---|---|
Soon ! |
Prerequisites¶
Using standalone instances requires to have a Docker server up and running, some basic knowledge
about commands to run and use containers. Images have been tested on Docker >=17. Using AWS cloud,
you can use our public AMI biothings_demo_docker (ami-44865e3c
in Oregon region) with Docker pre-configured
and ready for standalone demo instances deployment. We recommend using instance type with at least
8GiB RAM, such as t2.large
. AMI comes with an extra 30GiB EBS volume, which should be enough to
deploy any demo instances.
Alternately, you can install your own Docker server (on recent Ubuntu systems, sudo apt-get install docker.io
is usually enough). You may need to point Docker images directory to a specific hard drive to get enough space,
using -g
option:
# /mnt/docker points to a hard drive with enough disk space
sudo echo 'DOCKER_OPTS="-g /mnt/docker"' >> /etc/default/docker
# restart to make this change active
sudo service docker restart
Demo instances use very little disk space, as only a small subset of data is available. For instance, myvariant demo only requires ~10GiB to run with demo data up-to-date, including the whole Linux system and all other dependencies. Demo instances provide a quick and easy way to setup a running APIs, without having to deal with some advanced system configurations.
For deployment with production or old data, you may need a large amount of disk space. Refer to the Quick Links section for more information. Bigger instance types will also be required, and even a full cluster architecture deployment. We’ll soon provide guidelines and deployment scripts for this purpose.
What you’ll learn¶
Through this guide, you’ll learn:
how to obtain a Docker image to run your favorite API
how to run that image inside a Docker container and how to access the web API
how to connect to the hub, a service running inside to container used to interact with the API systems
how to use that hub, using specific commands, in order to perform update and keep data up-to-date
Data found in standalone instances¶
All BioThings APIs (mygene.info, myvariant.info, …) provide data release in different flavors:
Production data, the actual data found on live APIs we, the BioThings team at SuLab, are running and keeping up-to-date on a regular basis. Please contact us if you’re interested in obtaining this type of data.
Demo data, a small subset of production data, publicly available
Old production data, an at least one year old production dataset (full), publicly available
The following guide applies to demo data only, though the process would be very similar for other types of data flavors.
Downloading and running a standalone instance¶
Standalone instances are available as Docker images. For the purpose of this guide, we’ll setup an instance running mygene API,
containing demo data. Links to standalone demo Docker images, can be found in Quick links at the beginning of this guide.
Use one of these links, or use this direct link
to mygene’s demo instance, and download the Docker image file, using your favorite browser or wget
:
$ wget http://biothings-containers.s3-website-us-west-2.amazonaws.com/demo_mygene/demo_mygene.docker
You must have a running Docker server in order to use that image. Typing docker ps
should return all running containers, or
at least an empty list as in the following example. Depending on the systems and configuration, you may have to add sudo
in front of this command to access Docker server.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Once downloaded, the image can be loaded into the server:
$ docker image load < demo_mygene.docker
$ docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
demo_mygene latest 15d6395e780c 6 weeks ago 1.78GB
Image is now loaded, size is ~1.78GiB, it contains no data (yet). An docker container can now be instantiated from that image, to create a BioThings standalone instance, ready to be used.
A standalone instance is a pre-configured system containing several parts. BioThings hub is the system used to interact with BioThings backend and perform operations such as downloading data and create/update ElasticSearch indices. Those indices are used by the actual BioThings web API system to serve data to end-users. The hub can be accessed through a standard SSH connection or through REST API calls. In this guide, we’ll use the SSH server.
A BioThings instance expose several services on different ports:
80: BioThings web API port
7022: BioThings hub SSH port
7080: BioThings hub REST API port
9200: ElasticSearch port
We will map and expose those ports to the host server using option -p
so we can access BioThings services without
having to enter the container (eg. hub ssh port here will accessible using port 19022).
$ docker run --name demo_mygene -p 19080:80 -p 19200:9200 -p 19022:7022 -p 19090:7080 -d demo_mygene
Note
Instance will store ElasticSearch data in /var/lib/elasticsearch/ directory, and downloaded data and logs
in /data/
directory. Those two locations could require extra disk space, if needed Docker option -v
can be used to mount a directory from the host, inside the container. Please refer to Docker documnentation.
Let’s enter the container to check everything is running fine. Services may take a while, up to 1 min, before fully started. If some services are missing, the troubleshooting section may help.
$ docker exec -ti demo_mygene /bin/bash
root@a6a6812e2969:/tmp# netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:7080 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:7022 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 25/nginx
tcp 0 0 127.0.0.1:8881 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8882 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8883 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8884 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8885 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8886 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8887 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8888 0.0.0.0:* LISTEN -
tcp6 0 0 :::7080 :::* LISTEN -
tcp6 0 0 :::7022 :::* LISTEN -
tcp6 0 0 :::9200 :::* LISTEN -
tcp6 0 0 :::9300 :::* LISTEN -
We can see the different BioThings services’ ports: 7080, 7022 and 7080. All 888x ports correspond to Tornado instances running behing Nginx port 80. They shouldn’t be accessed directly. Ports 9200 and 9300 are ElasticSearch standard ports (9200 one can be used to perform queries directly on ES, if needed)
At this point, the standalone instance is up and running. No data has been downloaded yet, let’s see how to populate the BioThings API using the hub.
Updating data using Biothings hub¶
If the standalone instance has been freshly started, there’s no data to be queried by the API. If we make a API call, such as fetching metadata, we’ll get an error:
# from Docker host
$ curl -v http://localhost:19080/metadata
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 19080 (#0)
> GET /metadata HTTP/1.1
> Host: localhost:19080
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Date: Tue, 28 Nov 2017 18:19:23 GMT
< Content-Type: text/html; charset=UTF-8
< Content-Length: 93
< Connection: keep-alive
< Server: TornadoServer/4.5.2
<
* Connection #0 to host localhost left intact
This 500 error reflects a missing index (ElasticSearch index, the backend used by BioThings web API). We can have a look at existing indices in ElasticSearch:
# from Docker host
$ curl http://localhost:19200/_cat/indices
yellow open hubdb 5 1 0 0 795b 795b
There’s only one index, hubdb
, which is an internal index used by the hub. No index containing actual biological data…
BioThings hub is a service running inside the instance, it can be accessed through a SSH connection, or using REST API calls.
For the purpose of the guide, we’ll use SSH. Let’s connect to the hub (type yes
to accept the key on first connection):
# from Docker host
$ ssh guest@localhost -p 19022
The authenticity of host '[localhost]:19022 ([127.0.0.1]:19022)' can't be established.
RSA key fingerprint is SHA256:j63IEgXc3yJqgv0F4wa35aGliH5YQux84xxABew5AS0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[localhost]:19022' (RSA) to the list of known hosts.
Welcome to Auto-hub, guest!
hub>
We’re now connected to the hub, inside a python shell where the application is actually running. Let’s see what commands are available:
Warning
the hub console, though accessed through SSH, is not a Linux shell (such as bash), it’s a python interpreter shell.
hub> help()
Available commands:
versions
check
info
download
apply
step_update
update
help
Type: 'help(command)' for more
versions()
will display all available data build versions we can download to populate the APIcheck()
will return whether a more recent version is available onlineinfo()
will display current local API version, and information about the latest available onlinedownload()
will download the data compatible with current local version (but without populating the ElasticSearch index)apply()
will use local data previously downloaded to populate the indexstep_update()
will bring data release to the next one (one step in versions), compatible with current local versionupdate()
will bring data to the latest available online (using a combination ofdownload
andapply
calls)
Note
update()
is the fastest, easiest and preferred way to update the API. download
, apply
, step_update
are available
when it’s necessary to bring the API data to a specific version (not the latest one), are considered more advanced,
and won’t be covered in this guide.
Note
Because the hub console is actually a python interpreter, we call the commands using parenthesis, just like functions or methods. We can also pass arguments when necessary, just like standard python (remember: it is python…)
Note
After each command is typed, we need to press “enter” to get either its status (still running) or the result
Let’s explore some more.
hub> info()
[2] RUN {0.0s} info()
hub>
[2] OK info(): finished
>>> Current local version: 'None'
>>> Release note for remote version 'latest':
Build version: '20171126'
=========================
Previous build version: '20171119'
Generated on: 2017-11-26 at 03:11:51
+---------------------------+---------------+-------------+-----------------+---------------+
| Updated datasource | prev. release | new release | prev. # of docs | new # of docs |
+---------------------------+---------------+-------------+-----------------+---------------+
| entrez.entrez_gene | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_refseq | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_unigene | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_go | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_genomic_pos | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_retired | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_accession | 20171118 | 20171125 | 10,003 | 10,003 |
| generif | 20171118 | 20171125 | 10,003 | 10,003 |
| uniprot | 20171025 | 20171122 | 10,003 | 10,003 |
+---------------------------+---------------+-------------+-----------------+---------------+
Overall, 9,917 documents in this release
0 document(s) added, 0 document(s) deleted, 130 document(s) updated
We can see here we don’t have any local data release (Current local version: 'None'
), whereas the latest online (at that time) is from
November 26th 2017. We can also see the release note with the different changes involved in the release (whether it’s a new version, or the number
of documents that changed).
hub> versions()
[1] RUN {0.0s} versions()
hub>
[1] OK versions(): finished
version=20171003 date=2017-10-05T09:47:59.413191 type=full
version=20171009 date=2017-10-09T14:47:10.800140 type=full
version=20171009.20171015 date=2017-10-19T11:44:47.961731 type=incremental
version=20171015.20171022 date=2017-10-25T13:33:16.154788 type=incremental
version=20171022.20171029 date=2017-11-14T10:34:39.445168 type=incremental
version=20171029.20171105 date=2017-11-06T10:55:08.829598 type=incremental
version=20171105.20171112 date=2017-11-14T10:35:04.832871 type=incremental
version=20171112.20171119 date=2017-11-20T07:44:47.399302 type=incremental
version=20171119.20171126 date=2017-11-27T10:38:03.593699 type=incremental
Data comes in two distinct types:
full: this is a full data release, corresponding to an ElasticSearch snapshot, containing all the data
incremental : this is a differential/incremental release, produced by computing the differences between two consecutives versions. The diff data is then used to patch an existing, compatible data release to bring it to the next version.
So, in order to obtain the latest version, the hub will first find a compatible version. Since it’s currently empty (no data), it will
use the first full release from 20171009, and then apply incremental updates sequentially (20171009.20171015
, then 20171015.20171022
,
then 20171022.20171029
, etc… up to 20171119.20171126
).
Let’s update the API:
hub> update()
[3] RUN {0.0s} update()
hub>
[3] RUN {1.3s} update()
hub>
[3] RUN {2.07s} update()
After a while, the API is up-to-date, we can run command info()
again (it also can be used to track update progress):
hub> info()
[4] RUN {0.0s} info()
hub>
[4] OK info(): finished
>>> Current local version: '20171126'
>>> Release note for remote version 'latest':
Build version: '20171126'
=========================
Previous build version: '20171119'
Generated on: 2017-11-26 at 03:11:51
+---------------------------+---------------+-------------+-----------------+---------------+
| Updated datasource | prev. release | new release | prev. # of docs | new # of docs |
+---------------------------+---------------+-------------+-----------------+---------------+
| entrez.entrez_gene | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_refseq | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_unigene | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_go | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_genomic_pos | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_retired | 20171118 | 20171125 | 10,003 | 10,003 |
| entrez.entrez_accession | 20171118 | 20171125 | 10,003 | 10,003 |
| generif | 20171118 | 20171125 | 10,003 | 10,003 |
| uniprot | 20171025 | 20171122 | 10,003 | 10,003 |
+---------------------------+---------------+-------------+-----------------+---------------+
Overall, 9,917 documents in this release
0 document(s) added, 0 document(s) deleted, 130 document(s) updated
Local version is 20171126
, remote is 20171126
, we’re up-to-date. We can also use check()
:
hub> check()
[5] RUN {0.0s} check()
hub>
[5] OK check(): finished
Nothing to dump
Nothing to dump
means there’s no available remote version that can be downloaded. It would otherwise return a version number, meaning
we would be able to update the API again using command update()
.
Press Control-D to exit from the hub console.
Querying ElasticSearch, we can see a new index, named biothings_current
, has been created and populated:
$ curl http://localhost:19200/_cat/indices
green open biothings_current 1 0 14903 0 10.3mb 10.3mb
yellow open hubdb 5 1 2 0 11.8kb 11.8kb
We now have a populated API we can query:
# from Docker host
# get metadata (note the build_version field)
$ curl http://localhost:19080/metadata
{
"app_revision": "672d55f2deab4c7c0e9b7249d22ccca58340a884",
"available_fields": "http://mygene.info/metadata/fields",
"build_date": "2017-11-26T02:58:49.156184",
"build_version": "20171126",
"genome_assembly": {
"rat": "rn4",
"nematode": "ce10",
"fruitfly": "dm3",
"pig": "susScr2",
"mouse": "mm10",
"zebrafish": "zv9",
"frog": "xenTro3",
"human": "hg38"
},
# annotation endpoint
$ curl http://localhost:19080/v3/gene/1017?fields=alias,ec
{
"_id": "1017",
"_score": 9.268311,
"alias": [
"CDKN2",
"p33(CDK2)"
],
"ec": "2.7.11.22",
"name": "cyclin dependent kinase 2"
}
# query endpoint
$ curl http://localhost:19080/v3/query?q=cdk2
{
"max_score": 310.69254,
"took": 37,
"total": 10,
"hits": [
{
"_id": "1017",
"_score": 310.69254,
"entrezgene": 1017,
"name": "cyclin dependent kinase 2",
"symbol": "CDK2",
"taxid": 9606
},
{
"_id": "12566",
"_score": 260.58084,
"entrezgene": 12566,
"name": "cyclin-dependent kinase 2",
"symbol": "Cdk2",
"taxid": 10090
},
...
BioThings API with multiple indices¶
Some APIs use more than one ElasticSearch index to run. For instance, myvariant.info uses one index for hg19 assembly, and one index
for hg38 assembly. With such APIs, the available commands contain a suffix showing which index (thus, which data release) they relate to.
Here’s the output of help()
from myvariant’s standalone instance:
hub> help()
Available commands:
versions_hg19
check_hg19
info_hg19
download_hg19
apply_hg19
step_update_hg19
update_hg19
versions_hg38
check_hg38
info_hg38
download_hg38
apply_hg38
step_update_hg38
update_hg38
help
For instance, update()
command is now available as update_hg19()
and update_hg38()
depending on the assemlby.
Troubleshooting¶
We test and make sure, as much as we can, that standalone images are up-to-date and hub is properly running for each data release. But things can still go wrong…
First make sure all services are running. Enter the container and type netstat -tnlp
, you should see
services running on ports (see usual running services). If services running on ports 7080 or 7022 aren’t running,
it means the hub has not started. If you just started the instance, wait a little more as services may take a while before
they’re fully started and ready.
If after ~1 min, you still don’t see the hub running, log to user biothings
and check the starting sequence.
Note
Hub is running in a tmux session, under user biothings
# sudo su - biothings
$ tmux a # recall tmux session
python -m biothings.bin.autohub
(pyenv) biothings@a6a6812e2969:~/mygene.info/src$ python -m biothings.bin.autohub
INFO:root:Hub DB backend: {'module': 'biothings.utils.es', 'host': 'localhost:9200'}
INFO:root:Hub database: hubdb
DEBUG:asyncio:Using selector: EpollSelector
start
You should see something looking like this above. If not, you should see the actual error, and depending on the error, you may be able to
fix it (not enough disk space, etc…). The hub can be started again using python -m biothings.bin.autohub
from within the application
directory (in our case, /home/biothings/mygene.info/src/
)
Note
Press Control-B then D to dettach the tmux session and let the hub running in background.
Logs are available in /data/mygene.info/logs/
. You can have a look at:
dump_*.log
files for logs about data downloadupload_*.log
files for logs about index update in general (full/incremental)sync_*.log
files for logs about incremental update onlyand
hub_*.log
files for general logs about the hub process
Finally, you can report issues and request for help, by joining Biothings Google Groups (https://groups.google.com/forum/#!forum/biothings)
BioThings Hub¶
Note
This tutorial uses an old/deprecated version of BioThings SDK. It will be updated very soon.
In this tutorial, we will build the whole process, or “hub”, which produces the data for Taxonomy BioThings API, accessible at t.biothings.io. This API serves information about species, lineage, etc… This “hub” is used to download, maintain up-to-date, process, merge data. At the end of this process, an Elasticsearch index is created containing all the data of interest, ready to be served as an API, using Biothings SDK Web component (covered in another tutorial). Taxonomy Biothings API code is avaiable at https://github.com/biothings/biothings.species.
Prerequesites¶
BioThings SDK uses MongoDB as the “staging” storage backend for JSON objects before they are sent to Elasticsearch for indexing. You must a have working MongoDB instance you can connect to. We’ll also perform some basic commands.
You also have to install the latest stable BioThings SDK release, with pip from PyPI:
pip install biothings
You can install the latest development version of BioThings SDK directly from our github repository like:
pip install git+https://github.com/biothings/biothings.api.git#egg=biothings
Alternatively, you can download the source code, or clone the BioThings SDK repository and run:
python setup.py install
You may want to use virtualenv
to isolate your installation.
Finally, BioThings SDK is written in python, so you must know some basics.
Configuration file¶
Before starting to implement our hub, we first need to define a configuration file. This default_config.py <https://github.com/biothings/biothings.api/blob/master/biothings/hub/default_config.py> file contains all the required and optional configuration variables, some have to be defined in your own application, other can be overriden as needed (see config_hub.py <https://github.com/biothings/biothings.species/blob/master/src/config_hub.py> for an example).
Typically we will have to define the following:
MongoDB connections parameters,
DATA_SRC_*
andDATA_TARGET_*
parameters. They define connections to two different databases, one will contain individual collections for each datasource (SRC) and the other will contain merged collections (TARGET).HUB_DB_BACKEND
defines a database connection for hub purpose (application specific data, like sources status, etc…). Default backend type is MongoDB. We will need to provide a valid mongodb:// URI. Other backend types are available, like sqlite3 and ElasticSearch, but since we’ll use MongoDB to store and process our data, we’ll stick to the default.DATA_ARCHIVE_ROOT
contains the path of the root folder that will contain all the downloaded and processed data. Other parameters should be self-explanatory and probably don’t need to be changed.LOG_FOLDER
contains the log files produced by the hub
Create a config.py and add from config_common import *
then define all required variables above. config.py
will look something like this:
from config_common import *
DATA_SRC_SERVER = "myhost"
DATA_SRC_PORT = 27017
DATA_SRC_DATABASE = "tutorial_src"
DATA_SRC_SERVER_USERNAME = None
DATA_SRC_SERVER_PASSWORD = None
DATA_TARGET_SERVER = "myhost"
DATA_TARGET_PORT = 27017
DATA_TARGET_DATABASE = "tutorial"
DATA_TARGET_SERVER_USERNAME = None
DATA_TARGET_SERVER_PASSWORD = None
HUB_DB_BACKEND = {
"module" : "biothings.utils.mongo",
"uri" : "mongodb://myhost:27017",
}
DATA_ARCHIVE_ROOT = "/tmp/tutorial"
LOG_FOLDER = "/tmp/tutorial/logs"
Note: Log folder must be created manually
hub.py¶
This script represents the main hub executable. Each hub should define it, this is where the different hub commands are going to be defined and where tasks are actually running. It’s also from this script that a SSH server will run so we can actually log into the hub and access those registered commands.
Along this tutorial, we will enrich that script. For now, we’re just going to define a JobManager, the SSH server and make sure everything is running fine.
import asyncio, asyncssh, sys
import concurrent.futures
from functools import partial
import config, biothings
biothings.config_for_app(config)
from biothings.utils.manager import JobManager
loop = asyncio.get_event_loop()
process_queue = concurrent.futures.ProcessPoolExecutor(max_workers=2)
thread_queue = concurrent.futures.ThreadPoolExecutor()
loop.set_default_executor(process_queue)
jmanager = JobManager(loop,
process_queue, thread_queue,
max_memory_usage=None,
)
jmanager
is our JobManager, it’s going to be used everywhere in the hub, each time a parallelized job is created.
Species hub is a small one, there’s no need for many process workers, two should be fine.
Next, let’s define some basic commands for our new hub:
from biothings.utils.hub import schedule, top, pending, done
COMMANDS = {
"sch" : partial(schedule,loop),
"top" : partial(top,process_queue,thread_queue),
"pending" : pending,
"done" : done,
}
These commands are then registered in the SSH server, which is linked to a python interpreter. Commands will be part of the interpreter’s namespace and be available from a SSH connection.
passwords = {
'guest': '', # guest account with no password
}
from biothings.utils.hub import start_server
server = start_server(loop, "Taxonomy hub",passwords=passwords,port=7022,commands=COMMANDS)
try:
loop.run_until_complete(server)
except (OSError, asyncssh.Error) as exc:
sys.exit('Error starting server: ' + str(exc))
loop.run_forever()
Let’s try to run that script ! The first run, it will complain about some missing SSH key:
AssertionError: Missing key 'bin/ssh_host_key' (use: 'ssh-keygen -f bin/ssh_host_key' to generate it
Let’s generate it, following instruction. Now we can run it again and try to connect:
$ ssh guest@localhost -p 7022
The authenticity of host '[localhost]:7022 ([127.0.0.1]:7022)' can't be established.
RSA key fingerprint is SHA256:USgdr9nlFVryr475+kQWlLyPxwzIUREcnOCyctU1y1Q.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[localhost]:7022' (RSA) to the list of known hosts.
Welcome to Taxonomy hub, guest!
hub>
Let’s try a command:
hub> top()
0 running job(s)
0 pending job(s), type 'top(pending)' for more
Nothing fancy here, we don’t have much in our hub yet, but everything is running fine.
Dumpers¶
BioThings species API gathers data from different datasources. We will need to define different dumpers to make this data available locally for further processing.
Taxonomy dumper¶
This dumper will download taxonomy data from NCBI FTP server. There’s one file to download, available at this location: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.
When defining a dumper, we’ll need to choose a base class to derive our dumper class from.
There are different base dumper classes available in BioThings SDK, depending on the protocol
we want to use to download data. In this case, we’ll derive our class from biothings.hub.dataload.dumper.FTPDumper
.
In addition to defining some specific class attributes, we will need to implement a method called create_todump_list()
.
This method fills self.to_dump
list, which is later going to be used to download data.
One element in that list is a dictionary with the following structure:
{"remote": "<path to file on remote server", "local": "<local path to file>"}
Remote information are relative to the working directory specified as class attribute. Local information is an absolute path, containing filename used to save data.
Let’s start coding. We’ll save that python module in dataload/sources/taxonomy/dumper.py.
import biothings, config
biothings.config_for_app(config)
Those lines are used to configure BioThings SDK according to our own configuration information.
from config import DATA_ARCHIVE_ROOT
from biothings.hub.dataload.dumper import FTPDumper
We then import a configuration constant, and the FTPDumper base class.
class TaxonomyDumper(FTPDumper):
SRC_NAME = "taxonomy"
SRC_ROOT_FOLDER = os.path.join(DATA_ARCHIVE_ROOT, SRC_NAME)
FTP_HOST = 'ftp.ncbi.nih.gov'
CWD_DIR = '/pub/taxonomy'
SUFFIX_ATTR = "timestamp"
SCHEDULE = "0 9 * * *"
SRC_NAME
will used as the registered name for this datasource (more on this later).SRC_ROOT_FOLDER
is the folder path for this resource, without any version information (dumper will create different sub-folders for each version).FTP_HOST
andCWD_DIR
gives information to connect to the remove FTP server and move to appropriate remote directory (FTP_USER
andFTP_PASSWD
constants can also be used for authentication).SUFFIX_ATTR
defines the attributes that’s going to be used to create folder for each downloaded version. It’s basically either “release” or “timestamp”, depending on whether the resource we’re trying to dump has an actual version. Here, for taxdump file, there’s no version, so we’re going to use “timestamp”. This attribute is automatically set to current date, so folders will look like that: …/taxonomy/20170120, …/taxonomy/20170121, etc…Finally
SCHEDULE
, if defined, will allow that dumper to regularly run within the hub. This is a cron-like notation (see aiocron documentation for more).
We now need to tell the dumper what to download, that is, create that self.to_dump list:
def create_todump_list(self, force=False):
file_to_dump = "taxdump.tar.gz"
new_localfile = os.path.join(self.new_data_folder,file_to_dump)
try:
current_localfile = os.path.join(self.current_data_folder, file_to_dump)
except TypeError:
# current data folder doesn't even exist
current_localfile = new_localfile
if force or not os.path.exists(current_localfile) or self.remote_is_better(file_to_dump, current_localfile):
# register new release (will be stored in backend)
self.to_dump.append({"remote": file_to_dump, "local":new_localfile})
That method tries to get the latest downloaded file and then compare that file with the remote file using
self.remote_is_better(file_to_dump, current_localfile)
, which compares the dates and returns True if the remote is more recent.
A dict is then created with required elements and appened to self.to_dump
list.
When the dump is running, each element from that self.to_dump list will be submitted to a job and be downloaded in parallel.
Let’s try our new dumper. We need to update hub.py
script to add a DumperManager and then register this dumper:
In hub.py:
import dataload
import biothings.hub.dataload.dumper as dumper
dmanager = dumper.DumperManager(job_manager=jmanager)
dmanager.register_sources(dataload.__sources__)
dmanager.schedule_all()
Let’s also register new commands in the hub:
COMMANDS = {
# dump commands
"dm" : dmanager,
"dump" : dmanager.dump_src,
...
dm
will a shortcut for the dumper manager object, and dump
will actually call manager’s dump_src()
method.
Manager is auto-registering dumpers from list defines in dataload package. Let’s define that list:
__sources__ = [
"dataload.sources.taxonomy",
]
That’s it, it’s just a string pointing to our taxonomy package. We’ll expose our dumper class in that package
so the manager can inspect it and find our dumper (note: we could use give the full path to our dumper module,
dataload.sources.taxonomy.dumper
, but we’ll add uploaders later, it’s better to have one single line per resource).
In dataload/sources/taxonomy/__init__.py
from .dumper import TaxonomyDumper
Let’s run the hub again. We can on the logs that our dumper has been found:
Found a class based on BaseDumper: '<class 'dataload.sources.taxonomy.dumper.TaxonomyDumper'>'
Also, manager has found scheduling information and created a task for this:
Scheduling task functools.partial(<bound method DumperManager.create_and_dump of <DumperManager [1 registered]: ['taxonomy']>>, <class 'dataload.sources.taxonomy.dumper.TaxonomyDumper'>, job_manager=<biothings.utils.manager.JobManager object at 0x7f88fc5346d8>, force=False): 0 9 * * *
We can double-check this by connecting to the hub, and type some commands:
Welcome to Taxonomy hub, guest!
hub> dm
<DumperManager [1 registered]: ['taxonomy']>
When printing the manager, we can check our taxonomy resource has been registered properly.
hub> sch()
DumperManager.create_and_dump(<class 'dataload.sources.taxonomy.dumper.TaxonomyDumper'>,) [0 9 * * * ] {run in 00h:39m:09s}
Dumper is going to run in 39 minutes ! We can trigger a manual upload too:
hub> dump("taxonomy")
[1] RUN {0.0s} dump("taxonomy")
OK, dumper is running, we can follow task status from the console. At some point, task will be done:
hub>
[1] OK dump("taxonomy"): finished, [None]
It successfully run (OK), nothing was returned by the task ([None]). Logs show some more details:
DEBUG:taxonomy.hub:Creating new TaxonomyDumper instance
INFO:taxonomy_dump:1 file(s) to download
DEBUG:taxonomy_dump:Downloading 'taxdump.tar.gz'
INFO:taxonomy_dump:taxonomy successfully downloaded
INFO:taxonomy_dump:success
Alright, now if we try to run the dumper again, nothing should be downloaded since we got the latest file available. Let’s try that, here are the logs:
DEBUG:taxonomy.hub:Creating new TaxonomyDumper instance
DEBUG:taxonomy_dump:'taxdump.tar.gz' is up-to-date, no need to download
INFO:taxonomy_dump:Nothing to dump
So far so good! The actual file, depending on the configuration settings, it’s located in ./data/taxonomy/20170125/taxdump.tar.gz. We can notice the timestamp used to create the folder. Let’s also have a look at in the internal database to see the resource status. Connect to MongoDB:
> use hub_config
switched to db hub_config
> db.src_dump.find()
{
"_id" : "taxonomy",
"release" : "20170125",
"data_folder" : "./data/taxonomy/20170125",
"pending_to_upload" : true,
"download" : {
"logfile" : "./data/taxonomy/taxonomy_20170125_dump.log",
"time" : "4.52s",
"status" : "success",
"started_at" : ISODate("2017-01-25T08:32:28.448Z")
}
}
>
We have some information about the download process, how long it took to download files, etc… We have the path to the
data_folder
containing the latest version, the release
number (here, it’s a timestamp), and a flag named pending_to_upload
.
That will be used later to automatically trigger an upload after a dumper has run.
So the actual file is currently compressed, we need to uncompress it before going further. We can add a post-dump step to our dumper. There are two options there, by overriding one of those methods:
def post_download(self, remotefile, localfile): triggered for each downloaded file
def post_dump(self): triggered once all files have been downloaded
We could use either, but there’s a utility function available in BioThings SDK that uncompress everything in a directory, let’s use it in a global post-dump step:
from biothings.utils.common import untargzall
...
def post_dump(self):
untargzall(self.new_data_folder)
self.new_data_folder
is the path to the folder freshly created by the dumper (in our case, ./data/taxonomy/20170125)
Let’s try this in the console (restart the hub to make those changes alive). Because file is up-to-date, dumper will not run. We need to force it:
hub> dump("taxonomy",force=True)
Or, instead of downloading the file again, we can directly trigger the post-dump step:
hub> dump("taxonomy",steps="post")
There are 2 steps steps available in a dumper:
dump : will actually download files
post : will post-process downloaded files (post_dump)
By default, both run sequentially.
After typing either of these commands, logs will show some information about the uncompressing step:
DEBUG:taxonomy.hub:Creating new TaxonomyDumper instance
INFO:taxonomy_dump:success
INFO:root:untargz '/opt/slelong/Documents/Projects/biothings.species/src/data/taxonomy/20170125/taxdump.tar.gz'
Folder contains all uncompressed files, ready to be process by an uploader.
UniProt species dumper¶
Following guideline from previous taxonomy dumper, we’re now implementing a new dumper used to download species list. There’s just one file to be downloaded from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/speclist.txt. Same as before, dumper will inherits FTPDumper base class. File is not compressed, so except this, this dumper will look the same.
Code is available on github for further details: ee674c55bad849b43c8514fcc6b7139423c70074 for the whole commit changes, and dataload/sources/uniprot/dumper.py for the actual dumper.
Gene information dumper¶
The last dumper we have to implement will download some gene information from NCBI (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz). It’s very similar to the first one (we could even have merged them together).
Code is available on github: d3b3486f71e865235efd673d2f371b53eaa0bc5b for whole changes and dataload/sources/geneinfo/dumper.py for the dumper.
Additional base dumper classes¶
The previous examples utilized the FTPDumper base dumper class. The list of available base dumper classes include:
FTPDumper
- Downloads content from ftp sourceLastModifiedFTPDumper
- A wrapper over FTPDumper, one URL will give one FTPDumper instance. SRC_URLS containing a list of URLs pointing to files to download, use FTP’s MDTM command to check whether files should be downloaded. The release is generated from the last file’s MDTM in SRC_URLS, and formatted according to RELEASE_FORMAT.HTTPDumper
- Dumper using HTTP protocol and “requests” libraryLastModifiedHTTPDumper
- Similar to LastModifiedFTPDumper, but for httpWgetDumper
- Fill self.to_dump list with dict(“remote”:remote_path,”local”:local_path) elements. This is the todo list for the dumperFilesystemDumper
- works locally and copy (or move) files to datasource folderDummyDumper
- Does nothingManualDumper
- Assists user to dump a resource. This will usually expect the files to be downloaded first (sometimes there’s no easy way to automate this process). Once downloaded, a call to dump() will make sure everything is fine in terms of files and metadataGoogleDriveDumper
- Dumps files from google driveGitDumper
- Gets data from a git repo. Repo is stored in SRC_ROOT_FOLDER (without versioning) and then versions/releases are fetched in SRC_ROOT_FOLDER/<release>
Additional details on the available base dumper classes can be found at: https://github.com/biothings/biothings.api/blob/master/biothings/hub/dataload/dumper.py
Uploaders¶
Now that we have local data available, we can process them. We’re going to create 3 different uploaders, one for each datasource. Each uploader will load data into MongoDB, into individual/single collections. Those will then be used in the last merging step.
Before going further, we’ll first create an UploaderManager instance and register some of its commands in the hub:
import biothings.hub.dataload.uploader as uploader
# will check every 10 seconds for sources to upload
umanager = uploader.UploaderManager(poll_schedule = '* * * * * */10', job_manager=jmanager)
umanager.register_sources(dataload.__sources__)
umanager.poll()
COMMANDS = {
...
# upload commands
"um" : umanager,
"upload" : umanager.upload_src,
...
Running the hub, we’ll see the kind of log statements:
INFO:taxonomy.hub:Found 2 resources to upload (['species', 'geneinfo'])
INFO:taxonomy.hub:Launch upload for 'species'
ERROR:taxonomy.hub:Resource 'species' needs upload but is not registered in manager
INFO:taxonomy.hub:Launch upload for 'geneinfo'
ERROR:taxonomy.hub:Resource 'geneinfo' needs upload but is not registered in manager
...
Indeed, datasources have been dumped, and a pending_to_upload
flag has been to True in src_dump
. UploadManager polls this src_dump
internal collection, looking for this flag. If set, it runs automatically the corresponding uploader(s). Since we didn’t implement any uploaders yet,
manager complains… Let’s fix that.
Taxonomy uploader¶
The taxonomy files we downloaded need to be parsed and stored into a MongoDB collection. We won’t go in too much details regarding the actual parsing, there are two parsers, one for nodes.dmp and another for names.dmp files. They yield dictionaries as the result of this parsing step. We just need to “connect” those parsers to uploaders.
Following the same approach as for dumpers, we’re going to implement our first uploaders by inheriting one the base classes available in BioThings SDK.
We have two files to parse, data will stored in two different MongoDB collections, so we’re going to have two uploaders. Each inherits from
biothings.hub.dataload.uploader.BaseSourceUploader
, load_data
method has to be implemented, this is where we “connect” parsers.
Beside this method, another important point relates to the storage engine. load_data
will, through the parser, yield documents (dictionaries).
This data is processed internally by the base uploader class (BaseSourceUploader
) using a storage engine. BaseSourceUploader
uses
biothings.hub.dataload.storage.BasicStorage
as its engine. This storage inserts data in MongoDB collection using bulk operations for better performances.
There are other storages available, depending on how data should be inserted (eg. IgnoreDuplicatedStorage will ignore any duplicated data error).
While choosing a base uploader class, we need to consider which storage class it’s actually using behind-the-scene (an alternative way to do this is
using BaseSourceUploader
and set the class attribute storage_class, such as in this uploader:
biothings/dataload/uploader.py#L447).
The first uploader will take care of nodes.dmp parsing and storage.
import biothings.hub.dataload.uploader as uploader
from .parser import parse_refseq_names, parse_refseq_nodes
class TaxonomyNodesUploader(uploader.BaseSourceUploader):
main_source = "taxonomy"
name = "nodes"
def load_data(self,data_folder):
nodes_file = os.path.join(data_folder,"nodes.dmp")
self.logger.info("Load data from file '%s'" % nodes_file)
return parse_refseq_nodes(open(nodes_file))
TaxonomyNodesUploader
derives fromBaseSourceUploader
name
gives the name of the collection used to store the data. Ifmain_source
is not defined, it must matchSRC_NAME
in dumper’s attributesmain_source
is optional and allows to define main sources and sub-sources. Since we have 2 parsers here, we’re going to have 2 collections created. For this one, we want the collection named “nodes”. But this parser relates to taxonomy datasource, so we define amain source
called taxonomy, which matchesSRC_NAME
in dumper’s attributes.load_data()
hasdata_folder
as parameter. It will be set accordingly, to the path of the last version dumped. Also, that method gets data from parsing functionparse_refseq_nodes
. It’s where we “connect” the parser. We just need to return parser’s result so the storage can actually store the data.
The other parser, for names.dmp, is almost the same:
class TaxonomyNamesUploader(uploader.BaseSourceUploader):
main_source = "taxonomy"
name = "names"
def load_data(self,data_folder):
names_file = os.path.join(data_folder,"names.dmp")
self.logger.info("Load data from file '%s'" % names_file)
return parse_refseq_names(open(names_file))
We then need to “expose” those parsers in taxonomy package, in dataload/sources/taxonomy/__init__.py:
from .uploader import TaxonomyNodesUploader, TaxonomyNamesUploader
Now let’s try to run the hub again. We should see uploader manager has automatically triggered some uploads:
INFO:taxonomy.hub:Launch upload for 'taxonomy'
...
...
INFO:taxonomy.names_upload:Uploading 'names' (collection: names)
INFO:taxonomy.nodes_upload:Uploading 'nodes' (collection: nodes)
INFO:taxonomy.nodes_upload:Load data from file './data/taxonomy/20170125/nodes.dmp'
INFO:taxonomy.names_upload:Load data from file './data/taxonomy/20170125/names.dmp'
INFO:root:Uploading to the DB...
INFO:root:Uploading to the DB...
While running, we can check what jobs are running, using top() command:
hub> top()
PID | SOURCE | CATEGORY | STEP | DESCRIPTION | MEM | CPU | STARTED_AT | DURATION
5795 | taxonomy.nodes | uploader | update_data | | 49.7MiB | 0.0% | 2017/01/25 14:58:40|15.49s
5796 | taxonomy.names | uploader | update_data | | 54.6MiB | 0.0% | 2017/01/25 14:58:40|15.49s
2 running job(s)
0 pending job(s), type 'top(pending)' for more
16 finished job(s), type 'top(done)' for more
We can see two uploaders running at the same time, one for each file. top(done)
can also display jobs that are done and finally
top(pending)
can give an overview of jobs that are going to be launched when a worker is available (it happens when there are more
jobs created than the available number of workers overtime).
In src_dump
collection, we can see some more information about the resource and its upload processes. Two jobs were created,
we have information about the duration, log files, etc…
> db.src_dump.find({_id:"taxonomy"})
{
"_id" : "taxonomy",
"download" : {
"started_at" : ISODate("2017-01-25T13:09:26.423Z"),
"status" : "success",
"time" : "3.31s",
"logfile" : "./data/taxonomy/taxonomy_20170125_dump.log"
},
"data_folder" : "./data/taxonomy/20170125",
"release" : "20170125",
"upload" : {
"status" : "success",
"jobs" : {
"names" : {
"started_at" : ISODate("2017-01-25T14:58:40.034Z"),
"pid" : 5784,
"logfile" : "./data/taxonomy/taxonomy.names_20170125_upload.log",
"step" : "names",
"temp_collection" : "names_temp_eJUdh1te",
"status" : "success",
"time" : "26.61s",
"count" : 1552809,
"time_in_s" : 27
},
"nodes" : {
"started_at" : ISODate("2017-01-25T14:58:40.043Z"),
"pid" : 5784,
"logfile" : "./data/taxonomy/taxonomy.nodes_20170125_upload.log",
"step" : "nodes",
"temp_collection" : "nodes_temp_T5VnzRQC",
"status" : "success",
"time" : "22.4s",
"time_in_s" : 22,
"count" : 1552809
}
}
}
}
In the end, two collections were created, containing parsed data:
> db.names.count()
1552809
> db.nodes.count()
1552809
> db.names.find().limit(2)
{
"_id" : "1",
"taxid" : 1,
"other_names" : [
"all"
],
"scientific_name" : "root"
}
{
"_id" : "2",
"other_names" : [
"bacteria",
"not bacteria haeckel 1894"
],
"genbank_common_name" : "eubacteria",
"in-part" : [
"monera",
"procaryotae",
"prokaryota",
"prokaryotae",
"prokaryote",
"prokaryotes"
],
"taxid" : 2,
"scientific_name" : "bacteria"
}
> db.nodes.find().limit(2)
{ "_id" : "1", "rank" : "no rank", "parent_taxid" : 1, "taxid" : 1 }
{
"_id" : "2",
"rank" : "superkingdom",
"parent_taxid" : 131567,
"taxid" : 2
}
UniProt species uploader¶
Following the same guideline, we’re going to create another uploader for species file.
import biothings.hub.dataload.uploader as uploader
from .parser import parse_uniprot_speclist
class UniprotSpeciesUploader(uploader.BaseSourceUploader):
name = "uniprot_species"
def load_data(self,data_folder):
nodes_file = os.path.join(data_folder,"speclist.txt")
self.logger.info("Load data from file '%s'" % nodes_file)
return parse_uniprot_speclist(open(nodes_file))
In that case, we need only one uploader, so we just define “name” (no need to define main_source here).
We need to expose that uploader from the package, in dataload/sources/uniprot/__init__.py:
from .uploader import UniprotSpeciesUploader
Let’s run this through the hub. We can use the “upload” command there (though manager should trigger the upload itself):
hub> upload("uniprot_species")
[1] RUN {0.0s} upload("uniprot_species")
Similar to dumpers, there are different steps we can individually call for an uploader:
data: will take care of storing data
post: calls post_update() method, once data has been inserted. Useful to post-process data or create an index for instance
master: will register the source in src_master collection, which is used during the merge step. Uploader method
get_mapping()
can optionally returns an ElasticSearch mapping, it will be stored in src_master during that step. We’ll see more about this later.clean: will clean temporary collections and other leftovers…
Within the hub, we can specify these steps manually (they’re all executed by default).
hub> upload("uniprot_species",steps="clean")
Or using a list:
hub> upload("uniprot_species",steps=["data","clean"])
Gene information uploader¶
Let’s move forward and implement the last uploader. The goal for this uploader is to identify whether, for a taxonomy ID, there are
existing/known genes. File contains information about genes, first column is the taxid
. We want to know all taxonomy IDs present
in the file, and the merged document, we want to add key such as {'has_gene' : True/False}
.
Obviously, we’re going to have a lot of duplicates, because for one taxid we can have many genes present in the files. We have options here 1) remove duplicates before inserting data in database, or 2) let the database handle the duplicates (rejecting them). Though we could process data in memory – processed data is rather small in the end –, for demo purpose, we’ll go for the second option.
import biothings.hub.dataload.uploader as uploader
import biothings.hub.dataload.storage as storage
from .parser import parse_geneinfo_taxid
class GeneInfoUploader(uploader.BaseSourceUploader):
storage_class = storage.IgnoreDuplicatedStorage
name = "geneinfo"
def load_data(self,data_folder):
gene_file = os.path.join(data_folder,"gene_info")
self.logger.info("Load data from file '%s'" % gene_file)
return parse_geneinfo_taxid(open(gene_file))
storage_class
: this is the most important setting in this case, we want to use a storage that will ignore any duplicated records.parse_geneinfo_taxid
: is the parsing function, yield documents as{"_id" : "taxid"}
The rest is closed to what we already encountered. Code is available on github in dataload/sources/geneinfo/uploader.py
When running the uploader, logs show statements like these:
INFO:taxonomy.hub:Found 1 resources to upload (['geneinfo'])
INFO:taxonomy.hub:Launch upload for 'geneinfo'
INFO:taxonomy.hub:Building task: functools.partial(<bound method UploaderManager.create_and_load of <UploaderManager [3 registered]: ['geneinfo', 'species', 'taxonomy']>>, <class 'dataload.sources.gen
einfo.uploader.GeneInfoUploader'>, job_manager=<biothings.utils.manager.JobManager object at 0x7fbf5f8c69b0>)
INFO:geneinfo_upload:Uploading 'geneinfo' (collection: geneinfo)
INFO:geneinfo_upload:Load data from file './data/geneinfo/20170125/gene_info'
INFO:root:Uploading to the DB...
INFO:root:Inserted 62 records, ignoring 9938 [0.3s]
INFO:root:Inserted 15 records, ignoring 9985 [0.28s]
INFO:root:Inserted 0 records, ignoring 10000 [0.23s]
INFO:root:Inserted 31 records, ignoring 9969 [0.25s]
INFO:root:Inserted 16 records, ignoring 9984 [0.26s]
INFO:root:Inserted 4 records, ignoring 9996 [0.21s]
INFO:root:Inserted 4 records, ignoring 9996 [0.25s]
INFO:root:Inserted 1 records, ignoring 9999 [0.25s]
INFO:root:Inserted 26 records, ignoring 9974 [0.23s]
INFO:root:Inserted 61 records, ignoring 9939 [0.26s]
INFO:root:Inserted 77 records, ignoring 9923 [0.24s]
While processing data in batch, some are inserted, others (duplicates) are ignored and discarded. The file is quite big, so the process can be long…
Note: should we want to implement the first option, the parsing function would build a dictionary indexed by taxid and would read the whole, extracting taxid. The whole dict would then be returned, and then processed by storage engine.
So far, we’ve defined dumpers and uploaders, made them working together through some managers defined in the hub. We’re now ready to move the last step: merging data.
Mergers¶
Merging will the last step in our hub definition. So far we have data about species, taxonomy and whether a taxonomy ID has known genes in NCBI. In the end, we want to have a collection where documents look like this:
{
_id: "9606",
authority: ["homo sapiens linnaeus, 1758"],
common_name: "man",
genbank_common_name: "human",
has_gene: true,
lineage: [9606,9605,207598,9604,314295,9526,...],
other_names: ["humans"],
parent_taxid: 9605,
rank: "species",
scientific_name: "homo sapiens",
taxid: 9606,
uniprot_name: "homo sapiens"
}
_id: the taxid, the ID used in all of our invidual collection, so the key will be used to collect documents and merge them together (it’s actually a requirement, documents are merged using _id as the common key).
authority, common_name, genbank_common_name, other_names, scientific_name and taxid come from taxonomy.names collection.
uniprot_name comes from species collection.
has_gene is a flag set to true, because taxid 9606 has been found in collection geneinfo.
parent_taxid and rank come from taxonomy.nodes collection.
(there can be other fields available, but basically the idea here is to merge all our individual collections…)
finally, lineage… it’s a little tricky as we need to query nodes in order to compute that field from _id and parent_taxid.
A first step would be to merge names, nodes and species collections together. Other keys need some post-merge processing, they will handled in a second part.
Let’s first define a BuilderManager in the hub.
import biothings.hub.databuild.builder as builder
bmanager = builder.BuilderManager(poll_schedule='* * * * * */10', job_manager=jmanager)
bmanager.configure()
bmanager.poll()
COMMANDS = {
...
# building/merging
"bm" : bmanager,
"merge" : bmanager.merge,
...
Merging configuration¶
BuilderManager uses a builder class for merging. While there are many different dumpers and uploaders classes, there’s only one merge class (for now). The merging process is defined in a configuration collection named src_build. Usually, we have as many configurations as merged collections, in our case, we’ll just define one configuration.
When running the hub with a builder manager registered, manager will automatically create this src_build collection and create configuration placeholder.
> db.src_build.find()
{
"_id" : "placeholder",
"name" : "placeholder",
"sources" : [ ],
"root" : [ ]
}
We’re going to use that template to create our own configuration:
_id and name are the name of the configuration (they can be different but really, _id is the one used here)… We’ll set these as:
{"_id":"mytaxonomy", "name":"mytaxonomy" }
.sources is a list of collection names used for the merge. A element is this can also be a regular expression matching collection names. If we have data spread across different collection, like one collection per chromosome data, we could use a regex such as
data_chr.*
. We’ll set this as:{"sources":["names" ,"species", "nodes", "geneinfo"]}
root defines root datasources, that is, datasources that can be used to initiate document creation. Sometimes, we want data to be merged only if an existing document previously exists in the merged collection. If root sources are defined, they will be merged first, then the other remaining in sources will be merged with existing documents. If root doesn’t exist (or list is empty), all sources can initiate documents creation. root can be a list of collection names, or a negation (not a mix of both). So, for instance, if we want all datasources to be root, except source10, we can specify:
"root" : ["!source10"]
. Finally, all root sources must all be declared in sources (root is a subset of sources). That said, it’s interesting in our case because we have taxonomy information coming from NCBI and UniProt, but we want to make sure a document built from UniProt only doesn’t exist (it’s because we need parent_taxid field which only exists in NCBI data, so we give priority to those sources first). So root sources are going to benames
andnodes
, but because we’re lazy typist, we’re going to set this to:{"root" : ["!species"]}
The resulting document should look like this. Let’s save this in src_build (and also remove the placeholder, not useful anymore):
> conf
{
"_id" : "mytaxonomy",
"name" : "mytaxonomy",
"sources" : [
"names",
"uniprot_species",
"nodes",
"geneinfo"
],
"root" : ["!uniprot_species"]
}
> db.src_build.save(conf)
> db.src_build.remove({_id:"placeholder"})
Note: geneinfo contains only IDs, we could ignore it while merging but we’ll need it to be declared as a source when we’ll create the index later.
Restarting the hub, we can then check that configuration has properly been registered in the manager, ready to be used. We can list the sources specified in configuration.
hub> bm
<BuilderManager [1 registered]: ['mytaxonomy']>
hub> bm.list_sources("mytaxonomy")
['names', 'species', 'nodes']
OK, let’s try to merge !
hub> merge("mytaxonomy")
[1] RUN {0.0s} merge("mytaxonomy")
Looking at the logs, we can see builder will first root sources:
INFO:mytaxonomy_build:Merging into target collection 'mytaxonomy_20170127_pn1ygtqp'
...
INFO:mytaxonomy_build:Sources to be merged: ['names', 'nodes', 'species', 'geneinfo']
INFO:mytaxonomy_build:Root sources: ['names', 'nodes', 'geneinfo']
INFO:mytaxonomy_build:Other sources: ['species']
INFO:mytaxonomy_build:Merging root document sources: ['names', 'nodes', 'geneinfo']
Then once root sources are processed, species collection merged on top on existing documents:
INFO:mytaxonomy_build:Merging other resources: ['species']
DEBUG:mytaxonomy_build:Documents from source 'species' will be stored only if a previous document exists with same _id
After a while, task is done, merge has returned information about the amount of data that have been merge: 1552809 records from collections names, nodes and geneinfo, 25394 from species. Note: the figures show the number fetched from collections, not necessarily the data merged. For instance, merged data from species may be less since it’s not a root datasource).
hub>
[1] OK merge("mytaxonomy"): finished, [{'total_species': 25394, 'total_nodes': 1552809, 'total_names': 1552809}]
Builder creates multiple merger jobs per collection. The merged collection name is, by default, generating from the build name (mytaxonomy), and contains also a timestamp and some random chars. We can specify the merged collection name from the hub. By default, all sources defined in the configuration are merged., and we can also select one or more specific sources to merge:
hub> merge("mytaxonomy",sources="uniprot_species",target_name="test_merge")
Note: sources
parameter can also be a list of string.
If we go back to src_build
, we can have information about the different merges (or builds) we ran:
> db.src_build.find({_id:"mytaxonomy"},{build:1})
{
"_id" : "mytaxonomy",
"build" : [
…
{
"src_versions" : {
"geneinfo" : "20170125",
"taxonomy" : "20170125",
"uniprot_species" : "20170125"
},
"time_in_s" : 386,
"logfile" : "./data/logs/mytaxonomy_20170127_build.log",
"pid" : 57702,
"target_backend" : "mongo",
"time" : "6m26.29s",
"step_started_at" : ISODate("2017-01-27T11:36:47.401Z"),
"stats" : {
"total_uniprot_species" : 25394,
"total_nodes" : 1552809,
"total_names" : 1552809
},
"started_at" : ISODate("2017-01-27T11:30:21.114Z"),
"status" : "success",
"target_name" : "mytaxonomy_20170127_pn1ygtqp",
"step" : "post-merge",
"sources" : [
"uniprot_species"
]
}
We can see the merged collection (auto-generated) is mytaxonomy_20170127_pn1ygtqp. Let’s have a look at the content (remember, collection is in target database, not in src):
> use tutorial
switched to db tutorial
> db.mytaxonomy_20170127_pn1ygtqp.count()
1552809
> db.mytaxonomy_20170127_pn1ygtqp.find({_id:9606})
{
"_id" : 9606,
"rank" : "species",
"parent_taxid" : 9605,
"taxid" : 9606,
"common_name" : "man",
"other_names" : [
"humans"
],
"scientific_name" : "homo sapiens",
"authority" : [
"homo sapiens linnaeus, 1758"
],
"genbank_common_name" : "human",
"uniprot_name" : "homo sapiens"
}
Both collections have properly been merged. We now have to deal with the other data.
Mappers¶
The next bit of data we need to merge is geneinfo. As a reminder, this collection only contains taxonomy ID (as _id key) which have known NCBI genes. We’ll create a mapper, containing this information. A mapper basically acts as an object that can pre-process documents while they are merged.
Let’s define that mapper in databuild/mapper.py
import biothings, config
biothings.config_for_app(config)
from biothings.utils.common import loadobj
import biothings.utils.mongo as mongo
import biothings.hub.databuild.mapper as mapper
# just to get the collection name
from dataload.sources.geneinfo.uploader import GeneInfoUploader
class HasGeneMapper(mapper.BaseMapper):
def __init__(self, *args, **kwargs):
super(HasGeneMapper,self).__init__(*args,**kwargs)
self.cache = None
def load(self):
if self.cache is None:
# this is a whole dict containing all taxonomy _ids
col = mongo.get_src_db()[GeneInfoUploader.name]
self.cache = [d["_id"] for d in col.find({},{"_id":1})]
def process(self,docs):
for doc in docs:
if doc["_id"] in self.cache:
doc["has_gene"] = True
else:
doc["has_gene"] = False
yield doc
We derive our mapper from biothings.hub.databuild.mapper.BaseMapper
, which expects load
and process
methods to be defined.
load
is automatically called when the mapper is used by the builder, and process
contains the main logic, iterating over documents,
optionally enrich them (it can also be used to filter documents, by not yielding them). The implementation is pretty straightforward.
We get and cache the data from geneinfo collection (the whole collection is very small, less than 20’000 IDs, so it can fit nicely and
efficiently in memory). If a document has its _id found in the cache, we enrich it.
Once defined, we register that mapper into the builder. In bin/hub.py, we modify the way we define the builder manager:
import biothings.hub.databuild.builder as builder
from databuild.mapper import HasGeneMapper
hasgene = HasGeneMapper(name="has_gene")
pbuilder = partial(builder.DataBuilder,mappers=[hasgene])
bmanager = builder.BuilderManager(
poll_schedule='* * * * * */10',
job_manager=jmanager,
builder_class=pbuilder)
bmanager.configure()
bmanager.poll()
First we instantiate a mapper object and give it a name (more on this later). While creating the manager, we need to pass a builder class.
The problem here is we also have to give our mapper to that class while it’s instantiated. We’re using partial
(from functools
),
which allows to partially define the class instantiation. In the end, builder_class parameter is expected to a callable, which is the case with partial.
Let’s try if our mapper works (restart the hub). Inside the hub, we’re going to manually get a builder instance. Remember through the SSH connection, we can access python interpreter’s namespace, which is very handy when it comes to develop and debug as we can directly access and “play” with objects and their states:
First we get a builder instance from the manager:
hub> builder = bm["mytaxonomy"]
hub> builder
<biothings.hub.databuild.builder.DataBuilder object at 0x7f278aecf400>
Let’s check the mappers and get ours:
hub> builder.mappers
{None: <biothings.hub.databuild.mapper.TransparentMapper object at 0x7f278aecf4e0>, 'has_gene': <databuild.mapper.HasGeneMapper object at 0x7f27ac6c0a90>}
We have our has_gene
mapper (it’s the name we gave). We also have a TransparentMapper
. This mapper is automatically added and is used as the default
mapper for any document (there has to be one…).
hub> hasgene = builder.mappers["has_gene"]
hub> len(hasgene.cache)
Error: TypeError("object of type 'NoneType' has no len()",)
Oops, cache isn’t loaded yet, we have to do it manually here (but it’s done automatically during normal execution).
hub> hasgene.load()
hub> len(hasgene.cache)
19201
OK, it’s ready. Let’s now talk more about the mapper’s name. A mapper can applied to different sources, and we have to define which sources’ data should go through that mapper. In our case, we want names and species collection’s data to go through. In order to do that, we have to instruct the uploader with a special attribute. Let’s modify dataload.sources.species.uploader.UniprotSpeciesUploader class
class UniprotSpeciesUploader(uploader.BaseSourceUploader):
name = "uniprot_species"
__metadata__ = {"mapper" : 'has_gene'}
__metadata__
dictionary is going to be used to create a master document. That document is stored in src_master collection (we talked about it earlier).
Let’s add this metadata to dataload.sources.taxonomy.uploader.TaxonomyNamesUploader
class TaxonomyNamesUploader(uploader.BaseSourceUploader):
main_source = "taxonomy"
name = "names"
__metadata__ = {"mapper" : 'has_gene'}
Before using the builder, we need to refresh master documents so these metadata are stored in src_master. We could trigger a new upload, or directly tell the hub to only process master steps:
hub> upload("uniprot_species",steps="master")
[1] RUN {0.0s} upload("uniprot_species",steps="master")
hub> upload("taxonomy.names",steps="master")
[1] OK upload("uniprot_species",steps="master"): finished, [None]
[2] RUN {0.0s} upload("taxonomy.names",steps="master")
(you’ll notice for taxonomy, we only trigger upload for sub-source names, using “dot-notation”, corresponding to “main_source.name”. Let’s now have a look at those master documents:
> db.src_master.find({_id:{$in:["uniprot_species","names"]}})
{
"_id" : "names",
"name" : "names",
"timestamp" : ISODate("2017-01-26T16:21:32.546Z"),
"mapper" : "has_gene",
"mapping" : {
}
}
{
"_id" : "uniprot_species",
"name" : "uniprot_species",
"timestamp" : ISODate("2017-01-26T16:21:19.414Z"),
"mapper" : "has_gene",
"mapping" : {
}
}
We have our mapper
key stored. We can now trigger a new merge (we specify the target collection name):
hub> merge("mytaxonomy",target_name="mytaxonomy_test")
[3] RUN {0.0s} merge("mytaxonomy",target_name="mytaxonomy_test")
In the logs, we can see our mapper has been detected and is used:
INFO:mytaxonomy_build:Creating merger job #1/16, to process 'names' 100000/1552809 (6.4%)
INFO:mytaxonomy_build:Found mapper '<databuild.mapper.HasGeneMapper object at 0x7f47ef3bbac8>' for source 'names'
INFO:mytaxonomy_build:Creating merger job #1/1, to process 'species' 25394/25394 (100.0%)
INFO:mytaxonomy_build:Found mapper '<databuild.mapper.HasGeneMapper object at 0x7f47ef3bbac8>' for source 'species'
Once done, we can query the merged collection to check the data:
> use tutorial
switched to db tutorial
> db.mytaxonomy_test.find({_id:9606})
{
"_id" : "9606",
"has_gene" : true,
"taxid" : 9606,
"uniprot_name" : "homo sapiens",
"other_names" : [
"humans"
],
"scientific_name" : "homo sapiens",
"authority" : [
"homo sapiens linnaeus, 1758"
],
"genbank_common_name" : "human",
"common_name" : "man"
}
OK, there’s a has_gene
flag that’s been set. So far so good !
Post-merge process¶
We need to add lineage and parent taxid information for each of these documents. We’ll implement that last part as a post-merge step, iterating over each of them. In order to do so, we need to define our own builder class to override proper methodes there. Let’s define it in databuild/builder.py..
import biothings.hub.databuild.builder as builder
import config
class TaxonomyDataBuilder(builder.DataBuilder):
def post_merge(self, source_names, batch_size, job_manager):
pass
The method we have to implement in post_merge, as seen above. We also need to change hub.py to use that builder class:
from databuild.builder import TaxonomyDataBuilder
pbuilder = partial(TaxonomyDataBuilder,mappers=[hasgene])
For now, we just added a class level in the hierarchy, everything runs the same as before. Let’s have a closer look to that post-merge process. For each document, we want to build the lineage. Information is stored in nodes collection. For instance, taxid 9606 (homo sapiens) has a parent_taxid 9605 (homo), which has a parent_taxid 207598 (homininae), etc… In the end, the homo sapiens lineage is:
9606, 9605, 207598, 9604, 314295, 9526, 314293, 376913, 9443, 314146, 1437010, 9347, 32525, 40674, 32524, 32523, 1338369,
8287, 117571, 117570, 7776, 7742, 89593, 7711, 33511, 33213, 6072, 33208, 33154, 2759, 131567 and 1
We could recursively query nodes collections until we reach the top the tree, but that would be a lot of queries.
We just need taxid
and parent_taxid
information to build the lineage, maybe it’s possible to build a dictionary that could fit in memory.
nodes has 1552809 records. A dictionary would use 2 * 1552809 * sizeof(integer) + index overhead. That’s probably few megabytes,
let’s assume that ok… (note: using pympler lib, we can actually know that dictionary size will be closed to 200MB…)
We’re going to use another mapper here, but no sources will use it.We’ll just instantiate it from post_merge method. In databuild/mapper.py, let’s add another class:
from dataload.sources.taxonomy.uploader import TaxonomyNodesUploader
class LineageMapper(mapper.BaseMapper):
def __init__(self, *args, **kwargs):
super(LineageMapper,self).__init__(*args,**kwargs)
self.cache = None
def load(self):
if self.cache is None:
col = mongo.get_src_db()[TaxonomyNodesUploader.name]
self.cache = {}
[self.cache.setdefault(d["_id"],d["parent_taxid"]) for d in col.find({},{"parent_taxid":1})]
def get_lineage(self,doc):
if doc['taxid'] == doc['parent_taxid']: #take care of node #1
# we reached the top of the taxonomy tree
doc['lineage'] = [doc['taxid']]
return doc
# initiate lineage with information we have in the current doc
lineage = [doc['taxid'], doc['parent_taxid']]
while lineage[-1] != 1:
parent = self.cache[lineage[-1]]
lineage.append(parent)
doc['lineage'] = lineage
return doc
def process(self,docs):
for doc in docs:
doc = self.get_lineage(doc)
yield doc
Let’s use that mapper in TaxonomyDataBuider’s post_merge
method. The signature is the same as merge() method (what’s actually called from the hub)
but we just need the batch_size one: we’re going to grab documents from the merged collection in batch,
process them and update them in batch as well. It’s going to be much faster than dealing one document at a time.
To do so, we’ll use doc_feeder utility function:
from biothings.utils.mongo import doc_feeder, get_target_db
from biothings.hub.databuild.builder import DataBuilder
from biothings.hub.dataload.storage import UpsertStorage
from databuild.mapper import LineageMapper
import config
import logging
class TaxonomyDataBuilder(DataBuilder):
def post_merge(self, source_names, batch_size, job_manager):
# get the lineage mapper
mapper = LineageMapper(name="lineage")
# load cache (it's being loaded automatically
# as it's not part of an upload process
mapper.load()
# create a storage to save docs back to merged collection
db = get_target_db()
col_name = self.target_backend.target_collection.name
storage = UpsertStorage(db,col_name)
for docs in doc_feeder(self.target_backend.target_collection, step=batch_size, inbatch=True):
docs = mapper.process(docs)
storage.process(docs,batch_size)
Since we’re using the mapper manually, we need to load the cache
db and col_name are used to create our storage engine. Builder has an attribute called
target_backend
(abiothings.hub.dataload.backend.TargetDocMongoBackend
object) which can be used to reach the collection we want to work with.doc_feeder iterates over all the collection, fetching documents in batch.
inbatch=True
tells the function to return data as a list (default is a dict indexed by_id
).those documents are processed by our mapper, setting the lineage information and then are stored using our UpsertStorage object.
Note: post_merge
actually runs within a thread, so any calls here won’t block the execution (ie. won’t block the asyncio event loop execution)
Let’s run this on our merged collection. We don’t want to merge everything again, so we specify the step we’re interested in and
the actual merged collection (target_name
)
hub> merge(“mytaxonomy”,steps=”post”,target_name=”mytaxonomy_test”) [1] RUN {0.0s} merge(“mytaxonomy”,steps=”post”,target_name=”mytaxonomy_test”)
After a while, process is done. We can test our updated data:
> use tutorial
switched to db tutorial
> db.mytaxonomy_test.find({_id:9606})
{
"_id" : 9606,
"taxid" : 9606,
"common_name" : "man",
"other_names" : [
"humans"
],
"uniprot_name" : "homo sapiens",
"rank" : "species",
"lineage" : [9606,9605,207598,9604,...,131567,1],
"genbank_common_name" : "human",
"scientific_name" : "homo sapiens",
"has_gene" : true,
"parent_taxid" : 9605,
"authority" : [
"homo sapiens linnaeus, 1758"
]
}
OK, we have new lineage information (truncated for sanity purpose). Merged collection is ready to be used. It can be used for instance to create and send documents to an ElasticSearch database. This is what’s actually occuring when creating a BioThings web-servuce API. That step will be covered in another tutorial.
Indexers¶
Coming soon!
Full updated and maintained code for this hub is available here: https://github.com/biothings/biothings.species
Also, taxonomy BioThings API can be queried as this URL: http://t.biothings.io
BioThings Web¶
In this tutorial we will start a Biothings API and learn to customize it, overriding the default behaviors and adding new features, using increasingly more advanced techniques step by step. In the end, you will be able to make your own Biothings API, run other production APIs, like Mygene.info, and additionally, customize and add more features to those projects.
Attention
Before starting the tutorial, you should have the biothings package installed, and have an Elasticsearch running with only one index populated with this dataset using this mapping. You may also need a JSON Formatter browser extension for the best experience following this tutorial. (For Chrome)
1. Starting an API server¶
First, assuming your Elasticsearch service is running on the default port 9200,
we can run a Biothings API with all default settings to explore the data,
simply by creating a config.py
under your project folder. After creating
the file, run python -m biothings.web
to start the API server. You should
be able to see the following console output:
[I 211130 22:21:57 launcher:28] Biothings API 0.10.0
[I 211130 22:21:57 configs:86] <module 'config' from 'C:\\Users\\Jerry\\code\\biothings.tutorial\\config.py'>
[INFO biothings.web.connections:31] <Elasticsearch([{'host': 'localhost', 'port': 9200}])>
[INFO biothings.web.connections:31] <AsyncElasticsearch([{'host': 'localhost', 'port': 9200}])>
[INFO biothings.web.applications:137] API Handlers:
[('/', <class 'biothings.web.handlers.services.FrontPageHandler'>, {}),
('/status', <class 'biothings.web.handlers.services.StatusHandler'>, {}),
('/metadata/fields/?', <class 'biothings.web.handlers.query.MetadataFieldHandler'>, {}),
('/metadata/?', <class 'biothings.web.handlers.query.MetadataSourceHandler'>, {}),
('/v1/spec/?', <class 'biothings.web.handlers.services.APISpecificationHandler'>, {}),
('/v1/doc(?:/([^/]+))?/?', <class 'biothings.web.handlers.query.BiothingHandler'>, {'biothing_type': 'doc'}),
('/v1/metadata/fields/?', <class 'biothings.web.handlers.query.MetadataFieldHandler'>, {}),
('/v1/metadata/?', <class 'biothings.web.handlers.query.MetadataSourceHandler'>, {}),
('/v1/query/?', <class 'biothings.web.handlers.query.QueryHandler'>, {})]
[INFO biothings.web.launcher:99] Server is running on "0.0.0.0:8000"...
[INFO biothings.web.connections:25] Elasticsearch Package Version: 7.13.4
[INFO biothings.web.connections:27] Elasticsearch DSL Package Version: 7.3.0
[INFO biothings.web.connections:51] localhost:9200: docker-cluster 7.9.3
Note the console log shows the API version, the config file it uses, its database connections, HTTP routes, service port, important python dependency package versions, as well as the database cluster details.
Note
The cluster detail appears as the last line, sometimes with a delay,
because it is scheduled asynchronously at start time, but executed later
after the main program has launched. The default implementation of our
application is asynchronous and non-blocking
based on asyncio
and tornado.ioloop interface.
The specific logic in this case is implemented in the biothings.web.connections
module.
Of all the information provided, note that it says the server is running on port 8000, this is the default port we use when we start a Biothings API. It means you can acccess the API by opening http://localhost:8000/ in your browser in most of the cases.
Note
If this port is occupied, you can pass the “port” parameter during startup to change
it, for example, running python -m biothings.web --port=9000
.
The links in the tutorial assume the services is running on the default port 8000.
If you are running the service on a differnt port. You need to modify the URLs
provided in the tutorial before opening in the browser.
Now open the browser and access localhost:8000, we should be able to see the biothings welcome page, showing the public routes in regex formats reading like:
/
/status
/metadata/fields/?
/metadata/?
/v1/spec/?
/v1/doc(?:/([^/]+))?/?
/v1/metadata/fields/?
/v1/metadata/?
/v1/query/?
2. Exploring an API endpoint¶
The last route on the welcome page shows the URL pattern of the query API. Let’s use this pattern to access the query endpoint. Accessing http://localhost:8000/v1/query/ returns a JSON document containing 10 results from our elasticsearch index.
Let’s explore some Biothings API features here, adding a query parameter “fields” to limit the fields returned by the API, and another parameter “size” to limit the returned document number. If you used the dataset mentioned at the start of the tutorial, accessing http://localhost:8000/v1/query?fields=symbol,alias,name&size=1 should return a document like this:
{
"took": 15,
"total": 1030,
"max_score": 1,
"hits": [
{
"_id": "1017",
"_score": 1,
"alias": [
"CDKN2",
"p33(CDK2)"
],
"name": "cyclin dependent kinase 2",
"symbol": "CDK2"
}
]
}
The most commonly used parameter is the “q” parameter, try http://localhost:8000/v1/query?q=cdk2 and see all the returned results contain “cdk2”, the value specified for the “q” parameter.
Note
For a list of the supporting parameters, visit Biothings API Specifications. The documentation for our most popular service https://mygene.info/ also covers a lot of features also available in all biothings applications. Read more on Gene Query Service and Gene Annotation Service.
3. Customizing an API through the config file¶
In the previous step, we tested document exploration by search its content. Is there a way to access individual documents directly by their “_id” or other id fields? We can look at the annotation endpoint doing exactly that.
By default, this endpoint is accessible by an URL pattern like this: /<ver>/doc/<_id>
where “ver” refers to the API version. In our case, if we want to access a document
with an id of “1017”, one of those doc showing up in the previous example,
we can try: http://localhost:8000/v1/doc/1017
Note
To configure a different API version other than “v1” for your program, add a prefix
to all API patterns, like /api/<ver>/…, or remove these patterns, make changes
in the config file modifying the settings prefixed with “APP”, as those control
the web application behavior. A web application is basically a collection of routes
and settings that can be understood by a web server. See biothings.web.settings.default
source code to look at the current configuration and refer to biothings.web.applications
to see how the settings are turned to routes in different web frameworks.
In this dataset, we know the document type can be best described as “gene”s.
We can enable a widely-used feature, document type URL templating, by providing
more information to the biothings app in the config.py
file. Write the
following lines to the config file:
1ES_HOST = "localhost:9200" # optional
2ES_INDICES = {"gene": "<your index name>"}
3
4ANNOTATION_DEFAULT_SCOPES = ["_id", "symbol"]
Note
The ES_HOST
setting is a common parameter that you see in the config file.
Although it is not making a difference here, you can configure the value of
this setting to ask biothings.web to connect to a different Elasticsearch
server, maybe hosted remotely on the cloud. The ANNOTATION_DEFAULT_SCOPES
setting specifies the document fields we consider as the id fields. By default,
only the “_id” field in the document, a must-have field in Elasticsearch,
is considered the biothings id field. We additionally added the “symbol” field,
to allow the user to it to find documents in this demo API.
Restart your program and see the annotation route is now prefixed with /v1/gene if you pay close attention to the console log. Now try the following URL:
See that using both of the URLs can take you straight to the document previously mentioned. Note using the symbol field “CDK2” may yield multiple documents because multiple documents may have the same key-value pair. This also means “symbol” may not be a good choice of the key field we want to support in the URL.
These two endpoints, annotation and query, are the pillars for Biothings API. You can additionally customize these endpoints to work better with your data.
For example, if you think our returned result by default from the query endpoint
is too verbose and we want to only include limited information unless the user
specifically asked for more, we can set a default “fields” value, for this
parameter used in the previous example. Open config.py
and add:
from biothings.web.settings.default import QUERY_KWARGS
QUERY_KWARGS['*']['_source']['default'] = ['name', 'symbol', 'taxid', 'entrezgene']
Restart your program after changing the config file and visit http://localhost:8000/v1/query, see the effect of specifying default fields to return. Like this:
{
"took": 9,
"total": 100,
"max_score": 1,
"hits": [
{
"_id": "1017",
"_score": 1,
"entrezgene": "1017",
"name": "cyclin dependent kinase 2",
"symbol": "CDK2",
"taxid": 9606
},
{
"_id": "12566",
"_score": 1,
"entrezgene": "12566",
"name": "cyclin-dependent kinase 2",
"symbol": "Cdk2",
"taxid": 10090
},
{
"_id": "362817",
"_score": 1,
"entrezgene": "362817",
"name": "cyclin dependent kinase 2",
"symbol": "Cdk2",
"taxid": 10116
},
...
]
}
4. Customizing an API through pipeline stages¶
In the previous example, the numbers in the “entrezgene” field are typed as strings. Let’s modify the internal logic called the query pipeline to convert these values to integers just to show what we can do in customization.
Note
The pipeline is one of the biothings.web.services
. It defines the intermediate
steps or stages we take to execute a query. See biothings.web.query
to learn
more about the individual stages.
Add to config.py:
ES_RESULT_TRANSFORM = "pipeline.MyFormatter"
And create a file pipeline.py
to include:
1from biothings.web.query import ESResultFormatter
2
3
4class MyFormatter(ESResultFormatter):
5
6 def transform_hit(self, path, doc, options):
7
8 if path == '' and 'entrezgene' in doc: # root level
9 try:
10 doc['entrezgene'] = int(doc['entrezgene'])
11 except:
12 ...
Commit your changes and restart the webserver process. Run some queries and you should be able to see the “entrezgene” field now showing as integers:
{
"_id": "1017",
"_score": 1,
"entrezgene": 1017, # instead of the quoted "1017" (str)
"name": "cyclin dependent kinase 2",
"symbol": "CDK2",
"taxid": 9606
}
In this example, we made changes to the query transformation stage,
controlled by the biothings.web.query.formatter.ESResultFormatter
class,
this is one of the three stages that defined the query pipeline.
The two stages coming before it are represented by
biothings.web.query.engine.AsyncESQueryBackend
and
biothings.web.query.builder.ESQueryBuilder
.
Let’s try to modify the query builder stage to add another feature. We’ll incorporate
domain knowledge here to deliver more user-friendly seach result by scoring the documents
with a few rules to increase result relevancy. Additionally add to the pipeline.py
file:
from biothings.web.query import ESQueryBuilder
from elasticsearch_dsl import Search
class MyQueryBuilder(ESQueryBuilder):
def apply_extras(self, search, options):
search = Search().query(
"function_score",
query=search.query,
functions=[
{"filter": {"term": {"name": "pseudogene"}}, "weight": "0.5"}, # downgrade
{"filter": {"term": {"taxid": 9606}}, "weight": "1.55"},
{"filter": {"term": {"taxid": 10090}}, "weight": "1.3"},
{"filter": {"term": {"taxid": 10116}}, "weight": "1.1"},
], score_mode="first")
return super().apply_extras(search, options)
Make sure our application can pick up the change by adding this line to config.py
:
ES_QUERY_BUILDER = "pipeline.MyQueryBuilder"
Note
We wrapped our original query logic in an Elasticsearch compound query fucntion score query.
For more on writing python-friendly Elasticsearch queries, see Elasticsearch DSL package, one of the dependencies
used in biothings.web
.
Save the file and restart the webserver process. Search something and if you compare with the application before, you may notice some result rankings have changed. It is not easy to pick up this change if you are not familiar with the data, visit http://localhost:8000/v1/query?q=kinase&rawquery instead and see that our code was indeed making a difference and get passed to elasticsearch, affecting the query result ranking. Notice the “rawquery” is a feature in our program to intercept the raw query we sent to elasticsearch for debugging.
5. Customizing an API through pipeline services¶
Taking it one more step further, we can add more procedures or stages to the pipeline by overwriting the Pipeline class. Add to the config file:
ES_QUERY_PIPELINE = "pipeline.MyQueryPipeline"
and add the following code to pipeline.py
:
class MyQueryPipeline(AsyncESQueryPipeline):
async def fetch(self, id, **options):
if id == "tutorial":
res = {"_welcome": "to the world of biothings.api"}
res.update(await super().fetch("1017", **options))
return res
res = await super().fetch(id, **options)
return res
Now we made ourselves a tutorial page to show what annotation results can look like, by visiting http://localhost:8000/v1/gene/tutorial, you can see what http://localhost:8000/v1/gene/1017 would typically give you, and the additional welcome message:
{
"_welcome": "to the world of biothings.api",
"_id": "1017",
"_version": 1,
"entrezgene": 1017,
"name": "cyclin dependent kinase 2",
"symbol": "CDK2",
"taxid": 9606
}
Note
In this example, we modified the query pipeline’s “fetch” method, the one used in the annotation endpoint, to include some additional logic before executing what we would typically do. The call to the “super” function executes the typical query building, executing and formatting stages.
6. Customizing an API through the web app¶
The examples above demonstrated the customizations you can make on top of our pre-defined APIs, for the most demanding tasks, you can additionally add your own API routes to the web app.
Modify the config file as a usual first step. Declare a new route by adding:
from biothings.web.settings.default import APP_LIST
APP_LIST = [
*APP_LIST, # keep the original ones
(r"/{ver}/echo/(.+)", "handlers.EchoHandler"),
]
Let’s make an echo handler that just echos what the user puts in the URL.
Create a handlers.py
and add:
1from biothings.web.handlers import BaseAPIHandler
2
3
4class EchoHandler(BaseAPIHandler):
5
6 def get(self, text):
7 self.write({
8 "status": "ok",
9 "result": text
10 })
Now we have added a completely new feature not based on any of the existing biothings offerings, which can be as simple and as complex as you need. Visiting http://localhost:8000/v1/echo/hello would give you:
{
"status": "ok",
"result": "hello"
}
in which case, the “hello” in “result” field is the input we give the application in the URL.
7. Customizing an API through the app launcher¶
Another convenient place to customize the API is to have a launching module,
typically called index.py
, and pass parameters to the starting function,
provided as biothings.web.launcher.main()
. Create an index.py
in your project folder:
1from biothings.web.launcher import main
2from tornado.web import RedirectHandler
3
4if __name__ == '__main__':
5 main([
6 (r"/v2/query(.*)", RedirectHandler, {"url": "/v1/query{0}"})
7 ], {
8 "static_path": "static"
9 })
Create another folder called “static” and add a file of random content named “file.txt” under the newly created static folder. In this step, we added a redirection of a later-to-launch v2 query API, that we temporarily set to redirect to the v1 API and passed a static file configuration that asks tornado to serve files under the static folder we specified to the tornado webserver, the default webserver we use. The static folder is named “static” and contains only one file in this example.
Note
For more on configuring route redirections and other application features in tornado, see RedirectHandler and Application configuration.
After making the changes, visiting http://localhost:8000/v2/query/?q=cdk2
would direct you back to http://localhost:8000/v1/query/?q=cdk2 and by visiting
http://localhost:8000/static/file.txt you should see the random content you
previously created. Note in this step, you should run the python launcher
module directly by calling something like python index.py
instead of
running the first command we introduced. Running the launcher directly is
also how we start most of our user-facing products that require complex
configurations, like http://mygene.info/. ts code is publicly available at
https://github.com/biothings/mygene.info under the Biothings Organization.
The End¶
Finishing this tutorial, you have completed the most common steps to customize biothings.api. The customization starts from passing a different parameter at launch time and evolve to modifying the app code at different levels. I hope you feel confident running biothings API now and please check out the documentation page for more details on customizing APIs.
DataTransform Module¶
A key problem when merging data from multiple data sources is finding a common identifier. To ameliorate this problem, we have written a DataTransform module to convert identifiers from one type to another. Frequently, this conversion process has multiple steps, where an identifier is converted to one or more intermediates before having its final value. To describe these steps, the user defines a graph where each node represents an identifier type and each edge represents a conversion. The module processes documents using the network to convert their identifiers to their final form.
A graph is a mathematical model describing how different things are connected. Using our model, our module is connecting different identifiers together. Each connection is an identifier conversion or lookup process. For example, a simple graph could describe how pubchem identifiers could be converted to drugbank identifiers using MyChem.info.
Graph Definition¶
The following graph facilitates conversion from inchi to inchikey using pubchem as an intermediate:
from biothings.hub.datatransform import MongoDBEdge
import networkx as nx
graph_mychem = nx.DiGraph()
###############################################################################
# DataTransform Nodes and Edges
###############################################################################
graph_mychem.add_node('inchi')
graph_mychem.add_node('pubchem')
graph_mychem.add_node('inchikey')
graph_mychem.add_edge('inchi', 'pubchem',
object=MongoDBEdge('pubchem', 'pubchem.inchi', 'pubchem.cid'))
graph_mychem.add_edge('pubchem', 'inchikey',
object=MongoDBEdge('pubchem', 'pubchem.cid', 'pubchem.inchi_key'))
To setup a graph, one must define nodes and edges. There should be a node for each type of identifier and an edge which describes how to convert from one identifier to another. Node names can be arbitrary; the user is allowed to chose what an identifier should be called. Edge classes, however, must be defined precisely for conversion to be successful.
Edge Classes¶
The following edge classes are supported by the DataTransform module. One of these edge classes must be selected when defining an edge connecting two nodes in a graph.
MongoDBEdge¶
- class biothings.hub.datatransform.MongoDBEdge(collection_name, lookup, field, weight=1, label=None, check_index=True)[source]¶
The MongoDBEdge uses data within a MongoDB collection to convert one identifier to another. The input identifier is used to search a collection. The output identifier values are read out of that collection:
- Parameters:
collection_name (str) – The name of the MongoDB collection.
lookup (str) – The field that will match the input identifier in the collection.
field (str) – The output identifier field that will be read out of matching documents.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
The example above uses the MongoDBEdge class to convert from inchi to inchikey.
MyChemInfoEdge¶
- class biothings.hub.datatransform.MyChemInfoEdge(lookup, field, weight=1, label=None, url=None)[source]¶
The MyChemInfoEdge uses the MyChem.info API to convert identifiers.
- Parameters:
lookup (str) – The field in the API to search with the input identifier.
field (str) – The field in the API to convert to.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
This example graph uses the MyChemInfoEdge class to convert from pubchem to inchikey. The pubchem.cid and pubchem.inchi_key fields are returned by MyChem.info and are listed by /metadata/fields.
from biothings.hub.datatransform import MyChemInfoEdge
import networkx as nx
graph_mychem = nx.DiGraph()
###############################################################################
# DataTransform Nodes and Edges
###############################################################################
graph_mychem.add_node('pubchem')
graph_mychem.add_node('inchikey')
graph_mychem.add_edge('pubchem', 'inchikey',
object=MyChemInfoEdge('pubchem.cid', 'pubchem.inchi_key'))
MyGeneInfoEdge¶
- class biothings.hub.datatransform.MyGeneInfoEdge(lookup, field, weight=1, label=None, url=None)[source]¶
The MyGeneInfoEdge uses the MyGene.info API to convert identifiers.
- Parameters:
lookup (str) – The field in the API to search with the input identifier.
field (str) – The field in the API to convert to.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
RegExEdge¶
- class biothings.hub.datatransform.RegExEdge(from_regex, to_regex, weight=1, label=None)[source]¶
The RegExEdge allows an identifier to be transformed using a regular expression. POSIX regular expressions are supported.
- Parameters:
from_regex (str) – The first parameter of the regular expression substitution.
to_regex (str) – The second parameter of the regular expression substitution.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
This example graph uses the RegExEdge class to convert from pubchem to a shorter form. The CID: prefix is removed by the regular expression substitution:
from biothings.hub.datatransform import RegExEdge
import networkx as nx
graph = nx.DiGraph()
###############################################################################
# DataTransform Nodes and Edges
###############################################################################
graph.add_node('pubchem')
graph.add_node('pubchem-short')
graph.add_edge('pubchem', 'pubchem-short',
object=RegExEdge('CID:', ''))
Example Usage¶
A complex graph developed for use with MyChem.info is shown here. This file includes a definition of the MyChemKeyLookup class which is used to call the module on the data source. In general, the graph and class should be supplied to the user by the BioThings.api maintainers.
To call the DataTransform module on the Biothings Uploader, the following definition is used:
keylookup = MyChemKeyLookup(
[('inchi', 'pharmgkb.inchi'),
('pubchem', 'pharmgkb.xrefs.pubchem.cid'),
('drugbank', 'pharmgkb.xrefs.drugbank'),
('chebi', 'pharmgkb.xrefs.chebi')])
def load_data(self,data_folder):
input_file = os.path.join(data_folder,"drugs.tsv")
return self.keylookup(load_data)(input_file)
The parameters passed to MyChemKeyLookup are a list of input types. The first element in an input type is the node name that must match the graph. The second element is the field in dotstring notation which should describe where the identifier should be read from in a document.
The following report was reported when using the DataTransform module with PharmGKB. Reports have a section for document conversion and a section describing conversion along each edge. The document section shows which inputs were used to produce which outputs. The edge section is useful in debugging graphs, ensuring that different conversion edges are working properly.
{
'doc_report': {
"('inchi', 'pharmgkb.inchi')-->inchikey": 1637,
"('pubchem', 'pharmgkb.xrefs.pubchem.cid')-->inchikey": 46
"('drugbank', 'pharmgkb.xrefs.drugbank')-->inchikey": 41,
"('drugbank', 'pharmgkb.xrefs.drugbank')-->drugbank": 25,
}
'edge_report': {
'inchi-->chembl': 1109,
'inchi-->drugbank': 319,
'inchi-->pubchem': 209,
'chembl-->inchikey': 1109,
'drugbank-->inchikey': 360,
'pubchem-->inchikey': 255
'drugbank-->drugbank': 25,
},
}
As an example, the number identifiers converted from inchi to inchikey is 1637. However, these conversions are done via intermediates. One of these intermediates is chembl and the number of identifiers converted from inchi to chembl is 319. Some identifiers are converted directly from pubchem and drugbank. The inchi field is used to lookup several intermediates (chembl, drugbank, and pubchem). Eventually, most of these intermediates are converted to inchikey.
Advanced Usage - DataTransform MDB¶
The DataTransformMDB module was written as a decorator class which is intended to be applied to the load_data function of a Biothings Uploader. This class can be sub-classed to simplify applification within a Biothings service.
- class biothings.hub.datatransform.DataTransformMDB(graph, *args, **kwargs)[source]¶
Convert document identifiers from one type to another.
The DataTransformNetworkX module was written as a decorator class which should be applied to the load_data function of a Biothings Uploader. The load_data function yields documents, which are then post processed by call and the ‘id’ key conversion is performed.
- Parameters:
graph – nx.DiGraph (networkx 2.1) configuration graph
input_types – A list of input types for the form (identifier, field) where identifier matches a node and field is an optional dotstring field for where the identifier should be read from (the default is ‘_id’).
output_types (list(str)) – A priority list of identifiers to convert to. These identifiers should match nodes in the graph.
id_priority_list (list(str)) – A priority list of identifiers to to sort input and output types by.
skip_on_failure (bool) – If True, documents where identifier conversion fails will be skipped in the final document list.
skip_w_regex (bool) – Do not perform conversion if the identifier matches the regular expression provided to this argument. By default, this option is disabled.
skip_on_success (bool) – If True, documents where identifier conversion succeeds will be skipped in the final document list.
idstruct_class (class) – Override an internal data structure used by the this module (advanced usage)
copy_from_doc (bool) – If true then an identifier is copied from the input source document regardless as to weather it matches an edge or not. (advanced usage)
An example of how to apply this class is shown below:
keylookup = DataTransformMDB(graph, input_types, output_types,
skip_on_failure=False, skip_w_regex=None,
idstruct_class=IDStruct, copy_from_doc=False)
def load_data(self,data_folder):
input_file = os.path.join(data_folder,"drugs.tsv")
return self.keylookup(load_data)(input_file)
It is possible to extend the DataTransformEdge type and define custom edges. This could be useful for example if the user wanted to define a computation that transforms one identifier to another. For example inchikey may be computed directly by performing a hash on the inchi identifier.
Document Maintainers¶
Greg Taylor (@gregtaylor)
Chunlei Wu (@chunleiwu)
biothings.web¶
Generate a customized BioThings API given a supported database.
biothings.web.launcher: Launch web applications in different environments. biothings.web.applications: HTTP web application over data services below. biothings.web.services & query: Data services built on top of connections. biothings.web.connections: Elasticsearch, MongoDB and SQL database access.
Layers¶
biothings.web.launcher¶
Biothings API Launcher
In this module, we have three framework-specific launchers and a command-line utility to provide both programmatic and command-line access to start Biothings APIs.
- biothings.web.launcher.BiothingsAPILauncher¶
alias of
TornadoAPILauncher
- class biothings.web.launcher.FastAPILauncher(config=None)[source]¶
Bases:
BiothingsAPIBaseLauncher
- class biothings.web.launcher.FlaskAPILauncher(config=None)[source]¶
Bases:
BiothingsAPIBaseLauncher
- class biothings.web.launcher.TornadoAPILauncher(config=None)[source]¶
Bases:
BiothingsAPIBaseLauncher
- static use_curl()[source]¶
Use curl implementation for tornado http clients. More on https://www.tornadoweb.org/en/stable/httpclient.html
biothings.web.applications¶
Biothings Web Applications -
define the routes and handlers a supported web framework would consume
basing on a config file, typically named config.py, enhanced by
biothings.web.settings.configs
.
The currently supported web frameworks are Tornado, Flask, and FastAPI.
The biothings.web.launcher
can start the compatible HTTP servers
basing on their interface. And the web applications delegate routes defined
in the config file to handlers typically in biothings.web.handlers
.
Web Framework |
Interface |
Handlers |
---|---|---|
Tornado |
Tornado |
biothings.web.handlers.* |
Flask |
WSGI |
biothings.web.handlers._flask |
FastAPI |
ASGI |
biothings.web.handlers._fastapi |
- biothings.web.applications.BiothingsAPI¶
alias of
TornadoBiothingsAPI
biothings.web.services¶
biothings.web.services.query¶
A Programmatic Query API supporting Biothings Query Syntax.
From an architecture perspective, biothings.web.query
is one of
the data services, built on top of the biothings.web.connections
layer, however, due to the complexity of the module, it is escalated one
level in organization to simplify the overall folder structure. The features
are available in biothings.web.services.query namespace via import.
biothings.web.services.health¶
biothings.web.services.metadata¶
- class biothings.web.services.metadata.BiothingHubMeta(**metadata)[source]¶
Bases:
BiothingMetaProp
- class biothings.web.services.metadata.BiothingLicenses(licenses)[source]¶
Bases:
BiothingMetaProp
- class biothings.web.services.metadata.BiothingMappings(properties)[source]¶
Bases:
BiothingMetaProp
- class biothings.web.services.metadata.BiothingsESMetadata(indices, client)[source]¶
Bases:
BiothingsMetadata
- property types¶
- class biothings.web.services.metadata.BiothingsMongoMetadata(collections, client)[source]¶
Bases:
BiothingsMetadata
- property types¶
biothings.web.services.namespace¶
biothings.web.connections¶
- biothings.web.connections.get_es_client(hosts=None, async_=False, **settings)[source]¶
Enhanced ES client initialization.
- Additionally support these parameters:
async_: use AsyncElasticserach instead of Elasticsearch. aws: setup request signing and provide reasonable ES settings
to access AWS OpenSearch, by default assuming it is on HTTPS.
- sniff: provide resonable default settings to enable client-side
LB to an ES cluster. this param itself is not an ES param.
Additionally, you can reuse connecions initialized with the same parameters by getting it from the connection pools every time. Here’s the connection pool interface signature:
- class biothings.web.connections._ClientPool(client_factory, async_factory, callback=None)[source]¶
Bases:
object
The module has already initialized connection pools for each supported databases. Directly access these pools without creating by yourselves.
- biothings.web.connections.es = <biothings.web.connections._ClientPool object>¶
- biothings.web.connections.sql = <biothings.web.connections._ClientPool object>¶
- biothings.web.connections.mongo = <biothings.web.connections._ClientPool object>¶
Components¶
biothings.web.analytics¶
biothings.web.analytics.channels¶
- class biothings.web.analytics.channels.GA4Channel(measurement_id, api_secret, uid_version=1)[source]¶
Bases:
Channel
biothings.web.analytics.events¶
- class biothings.web.analytics.events.Message(dict=None, /, **kwargs)[source]¶
Bases:
Event
Logical document that can be sent through services. Processable fields: title, body, url, url_text, image, image_altext Optionally define default field values below.
- DEFAULTS = {'image_altext': '<IMAGE>', 'title': 'Notification Message', 'url_text': 'View Details'}¶
- to_ADF()[source]¶
Generate ADF for Atlassian Jira payload. Overwrite this to build differently. https://developer.atlassian.com/cloud/jira/platform/apis/document/playground/
- to_email_payload(sendfrom, sendto)[source]¶
Build a MIMEMultipart message that can be sent as an email. https://docs.aws.amazon.com/ses/latest/DeveloperGuide/examples-send-using-smtp.html
- to_jira_payload(profile)[source]¶
Combine notification message with project profile to genereate jira issue tracking ticket request payload.
- to_slack_payload()[source]¶
Generate slack webhook notification payload. https://api.slack.com/messaging/composing/layouts
biothings.web.analytics.notifiers¶
biothings.web.handlers¶
biothings.web.handlers.base¶
Biothings Web Handlers
biothings.web.handlers.BaseHandler
Supports: - access to biothings namespace - monitor exceptions with Sentry
biothings.web.handlers.BaseAPIHandler
Additionally supports: - JSON and YAML payload in the request body - request arguments standardization - multi-type output (json, yaml, html, msgpack) - standardized error response (exception -> error template) - analytics and usage tracking (Google Analytics and AWS) - default common http headers (CORS and Cache Control)
- class biothings.web.handlers.base.BaseAPIHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseHandler
,AnalyticsMixin
- cache = None¶
- cache_control_template = 'max-age={cache}, public'¶
- format = 'json'¶
- get_template_path()[source]¶
Override to customize template path for each handler.
By default, we use the
template_path
application setting. Return None to load templates relative to the calling file.
- kwargs = {'*': {'format': {'default': 'json', 'enum': ('json', 'yaml', 'html', 'msgpack'), 'type': <class 'str'>}}}¶
- name = '__base__'¶
- prepare()[source]¶
Called at the beginning of a request before get/post/etc.
Override this method to perform common initialization regardless of the request method.
Asynchronous support: Use
async def
or decorate this method with .gen.coroutine to make it asynchronous. If this method returns anAwaitable
execution will not proceed until theAwaitable
is done.New in version 3.1: Asynchronous support.
- set_default_headers()[source]¶
Override this to set HTTP headers at the beginning of the request.
For example, this is the place to set a custom
Server
header. Note that setting such headers in the normal flow of request processing may not do what you want, since headers may be reset during error handling.
- write(chunk)[source]¶
Writes the given chunk to the output buffer.
To write the output to the network, use the flush() method below.
If the given chunk is a dictionary, we write it as JSON and set the Content-Type of the response to be
application/json
. (if you want to send JSON as a differentContent-Type
, callset_header
after callingwrite()
).Note that lists are not converted to JSON because of a potential cross-site security vulnerability. All JSON output should be wrapped in a dictionary. More details at http://haacked.com/archive/2009/06/25/json-hijacking.aspx/ and https://github.com/facebook/tornado/issues/1009
- write_error(status_code, **kwargs)[source]¶
from tornado.web import Finish, HTTPError
raise HTTPError(404) raise HTTPError(404, reason=”document not found”) raise HTTPError(404, None, {“id”: “-1”}, reason=”document not found”) -> {
“code”: 404, “success”: False, “error”: “document not found” “id”: “-1”
}
biothings.web.handlers.query¶
Elasticsearch Handlers
biothings.web.handlers.BaseESRequestHandler
Supports: (all features above and) - access to biothing_type attribute - access to ES query pipeline stages - pretty print elasticsearch exceptions - common control option out_format
Subclasses: - biothings.web.handlers.MetadataSourceHandler - biothings.web.handlers.MetadataFieldHandler - myvariant.web.beacon.BeaconHandler
biothings.web.handlers.ESRequestHandler
Supports: (all features above and) - common control options (raw, rawquery) - common transform options (dotfield, always_list…) - query pipeline customization hooks - single query through GET - multiple quers through POST
Subclasses: - biothings.web.handlers.BiothingHandler - biothings.web.handlers.QueryHandler
- class biothings.web.handlers.query.BaseQueryHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseAPIHandler
- prepare()[source]¶
Called at the beginning of a request before get/post/etc.
Override this method to perform common initialization regardless of the request method.
Asynchronous support: Use
async def
or decorate this method with .gen.coroutine to make it asynchronous. If this method returns anAwaitable
execution will not proceed until theAwaitable
is done.New in version 3.1: Asynchronous support.
- write(chunk)[source]¶
Writes the given chunk to the output buffer.
To write the output to the network, use the flush() method below.
If the given chunk is a dictionary, we write it as JSON and set the Content-Type of the response to be
application/json
. (if you want to send JSON as a differentContent-Type
, callset_header
after callingwrite()
).Note that lists are not converted to JSON because of a potential cross-site security vulnerability. All JSON output should be wrapped in a dictionary. More details at http://haacked.com/archive/2009/06/25/json-hijacking.aspx/ and https://github.com/facebook/tornado/issues/1009
- class biothings.web.handlers.query.BiothingHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseQueryHandler
Biothings Annotation Endpoint
URL pattern examples:
/{pre}/{ver}/{typ}/? /{pre}/{ver}/{typ}/([^/]+)/?
queries a term against a pre-determined field that represents the id of a document, like _id and dbsnp.rsid
GET -> {…} or [{…}, …] POST -> [{…}, …]
- name = 'annotation'¶
- class biothings.web.handlers.query.MetadataFieldHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseQueryHandler
GET /metadata/fields
- kwargs = {'*': {'format': {'default': 'json', 'enum': ('json', 'yaml', 'html', 'msgpack'), 'type': <class 'str'>}}, 'GET': {'prefix': {'default': None, 'type': <class 'str'>}, 'raw': {'default': False, 'type': <class 'bool'>}, 'search': {'default': None, 'type': <class 'str'>}}}¶
- name = 'fields'¶
- class biothings.web.handlers.query.MetadataSourceHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseQueryHandler
GET /metadata
- kwargs = {'*': {'format': {'default': 'json', 'enum': ('json', 'yaml', 'html', 'msgpack'), 'type': <class 'str'>}}, 'GET': {'dev': {'default': False, 'type': <class 'bool'>}, 'raw': {'default': False, 'type': <class 'bool'>}}}¶
- name = 'metadata'¶
- class biothings.web.handlers.query.QueryHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseQueryHandler
Biothings Query Endpoint
URL pattern examples:
/{pre}/{ver}/{typ}/query/? /{pre}/{ver}//query/?
GET -> {…} POST -> [{…}, …]
- name = 'query'¶
biothings.web.handlers.services¶
- class biothings.web.handlers.services.APISpecificationHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseAPIHandler
- class biothings.web.handlers.services.FrontPageHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseHandler
biothings.web.options¶
biothings.web.options.manager¶
Request Argument Standardization
- class biothings.web.options.manager.Converter(**kwargs)[source]¶
Bases:
object
A generic HTTP request argument processing unit. Only perform one level of validation at this moment. The strict switch controls the type conversion rules.
- class biothings.web.options.manager.Existentialist(defdict)[source]¶
Bases:
object
Describes the requirement of the existance of an argument. {
“default”: <object>, “required”: <bool>,
}
- class biothings.web.options.manager.FormArgCvter(**kwargs)[source]¶
Bases:
Converter
Dedicated argument converter for HTTP body arguments. Additionally support JSON seriealization format as values. Correspond to arguments received in tornado from
RequestHandler.get_body_argument
- class biothings.web.options.manager.JsonArgCvter(**kwargs)[source]¶
Bases:
Converter
Dedicated argument converter for JSON HTTP bodys. Here it is used for dict JSON objects, with their first level keys considered as parameters and their values considered as arguments to process.
May correspond to this tornado implementation: https://www.tornadoweb.org/en/stable/web.html#input
- class biothings.web.options.manager.Locator(defdict)[source]¶
Bases:
object
Describes the location of an argument in ReqArgs. {
“keyword”: <str>, “path”: <int or str>, “alias”: <str or [<str>, …]>
}
- class biothings.web.options.manager.Option(*args, **kwargs)[source]¶
Bases:
UserDict
A parameter for end applications to consume. Find the value of it in the desired location.
For example: {
“keyword”: “q”, “location”: (“query”, “form”, “json”), “default”: “__all__”, “type”: “str”
}
- exception biothings.web.options.manager.OptionError(reason=None, **kwargs)[source]¶
Bases:
ValueError
- class biothings.web.options.manager.OptionSet(*args, **kwargs)[source]¶
Bases:
UserDict
A collection of options that a specific endpoint consumes. Divided into groups and by the request methods.
For example: {
“*”:{“raw”:{…},”size”:{…},”dotfield”:{…}}, “GET”:{“q”:{…},”from”:{…},”sort”:{…}}, “POST”:{“q”:{…},”scopes”:{…}}
}
- class biothings.web.options.manager.OptionsManager(dict=None, /, **kwargs)[source]¶
Bases:
UserDict
A collection of OptionSet(s) that makes up an application. Provide an interface to setup and serialize.
Example: {
“annotation”: {“*”: {…}, “GET”: {…}, “POST”: {… }}, “query”: {“*”: {…}, “GET”: {…}, “POST”: {… }}, “metadata”: {“GET”: {…}, “POST”: {… }}
}
- class biothings.web.options.manager.PathArgCvter(**kwargs)[source]¶
Bases:
Converter
Dedicated argument converter for path arguments. Correspond to arguments received in tornado for
RequestHandler.path_args RequestHandler.path_kwargs
- class biothings.web.options.manager.QueryArgCvter(**kwargs)[source]¶
Bases:
Converter
Dedicated argument converter for url query arguments. Correspond to arguments received in tornado from
RequestHandler.get_query_argument
biothings.web.options.openapi¶
- class biothings.web.options.openapi.OpenAPIContactContext(parent)[source]¶
Bases:
_ChildContext
- ATTRIBUTE_FIELDS = {'email': 'email', 'name': 'name', 'url': 'url'}¶
- EXTENSION = True¶
- email(v)¶
Set email field
- name(v)¶
Set name field
- subclasses: MutableMapping[str, Type[_ChildContext]] = {'OpenAPIContactContext': <class 'biothings.web.options.openapi.OpenAPIContactContext'>, 'OpenAPIContext': <class 'biothings.web.options.openapi.OpenAPIContext'>, 'OpenAPIExternalDocsContext': <class 'biothings.web.options.openapi.OpenAPIExternalDocsContext'>, 'OpenAPIInfoContext': <class 'biothings.web.options.openapi.OpenAPIInfoContext'>, 'OpenAPILicenseContext': <class 'biothings.web.options.openapi.OpenAPILicenseContext'>, 'OpenAPIOperation': <class 'biothings.web.options.openapi.OpenAPIOperation'>, 'OpenAPIParameterContext': <class 'biothings.web.options.openapi.OpenAPIParameterContext'>, 'OpenAPIPathItemContext': <class 'biothings.web.options.openapi.OpenAPIPathItemContext'>, '_BaseContext': <class 'biothings.web.options.openapi._BaseContext'>, '_ChildContext': <class 'biothings.web.options.openapi._ChildContext'>, '_HasDescription': <class 'biothings.web.options.openapi._HasDescription'>, '_HasExternalDocs': <class 'biothings.web.options.openapi._HasExternalDocs'>, '_HasParameters': <class 'biothings.web.options.openapi._HasParameters'>, '_HasSummary': <class 'biothings.web.options.openapi._HasSummary'>, '_HasTags': <class 'biothings.web.options.openapi._HasTags'>}¶
- url(v)¶
Set url field
- class biothings.web.options.openapi.OpenAPIContext[source]¶
Bases:
_HasExternalDocs
- CHILD_CONTEXTS = {'info': ('OpenAPIInfoContext', 'info')}¶
- EXTENSION = True¶
- info(**kwargs)¶
Set info Create OpenAPIInfoContext and set info
- subclasses: MutableMapping[str, Type[_ChildContext]] = {'OpenAPIContactContext': <class 'biothings.web.options.openapi.OpenAPIContactContext'>, 'OpenAPIContext': <class 'biothings.web.options.openapi.OpenAPIContext'>, 'OpenAPIExternalDocsContext': <class 'biothings.web.options.openapi.OpenAPIExternalDocsContext'>, 'OpenAPIInfoContext': <class 'biothings.web.options.openapi.OpenAPIInfoContext'>, 'OpenAPILicenseContext': <class 'biothings.web.options.openapi.OpenAPILicenseContext'>, 'OpenAPIOperation': <class 'biothings.web.options.openapi.OpenAPIOperation'>, 'OpenAPIParameterContext': <class 'biothings.web.options.openapi.OpenAPIParameterContext'>, 'OpenAPIPathItemContext': <class 'biothings.web.options.openapi.OpenAPIPathItemContext'>, '_BaseContext': <class 'biothings.web.options.openapi._BaseContext'>, '_ChildContext': <class 'biothings.web.options.openapi._ChildContext'>, '_HasDescription': <class 'biothings.web.options.openapi._HasDescription'>, '_HasExternalDocs': <class 'biothings.web.options.openapi._HasExternalDocs'>, '_HasParameters': <class 'biothings.web.options.openapi._HasParameters'>, '_HasSummary': <class 'biothings.web.options.openapi._HasSummary'>, '_HasTags': <class 'biothings.web.options.openapi._HasTags'>}¶
- biothings.web.options.openapi.OpenAPIDocumentBuilder¶
alias of
OpenAPIContext
- class biothings.web.options.openapi.OpenAPIExternalDocsContext(parent)[source]¶
Bases:
_ChildContext
,_HasDescription
- ATTRIBUTE_FIELDS = {'url': 'url'}¶
- EXTENSION = True¶
- subclasses: MutableMapping[str, Type[_ChildContext]] = {'OpenAPIContactContext': <class 'biothings.web.options.openapi.OpenAPIContactContext'>, 'OpenAPIContext': <class 'biothings.web.options.openapi.OpenAPIContext'>, 'OpenAPIExternalDocsContext': <class 'biothings.web.options.openapi.OpenAPIExternalDocsContext'>, 'OpenAPIInfoContext': <class 'biothings.web.options.openapi.OpenAPIInfoContext'>, 'OpenAPILicenseContext': <class 'biothings.web.options.openapi.OpenAPILicenseContext'>, 'OpenAPIOperation': <class 'biothings.web.options.openapi.OpenAPIOperation'>, 'OpenAPIParameterContext': <class 'biothings.web.options.openapi.OpenAPIParameterContext'>, 'OpenAPIPathItemContext': <class 'biothings.web.options.openapi.OpenAPIPathItemContext'>, '_BaseContext': <class 'biothings.web.options.openapi._BaseContext'>, '_ChildContext': <class 'biothings.web.options.openapi._ChildContext'>, '_HasDescription': <class 'biothings.web.options.openapi._HasDescription'>, '_HasExternalDocs': <class 'biothings.web.options.openapi._HasExternalDocs'>, '_HasParameters': <class 'biothings.web.options.openapi._HasParameters'>, '_HasSummary': <class 'biothings.web.options.openapi._HasSummary'>, '_HasTags': <class 'biothings.web.options.openapi._HasTags'>}¶
- url(v)¶
Set url field
- class biothings.web.options.openapi.OpenAPIInfoContext(parent)[source]¶
Bases:
_ChildContext
,_HasDescription
- ATTRIBUTE_FIELDS = {'terms_of_service': 'termsOfService', 'title': 'title', 'version': 'version'}¶
- CHILD_CONTEXTS = {'contact': ('OpenAPIContactContext', 'contact'), 'license': ('OpenAPILicenseContext', 'license')}¶
- EXTENSION = True¶
- contact(**kwargs)¶
Set contact Create OpenAPIContactContext and set contact
- license(**kwargs)¶
Set license Create OpenAPILicenseContext and set license
- subclasses: MutableMapping[str, Type[_ChildContext]] = {'OpenAPIContactContext': <class 'biothings.web.options.openapi.OpenAPIContactContext'>, 'OpenAPIContext': <class 'biothings.web.options.openapi.OpenAPIContext'>, 'OpenAPIExternalDocsContext': <class 'biothings.web.options.openapi.OpenAPIExternalDocsContext'>, 'OpenAPIInfoContext': <class 'biothings.web.options.openapi.OpenAPIInfoContext'>, 'OpenAPILicenseContext': <class 'biothings.web.options.openapi.OpenAPILicenseContext'>, 'OpenAPIOperation': <class 'biothings.web.options.openapi.OpenAPIOperation'>, 'OpenAPIParameterContext': <class 'biothings.web.options.openapi.OpenAPIParameterContext'>, 'OpenAPIPathItemContext': <class 'biothings.web.options.openapi.OpenAPIPathItemContext'>, '_BaseContext': <class 'biothings.web.options.openapi._BaseContext'>, '_ChildContext': <class 'biothings.web.options.openapi._ChildContext'>, '_HasDescription': <class 'biothings.web.options.openapi._HasDescription'>, '_HasExternalDocs': <class 'biothings.web.options.openapi._HasExternalDocs'>, '_HasParameters': <class 'biothings.web.options.openapi._HasParameters'>, '_HasSummary': <class 'biothings.web.options.openapi._HasSummary'>, '_HasTags': <class 'biothings.web.options.openapi._HasTags'>}¶
- terms_of_service(v)¶
Set termsOfService field
- title(v)¶
Set title field
- version(v)¶
Set version field
- class biothings.web.options.openapi.OpenAPILicenseContext(parent)[source]¶
Bases:
_ChildContext
- ATTRIBUTE_FIELDS = {'name': 'name', 'url': 'url'}¶
- EXTENSION = True¶
- name(v)¶
Set name field
- subclasses: MutableMapping[str, Type[_ChildContext]] = {'OpenAPIContactContext': <class 'biothings.web.options.openapi.OpenAPIContactContext'>, 'OpenAPIContext': <class 'biothings.web.options.openapi.OpenAPIContext'>, 'OpenAPIExternalDocsContext': <class 'biothings.web.options.openapi.OpenAPIExternalDocsContext'>, 'OpenAPIInfoContext': <class 'biothings.web.options.openapi.OpenAPIInfoContext'>, 'OpenAPILicenseContext': <class 'biothings.web.options.openapi.OpenAPILicenseContext'>, 'OpenAPIOperation': <class 'biothings.web.options.openapi.OpenAPIOperation'>, 'OpenAPIParameterContext': <class 'biothings.web.options.openapi.OpenAPIParameterContext'>, 'OpenAPIPathItemContext': <class 'biothings.web.options.openapi.OpenAPIPathItemContext'>, '_BaseContext': <class 'biothings.web.options.openapi._BaseContext'>, '_ChildContext': <class 'biothings.web.options.openapi._ChildContext'>, '_HasDescription': <class 'biothings.web.options.openapi._HasDescription'>, '_HasExternalDocs': <class 'biothings.web.options.openapi._HasExternalDocs'>, '_HasParameters': <class 'biothings.web.options.openapi._HasParameters'>, '_HasSummary': <class 'biothings.web.options.openapi._HasSummary'>, '_HasTags': <class 'biothings.web.options.openapi._HasTags'>}¶
- url(v)¶
Set url field
- class biothings.web.options.openapi.OpenAPIOperation(parent)[source]¶
Bases:
_ChildContext
,_HasSummary
,_HasExternalDocs
,_HasTags
,_HasDescription
,_HasParameters
- ATTRIBUTE_FIELDS = {'operation_id': 'operationId'}¶
- EXTENSION = True¶
- operation_id(v)¶
Set operationId field
- subclasses: MutableMapping[str, Type[_ChildContext]] = {'OpenAPIContactContext': <class 'biothings.web.options.openapi.OpenAPIContactContext'>, 'OpenAPIContext': <class 'biothings.web.options.openapi.OpenAPIContext'>, 'OpenAPIExternalDocsContext': <class 'biothings.web.options.openapi.OpenAPIExternalDocsContext'>, 'OpenAPIInfoContext': <class 'biothings.web.options.openapi.OpenAPIInfoContext'>, 'OpenAPILicenseContext': <class 'biothings.web.options.openapi.OpenAPILicenseContext'>, 'OpenAPIOperation': <class 'biothings.web.options.openapi.OpenAPIOperation'>, 'OpenAPIParameterContext': <class 'biothings.web.options.openapi.OpenAPIParameterContext'>, 'OpenAPIPathItemContext': <class 'biothings.web.options.openapi.OpenAPIPathItemContext'>, '_BaseContext': <class 'biothings.web.options.openapi._BaseContext'>, '_ChildContext': <class 'biothings.web.options.openapi._ChildContext'>, '_HasDescription': <class 'biothings.web.options.openapi._HasDescription'>, '_HasExternalDocs': <class 'biothings.web.options.openapi._HasExternalDocs'>, '_HasParameters': <class 'biothings.web.options.openapi._HasParameters'>, '_HasSummary': <class 'biothings.web.options.openapi._HasSummary'>, '_HasTags': <class 'biothings.web.options.openapi._HasTags'>}¶
- class biothings.web.options.openapi.OpenAPIParameterContext(parent, name: str, in_: str, required: bool)[source]¶
Bases:
_ChildContext
,_HasDescription
- ATTRIBUTE_FIELDS = {'allow_empty': 'allowEmptyValue', 'allow_reserved': 'allowReserved', 'deprecated': 'deprecated', 'explode': 'explode', 'schema': 'schema', 'style': 'style'}¶
- EXTENSION = True¶
- allow_empty(v)¶
Set allowEmptyValue field
- allow_reserved(v)¶
Set allowReserved field
- deprecated(v)¶
Set deprecated field
- explode(v)¶
Set explode field
- schema(v)¶
Set schema field
- style(v)¶
Set style field
- subclasses: MutableMapping[str, Type[_ChildContext]] = {'OpenAPIContactContext': <class 'biothings.web.options.openapi.OpenAPIContactContext'>, 'OpenAPIContext': <class 'biothings.web.options.openapi.OpenAPIContext'>, 'OpenAPIExternalDocsContext': <class 'biothings.web.options.openapi.OpenAPIExternalDocsContext'>, 'OpenAPIInfoContext': <class 'biothings.web.options.openapi.OpenAPIInfoContext'>, 'OpenAPILicenseContext': <class 'biothings.web.options.openapi.OpenAPILicenseContext'>, 'OpenAPIOperation': <class 'biothings.web.options.openapi.OpenAPIOperation'>, 'OpenAPIParameterContext': <class 'biothings.web.options.openapi.OpenAPIParameterContext'>, 'OpenAPIPathItemContext': <class 'biothings.web.options.openapi.OpenAPIPathItemContext'>, '_BaseContext': <class 'biothings.web.options.openapi._BaseContext'>, '_ChildContext': <class 'biothings.web.options.openapi._ChildContext'>, '_HasDescription': <class 'biothings.web.options.openapi._HasDescription'>, '_HasExternalDocs': <class 'biothings.web.options.openapi._HasExternalDocs'>, '_HasParameters': <class 'biothings.web.options.openapi._HasParameters'>, '_HasSummary': <class 'biothings.web.options.openapi._HasSummary'>, '_HasTags': <class 'biothings.web.options.openapi._HasTags'>}¶
- class biothings.web.options.openapi.OpenAPIPathItemContext(parent)[source]¶
Bases:
_ChildContext
,_HasSummary
,_HasDescription
,_HasParameters
- CHILD_CONTEXTS = {'delete': ('OpenAPIOperation', 'delete'), 'get': ('OpenAPIOperation', 'get'), 'head': ('OpenAPIOperation', 'head'), 'options': ('OpenAPIOperation', 'options'), 'patch': ('OpenAPIOperation', 'patch'), 'post': ('OpenAPIOperation', 'post'), 'put': ('OpenAPIOperation', 'put'), 'trace': ('OpenAPIOperation', 'trace')}¶
- EXTENSION = True¶
- delete(**kwargs)¶
Set delete Create OpenAPIOperation and set delete
- get(**kwargs)¶
Set get Create OpenAPIOperation and set get
- head(**kwargs)¶
Set head Create OpenAPIOperation and set head
- http_method = 'trace'¶
- options(**kwargs)¶
Set options Create OpenAPIOperation and set options
- patch(**kwargs)¶
Set patch Create OpenAPIOperation and set patch
- post(**kwargs)¶
Set post Create OpenAPIOperation and set post
- put(**kwargs)¶
Set put Create OpenAPIOperation and set put
- subclasses: MutableMapping[str, Type[_ChildContext]] = {'OpenAPIContactContext': <class 'biothings.web.options.openapi.OpenAPIContactContext'>, 'OpenAPIContext': <class 'biothings.web.options.openapi.OpenAPIContext'>, 'OpenAPIExternalDocsContext': <class 'biothings.web.options.openapi.OpenAPIExternalDocsContext'>, 'OpenAPIInfoContext': <class 'biothings.web.options.openapi.OpenAPIInfoContext'>, 'OpenAPILicenseContext': <class 'biothings.web.options.openapi.OpenAPILicenseContext'>, 'OpenAPIOperation': <class 'biothings.web.options.openapi.OpenAPIOperation'>, 'OpenAPIParameterContext': <class 'biothings.web.options.openapi.OpenAPIParameterContext'>, 'OpenAPIPathItemContext': <class 'biothings.web.options.openapi.OpenAPIPathItemContext'>, '_BaseContext': <class 'biothings.web.options.openapi._BaseContext'>, '_ChildContext': <class 'biothings.web.options.openapi._ChildContext'>, '_HasDescription': <class 'biothings.web.options.openapi._HasDescription'>, '_HasExternalDocs': <class 'biothings.web.options.openapi._HasExternalDocs'>, '_HasParameters': <class 'biothings.web.options.openapi._HasParameters'>, '_HasSummary': <class 'biothings.web.options.openapi._HasSummary'>, '_HasTags': <class 'biothings.web.options.openapi._HasTags'>}¶
- trace(**kwargs)¶
Set trace Create OpenAPIOperation and set trace
biothings.web.query¶
biothings.web.query.builder¶
Biothings Query Builder
Turn the biothings query language to that of the database. The interface contains a query term (q) and query options.
Depending on the underlying database choice, the data type of the query term and query options vary. At a minimum, a query builder should support:
- q: str, a query term,
when not provided, always perform a match all query. when provided as an empty string, always match none.
options: dotdict, optional query options.
- scopes: list[str], the fields to look for the query term.
the meaning of scopes being an empty list or a None object/not provided is controlled by specific class implementations or not defined.
_source: list[str], fields to return in the result. size: int, maximum number of hits to return. from_: int, starting index of result to return. sort: str, customized sort keys for result list
aggs: str, customized aggregation string. post_filter: str, when provided, the search hits are filtered after the aggregations are calculated. facet_size: int, maximum number of agg results.
- class biothings.web.query.builder.ESQueryBuilder(user_query=None, scopes_regexs=(), scopes_default=('_id',), allow_random_query=True, allow_nested_query=False, metadata=None)[source]¶
Bases:
object
Build an Elasticsearch query with elasticsearch-dsl.
- apply_extras(search, options)[source]¶
Process non-query options and customize their behaviors. Customized aggregation syntax string is translated here.
- build(q=None, **options)[source]¶
Build a query according to q and options. This is the public method called by API handlers.
- Regarding scopes:
scopes: [str] nonempty, match query. scopes: NoneType, or [], no scope, so query string query.
- Additionally support these options:
explain: include es scoring information userquery: customized function to interpret q
- additional keywords are passed through as es keywords
for example: ‘explain’, ‘version’ …
- multi-search is supported when q is a list. all queries
are built individually and then sent in one request.
- class biothings.web.query.builder.Group(term, scopes)¶
Bases:
tuple
Create new instance of Group(term, scopes)
- scopes¶
Alias for field number 1
- term¶
Alias for field number 0
- class biothings.web.query.builder.QStringParser(default_scopes=('_id',), patterns=(('(?P<scope>\\w+):(?P<term>[^:]+)', ()),), gpnames=('term', 'scope'))[source]¶
Bases:
object
biothings.web.query.engine¶
Search Execution Engine
Take the output of the query builder and feed to the corresponding database engine. This stage typically resolves the db destination from a biothing_type and applies presentation and/or networking parameters.
Example:
>>> from biothings.web.query import ESQueryBackend
>>> from elasticsearch import Elasticsearch
>>> from elasticsearch_dsl import Search
>>> backend = ESQueryBackend(Elasticsearch())
>>> backend.execute(Search().query("match", _id="1017"))
>>> _["hits"]["hits"][0]["_source"].keys()
dict_keys(['taxid', 'symbol', 'name', ... ])
- class biothings.web.query.engine.AsyncESQueryBackend(client, indices=None, scroll_time='1m', scroll_size=1000, multisearch_concurrency=5, total_hits_as_int=True)[source]¶
Bases:
ESQueryBackend
Execute an Elasticsearch query
- async execute(query, **options)[source]¶
Execute the corresponding query. Must return an awaitable. May override to add more. Handle uncaught exceptions.
- Options:
fetch_all: also return a scroll_id for this query (default: false) biothing_type: which type’s corresponding indices to query (default in config.py)
- exception biothings.web.query.engine.EndScrollInterrupt[source]¶
Bases:
ResultInterrupt
- exception biothings.web.query.engine.RawResultInterrupt(data)[source]¶
Bases:
ResultInterrupt
biothings.web.query.formatter¶
Search Result Formatter
Transform the raw query result into consumption-friendly structures by possibly removing from, adding to, and/or flattening the raw response from the database engine for one or more individual queries.
- class biothings.web.query.formatter.Doc(dict=None, /, **kwargs)[source]¶
Bases:
FormatterDict
- {
“_id”: … , “_score”: … , …
}
- class biothings.web.query.formatter.ESResultFormatter(licenses=None, license_transform=None, field_notes=None, excluded_keys=())[source]¶
Bases:
ResultFormatter
Class to transform the results of the Elasticsearch query generated prior in the pipeline. This contains the functions to extract the final document from the elasticsearch query result in `Elasticsearch Query`_. This also contains the code to flatten a document etc.
- transform(response, **options)[source]¶
Transform the query response to a user-friendly structure. Mainly deconstruct the elasticsearch response structure and hand over to transform_doc to apply the options below.
- Options:
# generic transformations for dictionaries # —————————————— dotfield: flatten a dictionary using dotfield notation _sorted: sort keys alaphabetically in ascending order always_list: ensure the fields specified are lists or wrapped in a list allow_null: ensure the fields specified are present in the result,
the fields may be provided as type None or [].
# additional multisearch result transformations # ———————————————— template: base dict for every result, for example: {“success”: true} templates: a different base for every result, replaces the setting above template_hit: a dict to update every positive hit result, default: {“found”: true} template_miss: a dict to update every query with no hit, default: {“found”: false}
# document format and content management # ————————————— biothing_type: result document type to apply customized transformation.
for example, add license field basing on document type’s metadata.
- one: return the individual document if there’s only one hit. ignore this setting
if there are multiple hits. return None if there is no hit. this option is not effective when aggregation results are also returned in the same query.
native: bool, if the returned result is in python primitive types. version: bool, if _version field is kept. score: bool, if _score field is kept. with_total: bool, if True, the response will include max_total documents,
and a message to tell how many query terms return greater than the max_size of hits. The default is False. An example when with_total is True: {
‘max_total’: 100, ‘msg’: ‘12 query terms return > 1000 hits, using from=1000 to retrieve the remaining hits’, ‘hits’: […]
}
- transform_aggs(res)[source]¶
Transform the aggregations field and make it more presentable. For example, these are the fields of a two level nested aggregations:
aggregations.<term>.doc_count_error_upper_bound aggregations.<term>.sum_other_doc_count aggregations.<term>.buckets.key aggregations.<term>.buckets.key_as_string aggregations.<term>.buckets.doc_count aggregations.<term>.buckets.<nested_term>.* (recursive)
After the transformation, we’ll have:
facets.<term>._type facets.<term>.total facets.<term>.missing facets.<term>.other facets.<term>.terms.count facets.<term>.terms.term facets.<term>.terms.<nested_term>.* (recursive)
Note the first level key change doesn’t happen here.
- transform_hit(path, doc, options)[source]¶
Transform an individual search hit result. By default add licenses for the configured fields.
If a source has a license url in its metadata, Add “_license” key to the corresponding fields. Support dot field representation field alias.
If we have the following settings in web_config.py
- LICENSE_TRANSFORM = {
“exac_nontcga”: “exac”, “snpeff.ann”: “snpeff”
},
Then GET /v1/variant/chr6:g.38906659G>A should look like: {
- “exac”: {
“_license”: “http://bit.ly/2H9c4hg”, “af”: 0.00002471},
- “exac_nontcga”: {
“_license”: “http://bit.ly/2H9c4hg”, <— “af”: 0.00001883}, …
} And GET /v1/variant/chr14:g.35731936G>C could look like: {
- “snpeff”: {
“_license”: “http://bit.ly/2suyRKt”, “ann”: [{“_license”: “http://bit.ly/2suyRKt”, <—
“effect”: “intron_variant”, “feature_id”: “NM_014672.3”, …}, {“_license”: “http://bit.ly/2suyRKt”, <— “effect”: “intron_variant”, “feature_id”: “NM_001256678.1”, …}, …]
}, …
}
The arrow marked fields would not exist without the setting lines.
- transform_mapping(mapping, prefix=None, search=None)[source]¶
Transform Elasticsearch mapping definition to user-friendly field definitions metadata results.
- trasform_jmespath(path: str, doc, options) None [source]¶
Transform any target field in doc using jmespath query syntax. The jmespath query parameter value should have the pattern of “<target_list_fieldname>|<jmespath_query_expression>” <target_list_fieldname> can be any sub-field of the input doc using dot notation, e.g. “aaa.bbb”.
If empty or “.”, it will be the root field.
The flexible jmespath syntax allows to filter/transform any nested objects in the input doc on the fly. The output of the jmespath transformation will then be used to replace the original target field value. .. rubric:: Examples
- filtering an array sub-field
jmespath=tags|[?name==`Metadata`] # filter tags array by name field jmespath=aaa.bbb|[?(sub_a==`val_a`||sub_a==`val_aa`)%26%26sub_b==`val_b`] # use %26%26 for &&
- static traverse(obj, leaf_node=False)¶
Output path-dictionary pairs. For example, input: {
‘exac_nontcga’: {‘af’: 0.00001883}, ‘gnomad_exome’: {‘af’: {‘af’: 0.0000119429, ‘af_afr’: 0.000123077}}, ‘snpeff’: {‘ann’: [{‘effect’: ‘intron_variant’,
‘feature_id’: ‘NM_014672.3’}, {‘effect’: ‘intron_variant’, ‘feature_id’: ‘NM_001256678.1’}]}
} will be translated to a generator: (
(“exac_nontcga”, {“af”: 0.00001883}), (“gnomad_exome.af”, {“af”: 0.0000119429, “af_afr”: 0.000123077}), (“gnomad_exome”, {“af”: {“af”: 0.0000119429, “af_afr”: 0.000123077}}), (“snpeff.ann”, {“effect”: “intron_variant”, “feature_id”: “NM_014672.3”}), (“snpeff.ann”, {“effect”: “intron_variant”, “feature_id”: “NM_001256678.1”}), (“snpeff.ann”, [{ … },{ … }]), (“snpeff”, {“ann”: [{ … },{ … }]}), (‘’, {‘exac_nontcga’: {…}, ‘gnomad_exome’: {…}, ‘snpeff’: {…}})
) or when traversing leaf nodes: (
(‘exac_nontcga.af’, 0.00001883), (‘gnomad_exome.af.af’, 0.0000119429), (‘gnomad_exome.af.af_afr’, 0.000123077), (‘snpeff.ann.effect’, ‘intron_variant’), (‘snpeff.ann.feature_id’, ‘NM_014672.3’), (‘snpeff.ann.effect’, ‘intron_variant’), (‘snpeff.ann.feature_id’, ‘NM_001256678.1’)
)
- class biothings.web.query.formatter.Hits(dict=None, /, **kwargs)[source]¶
Bases:
FormatterDict
- {
“total”: … , “hits”: [
{ … }, { … }, …
]
}
- class biothings.web.query.formatter.MongoResultFormatter[source]¶
Bases:
ResultFormatter
- class biothings.web.query.formatter.SQLResultFormatter[source]¶
Bases:
ResultFormatter
biothings.web.query.pipeline¶
- class biothings.web.query.pipeline.AsyncESQueryPipeline(builder, backend, formatter, **settings)[source]¶
Bases:
QueryPipeline
- class biothings.web.query.pipeline.ESQueryPipeline(builder=None, backend=None, formatter=None, *args, **kwargs)[source]¶
Bases:
QueryPipeline
- class biothings.web.query.pipeline.MongoQueryPipeline(builder, backend, formatter, **settings)[source]¶
Bases:
QueryPipeline
- class biothings.web.query.pipeline.QueryPipeline(builder, backend, formatter, **settings)[source]¶
Bases:
object
- exception biothings.web.query.pipeline.QueryPipelineException(code: int = 500, summary: str = '', details: object = None)[source]¶
Bases:
Exception
- code: int = 500¶
- details: object = None¶
- summary: str = ''¶
- exception biothings.web.query.pipeline.QueryPipelineInterrupt(data)[source]¶
Bases:
QueryPipelineException
- class biothings.web.query.pipeline.SQLQueryPipeline(builder, backend, formatter, **settings)[source]¶
Bases:
QueryPipeline
biothings.web.settings¶
biothings.web.settings.configs¶
- class biothings.web.settings.configs.ConfigModule(config=None, parent=None, validators=(), **kwargs)[source]¶
Bases:
object
A wrapper for the settings that configure the web API.
Environment variables can override settings of the same names.
Default values are defined in biothings.web.settings.default.
biothings.web.settings.default¶
Biothings Web Settings Default
biothings.web.settings.validators¶
biothings.web.templates¶
The “templates” folder stores HTML templates for native biothings
web handlers and is structured as a dummy module to facilitate location
resolution. Typically, biothings.web.templates.__path__[0]
returns
a string representing the location of this folder on the file system.
This folder is intended to be used internally by the SDK developer.
biothings.tests¶
biothings.tests.hub¶
biothings.tests.web¶
Biothings Test Helpers
There are two types of test classes that provide utilities to three types of test cases, developed in the standalone apps.
- The two types of test classes are:
BiothingsWebTest, which targets a running web server. BiothingsWebAppTest, which targets a web server config file.
To further illustrate, for any biothings web applications, it typically conforms to the following architectures:
Layer 3: A web server that implements the behaviors defined below. Layer 2: A config file that defines how to serve data from ES. Layer 1: An Elasticsearch server with data.
And for the two types of test classes, to explain their differences in the context of the layered design described above:
BiothingsWebTest targets an existing Layer 3 endpoint. BiothingsWebAppTest targets layer 2 and runs its own layer 3. Note no utility is provided to directly talk to layer 1.
The above discussed the python structures provided as programming utilities, on the other hand, there are three types of use cases, or testing objectives:
- L3 Data test, which is aimed to test the data integrity of an API.
It subclasses BiothingsWebTest and ensures all layers working. The data has to reside in elasticsearch already.
- L3 Feature test, which is aimed to test the API implementation.
It makes sure the settings in config file is reflected. These tests work on production data and require constant updates to keep the test cases in sync with the actual data. These test cases subclass BiothingsWebTest as well and asl require existing production data in elasticsearch.
- L2 Feature test, doing basically the same things as above but uses
a small set of data that it ingests into elasticsearch. This is a lightweight test for development and automated testings for each new commit. It comes with data it will ingest in ES and does not require any existing data setup to run.
To illustrate the differences in a chart: +————–+———————+—————–+————-+—————————+ | Objectives | Class | Test Target | ES Has Data | Automated Testing Trigger | +————–+———————+—————–+————-+—————————+ | L3 Data Test | BiothingsWebTest | A Running API | Yes | Data Release | +————–+———————+—————–+————-+—————————+ | L3 Feature T.| BiothingsWebTest | A Running API | Yes | Data Release & New Commit | +————–+———————+—————–+————-+—————————+ | L2 Feature T.| BiothingsWebAppTest | A config module | No* | New Commit | +————–+———————+—————–+————-+—————————+ * For L2 Feature Test, data is defined in the test cases and will be automatically ingested into
Elasticsearch at the start of the testing and get deleted after testing finishes. The other two types of testing require existing production data on the corresponding ES servers.
In development, it is certainly possible for a particular test case to fall under multiple test types, then the developer can use proper inheritance structures to avoid repeating the specific test case.
In terms of naming conventions, sometimes the L3 tests are grouped together and called remote tests, as they mostly target remote servers. And the L2 tests are called local tests, as they starts a local server.
L3 Envs:
TEST_SCHEME TEST_PREFIX TEST_HOST TEST_CONF
L2 Envs:
TEST_KEEPDATA < Config Module Override >
- biothings.tests.web.BiothingsDataTest¶
alias of
BiothingsWebTest
- biothings.tests.web.BiothingsTestCase¶
alias of
BiothingsWebAppTest
- class biothings.tests.web.BiothingsWebAppTest(methodName: str = 'runTest')[source]¶
Bases:
BiothingsWebTestBase
,AsyncHTTPTestCase
Starts the tornado application to run tests locally. Need a config.py under the test class folder.
- TEST_DATA_DIR_NAME: str | None = None¶
- property config¶
- get_app()[source]¶
Should be overridden by subclasses to return a tornado.web.Application or other .HTTPServer callback.
- get_new_ioloop()[source]¶
Returns the .IOLoop to use for this test.
By default, a new .IOLoop is created for each test. Subclasses may override this method to return .IOLoop.current() if it is not appropriate to use a new .IOLoop in each tests (for example, if there are global singletons using the default .IOLoop) or if a per-test event loop is being provided by another system (such as
pytest-asyncio
).
- class biothings.tests.web.BiothingsWebTest[source]¶
Bases:
BiothingsWebTestBase
- class biothings.tests.web.BiothingsWebTestBase[source]¶
Bases:
object
- get_url(path)[source]¶
Try best effort to get a full url to make a request. Return an absolute url when class var ‘host’ is defined. If not, return a path relative to the host root.
- host = ''¶
- prefix = 'v1'¶
- query(method='GET', endpoint='query', hits=True, data=None, json=None, **kwargs)[source]¶
Make a query and assert positive hits by default. Assert zero hit when hits is set to False.
- request(path, method='GET', expect=200, **kwargs)[source]¶
Use requests library to make an HTTP request. Ensure path is translated to an absolute path. Conveniently check if status code is as expected.
- scheme = 'http'¶
- static value_in_result(value, result: dict | list, key: str, case_insensitive: bool = False) bool [source]¶
Check if value is in result at specific key
Elasticsearch does not care if a field has one or more values (arrays), so you may get a search with multiple values in one field. You were expecting a result of type T but now you have a List[T] which is bad. In testing, usually any one element in the list eq. to the value you’re looking for, you don’t really care which. This helper function checks if the value is at a key, regardless of the details of nesting, so you can just do this:
assert self.value_in_result(value, result, ‘where.it.should.be’)
Caveats: case_insensitive only calls .lower() and does not care about locale/ unicode/anything
- Parameters:
value – value to look for
result – dict or list of input, most likely from the APIs
key – dot delimited key notation
case_insensitive – for str comparisons, invoke .lower() first
- Returns:
boolean indicating whether the value is found at the key
- Raises:
TypeError – when case_insensitive set to true on unsupported types
biothings.utils¶
biothings.utils.aws¶
- biothings.utils.aws.create_bucket(name, region=None, aws_key=None, aws_secret=None, acl=None, ignore_already_exists=False)[source]¶
Create a S3 bucket “name” in optional “region”. If aws_key and aws_secret are set, S3 client will these, otherwise it’ll use default system-wide setting. “acl” defines permissions on the bucket: “private” (default), “public-read”, “public-read-write” and “authenticated-read”
- biothings.utils.aws.download_s3_file(s3key, localfile=None, aws_key=None, aws_secret=None, s3_bucket=None, overwrite=False)[source]¶
- biothings.utils.aws.get_s3_file(s3key, localfile=None, return_what=False, aws_key=None, aws_secret=None, s3_bucket=None)[source]¶
- biothings.utils.aws.get_s3_file_contents(s3key, aws_key=None, aws_secret=None, s3_bucket=None) bytes [source]¶
- biothings.utils.aws.get_s3_folder(s3folder, basedir=None, aws_key=None, aws_secret=None, s3_bucket=None)[source]¶
- biothings.utils.aws.get_s3_static_website_url(s3key, aws_key=None, aws_secret=None, s3_bucket=None)[source]¶
- biothings.utils.aws.send_s3_big_file(localfile, s3key, overwrite=False, acl=None, aws_key=None, aws_secret=None, s3_bucket=None, storage_class=None)[source]¶
Multiparts upload for file bigger than 5GiB
- biothings.utils.aws.send_s3_file(localfile, s3key, overwrite=False, permissions=None, metadata=None, content=None, content_type=None, aws_key=None, aws_secret=None, s3_bucket=None, redirect=None)[source]¶
save a localfile to s3 bucket with the given key. bucket is set via S3_BUCKET it also save localfile’s lastmodified time in s3 file’s metadata
- Parameters:
redirect (str) – if not None, set the redirect property of the object so it produces a 301 when accessed
biothings.utils.backend¶
Backend access class.
- class biothings.utils.backend.DocBackendBase[source]¶
Bases:
object
- finalize()[source]¶
if needed, for example for bulk updates, perform flush at the end of updating. Final optimization or compacting can be done here as well.
- name = 'Undefined'¶
- property target_name¶
- property version¶
- class biothings.utils.backend.DocBackendOptions(cls, es_index=None, es_host=None, es_doc_type=None, mongo_target_db=None, mongo_target_collection=None)[source]¶
Bases:
object
- class biothings.utils.backend.DocESBackend(esidxer=None)[source]¶
Bases:
DocBackendBase
esidxer is an instance of utils.es.ESIndexer class.
- classmethod create_from_options(options)[source]¶
Function that recreates itself from a DocBackendOptions class. Probably a needless rewrite of __init__…
- finalize()[source]¶
if needed, for example for bulk updates, perform flush at the end of updating. Final optimization or compacting can be done here as well.
- mget_from_ids(ids, step=100000, only_source=True, asiter=True, **kwargs)[source]¶
ids is an id list. always return a generator
- name = 'es'¶
- query(query=None, verbose=False, step=10000, scroll='10m', only_source=True, **kwargs)[source]¶
Function that takes a query and returns an iterator to query results.
- property target_alias¶
- property target_esidxer¶
- property target_name¶
- property version¶
- class biothings.utils.backend.DocMemoryBackend(target_name=None)[source]¶
Bases:
DocBackendBase
target_dict is None or a dict.
- name = 'memory'¶
- property target_name¶
- class biothings.utils.backend.DocMongoBackend(target_db, target_collection=None)[source]¶
Bases:
DocBackendBase
target_collection is a pymongo collection object.
- count_from_ids(ids, step=100000)[source]¶
return the count of docs matching with input ids normally, it does not need to query in batches, but MongoDB has a BSON size limit of 16M bytes, so too many ids will raise a pymongo.errors.DocumentTooLarge error.
- mget_from_ids(ids, asiter=False)[source]¶
ids is an id list. returned doc list should be in the same order of the
input ids. non-existing ids are ignored.
- name = 'mongo'¶
- property target_db¶
- property target_name¶
- update(docs, upsert=False)[source]¶
if id does not exist in the target_collection, the update will be ignored except if upsert is True
- update_diff(diff, extra=None)[source]¶
update a doc based on the diff returned from diff.diff_doc “extra” can be passed (as a dictionary) to add common fields to the updated doc, e.g. a timestamp.
- property version¶
- biothings.utils.backend.DocMongoDBBackend¶
alias of
DocMongoBackend
biothings.utils.common¶
This module contains util functions may be shared by both BioThings data-hub and web components. In general, do not include utils depending on any third-party modules.
- class biothings.utils.common.BiothingsJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶
Bases:
JSONEncoder
A class to dump Python Datetime object. json.dumps(data, cls=DateTimeJSONEncoder, indent=indent)
Constructor for JSONEncoder, with sensible defaults.
If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, float or None. If skipkeys is True, such items are simply skipped.
If ensure_ascii is true, the output is guaranteed to be str objects with all incoming non-ASCII characters escaped. If ensure_ascii is false, the output can contain non-ASCII characters.
If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an RecursionError). Otherwise, no such check takes place.
If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.
If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.
If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.
If specified, separators should be an (item_separator, key_separator) tuple. The default is (’, ‘, ‘: ‘) if indent is
None
and (‘,’, ‘: ‘) otherwise. To get the most compact JSON representation, you should specify (‘,’, ‘:’) to eliminate whitespace.If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a
TypeError
.- default(o)[source]¶
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
- class biothings.utils.common.DummyConfig(name, doc=None)[source]¶
Bases:
module
This class allows “import config” or “from biothings import config” to work without actually creating a config.py file:
import sys from biothings.utils.common import DummyConfig sys.modules[“config”] = DummyConfig(‘config’) sys.modules[“biothings.config”] = DummyConfig(‘config’)
- class biothings.utils.common.LogPrint(log_f, log=1, timestamp=0)[source]¶
Bases:
object
If this class is set to sys.stdout, it will output both log_f and __stdout__. log_f is a file handler.
- biothings.utils.common.SubStr(input_string, start_string='', end_string='', include=0)[source]¶
Return the substring between start_string and end_string. If start_string is ‘’, cut string from the beginning of input_string. If end_string is ‘’, cut string to the end of input_string. If either start_string or end_string can not be found from input_string, return ‘’. The end_pos is the first position of end_string after start_string. If multi-occurence,cut at the first position. include=0(default), does not include start/end_string; include=1: include start/end_string.
- biothings.utils.common.addsuffix(filename, suffix, noext=False)[source]¶
Add suffix in front of “.extension”, so keeping the same extension. if noext is True, remove extension from the filename.
- async biothings.utils.common.aiogunzipall(folder, pattern, job_manager, pinfo)[source]¶
Gunzip all files in folder matching pattern. job_manager is used for parallelisation, and pinfo is a pre-filled dict used by job_manager to report jobs in the hub (see bt.utils.manager.JobManager)
- biothings.utils.common.anyfile(infile, mode='r')[source]¶
return a file handler with the support for gzip/zip comppressed files. if infile is a two value tuple, then first one is the compressed file; the second one is the actual filename in the compressed file. e.g., (‘a.zip’, ‘aa.txt’)
- biothings.utils.common.ask(prompt, options='YN')[source]¶
Prompt Yes or No,return the upper case ‘Y’ or ‘N’.
- biothings.utils.common.dump(obj, filename, protocol=4, compress='gzip')[source]¶
Saves a compressed object to disk protocol version 4 is the default for py3.8, supported since py3.4
- biothings.utils.common.dump2gridfs(obj, filename, db, protocol=2)[source]¶
Save a compressed (support gzip only) object to MongoDB gridfs.
- biothings.utils.common.file_newer(source, target)[source]¶
return True if source file is newer than target file.
- biothings.utils.common.filter_dict(d, keys)[source]¶
Remove keys from dict “d”. “keys” is a list of string, dotfield notation can be used to express nested keys. If key to remove doesn’t exist, silently ignore it
- biothings.utils.common.find_classes_subclassing(mods, baseclass)[source]¶
Given a module or a list of modules, inspect and find all classes which are a subclass of the given baseclass, inside those modules
- biothings.utils.common.find_doc(k, keys)[source]¶
Used by jsonld insertion in www.api.es._insert_jsonld
- biothings.utils.common.find_value_in_doc(dotfield, value, doc)[source]¶
Explore mixed dictionary using dotfield notation and return value. Stringify before search Support wildcard searching The comparison is case-sensitive Example:
X = {“a”:{“b”: “1”},”x”:[{“y”: “3”, “z”: “4”}, “5”]} Y = [“a”, {“b”: “1”}, {“x”:[{“y”: “3”, “z”: “4”}, “5”]}] Z = [“a”, {“b”: “1”}, {“x”:[{“y”: “34567”, “z”: “4”}, “5”]}] assert find_value_in_doc(“a.b”, “1”, X) assert find_value_in_doc(“x.y”, “3”, X) assert find_value_in_doc(“x.y”, “3*7”, Z) assert find_value_in_doc(“x.y”, “34567”, Z) assert find_value_in_doc(“x”, “5”, Y) assert find_value_in_doc(“a.b”, “c”, X) is False assert find_value_in_doc(“a”, “c”, X) is False
- biothings.utils.common.get_compressed_outfile(filename, compress='gzip')[source]¶
Get a output file handler with given compress method. currently support gzip/bz2/lzma, lzma only available in py3
- biothings.utils.common.get_dotfield_value(dotfield, d)[source]¶
Explore dictionary d using dotfield notation and return value. Example:
d = {"a":{"b":1}}. get_dotfield_value("a.b",d) => 1
- biothings.utils.common.get_loop()[source]¶
Since Python 3.10, a Deprecation warning is emitted if there is no running event loop. In future Python releases, a RuntimeError will be raised instead.
Ref: https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.get_event_loop
- biothings.utils.common.is_str(s)[source]¶
return True or False if input is a string or not. python3 compatible.
- biothings.utils.common.iter_n(iterable, n, with_cnt=False)[source]¶
Iterate an iterator by chunks (of n) if with_cnt is True, return (chunk, cnt) each time ref http://stackoverflow.com/questions/8991506/iterate-an-iterator-by-chunks-of-n-in-python
- biothings.utils.common.json_encode(obj)[source]¶
Tornado-aimed json encoder, it does the same job as tornado.escape.json_encode but also deals with datetime encoding
- biothings.utils.common.json_serial(obj)[source]¶
JSON serializer for objects not serializable by default json code
- biothings.utils.common.list2dict(a_list, keyitem, alwayslist=False)[source]¶
Return a dictionary with specified keyitem as key, others as values. keyitem can be an index or a sequence of indexes. For example:
li = [['A','a',1], ['B','a',2], ['A','b',3]] list2dict(li, 0)---> {'A':[('a',1),('b',3)], 'B':('a',2)}
If alwayslist is True, values are always a list even there is only one item in it.
list2dict(li, 0, True)---> {'A':[('a',1),('b',3)], 'B':[('a',2),]}
- biothings.utils.common.loadobj(filename, mode='file')[source]¶
Loads a compressed object from disk file (or file-like handler) or MongoDB gridfs file (mode=’gridfs’)
obj = loadobj('data.pyobj') obj = loadobj(('data.pyobj', mongo_db), mode='gridfs')
- biothings.utils.common.merge(x, dx)[source]¶
Merge dictionary dx (Δx) into dictionary x. If __REPLACE__ key is present in any level z in dx, z in x is replaced, instead of merged, with z in dx.
- biothings.utils.common.newer(t0, t1, fmt='%Y%m%d')[source]¶
t0 and t1 are string of timestamps matching “format” pattern. Return True if t1 is newer than t0.
- biothings.utils.common.open_anyfile(infile, mode='r')[source]¶
a context manager can be used in “with” stmt. accepts a filehandle or anything accepted by anyfile function.
- with open_anyfile(‘test.txt’) as in_f:
do_something()
- biothings.utils.common.open_compressed_file(filename)[source]¶
Get a read-only file-handler for compressed file, currently support gzip/bz2/lzma, lzma only available in py3
- biothings.utils.common.rmdashfr(top)[source]¶
Recursively delete dirs and files from “top” directory, then delete “top” dir
- biothings.utils.common.run_once()[source]¶
should_run_task_1 = run_once() print(should_run_task_1()) -> True print(should_run_task_1()) -> False print(should_run_task_1()) -> False print(should_run_task_1()) -> False
should_run_task_2 = run_once() print(should_run_task_2(‘2a’)) -> True print(should_run_task_2(‘2b’)) -> True print(should_run_task_2(‘2a’)) -> False print(should_run_task_2(‘2b’)) -> False …
- biothings.utils.common.safewfile(filename, prompt=True, default='C', mode='w')[source]¶
return a file handle in ‘w’ mode,use alternative name if same name exist. if prompt == 1, ask for overwriting,appending or changing name, else, changing to available name automatically.
- biothings.utils.common.sanitize_tarfile(tar_object, directory)[source]¶
Prevent user-assisted remote attackers to overwrite arbitrary files via a .. (dot dot) sequence in filenames in a TAR archive, a related issue to CVE-2007-4559
- biothings.utils.common.split_ids(q)[source]¶
split input query string into list of ids. any of ``”
- |,+”`` as the separator,
but perserving a phrase if quoted (either single or double quoted) more detailed rules see: http://docs.python.org/2/library/shlex.html#parsing-rules
e.g.:
>>> split_ids('CDK2 CDK3') ['CDK2', 'CDK3'] >>> split_ids('"CDK2 CDK3"
- CDk4’)
[‘CDK2 CDK3’, ‘CDK4’]
- class biothings.utils.common.splitstr[source]¶
Bases:
str
Type representing strings with space in it
- biothings.utils.common.timesofar(t0, clock=0, t1=None)[source]¶
return the string(eg.’3m3.42s’) for the passed real time/CPU time so far from given t0 (return from t0=time.time() for real time/ t0=time.clock() for CPU time).
- biothings.utils.common.traverse(obj, leaf_node=False)[source]¶
Output path-dictionary pairs. For example, input: {
‘exac_nontcga’: {‘af’: 0.00001883}, ‘gnomad_exome’: {‘af’: {‘af’: 0.0000119429, ‘af_afr’: 0.000123077}}, ‘snpeff’: {‘ann’: [{‘effect’: ‘intron_variant’,
‘feature_id’: ‘NM_014672.3’}, {‘effect’: ‘intron_variant’, ‘feature_id’: ‘NM_001256678.1’}]}
} will be translated to a generator: (
(“exac_nontcga”, {“af”: 0.00001883}), (“gnomad_exome.af”, {“af”: 0.0000119429, “af_afr”: 0.000123077}), (“gnomad_exome”, {“af”: {“af”: 0.0000119429, “af_afr”: 0.000123077}}), (“snpeff.ann”, {“effect”: “intron_variant”, “feature_id”: “NM_014672.3”}), (“snpeff.ann”, {“effect”: “intron_variant”, “feature_id”: “NM_001256678.1”}), (“snpeff.ann”, [{ … },{ … }]), (“snpeff”, {“ann”: [{ … },{ … }]}), (‘’, {‘exac_nontcga’: {…}, ‘gnomad_exome’: {…}, ‘snpeff’: {…}})
) or when traversing leaf nodes: (
(‘exac_nontcga.af’, 0.00001883), (‘gnomad_exome.af.af’, 0.0000119429), (‘gnomad_exome.af.af_afr’, 0.000123077), (‘snpeff.ann.effect’, ‘intron_variant’), (‘snpeff.ann.feature_id’, ‘NM_014672.3’), (‘snpeff.ann.effect’, ‘intron_variant’), (‘snpeff.ann.feature_id’, ‘NM_001256678.1’)
)
- biothings.utils.common.uncompressall(folder)[source]¶
Try to uncompress any known archive files in folder
- biothings.utils.common.untargzall(folder, pattern='*.tar.gz')[source]¶
gunzip and untar all
*.tar.gz
files in “folder”
biothings.utils.configuration¶
- class biothings.utils.configuration.ConfigAttrMeta(confmod: biothings.utils.configuration.MetaField = <factory>, section: biothings.utils.configuration.Text = <factory>, description: biothings.utils.configuration.Paragraph = <factory>, readonly: biothings.utils.configuration.Flag = <factory>, hidden: biothings.utils.configuration.Flag = <factory>, invisible: biothings.utils.configuration.Flag = <factory>)[source]¶
Bases:
object
- class biothings.utils.configuration.ConfigLine(seq)[source]¶
Bases:
UserString
- PATTERNS = (('hidden', re.compile('^#\\s?-\\s*hide\\s*-\\s?#\\s*$'), <function ConfigLine.<lambda>>), ('invisible', re.compile('^#\\s?-\\s*invisible\\s*-\\s?#\\s*$'), <function ConfigLine.<lambda>>), ('readonly', re.compile('^#\\s?-\\s*readonly\\s*-\\s?#\\s*$'), <function ConfigLine.<lambda>>), ('section', re.compile('^#\\s?\\*\\s*(.*)\\s*\\*\\s?#\\s*$'), <function ConfigLine.<lambda>>), ('description', re.compile('.*\\s*#\\s+(.*)$'), <function ConfigLine.<lambda>>))¶
- class biothings.utils.configuration.ConfigurationValue(code)[source]¶
Bases:
object
type to wrap default value when it’s code and needs to be interpreted later code is passed to eval() in the context of the whole “config” dict (so for instance, paths declared before in the configuration file can be used in the code passed to eval) code will also be executed through exec() if eval() raised a syntax error. This would happen when code contains statements, not just expression. In that case, a variable should be created in these statements (named the same as the original config variable) so the proper value can be through ConfigurationManager.
- class biothings.utils.configuration.ConfigurationWrapper(default_config, conf)[source]¶
Bases:
object
Wraps and manages configuration access and edit. A singleton instance is available throughout all hub apps using biothings.config or biothings.hub.config after calling import biothings.hub. In addition to providing config value access, either from config files or database, config manager can supersede attributes of a class with values coming from the database, allowing dynamic configuration of hub’s elements.
When constructing a ConfigurationWrapper instance, variables will be defined with default values coming from default_config, then they can be overridden by conf’s values, or new variables will be added if not defined in default_conf. Only metadata come from default_config will be used.
- property modified¶
- property readonly¶
- class biothings.utils.configuration.Flag(value=None)[source]¶
Bases:
MetaField
- default¶
alias of
bool
- class biothings.utils.configuration.MetaField(value=None)[source]¶
Bases:
object
- default¶
alias of
None
- property value¶
biothings.utils.dataload¶
Utility functions for parsing flatfiles, mapping to JSON, cleaning.
- biothings.utils.dataload.alwayslist(value)[source]¶
If input value is not a list/tuple type, return it as a single value list.
- biothings.utils.dataload.boolean_convert(d, convert_keys=None, level=0)[source]¶
Convert values specified by convert_keys in document d to boolean. Dotfield notation can be used to specify inner keys.
Note that None values are converted to False in Python. Use dict_sweep() before calling this function if such False values are not expected. See https://github.com/biothings/biothings.api/issues/274 for details.
- biothings.utils.dataload.dict_apply(d, key, value, sort=True)[source]¶
add value to d[key], append it if key exists
>>> d = {'a': 1} >>> dict_apply(d, 'a', 2) {'a': [1, 2]} >>> dict_apply(d, 'a', 3) {'a': [1, 2, 3]} >>> dict_apply(d, 'b', 2) {'a': 1, 'b': 2}
- biothings.utils.dataload.dict_attrmerge(dict_li, removedup=True, sort=True, special_fns=None)[source]¶
- dict_attrmerge([{‘a’: 1, ‘b’:[2,3]},
{‘a’: [1,2], ‘b’:[3,5], ‘c’=4}])
- should return
{‘a’: [1,2], ‘b’:[2,3,5], ‘c’=4}
special_fns is a dictionary of {attr: merge_fn} used for some special attr, which need special merge_fn e.g., {‘uniprot’: _merge_uniprot}
- biothings.utils.dataload.dict_convert(_dict, keyfn=None, valuefn=None)[source]¶
Return a new dict with each key converted by keyfn (if not None), and each value converted by valuefn (if not None).
- biothings.utils.dataload.dict_sweep(d, vals=None, remove_invalid_list=False)[source]¶
Remove keys whose values are “.”, “-”, “”, “NA”, “none”, ” “; and remove empty dictionaries
- Parameters:
d (dict) – a dictionary
vals (str or list) – a string or list of strings to sweep, or None to use the default values
remove_invalid_list (boolean) –
when true, will remove key for which list has only one value, which is part of “vals”. Ex:
test_dict = {'gene': [None, None], 'site': ["Intron", None], 'snp_build' : 136}
with remove_invalid_list == False:
{'gene': [None], 'site': ['Intron'], 'snp_build': 136}
with remove_invalid_list == True:
{'site': ['Intron'], 'snp_build': 136}
- biothings.utils.dataload.dict_to_list(gene_d)[source]¶
return a list of genedoc from genedoc dictionary and make sure the “_id” field exists.
- biothings.utils.dataload.dict_traverse(d, func, traverse_list=False)[source]¶
Recursively traverse dictionary d, calling func(k,v) for each key/value found. func must return a tuple(new_key,new_value)
- biothings.utils.dataload.dict_walk(dictionary, key_func)[source]¶
Recursively apply key_func to dict’s keys
- biothings.utils.dataload.dupline_seperator(dupline, dup_sep, dup_idx=None, strip=False)[source]¶
for a line like this:
a b1,b2 c1,c2
return a generator of this list (breaking out of the duplicates in each field):
[(a,b1,c1), (a,b2,c1), (a,b1,c2), (a,b2,c2)]
Example:
dupline_seperator(dupline=['a', 'b1,b2', 'c1,c2'], dup_idx=[1,2], dup_sep=',')
if dup_idx is None, try to split on every field. if strip is True, also tripe out of extra spaces.
- biothings.utils.dataload.file_merge(infiles, outfile=None, header=1, verbose=1)[source]¶
Merge a list of input files with the same format. If header is n then the top n lines will be discarded since reading the 2nd file in the list.
- biothings.utils.dataload.float_convert(d, include_keys=None, exclude_keys=None)[source]¶
Convert elements in a document to floats.
By default, traverse all keys If include_keys is specified, only convert the list from include_keys a.b, a.b.c If exclude_keys is specified, only exclude the list from exclude_keys
- Parameters:
d – a dictionary to traverse keys on
include_keys – only convert these keys (optional)
exclude_keys – exclude all other keys except these keys (optional)
- Returns:
generate key, value pairs
- biothings.utils.dataload.int_convert(d, include_keys=None, exclude_keys=None)[source]¶
Convert elements in a document to integers.
By default, traverse all keys If include_keys is specified, only convert the list from include_keys a.b, a.b.c If exclude_keys is specified, only exclude the list from exclude_keys
- Parameters:
d – a dictionary to traverse keys on
include_keys – only convert these keys (optional)
exclude_keys – exclude all other keys except these keys (optional)
- Returns:
generate key, value pairs
- biothings.utils.dataload.list2dict(a_list, keyitem, alwayslist=False)[source]¶
Return a dictionary with specified keyitem as key, others as values. keyitem can be an index or a sequence of indexes. For example:
li=[['A','a',1], ['B','a',2], ['A','b',3]] list2dict(li,0)---> {'A':[('a',1),('b',3)], 'B':('a',2)}
If alwayslist is True, values are always a list even there is only one item in it:
list2dict(li,0,True)---> {'A':[('a',1),('b',3)], 'B':[('a',2),]}
- biothings.utils.dataload.list_itemcnt(a_list)[source]¶
Return number of occurrence for each item in the list.
- biothings.utils.dataload.list_split(d, sep)[source]¶
Split fields by sep into comma separated lists, strip.
- biothings.utils.dataload.listitems(a_list, *idx)[source]¶
Return multiple items from list by given indexes.
- biothings.utils.dataload.listsort(a_list, by, reverse=False, cmp=None, key=None)[source]¶
Given a_list is a list of sub(list/tuple.), return a new list sorted by the ith (given from “by” item) item of each sublist.
- biothings.utils.dataload.merge_dict(dict_li, attr_li, missingvalue=None)[source]¶
Merging multiple dictionaries into a new one. Example:
In [136]: d1 = {'id1': 100, 'id2': 200} In [137]: d2 = {'id1': 'aaa', 'id2': 'bbb', 'id3': 'ccc'} In [138]: merge_dict([d1,d2], ['number', 'string']) Out[138]: {'id1': {'number': 100, 'string': 'aaa'}, 'id2': {'number': 200, 'string': 'bbb'}, 'id3': {'string': 'ccc'}} In [139]: merge_dict([d1,d2], ['number', 'string'], missingvalue='NA') Out[139]: {'id1': {'number': 100, 'string': 'aaa'}, 'id2': {'number': 200, 'string': 'bbb'}, 'id3': {'number': 'NA', 'string': 'ccc'}}
- biothings.utils.dataload.merge_duplicate_rows(rows, db)[source]¶
@param rows: rows to be grouped by @param db: database name, string
- biothings.utils.dataload.merge_root_keys(doc1, doc2, exclude=None)[source]¶
- Ex: d1 = {“_id”:1,”a”:”a”,”b”:{“k”:”b”}}
d2 = {“_id”:1,”a”:”A”,”b”:{“k”:”B”},”c”:123}
Both documents have the same _id, and 2 root keys, “a” and “b”. Using this storage, the resulting document will be:
{‘_id’: 1, ‘a’: [‘A’, ‘a’], ‘b’: [{‘k’: ‘B’}, {‘k’: ‘b’}],”c”:123}
- biothings.utils.dataload.merge_struct(v1, v2, aslistofdict=None, include=None, exclude=None)[source]¶
merge two structures, v1 and v2, into one. :param aslistofdict: a string indicating the key name that should be treated as a list of dict :param include: when given a list of strings, only merge these keys (optional) :param exclude: when given a list of strings, exclude these keys from merging (optional)
- biothings.utils.dataload.normalized_value(value, sort=True)[source]¶
Return a “normalized” value: 1. if a list, remove duplicate and sort it 2. if a list with one item, convert to that single item only 3. if a list, remove empty values 4. otherwise, return value as it is.
- biothings.utils.dataload.rec_handler(infile, block_end='\n', skip=0, include_block_end=False, as_list=False)[source]¶
A generator to return a record (block of text) at once from the infile. The record is separated by one or more empty lines by default. skip can be used to skip top n-th lines if include_block_end is True, the line matching block_end will also be returned. If as_list is True, return a list of lines in one record.
- biothings.utils.dataload.safe_type(f, val)[source]¶
Convert an input string to int/float/… using passed function. If the conversion fails then None is returned. If value of a type other than a string then the original value is returned.
- biothings.utils.dataload.tab2dict_iter(datafile, cols, key, alwayslist=False, **kwargs)[source]¶
- Parameters:
cols (array of int) – an array of indices (of a list) indicating which element(s) are kept in bulk
key (int) – an index (of a list) indicating which element is treated as a bulk key
Iterate datafile by row, subset each row (as a list of strings) by cols. Adjacent rows sharing the same value at the key index are put into one bulk. Each bulk is then transformed to a dict with the value at the key index as the dict key.
E.g. given the following datafile, cols=[0,1,2], and key=1, two bulks are generated:
key
a1 b1 c1 ————————————————– a2 b1 c2 # bulk_1 => {b1: [(a1, c1), (a2, c2), (a3, c3)]} # a3 b1 c3 ————————————————– a4 b2 c4 ————————————————– a5 b2 c5 # bulk_2 => {b2: [(a4, c4), (a5, c5), (a6, c6)]} # a6 b2 c6 ————————————————–
- biothings.utils.dataload.tabfile_feeder(datafile, header=1, sep='\t', includefn=None, coerce_unicode=True, assert_column_no=None)[source]¶
a generator for each row in the file.
- biothings.utils.dataload.to_boolean(val, true_str=None, false_str=None)[source]¶
Normalize str value to boolean value
- biothings.utils.dataload.traverse_keys(d, include_keys=None, exclude_keys=None)[source]¶
Return all key, value pairs for a document.
By default, traverse all keys If include_keys is specified, only traverse the list from include_kes a.b, a.b.c If exclude_keys is specified, only exclude the list from exclude_keys
if a key in include_keys/exclude_keys is not found in d, it’s skipped quietly.
- Parameters:
d – a dictionary to traverse keys on
include_keys – only traverse these keys (optional)
exclude_keys – exclude all other keys except these keys (optional)
- Returns:
generate key, value pairs
- biothings.utils.dataload.unlist_incexcl(d, include_keys=None, exclude_keys=None)[source]¶
Unlist elements in a document.
If there is 1 value in the list, set the element to that value. Otherwise, leave the list unchanged.
By default, traverse all keys If include_keys is specified, only traverse the list from include_keys a.b, a.b.c If exclude_keys is specified, only exclude the list from exclude_keys
- Parameters:
d – a dictionary to unlist
include_keys – only unlist these keys (optional)
exclude_keys – exclude all other keys except these keys (optional)
- Returns:
generate key, value pairs
- biothings.utils.dataload.update_dict_recur(d, u)[source]¶
Update dict d with dict u’s values, recursively (so existing values in d but not in u are kept even if nested)
- biothings.utils.dataload.updated_dict(_dict, attrs)[source]¶
Same as dict.update, but return the updated dictionary.
- biothings.utils.dataload.value_convert(_dict, fn, traverse_list=True)[source]¶
For each value in _dict, apply fn and then update _dict with return the value. If traverse_list is True and a value is a list, apply fn to each item of the list.
- biothings.utils.dataload.value_convert_incexcl(d, fn, include_keys=None, exclude_keys=None)[source]¶
Convert elements in a document using a function fn.
By default, traverse all keys If include_keys is specified, only convert the list from include_keys a.b, a.b.c If exclude_keys is specified, only exclude the list from exclude_keys
- Parameters:
d – a dictionary to traverse keys on
fn – function to convert elements with
include_keys – only convert these keys (optional)
exclude_keys – exclude all other keys except these keys (optional)
- Returns:
generate key, value pairs
biothings.utils.diff¶
Utils to compare two list of gene documents, requires to setup Biothing Hub.
- biothings.utils.diff.diff_collections(b1, b2, use_parallel=True, step=10000)[source]¶
b1, b2 are one of supported backend class in databuild.backend. e.g.:
b1 = DocMongoDBBackend(c1) b2 = DocMongoDBBackend(c2)
- biothings.utils.diff.diff_collections_batches(b1, b2, result_dir, step=10000)[source]¶
b2 is new collection, b1 is old collection
biothings.utils.doc_traversal¶
Some utility functions that do document traversal
- biothings.utils.doc_traversal.breadth_first_recursive_traversal(doc, path=None)[source]¶
doesn’t exactly implement breadth first ordering it seems, not sure why…
biothings.utils.docs¶
- biothings.utils.docs.flatten_doc(doc, outfield_sep='.', sort=True)[source]¶
This function will flatten an elasticsearch document (really any json object). outfield_sep is the separator between the fields in the return object. sort specifies whether the output object should be sorted alphabetically before returning
(otherwise output will remain in traveral order)
biothings.utils.dotfield¶
- biothings.utils.dotfield.compose_dot_fields_by_fields(doc, fields)[source]¶
Reverse funtion of parse_dot_fields
biothings.utils.dotstring¶
- biothings.utils.dotstring.key_value(dictionary, key)[source]¶
- Return a generator for all values in a dictionary specific by a dotstirng (key)
if key is not found from the dictionary, None is returned.
- Parameters:
dictionary – a dictionary to return values from
key – key that specifies a value in the dictionary
- Returns:
generator for values that match the given key
- biothings.utils.dotstring.last_element(d, key_list)[source]¶
Return the last element and key for a document d given a docstring.
A document d is passed with a list of keys key_list. A generator is then returned for all elements that match all keys. Not that there may be a 1-to-many relationship between keys and elements due to lists in the document.
- Parameters:
d – document d to return elements from
key_list – list of keys that specify elements in the document d
- Returns:
generator for elements that match all keys
- biothings.utils.dotstring.list_length(d, field)[source]¶
Return the length of a list specified by field.
If field represents a list in the document, then return its length. Otherwise return 0.
- Parameters:
d – a dictionary
field – the dotstring field specifying a list
- biothings.utils.dotstring.remove_key(dictionary, key)[source]¶
Remove field specified by the docstring key
- Parameters:
dictionary – a dictionary to remove the value from
key – key that specifies an element in the dictionary
- Returns:
dictionary after changes have been made
- biothings.utils.dotstring.set_key_value(dictionary, key, value)[source]¶
- Set values all values in dictionary matching a dotstring key to a specified value.
if key is not found in dictionary, it just skip quietly.
- Parameters:
dictionary – a dictionary to set values in
key – key that specifies an element in the dictionary
- Returns:
dictionary after changes have been made
biothings.utils.es¶
- class biothings.utils.es.Database[source]¶
Bases:
IDatabase
- CONFIG = None¶
- property address¶
Returns sufficient information so a connection to a database can be created. Information can be a dictionary, object, etc… and depends on the actual backend
- class biothings.utils.es.ESIndex(client, index_name)[source]¶
Bases:
object
An Elasticsearch Index Wrapping A Client. Counterpart for pymongo.collection.Collection
- property doc_type¶
- class biothings.utils.es.ESIndexer(index, doc_type='_doc', es_host='localhost:9200', step=500, step_size=10, number_of_shards=1, number_of_replicas=0, check_index=True, **kwargs)[source]¶
Bases:
object
- check_index()[source]¶
Check if index is an alias, and update self._index to point to actual index
- TODO: the overall design of ESIndexer is not great. If we are exposing ES
implementation details (such as the abilities to create and delete indices, create and update aliases, etc.) to the user of this Class, then this method doesn’t seem that out of place.
- clean_field(field, dryrun=True, step=5000)[source]¶
remove a top-level field from ES index, if the field is the only field of the doc, remove the doc as well. step is the size of bulk update on ES try first with dryrun turned on, and then perform the actual updates with dryrun off.
- find_biggest_doc(fields_li, min=5, return_doc=False)[source]¶
return the doc with the max number of fields from fields_li.
- get_alias(index: str | None = None, alias_name: str | None = None) List[str] [source]¶
Get indices with alias associated with given index name or alias name
- Parameters:
index – name of index
alias_name – name of alias
- Returns:
Mapping of index names with their aliases
- get_docs(**kwargs)[source]¶
Return matching docs for given ids iterable, if not found return None. A generator is returned to the matched docs. If only_source is False, the entire document is returned, otherwise only the source is returned.
- get_indice_names_by_settings(index: str | None = None, sort_by_creation_date=False, reverse=False) List[str] [source]¶
Get list of indices names associated with given index name, using indices’ settings
- Parameters:
index – name of index
sort_by_creation_date – sort the result by indice’s creation_date
reverse – control the direction of the sorting
- Returns:
list of index names (str)
- get_settings(index: str | None = None) Mapping[str, Mapping] [source]¶
Get indices with settings associated with given index name
- Parameters:
index – name of index
- Returns:
Mapping of index names with their settings
- index(doc, id=None, action='index')[source]¶
add a doc to the index. If id is not None, the existing doc will be updated.
- reindex(src_index, is_remote=False, **kwargs)[source]¶
In order to reindex from remote, - src es_host must be set to an IP which the current ES host can connect to.
It means that if 2 indices locate in same host, the es_host can be set to localhost, but if they are in different hosts, an IP must be used instead.
If src host uses SSL, https must be included in es_host. Eg: https://192.168.1.10:9200
- src host must be whitelisted in the current ES host.
Ref: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/reindex-upgrade-remote.html
- sanitize_settings(settings)[source]¶
- Clean up settings dictinary to remove those static fields cannot be updated.
like: “uuid”, “provided_name”, “creation_date”, “version”,
settings will be updated both in-place and returned as well.
- update(id, extra_doc, upsert=True)[source]¶
update an existing doc with extra_doc. allow to set upsert=True, to insert new docs.
- update_alias(alias_name: str, index: str | None = None)[source]¶
Create or update an ES alias pointing to an index
Creates or updates an alias in Elasticsearch, associated with the given index name or the underlying index of the ESIndexer instance.
When the alias name does not exist, it will be created. If an existing alias already exists, it will be updated to only associate with the index.
When the alias name already exists, an exception will be raised, UNLESS the alias name is the same as index name that the ESIndexer is initialized with. In this case, the existing index with the name collision will be deleted, and the alias will be created in its place. This feature is intended for seamless migration from an index to an alias associated with an index for zero-downtime installs.
- Parameters:
alias_name – name of the alias
index – name of the index to associate with alias. If None, the index of the ESIndexer instance is used.
- Raises:
- update_docs(partial_docs, upsert=True, step=None, **kwargs)[source]¶
update a list of partial_docs in bulk. allow to set upsert=True, to insert new docs.
- biothings.utils.es.generate_es_mapping(inspect_doc, init=True, level=0)[source]¶
Generate an ES mapping according to “inspect_doc”, which is produced by biothings.utils.inspect module
biothings.utils.exclude_ids¶
- class biothings.utils.exclude_ids.ExcludeFieldsById(exclusion_ids, field_lst, min_list_size=1000)[source]¶
Bases:
object
This class provides a framework to exclude fields for certain identifiers. Up to three arguments are passed to this class, an identifier list, a list of fields to remove, and minimum list size. The identifier list is a list of document identifiers to act on. The list of fields are fields that will be removed; they are specified using a dotstring notation. The minimum list size is the minimum number of elements that should be in a list in order for it to be removed. The ‘drugbank’, ‘chebi’, and ‘ndc’ data sources were manually tested with this class.
Fields to truncate are specified by field_lst. The dot-notation is accepted.
biothings.utils.hub¶
- class biothings.utils.hub.BaseHubReloader(paths, reload_func, wait=5.0)[source]¶
Bases:
object
Monitor sources’ code and reload hub accordingly to update running code
Monitor given paths for directory deletion/creation and for file deletion/creation. Poll for events every ‘wait’ seconds.
- class biothings.utils.hub.CompositeCommand(cmd)[source]¶
Bases:
str
Defines a composite hub commands, that is, a new command made of other commands. Useful to define shortcuts when typing commands in hub console.
- class biothings.utils.hub.HubShell(**kwargs: Any)[source]¶
Bases:
InteractiveShell
Create a configurable given a config config.
- Parameters:
config (Config) – If this is empty, default values are used. If config is a
Config
instance, it will be used to configure the instance.parent (Configurable instance, optional) – The parent Configurable instance of this object.
Notes
Subclasses of Configurable must call the
__init__()
method ofConfigurable
before doing anything else and usingsuper()
:class MyConfigurable(Configurable): def __init__(self, config=None): super(MyConfigurable, self).__init__(config=config) # Then any other code you need to finish initialization.
This ensures that instances will be configured properly.
- cmd = None¶
- cmd_cnt = None¶
- launch(pfunc)[source]¶
Helper to run a command and register it pfunc is partial taking no argument. Command name is generated from partial’s func and arguments
- launched_commands = {}¶
- pending_outputs = {}¶
- register_command(cmd, result, force=False)[source]¶
Register a command ‘cmd’ inside the shell (so we can keep track of it). ‘result’ is the original value that was returned when cmd was submitted. Depending on the type, returns a cmd number (ie. result was an asyncio task and we need to wait before getting the result) or directly the result of ‘cmd’ execution, returning, in that case, the output.
- class biothings.utils.hub.TornadoAutoReloadHubReloader(paths, reload_func, wait=5)[source]¶
Bases:
BaseHubReloader
Reloader based on tornado.autoreload module
Monitor given paths for directory deletion/creation and for file deletion/creation. Poll for events every ‘wait’ seconds.
- add_watch(paths)[source]¶
This method recursively adds the input paths, and their children to tornado autoreload for watching them. If any file changes, the tornado will call our hook to reload the hub.
Each path will be forced to become an absolute path. If a path is matched excluding patterns, it will be ignored. Only file is added for watching. Directory will be passed to another add_watch.
- biothings.utils.hub.publish_data_version(s3_bucket, s3_folder, version_info, update_latest=True, aws_key=None, aws_secret=None)[source]¶
- Update remote files:
- versions.json: add version_info to the JSON list
or replace if arg version_info is a list
latest.json: update redirect so it points to latest version url
“versions” is dict such as:
{"build_version":"...", # version name for this release/build "require_version":"...", # version required for incremental update "target_version": "...", # version reached once update is applied "type" : "incremental|full" # release type "release_date" : "...", # ISO 8601 timestamp, release date/time "url": "http...."} # url pointing to release metadata
- biothings.utils.hub.template_out(field, confdict)[source]¶
Return field as a templated-out filed, substituting some “%(…)s” part with confdict, Fields can follow dotfield notation. Fields like “$(…)” are replaced with a timestamp following specified format (see time.strftime) Example:
confdict = {"a":"one"} field = "%(a)s_two_three_$(%Y%m)" => "one_two_three_201908" # assuming we're in August 2019
biothings.utils.hub_db¶
hub_db module is a place-holder for internal hub database functions. Hub DB contains informations about sources, configurations variables, etc… It’s for internal usage. When biothings.config_for_app() is called, this module will be “filled” with the actual implementations from the specified backend (speficied in config.py, or defaulting to MongoDB).
Hub DB can be implemented over different backend, it’s orginally been done using MongoDB, so the dialect is very inspired by pymongo. Any hub db backend implementation must implement the functions and classes below. See biothings.utils.mongo and biothings.utils.sqlit3 for some examples.
- class biothings.utils.hub_db.ChangeWatcher[source]¶
Bases:
object
- col_entity = {'cmd': 'command', 'hub_config': 'config', 'src_build': 'build', 'src_build_config': 'build_config', 'src_dump': 'source', 'src_master': 'master'}¶
- do_publish = False¶
- event_queue = <Queue at 0x7f38547c2a40 maxsize=0>¶
- listeners = {}¶
- class biothings.utils.hub_db.Collection(colname, db)[source]¶
Bases:
object
Defines a minimal subset of MongoDB collection behavior. Note: Collection instances must be pickleable (if not, __getstate__ can be implemented to deal with those attributes for instance)
Init args can differ depending on the backend requirements. colname is the only one required.
- find(*args, **kwargs)[source]¶
Return an iterable of documents matching criterias defined in *args[0] (which will be a dict). Query dialect is a minimal one, inspired by MongoDB. Dict can contain the name of a key, and the value being searched for. Ex: {“field1”:”value1”} will return all documents where field1 == “value1”. Nested key (field1.subfield1) aren’t supported (no need to implement). Exact matches only are required.
If no query is passed, or if query is an empty dict, return all documents.
- find_one(*args, **kwargs)[source]¶
Return one document from the collection. *args will contain a dict with the query parameters. See also find()
- property name¶
Return the collection/table name
- replace_one(query, doc)[source]¶
Replace a document matching ‘query’ (or the first found one) with passed doc
- save(doc)[source]¶
Shortcut to update_one() or insert_one(). Save the document, by either inserting if it doesn’t exist, or update existing one
- update_one(query, what, upsert=False)[source]¶
Update one document (or the first matching query). See find() for query parameter. “what” tells how to update the document. $set/$unset/$push operators must be implemented (refer to MongoDB documentation for more). Nested keys operation aren’t necesary.
- class biothings.utils.hub_db.IDatabase[source]¶
Bases:
object
This class declares an interface and partially implements some of it, mimicking mongokit.Connection class. It’s used to keep used document model. Any internal backend should implement (derives) this interface
- property address¶
Returns sufficient information so a connection to a database can be created. Information can be a dictionary, object, etc… and depends on the actual backend
- biothings.utils.hub_db.backup(folder='.', archive=None)[source]¶
Dump the whole hub_db database in given folder. “archive” can be pass to specify the target filename, otherwise, it’s randomly generated
Note
this doesn’t backup source/merge data, just the internal data used by the hub
- biothings.utils.hub_db.get_api()¶
- biothings.utils.hub_db.get_cmd()¶
- biothings.utils.hub_db.get_data_plugin()¶
- biothings.utils.hub_db.get_event()¶
- biothings.utils.hub_db.get_hub_config()¶
- biothings.utils.hub_db.get_src_build()¶
- biothings.utils.hub_db.get_src_build_config()¶
- biothings.utils.hub_db.get_src_dump()¶
- biothings.utils.hub_db.get_src_master()¶
biothings.utils.info¶
biothings.utils.inspect¶
This module contains util functions may be shared by both BioThings data-hub and web components. In general, do not include utils depending on any third-party modules. Note: unittests available in biothings.tests.hub
- class biothings.utils.inspect.BaseMode[source]¶
Bases:
object
- key = None¶
- report(struct, drep, orig_struct=None)[source]¶
Given a data structure “struct” being inspected, report (fill) “drep” dictionary with useful values for this mode, under drep[self.key] key. Sometimes “struct” is already converted to its analytical value at this point (inspect may count number of dict and would force to pass struct as “1”, instead of the whole dict, where number of keys could be then be reported), “orig_struct” is that case contains the original structure that was to be reported, whatever the pre-conversion step did.
- template = {}¶
- class biothings.utils.inspect.DeepStatsMode[source]¶
Bases:
StatsMode
- key = '_stats'¶
- merge(target_stats, tomerge_stats)[source]¶
Merge two different maps together (from tomerge into target)
- report(val, drep, orig_struct=None)[source]¶
Given a data structure “struct” being inspected, report (fill) “drep” dictionary with useful values for this mode, under drep[self.key] key. Sometimes “struct” is already converted to its analytical value at this point (inspect may count number of dict and would force to pass struct as “1”, instead of the whole dict, where number of keys could be then be reported), “orig_struct” is that case contains the original structure that was to be reported, whatever the pre-conversion step did.
- template = {'_stats': {'__vals': [], '_count': 0, '_max': -inf, '_min': inf}}¶
- class biothings.utils.inspect.FieldInspectValidation(warnings: set() = <factory>, types: set = <factory>, has_multiple_types: bool = False)[source]¶
Bases:
object
- has_multiple_types: bool = False¶
- types: set¶
- warnings: set()¶
- class biothings.utils.inspect.FieldInspection(field_name: str, field_type: str, stats: dict = None, warnings: list = <factory>)[source]¶
Bases:
object
- field_name: str¶
- field_type: str¶
- stats: dict = None¶
- warnings: list¶
- class biothings.utils.inspect.IdentifiersMode[source]¶
Bases:
RegexMode
- ids = None¶
- key = '_ident'¶
- matchers = None¶
- class biothings.utils.inspect.InspectionValidation[source]¶
Bases:
object
This class provide a mechanism to validate and flag any field which: - contains whitespace - contains upper cased letter or special characters
(lower-cased is recommended, in some cases the upper-case field names are acceptable, so we should raise it as a warning and let user to confirm it’s necessary)
- when the type inspection detects more than one types
(but a mixed or single value and an array of same type of values are acceptable, or the case of mixed integer and float should be acceptable too)
- Usage:
` result = InspectionValidation.validate(data) `
Adding more rules: - add new code, and message to Warning Enum - add a new staticmethod for validate new rule and named in format: validate_{warning_code} - add new rule to docstring.
- INVALID_CHARACTERS_PATTERN = '[^a-zA-Z0-9_.]'¶
- NUMERIC_FIELDS = ['int', 'float']¶
- SPACE_PATTERN = ' '¶
- class Warning(value)[source]¶
Bases:
Enum
An enumeration.
- W001 = 'field name contains whitespace.'¶
- W002 = 'field name contains uppercase.'¶
- W003 = 'field name contains special character. Only alphanumeric, dot, or underscore are valid.'¶
- W004 = 'field name has more than one type.'¶
- static validate(data: List[FieldInspection]) Dict[str, FieldInspectValidation] [source]¶
- static validate_W001(field_inspection: FieldInspection, field_validation: FieldInspectValidation) bool [source]¶
- static validate_W002(field_inspection: FieldInspection, field_validation: FieldInspectValidation) bool [source]¶
- static validate_W003(field_inspection: FieldInspection, field_validation: FieldInspectValidation) bool [source]¶
- static validate_W004(field_inspection: FieldInspection, field_validation: FieldInspectValidation) bool [source]¶
- class biothings.utils.inspect.RegexMode[source]¶
Bases:
BaseMode
- matchers = []¶
- report(val, drep, orig_struct=None)[source]¶
Given a data structure “struct” being inspected, report (fill) “drep” dictionary with useful values for this mode, under drep[self.key] key. Sometimes “struct” is already converted to its analytical value at this point (inspect may count number of dict and would force to pass struct as “1”, instead of the whole dict, where number of keys could be then be reported), “orig_struct” is that case contains the original structure that was to be reported, whatever the pre-conversion step did.
- class biothings.utils.inspect.StatsMode[source]¶
Bases:
BaseMode
- key = '_stats'¶
- merge(target_stats, tomerge_stats)[source]¶
Merge two different maps together (from tomerge into target)
- report(struct, drep, orig_struct=None)[source]¶
Given a data structure “struct” being inspected, report (fill) “drep” dictionary with useful values for this mode, under drep[self.key] key. Sometimes “struct” is already converted to its analytical value at this point (inspect may count number of dict and would force to pass struct as “1”, instead of the whole dict, where number of keys could be then be reported), “orig_struct” is that case contains the original structure that was to be reported, whatever the pre-conversion step did.
- template = {'_stats': {'_count': 0, '_max': -inf, '_min': inf, '_none': 0}}¶
- biothings.utils.inspect.flatten_inspection_data(data: Dict[str, Any], current_deep: int = 0, parent_name: str | None = None, parent_type: str | None = None) List[FieldInspection] [source]¶
This function will convert the multiple depth nested inspection data into a flatten list Nested key will be appended with the parent key and seperate with a dot.
- biothings.utils.inspect.get_converters(modes, logger=<module 'logging' from '/home/docs/.asdf/installs/python/3.10.13/lib/python3.10/logging/__init__.py'>)[source]¶
- biothings.utils.inspect.inspect(struct, key=None, mapt=None, mode='type', level=0, logger=<module 'logging' from '/home/docs/.asdf/installs/python/3.10.13/lib/python3.10/logging/__init__.py'>)[source]¶
Explore struct and report types contained in it.
- Parameters:
struct – is the data structure to explore
mapt – if not None, will complete that type map with passed struct. This is useful when iterating over a dataset of similar data, trying to find a good type summary contained in that dataset.
level – is for internal purposes, mostly debugging
mode – see inspect_docs() documentation
- biothings.utils.inspect.inspect_docs(docs, mode='type', clean=True, merge=False, logger=<module 'logging' from '/home/docs/.asdf/installs/python/3.10.13/lib/python3.10/logging/__init__.py'>, pre_mapping=False, limit=None, sample=None, metadata=True, auto_convert=True)[source]¶
Inspect docs and return a summary of its structure:
- Parameters:
mode –
possible values are:
”type”: (default) explore documents and report strict data structure
- ”mapping”: same as type but also perform test on data so guess best mapping
(eg. check if a string is splitable, etc…). Implies merge=True
”stats”: explore documents and compute basic stats (count,min,max,sum)
- ”deepstats”: same as stats but record values and also compute mean,stdev,median
(memory intensive…)
”jsonschema”, same as “type” but returned a json-schema formatted result
mode can also be a list of modes, eg. [“type”,”mapping”]. There’s little overhead computing multiple types as most time is spent on actually getting the data.
clean – don’t delete recorded vqlues or temporary results
merge – merge scalar into list when both exist (eg. {“val”:..} and [{“val”:…}]
limit – can limit the inspection to the x first docs (None = no limit, inspects all)
sample – in combination with limit, randomly extract a sample of ‘limit’ docs (so not necessarily the x first ones defined by limit). If random.random() is greater than sample, doc is inspected, otherwise it’s skipped
metadata – compute metadata on the result
auto_convert – run converters automatically (converters are used to convert one mode’s output to another mode’s output, eg. type to jsonschema)
- biothings.utils.inspect.merge_field_inspections_validations(field_inspections: List[FieldInspection], field_validations: Dict[str, FieldInspectValidation])[source]¶
Adding any warnings from field_validations to field_inspections with corresponding field name
- biothings.utils.inspect.run_converters(_map, converters, logger=<module 'logging' from '/home/docs/.asdf/installs/python/3.10.13/lib/python3.10/logging/__init__.py'>)[source]¶
- biothings.utils.inspect.simplify_inspection_data(field_inspections: List[FieldInspection]) List[Dict[str, Any]] [source]¶
biothings.utils.jsondiff¶
The MIT License (MIT)
Copyright (c) 2014 Ilya Volkov
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
biothings.utils.jsonpatch¶
Apply JSON-Patches (RFC 6902)
- class biothings.utils.jsonpatch.AddOperation(operation)[source]¶
Bases:
PatchOperation
Adds an object property or an array element.
- class biothings.utils.jsonpatch.CopyOperation(operation)[source]¶
Bases:
PatchOperation
Copies an object property or an array element to a new location
- exception biothings.utils.jsonpatch.InvalidJsonPatch[source]¶
Bases:
JsonPatchException
Raised if an invalid JSON Patch is created
- class biothings.utils.jsonpatch.JsonPatch(patch)[source]¶
Bases:
object
A JSON Patch is a list of Patch Operations.
>>> patch = JsonPatch([ ... {'op': 'add', 'path': '/foo', 'value': 'bar'}, ... {'op': 'add', 'path': '/baz', 'value': [1, 2, 3]}, ... {'op': 'remove', 'path': '/baz/1'}, ... {'op': 'test', 'path': '/baz', 'value': [1, 3]}, ... {'op': 'replace', 'path': '/baz/0', 'value': 42}, ... {'op': 'remove', 'path': '/baz/1'}, ... ]) >>> doc = {} >>> result = patch.apply(doc) >>> expected = {'foo': 'bar', 'baz': [42]} >>> result == expected True
JsonPatch object is iterable, so you could easily access to each patch statement in loop:
>>> lpatch = list(patch) >>> expected = {'op': 'add', 'path': '/foo', 'value': 'bar'} >>> lpatch[0] == expected True >>> lpatch == patch.patch True
Also JsonPatch could be converted directly to
bool
if it contains any operation statements:>>> bool(patch) True >>> bool(JsonPatch([])) False
This behavior is very handy with
make_patch()
to write more readable code:>>> old = {'foo': 'bar', 'numbers': [1, 3, 4, 8]} >>> new = {'baz': 'qux', 'numbers': [1, 4, 7]} >>> patch = make_patch(old, new) >>> if patch: ... # document have changed, do something useful ... patch.apply(old) {...}
- apply(orig_obj, in_place=False, ignore_conflicts=False, verify=False)[source]¶
Applies the patch to given object.
- Parameters:
obj (dict) – Document object.
in_place (bool) – Tweaks way how patch would be applied - directly to specified obj or to his copy.
- Returns:
Modified obj.
- classmethod from_diff(src, dst)[source]¶
Creates JsonPatch instance based on comparing of two document objects. Json patch would be created for src argument against dst one.
- Parameters:
src (dict) – Data source document object.
dst (dict) – Data source document object.
- Returns:
JsonPatch
instance.
>>> src = {'foo': 'bar', 'numbers': [1, 3, 4, 8]} >>> dst = {'baz': 'qux', 'numbers': [1, 4, 7]} >>> patch = JsonPatch.from_diff(src, dst) >>> new = patch.apply(src) >>> new == dst True
- exception biothings.utils.jsonpatch.JsonPatchConflict[source]¶
Bases:
JsonPatchException
Raised if patch could not be applied due to conflict situation such as: - attempt to add object key then it already exists; - attempt to operate with nonexistence object key; - attempt to insert value to array at position beyond of it size; - etc.
- exception biothings.utils.jsonpatch.JsonPatchException[source]¶
Bases:
Exception
Base Json Patch exception
- exception biothings.utils.jsonpatch.JsonPatchTestFailed[source]¶
Bases:
JsonPatchException
,AssertionError
A Test operation failed
- class biothings.utils.jsonpatch.MoveOperation(operation)[source]¶
Bases:
PatchOperation
Moves an object property or an array element to new location.
- class biothings.utils.jsonpatch.PatchOperation(operation)[source]¶
Bases:
object
A single operation inside a JSON Patch.
- class biothings.utils.jsonpatch.RemoveOperation(operation)[source]¶
Bases:
PatchOperation
Removes an object property or an array element.
- class biothings.utils.jsonpatch.ReplaceOperation(operation)[source]¶
Bases:
PatchOperation
Replaces an object property or an array element by new value.
- class biothings.utils.jsonpatch.TestOperation(operation)[source]¶
Bases:
PatchOperation
Test value by specified location.
- biothings.utils.jsonpatch.apply_patch(doc, patch, in_place=False, ignore_conflicts=False, verify=False)[source]¶
Apply list of patches to specified json document.
- Parameters:
doc (dict) – Document object.
patch (list or str) – JSON patch as list of dicts or raw JSON-encoded string.
in_place (bool) – While
True
patch will modify target document. By default patch will be applied to document copy.ignore_conflicts (bool) – Ignore JsonConflicts errors
verify (bool) – works with ignore_conflicts = True, if errors and verify is True (recommanded), make sure the resulting objects is the same as the original one. ignore_conflicts and verify are used to run patches multiple times and get rif of errors when operations can’t be performed multiple times because the object has already been patched This will force in_place to False in order the comparison to occur.
- Returns:
Patched document object.
- Return type:
dict
>>> doc = {'foo': 'bar'} >>> patch = [{'op': 'add', 'path': '/baz', 'value': 'qux'}] >>> other = apply_patch(doc, patch) >>> doc is not other True >>> other == {'foo': 'bar', 'baz': 'qux'} True >>> patch = [{'op': 'add', 'path': '/baz', 'value': 'qux'}] >>> apply_patch(doc, patch, in_place=True) == {'foo': 'bar', 'baz': 'qux'} True >>> doc == other True
- biothings.utils.jsonpatch.get_loadjson()[source]¶
adds the object_pairs_hook parameter to json.load when possible
The “object_pairs_hook” parameter is used to handle duplicate keys when loading a JSON object. This parameter does not exist in Python 2.6. This methods returns an unmodified json.load for Python 2.6 and a partial function with object_pairs_hook set to multidict for Python versions that support the parameter.
- biothings.utils.jsonpatch.make_patch(src, dst)[source]¶
Generates patch by comparing of two document objects. Actually is a proxy to
JsonPatch.from_diff()
method.- Parameters:
src (dict) – Data source document object.
dst (dict) – Data source document object.
>>> src = {'foo': 'bar', 'numbers': [1, 3, 4, 8]} >>> dst = {'baz': 'qux', 'numbers': [1, 4, 7]} >>> patch = make_patch(src, dst) >>> new = patch.apply(src) >>> new == dst True
biothings.utils.jsonschema¶
biothings.utils.loggers¶
- class biothings.utils.loggers.Colors(value)[source]¶
Bases:
Enum
An enumeration.
- CRITICAL = '#7b0099'¶
- DEBUG = '#a1a1a1'¶
- ERROR = 'danger'¶
- INFO = 'good'¶
- NOTSET = '#d6d2d2'¶
- WARNING = 'warning'¶
- class biothings.utils.loggers.EventRecorder(*args, **kwargs)[source]¶
Bases:
StreamHandler
Initialize the handler.
If stream is not specified, sys.stderr is used.
- emit(record)[source]¶
Emit a record.
If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an ‘encoding’ attribute, it is used to determine how to do the output to the stream.
- class biothings.utils.loggers.Range(start: int | float = 0, end: int | float = inf)[source]¶
Bases:
object
- end: int | float = inf¶
- start: int | float = 0¶
- class biothings.utils.loggers.Record(range, value)[source]¶
Bases:
NamedTuple
Create new instance of Record(range, value)
- value: Enum¶
Alias for field number 1
- class biothings.utils.loggers.ShellLogger(*args, **kwargs)[source]¶
Bases:
Logger
Custom “levels” for input going to the shell and output coming from it (just for naming)
Initialize the logger with a name and an optional level.
- INPUT = 1001¶
- OUTPUT = 1000¶
- class biothings.utils.loggers.SlackHandler(webhook, mentions)[source]¶
Bases:
StreamHandler
Initialize the handler.
If stream is not specified, sys.stderr is used.
- emit(record)[source]¶
Emit a record.
If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an ‘encoding’ attribute, it is used to determine how to do the output to the stream.
- class biothings.utils.loggers.Squares(value)[source]¶
Bases:
Enum
An enumeration.
- CRITICAL = ':large_purple_square:'¶
- DEBUG = ':white_large_square:'¶
- ERROR = ':large_red_square:'¶
- INFO = ':large_blue_square:'¶
- NOTSET = ''¶
- WARNING = ':large_orange_square:'¶
- class biothings.utils.loggers.WSLogHandler(listener)[source]¶
Bases:
StreamHandler
when listener is a bt.hub.api.handlers.ws.LogListener instance, log statements are propagated through existing websocket
Initialize the handler.
If stream is not specified, sys.stderr is used.
- emit(record)[source]¶
Emit a record.
If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an ‘encoding’ attribute, it is used to determine how to do the output to the stream.
- class biothings.utils.loggers.WSShellHandler(listener)[source]¶
Bases:
WSLogHandler
when listener is a bt.hub.api.handlers.ws.LogListener instance, log statements are propagated through existing websocket
Initialize the handler.
If stream is not specified, sys.stderr is used.
- biothings.utils.loggers.configurate_file_handler(logger, logfile, formater=None, force=False)[source]¶
- biothings.utils.loggers.create_logger(log_folder, logger_name, level=10)[source]¶
Create and return a file logger if log_folder is provided. If log_folder is None, no file handler will be created.
biothings.utils.manager¶
- class biothings.utils.manager.BaseManager(job_manager, poll_schedule=None)[source]¶
Bases:
object
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- class biothings.utils.manager.BaseSourceManager(job_manager, datasource_path='dataload.sources', *args, **kwargs)[source]¶
Bases:
BaseManager
Base class to provide source management: discovery, registration Actual launch of tasks must be defined in subclasses
- SOURCE_CLASS = None¶
- filter_class(klass)[source]¶
Gives opportunity for subclass to check given class and decide to keep it or not in the discovery process. Returning None means “skip it”.
- find_classes(src_module, fail_on_notfound=True)[source]¶
Given a python module, return a list of classes in this module, matching SOURCE_CLASS (must inherit from)
- register_classes(klasses)[source]¶
Register each class in self.register dict. Key will be used to retrieve the source class, create an instance and run method from it. It must be implemented in subclass as each manager may need to access its sources differently,based on different keys.
- class biothings.utils.manager.BaseStatusRegisterer[source]¶
Bases:
object
- property collection¶
Return collection object used to fetch doc in which we store status
- load_doc(key_name, stage)[source]¶
Find document using key_name and stage, stage being a key within the document matching a specific process name: Ex: {“_id”:”123”,”snapshot”:”abc”}
load_doc(“abc”,”snapshot”)
will return the document. Note key_name is first used to find the doc by its _id. Ex: with another doc {“_id” : “abc”, “snapshot” : “somethingelse”}
load_doc{“abc”,”snapshot”)
will return doc with _id=”abc”, not “123”
- class biothings.utils.manager.CLIJobManager(loop=None)[source]¶
Bases:
object
This is the minimal JobManager used in CLI mode to run async jobs, with the compatible methods as JobManager. It won’t use a dedicated ProcessPool or ThreadPool, and will just run async job directly in the asyncio loop (which runs jobs in threads by default).
- class biothings.utils.manager.JobManager(loop, process_queue=None, thread_queue=None, max_memory_usage=None, num_workers=None, num_threads=None, auto_recycle=True)[source]¶
Bases:
object
- COLUMNS = ['pid', 'source', 'category', 'step', 'description', 'mem', 'cpu', 'started_at', 'duration']¶
- DATALINE = '{pid:<10}|{source:<35}|{category:<10}|{step:<20}|{description:<30}|{mem:<10}|{cpu:<6}|{started_at:<20}|{duration:<10}'¶
- HEADER = {'category': 'CATEGORY', 'cpu': 'CPU', 'description': 'DESCRIPTION', 'duration': 'DURATION', 'mem': 'MEM', 'pid': 'PID', 'source': 'SOURCE', 'started_at': 'STARTED_AT', 'step': 'STEP'}¶
- HEADERLINE = '{pid:^10}|{source:^35}|{category:^10}|{step:^20}|{description:^30}|{mem:^10}|{cpu:^6}|{started_at:^20}|{duration:^10}'¶
- property hub_memory¶
- property hub_process¶
- property pchildren¶
- recycle_process_queue()[source]¶
Replace current process queue with a new one. When processes are used over and over again, memory tends to grow as python interpreter keeps some data (…). Calling this method will perform a clean shutdown on current queue, waiting for running processes to terminate, then discard current queue and replace it a new one.
- schedule(crontab, func, *args, **kwargs)[source]¶
Helper to create a cron job from a callable “func”. *argd, and **kwargs are passed to func. “crontab” follows aicron notation.
biothings.utils.mongo¶
- class biothings.utils.mongo.Collection(*args, **kwargs)[source]¶
Bases:
HandleAutoReconnectMixin
,Collection
Get / create a Mongo collection.
Raises
TypeError
if name is not an instance ofstr
. RaisesInvalidName
if name is not a valid collection name. Any additional keyword arguments will be used as options passed to the create command. Seecreate_collection()
for valid options.If create is
True
, collation is specified, or any additional keyword arguments are present, acreate
command will be sent, usingsession
if specified. Otherwise, acreate
command will not be sent and the collection will be created implicitly on first use. The optionalsession
argument is only used for thecreate
command, it is not associated with the collection afterward.- Parameters:
database: the database to get a collection from
name: the name of the collection to get
create (optional): if
True
, force collection creation even without options being setcodec_options (optional): An instance of
CodecOptions
. IfNone
(the default) database.codec_options is used.read_preference (optional): The read preference to use. If
None
(the default) database.read_preference is used.write_concern (optional): An instance of
WriteConcern
. IfNone
(the default) database.write_concern is used.read_concern (optional): An instance of
ReadConcern
. IfNone
(the default) database.read_concern is used.collation (optional): An instance of
Collation
. If a collation is provided, it will be passed to the create collection command.session (optional): a
ClientSession
that is used with the create collection command**kwargs (optional): additional keyword arguments will be passed as options for the create collection command
Changed in version 4.2: Added the
clusteredIndex
andencryptedFields
parameters.Changed in version 4.0: Removed the reindex, map_reduce, inline_map_reduce, parallel_scan, initialize_unordered_bulk_op, initialize_ordered_bulk_op, group, count, insert, save, update, remove, find_and_modify, and ensure_index methods. See the pymongo4-migration-guide.
Changed in version 3.6: Added
session
parameter.Changed in version 3.4: Support the collation option.
Changed in version 3.2: Added the read_concern option.
Changed in version 3.0: Added the codec_options, read_preference, and write_concern options. Removed the uuid_subtype attribute.
Collection
no longer returns an instance ofCollection
for attribute names with leading underscores. You must use dict-style lookups instead::collection[‘__my_collection__’]
Not:
collection.__my_collection__
See also
The MongoDB documentation on collections.
- class biothings.utils.mongo.Database(*args, **kwargs)[source]¶
Bases:
HandleAutoReconnectMixin
,Database
Get a database by client and name.
Raises
TypeError
if name is not an instance ofstr
. RaisesInvalidName
if name is not a valid database name.- Parameters:
client: A
MongoClient
instance.name: The database name.
codec_options (optional): An instance of
CodecOptions
. IfNone
(the default) client.codec_options is used.read_preference (optional): The read preference to use. If
None
(the default) client.read_preference is used.write_concern (optional): An instance of
WriteConcern
. IfNone
(the default) client.write_concern is used.read_concern (optional): An instance of
ReadConcern
. IfNone
(the default) client.read_concern is used.
See also
The MongoDB documentation on databases.
Changed in version 4.0: Removed the eval, system_js, error, last_status, previous_error, reset_error_history, authenticate, logout, collection_names, current_op, add_user, remove_user, profiling_level, set_profiling_level, and profiling_info methods. See the pymongo4-migration-guide.
Changed in version 3.2: Added the read_concern option.
Changed in version 3.0: Added the codec_options, read_preference, and write_concern options.
Database
no longer returns an instance ofCollection
for attribute names with leading underscores. You must use dict-style lookups instead::db[‘__my_collection__’]
Not:
db.__my_collection__
- class biothings.utils.mongo.DatabaseClient(*args, **kwargs)[source]¶
Bases:
HandleAutoReconnectMixin
,MongoClient
,IDatabase
Client for a MongoDB instance, a replica set, or a set of mongoses.
Warning
Starting in PyMongo 4.0,
directConnection
now has a default value of False instead of None. For more details, see the relevant section of the PyMongo 4.x migration guide: pymongo4-migration-direct-connection.The client object is thread-safe and has connection-pooling built in. If an operation fails because of a network error,
ConnectionFailure
is raised and the client reconnects in the background. Application code should handle this exception (recognizing that the operation failed) and then continue to execute.The host parameter can be a full mongodb URI, in addition to a simple hostname. It can also be a list of hostnames but no more than one URI. Any port specified in the host string(s) will override the port parameter. For username and passwords reserved characters like ‘:’, ‘/’, ‘+’ and ‘@’ must be percent encoded following RFC 2396:
from urllib.parse import quote_plus uri = "mongodb://%s:%s@%s" % ( quote_plus(user), quote_plus(password), host) client = MongoClient(uri)
Unix domain sockets are also supported. The socket path must be percent encoded in the URI:
uri = "mongodb://%s:%s@%s" % ( quote_plus(user), quote_plus(password), quote_plus(socket_path)) client = MongoClient(uri)
But not when passed as a simple hostname:
client = MongoClient('/tmp/mongodb-27017.sock')
Starting with version 3.6, PyMongo supports mongodb+srv:// URIs. The URI must include one, and only one, hostname. The hostname will be resolved to one or more DNS SRV records which will be used as the seed list for connecting to the MongoDB deployment. When using SRV URIs, the authSource and replicaSet configuration options can be specified using TXT records. See the Initial DNS Seedlist Discovery spec for more details. Note that the use of SRV URIs implicitly enables TLS support. Pass tls=false in the URI to override.
Note
MongoClient creation will block waiting for answers from DNS when mongodb+srv:// URIs are used.
Note
Starting with version 3.0 the
MongoClient
constructor no longer blocks while connecting to the server or servers, and it no longer raisesConnectionFailure
if they are unavailable, norConfigurationError
if the user’s credentials are wrong. Instead, the constructor returns immediately and launches the connection process on background threads. You can check if the server is available like this:from pymongo.errors import ConnectionFailure client = MongoClient() try: # The ping command is cheap and does not require auth. client.admin.command('ping') except ConnectionFailure: print("Server not available")
Warning
When using PyMongo in a multiprocessing context, please read multiprocessing first.
Note
Many of the following options can be passed using a MongoDB URI or keyword parameters. If the same option is passed in a URI and as a keyword parameter the keyword parameter takes precedence.
- Parameters:
host (optional): hostname or IP address or Unix domain socket path of a single mongod or mongos instance to connect to, or a mongodb URI, or a list of hostnames (but no more than one mongodb URI). If host is an IPv6 literal it must be enclosed in ‘[’ and ‘]’ characters following the RFC2732 URL syntax (e.g. ‘[::1]’ for localhost). Multihomed and round robin DNS addresses are not supported.
port (optional): port number on which to connect
document_class (optional): default class to use for documents returned from queries on this client
tz_aware (optional): if
True
,datetime
instances returned as values in a document by thisMongoClient
will be timezone aware (otherwise they will be naive)connect (optional): if
True
(the default), immediately begin connecting to MongoDB in the background. Otherwise connect on the first operation.type_registry (optional): instance of
TypeRegistry
to enable encoding and decoding of custom types.datetime_conversion: Specifies how UTC datetimes should be decoded within BSON. Valid options include ‘datetime_ms’ to return as a DatetimeMS, ‘datetime’ to return as a datetime.datetime and raising a ValueError for out-of-range values, ‘datetime_auto’ to return DatetimeMS objects when the underlying datetime is out-of-range and ‘datetime_clamp’ to clamp to the minimum and maximum possible datetimes. Defaults to ‘datetime’. See handling-out-of-range-datetimes for details.
Other optional parameters can be passed as keyword arguments:- directConnection (optional): if
True
, forces this client to connect directly to the specified MongoDB host as a standalone. If
false
, the client connects to the entire replica set of which the given MongoDB host(s) is a part. If this isTrue
and a mongodb+srv:// URI or a URI containing multiple seeds is provided, an exception will be raised.
- directConnection (optional): if
maxPoolSize (optional): The maximum allowable number of concurrent connections to each connected server. Requests to a server will block if there are maxPoolSize outstanding connections to the requested server. Defaults to 100. Can be either 0 or None, in which case there is no limit on the number of concurrent connections.
minPoolSize (optional): The minimum required number of concurrent connections that the pool will maintain to each connected server. Default is 0.
maxIdleTimeMS (optional): The maximum number of milliseconds that a connection can remain idle in the pool before being removed and replaced. Defaults to None (no limit).
maxConnecting (optional): The maximum number of connections that each pool can establish concurrently. Defaults to 2.
timeoutMS: (integer or None) Controls how long (in milliseconds) the driver will wait when executing an operation (including retry attempts) before raising a timeout error.
0
orNone
means no timeout.socketTimeoutMS: (integer or None) Controls how long (in milliseconds) the driver will wait for a response after sending an ordinary (non-monitoring) database operation before concluding that a network error has occurred.
0
orNone
means no timeout. Defaults toNone
(no timeout).connectTimeoutMS: (integer or None) Controls how long (in milliseconds) the driver will wait during server monitoring when connecting a new socket to a server before concluding the server is unavailable.
0
orNone
means no timeout. Defaults to20000
(20 seconds).server_selector: (callable or None) Optional, user-provided function that augments server selection rules. The function should accept as an argument a list of
ServerDescription
objects and return a list of server descriptions that should be considered suitable for the desired operation.serverSelectionTimeoutMS: (integer) Controls how long (in milliseconds) the driver will wait to find an available, appropriate server to carry out a database operation; while it is waiting, multiple server monitoring operations may be carried out, each controlled by connectTimeoutMS. Defaults to
30000
(30 seconds).waitQueueTimeoutMS: (integer or None) How long (in milliseconds) a thread will wait for a socket from the pool if the pool has no free sockets. Defaults to
None
(no timeout).heartbeatFrequencyMS: (optional) The number of milliseconds between periodic server checks, or None to accept the default frequency of 10 seconds.
appname: (string or None) The name of the application that created this MongoClient instance. The server will log this value upon establishing each connection. It is also recorded in the slow query log and profile collections.
driver: (pair or None) A driver implemented on top of PyMongo can pass a
DriverInfo
to add its name, version, and platform to the message printed in the server log when establishing a connection.event_listeners: a list or tuple of event listeners. See
monitoring
for details.retryWrites: (boolean) Whether supported write operations executed within this MongoClient will be retried once after a network error. Defaults to
True
. The supported write operations are:bulk_write()
, as long asUpdateMany
orDeleteMany
are not included.delete_one()
insert_one()
insert_many()
replace_one()
update_one()
find_one_and_delete()
find_one_and_replace()
find_one_and_update()
Unsupported write operations include, but are not limited to,
aggregate()
using the$out
pipeline operator and any operation with an unacknowledged write concern (e.g. {w: 0})). See https://github.com/mongodb/specifications/blob/master/source/retryable-writes/retryable-writes.rstretryReads: (boolean) Whether supported read operations executed within this MongoClient will be retried once after a network error. Defaults to
True
. The supported read operations are:find()
,find_one()
,aggregate()
without$out
,distinct()
,count()
,estimated_document_count()
,count_documents()
,pymongo.collection.Collection.watch()
,list_indexes()
,pymongo.database.Database.watch()
,list_collections()
,pymongo.mongo_client.MongoClient.watch()
, andlist_databases()
.Unsupported read operations include, but are not limited to
command()
and any getMore operation on a cursor.Enabling retryable reads makes applications more resilient to transient errors such as network failures, database upgrades, and replica set failovers. For an exact definition of which errors trigger a retry, see the retryable reads specification.
compressors: Comma separated list of compressors for wire protocol compression. The list is used to negotiate a compressor with the server. Currently supported options are “snappy”, “zlib” and “zstd”. Support for snappy requires the python-snappy package. zlib support requires the Python standard library zlib module. zstd requires the zstandard package. By default no compression is used. Compression support must also be enabled on the server. MongoDB 3.6+ supports snappy and zlib compression. MongoDB 4.2+ adds support for zstd.
zlibCompressionLevel: (int) The zlib compression level to use when zlib is used as the wire protocol compressor. Supported values are -1 through 9. -1 tells the zlib library to use its default compression level (usually 6). 0 means no compression. 1 is best speed. 9 is best compression. Defaults to -1.
uuidRepresentation: The BSON representation to use when encoding from and decoding to instances of
UUID
. Valid values are the strings: “standard”, “pythonLegacy”, “javaLegacy”, “csharpLegacy”, and “unspecified” (the default). New applications should consider setting this to “standard” for cross language compatibility. See handling-uuid-data-example for details.unicode_decode_error_handler: The error handler to apply when a Unicode-related error occurs during BSON decoding that would otherwise raise
UnicodeDecodeError
. Valid options include ‘strict’, ‘replace’, ‘backslashreplace’, ‘surrogateescape’, and ‘ignore’. Defaults to ‘strict’.srvServiceName: (string) The SRV service name to use for “mongodb+srv://” URIs. Defaults to “mongodb”. Use it like so:
MongoClient("mongodb+srv://example.com/?srvServiceName=customname")
srvMaxHosts: (int) limits the number of mongos-like hosts a client will connect to. More specifically, when a “mongodb+srv://” connection string resolves to more than srvMaxHosts number of hosts, the client will randomly choose an srvMaxHosts sized subset of hosts.
Write Concern options:(Only set if passed. No default values.)w: (integer or string) If this is a replica set, write operations will block until they have been replicated to the specified number or tagged set of servers. w=<int> always includes the replica set primary (e.g. w=3 means write to the primary and wait until replicated to two secondaries). Passing w=0 disables write acknowledgement and all other write concern options.
wTimeoutMS: (integer) Used in conjunction with w. Specify a value in milliseconds to control how long to wait for write propagation to complete. If replication does not complete in the given timeframe, a timeout exception is raised. Passing wTimeoutMS=0 will cause write operations to wait indefinitely.
journal: If
True
block until write operations have been committed to the journal. Cannot be used in combination with fsync. Write operations will fail with an exception if this option is used when the server is running without journaling.fsync: If
True
and the server is running without journaling, blocks until the server has synced all data files to disk. If the server is running with journaling, this acts the same as the j option, blocking until write operations have been committed to the journal. Cannot be used in combination with j.
Replica set keyword arguments for connecting with a replica set - either directly or via a mongos:replicaSet: (string or None) The name of the replica set to connect to. The driver will verify that all servers it connects to match this name. Implies that the hosts specified are a seed list and the driver should attempt to find all members of the set. Defaults to
None
.
Read Preference:readPreference: The replica set read preference for this client. One of
primary
,primaryPreferred
,secondary
,secondaryPreferred
, ornearest
. Defaults toprimary
.readPreferenceTags: Specifies a tag set as a comma-separated list of colon-separated key-value pairs. For example
dc:ny,rack:1
. Defaults toNone
.maxStalenessSeconds: (integer) The maximum estimated length of time a replica set secondary can fall behind the primary in replication before it will no longer be selected for operations. Defaults to
-1
, meaning no maximum. If maxStalenessSeconds is set, it must be a positive integer greater than or equal to 90 seconds.
See also
/examples/server_selection
Authentication:username: A string.
password: A string.
Although username and password must be percent-escaped in a MongoDB URI, they must not be percent-escaped when passed as parameters. In this example, both the space and slash special characters are passed as-is:
MongoClient(username="user name", password="pass/word")
authSource: The database to authenticate on. Defaults to the database specified in the URI, if provided, or to “admin”.
authMechanism: See
MECHANISMS
for options. If no mechanism is specified, PyMongo automatically SCRAM-SHA-1 when connected to MongoDB 3.6 and negotiates the mechanism to use (SCRAM-SHA-1 or SCRAM-SHA-256) when connected to MongoDB 4.0+.authMechanismProperties: Used to specify authentication mechanism specific options. To specify the service name for GSSAPI authentication pass authMechanismProperties=’SERVICE_NAME:<service name>’. To specify the session token for MONGODB-AWS authentication pass
authMechanismProperties='AWS_SESSION_TOKEN:<session token>'
.
See also
/examples/authentication
TLS/SSL configuration:tls: (boolean) If
True
, create the connection to the server using transport layer security. Defaults toFalse
.tlsInsecure: (boolean) Specify whether TLS constraints should be relaxed as much as possible. Setting
tlsInsecure=True
impliestlsAllowInvalidCertificates=True
andtlsAllowInvalidHostnames=True
. Defaults toFalse
. Think very carefully before setting this toTrue
as it dramatically reduces the security of TLS.tlsAllowInvalidCertificates: (boolean) If
True
, continues the TLS handshake regardless of the outcome of the certificate verification process. If this isFalse
, and a value is not provided fortlsCAFile
, PyMongo will attempt to load system provided CA certificates. If the python version in use does not support loading system CA certificates then thetlsCAFile
parameter must point to a file of CA certificates.tlsAllowInvalidCertificates=False
impliestls=True
. Defaults toFalse
. Think very carefully before setting this toTrue
as that could make your application vulnerable to on-path attackers.tlsAllowInvalidHostnames: (boolean) If
True
, disables TLS hostname verification.tlsAllowInvalidHostnames=False
impliestls=True
. Defaults toFalse
. Think very carefully before setting this toTrue
as that could make your application vulnerable to on-path attackers.tlsCAFile: A file containing a single or a bundle of “certification authority” certificates, which are used to validate certificates passed from the other end of the connection. Implies
tls=True
. Defaults toNone
.tlsCertificateKeyFile: A file containing the client certificate and private key. Implies
tls=True
. Defaults toNone
.tlsCRLFile: A file containing a PEM or DER formatted certificate revocation list. Implies
tls=True
. Defaults toNone
.tlsCertificateKeyFilePassword: The password or passphrase for decrypting the private key in
tlsCertificateKeyFile
. Only necessary if the private key is encrypted. Defaults toNone
.tlsDisableOCSPEndpointCheck: (boolean) If
True
, disables certificate revocation status checking via the OCSP responder specified on the server certificate.tlsDisableOCSPEndpointCheck=False
impliestls=True
. Defaults toFalse
.ssl: (boolean) Alias for
tls
.
Read Concern options:(If not set explicitly, this will use the server default)readConcernLevel: (string) The read concern level specifies the level of isolation for read operations. For example, a read operation using a read concern level of
majority
will only return data that has been written to a majority of nodes. If the level is left unspecified, the server default will be used.
Client side encryption options:(If not set explicitly, client side encryption will not be enabled.)auto_encryption_opts: A
AutoEncryptionOpts
which configures this client to automatically encrypt collection commands and automatically decrypt results. See automatic-client-side-encryption for an example. If aMongoClient
is configured withauto_encryption_opts
and a non-NonemaxPoolSize
, a separate internalMongoClient
is created if any of the following are true:A
key_vault_client
is not passed toAutoEncryptionOpts
bypass_auto_encrpytion=False
is passed toAutoEncryptionOpts
Stable API options:(If not set explicitly, Stable API will not be enabled.)server_api: A
ServerApi
which configures this client to use Stable API. See versioned-api-ref for details.
See also
The MongoDB documentation on connections.
Changed in version 4.2: Added the
timeoutMS
keyword argument.Changed in version 4.0:
Removed the fsync, unlock, is_locked, database_names, and close_cursor methods. See the pymongo4-migration-guide.
Removed the
waitQueueMultiple
andsocketKeepAlive
keyword arguments.The default for uuidRepresentation was changed from
pythonLegacy
tounspecified
.Added the
srvServiceName
,maxConnecting
, andsrvMaxHosts
URI and keyword arguments.
Changed in version 3.12: Added the
server_api
keyword argument. The following keyword arguments were deprecated:ssl_certfile
andssl_keyfile
were deprecated in favor oftlsCertificateKeyFile
.
Changed in version 3.11: Added the following keyword arguments and URI options:
tlsDisableOCSPEndpointCheck
directConnection
Changed in version 3.9: Added the
retryReads
keyword argument and URI option. Added thetlsInsecure
keyword argument and URI option. The following keyword arguments and URI options were deprecated:wTimeout
was deprecated in favor ofwTimeoutMS
.j
was deprecated in favor ofjournal
.ssl_cert_reqs
was deprecated in favor oftlsAllowInvalidCertificates
.ssl_match_hostname
was deprecated in favor oftlsAllowInvalidHostnames
.ssl_ca_certs
was deprecated in favor oftlsCAFile
.ssl_certfile
was deprecated in favor oftlsCertificateKeyFile
.ssl_crlfile
was deprecated in favor oftlsCRLFile
.ssl_pem_passphrase
was deprecated in favor oftlsCertificateKeyFilePassword
.
Changed in version 3.9:
retryWrites
now defaults toTrue
.Changed in version 3.8: Added the
server_selector
keyword argument. Added thetype_registry
keyword argument.Changed in version 3.7: Added the
driver
keyword argument.Changed in version 3.6: Added support for mongodb+srv:// URIs. Added the
retryWrites
keyword argument and URI option.Changed in version 3.5: Add
username
andpassword
options. Document theauthSource
,authMechanism
, andauthMechanismProperties
options. Deprecated thesocketKeepAlive
keyword argument and URI option.socketKeepAlive
now defaults toTrue
.Changed in version 3.0:
MongoClient
is now the one and only client class for a standalone server, mongos, or replica set. It includes the functionality that had been split intoMongoReplicaSetClient
: it can connect to a replica set, discover all its members, and monitor the set for stepdowns, elections, and reconfigs.The
MongoClient
constructor no longer blocks while connecting to the server or servers, and it no longer raisesConnectionFailure
if they are unavailable, norConfigurationError
if the user’s credentials are wrong. Instead, the constructor returns immediately and launches the connection process on background threads.Therefore the
alive
method is removed since it no longer provides meaningful information; even if the client is disconnected, it may discover a server in time to fulfill the next operation.In PyMongo 2.x,
MongoClient
accepted a list of standalone MongoDB servers and used the first it could connect to:MongoClient(['host1.com:27017', 'host2.com:27017'])
A list of multiple standalones is no longer supported; if multiple servers are listed they must be members of the same replica set, or mongoses in the same sharded cluster.
The behavior for a list of mongoses is changed from “high availability” to “load balancing”. Before, the client connected to the lowest-latency mongos in the list, and used it until a network error prompted it to re-evaluate all mongoses’ latencies and reconnect to one of them. In PyMongo 3, the client monitors its network latency to all the mongoses continuously, and distributes operations evenly among those with the lowest latency. See mongos-load-balancing for more information.
The
connect
option is added.The
start_request
,in_request
, andend_request
methods are removed, as well as theauto_start_request
option.The
copy_database
method is removed, see the copy_database examples for alternatives.The
MongoClient.disconnect()
method is removed; it was a synonym forclose()
.MongoClient
no longer returns an instance ofDatabase
for attribute names with leading underscores. You must use dict-style lookups instead:client['__my_database__']
Not:
client.__my_database__
- class biothings.utils.mongo.HandleAutoReconnectMixin(*args, **kwargs)[source]¶
Bases:
object
This mixin will decor any non-hidden method with handle_autoreconnect decorator
- exception biothings.utils.mongo.MaxRetryAutoReconnectException(message: str = '', errors: Mapping[str, Any] | Sequence | None = None)[source]¶
Bases:
AutoReconnect
Raised when we reach maximum retry to connect to Mongo server
- biothings.utils.mongo.check_document_size(doc)[source]¶
Return True if doc isn’t too large for mongo DB
- biothings.utils.mongo.doc_feeder(collection, step=1000, s=None, e=None, inbatch=False, query=None, batch_callback=None, fields=None, logger=<module 'logging' from '/home/docs/.asdf/installs/python/3.10.13/lib/python3.10/logging/__init__.py'>, session_refresh_interval=5)[source]¶
An iterator returning docs in a collection, with batch query.
Additional filter query can be passed via query, e.g., doc_feeder(collection, query={‘taxid’: {‘$in’: [9606, 10090, 10116]}}) batch_callback is a callback function as fn(index, t), called after every batch. fields is an optional parameter to restrict the fields to return.
- session_refresh_interval is 5 minutes by default. We call refreshSessions command every 5 minutes to keep a session alive, otherwise the session
and all cursors attached (explicitly or implicitly) to the session will time out after idling for 30 minutes, even if we have no_cursor_timeout set True for a cursor. See https://www.mongodb.com/docs/manual/reference/command/refreshSessions/ and https://www.mongodb.com/docs/manual/reference/method/cursor.noCursorTimeout/#session-idle-timeout-overrides-nocursortimeout
- biothings.utils.mongo.get_previous_collection(new_id)[source]¶
Given ‘new_id’, an _id from src_build, as the “new” collection, automatically select an “old” collection. By default, src_build’s documents will be sorted according to their name (_id) and old collection is the one just before new_id.
- Note: because there can be more than one build config used, the actual build config name is first determined using new_id collection name,
then the find().sort() is done on collections containing that build config name.
- biothings.utils.mongo.get_source_fullname(col_name)[source]¶
Assuming col_name is a collection created from an upload process, find the main source & sub_source associated.
- biothings.utils.mongo.handle_autoreconnect(cls_instance, func)[source]¶
After upgrading the pymongo package from 3.12 to 4.x, the “AutoReconnect: connection pool paused” problem appears quite often. It is not clear that the problem happens with our codebase, maybe a pymongo’s problem.
This function is an attempt to handle the AutoReconnect exception, without modifying our codebase. When the exception is raised, we just wait for some time, then retry. If the error still happens after MAX_RETRY, it must be a connection-related problem. We should stop retrying and raise error.
Ref: https://github.com/newgene/biothings.api/pull/40#issuecomment-1185334545
- biothings.utils.mongo.id_feeder(col, batch_size=1000, build_cache=True, logger=<module 'logging' from '/home/docs/.asdf/installs/python/3.10.13/lib/python3.10/logging/__init__.py'>, force_use=False, force_build=False, validate_only=False)[source]¶
Return an iterator for all _ids in collection “col”.
Search for a valid cache file if available, if not, return a doc_feeder for that collection. Valid cache is a cache file that is newer than the collection.
“db” can be “target” or “src”. “build_cache” True will build a cache file as _ids are fetched, if no cache file was found. “force_use” True will use any existing cache file and won’t check whether it’s valid of not. “force_build” True will build a new cache even if current one exists and is valid. “validate_only” will directly return [] if the cache is valid (convenient way to check if the cache is valid).
biothings.utils.parallel¶
biothings.utils.parallel_mp¶
- class biothings.utils.parallel_mp.ParallelResult(agg_function, agg_function_init)[source]¶
Bases:
object
- biothings.utils.parallel_mp.run_parallel_on_ids_dir(fun, ids_dir, backend_options=None, agg_function=<function agg_by_append>, agg_function_init=[], outpath=None, num_workers=2, mget_chunk_size=10000, ignore_None=True, error_path=None, **query_kwargs)[source]¶
This function will run function fun on chunks defined by the files in ids_dir.
All parameters are fed to run_parallel_on_iterable, except:
- Params ids_dir:
Directory containing only files with ids, one per line. The number of files defines the number of chunks.
- biothings.utils.parallel_mp.run_parallel_on_ids_file(fun, ids_file, backend_options=None, agg_function=<function agg_by_append>, agg_function_init=[], chunk_size=1000000, num_workers=2, outpath=None, mget_chunk_size=10000, ignore_None=True, error_path=None, **query_kwargs)[source]¶
Implementation of run_parallel_on_iterable, where iterable comes from the lines of a file.
All parameters are fed to run_on_ids_iterable, except:
- Parameters:
ids_file – Path to file with ids, one per line.
- biothings.utils.parallel_mp.run_parallel_on_iterable(fun, iterable, backend_options=None, agg_function=<function agg_by_append>, agg_function_init=None, chunk_size=1000000, num_workers=2, outpath=None, mget_chunk_size=10000, ignore_None=True, error_path=None, **query_kwargs)[source]¶
This function will run a user function on all documents in a backend database in parallel using multiprocessing.Pool. The overview of the process looks like this:
Chunk (into chunks of size “chunk_size”) items in iterable, and run the following script on each chunk using a multiprocessing.Pool object with “num_workers” processes:
- For each document in list of ids in this chunk (documents retrived in chunks of “mget_chunk_size”):
Run function “fun” with parameters (doc, chunk_num, f <file handle only passed if “outpath” is not None>), and aggregate the result with the current results using function “agg_function”.
- Parameters:
fun – The function to run on all documents. If outpath is NOT specified, fun must accept two parameters: (doc, chunk_num), where doc is the backend document, and chunk_num is essentially a unique process id. If outpath IS specified, an additional open file handle (correctly tagged with the current chunk’s chunk_num) will also be passed to fun, and thus it must accept three parameters: (doc, chunk_num, f)
iterable – Iterable of ids.
backend_options – An instance of biothings.utils.backend.DocBackendOptions. This contains the options necessary to instantiate the correct backend class (ES, mongo, etc).
agg_function – This function aggregates the return value of each run of function fun. It should take 2 parameters: (prev, curr), where prev is the previous aggregated result, and curr is the output of the current function run. It should return some value that represents the aggregation of the previous aggregated results with the output of the current function.
agg_function_init – Initialization value for the aggregated result.
chunk_size – Length of the ids list sent to each chunk.
num_workers – Number of processes that consume chunks in parallel. https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool
outpath – Base path for output files. Because function fun can be run many times in parallel, each chunk is sequentially numbered, and the output file name for any chunk is outpath_{chunk_num}, e.g., if outpath is out, all output files will be of the form: /path/to/cwd/out_1, /path/to/cwd/out_2, etc.
error_path – Base path for error files. If included, exceptions inside each chunk thread will be printed to these files.
mget_chunk_size – The size of each mget chunk inside each chunk thread. In each thread, the ids list is consumed by passing chunks to a mget_by_ids function. This parameter controls the size of each mget.
ignore_None – If set, then falsy values will not be aggregated (0, [], None, etc) in the aggregation step. Default True.
All other parameters are fed to the backend query.
- biothings.utils.parallel_mp.run_parallel_on_query(fun, backend_options=None, query=None, agg_function=<function agg_by_append>, agg_function_init=[], chunk_size=1000000, num_workers=2, outpath=None, mget_chunk_size=10000, ignore_None=True, error_path=None, full_doc=False, **query_kwargs)[source]¶
Implementation of run_parallel_on_ids_iterable, where the ids iterable comes from the result of a query on the specified backend.
All parameters are fed to run_parallel_on_ids_iterable, except:
- Parameters:
query – ids come from results of this query run on backend, default: “match_all”
full_doc – If True, a list of documents is passed to each subprocess, rather than ids that are looked up later. Should be faster? Unknown how this works with very large query sets…
biothings.utils.parsers¶
- biothings.utils.parsers.docker_source_info_parser(url)[source]¶
- Parameters:
url – file url include docker connection string format: docker://CONNECTION_NAME?image=DOCKER_IMAGE&tag=TAG&dump_command=”python run.py”&path=/path/to/file the CONNECTION_NAME must be defined in the biothings Hub config. example: docker://CONNECTION_NAME?image=docker_image&tag=docker_tag&dump_command=”python run.py”&path=/path/to/file docker://CONNECTION_NAME?image=docker_image&tag=docker_tag&dump_command=”python run.py”&path=/path/to/file docker://CONNECTION_NAME?image=docker_image&tag=docker_tag&dump_command=”python run.py”&path=/path/to/file docker”//CONNECTION_NAME?image=docker_image&tag=docker_tag&dump_command=”python run.py”&path=/path/to/file
- Returns:
- biothings.utils.parsers.json_array_parser(patterns: Iterable[str] | None = None) Callable[[str], Generator[dict, None, None]] [source]¶
Create JSON Array Parser given filename patterns
For use with manifest.json based plugins. The data comes in a JSON that is an JSON array, containing multiple documents.
- Parameters:
patterns – glob-compatible patterns for filenames, like .json, data.json
- Returns:
parser_func
- biothings.utils.parsers.ndjson_parser(patterns: Iterable[str] | None = None) Callable[[str], Generator[dict, None, None]] [source]¶
Create NDJSON Parser given filename patterns
For use with manifest.json based plugins. Caveat: Only handles valid NDJSON (no extra newlines, UTF8, etc.)
- Parameters:
patterns – glob-compatible patterns for filenames, like .ndjson, data.ndjson
- Returns:
- Generator that takes in a data_folder and returns documents from
NDJSON files that matches the filename patterns
- Return type:
parser_func
biothings.utils.redis¶
- class biothings.utils.redis.RedisClient(connection_params)[source]¶
Bases:
object
- client = None¶
- get_db(db_name=None)[source]¶
Return a redict client instance from a database name or database number (if db_name is an integer)
- initialize(deep=False)[source]¶
Careful: this may delete data. Prepare Redis instance to work with biothings hub: - database 0: this db is used to store a mapping between
database index and database name (so a database can be accessed by name). This method will flush this db and prepare it.
any other databases will be flushed if deep is True, making the redis server fully dedicated to
- property mapdb¶
biothings.utils.serializer¶
- biothings.utils.serializer.json_dumps(data, indent=False, sort_keys=False)¶
- biothings.utils.serializer.json_loads(json_str: bytes | str) Any ¶
Load a JSON string or bytes using orjson
- biothings.utils.serializer.load_json(json_str: bytes | str) Any [source]¶
Load a JSON string or bytes using orjson
biothings.utils.shelve¶
biothings.utils.sqlite3¶
- class biothings.utils.sqlite3.Collection(colname, db)[source]¶
Bases:
object
- property database¶
- findv2(*args, **kwargs)[source]¶
This is a new version of find() that uses json feature of sqlite3, will replace find in the future
- property name¶
- class biothings.utils.sqlite3.Database(db_folder, name=None)[source]¶
Bases:
IDatabase
- CONFIG = <ConfigurationWrapper over <module 'config' from '/home/docs/checkouts/readthedocs.org/user_builds/biothingsapi/checkouts/stable/biothings/hub/default_config.py'>>¶
- property address¶
Returns sufficient information so a connection to a database can be created. Information can be a dictionary, object, etc… and depends on the actual backend
biothings.utils.version¶
Functions to return versions of things.
- biothings.utils.version.check_new_version(folder, max_commits=10)[source]¶
Given a folder pointing to a Git repo, return a dict containing info about remote commits not qpplied yet to the repo, or empty dict if nothing new.
- biothings.utils.version.get_python_version()[source]¶
Get a list of python packages installed and their versions.
- biothings.utils.version.get_repository_information(app_dir=None)[source]¶
Get the repository information for the local repository, if it exists.
- biothings.utils.version.get_source_code_info(src_file)[source]¶
Given a path to a source code, try to find information about repository, revision, URL pointing to that file, etc… Return None if nothing can be determined. Tricky cases:
src_file could refer to another repo, within current repo (namely a remote data plugin, cloned within the api’s plugins folder
src_file could point to a folder, when for instance a dataplugin is analized. This is because we can’t point to an uploader file since it’s dynamically generated
biothings.hub¶
- class biothings.hub.HubSSHServer[source]¶
Bases:
SSHServer
- PASSWORDS = {}¶
- SHELL = None¶
- begin_auth(username)[source]¶
Authentication has been requested by the client
This method will be called when authentication is attempted for the specified user. Applications should use this method to prepare whatever state they need to complete the authentication, such as loading in the set of authorized keys for that user. If no authentication is required for this user, this method should return False to cause the authentication to immediately succeed. Otherwise, it should return True to indicate that authentication should proceed.
If blocking operations need to be performed to prepare the state needed to complete the authentication, this method may be defined as a coroutine.
- Parameters:
username (str) – The name of the user being authenticated
- Returns:
A bool indicating whether authentication is required
- connection_lost(exc)[source]¶
Called when a connection is lost or closed
This method is called when a connection is closed. If the connection is shut down cleanly, exc will be None. Otherwise, it will be an exception explaining the reason for the disconnect.
- connection_made(connection)[source]¶
Called when a connection is made
This method is called when a new TCP connection is accepted. The conn parameter should be stored if needed for later use.
- Parameters:
conn (
SSHServerConnection
) – The connection which was successfully opened
- password_auth_supported()[source]¶
Return whether or not password authentication is supported
This method should return True if password authentication is supported. Applications wishing to support it must have this method return True and implement
validate_password()
to return whether or not the password provided by the client is valid for the user being authenticated.By default, this method returns False indicating that password authentication is not supported.
- Returns:
A bool indicating if password authentication is supported or not
- session_requested()[source]¶
Handle an incoming session request
This method is called when a session open request is received from the client, indicating it wishes to open a channel to be used for running a shell, executing a command, or connecting to a subsystem. If the application wishes to accept the session, it must override this method to return either an
SSHServerSession
object to use to process the data received on the channel or a tuple consisting of anSSHServerChannel
object created withcreate_server_channel
and anSSHServerSession
, if the application wishes to pass non-default arguments when creating the channel.If blocking operations need to be performed before the session can be created, a coroutine which returns an
SSHServerSession
object can be returned instead of the session iself. This can be either returned directly or as a part of a tuple with anSSHServerChannel
object.To reject this request, this method should return False to send back a “Session refused” response or raise a
ChannelOpenError
exception with the reason for the failure.The details of what type of session the client wants to start will be delivered to methods on the
SSHServerSession
object which is returned, along with other information such as environment variables, terminal type, size, and modes.By default, all session requests are rejected.
- Returns:
One of the following:
An
SSHServerSession
object or a coroutine which returns anSSHServerSession
A tuple consisting of an
SSHServerChannel
and the aboveA callable or coroutine handler function which takes AsyncSSH stream objects for stdin, stdout, and stderr as arguments
A tuple consisting of an
SSHServerChannel
and the aboveFalse to refuse the request
- Raises:
ChannelOpenError
if the session shouldn’t be accepted
- validate_password(username, password)[source]¶
Return whether password is valid for this user
This method should return True if the specified password is a valid password for the user being authenticated. It must be overridden by applications wishing to support password authentication.
If the password provided is valid but expired, this method may raise
PasswordChangeRequired
to request that the client provide a new password before authentication is allowed to complete. In this case, the application must overridechange_password()
to handle the password change request.This method may be called multiple times with different passwords provided by the client. Applications may wish to limit the number of attempts which are allowed. This can be done by having
password_auth_supported()
begin returning False after the maximum number of attempts is exceeded.If blocking operations need to be performed to determine the validity of the password, this method may be defined as a coroutine.
By default, this method returns False for all passwords.
- Parameters:
username (str) – The user being authenticated
password (str) – The password sent by the client
- Returns:
A bool indicating if the specified password is valid for the user being authenticated
- Raises:
PasswordChangeRequired
if the password provided is expired and needs to be changed
- class biothings.hub.HubSSHServerSession(name, shell)[source]¶
Bases:
SSHServerSession
- break_received(msec)[source]¶
The client has sent a break
This method is called when the client requests that the server perform a break operation on the terminal. If the break is performed, this method should return True. Otherwise, it should return False.
By default, this method returns False indicating that no break was performed.
- Parameters:
msec (int) – The duration of the break in milliseconds
- Returns:
A bool to indicate if the break operation was performed or not
- connection_made(chan)[source]¶
Called when a channel is opened successfully
This method is called when a channel is opened successfully. The channel parameter should be stored if needed for later use.
- Parameters:
chan (
SSHServerChannel
) – The channel which was successfully opened.
- data_received(data, datatype)[source]¶
Called when data is received on the channel
This method is called when data is received on the channel. If an encoding was specified when the channel was created, the data will be delivered as a string after decoding with the requested encoding. Otherwise, the data will be delivered as bytes.
- Parameters:
data (str or bytes) – The data received on the channel
datatype – The extended data type of the data, from extended data types
- eof_received()[source]¶
Called when EOF is received on the channel
This method is called when an end-of-file indication is received on the channel, after which no more data will be received. If this method returns True, the channel remains half open and data may still be sent. Otherwise, the channel is automatically closed after this method returns. This is the default behavior for classes derived directly from
SSHSession
, but not when using the higher-level streams API. Because input is buffered in that case, streaming sessions enable half-open channels to allow applications to respond to input read after an end-of-file indication is received.
- exec_requested(command)[source]¶
The client has requested to execute a command
This method should be implemented by the application to perform whatever processing is required when a client makes a request to execute a command. It should return True to accept the request, or False to reject it.
If the application returns True, the
session_started()
method will be called once the channel is fully open. No output should be sent until this method is called.By default this method returns False to reject all requests.
- Parameters:
command (str) – The command the client has requested to execute
- Returns:
A bool indicating if the exec request was allowed or not
- session_started()[source]¶
Called when the session is started
This method is called when a session has started up. For client and server sessions, this will be called once a shell, exec, or subsystem request has been successfully completed. For TCP and UNIX domain socket sessions, it will be called immediately after the connection is opened.
- shell_requested()[source]¶
The client has requested a shell
This method should be implemented by the application to perform whatever processing is required when a client makes a request to open an interactive shell. It should return True to accept the request, or False to reject it.
If the application returns True, the
session_started()
method will be called once the channel is fully open. No output should be sent until this method is called.By default this method returns False to reject all requests.
- Returns:
A bool indicating if the shell request was allowed or not
- soft_eof_received()[source]¶
The client has sent a soft EOF
This method is called by the line editor when the client send a soft EOF (Ctrl-D on an empty input line).
By default, soft EOF will trigger an EOF to an outstanding read call but still allow additional input to be received from the client after that.
- class biothings.hub.HubServer(source_list, features=None, name='BioThings Hub', managers_custom_args=None, api_config=None, reloader_config=None, dataupload_config=None, websocket_config=None, autohub_config=None)[source]¶
Bases:
object
Helper to setup and instantiate common managers usually used in a hub (eg. dumper manager, uploader manager, etc…) “source_list” is either:
a list of string corresponding to paths to datasources modules
a package containing sub-folders with datasources modules
Specific managers can be retrieved adjusting “features” parameter, where each feature corresponds to one or more managers. Parameter defaults to all possible available. Managers are configured/init in the same order as the list, so if a manager (eg. job_manager) is required by all others, it must be the first in the list. “managers_custom_args” is an optional dict used to pass specific arguments while init managers:
will set poll schedule to check upload every 5min (instead of default 10s) “reloader_config”, “dataupload_config”, “autohub_config” and “websocket_config” can be used to customize reloader, dataupload and websocket. If None, default config is used. If explicitely False, feature is deactivated.
- DEFAULT_API_CONFIG = {}¶
- DEFAULT_AUTOHUB_CONFIG = {'es_host': None, 'indexer_factory': None, 'validator_class': None, 'version_urls': []}¶
- DEFAULT_DATAUPLOAD_CONFIG = {'upload_root': '.biothings_hub/archive/dataupload'}¶
- DEFAULT_FEATURES = ['config', 'job', 'dump', 'upload', 'dataplugin', 'source', 'build', 'auto_archive', 'diff', 'index', 'snapshot', 'auto_snapshot_cleaner', 'release', 'inspect', 'sync', 'api', 'terminal', 'reloader', 'dataupload', 'ws', 'readonly', 'upgrade', 'autohub', 'hooks']¶
- DEFAULT_MANAGERS_ARGS = {'upload': {'poll_schedule': '* * * * * */10'}}¶
- DEFAULT_RELOADER_CONFIG = {'folders': None, 'managers': ['source_manager', 'assistant_manager'], 'reload_func': None}¶
- DEFAULT_WEBSOCKET_CONFIG = {}¶
- add_api_endpoint(endpoint_name, command_name, method, **kwargs)[source]¶
Add an API endpoint to expose command named “command_name” using HTTP method “method”. **kwargs are used to specify more arguments for EndpointDefinition
- configure_extra_commands()[source]¶
Same as configure_commands() but commands are not exposed publicly in the shell (they are shortcuts or commands for API endpoints, supporting commands, etc…)
- configure_hooks_feature()[source]¶
Ingest user-defined commands into hub namespace, giving access to all pre-defined commands (commands, extra_commands). This method prepare the hooks but the ingestion is done later when all commands are defined
- configure_readonly_api_endpoints()[source]¶
Assuming read-write API endpoints have previously been defined (self.api_endpoints set) extract commands and their endpoint definitions only when method is GET. That is, for any given API definition honoring REST principle for HTTP verbs, generate endpoints only for which actions are read-only actions.
- configure_readonly_feature()[source]¶
Define then expose read-only Hub API endpoints so Hub can be accessed without any risk of modifying data
- configure_upgrade_feature()[source]¶
Allows a Hub to check for new versions (new commits to apply on running branch) and apply them on current code base
- quick_index(datasource_name, doc_type, indexer_env, subsource=None, index_name=None, **kwargs)[source]¶
Intention for datasource developers to quickly create an index to test their datasources. Automatically create temporary build config, build collection Then call the index method with the temporary build collection’s name
- async biothings.hub.start_ssh_server(loop, name, passwords, keys=['bin/ssh_host_key'], shell=None, host='', port=8022)[source]¶
- biothings.hub.status(managers)[source]¶
Return a global hub status (number or sources, documents, etc…) according to available managers
Modules¶
biothings.hub.api¶
biothings.hub.api.managers¶
- class biothings.hub.api.manager.APIManager(log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseManager
biothings.hub.api.handlers.base¶
- class biothings.hub.api.handlers.base.BaseHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
DefaultHandler
- class biothings.hub.api.handlers.base.DefaultHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
RequestHandler
- set_default_headers()[source]¶
Override this to set HTTP headers at the beginning of the request.
For example, this is the place to set a custom
Server
header. Note that setting such headers in the normal flow of request processing may not do what you want, since headers may be reset during error handling.
- write(result)[source]¶
Writes the given chunk to the output buffer.
To write the output to the network, use the flush() method below.
If the given chunk is a dictionary, we write it as JSON and set the Content-Type of the response to be
application/json
. (if you want to send JSON as a differentContent-Type
, callset_header
after callingwrite()
).Note that lists are not converted to JSON because of a potential cross-site security vulnerability. All JSON output should be wrapped in a dictionary. More details at http://haacked.com/archive/2009/06/25/json-hijacking.aspx/ and https://github.com/facebook/tornado/issues/1009
- write_error(status_code, **kwargs)[source]¶
Override to implement custom error pages.
write_error
may call write, render, set_header, etc to produce output as usual.If this error was caused by an uncaught exception (including HTTPError), an
exc_info
triple will be available askwargs["exc_info"]
. Note that this exception may not be the “current” exception for purposes of methods likesys.exc_info()
ortraceback.format_exc
.
- class biothings.hub.api.handlers.base.GenericHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
DefaultHandler
biothings.hub.api.handlers.log¶
- class biothings.hub.api.handlers.log.HubLogDirHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
DefaultCORSHeaderMixin
,RequestHandler
- class biothings.hub.api.handlers.log.HubLogFileHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
DefaultCORSHeaderMixin
,StaticFileHandler
biothings.hub.api.handlers.shell¶
biothings.hub.api.handlers.upload¶
- class biothings.hub.api.handlers.upload.UploadHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
GenericHandler
- data_received(chunk)[source]¶
Implement this method to handle streamed request data.
Requires the .stream_request_body decorator.
May be a coroutine for flow control.
- prepare()[source]¶
Called at the beginning of a request before get/post/etc.
Override this method to perform common initialization regardless of the request method.
Asynchronous support: Use
async def
or decorate this method with .gen.coroutine to make it asynchronous. If this method returns anAwaitable
execution will not proceed until theAwaitable
is done.New in version 3.1: Asynchronous support.
biothings.hub.api.handlers.ws¶
- class biothings.hub.api.handlers.ws.HubDBListener[source]¶
Bases:
ChangeListener
Get events from Hub DB and propagate them through the websocket instance
- class biothings.hub.api.handlers.ws.LogListener(*args, **kwargs)[source]¶
Bases:
ChangeListener
- class biothings.hub.api.handlers.ws.ShellListener(*args, **kwargs)[source]¶
Bases:
LogListener
- class biothings.hub.api.handlers.ws.WebSocketConnection(session, listeners)[source]¶
Bases:
SockJSConnection
Listen to Hub DB through a listener object, and publish events to any client connected
SockJSConnection.__init__() takes only a session as argument, and there’s no way to pass custom settings. In order to use that class, we need to use partial to partially init the instance with ‘listeners’ and let the rest use the ‘session’
- parameter:
pconn = partial(WebSocketConnection,listeners=listeners) ws_router = sockjs.tornado.SockJSRouter(pconn,”/path”)
- clients = {}¶
- on_open(info)[source]¶
Default on_open() handler.
Override when you need to do some initialization or request validation. If you return False, connection will be rejected.
You can also throw Tornado HTTPError to close connection.
- request
ConnectionInfo
object which contains caller IP address, query string parameters and cookies associated with this request (if any).
biothings.hub.autoupdate¶
biothings.hub.autoupdate.dumper¶
- class biothings.hub.autoupdate.dumper.BiothingsDumper(*args, **kwargs)[source]¶
Bases:
HTTPDumper
This dumper is used to maintain a BioThings API up-to-date. BioThings data is available as either as an ElasticSearch snapshot when full update, and a collection of diff files for incremental updates. It will either download incremental updates and apply diff, or trigger an ElasticSearch restore if the latest version is a full update. This dumper can also be configured with precedence rules: when a full and a incremental update is available, rules can set so full is preferably used over incremental (size can also be considered when selecting the preferred way).
- AUTO_UPLOAD = False¶
- AWS_ACCESS_KEY_ID = None¶
- AWS_SECRET_ACCESS_KEY = None¶
- SRC_NAME = None¶
- SRC_ROOT_FOLDER = None¶
- TARGET_BACKEND = None¶
- VERSION_URL = None¶
- property base_url¶
- choose_best_version(versions)[source]¶
Out of all compatible versions, choose the best: 1. choose incremental vs. full according to preferences 2. version must be the highest (most up-to-date)
- compare_remote_local(remote_version, local_version, orig_remote_version, orig_local_version)[source]¶
- create_todump_list(force=False, version='latest', url=None)[source]¶
Fill self.to_dump list with dict(“remote”:remote_path,”local”:local_path) elements. This is the todo list for the dumper. It’s a good place to check whether needs to be downloaded. If ‘force’ is True though, all files will be considered for download
- download(remoteurl, localfile, headers=None)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- find_update_path(version, backend_version=None)[source]¶
Explore available versions and find the path to update the hub up to “version”, starting from given backend_version (typically current version found in ES index). If backend_version is None (typically no index yet), a complete path will be returned, from the last compatible “full” release up-to the latest “diff” update. Returned is a list of dict, where each dict is a build metadata element containing information about each update (see versions.json), the order of the list describes the order the updates should be performed.
- async get_target_backend()[source]¶
Example: [{
‘host’: ‘es6.mygene.info:9200’, ‘index’: ‘mygene_allspecies_20200823_ufkwdv79’, ‘index_alias’: ‘mygene_allspecies’, ‘version’: ‘20200906’, ‘count’: 38729977
}]
- async info(version='latest')[source]¶
Display version information (release note, etc…) for given version {
“info”: … “release_note”: …
}
- post_dump(*args, **kwargs)[source]¶
Placeholder to add a custom process once the whole resource has been dumped. Optional.
- prepare_client()[source]¶
Depending on presence of credentials, inject authentication in client.get()
- remote_is_better(remotefile, localfile)[source]¶
Determine if remote is better
Override if necessary.
- property target_backend¶
- async versions()[source]¶
Display all available versions. Example: [{
‘build_version’: ‘20171003’, ‘url’: ‘https://biothings-releases.s3.amazonaws.com:443/mygene.info/20171003.json’, ‘release_date’: ‘2017-10-06T11:58:39.749357’, ‘require_version’: None, ‘target_version’: ‘20171003’, ‘type’: ‘full’
}, …]
biothings.hub.autoupdate.uploader¶
- class biothings.hub.autoupdate.uploader.BiothingsUploader(*args, **kwargs)[source]¶
Bases:
BaseSourceUploader
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- AUTO_PURGE_INDEX = False¶
- SYNCER_FUNC = None¶
- TARGET_BACKEND = None¶
- get_snapshot_repository_config(build_meta)[source]¶
Return (name,config) tuple from build_meta, where name is the repo name, and config is the repo config
- async load(*args, **kwargs)[source]¶
Main resource load process, reads data from doc_c using chunk sized as batch_size. steps defines the different processes used to laod the resource: - “data” : will store actual data into single collections - “post” : will perform post data load operations - “master” : will register the master document in src_master
- name = None¶
- property syncer_func¶
- property target_backend¶
biothings.hub.databuild¶
biothings.hub.databuild.backend¶
Backend for storing merged genedoc after building. Support MongoDB, ES, CouchDB
- class biothings.hub.databuild.backend.LinkTargetDocMongoBackend(*args, **kwargs)[source]¶
Bases:
TargetDocBackend
This backend type act as a dummy target backend, the data is actually stored in source database. It means only one datasource can be linked to that target backend, as a consequence, when this backend is used in a merge, there’s no actual data merge. This is useful when “merging/indexing” only one datasource, where the merge step is just a duplication of datasource data.
- name = 'link'¶
- property target_collection¶
- class biothings.hub.databuild.backend.ShardedTargetDocMongoBackend(*args, **kwargs)[source]¶
Bases:
TargetDocMongoBackend
target_collection is a pymongo collection object.
- class biothings.hub.databuild.backend.SourceDocBackendBase(build_config, build, master, dump, sources)[source]¶
Bases:
DocBackendBase
- class biothings.hub.databuild.backend.SourceDocMongoBackend(build_config, build, master, dump, sources)[source]¶
Bases:
SourceDocBackendBase
- get_src_metadata()[source]¶
Return source versions which have been previously accessed wit this backend object or all source versions if none were accessed. Accessing means going through __getitem__ (the usual way) and allows to auto-keep track of sources of interest, thus returning versions only for those.
- class biothings.hub.databuild.backend.TargetDocBackend(*args, **kwargs)[source]¶
Bases:
DocBackendBase
- set_target_name(target_name, build_name=None)[source]¶
Create/prepare a target backend, either strictly named “target_name” or named derived from “build_name” (for temporary backends)
- property target_name¶
- class biothings.hub.databuild.backend.TargetDocMongoBackend(*args, **kwargs)[source]¶
Bases:
TargetDocBackend
,DocMongoBackend
target_collection is a pymongo collection object.
- biothings.hub.databuild.backend.create_backend(db_col_names, name_only=False, follow_ref=False, **kwargs)[source]¶
Guess what’s inside ‘db_col_names’ and return the corresponding backend. - It could be a string (will first check for an src_build doc to check
a backend_url field, if nothing there, will lookup a mongo collection in target database)
or a tuple(“target|src”,”col_name”)
or a (“mongodb://user:pass@host”,”db”,”col_name”) URI.
or a (“es_host:port”,”index_name”,”doc_type”)
If name_only is true, just return the name uniquely identifying the collection or index URI connection.
- biothings.hub.databuild.backend.generate_folder(root_folder, old_db_col_names, new_db_col_names)[source]¶
- biothings.hub.databuild.backend.merge_src_build_metadata(build_docs)[source]¶
Merge metadata from src_build documents. A list of docs should be passed, the order is important: the 1st element has the less precedence, the last the most. It means that, when needed, some values from documents on the “left” of the list may be overridden by one on the right. Ex: build_version field Ideally, build docs shouldn’t have any sources in common to prevent any unexpected conflicts…
biothings.hub.databuild.buildconfig¶
A build config contains autobuild configs and other information.
TODO: not all features already supported in the code
For exmaple: {
“_id”: “mynews”, “name”: “mynews”, “doc_type”: “news”, “sources”: [“mynews”], “root”: [“mynews”], “builder_class”: “biothings.hub.databuild.builder.DataBuilder”, “autobuild”: { … }, “autopublish”: { … }, “build_version”: “%Y%m%d%H%M”
}
Autobuild: - build - diff/snapshot
Autopublish: - release note - publish
Autorelease: - release
- class biothings.hub.databuild.buildconfig.AutoBuildConfig(confdict)[source]¶
Bases:
object
Parse automation configurations after each steps following ‘build’.
Example: {
- “autobuild”: {
“schedule”: “0 8 * * 7”, // Make a build every 08:00 on Sunday. “type”: “diff”, // Auto create a “diff” w/previous version.
// The other option is “snapshot”.
- “env”: “local”, // ES env to create an index and snapshot,
// not required when type above is diff. // Setting the env also implies auto snapshot. // It could be in addition to auto diff. // Also accept (indexer_env, snapshot_env).
}, “autopublish”: {
- “type”: “snapshot”, // Auto publish new snapshots for new builds.
// The type field can also be ‘diff’.
- “env”: “prod”, // The release environment to publish snapshot.
// Or the release environment to publish diff. // This field is required for either type.
- “note”: True // If we should publish with a release note
// TODO not implemented yet
}, “autorelease”: {
- “schedule”: “0 0 * * 1”, // Make a release every Monday at midnight
// (if there’s a new published version.)
- “type”: “full”, // Only auto install full releases.
// The release type can also be ‘incremental’.
}
} The terms below are used interchangeably.
- BUILD_TYPES = ('diff', 'snapshot')¶
- RELEASE_TO_BUILD = {'full': 'snapshot', 'incremental': 'diff'}¶
- RELEASE_TYPES = ('incremental', 'full')¶
biothings.hub.databuild.builder¶
- class biothings.hub.databuild.builder.BuilderManager(source_backend_factory=None, target_backend_factory=None, builder_class=None, poll_schedule=None, *args, **kwargs)[source]¶
Bases:
BaseManager
BuilderManager deals with the different builders used to merge datasources. It is connected to src_build() via sync(), where it grabs build information and register builder classes, ready to be instantiate when triggering builds. source_backend_factory can be a optional factory function (like a partial) that builder can call without any argument to generate a SourceBackend. Same for target_backend_factory for the TargetBackend. builder_class if given will be used as the actual Builder class used for the merge and will be passed same arguments as the base DataBuilder. It can also be a list of classes, in which case the default used one is the first, when it’s necessary to define multiple builders.
- build_info(id=None, conf_name=None, fields=None, only_archived=False, status=None)[source]¶
Return build information given an build _id, or all builds if _id is None. “fields” can be passed to select which fields to return or not (mongo notation for projections), if None return everything except:
“mapping” (too long)
- If id is None, more are filtered:
“sources” and some of “build_config”
only_archived=True will return archived merges only status: will return only successful/failed builds. Can be “success” or “failed”
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- clean_temp_collections(build_name, date=None, prefix='')[source]¶
Delete all target collections created from builder named “build_name” at given date (or any date is none given – carefull…). Date is a string (YYYYMMDD or regex) Common collection name prefix can also be specified if needed.
- create_build_configuration(name, doc_type, sources, roots=None, builder_class=None, params=None, archived=False)[source]¶
- find_builder_classes()[source]¶
- Find all available build class:
classes passed during manager init (build_class) (that includes the default builder)
all subclassing DataBuilder in:
biothings.hub.databuilder.*
hub.databuilder.* (app-specific)
- get_builder_class(build_config_name)[source]¶
builder class can be specified different way (in order): 1. within the build_config document (so, per configuration) 2. or defined in the builder manager (so, per manager) 3. or default to DataBuilder
- list_sources(build_name)[source]¶
List all registered sources used to trigger a build named ‘build_name’
- merge(build_name, sources=None, target_name=None, **kwargs)[source]¶
Trigger a merge for build named ‘build_name’. Optional list of sources can be passed (one single or a list). target_name is the target collection name used to store to merge data. If none, each call will generate a unique target_name.
- poll()[source]¶
Check “whatsnew()” to idenfity builds which could be automatically built, if {“autobuild” : {…}} is part of the build configuration. “autobuild” contains a dict with “schedule” (aiocron/crontab format), so each build configuration can have a different polling schedule.
- resolve_builder_class(klass)[source]¶
Resolve class/partial definition to (obj,”type”,”mod.class”) where names (class name, module, docstring, etc…) can directly be accessed whether it’s a standard class or not
- property source_backend¶
- property target_backend¶
- class biothings.hub.databuild.builder.DataBuilder(build_name, source_backend, target_backend, log_folder, doc_root_key='root', mappers=None, default_mapper_class=<class 'biothings.hub.databuild.mapper.TransparentMapper'>, sources=None, target_name=None, **kwargs)[source]¶
Bases:
object
Generic data builder.
- property build_config¶
- document_cleaner(src_name, *args, **kwargs)[source]¶
Return a function taking a document as argument, cleaning the doc as needed, and returning that doc. If no function is needed, None. Note: the returned function must be pickleable, careful with lambdas and closures.
- get_build_version()[source]¶
Generate an arbitrary major build version. Default is using a timestamp (YYMMDD) ‘.’ char isn’t allowed in build version as it’s reserved for minor versions
- get_custom_metadata(sources, job_manager)[source]¶
If more metadata is required, this method can be overridden and should return a dict. Existing metadata dict will be update with that one before storage.
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- get_predicates()[source]¶
Return a list of predicates (functions returning true/false, as in math logic) which instructs/dictates if job manager should start a job (process/thread)
- get_stats(sources, job_manager)[source]¶
Return a dictionnary of metadata for this build. It’s usually app-specific and this method may be overridden as needed. By default though, the total number of documents in the merged collection is stored (key “total”)
Return dictionary will be merged with any existing metadata in src_build collection. This behavior can be changed by setting a special key within metadata dict: {“__REPLACE__” : True} will… replace existing metadata with the one returned here.
“job_manager” is passed in case parallelization is needed. Be aware that this method is already running in a dedicated thread, in order to use job_manager, the following code must be used at the very beginning of its implementation: asyncio.set_event_loop(job_manager.loop)
- keep_archive = 10¶
- property logger¶
- merge(sources=None, target_name=None, force=False, ids=None, steps=('merge', 'post', 'metadata'), job_manager=None, *args, **kwargs)[source]¶
Merge given sources into a collection named target_name. If sources argument is omitted, all sources defined for this merger will be merged together, according to what is defined insrc_build_config. If target_name is not defined, a unique name will be generated.
- Optional parameters:
force=True will bypass any safety check
ids: list of _ids to merge, specifically. If None, all documents are merged.
- steps:
merge: actual merge step, create merged documents and store them
post: once merge, run optional post-merge process
- metadata: generate and store metadata (depends on merger, usually specifies the amount
of merged data, source versions, etc…)
- merge_order(other_sources)[source]¶
Optionally we can override this method to customize the order in which sources should be merged. Default as sorted by name.
- async merge_sources(source_names, steps=('merge', 'post'), batch_size=100000, ids=None, job_manager=None)[source]¶
Merge resources from given source_names or from build config. Identify root document sources from the list to first process them. ids can a be list of documents to be merged in particular.
- register_status(status, transient=False, init=False, **extra)[source]¶
Register current build status. A build status is a record in src_build The key used in this dict the target_name. Then, any operation acting on this target_name is registered in a “jobs” list.
- resolve_sources(sources)[source]¶
Source can be a string that may contain regex chars. It’s usefull when you have plenty of sub-collections prefixed with a source name. For instance, given a source named “blah” stored in as many collections as chromosomes, insteand of passing each name as “blah_1”, “blah_2”, etc… “blah_.*” can be specified in build_config. This method resolves potential regexed source name into real, existing collection names
- property source_backend¶
- property target_backend¶
- class biothings.hub.databuild.builder.LinkDataBuilder(build_name, source_backend, target_backend, *args, **kwargs)[source]¶
Bases:
DataBuilder
LinkDataBuilder creates a link to the original datasource to be merged, without actually copying the data (merged collection remains empty). This builder is only valid when using only one datasource (thus no real merge) is declared in the list of sources to be merged, and is useful to prevent data duplication between the datasource itself and the resulting merged collection.
- biothings.hub.databuild.builder.fix_batch_duplicates(docs, fail_if_struct_is_different=False)[source]¶
Remove duplicates from docs based on _id. If _id’s the same but structure is different (not real “duplicates”, but different documents with the same _ids), merge docs all together (dict.update) or raise an error if fail_if_struct_is_different.
biothings.hub.databuild.differ¶
- class biothings.hub.databuild.differ.BaseDiffer(diff_func, job_manager, log_folder)[source]¶
Bases:
object
- diff(old_db_col_names, new_db_col_names, batch_size=100000, steps=('content', 'mapping', 'reduce', 'post'), mode=None, exclude=None)[source]¶
wrapper over diff_cols() coroutine, return a task
- async diff_cols(old_db_col_names, new_db_col_names, batch_size, steps, mode=None, exclude=None)[source]¶
Compare new with old collections and produce diff files. Root keys can be excluded from comparison with “exclude” parameter
- *_db_col_names can be:
a colleciton name (as a string) asusming they are in the target database.
tuple with 2 elements, the first one is then either “source” or “target” to respectively specify src or target database, and the second element is the collection name.
tuple with 3 elements (URI,db,collection), looking like: (“mongodb://user:pass@host”,”dbname”,”collection”), allowing to specify any connection on any server
- steps: - ‘content’ will perform diff on actual content.
‘mapping’ will perform diff on ES mappings (if target collection involved)
‘reduce’ will merge diff files, trying to avoid having many small files
‘post’ is a hook to do stuff once everything is merged (override method post_diff_cols)
- mode: ‘purge’ will remove any existing files for this comparison while ‘resume’ will happily ignore
existing data and to whatever it’s requested (like running steps=”post” on existing diff folder…)
- diff_type = None¶
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- class biothings.hub.databuild.differ.ColdHotDiffer(diff_func, job_manager, log_folder)[source]¶
Bases:
BaseDiffer
- async diff_cols(old_db_col_names, new_db_col_names, *args, **kwargs)[source]¶
Compare new with old collections and produce diff files. Root keys can be excluded from comparison with “exclude” parameter
- *_db_col_names can be:
a colleciton name (as a string) asusming they are in the target database.
tuple with 2 elements, the first one is then either “source” or “target” to respectively specify src or target database, and the second element is the collection name.
tuple with 3 elements (URI,db,collection), looking like: (“mongodb://user:pass@host”,”dbname”,”collection”), allowing to specify any connection on any server
- steps: - ‘content’ will perform diff on actual content.
‘mapping’ will perform diff on ES mappings (if target collection involved)
‘reduce’ will merge diff files, trying to avoid having many small files
‘post’ is a hook to do stuff once everything is merged (override method post_diff_cols)
- mode: ‘purge’ will remove any existing files for this comparison while ‘resume’ will happily ignore
existing data and to whatever it’s requested (like running steps=”post” on existing diff folder…)
- class biothings.hub.databuild.differ.ColdHotJsonDiffer(diff_func=<function diff_docs_jsonpatch>, *args, **kwargs)[source]¶
Bases:
ColdHotJsonDifferBase
,JsonDiffer
- diff_type = 'coldhot-jsondiff'¶
- class biothings.hub.databuild.differ.ColdHotJsonDifferBase(diff_func, job_manager, log_folder)[source]¶
Bases:
ColdHotDiffer
- post_diff_cols(old_db_col_names, new_db_col_names, batch_size, steps, mode=None, exclude=None)[source]¶
Post-process the diff files by adjusting some jsondiff operation. Here’s the process. For updated documents, some operations might illegal in the context of cold/hot merged collections. Case #1: “remove” op in an update
- from a cold/premerge collection, we have that doc:
coldd = {“_id”:1, “A”:”123”, “B”:”456”, “C”:True}
- from the previous hot merge we have this doc:
prevd = {“_id”:1, “D”:”789”, “C”:True, “E”:”abc”}
- At that point, the final document, fully merged and indexed is:
finald = {“_id”:1, “A”:”123”, “B”:”456”, “C”:True, “D”:”789”, “E”:”abc”}
We can notice field “C” is common to coldd and prevd.
- from the new hot merge, we have:
newd = {“_id”:1, “E”,”abc”} # C and D don’t exist anymore
- Diffing prevd vs. newd will give jssondiff operations:
[{‘op’: ‘remove’, ‘path’: ‘/C’}, {‘op’: ‘remove’, ‘path’: ‘/D’}]
The problem here is ‘C’ is removed while it was already in cold merge, it should stay because it has come with some resource involved in the premerge (dependent keys, eg. myvariant, “observed” key comes with certain sources) => the jsondiff opetation on “C” must be discarded.
- Note: If operation involved a root key (not ‘/a/c’ for instance) and if that key is found in the premerge, then
then remove the operation. (note we just consider root keys, if the deletion occurs deeper in the document, it’s just a legal operation updating innder content)
For deleted documents, the same kind of logic applies Case #2: “delete”
- from a cold/premerge collection, we have that doc:
coldd = {“_id”:1, “A”:”123”, “B”:”456”, “C”:True}
- from the previous hot merge we have this doc:
prevd = {“_id”:1, “D”:”789”, “C”:True}
- fully merged doc:
finald = {“_id”:1, “A”:”123”, “B”:”456”, “C”:True, “D”:”789”}
- from the new hot merge, we have:
newd = {} # document doesn’t exist anymore
Diffing prevd vs. newd will mark document with _id == 1 to be deleted The problem is we have data for _id=1 on the premerge collection, if we delete the whole document we’d loose too much information. => the deletion must converted into specific “remove” jsondiff operations, for the root keys found in prevd on not in coldd
(in that case: [{‘op’:’remove’, ‘path’:’/D’}], and not “C” as C is in premerge)
- class biothings.hub.databuild.differ.ColdHotSelfContainedJsonDiffer(diff_func=<function diff_docs_jsonpatch>, *args, **kwargs)[source]¶
Bases:
ColdHotJsonDifferBase
,SelfContainedJsonDiffer
- diff_type = 'coldhot-jsondiff-selfcontained'¶
- class biothings.hub.databuild.differ.DiffReportRendererBase(max_reported_ids=None, max_randomly_picked=None, detailed=False)[source]¶
Bases:
object
- class biothings.hub.databuild.differ.DiffReportTxt(max_reported_ids=None, max_randomly_picked=None, detailed=False)[source]¶
Bases:
DiffReportRendererBase
- class biothings.hub.databuild.differ.DifferManager(poll_schedule=None, *args, **kwargs)[source]¶
Bases:
BaseManager
DifferManager deals with the different differ objects used to create and analyze diff between datasources.
- build_diff_report(diff_folder, detailed=True, max_reported_ids=None)[source]¶
Analyze diff files in diff_folder and give a summy of changes. max_reported_ids is the number of IDs contained in the report for each part. detailed will trigger a deeper analysis, takes more time.
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- configure(partial_differs=(<class 'biothings.hub.databuild.differ.JsonDiffer'>, <class 'biothings.hub.databuild.differ.SelfContainedJsonDiffer'>))[source]¶
- diff(diff_type, old, new, batch_size=100000, steps=('content', 'mapping', 'reduce', 'post'), mode=None, exclude=('_timestamp',))[source]¶
Run a diff to compare old vs. new collections. using differ algorithm diff_type. Results are stored in a diff folder. Steps can be passed to choose what to do: - count: will count root keys in new collections and stores them as statistics. - content: will diff the content between old and new. Results (diff files) format depends on diff_type
- diff_report(old_db_col_names, new_db_col_names, report_filename='report.txt', format='txt', detailed=True, max_reported_ids=None, max_randomly_picked=None, mode=None)[source]¶
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- class biothings.hub.databuild.differ.JsonDiffer(diff_func=<function diff_docs_jsonpatch>, *args, **kwargs)[source]¶
Bases:
BaseDiffer
- diff_type = 'jsondiff'¶
- class biothings.hub.databuild.differ.SelfContainedJsonDiffer(diff_func=<function diff_docs_jsonpatch>, *args, **kwargs)[source]¶
Bases:
JsonDiffer
- diff_type = 'jsondiff-selfcontained'¶
- biothings.hub.databuild.differ.diff_worker_new_vs_old(id_list_new, old_db_col_names, new_db_col_names, batch_num, diff_folder, diff_func, exclude=None, selfcontained=False)[source]¶
biothings.hub.databuild.mapper¶
- class biothings.hub.databuild.mapper.BaseMapper(name=None, *args, **kwargs)[source]¶
Bases:
object
Basic mapper used to convert documents. if mapper’s name matches source’s metadata’s mapper, mapper.convert(docs) call will be used to process/convert/whatever passed documents
- class biothings.hub.databuild.mapper.IDBaseMapper(name=None, convert_func=None, *args, **kwargs)[source]¶
Bases:
BaseMapper
Provide mapping between different sources
‘name’ may match a “mapper” metatdata field (see uploaders). If None, mapper will be applied to any document from a resource without “mapper” argument
- process(docs, key_to_convert='_id', transparent=True)[source]¶
Process ‘key_to_convert’ document key using mapping. If transparent and no match, original key will be used (so there’s no change). Else, if no match, document will be discarded (default). Warning: key to be translated must not be None (it’s considered a non-match)
- class biothings.hub.databuild.mapper.TransparentMapper(name=None, *args, **kwargs)[source]¶
Bases:
BaseMapper
biothings.hub.databuild.prebuilder¶
- class biothings.hub.databuild.prebuilder.BasePreCompiledDataProvider(name)[source]¶
Bases:
object
‘name’ is a way to identify this provider (usually linked to a database name behind the scene)
- class biothings.hub.databuild.prebuilder.MongoDBPreCompiledDataProvider(db_name, name, connection_params)[source]¶
Bases:
BasePreCompiledDataProvider
‘name’ is a way to identify this provider (usually linked to a database name behind the scene)
- class biothings.hub.databuild.prebuilder.RedisPreCompiledDataProvider(name, connection_params)[source]¶
Bases:
BasePreCompiledDataProvider
‘name’ is a way to identify this provider (usually linked to a database name behind the scene)
biothings.hub.databuild.syncer¶
- class biothings.hub.databuild.syncer.BaseSyncer(job_manager, log_folder)[source]¶
Bases:
object
- diff_type = None¶
- post_sync_cols(diff_folder, batch_size, mode, force, target_backend, steps)[source]¶
Post-sync hook, can be implemented in sub-class
- sync(diff_folder=None, batch_size=10000, mode=None, target_backend=None, steps=('mapping', 'content', 'meta', 'post'), debug=False)[source]¶
wrapper over sync_cols() coroutine, return a task
- async sync_cols(diff_folder, batch_size=10000, mode=None, force=False, target_backend=None, steps=('mapping', 'content', 'meta', 'post'), debug=False)[source]¶
Sync a collection with diff files located in diff_folder. This folder contains a metadata.json file which describes the different involved collection: “old” is the collection/index to be synced, “new” is the collecion that should be obtained once all diff files are applied (not used, just informative). If target_backend (bt.databbuild.backend.create_backend() notation), then it will replace “old” (that is, the one being synced)
- target_backend_type = None¶
- class biothings.hub.databuild.syncer.ESColdHotJsonDiffSelfContainedSyncer(job_manager, log_folder)[source]¶
Bases:
BaseSyncer
- diff_type = 'coldhot-jsondiff-selfcontained'¶
- target_backend_type = 'es'¶
- class biothings.hub.databuild.syncer.ESColdHotJsonDiffSyncer(job_manager, log_folder)[source]¶
Bases:
BaseSyncer
- diff_type = 'coldhot-jsondiff'¶
- target_backend_type = 'es'¶
- class biothings.hub.databuild.syncer.ESJsonDiffSelfContainedSyncer(job_manager, log_folder)[source]¶
Bases:
BaseSyncer
- diff_type = 'jsondiff-selfcontained'¶
- target_backend_type = 'es'¶
- class biothings.hub.databuild.syncer.ESJsonDiffSyncer(job_manager, log_folder)[source]¶
Bases:
BaseSyncer
- diff_type = 'jsondiff'¶
- target_backend_type = 'es'¶
- class biothings.hub.databuild.syncer.MongoJsonDiffSelfContainedSyncer(job_manager, log_folder)[source]¶
Bases:
BaseSyncer
- diff_type = 'jsondiff-selfcontained'¶
- target_backend_type = 'mongo'¶
- class biothings.hub.databuild.syncer.MongoJsonDiffSyncer(job_manager, log_folder)[source]¶
Bases:
BaseSyncer
- diff_type = 'jsondiff'¶
- target_backend_type = 'mongo'¶
- class biothings.hub.databuild.syncer.SyncerManager(*args, **kwargs)[source]¶
Bases:
BaseManager
SyncerManager deals with the different syncer objects used to synchronize different collections or indices using diff files
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- class biothings.hub.databuild.syncer.ThrottledESColdHotJsonDiffSelfContainedSyncer(max_sync_workers, *args, **kwargs)[source]¶
Bases:
ThrottlerSyncer
,ESColdHotJsonDiffSelfContainedSyncer
- class biothings.hub.databuild.syncer.ThrottledESColdHotJsonDiffSyncer(max_sync_workers, *args, **kwargs)[source]¶
- class biothings.hub.databuild.syncer.ThrottledESJsonDiffSelfContainedSyncer(max_sync_workers, *args, **kwargs)[source]¶
- class biothings.hub.databuild.syncer.ThrottledESJsonDiffSyncer(max_sync_workers, *args, **kwargs)[source]¶
Bases:
ThrottlerSyncer
,ESJsonDiffSyncer
- class biothings.hub.databuild.syncer.ThrottlerSyncer(max_sync_workers, *args, **kwargs)[source]¶
Bases:
BaseSyncer
- biothings.hub.databuild.syncer.sync_es_coldhot_jsondiff_worker(diff_file, es_config, new_db_col_names, batch_size, cnt, force=False, selfcontained=False, metadata=None, debug=False)[source]¶
- biothings.hub.databuild.syncer.sync_es_for_update(diff_file, indexer, diffupdates, batch_size, res, debug)[source]¶
biothings.hub.dataexport¶
biothings.hub.dataexport.ids¶
- biothings.hub.dataexport.ids.export_ids(col_name)[source]¶
Export all _ids from collection named col_name. If col_name refers to a build where a cold_collection is defined, will also extract _ids and sort/uniq them to have the full list of _ids of the actual merged (cold+hot) collection Output file is stored in DATA_EXPORT_FOLDER/ids, defaulting to <DATA_ARCHIVE_ROOT>/export/ids. Output filename is returned as the end, if successful.
biothings.hub.dataindex¶
biothings.hub.dataindex.idcache¶
biothings.hub.dataindex.indexer_cleanup¶
biothings.hub.dataindex.indexer_payload¶
- class biothings.hub.dataindex.indexer_payload.IndexMappings(dict=None, /, **kwargs)[source]¶
Bases:
_IndexPayload
biothings.hub.dataindex.indexer¶
- class biothings.hub.dataindex.indexer.ColdHotIndexer(build_doc, indexer_env, index_name)[source]¶
Bases:
object
MongoDB to Elasticsearch 2-pass Indexer. (
1st pass: <MongoDB Cold Collection>, # static data 2nd pass: <MongoDB Hot Collection> # changing data
- ) =>
<Elasticsearch Index>
- class biothings.hub.dataindex.indexer.DynamicIndexerFactory(urls, es_host, suffix='_current')[source]¶
Bases:
object
In the context of autohub/standalone instances, create indexer with parameters taken from versions.json URL. A list of URLs is provided so the factory knows how to create these indexers for each URLs. There’s no way to “guess” an ES host from a URL, so this parameter must be specified as well, common to all URLs “suffix” param is added at the end of index names.
- class biothings.hub.dataindex.indexer.IndexManager(*args, **kwargs)[source]¶
Bases:
BaseManager
An example of config dict for this module. {
- “indexer_select”: {
None: “hub.dataindex.indexer.DrugIndexer”, # default “build_config.cold_collection” : “mv.ColdHotVariantIndexer”,
}, “env”: {
- “prod”: {
“host”: “localhost:9200”, “indexer”: {
- “args”: {
“timeout”: 300, “retry_on_timeout”: True, “max_retries”: 10,
}, “bulk”: {
“chunk_size”: 50 “raise_on_exception”: False
}, “concurrency”: 3
}, “index”: [
# for information only, only used in index_info {“index”: “mydrugs_current”, “doc_type”: “drug”}, {“index”: “mygene_current”, “doc_type”: “gene”}
],
}, “dev”: { … }
}
}
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- cleanup(env=None, keep=3, dryrun=True, **filters)[source]¶
Delete old indices except for the most recent ones.
Examples
>>> index_cleanup() >>> index_cleanup("production") >>> index_cleanup("local", build_config="demo") >>> index_cleanup("local", keep=0) >>> index_cleanup(_id="<elasticsearch_index>")
- get_indexes_by_name(index_name=None, limit=10)[source]¶
Accept an index_name and return a list of indexes get from all elasticsearch environments
If index_name is blank, it will be return all indexes. limit can be used to specify how many indexes should be return.
The list of indexes will be like this: [
- {
“index_name”: “…”, “build_version”: “…”, “count”: 1000, “creation_date”: 1653468868933, “environment”: {
“name”: “env name”, “host”: “localhost:9200”,
}
},
]
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- index(indexer_env, build_name, index_name=None, ids=None, **kwargs)[source]¶
Trigger an index creation to index the collection build_name and create an index named index_name (or build_name if None). Optional list of IDs can be passed to index specific documents.
- update_metadata(indexer_env, index_name, build_name=None, _meta=None)[source]¶
- Update _meta field of the index mappings, basing on
the _meta value provided, including {}.
the _meta value of the build_name in src_build.
the _meta value of the build with the same name as the index.
Examples
update_metadata(“local”, “mynews_201228_vsdevjd”) update_metadata(“local”, “mynews_201228_vsdevjd”, _meta={}) update_metadata(“local”, “mynews_201228_vsdevjd”, _meta={“author”:”b”}) update_metadata(“local”, “mynews_201228_current”, “mynews_201228_vsdevjd”)
- class biothings.hub.dataindex.indexer.Indexer(build_doc, indexer_env, index_name)[source]¶
Bases:
object
MongoDB -> Elasticsearch Indexer.
- async index(job_manager, **kwargs)[source]¶
Build an Elasticsearch index (self.es_index_name) with data from MongoDB collection (self.mongo_collection_name).
“ids” can be passed to selectively index documents.
- “mode” can have the following values:
‘purge’: will delete an index if it exists.
‘resume’: will use an existing index and add missing documents.
‘merge’: will merge data to an existing index.
‘index’ (default): will create a new index.
- class biothings.hub.dataindex.indexer.IndexerCumulativeResult(dict=None, /, **kwargs)[source]¶
Bases:
_IndexerResult
- class biothings.hub.dataindex.indexer.IndexerStepResult(dict=None, /, **kwargs)[source]¶
Bases:
_IndexerResult
- class biothings.hub.dataindex.indexer.MainIndexStep(indexer)[source]¶
Bases:
Step
- method: property(abc.abstractmethod(lambda _: ...)) = 'do_index'¶
- name: property(abc.abstractmethod(lambda _: ...)) = 'index'¶
- state¶
alias of
MainIndexJSR
- class biothings.hub.dataindex.indexer.PostIndexStep(indexer)[source]¶
Bases:
Step
- method: property(abc.abstractmethod(lambda _: ...)) = 'post_index'¶
- name: property(abc.abstractmethod(lambda _: ...)) = 'post'¶
- state¶
alias of
PostIndexJSR
- class biothings.hub.dataindex.indexer.PreIndexStep(indexer)[source]¶
Bases:
Step
- method: property(abc.abstractmethod(lambda _: ...)) = 'pre_index'¶
- name: property(abc.abstractmethod(lambda _: ...)) = 'pre'¶
- state¶
alias of
PreIndexJSR
- class biothings.hub.dataindex.indexer.ProcessInfo(indexer, concurrency)[source]¶
Bases:
object
- class biothings.hub.dataindex.indexer.Step(indexer)[source]¶
Bases:
ABC
- catelog = {'index': <class 'biothings.hub.dataindex.indexer.MainIndexStep'>, 'post': <class 'biothings.hub.dataindex.indexer.PostIndexStep'>, 'pre': <class 'biothings.hub.dataindex.indexer.PreIndexStep'>}¶
- method: <property object at 0x7f3850a06020>¶
- name: <property object at 0x7f3850a05f80>¶
- state: <property object at 0x7f3850a05fd0>¶
biothings.hub.dataindex.indexer_registrar¶
- class biothings.hub.dataindex.indexer_registrar.IndexJobStateRegistrar(collection, build_name, index_name, **context)[source]¶
Bases:
object
- class biothings.hub.dataindex.indexer_registrar.MainIndexJSR(collection, build_name, index_name, **context)[source]¶
Bases:
IndexJobStateRegistrar
- class biothings.hub.dataindex.indexer_registrar.PostIndexJSR(collection, build_name, index_name, **context)[source]¶
Bases:
IndexJobStateRegistrar
- class biothings.hub.dataindex.indexer_registrar.PreIndexJSR(collection, build_name, index_name, **context)[source]¶
Bases:
IndexJobStateRegistrar
biothings.hub.dataindex.indexer_schedule¶
biothings.hub.dataindex.indexer_task¶
- class biothings.hub.dataindex.indexer_task.ESIndex(client, index_name, **bulk_index_args)[source]¶
Bases:
ESIndex
- mexists(ids)[source]¶
Return a list of tuples like [
(_id_0, True), (_id_1, False), (_id_2, True), ….
]
- class biothings.hub.dataindex.indexer_task.IndexingTask(es, mongo, ids, mode=None, logger=None, name='task')[source]¶
Bases:
object
Index one batch of documents from MongoDB to Elasticsearch. The documents to index are specified by their ids.
- class biothings.hub.dataindex.indexer_task.Mode(value)[source]¶
Bases:
Enum
An enumeration.
- INDEX = 'index'¶
- MERGE = 'merge'¶
- PURGE = 'purge'¶
- RESUME = 'resume'¶
biothings.hub.dataindex.snapshooter¶
- class biothings.hub.dataindex.snapshooter.Bucket(client, bucket, region=None)[source]¶
Bases:
object
- class biothings.hub.dataindex.snapshooter.CloudStorage(type: str, access_key: str, secret_key: str, region: str = 'us-west-2')[source]¶
Bases:
object
- access_key: str¶
- region: str = 'us-west-2'¶
- secret_key: str¶
- type: str¶
- class biothings.hub.dataindex.snapshooter.CumulativeResult(dict=None, /, **kwargs)[source]¶
Bases:
_SnapshotResult
- class biothings.hub.dataindex.snapshooter.ProcessInfo(env)[source]¶
Bases:
object
JobManager Process Info. Reported in Biothings Studio.
- class biothings.hub.dataindex.snapshooter.RepositoryConfig(dict=None, /, **kwargs)[source]¶
Bases:
UserDict
- {
“type”: “s3”, “name”: “s3-$(Y)”, “settings”: {
“bucket”: “<SNAPSHOT_BUCKET_NAME>”, “base_path”: “mynews.info/$(Y)”, # per year
}
}
- property bucket¶
- format(doc=None)[source]¶
Template special values in this config.
For example: {
“bucket”: “backup-$(Y)”, “base_path” : “snapshots/%(_meta.build_version)s”
} where “_meta.build_version” value is taken from doc in dot field notation, and the current year replaces “$(Y)”.
- property region¶
- property repo¶
- class biothings.hub.dataindex.snapshooter.SnapshotEnv(job_manager, cloud, repository, indexer, **kwargs)[source]¶
Bases:
object
- class biothings.hub.dataindex.snapshooter.SnapshotManager(index_manager, *args, **kwargs)[source]¶
Bases:
BaseManager
Hub ES Snapshot Management
Config Ex:
# env.<name>: {
- “cloud”: {
“type”: “aws”, # default, only one supported. “access_key”: <——————>, “secret_key”: <——————>, “region”: “us-west-2”
}, “repository”: {
“name”: “s3-$(Y)”, “type”: “s3”, “settings”: {
“bucket”: “<SNAPSHOT_BUCKET_NAME>”, “base_path”: “mygene.info/$(Y)”, # year
}, “acl”: “private”,
}, “indexer”: {
“name”: “local”, “args”: {
“timeout”: 100, “max_retries”: 5
}
}, “monitor_delay”: 15,
}
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- cleanup(env=None, keep=3, group_by='build_config', dryrun=True, **filters)[source]¶
Delete past snapshots and keep only the most recent ones.
Examples
>>> snapshot_cleanup() >>> snapshot_cleanup("s3_outbreak") >>> snapshot_cleanup("s3_outbreak", keep=0)
- poll(state, func)[source]¶
Search for source in collection ‘col’ with a pending flag list containing ‘state’ and and call ‘func’ for each document found (with doc as only param)
- snapshot(snapshot_env, index, snapshot=None, recreate_repo=False)[source]¶
Create a snapshot named “snapshot” (or, by default, same name as the index) from “index” according to environment definition (repository, etc…) “env”.
- snapshot_a_build(build_doc)[source]¶
Create a snapshot basing on the autobuild settings in the build config. If the build config associated with this build has: {
- “autobuild”: {
“type”: “snapshot”, // implied when env is set. env must be set. “env”: “local” // which es env to make the snapshot.
} Attempt to make a snapshot for this build on the specified es env “local”.
biothings.hub.dataindex.snapshot_cleanup¶
biothings.hub.dataindex.snapshot_registrar¶
- class biothings.hub.dataindex.snapshot_registrar.MainSnapshotState(col, _id)[source]¶
Bases:
_TaskState
- func = '_snapshot'¶
- name = 'snapshot'¶
- regx = True¶
- step = 'snapshot'¶
- class biothings.hub.dataindex.snapshot_registrar.PostSnapshotState(col, _id)[source]¶
Bases:
_TaskState
- func = 'post_snapshot'¶
- name = 'post'¶
- regx = True¶
- step = 'post-snapshot'¶
biothings.hub.dataindex.snapshot_repo¶
biothings.hub.dataindex.snapshot_task¶
biothings.hub.datainspect¶
biothings.hub.datainspect.inspector¶
- class biothings.hub.datainspect.inspector.InspectorManager(upload_manager, build_manager, *args, **kwargs)[source]¶
Bases:
BaseManager
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- inspect(data_provider, mode='type', batch_size=10000, limit=None, sample=None, **kwargs)[source]¶
Inspect given data provider: - backend definition, see bt.hub.dababuild.create_backend for
supported format), eg “merged_collection” or (“src”,”clinvar”)
or callable yielding documents
Mode: - “type”: will inspect and report type map found in data (internal/non-standard format) - “mapping”: will inspect and return a map compatible for later
ElasticSearch mapping generation (see bt.utils.es.generate_es_mapping)
“stats”: will inspect and report types + different counts found in data, giving a detailed overview of the volumetry of each fields and sub-fields
“jsonschema”, same as “type” but result is formatted as json-schema standard
limit: when set to an integer, will inspect only x documents.
sample: combined with limit, for each document, if random.random() <= sample (float), the document is inspected. This option allows to inspect only a sample of data.
biothings.hub.dataload¶
biothings.hub.dataload.dumper¶
- class biothings.hub.dataload.dumper.APIDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
Dump data from APIs
This will run API calls in a clean process and write its results in one or more NDJSON documents.
Populate the static methods get_document and get_release in your subclass, along with other necessary bits common to all dumpers.
For details on specific parts, read the docstring for individual methods.
An example subclass implementation can be found in the unii data source for MyGene.info.
- property client¶
- create_todump_list(force=False, **kwargs)[source]¶
This gets called by method dump, to populate self.to_dump
- download(remotefile, localfile)[source]¶
Runs helper function in new process to download data
This is run in a new process by the do_dump coroutine of the parent class. Then this will spawn another process that actually does all the work. This method is mostly for setting up the environment, setting up the the process pool executor to correctly use spawn and using concurrent.futures to simply run tasks in the new process, and periodically check the status of the task.
Explanation: because this is actually running inside a process that forked from a threaded process, the internal state is more or less corrupt/broken, see man 2 fork for details. More discussions are in Slack from some time in 2021 on why it has to be forked and why it is broken.
Caveats: the existing job manager will not know how much memory the actual worker process is using.
- static get_document() Generator[Tuple[str, Any], None, None] [source]¶
Get document from API source
Populate this method to yield documents to be stored on disk. Every time you want to save something to disk, do this: >>> yield ‘name_of_file.ndjson’, {‘stuff’: ‘you want saved’} While the type definition says Any is accepted, it has to be JSON serilizable, so basically Python dictionaries/lists with strings and numbers as the most basic elements.
Later on in your uploader, you can treat the files as NDJSON documents, i.e. one JSON document per line.
It is recommended that you only do the minimal necessary processing in this step.
A default HTTP client is not provided so you get the flexibility of choosing your most favorite tool.
This MUST be a static method or it cannot be properly serialized to run in a separate process.
This method is expected to be blocking (synchronous). However, be sure to properly SET TIMEOUTS. You open the resources here in this function so you have to deal with properly checking/closing them. If the invoker forcefully stops this method, it will leave a mess behind, therefore we do not do that.
You can do a 5 second timeout using the popular requests package by doing something like this: >>> import requests >>> r = requests.get(’https://example.org’, timeout=5.0) You can catch the exception or setup retries. If you cannot handle the situation, just raise exceptions or not catch them. APIDumper will handle it properly: documents are only saved when the entire method completes successfully.
- static get_release() str [source]¶
Get the string for the release information.
This is run in the main process and thread so it must return quickly. This must be populated
- Returns:
string representing the release.
- class biothings.hub.dataload.dumper.BaseDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
object
- ARCHIVE = True¶
- AUTO_UPLOAD = True¶
- MAX_PARALLEL_DUMP = None¶
- SCHEDULE = None¶
- SLEEP_BETWEEN_DOWNLOAD = 0.0¶
- SRC_NAME = None¶
- SRC_ROOT_FOLDER = None¶
- SUFFIX_ATTR = 'release'¶
- property client¶
- create_todump_list(force=False, **kwargs)[source]¶
Fill self.to_dump list with dict(“remote”:remote_path,”local”:local_path) elements. This is the todo list for the dumper. It’s a good place to check whether needs to be downloaded. If ‘force’ is True though, all files will be considered for download
- property current_data_folder¶
- property current_release¶
- download(remotefile, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- async dump(steps=None, force=False, job_manager=None, check_only=False, **kwargs)[source]¶
Dump (ie. download) resource as needed this should be called after instance creation ‘force’ argument will force dump, passing this to create_todump_list() method.
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- get_predicates()[source]¶
Return a list of predicates (functions returning true/false, as in math logic) which instructs/dictates if job manager should start a job (process/thread)
- property logger¶
- mark_success(dry_run=True)[source]¶
Mark the datasource as successful dumped. It’s useful in case the datasource is unstable, and need to be manually downloaded.
- property new_data_folder¶
Generate a new data folder path using src_root_folder and specified suffix attribute. Also sync current (aka previous) data folder previously registeted in database. This method typically has to be called in create_todump_list() when the dumper actually knows some information about the resource, like the actual release.
- post_download(remotefile, localfile)[source]¶
Placeholder to add a custom process once a file is downloaded. This is a good place to check file’s integrity. Optional
- post_dump(*args, **kwargs)[source]¶
Placeholder to add a custom process once the whole resource has been dumped. Optional.
- post_dump_delete_files()[source]¶
Delete files after dump
Invoke this method in post_dump to synchronously delete the list of paths stored in self.to_delete, in order.
Non-recursive. If directories need to be removed, build the list such that files residing in the directory are removed first and then the directory. (Hint: see os.walk(dir, topdown=False))
- release_client()[source]¶
Do whatever necessary (like closing network connection) to “release” the client
- remote_is_better(remotefile, localfile)[source]¶
Compared to local file, check if remote file is worth downloading. (like either bigger or newer for instance)
- property src_doc¶
- property src_dump¶
- to_delete: List[str | bytes | PathLike]¶
Populate with list of relative path of files to delete
- class biothings.hub.dataload.dumper.DockerContainerDumper(*args, **kwargs)[source]¶
Bases:
BaseDumper
Start a docker container (typically runs on a different server) to prepare the data file on the remote container, and then download this file to the local data source folder. This dumper will do the following steps: - Booting up a container from provided parameters: image, tag, container_name. - The container entrypoint will be override by this long running command: “tail -f /dev/null” - When the container_name and image is provided together, the dumper will try to run the container_name.
If the container with container_name does not exist, the dumper will start a new container from image param, and set its name as container_name.
- Run the dump_command inside this container. This command MUST block the dumper until the data file is completely prepare.
It will guarantees that the remote file is ready for downloading.
Run the get_version_cmd inside this container - if it provided. Set this command out put as self.release.
Download the remote file via Docker API, extract the downloaded .tar file.
- When the downloading is complete:
if keep_container=false: Remove the above container after.
if keep_container=true: leave this container running.
If there are any execption when dump data, the remote container won’t be removed, it will help us address the problem.
- These are supported connection types from the Hub server to the remote Docker host server:
ssh: Prerequisite: the SSH Key-Based Authentication is configured
unix: Local connection
http: Use an insecure HTTP connection over a TCP socket
- https: Use a secured HTTPS connection using TLS. Prerequisite:
The Docker API on the remote server MUST BE secured with TLS
A TLS key pair is generated on the Hub server and placed inside the same data plugin folder or the data source folder
All info about Docker client connection MUST BE defined in the config.py file, under the DOCKER_CONFIG key, Ex Optional DOCKER_HOST can be used to override the docker connections in any docker dumper regardless of the value of the src_url For example, you can set DOCKER_HOST=”localhost” for local testing:
- DOCKER_CONFIG = {
- “CONNECTION_NAME_1”: {
“tls_cert_path”: “/path/to/cert.pem”, “tls_key_path”: “/path/to/key.pem”, “client_url”: “https://remote-docker-host:port”
}, “CONNECTION_NAME_2”: {
“client_url”: “ssh://user@remote-docker-host”
}, “localhost”: {
“client_url”: “unix://var/run/docker.sock”
}
} DOCKER_HOST = “localhost”
- The data_url should match the following format:
docker://CONNECTION_NAME?image=DOCKER_IMAGE&tag=TAG&path=/path/to/remote_file&dump_command=”this is custom command”&container_name=CONTAINER_NAME&keep_container=true&get_version_cmd=”cmd” # NOQA
- Supported params:
image: (Optional) the Docker image name
tag: (Optional) the image tag
container_name: (Optional) If this param is provided, the image param will be discard when dumper run.
path: (Required) path to the remote file inside the Docker container.
dump_command: (Required) This command will be run inside the Docker container in order to create the remote file.
- keep_container: (Optional) accepted values: true/false, default: false.
If keep_container=true, the remote container will be persisted.
If keep_container=false, the remote container will be removed in the end of dump step.
- get_version_cmd: (Optional) The custom command for checking release version of local and remote file. Note that:
This command must run-able in both local Hub (for checking local file) and remote container (for checking remote file).
- “{}” MUST exists in the command, it will be replace by the data file path when dumper runs,
ex: get_version_cmd=”md5sum {} | awk ‘{ print $1 }’” will be run as: md5sum /path/to/remote_file | awk ‘{ print $1 }’ and /path/to/local_file
- Ex:
docker://CONNECTION_NAME?image=IMAGE_NAME&tag=IMAGE_TAG&path=/path/to/remote_file(inside the container)&dump_command=”run something with output is written to -O /path/to/remote_file (inside the container)” # NOQA
docker://CONNECTION_NAME?container_name=CONTAINER_NAME&path=/path/to/remote_file(inside the container)&dump_command=”run something with output is written to -O /path/to/remote_file (inside the container)&keep_container=true&get_version_cmd=”md5sum {} | awk ‘{ print $1 }’” # NOQA
docker://localhost?image=dockstore_dumper&path=/data/dockstore_crawled/data.ndjson&dump_command=”/home/biothings/run-dockstore.sh”&keep_container=1
docker://localhost?image=dockstore_dumper&tag=latest&path=/data/dockstore_crawled/data.ndjson&dump_command=”/home/biothings/run-dockstore.sh”&keep_container=True # NOQA
docker://localhost?image=praqma/network-multitool&tag=latest&path=/tmp/annotations.zip&dump_command=”/usr/bin/wget https://s3.pgkb.org/data/annotations.zip -O /tmp/annotations.zip”&keep_container=false&get_version_cmd=”md5sum {} | awk ‘{ print $1 }’” # NOQA
docker://localhost?container_name=<YOUR CONTAINER NAME>&path=/tmp/annotations.zip&dump_command=”/usr/bin/wget https://s3.pgkb.org/data/annotations.zip -O /tmp/annotations.zip”&keep_container=true&get_version_cmd=”md5sum {} | awk ‘{ print $1 }’” # NOQA
Container metadata: - All above params in the data_url can be pre-config in the Dockerfile by adding LABELs. This config will be used as the fallback of the data_url params:
The dumper will find those params from both data_url and container metadata. If a param does not exist in data_url, dumper will use its value from container metadata (of course if it exist).
- For example:
… Dockerfile LABEL “path”=”/tmp/annotations.zip” LABEL “dump_command”=”/usr/bin/wget https://s3.pgkb.org/data/annotations.zip -O /tmp/annotations.zip” LABEL keep_container=”true” LABEL desc=test LABEL container_name=mydocker
- CONTAINER_NAME = None¶
- DATA_PATH = None¶
- DOCKER_CLIENT_URL = None¶
- DOCKER_IMAGE = None¶
- DUMP_COMMAND = None¶
- GET_VERSION_CMD = None¶
- KEEP_CONTAINER = False¶
- MAX_PARALLEL_DUMP = 1¶
- ORIGINAL_CONTAINER_STATUS = None¶
- TIMEOUT = 300¶
- async create_todump_list(force=False, job_manager=None, **kwargs)[source]¶
Create the list of files to dump, called in dump method. This method will execute dump_command to generate the remote file in docker container, so we define this method as async to make it non-blocking.
- delete_or_restore_container()[source]¶
Delete the container if it’s created by the dumper, or restore it to its original status if it’s pre-existing.
- download(remote_file, local_file)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- generate_remote_file()[source]¶
Execute dump_command to generate the remote file, called in create_todump_list method
- get_remote_file()[source]¶
return the remote file path within the container. In most of cases, dump_command should either generate this file or check if it’s ready if there is another automated pipeline generates this file.
- get_remote_lastmodified(remote_file)[source]¶
get the last modified time of the remote file within the container using stat command
- post_dump(*args, **kwargs)[source]¶
Delete container or restore the container status if necessary, called in the dump method after the dump is done (during the “post” step)
- prepare_dumper_params()[source]¶
Read all docker dumper parameters from either the data plugin manifest or the Docker image or container metadata. Of course, at least one of docker_image or container_name parameters must be defined in the data plugin manifest first. If the parameter is not defined in the data plugin manifest, we will try to read it from the Docker image metadata.
- prepare_local_folders(localfile)[source]¶
prepare the local folder for the localfile, called in download method
- prepare_remote_container()[source]¶
prepare the remote container and set self.container, called in create_todump_list method
- remote_is_better(remote_file, local_file)[source]¶
Compared to local file, check if remote file is worth downloading. (like either bigger or newer for instance)
- set_release()[source]¶
call the get_version_cmd to get the releases, called in get_todump_list method. if get_version_cmd is not defined, use timestamp as the release.
This is currently a blocking method, assuming get_version_cmd is a quick command. But if necessary, we can make it async in the future.
- property source_config¶
- class biothings.hub.dataload.dumper.DummyDumper(*args, **kwargs)[source]¶
Bases:
BaseDumper
DummyDumper will do nothing… (useful for datasources that can’t be downloaded anymore but still need to be integrated, ie. fill src_dump, etc…)
- class biothings.hub.dataload.dumper.DumperManager(job_manager, datasource_path='dataload.sources', *args, **kwargs)[source]¶
Bases:
BaseSourceManager
- SOURCE_CLASS¶
alias of
BaseDumper
- call(src, method_name, *args, **kwargs)[source]¶
Create a dumper for datasource “src” and call method “method_name” on it, with given arguments. Used to create arbitrary calls on a dumper. “method_name” within dumper definition must a coroutine.
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- get_schedule(dumper_name)[source]¶
Return the corresponding schedule for dumper_name Example result: {
“cron”: “0 9 * * *”, “strdelta”: “15h:20m:33s”,
}
- class biothings.hub.dataload.dumper.FTPDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
- BLOCK_SIZE: int | None = None¶
- CWD_DIR = ''¶
- FTP_HOST = ''¶
- FTP_PASSWD = ''¶
- FTP_TIMEOUT = 600.0¶
- FTP_USER = ''¶
- download(remotefile, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- class biothings.hub.dataload.dumper.FilesystemDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
This dumpers works locally and copy (or move) files to datasource folder
- FS_OP = 'cp'¶
- download(remotefile, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- class biothings.hub.dataload.dumper.GitDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
Git dumper gets data from a git repo. Repo is stored in SRC_ROOT_FOLDER (without versioning) and then versions/releases are fetched in SRC_ROOT_FOLDER/<release>
- DEFAULT_BRANCH = None¶
- GIT_REPO_URL = None¶
- download(remotefile, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- async dump(release='HEAD', force=False, job_manager=None, **kwargs)[source]¶
Dump (ie. download) resource as needed this should be called after instance creation ‘force’ argument will force dump, passing this to create_todump_list() method.
- property new_data_folder¶
Generate a new data folder path using src_root_folder and specified suffix attribute. Also sync current (aka previous) data folder previously registeted in database. This method typically has to be called in create_todump_list() when the dumper actually knows some information about the resource, like the actual release.
- class biothings.hub.dataload.dumper.GoogleDriveDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
HTTPDumper
- download(remoteurl, localfile)[source]¶
- remoteurl is a google drive link containing a document ID, such as:
https://drive.google.com/open?id=<1234567890ABCDEF>
https://drive.google.com/file/d/<1234567890ABCDEF>/view
It can also be just a document ID
- class biothings.hub.dataload.dumper.HTTPDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
Dumper using HTTP protocol and “requests” library
- IGNORE_HTTP_CODE = []¶
- RESOLVE_FILENAME = False¶
- VERIFY_CERT = True¶
- download(remoteurl, localfile, headers={})[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- class biothings.hub.dataload.dumper.LastModifiedBaseDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
Use SRC_URLS as a list of URLs to download and implement create_todump_list() according to that list. Shoud be used in parallel with a dumper talking the actual underlying protocol
- SRC_URLS = []¶
- class biothings.hub.dataload.dumper.LastModifiedFTPDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
LastModifiedBaseDumper
SRC_URLS containing a list of URLs pointing to files to download, use FTP’s MDTM command to check whether files should be downloaded The release is generated from the last file’s MDTM in SRC_URLS, and formatted according to RELEASE_FORMAT. See also LastModifiedHTTPDumper, working the same way but for HTTP protocol. Note: this dumper is a wrapper over FTPDumper, one URL will give one FTPDumper instance.
- RELEASE_FORMAT = '%Y-%m-%d'¶
- download(urlremotefile, localfile, headers={})[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
- release_client()[source]¶
Do whatever necessary (like closing network connection) to “release” the client
- class biothings.hub.dataload.dumper.LastModifiedHTTPDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
HTTPDumper
,LastModifiedBaseDumper
Given a list of URLs, check Last-Modified header to see whether the file should be downloaded. Sub-class should only have to declare SRC_URLS. Optionally, another field name can be used instead of Last-Modified, but date format must follow RFC 2616. If that header doesn’t exist, it will always download the data (bypass) The release is generated from the last file’s Last-Modified in SRC_URLS, and formatted according to RELEASE_FORMAT.
- ETAG = 'ETag'¶
- LAST_MODIFIED = 'Last-Modified'¶
- RELEASE_FORMAT = '%Y-%m-%d'¶
- RESOLVE_FILENAME = True¶
- class biothings.hub.dataload.dumper.ManualDumper(*args, **kwargs)[source]¶
Bases:
BaseDumper
This dumper will assist user to dump a resource. It will usually expect the files to be downloaded first (sometimes there’s no easy way to automate this process). Once downloaded, a call to dump() will make sure everything is fine in terms of files and metadata
- async dump(path, release=None, force=False, job_manager=None, **kwargs)[source]¶
Dump (ie. download) resource as needed this should be called after instance creation ‘force’ argument will force dump, passing this to create_todump_list() method.
- property new_data_folder¶
Generate a new data folder path using src_root_folder and specified suffix attribute. Also sync current (aka previous) data folder previously registeted in database. This method typically has to be called in create_todump_list() when the dumper actually knows some information about the resource, like the actual release.
- class biothings.hub.dataload.dumper.WgetDumper(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
BaseDumper
- create_todump_list(force=False, **kwargs)[source]¶
Fill self.to_dump list with dict(“remote”:remote_path,”local”:local_path) elements. This is the todo list for the dumper. It’s a good place to check whether needs to be downloaded. If ‘force’ is True though, all files will be considered for download
- download(remoteurl, localfile)[source]¶
Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)
biothings.hub.dataload.source¶
- class biothings.hub.dataload.source.SourceManager(source_list, dump_manager, upload_manager, data_plugin_manager)[source]¶
Bases:
BaseSourceManager
Helper class to get information about a datasource, whether it has a dumper and/or uploaders associated.
- reset(name, key='upload', subkey=None)[source]¶
Reset, ie. delete, internal data (src_dump document) for given source name, key subkey. This method is useful to clean outdated information in Hub’s internal database.
- Ex: key=upload, name=mysource, subkey=mysubsource, will delete entry in corresponding
src_dump doc (_id=mysource), under key “upload”, for sub-source named “mysubsource”
“key” can be either ‘download’, ‘upload’ or ‘inspect’. Because there’s no such notion of subkey for dumpers (ie. ‘download’, subkey is optional.
biothings.hub.dataload.storage¶
biothings.hub.dataload.sync¶
Deprecated. This module is not used any more.
biothings.hub.dataload.uploader¶
- class biothings.hub.dataload.uploader.BaseSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
object
Default datasource uploader. Database storage can be done in batch or line by line. Duplicated records aren’t not allowed
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- config = <ConfigurationWrapper over <module 'config' from '/home/docs/checkouts/readthedocs.org/user_builds/biothingsapi/checkouts/stable/biothings/hub/default_config.py'>>¶
- classmethod create(db_conn_info, *args, **kwargs)[source]¶
Factory-like method, just return an instance of this uploader (used by SourceManager, may be overridden in sub-class to generate more than one instance per class, like a true factory. This is usefull when a resource is splitted in different collection but the data structure doesn’t change (it’s really just data splitted accros multiple collections, usually for parallelization purposes). Instead of having actual class for each split collection, factory will generate them on-the-fly.
- property fullname¶
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- get_predicates()[source]¶
Return a list of predicates (functions returning true/false, as in math logic) which instructs/dictates if job manager should start a job (process/thread)
- keep_archive = 10¶
- async load(steps=('data', 'post', 'master', 'clean'), force=False, batch_size=10000, job_manager=None, **kwargs)[source]¶
Main resource load process, reads data from doc_c using chunk sized as batch_size. steps defines the different processes used to laod the resource: - “data” : will store actual data into single collections - “post” : will perform post data load operations - “master” : will register the master document in src_master
- load_data(data_path)[source]¶
Parse data from data_path and return structure ready to be inserted in database In general, data_path is a folder path. But in parallel mode (use parallelizer option), data_path is a file path :param data_path: It can be a folder path or a file path :return: structure ready to be inserted in database
- main_source = None¶
- make_temp_collection()[source]¶
Create a temp collection for dataloading, e.g., entrez_geneinfo_INEMO.
- name = None¶
- post_update_data(steps, force, batch_size, job_manager, **kwargs)[source]¶
Override as needed to perform operations after data has been uploaded
- prepare_src_dump()[source]¶
Sync with src_dump collection, collection information (src_doc) Return src_dump collection
- regex_name = None¶
- register_status(status, subkey='upload', **extra)[source]¶
Register step status, ie. status for a sub-resource
- storage_class¶
alias of
BasicStorage
- switch_collection()[source]¶
after a successful loading, rename temp_collection to regular collection name, and renaming existing collection to a temp name for archiving purpose.
- unprepare()[source]¶
reset anything that’s not pickable (so self can be pickled) return what’s been reset as a dict, so self can be restored once pickled
- class biothings.hub.dataload.uploader.DummySourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
Dummy uploader, won’t upload any data, assuming data is already there but make sure every other bit of information is there for the overall process (usefull when online data isn’t available anymore)
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- class biothings.hub.dataload.uploader.IgnoreDuplicatedSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
Same as default uploader, but will store records and ignore if any duplicated error occuring (use with caution…). Storage is done using batch and unordered bulk operations.
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- storage_class¶
alias of
IgnoreDuplicatedStorage
- class biothings.hub.dataload.uploader.MergerSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- storage_class¶
alias of
MergerStorage
- class biothings.hub.dataload.uploader.NoBatchIgnoreDuplicatedSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
Same as default uploader, but will store records and ignore if any duplicated error occuring (use with caution…). Storage is done line by line (slow, not using a batch) but preserve order of data in input file.
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- storage_class¶
alias of
NoBatchIgnoreDuplicatedStorage
- class biothings.hub.dataload.uploader.NoDataSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
This uploader won’t upload any data and won’t even assume there’s actual data (different from DummySourceUploader on this point). It’s usefull for instance when mapping need to be stored (get_mapping()) but data doesn’t comes from an actual upload (ie. generated)
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- storage_class¶
alias of
NoStorage
- class biothings.hub.dataload.uploader.ParallelizedSourceUploader(db_conn_info, collection_name=None, log_folder=None, *args, **kwargs)[source]¶
Bases:
BaseSourceUploader
db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.
- class biothings.hub.dataload.uploader.UploaderManager(poll_schedule=None, *args, **kwargs)[source]¶
Bases:
BaseSourceManager
After registering datasources, manager will orchestrate source uploading.
- SOURCE_CLASS¶
alias of
BaseSourceUploader
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- filter_class(klass)[source]¶
Gives opportunity for subclass to check given class and decide to keep it or not in the discovery process. Returning None means “skip it”.
- poll(state, func)[source]¶
Search for source in collection ‘col’ with a pending flag list containing ‘state’ and and call ‘func’ for each document found (with doc as only param)
- register_classes(klasses)[source]¶
Register each class in self.register dict. Key will be used to retrieve the source class, create an instance and run method from it. It must be implemented in subclass as each manager may need to access its sources differently,based on different keys.
biothings.hub.dataload.validator¶
biothings.hub.dataplugin¶
biothings.hub.dataplugin.assistant¶
- class biothings.hub.dataplugin.assistant.AdvancedPluginLoader(plugin_name)[source]¶
Bases:
BasePluginLoader
- loader_type = 'advanced'¶
- class biothings.hub.dataplugin.assistant.AssistantManager(data_plugin_manager, dumper_manager, uploader_manager, keylookup=None, default_export_folder='hub/dataload/sources', *args, **kwargs)[source]¶
Bases:
BaseSourceManager
- configure(klasses=[<class 'biothings.hub.dataplugin.assistant.GithubAssistant'>, <class 'biothings.hub.dataplugin.assistant.LocalAssistant'>])[source]¶
- export(plugin_name, folder=None, what=['dumper', 'uploader', 'mapping'], purge=False)[source]¶
Export generated code for a given plugin name, in given folder (or use DEFAULT_EXPORT_FOLDER if None). Exported information can be: - dumper: dumper class generated from the manifest - uploader: uploader class generated from the manifest - mapping: mapping generated from inspection or from the manifest If “purge” is true, any existing folder/code will be deleted first, otherwise, will raise an error if some folder/files already exist.
- load(autodiscover=True)[source]¶
Load plugins registered in internal Hub database and generate/register dumpers & uploaders accordingly. If autodiscover is True, also search DATA_PLUGIN_FOLDER for existing plugin directories not registered yet in the database, and register them automatically.
- class biothings.hub.dataplugin.assistant.AssistedDumper[source]¶
Bases:
object
- DATA_PLUGIN_FOLDER = None¶
- class biothings.hub.dataplugin.assistant.AssistedUploader[source]¶
Bases:
object
- DATA_PLUGIN_FOLDER = None¶
- class biothings.hub.dataplugin.assistant.BaseAssistant(url)[source]¶
Bases:
object
- data_plugin_manager = None¶
- dumper_manager = None¶
- handle()[source]¶
Access self.url and do whatever is necessary to bring code to life within the hub… (hint: that may involve creating a dumper on-the-fly and register that dumper to a manager…)
- keylookup = None¶
- property loader¶
Return loader object able to interpret plugin’s folder content
- loaders = {'advanced': <class 'biothings.hub.dataplugin.assistant.AdvancedPluginLoader'>, 'manifest': <class 'biothings.hub.dataplugin.assistant.ManifestBasedPluginLoader'>}¶
- property plugin_name¶
Return plugin name, parsed from self.url and set self._src_folder as path to folder containing dataplugin source code
- plugin_type = None¶
- uploader_manager = None¶
- class biothings.hub.dataplugin.assistant.BasePluginLoader(plugin_name)[source]¶
Bases:
object
- loader_type = None¶
- class biothings.hub.dataplugin.assistant.GithubAssistant(url)[source]¶
Bases:
BaseAssistant
- handle()[source]¶
Access self.url and do whatever is necessary to bring code to life within the hub… (hint: that may involve creating a dumper on-the-fly and register that dumper to a manager…)
- property plugin_name¶
Return plugin name, parsed from self.url and set self._src_folder as path to folder containing dataplugin source code
- plugin_type = 'github'¶
- class biothings.hub.dataplugin.assistant.LocalAssistant(url)[source]¶
Bases:
BaseAssistant
- handle()[source]¶
Access self.url and do whatever is necessary to bring code to life within the hub… (hint: that may involve creating a dumper on-the-fly and register that dumper to a manager…)
- property plugin_name¶
Return plugin name, parsed from self.url and set self._src_folder as path to folder containing dataplugin source code
- plugin_type = 'local'¶
- class biothings.hub.dataplugin.assistant.ManifestBasedPluginLoader(plugin_name)[source]¶
Bases:
BasePluginLoader
- dumper_registry = {'docker': <class 'biothings.hub.dataload.dumper.DockerContainerDumper'>, 'ftp': <class 'biothings.hub.dataload.dumper.LastModifiedFTPDumper'>, 'http': <class 'biothings.hub.dataload.dumper.LastModifiedHTTPDumper'>, 'https': <class 'biothings.hub.dataload.dumper.LastModifiedHTTPDumper'>}¶
- get_code_for_mod_name(mod_name)[source]¶
Returns string literal and name of function, given a path
- Parameters:
mod_name – string with module name and function name, separated by colon
- Returns:
- containing
indented string literal for the function specified
name of the function
- Return type:
Tuple[str, str]
- loader_type = 'manifest'¶
biothings.hub.dataplugin.manager¶
- class biothings.hub.dataplugin.manager.DataPluginManager(job_manager, datasource_path='dataload.sources', *args, **kwargs)[source]¶
Bases:
DumperManager
- class biothings.hub.dataplugin.manager.GitDataPlugin(src_name=None, src_root_folder=None, log_folder=None, archive=None)[source]¶
Bases:
GitDumper
- class biothings.hub.dataplugin.manager.ManualDataPlugin(*args, **kwargs)[source]¶
Bases:
ManualDumper
biothings.hub.datarelease¶
biothings.hub.datarelease.publisher¶
- class biothings.hub.datarelease.publisher.BasePublisher(envconf, log_folder, es_backups_folder, *args, **kwargs)[source]¶
Bases:
BaseManager
,BaseStatusRegisterer
- property category¶
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- property collection¶
Return collection object used to fetch doc in which we store status
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- get_pre_post_previous_result(build_doc, key_value)[source]¶
In order to start a pre- or post- pipeline, a first previous result, fed all along the pipeline to the next step, has to be defined, and depends on the type of publisher.
- publish_release_notes(release_folder, build_version, s3_release_folder, s3_release_bucket, aws_key, aws_secret, prefix='release_')[source]¶
- class biothings.hub.datarelease.publisher.DiffPublisher(diff_manager, *args, **kwargs)[source]¶
Bases:
BasePublisher
- get_pre_post_previous_result(build_doc, key_value)[source]¶
In order to start a pre- or post- pipeline, a first previous result, fed all along the pipeline to the next step, has to be defined, and depends on the type of publisher.
- post_publish(build_name, repo_conf, build_doc)[source]¶
Post-publish hook, running steps declared in config, but also whatever would be defined in a sub-class
- pre_publish(previous_build_name, repo_conf, build_doc)[source]¶
Pre-publish hook, running steps declared in config, but also whatever would be defined in a sub-class
- publish(build_name, previous_build=None, steps=('pre', 'reset', 'upload', 'meta', 'post'))[source]¶
Publish diff files and metadata about the diff files, release note, etc… on s3. Using build_name, a src_build document is fetched, and a diff release is searched. If more than one diff release is found, “previous_build” must be specified to pick the correct one. - steps:
pre/post: optional steps processed as first and last steps.
reset: highly recommended, reset synced flag in diff files so they won’t get skipped when used…
upload: upload diff_folder content to S3
meta: publish/register the version as available for auto-updating hubs
- class biothings.hub.datarelease.publisher.ReleaseManager(diff_manager, snapshot_manager, poll_schedule=None, *args, **kwargs)[source]¶
Bases:
BaseManager
,BaseStatusRegisterer
- DEFAULT_DIFF_PUBLISHER_CLASS¶
alias of
DiffPublisher
- DEFAULT_SNAPSHOT_PUBLISHER_CLASS¶
alias of
SnapshotPublisher
- build_release_note(old_colname, new_colname, note=None) ReleaseNoteSource [source]¶
Build a release note containing most significant changes between build names “old_colname” and “new_colname”. An optional end note can be added to bring more specific information about the release.
Return a dictionary containing significant changes.
- clean_stale_status()[source]¶
During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.
- property collection¶
Return collection object used to fetch doc in which we store status
- configure(release_confdict)[source]¶
Configure manager with release “confdict”. See config_hub.py in API for the format.
- create_release_note(old, new, filename=None, note=None, format='txt')[source]¶
Generate release note files, in TXT and JSON format, containing significant changes summary between target collections old and new. Output files are stored in a diff folder using generate_folder(old,new).
‘filename’ can optionally be specified, though it’s not recommended as the publishing pipeline, using these files, expects a filenaming convention.
‘note’ is an optional free text that can be added to the release note, at the end.
txt ‘format’ is the only one supported for now.
- get_pinfo()[source]¶
Return dict containing information about the current process (used to report in the hub)
- poll(state, func)[source]¶
Search for source in collection ‘col’ with a pending flag list containing ‘state’ and and call ‘func’ for each document found (with doc as only param)
- publish_diff(publisher_env, build_name, previous_build=None, steps=('pre', 'reset', 'upload', 'meta', 'post'))[source]¶
- publish_snapshot(publisher_env, snapshot, build_name=None, previous_build=None, steps=('pre', 'meta', 'post'))[source]¶
- reset_synced(old, new)[source]¶
Reset sync flags for diff files produced between “old” and “new” build. Once a diff has been applied, diff files are flagged as synced so subsequent diff won’t be applied twice (for optimization reasons, not to avoid data corruption since diff files can be safely applied multiple times). In any needs to apply the diff another time, diff files needs to reset.
- class biothings.hub.datarelease.publisher.SnapshotPublisher(snapshot_manager, *args, **kwargs)[source]¶
Bases:
BasePublisher
- get_pre_post_previous_result(build_doc, key_value)[source]¶
In order to start a pre- or post- pipeline, a first previous result, fed all along the pipeline to the next step, has to be defined, and depends on the type of publisher.
- post_publish(snapshot_name, repo_conf, build_doc)[source]¶
Post-publish hook, running steps declared in config, but also whatever would be defined in a sub-class
- pre_publish(snapshot_name, repo_conf, build_doc)[source]¶
Pre-publish hook, running steps declared in config, but also whatever would be defined in a sub-class
- publish(snapshot, build_name=None, previous_build=None, steps=('pre', 'meta', 'post'))[source]¶
Publish snapshot metadata to S3. If snapshot repository is of type “s3”, data isn’t actually uploaded/published since it’s already there on s3. If type “fs”, some “pre” steps can be added to the RELEASE_CONFIG paramater to archive and upload it to s3. Metadata about the snapshot, release note, etc… is then uploaded in correct buckets as defined in config, and “post” steps can be run afterward.
Though snapshots don’t need any previous version to be applied on, a release note with significant changes between current snapshot and a previous version could have been generated. By default, snapshot name is used to pick one single build document and from the document, get the release note information.
biothings.hub.datarelease.releasenote¶
- class biothings.hub.datarelease.releasenote.ReleaseNoteSource(old_src_build_reader: ReleaseNoteSrcBuildReader, new_src_build_reader: ReleaseNoteSrcBuildReader, diff_stats_from_metadata_file: dict, addon_note: str)[source]¶
Bases:
object
- class biothings.hub.datarelease.releasenote.ReleaseNoteSrcBuildReader(src_build_doc: dict)[source]¶
Bases:
object
- attach_cold_src_build_reader(other: ReleaseNoteSrcBuildReader)[source]¶
Attach a cold src_build reader.
It’s required that self is a hot src_builder reader and other is cold.
- property build_id: str¶
- property build_stats: dict¶
- property build_version: str¶
- property cold_collection_name: str¶
- property datasource_mapping: dict¶
- property datasource_stats: dict¶
- property datasource_versions: dict¶
- class biothings.hub.datarelease.releasenote.ReleaseNoteSrcBuildReaderAdapter(src_build_reader: ReleaseNoteSrcBuildReader)[source]¶
Bases:
object
- property build_stats¶
- property datasource_info¶
- class biothings.hub.datarelease.releasenote.ReleaseNoteTxt(source: ReleaseNoteSource)[source]¶
Bases:
object
biothings.hub.datatransform¶
biothings.hub.datatransform.ciidstruct¶
CIIDStruct - case insenstive id matching data structure
- class biothings.hub.datatransform.ciidstruct.CIIDStruct(field=None, doc_lst=None)[source]¶
Bases:
IDStruct
CIIDStruct - id structure for use with the DataTransform classes. The basic idea is to provide a structure that provides a list of (original_id, current_id) pairs.
This is a case-insensitive version of IDStruct.
Initialize the structure :param field: field for documents to use as an initial id (optional) :param doc_lst: list of documents to use when building an initial list (optional)
biothings.hub.api.datatransform.datatransform_api¶
DataTransforAPI - classes around API based key lookup.
- class biothings.hub.datatransform.datatransform_api.BiothingsAPIEdge(lookup, fields, weight=1, label=None, url=None)[source]¶
Bases:
DataTransformEdge
APIEdge - IDLookupEdge object for API calls
Initialize the class :param label: A label can be used for debugging purposes.
- property client¶
property getter for client
- client_name = None¶
- class biothings.hub.datatransform.datatransform_api.DataTransformAPI(input_types, output_types, *args, **kwargs)[source]¶
Bases:
DataTransform
Perform key lookup or key conversion from one key type to another using an API endpoint as a data source.
This class uses biothings apis to conversion from one key type to another. Base classes are used with the decorator syntax shown below:
@IDLookupMyChemInfo(input_types, output_types) def load_document(doc_lst): for d in doc_lst: yield d
Lookup fields are configured in the ‘lookup_fields’ object, examples of which can be found in ‘IDLookupMyGeneInfo’ and ‘IDLookupMyChemInfo’.
- Required Options:
- input_types
‘type’
(‘type’, ‘nested_source_field’)
[(‘type1’, ‘nested.source_field1’), (‘type2’, ‘nested.source_field2’), …]
- output_types:
‘type’
[‘type1’, ‘type2’]
Additional Options: see DataTransform class
Initialize the IDLookupAPI object.
- batch_size = 10¶
- default_source = '_id'¶
- key_lookup_batch(batchiter)[source]¶
Look up all keys for ids given in the batch iterator (1 block) :param batchiter: 1 lock of records to look up keys for :return:
- lookup_fields = {}¶
- class biothings.hub.datatransform.datatransform_api.DataTransformMyChemInfo(input_types, output_types=None, skip_on_failure=False, skip_w_regex=None)[source]¶
Bases:
DataTransformAPI
Single key lookup for MyChemInfo
Initialize the class by seting up the client object.
- lookup_fields = {'chebi': 'chebi.chebi_id', 'chembl': 'chembl.molecule_chembl_id', 'drugbank': 'drugbank.drugbank_id', 'drugname': ['drugbank.name', 'unii.preferred_term', 'chebi.chebi_name', 'chembl.pref_name'], 'inchi': ['drugbank.inchi', 'chembl.inchi', 'pubchem.inchi'], 'inchikey': ['drugbank.inchi_key', 'chembl.inchi_key', 'pubchem.inchi_key'], 'pubchem': 'pubchem.cid', 'rxnorm': ['unii.rxcui'], 'unii': 'unii.unii'}¶
- output_types = ['inchikey', 'unii', 'rxnorm', 'drugbank', 'chebi', 'chembl', 'pubchem', 'drugname']¶
- class biothings.hub.datatransform.datatransform_api.DataTransformMyGeneInfo(input_types, output_types=None, skip_on_failure=False, skip_w_regex=None)[source]¶
Bases:
DataTransformAPI
deprecated
Initialize the class by seting up the client object.
- lookup_fields = {'ensembl': 'ensembl.gene', 'entrezgene': 'entrezgene', 'symbol': 'symbol', 'uniprot': 'uniprot.Swiss-Prot'}¶
- class biothings.hub.datatransform.datatransform_api.MyChemInfoEdge(lookup, field, weight=1, label=None, url=None)[source]¶
Bases:
BiothingsAPIEdge
The MyChemInfoEdge uses the MyChem.info API to convert identifiers.
- Parameters:
lookup (str) – The field in the API to search with the input identifier.
field (str) – The field in the API to convert to.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
- client_name = 'drug'¶
- class biothings.hub.datatransform.datatransform_api.MyGeneInfoEdge(lookup, field, weight=1, label=None, url=None)[source]¶
Bases:
BiothingsAPIEdge
The MyGeneInfoEdge uses the MyGene.info API to convert identifiers.
- Parameters:
lookup (str) – The field in the API to search with the input identifier.
field (str) – The field in the API to convert to.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
- client_name = 'gene'¶
biothings.hub.datatransform.datatransform_mdb¶
DataTransform MDB module - class for performing key lookup using conversions described in a networkx graph.
- class biothings.hub.datatransform.datatransform_mdb.CIMongoDBEdge(collection_name, lookup, field, weight=1, label=None)[source]¶
Bases:
MongoDBEdge
Case-insensitive MongoDBEdge
- Parameters:
collection_name (str) – The name of the MongoDB collection.
lookup (str) – The field that will match the input identifier in the collection.
field (str) – The output identifier field that will be read out of matching documents.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
- class biothings.hub.datatransform.datatransform_mdb.DataTransformMDB(graph, *args, **kwargs)[source]¶
Bases:
DataTransform
Convert document identifiers from one type to another.
The DataTransformNetworkX module was written as a decorator class which should be applied to the load_data function of a Biothings Uploader. The load_data function yields documents, which are then post processed by call and the ‘id’ key conversion is performed.
- Parameters:
graph – nx.DiGraph (networkx 2.1) configuration graph
input_types – A list of input types for the form (identifier, field) where identifier matches a node and field is an optional dotstring field for where the identifier should be read from (the default is ‘_id’).
output_types (list(str)) – A priority list of identifiers to convert to. These identifiers should match nodes in the graph.
id_priority_list (list(str)) – A priority list of identifiers to to sort input and output types by.
skip_on_failure (bool) – If True, documents where identifier conversion fails will be skipped in the final document list.
skip_w_regex (bool) – Do not perform conversion if the identifier matches the regular expression provided to this argument. By default, this option is disabled.
skip_on_success (bool) – If True, documents where identifier conversion succeeds will be skipped in the final document list.
idstruct_class (class) – Override an internal data structure used by the this module (advanced usage)
copy_from_doc (bool) – If true then an identifier is copied from the input source document regardless as to weather it matches an edge or not. (advanced usage)
- batch_size = 1000¶
- default_source = '_id'¶
- class biothings.hub.datatransform.datatransform_mdb.MongoDBEdge(collection_name, lookup, field, weight=1, label=None, check_index=True)[source]¶
Bases:
DataTransformEdge
The MongoDBEdge uses data within a MongoDB collection to convert one identifier to another. The input identifier is used to search a collection. The output identifier values are read out of that collection:
- Parameters:
collection_name (str) – The name of the MongoDB collection.
lookup (str) – The field that will match the input identifier in the collection.
field (str) – The output identifier field that will be read out of matching documents.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
- property collection¶
getting for collection member variable
- collection_find(id_lst, lookup, field)[source]¶
Abstract out (as one line) the call to collection.find
biothings.hub.datatransform.datatransform¶
DataTransform Module - IDStruct - DataTransform (superclass)
- class biothings.hub.datatransform.datatransform.DataTransform(input_types, output_types, id_priority_list=None, skip_on_failure=False, skip_w_regex=None, skip_on_success=False, idstruct_class=<class 'biothings.hub.datatransform.datatransform.IDStruct'>, copy_from_doc=False, debug=False)[source]¶
Bases:
object
DataTransform class. This class is the public interface for the DataTransform module. Much of the core logic is in the subclass.
Initialize the keylookup object and precompute paths from the start key to all target keys.
The decorator is intended to be applied to the load_data function of an uploader. The load_data function yields documents, which are then post processed by call and the ‘id’ key conversion is performed.
- Parameters:
G – nx.DiGraph (networkx 2.1) configuration graph
collections – list of mongodb collection names
input_type – key type to start key lookup from
output_types – list of all output types to convert to
id_priority_list (list(str)) – A priority list of identifiers to to sort input and output types by.
id_struct_class – IDStruct used to manager/fetch IDs from docs
copy_from_doc – if transform failed using the graph, try to get value from the document itself when output_type == input_type. No check is performed, it’s a straight copy. If checks are needed (eg. check that an ID referenced in the doc actually exists in another collection, nodes with self-loops can be used, so ID resolution will be forced to go through these loops to ensure data exists)
- DEFAULT_WEIGHT = 1¶
- batch_size = 1000¶
- debug = False¶
- default_source = '_id'¶
- property id_priority_list¶
Property method for getting id_priority_list
- key_lookup_batch(batchiter)[source]¶
Core method for looking up all keys in batch (iterator) :param batchiter: :return:
- lookup_one(doc)[source]¶
KeyLookup on document. This method is called as a function call instead of a decorator on a document iterator.
- class biothings.hub.datatransform.datatransform.DataTransformEdge(label=None)[source]¶
Bases:
object
DataTransformEdge. This class contains information needed to transform one key to another.
Initialize the class :param label: A label can be used for debugging purposes.
- edge_lookup(keylookup_obj, id_strct, debug=False)[source]¶
virtual method for edge lookup. Each edge class is responsible for its own lookup procedures given a keylookup_obj and an id_strct :param keylookup_obj: :param id_strct: - list of tuples (orig_id, current_id) :return:
- property logger¶
getter for the logger property
- class biothings.hub.datatransform.datatransform.IDStruct(field=None, doc_lst=None)[source]¶
Bases:
object
IDStruct - id structure for use with the DataTransform classes. The basic idea is to provide a structure that provides a list of (original_id, current_id) pairs.
Initialize the structure :param field: field for documents to use as an initial id (optional) :param doc_lst: list of documents to use when building an initial list (optional)
- property id_lst¶
Build up a list of current ids
- class biothings.hub.datatransform.datatransform.RegExEdge(from_regex, to_regex, weight=1, label=None)[source]¶
Bases:
DataTransformEdge
The RegExEdge allows an identifier to be transformed using a regular expression. POSIX regular expressions are supported.
- Parameters:
from_regex (str) – The first parameter of the regular expression substitution.
to_regex (str) – The second parameter of the regular expression substitution.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.
biothings.hub.datatransform.histogram¶
DataTransform Histogram class - track keylookup statistics
biothings.hub.standalone¶
This standalone module is originally located at “biothings/standalone” repo. It’s used for Standalone/Autohub instance.
- class biothings.hub.standalone.AutoHubFeature(managers, version_urls, indexer_factory=None, validator_class=None, *args, **kwargs)[source]¶
Bases:
object
version_urls is a list of URLs pointing to versions.json file. The name of the data release is taken from the URL (http://…s3.amazon.com/<the_name>/versions.json) unless specified as a dict: {“name” : “custom_name”, “url” : “http://…”}
If indexer_factory is passed, it’ll be used to create indexer used to dump/check versions currently installed on ES, restore snapshot, index, etc… A indexer_factory is typically used to generate indexer dynamically (ES host, index name, etc…) according to URLs for instance. See standalone.hub.DynamicIndexerFactory class for an example. It is typically used when lots of data releases are being managed by the Hub (so no need to manually update STANDALONE_CONFIG parameter.
If indexer_factory is None, a config param named STANDALONE_CONFIG is used, format is the following:
- {“_default”{“es_host”: “…”, “index”: “…”, “doc_type”“…”},
“the_name” : {“es_host”: “…”, “index”: “…”, “doc_type” : “…”}}
When a data release named (from URL) matches an entry, it’s used to configured which ES backend to target, otherwise the default one is used.
If validator_class is passed, it’ll be used to provide validation methods for installing step. If validator_class is None, the AutoHubValidator will be used as fallback.
- DEFAULT_DUMPER_CLASS¶
alias of
BiothingsDumper
- DEFAULT_UPLOADER_CLASS¶
alias of
BiothingsUploader
- DEFAULT_VALIDATOR_CLASS¶
alias of
AutoHubValidator
- configure()[source]¶
Either configure autohub from static definition (STANDALONE_CONFIG) where different hard-coded names of indexes can be managed on different ES server, or use a indexer factory where index names are taken from version_urls but only one ES host is used.
- install(src_name, version='latest', dry=False, force=False, use_no_downtime_method=True)[source]¶
Update hub’s data up to the given version (default is latest available), using full and incremental updates to get up to that given version (if possible).
- list_biothings()[source]¶
Example: [{‘name’: ‘mygene.info’, ‘url’: ‘https://biothings-releases.s3-us-west-2.amazonaws.com/mygene.info/versions.json’}]
- class biothings.hub.standalone.AutoHubServer(source_list, features=None, name='BioThings Hub', managers_custom_args=None, api_config=None, reloader_config=None, dataupload_config=None, websocket_config=None, autohub_config=None)[source]¶
Bases:
HubServer
Helper to setup and instantiate common managers usually used in a hub (eg. dumper manager, uploader manager, etc…) “source_list” is either:
a list of string corresponding to paths to datasources modules
a package containing sub-folders with datasources modules
Specific managers can be retrieved adjusting “features” parameter, where each feature corresponds to one or more managers. Parameter defaults to all possible available. Managers are configured/init in the same order as the list, so if a manager (eg. job_manager) is required by all others, it must be the first in the list. “managers_custom_args” is an optional dict used to pass specific arguments while init managers:
will set poll schedule to check upload every 5min (instead of default 10s) “reloader_config”, “dataupload_config”, “autohub_config” and “websocket_config” can be used to customize reloader, dataupload and websocket. If None, default config is used. If explicitely False, feature is deactivated.
- DEFAULT_FEATURES = ['job', 'autohub', 'terminal', 'config', 'ws']¶
Commands¶
biothings.hub.commands¶
This document will show you all available commands that can be used when you access the Hub shell, and their usages.
- am¶
This is a instance of type: <class ‘biothings.hub.dataplugin.assistant.AssistantManager’>
- api¶
This is a instance of type: <class ‘biothings.hub.api.manager.APIManager’>
- apply(src, *args, **kwargs)¶
method upload_src in module biothings.hub.dataload.uploader
- upload_src(src, *args, **kwargs) method of biothings.hub.dataload.uploader.UploaderManager instance
Trigger upload for registered resource named ‘src’. Other args are passed to uploader’s load() method
- archive(merge_name)¶
method archive_merge in module biothings.hub.databuild.builder
- archive_merge(merge_name) method of biothings.hub.databuild.builder.BuilderManager instance
Delete merged collections and associated metadata
- backend(src, method_name, *args, **kwargs)¶
method call in module biothings.hub.dataload.dumper
- call(src, method_name, *args, **kwargs) method of biothings.hub.dataload.dumper.DumperManager instance
Create a dumper for datasource “src” and call method “method_name” on it, with given arguments. Used to create arbitrary calls on a dumper. “method_name” within dumper definition must a coroutine.
- backup(folder='.', archive=None)¶
function backup in module biothings.utils.hub_db
- backup(folder=’.’, archive=None)
Dump the whole hub_db database in given folder. “archive” can be pass to specify the target filename, otherwise, it’s randomly generated
Note
this doesn’t backup source/merge data, just the internal data used by the hub
- bm¶
This is a instance of type: <class ‘biothings.hub.databuild.builder.BuilderManager’>
- build(id)¶
function <lambda> in module biothings.hub
<lambda> lambda id
- build_config_info()¶
method build_config_info in module biothings.hub.databuild.builder
build_config_info() method of biothings.hub.databuild.builder.BuilderManager instance
- build_save_mapping(name, mapping=None, dest='build', mode='mapping')¶
method save_mapping in module biothings.hub.databuild.builder
save_mapping(name, mapping=None, dest=’build’, mode=’mapping’) method of biothings.hub.databuild.builder.BuilderManager instance
- builds(id=None, conf_name=None, fields=None, only_archived=False)¶
method build_info in module biothings.hub.databuild.builder
- build_info(id=None, conf_name=None, fields=None, only_archived=False) method of biothings.hub.databuild.builder.BuilderManager instance
Return build information given an build _id, or all builds if _id is None. “fields” can be passed to select which fields to return or not (mongo notation for projections), if None return everything except:
“mapping” (too long)
- If id is None, more are filtered:
“sources” and some of “build_config”
only_archived=True will return archived merges only
- check(src, force=False, skip_manual=False, schedule=False, check_only=False, **kwargs)¶
method dump_src in module biothings.hub.dataload.dumper
dump_src(src, force=False, skip_manual=False, schedule=False, check_only=False, **kwargs) method of biothings.hub.dataload.dumper.DumperManager instance
- command(id, *args, **kwargs)¶
function <lambda> in module biothings.utils.hub
<lambda> lambda id, *args, **kwargs
- commands(id=None, running=None, failed=None)¶
method command_info in module biothings.utils.hub
command_info(id=None, running=None, failed=None) method of traitlets.traitlets.MetaHasTraits instance
- config()¶
method show in module biothings.utils.configuration
show() method of biothings.utils.configuration.ConfigurationWrapper instance
- create_api(api_id, es_host, index, doc_type, port, description=None, **kwargs)¶
method create_api in module biothings.hub.api.manager
create_api(api_id, es_host, index, doc_type, port, description=None, **kwargs) method of biothings.hub.api.manager.APIManager instance
- create_build_conf(name, doc_type, sources, roots=[], builder_class=None, params={}, archived=False)¶
method create_build_configuration in module biothings.hub.databuild.builder
create_build_configuration(name, doc_type, sources, roots=[], builder_class=None, params={}, archived=False) method of biothings.hub.databuild.builder.BuilderManager instance
- create_release_note(old, new, filename=None, note=None, format='txt')¶
method create_release_note in module biothings.hub.datarelease.publisher
- create_release_note(old, new, filename=None, note=None, format=’txt’) method of biothings.hub.datarelease.publisher.ReleaseManager instance
Generate release note files, in TXT and JSON format, containing significant changes summary between target collections old and new. Output files are stored in a diff folder using generate_folder(old,new).
‘filename’ can optionally be specified, though it’s not recommended as the publishing pipeline, using these files, expects a filenaming convention.
‘note’ is an optional free text that can be added to the release note, at the end.
txt ‘format’ is the only one supported for now.
- delete_api(api_id)¶
method delete_api in module biothings.hub.api.manager
delete_api(api_id) method of biothings.hub.api.manager.APIManager instance
- delete_build_conf(name)¶
method delete_build_configuration in module biothings.hub.databuild.builder
delete_build_configuration(name) method of biothings.hub.databuild.builder.BuilderManager instance
- diff(diff_type, old, new, batch_size=100000, steps=['content', 'mapping', 'reduce', 'post'], mode=None, exclude=['_timestamp'])¶
method diff in module biothings.hub.databuild.differ
- diff(diff_type, old, new, batch_size=100000, steps=[‘content’, ‘mapping’, ‘reduce’, ‘post’], mode=None, exclude=[‘_timestamp’]) method of biothings.hub.databuild.differ.DifferManager instance
Run a diff to compare old vs. new collections. using differ algorithm diff_type. Results are stored in a diff folder. Steps can be passed to choose what to do: - count: will count root keys in new collections and stores them as statistics. - content: will diff the content between old and new. Results (diff files) format depends on diff_type
- diff_info()¶
method diff_info in module biothings.hub.databuild.differ
diff_info() method of biothings.hub.databuild.differ.DifferManager instance
- dim¶
This is a instance of type: <class ‘biothings.hub.databuild.differ.DifferManager’>
- dm¶
This is a instance of type: <class ‘biothings.hub.dataload.dumper.DumperManager’>
- download(src, force=False, skip_manual=False, schedule=False, check_only=False, **kwargs)¶
method dump_src in module biothings.hub.dataload.dumper
dump_src(src, force=False, skip_manual=False, schedule=False, check_only=False, **kwargs) method of biothings.hub.dataload.dumper.DumperManager instance
- dpm¶
This is a instance of type: <class ‘biothings.hub.dataplugin.manager.DataPluginManager’>
- dump(src, force=False, skip_manual=False, schedule=False, check_only=False, **kwargs)¶
method dump_src in module biothings.hub.dataload.dumper
dump_src(src, force=False, skip_manual=False, schedule=False, check_only=False, **kwargs) method of biothings.hub.dataload.dumper.DumperManager instance
- dump_all(force=False, **kwargs)¶
method dump_all in module biothings.hub.dataload.dumper
- dump_all(force=False, **kwargs) method of biothings.hub.dataload.dumper.DumperManager instance
Run all dumpers, except manual ones
- dump_info()¶
method dump_info in module biothings.hub.dataload.dumper
dump_info() method of biothings.hub.dataload.dumper.DumperManager instance
- dump_plugin(src, force=False, skip_manual=False, schedule=False, check_only=False, **kwargs)¶
method dump_src in module biothings.hub.dataload.dumper
dump_src(src, force=False, skip_manual=False, schedule=False, check_only=False, **kwargs) method of biothings.hub.dataplugin.manager.DataPluginManager instance
- export_command_documents(filepath)¶
method export_command_documents in module biothings.hub
export_command_documents(filepath) method of biothings.hub.HubServer instance
- export_plugin(plugin_name, folder=None, what=['dumper', 'uploader', 'mapping'], purge=False)¶
method export in module biothings.hub.dataplugin.assistant
- export(plugin_name, folder=None, what=[‘dumper’, ‘uploader’, ‘mapping’], purge=False) method of biothings.hub.dataplugin.assistant.AssistantManager instance
Export generated code for a given plugin name, in given folder (or use DEFAULT_EXPORT_FOLDER if None). Exported information can be: - dumper: dumper class generated from the manifest - uploader: uploader class generated from the manifest - mapping: mapping generated from inspection or from the manifest If “purge” is true, any existing folder/code will be deleted first, otherwise, will raise an error if some folder/files already exist.
- expose(endpoint_name, command_name, method, **kwargs)¶
method add_api_endpoint in module biothings.hub
- add_api_endpoint(endpoint_name, command_name, method, **kwargs) method of biothings.hub.HubServer instance
Add an API endpoint to expose command named “command_name” using HTTP method “method”. **kwargs are used to specify more arguments for EndpointDefinition
- g¶
This is a instance of type: <class ‘dict’>
- get_apis()¶
method get_apis in module biothings.hub.api.manager
get_apis() method of biothings.hub.api.manager.APIManager instance
- get_release_note(old, new, format='txt', prefix='release_*')¶
method get_release_note in module biothings.hub.datarelease.publisher
get_release_note(old, new, format=’txt’, prefix=’release_*’) method of biothings.hub.datarelease.publisher.ReleaseManager instance
- help(func=None)¶
method help in module biothings.utils.hub
- help(func=None) method of biothings.utils.hub.HubShell instance
Display help on given function/object or list all available commands
- im¶
This is a instance of type: <class ‘biothings.hub.dataindex.indexer.IndexManager’>
- index(indexer_env, build_name, index_name=None, ids=None, **kwargs)¶
method index in module biothings.hub.dataindex.indexer
- index(indexer_env, build_name, index_name=None, ids=None, **kwargs) method of biothings.hub.dataindex.indexer.IndexManager instance
Trigger an index creation to index the collection build_name and create an index named index_name (or build_name if None). Optional list of IDs can be passed to index specific documents.
- index_cleanup(env=None, keep=3, dryrun=True, **filters)¶
method cleanup in module biothings.hub.dataindex.indexer
- cleanup(env=None, keep=3, dryrun=True, **filters) method of biothings.hub.dataindex.indexer.IndexManager instance
Delete old indices except for the most recent ones.
- Examples:
>>> index_cleanup() >>> index_cleanup("production") >>> index_cleanup("local", build_config="demo") >>> index_cleanup("local", keep=0) >>> index_cleanup(_id="<elasticsearch_index>")
- index_config¶
- index_info(remote=False)¶
method index_info in module biothings.hub.dataindex.indexer
- index_info(remote=False) method of biothings.hub.dataindex.indexer.IndexManager instance
Show index manager config with enhanced index information.
- indexes_by_name(index_name=None, limit=10)¶
method get_indexes_by_name in module biothings.hub.dataindex.indexer
- get_indexes_by_name(index_name=None, limit=10) method of biothings.hub.dataindex.indexer.IndexManager instance
Accept an index_name and return a list of indexes get from all elasticsearch environments
If index_name is blank, it will be return all indexes. limit can be used to specify how many indexes should be return.
The list of indexes will be like this: [
- {
“index_name”: “…”, “build_version”: “…”, “count”: 1000, “creation_date”: 1653468868933, “environment”: {
“name”: “env name”, “host”: “localhost:9200”,
}
},
]
- info(src, method_name, *args, **kwargs)¶
method call in module biothings.hub.dataload.dumper
- call(src, method_name, *args, **kwargs) method of biothings.hub.dataload.dumper.DumperManager instance
Create a dumper for datasource “src” and call method “method_name” on it, with given arguments. Used to create arbitrary calls on a dumper. “method_name” within dumper definition must a coroutine.
- inspect(data_provider, mode='type', batch_size=10000, limit=None, sample=None, **kwargs)¶
method inspect in module biothings.hub.datainspect.inspector
- inspect(data_provider, mode=’type’, batch_size=10000, limit=None, sample=None, **kwargs) method of biothings.hub.datainspect.inspector.InspectorManager instance
Inspect given data provider: - backend definition, see bt.hub.dababuild.create_backend for
supported format), eg “merged_collection” or (“src”,”clinvar”)
or callable yielding documents
Mode: - “type”: will inspect and report type map found in data (internal/non-standard format) - “mapping”: will inspect and return a map compatible for later
ElasticSearch mapping generation (see bt.utils.es.generate_es_mapping)
“stats”: will inspect and report types + different counts found in data, giving a detailed overview of the volumetry of each fields and sub-fields
“jsonschema”, same as “type” but result is formatted as json-schema standard
limit: when set to an integer, will inspect only x documents.
sample: combined with limit, for each document, if random.random() <= sample (float), the document is inspected. This option allows to inspect only a sample of data.
- install(src_name, version='latest', dry=False, force=False, use_no_downtime_method=True)¶
method install in module biothings.hub.standalone
- install(src_name, version=’latest’, dry=False, force=False, use_no_downtime_method=True) method of biothings.hub.standalone.AutoHubFeature instance
Update hub’s data up to the given version (default is latest available), using full and incremental updates to get up to that given version (if possible).
- ism¶
This is a instance of type: <class ‘biothings.hub.datainspect.inspector.InspectorManager’>
- jm¶
This is a instance of type: <class ‘biothings.utils.manager.JobManager’>
- job_info()¶
method job_info in module biothings.utils.manager
job_info() method of biothings.utils.manager.JobManager instance
- jsondiff(src, dst, **kwargs)¶
function make in module biothings.utils.jsondiff
make(src, dst, **kwargs)
- list()¶
method list_biothings in module biothings.hub.standalone
- list_biothings() method of biothings.hub.standalone.AutoHubFeature instance
Example: [{‘name’: ‘mygene.info’, ‘url’: ‘https://biothings-releases.s3-us-west-2.amazonaws.com/mygene.info/versions.json’}]
- loop¶
This is a instance of type: <class ‘asyncio.unix_events._UnixSelectorEventLoop’>
- lsmerge(build_config=None, only_archived=False)¶
method list_merge in module biothings.hub.databuild.builder
list_merge(build_config=None, only_archived=False) method of biothings.hub.databuild.builder.BuilderManager instance
- merge(build_name, sources=None, target_name=None, **kwargs)¶
method merge in module biothings.hub.databuild.builder
- merge(build_name, sources=None, target_name=None, **kwargs) method of biothings.hub.databuild.builder.BuilderManager instance
Trigger a merge for build named ‘build_name’. Optional list of sources can be passed (one single or a list). target_name is the target collection name used to store to merge data. If none, each call will generate a unique target_name.
- pending¶
This is a instance of type: <class ‘str’>
- pqueue¶
This is a instance of type: <class ‘concurrent.futures.process.ProcessPoolExecutor’>
- publish(publisher_env, snapshot_or_build_name, *args, **kwargs)¶
method publish in module biothings.hub.datarelease.publisher
publish(publisher_env, snapshot_or_build_name, *args, **kwargs) method of biothings.hub.datarelease.publisher.ReleaseManager instance
- publish_diff(publisher_env, build_name, previous_build=None, steps=['pre', 'reset', 'upload', 'meta', 'post'])¶
method publish_diff in module biothings.hub.datarelease.publisher
publish_diff(publisher_env, build_name, previous_build=None, steps=[‘pre’, ‘reset’, ‘upload’, ‘meta’, ‘post’]) method of biothings.hub.datarelease.publisher.ReleaseManager instance
- publish_snapshot(publisher_env, snapshot, build_name=None, previous_build=None, steps=['pre', 'meta', 'post'])¶
method publish_snapshot in module biothings.hub.datarelease.publisher
publish_snapshot(publisher_env, snapshot, build_name=None, previous_build=None, steps=[‘pre’, ‘meta’, ‘post’]) method of biothings.hub.datarelease.publisher.ReleaseManager instance
- quick_index(datasource_name, doc_type, indexer_env, subsource=None, index_name=None, **kwargs)¶
method quick_index in module biothings.hub
- quick_index(datasource_name, doc_type, indexer_env, subsource=None, index_name=None, **kwargs) method of biothings.hub.HubServer instance
Intention for datasource developers to quickly create an index to test their datasources. Automatically create temporary build config, build collection Then call the index method with the temporary build collection’s name
- register_url(url)¶
method register_url in module biothings.hub.dataplugin.assistant
register_url(url) method of biothings.hub.dataplugin.assistant.AssistantManager instance
- release_info(env=None, remote=False)¶
method release_info in module biothings.hub.datarelease.publisher
release_info(env=None, remote=False) method of biothings.hub.datarelease.publisher.ReleaseManager instance
- report(old_db_col_names, new_db_col_names, report_filename='report.txt', format='txt', detailed=True, max_reported_ids=None, max_randomly_picked=None, mode=None)¶
method diff_report in module biothings.hub.databuild.differ
diff_report(old_db_col_names, new_db_col_names, report_filename=’report.txt’, format=’txt’, detailed=True, max_reported_ids=None, max_randomly_picked=None, mode=None) method of biothings.hub.databuild.differ.DifferManager instance
- reset_backend(src, method_name, *args, **kwargs)¶
method call in module biothings.hub.dataload.dumper
- call(src, method_name, *args, **kwargs) method of biothings.hub.dataload.dumper.DumperManager instance
Create a dumper for datasource “src” and call method “method_name” on it, with given arguments. Used to create arbitrary calls on a dumper. “method_name” within dumper definition must a coroutine.
- reset_synced(old, new)¶
method reset_synced in module biothings.hub.datarelease.publisher
- reset_synced(old, new) method of biothings.hub.datarelease.publisher.ReleaseManager instance
Reset sync flags for diff files produced between “old” and “new” build. Once a diff has been applied, diff files are flagged as synced so subsequent diff won’t be applied twice (for optimization reasons, not to avoid data corruption since diff files can be safely applied multiple times). In any needs to apply the diff another time, diff files needs to reset.
- resetconf(name=None)¶
method reset in module biothings.utils.configuration
reset(name=None) method of biothings.utils.configuration.ConfigurationWrapper instance
- restart(force=False, stop=False)¶
method restart in module biothings.utils.hub
restart(force=False, stop=False) method of biothings.utils.hub.HubShell instance
- restore(archive, drop=False)¶
function restore in module biothings.utils.hub_db
- restore(archive, drop=False)
Restore database from given archive. If drop is True, then delete existing collections
- rm¶
This is a instance of type: <class ‘biothings.hub.datarelease.publisher.ReleaseManager’>
- rmmerge(merge_name)¶
method delete_merge in module biothings.hub.databuild.builder
- delete_merge(merge_name) method of biothings.hub.databuild.builder.BuilderManager instance
Delete merged collections and associated metadata
- sch(loop)¶
function get_schedule in module biothings.hub
- get_schedule(loop)
try to render job in a human-readable way…
- schedule(crontab, func, *args, **kwargs)¶
method schedule in module biothings.utils.manager
- schedule(crontab, func, *args, **kwargs) method of biothings.utils.manager.JobManager instance
Helper to create a cron job from a callable “func”. *argd, and **kwargs are passed to func. “crontab” follows aicron notation.
- setconf(name, value)¶
method store_value_to_db in module biothings.utils.configuration
store_value_to_db(name, value) method of biothings.utils.configuration.ConfigurationWrapper instance
- sm¶
This is a instance of type: <class ‘biothings.hub.dataload.source.SourceManager’>
- snapshot(snapshot_env, index, snapshot=None)¶
method snapshot in module biothings.hub.dataindex.snapshooter
- snapshot(snapshot_env, index, snapshot=None) method of biothings.hub.dataindex.snapshooter.SnapshotManager instance
Create a snapshot named “snapshot” (or, by default, same name as the index) from “index” according to environment definition (repository, etc…) “env”.
- snapshot_cleanup(env=None, keep=3, group_by='build_config', dryrun=True, **filters)¶
method cleanup in module biothings.hub.dataindex.snapshooter
- cleanup(env=None, keep=3, group_by=’build_config’, dryrun=True, **filters) method of biothings.hub.dataindex.snapshooter.SnapshotManager instance
Delete past snapshots and keep only the most recent ones.
- Examples:
>>> snapshot_cleanup() >>> snapshot_cleanup("s3_outbreak") >>> snapshot_cleanup("s3_outbreak", keep=0)
- snapshot_config¶
- snapshot_info(env=None, remote=False)¶
method snapshot_info in module biothings.hub.dataindex.snapshooter
snapshot_info(env=None, remote=False) method of biothings.hub.dataindex.snapshooter.SnapshotManager instance
- source_info(name, debug=False)¶
method get_source in module biothings.hub.dataload.source
get_source(name, debug=False) method of biothings.hub.dataload.source.SourceManager instance
- source_reset(name, key='upload', subkey=None)¶
method reset in module biothings.hub.dataload.source
- reset(name, key=’upload’, subkey=None) method of biothings.hub.dataload.source.SourceManager instance
Reset, ie. delete, internal data (src_dump document) for given source name, key subkey. This method is useful to clean outdated information in Hub’s internal database.
- Ex: key=upload, name=mysource, subkey=mysubsource, will delete entry in corresponding
src_dump doc (_id=mysource), under key “upload”, for sub-source named “mysubsource”
“key” can be either ‘download’, ‘upload’ or ‘inspect’. Because there’s no such notion of subkey for dumpers (ie. ‘download’, subkey is optional.
- source_save_mapping(name, mapping=None, dest='master', mode='mapping')¶
method save_mapping in module biothings.hub.dataload.source
save_mapping(name, mapping=None, dest=’master’, mode=’mapping’) method of biothings.hub.dataload.source.SourceManager instance
- sources(id=None, debug=False, detailed=False)¶
method get_sources in module biothings.hub.dataload.source
get_sources(id=None, debug=False, detailed=False) method of biothings.hub.dataload.source.SourceManager instance
- ssm¶
This is a instance of type: <class ‘biothings.hub.dataindex.snapshooter.SnapshotManager’>
- start_api(api_id)¶
method start_api in module biothings.hub.api.manager
start_api(api_id) method of biothings.hub.api.manager.APIManager instance
- status(managers)¶
function status in module biothings.hub
- status(managers)
Return a global hub status (number or sources, documents, etc…) according to available managers
- stop(force=False)¶
method stop in module biothings.utils.hub
stop(force=False) method of biothings.utils.hub.HubShell instance
- stop_api(api_id)¶
method stop_api in module biothings.hub.api.manager
stop_api(api_id) method of biothings.hub.api.manager.APIManager instance
- sym¶
This is a instance of type: <class ‘biothings.hub.databuild.syncer.SyncerManager’>
- sync(backend_type, old_db_col_names, new_db_col_names, diff_folder=None, batch_size=10000, mode=None, target_backend=None, steps=['mapping', 'content', 'meta', 'post'], debug=False)¶
method sync in module biothings.hub.databuild.syncer
sync(backend_type, old_db_col_names, new_db_col_names, diff_folder=None, batch_size=10000, mode=None, target_backend=None, steps=[‘mapping’, ‘content’, ‘meta’, ‘post’], debug=False) method of biothings.hub.databuild.syncer.SyncerManager instance
- top(action='summary')¶
method top in module biothings.utils.manager
top(action=’summary’) method of biothings.utils.manager.JobManager instance
- tqueue¶
This is a instance of type: <class ‘concurrent.futures.thread.ThreadPoolExecutor’>
- um¶
This is a instance of type: <class ‘biothings.hub.dataload.uploader.UploaderManager’>
- unregister_url(url=None, name=None)¶
method unregister_url in module biothings.hub.dataplugin.assistant
unregister_url(url=None, name=None) method of biothings.hub.dataplugin.assistant.AssistantManager instance
- update_build_conf(name, doc_type, sources, roots=[], builder_class=None, params={}, archived=False)¶
method update_build_configuration in module biothings.hub.databuild.builder
update_build_configuration(name, doc_type, sources, roots=[], builder_class=None, params={}, archived=False) method of biothings.hub.databuild.builder.BuilderManager instance
- update_metadata(indexer_env, index_name, build_name=None, _meta=None)¶
method update_metadata in module biothings.hub.dataindex.indexer
- update_metadata(indexer_env, index_name, build_name=None, _meta=None) method of biothings.hub.dataindex.indexer.IndexManager instance
- Update _meta field of the index mappings, basing on
the _meta value provided, including {}.
the _meta value of the build_name in src_build.
the _meta value of the build with the same name as the index.
- Examples:
update_metadata(“local”, “mynews_201228_vsdevjd”) update_metadata(“local”, “mynews_201228_vsdevjd”, _meta={}) update_metadata(“local”, “mynews_201228_vsdevjd”, _meta={“author”:”b”}) update_metadata(“local”, “mynews_201228_current”, “mynews_201228_vsdevjd”)
- update_source_meta(src, dry=False)¶
method update_source_meta in module biothings.hub.dataload.uploader
- update_source_meta(src, dry=False) method of biothings.hub.dataload.uploader.UploaderManager instance
Trigger update for registered resource named ‘src’.
- upgrade(code_base)¶
function upgrade in module biothings.hub
- upgrade(code_base)
Upgrade (git pull) repository for given code base name (“biothings_sdk” or “application”)
- upload(src, *args, **kwargs)¶
method upload_src in module biothings.hub.dataload.uploader
- upload_src(src, *args, **kwargs) method of biothings.hub.dataload.uploader.UploaderManager instance
Trigger upload for registered resource named ‘src’. Other args are passed to uploader’s load() method
- upload_all(raise_on_error=False, **kwargs)¶
method upload_all in module biothings.hub.dataload.uploader
- upload_all(raise_on_error=False, **kwargs) method of biothings.hub.dataload.uploader.UploaderManager instance
Trigger upload processes for all registered resources. **kwargs are passed to upload_src() method
- upload_info()¶
method upload_info in module biothings.hub.dataload.uploader
upload_info() method of biothings.hub.dataload.uploader.UploaderManager instance
- validate_mapping(mapping, env)¶
method validate_mapping in module biothings.hub.dataindex.indexer
validate_mapping(mapping, env) method of biothings.hub.dataindex.indexer.IndexManager instance
- versions(src, method_name, *args, **kwargs)¶
method call in module biothings.hub.dataload.dumper
- call(src, method_name, *args, **kwargs) method of biothings.hub.dataload.dumper.DumperManager instance
Create a dumper for datasource “src” and call method “method_name” on it, with given arguments. Used to create arbitrary calls on a dumper. “method_name” within dumper definition must a coroutine.
- whatsnew(build_name=None, old=None)¶
method whatsnew in module biothings.hub.databuild.builder
- whatsnew(build_name=None, old=None) method of biothings.hub.databuild.builder.BuilderManager instance
Return datasources which have changed since last time (last time is datasource information from metadata, either from given old src_build doc name, or the latest found if old=None)
biothings.cli¶
- biothings.cli.main()[source]¶
The main entry point for running the BioThings CLI to test your local data plugins.
biothings.cli.dataplugin¶
- biothings.cli.dataplugin.clean_data(dump: bool | None = False, upload: bool | None = False, clean_all: bool | None = False)[source]¶
clean command for deleting all dumped files and/or drop uploaded sources tables
- biothings.cli.dataplugin.create_data_plugin(name: str = '', multi_uploaders: bool | None = False, parallelizer: bool | None = False)[source]¶
create command for creating a new data plugin from the template
- biothings.cli.dataplugin.dump_and_upload()[source]¶
dump_and_upload command for downloading source data files to local, then converting them into JSON documents and uploading them to the source database. Two steps in one command.
- biothings.cli.dataplugin.dump_data()[source]¶
dump command for downloading source data files to local
- biothings.cli.dataplugin.inspect_source(sub_source_name: str | None = '', mode: str | None = 'type,stats', limit: int | None = None, output: str | None = None)[source]¶
inspect command for giving detailed information about the structure of documents coming from the parser after the upload step
- biothings.cli.dataplugin.listing(dump: bool | None = False, upload: bool | None = False, hubdb: bool | None = False)[source]¶
list command for listing dumped files and/or uploaded sources
- biothings.cli.dataplugin.serve(host: str | None = 'localhost', port: int | None = 9999)[source]¶
serve command runs a simple API server for serving documents from the source database.
For example, after run ‘dump_and_upload’, we have a source_name = “test” with a document structure like this:
doc = {“_id”: “123”, “key”: {“a”:{“b”: “1”},”x”:[{“y”: “3”, “z”: “4”}, “5”]}}.
An API server will run at http://host:port/<your source name>/, like http://localhost:9999/test/:
You can see all available sources on the index page: http://localhost:9999/
You can list all docs: http://localhost:9999/test/ (default is to return the first 10 docs)
You can paginate doc list: http://localhost:9999/test/?start=10&limit=10
You can retrieve a doc by id: http://localhost:9999/test/123
- You can filter out docs with one or multiple fielded terms:
http://localhost:9999/test/?q=key.a.b:1 (query by any field with dot notation like key.a.b=1)
http://localhost:9999/test/?q=key.a.b:1%20AND%20key.x.y=3 (find all docs that match two fields)
http://localhost:9999/test/?q=key.x.z:4* (field value can contain wildcard * or ?)
http://localhost:9999/test/?q=key.x:5&start=10&limit=10 (pagination also works)
biothings.cli.dataplugin_hub¶
- biothings.cli.dataplugin_hub.clean_data(plugin_name: str = '', dump: bool | None = False, upload: bool | None = False, clean_all: bool | None = False)[source]¶
clean command for deleting all dumped files and/or drop uploaded sources tables
- biothings.cli.dataplugin_hub.create_data_plugin(name: str = '', multi_uploaders: bool | None = False, parallelizer: bool | None = False)[source]¶
create command for creating a new data plugin from the template
- biothings.cli.dataplugin_hub.dump_and_upload(plugin_name: str = '')[source]¶
dump_and_upload command for downloading source data files to local, then converting them into JSON documents and uploading them to the source database. Two steps in one command.
- biothings.cli.dataplugin_hub.dump_data(plugin_name: str = '')[source]¶
dump command for downloading source data files to local
- biothings.cli.dataplugin_hub.inspect_source(plugin_name: str = '', sub_source_name: str | None = '', mode: str | None = 'type,stats', limit: int | None = None, output: str | None = None)[source]¶
inspect command for giving detailed information about the structure of documents coming from the parser after the upload step
- biothings.cli.dataplugin_hub.listing(plugin_name: str = '', dump: bool | None = False, upload: bool | None = False, hubdb: bool | None = False)[source]¶
list command for listing dumped files and/or uploaded sources
- biothings.cli.dataplugin_hub.serve(plugin_name: str = '', host: str | None = 'localhost', port: int | None = 9999)[source]¶
serve command runs a simple API server for serving documents from the source database.
For example, after run ‘dump_and_upload’, we have a source_name = “test” with a document structure like this:
doc = {“_id”: “123”, “key”: {“a”:{“b”: “1”},”x”:[{“y”: “3”, “z”: “4”}, “5”]}}.
An API server will run at http://host:port/<your source name>/, like http://localhost:9999/test/:
You can see all available sources on the index page: http://localhost:9999/
You can list all docs: http://localhost:9999/test/ (default is to return the first 10 docs)
You can paginate doc list: http://localhost:9999/test/?start=10&limit=10
You can retrieve a doc by id: http://localhost:9999/test/123
- You can filter out docs with one or multiple fielded terms:
http://localhost:9999/test/?q=key.a.b:1 (query by any field with dot notation like key.a.b=1)
http://localhost:9999/test/?q=key.a.b:1%20AND%20key.x.y=3 (find all docs that match two fields)
http://localhost:9999/test/?q=key.x.z:4* (field value can contain wildcard * or ?)
http://localhost:9999/test/?q=key.x:5&start=10&limit=10 (pagination also works)
biothings.cli.utils¶
- biothings.cli.utils.do_clean(plugin_name=None, dump=False, upload=False, clean_all=False, logger=None)[source]¶
Clean the dumped files, uploaded sources, or both.
- biothings.cli.utils.do_clean_dumped_files(data_folder, plugin_name)[source]¶
Remove all dumped files by a data plugin in the data folder.
- biothings.cli.utils.do_clean_uploaded_sources(working_dir, plugin_name)[source]¶
Remove all uploaded sources by a data plugin in the working directory.
- biothings.cli.utils.do_create(name, multi_uploaders=False, parallelizer=False, logger=None)[source]¶
Create a new data plugin from the template
- biothings.cli.utils.do_dump(plugin_name=None, show_dumped=True, logger=None)[source]¶
Perform dump for the given plugin
- biothings.cli.utils.do_dump_and_upload(plugin_name, logger=None)[source]¶
Perform both dump and upload for the given plugin
- biothings.cli.utils.do_inspect(plugin_name=None, sub_source_name=None, mode='type,stats', limit=None, merge=False, output=None, logger=None)[source]¶
Perform inspection on a data plugin.
- biothings.cli.utils.do_list(plugin_name=None, dump=False, upload=False, hubdb=False, logger=None)[source]¶
List the dumped files, uploaded sources, or hubdb content.
- biothings.cli.utils.do_upload(plugin_name=None, show_uploaded=True, logger=None)[source]¶
Perform upload for the given list of uploader_classes
- biothings.cli.utils.get_logger(name=None)[source]¶
Get a logger with the given name. If name is None, return the root logger.
- biothings.cli.utils.get_manifest_content(working_dir)[source]¶
return the manifest content of the data plugin in the working directory
- biothings.cli.utils.get_plugin_name(plugin_name=None, with_working_dir=True)[source]¶
return a valid plugin name (the folder name contains a data plugin) When plugin_name is provided as None, it use the current working folder. when with_working_dir is True, returns (plugin_name, working_dir) tuple
- biothings.cli.utils.get_uploaded_collections(src_db, uploaders)[source]¶
A helper function to get the uploaded collections in the source database
- biothings.cli.utils.get_uploaders(working_dir: Path)[source]¶
A helper function to get the uploaders from the manifest file in the working directory, used in show_uploaded_sources function below
- biothings.cli.utils.is_valid_data_plugin_dir(data_plugin_dir)[source]¶
Return True/False if the given folder is a valid data plugin folder (contains either manifest.yaml or manifest.json)
- biothings.cli.utils.load_plugin(plugin_name=None, dumper=True, uploader=True, logger=None)[source]¶
Return a plugin object for the given plugin_name. If dumper is True, include a dumper instance in the plugin object. If uploader is True, include uploader_classes in the plugin object.
If <plugin_name> is not valid, raise the proper error and exit.
- biothings.cli.utils.load_plugin_managers(plugin_path, plugin_name=None, data_folder=None)[source]¶
Load a data plugin from <plugin_path>, and return a tuple of (dumper_manager, upload_manager)
- biothings.cli.utils.process_inspect(source_name, mode, limit, merge, logger, do_validate, output=None)[source]¶
Perform inspect for the given source. It’s used in do_inspect function below
- biothings.cli.utils.run_sync_or_async_job(func, *args, **kwargs)[source]¶
When func is defined as either normal or async function/method, we will call this function properly and return the results. For an async function/method, we will use CLIJobManager to run it.
- biothings.cli.utils.serve(host, port, plugin_name, table_space)[source]¶
Serve a simple API server to query the data plugin source.
biothings.cli.web_app¶
- class biothings.cli.web_app.Application(db, table_space, **settings)[source]¶
Bases:
Application
The main application class, which defines the routes and handlers.
- class biothings.cli.web_app.BaseHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
RequestHandler
- set_default_headers()[source]¶
Override this to set HTTP headers at the beginning of the request.
For example, this is the place to set a custom
Server
header. Note that setting such headers in the normal flow of request processing may not do what you want, since headers may be reset during error handling.
- class biothings.cli.web_app.DocHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseHandler
The handler for the detail view of a document, e.g. /<source>/<doc_id/
- class biothings.cli.web_app.HomeHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseHandler
the handler for the landing page, which lists all available routes
- class biothings.cli.web_app.QueryHandler(application: Application, request: HTTPServerRequest, **kwargs: Any)[source]¶
Bases:
BaseHandler
The handler for return a list of docs matching the query terms passed to “q” parameter e.g. /<source>/?q=<query>
- async biothings.cli.web_app.get_available_routes(db, table_space)[source]¶
return a list available URLs/routes based on the table_space and the actual collections in the database