pymapd¶
The pymapd client interface provides a python DB API 2.0-compliant OmniSci interface (formerly MapD). In addition, it provides methods to get results in the Apache Arrow-based cudf GPU DataFrame format for efficient data interchange.
>>> from pymapd import connect
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
... dbname="omnisci")
>>> df = con.select_ipc_gpu("SELECT depdelay, arrdelay"
... "FROM flights_2008_10k"
... "LIMIT 100")
>>> df.head()
depdelay arrdelay
0 -2 -13
1 -1 -13
2 -3 1
3 4 -3
4 12 7
5-Minute Quickstart¶
pymapd
follows the python DB API 2.0, so experience with other Python database
clients will feel similar to pymapd.
Note
This tutorial assumes you have an OmniSci server running on localhost:6274
with the
default logins and databases, and have loaded the example flights_2008_10k
dataset. This dataset can be obtained from the insert_sample_data
script included
in the OmniSci install directory.
Installing pymapd¶
pymapd¶
pymapd
can be installed with conda using conda-forge or pip.
# conda
conda install -c conda-forge pymapd
# pip
pip install pymapd
If you have an NVIDIA GPU in the same machine where your pymapd code will be running, you’ll want to install cudf as well to return results sets into GPU memory as a cudf GPU DataFrame:
cudf via conda¶
# CUDA 9.2
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cudf
# CUDA 10.0
conda install -c nvidia/label/cuda10.0 -c rapidsai/label/cuda10.0 -c numba \
-c conda-forge -c defaults cudf
cudf via PyPI/pip¶
# CUDA 9.2
pip install cudf-cuda92
# CUDA 10.0
pip install cudf-cuda100
Connecting¶
Self-Hosted Install¶
For self-hosted OmniSci installs, use protocol='binary'
(this is the default)
to connect with OmniSci, as this will have better performance than using
protocol='http'
or protocol='https'
.
To create a Connection
using the connect()
method along with user
,
password
, host
and dbname
:
>>> from pymapd import connect
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
... dbname="omnisci")
>>> con
Connection(mapd://admin:***@localhost:6274/omnisci?protocol=binary)
Alternatively, you can pass in a SQLAlchemy-compliant connection string to
the connect()
method:
>>> uri = "mapd://admin:HyperInteractive@localhost:6274/omnisci?protocol=binary"
>>> con = connect(uri=uri)
Connection(mapd://admin:***@localhost:6274/omnisci?protocol=binary)
Querying¶
A few options are available for getting the results of a query into your Python process.
Into GPU Memory via cudf (
Connection.select_ipc_gpu()
)Into CPU shared memory via Apache Arrow and pandas (
Connection.select_ipc()
)Into python objects via Apache Thrift (
Connection.execute()
)
The best option depends on the hardware you have available, your connection to the database, and what you plan to do with the returned data. In general, the third method, using Thrift to serialize and deserialize the data, will be slower than the GPU or CPU shared memory methods. The shared memory methods require that your OmniSci database is running on the same machine.
Note
We currently support Timestamp(0|3|6)
data types i.e. seconds, milliseconds,
and microseconds granularity. Support for nanoseconds, Timestamp(9)
is in
progress.
pandas.read_sql()¶
With a Connection
defined, you can use pandas.read_sql()
to
read your data in a pandas DataFrame
. This will be slower than using
Connection.select_ipc()
, but works regardless of where the Python code
is running (i.e. select_ipc()
must be on the same machine as the OmniSci
install, pandas.read_sql()
works everywhere):
>>> from pymapd import connect
>>> import pandas as pd
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
... dbname="omnisci")
>>> df = pd.read_sql("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100", con)
Cursors¶
After connecting to OmniSci, a cursor can be created with Connection.cursor()
:
>>> c = con.cursor()
>>> c
<pymapd.cursor.Cursor at 0x110fe6438>
Or by using a context manager:
>>> with con as c:
... print(c)
<pymapd.cursor.Cursor object at 0x1041f9630>
Arbitrary SQL can be executed using Cursor.execute()
.
>>> c.execute("SELECT depdelay, arrdelay FROM flights_2008_10k limit 100")
<pymapd.cursor.Cursor at 0x110fe6438>
This will set the rowcount
property, with the number of returned rows
>>> c.rowcount
100
The description
attribute contains a list of Description
objects, a
namedtuple with the usual attributes required by the spec. There’s one entry per
returned column, and we fill the name
, type_code
and null_ok
attributes.
>>> c.description
[Description(name='depdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True),
Description(name='arrdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True)]
Cursors are iterable, returning a list of tuples of values
>>> result = list(c)
>>> result[:5]
[(38, 28), (0, 8), (-4, 9), (1, -1), (1, 2)]
Loading Data¶
The fastest way to load data is Connection.load_table_arrow()
. Internally,
this will use pyarrow
and the Apache Arrow format to exchange data with
the OmniSci database.
>>> import pyarrow as pa
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1, 2], "B": ['c', 'd']})
>>> table = pa.Table.from_pandas(df)
>>> con.load_table_arrow("table_name", table)
This accepts either a pyarrow.Table
, or a pandas.DataFrame
, which will
be converted to a pyarrow.Table
before loading.
You can also load a pandas.DataFrame
using Connection.load_table()
or Connection.load_table_columnar()
methods.
>>> df = pd.DataFrame({"A": [1, 2], "B": ["c", "d"]})
>>> con.load_table_columnar("table_name", df, preserve_index=False)
If you aren’t using arrow or pandas you can pass list of tuples to
Connection.load_table_rowwise()
.
>>> data = [(1, "c"), (2, "d")]
>>> con.load_table_rowwise("table_name", data)
The high-level Connection.load_table()
method will choose the fastest
method available based on the type of data
.
lists of tuples are always loaded with
Connection.load_table_rowwise()
A
pandas.DataFrame
orpyarrow.Table
will be loaded usingConnection.load_table_arrow()
If upload fails using the arrow method, a
pandas.DataFrame
can be loaded usingConnection.load_table_columnar()
Database Metadata¶
Some helpful metadata are available on the Connection
object.
Get a list of tables with
Connection.get_tables()
>>> con.get_tables()
['flights_2008_10k', 'stocks']
Get column information for a table with
Connection.get_table_details()
>>> con.get_table_details('stocks') [ColumnDetails(name='date_', type='STR', nullable=True, precision=0, scale=0, comp_param=32), ColumnDetails(name='trans', type='STR', nullable=True, precision=0, scale=0, comp_param=32), ...
Runtime User-Defined Functions¶
Connection instance is callable, it can be used as a decorator to Python functions to define these as Runtime UDFs:
>>> @con('int32(int32, int32)')
... def totaldelay(dep, arr):
... return dep + arr
...
>>> query = ("SELECT depdelay, arrdelay, totaldelay(depdelay, arrdelay)"
... " FROM flights_2008_10k limit 100")
>>> df = con.select_ipc(query)
>>> df.head()
depdelay arrdelay EXPR$2
0 8 -14 -6
1 19 2 21
2 8 14 22
3 -4 -6 -10
4 34 34 68
Note
Runtime UDFs can be defined if the OmniSci server has enabled its
support (see --enable-runtime-udf
option of omnisci_server
)
and rbc package is installed. This is still experimental functionality, and
currently it does not work on the Windows operating system.
API Reference¶
-
class
pymapd.
Connection
(uri=None, user=None, password=None, host=None, port=6274, dbname=None, protocol='binary', sessionid=None, bin_cert_validate=None, bin_ca_certs=None, idpurl=None, idpformusernamefield='username', idpformpasswordfield='password', idpsslverify=True)¶ Connect to your OmniSci database.
-
close
()¶ Disconnect from the database unless created with sessionid
-
commit
()¶ This is a noop, as OmniSci does not provide transactions.
Implemented to comply with the DBI specification.
-
create_table
(table_name, data, preserve_index=False)¶ Create a table from a pandas.DataFrame
- Parameters
- table_name: str
- data: DataFrame
- preserve_index: bool, default False
Whether to create a column in the table for the DataFrame index
-
deallocate_ipc
(df, device_id=0)¶ Deallocate a DataFrame using CPU shared memory.
- Parameters
- device_id: int
GPU which contains TDataFrame
-
deallocate_ipc_gpu
(df, device_id=0)¶ Deallocate a DataFrame using GPU memory.
- Parameters
- device_ids: int
GPU which contains TDataFrame
-
duplicate_dashboard
(dashboard_id, new_name=None, source_remap=None)¶ Duplicate an existing dashboard, returning the new dashboard id.
- Parameters
- dashboard_id: int
The id of the dashboard to duplicate
- new_name: str
The name for the new dashboard
- source_remap: dict
EXPERIMENTAL A dictionary remapping table names. The old table name(s) should be keys of the dict, with each value being another dict with a ‘name’ key holding the new table value. This structure can be used later to support changing column names.
Examples
>>> source_remap = {'oldtablename1': {'name': 'newtablename1'}, 'oldtablename2': {'name': 'newtablename2'}} >>> newdash = con.duplicate_dashboard(12345, "new dash", source_remap)
-
execute
(operation, parameters=None)¶ Execute a SQL statement
- Parameters
- operation: str
A SQL statement to exucute
- Returns
- c: Cursor
-
get_dashboard
(dashboard_id)¶ Return the dashboard object of a specific dashboard
Examples
>>> con.get_dashboard(123)
-
get_dashboards
()¶ List all the dashboards in the database
Examples
>>> con.get_dashboards()
-
get_table_details
(table_name)¶ Get the column names and data types associated with a table.
- Parameters
- table_name: str
- Returns
- details: List[tuples]
Examples
>>> con.get_table_details('stocks') [ColumnDetails(name='date_', type='STR', nullable=True, precision=0, scale=0, comp_param=32, encoding='DICT'), ColumnDetails(name='trans', type='STR', nullable=True, precision=0, scale=0, comp_param=32, encoding='DICT'), ... ]
-
get_tables
()¶ List all the tables in the database
Examples
>>> con.get_tables() ['flights_2008_10k', 'stocks']
-
load_table
(table_name, data, method='infer', preserve_index=False, create='infer')¶ Load data into a table
- Parameters
- table_name: str
- data: pyarrow.Table, pandas.DataFrame, or iterable of tuples
- method: {‘infer’, ‘columnar’, ‘rows’, ‘arrow’}
Method to use for loading the data. Three options are available
pyarrow
and Apache Arrow loadercolumnar loader
row-wise loader
The Arrow loader is typically the fastest, followed by the columnar loader, followed by the row-wise loader. If a DataFrame or
pyarrow.Table
is passed andpyarrow
is installed, the Arrow-based loader will be used. If arrow isn’t available, the columnar loader is used. Finally,data
is an iterable of tuples the row-wise loader is used.- preserve_index: bool, default False
Whether to keep the index when loading a pandas DataFrame
- create: {“infer”, True, False}
Whether to issue a CREATE TABLE before inserting the data.
infer: check to see if the table already exists, and create a table if it does not
True: attempt to create the table, without checking if it exists
False: do not attempt to create the table
See also
-
load_table_arrow
(table_name, data, preserve_index=False)¶ Load a pandas.DataFrame or a pyarrow Table or RecordBatch to the database using Arrow columnar format for interchange
- Parameters
- table_name: str
- data: pandas.DataFrame, pyarrow.RecordBatch, pyarrow.Table
- preserve_index: bool, default False
Whether to include the index of a pandas DataFrame when writing.
Examples
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": ['d', 'e', 'f']}) >>> con.load_table_arrow('foo', df, preserve_index=False)
-
load_table_columnar
(table_name, data, preserve_index=False, chunk_size_bytes=0, col_names_from_schema=False)¶ Load a pandas DataFrame to the database using OmniSci’s Thrift-based columnar format
- Parameters
- table_name: str
- data: DataFrame
- preserve_index: bool, default False
Whether to include the index of a pandas DataFrame when writing.
- chunk_size_bytes: integer, default 0
Chunk the loading of columns to prevent large Thrift requests. A value of 0 means do not chunk and send the dataframe as a single request
- col_names_from_schema: bool, default False
Read the existing table schema to determine the column names. This will read the schema of an existing table in OmniSci and match those names to the column names of the dataframe. This is for user convenience when loading from data that is unordered, especially handy when a table has a large number of columns.
See also
Notes
Use
pymapd >= 0.11.0
while running withomnisci >= 4.6.0
in order to avoid loading inconsistent values into DATE column.Examples
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": ['d', 'e', 'f']}) >>> con.load_table_columnar('foo', df, preserve_index=False)
-
load_table_rowwise
(table_name, data)¶ Load data into a table row-wise
- Parameters
- table_name: str
- data: Iterable of tuples
Each element of data should be a row to be inserted
Examples
>>> data = [(1, 'a'), (2, 'b'), (3, 'c')] >>> con.load_table('bar', data)
-
register_runtime_udfs
()¶ Register any bending Runtime UDF functions in OmniSci server.
If no Runtime UDFs have been defined, the call to this method is noop.
-
render_vega
(vega, compression_level=1)¶ Render vega data on the database backend, returning the image as a PNG.
- Parameters
- vega: dict
The vega specification to render.
- compression_level: int
The level of compression for the rendered PNG. Ranges from 0 (low compression, faster) to 9 (high compression, slower).
-
select_ipc
(operation, parameters=None, first_n=- 1, release_memory=True)¶ Execute a
SELECT
operation using CPU shared memory- Parameters
- operation: str
A SQL select statement
- parameters: dict, optional
Parameters to insert for a parametrized query
- first_n: int, optional
Number of records to return
- release_memory: bool, optional
Call
self.deallocate_ipc(df)
after DataFrame created
- Returns
- df: pandas.DataFrame
Notes
This method requires the Python code to be executed on the same machine where OmniSci running.
-
select_ipc_gpu
(operation, parameters=None, device_id=0, first_n=- 1, release_memory=True)¶ Execute a
SELECT
operation using GPU memory.- Parameters
- operation: str
A SQL statement
- parameters: dict, optional
Parameters to insert into a parametrized query
- device_id: int
GPU to return results to
- first_n: int, optional
Number of records to return
- release_memory: bool, optional
Call
self.deallocate_ipc_gpu(df)
after DataFrame created
- Returns
- gdf: cudf.GpuDataFrame
Notes
This method requires
cudf
andlibcudf
to be installed. AnImportError
is raised if those aren’t available.This method requires the Python code to be executed on the same machine where OmniSci running.
-
-
class
pymapd.
Cursor
(connection)¶ A database cursor.
-
property
arraysize
¶ The number of rows to fetch at a time with fetchmany. Default 1.
See also
-
close
()¶ Close this cursor.
-
property
description
¶ Read-only sequence describing columns of the result set. Each column is an instance of Description describing
name
type_code
display_size
internal_size
precision
scale
null_ok
We only use name, type_code, and null_ok; The rest are always
None
-
execute
(operation, parameters=None)¶ Execute a SQL statement.
- Parameters
- operation: str
A SQL query
- parameters: dict
Parameters to substitute into
operation
.
- Returns
- selfCursor
Examples
>>> c = conn.cursor() >>> c.execute("select symbol, qty from stocks") >>> list(c) [('RHAT', 100.0), ('IBM', 1000.0), ('MSFT', 1000.0), ('IBM', 500.0)]
Passing in
parameters
:>>> c.execute("select symbol qty from stocks where qty <= :max_qty", ... parameters={"max_qty": 500}) [('RHAT', 100.0), ('IBM', 500.0)]
-
executemany
(operation, parameters)¶ Execute a SQL statement for many sets of parameters.
- Parameters
- operation: str
- parameters: list of dict
- Returns
- results: list of lists
-
fetchmany
(size=None)¶ Fetch
size
rows from the results set.
-
fetchone
()¶ Fetch a single row from the results set
-
property
-
pymapd.
connect
(uri=None, user=None, password=None, host=None, port=6274, dbname=None, protocol='binary', sessionid=None, bin_cert_validate=None, bin_ca_certs=None, idpurl=None, idpformusernamefield='username', idpformpasswordfield='password', idpsslverify=True)¶ Create a new Connection.
- Parameters
- uri: str
- user: str
- password: str
- host: str
- port: int
- dbname: str
- protocol: {‘binary’, ‘http’, ‘https’}
- sessionid: str
- bin_cert_validate: bool, optional, binary encrypted connection only
Whether to continue if there is any certificate error
- bin_ca_certs: str, optional, binary encrypted connection only
Path to the CA certificate file
- idpurlstr
EXPERIMENTAL Enable SAML authentication by providing the logon page of the SAML Identity Provider.
- idpformusernamefield: str
The HTML form ID for the username, defaults to ‘username’.
- idpformpasswordfield: str
The HTML form ID for the password, defaults to ‘password’.
- idpsslverify: str
Enable / disable certificate checking, defaults to True.
- Returns
- conn: Connection
Examples
You can either pass a string
uri
, all the individual components, or an existing sessionid excluding user, password, and database>>> connect('mapd://admin:HyperInteractive@localhost:6274/omnisci?' ... 'protocol=binary') Connection(mapd://mapd:***@localhost:6274/mapd?protocol=binary)
>>> connect(user='admin', password='HyperInteractive', host='localhost', ... port=6274, dbname='omnisci')
>>> connect(user='admin', password='HyperInteractive', host='localhost', ... port=443, idpurl='https://sso.localhost/logon', protocol='https')
>>> connect(sessionid='XihlkjhdasfsadSDoasdllMweieisdpo', host='localhost', ... port=6273, protocol='http')
Exceptions¶
Define exceptions as specified by the DB API 2.0 spec.
Includes some helper methods for translating thrift exceptions to the ones defined here.
-
exception
pymapd.exceptions.
DatabaseError
¶ Raised when the database encounters an error.
-
exception
pymapd.exceptions.
Error
¶ Base class for all pymapd errors.
-
exception
pymapd.exceptions.
IntegrityError
¶ Raised when the relational integrity of the database is affected.
-
exception
pymapd.exceptions.
InterfaceError
¶ Raised whenever you use pymapd interface incorrectly.
-
exception
pymapd.exceptions.
InternalError
¶ Raised for errors internal to the database, e.g. and invalid cursor.
-
exception
pymapd.exceptions.
NotSupportedError
¶ Raised when an API not supported by the database is used.
-
exception
pymapd.exceptions.
OperationalError
¶ Raised for non-programmer related database errors, e.g. an unexpected disconnect.
-
exception
pymapd.exceptions.
ProgrammingError
¶ Raised for programming errors, e.g. syntax errors, table already exists.
Contributing to pymapd¶
As an open-source company, OmniSci welcomes contributions to all of its open-source repositories, including pymapd. All discussion and development takes place via the pymapd GitHub repository.
It is suggested, but not required, that you create a GitHub issue before contributing a feature or bug fix. This is so that other developers 1) know that you are working on the feature/issue and 2) that internal OmniSci experts can help you navigate any database-specific logic that may not be obvious within pymapd. All patches should be submitted as pull requests, and upon passing the test suite and review by OmniSci, will be merged to master for release as part of the next package release cycle.
Development Environment Setup¶
pymapd is written in plain Python 3 (i.e. no Cython), and as such, doesn’t require any specialized development environment outside of installing the dependencies. However, we do suggest creating a new conda development enviornment with the provided conda environment.yml file to ensure that your changes work without relying on unspecified system-level Python packages.
Two development environment files are provided: one to provide the packages needed to develop on CPU only, and the other to provide GPU development packages. Only one is required, but you may decide to use both in order to run pytest against a CPU or GPU environment.
A pymapd development environment can be setup with the following:
CPU Environment¶
Docker Environment Setup¶
OmniSci Core CPU-only¶
Unless you are planning on developing GPU-specific functionality in pymapd, using the CPU image is enough to run the test suite:
docker run \
-d \
--name omnisci \
-p 6274:6274 \
-p 6278:6278 \
--ipc=host \
-v /home/<username>/omnisci-storage:/omnisci-storage \
omnisci/core-os-cpu
- With the above code, we:
create/run an instance of OmniSci Core CPU as a daemon (i.e. running in the background until stopped)
forward ports
6274
(binary connection) and6278
(http connection).set
ipc=host
for testing shared memory/IPC functionalitypoint to a local directory to store data loaded to OmniSci. This allows our container to be ephemeral.
To run the test suite, call pytest
from the top-level pymapd folder:
(pymapd_dev) laptop:~/github_work/pymapd$ pytest
pytest
will run through the test suite, running the tests against the Docker container. Because we are using CPU-only, the
test suite skips the GPU tests, and you can expect to see the following messages at the end of the test suite run:
=============================================== short test summary info ================================================
SKIPPED [4] tests/test_data_no_nulls_gpu.py:15: No GPU available
SKIPPED [1] tests/test_deallocate.py:34: No GPU available
SKIPPED [1] tests/test_deallocate.py:54: deallocate non-functional in recent distros
SKIPPED [1] tests/test_deallocate.py:67: No GPU available
SKIPPED [1] tests/test_deallocate.py:80: deallocate non-functional in recent distros
SKIPPED [1] tests/test_deallocate.py:92: No GPU available
SKIPPED [1] tests/test_deallocate.py:105: deallocate non-functional in recent distros
SKIPPED [2] tests/test_integration.py:207: No GPU available
SKIPPED [1] tests/test_integration.py:238: No GPU available
================================== 69 passed, 13 skipped, 1 warnings in 19.40 seconds ==================================
OmniSci Core GPU-enabled¶
To run the pymapd test suite with the GPU tests, the workflow is pretty much the same as CPU-only, except with the OmniSci Core GPU-enabled container:
docker run \
--runtime=nvidia \
-d \
--name omnisci \
-p 6274:6274 \
-p 6278:6278 \
--ipc=host \
-v /home/<username>/omnisci-storage:/omnisci-storage \
omnisci/core-os-cuda
You also need to install cudf in your development environment. Because cudf is in active development, and requires attention to the specific version of CUDA installed, we recommend checking the cudf documentation to get the most up-to-date installation instructions.
Updating Apache Thrift Bindings¶
When the upstream mapd-core project updates its Apache Thrift definition file, the bindings shipped with
pymapd
need to be regenerated. Note that the omniscidb repository must be cloned locally.
Updating the Documentation¶
The documentation for pymapd is generated by ReadTheDocs on each commit. Some pages (such as this one) are manually created, others such as the API Reference is generated by the docstrings from each method.
If you are planning on making non-trival changes to the documentation and want to preview the result before making a commit, you need to install sphinx and sphinx-rtd-theme into your development environment:
pip install sphinx sphinx-rtd-theme
Once you have sphinx installed, to build the documentation switch to the pymapd/docs
directory and run make html
. This will update the documentation
in the pymapd/docs/build/html
directory. From that directory, running python -m http.server
will allow you to preview the site on localhost:8000
in the browser. Run make html
each time you save a file to see the file changes in the documentation.
Publishing a new package version¶
pymapd doesn’t currently follow a rigid release schedule; rather, when enough functionality is deemed to be “enough” for a new version to be released, or a sufficiently serious bug/issue is fixed, we will release a new version. pymapd is distributed via PyPI and conda-forge.
Prior to submitting to PyPI and/or conda-forge, create a new release tag on GitHub (with notes), then run git pull
to bring this tag to your
local pymapd repository folder.
PyPI¶
To publish to PyPI, we use the twine package via the CLI. twine only allows for submitting to PyPI by registered users (currently, internal OmniSci employees):
conda install twine
python setup.py sdist
twine upload dist/*
Publishing a package to PyPI is near instantaneous after runnning twine upload dist/*
. Before running twine upload
, be sure
the dist
directory only has the current version of the package you are intending to upload.
conda-forge¶
The release process for conda-forge is triggered via creating a new version number on the pymapd GitHub repository. Given the volume of packages released on conda-forge, it can take several hours for the bot to open a PR on pymapd-feedstock. There is nothing that needs to be done to speed this up, just be patient.
When the conda-forge bot opens a PR on the pymapd-feedstock repo, one of the feedstock maintainers needs to validate the correctness of the PR, check the accuracy of the package versions on the meta.yaml recipe file, and then merge once the CI tests pass.
Release Notes¶
The release notes for pymapd are managed on the GitHub repository in the Releases tab. Since pymapd releases try to track new features in the main OmniSci Core project, it’s highly recommended that you check the Releases tab any time you install a new version of pymapd or upgrade OmniSci so that you understand any breaking changes that may have been made during a new pymapd release.
Some notable breaking changes include:
Release |
Breaking Change |
---|---|
Added preliminary support for Runtime User-Defined Functions |
|
Support for binary TLS Thrift connections |
|
Updated Thrift bindings to 4.8 |
|
Changed context manager to return |
|
Updated Thrift to 4.6.1 bindings |
|
Dropped Python 3.5 support |
|
Modified |
|
|
|
|
|
Removed ability to specify |
|
Lower bounds for pandas, numpy, sqlalchemy and pytest increased |
|
Default ports changed in connect statement from 9092 to 6274 |
|
Python 2 support dropped |
|
Support for Python 3.4 dropped, support for Python 3.7 added |
|
First release supporting cudf (removing option to use pygdf) |
|
NumPy, pyarrow and pandas now hard dependencies instead of optional |
FAQ and Known Limitations¶
This page contains information that doesn’t fit into other pages or is important enough to be called out separately. If you have a question or tidbit of information that you feel should be included here, please create an issue and/or pull request to get it added to this page.
Note
While we strive to keep this page updated, bugfixes and new features are being added regularly. If information on this page conflicts with your experience, please open an issue or drop by our Community forum to get clarification.
FAQ¶
- Q
Why do
select_ipc()
andselect_ipc_gpu()
give me errors, butexecute()
works fine?- A
Both
select_ipc()
andselect_ipc_gpu()
require running the pymapd code on the same machine where OmniSci is running. This also implies that these two methods will not work on Windows machines, just Linux (CPU and GPU) and OSX (CPU-only).
- Q
Why do geospatial data get uploaded as
TEXT ENCODED DICT(32)
?- A
When using
load_table
withcreate=True
orcreate='infer'
, data where type cannot be easily inferred will default toTEXT ENCODED DICT(32)
. To solve this issue, create the table definition before loading the data.
Helpful Hints¶
- Convert your timestamps to UTC
OmniSci stores timestamps as UTC. When loading data to OmniSci, plain Python
datetime
objects are assumed to be UTC. If thedatetime
object has localization, onlydatetime64[ns, UTC]
is supported.
- When loading data, hand-create table schema if performance is critical
While the
load_table()
does provide a keyword argumentcreate
to auto-create the table before attempting to load to OmniSci, this functionality is for convenience purposes only. The user is in a much better position to know the exact data types of the input data than the heuristics used by pymapd.Additionally, pymapd does not attempt to use the smallest possible column width to represent your data. For example, significant reductions in disk storage and a larger amount of ‘hot data’ can be realized if your data fits in a
TINYINT
column vs storing it as anINTEGER
.
Known Limitations¶
- OmniSci
BIGINT
is 64-bit Be careful using pymapd on 32-bit systems, as we do not check for integer overflow when returning a query.
- OmniSci
DECIMAL
types returned as Pythonfloat
OmniSci stores and performs
DECIMAL
calculations within the database at the column-definition level of precision. However, the results are currently returned back to Python as float. We are evaluating how to change this behavior, so that exact decimal representations is consistent on the server and in Python.