chat-archive: Easy to use offline chat archive¶
Welcome to the documentation of chat-archive version 4.0.2! The following sections are available:
User documentation¶
The readme is the best place to start reading, it’s targeted at all users and documents the command line interface:
chat-archive: Easy to use offline chat archive¶
The Python program chat-archive provides a local archive of chat messages that can be viewed and searched on the command line. Supported chat services include Google Talk, Google Hangouts, Slack and Telegram. The program was developed on Linux and currently assumes a UNIX command line environment, although this is not fundamental to the program’s design (for example I could imagine someone building a GUI or web interface using the Python API).
When you add a new account the initial synchronization will download your full conversation history from the chat service in question, this can take quite a while. Later synchronization runs will be much quicker because only updates (new messages and conversations) are downloaded.
Chat messages are downloaded as plain text and when possible also with formatting (encoded as HTML). When viewing chat messages on the terminal the formatted text will be shown.
Python 3.5+ is required due to the asynchronous nature of some of the backends.
Status¶
This is very young software, developed in a couple of sprints in the summer of 2018, so it’s bound to be full of bugs! The fact that it doesn’t have a test suite doesn’t help. However since creating this program I’ve started using it on a daily basis, so I may very well be the first one to run into most if not all bugs 😇.
There’s a lot of implementation details in the code base that I’m not proud of and there’s a ton of features that I would like to add, for example right now the command line is still rather bare bones (minimal). I’ve decided to nevertheless publish what I have right now, because in its current state this project is already very useful for me, so it might be useful to others.
I consider the first release to be representative of the functional goals I had in mind when I set out to build this, but I’d love to find the time to refactor the code base once or twice more 😋. Before publishing the first release I had already gone through three or four complete rewrites and each of those rewrites improved the quality of the code, yet I’m still not fully satisfied… Oh well, at least it seems to work 😉.
Installation¶
The chat-archive package is available on PyPI which means installation should be as simple as:
$ pip3 install chat-archive
Make sure you’re using Python 3.5+ because this is required by dependencies of the chat-archive program.
There’s actually a multitude of ways to install Python packages (e.g. the per user site-packages directory, virtual environments or just installing system wide) and I have no intention of getting into that discussion here, so if this intimidates you then read up on your options before returning to these instructions 😉.
Usage¶
The command line interface is documented below. For more details about the Python API please refer to the API documentation available on Read the Docs.
Command line¶
Usage: chat-archive [OPTIONS] [COMMAND]
Easy to use offline chat archive that can gather chat message history from Google Talk, Google Hangouts, Slack and Telegram.
Supported commands:
- The ‘sync’ command downloads new chat messages from supported chat services and stores them in the local archive (an SQLite database).
- The ‘search’ command searches the chat messages in the local archive for the given keyword(s) and lists matching messages.
- The ‘list’ command lists all messages in the local archive.
- The ‘stats’ command shows statistics about the local archive.
- The ‘unknown’ command searches for conversations that contain messages from an unknown sender and allows you to enter the name of a new contact to associate with all of the messages from an unknown sender. Conversations involving multiple unknown sender are not supported.
Supported options:
Option | Description |
---|---|
-C , --context=COUNT |
Print COUNT messages of output context during ‘chat-archive search’. This
works similarly to ‘grep -C ’. The default value of COUNT is 3. |
-f , --force |
Retry synchronization of conversations where errors were previously encountered. This option is currently only relevant to the Google Hangouts backend, because I kept getting server errors when synchronizing a few specific conversations and I didn’t want to keep seeing each of those errors during every synchronization run :-). |
-c , --color=CHOICE, --colour=CHOICE |
Specify whether ANSI escape sequences for text and background colors and
text styles are to be used or not, depending on the value of
|
-l , --log-file=LOGFILE |
Save logs at DEBUG verbosity to the filename given by LOGFILE . This option
was added to make it easy to capture the log output of an initial
synchronization that will be downloading thousands of messages. |
-p , --profile=FILENAME |
Enable profiling of the chat-archive application to make it possible to
analyze performance problems. Python profiling data will be saved to
FILENAME every time database changes are committed (making it possible to
inspect the profile while the program is still running). |
-v , --verbose |
Increase logging verbosity (can be repeated). |
-q , --quiet |
Decrease logging verbosity (can be repeated). |
-h , --help |
Show this message and exit. |
The ‘sync’ command¶
The command chat-archive sync
downloads new chat messages using the
configured backends and stores the messages in the local SQLite database.
Positional arguments can be used to synchronize specific backends or accounts.
For example I have two Telegram accounts, a personal account and a work
account. The following command will synchronize both of these accounts:
$ chat-archive sync telegram
When I’m only interested in a specific account I can instead do this:
$ chat-archive sync telegram:personal
You can make this as complex as you want:
$ chat-archive sync hangouts slack:work telegram:personal
The command above will synchronize all configured Google Hangouts accounts, the Slack work account and the Telegram personal account. The following table shows the backend names you can use like this:
Backend name | Chat service |
---|---|
gtalk |
Google Talk |
hangouts |
Google Hangouts |
slack |
Slack |
telegram |
Telegram |
The ‘search’ command¶
The command chat-archive search
performs a keyword search through the chat
messages in the local SQLite database and renders the search results on the
terminal. Keywords are provided as positional arguments to the search
command and trigger a case insensitive AND search through the following message
metadata:
- The name of the backend (see the table above).
- The name of the account (
default
or a user defined name). - The name of the conversation (relevant for group conversations).
- The full name of the contact that sent the message.
- The email address of the contact that sent the message.
- The timestamp of the message. Any prefix of the date format
YYYY-MM-DD HH:MM:SS
should work, judging by the date/time searches that I’ve tried so far. So for example the keyword2018
will match all messages from that year,2018-08
will match all messages in a specific month, etc. - The text of the message. The plain text chat message as well as the HTML formatted chat message (when available) are searched, this enables searching for semantically meaningful HTML data like hyperlink targets.
The search results reported on the terminal include surrounding chat messages
from the matching conversations, to provide additional context. You can control
how many surrounding chat messages are rendered using the -C
, --context
command line option, the value 0 can be used to omit the context.
The ‘list’ command¶
The command chat-archive list
renders a listing of all chat messages in the
database on the terminal.
Due to the gathering of context the chat-archive search
command can be
rather slow and this is why I added the chat-archive list
command early in
the development of the project (it’s faster because it doesn’t have to gather
context). Since then I’ve collected 226.941 chat messages, completely negating
the usefulness of the chat-archive list
command 😇.
In any case this can be considered a very simple form of export functionality,
so I’ve decided to keep the chat-archive list
command for now, despite its
limited usefulness once one actively starts using the chat-archive
program.
The ‘stats’ command¶
The command chat-archive stats
reports some statistics about the contents
of the local SQLite database. Here’s what that looks like for me at the time of
writing:
Statistics about ~/.local/share/chat-archive/database.sqlite3:
- Number of contacts: 284
- Number of conversations: 5803
- Number of messages: 226941
- Database file size: 90.81 MB
- Size of 226941 plain text chat messages: 18.7 MB
- Size of 13409 HTML formatted chat messages: 4.25 MB
The ‘unknown’ command¶
The first time I synchronized the thousands of chat messages in my Google Hangouts account I was very disappointed to find out that all metadata about contacts whose accounts had since been deleted was lost (no names, no email addresses, nothing).
This is why I added the chat-archive unknown
command. It searches the local
database for private conversations that contain messages from an unknown sender
and prompts you to enter a name for the contact. When you enter a (nonempty)
name a new contact is created and the messages in the conversation which have
no sender are associated to the new contact.
Weirdly enough the Google Mail archive of chat messages was able to show me names for most of the contacts for which the Google Hangouts API no longer reported any useful information, this is how I was able to (manually) reconstruct this bit of history.
If the Google Mail archive had not provided me with this information I still would have been able to reconstruct the senders of 90% of these conversations simply by the fact that quite a few conversations start with “Hi $name” and I still have “client side chat archive backups” (Pidgin) from 2011-2015.
Configuration files¶
If you’re going to be synchronizing your chat message history frequently you can define credentials for the chat services that you are interested in using a configuration file.
Configuration files are text files in the subset of ini syntax supported by Python’s configparser module. They can be located in the following places:
Directory | Main configuration file | Modular configuration files |
---|---|---|
/etc | /etc/chat-archive.ini | /etc/chat-archive.d/*.ini |
~ | ~/.chat-archive.ini | ~/.chat-archive.d/*.ini |
~/.config | ~/.config/chat-archive.ini | ~/.config/chat-archive.d/*.ini |
The available configuration files are loaded in the order given above, so that user specific configuration files override system wide configuration files.
The special configuration file section chat-archive
defines general
options. Right now only the operator-name
option is supported here. All
other sections are specific to a chat account and encode the name of the
backend and the name of the account in the name of the section by delimiting
the two values with a colon. Here’s an example based on my configuration, that
shows the supported options:
[chat-archive]
operator-name = ...
[hangouts:work]
email-address = ...
password = ...
# Alternatively:
password-name = ...
[slack:work]
api-token = ...
# Alternatively:
api-token-name = ...
[gtalk:work]
email = ...
password = ...
# Alternatively:
password-name = ...
[telegram:personal]
api-hash = ...
api-id = ...
phone-number = ...
[telegram:work]
api-hash = ...
api-id = ...
phone-number = ...
# Alternatively:
api-hash-name = ...
api-id-name = ...
When an account is configured but the configuration doesn’t define a required
secret then you will be prompted to provide that secret every time you run the
chat-archive sync
command.
The values of the api-token-name
, password-name
, api-hash-name
and
api-id-name
options identify secrets in ~/.password-store
to use, this
provides an alternative somewhere in between the following two extremes:
- Always typing your secrets interactively (because you don’t want them to be
stored in the
chat-archive
configuration file, which is understandable from a security perspective of security). - Storing your secrets directly in the
chat-archive
configuration files (so you don’t have to type secrets interactively) thereby exposing them to all software running on your computer.
Because pass can use gpg-agent you only have to type a single master password to unlock the secrets required to synchronize any number of chat accounts.
The local database¶
The chat-archive program uses an SQLite database to store the chat messages that it collects. Because the whole point of the program is to safeguard the long term archival of chat messages, SQLAlchemy and Alembic are used to support database schema migrations. This is intended to ensure a reliable upgrade path for future enhancements without data loss.
There’s one significant exception I can think of: The current version of the chat-archive program doesn’t synchronize images and other multimedia files, only text messages are stored in the local database. If support for images is added in a later release (I’m not committing to this, but I am considering it) and collecting these is important to you then you may have to rebuild your database if and when this support is added.
You can change the location of the SQLite database and other datafiles by
setting the environment variable $CHAT_ARCHIVE_DIRECTORY
. Making a backup
of your chat archive is as simple as saving a copy of the database file
~/.local/share/chat-archive/database.sqlite3
to another storage medium.
Please keep in mind that this database has the potential to contain a lot of
sensitive data, so I strongly advise you to use disk encryption.
Supported chat services¶
The following backends are currently available:
Chat service | Description |
---|---|
Google Talk | At one time this was the primary chat service of Google. It was based on (or at least cooperated well with) XMPP. My personal chat archive of Google Talk messages ends on 2013-12-12. |
Google Hangouts | The successor to Google Talk. Interestingly enough my personal chat archive of Google Hangouts messages starts on 2013-10-30 (what’s interesting to me is the overlap with the date above). |
Slack | Love it or hate it, when all of your colleagues are using it you can’t really get around it. Actually now that I write it down like that I can’t help but think of WhatsApp (where the “peer pressure” comes from family instead of colleagues). |
Telegram | A popular alternative to WhatsApp from Russia, without the Facebook baggage 😇 (which is not to say that the company behind Telegram can’t be just as evil). |
In the future more backends may be added:
- I’ve been contemplating scraping “WhatsApp Web” using something like Selenium. It would get ugly and nasty, the resulting backend would be fragile at best, but having those messages available might just be worth it…
- I’m considering writing a chat log parser for the HTML chat logs that Pidgin generated ten years ago (circa 2008) because I have megabytes of such chat logs stored in backups 🙂.
History¶
The fragmented nature of digital communication, where messages come to you via numerous channels (including multiple chat services), has bothered me for years now. Finding things back can actually become a challenge 😇. Tangentially related is the realization that these chat services come and go, taking with them years of chat history, lost forever. I’m looking at you Google 😉.
Given that I am a programmer by trade and heart, It’s been itching for several years now to try and solve both of these problems at the same time by creating a computer program that downloads and stores the chat message history of multiple chat services into a single local database, available for searching and trivially easy to back up.
For what it’s worth I didn’t start out with the goal of “full fidelity” chat history backup including images and other multimedia, although I may eventually decide to implement it anyway. What I initially set out to build was a local, searchable database of textual chat messages collected from multiple chat services, with an easy way to add support for new chat services.
Contact¶
The latest version of chat-archive is available on PyPI and GitHub. The documentation is hosted on Read the Docs and includes a changelog. For bug reports please create an issue on GitHub. If you have questions, suggestions, etc. feel free to send me an e-mail at peter@peterodding.com.
License¶
This software is licensed under the MIT license.
© 2018 Peter Odding.
Here’s a quick overview of the licenses of the dependencies:
Dependency | License |
---|---|
Alembic | MIT license |
emoji | BSD license |
hangups | MIT license |
Slacker | Apache Software License |
SQLAlchemy | MIT license |
Telethon | MIT license |
Shortly before publishing this project I got worried that I had included a GPL dependency which (if I understand correctly) would require me to publish under GPL as well, even though I’ve been consistently publishing my open source projects under the MIT license since 2010.
After assembling the table above I can confidently say that this is not the case 😇. The dependencies that are not listed in the table above are projects of mine, all of them published under the same MIT license as the chat-archive program (assuming I keep this up-to-date as new dependencies are added).
API documentation¶
The following API documentation is automatically generated from the source code:
API documentation¶
This documentation is based on the source code of version 4.0.2 of the chat-archive package. The following modules are available:
chat_archive
chat_archive.backends
chat_archive.backends.gtalk
chat_archive.backends.hangouts
chat_archive.backends.slack
chat_archive.backends.telegram
chat_archive.cli
chat_archive.database
chat_archive.emoji
chat_archive.html
chat_archive.html.keywords
chat_archive.html.redirects
chat_archive.models
chat_archive.profiling
chat_archive.utils
chat_archive
¶
Python API for the chat-archive program.
-
chat_archive.
DEFAULT_ACCOUNT_NAME
= 'default'¶ The name of the default account (a string).
-
class
chat_archive.
ChatArchive
(*args, **kw)[source]¶ Python API for the chat-archive program.
You can set the values of the
data_directory
,database_file
andforce
properties by passing keyword arguments to the class initializer.Here’s an overview of the
ChatArchive
class:-
alembic_directory
¶ The pathname of the directory containing Alembic migration scripts (a string).
The value of this property is computed at runtime based on the value of
__file__
inside of thechat_archive/__init__.py
module.
-
backends
[source]¶ A dictionary of available backends (names and dotted paths).
>>> from chat_archive import ChatArchive >>> archive = ChatArchive() >>> print(archive.backends) {'gtalk': 'chat_archive.backends.gtalk', 'hangouts': 'chat_archive.backends.hangouts', 'slack': 'chat_archive.backends.slack', 'telegram': 'chat_archive.backends.telegram'}
Note
The
backends
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
config
[source]¶ A dictionary with general user defined configuration options.
Note
The
config
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
config_loader
[source]¶ A
ConfigLoader
object that provides access to the configuration.Configuration files are text files in the subset of ini syntax supported by Python’s configparser module. They can be located in the following places:
Directory Main configuration file Modular configuration files /etc /etc/chat-archive.ini /etc/chat-archive.d/*.ini ~ ~/.chat-archive.ini ~/.chat-archive.d/*.ini ~/.config ~/.config/chat-archive.ini ~/.config/chat-archive.d/*.ini The available configuration files are loaded in the order given above, so that user specific configuration files override system wide configuration files.
Note
The
config_loader
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
declarative_base
¶ The base class for declarative models defined using SQLAlchemy.
-
data_directory
[source]¶ The pathname of the directory where data files are stored (a string).
The environment variable
$CHAT_ARCHIVE_DIRECTORY
can be used to set the value of this property. When the environment variable isn’t set the default value~/.local/share/chat-archive
is used (where~
is expanded to the profile directory of the current user).Note
The
data_directory
property is acustom_property
. You can change the value of this property using normal attribute assignment syntax. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can usedel
ordelattr()
.
-
database_file
[source]¶ The absolute pathname of the SQLite database file (a string).
This defaults to
~/.local/share/chat-archive/database.sqlite3
(with~
expanded to the home directory of the current user) based ondata_directory
.Note
The
database_file
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
force
[source]¶ Retry synchronization of conversations where errors were previously encountered (a boolean, defaults to
False
).Note
The
force
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
import_stats
[source]¶ Statistics about objects imported by backends (a
BackendStats
object).Note
The
import_stats
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
num_contacts
¶ The total number of chat contacts in the local archive (a number).
-
num_conversations
¶ The total number of chat conversations in the local archive (a number).
-
num_html_messages
¶ The total number of chat messages with HTML formatting in the local archive (a number).
-
num_messages
¶ The total number of chat messages in the local archive (a number).
-
operator_name
[source]¶ The full name of the person using the chat-archive program (a string or
None
).The value of
operator_name
is used to address the operator of the chat-archive program in first person instead of third person. You can change the value in the configuration file:[chat-archive] operator-name = ...
The default value in case none has been specified in the configuration file is taken from
/etc/passwd
usingget_full_name()
.Note
The
operator_name
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
get_accounts_for_backend
(backend_name)[source]¶ Select the configured and/or previously synchronized account names for the given backend.
-
get_accounts_from_database
(backend_name)[source]¶ Get the names of the accounts that are already in the database for the given backend.
-
get_accounts_from_config
(backend_name)[source]¶ Get the names of the accounts configured for the given backend in the configuration file.
-
initialize_backend
(backend_name, account_name)[source]¶ Load a chat archive backend module.
Parameters: - backend_name – The name of the backend (one of the strings ‘gtalk’, ‘hangouts’, ‘slack’ or ‘telegram’).
- account_name – The name of the account (a string).
Returns: A
ChatArchiveBackend
object.Raises: Exception
when the backend doesn’t define a subclass ofChatArchiveBackend
.
-
is_operator
(contact)[source]¶ Check whether the full name of the given contact matches
operator_name
.
-
load_backend_module
(backend_name)[source]¶ Load a chat archive backend module.
Parameters: backend_name – The name of the backend (one of the strings ‘gtalk’, ‘hangouts’, ‘slack’ or ‘telegram’). Returns: The loaded module.
-
parse_account_expression
(value)[source]¶ Parse a
backend:account
expression.Parameters: value – The backend:account
expression (a string).Returns: A tuple with two values: - The name of a backend (a string).
- The name of an account (a string, possibly empty).
-
search_messages
(keywords)[source]¶ Search the chat messages in the local archive for the given keyword(s).
-
synchronize
(*backends)[source]¶ Download new chat messages.
Parameters: backends – Any positional arguments limit the synchronization to backends whose name matches one of the strings provided as positional arguments. If the name of a backend contains a colon the name is split into two:
- The backend name.
- An account name.
This way one backend can synchronize multiple named accounts into the same local database without causing confusion during synchronization about which conversations, contacts and messages belong to which account.
-
-
class
chat_archive.
BackendStats
[source]¶ Statistics about chat message synchronization backends.
-
__init__
()[source]¶ Initialize a
BackendStats
object.
-
scope
¶ The current scope (a
collections.defaultdict
object).
-
chat_archive.backends
¶
Namespace for chat archive backends.
The following chat archive backends have been implemented so far:
- Google Hangouts:
chat_archive.backends.hangouts
- Google Talk:
chat_archive.backends.gtalk
- Slack:
chat_archive.backends.slack
- Telegram:
chat_archive.backends.telegram
-
class
chat_archive.backends.
ChatArchiveBackend
(**kw)[source]¶ Abstract base class for
chat-archive
backends.When you initialize a
ChatArchiveBackend
object you are required to provide values for theaccount_name
,archive
,backend_name
andstats
properties. You can set the values of theaccount_name
,archive
,backend_name
andstats
properties by passing keyword arguments to the class initializer.Here’s an overview of the
ChatArchiveBackend
class:-
account
[source]¶ The
Account
object corresponding toaccount_name
andbackend_name
.Note
The
account
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
account_name
[source]¶ The name of the chat account that is being synchronized (a string).
The value of
account_name
needs to be set by the caller and is used to “get or create” theaccount
object on demand.Note
The
account_name
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named account_name (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
archive
[source]¶ The
ChatArchive
that is using this backend.Note
The
archive
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named archive (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
backend_name
[source]¶ The name of the chat archive backend (a short alphanumeric string).
The value of
backend_name
is used to “get or create” theaccount
object on demand.Note
The
backend_name
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named backend_name (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
config
[source]¶ The configuration options for this backend and account (a dictionary).
Note
The
config
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
external_id_cache
[source]¶ A dictionary mapping external IDs to
Contact
objects.Note
The
external_id_cache
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
redirect_stripper
[source]¶ An
RedirectStripper
object.Note
The
redirect_stripper
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
session
[source]¶ Shortcut for the
session
property ofarchive
.Note
The
session
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
stats
[source]¶ A
BackendStats
object.Note
The
stats
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named stats (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
find_contact_by_attributes
(attributes)[source]¶ Find a contact based on their external ID, an email address or a telephone number.
Parameters: attributes – A dictionary with any of the following keys:
external_id
(string value)email_addresses
(list of strings)telephone_numbers
(list of strings)
Returns: A Contact
object orNone
.
-
find_contact_by_email_address
(value)[source]¶ Find a contact based on their email address.
Parameters: value – An email address (a string). Returns: A Contact
object orNone
.
-
find_contact_by_external_id
(external_id)[source]¶ Find a contact based on their ‘external ID’.
Parameters: external_id – The external ID (a string). Returns: A Contact
object orNone
.This method uses
external_id_cache
to speed up lookup of contacts by their external ID.
-
find_contact_by_telephone_number
(value)[source]¶ Find a contact based on their telephone number.
Parameters: value – A telephone number (a string). Returns: A Contact
object orNone
.
-
get_or_create_contact
(**attributes)[source]¶ Get or create a contact object.
Parameters: attributes – The names and values of model attributes, used to find existing contacts and create new ones. Returns: A Contact
object.This method serves three distinct purposes:
- Finding existing contacts by their ‘external ID’ or one of their email addresses or telephone numbers.
- Creating new contacts (based on the given attributes).
- Updating existing contacts (based on the given attributes).
Here’s an overview of supported attributes:
- The
external_id
attribute (whose value is expected to be string). - The
full_name
attribute (whose value is expected to be string) is split into separatefirst_name
andlast_name
attributes. - The attributes
email_address
andtelephone_number
(whose value is expected to be string) are converted to their plural formsemail_addresses
andtelephone_numbers
(a list of strings).
-
get_or_create_conversation
(external_id, **attributes)[source]¶ Get or create a
Conversation
object.Parameters: - external_id – The external ID of the conversation (a string).
- attributes – Any optional attributes to set when creating a new conversation.
Returns: Refer to
get_or_create_object()
.
-
get_or_create_message
(conversation, **attributes)[source]¶ Get or create a
Message
object.Parameters: - conversation – The
Conversation
in which the message originated. - attributes – Any optional attributes to set when creating a new message.
Returns: Refer to
get_or_create_object()
.- conversation – The
-
get_or_create_email_address
(email_address)[source]¶ Get or create an
EmailAddress
object.Parameters: email_address – The email address (a string). Returns: An EmailAddress
object.
-
get_or_create_object
(model, required, optional=None)[source]¶ Find an existing object in the local database or create a new object.
Parameters: - model – The model to query.
- required – A dictionary with the key/value pairs that should be used to search for an existing object.
- optional – Any optional attributes to set when creating a new object.
Returns: A tuple with two values:
-
get_or_create_telephone_number
(telephone_number)[source]¶ Get or create a
TelephoneNumber
object.Parameters: telephone_number – The telephone number (a string containing a number). Returns: A TelephoneNumber
object.
-
have_message
(conversation, external_id)[source]¶ Check if a message exists in the local database.
Parameters: - conversation – The
Conversation
that contains the message. - external_id – The unique id of the message (a string).
Returns: - conversation – The
-
pre_process_text
(attributes)[source]¶ Pre-process the text and HTML of a chat message.
Parameters: attributes – A dictionary with Message
attributes.This method works as follows:
- The text is pre-processed using
strip_redirects()
. - The html is pre-processed using
RedirectStripper
. - When the resulting HTML exactly equals the plain text chat message, the html key in attributes is removed.
- The text is pre-processed using
-
chat_archive.backends.gtalk
¶
Synchronization logic for the Google Talk backend of the chat-archive program.
The Google Talk backend uses the IMAP protocol to discover and download the
messages available in the chats_folder
of your
Google Mail account. The following requirements need to be met in order to use
this backend:
- You need to enable IMAP access to your Google Mail account.
- You may need to specifically enable IMAP access to the
chats_folder
(this turned out to be necessary for me).
Before developing this module in June 2018 I had never implemented any IMAP automation [1] so I wasn’t that familiar with the protocol and I didn’t know about message UIDs. The Unique ID in IMAP protocol blog post provided me with some useful details about the semantics of message UIDs.
This backend assumes and requires that the Google Mail servers provide message UIDs that are stable across sessions (this enables discovery of new messages). My testing implies that this is the case, because it seems to work fine! :-)
[1] | Despite operating my own IMAP server for the past ten years, so I was already familiar with IMAP from the perspective of a user as well as server administrator. |
-
chat_archive.backends.gtalk.
FRIENDLY_NAME
= 'Google Talk'¶ A user friendly name for the chat service supported by this backend (a string).
-
chat_archive.backends.gtalk.
NAMESPACED_TAG_PATTERN
= re.compile('^{[^}]+}(\\S+)$')¶ Compiled regular expression to match XML tag names with a name space.
-
chat_archive.backends.gtalk.
BOGUS_EMAIL_PATTERN
= re.compile('^private-chat(-[0-9a-f]+)+@groupchat.google.com$', re.IGNORECASE)¶ Compiled regular expression to recognize private messages in group conversations.
-
class
chat_archive.backends.gtalk.
GoogleTalkBackend
(**kw)[source]¶ The Google Talk backend for the chat-archive program.
This backend supports the following configuration options:
Option Description chats-folder
See chats_folder
.imap-server
See imap_server
.email
The email address used to sign in to your Google Mail account. password-name
The name of a password in ~/.password-store
to use.password
See password
.If you set
password-name
thenpassword
doesn’t have to be set. Ifpassword
norpassword-name
have been set then you will be prompted for your password every time you synchronize.You can set the values of the
chats_folder
andimap_server
properties by passing keyword arguments to the class initializer.Here’s an overview of the
GoogleTalkBackend
class:-
chats_folder
[source]¶ The folder that contains chat message archives (a string, defaults to ‘[Gmail]/Chats’).
Note
The
chats_folder
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
client
[source]¶ An IMAP client connection to
imap_server
.Note
The
client
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
conversation_map
[source]¶ A mapping of conversations.
Note
The
conversation_map
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
imap_server
[source]¶ The domain name of the Google Mail IMAP server (a string, defaults to ‘imap.gmail.com’).
Note
The
imap_server
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
password
[source]¶ The password used to sign in to the Google Mail account (a string).
Note
The
password
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
synchronize
()[source]¶ Download RFC822 encoded Google Talk conversations using IMAP and import the embedded chat messages.
-
parse_singlepart_email
(email)[source]¶ Extract a chat message from a single-part email downloaded from
chats_folder
.
-
parse_multipart_email
(email)[source]¶ Find the
text/xml
payload in an RFC 822 multi-part email message.
-
find_conversation
(*participants)[source]¶ Find a conversation (without an external ID) that involves the given participants.
-
extract_timestamp
(message_node)[source]¶ Extract a timestamp from a
<message>
node.Parameters: message_node – A <message>
node.Returns: A datetime.datetime
object.
-
extract_html
(message_node)[source]¶ Try to extract HTML from a
<message>
node.Parameters: message_node – A <message>
node.Returns: The extracted HTML (a string) or None
.
-
contact_from_jid
(value)[source]¶ Convert a Jabber ID to an email address and use that to find or create a contact.
-
-
class
chat_archive.backends.gtalk.
EmailMessageParser
(**kw)[source]¶ Lazy evaluation of
email.message_from_string()
.When you initialize a
EmailMessageParser
object you are required to provide values for theraw_body
anduid
properties. You can set the values of theraw_body
anduid
properties by passing keyword arguments to the class initializer.Here’s an overview of the
EmailMessageParser
class:Superclass: PropertyManager
Properties: parsed_body
,raw_body
,timestamp
anduid
-
parsed_body
[source]¶ The result of
email.message_from_string()
.Note
The
parsed_body
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
raw_body
[source]¶ The raw message body of the email (a string).
Note
The
raw_body
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named raw_body (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
timestamp
[source]¶ Convert the
Date:
header of the email message to adatetime
object.Note
The
timestamp
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
uid
[source]¶ The UID of the email message.
Note
The
uid
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named uid (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
-
class
chat_archive.backends.gtalk.
LazyXMLFormatter
(node)[source]¶ Lazy evaluation of
xml.etree.ElementTree.tostring()
.-
__init__
(node)[source]¶ Initialize a
LazyXMLFormatter
object.Parameters: node – The XML node to render.
-
chat_archive.backends.hangouts
¶
Synchronization logic for the Google Hangouts backend of the chat-archive program.
-
chat_archive.backends.hangouts.
FRIENDLY_NAME
= 'Google Hangouts'¶ A user friendly name for the chat service supported by this backend (a string).
-
class
chat_archive.backends.hangouts.
HangoutsBackend
(**kw)[source]¶ The Google Hangouts backend for the chat-archive program.
This backend supports the following configuration options:
Option Description email-address
The email address used to sign in to your Google account. password-name
The name of a password in ~/.password-store
to use.password
The password used to sign in to your Google account. If you set
password-name
thenpassword` doesn't have to be set. If ``password
norpassword-name
have been set then you will be prompted for your password every time you synchronize.You can set the values of the
cookie_file
andretry_count
properties by passing keyword arguments to the class initializer.Here’s an overview of the
HangoutsBackend
class:Superclass: ChatArchiveBackend
Public methods: connect_then_sync()
,download_all_contacts()
,download_all_conversations()
,download_all_messages()
,download_conversation()
,download_message_batch()
,get_message_html()
,handle_import_errors()
,is_bogus_user()
,perform_initial_sync()
andsynchronize()
Properties: bogus_user_ids
,client
,cookie_file
andretry_count
-
bogus_user_ids
[source]¶ A
set
of strings with ‘gaia_id’ values of “bogus” users.Note
The
bogus_user_ids
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
The pathname of the
*.json
file with cached credentials (a string).Note
The
cookie_file
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
client
[source]¶ The hangups client object.
Note
The
client
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
retry_count
[source]¶ The number of times that a batch of messages will be requested (a number, defaults to 5).
Note
The
retry_count
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
download_all_messages
(conversation, conversation_in_db, event_id=None)[source]¶ Download the messages in a specific Hangouts conversation.
-
download_message_batch
(conversation, event_id)[source]¶ Try to download a batch of messages (retrying according to
retry_count
).
-
-
class
chat_archive.backends.hangouts.
GoogleAccountCredentials
(**kw)[source]¶ Used to non-interactively provide Google Account credentials to
hangups
.When you initialize a
GoogleAccountCredentials
object you are required to provide values for theemail_address
andpassword
properties. You can set the values of theemail_address
andpassword
properties by passing keyword arguments to the class initializer.Here’s an overview of the
GoogleAccountCredentials
class:Superclass: PropertyManager
Public methods: get_email()
,get_password()
andget_verification_code()
Properties: email_address
andpassword
-
email_address
[source]¶ The Google account email address (a string).
Note
The
email_address
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named email_address (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
password
[source]¶ The Google account password (a string).
Note
The
password
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named password (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
get_email
()[source]¶ Feed the configured
email_address
tohangups
.
-
chat_archive.backends.slack
¶
Synchronization logic for the Slack backend of the chat-archive program.
-
chat_archive.backends.slack.
FRIENDLY_NAME
= 'Slack'¶ A user friendly name for the chat service supported by this backend (a string).
-
class
chat_archive.backends.slack.
SlackBackend
(**kw)[source]¶ Container for the Slack chat archive backend.
You can set the value of the
is_limited
property by passing a keyword argument to the class initializer.Here’s an overview of the
SlackBackend
class:Superclass: ChatArchiveBackend
Public methods: expand_reference_callback()
,get_history()
,import_messages()
,synchronize()
,synchronize_channels()
,synchronize_direct_messages()
andsynchronize_users()
Properties: api_token
,client
,http_session
,is_limited
,mrkdwn_to_html
andspinner
-
api_token
[source]¶ The Slack API token (a string).
Note
The
api_token
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
client
[source]¶ A
slacker.Slacker
instance initialized withapi_token
andhttp_session
.Note
The
client
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
is_limited
[source]¶ Whether result sets have been limited due to the free plan.
Note
The
is_limited
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
mrkdwn_to_html
[source]¶ An
HTMLConverter
object.Note
The
mrkdwn_to_html
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
http_session
[source]¶ A
requests.Session
object used for HTTP connection re-use.Note
The
http_session
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
spinner
[source]¶ An interactive spinner to provide feedback to the user (because the Slack backend is slow).
Note
The
spinner
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
-
class
chat_archive.backends.slack.
HTMLConverter
(expand_reference_callback=None)[source]¶ Convert Slack chat messages from mrkdwn format to HTML.
-
__init__
(expand_reference_callback=None)[source]¶ Initialize an
HTMLConverter
object.
-
__call__
(text)[source]¶ Convert a Slack chat message to HTML.
Parameters: text – The text of a Slack message (a string). Returns: The generated HTML (a string).
-
followed_by_alphanumeric
(input, index, limit)[source]¶ Check if the given position is followed by an alphanumeric character.
-
chat_archive.backends.telegram
¶
Synchronization logic for the Telegram backend of the chat-archive program.
The use of this backend requires the user to register on my.telegram.org/apps to get an api_id
and
api_hash
.
-
chat_archive.backends.telegram.
FRIENDLY_NAME
= 'Telegram'¶ A user friendly name for the chat service supported by this backend (a string).
-
class
chat_archive.backends.telegram.
TelegramBackend
(**kw)[source]¶ Container for the Telegram chat archive backend.
When you initialize a
TelegramBackend
object you are required to provide values for theapi_hash
andapi_id
properties. You can set the values of theapi_hash
,api_id
andsession_file
properties by passing keyword arguments to the class initializer.Here’s an overview of the
TelegramBackend
class:Superclass: ChatArchiveBackend
Public methods: connect_then_sync()
,dialog_to_ignore()
,download_messages()
,is_duplicate_dialog()
,is_group_conversation()
,is_service_dialog()
,perform_initial_sync()
,recipient_to_contact()
,sender_to_contact()
,synchronize()
andupdate_conversation()
Properties: api_hash
,api_id
,client
andsession_file
-
api_hash
[source]¶ The API hash used to connect to the Telegram API (a string).
The value of this property can be configured as follows:
[telegram] api-hash = ...
You can use the
api-hash-name
configuration file option to specify the name of a secret in~/.password-store
instead.Note
The
api_hash
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named api_hash (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
api_id
[source]¶ The API ID used to connect to the Telegram API (an integer).
The value of this property can be configured as follows:
[telegram] api-id = ...
You can use the
api-id-name
configuration file option to specify the name of a secret in~/.password-store
instead.Note
The
api_id
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named api_id (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
client
[source]¶ A
telethon.TelegramClient
object constructed based onapi_id
,:attr:api_hash andsession_file
.Note
The
client
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
session_file
[source]¶ The filename of the session file passed to
telethon.TelegramClient
.Note
The
session_file
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
dialog_to_ignore
(dialog)[source]¶ Check if this conversation should be ignored.
This method exists to exclude two types of conversations:
- The conversation with the “Telegram” user, because I don’t consider the service messages in this conversation to be relevant to my chat archive.
- Group conversations that are being synchronized as part of a different Telegram account.
-
is_duplicate_dialog
(dialog)[source]¶ Check if the given dialog is being synchronized as part of a different Telegram account.
-
is_service_dialog
(dialog)[source]¶ Check if the given dialog is the dialog with the “Telegram” user, containing service messages.
-
connect_then_sync
()[source]¶ Connect to the Telegram API and synchronize the available conversations.
-
download_messages
(dialog, conversation_in_db, min_id=0, max_id=0)[source]¶ Download messages in the given conversation.
-
perform_initial_sync
(dialog, conversation_in_db)[source]¶ Start or resume the initial synchronization.
-
update_conversation
(dialog, conversation_in_db)[source]¶ Download new messages in an existing conversation.
-
chat_archive.cli
¶
Usage: chat-archive [OPTIONS] [COMMAND]
Easy to use offline chat archive that can gather chat message history from Google Talk, Google Hangouts, Slack and Telegram.
Supported commands:
- The ‘sync’ command downloads new chat messages from supported chat services and stores them in the local archive (an SQLite database).
- The ‘search’ command searches the chat messages in the local archive for the given keyword(s) and lists matching messages.
- The ‘list’ command lists all messages in the local archive.
- The ‘stats’ command shows statistics about the local archive.
- The ‘unknown’ command searches for conversations that contain messages from an unknown sender and allows you to enter the name of a new contact to associate with all of the messages from an unknown sender. Conversations involving multiple unknown sender are not supported.
Supported options:
Option | Description |
---|---|
-C , --context=COUNT |
Print COUNT messages of output context during ‘chat-archive search’. This
works similarly to ‘grep -C ’. The default value of COUNT is 3. |
-f , --force |
Retry synchronization of conversations where errors were previously encountered. This option is currently only relevant to the Google Hangouts backend, because I kept getting server errors when synchronizing a few specific conversations and I didn’t want to keep seeing each of those errors during every synchronization run :-). |
-c , --color=CHOICE, --colour=CHOICE |
Specify whether ANSI escape sequences for text and background colors and
text styles are to be used or not, depending on the value of
|
-l , --log-file=LOGFILE |
Save logs at DEBUG verbosity to the filename given by LOGFILE . This option
was added to make it easy to capture the log output of an initial
synchronization that will be downloading thousands of messages. |
-p , --profile=FILENAME |
Enable profiling of the chat-archive application to make it possible to
analyze performance problems. Python profiling data will be saved to
FILENAME every time database changes are committed (making it possible to
inspect the profile while the program is still running). |
-v , --verbose |
Increase logging verbosity (can be repeated). |
-q , --quiet |
Decrease logging verbosity (can be repeated). |
-h , --help |
Show this message and exit. |
-
chat_archive.cli.
FORMATTING_TEMPLATES
= {'conversation_delimiter': '<span style="color: green">{text}</span>', 'conversation_name': '<span style="font-weight: bold; color: #FCE94F">{text}</span>', 'keyword_highlight': '<span style="color: black; background-color: yellow">{text}</span>', 'message_backend': '<span style="color: #C4A000">({text})</span>', 'message_contacts': '<span style="color: blue">{text}</span>', 'message_delimiter': '<span style="color: #555753">{text}</span>', 'message_timestamp': '<span style="color: green">{text}</span>'}¶ The formatting of output, specified as HTML with placeholders.
-
chat_archive.cli.
UNKNOWN_CONTACT_LABEL
= 'Unknown'¶ The label for contacts without a name or email address (a string).
-
class
chat_archive.cli.
UserInterface
(*args, **kw)[source]¶ The Python API for the command line interface for the
chat-archive
program.You can set the values of the
context
,keywords
,timestamp_format
anduse_colors
properties by passing keyword arguments to the class initializer.Here’s an overview of the
UserInterface
class:-
context
[source]¶ The number of messages of output context to print during searches (defaults to 3).
Note
The
context
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
use_colors
[source]¶ Whether to output ANSI escape sequences for text colors and styles (a boolean).
Note
The
use_colors
property is acustom_property
. You can change the value of this property using normal attribute assignment syntax. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can usedel
ordelattr()
.
-
html_to_ansi
[source]¶ An
HTMLConverter
object that usesnormalize_emoji()
as a text pre-processing callback.Note
The
html_to_ansi
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
redirect_stripper
[source]¶ An
RedirectStripper
object.Note
The
redirect_stripper
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
html_to_text
[source]¶ An
HTMLStripper
object.Note
The
html_to_text
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
keyword_highlighter
[source]¶ A
KeywordHighlighter
object based onkeywords
.Note
The
keyword_highlighter
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
keywords
[source]¶ A list of strings with search keywords.
Note
The
keywords
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
timestamp_format
[source]¶ The format of timestamps (defaults to
%Y-%m-%d %H:%M:%S
).Note
The
timestamp_format
property is amutable_property
. You can change the value of this property using normal attribute assignment syntax. To reset it to its default (computed) value you can usedel
ordelattr()
.
-
search_cmd
(arguments)[source]¶ Search the chat messages in the local archive for the given keyword(s).
-
unknown_cmd
(arguments)[source]¶ Find private conversations with messages from an unknown sender and interactively prompt the operator to provide a name for a new contact to associate the messages with.
-
generate_html
(name, text)[source]¶ Generate HTML based on a named format string.
Parameters: - name – The name of an HTML format string in
FORMATTING_TEMPLATES
(a string). - text – The text to interpolate (a string).
Returns: The generated HTML (a string).
This method does not escape the text given to it, in other words it is up to the caller to decide whether embedded HTML is allowed or not.
- name – The name of an HTML format string in
-
normalize_whitespace
(text)[source]¶ Normalize the whitespace in a chat message before rendering on the terminal.
Parameters: text – The chat message text (a string). Returns: The normalized text (a string). This method works as follows:
- First leading and trailing whitespace is stripped from the text.
- When the resulting text consists of a single line, it is processed
using
compact()
and returned. - When the resulting text contains multiple lines the text is prefixed
with a newline character, so that the chat message starts on its own
line. This ensures that messages requiring vertical alignment render
properly (for example a table drawn with
|
and-
characters).
-
render_conversation_summary
(conversation)[source]¶ Render a summary of which conversation a message is part of.
-
prepare_output
(text)[source]¶ Prepare text for rendering on the terminal.
Parameters: text – The HTML text to render (a string). Returns: The rendered text (a string). When
use_colors
isTrue
this method first useskeyword_highlighter
to highlight search matches in the given text and then it converts the string from HTML to ANSI escape sequences usinghtml_to_ansi
.When
use_colors
isFalse
thenhtml_to_text
is used to convert the given HTML to plain text. In this case keyword highlighting is skipped.
-
render_output
(text)[source]¶ Render text on the terminal.
Parameters: text – The HTML text to render (a string). Refer to
prepare_output()
for details about how text is converted from HTML to text with ANSI escape sequences.
-
get_contact_name
(contact)[source]¶ Get a short string describing a contact (preferably their first name, but if that is not available then their email address will have to do). If no useful information is available
UNKNOWN_CONTACT_LABEL
is returned so as to explicitly mark the absence of more information.
-
chat_archive.database
¶
SQLAlchemy based database helpers.
-
class
chat_archive.database.
DatabaseClient
(*args, **kw)[source]¶ Simple wrapper for SQLAlchemy that makes it easy to use with SQLite.
When you initialize a
DatabaseClient
object you are required to provide a value for thedatabase_url
property. You can set the values of thedatabase_file
,database_url
andecho_queries
properties by passing keyword arguments to the class initializer.Here’s an overview of the
DatabaseClient
class:Superclass: ProfileManager
Special methods: __exit__()
and__init__()
Public methods: commit_changes()
Properties: database_engine
,database_file
,database_url
,echo_queries
,session
andsession_factory
-
__init__
(*args, **kw)[source]¶ Initialize a
DatabaseClient
object.Please refer to the
PropertyManager
documentation for details about the handling of arguments.
-
database_engine
[source]¶ An SQLAlchemy database engine connected to
database_url
.Note
The
database_engine
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
database_file
[source]¶ The absolute pathname of an SQLite database file (a string or
None
).Note
The
database_file
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
database_url
[source]¶ A URL that indicates the database dialect and connection arguments to SQLAlchemy (a string).
The value of
database_url
defaults to a URL that instructs SQLAlchemy to use an SQLite 3 database file located at the pathname given bydatabase_file
, but of course you are free to point SQLAlchemy to any supported database server.Note
The
database_url
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named database_url (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
echo_queries
[source]¶ Whether queries should be logged to
sys.stderr
(a boolean, defaults toFalse
).Note
The
echo_queries
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
session
[source]¶ An SQLAlchemy session created by
session_factory
.Note
The
session
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
session_factory
[source]¶ An SQLAlchemy session factory connected to
database_engine
.Note
The
session_factory
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
-
class
chat_archive.database.
SchemaManager
(*args, **kw)[source]¶ Easy to use database schema upgrades based on Alembic.
You can set the values of the
alembic_directory
,auto_create_schema
,auto_upgrade_schema
anddeclarative_base
properties by passing keyword arguments to the class initializer.Here’s an overview of the
SchemaManager
class:Superclass: DatabaseClient
Special methods: __init__()
Public methods: initialize_schema()
andrun_migrations()
Properties: alembic_config
,alembic_directory
,auto_create_schema
,auto_upgrade_schema
,current_schema_revision
,declarative_base
,latest_schema_revision
andschema_up_to_date
-
__init__
(*args, **kw)[source]¶ Initialize a
SchemaManager
object.This method automatically calls
run_migrations()
(andinitialize_schema()
when the database is initially created) to ensure that the database schema is up to date.
-
alembic_config
[source]¶ A minimal Alembic configuration object.
This configuration objects contains two options:
sqlalchemy.url
is set todatabase_url
script_location
is set toalembic_directory
Raises: ValueError
whenalembic_directory
isn’t set.Note
The
alembic_config
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
alembic_directory
[source]¶ The absolute pathname of the directory containing Alembic’s
env.py
file (a string orNone
).Note
The
alembic_directory
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
auto_create_schema
[source]¶ True
if automatic database schema upgrades are enabled,False
otherwise.This defaults to
True
whendeclarative_base
is set,False
otherwise.Note
The
auto_create_schema
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
auto_upgrade_schema
[source]¶ True
if automatic database schema initialization is enabled,False
otherwise.This defaults to
True
whenalembic_directory
is set,False
otherwise.Note
The
auto_upgrade_schema
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
current_schema_revision
[source]¶ The current database schema revision in the database that we’re connected to (a string or
None
).Note
The
current_schema_revision
property is acached_property
. This property’s value is computed once (the first time it is accessed) and the result is cached. To clear the cached value you can usedel
ordelattr()
.
-
declarative_base
[source]¶ The base class for declarative models defined using SQLAlchemy.
Note
The
declarative_base
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
latest_schema_revision
[source]¶ The current schema revision according to Alembic’s migration scripts (a string).
Note
The
latest_schema_revision
property is alazy_property
. This property’s value is computed once (the first time it is accessed) and the result is cached.
-
initialize_schema
()[source]¶ Initialize the database schema using SQLAlchemy.
This method is automatically called when a
SchemaManager
object is created. In order to initialize the database schema thedeclarative_base
property needs to be set, but if it’s not set theninitialize_schema()
won’t complain.
-
run_migrations
()[source]¶ Upgrade the database schema using Alembic.
This method is automatically called when a
SchemaManager
object is created. In order to upgrade the database schema thealembic_directory
property needs to be set, but if it’s not set thenrun_migrations()
won’t complain.
-
-
class
chat_archive.database.
CustomVerbosity
(**kw)[source]¶ Easily customize logging verbosity for a given scope.
This is used by
SchemaManager
to silence Alembic because it’s rather verbose by default, presumably because its primary purpose is to be a command line program and not a library embedded in an application.When you initialize a
CustomVerbosity
object you are required to provide a value for thelevel
property. You can set the values of thelevel
andoriginal_level
properties by passing keyword arguments to the class initializer.Here’s an overview of the
CustomVerbosity
class:Superclass: PropertyManager
Special methods: __enter__()
and__exit__()
Properties: level
andoriginal_level
-
level
[source]¶ The overridden logging verbosity level.
Note
The
level
property is arequired_property
. You are required to provide a value for this property by calling the constructor of the class that defines the property with a keyword argument named level (unless a custom constructor is defined, in this case please refer to the documentation of that constructor). You can change the value of this property using normal attribute assignment syntax.
-
original_level
[source]¶ The original logging verbosity level.
Note
The
original_level
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
chat_archive.emoji
¶
Utility functions to translate between various forms of smilies and emoji.
chat_archive.html
¶
Utility functions for working with the HTML encoded text.
-
chat_archive.html.
BLOCK_TAGS
= ['div', 'p', 'pre']¶ A list of strings with HTML tags that are considered block-level elements. The
HTMLStripper
emits an empty line before and after each block-level element that it encounters.
-
chat_archive.html.
URL_PATTERN
= re.compile('(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)')¶ A compiled regular expression pattern to find URLs in text (credit: taken from urlregex.com).
-
chat_archive.html.
html_to_text
(html_text)[source]¶ Convert HTML to plain text.
Parameters: html_text – A fragment of HTML (a string). Returns: The plain text (a string). This function uses the
HTMLStripper
class that builds on top of thehtml.parser.HTMLParser
class in the Python standard library.
-
chat_archive.html.
text_to_html
(text, callback=None)[source]¶ Convert plain text to HTML.
Parameters: - text – A fragment of plain text (a string).
- callback – An optional callback that provides the caller a chance to pre-process text before it is encoded as HTML.
Returns: The HTML encoded text (a string).
This function replaces URLs with
<a href="...">
tags and escapes special characters, that’s it, nothing more.
-
class
chat_archive.html.
HTMLStripper
(*, convert_charrefs=True)[source]¶ A simple HTML to text converter based on
html.parser.HTMLParser
.-
__call__
(data)[source]¶ Convert HTML to text.
Parameters: data – The HTML to convert to text (a string). Returns: The converted text (a string). This method calls
compact_empty_lines()
on the converted text to normalize superfluous empty lines caused by vertical whitespace emitted around block level elements like<div>
,<p>
and<pre>
.
-
handle_charref
(value)[source]¶ Process a decimal or hexadecimal numeric character reference.
Parameters: value – The decimal or hexadecimal value (a string).
-
handle_entityref
(name)[source]¶ Process a named character reference.
Parameters: name – The name of the character reference (a string).
-
reset
()[source]¶ Reset the state of the
HTMLStripper
instance.
-
chat_archive.html.keywords
¶
Utility functions for working with the HTML encoded text.
-
class
chat_archive.html.keywords.
KeywordHighlighter
(*args, **kw)[source]¶ A simple keyword highlighter for HTML based on
html.parser.HTMLParser
.-
__init__
(*args, **kw)[source]¶ Initialize a
KeywordHighlighter
object.Parameters: - keywords – A list of strings with keywords to highlight.
- highlight_template – A template string with the
{text}
placeholder that’s used to highlight keyword matches.
-
chat_archive.html.redirects
¶
Utility functions to pre-process URLs before rendering on a terminal.
In web browsers and chat clients the URLs behind hyperlinks are usually hidden, but in a terminal there’s no “out of band” mechanism to communicate the URL behind a hyperlink - the URL needs to appear literally in the text that is rendered to the terminal.
Given this requirement, I’ve become rather annoyed at Google prefixing every
URL they can get their hands on with https://www.google.com/url?q=…
because
this user hostile “encoding” obscures the intended URL with a lot of fluff that
I don’t care for.
This module contains the expand_url()
function to transform redirect
URLs into their target URL, the strip_redirects()
function to
transform all redirect URLs in a given text and RedirectStripper
to
transform all redirect URLs in a given HTML fragment.
-
chat_archive.html.redirects.
GOOGLE_REDIRECT_URL
= 'www.google.com/url'¶ The base URL of the Google redirect service (a string).
Note that the URL scheme is omitted on purpose, to enable a substring search for the Google redirect service regardless of whether a given URL is using the
http://
orhttps://
scheme.
-
chat_archive.html.redirects.
URL_PATTERN
= re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')¶ A compiled regular expression pattern to find URLs in text (credit: taken from urlregex.com).
-
chat_archive.html.redirects.
expand_url
(url)[source]¶ Expand a redirect URL to its target URL.
Parameters: url – The URL to expand (a string). Returns: The expanded URL (a string).
-
chat_archive.html.redirects.
strip_redirects
(text)[source]¶ Expand redirect URLs in the given text.
Parameters: text – The text to process (a string). Returns: The processed text (a string).
-
chat_archive.html.redirects.
strip_redirects_callback
(match)[source]¶ Apply
expand_url()
to the matched URL.
-
class
chat_archive.html.redirects.
RedirectStripper
(*, convert_charrefs=True)[source]¶ Expand redirect URLs embedded in HTML.
This class uses
html.parser.HTMLParser
to parse HTML and expand any redirect URLs that it encounters to their target URL. The__call__()
method provides an easy way to use this functionality.
chat_archive.models
¶
Database models for the chat-archive program based on SQLAlchemy.
The chat_archive.models
module defines the following database models for
the chat-archive program:
-
chat_archive.models.
metadata
= MetaData(bind=None)¶ Define an explicit naming convention to simplify future database migrations.
-
class
chat_archive.models.
Base
(**kwargs)¶ The most base type
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
chat_archive.models.
address_mapping
= Table('email_address_mapping', MetaData(bind=None), Column('contact_id', Integer(), ForeignKey('contacts.id'), table=<email_address_mapping>), Column('address_id', Integer(), ForeignKey('email_addresses.id'), table=<email_address_mapping>), schema=None)¶ Mapping table for many-to-many relationship between contacts and email addresses.
-
chat_archive.models.
telephone_number_mapping
= Table('telephone_number_mapping', MetaData(bind=None), Column('contact_id', Integer(), ForeignKey('contacts.id'), table=<telephone_number_mapping>), Column('telephone_number_id', Integer(), ForeignKey('telephone_numbers.id'), table=<telephone_number_mapping>), schema=None)¶ Mapping table for many-to-many relationship between contacts and telephone numbers.
-
class
chat_archive.models.
Account
(**kwargs)[source]¶ Database model for chat accounts.
-
id
¶ The primary key of the account (an integer).
-
backend
¶ The name of the backend that manages this account (a string).
-
name
¶ A user defined name for the account (a string).
-
contacts
¶ The contacts that have been imported using this account.
-
conversations
¶ The conversations that have been imported using this account.
-
name_is_significant
¶ True
if the database contains multiple accounts with thisbackend
,False
otherwise.
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
chat_archive.models.
EmailAddress
(**kwargs)[source]¶ Database model for email addresses of chat contacts.
-
id
¶ The primary key of the email address (an integer).
-
value
¶ The email address itself (a string).
-
__repr__
()[source]¶ Render a human friendly representation of an
EmailAddress
object.
-
__str__
()[source]¶ Render a human friendly representation of an
EmailAddress
object.
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
chat_archive.models.
TelephoneNumber
(**kwargs)[source]¶ Database model for telephone numbers of chat contacts.
-
id
¶ The primary key of the telephone number (an integer).
-
value
¶ The telephone number itself (a string).
-
__repr__
()[source]¶ Render a human friendly representation of an
TelephoneNumber
object.
-
__str__
()[source]¶ Render a human friendly representation of an
TelephoneNumber
object.
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
chat_archive.models.
Contact
(**kwargs)[source]¶ Database model for chat contacts.
-
id
¶ The primary key of the contact (an integer).
-
account_id
¶ A foreign key to associate contacts with accounts.
-
email_addresses
¶ The email addresses of this contact.
-
telephone_numbers
¶ The telephone numbers of this contact.
-
sent_messages
¶ The chat messages that were sent by this contact.
-
received_messages
¶ The chat messages that were received by this contact.
-
first_name_is_unambiguous
¶ True
if this first name unambiguously refers to a single contact,False
otherwise.
-
full_name
¶ The full name of the contact (as an SQL expression).
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
-
class
chat_archive.models.
Conversation
(**kwargs)[source]¶ Database model for chat conversations.
-
id
¶ The primary key of the conversation (an integer).
-
account_id
¶ A foreign key to associate conversations with accounts.
-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
is_group_conversation
¶ Whether the conversation is a group conversation (a boolean, defaults to
False
).
-
messages
¶ The chat messages that belong to this conversation.
-
have_unknown_senders
¶ Whether this conversation includes messages from unknown senders (a boolean).
-
-
class
chat_archive.models.
Message
(**kwargs)[source]¶ Database model for chat messages.
Note that the
Message
model doesn’t have a direct relationship to theAccount
model because these two models already have an indirect relationship via theConversation
model (in other words, messages are implicitly namespaced to accounts via conversations).-
__init__
(**kwargs)¶ A simple constructor that allows initialization from kwargs.
Sets attributes on the constructed instance using the names and values in
kwargs
.Only keys that are present as attributes of the instance’s class are allowed. These could be, for example, any mapped columns or relationships.
-
id
¶ The primary key of the chat message (an integer).
-
conversation_id
¶ A foreign key to associate chat messages with conversations.
-
recipient_id
¶ A foreign key that points to the contact who received this message (an integer or
None
).
-
raw
¶ The raw message text in a backend specific format (a string or
None
).The reason that this field was added to the database schema is because the Slack backend emits chat messages in the somewhat peculiar mrkdwn format which is “almost but not quite” human readable (in my opinion). When the Slack backend imports a new message, the following steps take place:
The original message text is stored without any modifications in the
raw
column.A custom mrkdwn parser developed for the chat-archive program is used to convert
raw
tohtml
(during the import).The value of
html
is used to generate the value oftext
(during the import).If this surprises you: I could have developed a second mrkdwn converter with a different output format, but that’s 150 lines of code I don’t care to repeat and
html_to_text()
works fine for this purpose 😇.
If the custom mrkdwn parser (which is bound to contain bugs) receives bug fixes in a new release of the chat-archive program then
raw
values can be used to regeneratetext
andhtml
values.
-
text
¶ The human readable plain text of the chat message (a string).
This field cannot be
None
(NULL
) and is expected to always contain a nonempty chat message text. This field is used during searches and whenchat-archive --colors=never
is run.
-
html
¶ The formatted text of the chat message (a string or
None
).When a chat message doesn’t contain text formatting or hyperlinks
html
will beNone
andtext
should be used instead. This field will be used whenchat-archive --color=yes
is run.
-
conversation
¶ The conversation that this chat message took place in (a
Conversation
object orNone
).
-
newer_messages
¶ Newer messages in the conversation (not yet sorted!).
-
older_messages
¶ Older messages in the conversation (not yet sorted!).
-
chat_archive.profiling
¶
Easy to use Python code profiling support.
-
class
chat_archive.profiling.
ProfileManager
(*args, **kw)[source]¶ Base class for easy to use Python code profiling support.
This class makes it easy to enable and disable Python code profiling and save the results to a file. You can use it in a
with
statement to guarantee that the profile is saved even when your program is interrupted with Control-C, so when your program is too slow and you’re wondering why you can just restart the program with profiling enabled, wait for it to get slow, give it a while to collect profile statistics and then interrupt it with Control-C.When
profile_file
is set the class initializer method will automatically callenable_profiling()
.You can set the values of the
profile_file
,profiler
andprofiling_enabled
properties by passing keyword arguments to the class initializer.Here’s an overview of the
ProfileManager
class:Superclass: PropertyManager
Special methods: __enter__()
,__exit__()
and__init__()
Public methods: disable_profiling()
,enable_profiling()
andsave_profile()
Properties: can_save_profile
,profile_file
,profiler
andprofiling_enabled
-
__init__
(*args, **kw)[source]¶ Initialize a
ProfileManager
object.Please refer to the
PropertyManager
documentation for details about the handling of arguments.
-
__exit__
(exc_type=None, exc_value=None, traceback=None)[source]¶ Disable code profiling and save the profile statistics when the
with
block ends.
-
can_save_profile
¶ True
ifsave_profile()
is expected to work,False
otherwise.
-
profile_file
[source]¶ The pathname of a file where Python profile statistics should be saved (a string or
None
).Note
The
profile_file
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
profiler
[source]¶ A
profile.Profile
object (ifprofile_file
is set) orNone
.Note
The
profiler
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
profiling_enabled
[source]¶ True
if code profiling is enabled,False
otherwise.Note
The
profiling_enabled
property is awritable_property
. You can change the value of this property using normal attribute assignment syntax.
-
save_profile
(filename=None)[source]¶ Save gathered profile statistics to a file.
Parameters: filename – The pathname of the profile file (a string or None
). Defaults to the value ofprofile_file
.Raises: ValueError
when profiling was never enabled or filename isn’t given andprofile_file
also isn’t set.
-
chat_archive.utils
¶
Utility functions for the chat-archive program.
-
chat_archive.utils.
ensure_directory_exists
(pathname)[source]¶ Create a directory if it doesn’t exist yet.
Parameters: pathname – The pathname of the directory (a string).
-
chat_archive.utils.
get_full_name
()[source]¶ Find the full name of the current user on the local system based on
/etc/passwd
.Returns: A string with the full name of the current user or an empty string when this information is not available.
-
chat_archive.utils.
get_secret
(options, value_option, name_option, description)[source]¶ Get a secret needed to connect to a chat service (like a password or API token).
Parameters: - options – A dictionary with configuration options.
- value_option – The name of the configuration option that defines the value of a secret (a string).
- name_option – The name of the configuration option that defines the
name of a secret in
~/.password-store
(a string). See alsoget_secret_from_store()
. - description – A description of the type of secret that the operator will be prompted for (a string).
Returns: The password (a string).
-
chat_archive.utils.
get_secret_from_store
(name, directory=None)[source]¶ Use
qpass
to get a secret from~/.password-store
.Parameters: - name – The name of a password or a search pattern that matches a single entry in the password store (a string).
- directory – The directory to use (a string, defaults to
~/.password-store
).
Returns: The secret (a string).
Raises: exceptions.ValueError
when the given name doesn’t match any entries or matches multiple entries in the password store.
Change log¶
The change log lists notable changes to the project:
Changelog¶
The purpose of this document is to list all of the notable changes to this project. The format was inspired by Keep a Changelog. This project adheres to semantic versioning.
Release 4.0.2 (2018-12-31)¶
Release 4.0.1 (2018-08-02)¶
Just before publishing this project yesterday I propagated a rename throughout the code base, rephrasing “password” as “secret” (my rationale being that “naming things is important” 😇). Unfortunately that rename was propagated a bit more thoroughly than I had intended, impacting the interaction with the Hangups API. This should be fixed in release 4.0.1. For posterity, this relates to the following exception:
AttributeError: 'GoogleAccountCredentials' object has no attribute 'get_password'
Release 4.0 (2018-08-01)¶
The initial public release! 🎉
Because I love giving mixed signals I’ve decided to use the version number 4.0
for this release (because four chat service backends are supported) but I’ve
added the “beta” trove classifier to the setup.py
script and I’ve added a
big fat disclaimer to the readme (see the status section) 😛.
While publishing the project I decided to be pragmatic and strip the version
control history, because in the first weeks of development I hard coded quite a
few secrets in the code base. Since then I’ve added support for configuration
files and even ~/.password-store
but of course those secrets remain in the
history…
Now I could have spent hours pouring through tens of thousands of lines of patch output to remove those secrets without trashing the history. Instead I decided to do something more useful with my time, hence “pragmatic” above 😇.
PS. This is that “awesome new project” that I’ve been referring to in the humanfriendly changelog. Over the course of developing chat-archive I’ve moved more than six hundred lines of code to the humanfriendly package due to its general purpose nature (the HTML to ANSI conversion).
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line