Django CAL-ACCESS documentation¶
Technical documentation for a collection of applications that make it easier to work with CAL-ACCESS, the jumbled, dirty and difficult government database that tracks campaign-finance and lobbying-activity in California politics. Built using the Django web framework.
This is a work in progress. It is maintained by the California Civic Data Coalition, an open-source team of journalists and computer programmers from news organizations across America. To learn more and get involved, read our backstory in the FAQ and reach out to our leadership team.
Table of contents¶
Installation Guide¶
This guide will walk you through the process of installing the latest official release of django-calaccess-processed-data so that you can incorporate CAL-ACCESS data into your own Django project.
If, instead, you want to install the raw source code or contribute as a developer please refer to the “How to contribute” tutorial.
Warning
This library is intended to be plugged into a project created with the Django web framework. Before you can begin, you’ll need to have one up and running. If you don’t know how, check out the official Django documentation.
Installing the Django apps¶
The latest version of the application can be installed from the Python Package Index using pip
.
$ pip install django-calaccess-processed-data
Like most Django applications, the app then needs to be added to the
INSTALLED_APPS
in your settings.py
configuration file. You also need to include other Django apps it depends on:
INSTALLED_APPS = (
# ... other apps up here ...
'calaccess_raw',
'calaccess_scraped',
'calaccess_processed',
'calaccess_processed_filings',
'calaccess_processed_elections',
'calaccess_processed_flatfiles',
'calaccess_processed_campaignfinance',
'opencivicdata.core.apps.BaseConfig',
'opencivicdata.elections.apps.BaseConfig',
)
A little more about these dependencies:
calaccess_raw
- This app downloads and extracts the raw data files exported each night from the CAL-ACCESS database. The app then loads these files into your Django project’s database with minimal transformations. For more details, see the django-calaccess-raw-data section.
calaccess_scraped
- This app scrapes the CAL-ACCESS website and loads additional data not included in the nightly exports. For more details, see the django-calaccess-scraped-data section.
opencivicdata.core
- This app includes Django models and admin panels for the core data types of the Open Civic Data specification, including
Person
,Organization
,Post
andMembership
. opencivicdata.elections
- This app includes Django models and admins panels for election-related data types that have been provisionally included in the Open Civic Data specification.
Connecting to a local database¶
Also in the settings.py
file, you will need to configure Django so it can connect to your database.
Note
Unlike a typical Django project, this application only supports PostgreSQL database backends. This is because we enlist specialized tools to load the immense amount of source data more quickly than Django typically allows. We haven’t developed those routines for SQLite and the other Django backends yet, but we might someday.
Before you begin, make sure you have a PostgreSQL server installed. If you don’t, now is the time to hit Google and figure out how. The official PostgreSQL documentation is another good place to start.
Once that’s handled, add a database connection string like this to your settings.py
.
DATABASES = {
'default': {
'NAME': 'calaccess_processed',
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'USER': 'your-username-here',
'PASSWORD': 'your-password-here',
'HOST': 'localhost',
'PORT': '5432'
}
}
Return to the command line. This will create a PostgreSQL database to store the data.
$ createdb calaccess_processed
Note
If you’d prefer to load the CAL-ACCESS outside your default database, check out our guide to working with Django’s system for multiple databases.
Loading the data¶
Now you’re ready to create the database tables with Django using its manage.py
utility belt.
$ python manage.py migrate
Once everything is set up, the updatecalaccessrawdata command will download the latest bulk data release from the Secretary of State’s website and load it into your location database.
$ python manage.py updatecalaccessrawdata
Warning
This will take an hour or more. Go grab some coffee.
Because the nightly raw export is incomplete, we have to scrape additional data from the CAL-ACCESS website. Use the scrapecalaccess command to kick off this process, either after updatecalaccessrawdata
finishes or in a separate terminal window:
$ python manage.py scrapecalaccess
Once the raw CAL-ACCESS data is loaded and the scrape has finished, you can transform all this messy data and load into a more simplified structure with the processcalaccessdata command:
$ python manage.py processcalaccessdata
Custom Project Settings¶
The settings listed below allow you to customize the behavior our apps to suit your needs. They should be declared in your Django project’s settings.py
file.
Read more about Django settings here.
CALACCESS_DATA_DIR¶
The local directory where the calaccess_raw
and calaccess_processed
management commands will download, extract and write files. By default, this is will be {BASE_DIR}/data/
, where BASE_DIR
is a setting pre-populated in settings.py
when you set up a new Django project.
You can change this location — say to the tmp/
directory at your file system’s root — by adding a line to settings.py
:
CALACCESS_DATA_DIR = '/tmp/'
CALACCESS_STORE_ARCHIVE¶
Enable archiving of all .ZIP
, .TSV
and .CSV
files in order to preserve each snapshot of the raw and processed CAL-ACCESS data.
Be default, archiving is disabled. You can enable it by adding this line to settings.py
:
CALACCESS_DATA_DIR = True
If you enable archiving, files will be saved in your Django project’s default storage system, which you can also customize. For example, we use django-storages to upload our archived files to an AWS Simple Service Storage (S3) bucket.
If you enable archiving without configuring DEFAULT_FILE_STORAGE
, files will be stored in the directory specified in your Django project’s MEDIA_ROOT
.
You can read more about how Django manages file storage here.
The Django apps¶
The full set of features available in each of our Django applications.
django-calaccess-raw-data¶
A Django app to download, extract and load campaign-finance and lobbying-activity data from the California Secretary of State’s CAL-ACCESS database.
Management commands¶
The raw-data app includes the following commands for processing and verifying the raw data released in the CAL-ACCESS `nightly exports`_.
As with any Django app management command, these can be invoked on the command line or called within your Python code.
updatecalaccessrawdata¶
This is the master command. It brings together all of the other management commands listed below to download, unzip, clean and load the latest snapshot of the CAL-ACCESS database.
Running the entire routine is as simple as this.
$ python manage.py updatecalaccessrawdata
This command will either:
- Update your copy of the CAL-ACCESS data to the latest snapshot on the California Secretary of State’s website
- Or complete your previously interrputed update, if possible.
You can skip the download’s confirmation prompt using Django’s standard --noinput
option.
$ python manage.py updatecalaccessrawdata --noinput
The source files downloaded as part of the process will be deleted unless the --keep-files
option is provided.
$ python manage.py updatecalaccessrawdata --keep-files
The other options are below.
usage: manage.py updatecalaccessrawdata [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--keep-files] [--noinput]
[-a APP_NAME]
Download, unzip, clean and load the latest CAL-ACCESS database ZIP
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--keep-files Keep zip, unzipped, TSV and CSV files
--noinput Update or resume previous update without asking
permission
-a APP_NAME, --app-name APP_NAME
Name of Django app with models into which data will be
imported (if not calaccess_raw)
Note
The updatecalaccessrawdata
command overwrites the previously downloaded, extracted and cleaned files in the application’s download directory.
cleancalaccessrawfile¶
Clean a source CAL-ACCESS TSV file and reformat it as a CSV. A component of the
master updatecalaccessrawdata
command.
Provide the name of the TSV file you would like to process. The command will attempt to find it in the application’s download directory.
$ python manage.py cleancalaccessrawfile RCPT_CD.TSV
The original TSV file will be deleted in favor of the new CSV unless the --keep-file
option is provided.
$ python manage.py cleancalaccessrawfile RCPT_CD.TSV --keep-file
usage: manage.py cleancalaccessrawfile [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--keep-file]
file_name
Clean a source CAL-ACCESS TSV file and reformat it as a CSV
positional arguments:
file_name Name of the TSV file to be cleaned and discarded for a
CSV
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--keep-file Keep original TSV file
Note
The cleancalaccessrawfile
command overwrites the CSV files previously processed from the original TSV files.
downloadcalaccessrawdata¶
Download the latest CAL-ACCESS database ZIP. A component of the master updatecalaccessrawdata
command.
Here is how to run the command.
$ python manage.py downloadcalaccessrawdata
You will then see a prompt with the release date and size of the latest zip of raw CAL-ACCESS data files available to download from the California Secretary of State.
If your previous download did not complete and the same snapshot is still available to download, you will be prompted to resume your previous download.
You can skip the download’s confirmation prompt using Django’s standard --noinput
option.
$ python manage.py downloadcalaccessrawdata --noinput
The other options are below.
The server hosting the ZIP doesn’t always provide the most up-to-date resource (as we have documented). As such, a CommandError
will be raised under either of the following conditions:
- If the actual size of the ZIP does not match the value of the
Content-Length
in the HEAD response. - If the
Last-modified
of HEAD and GET are more than five minutes apart.
usage: manage.py downloadcalaccessrawdata [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--noinput] [--force-restart]
Download the latest CAL-ACCESS database ZIP
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--noinput Download the ZIP archive without asking permission
--force-restart, --restart
Force re-start (overrides auto-resume).
Note
The downloadcalaccessrawdata
command overwrites the previously downloaded zip file.
extractcalaccessrawfiles¶
Extract the CAL-ACCESS raw data files from downloaded ZIP. A component of the
master updatecalaccessrawdata
command.
Here is how to run the command.
$ python manage.py extractcalaccessrawfiles
The downloaded zip file will be deleted unless the --keep-files
option is provided.
$ python manage.py extractcalaccessrawfiles --keep-files
usage: manage.py extractcalaccessrawfiles [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--keep-files]
Extract the CAL-ACCESS raw data files from the database export ZIP
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--keep-files Keep downloaded zipped files
Note
The extractcalaccessrawfiles
command overwrites the previously extracted TSV files.
loadcalaccessrawfile¶
Load clean CAL-ACCESS CSV file into a database model. A component of the
master updatecalaccessrawdata
command.
The command expects the name of the Django database model where the file will be loaded.
$ python manage.py loadcalaccessrawfile RcptCd
The model will attempt to load its default CSV file unless one is provided with the --csv
argument.
$ python manage.py loadcalaccessrawfile RcptCd --csv=/home/jerry/Data/MyFile.csv
usage: manage.py loadcalaccessrawfile [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--c CSV] [--keep-file]
[-a APP_NAME]
model_name
Load clean CAL-ACCESS CSV file into a database model
positional arguments:
model_name Name of the model into which data will be loaded
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--c CSV, --csv CSV Path to comma-delimited file to be loaded. Defaults to
one associated with model.
--keep-file Keep clean CSV file after loading
-a APP_NAME, --app-name APP_NAME
Name of Django app with models into which data will be
imported (if other not calaccess_raw)
Note
The loadcalaccessrawfile
command deletes any data previously loaded into the calaccess_raw models before loading in the current data.
Models for tracking updates¶
The raw-data app also keeps track of each snapshot of the CAL-ACCESS database released by the California Secretary of State, including its release date and byte size, as well as the activity of the management commands that process this data.
This tracking information is stored in the data tables outlined below.
Note
By default, the raw-data app does not archive previous versions of the CAL-ACCESS database. Rather, with each call to the management commands, the data files they process are overwritten.
You can configure the raw-data app to keep each copy of the zip file downloaded from the California Secretary of State as well as the indivdual raw .csv files and cleaned .tsv files by flipping the CALACCESS_STORE_ARCHIVE
to True
in settings.py
:
# in settings.py
CALACCESS_STORE_ARCHIVE = True
By default, the older copies of these files will be saved to the path specified by your Django project’s MEDIA_ROOT
setting (more on that here). However, if you’ve implemented a custom storage system or installed a third-party app (such as django-storages), that should work too.
RawDataVersion¶
Versions of CAL-ACCESS raw source data, typically released every day.
Name | Type | Unique key | Definition |
---|---|---|---|
id | Integer | Yes | Auto-incrementing unique identifer of versions |
release_datetime | DateTime | No | (Unique) date and time the version of the CAL-ACCESS database was released (value of Last-Modified field in HTTP response header) |
expected_size | Integer | No | The expected size of the downloaded CAL-ACCESS zip, as specified in the content-length field in HTTP response header |
update_start_datetime | DateTime | No | Date and time when the update to the CAL-ACCESS version started |
update_finish_datetime | DateTime | No | Date and time when the update to the CAL-ACCESS version finished |
download_start_datetime | DateTime | No | Date and time when the download of the CAL-ACCESS database export started |
download_finish_datetime | DateTime | No | Date and time when the download of the CAL-ACCESS database export finished |
extract_start_datetime | DateTime | No | Date and time when extraction of the CAL-ACCESS data files started |
extract_finish_datetime | DateTime | No | Date and time when extraction of the CAL-ACCESS data files finished |
download_zip_archive | FileField | No | An archive of the original zipped file downloaded from CAL-ACCESS |
clean_zip_archive | FileField | No | An archive zip of cleaned (and error log) files |
clean_zip_size | Integer | No | The actual size of the downloaded CAL-ACCESS zip after the downloaded completed |
download_zip_size | Integer | No | The size of the zip containing all cleaned raw data files and error logs |
.download_completed |
Check if the download of the version's zip file completed. Return True or False . |
.download_stalled |
Check if the download of the version's zip file started but did not complete. Return True or False . |
.download_file_count |
Returns the count of files included in the version's downloaded zip. |
.download_record_count |
Returns the count of records in the version's downloaded files. |
.clean_file_count |
Returns the count of files cleaned in the version. |
.clean_record_count |
Returns the count of records in the version's cleaned files. |
.error_file_count |
Returns the count of cleaned files with errors in the version. |
.error_count |
Returns the count of cleaning errors in the version. |
.extract_completed |
Check if the extract of files from the downloaded zip completed. Return True or False . |
.extract_stalled |
Check if the extract of files from the downloaded zip started but did not complete. Return True or False . |
.update_completed |
Check if the database update to the version completed. Return True or False . |
.update_stalled |
Check if the database update to the version started but did not complete. Return True or False . |
.pretty_clean_size() |
Returns a prettified version (e.g., "725M") of the zip of clean data files and error logs. |
.pretty_download_size() |
Returns a prettified version (e.g., "725M") of the actual size of the downloaded zip. |
.pretty_expected_size() |
Returns a prettified version (e.g., "725M") of the expected size of the downloaded zip. |
.complete()
Filters down QuerySet to return only version that have a complete update.
$ python manage.py shell
>>> from calaccess_raw.models.tracking import RawDataVersion
>>> RawDataVersion.objects.completed()
<QuerySet [<RawDataVersion: 2016-08-15 11:20:29+00:00>, <RawDataVersion: 2016-08-11 11:20:24+00:00>, <RawDataVersion: 2016-08-09 11:20:49+00:00>, <RawDataVersion: 2016-08-05 11:20:27+00:00>, <RawDataVersion: 2016-08-04 11:20:28+00:00>, <RawDataVersion: 2016-07-31 11:20:29+00:00>, <RawDataVersion: 2016-07-30 11:20:42+00:00>, <RawDataVersion: 2016-07-29 11:20:30+00:00>, <RawDataVersion: 2016-07-28 11:20:30+00:00>, <RawDataVersion: 2016-07-26 11:20:28+00:00>, <RawDataVersion: 2016-07-22 11:20:30+00:00>, <RawDataVersion: 2016-07-05 11:20:30+00:00>, <RawDataVersion: 2016-07-04 11:20:30+00:00>, <RawDataVersion: 2016-06-28 11:20:28+00:00>, <RawDataVersion: 2016-06-14 11:20:49+00:00>, <RawDataVersion: 2016-06-10 11:20:26+00:00>, <RawDataVersion: 2016-06-08 11:20:29+00:00>, <RawDataVersion: 2016-05-27 11:20:28+00:00>, <RawDataVersion: 2016-05-21 15:35:11+00:00>, <RawDataVersion: 2016-05-20 13:59:57+00:00>, '...(remaining elements truncated)...']>
RawDataFile¶
Data files included in the given version of the CAL-ACCESS raw source data.
Name | Type | Unique key | Definition |
---|---|---|---|
id | Integer | Yes | Auto-incrementing unique identifer of the file |
file_name | String (up to 100) | No | Name of the raw source data file without extension |
download_records_count | Integer | No | Count of records in the original file downloaded from CAL-ACCESS |
clean_records_count | Integer | No | Count of records in the cleaned file generated by calaccess_raw |
load_records_count | Integer | No | Count of records in the loaded from cleaned file into calaccess_raw's data model |
download_columns_count | Integer | No | Count of columns in the original file downloaded from CAL-ACCESS |
clean_columns_count | Integer | No | Count of columns in the cleaned file generated by calaccess_raw |
load_columns_count | Integer | No | Count of columns on the loaded calaccess_raw data model |
download_file_archive | FileField | No | An archive of the original raw data file downloaded from CAL-ACCESS. |
clean_file_archive | FileField | No | An archive of the raw data file after being cleaned. |
clean_file_size | Integer | No | Size of the .CSV file |
download_file_size | Integer | No | Size of the .TSV file |
error_log_archive | FileField | No | An archive of the error log containing lines from the original download file that could not be parsed and are excluded from the cleaned file. |
error_count | Integer | No | Count of records in the original download that could not be parsed and are excluded from the cleaned file. |
version_id | Integer | No | Foreign key referencing the version of the raw source data in which the file was included. |
clean_start_datetime | DateTime | No | Date and time when the cleaning of the file started |
clean_finish_datetime | DateTime | No | Date and time when the cleaning of the file finished |
load_start_datetime | DateTime | No | Date and time when the loading of the file started |
load_finish_datetime | DateTime | No | Date and time when the loading of the file finished |
.model() |
Returns the RawDataFile's corresponding CalAccess database model object. |
.pretty_clean_file_size |
Returns a prettified version (e.g., "725M") of the cleaned file's size. |
.pretty_download_file_size |
Returns a prettified version (e.g., "725M") of the downloaded file's size. |
Changelog¶
2.0.0 (August 2018)¶
- Dropped MySQL support
- Refactored clean and loading commands for readability
- Added Django 2.1 support
- Added Python 3.7 support
1.6.2 (October 2017)¶
- Renamed raw data zipfile as raw.zip to match the language on our revamped downloads page
1.6.1 (September 2017)¶
- Upgrade to v2.0.0 of django-postgres-copy and refactor export commands accordingly
- Require Django 1.10 or above
- Correct documentation for a number of fields
1.6.0 (July 2017)¶
- Deprecate
--test, --use-test-data
options fromupdatecalaccessrawdata
command. - Renamed custom Django project setting
CALACCESS_DOWNLOAD_DIR
toCALACCESS_DATA_DIR
. - Removed
CALACCESS_TEST_DOWNLOAD_DIR
project setting. - Extract and track any .TSV file regardless of location in download .ZIP directory tree.
1.5.2 (April 2017)¶
- Fix duplicate updates. Only create a new
RawDataVersion
if:Content-length
fromHEAD
differs fromexpected_size
on previous version, or,Last-modified
is at least five minutes more recent thanrelease_datetime
on previous version.
1.5.1 (April 2017)¶
- Proceed with download of ZIP file as long as last-modified datetimes in
HEAD
andGET
requests are within five minutes of each other.
1.5.0 (April 2017)¶
- Django 1.11 compatibility.
- Fix check for existing clean zipped file when resuming.
- Fix message on response status code log.
- Skip dropping/re-adding of database table constraints and indexes when loading into MySQL (transactional DDL statements are not supported).
1.4.9 (March 2017)¶
- Reset auto-increment fields after truncating database tables in postgres.
- Add prefixes on tracking model admins.
- When making requests to sos.ca.gov, log HTTP status code and reason and raise HTTP error if bad status.
1.4.8 (January 2017)¶
- Upgrade to
csvkit
version 1.0.
1.4.7 (December 2016)¶
- Fixed search field on admins for models with
ForeignKey
fields.
1.4.6 (November 2016)¶
- Upgraded to latest version of django-postgres-copy
- Small improvements to CAL-ACCESS field documentation
- Small expansion of unittests
- Clean up of migrations
1.4.5 (September 2016)¶
- Copyediting of CAL-ACCESS form documentation
1.4.2 (late-August 2016)¶
- Docstring edits
1.4.1 (late-August 2016)¶
- Increase max character length on
ReceivedFilingsCd
fields. - Prevent unnecessary download of zip when resuming
updatecalaccessrawdata
. - Include release datetimes in log when
downloadcalaccessrawdata
andupdatecalaccessrawdata
versions are incompatible.
1.4.0 (mid-August 2016)¶
Added zipping up and archiving of cleaned CSVs and error logs.
- Added
RawDataVersion.clean_zip_archive
FileField. - Renamed
RawDataVersion.zip_file_archive
toRawDataVersion.download_zip_archive
.
- Added
Smaller clean data files (removed unnecessary quote characters).
Improvements to tracking models
- Replaced
RawDataCommand
model with datetime fields and related properties- Added to
RawDataVersion
instances.update_start_datetime
and.update_finish_datetime
to store version’s most recent update start and finish datetimes..update_completed
returnsTrue
if most recent update to version started and finished..update_stalled
returnsTrue
if most recent update to version started but did not finish..download_start_datetime
and.download_finish_datetime
to store version’s most recent download start and finish datetimes..download_completed
returnsTrue
if most recent download of version started and finished..download_stalled
returnsTrue
if most recent download version started but did not finish..completed()
QuerySet method toRawDataVersion
to get all versions where the update completed.
- Added to
RawDataFile
instances.clean_start_datetime
and.clean_finish_datetime
to store raw file’s most recent clean start and finish datetimes..load_start_datetime
and.load_finish_datetime
to store raw file’s most recent load start and finish datetimes.
- Added to
- Expanded file size tracking
- Renamed
.size
to.expected_size
onRawDataVersion
instances. - Added
.download_zip_size
toRawDataVersion
instances. - Added
.clean_zip_size
toRawDataVersion
instances. - Added methods to get a pretty version (e.g.,
723M
) of each file size field- Added to
RawDataVersion
instances.pretty_expected_size()
.pretty_download_size()
.pretty_clean_size()
- Added to
RawDataFile
instances.pretty_download_file_size()
.pretty_clean_file_size()
- Added to
- Raise
CommandError
if completed download file size is not the same as expected size. - Added
RawDataVersion
properties to calculate file and record counts:.download_file_count
.download_record_count
.clean_file_count
.clean_record_count
.error_file_count
.error_count
- Renamed
- Replaced
Added
extractcalaccessrawfiles
management command for unzipping and extracting raw data files from downloaded CAL-ACCESS database export.- Start and finish times stored in
.start_extract_datetime
and.finish_extract_datetime
onRawDataVersion
instances.
- Start and finish times stored in
Bug fixes.
- In``downloadcalaccessrawdata``, skip download if the size of the local zip file is equal to or bigger than the expected zip file size.
- Because the server hosting the ZIP doesn’t always provide the most up-to-date resource (as we have documented), a
CommandError
will be raised under any of the following conditions:- If
downloadcalaccessrawdata
is not called from the command-line (presumably, then, it was called byupdatecalaccessrawdata
), and theRawDataVersion
instance of the download command doesn’t match the most recently started update. - If the
ETag
in the initial HEAD request made bydownloadcalaccessrawdata
does not match theETag
in the subsequent GET request. - If the actual size of the ZIP does not match the value of the
Content-Length
in the HEAD response.
- If
- If
downloadcalaccessrawdata
raises any of the above errors,updatecalaccessrawdata
will wait five minutes and try again. - When archiving zips and files, open in binary (
'rb'
) mode. - In
cleancalaccessrawfile
, fixed skipping of empty lines for Python 3.5.
Support for Django 1.10.
1.3.0 (July 2016)¶
- Now distributing on wheels.
- Added error_count to output
reportcalaccessrawdata
and excluded any unspecified fields. - Added model property to RawDataFile that returns the CalAccess model object.
1.2.0 (July 2016)¶
- Enhancements to tracking models
- Zero pad datetime parts of archive dir (for better sorting)
- Calculate and store load_columns_count and load_records_count in
loadcalaccessrawfile
- Added error_count and error_log_archive fields to RawDataFile in order to track bad line parses during
cleancalaccessrawfile
. - Added download_file_size and clean_file_size to RawDataFile.
- Enhancements to CalAccess models
- Added inactive models group for CAL-ACCESS tables that are empty or apparently no longer in use.
- Added a CalAccessMetaClass to automatically configure meta attributes common to all models.
- Added a custom admin for every model.
- Model verbose names are pre-fixed with model groups
- Edits to model doc strings.
- Enhancements to management commands
- Added standard logging.
- Added a logger.info to the end of the
updatecalaccessrawdata
command to allow sending of emails when finished - Edits to command doc strings.
- More tests
- Test to confirm that any field included in a model’s
UNIQUE_KEY
attribute actually exists on the model. - Test to confirm that every model has a custom admin.
- Test to confirm that any field included in a model’s
- Bug fixes
- Fixed numbers in clean_records_count for RawDataFile.
- Fixed line numbers logged in errors.csv files.
- Write output of
reportcalaccessrawdata
to data directory instead ofREPO_DIR
, which may not be in settings.
1.1.0 (late June 2016)¶
- When –noinput is invoked for
updatecalaccessrawdata
, exit if previously updated to the currently available version. - Enforce lowercase UNIQUE_KEY settings on models.
- Removed unnecessary pretty amount model methods as part of driving common.py models file test coverage up to 100%.
1.0.2 (early June 2016)¶
- Include migrations in official package.
- Fix
verbose_name
forRawDataFile.clean_file_archive
.
1.0.0 (May 2016)¶
- Enhanced resume behavior
- Allow previously interrupted updates to resume at any stage of the process: downloading, cleaning or loading.
- Users will be prompted to resume (if possible). User may decline and re-start the entire update.
- Removed
--resume-download
option fromupdatecalaccessrawdata
anddownloadcalaccessrawdata
in favor of prompting the user to resume. - Removed
--database
option from all commands. Multi-database users are encouraged to use Django’s database routers.
- Raw data file archiving
- Added
CALACCESS_STORE_ARCHIVE
setting. When enabled, management commands will save each version of the downloaded .zip file, the extracted .tsv files and cleaned .csv files to the Django project’sMEDIA_ROOT
. - Added FileFields to RawDataVersion and RawDataFile in order to link the database records with the archived files they reference.
- Added
- Completed documentation of all 80 raw data models and 1,467 fields
- Defined hundreds of choices for 182 look-up fields.
- Published expanded Django project documentation. Added re-directs from old app-specific documentation.
- Integrated references to official documents and filing forms into data models. PDFs on DocumentCloud.
- Expanded unit testing of data model documentation
- Wider scope of choice field testing.
- Verify that each model has a
UNIQUE_KEY
attribute set. - Verify that each model has a document reference.
- Verify that each choice field has a document reference.
- Verify that each model with a form_type or form_id field (with a few exceptions) is linked to filing forms.
- Introduced
reportcalaccessrawdata
command, which generates a report outlining the number / proportion of files / records cleaned and loaded.
- Model Re-modeling:
- Moved
BallotMeasuresCd
fromother.py
tocampaign.py
. Same with admin. - Moved remaining models in
other.py
tocommon.py
. Removedother.py
. Same with admins. - Re-ordered models into related groups.
- Moved
- Bug fixes
- Truncate time portions of raw datetime values (see #1457).
- Strip newlines when loading into MySQL.
0.2.0 (January 2016)¶
- Support for Python 3.5
- Support for Django 1.9
- Simplified downloadcalaccessrawdata. Now only downloads, unzips and preps
- Introduced updatecalaccessrawdata, which downloads, cleans and loads data
- Added –resume-download option in case download is interrupted
- Added –csv option to loadcalaccessrawfile so that users can load from a file other than the one specified for the given calaccess_raw model
- Added –keep-files option. Unless the option is invoked downloadcalaccessrawdata, cleancalaccessrawfile, loadcalaccessrawfile and updatecalaccessrawdata now clear out original and intermediate files
- Support for multiple databases configured in Django DATABASE settings. Users can now load into a specified database using –database option
- Fixed verifycalaccessrawfile
- Updated management command options to most recent Django style, using argparse instead of optparse
- Hundreds of unique keys, field defs and choices patched by Code Rushers
- Automatically generated table documentation page
- Expanded documentation
0.1.2 (February 2015)¶
- Substituted clint for progressbar
- Improved choices for form type fields
0.1.1 (January 2015)¶
- Datetime support for MySQL fields
- Fixed bug that didn’t allow null values in PostgreSQL datetime fields
0.1.0 (November 2014)¶
- Support for PostgreSQL database backends
- Upgraded to Django 1.7
- Prettified management command output and logging
- Improved docs, admins and configuration for some campaign-finance models
- Numerous small bug fixes and documentation corrections
0.0.7 (August 2014)¶
- Complete set of models that cover 100% of source CSV files
- Management commands that prep and load the data for MySQL backends
- Administration panels for previewing the data
Open-source resources¶
- Code: github.com/california-civic-data-coalition/django-calaccess-raw-data
- Issues: github.com/california-civic-data-coalition/django-calaccess-raw-data/issues
- Packaging: pypi.python.org/pypi/django-calaccess-raw-data
- Testing: travis-ci.org/california-civic-data-coalition/django-calaccess-raw-data
- Coverage: coveralls.io/r/california-civic-data-coalition/django-calaccess-raw-data
django-calaccess-scraped-data¶
A Django app to scrape campaign-finance data from the California Secretary of State’s CAL-ACCESS website.
Management commands¶
The scraped-data app includes the following commands for scraping campaign finance data from the CAL-ACCESS website.
As with any Django app management command, these can be invoked on the command line or called within your Python code.
Raw content downloaded from CAL-ACCESS is stored in .scraper_cache/
, found in the directory specified by BASE_DIR
in your Django project’s settings.
scrapecalaccess¶
This command runs the following management commands, in order:
scrapecalaccesspropositions
scrapecalaccesscandidates
scrapecalaccessincumbents
These commands are defined in more detail below.
The default behavior of the scraper commands is to avoid excessive downloads. As such, a CAL-ACCESS web page’s content will only be downloaded if:
- The page’s content isn’t cached; or
- The byte size of the cached content differs from the size of the content on the server (as specified in
Content-Length
header).
You can override this default behavior by invoking the force-download
option:
$ python manage.py scrapecalaccess --force-download
Alternatively, you can avoid making any network requests by invoking the --cache-only
option so as to parse and store data only from previously cached content:
$ python manage.py scrapecalaccess --cache-only
By default, data saved to your database from previous scrapes is preserved, or you can invoke the --flush
option to start over with empty data tables:
$ python manage.py scrapecalaccess --flush
usage: manage.py scrapecalaccess [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--flush] [--force-download]
[--cache-only]
Run all scraper commands
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are cached
--cache-only Skip the scraper's update checks. Use only cached
files.
scrapecalaccesscandidates¶
Scrape certified candidates for each election on the CAL-ACCESS site. A component of the scrapecalaccess
command.
This command requests and parses content from the “certified” view of the Campaign/Candidates/list.aspx
page (e.g., the 2016 General certified candidates). Data parsed from these pages are saved in the CandidateElection
and Candidate
models.
usage: manage.py scrapecalaccesscandidates [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush] [--force-download]
[--cache-only]
Scrape certified candidates for each election on the CAL-ACCESS site.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are
cached
--cache-only Skip the scraper's update checks. Use only cached
files.
scrapecalaccesscandidatecommittees¶
Scrape each candidate’s committees from the CAL-ACCESS site.
This command requests and parses content from the “general” view of the Campaign/Candidates/Detail.aspx
page for candidate’s most recent “session” (e.g., Edward T. Gaines general information leading up to the 2016 General election). Data parsed from these pages are saved in the CandidateCommittee
model.
Note
The scrapecalaccesscandidatecommittees
command is not currently included in scrapecalaccess
because of the number of CAL-ACCESS web pages it scrapes. This may change in the future.
usage: manage.py scrapecalaccesscandidatecommittees [-h] [--version]
[-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush]
[--force-download]
[--cache-only]
Scrape each candidate's committees from the CAL-ACCESS site.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are
cached
--cache-only Skip the scraper's update checks. Use only cached
files.
scrapecalaccessincumbents¶
Scrape list of incumbent state officials for each election on CAL-ACCESS site. A component of the scrapecalaccess
command.
This command requests and parses content from the “incumbent” view of the Campaign/Candidates/list.aspx
page (e.g., the 2017-2018 General incumbents). Data parsed from these pages are saved in the IncumbentElection
and Incumbent
models.
usage: manage.py scrapecalaccessincumbents [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush] [--force-download]
[--cache-only]
Scrape list of incumbent state officials for each election on CAL-ACCESS site.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are
cached
--cache-only Skip the scraper's update checks. Use only cached
files.
scrapecalaccesspropositions¶
Scrape links between filers and propositions from the official CAL-ACCESS site. A component of the scrapecalaccess
command.
This command requests and parses content from the Campaign/Measures/list.aspx
page (e.g., the 2015-2016 propositions and ballot measures) and “general” view of each propositions Campaign/Measures/Detail.aspx
page (e.g., Prop 60’s general information). Data parsed from these pages are saved in the PropositionElection
, Proposition
and PropositionCommittee
models.
$ python manage.py scrapecalaccesspropositions
usage: manage.py scrapecalaccesspropositions [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush] [--force-download]
[--cache-only]
Scrape links between filers and propositions from the official CAL-ACCESS
site.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are
cached
--cache-only Skip the scraper's update checks. Use only cached
files.
Changelog¶
0.1.2 (November 2017)¶
- Added five second timeout to requests made to CAL-ACCESS
0.1.1 (October 2017)¶
- Updated dependencies
- Added fixtures to distribution
0.1.0 (July 2017)¶
- Initial release of management commands and models for scraping:
- Ballot propositions (see scrapecalaccesspropositions)
- “Certified” candidates (see scrapecalaccesscandidates)
- Committees linked to certified candidates in each candidate’s most recent election cycle (see scrapecalaccesscandidatecommittees)
- Incumbent officeholders (see scrapecalaccessincumbents)
- The entire routine (see scrapecalaccess)
- Includes admin panels for previewing any scraped data
Open-source resources¶
- Code: github.com/california-civic-data-coalition/django-calaccess-raw-data
- Issues: github.com/california-civic-data-coalition/django-calaccess-scraped-data/issues
- Packaging: pypi.python.org/pypi/django-calaccess-scraped-data
- Testing: travis-ci.org/california-civic-data-coalition/django-calaccess-scraped-data
- Coverage: coveralls.io/r/california-civic-data-coalition/django-calaccess-scraped-data
django-calaccess-processed-data¶
A Django app to clean, transform and refine campaign-finance and lobbying-activity data from the California Secretary of State’s CAL-ACCESS database.
Management commands¶
The processed-data app includes the following commands for refining data extracted and scraped from CAL-ACCESS. Specifically, the raw data is loaded into the following types of models:
Filing
models that surface the most recent version of data included on a campaign-finance filing form, schedule or line item (e.g., a Form 460, its Schedule A or Line 1 on that schedule).FilingVersion
models that surface every version of a campaign-finance filing form, schedule or line item.- Models that implement the core data types of the Open Civic Data specification (e.g.,
Person
,Organization
,Post
andMembership
). - Models that implement election-related data types that have been provisionally included in the Open Civic Data specification (e.g.,
Election
,CandidateContest
andCandidacy
).
As with any Django app management command, these can be invoked on the command line or called within your Python code.
Note
Before using any of the commands below, make sure you need to download and extract the raw CAL-ACCESS data:
$ python manage.py updatecalaccessrawdata
And scrape supplementary data from the CAL-ACCESS website:
$ python manage.py scrapecalaccess
processcalaccessdata¶
This is the master command. It brings together all of the other management commands listed below to load data into processed CAL-ACCESS models.
If your Django project is configured for archiving (details here), this command also will export a csv file for each loaded model.
Running the entire routine is as simple as this.
$ python manage.py processcalaccessdata
If a previous processing job stalled for any reason, processcalaccessdata
will pick up wherever you left off. You can override this behavior by invoking the force-restart
option.
$ python manage.py processcalaccessdata --force-restart
usage: manage.py processcalaccessdata [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--force-restart]
[--no-scrape]
Load data into processed CAL-ACCESS models, archive processed files and ZIP.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--force-restart, --restart
Force re-start (overrides auto-resume).
archivecalaccessprocessedfile¶
Export and archive a .csv file for a given model.
You must provide the model_name
as the first and only positional argument. For example, here’s how you archive the Form460Filing
model:
$ python manage.py archivecalaccessprocessedfile Form460Filing
Or the Candidacy
model, which is one Open Civic Data’s election-related data types:
$ python manage.py archivecalaccessprocessedfile Candidacy
usage: manage.py archivecalaccessprocessedfile [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
model_name
Export and archive a .csv file for a given model.
positional arguments:
model_name Name of the model to archive
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
loadcalaccessfilings¶
Load the CAL-ACCESS Filing
and FilingVersion
models. A component of the master processcalaccessdata
command.
If your Django project is configured for archiving (details here), this command also will export a csv file for each loaded model.
Here is how to run the command.
$ python manage.py loadcalaccessfilings
This command will skip any Filing
or FilingVersion
models already loaded with raw data from the current CAL-ACCESS snapshot. You can override this behavior by invoking the force-restart
option.
$ python manage.py processcalaccessdata --force-restart
usage: manage.py loadcalaccessfilings [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--force-restart]
Load and archive the CAL-ACCESS Filing and FilingVersion models.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--force-restart, --restart
Force re-start (overrides auto-resume).
loadocdelections¶
Load OCD elections models with data extracted and scraped from CAL-ACCESS. A component of the master processcalaccessdata
command.
This command runs the following management commands, in order:
loadocdparties
loadocdballotmeasureelections
loadocdballotmeasurecontests
loadocdretentioncontests
loadocdcandidateelections
loadocdcandidatecontests
mergeocdpersonsbyfilerid
loadocdcandidaciesfrom501s
mergeocdpersonsbycontestandname
loadocdincumbentofficeholders
If your Django project is configured for archiving (details here), this command also will export a csv file for each loaded model.
usage: manage.py loadocdelections [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color]
Load OCD elections models with data extracted and scraped from CAL-ACCESS.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
loadocdballotmeasureelections¶
Load the OCD Election
model from the scraped PropositionElection
model. A component of the loadocdelections
command.
usage: manage.py loadocdballotmeasureelections [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
Load the OCD Election model from the scraped PropositionElection model
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
loadocdballotmeasurecontests¶
Load OCD BallotMeasureContest
and related models with scraped CAL-ACCESS data. A component of the loadocdelections
command.
Note
Use loadocdballotmeasureelections
before using loadocdballotmeasurecontests
.
usage: manage.py loadocdballotmeasurecontests [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush]
Load OCD BallotMeasureContest and related models with scraped CAL-ACCESS data
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush the database tables filled by this command.
loadocdcandidateelections¶
Load the OCD Election
model with data from the scraped CandidateElection
model. A component of the loadocdelections
command.
usage: manage.py loadocdcandidateelections [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush]
Load the OCD Election model with data from the scraped CandidateElection model.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush the database tables filled by this command.
loadocdcandidatecontests¶
Load the OCD CandidateContest
and related models with scraped CAL-ACCESS data. A component of the loadocdelections
command.
This command loads data from the IncumbentElection
and CandidateElection
models in calaccess_scraped
.
Note
Use loadocdcandidateelections
and loadocdparties
before using loadocdcandidatecontests
.
usage: manage.py loadocdcandidatecontests [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color] [--flush]
Load the OCD CandidateContest and related models with scraped CAL-ACCESS data
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush the database tables filled by this command.
loadocdcandidaciesfrom501s¶
Load the OCD Candidacy
model with data extracted from the Form501Filing
model. A component of the loadocdelections
command.
This command fills in Candidacy
records with data missing on the CAL-ACCESS website (e.g., the candidate’s party in each contest). It also adds additional Candidacy
records.
usage: manage.py loadocdcandidaciesfrom501s [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
Load the OCD Candidacy model with data extracted from the Form501Filing model.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
loadocdincumbentofficeholders¶
Load the OCD Membership
model with data from the scraped Incumbent model. A component of the loadocdelections
command.
Note
Use loadocdcandidateelections
before using loadocdincumbentofficeholders
.
usage: manage.py loadocdincumbentofficeholders [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
Load the OCD Membership model with data from the scraped Incumbent model
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
loadocdretentioncontests¶
Load OCD RetentionContest
and related models with data scraped from CAL-ACCESS. A component of the loadocdelections
command.
Note
Use loadballotmeasureelections
before using loadocdretentioncontests
.
usage: manage.py loadocdretentioncontests [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color] [--flush]
Load OCD RetentionContest and related models with data scraped from CAL-ACCESS
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush the database tables filled by this command.
loadocdparties¶
Load OCD Organization
model with parties extracted from raw CAL-ACCESS data. A component of the loadocdelections
command.
usage: manage.py loadocdparties [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--flush]
Load OCD Organization model with parties extracted from raw CAL-ACCESS data.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush the database tables filled by this command.
mergeocdpersonsbycontestandname¶
Find and merge OCD Person
records that share a name and CandidateContest
. A component of the loadocdelections
command.
usage: manage.py mergeocdpersonsbycontestandname [-h] [--version]
[-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
Find and merge OCD Person records that share a name and CandidateContest
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
mergeocdpersonsbyfilerid¶
Find and merge OCD Person
records that share the same CAL-ACCESS filer_id. A component of the loadocdelections
command.
usage: manage.py mergeocdpersonsbyfilerid [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
Find and merge OCD Person records that share the same CAL-ACCESS filer_id
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
Models for tracking updates¶
The processed-data app also keeps track of each snapshot of CAL-ACCESS database it processes. This tracking information is stored in the data tables outlined below.
Note
By default, the processed-data app does not archive previous versions of the CAL-ACCESS database. Rather, with each call to the management commands, the data files they process are overwritten.
You can configure the raw-data app to keep each copy of the zip file downloaded from the California Secretary of State as well as the indivdual raw .csv files and cleaned .tsv files by flipping the CALACCESS_STORE_ARCHIVE
to True
in settings.py
:
# in settings.py
CALACCESS_STORE_ARCHIVE = True
By default, the older copies of these files will be saved to the path specified by your Django project’s MEDIA_ROOT
setting (more on that here). However, if you’ve implemented a custom storage system or installed a third-party app (such as django-storages), that should work too.
ProcessDataVersion¶
Versions of CAL-ACCESS raw source data, typically released every day.
Name | Type | Unique key | Definition |
---|---|---|---|
id | Integer | Yes | Auto-incrementing unique identifer of versions |
raw_version_id | Integer | Yes | Foreign key referencing the raw data version processed |
process_start_datetime | DateTime | No | Date and time when the processing of the CAL-ACCESS version started |
process_finish_datetime | Integer | No | Date and time when the processing of the CAL-ACCESS version finished |
zip_archive | FileField | No | An archive zip of processed files |
zip_size | Integer | No | The expected size (in bytes) of the zip of processed files |
.update_completed |
Check if the database update to the version completed. Return True or False . |
.update_stalled |
Check if the database update to the version started but did not complete. Return True or False . |
.pretty_expected_size() |
Returns a prettified version (e.g., "725M") of the expected size of the downloaded zip. |
ProcessedDataFile¶
A data file included in a processed version of CAL-ACCESS.
Name | Type | Unique key | Definition |
---|---|---|---|
id | Integer | Yes | Auto-incrementing unique identifer of the file |
version_id | Integer | No | Foreign key referencing the processed version of CAL-ACCESS |
file_name | String (up to 100) | No | Name of the processed data file without extension |
process_start_datetime | DateTime | No | Date and time when the processing of the file started |
process_finish_datetime | DateTime | No | Date and time when the processing of the file finished |
records_count | Integer | No | Count of records in the processed file |
file_archive | FileField | No | An archive of the processed file |
file_size | Integer | No | Size of the processed file (in bytes) |
.pretty_file_size() |
Returns a prettified version (e.g., "725M") of the downloaded file's size. |
Open-source resources¶
- Code: github.com/california-civic-data-coalition/django-calaccess-processed-data
- Issues: github.com/california-civic-data-coalition/django-calaccess-processed-data/issues
- Packaging: pypi.python.org/pypi/django-calaccess-processed-data
- Testing: travis-ci.org/california-civic-data-coalition/django-calaccess-processed-data
- Coverage: coveralls.io/r/california-civic-data-coalition/django-calaccess-processed-data
django-calaccess-downloads-website¶
An open-source archive of campaign-finance and lobbying-activity data from the California Secretary of State’s CAL-ACCESS database.
Management commands¶
The downloads-website app includes the following commands for updating and publishing the website’s content.
Our website is one of those trendy static content sites that you’ve probably heard a lot about lately. This just means that, instead of generating HTML on-the-fly with each request from a user’s browser, we create and save all the web pages ahead of time by executing Python code against the database backend once a day.
This process is often called “baking”, and there’s a really handy app that we rely on to make all this work.
As with any Django app management command, these can be invoked on the command line or called within your Python code.
updatedownloadswebsite¶
Update to the latest CAL-ACCESS snapshot and bake static website pages.
This is the master command that performs the entire daily routine of downloading, processing and archiving the latest raw data, then re-building the downloads website’s content.
$ python manage.py updatedownloadswebsite
In order to publish this content to the S3 bucket where it’s served, you can invoke the --publish
option:
$ python manage.py updatedownloadswebsite --publish
Also, this command is a sub-class of the raw-data app’s updatecalaccessrawdata
command, so it inherits all the options of the parent command. For example, if you want to keep copies of the latest raw data files on the app’s server, you can:
$ python manage.py updatedownloadswebsite --keep-files
The other options are below.
usage: manage.py updatedownloadswebsite [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--keep-files] [--noinput] [--test]
[-a APP_NAME] [--publish]
Update to the latest CAL-ACCESS snapshot and bake static website pages
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--keep-files Keep zip, unzipped, TSV and CSV files
--noinput Download the ZIP archive without asking permission
--test, --use-test-data
Use sampled test data (skips download, clean a load)
-a APP_NAME, --app-name APP_NAME
Name of Django app with models into which data will be
imported (if other not calaccess_raw)
--publish Publish baked content
createlatestlinks¶
Save copies of data files from the most recently completed update in a latest/ directory in the Django project’s default file storage.
$ python manage.py createlatestlinks
This command will also clear out any objects currently saved under latest/ before saving new ones.
usage: manage.py createlatestlinks [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color]
Save copies of data files from the most recently completed update in a latest
directory in the default file storage of the Django project.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
Fab tasks index¶
We deploy and manage the downloads website infrastructure using Fabric, which makes processes like deploying the entire downloads website as simple as invoking a few commands from the command-line.
Below is the complete list of available Fabric tasks.
Note
Fabric allows you to run one task after another in a single fab command-line call like so:
$ fab task1:pos_arg1 task2:opt_arg=some_value
This can be useful for chaining tasks together for ad-hoc administrative processes. Read more here.
Amazon¶
Tasks for managing Amazon Web Service (AWS) resources.
createec2
¶Spin up a new Ubuntu 14.04 server on Amazon EC2. Returns the id and public address.
$ fab createec2
The address for your new EC2 instance will also be added to your current environment’s configuration (stored in .env
). If you already have an EC2 host set in your current env, its address will be replaced.
Optional arguments:
instance_name
(default iscalaccess_website
)block_gb_size
(default is100
)instance_type
(default isc3.large
)ami
(default isami-978dd9a7
)
createkey
¶Creates an EC2 key pair and saves it to a .pem file.
The name
for the key pair is the only positional argument:
$ fab createkey:ccdc-key
You’ll be stopped if you try to re-use an existing key pair name.
A new key pair will then be stored in ~/.ec2/<your-key-name>.pem
, and the key pair name will be added to your current environment’s configuration (stored in .env
). If you already have a key name set in your current env, it will be replaced.
createrds
¶Spin up a new database backend with Amazon RDS.
The instance_name
is the only positional argument:
$ fab createrds:downloads-website
This may take several minutes.
The address for your new RDS instance will be added to your current environment’s configurations (stored in .env
). If you already have an RDS host set in your current env, its address will be replaced.
Optional arguments:
database_port
(default is5432
)block_gb_size
(default is100
)instance_type
(default isdb.t2.large
)
copydb
¶Copy the most recent snapshot on the source AWS RDS instance to the destination RDS instance.
The positional arguments are:
src_db_instance_id
, which identifies the source instance from which to create a copydest_db_instance_id
, which identifies the destination instance for the copy.
Warning
The current database on the destination instance will be deleted.
You might execute this task if, for example, you want to replicate the production database to a dev instance.
$ fab copydb:prod-db,dev-db
The process may take several minutes to complete.
If you would like to create a new snapshot of the source db instance before making a copy, you can pass in make_snapshot=True
.
copys3
¶Copy objects in the source AWS S3 bucket to the destination S3 bucket.
Ignores source bucket objects with the same name as objects already in the destination bucket.
The positional arguments are:
src_bucket
, which identifies the bucket from which objects will be copied.dest_bucket
, which identifies the bucket to which objects will be copied.
You might execute this task if, for example, you want to replicate the production archived data bucket to a dev instance.
$ fab copys3:prod-archived-data,dev-archived-data
The process may take several minutes to complete.
App¶
Tasks for deploying and managing the Django app.
deploy
¶Run a full deployment of code to the remote server.
$ fab deploy
More specifically, this task executes the following sub-tasks in order:
pull
rmpyc
pipinstall
migrate
collectstatic
Chef¶
Tasks related to installing and executing Chef, the Ruby framework we use to set up the Ubuntu server that hosts the downloads website code.
bootstrap
¶Install Chef and use it to install the app on an EC2 instance.
$ fab bootstrap
More specifically, this task executes the following sub-tasks in order:
rendernodejson
installchef
cook
copyconfig
migrate
collectstatic
This task also sets the environment in which the website will run on the server based on your current local CALACCESS_WEBSITE_ENV
environment variable (defaults to DEV
if not set).
cook
¶In order to do its thing, Chef requires a cookbook that contains recipes (basically, short Ruby scripts) that outline the configuration scenario on the remote server. You can see our cookbook for this project here.
This task updates the Chef cookbook on the server and executes it.
$ fab cook
installchef
¶Install all the dependencies to run a Chef cookbook.
$ fab installchef
More specifically, this task:
- Updates apt-get
- Installs git
- Installs Ruby
- Installs Chef
rendernodejson
¶Render chef’s node.json file from a template.
$ fab rendernodejson
In addition to the cookbook, some of the settings Chef requires are stored in a local node.json
file, which is rendered from a template.
This template file is where you can, for example, change the run times for the crontab job that updates the download website with the latest CAL-ACCESS data export.
In order for any changes you make to node.json.template to take effect on the server, you need to execute both the rendernodejson
and cook
tasks.
Configure¶
Tasks for configuring the downloads website Django environment.
createconfig
¶Prompt users for settings to be stored in .env
file.
$ fab createconfig
You will prompted to provide:
- An AWS Access Key ID and Secret Access Key (read more here).
- An AWS region (defaults to
us-west-2
). - An SSH key-pair file name (defaults to
my-key-pair
). This assumes you have a key pair stored in~/.ec2/my-key-pair.pem
(if you don’t, you should create one). - The name of the PostgreSQL database that will serve as the backend for the downloads website (defaults to
calaccess_website
). - The name of the database user the Django app will use to connect to the database (defaults to
ccdc
). - The password for the database user.
- The name of the S3 bucket where the data files will be archived (defaults to
django-calaccess-dev-data-archive
). - The name of the S3 bucket where the “baked” content files will stored (defaults to
django-calaccess-dev-baked-content
). - The host email address and password (press ENTER to skip).
- Addresses for the RDS and EC2 instances, in case these servers are already up and running. If not, press ENTER to skip for now, and spin them up later.
These configurations will be stored in a .env
file (ignored by git) along with settings for other envs you have configured, each denoted by a section header such as [DEV]
and [PROD]
.
setconfig
¶Add or edit a key-value pair in the .env
configuration file.
$ fab setconfig:key=<new-variable-name>,value=<some-value>
Note that these changes will only take effect locally. In order to copy your new configuration to the EC2 instance, execute the copyconfig
task.
Dev¶
Tasks for connecting to and running the downloads website server.
rs
¶Start up the Django runserver.
$ fab rs
The only optional argument is port
, which defaults to 8000
.
ssh
¶Log into the EC2 instance using SSH.
$ fab ssh
By default, you will connect to the instance specified in ec2_host
under your current environmnet in the .env
file. If you want to connect to another EC2 instance you have up and running, pass in the address like so:
$ fab ssh:<ec2_instance_address>
Env¶
Tasks for temporarily switching environments before running subsequent tasks.
For example, if your OS CALACCESS_WEBSITE_ENV
environment variable is set to DEV
, but you want to quickly deploy some recent changes to the production server, you can:
$ fab prod deploy
Open-source resources¶
- Code: github.com/california-civic-data-coalition/django-calaccess-downloads-website
- Issues: github.com/california-civic-data-coalition/django-calaccess-downloads-website/issues
- Documentation: django-calaccess.californiacivicdata.org
- Testing: travis-ci.org/california-civic-data-coalition/django-calaccess-downloads-website
- Coverage: coveralls.io/github/california-civic-data-coalition/django-calaccess-downloads-website
How to contribute¶
This is an open-source project that welcomes contributions from anyone who has the time and energy to help us untangle the CAL-ACCESS database. Here’s how to get started.
What developers can do¶
Contribute to django-calaccess-raw-data¶
This walkthrough is for developers who want to contribute to django-calaccess-raw-data, a Django app to download, extract and load campaign-finance and lobbying-activity data from the California Secretary of State’s CAL-ACCESS database.
It will show you how to install the source code of this application to fix bugs and develop new features.
Preparing a development environment¶
It is not required, but it is recommended that development of the library be done from within a contained virtual environment.
One way to accomplish that is with Python’s virtualenv
tool and its helpful companion virtualenvwrapper
. If you have that installed, a new project can be started with the following:
$ mkproject django-calaccess-raw-data
That will jump into a new folder in your code directory, where you can clone our code repository from GitHub after you make a fork of your own. Don’t know what that means? Read this.
$ git clone https://github.com/<YOUR-USERNAME>/django-calaccess-raw-data.git .
Next install the other Python libraries our code depends on.
$ pip install -r requirements.txt
Connecting to a local database¶
Unlike a typical Django project, this application only supports the PostgreSQL database backend. This is because we enlist specialized tools to load the immense amount of source data more quickly than Django typically allows.
Create the database the PostgreSQL way.
$ createdb calaccess_raw -U postgres
Create a file at example/project/settings_local.py
to save your custom database credentials. That might look something like this.
DATABASES = {
'default': {
'NAME': 'calaccess_raw',
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'USER': 'username', # <-- Change this
'PASSWORD': 'password', # <-- And this
'HOST': 'localhost',
'PORT': '5432'
}
}
Note
If you’d prefer to load the CAL-ACCESS outside your default database, check out our guide to working with Django’s system for multiple databases.
Once the database is configured¶
Now create the tables and get to work.
$ python example/manage.py migrate
Once everything is set up, the updatecalaccessrawdata command will download the latest bulk data release from the Secretary of State’s website load it into your local database.
$ python example/manage.py updatecalaccessrawdata
Warning
This will take a while. Go grab some coffee.
Welcome aboard!¶
Now that your development environment is set up, check out the GitHub issue tracker where plenty of work awaits.
As you submit your work, please pay attention to the results of our integration tests (more details here).
Contribute to django-calaccess-scraped-data¶
This walkthrough is for developers who want to contribute to django-calaccess-scraped-data, a Django app to scrape from the CAL-ACCESS website supplementary data not included in the California Secretary of State’s nightly data dumps.
It will show you how to install the source code of this application to fix bugs and develop new features.
Preparing a development environment¶
It is not required, but it is recommended that development of the library be done from within a contained virtual environment.
One way to accomplish that is with Python’s virtualenv
tool and its helpful companion virtualenvwrapper
. If you have that installed, a new project can be started with the following:
$ mkproject django-calaccess-scraped-data
That will jump into a new folder in your code directory, where you can clone our code repository from GitHub after you make a fork of your own. Don’t know what that means? Read this.
$ git clone https://github.com/<YOUR-USERNAME>/django-calaccess-scraped-data.git .
Next install the other Python libraries our code depends on.
$ pip install -r requirements.txt
Connecting to a local database¶
The calaccess_scraped
app doesn’t have any specific database requirements. However, we recommend PostgreSQL 9.4 (or greater), which is a hard requirement of other apps in our tool chain.
Create the database the PostgreSQL way.
$ createdb calaccess_scraped -U postgres
Create a file at example/project/settings_local.py
to save your custom database credentials. That might look something like this.
DATABASES = {
'default': {
'NAME': 'calaccess_scraped',
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'USER': 'username', # <-- Change this
'PASSWORD': 'password', # <-- And this
'HOST': 'localhost',
'PORT': '5432'
}
}
Note
If you’d prefer to load the CAL-ACCESS outside your default database, check out our guide to working with Django’s system for multiple databases.
Once the database is configured¶
Now create the tables and get to work.
$ python example/manage.py migrate
Now you’re ready to scrape. The scrapecalaccess command will download, cache and parse content from the CAL-ACCESS website:
$ python example/manage.py scrapecalaccess
Welcome aboard!¶
Now that your development environment is set up, check out the GitHub issue tracker where plenty of work awaits.
As you submit your work, please pay attention to the results of our integration tests (more details here).
Contribute to django-calaccess-processed-data¶
This walkthrough is for developers who want to contribute to django-calaccess-processed-data, a Django app to transform and refine campaign-finance and lobbying-activity data from the California Secretary of State’s CAL-ACCESS database.
It will show you how to install the source code of this application to fix bugs and develop new features.
Preparing a development environment¶
It is not required, but it is recommended that development of the library be done from within a contained virtual environment.
One way to accomplish that is with Python’s virtualenv
tool and its helpful companion virtualenvwrapper
. If you have that installed, a new project can be started with the following:
$ mkproject django-calaccess-processed-data
That will jump into a new folder in your code directory, where you can clone our code repository from GitHub after you make a fork of your own. Don’t know what that means? Read this.
$ git clone https://github.com/<YOUR-USERNAME>/django-calaccess-processed-data.git .
Next install the other Python libraries our code depends on.
$ pip install -r requirements.txt
Connecting to a local database¶
Unlike a typical Django project, this application only supports PostgreSQL version 9.6 and above as a database backend. This is because we enlist specialized tools to load the immense amount of source data. We haven’t developed those routines for SQLite and the other Django backends yet, but we might someday.
Create the database the PostgreSQL way.
$ createdb calaccess_processed -U postgres
Create a file at example/project/settings_local.py
to save your custom database credentials. That might look something like this.
DATABASES = {
'default': {
'NAME': 'calaccess_processed',
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'USER': 'username', # <-- Change this
'PASSWORD': 'password', # <-- And this
'HOST': 'localhost',
'PORT': '5432'
}
}
Note
If you’d prefer to load the CAL-ACCESS outside your default database, check out our guide to working with Django’s system for multiple databases.
Once the database is configured¶
Now create the tables and get to work.
$ python example/manage.py migrate
Once everything is set up, the updatecalaccessrawdata command will download the latest bulk data release from the Secretary of State’s website and load it into your local database.
$ python example/manage.py updatecalaccessrawdata
Warning
This will take a while. Go grab some coffee.
Because the nightly raw export is incomplete, we have to scrape additional data from the CAL-ACCESS website. Use the scrapecalaccess command to kick off this process, either after updatecalaccessrawdata
finishes or in a separate terminal window:
$ python example/manage.py scrapecalaccess
Once the raw CAL-ACCESS data is loaded and the scrape has finished, you can transform all this messy data you’ve collected into our easy-to-understand, well-documented models with the processcalaccessdata command:
$ python example/manage.py processcalaccessdata
Welcome aboard!¶
Now that your development environment is set up, check out the GitHub issue tracker where plenty of work awaits.
As you submit your work, please pay attention to the results of our integration tests (more details here).
Contribute to django-calaccess-downloads-website¶
This walkthrough is for developers who want to contribute to django-calaccess-downloads-website, a open-source archive of campaign-finance and lobbying-disclosure data from the California Secretary of State’s CAL-ACCESS database.
It will show you how to install the source code of this application to fix bugs, develop new features and deploy an archive to the Internet using Amazon Web Services.
Preparing a development environment¶
In order to contribute you first need to set up a local development environment by installing the source code and configuring a few settings.
While not required, we recommend that development be done within a contained virtual environment.
One way to accomplish that is with a two related Python packages: virtualenv
and virtualenvwrapper
. If you have
both of these installed, a new project can be easily created like so:
$ mkproject django-calaccess-downloads-website
That will jump into a new folder in your code directory, where you can fork our code repository from GitHub. Don’t know what that means? Read this.
Once you’ve created a fork, you should clone it to your computer.
$ git clone https://github.com/<YOUR-USERNAME>/django-calaccess-downloads-website.git .
Next, install the other Python libraries our code depends on, like the Django web framework.
$ pip install -r requirements.txt
Many of the settings in this project can vary depending on where the code is being run. For instance, your local installation of the code will likely connect to a different database than the public website.
To keep these different environments straight and avoid including sensitive passwords in public repositories we have developed
a system for storing many of the configuration options in a file named .env
at the project’s root.
The file is excluded from Git’s version control system and needs to be created fresh each time the code is installed.
The .env
file is expected to contain a separate section for each environment, using the structure favored by Python’s ConfigParser module. Here’s a simple example:
[DEV]
database_name=calaccess
mysecretpassword=password
[PROD]
database_name=calaccess
mysecretpassword=hotpockets
By default, the source code will draw settings from a section name DEV
. To configure it to use a different set of variables
(like the``PROD`` section above), you must set the CALACCESS_WEBSITE_ENV
environment variable.
$ export CALACCESS_WEBSITE_ENV=PROD
If you are using virtualenv and virtualenvwrapper, you could add the above line of code to $VIRTUAL_ENV/bin/postactivate
so that
whenever you start the project’s virtual environment, the variable will be exported automatically.
Note
You could also add the following line to your $VIRTUAL_ENV/bin/postdeactivate
script to clear the variable
whenever you deactivate the virtual environment:
$ unset CALACCESS_WEBSITE_ENV
Unlike a typical Django project, this application only supports PostgreSQL version 9.6 and above as a database backend. This is because we enlist specialized tools to load the immense amount of source data more quickly than Django typically allows.
Create the database the PostgreSQL way.
$ createdb calaccess_website -U postgres
Even a development project that will run only on your computer needs an account with Amazon Web Services to store archived files in its S3 file service.
If you don’t already have an AWS account, make one now and request a key pair that lets you access its services via Python.
Then create a new S3 “bucket” to store files archived by this project.
The development environment can be created in the .env
file by running a Fabric task that will ask you to provide a value for all
of this project’s settings.
$ fab createconfig
You will prompted to provide the project’s full list of settings, though some of them are only necessary when deploying the code and site with Amazon Web Services.
Setting | Required in development | Definition |
---|---|---|
db_name | Yes | Name of your database. |
db_user | Yes | Database user. |
db_password | Yes | Database password. |
db_host | Yes | Database host location. |
aws_access_key_id | Yes | Shorter secret key for accessing Amazon Web Services. |
aws_secret_access_key | Yes | The longer secret key for accessing Amazon Web Services. |
aws_region_name | Yes | Amazon Web Services region where you resources are located. |
s3_archived_data_bucket | Yes | Amazon S3 bucket where archived CAL-ACCESS data will be stored. |
s3_baked_content_bucket | No | Amazon S3 bucket where the public-facing website will be stored. |
key_name | No | Name of the SSH .pem file associated with Amazon Web Services. Should be found in ~/.ec2 . |
ec2_host | No | Public address of website’s Amazon EC2 instance. |
email_user | No | Gmail account for sending error emails. |
email_password | No | Gmail password for sending error emails. |
If necessary, you can overwrite a specific setting or append a new one:
$ fab setconfig:key=<new-variable-name>,value=<some-value>
You can also print your current app environment’s configuration:
$ fab printconfig
Or everything in the Fabric environment:
$ fab printenv
Now that everything is configured, create the database tables.
$ python manage.py migrate
Once everything is set up, the updatedownloadswebsite
command will download the latest
bulk data release from the Secretary of State’s website load it into your local database and archive the files on Amazon S3.
$ python manage.py updatedownloadswebsite
Warning
This will take a while. Go grab some coffee.
Finally, start the development server and visit localhost:8000/admin/ in your browser to inspect the site.
$ python manage.py runserver
Preparing a production server¶
This section will walk you through deploy the downloads website on the Internet via Amazon Web Services. You will need to have completed the steps above.
As described above, the source code will draw settings from a section of the .env file named DEV
.
To switch to configuring your project for a production environment, you should set the CALACCESS_WEBSITE_ENV
environment
variable to PROD
.
$ export CALACCESS_WEBSITE_ENV=PROD
If you are using virtualenv and virtualenvwrapper, you could add the above line of code to $VIRTUAL_ENV/bin/postactivate
so that
whenever you start the project’s virtual environment, this variable will be exported automatically whenever you use workon
to
begin work.
You will need to create a hosted database to store the data and keep tabs on the archive over time. Our recommended method for doing this is using Amazon’s Relational Database Service.
You can spin up a PostgreSQL server there using our prepackaged Fabric commands. You’re only required to provide a
name like download-website
:
$ fab createrds:download-website
Then, wait several minutes while the server is provisioned.
By default, the new database server will have 100 GB of disk space allocated on a t2.large RDS class instance. If need be, you can override these settings:
$ fab createrds:download-website,block_gb_size=80,instance_type=db.m4.large
The address for the RDS host will automatically be added to the configuration for your current environment, which is stored in .env
.
If you already had an RDS host set for your current env, its address will be overwritten.
Next you should create a new Ubuntu 14.04 server on Amazon’s Elastic Compute Cloud to host the Django project.
$ fab createec2
By default, the server will have 100 GB of disk space allocated on a c3.large class instance. If need be, you can override these settings:
$ fab createec2:block_gb_size=80,instance_type=c3.xlarge
You can also override our default Amazon Machine Image (AMI):
$ fab createec2:ami=<some-other-ami-id>
As with creating an RDS instance, the address for your new EC2 instance will automatically be added to the configuration for your current environment, which is stored in .env
. If you already had an EC2 host set, its address will be overwritten.
Now you’ll want to run our configuration command again, this time filling in the new details from your AWS account, database and server. You may want to create a new set of S3 buckets separate from your development buckets.
$ fab createconfig
Finally, you’re ready to bootstrap the Django project on the Ubuntu server.
$ fab bootstrap
After connecting to your current EC2 instance, a framework called Chef and its dependencies, including Ruby, will be installed on the server. Chef is used to configure the server and install the downloads website’s code.
The bootstrap
task also sets up a crontab job to execute run as command every six hours that will automate the collection, extraction and processing of the daily CAL-ACCESS database exports.
Testing¶
Our code is tested using Django’s built-in unittesting system via the TravisCI continuous integration service.
In addition, prior to the Django unittests, code is evaluated using Python’s pep8 and pyflakes style-guide enforcement tools.
When a commit or pull request is made with our repository, those tests are rerun with the latest code. We try not to be too uptight, but we generally expect the tests to be pass before we will merge a request.
You can also run these tests locally. Change your directory into your local copy of any of our repos, and then:
$ make test
What anyone can do¶
Contribute to our documentation¶
We’re maintaining a single repository for all documents related to the Django CAL-ACCESS project. This section is for anyone who wants to contribute to these documents.
Do I need to know Python (or Django)?¶
No. But you should be familiar with the syntax of reStructuredText, since that’s the format in which these documents are written.
Which files should I edit?¶
Generally, you should be editing the .rst
files in docs/
directory, rather than any of the .html
files in the _build/
directory. The .html
files are compiled using Python’s Sphinx documentation framework (for more on this, see Viewing Local Changes).
Viewing changes locally¶
After you make changes to your local versions of the .rst
files in the docs/
directory, you can view your edits as they will appear on ReadTheDocs:
$ make html
This will invoke the sphinx command to compile the .html
and .doctree
files as they will be when deployed to ReadTheDocs. These files are then saved in docs/_build/html
, where you can open them with your favorite web browser and see how they look.
However, you might instead prefer to see your edits as you’re making them:
$ make livehtml
This will start serving the docs on http://127.0.0.1:8000, where you can point your favorite web browser. While the server is running, Sphinx will also detect any change you make to the .rst
files and, on save, automatically re-build the .html
files.
What maintainers should know¶
Version release checklist¶
The steps to follow each time release a new version of a Django package.
Update the changelog
Update
setup.py
with new version and any new dependencies- Consider including RC (release candidate) in release name until we’re sure we’re uploading a package with all tiny details accounted for
Update
requirements.txt
with any new dependenciesRun
python setup.py sdist bdist_wheel
, make sure their aren’t any errorsSpotcheck new release package in dist/ to make sure all files made it in.
Run
twine upload dist/* --skip-existing
Release on GitHub
git commit
final change and rungit tag "v#.#.#"
with whatever the release number is- Run
git push origin master --tags
- Add list of changes to the page release on GitHub
Frequently asked questions¶
Questions and answers about the technical aspects of our work. A broader FAQ about the CAL-ACCESS database and our work with it can be found here.
How do the Django applications fit together?¶
The django-calaccess-raw-data application is intended as the base layer below more sophisticated apps, like django-calaccess-processed-data, that transform the source data and load it into simplified models to serve as a platform for investigative analysis.
Why does django-calaccess-raw-data use loading techniques not supported by Django?¶
Because the CAL-ACCESS database is huge. With more than 35 million records sprawled across 80 tables, it can take a long time to load into a database using the standard Django tools, which insert one record at a time. In our early testing, it took as long as 24 hours to load all of CAL-ACCESS into a database on a standard laptop computer.
To speed things up, our loading commands take advantage of the built-in bulk loading tools offered by PostgreSQL and MySQL,
which are not currently included in Django’s system. These tools (COPY
in PostgreSQL and LOAD DATA INFILE
in MySQL) insert CSV files from the file system
directly into the database in a small fraction of the time it would take to load them row by row.
As part of developing these tools we released django-postgres-copy, a Django extension that makes it easier for us and other developers to work with these valuable tools.
Why does django-calaccess-raw-data only work with PostgreSQL and MySQL databases?¶
Because of the answer above. To run our loading routines in an acceptable amount of time, we need to take advantage of bulk file loading tools not currently supported by Django.
So far, we have only written custom loading routines for MySQL and PostgreSQL. We would welcome contributions that would expand our database support to other systems, like SQLite and Microsoft SQL Server. But we haven’t got there yet.
Do I have to load the CAL-ACCESS data into my default database?¶
No, django-calaccess-raw-data supports the use of automatic database routing, which Django’s own documentation describes as “the easiest way to use multiple databases”.
If you fall into this category, first of all, be sure you’ve carefully read through Django’s multiple databases topic guide.
Next, configure your additional databases in settings.py
. Let’s assume you want
two PostgreSQL databases: One for all CAL-ACCESS data called calaccess_raw
, and a default my_project
database for everything else:
DATABASES = {
'default': {
'NAME': 'my_project',
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'USER': 'your-username-here',
'PASSWORD': 'your-password-here',
'HOST': 'localhost',
'PORT': '5432'
},
'calaccess_raw': {
'NAME': 'calaccess',
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'USER': 'your-username-here',
'PASSWORD': 'your-password-here',
'HOST': 'localhost',
'PORT': '5432'
},
}
Then, create a routers.py
file in your Django project’s directory (same place as manage.py
and settings.py
). Following from the above example, here’s how you could implement a router to send calaccess-raw model data to their own database and everything else to default
:
class ExampleRouter(object):
"""
Send calaccess_raw models to their own db. Everything else to default.
"""
def get_db(self, model=None, app_label=None):
app_label = app_label or model._meta.app_label
if app_label == 'calaccess_raw'
db_label = 'calaccess_raw'
else:
db_label = 'default'
return db_label
def db_for_read(self, model, **hints):
"""
Attempts to read calaccess_raw models go to calaccess_raw db.
"""
return self.get_db(model=model)
def db_for_write(self, model, **hints):
"""
Attempts to write calaccess_raw models go to calaccess_raw db.
"""
return self.get_db(model=model)
def allow_relation(self, obj1, obj2, **hints):
"""
Allow relations if a model in the calaccess_raw app is involved.
"""
return self.get_db(model=obj1) == self.get_db(model=obj2)
def allow_migrate(self, db, app_label, model=None, **hints):
"""
Make sure the calaccess_raw app only appears in the calaccess_raw
database.
"""
intended_db = self.get_db(app_label=app_label)
return (db == intended_db) or (db == 'default' and intended_db is None)
Finally, configure the router in setting.py
:
DATABASE_ROUTERS = ['example.routers.ExampleRouter']
And everything should be ready.