Welcome to PasteHunter’s documentation!

PasteHunter is a python3 application that is designed to query a collection of sites that host publicliy pasted data. For all the pasts it finds it scans the raw contents against a series of yara rules looking for information that can be used by an organisation or a researcher.

Installation

There are a few ways to install PasteHunter. Pip is the recommended route for stable releases.

Pip Installation

Note Pip or setup.py installation will require gcc and wheel.

Pip installation is supported for versions after 1.2.1. This can easily be done using:

pip install pastehunter

You will then need to configure pastehunter. To do this, use:.:

mkdir -p ~/.config
wget https://raw.githubusercontent.com/kevthehermit/PasteHunter/master/settings.json.sample -O ~/.config/pastehunter.json

Then modify ~/.config/pastehunter.json to match your desired settings and run the project using pasthunter-cli

Local Installation

Pastehunter

If you want to run the latest stable version grab the latest release from https://github.com/kevthehermit/PasteHunter/releases. If you want to run the development version clone the repository or download the latest archive.

Pastehunter has very few dependancies you can install all the python libraries using the requirements.txt file and sudo pip3 install -r requirements.txt

Yara

Yara is the scanning engine that scans each paste. Use the official documentation to install yara and the python3 library. https://yara.readthedocs.io/en/latest/gettingstarted.html#compiling-and-installing-yara

All yara rules are stored in the YaraRules directory. An index.yar file is created at run time that includes all additional yar files in this directory. To add or remove yara rules, simply add or remove the rule file from this directory.

Kibana

Kibana is the frontend search to Elasticsearch. If you have enabled the Elasticsearch module you probably want this. To install follow the offical directions on https://www.elastic.co/guide/en/kibana/current/deb.html.

Docker Installation

You will find a Dockerfile that will build the latest stable version of PasteHunter.

This can be used with the included docker-compose.yml file. A sample podspec for kubernets is coming soon.

Configuration

See this page for help migrating configs from older versions (<1.2.1)

Before you can get up and running you will need to set up the basic config. Copy the settings.json.sample to settings.json and edit with your editor of choice.

Yara

  • rule_path: defaults to the YaraRules directory in the PasteHunter root.
  • blacklist: If set to true, any pastes that match this rule will be ignored.
  • test_rules: Occasionaly I release some early test rules. Set this to true to use them.

log

Logging for the application is configured here.

Level Numerical
CRITICAL 50
ERROR 40
WARNING 30
INFO 20
DEBUG 10
NETSET 0

general

General config options here.

  • run_frequency: Sleep delay between fetching list of inputs to download. This helps rate limits.

For Input, Output and Postprocess settings please refer to the relevant sections of the docs.

Starting

You can run pastehunter by calling the script by name.

python3 pastehunter-cli

Service

You can install pastehunter as a service if your planning on running for long periods of time. An example systemd service file is show below

Create a new service file /etc/systemd/system/pastehunter.service

Add the following text updating as appropriate for your setup paying attention to file paths and usernames.:

[Unit]
Description=PasteHunter

[Service]
WorkingDirectory=/opt/PasteHunter
ExecStart=/usr/bin/python3 /opt/PasteHunter/pastehunter-cli
User=localuser
Group=localuser
Restart=always

[Install]
WantedBy=multi-user.target

Before starting the service ensure you have tested the pastehunter app on the command line and identify any errors. Once your ready then update systemctl systemctl daemon-reload enable the new service systemctl enable pastehunter.service and start the service systemctl start pastehunter

Inputs

This page details all the configuration options per input.

There are a few generic options for each input. - enabled: This turns the input on and off. - store_all: ignore the only store on matching rule. - module: This is used internally by pastehunter.

Pastebin

To use the pastebin API you need an API key. These need to be purchased and are almost always on some sort of offer! https://pastebin.com/pro The API uses your IP to authenticate instead of a key. You will need to whitelist your IP at https://pastebin.com/api_scraping_faq

  • api_scrape: The URL endpoint for the list of recent paste ids.
  • api_raw: The URL endpoint for the raw paste.
  • paste_limit: How many pasteids to fetch from the recent list.
  • store_all: Store all pastes regardless of a rule match.

Github Gists

Github has an API that can be used at no cost to query recent gists. There are two options here.

  • Without an access key - You will have a low rate limit.
  • With an access key - You will have a higher rate limit.

The unauthenticated option is not suitable for pastehunter running full time. To create your key visit https://github.com/settings/tokens

YOU DO NOT NEED TO GIVE IT ANY ACCESS PERMISSIONS

  • api_token: The token you generated.
  • api_limit: Rate limit to prevent being blocked.
  • store_all: Store all pastes regardless of a rule match.
  • user_blacklist: Do not process gists created by these usernames.
  • file_blacklist: Do not process gists that match these filenames.

Github Activity

Github’s activity feed is a list of public changes made. We specifically filter on commits. It can be accessed in a similar manner to gists:

  • Without an access key - You will have a low rate limit.
  • With an access key - You will have a higher rate limit.

Again, the unauthenticated option is not suitable for pastehunter running full time, particularly if you’re also running the gist input. However, the same token may be used for both inputs.

  • api_token: The token you generated.
  • api_limit: Rate limit to prevent being blocked.
  • store_all: Store all pastes regardless of a rule match.
  • user_blacklist: Do not process gists created by these usernames.
  • ignore_bots: Ignore users with [bot] in their username (only actual bots can do this)
  • file_blacklist: Do not process gists that match these filenames. Supports glob syntax.

Slexy

Slexy has some heavy rate limits (30 requests per 30 seconds), but may still return interesting results.

  • store_all: Store all pastes regardless of a rule match.
  • api_scrape: The URL endpoint for the list of recent pastes.
  • api_raw: The URL endpoint for the raw paste.
  • api_view: The URL enpoint to view the paste.

ix.io

ix.io is a smaller site used primarily for console/command line pastes.

  • store_all: Store all pastes regardless of a rule match.

StackExchange

The same API is used to query them all. Similar to github there is a public API which has a reduced rate limit or an App API which has a higher cap. There is a cap on 10,000 requests per day per IP, so pulling all would be impractical. Generate a key at https://stackapps.com/.

There are over 170 exchanges that form stackexchange. The following list is the most likly to expose privldidged information.

  • stackoverflow
  • serverfault
  • superuser
  • webapps
  • webmasters
  • dba
  • site_list: List of site shorttitles that will be scraped.
  • api_key: API App key as generated above.
  • store_filter: This is the stackexchange filter that determines what fields are returned. It must contain the body element.
  • pagesize: How many questions to pull from the latest list.
  • store_all: Store all pastes regardless of a rule match.

Outputs

This page details all the confiuration options for the output modules/ There are a few generic options for each input.

  • enabled: This turns the input on and off.
  • module: This is used internally by pastehunter.
  • classname: This is used internally by pastehunter.

Elasticsearch

Elasticsearch was the default output. Storing all pastes and using Kibana as a graphical frontend to view the results

  • elastic_index: The name of the index.
  • weekly_index: Use a numbered index for each week of the year instead of a single index.
  • elastic_host: Hostname or IP of the elasticsearch.
  • elastic_port: Port number for elasticsearch default is 9200
  • elastic_user: Username if using xpack / shield or basic auth.
  • elastic_pass: Password if using xpack / shield or basic auth.
  • elastic_ssl: True or false if Elasticsearch is served over SSL.

Splunk

Splunk output is similar to Elasticsearch. All the data is put into Splunk and then Splunk can be used for graphical frontend and querying.

  • splunk_host: Hostname of IP of your Splunk instance.
  • splunk_port: The Splunk management port. (Usually port 8089)
  • splunk_user: Username of your Splunk user.
  • splunk_pass: Password for your Splunk user.
  • splunk_index: The name of the Splunk index to store the data in.
  • store_raw: Include the raw paste in the data sent to Splunk.

JSON

This output module will store each paste in a json file on disk. The name of the file is the pasteid.

  • output_path: Path on disk to store output files.
  • store_raw: Include the raw paste in the json file. False jsut stores metadata.
  • encode_raw: Ignored, Reserved for future usage.

CSV

The CSV output will append lines to a CSV that contains basic metadata from all paste sources. The raw paste is not included.

  • output_path: Path on disk to store output files.

Stored elements are

  • Timestamp
  • Pasteid
  • Yara Rules
  • Scrape URL
  • Pastesite

Syslog

Using the same format as the CSV output this writes paste metadata to a syslog server. The raw paste is not included.

  • host: IP or hostname of the syslog server.
  • port: Port number of the syslog server.

SMTP

This output will send an email to specific email addresses depending on the YaraRules that are matched. You need to set up an SMTP server.

  • smtp_host: hostname for the SMTP server.
  • smtp_port: Port number for the SMTP Server.
  • smtp_security: One of tls, starttls, none.
  • smtp_user: Username for SMTP Authentication.
  • smtp_pass: Password for SMTP Authentication.
  • recipients: Json array of recipients and rules. - address: Email address to send alerts to. - rule_list: A list of rules to alert on. Any of the rules in this list will trigger an email. - mandatory_rule_list: List of rules that MUST be present to trigger an email alert.

Slack

This output will send a Notification to a slack web hook. You need to configure the URL and the channel in Slack. Head over to https://api.slack.com/apps?new_app=1

Create a new Slack App with a Name and the workspace that you want to send alerts to. Once created under Add Features and Functionality select Incoming Webhooks and toggle the Active button to on. At the bottom of the page select Add New Webhook to Workspace This will show another page where you select the Channel that will receive the notifications. Once it has authorized the app you will see a new Webhook URL. This is the URL that needs to be added to the pastehunter config.

  • webhook_url: Generated when creating a Slack App as described above.
  • rule_list: List of rules that will generate an alert.

PostProcess

There are a handful of post process modules that can run additional checks on the raw paste data.

There are a few generic options for each input.

  • enabled: This turns the input on and off.
  • module: This is used internally by pastehunter.

Email

This postprocess module extracts additional information from data that includes email addresses. It will extract counts for:

  • Total Emails
  • Unique Email addresses
  • Unique Email domains

These 3 values are then added to the meta data for storage.

  • rule_list: List of rules that will trigger the postprocess module.

Base64

This postprocess will attempt to decode base64 data and then apply further processing on the new file data. At the moment this module only operates when the full paste is a base64 blob, i.e. it will not extract base64 code that is embedded in other data.

  • rule_list: List of rules that will trigger the postprocess module.

See the Sandboxes documentation for information on how to configure the sandboxes used for scanning decoded base64 data.

Entropy

This postprocess module calculates shannon entropy on the raw paste data. This can be used to help identify binary and encoded or encrytped data.

  • rule_list: List of rules that will trigger the postprocess module.

Compress

Compresses the data using LZMA(lossless compression) if it will reduce the size. Small pastes or pastes that don’t benefit from compression will not be affected by this module. Its outputs can be decompressed by base64-decoding, then using the xz command.

  • rule_list: List of rules that will trigger the postprocess module.

Sandboxes

There are a few sandboxes that can be configured and used in various post process steps.

There are a few generic options for each input.

  • enabled: This turns the sandbox on and off.
  • module: This is used internally by pastehunter.

Cuckoo

If the samples match a binary file format you can optionaly send the file for analysis by a Cuckoo Sandbox.

  • api_host: IP or hostname for a Cuckoo API endpoint.
  • api_port: Port number for a Cuckoo API endpoint.

Viper

If the samples match a binary file format you can optionaly send the file to a Viper instance for further analysis.

  • api_host: IP or hostname for a Viper API endpoint.
  • api_port: Port number for a Viper API endpoint.