Ultimate Sitemap Parser¶
usp¶
usp package¶
Subpackages¶
usp.objects package¶
Submodules¶
usp.objects.page module¶
Objects that represent a page found in one of the sitemaps.
-
usp.objects.page.SITEMAP_PAGE_DEFAULT_PRIORITY= Decimal('0.5')¶ Default sitemap page priority, as per the spec.
-
class
usp.objects.page.SitemapNewsStory(title: str, publish_date: datetime.datetime, publication_name: Optional[str] = None, publication_language: Optional[str] = None, access: Optional[str] = None, genres: List[str] = None, keywords: List[str] = None, stock_tickers: List[str] = None)[source]¶ Bases:
objectSingle story derived from Google News XML sitemap.
-
access¶ Return accessibility of the article.
Returns: Accessibility of the article.
-
genres¶ Return list of properties characterizing the content of the article.
Returns genres such as “PressRelease” or “UserGenerated”.
Returns: List of properties characterizing the content of the article
-
keywords¶ Return list of keywords describing the topic of the article.
Returns: List of keywords describing the topic of the article.
-
publication_language¶ Return primary language of the news publication in which the article appears in.
It should be an ISO 639 Language Code (either 2 or 3 letters).
Returns: Primary language of the news publication in which the article appears in.
-
publication_name¶ Return name of the news publication in which the article appears in.
Returns: Name of the news publication in which the article appears in.
-
publish_date¶ Return story publication date.
Returns: Story publication date.
-
stock_tickers¶ Return list of up to 5 stock tickers that are the main subject of the article.
Each ticker must be prefixed by the name of its stock exchange, and must match its entry in Google Finance. For example, “NASDAQ:AMAT” (but not “NASD:AMAT”), or “BOM:500325” (but not “BOM:RIL”).
Returns: List of up to 5 stock tickers that are the main subject of the article.
-
title¶ Return story title.
Returns: Story title.
-
-
class
usp.objects.page.SitemapPage(url: str, priority: decimal.Decimal = Decimal('0.5'), last_modified: Optional[datetime.datetime] = None, change_frequency: Optional[usp.objects.page.SitemapPageChangeFrequency] = None, news_story: Optional[usp.objects.page.SitemapNewsStory] = None)[source]¶ Bases:
objectSingle sitemap-derived page.
-
change_frequency¶ Return change frequency of a sitemap URL.
Returns: Change frequency of a sitemap URL.
-
last_modified¶ Return date of last modification of the URL.
Returns: Date of last modification of the URL.
-
news_story¶ Return Google News story attached to the URL.
Returns: Google News story attached to the URL.
-
priority¶ Return priority of this URL relative to other URLs on your site.
Returns: Priority of this URL relative to other URLs on your site.
-
url¶ Return page URL.
Returns: Page URL.
-
usp.objects.sitemap module¶
Objects that represent one of the found sitemaps.
-
class
usp.objects.sitemap.AbstractIndexSitemap(url: str, sub_sitemaps: List[usp.objects.sitemap.AbstractSitemap])[source]¶ Bases:
usp.objects.sitemap.AbstractSitemapAbstract sitemap with URLs to other sitemaps.
-
all_pages() → Iterator[usp.objects.page.SitemapPage][source]¶ Return iterator which yields all pages of this sitemap and linked sitemaps (if any).
Returns: Iterator which yields all pages of this sitemap and linked sitemaps (if any).
-
sub_sitemaps¶ Return sub-sitemaps that are linked to from this sitemap.
Returns: Sub-sitemaps that are linked to from this sitemap.
-
-
class
usp.objects.sitemap.AbstractPagesSitemap(url: str, pages: List[usp.objects.page.SitemapPage])[source]¶ Bases:
usp.objects.sitemap.AbstractSitemapAbstract sitemap that contains URLs to pages.
-
all_pages() → Iterator[usp.objects.page.SitemapPage][source]¶ Return iterator which yields all pages of this sitemap and linked sitemaps (if any).
Returns: Iterator which yields all pages of this sitemap and linked sitemaps (if any).
-
pages¶ Return list of pages found in a sitemap.
Returns: List of pages found in a sitemap.
-
-
class
usp.objects.sitemap.AbstractSitemap(url: str)[source]¶ Bases:
objectAbstract sitemap.
-
all_pages() → Iterator[usp.objects.page.SitemapPage][source]¶ Return iterator which yields all pages of this sitemap and linked sitemaps (if any).
Returns: Iterator which yields all pages of this sitemap and linked sitemaps (if any).
-
url¶ Return sitemap URL.
Returns: Sitemap URL.
-
-
class
usp.objects.sitemap.IndexRobotsTxtSitemap(url: str, sub_sitemaps: List[usp.objects.sitemap.AbstractSitemap])[source]¶ Bases:
usp.objects.sitemap.AbstractIndexSitemaprobots.txt sitemap with URLs to other sitemaps.
-
class
usp.objects.sitemap.IndexWebsiteSitemap(url: str, sub_sitemaps: List[usp.objects.sitemap.AbstractSitemap])[source]¶ Bases:
usp.objects.sitemap.AbstractIndexSitemapWebsite’s root sitemaps, including robots.txt and extra ones.
-
class
usp.objects.sitemap.IndexXMLSitemap(url: str, sub_sitemaps: List[usp.objects.sitemap.AbstractSitemap])[source]¶ Bases:
usp.objects.sitemap.AbstractIndexSitemapXML sitemap with URLs to other sitemaps.
-
class
usp.objects.sitemap.InvalidSitemap(url: str, reason: str)[source]¶ Bases:
usp.objects.sitemap.AbstractSitemapInvalid sitemap, e.g. the one that can’t be parsed.
-
all_pages() → Iterator[usp.objects.page.SitemapPage][source]¶ Return iterator which yields all pages of this sitemap and linked sitemaps (if any).
Returns: Iterator which yields all pages of this sitemap and linked sitemaps (if any).
-
reason¶ Return reason why the sitemap is deemed invalid.
Returns: Reason why the sitemap is deemed invalid.
-
-
class
usp.objects.sitemap.PagesAtomSitemap(url: str, pages: List[usp.objects.page.SitemapPage])[source]¶ Bases:
usp.objects.sitemap.AbstractPagesSitemapRSS 0.3 / 1.0 sitemap that contains URLs to pages.
-
class
usp.objects.sitemap.PagesRSSSitemap(url: str, pages: List[usp.objects.page.SitemapPage])[source]¶ Bases:
usp.objects.sitemap.AbstractPagesSitemapRSS 2.0 sitemap that contains URLs to pages.
-
class
usp.objects.sitemap.PagesTextSitemap(url: str, pages: List[usp.objects.page.SitemapPage])[source]¶ Bases:
usp.objects.sitemap.AbstractPagesSitemapPlain text sitemap that contains URLs to pages.
-
class
usp.objects.sitemap.PagesXMLSitemap(url: str, pages: List[usp.objects.page.SitemapPage])[source]¶ Bases:
usp.objects.sitemap.AbstractPagesSitemapXML sitemap that contains URLs to pages.
Module contents¶
usp.web_client package¶
Submodules¶
usp.web_client.abstract_client module¶
Abstract web client class.
-
class
usp.web_client.abstract_client.AbstractWebClient[source]¶ Bases:
objectAbstract web client to be used by the sitemap fetcher.
-
get(url: str) → usp.web_client.abstract_client.AbstractWebClientResponse[source]¶ Fetch an URL and return a response.
Method shouldn’t throw exceptions on connection errors (including timeouts); instead, such errors should be reported via Response object.
Parameters: url – URL to fetch. Returns: Response object.
-
-
class
usp.web_client.abstract_client.AbstractWebClientResponse[source]¶ Bases:
objectAbstract response.
-
class
usp.web_client.abstract_client.AbstractWebClientSuccessResponse[source]¶ Bases:
usp.web_client.abstract_client.AbstractWebClientResponseSuccessful response.
-
header(case_insensitive_name: str) → Optional[str][source]¶ Return HTTP header value for a given case-insensitive name, or None if such header wasn’t set.
Parameters: case_insensitive_name – HTTP header’s name, e.g. “Content-Type”. Returns: HTTP header’s value, or None if it was unset.
-
raw_data() → bytes[source]¶ Return encoded raw data of the response.
Returns: Encoded raw data of the response.
-
-
usp.web_client.abstract_client.RETRYABLE_HTTP_STATUS_CODES= {400, 408, 429, 499, 500, 502, 503, 504, 509, 520, 521, 522, 523, 524, 525, 526, 527, 530, 598}¶ HTTP status codes on which a request should be retried.
-
class
usp.web_client.abstract_client.WebClientErrorResponse(message: str, retryable: bool)[source]¶ Bases:
usp.web_client.abstract_client.AbstractWebClientResponseError response.
usp.web_client.requests_client module¶
requests-based implementation of web client class.
-
class
usp.web_client.requests_client.RequestsWebClient[source]¶ Bases:
usp.web_client.abstract_client.AbstractWebClientrequests-based web client to be used by the sitemap fetcher.
-
get(url: str) → usp.web_client.abstract_client.AbstractWebClientResponse[source]¶ Fetch an URL and return a response.
Method shouldn’t throw exceptions on connection errors (including timeouts); instead, such errors should be reported via Response object.
Parameters: url – URL to fetch. Returns: Response object.
-
set_max_response_data_length(max_response_data_length: int) → None[source]¶ Set the maximum number of bytes that the web client will fetch.
Parameters: max_response_data_length – Maximum number of bytes that the web client will fetch.
-
set_proxies(proxies: Dict[str, str]) → None[source]¶ Set proxies from dictionnary where:
- keys are schemes, e.g. “http” or “https”;
- values are “scheme://user:password@host:port/”.
For example:
proxies = {‘http’: ‘http://user:pass@10.10.1.10:3128/’}
-
-
class
usp.web_client.requests_client.RequestsWebClientErrorResponse(message: str, retryable: bool)[source]¶ Bases:
usp.web_client.abstract_client.WebClientErrorResponserequests-based error response.
-
class
usp.web_client.requests_client.RequestsWebClientSuccessResponse(requests_response: requests.models.Response, max_response_data_length: Optional[int] = None)[source]¶ Bases:
usp.web_client.abstract_client.AbstractWebClientSuccessResponserequests-based successful response.
-
header(case_insensitive_name: str) → Optional[str][source]¶ Return HTTP header value for a given case-insensitive name, or None if such header wasn’t set.
Parameters: case_insensitive_name – HTTP header’s name, e.g. “Content-Type”. Returns: HTTP header’s value, or None if it was unset.
-
raw_data() → bytes[source]¶ Return encoded raw data of the response.
Returns: Encoded raw data of the response.
-
Module contents¶
Submodules¶
usp.exceptions module¶
Exceptions used by the sitemap parser.
-
exception
usp.exceptions.SitemapException[source]¶ Bases:
ExceptionProblem due to which we can’t run further, e.g. wrong input parameters.
usp.tree module¶
Helpers to generate a sitemap tree.
-
usp.tree.sitemap_tree_for_homepage(homepage_url: str, web_client: Optional[usp.web_client.abstract_client.AbstractWebClient] = None) → usp.objects.sitemap.AbstractSitemap[source]¶ Using a homepage URL, fetch the tree of sitemaps and pages listed in them.
Parameters: - homepage_url – Homepage URL of a website to fetch the sitemap tree for, e.g. “http://www.example.com/”.
- web_client – Web client implementation to use for fetching sitemaps.
Returns: Root sitemap object of the fetched sitemap tree.