Welcome to weblib’s documentation!¶
Weblib provides tools to solve typical tasks in web scraping:
- processing HTML
- handling text encodings
- controling repeating and parallel tasks
- parsing RSS/ATOM feeds
- preparing data for HTTP requests
- working with DOM tree
- working with text and numeral data
- list of common user agents
- cross-platform file locking
- operations with files and directories
Installation¶
pip install -U weblib
Testing¶
Install tox package: pip install tox Run the command: tox
API¶
weblib.content¶
weblib.control¶
-
weblib.control.
repeat
(func, limit=3, args=None, kwargs=None, fatal_exceptions=(), valid_exceptions=())[source]¶ Return value of execution func function.
In case of error try to execute func maximum limit times and then raise latest exception.
Example:
def download(url): return urllib.urlopen(url).read() data = repeat(download, 3, args=['http://google.com/'])
weblib.debug¶
weblib.encoding¶
-
weblib.encoding.
make_str
(value, encoding='utf-8', errors='strict')[source]¶ Normalize unicode/byte string to byte string.
-
weblib.encoding.
make_unicode
(value, encoding='utf-8', errors='strict')[source]¶ Normalize unicode/byte string to unicode string.
-
weblib.encoding.
smart_str
(value, encoding='utf-8', errors='strict')¶ Normalize unicode/byte string to byte string.
-
weblib.encoding.
smart_unicode
(value, encoding='utf-8', errors='strict')¶ Normalize unicode/byte string to unicode string.
weblib.feed¶
Return a list of tag objects of the entry
weblib.files¶
Miscellaneous utilities which are helpful sometime.
-
weblib.files.
clear_directory
(path)[source]¶ Delete recursively all directories and files in specified directory.
weblib.html¶
-
weblib.html.
decode_entities
(html)[source]¶ Convert all HTML entities into their unicode representations.
- This functions processes following entities:
- &XXX;
- &#XXX;
Example:
>>> print html.decode_entities('→ABC R©') →ABC R©
weblib.http¶
Serialize dict or sequence of two-element items into string suitable for sending in Cookie http header.
-
weblib.http.
normalize_http_values
(items, charset='utf-8', ignore_classes=None)[source]¶ Accept sequence of (key, value) paris or dict and convert each value into bytestring.
Unicode is converted into bytestring using charset of previous response (or utf-8, if no requests were performed)
None is converted into empty string.
If ignore_classes is not None and the value is instance of any classes from the ignore_classes then the value is not processed and returned as-is.
weblib.lock¶
Provide functions for check if file is locked.
weblib.logs¶
weblib.etree¶
Functions to process content of lxml nodes.
-
weblib.etree.
clean_html
(html, safe_attrs=('src', 'href'), input_encoding=None, output_encoding=None, **kwargs)[source]¶ Fix HTML structure and remove non-allowed attributes from all tags.
-
weblib.etree.
clone_node
(elem)[source]¶ Create clone of Element node.
The resulted clone is not connected ot original DOM tree.
-
weblib.etree.
disable_links
(elem)[source]¶ Replace all links with span tags and drop href atrributes.
-
weblib.etree.
drop_node
(tree, xpath, keep_content=False)[source]¶ Find sub-node by its xpath and remove it.
-
weblib.etree.
find_node_number
(node, ignore_spaces=False, make_int=True)[source]¶ Find number in text content of the node.
-
weblib.etree.
get_node_text
(node, smart=False, normalize_space=True)[source]¶ Extract text content of the node and all its descendants.
In smart mode get_node_text insert spaces between <tag><another tag> and also ignores content of the script and style tags.
In non-smart mode this func just return text_content() of node with normalized spaces
weblib.metric¶
weblib.parser¶
weblib.progress¶
weblib.pwork¶
-
weblib.pwork.
make_work
(callback, tasks, limit, ignore_exceptions=True, taskq_size=50)[source]¶ Run up to “limit” processes, do tasks and yield results.
Parameters: - callback – the function that will process single task
- tasks – the sequence or iterator or queue of tasks, each task in turn is sequence of arguments, if task is just signle argument it should be wrapped into list or tuple
- limit – the maximum number of processes
weblib.rex¶
-
weblib.rex.
normalize_regexp
(regexp, flags=0)[source]¶ Accept string or compiled regular expression object.
Compile string into regular expression object.
-
weblib.rex.
rex
(body, regexp, flags=0, byte=False, default=<object object>)[source]¶ Search regexp expression in body text.
weblib.russian¶
weblib.system¶
weblib.text¶
Text parsing and processing utilities.
-
weblib.text.
find_number
(text, ignore_spaces=False, make_int=True, ignore_chars=None)[source]¶ Find the number in the text.
Parameters: - text – unicode or byte-string text
- ignore_spaces – if True then groups of digits delimited by spaces are considered as one number
Raises: DataNotFound
if number was not found.
weblib.user_agent¶
weblib.watch¶
-
class
weblib.watch.
Watcher
[source]¶ this class solves two problems with multithreaded programs in Python, (1) a signal might be delivered to any thread (which is just a malfeature) and (2) if the thread that gets the signal is waiting, the signal is ignored (which is a bug).
The watcher is a concurrent process (not thread) that waits for a signal and the process that contains the threads. See Appendix A of The Little Book of Semaphores. http://greenteapress.com/semaphores/
I have only tested this on Linux. I would expect it to work on the Macintosh and not work on Windows.
weblib.work¶
-
weblib.work.
make_work
(callback, tasks, limit, ignore_exceptions=True, taskq_size=50)[source]¶ Run up to “limit” threads, do tasks and yield results.
Parameters: - callback – the function that will process single task
- tasks – the sequence or iterator or queue of tasks, each task in turn is sequence of arguments, if task is just signle argument it should be wrapped into list or tuple
- limit – the maximum number of threads