reddit package¶

Submodules¶

reddit.get_all_subreddits.get_all_subreddits(lang='en') → List[Dict[str, Any]][source]¶: Retrieves all subreddits of a given language. :param lang: :return:

reddit.get_all_subreddits.get_subreddits_from_response(js: Dict[str, Any], lang='en') → List[Dict[str, Any]][source]¶

reddit.scrape_reddit.filter_submissions(data: List[Dict[str, Any]], sub: Dict[str, Any], blacklist_flairs=[]) → List[Tuple[str, int]][source]¶: Filters all submissions which should be used for the dataset. :param data: The list of submissions :param sub: The subreddit specification :param blacklist_flairs: Flairs which should be ignored :return: The permalinks and creation dates of all proper submissions

reddit.scrape_reddit.has_enough_upvotes(body: Dict[str, Any]) → bool[source]¶: Determines if a submission or a comment has more upvotes than downvotes for quality control :param body: Either a submission or a comment :return:

reddit.scrape_reddit.has_proper_text(text: Optional[str], text_html: Optional[str] = None) → bool[source]¶: Checks whether a post contains a text that is usable for the dataset. Contains generic validation suitable for submission and comment texts. :param text: The normal (Markdown) version of the text :param text_html: The HTML version of the text :return:

reddit.scrape_reddit.is_ama_submission(submission: Dict[str, Any], sub: Dict[str, Any]) → bool[source]¶: Checks whether the given submission is an AMA-type submission. Apply is_proper_submission first. :param submission: A reddit submission :param sub: The subreddit specification :return:

reddit.scrape_reddit.is_error_message(json: Dict[str, Any]) → bool[source]¶: Checks whether the given JSON is an error message :param json: The JSON returned by a request as dictionary :return:

reddit.scrape_reddit.is_proper_comment(comment: Dict[str, Any]) → bool[source]¶: Checks whether the comment should be used for the dataset :param comment: A reddit comment :return:

reddit.scrape_reddit.is_proper_submission(submission: Dict[str, Any], sub: Dict[str, Any], blacklist_flairs=[]) → bool[source]¶: Checks whether the submission should be used for the dataset :param submission: A reddit submission :param sub: The subreddit specification :param blacklist_flairs: Flairs which should be ignored :return:

reddit.scrape_reddit.main(output_dir, blacklist_flairs, text_maxlength, url)[source]¶

reddit.scrape_reddit.process_subreddit(sub: Dict[str, Any], url_template: str, queue: multiprocessing.context.BaseContext.Queue, text_maxlen=1024, blacklist_flairs=[], last_timestamp: Optional[int] = None)[source]¶: Retrieves all relevant submissions from a subreddit and collects dialogues :param sub: The subreddit specification :param url_template: The URL template to use for subreddit requests :param queue: The queue to output dialogues to :param text_maxlen: The maximum length of a comment text :param blacklist_flairs: Submissions with these flairs will be ignored :param last_timestamp: The timestamp from which to resume retrieval :return: The collected dialogues will be output into the queue as (subreddit_name, timestamp, dialogues) triple

reddit.scrape_reddit.retrieve_dialogs(js: List[Dict[str, Any]], sub: Dict[str, Any], top_url: str, text_maxlen=1024) → List[Dict[str, Any]][source]¶: Retrieves all dialogs of a submission. :param js: The submission site JSON dict containing both submission and comments :param sub: The subreddit specification :param top_url: The URL prefix to use for requests :param text_maxlen: The maximum length of a comment text :return: All dialogs of the submission

reddit.scrape_reddit.traverse_dialog(comment: Dict[str, Any], turns: List[Dict[str, str]], request_url: str, sys=None, user=None, text_maxlen=1024) → List[Dict[str, str]][source]¶: Traverses a comment chain recursively in order to assemble the dialog. :param comment: The current comment :param turns: The dialog turns up until now :param request_url: The URL used for HTTP requests :param sys: The author representing the system response :param user: The author representing the user response :param text_maxlen: The maximum length of a comment text :return: The turns of one dialog

reddit.scrape_reddit.write_output(queue: multiprocessing.context.BaseContext.Queue, top_dir: str, cache_dir: Optional[str] = None)[source]¶: Consumes processed dialog turns from a queue and writes them into a directory. An individual file will be used for each subreddit. :param queue: The queue to wait on for new data. Should output dialog JSON or None if the process is finished :param top_dir: The directory used to write the data to :param cache_dir: The directory used to cache timestamps :return: