In this article, we will be scraping Reddit discussion boards also popularly known as subreddits for Wh Questions.
Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members.
Source: Wikipedia
Reddit is a gold mine of data with real discussions going on between real people. Posts are categorized depending upon the topics into clusters or channels known as subreddits. These Subreddits are user created channels just like Facebook groups or Telegram channels.
If you are a content creator or educator who is struggling to get enough ideas or topics to create content on the reddit is great platform to interact with people. But what if you are lazy like me well I got your back.
Finding Interesting Subreddits
To get started first you need to decide which subreddit to run script on. For developer like me I choose programming related subreddits. You can choose subreddits of your own choice.
To help you choose a subreddit of your niche I got a trick for you. Just go to the Redditlist website and choose popular subreddits for your category. While choosing subreddit just choose by considering the recent activity, the number of subscribers, and growth. The more active the subreddit the better.

For this article the subreddit I have choosen are:
Importing Python Packages
We will be using following python packages for Scraping reddit.
import pandas as pd #Pandas for Creating DataFrame
import requests #For making Get request
import json #For Handling Json objects
import time #For adding Sleep in between the requests
import re #For Regular Expression
If you don’t have packages in your environment, You can install them using pip. Just run simple pip install “Name of Package” to install.
Setting Up Url and Request header
We need to set up a request header before we can send a request. The request headers are used along with requests so that the request should look like a browser visit. Some website detects bots that’s why.
hdr = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.reddit.com/r/python/.json'
Making Request
For making request we have used the request module. If you see the URL above the response for this will be in JSON format. JSON is a very popular format for data exchange. You can read about it from here. If you are familiar with arrays you can easily parse JSON.
req = requests.get(url, headers=hdr) #Sent a get request to Reddit with Custom header
json_data = json.loads(req.text) #Response is converted from file object to json object
print(json_data) # Printing JSON response
When we sent a get request using requests.get() we got a response object we have stored it into req. Then the content of the response is converted from file object to JSON object using json.load(). The content of the response is passed as an argument using req.text.
The response returned by request was in binary format we can not use it directly as JSON objects so we converted them into JSON object. Now it is very easy to use our data.
Scraping Reddit Post Titles
When check our response it will look like this.

The response has 27 objects in the “data” object of the JSON. Which means by default we got 27 post titles. So what if I want collect 1000.
Lets collect one thousand list object and store it into a list
data_all = json_data['data']['children'] #already collected Posts
num_of_posts = 0
while len(data_all) <= 1000:
time.sleep(2)
last = data_all[-1]['data']['name']
url = 'https://www.reddit.com/r/python/.json?after=' + str(last)
req = requests.get(url, headers=hdr)
data = json.loads(req.text)
data_all += data['data']['children']
if num_of_posts == len(data_all):
break
else:
num_of_posts = len(data_all)
If you observe the code what we are doing is we are iterating till we collect a 1000 post. since on each request, we are only getting 27 posts we need to go back and check what is the post after the last one stored.
This is done using “?after=Last title”.
Once we got 1000 post now comes the analysis of the data.
Finding Wh Question Prefix
There are two ways by which this can be done. First one is to use nltk for natural language and create a parser to sort out questions or simply use regex.
We are using regex though using natural language is the best approach after scrapping Reddit data but it is a little complex to implement.
Let iterate over our list of 1000 posts and sort out string containing wh questions. but before that lets create a DataFrame.
count = 0
df = pd.DataFrame(columns=['questions', 'link_to_answer'])
whPattern = re.compile(r'who|what|how|where|when|why|which|whom|whose', re.IGNORECASE)
for post in data_all:
whMatch = whPattern.search(str(post['data']['title']))
if whMatch:
print(str(post['data']['title']))
print(str(post['data']['url']))
count = count + 1
df.loc[count] = [str(post['data']['title']), str(post['data']['url'])]
# saving to DataFrame
else: pass
we are simply going through the post JSON object and collecting title and post link if you want to visit the link in the future.

Once you got the response you can now save it.
df.to_csv('python.csv', sep=',', encoding='utf-8')
I have saved it into a csv file.
If you look at those questions, you will find that there are some genuine questions asked by users which needs answers.
Conclusion
- If you analyze the questions you can easily find what users are asking. You can now create content based on it and once you have created just go to the link and post it.
- The Wh question scrapped can be sometimes useless because we are just looking for a pattern not using nltk for natural language. So there is scope of improvement.
- How easy it is to gather real conversation from Reddit.
- The method suggested in this post is limited to a few requests to use it in large amounts there is Reddit Api wrapper available in python. Checkout – PRAW : The Python Reddit API Wrapper.
- The code used in this scrapping tutorial can be found on my github – here
Thanks for reading 🙂
Comment if you have Suggestions, Questions or Concerns. Also checkout my other Posts.
Bonus
Do you know who was co-founder Reddit ?
Aaron Swartz was the co-founder of Reddit but that’s not it. He was the computer programmer involved in the development of web feed format RSS.
This Bonus is in his honor. He is an inspiration for me. Please look at the video below to know more about him.
0 Comments