Let me help you find your community in Reddit!

Minoo Taghavi
3 min readOct 15, 2020

--

As I am sitting here writing my blog, my computer is running way too many iterations to get about 20,000 posts from Reddit! I am not complaining at all since my MacBook Pro has exceeded my expectations time after time throughout this journey!

If you are an internet savvy person and follow different social news platforms, you must have heard of Reddit! There are about 138,000 active subreddits among a total of 1.2 million subreddits. But, how can we tell which subreddit actually contains context, discussions, and information related to what we are looking for! Well… That was the challenge for this project! To make a Machine Learning Model that reveals determines the most used or discussed keywords and subjects in each subreddit and helps to identify the right community for you!

The project?

Choose two subreddits, scrape their posts/comments and run them through a natural language processor like Count-vectorizer and TF-IDF Vectorizer, try different Machine Learning Models (In my case, I used Logistic Regression, Naïve Bayes and Random Forest algorithms), and finally build a binary predictor to determine which subreddit a given post is coming from.

Subreddits?

If you are into running then probably you will find this community interesting! Highly recommended! I decided to choose “Lose it” subreddit as my second topic since losing weight and staying in shape has always been a hot topic and it is in some ways related to running!

Preprocessing step?

The first step was to build an API function to scrape Reddit’s posts and collect data. By the way my computer is still running the code I previously ran trying to pull post data. After completion of data collection, obviously “data cleaning and EDA” was the next step. I then merged the two data frames into one big data frame and passed the new data frame into my NLP tool, TF-IDF and Count-vectorizer to analyze and rank each post.

Subreddits

Modeling the data?

After going through all the above mentioned time-consuming steps, it was time to have some fun making the model by initializing a model using Naïve Bayes and Logistic Regression algorithm in Sci-kit learn. Training, fitting the model and evaluating the model performance by plugging posts from each subreddits back into the model and building a confusion matrix for the final assessment!

While my computer is still running (feel like the results are never coming) I made these two “Word Cloud” from my two subreddits using “Andreas Mueller” code (My passion for Art never dies!) .[1]

World Cloud

My final model gives you the most keywords of each subreddits and you can easily decide which community is right for you! My model has an accuracy of 98% to identify the certain posts and vocabularies from any subreddits.

[1] https://peekaboo-vision.blogspot.com/2012/11/a-wordcloud-in-python.html

--

--

Minoo Taghavi

“Hi ! My name pronunciation is mee-noo.”- Now that just sounds fun.