Classifying Clickbait Headlines

The purpose of this app is to explore modeling of text data, in this case whether news headlines are real or clickbait. It lets the user fit models to classify articles into clickbait and not clickbait based on the headline text. The data are preloaded from the textclassificationexamples package, which can be installed using remotes::install_github('leahannejohnson/textclassificationexamples'), and a number of features are created by applying helper functions from the same to the rows of the data frame with the help of dplyr::mutate(). These functions include:

has_common_phrase(): Takes a character string and returns a logical - TRUE if the string contains a common phrase, and FALSE if it does not.
has_exaggerated_phrase(): Takes a character string and returns a logical - TRUE if the string contains an exaggerated phrase, and FALSE if it does not.
num_contractions(): Takes a character string and returns an integer - the number of contractions contained in the string.
num_stop_words(): Takes a character string and returns an integer - the number of stop words contained in the string.
num_pronouns(): Takes a character string and returns an integer - the number of pronouns contained in the string.
starts_with_num(): Takes a character string and returns a logical - TRUE if the string begins with a number, and FALSE if it does not.
has_question_word(): Takes a character string and returns a logical - TRUE if the string contains a question word, and FALSE if it does not.
positivity(): Takes a character string and returns the sum of the AFINN positivity scores of the words in the string.

By modifying the variable and model selection inputs on the sidebar panel on the left, the user is able to fit a variety of models including logistic regression, decision trees, and random forests.

Authors: Leah Johnson, Nicholas Horton

Last modified, May 2, 2025

At each node of the decision tree below, the boxes contain three values. The first indicates the classification decision - FALSE if the observation is not clickbait, and TRUE if it is. The second value is the probability of that classification, and the last is the percentage of observations that fall into that category.

Classifying Clickbait Headlines

The default model (when no variables are selected) is the intercept only logistic regression model.