The purpose of this app is to explore modeling of text data, in this case whether news headlines are real or clickbait. It lets the user fit models to
classify articles into clickbait and not clickbait based on the headline text.
The data are preloaded from the textclassificationexamples
package, which can be
installed using remotes::install_github('leahannejohnson/textclassificationexamples')
,
and a number of features are created by applying helper functions from the same to the rows of the
data frame with the help of dplyr::mutate()
. These functions include:
-
has_common_phrase()
: Takes a character string and returns a logical - TRUE if the string contains a common phrase, and FALSE if it does not. -
has_exaggerated_phrase()
: Takes a character string and returns a logical - TRUE if the string contains an exaggerated phrase, and FALSE if it does not. -
num_contractions()
: Takes a character string and returns an integer - the number of contractions contained in the string. -
num_stop_words()
: Takes a character string and returns an integer - the number of stop words contained in the string. -
num_pronouns()
: Takes a character string and returns an integer - the number of pronouns contained in the string. -
starts_with_num()
: Takes a character string and returns a logical - TRUE if the string begins with a number, and FALSE if it does not. -
has_question_word()
: Takes a character string and returns a logical - TRUE if the string contains a question word, and FALSE if it does not. -
positivity()
: Takes a character string and returns the sum of the AFINN positivity scores of the words in the string.
Authors: Leah Johnson, Nicholas Horton
Last modified, January 21, 2022
At each node of the decision tree below, the boxes contain three values. The first indicates the classification decision - FALSE if the observation is not clickbait, and TRUE if it is. The second value is the probability of that classification, and the last is the percentage of observations that fall into that category.