Understanding Twitter Dynamics With R and Gephi: Text Analysis and Centrality

Centrality and text analysis allow users to get more out of their social network data. Here’s how you can leverage them using R and Gephi.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Centrality and text analysis allow users to get more out of their social network data. Here’s how you can leverage them using R and Gephi.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Juan Manuel Ortiz de Zarate
Verified Expert in Engineering

Juan is a developer, data scientist, and doctoral researcher at the University of Buenos Aires where he studies social networks, AI, and NLP. Juan has more than a decade of data science experience and has published papers at ML conferences, including SPIRE and ICCS.

Read More

PREVIOUSLY AT

University of Buenos Aires
Share

This is the second installment in a three-part series on Twitter cluster analyses using R and Gephi. Part one laid the groundwork for the example we dive deeper into below; part three uses cluster analysis to draw conclusions from polarized posts about US politics.

Social Network Centrality

To achieve our goals, first we need to introduce the concept of centrality. In network science, centrality refers to nodes that have a strong influence on the network. Influence is an ambiguous concept; it can be understood in many ways. Is a node with many edges more influential than a node with fewer but more “important” edges? What constitutes an important edge on a social network?

To address these ambiguities, network scientists have developed many measures of centrality. Here, we discuss four commonly used measures, though many more are available.

Degree

The most common and intuitive measure is degree centrality. The idea behind degree centrality is simple: Measure the influence by the node’s degree. It can have variants if the graph is directed; in that case, you can measure the indegree and the outdegree—the first is known as the hub score and the second as the authority score.

In the first installment of this series, we used the undirected approach. This time, we focus on the indegree approach. This allows a more accurate analysis by emphasizing users who get retweeted by others over users who merely retweet frequently.

Eigenvector

The eigenvector measure builds upon degree centrality. The more that influential nodes point to a given node, the higher its score. We start with an adjacency matrix, where rows and columns represent nodes, and we use a 1 or 0 to indicate whether the corresponding nodes of a given row and column are connected. The main calculation estimates the matrix’s eigenvectors. The principal eigenvector will contain the centrality measures we want, where the position i will hold the centrality score of node i.

PageRank

PageRank is the variation of the eigenvector measure at the core of Google. The exact method Google uses is unknown, but the general idea is that each node starts with a score of 1, then distributes its score in equal parts to each of its edges. For example, if a node has three edges extending from it, it “sends” one third of its score through each edge. At the same time, the node is made more important by the edges that point to it. This results in a solvable system of N equations with N unknowns.

Betweenness

The fourth measure, betweenness, uses a very different approach. Here, a node is said to be influential if it is included in many short paths between other nodes. That is, it’s responsible for communicating with many other nodes, connecting “different worlds.”

For example, in social network analysis, these kinds of nodes could be understood as the types of people who help others find new jobs or make new connections—they are the doors to previously unknown social circles.

Which Should I Use?

The appropriate centrality measure depends on the goal of your analysis. Do you want to know which users are frequently singled out by others in terms of quantity? Degree centrality would likely be your best option. Or do you prefer a centrality measure that considers quality? In that case, eigenvector or PageRank will yield better results. If you want to know which users function most effectively as bridges between different communities, betweenness is your best option.

When using multiple similar measures, e.g., eigenvector and PageRank, you can estimate all of them and see if they yield equivalent rankings. If not, you can deepen your analysis of the differences or generate a new measure by combining their scores.

Another approach uses principal component analysis to estimate which measure gives you more information about the real influence of the nodes on your network.

Hands-on Centrality Calculation

Let’s see how we can calculate these measures using R and RStudio. (They can also be done with Gephi.)

First, we need to load all the libraries that we are going to use throughout this article:

library("plyr")
library(igraph)
library(tidyverse)
library(NLP)
library("tm")
library(RColorBrewer)
library(wordcloud)
library(topicmodels)
library(SnowballC)
library("textmineR")

Next, we will remove isolated nodes from the data we used before, as they are not useful to this analysis. Then, we will use the igraph functions betweenness, centr_eigen, page_rank, and degree to estimate the centrality measures. Finally, we will store the scores on the igraph object and on a data frame to see which users were the most central.

load("art1_tweets.RData")
Isolated = which(degree(net)==0)
net_clean = delete.vertices(net, Isolated)

cent<-data.frame(bet=betweenness(net_clean),eig=centr_eigen(net_clean)$vector,prank=(page_rank(net_clean)$vector),degr=degree(net_clean, mode="in"))
cent <- cbind(account = rownames(cent), cent)

Now we can check the 10 most central users by each measure:

Degree
top_n(cent,10,degr)%>% arrange(desc(degr))%>% select(degr)
Eigenvector
top_n(cent,10,eig)%>% arrange(desc(eig))%>% select(eig)
PageRank
top_n(cent,10,prank)%>% arrange(desc(prank))%>% select(prank)
Betweenness
top_n(cent,10,bet)%>% arrange(desc(bet))%>% select(bet)

The results:

DegreeEigenvectorPageRankBetweenness
ESPNFC5892PSG_inside1mundodabola0.037viewsdey77704
TrollFootball5755CrewsMat190.51AleLiparoti0.026EdmundOris76425
PSG_inside5194eh011959910.4PSG_inside0.017ba*****lla63799
CrewsMat194344mohammad1356800.37RoyNemer0.016FranciscoGaius63081
brfootball4054ActuFoot_0.34TrollFootball0.013Yemihazan62534
PSG_espanol3616marttvall0.34ESPNFC0.01hashtag2weet61123
IbaiOut3258ESPNFC0.3PSG_espanol0.007Angela_FCB60991
ActuFoot_3175brfootball0.25lnstantFoot0.007Zyyon_57269
FootyHumour2976SaylorMoonArmy0.22IbaiOut0.006CrewsMat1953758
mundodabola2778JohnsvillPat0.22010MisterChip0.006MdeenOlawale49572

We can see that the first three measures share a number of users, such as PSG_inside, ESPNFC, CrewsMat19, and TrollFootball. We can assume that they had a strong influence over the discussion. Betweenness has a different approach to measuring centrality and therefore doesn’t show as much overlap with the other techniques.

Note: Views expressed by the Twitter accounts mentioned in this article do not reflect those of Toptal or the author.

In the following images, you can see our original colored network graph with two user label overlays. In the first, the nodes are highlighted by their PageRank scores, and in the second, by their betweenness scores:

An image showing a colored PageRank plot, with the top 10 users and their networks highlighted. The three biggest users are PSG_inside, TrollFootball, and ESPNFC. ESPNFC is located on the left of the plot and colored purple, while PSG_inside is placed to the right of it, colored red. TrollFootball is located higher and to the right of them, between green-, blue-, and orange-colored users.
Messi discussion with the top 10 PageRank users highlighted

An image showing a colored betweenness plot, with the top 10 users and their networks labeled and highlighted. All of the top 10 users, which are more similar in size than in the previous image, are located in the lower-left corner of the image, which is colored purple. They are grouped together tightly.
Messi discussion with the top 10 betweenness users highlighted

Gephi can be used to reproduce these images. You can estimate betweenness or PageRank scores using the Network Diameter button in the statistics panel. Then, you can show node names using attributes as demonstrated in the first installment in this series.

Text Analysis: R and LDA

We can also analyze social network discussions to identify what users have been talking about. There are multiple ways to approach this. We will do topic modeling through Latent Dirichlet Allocation (LDA), an unsupervised machine learning technique that allows us estimate which set of words tend to appear together. Then, through that set of words, we can infer the topic being discussed.

The first step is to sanitize the text. To do that, we define the following function:

# This function normalizes text by removing Twitter-related terms and noisy characters
sanitize_text <- function(text) {
  # Convert to ASCII to remove accented characters:
  text <- iconv(text, to = "ASCII", sub = " ")
  # Move to lower case and delete RT word (this is added by Twitter)
  text <- gsub("rt", " ", tolower(text))
  # Delete links and user names:
  text <- gsub("@\\w+", " ", gsub("http.+ |http.+$", " ", text))
  # Delete tabs and punctuation:
  text <- gsub("[ |\t]{2,}", " ", gsub("[[:punct:]]", " ", text))
  text <- gsub("amp", " ", text)  # Remove HTML special character
  # Delete leading and lagging blanks:
  text <- gsub("^ ", "", gsub(" $", "", text))
  text <- gsub(" +", " ", text) # Delete extra spaces
  return(text)
}

We also need to remove stop words, duplicates, and empty entries. Next, we have to convert our text to a document-term matrix to be processed by LDA.

In this dataset, we have users speaking in many languages (English, Spanish, French, etc.). LDA works best if we focus on a single language. We are going to apply it over the users of the biggest community detected in the first installment of this series, which is composed principally of accounts with English-speaking users.

# Detect communities:
my.com.fast <-cluster_louvain(as.undirected(simplify(net)))
largestCommunities <- order(sizes(my.com.fast), decreasing=TRUE)[1:3]
# Save the usernames of the biggest community:
community1 <- names(which(membership(my.com.fast) == largestCommunities[1]))

# Sanitize the text of the users of the biggest community:
text <- unique(sanitize_text(tweets.df[which(tweets.df$screen_name %in% community1),]$text))
text = text[text!=''] # Delete empty entries
stopwords_regex = paste(stopwords('es'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
# Remove English stopwords:
text = stringr::str_replace_all(text, stopwords_regex, '')
# Create the document term matrix:
dtm <- CreateDtm(text,
                 doc_names = seq(1:length(text)),
                 ngram_window = c(1, 2))

Topic Counts and Coherence Scores

The main hyperparameter that we need to define in LDA is the number (k) of topics that we want to estimate. However, how can we know it beforehand? One common approach is to train LDA models over different k values and measure the coherence of each one. We’ll do this for k values from 3 up to 20, since values outside this range aren’t worth checking, in my experience:

tf <- TermDocFreq(dtm = dtm)
# Remove infrequent words:
tf_trimmed = tf$term[ tf$term_freq > 1 & tf$doc_freq < nrow(dtm) / 2 ]

# Create a folder to store trained models:
model_dir <- paste0("models_", digest::digest(tf_trimmed, algo = "sha1"))
if (!dir.exists(model_dir)) dir.create(model_dir)

# Define a function to infer LDA topics:
train_lda_model <- function(number_of_topics){
    filename = file.path(model_dir, paste0(number_of_topics, "_topics.rda"))
    # Check if the model already exists:
    if (!file.exists(filename)) {
        # To get exactly the same output on each run, use a constant seed:
        set.seed(12345)
        lda_model = FitLdaModel(dtm = dtm, k = number_of_topics, iterations = 500)
        lda_model$k = number_of_topics
        lda_model$coherence = CalcProbCoherence(phi = lda_model$phi, dtm = dtm, M = 5)
        save(lda_model, file = filename)
    } else {
        load(filename)
    }
    
    lda_model
}
# The number of topics that we are going to infer in each LDA training run:
topic_count = seq(3, 20, by = 1)
# Train through the TmParallelApply function
models = TmParallelApply(X = topic_count,
                         FUN = train_lda_model,
                         export = c("dtm", "model_dir"))

Next, we graph the coherence value of each:

coherence_by_topics_quantity = data.frame(
topic_number = sapply(models, function(model_instance) nrow(model_instance$phi)),
     score_coherence = sapply(models,
function(model_instance) mean(model_instance$coherence)),
     stringsAsFactors = FALSE)
ggplot(coherence_by_topics_quantity, aes(x = topic_number, y = score_coherence)) +
  geom_point() +
  geom_line(group = 1) +
  ggtitle("Coherence by Topic") + theme_minimal() +
  scale_x_continuous(breaks = seq(1,20,1)) + ylab("Coherence Score") + xlab("Number of topics")

A high coherence value shows a better segmentation of the text into topics:

A graph showing the coherence score for different topics. The coherence score varies from slightly over 0.05 on six to seven topics, with three to 12 topics all having a score below 0.065. The score suddenly peaks at about 0.105 for 13 topics. Then it goes below 0.06 for 17 topics, up to nearly 0.09 for 19 topics, and finishes at just above 0.07 for 20 topics.

We reach our peak coherence score with k = 13, so we’ll use the LDA model trained with 13 topics. Through the GetTopTerms function, we can see the 10 principal words for each topic and estimate the topic semantic through them:

best_model <- models[which.max(coherence_by_topics_quantity$score_coherence)][[ 1 ]]

# Most important terms by topic:
best_model$top_terms <- GetTopTerms(phi = best_model$phi, M = 20)
top10 <- as.data.frame(best_model$top_terms)
top10

The following table details the five most important topics detected and the 10 principal words that exemplify them:

 t_1t_2t_3t_4t_5
1messimessimessimessimessi
2lionelinstagramleagueestpsg
3lionel_messipostwinilleo
4psgmilliongoalsauleo_messi
5madridlikeschpourahora
6realspoionspascompa
7barcelonagoatch_ionsavecva
8parispsguclduser
9real_madridbarballonquijugador
10mbappbiggerworldjemejor

Although most users in this community are English speakers, there are still a number of French and Spanish speakers (t_4 and t_5 in the table). We can infer that the first topic relates to Messi’s previous team (FC Barcelona), the second topic is about Messi’s post on Instagram, and the third topic focuses on Messi’s achievements.

Now that we have the topics, we can predict which of them was the most discussed. To do that, we will first concatenate tweets by users (again, from the largest community):

tweets.df.com1 = tweets.df[which(tweets.df$screen_name %in% community1),]
users_text <- ddply(tweets.df.com1,
                    ~screen_name,
                    summarise,
                    text = paste(text, collapse = " "))

Then, we sanitize text as before and create the DTM. After that, we call the predict function using our LDA model and the DTM as arguments. Also, we set the method to Gibbs to improve computing time because we have a lot of text to analyze:

users_text$text <- sanitize_text(users_text$text) # Get rid of duplicates
stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
users_text$text = stringr::str_replace_all(users_text$text, stopwords_regex, '')

dtm.users.com1 <- CreateDtm(users_text$text,
                 doc_names = users_text$screen_name,
                 ngram_window = c(1, 2))
com1.users.topics = predict(best_model, dtm.users.com1, method="gibbs", iterations=100)

Now, in the com1.users.topics data frame, we see how much each user talked about each topic:

Accountt_1t_2t_3t_4t_5[…]
___99th0.027160490.866666660.002469130.002469130.00246913 
Boss__0.051851850.841975300.002469130.002469130.00246913 
Memphis0.003278680.003278680.036065570.003278680.00327868 
___Alex10.009523800.009523800.009523800.009523800.00952380 
[…]      

Finally, with this information, we can create a new attribute on the node graph to define which topic was most talked about by which user. Then we can create a new GML file to visualize it in Gephi:

# Get the subgraph of the first community:
net.com1 = induced_subgraph(net,community1)
# Estimate the topic with the max score for each user:
com1.users.maxtopic = cbind(users_text$screen_name,
                            colnames(com1.users.topics)[apply(com1.users.topics,
                                                              1,
                                                              which.max)])
# Order the users topic data frame by the users' order in the graph:
com1.users.maxtopic = com1.users.maxtopic[match(V(net.com1)$name,
                                          com1.users.maxtopic[,1]),]
# Create a new attr of the graph by the topic most discussed by each user:
V(net.com1)$topic = com1.users.maxtopic[,2]
# Create a new graph:
write_graph(simplify(net.com1),  "messi_graph_topics.gml", format = "gml")

A colored node graph generated using Gephi, showing ESPNFC as the highest-ranking user by PageRank centrality. ESPNFC is located near the bottom of the image, with many purple nodes grouped below it.
Largest community of Messi discussion colored by topic and with users highlighted by PageRank centrality

An image showing the percentage of users highlighted by each color used in the graph, with the purple "t 6" being the most-used color (40.53% of all users in the graph), followed by the green "t 13" at 11.02%, and blue/cyan "t 10" at 9.68%. A gray "NA," in second-to-last position of this list of 11, makes up 2.25%.
Topic labels and percentage of users for each color used in the graph

Inferring Important Topics and Applying Social Network Centrality

In the first installment of this series, we learned how to obtain data from Twitter, create the interaction graph, plot it through Gephi, and detect communities and important users. In this installment, we expanded upon this analysis by demonstrating the use of additional criteria to detect influential users. We also demonstrated how to detect and infer what the users were talking about and plot that in the network.

In our next article, we will continue to deepen this analysis by showing how users can explore clustered social media data.

Also in This Series:

Understanding the basics

  • What is centrality in network analysis?

    Centrality is a measure that, in the context of social network analysis, can help us detect influential users on the network.

  • What is degree centrality in social network analysis?

    Degree centrality measures the influence of a user by the number of edges that they have.

  • What is eigenvector centrality used for?

    Eigenvector centrality measures the influence of a user by the number and quality of edges that they have.

  • What is meant by betweenness centrality?

    Betweenness centrality measures the influence of a user by how many users connect through them.

  • What is RStudio used for?

    RStudio is an integrated development environment (IDE) for programming in R.

  • Can R do text analysis?

    R is a powerful programming language designed to perform text analysis. It has many libraries for classic text analysis tasks, such as sentiment analysis for topic modeling.

Hire a Toptal expert on this topic.
Hire Now
Juan Manuel Ortiz de Zarate

Juan Manuel Ortiz de Zarate

Verified Expert in Engineering

Ciudad de Buenos Aires, Buenos Aires, Argentina

Member since November 6, 2019

About the author

Juan is a developer, data scientist, and doctoral researcher at the University of Buenos Aires where he studies social networks, AI, and NLP. Juan has more than a decade of data science experience and has published papers at ML conferences, including SPIRE and ICCS.

Read More
authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

PREVIOUSLY AT

University of Buenos Aires

World-class articles, delivered weekly.

Subscription implies consent to our privacy policy

World-class articles, delivered weekly.

Subscription implies consent to our privacy policy

Join the Toptal® community.