Small World Properties of Facebook Group Networks

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 3

views

Report

Comments

Description

Download Small World Properties of Facebook Group Networks

Transcript

Small World Properties of Facebook Group Networks

Small World Properties of Facebook Group Networks
A Thesis
Presented to the
Department of Mathematics
and the
Faculty of the Graduate College
University of Nebraska
in Partial Fulfillment
of the requirements for the degree
Masters of Arts
University of Nebraska at Omaha
Jason Wohlgemuth
August 2012
Supervisory Committee:
Dora Matache
Robb Todd
Andrew Swift
Raj Dasgupta
Abstract: Small World Properties of Facebook Group
Networks
Jason Wohlgemuth, M.A.
University of Nebraska, 2012
Advisor: Dora Matache, Ph.D.
Small world networks permeate modern society. Paradoxically, the ubiquity of
small world networks is matched only by the difficulty of finding tractable examples.
Until we have an extensive taxonomy of small world networks at our disposal, intractability will prevail. In this thesis, we present a methodology for creating and
analyzing a practically limitless number of networks exhibiting small world network
properties. That is, we analyze networks, the vertices being Facebook groups sharing
a common word in the group name and the links being mutual members in any two
groups. Specifically, by analyzing the characteristics of single networks and network
ensembles, we attempt to quantify how the small world properties scale with network
size, and distill the essence of the networks to gain a deeper understanding of the
essential features based on insight gained from empirical observation. We show that
Facebook group networks have large clustering coefficients that do not vanish with
increased network size, and small average path lengths, thus exhibiting small world
features. At the same time, the average connectivity increases as a power of the network size, while the average clustering coefficients and average path lengths do not
exhibit a clear scaling with the size of the network. Our results are similar to what
have been found analyzing the Facebook network proper and give cause to pursue a
more in-depth investigation.
i
Acknowledgements
As I continue to learn, develop my mind, strive to understand the world or attempt
to convince others of my understanding, I will remember this thesis. In every man’s
life, there is a handful of people that hold special places in his memory. I would
like to thank Dr. Dora Matache for being one of those people for me. As a mentor,
she showed me professionalism in the pursuit of knowledge and a genuine passion for
learning on a most basic level. She guided me throughout this project and pointed
me at the finish line in the end.
I would also like to thank Drs. Andrew Swift, Robb Todd and Raj Dasgupta for
serving on my thesis committee and giving me their time and attention to further
develop this work and facilitate my growth.
Lastly and above all else, I would like to thank my wife, Julie Wohlgemuth. She is
the joy in my life and the foundation on which I stand. She makes me a better man
and I would be a shadow of my current self without her.
ii
Contents
List of Figures
iv
1. Introduction
1
1.1. Terminology
4
2. Overview of Relevant Literature and Motivation for Current Work
6
3. Mathematical Background
10
3.1. The Facebook Group Network and Significant Numerical Measures
10
3.2. Random Graph Models
13
3.3. Small World Network Models
15
4. More on the Facebook Social Network
19
4.1. Network Analysis Tools
21
4.2. Facebook Overview
22
5. Graph Creation Methodology
23
5.1. Python Overview
25
5.2. Node Creation
26
5.3. Edge Creation
27
5.4. Pruning
27
6. Analysis and Visualization
29
6.1. Network Visualization
29
6.2. Numerical Characteristics
37
7. Conclusions and Future Work
49
References
52
Appendix A. Python Code
56
iii
A.1. Module Import
56
A.2. Graph Generation
56
Appendix B. Tables
88
iv
List of Figures
1
The random rewiring procedure of the Watts-Strogatz model, which
interpolates between a regular ring lattice and a random network without
altering the number of nodes or edges [2]. After Watts and Strogatz,
1998 [43].
2
17
Characteristic path length, L(p), and clustering coefficient, C(p), for the
Watts-Strogatz model. The data are normalized by the values L(0) and C(0)
for a regular lattice. A logarithmic horizontal scale resolves the rapid drop
in L(p), corresponding to the onset of the small-world phenomenon. During
this drop C(p) remains almost constant, indicating that the transition to a
small world is almost undetectable at the local level [2]. After Watts and
Strogatz [43].
3
18
The process of creating a network out of Facebook groups of various size,
where two nodes (groups) are connected if they have at least one mutual
member. For example, the red group and yellow group have two mutual
members and are therefore connected in the lower network depiction. By
construction, isolated nodes are removed from the network. The purple group
has three members that are not in any of the other groups and is therefore
removed.
4
Facebook groups network created from groups with the word, anime, in the
title. The numerical characteristics are: N = 276, L = 9770, C̄ = 0.708, l =
1.837, < k >= 70.797, indicating short path lengths and a clustering
28
v
coefficient larger than the corresponding one for a random network with
similar topological aspects.
30
5
An enlarged depiction of the “core” of figure 4.
30
6
Plot of bieber network. The numerical characteristics are: N = 117, L =
1714, C̄ = 0.646, l = 1.903, < k >= 29.299, indicating short path lengths
and a clustering coefficient larger than the corresponding one for a random
network with similar topological aspects.
31
7
An enlarged depiction of the “core” of figure 6.
31
8
Plot of muslim network. The numerical characteristics are: N = 233, L =
4326, C̄ = 0.564, l = 2.121, < k >= 37.133, indicating short path lengths
and a clustering coefficient larger than the corresponding one for a random
9
network with similar topological aspects.
32
An enlarged depiction of the “core” of figure 8.
32
10 Plot of army network where the nodes are randomly distributed on a ring.
The numerical characteristics are: N = 48, L = 189, C̄ = 0.434, < k >= 7.875
and l = 2.32. It is difficult to gain much knowledge from a graph of this size,
but it is useful when analyzing the behaviors of network parameters as N is
increased.
33
11 Plot of navy network where the nodes are randomly distributed on a ring.
The numerical characteristics are: N = 20, L = 31, C̄ = 0.504, < k >= 3.1,
and l = 1.83. This is another example of small network. It serves as a good
conceptual comparison to figure 10, due to its generating keyword.
34
12 Plot of army network using the Fructerman-Reingold force-directed algorithm. 34
vi
13 Plot of navy network using the Fructerman-Reingold force-directed algorithm. 35
14 Plot of the Religion network. The individual key words used are: islam,
muslim, bible, atheist, catholic, christian, god, allah, mormon, pagan, pope.
The numerical characteristics are N = 1558, L = 39547, C̄ = 0.543, l =
2.723, < k >= 50.766.
39
15 An enlarged depiction of the “core” of figure 14. Observe the three smaller
cores and notice the density of links.
39
16 Plot of the Sports network. The individual key words used are: basketball,
soccer, football, baseball, golf, tennis, bowling and swim. The numerical
characteristics are N = 342, L = 10356, C̄ = 0.715, l = 2.403, < k >= 60.561. 40
17 An enlarged depiction of the “core” of figure 14.
40
18 The distribution of the connectivity values in log-log plots for the ten
networks of Table 1, in decreasing order of network size. Each graph has the
same general shape. As expected, the networks with more nodes render a
log-log plot as the plot appears to degrade at N ≈ 350. Although the graphs
appear to be mostly linear, the distributions cannot be characterized entirely
by a power law decay. In general, the plots adhere to a power law for small
values of k but then seem to decrease more quickly than a power law would
predict for large values of k. This is seen with more precision in figure 19
where we attempt to fit a Pareto distribution to a degree distribution.
41
19 Pareto fit for the degree distribution. The estimated shape parameter γ
is close to 1. However, the tail of the graph is actually heavier than the
estimated fit.
42
vii
20 Boxplot for the k/N values associated with the networks of Table 1.
43
21 Log-log plot of hki versus N for all the networks considered in this analysis,
including single and multiple keywords. The fitted line shows hki ∝ N 0.6 .
44
22 The average clustering coefficient, C̄(k), against the degree, k, on a log-log
scale for the ten networks of Table 1, in decreasing order of network size. In
general, each plot indicates an approximate logarithmic decay for large k.
These results are similar to other studies of small world networks [34, 38].
Observe that the largest network, Religion provides the most orderly plot,
which is to be expected.
45
23 Boxplot of the average clustering coefficients for individual connectivity
values C(k) values associated with the networks of Table 1.
46
24 Average clustering coefficient and average path length as functions of N.
The blue dots indicate the Facebook data and the red circles indicate the
corresponding values for a random network as listed in Table 1. There is a
clear difference for the clustering coefficients; however it is hard to notice
essential differences for the path lengths.
47
25 Average clustering coefficient as a function of N. The red dotted line depicts
the average value and possible steady state value for large N.
48
26 Average path length boxplot and distribution for both the Facebook group
networks and random networks with the same number of nodes and links.
49
1
1. Introduction
The networks of today are not static structures. They are dynamic and evolving
entities. The only constant parameters of a network are the meanings associated
with the network elements, nodes and edges. It has been said that the “science
of networks must become, in short, a manifestation of its own subject matter, a
network of scientists collectively solving problems that cannot be solved by any single
individual or even any single discipline” [42].
In the past few years there has been a great interest in studying the basic topology of a variety of networks such as the World Wide Web [3], signal transduction
networks [37], subway systems [23], railway networks [34] and more recently Facebook [38, 6]. Specifically, these efforts have seen the emergence of a very specific
type of network, a small world network, in which most nodes are not neighbors of
one another but can be reached in a small number of steps. First conceptualized
in a short story by the Hungarian novelist, Frigyes Karinthy, the concept of a small
world made its way into the realm of academia via the work of the social psychologist, Stanley Milgram [25, 26]. Physicists entered the fray with the development of
the aptly named Watts-Strogatz network model [43], which replaced the previously
uncontested random graph model of Erdős and Rényi.
In the Erdős-Rényi model the links between nodes are generated independently
with the same probability. The Watts-Strogatz model starts with a ring lattice (that
is the nodes are positioned in a ring and are connected to the k-nearest neighbors)
and each link is rewired with a given probability p. If p = 1 one obtains the ErdősRényi model, while p = 0 yields the ring lattice. Watts and Strogatz considered
2
two numerical characteristics associated with the network, the characteristic path
length and clustering coefficient [43]. They showed that for very small values of p and
for large enough values of p the two numerical characteristics tend to have similar
magnitudes. However, for intermediate values of p the two numerical characteristics
tend to be at opposite ends of their range, [0, 1], namely small characteristic path
length and large clustering coefficient. That behavior corresponds to small world
networks.
The random networks of Erdős and Rényi and the small world networks of Watts
and Strogatz seek to model the network phenomena all around us, the neural networks in our brains, the ecosystems of rainforests, the future of the stock market, the
dynamics of epidemics, and the internet to name a few. They seek to re-create these
network phenomena mathematically. This thesis is concerned with characterizing real
world examples of the small world network phenomena, starting with features such
as clustering and network diameter. That is, we seek to distill the essence of complex networks and gain a deeper understanding of the essential features for any given
network based on insight derived from empirical observation. Historically, such an
understanding has been out of reach due to technological limitations and the shear
vastness of data. By studying Facebook groups, we have found a tractable data set
of networks with amazing complexity. This thesis recounts the analysis of these Facebook group networks that can be analyzed independently and in comparison with
other similar networks to a high level of detail. Furthermore, having many networks
of similar construction allows us to address the scaling behavior of network parameters
as functions of network size, N.
3
The work in this thesis supplements previous research on the topology of Facebook
by focusing on groups rather than individuals. There is much to be learned about
networks by studying expansive social networks such as Facebook, but small world
properties have been seen and analyzed in networks with as few as 43 nodes [4].
Networks with fewer nodes are naturally more amenable to analysis and could yield
results without the requirement of expensive hardware or sophisticated software. By
analyzing the statistics of Facebook groups, we can search for the the most basic of
networks and compare the data to the Facebook network as a whole to see if small
world networks exhibit self-similiarity; that is, we can see if essential characteristics
are independent on the level of aggregation of the networks from basic elements to
modules that can be regarded as nodes. We are interested in understanding what
is the impact of aggregation of individuals into groups on the main statistics of the
network. In order to be able to perform a comparison with the previous results in the
literature, we use similar topology statistics [2, 27] and information, as well as some
new approaches. We collect information on connectivity or degree distribution, clustering coefficients, average shortest paths, and network density. We use the average
shortest paths and clustering coefficients to characterize small-world networks [43, 4]
and seek to develop a deeper understanding of what makes a small-world network, a
small-world network. In this regard we created several networks composed of Facebook groups with a common keyword in their titles, and compared the parameters
of each to find common characteristics and observe how they change with other parameters such as, but not limited to the number of nodes, graph density, and degree
distribution. We compare our results to the corresponding ones for random networks
or other Facebook studies. We show that an aggregation of Facebook users into
4
groups and of groups into further smaller categories of groups does not change the
basic small world features. Thus Facebook exhibits a scaling invariance of properties.
There are three complementary elements to this thesis: description of the social
networks under consideration, python code, and mathematical network analysis. Social networks provided the network data to be studied, python programming provided
access the the network data and mathematical analysis provided a means for understanding the complexity of the network data. All three parts will be described in
sufficient detail; however the main focus is the mathematical analysis. Although, the
possible avenues for sociological study is intuitively obvious.
The organization of this thesis follows the intuitive progression of the research it
describes. Section 3 lays the foundation for the mathematics needed for analysis.
Section 4 provides an overview of social network analysis with an emphasis on the
Facebook social network It also describes the tools available for social network analysis
and the reason why we chose to use python. Section 5 describes in detail how we
create our networks from the data pulled from Facebook using the Facebook Graph
Application Programming Interface (API) and python. Finally, we present the results
of our analysis with data and visualizations in section 6 and end our thesis with our
conclusions and ideas for future work in section 7. In Appendix A we provide the
extensive python scripts that have been created for this work and in Appendix B we
provide the significant part of the data used for the mathematical analysis.
1.1. Terminology. In this thesis we will use network and graph interchangeably.
The use of the word “group” will refer to Facebook groups composed of Facebook
users under some assumed common context. The Facebook group networks of this
5
thesis are collections of Facebook groups with a common word in its title and can be
created from a single search word or multiple. We will refer to the former as keyword
graphs and the latter as keyword multi-graphs. To be clear, keyword multi-graphs
are not merely the sum of multiple keyword graphs. Each keyword graph has isolated
nodes that are removed in the pruning phase of network creation. These isolated
nodes may not be isolated when combined with the nodes of another keyword graph.
Hence, a keyword multi-graph can be greater than the sum of its keyword graphs of
the multiple search words from which it was created. The nodes or our networks are
Facebook groups and we will use the words, links and edges, interchangeably. Finally,
we define a small world network to be a network that possesses two properties:
(1)
The average shortest path (to be defined later) length scales with the logarithm
of the number of nodes, ln N, where N is the size of the network. [13, 2, 28]
(2)
The network exhibits transitivity (or clustering, to be defined later) higher than
random networks that is bounded away from zero. [13].
Some papers [43, 13, 41] have included sparseness as a requirement for a small world
network. A network is said to be sparse if δ → 0 as n → ∞, where δ is the network
density and n is the number of nodes in the network. Network density is the ratio of
possible links given n, nodes, over the actual number of links, L. Mathematically,
n
2
δ=
L
=
n(n − 1)
.
2L
6
We conjecture that the assumption of sparseness for the Facebook network of this
study is valid. Our work does not include the analysis of sparseness due to both
computational limitations and time constraints. Based on previous studies [43, 13,
41] the restrictions on the average shortest path length and clustering adequately
encompass the networks analyzed in this thesis and provide a good metric for assessing
real world networks composed of a finite number of nodes that exist between complete
order and randomness.
Next we provide an overview of several papers and results that inspired this work.
2. Overview of Relevant Literature and Motivation for Current
Work
Most attribute the idea of small worlds to the social psychologist, Stanley Milgram [25, 26]. Milgram selected two participants, the first from Omaha, NE, and
the other from Wichita, KA. The initial participant was asked to send a letter in an
attempt to get close to the other participant. That is, he was to send the letter to
someone who might have a higher likelihood of getting closer to the other participant.
This recipient would mail the letter in a similar fashion and so on and so on until it
reached its destined recipient in Wichita. Milgram received results 4 days later. The
results pointed Milgram to the notion of small worlds. What he saw was that when
letters were successfully delivered, they did so in under 6 steps. It appeared that two
randomly chosen people with seemingly no causal relation, were in fact closer than
7
expected.† With regard to small worlds, Stanley Milgram believed that social networks are in some sense tightly woven, full of unexpected strands linking individuals
far removed from one another in physical or social space. This is not to say Milgram
believed himself to be four handshakes from having tea with the Prime Minister of
Hungary; rather, he believed that 6 degrees of separation is certainly less than intuition would dictate, but it is not as close as it seems. People are still approximately
6 groups apart and this is actually not so surprising.
Small worlds were first introduced to the scientific community by Watts and Strogatz in their seminal article, “Collective Dynamics of ‘small-world’ networks”. [43]
With their paper, the small-world network model was added to the random network
model of Erdős and Rényi. In their paper, Watts and Strogatz presented a model
that described systems that were highly clustered like regular lattices, yet had small
characteristic path lengths, like random graphs. [43] The Watts-Strogatz model highlighted random links much like the weak ties proposed by Mark S. Granovetter [18].
Members of any social network have strong ties, and weak ties. A strong tie might
be a relationship between family members or best friends. Weak ties are composed
of everything else, from your barber to the student you sit next to in second period.
Furthermore, Granovetter suggested that weak ties are immensely important, perhaps
more important than strong ties.
Neither the small-world model of Watts and Strogatz or the weak ties of Granovetter explained how small-world networks were created or grew. In a study by
Barabasi, Albert and Hawoong, called “Diameter of the World-Wide Web” [3], it was
†
His results were by no means definitive and were in fact contested by peers in his field, among
which was psychologist, Judith Kleinfeld
8
found that the World-Wide Web is in fact a scale-free network; that is, the connectivity has a power-law distribution with a heavy tail. Later studies would find that
several small-world networks have scale-free topologies. The article by Albert and
Barabasi, “Statistical Mechanics of Complex Networks” provides a comprehensive
treatment of networks and includes the parameters of many small-world networks of
a diverse variety. This thesis utilizes the mathematical foundation providing in “Statistical Mechanics of Complex Networks” as well research and writings of by M.E.J.
Newman, to include “Networks” [28] and “Models of a Small-World: A Review” [29].
With a strong theoretical basis, this work sought to apply the analysis to real world
social networks.
Small world networks have become quite mainstream with books such as Watts’
‘Six Degrees” [42], and Barabasi’s “Bursts” [7] and “Linked” [8]. Certainly, the
scientific community has taken an interest as well. The study, “Is the Boston Subway
a Small-World Network?” by Latora and Marchiori [23] made the leap from theory to
experiment. Latora and Marchiori made a complex network where the nodes are train
stations and the edges are stations connected by track. In their own terminology, they
found the Boston subway to be locally and globally efficient, tantamount to being a
small world network. The Indian railway network was also found to exhibit small
world properties by Sen, et all in their paper, “Small-World Properties of the Indian
Railway Network” [34].
This work sought to leverage data from a social network. Of all the social networks
currently in use, Facebook arguably has the best combination of access, popularity
and support. Facebook has amassed over 900 million users. The Facebook Graph
API provides access tempered by privacy restrictions with support for interfacing with
9
almost any programming language. Furthermore, there has already been significant
study into the structure and nature of the Facebook social graph as a whole. [38, 6]
“The Anatomy of the Facebook Social Network Graph” [38] confirms the ‘six degrees
of separation” to be present in the Facebook graph along with high local clustering.
That is, Facebook seems to be a small world network and the degrees of separation
would later be reduced to 4. [6]
This work does not analyze networks on the scale of the entire Facebook graph, but
successfully finds subgraphs of sufficient complexity in our Facebook group networks.
Previous studies on the entire Facebook graph have indicated the presence of the
small world properties. We are interested in seeing if an aggregation of users/groups
preserves these properties. We consider networks of Facebook groups (as opposed to
Facebook users that may belong to one or multiple groups) that share some common
interest. Different common interests could generate differences and/or similarities
between the basic structure of these networks. The actual description of the Facebook
networks constructed in this study will be provided in the next section.
We have by no means listed all the work in the field of complex networks or social
networks analysis. By analyzing a collection of tractable data sets, we hope to add
the conversation by helping to connect theory and reality.
In the next section we define the basic elements of the network and the numerical
characteristics to be used in our analysis.
10
3. Mathematical Background
In this section we provide and overview of the mathematical tools needed to analyze
the networks in this work. We also review the numerical characteristics of ordered,
random, and small world networks.
3.1. The Facebook Group Network and Significant Numerical Measures.
Let us denote by G = {x1 , x2 , . . . , xN } a network with N nodes. Each node, xi , is
assumed to be linked to ki ∈ {0, 1, . . . N − 1} other nodes in the network, called its
inputs. The parameter ki is called the connectivity of node xi . If ki = 0 then the
node is isolated. We will assume a non-directed network, that is if node xj is an
input to node xi , then xi is an input to xj . G can be viewed as a graph with vertices,
x1 , x2 , . . . , xN , and edges (xi , xj ), i, j = 1, 2, . . . N .
Now we describe the actual Facebook network considered in this study.
Definition 1. Let Fw be the set of (public) Facebook groups with the word, w, in the
group’s title. These groups represent the nodes of the network, and consequently each
node xi is basically the set of people belonging to that group. Then we can define the
set of links between the nodes of this network as
Lw = {(xi , xj ) ∈ Fw × Fw | xi ∩ xj 6= ∅}.
Furthermore, two nodes, xi and xj , are said to be adjacent nodes if (xi , xj ) ∈ Lw .
Thus, we construct Gw , a network with nodes Fw , and edges Lw , as defined above.
The Facebook networks under consideration consist of Facebook groups that share
a common word in their title (implying a common interest). Individual groups are
linked by having common members. For example, if w = f ood, then we consider all
11
the groups that have the word f ood in their title. Two possible group titles could be
”Food and wine” and ”All about food”. These would be two nodes in the network.
If they share common members, then there is an edge between them. The main
reason for constructing such networks is to be able to identify possible commonalities
between the structure and properties of networks that are all social, but represent
different types of personalities and interests. We would expect to see some impact
of the type of common interest of the groups, represented by the common word. For
instance, we have found that the w = bieber network is a lot more connected and
clustered than the w = math network, not to mention, much, much larger. Often the
graphs are intuitive. The graph for w = bieber, Fbieber , is large and connected, while
the graph for w = biology, Fbiology , is almost non-existent. However, it is not intuitive
that the nodes in the Fbieber are from all around the world and even though Fbiology is
very small, the fact that is exists at all on Facebook is somewhat surprising. Clearly,
choice of keyword effects the return drastically and the resultant graph is subject to
innumerable factors, both intuitive and otherwise.
It is our goal to analyze the impact of these common interests on the numerical
characteristics of the Facebook group networks, in order to identify what kind of
networks they are, namely ordered, random, or small-world. To this end the most
common numerical characteristics to be analyzed are the degree distribution, average
clustering coefficient and associated path lengths of the network [43, 2, 38, 6, 34].
Definition 2. If k1 , k2 , . . . , kN are the connectivity values of the nodes x1 , x2 , . . . xN
respectively, the connectivity distribution is given by the probability distribution function f (x) = P (ki = x) for any i, where x ∈ {1, 2, . . . N − 1}. The average connectivity
12
is then
(3)
< k >=
N
−1
X
xP (ki = x)
x=1
Observe that we do not take into account isolated nodes to insure that all numerical
characteristics are properly defined.
Definition 3. Given a node, xi , the clustering coefficient of xi , denoted Ci , is a
measure of transitivity. That is, it measures how connected the inputs of a node
are. Ci is given by the number of all possible adjacent node pairs that actually exist
in the network [43]. More precisely, if node xi has ki inputs then there exist at most
1
k (k −1)
2 i i
links between these ki inputs. Ci is defined as the fraction of these allowable
links that actually exist in the network. We can then define the average clustering
coefficient for a network with N nodes as
(4)
C̄ =
N
1 X
Ci .
N i=1
If k̂ = {x ∈ Fw | degree(x) = k}, where k ∈ {1, 2, . . . , N − 1}, then we can define
the average clustering coefficient of nodes with degree k, as
(5)
C̄(k) =
1 X
|k̂|
Cx ,
k = 1, 2, . . . , N − 1.
x∈k̂
Definition 4. Given two nodes, xi and xj , the shortest path connecting them, lij , is
given by the minimal number of links that lead from xi to xj (or viceversa since we
deal with a non-directed network, so that lij = lji ). By construction, our networks
contain no nodes with k = 0 (isolated nodes) which means each network is comprised
of one or more connected subgraphs/subnetworks where every node in each subgraph
13
has k ≥ 1. In the case that a network is composed of multiple connected subgraphs,
the average path length is determined from the average path lengths of each connected
subgraph where the average path length is defined by
(6)
< l >=
X
1
lij .
N (N − 1) i6=j
Small world networks exist in the realm between order and randomness. An ordered
network is basically a ring lattice in which the nodes are placed on a circle and each
node is connected to its k nearest neighbors (k/2 on either side). With this idea in
mind, we review the models for random and small world networks.
3.2. Random Graph Models. A random network is a network in which some specific parameters such as the number of edges or the average connectivity, take fixed
values, but the network is random in other respects. There are several models for
generating random graphs, the most common one being to maintain the number of
nodes and edges constant while randomly assigning the edges [28]. These graphs are
commonly referred as G(N,m) graphs, where N is the number of nodes and m is the
number of edges. Basically a G(N,m) graph is obtained by choosing randomly, according to a uniform probability distribution, one network out of the ensemble of all
possible networks with N nodes and m edges. It is known that the average connectivity for a G(N,m) graph is < k >= 2m/N , but beyond that it becomes increasingly
difficult to derive closed form equations for other network parameters [28].
Arguably the most widely studied type of model for the construction of a random
network is G(N, p) where p is the (fixed) probability of constructing an edge between
any two nodes. These networks have become known as Erdős-Rényi networks[15] due
to the eminent works of the namesakes. The connectivity for G(N,p) networks follows
14
a binomial distribution,
(7)
N −1 k
frand (k) =
p (1 − p)N −1−k ,
k
where p is the given probability that any two nodes are connected, assumed fixed
for all pairs of nodes. This is because
N −1
k
k
p is the probability that a node has edges
to k nodes. However, there must be no edges to the rest of the N − k nodes, which
occurs with probability (1 − p)n−k .
Consequently, the average connectivity is < k >= N p. For large N this leads
to the Poisson distribution as an approximation of the binomial distribution, with
parameter λ = N p =< k >,
(8)
f (k) =
e−λ λk
.
k!
Now, every pair of nodes in a G(N,p) network has the probability, p, of being
connected. If a node has k connections, it has a possible
k
2
of connected “friend”
pairs, also called the first neighbors. The probability that any two of these first
neighbors are connected is then equal to p, so that the clustering coefficient of any
node xi is
(9)
Ci = p.
The average clustering coefficient is defined over the entire network and for a G(N,p)
network, we have
(10)
C̄rand
N
1 X
1
<k>
=
Ci = N p = p =
.
N i=1
N
N
For a random network, each node has < k > connections, on average. Then each
adjacent node also has < k > connections, on average. If we counted every node each
15
time and repeated this process l times, we would have < k >l nodes, and l would be
the path length between two nodes considered in the beginning and the end of this
process. If we equate < k >l and N (since we cannot have more than N nodes), we
can derive the relation for the average shortest path length for a random network.
More precisely, it is known that the average shortest path length for a G(N,p) network
scales as follows
(11)
lrand ∼
ln N
.
ln < k >
Therefore, for a G(N,p) network, as N → ∞, we see that C̄rand vanishes and lrand
scales as ln N . Thus, property (1) of small world networks is fulfilled. However,
despite the fact that there is a relatively small path length between any two nodes,
the random graphs lack the inherent nontrivial clustering of a small world network
described by properties (1) and (2).
We will be computing all these global parameters for a number of Facebook networks, Gw , and we will compare them with the ones presented in this section.
3.3. Small World Network Models. In an effort to capture the transitivity of
actual small world networks and retain average shortest paths characteristic of random
graphs that scale with ln N , the Watts-Strogatz model was developed [43]. More
precisely, starting with a ring lattice, each edge is re-wired with probability p excluding
self-inputs and duplicate edges. When there is no chance of an edge being re-wired,
p = 0, we obtain an ordered/regular network. It exhibits high clustering fulfilling
property (2), but long path lengths, so it is not a small world network. At p = 1,
a Watts-Strogatz network is exactly a random network, and thus has short path
lengths fulfilling property (1), but little to no transitivity, so it is not a small world
16
network. For a significant range of values of 0 < p < 1, the networks created with the
Watts-Strogatz model can fulfill both properties (1) and (2), so they are small world
networks.
Another version of the Watts-Strogatz model is to randomly add ”shortcut” edges
according to the value of p, but not to remove any edges. That is, edges are added
between randomly chosen node pairs just as described above and depicted in Figure
1, but no edges are removed from the original circle [28]. For this variation, since
there are 21 N k edges on the circle, there are 12 N kp added shortcuts.
This means there are kp shortcut edges per node, on average. Let s be the number
of shortcut edges attached to any one node. Then the degree distribution for the
Watts-Strogatz network model is a Poisson distribution with parameter, kp,
(12)
kp (kp)
Ps = e
s!
s
.
Furthermore, by construction, we know that the total degree of a node is equal to
d = s + k. Substituting s = d − k into the previous equation, we have the degree
distribution for the Watts-Strogatz small-world model [28]:
(13)
Pd = ekp
(kp)d−k
(d − k)!
for d ≥ k and Pk = 0 if d < k.
To calculate the clustering coefficient for the Watts-Strogatz small-world model, one
has to calculate the number of triangles within a graph and the number of connected
triples. A triangle is a closed path of length two and a connected triple means three
nodes, u,v, and w, with edges, (u, v) and (v, w). Then, the clustering coefficient is
simply the ratio of three times the number of triangles over the number of connected
17
triples. For the Watts-Strogatz small-world model the triples formed from shortcut
edges must be taken into account and added to the triples from the original lattice.
Therefore the clustering coefficient for the Watts-Strogatz small-world model is
(14)
k/2
×3
2
2
N k p + 12 N k 2 p2
C=
k
2
+
=
( 41 N k( 12 k − 1)) × 3
3(k − 2)
=
1
1
2
2
2
4(k − 1) + 8kp + 4kp2
N k(k − 1) + N k p + 2 N k p
2
where N k 2 p is the number of connected triples formed from an added shortcut edge
and an original edge of the lattice and 21 N k 2 p2 is the number of connected triples
formed from pairs of added shortcut edges [28].
Finally, the average shortest path for the Watts-Strogatz small-world model has
been shown to scale with ln N just like random networks [28].
The behavior of the clustering coefficient and average shortest path as a function
of p is depicted in Figure 2.
Figure 1. The random rewiring procedure of the Watts-Strogatz model, which
interpolates between a regular ring lattice and a random network without altering
the number of nodes or edges [2]. After Watts and Strogatz, 1998 [43].
18
Figure 2. Characteristic path length, L(p), and clustering coefficient, C(p), for
the Watts-Strogatz model. The data are normalized by the values L(0) and C(0)
for a regular lattice. A logarithmic horizontal scale resolves the rapid drop in L(p),
corresponding to the onset of the small-world phenomenon. During this drop C(p)
remains almost constant, indicating that the transition to a small world is almost
undetectable at the local level [2]. After Watts and Strogatz [43].
Some small world networks have been shown to exhibit another interesting property,
along with high clustering and short path lengths. Networks such as the Internet,
collaboration networks, ecological networks, cellular networks, citation networks and
the community of actors exhibit power law degree distributions∗[2].
Definition 5. Scale-free networks obey a power law connectivity distribution (known
by several other names †). That is, the connectivity of a node is determined by a shape
∗
Although some small world networks are also scale free, not all scale free networks are also small
worlds.
†
Zeta distribution, Zipf distribution, and Pareto distribution to name a few.
19
parameter and a scaling factor with
(15)
where ζ(γ) =
f (k) =
PN
1
x=1 xγ
k −γ
ζ(γ)
k = 1, 2, . . . , N
is the truncated Riemann ζ-function, also called the scaling
factor. The shape parameter of the distribution is γ > 0.
Thus each node is assigned k inputs with probability f (k). For scale-free networks
the usual range is 2 < γ ≤ 3 , while 1 < γ < 2 is typical for biological networks
(genes, proteins, metabolism, and ecological networks)[2].
The power law distributions exhibit a scale invariance. That is,
(16)
f (βk) =
(βk)−γ
(k)−γ
= (β)−γ
= (β)−γ f (k) ∝ f (k),
ζ(γ)
ζ(γ)
where β is a constant. So the resulting function is equal to f (k) modulo a multiplicative constant. Because of this property, such networks have become known as
scale-free networks.
The purpose of this work is to identify the impact of common interests on Facebook
group networks’ numerical measures, and identify similarities and differences. We will
generate many Facebook group networks and compute their numerical measures in
comparison to the values described above for random, small world, and scale free
networks. With this in mind, we proceed to our discussion of the social aspects of
the Facebook networks from which we get all our data.
4. More on the Facebook Social Network
A social network consists of a finite set or sets of actors/people/groups and the
relation or relations defined on them [40]. In this thesis, we will call actors nodes,
20
and are referring to the social collective units of Facebook groups. All the nodes are
of the same type (groups) and our networks are therefore referred to as one-mode
networks. We will call relations links and will analyze the structural variables, connectivity, clustering and average shortest path length. These concepts are relatively
intuitive and as previously mentioned, this thesis is not meant to be a treatise on
the social interactions of Facebook groups. Rather, this thesis seeks to reveal the
underlying commonalities of the networks, in a mathematical sense, formed from the
social interactions between Facebook groups. Privacy restrictions aside, analyzing
groups of Facebook users as opposed to analyzing individual users lends itself well to
a social network study. The name of a given group is assumed to be consistent with
the context of the group itself. That is, if a group is named ”People who love cats”, it
is assumed that the group is comprised of people who have some preference for cats,
rather than individuals that collectively hate cats. To use social network analysis
terminology [40], this assumption addresses the validity of the data presented in this
thesis. A person who likes cats will join groups that like cats and are named to reflect
that fact. Contrary circumstances will be statistical minorities. In any case, for the
purposes of this thesis, it is not whether or not a user likes cats that is important,
but the fact that a user is concerned with cats at all. As will become evident in the
remainder of this thesis, complex networks can be created without ever considering
the personal preferences of the users within the groups (nodes). With a firm basis
for our work, we proceed to an overview of the Facebook social network and mention
some of its idiosyncracies.
21
4.1. Network Analysis Tools. As far as social network analysis software is concerned, there are three categories: simulation, mathematical analysis, and visualization. These categories are not mutually exclusive and most applications support two
or three. In any case, other features of the available software is cost (free or pay to
use), installation requirements (web app or downloadable software), and Operating
System (OS) support. As we said in the first paragraph of the introduction, collaboration is key to eventually understanding the simple complexity all around us. With
this in mind, this work sought to use open source, free of cost, no download required
technology that could analyze dynamic data pulled from constantly updated social
networks. We used MATLAB for some of the mathematical analysis, but in the future we hope to replace the MATLAB functionality with python coding. There is a
wide variety of literature on the available software for social network analysis. While
researching for this thesis, we came across myriad applications. But none of them
offered all the features needed. Most notably, none leveraged real world data that is
subject to continuous modifications for analysis. Of the applications that performed
analysis, all require the user to import network data as separate files. This may
sound like a simple endeavor, but an all-encompassing application for the analysis of
the kind of real world data used in this thesis does not exist to our knowledge. We
refer the reader to the reference section for articles on social network analysis software [16, 32, 40]. Imminent applications for analysis include Discrete Dynamics Lab
(DDLab) [44], Pajek [9] and UCINET [11], where all three have visualization capabilities built in. The bulk of the other applications we looked into include Krackplot [22],
Visual Social Networks(Visone) [12], NetVis, Touchgraph, MASON [33], Discrete Visualizer of Dynamics (DVD) [21], NodeXL [36, 19], Netminer [14], STRUCTURE,
22
NEGOPY [30], Applied Graph and Network Analysis (AGNA), FATCAT [31], and
Cytoscape [35].
4.2. Facebook Overview. Facebook is arguably the most influential social network
in the history of mankind [1]. Anyone can create a Facebook profile for free and
input as much information as he/she desires. Users can “friend” other users, join
groups of users, like users, check in at various locations and update his/her own
personal message board either for private or public consumption. These capabilities
have created an intricate network of users, connected by “friendships”. Paired with
its eminence is the accessibility of its data. In 2007, the Facebook API was released
and the flood gates were opened. Data of unparalleled detail and magnitude became
available as never before. With approximately 900 million users and over 125 billion
friend connections, Facebook is available in more than 70 languages† and is larger
than any network before it, permeating every corner of the globe and every walk of
life.
Naturally, massive amounts of data facilitate a quest for knowledge and meaning.
Ever vigilant to remain innovative, the initial Facebook API was succeeded by the
aptly named “Facebook Graph API” that represented every object of Facebook by a
numerical id easily accessible by “folder hierarchy” like programmatic calls. Although
significant, the access to the massive amounts of Facebook data is not completely
unfettered. There are privacy protocols in place to protect users from unwanted
intrusions. Basically, to gain access to objects like user friend lists and user checkins require permission from the target user. Permission is requested by Facebook
†
These
numbers
are
as
of
March
http://www.facebook.com/press/info.php?statistics
2012
and
are
available
at
23
applications (“apps” for short) created by other Facebook users called developers.
Every app has a unique id and leverages the data available from Facebook, including
private data with permission, to accomplish a seemingly endless number of tasks. The
Facebook Graph API is not limited to Facebook proper. Facebook and those who love
Facebook have developed multiple software development kits (SDKs), development
tools and programming language libraries in almost any programming language you
can imagine. These tools provide Facebook Platform support to the myriad web apps
that permeate the world wide web. This work leveraged just such a programming
language,python, which is completely free and open source.
Some have managed to analyze the entire Facebook network [6, 38] ,but in practice, the Facebook network is too large for casual analysis. Furthermore, since the
Facebook network is obviously not random, analyzing a random sample of Facebook
would be useless. The size of Facebook combined with the privacy restrictions pose
a serious hindrance for attaining interesting data that is characteristic of an actual
social network. These hindrances are not overcome without some compromise. In
section 5, we describe a methodology for analyzing coherent, self-contained Facebook
sub-networks of a tractable size.
5. Graph Creation Methodology
Today we have access to unprecedented amounts of data. For the most part, we also
have access to the means for analyzing the data. Online social networks are probably
the most imminent of the available data sources. Social networks are dynamic and
fluid. Because of this, the data collected about them will also be dynamic and fluid.
So, to properly understand the data of today’s ever changing online environment, one
24
must employ methods and means that are similarly dynamic. This is not an easy
task, and to date the burden of progress has been placed on sophisticated algorithms
run on powerful computers, analyzing stagnant and often constricted data. Simply
put, small data sets are prevalent and easy to apply statistical analysis to, but often
do not have the required complexity for interesting behavior. Yet large data sets are
more difficult to obtain and analyze, but often render more significant results. In
the case of large data sets, the nature of the data set only facilitates the analysis
of a “snapshot” of the data. In the spirit of small world networks, there has to be
something in between. Networks small enough to be analyzed dynamically, but large
enough to render significant results. Networks unique enough to not be commonplace,
but in enough abundance to analyze ensemble statistics.
A small network is characterized by a short average path between nodes and nonvanishing clustering. After a certain social psychologist decided to study the connectedness of a small group of Nebraska natives and a Massachusetts stockholder [25, 26],
the phrase, “6 Degrees of Separation” was born into the intellectual realm‡ The existence of “6 degrees of separation” is relatively un-contested. In fact, the logical
implications of a network where any node has k connections is that any node is
within 6 steps of k 6 other nodes. Even for small values of k, that is a large number of
connections. Since small world networks are somewhere in between perfectly random
and perfectly ordered, it is natural to see features of both to be present in a small
‡
The first mention of such a concept should be attributed to the Hungarian novelist, Frigyes
Karinthy. As a novelist, Karinthy rendered no actual study of the topic, only insight beyond his
time.
25
world network. That is, random networks display the small characteristic lengths
between nodes and ordered networks have high clustering. But, neither have both.
5.1. Python Overview. We chose Python for its syntax, availability, versatility,
community and support. Python is a very readable language. That is, the code can
be interpreted in an intuitive manner without the need for extensive experience in
programming. Hopefully, the ease at which one can understand Python can be seen in
appendix A. Python is free to use and available on almost every operating system. It
has user developed modules ‡ to interact with almost any other language with almost
limitless potential and unbounded versatility. Modules that expand the capability
of Python are developed by people that use Python. The Python community is
expansive and if you do not know how to accomplish a task with Python, it is not
difficult to find someone who does. Since Python is open source, this is all free and
available to anyone. We are not alone in our admiration of Python and definitely
not the first to leverage its power for scientific endeavors. Researchers in Los Alamos
have already found Python well suited for molecular dynamics modeling and Python
is still gaining support in the scientific computing community. [10]
This work used Python to collect data, visualize data and analyze data. Data is
collected using user developed code to interface with the Facebook Graph API. After
the data is collected, the NetworkX [5] module is leveraged to create the networks
and calculate the network parameters. Finally, the networks are visualized with the
aid of matplotlib [20]. The modules used in this work are not unique, but they have
‡
A module is a collection of Python code used to augment to capability of the core Python
installation. The modules used in this work are listed in appendix A
26
a diverse feature set that lent itself well to the purposes of this thesis. The code used
to accomplish these tasks are included in appendix A.
5.2. Node Creation. To say two people are separated by k degrees is tantamount to
saying that any two people are separated by k circle of acquaintances or k spheres of
influence or as we have chosen to interpret in this thesis, k groups of Facebook users.
That is, the idea of analyzing groups is an intuitive continuation of the logic behind
small world networks. Groups are collections of people connected via a common
context, a set of facts and circumstances that surround a situation, event or concept.
Naturally, to move beyond a set of isolated groups one must assume that the average
person belongs to multiple groups. This is a feasible as not many people would choose
to define themselves by a single context. To this end, in order to analyze Facebook
groups one must operate under a small number of restrictions. First, the method
for retrieving groups of a common context is limited to a search for groups with a
single word in the group’s title§. For example, using the definitions of section 3, if we
decided to form a network of groups with word in the title, we might have
Fword = {word, Words, I love words, WORD!!!,...}.
Furthermore, one can combine search results of multiple words to create multi-word
networks. To develop multi-word networks as networks defined by a unifying context,
we combined words that could define a community, circle of acquaintances or field
of interest. For example, to create a network from the words, foo and bar, we might
have
Ffoo = {foo, FOOD!,foobar? No Fools, FOOBAR!!!,...}
§
The Facebook API search functionality is not case sensitive
27
and
Fbar = {Bar Hopping, baristas unite, No Holds Barred!!, FOOBAR!!!,foobar?...}
from which we could form
Ffoobar = Ffoo ∪ Fbar
5.3. Edge Creation. The nodes of our networks are Facebook groups. As such, each
node has some number of members. Recall that two such nodes are adjacent or linked
if they contain at least one mutual member. That is, given two nodes, xi , xj ∈ Fw ,
(xi , xj ) ∈ Lw iff xi ∩ xj 6= ∅.
The process for generating Facebook groups networks is depicted pictorially in
diagram 3.
5.4. Pruning. Once all the edges have been added, the networks are pruned to
improve the results of analysis. Namely, there are two criteria for nodes to be removed
from a network: 1. All nodes with degree = 0, also called isolated nodes, are removed.
2. Every node of every subgraph with l = 1 are removed. Removing isolated nodes
and subgraphs with an average shortest path equal to 1 help to reduce the skew of
data that would result if they were not removed. That is, for a network composed
of multiple connected ‡ subgraphs, the average shortest path of the network becomes
the average of the average shortest path for each connected subgraph. Clearly, having
an abundance of isolated nodes would reduce the clustering coefficient of the network
and the presence of subgraphs with L = 1 would reduce the average shortest path of
‡
A graph is said to be connected if there is a path from every node in the graph to every other
node, where a path is any sequence of nodes such that every consecutive pair of nodes in the sequence
is connected by an edge in the graph.
28
Figure 3. The process of creating a network out of Facebook groups of various size,
where two nodes (groups) are connected if they have at least one mutual member.
For example, the red group and yellow group have two mutual members and are
therefore connected in the lower network depiction. By construction, isolated nodes
are removed from the network. The purple group has three members that are not
in any of the other groups and is therefore removed.
the network. In both cases, the effects of not removing such nodes would cause the
data to exhibit features not characteristic of the phenomenon we wish to quantify. In
essence, the pruning process enabled us to analyze the largest, connected, subgraph
of each keyword graph.
We present the single words and word groups that we used to generate networks in
Appendix B. The reason for combining multiple keywords is to create larger networks
with a unifying context for statistical purposes. In fact, we will be combining multiple
keywords in an attempt to understand network growth and scaling. Not every word
renders interesting results, but in general every network larger than approximately 40
29
nodes, exhibited the same general degree distribution shapes, low average path lengths
(≈ 2) and high clustering (on average about three times the expected clustering for a
random graph with the same number of nodes and edges). The networks were created
using python and the code can be found in Appendix A.
6. Analysis and Visualization
This section contains the main results obtained by analyzing the Facebook group
networks. We divide the section in two subsections that focus on network visualizations and numerical characteristics.
6.1. Network Visualization. We start by presenting several networks created according to the rules of Section 5 and listed in Appendix B. In Figures 4-12 we visualize
several of those networks. In each figure we generate the nodes of the network and
the links between them. The visualizations are rendered via an implementation of
the Fructerman-Reingold force-directed algorithm [17], which tends to create highly
clustered “cores”. We also include in the title the key words used to generate the
networks and the numerical characteristics: the number of nodes N , the number of
links L, the average connectivity < k >, the average clustering coefficient C, and the
average path length l. Observe the high density of the links in the networks anime,
bieber, and muslim networks for which we provide also a “zoom-in” view, as opposed
to the very sparse army and navy networks. We note that anime, bieber, and muslim
networks exhibit small world network features.
Visualization is an important aspect to consider when analyzing networks. For
example, the navy network of figure 11 may seem only slightly different from the
army network depicted in figure 10. Yet, when visualized using an implementation
30
Figure 4. Facebook groups network created from groups with the word, anime,
in the title. The numerical characteristics are: N = 276, L = 9770, C̄ = 0.708, l =
1.837, < k >= 70.797, indicating short path lengths and a clustering coefficient
larger than the corresponding one for a random network with similar topological
aspects.
Figure 5. An enlarged depiction of the “core” of figure 4.
31
Figure 6. Plot of bieber network. The numerical characteristics are: N = 117, L =
1714, C̄ = 0.646, l = 1.903, < k >= 29.299, indicating short path lengths and a
clustering coefficient larger than the corresponding one for a random network with
similar topological aspects.
Figure 7. An enlarged depiction of the “core” of figure 6.
32
Figure 8. Plot of muslim network.
The numerical characteristics are: N =
233, L = 4326, C̄ = 0.564, l = 2.121, < k >= 37.133, indicating short path lengths
and a clustering coefficient larger than the corresponding one for a random network
with similar topological aspects.
Figure 9. An enlarged depiction of the “core” of figure 8.
of the Fructerman-Reingold force-directed algorithm [17](figures 12 - 13), it becomes
evident that while the army network is a single, moderately clustered network, the
navy network is composed of two isolated sub graphs. The two main features of the
33
Fructerman-Reingold force directed algorithm are 1) Vertices connected by an edge
should be drawn near each other. 2) Vertices should not be drawn too close to each
other. In general, how close vertices should be placed depends on how many there are
and how much space is available. All vertices repel each other but connected vertices
attract. This leads to vertices with low connectivity being overcome by repelling
forces and a core of highly connected vertices to form near the center of the graph.
In this fashion, the model is driven, but not entirely constrained, by the physical
analogy of a collection of charged particles connected by springs.
Figure 10. Plot of army network where the nodes are randomly distributed on a
ring. The numerical characteristics are: N = 48, L = 189, C̄ = 0.434, < k >= 7.875
and l = 2.32. It is difficult to gain much knowledge from a graph of this size, but it
is useful when analyzing the behaviors of network parameters as N is increased.
Now, observe that the individual group networks presented in Figures 4-12 are
fairly small. To make sure that our networks are large enough for a more accurate
analysis, we have combined individual key words into categories, and then created
larger networks from these categories. The categories that will be used in graphical
34
Figure 11. Plot of navy network where the nodes are randomly distributed on a
ring. The numerical characteristics are: N = 20, L = 31, C̄ = 0.504, < k >= 3.1,
and l = 1.83. This is another example of small network. It serves as a good
conceptual comparison to figure 10, due to its generating keyword.
Figure 12. Plot of army network using the Fructerman-Reingold force-directed
algorithm.
representations in this work are presented in Table 1 and Appendix B. We will focus
on these larger networks in what follows.
Figures 14 and 15 show what a larger network actually looks like and the complexity that manifests by doubling the number of nodes and adding more related groups.
35
Figure 13. Plot of navy network using the Fructerman-Reingold force-directed
algorithm.
Other multi-word graphs display comparable levels of complexity, but it should be
noted that the 3 cores evident in figures 14 and 15 are characteristic of the Religion multi-word graph force-directed rendering and may or may not be seen in other
graphs. The multi-word graph, Sports is presented in Figures 16 17 for comparison.
Although the cores look slightly different in the Religion versus Sports visualizations,
it is the mathematical analysis that will shed more light upon the similarities and
differences between them. However, such visualizations allow us to have a pictorial
view of the networks under consideration as a starting point for our analysis. We
also notice the impact of increasing the network size on the level of complexity of the
networks.
Now that we have a pictorial view of some of the networks under consideration,
we focus on the numerical characteristics and identify some mathematical similarities
and differences between the categories. In table 3 we present a complete list of the
numerical characteristics for each of the networks considered in this analysis. We also
36
Table 1. Ten categories of key words used in the mathematical analysis of
Facebook group networks.
Category
Key words
Religion
islam, muslim, bible, atheist, catholic, christian, god,
allah, mormon, pagan, pope
Politics
republican, romney, vote, government, politic, election,
democrat, obama, senate
Hobby
art, gun, fish, collect, craft, comic, cook, eat, hunt, stamp
Sexes
boy, man, gay, girl, lesbian, male, woman
Anime
naruto, bleach, dragonball, DBZ, OnePiece, Gundam,
anime, saiyan, sasuke
Minorities
africa, white, black, american, china, asian, hispanic,
latin, greek, KKK, nazi
Science
physics, fractal, chaos, nuclear, math, science
Music
bieber, gaga, beyonce, spears, kesha, kardashian, MTV
Sports
basketball, soccer, football, baseball, golf,
tennis, bowling, swim
Computer
ipad, ipod, android, PC, mac, computer
include the analog values for random networks for comparison. We observe that the
studied networks exhibit average clustering coefficients 3-10 times larger than those
for random networks, while maintaining relatively small values of the average path
length. Thus we claim that both small world requirements (1) and (2) are satisfied
37
for Facebook group networks. We provide details in what follows, and refer to power
law distributions as well.
6.2. Numerical Characteristics. First let us look at the degree distribution for the
networks of Table 1, indicated accordingly in the titles of Figure 18. For each network
we generate the distribution of the connectivity values k and we present the results
on log-log plots. Observe that for smaller networks the plots are not as structured.
However, for larger networks it is apparent that the distribution is decreasing with
increased connectivity and has a rather mild curvature, which means that it appears
to be mostly linear. This is typical for power law distributions. We do not observe a
pronounced peak typical for random networks. As a matter of fact, we have used the
generalized Pareto fit tool of MATLAB [24], to fit a power law distribution to several
degree distributions. One sample is shown in Figure 19 for the Religion network,
which is the largest in this analysis. The fit is not performed on log-log scale as seen
from the graph. The parameters are γ, the power, and ζ, which is the truncated
Riemann zeta-function as described in formula (15). It is basically the scaling factor
necessary for normalizing the probability distribution function of the connectivity
levels. Observe that the best fit for all the data points yields 1 < γ < 2, but very
close to 1, and that the actual data tail seems to be heavier than the estimated
distribution. Thus, if we were to fit only the tail we would obtain γ values closer to 2,
which would be closer to what was found in the literature for say, various biological
networks. In Table 2 we present the fitted parameters for the largest networks. All
of them yield similar results.
38
We use a t-distribution for small samples, to find a confidence interval for the
average of the shape parameter, γ. We recall that if γ̄ and s are the values of
the mean and the standard deviation of a random sample of size n from a normal
distribution (assumed here for the shape parameter), then
s
s
γ̄ − tα/2,n−1 · √ , γ̄ + tα/2,n−1 · √
n
n
is a (1 − α)100% confidence interval for the mean of the underlying distribution of
the sample. We find a 95% confidence interval for the average γ to be (0.934, 1.499).
Barabási and Albert [2] collected the parameter values for several networks with
power law degree distributions. It is interesting to note that the networks with similar
values for γ (≈ 1) are the Ythan estuary and Silwood park networks, both of which
are un-directed ecological networks. On the other hand other internet networks have
been shown to exhibit a power law scaling factor 2 < γ < 3 [2]. More recently, a
study of the the social graph of active Facebook users does not yield a strict power
law distribution for the degree distribution [38]. The same phenomenon is observed
here for group networks, so a strict power law fit may not be the most appropriate
approach to Facebook degree distributions. Thus an aggregation of individual users
into groups renders a similar behavior. This is typical of scale invariance as discussed
in Equation (16).
At the same time we provide a boxplot representation of the values of the scaled
connectivity values k/N for each of the ten networks of Table 1 for an easier comparison of the connectivity with respect to the network size. This is shown in Figure 20,
including the median, the first and third quartiles and the outliers. Observe that only
the Sports network exhibits a significant departure from the overall trend of small
39
Figure 14. Plot of the Religion network. The individual key words used are: islam,
muslim, bible, atheist, catholic, christian, god, allah, mormon, pagan, pope. The
numerical characteristics are N = 1558, L = 39547, C̄ = 0.543, l = 2.723, < k >=
50.766.
Figure 15. An enlarged depiction of the “core” of figure 14. Observe the three
smaller cores and notice the density of links.
40
Figure 16. Plot of the Sports network. The individual key words used are: basketball, soccer, football, baseball, golf, tennis, bowling and swim. The numerical
characteristics are N = 342, L = 10356, C̄ = 0.715, l = 2.403, < k >= 60.561.
Figure 17. An enlarged depiction of the “core” of figure 14.
41
Figure 18. The distribution of the connectivity values in log-log plots for the ten
networks of Table 1, in decreasing order of network size. Each graph has the same
general shape. As expected, the networks with more nodes render a log-log plot as
the plot appears to degrade at N ≈ 350. Although the graphs appear to be mostly
linear, the distributions cannot be characterized entirely by a power law decay. In
general, the plots adhere to a power law for small values of k but then seem to
decrease more quickly than a power law would predict for large values of k. This is
seen with more precision in figure 19 where we attempt to fit a Pareto distribution
to a degree distribution.
median, indicating small connectivity for most nodes. However, this is one of the
smallest networks considered, and thus we need further analysis with larger networks
under this category.
Finally, in reference to the degree distribution, in Figure 21 we plot the average
connectivity, Y = Output = hki as a function of T = Target = N on a log-log
42
Figure 19. Pareto fit for the degree distribution. The estimated shape parameter
γ is close to 1. However, the tail of the graph is actually heavier than the estimated
fit.
Table 2. Parameters of Pareto distribution, p(x) = ξ(γ)x−γ , fitted to several multiple keyword networks.
Keyword Category
N
ξ
γ
Anime
839
0.4889 0.9354
Sexes
853
0.2533 1.3875
Hobby
929
0.1922 1.4237
Politics
1117 0.3025 1.3258
Religion
1558 0.3633 1.0082
plot. The plot clearly exhibits a positive correlation and linear regression analysis
yields log(< k >) = 0.6 log(N ) + 0.19, thus the average connectivity increases as a
power of the network size, and slower than N which is typical for random graphs.
The correlation coefficient is R 0.78 which indicates a moderate linear relationship
43
Figure 20. Boxplot for the k/N values associated with the networks of Table 1.
between N and < k >. We recall that
sT Y
R= √
sT T sY Y
where
sT T =
n
X
i=1
2
(Ti − T̄ ) ,
sY Y =
n
X
i=1
2
(Yi − Ȳ ) ,
sT Y
n
X
=
(Ti − T̄ )(Yi − Ȳ )
i=1
where Ti are the target sample points (inputs), Yi are the corresponding outputs, and
T̄ , Ȳ are the associated sample means.
We conclude here that the degree distribution of group networks exhibits a monotonic decrease at a fairly steep rate with increased connectivity, partially fitted by
a straight line which corresponds to a power law distribution. Most groups have a
rather small connectivity in comparison to the size of the network. For smaller networks, like Music, Sports, and Computer, these results are not very accurate, so more
analysis with an increased number of keywords in each category will be needed in the
future, paired with a more in-depth search for a best fit of the degree distribution.
44
Figure 21. Log-log plot of hki versus N for all the networks considered in this
analysis, including single and multiple keywords. The fitted line shows hki ∝ N 0.6 .
Now, let us look at the clustering coefficient defined in formula (3). We compute
C̄(k) for all possible connectivity values k in each network, and plot them against k
on a log-log scale in Figure 22. What we found is similar to other small world network
studies [34, 38] where each plot exhibits an approximately logarithmic decay for large
values of k.
Again we provide a boxplot representation of the values of C(k) for each of the
ten networks of Table 1 for an easier comparison of the average connectivity values
for different degrees. This is shown in Figure 23. We note that overall the values
are larger than the corresponding average clustering coefficients of random networks,
thus suggesting the possibility of small world networks. In fact, the average clustering
coefficients for the networks listed in table 1 are on average, 14 times larger than the
average clustering coefficients of random networks.
45
Figure 22. The average clustering coefficient, C̄(k), against the degree, k, on a
log-log scale for the ten networks of Table 1, in decreasing order of network size.
In general, each plot indicates an approximate logarithmic decay for large k. These
results are similar to other studies of small world networks [34, 38]. Observe that the
largest network, Religion provides the most orderly plot, which is to be expected.
Finally, in reference to the clustering coefficients, we present a plot of the average
clustering coefficient C̄ versus the network size N in Figure 24 (top graph, blue
dots) for all the networks considered in this study, not only the ten large networks
constructed by aggregating smaller networks corresponding to the various categories.
Most of the networks considered in this study have less than 400 nodes. We plot
46
Figure 23. Boxplot of the average clustering coefficients for individual connectivity
values C(k) values associated with the networks of Table 1.
on the same graph the corresponding values C̄rand for clustering coefficients of a
random network with the same number of nodes and edges (red circles). We can
notice immediately that the Facebook values are clearly higher than those for random
networks, fulfilling requirement (1) of small world networks. Not only that, but by
focusing only on the Facebook data for C̄ in Figure 25 we note an interesting result as
Indian railway network has exhibited C̄ values that remain relatively constant as the
network size increases [34]. Figure 25 has too few data points to definitively decide on
the behavior of C̄ for large networks; however, to some extent it appears to steady out
around C̄ 0.5 for larger values of N. This suggests that we need more sample networks
with a large number of nodes, which will be subject for future research. At the same
time we will need a systematic procedure for increasing the network size. For example
adding keywords one by one for a slower increase of the network, besides expanding
the selection of keywords and categories based on a dictionary search. These will be
subject for future research.
47
Figure 24. Average clustering coefficient and average path length as functions of
N. The blue dots indicate the Facebook data and the red circles indicate the corresponding values for a random network as listed in Table 1. There is a clear difference
for the clustering coefficients; however it is hard to notice essential differences for
the path lengths.
Now, regarding the average path length, in Figure 24 (bottom graph) we plot the
average path length for both the Facebook networks of this study (blue dots) and
the corresponding random networks (red circles). The plots are clustered around
small values and the average path length does not seem to exhibit an increase with
increased network size. Notice the similarity between the Facebook and random
network plots which suggests that requirement (2) of small world networks is also
satisfied. However, using the information of Table 3 we obtain a boxplot of l, as
well as the corresponding distribution in Figure 26 for both Facebook and random
networks. Notice that although the median path length for Facebook groups is higher
48
Figure 25. Average clustering coefficient as a function of N. The red dotted line
depicts the average value and possible steady state value for large N.
than that of random networks, the variation is smaller, so that overall, the values for
Facebook groups are slightly smaller than those for random networks. In general the
average path length is mostly less than 3, which comes to supplement previous findings
that the entire Facebook network of active users exhibits an average path length of
about 4 [6], as opposed to the well known six degrees of separation paradigm. Thus
the Facebook group networks are indeed examples of small world networks. In order
to offer more precise statements regarding the distribution of average path length as
shown in Figure 26 (right) we will have to generate a larger amount of data in the
future. However, notice that despite the differences in the two distributions, they
both have more or less the same behavior, which emphasizes requirement (2).
We conclude that the Facebook group networks exhibit features of ordered networks and random networks. That is, they have large clustering coefficients like the
49
Figure 26. Average path length boxplot and distribution for both the Facebook
group networks and random networks with the same number of nodes and links.
ring lattice networks described by Watts and Strogatz [43] that do not vanish with increased networks, and small average path lengths characteristic of random graphs [2].
Thus we conclude that the networks under consideration in this study exhibit small
world features. At the same time, the average connectivity increases as a power of
the network size with approximation, while the average clustering coefficients and
average path lengths do not exhibit a clear scaling with N .
In the future we plan to expand this analysis to more and larger networks, and to
supplement it with a more in-depth study of the numerical characteristics presented
in this work and other suitable measures.
7. Conclusions and Future Work
In this work we provided a study of the structure of Facebook group networks. We
generated a number of networks using keywords for group selection, and we linked
50
groups with common members. We focused on a number of measures that can describe the structure of these networks. Most eminent among the features of the
networks analyzed were our results concerning the degree distributions, and the behavior of the clustering coefficients and average shortest paths. Our networks have
degree distributions that cannot be entirely characterized by a power law, clustering
coefficients that are significantly larger than what would be expected for random networks and consistently small values for the average shortest paths, characteristic of
random graphs. That is to say, our analysis has shown that Facebook group networks
are small world networks. A unique element of this study is that we did more than
analyze only one networks or a hand full of networks, we analyzed an ensemble of
real world networks and were able to find relationships that otherwise could not have
been found. Figures 21 and 25 attest to this fact and suggest an interesting path to
take in future studies.
Facebook group networks provide a nearly limitless source of data to analyze, both
individually and collectively. If one was to study a larger set of networks, perhaps
it would be more interesting to see how the networks differ as opposed to how they
are similar. And if they are different, is there a finite set of network classes into
which any given network can be placed based on its parameters? Furthermore, this
analysis can lend itself to other fields as well. Most notably sociology. Although our
methodology for network construction is efficient, it is not the only possibility. We
took care to choose keywords that would most likely render interesting and mostly
unambiguous results using a programmatic approach, but one could create networks
using a more manual approach. Biased node selection could feasibly render networks
that more realistically portray the collection of nodes and relations of which it is
51
meant to model. Sociological credence aside, it is quite surprising to see so many
networks created of such consistent complexity and consistency. Our future work will
expand this analysis to include more networks with larger sizes, in order to be able to
generate a more complete statistical analysis of the Facebook group networks. Future
studies should include other measures of interest such as efficiency, correlations, or
spectral properties, as well as more sophisticated statistical analyses.
Moreover, it would be important to analyze the dynamics of the networks over
time, since Facebook groups could be added or removed from the networks based on
the natural changes that occur, and links readjusted. This would imply observing
Facebook over a significant amount of time to have sufficient data for statistical
purposes.
At the same time, it would be of interest to generate a dynamical system model of
the social networks described here. More precisely, one can define some meaningful
states of the nodes and associated rules that could help understand the complexity of
this kind of large scale network. For instance, by focusing on the links in the networks,
which are created based on individuals that are common to multiple groups, one could
put a weight on the influence of one group over another group (the target) in terms
of Facebook messages that are posted in the target group by the common members
during a unit of time. By defining a suitable threshold function, one could label a
node as active or inactive based on the aggregated influences of the individual input
nodes through the common members. This way one could generate a Boolean network
whose dynamics could be studied to provide some insight on the impact of multiple
memberships of individuals on the activity of groups.
52
References
1. Paul Adams, Grouped, New Riders, 2012.
2. Réka Albert and Albert-László Barabási, Statistical mechanics of complex networks, Rev. Mod.
Phys. 74 (2002), 47–97.
3. Réka Albert, Hawoong Jeong, and Albert-László Barabási, Diameter of the world-wide web,
Nature 401 (1999), 130–131.
4. L. A. N. Amaral, A. Scala, M. Barthélémy, and H. E. Stanley, Classes of small-world networks,
Proceedings of the National Academy of Sciences 97 (2000), no. 21, 11149–11152.
5. Daniel A. Schult Aric A. Hagberg and Pieter J. Swart, Exploring network structure,dynamics,
and function using networkx, Aug 2008, pp. 11–15.
6. Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander, and Sebastiano Vigna, Four degrees
of separation, CoRR abs/1111.4570 (2011).
7. Barabasi and Albert-Laszlo, Bursts: The hidden pattern behind everything we do, Dutton Adult,
2010.
8. Albert-Laszlo Barabasi, Linked: How Everything Is Connected to Everything Else and What It
Means, reissue ed., Plume, April 2003.
9. V. Batagelj and A. Mrvar, PAJEK – Program for large network analysis, 1998.
10. David M. Beazley and Peter S. Lomdahl, Feeding a large-scale physics application to python, In
6th International Python Conference, 1997, pp. 21–28.
11. S. P. Borgatti, M. G. Everett, and L. C. Freeman, UCINET 6 For Windows: Software for Social
Network Analysis, 2002.
12. Ulrik Brandes and Dorothea Wagner, Visone – analysis and visualization of social networks,
GRAPH DRAWING SOFTWARE, Springer-Verlag, 2003, pp. 321–340.
13. Rama Cont and Emily Tanimura, Small world graphs: characterization and alternative constructions, Adv. Appl. Prob 40 (2007), 939–965.
14. Cyram, NetMiner Webpage, 2002.
53
15. Paul Erdős and Alfred Rényi, On random graphs, Publicationes Mathematicae Debrecen 6
(1959), 290–297.
16. Linton C. Freeman, Computer programs and social network analysis, Connections 11 (1988),
26–31.
17. T.M.J. Fruchterman and E.M. Reingold, Graph drawing by force-directed placement, Software,
Practice and Experience 21 (1991), 1129–1164.
18. Mark Gnanovetter, The strength of weak ties, American Journal of Sociology 78 (1973), 1360–
1380.
19. Derek Hansen, Ben Shneiderman, and Marc A. Smith, Analyzing social media networks with
nodexl: Insights from a connected world, Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2010.
20. John D. Hunter, Matplotlib: A 2d graphics environment, Computing In Science & Engineering
9 (2007), no. 3, 90–95.
21. Abdul Salam Jarrah Hussein Vastani and Reinhard Laubenbacher, Visualization fo dynamics
for biological networks.
22. D. Krackhardt, J. Blythe, and C. Mcgrath, KrackPlot 3.0: An improved network drawing program, Connections 17 (1994), no. 2, 53–55.
23. Vito Latora and Massimo Marchiori, Is the boston subway a small-world network?, Physica A:
Statistical Mechanics and its Applications 314 (2002), no. 1–4, 109 – 113, Horizons in Complex
Systems.
24. MATLAB, version 7.10.0 (r2010a), The MathWorks Inc., Natick, Massachusetts, 2010.
25. Stanley Milgram, The small world problem, Psychology Today 2 (1967), 60–67.
26. Stanley Milgram and Jeffrey Travers, An experimental study of the small world problem, Sociometry 32 (1969), 425–443.
27. M. E. J. Newman, The mathematics of networks, The New Palgrave Encyclopedia of Economics
2 (2007), no. 5, 1–12.
28. Mark Newman, Networks: An introduction, Oxford University Press, Inc., New York, NY, USA,
2010.
54
29. M.E.J. Newman, Models of the small world: A review, J. Stat. Phys. 101 (2000), 819–841.
30. William D. Richards and Ronald E. Rice, The negopy network analysis program, Social Networks
3 (1981), 215–223.
31. William D. Richards and Andrew J. Seary, Fatcat – for thick data.
32. John Scott, A toolkit for social network analysis, Acta Sociologica 39 (1996), no. 2, 211–216.
33. Liviu Panait Keith Sullivan Sean Luke, Claudio Cioffi-revilla and Gabriel Balan, Mason: A
multi-agent simulation environment.
34. Parongama Sen, Subinay Dasgupta, Arnab Chatterjee, P. A. Sreeram, G. Mukherjee, and S. S.
Manna, Small-world properties of the indian railway network, Phys. Rev. E 67 (2003), 036106.
35. Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S. Baliga, Jonathan T. Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker, Cytoscape: A Software Environment for
Integrated Models of Biomolecular Interaction Networks, Genome Research 13 (2003), no. 11,
2498–2504.
36. Marc A. Smith, Ben Shneiderman, Natasa Milic-Frayling, Eduarda Mendes Rodrigues, Vladimir
Barash, Cody Dunne, Tony Capone, Adam Perer, and Eric Gleave, Analyzing (social media)
networks with nodexl, Proceedings of the fourth international conference on Communities and
technologies, 2009, pp. 255–264.
37. J. Heidel T. Helikar, J. Konvalina and J. A. Rogers, Emergent decision-making in biological
signal transduction networks, PNAS 105 (2008), no. 6, 1913–1918.
38. Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow, The anatomy of the
facebook social graph, CoRR abs/1111.4503 (2011).
39. Guido van Rossum et al., The Python programming language, 1991–.
40. Stanley Wasserman and Katherine Faust, Social network analysis: Methods and applications, 1
ed., Structural analysis in the social sciences, no. 8, Cambridge University Press, 1994.
41. Duncan J. Watts, Networks, Dynamics, and the Small-World Phenomenon, The American Journal of Sociology 105 (1999), no. 2, 493–527.
42. Duncan J. Watts, Six degrees: The science of a connected age, W. W. Norton & Company, 2003.
55
43. Duncan J. Watts and Steven H. Strogatz, Collective dynamics of small-world networks, Nature
393 (1998), no. 6684, 440–442.
44. Andy Wuensche, Exploring discrete dynamics: The ddlab manual, 2011.
56
Appendix A. Python Code
A.1. Module Import. The following code imports the necessary python modules [39,
20, 5]. The Facebook module is open source and is available at
https://github.com/pythonforfacebook/facebook-sdk.
import json
import urllib
import urllib2
import Facebook
import networkx as nx
import matplotlib.pyplot as plt
import decimal as dec
import itertools as iter
import os
import sys
import time
import string
import math
A.2. Graph Generation. generate_graph() queries the Facebook Graph API to
obtain the group ids for the search parameter, *searchwords. Once all the nodes
have been retrieved using get_group_ids(), generate_graph() adds the edges using add_members(), removes isolated nodes, removes connected sub graphs with
l = 1, computes the parameters of the network and returns the generated network.
57
dataLog() and errorLog() are generic functions for recording progress and troubleshooting errors, respectively.
def dataLog(info):
if not os.path.isdir(’data’):
os.mkdir(’data’)
filename = os.path.join(’data’,’nL_dataLog.txt’)
log = open(filename,’a’)
t = time.strftime("%d%b%y %H:%M:%S", time.localtime())
log.write("(%s) %s\n"%(t,info))
log.close()
print(info)
def errorLog(info):
if not os.path.isdir(’data’):
os.mkdir(’data’)
filename = os.path.join(’data’,’nL_errorLog.txt’)
log = open(filename,’a’)
t = time.strftime("%d%b%y %H:%M:%S", time.localtime())
log.write("(%s) %s\n"%(t,info))
log.close()
print(info)
#Name:
get_group_ids
def get_group_ids(searchword):
ids = []
58
LIMIT = 100
i=0
while True:
results = graph.request(’search’,{’q’:searchword,’type’:’group’,
’limit’:LIMIT,’offset’:LIMIT*i})
if not results[’data’]:
break
group_ids = [group[’id’] for group in results[’data’]]
if len(group_ids) == 0:
break
ids += group_ids
i += 1
save_data(ids,"ids_%(searchword)s.txt"%vars())
return ids
#Name:
add_members
def add_members(G):
try:
title = G.graph[’title’]
except:
title = "<No Title>"
count=1
N = nx.number_of_nodes(G)
nodes = chunks(nx.nodes(G),100)
59
for item in nodes:
groups = graph.get_objects(item,metadata=1)
for g in groups:
member_ids = []
group = groups[g]
added = False
broken
= False
while not added:
try:
conn = urllib2.urlopen(group[’metadata’]
[’connections’]
[’members’])
members = json.loads(conn.read())[’data’]
for member in members:
member_ids.append(member["id"])
G.node[group[’id’]][’members’]= member_ids
added = True
except urllib2.URLError as e:
errorLog("URLError with %s node %s (%i of %i)"/
%(title,str(group[’id’]),count,N))
pass
except KeyError:
errorLog("KeyError with %s node %s (%i of %i)"/
%(title,str(group[’id’]),count,N))
60
broken = True
added = True
except:
errorLog("Error with %s node %s (%i of %i)"/
%(title,str(group[’id’]),count,N))
broken = True
pass
finally:
conn.close()
if added and not broken:
dataLog("(%i of %i) Added %s node %s"/
%(count,N,title,str(group[’id’])))
count += 1
dataLog("All " + title + " nodes added.")
#Name:
generate_graph
def generate_graph(*searchwords):
app_ID = <secret>
app_SECRET = <secret>
key = facebook.get_app_access_token(app_ID,app_SECRET)
keywords = []
for word in searchwords:
if isinstance(word,str):
61
keywords.append(word)
if isinstance(word,list):
for i in word:
keywords.append(i)
save = True
broken = False
connected = True
empty_graph = False
G=nx.Graph()
title = string.join(keywords,"+")
G.graph[’title’] = title
#Add nodes to graph, G.
Each node has ’members’ attribute.
keyword_group_ids = []
for word in keywords:
keyword_group_ids.append(get_group_ids(word))
keyword_group_ids = nx.utils.flatten(keyword_group_ids)#flatten list
keyword_group_ids = list(set(keyword_group_ids))#remove duplicate nodes
if keyword_group_ids:
G.add_nodes_from(keyword_group_ids)
add_members(G)
N = nx.number_of_nodes(G)
62
else:
empty_graph = True
#Add edges to graph, G.
Each edge has a ’weight’ attribute.
if not empty_graph:
count = 0
perm = iter.combinations(G.nodes(),2)
edges = (nx.number_of_nodes(G))*(nx.number_of_nodes(G) - 1)/2
try:
for edge in perm:
added = False
while not added:
try:
adj = [i for i in G.node[edge[0]][’members’] /
if i in G.node[edge[1]][’members’]]
w = len(adj)
if w > 0:
G.add_edge(edge[0],edge[1],weight=w)
dataLog("(%i of %i) Added %s edge (%s,%s)"/
%(count,edges,title,edge[0],edge[1]))
added = True
else:
connected = False
added = True
63
except:
errorLog("Error adding %s edge (%s,%s)"/
%(title,edge[0],edge[1]))
pass
finally:
count+=1
except:
errorLog("Error adding edges to %(title)s graph"%vars())
finally:
dataLog("All %(title)s edges added."%vars())
#Assign isolatedNodes parameter as graph attribute
isolatedNodes = len(nx.isolates(G))
G.graph[’isolatedNodes’] = isolatedNodes
#Remove isolated nodes from graph
for i in nx.isolates(G):
G.remove_node(i)
#Remove connected subgraphs with average shortest path length = 1
#Calculate average shortest path length, assign to graph as attribute
if connected and not empty_graph:
G.graph[’averagePathLength’] = nx.average_shortest_path_length(G)
else:
64
if not empty_graph:
subgraphs = len(nx.connected_component_subgraphs(G))
pathLengths = []
for g in nx.connected_component_subgraphs(G):
if nx.average_shortest_path_length(g) == 1:
G.remove_nodes_from(g)
subgraphs -= 1
else:
pathLength = dec.Decimal(nx.average_shortest_path_length(g))
pathLengths.append(pathLength)
try:
averagePathLength = sum(pathLengths) / subgraphs
G.graph[’averagePathLength’] = averagePathLength
except ZeroDivisionError as e:
empty_graph = True
except:
errorLog(/
"Error when calculating average path length for /
%(title)s graph."%vars())
#Calculate <k>, average clustering coefficient and density.
if not empty_graph:
N = nx.number_of_nodes(G)
nodeDegrees = []
65
try:
for i in nx.nodes(G):
nodeDegrees.append(nx.degree(G,i))
k = dec.Decimal(sum(nodeDegrees)) / N
G.graph[’k’] = k
G.graph[’C’] = nx.average_clustering(G)
except:
errorLog("Error when calculating C and <k> for %(title)s graph."/
%vars())
broken = True
try:
G.graph[’density’] = nx.density(G)
except:
errorLog("Error when calculating density for %(title)s graph."/
%vars())
broken = True
#If not empty or broken, save graph data.
if not broken and not empty_graph:
return G
if empty_graph:
return emptyGraph(name=title)
errorLog("%(title)s has no connected groups /
(%(isolatedNodes)i isolated groups)."%vars())
66
if broken:
return emptyGraph(name="ERROR")
errorLog("Error during %(title)s graph."%vars())
#Name:
save_data
def save_data(data,filename,path=’data/ids’,method=’w’):
if not os.path.isdir(path):
os.mkdir(path)
filepath = os.path.join(path,"%(filename)s.txt"%vars())
file = open(filepath,method)
for i in data:
file.write("%s\n"%(str(i)))
file.close()
def save_data_to_CSV(info):
if not os.path.isdir(’data’):
os.mkdir(’data’)
filename = os.path.join(’data’,’nL_data.txt’)
log = open(filename,’a’)
t = time.strftime("%y%m%d%H%M%S", time.localtime())
log.write("%s, %s\n"%(t,info))
log.close()
dataLog("Saved ’%(info)s’ to nL_data.txt."%vars())
67
def save_to_LaTeX_single(minN=40):
output = []
if not os.path.isdir(’data’):
os.mkdir(’data’)
filename = os.path.join(’data’,’nL_data.txt’)
input = open(filename,’r’)
data = input.readlines()
input.close()
data = [string.split(i,’,’) for i in data]
print(data)
for i in data:
temp = [k.strip() for k in i]
del temp[0]
N = dec.Decimal(eval(temp[1]))
if N > minN and ’+’ not in temp[0]:
L = dec.Decimal(eval(temp[2]))
C = dec.Decimal(eval(temp[3]))
l = dec.Decimal(eval(temp[4]))
k = dec.Decimal(eval(temp[5]))
temp[0] = "\\textsl{%s}"%(temp[0])
temp[1] = N
temp[2] = L
temp[3] = C
temp[4] = k/N
68
temp[5] = l
temp[6] = math.log(k)/math.log(N)
temp[7] = k
temp = rounded_list(temp,3)
temp[1] = int(float(temp[1]))
temp[2] = int(float(temp[2]))
temp = [str(i) for i in temp]
temp = string.join(temp,’ & ’)
temp = "%(temp)s \\\\"%vars()
output.append(temp)
output = sorted(list(set(output)),key=lambda sortParam:
eval(string.split(sortParam,’&’)[1]) )
header = ["\\begin{table}",/
"\\caption{caption here}\\label{single}",/
"\\begin{center}",/
"\\begin{tabular}{|c|c|c|c|c|c|c|c|}"]
title = ["\\hline","Keyword & N
& L
& $C$
& $C_{rand}$
& $l$ & $l_{rand}$
& $\langle k\rangle$ \\\\","\\hline"]
footer = ["\\hline","\\end{tabular}","\\end{center}","\end{table}"]
output = header + title + output + footer
69
filename = os.path.join(’data’,’nL_LaTeX_Table_single.txt’)
file = open(filename,’w’)
for i in output:
file.write("%(i)s\n"%vars())
file.close()
def save_to_LaTeX_multi(minN=40):
output = []
if not os.path.isdir(’data’):
os.mkdir(’data’)
filename = os.path.join(’data’,’nL_data.txt’)
input = open(filename,’r’)
data = input.readlines()
input.close()
data = [i[0:len(i)-3] for i in data]
data = [string.split(i,’,’) for i in data]
for i in data:
temp = [k.strip() for k in i]
del temp[0]
N = dec.Decimal(eval(temp[1]))
if N > minN and ’+’ in temp[0]:
L = dec.Decimal(eval(temp[2]))
C = dec.Decimal(eval(temp[3]))
l = dec.Decimal(eval(temp[4]))
70
k = dec.Decimal(eval(temp[5]))
temp[0] = "\\textsl{%s}"%(temp[0])
temp[1] = N
temp[2] = L
temp[3] = C
temp[4] = k/N
temp[5] = l
temp[6] = math.log(k)/math.log(N)
temp[7] = k
temp = rounded_list(temp,3)
temp[1] = int(float(temp[1]))
temp[2] = int(float(temp[2]))
temp = [str(i) for i in temp]
temp = string.join(temp,’ & ’)
temp = "%(temp)s \\\\"%vars()
output.append(temp)
output = sorted(list(set(output)))
header = ["\\begin{table}",
"\\caption{caption here}\\label{multi}",
"\\begin{center}",
"\\begin{tabular}{|c|c|c|c|c|c|c|c|}"]
title = ["\\hline","Keyword & N
& L
& $C$
71
& $C_{rand}$
& $l$
& $l_{rand}$
& <k> \\\\","\\hline"]
footer = ["\\hline","\\end{tabular}","\\end{center}","\end{table}"]
output = header + title + output + footer
filename = os.path.join(’data’,’nL_LaTeX_Table_multi.txt’)
file = open(filename,’w’)
for i in output:
file.write("%(i)s\n"%vars())
file.close()
#Name:
save_degree_sequence
def save_degree_sequence(G,min=30):
graph = emptyGraph()
if isinstance(G,str):
if G in searchwords():
graph = load_graph(G)
else:
if not isEmptyWordGraph(G):
graph = generate_graph(G)
save_graph(graph)
if type(G) == type(graph):
graph = G
72
degree_sequence=list(nx.degree(graph).values())
output = ""
for k in degree_sequence:
if output == "":
output = "%(k)i"%vars()
else:
output = "%(output)s\n%(k)i"%vars()
if not os.path.isdir(’data/deg_seq’):
os.mkdir(’data/deg_seq’)
filename = os.path.join(’data/deg_seq’,"deg_seq_%s.txt"/
%(graph.graph[’title’]))
if len(degree_sequence) > min:
log = open(filename,’w’)
log.write(output)
log.close()
dataLog("Saved \"%s\" degree sequence to file."%(graph.graph[’title’]))
else:
errorLog("The %s graph had fewer than %i nodes /
and degree sequence was not saved."%(graph.graph[’title’],min))
#Name:
save_C_vs_k
def save_C_vs_k(graph):
if isinstance(graph,str):
G = generate_graph(graph)
73
else:
G = graph
C = nx.clustering(G).keys()
x = [G.degree(i) for i in nx.clustering(G)]
degrees = list(set(x))
C_List = []
for k in degrees:
temp = [nx.clustering(G)[i] for i in C if isEqual(G.degree(i),k)]
average_C_k = sum(temp)/len(temp)
C_List.append(average_C_k)
data = zip(degrees,C_List)
data = ["%s, %s"%(str(i[0]),str(i[1])) for i in data]
save_data(data,"C_vs_k_%s"%(G.graph[’title’]),path=’data/C_vs_k’)
dataLog("Saved %s graph C vs. k data to file."%(G.graph[’title’]))
#Name:
save_graph
def save_graph(graph,stamp=False):
if isinstance(graph,str):
G = generate_graph(graph)
else:
G = graph
#Assign parameter values in preparation for writing to a file.
pickle = True
N = len(nx.nodes(G))
74
L = len(nx.edges(G))
isolatedNodes = G.graph[’isolatedNodes’]
density = G.graph[’density’]
k = G.graph[’k’]
C = G.graph[’C’]
l = G.graph[’averagePathLength’]
timeStamp = time.strftime("%m%d%H%M", time.localtime())
if not os.path.isdir(’data’):
os.mkdir(’data’)
#Save G parameters to CSV format.
title = G.graph[’title’]
if N == 0:
pickle = False
errorLog("%(title)s graph contained no data. Nothing was saved."%vars())
else:
save_data_to_CSV("%(title)s,
%(N)i,
%(L)i,
%(C)f,
%(l)f,
%(k)f,
%(isolatedNodes)i,
75
%(density)f"%vars())
save_degree_sequence(G)
save_C_vs_k(G)
dataLog("Saved %(title)s graph data."%vars())
#Pickle G data
if not os.path.isdir(’data/gpickle’):
os.mkdir(’data/gpickle’)
if stamp:
gpickleName = "data/gpickle/%(title)s_%(timeStamp)s.gpickle"%vars()
else:
gpickleName = "data/gpickle/%(title)s.gpickle"%vars()
try:
if pickle:
nx.write_gpickle(G,gpickleName)
dataLog("Pickled %(title)s graph data successfully."%vars())
except:
errorLog("Error attempting to pickle %(title)s graph data."%vars())
#Name:
load_data
#Input:
filename of file from which to load data
#Output:
input data from save_data call
def load_data(filename,path=’ids’):
filepath = os.path.join(’data’,path,"%(path)s_%(filename)s.txt"%vars())
76
file = open(filepath,’r’)
data = []
input = file.readlines()
for i in input:
try:
data.append(eval(i[0:len(i)-1]))
except:
data.append(i[0:len(i)-1])
return data
#Name:
load_graph
#Input:
searchword used to create graph with generate_graph
#Output:
Returns graph data of generate_graph(searchword)
def load_graph(searchwords,path="data/gpickle"):
results = []
found = False
dirList = os.listdir(path)
if isinstance(searchwords,str):
searchword = searchwords
for filename in dirList:
if searchword in filename and ’+’ not in filename:
print(filename)
results.append(filename)
found = True
77
if isinstance(searchwords,list):
perm = iter.permutations(searchwords,len(searchwords))
searchList = []
for i in perm:
searchList.append(string.join(i,’+’))
for word in searchList:
for filename in dirList:
if word in filename:
searchword = word
results.append(filename)
found = True
if results and found:
results = [i[0:len(i)-8] for i in results]
results = [i for i in results if isEqual(i,searchword)]
filename = "%s.gpickle"%(results[0])
return nx.read_gpickle("%(path)s/%(filename)s"%vars())
else:
errorLog("No graphs with %(searchword)s in title."%vars())
return emptyGraph()
#Name:
plot_degree_sequences
def plot_degree_sequences(*graphs):
plt.clf()
graphType = emptyGraph()
78
names = []
plots = []
graphList = []
for i in graphs:
if isinstance(i,str):
if i in searchwords():
graphList.append(load_graph(i))
names.append(load_graph(i).graph[’title’])
else:
g = generate_graph(i)
save_graph(g)
graphList.append(g)
names.append(g.graph[’title’])
if isinstance(i,list):
for k in i:
if type(k) == type(graphType):
graphList.append(k)
names.append(k.graph[’title’])
else:
if k in searchwords():
graphList.append(load_graph(k))
names.append(load_graph(k).graph[’title’])
else:
if not isEmptyWordGraph(k):
79
g = generate_graph(k)
save_graph(g)
if not isEmptyWordGraph(k):
graphList.append(g)
names.append(g.graph[’title’])
else:
errorLog("%(k)s graph omitted
from degree sequence plot"%vars())
if type(i) == type(graphType):
graphList.append(i)
names.append(i.graph[’title’])
name = string.join(names,"/")
markers = ["o","D","s"]
marker = 0
colors = ["red","blue","green","yellow","cyan","magenta","black"]
color = 0
for G in graphList:
if nx.number_of_nodes(G) != 0:
degree_sequence=sorted(list(set(nx.degree(G).values())),
reverse=True)
dmax=max(degree_sequence)
plot, = plt.loglog(degree_sequence,colors[color],
marker=markers[marker])
marker += 1
80
marker = divmod(marker,len(markers))[1]
color += 1
color = divmod(color,len(colors))[1]
plots.append(plot)
save_degree_sequence(G)
plt.ylabel("degree")
plt.xlabel("rank")
nodes = ["N=" + str(nx.number_of_nodes(i)) for i in graphList]
labels = zip(names,nodes)
plt.legend(plots,labels,loc=1)
dataLog("Created degree distribution plot of %(name)s."%vars())
#Name:
plot_C_vs_k
def plot_C_vs_k(searchwords,path="data/C_vs_k"):
results = []
found = False
dirList = os.listdir(path)
if isinstance(searchwords,str):
name = searchwords
searchword = searchwords
for filename in dirList:
if searchword in filename:
results.append(filename)
if isinstance(searchwords,list):
81
name = string.join(searchwords,’+’)
perm = iter.permutations(searchwords,len(searchwords))
searchList = []
for filename in dirList:
if name in filename:
searchword = name
results.append(filename)
found = True
if not found:
for i in perm:
searchList.append(string.join(i,’+’))
for word in searchList:
for filename in dirList:
if word in filename:
searchword = word
results.append(filename)
if results:
results = [i[7:len(i)-4] for i in results]
results = [i for i in results if isEqual(i,searchword)]
data = load_data(results[0],’C_vs_k’)
x = [i[0] for i in data]
y = [i[1] for i in data]
plt.semilogx(x,y,’ro’,ms=5)
plt.ylabel("C(k)")
82
plt.xlabel("k")
else:
dataLog("No C(k) vs k data")
G = generate_graph(searchwords)
save_graph(G)
plot_C_vs_k(string.split(G.graph[’title’],’+’))
def plot_vs_N(y,loglog=False):
xData = []
yData = []
if not os.path.isdir(’data’):
os.mkdir(’data’)
filename = os.path.join(’data’,’nL_data.txt’)
input = open(filename,’r’)
data = input.readlines()
input.close()
data = [string.split(i,’,’) for i in data]
N = [i[2] for i in data]
C = [i[4] for i in data]
l = [i[5] for i in data]
k = [i[6] for i in data]
if y == "C":
xData = N
83
yData = C
xlabel = "N"
ylabel = "C"
if y == "l":
xData = N
yData = l
xlabel = "N"
ylabel = "l"
if y == "k":
xData = k
yData = N
xlabel = "<k>"
ylabel = "N"
output = zip(xData,yData)
output = ["%s, %s"%(i[0],i[1]) for i in output]
save_data(output,filename="%(y)s_vs_N.txt"%vars(),path=’data’)
if loglog:
plt.ylabel("log(%(ylabel)s)"%vars())
plt.xlabel("log(%(xlabel)s)"%vars())
plt.loglog(xData,yData,’bo’)
else:
plt.ylabel(ylabel)
plt.xlabel(xlabel)
84
plt.plot(xData,yData,’ro’)
#Name:
plot_graph
def plot_graph(G,layout):
plt.clf()
name = G.graph[’title’]
k = str(round(G.graph[’k’],3))
N = str(nx.number_of_nodes(G))
L = str(nx.number_of_edges(G))
C = str(round(G.graph[’C’],3))
l = str(round(G.graph[’averagePathLength’],3))
density = str(round(G.graph[’density’],3))
plt.title("\"%(name)s\",
N = %(N)s,
L = %(L)s,
<k> = %(k)s,
C = %(C)s,
l = %(l)s"%vars())
if layout == "circular":
pos = nx.circular_layout(G)
nx.draw(G,pos,with_labels=False,node_size=30,width=0.1)
else:
pos = nx.spring_layout(G)
nx.draw(G,pos,with_labels=False,node_size=35,width=0.3,iterations=100)
if not os.path.isdir(’data/graphs’):
85
os.mkdir(’data/graphs’)
plt.savefig("data/graphs/%s_%s.png"%(G.graph[’title’],layout))
#Name:
emptyGraph
#Input:
none
#Output:
generates empty keyword graph with set parameters equal to zero
def emptyGraph(name="None"):
G = nx.Graph()
G.graph[’title’]=name
G.graph[’isolatedNodes’]=0
G.graph[’density’]=0
G.graph[’k’]=0
G.graph[’C’]=0
G.graph[’averagePathLength’]=0
return G
#Name:
searchwords
def searchwords(path="data/gpickle"):
searchwords = []
try:
dirList = os.listdir(path)
for i in dirList:
searchwords.append(string.split(i,".")[0])
except:
searchwords = ["error"]
86
errorLog("Failed to retrieve list of created keyword graphs.")
finally:
return sorted(searchwords)
#Name:
isEmptyGraph
def isEmptyWordGraph(searchword):
empty = False
if not os.path.isdir(’data’):
os.mkdir(’data’)
filename = os.path.join(’data’,’nL_errorLog.txt’)
log = open(filename,’r’)
errMsgs = log.read().split("\n")
searchStr = "%(searchword)s graph contained no data."%vars()
for i in errMsgs:
if searchStr in i:
empty = True
log.close()
return empty
def isEqual(a,b):
if a == b:
return True
else:
return False
def rounded_list(input,digits):
tempList = []
87
for i in input:
try:
tempValue = eval(str(i))
tempValue = round(tempValue,digits)
tempValue = str(tempValue)
except:
tempValue = i
finally:
tempList.append(tempValue)
return tempList
def chunks(l, n):
return [l[i:i+n] for i in range(0, len(l), n)]
88
Appendix B. Tables
This Appendix contains the table of all numerical characteristics obtained for the
Facebook networks under consideration. All the networks are listed by categories and
keywords.
The data presented below supports the claim that the Facebook group networks of
this thesis exhibit features of ordered networks and random networks. That is, they
have large clustering coefficients like the ring lattice networks described by Watts
and Strogatz [43], and small average path lengths characteristic of random graphs [2].
Thus we conclude that the networks under consideration in this study exhibit small
world features.
89
Table 3. The general characteristics of several Facebook group networks.
Number of nodes, N, number of edges, L, average clustering coefficient, C̄,
average shortest path, l, average connectivity, hki. The networks are grouped
by categories and listed in decreasing order of N. The corresponding values
of C̄ and l for random networks with N nodes and L edges, C̄rand and lrand ,
are included for comparison. This table is continued on several pages.
Keyword
N
L
C̄
C̄rand
l
lrand
hki
Religion 1558 39547 0.543 0.033 2.723 1.872 50.766
islam
329
6102
0.577 0.113 2.215 1.604 37.094
muslim
233
4326
0.564 0.159 2.121 1.508 37.133
bible
210
2907
0.594 0.132 2.157 1.610 27.686
atheist
194
4745
0.766 0.252 1.818 1.354 48.918
catholic
155
1905
0.687 0.159 2.071 1.575 24.581
christian
106
1335
0.668 0.238 1.988 1.445 25.189
god
78
255
0.461 0.084 2.653 2.320
allah
75
388
0.554 0.138 2.423 1.848 10.347
pagan
31
133
0.707 0.277 1.888 1.598
8.581
mormon
19
37
0.381 0.205 2.398 2.166
3.895
pope
0
0
0
0
0
0
6.538
0
90
N
L
Politics
1117
9856
republican
299
10248 0.695 0.229 1.617 1.348 68.548
romney
193
1665
0.682 0.089 2.351 1.848 17.254
vote
140
380
0.374 0.039 2.371 2.921
5.429
government
104
219
0.349
1.936 3.230
4.212
politic
76
211
0.4
0.073 3.268 2.526
5.553
election
59
83
0.293 0.048 2.074 3.942
2.814
democrat
27
53
0.524 0.145 2.875 2.410
3.926
obama
20
85
0.624 0.425
senate
13
16
Hobby
929
8896
art
291
4521
gun
138
425
0.393 0.045 2.863 2.710
fish
80
421
0.632 0.132
2.39
1.862 10.525
collect
72
184
0.335 0.071
3.26
2.621
craft
46
259
0.66
0.245 2.008 1.581 11.261
comic
40
168
0.595
0.21
cook
37
228
0.733 0.333 1.773 1.438 12.324
hunt
31
67
0.483 0.139 2.295 2.346
stamp
6
9
0.678
0.5
1.4
1.631
3
eat
0
0
0
0
0
0
0
C̄
C̄rand
l
lrand
hki
Keyword
0.482 0.016 1.801 2.445 17.647
0.4
0.04
1.7
1.4
0.189 1.422 2.847
8.5
2.462
0.423 0.021 3.280 2.315 19.152
0.6
0.107
2.31
1.651 31.072
2.149 1.733
6.159
5.111
8.4
4.323
91
N
L
Sexes
853
6621
0.357 0.018 3.158 2.461 15.524
boy
259
2300
0.445 0.069 2.684 1.931 17.761
man
225
1093
0.455 0.043 3.286 2.382
gay
209
3430
0.624 0.157 2.201 1.530 32.823
girl
107
1242
0.608 0.217 1.993 1.486 23.215
lesbian
56
265
0.513 0.169 2.217 1.791
9.464
woman
15
48
0.729 0.427 1.695 1.459
6.4
male
3
2
Anime
C̄
0
C̄rand
l
lrand
hki
Keyword
0.444 1.333 3.819
9.716
1.333
839 27155 0.598 0.077 2.296 1.614 64.732
anime
276
9770
0.708 0.257 1.837 1.319 70.797
naruto
184
3101
0.663 0.183
OnePiece
127
1440
0.656 0.179 1.996 1.552 22.677
bleach
69
416
0.671 0.175
saiyan
31
143
0.633 0.298 1.824 1.545
9.226
Gundam
23
78
0.630 0.295 1.858 1.638
6.783
DBZ
18
32
0.441 0.198 1.967 2.279
2.556
dragonball
13
17
0.531 0.201 2.397 2.668
2.615
sasuke
12
19
0.454 0.264 1.985 2.156
3.167
2.01
2.06
1.482 33.707
1.701 12.058
92
Keyword
N
L
C̄
C̄rand
l
lrand
hki
Minorities 679 4758 0.356 0.021 1.544 2.470 14.015
latin
140 1140
africa
103 1196 0.583 0.225 2.029 1.474 23.223
black
70
274
hispanic
64
221
0.432 0.108
2.7
2.152
6.906
greek
37
137
0.469
2.11
1.803
7.405
american
31
136
0.591 0.283 1.916 1.581
8.774
china
26
121
0.623 0.358
1.460
9.308
nazi
23
21
0.153 0.079 1.549 5.207
1.826
white
23
22
0.033 0.083 2.205 4.834
1.913
asian
16
17
0.236 0.133 1.923 3.678
2.125
KKK
5
9
Science
0.48
0.5
0.9
0.116 2.327 1.771 16.286
0.112 2.052 2.065
0.20
0.72
2.0
1.10
1.256
7.829
3.60
417 3724 0.429 0.043 2.892 2.093 17.861
science
188 1774 0.513
0.1
2.322 1.782 18.872
nuclear
91
376
0.494 0.091
fractal
44
71
0.38
chaos
0
0
0
math
40
161
0.575 0.201 2.059 1.769
8.050
physics
23
108
0.684 0.408 1.719
9.391
2.136
8.264
0.073 2.109 3.230
3.227
0
2.68
0
0
1.40
0
93
N
L
Music
358
5210
0.616 0.081 1.941 1.744 29.106
bieber
117
1714
0.646
0.25
spears
105
1771
0.71
0.321 1.733 1.323 33.733
gaga
33
186
0.648 0.342 1.858 1.443 11.273
MTV
27
109
0.693 0.299 1.915 1.578
8.074
beyonce
24
162
0.851 0.563 1.413 1.221
13.50
kardashian
21
26
0.498 0.118 2.288 3.358
2.476
kesha
0
0
C̄
0
C̄rand
0
l
lrand
hki
Keyword
1.903 1.410 29.299
0
0
0
Sports
342 10356 0.715 0.177 2.403 1.422 60.561
football
91
1401
0.76
0.338 1.858 1.316 30.791
soccer
66
1426
0.9
0.655 1.384 1.112 43.212
golf
60
799
0.814 0.444 1.647 1.247 26.633
tennis
38
280
0.741 0.388 1.711 1.352 14.737
basketball
28
124
0.698 0.316 1.907 1.528
8.857
swim
15
19
0.336 0.169 1.689 2.913
2.533
baseball
10
21
0.650 0.420 1.622 1.604
4.20
bowling
4
3
Computer 300
0
0.375 1.667 3.419
1.5
2927
0.505 0.065 2.247 1.920 19.513
computer
112
1406
0.605 0.224 2.166 1.464 25.107
android
71
653
0.651 0.259 1.971 1.464 18.394
ipad
26
140
0.866 0.414 1.702 1.371 10.769
ipod
6
4
mac
18
PC
12
0
0.222 1.333 6.228
1.333
22
0.537 0.136 1.350 3.234
2.444
22
0.466 0.306 2.061 1.913
3.667