Comments
Transcript
Small World Properties of Facebook Group Networks
Small World Properties of Facebook Group Networks A Thesis Presented to the Department of Mathematics and the Faculty of the Graduate College University of Nebraska in Partial Fulfillment of the requirements for the degree Masters of Arts University of Nebraska at Omaha Jason Wohlgemuth August 2012 Supervisory Committee: Dora Matache Robb Todd Andrew Swift Raj Dasgupta Abstract: Small World Properties of Facebook Group Networks Jason Wohlgemuth, M.A. University of Nebraska, 2012 Advisor: Dora Matache, Ph.D. Small world networks permeate modern society. Paradoxically, the ubiquity of small world networks is matched only by the difficulty of finding tractable examples. Until we have an extensive taxonomy of small world networks at our disposal, intractability will prevail. In this thesis, we present a methodology for creating and analyzing a practically limitless number of networks exhibiting small world network properties. That is, we analyze networks, the vertices being Facebook groups sharing a common word in the group name and the links being mutual members in any two groups. Specifically, by analyzing the characteristics of single networks and network ensembles, we attempt to quantify how the small world properties scale with network size, and distill the essence of the networks to gain a deeper understanding of the essential features based on insight gained from empirical observation. We show that Facebook group networks have large clustering coefficients that do not vanish with increased network size, and small average path lengths, thus exhibiting small world features. At the same time, the average connectivity increases as a power of the network size, while the average clustering coefficients and average path lengths do not exhibit a clear scaling with the size of the network. Our results are similar to what have been found analyzing the Facebook network proper and give cause to pursue a more in-depth investigation. i Acknowledgements As I continue to learn, develop my mind, strive to understand the world or attempt to convince others of my understanding, I will remember this thesis. In every man’s life, there is a handful of people that hold special places in his memory. I would like to thank Dr. Dora Matache for being one of those people for me. As a mentor, she showed me professionalism in the pursuit of knowledge and a genuine passion for learning on a most basic level. She guided me throughout this project and pointed me at the finish line in the end. I would also like to thank Drs. Andrew Swift, Robb Todd and Raj Dasgupta for serving on my thesis committee and giving me their time and attention to further develop this work and facilitate my growth. Lastly and above all else, I would like to thank my wife, Julie Wohlgemuth. She is the joy in my life and the foundation on which I stand. She makes me a better man and I would be a shadow of my current self without her. ii Contents List of Figures iv 1. Introduction 1 1.1. Terminology 4 2. Overview of Relevant Literature and Motivation for Current Work 6 3. Mathematical Background 10 3.1. The Facebook Group Network and Significant Numerical Measures 10 3.2. Random Graph Models 13 3.3. Small World Network Models 15 4. More on the Facebook Social Network 19 4.1. Network Analysis Tools 21 4.2. Facebook Overview 22 5. Graph Creation Methodology 23 5.1. Python Overview 25 5.2. Node Creation 26 5.3. Edge Creation 27 5.4. Pruning 27 6. Analysis and Visualization 29 6.1. Network Visualization 29 6.2. Numerical Characteristics 37 7. Conclusions and Future Work 49 References 52 Appendix A. Python Code 56 iii A.1. Module Import 56 A.2. Graph Generation 56 Appendix B. Tables 88 iv List of Figures 1 The random rewiring procedure of the Watts-Strogatz model, which interpolates between a regular ring lattice and a random network without altering the number of nodes or edges [2]. After Watts and Strogatz, 1998 [43]. 2 17 Characteristic path length, L(p), and clustering coefficient, C(p), for the Watts-Strogatz model. The data are normalized by the values L(0) and C(0) for a regular lattice. A logarithmic horizontal scale resolves the rapid drop in L(p), corresponding to the onset of the small-world phenomenon. During this drop C(p) remains almost constant, indicating that the transition to a small world is almost undetectable at the local level [2]. After Watts and Strogatz [43]. 3 18 The process of creating a network out of Facebook groups of various size, where two nodes (groups) are connected if they have at least one mutual member. For example, the red group and yellow group have two mutual members and are therefore connected in the lower network depiction. By construction, isolated nodes are removed from the network. The purple group has three members that are not in any of the other groups and is therefore removed. 4 Facebook groups network created from groups with the word, anime, in the title. The numerical characteristics are: N = 276, L = 9770, C̄ = 0.708, l = 1.837, < k >= 70.797, indicating short path lengths and a clustering 28 v coefficient larger than the corresponding one for a random network with similar topological aspects. 30 5 An enlarged depiction of the “core” of figure 4. 30 6 Plot of bieber network. The numerical characteristics are: N = 117, L = 1714, C̄ = 0.646, l = 1.903, < k >= 29.299, indicating short path lengths and a clustering coefficient larger than the corresponding one for a random network with similar topological aspects. 31 7 An enlarged depiction of the “core” of figure 6. 31 8 Plot of muslim network. The numerical characteristics are: N = 233, L = 4326, C̄ = 0.564, l = 2.121, < k >= 37.133, indicating short path lengths and a clustering coefficient larger than the corresponding one for a random 9 network with similar topological aspects. 32 An enlarged depiction of the “core” of figure 8. 32 10 Plot of army network where the nodes are randomly distributed on a ring. The numerical characteristics are: N = 48, L = 189, C̄ = 0.434, < k >= 7.875 and l = 2.32. It is difficult to gain much knowledge from a graph of this size, but it is useful when analyzing the behaviors of network parameters as N is increased. 33 11 Plot of navy network where the nodes are randomly distributed on a ring. The numerical characteristics are: N = 20, L = 31, C̄ = 0.504, < k >= 3.1, and l = 1.83. This is another example of small network. It serves as a good conceptual comparison to figure 10, due to its generating keyword. 34 12 Plot of army network using the Fructerman-Reingold force-directed algorithm. 34 vi 13 Plot of navy network using the Fructerman-Reingold force-directed algorithm. 35 14 Plot of the Religion network. The individual key words used are: islam, muslim, bible, atheist, catholic, christian, god, allah, mormon, pagan, pope. The numerical characteristics are N = 1558, L = 39547, C̄ = 0.543, l = 2.723, < k >= 50.766. 39 15 An enlarged depiction of the “core” of figure 14. Observe the three smaller cores and notice the density of links. 39 16 Plot of the Sports network. The individual key words used are: basketball, soccer, football, baseball, golf, tennis, bowling and swim. The numerical characteristics are N = 342, L = 10356, C̄ = 0.715, l = 2.403, < k >= 60.561. 40 17 An enlarged depiction of the “core” of figure 14. 40 18 The distribution of the connectivity values in log-log plots for the ten networks of Table 1, in decreasing order of network size. Each graph has the same general shape. As expected, the networks with more nodes render a log-log plot as the plot appears to degrade at N ≈ 350. Although the graphs appear to be mostly linear, the distributions cannot be characterized entirely by a power law decay. In general, the plots adhere to a power law for small values of k but then seem to decrease more quickly than a power law would predict for large values of k. This is seen with more precision in figure 19 where we attempt to fit a Pareto distribution to a degree distribution. 41 19 Pareto fit for the degree distribution. The estimated shape parameter γ is close to 1. However, the tail of the graph is actually heavier than the estimated fit. 42 vii 20 Boxplot for the k/N values associated with the networks of Table 1. 43 21 Log-log plot of hki versus N for all the networks considered in this analysis, including single and multiple keywords. The fitted line shows hki ∝ N 0.6 . 44 22 The average clustering coefficient, C̄(k), against the degree, k, on a log-log scale for the ten networks of Table 1, in decreasing order of network size. In general, each plot indicates an approximate logarithmic decay for large k. These results are similar to other studies of small world networks [34, 38]. Observe that the largest network, Religion provides the most orderly plot, which is to be expected. 45 23 Boxplot of the average clustering coefficients for individual connectivity values C(k) values associated with the networks of Table 1. 46 24 Average clustering coefficient and average path length as functions of N. The blue dots indicate the Facebook data and the red circles indicate the corresponding values for a random network as listed in Table 1. There is a clear difference for the clustering coefficients; however it is hard to notice essential differences for the path lengths. 47 25 Average clustering coefficient as a function of N. The red dotted line depicts the average value and possible steady state value for large N. 48 26 Average path length boxplot and distribution for both the Facebook group networks and random networks with the same number of nodes and links. 49 1 1. Introduction The networks of today are not static structures. They are dynamic and evolving entities. The only constant parameters of a network are the meanings associated with the network elements, nodes and edges. It has been said that the “science of networks must become, in short, a manifestation of its own subject matter, a network of scientists collectively solving problems that cannot be solved by any single individual or even any single discipline” [42]. In the past few years there has been a great interest in studying the basic topology of a variety of networks such as the World Wide Web [3], signal transduction networks [37], subway systems [23], railway networks [34] and more recently Facebook [38, 6]. Specifically, these efforts have seen the emergence of a very specific type of network, a small world network, in which most nodes are not neighbors of one another but can be reached in a small number of steps. First conceptualized in a short story by the Hungarian novelist, Frigyes Karinthy, the concept of a small world made its way into the realm of academia via the work of the social psychologist, Stanley Milgram [25, 26]. Physicists entered the fray with the development of the aptly named Watts-Strogatz network model [43], which replaced the previously uncontested random graph model of Erdős and Rényi. In the Erdős-Rényi model the links between nodes are generated independently with the same probability. The Watts-Strogatz model starts with a ring lattice (that is the nodes are positioned in a ring and are connected to the k-nearest neighbors) and each link is rewired with a given probability p. If p = 1 one obtains the ErdősRényi model, while p = 0 yields the ring lattice. Watts and Strogatz considered 2 two numerical characteristics associated with the network, the characteristic path length and clustering coefficient [43]. They showed that for very small values of p and for large enough values of p the two numerical characteristics tend to have similar magnitudes. However, for intermediate values of p the two numerical characteristics tend to be at opposite ends of their range, [0, 1], namely small characteristic path length and large clustering coefficient. That behavior corresponds to small world networks. The random networks of Erdős and Rényi and the small world networks of Watts and Strogatz seek to model the network phenomena all around us, the neural networks in our brains, the ecosystems of rainforests, the future of the stock market, the dynamics of epidemics, and the internet to name a few. They seek to re-create these network phenomena mathematically. This thesis is concerned with characterizing real world examples of the small world network phenomena, starting with features such as clustering and network diameter. That is, we seek to distill the essence of complex networks and gain a deeper understanding of the essential features for any given network based on insight derived from empirical observation. Historically, such an understanding has been out of reach due to technological limitations and the shear vastness of data. By studying Facebook groups, we have found a tractable data set of networks with amazing complexity. This thesis recounts the analysis of these Facebook group networks that can be analyzed independently and in comparison with other similar networks to a high level of detail. Furthermore, having many networks of similar construction allows us to address the scaling behavior of network parameters as functions of network size, N. 3 The work in this thesis supplements previous research on the topology of Facebook by focusing on groups rather than individuals. There is much to be learned about networks by studying expansive social networks such as Facebook, but small world properties have been seen and analyzed in networks with as few as 43 nodes [4]. Networks with fewer nodes are naturally more amenable to analysis and could yield results without the requirement of expensive hardware or sophisticated software. By analyzing the statistics of Facebook groups, we can search for the the most basic of networks and compare the data to the Facebook network as a whole to see if small world networks exhibit self-similiarity; that is, we can see if essential characteristics are independent on the level of aggregation of the networks from basic elements to modules that can be regarded as nodes. We are interested in understanding what is the impact of aggregation of individuals into groups on the main statistics of the network. In order to be able to perform a comparison with the previous results in the literature, we use similar topology statistics [2, 27] and information, as well as some new approaches. We collect information on connectivity or degree distribution, clustering coefficients, average shortest paths, and network density. We use the average shortest paths and clustering coefficients to characterize small-world networks [43, 4] and seek to develop a deeper understanding of what makes a small-world network, a small-world network. In this regard we created several networks composed of Facebook groups with a common keyword in their titles, and compared the parameters of each to find common characteristics and observe how they change with other parameters such as, but not limited to the number of nodes, graph density, and degree distribution. We compare our results to the corresponding ones for random networks or other Facebook studies. We show that an aggregation of Facebook users into 4 groups and of groups into further smaller categories of groups does not change the basic small world features. Thus Facebook exhibits a scaling invariance of properties. There are three complementary elements to this thesis: description of the social networks under consideration, python code, and mathematical network analysis. Social networks provided the network data to be studied, python programming provided access the the network data and mathematical analysis provided a means for understanding the complexity of the network data. All three parts will be described in sufficient detail; however the main focus is the mathematical analysis. Although, the possible avenues for sociological study is intuitively obvious. The organization of this thesis follows the intuitive progression of the research it describes. Section 3 lays the foundation for the mathematics needed for analysis. Section 4 provides an overview of social network analysis with an emphasis on the Facebook social network It also describes the tools available for social network analysis and the reason why we chose to use python. Section 5 describes in detail how we create our networks from the data pulled from Facebook using the Facebook Graph Application Programming Interface (API) and python. Finally, we present the results of our analysis with data and visualizations in section 6 and end our thesis with our conclusions and ideas for future work in section 7. In Appendix A we provide the extensive python scripts that have been created for this work and in Appendix B we provide the significant part of the data used for the mathematical analysis. 1.1. Terminology. In this thesis we will use network and graph interchangeably. The use of the word “group” will refer to Facebook groups composed of Facebook users under some assumed common context. The Facebook group networks of this 5 thesis are collections of Facebook groups with a common word in its title and can be created from a single search word or multiple. We will refer to the former as keyword graphs and the latter as keyword multi-graphs. To be clear, keyword multi-graphs are not merely the sum of multiple keyword graphs. Each keyword graph has isolated nodes that are removed in the pruning phase of network creation. These isolated nodes may not be isolated when combined with the nodes of another keyword graph. Hence, a keyword multi-graph can be greater than the sum of its keyword graphs of the multiple search words from which it was created. The nodes or our networks are Facebook groups and we will use the words, links and edges, interchangeably. Finally, we define a small world network to be a network that possesses two properties: (1) The average shortest path (to be defined later) length scales with the logarithm of the number of nodes, ln N, where N is the size of the network. [13, 2, 28] (2) The network exhibits transitivity (or clustering, to be defined later) higher than random networks that is bounded away from zero. [13]. Some papers [43, 13, 41] have included sparseness as a requirement for a small world network. A network is said to be sparse if δ → 0 as n → ∞, where δ is the network density and n is the number of nodes in the network. Network density is the ratio of possible links given n, nodes, over the actual number of links, L. Mathematically, n 2 δ= L = n(n − 1) . 2L 6 We conjecture that the assumption of sparseness for the Facebook network of this study is valid. Our work does not include the analysis of sparseness due to both computational limitations and time constraints. Based on previous studies [43, 13, 41] the restrictions on the average shortest path length and clustering adequately encompass the networks analyzed in this thesis and provide a good metric for assessing real world networks composed of a finite number of nodes that exist between complete order and randomness. Next we provide an overview of several papers and results that inspired this work. 2. Overview of Relevant Literature and Motivation for Current Work Most attribute the idea of small worlds to the social psychologist, Stanley Milgram [25, 26]. Milgram selected two participants, the first from Omaha, NE, and the other from Wichita, KA. The initial participant was asked to send a letter in an attempt to get close to the other participant. That is, he was to send the letter to someone who might have a higher likelihood of getting closer to the other participant. This recipient would mail the letter in a similar fashion and so on and so on until it reached its destined recipient in Wichita. Milgram received results 4 days later. The results pointed Milgram to the notion of small worlds. What he saw was that when letters were successfully delivered, they did so in under 6 steps. It appeared that two randomly chosen people with seemingly no causal relation, were in fact closer than 7 expected.† With regard to small worlds, Stanley Milgram believed that social networks are in some sense tightly woven, full of unexpected strands linking individuals far removed from one another in physical or social space. This is not to say Milgram believed himself to be four handshakes from having tea with the Prime Minister of Hungary; rather, he believed that 6 degrees of separation is certainly less than intuition would dictate, but it is not as close as it seems. People are still approximately 6 groups apart and this is actually not so surprising. Small worlds were first introduced to the scientific community by Watts and Strogatz in their seminal article, “Collective Dynamics of ‘small-world’ networks”. [43] With their paper, the small-world network model was added to the random network model of Erdős and Rényi. In their paper, Watts and Strogatz presented a model that described systems that were highly clustered like regular lattices, yet had small characteristic path lengths, like random graphs. [43] The Watts-Strogatz model highlighted random links much like the weak ties proposed by Mark S. Granovetter [18]. Members of any social network have strong ties, and weak ties. A strong tie might be a relationship between family members or best friends. Weak ties are composed of everything else, from your barber to the student you sit next to in second period. Furthermore, Granovetter suggested that weak ties are immensely important, perhaps more important than strong ties. Neither the small-world model of Watts and Strogatz or the weak ties of Granovetter explained how small-world networks were created or grew. In a study by Barabasi, Albert and Hawoong, called “Diameter of the World-Wide Web” [3], it was † His results were by no means definitive and were in fact contested by peers in his field, among which was psychologist, Judith Kleinfeld 8 found that the World-Wide Web is in fact a scale-free network; that is, the connectivity has a power-law distribution with a heavy tail. Later studies would find that several small-world networks have scale-free topologies. The article by Albert and Barabasi, “Statistical Mechanics of Complex Networks” provides a comprehensive treatment of networks and includes the parameters of many small-world networks of a diverse variety. This thesis utilizes the mathematical foundation providing in “Statistical Mechanics of Complex Networks” as well research and writings of by M.E.J. Newman, to include “Networks” [28] and “Models of a Small-World: A Review” [29]. With a strong theoretical basis, this work sought to apply the analysis to real world social networks. Small world networks have become quite mainstream with books such as Watts’ ‘Six Degrees” [42], and Barabasi’s “Bursts” [7] and “Linked” [8]. Certainly, the scientific community has taken an interest as well. The study, “Is the Boston Subway a Small-World Network?” by Latora and Marchiori [23] made the leap from theory to experiment. Latora and Marchiori made a complex network where the nodes are train stations and the edges are stations connected by track. In their own terminology, they found the Boston subway to be locally and globally efficient, tantamount to being a small world network. The Indian railway network was also found to exhibit small world properties by Sen, et all in their paper, “Small-World Properties of the Indian Railway Network” [34]. This work sought to leverage data from a social network. Of all the social networks currently in use, Facebook arguably has the best combination of access, popularity and support. Facebook has amassed over 900 million users. The Facebook Graph API provides access tempered by privacy restrictions with support for interfacing with 9 almost any programming language. Furthermore, there has already been significant study into the structure and nature of the Facebook social graph as a whole. [38, 6] “The Anatomy of the Facebook Social Network Graph” [38] confirms the ‘six degrees of separation” to be present in the Facebook graph along with high local clustering. That is, Facebook seems to be a small world network and the degrees of separation would later be reduced to 4. [6] This work does not analyze networks on the scale of the entire Facebook graph, but successfully finds subgraphs of sufficient complexity in our Facebook group networks. Previous studies on the entire Facebook graph have indicated the presence of the small world properties. We are interested in seeing if an aggregation of users/groups preserves these properties. We consider networks of Facebook groups (as opposed to Facebook users that may belong to one or multiple groups) that share some common interest. Different common interests could generate differences and/or similarities between the basic structure of these networks. The actual description of the Facebook networks constructed in this study will be provided in the next section. We have by no means listed all the work in the field of complex networks or social networks analysis. By analyzing a collection of tractable data sets, we hope to add the conversation by helping to connect theory and reality. In the next section we define the basic elements of the network and the numerical characteristics to be used in our analysis. 10 3. Mathematical Background In this section we provide and overview of the mathematical tools needed to analyze the networks in this work. We also review the numerical characteristics of ordered, random, and small world networks. 3.1. The Facebook Group Network and Significant Numerical Measures. Let us denote by G = {x1 , x2 , . . . , xN } a network with N nodes. Each node, xi , is assumed to be linked to ki ∈ {0, 1, . . . N − 1} other nodes in the network, called its inputs. The parameter ki is called the connectivity of node xi . If ki = 0 then the node is isolated. We will assume a non-directed network, that is if node xj is an input to node xi , then xi is an input to xj . G can be viewed as a graph with vertices, x1 , x2 , . . . , xN , and edges (xi , xj ), i, j = 1, 2, . . . N . Now we describe the actual Facebook network considered in this study. Definition 1. Let Fw be the set of (public) Facebook groups with the word, w, in the group’s title. These groups represent the nodes of the network, and consequently each node xi is basically the set of people belonging to that group. Then we can define the set of links between the nodes of this network as Lw = {(xi , xj ) ∈ Fw × Fw | xi ∩ xj 6= ∅}. Furthermore, two nodes, xi and xj , are said to be adjacent nodes if (xi , xj ) ∈ Lw . Thus, we construct Gw , a network with nodes Fw , and edges Lw , as defined above. The Facebook networks under consideration consist of Facebook groups that share a common word in their title (implying a common interest). Individual groups are linked by having common members. For example, if w = f ood, then we consider all 11 the groups that have the word f ood in their title. Two possible group titles could be ”Food and wine” and ”All about food”. These would be two nodes in the network. If they share common members, then there is an edge between them. The main reason for constructing such networks is to be able to identify possible commonalities between the structure and properties of networks that are all social, but represent different types of personalities and interests. We would expect to see some impact of the type of common interest of the groups, represented by the common word. For instance, we have found that the w = bieber network is a lot more connected and clustered than the w = math network, not to mention, much, much larger. Often the graphs are intuitive. The graph for w = bieber, Fbieber , is large and connected, while the graph for w = biology, Fbiology , is almost non-existent. However, it is not intuitive that the nodes in the Fbieber are from all around the world and even though Fbiology is very small, the fact that is exists at all on Facebook is somewhat surprising. Clearly, choice of keyword effects the return drastically and the resultant graph is subject to innumerable factors, both intuitive and otherwise. It is our goal to analyze the impact of these common interests on the numerical characteristics of the Facebook group networks, in order to identify what kind of networks they are, namely ordered, random, or small-world. To this end the most common numerical characteristics to be analyzed are the degree distribution, average clustering coefficient and associated path lengths of the network [43, 2, 38, 6, 34]. Definition 2. If k1 , k2 , . . . , kN are the connectivity values of the nodes x1 , x2 , . . . xN respectively, the connectivity distribution is given by the probability distribution function f (x) = P (ki = x) for any i, where x ∈ {1, 2, . . . N − 1}. The average connectivity 12 is then (3) < k >= N −1 X xP (ki = x) x=1 Observe that we do not take into account isolated nodes to insure that all numerical characteristics are properly defined. Definition 3. Given a node, xi , the clustering coefficient of xi , denoted Ci , is a measure of transitivity. That is, it measures how connected the inputs of a node are. Ci is given by the number of all possible adjacent node pairs that actually exist in the network [43]. More precisely, if node xi has ki inputs then there exist at most 1 k (k −1) 2 i i links between these ki inputs. Ci is defined as the fraction of these allowable links that actually exist in the network. We can then define the average clustering coefficient for a network with N nodes as (4) C̄ = N 1 X Ci . N i=1 If k̂ = {x ∈ Fw | degree(x) = k}, where k ∈ {1, 2, . . . , N − 1}, then we can define the average clustering coefficient of nodes with degree k, as (5) C̄(k) = 1 X |k̂| Cx , k = 1, 2, . . . , N − 1. x∈k̂ Definition 4. Given two nodes, xi and xj , the shortest path connecting them, lij , is given by the minimal number of links that lead from xi to xj (or viceversa since we deal with a non-directed network, so that lij = lji ). By construction, our networks contain no nodes with k = 0 (isolated nodes) which means each network is comprised of one or more connected subgraphs/subnetworks where every node in each subgraph 13 has k ≥ 1. In the case that a network is composed of multiple connected subgraphs, the average path length is determined from the average path lengths of each connected subgraph where the average path length is defined by (6) < l >= X 1 lij . N (N − 1) i6=j Small world networks exist in the realm between order and randomness. An ordered network is basically a ring lattice in which the nodes are placed on a circle and each node is connected to its k nearest neighbors (k/2 on either side). With this idea in mind, we review the models for random and small world networks. 3.2. Random Graph Models. A random network is a network in which some specific parameters such as the number of edges or the average connectivity, take fixed values, but the network is random in other respects. There are several models for generating random graphs, the most common one being to maintain the number of nodes and edges constant while randomly assigning the edges [28]. These graphs are commonly referred as G(N,m) graphs, where N is the number of nodes and m is the number of edges. Basically a G(N,m) graph is obtained by choosing randomly, according to a uniform probability distribution, one network out of the ensemble of all possible networks with N nodes and m edges. It is known that the average connectivity for a G(N,m) graph is < k >= 2m/N , but beyond that it becomes increasingly difficult to derive closed form equations for other network parameters [28]. Arguably the most widely studied type of model for the construction of a random network is G(N, p) where p is the (fixed) probability of constructing an edge between any two nodes. These networks have become known as Erdős-Rényi networks[15] due to the eminent works of the namesakes. The connectivity for G(N,p) networks follows 14 a binomial distribution, (7) N −1 k frand (k) = p (1 − p)N −1−k , k where p is the given probability that any two nodes are connected, assumed fixed for all pairs of nodes. This is because N −1 k k p is the probability that a node has edges to k nodes. However, there must be no edges to the rest of the N − k nodes, which occurs with probability (1 − p)n−k . Consequently, the average connectivity is < k >= N p. For large N this leads to the Poisson distribution as an approximation of the binomial distribution, with parameter λ = N p =< k >, (8) f (k) = e−λ λk . k! Now, every pair of nodes in a G(N,p) network has the probability, p, of being connected. If a node has k connections, it has a possible k 2 of connected “friend” pairs, also called the first neighbors. The probability that any two of these first neighbors are connected is then equal to p, so that the clustering coefficient of any node xi is (9) Ci = p. The average clustering coefficient is defined over the entire network and for a G(N,p) network, we have (10) C̄rand N 1 X 1 <k> = Ci = N p = p = . N i=1 N N For a random network, each node has < k > connections, on average. Then each adjacent node also has < k > connections, on average. If we counted every node each 15 time and repeated this process l times, we would have < k >l nodes, and l would be the path length between two nodes considered in the beginning and the end of this process. If we equate < k >l and N (since we cannot have more than N nodes), we can derive the relation for the average shortest path length for a random network. More precisely, it is known that the average shortest path length for a G(N,p) network scales as follows (11) lrand ∼ ln N . ln < k > Therefore, for a G(N,p) network, as N → ∞, we see that C̄rand vanishes and lrand scales as ln N . Thus, property (1) of small world networks is fulfilled. However, despite the fact that there is a relatively small path length between any two nodes, the random graphs lack the inherent nontrivial clustering of a small world network described by properties (1) and (2). We will be computing all these global parameters for a number of Facebook networks, Gw , and we will compare them with the ones presented in this section. 3.3. Small World Network Models. In an effort to capture the transitivity of actual small world networks and retain average shortest paths characteristic of random graphs that scale with ln N , the Watts-Strogatz model was developed [43]. More precisely, starting with a ring lattice, each edge is re-wired with probability p excluding self-inputs and duplicate edges. When there is no chance of an edge being re-wired, p = 0, we obtain an ordered/regular network. It exhibits high clustering fulfilling property (2), but long path lengths, so it is not a small world network. At p = 1, a Watts-Strogatz network is exactly a random network, and thus has short path lengths fulfilling property (1), but little to no transitivity, so it is not a small world 16 network. For a significant range of values of 0 < p < 1, the networks created with the Watts-Strogatz model can fulfill both properties (1) and (2), so they are small world networks. Another version of the Watts-Strogatz model is to randomly add ”shortcut” edges according to the value of p, but not to remove any edges. That is, edges are added between randomly chosen node pairs just as described above and depicted in Figure 1, but no edges are removed from the original circle [28]. For this variation, since there are 21 N k edges on the circle, there are 12 N kp added shortcuts. This means there are kp shortcut edges per node, on average. Let s be the number of shortcut edges attached to any one node. Then the degree distribution for the Watts-Strogatz network model is a Poisson distribution with parameter, kp, (12) kp (kp) Ps = e s! s . Furthermore, by construction, we know that the total degree of a node is equal to d = s + k. Substituting s = d − k into the previous equation, we have the degree distribution for the Watts-Strogatz small-world model [28]: (13) Pd = ekp (kp)d−k (d − k)! for d ≥ k and Pk = 0 if d < k. To calculate the clustering coefficient for the Watts-Strogatz small-world model, one has to calculate the number of triangles within a graph and the number of connected triples. A triangle is a closed path of length two and a connected triple means three nodes, u,v, and w, with edges, (u, v) and (v, w). Then, the clustering coefficient is simply the ratio of three times the number of triangles over the number of connected 17 triples. For the Watts-Strogatz small-world model the triples formed from shortcut edges must be taken into account and added to the triples from the original lattice. Therefore the clustering coefficient for the Watts-Strogatz small-world model is (14) k/2 ×3 2 2 N k p + 12 N k 2 p2 C= k 2 + = ( 41 N k( 12 k − 1)) × 3 3(k − 2) = 1 1 2 2 2 4(k − 1) + 8kp + 4kp2 N k(k − 1) + N k p + 2 N k p 2 where N k 2 p is the number of connected triples formed from an added shortcut edge and an original edge of the lattice and 21 N k 2 p2 is the number of connected triples formed from pairs of added shortcut edges [28]. Finally, the average shortest path for the Watts-Strogatz small-world model has been shown to scale with ln N just like random networks [28]. The behavior of the clustering coefficient and average shortest path as a function of p is depicted in Figure 2. Figure 1. The random rewiring procedure of the Watts-Strogatz model, which interpolates between a regular ring lattice and a random network without altering the number of nodes or edges [2]. After Watts and Strogatz, 1998 [43]. 18 Figure 2. Characteristic path length, L(p), and clustering coefficient, C(p), for the Watts-Strogatz model. The data are normalized by the values L(0) and C(0) for a regular lattice. A logarithmic horizontal scale resolves the rapid drop in L(p), corresponding to the onset of the small-world phenomenon. During this drop C(p) remains almost constant, indicating that the transition to a small world is almost undetectable at the local level [2]. After Watts and Strogatz [43]. Some small world networks have been shown to exhibit another interesting property, along with high clustering and short path lengths. Networks such as the Internet, collaboration networks, ecological networks, cellular networks, citation networks and the community of actors exhibit power law degree distributions∗[2]. Definition 5. Scale-free networks obey a power law connectivity distribution (known by several other names †). That is, the connectivity of a node is determined by a shape ∗ Although some small world networks are also scale free, not all scale free networks are also small worlds. † Zeta distribution, Zipf distribution, and Pareto distribution to name a few. 19 parameter and a scaling factor with (15) where ζ(γ) = f (k) = PN 1 x=1 xγ k −γ ζ(γ) k = 1, 2, . . . , N is the truncated Riemann ζ-function, also called the scaling factor. The shape parameter of the distribution is γ > 0. Thus each node is assigned k inputs with probability f (k). For scale-free networks the usual range is 2 < γ ≤ 3 , while 1 < γ < 2 is typical for biological networks (genes, proteins, metabolism, and ecological networks)[2]. The power law distributions exhibit a scale invariance. That is, (16) f (βk) = (βk)−γ (k)−γ = (β)−γ = (β)−γ f (k) ∝ f (k), ζ(γ) ζ(γ) where β is a constant. So the resulting function is equal to f (k) modulo a multiplicative constant. Because of this property, such networks have become known as scale-free networks. The purpose of this work is to identify the impact of common interests on Facebook group networks’ numerical measures, and identify similarities and differences. We will generate many Facebook group networks and compute their numerical measures in comparison to the values described above for random, small world, and scale free networks. With this in mind, we proceed to our discussion of the social aspects of the Facebook networks from which we get all our data. 4. More on the Facebook Social Network A social network consists of a finite set or sets of actors/people/groups and the relation or relations defined on them [40]. In this thesis, we will call actors nodes, 20 and are referring to the social collective units of Facebook groups. All the nodes are of the same type (groups) and our networks are therefore referred to as one-mode networks. We will call relations links and will analyze the structural variables, connectivity, clustering and average shortest path length. These concepts are relatively intuitive and as previously mentioned, this thesis is not meant to be a treatise on the social interactions of Facebook groups. Rather, this thesis seeks to reveal the underlying commonalities of the networks, in a mathematical sense, formed from the social interactions between Facebook groups. Privacy restrictions aside, analyzing groups of Facebook users as opposed to analyzing individual users lends itself well to a social network study. The name of a given group is assumed to be consistent with the context of the group itself. That is, if a group is named ”People who love cats”, it is assumed that the group is comprised of people who have some preference for cats, rather than individuals that collectively hate cats. To use social network analysis terminology [40], this assumption addresses the validity of the data presented in this thesis. A person who likes cats will join groups that like cats and are named to reflect that fact. Contrary circumstances will be statistical minorities. In any case, for the purposes of this thesis, it is not whether or not a user likes cats that is important, but the fact that a user is concerned with cats at all. As will become evident in the remainder of this thesis, complex networks can be created without ever considering the personal preferences of the users within the groups (nodes). With a firm basis for our work, we proceed to an overview of the Facebook social network and mention some of its idiosyncracies. 21 4.1. Network Analysis Tools. As far as social network analysis software is concerned, there are three categories: simulation, mathematical analysis, and visualization. These categories are not mutually exclusive and most applications support two or three. In any case, other features of the available software is cost (free or pay to use), installation requirements (web app or downloadable software), and Operating System (OS) support. As we said in the first paragraph of the introduction, collaboration is key to eventually understanding the simple complexity all around us. With this in mind, this work sought to use open source, free of cost, no download required technology that could analyze dynamic data pulled from constantly updated social networks. We used MATLAB for some of the mathematical analysis, but in the future we hope to replace the MATLAB functionality with python coding. There is a wide variety of literature on the available software for social network analysis. While researching for this thesis, we came across myriad applications. But none of them offered all the features needed. Most notably, none leveraged real world data that is subject to continuous modifications for analysis. Of the applications that performed analysis, all require the user to import network data as separate files. This may sound like a simple endeavor, but an all-encompassing application for the analysis of the kind of real world data used in this thesis does not exist to our knowledge. We refer the reader to the reference section for articles on social network analysis software [16, 32, 40]. Imminent applications for analysis include Discrete Dynamics Lab (DDLab) [44], Pajek [9] and UCINET [11], where all three have visualization capabilities built in. The bulk of the other applications we looked into include Krackplot [22], Visual Social Networks(Visone) [12], NetVis, Touchgraph, MASON [33], Discrete Visualizer of Dynamics (DVD) [21], NodeXL [36, 19], Netminer [14], STRUCTURE, 22 NEGOPY [30], Applied Graph and Network Analysis (AGNA), FATCAT [31], and Cytoscape [35]. 4.2. Facebook Overview. Facebook is arguably the most influential social network in the history of mankind [1]. Anyone can create a Facebook profile for free and input as much information as he/she desires. Users can “friend” other users, join groups of users, like users, check in at various locations and update his/her own personal message board either for private or public consumption. These capabilities have created an intricate network of users, connected by “friendships”. Paired with its eminence is the accessibility of its data. In 2007, the Facebook API was released and the flood gates were opened. Data of unparalleled detail and magnitude became available as never before. With approximately 900 million users and over 125 billion friend connections, Facebook is available in more than 70 languages† and is larger than any network before it, permeating every corner of the globe and every walk of life. Naturally, massive amounts of data facilitate a quest for knowledge and meaning. Ever vigilant to remain innovative, the initial Facebook API was succeeded by the aptly named “Facebook Graph API” that represented every object of Facebook by a numerical id easily accessible by “folder hierarchy” like programmatic calls. Although significant, the access to the massive amounts of Facebook data is not completely unfettered. There are privacy protocols in place to protect users from unwanted intrusions. Basically, to gain access to objects like user friend lists and user checkins require permission from the target user. Permission is requested by Facebook † These numbers are as of March http://www.facebook.com/press/info.php?statistics 2012 and are available at 23 applications (“apps” for short) created by other Facebook users called developers. Every app has a unique id and leverages the data available from Facebook, including private data with permission, to accomplish a seemingly endless number of tasks. The Facebook Graph API is not limited to Facebook proper. Facebook and those who love Facebook have developed multiple software development kits (SDKs), development tools and programming language libraries in almost any programming language you can imagine. These tools provide Facebook Platform support to the myriad web apps that permeate the world wide web. This work leveraged just such a programming language,python, which is completely free and open source. Some have managed to analyze the entire Facebook network [6, 38] ,but in practice, the Facebook network is too large for casual analysis. Furthermore, since the Facebook network is obviously not random, analyzing a random sample of Facebook would be useless. The size of Facebook combined with the privacy restrictions pose a serious hindrance for attaining interesting data that is characteristic of an actual social network. These hindrances are not overcome without some compromise. In section 5, we describe a methodology for analyzing coherent, self-contained Facebook sub-networks of a tractable size. 5. Graph Creation Methodology Today we have access to unprecedented amounts of data. For the most part, we also have access to the means for analyzing the data. Online social networks are probably the most imminent of the available data sources. Social networks are dynamic and fluid. Because of this, the data collected about them will also be dynamic and fluid. So, to properly understand the data of today’s ever changing online environment, one 24 must employ methods and means that are similarly dynamic. This is not an easy task, and to date the burden of progress has been placed on sophisticated algorithms run on powerful computers, analyzing stagnant and often constricted data. Simply put, small data sets are prevalent and easy to apply statistical analysis to, but often do not have the required complexity for interesting behavior. Yet large data sets are more difficult to obtain and analyze, but often render more significant results. In the case of large data sets, the nature of the data set only facilitates the analysis of a “snapshot” of the data. In the spirit of small world networks, there has to be something in between. Networks small enough to be analyzed dynamically, but large enough to render significant results. Networks unique enough to not be commonplace, but in enough abundance to analyze ensemble statistics. A small network is characterized by a short average path between nodes and nonvanishing clustering. After a certain social psychologist decided to study the connectedness of a small group of Nebraska natives and a Massachusetts stockholder [25, 26], the phrase, “6 Degrees of Separation” was born into the intellectual realm‡ The existence of “6 degrees of separation” is relatively un-contested. In fact, the logical implications of a network where any node has k connections is that any node is within 6 steps of k 6 other nodes. Even for small values of k, that is a large number of connections. Since small world networks are somewhere in between perfectly random and perfectly ordered, it is natural to see features of both to be present in a small ‡ The first mention of such a concept should be attributed to the Hungarian novelist, Frigyes Karinthy. As a novelist, Karinthy rendered no actual study of the topic, only insight beyond his time. 25 world network. That is, random networks display the small characteristic lengths between nodes and ordered networks have high clustering. But, neither have both. 5.1. Python Overview. We chose Python for its syntax, availability, versatility, community and support. Python is a very readable language. That is, the code can be interpreted in an intuitive manner without the need for extensive experience in programming. Hopefully, the ease at which one can understand Python can be seen in appendix A. Python is free to use and available on almost every operating system. It has user developed modules ‡ to interact with almost any other language with almost limitless potential and unbounded versatility. Modules that expand the capability of Python are developed by people that use Python. The Python community is expansive and if you do not know how to accomplish a task with Python, it is not difficult to find someone who does. Since Python is open source, this is all free and available to anyone. We are not alone in our admiration of Python and definitely not the first to leverage its power for scientific endeavors. Researchers in Los Alamos have already found Python well suited for molecular dynamics modeling and Python is still gaining support in the scientific computing community. [10] This work used Python to collect data, visualize data and analyze data. Data is collected using user developed code to interface with the Facebook Graph API. After the data is collected, the NetworkX [5] module is leveraged to create the networks and calculate the network parameters. Finally, the networks are visualized with the aid of matplotlib [20]. The modules used in this work are not unique, but they have ‡ A module is a collection of Python code used to augment to capability of the core Python installation. The modules used in this work are listed in appendix A 26 a diverse feature set that lent itself well to the purposes of this thesis. The code used to accomplish these tasks are included in appendix A. 5.2. Node Creation. To say two people are separated by k degrees is tantamount to saying that any two people are separated by k circle of acquaintances or k spheres of influence or as we have chosen to interpret in this thesis, k groups of Facebook users. That is, the idea of analyzing groups is an intuitive continuation of the logic behind small world networks. Groups are collections of people connected via a common context, a set of facts and circumstances that surround a situation, event or concept. Naturally, to move beyond a set of isolated groups one must assume that the average person belongs to multiple groups. This is a feasible as not many people would choose to define themselves by a single context. To this end, in order to analyze Facebook groups one must operate under a small number of restrictions. First, the method for retrieving groups of a common context is limited to a search for groups with a single word in the group’s title§. For example, using the definitions of section 3, if we decided to form a network of groups with word in the title, we might have Fword = {word, Words, I love words, WORD!!!,...}. Furthermore, one can combine search results of multiple words to create multi-word networks. To develop multi-word networks as networks defined by a unifying context, we combined words that could define a community, circle of acquaintances or field of interest. For example, to create a network from the words, foo and bar, we might have Ffoo = {foo, FOOD!,foobar? No Fools, FOOBAR!!!,...} § The Facebook API search functionality is not case sensitive 27 and Fbar = {Bar Hopping, baristas unite, No Holds Barred!!, FOOBAR!!!,foobar?...} from which we could form Ffoobar = Ffoo ∪ Fbar 5.3. Edge Creation. The nodes of our networks are Facebook groups. As such, each node has some number of members. Recall that two such nodes are adjacent or linked if they contain at least one mutual member. That is, given two nodes, xi , xj ∈ Fw , (xi , xj ) ∈ Lw iff xi ∩ xj 6= ∅. The process for generating Facebook groups networks is depicted pictorially in diagram 3. 5.4. Pruning. Once all the edges have been added, the networks are pruned to improve the results of analysis. Namely, there are two criteria for nodes to be removed from a network: 1. All nodes with degree = 0, also called isolated nodes, are removed. 2. Every node of every subgraph with l = 1 are removed. Removing isolated nodes and subgraphs with an average shortest path equal to 1 help to reduce the skew of data that would result if they were not removed. That is, for a network composed of multiple connected ‡ subgraphs, the average shortest path of the network becomes the average of the average shortest path for each connected subgraph. Clearly, having an abundance of isolated nodes would reduce the clustering coefficient of the network and the presence of subgraphs with L = 1 would reduce the average shortest path of ‡ A graph is said to be connected if there is a path from every node in the graph to every other node, where a path is any sequence of nodes such that every consecutive pair of nodes in the sequence is connected by an edge in the graph. 28 Figure 3. The process of creating a network out of Facebook groups of various size, where two nodes (groups) are connected if they have at least one mutual member. For example, the red group and yellow group have two mutual members and are therefore connected in the lower network depiction. By construction, isolated nodes are removed from the network. The purple group has three members that are not in any of the other groups and is therefore removed. the network. In both cases, the effects of not removing such nodes would cause the data to exhibit features not characteristic of the phenomenon we wish to quantify. In essence, the pruning process enabled us to analyze the largest, connected, subgraph of each keyword graph. We present the single words and word groups that we used to generate networks in Appendix B. The reason for combining multiple keywords is to create larger networks with a unifying context for statistical purposes. In fact, we will be combining multiple keywords in an attempt to understand network growth and scaling. Not every word renders interesting results, but in general every network larger than approximately 40 29 nodes, exhibited the same general degree distribution shapes, low average path lengths (≈ 2) and high clustering (on average about three times the expected clustering for a random graph with the same number of nodes and edges). The networks were created using python and the code can be found in Appendix A. 6. Analysis and Visualization This section contains the main results obtained by analyzing the Facebook group networks. We divide the section in two subsections that focus on network visualizations and numerical characteristics. 6.1. Network Visualization. We start by presenting several networks created according to the rules of Section 5 and listed in Appendix B. In Figures 4-12 we visualize several of those networks. In each figure we generate the nodes of the network and the links between them. The visualizations are rendered via an implementation of the Fructerman-Reingold force-directed algorithm [17], which tends to create highly clustered “cores”. We also include in the title the key words used to generate the networks and the numerical characteristics: the number of nodes N , the number of links L, the average connectivity < k >, the average clustering coefficient C, and the average path length l. Observe the high density of the links in the networks anime, bieber, and muslim networks for which we provide also a “zoom-in” view, as opposed to the very sparse army and navy networks. We note that anime, bieber, and muslim networks exhibit small world network features. Visualization is an important aspect to consider when analyzing networks. For example, the navy network of figure 11 may seem only slightly different from the army network depicted in figure 10. Yet, when visualized using an implementation 30 Figure 4. Facebook groups network created from groups with the word, anime, in the title. The numerical characteristics are: N = 276, L = 9770, C̄ = 0.708, l = 1.837, < k >= 70.797, indicating short path lengths and a clustering coefficient larger than the corresponding one for a random network with similar topological aspects. Figure 5. An enlarged depiction of the “core” of figure 4. 31 Figure 6. Plot of bieber network. The numerical characteristics are: N = 117, L = 1714, C̄ = 0.646, l = 1.903, < k >= 29.299, indicating short path lengths and a clustering coefficient larger than the corresponding one for a random network with similar topological aspects. Figure 7. An enlarged depiction of the “core” of figure 6. 32 Figure 8. Plot of muslim network. The numerical characteristics are: N = 233, L = 4326, C̄ = 0.564, l = 2.121, < k >= 37.133, indicating short path lengths and a clustering coefficient larger than the corresponding one for a random network with similar topological aspects. Figure 9. An enlarged depiction of the “core” of figure 8. of the Fructerman-Reingold force-directed algorithm [17](figures 12 - 13), it becomes evident that while the army network is a single, moderately clustered network, the navy network is composed of two isolated sub graphs. The two main features of the 33 Fructerman-Reingold force directed algorithm are 1) Vertices connected by an edge should be drawn near each other. 2) Vertices should not be drawn too close to each other. In general, how close vertices should be placed depends on how many there are and how much space is available. All vertices repel each other but connected vertices attract. This leads to vertices with low connectivity being overcome by repelling forces and a core of highly connected vertices to form near the center of the graph. In this fashion, the model is driven, but not entirely constrained, by the physical analogy of a collection of charged particles connected by springs. Figure 10. Plot of army network where the nodes are randomly distributed on a ring. The numerical characteristics are: N = 48, L = 189, C̄ = 0.434, < k >= 7.875 and l = 2.32. It is difficult to gain much knowledge from a graph of this size, but it is useful when analyzing the behaviors of network parameters as N is increased. Now, observe that the individual group networks presented in Figures 4-12 are fairly small. To make sure that our networks are large enough for a more accurate analysis, we have combined individual key words into categories, and then created larger networks from these categories. The categories that will be used in graphical 34 Figure 11. Plot of navy network where the nodes are randomly distributed on a ring. The numerical characteristics are: N = 20, L = 31, C̄ = 0.504, < k >= 3.1, and l = 1.83. This is another example of small network. It serves as a good conceptual comparison to figure 10, due to its generating keyword. Figure 12. Plot of army network using the Fructerman-Reingold force-directed algorithm. representations in this work are presented in Table 1 and Appendix B. We will focus on these larger networks in what follows. Figures 14 and 15 show what a larger network actually looks like and the complexity that manifests by doubling the number of nodes and adding more related groups. 35 Figure 13. Plot of navy network using the Fructerman-Reingold force-directed algorithm. Other multi-word graphs display comparable levels of complexity, but it should be noted that the 3 cores evident in figures 14 and 15 are characteristic of the Religion multi-word graph force-directed rendering and may or may not be seen in other graphs. The multi-word graph, Sports is presented in Figures 16 17 for comparison. Although the cores look slightly different in the Religion versus Sports visualizations, it is the mathematical analysis that will shed more light upon the similarities and differences between them. However, such visualizations allow us to have a pictorial view of the networks under consideration as a starting point for our analysis. We also notice the impact of increasing the network size on the level of complexity of the networks. Now that we have a pictorial view of some of the networks under consideration, we focus on the numerical characteristics and identify some mathematical similarities and differences between the categories. In table 3 we present a complete list of the numerical characteristics for each of the networks considered in this analysis. We also 36 Table 1. Ten categories of key words used in the mathematical analysis of Facebook group networks. Category Key words Religion islam, muslim, bible, atheist, catholic, christian, god, allah, mormon, pagan, pope Politics republican, romney, vote, government, politic, election, democrat, obama, senate Hobby art, gun, fish, collect, craft, comic, cook, eat, hunt, stamp Sexes boy, man, gay, girl, lesbian, male, woman Anime naruto, bleach, dragonball, DBZ, OnePiece, Gundam, anime, saiyan, sasuke Minorities africa, white, black, american, china, asian, hispanic, latin, greek, KKK, nazi Science physics, fractal, chaos, nuclear, math, science Music bieber, gaga, beyonce, spears, kesha, kardashian, MTV Sports basketball, soccer, football, baseball, golf, tennis, bowling, swim Computer ipad, ipod, android, PC, mac, computer include the analog values for random networks for comparison. We observe that the studied networks exhibit average clustering coefficients 3-10 times larger than those for random networks, while maintaining relatively small values of the average path length. Thus we claim that both small world requirements (1) and (2) are satisfied 37 for Facebook group networks. We provide details in what follows, and refer to power law distributions as well. 6.2. Numerical Characteristics. First let us look at the degree distribution for the networks of Table 1, indicated accordingly in the titles of Figure 18. For each network we generate the distribution of the connectivity values k and we present the results on log-log plots. Observe that for smaller networks the plots are not as structured. However, for larger networks it is apparent that the distribution is decreasing with increased connectivity and has a rather mild curvature, which means that it appears to be mostly linear. This is typical for power law distributions. We do not observe a pronounced peak typical for random networks. As a matter of fact, we have used the generalized Pareto fit tool of MATLAB [24], to fit a power law distribution to several degree distributions. One sample is shown in Figure 19 for the Religion network, which is the largest in this analysis. The fit is not performed on log-log scale as seen from the graph. The parameters are γ, the power, and ζ, which is the truncated Riemann zeta-function as described in formula (15). It is basically the scaling factor necessary for normalizing the probability distribution function of the connectivity levels. Observe that the best fit for all the data points yields 1 < γ < 2, but very close to 1, and that the actual data tail seems to be heavier than the estimated distribution. Thus, if we were to fit only the tail we would obtain γ values closer to 2, which would be closer to what was found in the literature for say, various biological networks. In Table 2 we present the fitted parameters for the largest networks. All of them yield similar results. 38 We use a t-distribution for small samples, to find a confidence interval for the average of the shape parameter, γ. We recall that if γ̄ and s are the values of the mean and the standard deviation of a random sample of size n from a normal distribution (assumed here for the shape parameter), then s s γ̄ − tα/2,n−1 · √ , γ̄ + tα/2,n−1 · √ n n is a (1 − α)100% confidence interval for the mean of the underlying distribution of the sample. We find a 95% confidence interval for the average γ to be (0.934, 1.499). Barabási and Albert [2] collected the parameter values for several networks with power law degree distributions. It is interesting to note that the networks with similar values for γ (≈ 1) are the Ythan estuary and Silwood park networks, both of which are un-directed ecological networks. On the other hand other internet networks have been shown to exhibit a power law scaling factor 2 < γ < 3 [2]. More recently, a study of the the social graph of active Facebook users does not yield a strict power law distribution for the degree distribution [38]. The same phenomenon is observed here for group networks, so a strict power law fit may not be the most appropriate approach to Facebook degree distributions. Thus an aggregation of individual users into groups renders a similar behavior. This is typical of scale invariance as discussed in Equation (16). At the same time we provide a boxplot representation of the values of the scaled connectivity values k/N for each of the ten networks of Table 1 for an easier comparison of the connectivity with respect to the network size. This is shown in Figure 20, including the median, the first and third quartiles and the outliers. Observe that only the Sports network exhibits a significant departure from the overall trend of small 39 Figure 14. Plot of the Religion network. The individual key words used are: islam, muslim, bible, atheist, catholic, christian, god, allah, mormon, pagan, pope. The numerical characteristics are N = 1558, L = 39547, C̄ = 0.543, l = 2.723, < k >= 50.766. Figure 15. An enlarged depiction of the “core” of figure 14. Observe the three smaller cores and notice the density of links. 40 Figure 16. Plot of the Sports network. The individual key words used are: basketball, soccer, football, baseball, golf, tennis, bowling and swim. The numerical characteristics are N = 342, L = 10356, C̄ = 0.715, l = 2.403, < k >= 60.561. Figure 17. An enlarged depiction of the “core” of figure 14. 41 Figure 18. The distribution of the connectivity values in log-log plots for the ten networks of Table 1, in decreasing order of network size. Each graph has the same general shape. As expected, the networks with more nodes render a log-log plot as the plot appears to degrade at N ≈ 350. Although the graphs appear to be mostly linear, the distributions cannot be characterized entirely by a power law decay. In general, the plots adhere to a power law for small values of k but then seem to decrease more quickly than a power law would predict for large values of k. This is seen with more precision in figure 19 where we attempt to fit a Pareto distribution to a degree distribution. median, indicating small connectivity for most nodes. However, this is one of the smallest networks considered, and thus we need further analysis with larger networks under this category. Finally, in reference to the degree distribution, in Figure 21 we plot the average connectivity, Y = Output = hki as a function of T = Target = N on a log-log 42 Figure 19. Pareto fit for the degree distribution. The estimated shape parameter γ is close to 1. However, the tail of the graph is actually heavier than the estimated fit. Table 2. Parameters of Pareto distribution, p(x) = ξ(γ)x−γ , fitted to several multiple keyword networks. Keyword Category N ξ γ Anime 839 0.4889 0.9354 Sexes 853 0.2533 1.3875 Hobby 929 0.1922 1.4237 Politics 1117 0.3025 1.3258 Religion 1558 0.3633 1.0082 plot. The plot clearly exhibits a positive correlation and linear regression analysis yields log(< k >) = 0.6 log(N ) + 0.19, thus the average connectivity increases as a power of the network size, and slower than N which is typical for random graphs. The correlation coefficient is R 0.78 which indicates a moderate linear relationship 43 Figure 20. Boxplot for the k/N values associated with the networks of Table 1. between N and < k >. We recall that sT Y R= √ sT T sY Y where sT T = n X i=1 2 (Ti − T̄ ) , sY Y = n X i=1 2 (Yi − Ȳ ) , sT Y n X = (Ti − T̄ )(Yi − Ȳ ) i=1 where Ti are the target sample points (inputs), Yi are the corresponding outputs, and T̄ , Ȳ are the associated sample means. We conclude here that the degree distribution of group networks exhibits a monotonic decrease at a fairly steep rate with increased connectivity, partially fitted by a straight line which corresponds to a power law distribution. Most groups have a rather small connectivity in comparison to the size of the network. For smaller networks, like Music, Sports, and Computer, these results are not very accurate, so more analysis with an increased number of keywords in each category will be needed in the future, paired with a more in-depth search for a best fit of the degree distribution. 44 Figure 21. Log-log plot of hki versus N for all the networks considered in this analysis, including single and multiple keywords. The fitted line shows hki ∝ N 0.6 . Now, let us look at the clustering coefficient defined in formula (3). We compute C̄(k) for all possible connectivity values k in each network, and plot them against k on a log-log scale in Figure 22. What we found is similar to other small world network studies [34, 38] where each plot exhibits an approximately logarithmic decay for large values of k. Again we provide a boxplot representation of the values of C(k) for each of the ten networks of Table 1 for an easier comparison of the average connectivity values for different degrees. This is shown in Figure 23. We note that overall the values are larger than the corresponding average clustering coefficients of random networks, thus suggesting the possibility of small world networks. In fact, the average clustering coefficients for the networks listed in table 1 are on average, 14 times larger than the average clustering coefficients of random networks. 45 Figure 22. The average clustering coefficient, C̄(k), against the degree, k, on a log-log scale for the ten networks of Table 1, in decreasing order of network size. In general, each plot indicates an approximate logarithmic decay for large k. These results are similar to other studies of small world networks [34, 38]. Observe that the largest network, Religion provides the most orderly plot, which is to be expected. Finally, in reference to the clustering coefficients, we present a plot of the average clustering coefficient C̄ versus the network size N in Figure 24 (top graph, blue dots) for all the networks considered in this study, not only the ten large networks constructed by aggregating smaller networks corresponding to the various categories. Most of the networks considered in this study have less than 400 nodes. We plot 46 Figure 23. Boxplot of the average clustering coefficients for individual connectivity values C(k) values associated with the networks of Table 1. on the same graph the corresponding values C̄rand for clustering coefficients of a random network with the same number of nodes and edges (red circles). We can notice immediately that the Facebook values are clearly higher than those for random networks, fulfilling requirement (1) of small world networks. Not only that, but by focusing only on the Facebook data for C̄ in Figure 25 we note an interesting result as Indian railway network has exhibited C̄ values that remain relatively constant as the network size increases [34]. Figure 25 has too few data points to definitively decide on the behavior of C̄ for large networks; however, to some extent it appears to steady out around C̄ 0.5 for larger values of N. This suggests that we need more sample networks with a large number of nodes, which will be subject for future research. At the same time we will need a systematic procedure for increasing the network size. For example adding keywords one by one for a slower increase of the network, besides expanding the selection of keywords and categories based on a dictionary search. These will be subject for future research. 47 Figure 24. Average clustering coefficient and average path length as functions of N. The blue dots indicate the Facebook data and the red circles indicate the corresponding values for a random network as listed in Table 1. There is a clear difference for the clustering coefficients; however it is hard to notice essential differences for the path lengths. Now, regarding the average path length, in Figure 24 (bottom graph) we plot the average path length for both the Facebook networks of this study (blue dots) and the corresponding random networks (red circles). The plots are clustered around small values and the average path length does not seem to exhibit an increase with increased network size. Notice the similarity between the Facebook and random network plots which suggests that requirement (2) of small world networks is also satisfied. However, using the information of Table 3 we obtain a boxplot of l, as well as the corresponding distribution in Figure 26 for both Facebook and random networks. Notice that although the median path length for Facebook groups is higher 48 Figure 25. Average clustering coefficient as a function of N. The red dotted line depicts the average value and possible steady state value for large N. than that of random networks, the variation is smaller, so that overall, the values for Facebook groups are slightly smaller than those for random networks. In general the average path length is mostly less than 3, which comes to supplement previous findings that the entire Facebook network of active users exhibits an average path length of about 4 [6], as opposed to the well known six degrees of separation paradigm. Thus the Facebook group networks are indeed examples of small world networks. In order to offer more precise statements regarding the distribution of average path length as shown in Figure 26 (right) we will have to generate a larger amount of data in the future. However, notice that despite the differences in the two distributions, they both have more or less the same behavior, which emphasizes requirement (2). We conclude that the Facebook group networks exhibit features of ordered networks and random networks. That is, they have large clustering coefficients like the 49 Figure 26. Average path length boxplot and distribution for both the Facebook group networks and random networks with the same number of nodes and links. ring lattice networks described by Watts and Strogatz [43] that do not vanish with increased networks, and small average path lengths characteristic of random graphs [2]. Thus we conclude that the networks under consideration in this study exhibit small world features. At the same time, the average connectivity increases as a power of the network size with approximation, while the average clustering coefficients and average path lengths do not exhibit a clear scaling with N . In the future we plan to expand this analysis to more and larger networks, and to supplement it with a more in-depth study of the numerical characteristics presented in this work and other suitable measures. 7. Conclusions and Future Work In this work we provided a study of the structure of Facebook group networks. We generated a number of networks using keywords for group selection, and we linked 50 groups with common members. We focused on a number of measures that can describe the structure of these networks. Most eminent among the features of the networks analyzed were our results concerning the degree distributions, and the behavior of the clustering coefficients and average shortest paths. Our networks have degree distributions that cannot be entirely characterized by a power law, clustering coefficients that are significantly larger than what would be expected for random networks and consistently small values for the average shortest paths, characteristic of random graphs. That is to say, our analysis has shown that Facebook group networks are small world networks. A unique element of this study is that we did more than analyze only one networks or a hand full of networks, we analyzed an ensemble of real world networks and were able to find relationships that otherwise could not have been found. Figures 21 and 25 attest to this fact and suggest an interesting path to take in future studies. Facebook group networks provide a nearly limitless source of data to analyze, both individually and collectively. If one was to study a larger set of networks, perhaps it would be more interesting to see how the networks differ as opposed to how they are similar. And if they are different, is there a finite set of network classes into which any given network can be placed based on its parameters? Furthermore, this analysis can lend itself to other fields as well. Most notably sociology. Although our methodology for network construction is efficient, it is not the only possibility. We took care to choose keywords that would most likely render interesting and mostly unambiguous results using a programmatic approach, but one could create networks using a more manual approach. Biased node selection could feasibly render networks that more realistically portray the collection of nodes and relations of which it is 51 meant to model. Sociological credence aside, it is quite surprising to see so many networks created of such consistent complexity and consistency. Our future work will expand this analysis to include more networks with larger sizes, in order to be able to generate a more complete statistical analysis of the Facebook group networks. Future studies should include other measures of interest such as efficiency, correlations, or spectral properties, as well as more sophisticated statistical analyses. Moreover, it would be important to analyze the dynamics of the networks over time, since Facebook groups could be added or removed from the networks based on the natural changes that occur, and links readjusted. This would imply observing Facebook over a significant amount of time to have sufficient data for statistical purposes. At the same time, it would be of interest to generate a dynamical system model of the social networks described here. More precisely, one can define some meaningful states of the nodes and associated rules that could help understand the complexity of this kind of large scale network. For instance, by focusing on the links in the networks, which are created based on individuals that are common to multiple groups, one could put a weight on the influence of one group over another group (the target) in terms of Facebook messages that are posted in the target group by the common members during a unit of time. By defining a suitable threshold function, one could label a node as active or inactive based on the aggregated influences of the individual input nodes through the common members. This way one could generate a Boolean network whose dynamics could be studied to provide some insight on the impact of multiple memberships of individuals on the activity of groups. 52 References 1. Paul Adams, Grouped, New Riders, 2012. 2. Réka Albert and Albert-László Barabási, Statistical mechanics of complex networks, Rev. Mod. Phys. 74 (2002), 47–97. 3. Réka Albert, Hawoong Jeong, and Albert-László Barabási, Diameter of the world-wide web, Nature 401 (1999), 130–131. 4. L. A. N. Amaral, A. Scala, M. Barthélémy, and H. E. Stanley, Classes of small-world networks, Proceedings of the National Academy of Sciences 97 (2000), no. 21, 11149–11152. 5. Daniel A. Schult Aric A. Hagberg and Pieter J. Swart, Exploring network structure,dynamics, and function using networkx, Aug 2008, pp. 11–15. 6. Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander, and Sebastiano Vigna, Four degrees of separation, CoRR abs/1111.4570 (2011). 7. Barabasi and Albert-Laszlo, Bursts: The hidden pattern behind everything we do, Dutton Adult, 2010. 8. Albert-Laszlo Barabasi, Linked: How Everything Is Connected to Everything Else and What It Means, reissue ed., Plume, April 2003. 9. V. Batagelj and A. Mrvar, PAJEK – Program for large network analysis, 1998. 10. David M. Beazley and Peter S. Lomdahl, Feeding a large-scale physics application to python, In 6th International Python Conference, 1997, pp. 21–28. 11. S. P. Borgatti, M. G. Everett, and L. C. Freeman, UCINET 6 For Windows: Software for Social Network Analysis, 2002. 12. Ulrik Brandes and Dorothea Wagner, Visone – analysis and visualization of social networks, GRAPH DRAWING SOFTWARE, Springer-Verlag, 2003, pp. 321–340. 13. Rama Cont and Emily Tanimura, Small world graphs: characterization and alternative constructions, Adv. Appl. Prob 40 (2007), 939–965. 14. Cyram, NetMiner Webpage, 2002. 53 15. Paul Erdős and Alfred Rényi, On random graphs, Publicationes Mathematicae Debrecen 6 (1959), 290–297. 16. Linton C. Freeman, Computer programs and social network analysis, Connections 11 (1988), 26–31. 17. T.M.J. Fruchterman and E.M. Reingold, Graph drawing by force-directed placement, Software, Practice and Experience 21 (1991), 1129–1164. 18. Mark Gnanovetter, The strength of weak ties, American Journal of Sociology 78 (1973), 1360– 1380. 19. Derek Hansen, Ben Shneiderman, and Marc A. Smith, Analyzing social media networks with nodexl: Insights from a connected world, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2010. 20. John D. Hunter, Matplotlib: A 2d graphics environment, Computing In Science & Engineering 9 (2007), no. 3, 90–95. 21. Abdul Salam Jarrah Hussein Vastani and Reinhard Laubenbacher, Visualization fo dynamics for biological networks. 22. D. Krackhardt, J. Blythe, and C. Mcgrath, KrackPlot 3.0: An improved network drawing program, Connections 17 (1994), no. 2, 53–55. 23. Vito Latora and Massimo Marchiori, Is the boston subway a small-world network?, Physica A: Statistical Mechanics and its Applications 314 (2002), no. 1–4, 109 – 113, Horizons in Complex Systems. 24. MATLAB, version 7.10.0 (r2010a), The MathWorks Inc., Natick, Massachusetts, 2010. 25. Stanley Milgram, The small world problem, Psychology Today 2 (1967), 60–67. 26. Stanley Milgram and Jeffrey Travers, An experimental study of the small world problem, Sociometry 32 (1969), 425–443. 27. M. E. J. Newman, The mathematics of networks, The New Palgrave Encyclopedia of Economics 2 (2007), no. 5, 1–12. 28. Mark Newman, Networks: An introduction, Oxford University Press, Inc., New York, NY, USA, 2010. 54 29. M.E.J. Newman, Models of the small world: A review, J. Stat. Phys. 101 (2000), 819–841. 30. William D. Richards and Ronald E. Rice, The negopy network analysis program, Social Networks 3 (1981), 215–223. 31. William D. Richards and Andrew J. Seary, Fatcat – for thick data. 32. John Scott, A toolkit for social network analysis, Acta Sociologica 39 (1996), no. 2, 211–216. 33. Liviu Panait Keith Sullivan Sean Luke, Claudio Cioffi-revilla and Gabriel Balan, Mason: A multi-agent simulation environment. 34. Parongama Sen, Subinay Dasgupta, Arnab Chatterjee, P. A. Sreeram, G. Mukherjee, and S. S. Manna, Small-world properties of the indian railway network, Phys. Rev. E 67 (2003), 036106. 35. Paul Shannon, Andrew Markiel, Owen Ozier, Nitin S. Baliga, Jonathan T. Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker, Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks, Genome Research 13 (2003), no. 11, 2498–2504. 36. Marc A. Smith, Ben Shneiderman, Natasa Milic-Frayling, Eduarda Mendes Rodrigues, Vladimir Barash, Cody Dunne, Tony Capone, Adam Perer, and Eric Gleave, Analyzing (social media) networks with nodexl, Proceedings of the fourth international conference on Communities and technologies, 2009, pp. 255–264. 37. J. Heidel T. Helikar, J. Konvalina and J. A. Rogers, Emergent decision-making in biological signal transduction networks, PNAS 105 (2008), no. 6, 1913–1918. 38. Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow, The anatomy of the facebook social graph, CoRR abs/1111.4503 (2011). 39. Guido van Rossum et al., The Python programming language, 1991–. 40. Stanley Wasserman and Katherine Faust, Social network analysis: Methods and applications, 1 ed., Structural analysis in the social sciences, no. 8, Cambridge University Press, 1994. 41. Duncan J. Watts, Networks, Dynamics, and the Small-World Phenomenon, The American Journal of Sociology 105 (1999), no. 2, 493–527. 42. Duncan J. Watts, Six degrees: The science of a connected age, W. W. Norton & Company, 2003. 55 43. Duncan J. Watts and Steven H. Strogatz, Collective dynamics of small-world networks, Nature 393 (1998), no. 6684, 440–442. 44. Andy Wuensche, Exploring discrete dynamics: The ddlab manual, 2011. 56 Appendix A. Python Code A.1. Module Import. The following code imports the necessary python modules [39, 20, 5]. The Facebook module is open source and is available at https://github.com/pythonforfacebook/facebook-sdk. import json import urllib import urllib2 import Facebook import networkx as nx import matplotlib.pyplot as plt import decimal as dec import itertools as iter import os import sys import time import string import math A.2. Graph Generation. generate_graph() queries the Facebook Graph API to obtain the group ids for the search parameter, *searchwords. Once all the nodes have been retrieved using get_group_ids(), generate_graph() adds the edges using add_members(), removes isolated nodes, removes connected sub graphs with l = 1, computes the parameters of the network and returns the generated network. 57 dataLog() and errorLog() are generic functions for recording progress and troubleshooting errors, respectively. def dataLog(info): if not os.path.isdir(’data’): os.mkdir(’data’) filename = os.path.join(’data’,’nL_dataLog.txt’) log = open(filename,’a’) t = time.strftime("%d%b%y %H:%M:%S", time.localtime()) log.write("(%s) %s\n"%(t,info)) log.close() print(info) def errorLog(info): if not os.path.isdir(’data’): os.mkdir(’data’) filename = os.path.join(’data’,’nL_errorLog.txt’) log = open(filename,’a’) t = time.strftime("%d%b%y %H:%M:%S", time.localtime()) log.write("(%s) %s\n"%(t,info)) log.close() print(info) #Name: get_group_ids def get_group_ids(searchword): ids = [] 58 LIMIT = 100 i=0 while True: results = graph.request(’search’,{’q’:searchword,’type’:’group’, ’limit’:LIMIT,’offset’:LIMIT*i}) if not results[’data’]: break group_ids = [group[’id’] for group in results[’data’]] if len(group_ids) == 0: break ids += group_ids i += 1 save_data(ids,"ids_%(searchword)s.txt"%vars()) return ids #Name: add_members def add_members(G): try: title = G.graph[’title’] except: title = "<No Title>" count=1 N = nx.number_of_nodes(G) nodes = chunks(nx.nodes(G),100) 59 for item in nodes: groups = graph.get_objects(item,metadata=1) for g in groups: member_ids = [] group = groups[g] added = False broken = False while not added: try: conn = urllib2.urlopen(group[’metadata’] [’connections’] [’members’]) members = json.loads(conn.read())[’data’] for member in members: member_ids.append(member["id"]) G.node[group[’id’]][’members’]= member_ids added = True except urllib2.URLError as e: errorLog("URLError with %s node %s (%i of %i)"/ %(title,str(group[’id’]),count,N)) pass except KeyError: errorLog("KeyError with %s node %s (%i of %i)"/ %(title,str(group[’id’]),count,N)) 60 broken = True added = True except: errorLog("Error with %s node %s (%i of %i)"/ %(title,str(group[’id’]),count,N)) broken = True pass finally: conn.close() if added and not broken: dataLog("(%i of %i) Added %s node %s"/ %(count,N,title,str(group[’id’]))) count += 1 dataLog("All " + title + " nodes added.") #Name: generate_graph def generate_graph(*searchwords): app_ID = <secret> app_SECRET = <secret> key = facebook.get_app_access_token(app_ID,app_SECRET) keywords = [] for word in searchwords: if isinstance(word,str): 61 keywords.append(word) if isinstance(word,list): for i in word: keywords.append(i) save = True broken = False connected = True empty_graph = False G=nx.Graph() title = string.join(keywords,"+") G.graph[’title’] = title #Add nodes to graph, G. Each node has ’members’ attribute. keyword_group_ids = [] for word in keywords: keyword_group_ids.append(get_group_ids(word)) keyword_group_ids = nx.utils.flatten(keyword_group_ids)#flatten list keyword_group_ids = list(set(keyword_group_ids))#remove duplicate nodes if keyword_group_ids: G.add_nodes_from(keyword_group_ids) add_members(G) N = nx.number_of_nodes(G) 62 else: empty_graph = True #Add edges to graph, G. Each edge has a ’weight’ attribute. if not empty_graph: count = 0 perm = iter.combinations(G.nodes(),2) edges = (nx.number_of_nodes(G))*(nx.number_of_nodes(G) - 1)/2 try: for edge in perm: added = False while not added: try: adj = [i for i in G.node[edge[0]][’members’] / if i in G.node[edge[1]][’members’]] w = len(adj) if w > 0: G.add_edge(edge[0],edge[1],weight=w) dataLog("(%i of %i) Added %s edge (%s,%s)"/ %(count,edges,title,edge[0],edge[1])) added = True else: connected = False added = True 63 except: errorLog("Error adding %s edge (%s,%s)"/ %(title,edge[0],edge[1])) pass finally: count+=1 except: errorLog("Error adding edges to %(title)s graph"%vars()) finally: dataLog("All %(title)s edges added."%vars()) #Assign isolatedNodes parameter as graph attribute isolatedNodes = len(nx.isolates(G)) G.graph[’isolatedNodes’] = isolatedNodes #Remove isolated nodes from graph for i in nx.isolates(G): G.remove_node(i) #Remove connected subgraphs with average shortest path length = 1 #Calculate average shortest path length, assign to graph as attribute if connected and not empty_graph: G.graph[’averagePathLength’] = nx.average_shortest_path_length(G) else: 64 if not empty_graph: subgraphs = len(nx.connected_component_subgraphs(G)) pathLengths = [] for g in nx.connected_component_subgraphs(G): if nx.average_shortest_path_length(g) == 1: G.remove_nodes_from(g) subgraphs -= 1 else: pathLength = dec.Decimal(nx.average_shortest_path_length(g)) pathLengths.append(pathLength) try: averagePathLength = sum(pathLengths) / subgraphs G.graph[’averagePathLength’] = averagePathLength except ZeroDivisionError as e: empty_graph = True except: errorLog(/ "Error when calculating average path length for / %(title)s graph."%vars()) #Calculate <k>, average clustering coefficient and density. if not empty_graph: N = nx.number_of_nodes(G) nodeDegrees = [] 65 try: for i in nx.nodes(G): nodeDegrees.append(nx.degree(G,i)) k = dec.Decimal(sum(nodeDegrees)) / N G.graph[’k’] = k G.graph[’C’] = nx.average_clustering(G) except: errorLog("Error when calculating C and <k> for %(title)s graph."/ %vars()) broken = True try: G.graph[’density’] = nx.density(G) except: errorLog("Error when calculating density for %(title)s graph."/ %vars()) broken = True #If not empty or broken, save graph data. if not broken and not empty_graph: return G if empty_graph: return emptyGraph(name=title) errorLog("%(title)s has no connected groups / (%(isolatedNodes)i isolated groups)."%vars()) 66 if broken: return emptyGraph(name="ERROR") errorLog("Error during %(title)s graph."%vars()) #Name: save_data def save_data(data,filename,path=’data/ids’,method=’w’): if not os.path.isdir(path): os.mkdir(path) filepath = os.path.join(path,"%(filename)s.txt"%vars()) file = open(filepath,method) for i in data: file.write("%s\n"%(str(i))) file.close() def save_data_to_CSV(info): if not os.path.isdir(’data’): os.mkdir(’data’) filename = os.path.join(’data’,’nL_data.txt’) log = open(filename,’a’) t = time.strftime("%y%m%d%H%M%S", time.localtime()) log.write("%s, %s\n"%(t,info)) log.close() dataLog("Saved ’%(info)s’ to nL_data.txt."%vars()) 67 def save_to_LaTeX_single(minN=40): output = [] if not os.path.isdir(’data’): os.mkdir(’data’) filename = os.path.join(’data’,’nL_data.txt’) input = open(filename,’r’) data = input.readlines() input.close() data = [string.split(i,’,’) for i in data] print(data) for i in data: temp = [k.strip() for k in i] del temp[0] N = dec.Decimal(eval(temp[1])) if N > minN and ’+’ not in temp[0]: L = dec.Decimal(eval(temp[2])) C = dec.Decimal(eval(temp[3])) l = dec.Decimal(eval(temp[4])) k = dec.Decimal(eval(temp[5])) temp[0] = "\\textsl{%s}"%(temp[0]) temp[1] = N temp[2] = L temp[3] = C temp[4] = k/N 68 temp[5] = l temp[6] = math.log(k)/math.log(N) temp[7] = k temp = rounded_list(temp,3) temp[1] = int(float(temp[1])) temp[2] = int(float(temp[2])) temp = [str(i) for i in temp] temp = string.join(temp,’ & ’) temp = "%(temp)s \\\\"%vars() output.append(temp) output = sorted(list(set(output)),key=lambda sortParam: eval(string.split(sortParam,’&’)[1]) ) header = ["\\begin{table}",/ "\\caption{caption here}\\label{single}",/ "\\begin{center}",/ "\\begin{tabular}{|c|c|c|c|c|c|c|c|}"] title = ["\\hline","Keyword & N & L & $C$ & $C_{rand}$ & $l$ & $l_{rand}$ & $\langle k\rangle$ \\\\","\\hline"] footer = ["\\hline","\\end{tabular}","\\end{center}","\end{table}"] output = header + title + output + footer 69 filename = os.path.join(’data’,’nL_LaTeX_Table_single.txt’) file = open(filename,’w’) for i in output: file.write("%(i)s\n"%vars()) file.close() def save_to_LaTeX_multi(minN=40): output = [] if not os.path.isdir(’data’): os.mkdir(’data’) filename = os.path.join(’data’,’nL_data.txt’) input = open(filename,’r’) data = input.readlines() input.close() data = [i[0:len(i)-3] for i in data] data = [string.split(i,’,’) for i in data] for i in data: temp = [k.strip() for k in i] del temp[0] N = dec.Decimal(eval(temp[1])) if N > minN and ’+’ in temp[0]: L = dec.Decimal(eval(temp[2])) C = dec.Decimal(eval(temp[3])) l = dec.Decimal(eval(temp[4])) 70 k = dec.Decimal(eval(temp[5])) temp[0] = "\\textsl{%s}"%(temp[0]) temp[1] = N temp[2] = L temp[3] = C temp[4] = k/N temp[5] = l temp[6] = math.log(k)/math.log(N) temp[7] = k temp = rounded_list(temp,3) temp[1] = int(float(temp[1])) temp[2] = int(float(temp[2])) temp = [str(i) for i in temp] temp = string.join(temp,’ & ’) temp = "%(temp)s \\\\"%vars() output.append(temp) output = sorted(list(set(output))) header = ["\\begin{table}", "\\caption{caption here}\\label{multi}", "\\begin{center}", "\\begin{tabular}{|c|c|c|c|c|c|c|c|}"] title = ["\\hline","Keyword & N & L & $C$ 71 & $C_{rand}$ & $l$ & $l_{rand}$ & <k> \\\\","\\hline"] footer = ["\\hline","\\end{tabular}","\\end{center}","\end{table}"] output = header + title + output + footer filename = os.path.join(’data’,’nL_LaTeX_Table_multi.txt’) file = open(filename,’w’) for i in output: file.write("%(i)s\n"%vars()) file.close() #Name: save_degree_sequence def save_degree_sequence(G,min=30): graph = emptyGraph() if isinstance(G,str): if G in searchwords(): graph = load_graph(G) else: if not isEmptyWordGraph(G): graph = generate_graph(G) save_graph(graph) if type(G) == type(graph): graph = G 72 degree_sequence=list(nx.degree(graph).values()) output = "" for k in degree_sequence: if output == "": output = "%(k)i"%vars() else: output = "%(output)s\n%(k)i"%vars() if not os.path.isdir(’data/deg_seq’): os.mkdir(’data/deg_seq’) filename = os.path.join(’data/deg_seq’,"deg_seq_%s.txt"/ %(graph.graph[’title’])) if len(degree_sequence) > min: log = open(filename,’w’) log.write(output) log.close() dataLog("Saved \"%s\" degree sequence to file."%(graph.graph[’title’])) else: errorLog("The %s graph had fewer than %i nodes / and degree sequence was not saved."%(graph.graph[’title’],min)) #Name: save_C_vs_k def save_C_vs_k(graph): if isinstance(graph,str): G = generate_graph(graph) 73 else: G = graph C = nx.clustering(G).keys() x = [G.degree(i) for i in nx.clustering(G)] degrees = list(set(x)) C_List = [] for k in degrees: temp = [nx.clustering(G)[i] for i in C if isEqual(G.degree(i),k)] average_C_k = sum(temp)/len(temp) C_List.append(average_C_k) data = zip(degrees,C_List) data = ["%s, %s"%(str(i[0]),str(i[1])) for i in data] save_data(data,"C_vs_k_%s"%(G.graph[’title’]),path=’data/C_vs_k’) dataLog("Saved %s graph C vs. k data to file."%(G.graph[’title’])) #Name: save_graph def save_graph(graph,stamp=False): if isinstance(graph,str): G = generate_graph(graph) else: G = graph #Assign parameter values in preparation for writing to a file. pickle = True N = len(nx.nodes(G)) 74 L = len(nx.edges(G)) isolatedNodes = G.graph[’isolatedNodes’] density = G.graph[’density’] k = G.graph[’k’] C = G.graph[’C’] l = G.graph[’averagePathLength’] timeStamp = time.strftime("%m%d%H%M", time.localtime()) if not os.path.isdir(’data’): os.mkdir(’data’) #Save G parameters to CSV format. title = G.graph[’title’] if N == 0: pickle = False errorLog("%(title)s graph contained no data. Nothing was saved."%vars()) else: save_data_to_CSV("%(title)s, %(N)i, %(L)i, %(C)f, %(l)f, %(k)f, %(isolatedNodes)i, 75 %(density)f"%vars()) save_degree_sequence(G) save_C_vs_k(G) dataLog("Saved %(title)s graph data."%vars()) #Pickle G data if not os.path.isdir(’data/gpickle’): os.mkdir(’data/gpickle’) if stamp: gpickleName = "data/gpickle/%(title)s_%(timeStamp)s.gpickle"%vars() else: gpickleName = "data/gpickle/%(title)s.gpickle"%vars() try: if pickle: nx.write_gpickle(G,gpickleName) dataLog("Pickled %(title)s graph data successfully."%vars()) except: errorLog("Error attempting to pickle %(title)s graph data."%vars()) #Name: load_data #Input: filename of file from which to load data #Output: input data from save_data call def load_data(filename,path=’ids’): filepath = os.path.join(’data’,path,"%(path)s_%(filename)s.txt"%vars()) 76 file = open(filepath,’r’) data = [] input = file.readlines() for i in input: try: data.append(eval(i[0:len(i)-1])) except: data.append(i[0:len(i)-1]) return data #Name: load_graph #Input: searchword used to create graph with generate_graph #Output: Returns graph data of generate_graph(searchword) def load_graph(searchwords,path="data/gpickle"): results = [] found = False dirList = os.listdir(path) if isinstance(searchwords,str): searchword = searchwords for filename in dirList: if searchword in filename and ’+’ not in filename: print(filename) results.append(filename) found = True 77 if isinstance(searchwords,list): perm = iter.permutations(searchwords,len(searchwords)) searchList = [] for i in perm: searchList.append(string.join(i,’+’)) for word in searchList: for filename in dirList: if word in filename: searchword = word results.append(filename) found = True if results and found: results = [i[0:len(i)-8] for i in results] results = [i for i in results if isEqual(i,searchword)] filename = "%s.gpickle"%(results[0]) return nx.read_gpickle("%(path)s/%(filename)s"%vars()) else: errorLog("No graphs with %(searchword)s in title."%vars()) return emptyGraph() #Name: plot_degree_sequences def plot_degree_sequences(*graphs): plt.clf() graphType = emptyGraph() 78 names = [] plots = [] graphList = [] for i in graphs: if isinstance(i,str): if i in searchwords(): graphList.append(load_graph(i)) names.append(load_graph(i).graph[’title’]) else: g = generate_graph(i) save_graph(g) graphList.append(g) names.append(g.graph[’title’]) if isinstance(i,list): for k in i: if type(k) == type(graphType): graphList.append(k) names.append(k.graph[’title’]) else: if k in searchwords(): graphList.append(load_graph(k)) names.append(load_graph(k).graph[’title’]) else: if not isEmptyWordGraph(k): 79 g = generate_graph(k) save_graph(g) if not isEmptyWordGraph(k): graphList.append(g) names.append(g.graph[’title’]) else: errorLog("%(k)s graph omitted from degree sequence plot"%vars()) if type(i) == type(graphType): graphList.append(i) names.append(i.graph[’title’]) name = string.join(names,"/") markers = ["o","D","s"] marker = 0 colors = ["red","blue","green","yellow","cyan","magenta","black"] color = 0 for G in graphList: if nx.number_of_nodes(G) != 0: degree_sequence=sorted(list(set(nx.degree(G).values())), reverse=True) dmax=max(degree_sequence) plot, = plt.loglog(degree_sequence,colors[color], marker=markers[marker]) marker += 1 80 marker = divmod(marker,len(markers))[1] color += 1 color = divmod(color,len(colors))[1] plots.append(plot) save_degree_sequence(G) plt.ylabel("degree") plt.xlabel("rank") nodes = ["N=" + str(nx.number_of_nodes(i)) for i in graphList] labels = zip(names,nodes) plt.legend(plots,labels,loc=1) dataLog("Created degree distribution plot of %(name)s."%vars()) #Name: plot_C_vs_k def plot_C_vs_k(searchwords,path="data/C_vs_k"): results = [] found = False dirList = os.listdir(path) if isinstance(searchwords,str): name = searchwords searchword = searchwords for filename in dirList: if searchword in filename: results.append(filename) if isinstance(searchwords,list): 81 name = string.join(searchwords,’+’) perm = iter.permutations(searchwords,len(searchwords)) searchList = [] for filename in dirList: if name in filename: searchword = name results.append(filename) found = True if not found: for i in perm: searchList.append(string.join(i,’+’)) for word in searchList: for filename in dirList: if word in filename: searchword = word results.append(filename) if results: results = [i[7:len(i)-4] for i in results] results = [i for i in results if isEqual(i,searchword)] data = load_data(results[0],’C_vs_k’) x = [i[0] for i in data] y = [i[1] for i in data] plt.semilogx(x,y,’ro’,ms=5) plt.ylabel("C(k)") 82 plt.xlabel("k") else: dataLog("No C(k) vs k data") G = generate_graph(searchwords) save_graph(G) plot_C_vs_k(string.split(G.graph[’title’],’+’)) def plot_vs_N(y,loglog=False): xData = [] yData = [] if not os.path.isdir(’data’): os.mkdir(’data’) filename = os.path.join(’data’,’nL_data.txt’) input = open(filename,’r’) data = input.readlines() input.close() data = [string.split(i,’,’) for i in data] N = [i[2] for i in data] C = [i[4] for i in data] l = [i[5] for i in data] k = [i[6] for i in data] if y == "C": xData = N 83 yData = C xlabel = "N" ylabel = "C" if y == "l": xData = N yData = l xlabel = "N" ylabel = "l" if y == "k": xData = k yData = N xlabel = "<k>" ylabel = "N" output = zip(xData,yData) output = ["%s, %s"%(i[0],i[1]) for i in output] save_data(output,filename="%(y)s_vs_N.txt"%vars(),path=’data’) if loglog: plt.ylabel("log(%(ylabel)s)"%vars()) plt.xlabel("log(%(xlabel)s)"%vars()) plt.loglog(xData,yData,’bo’) else: plt.ylabel(ylabel) plt.xlabel(xlabel) 84 plt.plot(xData,yData,’ro’) #Name: plot_graph def plot_graph(G,layout): plt.clf() name = G.graph[’title’] k = str(round(G.graph[’k’],3)) N = str(nx.number_of_nodes(G)) L = str(nx.number_of_edges(G)) C = str(round(G.graph[’C’],3)) l = str(round(G.graph[’averagePathLength’],3)) density = str(round(G.graph[’density’],3)) plt.title("\"%(name)s\", N = %(N)s, L = %(L)s, <k> = %(k)s, C = %(C)s, l = %(l)s"%vars()) if layout == "circular": pos = nx.circular_layout(G) nx.draw(G,pos,with_labels=False,node_size=30,width=0.1) else: pos = nx.spring_layout(G) nx.draw(G,pos,with_labels=False,node_size=35,width=0.3,iterations=100) if not os.path.isdir(’data/graphs’): 85 os.mkdir(’data/graphs’) plt.savefig("data/graphs/%s_%s.png"%(G.graph[’title’],layout)) #Name: emptyGraph #Input: none #Output: generates empty keyword graph with set parameters equal to zero def emptyGraph(name="None"): G = nx.Graph() G.graph[’title’]=name G.graph[’isolatedNodes’]=0 G.graph[’density’]=0 G.graph[’k’]=0 G.graph[’C’]=0 G.graph[’averagePathLength’]=0 return G #Name: searchwords def searchwords(path="data/gpickle"): searchwords = [] try: dirList = os.listdir(path) for i in dirList: searchwords.append(string.split(i,".")[0]) except: searchwords = ["error"] 86 errorLog("Failed to retrieve list of created keyword graphs.") finally: return sorted(searchwords) #Name: isEmptyGraph def isEmptyWordGraph(searchword): empty = False if not os.path.isdir(’data’): os.mkdir(’data’) filename = os.path.join(’data’,’nL_errorLog.txt’) log = open(filename,’r’) errMsgs = log.read().split("\n") searchStr = "%(searchword)s graph contained no data."%vars() for i in errMsgs: if searchStr in i: empty = True log.close() return empty def isEqual(a,b): if a == b: return True else: return False def rounded_list(input,digits): tempList = [] 87 for i in input: try: tempValue = eval(str(i)) tempValue = round(tempValue,digits) tempValue = str(tempValue) except: tempValue = i finally: tempList.append(tempValue) return tempList def chunks(l, n): return [l[i:i+n] for i in range(0, len(l), n)] 88 Appendix B. Tables This Appendix contains the table of all numerical characteristics obtained for the Facebook networks under consideration. All the networks are listed by categories and keywords. The data presented below supports the claim that the Facebook group networks of this thesis exhibit features of ordered networks and random networks. That is, they have large clustering coefficients like the ring lattice networks described by Watts and Strogatz [43], and small average path lengths characteristic of random graphs [2]. Thus we conclude that the networks under consideration in this study exhibit small world features. 89 Table 3. The general characteristics of several Facebook group networks. Number of nodes, N, number of edges, L, average clustering coefficient, C̄, average shortest path, l, average connectivity, hki. The networks are grouped by categories and listed in decreasing order of N. The corresponding values of C̄ and l for random networks with N nodes and L edges, C̄rand and lrand , are included for comparison. This table is continued on several pages. Keyword N L C̄ C̄rand l lrand hki Religion 1558 39547 0.543 0.033 2.723 1.872 50.766 islam 329 6102 0.577 0.113 2.215 1.604 37.094 muslim 233 4326 0.564 0.159 2.121 1.508 37.133 bible 210 2907 0.594 0.132 2.157 1.610 27.686 atheist 194 4745 0.766 0.252 1.818 1.354 48.918 catholic 155 1905 0.687 0.159 2.071 1.575 24.581 christian 106 1335 0.668 0.238 1.988 1.445 25.189 god 78 255 0.461 0.084 2.653 2.320 allah 75 388 0.554 0.138 2.423 1.848 10.347 pagan 31 133 0.707 0.277 1.888 1.598 8.581 mormon 19 37 0.381 0.205 2.398 2.166 3.895 pope 0 0 0 0 0 0 6.538 0 90 N L Politics 1117 9856 republican 299 10248 0.695 0.229 1.617 1.348 68.548 romney 193 1665 0.682 0.089 2.351 1.848 17.254 vote 140 380 0.374 0.039 2.371 2.921 5.429 government 104 219 0.349 1.936 3.230 4.212 politic 76 211 0.4 0.073 3.268 2.526 5.553 election 59 83 0.293 0.048 2.074 3.942 2.814 democrat 27 53 0.524 0.145 2.875 2.410 3.926 obama 20 85 0.624 0.425 senate 13 16 Hobby 929 8896 art 291 4521 gun 138 425 0.393 0.045 2.863 2.710 fish 80 421 0.632 0.132 2.39 1.862 10.525 collect 72 184 0.335 0.071 3.26 2.621 craft 46 259 0.66 0.245 2.008 1.581 11.261 comic 40 168 0.595 0.21 cook 37 228 0.733 0.333 1.773 1.438 12.324 hunt 31 67 0.483 0.139 2.295 2.346 stamp 6 9 0.678 0.5 1.4 1.631 3 eat 0 0 0 0 0 0 0 C̄ C̄rand l lrand hki Keyword 0.482 0.016 1.801 2.445 17.647 0.4 0.04 1.7 1.4 0.189 1.422 2.847 8.5 2.462 0.423 0.021 3.280 2.315 19.152 0.6 0.107 2.31 1.651 31.072 2.149 1.733 6.159 5.111 8.4 4.323 91 N L Sexes 853 6621 0.357 0.018 3.158 2.461 15.524 boy 259 2300 0.445 0.069 2.684 1.931 17.761 man 225 1093 0.455 0.043 3.286 2.382 gay 209 3430 0.624 0.157 2.201 1.530 32.823 girl 107 1242 0.608 0.217 1.993 1.486 23.215 lesbian 56 265 0.513 0.169 2.217 1.791 9.464 woman 15 48 0.729 0.427 1.695 1.459 6.4 male 3 2 Anime C̄ 0 C̄rand l lrand hki Keyword 0.444 1.333 3.819 9.716 1.333 839 27155 0.598 0.077 2.296 1.614 64.732 anime 276 9770 0.708 0.257 1.837 1.319 70.797 naruto 184 3101 0.663 0.183 OnePiece 127 1440 0.656 0.179 1.996 1.552 22.677 bleach 69 416 0.671 0.175 saiyan 31 143 0.633 0.298 1.824 1.545 9.226 Gundam 23 78 0.630 0.295 1.858 1.638 6.783 DBZ 18 32 0.441 0.198 1.967 2.279 2.556 dragonball 13 17 0.531 0.201 2.397 2.668 2.615 sasuke 12 19 0.454 0.264 1.985 2.156 3.167 2.01 2.06 1.482 33.707 1.701 12.058 92 Keyword N L C̄ C̄rand l lrand hki Minorities 679 4758 0.356 0.021 1.544 2.470 14.015 latin 140 1140 africa 103 1196 0.583 0.225 2.029 1.474 23.223 black 70 274 hispanic 64 221 0.432 0.108 2.7 2.152 6.906 greek 37 137 0.469 2.11 1.803 7.405 american 31 136 0.591 0.283 1.916 1.581 8.774 china 26 121 0.623 0.358 1.460 9.308 nazi 23 21 0.153 0.079 1.549 5.207 1.826 white 23 22 0.033 0.083 2.205 4.834 1.913 asian 16 17 0.236 0.133 1.923 3.678 2.125 KKK 5 9 Science 0.48 0.5 0.9 0.116 2.327 1.771 16.286 0.112 2.052 2.065 0.20 0.72 2.0 1.10 1.256 7.829 3.60 417 3724 0.429 0.043 2.892 2.093 17.861 science 188 1774 0.513 0.1 2.322 1.782 18.872 nuclear 91 376 0.494 0.091 fractal 44 71 0.38 chaos 0 0 0 math 40 161 0.575 0.201 2.059 1.769 8.050 physics 23 108 0.684 0.408 1.719 9.391 2.136 8.264 0.073 2.109 3.230 3.227 0 2.68 0 0 1.40 0 93 N L Music 358 5210 0.616 0.081 1.941 1.744 29.106 bieber 117 1714 0.646 0.25 spears 105 1771 0.71 0.321 1.733 1.323 33.733 gaga 33 186 0.648 0.342 1.858 1.443 11.273 MTV 27 109 0.693 0.299 1.915 1.578 8.074 beyonce 24 162 0.851 0.563 1.413 1.221 13.50 kardashian 21 26 0.498 0.118 2.288 3.358 2.476 kesha 0 0 C̄ 0 C̄rand 0 l lrand hki Keyword 1.903 1.410 29.299 0 0 0 Sports 342 10356 0.715 0.177 2.403 1.422 60.561 football 91 1401 0.76 0.338 1.858 1.316 30.791 soccer 66 1426 0.9 0.655 1.384 1.112 43.212 golf 60 799 0.814 0.444 1.647 1.247 26.633 tennis 38 280 0.741 0.388 1.711 1.352 14.737 basketball 28 124 0.698 0.316 1.907 1.528 8.857 swim 15 19 0.336 0.169 1.689 2.913 2.533 baseball 10 21 0.650 0.420 1.622 1.604 4.20 bowling 4 3 Computer 300 0 0.375 1.667 3.419 1.5 2927 0.505 0.065 2.247 1.920 19.513 computer 112 1406 0.605 0.224 2.166 1.464 25.107 android 71 653 0.651 0.259 1.971 1.464 18.394 ipad 26 140 0.866 0.414 1.702 1.371 10.769 ipod 6 4 mac 18 PC 12 0 0.222 1.333 6.228 1.333 22 0.537 0.136 1.350 3.234 2.444 22 0.466 0.306 2.061 1.913 3.667