【数据】社区发现数据集

社区发现数据集
目录
社区发现数据集
目录
基于链接分析的数据集
基于链接与离散型属性的数据集
基于链接与文本型属性的数据集
其他常见的数据集链接
Mark Newman收集的数据集
Social and Information Network Analysis
基于链接分析的数据集
Zachary karate club

Zachary 网络是通过对一个美国大学空手道俱乐部进行观测而构建出的一个社会网络.网络包含 34 个节点和 78 条边,其中个体表示俱乐部中的成员,而边表示成员之间存在的友谊关系.空手道俱乐部网络已经成为复杂网络社区结构探测中的一个经典问题[1]。【下载地址】

American College football

College Football 网络. Newman 根据美国大学生足球联赛而创建的一个复杂的社会网络.该网络包含 115个节点和 616 条边,其中网络中的结点代表足球队,两个结点之间的边表示两只球队之间进行过一场比赛.参赛的115支大学生代表队被分为12个联盟。比赛的流程是联盟内部的球队先进行小组赛,然后再是联盟之间球队的比赛。这表明联盟内部的球队之间进行的比赛次数多于联盟之间的球队之间进行的比赛的次数.联盟即可表示为该网络的真实社区结构[2]。【下载地址】

Dolphin social network

Dolphin 数据集是 D.Lusseau 等人使用长达 7 年的时间观察新西兰 Doubtful Sound海峡 62 只海豚群体的交流情况而得到的海豚社会关系网络。这个网络具有 62 个节点,159 条边。节点表示海豚,而边表示海豚间的频繁接触[3]。【下载地址】

netscience dataset

Netscience is a coauthorship network of scientists working on network theory and experiment. The dataset contains all components of the network, for a total of 1589 scientists [12]. 【下载地址, 访问密码:4bfc】

基于链接与离散型属性的数据集
Political blogs

该 数 据 集 由Lada Adamic于2005年编译完成, 表示博客的政治倾向。 包含1490个结点和19090条边。数据集中的每个结点都有一个属性描述(用0或者1表示),表示民主或者保守[4] 。【下载地址】

DBLP Dataset

Digital Bibliography Project (DBLP) is a computer science bibliography. In this data set, authors are considered as users, the paper titles of the authors are the text of users and the coauthorship relationship forms the links of users. 
DBLP每月更新的【数据地址】 
DBLP处理后的数据集【数据地址】 
DBLP数据集【使用说明1,使用说明2】

DBLP-10K 
DBLP Bibliography data from four research areas of database (DB), data mining (DM), information retrieval (IR) and artificial intelligence (AI) 3. We build a coauthor graph with top 5, 000 authors and their coauthor relationships. In addition, we use two relevant attributes: prolific and primary topic. For attribute “prolific”, authors with ≥ 20 papers are labeled as highly prolific; authors with ≥ 10 and < 20 papers are labeled as prolific and authors with < 10 papers are labeled as low prolific. For attribute “primary topic”, we use a topic modeling approach (PLSA) to extract 100 research topics from a document collection composed of paper titles from the selected authors. Each extracted topic consists of a probability distribution of keywords which are most representative of the topic. Then each author will have one out of 100 topics as his/her primary topic [5]. 【下载地址 访问密码 0674】
DBLP-1K, DBLP-5K 
两个数据集,则是直接从DBKP-10K数据集中选择TOP 1000、5000作者构成的数据集。DBLP-5K可参考文献 [6]
Facebook Friendship Datasets

The datasets contain the Facebook networks (from a date in Sept. 2005) from these colleges: Caltech, Princeton, Georgetown and UNC Chapel Hill. The links represent the friendship on Facebook. Each user has the following attributes: ID, a student/faculty status flag, gender, major, second major/minor (if applicable), dormitory(house), year and high school [10].【下载地址, 访问密码:264c】.

基于链接与文本型属性的数据集
Enron Email Dataset

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders [7].【下载地址】

Enron Mail subset 
A subset of about 1700 labeled email messages (4.5M). These were chosen in a semi-motivated fashion (focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on). Students in the ANLP course annotated the selected messages with the category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings. 该子数据集参照【分类】分为11类。【下载地址】 
2005年3月版本的【Enron mail数据集】
CiteSeer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset [8]. 【下载地址】

Cora

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details [8].【下载地址】

WebKB

The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details [9].【下载地址】

Terrorists

The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details. 【下载地址】

Terrorist Attacks

This dataset consists of 1293 terrorist attacks each assigned one of 6 labels indicating the type of the attack. Each attack is described by a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature. There are a total of 106 distinct features. The files in the dataset can be used to create two distinct graphs. The README file in the dataset provides more details.【下载地址】

Flickr数据集

The Flickr image sharing network consists of nodes which represent Flickr users, and edges indicate follow relations between users. We use tags of images uploaded by a given user as her attributes. In this network, the ground-truth communities are defined as user-created interest-based groups that have more than five members. 【下载地址, 访问密码:ffdb】

其他常见的数据集链接

Stanford Large Network Dataset Collection 
Social networks 
online social networks, edges represent interactions between people
Networks with ground-truth communities 
ground-truth network communities in social and information networks
Communication networks 
email communication networks with edges representing communication
Citation networks 
nodes represent papers, edges represent citations
Collaboration networks 
nodes represent scientists, edges represent collaborations (co-authoring a paper)
Web graphs 
nodes represent webpages and edges are hyperlinks
Amazon networks 
nodes represent products and edges link commonly co-purchased products
Internet networks 
nodes represent computers and edges communication
Road networks 
nodes represent intersections and edges roads connecting the intersections
Autonomous systems 
graphs of the internet
Signed networks 
networks with positive and negative edges (friend/foe, trust/distrust)
Location-based online social networks 
Social networks with geographic check-ins
Wikipedia networks and metadata 
Talk, editing and voting data from Wikipedia
Twitter and Memetracker 
Memetracker phrases, links and 467 million Tweets
Online communities 
Data from online communities such as Reddit and Flickr
Online reviews 
Data from online review systems such as BeerAdvocate and Amazon

Mark Newman收集的数据集
介绍及相关社区发现算法:http://www-personal.umich.edu/~mejn/ 
数据集:http://www-personal.umich.edu/~mejn/netdata/

Social and Information Network Analysis
KDD Cup Dataset 
http://www.cs.cornell.edu/projects/kddcup/datasets.html
Stack Overflow Data 
http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/
Youtube dataset 
YouTube videos as nodes. Edge a->b means video b is in the related video list (first 20 only) of a video a. 
http://netsg.cs.sfu.ca/youtubedata/
Amazon Data 
The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes). http://snap.stanford.edu/data/amazon-meta.html
[1]: W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977) 
[2]: M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002). 
[3]: V.Lusseau, K .Schneider, OJ .Boisseau et al. The Bottlenose Dolphin Community of Doubtful Sound Features a Large Proportion of Long-Lasting Associations. Behavioral Ecology and Sociobiology, 2003, 54(4):392-405 
[4]: L. A. Adamic and N. Glance, “The political blogosphere and the 2004 US Election”, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005) 
[5]: Zhou Y, Cheng H, Yu J X. Clustering large attributed graphs: An efficient incremental approach[C]//Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, 2010: 689-698. 
[6]: Zhou Y, Cheng H, Yu J X. Graph clustering based on structural/attribute similarities[J]. Proceedings of the VLDB Endowment, 2009, 2(1): 718-729. 
[7]: Klimt B, Yang Y. Introducing the Enron Corpus[C]//CEAS. 2004. 
[8]: Yang T, Jin R, Chi Y, et al. Combining link and content for community detection: a discriminative approach[C]//Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009: 927-936. 
[9]: Lu Q, Getoor L. Link-based classification[C]//ICML. 2003, 3: 496-503. 
[10]: Dang T A, Viennet E. Community detection based on structural and attribute similarities[C]//International Conference on Digital Society (ICDS). 2012: 7-12. 
[11]: Xirong Li, Cees G.M. Snoek, and Marcel Worring, Learning Social Tag Relevance by Neighbor Voting, in IEEE Transactions on Multimedia (T-MM), 2009 
[12]: Newman M E J. Finding community structure in networks using the eigenvectors of matrices[J]. Physical review E, 2006, 74(3): 036104.

版权声明:本文为CSDN博主「wzgang123」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/wzgang123/article/details/51089521

发布了392 篇原创文章 · 获赞 492 · 访问量 241万+
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 编程工作室 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览