LIBSVM Data: Multi-label Classification

Recently multi-label classification has been an important topic. Currently there are very few publicly available data sets. We tried hard to collect the following sets. Labels are in the beginning of each line and separated by commas.

Amazon-670K

Source: [JM13a]
Preprocessing: Since part of the original data from Amazon is no longer available, the raw texts are downloaded from The Extreme Classification Repository. We concatenate "title" and "content" once as our raw texts, instead of concatenating "title" twice and "content" once like in AttentionXML. We use "\t" for each instance to separate labels and raw texts. The tf-idf features are calculated from the raw texts provided here by using sklearn's TfidfVectorizer with default configurations except for the options "vocabulary", "stop_words" and "strip_accents". The option "vocabulary" is set to be the given vocabulary in The Extreme Classification Repository. The option "stop_words" is set to be "english" to ignore some uninformative words, like "and", "or". The option "strip_accents" is set to be "unicode" to remove accents and perform other character NFKD normalization. Our resulting tf-idf feature set is slightly different from the set provided in The Extreme Classification Repository because of the following reasons. First, we set "stop_words" to be "english", which causes some features to be not in ours. Second, accents are handled differently in the tokenization process. For instance, "AragÃ³n" is tokenized as "arag" in The Extreme Classification Repository, but is tokenized as "aragan" in ours. The last reason is that in the tf-idf provided in The Extreme Classification Repository, some instances' features do not appear in their raw text. The second version is downloaded BoW features provided by The Extreme Classification Repository. To satisfy the LibSVM format, we remove the first header line and change the feature indices to start from 1 instead of 0. We then normalize each instance to a unit vector. The code used to get raw texts, calculate tf-idf features, and modify the format of downloaded BoW features is provided.
# of classes: 670,091
# of data: 490,449 / 153,025 (testing)
# of features: 135,909 (ver1) 135,909 (ver2)
Files:

AmazonCat-13K

Source: [JM13a]
Preprocessing: Since part of the original data from Amazon is no longer available, the raw texts are downloaded from The Extreme Classification Repository. We extract "title" and "content" as our raw texts. A comparison with raw texts given in AttenionXML shows that one instance given in The Extreme Classification Repository is missing. We add this instance so our raw texts are the same as those from AttenionXML. We use "\t" for each instance to separate labels and raw texts. We provide three versions of tf-idf features. The first two are calculated from the raw texts provided here by using sklearn's TfidfVectorizer but have different configurations. The first version considers default configurations except the options "vocabulary", "stop_words" and "strip_accents". The option "vocabulary" is set to be the given vocabulary in The Extreme Classification Repository. The option "stop_words" is set to be "english" to ignore some uninformative words, like "and", "or". The option "strip_accents" is set to be "unicode" to remove accents. Configurations of the second version are the same as the default. Note that these two tf-idf sets are very different from the tf-idf set provided in The Extreme Classification Repository but with higher performance. The third version is the downloaded BoW features provided by The Extreme Classification Repository. To satisfy the LibSVM format, we remove the first header line and change the feature indices to start from 1 instead of 0. We then normalize each instance to a unit vector. The code used to get raw texts, calculate tf-idf features, and modify the format of downloaded BoW features is provided.
# of classes: 13,330
# of data: 1,186,239 / 306,782 (testing)
# of features: 203,882 (ver1) 1,293,747 (ver2) 203,882 (ver3)
Files:

bibtex

Source: [GT08a]
Preprocessing: The original set was at Mulan: A Java Library for Multi-Label Learning. Here we slightly modify the sparse form at The Extreme Classification Repository so feature indices start from 1 instead of 0.
# of classes: 159
# of data: 7,395
# of features: 1,836
Files:
- bibtex.bz2

BlogCatalog

Source: [LT09a]
Preprocessing: This is a node classification problem where each node is associated with multiple labels and features are embedding vectors learned from graphs. The original graph data was at Social-Dimension Approach to Classification in Large-Scale Networks. Embedding vectors are generated by the following representation-learning methods: DeepWalk, LINE and Node2Vec. For more details (e.g., the parameters used for generating embedding vectors and the five training/test splits), please see the supplementary materials and the experimental code/data used in the paper [LCL22a]
# of classes: 39
# of data: 10,312
# of features: 128
Files:

delicious

Source: [GT08a]
Preprocessing: The original set was at Mulan: A Java Library for Multi-Label Learning. Here we slightly modify the sparse form at The Extreme Classification Repository so feature indices start from 1 instead of 0.
# of classes: 983
# of data: 16,105
# of features: 500
Files:
- delicious.bz2

ECtHR (A) (LexGLUE)

Source: [IC22b]
Preprocessing: The data are downloaded from Hugging Face. If a list of texts is provided, we follow lex-glue to concatenate them with a white space. All newlines are replaced with white spaces in addition. The raw data are in the format of labels<TAB>texts. We also provide data with tf-idf features, which are calculated from the raw texts provided here using TfidfVectorizer from sklearn with default configurations. The training and validation sets are combined as one bigger training file. Note that the resulting tf-idf features are different from the one provided by lex-glue. The code used to generate the raw texts and tf-idf features is provided.
# of classes: 10
# of data: 9,000 / 1,000 (validation) / 1,000 (testing)
# of features: 92,401
Files:
- lexglue_code.tar.gz
- ecthr_a_lexglue_raw_texts_train.txt.bz2
- ecthr_a_lexglue_raw_texts_val.txt.bz2 (validation)
- ecthr_a_lexglue_raw_texts_test.txt.bz2 (testing)
- ecthr_a_lexglue_tfidf_train.svm.bz2
- ecthr_a_lexglue_tfidf_test.svm.bz2 (testing)

ECtHR (B) (LexGLUE)

Source: [IC22b]
Preprocessing: The procedure is the same as that for ECtHR (A) (LexGLUE).
# of classes: 10
# of data: 9,000 / 1,000 (validation) / 1,000 (testing)
# of features: 92,401
Files:
- lexglue_code.tar.gz
- ecthr_b_lexglue_raw_texts_train.txt.bz2
- ecthr_b_lexglue_raw_texts_val.txt.bz2 (validation)
- ecthr_b_lexglue_raw_texts_test.txt.bz2 (testing)
- ecthr_b_lexglue_tfidf_train.svm.bz2
- ecthr_b_lexglue_tfidf_test.svm.bz2 (testing)

EUR-LEX (LexGLUE)

Source: [IC22b]
Preprocessing: The procedure is the same as that for ECtHR (A) (LexGLUE).
# of classes: 100
# of data: 55,000 / 5,000 (validation) / 5,000 (testing)
# of features: 147,464
Files:
- lexglue_code.tar.gz
- eurlex_lexglue_raw_texts_train.txt.bz2
- eurlex_lexglue_raw_texts_val.txt.bz2 (validation)
- eurlex_lexglue_raw_texts_test.txt.bz2 (testing)
- eurlex_lexglue_tfidf_train.svm.bz2
- eurlex_lexglue_tfidf_test.svm.bz2 (testing)

EUR-Lex

Source: [LM10a]
Preprocessing: Both the tokenized texts and tf-idf features provided here are the same as those used by AttentionXML (except tiny numerical differences in the tf-idf features). The texts are extracted from the source documents obtained from the original EUR-Lex dataset. Before tokenization, symbols like '>', '<', '"' and '&' are replaced with their textual representatons 'gt', 'lt', 'quot' and 'amp', respectively. Then, the text is further processed using LetterTokenizer, LowerCaseFilter, StopFilter (with stopwords taken from the original dataset) and PorterStemFilter provided by PyLucene in the mentioned order. To reproduce the text used by AttentionXML, words without any english letter in it are removed and the greek letter 'σ' at the end of words is transformed to 'ς' (except the formula dH/dσ appeared in one document). The EUROVOC labels for each instance are taken from the original dataset directly. Finally, the tokenized texts are outputted in the format of labels<TAB>texts, where the labels and texts are respectively separated by white spaces. The tf-idf features are not calculated from the text provided here. Instead, they are calculated from the tokenized texts provided by the original EUR-Lex dataset. The tf-idf features are then generated using TfidfVectorizer from sklearn with the idf formula modified to be log(N/df). The code for generating the dataset is provided.
# of classes: 3,956
# of data: 15,449 / 3,865 (testing)
# of features: 186,104
Files:
- eurlex_code.tar.gz
- eurlex_raw_texts_train.txt.bz2
- eurlex_raw_texts_test.txt.bz2 (testing)
- eurlex_tfidf_train.svm.bz2
- eurlex_tfidf_test.svm.bz2 (testing)

EURLEX57K

Source: [IC19a]
Preprocessing: The data for generating the raw text are downloaded from this website. Following lmtc-eurlex57k, we concatenated header, recitals, main_body, and attachments with space. Whitespace characters "\s" including tabs and newlines and non-breaking spaces "\xa0" are replaced with space. The code used to generate the sets is also provided. The data is in the format of ID<TAB>labels<TAB>raw texts.
# of classes: 4,271
# of data: 45,000 / 6,000 (validation) / 6,000 (testing)
# of features: N/A
Files:
- eurlex57k_code.py
- eurlex57k_raw_texts_train.txt.bz2
- eurlex57k_raw_texts_val.txt.bz2 (validation)
- eurlex57k_raw_texts_test.txt.bz2 (testing)

Flickr

Source: [LT09a]
Preprocessing: This is a node classification problem where each node is associated with multiple labels and features are embedding vectors learned from graphs. The original graph data was at Social-Dimension Approach to Classification in Large-Scale Networks. Embedding vectors are generated by the following representation-learning methods: DeepWalk, LINE and Node2Vec. For more details (e.g., the parameters used for generating embedding vectors and the five training/test splits), please see the supplementary materials and the experimental code/data used in the paper [LCL22a]
# of classes: 195
# of data: 80,513
# of features: 128
Files:

mediamill (exp1)

Source: Mediamill / The Mediamill Challenge Problem
Preprocessing: We combine all binary classification problems into a multi-label one.
# of classes: 101
# of data: 30,993 / 12,914 (testing)
# of features: 120
Files:
- train-exp1.svm.bz2
- test-exp1.svm.bz2 (testing)

PPI

Source: [WLH17a]
Preprocessing: This is a node classification problem where each node is associated with multiple labels and features are embedding vectors learned from graphs. The original graph data was at GraphSAGE: Inductive Representation Learning on Large Graphs. Embedding vectors are generated by the following representation-learning methods: DeepWalk, LINE and Node2Vec. After embedding vectors are generated, nodes with no labels are discarded. For more details (e.g., the parameters used for generating embedding vectors and the five training/test splits), please see the supplementary materials and the experimental code/data used in the paper [LCL22a]
# of classes: 121
# of data: 54,958
# of features: 128
Files:

rcv1v2 (topics; subsets)

Source: [DL04b]
# of classes: 101
# of data: 3,000 / 3,000 (testing)
# of features: 47,236
Files:
- rcv1subset_topics_train_1.svm.bz2
- rcv1subset_topics_train_2.svm.bz2
- rcv1subset_topics_train_3.svm.bz2
- rcv1subset_topics_train_4.svm.bz2
- rcv1subset_topics_train_5.svm.bz2
- rcv1subset_topics_test_1.svm.bz2 (testing)
- rcv1subset_topics_test_2.svm.bz2 (testing)
- rcv1subset_topics_test_3.svm.bz2 (testing)
- rcv1subset_topics_test_4.svm.bz2 (testing)
- rcv1subset_topics_test_5.svm.bz2 (testing)

rcv1v2 (topics; full sets)

Source: [DL04b]
Preprocessing: The four test sets correspond to the four testing files from the RCV1 site (appendix B.13). A combined file is also provided. In the test set, the number of classes is 103. We further provide files of original labels and tokenized texts (B.12 of RCV1 site) in the format of ID<TAB>labels<TAB>tokens, where labels and tokens are respectively separated by spaces.
# of classes: 101
# of data: 23,149 / 781,265 (testing)
# of features: 47,236
Files:
- rcv1_topics_train.svm.bz2
- rcv1_topics_test_0.svm.bz2 (testing)
- rcv1_topics_test_1.svm.bz2 (testing)
- rcv1_topics_test_2.svm.bz2 (testing)
- rcv1_topics_test_3.svm.bz2 (testing)
- rcv1_topics_combined_test.svm.bz2 (testing; combined from the above four files)
- rcv1_topics_train.txt.bz2 (ID, original labels and tokenized texts)
- rcv1_topics_test.txt.bz2 (ID, original labels and tokenized texts (testing))

rcv1v2 (industries; full sets)

Source: [DL04b]
Preprocessing: The four testing sets correspond to the four testing files from the RCV1 site. In the testing set, the number of classes is 350.
# of classes: 313
# of data: 23,149 / 781,265 (testing)
# of features: 47,236
Files:
- rcv1_industries_train.svm.bz2
- rcv1_industries_test_0.svm.bz2 (testing)
- rcv1_industries_test_1.svm.bz2 (testing)
- rcv1_industries_test_2.svm.bz2 (testing)
- rcv1_industries_test_3.svm.bz2 (testing)

rcv1v2 (regions; full sets)

Source: [DL04b]
Preprocessing: The four testing sets correspond to the four testing files from the RCV1 site. In the testing set, the number of classes is 296.
# of classes: 228
# of data: 23,149 / 781,265 (testing)
# of features: 47,236
Files:
- rcv1_regions_train.svm.bz2
- rcv1_regions_test_0.svm.bz2 (testing)
- rcv1_regions_test_1.svm.bz2 (testing)
- rcv1_regions_test_2.svm.bz2 (testing)
- rcv1_regions_test_3.svm.bz2 (testing)

scene-classification

Source: [MB04a]
# of classes: 6
# of data: 1,211 / 1,196 (testing)
# of features: 294
Files:
- scene_train.bz2
- scene_test.bz2 (testing)

siam-competition2007

Source: SIAM Text Mining Competition 2007 / SIAM Text Mining Competition 2007
Preprocessing: We remove "." before transforming data to vectors. We use binary term frequencies and normalize each instance to unit length.
# of classes: 22
# of data: 21,519 / 7,077 (testing)
# of features: 30,438
Files:
- tmc2007_train.svm.bz2
- tmc2007_test.svm.bz2 (testing)

UNFAIR-ToS (LexGLUE)

Source: [IC22b]
Preprocessing: The procedure is the same as that for ECtHR (A) (LexGLUE).
# of classes: 8
# of data: 5,532 / 2,275 (validation) / 1,607 (testing)
# of features: 6,290
Files:

Wiki10-31K

Source: [AZ09a]
Preprocessing: The raw texts are extracted from the original html documents by concatenating all the <p> tags within the block <div id="bodyContent"> ... </div> in each file with white spaces between them. We tried to generate the raw texts and also the split as close as possible to those used in AttentionXML, which is based on those provided by the Extreme Classification Repository. It seems that due to the update of the source pages, eight instances are slightly different from those in AttentionXML. Other instances are exactly the same except that tabs in the texts are replaced with white spaces. The raw text data is in the format of labels<TAB>raw texts, where the labels are separated by spaces. The tf-idf features are calculated from the raw texts provided here using sklearn's TfidfVectorizer with default configurations except that "min_df" is set to be 3 to avoid too many features. Note that the resulting tf-idf features are different from the one provided by the Extreme Classification Repository. The code used to generate the raw texts and tf-idf features are both provided.
# of classes: 30,938
# of data: 14,146 / 6,616 (testing)
# of features: 104,374
Files:

yeast

Source: [AE02a]
# of classes: 14
# of data: 1,500 / 917 (testing)
# of features: 103
Files:
- yeast_train.svm.bz2
- yeast_test.svm.bz2 (testing)

YouTube

Source: [LT09a]
Preprocessing: This is a node classification problem where each node is associated with multiple labels and features are embedding vectors learned from graphs. The original graph data was at Social-Dimension Approach to Classification in Large-Scale Networks. Embedding vectors are generated by the following representation-learning methods: DeepWalk, LINE and Node2Vec. After embedding vectors are generated, nodes with no labels are discarded. For more details (e.g., the parameters used for generating embedding vectors and the five training/test splits), please see the supplementary materials and the experimental code/data used in the paper [LCL22a]
# of classes: 46
# of data: 31,703
# of features: 128
Files: