LIBSVM Data: Multi-label Classification
Recently multi-label classification has been an important topic. Currently there are very few publicly available data sets. We tried hard to collect the following sets. Labels are in the beginning of each line and separated by commas.
Amazon-670K
- Source:
[JM13a]
- Preprocessing:
Since part of the original data from Amazon is no longer available, the raw texts are downloaded from The Extreme Classification Repository. We concatenate "title" and "content" once as our raw texts, instead of concatenating "title" twice and "content" once like in AttentionXML. We use "\t" for each instance to separate labels and raw texts. The tf-idf features are calculated from the raw texts provided here by using sklearn's TfidfVectorizer with default configurations except for the options "vocabulary", "stop_words" and "strip_accents". The option "vocabulary" is set to be the given vocabulary in The Extreme Classification Repository. The option "stop_words" is set to be "english" to ignore some uninformative words, like "and", "or". The option "strip_accents" is set to be "unicode" to remove accents and perform other character NFKD normalization. Our resulting tf-idf feature set is slightly different from the set provided in The Extreme Classification Repository because of the following reasons. First, we set "stop_words" to be "english", which causes some features to be not in ours. Second, accents are handled differently in the tokenization process. For instance, "Aragón" is tokenized as "arag" in The Extreme Classification Repository, but is tokenized as "aragan" in ours. The last reason is that in the tf-idf provided in The Extreme Classification Repository, some instances' features do not appear in their raw text. The second version is downloaded BoW features provided by The Extreme Classification Repository. To satisfy the LibSVM format, we remove the first header line and change the feature indices to start from 1 instead of 0. We then normalize each instance to a unit vector. The code used to get raw texts, calculate tf-idf features, and modify the format of downloaded BoW features is provided.
- # of classes: 670,091
- # of data:
490,449
/ 153,025 (testing)
- # of features:
135,909 (ver1) 135,909 (ver2)
- Files:
AmazonCat-13K
- Source:
[JM13a]
- Preprocessing:
Since part of the original data from Amazon is no longer available, the raw texts are downloaded from The Extreme Classification Repository. We extract "title" and "content" as our raw texts. A comparison with raw texts given in AttenionXML shows that one instance given in The Extreme Classification Repository is missing. We add this instance so our raw texts are the same as those from AttenionXML. We use "\t" for each instance to separate labels and raw texts. We provide three versions of tf-idf features. The first two are calculated from the raw texts provided here by using sklearn's TfidfVectorizer but have different configurations. The first version considers default configurations except the options "vocabulary", "stop_words" and "strip_accents". The option "vocabulary" is set to be the given vocabulary in The Extreme Classification Repository. The option "stop_words" is set to be "english" to ignore some uninformative words, like "and", "or". The option "strip_accents" is set to be "unicode" to remove accents. Configurations of the second version are the same as the default. Note that these two tf-idf sets are very different from the tf-idf set provided in The Extreme Classification Repository but with higher performance. The third version is the downloaded BoW features provided by The Extreme Classification Repository. To satisfy the LibSVM format, we remove the first header line and change the feature indices to start from 1 instead of 0. We then normalize each instance to a unit vector. The code used to get raw texts, calculate tf-idf features, and modify the format of downloaded BoW features is provided.
- # of classes: 13,330
- # of data:
1,186,239
/ 306,782 (testing)
- # of features:
203,882 (ver1) 1,293,747 (ver2) 203,882 (ver3)
- Files:
bibtex
BlogCatalog
delicious
ECtHR (A) (LexGLUE)
- Source:
[IC22b]
- Preprocessing:
The data are downloaded from Hugging Face. If a list of texts is provided, we follow lex-glue to concatenate them with a white space. All newlines are replaced with white spaces in addition. The raw data are in the format of labels<TAB>texts. We also provide data with tf-idf features, which are calculated from the raw texts provided here using TfidfVectorizer from sklearn with default configurations. The training and validation sets are combined as one bigger training file. Note that the resulting tf-idf features are different from the one provided by lex-glue. The code used to generate the raw texts and tf-idf features is provided.
- # of classes: 10
- # of data:
9,000
/ 1,000 (validation)
/ 1,000 (testing)
- # of features:
92,401
- Files:
ECtHR (B) (LexGLUE)
- Source:
[IC22b]
- Preprocessing:
The procedure is the same as that for ECtHR (A) (LexGLUE).
- # of classes: 10
- # of data:
9,000
/ 1,000 (validation)
/ 1,000 (testing)
- # of features:
92,401
- Files:
EUR-LEX (LexGLUE)
- Source:
[IC22b]
- Preprocessing:
The procedure is the same as that for ECtHR (A) (LexGLUE).
- # of classes: 100
- # of data:
55,000
/ 5,000 (validation)
/ 5,000 (testing)
- # of features:
147,464
- Files:
EUR-Lex
- Source:
[LM10a]
- Preprocessing:
Both the tokenized texts and tf-idf features provided here are the same as those used by AttentionXML (except tiny numerical differences in the tf-idf features). The texts are extracted from the source documents obtained from the original EUR-Lex dataset. Before tokenization, symbols like '>', '<', '"' and '&' are replaced with their textual representatons 'gt', 'lt', 'quot' and 'amp', respectively. Then, the text is further processed using LetterTokenizer, LowerCaseFilter, StopFilter (with stopwords taken from the original dataset) and PorterStemFilter provided by PyLucene in the mentioned order. To reproduce the text used by AttentionXML, words without any english letter in it are removed and the greek letter 'σ' at the end of words is transformed to 'ς' (except the formula dH/dσ appeared in one document). The EUROVOC labels for each instance are taken from the original dataset directly. Finally, the tokenized texts are outputted in the format of labels<TAB>texts, where the labels and texts are respectively separated by white spaces. The tf-idf features are not calculated from the text provided here. Instead, they are calculated from the tokenized texts provided by the original EUR-Lex dataset. The tf-idf features are then generated using TfidfVectorizer from sklearn with the idf formula modified to be log(N/df). The code for generating the dataset is provided.
- # of classes: 3,956
- # of data:
15,449
/ 3,865 (testing)
- # of features:
186,104
- Files:
EURLEX57K
- Source:
[IC19a]
- Preprocessing:
The data for generating the raw text are downloaded from this website. Following lmtc-eurlex57k, we concatenated header, recitals, main_body, and attachments with space. Whitespace characters "\s" including tabs and newlines and non-breaking spaces "\xa0" are replaced with space. The code used to generate the sets is also provided. The data is in the format of
ID<TAB>labels<TAB>raw texts.
- # of classes: 4,271
- # of data:
45,000
/ 6,000 (validation)
/ 6,000 (testing)
- # of features:
N/A
- Files:
Flickr
mediamill (exp1)
- Source:
Mediamill
/ The Mediamill Challenge Problem
- Preprocessing:
We combine all binary classification problems into a multi-label one.
- # of classes: 101
- # of data:
30,993
/ 12,914 (testing)
- # of features:
120
- Files:
PPI
rcv1v2 (topics; subsets)
- Source:
[DL04b]
- # of classes: 101
- # of data:
3,000
/ 3,000 (testing)
- # of features:
47,236
- Files:
rcv1v2 (topics; full sets)
- Source:
[DL04b]
- Preprocessing:
The four test sets correspond to the four testing files from the RCV1 site (appendix B.13). A combined file is also provided. In the test set, the number of classes is 103. We further provide files of original labels and tokenized texts (B.12 of RCV1 site) in the format of
ID<TAB>labels<TAB>tokens,
where labels and tokens are respectively separated by spaces.
- # of classes: 101
- # of data:
23,149
/ 781,265 (testing)
- # of features:
47,236
- Files:
rcv1v2 (industries; full sets)
- Source:
[DL04b]
- Preprocessing:
The four testing sets correspond to the four testing files from the RCV1 site. In the testing set, the number of classes is 350.
- # of classes: 313
- # of data:
23,149
/ 781,265 (testing)
- # of features:
47,236
- Files:
rcv1v2 (regions; full sets)
- Source:
[DL04b]
- Preprocessing:
The four testing sets correspond to the four testing files from the RCV1 site. In the testing set, the number of classes is 296.
- # of classes: 228
- # of data:
23,149
/ 781,265 (testing)
- # of features:
47,236
- Files:
scene-classification
- Source:
[MB04a]
- # of classes: 6
- # of data:
1,211
/ 1,196 (testing)
- # of features:
294
- Files:
siam-competition2007
- Source:
SIAM Text Mining Competition 2007
/ SIAM Text Mining Competition 2007
- Preprocessing:
We remove "." before transforming data to vectors. We use
binary term frequencies and normalize each instance to unit
length.
- # of classes: 22
- # of data:
21,519
/ 7,077 (testing)
- # of features:
30,438
- Files:
UNFAIR-ToS (LexGLUE)
- Source:
[IC22b]
- Preprocessing:
The procedure is the same as that for ECtHR (A) (LexGLUE).
- # of classes: 8
- # of data:
5,532
/ 2,275 (validation)
/ 1,607 (testing)
- # of features:
6,290
- Files:
Wiki10-31K
- Source:
[AZ09a]
- Preprocessing:
The raw texts are extracted from the original html documents by concatenating all the <p> tags within the block <div id="bodyContent"> ... </div> in each file with white spaces between them. We tried to generate the raw texts and also the split as close as possible to those used in AttentionXML, which is based on those provided by the Extreme Classification Repository. It seems that due to the update of the source pages, eight instances are slightly different from those in AttentionXML. Other instances are exactly the same except that tabs in the texts are replaced with white spaces. The raw text data is in the format of labels<TAB>raw texts, where the labels are separated by spaces. The tf-idf features are calculated from the raw texts provided here using sklearn's TfidfVectorizer with default configurations except that "min_df" is set to be 3 to avoid too many features. Note that the resulting tf-idf features are different from the one provided by the Extreme Classification Repository. The code used to generate the raw texts and tf-idf features are both provided.
- # of classes: 30,938
- # of data:
14,146
/ 6,616 (testing)
- # of features:
104,374
- Files:
yeast
- Source:
[AE02a]
- # of classes: 14
- # of data:
1,500
/ 917 (testing)
- # of features:
103
- Files:
YouTube