This page contains many classification, regression, multi-label and string data sets stored in LIBSVM format. For some sets raw materials (e.g., original texts) are also available. These data sets are from UCI, Statlog, StatLib and other collections. We thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.
A summary of all data sets is in the following. If you have used LIBSVM with these sets, and find them useful, please cite our work as:
Chih-Chung Chang and Chih-Jen Lin, LIBSVM
: a library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Please also cite the source of the data sets (references given below).
Go to pages of classification (binary, multi-class), regression, multi-label, and string.
Some sets are large and the connection may fail. On Linux you can use
> wget -t inf URL_address_of_data
to retry infinitely many times. If it still fails, add -c to continuely get a partially-downloaded set.
You can also use > lftp -c 'pget -c URL_address_of_data'
to
have several connections for reducing the downloading time.
name | source | type | class | training size | testing size | feature |
---|---|---|---|---|---|---|
a1a | UCI | classification | 2 | 1,605 | 30,956 | 123 |
a2a | UCI | classification | 2 | 2,265 | 30,296 | 123 |
a3a | UCI | classification | 2 | 3,185 | 29,376 | 123 |
a4a | UCI | classification | 2 | 4,781 | 27,780 | 123 |
a5a | UCI | classification | 2 | 6,414 | 26,147 | 123 |
a6a | UCI | classification | 2 | 11,220 | 21,341 | 123 |
a7a | UCI | classification | 2 | 16,100 | 16,461 | 123 |
a8a | UCI | classification | 2 | 22,696 | 9,865 | 123 |
a9a | UCI | classification | 2 | 32,561 | 16,281 | 123 |
australian | Statlog | classification | 2 | 690 | 14 | |
avazu | Avazu's Click-through Prediction | classification | 2 | 40,428,967 | 4,577,464 | 1,000,000 |
breast-cancer | UCI | classification | 2 | 683 | 10 | |
cod-rna | [AVU06a] | classification | 2 | 59,535 | 8 | |
colon-cancer | [AU99a] | classification | 2 | 62 | 2,000 | |
covtype.binary | UCI | classification | 2 | 581,012 | 54 | |
criteo | Criteo's Display Advertising Challenge | classification | 2 | 45,840,617 | 6,042,135 | 1,000,000 |
criteo_tb | Criteo's Terabyte Click Logs | classification | 2 | 4,195,197,692 | 178,274,637 | 1,000,000 |
diabetes | UCI | classification | 2 | 768 | 8 | |
duke breast-cancer | [MW01a] | classification | 2 | 44 | 7,129 | |
epsilon | PASCAL Challenge 2008 | classification | 2 | 400,000 | 100,000 | 2,000 |
fourclass | [TKH96a] | classification | 2 | 862 | 2 | |
german.numer | Statlog | classification | 2 | 1,000 | 24 | |
gisette | NIPS 2003 Feature Selection Challenge [IG05a] | classification | 2 | 6,000 | 1,000 | 5,000 |
heart | Statlog | classification | 2 | 270 | 13 | |
HIGGS | UCI | classification | 2 | 11,000,000 | 28 | |
Hyperpartisan News Detection | SemEval-2019 Task 4: Hyperpartisan News Detection | classification | 2 | 516 | 65 | |
ijcnn1 | [DP01a] | classification | 2 | 49,990 | 91,701 | 22 |
imdb-sentiment | Learning Word Vectors for Sentiment Analysis | classification | 2 | 25,000 | 25,000 | |
ionosphere | UCI | classification | 2 | 351 | 34 | |
kdd2010 (algebra) | KDD CUP 2010 | classification | 2 | 8,407,752 | 510,302 | 20,216,830 |
kdd2010 (bridge to algebra) | KDD CUP 2010 | classification | 2 | 19,264,097 | 748,401 | 29,890,095 |
kdd2010 raw version (bridge to algebra) | KDD CUP 2010 | classification | 2 | 19,264,097 | 748,401 | 1,163,024 |
kdd2012 | KDD CUP 2012 | classification | 2 | 149,639,105 | 54,686,452 | |
leukemia | [TG99a] | classification | 2 | 38 | 34 | 7129 |
liver-disorders | UCI | classification | 2 | 145 | 200 | 5 |
madelon | NIPS 2003 Feature Selection Challenge [IG05a] | classification | 2 | 2,000 | 600 | 500 |
mushrooms | UCI | classification | 2 | 8124 | 112 | |
news20.binary | [SSK05a] | classification | 2 | 19,996 | 1,355,191 | |
phishing | UCI | classification | 2 | 11,055 | 68 | |
rcv1.binary | [DL04b] | classification | 2 | 20,242 | 677,399 | 47,236 |
real-sim | A. McCallum | classification | 2 | 72,309 | 20,958 | |
skin_nonskin | UCI | classification | 2 | 245,057 | 3 | |
splice | Delve | classification | 2 | 1,000 | 2,175 | 60 |
splice-site | [SS10a,AA12a] | classification | 2 | 50,000,000 | 4,627,840 | 11,725,480 |
sonar | UCI | classification | 2 | 208 | 60 | |
SUSY | UCI | classification | 2 | 5,000,000 | 18 | |
svmguide1 | [CWH03a] | classification | 2 | 3,089 | 4,000 | 4 |
svmguide3 | [CWH03a] | classification | 2 | 1,243 | 41 | 21 |
url | [JM09a] | classification | 2 | 2,396,130 | 3,231,961 | |
w1a | [JP98a] | classification | 2 | 2,477 | 47,272 | 300 |
w2a | [JP98a] | classification | 2 | 3,470 | 46,279 | 300 |
w3a | [JP98a] | classification | 2 | 4,912 | 44,837 | 300 |
w4a | [JP98a] | classification | 2 | 7,366 | 42,383 | 300 |
w5a | [JP98a] | classification | 2 | 9,888 | 39,861 | 300 |
w6a | [JP98a] | classification | 2 | 17,188 | 32,561 | 300 |
w7a | [JP98a] | classification | 2 | 24,692 | 25,057 | 300 |
w8a | [JP98a] | classification | 2 | 49,749 | 14,951 | 300 |
webspam | Webb Spam Corpus [ST06a] | classification | 2 | 350,000 | 16,609,143 | |
aloi | aloi [AR14a] | classification | 1,000 | 108,000 | 128 | |
cifar10 | The CIFAR-10 dataset [AK09a] | classification | 10 | 50,000 | 10,000 | 3,072 |
connect-4 | UCI | classification | 3 | 67,557 | 126 | |
covtype | UCI | classification | 7 | 581,012 | 54 | |
dna | Statlog | classification | 3 | 2,000 | 1,186 | 180 |
glass | UCI | classification | 6 | 214 | 9 | |
imdb-rating | Jointly Modelling Aspects, Ratings and Sentiments for Movie Recommendation | classification | 10 | 348,415 | ||
iris | UCI | classification | 3 | 150 | 4 | |
LEDGAR (LexGLUE) | [IC22b] | classification | 100 | 60,000 | 10,000 | 19,996 |
letter | Statlog | classification | 26 | 15,000 | 5,000 | 16 |
mnist | [YL98a] | classification | 10 | 60,000 | 10,000 | 780 |
mnist8m | Invariant SVM [GL07b] | classification | 10 | 8,100,000 | 784 | |
news20 | [KL95a] | classification | 20 | 15,935 | 3,993 | 62,061 |
news20 (18,846) | [KL95a] | classification | 20 | 9,051 | 7,532 | 130,107 |
pendigits | UCI | classification | 10 | 7,494 | 3,498 | 16 |
poker | UCI | classification | 10 | 25,010 | 1,000,000 | 10 |
protein | [JYW02a] | classification | 3 | 17,766 | 6,621 | 357 |
rcv1.multiclass | [DL04b] | classification | 53 | 15,564 | 518,571 | 47,236 |
SCOTUS (LexGLUE) | [IC22b] | classification | 13 | 5,000 | 1,400 | 126,405 |
satimage | Statlog | classification | 6 | 4,435 | 2,000 | 36 |
sector | [AM98a] | classification | 105 | 6,412 | 3,207 | 55,197 |
segment | Statlog | classification | 7 | 2,310 | 19 | |
Sensorless | UCI | classification | 11 | 58,509 | 48 | |
shuttle | Statlog | classification | 7 | 43,500 | 14,500 | 9 |
smallNORB | The Small NORB Dataset [YL04b] | classification | 5 | 24,300 | 24,300 | 18,432 |
SVHN | SVHN [YN11a] | classification | 10 | 73,257 | 26,032 | 3,072 |
svmguide2 | [CWH03a] | classification | 3 | 391 | 20 | |
svmguide4 | [CWH03a] | classification | 6 | 300 | 312 | 10 |
usps | [JJH94a] | classification | 10 | 7,291 | 2,007 | 256 |
SensIT Vehicle (acoustic) | Sensit [MD04a] | classification | 3 | 78,823 | 19,705 | 50 |
SensIT Vehicle (seismic) | Sensit [MD04a] | classification | 3 | 78,823 | 19,705 | 50 |
SensIT Vehicle (combined) | Sensit [MD04a] | classification | 3 | 78,823 | 19,705 | 100 |
vehicle | Statlog | classification | 4 | 846 | 18 | |
vowel | UCI | classification | 11 | 528 | 462 | 10 |
wine | UCI | classification | 3 | 178 | 13 | |
abalone | UCI | regression | 4,177 | 8 | ||
bodyfat | StatLib | regression | 252 | 14 | ||
cadata | StatLib | regression | 20,640 | 8 | ||
cpusmall | Delve | regression | 8,192 | 12 | ||
E2006-log1p | 10-K Corpus | regression | 16,087 | 3,308 | 4,272,227 | |
E2006-tfidf | 10-K Corpus | regression | 16,087 | 3,308 | 150,360 | |
eunite2001 | regression | 336 | 31 | 16 | ||
housing | UCI | regression | 506 | 13 | ||
mg | [GWF01a] | regression | 1,385 | 6 | ||
mpg | UCI | regression | 392 | 7 | ||
pyrim | UCI | regression | 74 | 27 | ||
space_ga | StatLib | regression | 3,107 | 6 | ||
triazines | UCI | regression | 186 | 60 | ||
YearPredictionMSD | UCI | regression | 463,715 | 51,630 | 90 | |
Amazon-670K | [JM13a] | multi-label | 670,091 | 490,449 | 153,025 | 135,909 (ver1) 135,909 (ver2) |
AmazonCat-13K | [JM13a] | multi-label | 13,330 | 1,186,239 | 306,782 | 203,882 (ver1) 1,293,747 (ver2) 203,882 (ver3) |
bibtex | [GT08a] | multi-label | 159 | 7,395 | 1,836 | |
BlogCatalog | [LT09a] | multi-label | 39 | 10,312 | 128 | |
delicious | [GT08a] | multi-label | 983 | 16,105 | 500 | |
ECtHR (A) (LexGLUE) | [IC22b] | multi-label | 10 | 9,000 | 1,000 | 92,401 |
ECtHR (B) (LexGLUE) | [IC22b] | multi-label | 10 | 9,000 | 1,000 | 92,401 |
EUR-LEX (LexGLUE) | [IC22b] | multi-label | 100 | 55,000 | 5,000 | 147,464 |
EUR-Lex | [LM10a] | multi-label | 3,956 | 15,449 | 3,865 | 186,104 |
EURLEX57K | [IC19a] | multi-label | 4,271 | 45,000 | 6,000 | N/A |
Flickr | [LT09a] | multi-label | 195 | 80,513 | 128 | |
mediamill (exp1) | Mediamill | multi-label | 101 | 30,993 | 12,914 | 120 |
PPI | [WLH17a] | multi-label | 121 | 54,958 | 128 | |
rcv1v2 (topics; subsets) | [DL04b] | multi-label | 101 | 3,000 | 3,000 | 47,236 |
rcv1v2 (topics; full sets) | [DL04b] | multi-label | 101 | 23,149 | 781,265 | 47,236 |
rcv1v2 (industries; full sets) | [DL04b] | multi-label | 313 | 23,149 | 781,265 | 47,236 |
rcv1v2 (regions; full sets) | [DL04b] | multi-label | 228 | 23,149 | 781,265 | 47,236 |
scene-classification | [MB04a] | multi-label | 6 | 1,211 | 1,196 | 294 |
siam-competition2007 | SIAM Text Mining Competition 2007 | multi-label | 22 | 21,519 | 7,077 | 30,438 |
UNFAIR-ToS (LexGLUE) | [IC22b] | multi-label | 8 | 5,532 | 1,607 | 6,290 |
Wiki10-31K | [AZ09a] | multi-label | 30,938 | 14,146 | 6,616 | 104,374 |
yeast | [AE02a] | multi-label | 14 | 1,500 | 917 | 103 |
mnist (string format) | [SL96a] | string | 60,000 | 10,000 | string | |
YouTube | [LT09a] | multi-label | 46 | 31,703 | 128 |
We have tried the best to obtain the permission from most original sources for distributing these sets. Please follow their respective copyrights for using them.
Author: Rong-En Fan at National Taiwan University. Please contact Chih-Jen Lin for any question.