This page contains many classification, regression, multi-label and string data sets stored in LIBSVM format. For some sets raw materials (e.g., original texts) are also available. These data sets are from UCI, Statlog, StatLib and other collections. We thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.
> splice_explicit_features data/H_sapiens_acc_all_examples_plain_139-279_50000000.fasta data/H_sapiens_acc_all_examples_plain_139-279_50000000.fasta_down data/H_sapiens_acc_all_examples_plain_139-279_50000000.fasta_up data/H_sapiens_acc_all_examples_plain_50000000.labeland
> splice_explicit_features data/H_sapiens_acc_all_examples_plain_139-279_5e7_test.fasta data/H_sapiens_acc_all_examples_plain_139-279_5e7_test.fasta_down data/H_sapiens_acc_all_examples_plain_139-279_5e7_test.fasta_up data/H_sapiens_acc_all_examples_plain_5e7_test.labelThis set is highly skewed, so auPRC (area under precision-recall curve) is the suitable criterion. Using matlab statistics toolbox, you can obtain auPRC by
[Xpr,Ypr,Tpr,AUCpr] = perfcurve(labels, predictions, 1, 'xCrit', 'reca', 'yCrit', 'prec'); AUCprwhere labels are true labels and predictions are your predicted decision values. You can use LIBLINEAR with option -s 3 (i.e., l2-regularized l1-loss SVM) to get auPRC of 0.5773, similar to 0.5775 reported in Table 2 of Sonnenburg and Franc (2010). If you don't have enough RAM to run LIBLINEAR, you can use the following code at LIBSVM tools and see our experimental log here. The code used is a disk-level linear classifier. [HFY11a]