label, features = line xi = {} for e in features.split(): ind, val = e.split(":") xi[int(ind)] = float(val) prob_y += [float(label)] prob_x += [xi]with
label, features = line features = features.split() # Two-dimensional array # One row is for feature index, # the other is for feature value array = (c_double * 2) * len(features) xi = array() for i in range(len(features)): ind, val = features[i].split(":") xi[i][0] = int(ind) xi[i][1] = float(val) prob_y += [float(label)] prob_x += [xi]
index_range = range(1, len(xi)) else: raise TypeError('xi should be a dictionary, list or tuple')with
index_range = range(1, len(xi)) elif isinstance(xi, Array): # Handle ctypes array return gen_feature_nodearray_from_ctypes_array(xi, feature_max, issparse) else: raise TypeError('xi should be a dictionary, list, tuple or ctypes.Array')Add another function to handle this data structure respectively.
def gen_feature_nodearray_from_ctypes_array(xi, feature_max=None, issparse=True): values = list(xi) if feature_max: assert(isinstance(feature_max, int)) values = filter(lambda x: x[0] <= feature_max, values) if issparse: values = filter(lambda x: x[1] != 0, values) values = sorted(values, key=lambda x: x[0]) ret = (feature_node * (len(values)+2))() ret[-1].index = -1 # for bias term ret[-2].index = -1 for idx, (fea, val) in enumerate(values): ret[idx].index = int(fea) ret[idx].value = val max_idx = 0 if values: max_idx = int(values[-1][0]) return ret, max_idxAnd, we are done.
**************************************************************************************************** dataset: random-generated, 1.5M instances, 150K features, 100 non-zeros per instance. data structure | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND VER ---------------------------------------------------------------------------------------------------- list of dictionary | 17122 applerma 20 0 21.5g 21g 3808 S 0 34.1 2:58.07 python (2.7.3) list of array | 16433 applerma 20 0 2664m 2.6g 3812 S 0 4.1 4:01.64 python (2.7.3) ---------------------------------------------------------------------------------------------------- list of dictionary | 12209 applerma 20 0 22.0g 21g 3648 S 0 34.9 4:06.99 python3.2 (3.2.3) list of array | 12140 applerma 20 0 2664m 2.6g 3660 S 0 4.1 5:39.53 python3.2 (3.2.3) ---------------------------------------------------------------------------------------------------- list of dictionary | 12069 applerma 20 0 12.5g 12g 3528 S 0 19.7 3:24.30 python3.3 (3.3.0) list of array | 11821 applerma 20 0 2660m 2.5g 3532 S 0 4.0 5:26.49 python3.3 (3.3.0) ==================================================================================================== Ratio between two approaches python (2.7.3) 21 / 2.6 ~ 8.07 (3.2.3) 21 / 2.6 ~ 8.07 (3.3.0) 12 / 2.5 ~ 4.80 **************************************************************************************************** dataset: 'url_combined', 2396130 instances, 3231961 features, around 100 non-zeros per instance. data structure | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND VER ---------------------------------------------------------------------------------------------------- list of dictionary | 16730 applerma 20 0 38.6g 38g 3808 S 0 61.4 6:03.76 python (2.7.3) list of array | 17106 applerma 20 0 4810m 4.7g 3812 S 0 7.4 7:50.32 python (2.7.3) ---------------------------------------------------------------------------------------------------- list of dictionary | 12191 applerma 20 0 40.0g 39g 3648 S 0 63.6 9:06.73 python3.2 (3.2.3) list of array | 12232 applerma 20 0 4915m 4.8g 3652 S 0 7.6 10:56.34 python3.2 (3.2.3) ---------------------------------------------------------------------------------------------------- list of dictionary | 12118 applerma 20 0 26.2g 26g 3512 S 0 41.5 6:25.71 python3.3 (3.3.0) list of array | 12088 applerma 20 0 4901m 4.7g 3520 S 0 7.5 9:55.15 python3.3 (3.3.0) ==================================================================================================== Ratio between two approaches python (2.7.3) 38 / 4.7 ~ 8.085 (3.2.3) 39 / 4.8 ~ 8.125 (3.3.0) 2.6 / 4.7 ~ 5.53 ****************************************************************************************************Ctypes array has much better memory efficiency than Python dictionary. Note that Python 3.3.0 consumes much less memory than earlier Python. We tested on Python list/tuple, but these data structures could not finish in feasible time.
However, ctypes array seems to take more time than dictionary,
so next we check running time by presenting
READ TIME: read data file and store it in the data structure.
TRAIN TIME: convert the data structure to LIBSVM/LIBLINEAR's structures and train the data.
******************************************************************************************************************** version: python (3.3.0) dataset: randomly generated sparse data (for each setting, we run the experiment twice) total features: 20K Data Structure | Instance Size | Feature Size | READ TIME(s) | TRAIN TIME(s) | TOTAL TIME(s) | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 1.5M | 100 | 233.873 / 230.885 | 2442.176 / 2372.660 | 2676.049 / 2603.545 | list of array | -- | -- | 343.717 / 357.864 | 2142.012 / 2125.639 | 2485.730 / 2483.504 | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 500K | 100 | 74.262 / 75.424 | 616.010 / 616.809 | 690.272 / 692.234 | list of array | -- | -- | 110.396 / 112.538 | 698.169 / 691.384 | 808.566 / 803.923 | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 10K | 100 | 1.644 / 1.779 | 7.987 / 7.945 | 9.632 / 9.725 | list of array | -- | -- | 2.664 / 2.280 | 9.661 / 9.256 | 12.326 / 11.537 | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 10K | 1K | 14.047 / 14.140 | 51.475 / 50.927 | 65.522 / 65.067 | list of array | -- | -- | 22.564 / 22.553 | 67.680 / 67.753 | 90.244 / 90.307 | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 10K | 10K | 138.109 / 138.069 | 111.620 / 105.487 | 249.730 / 243.556 | list of array | -- | -- | 235.826 / 214.531 | 285.419 / 278.632 | 521.246 / 493.164 | ******************************************************************************************************************** version: python (3.3.0) dataset: randomly generated dense data (for each setting, we run the experiment twice) Data Structure | Instance Size | Feature Size | READ TIME(s) | TRAIN TIME(s) | TOTAL TIME(s) | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 1.5M | 100 | 212.936 / 222.638 | 2501.161 / 2504.249 | 2714.097 / 2726.887 | list of array | -- | -- | 350.020 / 348.115 | 2394.653 / 2401.696 | 2744.673 / 2749.812 | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 500K | 100 | 72.373 / 74.261 | 755.902 / 750.575 | 828.275 / 824.836 | list of array | -- | -- | 116.867 / 118.757 | 702.789 / 706.021 | 819.657 / 824.778 | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 10K | 100 | 1.574 / 1.526 | 8.292 / 8.115 | 9.866 / 9.641 | list of array | -- | -- | 2.290 / 2.362 | 9.510 / 9.996 | 11.800 / 12.359 | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 10K | 1K | 13.768 / 13.935 | 56.815 / 56.858 | 70.583 / 70.794 | list of array | -- | -- | 22.779 / 22.954 | 73.417 / 73.496 | 96.197 / 96.451 | -------------------------------------------------------------------------------------------------------------------- list of dictionary | 10K | 10K | 140.726 / 145.051 | 176.726 / 170.361 | 317.453 / 315.413 | list of array | -- | -- | 230.698 / 233.546 | 348.275 / 330.043 | 578.973 / 563.590 | ********************************************************************************************************************We have also checked on different Python version, 2.7.3 and 3.2.3. The results are similar.