python - Performance issue in computing multiple linear regression with huge data sets -
python - Performance issue in computing multiple linear regression with huge data sets -
i using np.linalg.lstsq calculating multiple linear regression. info set huge: has 20,000 independent variables(x) , 1 dependent variable (y). each independent variable has 10,000 datas. this:
x1 x2 x3.. x20,000 y data1 -> 10 1.8 1 1 3 data2 -> 20 2.3 200 206 5 .. .. .. .. .. data10,000-> 300 2398 878 989 998 it taking huge time (20-30 mins) compute regression coefficient using np.linalg.lstsq. can tell me improve solution according computation time?
the time spent seems follow n**2.8. can increment speed reducing number of info points.
if downsample info one thousand rows, can computations in couple of seconds. can repeat analysis different random sample.
in order combine results, have several options:
do, usual in cross correlation in statistics, , weight them inverse of norm of residuals (fast compute, in output). measure real residuals total dataset (that takes less 3 seconds) and: keep best one. weight them inverse of real distance.the best alternative depends on how much accuracy need, , nature of data. if need gross estimation in medium noise, single downsampling should work. maintain in mind under determined already, solutions degenerated.
python numpy machine-learning linear-regression
Comments
Post a Comment