python - Numpy average from a large masked array -



python - Numpy average from a large masked array -

whats appropriate way average value big masked array? phone call .mean() big array's fails me.

consider creating array of 1000000 elements, value 500, like:

a = np.ones(1000000, dtype=np.int16) * 500

then create random mask , combine both in new masked array:

mask = np.random.randint(0, 2, a.size) b = np.ma.masked_array(a, mask=mask)

the array b inherits dtype , int16.

getting average value b can done in different ways, give same result. non-ma functions ignoring mask , shouldn't used.

print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b)) (500.0, 500.0, 500.0, 500.0)

but if increment original array size 1 1000000 10 million, result becomes:

print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b)) (500.0, -359.19365132075774, -359.19365132075774, -359.19365132075774)

now np.average seems correct, said ignoring mask , calculating average on entire array, can shown when changing of masked values b[b.mask] = 1000 example. expect np.mean same though.

casting masked array b float32 results in:

b = b.astype(np.float32) print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b)) (511.18945, 510.37895680000003, 510.37895680000003, 510.37895680000003)

and casting masked array b float64 (which should done default according documentation) results in:

b = b.astype(np.float64) print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b)) (500.0, 500.0, 500.0, 500.0)

so casting float64 seems work, rather avoid since increment memory footprint lot.

and while testing of this, noted calling np.ma.average on non-masked array (a) gives right result if size 1 1000000 , wrong result when 10 million, while np.ma.mean right both sizes.

can explain relation between dtype , size of array in example? bit of mystery me when occurs , how handle properly.

all done in numpy 1.8.1 on 64bit win 7 machine. installed via conda.

here notebook replicating have done:

http://nbviewer.ipython.org/gist/rutgerk/69b60da73f464900310a

this can shown when changing of masked values b[b.mask] = 1000 example. expect np.mean same though.

this not correct, b.mask true there masked values. when assign new value masked values unmasking them, making values in array valid, can utilize instead b[np.invert(b.mask)].

so should work:

import numpy np = np.ones(10000000, dtype=np.int64) * 500 mask = np.random.randint(0, 2, a.size) b = np.ma.masked_array(a, mask=mask) b[np.invert(b.mask)] = 1000 print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))

which give right value except np.average.

apart when getting negative/incorrect values it's because getting integer overflow. using dtype=np.int64 instead should solve it,

edit: alternative utilize python integers dtype=object instead of fixed width integers, slower,this alter makes np.average crash, rest of methods work properly.

edit 2: spoken in comments, in case it's not necessary increment size of elements of array, can phone call np.mean(b, dtype=np.float64) np.mean uses bigger accumulator avoid overflowing.

python numpy

Comments

Popular posts from this blog

php - Android app custom user registration and login with cookie using facebook sdk -

django - Access session in user model .save() -

php - .htaccess Multiple Rewrite Rules / Prioritizing -