python - Numpy average from a large masked array -
python - Numpy average from a large masked array -
whats appropriate way average value big masked array? phone call .mean()
big array's fails me.
consider creating array of 1000000 elements, value 500, like:
a = np.ones(1000000, dtype=np.int16) * 500
then create random mask , combine both in new masked array
:
mask = np.random.randint(0, 2, a.size) b = np.ma.masked_array(a, mask=mask)
the array b
inherits dtype
, int16
.
getting average value b
can done in different ways, give same result. non-ma
functions ignoring mask , shouldn't used.
print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b)) (500.0, 500.0, 500.0, 500.0)
but if increment original array size 1 1000000 10 million, result becomes:
print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b)) (500.0, -359.19365132075774, -359.19365132075774, -359.19365132075774)
now np.average
seems correct, said ignoring mask , calculating average on entire array, can shown when changing of masked values b[b.mask] = 1000
example. expect np.mean
same though.
casting masked array b
float32
results in:
b = b.astype(np.float32) print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b)) (511.18945, 510.37895680000003, 510.37895680000003, 510.37895680000003)
and casting masked array b
float64
(which should done default according documentation) results in:
b = b.astype(np.float64) print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b)) (500.0, 500.0, 500.0, 500.0)
so casting float64
seems work, rather avoid since increment memory footprint lot.
and while testing of this, noted calling np.ma.average
on non-masked array (a) gives right result if size 1 1000000 , wrong result when 10 million, while np.ma.mean
right both sizes.
can explain relation between dtype
, size
of array in example? bit of mystery me when occurs , how handle properly.
all done in numpy 1.8.1 on 64bit win 7 machine. installed via conda.
here notebook replicating have done:
http://nbviewer.ipython.org/gist/rutgerk/69b60da73f464900310a
this can shown when changing of masked values b[b.mask] = 1000
example. expect np.mean same though.
this not correct, b.mask true there masked values. when assign new value masked values unmasking them, making values in array valid, can utilize instead b[np.invert(b.mask)]
.
so should work:
import numpy np = np.ones(10000000, dtype=np.int64) * 500 mask = np.random.randint(0, 2, a.size) b = np.ma.masked_array(a, mask=mask) b[np.invert(b.mask)] = 1000 print(np.average(b), np.mean(b), np.ma.average(b), np.ma.mean(b))
which give right value except np.average
.
apart when getting negative/incorrect values it's because getting integer overflow. using dtype=np.int64
instead should solve it,
edit: alternative utilize python integers dtype=object
instead of fixed width integers, slower,this alter makes np.average
crash, rest of methods work properly.
edit 2: spoken in comments, in case it's not necessary increment size of elements of array, can phone call np.mean(b, dtype=np.float64)
np.mean
uses bigger accumulator avoid overflowing.
python numpy
Comments
Post a Comment