Aggregate NumPy array with condition as mask

I have a matrix $b$ with elements:
$$b =
begin{pmatrix}
0.01 & 0.02 & cdots & 1 \
0.01 & 0.02 & cdots & 1 \
vdots& vdots & ddots & vdots \
0.01 & 0.02 & cdots & 1 \
end{pmatrix}
$$For which through a series of calculation which is vectorised, $b$ is used to calculate $a$ which is another matrix that has the same dimension/shape as $b$.
$$a =
begin{pmatrix}
3 & 5 & cdots & 17 \
2 & 6 & cdots & 23 \
vdots& vdots & ddots & vdots \
4 & 3 & cdots & 19 \
end{pmatrix}
$$
At this point it is important to note that the elements of $a$ and $b$ have a one to one correspondence. The different row values(let's call it $sigma$) $0.01, 0.02...$ are different parameters for a series of simulations that I'm running. Hence for a fixed value of say $sigma = 0.01$, the length of its column values correspond to the total number of "simulations" I'm running for that particular parameter. If you know python vectorisation then you'll start to understand what I'm doing.

It is known that higher the $sigma$, the more the number of simulations for that particular sigma will have a value higher than 5 i.e. more of the matrix element along a column will have value bigger than 5. Essentially what I'm doing is vectorising $N$(columns) different simulations for $M$(rows) different parameters. Now I wish to find out the value of $sigma$ for which the total number simulation that's bigger than 5, is bigger than 95% of the total simulation.

To put it more concisely, for a $sigma$ of 0.02, each simulation would have results of $$5, 6, ..., 3$$ with say a total simulation of $N$. So let $$kappa = sum{ (text{all the simulations that have values bigger than 5})},$$I wish to find out the FIRST $sigma$ for which
$$frac{kappa}{N} > 0.95*N$$
i.e. the FIRST $sigma$ for which the proportion of total experiment for which its value $>5$ is bigger than 95% of the total number of experiment.

The code that I have written is:

# say 10000 simulations for a particular sigma 

SIMULATION = 10000



# say 100 different values of sigma ranging from 0.01 to 1

# this is equivalent to matrix b in mathjax above

SIGMA = np.ones((EXPERIMENTS,100))*np.linspace(0.01, 1, 100)



def return_sigma(matrix, simulation, sigma):

    """

    My idea here is I put in sigma and matrix and total number of simulation. 

    Each time using np.ndenumerate looping over i and j to compare if the 

    element values are greater than 5. If yes then I add 1 to counter, if no 

    then continue. If the number of experiments with result bigger than 5 is 

    bigger than 95% of total number of experiment then I return that particular 

    sigma.

    """

    counter = 0

    for (i, j), value in np.ndenumerate(matrix):

        if value[i, j] > 5:

            counter+=1

        if counter/experiments > 0.95*simulation:

            break

        return sigma[0, j] # sigma[:, j] should all be the same anyway

"""Now this can be ran by:"""

print(return_sigma(a, SIMULATION, SIGMA))

which doesn't seem to quite work as I'm not well-versed with 2D slicing comprehension so this is quite a challenging problem for me. Thanks in advance.

EDIT
I apologise on not giving away my calculation as it's sort of a coursework of mine. I have generated a for 15 different values of $sigma$ with 15 simulations each, and here they are:

array([[ 6,  2, 12, 12, 14, 14, 11, 11,  9, 23, 15,  3, 10, 12, 10],

       [ 7,  7,  6,  9, 13,  8, 11, 17, 13,  8, 10, 16, 11, 16,  8],

       [14,  6,  4,  8, 10,  9, 11, 14, 12, 14,  5,  8, 18, 29, 22],

       [ 4, 12, 12,  3,  7,  8,  5, 13, 13, 10, 14, 16, 22, 15, 22],

       [ 9,  8,  7, 12, 12,  6,  4, 13, 12, 12, 18, 20, 18, 14, 23],

       [ 8,  6,  8,  6, 12, 11, 11,  4,  9,  9, 13, 19, 13, 11, 20],

       [12,  8,  7, 17,  3,  9, 11,  5, 12, 24, 11, 12, 17,  9, 16],

       [ 4,  8,  7,  5,  6, 10,  9,  6,  4, 13, 13, 14, 18, 20, 23],

       [ 5, 10,  5,  6,  8,  4,  7,  7, 10, 11,  9, 22, 14, 30, 17],

       [ 6,  4,  5,  9,  8,  8,  4, 21, 14, 18, 21, 13, 14, 22, 10],

       [ 6,  2,  7,  7,  8,  3,  7, 19, 14,  7, 13, 12, 18,  8, 12],

       [ 5,  7,  6,  4, 13,  9,  4,  3, 20, 11, 11,  8, 12, 29, 14],

       [ 6,  3, 13,  6, 12, 10, 17,  6,  9, 15, 12, 12, 16, 12, 15],

       [ 2,  9,  8, 15,  5,  4,  5,  7, 16, 13, 20, 18, 14, 18, 14],

       [14, 10,  7, 11,  8, 13, 14, 13, 12, 19,  9, 10, 11, 17, 13]])

As you can see as $sigma$ gets higher the number of matrix elements in each column for which it is bigger than 5 is higher.

EDIT 2
So now condition is giving me the right thing, which is an array of booleans.

array([[False, False, False, False, False, False, False, False, True, True],

       ....................................................................,

       [False, False, False, False, False, False, False, True,  True, True]])

So now the last row is the important thing here as it corresponds to the parameters, in this case,

array([[0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.5],

       ...........................................................,

       [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5]])

Now the last row of condition is telling me that first True happens at $sigma$=0.4 i.e. for which all the > 95% of the total simulations for that $sigma$ have simulation result of > 5. So now I need to return the index of condition where the first True in the last row appeared i.e. [i, j]. Now doing b[i, j] should give me the parameter I want.(which I'm not sure if your next few line of codes are doing that.)

edited 12 hours ago

n1k31t4

6,4912421

asked Mar 31 at 21:40

user3613025

234

New contributor

$begingroup$
If you could provide the matrix: a (a dummy version at least) it would be helpful to check output against your expectations.
$endgroup$
– n1k31t4
Mar 31 at 23:22

$begingroup$
Hi thanks for the reminder, I've added a to my edit.
$endgroup$
– user3613025
Mar 31 at 23:41

$begingroup$
Thanks for adding the example. I'd just like to point out that this smaller test matrix will probably never hit your target threshold of 95% of simulations going over 5 ;)
$endgroup$
– n1k31t4
Mar 31 at 23:47

add a comment |

The code that I have written is:

# say 10000 simulations for a particular sigma 

SIMULATION = 10000



# say 100 different values of sigma ranging from 0.01 to 1

# this is equivalent to matrix b in mathjax above

SIGMA = np.ones((EXPERIMENTS,100))*np.linspace(0.01, 1, 100)



def return_sigma(matrix, simulation, sigma):

    """

    My idea here is I put in sigma and matrix and total number of simulation. 

    Each time using np.ndenumerate looping over i and j to compare if the 

    element values are greater than 5. If yes then I add 1 to counter, if no 

    then continue. If the number of experiments with result bigger than 5 is 

    bigger than 95% of total number of experiment then I return that particular 

    sigma.

    """

    counter = 0

    for (i, j), value in np.ndenumerate(matrix):

        if value[i, j] > 5:

            counter+=1

        if counter/experiments > 0.95*simulation:

            break

        return sigma[0, j] # sigma[:, j] should all be the same anyway

"""Now this can be ran by:"""

print(return_sigma(a, SIMULATION, SIGMA))

which doesn't seem to quite work as I'm not well-versed with 2D slicing comprehension so this is quite a challenging problem for me. Thanks in advance.

EDIT
I apologise on not giving away my calculation as it's sort of a coursework of mine. I have generated a for 15 different values of $sigma$ with 15 simulations each, and here they are:

array([[ 6,  2, 12, 12, 14, 14, 11, 11,  9, 23, 15,  3, 10, 12, 10],

       [ 7,  7,  6,  9, 13,  8, 11, 17, 13,  8, 10, 16, 11, 16,  8],

       [14,  6,  4,  8, 10,  9, 11, 14, 12, 14,  5,  8, 18, 29, 22],

       [ 4, 12, 12,  3,  7,  8,  5, 13, 13, 10, 14, 16, 22, 15, 22],

       [ 9,  8,  7, 12, 12,  6,  4, 13, 12, 12, 18, 20, 18, 14, 23],

       [ 8,  6,  8,  6, 12, 11, 11,  4,  9,  9, 13, 19, 13, 11, 20],

       [12,  8,  7, 17,  3,  9, 11,  5, 12, 24, 11, 12, 17,  9, 16],

       [ 4,  8,  7,  5,  6, 10,  9,  6,  4, 13, 13, 14, 18, 20, 23],

       [ 5, 10,  5,  6,  8,  4,  7,  7, 10, 11,  9, 22, 14, 30, 17],

       [ 6,  4,  5,  9,  8,  8,  4, 21, 14, 18, 21, 13, 14, 22, 10],

       [ 6,  2,  7,  7,  8,  3,  7, 19, 14,  7, 13, 12, 18,  8, 12],

       [ 5,  7,  6,  4, 13,  9,  4,  3, 20, 11, 11,  8, 12, 29, 14],

       [ 6,  3, 13,  6, 12, 10, 17,  6,  9, 15, 12, 12, 16, 12, 15],

       [ 2,  9,  8, 15,  5,  4,  5,  7, 16, 13, 20, 18, 14, 18, 14],

       [14, 10,  7, 11,  8, 13, 14, 13, 12, 19,  9, 10, 11, 17, 13]])

As you can see as $sigma$ gets higher the number of matrix elements in each column for which it is bigger than 5 is higher.

EDIT 2
So now condition is giving me the right thing, which is an array of booleans.

array([[False, False, False, False, False, False, False, False, True, True],

       ....................................................................,

       [False, False, False, False, False, False, False, True,  True, True]])

So now the last row is the important thing here as it corresponds to the parameters, in this case,

array([[0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.5],

       ...........................................................,

       [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5]])

edited 12 hours ago

n1k31t4

6,4912421

asked Mar 31 at 21:40

user3613025

234

New contributor

$begingroup$
If you could provide the matrix: a (a dummy version at least) it would be helpful to check output against your expectations.
$endgroup$
– n1k31t4
Mar 31 at 23:22

$begingroup$
Hi thanks for the reminder, I've added a to my edit.
$endgroup$
– user3613025
Mar 31 at 23:41

$begingroup$
Thanks for adding the example. I'd just like to point out that this smaller test matrix will probably never hit your target threshold of 95% of simulations going over 5 ;)
$endgroup$
– n1k31t4
Mar 31 at 23:47

add a comment |

The code that I have written is:

# say 10000 simulations for a particular sigma 

SIMULATION = 10000



# say 100 different values of sigma ranging from 0.01 to 1

# this is equivalent to matrix b in mathjax above

SIGMA = np.ones((EXPERIMENTS,100))*np.linspace(0.01, 1, 100)



def return_sigma(matrix, simulation, sigma):

    """

    My idea here is I put in sigma and matrix and total number of simulation. 

    Each time using np.ndenumerate looping over i and j to compare if the 

    element values are greater than 5. If yes then I add 1 to counter, if no 

    then continue. If the number of experiments with result bigger than 5 is 

    bigger than 95% of total number of experiment then I return that particular 

    sigma.

    """

    counter = 0

    for (i, j), value in np.ndenumerate(matrix):

        if value[i, j] > 5:

            counter+=1

        if counter/experiments > 0.95*simulation:

            break

        return sigma[0, j] # sigma[:, j] should all be the same anyway

"""Now this can be ran by:"""

print(return_sigma(a, SIMULATION, SIGMA))

which doesn't seem to quite work as I'm not well-versed with 2D slicing comprehension so this is quite a challenging problem for me. Thanks in advance.

EDIT
I apologise on not giving away my calculation as it's sort of a coursework of mine. I have generated a for 15 different values of $sigma$ with 15 simulations each, and here they are:

array([[ 6,  2, 12, 12, 14, 14, 11, 11,  9, 23, 15,  3, 10, 12, 10],

       [ 7,  7,  6,  9, 13,  8, 11, 17, 13,  8, 10, 16, 11, 16,  8],

       [14,  6,  4,  8, 10,  9, 11, 14, 12, 14,  5,  8, 18, 29, 22],

       [ 4, 12, 12,  3,  7,  8,  5, 13, 13, 10, 14, 16, 22, 15, 22],

       [ 9,  8,  7, 12, 12,  6,  4, 13, 12, 12, 18, 20, 18, 14, 23],

       [ 8,  6,  8,  6, 12, 11, 11,  4,  9,  9, 13, 19, 13, 11, 20],

       [12,  8,  7, 17,  3,  9, 11,  5, 12, 24, 11, 12, 17,  9, 16],

       [ 4,  8,  7,  5,  6, 10,  9,  6,  4, 13, 13, 14, 18, 20, 23],

       [ 5, 10,  5,  6,  8,  4,  7,  7, 10, 11,  9, 22, 14, 30, 17],

       [ 6,  4,  5,  9,  8,  8,  4, 21, 14, 18, 21, 13, 14, 22, 10],

       [ 6,  2,  7,  7,  8,  3,  7, 19, 14,  7, 13, 12, 18,  8, 12],

       [ 5,  7,  6,  4, 13,  9,  4,  3, 20, 11, 11,  8, 12, 29, 14],

       [ 6,  3, 13,  6, 12, 10, 17,  6,  9, 15, 12, 12, 16, 12, 15],

       [ 2,  9,  8, 15,  5,  4,  5,  7, 16, 13, 20, 18, 14, 18, 14],

       [14, 10,  7, 11,  8, 13, 14, 13, 12, 19,  9, 10, 11, 17, 13]])

As you can see as $sigma$ gets higher the number of matrix elements in each column for which it is bigger than 5 is higher.

EDIT 2
So now condition is giving me the right thing, which is an array of booleans.

array([[False, False, False, False, False, False, False, False, True, True],

       ....................................................................,

       [False, False, False, False, False, False, False, True,  True, True]])

So now the last row is the important thing here as it corresponds to the parameters, in this case,

array([[0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.5],

       ...........................................................,

       [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5]])

edited 12 hours ago

n1k31t4

6,4912421

asked Mar 31 at 21:40

user3613025

234

New contributor

The code that I have written is:

# say 10000 simulations for a particular sigma 

SIMULATION = 10000



# say 100 different values of sigma ranging from 0.01 to 1

# this is equivalent to matrix b in mathjax above

SIGMA = np.ones((EXPERIMENTS,100))*np.linspace(0.01, 1, 100)



def return_sigma(matrix, simulation, sigma):

    """

    My idea here is I put in sigma and matrix and total number of simulation. 

    Each time using np.ndenumerate looping over i and j to compare if the 

    element values are greater than 5. If yes then I add 1 to counter, if no 

    then continue. If the number of experiments with result bigger than 5 is 

    bigger than 95% of total number of experiment then I return that particular 

    sigma.

    """

    counter = 0

    for (i, j), value in np.ndenumerate(matrix):

        if value[i, j] > 5:

            counter+=1

        if counter/experiments > 0.95*simulation:

            break

        return sigma[0, j] # sigma[:, j] should all be the same anyway

"""Now this can be ran by:"""

print(return_sigma(a, SIMULATION, SIGMA))

which doesn't seem to quite work as I'm not well-versed with 2D slicing comprehension so this is quite a challenging problem for me. Thanks in advance.

EDIT
I apologise on not giving away my calculation as it's sort of a coursework of mine. I have generated a for 15 different values of $sigma$ with 15 simulations each, and here they are:

array([[ 6,  2, 12, 12, 14, 14, 11, 11,  9, 23, 15,  3, 10, 12, 10],

       [ 7,  7,  6,  9, 13,  8, 11, 17, 13,  8, 10, 16, 11, 16,  8],

       [14,  6,  4,  8, 10,  9, 11, 14, 12, 14,  5,  8, 18, 29, 22],

       [ 4, 12, 12,  3,  7,  8,  5, 13, 13, 10, 14, 16, 22, 15, 22],

       [ 9,  8,  7, 12, 12,  6,  4, 13, 12, 12, 18, 20, 18, 14, 23],

       [ 8,  6,  8,  6, 12, 11, 11,  4,  9,  9, 13, 19, 13, 11, 20],

       [12,  8,  7, 17,  3,  9, 11,  5, 12, 24, 11, 12, 17,  9, 16],

       [ 4,  8,  7,  5,  6, 10,  9,  6,  4, 13, 13, 14, 18, 20, 23],

       [ 5, 10,  5,  6,  8,  4,  7,  7, 10, 11,  9, 22, 14, 30, 17],

       [ 6,  4,  5,  9,  8,  8,  4, 21, 14, 18, 21, 13, 14, 22, 10],

       [ 6,  2,  7,  7,  8,  3,  7, 19, 14,  7, 13, 12, 18,  8, 12],

       [ 5,  7,  6,  4, 13,  9,  4,  3, 20, 11, 11,  8, 12, 29, 14],

       [ 6,  3, 13,  6, 12, 10, 17,  6,  9, 15, 12, 12, 16, 12, 15],

       [ 2,  9,  8, 15,  5,  4,  5,  7, 16, 13, 20, 18, 14, 18, 14],

       [14, 10,  7, 11,  8, 13, 14, 13, 12, 19,  9, 10, 11, 17, 13]])

As you can see as $sigma$ gets higher the number of matrix elements in each column for which it is bigger than 5 is higher.

EDIT 2
So now condition is giving me the right thing, which is an array of booleans.

array([[False, False, False, False, False, False, False, False, True, True],

       ....................................................................,

       [False, False, False, False, False, False, False, True,  True, True]])

So now the last row is the important thing here as it corresponds to the parameters, in this case,

array([[0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.5],

       ...........................................................,

       [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5]])

python dataset data numpy

edited 12 hours ago

n1k31t4

6,4912421

asked Mar 31 at 21:40

user3613025

234

New contributor

edited 12 hours ago

n1k31t4

6,4912421

asked Mar 31 at 21:40

user3613025

234

New contributor

edited 12 hours ago

n1k31t4

6,4912421

edited 12 hours ago

n1k31t4

6,4912421

edited 12 hours ago

n1k31t4

6,4912421

asked Mar 31 at 21:40

user3613025

234

New contributor

asked Mar 31 at 21:40

user3613025

234

asked Mar 31 at 21:40

user3613025

234

New contributor

user3613025 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
If you could provide the matrix: a (a dummy version at least) it would be helpful to check output against your expectations.
$endgroup$
– n1k31t4
Mar 31 at 23:22

$begingroup$
Hi thanks for the reminder, I've added a to my edit.
$endgroup$
– user3613025
Mar 31 at 23:41

$begingroup$
Thanks for adding the example. I'd just like to point out that this smaller test matrix will probably never hit your target threshold of 95% of simulations going over 5 ;)
$endgroup$
– n1k31t4
Mar 31 at 23:47

add a comment |

$begingroup$
If you could provide the matrix: a (a dummy version at least) it would be helpful to check output against your expectations.
$endgroup$
– n1k31t4
Mar 31 at 23:22

$begingroup$
Hi thanks for the reminder, I've added a to my edit.
$endgroup$
– user3613025
Mar 31 at 23:41

$begingroup$
Thanks for adding the example. I'd just like to point out that this smaller test matrix will probably never hit your target threshold of 95% of simulations going over 5 ;)
$endgroup$
– n1k31t4
Mar 31 at 23:47

If you could provide the matrix: a (a dummy version at least) it would be helpful to check output against your expectations.

– n1k31t4
Mar 31 at 23:22

Hi thanks for the reminder, I've added a to my edit.

– user3613025
Mar 31 at 23:41

Thanks for adding the example. I'd just like to point out that this smaller test matrix will probably never hit your target threshold of 95% of simulations going over 5 ;)

– n1k31t4
Mar 31 at 23:47

add a comment |

2 Answers
2

active

oldest

votes

I think I have understood your problem (mostly from the comments added in your function).

I'll show step by step what the logic is, building upon each previous step to get the final solution.

First we want to find all position where the matrix is larger than 5:

a > 5    # returns a boolean array with true/false in each position

Now we want to check each row to count if the proportion of of matches (> 5) has reached a certain threshold; $N * 0.95$. We can divide by the number of simulations (number of columns) to essentially normalise by the number of simulations:

(a > 5) / SIMULATION    # returns the value of one match

These values are required to sum to your threshold for an experiment to be valid.

Now we cumulatively sum across each row. As the True/False array is ones and zeros, we now have a running total of the numbers of matches for each experiment (each row).

np.cumsum((a > 5) / SIMULATION, axis=1)     # still same shape as b

Now we just need to find out where (in each row) the sum of matches reaches your threshold. We can use np.where:

## EDIT: we only need to check the cumsum is greater than 0.95 and not (0.95 * SUMLATION)

## because we already "normalised" the values within the cumsum.

condition = np.cumsum((a > 5) / SIMULATION, axis=0) > 0.95

mask = np.where(condition)

I broke it down now as the expressions are getting long.

That gave us the i and j coordinates of places where the condition was True. We just want to find the place where we first breached the threshold, so we want to find the indices for the first time in each row:

valid_rows = np.unique(mask[0], return_index=True)[1]    # [1] gets the indices themselves

Now we can simply use these indices to get the first index in each valid row, where the threshold was breached:

valid_cols = mask[1][valid_rows]

So now you can get the corresponding values from the parameter matrix using these valid rows/columns:

params = b[valid_rows, valid_cols]

If this is correct, it should be significantly faster than your solution because it avoids looping over the 2d array and instead utilises NumPy's vectorised method and ufuncs.

edited 2 days ago

answered Mar 31 at 23:43

n1k31t4

6,4912421

$begingroup$
Hi I'd really love to try your method but it's really late here and I can barely open my eyes so I'm going to try it tomorrow and let you know. Cheers.
$endgroup$
– user3613025
Mar 31 at 23:45

$begingroup$
your method seems to be doing fine until I tried to print mask where it'd just keep giving me an empty array, and subsequently all valid_rows, valid_cols and params become empty arrays too. Even if the first $sigma$ value had already given me over 95% of > 5, it your param should still be returning the first $sigma$ value right? Happy to send you my data code in private if you'd like
$endgroup$
– user3613025
Apr 2 at 13:51

$begingroup$
From your comment it sounds like I interpreted the logic incorrectly. Each value must be over (0.95 * 5) then you should just change the condition line to match your needs. My line checks that more than 95% of the experiment's simulation are over 5. That sound different to your description.
$endgroup$
– n1k31t4
Apr 2 at 16:08

$begingroup$
Sorry let me explain it more clearly. Looking at the example a output that I gave above, each column represent the set of simulations that are being ran with a specific value of $sigma$ in increasing order(across the rows). I would like to find the first sigma value for which the total number of elements in that set of simulations that have values > 5 is bigger than 95% of the number of the simulations. So for 100 simulations(each matrix element is each simulation), if 96 of them turns out to be >5 (bigger than 95% of total simulation), I want that particular $sigma$ value.
$endgroup$
– user3613025
Apr 2 at 16:42

$begingroup$
So each row in b is identical? And each column in a contains the results of num_rows experiments for the sigma value of that column? Could the solution then be as simple as to alter condition to perform np.cumsum(..., axis=0)... ?
$endgroup$
– n1k31t4
Apr 2 at 23:56

|
show 5 more comments

Is this helpful?

import numpy as np, numpy.random as npr

N_sims = 15 # sims per sigma

N_vals = 15 # num sigmas

# Parameters

SIGMA = np.ones( (N_sims, N_vals) ) * np.linspace(0.01, 1, N_vals)

# Generate "results" :3 (i.e., the matrix a)

RESULTS = npr.random_integers(low=1, high=10, size=SIGMA.shape)

for i in range(N_vals): 

    RESULTS[:, i] += npr.random_integers(low=0, high=1, size=(N_sims)) + i // 3

print("SIGMAn", SIGMA)

print("RESULTSn", RESULTS)

# Mark the positions > 5

more_than_five = RESULTS > 5

print("more_than_fiven", more_than_five)

# Count how many are greater than five, per column (i.e., per sigma)

counts = more_than_five.sum(axis=0)

print('COUNTSn', counts)

# Compute the proportions (so, 1 if all exps were > 5)

proportions = counts.astype(float) / N_sims

print('Proportionsn', proportions)

# Find the first time it is larger than 0.5

first_index = np.argmax( proportions > 0.95 )

print('---nFIRST INDEXn', first_index)

answered Mar 31 at 23:34

user3658307

1956

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

user3613025 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48316%2faggregate-numpy-array-with-condition-as-mask%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I think I have understood your problem (mostly from the comments added in your function).

I'll show step by step what the logic is, building upon each previous step to get the final solution.

First we want to find all position where the matrix is larger than 5:

a > 5    # returns a boolean array with true/false in each position

(a > 5) / SIMULATION    # returns the value of one match

These values are required to sum to your threshold for an experiment to be valid.

Now we cumulatively sum across each row. As the True/False array is ones and zeros, we now have a running total of the numbers of matches for each experiment (each row).

np.cumsum((a > 5) / SIMULATION, axis=1)     # still same shape as b

Now we just need to find out where (in each row) the sum of matches reaches your threshold. We can use np.where:

## EDIT: we only need to check the cumsum is greater than 0.95 and not (0.95 * SUMLATION)

## because we already "normalised" the values within the cumsum.

condition = np.cumsum((a > 5) / SIMULATION, axis=0) > 0.95

mask = np.where(condition)

I broke it down now as the expressions are getting long.

valid_rows = np.unique(mask[0], return_index=True)[1]    # [1] gets the indices themselves

Now we can simply use these indices to get the first index in each valid row, where the threshold was breached:

valid_cols = mask[1][valid_rows]

So now you can get the corresponding values from the parameter matrix using these valid rows/columns:

params = b[valid_rows, valid_cols]

If this is correct, it should be significantly faster than your solution because it avoids looping over the 2d array and instead utilises NumPy's vectorised method and ufuncs.

edited 2 days ago

answered Mar 31 at 23:43

n1k31t4

6,4912421

$begingroup$
Hi I'd really love to try your method but it's really late here and I can barely open my eyes so I'm going to try it tomorrow and let you know. Cheers.
$endgroup$
– user3613025
Mar 31 at 23:45

$begingroup$
your method seems to be doing fine until I tried to print mask where it'd just keep giving me an empty array, and subsequently all valid_rows, valid_cols and params become empty arrays too. Even if the first $sigma$ value had already given me over 95% of > 5, it your param should still be returning the first $sigma$ value right? Happy to send you my data code in private if you'd like
$endgroup$
– user3613025
Apr 2 at 13:51

$begingroup$
From your comment it sounds like I interpreted the logic incorrectly. Each value must be over (0.95 * 5) then you should just change the condition line to match your needs. My line checks that more than 95% of the experiment's simulation are over 5. That sound different to your description.
$endgroup$
– n1k31t4
Apr 2 at 16:08

$begingroup$
Sorry let me explain it more clearly. Looking at the example a output that I gave above, each column represent the set of simulations that are being ran with a specific value of $sigma$ in increasing order(across the rows). I would like to find the first sigma value for which the total number of elements in that set of simulations that have values > 5 is bigger than 95% of the number of the simulations. So for 100 simulations(each matrix element is each simulation), if 96 of them turns out to be >5 (bigger than 95% of total simulation), I want that particular $sigma$ value.
$endgroup$
– user3613025
Apr 2 at 16:42

$begingroup$
So each row in b is identical? And each column in a contains the results of num_rows experiments for the sigma value of that column? Could the solution then be as simple as to alter condition to perform np.cumsum(..., axis=0)... ?
$endgroup$
– n1k31t4
Apr 2 at 23:56

|
show 5 more comments

I think I have understood your problem (mostly from the comments added in your function).

I'll show step by step what the logic is, building upon each previous step to get the final solution.

First we want to find all position where the matrix is larger than 5:

a > 5    # returns a boolean array with true/false in each position

(a > 5) / SIMULATION    # returns the value of one match

These values are required to sum to your threshold for an experiment to be valid.

Now we cumulatively sum across each row. As the True/False array is ones and zeros, we now have a running total of the numbers of matches for each experiment (each row).

np.cumsum((a > 5) / SIMULATION, axis=1)     # still same shape as b

Now we just need to find out where (in each row) the sum of matches reaches your threshold. We can use np.where:

## EDIT: we only need to check the cumsum is greater than 0.95 and not (0.95 * SUMLATION)

## because we already "normalised" the values within the cumsum.

condition = np.cumsum((a > 5) / SIMULATION, axis=0) > 0.95

mask = np.where(condition)

I broke it down now as the expressions are getting long.

valid_rows = np.unique(mask[0], return_index=True)[1]    # [1] gets the indices themselves

Now we can simply use these indices to get the first index in each valid row, where the threshold was breached:

valid_cols = mask[1][valid_rows]

So now you can get the corresponding values from the parameter matrix using these valid rows/columns:

params = b[valid_rows, valid_cols]

If this is correct, it should be significantly faster than your solution because it avoids looping over the 2d array and instead utilises NumPy's vectorised method and ufuncs.

edited 2 days ago

answered Mar 31 at 23:43

n1k31t4

6,4912421

$begingroup$
Hi I'd really love to try your method but it's really late here and I can barely open my eyes so I'm going to try it tomorrow and let you know. Cheers.
$endgroup$
– user3613025
Mar 31 at 23:45

$begingroup$
your method seems to be doing fine until I tried to print mask where it'd just keep giving me an empty array, and subsequently all valid_rows, valid_cols and params become empty arrays too. Even if the first $sigma$ value had already given me over 95% of > 5, it your param should still be returning the first $sigma$ value right? Happy to send you my data code in private if you'd like
$endgroup$
– user3613025
Apr 2 at 13:51

$begingroup$
From your comment it sounds like I interpreted the logic incorrectly. Each value must be over (0.95 * 5) then you should just change the condition line to match your needs. My line checks that more than 95% of the experiment's simulation are over 5. That sound different to your description.
$endgroup$
– n1k31t4
Apr 2 at 16:08

$begingroup$
Sorry let me explain it more clearly. Looking at the example a output that I gave above, each column represent the set of simulations that are being ran with a specific value of $sigma$ in increasing order(across the rows). I would like to find the first sigma value for which the total number of elements in that set of simulations that have values > 5 is bigger than 95% of the number of the simulations. So for 100 simulations(each matrix element is each simulation), if 96 of them turns out to be >5 (bigger than 95% of total simulation), I want that particular $sigma$ value.
$endgroup$
– user3613025
Apr 2 at 16:42

$begingroup$
So each row in b is identical? And each column in a contains the results of num_rows experiments for the sigma value of that column? Could the solution then be as simple as to alter condition to perform np.cumsum(..., axis=0)... ?
$endgroup$
– n1k31t4
Apr 2 at 23:56

|
show 5 more comments

I think I have understood your problem (mostly from the comments added in your function).

I'll show step by step what the logic is, building upon each previous step to get the final solution.

First we want to find all position where the matrix is larger than 5:

a > 5    # returns a boolean array with true/false in each position

(a > 5) / SIMULATION    # returns the value of one match

These values are required to sum to your threshold for an experiment to be valid.

Now we cumulatively sum across each row. As the True/False array is ones and zeros, we now have a running total of the numbers of matches for each experiment (each row).

np.cumsum((a > 5) / SIMULATION, axis=1)     # still same shape as b

Now we just need to find out where (in each row) the sum of matches reaches your threshold. We can use np.where:

## EDIT: we only need to check the cumsum is greater than 0.95 and not (0.95 * SUMLATION)

## because we already "normalised" the values within the cumsum.

condition = np.cumsum((a > 5) / SIMULATION, axis=0) > 0.95

mask = np.where(condition)

I broke it down now as the expressions are getting long.

valid_rows = np.unique(mask[0], return_index=True)[1]    # [1] gets the indices themselves

Now we can simply use these indices to get the first index in each valid row, where the threshold was breached:

valid_cols = mask[1][valid_rows]

So now you can get the corresponding values from the parameter matrix using these valid rows/columns:

params = b[valid_rows, valid_cols]

If this is correct, it should be significantly faster than your solution because it avoids looping over the 2d array and instead utilises NumPy's vectorised method and ufuncs.

edited 2 days ago

answered Mar 31 at 23:43

n1k31t4

6,4912421

I think I have understood your problem (mostly from the comments added in your function).

I'll show step by step what the logic is, building upon each previous step to get the final solution.

First we want to find all position where the matrix is larger than 5:

a > 5    # returns a boolean array with true/false in each position

(a > 5) / SIMULATION    # returns the value of one match

These values are required to sum to your threshold for an experiment to be valid.

Now we cumulatively sum across each row. As the True/False array is ones and zeros, we now have a running total of the numbers of matches for each experiment (each row).

np.cumsum((a > 5) / SIMULATION, axis=1)     # still same shape as b

Now we just need to find out where (in each row) the sum of matches reaches your threshold. We can use np.where:

## EDIT: we only need to check the cumsum is greater than 0.95 and not (0.95 * SUMLATION)

## because we already "normalised" the values within the cumsum.

condition = np.cumsum((a > 5) / SIMULATION, axis=0) > 0.95

mask = np.where(condition)

I broke it down now as the expressions are getting long.

valid_rows = np.unique(mask[0], return_index=True)[1]    # [1] gets the indices themselves

Now we can simply use these indices to get the first index in each valid row, where the threshold was breached:

valid_cols = mask[1][valid_rows]

So now you can get the corresponding values from the parameter matrix using these valid rows/columns:

params = b[valid_rows, valid_cols]

If this is correct, it should be significantly faster than your solution because it avoids looping over the 2d array and instead utilises NumPy's vectorised method and ufuncs.

edited 2 days ago

answered Mar 31 at 23:43

n1k31t4

6,4912421

edited 2 days ago

answered Mar 31 at 23:43

n1k31t4

6,4912421

answered Mar 31 at 23:43

n1k31t4

6,4912421

answered Mar 31 at 23:43

n1k31t4

6,4912421

$begingroup$
Hi I'd really love to try your method but it's really late here and I can barely open my eyes so I'm going to try it tomorrow and let you know. Cheers.
$endgroup$
– user3613025
Mar 31 at 23:45

$begingroup$
your method seems to be doing fine until I tried to print mask where it'd just keep giving me an empty array, and subsequently all valid_rows, valid_cols and params become empty arrays too. Even if the first $sigma$ value had already given me over 95% of > 5, it your param should still be returning the first $sigma$ value right? Happy to send you my data code in private if you'd like
$endgroup$
– user3613025
Apr 2 at 13:51

$begingroup$
From your comment it sounds like I interpreted the logic incorrectly. Each value must be over (0.95 * 5) then you should just change the condition line to match your needs. My line checks that more than 95% of the experiment's simulation are over 5. That sound different to your description.
$endgroup$
– n1k31t4
Apr 2 at 16:08

$begingroup$
Sorry let me explain it more clearly. Looking at the example a output that I gave above, each column represent the set of simulations that are being ran with a specific value of $sigma$ in increasing order(across the rows). I would like to find the first sigma value for which the total number of elements in that set of simulations that have values > 5 is bigger than 95% of the number of the simulations. So for 100 simulations(each matrix element is each simulation), if 96 of them turns out to be >5 (bigger than 95% of total simulation), I want that particular $sigma$ value.
$endgroup$
– user3613025
Apr 2 at 16:42

$begingroup$
So each row in b is identical? And each column in a contains the results of num_rows experiments for the sigma value of that column? Could the solution then be as simple as to alter condition to perform np.cumsum(..., axis=0)... ?
$endgroup$
– n1k31t4
Apr 2 at 23:56

|
show 5 more comments

$begingroup$
Hi I'd really love to try your method but it's really late here and I can barely open my eyes so I'm going to try it tomorrow and let you know. Cheers.
$endgroup$
– user3613025
Mar 31 at 23:45

$begingroup$
your method seems to be doing fine until I tried to print mask where it'd just keep giving me an empty array, and subsequently all valid_rows, valid_cols and params become empty arrays too. Even if the first $sigma$ value had already given me over 95% of > 5, it your param should still be returning the first $sigma$ value right? Happy to send you my data code in private if you'd like
$endgroup$
– user3613025
Apr 2 at 13:51

$begingroup$
From your comment it sounds like I interpreted the logic incorrectly. Each value must be over (0.95 * 5) then you should just change the condition line to match your needs. My line checks that more than 95% of the experiment's simulation are over 5. That sound different to your description.
$endgroup$
– n1k31t4
Apr 2 at 16:08

$begingroup$
Sorry let me explain it more clearly. Looking at the example a output that I gave above, each column represent the set of simulations that are being ran with a specific value of $sigma$ in increasing order(across the rows). I would like to find the first sigma value for which the total number of elements in that set of simulations that have values > 5 is bigger than 95% of the number of the simulations. So for 100 simulations(each matrix element is each simulation), if 96 of them turns out to be >5 (bigger than 95% of total simulation), I want that particular $sigma$ value.
$endgroup$
– user3613025
Apr 2 at 16:42

$begingroup$
So each row in b is identical? And each column in a contains the results of num_rows experiments for the sigma value of that column? Could the solution then be as simple as to alter condition to perform np.cumsum(..., axis=0)... ?
$endgroup$
– n1k31t4
Apr 2 at 23:56

Hi I'd really love to try your method but it's really late here and I can barely open my eyes so I'm going to try it tomorrow and let you know. Cheers.

– user3613025
Mar 31 at 23:45

your method seems to be doing fine until I tried to print mask where it'd just keep giving me an empty array, and subsequently all valid_rows, valid_cols and params become empty arrays too. Even if the first $sigma$ value had already given me over 95% of > 5, it your param should still be returning the first $sigma$ value right? Happy to send you my data code in private if you'd like

– user3613025
Apr 2 at 13:51

From your comment it sounds like I interpreted the logic incorrectly. Each value must be over (0.95 * 5) then you should just change the condition line to match your needs. My line checks that more than 95% of the experiment's simulation are over 5. That sound different to your description.

– n1k31t4
Apr 2 at 16:08

Sorry let me explain it more clearly. Looking at the example a output that I gave above, each column represent the set of simulations that are being ran with a specific value of $sigma$ in increasing order(across the rows). I would like to find the first sigma value for which the total number of elements in that set of simulations that have values > 5 is bigger than 95% of the number of the simulations. So for 100 simulations(each matrix element is each simulation), if 96 of them turns out to be >5 (bigger than 95% of total simulation), I want that particular $sigma$ value.

– user3613025
Apr 2 at 16:42

So each row in b is identical? And each column in a contains the results of num_rows experiments for the sigma value of that column? Could the solution then be as simple as to alter condition to perform np.cumsum(..., axis=0)... ?

– n1k31t4
Apr 2 at 23:56

|
show 5 more comments

Is this helpful?

import numpy as np, numpy.random as npr

N_sims = 15 # sims per sigma

N_vals = 15 # num sigmas

# Parameters

SIGMA = np.ones( (N_sims, N_vals) ) * np.linspace(0.01, 1, N_vals)

# Generate "results" :3 (i.e., the matrix a)

RESULTS = npr.random_integers(low=1, high=10, size=SIGMA.shape)

for i in range(N_vals): 

    RESULTS[:, i] += npr.random_integers(low=0, high=1, size=(N_sims)) + i // 3

print("SIGMAn", SIGMA)

print("RESULTSn", RESULTS)

# Mark the positions > 5

more_than_five = RESULTS > 5

print("more_than_fiven", more_than_five)

# Count how many are greater than five, per column (i.e., per sigma)

counts = more_than_five.sum(axis=0)

print('COUNTSn', counts)

# Compute the proportions (so, 1 if all exps were > 5)

proportions = counts.astype(float) / N_sims

print('Proportionsn', proportions)

# Find the first time it is larger than 0.5

first_index = np.argmax( proportions > 0.95 )

print('---nFIRST INDEXn', first_index)

answered Mar 31 at 23:34

user3658307

1956

add a comment |

Is this helpful?

import numpy as np, numpy.random as npr

N_sims = 15 # sims per sigma

N_vals = 15 # num sigmas

# Parameters

SIGMA = np.ones( (N_sims, N_vals) ) * np.linspace(0.01, 1, N_vals)

# Generate "results" :3 (i.e., the matrix a)

RESULTS = npr.random_integers(low=1, high=10, size=SIGMA.shape)

for i in range(N_vals): 

    RESULTS[:, i] += npr.random_integers(low=0, high=1, size=(N_sims)) + i // 3

print("SIGMAn", SIGMA)

print("RESULTSn", RESULTS)

# Mark the positions > 5

more_than_five = RESULTS > 5

print("more_than_fiven", more_than_five)

# Count how many are greater than five, per column (i.e., per sigma)

counts = more_than_five.sum(axis=0)

print('COUNTSn', counts)

# Compute the proportions (so, 1 if all exps were > 5)

proportions = counts.astype(float) / N_sims

print('Proportionsn', proportions)

# Find the first time it is larger than 0.5

first_index = np.argmax( proportions > 0.95 )

print('---nFIRST INDEXn', first_index)

answered Mar 31 at 23:34

user3658307

1956

add a comment |

Is this helpful?

import numpy as np, numpy.random as npr

N_sims = 15 # sims per sigma

N_vals = 15 # num sigmas

# Parameters

SIGMA = np.ones( (N_sims, N_vals) ) * np.linspace(0.01, 1, N_vals)

# Generate "results" :3 (i.e., the matrix a)

RESULTS = npr.random_integers(low=1, high=10, size=SIGMA.shape)

for i in range(N_vals): 

    RESULTS[:, i] += npr.random_integers(low=0, high=1, size=(N_sims)) + i // 3

print("SIGMAn", SIGMA)

print("RESULTSn", RESULTS)

# Mark the positions > 5

more_than_five = RESULTS > 5

print("more_than_fiven", more_than_five)

# Count how many are greater than five, per column (i.e., per sigma)

counts = more_than_five.sum(axis=0)

print('COUNTSn', counts)

# Compute the proportions (so, 1 if all exps were > 5)

proportions = counts.astype(float) / N_sims

print('Proportionsn', proportions)

# Find the first time it is larger than 0.5

first_index = np.argmax( proportions > 0.95 )

print('---nFIRST INDEXn', first_index)

answered Mar 31 at 23:34

user3658307

1956

Is this helpful?

import numpy as np, numpy.random as npr

N_sims = 15 # sims per sigma

N_vals = 15 # num sigmas

# Parameters

SIGMA = np.ones( (N_sims, N_vals) ) * np.linspace(0.01, 1, N_vals)

# Generate "results" :3 (i.e., the matrix a)

RESULTS = npr.random_integers(low=1, high=10, size=SIGMA.shape)

for i in range(N_vals): 

    RESULTS[:, i] += npr.random_integers(low=0, high=1, size=(N_sims)) + i // 3

print("SIGMAn", SIGMA)

print("RESULTSn", RESULTS)

# Mark the positions > 5

more_than_five = RESULTS > 5

print("more_than_fiven", more_than_five)

# Count how many are greater than five, per column (i.e., per sigma)

counts = more_than_five.sum(axis=0)

print('COUNTSn', counts)

# Compute the proportions (so, 1 if all exps were > 5)

proportions = counts.astype(float) / N_sims

print('Proportionsn', proportions)

# Find the first time it is larger than 0.5

first_index = np.argmax( proportions > 0.95 )

print('---nFIRST INDEXn', first_index)

answered Mar 31 at 23:34

user3658307

1956

answered Mar 31 at 23:34

user3658307

1956

answered Mar 31 at 23:34

user3658307

1956

answered Mar 31 at 23:34

user3658307

1956

add a comment |

user3613025 is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

user3613025 is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk