I’ve applied what’s know as a countSketch in python (web page 17: https://arxiv.org/pdf/1411.4357.pdf) however my implementation is at present missing in efficiency. The algorithm is to compute the product `SA`

the place `A`

is an `n x d`

matrix, `S`

is `m x n`

matrix outlined as follows: for each column of `S`

uniformly at random choose a row (hash bucket) from the `m`

rows and for that given row, uniformly at random choose +1 or -1. So S is a matrix with precisely one nonzero in each column and in any other case all zero.

My intention is to compute `SA`

in a *streaming* trend by studying the entries of `A`

. The concept for my implementation is as follows: observe a sequence of triples `(i,j,A_ij)`

and return a sequence `(h(i), j, s(i)A_ij)`

the place:

– `h(i)`

is a hash bucket (row of matrix chosen uniformly at random from the `m`

potential rows of `S`

– `s(i)`

is the random signal operate as described above.

I’ve assumed that the matrix is in *row order* in order that the primary row of `A`

arrives in its entirety earlier than the following row of `A`

arrives as a result of this limits the variety of calls I would like to pick a random bucket or the necessity to use a hash library. I’ve additionally assumed that the variety of nonzero entries (or the size of the enter stream) is understood in order that I can effectively encode the iteration.

My downside is that the matrix ought to compute `(1+error)*||Ax||^2 <= ||SAx||^2 <= (1+error)*||Ax||^2`

and still have the distinction in frobenius norms between `A^T S^T S A`

and `A^T A`

being small. Nonetheless, whereas my implementation for the primary situation appears to be true, the latter is constantly too small. I used to be questioning if there’s an apparent purpose for this that I’m lacking as a result of it appears to be underestimating the latter amount.

I’m open to suggestions on altering the code if there are apparent enhancements to be made.

nb. For those who do not wish to run utilizing `numba`

then simply remark out the import and the operate decorator and it’ll run in customary numpy/scipy.

```
import numpy as np
import numpy.random as npr
import scipy.sparse as sparse
from scipy.sparse import coo_matrix
import numba
from numba import jit
@jit(nopython=True) # remark this if need simply numpy
def countSketch(input_rows, input_data,
input_nnz,
sketch_size, seed=None):
'''
input_rows: row indices for information (might be repeats)
input_data: values seen in row location,
input_nnz : variety of nonzers within the information (can substitute with
len(input_data) however prevented right here for pace)
sketch_size: int
seed=None : random seed
'''
hashed_rows = np.empty(input_rows.form,dtype=np.int32)
current_row = 0
hash_val = npr.selection(sketch_size)
sign_val = npr.selection(np.array([-1.0,1.0]))
#print(hash_val)
hashed_rows[0] = hash_val
#print(hash_val)
for idx in np.arange(input_nnz):
print(idx)
row_id = input_rows[idx]
data_val = input_data[idx]
if row_id == current_row:
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
else:
# make new hashes
hash_val = npr.selection(sketch_size)
sign_val = npr.selection(np.array([-1.0,1.0]))
hashed_rows[idx] = hash_val
input_data[idx] = sign_val*data_val
return hashed_rows, input_data
def sort_row_order(input_data):
sorted_row_column = np.array((input_data.row,
input_data.col,
input_data.information))
idx = np.argsort(sorted_row_column[0])
sorted_rows = np.array(sorted_row_column[0,idx], dtype=np.int32)
sorted_cols = np.array(sorted_row_column[1,idx], dtype=np.int32)
sorted_data = np.array(sorted_row_column[2,idx], dtype=np.float64)
return sorted_rows, sorted_cols, sorted_data
if __name__=="__main__":
import time
from tabulate import tabulate
matrix = sparse.random(1000, 50, 0.1)
x = np.random.randn(matrix.form[1])
true_norm = np.linalg.norm(matrix@x,ord=2)**2
tidy_data = sort_row_order(matrix)
sketch_size = 300
begin = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],
tidy_data[2], matrix.nnz,sketch_size)
duration_slow = time.time() - begin
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm_slow = np.linalg.norm(S_A@x,ord=2)**2
rel_error_slow = approx_norm_slow/true_norm
#print("Sketch time: {}".format(duration_slow))
begin = time.time()
hashed_rows, sketched_data = countSketch(tidy_data[0],
tidy_data[2], matrix.nnz,sketch_size)
period = time.time() - begin
#print("Sketch time: {}".format(period))
S_A = sparse.coo_matrix((sketched_data, (hashed_rows,matrix.col)))
approx_norm = np.linalg.norm(S_A@x,ord=2)**2
rel_error = approx_norm/true_norm
#print("Relative norms: {}".format(approx_norm/true_norm))
print(tabulate([[duration_slow, rel_error_slow, 'Yes'],
[duration, rel_error, 'No']],
headers=['Sketch Time', 'Relative Error', 'Dry Run'],
tablefmt='orgtbl'))
```