Hi. I am a Software Engineer with interests in Java, Python,Sql, Scala, Bash and computer networks.

You can find me on GitHub https://github.com/appym

-Apratim

| Posted by Apratim Mishra

## Postgres window and aggregate functions

Postgres supports a wide array of aggregate and window functions which can be used to perform analytics on data or data visualization in general.Aggregate functions act as window functions only when an OVER clause follows the call; otherwise they act as regular aggregates.

There are different categories of aggregate functions in postgres. The most common ones being

In this blog I am going to take an example data set and walk through some of the common window functions and where they can be useful.

Lets go ahead and create following table

And then insert data into the table

The primary window functions have two major operations.

1. Order by
2. Order by Partition

ROW_NUMBER()
Assigns a unique number to each row to which it is applied (either each row in the partition or each row returned by the query), in the ordered sequence of rows specified in the order_by_clause, beginning with 1

RANK()
Computes the rank of each row returned from a query with respect to the other rows returned by the query, based on the values of the expression in the order_by_clause. Ties are assigned the same rank, with the next ranking(s) skipped

DENSE_RANK()
Rank of the current row without gaps. This function counts peer groups

We could also use partiiton by clause for Aggregate functions such as SUM, MIN, MAX.

PERCENT_RANK()
Relative rank of the current row: (rank - 1) / (total rows - 1)

CUME_DIST()
Relative rank of the current row: (number of rows preceding or peer with current row) / (total rows)

Percentile functions

The median for above data set is (1000+650)/2 = 825

Percentiles by customer

Reusing a window function

| Posted by Apratim Mishra

## Document Similarity with LSH and MinHash

In this blog post I am going to discuss about two probabilistic algorithms which are used for matching documents for similarities. Specifically, determining if two documents are similar to each other using Minhash. Comparing N documents for similarity using LSH. The implementation of Minhash and LSH algorithm are based on [1] and [2].

In this article similarity of documents do not mean contextual similarity rather lexical matching of documents.

## Similarity Techniques

### Jaccard Similarity

Jaccard similarity of two sets is the ratio of size of intersection of the two sets to the size of Union of the two sets.

For sets A and B

J (A, B) = |A ∩ B| / |A U B|

### Jaccard Bag Similarity

Jaccard bag similarity for bags A and B is defined by counting an element n times in the intersection if n is the minimum of the number of times the element appears in A and B. In the union, count the sum of number of times an element appears in A and B.

Example – If bag A contains {a, a, a, b} and bag B contains {a, a, b, b, c}. Then min count of ‘a’ would be 2 and min count of ‘b’ would be 1 making a total of 3. The union of two bags is sum of sizes of each bag so 9. Hence Jaccard bag similarity would be 1/3

### Representation of documents

Before matching two documents for similarity, they need to be represented as sets. The most effective way to lexically match documents is to construct a set of N-Grams of the document. N-Grams or shingles are sets of N consecutive characters. So for example the 2-Gram for the word “match” would be {ma, at, tc, ch}. So it’s essentially a sliding window of 2 characters over the word “match”.

Shingling can also be constructed from words in a text instead of single characters. Shingles for a sentence would be a set formed from a sliding window of consecutive words. Example 2-shingles or 2-Grams for “Its quite sunny today” would be {“its quite”, “quite sunny”, “sunny today”}. This can be very helpful when matching documents that might have sentences in different order or words in sentences in different order. This can be customized to remove stop words from the documents.

Choosing the right shingle size is important. If the size were too small then the shingles would most likely repeat across documents leading to false positives. If the size were too large then the similarity would be very low leading to false negatives. The shingle size should be chosen such that probability of finding any given shingle in a document is low.

### Hashing Shingles

Another concern with shingles made from substrings of a document is the memory cost. In English there are 26 characters, but a document has other characters such as whitespace or ‘.’ or symbols which make the total possible shingles to be very high. A standard approximation for documents is to consider 20 characters (which mostly occur) and hence number of possible shingles would be 20^k for k-shingles.

To circumvent this cost, shingles for a large k, say 10 can be hashed to a bucket number in the 4-byte range thus compacting the data.

### Characteristic matrices and Signatures

If there were N documents being matched for similarity where N is a large number then it would not be feasible to match the many k-shingles for each document. Signature or smaller subsets are formed from each set of shingles (each set is derived from 1 document). Characteristic matrix represents the set or document identifier as columns and the elements of all the sets combined as rows. A cell is given the value 1 when an element is part of a set, 0 otherwise. Characteristic matrices are unlikely candidates for data storage as they are sparse matrices with more zeros than ones and are best used for visualizing data.

D1 D2 .... Dn
h(s1) 1 0 ... 1
h(s2) 0 1 ... 0
... ... ... ... ...
h(sn) 1 1 ... 0

### Minhashing and Signature matrices

Minhashing is another way of representing the characteristic matrix mentioned above. Minhash of a set or document is a subset of say several hundred of the many hash functions applied to it. Minhash on a set is calculated by picking a permutation of rows for a column of the characteristic matrix . The minhash value of any column would be the number of the first row in the permuted order where the column has a 1.

Signature matrices consist of the hash functions as rows and the sets as columns. Each column essentially is a vector of all the permuted hash values and is referred to as Minhash Signatures for that Set. Signature matrices are much smaller than characteristic matrices.

### Minhashing and Jaccard Similarity

To find the similarity between two documents Jaccard similarity of the minhash sets is a good estimator.

D1 D2
h(s1) 0 0
h(s2) 1 0
... ... ...
h(sn) 1 1

There are 3 types of combinations possible for the rows corresponding to each hash function as seen in the above table.

• Both columns have 0's. Type X
• One column has 0 and the other 1, Type Y
• Both columns have 1's. Type Z

The Type X does not fall into either an union or an intersection operation on the two sets hence ignore them.

The union of the two sets would be combination of Y and Z type rows. Hence |A U B| = Y + Z

The intersection of two sets would be where both columns are 1 so type Z.

Now as mentioned before Jaccard Similarity is J (A, B) = |A ∩ B| / |A U B|

so J (A, B) = Z/ Y + Z

### Computing Minhash Signatures

Characteristics matrices are generally huge and picking a permutation of rows from millions of rows would be a time-consuming process hence a better way to do this is to use a random hash function which maps row numbers to bucket numbers. So its possible for multiple rows to be mapped to the same bucket number and cause collisions but if the number of buckets is large then the probability of collisions is much lower.

Row S1 S2 S3 S4 h1(x+1 mod 5) h2 (3x + 1 mod 5)
0 1 0 0 1 1 1
1 0 0 1 0 2 4
2 0 1 0 1 3 2
3 1 0 1 1 4 0
4 0 0 1 0 0 3

Table: Hash functions for Characteristic matrix

To compute minhash signature matrix begin by choosing n (2 in above example) hash functions (h1 and h2) on the rows, the signature matrix initially consists of ∞ for all cells.

S1 S2 S3 S4
h1
h2

First consider row 0, S1 and S4 have 1's and S2 and S3 have 0's so only S1 and S4 can be altered. The values of both h1(0) and h2(0) are 1 for row 0 and will replace ∞ (1 < ∞). Thus the matrix looks like below.

S1 S2 S3 S4
h1 1 1
h2 1 1

Next for row 1, only S3 has 1 and can be altered. Thus we set the values of h1 and h2  for S3.

S1 S2 S3 S4
h1 1 2 1
h2 1 4 1

Similarly for row 2, has 1's for S2 and S4. S2 had ∞ for both h1 and h2 and will get replaced by 3 and 2 respectively. S4 had 1's for both h1 and h2 which is lower than 3 and 2 hence there will be no change.

S1 S2 S3 S4
h1 1 3 2 1
h2 1 2 4 1

Following through the minhash process of iteratively replacing a hash value if there exists an hash value lower than it we get the signature matrix as below.

S1 S2 S3 S4
h1 1 3 0 1
h2 0 2 0 0

Jaccard Similarity can be applied to the above signature matrix to determine similarity between sets (documents). So for example the Jaccard similarity between S1 and S2 would be 0 (hashes don't match) whereas for s1 and s3 it would be 0.5 (one hash matches).

## Locality Sensitive Hashing

Minhash solves the problem of comparing large documents by compressing them into signature matrices which are a much smaller dataset compared to the original characteristic matrix. But when there are a large number of large documents involved then comparing them to each other would still be a time consuming task because of the number of pairs of documents that would need to be compared for similarity.

Locality sensitive hashing or near neighbor search solves this problem by focussing on pairs which are likely to be similar rather than on all pairs.

The idea is to hash each item several times such that similar items would more likely hash to the same bucket than dissimilar items. Any pair of documents which hashed to the same bucket is treated as a candidate pair and these are in turn evaluated for similarity either using Jaccard rule or cosine distances or euclidean approach.

### LSH with Minhash

If we have already computed the minhash signatures for the items then an effective to LSH would be to divide the signature matrix into b bands of r rows each. For each band there is a hash function that takes the vector of r values and hashes them to various buckets.

Pick a threshold value t such that t = (1/b) ^ (1/r). Amongst the candidate pairs find out which pairs have Jaccard similarity higher than t to get the list of documents which are most similar to each other.

You can find the code worked for matching two documents here:

https://github.com/appym/lsh

### References

1. Rajaraman, Anand and Jeff Ullman. 2010 (Draft). Mining of Massive Data Sets(Chapter 3).
2. http://okomestudio.net/biboroku/?p=2065
| Posted by Apratim Mishra

## Unix Commands for OSx

I have been meaning to note down my unix checklist of commands (For Osx) which are very handy for basic operations on data. I will modify this post as and when I remember or come across something that fits here. These *nix commands are specifically tested for Mac OS.

Uniques

uniq - This is the unix unique function which can be primarily used to remove duplicates from a file amongst other things. The file has to be pre sorted for uniq to work
Consider file test which contains the following

Remove duplicates

Count occurences of each item

Print only duplicate items in file

Print only unique lines

Consider test now contains

Remove duplicate case insensitive.
This file is not sorted though. So it has to be sorted first before uniq. -i flag is for case in sensitive

Case conversion

Convert all upper case in fileA to lower case and output as fileB

File comparision

Compare two files and keep strings present in fileA but not in fileB

Compare two files and keep strings present in fileB but not in fileA

Compare two files and keep only strings which are present in both files

Sed

Primary purpose of sed is string replacement or pattern replacement.
Consider the following file as input

1. Replacing or substituting string By default, the sed command replaces the first occurrence of the pattern in each line and it won't replace the second, third...occurrence in the line. Here the "s" specifies the substitution operation. The "/" are delimiters. The "unix" is the search pattern and the "linux" is the replacement string. If you miss a delimiter then the expression errors out as below 2 Replacing the nth occurrence of a pattern in a line. Use the /1, /2 etc flags to replace the first, second occurrence of a pattern in a line. The below command replaces the second occurrence of the word "unix" with "linux" in a line. Here is the first occurence which is the default option And the third occurence To replace all the occurence use 'g' (global replacement) To make the search case insensitive sed on mac does not have a flag but you can use plain regex to achieve it. For example modify the file.txt to below How to find a string in all the files contained in a directory. You could use grep or find. To find/replace multiple strings use the -e flag. To replace a string that begins with a pattern use the regex for it alongwith sed To remove whitespace characters at end of the line Unix command to know if your file has whitespace or tab characters Unix command to remove BOM (Byte Order Mark) characters from your file Open the file in binary mode using -b flag to verify if you have BOM. And then remove them Use the -i flag to overwrite the existing file and create a backup of the original file. For example to remove all white spaces in a file. This will create a backup file called file.txt.bak with the original file contents and overwrite file.txt with no spaces To remove only the trailing spaces in a line use *$. The * character means "any number of the previous character" and$ refers to end of line. Verify the trailing whitespaces are removed by :set list To replace a blank line with something else. You can match a blank line by specifying an end-of-line immediately after a beginning-of-line, i.e. with ^\$ To remove tabs at the end of a line. Ex: Add a tab to the end of first line, so :set list will show ^I To create a tab in your sed command. use ctrl + v and then ctrl + i Consider file test which contains the following To extract the content after firstname

Search Strings

Total occurences of searchStr in current directory

Total number of files where searchStr occurs in current directory

To get an exact word match use the -w flag.

Recursively replace string original with replacement in all files under OSx directory mydir recursively(Excludes hidden files and folders)

OR

The regex excludes all hidden files and folders which is particularly important if you want to avoid messing up your .DS_Store or .git files unknowningly.
if you use zsh then the following would also work

This isnt exlcuding hidden files though. The */(D*) is basically zsh way of saying recursively go through all sub directories and all files.