About

Hi. I am a Software Engineer with interests in Java, Python,Sql, Scala, Bash and computer networks.

You can find me on GitHub https://github.com/appym

-Apratim

| Posted by Apratim Mishra

Postgres window and aggregate functions

Postgres supports a wide array of aggregate and window functions which can be used to perform analytics on data or data visualization in general.Aggregate functions act as window functions only when an OVER clause follows the call; otherwise they act as regular aggregates.

There are different categories of aggregate functions in postgres. The most common ones being

'N' FOR NORMAL AGGREGATES, LIKE MAX, MIN, ETC.
'O' FOR THE ORDERED-SET AGGREGATES
'H' FOR THE HYPOTHETICAL-SET AGGREGATES, WHICH ARE A SUBCLASS OF ORDERED-SET AGGREGATES

The hypothetical aggregates do not drop input rows containing nulls. Null values sort according to the rule specified in the ORDER BY clause.

psql=>
SELECT AGGFNOID, AGGKIND
FROM PG_AGGREGATE
WHERE AGGKIND IN ('O', 'H');

AGGFNOID | AGGKIND
——————————————+---------
PG_CATALOG.PERCENTILE_DISC | O
PG_CATALOG.PERCENTILE_CONT | O
PG_CATALOG.PERCENTILE_CONT | O
PG_CATALOG.PERCENTILE_DISC | O
PG_CATALOG.PERCENTILE_CONT | O
PG_CATALOG.PERCENTILE_CONT | O
MODE                       | O
PG_CATALOG.RANK            | H
PG_CATALOG.PERCENT_RANK    | H
PG_CATALOG.CUME_DIST       | H
PG_CATALOG.DENSE_RANK      | H

In this blog I am going to take an example data set and walk through some of the common window functions and where they can be useful.

Lets go ahead and create following table

CREATE TABLE orderranking
(
  orderid serial NOT NULL,
  customerid integer,
  ordertotal numeric(15,2)
)
DISTRIBUTED BY (orderid);

And then insert data into the table

INSERT INTO OrderRanking (CustomerID, OrderTotal)
SELECT 1, 1000
UNION ALL
SELECT 1, 1000
UNION ALL
SELECT 1, 500
UNION ALL
SELECT 1, 650
UNION ALL
SELECT 1, 3000
UNION ALL
SELECT 2, 1000
UNION ALL
SELECT 2, 2000
UNION ALL
SELECT 2, 500
UNION ALL
SELECT 2, 500
UNION ALL
SELECT 3, 500

select * from orderranking order by orderid;
 orderid | customerid | ordertotal 
---------+------------+------------
       1 |          1 |     500.00
       2 |          1 |     650.00
       3 |          1 |    1000.00
       4 |          1 |    3000.00
       5 |          2 |     500.00
       6 |          2 |    1000.00
       7 |          2 |    2000.00
       8 |          3 |     500.00
       9 |          1 |    1000.00
      10 |          2 |     500.00
(10 rows)

The primary window functions have two major operations.

  1. Order by
  2. Order by Partition

ROW_NUMBER()
Assigns a unique number to each row to which it is applied (either each row in the partition or each row returned by the query), in the ordered sequence of rows specified in the order_by_clause, beginning with 1

select *,
ROW_NUMBER() OVER (ORDER BY OrderTotal DESC) AS row_num
from OrderRanking order by OrderTotal desc;

 orderid | customerid | ordertotal | row_num                                                                                                                              
---------+------------+------------+---------
       4 |          1 |    3000.00 |       1
       7 |          2 |    2000.00 |       2
       6 |          2 |    1000.00 |       3
       3 |          1 |    1000.00 |       4
       9 |          1 |    1000.00 |       5
       2 |          1 |     650.00 |       6
       1 |          1 |     500.00 |       7
      10 |          2 |     500.00 |       8
       8 |          3 |     500.00 |       9
       5 |          2 |     500.00 |      10
(10 rows)

select *,
ROW_NUMBER() OVER (ORDER BY OrderTotal DESC) AS row_num,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS row_num_partition
from OrderRanking order by customerid, ordertotal desc;

 orderid | customerid | ordertotal | row_num | row_num_partition 
---------+------------+------------+---------+-------------------
       4 |          1 |    3000.00 |       1 |                 1
       3 |          1 |    1000.00 |       4 |                 2
       9 |          1 |    1000.00 |       5 |                 3
       2 |          1 |     650.00 |       6 |                 4
       1 |          1 |     500.00 |       7 |                 5
       7 |          2 |    2000.00 |       2 |                 1
       6 |          2 |    1000.00 |       3 |                 2
       5 |          2 |     500.00 |      10 |                 4
      10 |          2 |     500.00 |       9 |                 3
       8 |          3 |     500.00 |       8 |                 1
(10 rows)

RANK()
Computes the rank of each row returned from a query with respect to the other rows returned by the query, based on the values of the expression in the order_by_clause. Ties are assigned the same rank, with the next ranking(s) skipped

select *,
RANK() OVER (ORDER BY OrderTotal DESC) AS rank
from OrderRanking order by OrderTotal desc;

 orderid | customerid | ordertotal | rank                                                                                                                                 
---------+------------+------------+------
       4 |          1 |    3000.00 |    1
       7 |          2 |    2000.00 |    2
       6 |          2 |    1000.00 |    3
       9 |          1 |    1000.00 |    3
       3 |          1 |    1000.00 |    3
       2 |          1 |     650.00 |    6
      10 |          2 |     500.00 |    7
       8 |          3 |     500.00 |    7
       5 |          2 |     500.00 |    7
       1 |          1 |     500.00 |    7
(10 rows)

select *,
RANK() OVER (ORDER BY OrderTotal DESC) AS rank,
RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS rank_partition
from OrderRanking order by customerid, ordertotal desc;

 orderid | customerid | ordertotal | rank | rank_partition                                                                                                                
---------+------------+------------+------+----------------
       4 |          1 |    3000.00 |    1 |              1
       3 |          1 |    1000.00 |    3 |              2
       9 |          1 |    1000.00 |    3 |              2
       2 |          1 |     650.00 |    6 |              4
       1 |          1 |     500.00 |    7 |              5
       7 |          2 |    2000.00 |    2 |              1
       6 |          2 |    1000.00 |    3 |              2
      10 |          2 |     500.00 |    7 |              3
       5 |          2 |     500.00 |    7 |              3
       8 |          3 |     500.00 |    7 |              1
(10 rows)

DENSE_RANK()
Rank of the current row without gaps. This function counts peer groups

select *,
DENSE_RANK() OVER (ORDER BY OrderTotal DESC) AS dense_rank
from OrderRanking order by OrderTotal desc;

 orderid | customerid | ordertotal | dense_rank                                                                                                                           
---------+------------+------------+------------
       4 |          1 |    3000.00 |          1
       7 |          2 |    2000.00 |          2
       6 |          2 |    1000.00 |          3
       3 |          1 |    1000.00 |          3
       9 |          1 |    1000.00 |          3
       2 |          1 |     650.00 |          4
       1 |          1 |     500.00 |          5
      10 |          2 |     500.00 |          5
       8 |          3 |     500.00 |          5
       5 |          2 |     500.00 |          5
(10 rows)

select *,
DENSE_RANK() OVER (ORDER BY OrderTotal DESC) AS dense_rank,
DENSE_RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS dense_rank_partition
from OrderRanking order by customerid, ordertotal desc;

 orderid | customerid | ordertotal | dense_rank | dense_rank_partition 
---------+------------+------------+------------+----------------------
       4 |          1 |    3000.00 |          1 |                    1
       9 |          1 |    1000.00 |          3 |                    2
       3 |          1 |    1000.00 |          3 |                    2
       2 |          1 |     650.00 |          4 |                    3
       1 |          1 |     500.00 |          5 |                    4
       7 |          2 |    2000.00 |          2 |                    1
       6 |          2 |    1000.00 |          3 |                    2
      10 |          2 |     500.00 |          5 |                    3
       5 |          2 |     500.00 |          5 |                    3
       8 |          3 |     500.00 |          5 |                    1
(10 rows)

We could also use partiiton by clause for Aggregate functions such as SUM, MIN, MAX.

select *,
SUM(OrderTotal) OVER (PARTITION BY CustomerID) AS sum
from OrderRanking order by customerid, ordertotal desc;

 orderid | customerid | ordertotal |   sum   
---------+------------+------------+---------
       4 |          1 |    3000.00 | 6150.00
       3 |          1 |    1000.00 | 6150.00
       9 |          1 |    1000.00 | 6150.00
       2 |          1 |     650.00 | 6150.00
       1 |          1 |     500.00 | 6150.00
       7 |          2 |    2000.00 | 4000.00
       6 |          2 |    1000.00 | 4000.00
       5 |          2 |     500.00 | 4000.00
      10 |          2 |     500.00 | 4000.00
       8 |          3 |     500.00 |  500.00
(10 rows)

PERCENT_RANK()
Relative rank of the current row: (rank - 1) / (total rows - 1)

select *,
RANK() OVER (ORDER BY OrderTotal DESC) AS rank,
PERCENT_RANK() OVER (ORDER BY OrderTotal DESC) AS percent_rank
from OrderRanking order by ordertotal desc;

 orderid | customerid | ordertotal | rank |   percent_rank                                                                                                                
---------+------------+------------+------+-------------------
       4 |          1 |    3000.00 |    1 |                 0
       7 |          2 |    2000.00 |    2 | 0.111111111111111
       6 |          2 |    1000.00 |    3 | 0.222222222222222
       9 |          1 |    1000.00 |    3 | 0.222222222222222
       3 |          1 |    1000.00 |    3 | 0.222222222222222
       2 |          1 |     650.00 |    6 | 0.555555555555556
      10 |          2 |     500.00 |    7 | 0.666666666666667
       8 |          3 |     500.00 |    7 | 0.666666666666667
       5 |          2 |     500.00 |    7 | 0.666666666666667
       1 |          1 |     500.00 |    7 | 0.666666666666667
(10 rows)

select *,
RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS rank_partition,
PERCENT_RANK() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS percent_rank_partition
from OrderRanking order by customerid, ordertotal desc;

 orderid | customerid | ordertotal | rank_partition | percent_rank_partition 
---------+------------+------------+----------------+------------------------
       4 |          1 |    3000.00 |              1 |                      0
       3 |          1 |    1000.00 |              2 |                   0.25
       9 |          1 |    1000.00 |              2 |                   0.25
       2 |          1 |     650.00 |              4 |                   0.75
       1 |          1 |     500.00 |              5 |                      1
       7 |          2 |    2000.00 |              1 |                      0
       6 |          2 |    1000.00 |              2 |      0.333333333333333
      10 |          2 |     500.00 |              3 |      0.666666666666667
       5 |          2 |     500.00 |              3 |      0.666666666666667
       8 |          3 |     500.00 |              1 |                      0
(10 rows)

CUME_DIST()
Relative rank of the current row: (number of rows preceding or peer with current row) / (total rows)

select *,
CUME_DIST() OVER (ORDER BY OrderTotal DESC) AS cume_dist
from OrderRanking order by OrderTotal desc;

 orderid | customerid | ordertotal | cume_dist 
---------+------------+------------+-----------
       4 |          1 |    3000.00 |       0.1
       7 |          2 |    2000.00 |       0.2
       6 |          2 |    1000.00 |       0.5
       9 |          1 |    1000.00 |       0.5
       3 |          1 |    1000.00 |       0.5
       2 |          1 |     650.00 |       0.6
      10 |          2 |     500.00 |         1
       8 |          3 |     500.00 |         1
       5 |          2 |     500.00 |         1
       1 |          1 |     500.00 |         1
(10 rows)

select *,
CUME_DIST() OVER (ORDER BY OrderTotal DESC) AS cume_dist,
CUME_DIST() OVER (PARTITION BY CustomerID ORDER BY OrderTotal DESC) AS cume_dist_partition
from OrderRanking order by customerid, ordertotal desc;

 orderid | customerid | ordertotal | cume_dist | cume_dist_partition 
---------+------------+------------+-----------+---------------------
       4 |          1 |    3000.00 |       0.1 |                 0.2
       3 |          1 |    1000.00 |       0.5 |                 0.6
       9 |          1 |    1000.00 |       0.5 |                 0.6
       2 |          1 |     650.00 |       0.6 |                 0.8
       1 |          1 |     500.00 |         1 |                   1
       7 |          2 |    2000.00 |       0.2 |                0.25
       6 |          2 |    1000.00 |       0.5 |                 0.5
      10 |          2 |     500.00 |         1 |                   1
       5 |          2 |     500.00 |         1 |                   1
       8 |          3 |     500.00 |         1 |                   1
(10 rows)

Percentile functions

select ordertotal from orderranking order by 1;
 ordertotal 
------------
     500.00
     500.00
     500.00
     500.00
     650.00
    1000.00
    1000.00
    1000.00
    2000.00
    3000.00
(10 rows)

The median for above data set is (1000+650)/2 = 825

select percentile_disc(0.5) WITHIN GROUP (order by ordertotal),
percentile_cont(0.5) WITHIN GROUP (order by ordertotal), 
median(ordertotal) from orderranking; 
 
 percentile_disc | percentile_cont | median 
-----------------+-----------------+--------
             650 |             825 |    825
(1 row)

Percentiles by customer

select 
 customerid,
 percentile_disc(0.25) WITHIN GROUP (order by ordertotal) as bottom_quartile,
 percentile_cont(0.25) WITHIN GROUP (order by ordertotal) as bottom_continuos,
 percentile_disc(0.5) WITHIN GROUP (order by ordertotal) as median_quartile,
 percentile_cont(0.5) WITHIN GROUP (order by ordertotal) as median_continuos,
 percentile_disc(0.75) WITHIN GROUP (order by ordertotal) as top_quartile,
 percentile_cont(0.75) WITHIN GROUP (order by ordertotal) as top_continuos
from orderranking group by 1 order by 1;

 customerid | bottom_quartile | bottom_continuos | median_quartile | median_continuos | top_quartile | top_continuos 
------------+-----------------+------------------+-----------------+------------------+--------------+---------------
          1 |             650 |              650 |            1000 |             1000 |         1000 |          1000
          2 |             500 |              500 |             500 |              750 |         1000 |          1250
          3 |             500 |              500 |             500 |              500 |          500 |           500
(3 rows)

Lead & Lag

select
customerid,
ordertotal,
lag(customerid, 1) OVER (order by ordertotal) as lag_func,
lead(customerid, 1) OVER (order by ordertotal) as lead_func
from orderranking group by 1,2 order by ordertotal;

 customerid | ordertotal | lag_func | lead_func 
------------+------------+----------+-----------
          3 |     500.00 |          |         2
          2 |     500.00 |        3 |         1
          1 |     500.00 |        2 |         1
          1 |     650.00 |        1 |         1
          1 |    1000.00 |        1 |         2
          2 |    1000.00 |        1 |         2
          2 |    2000.00 |        2 |         1
          1 |    3000.00 |        2 |          
(8 rows)

Reusing a window function

select 
CustomerID, 
SUM(OrderTotal) OVER (w) as sum, 
AVG(OrderTotal) OVER (w) as avg 
from OrderRanking WINDOW w AS (PARTITION BY CustomerID)
order by 1

 customerid |   sum   |          avg          
------------+---------+-----------------------
          1 | 6150.00 | 1230.0000000000000000
          1 | 6150.00 | 1230.0000000000000000
          1 | 6150.00 | 1230.0000000000000000
          1 | 6150.00 | 1230.0000000000000000
          1 | 6150.00 | 1230.0000000000000000
          2 | 4000.00 | 1000.0000000000000000
          2 | 4000.00 | 1000.0000000000000000
          2 | 4000.00 | 1000.0000000000000000
          2 | 4000.00 | 1000.0000000000000000
          3 |  500.00 |  500.0000000000000000
(10 rows)
| Posted by Apratim Mishra

Document Similarity with LSH and MinHash

In this blog post I am going to discuss about two probabilistic algorithms which are used for matching documents for similarities. Specifically, determining if two documents are similar to each other using Minhash. Comparing N documents for similarity using LSH. The implementation of Minhash and LSH algorithm are based on [1] and [2].

In this article similarity of documents do not mean contextual similarity rather lexical matching of documents.

Similarity Techniques

Jaccard Similarity

Jaccard similarity of two sets is the ratio of size of intersection of the two sets to the size of Union of the two sets.

For sets A and B

J (A, B) = |A ∩ B| / |A U B|

Jaccard Bag Similarity

Jaccard bag similarity for bags A and B is defined by counting an element n times in the intersection if n is the minimum of the number of times the element appears in A and B. In the union, count the sum of number of times an element appears in A and B.

Example – If bag A contains {a, a, a, b} and bag B contains {a, a, b, b, c}. Then min count of ‘a’ would be 2 and min count of ‘b’ would be 1 making a total of 3. The union of two bags is sum of sizes of each bag so 9. Hence Jaccard bag similarity would be 1/3

Representation of documents

Before matching two documents for similarity, they need to be represented as sets. The most effective way to lexically match documents is to construct a set of N-Grams of the document. N-Grams or shingles are sets of N consecutive characters. So for example the 2-Gram for the word “match” would be {ma, at, tc, ch}. So it’s essentially a sliding window of 2 characters over the word “match”.

Shingling can also be constructed from words in a text instead of single characters. Shingles for a sentence would be a set formed from a sliding window of consecutive words. Example 2-shingles or 2-Grams for “Its quite sunny today” would be {“its quite”, “quite sunny”, “sunny today”}. This can be very helpful when matching documents that might have sentences in different order or words in sentences in different order. This can be customized to remove stop words from the documents.

Choosing the right shingle size is important. If the size were too small then the shingles would most likely repeat across documents leading to false positives. If the size were too large then the similarity would be very low leading to false negatives. The shingle size should be chosen such that probability of finding any given shingle in a document is low.

Hashing Shingles

Another concern with shingles made from substrings of a document is the memory cost. In English there are 26 characters, but a document has other characters such as whitespace or ‘.’ or symbols which make the total possible shingles to be very high. A standard approximation for documents is to consider 20 characters (which mostly occur) and hence number of possible shingles would be 20^k for k-shingles.

To circumvent this cost, shingles for a large k, say 10 can be hashed to a bucket number in the 4-byte range thus compacting the data.

Characteristic matrices and Signatures

If there were N documents being matched for similarity where N is a large number then it would not be feasible to match the many k-shingles for each document. Signature or smaller subsets are formed from each set of shingles (each set is derived from 1 document). Characteristic matrix represents the set or document identifier as columns and the elements of all the sets combined as rows. A cell is given the value 1 when an element is part of a set, 0 otherwise. Characteristic matrices are unlikely candidates for data storage as they are sparse matrices with more zeros than ones and are best used for visualizing data.

D1 D2 .... Dn
h(s1) 1 0 ... 1
h(s2) 0 1 ... 0
... ... ... ... ...
h(sn) 1 1 ... 0

Minhashing and Signature matrices

Minhashing is another way of representing the characteristic matrix mentioned above. Minhash of a set or document is a subset of say several hundred of the many hash functions applied to it. Minhash on a set is calculated by picking a permutation of rows for a column of the characteristic matrix . The minhash value of any column would be the number of the first row in the permuted order where the column has a 1.

Signature matrices consist of the hash functions as rows and the sets as columns. Each column essentially is a vector of all the permuted hash values and is referred to as Minhash Signatures for that Set. Signature matrices are much smaller than characteristic matrices.

Minhashing and Jaccard Similarity

To find the similarity between two documents Jaccard similarity of the minhash sets is a good estimator.

D1 D2
h(s1) 0 0
h(s2) 1 0
... ... ...
h(sn) 1 1

There are 3 types of combinations possible for the rows corresponding to each hash function as seen in the above table.

  • Both columns have 0's. Type X
  • One column has 0 and the other 1, Type Y
  • Both columns have 1's. Type Z

The Type X does not fall into either an union or an intersection operation on the two sets hence ignore them.

The union of the two sets would be combination of Y and Z type rows. Hence |A U B| = Y + Z

The intersection of two sets would be where both columns are 1 so type Z.

Now as mentioned before Jaccard Similarity is J (A, B) = |A ∩ B| / |A U B|

so J (A, B) = Z/ Y + Z

Computing Minhash Signatures

Characteristics matrices are generally huge and picking a permutation of rows from millions of rows would be a time-consuming process hence a better way to do this is to use a random hash function which maps row numbers to bucket numbers. So its possible for multiple rows to be mapped to the same bucket number and cause collisions but if the number of buckets is large then the probability of collisions is much lower.

Row S1 S2 S3 S4 h1(x+1 mod 5) h2 (3x + 1 mod 5)
0 1 0 0 1 1 1
1 0 0 1 0 2 4
2 0 1 0 1 3 2
3 1 0 1 1 4 0
4 0 0 1 0 0 3

                                    Table: Hash functions for Characteristic matrix

To compute minhash signature matrix begin by choosing n (2 in above example) hash functions (h1 and h2) on the rows, the signature matrix initially consists of ∞ for all cells.

S1 S2 S3 S4
h1
h2

First consider row 0, S1 and S4 have 1's and S2 and S3 have 0's so only S1 and S4 can be altered. The values of both h1(0) and h2(0) are 1 for row 0 and will replace ∞ (1 < ∞). Thus the matrix looks like below.

S1 S2 S3 S4
h1 1 1
h2 1 1

Next for row 1, only S3 has 1 and can be altered. Thus we set the values of h1 and h2  for S3.

S1 S2 S3 S4
h1 1 2 1
h2 1 4 1

Similarly for row 2, has 1's for S2 and S4. S2 had ∞ for both h1 and h2 and will get replaced by 3 and 2 respectively. S4 had 1's for both h1 and h2 which is lower than 3 and 2 hence there will be no change.

S1 S2 S3 S4
h1 1 3 2 1
h2 1 2 4 1

Following through the minhash process of iteratively replacing a hash value if there exists an hash value lower than it we get the signature matrix as below.

S1 S2 S3 S4
h1 1 3 0 1
h2 0 2 0 0

Jaccard Similarity can be applied to the above signature matrix to determine similarity between sets (documents). So for example the Jaccard similarity between S1 and S2 would be 0 (hashes don't match) whereas for s1 and s3 it would be 0.5 (one hash matches).

Locality Sensitive Hashing

Minhash solves the problem of comparing large documents by compressing them into signature matrices which are a much smaller dataset compared to the original characteristic matrix. But when there are a large number of large documents involved then comparing them to each other would still be a time consuming task because of the number of pairs of documents that would need to be compared for similarity.

Locality sensitive hashing or near neighbor search solves this problem by focussing on pairs which are likely to be similar rather than on all pairs.

The idea is to hash each item several times such that similar items would more likely hash to the same bucket than dissimilar items. Any pair of documents which hashed to the same bucket is treated as a candidate pair and these are in turn evaluated for similarity either using Jaccard rule or cosine distances or euclidean approach.

LSH with Minhash

If we have already computed the minhash signatures for the items then an effective to LSH would be to divide the signature matrix into b bands of r rows each. For each band there is a hash function that takes the vector of r values and hashes them to various buckets.

Pick a threshold value t such that t = (1/b) ^ (1/r). Amongst the candidate pairs find out which pairs have Jaccard similarity higher than t to get the list of documents which are most similar to each other.

You can find the code worked for matching two documents here:

https://github.com/appym/lsh

References

  1. Rajaraman, Anand and Jeff Ullman. 2010 (Draft). Mining of Massive Data Sets(Chapter 3).
  2. http://okomestudio.net/biboroku/?p=2065
| Posted by Apratim Mishra

Unix Commands for OSx

I have been meaning to note down my unix checklist of commands (For Osx) which are very handy for basic operations on data. I will modify this post as and when I remember or come across something that fits here. These *nix commands are specifically tested for Mac OS.

Uniques

uniq - This is the unix unique function which can be primarily used to remove duplicates from a file amongst other things. The file has to be pre sorted for uniq to work
Consider file test which contains the following

$ cat test
aa
bb
bb
cc
cc
cc

Remove duplicates

$uniq test
aa
bb
cc

Count occurences of each item

$ uniq -c test
1 aa
2 bb
3 cc

Print only duplicate items in file

$ uniq -d test
bb
cc

Print only unique lines

$ uniq -u test
aa

Consider test now contains

$cat test
aa
bb
cc
AA
cC

Remove duplicate case insensitive.
This file is not sorted though. So it has to be sorted first before uniq. -i flag is for case in sensitive

$ sort test | uniq -i
AA
bb
cC

Case conversion

Convert all upper case in fileA to lower case and output as fileB

$ tr '[:upper:]' '[:lower:]' < fileA.txt > fileB.txt

File comparision

Compare two files and keep strings present in fileA but not in fileB

$ comm -23 fileA fileB

Compare two files and keep strings present in fileB but not in fileA

$ comm -13 fileA fileB

Compare two files and keep only strings which are present in both files

$ comm -3 fileA fileB

Sed

Primary purpose of sed is string replacement or pattern replacement.
Consider the following file as input

$ cat file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
unixlinux which one you choose.
  1. Replacing or substituting string
    $ sed 's/unix/linux/' file.txt
    linux is great os. unix is opensource. unix is free os.
    learn operating system.
    linuxlinux which one you choose.
    
    By default, the sed command replaces the first occurrence of the pattern in each line and it won't replace the second, third...occurrence in the line. Here the "s" specifies the substitution operation. The "/" are delimiters. The "unix" is the search pattern and the "linux" is the replacement string. If you miss a delimiter then the expression errors out as below
    $ sed 's/unix/linux' file.txt 
    sed: 1: "s/unix/linux": unterminated substitute in regular expression
    
    2 Replacing the nth occurrence of a pattern in a line. Use the /1, /2 etc flags to replace the first, second occurrence of a pattern in a line. The below command replaces the second occurrence of the word "unix" with "linux" in a line.
    $ sed 's/unix/linux/2' file.txt
    unix is great os. linux is opensource. unix is free os.
    learn operating system.
    unixlinux which one you choose.
    
    Here is the first occurence which is the default option
    $ sed 's/unix/linux/1' file.txt
    linux is great os. unix is opensource. unix is free os.
    learn operating system.
    linuxlinux which one you choose.
    
    And the third occurence
    $ sed 's/unix/linux/3' file.txt
    unix is great os. unix is opensource. linux is free os.
    learn operating system.
    unixlinux which one you choose.
    
    To replace all the occurence use 'g' (global replacement)
    $ sed 's/unix/linux/g' file.txt
    linux is great os. linux is opensource. linux is free os.
    learn operating system.
    linuxlinux which one you choose.
    
    To make the search case insensitive sed on mac does not have a flag but you can use plain regex to achieve it. For example modify the file.txt to below
    $ vi file.txt
    unix is great os. Unix is opensource. unix is free os.
    learn operating system.
    Unixlinux which one you choose.
    sed 's/[Uu]nix/linux/g' file.txt
    linux is great os. linux is opensource. linux is free os.
    learn operating system.
    linuxlinux which one you choose.
    
    How to find a string in all the files contained in a directory. You could use grep or find.
    grep -lr searchStr mydir
    grep --recursive --ignore-case --files-with-matches “searchStr" mydir
    find mydir -type f | xargs grep -l searchStr
    
    To find/replace multiple strings use the -e flag.
    sed -e 's/unix/linux/g' -e 's/Unix/Linux/g' file.txt
    linux is great os. Linux is opensource. linux is free os.
    learn operating system.
    Linuxlinux which one you choose.
    
    To replace a string that begins with a pattern use the regex for it alongwith sed
    sed 's/^learn/learn to use/g' file.txt
    unix is great os. Unix is opensource. unix is free os.
    learn to use operating system.
    Unixlinux which one you choose
    
    To remove whitespace characters at end of the line
    sed 's/[<spc><tab>]|/|/g' file.txt
    
    Unix command to know if your file has whitespace or tab characters
    vi file.txt
    :set list
    
    Unix command to remove BOM (Byte Order Mark) characters from your file Open the file in binary mode using -b flag to verify if you have BOM. And then remove them
    vi -b file.txt 
    :set nobomb
    :wq
    
    Use the -i flag to overwrite the existing file and create a backup of the original file. For example to remove all white spaces in a file.
    sed 's/ //g' file.txt
    cat file.txt
    unixisgreatos.Unixisopensource.unixisfreeos.
    learnoperatingsystem.
    Unixlinuxwhichoneyouchoose
    
    This will create a backup file called file.txt.bak with the original file contents and overwrite file.txt with no spaces To remove only the trailing spaces in a line use *$. The * character means "any number of the previous character" and $ refers to end of line.
    sed -i .bak 's/ *$//g' file.txt
    
    Verify the trailing whitespaces are removed by :set list
    vi file.txt
    :set list
    unix is great os. Unix is opensource. unix is free os.$
    
    learn operating system.$
    
    Unixlinux which one you choose.$
    
    To replace a blank line with something else. You can match a blank line by specifying an end-of-line immediately after a beginning-of-line, i.e. with ^$
    vi file.txt
    unix is great os. Unix is opensource. unix is free os.
    learn operating system.
    Unixlinux which one you choose.
    sed 's/^$/this used to be a blank line/' file.txt
    unix is great os. Unix is opensource. unix is free os.
    this used to be a blank line
    learn operating system.
    Unixlinux which one you choose.
    
    To remove tabs at the end of a line. Ex: Add a tab to the end of first line, so :set list will show ^I
    vi file.txt 
    unix is great os. Unix is opensource. unix is free os.^I$
    
    learn operating system.$
    
    Unixlinux which one you choose.$
    
    To create a tab in your sed command. use ctrl + v and then ctrl + i
    sed -i.bak 's/ *$//' file.txt
    vi file.txt
    :set list
    unix is great os. Unix is opensource. unix is free os.$
    
    learn operating system.$
    
    Unixlinux which one you choose.$
    
    Consider file test which contains the following
    $ cat test
    (firstname).aa
    (firstname).bb
    (firstname).bb
    (firstname).cc
    (firstname).CC
    (lastname).hh
    (lastname).jj
    (lastname).ll
    
    To extract the content after firstname
    sed -En 's/.firstname).([A-Za-z]+).*/\1/p' test
    aa
    bb
    bb
    cc
    CC
    

Search Strings

Total occurences of searchStr in current directory

grep -ro searchStr . | wc -l | xargs echo "Total matches :"

Total number of files where searchStr occurs in current directory

grep -lor searchStr . | wc -l | xargs echo "Total matches :"

To get an exact word match use the -w flag.

grep -lwr searchStr mydir

Recursively replace string original with replacement in all files under OSx directory mydir recursively(Excludes hidden files and folders)

find mydir \( ! -regex '.*/\..*' \) -type f -exec sed -i '' 's/original/replacement/g' {} \;

OR

find mydir \( ! -regex '.*/\..*' \) -type f -exec sed -i '' 's/original/replacement/g' {} +

The regex excludes all hidden files and folders which is particularly important if you want to avoid messing up your .DS_Store or .git files unknowningly.
if you use zsh then the following would also work

sed -i -- 's/original/replacement/g' **/*(D*)

This isnt exlcuding hidden files though. The */(D*) is basically zsh way of saying recursively go through all sub directories and all files.