1. Near Duplicate Detection Using Simhash: Simhash is locality sensitive hasing (LSH) algorithm to evaluate approximate cosine similarity between two documents from a large collection (millions) of files. In this project, I have preprocessed the documents, created weighted word vectors and then implemented simhash algorithm to generate 64-bit fingerprint of each document. Finally, I have implemented block permuted hamming search in our fingerprint space to fnd the document that is very similar to the given document.
  2. Text Classification: In this project I have classified text documents using machine learning techniques in a large pool. I have built a Naive Bayes classifier to classify approximately 20,000 newsgroup documents. The original data set is available at here.
  3. Benchmarking for Link Prediction in Social Networks: From a given snapshot of a social network database, we can predict for a given person (or the entire network), whether he/she can be potentially connected to another person, by analyzing existing links. This work focuses on the node similarity metrics and database used for such analysis. We take two datasets (Facebook dataset from Stanford Large Network Dataset Collection and bibliography dataset from DBLP) and import it into MySQL (relational DB), and Neo4J (Graph based DB) and evaluate several link metrics for different network topology.
  4. PEKS with Bloom Filter: Public Key Encryption with Keyword Search (PEKS) is one of the most used method to search keywords over encrypted data. However, semantic security is not preserved and dictionary attack can help attackers to guess keywords and pose serious damage. I have resolved that issue implementing PEKS and searching keywords over Bloom Filter. The false positives of a Bloom filter does not allow to make PEKS susceptible to dictionary attack.
  5. Path ORAM: This project implements Path ORAM which is a simple oblivious RAM algorithm. While using cloud platform or any other insecure memory, attack can be made using the access pattern. Oblivious RAM is the way to hide the memory access pattern with some extra bandwidth and memory overhead.