Govinda M. Kamath

Office # 12065,
1 Memorial Drive,
Microsoft Research New England Research and Development Lab,
Massachusetts 02142,
United States of America.
Email - gkamath AT stanford.edu
Research interests: Machine learning, Algorithms, Computational Genomics, Information Theory
[Google Scholar] [Github] [Linkedin]

About Me

I am a post-doctoral researcher at Microsoft Research New England Lab. Before this I obtained my PhD from the Information Systems Lab at Stanford, working with Prof. David Tse. My work during my doctoral and post-doctoral studies has mainly involved designing and deploying scalable machine learning algorithms to solve real problems, and showing their goodness, with problems often coming from computational genomics.

I've also been involved in TA-ing and helping develop a new course Data science for high throughput sequencing taught by my adviser David Tse. We discuss various problems in computational genomics, the different ways of modelling them and solving them. We cover a variety of tools from statistics, computer science and information theory that have been instrumental in tackling these problems.

I'm passionate about making empirical components of my research easily reproducible. Jupyter notebooks have been a godsend in that context, and has enabled me improve the ease of reproducing of some of my more recent work quite drastically.

In a previous life, I obtained a masters in Electrical Communication Engineering from Indian Institute of science, where I worked with Prof. P Vijay Kumar on designing and proving optimality results of codes related to Distributed Storage Systems.

Research

During my doctoral studies, the following are a sample of the problems I have worked on.

Genome assembly: The genome of an organism is a long strings of A,C,T,G. Current sequencing technology allows us to obtain noisy substrings of the genome called reads from the genome. These noisy reads are of length around 10,000 while the genome is of length in millions or billions. Assembling the genome from reads is a fundamental problem of computational genomics. I worked on this from both a theoretic and a practical perspective.
Genome assembly is constrained by repeats in the genome. We realised that the output of an assembler should be a graph where every sequence corresponding to a genome is an eulerian cycle (or an eulerian tour if the genome is linear). We wrote a practical assembler called HINGE which was published in Genome Research, and is still under active development with quite a few users primarily doing bacterial genomics. The current focus here is reducing the memory footprint. The theoretic results, which showed goodness of simplified version of HINGE from a rate distortion perspective were presented at International Symposium of Information Theory (ISIT).
Using Monte-Carlo tree search to speed up common ML algorithms: We consider common machine learning algorithms such as k-Nearest Neighbours, k-means and show that a monte-carlo search based algorithm can provide speedups of the order of 10x in practice. We also prove theoretical results using the multi-armed bandit framework. This was presented in part in NIPS-2017 workshop, in AISTATS-2018 and is under review currently.
Low-Rank models for sequence alignment: We considered a class of algorithms for sequence alignment - seed and extend algorithms, and drew connections to low rank models. We derived spectral estimators for such models and then used the framework we developed in 2. to speed up its computation. This was published in Recomb 2020, Cell Patterns and NeurIPS 2020.
Knowledge Distillation And Semiparametric Inference: We studied the problem of knowledge distillation - where it has been observed that training a small model using logits soft class labels generated using a large model trained from the data does better than directly training the small model using the data. We cast knowledge distillation as a semiparametric inference problem and derive several new guarantees for the prediction error of standard distillation. This will be published in ICLR 2021.
Single Cell RNA-seq: RNA is an intermediate compound often created as an intermediate between DNA and protein. RNA transcripts are much smaller than DNA (of length tens of thousand instead of billions), and the number of molecule of each RNA transcript is dynamically varying with time (and the type of cell in the organism), while all the cells in any organism has the same DNA. RNA-seq is a sequencing assay that sequences RNA rather than DNA. With advances in the bio-chemistry, researchers have been able to sequence RNA from single cells. This presents interesting computational problems, as there is very little RNA in each cell. The fact that millions of cells are sequenced and each cell has an expression (feature vector) of tens of thousands of dimensions makes the problem more challenging.
One basic application of this technology is to be able to cluster cells from their RNA-seq reads. We developed an equivalence class based method that does this with significant computational savings over conventional methods. This was published in Genome Biology.
Post clustering differential analysis for Single Cell RNA-seq: We noticed that standard pipelines of single cell RNA seq had a data snooping problem leading to artificially low p-values and a selection of spurious genes as differentiating clusters of cells discovered in the single cell RNA-seq analysis. We quantified this bias and proposed solutions. This was published in RECOMB 2019 and Cell Systems.
Haplotype Assembly, Codes and Spectral Algorithms: Humans and other higher organisms are diploid, that is they have two copies of their genome. These two copies are almost identical with some polymorphic sites and regions (less than 0.3% of the genome). The problem here is to estimate which of the polymorphisms are on the same copy of a chromosome. We attacked this in two theoretic frameworks. The first was noting that this was a convolutional code, and the second was posing this as a community detection problem. These were published at ISIT-2015 and ICML-2016.

Some other problems I've dabbled with include trying to understand deep neural nets in the context of information theoretic and computational genomics problems (like the problem of finding DNA motifs binding with proteins), and building deep learning models to predict where CRISPR-CAS9 guides cut a DNA sequences.

Preprints and Publications

Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, and Lester Mackey, ‘‘Knowledge Distillation as Semiparametric Inference’’, accepted at ICLR 2021. [Preprint]

Govinda M Kamath*, Tavor Baharav*, Ilan Shomorony, ‘‘Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment’’, NeurIPS 2020. [Paper] [Preprint] [Analysis].
* co-first authors

Govinda M. Kamath, Hugh Yeh, Ifrah Tariq, Ernest Fraenkel, and Lester Mackey, ‘‘Graph-structured Feature Selection with Normalized Trend Filtering’’, Under Review.

Tavor Baharav*, Govinda M Kamath*, David N Tse, and Ilan Shomorony, ‘‘Spectral Jaccard Similarity: A new approach to estimating pairwise sequence alignments’’, Cell Patterns 1.6. [Paper] [Preprint] [Analysis]
* co-first authors

Tara Basu Trivedi, Ron Boger, Govinda M Kamath, Georgios Evangelopoulos, Jamie Cate, Jennifer Doudna, and Jack Hidary, ‘‘crispr2vec: Machine Learning Model Predicts Off-Target Cuts of CRISPR systems’’, Under Review. [Preprint]

Jesse M. Zhang, Govinda M. Kamath, David N. Tse, ‘‘Towards a post-clustering test for differential expression’’, Under review at RECOMB 2019. [Preprint] [Analysis]

Vivek Bagaria*, Govinda M. Kamath*, David N. Tse, ‘‘Adaptive Monte-Carlo Optimization’’, Under review. [Preprint] [Analysis]
* co-first authors

Vivek Bagaria *, Govinda M. Kamath *, Vasilis Ntranos *, Martin J. Zhang *, and David N. Tse. “Medoids in almost linear time via multi-armed bandits.” Presented at International Conference on Artificial Intelligence and Statistics - 2018, (AISTATS-2018). Also presented at NIPS workshop for learning interactions from complex data. [Preprint] [Paper] [Analysis].
* co-first authors

Govinda M.Kamath *, Ilan Shomorony*, Fei Xia*, Thomas A. Courtade, and David N. Tse. “HINGE: long-read assembly achieves optimal repeat resolution.” Genome research 27, no. 5 (2017): 747-756. [Preprint] [Paper] [Code] [Analysis] [Press Coverage]
* co-first authors

Yuxin Chen, Govinda M. Kamath, Changho Suh, and David N. Tse, ‘‘Community Recovery in Graphs with Locality’’, International Conference on Machine Learning - 2016 (ICML- 2016), New York City, USA. [Preprint] [Paper]

Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Govinda M. Kamath, Parikshit Gopalan, Bichlien Nguyen, Christopher Takahashi, Sharon Newman, Hsing-Yeh Parker, Cyrus Rashtchian, Kendall Stewart, Gagan Gupta, Robert Carlson, John Mulligan, Douglas Carmean, Georg Seelig, Luis Ceze, Karin Strauss, ‘‘Scaling up DNA data storage and random access retrieval’’, to be published in Nature Biotechnology. [Preprint]
This was a large project that I was a part of as an intern at MSR, Redmond. My work was mainly in analysing the data, modelling and characterising the storage channel.

Ilan Shomorony, Govinda M. Kamath, Fei Xia, Thomas A. Courtade, and David N. Tse, ‘‘Partial Assembly: A Rate-Distortion Perspective’’, IEEE International Symposium of Information Theory - 2016 (ISIT- 2016), Barcelona, Spain. [Preprint] [Paper]

Jesse Zhang and Govinda M. Kamath, ’'Learning the Language of the Genome using RNNs’’, CS224d Project. [Technical Report]

Vasilis Ntranos*, Govinda M. Kamath*, Jesse Zhang*, Lior Pachter and David N. Tse, ‘‘Fast and accurate single-cell RNA-Seq analysis by clustering of transcript-compatibility counts’’, in press in Genome Biology special issue on single cell omics. [Preprint] [Paper] [Analysis]
* co-first authors

Govinda M Kamath, Eren Sasoglu, and David Tse. ‘‘Optimal Haplotype Assembly from High-Throughput Mate Pair Reads’’, published in IEEE International Symposium of Information Theory - 2015 (ISIT-2015), Hong Kong. [Preprint] [Paper]

Govinda M Kamath, Narayanamoorthy Prakash, Lalitha Vadlamani, and P Vijay Kumar.‘‘Codes With Local Regeneration and Erasure Correction’’. IEEE Transactions of Information Theory June 2014. Earlier version published in the proceedings of IEEE International Symposium of Information Theory - 2013 (ISIT 2013), Istanbul, Turkey. [Preprint] [Conference Version] [Journal Version]

Govinda M Kamath, Narayanamoorthy Prakash, Lalitha Vadlamani, P. Vijay Kumar, Natalia Silberstein, Ankit S. Rawat, O. Ozan Koyluoglu, and Sriram Vishwanath, ‘‘Explicit MBR All-Symbol Locality Codes’’. Proceedings of IEEE International Symposium of Information Theory - 2013 (ISIT 2013), Istanbul, Turkey. [Preprint] [Paper]

Narayanamoorthy Prakash, Govinda M Kamath, Lalitha Vadlamani, and P Vijay Kumar.‘‘Optimal Linear Codes with a Local-Error-Correction Property’’. in Proceedings of IEEE International Symposium of Information Theory - 2012 (ISIT 2012), Cambridge, Massachusetts, USA. [Preprint] [Paper]

Govinda M Kamath and P Vijay Kumar. ‘‘Regenerating codes: a reformulated storage-bandwidth trade-off and a new construction.’'in Proceedings. of the National Conference on Communication 2012 (NCC 2012), IIT Kharagpur, India. [Paper]

Lalitha Vadlamani, Narayanamoorthy Prakash, Govinda M Kamath, and P Vijay Kumar. ‘‘On t-designs and bounds relating query complexity to error resilience in locally correctable codes.’’ in Proceedings. of the National Conference on Communication 2012 (NCC 2012), IIT Kharagpur, India. [Paper]