Govinda M. Kamath
|
Office # 12065,
1 Memorial Drive,
Microsoft Research New England Research and Development Lab,
Massachusetts 02142,
United States of America.
Email - gkamath AT stanford.edu
Research interests: Machine learning, Algorithms, Computational Genomics, Information Theory
[Google Scholar] [Github]
[Linkedin]
|
About Me
I am a post-doctoral researcher at Microsoft Research New England Lab.
Before this I obtained my PhD from the
Information Systems Lab at Stanford,
working with Prof. David Tse.
My work during my doctoral and post-doctoral studies
has mainly involved designing and
deploying scalable machine learning algorithms to solve real problems,
and showing their goodness, with problems often coming from computational genomics.
I've also been involved in TA-ing and helping develop a new course
Data science for high throughput sequencing taught by my adviser David Tse.
We discuss various problems in computational genomics, the different ways of modelling them and solving them. We
cover a variety of tools from statistics, computer science and information theory that have been instrumental in
tackling these problems.
I'm passionate about making empirical components of my research easily reproducible. Jupyter notebooks
have been a godsend in that context, and has enabled me improve the ease of reproducing of some of my more
recent work quite drastically.
In a previous life, I obtained a masters in Electrical Communication Engineering from Indian Institute of science,
where I worked with Prof. P Vijay Kumar on designing and
proving optimality results of codes related to Distributed Storage Systems.
Research
During my doctoral studies, the following are a sample of the problems I have worked on.
Genome assembly: The genome of an organism is a long strings of A,C,T,G.
Current sequencing technology allows us to obtain noisy substrings of the genome called reads from the genome.
These noisy reads are of length around 10,000 while the genome is of length in millions or billions.
Assembling the genome from reads is a fundamental problem of computational genomics.
I worked on this from both a theoretic and a practical perspective.
Genome assembly is constrained by repeats in the genome. We realised that the output of an assembler should
be a graph where every sequence corresponding to a genome is an eulerian cycle
(or an eulerian tour if the genome is linear). We wrote a practical assembler called HINGE
which was published in Genome Research,
and is still under active development with quite a few users primarily doing bacterial genomics. The current focus here is reducing the memory footprint. The theoretic results, which showed goodness of simplified version of HINGE from a rate distortion perspective
were presented at International Symposium of Information Theory (ISIT).
Using Monte-Carlo tree search to speed up common ML algorithms:
We consider common machine learning algorithms such as k-Nearest Neighbours, k-means and show that a monte-carlo search based algorithm can
provide speedups of the order of 10x in practice. We also prove theoretical results using the multi-armed bandit framework.
This was presented in part in NIPS-2017 workshop, in AISTATS-2018 and is
under review currently.
Low-Rank models for sequence alignment: We considered a class of algorithms for sequence alignment - seed
and extend algorithms, and drew connections to low rank models. We derived spectral
estimators for such models and then used the framework we developed in 2. to speed up its computation.
This was published in Recomb 2020,
Cell Patterns and
NeurIPS 2020.
Knowledge Distillation And Semiparametric Inference:
We studied the problem of knowledge distillation - where it has been observed
that training a small model using logits soft class labels generated using a large
model trained from the data does better than directly training the small model using the data.
We cast knowledge distillation as a semiparametric inference problem and derive several new guarantees
for the prediction error of standard distillation. This will be published in ICLR 2021.
Single Cell RNA-seq: RNA is an intermediate compound often created as an intermediate between DNA and protein.
RNA transcripts are much smaller than DNA (of length tens of thousand instead of billions), and the number of
molecule of each RNA transcript is dynamically varying with time (and the type of cell in the organism), while
all the cells in any organism has the same DNA.
RNA-seq is a sequencing assay that sequences RNA rather than DNA.
With advances in the bio-chemistry, researchers have been able to sequence RNA from single cells.
This presents interesting computational problems, as there is very little RNA in each cell.
The fact that millions of cells are sequenced and each cell has an expression (feature vector) of tens
of thousands of dimensions makes the problem more challenging.
One basic application of this technology is to be able to cluster cells from their RNA-seq reads.
We developed an equivalence class based method that does this with significant computational
savings over conventional methods. This was published in
Genome Biology.
Post clustering differential analysis for Single Cell RNA-seq:
We noticed that standard pipelines of single cell RNA seq had a data snooping
problem leading to artificially low p-values
and a selection of spurious genes as differentiating clusters of cells discovered
in the single cell RNA-seq analysis.
We quantified this bias and proposed solutions. This was published in
RECOMB 2019 and
Cell Systems.
Haplotype Assembly, Codes and Spectral Algorithms: Humans and other higher organisms are diploid,
that is they have two copies of their genome. These two copies are almost identical
with some polymorphic sites and regions (less than 0.3% of the genome).
The problem here is to estimate which of the polymorphisms are on the same copy of a chromosome.
We attacked this in two theoretic frameworks.
The first was noting that this was a convolutional code, and the second was posing this as a community detection problem.
These were published at ISIT-2015 and ICML-2016.
Some other problems I've dabbled with include trying to understand deep neural nets in the context of
information theoretic and computational genomics problems (like the problem of finding DNA motifs binding
with proteins), and building deep learning models to predict where CRISPR-CAS9 guides cut a DNA sequences.
Preprints and Publications
Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, and Lester Mackey, ‘‘Knowledge Distillation as Semiparametric Inference’’,
accepted at ICLR 2021. [Preprint]
Govinda M Kamath*, Tavor Baharav*, Ilan Shomorony, ‘‘Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment’’,
NeurIPS 2020. [Paper] [Preprint]
[Analysis].
* co-first authors
Govinda M. Kamath, Hugh Yeh, Ifrah Tariq, Ernest Fraenkel, and Lester Mackey, ‘‘Graph-structured Feature Selection with
Normalized Trend Filtering’’, Under Review.
Tavor Baharav*, Govinda M Kamath*, David N Tse, and Ilan Shomorony, ‘‘Spectral Jaccard Similarity:
A new approach to estimating pairwise sequence alignments’’, Cell Patterns 1.6. [Paper]
[Preprint] [Analysis]
* co-first authors
Tara Basu Trivedi, Ron Boger, Govinda M Kamath, Georgios Evangelopoulos, Jamie Cate, Jennifer Doudna, and Jack Hidary, ‘‘crispr2vec: Machine Learning Model Predicts Off-Target Cuts of CRISPR systems’’, Under Review. [Preprint]
Jesse M. Zhang, Govinda M. Kamath, David N. Tse, ‘‘Towards a post-clustering test for differential expression’’,
Under review at RECOMB 2019. [Preprint]
[Analysis]
Vivek Bagaria*, Govinda M. Kamath*, David N. Tse, ‘‘Adaptive
Monte-Carlo Optimization’’, Under review. [Preprint]
[Analysis]
* co-first authors
Govinda M.Kamath *, Ilan Shomorony*, Fei Xia*, Thomas A. Courtade, and David N. Tse. “HINGE: long-read assembly achieves optimal repeat resolution.” Genome research 27, no. 5 (2017): 747-756. [Preprint] [Paper] [Code] [Analysis]
[Press Coverage]
* co-first authors
Yuxin Chen, Govinda M. Kamath, Changho Suh, and David N. Tse, ‘‘Community Recovery in Graphs with Locality’’, International Conference on Machine Learning - 2016 (ICML- 2016), New York City, USA. [Preprint] [Paper]
Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Govinda M. Kamath, Parikshit Gopalan, Bichlien Nguyen, Christopher Takahashi, Sharon Newman, Hsing-Yeh Parker, Cyrus Rashtchian, Kendall Stewart, Gagan Gupta, Robert Carlson, John Mulligan, Douglas Carmean, Georg Seelig, Luis Ceze, Karin Strauss,
‘‘Scaling up DNA data storage and random access retrieval’’, to be published in Nature Biotechnology. [Preprint]
This was a large project that I was a part of as an intern at MSR, Redmond. My work was mainly in analysing the data, modelling and characterising the storage channel.
Ilan Shomorony, Govinda M. Kamath, Fei Xia, Thomas A. Courtade, and David N. Tse, ‘‘Partial Assembly: A Rate-Distortion Perspective’’, IEEE International Symposium of Information Theory - 2016 (ISIT- 2016), Barcelona, Spain. [Preprint] [Paper]
Vasilis Ntranos*, Govinda M. Kamath*, Jesse Zhang*, Lior Pachter and David N. Tse, ‘‘Fast and accurate single-cell RNA-Seq analysis by clustering of transcript-compatibility counts’’, in press in Genome Biology special issue on single cell omics. [Preprint] [Paper] [Analysis]
* co-first authors
Govinda M Kamath, Eren Sasoglu, and David Tse. ‘‘Optimal Haplotype Assembly from High-Throughput Mate Pair Reads’’, published in IEEE International Symposium of Information Theory - 2015 (ISIT-2015), Hong Kong. [Preprint] [Paper]
Govinda M Kamath, Narayanamoorthy Prakash, Lalitha Vadlamani, and P Vijay Kumar.‘‘Codes With Local Regeneration and Erasure Correction’’. IEEE Transactions of Information Theory June 2014. Earlier version published in the proceedings of IEEE International Symposium of Information Theory - 2013 (ISIT 2013), Istanbul, Turkey. [Preprint] [Conference Version] [Journal Version]
Govinda M Kamath, Narayanamoorthy Prakash, Lalitha Vadlamani, P. Vijay Kumar, Natalia Silberstein, Ankit S. Rawat, O. Ozan Koyluoglu, and Sriram Vishwanath, ‘‘Explicit MBR All-Symbol Locality Codes’’. Proceedings of IEEE International Symposium of Information Theory - 2013 (ISIT 2013), Istanbul, Turkey. [Preprint] [Paper]
Narayanamoorthy Prakash, Govinda M Kamath, Lalitha Vadlamani, and P Vijay Kumar.‘‘Optimal Linear Codes with a Local-Error-Correction Property’’. in Proceedings of IEEE International Symposium of Information Theory - 2012 (ISIT 2012), Cambridge, Massachusetts, USA. [Preprint] [Paper]
Govinda M Kamath and P Vijay Kumar. ‘‘Regenerating codes: a reformulated storage-bandwidth trade-off and a new construction.’'in Proceedings. of the National Conference on Communication 2012 (NCC 2012), IIT Kharagpur, India. [Paper]
Lalitha Vadlamani, Narayanamoorthy Prakash, Govinda M Kamath, and P Vijay Kumar. ‘‘On t-designs and bounds relating query complexity to error resilience in locally correctable codes.’’ in Proceedings. of the National Conference on Communication 2012 (NCC 2012), IIT Kharagpur, India. [Paper]
|