Projects
🔬 Computational Identification of non coding RNAs association with chronic diseases
Currently working on developing an AI/ML-based predictive framework to identify potential disease-associated non-coding RNAs (primarily lncRNAs) associated with chronic diseases, with current focus on Type-2 Diabetes (T2D) . The project covers the complete end-to-end workflow, including data collection through bioinformatics pipelines, dataset preprocessing, hypothesis formulation, feature extraction, modeling, and interpretation of results.
I am exploring innovative approaches at multiple stages of the pipeline—such as applying advanced feature engineering strategies using sequence, structure, and expression-based representations, and experimenting with hybrid Machine Learning and Deep Learning architectures to improve biomarker discovery and predictive stability. Additionally, I am working on identifying SNPs across multiple datasets to integrate genomic variation into the model. The outcome of this project aims to support precision medicine by facilitating patient-specific diagnosis and enhancing understanding of lncRNA involvement in chronic metabolic disorders.
Key Contributions & Skills Learned:
- Bioinformatics workflow: sequence alignment and variant calling using HISAT2, SAMtools, BWA, GATK, VCFtools
- VCF preprocessing and variant interpretation, metadata handling, etc. to identify the mutations with their positions, etc.
- Feature engineering techniques for genomic data
- Modeling using Random Forest, XGBoost, and Deep Learning etc.
- Tackling heavy class imbalance, hyperparameter optimization, evaluation
- Forming hypothesis and novel solution direction for biomarker prediction and more
Tools & Libraries: Python, Scikit-learn, Pandas, NumPy, PyTorch,
Biopython, PC-PseDNC-General, Plotly, etc.
🧬 Mentoring Undergraduate Students in Computational Genomics & Machine Learning
Mentoring undergraduate students working on projects related to Parkinson’s Disease,
microexon discovery, and splice site identification etc. Providing guidance
through the full workflow from problem formulation to ML result interpretation.
Supported them in exploring GEO datasets, performing preprocessing and normalization,
and applying machine learning methods for biological insight discovery.
Responsibilities & Skills Applied:
- GEO dataset processing, sample grouping, metadata-based filtering
- Methylation data handling using
methylprep and downstream statistical analysis
- EDA, visualization, ML model building with reproducible workflow design
- End-to-end guidance: data → preprocessing → ML → final results & documentation
🤖 Deep Learning for Genomic Sequence Modeling
Exploring BiLSTM, CNN, and Transformer-based architectures (including LLM-style models for genomics like DNABert, Nucleotide Transformer, BigRNA, etc.)
to learn biological sequence patterns beyond manually extracted features. Working on embedding strategies,
attention-based learning, and representation learning for improved biological relevance.
Frameworks: PyTorch, HuggingFace Transformers.