Research Journal of Biotechnology

Indexed in Web of Science, SCOPUS, BioTechnology Citation Index®, Chemical Abstracts,
Biological Abstracts, ESCI, UGC, NAAS, Indian Citation Index etc.



Please donate Rs.7000/- per plant to WRA for our plantation drive of planting 50,000 trees for a better environment and oblige.



WRA Plantation - 34,000 trees grown on rocks and stones on barren rocky hillock "Keshar Parvat".






In silico Identification and Assessment of Potential Chloroplast DNA Barcodes for Discriminating Brassicaceous Species using Machine Learning Algorithms

Singh Bhupinder Pal, Kumar Ajay, Singh Harpreet and Nagpal Avinash Kaur

Res. J. Biotech.; Vol. 17(1); 64-96; doi: https://doi.org/10.25303/1701rjbt6496; (2022)

Abstract
Complete chloroplast genome sequences of 89 Brassicaceous species (42 genera) were used for identification and assessment of potential DNA barcodes at genus and species levels. Sliding windows analysis was performed on the aligned file to identify hyper-variable regions based on nucleotide diversity(π). Out of 23 identified hyper-variable regions, 3 coding regions i.e. ycf1, ndhF and ndhA and 3 combinations of coding and non-coding regions i.e. ‘ndhH-rps15, rps15, rps15-ycf1, ycf1’; ‘ccsA/ycf5, ccsA/ ycf5-ndhD, ndhD’ and ‘ndhE-ndhG, ndhG, ndhG-ndhI, ndhI’ were selected for sequence enrichment and assessed using six supervised machine learning algorithms i.e. J48, Jrip, SMO, Naive Bayes, Random Forest and KNN using WEKA along with distance based method using ‘nearneighbour’ function in SPIDER.

It was observed that ycf1 was the most efficient region for discriminating Brassicaceous species with average identification rate of 76% and maximum identification rate of 86% at species level. The other three regions i.e. ndhA, ‘ndhH-rps15,rps15,rps15-ycf1,ycf1’ and ‘ccsA/ycf5,ccsA/ycf5-ndhD,ndhD’ were found to be more efficient than well established markers i.e. matK and rbcL and hence can be used as potential DNA barcodes for family Brassicaceae. Supervised machine learning algorithms i.e. SMO, Random Forest and KNN along with distance based method SPIDER(NN) were shown to be more efficient and stable as compared to Jrip, J48 and Naive Bayes.