In silico Identification
and Assessment of Potential Chloroplast DNA Barcodes for Discriminating Brassicaceous
Species using Machine Learning Algorithms
Singh Bhupinder Pal, Kumar Ajay, Singh Harpreet and Nagpal Avinash Kaur
Res. J. Biotech.; Vol. 17(1); 64-96;
doi: https://doi.org/10.25303/1701rjbt6496; (2022)
Abstract
Complete chloroplast genome sequences of 89 Brassicaceous species (42 genera) were
used for identification and assessment of potential DNA barcodes at genus and species
levels. Sliding windows analysis was performed on the aligned file to identify hyper-variable
regions based on nucleotide diversity(π). Out of 23 identified hyper-variable regions,
3 coding regions i.e. ycf1, ndhF and ndhA and 3 combinations of coding and non-coding
regions i.e. ‘ndhH-rps15, rps15, rps15-ycf1, ycf1’; ‘ccsA/ycf5, ccsA/ ycf5-ndhD,
ndhD’ and ‘ndhE-ndhG, ndhG, ndhG-ndhI, ndhI’ were selected for sequence enrichment
and assessed using six supervised machine learning algorithms i.e. J48, Jrip, SMO,
Naive Bayes, Random Forest and KNN using WEKA along with distance based method using
‘nearneighbour’ function in SPIDER.
It was observed that ycf1 was the most efficient region for discriminating Brassicaceous
species with average identification rate of 76% and maximum identification rate
of 86% at species level. The other three regions i.e. ndhA, ‘ndhH-rps15,rps15,rps15-ycf1,ycf1’
and ‘ccsA/ycf5,ccsA/ycf5-ndhD,ndhD’ were found to be more efficient than well established
markers i.e. matK and rbcL and hence can be used as potential DNA barcodes for family
Brassicaceae. Supervised machine learning algorithms i.e. SMO, Random Forest and
KNN along with distance based method SPIDER(NN) were shown to be more efficient
and stable as compared to Jrip, J48 and Naive Bayes.