Databases & Softwares
54
Genomic, Proteomic, Transcriptomic, Metabolomic Softwares and Databases
Follow
Scooped by Biswapriya Biswavas Misra onto Databases & Softwares
Scoop.it!

CAT 1.0 - Composition Analysis Toolkit

CAT 1.0 - Composition Analysis Toolkit | Databases & Softwares | Scoop.it
CAT (Composition Analysis Toolkit) is a software package that includes a novel measure of codon usage bias, Codon Deviation Coefficient (CDC). ... The Zhang Lab — Computational Biology and Bioinformatics ...
No comment yet.
Your new post is loading...
Scooped by Biswapriya Biswavas Misra
Scoop.it!

Interface-Resolved Network of Protein-Protein Interactions

Interface-Resolved Network of Protein-Protein Interactions | Databases & Softwares | Scoop.it
PLOS Computational Biology is an open-access
Biswapriya Biswavas Misra's insight:
Abstract

We define an interface-interaction network (IIN) to capture the specificity and competition between protein-protein interactions (PPI). This new type of network represents interactions between individual interfaces used in functional protein binding and thereby contains the detail necessary to describe the competition and cooperation between any pair of binding partners. Here we establish a general framework for the construction of IINs that merges computational structure-based interface assignment with careful curation of available literature. To complement limited structural data, the inclusion of biochemical data is critical for achieving the accuracy and completeness necessary to analyze the specificity and competition between the protein interactions. Firstly, this procedure provides a means to clarify the information content of existing data on purported protein interactions and to remove indirect and spurious interactions. Secondly, the IIN we have constructed here for proteins involved in clathrin-mediated endocytosis (CME) exhibits distinctive topological properties. In contrast to PPI networks with their global and relatively dense connectivity, the fragmentation of the IIN into distinctive network modules suggests that different functional pressures act on the evolution of its topology. Large modules in the IIN are formed by interfaces sharing specificity for certain domain types, such as SH3 domains distributed across different proteins. The shared and distinct specificity of an interface is necessary for effective negative and positive design of highly selective binding targets. Lastly, the organization of detailed structural data in a network format allows one to identify pathways of specific binding interactions and thereby predict effects of mutations at specific surfaces on a protein and of specific binding inhibitors, as we explore in several examples. Overall, the endocytosis IIN is remarkably complex and rich in features masked in the coarser PPI, and collects relevant detail of protein association in a readily interpretable format.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

The MORPH Algorithm: Ranking Candidate Genes for Membership in Arabidopsis and Tomato Pathways

Biswapriya Biswavas Misra's insight:
Abstract

Closing gaps in our current knowledge about biological pathways is a fundamental challenge. The development of novel computational methods along with high-throughput experimental data carries the promise to help in the challenge. We present an algorithm called MORPH (for module-guided ranking of candidate pathway genes) for revealing unknown genes in biological pathways. The method receives as input a set of known genes from the target pathway, a collection of expression profiles, and interaction and metabolic networks. Using machine learning techniques, MORPH selects the best combination of data and analysis method and outputs a ranking of candidate genes predicted to belong to the target pathway. We tested MORPH on 230 known pathways in Arabidopsis thaliana and 93 known pathways in tomato (Solanum lycopersicum) and obtained high-quality cross-validation results. In the photosynthesis light reactions, homogalacturonan biosynthesis, and chlorophyll biosynthetic pathways of Arabidopsis, genes ranked highly by MORPH were recently verified to be associated with these pathways. MORPH candidates ranked for the carotenoid pathway from Arabidopsis and tomato are derived from pathways that compete for common precursors or from pathways that a

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis

The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis | Databases & Softwares | Scoop.it
Biswapriya Biswavas Misra's insight:
Abstract

Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein–protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

PLOS ONE: Ancestral Genome Inference Using a Genetic Algorithm Approach

PLOS ONE: Ancestral Genome Inference Using a Genetic Algorithm Approach | Databases & Softwares | Scoop.it
PLOS ONE: an inclusive, peer-reviewed, open-access resource from the PUBLIC LIBRARY OF SCIENCE. Reports of well-performed scientific studies from all disciplines freely available to the whole world.
Biswapriya Biswavas Misra's insight:
Abstract

Recent advancement of technologies has now made it routine to obtain and compare gene orders within genomes. Rearrangements of gene orders by operations such as reversal and transposition are rare events that enable researchers to reconstruct deep evolutionary histories. An important application of genome rearrangement analysis is to infer gene orders of ancestral genomes, which is valuable for identifying patterns of evolution and for modeling the evolutionary processes. Among various available methods, parsimony-based methods (including GRAPPA and MGR) are the most widely used. Since the core algorithms of these methods are solvers for the so called median problem, providing efficient and accurate median solver has attracted lots of attention in this field. The “double-cut-and-join” (DCJ) model uses the single DCJ operation to account for all genome rearrangement events. Because mathematically it is much simpler than handling events directly, parsimony methods using DCJ median solvers has better speed and accuracy. However, the DCJ median problem is NP-hard and although several exact algorithms are available, they all have great difficulties when given genomes are distant. In this paper, we present a new algorithm that combines genetic algorithm (GA) with genomic sorting to produce a new method which can solve the DCJ median problem in limited time and space, especially in large and distant datasets. Our experimental results show that this new GA-based method can find optimal or near optimal results for problems ranging from easy to very difficult. Compared to existing parsimony methods which may severely underestimate the true number of evolutionary events, the sorting-based approach can infer ancestral genomes which are much closer to their true ancestors. The code is available at http://phylo.cse.sc.edu.

 

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees and Supermatrices

Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees and Supermatrices | Databases & Softwares | Scoop.it
PLOS ONE: an inclusive, peer-reviewed, open-access resource from the PUBLIC LIBRARY OF SCIENCE. Reports of well-performed scientific studies from all disciplines freely available to the whole world.
Biswapriya Biswavas Misra's insight:
Abstract

Over 3000 microbial (bacterial and archaeal) genomes have been made publically available to date, providing an unprecedented opportunity to examine evolutionary genomic trends and offering valuable reference data for a variety of other studies such as metagenomics. The utility of these genome sequences is greatly enhanced when we have an understanding of how they are phylogenetically related to each other. Therefore, we here describe our efforts to reconstruct the phylogeny of all available bacterial and archaeal genomes. We identified 24, single-copy, ubiquitous genes suitable for this phylogenetic analysis. We used two approaches to combine the data for the 24 genes. First, we concatenated alignments of all genes into a single alignment from which a Maximum Likelihood (ML) tree was inferred using RAxML. Second, we used a relatively new approach to combining gene data, Bayesian Concordance Analysis (BCA), as implemented in the BUCKy software, in which the results of 24 single-gene phylogenetic analyses are used to generate a “primary concordance” tree. A comparison of the concatenated ML tree and the primary concordance (BUCKy) tree reveals that the two approaches give similar results, relative to a phylogenetic tree inferred from the 16S rRNA gene. After comparing the results and the methods used, we conclude that the current best approach for generating a single phylogenetic tree, suitable for use as a reference phylogeny for comparative analyses, is to perform a maximum likelihood analysis of a concatenated alignment of conserved, single-copy genes.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

Use of a global metabolic network to curate organismal metabolic networks : Scientific Reports

Use of a global metabolic network to curate organismal metabolic networks : Scientific Reports | Databases & Softwares | Scoop.it
The difficulty in annotating the vast amounts of biological information poses one of the greatest current challenges in biological research.
Biswapriya Biswavas Misra's insight:

The difficulty in annotating the vast amounts of biological information poses one of the greatest current challenges in biological research. The number of genomic, proteomic, and metabolomic datasets has increased dramatically over the last two decades, far outstripping the pace of curation efforts. Here, we tackle the challenge of curating metabolic network reconstructions. We predict organismal metabolic networks using sequence homology and a global metabolic network constructed from all available organismal networks. While sequence homology has been a standard to annotate metabolic networks it has been faulted for its lack of predictive power. We show, however, that when homology is used with a global metabolic network one is able to predict organismal metabolic networks that have enhanced network connectivity. Additionally, we compare the annotation behavior of current database curation efforts with our predictions and find that curation efforts are biased towards adding (rather than removing) reactions to organismal networks.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

Plantagora: Modeling Whole Genome Sequencing and Assembly of Plant Genomes

Plantagora: Modeling Whole Genome Sequencing and Assembly of Plant Genomes | Databases & Softwares | Scoop.it
PLOS ONE: an inclusive, peer-reviewed, open-access resource from the PUBLIC LIBRARY OF SCIENCE. Reports of well-performed scientific studies from all disciplines freely available to the whole world.
Biswapriya Biswavas Misra's insight:
AbstractBackground

Genomics studies are being revolutionized by the next generation sequencing technologies, which have made whole genome sequencing much more accessible to the average researcher. Whole genome sequencing with the new technologies is a developing art that, despite the large volumes of data that can be produced, may still fail to provide a clear and thorough map of a genome. The Plantagora project was conceived to address specifically the gap between having the technical tools for genome sequencing and knowing precisely the best way to use them.

Methodology/Principal Findings

For Plantagora, a platform was created for generating simulated reads from several different plant genomes of different sizes. The resulting read files mimicked either 454 or Illumina reads, with varying paired end spacing. Thousands of datasets of reads were created, most derived from our primary model genome, rice chromosome one. All reads were assembled with different software assemblers, including Newbler, Abyss, and SOAPdenovo, and the resulting assemblies were evaluated by an extensive battery of metrics chosen for these studies. The metrics included both statistics of the assembly sequences and fidelity-related measures derived by alignment of the assemblies to the original genome source for the reads. The results were presented in a website, which includes a data graphing tool, all created to help the user compare rapidly the feasibility and effectiveness of different sequencing and assembly strategies prior to testing an approach in the lab. Some of our own conclusions regarding the different strategies were also recorded on the website.

Conclusions/Significance

Plantagora provides a substantial body of information for comparing different approaches to sequencing a plant genome, and some conclusions regarding some of the specific approaches. Plantagora also provides a platform of metrics and tools for studying the process of sequencing and assembly further.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

BMC Bioinformatics | Abstract | An automated graphics tool for comparative genomics: the Coulson plot generator

Comparative analysis is an essential component to biology. When applied to genomics for example, analysis may require comparisons between the predicted presence and absence of genes in a group of genomes under consideration.
No comment yet.
Rescooped by Biswapriya Biswavas Misra from Genomic Parasites: Coevolution between host and parasites
Scoop.it!

TIRfinder: A Web Tool for Mining Class II Transposons Carrying Terminal Inverted Repeats

TIRfinder: A Web Tool for Mining Class II Transposons Carrying Terminal Inverted Repeats | Databases & Softwares | Scoop.it
Gabriel Wallau's curator insight, April 26, 11:51 AM

Transposable elements (TEs) can be found in virtually all known genomes; plant genomes are exceptionally rich in this kind of dispersed repetitive sequences. Current knowledge on TE proliferation dynamics places them among the main forces of molecular evolution. Therefore efficient tools to analyze TE distribution in genomes are needed that would allow for comparative genomics studies and for studying TE dynamics in a genome. This was our main motivation underpinning TIRfinder construction—an efficient tool for mining class II TEs carrying terminal inverted repeats. TIRfinder takes as an input a genomic sequence and information on structural properties of a TE family, and identifies all TEs in the genome showing the desired structural characteristics. The efficiency and small memory requirements of our approach stem from the use of suffix trees to identify all DNA segments surrounded by user-specified terminal inverse repeats (TIR) and target site duplications (TSD) which together constitute a mask. On the other hand, the flexibility of the notion of the TIR/TSD mask makes it possible to use the tool for de novo detection. The main advantages of TIRfinder are its speed, accuracy and convenience of use for biologists. A web-based interface is freely available at http://bioputer.mimuw.edu.pl/tirfindertool/.

Scooped by Biswapriya Biswavas Misra
Scoop.it!

Development of a Natural Products Database from the Biodiversity of Brazil

Development of a Natural Products Database from the Biodiversity of Brazil | Databases & Softwares | Scoop.it
Biswapriya Biswavas Misra's insight:

We describe herein the design and development of an innovative tool called the NuBBE database (NuBBEDB), a new Web-based database, which incorporates several classes of secondary metabolites and derivatives from the biodiversity of Brazil. This natural product database incorporates botanical, chemical, pharmacological, and toxicological compound information. The NuBBEDB provides specialized information to the worldwide scientific community and can serve as a useful tool for studies on the multidisciplinary interfaces related to chemistry and biology, including virtual screening, dereplication, metabolomics, and medicinal chemistry. The NuBBEDB site is at http://nubbe.iq.unesp.br/nubbeDB.html.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

Integrated database of information from structural genomics experiments

Biswapriya Biswavas Misra's insight:

Abstract: Information from structural genomics experiments at the RIKEN SPring-8 Center, Japan has been compiled and published as an integrated database. The contents of the database are (i) experimental data from nine species of bacteria that cover a large variety of protein molecules in terms of both evolution and properties (http://database.riken.jp/db/bacpedia ), (ii) experimental data from mutant proteins that were designed systematically to study the influence of mutations on the diffraction quality of protein crystals (http://database.riken.jp/db/bacpedia ) and (iii) experimental data from heavy-atom-labelled proteins from the heavy-atom database HATODAS (http://database.riken.jp/db/hatodas ). The database integration adopts the semantic web, which is suitable for data reuse and automatic processing, thereby allowing batch downloads of full data and data reconstruction to produce new databases. In addition, to enhance the use of data (i) and (ii) by general researchers in biosciences, a comprehensible user interface, Bacpedia (http://bacpedia.harima.riken.jp ), has been developed.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

MicrobeDB: a locally maintainable database of microbial genomic sequences

MicrobeDB: a locally maintainable database of microbial genomic sequences | Databases & Softwares | Scoop.it
Biswapriya Biswavas Misra's insight:

Summary: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

WEP: a high-performance analysis pipeline for whole-exome data

The advent of massively parallel sequencing technologies (Next Generation Sequencing, NGS) profoundly modified the landscape of human genetics.
Biswapriya Biswavas Misra's insight:
AbstractBackground

The advent of massively parallel sequencing technologies (Next Generation Sequencing, NGS) profoundly modified the landscape of human genetics.

In particular, Whole Exome Sequencing (WES) is the NGS branch that focuses on the exonic regions of the eukaryotic genomes; exomes are ideal to help us understanding high-penetrance allelic variation and its relationship to phenotype. A complete WES analysis involves several steps which need to be suitably designed and arranged into an efficient pipeline.

Managing a NGS analysis pipeline and its huge amount of produced data requires non trivial IT skills and computational power.

Results

Our web resource WEP (Whole-Exome sequencing Pipeline web tool) performs a complete WES pipeline and provides easy access through interface to intermediate and final results. The WEP pipeline is composed of several steps:

1) verification of input integrity and quality checks, read trimming and filtering; 2) gapped alignment; 3) BAM conversion, sorting and indexing; 4) duplicates removal; 5) alignment optimization around insertion/deletion (indel) positions; 6) recalibration of quality scores; 7) single nucleotide and deletion/insertion polymorphism (SNP and DIP) variant calling; 8) variant annotation; 9) result storage into custom databases to allow cross-linking and intersections, statistics and much more. In order to overcome the challenge of managing large amount of data and maximize the biological information extracted from them, our tool restricts the number of final results filtering data by customizable thresholds, facilitating the identification of functionally significant variants. Default threshold values are also provided at the analysis computation completion, tuned with the most common literature work published in recent years.

Conclusions

Through our tool a user can perform the whole analysis without knowing the underlying hardware and software architecture, dealing with both paired and single end data. The interface provides an easy and intuitive access for data submission and a user-friendly web interface for annotated variant visualization.

Non-IT mastered users can access through WEP to the most updated and tested WES algorithms, tuned to maximize the quality of called variants while minimizing artifacts and false positives.

The web tool is available at the following web address: http://www.caspur.it/wep webcite


No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

Optimizing de novo assembly of short-read RNA-seq data for phylogenomics

RNA-seq has shown huge potential for phylogenomic inferences in non-model organisms.
Biswapriya Biswavas Misra's insight:
Abstract (provisional)Background

RNA-seq has shown huge potential for phylogenomic inferences in non-model organisms. However, error, incompleteness, and redundant assembled transcripts for each gene in de novo assembly of short reads cause noise in analyses and a large amount of missing data in the aligned matrix. To address these problems, we compare de novo assemblies of paired end 90 bp RNA-seq reads using Oases, Trinity, Trans-ABySS and SOAPdenovo-Trans to transcripts from genome annotation of the model plant Ricinus communis. By doing so we evaluate strategies for optimizing total gene coverage and minimizing assembly chimeras and redundancy.

Results

We found that the frequency and structure of chimeras vary dramatically among different software packages. The differences were largely due to the number of trans-self chimeras that contain repeats in the opposite direction. More than half of the total chimeras in Oases and Trinity were trans-self chimeras. Within each package, we found a trade-off between maximizing reference coverage and minimizing redundancy and chimera rate. In order to reduce redundancy, we investigated three methods: 1) using cap3 and CD-HIT-EST to combine highly similar transcripts, 2) only retaining the transcript with the highest read coverage, or removing the transcript with the lowest read coverage for each subcomponent in Trinity, and 3) filtering Oases single k-mer assemblies by number of transcripts per locus and relative transcript length, and then finding the transcript with the highest read coverage. We then utilized results from blastx against model protein sequences to effectively remove trans chimeras. After optimization, seven assembly strategies among all four packages successfully assembled 42.9--47.1% of reference genes to more than 200 bp, with a chimera rate of 0.92--2.21%, and on average 1.8--3.1 transcripts per reference gene assembled.

Conclusions

With rapidly improving sequencing and assembly tools, our study provides a framework to benchmark and optimize performance before choosing tools or parameter combinations for analyzing short-read RNA-seq data. Our study demonstrates that choice of assembly package, k-mer sizes, post-assembly redundancy-reduction and chimera cleanup, and strand-specific RNA-seq library preparation and assembly dramatically improves gene coverage by non-redundant and non-chimeric transcripts that are optimized for downstream phylogenomic analyses.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis

The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis | Databases & Softwares | Scoop.it
Biswapriya Biswavas Misra's insight:
Abstract

Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein–protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

On the edge of web-based multiple sequence alignment services

On the edge of web-based multiple sequence alignment services | Databases & Softwares | Scoop.it
There are many web-based multiple sequence alignment services accessible around the world.
Biswapriya Biswavas Misra's insight:

There are many web-based multiple sequence alignment services accessible around the world. However, many researchers working on biological sequence analysis still struggle with inefficient, unfriendly user interface, and limited capability multiple sequence alignment software. In this study, we provide a comprehensive survey of regional and continental facilities that provide web-based alignment services. We also analyze and identify much needed services that are not available through these existing service providers. We then implement a web-based model to address these needs. From that perspective, our web-based multiple sequence alignment server, SeqAna, provides a unique set of services that none of these studied facilities have. For example, SeqAna provides a multiple sequence alignment scoring and ranking service. This service, the only of its kind, allows SeqAna's users to perform multiple sequence alignment with several alignment tools and rank the results of these alignments in the order of quality. With this service, SeqAna's users will be able to identify which alignment tools are more appropriate for their specific set of sequences. In addition, SeqAna's users can customize a small alignment sample as a reference for SeqAna to automatically identify the best tool to align their large set of sequences.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

PLOS ONE: ODoSE: A Webserver for Genome-Wide Calculation of Adaptive Divergence in Prokaryotes

PLOS ONE: ODoSE: A Webserver for Genome-Wide Calculation of Adaptive Divergence in Prokaryotes | Databases & Softwares | Scoop.it
PLOS ONE: an inclusive, peer-reviewed, open-access resource from the PUBLIC LIBRARY OF SCIENCE. Reports of well-performed scientific studies from all disciplines freely available to the whole world.
Biswapriya Biswavas Misra's insight:
Abstract

Quantifying patterns of adaptive divergence between taxa is a major goal in the comparative and evolutionary study of prokaryote genomes. When applied appropriately, the McDonald-Kreitman (MK) test is a powerful test of selection based on the relative frequency of non-synonymous and synonymous substitutions between species compared to non-synonymous and synonymous polymorphisms within species. The webserver ODoSE (Ortholog Direction of Selection Engine) allows the calculation of a novel extension of the MK test, the Direction of Selection (DoS) statistic, as well as the calculation of a weighted-average Neutrality Index (NI) statistic for the entire core genome, allowing for systematic analysis of the evolutionary forces shaping core genome divergence in prokaryotes. ODoSE is hosted in a Galaxy environment, which makes it easy to use and amenable to customization and is freely available at www.odose.nl.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

FOGSAA: Fast Optimal Global Sequence Alignment Algorithm

FOGSAA: Fast Optimal Global Sequence Alignment Algorithm | Databases & Softwares | Scoop.it
In this article we propose a Fast Optimal Global Sequence Alignment Algorithm, FOGSAA, which aligns a pair of nucleotide/protein sequences faster than any optimal global alignment method including the widely used Needleman-Wunsch (NW) algorithm.
Biswapriya Biswavas Misra's insight:

In this article we propose a Fast Optimal Global Sequence Alignment Algorithm, FOGSAA, which aligns a pair of nucleotide/protein sequences faster than any optimal global alignment method including the widely used Needleman-Wunsch (NW) algorithm. FOGSAA is applicable for all types of sequences, with any scoring scheme, and with or without affine gap penalty. Compared to NW, FOGSAA achieves a time gain of (70–90)% for highly similar nucleotide sequences (> 80% similarity), and (54–70)% for sequences having (30–80)% similarity. For other sequences, it terminates with an approximate score. For protein sequences, the average time gain is between (25–40)%. Compared to three heuristic global alignment methods, the quality of alignment is improved by about 23%–53%. FOGSAA is, in general, suitable for aligning any two sequences defined over a finite alphabet set, where the quality of the global alignment is of supreme importance.

 
No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

ContigScape: a Cytoscape plugin facilitating microbial genome gap closing

With the emergence of next-generation sequencing, the availability of prokaryotic genome sequences is expanding rapidly. A total of 5,276 genomes have been released since 2008, yet only 1,692 genomes were complete.
Biswapriya Biswavas Misra's insight:
Abstract (provisional)Background

With the emergence of next-generation sequencing, the availability of prokaryotic genome sequences is expanding rapidly. A total of 5,276 genomes have been released since 2008, yet only 1,692 genomes were complete. The final phase of microbial genome sequencing, particularly gap closing, is frequently the rate-limiting step either because of complex genomic structures that cause sequence bias even with high genomic coverage, or the presence of repeat sequences that may cause gaps in assembly.

Results

We have developed a Cytoscape plugin to facilitate gap closing for high-throughput sequencing data from microbial genomes. This plugin is capable of interactively displaying the relationships among genomic contigs derived from various sequencing formats. The sequence contigs of plasmids and special repeats (IS elements, ribosomal RNAs, terminal repeats, etc.) can be displayed as well.

Conclusions

Displaying relationships between contigs using graphs in Cytoscape rather than tables provides a more straightforward visual representation. This will facilitate a faster and more precise determination of the linkages among contigs and greatly improve the efficiency of gap closing.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

Proteome and Protein Analysis (Results and Problems in Cell Differentiation)

Proteome and Protein Analysis (Results and Problems in Cell Differentiation) | Databases & Softwares | Scoop.it
No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

An automated graphics tool for comparative genomics: the Coulson plot generator

Comparative analysis is an essential component to biology. When applied to genomics for example, analysis may require comparisons between the predicted presence and absence of genes in a group of genomes under consideration.
Biswapriya Biswavas Misra's insight:
Abstract (provisional) Background Comparative analysis is an essential component to biology. When applied to genomics for example, analysis may require comparisons between the predicted presence and absence of genes in a group of genomes under consideration. Frequently, genes can be grouped into small categories based on functional criteria, for example membership of a multimeric complex, participation in a metabolic or signaling pathway or shared sequence features and/or paralogy. These patterns of retention and loss are highly informative for the prediction of function, and hence possible biological context, and can provide great insights into the evolutionary history of cellular functions. However, representation of such information in a standard spreadsheet is a poor visual means from which to extract patterns within a dataset. Results We devised the Coulson Plot, a new graphical representation that exploits a matrix of pie charts to display comparative genomics data. Each pie is used to describe a complex or process from a separate taxon, and is divided into sectors corresponding to the number of proteins (subunits) in a complex/process. The predicted presence or absence of proteins in each complex are delineated by occupancy of a given sector; this format is visually highly accessible and makes pattern recognition rapid and reliable. A key to the identity of each subunit, plus hierarchical naming of taxa and coloring are included. A java-based application, the Coulson plot generator (CPG) automates graphic production, with a tab or comma-delineated text file as input and generating an editable portable document format or svg file. Conclusions CPG software may be used to rapidly convert spreadsheet data to a graphical matrix pie chart format. The representation essentially retains all of the information from the spreadsheet but presents a graphically rich format making comparisons and identification of patterns significantly clearer. While the Coulson plot format is highly useful in comparative genomics, its original purpose, the software can be used to visualize any dataset where entity occupancy is compared between different classes. Availability CPG software is available at sourceforge http://sourceforge.net/projects/coulson and http://dl.dropbox.com/u/6701906/Web/Sites/Labsite/CPG.html
No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

TopHat is a popular spliced aligner for RNA-seq experiments. Here, we describe TopHat2, which incorporates many significant enhancements to TopHat.
Biswapriya Biswavas Misra's insight:
Abstract (provisional)

TopHat is a popular spliced aligner for RNA-seq experiments. Here, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which occur after genomic translocations. TopHat2 combines the ability to discover novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

The Metadata Coverage Index (MCI): A standardized metric for quantifying database metadata richness | Liolios | Standards in Genomic Sciences

The Metadata Coverage Index (MCI): A standardized metric for quantifying database metadata richness
Biswapriya Biswavas Misra's insight:
Abstract

 

Variability in the extent of the descriptions of data (‘metadata’) held in public repositories forces users to assess the quality of records individually, which rapidly becomes impractical. The scoring of records on the richness of their description provides a simple, objective proxy measure for quality that enables filtering that supports downstream analysis. Pivotally, such descriptions should spur on improvements.  Here, we introduce such a measure — the ‘Metadata Coverage Index’ (MCI): the percentage of available fields actually filled in a record or description. MCI scores can be calculated across a database, for individual records or for their component parts (e.g., fields of interest). There are many potential uses for this simple metric: for example; to filter, rank or search for records; to assess the metadata availability of an ad hoc collection; to determine the frequency with which fields in a particular record type are filled, especially with respect to standards compliance; to assess the utility of specific tools and resources, and of data capture practice more generally; to prioritize records for further curation; to serve as performance metrics of funded projects; or to quantify the value added by curation. Here we demonstrate the utility of MCI scores using metadata from the Genomes Online Database (GOLD), including records compliant with the ‘Minimum Information about a Genome Sequence’ (MIGS) standard developed by the Genomic Standards Consortium. We discuss challenges and address the further application of MCI scores; to show improvements in annotation quality over time, to inform the work of standards bodies and repository providers on the usability and popularity of their products, and to assess and credit the work of curators. Such an index provides a step towards putting metadata capture practices and in the future, standards compliance, into a quantitative and objective framework.

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

The challenge of increasing Pfam coverage of the human proteome

Biswapriya Biswavas Misra's insight:
Abstract

It is a worthy goal to completely characterize all human proteins in terms of their domains. Here, using the Pfam database, we asked how far we have progressed in this endeavour. Ninety per cent of proteins in the human proteome matched at least one of 5494 manually curated Pfam-A families. In contrast, human residue coverage by Pfam-A families was <45%, with 9418 automatically generated Pfam-B families adding a further 10%. Even after excluding predicted signal peptide regions and short regions (<50 consecutive residues) unlikely to harbour new families, for ∼38% of the human protein residues, there was no information in Pfam about conservation and evolutionary relationship with other protein regions. This uncovered portion of the human proteome was found to be distributed over almost 25 000 distinct protein regions. Comparison with proteins in the UniProtKB database suggested that the human regions that exhibited similarity to thousands of other sequences were often either divergent elements or N- or C-terminal extensions of existing families. Thirty-four per cent of regions, on the other hand, matched fewer than 100 sequences in UniProtKB. Most of these did not appear to share any relationship with existing Pfam-A families, suggesting that thousands of new families would need to be generated to cover them. Also, these latter regions were particularly rich in amino acid compositional bias such as the one associated with intrinsic disorder. This could represent a significant obstacle toward their inclusion into new Pfam families. Based on these observations, a major focus for increasing Pfam coverage of the human proteome will be to improve the definition of existing families. New families will also be built, prioritizing those that have been experimentally functionally characterized.

Database URL: http://pfam.sanger.ac.uk/

No comment yet.
Scooped by Biswapriya Biswavas Misra
Scoop.it!

GAM-NGS: genomic assemblies merger for next generation sequencing

In recent years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria.
Biswapriya Biswavas Misra's insight:
AbstractBackground

In recent years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria. However, recent results clearly show that some assemblers lead to better statistics than others on specific regions but are outperformed on other regions or on different evaluation measures. To limit these problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions.

Results

GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools.

Conclusions

The difficulty to obtain correct and reliable assemblies using a single assembler is forcing the introduction of new algorithms able to enhance de novo assemblies. GAM-NGS is a tool able to merge two or more assemblies in order to improve contiguity and correctness. It can be used on all NGS-based assembly projects and it shows its full potential with multi-library Illumina-based projects. With more than 20 available assemblers it is hard to select the best tool. In this context we propose a tool that improves assemblies (and, as a by-product, perhaps even assemblers) by merging them and selecting the generating that is most likely to be correct.


No comment yet.