Life Sciences
14 views | +0 today
Follow
Your new post is loading...
Your new post is loading...
Scooped by Gaurav Kaul
Scoop.it!

Random Forests in Python

Random Forests in Python | Life Sciences | Scoop.it

Random forest is a highly versatile machine learning method with numerous applications ranging from marketing to healthcare and insurance. It can be used to model the impact of marketing on customer acquisition, retention, and churn or to predict disease risk and susceptibility in patients.

Random forest is a capable of regression and classification. It can handle a large number of features, and it's helpful for estimating which or your variables are important in the underlying data being modeled.

This is a post about random forests using Python.

more...
No comment yet.
Rescooped by Gaurav Kaul from biotools
Scoop.it!

Contrail: Assembly of Large Genomes using Cloud Computing

Contrail: Assembly of Large Genomes using Cloud Computing | Life Sciences | Scoop.it

The first step towards analyzing a previously unsequenced organism is to assemble the reads by merging similar reads into progressively longer sequences. New assemblers such as Velvet and Eulerattempt to solve the assembly problem by constructing, simplifying, and traversing the de Bruijn graph of the read sequences. Nodes in the graph represent substrings of the reads, and directed edges connect consecutive substrings. Genome assembly is then modeled as finding an Eulerian tour through the graph, although repeats may lead to multiple possible tours. As such, assemblers primarily focus on correcting errors, reconstructing unambiguous regions, and resolving short repeats. These assemblers have successfully assembled small genomes from short reads, but have had limited success scaling to larger mammalian-sized genomes, in part, because they require constructing and manipulating graphs far larger than can fit into memory.
Addressing this limitation, we have developed a new assembly program Contrail, that uses Hadoop for de novo assembly of large genomes from short sequencing reads. Similar to other leading short read assembler, Contrail relies on the graph-theoretic framework of de Bruijn graphs. However, unlike these programs, which require large RAM resources, Contrail relies on Hadoop to iteratively transform an on-disk representation of the assembly graph, allowing an in depth analysis even for large genomes. Preliminary results show Contrail’s contigs are of similar size and quality to those generated by Velvet when applied to small (bacterial) genomes, but provides vastly superior scaling capabilities when applied to large genomes. We are also developing extensions to Contrail to efficiently compute a traditional overlap-graph based assembly of large genomes within Hadoop, strategy that will be especially valuable as read lengths increase beyond 100bp.

Contrail enables de novo assembly of large genomes from short reads by bridging research in computation biology with research in high performance computation. This combination is essential in light of the large data sets involved, and has the potential to unlock discoveries of critical magnitude. Whereas the published analysis of the African and Asian human individuals used read mapping to discover conserved regions and regions with small polymorphisms, de novo assembly has the unique potential to also discover large scale polymorphisms between these individuals and the reference human genome. Mapping the large-scale differences is an important step towards better understanding of our own biology, and may reveal previously unknown characteristics of the human genome related to health or disease. Furthermore, a short read assembler for large genomes is also essential for sequencing the vast numbers of complex organisms that have never been sequenced before, and will directly contribute to new biological knowledge.

 


Via bitslife
Gaurav Kaul's insight:

Thanks this is helpful...It would be nice to have more info on have contrail on some kind of cloud infrastructure

more...
No comment yet.