This post is about a prototype ‘network’ approach to finding papers using data from Google Scholar, hopefully pointing to what could be done with more open data. I was able to use a supervised program searching on Google Scholar to extract my data, but a scalable version of this tool would require open data.
More than one million people have now had their genome sequenced, or its protein-coding regions (the exome). The hope is that this information can be shared and linked to phenotype — specifically, disease — and improve medical care. An obstacle is that only a small fraction of these data are publicly available:
Human genomics: A deep dive into genetic variationAnalysis of protein-coding genetic variation in 60,706 humansProtective gene offers hope for next blockbuster heart drug
In an important step, we report this week the first publication from the Exome Aggregation Consortium (ExAC), which has generated the largest catalogue so far of variation in human protein-coding regions. It aggregates sequence data from some 60,000 people. Most importantly, it puts the information in a publicly accessible database that is already a crucial resource (http://exac.broadinstitute.org).
There are challenges in sharing such data sets — the project scientists deserve credit for making this one open access. Its scale offers insight into rare genetic variation across populations. It identifies more than 7.4 million (mostly new) variants at high confidence, and documents rare mutations that independently emerged, providing the first estimate of the frequency of their recurrence. And it finds 3,230 genes that show nearly no cases of loss of function. More than two-thirds have not been linked to disease, which points to how much we have yet to understand.
The study also raises concern about how genetic variants have been linked to rare disease. The average ExAC participant has some 54 variants previously classified as causal for a rare disorder; many show up at an implausibly high frequency, suggesting that they were incorrectly classified. The authors review evidence for 192 variants reported earlier to cause rare Mendelian disorders and found at a high frequency by ExAC, and uncover support for pathogenicity for only 9. The implications are broad: these variant data already guide diagnoses and treatment (see, E. V. Minikel et al. Sci. Transl. Med. 8, 322ra9; 2016 and R. Walsh et al. Genet. Med. http://dx.doi.org/10.1038/gim.2016.90; 2016).
These findings show that researchers and clinicians must carefully evaluate published results on rare genetic disorders. And it demonstrates the need to filter variants seen in sequence data, using the ExAC data set and other reference tools — a practice widely adopted in genomics.
The ExAC project plans to grow over the next year to include 120,000 exome and 20,000 whole-genome sequences. It relies on the willingness of large research consortia to cooperate, and highlights the huge value of sharing, aggregation and harmonization of genomic data. This is also true for patient variants — there is a need for databases that provide greater confidence in variant interpretation, such as the US National Center for Biotechnology Information’s ClinVar database.
IF YOU WANTED to write a history of the Internet, one of the first things you would do is dig into the email archives of Vint Cerf. In 1973, he co-created the protocols that Internet servers use to communicate with each other without the need for any kind of centralized authority or control. He has spent the decades since shaping the Internet’s development, most recently as Google’s “chief Internet evangelist.”
Thankfully, Cerf says he has archived about 40 years of old email—a first-hand history of the Internet stretching back almost as far as the Internet itself. But you’d also have a pretty big problem: a whole lot of that email you just wouldn’t be able to open. The programs Cerf used to write those emails, and the formats in which they’re stored, just don’t work on any current computer you’d likely be using to try to read them.
Today, much of the responsibility for preserving the web’s history rests on The Internet Archive. The non-profit’s Wayback Machine crawls the web perpetually, taking snapshots that let you, say, go back and see how WIRED looked in 1997. But the Wayback Machine has to know about a site before it can index it, and it only grabs sites periodically. Based on the Internet Archive’s own findings, the average webpage only lasts about 100 days. In order to preserve a site, the Wayback Machine has to spot it in that brief window before it disappears.
The Average Webpage Is Now the Size of the Original Doom. What’s more, the Wayback Machine is a centralized silo of information—an irony that’s not lost on the inventors of the Internet. If it runs out of money, it could go dark. And because the archives originate from just one web address, it’s relatively easy for censors, such as those in China, to block users from accessing the site entirely. The Archive Team–an unrelated organization–is leading an effort to create a more decentralized backup on the Internet Archive. But if Internet Archive founder Brewster Kahle, Cerf, and their allies who recently came together at what they called the Decentralized Web Summit have their way, the world will one day have a web that archives itself and backs itself up automatically.
Some pieces of this new web already exist. Interplanetary File System, or IPFS, is an open source project that taps into ideas pioneered by the decentralized digital currency Bitcoin and the peer-to-peer file sharing system BitTorrent. Sites opt in to IPFS, and the protocol distributes files among participating users. If the original web server goes down, the site will live on thanks to the backups running on other people’s computers. What’s more, these distributed archives will let people browse previous versions of the site, much the way you can browse old edits in Wikipedia or old versions of websites in the Wayback Machine.
“We are giving digital information print-like quality,” says IPFS founder Juan Benet. “If I print a piece of paper and physically hand it to you, you have it, you can physically archive it and use it in the future.” And you can share that copy with someone else.
Unlike the early web, the web of today isn’t just a collection of static HTML files. It’s a rich network of interconnected applications like Facebook and Twitter and Slack that are constantly changing. A truly decentralized web will need ways not just to back up pages but applications and data as well. That’s where things get really tricky–just ask the team behind the decentralized crowdfunding system DAO which was just hacked to the tune of $50 million last week.
The IPFS team is already hard at work on a feature that would allow a web app to keep trucking along even if the original server disappears, and it’s already built a chat app to demonstrate the concept. Meanwhile, several other projects– such as Ethereum, ZeroNet and the SAFE Network—aspire to create ways to build websites and applications that don’t depend on a single server or company to keep running. And now, thanks in large part to the Summit, many of them are working to make their systems cross-compatible.
For thousands of years humans believed that authority came from the gods. Then, during the modern era, humanism gradually shifted authority from deities to people. Jean-Jacques Rousseau summed up this revolution in Emile, his 1762 treatise on
Social-media companies are trying to level the playing field in the online propaganda war with Islamist radicals. Their goal is to see what kinds of messages could reach potential extremists before they become radicalized.
Big Data is coming whether we are ready or not. But given the human ability to ignore facts when they conflict with beliefs, will Big Data help on our biggest issues, or just give us a store coupon while the world burns?
Have you ever wondered why there is such a scarcity of data science skills on the market? In my view the answer is rather simple - a massive skill gap amongst university scientists. Despite the course names, like ‘Business Analytics’ or ‘Data Science’ I would venture an opinion that the vast majority of the scientists leading them have no idea how ‘Data Science’ in the ‘Business world’ really looks like. They are not even close. And what’s worse – to the students’ harm, they are perfectly happy with it.
A guide to authoring books with R Markdown, including how to generate figures and tables, and insert cross-references, citations, HTML widgets, and Shiny apps in R Markdown. The book can be exported to HTML, PDF, and e-books (e.g. EPUB). The book style is customizable. You can easily write and preview the book in RStudio IDE or other editors, and host the book wherever you want (e.g. bookdown.org).
Sharing your scoops to your social media accounts is a must to distribute your curated content. Not only will it drive traffic and leads through your content, but it will help show your expertise with your followers.
How to integrate my topics' content to my website?
Integrating your curated content to your website or blog will allow you to increase your website visitors’ engagement, boost SEO and acquire new visitors. By redirecting your social media traffic to your website, Scoop.it will also help you generate more qualified traffic and leads from your curation work.
Distributing your curated content through a newsletter is a great way to nurture and engage your email subscribers will developing your traffic and visibility.
Creating engaging newsletters with your curated content is really easy.