The Open Data Institute will catalyse an open data culture that has economic, environmental and social benefits. It will unlock supply, generate demand, create and disseminate knowledge to address local and global issues.
We will convene world-class experts to collaborate, incubate, nurture and mentor new ideas, and promote innovation. We will enable anyone to learn and engage with open data, and empower our teams to help others through professional coaching and mentoring.
The Open Data for Africa platform is a response from the African Development Bank Group (AfDB) aimed at boosting access to quality data necessary for managing and monitoring development results in African countries, including the millennnium development goals. It responds to a number of important global and regional initiatives increase the availability of data on Africa. It will foster evidence-based decision-making, public accountability, and good governance. The initiative forms part of the worldwide effort to strengthen statistical capacity, articulated in the Busan Action Plan for Statistics (BAPS), which was endorsed by the international community at the High-Level Forum on Aid Effectiveness, which took place in Busan, Korea, between 28 November and 1 December 2011. Read more
Open data for GLAMs: Open up your institution's data
Open up your institution's data
This challenge is for professionals in cultural institutions who are interested in opening up their data as open culture data.
The course will guide you through the different steps towards open data and provide you with extensive background information on how to handle copyright and other possible issues.
The different steps will force you to think about different aspects of your data that could lead to a more efficient data infrastructure and a coherent data policy with great internal benefits for your institution.
alloveralbany New York State Unveils New Open Data Portal techPresident New York Governor Andrew Cuomo launched a new open data portal Monday, Open.ny.gov, following through on a promise made in his State of the State speech in January.
New York Governor Andrew Cuomo launched a new open data portal Monday, Open.ny.gov, following through on a promise made in his State of the State speech in January. The site will feature data from every New York State agency, and tie in localities from all over the state.
Open access to publicly funded agriculturally-relevant data is critical to increasing global food security. It is being used by innovators and entrepreneurs around the world to accelerate development, whether it be tracking election transparency in Kenya or providing essential information to rural farmers in Uganda.
The G-8 conference will convene policy makers, thought leaders, food security stakeholders, and data experts to discuss the role of public, agriculturally-relevant data in increasing food security and to build a strategy to spur innovation by making agriculture data more accessible. As part of the conference, selected applicants will be invited to showcase innovative uses of open data for food security in either a Lightning Presentation (a 3-5 minute, image-rich presentation on the first day of the conference) or in the Exhibit Hall (an image-rich exhibit on display throughout the two-day conference).
The G-8 is inviting innovators to apply to present ideas that demonstrate how open data can be unleashed to increase food security at the G-8 International Conference on Open Data for Agriculture on April 29-30, 2013 in Washington, D.C.
For more information on the conference and to submit your application, please visit the conference website or email G8AGOPENDATA@osec.usda.gov.
Discuss the upcoming conference and the role for open data in promoting agriculture and food security on twitter! #OpenAgData
Richard Corbridge, chief information officer, explains why NIHR Clinical Research Network is creating an open data platform. He says the idea is to have a transparent data set that everyone can have access to.
Journalism.co.uk Health data: How open data can be used to examine the NHS Journalism.co.uk "There are some drugs that the NHS wants to understand the patterns of usage of, and we've used those files to turn the data into interactive maps.
The dangers in interpreting data
The researchers wanted to highlight the story of the potential savings to journalists. But in briefing journalists on the story, they paid particular care to how the data was released – as simply mapping spending would have showed the most densely populated areas of the country.
"There's an easy and sensational story of 'this doctor has a high proportion [of prescribing expensive drugs], so they are a very bad person'. That's not necessarily true but would be easy to pick up and run with," Bennett said.
"We were careful about the level on which we released data and the way in which we visualised it, which tried to encourage people to think about wider patterns and systems and try and discourage into digging into an individual's behaviour.
"PCTs and health organisations might want to do that but we didn't want to create a tabloid story around it, we wanted it to be a genuine discussion about 'how do we manage the NHS well?'
"We think it's a great institution and it's about giving it positive support and transparency rather than accusing people of things."
The planning around the controlled release worked, and the story was reported by the Financial Times, the Economist and the Daily Mail.
A large number of Wikipedia articles are geocoded. This means that when an article pertains to a location, its latitude and longitude are linked to the article. As you can imagine, this can be useful to generate insightful and eye-catching infographics. A while ago, a team at Oxford built this magnificent tool to illustrate the language boundaries in Wikipedia articles. This led me to wonder if it would be possible to extract the different topics in Wikipedia.
This is exactly what I managed to do in the past few days. I downloaded all of Wikipedia, extracted 300 different topics using a powerful clustering algorithm, projected all the geocoded articles on a map and highlighted the different clusters (or topics) in red. The results were much more interesting than I thought. For example, the map on the left shows all the articles related to mountains, peaks, summits, etc. in red on a blue base map. The highlighted articles from this topic match the main mountain ranges exactly.
Read on for more details, pretty pictures and slideshows.
A bit about the process
You can skip this section if you don’t really care about the nitty-gritty of the production of the maps. Scroll down to get to the slideshows.
Getting the data
Trains, stations, platforms, railways, etc.
The first the step to create these map was to retrieve all Wikipedia articles. There are 1.5 million of them and only a portion (400,000) are geocoded, but this doesn’t matter, because it’s an all or nothing deal: everything must be downloaded. I had to download the raw data from this page. It’s quite a large download at 9GB compressed and it expands to about 40GB once it is uncompressed. I then parsed this very large file to extract the article content, links and geographical coordinates.
Islands, coasts, beaches, oceans, etc.
To extract topics from this huge corpus, I used Latent Dirichlet Allocation. This algorithm can extract a given number of topics from a large corpus. Usually the optimal number of topics can be inferred from the likelihood values over several topic runs. However, in this case, since the corpus is very large and each run is very time consuming (50 hours on the most powerful AWS cluster instance), I chose a number relying on an educated guess and my LDA experience.
I ran the LDA algorithm using Yahoo’s LDA implementation since it’s quite fast and can be parallelized. After 50 hours, I got 300 different topics linked to 1.5 million articles, but because only 400,000 of them are geocoded, the rest of this post only pertains to these 400,000. You can download the topic descriptions here. The topics are very varied and range from geographical regions, ethnic groups, science, sports (including both kinds of football!), historical sites and even archeological dig sites.
Here are the most recent data sets uploaded to Many Eyes. Use the link in the Data column to open a view of the data set itself. Use the blue Visualize button to visualize the data.
The Source column shows the source as described by the person who uploaded the file. Please be aware that these files have been provided by users of the site; we cannot vouch for their accuracy or authenticity. To upload your own data, use the Upload page.
There is a growing need for efficient and integrated access to databases provided by diverse institutions. Using a linked data design pattern allows the diverse data on the Internet to be linked effectively and accessed efficiently by computers.
Pan-European open data available online from EuroGeographics DirectionsMag.com (press release) From today (8 March 2013), the 1:1 million scale topographic dataset, EuroGlobalMap will be available free of charge for any use under a new open data...
Tips, News, Tutorials, Reviews about Linux, Open Source Software, Ubuntu, Google, Chrome, Android, Apple, Programming, Gadgets, and all things tech.
For those of you who are looking for some data mining tools, here are five of the best open-source data mining software that you could get for free:
Orange Orange is a component-based data mining and machine learning software suite that features friendly yet powerful, fast and versatile visual programming front-end for explorative data analysis and visualization, and Python bindings and libraries for scripting. It contains complete set of components for data preprocessing, feature scoring and filtering, modeling, model evaluation, and exploration techniques. It is written in C++ and Python, and its graphical user interface is based on cross-platform Qt framework.
RapidMiner RapidMiner, formerly called YALE (Yet Another Learning Environment), is an environment for machine learning and data mining experiments that is utilized for both research and real-world data mining tasks. It enables experiments to be made up of a huge number of arbitrarily nestable operators, which are detailed in XML files and are made with the graphical user interface of RapidMiner. RapidMiner provides more than 500 operators for all main machine learning procedures, and it also combines learning schemes and attribute evaluators of the Weka learning environment. It is available as a stand-alone tool for data analysis and as a data-mining engine that can be integrated into your own products.
Weka Written in Java, Weka (Waikato Environment for Knowledge Analysis) is a well-known suite of machine learning software that supports several typical data mining tasks, particularly data preprocessing, clustering, classification, regression, visualization, and feature selection. Its techniques are based on the hypothesis that the data is available as a single flat file or relation, where each data point is labeled by a fixed number of attributes. Weka provides access to SQL databases utilizing Java Database Connectivity and can process the result returned by a database query. Its main user interface is the Explorer, but the same functionality can be accessed from the command line or through the component-based Knowledge Flow interface.
JHepWork Designed for scientists, engineers and students, jHepWork is a free and open-source data-analysis framework that is created as an attempt to make a data-analysis environment using open-source packages with a comprehensible user interface and to create a tool competitive to commercial programs. It is specially made for interactive scientific plots in 2D and 3D and contains numerical scientific libraries implemented in Java for mathematical functions, random numbers, and other data mining algorithms. jHepWork is based on a high-level programming language Jython, but Java coding can also be used to call jHepWork numerical and graphical libraries.
KNIME KNIME (Konstanz Information Miner) is a user friendly, intelligible, and comprehensive open-source data integration, processing, analysis, and exploration platform. It gives users the ability to visually create data flows or pipelines, selectively execute some or all analysis steps, and later study the results, models, and interactive views. KNIME is written in Java, and it is based on Eclipse and makes use of its extension method to support plugins thus providing additional functionality. Through plugins, users can add modules for text, image, and time series processing and the integration of various other open source projects, such as R programming language, Weka, the Chemistry Development Kit, and LibSVM.
Cartographie des 2770 comptes les plus importants de la communauté github.
La taille de chaque noeud est fonction du nombre de « followers » de l’utilisateur. L’épaisseur des liens représente le nombre de projets « forkés » entre 2 utilisateurs. La spatialisation est le résultat de l’algorithme de spatialisation ForceAtlas.
In natural language processing, latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan in 2003.
In LDA, each document may be viewed as a mixture of various topics. This is similar to probabilistic latent semantic analysis (pLSA), except that in LDA the topic distribution is assumed to have a Dirichletprior. In practice, this results in more reasonable mixtures of topics in a document. It has been noted, however, that the pLSA model is equivalent to the LDA model under a uniform Dirichlet prior distribution.
For example, an LDA model might have topics that can be classified as CAT_related and DOG_related. A topic has probabilities of generating various words, such as milk, meow, and kitten, which can be classified and interpreted by the viewer as "CAT_related". Naturally, the word cat itself will have high probability given this topic. The DOG_related topic likewise has probabilities of generating each word: puppy, bark, and bone might have high probability. Words without special relevance, such as the (see function word), will have roughly even probability between classes (or can be placed into a separate category). A topic is not strongly defined, neither semantically nor epistemologically. It is identified on the basis of supervised labeling and (manual) pruning on the basis of their likelihood of co-occurrence. A lexical word may occur in several topics with a different probability, however, with a different typical set of neighboring words in each topic.
Nature is the international weekly journal of science: a magazine style journal that publishes full-length research papers in all disciplines of science, as well as News and Views, reviews, news, features, commentaries, web focuses and more,...
Many complex systems display a surprising degree of tolerance against errors. For example, relatively simple organisms grow, persist and reproduce despite drastic pharmaceutical or environmental interventions, an error tolerance attributed to the robustness of the underlying metabolic network1. Complex communication networks2display a surprising degree of robustness: although key components regularly malfunction, local failures rarely lead to the loss of the global information-carrying ability of the network. The stability of these and other complex systems is often attributed to the redundant wiring of the functional web defined by the systems' components. Here we demonstrate that error tolerance is not shared by all redundant systems: it is displayed only by a class of inhomogeneously wired networks, called scale-free networks, which include the World-Wide Web3, 4, 5, the Internet6, social networks7 and cells8. We find that such networks display an unexpected degree of robustness, the ability of their nodes to communicate being unaffected even by unrealistically high failure rates. However, error tolerance comes at a high price in that these networks are extremely vulnerable to attacks (that is, to the selection and removal of a few nodes that play a vital role in maintaining the network's connectivity). Such error tolerance and attack vulnerability are generic properties of communication networks.
If India is to transition to a true knowledge economy, open access, availability and contestability of public knowledge is paramount. Building sound public policies requires robust information systems and data.