I am less and less often mistaken for a pirate when I mention the R language. While I miss the excuse to wear an eyepatch, I'm glad more people are beginning to explore a statistical language I've been touting for years. When it comes to plotting or running complex statistics in a single line of code, R is a great tool to have. That said, there are plenty of pitfalls for the casual or new user: syntax, learning to write vectorized code, or even just knowing which "apply" function you really should choose.
I want to explore a slightly less-often considered aspect of R development: parallelism. Out of the box, R can seem very limited to someone used to working on compute clusters or even a multicore server. However, there are a few tricks we can leverage to get the most out of R on everything from a personal workstation to a Hadoop cluster.
Hortonworks, a leading contributor to Apache Hadoop, today announced it has joined the OpenStack Foundation, which promotes the development, distribution and adoption of the OpenStack cloud operating system.
The VoltDB engineering team is thrilled to announce that VoltDB 3.0 is now available! Over the past six months we’ve added a ton of features to VoltDB 3.0. This blog post lists the highlights, but that just scratches the surface. Look for future blog posts to dive into specific areas of version 3 functionality.
Open Government was published in 2010 by O'Reilly Media. The United States had just elected a president in 2008, who, on his first day in office, issued an executive order committing his administration to "an unprecedented level of openness in government." The contributors of Open Government had long fought for transparency and openness in government, as well as access to public information. Aaron Swartz was one of these contributors (Chapter 25: When is Transparency Useful?). Aaron was a hacker, an activist, a builder, and a respected member of the technology community. O'Reilly Media is making Open Government free to all to access in honor of Aaron. #PDFtribute
Zynga has deployed nearly 100 nodes of MemSQL, the hot new database from two former Facebook engineers. It might not be a magic pill for Zynga’s woes, but it could help the company boost revenue and even build new types of games.
D&R is being developed to meet these many challenges. In a D&R analysis, the data are divided into subsets in one more ways, forming multiple divisions. Numeric and visualization methods are applied to each of the subsets of a division, and the results of each method are recombined across subsets.
A very early stage of a Redis monitoring tool using hiredis1and express2 on Node.js presenting a dashboard inspired by Netflix’s Hystrix3: The project is on GitHub so you can send some pull requests...
Chris Winslett of MongoHQ is experimenting with MongoDB text search and [declare himself satisfied: Full-text searching with MongoDB 2.4 is more complex and powerful than originally illustrated in our first blog post outlining this feature.
As you hopefully have heard, we at scikit-learn are doing a user survey (which is still open by the way). One of the requests there was to provide some sort of flow chart on how to do machine learning.
ControlTier is an open source, cross-platform build and deployment automation framework. ControlTier can help you to coordinate and scale service management and administration activities across multiple nodes and application tiers.
In today's social world, it's important to be able to collaborate with others online when working with data, and to be able to easily share your outputs online. Fortunately, the R language and the broad R community provides a number of facilities for collaboration and sharing, which are summarized in Noam Ross's guide to tools for collaboration with R. Among the resources he lists:
Facebook has released under the Open Compute umbrella the design of a new database server they’ve introduced in one of the datacenters. The bit that caught my eyes is that this is not about more disk space or more CPU, but redundant power supplies:
What is good C++11 coding style? Although just ratified in late 2011, we already have early implementations to show us the path to good coding style. Indeed, much of C++11 was designed following rules to allow us to to achieve a superior coding style rather than the litter of bad code that was normal in C++98/03. Gone are the casts, macros, pointers, naked new and deletes, complicated control structures, special cases that don't fit generic programming, and deep nesting. This talk will illustrate the design rules we were aiming for in C++11, many of the pitfalls of migrating from C++98/03 to C++11, and how to achieve good code style using C++11's type-rich interfaces, integrated resource management, uniform generic programming and efficient mapping to hardware. In essence, modern C++11 coding style should be noticeably different than C++98 code, even at a glance. This talk will show you what to look for as code begin to migrate towards C++11.
It shouldn’t be a surprise to anyone that the top most connected companies in the Hadoop space are Cloudera and Hortonworks. They outrank the IT industry mammoths: IBM, HP, Microsoft, Oracle, SAP, etc.