Mongodb BigData N...
Follow
Find
2.8K views | +3 today
Mongodb BigData NoSql & Search Engine
Topics about mongodb, nosql, big data and search engine
Curated by Alex Kantone
Your new post is loading...
Your new post is loading...
Scooped by Alex Kantone
Scoop.it!

Big Data Benchmark - Redshift, Hive, Shark, Impala, Tez

Big Data Benchmark - Redshift, Hive, Shark, Impala, Tez | Mongodb BigData NoSql & Search Engine | Scoop.it

Several analytic frameworks have been announced in the last year. Among them are inexpensive data-warehousing solutions based on traditional Massively Parallel Processor (MPP) architectures (Redshift), systems which impose MPP-like execution engines on top of Hadoop (Impala, HAWQ) and systems which optimize MapReduce to improve performance on analytical workloads (Shark, Stinger/Tez). This benchmark provides quantitativeand qualitative comparisons of five systems. It is entirely hosted on EC2 and can be reproduced directly from your computer.

Redshift - a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse. We tested Redshift on HDDs.Hive - a Hadoop-based data warehousing system. (v0.12)Shark - a Hive-compatible SQL engine which runs on top of the Spark computing framework. (v0.8.1)Impala - a Hive-compatible* SQL engine with its own MPP-like execution engine. (v1.2.3)Stinger/Tez - Tez is a next generation Hadoop execution engine currently in development (v0.2.0)

This remains a work in progress and will evolve to include additional frameworks and new capabilities. We welcome contributions.

What this benchmark is not

This benchmark is not intended to provide a comprehensive overview of the tested platforms. We are aware that by choosing default configurations we have excluded many optimizations. The choice of a simple storage format, compressed SequenceFile, omits optimizations included in columnar formats such as ORCFile and Parquet. For now, we've targeted a simple comparison between these systems with the goal that the results areunderstandable and reproducible.

What is being evaluated?

This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF's, across different data sizes. Keep in mind that these systems have very different sets of capabilities. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF's), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. The workload here is simply one set of queries that most of these systems these can complete.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

MongoDB and Indexing: Best Practices

MongoDB and Indexing: Best Practices | Mongodb BigData NoSql & Search Engine | Scoop.it

Back in May last year, we talked about a set of best practices for indexing and MongoDB, focusing on performance. Since then MongoDB 2.6 has arrived and it’s a good time to update that list.

One Index per Query

MongoDB can only use one index in any one query operation. Although MongoDB 2.6 has introduced index intersection, which allows more than one index to be used (specifically two indexes currently), there are a number of caveats on their use and one index, whether singular or compound, is still easier to manage and optimise around.

One “Multi-Value” Operator per Query

There’s a range of selectors we call “multi-value” operators. They can return records that match a variety of different values. They include $gt, $gte, $lt and $lte, $in and $nin, $ne, $not and $near. If you need to use one, use one and no more. This is because the “multi-value” operators can return an unpredictable number of values and are more likely to be scanned to be evaluated – the more you have to scan, the slower the query.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Java Clients for Elasticsearch

One of the important aspects of Elasticsearch is that it is programming language independent. All of the APIs for indexing, searching and monitoring can be accessed using HTTP and JSON so it can be integrated in any language that has those capabilities. Nevertheless Java, the language Elasticsearch and Lucene are implemented in, is very dominant. In this post I would like to show you some of the options for integrating Elasticsearch with a Java application

 

The Native Client

The obvious first choice is to look at the client Elasticsearch provides natively. Unlike other solutions there is no separate jar file that just contains the client API but you are integrating the whole application Elasticsearch. Partly this is caused by the way the client connects to Elasticsearch: It doesn’t use the REST API but connects to the cluster as a cluster node. This node normally doesn’t contain any data but is aware of the state of the cluster.

The node client integrates with your Elasticsearch cluster

On the right side we can see two normal nodes, each containing two shards. Each node of the cluster, including our application’s client node, has access to the cluster state as indicated by the cylinder icon. That way, when requesting a document that resides on one of the shards of Node 1 your client node already knows that it has to ask Node 1. This saves a potential hop that would occur when asking Node 2 for the document that would then route your request to Node 1 for you.

Creating a client node in code is easy. You can use the NodeBuilder to get access to the Client interface. This then has methods for all of the API functionality, e.g. for indexing and searching data.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Why Microsoft's Azure Machine Learning is such a big deal

Why Microsoft's Azure Machine Learning is such a big deal | Mongodb BigData NoSql & Search Engine | Scoop.it

Microsoft made two big announcements on Monday, showcasing two different aspects of its changes. While developer access to new builds of Internet Explorerwould have been a showstopper on most days, the biggest news came an hour or so later with the unveiling of Azure Machine Learning (ML).

Featured ResourcePresented by Code425 Impactful Tips for Securing BYOD in the Enterprise

Here are five immediately useful tips for creating a security strategy to enable your mobile workers

LEARN MORE

Dig down into the heart of any big cloud service, whether it's Amazon, Google, or Bing, and you'll find machine learning. It's one of the technologies that powers a new generation of cloud scale AI-based technologies, using big data to learn how to respond to a wide range of inputs. A powerful set of tools, machine learning simplifies what were once complex programming tasks -- while allowing you to build software that can make inferences from your data sources.

But making machine learning easy to use is surprisingly difficult. In the past, building learning systems took time and required significant computing resources. It required data scientists to build the training sets and to guide the learning system through its first steps. It was a complex and expensive process.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Elasticsearch’s behavior under various types of network failure.

Elasticsearch’s behavior under various types of network failure. | Mongodb BigData NoSql & Search Engine | Scoop.it

Elasticsearch is a distributed search engine, built around Apache Lucene–a well-respected Java indexing library. Lucene handles the on-disk storage, indexing, and searching of documents, while ElasticSearch handles document updates, the API, and distribution. Documents are written to collections as free-form JSON; schemas can be overlaid onto collections to specify particular indexing strategies.

As with many distributed systems, Elasticsearch scales in two axes: sharding and replication. The document space is sharded–sliced up–into many disjoint chunks, and each chunk allocated to different nodes. Adding more nodes allows Elasticsearch to store a document space larger than any single node could handle, and offers quasilinear increases in throughput and capacity with additional nodes. For fault-tolerance, each shard is replicated to multiple nodes. If one node fails or becomes unavailable, another can take over. There are additional distinctions between nodes which can process writes, and those which are read-only copies–termed “data nodes”–but this is primarily a performance optimization.

Because index construction is a somewhat expensive process, Elasticsearch provides a faster, more strongly consistent database backed by a write-ahead log. Document creation, reads, updates, and deletes talk directly to this strongly-consistent database, which is asynchronously indexed into Lucene. Search queries lag behind the “true” state of Elasticsearch records, but should eventually catch up. One can force a flush of the transaction log to the index, ensuring changes written before the flush are made visible.

But this is Jepsen, where nothing works the way it’s supposed to. Let’s give this system’s core assumptions a good shake and see what falls out!

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Using the Power of Real-Time Distributed Search with ElasticSearch

Using the Power of Real-Time Distributed Search with ElasticSearch | Mongodb BigData NoSql & Search Engine | Scoop.it

Internet is a place where everyone in the world can find any information they want. But with billions of documents available in the web, how is it possible to find exactly what we want in seconds or less?

For this purpose special programs called ‘search engines’ are developed by using many algorithms for analyzing, stemming, building indexes and searching querying terms. In Java world there is one of the most popular open source libraries called Lucene from Apache. It is a high performance, reliable and widely used full-featured Information Retrieval library written in Java. On top of it are built a few servers such as Solr, ElasticSearch and others.

Nowadays most companies are trying to move all computation into the cloud and Search is not an exception. In this article I would like to consider ElasticSearch, which, besides many other features, is initially designed to work in clouds and is quite successful in accomplishing that mission.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Can a search engine predict the World Cup? Microsoft Bing is giving it a try

Can a search engine predict the World Cup? Microsoft Bing is giving it a try | Mongodb BigData NoSql & Search Engine | Scoop.it

Microsoft’s Bing search engine, after using data to successfully forecast the results of American Idol, Dancing with the Stars and The Voice, will now test its predictive abilities on a contest with an even more passionate fan base: The World Cup.

Yes, the company is expanding its “Bing Predicts” initiative to the beautiful game, and hoping for a similar result. Microsoft launched the World Cup predictions effort this morning, letting users type in “World Cup Predictions” or “Group A Predictions,” for example, to bring up the forecast generated by its algorithm.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Cassandra Vs HBase : Which NoSql store do I need ?

Cassandra Vs HBase : Which NoSql store do I need ? | Mongodb BigData NoSql & Search Engine | Scoop.it

 There many NoSql databases out there and it can be confusing to determine which one is suitable for a particular use case. In this blog, we discuss the two more popular ones, Cassandra and HBase. If you are new to NoSql , you may review these earlier posts:

What is NoSql ?
HBase
HBase architecture

 

To understand the goals and motivation behind any product, it is a good idea to trace its origin.

HBase is based on Google's Bigtable as described in the paper "Bigtable: A Distributed Storage System for Structured Data". You only need to read the first line of the abstract to understand what BigTable attempts to do. It is a distributed storage system for managing structured data that can scale to very large size of the order of petabytes using thousands of commodity servers.

Cassandra derives its motivation from Amazon's Dynamo as described in the paper "Dynamo: Amazon's highly available key value store". Reading the first page of this paper it is clear that the primary goals were reliability at scale and high availability for amazon's online store.

While both papers talk about scale, reliability and availability, the primary problem BigTable (and HBase) is addressing is random access to data of the scale of hundreds of terrabytes or petabytes and the primary problem that Dynamo ( and Cassandra) addresses is high availability

 

 

 

 

 

 

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Yeoman, Mongoose & MongoDB

Yeoman, Mongoose & MongoDB | Mongodb BigData NoSql & Search Engine | Scoop.it

Yeoman is a scaffolding tool, that scaffolds out projects using Grunt, Bower and Node. There are times when you end up cut ‘n pasting boilerplate code around to create a new project. This is precisely what Yeoman does, but with a single command and a few awesome generators.

Yeoman uses Grunt as the taskrunner to perform run/build/test tasks. If you want to use Gulp for the same, you can checkout Slush. Slush is also a Scaffolding tool but uses Gulp as the taskrunner.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Data Analytics Handbook

Data Analytics Handbook | Mongodb BigData NoSql & Search Engine | Scoop.it

An in-depth look at the data science industry
Interviews with data scientists, data analysts, CEOs, managers, and researchers at the cutting edge of the data science industry

 

 

 

Created to inform young professionals
by young professionals
We are three UC Berkeley students (Go Bears!) who set out to educate young professionals on the Big Data Industry

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Using logstash, elasticsearch and Kibana to monitor your video card - a tutorial

Using logstash, elasticsearch and Kibana to monitor your video card - a tutorial | Mongodb BigData NoSql & Search Engine | Scoop.it

A few weeks ago my colleague Jettro wrote a blog post about an interesting real-life use case for Kibana: using it to graph meta-data of the photos you took. Given that photography is not a hobby of mine I decided to find a use-case for Kibana using something closer to my heart: gaming.

This Christmas I treated myself to a new computer. The toughest decision I had to make was regarding the video card. In the end I went with a reference AMD R9 290, notoriously known for its noisiness. Because I'm really interested in seeing how the card performs while gaming, I decided to spent some time on my other hobby, programming, in order to come up with a video card monitoring solution based on logstash, elasticsearch & Kibana. Overkill? Probably. Fun? Definitely.

I believe it's also a very nice introduction on how to set up a fully working setup of logstash - elasticsearch - Kibana. Because of the "Windowsy" nature of gaming, some of the commands listed are the Windows version. The Unix folk should have no problems translating these as everything is kept very simple.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Big Data Tools that You Need to Know About Hadoop & NoSQL

Big Data Tools that You Need to Know About Hadoop & NoSQL | Mongodb BigData NoSql & Search Engine | Scoop.it

If you’ve heard anything about Big Data chances are good that you’ve also heard some buzz around a platform called Hadoop. Hadoop was developed in 2005 by Doug Cutting and Mike Cafarella. Cutting, who worked for Yahoo at the time, actually named this tool after his son’s toy elephant. Hadoop really came to light as an outgrowth of efforts by Google, Yahoo, and other companies to provide faster methods for indexing web pages and handling the growing data bottleneck. By 2008 Yahoo announced that a 10,000 core Hadoop cluster was being used to run its production index search and since then there’s been no looking back. We’re at the point now where, as one publication put it, Hadoop is “the focal point of an immense big data movement.” A fast-growing ecosystem of commercial vendors has emerged in recent years with companies like Cloudera, HortonWorks, and MapR developing customizable and accessible out-of-the-box solutions for scaling up Hadoop. In 2012 the research firm IDC estimated the Hadoop market at $77M with projected annual growth of 60% to $813M by 2016.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

3 lessons in database design from the team behind Twitter's Manhattan

3 lessons in database design from the team behind Twitter's Manhattan | Mongodb BigData NoSql & Search Engine | Scoop.it

In April, Twitter announced that it had had enough of trying to bend existing database technologies to its unique needs, inspiring the company to build its own database called Manhattan. Last week, I spoke with the trio of Twitter engineers — Chris Goffinet, Peter Schuller and Boaz Avital — who built the database to get a higher-level view of Manhattan beyond its technological underpinnings, to get a better sense of how prevalent Manhattan will be inside Twitter and what its creation says about software development at web scale.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

ElasticSearch Query: Performance Optimisation

ElasticSearch Query: Performance Optimisation | Mongodb BigData NoSql & Search Engine | Scoop.it

In one of my previous posts on elasticsearch, i shared my understanding of elasticsearch configurations and best practices. That was mostly from an indexing perspective. There are several tweaks one can use to optimise query performance as well. Improving querying time can be even more challenging than trying to improve indexing times. Lets see why querying is more of a challenge:

Queries can go on while index is getting updatedDifferent queries would need different strategies for optimisationsThere are far more configurations that impact query performance:Query syntax/clauses usedIndex schemaElasticsearch configurationsRAM, CPU, Network, IO

And there are times when you need to fire 2 or more queries in succession to get certain results back from ES. I have had one such scenario recently where i needed to fire 3 queries to ES and make sure that the response times where always less then a second. The 3 queries in question were related to each other in a sense that query 2 uses output of query 1 and query 3 uses output from query 2. For my use case, one of the queries was simple, while others two were more complex as they had aggregations, stats, filters etc.

As outlined above, there are several things that can prevent an optimal response time. Also, to safely say that  a desired response time has been achieved, one needs to test and test right. A poor testing method would lead to misleading performance statistics. Below are details of my testing methodology and tweaks that led to sub second response times for 3 queries.

ElasticSearch Cluster and Indexes5 Machines in the cluster5 Shards per index250 GB EBS volume on each machine to hold indexesIndexes are stored as compressedNo indexing takes place while testing (my use case asks for indexing in batch once a day)3 indexesIndex A: with 24+ million records (used in 1st query)All integer fields.4 fields.Index B: with 90+ million records (used in 2nd query)All integers3 fieldsIndex C: with 340K records (used in 3rd query)String, Integer and Date fieldsonly few fields used in querying.Different machine types:to hold ES indexes: m3.large to c3.4xlargeRAMDifferent sizes for tests, starting from 4GB to 15GB given to ES instance.
more...
No comment yet.
Rescooped by Alex Kantone from Programming Stuffs
Scoop.it!

NoSQL Performance Benchmarks: Cassandra vs HBase vs MongoDB vs Redis vs MySQL

NoSQL Performance Benchmarks: Cassandra vs HBase vs MongoDB vs Redis vs MySQL | Mongodb BigData NoSql & Search Engine | Scoop.it

Apache Cassandra is a leading NoSQL database platform for online applications.  By offering benefits of continuous availability, high scalability & performance, strong security, and operational simplicity —  while lowering overall cost of ownership — Cassandra has become a proven choice for both technical and business stakeholders.

When compared to other database platforms such as HBase, MongoDB, Redis, MySQL and many others, Cassandra delivers higher performance under heavy workloads.

The following  benchmark tests provide a graphical, ‘at a glance’ view of how these platforms compare under different scenarios.


End Point Benchmark Configuration and and Results

University of Toronto NoSQL Database Performance

Netflix Benchmarking Cassandra Scalability on AWS

 

End Point Benchmark Configuration and Results Summary

End Point, a database and open source consulting company, benchmarked the top NoSQL databases — Apache Cassandra, Apache HBase, and MongoDB — using a variety of different workloads on Amazon Web Services EC2 instances. This is an industry-standard platform for hosting horizontally scalable services such as the three NoSQL databases that were tested. In order to minimize the effect of AWS CPU and I/O variability, End Point performed each test 3 times on 3 different days. New EC2 instances were used for each test run to further reduce the impact of any “lame instance” or “noisy neighbor” effect on any one test.

A summary of the workload analysis is available below. For a review of the entire testing process with testing environment configuration details, the benchmarking NoSQL databases white paper by End Point is available.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Get Started: Ambari for provisioning, managing and monitoring Hadoop

Get Started: Ambari for provisioning, managing and monitoring Hadoop | Mongodb BigData NoSql & Search Engine | Scoop.it

Ambari is 100% open source and included in HDP, greatly simplifying installation and initial configuration of Hadoop clusters. In this article we’ll be running through some installation steps to get started with Ambari. Most of the steps here are covered in the main HDP documentation here.

The first order of business is getting Ambari Server itself installed. There are different approaches to this, but for the purposes of this short tour, we’ll assume Ambari is already installed on its own dedicated node somewhere or on one of the nodes on the (future) cluster itself. Instructions can be found under the installation steps linked above. Once Ambari Server is running, the hard work is actually done. Ambari  simplifies cluster install and initial configuration with a wizard interface, taking care of it with but a few clicks and decisions from the end user. Hit http://<server_you_installed_ambari>:8080 and log in with admin/admin. Upon logging in, we are greeted with a user-friendly, wizard interface. Welcome to Apache Ambari! Name that cluster and let’s get going.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

10 of the most useful cloud databases

10 of the most useful cloud databases | Mongodb BigData NoSql & Search Engine | Scoop.it

IDC predicts that big data is growing at an annual rate of 60% for structured and unstructured data. Businesses need to do something with all that data, and traditionally databases have been the answer. With cloud technology, providers are rolling out more ways to host those databases in the public cloud, freeing users from dedicating their own dedicated hardware to these databases, while providing the ability to scale the databases into large capacities. "This is a really huge market given all the data out there," says Jeff Kelly, a big data expert at research firm Wikibon. "The cloud is going to be the destination for a lot of this big data moving forward."

Some concerns remain for what some call database as a service (DBaaS), specifically around sensitive information being stored in the cloud and around cloud outages. But still, an emerging market of cloud database services and tools seems to be picking up steam. Here, Network World looks at 10 cloud database tools. Some of these are providers of direct relational, SQL or NoSQL databases, while others are niche focused on various open source databases. Please note this list is not meant to be exhaustive, as some big players, like Oracle, HP and EMC/VMware are still rounding out their cloud-based products and strategies for these tools.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Facebook Apollo NoSQL Database

Facebook Apollo NoSQL Database | Mongodb BigData NoSql & Search Engine | Scoop.it

Facebook’s latest project is a NoSQL database called Apollo that provides online low latency hierarchical storage.

The details of the database project were revealed at QCon New York on Wednesday by Jeff Johnson, a software engineer in Facebook’s Core Data group. He described Apollo as a distributed database around strong consistency using Paxos-style quorum protocols.

Paxos is a family of quorum consensus protocols, originally defined for deriving a single agreed result from a number of possibilities on a network of unreliable processors. It can be used in replicated databases to overcome the problems caused if distributed servers fail. In order for an update to be accepted it must be voted on by a majority of servers within a shard, and updates are only completed when they make their way to a majority of servers.

Distributed databases suffer from a problem described using the CAP or Brewer’s theorem, which states that a distributed database can’t achieve the following all at the same time:

Consistency of data across nodesAvailability in the sense of always responding to requests;Partition tolerance in working even if part of the network is unavailable
more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Best Machine Learning Resources for Getting Started

Best Machine Learning Resources for Getting Started | Mongodb BigData NoSql & Search Engine | Scoop.it

This was a really hard post to write because I want it to be really valuable. I sat down with a blank page and asked the really hard question of what are the very best libraries, courses, papers and books I would recommend to an absolute beginner in the field of Machine Learning.

I really agonised over what to include and what to exclude. I had to work hard to put my self in the shoes of a programmer and beginner at machine learning and think about what resources would best benefit them.

I picked the best for each type of resource. If you are a true beginner and excited to get started in the field of machine learning, I hope you find something useful. My suggestion would be to pick one thing, one book or one library and read it cover to cover or work through all of the tutorials. Pick one and stick to it, then once you master it, pick another and repeat. Let’s get into it.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

The Absolute Basics of Indexing Data

The Absolute Basics of Indexing Data | Mongodb BigData NoSql & Search Engine | Scoop.it

Ever wondered how a search engine works? In this post I would like to show you a high level view of the internal workings of a search engine and how it can be used to give fast access to your data. I won't go into any technical details, what I am describing here holds true for any Lucene based search engine, be it Lucene itself, Solr or Elasticsearch. Input Normally a search engine is agnostic to the real data source of indexing data. Most often you push data into it via an API that already needs to be in the expected format, mostly Strings and data types like integers. It doesn't matter if this data originally resides in a document in the filesystem, on a website or in a database. Search engines are working with documents that consist of fields and values. Though not always used directly you can think of documents as JSON documents. For this post imagine we are building a book database. In our simplified world a book just consists of a title and one or more authors. 

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Elasticsearch 1.2.0 And 1.1.2 Released | Blog | Elasticsearch

Elasticsearch 1.2.0 And 1.1.2 Released | Blog | Elasticsearch | Mongodb BigData NoSql & Search Engine | Scoop.it

 

Elasticsearch 1.2.0 is a bumper release, containing over 300 new features, enhancements, and bug fixes. You can see the full changes list in the Elasticsearch 1.2.0 release notes, but we will highlight some of the important ones below:

 

Java 7 required

Elasticsearch now requires Java 7 and will no longer work with Java 6. We recommend using Oracle’s JDK 7u55 or JDK 7u25. Avoid any of the updates in-between as they contain a nasty bug which can cause index corruption.
Dynamic scripting disabled by default

Elasticsearch allows the use of scripts in several APIs: document updates, searches and aggregations. Scripts can be loaded from disk (static scripts) or specified directly within a request (dynamic scripts). Unfortunately MVEL, the current default scripting language, does not support sandboxing, meaning that a dynamic script can be used to do pretty much anything that the elasticsearch user can do.

While it has been possible to disable dynamic scripting for a long time, we’ve decided to change the default to disable dynamic scripting out of the box. See instructions for how to reenable dynamic scripting. Watch this space for a blog post giving more details about the future of scripting in Elasticsearch.

 

...

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

DuckDuckGo Relaunches & Starts To Look Like A Real Search Engine

DuckDuckGo Relaunches & Starts To Look Like A Real Search Engine | Mongodb BigData NoSql & Search Engine | Scoop.it

DuckDuckGo, the spunky little search engine that hangs its hat on user privacy, has relaunched today with a new look and feel, not to mention a number of new features like maps and local search, image search and more.

Put it all together, and DuckDuckGo is starting to look less and less “spunky,” and more and more like a real search engine. It’s gone from something that looks like it would appeal mainly to its active developer crowd to something that consumers would use.

In other words, it’s growing up.

Here’s a look at some of DuckDuckGo’s new features — things we’re all familiar with on Google, Bing, etc., but were missing from DuckDuckGo.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Aggregation in MongoDB 2.6: Things Worth Knowing

Aggregation in MongoDB 2.6: Things Worth Knowing | Mongodb BigData NoSql & Search Engine | Scoop.it

The MongoDB 2.6 release improved aggregation framework (one of MongoDB’s best features) considerably. We often hear from customers who are unaware of the aggregation framework, or unsure exactly why they should be using it. We frequently find them wrestling with unnecessarily complex and slow methods of solving problems that the aggregation framework is purpose built to solve. With this in mind, we’ll take a moment to introduce aggregation before diving into the 2.6 changes. If you already understand the framework, feel free skip ahead; otherwise read on…
Introducing Aggregation

The aggregation framework in MongoDB has become the go-to tool for a range of problems which would traditionally have been solved with the map-reduce engine. Introduced back in MongoDB 2.2, the framework distills collections down to essential information by using a multi-stage pipeline of filters, groupers, sorters, transformations and other operators. The distilled set of results is produced far more efficiently than other techniques. The set of operations is fixed, though, and does have the flexibility of map-reduce scripts. Before investing development time in map-reduce, it is best to check whether you can achieve the same results with the aggregation framework.

more...
Greg Deckler's curator insight, May 20, 12:58 PM

Fantastic overview of aggregation and how to get started with it.

Scooped by Alex Kantone
Scoop.it!

10 things you shouldn't expect big data to do

10 things you shouldn't expect big data to do | Mongodb BigData NoSql & Search Engine | Scoop.it

Every organization pursues big data with high hopes that it can answer long-standing business questions that will make the company more competitive in its markets and better in the delivery of products and services. Yet in the midst of this enthusiasm, it's easy to build false expectations for big data -- benefits that will never materialize unless you give it the right amount of "help." Here are 10 key things that big data in itself won't do for you unless you take the right steps to optimize its value.

more...
No comment yet.
Scooped by Alex Kantone
Scoop.it!

Getting Started With MongoLab And The MongoDB Shell

Getting Started With MongoLab And The MongoDB Shell | Mongodb BigData NoSql & Search Engine | Scoop.it

Yesterday, I blogged about MongoDB: The Definitive Guide by Kristina Chodorow. It is an excellent book and has gotten me really interested in learning more about MongoDB. Apparently, MongoDB is really easy to download, install, and run locally; but, for some reason, I wanted to try running it as a remote database. So, I signed up for MongoLab - a hosted MongoDB platform that has a free developer sandbox. And, within minutes, I had created my database, connected to it from the Mongo Shell, and was performing CRUD (Create, Read, Update, and Delete) operations!

To start with, I used the Homebrew package manager to install MongoDB:

brew install mongodb

The installation was very easy and went off without error.

Then, I signed up for a MongoLab account and created a database using the free developer sandbox. This process was also very easy! Once logged-in, I clicked on the Create New button to create a new database:


MongoLab allows you to select the Cloud provider for your MongoDB database; I don't really know one provider from another, so I just went with Amazon Web Services, which was selected by default. Then, I chose to use the free developer sandbox, which has limitations, but is perfect for some experimentation.

I am not exactly sure about this following statement, but I believe that the database name you choose - within the developer sandbox - has to be unique. I say this because when I went to choose the database name, "Ben," MongoLab complained that the given name was already taken. I assume this is a byproduct of the shared sandbox and will not be an issue with a dedicated plan.

more...
No comment yet.