Text-Mining, Metadata & Publishing
635 views | +0 today
Text-Mining, Metadata & Publishing
Things that are relevant to how publishers should use semantic and related information analysis technologies to transform their business.
Your new post is loading...
Your new post is loading...
Scooped by Vincent Henderson
Scoop.it!

The Answer Engine: How Humans Can Provide Better Knowledge Than Algorithms

The Answer Engine: How Humans Can Provide Better Knowledge Than Algorithms | Text-Mining, Metadata & Publishing | Scoop.it
 When you have a question you need answered, do you ask a friend – or a robot? The answer to that question used to be easy.
Vincent Henderson's insight:

This is a nice article, soundly bringing people like, I guess, you and me, back to earth to some extent: its main points are that:

  • small niche knowledge is hard to generalize for all-purpose search engines, making good answers in such niches far less relevant than in broader, more central domains;
  • in such niche knowledge areas, specific useful knowledge for specific situations is far more usefully provided by humans who really know.


I think that, like often when one tries to make a point, the article overstates its case by seeming to advocate for powering question answering by actual humans answering the questions, such as Yahoo Answers or Quora and such, as a matter of policy, or principle.


Because ultimately, while Yahoo Answers or Quora do indeed often come up in my search results, I have never asked a question there. I have only found, through Google, answers that people had provided before to similar questions.


But this left me, of course, with the work of reading through the results and figuring out whether these actually answered my question. And that's really the key here: what tools do I have in Google that prevent me from having to go read websites to figure out my answer? Well if i'm looking for info on films, books, celebrities, sports teams, the weather (the list keeps growing stealthily) I get all the info straight from Google's awesome user experience. But as soon as I want to know something that is not in the general knowledge and entertainment of essentially what amounts to general newspaper rubrics, I have to do the work myself.


That's really the core of the question that Michal Borkovski touches on here. As he points out, much of it is driven by the fact that the data that Google mines from the web to answer questions is actually marked up so that it's possible for Google to mine it and have a reliable idea of what the data is.


The work of professional publishers should be two-fold to innovate search experiences for their users:

  1. Mark up our data so that the knowledge contained in the documents or databases is explicit knowledge that can be reasoned on for what it is by an algorithm (whether it's opened up to Google and how, that's a business strategy question not relevant here), and
  2. Devise user experiences à-la Google films, or Google's weather and other such widgets, with little such applets that respond to the kinds of questions that the professionals we serve have to gets answers to.


In the case of 1. above, it requires a pretty decent knowledge model representing the domains that are covered. That's hard work. It's also work that looks pretty unproductive at first glance. It's hard to make business cases for "develop a knowledge model", because by nature it covers all areas, both information and workflows, and in and of itself, it doesn't deliver a product. While on the other hand, if you want to build a product that could use such features, your business case can't support developing the knowledge model just for that product. It's the canonical catch-22.


The way to work around that problem, if the business owners don't really get it, which happens, er, sometimes..., is for professionals who implement the stuff, developers, user experience designers, business analysts, and, crucially, the subject matter organization ("editorial"), to be smart: don't just work off of a single product's requirement. Identify the underlying logic of modern product requirements, imagine what the next ones will be, and architect your solutions in ways that open new capabilities that will lower future products' business case thresholds. Build your knowledge model one product at a time, in a way that you can grow. Use the logic of open data and other semantic standards that mean you can grow your model.


In other words: don't silo.


How did I get there from the article? Well, the thing is: if we professional publishers, as we do,  maintain significant levels of human expertise running our ships, people who write analysis, questions and answers and so on, we must use that work of humans answering specific questions to add to the knowledge model and make these answers parseable by algorithms. When we answer a niche question, we should ask ourselves: what are the data points, their relationships and how they relate to other things that I have? How can I model this in my knowledge model if it's not already covered?


If we apply this type of practice to our knowledge production workflows, in tandem with the writing of texts that answer questions, then we can really produce innovation and make sure that Google can't make us irrelevant once they decide to tackle our professional domains. This makes me think that I should do something about Google scholar at some point.


So yes, human niche knowledge to answer hard questions is still the most useful thing around. But we can model this meaning one question at a time to make it even more useful to others once the question has been answered.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

MonkeyLearn Startup Alley Interview | Disrupt SF 2014 | TechCrunch

MonkeyLearn Startup Alley Interview | Disrupt SF 2014 | TechCrunch | Text-Mining, Metadata & Publishing | Scoop.it
Alex Wilhelm interviews MonkeyLearn during Day 1 of Startup Alley at Disrupt SF 2014. on the TechCrunch Disruptshow from TechCrunch TV
Vincent Henderson's insight:

Hey so it seems to turn out that MonkeyLearn, of which I spoke last Friday, was founded by spanish guys. Anyway, I just like the accent.


So I've been testing this since this weekend, and I think it holds its promise, as being "the WordPress of Machine Learning". It holds it in more ways than one, both good and less good.


On the good side, well, it does what it's supposed to do: you can prepare, load and tweak a training set, train the algorithm against it, and get results. And that's a breeze.


It's simple to use (provided you can convert all your files to text, as it only works with text format for now), the UI is clean and simple, and no fuss. It's a beta, so there's some glitches, of course. I lost a whole training set while editing the taxonomy at one point. But that's to be expected in beta. It only happened once.


One the plus side still, you not only get your precision/recall, but you can also see the confusion matrix for the children of your categories. That's very useful to see which categories are creating problems with your performance. It's a real help to tweak the tree and samples. In fact it's pretty much the only help you can get from the data in such a tool.


On the minus side, there's a couple of things. I don't understand why we don't get stats for leaf nodes beyond the confusion matrix. It can't be that hard to do a little graph based on the confusion matrix numbers to highlight the problem leaves. That's a little disappointing.


Also, I couldn't find out where the numbers for the confusion matrix come from. They don't add up to the sample totals, which I would have expected them to.


I haven't tested the API and the classification of new text yet, but overall, I have to say that the whole training process is a real breeze. I tweaked, changed, reorganized, split and retrained a 9K doc training set plenty of times in a couple of days, with only that one glitch that I talked about. You can also download your tree once you've tweaked it for archiving/backup purposes, or to rework it on your hard drive, which can be better than online. Online, you can't really do anything in bulk other than change the parents of your nodes, which is already something.


The precision/recall values that I got seem pretty good, depending on the training subsets that I used. I could get some categories up to 99% for several hundred training documents (I mean without too much overfitting risk), and managed to get most of them up to about 80%. I had to split my training set by type of content to get better results also. But that's nothing unexpected on such algorithms. It's not magic. However, as I say, I have not made it to actually testing on new content. It could be that my training set is overfitted. I'll continue testing a little bit over the next few days.


Overall, I think that this tool could be a game-changer, if they manage to grow it and make it really work 24/7, scale and, mostly, add text-mining features that everybody expects of a text-mining tool these days. Because what they say in the techcrunch video above is not accurate: you still can't extract topics and entities from your corpus. The only feature available is the supervised classifier.


I'm looking forward to more, it's quite exciting to see such technologies become broadly available. Publishers beware: all this only lowers the barriers to entry to disrupting our positions in our markets by newcomers.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Text Mining Made Easy: MonkeyLearn Go into Public Beta

Text Mining Made Easy: MonkeyLearn Go into Public Beta | Text-Mining, Metadata & Publishing | Scoop.it
Text mining innovators-in-the-making MonkeyLearn have finally opened their beta version for the public to sign up. The announcement came at TechCrunch Disrupt in San Francisco, on Monday.
Vincent Henderson's insight:
It was only a matter of time until something like that became available and worked. There used to be OpenCalais, let's see what this offers. I will try it out for sure. In the modern economic environment for publishing, there is no metadata worth talking about without text mining.
more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Big Data should not be a faith-based initiative

Big Data should not be a faith-based initiative | Text-Mining, Metadata & Publishing | Scoop.it
Cory Doctorow summarizes the problem with the idea that sensitive personal information can be removed responsibly from big data: computer scientists are pretty sure that's impossible.
Vincent Henderson's insight:

When I read the title of that piece, I thought that this was going to be about how one should not expect more from Big Data than what it can give you and such, but it's not really, at least only very narrowly so. 


It's about the question of making big claims about big data, such as Cavoukian and Castro made in a paper where they explain that there's nothing to be afraid of in terms of privacy in the spread and commoditization of countless large datasets that have been unilaterally declared "de-identified' by the producers of the dataset.


So it still is a good reminder that we can't just have a vague idea and overall faith in whatever we think would be great in Big Data in principle, and then strenuously argue that this is the truth.


Until you can state what it is that you are seeking form it and how you are going to get it, then making broad claims about how useful or wonderful it is is just Big Data religion.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Call for papers: Special Issue on Question Answering over Linked Data | www.semantic-web-journal.net

Call for papers: Special Issue on Question Answering over Linked Data | www.semantic-web-journal.net | Text-Mining, Metadata & Publishing | Scoop.it

One of the main [obstacles to broad Linked Data adoption] (...) is that accessing the billions of RDF triples already available requires (...) familiarity with the datasets available and their underlying schemas.


[T]here is a growing amount of research on interaction paradigms that allow end users to access linked data and hide the complexity of Semantic Web standards behind an easy-to-use interface. Especially question answering systems play a major role, as they allow users to express arbitrarily complex information needs in an intuitive fashion and, at least in principle, in their native language.

Vincent Henderson's insight:

This call for papers does a good job of succinctly identifying one of the main problems of Linked Data: the knowledge models determine how knowledge is stored, and therefore how it can be retrieved, thus requiring a user to have a lot of prior knowledge about the data in order to get information from it.


Conversely, the problem with Big Data is its ability to provide beautiful answers that leave you searching for a question.


Linked Data, on the other hand, is by nature driven by questions. A knowledge model is a question-answering system. Every relation that exists between two concepts is implicitly answering a question about each concept.


So while Big Data approaches essentially attempt to free themselves from formal syntax and curated knowledge at the expense of explanation, Linked Data focuses on curated knowledge at the expense of accessibility.


The idea behind this call for papers is about building an interpretive layer that would transform (translate) natural language questions into some kind of  list of query parameters that could then be passed on to multiple knowledge sources (presumably using some kind of map of these open parameters to the underlying ontology schema?), and then aggregate the results into answers. It's a beautiful idea, and a great avenue of work.


The open battle between Big Data and Linked Data will continue, until such time as Big Data accepts that there are no explanations to phenomena without a priori knowledge (aka "theory"), and Linked Data figures out that knowledge without the ability to answer questions is meaningless and, most importantly, useless.


This means that Big Data needs knowledge models in order to discriminate against meaningless answers or recognise meaningful questions, and Linked Data can use Big Data thinking to unify the multitude of knowledge silos that it links together via the content assets that it describes and the questions that users want to ask.


It's an interesting avenue for innovation.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Content Is King, But It Won't Be For Long: Analyst

Content Is King, But It Won't Be For Long: Analyst | Text-Mining, Metadata & Publishing | Scoop.it
This won't go over well with the media moguls hobnobbing at the Allen & Co retreat in Sun Valley. But it's the thesis behind the "neutral" rating that Barclays Capital's Kannan Venkateshwar ass...
Vincent Henderson's insight:

The absolutely core sentence in there:


"They [successful newcomers like Netflix or Amazon] focus on shows and search engines — not networks and schedules."


For professional publishers, this translates as "They focus on knowledge, metadata and search engines, not books, journals and looseleafs".


Why is that important? Because:


“traditional media companies get pushed further back into the value chain, further away from a direct relationship with the consumer.”


Exactly.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Open data can be bigger than big data. An interview with open data czar Gavin Starks - Digits - WSJ

Open data can be bigger than big data. An interview with open data czar Gavin Starks - Digits - WSJ | Text-Mining, Metadata & Publishing | Scoop.it
Open Data Institute CEO Gavin Starks believes there is a strong business case to be made for governments and businesses to open up their big data. He also thinks individuals will eventually own their private data and license it back to companies in exchange for money, goods or services.
Vincent Henderson's insight:

Short and to-the-point interview about what the open data opportunity is.


As far as us professional publishers are concerned, it's time we recognized that, by-and-large, what we do is to resell open data with additional knowledge baked in somehow.


Strategies around the open data proposition should be driving much of our thinking.


So how does driving a revenue-generating community around open government (or scientific) data play out for us in the future? Historically, we have taken the open data and walled it in with our added value.


We have devised so-called "freemium" models to lure people to our content. But these are still sales and marketing tactics; we still have to find a place in the open information stream and organically driving revenues from information flows and knowledge needs. This would require us to establish our presence as a topical expertise nexus in the overall online user experience of professional knowledge workers, rather than sell applications and subscriptions.


Easier said than done.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

What universities have in common with record labels

What universities have in common with record labels | Text-Mining, Metadata & Publishing | Scoop.it
If you spent the 1990s plucking songs from a stack of cassettes to make the perfect mixtape, you probably welcomed innovations of the next decade that served your favorite albums up as individual songs, often for free. The internet’s power to unbundle content sparked a rapid transformation of the music industry, which today generates just...
Vincent Henderson's insight:

Very thought-provoking piece about the lessons to be learnt from the music industry for universities. I would argue that this analysis holds for any producer of information and content that people rely on for critical knowledge, and that includes professional publishers.


But what isn't mentioned in the piece is **how** the music industry managed to "unbundle" content, as Martin Smith puts it.


How do you unbundle stuff and then curate it back into diverse information sets? You need loads of metadata, most of which is mined. Let's remember that, whether it's Genius, Deezer, Shazam or all kinds of other ways, aggregating playlists is based on **both** people putting things together **and** machines mining metadata from the music files (bpms, tunes, tones, length, and a host of other properties of the music files) creating fingerprints as well as descriptive facets.


So this is also what we have to learn from the music industry: disaggregating and re-aggregating content requires massive metadata of all kinds, and machines and algorithms to mine a lot of it.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Tamr Emerges With $16M to Crack Data Curation for Enterprises | Xconomy

Tamr Emerges With $16M to Crack Data Curation for Enterprises | Xconomy | Text-Mining, Metadata & Publishing | Scoop.it
One of the Boston tech scene’s most dynamic duos is at it again. Yes, Andy Palmer and Michael Stonebraker are coming out of stealth with their latest compa
Vincent Henderson's insight:

It's when I hit Tamr that I started my Listly list of data curation software.


Tamr seems to be buzzing a lot right now, and this is an intersting article about them.


What caught my attention here was that Tamr "has done large-scale pilots with the likes of Novartis and Thomson Reuters. Palmer cites one customer who found that, of the $60 million worth of data it licensed, about one-third was redundant. In other words, the company could save $20 million if it had better visibility into its data. “


This certainly feels, *a lot*, like the kinds of problems global professional publishers have, and the name of Thomson Reuters sure piqued my interest.


The question with all these tools remains: many of them claim "semantic analysis" of one kind or another. But what kind of semantic algorithms are involved, what kind of information do they infer from the underlying data, to what extent they distinguish between data, information and metadata, I don't yet know.


I'm sure very curious, though.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Exclusive Interview: Michael O’Connell, Chief Data Scientist, TIBCO on How to Lead in Big Data

Vincent Henderson's insight:

Big Data is mostly talked about in terms of Business Intelligence, analyzing exhaust data, with a constituency of business management. But the insights gained apply broadly to information.


What professional publishers have always done is to analyze real-world information and digest it into actionable or otherwise enlightening content. Big data is merely another source of real-world data now. Publishers of information need to figure out what data is out there (and where)  that may contain useful insights. 


O'Connell insists on the need to use tools to first create an analytical context that enables the discovery of potential sources of insights, rather than scour the data and hope for the best.


Also, useful distinction between "at-rest data" and "fast data".

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Sitecore DMS: Real-Time Personalization | Sitecore Video Player

Sitecore DMS: Real-Time Personalization | Sitecore Video Player | Text-Mining, Metadata & Publishing | Scoop.it

Mapping content to personas.

Vincent Henderson's insight:

While this CMS & video are very explicitly targeting marketing campaigns for catalog product sales, it's a tool that analyzes the context of a user's activity, maps it to personas, and presents content that is relevant to it.


This kind of mapping of personas to properties (metadata) of information and content is what professional publishers need to consider as their core competency now. Writing or publishing the documents is something anyone can do now. The core challenge is: who knows the users and customers best, to hone content delivery to customer needs in real time? And, also, to optimize SEO correspondingly for marketing purposes.


This is what differentiates good publishers from commodity content buckets.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

OpenRefine

OpenRefine : A free, open source, power tool for working with messy data
Vincent Henderson's insight:

I feel like I'm late at the party here, but this tool looks like an incredibly fantastically useful support to dealing with messy lists of would-be controlled values.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

BPO: Pronounced dead, but still very much alive

BPO: Pronounced dead, but still very much alive | Text-Mining, Metadata & Publishing | Scoop.it
If I had a Bitcoin every time someone claimed that BPO is “dead” / “hitting the bottom” / “merely staff augmentation that’s going away soon”, I could commission a whole team of robots to write this blog until the new year.
Vincent Henderson's insight:

BPO/Outsourcing/Offshoring is indeed here to stay. In Publishing, this has been used mostly to do "conversion and tagging" supporting whatever legacy publishing process happened to be there.


But in the new big data/text mining era, they should be used to support publishing innovation by providing data cleanup/prep to support semantic-technology-driven curation business process. 

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Update on MonkeyLearn after further testing - Reporting features

Update on MonkeyLearn after further testing - Reporting features | Text-Mining, Metadata & Publishing | Scoop.it

As per my previous few posts, I've been investigating MonkeyLearn, the recently beta'd online classifier, self-described by its publisher as "the Wordpress of Machine-Learning". After some initial fairly basic testing of a couple of days, I published a review on it here a couple of weeks ago.


I have now gone through far more extensive attempts at training MonkeyLearn to produce topical tags that I have a decent source of content to compare to, including the expected tags. I'm in a position to have a pretty complete understanding of MonkeyLearn's strengths and weaknesses. None of the comments of my initial review are over-ruled, so you can still take a look at that. This is additional, focusing on more advanced and detailed questions.


Before I go on, I should say that MonkeyLearn got in touch and made comments on my review, and was very welcoming of the feedback. So hopefully that will translate into feature evolution soon.


What's MonkeyLearn?


So to start with, let's be clear about what MonkeyLearn is at this point: it's a classifier. You create category nodes, you give it text examples, and it will attempt to classify other text in these nodes, based on similarities of some kind. I have not spoken to MonkeyLearn about their algorithm, so I don't know what's implemented under the covers. However, it's very likely purely token-based, and does no morphological analysis, and likely no noun-phrase identification or anything of the sort. I have not tested that assertion in any way, but given that I don't see any options that have anything to do with language settings, I can only assume that it's purely analyzing the statistical distribution of single text tokens (i.e. words, not sentences or expressions).


So I'm leaving aside any considerations of this type at this point, and will focus only on the features.


MonkeyLearn Features


There are two key (related) things that will determine the quality of the outcomes of a statistical classifier, in my opinion:

  • the ability to fine-tune the training set,
  • the ability to get insights into where the accuracy leaks are.
Stats reporting for sub-nodes


As I mentioned in my first review, I find the statistical reporting of MonkeyLearn to be very, very basic. Professional-grade offerings will give you document-level reporting on confidence levels, and will give you node-by-node precision/recall values. MonkeyLearn, for reasons that I can't really explain, only gives you these figures and a confusion matrix for nodes that contain other nodes.


An unfortunate side-effect is that for nodes that contain a single sub-node, it will always give you a 100% number. Which can only mean that its accuracy calculations are answering the question "for stuff that's in this node, how am I doing discriminating between sub-nodes?" If there's only one sub-node, then there's nothing to discriminate between, so it's by default 100%.


Before you understand that, it gives you a very misleading feeling. When I click on a node to see its statistics, what I expect is to see the statistics for that node, meaning I want to answer the question "for this node that I just clicked on, what is the percentage of documents that are in here that should be in here?"  (for precision -- same comment for recall or accuracy)


Documents in non-leaf nodes


This is a problem when you put it alongside another thing in MonkeyLearn that I liked to start with, but now I'm not sure whether it's what I expected: it's that you can classify texts in nodes that contain other nodes. This means that in a given node, you can have some texts and other nodes. In my case, I only have training sets where when this is the case, there is only one sub-node. 


Combine that with the "100% effect", and basically I can't get any useful information for these nodes. When I click on a node like that, I have no information about whether the texts in there train properly, and I have no information about even how this node can discriminate between the node level and the sub-node level.


Even the confusion matrix, which is otherwise the only way to get detailed information on leaf nodes, doesn't help there, since it's only showing one node.


In my opinion, when there are both texts and sub-nodes, the classifier should assume that the texts at the main node level should behave as though they were in a sub-node alongside the other sub-nodes in the main node. Otherwise, it feels like these documents are lost in a limbo.


So to conclude on this topic, i'd say that there's a pretty fundamental design flaw there. it seems that the reporting statistics are going about it wrong, they are answering the wrong question. it's as if they can only calculate stats about the lower level in aggregate, whereas we want stats for each level in particular. There isn't a single stat that gives me usable information about one single node (let alone documents).


And that makes training set tweaking very painstaking. It's even weirder that this data is not available, because when you send a document to the classifier, it gives you precise confidence values. So why not give those values that come out of the training set documents?


Is there a deeper problem?


But this also raises another question: to what extent does the reporting flaws reflect algorithm flaws?


I suspect that it does to some extent. Let me explain.


If I have two levels down from the root like this with the associated statistics (this kind of relative accuracies is quite typical):


-Things to ingest (75%)

---Foods (95%)

--------Pizza

--------Soups

---Drinks (90%)

--------Wine

--------Beer


I now tentatively believe that this is telling me the following things:

  • of all the documents that I am testing, only 75% find themselves accurately somewhere in the Foods or Drinks sub-tree.
  • of the documents that I have put in Food (therefore with only a 75% confidence), I'm 95% sure that those that should go either in Pizza or in Soups are in the right category.


My question is: for the second statement, did the testing algo get rid of the documents that shouldn't be there to start with? Did it do so only for the calculation, or also for the training?


If yes, then that means that the 95% accuracy is misleading, since it's really 75% * 95% of the document that should be in there that are in there (sort of).


If no, then how is it possible to have 95% of all the right documents in Pizza and Soups, yet only have 75% confidence that things that should be in Foods are indeed in Foods? There's really some logic that I don't get here, and I don't think it's just me not getting the point. There's a usability and logic issue here. the question is to what extent is it just flawed reporting logic, or does it reflect deeper classifier logic?


The confusion matrix


MonkeyLearn also has a confusion matrix view, where you can see to what extent the documents that should be somewhere are indeed there. As I mentioned, this is really what makes the reporting useful in terms of having useful data. But it certainly doesn't make user-friendly.


In the response that I received from MonkeyLearn, they did address that point and clarified how the confusion matrix number related to the documents. I would have liked to have spent some time looking at that in detail to get a firm grasp on it, and in particular analyze the confusion matrices to shed light on the aggregate accuracy numbers, but well, had other things to do, didn't get to it.


That's for the stats. Next, let's look at questions of accuracy and results.

Vincent Henderson's insight:

So on the whole, MonkeyLearn certainly doesn't shine by its ability to give you really meaningful easy to use information to help you analyze your results and curate your training set. It has a lot of flaws there, both in terms of depth of reporting, but also, until I'm proven wrong, in terms of the reporting logic itself, which left me confused. And there is no documentation to speak of either, that attempts to explain any of this.


It's not useless, the confusion matrices do help, but it makes the job very fastidious and leaves you with certain situations that you can't effectively troubleshoot (non-leaf-node documents). And, it doesn't give you any document-level data, so you can't analyze how the documents themselves impact the classifier.


MonkeyLearn did tell me that they found these comments very interesting and that the reporting stats issue would be high up on their backlog. So let's stay tuned to see how they improve there. I also expect that they will be responding to the questions raised in this part of the review too.


The next post will cover the classification performance itself.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Big Data Analytics Vs. The Gut Check | TechCrunch

Big Data Analytics Vs. The Gut Check | TechCrunch | Text-Mining, Metadata & Publishing | Scoop.it
Data is more varied and fast-moving than ever, and analyzing it effectively now requires highly sophisticated software and machinery. But where does business experience come in?
Vincent Henderson's insight:
A very concrete piece using feedback from real life Analytics projects. The key notion here is that data is not knowledge. Knowledge is what you see in the data. It's the sense that you make of the data. That's knowledge. No answers without questions, as always. Data can help you ask the right questions, and it can give you answers to some of them, but it doesn't ask your questions for you. Likewise for metadata: metadata is the thing that will help answer questions. It's not in and of itself an answer. If I have a document that concerns a law suit between a hospital and a patient, I can add metadata to that effect. But that not an answer to a question. It's just information. It's only when I ask the question "how likely am I to win my case" that this metadata can contribute to me getting an answer, and gaining knowledge in the process.
more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

AllAnalytics - Mary E. Shacklett - Making Analytics a Corporate Strategic Role

AllAnalytics - Mary E. Shacklett - Making Analytics a Corporate Strategic Role | Text-Mining, Metadata & Publishing | Scoop.it

CXOs need to ask the bold forward questions from big data, not just look in the rear-view mirror.

Vincent Henderson's insight:

One tends to often talk about "big-data analytics", but this obfuscates the difference between the two. Or at least, like all buzzing words, it makes it impossible to tell what the person actually means by using them.


For instance, if I'm analyzing a few hundred excel lines of some data, making a few crafty formulas in a "Synthetic view" sheet, I am doing analytics, aren't I? 


But the bold questions to ask from Big Data are the prospective questions, such as "what are the prospects most likely to convert to customers?", and such like.


So the question is, in terms of our customers' experience, what could we as professional publishers ask of our data?


One question that I find particularly compelling is "what content type is most likely to be clicked on for this search query?" Or "What topics are most likely to be covered by the last opened document for this search query by this user?"


It's the things that enable you to make a decision, such as "what information should I show this person right now?", that really make a difference.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Text-mining-assisted biocuration workflows in Argo

Text-mining-assisted biocuration workflows in Argo | Text-Mining, Metadata & Publishing | Scoop.it
A Web-based Text Mining Workbench "Text-mining-assisted biocuration workflows in Argo" http://t.co/kUXnbCV20P #textmining #Database
Vincent Henderson's insight:
An very instructive (and exhaustive!) case study of the implementation of a professional information curation tool. We're not talking about curating the latest lolcat to drive content to your marketing site here. We're talking heavy duty, expert review, annotation and analysis of highly specific and critical data. This (pending further review of the tools) looks like the best practice professional editorial curation workflow for professional publishers. It's not bleeding edge, but reasonably state of the art, and uses all stable technologies and semantic approaches available today on the market. The next innovation step would include more big data trawling and analytics upstream, which would be required for non-medical-scientific information (e.g. Legal) which enjoys far less out of the box structure and predictable entities and relationships. However, if we are to truly enter 21st century publishing, there is no alternative but to structure and formalize our knowledge domains across the board, in order to be able to use such curation technologies effectively.
more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Machine Learning | Microsoft Azure - Analytics for the masses

Machine Learning | Microsoft Azure - Analytics for the masses | Text-Mining, Metadata & Publishing | Scoop.it

Analytics for the masses is coming - Get ready now: get the data out of your content.

Vincent Henderson's insight:

Assuming this turns out to be what it claims to be (and it will eventually), this is a big deal.


Predictive analytics is arguably the single most important enabler of point-of-[whatever professional actions you care about] decision-making support.


For publishers of information whose strategy depends on "the right information at the right time", "decision-support" and the like, predictive analytics will be the game-changer.


But doing predictive analytics requires what? It requires data to analyze in order to predict future data, and therefore future courses of action. I haven't read much about predictive analytics driving relevance. We mostly talk about matching profile metadata to content metadata to fine-tune content relevance. But ultimately, the ability to formalize the data inherent in our content and to the behavior of our users will make it possible to apply this sort of analytics to caselaw, regulatory trends, and infer information needs based on our customers actions and states.


But the first step is: model the data in our content. This means perform both expert analysis of the content and then apply data science to it.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

5 Ways to Align Content Publishing and Semantic Web Optimization

5 Ways to Align Content Publishing and Semantic Web Optimization | Text-Mining, Metadata & Publishing | Scoop.it

Publishers must integrate their content marketing and semantic search optimization practices in order to create a natural single-track strategy.

Vincent Henderson's insight:

Basic stuff, but fundamental. Highlighting the 2 main questions of:


- Authority / Legitimacy

- Accessibility / Findability / Shareability


While established professional publishers tend to obviously have the edge on the former, we tend to lag on the latter, due to a culture of content protection.


The more we protect our content, the more we leave space for newcomers to take the stage.


We've successfully navigated the digitisation transition, now we have to navigate to insert ourselves in the stream or river of knowledge. Does a rising tide lift all boats?

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Overcoming The Initial Challenges Of Big Data

Overcoming The Initial Challenges Of Big Data | Text-Mining, Metadata & Publishing | Scoop.it
Big data expert Joe Caserta says understanding exactly what the technology can and cannot do is essential to the success of big data and deep analytic initiatives.
Vincent Henderson's insight:

Rather boring interview, but contains a couple of interesting tidbits:


- "amazon-like" recommendation engines are now accessible to non-data scientists thanks to Mahout-type (open-souce machine-learning) technology

- the future of data is the "data lake", where you have a loosely structured and loosely governed large dataset, on which you perform highly structured and highly governed interventions to extract knowledge when you need it.


The second point is interesting. Some of us have been operating for the past 10 years or so under the mantra "structuring your data is the starting point if you want to get any useful knowledge out of it on a large scale". Big data and machine-learning technology, and the "data lake" approach may make us think differently about the problem.


We used to insist on strict governance and compliance on the back of a precisely designed knowledge model to govern the data set. Now, it looks like with new technologies and available processing power and algorithms, one can take a looser view to data governance and advocate for lighter governance but a strong data-science-driven curation of the data set to extract and publish knowledge  from it.


The great thing about this new stance is that it allows for a better "needs-driven" effort. The previous mindset of strict governance, tended to either create "artificial" needs for the data that were dictated by governance and compliance requirements. It also had the drawback of strongly constraining the realizability of new needs that were not already baked into the data, thus raising the business case thresholds.


Now, a looser governance need will reduce the overhead that needs to be pushed back to the data producers, and lower the threshold for addressing new output needs that were not originally envisioned.


But that is only possible with an effective data-management team who knows data-science, machine-learning and software.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Legal Current Awareness Must Help in the Business of Law

Legal Current Awareness Must Help in the Business of Law | Text-Mining, Metadata & Publishing | Scoop.it
As promised in the prior post about how the new normal for legal increases the need for current awareness, I want to offer some thoughts focusing on how current awareness is increasing in importanc...
Vincent Henderson's insight:

The key to effective current awareness is: Context!


Ultimately, this is what metadata is: Context.


More than any other kind of information publishing, arguably, current awareness epitomises what we could see as a shift from “Content is King” to “Context is King!”.


Because context is what makes content relevant, and it’s by being able to contextualize content that we can reduce information overload for information users.


A nice, short and to the point illustration of that principle here by John Barker.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Listly List - (Big-)Data preparation, cleanup and discovery

Listly List - (Big-)Data preparation, cleanup and discovery - Applications and other tools to prepare data in the world of "big data".
Vincent Henderson's insight:

I've scooped a few data preparation and analysis tools lately, and was going to continue, given the obviously critical nature of such tools these days. I figured I'd create a list.ly instead.


I'm not putting everything I come across, just those that seem relevant to text-mining and related metadata management and discovery issues. Putting them on the list is not an endorsement, because I've obviously not tried most of them, but I try to make sure that they're alive and well and seem to me to have potential use in for content and metadata publishing in the big data era.


Most of the tools in the data preparation and cleanup, as well as most of the big data discourse, seems to focus  a lot on the marketing angle, trying obviously to cater first to the largest possible market (what business doesn't have messy data that they'd like to make sense of?). But in the world of professional information publishing, we very likely have uses for these kinds of data-science approaches, to manage our metadata and the entities that we are interested in and that may come up in all kinds of various contexts, that we would like to relate.


I hope I'll manage to keep it up to date, and recruit helpers to enrich it.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Harry Surden - Computable Contracts on Vimeo

This is "Harry Surden - Computable Contracts" by ReInvent Law Channel on Vimeo, the home for high quality videos and the people who love them.
Vincent Henderson's insight:

Quick talk surveying what computable contracts mean.


Quite relevant for legal publishers: what is a legal clause or a legal provision, from a computing perspective? How can we translate that into algorithms?


But the question always remains: beyond the computing question, what applications does this enable?

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

PaxataTV - YouTube

PaxataTV - YouTube | Text-Mining, Metadata & Publishing | Scoop.it

A commercial company providing data cleanup and preparation software. 

Vincent Henderson's insight:

This looks like a nice data management tool, but I haven't seen what it does better than OpenRefine yet.

more...
No comment yet.
Scooped by Vincent Henderson
Scoop.it!

Vurb's Contextual Search Engine Blows Away Those Stupid Lists Of Links - TechCrunch

Vurb's Contextual Search Engine Blows Away Those Stupid Lists Of Links - TechCrunch | Text-Mining, Metadata & Publishing | Scoop.it
Vurb's Contextual Search Engine Blows Away Those Stupid Lists Of LinksTechCrunchSearch is outdated. Google steers you to right section of the library, but doesn't answer your question or compile that answer with others to help you make a decision.
Vincent Henderson's insight:
While the article woefully misrepresents what the Google UXP is these days, the amount of effort that is going into context-driven search is the story here. All these offerings are targeting B2C preoccupations. What are the contexts that define professionals ? That's the real question publishers have to answer.
more...
No comment yet.