Text. English. Chinese. Multi-structured. Language. Fuzzy. Logs. Hard-to-parse. Rich. Semi-structured. Whatever you want to call data that doesn’t fit neatly into tidy little rows and columns these days, can we please stop calling it “unstructured”? I feel a bit like Don Quixote in even pursuing this topic, but after 15-plus years of (mostly) working on search and text processing (including writing a book on the subject) I can’t help but feel that it’s time for the word “unstructured” to be retired and for us to find a better term to describe all of this stuff spewing from us and our computational creations.
Why all the (somewhat tongue-in-cheek) vitriol towards such a simple word? When I’m feeling cynical, I think that, in the early days of databases, someone coined “unstructured” as a derogatory term to mean “all the stuff a database isn’t good at working on.” If “structured” is good, then “un”-structured must be bad, right? The problem is that working with text is one of the defining computational challenges of our time. We need our best and brightest working on it; and not just so we can better target ads to consumers. It’s too full of promise to describe with such a diminutive word as “unstructured.” Numerical data? Child’s play! Text? Now there’s a real challenge.
Click headline to read more--