Language Tech Market News
52.5K views | +6 today
Follow
Language Tech Market News
The Home of Multilingual Intelligence
Curated by LT-Innovate
Your new post is loading...
Your new post is loading...
Scooped by LT-Innovate
Scoop.it!

US Law and Corpus Linguistics Technology Platform Now Open Free of Charge

THis includes: Corpus of Founding Era American English, a collection spanning 1760 to 1799 that contains nearly 100,000 documents from the founders, ordinary people and legal sources, and that includes letters, diaries, newspapers, non-fiction and fiction books, sermons, speeches, debates, legal cases and other legal materials.
Corpus of Supreme Court of the United States, a collection of all Supreme Court opinions in the United States Reports though the 2017 term (with the 2018 soon to be added).
Early English Books Online (EBO), Eighteenth Century Collections Online (ECCO) corrected by the Text Creation Partnership (TCP) Evans Bibliography (University of Michigan).

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

@LangNet Partners with @ENEBA to Boost #SpeechTech in Gaming

LangNet has signed a new partnership with ENEBA, a blockchain-based video game distribution platform that seeks to bring new revenue streams to game developers and provides a personalized user experience in the game distribution industry. In addition to voice search features, developers on ENEBA will be able to use LangNet’s language AI ecosystem of language data, trained language models, and public APIs.

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

DefinedCrowd Raises $11.8M to Create Bespoke Datasets for AI Model Training

DefinedCrowd Raises $11.8M to Create Bespoke Datasets for AI Model Training | Language Tech Market News | Scoop.it

The three-year-old Seattle-based startup, which describes itself as a “smart” data curation platform, offers a bespoke model-training service to clients in customer service, automotive, retail, health care, and other enterprise sectors. It’s raised $11.8 million in a funding round led by Evolution Equity Partners, Mastercard, Kibo Ventures, and Energias de Portugal (EDP), and secured additional capital from current investors Sony, Portugal Ventures, Amazon, and Busy Angels.

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

@snips Plan to Generate Better Conversational Data By Incentivising 'Workers' 

Our data generation tool has been available as a centralized, fiat-based SaaS for a few months, and has been used across all our enterprise customers. Our goal is to now decentralize it, using the upcoming Snips AIR token to incentivize “workers” to produce high quality training data. The first step is to hire workers and onboard them. This is done via a bounty program, where they are asked to generate data for internal Snips tasks in exchange for tokens. This enables both workers to start earning tokens, as well as learning how the process works. Anyone can be a worker, including users of Snips AIR. Once a worker has accumulated enough experience, they can start participating in data generation for developers.

LT-Innovate's insight:

Looks as if decentralized blockchain-type systems could both attract & reward targeted data contributors. Already widely practiced for language data production in Asia.

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

How @Snips Supports Skill Creation Using Decentralized Voice Data Generation

Data generation is an important tool for developers to create high quality skills. Thus, the more consumers adopt Snips, the more we can expect the number of data generation campaigns to increase, and with it the total transaction volume. With the worker staking mechanisms in place, we can expect this to contribute to create a sound economy around the Snips token, while simultaneously solving the issue of data quality.
LT-Innovate's insight:

Worth reading in full for an overview of the Snips skill-developer pipeline.

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

Alibaba Expands Cloud to ROW

Alibaba Expands Cloud to ROW | Language Tech Market News | Scoop.it

Alibaba Cloud, the cloud computing arm of Alibaba Group, has partnered with Istanbul-based B2B services provider e-Glober, to offer its cloud services in Turkey.. It has expanded overseas to Singapore, Indonesia and Malaysia in Southeast Asia, Frankfurt, London and Paris in Europe, New York and San Mateo in the United States, Dubai in the Middle East, as well as Seoul, Tokyo and Sydney. It currently has more than 2.3 million customers worldwide.

LT-Innovate's insight:

More cloud big data resources in more geographies means more language data which in turn means more opportunities for closer-to-customer machine-learning language technology applications worldwide. It also means more competition between Amazon, Microsoft and Alibaba in data tool provision.

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

@Bottos — A Decentralized AI Data-Sharing Network

The Bottos project will comprise two developmental stages. The first stage will focus on building a beta marketplace for AI companies to acquire specified high quality, low cost training data for their AI models to mature. Currently, unless you are a big data titan, such as: Google, Amazon, or Facebook, it is difficult to finance and scour such necessary data. Due to this reason, small and medium enterprises, research institutions, and all other entities in need of AI training data cannot proceed in the completion of their models.

LT-Innovate's insight:

There is considerable interest in a new regime for data sharing in a Big Data world, further sharpened by the AI boom. In the case of language tech, the problem is not so much the GAFA monopoly as the literal lack (i.e. non-existence) of "low-resource" language data in digital form. In China, they are crowdsourcing to expand the L-data base. Other ideas needed.

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

Conceptual Captions: A New Dataset and Challenge for Image Captioning

Conceptual Captions: A New Dataset and Challenge for Image Captioning | Language Tech Market News | Scoop.it

We introduce Conceptual Captions, a new dataset consisting of ~3.3 million image/caption pairs that are created by automatically extracting and filtering image caption annotations from billions of web pages. Introduced in a paper presented at ACL 2018, Conceptual Captions represents an order of magnitude increase of captioned images over the human-curated MS-COCO dataset. As measured by human raters, the machine-curated Conceptual Captions has an accuracy of ~90%. Furthermore, because images in Conceptual Captions are pulled from across the web, it represents a wider variety of image-caption styles than previous datasets, allowing for better training of image captioning models.

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

7 Companies Using Blockchain To Power AI Applications

7 Companies Using Blockchain To Power AI Applications | Language Tech Market News | Scoop.it

Emerging blockchain startups are re-imagining internet services and access to data with a decentralized twist.
New cryptonetworks are proposing a way to create data marketplaces that democratize access to AI training data. These marketplaces would coordinate users offering their data with projects in need of it — and because the exchanges are on a blockchain, there’s no middleman handling files, ensuring the shared data stays secure.

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

@Explosion.ai Automates Annotation to Accelerate Machine Learning

@Explosion.ai Automates Annotation to Accelerate Machine Learning | Language Tech Market News | Scoop.it

Explosion AI (Germany) provides Prodigy, software which automates some parts of annotation. It can extrapolate a corpus of relevant terms from a few seed words and helps data scientists quickly confirm the targeted language using a Tinder-like graphical interface. Co-founder Ines Montani has demonstrated the efficiency of Prodigy in annotating insulting language to help moderate online behavior, for example on social media or ecommerce feedback comments, but the tools have been used to build applications analyzing text in financial services, she says.
 “The bottleneck is training data. Companies are amassing data, hoping they can do something with it. While machine learning might provide some good applications, you still have to document and label the data to use it for training machine learning models,” Montani says.

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

Gengo.ai Debuts Service for Acquiring Multilingual Machine-Learning Training Data

Gengo.ai Debuts Service for Acquiring Multilingual Machine-Learning Training Data | Language Tech Market News | Scoop.it

Gengo (crowdsourced translation services), is launching Gengo.ai, an on-demand platform that provides developers of machine-learning systems access to a wide array of multilingual services delivered by Gengo’s crowdsourced network of 25,000+ vetted contributors. The Gengo.ai platform offers data-curation services for both text and speech, including sentiment analysis, transcription, and content summarization. Equipped with this data, software developers at global technology companies can now accelerate the training of their AI systems and deliver more sophisticated products to market, faster.

LT-Innovate's insight:

Interesting move to leverage crowd-sourced text & speech production but who owns the data? And does it matter? 

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

Datasheets to Make AI Datasets MoreTransparent

Currently there is no standard way to identify how a dataset was created, and what characteristics, motivations, and potential skews it represents. To begin to address this issue, we propose the concept of a datasheet for datasets, a short document to accompany public datasets, commercial APIs, and pretrained models. The goal of this proposal is to enable better communication between dataset creators and users, and help the AI community move toward greater transparency and accountability. 

LT-Innovate's insight:

Archiv paper

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

YAFC Corpus: Corpus, Benchmarks and Metrics for Formality Style Transfer

Style transfer is the task of automatically transforming a piece of text in one particular style into another. A major barrier to progress in this field has been a lack of training and evaluation datasets, as well as benchmarks and automatic metrics. In this work, we create the largest corpus for a particular stylistic transfer (formality) and show that techniques from the machine translation community can serve as strong baselines for future work. We also discuss challenges of using automatic metrics.
LT-Innovate's insight:

Use case?  Raymond Queneau's Exercises de Style 2.0?

more...
No comment yet.
Scooped by LT-Innovate
Scoop.it!

The Data Challenge for Indian Language Tech

The Data Challenge for Indian Language Tech | Language Tech Market News | Scoop.it

Tech companies are finally waking up to the fact that Indian languages need digital support too, and that involves creating a user experience that is completely optimised for Indian languages - merely providing a suboptimal, patchwork user experience won’t do. Developing language tech, however, comes with its own numerous challenges.
There’s a very real scarcity in actual resources for building digital support for Indian languages.
The European Union, for example, has EuroParl, a database of corpora (parallel language vocab data) for multiple European languages. Indian languages have nothing comparable. This means these resources for Indian languages have to be built from the ground up.

LT-Innovate's insight:

India probably lacks good speech data for its ten or fifteen major languages. This will make #ASR ansd #TTS development harder. In Europe, even the famous Europarl parallel data resource for #MT is insufficient for low-resource European languages. Expect innovative crowdsourcing projects in India to help bridge the language-data gap.

more...
No comment yet.