Artificial intelligence algorithms can find your face in a crowd and detect what you’re feeling. But it’s harder for them to identify what you’re doing.
That’s because most actions — say, dancing — are actually a series of smaller actions. If an image depicted a person with hands in the air and hip cocked to one side, it would be difficult to know what that person was doing.
Researchers at MIT and U.C. Irvine have developed a new algorithm that can detect actions in video much more effectively than past efforts. It does by applying the lessons of natural language grammar computer scientists have parsed for computers.
“We see an analogy here, which is, if you have a complex action — like making tea or making coffee — that has some subactions, we can basically stitch together these subactions and look at each one as something like verb, adjective, and adverb,” said MIT post-doctoral researcher Hamed Pirsiavash in a news release.