Input: At the top, there is voice input. Voice to text is a relatively well solved problem for most common use cases.
Natural Language Understanding (NLU): Below that is NLU, which consists of parsing text (from voice, or from direct input) and pulling out the key entities. At this point NLU is good, moving towards “great.” Parsing is already near human level success rates.
Inference & Reasoning: Moving down the stack further, to inference and reasoning is a bit more difficult. A.I.s with bodies (robots, for example) may be needed to help ground language in some other modality, or, new graphical techniques might help, but these are all very nascent. (But keep an eye on the intersection of deep learning and graphical models)