Extracting the main text content from a web page is an important step in the text processing pipeline. The source code of pages in HTML is usually cluttered with advertising and other text which is not related to the main content. Formally, in the context of computer science, it is impossible for a computer to distinguish between the main content and other content on the same page. That is, no algorithm can recognize it for all possible cases. Sometimes it is even difficult for humans to distinguish it. Recognition of primary content is part of the machine learning/artificial intelligence field of study.