Representative and role model of the web page content

A P Golovko; V N Lapin

doi:doi:

Home / Journals / Bulletin of Nizhnevartovsk State University / Issue 3 / Representative and role model of the web page content

Submit manuscript

To cite

Citations:

REPRESENTATIVE AND ROLE MODEL OF THE WEB PAGE CONTENT

Journal: BULLETIN OF NIZHNEVARTOVSK STATE UNIVERSITY № 3 , 2015

Rubrics: ARTICLES

UDC 51:004.738.5+004.8

A P Golovko ¹

V N Lapin ²

Author and publication information

Authors:

1. Kurgan State University

2. Smolny University of the Russian Academy of Education

Type:

Article

Pages:

from 3 to 14

Status:

Published

Received:

06.09.2015

Accepted:

15.09.2015

Published:

25.09.2015

Subject area:

UDC 51:004.738.5+004.8

Language:

Russian

Keywords:

veb, modelirovanie, iskusstvennyy intellekt

Abstract and keywords

Abstract:
Today automatic analysis of the web page content is a topical problem. The analysis enables us to solve several practical problems, including detecting the role structure of a page content. Here we can distinguish the main page article, comments of website visitors, advertisements, and other functions. In addition, solving this problem is an important step towards a more profound automatic analysis of website semantic in the future. We have applied the approach defining the role of some html-code fragment in accordance with the way it is represented on the screen, which corresponds to the human way of perception. The developed model allows us to distinguish such html-code fragments acting as the main header and the main article of a page. The main article may contain different elements, such as a text, tables, images, etc. Often other elements (advertisements etc.) are deleted from the main article, and various ways of placing content elements on the screen and page layouts may be applied. The model is an expert system with the knowledge base containing 1) a semantic net reflecting relations between objects and concepts used in problem-solving; 2) a production system containing a set of rules for the inference. The inference strategy is constructed so to exclude any iteration. During the inference, all elements that can play this role are selected, after which the number of them gradually decreases to one. The production system has a hierarchical structure, with each local system consisting of 5-10 rules and having its own local data storage, which allows us to minimize the probability of side effects. This model is implemented as a program using Python programming language. The program reads html-file from the Internet, removes all elements except the main header and main article, and stores the result as a file on a hard disc. The program was tested on news-sites and habrahabr.ru. The proportion of correctly processed pages was 85-90% in case of the table layout of a page and 95-97% when a page was developed as a block.

Keywords:
veb, modelirovanie, iskusstvennyy intellekt

References

1. Prilozhenie dlya sohraneniya informacii v oblake. Pocket URL: http://getpocket.com.

2. Produkcionnye modeli // Iskusstvennyy intellekt: V 3 kn. Kn. 2. Modeli i metody: Spravochnik / Pod. red. D.A.Pospelova. - M., 1990.

3. Semantika v HTML 5. URL: http://habrahabr.ru/post/49734.

4. Uchebnik HTML. URL: http://ru.html.net/tutorials/html.

5. Elti Dzh., Kumbs M. Ekspertnye sistemy: koncepcii i primery. - M., 1977.