Life in Technology: December 2010

The volume of information on the web is increasing day by day. The information in the web can be broadly categorized into two types i.e. surface web and deep web. The surface web pages can be easily indexed through conventional techniques but the deep web, whose size assumed to be thousand times larger than surface web, cannot be indexed through conventional search technique. The first stage of the extraction of the deep web information is the detection of deep web search interface. A search interface is generally consisting of html forms. The conventional techniques of searching the deep web information is done by filling the html forms on the search interface manually but recently the research is going on automatic accessing and understanding of html forms. Being the first stage of deep web extraction process, the detection of deep web search interface becomes one of the important module of deep web information retrieval. In this paper a technical analysis of some of the important deep web search interface detection techniques is done to find out their relative strengths and limitations with reference to current development in the field of deep web information retrieval technology.

Keywords : Deep web, hidden web, search interface detection, crawler, random forest.
1. IntroductionThe The whole process of extraction of information from deep web can be broadly categorized into four steps i.e. query interface analysis, values allotment, response analysis & navigation and relevance ranking. Query interface analysis is the first and most important step for deep web information retrieval. In query interface analysis, a request of fetching a web page from a web server is made by a crawler. After completion of the fetching process, an internal representation of the web page is produced after parsing and processing of html forms based on the developed model. Further the query interface analysis can be broken into the some modules that are detection of hidden web search interface, search form schema matching and domain ontology identification. In these module the detection of hidden web search interface is the first and foremost step towards deep web information retrieval. As expected, a human user can easily identify a deep web search interface but to understand a deep web search interface through a automatic technique without human intervention is a challenging task [1][2][3][4][5]. Figures 1 depict the different types of search interfaces.

Fig. 1 : Different types of search interface

Related Work
One of the prominent works for detection of deep web search interface is done by Leo Breiman (2001)[6] in form of random forest algorithm. A random forest algorithm detects the deep web search interface by using a model, based on decision trees classification. A random forest model can be defined as a collection of decision trees. A decision tree can be generated by bootstrapping processing of the training data. Various classification trees can be generated through random forest algorithm. To classify a new object from its input vector, the sample vector is passed to every tree defined in algorithm. A decision for classification is given by every tree. A decision about most voted classification is done by using all of the classification results of the individual trees. The advantages of random forest algorithm are that it exhibits a substantial performance improvement over single tree classifiers and injecting of the right kind of randomness makes accurate classifiers and regulators. The disadvantage of this algorithm is that it may select unimportant and noisy features in the training data, as a result a bad classification results because of its random selection feature.
One of the deep web crawler architecture is proposed by Sriram Raghavan and Hector Garcia-Molina (2001) [7]. In this paper, a task-specific, human-assisted approach is used for crawl the hidden web. There are two basic problems related to deep web search, firstly the volume of the hidden web is very large and secondly there is a need of such type of crawlers which can handle search interfaces efficiently, which are designed mainly for humans. In this paper a model of task specific human assisted web crawler is designed and relized in HiWE (hidden web exposure). The HiWE prototype built at Stanford which crawl the dynamic pages. HiWE is designed to automatically process, analyze, and submit forms, using an internal model of forms and form submissions. HiWE uses a layout-based information extraction (LITE) technique to process and extract useful information. The advantages of HiWE architecture is that its application/task specific approach allows the crawler to concentrate on relevant pages only and with the human assisted approach automatic form filling can be done. Limitations of this architecture are that it is not precise with response to partially filled forms and it is not able to identify and respond to simple dependency between form elements.
A technique for collecting hidden web pages for data extraction is proposed by Juliano Palmieri Lage et al. (2002) [8] . In this technique the authors have proposed the concept of web wrappers. A web wrapper is programs which extract the unstructured data from web pages. It takes a set of target pages from the web source as an input. These set of target pages are automatically generated by an approach called “Spiders”. Spiders automatically traverse the web for web pages. Hidden web agents assist the wrappers to deal with the data available on the hidden web. The advantage of this technique is that it can access a large number of web sites from diverse domains and limitation of this technique is that it can access only that web site that follow common navigation patterns. Further, modification can be done in this technique to cover navigation patterns based on these mechanisms.
A technique for automated discovery of search interface from a set of html forms is proposed by Jared Cope, Nick Craswell and David Hawking (2003) [9]. This paper defined a novel technique to automatically detect search interface from a group of html forms. A decision tree was developed with the C4.5 learning algorithm using automatically generated features from html markup that can give a classification accuracy of about 85% for general web interfaces. Advantage of this technique is that it can automatically discover the search interface. Limitation of this technique is that it is based on single tree classification method and number of feature generation is limited due to use of limited data set. As a future work, modification is suggested that a search engine can be develop using existing methods for other stages along with the proposed one with a technique to eliminate false positives.
A technique for understanding web query interfaces through best effort parsing with hidden syntax is proposed by Zhen Zhang et al. (2004)[10]. This paper addresses the problem of understanding web search interfaces by presenting a best-effort parsing framework. The paper presented a form extractor framework based on 2P grammar and the best effort parses in a language parsing framework. It identifies the search interface by continuously producing fresh instances by applying productions until attaining a fix-point, when no fresh instance can be produced. Best effort parser technique minimizes wrong interpretation as much as possible in a very fast manner. It also understands the interface to a large extent. Advantage of this technique is that it is a very simple and consistent technique with no priority among preferences and it can handle missing elements in form and limitation of this technique is that establishment of single global grammar that can be interacted to the machine globally is a critical issue.
A technique named as “siphoning hidden web data through key word based interface” for retrieval of information from hidden web databases through generation of a small set of representative keywords and build queries is proposed by Luciano Barbosa and Juliana Freire (2004) [11]. This technique is designed to enhance coverage of deep web. Advantage of this technique is that it is a simple and completely automated strategy that can be quite effective in practice, leading to very high coverage of deep web. Limitation of this technique is that it is not able to achieve the coverage for collection whose search interface fixes a number of results. Further the authors have advised that modification can be done in this algorithm to characterize search interfaces techniques in a better way so that different notions and levels of security can be achieved.
An improved version of random forest algorithm is proposed by Deng et al. (2008) [12]. In this improved technique a weighted feature selection algorithm is proposed to generate the decision trees. The advantage of this improved algorithm is that it minimizes the problem of classification of high dimension and sparse search interface using the ensemble of decision trees. Disadvantage of this improved algorithm is that it is highly sensitive towards the changes in training data set.
Further improvement in random forest algorithm is done by Yunming Ye et al. (2009) [13] by using feature weighting random forest algorithm for detection of hidden web search interface. This paper had presented a feature weighting selection process rather than random selection process. Advantage of this technique is that it makes a weighted feature selection process instead of random selection hence reduces the chances of noisy feature selection and limitation of this techniques is that features available only in the search forms were used. Future modification suggested in random forest algorithm to investigate more feature weighting methods for construction of random forests.
An algorithm named as “The naive bayesian web text classification algorithm” is proposed by Ping Bai and Junqing Li (2009) [14] for automatic and effective classification of web pages with reference to given model for machine learning. In the conventional techniques, category abstracts are produced using the inspection by domain experts either through semiautomatic method or artificial method. All the items are provided equal important according to conventional common bayesian classifier whereas according to improved naive bayesian web text classification algorithm, whole of the items in every title are provided higher importance to others. The strength of this technique is that text classification results are very accurate and further scope in this algorithm is suggested to make the classification process automatic in an efficient way.
An approach for automatic detection and unification of web search query interfaces using domain ontology is proposed by Anuradha and A.K.Sharma (2010) [15]. The technique proposed in this paper works by concentrating the crawler on the given topic considering the domain ontology. This technique results in the pages which contains the domain specific search form. The strengths of this technique are that results are produced from multiple sources, human effort is reduced and results are very accurate in less execution time. Limitation of this technique is that it is domain specific.

Summary of various techniques for detection of deep web search interface
By going through the literature survey of some deep web search interface detection techniques, it is concluded that each techniques for detection of deep web search interface have some relative strengths and limitations. A tabular summary is given below in table 1, which summarizes the techniques, strengths and limitations of some of important detection techniques for deep web search interface.

Table 1 : Summary of various techniques for detection of deep web search interface

Authors	Technique	Strengths	Limitations
Leo Breiman (2001)	Forest of regression trees as classifiers	A substantial improvement in performance over single tree classifiers.	May include un-important or noisy features.

Sri Ram Raghavan et al. (2001)	Hidden Web Exposer	An application specific approach to hidden web crawling	Imprecise in filling the forms.

Palmieri Lage et al. (2002)	Hidden Web Agents	Wide coverage of distinct domains.	Restricted to web sites that follow common navigation patterns.

Jared Cope et al. (2003)	Single tree classifiers	Automatically discovery of search interface, performed well when rules are generated on the same domain.	Long rules, large size of feature space in training samples, Over fitting, Classification precision is not very satisfying.

Zhen Zhang et al. (2004)	2P Grammar and Best effort Parser	Very simple and consistent, No priority among preferences, Handling of missing elements in form.	Critical to establish single global grammar that can be interacted to the machine globally.

Luciano Barbosa et al. (2004)	Automatic query generation based on small set of keywords.	A simple and completely automated strategy that can be quite effective in practice	A large domain of Keywords has to be generated.

Deng, X. B. et al. (2008)	weighted feature selection algorithm	Minimizes the problem of classification of high dimension and sparse search interface using the ensemble of decision trees	Highly sensitive towards the changes in training data set.

Ye, Li, Deng et al.(2009)	Feature weighted selection process	Minimizes the chances of selection of noisy features.	No use of contextual information associated with forms.

Ping Bai et al.(2009)	Naïve Bayesian Algorithm	Text classification results are very accurate.	Classification algorithm is not automatic.

Anuradha et al. (2010)	Based on domain ontology	Results are produced from multiple sources, reduces the human effort, less execution time, accuracy is high.	It is domain specific.

Conclusion
Deep web search interface are the entry point for the searching of the deep web information. A deep web crawler should understand and detect the deep web search interface efficiently to facilitate the further process of deep web information retrieval. An efficient detection of deep web search interface may results towards a significant retrieval of deep web information so the first and foremost step of deep web information retrieval is the efficient understanding and detection of deep web search interface. In this paper a technical analysis of some of the techniques for detection of deep web search interface is done and it is concluded that each of them have some relative strengths and limitations in detecting of deep web search interface. To explore the deep web information efficiently, an efficient technique for detection of deep web search interface should be designed which should have strengths simultaneously and particularly in terms of wide coverage of different domains, automatic procedure, resistant to noisy and unwanted features, ability to consider the features as per their importance, application specific approach as per requirement and user friendly approach. Finally the technique for detection of deep web search interface should be compatible with current web technology.

This is a paper published by :Dilip Kumar Sharma1, A. K. Sharma21GLA University, Mathura, UP, India. Email: todilipsharma@rediffmail.com
2YMCA University of Science and Technology, Faridabad, Haryana, India

I seem to be intrested in the topic specified by the author.Hope this become useful for any of the users of the blog.

References
Bergman, M.K. (2001). The Deep Web: Surfacing Hidden Value. In The Journal of Electronic Publishing, Vol. 7, No. 1.
Peisu, X., Ke, T. and Qinzhen, H.(2008). A Framework of Deep Web Crawler. In Proceedings of the 27th Chinese Control Conference, Kunming,Yunnan, China.
Sharma, D. K., and Sharma, A.K. (2010). Deep Web Information Retrieval Process: A Technical Survey. In International Journal of Information Technology & Web Engineering, USA, Vol 5, No. 1.
Khare, R., An, Y., and Song, Y. (2010). Understanding Deep Web Search Interfaces: A Survey. In ACM SIGMOD Record, Volume 39 , Issue 1, PP: 33-40.
Sharma D. K., and Sharma A.K. (2009). Query Intensive Interface Information Extraction Protocol for Deep Web., In Proceedings of IEEE International Conference on Intelligent Agent & Multi-Agent Systems, PP. 1-5 , IEEE Explorer.
Breiman, L. (2001). Random Forests. In Machine Learning, Vol. 45, No.1, PP: 5-32, Kluwer Academic Publishers.
Raghavan, S. and Garcia-Molina, H. (2001). Crawling the Hidden Web. In Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy.
Lage, P. et al. (2002). Collecting Hidden Web Pages for Data Extraction. In Proceedings of the 4th international workshop on Web information and data management , PP: 69-75.
Cope, J., Craswell, N., and Hawking, D. (2003). Automated Discovery of Search Interfaces on the web. In
Proceedings of the Fourteenth Australasian Database Conference (ADC2003), Adelaide, Australi,a.
Zhang, Z., He, B., and Chang, K. (2004). Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. In Proceedings of ACM International Conference on Management of Data ,PP: 107-118.
Barbosa, L., and Freirel, J.(2004). Siphoning Hidden-Web Data through Keyword-Based Interface., In Proceedings of SBBD.
Deng, X. B., Ye, Y. M., Li, H. B., & Huang, J. Z. (2008). An Improved Random Forest Approach For Detection Of Hidden Web Search Interfaces. In Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming, China. IEEE.
Ye, Y., et al. (2009). Feature Weighting Random Forest for Detection of Hidden Web Search Interfaces. In Computational Linguistics and Chinese Language Processing , Vol. 13, No. 4, PP: 387-404.
Bai, P., and Li, J.(2009). The Improved Naive Bayesian WEB Text Classification Algorithm, In International Symposium on Computer Network and Multimedia Technology, IEEE Explorer.
Anuradha, and Sharma, A.K. (2010). A Novel Approach For Automatic Detection and Unification of Web Search Query Interfaces Using Domain Ontology. In International Journal of Information Technology and Knowledge Management, July-December, Vol. 2, No. 2,PP: 196-199.
Dilip Kumar Sharma is B.Sc, B.E.(CSE), M.Tech.(IT), M.Tech. (CSE) and pursuing Ph.D in Computer Engineering. He is life member of CSI, IETE, ISTE,, ISCA, SSI and member of CSTA, USA. He has attended 21 short term courses/workshops/seminars organized by various esteemed originations. He has published 21 research papers in International Journals /Conferences of repute and participated in 18 International/National conferences. Presently he is working as Reader in Department of Computer Science, IET at GLA University, Mathura, U.P. since March 2003 and he is also CSI Student branch Coordinator. His research interests are deep web information retrieval, Digital Watermarking and Software Engineering. He has guided various projects and seminars undertaken by the students of undergraduate/postgraduate.
Prof. A. K. Sharma received his M.Tech. (CST) with Hons. from University of Roorkee (Presently I.I.T. Roorkee) and Ph.D (Fuzzy Expert Systems) from JMI, New Delhi and he obtained his second Ph.D. in Information Technology form IIITM, Gwalior in 2004. Presently he is working as Dean, Faculty of Engineering and Technology & Chairman, Dept of Computer Engineering at YMCA University of Science and Technology, Faridabad. His research interest includes Fuzzy Systems, OOPS, Knowledge Representation and Internet Technologies. He has guided 9 Ph.D thesis and 8 more are in progress with about 175 research publications in International and National journals and conferences. The author of 7 books, is actively engaged in research related to Fuzzy logic, Knowledge based systems, MANETS, Design of crawlers. Besides being member of many BOS and Academic councils, he has been Visiting Professor at JMI, IIIT&M, and I.I.T. Roorkee.

Life in Technology

Wednesday, December 29, 2010

Analysis of Techniques for Detection of Deep Web Search Interface

Nature Inspired Machine Intelligence

Harmony Search Algorithm