Wednesday, December 29, 2010

Analysis of Techniques for Detection of Deep Web Search Interface

The volume of information on the web is increasing day by day. The information in the web can be broadly categorized into two types i.e. surface web and deep web. The surface web pages can be easily indexed through conventional techniques but the deep web, whose size assumed to be thousand times larger than surface web, cannot be indexed through conventional search technique. The first stage of the extraction of the deep web information is the detection of deep web search interface.  A search interface is generally consisting of html forms. The conventional techniques of searching the deep web information is done by filling the html forms on the search interface manually but recently the research is going on automatic accessing and understanding of html forms. Being the first stage of deep web extraction process, the detection of deep web search interface becomes one of the important module of deep web information retrieval. In this paper a technical analysis of some of the important deep web search interface detection techniques is done to find out their relative strengths and limitations with reference to current development in the field of deep web information retrieval technology.     

Keywords : Deep web, hidden web, search interface detection, crawler, random forest.

1. IntroductionThe The whole process of extraction of information from deep web can be broadly categorized into four steps i.e. query interface analysis, values allotment, response analysis & navigation and relevance ranking. Query interface analysis is the first and most important step for deep web information retrieval. In query interface analysis, a request of fetching a web page from a web server is made by a crawler.   After completion of the fetching process, an internal representation of the web page is produced after parsing and processing of html forms based on the developed model. Further the query interface analysis can be broken into the some modules that are detection of hidden web search interface, search form schema matching and domain ontology identification. In these module the detection of hidden web search interface is the first and foremost step towards deep web information retrieval. As expected, a human user can easily identify a deep web search interface but to understand a deep web search interface through a automatic technique without human intervention is a challenging task [1][2][3][4][5]. Figures 1 depict the different types of search interfaces.

Fig. 1 : Different types of search interface

Related Work
One of the prominent works for detection of deep web search interface is done by Leo Breiman (2001)[6] in form of random forest algorithm. A random forest algorithm detects the deep web search interface by using a model, based on decision trees classification.  A random forest model can be defined as a collection of decision trees. A decision tree can be generated by bootstrapping processing of the training data. Various classification trees can be generated through random forest algorithm. To classify a new object from its input vector, the sample vector is passed to every tree defined in algorithm. A decision for classification is given by every tree.  A decision about most voted classification is done by using all of the classification results of the individual trees.  The advantages of random forest algorithm are that it exhibits a substantial performance improvement over single tree classifiers and injecting of the right kind of randomness makes accurate classifiers and regulators. The disadvantage of this algorithm is that it may select unimportant and noisy features in the training data, as a result a bad classification results because of its random selection feature.                                                                                      
One of the deep web crawler architecture is proposed by Sriram Raghavan and Hector Garcia-Molina (2001) [7]. In this paper, a task-specific, human-assisted approach is used for crawl the hidden web. There are two basic problems related to deep web search, firstly the volume of the hidden web is very large and secondly there is a need of such type of crawlers which can handle search interfaces efficiently, which are designed mainly for humans. In this paper a model of task specific human assisted web crawler is designed and relized in HiWE (hidden web exposure).  The HiWE prototype built at Stanford which crawl the dynamic pages. HiWE is designed to automatically process, analyze, and submit forms, using an internal model of forms and form submissions. HiWE uses a layout-based information extraction (LITE) technique  to process and extract useful information. The advantages of HiWE architecture is that its application/task specific approach allows the crawler to concentrate on relevant pages only and with the human assisted approach automatic form filling can be done. Limitations of this architecture are that it is not precise with response to partially filled forms and it is not able to identify and respond to simple dependency between form elements.
A technique for collecting hidden web pages for data extraction is proposed by Juliano Palmieri Lage et al. (2002) [8] . In this technique the authors have proposed the concept of web wrappers. A web wrapper is programs which extract the unstructured data from web pages.  It takes a set of target pages from the web source as an input. These set of target pages are automatically generated by an approach called   “Spiders”.  Spiders automatically traverse the web for web pages. Hidden web agents assist the wrappers to deal with the data available on the hidden web. The advantage of this technique is that it can access a large number of web sites from diverse domains and limitation of this technique is that it can access only that web site that follow common navigation patterns. Further, modification can be done in this technique to cover navigation patterns based on these mechanisms.
A technique for automated discovery of search interface from a set of html forms is proposed by Jared Cope, Nick Craswell and David Hawking (2003) [9]. This paper defined a novel technique to automatically detect search interface from a group of html forms. A decision tree was developed with the C4.5 learning algorithm using automatically generated features from html markup that can give a classification accuracy of about 85% for general web interfaces. Advantage of this technique is that it can automatically discover the search interface. Limitation of this technique is that it is based on single tree classification method and number of feature generation is limited due to use of limited data set. As a future work, modification is suggested that a search engine can be develop using existing methods for other stages along with the proposed one with a technique to eliminate false positives.
A technique for understanding web query interfaces through best effort parsing with hidden syntax is proposed by  Zhen Zhang et al. (2004)[10]. This paper addresses the problem of understanding web search interfaces by presenting a best-effort parsing framework. The paper presented a form extractor framework based on 2P grammar and the best effort parses in a language parsing framework. It identifies the search interface by continuously producing fresh instances by applying productions until attaining a fix-point, when no fresh instance can be produced. Best effort parser technique minimizes wrong interpretation as much as possible in a very fast manner. It also understands the interface to a large extent. Advantage of this technique is that it is a very simple and consistent technique with no priority among preferences and it  can handle missing elements in form and limitation of this technique is that establishment of single global grammar that can be interacted to the machine globally is a critical issue.
A technique named as “siphoning hidden web data through key word based interface” for retrieval of information from hidden web databases through generation of a small set of representative keywords and build queries is proposed by Luciano Barbosa and Juliana Freire (2004) [11]. This technique is designed to enhance coverage of deep web. Advantage of this technique is that it is a simple and completely automated strategy that can be quite effective in practice, leading to very high coverage of deep web. Limitation of this technique is that it is not able to achieve the coverage for collection whose search interface fixes a number of results. Further the authors have advised that modification can be done in this algorithm to characterize search interfaces techniques in a better way so that different notions and levels of security can be achieved.
An improved version of random forest algorithm is proposed by Deng et al. (2008) [12]. In this improved technique a weighted feature selection algorithm is proposed to generate the decision trees. The advantage of this improved algorithm is that it minimizes the problem of classification of high dimension and sparse search interface using the ensemble of decision trees. Disadvantage of this improved algorithm is that it is highly sensitive towards the changes in training data set. 
Further improvement in random forest algorithm is done by Yunming Ye et al. (2009) [13] by using feature weighting random forest algorithm for detection of hidden web search interface. This paper had presented a feature weighting selection process rather than random selection process. Advantage of this technique is that it makes a weighted feature selection process instead of random selection hence reduces the chances of noisy feature selection and limitation of this techniques is that features available only in the search forms were used. Future modification suggested in random forest algorithm to investigate more feature weighting methods for construction of random forests.
An algorithm named as “The naive bayesian web text classification algorithm” is proposed by Ping Bai and Junqing Li (2009) [14] for automatic and effective classification of web pages with reference to given model for machine learning. In the conventional techniques, category abstracts are produced using the inspection by domain experts either through semiautomatic method or artificial method. All the items are provided equal important according to conventional common bayesian classifier whereas according to improved naive bayesian web text classification algorithm, whole of the items in every title are provided higher importance to others. The strength of this technique is that text classification results are very accurate and further scope in this algorithm is suggested to make the classification process automatic in an efficient way.
An approach for automatic detection and unification of web search query interfaces using domain ontology is proposed by Anuradha and A.K.Sharma (2010) [15]. The technique proposed in this paper works by concentrating the crawler on the given topic considering the domain ontology. This technique results in the pages which contains the domain specific search form. The strengths of this technique are that results are produced from multiple sources, human effort is reduced and results are very accurate in less execution time. Limitation of this technique is that it is domain specific.

Summary of various techniques for detection of deep web search interface

By going through the literature survey of some deep web search interface detection techniques, it is concluded that each techniques for detection of deep web search interface have some relative strengths and limitations. A tabular summary is given below in table 1, which summarizes the techniques, strengths and limitations of some of important detection techniques for deep web search interface.

Table 1 : Summary of various techniques for detection of deep web search interface
AuthorsTechniqueStrengthsLimitations
Leo Breiman
(2001)
Forest of regression trees as classifiersA substantial improvement in performance over single tree classifiers.May include un-important or noisy features.
Sri Ram Raghavan et al. (2001)Hidden Web ExposerAn application specific approach to hidden web crawlingImprecise in filling the forms.
Palmieri Lage et al. (2002)Hidden Web AgentsWide coverage of distinct domains.Restricted to web sites that follow common navigation patterns.
Jared Cope  et al. (2003)Single tree classifiersAutomatically discovery of search interface, performed well when rules are generated on the same domain.Long rules, large size of feature space in training samples, Over fitting, Classification precision is not very satisfying.
Zhen Zhang et al. (2004)2P Grammar and Best effort ParserVery simple and consistent, No priority among  preferences, Handling of missing elements in form.Critical to establish single global grammar that can be interacted to the machine globally.
Luciano Barbosa  et al. (2004)Automatic query generation based on small set of keywords.A simple and completely automated strategy that can be quite effective in practiceA large domain of Keywords has to be generated.
Deng, X. B. et al. (2008)weighted feature selection algorithmMinimizes the problem of classification of high dimension and sparse search interface using the ensemble of decision treesHighly sensitive towards the changes in training data set.
Ye, Li, Deng  et al.(2009)Feature weighted selection processMinimizes the chances of selection of noisy features.No use of contextual information associated with forms.
Ping Bai  et al.(2009)Naïve Bayesian AlgorithmText classification results are very accurate.Classification algorithm is not automatic.
Anuradha et al. (2010)Based on domain ontologyResults are produced from multiple sources, reduces the human effort, less execution time, accuracy is high.It is domain specific.


Conclusion
Deep web search interface are the entry point for the searching of the deep web information. A deep web crawler should understand and detect the deep web search interface efficiently to facilitate the further process of deep web information retrieval. An efficient detection of deep web search interface may results towards a significant retrieval of deep web information so the first and foremost step of deep web information retrieval is the efficient understanding and detection of deep web search interface. In this paper a technical analysis of some of the techniques for detection of deep web search interface is done and it is concluded that each of them have some relative strengths and limitations in detecting of deep web search interface. To explore the deep web information efficiently, an efficient technique for detection of deep web search interface should be designed which should have strengths simultaneously and particularly in terms of wide coverage of different domains, automatic procedure, resistant to noisy and unwanted features, ability to consider the features as per their importance, application specific approach as per requirement and user friendly approach.  Finally the technique for detection of deep web search interface should be compatible with current web technology.
This is a paper published by :Dilip Kumar Sharma1, A. K. Sharma21GLA University, Mathura, UP, India. Email: todilipsharma@rediffmail.com
2YMCA University of Science and Technology, Faridabad, Haryana, India


I seem to be intrested in the topic specified by the author.Hope this become useful for any of the users of the blog.

References
Bergman, M.K. (2001). The Deep Web: Surfacing Hidden Value. In The Journal of Electronic Publishing, Vol. 7, No. 1.
Peisu, X., Ke, T. and Qinzhen, H.(2008).  A Framework of Deep Web Crawler. In Proceedings of the 27th Chinese Control Conference,  Kunming,Yunnan, China.
Sharma, D. K., and Sharma, A.K. (2010). Deep Web Information Retrieval Process: A Technical Survey. In International Journal of Information Technology & Web Engineering, USA, Vol 5, No. 1.
Khare, R., An, Y., and Song, Y. (2010). Understanding Deep Web Search Interfaces: A Survey. In ACM SIGMOD Record, Volume 39 ,  Issue 1,   PP: 33-40.  
Sharma D. K., and Sharma A.K. (2009). Query Intensive Interface Information Extraction Protocol for Deep Web., In Proceedings of IEEE International Conference on Intelligent Agent & Multi-Agent Systems, PP. 1-5 , IEEE Explorer.
Breiman, L. (2001).  Random Forests. In Machine Learning, Vol. 45, No.1, PP: 5-32, Kluwer Academic Publishers.
Raghavan, S. and  Garcia-Molina, H. (2001). Crawling the Hidden Web. In Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy.
Lage,  P. et al. (2002).  Collecting Hidden Web Pages for Data Extraction. In Proceedings of the 4th international workshop on Web information and data management , PP: 69-75.
Cope, J., Craswell, N.,  and  Hawking, D. (2003). Automated Discovery of Search Interfaces on the web. In
Proceedings of the Fourteenth Australasian Database Conference (ADC2003), Adelaide, Australi,a.
Zhang, Z., He, B., and Chang, K. (2004). Understanding Web Query Interfaces: Best-Effort Parsing with Hidden  Syntax. In Proceedings of ACM International Conference on Management of Data ,PP: 107-118.
Barbosa, L., and Freirel, J.(2004). Siphoning Hidden-Web Data through Keyword-Based Interface., In Proceedings of   SBBD.
Deng, X. B., Ye, Y. M., Li, H. B., & Huang, J. Z. (2008). An Improved Random Forest Approach For Detection Of Hidden Web Search Interfaces. In Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming, China. IEEE.
Ye, Y., et al. (2009). Feature Weighting Random Forest for Detection of Hidden Web Search Interfaces. In Computational Linguistics and Chinese Language Processing , Vol. 13, No. 4,  PP: 387-404.
Bai, P., and Li, J.(2009). The Improved Naive Bayesian WEB Text Classification Algorithm, In International Symposium on Computer Network and Multimedia Technology, IEEE Explorer. 
Anuradha,  and Sharma, A.K. (2010). A Novel Approach For Automatic Detection and Unification of Web Search Query Interfaces Using Domain Ontology. In International Journal of Information Technology and Knowledge Management, July-December, Vol. 2, No. 2,PP: 196-199.
Dilip Kumar Sharma is B.Sc, B.E.(CSE), M.Tech.(IT), M.Tech. (CSE) and pursuing Ph.D in Computer Engineering. He is life member of CSI, IETE, ISTE,, ISCA, SSI and member of CSTA, USA. He has attended 21 short term courses/workshops/seminars organized by various esteemed originations. He has published 21 research papers in International Journals /Conferences of repute and participated in 18 International/National conferences. Presently he is working as Reader in Department of Computer Science, IET at GLA University, Mathura, U.P.  since March 2003 and he is also CSI Student branch Coordinator. His research interests are deep web information retrieval, Digital Watermarking and Software Engineering. He has guided various projects and seminars undertaken by the students of undergraduate/postgraduate.
Prof. A. K. Sharma received his M.Tech. (CST) with Hons. from University of Roorkee (Presently I.I.T. Roorkee) and Ph.D (Fuzzy Expert Systems) from JMI, New Delhi and he obtained his second Ph.D. in Information Technology form IIITM, Gwalior in 2004. Presently he is working as Dean, Faculty of Engineering and Technology & Chairman, Dept of Computer Engineering at YMCA University of Science and Technology, Faridabad. His research interest includes Fuzzy Systems, OOPS, Knowledge Representation and Internet Technologies. He has guided 9 Ph.D thesis and 8 more are in progress with about 175 research publications in International and National journals and conferences. The author of 7 books, is actively engaged in research related to Fuzzy logic, Knowledge based systems, MANETS, Design of crawlers. Besides being member of many BOS and Academic councils, he has been Visiting Professor at JMI, IIIT&M, and I.I.T. Roorkee.

Nature Inspired Machine Intelligence

  1. Artificial Neural Networks
  2. Evolutionary Algorithms  
  3. Swarm Intelligence
  4. Harmony Search
  5. Simulated Annealing
  6. Membrane Computing
  7. Artificial Immune System (AIS)
  8. DNA Computation
  9. Computing with Words
  10. Artificial Life
  11. Quantum Computation
  12. Hybrid Approaches
If All the above features can be combined and made effective implementation on a real-time basis which should provide the right information at the right time can be said as NIMI (Nature Inspired Machine Intelligence). No matter who develops, any R&D division of a small company to Microsoft, Infosys, TCS, Wipro can emerge as a indisputable common brand in common mans day to day life. I say it because I believe technology has to be cheaper and should always reach common man as a help. What is the future of burning carbon and making a global footprint on Ozone layer? To avoid education and methodologies has to evolve from basic to basic+1 at least.

References[1] A. Abraham, “Neuro-Fuzzy Systems: State-of-the-Art Modeling Techniques”, in Jose Mira and Alberto Prieto, eds., Connectionist Models of Neurons, Learning Processes, and Artificial Intelligence, Springer Verlag Germany, 2001, pp. 269-276.
[2] W. Banzhaf, P. Nordin, E.R. Keller, and F.D. Francone, “Genetic Programming: An Introduction on The Automatic Evolution of Computer Programs and its Applications”, Morgan Kaufmann Publishers, Inc., 1998
[3] Kirkpatrick, S., C. D. Gelatt Jr., M. P. Vecchi, Optimization by Simulated Annealing, Science, 220, 4598, 671-680, 1983.
[4] G. Paun, Computing with membranes, Journal of Computer and System Sciences, 61 (1), 108-143, 2000.
[5] Deutsch, D., Quantum Theory, the Church-Turing Principle, and the Universal Quantum Computer”. Proc. Roy. Soc. Lond. A400, 97–117, 1985.
[6] A. Abraham, Intelligent Systems: Architectures and Perspectives, Recent Advances in Intelligent Paradigms and Applications, Abraham A., Jain L. and Kacprzyk J. (Eds.), Studies in Fuzziness and Soft Computing, Springer Verlag Germany, ISBN 3790815381, Chapter 1, pp. 1-35, 2002.
[7] Bishop C.M., Neural Networks for Pattern Recognition, Oxford University Press, Oxford, UK, 1995.
[8] Fogel, D. B., Evolutionary Computation: Toward a New Philosophy of Machine Intelligence. IEEE Press, Piscataway, NJ, Second edition, 1999.
[9] Kennedy J. and Eberhart R. Swarm intelligence. Morgan Kaufmann Publishers, Inc., San Francisco, CA, 2001.
[10] Passino, K.M., Biomimicry of Bacterial Foraging for Distributed Optimization and Control, IEEE Control Systems Magazine, pp. 52-67, June 2002.
[11] de Castro, L. N. and Timmis, J. I., Artificial Immune Systems: A New Computational Intelligence Approach, Springer-Verlag, London, 2002.
[12] Amos M., Theoretical and Experimental DNA Computation. Springer, ISBN: 3-540-65773-8, 2005.
[13] Zadeh L.A. and Kacprzyk J. (Eds.) Computing with Words in Information/Intelligent Systems: Foundations, Studies in Fuzziness and Soft Computing, Springer Verlag, Germany, ISBN 379081217X, 1999.
[14] Reynolds R.G., Michalewicz, Z. Cavaretta M.J., Using Cultural Algorithms for Constraint Handling in GENOCOP. Proceedings of the Fourth Annual Conference on Evolutionary Programming. MIT Press, Cambridge, pp. 289-305, 1995.
[15] C. Adami, Introduction to Artificial Life. Springer-Verlag New York, Inc., 1998.
[16] Z.W. Geem, J.H. Kim, and G.V. Loganathan, “A new heuristic optimization algorithm: harmony search”, Simulation 76 (2), 60–68, 2001.           

Harmony Search Algorithm

Harmony search (HS) is a music-inspired algorithm (Geem et al., 2001) and has been applied to various optimization problems including music composition, Sudoku puzzle, magic square, timetabling, tour planning, logistics, web page clustering, text summarization, Internet routing, visual tracking, robotics, energy system dispatch, power system design, cell phone networking, structural design, water network design, dam scheduling, flood model calibration, groundwater management, soil stability analysis, ecological conservation, vehicle routing, heat exchanger design, satellite heat pipe design, offshore structure mooring, RNA structure prediction, medical imaging, medical physics, etc (Geem, 2009; 2010a). Recently, HS was also applied to astronomical data analysis, which was published in Nature (Deeg et al., 2010).
Each musician in music performance plays a musical note at a time, and those musical notes together make a harmony. Likewise, each variable in optimization has a value at a time, and those values together make a solution vector. Just like the music group improves their harmonies practice by practice, the algorithm improves its solution vectors iteration by iteration.
The HS algorithm basically has three operations, such as memory consideration, pitch adjustment, and random selection. Using memory consideration operation, HS chooses a value from harmony memory (HM); using pitch adjustment operation, HS chooses a value which is slightly modified from HM; and using random selection operation, HS chooses a value randomly from entire value range. These basic operations constitute a novel stochastic derivative (Geem, 2008), instead of traditional calculus-based derivative, in order to search for the right direction to the optimal solution.
For more advanced issues in HS, researchers have researched exploratory power (Das et al., 2010), multi-modal solution space (Gao et al., 2009), multi-objective optimization (Geem, 2010b), distributed memory (Pan et al., 2010), hybridization (Fesanghary et al., 2008), and adaptive theory (Geem and Sim, 2010).  In addition, HS has a unique derivative which considers the relationship among variables (Geem, 2011).

References
Das, S., Mukhopadhyay, A., Roy, A., Abraham, A., & Panigrahi, B. K. (2010) Exploratory Power of the Harmony Search Algorithm: Analysis and Improvements for Global Numerical Optimization. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, http://dx.doi.org/10.1109/TSMCB.2010.2046035
Deeg, H. J., Moutou, C., & Erikson A. et al. A transiting giant planet with a temperature between 250 K and 430 K. Nature, 464, 384-387.
Fesanghary, M., Mahdavi, M., Minary-Jolandan, M., Alizadeh, Y. (2008). Hybridizing harmony search algorithm with sequential quadratic programming for engineering optimization problems. Computer Methods in Applied Mechanics and Engineering, 197(33-40), 3080-3091.
Gao, X. Z., Wang, X., & Ovaska S. J. (2009) Uni-modal and Multi-modal Optimization Using Modified Harmony Search Methods. International Journal of Innovative Computing, Information and Control, 5(10A), 2985-2996.
Geem, Z. W. (2008). Novel Derivative of Harmony Search Algorithm for Discrete Design Variables. Applied Mathematics and Computation, 199(1), 223-230.
Geem, Z. W. (2009). Music-Inspired Harmony Search Algorithms: Theory and Applications. Berlin: Springer.
Geem, Z. W. (2010a). Recent Advances in Harmony Search Algorithm. Berlin: Springer.
Geem, Z. W. (2010b). Multiobjective Optimization of Time-Cost Trade-Off Using Harmony Search. ASCE Journal of Construction Engineering and Management, 136(6), 711-716.
Geem, Z. W. (2011). Stochastic Co-Derivative of Harmony Search Algorithm. International Journal of Mathematical Modelling and Numerical Optimisation, 2(1), 1-12.
Geem, Z. W., Kim, J. H., & Loganathan, G. V. (2001). A New Heuristic Optimization Algorithm: Harmony Search. Simulation, 76(2), 60-68.
Geem, Z.W., Sim, K.-B. (2010). Parameter-Setting-Free Harmony Search Algorithm. Applied Mathematics and Computation, http://dx.doi.org/10.1016/j.amc.2010.09.049
Pan, Q.-K., Suganthan, P.N., Liang, J. J., Tasgetiren, M.F. (2010). A local-best harmony search algorithm with dynamic subpopulations. Engineering Optimization, 42(2), 101 - 117.