Internet Marketing  » SE Optimization
 Picture

LSI and SEO Benefits

By: Pat Lovell
Rate Author : Current : 3/5
Rate this Article : Current : 4/5
Date Added : 2008-09-13 Views : 587

I am going to give you two definitions of Latent Semantic Indexing. Reason being, LSI is derived from a mathematical formula used to retrieve data and was originally used at Universities to make searching large information databases more accurate. The first definition will give you an explanation of LSI and LSA (latent semantic analysis) from the educational perspective. The second will be in accordance to how the Search Engines (mainly Google) are using LSI in their search engine algorithm to produce their search results.

Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA's reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word-word and passage-word lexical priming data; and, as has been reported, it accurately estimates passage coherence, learns ability of passages by individual students, and the quality and quantity of knowledge contained in an essay.

LSA can be construed in two ways:

(1) simply as a practical expedient for obtaining approximate estimates of the contextual usage substitutability of words in larger text segments, and of the kinds of-as yet incompletely specified- meaning similarities among Introduction to Latent Semantic Analysis 4 words and text segments that such relations may reflect, or

(2) as a model of the computational processes and representations underlying substantial portions of the acquisition and utilization of knowledge. We next sketch both views.

Regular keyword searches approach a document collection with a kind of accountant mentality: a document contains a given word or it doesn't, with no middle ground. We create a result set by looking through each document in turn for certain keywords and phrases, tossing aside any documents that don't contain them, and ordering the rest based on some ranking system. Each document stands alone in judgment before the search algorithm - there is no interdependence of any kind between documents, which are evaluated solely on their contents.

Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem astonishingly intelligent.

When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don't contain the keyword at all.

To use an earlier example, let's say we use LSI to index our collection of mathematical articles. If the words n-dimensional, manifold and topology appear together in enough articles, the search algorithm will notice that the three terms are semantically close. A search for n-dimensional manifolds will therefore return a set of articles containing that phrase (the same result we would get with a regular search), but also articles that contain just the word topology. The search engine understands nothing about mathematics, but examining a sufficient number of documents teaches it that the three terms are related. It then uses that information to provide an expanded set of results with better recall than a plain keyword search.


Post Article Comments

Name : 
EmailAddress : 
URL : 
Comments :