1. Automatic methods have trouble handling synonyms, homonyms, and semantic relations.
Conceptualizing is very poor.
2. Human indexers go through cognitive processes that may be influenced by their background experience, education, training, intelligence, and common sense.
3. Computers can, and humans cannot, organize all words in a text and in a given database and make statistical operations on them.
Biyernes, Disyembre 20, 2013
Huwebes, Disyembre 19, 2013
Principles of KWIC Indexing
1. Titles are generally informative.
2. Words extracted from the title can be used as as effective guide.
3. Although the meaning of an individual word viewed in isolation may be ambiguous or too general, the context surrounding the words helps to define and explain the meaning.
Examples.
for Croatians. Cataloging and Classification
Cataloging and classification for Croatians
For Croatians. Cataloging and Classification
2. Words extracted from the title can be used as as effective guide.
3. Although the meaning of an individual word viewed in isolation may be ambiguous or too general, the context surrounding the words helps to define and explain the meaning.
Examples.
for Croatians. Cataloging and Classification
Cataloging and classification for Croatians
For Croatians. Cataloging and Classification
Construction of a Thesaurus
1. Identify the subject field.
2. Identify the nature of literature to be indexed.
3. Identify the users.
4. Identify the file structure.
5. Cluster the terms.
6. Establish term relationships.
2. Identify the nature of literature to be indexed.
3. Identify the users.
4. Identify the file structure.
5. Cluster the terms.
6. Establish term relationships.
Difference Between Authority Lists and Thesauri
1. Thesauri are made up of single terms and bound terms representing single concepts. Subject heading list have phrases and other pre-coordinated terms in addition to single terms.
2. Thesauri are more strictly hierarchical.
3. Thesauri are narrow in scope.
4. Thesauri are more likely multilingual.
2. Thesauri are more strictly hierarchical.
3. Thesauri are narrow in scope.
4. Thesauri are more likely multilingual.
Similarities Between Authority Lists and Thesauri
1. Both attempts to provide subject access to information resources by providing terminology that can be consistent rather than uncontrolled and predictable.
2. Both choose preferred terms and make references from non-used terms.
3. Both provide hierarchies so that terms are presented in relation to their broader, narrower, and related terms.
2. Both choose preferred terms and make references from non-used terms.
3. Both provide hierarchies so that terms are presented in relation to their broader, narrower, and related terms.
Types of Controlled Vocabulary
1. Authority List / Subject Authority List
Examples: Library of Congress Subject Headings
Sears List of Subject Headings
Dewey Decimal Classification
2. Thesaurus
Latin word "treasure"
Examples: The Art & Architecture Thesaurus
ERIC (Education Resources Information Center) Thesaurus
Examples: Library of Congress Subject Headings
Sears List of Subject Headings
Dewey Decimal Classification
2. Thesaurus
Latin word "treasure"
Examples: The Art & Architecture Thesaurus
ERIC (Education Resources Information Center) Thesaurus
Types of Indexing Language
1. Natural Language (derived term system)
Characteristics are:
Improves recall because it provides more success points but reduces precision
Redundancy is greater
Uses more current terms
May also be called Indexing by Exyraction
2. Controlled Vocabulary (assigned term system)
Functions:
To control synonyms by choosing one form as the standard term
To make distinctions among homographs
To bring or link together terms that are closely related
Establishes the size of scope of a term
Characteristics are:
Improves recall because it provides more success points but reduces precision
Redundancy is greater
Uses more current terms
May also be called Indexing by Exyraction
2. Controlled Vocabulary (assigned term system)
Functions:
To control synonyms by choosing one form as the standard term
To make distinctions among homographs
To bring or link together terms that are closely related
Establishes the size of scope of a term
Main Purpose of the Abstract
1. To indicate what the document is about or to summarize its contents.
2. To facilitate selection.
3. Help the reader decide whether a particular item is likely to be interest or not.
4. They save the time of the reader.
2. To facilitate selection.
3. Help the reader decide whether a particular item is likely to be interest or not.
4. They save the time of the reader.
Advantages of Controlled Vocabulary Language
1. Increases the probability that both indexer and searcher will express a particular concept in the same way.
2. Increases the probability that the same term will be used by different indexes or by the same indexer at different times.
3. Help searchers to focus their thoughts when they approach the information system without a full and precise realization of what information they need.
Disadvantages of Controlled Vocabulary Language
1. Incompatibility of different indexing languages.
2. High input cost.
3. The possibility of inadequate vocabulary.
2. Increases the probability that the same term will be used by different indexes or by the same indexer at different times.
3. Help searchers to focus their thoughts when they approach the information system without a full and precise realization of what information they need.
Disadvantages of Controlled Vocabulary Language
1. Incompatibility of different indexing languages.
2. High input cost.
3. The possibility of inadequate vocabulary.
Difference Between Book and Periodical Indexes
Book Index
1. Compiled only once and within a relatively short time and usually
performed by a single person.
2. Deals with a more or less well-defined central topic.
3. Indexing terms are almost always derived from
the text.
4. Specificity is largely governed by the text itself.
5. Every single page of a book must be read.
6. Always bound with the indexed text.
Periodical Indexes
1. A continuous process and more often performed by a team of
indexers and lasting for an extended period..
2. Deals with a great variety of topics.
3. Terminology must be consistent and derived from a controlled vocabulary.
4. Terms are prescribed by a controlled vocabulary and their level of specificity may be lower than the book index
5. Articles are scanned for indexable items and may rely on an abstract or summary compiled.
6. A periodical index will depend on a number of policy decisions.
7. Compiled separately.
1. Compiled only once and within a relatively short time and usually
performed by a single person.
2. Deals with a more or less well-defined central topic.
3. Indexing terms are almost always derived from
the text.
4. Specificity is largely governed by the text itself.
5. Every single page of a book must be read.
6. Always bound with the indexed text.
Periodical Indexes
1. A continuous process and more often performed by a team of
indexers and lasting for an extended period..
2. Deals with a great variety of topics.
3. Terminology must be consistent and derived from a controlled vocabulary.
4. Terms are prescribed by a controlled vocabulary and their level of specificity may be lower than the book index
5. Articles are scanned for indexable items and may rely on an abstract or summary compiled.
6. A periodical index will depend on a number of policy decisions.
7. Compiled separately.
Measures of Effectiveness of the Indexing System
1. Recall Measure - is a simple quantitative ratio of relevant documents retrieved to the total number of relevant documents potentially available. Recall depends on the level of exhaustivity allowed by the indexing policy.
Example:
If there are 100 relevant documents in the library that are relevant to the user's needs and the indexing system retrieves 75, then the recall ratio is 75 out of 100 (75/100). Recall for this search is 75 percent effective.
2. Precision Measure - is the ration of relevant documents retrieved to the total number of documents retrieved. Relevance or precision depends on the terminology of the text being indexed and the specificity of the indexing language used.
Example:
If 100 documents are retrieved and 50 of those items are relevant to the request, the precision ratio is 50 to 100 (50/100). Precision for this search is 50 percent effective.
Example:
If there are 100 relevant documents in the library that are relevant to the user's needs and the indexing system retrieves 75, then the recall ratio is 75 out of 100 (75/100). Recall for this search is 75 percent effective.
2. Precision Measure - is the ration of relevant documents retrieved to the total number of documents retrieved. Relevance or precision depends on the terminology of the text being indexed and the specificity of the indexing language used.
Example:
If 100 documents are retrieved and 50 of those items are relevant to the request, the precision ratio is 50 to 100 (50/100). Precision for this search is 50 percent effective.
Functions Involved of Information Retrieval System
2. Knowledge records are analyze and tagged by set of index terms.
3. The knowledge records are stored physically and index terms are stored into a structured file.
4. The user's query is tagged with sets of index terms and then is matched against tagged records.
5. Matched documents are stored for review.
6. Feedback may lead to several reiteration of the search.
Stopwords / Stoplist
1. Function words do not bear useful information for IR of, in, about, with, I, although,.....
2. Stoplist: contain stopwords, not to be used as index
a. Prepositions
b. Articles
c. Pronouns
d. Some adverbs and adjectives
e. Some frequent words (e.g. document)
3. The removal of stopwords usually improves IR effectiveness
4. A few "standard" stoplists are commonly used.
2. Stoplist: contain stopwords, not to be used as index
a. Prepositions
b. Articles
c. Pronouns
d. Some adverbs and adjectives
e. Some frequent words (e.g. document)
3. The removal of stopwords usually improves IR effectiveness
4. A few "standard" stoplists are commonly used.
Document Indexing
1. Goal = Find the important meanings and create an internal
representation
2. Factors to consider:
Accuracy to represent meanings (semantics)
Exhaustiveness (cover all the contents)
Facility for computer to manipulate.
3. What is the best representation of contents?
Char. string (char trigrams) : not precise enough
Word: good coverage, not precise
Phrase: poor coverage, more precise
Concept: poor coverage, precise
String Word Phrase Concept
Coverage (Recall)
Accuracy (Precision)
String Word Phrase Concept
Main problems in IR
1. Document
and query indexing
- How to best represent their contents?
- To what extent does a document correspond to a query?
3. System Evaluation
- How good is a system?
- Are the retrieved documents relevant?
(precision)
- Are the relevant documents retrieved?
(recall)
Possible Approaches
1. String matching (linear search in documents)
- Slow
- Difficult to improve
2. Indexing (*)
-
Fast
-
Flexible to further improvement
Information Retrieval Problem
First Application: in libraries (1950)
ISBN: 0-201-12227-8
Author: Salton, Gerard
and retrieval of information by computer
Editor: Addison-Wesley
Date: 1989
Content: <Text>
External attributes and internal attribute (content)
Search by external attributes = Search in Database
IR: Search by content
Why is Information Retrieval difficult
1. Vocabularies mismatching
Synonymy: e.g. car v.s. automobile
Polysemy: table
2. Queries are ambiguous, they are partial specification of user's need
3. Content representation may be inadequate and incomplete
4. The user is the ultimate judge, but we don't know how the judges.....
- the notion of relevance is imprecise, context-and user-dependent.
Indexing and Abstracting Services And Access to Information
Terms and concepts used in Information Retrieval and Indexing
1. Online Searching
2. Information Retrieval System
3. Index
4. Abstract
5. Indexing Vocabulary
6. Controlled-Vocabulary Indexing System
7. Subject Authority List
8. Thesaurus
9. Descriptor
10. Subject Heading
11. Natural Language Indexing System
12. Keyword
13. Keyword in Context (KWIC)
14. Recall
15. Precision
1. Online Searching
2. Information Retrieval System
3. Index
4. Abstract
5. Indexing Vocabulary
6. Controlled-Vocabulary Indexing System
7. Subject Authority List
8. Thesaurus
9. Descriptor
10. Subject Heading
11. Natural Language Indexing System
12. Keyword
13. Keyword in Context (KWIC)
14. Recall
15. Precision
Mag-subscribe sa:
Mga Komento (Atom)