Ryan A. Rossi
Adobe Research
About Me
Publications
Research
Teaching
Codes
Contact Me
Curriculum Vitae
Misc.
Introduction to Search Engine Theory
An undergraduate course I developed (with the help of Jean-Louis Lassez) and taught back in 2007-2008. This course was designed to cover only the basics at an undergraduate level.
Syllabus
[Word Document]
Lectures
Lectures 1 & 2: Introduction, Ergodic Theorem, Perron-Frobenius Theorem, Power Method and Foundations of PageRank
Lecture 3: Hyperlink-Induced Topic Search (HITS)
Lecture 4: PageRank & SALSA
Lecture 5: Latent Semantic Analysis
Lecture 6: Ranking Links: Search and Surf Engines
Lecture 7: Detecting Spam Sites
Lecture 8: Spectral Clustering and Graph Partitioning
Lecture 9: K-means, Hierarchical and Zoomed Clustering, Hidden Markov Models
Homework
Homework 1 - Ergodic and Perron-Frobenius Theorems
Homework 2 - Hubs and Authorities (HITS)
Homework 2.1 - Sets of Hubs and Authorities
Homework 3 - PageRank
Homework 4 - Latent Semantic Analysis
Homework 5 - Ranking Links
Homework 6 - K-means and Hierarchical Clustering
Homework 7 - Spectral Clustering
Homework 8 - Building A Search Engine
Textbook
A reference textbook is
Introduction to Information Retrieval
, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. You may view the textbook online, or print your own copy.
Core Papers
Authoritative Sources in a Hyperlinked Environment
The PageRank Citation Ranking: Bringing Order to the Web
The stochastic approach for link-structure analysis (SALSA) and the TKC effect
Introduction to Latent Semantic Analysis
Automatic Cross-Language Information Retrieval using Latent Semantic Indexing
Ranking Links on the Web: Search and Surf Engines
Spam Detection Papers
Combating Web Spam with TrustRank
Measuring Similarity to Detect Qualified Links
Topical TrustRank: Using Topicality to Combat Web Spam
Improving Web Spam Classifiers Using Link Structure
A Large-Scale Study of Link Spam Detection by Graph Algorithms
Advanced Reading
The ATHENS System for Novel Information Discovery
Detecting Anomalies in Graphs
Searching and Ranking Web Pages
Self-Organization and Identification of Web Communities
Organizing WWW Images Based On The Analysis of Page Layout and Web Link Structure
The ATHENS System for Novel Information Discovery
Indexing by Latent Semantic Analysis
Signature Based Intrusion Detection using Latent Semantic Analysis
Symbolic Stochastic Systems
Software
As an alternative to Matlab, there is a free software package called SciLab that is very similar. You can download this software from
http://www.scilab.org
. There is also online help at
http://www.scilab.org/product/man/
and a guide:
An Introduction to Scilab
.
Data sets
Abortion Refined
--
Sites
Computational Geometry Refined
--
Sites
Death Penalty Refined
--
Sites
Gun Control Refined
--
Sites
Movies Refined
--
Sites
Net Censorship Refined
--
Sites
These data sets were made public by
Panayiotis Tsaparas
.
Last updated 05/10/07