Data Mining and Machine Learning

Instructor: Chih-Jen Lin, Room 413, CSIE building.
TA: Xiang-Rui Wang (email: r95073@csie)
Final score
Approximately 10% will fail. (tentitive)
Discussion board: ptt.cc; data mining board in ntu/csie
Time: Monday 10:20am-1:10pm, Room 105, CSIE building.
Usually we have a 20-minute break at around 11:30am.
Note for this course:
This course will be taught in English.
The course load is designed under the assumption that you are taking no more than four major courses in this semester.
No prerequisites for this course, so anyone (from high school to Ph.D. student) is welcome if you work hard and are enthusiasic about the topic.
To get to the essence of things one has to work long and hard --- Vincent Van Gogh
Difference to another course offered this semester: Data mining and machine learning case studies: This one is more basic. The case study course assumes that you have certain machine learning knowledge.
Textbook: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (2nd edition) by Ian H. Witten and Eibe Frank, 2005
Lecture slides: downloadable from the publisher's web site
Reference:
- The environment R. This will be the main environment used for the course.
  R is a powerful statistical environment and we think it is useful for machine learning as well. You are expected to learn it by yourself.
- Modern applied statistics with S. Venables and Ripley, Fourth edition, 2002.
- Machine Learning, Tom Mitchell, McGraw Hill, 1997.
Past course pages: 2006 Winter 2005 Winter, 2004 Winter, 2002 Fall

Course Outline (tentative)

Input: Concepts, Instances, Attributes
Output: Knowledge Representation
Algorithms: The Basic Methods
Credibility: Evaluating What's Been Learned
Implementations: Real Machine Learning Schemes
Engineering the Input and Output

Homework

Once every week. Please write your homework/reports in English.
For late homework, the score will be exponentially decreased.
Please print out your homework but not e-mail it to the TA.

Every week at around 12:50pm we randomly select one to present his/her homework. Moreover, you are required to turn in your homework before the 20-minute break. Rules: We do not require you to come every week. If you are absent and are selected, you will be required to do a presentation next week. If you fail to show up then, your mid-term exam will be deducted by 15 points. On the other hand, every week we seek for a volunteer first who will get 10 bonus points for the mid-term. However, you can do this only once in this course. When no one volunteers, everyone can be picked regardless of whether you have presented some homework before or not.

hw1, simple experiments using R (decision trees on iris data), due March 5, 2007.
hw2, input format, due March 12, 2007.
hw3, further analysis of the UCI university data, due March 19, 2007.
hw4, 1-Nearest Neighbor, due March 26, 2007.
hw5, naive Bayes, due April 2,2007.
hw6, decision tree, due April 9,2007.
hw7, rule construction, due April 23,2007.
hw8, logistic regression, due April 30,2007.
hw9, kmeans, due May 14,2007.
hw10, cross validation and t-test, due May 21,2007.
hw11, ROC curve, due May 28, 2007.

Exams

You can bring your textbook, slides, class notes, but nothing else. For example, you can neither bring a computer nor a person.

Midterm: April 9, 2007
Final: June 4, 2007
We will discuss the final exam on June 11, 2007 (the day for your final presentation)

Final Project

We will have one final project. Project presentations: May 7 and June 11. Each group: a 20-minute presentation. Please give me your final report (<= 10 pages) by June 9. Each group has three/four members.

Yes, your presentation will be in English.

project topic: spam filtering

We all hate spams but there is no good way yet to deal with them. There are quite a few approaches to control spams. For example, some use black lists and some servers delay the incoming messages to see if the mail is resent. Usine data classification is another possible approach. The basic idea is simple. We have a training set of spams/non-spams. After training a model, we can prdice whether a new mail is a spam or not.

However, there are quite a few problems on using the training/testing procedure.

How many training instances are enough? Is the testing accuracy so far satisfactory?
Mails are becoming more and more complicated. They may include figures and even videos. How should we process them?
How are we going to handle multiple languages?
Can we have personal preferences?

Of course we don't expect to fully solve these problems in this project. However, we hope to understand the following.

What is the status of using training/testing for spam filtering? How is the performance?
Some software already using learning techniques. What are their settings?
We would like to conduct some simple implementations by ourselves. This helps us to understand more about this topic. You also need to learn a bit about document processing.
Any idea on improving the performance?

Grading

30% homework, 30% project, 40% Exam. (tentative)

Related Information

Kdnuggets: a useful collection of data mining related software, book, and many other stuff.

Last modified: Tue Jun 12 14:43:53 CST 2007