Data Mining and Machine Learning
- Instructor:
Chih-Jen Lin, Room 413, CSIE building.
- TA: Xiang-Rui Wang (email: r95073@csie)
Final score
Approximately 10% will fail. (tentitive)
- Discussion board: ptt.cc; data mining board in ntu/csie
- Time: Monday 10:20am-1:10pm, Room 105, CSIE building.
Usually we have a 20-minute break at around 11:30am.
-
Note for this course:
This course will be taught in English.
The course load is designed under
the assumption that you are taking no more than
four major courses in this semester.
No prerequisites for this course, so anyone (from
high school to Ph.D. student) is welcome if you work
hard and are enthusiasic about the topic.
To get to the essence of things one has to work long and hard
--- Vincent Van Gogh
Difference to another course offered this
semester: Data mining and machine learning
case studies: This one is more basic. The
case study course assumes that you have certain
machine learning knowledge.
- Textbook:
Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations
(2nd edition) by
Ian H. Witten and Eibe Frank, 2005
- Lecture slides: downloadable from
the
publisher's web site
- Reference:
-
Past course pages:
2006 Winter
2005 Winter,
2004 Winter, 2002 Fall
Course Outline (tentative)
- Input: Concepts, Instances, Attributes
- Output: Knowledge Representation
- Algorithms: The Basic Methods
- Credibility: Evaluating What's Been Learned
- Implementations: Real Machine Learning Schemes
- Engineering the Input and Output
Homework
Once every week. Please write your homework/reports in English.
For late homework, the score will be
exponentially decreased.
Please print out your homework but not e-mail it to the TA.
Every week at around 12:50pm we randomly select one to present
his/her homework.
Moreover, you are required to turn in your homework before
the 20-minute break.
Rules: We do not require you to come every week. If you are
absent and are selected, you will be
required to do a presentation next week. If you fail
to show up then, your mid-term exam will be deducted by
15 points. On the other hand, every week we seek for a
volunteer first who will get 10 bonus points for
the mid-term. However, you can do this only once in this course.
When no one volunteers, everyone can be picked regardless of
whether you have presented some homework before or not.
- hw1, simple experiments using R (decision
trees on iris data), due March 5, 2007.
- hw2, input format, due March 12, 2007.
- hw3, further analysis of the UCI university data, due March 19, 2007.
- hw4, 1-Nearest Neighbor, due March 26, 2007.
- hw5, naive Bayes, due April 2,2007.
- hw6, decision tree, due April 9,2007.
- hw7, rule construction, due April 23,2007.
- hw8, logistic regression, due April 30,2007.
- hw9, kmeans, due May 14,2007.
- hw10, cross validation and t-test, due May 21,2007.
- hw11, ROC curve, due May 28, 2007.
Exams
You can bring your textbook, slides, class notes,
but nothing else. For example, you can neither bring
a computer nor a person.
Final Project
We will have one final project. Project presentations:
May 7 and June 11.
Each group: a 20-minute presentation.
Please give me your final report (<= 10 pages) by June 9.
Each group has three/four members.
Yes, your presentation will be in English.
project topic: spam filtering
We all hate spams but there is no good
way yet to deal with them.
There are quite a few approaches
to control spams. For example, some
use black lists and some servers delay
the incoming messages to see if the mail
is resent.
Usine data classification is another possible
approach.
The basic idea is simple. We have a training
set of spams/non-spams. After training
a model, we can prdice whether a new
mail is a spam or not.
However, there are quite a few
problems on using the training/testing
procedure.
- How many training instances are
enough? Is the testing accuracy so
far satisfactory?
-
Mails are becoming more and more complicated.
They may include figures and even videos.
How should we process them?
- How are we going to handle multiple
languages?
- Can we have personal preferences?
Of course we don't expect to fully solve
these problems in this project.
However, we hope to understand
the following.
- What is the status of using training/testing
for spam filtering? How is the performance?
- Some software already using learning
techniques. What are their settings?
- We would like to conduct some simple
implementations by ourselves. This helps
us to understand more about this topic.
You also need to learn a bit about
document processing.
- Any idea on improving the performance?
Grading
30% homework, 30% project, 40% Exam. (tentative)
Related Information
-
Kdnuggets:
a useful collection of data mining related software,
book, and many other stuff.
Last modified: Tue Jun 12 14:43:53 CST 2007