In traditional signal processing, you often want to be able to represent an analog signal accurately in digital form. To do this, you would use an ADC to sample the signal. If the signal is (for example) a voltage that varies with time, then you want to sample it at many points in time in order to be sure that the digital representation you’re building up looks the same as the original analog signal. The samples are arranged periodically, and the number of samples that you need is related to how fast the signal changes via the Nyquist rate. The idea behind the Nyquist rate is that the faster the signal changes, the more samples you need in order to accurately see those changes.
People often forget that even though this idea was originally developed for communications and traditional signal processing, it applies just as much to machine learning and classification. In classification, for example, you’re trying to identify the class of some feature vector based on feature vectors that have already been classified. The feature vectors are drawn from some sample space, and the classifications vary over the sample space just as voltage varies over time in traditional communications and DSP. In order to be sure that the classifier you learn correctly represents the true distribution in the sample space, you have to have enough examples drawn from that sample space.
While the number of training examples is important, it is equally important that the examples are drawn from all parts of the sample space. Unfortunately, many machine learning applications focus more on generating lots of training data than on ensuring that the training data is drawn from all portions of the sample space. This works well a lot of the time, especially when the sample space isn’t particularly complex or noisy. When you start to have a highly varying sample space, then it becomes more important that your training data meets the Nyquist rate.
Theoretically, knowledge of the Nyquist rate can also help minimize time spent collecting training data. If you already know how fast the classifications vary over the sample space, then you can determine how many samples you need to accurately reconstruct it. Unfortunately, this is often very difficult in practice, because you may not be able to choose where in the sample space your training data comes from.
Traditional signal processing also deals mainly with 1- and 2-Dimensional signals, while classification may be done on a feature space in the 10s to 100s of dimensions. That dramatically increases the computational cost of determining a simple function over the sample space. A trade-off is often made where more data is collected in order to minimize the processing time required to learn a useful classifier. Many of the learning algorithms are also fairly naive, and wouldn’t fare well with a minimum amount of training data.
Regardless of how much training data you choose to collect, you need to be sure that you collect enough data from different parts of the sample space. More samples will be needed in areas of the sample space where classifications change quickly. It can often be difficult to collect this training data, but it will dramatically simplify the process of learning a classifier.