Applying the Nyquist rate to Machine Learning

In traditional signal processing, you often want to be able to represent an analog signal accurately in digital form. To do this, you would use an ADC to sample the signal. If the signal is (for example) a voltage that varies with time, then you want to sample it at many points in time in order to be sure that the digital representation you’re building up looks the same as the original analog signal. The samples are arranged periodically, and the number of samples that you need is related to how fast the signal changes via the Nyquist rate. The idea behind the Nyquist rate is that the faster the signal changes, the more samples you need in order to accurately see those changes.

People often forget that even though this idea was originally developed for communications and traditional signal processing, it applies just as much to machine learning and classification. In classification, for example, you’re trying to identify the class of some feature vector based on feature vectors that have already been classified. The feature vectors are drawn from some sample space, and the classifications vary over the sample space just as voltage varies over time in traditional communications and DSP. In order to be sure that the classifier you learn correctly represents the true distribution in the sample space, you have to have enough examples drawn from that sample space.

While the number of training examples is important, it is equally important that the examples are drawn from all parts of the sample space. Unfortunately, many machine learning applications focus more on generating lots of training data than on ensuring that the training data is drawn from all portions of the sample space. This works well a lot of the time, especially when the sample space isn’t particularly complex or noisy. When you start to have a highly varying sample space, then it becomes more important that your training data meets the Nyquist rate.

Theoretically, knowledge of the Nyquist rate can also help minimize time spent collecting training data. If you already know how fast the classifications vary over the sample space, then you can determine how many samples you need to accurately reconstruct it. Unfortunately, this is often very difficult in practice, because you may not be able to choose where in the sample space your training data comes from.

Traditional signal processing also deals mainly with 1- and 2-Dimensional signals, while classification may be done on a feature space in the 10s to 100s of dimensions. That dramatically increases the computational cost of determining a simple function over the sample space. A trade-off is often made where more data is collected in order to minimize the processing time required to learn a useful classifier. Many of the learning algorithms are also fairly naive, and wouldn’t fare well with a minimum amount of training data.

Regardless of how much training data you choose to collect, you need to be sure that you collect enough data from different parts of the sample space. More samples will be needed in areas of the sample space where classifications change quickly. It can often be difficult to collect this training data, but it will dramatically simplify the process of learning a classifier.

Use the right metric

In the field of artificial intelligence, there’s a fairly simple (compared to other AI techniques) method called A* (pronounced A-star) that’s used to solve a lot of problems. It can be used in any situation where you’d want to find the shortest path from where you are now to some goal. This is a very common problem, and shows up everywhere from calculating the best airline routes to video game NPC AI. It’s useful in any kind of route planning, whether it’s a route through space or through some other set of sequential decisions.

A* works by comparing the state of the world now with the state of the world if a certain decision were made. Since it has to choose which decision to make before it knows what the result of a decision is, it uses a heuristic to approximate the result of different decisions and then chooses between those decisions. The heuristic approximates the distance to the goal after each given decision. There’s a bit more to it than that, but you can read about it on Wikipedia or something if you’re interested. What I find really intriguing about A* is the importance of the heuristic.

It turns out that A* doesn’t always work well. If the heuristic isn’t chosen well, then A* will do very poorly at finding a way to the goal. It may find a route that is much worse than what’s possible. Whatever heuristic is chosen needs to never over-estimate the true distance to the goal. AI researchers will spend a lot of time thinking up good heuristics for their application, often much more time than they spend implementing the path-finding A* algorithm.

What AI researchers realize that a lot of other people don’t is that the way you estimate the distance to your goal has a huge impact on how fast you reach it. I’ve been noticing recently that the standard heuristic is very poor in a lot of areas in life where people are trying to achieve something. I first noticed that in the field of education when I was having trouble learning about AI in the first place.

When I took my first few algorithms and AI classes, I got very high grades. I’m good at school and test-taking, so I got 4.0s in most of the classes that I took. Unfortunately, I was often unable to apply the information or implement the algorithms after the class ended. The heuristic I’d been using to see if I learned, my grade, was often over-estimating my progress. I was eventually able to truly learn and implement the different algorithms that were important, but I had to completely ignore grades while doing so. Instead, I just focused on small projects that included the algorithms. These small projects (what homework aspires but often fails to be) were a truer measure of progress than grades ever were.

Since noticing that grades were such a poor metric for education goals, I’ve been on the lookout for other places that this appears in people’s daily lives. Here’s what I’ve noticed so far. Can you think of any that I’ve missed?

  • Education: The default heuristic is grades, but number of completed projects is probably a better metric.
  • Life-success: It seems like a lot of people use income to measure this, but that’s a pretty limited metric and often seems to lead people to unhappiness. I would argue that autonomy is probably a better metric here, but I’m not sure how to measure autonomy.
  • Weight loss: I know a lot of people who have been working on losing weight and being healthier in general, but I think that weight is probably not a good heuristic for this. Body Mass Index does a lot better, but it’s a bit harder for the average person to measure accurately.
  • Relationship health: People tend to measure the health of a relationship by how long it’s lasted, but this is a very inaccurate guide at best. Lots of long relationships are pretty bad, but for one reason or another, the people don’t break up. Other relationships are short, but end up being very positive experiences for everyone involved. It seems like the happiness of the partners is a much better heuristic, but it’s also harder to measure.

In each of these cases, it’s pretty hard to measure something that’s actually a good heuristic so people choose a less useful heuristic. Using the worse heuristic, people can easily make measurements, but those measurements aren’t actually very useful. Even though good heuristics are often harder to measure, they offer much better results.

The two most important techniques in technical writing: grad school edition

In grad school you do a lot of technical writing. You write research summaries for your professors, guides for your fellow students, outlines for courses you’re TAing, and (if you’re lucky) you even get to write journal papers. You might assume from this that grad schools require English classes or writing classes. They do not (at least mine didn’t). In my experience, grad students learn to write by writing a lot and being judged on it.

Most of what I’ve written never gets more than a few comments or questions on the topic, but when I write journal or conference submissions, I tend to get a lot more structured feedback. Since the good feedback on my own writing has been fairly rare, the most helpful thing that I’ve done to learn how to write better is critique other people’s work. Critiquing other people’s work allows me to hone some writing skills without the emotional pushback of being invested in the content that’s already there. It works even better if I’m currently writing a paper while I’m editing someone else’s, because I can immediately apply what I learn from editing to the paper I’m writing.

As I’ve progressed as a writer, I’ve noticed that grammar and sentence structure are really only the first step to writing well. Once you memorize the rules, it’s pretty easy to throw together a sentence that doesn’t make someone cringe. The hard part comes when you try to use those sentences to convey complex ideas. What part of the idea do you want to present first? What if the audience doesn’t know the background material? How do you make the paper interesting? What do you do if nobody understands the thing you’ve written, or if they don’t catch the most important part of it?

What I’ve come to realize is that the answers to all of these questions revolve around two things: ordering concepts and choosing detail level.

Ordering Concepts

When you’re writing a technical paper, there’s a lot of different things to cover. The naive approach, which I’ve seen used a lot and which I’ve even used a time or two, is to present concepts in the order that you figured them out. For example, if you’ve designed a piece of hardware to solve the problem, you might write the document by going through all of the design decisions that went into it: going over what it has to do, the electronics design, the mechanical design, implementation, testing, the second prototype, re-testing, etc. This works alright, but it often isn’t the best way to go.

It can be much better to organize concepts by complexity instead of chronologically. This is especially true for documents that don’t detail a specific design or discovery. In this case, you want to start with the most simple concepts, and then present ideas that build on those. This kind of document leads the reader from simple to complex, rather than from start to finish. If you’re writing a report on the effects of opium use on Chinese culture, then you might want to start with the individual effects of opium, then progress to interpersonal effects, large scale social and economic impact, and then the whole historical picture.

There are many more ways to organize a paper than just these two, but these are the ones I’ve had the most success with. The main thing you want to do is have some kind of organization at all. Don’t start writing without knowing what the overall organization of the paper is. It can be annoying to have to sit down and think things out when you really want to get started writing, but its’ worth it. For a document you might spend a week or more on (or a month or more), spending some time up front on figuring out the organization structure can save you lots of time in the end.

Choosing Detail Level

I’ve read some incredibly boring papers about subjects that I loved. The main thing that all of these papers had in common was that they used the same level of detail throughout the paper. New writers are often more comfortable at a single level of detail, so their paper either reads like a list of facts (if they’re very detail oriented) or feels like it has no substance (if they’re always high level). When it comes to technical writing, I’ve noticed a lot more of the former.

No matter who’s reading your paper or why, you have to get them to care about the content if you want them to remember it. Often they’ll do this themselves if you can show them why the content is relevant. When encountering new information, people need some way to connect it to things they already know in order to retain it. This is why lists of facts are so hard to read, there’s no high level context. The high level context before lots of facts and theory will prime the reader with similar things they’ve heard before. This allows them to associate what they’re reading with what they already know, and it makes the document more interesting.

On the other hand, if all you give is high level context, then the reader won’t learn anything from your paper. At that point, there’s no reason for them to even read it. You need the lower level detailed explanation. With technical writing, this usually isn’t the hard part. As a grad student, this was always what I started with when I had to describe a study and the results.

It’s clear you need both high level context and low level detail to make your paper worth reading. The main question is what ratio to use and how to organize the high and low level sections.

My own opinion is that each large concept needs some high level to tie it to things the reader already knows before the low level explaining the concept itself. When figuring out how to arrange these, I usually fall back on that old teaching advice “first tell them what you’ll tell them, tell them, then tell them what you told them”.

Whenever I have a new concept to cover (in the right place according to my outline of course) I introduce it with some text about why it’s important. Then I go over the concept. Then I talk about what the concept means and what it implies about other things in my paper. Often, this serves as a good introduction to the next concept I want to write about. This is pretty similar to the topic sentence/detail sentence/transition that they taught me in elementary school, just on a larger scale.

Putting it all together

I’ve found that using these two concepts not only makes my writing better, it makes my writing faster. Once I have an outline using whatever structure I think is best for the paper, I can go in and add some context in each section talking about why the different sections are important. That’s easy to write, because I just write down why I put the sections there in the first place. Then I can put in the details of each point.

The whole thing is an iterative process, so I usually end up with a different outline or structure than the one I started with. The act of adding in context and details in different places reminds me of other things I need to cover so that it all makes sense. After that, it’s just a matter of editing it 40 or 50 times before I can send it in.

Voting and the Prisoner’s Dilemma

Voting is important to me. It’s something that I do every time elections roll around, and it’s something that I take a lot of pride in. I wouldn’t say that I’m the most political guy in the world, but I definitely have strong opinions about the best direction for my country to go. I want to make sure that those opinions get counted and have an effect.

That’s why I’m so bothered by people who don’t vote, especially if those people agree with me. People’s excuses for not voting are as numerous as the non-voters themselves, but they usually fall into a few different categories. There are the people who don’t vote because they’re too lazy. Some don’t vote because they just don’t care (or claim not to, they sure complain about the government a lot for people who don’t care). Others don’t vote because they say it’s not worth it.

It’s that last group that I want to address here. I’ve recently been in a few different conversations with people who feel that their vote has such a small impact that it’s not worth the time it takes to get informed and actually cast their ballot. The people I’ve met who think this way tend to be economists and computer scientists, so it’s not like they’re ignorant about the situation. From their perspective, they’re probably making the right choice. However, that choice is predicated on the fact that many other people with similar views will be voting.

Voting is the Prisoner’s Dilemma played at scale. There are huge groups of people who can either defect (not vote) or cooperate (vote). Assume for simplicity that each issue is divided on party lines and each party is the same size. In that case, each individual voter is playing the prisoner’s dilemma with people of their own party.

If the person defects and doesn’t vote, they get the extra time and energy they would have spent on voting to use for other things. If everyone defects, then the bill/politician/whatever that they wanted gets voted down by the other party.

So when people who share my political ideas say they don’t vote, I get upset. I don’t much mind that my vote is then that much more important. What I do mind is that those people are making it less likely that the world turns out how I want it.

This is a bit different than the popular idea that voting is a civic duty, and that everyone should vote. Basically I’m saying that I only want people who agree with me to vote. I’m still working on how to reconcile that with my general belief in the process of democracy.

Mood Monitoring and Fun Density

I’ve been getting more into the quantified self movement recently. The idea is that by tracking what you do and what the result is, you can understand yourself better. If you sift through the data, you can also find pretty easy ways to improve your life.

To get my feet wet measuring my life, I downloaded an app to my phone that will periodically ask me how I’m doing. I can answer anything from “normal” to “ok” to “excellent” on the positive end. On the negative end, I could answer “not good”, “terrible”, etc. When I make an entry about how I’m doing, I can also leave a note about what I was doing at that time. Once I have a large set of data from several weeks, I can go through and pull out patterns.

I first read about doing this in The Motivation Hacker, by Nick Winter. In it, he describes how he does a lot of new things and logs how he feels about them. Based on those logs, he figures out the fun density of different activities and tries to only do activities that have high fun density. He uses the example of white water rafting, where he has to travel to the river (not good), then rafts (ok with periods of awesome during rapids), then travel home (not good). His final conclusion after rafting was that it wasn’t worth it, because the fun density was low.

I like the idea of fun density as a way to measure activities, but I think that Winter’s application of it might be a bit flawed. Specifically, I think this because he talks about his own mood monitoring as a logarithmic measure. In his method, ok is twice as fun as normal, and good is twice as fun as ok. When he represents them with numbers, he only uses 1 to 10 (1=terrible, 10=excellent). He then takes the time average using that logarithmic scale of 1 to 10, not the experienced scale of 2^1 to 2^10. This means that his fun density measurements will undervalue short periods of high fun, which matches my surprise at him not wanting to go white water rafting again.

Now, I’m not saying that he should go white water rafting again. If he actually didn’t think it was worth it, that’s totally fine and he should do other things he finds more fun.

What I am saying is that logarthmic rating systems are a bit tricky. If rapids during rafting actually are an 8, and lasted for like twenty minutes of a two hour raft that was a 6 on average, and had a two hour ride there and back that was a 4, then his experienced average would be (20*2^8+100*2^6+240*2^4)/360 ~ 42. That 42 is about 5.4 on Winter’s scale, not the 4.7 that a simple average would give.

(To be fair, Winter also got a headache on his rafting trip that dropped the last hour of rafting down to a 4. Taking this into account, we get (20*2^8+40*2^6+60*2^4+240*2^4)/360 ~ 34.6667. This about 5.1 on his scale. That’s definitely less than his daily average (6.2!), but it’s more than the 4.4 that he was giving it.)

I’m pretty excited to start finding patterns in my own experienced fun levels. Having never tried to optimize my life for fun, I think my fun density might be pretty low. In a few weeks, I’ll have enough data to start finding things to do that improve my fun density. When I start doing that, I’m going to make sure that I don’t underestimate the impact of short periods of extreme emotion.

Walking Through Walls

I think my favorite super power is probably the ability to walk through walls. I’ve always wanted to be able to break into any building, escape any pursuer, or drop through the floor instead of taking the stairs. It’s one of my main regrets that I’ll never be able to do it.

The next best thing to being able to do something is knowing as much as you can about it. In elementary school I’d learned that atoms were mostly empty space. You have the electrons and the nucleus, but it seemed to me that if you managed to squish two atoms into the same space then they might be able to pass through each other.

I was pretty excited to learn, when I got to middle school, why exactly objects couldn’t do that. For one thing, if you got the atoms occupying roughly the same space, then the electromagnetic forces would interfere with each other and the electrons of both atoms would be disrupted. This would severely mess up any chemistry that was going on with those atoms, and probably do very bad things to a person walking through a wall (and the wall itself).

Luckily, my youthful attempts to pass through my bedroom wall were doomed to fail for a reason that didn’t involve all of my molecules coming apart. Originally, I thought this was due to electron repulsion. According to my high school physics teachers, atoms can’t get that close together because the electrons of the atoms repell each other. Just like magnets of the same pole, two electrons will stay as far apart as they can. You can’t make use of all that empty space within an atoms because the electrons form a kind of force field to keep other atoms out of their own territory.

It turns out that electron repulsion isn’t actually what prevents objects from passing through each other. It’s actually a quantum effect called electron degeneracy pressure, Basically, two electrons can’t possibly be in the same place at once (that’s part of the Pauli exclusion principle). When electrons get too close to each other, they must assume different energy levels. This means that to bring electrons close together, you need to add enough energy to put most of them into very high energy states. The closer objects come, the more energy you need. On the macro-level, that manifests as degeneracy pressure. That’s why objects feel solid.

Understanding this almost makes up for not being able to walk through walls.

Copyright Unbalanced

I’ve been reading Copyright Unbalanced lately, and it gives a pretty good description of what copyright is for and why it’s important:

Like all other forms of property, copyright exists to address an externality problem. Because the author of a creative work, such as a song, cannot exclude others from the benefits her work creates, authors who publish works are creating a positive externality. The problem is that if authors can’t internalize at least some of the positive externality they produce, then they will have only a weak incentive to create and publish works. Put another way, if authors have no way to exclude others from enjoying their works, and therefore can’t charge users for access, then they won’t produce as many works as they otherwise would, making everyone worse off. Copyright addresses this externality problem by creating a legal right to exclude others from enjoying the work without the author’s permission. If authors can sell permission for money, they can capture a higher proportion of the benefits they create, and their incentive to produce creative works in the first place will increase.

The book also goes over many of the things that are wrong with copright as it’s currently implemented in this country. I was especially interested in this tidbit:

Congress is supposed to represent the public’s interest, but it has abdicated that responsibility. As Jessica Litman has carefully documented, Congress has turned over the responsibility of crafting copyright law to the representatives of copyright-affected industries.13 That is, lobbyists write the copyright laws—not just figuratively, but literally.

For more than 100 years, copyright statutes have not been forged by members of Congress and their staff, but by industry, union, and library representatives who meet (often convened by the Copyright Office) to negotiate the language of new copyright legislation. As Litman explains, “When all the lobbyists have worked out their disagreements and arrived at language they can all live with … they give it to Congress and Congress passes the bill, often by unanimous consent.”

So to sum up, copyright was created to benefit the public by making artists more willing to create. It then got taken over by the artists(or more specifically, the marketers and labels) who pushed to have it extended beyond any reasonable benefit to society. The current state of affairs is pretty sad.

Relativity and your smartphone

Einstein’s theory of general relativity has dramatically changed life on our planet. It’s used in a lot of different technologies, but perhaps the most surprising place to find the theory of relativity is in your smartphone. Smartphones account for general relativity in two different ways.

The place that it’s most commonly pointed out is in GPS. Your phone figures out where it is by calculating the distance to a number of satellites. It does this by measuring the time of flight of a signal broadcast by each satellite. Once the phone knows how far away different satellites are, it can do triangulation on the known positions of the satellites to figure out where you are. This location measurement can be pretty precise (on the order of a meter).

The precision of GPS is possible because your phone takes into account special relativity in the form of time dilation. Satellites are travelling very fast with respect to a stationary smartphone. That high speed means that time goes slower for the satellite, and the clock it uses to calculate time of flight is off. Your phone takes that into account when calculating how long it took the signal broadcast by the satellite to get to wherever you are.

General relativity comes into play because satellites are so much higher than your phone, which means that they experience less of Earth’s gravity than you do (note that this is different from microgravity). Since satellites experience less gravity than you do, time travels faster for them. So there are really two relativistic effects that need to be taken into account to actually figure out how fast time is travelling for the satellite, which can help tell how long it takes for a radio signal to travel from the satellite to your phone.

The second way that a smartphone takes general relativity into account is far simpler. Your phone has an accelerometer in it that measures acceleration on the phone. This is how your phone knows which way you’re holding it. It’s also how it makes those cool light-saber sounds when you swing your phone around.

When you’re having a light-saber duel, your phone is measuring the acceleration applied by your wild jabs and lunges. No relativity there. However, when the phone is stationary and it detects which way it’s oriented, it’s measuring gravity. Gravity isn’t acceleration, but it is indistinguishable under the theory of general relativity. It’s only through the effects described by general relativity that your phone works the way that it does.

Science! It’s closer than you think!

Men in Black ethics

I remember really enjoying the Men in Black movies when I was younger. They’ve got explosions, aliens, flying saucers, Will Smith. They’ve got everything that makes a movie great. However, the movies are missing one very important thing: good ethics.

It slipped by me when I was watching the movies as a kid, but humans in the Men in Black universe are kind of the galaxy’s village idiot. In the first movie it’s explained how television, computers, and basically every other technology was given to us by aliens. We’re apparently not capable of developing any of these things ourselves.

In a universe where its easy to travel from one planet to another and there are all kinds of interesting planets with interesting life to visit, we humans are stuck on earth. We get the alien technology that they don’t want: television. They keep their faster than light travel for themselves.

The culture of craft and making that’s seen a resurgence in the past decade or so has me very excited. It shows that people are creative, interested in learning, and willing to build things that make the world a better and more exciting place to live. It’s humans making these inventions, and we celebrate those past humans who invent. People like Philo Farnsworth, the Wright Brothers, and Alan Turing.

When movies like MIB cast humans as incapable of inventing, they do a disservice to our culture and our history.

Why Transform?

I’ve just recently had an epiphany about signal processing. It’s kind of embarassing that it’s taken me so long to realize this, but all the transforms that I’ve been doing in classes are just to make the signal separable from the noise in my data.

That seems pretty simple, so let me back up and explain why it took me so long to realize this. I’ve been taking signal processing classes off and on for about five years now. The classes mostly have focused on a few transforms (fourier and wavelet mostly) and how they can be used to filter an incoming signal. We’ve made low pass filters, high pass filters, and everything in between. It was never quite clear to me why you use the tranform though. You can just do everything in the time domain.

I didn’t put too much thought into that because computations can be easier to do in the frequency domain. What’s convolution in time corresponds to multiplication in frequency. It can be faster to do some calculations in the frequency domain because of that correspondence. I understood that, and thought that I was using some transforms that brilliant people had invented just to speed up their computations. I had no intuition for how they could have devel0ped  the transform. How could they have known the transform would make calculations faster? I put it down to Laplace and Fourier just being more brilliant than me.

What I’ve recently come to realize is that, while Laplace and Fourier were indeed brilliant, their transforms serve a different purpose altogether. The speed up that I got in filter calculations is almost an afterthought to the real purpose of using a transform.

Filters only let through the frequencies that you want. This is obvious when you see plots of filters in the frequency (fourier) domain. I was clear on this from the outset. You use the Fourier transform to select frequencies, gotcha.

For some reason, this knowledge didn’t generalize like it should. I went around saying to myself that filters select different frequencies, and that convolution in time was multiplication in frequency, but I didn’t get that this was the whole point of the transform in the first place. Noise in the time domain is hard to separate from a signal, but in the frequency domain it can be very easy to separate.

And that is the key behind transforms. The real reason you do the transform isn’t so that you can do fast multiplication instead of slow convolution. The real reason to transform a signal to a new domain is because the new domain can make the parts of the signal you’re interested in easier to separate from everything else. That just happens to make the calculations faster too.

This separability comes up in all kinds of signal processing, pattern recognition, and machine learning. A transform may help anywhere where you want to separate one type of thing from another. Making it easier to separate the wheat from the chaff is why you would calculate features before feeding your data into machine learning algorithms.

My understanding of signal processing now revolves around three steps.

  1. Transform the incoming data so that the components you’re interested in are easy to separate from the components you’re not (separate the signal from the noise).
  2. Do whatever calculations you need to in order to get the output that you want.
  3. Transform the output to the domain you need it in; the new domain is usually, but not always, the same as the domain the data had in the first place.