This post is all about how to evaluate your project. This is something you will do ideally about half way through your work, but more realistically towards the end of your work, just before you write up. However, it's imporant that you plan your evaluation early on, otherwise you run the risk of getting almost to the end of your work and finding that your evaluation process isn't going to give you any good results. So, the main thing you need to understand when you are planning is how evaluation works in the sciences and how to apply that understanding to your own projects.

**Proof**

Sometimes we see student dissertations which make claims like "this experiment has proved that ...". Almost always the student has used the word "proof" incorrectly, and will lose marks because they have given the impression that they have not understood the contribution that their work has made to the field they are working in. Very often students in this situation simply haven't understood how scientists use the word "proof".

*Proof*, in a scientific context, is a mathematical argument that is used to convince other mathematicians or scientists that a *theorem* (or a mathematical idea) is true. Proofs must never involve evidence or experiments, only arguments. There's an example proof of a simple theorem at the end of this post.

Once mathematicians are convinced that a proof is correct (and sometimes that is difficult in itself, if the proof is several hundred pages long) then it is irrifutable. This is very different to the sort of science that is advanced by experiments, where another scientist can find new data or eveidence that shows that an old idea was wrong.

So, we generally say that a *theorem* can be proved correct whereas a *hypothesis* (or guess!) can only be tested via experiments. A hypothesis might turn out to be wrong if experimental data cannot be found to support the hypothesis, or contradictary evidence is found. If a lot of evidence is found to support a hypothesis we might call it a *theory*. Even so, a theory cannot be proved correct in all cases. For example, if you came up with a theory that said all atoms have a particular shape, you might invent a special microscope to look at atoms and find out if they have your shape. This would provide some evidence to support your theory. You couldn't, however, test every single atom in the Universe, so your hypothesis might well become a theory, but it can never be "proved" correct.

*[Aside: there's a long literature in this sort of philosophy of science. If you are really interested, read Karl Popper on Falsification, AJ Ayer on Verification and Paul Feyerabend on scientific revolutions and Imre Lakatos on Proof.]*

**Scientific method**

*Scientific method* is the way that scientists decide whether a particular *hypothesis* (or guess) is likely to be a good model for the way the world works. If most scientists accept that the hypothesis is likely to be true, then we call it a *theory*. Of course, even theories have limitations, and it may be that as more experiments are carried out we find that a different theory fits the evidence better, or that the theory only works in certain circumstances. This is exactly what happened in physics to Newton's laws of motion. It turns out that Newton's laws describe the world pretty well in most cases, they can certainly tell you when your train is likely to arrive at its destination. For other circumstances, for example when you are travelling very fast, close to the speed of light, or for very small particles like quarks, other theories (like Einstein's theories or quantum mechanics) better fit the data we have gathered. Of course, much of this work in areas like physics is driven by what we can measure and observe. Better telescopes mean better theories of cosmology, and so on.

In computer science we also have hypotheses that we can test. For example "functional programming languages can run just as efficiently as imperative languages", "online learning increases student engagement", "objects and inheritance improve code reuse in software companies", and so on.

To be a true hypothesis, and not just the opinion of the author, a statement must be *refutable*, that is, it must be possible for experiments to determine that the hypothesis is incorrect. The opposite statement to a hypothesis is called an *alternate hypothesis*. Examples for the hypotheses listed above would be "functional languages are necessarily slower (or faster!) than imperative ones", "online learning has no effect on student engagement" and "objects and inheritance have no effect on code reuse in software companies".

So, to evaluate your own research questions, you need to do the following:

- Devise a hypothesis.
- Form your alternative hypothesis.
- Plan an experiment that tests whether the hypothesis or the alternative hypothesis is true.
- Conduct your experiment.
- Analyse the results of your experiments.
- If the results are conclusive, STOP. Else, re-run the experiments, or devise a better experiment and repeat.

In a student project, you may not have time to repeat your experiments, especially if they involve people, but you should design your evaluation in such a way that this would be possible, were you to continue the work.

**About experiments**

A good experiment should test one variable and one variable only. So, if your hypothesis is "neural network algorithms run faster in C than C++" then you will probably want to implement some neural network algorithms in both languages. You should make sure that the programs are as similar as possible, except for the language you are using. If you implement slightly different algorithms, it may be the algorithm and not the language which is causing any change in performance you observe. In this case, the programming language is called the *independent variable* and the algorithms are called the *controlled variable* and the speed is the *dependent variable* which is being measured.

**Interpreting your results: correlation does not imply causation**

Correlation by xkcd

When you perform an experiment, you are hoping that the outcome will lend some evidence to either your hypothesis or your alternate hypothesis. Going back to the example above, they hypothesis "neural network algorithms run faster in C than C++" has an alternate hypothesis "neural network algorithms run no faster in C than in C++". If we run an experiment to test this, and assume it's a fair experiment, and the results are that all our algorithms run faster in C, what has this told us? A naive answer would be that the experiments have confirmed the hypothesis that C is the faster language for this sort of algorithm. A more subtle answer would be that efficient neural networks are *correlated* with neural networks written in C. That means that when the algorithm is written in C it's likely to run quickly, which is what the experiment reported. This does not necessarily mean that the algorithms implemented in C ran quickly *because* they were written in C, it may be that there was some other factor involved that the experiment didn't effectively control.

In experimental work it is very important to understand this subtle distinction, otherwise you can easily fool yourself into believing that your experiments have discovered something far more conclusive than is actually possible.

To give you a better idea of how this distinction between correlation and causation works, below are some examples of incorrect conclusions drawn from perfectly reasonable correlations. See if you can work out why the conclusions are unreasonable:

- Children with bigger feet have higher reading ages. Therefore, people with bigger feet are more intelligent.
- Teenagers who text late at night have poor motivation in class (see news reports here). Therefore, using mobile phones leads to poor performance in class (see also a more skeptical analysis here).
- In the last 150 years there has been a dramatic increase in the number of people who report being abducted by aliens. There has also been a trend towards global warming. Therefore, alien abductions cause global warming.

In your own work, just be honest and straight forward about your results. If they aren't conclusive then say so and demonstrate your understanding by describing what future work could be done to gather more data.

**Some basic dos and don'ts**

This is some more specific advice, based on good and bad practice we have seen from students over the years:

- DO be clear and honest about what results your evaluation has obtained.
- DON'T claim to have "prooven" anything if you haven't written a formal, mathematical proof.
- DO use an appropriate experiment for your hypothesis. For example, if your work is about evaluating the performance or security of a technique, there is no need to involve real users in your evaluation. If your hypothesis is about usability you really must involve real users.
- DON'T use questionnaires unless you can guarentee to get a large sample size of answers (always well above thirty) and you understand the statistics needed to analyse the results. If you are in any doubt at all about this then seek the advice of a qualified statistician before you start your project. If you can't do that, think about using an alternative evaluation method such as semi-structured interviews.

**Appendix**

**Example proof: The square root of 2 cannot be written as a fraction of whole numbers**

**Theorem**

The square root of 2 cannot be written as a fraction of two whole numbers. (This is sometimes called the Theorem of Theaetetus)

**Proof (by contradiction)**

Imagine we could write the square root of 2 as a fraction of two whole numbers, say *x/y* where *x* and *y* are integers.

Let's say that *x* and *y* don't have any factors in common, so *x/y* is already written in its simplest form and no numbers can be "cancelled out" of the fraction.

So, we can also say that (*x/y)*(x/y)=2*

Therefore *(x*x)/(y*y)=2*

Therefore *(x*x)=2*(y*y)*

So we now know that *x*x* is even, since *x* is *2* times another number.

Since *x*x* is even, we also know that *x* is even (by the "Lemma" or little theorem that squares of odd numbers are never even).

Therefore, there must be a number, which we'll call *z* such that *x=2*z*

So, *(2*z)*(2*z)=2*(y*y)*

Or, more simply, *2*z*z=y*y*

*y* must also be even, by the same argument that we used to say that *x* is even.

If *y* is also even, there must be some number, which we'll call *w* such that *y=2*w*

But if *x/y=2z/2w* then the fraction *x/y* was not in its simplest form like we assumed above.

This contradicts our initial assumptions, which must have been wrong.

So, the square root of 2 cannot be written as a fraction of whole numbers.