In An Undetectable Computer Virus the authors David M. Chess an Steve R. White expand on Cohen’s result that there is no algorithm that can perfectly detect all possible viruses by showing that there exists viruses for which no algorithm can detect. After a brief introduction on the notation used in the article the authors proceed by summarizing Cohen’s result as well as providing an example of the result. The example is that for a virus detection algorithm A there exists a program p which reads:
“if A(p), then exit; else spread”
So that if A(p) returns true the program exits (does not spread) and therefore is not a virus (since it is only a virus if it spreads). However, if A(p) returns false then the program spreads thus A did not detect the virus p. This shows that there “is no algorithm which detects all viruses without error”. In fact, any program that tries to do this will either give a false positive or negative for some files.
The next section deals with the fact that there is a virus that no algorithm can detect. In fact Chess and White claim that “even when we have a sample of the virus in hand and have analyzed it completely, we cannot write a program that detects just that particular virus with no false positives”. The authors then note that a virus is polymorphic if it can mutate (that is it can change its components a little but do exactly the same thing it was programmed to do). The notation they use is that for a virus V if the size of the set V is greater than one the virus is polymorphic. That is for programs (or instances) p, q in V, p eventually produces (or mutates) in to q. Now the authors consider a virus set that is sufficiently polymorphic (V is big enough) that “for any implementable algorithm X and program p:
if X(p) then exit, else spread
is an instance of the virus.” This is that there does not exist any algorithm X that correctly detects this virus all of the time since that for any algorithm X there will be a virus p in V (since V is sufficiently polymorphic) that reads:
if X(p) then exit, else spread
so that X will return the wrong result and thus does not correctly detects the virus all of the time (essentially the same argument as the one above).
The authors now back up their claim that there exist viruses that no algorithm can exist by giving an example of one:
“Consider virus W one instance of which is r:
if subroutine_one(r) then exit, else{
replace the text of subroutine_one with a random program;
spread;
exit;
}
subroutine_one:
return false;
“
The authors then show that, by a similar argument as above, for any detection algorithm C there is a program s that has subroutine_one: return C(argument), where argument gets replaced by s when the program is run. The result follows by the same argument for X and p as above. Then the authors briefly explain that even under a looser notion of detection (that is even if detection algorithm X returns true if p is infected W and something else otherwise) their claim still holds (that is W is still undetectable).
Finally the authors explain that the only real practical application (so far) of their result and Cohen’s result is to dismiss the idea that there is an algorithm that can detect all viruses “known and unknown”. The authors then point out that the idea of detection used in their article varies from that of the real world seeing as the real world detection allows for some false positives (though very few) and leaves the door open for more work in the area to develop “a more formal characterization of this more realistic notion of detection”.
One of the main assumptions that the authors make in writing this article is that the reader would be familiar with the pseudocode (those piece quoted above that look like computer code) that is used to show the examples of viruses which they use to support their arguments. If the reader does not understand pseudocode they would not be able to see the proof of the claims the authors are making in the article (and thus would not be able to know whether or not it is really true). However, it is fair to make the assumption that the reader understands pseudocode since the target audience for this article is computer scientists (who should all know how to read pseudocode from their programming classes). Another (small) assumption that the author makes in the article is that the reader understands some basic mathematical notation such as the symbols for “there exists” and “for all” (they are the backwards E and upside down A respectively) as well as some basic set theory. Again, due to the audience this article is aimed for, this assumption is a fair one.
