/dev/kev

Swine - 20th IOCCC loser

I submitted an entry to the 20th International Obfuscated C Code Contest (IOCCC) in January 2012. It didn't win anything, and the way it abused the rules has been patched, which means I can't try to improve or resubmit it. So I'm publishing it here so I don't feel like all that effort was for nothing.

Swine is not your average quine - it is effectively a "source-level virus", propagating itself via stdio.h, and with a payload that causes if() statements to very occasionally be inverted.

I've had code that more-or-less does this since around mid-2000, and have always wanted to get it into shape for an IOCCC. Now the 21st IOCCC has been announced, and the preliminary rules have been updated (emphasis added):

|  21) Your program must not modify the content of the original
|      prog.c C source file.  If you need to modify the entry, copy
|      prog.c to another filename in the same directory and then
|      modify that file.  Your entry must not create or modify
|      files above the current directory with the exception of
|      of /tmp the /var/tmp directories.
So at least I can be reasonably sure that the judges did pay some attention to my entry. :)

You can get the as-submitted swine.c code, remarks/README and build instructions (which are just 'gcc -Wall -std=c99 -pedantic -o swine swine.c'). The markdown-formatted remarks are also included inline below. I might also put this onto github, including the scripts used to generate swine.c.

This code is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.


SWINE

Not only did I write this code, but I can read your mind. You are thinking, "Oh great, yet another boring self-replicating code. I see it's based on the venerable 1990/scjones, with a few extra tricks thrown in, like jumbling up the code string array. yawn"

WRONG

Although I started with 1990/scjones as a base and there is some resemblence, this entry does much more than just spew itself onto stdout.

But first, a very important warning:


DO NOT RUN THIS CODE AS A SUPER-USER


This program may look like a quine, but it's actually more like swine. Running it with root privileges will effectively "infect" the C compiler on your system.

On the surface

For each file given on the command line, the program will ensure that the file ends with a copy of the program's source code. If the file doesn't exist, then it will be created. The program's source will be appended to the file if necessary, ie. it checks if the end of the file is correct, and if not then appends to it (as opposed to blindly always appending, or blindly always rewriting/overwriting the final bytes of the file).



SPOILER WARNING

Stop reading now if you want to try to figure it out for yourself.



What it really does

In addition to the files listed on the command line, the program will similarly process /usr/include/stdio.h (or wherever your system's stdio.h is located). If it has write permission to this file, then it will ensure that an alternate set of code is always at the end of stdio.h. This code is itself self-replicating, and by way of payload includes a #define of the if keyword that occasionally causes its result to be inverted.

Thus, if you are stupid/crazy enough to run this program as root, or as any user that has write permission to the compiler's stdio.h, then your system will effectively be "infected" by this "source-level virus". From this point, any other program compiled on the system (by any user) will also be infected, and if those compiled programs are subsequently run on other similar systems, then they too will be infected.

Seeing it in action without nailing your system

The easiest way is to copy /usr/include/stdio.h (or wherever it is on your system) into the current directory, and then add -I. to the compiler flags when compiling swine.c. The resulting program will then mess with ./stdio.h, rather than /usr/include/stdio.h.

You can then compile some other test program which uses stdio.h, and give it -I. as well, so as to use the infected ./stdio.h. If ./stdio.h is "cleaned" by re-copying it from /usr/include, then running the other program will nicely reinfect it.

Abuse of the rules

By now it should come as no surprise that this entry is aiming for the "worst abuse of the rules" award, or maybe "most deceptive C code". Clearly, code that is designed to propagate itself in perpetity (ie. a virus by any other name) is outside the scope of the spirit of the rules. However, it is not outside the letter of the rules - #5 is very clear:

The build file, the source and the resulting executable should be treated as read-only files.

Since no other files are mentioned, these must be the only files that are considered read-only. This means that all system files, including those used by the compiler (such as stdio.h), are fair game for writing. Thus, this entry is not invalidated merely because it tries to adulterate your system files. There is also no rule which states that malicious code is not allowed.

Probably the best plug for this hole in future years would be to add "system files" to the list in rule #5.

Obfuscations and how it works

The top-level obfuscation is the usual mess of #define's, along with terse/misleading identifiers, hideous expressions, excessively long lines and no code indentation at all. (Actually I wish these last two didn't have to be the case, and the original version wasn't, but it was necessary for the size limit - the self-replicating nature of the code effectively halves the limit.)

A sloppy reader of the code will just assume that the string version of the program in the a[] array contains the same lines as the main program, only jumbled up somehow. Slightly more astute readers will see that there are more lines in a[] than in the rest of the program, but may just assume that they are red herrings thrown in as distractions. Or they may notice the strange string of apparent line noise, and wonder what that's all about.

There are three main, subtle obfuscations that the program hinges on. Two are simple "bugs" that are reasonably common, but difficult to catch visually (or to put it another way, will easily pass the "glance test").

Obfuscation 1: Finding stdio.h

The first is the contents of the prgnam string. This is set to the value of __FILE__ as evaluated inside stdio.h. This is how the program is able to write to /usr/include/stdio.h without ever having to mention /usr/include or anything else so blindingly suspicious. The only mention of stdio.h at all is in the exceedingly common and very innocent #include <stdio.h>. (Indeed, I haven't tested but I wouldn't be surprised if the code was portable to Windows, where prgnam may end up being something like "C:\Program Files\Some Stuff\Include\stdio.h".) The only very slightly conspicious aspect is that this #include appears a little lower than it ordinarily does (ie. the first line).

In order to do this, the code defines a macro for fflush() like so:

#define fflush(f) b; c prgnam[] = __FILE__; int fflush(f)

When the fflush() function is later defined inside stdio.h, it is changed from looking something like this:

extern int fflush(FILE *f);

to this (newlines added for readability):

extern int b;
c prgnam[] = __FILE__;
int fflush(FILE *f);

This is a great little trick for injecting code into places it's not supposed to be, and it's used again later.

(Technically speaking, the virus will be implanted into whichever system file defines fflush(), which is usually stdio.h, but could be any file #included therein. In any case, #include <stdio.h> will still pick up the implanted virus.)

Obfuscation 2: Using stdio.h

The second main obfuscation is being able to use the contents of prgnam. This is done by setting argv[0] (referred to as *v) to point instead to prgnam. This seems innocent enough - the program is just ensuring that argv[0] (which shows up in ps output, etc) is what it should be: prgnam, presumably the program's name. So this is a misleading identifier name.

This is followed by a subtle (intentional) bug in the subsequent loop over argv, so that it starts from argv[0] instead of argv[1]. The loop pointer p starts at v++, ie. v = argv, when it should actually start from ++v, ie. v = argv + 1. Hidden inside the mess of the for loop, it's easy to miss that:

for (q = (s = p = v++) + ac; p < q; s = p++)

should actually be:

for (q = (s = p = ++v) + ac; p < q; s = p++)

Obfuscation 3: Special treatment of stdio.h

The third obfuscation is the way that one set of lines from a[] is used for the first file (stdio.h), and another set for the rest (from the command line).

There are three functions that define mappings into the array a[]. l() is a linear mapping, g() uses a list encoded into one of the unused strings in a[], and x() uses a simple 5-bit Fibonnaci linear feedback shift register (LFSR) to get a pseudo-random cycle through the available 31 lines.

x() is used for the lines of swine.c itself, so that when they are output unquoted as the program code, they appear in the correct (unjumbled) order. l() is used to output the quoted string lines, thus preserving their (jumbled) order. The lines specified by the coded string used by g() are the lines that are output to stdio.h, and this is used for both the unquoted program code and the quoted strings, so that the lines used by the stdio.h quine are unjumbled and only require l() to self-output.

For easy access, the array r[] has pointers to these three functions in order: l(), g(), x(). In the main argv loop inside main(), p points to the current filename, and s points to the previous one, since the loop increment is s = p++. Thus, (p-s) is always 1. The macro da(), where the functions in r[] are referenced, uses array indicies 1+p-s (=2) to access x(), and 1-(p-s) (=0) to access l(). However, there is another subtle intentional bug: s is incorrectly initialised as s = p = v++. Thus, the first time through the loop (which is the stdio.h case), p == s, so p-s is 0, and so r[1&plusmn;(p-s)] == r[1] == g() is used in both cases - which is exactly what is required.

Doing things this way allows the string version of swine.c to be nicely jumbled inside a[], which helps to mask the fact that there are more strings than source lines. (Having a[] appear after the main source also helps with this.) In turn, this also helps to hide the somewhat-less-than-innocent stdio.h lines in amongst the swine.c lines. It also means that the g() string encoding can refer to any line, allowing the bulk of the self-replicating code from swine.c to be cheaply recycled into stdio.h.

More details

The LFSR used by x() uses 5 bits, with taps in bits 3 and 5 (so that it is a maximal LFSR). The initial value of 23 is carefully chosen so that the value of 31 (which indexes to the null string at the end of a[]) occurs immediately after all of the swine.c lines.

The encoding used by g() is very simple, and is shifted by 1 (modulus 32, the number of strings) so that the encoded string's trailing '\0' maps to 31 (again, so as to end the list of strings).

The da() macro is used for counting the length of the potential output (which is different for stdio.h vs the rest, so cannot be hard-coded or computed once) (so as to know how far back to seek from the end of the file), for comparing the contents of the file against the strings, and for actually doing the output. This is done by passing in different function names, which eventually work their way back to the pt(), ck() and ct() macros.

Because there are no trigraphs, backslash encoded non-printable characters, or other stupendities, the quoting macro e() can be quite simple.

stdio.h: hijacking main()

The stdio.h code features a macro to hijack any main() that is later defined by a user program which #includes <stdio.h>. It works in a similar way to the fflush() trick described earlier. The #define will turn this:

int main(int argc, char *argv[]) {
    ...
}

into this (newlines added for readability):

int main_(int argc, char *argv[]);
int main(int ac, char *v[]) {
    char *z = w;
    qu(p=s=&z);
    return main_(ac, v);
}
int main_(int argc, char *argv[]) {
    ...
}

(Where w has previously been set to __FILE__.)

Thus, the main() that the compiler sees is the one that contains our code (that outputs itself to stdio.h if possible), while the user-specified main() is now called main_(). Even in a debugger, the extra frame and introduction of an apparently munged "main_" function is unlikely to attract the attention of the programmer - after all, compilers munge and rearrange things all the time.

stdio.h: redefining if()

The stdio.h code also includes an "interesting" macro for the if() keyword. This purportedly appears to be for debugging purposes, passing a stringified version of the if condition to the d() ("debug"?) function. In fact, d() is a simple linear congruential generator (LCG) pseudo-random number generator (PRNG) which returns a pseudo-random unsigned long. If this random number is a multiple of the address of the d() function, then the effect of the entire expression is to invert (by way of an XOR) the result of the conditional expression that has been passed to the if() statement.

Thus, the overall effect is that approximately one if() conditional out of every 10 million (on x86_64 Linux, this will vary on other architectures) will inexplicably and randomly fail, in the sense that if the condition is true, the else block will be executed, and if it is false, then the if block will be executed.

One interesting side effect of this is that the if keyword cannot be trusted by the stdio.h code (or the swine.c code, because it shares lines with the stdio.h code and in case it is run on an already infected system). Instead, the y() macro implements an "else-less if" by instead doing while(condition) { ... break;}. Other code that attempts to test the "trustworthiness" of if() also has to rely on similar tricks, or the ?: ternary operator.

Bugbears

Needs an ASCII character set. But what doesn't these days?

There are more space-saving #define's (eg. "rt" for "return") than I'd like, but they were necessary to make the program fit within the limit.

There's no way to properly initialise the PRNG used by the if() macro. Time-based calls would be best, but would require sucking in other headers (like time.h) from inside stdio.h, which might raise suspicions. Similarly for getpid(). Hashing the contents of argv and environ is probably the best that could be done, and even that's not very good. Any/all of these could be easily done from inside the hijacked main(), though.

The stdio.h code leaves behind a fair amount of namespace pollution, both preprocessor macros and non-static variables. With more space, the macros could be #undef'd, m could be moved inside d() as a static variable, and the remaining global identifiers could have munged/innocent/misleading names.

Finally, the main() hijacking macro won't work for reentrant code, ie. code that calls main() again. This is because the call to main() will also be macro expanded by the preprocessor, and the expanded code won't make sense within a code block. Thankfully not much does this (not much real code, anyway).


Last updated: Monday, 23 July, 2012.
Copyright © 1994-2018, Kevin Pulo, kev at pulo dot com dot au
Public key fingerprint: 94A4 D2B6 85E6 A46A 5330 74F3 199C 4F85 563D C85F