Seb

note: This is work in progress, and I’d love some critical feedback.

After analysing my Google sources, I’ve noticed people asking questions to Google that I haven’t answered. I figured I’d give it a shot. This question is a good one, but I may need to go back and fill in some gaps. (It’s been a while since I posted here…).

Before I answer this question, there are a few basic guidelines that should be covered.

  • Watch the types! When an argument is a void pointer (void *), C will implicitly convert any pointer-to-object type you like (eg. char *, int *, struct fortytwo *), but this can bite you. Let us consider parity (checksum) bits when copying unsigned int foo = 0xBEEF; to an int bar;… Can anyone spell “segfault”?
  • Watch the length! I’m sure I don’t have to tell you, a scholar studying C, not to overflow buffers! For the same reason that you should ensure your code does not write out of bounds of an array or object, you should ensure your code does not read out of bounds of an array or object. Likewise, always ensure you give your objects values before you access them; Accessing an uninitialised object is also undefined behaviour.
  • Copy from one object into another. Don’t use the same array or object as both the source and the destination; memcpy isn’t good at that. If the source and destination overlaps, the behaviour is undefined. Use memmove for that, instead. The usage is exactly the same.

Now that we’ve covered the safety material, let us move on to creating objects. There are two ways to create objects in C. The first, you probably already know about, is the easiest…

Declare a variable. If you’re only copying one value (eg. not an array), then there is no need to use memcpy; Just use the assignment operator! memcpy is most useful for copying arrays. Below I have declared three arrays. The first two I have initialised to store strings.

char foo[] = "hello";
char bar[] = "world";

I’ll be demonstrating a concatenation (join operation) on these strings. There are just a few things to note, first.

  • foo is an array of 6 chars. The 6 chars in that string are ‘h’, ‘e’, ‘l’, ‘l’, ‘o’ and the string terminator, ‘\0′. bar is also an array of 6 chars. There is no surprise here.
  • Both foo and bar have well defined values. I can access the following six items safely: foo[0], foo[1], foo[2], foo[3], foo[4] and foo[5]. The same goes for bar.
  • Neither foo nor bar have enough space to store the resulting concatenation; strcat(foo, bar) is undefined behaviour because strcat will copy bar onto the end of foo and there isn’t enough space in foo… Out of the question (this is a jab at Mark Lassoff because he’s too self-righteous to fix his errors).

char foobar[sizeof foo + sizeof bar]; // foobar is an array large enough to hold the characters from “hello” and “world”, including the two lots of string terminators. foobar is required because neither foo nor bar are large enough to store the result of the concatenation; I needed to declare something large enough.

memcpy(foobar, foo, sizeof foo); // This should seem fairly obvious to the average student studying C: It copies the characters within foo (“hello”) into foobar. There are a few things happening behind the scenes here, however… The first two arguments are arrays, but they’re being converted to pointers before they’re passed to memcpy. There is an implicit conversion there between char[6] and char *, and another one between char * and the type that malloc expects for those arguments: void *.

char *temp = foobar + sizeof foo; // temp now points to the foobar[6]. To those who have studied arrays briefly, having this line probably seems silly or confusing. However, there is more to it than meets the eye. [] is not an array operator; it is actually a pointer operator. Within the above memcpy code, there are two implicit conversions going on behind the scenes. The same implicit conversion happens to foobar when you access foobar[0], foobar[1], etc. foobar[0] is the same as *(foobar). Thus, *temp is the same as temp[0]. Because temp == foobar + sizeof foo, the following is also true: temp[0] == foobar[sizeof foo].

temp[-1] = ‘ ‘; // Prior to this statement, temp[-1] was the string terminator for “hello”, right? I didn’t think it’d be suitable to paste “world” straight onto the end without a space in between ;)

memcpy(&temp[0], bar, sizeof bar); // This notation may seem more familiar than the pointer arithmetic I used above. It is the alternative to memcpy(temp + 0, bar, sizeof bar);. Pointer arithmetic is a very important aspect of C, and I may have to fill in some blanks later on.

This concludes this post. Hopefully I’ve managed to answer the question one unlucky Googler posed, while exploring some other interesting concepts within C. If you got to the pointer arithmetic section of this post before you were confused, read the entire blog post again and let me know precisely what it is that boggles your mind ;)

(… to be continued…)

After seeing lots of “simple calculator” post/help requests from those I like to refer to “incompetents”, I decided to waste a little time and write something slightly more advanced. It’s nothing special; It handles brackets but not operator precedence. All evaluation is performed left to right.

Before I continue, I’d like to explain why I refer to those as “incompetents”. I’d love to write a lengthy, patriotic speech about it, but the long story is short: They can obviously read as they are coming to forums to ask questions, yet they obviously haven’t bothered reading a book because all of the books deal with these problems. … That or they’re idiots.

I didn’t get where I am today by ignoring books, and I’m not trying to brag… I don’t consider myself advanced by any means. If you want to learn, take control of your own learning. Don’t let other idiots mislead or misinform you.

/* (c) Sebastian Ramadan 2011, all rights reserved */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct operation {
	int operat0r;
	int x;
	int y;
};

struct operation *resize(struct operation *alloc, unsigned int capacity);
int calc(char *str);

int main(void) {
	char *problems[] = {
		"3 * 2 + 8 / 4",
		"3 * 2 + (8 / 4)",
		"3 * (2 + 8 ) / 4",
		"3 * (2 + 8 / 4)",
		"(3 * 2 + 8 ) / 4",
		"(3 * 2) + 8 / 4",
		"(3 * 2) + (8 / 4)",
		"(3 * 2 + 8 / 4)"
	};

	for (int x = 0; x < sizeof (problems) / sizeof (*problems); x++) {
		printf("%s = %d\n", problems[x], calc(problems[x]));
	}

	return 0;
}

struct operation *resize(struct operation *alloc, unsigned int capacity) {
    struct operation *temp = (struct operation *) realloc(alloc, sizeof (struct operation) * capacity);
    if (temp == NULL) {
        fprintf(stderr, "Error in realloc.\n");
        exit(EXIT_FAILURE);
    }
    return temp;
}

int calc(char *str) {
    int operat0r = '\0', x = 0, y = 0, *operand;

	struct operation *stack = NULL;
	unsigned int level = 0, capacity = 0;

	operand = &x;
    stack = resize(stack, ++capacity);

    while (*str != '\0') {
        switch (*str) {
            case '(':
                stack[level  ].operat0r = operat0r;
				stack[level  ].x = x;
				stack[level++].y = y;

				operat0r = '\0', x = 0, y = 0;
				str++;
                break;
            case ')':
				if (level == 0) {
					fprintf(stderr, "Error: Too many closing brackets...\n");
					exit(EXIT_FAILURE);
				}
				operand = stack[--level].operat0r == '\0' ? &stack[level].x : &stack[level].y;
				*operand = operat0r == '+' ? x + y :
						   operat0r == '-' ? x - y :
						   operat0r == '*' ? x * y :
						   operat0r == '/' ? x / y :
						   x;

				operat0r = stack[level].operat0r;
				x = stack[level].x;
				y = stack[level].y;
				str++;
			case '\0':
				x = operat0r == '+' ? x + y :
					operat0r == '-' ? x - y :
					operat0r == '*' ? x * y :
					operat0r == '/' ? x / y :
					x;
				break;
			case '0':
			case '1':
			case '2':
			case '3':
			case '4':
			case '5':
			case '6':
			case '7':
			case '8':
			case '9':
				int temp;
				if (operat0r == '\0' && sscanf(str, "%d%n", &x, &temp) != 1 || sscanf(str, "%d%n", &y, &temp) != 1) {
					fprintf(stderr, "Error: Invalid operand.\n");
					exit(EXIT_FAILURE);
				}
				else if (operat0r != '\0') {
					x = operat0r == '+' ? x + y :
						operat0r == '-' ? x - y :
						operat0r == '*' ? x * y :
						operat0r == '/' ? x / y :
						x;
					y = 0;
					operat0r = '\0';
				}
				str += temp;
				break;
            case '+':
			case '-':
			case '*':
			case '/':
				operat0r = *str;
			case ' ':
				str++;
			default:
				break;
        };
    }

	if (level > 0) {
		fprintf(stderr, "Error: Too many opening brackets...\n");
		exit(EXIT_FAILURE);
	}

	return x;
}

Before I go on to discuss the problems that can be encountered when using getchar, I just want to announce that the solution to both of these problems is fairly simple. getchar returns an int, so it is a great idea to store the return value of getchar into an int! You could then check to ensure it’s not EOF before converting it to an unsigned char. For example:

int foo = getchar();
if (foo == EOF) { fputs("An error occured. getchar was unable to read from stdin.\n", stderr); }
unsigned char bar = foo;

If you’re happy to stick strictly to that, you may stop reading here. getchar() is equivelant to fgetc(stdin).1 At first glimpse it seems normal to store the result of getchar into a char. That is a bad idea! Let us take a little glimpse of the description of fgetc:

If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined). … If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the end-of-file indicator for the stream is set and the fgetc function returns EOF.

What can go wrong? When storing the return value into a char, there are two possible scenarios.

  • When “char” is an unsigned type, any negative value is out of range, and will be converted by adding UCHAR_MAX until the value can fit into its destination. This is perfectly well defined and fine, but how do you differentiate between the negative EOF value and an actual char value?
  • When “char” is a signed type, we have a bigger problem when getchar returns a value that is outside of the range (those values greater than CHAR_MAX, for example). Overflowing signed integers is undefined behaviour. That means your program could crash when you try to store the return value of getchar into a char.

Whether your implementation uses signed or unsigned chars is entirely “implementation defined” according to the C standard. That means your implementation (compiler or whatever) makes the decision and is required to explain this in documentation. If you wish to know which of these scenarios applies to you, I suggest consulting the documentation.

  1. The C standard actually states that getchar is equivalent to getc, and getc is equivalent to fgetc, except that getc may be implemented as a macro. Since there are no side-effects caused by evaluating stdin, getchar(), getc(stdin) and fgetc(stdin) are all equivalent.

Types, objects and values are all surprisingly easy topics to cover. I’ve already covered the dangers of ignoring the return value of getchar, scanf and realloc. I’ll also be covering the dangers of providing incorrect types to printf later. The common theme for all of these is that they have a return value, and that the type of the return value should be the same as the type of the object declared to store it.

If you don’t know about types, values and objects, you shouldn’t be studying any of the following:

  • Functions of any kind. These include, and are not limited to the following, which are commonly abused due to lack of type understanding: scanf, printf, getchar. Some very basic, but extremely important type/object/value-related concepts are also commonly misunderstood.
  • Operators of any kind. It would be silly to learn about the modulo (%) operator, without first learning what an int is, right? Equally important is the range of an int.

Let us cover the basics. These really are simple concepts, so they won’t take long.

  • Objects are storage locations. These are commonly confused with variables. Variables are objects, though they’re not the only objects in C. malloc, for example, returns the pointer to an object.
  • Values are stored in objects.
  • Types are identifiers that indicate the representation of a value. Values require representation in order to be interpreted correctly. Types provide those representations. Wouldn’t it be silly for an implementation to be operating on an int (object or value) without knowing how to interprete the value?

Why is it important to study types before functions?” Most functions return values and require values as arguments. Usually those values actually mean something.

  • Sometimes those values should be used in your logic for error checking. For example, the return value of scanf and malloc. If these values aren’t processed correctly, the result is undefined behaviour. Undefined behaviour is the difference between “I already know this will work on any system today and in the future” and “Oh no! It crashes on some other system!“.
  • Other times, those values may be ignored. For example, the return value of printf (though it can be a very useful return value because it indicates the number of bytes written).

Why is it important to study types before operators?” Most operators invoke undefined behaviour if they’re used incorrectly.

C guarantees the following:

  • A signed char object will be able to store at least any value between (and including) -127 and 127.
  • An unsigned char object will be able to store at least any value between (and including) 0 and 255.
  • char may be signed or unsigned, depending upon the choices made by those who developed your compiler. As a result, a char is guaranteed to be able to store at least any value between (and including) 0 and 127
  • An int (or signed int) object will be able to store at least any value between (and including) -32767 and 32767.
  • An unsigned int object will be able to store at least any value between (and including) 0 and 65535.
  • For the ranges of other types, check section 5.2.4.2.2 of n1256.pdf.

Notice how all of the unsigned types have a broader range than their corresponding signed types? I won’t go into details about signed representation, because there are such details easily available by searching, but this broader range is due to the possible presense of negative zero on some systems. Relying on, or ignoring, the lack of presense of negative zero is an incorrect way to develop portable code. Code such as int x = INT_MAX + 1; should be treated as undefined behaviour and avoided at all costs. If wrapping is desired, a better way to provide that wrapping is to work with an unsigned type, and reduce the value down to the range of the signed type.

Contrary to the belief of the freenode/##C populous, realloc is not that bad. It does require careful usage, though.

Let us first consider that realloc does have a return value. The set of return values falls into two categories: fail (NULL) or success (everything else). Realloc knows whether it’ll succeed or fail the moment it finishes allocating memory; It won’t even consider copying the bytes from the old to the new until there is a new. It seems logical to cover the fail set before we move onto the success set, so here goes:

Fail
Realloc indicates failure by returning NULL, just like malloc. The first thing to remember when using realloc is: Keep your original pointer in case realloc fails! Store the return value in a temp variable. If you don’t store the return value in a temp variable and realloc fails, you’ll have lost the old object to outer-space; It becomes a memory leak.

Here’s an analogy: If you’re going back to University to continue studying, the staff will ask you for your diploma as proof that you’ve completed the prerequisite courses. Are you going to give them the original, and assume they’ll give it back when you’re accepted? No, because if they don’t accept you as a student you’ll never get it back.

If you keep the return value and realloc fails, your application can recover, save it’s state to disk and try again later or something. Think of the users; They’re the ones who will be most annoyed by a failed realloc if you don’t handle it correctly.

Success
realloc returns any value except for NULL to indicate that a call is successful. Two things happen when a call to realloc is successful, in this order:

  1. The contents of the old object are roughly copied into the new object. If the old object can fit entirely into the new object, the old object is copied entirely and the remaining bytes in the new object are uninitialised. Otherwise, the old object won’t fit into the new object, and realloc will cram as many bytes as possible into the new object. realloc won’t even consider freeing the old object until those bytes have been copied over to the new object.
  2. Once those bytes are copied over, the old object is automatically freed by realloc. When the original pointer is automatically freed by realloc, it is no longer valid to use that pointer.

To extend the university analogy, once you’ve completed your study your diploma is superfluous. The university will issue you with a bachela of arts certificate. Your BA certificate has everything your diploma certificate had and more, so you may as well throw your diploma certificate away. realloc takes this step for you; it’ll automatically free the old object immediately before it returns the new object.

To recap:

  • Use a temp var to store the return value of realloc.
  • When you know realloc was successful, you know the old object was destroyed. Replace all references of the old object with references to the new object. If there are more than one references to the old object, I suggest finding their locality before you call realloc.

The main reason the freenode/##C crowd is against use of realloc is that it can lead to stale pointers. Stale pointers are pointers to the old object. As you’ll have gathered by reading this post, using them is a bad idea because the old objects have been destroyed. In most cases, I believe they’re right to discourage it’s use.

Most C students won’t bother reading the manual; They’ll just use the function even if they don’t know how to use it correctly. Using malloc, memcpy and free can’t solve the stale pointer problem alone, though suggesting their use will solve the problem of convincing students to read and understand the realloc manual. Linked lists don’t tend to represent strings and other contiguous data very elegantly, though they are preferable over realloc when the type of data being stored needn’t be contiguous.

Speaking of linked lists, I’ll be doing a series of blogs on abstract data structures later on. For today, however, this concludes my blogging output. Toodles.

Description
2 The realloc function deallocates the old object pointed to by ptr and returns a pointer to a new object that has the size specified by size. The contents of the new object shall be the same as that of the old object prior to deallocation, up to the lesser of the new and old sizes. Any bytes in the new object beyond the size of the old object have indeterminate values.
3 If ptr is a null pointer, the realloc function behaves like the malloc function for the specified size. Otherwise, if ptr does not match a pointer earlier returned by the calloc, malloc, or realloc function, or if the space has been deallocated by a call to the free or realloc function, the behavior is undefined. If memory for the new object cannot be allocated, the old object is not deallocated and its value is unchanged.
Returns
4 The realloc function returns a pointer to the new object (which may have the same value as a pointer to the old object), or a null pointer if the new object could not be allocated.

In a rather lengthy post I wrote on TNB some time ago, I detailed some of the possible behaviours of fflush(stdin). Just to ensure the message is conveyed, before I continue: Do not use fflush(stdin), EVER! I will be outlining an alternative later in this blog post. Here is an extract from the original, lengthy TNB post:

In the ideal console application, input should almost never be discarded. It’s only because very bad code is used too often that such code is available in my signature. fflush(stdin) is wrong. It is just as defined to cause problems as it is to work. Consider what it may do, depending upon the implementation:

  • not work at all, or even worse, do something unexpected like crash or launch missiles in error. This can be really annoying.
  • consume and discard the entire buffer, which may be one of: nothing, a line, part of a line or multiple lines, etc. This can be really annoying.
  • consume and discard a line (thus, appears to “work” just like the code in my sig would, after scanf(“%d”,…)). This can be annoying, when lengthy information contains just 1 invalid byte, and the program requires it to be re-entered.
  • consume and discard a single character (thus, appears to “work” just like getchar would, after scanf(“%d”,…)). This can be annoying, when a program that silently ignored invalid bytes does not inform the user.

It then goes on to express the message conveyed in my recent blog post, Console development; Poor design observed en masse. Note that the “code in my sig”, as referred to above, is the following:

int c; do { c = getchar(); } while (c >= 0 && c != ‘\n’); /* When is this code useful? */

Let us further consider what fflush is actually defined to do, and work from there. The C99 standard states that fflush will cause unwritten data to be written to the file. Can you imagine how slow writing to the hard disk would be if, for every byte written, the OS caused the drive to seek to the correct location, write a byte and then seek to some other location for some other application? When fprintf, fputs, fputc, fwrite or any other form of file modification function is called, an implementation may buffer the write to memory to optimise write access to the hard disk. That means the OS can wait until some point later in time to write an entire chunk to the drive in one seek.

That is what the C standard defines for output streams. (Well, actually, it doesn’t mention buffers or hard disks, so neither of those are required.) Now let us consider the behaviour as expected by someone using fflush on an input stream (eg. fflush(stdin)), which would be any of the following, from my previous quote:

  • consume and discard the entire buffer, which may be one of: nothing, a line, part of a line or multiple lines, etc. This can be really annoying.
  • consume and discard a line (thus, appears to “work” just like the code in my sig would, after scanf(“%d”,…)). This can be annoying, when lengthy information contains just 1 invalid byte, and the program requires it to be re-entered.
  • consume and discard a single character (thus, appears to “work” just like getchar would, after scanf(“%d”,…)). This can be annoying, when a program that silently ignored invalid bytes does not inform the user.

We now have a conundrum: Which of these behaviours would be suitable for a stream that permits both input and output access? The C99 standard defines fflush on an output stream to be inconsistent to any of the possible behaviours expected by someone using fflush(stdin). Do you really think the C99 standard would be so inconsistent with the semantics of a function? The answer lies in the following quote from the C99 standard draft:

If stream points to an output stream or an update stream in which the most recent operation was not input, the fflush function causes any unwritten data for that stream to be delivered to the host environment to be written to the file; otherwise, the behavior is undefined.

This quote makes it explicitly clear that fflush on an input stream is bad news. The well defined alternative is to use getchar or fgetc to read and discard bytes until some suitable condition (eg. a ‘\n’ has been read), and an example of this has been provided above. The elegant alternative is to not use the stdin, as outlined by my previous post, Console development; Poor design observed en masse.

In this, my first blog post, I’ll be referring to the code in my signature. For those unfamiliar with the C programming forum on TheNewBoston, the code in my signature is this:

int c; do { c = getchar(); } while (c >= 0 && c != ‘\n’); /* When is this code useful? */

That code discards input, which is really unfavourable from the users perspective. I’ll also be referring to input or conversion errors, which leave seemingly little choice but to discard the invalid input. Any input discarded can cause users to quickly become frustrated.

What’s the solution?” Take a look at your command line interface. If you type something like “dir foo” (or “ls foo” on most Unix-like systems) and foo doesn’t exist, you will be given an error. Most consoles permit you to press the “up” arrow, modify the command and run it again. Beautifully designed, isn’t it? The console lets you recover quickly from your errors! Most programs that use gets, fgets or scanf don’t have this elegance. Imagine how difficult that would be to write into one of your applications. What’s the point, when the console already supports it? Try to design applications that accept input only by main’s argv argument, and avoid functions like fgets, scanf and especially gets altogether, when they aren’t sensible.

From the C standard, “The scanf function returns the value of the macro EOF if an input failure occurs before any conversion. Otherwise, the scanf function returns the number of input items assigned, which can be fewer than provided for, or even zero, in the event of an early matching failure.”

When can an input failure occur?” One example is when the user presses ^D on Linux, or CTRL+Z on Windows. stdin will close, which will cause an input failure, and scanf will return EOF. Another example is when input is piped from the output of another application or a named pipe. When the other application exits, stdin will close, which will cause an input failure. There are various implementation defined methods of closing a named pipe. On FreeBSD, if a program attempts to read from a named pipe while another program is reading from it, the previous application will end up with a closed stdin.

So what?” The variable likely won’t be touched by scanf. That means you’ll be operating on old, unassigned values. Not only is this undefined behaviour, but it could be crafted by a malicious user to compromise certain systems.

What about matching failures? How can they occur?” When scanf expects numeric information to be entered, for example “12345″ and a user enters non-numeric information, for example “foobar”, scanf will indicate that as a matching failure via it’s return value. The corresponding argument won’t be modified by scanf. That means you’ll be operating on old, unassigned values. This is undefined behaviour, and can be harmful.

How can I avoid undefined behaviour when using scanf?

  • Make sure the type that you tell scanf to expect matches the type of the pointer that you give scanf.
  • Ensure the return value of scanf indicates the correct number of items were read. For a typical call to scanf that reads a single integer, the return value must be verified to be 1. For a call to scanf that reads 2 integers, the return value must be verified to be 2, and so on.
  • %s should be avoided, as scanf has no way of preventing buffer overflows. Use %{xx}s instead, where {xx} is the maximum length of your buffer. For example:

char foo[64];
if (scanf(“%64s”, foo) != 1) {
    fputs(“Error inputting into foo.”, stderr);
    exit(EXIT_FAILURE);
}

One more thing to be wary of: “Trailing white space (including new-line characters) is left unread unless matched by a directive.” For this reason it is particularly important to realise that %s reads words, not lines. It’ll stop at the first whitespace (space, tab or newline). That means scanf operates differently to fgets, and using them together can lead to unexpected, but perfectly defined behaviour. Consider what happens when a user enters a number, and presses enter. scanf reads and converts up until the first non-numeric character. The first non-numeric character will be the ‘\n’ left from pressing enter, and this will be left in stdin. What happens when fgets is called? It’ll read up to (and including) the first ‘\n’. See the code in my signature for a solution. If this paragraph didn’t make any sense to you, I’d suggest reading fflush vs gets on the c-faq.com website.

Don’t these issues seem to act as support for the divergence at the top of this message? Well, not really. You’ll still need to convert integer values using sscanf (note the double-s, of which the first stands for “string”), and you should also ensure that the return value is correct for your task. However, you won’t need to worry about buffer size for input strings because the OS will handle that. You also won’t need to worry about trailing whitespace such as ‘\n’ left by scanf, affecting future calls to fgets and friends.

There’s my rant for the day. Designing most basic applications with the console in mind will be slightly easier, and your end users will love you for it! I hope this was most beneficial :) Toodles.

© 2011 GeekyCode proudly hosted by GeekShells. Suffusion theme by Sayontan Sinha