Ulrich Drepper

Ulrich Drepper
Checked: 1 hour 41 min ago
Updated: 3 days 5 hours ago
Update every: 2 hours

Ulrich Drepper - LiveJournal.com
Syndicate content

Fedora 10 a little bit more secure

Ulrich Drepper - Sat, 03/01/2009 - 12:21am
Fedora 10 comes with filesystem capability support. Unfortunately it is not used by default in the packages which can take advantage of it. I think the excuse is that there people who build their own kernels and disable it. That's nonsense since there are many other options we rely on and which can be compiled out.

Anyway, you can do the following by hand. Unfortunately you have to do it every time the program is updated again.

sudo chmod u-s /bin/ping
sudo /usr/sbin/setcap cap_net_raw=ep /bin/ping
sudo chmod u-s /bin/ping6
sudo /usr/sbin/setcap cap_net_raw=ep /bin/ping6


Voilà, ping and ping6 are no SUID binaries anymore. Note that ls still signals (at least when you're using --color) that there is something special with the file, namely, there are filesystem attributes.

These are two easy cases. Other SUID programs need some research to see whether they can use filesystem capabilities as well and which capabilities they need.

Secure File Desciptor Handling

Ulrich Drepper - Fri, 01/08/2008 - 11:25pm

During the 2.6.27 merge window a number of my patches were merge and now we are at the point where we can securely create file descriptors without the danger of possibly leaking information. Before I go into the details let's get some background information.

A file descriptor in the Unix/POSIX world has lots of state associated with it. One bit of information determines whether the file descriptor is automatically closed when the process executes an exec call to start executing another program. This is useful, for instance, to establish pipelines. Traditionally, when a file descriptor is created (e.g., with the default open() mode) this close-on-exec flag is not set and a programmer has to explicitly set it using

   fcntl(fd, F_SETFD, FD_CLOEXEC);

Closing the descriptor is a good idea for two main reasons:

  • the new program's file descriptor table might fill up. For every open file descriptor resources are consumed.
  • more importantly, information might be leaked to the second program. That program might get access to information it normally wouldn't have access to.

It is easy to see why the latter point is such a problem. Assume this common scenario:

A web browser has two windows or tabs open, both loading a new page (maybe triggered through Javascript). One connection is to your bank, the other some random Internet site. The latter contains some random object which must be handled by a plug-in. The plug-in could be an external program processing some scripting language. The external program will be started through a fork() and exec sequence, inheriting all the file descriptors open and not marked with close-on-exec from the web browser process.

The result is that the plug-in can have access to the file descriptor used for the bank connection. This is especially bad if the plug-in is used for a scripting language such a Flash because this could make the descriptor easily available to the script. In case the author of the script has malicious intentions would might end up losing money.

Until not too long ago the best programs could to is to set the close-on-exec flag for file descriptors as quickly as possible after the file descriptor has been created. Programs would break if the default for new file descriptors would be changed to set the bit automatically.

This does not solve the problem, though. There is a (possibly brief) period of time between the return of the open() call or other function creating a file descriptor and the fcntl() call to set the flag. This is problematic because the fork() function is signal-safe (i.e., it can be called from a signal handler). In multi-threaded code a second thread might call fork() concurrently. It is theoretically possible to avoid these races by blocking all signals and by ensuring through locks that fork() cannot be called concurrently. This very quickly get far too complicated to even contemplate:

  • To block all signals, each thread in the process has to be interrupted (through another signal) and in the signal handler block all the other signals. This is complicated, slow, possibly unreliable, and might introduce deadlocks.
  • Using a lock also means there has to be a lock around fork() itself. But fork() is signal safe. This means this step also needs to block all signals. This by itself requires additional work since child processes inherit signal masks.
  • Making all this work in projects which come from different sources (and which non-trivial program doesn't use system or third-party libraries?) is virtually impossible.

It is therefore necessary to find a different solution. The first set of patches to achieve the goal went into the Linux kernel in 2.6.23, the last, as already mentioned, will be in the 2.6.27 release. The patches are all rather simple. They just extend the interface of various system calls so that already existing functionality can be taken advantage of.

The simplest case is the open() system call. To create a file descriptor with the close-on-exec flag atomically set all one has to do is to add the O_CLOEXEC flag to the call. There is already a parameter which takes such flags.

The next more complicated is the solution chosen to extend the socket() and socketcall() system calls. No flag parameter is available but the second parameter to these interfaces (the type) has a very limited range requirement. It was felt that overloading the parameter is an acceptable solution. It definitely makes using the new interfaces simpler.

The last group are interfaces where the original interface simply doesn't provide a way to pass additional parameters. In all these cases a generic flags parameter was added. This is preferable to using specialized new interfaces (like, for instance, dup2_cloexec) because we do and will need other flags. O_NONBLOCK is one case. Hopefully we'll have non-sequential file descriptors at some point and we can then request them using the flags, too.

The (hopefully complete) list of interface changes which were introduced is listed below. Note: these are the userlevel change. Inside the kernel things look different.

Userlevel InterfaceWhat changed? openO_CLOEXEC flag added fcntlF_DUPFD_CLOEXEC command added recvmsgMSG_CMSG_CLOEXEC flag for transmission of file descriptor over Unix domain socket dup3New interface taken an an addition flag parameter (O_CLOEXEC, O_NONBLOCK) pipe2New interfaces taking an addition flag parameter (O_CLOEXEC, O_NONBLOCK) socketSOCK_CLOEXEC and SOCK_NONBLOCK flag added to type parameter socketcallSOCK_CLOEXEC and SOCK_NONBLOCK flag added to type parameter pacceptNew interface taken an addition flag parameter (SOCK_CLOEXEC, SOCK_NONBLOCK) and a temporary signal mask fopenNew mode 'e' to open file with close-on-exec set eventfdTake new flags EFD_CLOEXEC and EFD_NONBLOC signalfdTake new flags SFD_CLOEXEC and SFD_NONBLOCK timerfdTake new flags TFD_CLOEXEC and TFD_NONBLOCK epoll_create1New interface which takes a flag value. Support EPOLL_CLOEXEC and EPOLL_NONBLOCK inotify_init1New interface taking a flag parameter (IN_CLOEXEC, IN_NONBLOCK)

When should these interfaces be used? The answer is simple: whenever the author is not such that no asynchronous fork()+exec can happen or a concurrently running threads executes fork()+exec (or posix_spawn(), BTW).

Application writers might have control over this. But I'd say that in all library code one has to play it safe. In glibc we do now in almost all interfaces open the file descriptor with the close-on-exec flag set. This means a lot of work but it has to be done. Applications also have to change (see this autofs bug, for instance).

dual head xrandr configuration

Ulrich Drepper - Fri, 23/05/2008 - 5:53pm

ajax told me that extra wide screens now work with the latest Fedora 9 binaries for X11. So I had to try it out and after some experimenting I got it to work. So save others the work here is what I did.

Hardware:

  • AIT FireGL V3600
  • 2x Dell 3007FPW

I use the free driver, of course. No need for 3D here.

The old way to get a spanning desktop was to use Xinerama. This has been replaced by xrandr nowadays. xrandr is not just for external screens of laptops and to change the resolution. One can assign the origin of various screens and therefore display different parts of a bigger virtual desktop. This is the whole trick here. The /etc/X11/xorg.conf file I use is this:

Section "ServerLayout"
	Identifier     "dual head configuration"
	Screen      0  "Screen0" 0 0
	InputDevice    "Keyboard0" "CoreKeyboard"
EndSection

Section "InputDevice"
	Identifier  "Keyboard0"
	Driver      "kbd"
	Option	    "XkbModel" "pc105"
	Option	    "XkbLayout" "us+inet"
EndSection

Section "Device"
	Identifier  "Videocard0"
	Driver      "radeon"
	Option	    "monitor-DVI-0" "dvi0"
	Option	    "monitor-DVI-1" "dvi1"
EndSection

Section "Monitor"
	Identifier "dvi0"
	Option "Position" "2560 0"
EndSection

Section "Monitor"
	Identifier "dvi1"
	Option "LeftOf" "dvi0"
EndSection

Section "Screen"
	Identifier "Screen0"
	Device     "Videocard0"
	DefaultDepth     16
	SubSection "Display"
		Viewport   0 0
		Depth     16
		Modes	"2560x1600"
		Virtual	5120 1600
	EndSubSection
EndSection

Fortunately X11 configuration got much easier since I had to edit the file by hand. I started from the most basic setup for a single screen which the installer or config-system-display will be happy to create for you. The important changes on top of this initial version are these:

	Option	    "monitor-DVI-0" "dvi0"
	Option	    "monitor-DVI-1" "dvi1"

These lines in the Device section announce the two screens. It is unfortunately not well (at all?) documented that the first parameter strings are magic. If you ran xrandr -q on your system with two screens attached you'll see the identifiers assigned to the screens by the system. In my case:

$ xrandr -q
Screen 0: minimum 320 x 200, current 5120 x 1600, maximum 5120 x 1600
DVI-1 connected 2560x1600+0+0 (normal left inverted right x axis y axis) 646mm x 406mm
...
DVI-0 connected 2560x1600+2560+0 (normal left inverted right x axis y axis) 646mm x 406mm
...

Add to the names DVI-0 and DVI-1 the magic prefix monitor- and add as the second parameter string an arbitrary identifier. Do not drop or change the monitor- prefix, that's the main magic which seems to make all this work. Then create two monitor sections in the xorg.conf file, one for each screen:

Section "Monitor"
	Identifier "dvi0"
	Option "Position" "2560 0"
EndSection

Section "Monitor"
	Identifier "dvi1"
	Option "LeftOf" "dvi0"
EndSection

The Identifier lines must of course match the identifiers used in the Device section. The rest are options which determine what the screens show. Since the LCDs have a resolution of 2560x1600 and since I want to have a spanning desktop and the DVI-0 connector is used for the display on the right side, I'm using an x-offset of 2560 and an y-offset of 0 for that screen. Then just tell the server to place the second screen at the left of it and the server will figure out the rest.

What remains to be done is to tell the server how large the screen in total is. That's done using

		Virtual	5120 1600

The numbers should explain themselves. Now the two screens show non-overlapping regions of the total desktop with no area not displayed, all due to the correct arithmetic in the calculation of the total screen size and the offset.

Note: there is only one Screen section. That's something which is IIRC different from the last Xinerama setup I did years ago.

Producing PDFs

Ulrich Drepper - Thu, 22/11/2007 - 2:35am
I don't want to throw this in with the announcement of the availability of the paper on memory and cache handling but I also don't want to forget it. So, here we go.

I write all the text I can using TeX (PDFLaTeX to be exact). This leads directly to a PDF document without intermediate steps. The graphics are done using Metapost because I'm better at programming than at drawing. Metapost produces Postscript-like files which some LaTeX macros then read and directly integrate into the PDF output.

The result in this case is a PDF with 114 pages which is only 934051 bytes in size. Just about 8kB for each page. Given that the text is multi-column and the numerous graphics in the text this is amazingly small.

I mentioned before how badly OO.org sucks at exporting graphics. I bad all the other word processor, spreadsheets, etc suck just as badly. Also generated PDFs for text is much, much bigger.

My guess is that if I'd written the document with OOO.org the size would be north of 4MB, probably significantly more. I cannot understand why people do this to themselves and, more importantly, to others.

Memory and Cache Paper

Ulrich Drepper - Thu, 22/11/2007 - 2:09am
Well, it's finally done. I've uploaded the PDF of the memory and cache paper to my home page. You can download it but do not re-publish it or make it available in any form to others. I do not want multiple copies flying around, at least not while I'm still intending to maintain the document.

With Jonathan Corbet's help the text should actually be readable. I had to change some of the text in the end to accommodate line breaks in the PDF. So I might have introduced problems, don't think bad about Jonathan's abilities. Aside, this is a large document. You simply go blind after a while, I know I do.

Which brings me to the next point. Even though I intend to maintain the document, don't expect me to do much in the near future. I've been working on it for far too long now and need a break. Integrating all the editing Jonathan produced plus today's line breaking have given me the rest. I haven't even integrated all the comments I've received. I know the structure of the document is in a few places a bit weak, esp section 5 which contains a lot of non-NUMA information. But it was simply too much work so far. Maybe some day.

The Evils of pkgconfig and libtool

Ulrich Drepper - Tue, 13/11/2007 - 2:05am

If you need more proof that this insane just look at some of the packages using it. I recently was looking at krb5-auth-dialog. The output of ldd -u -r on the original binary shows 26 unused DSOs.

This can be changed quite easily: add -Wl,--as-needed to link line. Do this in case of this package all but one of the unused dependencies is going away. This is several benefits:</tt>

The binary size is actually measurably reduced.

   text    data     bss     dec     hex filename
  35944    6512      64   42520    a618 src/krb5-auth-dialog-old
  35517    6112      64   41693    a2dd src/krb5-auth-dialog

That’s a 2% improvement. Note that all the saved dependencies are all recursive dependencies. The runtime is therefore not much effected (only a little). The saved data is pure overhead. Multiply the number by the thousands of binaries and DSOs which are shipped and the savings are significant.

The second problem to mention here is that not all unused dependencies are gone because somebody thought s/he is clever and uses -pthread in one of the pkgconfig files instead of linking with -lpthread. That’s just stupid when combined with the insanity called libtool. The result is that the -Wl,--as-needed is not applied to the thread library.

Just avoid libtool and pkgconfig. At the bery least fix up the pkgconfig files to use -Wl,--as-needed.

Energy saving is everybody&apos;s business

Ulrich Drepper - Fri, 09/11/2007 - 4:44am
With the wide acceptance of laptop and even smaller devices more and more people have been exposed to devices limited by energy consumption. Still, programmers don't pay much attention to this aspect. This statement is not entirely accurate: there has been a big push towards energy conservation in the kernel world (at least in the Linux kernel). With the tickless kernels we have the infrastructure to sleep for long times (long is a relative term here). Other internal changes avoid unnecessary wakeups. It is now realy up to the userlevel world to do its part. The situation is pretty dire here. There are some projects (e.g., PowerTOP) which highlight the problems. Still, not much happens. I've been somewhat guilty myself. nscd (part of glibc) was waking up every 5 seconds to clean up its cache, even if often was to be done. This program structure has several reasons. Good ones, but not ultimate reason. So I finally bit the bullet and changed the program structure significantly to better enable wakeup. The result is that now nscd at all times determines when the next cache cleanup is due and sleeps until then. Cache cleanups might be many hours out, so the code improved from one wakeups every 5 seconds to one wakeup every couple of hours. nscd is a very small drop in the bucket, though. Just look at your machine and examine the running processes and those which are regularly started. PowerTOP cannot realy help here (Arjan said something will be coming soon though(. There is a tool which can help, though: systemtap. Simply create a small script which traps syscalls the violators will use and disply process information. The syscalls to use include: open, stat, access, poll, select, nanosleep, futex. For the latter four it is a matter of small timeout values which is the problem. I'll post a script to do this soon (just not now). But the guilty parties probably already know who they are. Just don't do this quasi busy waiting!
  • If a program has to react to a file change or removal or creation, use inoity
  • for internal cleanups, choose reasonable values and then compute the timeout so that you don't wake up when nothing has to be done.
  • </lu> If you want to see how not to do it, look at something like the flash player (the proprietary one). If you inadvertendly have started it it'll remain active (even if no flash page is displayed) and it is basically busy waiting on something. Let's show the proprietary software world we can do better.

Part 2 released

Ulrich Drepper - Mon, 01/10/2007 - 3:09pm

Jonathan and crew published part 2 of the paper. If you have an LWN subscription you can read it here.

Directory Reading

Ulrich Drepper - Thu, 27/09/2007 - 5:45pm

In the last weeks I have seen far too much code which reads directory content in horribly inefficient ways to let this slide. Programmers really have to learn doing this efficiently. Some of the instances I've seen are in code which runs frequently. Frequently as in once per second. Doing it right can make a huge difference.

The following is an exemplary piece of code. Not taken from an actual project but it shows some of the problems quite well, all in one example. I drop the error handling to make the point clearer.

  DIR *dir = opendir(some_path);
  struct dirent *d;
  struct dirent d_mem;
  while (readdir_r(d, &d_mem, &d) == 0) {
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "%s/%s/somefile", some_path, d->d_name);
    int fd = open(path, O_RDONLY);
    if (fd != -1) {
      ... do something ...
      close (fd);
    }
  }
  closedir(dir);

How many things are inefficient at best and outright problematic in some cases?

Let's enumerate:

  1. Why use readdir_r?
  2. Even the use of readdir is dangerous.
  3. Creating a path string might exceed the PATH_MAX limit.
  4. Using a path like this is racy.
  5. What if the directory contain entries which are not directories?

readdir_t is only needed if multiple thread are using the same directory stream. I have yet to see a program where this really is the case. In this toy example the stream (variable dir) is definitely not shared between different threads. Therefore the use of readdir is just fine. Should this matter? Yes, it should, since readdir_r has to copy the data in into the buffer provided by the user while readdir has the possibility to avoid that.

Instead of readdir code should in fact use readdir64. The definition of the dirent structure comes from an innocent time when hard drive with a couple of dozen MB of capacity were huge. Things change and we need larger values for inode numbers etc. Modern (i.e., 64-bit) ABIs do this by default but if the code is supposed to be used on 32-bit machines as well the *64 variants should always be used.

Path length limits are becoming an ever-increasing problem. Linux, like most Unix implementations, imposes a length limit on each filename string which is passed to a system call. But this does not mean that in general path names have any length limit. It just means that longer names have to be implicitly constructed through the use of multiple relative path names. In the example above, what happens if some_path is already close to PATH_MAX bytes in size? It means the snprintf call will truncate the output. This can and should of course be caught but this doesn't help the program. It is crippled.

Any use of filenames with path components (i.e., with one or more slashes in the name) is racy and an attacker change any of the contained path components. This can lead to exploits. In the example, the some_path string itself might be long and traverse multiple directories. A change in any of these will lead to the open call not reaching the desired file or directory.

Finally, while the code above works (the open call will fail if d->d_name does not name a directory) it is anything but efficient. In fact, the open system calls are quite expensive. Before any work is done, the kernel has to reserve a file descriptor. Since file descriptors are a shared resource this requires coordination and synchronization which is expensive. Synchronization also reduces parallelism, which might be a big issue in some code. The open call then has to follow the path which also is not free.

To make a long story short, here is how the code should look like (again, sans error handling):

  DIR *dir = opendir(some_path);
  int dfd = dirfd(dir);
  struct dirent64 *d;
  while ((d = readdir64(d)) != NULL) {
    if (d->d_type != DT_DIR && d->d_type != DT_UNKNOWN)
      continue;
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "%s/somefile", d->d_name);
    int fd = openat(dfd, path, O_RDONLY);
    if (fd != -1) {
      ... do something ...
      close (fd);
    }
  }
  closedir(dir);

This rewrite addresses all the issues. It uses readdir64 which will do just fine in this case and it is safe when it comes to huge disk drives. It uses the d_type field of the dirent64 to check whether we already know the file is no directory. Most of Linux's directories today fill in the d_type field correctly (including all the pseudo filesystems like sysfs and proc). Those file systems which do not have the information handy fill in DT_UNKNOWN which is why the code above allows this case, too. In some program one also might want to allow DT_LNK since a symbolic link might point to a directory. But more often enough this is not the case and not following symlinks is a security measure.

Finally, the new code uses openat to open the file. This avoids the length path lookup and it closes most of the races of the original open call since the pathname lookup starts at the directory read by readdir64. Any change to the filesystem below this directory has no effect on the openat call. Also, since now the generated path is very short (just the maximum of 256 bytes for d_name plus 10 we know that the buffer path is sufficient.

It is easy enough to apply these changes to all the places which read directories. The result will be small, faster, and safer code.

The Series is Underway

Ulrich Drepper - Fri, 21/09/2007 - 8:41pm
Jon Corbet has edited the first two sections of the document I mentioned earlier here and here.

The document will be published in multiple installments, beginning with Sections 1 and 2 which are available now. Since LWN is a business the reasonable limitation is put in place that for the first week only subscribers have access to it.

So, get a subscription to LWN.

If you find mistakes in the text let me know directly, either as a comment here or as a personal mail. Don't bother J on with that.

SHA for crypt

Ulrich Drepper - Wed, 19/09/2007 - 9:55pm
Just a short note: I add SHA support to the Unix crypt implementation in glibc. The reason for all this (including replies to the extended "NIH" complaints) can be found here.

Publishing Update

Ulrich Drepper - Tue, 14/08/2007 - 3:43am

A few weeks back I asked how I should publish the document on memory and cache handling. I got quite some feedback.

  • There was the usual it doesn't matter but I want it for free crowd.
  • Then there was the even $8 for a book is too much for me. These are people from outside the US and $8 translated to local currency and income is certainly far too much for many people. I do not throw this group in with the first.
  • Several people (all or mostly US-based) thought the idea of printed paper to be nice. The price was no issue.
  • Most people said a freely PDF is more important than a printed copy. Some derogatory comments about lecturers who require books were heard. Others said editing isn't important.

Because of this first obnoxious group of people I would probably have gone with a print-only route. This attitude that just because somebody works on free software he always has to make everything available for free makes me sick. These are most probably the same people who never in their life produced anything that other found of value or they are the criminals working on (mostly embedded) project exploiting free software.

But since I really want the document to be widely distributed and available to places where $8 is too much money I will release the PDF for free. But this won't happen right away. Unlike some of the people making comments I do think that editing is important. Fortunately having professional editing and a free PDF don't exclude each other.

I'll not go with a publisher (esp not these $%# at O'Reilly, as several people suggested). This would in most cases have precluded retaining the copyright and making the text available for free.

Instead the nice people at LWN, Jonathan Corbet and crew, will edit the document. They will then serialize it, I guess, along with the weekly edition. It's up to Jon to make this decision. The document has 8 large section including introduction which means my guess is that after 7 installments the whole document is published. Once this has happened I'll then make the whole updated and edited PDF available.

This means if you think it's worth it, get a subscription to the LWN instead of waiting a week to read it for free.

So in summary, I get professional editing, keep the copyright, and might be able to help getting some more subscribers for the LWN. Win, win, win. If the L in LWN bothers you I've news for you: the document itself is very Linux-centric.

I haven't forgotten the printed version. I've read a bit more of the Lulu documentation. Apparently there is a model where I don't have to pay anything. People ordering the book pay a per-copy price and that's it (apparently with discounts for larger orders). If I submit it in letter/A4 format I don't have to do any reformatting and the price is less (for the color print) since there are fewer pages.

I'll probably try to do this after the PDF is freely available. People who like to have something in their hands will have their wishes. The only problem I see right now is that Lulu has a stupid requirement that the PDF documents must be generated with proprietary tools from Adobe. Of course I don't do this, I use pdfTeX. If this proves to be the case I guess I'll have to have a word with Bob Young...

Increasing Virtualization Insanity

Ulrich Drepper - Mon, 13/08/2007 - 11:52pm

People are starting to realize how broken the Xen model is with its privileged Dom0 domain. But the actions they want to take are simply ridiculous: they want to add the drivers back into the hypervisor. There are many technical reasons why this is a terrible idea. You'd have to add (back, mind you, Xen before version 2 did this) all the PCI handling and lots of other lowlevel code which is now maintained as part of the Linux kernel. This would of course play nicely into Xensource's (the company) pocket. Their technical people so far turn this down but I have no faith in this group: sooner or later they want to be independent of OS vendors and have their own mini-OS in the hypervisor. Adios remaining few advantages of the hypervisor model. But this is of course also the direction of VMWare who loudly proclaim that in the future we won't have OS as they exist today. Instead only domains with mini-OS which are ideally only hooks into the hypervisor OS where single applications run.

I hope everybody realizes the insanity of this:

  • If they really mean single application this must also mean single-process. If not, you'll have to implement an OS which can provide multi-process services. But this means that you either have no support to create processes or you rely on an mini-OS which is a front for the hypervisor. In VMWare's case this is some proprietary mini-OS and I imagine Xensource would like to do the very same.
  • Imagine that you have such application domains. All nicely separated because replicated. The result is a maintainance nightmare. What if a component which is needed in all application domains has to be updated? In a traditional system you update the one instance per machine/domain. With application domains you have to update every single one and not forget one.
And worst of all:
  • Don't people realize that this is the KVM model just implemented much poorer and more proprietary? If you invite drivers and all the infrastructure into the hypervisor it is not small enough anymore to have a complete code review. I.e., you end up with a full OS which is too large for that. Why not use one which already works: Linux.

I fear I have to repeat myself over and over again until the last person recognizes that the hypervisor model does not work for the type of virtualization for commodity hardware we try to achieve. Using a hypervisor was simply the first idea which popped into people's head since it was already done before in quite different environments. The change from Xen v1 to v2 should have shown how rotten the model is. Only when you take a step back you can see the whole picture and realize the KVM model is not only better, it's the only logical choice.

I know people have invested into Xen and that KVM is not yet there yet but a) there has been a lot of progress in KVM-land and b) the performance is constantly improving and especially with next year's processor updates hardware virtualization costs will go down even further.

For sysadmin types this means: do what you have to do with Xen for now. But keep the investments small. For developers this means: don't let yourself be tied to a platform. Use an abstraction layer such as libvirt to bridge over the differences. For architects this means: don't looking to Xen for answers, base your new designs on KVM.

How to publish?

Ulrich Drepper - Mon, 25/06/2007 - 5:08pm

That is meant as a question to the readers. The problem I have right now is that I have more or less finished the paper accompanying one of the talks I gave at the Red Hat Summit in Nashville last year. The slides for the talk about CPU Caches are available. But quite honestly, as most slide sets, they don't do the topic any justice. I had to compress things to < 45 mins which is of course not enough. The paper covers everything I can currently think of and which makes sense with relation to CPU caches and CPU memory, as far as programmers are concerned (nothing for hardware people). The title I currently use it

What Every Programmer Should Know About Memory

and I think this is adequate.

For this reason I usually write a paper on the important topics I talk about. And this topic qualifies. I consider the topic especially important since it's almost never treated in the software world at all. College grads today in most cases have not the slightest clue about this topic. Ideally I'd like the paper be picked up by some lecturers (like they do for many of my other publications) and use it in a course. Heck, I'm even willing to teach it myself if that is what it takes to get credibility.

The problem I'm facing is that the document is (using my usual paper style, two column etc) around 100 densely packed pages long. Some of the people I've shown it to suggested that it should rather be published as a book. I'm a bit unsure about this. I have a few publisher who for a long time keep pestering me about writing something for them (some even prematurely submitted titles to distributors!). One I talked to would be willing to print it even though it's thin for a book. But there are a lot of pluses and minuses all around:

My PDF only
Going this route means the document is easy to change and extend. The format is exactly as I want it. The visibility is restricted, not in the print market. No professional review. Due to the size (and use of color) it is hard to print.
Go with a publisher
Professional editing, maybe a college edition, visibility through listing in catalogs etc. Additionally available as e-book. But it likely means the color has to go (printing in color is expensive) and there will be no free-of-charge copy. Getting a revision out will be almost impossible.
Go with Lulu
The alternative publishing route: I could submit an appropriately formatted PDF to Lulu and have them publish it. Demand printing, ISBN available. B&W and color printing possible. Even e-books if anybody cares. No professional editing.

Going with Lulu has the advantages I want but it's quite an effort. And there are costs associated with it. I do not plan to make money out of all this but I'd have to recover the costs. Excess gains would probably go to charity (in my case this is the Monterey Bay Aquarium in case anybody is interested).

So, the questions I have and would like to get some feedback on are:

  • Are printed copies wanted at all? Especially for those teaching, is it a prerequisite?
  • If yes, do you prefer a professional, more expensive book?
  • Or perhaps an amateur-ish publication which is either B&W and cheap (I guess not much more than $10)...
  • ... or a colored print for around $30. The paper has currently around 60 diagrams and color helps.

If you have an opinion and a mail or add a comment to the blog (which won't be published). I know it is not easy to answer given that you haven't seen the material. But this is the same for most books, isn't it? Look at the slides and assume 100 times more details. I doubt I'll find many people who know all these details now (I had to do research myself).

grep and color

Ulrich Drepper - Fri, 01/06/2007 - 1:53pm

I cannot believe there are still people who are surprised they see me working with the command line on my machine or when I tell them otherwise the the output of grep can use highlighting. Just add --color to the command line (with the optional argument just like ls). I've implemented that more than six years ago. In my .bashrc I have the following:

alias egrep='egrep --color=tty -d skip'
alias egrpe='egrep --color=tty -d skip'
alias fgrep='fgrep --color=tty -d skip'
alias fgrpe='fgrep --color=tty -d skip'
alias grep='grep --color=tty -d skip'
alias grpe='grep --color=tty -d skip'

Yes, I mistype grep often enough to warrant the extra aliases. Using tty as the color mode mean that if I pipe the output into another program there won't be any color escape sequences added which could irritate those programs.

Just make your life easier and add such aliases, too.

pthread_t and similar types

Ulrich Drepper - Tue, 22/05/2007 - 6:46pm

Constantly people complain that the runtime does not catch their mistakes. They are hiding behind this requirement in the POSIX specification (for pthread_join in this case, also applies to pthread_kill and similar functions):

       The pthread_join() function shall fail if:
       [...]

       ESRCH  No thread could be found corresponding to that specified by the given thread ID.

The glibc implementation follows this requirement to the letter. *IFF* we can detect that the thread descriptor is invalid we do return ESRCH.

But: the above does not mean that all uses of invalid thread descriptors must result in ESRCH errors. The reason is simple: the standard does not restrict the implementation in any way in the definition of the type pthread_t. It does not even have to be an arithmetic type. This means it is valid to use a pointer type and this is just what NPTL does.

Nobody argues that functions like strcpy should not dump a core in case the buffer is invalid. The same for pthread_attr_t references passed to pthread_attr_init etc. The use of pthread_t when defined as a pointer is no different. The only complication is in the understanding that pthread_t can be a pointer type. This is obvious for void* etc.

In the POSIX committee we discussed several times changing the pthread_join and pthread_kill man pages. The ESRCH errors could be marked as may fail. But

  1. this really is not necessary, see above.
  2. it would mean we have to go through the entire specification and treat every other place where this is an issue the same way.

If somebody wants to do the work associated with the second step above and we have confidence in the results, we (= Austin Group) might make the change at some later date. But it is a rather high risk for no real gain. Programmers have to educate themselves anyway.

What remains is the question: how can programs avoid these mistakes? It is actually pretty simple: the program should make sure that no calls to pthread_kill, for instance, can happen when the thread is exiting. One way to solve this problem is:

  1. Associate a variable running of some sort and a mutex with each thread.
  2. In the function started by pthread_create (the thread function) set running to true.
  3. Before returning from the thread function or calling pthread_exit or in a cancellation handler acquire the mutex, set running to false, unlock the mutex, and proceed.
  4. Any thread trying to use pthread_kill etc first must get the mutex for the target thread, if running is true call pthread_kill, and finally unlock the mutex.

This ensures that no invalid descriptor is used. But I can already hear people complain:

This is too expensive!

That is ridiculous. The implementation would have to do something similar if it would try to catch bad thread descriptors. In fact, it would have to do more. What is important is to recognize that this price would have to be paid by every program, not just the buggy ones. This is wrong. Only those people who need this extra protection should pay the price.

But I don't have control over the code calling pthread_create!

Boo hoo, cry me a river. Don't expect sympathy for using proprietary software. I will never allow good free software to be shackled because of proprietary code. If you cannot get this changed in the code you pay good money for this just means it is time to find a new supplier or, even better, use free software.

In summary, this is entirely a problem of the programs which experience them. Existing Linux systems are proof that it is possible to write complex programs without requiring the implementation to help incompetent programmers. We will have a few more words in the next revision of the POSIX specification which talk about this issue. But I expect they will be ignored anyway and all focus remains on the shall fail errors of pthread_kill etc.

The Growing Importance of Parallel Programming

Ulrich Drepper - Sat, 12/05/2007 - 5:49pm
At the 2007 Red Hat Summit in San Diego which just which just wrapped up yesterday I gave a talk about parallel programming which the marketing folks retitled Programming for tomorrow's high speed processors, today.

The crux of the talk is that programmers in the future cannot always rely on improving hardware to make their programs run faster. This is summarized nicely in the following graph which I generated from performance data for x86 processors.



The crucial part is the divergence of the two lines going forward and the flattening of the blue line. This means programs which are not able to take advantage of ever increasing numbers of processing cores simply won't run (much) faster.

Parallel programming is hard. There are algorithms to change to allow more than one thread in parallel. Well, not necessarily thread, especially on Linux one should use processes if the sharing requirement between the processes makes this feasible.

There are data structures to lay out correctly to allow a) vectorization and b) data parallelization. Vectorization is important if one wants to come even close to the peak performance listed for the processor. But when you do this you also have to know a lot about CPU design (pipelines etc), caches, and memory.

And then there is something people might have heard about but didn't really register: co-processors are back. Intel's Geneseo and AMD's Torrenza are technologies to couple 3rd party processors tightly to the existing processor-memory mash.

In general I think the industry is entirely ill-prepared for these upcoming changes. Many/most programmers are not able to write code with these requirements. Companies and other organizations will have to invest into education. The system provides (like Red Hat) have to find ways to make parallel programming easier.

One big step in the right direction is OpenMP. Officially supported in gcc 4.2 Red Hat has backported the changes to our gcc 4.1 used in RHEL5 and Fedora Core 6 and later. Not only does OpenMP allow relatively easy conversion of existing code, it also frees the programmer from dealing with all the details of thread lifetime handling, thread stacks, etc. Even mutual exclusion happens at a higher level. All this is good, It will make programmers more productive if only it is used more often.

But there is one more thing: the OpenMP runtime is basically in complete control. It can decide on using just one thread or many threads. It can decide where to run threads and many more things. All these details are hidden from the programmer. This is a good thing since it allows the runtime to perform optimizations. I'll have more about this at a later date.

In summary, programmers have to learn, re-learn or for the first time, about parallelism. I think the topic of this talk is very important. If you are a Red Hat customer you could potentially ask for somebody from Red Hat to come in and talk about these issues. I'll give the slides and the details to our consulting organization and possibly also sales engineers. I cannot make any promises but I'll encourage those gals and guys to be willing to talk about this. If you're a big enough customer and you demand it, I might (have to) come out myself, if this is wanted. Or somebody can organize gatherings in places I have to go to anyway and have me speak there.

nscd and DNS TTL

Ulrich Drepper - Sat, 12/05/2007 - 5:04pm
Recently some people spread their non-existing knowledge about nscd (Name Service Cache Daemon) by claiming it ignores the TTL (time-to-live) value a DNS server returns. As far as I know this rampant ignorance is especially wide-spread in the ubuntu world. They claim that for this reason one has to run a local, caching DNS server. This is complete nonsense. nscd does handle TTL for a long time now (committed to the public CVS on 2004-9-15). All reasonable requests are handled, i.e., all getaddrinfo requests.

As I have pointed out many times before (here and here and in other places), it is completely unacceptable today to use gethostbyname etc. These functions simply don't work. Which is why I found it unnecessary to make the implementation of nscd more complicated and add more compatiblity and maintenance problems just to fix one of the many problems these interfaces have. Just don't use them and convert all your programs (e.g., I think we've done just that for all of RHEL and Fedora nowadays). Also don't use

  getent hosts some.host


You have to use

  getent ahosts some.host


For all getaddrinfo lookups the TTL value from DNS replies takes precedence over the TTL value from /etc/nscd.conf. The latter is used for services which do not provide a TTL themselves (today all other services).

getaddrinfo is not just for IPv6

Ulrich Drepper - Wed, 07/03/2007 - 9:11am

I've heard far too often that getaddrinfo is only interesting for IPv6 and therefore can be ignored since one does not have IPv6.

Aside from the fact that all programs should be protocol independent this statement is bogus. gethostbyname etc do not perform correctly in some situations where only ever IPv4 is involved.

Assume you have an internal IPv4 network with, say, 192.168.x.y addresses. In addition you have a server (web server, for instance) which is also visible on the Internet. This server has two addresses: one 192.168.x.y address and one global address. The client is a NATed machine on the intranet.

Now what happens if the nameserver returns both addresses to a query for the addresses of said server? With gethostbyname the addresses are returned to the caller in the order they are received from the DNS server. Maybe some randomization is applied. In short, it is possible that the internal machine gets sees the public IPv4 address and then connects to it. This is not only wasteful (the request has to be routed through a switch), it might even be dangerous (the traffic might actually have to go through the Internet).

With getaddrinfo this is not the case. The sorting according to RFC 3484 makes sure that the internal address of the server is returned first. The sorting function will notice that the source address used on the client is also an internal address and therefore the internal address of the server is a better match than the global address.

In summary, gethostbyaddr is not only about IPv6. The old interfaces were simply completely inadequate and should never be used. If you still haven't converted your programs to use getaddrinfo instead of gethostbyname and gethostbyname2 do it now. I have written some time ago a brief intro.

Xensource/VMWare start sandbagging

Ulrich Drepper - Tue, 27/02/2007 - 6:07am

With KVM proving more and more that it is viable Xensource and VMWare start sandbagging. They call KVM immature and the wrong approach (see their quotes in CNET article).

Calling KVM is immature is, well, premature and misleading. Xen has a headstart of several years. KVM is today not supposed to be in the state Xen is. Nevertheless, KVM already has hardware virt support, SMP support, support for 64-bit host and guests (despite what the article says), live migration, and more. Xen simply started from the other direction, namely para-virt, hardware virt took them a long time and a lot of help from the hardware vendors. I think para-virt will be done RSN.

But immature is not the worst complain. Claiming the hypervisor approach is the only viable option is what should get people worked up. Look at the arguments:

[...] but hypervisors offer better performance, have security advantages, and juggle the competing needs of multiple virtual machines better [...] In order to [deliver Virtual Infrastructure], you need the separate hypervisor layer.

These are bogus claims. And you have realize where they come from. VMWare’s ESX is a kernel on itself, one which only few people work on (compared to something like Linux). Device drivers will always be a nightmare unless/until devices get their own PCI devices (once DMA can be virtualized). Nevertheless, ESX is a full OS by itself. Plus, ESX has the service console a Linux OS. The service console of course has to have some control over the hypervisor.

For Xen the situation is similar. Here the hypervisor, after the mistakes of the 1.x series, don’t have device drivers included and use a privileged domain, a complete OS.

This means, both Xen and VMWare do not have less code. I’d say they even have more code that is part of the privileged code base. Certainly a Linux installation hosting KVM domains can be scaled down to only have the kernel, kqemu, and the service console.

As for specific security support, Xen has in theory shype or whatever will come out of it. Like SELinux, it’s based on Flask. But it still is a separate code base. And if shype is actually moved out of the hypervisor itself and into a separate domain you have to worry about even more interfaces to worry about. I haven’t seen any security features of this caliber even mentioned for VMWare. With KVM, the SELinux policy governing the kernel can also handle the KVM module. It’s after all part of the same kernel. One implementation, one policy.

As for performance, let’s wait until KVM actually has been optimized. Ingo did some work on a para-virt network driver and the results are simply great. It’s just that performance tuning hasn’t been a focus. In theory there is absolutely no reason why the KVM approach should be any slower.

As for better scheduling with a hypervisor: that can only be a joke. Especially for Xen, the privileged domain (Dom0) has to be scheduled without the hypervisor having any insight into the Dom0 kernel. How can this be better? For VMWare we have a simple-minded OS serving as the hypervisor. The Linux kernel has support for all kinds of situations, including NUMA machines, many processor machines, HT and multi-core processors etc. And it’s an O(1) scheduler which sooner or later will make a difference even for hypervisors.

And then there are the advantages the KVM solution has. For instance, ever tried to run Xen on a laptop while on battery? It’s almost not worth it since power management does not exist. The machine will always run at full power. VMWare has the same problem. This is not only an issue for laptops. Cooling is a major issue in data centers. Maybe even a bigger issue with increasing density.

NUMA has already been mentioned, but there is also the memory allocation issue as part of the problem. Xen has nothing of it, I bet VMWare neither or something simple. The Linux kernel can provide KVM with all kinds of support, as the performance on big NUMA machines like SGI’s Altix shows.

In short: neither Xen nor VMWare have any real advantages which cannot be surmounted by giving KVM more time to catch up, i.e., grant it the same time to develop the features. On the other hand there are device driver issues which VMWare will never be able to muster. Xen is not included into the mainstream kernel and even with paravitr_ops interface will be lagging because it needs synchronization.

So why do these companies (and Xensource makes this statement as a company) make such statements? The answer should be not surprising: they have a lot or all to lose. KVM can be the one-in-all solution, unlike any of the others. Xensource and VMWare want to get on your system by providing a hypervisor which then can be used with all kinds of OSes. But: they are in ultimate control. The idea that there suddenly is a virtualization solution which does not need any hypervisor must be absolutely frightening to them. So, they try to suppress this no technology from the start.

Don’t believe the propaganda. Try KVM once it Fedora 7 is out. I expect it to be updated over the lifetime of Fedora 7. Or for the more adventurous people, start using rawhide now and keep using it.

DST Panic

Ulrich Drepper - Thu, 22/02/2007 - 9:27pm

With the DST rule changes for the US going into effect real soon (2007-3-11) people are panicking.

  1. there are those who run completely obsolete OSes. People contact us for support of RHL9 or even RHL7.2 (that's the predecessor of Fedora, for those who don't know. Guess what, even FC4 is not supported anymore, leave alone anything earlier. The DST change is just one of many good reasons to update. Security is the other big one.
  2. many applications are broken and they use private timezone data. In their quest to achieve perfect portability the people writing the Java runtime added the data into their sources. And unfortunately the same has been done for libgcj. Only somebody without the slightest clue about the nature of DST rules would do something like this. Would Sun/BEA/IBM/... be willing to update their JVMs 20 times a year for all DST changes (if they would have that broad support in the first place)? Of course not. Only people in countries with stable rules would not think about it. There probably hasn't been a day in the life of the JVMs when the data was really accurate. and complete.

Time zone changes are nothing special. I guess on average we see about 20 each year, maybe more. Do you see people packing 20 times a year? No, only if the US is involved. Yes, you can argue that more computers are affected but aren't the computers which are in countries affected by those changes as important to the people living there? There are also banks, utilities, etc which need to keep the correct time.

Having lived here in the US now for quite a few years I think the root of the problem is the same that keeps the US from making progress on other fronts: fear of change and trying to prevent change through denial. Another example? Take the measurement system. When knowing the metric and the imperial system equally well, who would argue the latter is better? And it's not that people don't know the metric system at all. There are large numbers of people who serve/d in the military and all these people had to use it in their job. Every food container also shows grams. But I'm getting off-topic.

Fact is, people delay things. Delaying to update their OSes and even delaying to think about the problem. It might go away on its own and then no time has been wasted. But guess what: the DST change is coming.

So, people, get your act together. Update your OSes. If you really for some obscure reason cannot do this, update the timezone data (in /usr /share/zoneinfo). The data we have today is fully compatible, even the extended file format. Since there is no glibc update coming you could just overwrite the files without fear of reverting the changes inadvertently later. Update applications with their own timezone data. There are lots of (especially big) programs which come with their own data. The timezone data is free to copy by everybody so companies take advantage of this. Now you know why I always advised against this. More likely than not, updates for old versions of these programs are not available anymore. Make sure you let the companies who produce this kind of shitty products know what you think of them, duplicating timezone data is always bad. As for JVM: have fun! Old versions will get no updates and new version often don't run on old OS versions. At least libgcj should now be fixed for good due to the work of one of my collegues.

Once the timezone data is updated some more steps might be needed. Many programs will just work. glibc detects updates to the /etc/localtime file and reloads the data. Lots of people complained about this in the past and present since it means time operations cause my filesystem operations, but it is critical in some situations. If a process only uses the time functions which do not implicitly call tzset() they must be restarted. The same is true for processes which have the TZ environment variable set. In general you cannot know whether a process falls into any of the later categories. The safest thing to do is to reboot the machine.

More array fun

Ulrich Drepper - Tue, 20/02/2007 - 11:27pm

As a continuation of a previous post, here's another thing I frequently stumble across:

#include <stdio.h>
#include <string.h>
int main(void)
{
  const char s[] = "hello";
  strcpy (s, "bye");
  puts (s);
  return 0;
}

Yes, this code will produce a warning. But it will run. Slowly, since it does not do what the programmer actually meant. s is a dynamic variable. The compiler has to allocate space on the stack (or in TLS) and then copy the string from some static, read-only area into it. This of course is not only slow, uses memory, it also means that the newly created string is NOT read-only. The compiler-generate function prologue has to write to the memory.

Whenever you write code where you define an array in the scope of a function, always stop and think what the semantics should be. For all constant arrays it is almost always correct to have exactly one copy (it cannot be changed). If one copy is needed or OK then don't forget the static:

  static const char s[] = "hello";

If you do this all of a suddenly the code will not only produce a warning, it will also crash at runtime since the string is stored in read-only memory and s is not now really a variable anymore, it's a label for the region in read-only memory.

But I Have Nothing Of Interest On My Machine

Ulrich Drepper - Tue, 13/02/2007 - 5:09am

I'm sick and tired of hearing people saying

I don't have to secure my machine since I have nothing of interest on it. Nobody would want to steal anything I have.

That's absolutely not the point. Yes, some attackers are after personal data like account numbers. But this is not all:

  • passwords are high on the list since people use the same password for all their accounts, be it banks, Amazon, eBay, whatever. Do you still agree you don't have anything interesting protected by those passwords?
  • if a machine can be taken over it can be used to a) sniff the local network, b) attack other machines, c) send spam. Some ISPs already stopped being lenient towards idiots who allow this to happen unchecked and they simply suspend the accounts. Do you care about having an Internet connection?

Security always matters even if the data stored on the machine is benign. Nobody should be allowed to even run machines which have no distinction between user and administrator. This includes more and more Linux people because new idiot distributions like Linspire, NimbleX, etc pop up. No machine should be without firewalls, in both directions. For RHEL/Fedora users it of course doesn't stop there, we have many more security features and if it would be up to me I would take out the switch to disable them.

Next time when you see somebody writing nonsense like the above (or hear them talking like this) do me a favor: smack them a bit so that they come to their senses. These are the people who create the opportunity for spam, phishing, and other illicit activities. Heck, they deserve more then a bit of smacking...

RSA conference, Day 1 (for me)

Ulrich Drepper - Wed, 07/02/2007 - 8:50am
I had the podium discussion today (nothing special to report) and so I stayed a bit longer until my ride arrived. What to do? The show floor is boring for me nobody really targets developers. So join a few sessions.

The first by Eugene Kaspersky. Well known name, quite interesting title: The Dark Side of Cybercrime: Details on the Latest Hacker Tactics from Around the World. What would you expect when reading this? I myself expected to actually learn about attack vectors etc since this guy must be exposed to them on a daily basis.

Well, Mr. Kaspersky didn't think so. He spent the first 40-45 minutes on recounting the history of attacks, viruses, worms, trojans, etc. Some statistics thrown in, some pictures of authors. Then in the last 5-10 minutes he talks about attacks going on today but still only at the level of there will be phishing attacks, and data theft, and .... And suddenly it was all over?

If the title promises the latest tactics, why waste time on ancient history? When promising details, why only scratch the surface and throw out a few buzzwords? This was probably one of the most wasteful hour I've spent in a long time. Heck, I might have enjoyed an HR seminar more than this baloney.

Still not time to leave, so I go into the podium discussion about Virtualization and Security. I was skeptical from the get go. A panel without anyone who actually works on virtualization technology. Only security professionals, i.e., the people who benefit from security problems. Turns out this discussion is really meant as a big fright fest. It was an enumeration of additional problems in security, monitoring, auditing when you deploy virtualization. Close to the end one of the panelists actually asked (I paraphrase) And who in the audience still considers deploying virtualization after what you heard here today?

I'm always willing to accept that there are some new problems. They are mostly concerning the introduction of a new code base (hypervisor or the hardware emulation like KQEMU) and the interfaces between it an the VMs. But many (most?) of the problems they mentioned are home made or are simply problems which exist without virtualization. For instance, they were complaining about VLANs which are created between the domains so that a single NIC is sufficient for all domains. Dah! If this is a problem for you, don't do it. Use separate network cards for each domain. PCI forwarding is there and by the time people actually start deploying Intel will have VT-d in their chips (and AMD whatever they need). We'll soon enough have NICs with virtualization supoprt built in (Infiniband already can do this today). Once this is true I hear them shout but who audits the firmware which implements this (it'll indeed something mostly implemented in firmware). The answer here is again: do you audit the firmware of the NIC today? I don't think so and still it can very well be a security risk.

I took away from this that the security industry sees virtualization as yet another source of money and full employment. Yes, you'll have problems if you do stupid things when deploying virtualization. But the same is true without virtualization. I fail to see the difference. And the panel constantly reminded everybody that no company out there has a person who understands all the problems, front to back, from technical details about virtualization to specific problems of SOA deployments in virtualized environments. That's most probably true. But how is this difference from non-virtual deployments. I dare a security professional to step forward and prove s/he knows all this. Heck, I can think of a gazillion security-relevant details at low levels which are not known except to people who actually work on that code.

The organizers claim that they try to keep the sessions clear from being marketing sessions. Mr Kaspersky certainly didn't manage to do this, my podium discussion obviously couldn't (it was after all about three specific implementations), and this virtualization session was a big see, we are more than ever relevant session byt the security professions (with special plugs of the Center for Internet Security).

What was is there are sessions which actual practical advice for programmers, i.e., to cure the root of all the evil. My Thursday session is probably one of the very few exceptions. And the funny thing is: during my podium session people actually made it known that one of the things they like to hear about at conference is specifically this.

My opinion thus far: if you are a security professional, CSO, etc, run to San Francisco, don't walk. You'll get plenty of stories you can tell your boss to frighten her/him and give you a large budget and many underlings to have fun with. You'll also find people who want to sell you piece of mind and that should be well worth it to you. After all, you somehow have to spend the money your scared boss throws at you.

If you actually are interested in fixing the problem, don't bother. The organizers don't either.

Security Now! podcast

Ulrich Drepper - Sun, 04/02/2007 - 10:19am
I happened to listen to a few episodes of the Security Now podcast, by Leo Laporte and Steve Gibson. It's mostly Windows stuff, hence uninteresting technically, but it's an eye opener nevertheless. And not in the positive sense. They, well Steve, often makes clueless comments about non-Windows OSes in hos attempt to give every OS its fair share. But this of course backfires when the comments are wrong or misleading.

But the worst thing I came across so far is in episode 71, called "Securable". That's a program of Steve's and of no relevance. But he tried to explain the NX feature of modern x86/x86-64 processors and this is what he said (see the transcript):


[...] what this does is essentially it allows the system to stop virtually all buffer overruns. And that’s big. I mean, all the security problems that we encounter with incredibly small exception are buffer overrun attacks. [...]


This is really what he thinks, he repeats it with different words later on in the show.

It seems for him buffer overflow is synonymous with inject code through a writing over buffer boundaries and then execute that code in place. Everybody who deals with security will laugh about such a definition. These are the first generation buffer flows which were exploited. At least on platforms which are secure. In Linux we have for the longest time means to protect against these kinds of attacks, starting from address space randomization to NX emulation. This does not in any way stop buffer overflows from being a problem.

Buffer overflows still can be used to redirect program execution. Overwriting return addresses for return-to-libc exploits (so other libraries), overwriting function pointers elsewhere, overwriting local variables and changing the direction of execution at branch points. The list goes on. These kinds of effects of buffer overruns are not detected by NX.

There are two ways I can interpret Steve's comments:


  1. On Windows, because it is such a soft target, attackers didn't have to bother with more sophisticated attacks and they really didn't happen. In this situation the attackers will simply adapt and use the attack vectors I described above.


  2. Steve doesn't know what he's talking about and he's doing his listeners a disservice by suggesting they are almost completely safe just because they enable NX.



For Steve's sake, and Leo's since he would be guilt by association, let's go with the first possibility. But all this means that Windows is years, many years, behind the Unix world when it comes to security. It might be a rude awakening for some people to find that the new features do not cure all problems.

Yes, MSFT has copied us on many levels and also implement things like address space randomization and stack canaries. This will help but only if the features are enabled. And this is the second eye opener from the show. Windows has apparently no fine grained control. This means at the slightest sign of problems the features will be turned off completely. They mentioned the BIOS and OS control of the NX bit. Since drivers and many applications are badly written the machines run with the feature turned off. One point for an easy sysadmin interface, but -100 points for security.

I think everybody who hopes that with the (slow) proliferation of MSFT's new OS release the Internet will be more secure is gravely mistaken. There are still not enough security features in place and those which are in place will be turned off. Heck, or they are not even implemented. Apparently several of the security features are not implemented in the 32-bit version to maintain compatibility.

This is very, very wrong. But it's been MSFT's goal, don't piss of the customer even if it's technically wrong and it causes huge problems for everybody. I'm a strong advocate of security over backward compatibility if there is a good reason. But usually it does not come to this because you can strengthen security without compromising backward compatibility. Case in point: see how we implemented non-executable stacks. Old programs continue to run while almost all new code automatically gets protected. And the case with, automatically again, get flagged as requiring an executable stack got fixed. It is one of Red Hat's release requirements that no binaries needs stack execution permission.

One last thing: it's really amusing to see that x86-64 pick-up (I mean real 64-bit code) is so slow on Windows. For the last 3 years I haven't been using any 32-bit machine except my laptop. This is no isolated case in the Linux world, we are well on the way to make 32-bit obsolete.

Lock-Free Datastructures

Ulrich Drepper - Sat, 03/02/2007 - 8:41am
I looked at an unfortunately quite widely cited paper about lock-free operations:

Lock-Free Techniques for Concurrent Access to Shared Objects

Shared object here refers to a data structure. The paper describes how to use compare-and-exchange to avoid mutual exclusion. This is the second example on the way to derive a working LIFO implementation:

lifo-pop (lf: pointer to lifo): pointer to cell
       C1:     loop
       C2:          head = lf->top                # get the top cell of the lifo
       C3:          if head == NULL
       C4:               return NULL              # LIFO is empty
       C5:          endif
       C6:          next = head->next             # get the next cell of cell
       C7:          if CAS (&lf->top, head, next) # try to set the top of the lifo to the next cell
       C8:               break
       C9:          endif
       C10: endloop
       C11: return head


The authors correctly point out that this code has a so-called ABA problem on preemption after C6. Assume another thread is scheduled and it pops the top element, pushes one or more other elements, and then pushes the initially popped element again (it's here only about the memory address, the use can differ, so this makes actually sense). If then the first thread continues with C7 it will see the CAS operation succeed and it'll screw up the list. The solution is a double-word CAS operation as provided by x86 and x86-64. Hurray! I guess this is why Intel cites the paper in a few documents, it makes them look good.

But wait, if they care about preemption as they must, what about preemption between C2 and C6? In this case the top element (pointed to by head) can be popped and suddenly the pointer reference in C6 can go bad. This problem is never pointed out and people cite this paper to point out how double-word CAS saves the day.

Fact is, you have to write a very special, expensive, and limited data structure for the code to work. The pointer dereference must never fail. This mean the memory used for the LIFO elements must never be freed. This means equipping the LIFO data structure with its own memory allocator. A small, specialized one, but an allocator nonetheless. This is a problem in case the number of elements can grow large since after that none of the elements can be freed again. At least in general. An implementations could try to determine that no thread holds a reference to an element and in this case proceed with freeing it. But the whole point behind lock-free data structures is that they are supposed to be fast. Adding what amounts to a simple RCU (Read Copy Update) implementation plus the memory allocator is contra-productive.

I guess what I mean to say here is:



  1. lock-free data structures with today's processor technology is limited to very few limited uses


  2. take the paper above with a huge grain of salt, it's typical academia work, without relevance to practice




For now mutual exclusion is the best solution for most data structures, at least those which are general enough to have to deal with dynamically allocated memory. This will change in future, maybe in the not so distant future in fact.

So close, but no cigar

Ulrich Drepper - Tue, 30/01/2007 - 4:43pm
It's nice to see some people actually look at their DSO's and rewrite them to not be resource hogs. One late example is this PCRE code and the optimization done by one Marco Barisione who should be applauded for starting the work. But then this:

const char *_pcre_ucp_names =
  "Any\0"
  "Arabic\0"
  "Armenian\0"
  ...
  "Zs";


This is a global variable. Anybody seeing what is wrong?

What this does is define a variable in .data (it's modifiable) which points to a constant string. This means


  1. An additional variable

  2. More attack points, the variable is writable

  3. An additional relocation

  4. Getting the string address requires a memory load and accessing the string itself requires two memory loads



People, think before writing code! All that is needed here is name for the memory area containing the constant string. I.e.:

const char _pcre_ucp_names[] =
  "Any\0"
  "Arabic\0"
  "Armenian\0"
  ...
  "Zs";


See the difference? This one character removed and two added make all the difference in the world. The binary is smaller (at least 32 bytes on x86-64, more counting the simpler memory access in the actual code), one less relative relocation, faster code at runtime since the code to compute the string address needs no memory access.

RSA conference

Ulrich Drepper - Mon, 29/01/2007 - 8:49am
I perhaps should mention that I'll be talking at the RSA conference in San Francisco on February 6th and 8th. I don't know yet whether I'll be around outside of these two times. There are not too many other talks which I am interested in. Two I found conflict with my own appearances. I have a few others but hardly anything which deals with secure development and system software design. Maybe somebody has some proposals.

Pointer Encryption

Ulrich Drepper - Wed, 24/01/2007 - 10:00pm
Mark pointed out that I haven't mentioned anywhere public parts of the security features new in FC6. Well, here it is.

One of the remaining attack vectors in the runtime are function pointers in writable memory. Overwrite the value and you can redirect execution. Of course the pointer must actually be used and randomization must be overcome, but it's theoretically possible.

The remedy I've implemented in libc internally is to encrypt function pointers. I.e., they are not stored as-is but instead in a mangled form. This mangling consists in my code of XOR-ing the pointer value with a random 32/64-bit value. Each process has its own random value. The code was publicly committed back in December 2005 and is in FC6.

The only real challenge was to make this fast. Especially on platforms like x86 which have no fast PC-relative data access. To not use a fixed address the value is stored in the TCB.

What is protected? I hope meanwhile most function pointers in libc. Some are probably still missing and others cannot be handled this way since they are visible to the outside. For some broken programs (including UML) the setjmp change was the biggest. These programs tried to access the stored code address which now is not really useful anymore (program don't know how to decrypt the value). Other pointers which are encrypted are the iconv and atexit structures as well as some function pointer tables people don't really know about, they are completely internal.

Using encryption (instead of canaries) to protect structures like jmp_buf is at least as secure and in addition faster. Question is whether we can extend the use to other parts of the runtime. Runtimes for languages like C++ and Java just scream for such a protection, virtual function tables are a prime target.

The Case For 64-bit Futexes

Ulrich Drepper - Fri, 12/01/2007 - 11:41pm
There are currently patches to implement 64-bit futexes floating around. So far, futexes are limited to 32 bits, regardless of platform. The new code will implement 64-bit futexes for 64-bit platforms. That's possible for all 64-bit architectures which can implement futexes in the first place. People were wondering why this code is needed. PIDs are not going to be extended.

Here is one reason: I have a new reader/writer lock implementation which is a whole lot faster. But to be general enough to replace the existing code in libpthread it needs 64-bit futexes. Does it matter? Here are the measurements, judge yourself.



This is an extreme case but something customers report as being quite close to frequent cases. Here only readers are used and they do nothing but take a lock, do some memory intensive work, unlock, and repeat. The graphs show the speed-up when adding more threads. The measurements are done on a UP quad core machine.

For the old code any number of threads greater than one loses. With the new implementation we can achieve about 50% speed-up when using more threads. It seems three threads is about the sweat spot. This is of course a detail of this specific test. Regardless, suddenly throwing threads at the problem yields better results and this is of course good news given that we'll see more and more cores.

Ulrich Drepper - Fri, 05/01/2007 - 5:21pm
I got some requests to share the MetaPost sources I use. The reason I haven't done this right away is that I haven't cleaned it up. At all. It's not parametrized, nothing. So far I copy the figure definition and reuse them. The change are minimal each time but that's it. I still haven't cleaned it up and probably won't so some time. So, the source for two graph types is here. The result looks like this:


OO.org PDFs exported

Ulrich Drepper - Sun, 31/12/2006 - 9:13pm
I write all my documented using TeX but I like using OO.org (especially calc) for preparing data. My test programs usually emit .csv files which calc can read. Then I can do the preprocessing of the data using the spreadsheet (the greatest type of application ever developed).

When I want to use the data in my TeX documents I used to export the graphs as PDFs. I.e., I copy&pasted them to draw and then wrote the PDF. This works OK, the results always looked good since they were native PDF files. Scaling in the PDF viewer works.

There is a drawback, though, which only became really obvious in my latest project. An upcoming document features a lot of graphs (about 20 so far). I used the OO.org generated PDFs with the result that the generated document PDF was larger than 2MB. Each graph weighs in at about 100kB. That's not acceptable nor necessary.

Looking at the files it's obvious that the overhead is due to embedded fonts. Something which is needed, for most cases. I it is not for me since I use the same typefaces I have in the document as well or at least in the other graphs. Maybe there is a way to tell OO.org to not write out the font information. I haven't found it, though.

Regardless, I'm using something better now anyway. I'm already using MetaPost to generate figures. MetaPost comes with graph.mp, a macro package to generate graphs. The graphs generated by it by default are very plain and IMO not really publication quality. At least not for today's world, things were different when graph.mp was written.

But since this is MetaPost (for those who don't know it, it's basically Knuth's Metafont changed to generate Postscript graphs instead of font files) it's fully programmable. After reading graph.mp I now have templates for MetaPost to generate several of the types of graphs OO.org's calc can produce. All driven by text files containing the measurements.

The result: the graphs IMO look nicer since I have more fine-grained control over the generation and the document PDF file shrank from ~2.2M to 600kB. I just love TeX and friends. No OO.org word processor (or equivalent product) can ever produce documents of this quality.

Static linking clarification

Ulrich Drepper - Sat, 11/11/2006 - 10:57pm
I've updated my page on static linking a bit. I've seen too many people misinterpreting the recommendation when they talk about splitting a project in as many DSOs as possible. I never suggested that. Please read it. And if your project suffers from too many DSOs, please do something against it.

Linux event handling

Ulrich Drepper - Tue, 31/10/2006 - 12:40am
I had a little chat with Andrew Morton on Monday. The topic was the event handling, async I/O, etc. All the same stuff I talked about in Ottawa at OLS this year (paper and slides on my web page).

Unfortunately not much has happened since OLS. Lots of people said they would be willing and able to help but life intervened, I guess. I’m no kernel developer but perhaps I have to start taking this up (this is a threat...).

There is one stream of patches which exist but they are not usable in my opinion. I’m talking about Evgeniy Polyakov’s kevent patches. I send my analysis to lkml but this didn’t result into much (except removing the ring buffer support). There are more fundamental problems with his proposed approach but I don’t get through to him and it appears as if I’m the only one complaining. Andrew keeps saying that this is because nobody else understands the topic. Well, let’s be a bit more specific. I’ll try again to motivate the case, describe how I think the userlevel interface should look like and then deduce something about the kernel interfaces. A bit of the latter Andrew and I did today. There are also network and file AIO aspects to this and then the DMA handling, but I’m restricting myself here to the event handling.

The fundamental premise is that multi-threaded code is absolutely necessary going forward. All processors grow horizontally by having ever and ever larger numbers of execution contexts (cores, hyper-threads, ...). For applications to scale there are two types of approaches:



  • Parallelize large chunks of code. This is not a suitable approach for small chunks of code (because the startup, and distribution and collection of results are not free) or for code which is hardly parallelizable. If code can be parallelized people hopefully will start using OpenMP and similar methods. Unfortunately not everything is parallelizable in this form.


  • Parallelize individual requests.




A web server is an example where the former approach is not usable but the latter is. Usually each request is easy to handle. Even for dynamically built pages the work is usually small. It makes no sense to parallelize the code. But many requests can be processed in parallel.

So it is logical to achieve parallelism by having many threads (or processes as in Apache). This is what programs do today. But this is not without problems:



  • The current network interfaces are not well prepared. If you park all inactive threads in the accept() call to get a new incoming restriction you get a trampling herd effect if there is an incoming connection: all threads get woken up. But only one or a few will find that the accept() succeeded. The same is true for poll() etc calls.


  • If the number is requests is high it is a bad approach to increase the number of threads until no connection remains unanswered. The problem is that this adds a lot of administrative overhead in the kernel and processor (context switches). If the processing of the requests uses 100% of the CPU time we end up with a higher total processing time because of the overhead.


  • If the CPU time is not 100% we are wasting even more. If a thread does not have 100% utilization (and this is almost always the case) then the threads blocks somewhere, either a system call or a synchronization primitive. But this means there is CPU time available for the thread to do something else like working on another requests. So the total system time increases because of management overhead for no good reason: one thread could do the work needed for two or more requests.




The optimal programming model would therefore use only as many threads as are needed to utilize the system at 100%. This means one or two threads per CPU or core.

To scale with this fixed number of threads and achieve 100% utilization it is necessary to avoid any blocking. If a thread has to perform an operation which might block (such as reading from a socket or waiting on a mutex) the normal call cannot be executed. Instead the thread requests an event to be signaled when the reason for the blocking is gone and while this is going on it starts/continues working on another request.

This is a model of appealing simplicity: scaling is achieved by adding threads into a central loop. Add threads corresponding to the capabilities of the machine not the complexity of the program. All threads are using the same code.

This simplicity in the inner loop does come at a price: code has to be restructured to avoid all blocking calls and the central loop has to be able to wait for events of all types corresponding to operations which can block and are avoided.

This is where my proposal comes in. The new event handling mechanism is the bit in the inner loop. We can define its abilities and properties as this:



  1. We can wait for all kinds of events corresponding to all the possible interfaces which can block


  2. Event notification better be fast. Already today with 1Gb/s Ethernet we can create requests faster than they can be processed. This is only to increase with 10Gb/s Ethernet or fabrics.


  3. Obviously, the interface must be usable in multi-threaded situations. This means multiple threads must be able to wait in the kernel for event. The alternative is to have one waiting and signal a thread in a thread pool that an event is available. This is unnecessarily slow. When multiple threads are used it must be avoided that an unnecessary number gets woken if only one or a few events are available. Ideally we wake one thread per available event.


  4. Ideally the interface is also usable from multiple processes.




A consequence waking only one thread per available event is that no wakeup must be lost. This is a new concept which does not exist in the POSIX interfaces and the kernel (except the level triggered epoll interface). For this to work we have to have a mechanism to tell the kernel I cannot handle this event, tell another thread. This is important, for instance, when the system call to wait for the event is canceled (as in thread cancellation). In this case the userlevel function call does not return and the program does not even know the notification is lost. Therefore the system library wrapper the wait-for-event system call needs to tell the kernel about the notification loss.

The requirement for such a system call to request additional notifications has itself a consequence. If the wait-for-event system call returns itself the data for the events instead of just a notification, then this data would have to be pushed back to the kernel as well. This is a high overhead. And more problematic: if there is not sufficient memory to store the event data, what to do then?

It is therefore better to separate the signalling of an event from delivering the data associated with the event. The data can be retrieved using another system call. Or: we can use shared memory. Shared between kernel and userlevel. Or userlevels (plural), see the multi-process requirement above.

If we implement the shared memory mechanism this calls for a ring buffer. Since there can be multiple threads working on the same ring buffer it must be possible to mark a ring buffer entry as worked on and/or at least as finished. The bulk of the ring buffer needs not be modified. In fact, it might be best to not allow the userlevel code to change it at all. This way the kernel care store information in the ring buffer in a form which might cause problems if the values are changed (e.g., pointers of some sort). Not all the fields of a ring buffer entry need to be used or even usable by the userlevel code.

Proposed Kernel Interfaces

This is the current state of my/our thinking. We want to use a ring buffer. The memory must not be required to be pinned down. The memory might only be used among the threads in a process or among several processes. Depending on the needs the application should allocate memory either anonymously or file-backed. The size of the buffer can be chosen by the application.

To create an event queue we pass the address and size of the buffer to the kernel:

int kev_create (void *addr, size_t size, int flags);

We throw in a flags parameter to be on the safe side for future use.

We need some structure in the buffer. At the very least we need to be able to locate the ring buffer and identify the elements of the buffer which are currently active. If we are going to implement CPU ID-based cache line optimization (see next section) we need some sort of mapping from CPU ID to ring buffer entry. And we need to chain the ring buffer entries to reach only the entries for the CPU. Searching for them defeats the purpose.

Ideally we would split the memory area in two pieces. One piece is the ring buffer itself. This part needs not be written to by the userlevel code. See above for the advantages. The writable section is the part where the program(s) provide feedback to the kernel about the entries which have been handled. Here we have two possibilities:



  1. We use only a front and a tail pointer. The front pointer points to the first unused entry, the tail pointer to the last still-used entry. Everything between tail and front is assumed used even though some entries might actually have been handled already. Everything between front and tail is free and available to receive new events. This approach requires little work from the kernel when queuing new events but it means we might run out of space even if there are empty slots. The latter can be avoided by making it the policy that userlevel code has to evacuate the ring buffer slots ASAP.


  2. The alternative is to use per-ring buffer-entry accounting. Each entry has a flag associated with it. This means less memory requirement. We can still have front and tail pointers to mark regions about which we have absolute knowledge but the kernel would also try to look for holes and use those entries.




The alternatives do not differ too much. When moving the tail pointer we always have to have knowledge about the individual entries. It's just that in the second case the work is all done by the userlevel code. The kernel has an easier life and we avoid possibly long delays (imagine a huge ring buffer with many thousand entries). Maybe the best argument in favor of the first solution: we only have one pointer is modifiable state (the tail pointer, the kernel maintains the front pointer). This means we potentially can do away with the writable part of the mapping altogether. Just pass a void ** value to kev_create() and maintain this variable in the runtime. This advantage would go away if we need more writable state.

A few words about the memory handling of the ring buffer. I would imagine that on 64bit platforms we can use large areas. Several MBs if necessary. This would cover worst case scenarios. The key points are that a) the memory needs not be pinned down (interrupt handlers can try to write and just punt to the helper code in case that fails because the memory is swapped out) and b) we can enforce a policy where the page freed by advancing the tail pointer are simply discarded (madvise(MADV_REMOVE)). This can be done implicitly in the kernel or explicitly at userlevel. There is therefore no huge memory usage associated with the ring buffers unless the program never frees entries. It is all ordinary anonymous or file-backed memory, subject to existing rlimit restrictions. No need to invent yet more rlimits.

To register delivery of events through an event queued we likely need a number of interfaces. Fortunately one interface already exists: the sigevent structure. I described all this in my paper but some people keep misunderstanding this point. The use of sigevent has nothing whatsoever to do with using signals. It's just a convenient way to request for a whole bunch of existing interfaces that the new method of event delivery should be used. This applies, so far, to POSIX AIO (which might be extended for networking) and POSIX timers.

Other uses will need new interfaces. For instance, signal delivery could be requested through an extension of sigaction(). We define a new flag to be ORed into sa_flags and add int sa_kev to the union already containing sa_handler and sa_sigaction. None of these would be use at the same time.

My biggest worry in the moment is to find a suitable interface for futexes. I'll think about it once the work actually starts.



Implementation Notes

These are a few notes about implementation details Andrew and I came up with during the discussions. It is not necessarily all usable:



  • Ring buffer entries can be annotated with CPU ID (or perhaps even core ID). This would allow the scheduler to chose among the threads which are waiting for events to chose one which is scheduled for the same CPU. At userlevel we have the getcpu() system call and we are therefore able to pick an entry from the ring buffer which matches the CPU. This means no unnecessary cache line transfers.


  • To locate the entries for a CPU the ring buffer could have a hash table or something similar which points to the appropriate entry. It would also be possible for the wait-for-event system call to return the index of the ring buffer entry for the current CPU.


  • I think the memory for the ring buffer should be either anonymous or backed by tmpfs. Nothing in a persistent file system. We should not in any form or away encourage that people even look at the data in the file. It should be meaningless and should not survive a reboot.


Critical or Not

Ulrich Drepper - Fri, 30/06/2006 - 11:51pm
One thing many people apparently don’t understand is that the same reported security problem can have different severity levels for different distributions. This is why one distribution might have to issue a security update right away when the vulnerability is made public while others can wait.

RHEL (especially RHEL4) has many security features which can alleviate many problems. Critical problems suddenly are not critical anymore since the security features will prevent the remote exploit. This is why we spent so much time on the security features.

So, next time you see somebody complain that a RHEL update for a vulnerability is not released in time make sure Red Hat does not classify the bug differently than your other distribution. Given that we are not shipping all kinds of junk and we can classify some vulnerabilities as less severe we can focus on the inevitable remaining problems.

MALLOC_PERTURB_

Ulrich Drepper - Wed, 21/06/2006 - 11:53pm
Seems like the number of people who know this feature is still almost zero. Well, RH developers know it and we have some who use it all the time on development machines. Basically, setting this environment variable causes the malloc functions in libc to return memory which has been wiped and clear memory when it is returned. Of course this does not affect calloc which always does clear the memory.

The reason for this exercise is, of course, to f ind code which uses memory returned by malloc without initializing it and code which uses code after it is freed. valgrind can do this but it's costly to run. The MALLOC_PERTURB_ exchanges the ability to detect problems in 100% of the cases with speed.

The by