sudo chmod u-s /bin/ping sudo /usr/sbin/setcap cap_net_raw=ep /bin/ping sudo chmod u-s /bin/ping6 sudo /usr/sbin/setcap cap_net_raw=ep /bin/ping6
During the 2.6.27 merge window a number of my patches were merge and now we are at the point where we can securely create file descriptors without the danger of possibly leaking information. Before I go into the details let's get some background information.
A file descriptor in the Unix/POSIX world has lots of state associated with it. One bit of information determines whether the file descriptor is automatically closed when the process executes an exec call to start executing another program. This is useful, for instance, to establish pipelines. Traditionally, when a file descriptor is created (e.g., with the default open() mode) this close-on-exec flag is not set and a programmer has to explicitly set it using
fcntl(fd, F_SETFD, FD_CLOEXEC);
Closing the descriptor is a good idea for two main reasons:
It is easy to see why the latter point is such a problem. Assume this common scenario:
A web browser has two windows or tabs open, both loading a new page (maybe triggered through Javascript). One connection is to your bank, the other some random Internet site. The latter contains some random object which must be handled by a plug-in. The plug-in could be an external program processing some scripting language. The external program will be started through a fork() and exec sequence, inheriting all the file descriptors open and not marked with close-on-exec from the web browser process.The result is that the plug-in can have access to the file descriptor used for the bank connection. This is especially bad if the plug-in is used for a scripting language such a Flash because this could make the descriptor easily available to the script. In case the author of the script has malicious intentions would might end up losing money.
Until not too long ago the best programs could to is to set the close-on-exec flag for file descriptors as quickly as possible after the file descriptor has been created. Programs would break if the default for new file descriptors would be changed to set the bit automatically.
This does not solve the problem, though. There is a (possibly brief) period of time between the return of the open() call or other function creating a file descriptor and the fcntl() call to set the flag. This is problematic because the fork() function is signal-safe (i.e., it can be called from a signal handler). In multi-threaded code a second thread might call fork() concurrently. It is theoretically possible to avoid these races by blocking all signals and by ensuring through locks that fork() cannot be called concurrently. This very quickly get far too complicated to even contemplate:
It is therefore necessary to find a different solution. The first set of patches to achieve the goal went into the Linux kernel in 2.6.23, the last, as already mentioned, will be in the 2.6.27 release. The patches are all rather simple. They just extend the interface of various system calls so that already existing functionality can be taken advantage of.
The simplest case is the open() system call. To create a file descriptor with the close-on-exec flag atomically set all one has to do is to add the O_CLOEXEC flag to the call. There is already a parameter which takes such flags.
The next more complicated is the solution chosen to extend the socket() and socketcall() system calls. No flag parameter is available but the second parameter to these interfaces (the type) has a very limited range requirement. It was felt that overloading the parameter is an acceptable solution. It definitely makes using the new interfaces simpler.
The last group are interfaces where the original interface simply doesn't provide a way to pass additional parameters. In all these cases a generic flags parameter was added. This is preferable to using specialized new interfaces (like, for instance, dup2_cloexec) because we do and will need other flags. O_NONBLOCK is one case. Hopefully we'll have non-sequential file descriptors at some point and we can then request them using the flags, too.
The (hopefully complete) list of interface changes which were introduced is listed below. Note: these are the userlevel change. Inside the kernel things look different.
Userlevel InterfaceWhat changed? openO_CLOEXEC flag added fcntlF_DUPFD_CLOEXEC command added recvmsgMSG_CMSG_CLOEXEC flag for transmission of file descriptor over Unix domain socket dup3New interface taken an an addition flag parameter (O_CLOEXEC, O_NONBLOCK) pipe2New interfaces taking an addition flag parameter (O_CLOEXEC, O_NONBLOCK) socketSOCK_CLOEXEC and SOCK_NONBLOCK flag added to type parameter socketcallSOCK_CLOEXEC and SOCK_NONBLOCK flag added to type parameter pacceptNew interface taken an addition flag parameter (SOCK_CLOEXEC, SOCK_NONBLOCK) and a temporary signal mask fopenNew mode 'e' to open file with close-on-exec set eventfdTake new flags EFD_CLOEXEC and EFD_NONBLOC signalfdTake new flags SFD_CLOEXEC and SFD_NONBLOCK timerfdTake new flags TFD_CLOEXEC and TFD_NONBLOCK epoll_create1New interface which takes a flag value. Support EPOLL_CLOEXEC and EPOLL_NONBLOCK inotify_init1New interface taking a flag parameter (IN_CLOEXEC, IN_NONBLOCK)When should these interfaces be used? The answer is simple: whenever the author is not such that no asynchronous fork()+exec can happen or a concurrently running threads executes fork()+exec (or posix_spawn(), BTW).
Application writers might have control over this. But I'd say that in all library code one has to play it safe. In glibc we do now in almost all interfaces open the file descriptor with the close-on-exec flag set. This means a lot of work but it has to be done. Applications also have to change (see this autofs bug, for instance).
ajax told me that extra wide screens now work with the latest Fedora 9 binaries for X11. So I had to try it out and after some experimenting I got it to work. So save others the work here is what I did.
Hardware:
I use the free driver, of course. No need for 3D here.
The old way to get a spanning desktop was to use Xinerama. This has been replaced by xrandr nowadays. xrandr is not just for external screens of laptops and to change the resolution. One can assign the origin of various screens and therefore display different parts of a bigger virtual desktop. This is the whole trick here. The /etc/X11/xorg.conf file I use is this:
Section "ServerLayout" Identifier "dual head configuration" Screen 0 "Screen0" 0 0 InputDevice "Keyboard0" "CoreKeyboard" EndSection Section "InputDevice" Identifier "Keyboard0" Driver "kbd" Option "XkbModel" "pc105" Option "XkbLayout" "us+inet" EndSection Section "Device" Identifier "Videocard0" Driver "radeon" Option "monitor-DVI-0" "dvi0" Option "monitor-DVI-1" "dvi1" EndSection Section "Monitor" Identifier "dvi0" Option "Position" "2560 0" EndSection Section "Monitor" Identifier "dvi1" Option "LeftOf" "dvi0" EndSection Section "Screen" Identifier "Screen0" Device "Videocard0" DefaultDepth 16 SubSection "Display" Viewport 0 0 Depth 16 Modes "2560x1600" Virtual 5120 1600 EndSubSection EndSection
Fortunately X11 configuration got much easier since I had to edit the file by hand. I started from the most basic setup for a single screen which the installer or config-system-display will be happy to create for you. The important changes on top of this initial version are these:
Option "monitor-DVI-0" "dvi0" Option "monitor-DVI-1" "dvi1"
These lines in the Device section announce the two screens. It is unfortunately not well (at all?) documented that the first parameter strings are magic. If you ran xrandr -q on your system with two screens attached you'll see the identifiers assigned to the screens by the system. In my case:
$ xrandr -q Screen 0: minimum 320 x 200, current 5120 x 1600, maximum 5120 x 1600 DVI-1 connected 2560x1600+0+0 (normal left inverted right x axis y axis) 646mm x 406mm ... DVI-0 connected 2560x1600+2560+0 (normal left inverted right x axis y axis) 646mm x 406mm ...
Add to the names DVI-0 and DVI-1 the magic prefix monitor- and add as the second parameter string an arbitrary identifier. Do not drop or change the monitor- prefix, that's the main magic which seems to make all this work. Then create two monitor sections in the xorg.conf file, one for each screen:
Section "Monitor" Identifier "dvi0" Option "Position" "2560 0" EndSection Section "Monitor" Identifier "dvi1" Option "LeftOf" "dvi0" EndSection
The Identifier lines must of course match the identifiers used in the Device section. The rest are options which determine what the screens show. Since the LCDs have a resolution of 2560x1600 and since I want to have a spanning desktop and the DVI-0 connector is used for the display on the right side, I'm using an x-offset of 2560 and an y-offset of 0 for that screen. Then just tell the server to place the second screen at the left of it and the server will figure out the rest.
What remains to be done is to tell the server how large the screen in total is. That's done using
Virtual 5120 1600
The numbers should explain themselves. Now the two screens show non-overlapping regions of the total desktop with no area not displayed, all due to the correct arithmetic in the calculation of the total screen size and the offset.
Note: there is only one Screen section. That's something which is IIRC different from the last Xinerama setup I did years ago.
If you need more proof that this insane just look at some of the packages using it. I recently was looking at krb5-auth-dialog. The output of ldd -u -r on the original binary shows 26 unused DSOs.
This can be changed quite easily: add -Wl,--as-needed to link line. Do this in case of this package all but one of the unused dependencies is going away. This is several benefits:</tt>
The binary size is actually measurably reduced.
text data bss dec hex filename 35944 6512 64 42520 a618 src/krb5-auth-dialog-old 35517 6112 64 41693 a2dd src/krb5-auth-dialog
That’s a 2% improvement. Note that all the saved dependencies are all recursive dependencies. The runtime is therefore not much effected (only a little). The saved data is pure overhead. Multiply the number by the thousands of binaries and DSOs which are shipped and the savings are significant.
The second problem to mention here is that not all unused dependencies are gone because somebody thought s/he is clever and uses -pthread in one of the pkgconfig files instead of linking with -lpthread. That’s just stupid when combined with the insanity called libtool. The result is that the -Wl,--as-needed is not applied to the thread library.
Just avoid libtool and pkgconfig. At the bery least fix up the pkgconfig files to use -Wl,--as-needed.
Jonathan and crew published part 2 of the paper. If you have an LWN subscription you can read it here.
In the last weeks I have seen far too much code which reads directory content in horribly inefficient ways to let this slide. Programmers really have to learn doing this efficiently. Some of the instances I've seen are in code which runs frequently. Frequently as in once per second. Doing it right can make a huge difference.
The following is an exemplary piece of code. Not taken from an actual project but it shows some of the problems quite well, all in one example. I drop the error handling to make the point clearer.
DIR *dir = opendir(some_path);
struct dirent *d;
struct dirent d_mem;
while (readdir_r(d, &d_mem, &d) == 0) {
char path[PATH_MAX];
snprintf(path, sizeof(path), "%s/%s/somefile", some_path, d->d_name);
int fd = open(path, O_RDONLY);
if (fd != -1) {
... do something ...
close (fd);
}
}
closedir(dir);
How many things are inefficient at best and outright problematic in some cases?
Let's enumerate:
readdir_t is only needed if multiple thread are using the same directory stream. I have yet to see a program where this really is the case. In this toy example the stream (variable dir) is definitely not shared between different threads. Therefore the use of readdir is just fine. Should this matter? Yes, it should, since readdir_r has to copy the data in into the buffer provided by the user while readdir has the possibility to avoid that.
Instead of readdir code should in fact use readdir64. The definition of the dirent structure comes from an innocent time when hard drive with a couple of dozen MB of capacity were huge. Things change and we need larger values for inode numbers etc. Modern (i.e., 64-bit) ABIs do this by default but if the code is supposed to be used on 32-bit machines as well the *64 variants should always be used.
Path length limits are becoming an ever-increasing problem. Linux, like most Unix implementations, imposes a length limit on each filename string which is passed to a system call. But this does not mean that in general path names have any length limit. It just means that longer names have to be implicitly constructed through the use of multiple relative path names. In the example above, what happens if some_path is already close to PATH_MAX bytes in size? It means the snprintf call will truncate the output. This can and should of course be caught but this doesn't help the program. It is crippled.
Any use of filenames with path components (i.e., with one or more slashes in the name) is racy and an attacker change any of the contained path components. This can lead to exploits. In the example, the some_path string itself might be long and traverse multiple directories. A change in any of these will lead to the open call not reaching the desired file or directory.
Finally, while the code above works (the open call will fail if d->d_name does not name a directory) it is anything but efficient. In fact, the open system calls are quite expensive. Before any work is done, the kernel has to reserve a file descriptor. Since file descriptors are a shared resource this requires coordination and synchronization which is expensive. Synchronization also reduces parallelism, which might be a big issue in some code. The open call then has to follow the path which also is not free.
To make a long story short, here is how the code should look like (again, sans error handling):
DIR *dir = opendir(some_path);
int dfd = dirfd(dir);
struct dirent64 *d;
while ((d = readdir64(d)) != NULL) {
if (d->d_type != DT_DIR && d->d_type != DT_UNKNOWN)
continue;
char path[PATH_MAX];
snprintf(path, sizeof(path), "%s/somefile", d->d_name);
int fd = openat(dfd, path, O_RDONLY);
if (fd != -1) {
... do something ...
close (fd);
}
}
closedir(dir);
This rewrite addresses all the issues. It uses readdir64 which will do just fine in this case and it is safe when it comes to huge disk drives. It uses the d_type field of the dirent64 to check whether we already know the file is no directory. Most of Linux's directories today fill in the d_type field correctly (including all the pseudo filesystems like sysfs and proc). Those file systems which do not have the information handy fill in DT_UNKNOWN which is why the code above allows this case, too. In some program one also might want to allow DT_LNK since a symbolic link might point to a directory. But more often enough this is not the case and not following symlinks is a security measure.
Finally, the new code uses openat to open the file. This avoids the length path lookup and it closes most of the races of the original open call since the pathname lookup starts at the directory read by readdir64. Any change to the filesystem below this directory has no effect on the openat call. Also, since now the generated path is very short (just the maximum of 256 bytes for d_name plus 10 we know that the buffer path is sufficient.
It is easy enough to apply these changes to all the places which read directories. The result will be small, faster, and safer code.
A few weeks back I asked how I should publish the document on memory and cache handling. I got quite some feedback.
Because of this first obnoxious group of people I would probably have gone with a print-only route. This attitude that just because somebody works on free software he always has to make everything available for free makes me sick. These are most probably the same people who never in their life produced anything that other found of value or they are the criminals working on (mostly embedded) project exploiting free software.
But since I really want the document to be widely distributed and available to places where $8 is too much money I will release the PDF for free. But this won't happen right away. Unlike some of the people making comments I do think that editing is important. Fortunately having professional editing and a free PDF don't exclude each other.
I'll not go with a publisher (esp not these $%# at O'Reilly, as several people suggested). This would in most cases have precluded retaining the copyright and making the text available for free.
Instead the nice people at LWN, Jonathan Corbet and crew, will edit the document. They will then serialize it, I guess, along with the weekly edition. It's up to Jon to make this decision. The document has 8 large section including introduction which means my guess is that after 7 installments the whole document is published. Once this has happened I'll then make the whole updated and edited PDF available.
This means if you think it's worth it, get a subscription to the LWN instead of waiting a week to read it for free.
So in summary, I get professional editing, keep the copyright, and might be able to help getting some more subscribers for the LWN. Win, win, win. If the L in LWN bothers you I've news for you: the document itself is very Linux-centric.
I haven't forgotten the printed version. I've read a bit more of the Lulu documentation. Apparently there is a model where I don't have to pay anything. People ordering the book pay a per-copy price and that's it (apparently with discounts for larger orders). If I submit it in letter/A4 format I don't have to do any reformatting and the price is less (for the color print) since there are fewer pages.
I'll probably try to do this after the PDF is freely available. People who like to have something in their hands will have their wishes. The only problem I see right now is that Lulu has a stupid requirement that the PDF documents must be generated with proprietary tools from Adobe. Of course I don't do this, I use pdfTeX. If this proves to be the case I guess I'll have to have a word with Bob Young...
People are starting to realize how broken the Xen model is with its privileged Dom0 domain. But the actions they want to take are simply ridiculous: they want to add the drivers back into the hypervisor. There are many technical reasons why this is a terrible idea. You'd have to add (back, mind you, Xen before version 2 did this) all the PCI handling and lots of other lowlevel code which is now maintained as part of the Linux kernel. This would of course play nicely into Xensource's (the company) pocket. Their technical people so far turn this down but I have no faith in this group: sooner or later they want to be independent of OS vendors and have their own mini-OS in the hypervisor. Adios remaining few advantages of the hypervisor model. But this is of course also the direction of VMWare who loudly proclaim that in the future we won't have OS as they exist today. Instead only domains with mini-OS which are ideally only hooks into the hypervisor OS where single applications run.
I hope everybody realizes the insanity of this:
I fear I have to repeat myself over and over again until the last person recognizes that the hypervisor model does not work for the type of virtualization for commodity hardware we try to achieve. Using a hypervisor was simply the first idea which popped into people's head since it was already done before in quite different environments. The change from Xen v1 to v2 should have shown how rotten the model is. Only when you take a step back you can see the whole picture and realize the KVM model is not only better, it's the only logical choice.
I know people have invested into Xen and that KVM is not yet there yet but a) there has been a lot of progress in KVM-land and b) the performance is constantly improving and especially with next year's processor updates hardware virtualization costs will go down even further.
For sysadmin types this means: do what you have to do with Xen for now. But keep the investments small. For developers this means: don't let yourself be tied to a platform. Use an abstraction layer such as libvirt to bridge over the differences. For architects this means: don't looking to Xen for answers, base your new designs on KVM.
That is meant as a question to the readers. The problem I have right now is that I have more or less finished the paper accompanying one of the talks I gave at the Red Hat Summit in Nashville last year. The slides for the talk about CPU Caches are available. But quite honestly, as most slide sets, they don't do the topic any justice. I had to compress things to < 45 mins which is of course not enough. The paper covers everything I can currently think of and which makes sense with relation to CPU caches and CPU memory, as far as programmers are concerned (nothing for hardware people). The title I currently use it
What Every Programmer Should Know About Memoryand I think this is adequate.
For this reason I usually write a paper on the important topics I talk about. And this topic qualifies. I consider the topic especially important since it's almost never treated in the software world at all. College grads today in most cases have not the slightest clue about this topic. Ideally I'd like the paper be picked up by some lecturers (like they do for many of my other publications) and use it in a course. Heck, I'm even willing to teach it myself if that is what it takes to get credibility.
The problem I'm facing is that the document is (using my usual paper style, two column etc) around 100 densely packed pages long. Some of the people I've shown it to suggested that it should rather be published as a book. I'm a bit unsure about this. I have a few publisher who for a long time keep pestering me about writing something for them (some even prematurely submitted titles to distributors!). One I talked to would be willing to print it even though it's thin for a book. But there are a lot of pluses and minuses all around:
Going with Lulu has the advantages I want but it's quite an effort. And there are costs associated with it. I do not plan to make money out of all this but I'd have to recover the costs. Excess gains would probably go to charity (in my case this is the Monterey Bay Aquarium in case anybody is interested).
So, the questions I have and would like to get some feedback on are:
If you have an opinion and a mail or add a comment to the blog (which won't be published). I know it is not easy to answer given that you haven't seen the material. But this is the same for most books, isn't it? Look at the slides and assume 100 times more details. I doubt I'll find many people who know all these details now (I had to do research myself).
I cannot believe there are still people who are surprised they see me working with the command line on my machine or when I tell them otherwise the the output of grep can use highlighting. Just add --color to the command line (with the optional argument just like ls). I've implemented that more than six years ago. In my .bashrc I have the following:
alias egrep='egrep --color=tty -d skip' alias egrpe='egrep --color=tty -d skip' alias fgrep='fgrep --color=tty -d skip' alias fgrpe='fgrep --color=tty -d skip' alias grep='grep --color=tty -d skip' alias grpe='grep --color=tty -d skip'
Yes, I mistype grep often enough to warrant the extra aliases. Using tty as the color mode mean that if I pipe the output into another program there won't be any color escape sequences added which could irritate those programs.
Just make your life easier and add such aliases, too.
Constantly people complain that the runtime does not catch their mistakes. They are hiding behind this requirement in the POSIX specification (for pthread_join in this case, also applies to pthread_kill and similar functions):
The pthread_join() function shall fail if:
[...]
ESRCH No thread could be found corresponding to that specified by the given thread ID.
The glibc implementation follows this requirement to the letter. *IFF* we can detect that the thread descriptor is invalid we do return ESRCH.
But: the above does not mean that all uses of invalid thread descriptors must result in ESRCH errors. The reason is simple: the standard does not restrict the implementation in any way in the definition of the type pthread_t. It does not even have to be an arithmetic type. This means it is valid to use a pointer type and this is just what NPTL does.
Nobody argues that functions like strcpy should not dump a core in case the buffer is invalid. The same for pthread_attr_t references passed to pthread_attr_init etc. The use of pthread_t when defined as a pointer is no different. The only complication is in the understanding that pthread_t can be a pointer type. This is obvious for void* etc.
In the POSIX committee we discussed several times changing the pthread_join and pthread_kill man pages. The ESRCH errors could be marked as may fail. But
If somebody wants to do the work associated with the second step above and we have confidence in the results, we (= Austin Group) might make the change at some later date. But it is a rather high risk for no real gain. Programmers have to educate themselves anyway.
What remains is the question: how can programs avoid these mistakes? It is actually pretty simple: the program should make sure that no calls to pthread_kill, for instance, can happen when the thread is exiting. One way to solve this problem is:
This ensures that no invalid descriptor is used. But I can already hear people complain:
This is too expensive!That is ridiculous. The implementation would have to do something similar if it would try to catch bad thread descriptors. In fact, it would have to do more. What is important is to recognize that this price would have to be paid by every program, not just the buggy ones. This is wrong. Only those people who need this extra protection should pay the price.
But I don't have control over the code calling pthread_create!Boo hoo, cry me a river. Don't expect sympathy for using proprietary software. I will never allow good free software to be shackled because of proprietary code. If you cannot get this changed in the code you pay good money for this just means it is time to find a new supplier or, even better, use free software.
In summary, this is entirely a problem of the programs which experience them. Existing Linux systems are proof that it is possible to write complex programs without requiring the implementation to help incompetent programmers. We will have a few more words in the next revision of the POSIX specification which talk about this issue. But I expect they will be ignored anyway and all focus remains on the shall fail errors of pthread_kill etc.

getent hosts some.host
getent ahosts some.host
I've heard far too often that getaddrinfo is only interesting for IPv6 and therefore can be ignored since one does not have IPv6.
Aside from the fact that all programs should be protocol independent this statement is bogus. gethostbyname etc do not perform correctly in some situations where only ever IPv4 is involved.
Assume you have an internal IPv4 network with, say, 192.168.x.y addresses. In addition you have a server (web server, for instance) which is also visible on the Internet. This server has two addresses: one 192.168.x.y address and one global address. The client is a NATed machine on the intranet.
Now what happens if the nameserver returns both addresses to a query for the addresses of said server? With gethostbyname the addresses are returned to the caller in the order they are received from the DNS server. Maybe some randomization is applied. In short, it is possible that the internal machine gets sees the public IPv4 address and then connects to it. This is not only wasteful (the request has to be routed through a switch), it might even be dangerous (the traffic might actually have to go through the Internet).
With getaddrinfo this is not the case. The sorting according to RFC 3484 makes sure that the internal address of the server is returned first. The sorting function will notice that the source address used on the client is also an internal address and therefore the internal address of the server is a better match than the global address.
In summary, gethostbyaddr is not only about IPv6. The old interfaces were simply completely inadequate and should never be used. If you still haven't converted your programs to use getaddrinfo instead of gethostbyname and gethostbyname2 do it now. I have written some time ago a brief intro.
With KVM proving more and more that it is viable Xensource and VMWare start sandbagging. They call KVM immature and the wrong approach (see their quotes in CNET article).
Calling KVM is immature is, well, premature and misleading. Xen has a headstart of several years. KVM is today not supposed to be in the state Xen is. Nevertheless, KVM already has hardware virt support, SMP support, support for 64-bit host and guests (despite what the article says), live migration, and more. Xen simply started from the other direction, namely para-virt, hardware virt took them a long time and a lot of help from the hardware vendors. I think para-virt will be done RSN.
But immature is not the worst complain. Claiming the hypervisor approach is the only viable option is what should get people worked up. Look at the arguments:
[...] but hypervisors offer better performance, have security advantages, and juggle the competing needs of multiple virtual machines better [...] In order to [deliver Virtual Infrastructure], you need the separate hypervisor layer.These are bogus claims. And you have realize where they come from. VMWare’s ESX is a kernel on itself, one which only few people work on (compared to something like Linux). Device drivers will always be a nightmare unless/until devices get their own PCI devices (once DMA can be virtualized). Nevertheless, ESX is a full OS by itself. Plus, ESX has the service console a Linux OS. The service console of course has to have some control over the hypervisor.
For Xen the situation is similar. Here the hypervisor, after the mistakes of the 1.x series, don’t have device drivers included and use a privileged domain, a complete OS.
This means, both Xen and VMWare do not have less code. I’d say they even have more code that is part of the privileged code base. Certainly a Linux installation hosting KVM domains can be scaled down to only have the kernel, kqemu, and the service console.
As for specific security support, Xen has in theory shype or whatever will come out of it. Like SELinux, it’s based on Flask. But it still is a separate code base. And if shype is actually moved out of the hypervisor itself and into a separate domain you have to worry about even more interfaces to worry about. I haven’t seen any security features of this caliber even mentioned for VMWare. With KVM, the SELinux policy governing the kernel can also handle the KVM module. It’s after all part of the same kernel. One implementation, one policy.
As for performance, let’s wait until KVM actually has been optimized. Ingo did some work on a para-virt network driver and the results are simply great. It’s just that performance tuning hasn’t been a focus. In theory there is absolutely no reason why the KVM approach should be any slower.
As for better scheduling with a hypervisor: that can only be a joke. Especially for Xen, the privileged domain (Dom0) has to be scheduled without the hypervisor having any insight into the Dom0 kernel. How can this be better? For VMWare we have a simple-minded OS serving as the hypervisor. The Linux kernel has support for all kinds of situations, including NUMA machines, many processor machines, HT and multi-core processors etc. And it’s an O(1) scheduler which sooner or later will make a difference even for hypervisors.
And then there are the advantages the KVM solution has. For instance, ever tried to run Xen on a laptop while on battery? It’s almost not worth it since power management does not exist. The machine will always run at full power. VMWare has the same problem. This is not only an issue for laptops. Cooling is a major issue in data centers. Maybe even a bigger issue with increasing density.
NUMA has already been mentioned, but there is also the memory allocation issue as part of the problem. Xen has nothing of it, I bet VMWare neither or something simple. The Linux kernel can provide KVM with all kinds of support, as the performance on big NUMA machines like SGI’s Altix shows.
In short: neither Xen nor VMWare have any real advantages which cannot be surmounted by giving KVM more time to catch up, i.e., grant it the same time to develop the features. On the other hand there are device driver issues which VMWare will never be able to muster. Xen is not included into the mainstream kernel and even with paravitr_ops interface will be lagging because it needs synchronization.
So why do these companies (and Xensource makes this statement as a company) make such statements? The answer should be not surprising: they have a lot or all to lose. KVM can be the one-in-all solution, unlike any of the others. Xensource and VMWare want to get on your system by providing a hypervisor which then can be used with all kinds of OSes. But: they are in ultimate control. The idea that there suddenly is a virtualization solution which does not need any hypervisor must be absolutely frightening to them. So, they try to suppress this no technology from the start.
Don’t believe the propaganda. Try KVM once it Fedora 7 is out. I expect it to be updated over the lifetime of Fedora 7. Or for the more adventurous people, start using rawhide now and keep using it.
With the DST rule changes for the US going into effect real soon (2007-3-11) people are panicking.
Time zone changes are nothing special. I guess on average we see about 20 each year, maybe more. Do you see people packing 20 times a year? No, only if the US is involved. Yes, you can argue that more computers are affected but aren't the computers which are in countries affected by those changes as important to the people living there? There are also banks, utilities, etc which need to keep the correct time.
Having lived here in the US now for quite a few years I think the root of the problem is the same that keeps the US from making progress on other fronts: fear of change and trying to prevent change through denial. Another example? Take the measurement system. When knowing the metric and the imperial system equally well, who would argue the latter is better? And it's not that people don't know the metric system at all. There are large numbers of people who serve/d in the military and all these people had to use it in their job. Every food container also shows grams. But I'm getting off-topic.
Fact is, people delay things. Delaying to update their OSes and even delaying to think about the problem. It might go away on its own and then no time has been wasted. But guess what: the DST change is coming.
So, people, get your act together. Update your OSes. If you really for some obscure reason cannot do this, update the timezone data (in /usr /share/zoneinfo). The data we have today is fully compatible, even the extended file format. Since there is no glibc update coming you could just overwrite the files without fear of reverting the changes inadvertently later. Update applications with their own timezone data. There are lots of (especially big) programs which come with their own data. The timezone data is free to copy by everybody so companies take advantage of this. Now you know why I always advised against this. More likely than not, updates for old versions of these programs are not available anymore. Make sure you let the companies who produce this kind of shitty products know what you think of them, duplicating timezone data is always bad. As for JVM: have fun! Old versions will get no updates and new version often don't run on old OS versions. At least libgcj should now be fixed for good due to the work of one of my collegues.
Once the timezone data is updated some more steps might be needed. Many programs will just work. glibc detects updates to the /etc/localtime file and reloads the data. Lots of people complained about this in the past and present since it means time operations cause my filesystem operations, but it is critical in some situations. If a process only uses the time functions which do not implicitly call tzset() they must be restarted. The same is true for processes which have the TZ environment variable set. In general you cannot know whether a process falls into any of the later categories. The safest thing to do is to reboot the machine.
As a continuation of a previous post, here's another thing I frequently stumble across:
#include <stdio.h>
#include <string.h>
int main(void)
{
const char s[] = "hello";
strcpy (s, "bye");
puts (s);
return 0;
}
Yes, this code will produce a warning. But it will run. Slowly, since it does not do what the programmer actually meant. s is a dynamic variable. The compiler has to allocate space on the stack (or in TLS) and then copy the string from some static, read-only area into it. This of course is not only slow, uses memory, it also means that the newly created string is NOT read-only. The compiler-generate function prologue has to write to the memory.
Whenever you write code where you define an array in the scope of a function, always stop and think what the semantics should be. For all constant arrays it is almost always correct to have exactly one copy (it cannot be changed). If one copy is needed or OK then don't forget the static:
static const char s[] = "hello";
If you do this all of a suddenly the code will not only produce a warning, it will also crash at runtime since the string is stored in read-only memory and s is not now really a variable anymore, it's a label for the region in read-only memory.
I'm sick and tired of hearing people saying
I don't have to secure my machine since I have nothing of interest on it. Nobody would want to steal anything I have.That's absolutely not the point. Yes, some attackers are after personal data like account numbers. But this is not all:
Security always matters even if the data stored on the machine is benign. Nobody should be allowed to even run machines which have no distinction between user and administrator. This includes more and more Linux people because new idiot distributions like Linspire, NimbleX, etc pop up. No machine should be without firewalls, in both directions. For RHEL/Fedora users it of course doesn't stop there, we have many more security features and if it would be up to me I would take out the switch to disable them.
Next time when you see somebody writing nonsense like the above (or hear them talking like this) do me a favor: smack them a bit so that they come to their senses. These are the people who create the opportunity for spam, phishing, and other illicit activities. Heck, they deserve more then a bit of smacking...
lifo-pop (lf: pointer to lifo): pointer to cell
C1: loop
C2: head = lf->top # get the top cell of the lifo
C3: if head == NULL
C4: return NULL # LIFO is empty
C5: endif
C6: next = head->next # get the next cell of cell
C7: if CAS (&lf->top, head, next) # try to set the top of the lifo to the next cell
C8: break
C9: endif
C10: endloop
C11: return head
const char *_pcre_ucp_names = "Any\0" "Arabic\0" "Armenian\0" ... "Zs";
const char _pcre_ucp_names[] = "Any\0" "Arabic\0" "Armenian\0" ... "Zs";

