Thursday, July 21, 2011

Is Open Source Headed In The Wrong Direction?

Over the time I began writing here, I have made many comments about Open Source projects and have, overall, been a fairly vocal advocate of Open Source Software.  I still feel that open standards and keeping a rich library of open source software is the only way that computer technology can continue to be innovative, particularly in light of the patent frenzy that looms over the computer science and technology community as a whole.  In my line of work I get to work with both commercial (closed-source) and community (open-source) software and, in general, I find that while the commercial software tends to be more polished, the same kinds of bugs and software quality issues exist.  The difference, generally, is that to get the commercial software fixed you first have to convince the company there is a bug, then convince them that the bug is worth their time to fix in a reasonable time period, and if you get that far you may actually see the fruits of your labor in a software update for little or no cost (generally, you end up having to buy a new version or a support contract).  Traditionally Open Source software had a quick turn-around on bug fixes due to the fact that you could, indeed, fix the bug, or someone among a large user community could usually do so.

There are some dirty little secrets, though, festering in the Open Source community that I feel compelled to reveal in the hope that we can reverse the trend.  Note that I am not an OO (Object-Oriented) type-of-person, so what you will see here references procedural software design.  However, I feel that these same issues apply (maybe even more so) to OO classes as they do to subroutine libraries.  That said...

Issue 1:  Poorly Documented Code

If you have a small utility program that does a specific task and is not generally embedded in some other program, then documenting your code is not critical (nice, but not critical).  When I refer to code documentation, I refer to subroutine/class libraries or major OS subsystems that are meant to interface with (or act as an interface for) other software.  What I am finding more and more is that automated utilities are being used to interpret meta-tags in comments and produce documentation, but this really isn't documentation.  These systems produce loads and loads of HTML documents and subroutine descriptions, but don't really show how the subroutine or subsystem is meant to be interfaced with.

The example I like to cite often is the Linux D-Bus system, which is an IPC (inter-process communication) system that is meant to allow various software subsystems to talk to each other.  It is meant to replace many IPC methods that are both incompatible and non-interoperable.  Search in Google for "dbus documentation" you may end up at several places.  The de-facto place to start is the site, where the D-Bus specifications are located.  While the introduction to D-Bus in the specifications say that it is "easy to use," I still find the entire description of it so incredibly complex and obtuse that I have still yet to understand how all the pieces fit together, what is going on inside, and most importantly, how and why I would use this rather than trying to roll my own (in actuality, I know why I wouldn't want to "roll my own" -- it's because I wouldn't want to write yet another non-interoperable IPC system!).  However, by the time you've read the third or fourth dot-separated interface definition example, your head begins to spin.  By the time I'm done reading, I don't really have a clue how to write a program that uses a D-Bus interface, nor do I have a really good idea what my responsibilities are in utilizing that system.  Okay, yeah, you've given me an API in several different languages, and maybe even written an example program, but I still don't know, both as a software developer and as a system administrator, what I have to do for the care and feeding of the D-Bus system.

I call this poor documentation not because the people involved didn't attempt to document the code, because they clearly did.  In fact, they wrote a lot.  What's bad is that it is not helpful to me.  It goes through lots and lots of pieces of history about the inner workings, but is not organized such that a system administrator can read a section and understand their responsibilities with respect to D-Bus, and an application developer can read a section and understand the interfaces they need to know with respect to D-Bus, and finally, if i want to hack on the internals of D-Bus, a deep investigation as to how it really works (complete with some diagrams, because pictures really do speak a thousand words).  Instead, all of these are probably documented somehow, but are so jumbled together that it is impossible for these three groups of users to truly understand the system as it applies to their role.

I am not picking on D-Bus alone.  This is the same for many projects - particularly in the Linux world, because of the rapid development taking place.  D-Bus is also not the worst offender either.  There are other libraries or OS interfaces which have little to no documentation at all, so your best bet is to grab source code and try to wrap your brain around what the authors were thinking when they wrote it.  That makes for a very elitist group, and seriously limits who can participate in development.

I remember when I was using the VAX/VMS and TOPS-10 operating systems and they had excellent documentation on the OS libraries and system services (TOPS-10 had the Monitor Calls Manual).  Here, you knew what the library call did, when you would want to use it, how to use it, and what data structures were required to be defined.  I think I have just dated myself...

PS:  OO programs and classes are not self-documenting.

Issue 2:  Unnecessary Complexity

I always laugh when I talk about SNMP - the Simple Network Management Protocol - because it is so far from simple as to nearly be an oxymoron.  I've been working with network equipment for years, and even network giants like Cisco can't implement SNMP correctly in their products.  In fact, I have never seen SNMP implemented entirely correctly anywhere.  The reason why is that while the protocol may be "simple" (ASN1 may be simple in theory but it is not simple to implement), the interfaces are so complex that nobody really implements it properly.  Anyone who has downloaded a manufacturer's MIBs and tries to run them through the ever-friendly (said rich in sarcasm) Net-SNMP (originally developed at Carnegie Mellon University) interpreter will notice that they end up with hundreds of error messages.

After looking at SNMP for a while you start to ask yourself, "Can't this be done any simpler?"

In defense of SNMP, I actually wonder if it can.  I mean, I certainly haven't come up with anything better, but I also haven't tried much either.  In any case, organizationally speaking, it is actually hard to figure out how to manage any network equipment with SNMP even after looking at the MIBs the device implements.  You'll typically end up having to join one OID (object identifier) in one "table" to get information about some OID in another table.  Cisco even makes things worse by requiring that you get information for different VLANs by using the community string in an undocumented way (for those interested, it's {community}@{VLAN}).  Sure, this is all documented in the MIBs, but trying to navigate them is a chore.  Cisco at least had a web-based tool to navigate their OIDs, but given Cisco's slow web site, it isn't pleasant to use either.

Some software simply grows organically in such a way that its interfaces or functionality becomes so complex that it would be better to re-think how it is done than to pile more and more functionality on the already complex framework.  I point to two other examples of software that has grown in this way:  two newer syslog daemons for LInux, and the latest Linux OS startup manager called "systemd."  These are two simple functions should really have (or retain) a simple interface, but are becoming more and more complex as time goes on.

One other casualty of excessive software complexity is that it becomes so difficult to use and/or configure properly that latent security holes form that can eventually be exploited.  While I have come to like sendmail, it is an early example of a software system that suffered from this kind of issue.

UNIX was originally created such  that simple software tools were developed as building blocks, and these simple tools were meant to be coupled together to form more complex systems.  While the problems we're now solving are more complex than these original tools were designed to handle, it seems that we've lost the basic principles that made UNIX a desirable operating system to use.

Issue 3:  Bloat / Scalability

I almost feel that the issues of bloat and scalability should be handled independently, but upon further thought I think they are so closely related that I am going to keep them together.

When I talk about bloat I am talking about a system that has become so large that it should be questioned whether the system should be broken into smaller, more manageable, pieces.  Most often, bloat occurs because the software (like what was described under excessive complexity) grows organically and eventually starts to do things that it wasn't originally meant to do.  At other times, it simply grows too big because it is trying to do everything all at once.  I have a few examples of this that I am going to pass-on, because my writing is also becoming kind of bloated as a result.

A cousin of bloat is scalability.  Scalability problems generally arise when someone writes code to solve a small problem, someone sees it and thinks it's a great idea, and uses it to solve a much bigger problem.  In Open Source software the biggest scalability problem I see can be categorically called memory abuses.  In so many software systems I see software that will casually read an entire configuration file, parse it, and keep it in memory.  This acceptable when the configuration file doesn't grow to be too big, as most configuration files are.  However, there are some configuration files that hold data that is probably better suited to be kept in a database of some kind, rather than being read into memory from a configuration file.  The Asterisk Open Source PBX contains many classic examples of such an abuse.  In addition to the application tying-up loads of memory holding copies of these configuration files, it also prevents any other applications from being able to modify the data outside the application in such a way that the multiple applications can work together.

Another example of a hidden and subtle memory abuse and scalability problem is OpenWrt's UCI (Unified Configuration Interface).   UCI works by taking a common configuration file format in multiple files in a single directory tree (/etc/config, in their case).  Applications wishing to use UCI use a library that effectively takes all the files in the directory, parses them, and converts them to a tree of dot-separated-key/value pairs.  The configuration language itself presents a scalability issue because its syntax is limited, and is expected to be used for virtually all OS configuration tasks.  The bigger scalability issue is that shell scripts that use UCI has an API that reads the entire configuration file tree into shell variables.  So if a shell script uses only a small portion of the UCI-based configuration, it must read, parse, and store all of the configuration files in memory.  In fact, every time the UCI library is used, the configuration files are effectively parsed, because the application never knows which configuration file it may be using, and the UCI system doesn't know if one of the configuration files were changed by another process.  As I started to understand more and more of what UCI was doing and how it worked, I asked myself, "Why in the world didn't they just use SQLite?"  Now, granted, there are some advantages to UCI, an important one being the ability to maintain temporary state by grouping temporary directories with the normal configuration directories.  Yes, I get that.  However, SQLite gives you the flexibility and scalability of a SQL-based database coupled with an efficient size that works well in embedded devices.  It was designed with this in mind.  State could be maintained in a temporary table, as an example.  OpenWrt has some fine conceptual design features but lacks sufficient scalability in many areas.

When Open Source software systems become more bloated and less scalable, it forces people to ask themselves, "Why don't I just cave and use Windows?  Or MacOS?"  Which leads me to the final issue...

Issue 4:  Bugs & Egos

Bugs are a fact of any software system, particularly those that become larger and more complex.  It is how the bugs are addressed where Open Source software projects are becoming more and more troublesome.  While it is true that having the source code means you can fix the bug yourself, where you can't in a commercial (closed-source) model, being able to actually fix the bug requires a particular level of expertise.  If you're dealing with issues 1 through 3 that I outlined above, that level of necessary expertise becomes less and less available, even to people who are experienced software developers.  In addition, if you find the bug, and are skilled enough to fix it, you'll ultimately want your hard work to be incorporated "upstream" into the next release of that software.  If you can't fix it yourself, then you need to report the bug upstream as well.

Now having worked on several software projects in my day, I realize there are frivolous and even incorrect bug reports and patch suggestions.  However, the developers on many Open Source projects have begun to have an inflated sense of their ability to both produce good code and bug-free code such that they frequently place such stringent demands on people reporting bugs as to discourage participation.  Without naming specific projects, I have submitted bugs with a high enough level of detail that the core development group should have been able to reproduce the bug without lots of additional detail.  However, they require so much additional detail that it frequently takes me longer to report the bug than it did to fix it (when I can).  Seriously, many projects simply outright refuse to acknowledge your bug report unless you provide massive amounts of debugging output and precise details on how to reproduce the bug.  Some bugs cannot be easily reproduced without being exercised in a specific environment that can't be trimmed down to a small sample case.  I have also had instances where some projects rejected my bug report simply by refusing to acknowledge the behavior as a bug, or by refusing to address the issue, generally asserting that the problem won't occur in a typical environment.  This is a more of a problem of ego than of failing software, and is becoming more and more common as more and more people utilize Open Source software.  I genuinely value people's time and understand that these projects are being run in a volunteer capacity, but if a project is to be taken seriously it can't be so inaccessible that only a few elite can be trusted to address problems in that software.

The other ego issue is code that is so badly written that, while it works, it is hardly maintainable or expandable.  Most software projects won't allow the code to be rewritten, or if they do, it won't be without a large amount of supporting evidence.  In many cases, what constitutes badly-written code is in the eyes of the programmer, but I've seen some utter crap in my travels that make you wonder how any computerized device in existence today actually works at all.  Again, without mentioning a specific project, there is a well-known Open Source developer that wrote some C code with two arrays.  That developer depended on the C compiler allocating space for the two arrays such that they were adjacent to each other, and proceeded to access part of the first array by providing a negative array index in the second array (editor's note: I believe it was actually worse than this, that this was being used to compensate for the case where the index became negative and the programmer was too lazy to address that case).  Not only is this blatantly bad coding practice, it was sufficiently non-portable that the code simply failed when used on a different operating system.  Happily, this specific issue has since been fixed...but what possesses a person to write code like this and assert it is correct?

Commercial software companies employ FUD (fear, uncertainty, and doubt) about Open Source software to convince people that using it is risky.  If, as Open Source advocates and project developers, we don't address some of the issues noted here, we are likely to be playing-into that FUD.  Larger Open Source projects depend on businesses using this software and supporting development through allowing employees to work on the software to make the software viable.  Further, without keeping enough interest and active participation in a project, there will not be enough continued development and support to keep the various projects going.  While I am, and will likely continue to be, a strong Open Source advocate, I am beginning to see these issues as an unraveling of some of what I admire about Open Source.  I understand that my criticisms here are likely to gnaw at some people, but at the same time I'm also hoping that these criticisms will cause some thought that will ultimately lead to better software.

Finally, I want to emphasize, again, that the specific projects I mention here have many positive points despite the fact I have not commented on them.  The reason I mentioned them here was because I was interested enough in the project that I wanted to learn more or wanted to participate.  If you're reading this and are part of one of these projects, understand that what I am saying here is meant as a means to make the project better...not to trash it.

No comments: