We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Better Software — No Matter What - Part 3

00:00

Formale Metadaten

Titel
Better Software — No Matter What - Part 3
Serientitel
Anzahl der Teile
150
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Some development practices improve software quality, regardless of the domain of the application, the language in which it's written, the platform on which it runs, or the users it is intended to serve. This seminar explores fundamental principles, practices, and standards that improve software quality, no matter what the software does, how it does it, or whom it does it for. Unlike most treatments of software quality, this seminar focuses on the critical role that programmers play, and it discusses specific strategies applicable to their activities.
31
59
Vorschaubild
1:00:41
89
Vorschaubild
1:00:33
90
Vorschaubild
1:00:33
102
FaserbündelAvatar <Informatik>GasströmungROM <Informatik>LeckZeichenketteDatentypSampler <Musikinstrument>RuhmasseTypentheorieMenütechnikAnalysisCodeFehlermeldungHydrostatikDateiformatParametersystemCodeMAPSoftwareTypentheorieDefaultAnalysisInterface <Schaltung>CompilerWiderspruchsfreiheitKonfiguration <Informatik>Sampler <Musikinstrument>Formale SpracheVersionsverwaltungDifferenteElektronische PublikationObjekt <Kategorie>MultiplikationZahlenbereichQuick-SortPunktCASE <Informatik>FunktionalProgrammiergerätWasserdampftafelDiskrete UntergruppeRuhmasseHydrostatikClientProgrammierumgebungMusterspracheDifferenzkernLeckExpertensystemAuswahlaxiomArithmetischer AusdruckExistenzsatzEinsFokalpunktNormalvektorResultanteSystemprogrammProzess <Informatik>ÄhnlichkeitsgeometrieSchnittmengeLikelihood-FunktionMathematikInformationGamecontrollerMaßerweiterungUmwandlungsenthalpieComputersicherheitMobiles EndgerätMaschinenschreibenKategorizitätRechter WinkelPuffer <Netzplantechnik>SpieltheorieLesen <Datenverarbeitung>Schreiben <Datenverarbeitung>SoftwaretestFlächeninhaltFormale GrammatikTUNIS <Programm>VektorpotenzialThreadArithmetische FolgeDatenparallelitätProgrammierungAppletWeg <Topologie>MultiplikationsoperatorPartitionsfunktionGüte der AnpassungPhysikalismusEndliche ModelltheorieMailing-ListeHalbleiterspeicherElement <Gruppentheorie>ProgrammbibliothekMereologieSpeicherbereinigungHauptidealOrtsoperatorSinusfunktionVerband <Mathematik>Klasse <Mathematik>Physikalisches SystemTemplateBenutzeroberflächeSortierverfahrenFehlermeldungAnnulatorNuklearer RaumMehrschichten-PerzeptronDickeReelle ZahlGruppenoperationKonditionszahlMathematische LogikInverser LimesVirtuelle MaschineTreiber <Programm>ComputerspielMehrrechnersystemVirtualisierungBimodulProfil <Aerodynamik>Fächer <Mathematik>Stabilitätstheorie <Logik>TropfenWeb SiteNichtlinearer OperatorVerklemmungWidgetKontrollstrukturVarietät <Mathematik>BeobachtungsstudieValiditätBildschirmmaskeRechenwerkUmsetzung <Informatik>TermExergieEin-AusgabeAbstandQuellcodeVererbungshierarchieDatenstrukturDatentypSystemaufrufInformatikRegulärer Ausdruck <Textverarbeitung>SynchronisierungTotal <Mathematik>Metropolitan area networkRandwertDeterministischer ProzessOrdnung <Mathematik>Wort <Informatik>BitZählenRoutingDatenverwaltungSystemplattformAusnahmebehandlungLesezeichen <Internet>AdressraumStatistische HypotheseFunktion <Mathematik>KonfigurationsraumKonstruktor <Informatik>Dynamisches SystemKomplex <Algebra>LeistungsbewertungSoftwarewartungZeichenketteKategorie <Mathematik>Formale SemantikLokales MinimumMechanismus-Design-TheorieBefehl <Informatik>TeilbarkeitNummernsystemLuenberger-BeobachterAdditionNichtunterscheidbarkeitProgrammfehlerGraphfärbungStellenringMetrisches SystemVariableSoundverarbeitungProgrammschleifeBitrateSoftwareentwicklerRelativitätstheorieSichtenkonzeptStatistikÜbersetzerbauLaufzeitfehlerLeistung <Physik>Spiegelung <Mathematik>StandardabweichungGlobale OptimierungDatenflussHardwareKartesische KoordinatenWarteschlangeArithmetisches MittelMeterZeiger <Informatik>Vorzeichen <Mathematik>ForcingJensen-MaßGanze ZahlKugelkappeAusdruck <Logik>PhysikerMathematikerinData MiningPrinzip der gleichmäßigen BeschränktheitGeradeIntegralNebenbedingungHyperbelverfahrenZweiGeheimnisprinzipQuadratzahlWürfelEliminationsverfahrenMetaprogrammierungBinärcodeWechselseitiger AusschlussDateiverwaltungNeuroinformatikSprachsyntheseImplementierungZirkulation <Strömungsmechanik>IdentifizierbarkeitBefehlsprozessorSchießverfahrenGebäude <Mathematik>Abgeschlossene MengeErweiterte Realität <Informatik>MomentenproblemInnerer PunktKraftCodierungGrundraumFreewareComputeranimation
Transkript: Englisch(automatisch erzeugt)
If one of the people sitting on the unbelievably comfortable concrete would prefer to move to a seat right there, which is probably less comfortable because it's got all that cushioning and it sits up higher, there's enough room for one person to do that.
Okay, since this is the only all-day session running at the conference, I will mention again, this is part three of an all-day session. You should be able to follow along even if you haven't been present
for the first two sessions. We are talking about ways to improve interfaces at this point in a particular, we're talking about what I consider the most important design guideline, which is how to make interfaces easy to use correctly and hard to use incorrectly. And before the break, we were talking about the importance of consistency.
So I gave an example of inconsistency here between what the software says and what the hardware says as to where you're supposed to deposit things. Then we talked about Java had three different ways to find out how many elements were in a container. Microsoft optimized that to only two different ways
to figure how many elements were in a container. And then I explained that sometimes it has unanticipated consequences such as even people who thought, well, I'm always gonna be using an integrated development environment, were surprised to find out that that complicates reflection-based code. But it's really, it's not entirely fair that I'm picking on things like Java and C-sharp
because really, there's so many other things to pick on. Let's go back to C. Let's go to the beginning to C. It doesn't get much older than that in many cases. Now, in the C standard library, you've got some members of the standard library where the first parameter is a file pointer. Nice and consistent.
First parameter is the file pointer. That's great, if that's as big as the library was. Because the problem is that other parts of the library the file pointer is the last parameter. If you talk to experienced C programmers, people who have been programming in C for 20, 30 years, they will tell you they still have to look it up every single time to find out what's going on.
It has been remarked that this inconsistency has frustrated millions of developers for more than 30 years. If you think about it seems like such a little thing, what is the order in which we put the parameters? If you are so lucky as to design an interface
that is still being used 40 years from now, do you really want to be remembered for something like this? This is not what you should be aspiring to. At the time it probably didn't seem like such a big deal, but when things are successful and we all would like our things to be successful, they should put their best foot forward and this is inconsistent.
This is from the C++98 standard library. If you want to eliminate all the elements in a particular container with a given value, if your container is a set, you call erase. If your container is a multi-set, you call erase. If your container is a map, you call erase. If your container is a multi-map, you call erase. Yes.
If your container is a list, you call remove. Okay. Now, if we look at the C++98 standard library, we find a different kind of inconsistency that nevertheless is problematic. So in the C++98 standard library, there is a function called sort.
And sort will run in n log n time or your code will not compile. The underlying principle here is if you ask us to do something and we can't do it efficiently, we're not going to compile it. All right, that's a reasonable principle.
However, there is another function in the standard library called binary search. Binary search will run in log n time, which is what you would expect from a binary search, if it can. And if it can't, it actually runs in linear time, strangely enough. And you should see the way that they specify that
to avoid having to look like they're complete liars. Now, this philosophy says we will do something no matter what it takes, regardless of how slow it is. That's also a legitimate principle. The problem is when you have both of these principles in the same library at the same time,
you end up in a situation where programmers don't know if they call a particular standard library routine, whether it's going to compile, and if it does compile, whether it's going to be fast. That doesn't help anybody. Again, this is not a syntactic thing. This is more of a principle that was not applied consistently.
So the issue of consistency arises in a lot of different forms. So something else from the C++98 standard library. There is a function called sort. It is not guaranteed to be stable. For what it's worth, sorts can be either stable or not stable. The difference is not important.
You just need to know that there are two possibilities, stable or not stable. So sort is not guaranteed to be stable. If you need stability in your sorting, no problem. There is a different function. It is called stable sort. Stable sort is guaranteed to be stable. That seems pretty reasonable. And then there is a member function
called sort in the list class. It's called sort, and it's guaranteed to be stable. So this sort, not guaranteed to be stable. This sort is guaranteed to be stable. Again, it's not that difficult to choose an appropriate name. Remember, I told you that choosing good names is really important.
This is a case where one could easily imagine for somebody familiar with calling stable sort, they would assume that this version of sort is not stable, simply because things are inconsistent. Another technique that you can use to make your interfaces easy to use correctly
and hard to use incorrectly, and remember, the higher level design guideline we're talking about is making interfaces easy to use correctly, hard to use incorrectly. What we're talking about are different ways that you can go about doing that. So we've talked about consistency, and now I'm talking about progressive disclosure. Fundamentally, progressive disclosure
is about presenting options to people in a way that avoids overwhelming them with choices. The more choices you give people, the higher the likelihood they are going to accidentally choose the choice that is inappropriate. So you don't want to overwhelm people with choices. Fundamentally, what you want to be able to do is distinguish normal options
from expert and advanced options. Statistically speaking, most people aren't experts. They don't want to do the most advanced stuff in the world. There are many examples of how this is done correctly. So as an example, this happens to be from Firefox. So here in Firefox, if I'm under the content tab, then these are the choices I have available to me.
But it turns out there's additional choices, but if I want to get at them, I click on advanced, and then I can come over and I have some more options here. But what that means is that users are not shown all these options simultaneously. They are encouraged to limit themselves to these options, and they only get these options if they expressly ask for them.
Assuming you have partitioned things appropriately, people should be less likely to get into trouble by fiddling with this stuff when they should be fiddling with this stuff. So that's progressive disclosure. Now, it is important to recognize that simply partitioning things does not correspond to progressive disclosure.
So for example, this is from a program called Super. Now, it has these lovely laid out areas, so things are nicely divided, but there's no progressive disclosure here. Every option is sitting right in front of you. And similarly, this happens to be iTunes, but there's not really progressive disclosure here either. It's divided into various tabs.
That's categorization. But on every tab, every option you have available is available to you. Categorization is great. I'm not opposed to categorization. It's just important to recognize categorization is not progressive disclosure. Progressive disclosure is not based on dividing things into equal categories. It is designed around the idea that
some things are more likely to be needed to be addressed than others. And the things that are more likely needed to be accessed by users are the ones that should be presented first. This is also applicable to class and library design.
What you could do is if you have an API, if you have some interface, you could imagine breaking it into the functions people are more likely to want to call and the functions people are less likely to want to call. Now, Ken Arnold's had an article called Programmers Are People Too. This was published in 2005. And his central thesis in that article was
we go to a lot of work in user interface design to encourage people to make the right choices and stay away from the wrong choices. And yet we give developers these giant APIs where there's a whole bunch of methods or a whole bunch of functions all at the same level and we basically say, so here are some functions. Use the right ones, even though it is much more likely than others
that some functions will be useful. And he gives us his example. In Java swing J button class, there's over 100 methods. But it turns out that typically people only want a very small minority of these 100 methods. So essentially giving people who use the button class 100 different methods just encourages themselves
to get in trouble. It doesn't distinguish between the methods you probably want to call from the methods you don't want to call. And he offers as a design that what you do is you retain a few commonly used methods in the J button interface. So the interface immediately shrinks much, much smaller. Those are the things most people are gonna want to use.
And then he says, take for example, the button tweaking functionality that tweaks exactly the way the button looks and put that into an object that is accessible by, for example, a J button get expert knobs. What this would mean is if you wanted to work on this special functionality, you wouldn't be able to call directly to the method. You'd have to call and get an intermediate object, which itself would offer those methods.
So you'd have to take an additional step to get at the more complicated methodology. Similarly, for integration functionality, which has to do with integrating J button with other parts of the system, offer a get integration hooks method and remove that from J button. So everything is still in the interface.
Users can do just as much as they could do before. The difference is that when they look at the interface, they say, oh, here's the small set of methods I probably want to call, and here are two additional objects I can access if I need advanced method functionality. It naturally encourages users to focus on these methods, which are the ones you probably want to use, and don't be distracted by all those other methods
that you probably don't want to use. And the result of this would be an interface that would be easier to use correctly and harder to use incorrectly without losing any functionality. The next thing you can do to make interfaces
easy to use correctly and hard to use incorrectly is to prevent resource leaks. Any time you tell people, so do this, and later, do that to get rid of the resource, there are two possible problems. Any interface that looks like this, here's a resource, and you say,
okay, I'm gonna get the resource, and later on, I have to release the resource. Any interface, and it doesn't matter what the resource is, the resource can be memory, the resource can be a file handle, the resource can be a mutex, the resource can be a font handle, any resource where you ultimately have to release the resource later. If you have this kind of an interface,
you have two problems. Number one, whoever gets the resource can fail to release it. They can call it zero times. That's normally called a resource leak. Problem number two, they can't count, which means they release it more than once.
So problem number two is they make more than one call, in which case you might get a runtime exception, you might get undefined behavior. Anybody who's ever had the problem, for example, of dealing with mutexes, if you acquire a mutex and you never release it, kind of bad. If you acquire a mutex and you release it more than once, equally bad on many platforms.
So you would like to avoid those kinds of problems. Any interface that has this characteristic that says once you've done this, later you have to do that, immediately has a problem. Now, one way to resolve this problem is whenever you can, when somebody wants a resource, what you don't do is you don't give them the resource.
You actually return to them a resource manager. And the resource manager object automatically manages the resource's lifetime, such that the person using the resource simply doesn't have to worry about it. The simplest way to implement this under the hoods usually is to do the timing of resource release based on reference counting. And fundamentally, under the hood,
you're counting how many references refer to the resource. When there's no more references to it, you can automatically release the resource. It's a little bit different from garbage collection because garbage collection doesn't release things deterministically. For example, you would like to release a mutex as soon as you possibly can, not at some point after as soon as you possibly can.
This is a common thing in C++, although arguably that's because we don't have garbage collection. It's based on automatic deterministic finalization. In other words, we know exactly when objects will be either destroyed or finalized, and that that's specified by the language semantics.
Java and C Sharp, for example, they don't have this feature, although C Sharp has a using statement which approximates it. Unfortunately, callers have to remember to use it. So if callers forget to use using, then the automatic mechanism for making sure that things get released doesn't kick in. At the same time, reference counting schemes have trouble with cyclic structures,
so reference counting is not the solution to every problem either. But that's a separate issue. The fundamental idea is get away from interfaces that require that people release resources they acquire, and whenever you can come up with a way to do it, replace them with interfaces where the resource release is automated
so that clients don't need to worry about it. How you implement it often will be reference counting, but that's not the only way to do it. Setting aside the details of all of this, I told you earlier that it is not uncommon for systems to be fairly complicated,
and you have to move the complexity around somewhere so that somebody has to deal with it. What you want to do is minimize the number of places where resource management needs to occur. Hide it from as many people as possible, which is just an example of whenever it's possible, you encapsulate the tricky stuff. So if there's a way for an error to occur,
try to design that error so that it's only something which can be made by as few people as possible. Ideally, you're gonna hide it inside a class or inside a function, so only the class implementer or the function implementer has to worry about it. In terms of preventing resource leaks,
one thing you can do is C++'s idea of what is unfortunately called resource acquisition is initialization, which says that constructors acquire resources and destructors release them. The whole idea is that destructors are responsible for releasing resources. There are other ways to do it, though. For example, let us suppose what I want to do
is make it possible for someone to be able to write some data to a file. Doesn't sound very complicated. Let's write data to a file. What do I have to do? I have to open the file, write the data, I have to close the file, which means I can forget to close the file, which means I can't count and I can close the file
more than one time. So that's an error-prone design. If I open the file, I essentially have a resource. I have to release the resource. What I could do is encapsulate that in a function called writeToFile. So I could say, okay, here's a function writeToFile. That's the name of the file I want to write to, and this is the data that I want to write. So now all a client has to say is,
I want to write this data to this file. It is now up to the implementer of writeToFile to open the file, write the data, and close it. But as a result, the client doesn't have to think about opening and closing files. Or let's suppose I want to make it possible for somebody to easily acquire a lock,
do some work, and then release the lock, because I don't want to have them have to remember to release the lock when they're done, all right? I write a function called doLockedWork. This is the object that needs to be locked, and that's the function that should be called on the object once it has been locked. So as a client, I simply say, do this to that object
in a thread-safe fashion, and this function, doLockedWork, takes care of it. These kinds of interfaces are nice for avoiding usage errors. They don't necessarily replace lower-level interfaces. As an example, if I have a whole bunch
of different things to write to the file, I don't really want to open the file, write a line, close the file, open the file, write a line, close the file, open the file, write a line, close the file. That would be very inefficient, and in a multi-threaded environment, it could lead to interleaving problems which wouldn't even be correct. Similarly, if I have several things which need to be done to a particular object
under the same lock, I don't really want to get the lock, do some work, release it, and then get the lock again, do some work, and release it again. It might be inefficient, and it might not have the right behavior. So I probably do need lower-level APIs that will give me the ability to expressly manage the resources. But what I want to do is I want to advocate
these kinds of interfaces, make them very well-known to my clients. I want to give them especially beautiful names so that they're attractive and people want to use them. And the lower-level APIs, I want to give really ugly, gross, hard-to-use names that no one wants to type to discourage people from using them.
So I give them the functionality, but I'm encouraging them to do things the easy way that is less error-prone, and I'm discouraging them from doing things the way that is more error-prone. One of the nice things about programmer discretion is that, somewhat like water, it often will seek the easiest path to where it needs to go.
You can also do things like have one high-level object that manages multiple resources. So for example, if I have some work to do, which requires opening a bunch of files, doing some work, and then closing all of the files, this means I have to remember to close all of those files. So maybe what I want to do is have a multi-file object,
where a multi-file object actually opens and closes files as a group. So now, it may be the case that clients still have to remember to expressly tell the multi-file object, okay, I'm done with you, it's time to be closed now, but instead of having to remember to close n files and release n resources,
now they only have to remember to release one resource. Fewer things for them to remember, fewer opportunities for making errors. As long as we're on the topic of resource leaks, because it is an important topic, something else you can do is augment prevention with detection.
Typically, you can't design all possible resource leaks out of a system. When you can't design them out of a system, resource leaks tend to be hard to track down, so what you can do is build auditing support into your resource-managing classes. You can therefore figure out who acquires what, so for example, maybe which thread acquires what,
or which function is making certain calls. This will then allow you later to find out who acquired what at what point, and failed to release it, which allows you to detect leaks as soon as possible. Fundamentally, aggressive detection helps prevent leaks from making their way into software production. So that's the story on resource release.
Something else you can do to make interfaces easy to use correctly and hard to use incorrectly is to document your interfaces before you implement them. This is a wonderful way to find out about resource, excuse me, about interface problems.
If you find that it is unpleasant to explain how an interface works, it's gonna be really unpleasant to use. So just by describing what the interface is gonna look like and how it's gonna be used, this can make it much more likely you'll design a good interface in the first place. Some of the things we've talked about
become pretty obvious. If they're surprising or underspecified behavior, surprising behavior, you're gonna have to document. You're gonna have to say, be aware that if blah, blah, blah, you wanna be able to eliminate that kind of comment. You shouldn't have to make those kinds of comments about interfaces. Bad names, inconsistencies in names or in layouts,
opportunities to leak resources, all of these things tend to become more visible if you document the interfaces before you actually write them. This is consistent with test-driven design, which we'll talk about later on this afternoon.
One of the most important things that you can do to improve the quality of your interfaces is to introduce new types. Assuming you are working in a strongly typed language, the type system is an unbelievably powerful weapon in preventing people from making certain kinds of mistakes.
So let us consider something like, I have a class for representing dates and time, so here's a date class. The month is an integer, the day is an integer, the year is an integer. Now, the first thing to notice is that because these are all integers, it is impossible for the compiler to tell the difference between a month, a day, and a year.
And that means it's really easy for people to pass things in the wrong order. Dates are particularly susceptible to this because in the United States, we typically do month, day, year, whereas in other parts of the world, it does day, month, year. So that's an easy kind of mistake to make. But there's a generalization of this.
Any function interface which has two parameters of the same type that are adjacent to one another means that if you swap the order for some reason, the compiler will be unable to tell. So if you have any function interfaces which take two parameters of the same type
or of compatible types that are adjacent to one another, you inherently have the problem that people could call by passing parameters accidentally in the wrong order. And you would then want to consider finding a way to eliminate those two adjacent parameters that are of the same type. Now, what we can do in this particular case, which would also solve the general problem
of having two parameters adjacent to one another, is to turn day, month, and year into classes. Now, I'm doing the minimal possible work in C++ to make this work. I've just said, okay, day is now a type, month is now a type, and year is now a type. I haven't made them full-blown classes. I haven't done any encapsulation.
I just told the compiler that they're different types. And now I can say, okay, here's my day class. I have a month object that comes first. I have a day object that comes second. I have a year object that comes third. Now, there is no ambiguity.
It is now impossible to pass parameters in the wrong order, and as a bonus, the calling code is clearer. So if I say, date D of 4-8-2005, well, this is not gonna compile because the types are wrong, but if I say the month is four, the day is eight, and the year is 2005, then that will compile. So it eliminates ambiguity,
and it makes the calling code a lot clearer as well. This approach only works if you actually create distinct types that the compiler views as being separate. Now, in C and in C++, at least, you can do typedefs.
So you can say, okay, when I say int, excuse me, when I say day, I mean int. When I say month, I mean int. When I say year, I mean int. Now I can say the date is month, day, and year, and now I can say, okay, the day is four, the month is eight, the year is 2005, except that's wrong. I really wanted it to be the month was four
and the day was eight. The problem is that although the source code looks pretty, the types are all the same. This is what I call programming to make you feel better about yourself. So you don't actually improve the quality of the code. It's more readable, but it doesn't prevent mistakes.
In case you want to know why I have this fixation on April 8th, that's the day we got our puppy, Darla, who's not quite so small anymore. Now, Darla's adorable. That's the most important thing in this entire seminar is that Darla is adorable.
But it's still possible to use the interface incorrectly. For example, somebody could say, okay, the day is eight and the month is minus four and the year is 2005. Well, obviously the month is not minus four. Well, you might say, look, no one is gonna say the month is minus four. Actually, I can think of two ways, plausible ways
you might end up with a month of minus four. One of them is someone's hacking your system and trying to break it. It's a security issue, that's one possibility. The more interesting possibility is they didn't write minus four. What they wrote was something like X minus Y when what they should have written was X plus Y or some other simple typo in expressions.
In other words, they meant to get it right, they just made a mistake. So this doesn't solve all of your problems. So, all right, let's eliminate invalid months. There's only 12 months. So what we can do is we can enforce that constraint. We're gonna make month a real class.
We're gonna declare 12 immutable month objects and we're gonna limit the creation of month objects to copies of those values. So in C++, it would look like this. Here's a month class. I have static objects for January through December. Then I have the constructor for month from integer
is declared private. This prevents people from creating new months. This prevents people from creating uninitialized months. The point is I'm making it so that people can only use one of 12 possible values. So then we initialize all these month objects
to make sure that they have the appropriate integer inside them. And the result now is better. So now, if I try to say, okay, the month is minus four, this won't compile because you can't call this constructor. Instead, you have to say the month is, oh, it's month, colon, colon, April. We take advantage of the fact that we know
there are only 12 possible values, so we enumerate those values, and then we eliminate the possibility of using anything other than those specified values. Well, we haven't eliminated all the problems because now we've got the possibility of saying, okay, the day is 71.
I don't know of any months with 71 days in them. If I wanted to make it so that the day and the year were impossible to get wrong, I mean, if you're willing to work out hard enough, you can do it. For example, I might say, okay, I'm going to create a year object, and then from the year objects, I'll get month objects, and then from the month objects, I will get day objects,
and they will only give me objects corresponding to valid days in the valid month of the valid year. You could clearly do that. Now, that's not what I'm advocating. I'm not saying everybody should go out and create year objects and month objects and day objects that enforce these kinds of constraints, but my experience has been that in many cases,
people don't think about how interfaces can be used incorrectly. I believe when you design an interface, it is a very important exercise to say, okay, here's my prospective interface. How could people innocently use it incorrectly? How could they accidentally make mistakes?
And once you have identified how they could accidentally make mistakes, then what you can say is, all right, how much work would it be for me to change the interface so that the mistake becomes impossible? After you've done that analysis, you're in a position to say, okay, this is how common I think the mistakes are,
and this is how serious the implications are if the mistakes occur, and this is how much work it's going to be for me to change things so that the mistake becomes impossible. And now you can make an engineering judgment to say, is this more important or is this more important? And what the right answer is will depend on your particular circumstances.
But the important thing is to recognize that you have created an interface which could be misused and to ask yourself, could it be revised so that it cannot be misused? Now, the sessions run for an hour at a time,
but this is the session right after the lunch break. And so I'm assuming you've got kind of a heavy meal floating around in your stomach right now. It's a big room. There's lots of people. It's warm, and I'm talking about software. So this is what we're going to do. We're going to take a break for five minutes right now, and it's just going to be five minutes. There's not a lot of room, but you can at least stand up
and try to get your blood circulating. And those of you who want to can go and try to get as much caffeine as you can possibly consume in five minutes. So we'll start again in five minutes.
So this is very nice. I've got a drop-down.
Drop-down's very empowering. Make me feel like I can't possibly choose the incorrect date except that I chose an incorrect date. But it's not fair to pick on Lonely Planet because actually it turns out that Lonely Planet has a wide variety of ways to make mistakes. So this is all from Lonely Planet.
So it turns out that if you want to fly somewhere, you actually don't even have a drop-down. All you have is a widget you click on, which is great. It brings up a calendar, and you can only choose valid dates. That's wonderful. That works really well. But if you want to find a hotel, then all you have is a drop-down and a widget. So you kind of have a choice as to whether you want to get it right or you want to have the risk of getting it wrong.
And speaking of choice, when you want to rent a car, the only possibility is a drop-down. So, wait, did I talk about consistency yet? Have I mentioned the importance of consistency? So here we have three essentially identical operations at the same website with three different interfaces,
some of them allowing some kinds of mistakes and other ones not. So this is the kind of thing you would like to avoid if you possibly can. Now, constraining values, such as saying there's only 12 possible months objects, that's a legitimate technique.
But to be honest, it is not that common to have a relatively restricted universe of values that you can therefore prevent people from using other possible values. The more general technique is introducing types. So I did some work with a company. They make slot machines, these lovely automated things whose sole purpose in life is to separate you
from as much money as possible and somehow leave a smile on your face. It's an interesting industry, actually. So one of the techniques that they use to make sure that you will keep on playing to lose all of your money is to give you bonus money so you feel like you've gotten money for free. The problem is it's really important to them that the bonus money never leave the machine
as real money. The bonus money is only there to keep you playing so you lose more of the real money. So under those conditions, it makes a lot of sense in their software, and they ultimately did do this. They have a type called something like real money and a type called something like bonus money, and they have various operations for combining them together and adding them and showing totals and stuff,
but under no conditions can bonus money ever be converted into real money. As another example, we know from studying engineering and physics that if we have units like time and mass and distance, then you can't arbitrarily combine them.
But an awful lot of software that is representing things like time and mass and distance actually only programs in terms of floating point numbers. Maybe they use typedefs or something like that to make them feel better about themselves. What we would like to do is make sure that unit conversion errors are impossible. I'll talk about that a bit more in a moment.
But first, I want to mention everybody's favorite type is string. That's not true. Everybody's favorite type is int, and the problem is the int is meaningless. Int means I've got a number, but we don't know someone's age, a street address, number of times we've circled the moon.
I don't know. It's a number. And string is the equivalent thing. If I say I have a string, that is effectively meaningless. You might as well not even be using a type if you say something is subtype string. A file name, for example, should be different from a customer name, and they should both be different from a regular expression,
which really is a different kind of thing. Printer name should be different from driver name. That's not my observation. That is the observation of a client I worked with one time that spent a huge amount of time debugging a problem where driver names and printer names were both represented as strings. They only differed in terms of the last three characters of the name, and debugging that was not terribly pleasant.
Once you have different types, if I know I have a customer name, or if I know I have a driver name, or if I know I have a regular expression, or have an address, once I know what it represents, and that's encoded in the type system, then I can start doing type-specific validation,
I can do type-specific printing, depending on what it is that I have. But if all I have is just a string, and I have no idea what it is, it's essentially useless information. So string is very convenient, but it doesn't carry any real type information.
So I'm advocating the idea that you can eliminate certain kinds of client errors by generating new types. In many cases, the types are gonna be written by hand. Under some conditions, you may need to generate the types automatically. And this is a case where you can do this
using template metaprogramming in C++. As an example domain, let us suppose what you are dealing with are things like mass and distance and time, so the normal units that you deal with in physics and engineering applications. Now, the problem that we have is that the number of possible types
is in principle unlimited. So for example, if I have mass, and I multiply it by mass, I get a new type, mass squared. And if I multiply that by mass, I get another new type, mass cubed. In principle, the number of types that I need to be able to express is unlimited.
So it would be really nice if we could find a way to have the compiler automatically generate all the types we need itself, so that we would get full type checking and not have these problems. So this is an example that was published in a paper a while ago. This is a simple formula, claims the paper.
I'm sure to some people it is simple. I am not one of those people. But what is important is this. Many, many businesses have people who are working in a world that looks more or less like this. They might be physicists. They might be mathematicians
working for insurance companies. They might be statisticians. They might be people who are trying to come up with new ways of drug therapy. There are people who are working in an analytical and a symbolic realm. And sometimes they will come up with something and they will say, okay, we think this could be useful in our software. And there comes a point where these symbolic formulas
need to be translated into a language where they can execute more efficiently. In many cases, this turns out to be C or C++. And as an example, the question is, is this formula correctly translated into this C++ source code?
Well, all right, we've got four times alpha times RE squared. And here I've got four times alpha times RE squared. I mean, it looks kind of similar. So that's encouraging. But it turns out that if you work in physics, I mean, if you think it's bad to try to choose good names in computer science,
think about trying to choose good names in physics, which has been around for hundreds of years. So there is something in physics known as the dimensional thickness. There's the dimensional thickness right there. Thickness, how thick something is. So modeled as a length.
Which would be great if it were a length, but it turns out that the names in physics are no better than they are in computer science. It's actually an area. So this is incorrect. And as a result, this code will not compile. Now, the nice thing is the people who are doing these kinds of experiments, who, as I recall, are actually working
at the research labs in the United States on nuclear bombs. So we kind of would like them to come out correct. They assure me they have an extensive testing program. So I have great faith in that. But we would like to avoid having to spend the time to test things at runtime if we could simply detect that they don't make any sense during compilation.
And by generating the types from these templates here, we are able to prevent those kinds of mistakes. The general idea has nothing to do with C++. The general idea is by creating new types, you give the type system information that it can use to detect mistakes that you have made
and thus prevent your code from compiling. That's the general idea that is applicable to pretty much any compiled language which can do static type checking. So this brings me to the end of what I've already said
is I believe the single most important design guideline, which is that you should make interfaces easy to use correctly and hard to use incorrectly. First, you should adhere to the principle of least astonishment. Avoid gratuitous incompatibilities with the surrounding environment. Stay away from undefined behavior if you possibly can.
Choose good names. Shoot for what are called nice classes and definitely be consistent. You can also employ progressive disclosure to encourage people to go to the area where they're more likely to want to fiddle with things and to stay away from the dangerous stuff. By design, minimize potential resource leaks
so you're not relying on your clients to release things appropriately. Document your interfaces before you implement them so you can identify likely trouble spots before you've gone to the work of writing the code. And then finally, introduce new types to prevent common mistakes. Make sure that these are actual types, not type synonyms like typedefs.
Consider explicitly defining all the possible values for a type if you can identify all the possible values. Generate the types automatically if you can and avoid over-reliance on string. These are all specific things you can do to improve the quality of your interfaces.
Any questions about anything to do with my discussion of making interfaces easy to use correctly, hard to use incorrectly? I will interpret that as awestruck silence.
So let us move on to the next topic, which is to embrace static analysis. So now we're up to the higher-level discussion again of how can you write better software regardless of what it's supposed to do, regardless of the technology? So static analysis is the analysis of code
without actually executing it. It is the complement to dynamic analysis. Dynamic analysis is where you actually get information about the code by executing it. Dynamic analysis includes things like CPU profiling for performance, measuring memory usage, memory test coverage, deadlock detection, all that kind of stuff. I am a fan of dynamic analysis.
I like CPU profiling. I like doing memory analysis. I like doing test coverage. I'm a believer in all those things. Don't get me wrong. I believe in those technologies. But it is my belief that static analysis is at least as useful and is not as widely appreciated. And that's why I'm focusing now on static analysis.
It's not because it's better. It's because I think it needs to get a little more attention. The number of problems that static analysis can find in code is larger than most people understand. So for example, you can look for likely design violations. For example, dependency errors. You can look for superfluous dependencies.
You can look for cyclic dependencies. You can look for dependency inversion, which is where you have a more stable module dependent on a less stable module. Can all be detected statically. When a data type is smaller or larger than expected in languages which don't nail down exactly how big the data types are.
Static analysis can find things that aren't there. So for example, if you have a convention that a derived class virtual function implementation or a subclass method is supposed to invoke its base class version before doing any additional work and that's missing, then that could be something which is identified by static analysis.
Logic errors, off by one errors or other kinds of boundary errors. Likely concurrency issues. For example, lack of needed synchronization. It notices that you access this variable with a mutex over here, but you don't access that variable with a mutex over here. Likely concurrency mistake. Likely inefficiencies. I've got two loops over similar data structures.
They could be fused together to only go through the data structures one time. Likely security issues. Failure to validate user input, for example. In some languages, likely typos, like if X plus Y gets Z, which in some languages this is an assignment and in some languages this actually will compile.
Likely violations of local coding standards. For example, functions that exceed a particular complexity metric or failure to follow naming conventions. These are all things that can be found by static analysis. Many of them can be found by static analysis tools, which thus frees up human beings for more useful work.
A lot of the things I've just talked about can also be caught by testing, for example, or during debugging, but the thing is that static analysis is more reliable because static analysis should not miss any paths. Testing in nontrivial systems is typically
not gonna be able to cover every single path. Static analysis can analyze them all. And furthermore, static analysis incurs no runtime cost because it occurs prior to runtime. As a result, if you can guarantee that certain conditions cannot occur because static analysis has ruled out the possibility, you can eliminate the runtime checks
for those conditions and you can eliminate the error handling code when those conditions arise. So you can actually make your program a little smaller and a little faster simply by having ruled out the possibility of certain kinds of mistakes. There are a whole bunch of different kinds of static analysis. What I wanna do is just introduce you to the variety of forms of static analysis.
And we're gonna start with compiler warnings. Compiler warnings are about the lowest of the low-lying fruit when it comes to static analysis. This is the situation. It is highly likely that compiler writers know the language better than you do. Highly likely that that is true. As a result, you should pay attention to their warnings.
Now, it's an interesting thing about compiler writers. In my experience, compiler writers view their job as taking a valid source program and generating the best possible object code from it. That's their job. Their job is not to babysit you and find a lot of mistakes,
except for in certain parts of GCC. But generally speaking, they take a valid source program and they produce an object code. Now, if they take the time to issue a warning, it is highly likely that it is a relevant warning because number one, they don't view
issuing warnings as their job. And number two, they understand that many of their clients work in an environment where they are required to compile without getting any warnings. So if they issue a warning, that means somebody somewhere's gonna get really upset and have to change some code. So generally speaking, compiler vendors don't issue a lot of warnings.
So if they do issue a warning, usually it is meaningful. You should therefore try to compile cleanly at maximum warning level. Actually, you should try to require compiling cleanly at maximum warning level if you can. Not everybody has the luxury of that. At the same time, you do not want to become dependent on compiler warnings, especially in languages like C and C++
with multiple compiler vendors that behave slightly differently. It is entirely possible to find different compilers that warn about different things so your code can sail through one compiler with no comment at all and get warnings from other compilers. And also, different compilers may issue warnings under differing conditions.
So you don't want to become dependent on the existence of compiler warnings. And this is a problem I run into in practice all the time where someone will say, I don't need to remember that because if I make that mistake, the compiler will warn me. And then I have to point out, well, yes, but I know another compiler that doesn't issue a warning and if you end up porting your code to a different platform, for example, you may run into this problem.
So let us look at a really small piece of C code described as an extremely small piece of bad C code. This is from an article from a number of years ago. So here's the code. Under GCC 3.2.3 with the default compiler options, that compiles cleanly.
No warnings. However, if you turn on full warnings, wall, it says, okay, too few arguments for format, control reaches the end of a non-void function. I mean, this is giving you some really information. Wouldn't you like to know that you don't have enough arguments in your format specifier? Seems vaguely relevant.
All you had to do was ask this particular compiler to tell you the things that it recognized was wrong with your code. So if it's just a matter of enabling a particular command line option, it seems like you definitely would like to be able to do that. It doesn't get a lot easier than that.
The next step up after turning on compiler warnings is lint and similar utilities that read through your source code. Now, lint and similar utilities, their only job is to issue warnings. They don't generate object code. So their only reason for existence is to try to find things which might be mistakes. So they check for things like
constructs with unexpected behavior. For example, if you test floating point numbers for equality, most people have learned at one point or another that just because I have two mathematical expressions which are mathematically equal does not mean that if I translate that into source code and run it, I'm gonna get two bit patterns that are identical, but checking floating point numbers for equality
checks the bit patterns. So that's a nasty little trap to fall into, as I can testify by the nine hours I spent one time trying to figure out what the problem was. Placing mandatory cleanup in a Java finalizer, for example. If it's mandatory, finalizers in Java aren't always called. So putting mandatory stuff there, bad idea.
Potential concurrency problems, for example, invoking thread.run instead of thread.start in Java. In Java concurrency in practice, they say that static analysis tools are an effective complement to formal testing and code review, or potential security risks.
For example, requesting read write file access when you only need read only, making unchecked writes to fix size buffers. Gary McGraw in the security industry, he says that static analysis tools is number one of seven touch points of secure software. Recognized as being able to find really interesting problems with your code.
Couple of other things that can be identified. One of them is unportable code. For example, use of compiler specific extensions or dependencies on evaluation order in languages where the evaluation order is not completely nailed down. Likely maintenance problems like overly complex expressions or failure to follow naming conventions.
As I already mentioned, lint-like programs are typically a lot more aggressive than compilers. Traditionally, lint-like programs have required a non-trivial investment up front to get them configured to the point where they can do something useful,
especially for large legacy systems. It's very, very common that if you get a brand new lint-like tool and you deploy it for the very first time and you have a large code base, in many cases you will be inundated with hundreds of thousands of warnings. And you will be trying to figure out, actually you'll be trying to figure out how do I uninstall the static analysis tool,
because you just can't do anything with hundreds of thousands of warnings. Output filtering helps a lot, but you're gonna have to set aside time for initial configuration. Now, this is sort of the traditional path for static analysis tools. In the last, let's say, half dozen years,
a number of companies have arisen with a different philosophy. Their philosophy has been, we are gonna issue almost no warnings unless we are really, really sure that this is a problem. And their goal is to issue almost no false positives. So what they do is they have a very low false positive rate, but they don't catch as many mistakes.
And companies that I've talked to that have used both of them have said, well, you know, it's not exactly obvious what the best solution is, because these people, whatever they warn about usually has to be fixed, that's great. But the problem is there's a lot of other stuff that they need to fix that they didn't warn about. And the first set of tools which give more warnings
will bring those other things to light. At the same time, there's a ton of payoffs for using these kind of tools. One of them is reduced debugging time. You have to ask yourself, how long is it gonna take me to track down, for example, use of an uninitialized variable? Uninitialized variables are easy to catch with data flow analysis.
Tools do it all the time. But if you don't run the tool, you don't necessarily know it's uninitialized. What if you have an evaluation order problem where you think that X and Y are being added before being multiplied by Z, but actually it's in some other order for some reason? Or the order in which outbrands of a function call were evaluated was different from what you expected?
In addition, one of the nice things about using these kinds of tools is that when people use them and get warnings about problems, this helps educate the programmers about the kinds of problems. After you've received a warning six times in a row that if you do this, you could have a problem, we would like to believe you're gonna learn,
you probably should not do that. So it's a way to educate people over time. And by running static analysis tools like Lint, you can identify modules that probably should be looked at more closely, either through testing or through general review. And that is because it is an empirical observation
that defects in modules tend to cluster, which means that they're not uniformly distributed. If you find a bunch of mistakes in one area, there's probably other mistakes in the same area. So if Lint gives you a whole ton of warnings in one module or two or three files, you should probably be subjecting those files to additional scrutiny, because statistically it is likely
there are other issues there which require some way of being addressed. There is an interesting thesis that says programmers don't do anything.
Don't do something unless they think it will have an effect. Seems reasonable. Programmers don't write code unless they think it's gonna do something. So the question is, how can we take advantage of the observation that programmers don't write code unless they think it's gonna do something?
I will tell you in 20 minutes, because that's the time for the break. So we'll start again in 20 minutes. Oh, thank you.