We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Better Software — No Matter What - Part 5

00:00

Formal Metadata

Title
Better Software — No Matter What - Part 5
Title of Series
Number of Parts
150
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Beyond Inconvenience
31
59
Thumbnail
1:00:41
89
Thumbnail
1:00:33
90
Thumbnail
1:00:33
102
Clique-widthPrice indexDemonHypermediaSimulationIRIS-TWindowScalable Coherent InterfaceWave packetLoginDirected graphMenu (computing)Hash functionDivisorLengthLogical constantSummierbarkeitMaxima and minimaInformation securityPhysical systemUsabilityReflection (mathematics)Physical systemCartesian coordinate systemVotingNumberPoint (geometry)UsabilityCausalityWhiteboardDecision theoryDegree (graph theory)DatabaseLengthRange (statistics)Electronic visual displayWebsiteExtension (kinesiology)Data structureMultiplication signOrder (biology)Right angleDifferent (Kate Ryan album)Constraint (mathematics)MathematicsLimit (category theory)Operating systemVirtual machineBuffer overflowMathematical analysisPasswordField (computer science)Electronic mailing listSemiconductor memorySoftwareBoiling pointPressureIP addressBitSpacetimeGame controllerData storage deviceSet (mathematics)Information securityClassical physicsVariety (linguistics)ResultantBand matrixRevision controlExpressionArithmetic meanReading (process)Vulnerability (computing)WordTerm (mathematics)Computer programmingCASE <Informatik>Crash (computing)Exception handlingDeterminantKey (cryptography)Hash functionStorage area networkConsistencySystem callProduct (business)Table (information)Buffer solutionMusical ensembleFunctional (mathematics)19 (number)MereologyFormal languageXMLUML
ThumbnailConvex hullSimulationMiniDiscArrow of timeWeb pageSource codeFormal languageObject (grammar)CodeStatistical dispersionParameter (computer programming)Normed vector spaceFunction (mathematics)Template (C++)Annulus (mathematics)Polar coordinate systemMiniDiscInternetworkingVirtual machinePhysical systemMereologyType theoryGeneric programmingCodeObject (grammar)Mathematical analysisSheaf (mathematics)SoftwareFunctional (mathematics)Social classInheritance (object-oriented programming)Run time (program lifecycle phase)Cache (computing)Software design patternNumberPoint (geometry)Multiplication signSoftware bugConsistencyStrategy gameVariable (mathematics)Source codeMacro (computer science)Template (C++)Parameter (computer programming)Constraint (mathematics)Web pageCode refactoringComputer programmingDifferenz <Mathematik>Table (information)Arithmetic meanConnectivity (graph theory)Axiom of choiceDivisorSoftware developerBuffer solutionKey (cryptography)Uniform resource locatorDependent and independent variablesBus (computing)Musical ensembleFormal languagePerspective (visual)SpeciesDialectComputer animation
Computer programmingMacro (computer science)Proxy serverObject (grammar)AerodynamicsJava appletSimulationSource codePointer (computer programming)WeightGamma functionInternational Date LineCNNBinary fileoutputCodeSummierbarkeitMaxima and minimaFormal languageFocus (optics)SequenceCompilation albumComputer scienceSocial classSoftware design patternCodeFunctional (mathematics)Macro (computer science)Programming languageMultiplication signInvariant (mathematics)Computing platformPhysical systemNumberSimilarity (geometry)Sheaf (mathematics)Data structureFormal languageVariety (linguistics)Chemical equationType theoryoutputObject (grammar)Source codeIndependence (probability theory)Generic programmingProxy serverWeightQuicksortUniverse (mathematics)Java appletStudent's t-testTotal S.A.CompilerAspect-oriented programmingElectric generatorSoftware maintenanceAttribute grammarInformation securityFunction (mathematics)Sound effectOrder (biology)Reduction of orderDesign by contractKey (cryptography)Boilerplate (text)Perspective (visual)WritingCompilation albumDecision theoryArithmetic meanComputer programmingDatabaseBounded variationPoint (geometry)Ferry CorstenSystem callText editorTemplate (C++)SequenceSoftware developerConsistencyIntercept theoremUniform resource locatorSingle-precision floating-point formatDatabase normalizationDivisorExpressionCASE <Informatik>Heat transferNumbering schemeScripting languageExecution unitGraph coloringProcess (computing)Line (geometry)Structural loadVarianceIntegrated development environmentInsertion lossIncidence algebraLibrary (computing)Pattern languageComputer animation
Supersonic speedCodeSimulationTesselationLatent heatCASE <Informatik>Object (grammar)Interface (computing)CodeDistribution (mathematics)Physical systemFunctional (mathematics)InformationConnectivity (graph theory)BitMathematical analysisMultiplication signLibrary (computing)Software testingDatabaseMereologySource codeState observerSocial classCode refactoringReading (process)Hand fanMathematicsLine (geometry)Aspect-oriented programmingInheritance (object-oriented programming)Sheaf (mathematics)Uniform resource locatorRepresentation (politics)Process (computing)QuicksortTemplate (C++)MappingComputer programmingResultantRule of inferenceFactory (trading post)DivisorProgrammer (hardware)Principal idealMaxima and minimaRight angleSpectrum (functional analysis)Condition numberFocus (optics)Computer animation
SimulationSummierbarkeitExecution unitClient (computing)Natural numberSoftware testingMaß <Mathematik>CausalityUnit testingUser interfaceExecution unitFunctional (mathematics)CodeQuicksortSoftware testingSoftwareLevel (video gaming)Client (computing)Term (mathematics)Latent heatMathematicsMereologyMedical imagingSocial classImplementationMultiplication signOperator (mathematics)Rule of inferenceReal numberProgrammschleifeFerry CorstenFactory (trading post)INTEGRALPhysical systemSubsetDivisorVideo gameDatabaseWeightIntegrated development environmentConfiguration spaceComputer fileCondition numberSource codeTransformation (genetics)Formal grammarInterface (computing)ExistenceLibrary (computing)Confidence intervalRight angleConnectivity (graph theory)File systemComplex (psychology)Line (geometry)Point (geometry)AbstractionAdditionBitWritingHookingCode refactoringComputer animation
Execution unitSoftware testingSimulationRankingProper mapComplex (psychology)Encapsulation (object-oriented programming)Block (periodic table)Physical lawConcurrency (computer science)Random numberUnit testingMathematical analysisInterface (computing)NumberError messageSocial classCategory of beingStress (mechanics)Library (computing)Independence (probability theory)Thread (computing)Physical systemConcurrency (computer science)Serial portLimit (category theory)Functional (mathematics)Fluid staticsSystem callBitMereologyGoodness of fitExecution unitCASE <Informatik>Sheaf (mathematics)Different (Kate Ryan album)Process (computing)Random number generationSoftware testingClient (computing)CodeControl flowSet (mathematics)PlastikkarteJava appletVapor barrierPoint (geometry)State of matterBuffer solutionScalabilityCharacteristic polynomialLevel (video gaming)Multiplication signInterleavingScheduling (computing)Artificial neural networkBenchmarkSoftware developerCore dumpCoprocessorEncapsulation (object-oriented programming)Mathematics1 (number)Code refactoringFactory (trading post)SequenceSpherical capPrisoner's dilemmaEmailMessage passingScaling (geometry)HierarchyComputer animation
XMLUML
Transcript: English(auto-generated)
Welcome back. So we have been talking about what I call the keyhole problem, which is an unjustifiable arbitrary restriction on what you can see or what you can express. I had just talked about how it went from annoyances
to lives being at stake. There's been a lot of press over the years about buffer overruns. Why do we get buffer overruns? It's because we have a fixed size buffer, which means somebody shows an arbitrary size, and then somebody either didn't bother to look at that or they just weren't very careful about it.
This is just a list of lots of different buffer overrun attacks which have been reported over the years on all kinds of operating systems and all kinds of applications. It's not something restricted to a particular operating system or application style. San Francisco, November 5th, 2004. 2004 was a big day, big year
in presidential elections in the United States. It was very hotly contended, so a lot of people turned out to vote. San Francisco had a little problem. Too many people turned out to vote. An unexpectedly high voter turnout caused the vote counting software to crash. And they found out later that the cause was that the amount of data exceeded a preset limit. Apparently, somebody said no one will ever have
more than this many ballots. And the vendor later determined that the limit was not necessary. A classic keyhole problem. Somebody set an arbitrary restriction, even though it was not technologically justified. Presidential election, kind of important, at least to us. And then there was Three Mile Island. Three Mile Island, March 1979.
Nuclear reactor in Pennsylvania. Comes within about 30 minutes of a total meltdown. Now, there's a variety of reasons why this occurred, but what I found interesting in reading about the account was at some point, what they notice is the temperature is rising, and this is a bad thing. So you've got these people who are looking
at a control board trying to figure out why is the temperature rising? They don't know what the problem is. Now, these are smart people. These are smart engineers. They understand that when they're looking at the control board, it is possible that what they see is not accurate. In other words, if they look at a particular setting for a valve, and the control board says
the valve is closed, they understand that there could be a malfunction and that the valve might actually be open. And in this particular case, there was a valve, and they were suspicious because the board said it was closed, but they knew that if that valve was truly open, that would account for the heating that they were looking at.
Now, these are nuclear engineers in probably the most important debugging session of their lives. They understand that even though the board says that the valve is closed, the valve might really be open. And they go, okay, we have a way to check. What we can do is we can check to see what is the temperature of the steam
on the other side of the valve. Now, the steam is under pressure, so it's gonna boil at a much hotter temperature than it usually does because there's pressure in there. They go, what we'll do is we will check the temperature of the steam in the chamber beyond the valve. And if that temperature is higher than we expect, that will indicate that the valve is open,
even though the board says that it is closed. So they check the temperature of the steam, and as it turns out, a steam temperature readout was programmed never to display values over 280 degrees. This is what I call a restricted range keyhole. Now, it could be that this is because the sensor
simply can't detect temperatures over 280 degrees. That's a possibility. Or it could be that there is an arbitrary restriction in the software that simply doesn't display numbers over 280 degrees for some reason. But I hope that we can all agree that there is a giant difference between 280 degrees and 280 degrees
or possibly a whole lot higher. And this particular readout did not display that. Now, fortunately, they were able to get the situation under control. They actually did determine through other means that the valve really was open and they found a way to close it. But it didn't help them any that somebody had arbitrarily programmed things
to never display values over 280 degrees. Again, an arbitrary restriction. Even if the sensor didn't go above 280 degrees, the software should have been smart enough to say, don't know what the temperature is. So at least they would have a reason for checking further.
Which brings me to the end of the discussion of keyholes. I want to recap where we are. Keyholes are primarily gratuitous restrictions on what our user can see or what our user can express. And I want to emphasize not all restrictions are gratuitous. As an example, consider many, many websites
where you have to enter a username and you have to enter a password. Usernames and passwords are looked up very, very frequently. They're stored in databases. This means they need to be efficient to store in a database. They need to be efficient to look up. I understand that. Furthermore, in databases, the larger the field you allow,
for example, for a username, the more memory it takes. The more memory it takes, the more expensive it is, but also you get more fragmentation in the real memory. Which means there are technical reasons why restricting usernames to below a certain length can be justified. You can argue about what that number should be, but the point is there are direct implications
between the decision that you make about username length and how that manifests in the database and how efficient it is to use the database to support your website. I don't have a problem with restrictions on the length of usernames. Passwords are completely different.
The reason passwords are completely different is because you shouldn't be storing the password in clear text anyway. You should take the password, you should run it through some kind of a hashing or a salting function, then get some result which cannot be reversed to find out what it was originally. This has been done in C since the 1970s, excuse me, in Unix since the 1970s, and then you should store the hashed version in the database.
And mathematically, you can take an arbitrarily long password and you can smush it down to as few bits as you want. So just because you want to limit the amount of space in the database, which is a legitimate technical concern, does not mean you can justify restricting the length of the password in the first place.
So there is a completely different analysis for usernames and for passwords. And it is therefore my belief, usernames can legitimately be limited in length, but passwords cannot be, even though many websites and many applications restrict them in the first place, but there's no good technical reason for doing that.
Which means anytime you're faced with a potential restriction, you have to do a case-by-case analysis to determine whether the restriction is technically justified. Often, a keyhole is use of a constant where a variable would be better or imposition of a constant where none of them is warranted.
Anything to do with a fixed size array usually cannot be justified. There's a few exceptions, but we have so many dynamically extensible array-like data structures now in every language, there's just no excuse for saying you can't have more than n characters most of the time.
In order for the notion of a keyhole to be meaningful, we have to be able to describe what it is and what it is not. I want to point out that a missing feature is typically not due to a keyhole. A keyhole keeps you from seeing everything at once.
A missing feature says, well, you can't even do that at all. I don't want people to start saying, well, that's a keyhole to mean it doesn't have the feature I want. That's not what keyholes are about. Keyholes are where somebody does offer you a feature and then they impose an arbitrary restriction on it. Okay. All right.
Not all usability problems are due to keyholes, and not all keyholes lead to what are usually considered usability issues. It's a particular kind of technical constraint. Keyholes are important because they lead to serious security and safety vulnerabilities,
as we've seen things like buffer overruns and sawmills that blow things up. They make systems brittle in the face of change. Now, I mentioned C4786, that's Microsoft's warning. Over and over again, we run into situations where people say, I have a limit that's obviously sufficient, and then over time, it becomes insufficient,
either because machines get faster, we get a larger amount of data, bandwidth improves, the world changes in some way. By now, we should understand that almost nothing is gonna stay the same size. Even IP addresses are getting bigger now. They typically degrade a system's usability. Users can't see what they wanna see.
They can't express what they wanna express. They have to cope with a lot of inconsistency, and it leads to users who are frustrated, who are unhappy, who are what I call unloyal to a product or a system. Somebody who's disloyal actively works against you, but somebody who's unloyal simply lacks loyalty to you. They would be happy to switch to somebody else.
What I really want to accomplish with this part of the talk is I would like it so that keyholes get what I call a seat at the table. When people are sitting down and they're designing software and they're implementing software, there's a bunch of considerations which are already there.
Everybody worries about performance. Everybody worries about when can we deliver it. Everybody worries about correctness. All of these things have what I call a seat at the table. You're going to discuss them when you make trade-offs. What I would like is that keyholes get a seat at the table. It doesn't mean that you never ever impose them, but it means that if you choose to impose them,
you do it consciously, having done a trade-off analysis to determine whether it's really justifiable. And during design and development, you should avoid gratuitous restrictions. What it boils down to is pretty simple. Determine things dynamically instead of statically.
If constraints have to be imposed, make them as lax as they possibly can be and reject components that impose keyholes. And when it comes to buffers, before you put something somewhere, make sure that it will fit. So the guideline is to minimize the introduction of keyholes. Any questions about the keyhole problem?
Yes.
All right, so I think your question is, if I'm doing a trade-off analysis
and I have a constraint which isn't completely arbitrary, then it's not a keyhole because it's not arbitrary. Is that essentially what you're arguing? Now, which is a legitimate question. As an example, you might find yourself in a situation where you say, okay, we already have a part of a system
and it behaves in this particular way, but it's got a keyhole, but it's been there for five years and everybody's used to it. And now we're going to add this new feature to the system which does this other thing over here. And now we have a choice. We could implement this system to have the same keyhole which would then be consistent with the existing system. And I've already argued in favor of consistency.
But that would mean that this part of the system over here now has a keyhole, and I've argued that keyholes are bad. Now you have to make a trade-off. Assuming you can't go back and rewrite the system, now you've got to choose between do I impose the arbitrary restriction, it's still a keyhole, in the name of achieving consistency with the rest of the system,
or do I choose not to impose the keyhole and now have inconsistent behavior across the system? So that's the kind of trade-off analysis I'm talking about. Does that seem reasonable? Okay. Other questions about keyholes?
All right, the next topic I want to talk about is to minimize duplication. Martin Peller, well known for a book called Refactoring, and in Refactoring, he talks about code smells. Talks about code that just has a bad smell.
There's something wrong with it. And when he talks about that, he says, number one in the stink parade is duplicated code. So what I want to talk about is minimizing duplication. Now fundamentally, the reason why code duplication is problematic is because the software is bigger. Now sometimes people think, well, size doesn't really matter anymore.
Again, from a historical perspective, we keep discovering the same things over and over. So back in the Dark Ages, PCs were slow and disks were small, so we cared about size. And then PCs got fast and disks got big, and we said, great, we don't care about size anymore. And this is going way back, oops, and then came dial-up internet access, okay?
Then the internet access got a whole lot faster. We had broadband, and we had cable, and we had DSL, it was great. And then came wireless access, when suddenly we had to be concerned about bandwidth again. And now wireless is getting faster, and there's no doubt in my mind that something else will come along at some point. Furthermore, maybe you don't care about size. Maybe you say, yeah, not in my world.
Generally speaking, bigger is slower, even on really fast machines. Big systems take longer to build. Big systems take longer to swap in and start up. Big systems have more page faults, they have more cache misses. Generally speaking, big systems are slower and more cumbersome than smaller systems.
So even if you don't care about size per se, maybe you do care about responsiveness and performance. There are two kinds of code duplication. The first one is source code duplication, which actually is what Martin Fowler is talking about in refactoring. Source code duplication complicates working with the source code, and it leads to object code duplication, unsurprisingly.
If I have two chunks of code in the source code that are the same, they will compile down typically into two chunks of object code. Then there's object code duplication. Now, object code duplication actually leads to the runtime performance problems I just mentioned. And that can arise without source code duplication.
So for example, macros in C and C++, templates in C++, inlining in any language, aspect-oriented programming, all these things make your executable larger. This is typically referred to generically as code bloat. I've got this bloated program
that's just bigger than I want it to be. Usually that refers to the object code, not the source code. So we have to address two issues here. I'm gonna start with source code duplication and then we'll move on from there to object code duplication. Source code duplication is bad for a couple of reasons. The first one is the code is a lot harder to comprehend. So if I'm looking at some code
and I see a function here and another function here and they look identical, I'm gonna go, wait a minute, they can't really be identical. There's gotta be some subtle difference between it. So I keep looking at them or maybe I run a diff or something like that. The point is I waste time trying to determine whether they really are doing the same thing or not.
Furthermore, we talked about code reviews. The software's harder to review because there's just so darn much of it. If you have duplicated code, this means you're gonna be reviewing the same code over and over, not a good use of your resources. The most common reason people talk about there being problems with source code duplication
is that fixing bugs is harder because the same problem has to be addressed in multiple places in the code. You fix this function, you gotta fix the duplicated function over here, or you get inconsistent behavior, and inconsistent behavior suddenly reduces the quality of your software. And adding features is harder for the same reason that fixing bugs is more difficult.
If you add a feature and you have to add it through all the duplicated pack sections of code, it's gonna take longer to do that. If you wanna prevent source code duplication, the fundamental strategy is what is known as commonality and variability analysis. So what you do is you say,
okay, I've got two sections of code, and I need to figure out what is common between them and what varies between them, and then you put the common stuff in one place and you find a way to implement the variability. So for example, what you might do is you might put the common stuff in a single function and the variability would be a parameter,
or you might put the variability in another function and have both functions call the same third-party function. To prevent code duplication, you can move common code into functions using parameters for variability.
You can move the common code features into other classes, often into a base class. So if I see two classes with common features, I can extract the common features and move them into a third class, possibly a base class from which they both inherit, or possibly another class that they both simply use, but I get the common functionality out of them.
This actually improves the design. Common features belong in a single place. You shouldn't be duplicating that kind of functionality. This applies to methods, to attributes, to properties, to nested types, all that kind of stuff. If you migrate commonality into a base class, you can then use virtual functions, dynamically bound functions for variability.
For example, this is the template method design pattern where you move common stuff up into a base class and then you have virtual functions that call down for implementing what needs to vary. You can move type dependent code into generics using type parameters for variability.
So you've only got one template or one generic to maintain. Furthermore, anytime you do copy and paste, you always get duplication. The whole idea behind copy and paste is to get duplication. So after you've done copy and paste, you want to make sure you refactor after that to avoid keeping the duplication.
And then after you've done the copy and paste, you can apply some of these techniques up here. How many people are familiar with aspect-oriented programming? A few people familiar with aspect-oriented programming. The idea behind aspect-oriented programming is that in some cases, you can identify repeated need
to do something in what is called a cross-cutting concern. So for example, we talked earlier about design by contract. So maybe what you'd like to say is at the beginning of every method, it should call a function called check invariance. And at the end of every method,
you should call a function called check invariance. Now you could manually go into every single method and type that at the beginning and at the end, but that's going to be problematic. That's a lot of work. Furthermore, it's going to just show a lot of redundancy. With aspect-oriented programming, you would be able to describe a way of saying, I want to insert a call to check invariance at the beginning, at the end of every method,
and then it automatically gets knitted in through the entire system. So aspect-oriented programming can do things like check invariance on an exit for public methods. There are very few programming languages which have aspects in them right now. Used to be something we heard more about probably five or 10 years ago, but there's still work being done on that.
But if you like the idea of aspects, you can approximate it through what is known as interception. Interception is where proxy objects are inserted between callers and callees. So I think I'm calling that function there, but with interception, instead of calling the function directly, that function call gets sidetracked
and goes to a third-party object. That third-party object, for example, can do some work before forwarding the call to the target function. For example, it could check the invariance. And then when that function returns, the return doesn't go directly to the caller. It goes back to this proxy object, which can do some more processing, for example, check invariance again before returning to the original caller.
That is what interception can do. Technologies for doing this vary. For example, Java has dynamic proxies. .NET has context-bound objects. CORBA has interceptors. So there's a variety of ways to implement. There's also a design pattern called interceptor. But the point is the notion of an aspect
allows you to achieve the effect of inserting code into your code base based on some criterion that you specify, like entry to a method or exit from a method. It's more general that you can also say, anytime I call a method that is from this library, then I want to insert a call to do a security check or something like that, as an example.
That avoids source code duplication because you describe what needs to be done in one place, and then the system handles it automatically. You should be wary of editor macros. There are some people who program their editors so they hit one key and a boilerplate class pops out or a boilerplate function pops out or something like that.
Just be aware that the inserted code leads to duplication because the macros are just producing the same thing every time. You can also consider code generation tools. For example, .NET has attributes. You have proxy and subgenerators from IDL.
You could write custom code generation scripts, macros in C and in C++. The thing is, in all of these cases, the tool input is the source code that you as a human maintain. So what you're doing as a human is you're maintaining the attributes or the proxy and stubs or the code generation scripts, and the output should be generated
for each build of the system, and you're not supposed to edit it. So this has the effect of you have less source code to look at, so the source code gets smaller. If you want to find source code duplication, there's a couple of ways you can do it.
First, just keep an eye out for it as you review source code or as you view source code. So as you're editing files, you might notice, oh, I noticed that this class is really similar to that class, or this function is really similar to that function, so you're looking for duplication all the time. And then there's a whole bunch of tools these days which look for similar sections of code.
It's sort of an interesting technology transfer, I guess. I think most of these tools originally started in university settings, where the problem they were trying to solve was catching students who were copying from one another on their homework. So they were looking for ways to find people who were cheating in computer science classes, and then they realized that actually, finding duplicated code has other advantages as well.
Last time I googled, here was an example of some tools that I found. I have not used any of them personally, but there are a number of tools available now that can help you look for similar sections of code and therefore possibly identify places to look for eliminating source code duplication.
If you find source code duplication, then the question becomes, what are you going to do about it? Well, basically, you refactor mercilessly. So once you've found code duplication, what you have to do is you have to find a way to eliminate it. You can use any of the techniques I talked about earlier. You can factor things into a common function. You can move them into a base class.
You can move them into a third party class. You can templatize things. There's a variety of ways that you can do it once you have identified that there is duplication present, but first you have to find it. You can also have tools generate code you'd otherwise have to duplicate manually. So if you're working in an environment, for example, where there's no support for generics,
you can create a tool that would generate generics for you. And again, what you would do is you would maintain the source code that you feed to the tool, and you wouldn't look at its output at all. That would just be object code from your perspective. Now, in the realm of object code duplication,
there are some different things you can do. One of them is to inline judiciously, assuming you're working in a language which gives you control over inlining. If you are in a language where you have control over inlining, what you want to do is you want to focus on short code sequences that are frequently called. Those are the best candidates for inlining,
short code sequences that are frequently called. And if you have to make these decisions manually, that's what you want to look for. Short sequences mean you have the least chance of code bloat, and frequently called mean that you have the best chance of actually getting a performance boost. If you have a section of code that is not called very often, the chances of it affecting the overall performance
of your program is very, very small, and the benefit of inlining is typically nothing. So you want to look for small sections of code that are frequently called. Increasingly, we have compilation technologies that can take this over for you, and you don't have to really think about this. But there's still some languages where you have to think about manual inlining.
Now, I told you before that in order to reduce source code duplication, you should do things like consider using tools that will generate code. Which is true. Now I'm gonna tell you to be wary of those same tools. And the reason I'm gonna tell you that is because the use of tools
can reduce the source code. That's true. But the problem is now those tools are generating potentially a whole lot of object code. So if what you're trying to do is minimize the total amount of object code, in that case, you need to also try to reduce the total amount of source code that gets generated by these tools. Which means that first, the source code that you see
which is what you're editing, is not necessarily the same as the source code that your compiler sees. Because your tools may be spitting out source code as well. It also means that in some cases, there can be a tension between minimizing source code and minimizing object code. Which you have to sort of fight to figure out where the right balance is for that.
You can move type independent code out of generics. This is especially important in C++. What this means is if you have a generic class or a generic function, and there is some functionality in the function or in the class that doesn't depend on the type,
you wanna pull it out of the class or out of the function so it doesn't get duplicated every time that that template gets instantiated. The same idea probably applies in other languages, although I know that .NET is more sophisticated about that than C++ is. So it depends on your platform. If you are using aspects, you want to move location independent code out of aspects.
So with aspects, what you're trying to say is, at the location described by this expression, what I wanna do is put some source code in there. You wanna make sure you don't put in source code there that could be factored out, for example, into a function call.
I've focused on code duplication because that's the primary artifact that is produced by developers. And developers are the people who I am primarily interested in, but there are other places where duplication can arise. For example, you're gonna have consistency problems anytime you have duplication.
If I have, for example, a database schema, which specifies how things look, and I have programs data structures, and the data structures are supposed to map into the database schema, if I change the schema, I have to update the data structures. If I change the data structures, I have to update the database schema. I'm expressing the same information in two places.
For example, between specification and test cases, if you have a written specification that is not executable, and somebody writes test cases based on that specification, if the spec changes, the test cases have to be changed. As a result, you have duplication here. Ideally, therefore, the test cases
would be the specification. If you have code and you have comments, ideally, the comments correctly correspond to the code. I suspect everybody here has run into the situation where you look at the comments and you find out that they don't correspond to the code because somebody edited the code and they forgot to update the comments. This is another example of duplication.
Andy Hunt and Dave Thomas, in their book, The Pragmatic Programmer, they advocate what they call the DRY principle. DRY stands for Don't Repeat Yourself, which they define as every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
The way that I phrase it is slightly different. I say, if you're doing something more than once, you're probably doing something wrong. So if you find yourself doing something more than once, you want to try to find a way to eliminate that repeated effort. And the guideline, therefore, is to minimize duplication.
Any questions about dealing with duplication? You're gonna have to speak a little louder, I'm sorry.
Okay, so I believe the question is,
when you're doing some refactoring, so you found some duplication, you want to get rid of the refactoring, the process of refactoring can potentially interfere with the readability of the code because you take some local information
and maybe you move it to a remote location, so you maybe another class, maybe another function, and as a result, instead of having all the information that you had at your disposal before, now you've sort of got it distributed across a couple of places. Is that an accurate summary of your question? Okay, so, and the question, do I have any feelings about that?
As a general rule, my feeling is that it's still worth doing the refactoring. There are some edge cases with one or two lines of code. So the question is, how can you mitigate the effect,
the impact that that has on the code? What I would argue is that you need to become comfortable with the idea of being familiar with the interfaces of what you are using. So for example, if you have two different functions which have a common section of code, you decide to refactor that into a function that they're both going to call, then you put it into the function.
At that location in the code, you're now going to be calling the function, which means you're only relying on its interface. You're no longer relying on the details of exactly what's going on inside. And I think that if you are familiar with the interface of a function that you're calling, it doesn't make it any less comprehensible because you know what the function does.
And similarly, if I have two classes and I'm going to extract a class from the two of them and put that someplace else, whether it's a base class or a third-party class, as long as I want to deal with that class through its interface and I focus on its interface, it should not be any more difficult for me to understand what's going on
in the places where I use that class. So fundamentally, I would argue that you mitigate the distribution of information by separating it by focusing on better interfaces. Question in the back?
Is the question, would it just be a matter of choosing suitable names? Okay, so the question is, if you just choose suitable names, that that would explain what's going on. I would go a little bit more broadly than that. I would say that you want to give those entities, whether they're functions or classes or templates
or whatever, you want to give them interfaces that are easy to use correctly and hard to use incorrectly, and certainly a key component of that is choosing good names. But I would also want to follow some of the other advice I had earlier. I would try to have them behave consistently with respect to one another so I could predict how they're going to behave and all those other kinds of things that I talked about. Was there another question here?
Yeah, okay.
So the observation is that with aspect-oriented programming you're actually, essentially, part of your program is located separate from everything else and you can't therefore look at the code, at the source code, and know everything that's going to be there because you may have an aspect that comes in later,
which I completely agree with. Aspects essentially are designed to improve the ability to refactor the code and to avoid the need to introduce duplicated code. But they do have the drawback, I agree with you, that when you're reading through the code you have to read with the aspects at the same time to really understand everything that is going on.
I completely agree. Yes?
Okay, so I believe your observation is that if you have two existing systems which have some similar functionality but they're using different interfaces to access it,
if you try to extract that into a single component that they could both then use, it's going to be more difficult because you may need to support both interfaces simultaneously. Is that an accurate summary? Okay, I think that's a legitimate observation. Under those conditions, under the assumption I can't change the calling interfaces of the existing components,
then what I would probably do is I would choose one of the interfaces as the interface I really want to support and then I'd write a thin layer adapter to adapt the secondary interface to the first one if I could. So then I'd only have to maintain the one component with its one interface and assuming that there's a relatively straightforward mapping from the alternative interface
to the first interface, I would hope I wouldn't have to maintain that adapter very much. I mean, that would be my original goal. Whether I could achieve it would depend on the circumstances. Does that seem reasonable?
All right, so the observation is that in some cases
it can be such a difficult problem to extract things and to go through the interfaces that it may not be worth the trouble in the end. Is that an accurate summary? Okay, and so the fundamental observation is that this means you don't want to do these things
blindly, you need to really analyze the situation and find out what the implications are of trying to do this kind of refactoring. Certainly I am a big fan of thinking about what you're going to do before you do it to determine whether it's worthwhile. Having said that, generally speaking, people don't like change,
because change is hard, change is a lot of work. And so it is not uncommon to people want to find excuses for not doing things. So I completely agree with the need for analysis, but at the same time I think that if you do have duplication along with the disadvantages that arise from that, you're already accustomed to a particular price
that you're paying, and I think you need to take into account when you're doing the analysis that we're not going to have to pay for excess source code duplication, for the inability to make certain kinds of changes, for excess object code duplication. Once we refactor, we can make those kinds of problems either go away or at least be reduced. So I do agree with your desire for analysis,
but I think that at the same time that shouldn't be used as an excuse to come to the conclusion that it's not worth doing. Yes? Yes. Let's say you have an existing system and a common library.
Right, mm-hmm, mm-hmm. And that would be an extra cost to that.
So I think what you're saying is that maybe there's some functionality in part of a system that I recognize I could use, but it's bound up in the other system. So what I want to do is I consider extracting it to make it a library, for example, which is nice, but then I have to go through the effort of ensuring that the existing system which used it,
I have to test to make sure that it continues to function as well as it did before. And there's an additional cost associated with doing that, which I agree with, but I'm going to say this. Anytime you perform a refactoring, I mean, a refactoring says I've got, the definition of refactoring is I take an existing body of code and I make a change to it in some way
that doesn't change its behavior. That's what a refactoring is. It's a source transformation that does not change the behavior, which means in every kind of refactoring, you always have to verify that the behavior hasn't changed. So the need to verify that you didn't change behavior is inherent in every kind of refactoring.
It's certainly a cost, I'm not disputing that, but what I'm saying is if you're not going to take on the need to pay that cost, you're not going to do any refactorings at all because the definition of refactoring is it doesn't change the behavior of the system. All right, any other questions?
Okay, so the assumptions,
let's suppose I've got two functions, they have some shared functionality, some code duplication, so we say great, I'm going to refactor that into a third function. They both now call the third function, everybody's happy. But then life goes on and it turns out that one of these functions, its specification changes in some way, so it can't use this third-party function any longer.
So we change that, it no longer calls this third-party function, and the question is do we now take that function which we originally extracted and basically move it back into the original second function that was calling it? I would argue that there's no reason to do that, and what I would say is presumably at the time you extracted that functionality, it corresponded to some conceptual operation.
You were able to describe what it did, and you gave it a reasonable function name. So there's no reason to put it, to sort of automatically put it back into the only other calling function because that calling function is still calling this high-level abstraction which you've incorporated. So I don't see any benefit to be gained
by putting it back into that other function. Basically, what you did was you extracted a meaningful piece of functionality that at one point was useful in two places, and I think that holding on to that abstraction makes perfect sense. Okay, so let's continue a little bit further. We have a couple more guidelines I want to get through.
The next guideline is to embrace automated unit testing. So, and let me just ask, how many people are familiar with the notion of unit testing? Pretty much everybody, which is essentially what I expected, so. All right, unit tests verify the correct behavior
of standalone units of software. They're usually classes and functions that may or may not be parts of classes. Now, there are some obvious benefits, like ooh, I test my class and things are going to be better, but I want to point out some less obvious benefits. The first one is improved user interface design.
Fundamentally, it naturally promotes testable interfaces. If you are writing unit tests for your software, this means you are able to test your software, which means it has to be testable. This naturally encourages loose coupling, and this means that unit testing some unit x requires isolating x from all of its collaborators,
and the mere act of having introduced unit testing almost always leads to decoupling of systems within your software, and that's a benefit in its own right. It also encourages the definition and the use of formal interfaces, and by now, I hope you are convinced that I am a really big believer in having good interfaces.
In addition, the use of unit testing can facilitate the discovery of overly complex units, classes and functions, for example, because they're really hard to test. If it turns out that it's really hard to write unit tests for a class or really hard to write unit tests for a function, that suggests that maybe you have something which is too complicated, and it should be broken down further
into units that are more easy to test. One of the most important things about unit tests is they provide a safety net for refactoring, so we just talked about the fact that if you do a refactoring, after the refactoring, you have to make sure you did not change the behavior of the system. If you have an array of unit tests at your disposal,
and what you can do is you can say, okay, I know that I'm passing all my unit tests, and now what I'm going to do is a refactoring, so I do the refactoring, and if I rerun the unit tests and all of them pass, this means I have high confidence that I did not change the behavior of the system. So in your particular scenario,
what this would mean is I already had unit tests for the other component anyway, so when I extract it and I hook it up to use the library, if all of their unit tests continue to pass, I should have high confidence I did not break anything. If those unit tests didn't already exist, then it becomes much dicier, which is one of the best things about unit testing is they make refactoring easier.
Unit tests also are executable client documentation for unit users, so if you want to say, look, this is how you use this function, or this is how you use this class, you can point to people and say, look, just take a look at the unit tests. They show you how this interface works in practice. So unit tests sort of obviously test the software,
but these other benefits are really important. So good unit tests, they set up and they run very, very quickly. Michael Feathers argues that a tenth of a second for a test is too long. You want these things to run really, really fast. It needs to be fundamentally painless to run your unit tests
because you want them to be run as frequently as they can be, ideally every time you build. So anything that takes a long time is not going to work. This means you can't have any non-trivial IO, no network traffic, no database traffic, no file system access, nothing that it changes in the environment like having to edit a configuration file or something like that.
All that stuff is way too much work for a unit test. You want unit tests to be as close to instantaneous as you can possibly make them because the goal is that they get run very, very frequently. Unit tests should localize problems. If you get a failure in a unit test, it should only be caused by a really small amount of code, which means that if a unit test fails,
I should be able to go in very, very quickly and find the part of the code that needs to be fixed. Now, unit tests are software. And unit tests themselves ideally don't need unit testing. They should be so obviously right
that you don't really need to test them. And this means that unit tests are really, really simple. They're not complicated at all. In fact, Gerard Mazzaro says, you shouldn't even have conditionals or loops in test code. It should just be straight line code. That's ideally what you want your tests to be. Now, there's other kinds of tests that are not so constrained.
Integration tests, low test, acceptance tests, they don't need to be this simple. But unit tests should be really simple, really focused, and really fast. In terms of the things that are supposed to be tested, only the accessible parts of the unit should have tests, basically the public and the protected methods.
Private methods should not be tested. In fact, if you do unit tests on the private methods of your class, you are going to hinder refactoring, because refactoring typically changes implementation details, which means if you have a bunch of unit tests for your private methods and you do a refactoring,
it would not be surprising if a whole bunch of unit tests start failing. But if a whole bunch of unit tests start failing, you're not gonna wanna make the change, because this means you gotta go and fix a whole bunch of code. So what you wanna do is you wanna test the stuff that is available to clients. Fundamentally, time that you might spend on unit tests for inaccessible methods
would frankly be better spent on unit tests for accessible methods. If you have unit tests set up properly for all of your public and protected methods, then if you break something in a private method, one of those other unit tests is gonna find it, because it will have been discovered through the public or the protected interface.
So the question is, why do we test the protected methods? We test the protected methods because a protected method is available to all of its derived classes, and the number of derived classes in principle is unknowably large. So basically, anybody using a protected method
is a client of your class. What library developers have discovered over time is that practically speaking, there's only two kinds of methods in a hierarchy,
private and everything else. And the reason that's the case is because public methods are the most obvious situation. If I break a public method, if I make a change to a public method so it doesn't work anymore, how much client code might be broken? The answer is, we don't know. Unknowably large amounts, because it could have as many clients as it needs.
Okay, so the comment is that typically when you're testing DLLs, then what you do is you test the public methods, you don't test the protected ones.
I can't speak to testing through DLLs because I don't know things at that level, but what I can say is, the fundamental question is this, if a method breaks, if I do a refactoring or something so that a method breaks, how much client code is broken? And with a public method, an unknowably large amount of client code is broken. Anybody who calls that method could be broken.
And with protected methods, it's the same thing. An unknowably large amount of client code can be broken. So we need therefore to have unit tests for anything that is available to clients. And the stuff that's available to clients is the public stuff and the protected stuff.
This also means, by the way, as class and library implementers, the only protection level that matters is private. There's private and everything else. Does that make sense? Okay. So you would not be the first person who would be disappointed to come to that conclusion.
But many, many library vendors have discovered that once they publish something, for example, in a protected interface, they can't ever change it again because too many people, basically if your class or your library is successful, lots and lots of people start using that because it is available to anybody who wants it.
All right. We want our classes and our units to be testable. We also want things to be encapsulated.
Well, encapsulated things are inaccessible. That's what it means to be encapsulated, which means they're not directly testable. So I've already indicated that if you have a private method, the private method should not have its own unit tests. But clearly you need to exercise the functionality of the private methods. You may decide to refactor a private method, for example.
Presumably, all the functionality in the private method can be gotten to, excuse me, through the public interface or through the protected interface. So if you have come up with a comprehensive set of unit tests for both the public methods and the protected methods, that should exercise all the private methods as well,
which means if you refactor or otherwise change a private method so it doesn't work, one of your other unit tests should fail. If that is not the case, if it turns out that there is a piece of functionality not accessible to you that is encapsulated and you can't test it indirectly
through the public or the protected interface, then that could suggest that you may actually have a design flaw. Maybe what you need to do is break your units down into something simpler so you have smaller abstractions so that you can get at that underlying functionality. So there is no reason you have to abandon encapsulation,
but you do need to recognize you're going to have to test through the encapsulation barrier indirectly. Pardon me? Or it could be dead code. Although, if it's dead code, you should have written unit tests which will actually determine that it's functioning correctly.
So. What's that? Or you could use coverage tools as well. I mean, I think his fundamental point is if you have dead code, then there's no path that ever reaches it, in which case all your tests are always going to succeed. And unit testing is not the best way to identify dead code.
But you're right, coverage tools are a better way to find out that you were not able to reach the code. Ideally what you really want is a tool, ideally a static analysis tool that can prove that you can never reach a particular section of code and it's truly dead. But that may or may not be possible, especially with dynamic binding and dynamically bound function calls.
I want to talk a little bit about concurrency and unit testing. Fundamentally what you need to understand is unit testing was not really designed for concurrent systems. That's not really its strength. Unit tests should be simple. Concurrency tests typically are not simple. However, having said that,
even if you have a system that does have concurrency, presumably you still have parts of the system that execute purely serially. For example, you may have independent background threads that run, or producers and consumers in producer and consumer systems. So although unit tests are not great for establishing that the concurrency is correct,
a concurrent system fundamentally consists of a bunch of serial things executing at the same time. You can at least use unit testing to verify that the sequential things work correctly by themselves. There is some limited concurrency functionality that can be unit tested. So for example, you can demonstrate
that states that should block or unblock threads really do. For example, you could use bounded buffers. So you could have a unit test that fills up a bounded buffer, make sure that actually trying to add any more blocks, and then you could put in a buffer that is not empty and you could confirm that somebody
can pull something out of the buffer and continue running. So you can do some limited concurrency testing. I will just remark in passing, Java has a method called getState, which tells you what the state of the thread is, but it's not reliable. As I recall the problem, one of the things is it can't tell the difference
between blocked on something and, I don't remember, I'd have to look up the details, but if you're interested, send me some email. I just forgotten the details right now, getState, the problem is it can tell you one thing but actually mean something else due to the way that it's been specified. These kinds of tests for concurrency
are still more complex than most sequential unit tests. So increasingly as we work with multi-threaded systems, we need to be able to perform some testing on them. Just bear in mind that unit tests were never really designed for that and don't do as good a job as we might hope.
If you're using concurrency, you definitely want to have disciplined static analysis, increasingly there are some pretty good static analysis tools for looking for concurrency errors. You'll also want to use manual code reviews because human beings are still probably better at this than tools are in many cases. You're certainly going to want to do automatic stress testing on concurrent systems.
Especially when you're dealing with concurrent systems, your test design is really critical. You have to make sure, for example, that you have more threads and processes than you actually have cores or processors or else you may not get true concurrency. You've got to make sure that the scheduler doesn't actually run all the threads sequentially. So you might think that you've got this wonderful interleave concurrent execution
when in fact the thread scheduler just ran them one after the other. And then you have to avoid inadvertently limiting the concurrency in artificial ways. Brian Goetz of Java Concurrency and Practice, and I do love this quote. He says, many benchmarks are, unbeknownst to the developers or users, simply tests of how great a concurrency bottleneck
the random number generator is. So that's not really what you're looking for in a scalability issue, I think, in most cases. Which brings me to the end of talking about unit tests, what their characteristics are, what they can be used for. And what I want to do now is talk about
the notion of automated unit tests. But it turns out that we're almost time for a break anyway and this is a good break point. So let's take a break now for the next 20 minutes and then we'll continue in 20 minutes. And I will remind you again, they like you to evaluate after every session so either put in a red card,
a yellow card, or a green card. Thank you.