Tuesday, August 15, 2017

Words I Don’t Use, Part 5: “Wait”

The fifth “word I do not use” is the Oracle technical term wait.

The Oracle Wait Interface

In 1991, Oracle Corporation released some of the most important software instrumentation of all time: the wait statistics that were implemented in Oracle 7.0. Here’s part of the story, in Juan Loaiza’s words, as told in Nørgaard et. al (2004), Oracle Insights: Tales of the Oak Table.
This stuff was developed because we were running a benchmark that we could not get to perform. We had spent several weeks trying to figure out what was happening with no success. The symptoms were clear—the system was mostly idle—we just couldn’t figure out why.

We looked at the statistics and ratios and kept coming up with theories, the trouble was that none of them were right. So we wasted weeks tuning and fixing things that were not the problem. Finally we ran out of ideas and were forced to go back and instrument the code to figure out what the problem was.

Once the waits were instrumented the problem was diagnosed in minutes. We were having “free buffer” waits because the DBWR was not writing blocks fast enough. It’s amazing how hard that was to figure out with statistics, and how easy it was to figure out once the waits were instrumented.

...In retrospect a lot of the names could be greatly improved. The wait interface was added after the freeze date as a “stealth” project so it did not get as well thought through as it should have. Like I said, we were just trying to solve a problem in the course of a benchmark. The trouble is that so many people use this stuff now that if you change the names it will break all sorts of thing tools, so we have to leave them alone.
Before Juan’s team added this code, the Oracle kernel would show you only how much time its user calls (like parse, exec, and fetch) were taking. The new instrumentation, which included a set of new fixed views like v$session_wait and new WAIT lines in our trace files, showed how much time Oracle’s system calls (like reads, writes, and semops) were taking.

The Working-Waiting Model

The wait interface begat a whole new mental model about Oracle performance, based on the principle of working versus waiting:
Response Time = Service Time + Wait Time
In this formula, Oracle defines service time as the duration of the CPU used by your Oracle session (the duration Oracle spent working), and wait time as the sum of the durations of your Oracle wait events (the duration that Oracle spent waiting). Of course, response time in this formula means the duration spent inside the Oracle Database kernel.

Why I Don’t Say Wait, Part 1

There are two reasons I don’t use the word wait. The first is simply that it’s ambiguous.

The Oracle formula is okay for talking about database time, but the scope of my attention is almost never just Oracle’s response time—I’m interested in the business’s response time. And when you think about the whole stack (which, of course you do; see holistic), there are events we could call wait events all the way up and down:
  • The customer waits for an answer from a user.
  • The user waits for a screen from the browser.
  • The browser waits for an HTML page from the application server.
  • The application server waits for a database call from the Oracle kernel.
  • The Oracle kernel waits for a system call from the operating system.
  • The operating system’s I/O request waits to clear the device’s queue before receiving service.
  • ...
If I say waits, the users in the room will think I’m talking about application response time, the Oracle people will think I’m talking about Oracle system calls, and the hardware people will think I’m talking about device queueing delays. Even when I’m not.

Why I Don’t Say Wait, Part 2

There is a deeper problem with wait than just ambiguity, though. The word wait invites a mental model that actually obscures your thinking about performance.

Here’s the problem: waiting sounds like something you’d want to avoid, and working sounds like something you’d want more of. Your program is waiting?! Unacceptable. You want it to be working. The connotations of the words working and waiting are unavoidable. It sounds like, if a program is waiting a lot, then you need to fix it; but if it’s working a lot, then it is probably okay. Right?

Actually, no.

The connotations “work is virtuous” and “waits are abhorrent” are false connotations in Oracle. One is not inherently better or worse than the other. Working and waiting are not accurate value judgments about Oracle software. On the contrary, they’re not even meaningful; they’re just arbitrary labels. We could just as well have been taught to say that an Oracle program is “working on disk I/O” and “waiting to finish its CPU instructions.”

The terms working and waiting really just refer to different subroutine call types:

“Oracle is workingmeans“your Oracle kernel process is executing a user call”
“Oracle is waitingmeans“your Oracle kernel process is executing a system call”

The working-waiting model implies a distinction that does not exist, because these two call types have equal footing. One is no worse than the other, except by virtue of how much time it consumes. It doesn’t matter whether a program is working or waiting; it only matters how long it takes.

Working-Waiting Is a Flawed Analogy

The working-waiting paradigm is a flawed analogy. I’ll illustrate. Imagine two programs that consume 100 seconds apiece when you run them:

Program AProgram B
DurationCall typeDurationCall type
98system calls (waiting)98user calls (working)
2user calls (working)2system calls (waiting)
100Total100Total

To improve program A, you should seek to eliminate unnecessary system calls, because that’s where most of A’s time has gone. To improve B, you should seek to eliminate unnecessary user calls, because that’s where most of B’s time has gone. That’s it. Your diagnostic priority shouldn’t be based on your calls’ names; it should be based solely on your calls’ contributions to total duration. Specifically, conclusions like, “Program B is okay because it doesn’t spend much time waiting,” are false.

A Better Model

I find that discarding the working-waiting model helps people optimize better. Here’s how you can do it. First, understand the substitute phrasing: working means executing a user call; and waiting means executing a system call. Second, understand that the excellent ideas people use to optimize other software are excellent ideas for optimizing Oracle, too:
  1. Any program’s duration is a function of all of its subroutine call durations (both user calls and system calls), and
  2. A program is running as fast as possible only when (1) its unnecessary calls have been eliminated, and (2) its necessary calls are running at hardware speed.
Oracle’s wait interface is vital because it helps us measure an Oracle program’s complete execution duration—not just Oracle’s user calls, but its system calls as well. But I avoid saying wait to help people steer clear of the incorrect bias introduced by the working-waiting analogy.

Monday, August 7, 2017

Words I Don’t Use, Part 4: “Expert”

The fourth “word I do not use” is expert.

When I was a young boy, my dad would sometimes drive me to school. It was 17 miles of country roads and two-lane highways, so it gave us time to talk.

At least once a year, and always on the first day of school, he would tell me, “Son, there are two answers to every test question. There’s the correct answer, and there’s the answer that the teacher expects. ...They’re not always the same.”

He would continue, “And I expect you to know them both.”

He wanted me to make perfect grades, but he expected me to understand my responsibility to know the difference between authority and truth. My dad thus taught me from a young age to be skeptical of experts.

The word expert always warns me of a potentially dangerous type of thinking. The word is used to confer authority upon the person it describes. But it’s ideas that are right or wrong; not people. You should evaluate an idea on its own merit, not on the merits of the person who conveys it. For every expert, there is an equal and opposite expert; but for every fact, there is not necessarily an equal and opposite fact.

A big problem with expert is corruption—when self-congratulators hijack the label to confer authority upon themselves. But of course, misusing the word erodes the word. After too much abuse within a community, expert makes sense only with finger quotes. It becomes a word that critical thinkers use only ironically, to describe people they want to avoid.

Saturday, July 29, 2017

Words I Don’t Use, Part 3: “Best Practice”

The third “word I do not use” is best practice.

The “best practice” serves a vital need in any industry. It is the answer to, “Please don’t make me learn about this; just tell me what to do.” The “best practice” is a fine idea in spirit, but here’s the thing: many practices labeled “best” don’t deserve the adjective. They’re often containers for bad advice.

The most common problem with “best practices” is that they’re not parameterized like they should be. A good practice usually depends on something: if this is true, then do that; otherwise, do this other thing. But most “best practices” don’t come with conditions of execution—they often contain no if statements at all. They come disguised as recipes that can save you time, but they often encourage you to skip past thinking about things that you really ought to be thinking about.

Most of my objections to “best practices” go away when the practices being prescribed are actually good. But the ones I see are often not, like the old SQL “avoid full-table scans” advice. Enforcing practices like this yields applications that don’t run as well as they should and developers that don’t learn the things they should. Practices like “Measure the efficiency of your SQL at every phase of the software life cycle,” are actually “best”-worthy, but alas, they’re less popular because they sound like real work.

Thursday, July 27, 2017

Words I Don’t Use, Part 2: “Holistic”

The second “word I do not use” is holistic.

When people use the word “holistic” in my industry (Oracle), it means that they’re paying attention to not just an individual subcomponent of a system, but to a whole system, including (I hope) even the people it serves.

But trying to differentiate technology services by saying “we take a holistic view of your system” is about like differentiating myself by saying I’ll wear clothes to work. Saying “holistic” would make it look like I’ve only just recently become aware that optimizing a system’s individual subsystems is not a reliable way to optimize the system itself. This should not be a distinctive revelation.

Friday, July 21, 2017

Words I Don’t Use, Part 1: “Methodology”

Today, I will begin a brief sequence of blog posts that I’ll call “Words I Don’t Use.” I hope you’ll have some fun with it, and maybe find a thought or two worth considering.

The first word I’ll discuss is methodology. Yes. I made a shirt about it.

Approximately 100% of the time in the [mostly non-scientific] Oracle world that I live in, when people say “methodology,” they’re using it in the form that American Heritage describes as a pretentious substitute for “method.” But methodology is not the same as method. Methods are processes. Sequences of steps. Methodology is the scientific study of methods.

I like this article called “Method versus Methodology” by Peter Klein, which cites the same American Heritage Dictionary passage that I quoted on page 358 of Optimizing Oracle Performance.

Thursday, May 19, 2016

Messed-Up App of the Day: Tables of Numbers

Quick, which database is the biggest space consumer on this system?
Database                  Total Size   Total Storage
-------------------- --------------- ---------------
SAD99PS                    635.53 GB         1.24 TB
ANGLL                        9.15 TB         18.3 TB
FRI_W1                       2.14 TB         4.29 TB
DEMO                         6.62 TB        13.24 TB
H111D16                      7.81 TB        15.63 TB
HAANT                         1.1 TB          2.2 TB
FSU                          7.41 TB        14.81 TB
BYNANK                       2.69 TB         5.38 TB
HDMI7                      237.68 GB       476.12 GB
SXXZPP                     598.49 GB         1.17 TB
TPAA                         1.71 TB         3.43 TB
MAISTERS                   823.96 GB         1.61 TB
p17gv_data01.dbf            800.0 GB         1.56 TB
It’s harder than it looks.

Did you come up with ANGLL? If you didn’t, then you should look again. If you did, then what steps did you have to execute to find the answer?

I’m guessing you did something like I did:
  1. Skim the entire list. Notice that HDMI7 has a really big value in the third column.
  2. Read the column headings. Parse the difference in meaning between “size” and “storage.” Realize that the “storage” column is where the answer to a question about space consumption will lie.
  3. Skim the “Total Storage” column again and notice that the wide “476.12” number I found previously has a GB label beside it, while all the other labels are TB.
  4. Skim the table again to make sure there’s no PB in there.
  5. Do a little arithmetic in my head to realize that a TB is 1000× bigger than a GB, so 476.12 is probably not the biggest number after all, in spite of how big it looked.
  6. Re-skim the “Total Storage” column looking for big TB numbers.
  7. The biggest-looking TB number is 15.63 on the H111D16 row.
  8. Notice the trap on the ANGLL row that there are only three significant digits showing in the “18.3” figure, which looks physically the same size as the three-digit figures “1.24” and “4.29” directly above and below it, but realize that 18.3 (which should have been rendered “18.30”) is an order of magnitude larger.
  9. Skim the column again to make sure I’m not missing another such number.
  10. The answer is ANGLL.
That’s a lot of work. Every reader who uses this table to answer that question has to do it.

Rendering the table differently makes your readers’ (plural!) job much easier:
Database          Size (TB)  Storage (TB)
----------------  ---------  ------------
SAD99PS                 .64          1.24
ANGLL                  9.15         18.30
FRI_W1                 2.14          4.29
DEMO                   6.62         13.24
H111D16                7.81         15.63
HAANT                  1.10          2.20
FSU                    7.41         14.81
BYNANK                 2.69          5.38
HDMI7                   .24           .48
SXXZPP                  .60          1.17
TPAA                   1.71          3.43
MAISTERS                .82          1.61
p17gv_data01.dbf        .80          1.56
This table obeys an important design principle:
The amount of ink it takes to render each number is proportional to its relative magnitude.
I fixed two problems: (i) now all the units are consistent (I have guaranteed this feature by adding unit label to the header and deleting all labels from the rows); and (ii) I’m showing the same number of significant digits for each number. Now, you don’t have to do arithmetic in your head, and now you can see more easily that the answer is ANGLL, at 18.30 TB.

Let’s go one step further and finish the deal. If you really want to make it as easy as possible for readers to understand your space consumption problem, then you should sort the data, too:
Database          Size (TB)  Storage (TB)
----------------  ---------  ------------
ANGLL                  9.15         18.30
H111D16                7.81         15.63
FSU                    7.41         14.81
DEMO                   6.62         13.24
BYNANK                 2.69          5.38
FRI_W1                 2.14          4.29
TPAA                   1.71          3.43
HAANT                  1.10          2.20
MAISTERS                .82          1.61
p17gv_data01.dbf        .80          1.56
SAD99PS                 .64          1.24
SXXZPP                  .60          1.17
HDMI7                   .24           .48
Now, your answer comes in a glance. Think back at the comprehension steps that I described above. With the table here, you only need:
  1. Notice that the table is sorted in descending numerical order.
  2. Comprehend the column headings.
  3. The answer is ANGLL.
As a reader, you have executed far less code path in your brain to completely comprehend the data that the author wants you to understand.

Good design is a topic of consideration. And even conservation. If spending 10 extra minutes formatting your data better saves 1,000 readers 2 minutes each, then you’ve saved the world 1,990 minutes of wasted effort.

But good design is also a very practical matter for you personally, too. If you want your audience to understand your work, then make your information easier for them to consume—whether you’re writing email, proposals, reports, infographics, slides, or software. It’s part of the pathway to being more persuasive.

Friday, May 13, 2016

Fail Fast

Among movements like Agile, Lean Startup, and Design Thinking these days, you hear the term fail fast. The principle of failing fast is vital to efficiency, but I’ve seen project managers and business partners be offended or even agitated by the term fail fast. I’ve seen it come out like, “Why the hell would I want to fail fast?! I don’t want to fail at all.” The implication, of course: “Failing is for losers. If you’re planning to fail, then I don’t want you on my team.”

I think I can help explain why the principle of “fail fast” is so important, and maybe I can help you explain it, too.

Software developers know about fail fast already, whether they realize it or not. Yesterday was a prime example for me. It was a really long day. I didn’t leave my office until after 9pm, and then I turned my laptop back on as soon as I got home to work another three hours. I had been fighting a bug all afternoon. It was a program that ran about 90 seconds normally, but when I tried a code path that should have been much faster, I could let it run 50 times that long and it still wouldn’t finish.

At home, I ran it again and left it running while I watched the Thunder beat the Spurs, assuming the program would finish eventually, so I could see the log file (which we’re not flushing often enough, which is another problem). My MacBook Pro ran so hard that the fan compelled my son to ask me why my laptop was suddenly so loud. I was wishing the whole time, “I wish this thing would fail faster.” And there it is.

When you know your code is destined to fail, you want it to fail faster. Debugging is hard enough as it is, without your stupid code forcing you to wait an hour just to see your log file, so you might gain an idea of what you need to go fix. If I could fail faster, I could fix my problem earlier, get more work done, and ship my improvements sooner.

But how does that relate to wanting my business idea to fail faster? Well, imagine that a given business idea is in fact destined to fail. When would you rather find out? (a) In a week, before you invest millions of dollars and thousands of hours investing into the idea? Or (b) In a year, after you’ve invested millions of dollars and thousands of hours?

I’ll take option (a) a million times out of a million. It’s like asking if I’d like a crystal ball. Um, yes.

The operative principle here is “destined to fail.” When I’m fixing a reported bug, I know that once I create reproducible test case for that bug, my software will fail. It is destined to fail on that test case. So, of course, I want for my process of creating the reproducible test case, my software build process, and my program execution itself to all happen as fast as possible. Even better, I wish I had come up with the reproducible test case a year or two ago, so I wouldn’t be under so much pressure now. Because seeing the failure earlier—failing fast—will help me improve my product earlier.

But back to that business idea... Why would you want a business idea to fail fast? Why would you want it to fail at all? Well, of course, you don’t want it to fail, but it doesn’t matter what you want. What if it is destined to fail? It’s really important for you to know that. So how can you know?

Here’s a little trick I can teach you. Your business idea is destined to fail. It is. No matter how awesome your idea is, if you implement your current vision of some non-trivial business idea that will take you, say, a month or more to implement, not refining or evolving your original idea at all, your idea will fail. It will. Seriously. If your brain won’t permit you to conceive of this as a possibility, then your brain is actually increasing the probability that your idea will fail.

You need to figure out what will make your idea fail. If you can’t find it, then find smart people who can. Then, don’t fear it. Don’t try to pretend that it’s not there. Don’t work for a year on the easy parts of your idea, delaying the inevitable hard stuff, hoping and praying that the hard stuff will work its way out. Attack that hard stuff first. That takes courage, but you need to do it.

Find your worst bottleneck, and make it your highest priority. If you cannot solve your idea’s worst problem, then get a new idea. You’ll do yourself a favor by killing a bad idea before it kills you. If you solve your worst problem, then find the next one. Iterate. Shorter iterations are better. You’re done when you’ve proven that your idea actually works. In reality. And then, because life keeps moving, you have to keep iterating.

That’s what fail fast means. It’s about shortening your feedback loop. It’s about learning the most you can about the most important things you need to know, as soon as possible.

So, when I wish you fail fast, it’s a blessing; not a curse.