The Dew Review – RavenDB 2.x Beginner’s Guide

I just finished reading Packt Publishing’s RavenDB 2.x Beginner’s Guide by Khaled Tannir. I haven’t used RavenDB in a project before, so when I was asked to review the book, I jumped at the opportunity. NoSQL in general, and RavenDB in particular, is something I have meaning to start learning.

3798OS

I really liked the format of the book. Each section starts with a brief introduction of the topic, continues with a  step by step set of instructions complete with code snippets and/or screen shots, and finishes up with a deeper explanation of what was done and what happened behind the scenes. The instructions part of each section is titled “Time for action”, the deeper dive is titled “What just happened?”, and some sections also have a “Have a go hero” challenge. These challenges give the reader a more advanced task to perform based on the one just completed and explained. Most of these challenges include some tips to get you started.

In some of the more introductory sections, the format felt a little repetitive, but it’s easy enough to skim through those parts if you’re comfortable with them already. For the more advanced topics, it’s a great way to re-enforce the material.

The book begins with an overview of RavenDB, covers the basics of NoSQL at a conceptual level and compares and contrasts its strengths with relational databases. Next it moves into the Management Studio… getting it installed and running, and gives an overview of what can be performed in the Studio. The next several chapters focus on using RavenDB within .NET and Visual Studio. Indexes, queries, and documents are all covered at a good level of detail. Chapters seven through 10 cover less code-focused aspects of RavenDB including deployment, scaling and profiling. There is a chapter on accessing RavenDB via a RESTful interface over HTTP rather than through the .NET API and the book finishes with a “Putting it all together” chapter where the author walks through building an ASP.NET MVC application with RavenDB as the data source.

The book is well-written, organized and an all-around good read. I think it targets a large number of developers – those who are experienced in .NET but have little or no exposure to NoSQL or RavenDB. If you fall into that category, I highly recommend picking up this title. When you see that Oren Eini, the main man behind RavenDB, is one  of the reviewers, you know it’s going to be a technically solid tutorial.

 

The RavenDB Indexing Process

By Itamar Syn-Hershko, author of RavenDB in Action

ravendbIndexes play a crucial part in answering queries. Without them, it is impossible to find data on anything other than the document ID, and, therefore, RavenDB becomes just a bloated key/value store. Indexes are the piece of the puzzle that allows rich queries for data to be satisfied efficiently. In this article, based on chapter 3 of RavenDB in Action, author Itamar Syn-Hershko explains how indexing works in RavenDB.

Get RavenDB in Action for 50% through June 30, 2013 by using promo code RAVDBAA at checkout.

As a document database, RavenDB has a dedicated storage for documents—where documents are stored and pulled from when accessed. This is the heart of RavenDB, and what we call the Document Store. When we stored and updated documents in the previous chapter, we were working directly against the Document Store.

The Document Store has one important feature – it is very efficient in pulling documents out by their ID. However, this is also its only feature, and the only way it can find documents. It can only have one key for a document, and that key is always the document ID; documents cannot be retrieved based on any other criteria.

When you need to pull documents out of the Document Store based on some search criteria other than their ID, the Document Store itself becomes useless. To be able to retrieve documents using some other properties they have, you need to have indexes. Those indexes are being stored separately from the documents themselves—in what we call the Index Store.

In this article, we will discuss indexes and the indexing process in RavenDB. It is important to understand what is it and why it is needed before making actual use of it.

The indexing process

Let’s assume for one moment all we have in our database is the Document Store, in it a couple million documents, and now we got a user query we need to answer. The document store by itself can’t really help us, as the query doesn’t have the document IDs in it. What do we do now?

One option is to go through all the documents in the system and check them one by one to see if they match the query. This is going to work, sure, if the user who issued the query is kind enough to wait for a few hours in a large system. But no user is. In order to efficiently satisfy user queries, we need to have our data indexed. By using indexes, the software can perform searches much more efficiently and complete queries much faster.

Let’s consider for a moment when are they going to be built or updated with the new documents that came in. If we calculate them when the user issues the query, we again delay returning the results. This is going to be much less substantial than going over all the documents, but that still is a performance hit we incur to the user for every query he makes.

Another, perhaps more sensible, option is to update the indexes when the user puts the new documents. This indeed makes more sense at first, but then when you start to consider what it would take to update several complex indexes on every put, it becomes much less attractive. In real systems, this means writes would take quite a lot of time, as now not only the document is being written, but all indexes have to be updated as well. There is also the question of transactions—what happens when a failure occurs while the indexes are being updated, should it fail a transaction?

With RavenDB, a conscious design decision was made to not cause any wait due to indexing. There should be no wait at all, never when you ask for data, and also never during other operations—like adding new documents to the store.

So when are indexes updated?

Updating indexes

RavenDB has a background process that is handed new documents and document updates as they come in, right after they were stored in the Document Store, and it passes them in batches through all the indexes in the system. For write operations, the user gets an immediate confirmation on their transaction—even before the indexing process started processing these updates—without waiting for indexing, but being 100 percent certain the changes were recorded in the database. Queries do not wait for indexing either—they just use the indexes that exist at the time the query was issued. This ensures both smooth operation on all fronts, and that no documents are left behind.

This is shown in figure 1.

indexstore

Figure 1 RavenDB’s background indexing process does not affect response time for neither updates nor queries.

It all sounds suspiciously good, doesn’t it? Obviously, there is a catch. Since indexing is done in the background, when enough data comes in that process can take a while to complete. This means it may take a while for new documents until they appear in query results. While RavenDB is highly optimized to minimize such cases, it can still happen, and when this happens we say the index results are stale. This is by design, and we discuss the implications of that in the end of this section.

What is an index?

Consider the following list of books:

booklist0

If I asked you what was the price of the book written by J.K. Rowling, or to name all the books with more than 600 pages in them—how would you find the answer to that? Obviously going through the entire list is not too cumbersome when there are only 10 books in it, but it becomes a problem rather quickly as the list grows.

An index is just a way to help us answer such questions more quickly. It is all about making a list of all possible values grouped by their context, and ordering it alphabetically. As a result, the list of books from above becomes the following lists of values, each value accompanied by the book number it was taken from:

booklist

Figure 2 A list of books (left) and lists of all possible searchable values, grouped by context

Since the values are grouped by context (a Title, an Author name, and so on), and are sorted lexicographically, it is now rather easy to find a book by any of those values even if we had millions of them. You simply go to the appropriate list (say, Author Names) and look the value up; since the lists are lexicographically sorted, this can be done rather efficiently. Once the value has been found in the list, the book number that is associated with it is returned, and can be used to get the actual book if you need more information on it.

Surprisingly, the process of creating an index like that is called indexing. RavenDB uses Lucene.NET as its indexing mechanism. Lucene.NET is the .NET port of the popular open-source search engine library Lucene. Originally written in Java and first released in 2000, Lucene is the leading open-source search engine library. It is being used by big names like Twitter, LinkedIn, and other online services to make their content searchable, and is constantly being improved to be made faster and better.

Summary

Having a scalable key/value store database is nice, but indexes are what really make RavenDB so special. Indexes make querying possible and efficient, and the more flexible indexes are, the more querying possibilities you have.

In this article, we laid the basics for understanding indexes in RavenDB and became familiar with RavenDB’s novel approach to indexing.

Here are some other Manning titles you might be interested in:

mongodbMongoDB in Action

Kyle Banker

hibernateJava Persistence with Hibernate, Second Edition

Christian Bauer, Gavin King, and Gary Gregory

neo4jNeo4j in Action

Jonas Partner, Aleksa Vukotic, and Nicki Watt

 

Technorati Tags: ,,

The Fundamental Aims of Asynchrony

async1

C# in Depth, Third Edition

By Jon Skeet

Asynchrony has been a thorn in the side of developers for years. It’s been known to be useful as a way of avoiding tying up a thread while waiting for some arbitrary task to complete, but it’s also been a pain in the neck to implement correctly. In this article, based on chapter 15 of C# in Depth, Third Edition, author Jon Skeet explains the purpose of asynchrony in C#.

At the time of this writing, I’ve been playing with async/await for about two years, and it still makes me feel like a giddy schoolboy. I firmly believe it will do for asynchrony what LINQ did for data handling when C# 3 came out—except that dealing with asynchrony was a far harder problem.

Even within the .NET framework (which is still relatively young in the grand scheme of things), we’ve had three different models to try to make things simpler:

  • The BeginFoo/EndFoo approach from .NET 1.x, using IAsyncResult and AsyncCallback to propagate results
  • The event-based asynchronous pattern from .NET 2.0, as implemented by BackgroundWorker and WebClient
  • The Task Parallel Library (TPL) introduced in .NET 4, but then expanded in .NET 4.5

Despite its generally excellent design, writing robust and readable asynchronous code with the TPL was hard. While the support for parallelism was great, there are some aspects of general asynchrony that are simply much better fixed in a language instead of purely in libraries.

The main feature of C# 5 builds on the TPL so that you can write synchronously looking code that uses asynchrony where appropriate. Gone is the spaghetti of callbacks, event subscriptions and fragmented error handling; instead, asynchronous code expresses its intentions clearly, and in a form that builds on the structures developers are already familiar with. A new language construct allows you to “await” an asynchronous operation. This “awaiting” looks very much like a normal blocking call, in that the rest of your code won’t continue until the operation has completed, but it manages to do this without actually blocking the currently executing thread. Don’t worry if that statement sounds completely contradictory.

The .NET framework has embraced asynchrony wholeheartedly in version 4.5, exposing asynchronous versions of a great many operations, following a newly documented task-based asynchronous pattern to give a consistent experience across multiple APIs. The WinRT framework used to create applications for Windows 8 enforces asynchrony for all long-running (or potentially long-running) operations. In short, the future is asynchronous and you’d be foolish not to take advantage of the new language features when trying to manage the additional complexity.

Just to be clear, C# hasn’t become omniscient, guessing where you might want to perform operations concurrently or asynchronously. The compiler is smart, but it doesn’t even attempt to remove the inherent complexity of asynchronous execution. You still need to think carefully, but the beauty of C# 5 is that all the tedious and confusing boilerplate code that used to be required has gone. Without the distraction of all fluff required just to make your code asynchronous to start with, you can concentrate on the hard bits.

A word of warning: this topic is reasonably advanced. It has the unfortunate properties of being incredibly important (realistically, even entry-level developers will need to have a passing understanding of it in a few years) but also quite tricky to get your head round to start with. I’m not going to shy away from the complexity; we’ll look at what’s going on in a fair amount of detail.

It’s just possible that I may temporarily break your brain a little, before hopefully putting it back together again later on. If it all starts sounding a little crazy, don’t worry—it’s not just you; bafflement is an entirely natural reaction. The good news is that when you’re using C# 5, it all makes sense on the surface. It’s only when you try to think of exactly what’s going on behind the scenes that things get tough.

Let’s get started!

Introducing asynchronous functions

C# 5 introduces the concept of an asynchronous function. This is always either a method or an anonymous function[1] which is declared with the async modifier, and can include await expressions. These await expressions are the points where things get interesting from a language perspective: if the value that the expression is awaiting isn’t available yet, the asynchronous function will return immediately, and then continue where it left off (in the “right” thread) when the value becomes available. The natural flow of “don’t execute the next statement until this one has completed” is still maintained, but without blocking.

We’ll break that woolly description down into more concrete terms and behavior later on, but you really need to see an example of it before it’s likely to make any sense.

First encounters of the asynchronous kind

Let’s start with something very simple, but which demonstrates asynchrony in a practical way. We often curse network latency for causing delays in our real applications, but it does make it easy to show why asynchrony is so important, as we can see in 1.

Listing 1 Displaying a page length asynchronously

class AsyncForm : Form
{
    Label label;
    Button button;

    public AsyncForm()
    {
        label = new Label { Location = new Point(10, 20), Text = “Length” };
        button = new Button { Location = new Point(10, 50), Text = “Click” };
        button.Click += DisplayWebSiteLength;  // #1
        AutoSize = true;
        Controls.Add(label);
        Controls.Add(button);
    }

    async void DisplayWebSiteLength(object sender, EventArgs e)
    {
        label.Text = “Fetching...”;
        HttpClient client = new HttpClient();  // #2
        string text = await client.GetStringAsync(“http://csharpindepth.com”);  // #2
        label.Text = text.Length.ToString();  // #3
    }
}

Application.Run(new AsyncForm());laying a page length asynchronously

#1 Wires up event handler

#2 Starts fetching the page

#3 Updates the UI

The first part of listing 1 simply creates the UI and hooks up an event handler for the button in a straightforward way. It’s the DisplayWebSiteLength method that we’re interested in. When you click on the button, the text of the book’s home page is fetched, and the label is updated to display the HTML length in characters.

I could have written a smaller example program as a console app, but hopefully this makes a more convincing demo. In particular, if you remove the async and await contextual keywords, change HttpClient to WebClient, and change GetStringAsync to DownloadString, the code will still compile and work… but the UI will freeze while it fetches the contents of the page.[2] If you run the async version (ideally over a slow network connection), you’ll see that the UI is responsive; you can still move the window around while the web page is fetching.

Most developers are familiar with the two golden rules of threading in Windows Forms development:

  • Don’t perform any time-consuming action on the UI thread
  • Don’t access any UI controls other than on the UI thread

These are easier to state than to obey. As an exercise, you might want to try a few different ways of creating similar code to listing 1 without using the new features of C# 5. For this extremely simple example, it’s not actually too bad to use the event-based WebClient.DownloadStringAsync method, but as soon as more complex flow control (error handling, waiting for multiple pages to complete, and so on) comes into the equation, the “legacy” code quickly becomes hard to maintain, whereas the C# 5 code can be modified in a natural way.

At this point, the DisplayWebSiteLength method feels somewhat magical: we know it does what we need it to, but we have no idea how. Let’s take it apart just a little bit, saving the really gory details for later.

Breaking down the first example

First I’ll start by expanding the method very slightly – splitting the call to HttpClient.GetStringAsync from the await expression to highlight the types involved:

async void DisplayWebSiteLength(object sender, EventArgs e)
{
    label.Text = “Fetching...”;
    HttpClient client = new HttpClient();
    Task<string> task = client.GetStringAsync(“http://csharpindepth.com”);
    string text = await task;
    label.Text = text.Length.ToString();
}

Notice how the type of task is Task<string>, but the type of the await task expression is just string. In this sense, an await expression performs an “unwrapping” operation, at least when the value being awaited is a Task<TResult>. (You can await other types too, as we’ll see, but Task<TResult> acts as a good starting point.) That’s one aspect of await – which doesn’t seem directly related to asynchrony, but makes life easier.

The main purpose of await is to avoid blocking while we wait for time-consuming operations to complete. You may be wondering how this all works in the concrete terms of threading. We’re setting label.Text at both the start and end of the method, so it’s reasonable to assume that both of those statements are executed on the UI thread, and yet we’re clearly not blocking the UI thread while we wait for the web page to download.

The trick is that the method actually returns as soon as we hit the await expression. Up until that point, it executes synchronously on the UI thread just as any other event handler would. If you put a breakpoint on the first line and hit it in the debugger, you’ll see that the stack trace shows that the button is busy raising its Click event. When we reach the await, the code checks whether the result is already available, and if it’s not (which will almost certainly be the case) it schedules a continuation to be executed when the web operation has completed. A continuation is effectively a callback which maintains the control state of the method: just as a closure maintains its environment in terms of variables, a continuation also remembers where it had got to, so it can continue from there when it’s executed. It’s very much like an iterator block, except that instead of yielding a value and then waiting to be executed again, await just “pauses” the method until the asynchronous operation has completed.[3]

In our case, the continuation executes the rest of the method, effectively jumping to the end of the await expression, back in the UI thread just as we want in order to manipulate the UI.

In case you’re wondering, all of this is handled by the compiler creating a complicated state machine. That’s an implementation detail. It’s instructive to examine it to get a better grasp of what’s going on, but before that, we need a more concrete description of what we’re trying to achieve and what the language actually specifies.

Thinking about asynchrony

If you ask a developer what they understand by asynchronous execution, chances are they’ll start talking about multi-threading. While that’s an important part of uses of asynchrony, it’s not really required for asynchronous typical execution. To fully appreciate how the async feature of C# 5 works, it’s best to strip away any thoughts of threading, and go back to basics.

Asynchrony strikes at the very heart of the execution model that C# developers are familiar with. Consider simple code like this:

Console.WriteLine(“First”);

Console.WriteLine(“Second”);

We expect the first call to complete, and then the second call to start. Execution flows from one statement to the next, in order. An asynchronous execution model doesn’t work that way. Instead, it’s all about continuations. When you start doing something, you tell that operation what you want to happen when that operation has completed. You may have heard (or used) the term callback for the same idea, but that’s got a broader meaning than the one we’re after here. We’re only interested in preserving the control state of the program, not arbitrary callbacks for other purposes.

Continuations are naturally represented as delegates in .NET, typically actions that receive the results of the asynchronous operation. That’s why to use the asynchronous methods in WebClient prior to C# 5, you would wire up various events to say what code should be executed in the case of success, failure and so on. The trouble is that creating all those delegates for a complicated sequence of steps ends up being very complicated, even with the benefit of lambda expressions. It’s even worse when you try to make sure that your error handling is correct. (On a good day, I can be reasonably confident that the success paths of hand-written asynchronous code are correct. I’m typically less certain that it reacts the right way on failure.)

Essentially, all that async in C# does is ask the compiler to build continuations for you. For an idea that can be expressed so simply, however, the consequences for readability and developer sanity are remarkable.

When I wrote earlier that we pass the continuation to the asynchronous operation at the same time that we start it, I was thinking about asynchrony in a classic, idealized sense. The reality in the task-based asynchronous pattern is very slightly different. Instead of the continuation being passed to the asynchronous operation, the asynchronous operation starts and returns us a token we can used to provide the continuation later. It represents the ongoing operation, which may have completed even before it’s returned to the calling actually code or may still be in progress. That token is then used whenever we want to express the idea of “I can’t proceed any further until this operation has completed.” Typically the token is in the form of a Task or Task<TResult>, but it doesn’t have to be.

So, the execution flow in an asynchronous method in C# 5 is typically along the lines of:

  • Do some work
  • Start an asynchronous operation, and remember the token it returns
  • Possibly do some more work. (Often you really can’t make any further progress until the asynchronous operation has completed, in which case this step is empty.)
  • “Wait” for the asynchronous operation to complete (via the token)
  • Do some more work
  • Finish

If we didn’t care about exactly what the “wait” part meant, we could do all of this in C# 4. If we’re happy to block until the asynchronous operation completes, the token will normally provide us some way of doing so. For a Task, we could just call Wait(). At that point, though, we’re taking up a valuable resource (a thread) and not doing any useful work. It’s a little like phoning for a takeaway pizza, and then standing at the front door until it arrives. What we really want to do is get on with something else, ignoring the pizza until it arrives. That’s where async comes in.

When we “wait” for an asynchronous operation, what we’re really saying is, “I’ve gone as far as I can go for now. Keep going when the operation has completed.” But if we’re not going to block the thread, what can we do? Very simply, we can return right there and then. We’ll continue asynchronously ourselves. And if we want our caller to know when our asynchronous method has completed, we’ll pass a token back to them, which they can block on if they want, or (more likely) use with another continuation. Very often you’ll end up with a whole stack of asynchronous methods calling each other. It’s almost as if you go into “async mode” for a section of code. There’s nothing in the language which states that it has to be done that way, but the fact that the same code which makes consuming asynchronous operations also behaves as an asynchronous operation certainly encourages it.

With the theory out of the way, let’s take a closer look at the concrete details of asynchronous methods. Asynchronous anonymous functions fit into the same mental model, but it’s much easier to talk about asynchronous methods.

Modelling asynchronous methods

I find it very useful to think about asynchronous methods in the form of figure 1.

async2

Figure 1 Async model

Here we have three blocks of code (the methods) and two boundaries (the method return types). To give a very simple example, we might have:

static async Task<int> GetPageLengthAsync(string url)
{
    HttpClient client = new HttpClient();
    Task<string> fetchTextTask = client.GetStringAsync(url);
    int length = await fetchTextTask;
    return length;
}

static void PrintPageLength()
{
    Task<int> lengthTask = GetPageLengthAsync(“http://csharpindepth.com”);
    Console.WriteLine(lengthTask.Result);
}

Here the five parts of the diagram correspond like this:

  • The calling method is PrintPageLength.
  • The async method is GetPageLengthAsync.
  • The asynchronous operation is HttpClient.GetStringAsync.
  • The boundary between the calling method and the async method is Task<int>.
  • The boundary between the async method and the asynchronous operation is Task<string>.

We’re mainly interested in the async method itself, but I’ve included the other methods so we can work out how they all interact together. In particular, you definitely need to know about the valid types at the method boundaries.

Summary

I hope that the more complicated, deep-dive sections of this chapter haven’t put you off the elegance of the asynchronous features of C# 5. The ability to write efficient asynchronous code in a more familiar execution model is a huge step forward, and I believe it will be transformative—once it’s well understood. In my experience giving presentations about async, many developers get easily confused by the feature the first time they see and use it. That’s entirely understandable, but please don’t let that put you off.



Here are some other Manning titles you might be interested in:

async 3

Dependency Injection in .NET

Mark Seemann

async 4

Windows Phone 7 in Action

Timothy Binkley-Jones, Massimo Perga and Michael Sync

async5

Real-World Functional Programming

Tomas Petricek

 

[1] As a reminder, an anonymous function is either a lambda expression or an anonymous method.

[2] HttpClient is in some senses the “new and improved” WebClient—it’s the preferred HTTP API for .NET 4.5 onwards and only contains asynchronous operations. If you’re writing a Windows Store app, you don’t even have the option of using WebClient.

[3] There are many parallels to be drawn between iterator blocks and asynchronous functions, but don’t be fooled into thinking they’re equivalent features. In the past, asynchronous libraries have been built on top of iterators to take advantage of the generated state machines, but the fact that asynchronous functions have been designed specifically for asynchrony makes them considerably cleaner.

 

del.icio.us Tags: ,,,