Perfect MVC code: April 2014

Tuesday, 8 April 2014

Garbage Collector Internals

The CLR GC is a highly efficient, scalable, and reliable automatic memory manager. Much time and effort went into researching the optimal behavioral characteristics of the GC. Before delving into the details of the CLR GC, it is important to state the definition of what the GC is and also what assumptions were made during its design and implementation. Let's begin by looking at some of the key assumptions.

The CLR GC assumes that everything is garbage unless otherwise told. This means that the GC is ready to collect all objects on the managed heap unless told otherwise. In essence, it implements a reference tracking scheme for all live objects in the system (we will define what live means shortly) where objects without any references to them are considered garbage and can be collected.
The CLR GC assumes that all objects on the managed heap will be short lived (or ephemeral). In other words, the GC attempts to collect short-lived objects more often than long-lived objects operating under the assumption that if an object has been around for a while, chances are it will be around for a little longer and there is no need to attempt to collect that object again.
The CLR GC tracks an object's age via the use of generations. Young objects are placed in generation 0 and older objects in generations 1 and 2. As an object grows older, it is promoted from one generation to the next. As such, a generation can be said to define the age of an object.

Based upon the assumptions above, we can arrive at a definition of the CLR GC: It is a reference tracking and generational garbage collector.
Let's look at each of the parts of the definition more concretely and begin with how generations define the age of an object.

Generations

The CLR GC defines three generations very innovatively called generation 0, generation 1, and generation 2. Each of the generations contains objects of a certain age where generation 0 contains newly allocated objects and generation 2 contains the oldest of objects. An object moves from one generation to the next by surviving a garbage collection. By surviving, it's implied that the object was still being referenced (or is still rooted) at the time of the garbage collection. Each of the generations can be garbage collected at any time, but the frequency of garbage collections depend on the generation. Remember from the previous section that one of the assumptions that the CLR makes is that most objects are going to be short-lived (i.e., live in generation 0). Due to that assumption, generation 0 is collected far more frequently than generation 2 in hopes to prune these short-lived objects quicker. Figure 5-5 shows the overall algorithm when it comes to how the generations are garbage collected.

Figure 5-5 High-level overview of generational garbage collection algorithm

In Figure 5-5, we can see that the triggering of a garbage collection is by new allocation request and when the budget for generation 0 has been exceeded. If so, the garbage collector collects all objects that have no roots associated with them and promotes all objects with roots to generation 1. Much in the same way that generation 0 has a budget defined, so does generation 1; and if, as part of promoting objects from generation 0 to generation 1, the budget is exceeded, the GC repeats the process of collecting objects with no roots in generation 1 and promoting objects with roots to generation 2. The process repeats itself for generation 2. If, while promoting to generation 2, the GC cannot collect any objects and the budget for generation 2 is exceeded, the CLR heap manager tries to allocate another segment that will hold generation 2 objects. If the creation of a new segment fails, an OutOfMemoryException is thrown. The CLR heap manager also releases segments if they are not in use anymore; we will discuss this process in more detail later in the chapter.

What Else Can Trigger a Garbage Collection?

In addition to a garbage collection occurring due to the allocation of memory and exceeding the thresholds for generation 0, 1, and 2, respectively, a couple of other scenarios exist that can cause it to happen. First, a garbage collection can be forced via the GC.Collect and related APIs. Secondly, the garbage collector is very cognizant of memory usage in the system as a whole. Through careful collaboration with the operating system, the garbage collector can kick start a collection if the system as a whole is found to be under extreme memory pressure.

Let's take a practical look at how an object is collected and promoted. Listing 5-2 shows the source code behind the application we will use to illustrate the generational concepts.

Listing 5-2. Example source code to illustrate generational concepts

using System;
using System.Text;
using System.Runtime.Remoting;


namespace Advanced.NET.Debugging.Chapter5
{
    class Name
    {
        private string first;
        private string last;


        public string First { get { return first; } }
        public string Last { get { return last; } }


        public Name(string f, string l)
        {
            first = f; last = l;
        }
    }
    class Gen
    {
        static void Main(string[] args)
        {
            Name n1 = new Name("Mario", "Hewardt");
            Name n2 = new Name("Gemma", "Hewardt");


            Console.WriteLine("Allocated objects");


            Console.WriteLine("Press any key to invoke GC");
            Console.ReadKey();


            n1 = null;
            GC.Collect();


            Console.WriteLine("Press any key to invoke GC");
            Console.ReadKey();


            GC.Collect();


            Console.WriteLine("Press any key to exit");
            Console.ReadKey();
        }
    }
}

The source code and binary for Listing 5-2 can be found in the following folders:

Source code: C:\ADND\Chapter5\Gen
Binary: C:\ADNDBin\05Gen.exe

In Listing 5-2, we have defined a simple type called Name. In the Main method, we instantiate two instances of the Name type, both of which end up going to generation 0 as new allocations. When the user has been prompted to Press any key to invoke GC, we set the n1 instance to null, which indicates that it can be garbage collected because it no longer has any roots. Next, the garbage collection occurs and collects n1 and promotes n2 to generation 1. Finally, the last garbage collection promotes n2 to generation 2 because it is still rooted.
Let's run the application under the debugger and see how we can verify our theories on how n1 and n2 are collected and promoted. When the application is running under the debugger, resume execution until the first Press any key to invoke GC prompt. At that point, we need to break execution and find the addresses to the two object instances, which can easily be done via the ClrStack command as shown in the following:

0:000> !ClrStack -a
OS Thread Id: 0x1c0c (0)
ESP       EIP
0028f3b4 77709a94 [NDirectMethodFrameSlim: 0028f3b4]
 Microsoft.Win32.Win32Native.ReadConsoleInput(IntPtr, InputRecord ByRef, Int32,
Int32 ByRef)
0028f3cc 793e8f28 System.Console.ReadKey(Boolean)
    PARAMETERS:
        intercept = 0x00000000
    LOCALS:
        <no data>
        0x0028f3dc = 0x00000001
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>


0028f40c 793e8e33 System.Console.ReadKey()
0028f410 003000f3 Advanced.NET.Debugging.Chapter5.Gen.Main(System.String[])
    PARAMETERS:
        args = 0x01c55818
    LOCALS:
        <CLR reg> = 0x01da5938 
        <CLR reg> = 0x01da5948


0028f65c 79e7c74b [GCFrame: 0028f65c]

The addresses of the two objects on the managed heap are 0x01da5938 and 0x01da5948. How can we figure out which generation objects on the managed heap belong to? The answer to that lies in understanding the correlation between managed heap segments and generations. As previously discussed, each managed heap consists of one or more segments where the objects reside. Furthermore, part of the segment(s) is dedicated to a given generation. Figure 5-6 shows an example of a hypothetical managed heap segment.

Figure 5-6 Hypothetical managed heap segment

In Figure 5-6, the managed heap segment is divided into three generations, each with its own starting address managed by the CLR heap manager. Generations 0 and 1 are part of a single segment known as the ephemeral segment where short-lived objects live. Because the GC goes under the assumption that most objects are short lived, most objects are not expected to live past generation 0 or, at a maximum, generation 1. Objects that live in generation 2 are the oldest objects and get collected very infrequently. It is possible that generation 2 can also be part of the ephemeral segment even though generation 2 is not collected as often. By looking at an object's address and knowing the address ranges for each of the generations, we can find out which generation an object belongs to. How do we know what the generational starting addresses for the CLR heap manager are? The answer lies in a command called eeheap. The eeheap command displays various memory statistics of data consumed by internal CLR data structures. By default, eeheap displays verbose data, meaning that information related to the GC as well as the loader is displayed. To display information only about the GC, the –gc switch can be used. Let's run the command in our existing debug session and see what we get:

0:004> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x01da1018
generation 1 starts at 0x01da100c
generation 2 starts at 0x01da1000
ephemeral segment allocation context: none
 segment    begin allocated     size
002c7db0 790d8620  790f7d8c 0x0001f76c(128876)
01da0000 01da1000  01da8010 0x00007010(28688)
Large object heap starts at 0x02da1000
 segment    begin allocated     size
02da0000 02da1000  02da3250 0x00002250(8784)
Total Size   0x289cc(166348)
–––––––––––––––––––––––––––––
GC Heap Size   0x289cc(166348)

Part of the output shows clearly the starting addresses of each of the generations. If we look at the object addresses in the debug session of our sample application, we can see the following:

<CLR reg> =  0x01da5938
<CLR reg> =  0x01da5948

Both of these addresses corresponding to our objects fall within the address range of generation 0 (starting at 0x01da1018), hence we can conclude that both of them live within the realm of that generation. This makes perfect sense because we are currently in the code flow where the objects were just allocated and we are pending a garbage collection. If we resume execution of the application and subsequently break execution again the next time we see the Press any key to invoke GC, we should see some difference in which generation the objects belong to. If we look at the source code, we can see that prior to invoking a garbage collection, we set the n1 reference to null, which in essence makes the object rootless and one that should be garbage collected. Furthermore, n2 is still rooted and as such should be promoted to generation 1 during the garbage collection. Let's take a look by following the same process as earlier: find the object addresses, use the eeheap command to find the generational address ranges, and see which generation the object falls into:

0:000> !ClrStack -a
OS Thread Id: 0x1910 (0)
ESP       EIP
0021f394 77709a94 [NDirectMethodFrameSlim: 0021f394]
 Microsoft.Win32.Win32Native.ReadConsoleInput(IntPtr, InputRecord
ByRef, Int32, Int32 ByRef)
0021f3ac 793e8f28 System.Console.ReadKey(Boolean)
    PARAMETERS:
        intercept = 0x00000000
    LOCALS:
        <no data>
        0x0021f3bc = 0x00000001
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
0021f3ec 793e8e33 System.Console.ReadKey()
0021f3f0 01690111 Advanced.NET.Debugging.Chapter5.Gen.Main(System.String[])
    PARAMETERS:
        args = 0x01da5818
    LOCALS:
        <CLR reg> = 0x00000000
        
        <CLR reg> = 0x01da5948


0021f644 79e7c74b [GCFrame: 0021f644]
0:000> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x01da6c00
generation 1 starts at 0x01da100c
generation 2 starts at 0x01da1000
ephemeral segment allocation context: none
 segment    begin allocated     size
002c7db0 790d8620  790f7d8c 0x0001f76c(128876)
01da0000 01da1000  01da8c0c 0x00007c0c(31756)
Large object heap starts at 0x02da1000
 segment    begin allocated     size
02da0000 02da1000  02da3240 0x00002240(8768)
Total Size   0x295b8(169400)
––––––––––––––––––––––––––––––
GC Heap Size   0x295b8(169400)

The most interesting part of the output is in the eeheap command output. We can see now that the generational address ranges have changed slightly. More specifically, the starting address of generation 0 has changed from 0x01da1018 to 0x01da6c00, which in essence implies that generation 1 has become bigger (because the starting address of generation 1 remains unchanged). If we correlate the address of our n2 object (0x01da5948) with the generational address ranges that the eeheap command displayed, we can see that the n2 object falls into generation 1. Again, this is fully expected because n2 previously lived in generation 0 and was still rooted at the time of the garbage collection, thereby promoting the object to the next generation. I will leave it as an exercise to you to see what happens on the final garbage collection in the sample application.
Although the SOS debugger extension provides the means of finding out which generation any given object belongs to, it is a somewhat tedious process as it requires that addresses be checked against potentially changing generational addresses within any given managed heap segment. Furthermore, there is no concrete way to list all the objects that fall into any given generation, making it hard to get an overall picture of the per generation utilization. Fortunately, the SOSEX extension comes to the rescue with a command named dumpgen. With the dumpgen command, you can easily get a list of all objects that belong to the generation specified as an argument to the command. For example, using the same sample application as shown in Listing 5-2, here is the output when running dumpgen:

0:000>  !dumpgen 0
01da6c00              12 **** FREE ****
01da6c0c              68 System.Char[]
2 objects, 80 bytes
0:000>  !dumpgen 1
01da100c              12 **** FREE ****
01da1018              12 **** FREE ****
01da1024              72 System.OutOfMemoryException
01da106c              72 System.StackOverflowException
01da10b4              72 System.ExecutionEngineException
01da10fc              72 System.Threading.ThreadAbortException
01da1144              72 System.Threading.ThreadAbortException
01da118c              12 System.Object
01da1198              28 System.SharedStatics
01da11b4             100 System.AppDomain
...
...
...
01da5948               16 Advanced.NET.Debugging.Chapter5.Name
01da5958              28 Microsoft.Win32.Win32Native+InputRecord
01da5974              12 System.Object
01da5980              20 Microsoft.Win32.SafeHandles.SafeFileHandle
01da5994              36 System.IO.__ConsoleStream
01da59b8              28 System.IO.Stream+NullStream
...
...
...

We can see that there aren't a lot of objects in generation 0; instead, we have a ton of objects in generation 1 including our n2 instance at address 0x01da5948. The dumpgen command really makes life easier when looking at generation specific data.

What About GC.Collect()?

As you have seen, the source code in Listing 5-2 (as well as throughout the chapter) contains calls to GC.Collect(). The GC.Collect() API does pretty much what the name implies. It forces a garbage collection to occur irrespective of whether it is needed. The last part of the previous statement is extremely important: irrespective of whether it is needed. The GC continuously fine tunes itself throughout the execution of the application to ensure that it behaves optimally under the application's circumstances. By invoking GC.Collect(), and thereby forcing a garbage collection, it can wreak havoc with the GC's fine-tuning algorithm. Under normal circumstances, it is therefore highly recommended not to use the API. The usage of the API in the book is solely to make the examples more deterministic.

So far, we have discussed how objects live in managed heap segments divided into generations and how these objects are either garbage collected or promoted to the next generation, depending on if they are still referenced (or still rooted). One question that still remains is what it means for an object to be rooted. The next section introduces the notion of roots, which are at the heart of the decision-making process the GC uses to determine if an object can be collected.

Roots

One of the most fundamental aspects of a garbage collection is that of being able to determine which objects are still being referenced and which objects are not and can be considered for garbage collection. Contrary to popular belief, the GC itself does not implement the logic for detecting which objects are still being referenced; rather, it uses other components in the CLR that have far more knowledge about the lifetimes of the objects. The CLR uses the following components to determine which objects are still referenced:

Just In Time compiler. The JIT compiler is the component responsible for translating IL to machine code and has detailed knowledge of which local variables were considered active at any given point in time. The JIT compiler maintains this information in a table that it subsequently references when the GC asks for objects that are still considered to be alive.

Retail Versus Debug Builds

Please note that there can be a difference between retail and debug builds when it comes to the JIT compiler tracking the aliveness of local variables. In retail builds, the JIT compiler can get rather aggressive and consider a local variable dead even before it goes out of scope (assuming it is not being used). This can present some really interesting challenges when debugging, and the decision was therefore made to keep all local variables alive until the end of the scope in debug builds.

Stack walker. This comes into play when unmanaged calls are made to the execution engine. During these calls, it is imperative that any managed objects used during the call also be part of the reference tracking system.
Handle table. The CLR maintains a set of handle tables on a per application domain basis that can contain, for example, pointers to pinned reference types on the managed heap. During a GC inquiry, these handle tables are probed for live references to objects on the managed heap.
Finalize queue. We will discuss the notion of object finalizers shortly, but for the time being, view objects with finalizers as objects that can be considered dead from an application's perspective but still need to be kept alive for cleanup purposes.
If the object is a member of any of the above categories.

During the probing phase, the GC also marks all the objects according to their state (rooted). When all components have been probed, the GC goes ahead and starts the garbage collection of all objects by promoting all objects that are still considered rooted. An interesting question in regards to roots is, Given an address to an object on the managed heap, is it possible to see if the object is rooted or not; and if so, what the reference chain of object is? Again, we turn to the SOS extension and a command named gcroot. The gcroot command uses a technique similar to the earlier one utilized by the GC to find the aliveness of the object. Let's take a look at some sample code. Listing 5-3 shows the source code of an application that defines a set of types and references to those types at various scopes.

Listing 5-3. Sample application to illustrate object roots

using System;
using System.Text;
using System.Threading;


namespace Advanced.NET.Debugging.Chapter5
{
    class Name
    {
        private string first;
        private string last;


        public string First { get { return first; } }
        public string Last { get { return last; } }


        public Name(string f, string l)
    {
        first = f; last = l;
    }
}


class Roots
{
    public static Name CompleteName = new Name ("First", "Last");


    private Thread thread;
    private bool shouldExit;


    static void Main(string[] args)
    {
        Roots r = new Roots();
        r.Run();
    }


    public void Run()
    {
        shouldExit = false;


        Name n1 = CompleteName;


        thread = new Thread(this.Worker);
        thread.Start(n1);


        Thread.Sleep(1000);


        Console.WriteLine("Press any key to exit");
        Console.ReadKey();


        shouldExit = true;


    }


    public void Worker(Object o)
    {
        Name n1 = (Name)o;
        Console.WriteLine("Thread started {0}, {1}",
                          n1.First,
                          n1.Last);


        while (true)
        {
            // Do work
               Thread.Sleep(500);
               if (shouldExit)
                   break;
            }
        }
    }
}

The source code and binary for Listing 5-3 can be found in the following folders:

Source code: C:\ADND\Chapter5\Roots
Binary: C:\ADNDBin\05Roots.exe

The source code in Listing 5-3 declares a static instance of the Name type. The main part of the application declares a reference to the static instance in the Run method as well as starts up a thread passing the reference to the newly created thread. The method that the new thread executes uses the reference passed to it until the user hits any key, at which point both the worker thread and the application terminate. The object we are interested in tracking for this exercise is the CompleteName static field. From the source code, we can glean the following characteristics about CompleteName:

We have a static reference to the object instance at the Roots class level serving as our first root to the object.
In the Run method, we assign a local variable reference (n1) to the object instance serving as our second root. The n1 local variable is not used after the thread has started and is subject to becoming invalid even before the end of the method scope (in retail builds). In debug builds, the reference is guaranteed to remain valid until the end of the scope is reached.
In the Run method, we pass the local variable reference n1 to the thread method during thread startup serving as our third root.

Let's run the application under the debugger and manually break execution when the Press any key to exit prompt is displayed. The first thing we need to find is the address to the object we are interested in (and dumping the object for good measure) followed by running the gcroot command on the address:

0:005> ~0s
eax=002cef9c ebx=002cef94 ecx=792274ec edx=79ec9058 esi=002cedf0 edi=00000000
eip=77709a94 esp=002ceda0 ebp=002cedc0 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
ntdll!KiFastSystemCallRet:
77709a94 c3              ret
0:000> !ClrStack -a
OS Thread Id: 0x2358 (0)
ESP       EIP
002cef6c 77709a94 [NDirectMethodFrameSlim: 002cef6c]
 Microsoft.Win32.Win32Native.ReadConsoleInput(IntPtr, InputRecord ByRef, Int32,
Int32 ByRef)
002cef84 793e8f28 System.Console.ReadKey(Boolean)
    PARAMETERS:
        intercept = 0x00000000
    LOCALS:
        <no data>
        0x002cef94 = 0x00000001
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>
        <no data>


002cefc4 793e8e33 System.Console.ReadKey()
002cefc8 00890212 Advanced.NET.Debugging.Chapter5.Roots.Run()
    PARAMETERS:
        this = 0x01c758e0
    LOCALS:
        <CLR reg> = 0x01c758d0


002cefe8 0089013f Advanced.NET.Debugging.Chapter5.Roots.Main(System.String[])
    PARAMETERS:
        args = 0x01c75888
    LOCALS:
        <CLR reg> = 0x01c758e0


002cf208 79e7c74b [GCFrame: 002cf208]
0:000> !do 0x01c758d0
Name: Advanced.NET.Debugging.Chapter5.Name
MethodTable: 001b311c
EEClass: 001b13a0
Size: 16(0x10) bytes
 (C:\ADNDBin\05Roots.exe)
Fields:
      MT    Field   Offset                 Type VT     Attr     Value Name
790fd8c4  4000001        4        System.String  0 instance 01c75898 first
790fd8c4  4000002        8         System.String 0 instance 01c758b4 last
0:000> !gcroot 0x01c758d0
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 2358
ESP:2cefbc:Root:01c758d0(Advanced.NET.Debugging.Chapter5.Name)
Scan Thread 1 OSTHread 1630
Scan Thread 3 OSTHread 254c
ESP:47df428:Root:01c758d0(Advanced.NET.Debugging.Chapter5.Name)
ESP:47df42c:Root:01c758d0(Advanced.NET.Debugging.Chapter5.Name)
ESP:47df438:Root:01c758d0(Advanced.NET.Debugging.Chapter5.Name)
ESP:47df4d0:Root:01c75984(System.Threading.ThreadHelper)->
01c758d0(Advanced.NET.Debugging.Chapter5.Name)
ESP:47df4d8:Root:01c75984(System.Threading.ThreadHelper)->
01c758d0(Advanced.NET.Debugging.Chapter5.Name)
ESP:47df4f4:Root:01c75984(System.Threading.ThreadHelper)->
01c758d0(Advanced.NET.Debugging.Chapter5.Name)
ESP:47df500:Root:01c75984(System.Threading.ThreadHelper)->
01c758d0(Advanced.NET.Debugging.Chapter5.Name)
ESP:47df5c0:Root:01c758d0(Advanced.NET.Debugging.Chapter5.Name)->
01c758d0(Advanced.NET.Debugging.Chapter5.Name)
ESP:47df5c4:Root:01c75998(System.Threading.ParameterizedThreadStart)->
01c75984(System.Threading.ThreadHelper)
ESP:47df754:Root:01c758d0(Advanced.NET.Debugging.Chapter5.Name)->
01c75984(System.Threading.ThreadHelper)
ESP:47df758:Root:01c75998(System.Threading.ParameterizedThreadStart)->
01c75984(System.Threading.ThreadHelper)
ESP:47df764:Root:01c75998(System.Threading.ParameterizedThreadStart)->
01c75984(System.Threading.ThreadHelper)
ESP:47df76c:Root:01c758d0(Advanced.NET.Debugging.Chapter5.Name)->
01c75984(System.Threading.ThreadHelper)
DOMAIN(0037FCF8):HANDLE(Pinned):a13fc:Root:02c71010(System.Object[])->
01c758d0(Advanced.NET.Debugging.Chapter5.Name)

As you can see from the gcroot output, the command scans a number of different sources to find and build the reference chain to the object specified. Regardless of the source, the output of the GCRoot command results in the following general format:

<root>-><reference 1>-><reference 2>-><reference X>-><object>

Depending on the source probed, each of the elements takes on a slightly different format as shown.

Local variables on a threads stack. The root element typically looks like the following: <stack register>:<stack pointer>:Root:<object>. The stack register depends on the architecture. For example, on x86 machines it shows as ESP and on x64 machines it shows as RSP. The stack pointer shows the location on the stack where the object is rooted, and the object address is the address of the object that is holding a reference to the next object in the reference chain. Let's take a look at an example:
```
ESP:47df428:Root:01c758d0(Advanced.NET.Debugging.Chapter5.Name)
```
We can see that there is a local variable located on stack (ESP) location 0x047df428. Furthermore, the output tells us that this constitutes a root to the object at address 0x01c758d0, which is a reference to the Advanced.NET.Debugging.Chapter5.Name type.
Handle tables. All handle tables are scanned as part of GCRoot execution looking for references to the specified object. If a reference is found, the output of the command takes on the following general syntax: DOMAIN(<address>):HANDLE(<type>):<handleaddress>:Root: <object>. The domain address field indicates the address of the application domain to which the handle reference belongs. The handle type specifies the type of the handle. The possible handle types are Weak, WeakTrac Resurrection, Normal, and Pinned.Next is the handle address, which is the address to the handle itself. Please keep in mind that the handle type is a value type and if you want to dump out the contents you must use the DumpVC command rather than DumpObj. Finally, the root object address is shown. Let's take a look at an example:
```
DOMAIN(002EFCD8):HANDLE(Pinned):2813fc:Root:02c81010
(System.Object[])->01c858d0(Advanced.NET.Debugging.
Chapter5.Name)
```
The preceding output indicates that the object at address 0x01c858d0 is rooted by an object that resides in the handle table corresponding to the application domain with address 0x002efcd8. Furthermore, the address of the handle value holding the reference is located at address 0x002813fc and the type of the handle value is pinned. Lastly, the actual object that holds the reference is at address 0x02c81010, which is of type System.Object[].
F-reachable queue. The f-reachable queue is scanned to see if there are any references to the specified object. If a root reference to the object is found on the f-reachable queue, it will be displayed in the following general format: Finalizer queue:Root:<object address>(<object type>). The first part of the output indicates that the source of the root is the f-reachable queue. Next, the address of the referenced object is displayed, followed by the object type. What follows is an example of the output of GCRoot when run against an object that is on the f-reachable queue:
```
Finalizer
queue:Root:01d15750(Advanced.NET.Debugging.Chapter5.Name)
```
In the preceding output, we can see that the object at address 0x01d15750 of type Advanced.NET.Debugging.Chapter5.Name is rooted by the f-reachable queue.
The last source of output for the GCRoot command is if an object is a member of any of the preceding categories.

One of the potential problems with gcroot and local variables is that it may not always be accurate, thereby producing false positives. To convince ourselves that the stack locations listed in the output are accurate, we have to manually inspect the stack location and correlate it to source code so that we can see whether the local variable is in fact still referencing the object. For example, assume we have the following very simple function:

   public void Run()
   {
       Name n1 = new Name("A", "B");


       Console.WriteLine("Press any key to exit");
       Console.ReadKey();
}

In the source code, we have a simple instance of the Name class assigned to the n1 local variable. If we ran the GCRoot command on the n1 reference, we would expect to only see one reference on the thread stack:

0:000> !GCRoot 0x01e9580c
Note: Roots found on stacks may be false positives. Run "!help gcroot" for
more info.
Scan Thread 0 OSTHread 1638
ESP:1df29c:Root:01e9580c(Advanced.NET.Debugging.Chapter5.Name)

ESP:1df2a0:Root:01e9580c(Advanced.NET.Debugging.Chapter5.Name)
Scan Thread 2 OSTHread 14ac

The output clearly shows that thread 0 apparently has two references to the object on the thread stack. How is this possible? The way that the GCRoot command works is by assuming that every address on the stack is an address to an object. It tries to verify this assumption by utilizing various metadata information. In light of this, objects that are (or were) previously present on the stack are treated as first class references to those objects and listed in the output of GCRoot. If you suspect that the output of GCRoot, in as far as thread stacks is concerned, is incorrect, the best approach is to use the U command to unassemble the stack frames and correlate the stack registers in the GCRoots output to the unassembled code to see which objects are truly valid.

Finalization

The garbage collection mechanism described so far assumes that objects that are collected do not require any special cleanup code. At times, objects that encapsulate other resources require that these resources be cleaned up as part of object destruction. A great example is an object that wraps an underlying native resource such as a file handle. Without explicit cleanup code, the memory behind the managed object is cleaned up by the GC, but the underlying handle that the object encapsulates is not (because GC has no special knowledge of native handles). The net result is naturally a resource leak. To provide a proper cleanup mechanism, the CLR introduces what is known as finalizers. A finalizer can be compared to destructors in the native C++ world. Whenever an object is freed (or garbage collected), the destructor (or finalizer) is run. In C#, a finalizer is declared very similarly to a C++ destructor by using the ~<class name>() notation. An example is shown in the following listing:

public class MyClass
{
      ...
      
   ...
      
   ...
      
   ~MyClass()
      
   {
            
         // Cleanup code
      
   }
}

When the class is compiled into IL, the finalize method gets translated into a function called Finalize. The key thing about objects with finalizers is that the garbage collector treats them a little differently than other objects. Because the garbage collector is in fact an automatic memory manager, it also has the responsibility of executing all finalization code that an object may have during a garbage collection. To keep tabs on which objects have finalizers, the garbage collector maintains a queue called a finalization queue. Objects that are created on the managed heap and contain finalizers are automatically placed on the finalization queue during creation. Please note that the finalization queue does not contain objects that are considered garbage, but rather it contains all objects with finalizers that are alive on the managed heap. When an object with a finalizer becomes rootless and a garbage collection occurs, the GC places the object on a different queue known as the f-reachable queue. This queue contains all objects with defined finalizers that are considered to be garbage and need to have their finalizers executed. All objects on the f-reachable queue are considered roots to those objects, meaning that the object is still alive. It is important to note that the finalizer code for each of the objects on the f-reachable queue is not executed as part of the garbage collection phase. Instead, each .NET process contains a special thread known as the finalization thread. The finalization thread wakes up, on request of the GC, and checks the state of the f-reachable queue. If there are any objects on the f-reachable queue, the finalization thread picks them up one by one and executes the finalize methods.
When the garbage collection finishes, objects with finalizers are on the f-reachable queue (rooted and alive) until the finalization thread executes the finalize methods. At that point, the object is removed from the f-reachable queue, is considered rootless, and can be truly reclaimed by the garbage collector. The next time a garbage collection is started, the objects are collected. Figure 5-7 illustrates an example of the finalization process.

Figure 5-7 Example of finalization process

Step 1 in Figure 5-7 consists of allocating Obj D and Obj E, both of which contain finalize methods. As part of the allocation, the objects are placed on the managed heap as well as on the finalization queue to indicate that the objects need to be finalized when no longer in use. In step 2, Obj D and Obj E have both become rootless when a garbage collection occurs. At that point, both objects are moved from the finalization queue to the f-reachable queue to indicate that the finalize methods are now ready to be run. At some point in the future (nondeterministic), step 3 is executed and the finalizer thread wakes up and starts running the finalize methods for both of the objects. Even after the finalizer has finished, both objects are still rooted on the f-reachable queue. Lastly, in step 4, another garbage collection occurs and the objects are removed from the f-reachable queue (no longer rooted) and then collected from the managed heap by the garbage collector.
An interesting aspect of having a dedicated thread executing the finalize methods is that the CLR does not place any guarantees when the thread wakes up and executes. As such, it is possible that it will take some time before an object with a finalizer is actually cleaned up. When dealing with objects that aggregate scarce resources, it may not always be feasible to wait for a long period of time for the resource to be reclaimed. In such situations, it is best to implement an explicit and deterministic cleanup pattern such as the IDisposable and/or Close patterns. Finally, having a dedicated thread also means that you have no control over the state of that thread, and making assumptions based on state can break your application.
Let's take a look at a concrete example of an object with a finalize method and see if we can track the object during a garbage collection. Listing 5-4 shows the source code of the application we will be utilizing.

Listing 5-4. Simple object with a finalize method

using System;
using System.Text;
using System.Runtime.InteropServices;


namespace Advanced.NET.Debugging.Chapter5
{
    class NativeEvent
    {
        private IntPtr nativeHandle;


        public IntPtr NativeHandle { get { return nativeHandle; } }


        public NativeEvent(string name)
        {
            nativeHandle = CreateEvent(IntPtr.Zero,
                                       false,
                                       true,
                                       name);
        }


        ~NativeEvent()
        {
            if(nativeHandle!=IntPtr.Zero)
            {
                CloseHandle(nativeHandle);
                nativeHandle=IntPtr.Zero;
            }
        }

        [DllImport("kernel32.dll")]
        static extern IntPtr CreateEvent(IntPtr lpEventAttributes,
                                         bool bManualReset,
                                         bool bInitialState,
                                         string lpName);


        [DllImport("kernel32.dll")]
        static extern IntPtr CloseHandle(IntPtr lpEvent);
    }


    class Finalize
    {
        static void Main(string[] args)
        {
            Finalize f = new Finalize();
            f.Run();
        }


        public void Run()
        {
            NativeEvent nEvent = new NativeEvent("MyNewEvent");


            //
            // Use nEvent
            //


            nEvent = null;


            Console.WriteLine("Press any key to GC");
            Console.ReadKey();


            GC.Collect();


            Console.WriteLine("Press any key to GC");
            Console.ReadKey();


            GC.Collect();


            Console.WriteLine("Press any key to exit");
            Console.ReadKey();
        }


    }
}

The source code and binary for Listing 5-4 can be found in the following folders:

Source code: C:\ADND\Chapter5\Finalize
Binary: C:\ADNDBin\05Finalize.exe

The source code in Listing 5-4 declares a type called NativeEvent that simply wraps the creation of a Windows event using the .NET interoperability services. Because the net result of creating a native event is a handle, the handle must be closed during object destruction to avoid a handle leak in the application. The closing of the handle is implemented in the NativeEvent finalize method. The main part of the application is implemented in the Finalize class. More specifically, the Run method declares an instance of the NativeEvent class, sets the local variable reference to null (indicating that it can be garbage collected), followed by a couple of forced garbage collections. What do we expect to happen to the NativeEvent instance we declared at the point of the first garbage collection? From our previous discussion, we expect that prior to the garbage collection, the object is in the finalization queue. Furthermore, when the garbage collection occurs, the object is deemed rootless and moved to the f-reachable queue where it maintains a reference to the object so that the finalization thread can run the Finalize method. It's important to remember that the execution of the finalization thread does not happen during the garbage collection, but rather it happens out of band at any time. When the Finalize method has run, the object can be fully collected during the next garbage collection. Let's see if we can use the debuggers to verify our earlier theory. Run 05Finalize.exe under the debugger and break execution when the first Press any key to GC prompt appears. When we have broken into the debugger, we can use the FinalizeQueue command to show the state of the finalizable objects in the process:

0:004> !FinalizeQueue
SyncBlocks to be cleaned up: 0
MTA Interfaces to be released: 0
STA Interfaces to be released: 0
––––––––––––––––––––––––––––––––––
generation 0 has 6 finalizable objects (003d3160->003d3178)
generation 1 has 0 finalizable objects (003d3160->003d3160)
generation 2 has 0 finalizable objects (003d3160->003d3160)
Ready for finalization 0 objects (003d3178->003d3178)
Statistics:
      MT    Count    TotalSize Class Name
00123128        1           12 Advanced.NET.Debugging.Chapter5.NativeEvent
7911c9c8        1           20 Microsoft.Win32.SafeHandles.SafePEFileHandle
791037c0        1           20 Microsoft.Win32.SafeHandles.SafeFileMappingHandle
79103764        1           20 Microsoft.Win32.SafeHandles.SafeViewOfFileHandle
79101444        1           20 Microsoft.Win32.SafeHandles.SafeFileHandle
790fe704        1           56 System.Threading.Thread
Total 6 objects

There are several pieces of useful information in the output. First, the finalization queues for each generation are shown. In this particular case, generation 0 has 6 finalizable objects and generations 1 and 2 have none. For each of the finalization queues, the FinalizeQueue command also shows the address range of the queue itself for that particular generation. For example, generation 0's finalization queue starts at address 0x003d3160 and ends at address 0x003d3178. We can use the dd command to dump the queue as shown here:

0:004> dd 003d3160 l6
003d3160  01fc1df0 01fc5090 01fc5964 01fc5998
003d3170  01fc683c 01fc6850

The elements in the queue can be looked at further by using the do command. If we want to look at the object at address 0x01fc5964 in more detail, we would use the command shown here:

0:004> !do 01fc5964

Name: Advanced.NET.Debugging.Chapter5.NativeEvent
MethodTable: 00123128
EEClass: 00121804
Size: 12(0xc) bytes
 (C:\ADNDBin\05Finalize.exe)
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
791016bc  4000001        4        System.IntPtr  1 instance      1f0 nativeHandle

The next piece of useful information from the FinalizeQueue command is the f-reachable queue, which is shown in the following output:

Ready for finalization 0 objects (000c3178->000c3178)

The output indicates that at this point there are no objects that are ready to be finalized. This makes perfect sense because a garbage collection has not yet occurred.
The final piece of output in the FinalizeQueue command is the statistics section, which shows a summarized list of all objects in either the finalization queue or the f-reachable queue.
Before we resume execution, we need to discuss the magic finalization thread that exists in all managed processes. What does the stack trace of this thread look like? To find the answer, use the ~*kn command to display the stack traces of all the threads in the process including frame numbers. In the output, one thread in particular looks interesting:

      2  Id: 1a10.c10 Suspend: 1 Teb: 7ffdd000 Unfrozen
 # ChildEBP RetAddr
00 011cf604 77709254 ntdll!KiFastSystemCallRet
01 011cf608 7618c244 ntdll!ZwWaitForSingleObject+0xc
02 011cf678 79e789c6 KERNEL32!WaitForSingleObjectEx+0xbe
03 011cf6bc 79e7898f mscorwks!PEImage::LoadImage+0x1af
04 011cf70c 79e78944 mscorwks!CLREvent::WaitEx+0x117
05 011cf720 79ef2220 mscorwks!CLREvent::Wait+0x17
06 011cf73c 79fb997b mscorwks!WKS::WaitForFinalizerEvent+0x4a
07 011cf750 79ef3207 mscorwks!WKS::GCHeap::FinalizerThreadWorker+0x79
08 011cf764 79ef31a3 mscorwks!Thread::DoADCallBack+0x32a
09 011cf7f8 79ef30c3 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3
0a 011cf834 79fb9643 mscorwks!Thread::ShouldChangeAbortToUnload+0x30a
0b 011cf85c 79fb960d mscorwks!ManagedThreadBase_NoADTransition+0x32
0c 011cf86c 79fba09b mscorwks!ManagedThreadBase::FinalizerBase+0xd
0d 011cf8a4 79f95a2e mscorwks!WKS::GCHeap::FinalizerThreadStart+0xbb
0e 011cf93c 76184911 mscorwks!Thread::intermediateThreadProc+0x49
0f 011cf948 776ee4b6 KERNEL32!BaseThreadInitThunk+0xe
10 011cf988 776ee489 ntdll!__RtlUserThreadStart+0x23
11 011cf9a0 00000000 ntdll!_RtlUserThreadStart+0x1b

Frames 6 and 7 in the stack trace indicate that in fact this is the finalizer thread for the process. Frame 6 in particular shows that the thread is currently waiting for finalizer events (or objects that need to be finalized). Let's set a breakpoint on the return address of frame 6 (0x79fb997b), which will trigger any time the finalizer thread is awakened to perform work:

bp 79fb997b

When the breakpoint is set, resume execution and press any key to trigger the first garbage collection. You'll notice that a breakpoint is hit, as shown in the following:

0:003> g
 Breakpoint 0 hit
eax=00000001 ebx=00000001 ecx=7618c42d edx=77709a94 esi=00000000 edi=00493a48
eip=79fb997b esp=00b7f768 ebp=00b7f770 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202
mscorwks!WKS::GCHeap::FinalizerThreadWorker+0x79:
79fb997b 3bde           cmp     ebx,esi

The breakpoint corresponds to the finalizer thread breakpoint set earlier and indicates that the finalizer is ready to execute the Finalize methods on the objects in the f-reachable queue. How do we find out what objects are in the f-reachable queue? You guessed it: by using the FinalizeQueue command:

0:002> !FinalizeQueue
SyncBlocks to be cleaned up: 0
MTA Interfaces to be released: 0
STA Interfaces to be released: 0
––––––––––––––––––––––––––––––––––
generation 0 has 0 finalizable objects (003d3170->003d3170)
generation 1 has 4 finalizable objects (003d3160->003d3170)
generation 2 has 0 finalizable objects (003d3160->003d3160)
Ready for finalization 2 objects (003d3170->003d3178)
Statistics:
      MT    Count    TotalSize Class Name
00123128        1           12 Advanced.NET.Debugging.Chapter5.NativeEvent
7911c9c8        1           20 Microsoft.Win32.SafeHandles.SafePEFileHandle
791037c0        1           20 Microsoft.Win32.SafeHandles.SafeFileMappingHandle
79103764        1           20 Microsoft.Win32.SafeHandles.SafeViewOfFileHandle
79101444        1           20 Microsoft.Win32.SafeHandles.SafeFileHandle
790fe704        1           56 System.Threading.Thread

This time, the output states that there are two objects in the f-reachable queue, starting at address 0x003d3160, that the finalization thread is about to execute. If we dump out the contents of the f-reachable queue and each of the objects, we can see the following:

0:002> dd 003d3170 l2
003d3170  01fc5090 01fc5964
0:002> !do 01fc5090
Name: Microsoft.Win32.SafeHandles.SafePEFileHandle
MethodTable: 7911c9c8
EEClass: 791fb61c
Size: 20(0x14) bytes
 (C:\Windows\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
791016bc  40005c1        4        System.IntPtr  1 instance    3eab28 handle
79102290  40005c2        8         System.Int32  1 instance        4 _state
7910be50  40005c3        c       System.Boolean  1 instance        1 _ownsHandle
7910be50  40005c4        d       System.Boolean  1 instance        1
_fullyInitialized
0:002> !do01fc5964
Name: Advanced.NET.Debugging.Chapter5.NativeEvent
MethodTable: 00123128
EEClass: 00121804
Size: 12(0xc) bytes
 (C:\ADNDBin\05Finalize.exe)
Fields:
      MT    Field   Offset                 Type VT     Attr    Value Name
791016bc  4000001        4        System.IntPtr  1 instance      1f0 nativeHandle

The first object is of type SafePEFileHandle and the second object is of type NativeEvent, which happens to be the object we are interested in. If we resume execution, the finalizer thread executes the Finalize method of our NativeEvent class. What happens to the objects on the f-reachable queue after finalization has completed? Well, the objects are removed from the f-reachable queue, which renders them rootless; they will be collected during the next garbage collection.
This concludes our discussion of finalization. As you can see, there is a lot of work being done under the hood whenever a finalizable type comes into play. Not only does the CLR need additional data structures (such as the finalization queue and f-reachable queue), but it also spins up a dedicated thread to run the Finalize methods for each object that is being collected. Furthermore, an object with a Finalize does not get collected in just one garbage collection, but rather two, which in essence means that the objects with Finalize methods always get promoted to generation 1 before they are truly dead, making it a far more expensive object to work with.

Reclaiming GC Memory

We have discussed the GC in quite a bit of detail. We now know exactly what the GC does when an object is considered garbage. The one missing piece of information is what the GC does with the memory that becomes available after an object is garbage collected. Does the memory get put on some sort of free list and then reused when another allocation request arrives? Does the memory get freed? Is fragmentation ever a problem on the managed heap? The answer is a combination of all three. If a collection that occurs in generations 0 and 1 leaves a gap on the managed heap, the garbage collector compacts all live objects so that they reside next to each other and coalesces any free blocks on the managed heap into a larger block that is located after the last live object (starting at the current allocation pointer). Figure 5-8 shows an example of the compacting and coalescing.

Figure 5-8 Garbage collection compacting and coalescing phase

In Figure 5-8, the initial state of the managed heap contains five rooted objects (A through E). At some point during execution, objects B and D become rootless and are candidates to be reclaimed during a garbage collection. When the garbage collection occurs, the memory occupied by objects B and D is reclaimed, which leads to gaps on the managed heap. To remove these gaps, the garbage collector compacts the remaining live objects (Obj A, C, and E) and coalesces the two free blocks (used to hold Obj B and D) into one free block. Lastly, the current allocation pointer is updated as a result of the compacting and coalescing.
The ephemeral segment contains both generation 0 and generation 1 (and also part of generation 2), but generation 2 can consist of multiple managed heap segments. As more and more objects make it to generation 2, the need to grow generation 2 also increases. The way that the CLR heap manager grows generation 2 is by allocating more segments. When objects in generation 2 are collected, the CLR heap manager decommits memory in the segments, and when a segment is no longer needed, it is entirely freed. In certain situations and allocation patterns, generation 2 grows and shrinks quite frequently, leading to a large number of calls to allocate and free virtual memory (VirtualAlloc and VirtualFree APIs). Two common drawbacks of this approach are that these calls can be expensive because a transition to kernel mode is required as well as the potential to fragment the VM address space. As such, CLR 2.0 introduces a feature called VM hoarding, which essentially does not free segments but rather keeps the segments on a standby list that can be utilized when more memory is required. To utilize the VM hoarding feature, the CLR host itself must specify that it wants to use the feature.
Because the cost of a compaction is directly proportional to the size of the object (the bigger the object, the costlier the compaction), the garbage collector introduces another type of heap called the large object heap (LOH). Objects that are large enough to severely hurt the performance of a compaction are placed on the LOH, which we will discuss next.

Large Object Heap

The large object heap (LOH) consists of objects that are greater than or equal to 85,000 bytes in size. The decision to separate objects of that size into its own heap is related to the fact that during the compacting phase of a garbage collection, the cost of compacting an object is directly proportional to the size of the object being compacted. Rather than having large objects on the standard heap eating up garbage collection time during compaction, the LOH was created. The LOH is best viewed as an extension of generation 2, and a collection of the LOH can only be done after a generation 2 collection has occurred, implying that a collection of the LOH is only done during a full garbage collection. Because compacting large objects is very expensive, the GC avoids compacting the LOH altogether and instead uses a process known as sweeping that keeps a free list that is used to keep track of available memory in the LOH segment(s). Figure 5-9 shows an example of a LOH with two segments.

Figure 5-9 LOH example

Please note that although the LOH does not perform any compaction, it does do coalescing of adjacent free blocks. That is, if you ever end up with two free adjacent blocks, the GC coalesces those blocks into a larger block and adds it to the free list (while also removing the two smaller blocks).
To find out the current state of the LOH in the debugger, we can again use the eeheap –gc command, which includes details on the LOH:

0:004> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x01fc6c18
generation 1 starts at 0x01fc100c
generation 2 starts at 0x01fc1000
ephemeral segment allocation context: none
 segment    begin allocated     size
00308030 790d8620  790f7d8c 0x0001f76c(128876)
01fc0000 01fc1000  01fc8c24 0x00007c24(31780)
Large object heap starts at 0x02fc1000
 segment    begin allocated     size
02fc0000 02fc1000  02fc3240 0x00002240(8768)
Total Size   0x295d0(169424)
––––––––––––––––––––––––––––––
GC Heap Size   0x295d0(169424)

The LOH section in the command output shows the starting point of the LOH as well as per-segment information such as the segment, start, and end address of the segment and total size of the segment. In the preceding example, we can see that the LOH has one segment (0x02fc000) starting at address 0x02fc1000 and ending at 0x02fc3240 with a total size of 0x00002240. The last piece of information is the total size of all segments in the LOH. One interesting question related to the LOH is how the contents of the LOH can be dumped. There are a couple of options that both revolve around using DumpHeap command switches. The first switch of interest is the –min switch, which tells the DumpHeap command that you are only interested in objects of the specified size. Because we know that LOH objects are greater than or equal to 85,000 bytes in size, we can use the following command:

0:004> !DumpHeap -min 85000
 Address       MT     Size
02c53250 7912dae8    100016
total 1 objects
Statistics:
      MT    Count    TotalSize Class Name
7912dae8        1       100016 System.Byte[]

Here, we can see that there is one object of size 100016 on the LOH. You can verify or convince yourself that the object is in fact on the LOH by looking at the address. If the address of the object falls within the LOH segments addresses, it must be located on the LOH (with the exception of free objects, which can reside both in the SOH as well as the LOH).
The next option we have is to specify a starting address for the DumpHeap command. If we specify the starting address of the LOH, we can ask the command to dump out all objects on the LOH. The switch to use is the –startAtLowerBound switch, which takes the address as a parameter. Using the same LOH as earlier, the following command can be used:

0:004> !DumpHeap -startAtLowerBound 02c51000
 Address       MT     Size
02c51000 002a6360       16 Free
02c51010 7912d8f8     4096
02c52010 002a6360       16 Free
02c52020 7912d8f8     4096
02c53020 002a6360       16 Free
02c53030 7912d8f8      528
02c53240 002a6360       16 Free
02c53250 7912dae8   100016
02c6b900 002a6360       16 Free
total 9 objects
Statistics:
      MT    Count    TotalSize  Class Name
002a6360        5           80      Free
7912d8f8        3         8720   System.Object[]
7912dae8        1       100016   System.Byte[]
Total 9 objects

Again, we see the object of size 100016, but even more interesting is that we see objects that are smaller than 85,000 bytes on the LOH. What are these objects and how did they end up on the LOH? The answer is that these very, very small objects are placed there by the CLR heap manager, which uses them for its own purposes. Generally speaking, you always see a select few objects with a size less than 85,000 bytes exclusively used by the GC.
Let's take a look at a small sample application that allocates a single large object of size 10,000 bytes (see Listing 5-5). We will then use the debuggers to see if we can locate the object on the LOH and see what happens when the object is collected.

Listing 5-5. Sample application demonstrating LOH

using System;
using System.Text;
using System.Runtime.InteropServices;


namespace Advanced.NET.Debugging.Chapter5
{
     class LOH
    {
        static void Main(string[] args)
        {
            LOH l = new LOH();
            l.Run();
        }


        public void Run()
        {
            byte[] b = null;
            Console.WriteLine("Press any key to allocate on LOH");
            Console.ReadKey();


            b = new byte[100000];


            Console.WriteLine("Press any key to GC");
            Console.ReadKey();


            b = null;
            GC.Collect();


            Console.WriteLine("Press any key to exit");
            Console.ReadKey();
        }


    }
}

The source code and binary for Listing 5-5 can be found in the following folders:

Source code: C:\ADND\Chapter5\LOH
Binary: C:\ADNDBin\05LOH.exe

Let's run the application in the debugger and break execution when the Press any key to allocate on LOH is displayed. At this point, we haven't yet created our big allocation, but it never hurts to take a look at the LOH heap to see what, if anything, is already on it:

0:004> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x01f01018
generation 1 starts at 0x01f0100c
generation 2 starts at 0x01f01000
ephemeral segment allocation context: none
 segment    begin allocated     size
004a8008 790d8620  790f7d8c 0x0001f76c(128876)
01f00000 01f01000  01f5c334 0x0005b334(373556)
Large object heap starts at 0x02f01000
 segment    begin allocated     size
02f00000 02f01000  02f03250 0x00002250(8784)
Total Size   0x7ccf0(511216)
––––––––––––––––––––––––––––––
GC Heap Size   0x7ccf0(511216)
0:004> !dumpheap -startatlowerbound 02f01000
 Address       MT      Size
02f01000 00496360        16 Free
02f01010 7912d8f8      4096
02f02010 00496360        16 Free
02f02020 7912d8f8      4096
02f03020 00496360        16 Free
02f03030 7912d8f8       528
02f03240 00496360        16 Free
total 7 objects
Statistics:
      MT    Count    TotalSize Class Name
00496360        4           64      Free
7912d8f8        3         8720 System.Object[]
Total 7 objects

We start by finding the starting point of the LOH by using the eeheap command. The starting point in this case is 0x02f01000. Then, we feed the starting address to the dumpheap command using the –startatlowerbound switch to output all objects on the LOH. In the output, we can see that the only objects that are on the LOH are the mysterious object arrays that are smaller than 85,000 bytes. Other than that, we have no other objects present. Next, resume execution and again manually break execution when the Press any key to GC is shown.
We issue the same dumpheap command as before to see if we can spot our 100KB allocation:

0:003> !dumpheap -startatlowerbound 02f01000
 Address       MT    Size
02f01000 00496360      16 Free
02f01010 7912d8f8    4096
02f02010 00496360      16 Free
02f02020 7912d8f8    4096
02f03020 00496360      16 Free
02f03030 7912d8f8     528
02f03240 00496360      16 Free
02f03250 7912dae8  100016
02f1b900 00496360      16 Free
total 9 objects
Statistics:
      MT    Count    TotalSize Class Name
00496360        5           80      Free
7912d8f8        3         8720 System.Object[]
7912dae8        1       100016 System.Byte[]
Total 9 objects

We can see that our allocation is stored at address 0x02f03250 on the LOH. Next, we resume execution until we see the Press any key to exit prompt. At this point, a garbage collection has occurred, so let's see what the LOH looks like by using the same dumpheap command again:

0:003> !dumpheap -startatlowerbound 02f01000
 Address       MT    Size
02f01000 00496360      16 Free
02f01010 7912d8f8    4096
02f02010 00496360      16 Free
02f02020 7912d8f8    4096
02f03020 00496360      16 Free
02f03030 7912d8f8     528
total 6 objects
Statistics:
      MT    Count    TotalSize Class Name
00496360        3           48      Free
7912d8f8        3         8720 System.Object[]

This time, we can see how the object has been removed from the LOH and the free blocks available as a result of the collection.

Pinning

As we saw in the Releasing GC Memory section, the garbage collector employs a technique known as compaction to reduce fragmentation on the GC heap. When a compaction occurs, objects may end up moving around on the heap so that they can be placed together, thereby avoiding gaps. As part of the object move, because the address of the object changes, all references to the object are also updated. This works well assuming all references to the object are contained within the CLR, but quite often it is necessary for .NET applications to work outside of the boundary of the CLR by using the interoperability services (such as platform invocation or COM interoperability). If a reference to a managed object is passed to an underlying native API, the object might be moved while the native API is reading and/or writing to the memory, causing serious problems because the CLR clearly cannot notify the native API of the address change. Figure 5-10 illustrates the problem.

Figure 5-10 Interoperability services and GC compaction problem

From the flow in Figure 5-10, we can see that the initial state of the managed heap includes five objects starting with Obj A at address 0x02000000. At a certain point, a platform invocation call to an asynchronous native API is required. Furthermore, the address of Obj C (0x02000090) needs to be passed to the API. Upon successfully calling the asynchronous native API, a garbage collection occurs causing Obj A and Obj B to be collected. This leaves a gap of two free objects on the managed heap and the garbage collector dutifully rectifies the problem by compacting the managed heap and therefore moving Obj C to address 0x02000000. It also coalesces the two free blocks and places them at the end of the heap. After the garbage collection has finished, the asynchronous API call we made earlier decides to write to the address initially passed to it (0x02000090), which originally held Obj C. As you can see, with the asynchronous API writing to that address, we will experience a managed heap corruption as the memory is no longer occupied by Obj C.
Because the invocation of native code is such a common task, a solution had to be devised that allowed for safe invocation in light of a compacting garbage collector. The solution is called pinning and refers to the capability to pin specific objects on the managed heap. When an object is pinned, the garbage collector will not move the object for any reason until the object is unpinned. If Obj C in Figure 5-10 was pinned prior to invoking the asynchronous native API, the managed heap corruption would not have occurred due to the garbage collector not moving Obj C during the compaction phase.
Let's take a look at an example of a simple application that performs pinning and see what it looks like in the debugger. Listing 5-6 shows the source code of the application.

Listing 5-6. Sample application using pinning

using System;
using System.Text;
using System.Runtime.InteropServices;


namespace Advanced.NET.Debugging.Chapter5
{
    class Pinning
    {
        static void Main(string[] args)
        {
            Pinning p = new Pinning();
            p.Run();
        }
        public void Run()
        {
            SByte[] b1 = null;
            SByte[] b2 = null;
            SByte[] b3 = null;
            Console.WriteLine("Press any key to alloc");
            Console.ReadKey();


            b1 = new SByte[100];
            b2 = new SByte[200];
            b3 = new SByte[300];


            GCHandle h1 = GCHandle.Alloc(b1, GCHandleType.Pinned);
            GCHandle h2 = GCHandle.Alloc(b2, GCHandleType.Pinned);
            GCHandle h3 = GCHandle.Alloc(b3, GCHandleType.Pinned);


            Console.WriteLine("Press any key to GC");
            Console.ReadKey();


            GC.Collect();


            Console.WriteLine("Press any key to exit");
            Console.ReadKey();


            h1.Free(); h2.Free(); h3.Free();
        }


    }
}

The source code and binary for Listing 5-6 can be found in the following folders:

Source code: C:\ADND\Chapter5\Pinning
Binary: C:\ADNDBin\05Pinning.exe

The sample application shown in Listing 5-6 illustrates how to use the GCHandle type to pin objects. The Run method declares three arrays of the SByte type and creates GCHandles for each of the allocations specifying that the objects be pinned. The application then forces a garbage collection and exits. Let's run the application under the debugger and see if we can track the allocated memory and how it gets pinned.
Resume execution of the application until you see the Press any key to GC prompt. At this point, we manually break execution and use a command called GCHandles. The GCHandles command displays a list of all the handles available in the process:

0:004> !GCHandles
GC Handle Statistics:
Strong Handles: 15
Pinned Handles: 7
Async Pinned Handles: 0
Ref Count Handles: 0
Weak Long Handles: 0
Weak Short Handles: 1
Other Handles: 0
Statistics:
      MT    Count    TotalSize Class Name
790fd0f0        1           12 System.Object
790feba4        1           28 System.SharedStatics
790fcc48        2           48 System.Reflection.Assembly
790fe17c        1           72 System.ExecutionEngineException
790fe0e0        1           72 System.StackOverflowException
790fe044        1           72 System.OutOfMemoryException
790fed00        1          100 System.AppDomain
790fe704        2          112 System.Threading.Thread
79100a18        4          144 System.Security.PermissionSet
790fe284        2          144 System.Threading.ThreadAbortException
7912ee44        3          636 System.SByte[]
7912d8f8        4          8736 System.Object[]
Total 23 objects

The GCHandles command walks the handle tables and looks for all types of different handles (strong, weak, pinned, etc.) and displays a summary of the results as well as a statistical section with detailed information on each type found. In the preceding output, we can see that we have 15 strong handles, 7 pinned handles, and 1 weak short handle. In addition, in the Statistics section, we can see the three SByte arrays that we allocated and pinned. The GCHandles command provides a good overview of the handle activity in any given process, but if further information is required, such as the type of handle for each of the types listed in the Statistics section, we have to use an additional command called objsize. One of the functions of the objsize command is to output the size of the object passed in as an argument. If no arguments are specified, it scans all the referenced objects in the process and outputs the size as well as other useful information:

0:004> !objsize
Scan Thread 0 OSTHread 2558
ESP:2fed54:  sizeof(01d9599c)  =          20 (        0x14) bytes
 (Microsoft.Win32.SafeHandles.SafeFileHandle)
ESP:2fee18: sizeof(01d96d9c) =          312 (           0x138) bytes (System.SByte[])
ESP:2fee20: sizeof(01d96c58) =          112 (            0x70) bytes (System.SByte[])
ESP:2fee24: sizeof(01d96cc8) =          212 (            0xd4) bytes (System.SByte[])
ESP:2fee30: sizeof(01d958b4)  =          12 (            0xc) bytes
 (Advanced.NET.Debugging.Chapter5.Pinning)
...
...
...
Scan Thread 2 OSTHread 2c80
DOMAIN(004DFD10):HANDLE(Strong):1c119c: sizeof(01d958a4) =
           16 (         0x10) bytes (System.Object[])
...
...
...
DOMAIN(004DFD10):HANDLE(WeakSh):1c12fc: sizeof(01d91de8) =
          56 (        0x38) bytes (System.Threading.Thread)
DOMAIN(004DFD10):HANDLE(Pinned):1c13e4: sizeof(01d96d9c) =
          312 (         0x138) bytes (System.SByte[])
DOMAIN(004DFD10):HANDLE(Pinned):1c13e8: sizeof(01d96cc8) =
          212 (          0xd4) bytes (System.SByte[])
DOMAIN(004DFD10):HANDLE(Pinned):1c13ec: sizeof(01d96c58) =
          112 (        0x70) bytes (System.SByte[])
DOMAIN(004DFD10):HANDLE(Pinned):1c13f0: sizeof(02d93030) =
          708 (       0x2c4) bytes (System.Object[])
DOMAIN(004DFD10):HANDLE(Pinned):1c13f4: sizeof(02d92020) =
          4276 (       0x10b4) bytes (System.Object[])
DOMAIN(004DFD10):HANDLE(Pinned):1c13f8: sizeof(01d9118c) =
          12 (         0xc) bytes (System.Object)
DOMAIN(004DFD10):HANDLE(Pinned):1c13fc: sizeof(02d91010) =
          19332 (      0x4b84) bytes (System.Object[])

The output has been abbreviated, but clearly shows that our SByte arrays have been pinned as shown by HANDLE(Pinned).
Although the notion of pinning objects solves the problem of movable objects during native code invocations, it does present a problem to the garbage collector; the problem is that of fragmentation (one of the problems that compaction is meant to solve). If there are a lot of interleaved pinned objects on the managed heap, situations may occur where there isn't enough contiguous free space available. Figure 5-11 shows a hypothetical example of a fragmented managed heap due to excessive pinning.

Figure 5-11 Hypothetical example of a fragmented managed heap

In the layout illustrated in Figure 5-11, we can see that we have several free smaller blocks intertwined with live objects (Obj A through D). If a garbage collection should occur, the layout of the managed heap will remain unchanged. The reason for that is simple: The garbage collector cannot perform a compaction due to all live objects being pinned and hence not movable. Because the free blocks are not adjacent, it also cannot perform coalescing. Even though we have free blocks available, memory allocation requests may in fact fail if the size of the requested allocation is greater than 32 bytes. We will take a look at a real-world managed heap fragmentation problem in detail later in the chapter.

What About the LOH?

Earlier, we discussed the LOH and how it is swept rather than compacted. This essentially means that objects on the LOH never move. Does that mean that we can skip pinning objects on the LOH? The answer is a resounding no! If you don't pin objects on the LOH, you are making a very dangerous implementation assumption that the LOH will never ever utilize compaction. That is an implementation detail that can change between CLR versions. It is therefore imperative that objects on the LOH always be pinned in case the implementation changes.

Garbage Collection Modes

The last topic we will discuss are the modes that the garbage collector runs in. There are three primary modes of operation:

Nonconcurrent workstation
Concurrent workstation
Server

We've already discussed the difference between server and workstation in general, and it boils down to the server mode creating one heap and one GC thread per processor. All garbage collection related activities are performed by the dedicated GC thread on the processor it is assigned to. What we haven't discussed is the notion of concurrent and nonconcurrent garbage collections. In the nonconcurrent workstation mode, the garbage collector suspends all managed threads for the entire duration of the garbage collection. Only when the garbage collection is finished does it resume all the managed threads in the process. This may work fine if there isn't a need for super-fast responsiveness, but in cases such as GUI applications, quick response times are very critical. Hence, the introduction of the concurrent workstation mode where, during a garbage collection, the managed threads are not suspended for the entire duration of the garbage collection but are allowed to wake up periodically and do work before being put back to sleep again for the garbage collector to do some more work. This increases the responsiveness of the application but can make garbage collection slightly slower.

Tuesday, 1 April 2014

NoSQL Data Modeling Techniques

Posted on March 1, 2012

NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like the CAP theorem apply well to NoSQL systems. At the same time, NoSQL data modeling is not so well studied and lacks the systematic theory found in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.
I would like to thank Daniel Kirkdorffer who reviewed the article and cleaned up the grammar.
To explore data modeling techniques, we have to start with a more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. The following figure depicts imaginary “evolution” of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases:

NoSQL Data Models

First, we should note that SQL and relational model in general were designed long time ago to interact with the end user. This user-oriented nature had vast implications:

The end user is often interested in aggregated reporting information, not in separate data items, and SQL pays a lot of attention to this aspect.
No one can expect human users to explicitly control concurrency, integrity, consistency, or data type validity. That’s why SQL pays a lot of attention to transactional guaranties, schemas, and referential integrity.

On the other hand, it turned out that software applications are not so often interested in in-database aggregation and able to control, at least in many cases, integrity and validity themselves. Besides this, elimination of these features had an extremely important influence on the performance and scalability of the stores. And this was where a new evolution of data models began:

Key-Value storage is a very simplistic, but very powerful model. Many techniques that are described below are perfectly applicable to this model.
One of the most significant shortcomings of the Key-Value model is a poor applicability to cases that require processing of key ranges. Ordered Key-Value model overcomes this limitation and significantly improves aggregation capabilities.
Ordered Key-Value model is very powerful, but it does not provide any framework for value modeling. In general, value modeling can be done by an application, but BigTable-style databases go further and model values as a map-of-maps-of-maps, namely, column families, columns, and timestamped versions.
Document databases advance the BigTable model offering two significant improvements. The first one is values with schemes of arbitrary complexity, not just a map-of-maps. The second one is database-managed indexes, at least in some implementations. Full Text Search Engines can be considered a related species in the sense that they also offer flexible schema and automatic indexes. The main difference is that Document database group indexes by field names, as opposed to Search Engines that group indexes by field values. It is also worth noting that some Key-Value stores like Oracle Coherence gradually move towards Document databases via addition of indexes and in-database entry processors.
Finally, Graph data models can be considered as a side branch of evolution that origins from the Ordered Key-Value models. Graph databases allow one model business entities very transparently (this depends on that), but hierarchical modeling techniques make other data models very competitive in this area too. Graph databases are related to Document databases because many implementations allow one model a value as a map or document.

General Notes on NoSQL Data Modeling

The rest of this article describes concrete data modeling techniques and patterns. As a preface, I would like to provide a few general notes on NoSQL data modeling:

NoSQL data modeling often starts from the application-specific queries as opposed to relational modeling:
- Relational modeling is typically driven by the structure of available data. The main design theme is ”What answers do I have?”
- NoSQL data modeling is typically driven by application-specific access patterns, i.e. the types of queries to be supported. The main design theme is ”What questions do I have?”
NoSQL data modeling often requires a deeper understanding of data structures and algorithms than relational database modeling does. In this article I describe several well-known data structures that are not specific for NoSQL, but are very useful in practical NoSQL modeling.
Data duplication and denormalization are first-class citizens.
Relational databases are not very convenient for hierarchical or graph-like data modeling and processing. Graph databases are obviously a perfect solution for this area, but actually most of NoSQL solutions are surprisingly strong for such problems. That is why the current article devotes a separate section to hierarchical data modeling.

Although data modeling techniques are basically implementation agnostic, this is a list of the particular systems that I had in mind while working on this article:

Key-Value Stores: Oracle Coherence, Redis, Kyoto Cabinet
BigTable-style Databases: Apache HBase, Apache Cassandra
Document Databases: MongoDB, CouchDB
Full Text Search Engines: Apache Lucene, Apache Solr
Graph Databases: neo4j, FlockDB

Conceptual Techniques

This section is devoted to the basic principles of NoSQL data modeling.

(1) Denormalization

Denormalization can be defined as the copying of the same data into multiple documents or tables in order to simplify/optimize query processing or to fit the user’s data into a particular data model. Most techniques described in this article leverage denormalization in one or another form.
In general, denormalization is helpful for the following trade-offs:

Query data volume or IO per query VS total data volume. Using denormalization one can group all data that is needed to process a query in one place. This often means that for different query flows the same data will be accessed in different combinations. Hence we need to duplicate data, which increases total data volume.
Processing complexity VS total data volume. Modeling-time normalization and consequent query-time joins obviously increase complexity of the query processor, especially in distributed systems. Denormalization allow one to store data in a query-friendly structure to simplify query processing.

Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

(2) Aggregates

All major genres of NoSQL provide soft schema capabilities in one way or another:

Key-Value Stores and Graph Databases typically do not place constraints on values, so values can be comprised of arbitrary format. It is also possible to vary a number of records for one business entity by using composite keys. For example, a user account can be modeled as a set of entries with composite keys like UserID_name, UserID_email, UserID_messages and so on. If a user has no email or messages then a corresponding entry is not recorded.
BigTable models support soft schema via a variable set of columns within a column family and a variable number of versions for one cell.
Document databases are inherently schema-less, although some of them allow one to validate incoming data using a user-defined schema.

Soft schema allows one to form classes of entities with complex internal structures (nested entities) and to vary the structure of particular entities.This feature provides two major facilities:

Minimization of one-to-many relationships by means of nested entities and, consequently, reduction of joins.
Masking of “technical” differences between business entities and modeling of heterogeneous business entities using one collection of documents or one table.

These facilities are illustrated in the figure below. This figure depicts modeling of a product entity for an eCommerce business domain. Initially, we can say that all products have an ID, Price, and Description. Next, we discover that different types of products have different attributes like Author for Book or Length for Jeans. Some of these attributes have a one-to-many or many-to-many nature like Tracks in Music Albums. Next, it is possible that some entities can not be modeled using fixed types at all. For example, Jeans attributes are not consistent across brands and specific for each manufacturer. It is possible to overcome all these issues in a relational normalized data model, but solutions are far from elegant. Soft schema allows one to use a single Aggregate (product) that can model all types of products and their attributes:

Entity Aggregation

Embedding with denormalization can greatly impact updates both in performance and consistency, so special attention should be paid to update flows.

Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

(3) Application Side Joins

Joins are rarely supported in NoSQL solutions. As a consequence of the “question-oriented” NoSQL nature, joins are often handled at design time as opposed to relational models where joins are handled at query execution time. Query time joins almost always mean a performance penalty, but in many cases one can avoid joins using Denormalization and Aggregates, i.e. embedding nested entities. Of course, in many cases joins are inevitable and should be handled by an application. The major use cases are:

Many to many relationships are often modeled by links and require joins.
Aggregates are often inapplicable when entity internals are the subject of frequent modifications. It is usually better to keep a record that something happened and join the records at query time as opposed to changing a value . For example, a messaging system can be modeled as a User entity that contains nested Message entities. But if messages are often appended, it may be better to extract Messages as independent entities and join them to the User at query time:

Applicability: Key-Value Stores, Document Databases, BigTable-style Databases, Graph Databases

General Modeling Techniques

In this section we discuss general modeling techniques that applicable to a variety of NoSQL implementations.

(4) Atomic Aggregates

Many, although not all, NoSQL solutions have limited transaction support. In some cases one can achieve transactional behavior using distributed locks or application-managed MVCC, but it is common to model data using an Aggregates technique to guarantee some of the ACID properties.
One of the reasons why powerful transactional machinery is an inevitable part of the relational databases is that normalized data typically require multi-place updates. On the other hand, Aggregates allow one to store a single business entity as one document, row or key-value pair and update it atomically:

Atomic Aggregates

Of course, Atomic Aggregates as a data modeling technique is not a complete transactional solution, but if the store provides certain guaranties of atomicity, locks, or test-and-set instructions then Atomic Aggregates can be applicable.
Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

(5) Enumerable Keys

Perhaps the greatest benefit of an unordered Key-Value data model is that entries can be partitioned across multiple servers by just hashing the key. Sorting makes things more complex, but sometimes an application is able to take some advantages of ordered keys even if storage doesn’t offer such a feature. Let’s consider the modeling of email messages as an example:

Some NoSQL stores provide atomic counters that allow one to generate sequential IDs. In this case one can store messages using userID_messageID as a composite key. If the latest message ID is known, it is possible to traverse previous messages. It is also possible to traverse preceding and succeeding messages for any given message ID.
Messages can be grouped into buckets, for example, daily buckets. This allows one to traverse a mail box backward or forward starting from any specified date or the current date.

Applicability: Key-Value Stores

(6) Dimensionality Reduction

Dimensionality Reduction is a technique that allows one to map multidimensional data to a Key-Value model or to other non-multidimensional models.
Traditional geographic information systems use some variation of a Quadtree or R-Tree for indexes. These structures need to be updated in-place and are expensive to manipulate when data volumes are large. An alternative approach is to traverse the 2D structure and flatten it into a plain list of entries. One well known example of this technique is a Geohash. A Geohash uses a Z-like scan to fill 2D space and each move is encoded as 0 or 1 depending on direction. Bits for longitude and latitude moves are interleaved as well as moves. The encoding process is illustrated in the figure below, where black and red bits stand for longitude and latitude, respectively:

Geohash Index

An important feature of a Geohash is its ability to estimate distance between regions using bit-wise code proximity, as is shown in the figure. Geohash encoding allows one to store geographical information using plain data models, like sorted key values preserving spatial relationships. The Dimensionality Reduction technique for BigTable was described in [6.1]. More information about Geohashes and other related techniques can be found in [6.2] and [6.3].
Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

(7) Index Table

Index Table is a very straightforward technique that allows one to take advantage of indexes in stores that do not support indexes internally. The most important class of such stores is the BigTable-style database. The idea is to create and maintain a special table with keys that follow the access pattern. For example, there is a master table that stores user accounts that can be accessed by user ID. A query that retrieves all users by a specified city can be supported by means of an additional table where city is a key:

Index Table Example

An Index table can be updated for each update of the master table or in batch mode. Either way, it results in an additional performance penalty and become a consistency issue.
Index Table can be considered as an analog of materialized views in relational databases.
Applicability: BigTable-style Databases

(8) Composite Key Index

Composite key is a very generic technique, but it is extremely beneficial when a store with ordered keys is used. Composite keys in conjunction with secondary sorting allows one to build a kind of multidimensional index which is fundamentally similar to the previously described Dimensionality Reduction technique. For example, let’s take a set of records where each record is a user statistic. If we are going to aggregate these statistics by a region the user came from, we can use keys in a format (State:City:UserID) that allow us to iterate over records for a particular state or city if that store supports the selection of key ranges by a partial key match (as BigTable-style systems do):

SELECT Values WHERE state="CA:*"

SELECT Values WHERE city="CA:San Francisco*"

Composite Key Index

Applicability: BigTable-style Databases

(9) Aggregation with Composite Keys

Composite keys may be used not only for indexing, but for different types of grouping. Let’s consider an example. There is a huge array of log records with information about internet users and their visits from different sites (click stream). The goal is to count the number of unique users for each site. This is similar to the following SQL query:

SELECT count(distinct(user_id)) FROM clicks GROUP BY site

We can model this situation using composite keys with a UserID prefix:

Counting Unique Users using Composite Keys

The idea is to keep all records for one user collocated, so it is possible to fetch such a frame into memory (one user can not produce too many events) and to eliminate site duplicates using hash table or whatever. An alternative technique is to have one entry for one user and append sites to this entry as events arrive. Nevertheless, entry modification is generally less efficient than entry insertion in the majority of implementations.
Applicability: Ordered Key-Value Stores, BigTable-style Databases

(10) Inverted Search – Direct Aggregation

This technique is more a data processing pattern, rather than data modeling. Nevertheless, data models are also impacted by usage of this pattern. The main idea of this technique is to use an index to find data that meets a criteria, but aggregate data using original representation or full scans. Let’s consider an example. There are a number of log records with information about internet users and their visits from different sites (click stream). Let assume that each record contains user ID, categories this user belongs to (Men, Women, Bloggers, etc), city this user came from, and visited site. The goal is to describe the audience that meet some criteria (site, city, etc) in terms of unique users for each category that occurs in this audience (i.e. in the set of users that meet the criteria).
It is quite clear that a search of users that meet the criteria can be efficiently done using inverted indexes like {Category -> [user IDs]} or {Site -> [user IDs]}. Using such indexes, one can intersect or unify corresponding user IDs (this can be done very efficiently if user IDs are stored as sorted lists or bit sets) and obtain an audience. But describing an audience which is similar to an aggregation query like

SELECT count(distinct(user_id)) ... GROUP BY category

cannot be handled efficiently using an inverted index if the number of categories is big. To cope with this, one can build a direct index of the form {UserID -> [Categories]} and iterate over it in order to build a final report. This schema is depicted below:

Counting Unique Users using Inverse and Direct Indexes

And as a final note, we should take into account that random retrieval of records for each user ID in the audience can be inefficient. One can grapple with this problem by leveraging batch query processing. This means that some number of user sets can be precomputed (for different criteria) and then all reports for this batch of audiences can be computed in one full scan of direct or inverse index.
Applicability: Key-Value Stores, BigTable-style Databases, Document Databases

Hierarchy Modeling Techniques

(11) Tree Aggregation

Trees or even arbitrary graphs (with the aid of denormalization) can be modeled as a single record or document.

This techniques is efficient when the tree is accessed at once (for example, an entire tree of blog comments is fetched to show a page with a post).
Search and arbitrary access to the entries may be problematic.
Updates are inefficient in most NoSQL implementations (as compared to independent nodes).

Tree Aggregation

Applicability: Key-Value Stores, Document Databases

(12) Adjacency Lists

Adjacency Lists are a straightforward way of graph modeling – each node is modeled as an independent record that contains arrays of direct ancestors or descendants. It allows one to search for nodes by identifiers of their parents or children and, of course, to traverse a graph by doing one hop per query. This approach is usually inefficient for getting an entire subtree for a given node, for deep or wide traversals.
Applicability: Key-Value Stores, Document Databases

(13) Materialized Paths

Materialized Paths is a technique that helps to avoid recursive traversals of tree-like structures. This technique can be considered as a kind of denormalization. The idea is to attribute each node by identifiers of all its parents or children, so that it is possible to determine all descendants or predecessors of the node without traversal:

Materialized Paths for eShop Category Hierarchy

This technique is especially helpful for Full Text Search Engines because it allows one to convert hierarchical structures into flat documents. One can see in the figure above that all products or subcategories within the Men’s Shoes category can be retrieved using a short query which is simply a category name.
Materialized Paths can be stored as a set of IDs or as a single string of concatenated IDs. The latter option allows one to search for nodes that meet a certain partial path criteria using regular expressions. This option is illustrated in the figure below (path includes node itself):

Query Materialized Paths using RegExp

Applicability: Key-Value Stores, Document Databases, Search Engines

(14) Nested Sets

Nested sets is a standard technique for modeling tree-like structures. It is widely used in relational databases, but it is perfectly applicable to Key-Value Stores and Document Databases. The idea is to store the leafs of the tree in an array and to map each non-leaf node to a range of leafs using start and end indexes, as is shown in the figure below:

Modeling of eCommerce Catalog using Nested Sets

This structure is pretty efficient for immutable data because it has a small memory footprint and allows one to fetch all leafs for a given node without traversals. Nevertheless, inserts and updates are quite costly because the addition of one leaf causes an extensive update of indexes.
Applicability: Key-Value Stores, Document Databases

(15) Nested Documents Flattening: Numbered Field Names

Search Engines typically work with flat documents, i.e. each document is a flat list of fields and values. The goal of data modeling is to map business entities to plain documents and this can be challenging if the entities have a complex internal structure. One typical challenge mapping documents with a hierarchical structure, i.e. documents with nested documents inside. Let’s consider the following example:

Nested Documents Problem

Each business entity is some kind of resume. It contains a person’s name and a list of his or her skills with a skill level. An obvious way to model such an entity is to create a plain document with Skill and Level fields. This model allows one to search for a person by skill or by level, but queries that combine both fields are liable to result in false matches, as depicted in the figure above.
One way to overcome this issue was suggested in [4.6]. The main idea of this technique is to index each skill and corresponding level as a dedicated pair of fields Skill_i and Level_i, and to search for all these pairs simultaneously (where the number of OR-ed terms in a query is as high as the maximum number of skills for one person):

Nested Document Modeling using Numbered Field Names

This approach is not really scalable because query complexity grows rapidly as a function of the number of nested structures.
Applicability: Search Engines

(16) Nested Documents Flattening: Proximity Queries

The problem with nested documents can be solved using another technique that were also described in [4.6]. The idea is to use proximity queries that limit the acceptable distance between words in the document. In the figure below, all skills and levels are indexed in one field, namely, SkillAndLevel, and the query indicates that the words “Excellent” and “Poetry” should follow one another:

Nested Document Modeling using Proximity Queries

[4.3] describes a success story for this technique used on top of Solr.
Applicability: Search Engines

(17) Batch Graph Processing

Graph databases like neo4j are exceptionally good for exploring the neighborhood of a given node or exploring relationships between two or a few nodes. Nevertheless, global processing of large graphs is not very efficient because general purpose graph databases do not scale well. Distributed graph processing can be done using MapReduce and the Message Passing pattern that was described, for example, in one of my previous articles. This approach makes Key-Value stores, Document databases, and BigTable-style databases suitable for processing large graphs.
Applicability: Key-Value Stores, Document Databases, BigTable-style Databases

References

Finally, I provide a list of useful links related to NoSQL data modeling:

Tagged: big table, data modeling, document, geohash, graph, index, key value, lucene, nosql

Posted in: Big Data, Fundamentals

70 Responses “NoSQL Data Modeling Techniques” →

Dominique Plante
March 1, 2012

Thanks for the great article! The diagram of the NoSQL data models is hilarious.
I think I found a small typo – “Denormalization can be defined as coping of the same data..” coping should be copying

Reply

Ilya Katsov
March 2, 2012

Thanks for pointing this. Fixed.

Reply

Sam Wesley
March 27, 2012

Nice article, but it could really use a proof read to correct the grammar. There are many more problems than that one typo (lack of articles – a, the, etc…. – for one).
Ilya Katsov
March 27, 2012

I’m doing my best ;) I’m not a native English speaker though

TkTech
March 2, 2012

Excellent article and due diligence providing references. Long, but a worthwhile read. Wish articles like this made it to the frontpage of reddit more often.

Reply
Voice in the wind
March 2, 2012

A very interesting post and nice graphics. However in the interest of clarity there are a few things I disagree with and as someone who has designed databases for more than 20 years in the interest of the new developers who may read this I feel some history should be injected into these conversations so the same mistakes are not repeated.
When relational databases began there was a thing called ISAM which was the system dejour for large databases at the time – the interesting part about ISAM in reference to NoSql was that the application code navigated indexes to find the data, and the indexes were designed into the application. Changing these became a nightmare as the tight coupling of data and access methods was very difficult to change, often it was easier to chuck it and start again. Relational databases provided the freedom to model the data, and then as the system evolved and you needed to provide different queries for reporting and so on, you could tune your queries by adding indexes, add columns here and there and modify the schema.
There was much argument at the time from the ISAM guys saying this couldn’t possibly work, how can you optimise your access paths without knowing before hand, also query optimisation will be to slow and so on. It’s funny to be seeing the same old things going back the other way.
I’ll pick one of your points above among the many – ‘Relational modeling is typically driven by structure of available data’, this is not true. Relational databases are designed with the view of what data is needed to be stored for the application. You could be referring to queries – in which case this would be true, and query and reporting, by it’s nature, can only give you the data you have. There is an incredible body of work on how to design databases to suit applications, not the other way round.
High transaction rates are not a problem with RDBMSes, likewise availability, but here’s the rub the present incumbents are very expensive Oracle et al, they are also complex, the black arts of being a DBA are legendary. So to create a high availability platform using a commercial RDBMS is expensive, because of licensing and specialized skills. So what’s happening imho is that we’re going back to the bad old days – not because it’s better, but because it’s cheaper, and NoSql is free, and in uncomplicated data models – like shopping sites – then it probably will do. *However* know what your trading off, because as your site gets larger, and more complicated access paths are required as the CEO wants sales reports based every two weeks but your data structure stores them monthly, then it will hit the wall.
So like everything NoSql is a trade off and in this case cost versus coding hours amongst other things, if your coding time is cheap then by all means reproduce all the things an RDBMS does, however know that you’ll be delaying the inevitable, something will come along that your NoSql model doesn’t cover and the pain of changing the model isn’t as easy as typing ‘Create Index…’
A very nice exposition of NoSql though, and it does have it’s place, much the same way as Microsoft access does :-)

Reply

Scott Lynch
March 2, 2012

Not to mention the current move towards NoSQL implies that any and all access to the data is through application code only. If somebody needs ad-hoc queries, you can’t write a couple lines of SQL to get the answer. There are never any bugs in the code are there? What could possibly go wrong?

Reply

Ilya Katsov
March 3, 2012

Scott,
I would like to mention that many NoSQL systems provide data introspection tools, so it is often possible to do ad-hoc queries. Of course, many applications use their own binary data format, but in this case custom introspection tools are often developed.
Voice in the wind.
March 3, 2012

Indeed. Recovery and consistency are other issues as you allude to. I am disappointed a bit, I thought there’d be some refutation, maybe I was missing something – but obviously it’s more ‘we’re building something new man – you just don’t get it’. As long as this crowd doesn’t start writing software for my bank/car etc.
Ilya Katsov
March 4, 2012

This article is about NoSQL data modeling, not more, not less. It doesn’t criticize RDBMSes or claim that NoSQL is superior in any sense.

Voice in the wind
March 5, 2012

Well there are a number of statements that do criticize RDBMses
NoSQL data modeling often starts from the application-specific queries as opposed to relational modeling:
Relational modeling is typically driven by structure of available data, the main design theme is ”What answers do I have?”
Relational databases are not very convenient for hierarchical or graph-like data modeling and processing
and I’m afraid all these statements are false. The implication being that RDBMses are deficient and NoSql is superior: again an incorrect inference, and that was my statement that NoSql is a back to the past technology and I gave the reasons why. As I’ve said NoSql has it’s place but look at it with open eyes. If you said that NoSql can be used for performance and cost reasons in particular cases then I would be less inclined to argue.

Reply

Ilya Katsov
March 5, 2012

Actually, I can not understand why do you consider the first two statements as a criticism of RDBMSes. Both approaches (I mean answer- or question-driven) have their own pros/cons and “typically”/”often” doesn’t mean “always” – relational modeling, of course, allows query-driven schemes and denormalization if necessary, but I can not admit that these techniques are the first-class citizens.
I admit that the third statement may be controversial, but I don’t think that judgement like “are not very convenient” can be true or false. There is a group of people, including me, who think that graph databases are more convenient than RDBMSes in certain cases, but of course there are different opinions.
I completely agree with you that performance, scalability and cost reasons are the main drivers of NoSQL. Obviously, complex data modeling is not an end in itself ;)

Vincent Lowe
January 16, 2014

…Ummmm, yeah, but some of the arguments you present here sound like the cries of the ISAM guys in the face of advancing RDBMS technology.
The real equation is that storage of data in a structured store represents a significant investment in software, hardware, and particularly in human capital. Then maintenance and updates to this store require significant additional investment of human capital.
It’s justified and it really works well, but things have changed. Here’s how:
Cost of storage hardware has decreased by 1400x in past years.
Cost of network transport has decreased by around 400x in the same time period.
Data read times have improved only 12x in that same time.
So to get at data in the scale we now see becoming common, the data must be stored in a fashion that can be distributed over a large compute range.
In light of that, key-value stores now find themselves being the Belle of the Ball again.
Cost of insertion and maintenance for data in a Hadoop cluster is lower than RDBMS by a couple of orders of magnitude. It’s not completely clear yet, but cost of design and maintenance will probably be a full order of magnitude improvement.
Ignore this paradigm evolution at your own peril.

Reply

h.ash
March 2, 2012

impressive and nice article. this gives detailed history of nosql systems and ways they could be used efficiently.

Reply

Ilya Katsov
March 2, 2012

I would like to notice that this “history” has nothing to do with the real timeline of NoSQL developments. This is just an imaginary concept that helps to explain relationships.

Reply

Chris Comella (@chrisco)
March 2, 2012

Very interesting, although as a beginner it’s going to take time and practice for me to learn well. Question: I wonder if you have any suggestion(s) for what might be the best data structure for a database of 10,000 products sold on Amazon? I have a freelancer collecting the data and I need a data structure in which to put it. I think it will come in an XML file. I will collect the data over time, so the Big Table, Materialized Paths, and Nested Sets caught my eye. Once I have to data, I want will have someone help me gather statistics and metrics, analyze, looking for trends, correlations, etc. Some will be basic query and some may be machine learning. Thanks for any tips :)

Reply

Ilya Katsov
March 3, 2012

Chris,
I worked with eCommerce systems that used Nested Sets or Materialized Paths, but these structures were chosen to meet high performance requirements, say, thousands requests/sec for a set of tens of thousands of products. But high performance doesn’t come for free – these structures are relatively difficult to implement and update. Of course I don’t know your functional and performance requirements, but from what you said (small data size, statistical processing) I can advice you to consider something straightforward – per-product documents in MongoDB, search engine like Solr (can be helpful to deal with free text data like product description), or standard relational DB. Complicated modeling should be avoided unless it is unavoidable because of performance requirements or whatever.

Reply

Mikhail Khludnev
March 2, 2012

Some recent stuff in inverted index camp
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
https://issues.apache.org/jira/browse/SOLR-3076
It’s worth to mention it too

Reply
Martin Horvath
March 2, 2012

An excellent article which motivated me to re-think about my modeling strategies again. Thanks for your work! Very much appreciated!

Reply
CraigL
March 3, 2012

This is the most lucid and well organized presentation of NoSQL techniques that I’ve seen. Thank you!

Reply
Scooletz (@Scooletz)
March 3, 2012

Thanks for the great article as well as references, which allows readers to dig dipper. I’d like to add, that paradigms like Nested Sets, Materialized Paths can be also used with SQL-based solutions, especially, when sql dbs are connected with some cache and intelligent expiration (like one in NHibernate).

Reply
Hermann Schmidt (@alpengeist_de)
March 5, 2012

Thanks a lot! You’ve saved me days of research. Almost no articles about NoSQL address data modelling. Most stir the scaling and CAP soup. Understanding the nature of the data to evaluate the utility of a NoSQL DB is much more important for ordinary (non-Google-sized) businesses than the simple realization “I have lots of data”.

Reply
mhamrah (@mhamrah)
March 5, 2012

Great article- thanks for citing my post!

Reply
robert jones
March 7, 2012

Relational modeling is typically driven by structure of available data, the main design theme is ”What answers do I have?”
Wrong. I have done rdb modeling for 20 years and we don’t do that. Statements like this undermine the integrity of your otherwise useful article.

Reply

Ilya Katsov
March 8, 2012

Robert,
Can I ask you what exactly is wrong?

Reply

jmdev
March 9, 2012

Great overview of NoSQL modeling! I wish more articles were researched as much as the ones on your blog. Quick question, what tool did you use to draw the first NoSQL evolution diagram? It is incredibly crisp and really adds a level of polish to your wealth of material. Thanks again!

Reply

Ilya Katsov
March 11, 2012

what tool did you use to draw the first NoSQL evolution diagram – CorelDraw

Reply

andee_marks
March 15, 2012

Reblogged this on Blah Blah Blog and commented:
Really nice overview of data modelling techniques for NoSQL databases.

Reply
Chandermani
March 25, 2012

Excellent overview about data modeling. I was design for a nosql document database and this post helped me a lot. I have created post with has some good resources around document database modelling available here http://chandermani.blogspot.in/2012/03/nosql-data-modelling.html

Reply
cresports
April 9, 2012

Reblogged this on wenfengsun.

Reply
Ki Chul Park
April 16, 2012

Hi, I’m voluntarily contributing some translation works to a community. (http://engfordev.com)
If you don’t mind, I’d like to translate your article into Korean and publish it onto my community website.
And I promise you that agree your all rights and never use for profit.
Would you give me a permission to do so, please?
Thanks for your time and concern. I’m looking forward to hearing from you.

Reply

Ilya Katsov
April 16, 2012

Sure, please go ahead. I really appreciate this.

Reply

Ki Chul Park
April 17, 2012

Thanks a lot. = )

Praveen Kumar Jayaram
April 20, 2012

Great article! Thanks.

Reply
Jamison White
April 25, 2012

Reblogged this on Jamison White's Blog and commented:
Great, long discussion on NoSQL data modeling.

Reply
Artur Ejsmont (@artur_ejsmont)
May 6, 2012

Very good post, thanks a lot.

Reply
Gaurav
June 27, 2012

Very nice info

Reply
egs
July 13, 2012

Sadly, this article misses one of the biggest new developments in the NoSQL space, namely, HyperDex. HyperDex is a consistent and fault-tolerant data store, with support for efficient retrieval by secondary attributes.
Existing NoSQL systems are little more than bucketed hashtables, and provide terrible consistency properties (i.e. eventual consistency) as a result. HyperDex provides linearizability. It does so while supporting retrieval by secondary attributes. And it does so with very high performance.
I liked the first few bullet points, but the article went south around the middle, when it started assuming that the only way to support retrieval by secondary attributes is by building indices. This is a very RDBMS-centric view, and it poses problems for consistency. HyperDex’s internal data organization (called hyperspace hashing) enables it to sidestep these problems.
Overall, the author needs to do some further reading on recent developments in the NoSQL field.

Reply
Nantacoben Kim
July 14, 2012

This really great I added link to this blog at nantacoben.tumblr.com.
Also with all these controversies of RDB vs NoSQL, and How vs What,
By saying “What answers do I have?” I think what Ilya meant is by nature SQL is a declarative language. SQL tends specify “What should be accomplished” without worrying about what technologies are being used while getting the query results, and RDBMS uses SQL to manipulate and define data.
On the other hand NoSQL tends to focus on “how” to process data based on data access patterns.
NoSQL could also be declarative language, for example adding hive on top of hbase SQL could be used to query NoSQL system.
RDB could also be procedural by using PL/SQL
So in my point-of-view the boundaries between RDB and NoSQL being what and how is getting less definite over time.
Also (Voice in the wind) mentioned we’re going back to the bad old days, but nothing is going back. Whether NoSQL uses denormalization or not it is a perfectly normal chronological evolution of DB systems. In fact 5th RDB normalization is all about denormalization. NoSQL came into recent trend in DB systems because many DB engineers felt performance and consistency issue with RDB in complex operations of large-data.
Who knows? in couple of years RDB may find right HW and technologies to support large-data.
As (Voice in the Wind) told, all DB’s currently have trade offs and their place. Even within NoSQL key-value, bigtable, mongoDB all has pros/cons.
Unfortunately DB engineers today will need to be thoroughly accustomed to all of them to make right decisions for business needs.

Reply
tb
July 26, 2012

Great Job ! best I have seen on this subject on the web. You should write a ebook and sell it.

Reply
umangapps
July 28, 2012

Great article
I need full information for no sql,i need to give seminaar on no sql ,plz email me important docs

Reply
Fabrice
August 10, 2012

Very clear. no more, no less

Reply
ravi
August 16, 2012

Great Article, crisp and clear.

Reply
Dean Hiller
August 16, 2012

Nice patterns, I noticed you are missing one that playorm(github) uses…partitioning a table and using the wide row pattern for indexing just the partition. This works great when you have trillions of small businesses on your system and only need to query info on each business from an application point of view.

Reply
Terry Cho
August 22, 2012

Very great article Thank you.

Reply
Pawan
August 28, 2012

Best article on nosql data mode
ling…thanks!

Reply
Jan
September 18, 2012

The chart absolutely made my day, thanks!!! We do have a comparison about different NoSql databases: http://www.kammerath.co.uk/nosql-on-the-spot.html – check it out! Hope to see more charts like that…

Reply
Java Geek
October 12, 2012

This is the kind of article, I was looking for long time which can explain some key concepts around NoSQL and where it is better over SQL . Great job, worth bookmarking

Reply
ami
October 15, 2012

Great Blog,
just a small typo:” entires can be partitioned across multiple servers”,
I guess you meant “entries”

Reply

Ilya Katsov
October 16, 2012

Thanks for pointing this out. Fixed.

Reply

Baxter
October 22, 2012

Ilya, I had this saved in Pocket forever but finally read it. Really great article, love the depth!

Reply
nickikt
November 24, 2012

Where does Datomic fit in all this?

Reply

Ilya Katsov
December 1, 2012

Roughly speaking, any data model can be decomposed into a set of key-value pairs. So, models like document, graph, or relational are essentially features over key-values model that can be supported by a database in varying degrees and in different combinations. I would say that Datomic provides document-oriented features with traits of graph-orientation.

Reply

Thomas
November 26, 2012

Another way to deal with nested document flattening is to have each document be based on a skill instead of a person. So you would split the original single document into two documents for each skill and have:
{
“skill” : “Math”,
“level” : “Low”,
“name” : “John”
}
{
“skill” : “Poetry”,
“level” : “High”,
“name” : “John”
}
Then you could run the query “skill:Math AND level:High”, and then remove all duplicate names.

Reply
Ignacio T.
January 3, 2013

Great Blog. It was very helpful for me.

Reply
jtomasrl
January 10, 2013

Excelent blog post, just working with mongoDB for some new projects. thanks a lot

Reply
Ravishankar Haranath
January 15, 2013

Very very informative article! Kudos to u.

Reply
Prashant
January 15, 2013

Excellent post, very helpful

Reply
Kalyan
January 17, 2013

Great composition of various modelling patters. Good work.

Reply
Robert
January 20, 2013

Great job!! Good info on comparing the noSQL data models!!

Reply
vlad
January 22, 2013

Ilya Katsov, this is a great article. Actually I think the structure of the presentation, the content and even some of the conclusions — are worty of a book. You should serious concider it. You are basically describing architectural patterns in data modeling that span structured/unstructured data and SQL/noSQL data stores.

Reply
surinder77
March 24, 2013

Superb Article

Reply
elie
June 1, 2013

This is a great article, thanks for writing it!

Reply
Mark T
July 6, 2013

This is be of the most awesome tech articles I’ve ever read. Thank you for taking the time to pull this together, it has been a tremendous asset getting me up to speed on noSQL design.

Reply
Rex
September 9, 2013

from an earlier comment from Robert…
“Relational modeling is typically driven by structure of available data” … Statements like this undermine the integrity of your otherwise useful article.
Robert,
Can I ask you what exactly is wrong?
I think a more accurate statement about relational modeling should be something like
Relational modeling is typically driven by the modeling the business.
Once the relational data model represents the business, then update anomalies are avoided and reporting is just a matter joining the appropriate structures. (With noted performance issues at scale)
I agree each modeling technique and platform has an appropriate use, based on data volumes, available funding, data source and access patterns
Excellent post, thank you Ilya for putting in the effort to make it so thorough and professional!

Reply
Michael
September 9, 2013

Was the F-bomb really necessary?

Reply

Ilya Katsov
September 10, 2013

Definitely not ;)

Reply

Allan Ramirez
September 10, 2013

Reblogged this on Note To Self and commented:
Great post. I’m reblogging this for my reference and of course for spreading the information to others.

Reply
sreeni
September 25, 2013

super article. examples based on vendor e.g mongodb would have made it a much simpler for all to digest. will try to add examples

Reply
Heather
October 31, 2013

“NoSQL Data Modeling Techniques | Highly Scalable Blog” was in fact
a good blog post. If it had even more photos this might be perhaps even far better.
Thanks -Seymour

Reply

Tuesday, 8 April 2014

Garbage Collector Internals

Garbage Collector Internals

Generations

Listing 5-2. Example source code to illustrate generational concepts

Roots

Listing 5-3. Sample application to illustrate object roots

Finalization

Listing 5-4. Simple object with a finalize method

Reclaiming GC Memory

Large Object Heap

Listing 5-5. Sample application demonstrating LOH

Pinning

Listing 5-6. Sample application using pinning

Garbage Collection Modes

Tuesday, 1 April 2014

NoSQL Data Modeling Techniques

NoSQL Data Modeling Techniques

General Notes on NoSQL Data Modeling

Conceptual Techniques

(1) Denormalization

(2) Aggregates

(3) Application Side Joins

General Modeling Techniques

(4) Atomic Aggregates

(5) Enumerable Keys

(6) Dimensionality Reduction

(7) Index Table

(8) Composite Key Index

(9) Aggregation with Composite Keys

(10) Inverted Search – Direct Aggregation

Hierarchy Modeling Techniques

(11) Tree Aggregation

(12) Adjacency Lists

(13) Materialized Paths

(14) Nested Sets

(15) Nested Documents Flattening: Numbered Field Names

(16) Nested Documents Flattening: Proximity Queries

(17) Batch Graph Processing

References

Share this:

Related

Leave a Reply

Follow Blog via Email

Follow Blog via Twitter

Categories

Archives

Blog Stats

Follow “Highly Scalable Blog”

Angular Tutorial (Update to Angular 7)