JL Computer Consultancy

How a Distributed Lock Manager Works

Prior to Aug 1999


The Distirbuted Lock Manager is a critical adjunct to Oracle Parallel Server. I have already mentioned that the multiple nodes of a parallel server system have to be able to synchronise themselves through Parallel Cache Management (PCM) locks and non-PCM locks. This article drops a level below PCM and non-PCM locks to describe the inter-node communication and in-memory objects that allow such locks to exist.

There are a few important points to note before we start. First and foremost - these notes outline the DLM package for one particular version of an operating system running one specific version of Oracle. (specifically one of the versions of Siemens' Reliant Unix, running Oracle 7.3). Operating systems move on, and Oracle has moved on dramatically, so if you are running Oracle 8 on a different platform I cannot guarantee the level of similarity between what I describe and what you see.

Secondly - bear in mind that there are typically three levels of operation to consider - Oracle's internal handling of locking (e.g. row locks etc.), PCM and non-PCM locks, and DLM communcations method. The first is completely independent of the second (and third), the second and third are very closely related to the extent that you could consider one to the be logical representation and the other the physical representation of the same set of events. PCM/non-PCM locks are in some senses Oracle's view of what is going on whilst DLM locks are the same thing from the point of view of the operating system. (This distinction is even finer in Oracle 8, where Oracle has taken the DLM inside Oracle, and left an even smaller component of the action to the operating system)

No doubt I should split hairs carefully all the way through this article and qualify all my remarks with 'in some cases', 'it is probably the case', etc; to keep the text moving on, I am going to ignore such conventions and write the article as if it were the definitive guide rather than a specific example with a number of general features.

The DLM Strategy:

A Distributed Lock Manager has two main components; a set of resources and a directory of those resources. The resources are compound objects - they identify 'something' and allow processes to queue up for locks on that 'something': in fact when setting up the DLM the two most critical parameters are the number of resources and the number of locks that the DLM will reserve memory for on each node in the cluster. Unfortunately the Oracle manuals have always been a bit sloppy when distinguishing between Resources (the things to be locked) and Locks (the things doing the locking), in particular the word 'lock' is often used when the word 'resource' is the correct expression.

The directory is simply a list, hashed across all the nodes in the cluster, which takes the name of a resource and returns the node and address of that resource.This can best be explained through an example:

 

A process on node 1 wants an exclusive lock on objectX.

 

Through a suitable hash function the process determines that the directory entry for the necessary DLM resource will be on node 7.

 

By messaging the directory entry on node 7, the process determines that the relevant DLM resource already exists and is located on node 13, with resource ID 'QC 13012'.

 

The process sends a message to node 13 requesting an exclusive lock on resource QC 13012. The request goes into the 'requested lock' queue for the resource.

 

Eventually (we assume) the request is granted and the request moves from the 'requested lock' queue into the 'granted lock' queue for the resource, and a message is sent to node 1 telling it that it now has the necessary lock..

This two-step 'directory to resource' link may seem unduly complicated, and in some cases it is, but it is necessary because Oracle needs to use both STATIC and DYNAMIC resources. Traditionally PCM locks equate to static DLM resources, and non-PCM locks equate to dynamic DLM resources, but with the advent of fine-grained locking PCM locks may also be dynamic DLM resources.

Non-PCM locks and PCM locks - both hashed and fine-grained will be the subject of a further note.

Note: I have simplified matters somewhat by suggesting that the interested Oracle process sends messages from node to node and receives answers. In fact a process communicates with its local DLM process (usually the LCK0 process), and the LCK0 processes communicate with each other. These communications can be either synchronous (wait for reply) or asynchronous (get posted with reply, or go back and check at intervals).

Let us take a pair of concrete examples to demonstrate DLM activity in more detail. The first will relate to a dynamically allocated lock, the second to a statically allocated lock.

A transaction lock (Oracle v$lock type TX)

Under single-instance Oracle, all sorts of locking activity is governed through Enqueues; many of which are visible through the v$lock view. One of the commonest enqueues is probably the TX (transaction) enqueue - this is actually a type of lock on a rollback block which is used to ensure that one transaction has to wait for another transaction to complete if the second transaction needs to change a row locked by the first transaction.

In an OPS system, the two transactions that are interested in the same row (hence the same rollback block) may be on different nodes (say nodes A and B), so the internal Enqueue mechanism has to be made public by exporting it to the DLM. The steps are a simplified case of the table above.

 

The process on node A wants enqueue TX, 3523, 142 (see v$lock.type, id1, id2).

 

Node A determines that node C holds the directory entry for the appropriate DLM resource.

 

Node C informs node A that no other node has the resource, so node A creates the resource and attaches an exclusive lock to the 'granted locks' queue on that resource. The location of the resource (the master) is stored in the directory at node C, and node A continues with its work.

At this point the local node (node A) can carry on with its processing in the normal way, and in most cases it will eventually commit, release the lock on the DLM resource, tell node C that the resource has been cleared, and node C will clear the directory entry. In a well-balanced OPS system this chain of events will happen for most Enqueues on most occasions. (except for dictionary cache enqueues - more on that in another article). In other words, the DLM introduces a small overhead along the lines of 'I want to create an enqueue, so I'll need to create a local resource - where do I find the directory entry that tells all the other nodes that I've done this ?).

However - assume now that node B wants to interfere with a row currently locked by the transaction on node A - it needs to queue on the TX enqeueue for that transaction. In a single instance we would see the same TX lock ('TX', 3523, 142) reported in v$lock twice, once with lock mode 6, once with request mode 6; in OPS, though the waiter is on a different node from the holder, so won't be visible in the same instance - instead node B goes to the DLM with the following steps:

 

The process on node B wants enqueue TX, 3523, 142 (see v$lock.type, id1, id2).

 

Node B determines that node C holds the directory entry for the appropriate DLM resource.

 

Node B sends a message to node A requesting an exclusive lock on the DLM resource. Node A replies with a message that the resource is not available for an exclusive lock, and node B cycles into a wait loop.

Eventually the transaction on node A commits, drops its enqueue, releases the exclusive lock on the DLM resource, at which point the request on the 'request queue' get promoted to the 'granted queue' and node B is told that it now has an exclusive lock on the DLM resource. Then node B commits, drops its enqueue, and releases its exclusive lock on the DLM resource, at which point the DLM resource is free and node A sends a message to node C to remove it from the directory and eliminates the resource.

Hash locks on db_block_buffers (static PCM locks)

There is one important difference between (static) PCM locks and non-PCM locks. Non-PCM locks (usually representing things like Oracle Enqueues) are maintained as locally to the interested node as possible - if node A wants to hold a specific TX lock, and node A is in almost all cases the only node that will ever want to hold that lock, then it would be inefficient for the DLM resource for that lock to be held anywhere other than node A. A directory entry has to be globally available just in case any other node wants the same resource, but in general one resource gets one lock - so make the action take place as close to the relevant instance as possible.

However, PCM blocks cover database blocks that are in each instances SGA; and since each PCM lock has to match to a DLM resource, and it is not feasible to have one DLM resource per Oracle database block, then each PCM lock / DLM resource has to be covering a large number of Oracle database blocks. (Remember at this point that fine-grained locking is a newer feature for which different rules apply).

Since every instance may have an interest in any block in the database, it makes less sense for a DLM resource to be created at the location where the block is being used - it is quite reasonable for a large number of nodes all to have a (shared) lock on one same resource so there is no sound argument dictating that any one node should preferentially hold the DLM resource.

For PCM locks then (other than fine-grained PCM locks), Oracle uses static DLM resources. All this means is that all the relevant resources have to be created as the DLM starts up, and the resources are hashed across all the active DLM nodes (even the nodes which are not running Oracle), and basically any one DLM resource is held on the node which also holds the directory entry for that resource. In this way some of the DLM traffic is eliminated: when Oracle wants a resource,, the commnication looks more like this:

 

The process on node A wants to put a shared lock on static DLM resource BL,1,345

 

Node A determines that node C holds the directory entry for the appropriate DLM resource.

 

Node C is also holding the statically alocated DLM resource BL,1,345 so applies the lock request to it and informs node A that it now has the required lock (or not as the case may be).

If the resources and the directory entries for resources did not hash to the same value, then instead of a 2-message dialogue, there would be 4 messages, as node A first discovered where the resource was held, and then sent a request to this third node to acquire the lock. This, of course, is how fine-grained PCM locking works, and when deciding between static PCM locks and fine-grained locks, the probability of excessive messaging being needed is one of the factors in deciding how the system should be set up.

Consequences:

There are four important consequences to the way in which DLM locks are implemented.

First, because of the way in which the hashed directory is implemented, every node in the physical cluster participates in the dynamic lock allocation process - even the nodes which are NOT running Oracle instances.

Secondly, because of the hashing mechanism used for static locks, every node in the physical cluster is responsible for some of the static (i.e. PCM) locks - even the nodes which are NOT running Oracle instances.

Third, because of this hashing, the task of adding a node to the cluster results in a significant re-hashing operation, and on some platforms a system restart (either at the O/S, DLM, or at least at the Oracle level) would be required.

Fourth, where you can choose between statically hashed PCM locks and fine-grain PCM locks, you need to be aware of the trade-offs between reduced messaging and (to be covered later) pinging.

Fifth, whilst many Oracle Enqueues will be held only by a single process, and the dynamic DLM resource strategy of 'first user creates it locally' is a good thing, there are cases (particularly in the dictionary cache/library cache at parse time) where this strategy can lead to severe resource contention and processing overhead.


Back to Main Index of Topics