Wednesday, April 30, 2014

Puppet Internals - Basic Modeling with RGen

Modeling Process - an example

In the previous post about modeling I covered the basic principles of models, i.e. - A model is an abstraction, a model that describes a model is a meta-model, and finally a model that defines the grammar for how meta-models are written is called a meta-meta model.

Since this may sound like hocus-pocus, I am going to show concrete examples of how this works. Before we can start doing tricks with meta-models, we must have something concrete to model, and we must learn the concepts to use to express these models, and that is the focus of this post.

I am going to use the RGen Metamodel Builder API, a Ruby DSL for creation of meta models to illustrate simply because it is the easiest way to create a model.

Modeling Process - an example

When we are modeling, we typically produce the model in several iterations - first just starting loosely with the nouns, verbs, and attributes of our problem domain. We recently did this exercise for a first version of Puppet's future Catalog Builder, and it went something like this:

  • We need to be able to build a Catalog
  • A Catalog contains Resources
  • A Resource describes a wanted state of a particular resource type in terms of Properties
  • A Resource also describes Parameters that are input to the resource provider performing the work
  • It would be great if we could define a catalog consisting of multiple catalogs to be applied in sequence - maybe Section is a good name for this
  • While the catalog is built we need to be able to make future references
  • We need to track where resources in the catalog originates (they may be virtual and exported from elsewhere)
  • ... etc.

Once we settled on a handful of statements, we could then whiteboard a tentative model, and reason about the implications. We continued with all the various sets of requirements, and we took pictures of the whiteboard, and made some written notes to remember what we did.

Next step was to document this more formally. I used the graphical modeling tool for Eclipse to make a diagram. Using this diagram, we then walked through it, continued discussing / testing use cases, and revising the diagram. The revised diagram after a couple of iterations looks like this:

At this point in the lifecycle of the model, it is fastest to make changes in the graphical tool, and it is much faster and easier to communicate what the model means than if everything is written in a programming language. After a while though, it becomes somewhat tedious to describe all the details in the diagram, and while we can use the output directly from the tool to get a version in Ruby, we really want to maintain the original model in Ruby source code form.

We then made an implementation of the model in Ruby that you can view on github.

The work of making the diagram and the implementation in Ruby took me something like 4 hours spread out over the breaks we took while discussing and white-boarding.

We expect to revise this model several times until it is done. If we want to generate a graph, we can now go in the other direction and create input to the graphical tool from the code we wrote in Ruby. This is basically a transformation from the Ruby code to Ecore in XMI since that is what the graphical tool understands.

The reason why I included this real life example is to show something that is relevant to the Puppet community. I am however going to switch to toy examples to demonstrate the various techniques when modeling since any real life model tends to overshadow the technique with domain specific issues - i.e. this post is not about the Catalog-model, it is about modeling in general.

Concepts

The basic concepts used when modeling are:

  • We model classifiers / classes
  • Classes have named features
  • A feature is an attribute of a simple data type, or a relationship to another class
  • Attributes can be multi valued
  • A relationship can describe containment
  • A relationship can be uni-, or bi-directional (a bi-directional reference allows us to conveniently navigate to an object that is "pointing to" the object in question).
  • Relationships can be multi valued at one side (or both sides for non containment relationships).
  • The term multiplicity is sometimes used to denote optional/single/multi-value in relationships, e.g. 0:1, 1:1, 0:M, 1:M, or M:M
  • An abstract class can not be instantiated
  • Ecore also contains modeling of interfaces, this is mostly useful for Java and I am going to skip explaining these.

Classes

When using RGen's Meta Model Builder, the classes in a model are simply placed in a module and made to inherit the base class RGen::MetamodelBuilder::MMBase. It is common to let each meta model define an abstract class that signals that it is part of that model.

require 'rgen/metamodel_builder'

module MyModel
  # Let RGen know this module is a model
  extend RGen::MetamodelBuilder::ModuleExtension

  # An abstract class that makes it easier to check if a given
  # object is an element of "MyModel"
  #
  class MyModelElement < RGen::MetamodelBuilder::MMBase
    # We make this class abstract to make it impossible to create instances of it
    abstract
  end

  class Car < MyModelElement
    # definition of Car
  end
end

At this point we cannot really do much except create an instance of a Car

require 'mymodel'
a_car = MyModel::Car.new

Attribute Features

Attribute features are used to define the attributes of the class that are represented by basic data types. Attributes have a name, a type, and multiplicity. A single value attribute is defined by calling has_attr, a multivalued attribute by calling has_many_attr.

The basic data types are:

  • String
  • Integer
  • Float
  • Numeric
  • Boolean
  • Enum

You can also use the completely generic Object type, but this also means that the model cannot be serialized, so this should be avoided. It is also possible to reference implementation classes, but this should also be avoided for serialization and cross platform reasons.

The Enum type can be specified separately, and given a name, or it may be defined inline.

EngineEnum = RGen::MetamodelBuilder::DataTypes::Enum.new([:combustion, :electric, :hybrid])

class Car < MyModelElement
  has_attr 'engine', EngineEnum
  has_attr 'doors', Integer
  has_attr 'steering_wheel', String
  has_attr 'wheels', Integer
end

If we have an attribute that should be capable of holding many values, we use has_many_attr. In the example below, an enum for extra equipment is used - it does not have to be an enum, a multi valued attribute can have any basic data type.

ExtraEquipmentEnum = RGen::MetamodelBuilder::DataTypes::Enum.new([
  :gps, :stereo, :sun_roof, :metallic
])
class Car < MyModelElement
  has_many_attr 'extras', ExtraEquipmentEnum
  # ...
end

The resulting implementation allows us to set and get values.

a_car = MyModel::Car.new()
a_car.engine = :combustion
a_car.extras = [:gps, :sun_roof]

For multi-valued attributes, we can add and remove entries

a_car.addExtras(:metallic)
a_car.removeExtras(:sun_roof)
a_car.extras   # => [:gps, :metallic]

If we attempt to assign something of the wrong type, we get an error:

a_car.addExtras(:catapult_chair)

=> In MyModel::Car : Can not use a Symbol(:catapult_chair) where a
   [:gps,:stereo,:sun_roof,:metallic] is expected

We can also specify additional metadata for each attribute, but I will return to that later.

Relationship Features

All non basic data types are handled via references. There are two types of references; containment, and regular. A containment reference is used when the referenced element is an integral part of the object - e.g. when it is an attribute of the object, when something cannot and should not be shared between objects (one particular wheel can only be mounted on once car at the time), and when they should be serialized as part of the object that holds the reference. All other references requires that the referenced object is contained somewhere else (although this is not quite true as we will see later when we talk about more advanced concepts, it is a reasonable conceptual approximation for now).

In order to have something meaningful to model, lets expand the notion of Engine.

class Engine < MyModelElement
  abstract
end

FuelTypeEnum = RGen::MetamodelBuilder::DataTypes::Enum.new([:diesel, :petrol, :etanol])

class CombustionEngine < Engine
  has_attr 'fuel', FuelTypeEnum
end

class ElectricalEngine < Engine
  has_attr 'charge_time_h', Integer
end

# skipping HybridEngine for now

Containment

We can now change the Car to contain an Engine. When we do this, we must also decide if the Engine should explicitly know about which car it is mounted in or not. That is, if the containment relationship is uni- or bi- directional. Let's start with a uni-directional containment:

class Car < ModelElement
  contains_one_uni 'engine', Engine
  # ...
end

We can now create and assign an Engine to a Car.

a_car = MyModel::Car.new
a_car.engine = MyModel::CombustionEngine.new

If we want to make the containment bi-directional:

class Car < MyModelElement
  contains_one 'engine', Engine, 'in_car'
  # ...
end

The assignment works as before, but we can now also navigate to the car the engine is mounted in. We achieved this, by defining the reverse role 'in_car' for the bi-directional containment. Now we can do this:

an_engine = MyModel::CombustionEngine.new
an_engine.in_car            # => nil
a_car = MyModel::Car.new
a_car.engine = an_engine
an_engine.in_car            # => Car

The semantics of containment means that if we assign the engine to another car we will move it!

# continued from previous example
another_car = MyModel::Car.new
a_car.engine                # => CombustionEngine
another_car.engine = an_engine
another_car.engine          # => CombustionEngine
a_car.engine                # => nil

This may seem scary, but it is actually quite natural. If we find that we want to contain something in more than one place at a given time our model (and thinking) is just wrong, and one of the references should not be a containment reference. Say if we have a ServiceOrder (imagine for the purpose of repairing an engine), the engine is not ever contained in the order (it is still mounted in the car). Model-wise we simply use a non-containment / regular reference from the ServiceOrder to an Engine.

We specify different kinds of containment with the methods:

  • contains_one
  • contains_many
  • contains_one_uni
  • contains_many_uni

Regular References

Regular references are defined with one of the methods:

  • has_one, uni-directional
  • has_many, uni-directional
  • one_to_many, bi-directional
  • many_to_one, bi-directional
  • many_to_many, bi-directional

We can now define a ServiceOrder for service of engines:

class ServiceOrder < MyModelElement
  has_one 'serviced_engine', Engine
end

And we can use this:

an_engine = MyModel::CombustionEngine.new
a_car = MyModel::Car.new
a_car.engine = an_engine
a_car.engine                   # => CombustionEngine
so = MyModel.ServiceOrder.new
so.serviced_engine = an_engine
a_car.engine                   # => CombustionEngine

As you can see, since the relationship is non-containment, the engine does not move to become an integral part of the service order, it is still mounted in the car (which is exactly what we wanted).

For the bi-directional references, the reverse role is required. When using these, it is important to consider the cohesion of the system - we do not want every piece to know about everything else. Therefore, choose bi-directional references only where it really matters.

Testing the examples

You can easily try the examples in this gist. You need to have the rgen gem installed, and then you can run the examples in irb, and try things out.

In this Post

In this post you have seen examples of how to build a model with RGen, and how it can be used. As you probably have noticed, you get quite a lot of functionality with only a small amount of work. How much code would you have to write to correctly support fully typed many to many relationships? (And then discover that you want to model it differently).

I hope this post has showed the usefulness of modeling even when no fancy modeling tricks have been used simply because of the robust, type and referentially safe implementation that we get from a small an concise definition.

Monday, April 28, 2014

Puppet Internals - Introduction to Modeling

What is a "model" really?

As you probably know, Puppet creates a Declarative Catalog that describes the desired system state of a machine. What you probably have not thought about is that this means that Puppet is actually Model Driven. The model in this case is the catalog and the descriptions of the desired state of a set of resources that is built by Puppet from the instructions given to it in the Puppet Programming Language.

While Puppet has always been based on this model, it has not been built using modeling technology. What is new in the future parser and evaluator features in Puppet is that we have started using modeling technology to also implement how Puppet itself works.

In this post, I am going to talk a bit about modeling in general, the Ecore modeling technology and the Ruby implementation of Ecore called RGen.

What is a "model" really?

If at first when you hear the word model, you think about Tyra Banks or Marcus Schenkenberg and then laugh a little because that is obviously not the kind of models we are talking about here. You were actually not far off the mark.

A model is just an abstraction, and we use them all the time - they are the essence of any spoken language; when we talk about something like a "Car" we are talking about an unspecific instance of something like "a powered means of transportation, probably a 4 door sedan". Since it is not very efficient to communicate in that way, we use the abstraction "Car". Fashion models such as Tyra or Marcus are also abstractions of "people wearing clothes" albeit very good looking ones (perhaps you think they represent "people that do not wear enough clothes", but they are still abstractions).

We can express a model concretely using natural language:

A Car has an engine, 2 to 5 doors, a steering wheel, breaks, and 3-4 wheels.
. . .

We can also express such a model in a programming language such as Ruby:

class Car
  attr_accessor :engine
  attr_accessor :doors
  attr_accessor :steering_wheel
  attr_accessor :breaks
  attr_accessor :wheels

. . .
end

As you can see in the above attempt to model a Car we lack the semantics in Ruby to declare more details about the Car's attributes - there is no way to declaratively state how many doors there could be, the number of wheels etc. There is also no way to declare that the engine attribute must be of Engine type, etc. All such details must be implemented as logic in the setter and getter methods that manipulate a Car instance. While this is fine from a runtime perspective (it protects us from mistakes when using the class), we can not (at least not easily) introspect the class and deduce the number of allowed doors, or the allowed type of the various attributes.

Introspection (or reflection) is the ability to programmatically obtain information about the expressed logic. Sometimes we talk about this as the ability to get meta-data (data about data) that describes what we are interested in.

While Ruby is a very fluid implementation language, in itself it is not very good at expressing a model.

Modeling Language

A modeling language (in contrast to an implementation language such as Ruby) lets us describe far more details about the abstraction without having to express them in imperative code in one particular implementation language. One such family of "languages" are those that are used to describe data formats - they are referred to a schemas - and you are probably familiar with some of them such as XmlSchema, JsonSchema, and Yamlschema. These schema technologies allows us to make declarations about what is allowed in data that conforms to the schema and we can use this to validate actual data.

A schema is a form of modeling language. What is interesting about them is that they enable transformation from one form to another! Given an XmlSchema, it is possible to transform it into a corresponding Yaml or Json schema, and likewise transform data conformant with one such schema into data conformant with the transformed schema. (The problems doing this in practice has to do with the difference in semantic power between the different technologies - we may be able to express rules/constraints in one such schema technology that does not exist the others).

Schemas and Meta Models

When we look at a schema - we are actually looking at a meta model; a model that describes a model. That is if we describe a Car in Json we have a Car model:

{ "engine": "combustion",
  "steering-wheel": "sport-leather",
  "wheels": ...
}

And if we describe a schema for it:

{ "title": "Car Schema",
  "type"; "object",
  "properties": {
    "engine": { "type": "string"},
    "steering-wheel": { "type": "string" },
    . . .
  }
  "required": ["engine", "steering-wheel", ...]
}

We have a Car meta-model.

In everyday speak, we typically refer to the schema as "schema" or "model" and simply ignore its "meta status". But since we are on the topic of meta models - what we can do now is to also express the meta model as a model - i.e. what is the schema for a jsonschema? Here is an excerpt from the Json "Core/Validation Meta-Schema"

{
  "id": "http://json-schema.org/draft-04/schema#",
  "$schema": "http://json-schema.org/draft-04/schema#",
  . . .
   "title": {
        "type": "string"
    },
  . . .
  "required": { "$ref": "#/definitions/stringArray" },
  . . .
}

If you are interested in what it looks like, do download it. Be warned that you will quickly become somewhat disoriented since it is a schema describing a schema that describes what Json data should look like...

A meta schema such as that for JsonSchema is very useful as it can be used to validate schemas that describe data.

Schemas such as XmlSchema, YamlSchema, and JsonSchema are good for describing data, but they becomes somewhat difficult to use in practice for the construction of software. There are other modeling languages that are more specifically targeting software system constructs. There are both graphical and textual languages as well as those that have both types of representations.

What we are specifically interested in is an Object modeling language, which I will explain in more details.

Levels of reality / meta-ness

We can organize what we just learned about a model and its relationship to a meta-model, and that all of these can be expressed as a model. Here is a table that illustrates this from most concrete to most abstract:

Meta-Level Description
M0 Real Object - e.g. the movie "Casablanca on a DVD"
M1 User Model / Instance Level - e.g. a computer abstraction of the DVD - aVideo = Video.new('Casablanca')
M2 Meta Model - e.g. defines what Video is, its attributes and operations
M3 Meta meta model (the modeling language/grammar) - e.g. defines how the definition of "what a Video is", is expressed.

We very rarely talk about meta-meta models (this is the technology we use to implement a meta-model), and we also typically leave out the meta word when talking about a meta-model (i.e. just using the word model). We also typically talk about an "instance of a model" as being just a "model", without any distinction about its form; as live objects in memory that we can manipulate, or serialized to disk, stored in a database etc. It is only when we talk explicitly about modeling and modeling technology that we need to use the more precise terms, most of the time, it is perfectly clear what we are referring to when we use the word "model".

Object Modeling Language

An Object Modeling Language is a language that directly supports the kinds of elements we are interested in when constructing software. E.g. classes, methods, the properties of objects. You probably heard of one such technology called UML - Unified Modeling Language - this is a broad modeling technology and is associated with object oriented software development methodologies such as Booch, OMT, Objectory, IBM's RUP, and the Dynamic Systems Development Method. This was The Big Thing in the 90's, but UML has since then more or less slid into darkness as "the way to write better software" has shifted focus. An interesting debate from 2009, can be found here.

There is however a very useful part of the UML technology that is often overlooked - the so called Meta Object Facility (MOF) that sits at the very core of UML. It contains the (meta-meta) model that UML itself is defined in. MOF plays the same role for models as what Extended Backus Naur Form (EBNF) plays for programming languages - it defines the grammar. Thus MOF can be said to be a domain specific language used to define meta models. The technology used in MOF is called Ecore - and it is the reference implementation of MOF. (It is a model at level M3 in the table above).

Ecore

Ecore is part of Eclipse EMF, and is heavily used within Eclipse for a wide variety of IDE applications and application development domains. In the Puppet Domain EMF/Ecore technology is used in the Puppetlabs Geppetto IDE tool in combination with additional frameworks for language development such as Xtext.

Eclipse EMF is a Java centric implementation of Ecore. There are also implementations for Ruby (RGen, and C++ EMF4CPP).

Thus, there are many different ways to express an Ecore model. The UML MOF has defined one serialization format known as XMI, which is based on XML, but there are many other concrete formats such as Xcore (a DSL built with Xtext, annotated Java, JSon, binary serialization formats, the Rgen DSL in Ruby, etc.)

Car in RGen

Here is the Car expressed with RGen's MetamodelBuilder. In the next blog post about modeling I will talk a lot more about RGen - this is just a simple illustration:

class Car < RGen::MetamodelBuilder::MMBase
  has_attr 'engine', DataTypes::Enum.new([:combustion, :electric, :hybrid])
  has_attr 'doors', Integer
  has_attr 'steering_wheel', String
  has_attr 'wheels', Integer

. . .
end

A few points though:

  • RGen::MetamodelBuilder::MMBase is the base class for all models implemented with RGen (irrespective of how they are defined; using the metamodel builder, using the API directly, loading an ecore XMI file, or any other serialization format).
  • has_attr is similar to Ruby's attr_accessor, but it also specifies the type and type checking is automatic.
  • You can probably guess what Enum does

If you are eager to see real RGen models as they are used in Puppet, you can take a look at the AST model, or the type model.

Benefits of Modeling Technology

So what is it that is appealing about using modeling technology?

  • The model is declarative, and can often be free of implementation concerns
  • Polyglot - a model can be used dynamically in different runtimes, or be used to generate code.
  • Automatic type checking
  • Support for containment and serialization
  • Models can be navigated
  • Objects in the model has information (meta-data) about where they are contained ("this is the left front wheel of the car XYZ 123").
  • Direct support for relationships, including many to many, and bidirectional relationships

Which I guess boils down to "Modeling technology removes the need to write a lot of boilerplate code for non trivial structures" In the following posts I will talk about these various benefits and concepts of Ecore using examples in RGen.

In this Blog Post

In this blog post I explained that a model is an abstraction, and how such an abstraction can be implemented in software i.e. hand written, or using modeling technology such as Data Schemas or one specifically designed for software such as Ecore which available for Ruby in the form of the RGen gem. I also dived into the somewhat mysterious domain of meta-meta models - the grammar used to describe a meta-model, which in turn describes something that we want be manipulate/work with in our system.

Things will be more concrete in the next post, I promise.

Monday, April 7, 2014

Getting your Puppet Ducks in a Row

Getting your Puppet Ducks in a Row

A conversation that comes up frequently is if the Puppet Programming Language is declarative or not. This is usually the topic when someone has been fighting with how master side order of evaluation of manifests works and have left someone beaten by what sometimes may seem as random behavior. In this post I want to explain how Puppet works and try to straighten out some of the misconceptions.

First, lets get the terminology right (or this will remain confusing). It is common to refer to "parse order" instead of "evaluation order" and the use of the term "parse order" is deeply rooted in the Puppet community - this is unfortunate as it is quite misleading. A computer language is typically first parsed and then evaluated (Puppet does the same), and as you will see, almost all of the peculiarities occur during evaluation.

"Parse Order"

Parse Order is the order in which Puppet reads puppet manifests (.pp) from disk, turns them into tokens and checks their grammar. The result is something that can be evaluated (technically an Abstract Syntax Tree (AST)). The order in which this is done is actually of minor importance from a user perspective, you really do not need to think about how an expression such as $a = 1 + 2 becomes an AST.

The overall ordering of the execution is that Puppet starts with the site.pp file (or possibly the code setting in the configuration), then asks external services (such as the ENC) for additional things that are not included in the logic that was loaded from the site.pp. In versions from 3.5.1 the manifest setting can also refer to a directory of .pp files (preferred over using the now deprecated import statement).

After having parsed the initial manifest(s), Puppet then matches the information about the node making a request for a catalog with available node definitions, and selects the first matching node definition. At this point Puppet has the notion of:

  • node - a mapping of the node the request is for.
  • a set of classes to include and possibly parameters to set that it got from external sources.
  • parsed content in the form of one or several ASTs (one per file that was initially parsed)

Evaluation of the puppet logic (the ASTs) now starts. The evaluation order is imperative - lines in the logic are executed in the order they are written. However, All Classes and Defines in a file are defined prior to starting evaluation, but they are not evaluated (i.e. their bodies of code are just associated with the respective name and set aside for later "lazy" evaluation).

Which leads to the question what "being defined" really means.

Definition and Declaration

In computer science these terms are used as follows:

  • Declaration - introduces a named entity and possibly its type, but it does not fully define the entity (its value, functionality, etc.)
  • Definition - binds a full definition to a name (possibly declared somewhere). A definition is what gives a variable a value, or defines the body of code for a function.

A user-defined resource type is defined in puppet using a define expression. E.g. something like this:

define mytype($some_parameter) {
  # body of definition
}

A host class is defined in puppet using the class expression. E.g. something like this:

class ourapp {
  # body of class definition
}

After such a resource type definition or class definition has been made, if we try to ask whether mytype or ourapp is defined by using the function defined, we will be told that it is not! This is because the implementer of the function defined used the word in a very ambiguous manner - the defined function actually answers "is ourapp in the catalog?", not "do you know what a mytype is?".

The terminology is further muddled by the fact that the result of a resource expression is computed in two steps - the instruction is queued, and later evaluated. Thus, there is a period of time when it is defined, but what it defines does not yet exist (i.e. it is a kind of recorded desire / partial evaluation). The defined function will however return true for resources that are either in the queue or have been fully evaluated.

 mytype { '/tmp/foo': ...}
 notice defined(Mytype['tmp/foo'])  # true

When this is evaluated, a declaration of a mytype resource is made in the catalog being built. The actual resource '/tmp/foo' is "on its way to be evaluated" and the defined function returns true since it is (about to be) "in the catalog" (only not quite yet).

Read on to learn more, or skip to the examples at the end if you want something concrete, and then come back and read about "Order of Evaluation".

Order of Evaluation

In order for a class to be evaluated, it must be included in the computation via a call to include, or by being instantiated via the resource instantiation expression. (In comparison to a classic Object Oriented programming language include is the same as creating a new instance of the class). If something is not included, then nothing that it in turn defines is visible. Also note that instances of Puppet classes are singletons (a class can only be instantiated once in one catalog).

Originally, the idea was that you could include a given class as many times you wanted. (Since there can only be one instance per class name, multiple calls to include a class only repeats the desire to include that single instance. There is no harm in this). Prior to the introduction of parameterized classes, it was easy to ensure that a class was included; a call to 'include' before using the class was all all that was required. Parameterized classes were then introduced, along with new expression syntax allowing you to "instantiate class as a resource". When a class is parameterized, the “signature” of the class is changed by the values given to the parameters, but the class name remains the same. (In other words, ourapp(“foo”) has a different signature than ourapp(42), even though the class itself is still ourapp.) Parameterization of classes therefore implies that including a class only works when that class does not have multiple signatures. This is because multiple signatures would require multiple singleton instantiations of the same class (a logical impossibility). Unfortunately puppet cannot handle this even if the parameter values are identical - it sees this as an attempt of creating a second (illegal) instance of the class.

When something includes a class (or uses the resource instantiation expression to do the same), the class is auto-loaded; this means that puppet maps the name to a file location, parses the content, and expects to find the class with a matching name. When it has found the class, this class is evaluated (its body of code is evaluated).

The result of the evaluation is a catalog - the catalog contains resources and edges and is declarative. The catalog is transported to the agent, which applies the catalog. The order resources are applied is determined by their dependencies as well as their containment, use of anchor pattern, or the contain function, and settings (apply in random, or by source order, etc.). No evaluation of any puppet logic takes place at this point (at least not in the current version of Puppet) - on the agent the evaluation is done by the providers operating on the resource in the order that is determined by the catalog application logic running on the agent.

The duality of this; a mostly imperative, but sometimes lazy production (as you will learn below) of a catalog and a declarative catalog application is something that confuses many users.

As an analog; if you are writing a web service in PHP, the PHP logic runs on the web server and produces HTML which is sent to the browser. The browser interprets the HTML (which consists of declarative markup) and decides what to render where and the order in which rendering will take place (images load in the background, some elements must be rendered first because their size is needed to position other elements etc.). Compared to Puppet; the imperative PHP backend corresponds to the master computing a catalog in a mostly imperative fashion, and an agent's application of the declarative catalog corresponds to the web browser's rendering of HTML.

Up to this point, the business of "doing things in a particular order" is actually quite clear; the initial set of puppet logic is loaded, parsed and evaluated, which defines nodes (and possibly other things), then the matching node is evaluated, things it references are then autoloaded, parsed and evaluated, etc. until everything that was included has been evaluated.

What still remains to be explained is the order in which the bodies of classes and user-defined types are evaluated, as well as when relationships (dependencies between resources) and queries are evaluated.

Producing the Catalog

The production of the catalog is handled by what is currently known as the "Puppet Compiler". This is again a misnomer, it is not a compiler in the sense that other computer languages have a compiler that translates the source text to machine code (or some intermediate form like Java Byte Code). It does however compile in the sense that it is assembling something (a catalog) out of many pieces of information (resources). Going forward (Puppet 4x) you will see us referring to Catalog Builder instead of Compiler - who knows, one day we may have an actual compiler (to machine code) that compiles the instructions that builds the catalog. Even if we do not, for anyone that has used a compiler it is not intuitive that the compiler runs the program, which is what the current Puppet Compiler does.

When puppet evaluates the AST, it does this imperatively - $a = $b + $c, will immediately look up the value of $b, then $c, then add them, and then assign that value to $a. The evaluation will use the values assigned to $b and $c at the time the assignment expression is evaluated. There is nothing "lazy" going on here - it is not waiting for $b or $c to get a value that will be produced elsewhere at some later point in time.

Some instructions have side effects - i.e. something that changes the state of something external to the function. This is in contrast to an operation like + which is a pure function - it takes two values, adds them, and produces the result - once this is done there is no memory of that having taken place (unless the result is used in yet another expression, etc. until it is assigned to some variable (a side effect).

The operations that have an effect on the catalog are evaluated for the sole purpose of their side effect. The include function tells the catalog builder about our desire to have a particular class included in the catalog. A resource expression tells the catalog builder about our desire to have a particular resource applied by the agent, a dependency formed between resources again tells the catalog builder about our desire that one resource should be applied before/after another. While the instructions that cause the side effects are immediate, the side effects are not completely finished, instead they are recorded for later action. This is the case for most operations that involve building a catalog. This is what we mean when we say that evaluation is lazy.

To summarize:

  • An include will evaluate the body of a class (since classes are singletons this happens only once). The fact that we have instantiated the class is recorded in the catalog - a class is a container of resources, and the class instance is fully evaluated and it exists as a container, but it does not yet containe the actual resources. In fact, it only contains instructions (i.e. our desire to have a particular resource with particular parameter values to be applied on the agent).
  • A class included via what looks like a resource expression i.e. class { name: } behaves like the include function wrt. evaluation order.
  • A dependency between two (or a chain of) resources is also instructions at this point.
  • A query (i.e. space-ship expressions) are instructions to find and realize resources.

When there are no more expressions to immediately evaluate, the catalog builder starts processing the queued up instructions to evaluate resources. Since a resource may be of user-defined type, and it in turn may include other classes, the processing of resources is interrupted while any included classes are evaluated (this typically adds additional resource instructions to the queue). This continues until all instructions about what to place in the catalog have been evaluated (and nothing new was added). Now, the queue is empty.

The lazy evaluation of the catalog building instructions are done in the order they were added to the catalog with the exception of application of default values, queries, and relations which are delayed until the very end. (Exactly how these work is beyond the topic of this already long blog post).

How many different Orders are there?

The different orders are:

  • Parse Order - a more or less insignificant term meaning the order in which text is translated into something the puppet runtime can act on. (If you have a problem with ordering, you are almost certainly not having a problem with parse order).
  • Evaluation Order - the order in which puppet logic is evaluated with the purpose of
    producing a catalog. Pure evaluation order issues are usually related to the order in which arguments are evaluated, the order case options are evaluated - these are usually not difficult to figure out.
  • Catalog Build Order - the order in which the catalog builder evaluates definitions. (If you are having problems with ordering, this is where things appears to be mysterious).
  • Application Order - the order in which the resources are applied on an agent (host). (If you are having ordering problems here, they are more apparent, "resource x" must come before "resource y", or something (like a file) that "resource y" needs will be missing). Solutions here are to use dependencies, the anchor pattern, or use the contain function.)

Please Make Puppet less Random!

This is a request that pops up from time to time. Usually because someone has blown a fuse over a Catalog Build Order problem. As you have learned, the order is far from random. It is however, still quite complex to figure out the order, especially in a large system.

Is there something we can do about this?

The mechanisms in the language have been around for quite some time, and they are not an easy thing to change due to the number of systems that rely on the current behavior. However, there are many ways around the pitfalls that work well for people creating complex configurations - i.e. there are "best practices". There are also some things that are impossible or difficult to achieve.

Many suggestions have been made about how the language should change to be both more powerful and easier to understand, and several options are being considered to help with the mysterious Catalog Build Order and the constraints it imposes. These options include:

  • Being able to include a resource multiple times if they are identical (or that they augment each other).
  • If using a resource expression to instantiate a class, consider a previous include of that class to be identical (since the include did not specify any parameters it can be considered as a desire of lower precedence). (The reverse interpretation is currently allowed).

Another common request is to support decoupling between resources, sometimes referred to as "co-op", where there is a desire to include things "if they are present" (as oppose to someone explicitly including them). The current set of functions and language mechanisms makes this hard to achieve (due to Catalog Build Order being complex to reason about).

Here the best bet is the ENC (for older versions), or the Node Classifier for newer Puppet versions. Related to this is the topic of "data in modules", which in part deals with the overall composition of the system. The features around "data in modules" have not been settled while there are experimental things to play with - none of the existing proposals is a clear winner at present.

I guess this was a long way of saying - we will get to it in time. What we have to do first (and what we are working on) is the semantics of evaluation and catalog building. At this point, the new evaluator (that evaluates the AST) is available when using the --parser future flag in the just to be released 3.5.1. We have just started up the work on the new Catalog Builder where we will more clearly (with the goal of being both strict and deterministic) define the semantics of the catalog and the process that constructs it. We currently do not have "inversion of control" as a feature under consideration (i.e. by adding a module to the module path you also make its starting point included), but are well aware that this feature is much wanted (in conjunction with being able to compose data).

What better way to end than with a couple of examples...

Getting Your Ducks in a Row

Here is an example of a manifest containing a number of ducks. In which order will they appear?

define duck($name) {
  notice "duck $name"
  include c
}

class c {
  notice 'in c'
  duck { 'duck0': name => 'mc scrooge' }
}

class a {
  notice 'in a'
  duck {'duck1': name => 'donald' }
  include b
  duck {'duck2': name => 'daisy' }
}

class b {
  notice 'in b'
  duck {'duck3': name => 'huey' }
  duck {'duck4': name => 'dewey' }
  duck {'duck5': name => 'louie' }
}

include a

This is the output:

Notice: Scope(Class[A]): in a
Notice: Scope(Class[B]): in b
Notice: Scope(Duck[duck1]): duck donald
Notice: Scope(Class[C]): in c
Notice: Scope(Duck[duck3]): duck huey
Notice: Scope(Duck[duck4]): duck dewey
Notice: Scope(Duck[duck5]): duck louie
Notice: Scope(Duck[duck2]): duck daisy
Notice: Scope(Duck[duck0]): duck mc scrooge

(This manifest is found in this gist if you want to get it and play with it yourself).

Here is a walk through:

  • class a is included and its body starts to evaluate
  • it placed duck1 - donald in the catalog builder's queue
  • it includes class b and starts evaluating its body (before it evaluates duck2 - daisy)
  • class b places ducks 3-5 (the nephews) in the catalog builder's queue
  • class a evaluation continues, and duck2 - daisy is now placed in the queue
  • the immediate evaluation is now done, and the catalog builder starts executing the queue
  • duck1 - donald is first, when it is evaluated the name is logged, and class c is included
  • class c queues duck0 - mc scrooge
  • catalog now processes the remaining queued ducks in order 3, 4, 5, 2, 0

The order in which resources are processed may seem to be random, but now you know the actual rules.

Summary

In this (very long) post, I tried to explain "how puppet master really works", and while the order in which puppet takes action may seem mysterious or random at first, it is actually both defined and deterministic - albeit quite unintuitive when reading the puppet logic at "face value".

Big thanks to Brian LaMetterey, and Charlie Sharpsteen who helped me proof read, edit, and put this post together. Any remaining mistakes are all mine...

Thursday, April 3, 2014

Stdlib Module Functions vs. Puppet Future Parser / Evaluator

The Stdlib is_xxx functions

Stdlib Module vs. Puppet Future Parser / Evaluator

Earlier in this series of blog posts about the future capabilities of Puppet, and the Puppet Type System in particular, you have seen how the match operator can be used to check the type of values. In Puppet 3.6 (with --parser future) there is a new function called assert_type that helps with type checking. This led to questions about the existing functionality in the puppetlabs-stdlib module, and how the new capabilities differ and offer alternatives.

In this post I am going to show examples of when to use type matching, and when to use the new assert_type function as well as showing examples of a few other stdlib functions and how the same tasks can be achieved with the future parser/evaluator available in Puppet 3.5.0 and later.

The Stdlib is_xxx functions

The puppetlabs-stdlib module has several functions for checking if the given value is an instance of a particular type. Here is a comparison:

stdlib type system
is_array($x) $x =~ Array
is_bool($x) $x =~ Boolean
is_float($x) $x =~ Float
is_hash($x) $x =~ Hash
is_integer($x) $x =~ Integer
is_numeric($x) $x =~ Numeric
is_string($x) $x =~ String
n/a $x =~ Regexp

Note that the type system operations does not coerce strings into numbers or vice versa. It also does not make a distinction about how a number was entered (decimal, hex, or octal). The stdlib functions vary in their behavior, but typically only treat strings with decimal notation as being numeric or integer (which is both wrong and confusing).

In addition to the basic type checking shown in the table above, you can also match against parameterized types to perform more advanced checks; range of numeric values, checking the size of an array, the size and type of elements in an array, arrays with a sequence of different types (i.e. using the Tuple type). You can do the same for Hash where the Struct type allows specification of expected keys and their respective type). See the earlier posts in this series for how to use those types.

The Stdlib validate_xxx functions

The puppetlabs-stdlib module has several functions to validate if the given value is an instance of a particular type. If not, an error is raised. The new assert_type function does the same, but it checks only one argument, and thus if you want to check multiple values at ones, you place them in an array, and then check against an Array type parameterized with the type you want each element of the array to be an instance of. Here are examples:

stdlib type system
validate_array($x) assert_type(Array, $x)
validate_array($x, $y) assert_type(Array[Array], [$x, y])
validate_bool($x) assert_type(Boolean, $x)
validate_bool($x, $y) assert_type(Array[Boolean], [$x, $y])
validate_hash($x) assert_type(Hash, $x)
validate_hash($x, $y) assert_type(Array[Hash], [$x, $y])
validate_re($x) assert_type(Regexp, $x)
validate_re($x, $y) assert_type(Array[Regexp], [$x, $y])
validate_string($x) assert_type(String, $x)
validate_string($x, $y) assert_type(Array[String], [$x, $y])

Note that the Regexp type only matches regular expressions. If the desire is to assert that a String is a valid regular expression it can be given as a parameter to the Regexp or Pattern type since it performs a regular expression compilation of the pattern string, and raises an error with details about the failure.

'foo' = Pattern["{}[?"] # this will fail with error

Note that the 3.5.0 --parser future does not validate the regular expression pattern until it is used in a match (not when it is constructed). This is fixed in Puppet 3.6.

The validate_slength function

The validate_slength function is a bit of a Swiss Army knife and it allows validation of length in various ways for one or more strings. It has the following signatures:

validate_slength(String value, Integer max, Integer min) - arg count {2,3}
validate_slength(Array[String] value, Integer max, Integer min) - arg count {2,3}

To achieve the same with the type system:

# matching (there is no is_xxx function for this)
$x =~ String[min, max]
[$x, $y] =~ Array[String[min, max]]

# validation
assert_type(String[min,max], $x)
assert_type(Array[String[min,max]], [$x, $y])

A common assertion is to check if a string is not empty:

assert_type(String[1], $x)

The Stdlib values_at function

The stdlib function values_at, can pick values from an array given a single index value, or a range. The same can now be achieved with the [] operator by simply giving it a range.

stdlib future parser
values_at([1,2,3,4],2) [1,2,3,4][2]
values_at([1,2,3,4],["1-2"]) [1,2,3,4][1,2]

The values_at, allows picking various values by giving it an array of elements to pick. This is not supported by the [] operator. OTOH, if you find that you often need to pick elements 1,6, 32-38, and 164 from an array, you are probably not doing it right.

The Stdlib type function

The type function returns the name of the type as a lower case string, i.e. 'array', 'hash', 'float', 'integer', 'boolean'. This stdlib function does not perform any inference or details about the types, it only returns the type name of the base type.

When writing this, there is currently no corresponding function for the new type system, but a type_of function will be added in 3.6 that returns a fully inferred Puppet Type (with all details intact). When this function is added it may have an option to make the type generic (i.e. reduce it to its most generic form).

The typical usage of type is to... uh, check the type - this is easily done with the match operator:

stdlib future parser
type($x) == string $x =~ String

The Stdlib merge, concat, difference functions

Merging of hashes an concatenation of arrays can be performed with the + operator instead of calling concat and merge. The - operator can be used to compute the difference.

stdlib future parser
merge($x,$y) $x + $y
concat($x,$y) $x + $y
difference($x,$y) $x - $y

Other functions

There are other functions that partially overlap new features (like the range function), but where the new feature does not completely replace the functionality provided by the function. There is also the possibility to enhance some of the functions to give them the ability to accept a block of code, or to make use of the type system.

At some point during the work on Puppet 4x we will need to revisit all of the stdlib functions.