Implementing a FDW

Multicorn provides a simple interface for writing foreign data wrappers: the multicorn.ForeignDataWrapper interface.

Implementing a foreign data wrapper is as simple as inheriting from multicorn.ForeignDataWrapper and implemening the execute method.

What are we trying to achieve?

Supposing we want to implement a foreign data wrapper which only returns a set of 20 rows, containing in each column the name of the column itself concatenated with the number of the line.

The goal of this tutorial is to be able to execute this:

CREATE FOREIGN TABLE constanttable (
    test character varying,
    test2 character varying
) server multicorn_srv options (
    wrapper 'myfdw.ConstantForeignDataWrapper'
)

SELECT * from constanttable;

And obtain this as a result:

  test   |  test2
---------|----------
 test 0  | test2 0
 test 1  | test2 1
 test 2  | test2 2
 test 3  | test2 3
 test 4  | test2 4
 test 5  | test2 5
 test 6  | test2 6
 test 7  | test2 7
 test 8  | test2 8
 test 9  | test2 9
 test 10 | test2 10
 test 11 | test2 11
 test 12 | test2 12
 test 13 | test2 13
 test 14 | test2 14
 test 15 | test2 15
 test 16 | test2 16
 test 17 | test2 17
 test 18 | test2 18
 test 19 | test2 19
(20 lignes)

How do we do that?

The fdw described above is pretty simple, implementing it should be easy!

First things first, we have to create a new python module.

This can be achieved with the most simple setup.py file:

import subprocess from setuptools import setup, find_packages, Extension

setup(
  name='myfdw',
  version='0.0.1',
  author='Ronan Dunklau',
  license='Postgresql',
  packages=['myfdw']
)

But let’s see the whole code. To be usable with the above CREATE FOREIGN TABLE statement, this module should be named myfdw.

from multicorn import ForeignDataWrapper

class ConstantForeignDataWrapper(ForeignDataWrapper):

    def __init__(self, options, columns):
        super(ConstantForeignDataWrapper, self).__init__(options, columns)
        self.columns = columns

    def execute(self, quals, columns):
        for index in range(20):
            line = {}
            for column_name in self.columns:
                line[column_name] = '%s %s' % (column_name, index)
            yield line

You should have the following directory structure:

.
|-- myfdw/
|   `-- __init__.py
`-- setup.py

To install it, just run python setup.py install, and the file will be copied to your global python installation, which should be the one your PostgreSQL instance is using.

And that’s it! You just created your first foreign data wrapper. But let’s look a bit more thoroughly to the class…

The first thing to do (although optional, since you can implement the interface via duck-typing), is to import the base class and subclass it:

from multicorn import ForeignDataWrapper

class ConstantForeignDataWrapper(ForeignDataWrapper):

The init method must accept two arguments:

Our access point do not need any options, thus we will only need to keep a reference to the columns:

def __init__(self, options, columns):
    super(ConstantForeignDataWrapper, self).__init__(options, columns)
    self.columns = columns

The execute method is the core of the API. It is called with a list of Qual objects, and a list column names, which we will ignore for now but more on that later.

This method must return an iterable of the resulting lines. Each line can be either a list containing an item by column, or a dictonary mappning the column names to their value.

For this example, we chose to build a dictionary. Each column contains the concatenation of the column name and the line index.

def execute(self, quals):
    for index in range(20):
        line = {}
        for column_name in self.columns:
            line[column_name] = '%s %s' % (column_name, index)
        yield line

And that’s it !

Write API

Since PostgreSQL 9.3, foreign data wrappers can implement a write API.

In multicorn, this involves defining which column will be used as a primary key (mandatory) and implementing the following methods at your discretion:

def insert(self, new_values)
def update(self, rowid, new_values)
def delete(self, rowid)

Defining the primary key

Which column is used as primary key is defined by the property rowid_column. In order to support the write API, you will have to implement this property to return a column name, e.g.

@property
def rowid_column(self):
  return 'id'

Without implementing this property any attempt to use the write API methods will raise an exception.

Inserts

The insert method receives the new_values argument, which is a dictionary of column names to values. This method should returns a dictionary which will be used in case the INSERT has a RETURNING clause, e.g.

INSERT INTO my_foreign_table VALUES (some_value) RETURNING *;

In other words: The returned dictionary should be a mapping from column names to values.

Updates

The update method behaves like the insert method with regards to output. However, as input it receives an additional argument: the rowid. This is the value of the column named primary key by the rowid_column property, and thus defines which row should be updated.

Deletes

The delete method receives only the rowid argument, and should always return nothing.

Transactions

If you want to handle transaction hooks, you can implement the following methods:

def commit(self)
def rollback(self)
def pre_commit(self)

The pre_commit method will be called just before the local transaction commits. You can raise an exception here to abort the current transaction were your remote commit to fail.

The commit method will be called just at commit time, while the rollback method will be called whenever the local transaction is rollbacked.

Optimizations

As was noted in the code comments, the execute methods accept a quals argument. This argument is a list of quals object, which are defined in multicorn/__init__.py. A Qual object defines a simple condition wich can be used by the foreign data wrapper to restrict the number of the results. The Qual class defines three instance’s attributes:

Let’s suppose we write the following query:

SELECT * from constanttable where test = 'test 2' and test2 like '%3%';

The method execute would be called with the following quals:

[Qual('test', '=', 'test 2'), Qual('test', '~~', '3')]

Now you can use this information to reduce the set of results to return to the PostgreSQL server.

You don’t have to enforce those quals, PostgreSQL will check them anyway. It’s nonetheless useful to reduce the amount of results you fetch over the network, for example.

Similarly, the columns argument contains the list of needed columns. You can use this information to reduce the amount of data that has to be fetched.

For example, the following query:

select test, test2 from constanttable;

would result in the following columns argument:

['test', 'test2']

Once again, if you return more than these columns, everything should be fine.

Parameterized paths

The python FDW implementor can affect the planner by implementing the get_path_keys and get_rel_size methods.

def get_rel_size(self, quals, columns):

This method must return a tuple of the form (expected_number_of_row, expected_mean_width_of_a_row (in bytes)).

The quals and columns arguments can be used to compute those estimates.

For example, the imapfdw computes a huge width whenever the payload column is requested.

def get_path_keys(self):

This method must return a list of tuple of the form (column_name, expected_number_of_row).

The expected_number_of_row must be computed as if a where column_name = some_value filter were applied.

This helps the planner to estimate parameterized paths cost, and change the plan accordingly.

For example, informing the planner that a filter on a column may return exactly one row, instead of the full billion, may help it on deciding to use a nested-loop instead of a full sequential scan.

Error reporting

In the multicorn.utils module lies a simple utility function, log_to_postgres.

This function is mapped to the PostgreSQL function erreport.

It accepts three arguments:

Foreign Data Wrapper lifecycle

The foreign data wrapper associated to a table is instantiated on a per-process basis, and it happens when the first query is run against it.

Usually, PostgreSQL server processes are spawned on a per-connection basis.

During the life time of a server process, the instance is cached. That means that if you have to keep references to resources such as connections, you should establish them in the __init__ method and cache them as instance attributes.