Ulf Wendel

PECL/mysqlnd_ms needs updates for MySQL Group Replication

2014/10/30 - by admin - 0 comments

‘Synchronous’, multi-master, auto-everything – that’s the new MySQL Group Replication (IPC14 talk/slides) in simple words. After torturing PHP developers for decades with MySQL Replication there is now a new replication option which does not require read-write splitting. A system that does not know about slave lags and reading stale data. In theory, MySQL Group Replication is just about the perfect approach to run a standard PHP application (WordPress, Drupal, …) on a small cluster (3-7 nodes) in LAN settings. In theory, MySQL Group Replication improves both availability and performance.

Distribution Transparency

When designing replication systems there are some desireable goals which contradict each other. In a perfect world, from a developers perspective, a database cluster would behave exactly the same way as a single database. The user should never have to worry where and how data is stored in the cluster. Transactions executed on the cluster would provide the same properties like transactions run on a standalone database. The cluster would never return stale data (synchronous).

Synchronous replication is desired but it requires coordination among cluster nodes. In LAN settings coordination can be reasonably fast. MySQL Group Replication is ‘synchronous’ replication (see slides for details). Deploy it on LAN only. In the internet, in WAN settings, when trying to replicate from Europe to Asia things will be slow. If WAN, then either forget about distribution transparency or performance. If WAN, go for asychronous MySQL Replication.

The extra work different clusters cause for the developer

Synchronous and asynchronous clusters always cause some extra work for the developer. Either approach requires load balancing and failover logic. An asynchronous approach adds: dealing with delays and stale reads. MySQL Replication is not only asynchronous but has only one master (primary). This adds: read-write splitting.

PECL/mysqlnd_ms tries to help with all these tasks and take them over in a semi-transparent way.

PECL/mysqlnd_ms support for synchronous clusters

PECL/mysqlnd_ms is a plugin for mysqlnd. PDO_MySQL and mysqli use mysqlnd as their default library to talk to MySQL. Any of the two APIs works with PECL/mysqlnd_ms, our load balancing and replication plugin.

The plugin monitors many API calls and aims to make using any kind of MySQL clusters easier. No matter what cluster: MySQL Replication, MySQL Cluster, MySQL Group Replication, 3rd party solutions. Example configurations are given in the PHP manual.

MySQL Group Replication usage task 1: load balancing

When moving an application from a single database server to a synchronous cluster there are two additional tasks: load balancing and failover. With PECL/mysqlnd_ms load balancing does no require any code changes. The plugin intercepts your connect calls and tests whether the host you connect to matches the name of a config entry. If so, the plugin loads the config, learns from the config which nodes there are and starts load balancing connection. Should you be too lazy to change the host name in your connects to match a PECL/mysqlnd_ms config entry, then just name the config entry after you current host names, have a config entry for ‘’ etc.

$link = new mysqli("myapp", ...);

MySQL Group Replication usage task 2: failover

The second task is to handle the failure of a cluster node and connect to the next available one. PECL/mysqlnd_ms does that for you if you want. It picks an alternative from the config and connect you to it.

There’s a small feature gap here. MySQL Group Replication tries to be an auto-everything solution. It automatically detects failed nodes. It also fully automates adding new nodes to the cluster. That’s cool but it means that over time the set of nodes can change and your config needs to be updated.

The PECL/mysqlnd_ms feature gap

There are two options. First, you could deploy the config. Second, after a failover or periodically, we could make PECL/mysqlnd_ms fetch the list of nodes from the cluster and make it reconfigure itself (see also here). That’s finally possible because MySQL Group Replication shows the list of nodes in a performance schema table.

Once we did that, and MySQL Group Replication has reached GA, the auto-everything cluster for MySQL becomes real. All the stuff on the server side is already automatic. PECL/mysqlnd_ms is already GA and already handles all additional tasks – without code changes. A tiny addition is missing and you could even get an auto-deployed PECL/mysqlnd_ms…

Happy hacking!

@Ulf_Wendel Follow me on Twitter

Is the HTTP Plugin for MySQL secure?

2014/10/02 - by admin - 0 comments

The HTTP Plugin for MySQL offers three APIs: REST-like CRUD, REST-like JSON DOCUMENT and SQL. The SQL API lets you run any SQL you want. Including, for example, DROP mysql.users if you mess up your setup. Insecure? It depends on your viewpoint.

It’s more than just another protocol…

On the first look HTTP is just another network protocol for sending requests to MySQL. HTTP is the protocol of the web. Whether you need to integrate MySQL in a larger setup and use web services for data exchange or you want to access MySQL from a JavaScript client that is restricted to HTTP/Websocket. HTTP is the natural choice. CouchDB convinced many when it introduced the idea.

HTTP Client   Standard client
|   |
HTTP Protocol   MySQL C/S Protocol
|   |

Standard clients use the properitary binary MySQL Client/Server Protocol to communicate with MySQL. The optional HTTP Plugin makes MySQL also listens to HTTP requests. It is you choice whether you want to make MySQL speak HTTP or not. HTTP is a clear-text protocol. Whether clear-text or binary makes no difference: security by obscurity does not work.

Both the MySQL Client/Server Protocol and the HTTP Protocol can be run over secure channels. This usally means SSL. SSL is available for both protocols, which means it’s 1:1 again. Let’s do a litte cheat-sheet:

  HTTP MySQL C/S Security Rating
Obscurity Clear-text Binary 0 : 0
Encryption Supports SSL Supports SSL 0 : 0

The transport layer

The MySQL Client/Server Protocol either runs over TCP/IP, Unix Domain Sockets or Windows named-pipes. Unix Doman Sockets are an inter-process communication method for processes on one machine. If you deploy the MySQL Server and its clients on the same machine, you can use this method of communication. It ensures that data exchanged never appears on the network. The data stays on the machine, which adds to the overall security.

Although you may use TCP/IP to connect from a client to a MySQL Server on the same machine and the data should stay on the machine, this is no common deployment. Just by using the TCP/IP stack instead of a local IPC method, you loose a tiny bit of security. The TCP/IP stack could send data to the network, the local IPC can’t. It has no means to do so.

I assume most deploy MySQL and its clients on different machines. Then TCP/IP is used. TCP/IP is the only choice for the HTTP Plugin. Although you can use SSL to secure the connection, it just a little worse than MySQL C/S:

  HTTP MySQL C/S Security Rating
Transport TCP/IP TCP/IP, Unix Domain Socket, Windows pipe 0 : 0.5

The access methods: SQL and NoSQL

Todays MySQL speaks three protocols: HTTP, MySQL Client/Server Protocol and Memcache (InnoDB Memcache Plugin, MySQL Cluster support).

HTTP Client   Standard client   Memcache client
|   |   |
HTTP Protocol   MySQL C/S Protocol   Memcache Protocol
|   |   |

This is done to offer SQL and Not-Only-SQL access methods. The Not-Only-SQL access methods bypass the SQL processing layers in MySQL. A decade ago already, MySQL Connectors folks estimated that SQL parsing can take 50% of the total execution time of a simple query. Yet, it required innovative MySQL expert users and the NoSQL movement until the popular Memcache protocol creeped into MySQL and killed the 50% overhead. In addition to SQL commands, MySQL also accepts key-value style commands. The latter use lower-level APIs inside the server. Thus, they are faster. The Memcache key-value style APIs don’t know about SQL injection or the like – no SQL used…

The HTTP Plugin development version we show internally maps all HTTP requests to SQL. It has three APIs. REST-like CRUD and REST-like DOCUMENT perfectly qualify for taking the lower-level API route. Whether we will do that is undecided. If so, SQL injection or the like does not matter – no SQL used…

Currently, in the development version, we don’t use the lower-level APIs. That’s no secret, we released the source. In the source one will find the use of SQL and escaping. We could have, had we bothered, used prepared statements. The initial HTTP Plugin development version works exactly like trivial proxy written in a language like PHP, Python or the like. As much as such a proxy can be secured, the current code with its all SQL mapping could be secured.

I won’t put any rating entry on the cheat-sheet as this is an implementation detail of a development version.

The user authentication

Things are now down to one third of the HTTP Plugin – the SQL endpoint – versus a standard MySQL Client.

Both a standard MySQL client and a HTTP Plugin client execute SQL as a certain MySQL user. The actions of the client are restricted by the permissions granted. If you allow your client to run DELETE FROM mysql.user you may find yourself in an uncomfortable situation soon.

If SSL is used, it is valid to accept MySQL user credentials from the HTTP client and log the client in as the MySQL user given. This clear-text user credential exhange over SSL is also done with the MySQL Client/Server Protocol. It is implementation details of the various pluggable authentication methods MySQL offers that restrict HTTP clients to login as a MySQL user of their choice when SSL is used only. For example, some of the authentication methods require a handshake procedure.

  HTTP MySQL C/S Security Rating
MySQL user password security Clear-text over SSL Clear-text over SSL and other methods 0 : 0

Without SSL, the HTTP client defaults to a preconfigured user. We used HTTP Basic Authentication, which is a rather undesireable HTTP authentication method, as a simple and immediately to understand method of getting user credentials from a HTTP client. Of course, there are better one!

Other HTTP authentication methods are also more complicated and distract. Distract from the SSL explanation. Distract from the option that one could and should see the authentication process as two-staged. The first step is the login towards the HTTP Plugin, then a second step maps the HTTP user to a MySQL user. Wouldn’t it be nice if for non-SSL connections HTTP user ‘Ulf’ could be mapped to MySQL user ‘evil’ and HTTP user ‘Johannes’ could be mapped to MySQL ‘caretaker’? Here, the MySQL users password would not be taken from the HTTP world, it would be taken from some plugin config.

First stage (network)
HTTP user -> HTTP Authentication method (Basic Auth, OAuth, …)
Second stage (inside the Plugin)
HTTP user -> Mapping to MySQL user

No rating. I find it hard to compare with MySQL where you always have the password in clear-text at the client.

What people missed, …

The user authentication and mapping alone is still not enough. See slide 100! Slide 100 is what comes after the safe harbour statement. In other words this is pure speculation and idea spreading. No promise we will ever make this happen.

Here’s the illustration from the slide, slightly simplified (it’s 2am – time for shortcuts ;-)):

HTTP Client
Improved security, more freedom
HTTP Plugin
app/ v8 JavaScript sql/, crud/, docs/ – built-in

The MySQL permission system alone is too coarse to fully match the capabilities of any of todays self-baken HTTP-to-MySQL proxies. You can limit a MySQL user to be allowed to read from a single table only. But, the person can read all rows from the table. You could try to use a view but probably, you would just implement some filtering in your proxy. “Just implement some filtering” becomes a problem when the HTTP Plugin is your proxy. There’s no script language, you can’t do that. Unless, there was a script language built-in to the proxy…

  HTTP MySQL C/S Security Rating
Application based filtering Implementation detail not bound to protocol Implementation detail not bound to protocol 0 : 0.5 (see text)

All in all, a HTTP Plugin can get pretty close to todays solutions. It’s not there yet.

Happy hacking!

@Ulf_Wendel Follow me on Twitter

MySQL 5.7 – HTTP Plugin for MySQL

2014/09/27 - by admin - 8 Comments

It is official: MySQL listens to HTTP and speaks JSON. MySQL got a new plugin that lets HTTP clients and JavaScript users connect to MySQL using HTTP. The development preview brings three APIs: key-document for nested JSON documents, CRUD for JSON mapped SQL tables and plain SQL with JSON replies. More so: MySQL 5.7.4 has SQL functions for modifying JSON, for searching documents and new indexing methods! The download and documentation (also here) is on http://labs.mysql.com/, the slides are below:

What a buzzword bingo! The HTTP Plugin is just a HTTP proxy. One that you can load into MySQL, if you want, to avoid having to write your own one. It simplifies the setup, it is a convenience feature that you should consider if you don’t need the power of scaling the database and the web server independently.

Without the HTTP Plugin   With the HTTP Plugin
HTTP Client   HTTP Client
|   |
Web Service (e.g. PHP script)   |
Web server   |
|   HTTP Plugin

Did we forget about the developer?

The HTTP Plugin just makes MySQL a bit more web developer friendly. Hello frontend developers, we heard about you ;-). MySQL was born in a world without the web. MySQL grew up with backend developers. MySQL somewhat missed the growing role of frontend development, the developers and their ideas. NoSQL happily listened: be it in terms of on built-in JavaScript, the choice of JSON as a serialization format for document stores, asynchronous APIs or just HTTP interfaces.

The three APIs

There are three APIs: plain SQL over HTTP, CRUD and DOCUMENT. All three return JSON. Technically, JSON is just a different data serialization format for MySQL. If you are a frontend JavaScript developer and life in a world of HTTP and JavaScript, you likely prefer that over some binary format.

To be totally frank: at this point the integration of JSON into the APIs is only so-so. That is on purpose. None of the APIs is finalized. We translated some “what if” questions into code so that one has something play with. Now we fish for input.

Let’s get the fishing started, let’s cause some laughter. Here’s the SQL endpoint executing SELECT 1:

shell> curl ... --url ""

The SQL endpoint replies 1:1 what a standard MySQL client gets a MySQL Connector may hide from the developers eyes. This is all information the MySQL Client/Server Protocol has to offer. We included everything to see if there is anybody who wants to write its own “driver” atop of it. If so, that self-made driver can expose similar features.

Ignoring the chatty reply, the SQL endpoint is needed for rich queries. The CRUD and DOCUMENT endpoint currently support key-document semantics only.

The CRUD endpoint a single row from a MySQL table identified by its primary key value. The row is mapped to JSON in the most simple way. No meta data is included: a very lightweight reply.

shell> curl ...  --url ""

The SQL endpoint supports HTTP GET requests only. CRUD and DOCUMENT endpoints accept GET, PUT and DELETE requests. A better mapping is much desired. To be frank, once more, we had the choice between writing documentation and attempting to look a bit more REST-like. You know REST, what’s there to demo?

The DOCUMENT endpoint takes any valid JSON and stores it. The access pattern is key-document:

shell> # curl -i -X PUT --user basic_auth_user:basic_auth_passwd --url ""
HTTP/1.1 201 Created
{"info": "Table created"}
shell> # curl -i -X PUT -d '{"words": ["Hello", "world"]}' --user basic_auth_user:basic_auth_passwd --url ""
HTTP/1.1 200 OK
{"info": "Document added"}
shell> # curl -X DELETE -i --user basic_auth_user:basic_auth_passwd --url ""
HTTP/1.1 200 OK
{"info": "Table dropped"}

That’s inside

The HTTP Plugin for MySQL works exactly like any of todays client-side proxies/web services. A multi-threaded web server library featuring asynchronous I/O accepts the web request. Then, the plugin translates it to SQL statements to compute an answer. A server plugin alone does not make MySQL NoSQL or a document store over night.

MySQL 5.7.4 brings some improvements that help with the proxy job: updated SQL JSON functions, virtual columns and stay tuned on that [functional indexes] topic!.

Acting exactly like any of todays client-side proxies and storing JSON documents in plain BLOB columns has the advantage of acting fast. Improving the HTTP API alone takes about as much effort as it takes improving a web service! The MySQL Connectors team can act fast and independent of any other team on this.

Starting simple with the CRUD and DOCUMENT endpoint, offering key-document semantics only leaves room for future optimization. There are faster ways than mapping HTTP requests to SQL. For MySQL 5.7 the InnoDB team claims 1.100.000 QPS through the memcache protocol and 625.000 QPS through SQL. But optimizing for speed would be a premature optimization. After all, we don’t have finalized APIs yet.

The other stories

Since 2009, since the time when the equation NoSQL = Not Only SQL appeared, MySQL expert users have attempted to create key-value access methods for MySQL. This ultimately resulted in the development of the Memcache interfaces for MySQL. There is a dual access method for data inside MySQL: fast key/value and rich SQL.

PHP, … any standard client
Memcache driver MySQL Connector
| |
Fast Key/Value Rich SQL
| |
InnoDB Memcache Plugin Standard C/S interface

This story has one disadvantage: clients need two drivers. The protocol changes, two libraries are needed that implemenent the network protocol and data serialization. Assuming the SQL CRUD and DOCUMENT endpoints would be optimized for performance, clients could choose between fast access and rich query language by just changing the URL.

How about SQL over Protocol X? The SQL endpoint has given you a quick impression how much information the standard MYSQL Client/Server protocol must carry to handle all possible SQL and every possible client. If you need only a subset of the SQL features offered by MySQL, you could try to write a plugin that has a leaner protocol. There are major users of MySQL that restrict themselves to certain SQL features. Some even developed their own programming languages. To get back to the ground: how about Websocket for JavaScript users?

HTTP Client HTTP Client Your Client
| | |
HTTP (JSON) Websocket (JSON) Your protocol (your serialization)
| | |
HTTP Plugin Websocket Plugin Your Plugin

These and many more stories, some of which I hinted towards the end of the presentation, become possible if MySQL had just two plugin services:

HTTP Client Your Client
| |
HTTP (JSON) Your protocol (your serialization)
| |
HTTP Plugin Your Plugin
Plugin SQL Execution Service Plugin Authentication Service

But that’s another story for another audience. And, no, the development preview release only contains some proof-of-concept hacks here, no proper plugin services…

Happy hacking!

@Ulf_Wendel Follow me on Twitter

PECL/mysqlnd_ms: summer time, (connection) pool time – Fabric support internals

2014/08/07 - by admin - 0 comments

The MySQL replication and load balancing plugin for PHP, PECL/mysqlnd_ms, aims to make using a cluster of MySQL servers instead of a single server as transparent as possible. Should you ever want to migrate your PHP app from a single MySQL server to any kind of MySQL cluster, install PECL/mysqlnd_ms and see how much breaks. A simple app, might need no code changes at all. Your APIs connect call remains unchanged, your query execution calls remain the same: one connection handle transparently switched to an appropriate cluster node (read-write splitting, weighted load balancing, …).

/* Or, use PDO_MySQL... One user handle for an entire cluster */
$link = new mysqli("mycluster", ...);  
/* Goes automatically to a slave, if read-write splitting is enabled */
$link->query("SELECT ...");
/* All actual connections to master or slave will use UTF8 from now on */
/* With read-write splitting on, goes to master */
$link->query("INSERT ...");

Tell PECL/mysqlnd_ms that connect("mycluster", ...) means you want to connect to a cluster “mycluster” which has, for example, is a MySQL Replication cluster with master A and slaves B, C. Call $link->select_db("mydb"), $link->autocommit(false), $link->set_charset("utf8"), … and no matter to which node (A, B, C) you actually get automatically connected, the connection has those settings! Unless, you are using the sharding feature of our current, still very limited MySQL Fabric support… A new internal connection pool shall cure this and more. A tour through PECL/mysqlnd_ms internals to follow.

PECL/mysqlnd_ms example configuration for a simple MySQL Replication setup

    "myapp": {
        "master": {
            "master_A": {
                "host": "localhost",
                "socket": "/tmp/mysql.sock"
        "slave": {
            "slave_B": {
                "host": "",
                "port": "3306"
            "slave_C": {
                "host": "",
                "port": "3306"

How a single user handle stands for many actual connections

No matter whether you use PECL/mysqlnd_ms for a simple MySQL Replication cluster or a farm of MySQL servers that shard data one or the other way, it always confronts you with a massive change: a single user conection handle is mapped internally to many actual connections.

PHP script (*.phpt) Inside PHP (*.c)

$link = 
  mysqli_connect("mycluster", ...);

Object representing $link
List of masters Actual connection to master_A
List of slaves Actual connection to slave_B
Actual connection to slave_C
List of filters Load Balancer: Random

Furthermore, to keep the number of actual connections low, the PECL/mysqlnd_ms plugin for mysqlnd, will – by default – use lazy connections. A lazy connection to a node is not established before a query is to be run on the node. Assume you do $link = mysqli_connect("mycluster"); $link->set_charset("UTF8");. What happens inside PECL/mysqlnd_ms is very different from the normal case. Normally, without the plugin, $link would be one actual, established connection to a MySQL server and the server would immediately reply to your set_charset() command.

PHP script (*.phpt) Inside PHP (*.c)


Loop over all actual master and slaves connections:

  • If lazy: recall setting for later connect()
  • If established: execute on server

With the plugin, we take the command and dispatch it to all masters and slaves. At this point the internal master and slave lists may hold already established or lazy connections. If its an established connection, the command is executed on the server. If its a lazy connection, the setting is remembered but the command execution itself is delayed until the connection is established. This is what I like to call semi-transparent. However, because we delay command execution in case of lazy connections, errors may also be delayed until the connection is established – be aware.

Still, this is about as transparent and comfy as things can get, if there is such a thing as a connection state (note that I’m ignoring the SQL side). An alternative had been to disable features and strip down connections to something without much of a state, however, different story…

MySQL Fabric: the auto-everything clustering vision

Probably a decade after we should have had a solution ready, PECL/mysqlnd_ms was ready. But MySQL Replication was lacking badly behind any of the better NoSQL clustering solutions. Automatic master failover? Considered a user task. Have clients discover the new master automatically? Unthinkable! Sharding for MySQL Replication? Considered a user task. MySQL Proxy? Still alpha. Sure, after all the years their had been third-party solutions for all this but the MySQL offering became less attractive.

MySQL Fabric is an administration tool for farms of MySQL servers. It is for the mass of the “all I need is a backup server…”-users that want nothing but automatic failover for MySQL Replication. And, it is for handful of “we host several thousand MySQL servers”-experts that go crazy and really need sharding.

MySQL Fabric – monitoring, administration -> Any MySQL Replication cluster
| Master (fails)
asks Fabric about nodes, e.g. master Slave Slave
^- Clients, e.g. PHP using PECL/mysqlnd_ms

The pure existance of an official administration tool is nice, but without a way for clients to ask about the current structure of the cluster it had been only yet another tool. The communication channel between clients and Fabric is what adds value. For example, should a master fail, Fabric promotes a new one – like any other tool. Then, the clients need to learn about the new master. This can finally be automated, be done out-of-the box: the clients simply ask Fabric for the new master.

The story is similar with sharding. You tell the driver (here: PECL/mysqlnd_ms) which shard key you are going to use, the driver asks Fabric for a list of appropriate shards. Usually, Fabric will reply with not only one shard but return a master and a couple of slaves because you want HA for your shards and use a MySQL Replication cluster to each and every shard. Then, PECL/mysqlnd_ms picks a shard for you.

$link = new mysqli("myfabric", ...);
mysqlnd_ms_fabric_select_shard($link, $table, $shard_key);
$link->query("SELECT ...");
$link->query("INSERT ...");

PECL/mysqlnd_ms matches all Fabric requirements but implementations clash

On the first look, this seemed peanuts to support. Johannes added the code for talking to Fabric and replaced the internal server lists. This has been our all idea. But, simple testing quickly revealed we had been wrong.

PHP script (*.phpt) Inside PHP (*.c)


Object representing $link
List of masters Load from Fabric
List of slaves Load from Fabric

When we have a sequence like $link->set_charset("utf8"); mysqlnd_ms_fabric_select_shard($link, ...), then the charset setting is forgotten. The charset was set on the actual connections. But, we have loaded a new server list from Fabric and with it we got new connections.

PHP script (*.phpt) Inside PHP (*.c)

mysqlnd_ms_fabric_select_shard($link, ...);

  • Load master list from Fabric
  • Load slave list from Fabric


  • Set all master connections to utf8
  • Set all slave connections to utf8

mysqlnd_ms_fabric_select_shard($link, ...);

  • Load server lists from Fabric
  • utf8 charset setting lost

Furthermore, there was a problem with our load balancing filters. The load balancing filters (random, round-robin) support assigning a weight to a server. The higher the weight, the more requests we dispatch to the server at average. You can use this to optimize for proximity/latency or to assign load dependending on the size of your machines. For the weightes load balancing, the filters create cached information during startup phase based on the initial server lists. It is assumed that the server lists will not change until the script ends. However, Fabric does exactly that: replace server lists during runtime. Hence, this was broken too. The breakage is of little relevance as you would never combine the classic weighted load balancing with Fabric but still.

Last but not least, there is our new distriuted/XA transaction support. Imagine you need a transaction that spawns two shards. You start it on one shard and PECL/mysqlnd_ms does all the XA related SQL magic for you in the background. Then, you switch to another shard. All the connections get swapped out including the one on which you started the XA transaction. Upon swapping out the connections, they get closed and MySQL aborts the distributed transaction…

PHP script (*.phpt) Inside PHP (*.c)

mysqlnd_ms_xa_begin($link, ...);

  • Begin transaction on shard/server A

mysqlnd_ms_fabric_select_shard($link, ...);

  • Load server lists from Fabric
  • Connection to A closed, transaction aborted

SOLVED, well, on the way to be

There is now an internal connection pool abstraction layer that solves all these problems. Connections are added to the pool and removed from it through API calls. There is some reference counting that will prevent a connection from being close while it is still needed for an ongoing distributed transaction.

The pool gets to know all the API commands that have been dispatched to connections to align their state. At the time of writing we monitor change_user(), select_db(), set_charset(), set_server_option(), set_client_option(), set_autocommit and ssl_set(). When you switch from one shard to another using mysqlnd_ms_fabric_select_shard(), the pool can replay the commands. The knowledge about the last charset choosen is no longer bound to the connections. The pool knows the charset and can align the state. When done, the pool can inform interested parties, such as a filter, about the new list of active connections.

PHP script (*.phpt) Inside PHP (*.c)


  • Inform the connection pool charset=utf8
  • Set all connections to utf8

mysqlnd_ms_fabric_select_shard($link, ...);

  • Load server lists from Fabric
  • Tell the pool to replace active connection list
  • Keep connections ope which are still referenced, e.g. for XA transactions
  • Ask the pool to replay setting, e.g. charset=utf8
  • Tell all listeners, e.g. filter about new connection list

In theory, we should now be able to start supporting Fabric properly in PECL/mysqlnd_ms. But, some details are still open and not finished. For example, the order in which the pool replays state alignment commands when the connection lists are replaced, is somewhat random. This is likely to cause issues when settings depend on each other and order of execution. In the worst case, we may be forced to offer a user API call to allow setting an order and burdening the developer with this task.

Looking forward: reduce connection overhead

A positive side effect of having a pool abstraction is that we may be able to reduce connection overhead. Assume, your application is constantly switching between two shards A and B. We load the server lists for A and B exactly once from Fabric, then (in theory, yet) we cache them. When you query shard A, we open at least one connection. If the shard itself is a MySQL Replication cluster, we may have to open one connection for reading and one for writing. The next minute you switch to shard group B. So far, we have closed the connections to shard group A. Assume you now go back to shard group A. We close the connections to B and reopen some to A… what a waste.

Shard can be MySQL Replication clusters for HA
Shard group A   Shard group B
Master   Standalone/Master
Slave Slave  

With the pool abstraction in place, we put all the connections we ever open in a pool, when you first use a shard group. Consider the case again where your application switches from shard group A to B and back to A. When you switch from one shard to another, we don’t necessarily close the current connections. We mark them inactive. Then, we check whether we already have connections to the new shard open. If so, we mark them active and use them. Otherwise, we open new connections. In a sequence of switching from A to B and back to A, the last switch will – likely, depending on load balancing rules – boil down to marking already open connections as active.

New internal connection pool
All connection list
Active (= currently used for load balancing)
Shard group A server Master connection
Shard group A server Slave connection
Inactive (= currently not used)
Shard group B server Connection to B

No pool management/administration possible yet

The new connection pool does not break any existing functionality. No surprise, because the classic functionality of PECL/mysqlnd_ms does not really need and use the new pool. It is the Fabric support for which we need the pool. And, the Fabric support by PECL/mysqlnd_ms is not much tested. Until this is testing is done, the implementation has been completed and we trust the new pool code, you should consider it internal. Because it is internal, there is no way for you to manage or administrate the pool. Not yet.

There are some new statistics that let you monitor pool activity only.

Pool monitoring statistics
pool_masters_total Number of master servers (connections) in the internal connection pool.
pool_slaves_total Number of slave servers (connections) in the internal connection pool.
pool_masters_active Number of master servers (connections) from the internal connection pool which are currently used for picking a connection.
pool_slaves_active Number of slave servers (connections) from the internal connection pool which are currently used for picking a connection.
pool_updates How often the active connection list has been replaced and a new set of master and slave servers had been installed.
pool_master_reactivated How often a master connection has been reused after being
flushed from the active list.
pool_slave_reactivated How often a slave connection has been reused after being
flushed from the active list.

Happy hacking!

@Ulf_Wendel Follow me on Twitter

PECL/mysqlnd_ms: how failed XA transactions get fixed

2014/07/29 - by admin - 0 comments

XA transactions are an open standard for distributed transactions. A distributed or global transaction can spawn multiple databases. XA transactions come handy, for example, when data sets are sharded over multiple servers and a business transaction affects multiple shards. The free MySQL Fabric tools help to setup and manage a sharded MySQL cluster. The development version of PECL/mysqlnd_ms 1.6 helps with XA transactions. It abstracts SQL details and acts as a transaction manager. The PHP MySQL driver kicks in when things go wrong and XA transactions show their nasty side: blocked servers. Good news: this is a rare case. Bad news: a deep dive below.

Grant all Ulf’s a 10% discount (or none of them = use a transaction)
MySQL shard A: EMEA customers MySQL shard B: US customers MySQL shard C: APAC customers

/* Note: Fabric and XA don't go together - yet... it's coming! */
$link = 
new mysqli("mysql_sharding_cluster", "user", "password");

mysqlnd_ms_xa_begin($link, 1);

mysqlnd_ms_fabric_select_shard($link, "shop.customer", "EMEA");
$link->query("UPDATE discount = 10 WHERE first_name = 'Ulf');

mysqlnd_ms_fabric_select_shard($link, "shop.customer", "US");
$link->query("UPDATE discount = 10 WHERE first_name = 'Ulf');

mysqlnd_ms_fabric_select_shard($link, "shop.customer", "APAC");
$link->query("UPDATE discount = 10 WHERE first_name = 'Ulf');

mysqlnd_ms_xa_commit($link, 1);

PECL/mysqlnd_ms as a transaction manager

XA transactions use the two-phase commit protocol, which is a blocking protocol. Please, see also my previous blog post on the nature of the protocol and MySQL implementation limitation. If the client that drives the XA transaction, your PHP script, crashes at a certain point, some XA participants (MySQL servers) cannot make any progress. In the worst case, they end up waiting for a decision on the global transactions outcome endlessly. No, there is no timeout. As they wait, they block resources. That can be memory used for the transaction or some lock on some table.

Blocked during the second phase of the 2PC/XA protocol
PHP (coordinator) MySQL (participant) MySQL MySQL
–> Global commit  
  Comitted Uncomitted: waiting for global commit or rollback

Any serious user of XA transactions will therefore have to implement some mechanism that ensures progress in case of crashes. After a crash, it must be possible to learn which participant is blocked, connect to the participant and tell it to either commit or roll back the open transaction. This housekeeping job is rather annoying, yet important. PECL/mysqlnd_ms can do it for you, it can act as a transaction manager. (On an aside: the academic world does distinguish between a transaction manager and coordinator. I am using the terms interchangeably here.)

MySQL as a state store to track XA transactions

Upon request, PECL/mysqlnd_ms can record the state of each global transaction in a MySQL database. Should your PHP script (and with it PECL/mysqlnd_ms) crash or be interrupted in another way with an XA transaction being unfinished, then the next PHP script that runs can check the database and “garbage collect” the unfinished global transaction. The “next” PHP script could be run on the same server or another one, as long as all servers use the same MySQL database to track XA transactions, the garbage collection will work flawless.

The illustration below shows a scenario with two PHP servers that use PECL/mysqlnd_ms. A script on one of the servers runs and begins a distributed (XA) transactions that involves two MySQL servers M1, M2. As you can see, a simple sequence of mysqlnd_ms_xa_begin(); ... ; mysqli_query(); mysqli_query() causes a lot of background activity inside PECL/mysqlnd_ms.

Tracking XA transactions for “garbage collection” to avoid blocking situations
PHP server A PHP server B
Your code Inside PECL/mysqlnd_ms
mysqlnd_ms_xa_begin(id = 1)
  • ensure there is no open transaction
  • optional: record new XA trx in state store
… /* select some server M1 */ …
  • optional: if new participant, record participant in state store
  • inject XA BEGIN on server M1
  • update participant state to XA BEGIN sent
  • optional: record new state in state store
  • execute query on server M1
… /* select some server M2 */ …
  • optional: if new participant, record participant in state store
  • inject XA BEGIN on server M2
  • update participant state to XA BEGIN sent
  • optional: record new state in state store
  • execute query on server M2

PECL/mysqlnd_ms does all the things that you would have to do in your script manually otherwise. It issues the appropriate SQL commands for XA transactions on the participating MySQL servers. And, optionally, it can record the participants and their state in a state store.

PHP server A PHP server B
Your code Inside PECL/mysqlnd_ms
  • optional: record intent to commit in state store
  • inject XA END on server M1
  • optional: record new state in state store
  • inject XA PREPARE on server M1
  • optional: record new state in state store
  • inject XA COMMIT on server M1
  • optional: record new state in state store
  • inject XA END on server M2
  • optional: record new state in state store
  • inject XA PREPARE on server M2
  • optional: record new state in state store
  • unexpected crash

In case of an interruption, it is possible to reconstruct the state of the XA transaction based on the records from the state store. Using a state store is optional, however, if you don’t configure PECL/mysqlnd_ms to use one, it is your responsibility to ensure that no MySQL server ever gets blocked. Please, note how often the state store gets queried. Using a state store adds quite some work to the system but it is necessary to play safe.

PHP server A PHP server B
Your code Inside PECL/mysqlnd_ms
Script ends
  • Rollback open XA transactions, if any and rollback_on_close is set (default)
PHP internal shutdown: RSHUTDOWN
  • Decide whether to do background garbage collection based on probability setting
  • Search for up to max_transactions_per_run unfinished XA transactions in state store
  • Do at most max_retries GC runs attempting to fix a certain failed XA transaction

Issues that require no garbage collection

Most errors around XA transactions do not require special garbage collection. Simple cases are handled immediately by PECL/mysqlnd_ms and end up with the transaction being rolled back. Errors during the first phase of the XA two phase commit protocol are uncritical. No matter whether the error is caused by a server crash, a client/PHP crash, the client loosing its connection to one of the MySQL servers participating in the global transaction, or a participant reports a simple SQL error, the XA transaction will be properly rolled back.

Participants that got disconnected or recover from a crash forget about XA transactions that have been in their first phase at the time of the error. Assume, you have started an XA transaction in which three MySQL servers A, B, and C participate. During mysqlnd_ms_xa_commit(), PECL/mysqlnd_ms steps the participating servers through the commit protocol and issues the SQL commands: XA END (XA_ACTIVE state), XA PREPARE (XA_PREPARED state), XA COMMIT (XA_COMMIT state). For each step, the list of participants is iterated.

Uncritical, first phase errors during mysqlnd_ms_commit()
MySQL A MySQL B MySQL C PHP using PECL/mysqlnd_ms
XA_IDLE mysqli_query(<some query>)

  • Pick appropriate server for some query/code>
  • Inject XA BEGIN on server
  • Run some query on server
XA_IDLE XA_IDLE XA_IDLE Assorted queries run on all participants
XA_ACTIVE XA_ACTIVE XA_IDLE (XA END failed) mysqlnd_ms_xa_commit()

  • On all participants: change participant state from XA_IDLE to XA_ACTIVE (= run XA END)
  • If error, try rollback on all participants, including failed one
  • Reason of the error on participant 3 (or during rollback on other participants…
    • Client crashed or got disconnected: safe to ignore, server has dropped XA transaction
    • Server crashed: safe to ignore, XA transaction has been dropped (2PC: first phase)
    • SQL error: rollback performed

Let there be an error with some participant around XA END. Upon the error, the plugin rolls back the global transactions. Effectively, it continues stepping the servers through the states but this time with the goal to send XA ROLLBACK. No matter what made XA END fail in the first place, the overall issue will most likely be solved. Should PHP have crashed, during XA END, all the servers notice it and drop the XA transaction. Should PHP have lost its connection to a server, it is the same: the server drops the XA transaction. Should one MySQL have gone away, it will drop the XA transaction upon recovery. Because PHP is still working, PECL/mysqlnd_ms tells the other servers to do the same and issues a rollback.

Whatever goes wrong prior to any server reaching XA_PREPARED (= second phase of the two phase protocol), a rollback happens and no garbage collection is required. You are safe even without configuring a state store for garbage collection!

When garbage collection becomes necessary

Any troubles that PECL/mysqlnd_ms has committing a global transaction after any participant entered the XA_PREPARED state, may require garbage collection. But, there are sill cases that can be ignored and require no attention.

Assume the below case where the global transaction has been prepared on servers A and B but XA PREPARE fails on the third participant, server C. Should PHP crash, it’s safe to ignore the case. A and B will see the client disconnect and rollback the transaction, so does C.

Should PHP have lost a connection to any of the servers, the disconnected MySQL server forgets about the XA transaction, even if it is in XA_PREPARED state. And, any participant that can still be reached will receive the rollback command from PECL/mysqlnd_ms.

Second phase errors during mysqlnd_ms_commit()
MySQL A MySQL B MySQL C PHP using PECL/mysqlnd_ms
XA_PREPARED XA_PREPARED XA_ACTIVE (XA PREPARE failed) mysqlnd_ms_xa_commit()

  • On all participants: change participant state from XA_IDLE to XA_ACTIVE, then to XA_PREPARED
  • If error, try rollback on all participants, including failed one
  • Reason of the error on participant 3 (or during rollback on other participants…)
    • Client crashed or got disconnected: safe to ignore, server has dropped XA transaction
    • SQL error (but no server crash): rollback performed

However, should a participating server crash when it is in XA_PREPARED state or failed to switch to XA_COMMIT after the global transaction has been committed on any of the other participants, PECL/mysqlnd_ms leaves the issue to be solved by the garbage collection. The plugin will neither wait until a crashed server possibly returns nor will it retry a failed XA COMMIT command.

A case during mysqlnd_ms_commit() that needs garbage collection
MySQL A MySQL B MySQL C PHP using PECL/mysqlnd_ms
XA_COMMIT XA_COMMIT XA_PREPARED mysqlnd_ms_xa_commit()

  • On all participants: change state to XA_PREPARED (success)
  • On all participants: change state to XA_COMMIT (fails on C either due to SQL error or server crashed)
  • If error (as its the case on C), wait for garbage collection

Please note, that by default the plugin takes no immediate action. It leaves you with a yet to be completed global transaction. In the example, the failed transaction would already be committed on participants A and B but C is blocked waiting for a decision. Should the MySQL server C be online, it can, of course, accept clients and run transactions but all resources associated with the failed transaction are still held and blocked.

Transactional databases are designed with the assumption that transactions rarely fail. To get into a situation with a blocked server (above example), a rare transaction failure must happen. And, it must happen at a certain point. Means, the example is a rare case. If you are still concerned about this rare case and want to see it solved immediately – if anyhow possible – do not wait for automatic background solution but invoke the garbage collection manually.

if (!mysqlnd_ms_xa_commit($link, $xa_id)) {
  /* optional: trigger GC immediately */
  mysqlnd_ms_xa_gc($link, $xa_id);

A MySQL limitation: replication troubles

MySQL has some XA restrictions but one really hurts. Assume the example with three participants A, B and C. MySQL servers A and B have committed a XA transaction, C crashes in XA_PREPARED state. Upon recovery, the PECL/mysqlnd_ms garbage collection will commit the XA transaction on C because the transaction is already partially committed on A and B.

Replication troubles
Changes in binary log Changes in binary log Crash
PECL/mysqlnd_ms GC: XA_COMMIT
Changes not in binary log
| | |
Slave of A gets changes Slave of B gets changes Slave of C looses changes

Trouble is, C does not log the changes made by the XA transaction in the binary log. A slave of C will not receive the updates. The master and the slave will go out of sync. With MySQL Fabric and sharding being a prime use case for XA transactions this is a severe server limitation. MySQL Fabric makes heavy use of MySQL Replication to achieve high availability of shards.

To play really safe: use XA only for read only transactions

The lost change set problem leads to the recommendation to use XA transactions for read-only operations only, if you want to be 100% safe. Should you not be using MySQL Replication, write operations are safe too.

Configuring a state store for garbage collection

For PECL/mysqlnd_ms to be able to solve failed XA transactions that cannot be simply rolled back immediately, you must configure a state store. Version 1.6 of the plugin supports only one state store: MySQL. The code has been layed out to allow for other state store, but hey, why look further? MySQL is readily available (otherwise you would not use a MySQL driver…) and it will not loose data during a crash.

  "myapp": {
    "xa": {
      "rollback_on_close": 1,
      "state_store": {
        "participant_localhost_ip": "",
        "mysql": {
          "host": "",
          "user": "root",
          "password": "",
          "db": "test",
          "port": "3312",
          "socket": null

Configuring the state store in the plugins configuration file is straight forward, see above. Note the participant_localhost_ip setting. As explained, the state store will hold records of all participants. If a participant is to be recorded that has a host name of, then has a special meaning to the host that writes the record. Should, later on, garbage collection be required and the garbage collection be run on a different host than the one who wrote the record, the garbage collection needs to know the actual IP associated with the original

  "myapp": {
    "xa": {
      "rollback_on_close": 1,
      "state_store": {
        "participant_localhost_ip": "",
        "mysql": {
          "host": "",
          "user": "root",
          "password": "",
          "db": "test",
          "port": "3312",
          "socket": null
      "garbage_collection": {
        "max_retries": 3,
        "probability": 0,
        "max_transactions_per_run": 100

With a state store configured, you can also change the defaults for the background garbage collection. The probability setting determines how often the garbage collection will be called during RSHUTDOWN. Allowed values range from 0...1000. During RSHUTDOWN, the plugin calcualtes a random value between 1...1000. If the probability setting is higher or equal to the random value, the background garbage collection will start. max_retries and max_transactions_per_run have been explained above and are also listed in the manual.

Testers wanted

The XA related code has been lab tested, however, I do not plan to call it anything more than beta prior to hearing back from actual users. As you saw, the garbage collection is a rather tricky piece of code. It should work, give it a try… and let us know whether it works.

Happy hacking!

@Ulf_Wendel Follow me on Twitter

PECL/mysqlnd_ms: Distributed Transaction/XA support coming, e.g. for MySQL Fabric?

2014/06/13 - by admin - 4 Comments

The development version of PECL/mysqlnd_ms now has an very early version of distributed (2PC/XA) transaction support tailored for use with clusters of MySQL servers. XA transactions can span multiple servers in a cluster to ensure transaction guarantees among them. In real life, there is always, at some point, a sequence of work on your data for which atomicity, consistency, isolation, and durability must be given. The early generation of NoSQL sharding solutions has a weak spot here. How do you, as a developer, ensure a logical operation affecting two shards is either applied to both or none? Either you don’t at all, or you hack and pray, or you start reading lecture books on transactions, or you grow your documents to avoid cross-shard operations. MySQL Fabric, our own sharding solution has a gap here too. However, let three new API calls do the troublesome part for you: mysqlnd_ms_xa_begin(), mysqlnd_ms_xa_commit() and mysqlnd_ms_xa_rollback().

Any sharding cluster: consistent work across shards?
Any task that involving more than one shard, for example: copy customer information into order and update order counter stored with customer in one step. Or: update the email of a user in all places – forum users, orders, customers.
Shard 1 – Primary   Shard 2 – Primary
Customers: US residents only Customers: rest of the world
Forum users: all Orders: all
| | | |
Copy Copy Copy Copy

Massive news but where are we on distributed transactions?

The current XA support of the PHP Replication and Load Balacing plugin for mysqlnd is hardly any good – yet! Still, we sort of have a tradition of explaining pain point and feature gaps early. MySQL 5.6/5.7 is on the track learning NoSQL lessons (protocol and query language flexibility, nested data, replication experience), MySQL Fabric is now GA and brings us Sharding and automated High Availability outside MySQL Cluster. Fabric abstracts managing shards, shuffling data around during resharding and telling a client where to find data. Also, the core server improves and some massive pain points from when we started (properly detecting transaction boundaries, reliably finding up-to date replicas) are being addressed.

But: distributed transactions… Lets keep the focus narrow and talk XA only, let’s forget about cross shard joins here. Even if the PHP MySQL APIs had proper XA support (and, of course, it will be so in no more than two hours ;-)), we would hit MySQL limitations. Hence, time to stark the barking.

How it shall look in code… one day

That warning given, let’s talk code. For the “copy customer information into order and update order counter stored with customer in one step” task from the introduction, your code looks like this, if you want transaction guarantees:

/* Connect to cluster: mysqli, PDO_MySQL, ... whatever */
$link = mysqli_connect('my_sharding_cluster', ...);

/* Begin distributed transaction */
mysqlnd_ms_xa_begin($link, $xa_id);

  /* Do something on your shards */
  mysqlnd_ms_fabric_select_shard($link, "shop.customers", "us_123");
  $link->query("SELECT * FROM customer 
                        WHERE key = 'us_123'");
  $link->query("UPDATE customer SET 
                  order_counter = order_counter + 1 
                   WHERE key = 'us_123'");
  mysqlnd_ms_fabric_select_global($link, "shop.orders");
  $link->query("INSERT INTO order(...) VALUES (...)");

/* End distributed transaction */
mysqlnd_ms_xa_commit($link, $xa_id);

Remove mysqlnd_ms_xa_begin(), mysqlnd_ms_xa_commit() and you go without distributed transactions. Your choice! Should there be a series of failures at the right time, you get logically inconsistent data without the distributed transaction. With MySQL, invidiual actions will still be consistent as we default to InnoDB these days. InnoDB is nowadays faster than the non-transactional MyISAM from the anchient days of MySQL, and very much so under load.

A word on XA/2PC to illustrate the convenience PECL/mysqlnd_ms brings

The XA specification has 94 pages. The (My)SQL world folds this into six XA related SQL commands. XA follows the two-phase commit protocol. The protocol operates in two rounds and has two distinct players. The players are a coordinator and the participants. In the first phase the coordinator asks all participants whether they are ready to commit some previous work carried out on them.

Two-phase commit (2PC) in XA: first phase
Coordinator Participant Participant Participant
–> Vote request  
–> Vote request  
–> Vote request
<– Pre commit  
<– Pre commit
<– Pre commit

A participant that replies to the coordinators vote request with a pre commit makes the firm promise that the work done will not be lost even the case of a temporary failure of itself, e.g. when it crashes. But, the pre committed work does not yet become visible to others. Should all participants reply positively to a vote request, the coordinator sends a global commit, which makes the work visible to everybody.

Two-phase commit (2PC) in XA: second phase
Coordinator Participant Participant Participant
–> Global commit  
–> Global commit  
–> Global commit

Failures of participants are no problem for the protocol. The coordinator can use timeouts to detect unresponsive participants and either retry after the participant is recovered or send a global rollback to the rest. But should the coordinator crash in the course of informing participants of a global rollback or global commit decision, the protocol becomes a blocking one. The participants wait for instructions.

Two-phase commit (2PC) in XA: blocking protocol
Coordinator Participant Participant Participant
–> Global commit  
  Comitted Uncomitted: waiting for global commit or rollback

If you call mysqlnd_ms_xa_begin(), PECL/mysqlnd_ms acts as a coordinator. It must handle all the state transitions of the participants, the possible failures of participants and its own failure as a coordinator.

Should you have become curious about 2PC or the challenges of distributed transactions… MySQL Cluster is using two-phase commit internally (2PC), see also the presentation DIY: A distributed database cluster, or: MySQL Cluster or the really in-depth presentation Data massage: How databases have been scaled from one to one million nodes, which takes you from early distributed NoSQL to the very latest developments.

What PECL/mysqlnd_ms does for you, what it hide, where its limited

As said, there are six SQL commands related to XA/2PC:

  • Phase 1 related
    • XA BEGIN – mark the beginning of a unit of work on a participant, telling the participant what to vote about
    • XA END – mark the end of the unit of work on a participant
    • XA PREPARE – the pre commit
  • Phase 2 related
    • XA COMMIT – global commit
    • XA ROLLBACK – global rollback
    • XA RECOVER – error handling: list pre commits

PECL/mysqlnd_ms further compresses this to three API calls. The calls hide issuing the SQL commands on the participants and all the steps a coordinator has to perform. Some features are lost on the way, but convenience is won.

The loss of features is not critical: as a general rule we strive to provide you with a way to overrule any automatic actions, go down to the lower layers and handle anything on your own. The features you loose are choice when participant failures happen and support for other XA participants but MySQL servers. Upon any participant error the automatic action is rollback. Future versions may lift this limitation, I am blogging about pre-alpha development version. At this point, I also have not bothered about any other XA participants but MySQL servers. The XA specification describes distributed transactions for any system implementing a certain API. That could be a database server, a web service or a moon rocket. Other RDBMS also feature SQL commands for XA so that you could have Microsoft SQL Server, IBM’s DB2, Oracle, Postgres and MySQL jointly working on a distributed transactions. If you need that, use the SQL commands directly.

I have had MySQL Fabric in mind, the sharding scenario. Fabric manages farms or clusters of MySQL servers, nothing else. And, MySQL should not loose one of its strength when combined with Fabric: transactions, should you need them.

PECL/mysqlnd_ms as a transaction coordinator

Should you not fear the risk of blocked servers when the coordinator crashes, you can use the new functions straight away.

On mysqlnd_ms_xa_begin(), PECL/mysqlnd_ms will first try to find out whether you are in the middle of a local transaction. Local transactions and global transactions (XA) are mutually exclusive. To detect local transaction boundaries, PECL/mysqlnd_ms monitors all API calls related to them, e.g. mysqli_begin_transaction(), mysqli_autocommit(). It does not monitor SQL commands, such BEGIN TRANSACTION, though. Details are described in the manual. Good news is: mid term the MySQL server will announce transaction boundaries and we can make things bullet proof. Should you not be in the middle of a transaction, it remembers the so-called gtrid (global transaction identified) given in the second parameter: bool mysqlnd_ms_xa_begin(mixed connection, int gtrid). At the time of writing, gtrid is massively limited – I haven’t finished my SQL related C code. That’s for sure a temporary limitation.

Then, as you continue PECL/mysqlnd_ms transparently injects XA BEGIN gtrid on every shard/node that you run a query on. When doing so it also calculates a bqual (branch qualifier) but does not use it yet – for the same reason as above: lazy me, early blogging…

Built-in garbage collection to cover coordinator crashes

I assume most users talking transactions want to see a coordinator crash covered too. This is done by optionally recording all state changing actions in a persistent store that survives a crash. Should the coordinator – your PHP process/script – crash, another process/script can do a garbage collection run to get the participants into a defined state: rollback, clean up.

Garbage collection is built-in. It is done the traditional way: wait for the next PHP process/script to begin its work and run the garbage collection based on a probability value. Very much as PHP sessions or PECL/mysqlnd_qc (query cache plugin) does it. An additional timeout can be configured to decide whether a recorded global transaction is and its coordinator are still active or they should be considered crashed. Had I invested more brain-power already, maybe, the timeout would not be need. Key is: its automatic.

Note that for the garbage collection to work, one needs to record information how to connect to the participants. Recording of user name and password has been made optional and the special meaning of localhost has been thought of. Should the GC be run on a different host but the one who wrote a record with host=localhost, we need to know to which IP localhost refers. Some things are covered and tested – but, this is pre-alpha…

The plugin supports using any data store as a backing store, but only one storage backend is implemented: MySQL. MySQL is transactional, survives crashes and after all we are talking about managing MySQL farms here.

mysqlnd_ms_xa_commit(), mysqlnd_ms_xa_rollback() – note the limitation

Upon mysqlnd_ms_xa_commit(), the plugin becomes chatty. It tells all participants that the work is done (XA END) and gathers pre commit replies (XA PREPARE) to decide on global commit or global rollback. Should all participants give a positive reply, it does send a global commit (XA COMMIT) as requested and mysqlnd_ms_xa_commit() indicates success.

The emergency break is pulled in case of any error with the participants: rollback as much as we can, leave any possible rest to the garbage collection. There is currently no way for the user to react to low-level errors and, for example, do a reconnect to a participant. To handle such cases, the user would need to start handling the connections to participants itself. Any API for doing this would be more complex than just doing everything yourself using SQL and plain vanilla connect()/query() calls! (These lines are being written in early June 2014 – check the manual for updates should you read this later. Some tweaks are likely to happen.)

The function mysqlnd_ms_rollback() also ends the XA transaction on the participants but always sends out a global rollback.

The sad truth: not yet with Fabric, plugin refactoring needed…

Before I can close and move a black jack to the overloaded server folks, well, yeah: it doesn’t work together with Fabric yet. Our fault.

The reason is an implementation detail of PECL/mysqlnd_ms, which is not visible to you. The plugin does not have any dedicated connection management module. We keep a list of all servers of a cluster. When we choose a server for you, say the read-write splitter tells us to use a master, we open a connection and associate it with that server list. When you use Fabric and call mysqlnd_ms_fabric_select_shard()/mysqlnd_ms_fabric_select_global(), as in the example from the beginning, the whole list is replaced. With it, the connections are gone. If there was a participant connection in the list… *ouch*. This is also the reason why I think the state alignment feature is broken when using Fabric.

Fixing this, using a simple option, takes days. However, Andrey, Johannes and I need to find a time to discuss our options. Documenting XA, which I have not done yet, also takes days.

The sad truth: our server has limitations too…

Any server change take way more time for various reasons. The server is more complex, server code plays in a different league, there is little chance of saying we just did it that way, server developers are a rare species and the server has a different release cycle. The server manual reads:

If an XA transaction has reached the PREPARED state and the MySQL server is killed (for example, with kill -9 on Unix) or shuts down abnormally, the transaction can be continued after the server restarts. However, if the client reconnects and commits the transaction, the transaction will be absent from the binary log even though it has been committed. This means the data and the binary log have gone out of synchrony. An implication is that XA cannot be used safely together with replication.

In other words: if you don’t mess around, if you make the emergency break – rollback – approach, as PECL/mysqlnd_ms does, over any attempt to recover, there is a chance it works depite not being bullet proof. I can only hope any server plugin toying with XA keeps an eye on this. The manual then goes on:

It is possible that the server will roll back a pending XA transaction, even one that has reached the PREPARED state. This happens if a client connection terminates and the server continues to run, or if clients are connected and the server shuts down gracefully. (In the latter case, the server marks each connection to be terminated, and then rolls back the PREPARED XA transaction associated with it.) It should be possible to commit or roll back a PREPARED XA transaction, but this cannot be done without changes to the binary logging mechanism.

Again, make the emergency break policy your default, and it may become acceptable. At least for coordinators and implementations tailored for MySQL Fabric. If that’s all the restrictions, and no major bug is lurking around then its time for: how to use distributed transactions with MySQL Fabric sharding ;-).

For proper XA support, as it may be needed by JDBC, these restrictions could be a deal breaker.

Happy PHP hacking!

@Ulf_Wendel Follow me on Twitter

Background: The PHP replication and load balancing plugin

Happy birthday! Three years ago we pimped all the PHP MySQL APIs to make using any sort of MySQL cluster easier. We began developing a plugin which does all the annoying jobs that clusters may burden developers with: transaction aware read/write splitting if needed, load balancing with the option to assign weight to a replica, connection state alignment when switching servers, automatic client failover, user defined filters/pipes for replica selection, an abstraction for requesting eventual consistency, session consistency (defined as read-your-writes) and strong consistency with a single API call. That is 90% of the infrastructure one needs to get all out of MySQL Fabric. But, MySQL Fabric support is still in the works.

All these things are done on the client side to scale by client. Unlike with basic proxy solutions there is no single point of failure as a single clients failure does not affect the rest.

Yet, if used for nothing but read-write splitting, the ease of use is not compromised. It boils down to installing the plugin and replacing the host name in your APIs connect call mysqli_connect("my_mysql_server", ...) with the name of an entry from the plugins configuration file: new PDO("mysql:section_from_ms_config", ...). And, as you dig deeper into massive database clustering, you will soon realize how powerful it is to have clients supporting a specific cluster design, for which you just don’t want an all-automatic solution.

PHP Unconference Europe, a legend. But lacking PHP…

2014/05/21 - by admin - 0 comments

A legend is born: PHP Unconference Europe, Majorca. Nearly one hundred witnesses exist. The #phpuceu combines the proven concept of an unconference with the beauty of the island of majorca. Within a day after the end of the event, one third had bought a ticket for next years issue (May 9-10, 2015)! We all knew for years, unconferences work. We finally have one at a place that is easy to get and is just about perfect for any kind of visitor. Goal accomplished. Just a detail, PHP (partner happieness program) was missing…

The power of the people

Unconferences give power to the people. Participants gather in the morning to propose sessions and vote on an agenda for the day. Sponsors and participants never know what road will be taken. However, with a solid number of PHP experts and frequent speakers traditionally being around, the worst that can happen is that they are asked to repeat one of their usual talks… I have visited some ten PHP Unconferences in the past years. Usually, heavy-weights sometimes have issues to get enough votes for a slot because the other propsals are so hot! Also, Some of this years sessions


What kind of session is again up to the audience! Johannes Schlueters got it right and gathered with followers in the lobby area when no room was available. Kristian Koehntopp tortuered listeners with a spontaneous drawing on a tissue how to get benchmarking right… Long, long breaks for discussions and a need to become active are part of the game.

The soft factor: majorca – it’s not expensive

I must say, I am surprised how well the visitors behaved. The no-show rate on the first day was as little as three percent. This is absolutely amazing. Recall that a flight to Palma can cost you no more than a train trip through your country. Round trip flights to many destinations are around 150-200 Euros. A train ticket from Kiel in the north of Germany, where I live, to Munich in the very south and back is 210-280 Euros. And, #phpuceu sponsor trivago reports 210 euros for a single room over the weekend in Munich, which is an exact match for Palma.

At the same costs of the national PHP Unconference Hamburg, I get a weekend vacation in the sun! Still, people showed up on early Sunday morning with a smile in their face. They knew yummy catering (tapas) was on the agenda for lunch.

If everbody is relaxed, discussion go smoother. And, exchanging ideas, getting connected is all what it is about: whether you wanted to bother us MySQL folks, ask Zend about the green jumbo elePHPant, hack PHP or squeeze out long term project management consultans… all up to you. Or, how about table soccer to break the ice.

The smile factor: food, sun and partying

Compared with a national unconference, say one in Munich, Palma has more party locations and better drinks to offer. The 1 litre beer monster, which goes stale before it reaches you in an huge umpfda-umpfada hall is available. If you really want that… see the photo.

But then, there is also the other world of smaller restaurants too with the usual Twitter or Facebook comment going like: “Relaxing after the #PHPucEU” or “Nachklang zur ‪#‎phpuceu‬ an klasse Essen und Wein”, or this one http://click-to-read-mo.re/p/7dr8/5321aee8… The fine dining or the simple yet tasty local food, the vacation feeling, the beaches, the old town…

… hiking in the hills, cycling or riding a scooter. Quite some decided it would be a good idea to stay a day longer for some nice time on the island in late spring. This has a very positive effect on the unconference itself. There is less of a rush on sunday. People stay for an additional night and the last session of the day is almost as popular as the first one was.

The trained elePHPant

On the first day the organizers (Judith Andresen, Fabian Blechschmidt, Jonathan Maron, Karl Spies) announced a photo competition. Zend had given them one of the big green elePHPants for the winner. The task was to take a photo of the elePHPant going on safari during the event. The winning photo is…


… and the winner got a hug from one of the organizers, after the latter had stopped biting the elePHPant! Something that obviously has irritated the new shepard of the elePHPant.

My own elePHPants, well prepared and trained for the trip, became friends with him. However, I think, the big one is not quite such a sports gun as the smaller ones are.

PHP – Partner happieness program

This was a wonderful event: work and fun combined! Despite the fun, everybody needs measurable results to justify their investments. Search Twitter, Facebook, Google+ or blog posts for opinions, if you have not participated and want to know if all was green – not just the elePHPant.

Of course, sponsors do the same. For sponsors it is not all about getting a new hire or immediately winning a big customer. These events have a lot to do with learning what really matters to the crowd. It is about reaching out to key people that can act as multiplicators. And, it is about fostering the community. Obviously, many sponsors have been very pleased with this years results. AFAIK, some announced their strong interest to come back in 2015, or even gave firm promises. Let’s shout THANK YOU at:

PHP is missing! Spouses and family welcome !?

For next year, it would be perfect if there was a partner happieness program – short PHP. Ideally, it would be run as an unconference. All spouses gather in the morning to propose something and vote about their plans for the day. I would be fine with that as long as they don’t start stealing sponsors, or do photo contests about the best looking new potential partner or the like….

Again, 200 reasons to go there…

Please specify a Flickr ID for this galleria

Happy hacking!

@Ulf_Wendel Follow me on Twitter

Train your elePHPants for the PHP Unconf EU – Palma de Majorca!

2014/05/01 - by admin - 1 Comment

Majorca! May 2014 – 17./18! Could there be a better place and time on earth for the PHP EU Unconference: sun, fun, plenty of cheap flights, affordable accommodation? The PHP EU Unconference (tickets) is the international version of the national unconference. The national one frequently sells out quickly (>350 tickets) not only because it matches other conferences despite the ticket price of ~40 Euro only. Traditionally the international one, is more cosy with about 100 elePHPant lovers coming in the years before.

Palma de Majorca

A schedule that really matters – with a solid fallback…

The key to success is the very nature of an unconference. There is no call for papers and no program comittee that makes more or less educated guesses about what the audience might want to see. There is no risk of a schedule dominated by sponsor talks. It is the very audience that gathers every morning to decide about the schedule for the rest of the day. This works amazingly well: only topics that really matter to the audience have a chance to pass the voting. And, with many heavy-weights and frequent speakers being around, the worst that can happen is that they give a talk. The "2716th What’s new in PHP", the 1872th "MySQL – my love" and the "999th PHPUnit talk" are available at your convenience, but only if there is nothing hotter than that ;-). That said, here’s a random list of well known nerds that are said to be coming in two weeks: Sebastian Bergmann, Johannes Schlueter, Stefan Priebsch, Kore Nordmann, Lars Jankowsky, Arne Blankerts, Boris Erdmann, …

Train your #elePHPant for Majorca! #phpuceu Take fotos of your #elePhant and tweet them.

Sun and fun

A day at either the national or the PHP EU Unconference follows the same pattern:

  1. eat and drink: you have breakfast together – your ticket includes catering during the day
  2. contribute: you propose a talk and hope for votes or just vote
  3. contribute: you start talking either in front of others or with others
  4. eat and drink: you have lunch
  5. contribute: continue talking
  6. eat, drink and contribute: party – sleep, goto 1)

All this hard work finally takes place where it should to ensure a relaxed atmosphere. If you plan an international meeting without knowing from where people come, the travel costs are a constant. Pick any place having a major airport, pick one that is beautiful in May! Majorca is a perfect choice.

Flights are available, still

Being a popular party location, there are many low-cost flights are available. Round-trip flights from Paris/London/Berlin to PMI are still available starting at ~120 Euros. During JS Unconf, I’ve heard of people sharing holiday homes (including Wifi :-)) for very low rates. If you have a deeper pocket you may prefer the unconference hotel. Check for a good offer. My wife and I booked two days ago at a very reasonable rate.

If you consider staying in Palma for some extra days, it may happen that you run into some other Unconf visitors. Those I spoke to and said they consider to spent some extra days in Majorca tend not to arrive before May 16 but stay longer than May 19.

Train your elePHPants!

When I told my elePHPant that we ultimately booked the trip to Majorca, he got crazy. He insisted of being trained for life on sunny beaches. Hope he will have the opportunity to get to know many new friends. Go share photos of how you prepare your elePHPant for sun, fun and PHP EU Unconf with @phpuceu.

And, as you are at it, think about session proposals. At an unconf you are not limited to giving a talk. There is a time slot waiting to be filled with whatever gets enough votes. You may present, discuss, entertain, train elePHPants, run an elePHPant meetup…

Please specify a Flickr ID for this galleria

Happy hacking!

@Ulf_Wendel Follow me on Twitter

JS Unconf Hamburg 2014

2014/04/28 - by admin - 2 Comments

What happens when you take a well known receipt for unconferences, a group of young motivated and talented chefs, some 300 attendees is in the photos below. Some call it, the JavaScript Unconference Hamburg (JS Unconf). Some say, they had seasons in the sun. All I know is, I’ve been there and will come again!

Please specify a Flickr ID for this galleria

The menu delivered by the chefs is as tasty and sophisticated as that of a commercial conference but it does not cost you a fortune. For some 40,– Euro you get two days packed with talks, intensive discussions and nice fellows. It is the people that drive an unconference and that becomes obvious the very moment you get involved. People know their jobs, they know your problems as they experienced the same. People have something to share and they want to share to with others!

How it works

Like the PHP Unconf Hamburg, the JS Unconf takes place at a building of Hamburg university in the hearth of Hamburg. On Saturday morning everybody gathers in or around the “bath tube” in the lobby area to propose sessions. Then, voting starts to find the lucky ones that run their discussions, talks or whatever has been proposed. Six lecture rooms that can fit 50 to 250 people each are available and there are up to four rounds of sessions per day. Despite the large number of slots it is not easy to get elected – the level is high. And, there is no program comittee that could prefer a big name or a sponsor over a just brilliant session. Elected gets only what’s hot to the very audience at the very day: AngularJS, Hoodie, React, …

Being unfamiliar with the JavaScript celebreties, I cannot give a list of big names spotted at the event. I managed only to identify one of the uber-names with Jan Lehnardt (CouchDB, Hoodie), a solid group of well known PHP experts that also do JavaScript work, or the all-mighty Kristian Koehntopp (working with a group of ’40 sufficiently crazy nerds’). However, my first, random click on the program and slides shows that the speaker has once written a book. Need I say more about how these ‘small’ events compare to the ‘big’ ones? Small means some 300 tickets sold, 300 arguments to join next year… get your ticket in time, it was booked out.

New: lightning talks

The fresh group of JS Unconf chefs has spiced the event with a change: lightning talks. It is a clever change inspired by two observations. First, some topic and thoughts worth sharing and presenting do not fill 45 minute slots. Also, giving a 45 minutes show takes some serious time to prepare depending on the topic. Second, many people have to travel back early on Sunday afternoon, which means they are forced to leave in the middle of a talk.

The talks took place in the “bath tube” with everbody enjoying the show and leaving whenever need be. They absolutely delivered what to expect at an Unconf: ranging from request for help, to presenting a thought or showing how to monitor your power consumption using RasperryPI, toys and stuff! Just watching how people use a 12” notebook to share slides when no beamer is available was worth it.

It just works…

I have a hard time saying much about the event as it just worked! The organizers Robert Kowalski, Robert Katzki and Robin Drexler have had help from Hinrich Sager and Ekkehard Dörre. The latter two have helped to run numerous PHP Unconferences at the same place before. Consequently, everything was running smooth as usual yet with nice tweaks from the new chefs. They just do it right. They just don’t run out of beer, their party is where all the parties take place, it is easy to get to the event, beamers and WIFI works, the food is tasty and healthy, they have weird soft drinks such as rhubarb, they serve first class fair trade coffee, people are relaxed and friendly – they learned their lessons in the years before…. More of this, please!

Credits to sponsors!

No such event, no such low ticket price without the sponsors. Thank you, …

Happy hacking!

@Ulf_Wendel Follow me on Twitter

29/07/2014 – Added official figures: 342 tickets sold, 294 appeared on Saturday.

PHP mysqlnd memory optimizations: from 49MB to 2MB

2014/04/10 - by admin - 2 Comments

Inspired by Antony, Andrey has implemented a memory optimization for the PHP mysqlnd library. Depending on your usage pattern and the actual query, memory used for result sets is less and free’d earlier to be reused by the PHP engine. In other cases, the optimization will consume about the same or even more memory. The additional choice is currently available with mysqli only.

From the network line into your script

Many wheels start spinning when mysqli_query() is called. All the PHP MySQL APIs/extensions (mysqli, PDO_MySQL, mysql) use a client library that handles the networking details and provides a C API to the C extensions. Any recent PHP will default to use the mysqlnd library. The library speaks the MySQL Client Server protocol and handles the communication with the MySQL server. The actions behind a users mysqli_query() are sketched below.

The memory story begins in the C world when mysqlnd fetches query results from MySQL. It ends with passing those results to the PHP.

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
  return result set  

The new memory optimization is for buffered result sets as you get them from mysqli_query() or a sequence of mysqli_real_query(), mysqli_store_result(). With a buffered result set, a client fetches all query results into a local buffer as soon as they become available from MySQL. In most cases, this is the desired behaviour. The network line with MySQL becomes ready for a new command quickly. And, the hard to scale servers is offloaded from the duty to keep all results in memory until a potentially slow client has fetched and released them.

The result buffering happens first at the C level inside the mysqlnd library. The buffer holds zvals. A zval is internal presentation structure for a plain PHP variable. Hence, think of the mysqlnd result buffer as a list of anonymous PHP variables.

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
  Buffer with zvals (MYSQLND_PACKET_ROW),
think: PHP variables

The default: reference and copy-on-write

When results are to be fetched from the mysqlnd internal buffers to a PHP script, the default behaviour of mysqlnd is to reference the internal buffer from the PHP script. When code like $rows = mysqli_fetch_all($res) is executed, first $rows gets created. Then, mysqlnd makes $rows reference the mysqlnd internal result buffers. MySQL results are not copied initially. Result set data is kept only once in memory.

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
  Buffer with zvals (MYSQLND_PACKET_ROW),
think: PHP variables
$rows = mysqli_fetch_all($res)  
  rows[n] &= result_buffer_zval[n]  

The reference approach works fine if you have code that follows the general pattern: query(), fetch() followed by implicit or explicit unset(), free().

A call to query() fills the internal buffer. Then, all rows are fetched into PHP variables. You may fetch all of them at once or read them row by row into the same array using a pattern such as while ($row = $res->fetch_assoc()) { ... }. As long as you do not modify $row and free $row explicitly (may happen implicitly by overwriting them in a loop), no data will be copied. Then, as the last step you call free() to dispose the buffered result set.

$res = mysqli_query(...);               /* create internal buffer */
$rows = $res->fetch_all(MYSQLI_ASSOC);  /* rows &= internal buffer */
unset($rows);                           /* remove references */
$res->free();                          /* free internal buffer */   

The memory saving by using references is gone, if mysqlnd is forced to perform a copy-on-write. If you free the result set prior to freeing $rows, then mysqlnd is forced to copy the data into $rows before it can free the internal buffer. Otherwise, $rows points to nowhere.

$res = mysqli_query(...)     /* buffered results */
$rows = $res->fetch_all(MYSQLI_ASSOC)  /* rows &= internal buffer */
$res->free();                /* free internal buffer: copy to rows */
unset($rows)                 /* free copied results */

The copy_on_write_saved and copy_on_write_performed statistics tell you what your code does.

When I say that memory is free’d I mean that it is given back to the PHP engine. The PHP engine may or may not release it to the system immediately. If you are just after monitoring how mysqlnd behaves, check the statistics, or go for huge result sets to ensure the garbage collection kicks in to make effects visible.

The price of copy-on-write: management overhead

Copy-on-write does not come for free: mysqlnd must track variables that reference the internal buffer. This adds some memory overhead (zval**). And, the internal tracking list must not point to nowhere . As long as the tracking list exists, the referenced user variable – here: the elements of $rows – must not be released. There is kind of a circular reference. The resulting overhead is two-folded: there is an additional tracking list and the references from elements of $rows might occupy memory until free() is called. But, you save holding query results twice in memory. So the question is what takes more memory, and when?

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
  Buffer with zvals (MYSQLND_PACKET_ROW)  
$rows = mysqli_fetch_all($res)  
  assign rows[n] &= result_buffer_zval[n]  
  remember rows[n] belongs to result_buffer_zval[n]  
  increase rows[n] reference counter  

The answer depends on your code and the size of the result sets. If you fetch some few thousand rows at most on your one PHP server, it does not matter much how mysqlnd manages the result sets. The implementation detail impact is not worth optimizing: the impact is small and it takes time consuming fine tuning to find the optimum. But for Antony it mattered: likely, he handles PHP at scale…

New: classical copy as an alternative

At times it is more efficient to bite the bullet and to copy data from the internal buffer into user variables. It may sound counter intuitive but this may help to ‘save’ memory. ‘Save’ must not only be measured in absolute terms at a given point in time. This is what the figures below will show as it is easy to grasp. When done with reading ask yourself what it means to release memory early during the execution of one or many scripts, think of memory usage over time.

PHP script mysqli extension/API mysqlnd library MySQL
*.php *.c
  Buffer with zvals (MYSQLND_PACKET_ROW)  
$rows = mysqli_fetch_all($res)  
  assign/copy rows[n] = result_buffer_zval[n]  

Every optimization is a trade: the COPY don’t do

The new always copy policy can be enabled using mysqli_store_result(MYSQLI_STORE_RESULT_COPY_DATA). Let’s consider a case that seems to prove one cannot ‘save’ memory by duplicating data. The code runs a single SELECT that returns a result set of 200.000 rows. Anything big enough to make the basic principles discussed visible through memory_get_usage(true) works. Then, all rows are fetched into using fetch_all() and the result set is released.

$link = new mysqli("", "root", "", "test", 3307);
$link->real_query("SELECT * FROM test LIMIT 200000");
$res = $link->store_result(MYSQLI_STORE_RESULT_COPY_DATA);
$rows = $res->fetch_all(MYSQLI_ASSOC);

The diagram shows the memory usage reported by memory_get_usage(true) immediately after the function calls. The default behaviour is to make $rows reference (red, NO COPY) the internal buffer. The blue line shows the effect when $rows gets populated with copies. The initial memory usage of the reference approach is lower until free() is called. Now copy-on-write must be done and either approach uses the same amount of memory.

BTW, don’t get nervous about the huge figures. This is what happens if you create large PHP arrays…

fetch_all() + STORE_COPY: memory

Function runtimes measured with microtime(true) are as expected. The reference approach is a tad faster when $rows is to be populated but has some hidden costs at free() when the internal buffer reference lists are to be checked.

For this very script, with this very result set, version of MySQL and PHP and notebook there was no significant difference in total runtime. Only a minor tendency of MYSQLI_STORE_RESULT_COPY_DATA being ever so slighty slower became visible. This meets expectations due to the additional memory copies and allocations.

fetch_all() + STORE_COPY: runtime

Counter example done, time to show the potential win

Copying result sets is no significant win when you have many rows and use fetch_all(), when you have few rows at all (point SELECT, paging with LIMIT, few hundret rows overall). Or, if you have few large rows with, for example, BLOB columns. In those cases, there is no need to bother about the flavour of the mysqlnd result set handling.

The code to try the copy optimization with should:

  • fetch many, rather small rows
  • and, should not use fetch_all()
  • or… is sloppy with free() (bad style, use mysqlnd statistics to identify such scripts)

Replace the fetch_all() with while ($row = $res->fetch_assoc()) in the 200k rows result set example and you immediately see the potential of Antony’s proposal:

$link->real_query("SELECT * FROM test LIMIT 200000");
$res = $link->store_result(MYSQLI_STORE_RESULT_COPY_DATA);
while ($row = $res->fetch_all(MYSQLI_ASSOC)) {

Peak memory reported for this code is 1808 KB (~2MB) versus 50368 KB (~49 MB) for the referencing logic! But the runtime is about 10% slower. As with all those benchmarks: run your own! I am showing trivial microbenchmark observations only to highlight the principles behind. I did not even bother to align the negative values reported by memory_get_usage()

fetch_all() + STORE_COPY: proper usage, memory

The lower peak memory usage is because the memory used for $row can now be released as the result set is being iterated.

Pseudo code to illustrate copy logic during fetch loop
PHP call PHP internals memory objects
store_result() Create internal_row_buffer[]
  • internal_row_buffer: n rows
$row = fetch_assoc() create $row, copy internal_row_buffer[0] into $row
  • internal_row_buffer: n rows
  • + copy of internal_row_buffer[0]
  • + zval $row
$row = fetch_assoc() free $row, copy internal_row_buffer[1] into $row
  • internal_row_buffer: n rows
  • – copy of internal_row_buffer[0] (previous contents of $row)
  • + copy of internal_row_buffer[1]

Obviously, with the copy technique, there is no reference from the mysqlnd library to $row because mysqlnd does not have to bother about copy-on-write and the like. $row contains a copy of the actual data from MySQL and no longer points to the mysqlnd internal result buffer. The table below tries to illustrate the situation when data copies are avoided and references are used:

Pseudo code to illustrate reference logic during fetch loop
PHP call PHP internals memory objects
store_result() Create internal_row_buffer[], internal_reference_list[]
  • internal_row_buffer: n rows
  • internal_reference_list: 0 entries
$row = fetch_assoc() create $row, $row = &internal_row_buffer[0]; internal_reference_list[0] = $row;
  • internal_row_buffer: n rows
  • internal_reference_list: + 1 entry (first $row)
  • + zval $row
$row = fetch_assoc() $row = &internal_row_buffer[1]; internal_reference_list[1] = $row;
  • internal_row_buffer: n rows
  • internal_reference_list: 1 + 1 entries (first, second $row)
  • zval $row (first) + zval $row (second)
n-th $row = fetch_assoc()
  • internal_row_buffer: n rows
  • internal_reference_list: n entries (all $row)
  • n * zval $row (all)
unset($row); free_result($res) free(internal_row_buffer); free(internal_reference_list) empty

Don’t get too excited. The copy approach is not an ultimate solution. There are many factors to consider: actual code, actual statements and the size of their result sets, your servers demands. And, don’t even try to count bytes: some hundret or even thousand bytes here and there may not matter much. Write proper code (explicitly calling free()) and check your servers reaction.


The key takeaway is: try MYSQLI_STORE_RESULT_COPY_DATA if you have result sets with many rows. The copy approach will release memory faster to PHP for reuse and thus peak memory usage may be significantly lower. Again, be warned that I have choosen my examples carefully to show a hefty difference when doing point in time memory measurements. There may be a broad grey area for which you need to consider memory usage over time and possibly over many scripts to decide whether copy or reference fits your bill.

Ready for use?

Yes, the code is ready for use! The patch was running in production for a while and I failed to break it with tests too.

A new PHP configuration setting mysqlnd.fetch_data_copy has been introduced. Setting mysqlnd.fetch_data_copy=1 makes mysqlnd use and enforce the copy approach for all buffered result sets.

Not for PDO yet

The copy approach is not compatible with PDO_MySQL, though. PDO_MySQL relies on the zval reference logic and must be updated first.

Credits to A + A = A+ ?!

The manual has got a few words on it too but not substantially more than is above. The changes should bubble through build, staging and mirror servers within a few days.

Happy hacking!

@Ulf_Wendel Follow me on Twitter