Recently there was a fault at one of the sites of the hosting providing we use at Telos UK. This caused an unclean shutdown of a couple of our nodeos instances leaving them in a “Database dirty flag set” state. I attempted to replay from a snapshot file with the existing blocks.log
file in place as described in Replay from Snapshot with Blocks Log in my prior post Nodeos Replays, but this time this failed with the following error message in the log:
Error unpacking field new_producers {"field":"new_producers"} nodeos raw.hpp:367 operator() error unpacking eosio::chain::signed_block {"type":"eosio::chain::signed_block"} nodeos raw.hpp:644 unpack
So, I removed the blocks.log
and tried again. This succeeded but of course meant that the blocks required to take us from the snapshot back to being in synchronisation with the chain would have to be entirely sourced via the P2P (peer-to-peer) network rather than being partially replayed from the blocks.log
.
At Telos UK, we run a main and standby pair of servers in different data centres. Each server runs two instances of nodeos, a block producing node and a kind of proxy node that synchronises with the peer nodes of other Telos block producers and provides our external API service. Our block producing nodes only synchronise in a mesh with our own internal nodes.
The snapshot that I used to recover both nodes at the affected site was a couple of months old. I set the synchronisation going late at night and checked on its progress the next morning, about eight hours later. To my surprise, whereas the block producing node had already caught up with the chain, the proxy node had only received about a week’s worth of blocks – ball park at least 180 times the normal synchronisation rate for the block producing node vs. only about 8 times for the proxy node.
I take care not to include hosts in the P2P network that produce Net Plugin errors in our nodeos logs, but I noted that whereas the log for our block producing node reported nothing other than requests for ranges of blocks during catch up, these were interspersed with various other Net Plugin info messages in the proxy node’s log, which, though I wasn’t confident that I understood them, suggested all was not well.
I also temporarily made the proxy node in catchup unavailable for incoming P2P connections to see if this made any difference to its catchup speed. It did not.
This set me thinking, could it be that:
- The info messages that I saw in the log for our proxy node are indeed indicative that all is not well as they appeared to be.
- Dealing with whatever it is that these info messages are reporting explains the poor catchup performance of our proxy node relative to our block producing node.
So, I set about exploring these questions by synchronising a test node with the Telos testnet. I believe that what I found out may be useful to the wider Telos community and indeed other communities who manage chains based on EOSIO. As always, comments that advance, expand on or correct the learning in this post are most welcome.
Continue reading