技术成就梦想

Group Replication GCS Troubleshooting 转 Group Replication GCS Troubleshooting IT–小哥



In the last post I shared the simple set of steps to configure a Group Replication setup using SQL commands, and a few in the configuration file too. Indeed, it can be simple.  But then there are times where there are more requirements and configurations need more attention.
Maybe the OS environment we use for MySQL setups has never impacted us before building a Group like this.

Or simply the Group Replication plugin introduces new things we never needed such as an extra network port in play or new security requirements we’re not used to.  So let’s look at a couple of these common things we might not expect to encounter, so that we can remedy them more quickly.

Security in Group Replication

There are a few areas of added security that is offered and possibly desired in Group Replication.  One of them is SSL support in the Group Communication System (GCS) as well as the Recovery process of bringing members back online into the Group.  You can read about the SSL features and involved configuration here.  It’s not something that I’ll blog about right now.  Other areas include the following…

OS Configuration Settings

The MySQL install package for your Operating System distribution may include various security settings that integrate MySQL properly, some of which “may” include SELinux. Others that need to be handled would be firewalls like Iptables.  This is of course assuming that SELinux and/or the firewall are enabled. For both of these there are references in the MySQL Group Replication documentation for changes that “may” need to be addressed in your environment.  Specifically, these topics are covered in the Group Replication FAQ section as we’ll see next.

MySQL GCS and firewalls

The firewall named Iptables on Linux will help to illustrate how Group Replication is configured. There is a new network port that is required for Group Replication to work.  There is no explicit port that must be used, with one core exception….it can’t be the same port that the MySQL database instance uses (typically 3306 unless otherwise changed for various reasons).  The port that is chosen should also not conflict with other services defined on your OS (more for clarity when observing services running through netstat or the like) or other services in-use with other systems in your company infrastructure.  Don’t want conflicts to arise, so choose wisely.

The above aside, the firewall on your server likely needs to have this port configured, and instructions for doing so can be found in this FAQ entry for Iptables.

So if you’re configuration files look like the following, then these details may be needed in your firewall.

Group Replication GCS Configurations:

group_replication_local_address = 'HOST1:16601'
group_replication_group_seeds = 'HOST2:16601,HOST3:16601'

Then you’re associated Iptables or possibly firewalld entries ‘might’ look like this:

Firewall configurations for ports 16601 as per example:

# Handling Iptables for Enterprise Linux 6, CentOS6/etc (according to the MySQL Docs FAQ document)
# look at the current rules in place to see if port 16601 is listed already
iptables -L
 
# if our port is not listed, then add a rule to accept network traffic on port 16601 and save
iptables -A INPUT -p tcp --dport 16601 -j ACCEPT
service iptables save
 
# Handling firewalld for Enterprise Linux 7, CentOS7/etc
# look at the current rules and/ports in place
firewall-cmd --list-ports
firewall-cmd --permanent --list-ports
 
# if our port isn't there than add it permanently to the suitable zone 
firewall-cmd --zone=public --permanent --add-port=16601/tcp
firewall-cmd --zone=public --permanent --list-ports

MySQL GCS and SELinux

For SELinux with Group Replication, you can also refer to the MySQL Group Replication FAQ Document.  The SELinux notes in the FAQ show the following in the scripted section below.  This setup is needed so that SELinux will allow MySQL to receive traffic across the port you’ve defined for GCS.

SELinux handling for MySQL Group Replication GCS

# SELinux notes as from the MySQL Group Replication FAQ page
# verify the status of selinux
sestatus -v
 
# check which ports MySQL is allowed to use
semanage port -l | grep mysqld
 
# add the port used by your configuration file if its not already found
semanage port -a -t mysqld_port_t -p tcp 16601
 
# verify the change, output should include both 16601 and 3306 (if using the default)
semanage port -l | grep mysqld
 
# Of course there are more sophisticated ways of handling SELinux, 
# this is a minimal highlight only

So by now you’ve possibly made some changes as above. Did you find things that needed to be configured?  Are things working now?

No changes made as they weren’t relevant for a fix? (maybe your firewall is off and SELinux is permissive)

…..if this is the case, then continue to the next section of this post!

IP Address Whitelisting

The other area of security that the MySQL team embarked on in Group Replication was the setup of Network IP Whitelist support.  This is a really interesting feature for a few reasons.

  1. It only pertains to the GCS implementation which is responsible for all of the Group’s membership awareness, transaction certification consensus and is generally the intelligence in the Group that makes this new capability and offering from the MySQL team to work.
  2. It otherwise, does not impact the MySQL database instances whatsoever.
  3. It involves all currently active and future members by defining their network locations as (implicitly or explicitly) safe.
  4. By default, it’s kind of like an auto-configured firewall for GCS, unless you define it otherwise.

So you’re saying to yourself now…this looks great, how could anything go wrong!  **famous last words**

Let’s explore!

Diagnosing & Solving GCS Communication Problems

So you’ve built your MySQL instances, you have members spanning your cloud platform and then you notice that the Group membership is failing and some nodes are offline from the Group.  You think to yourself…this is odd, I did the exact same setup in our lab environment and everything worked great, what is the difference now?  Plus I know (from above) that SELinux and our firewall are configured properly.

The likely and first suspect to look at, the network!

Where to look first though….the MySQL Error Log.  **huh! you say.**

GCS Logging in the MySQL Error Log

Yes…the error log is useful for trouble-shooting all sorts of things.  I used to have bash alias setups in the past so that I could type 2 characters and immediately be able to view the precious Error Log….source for confirmations of server startup progress, successes, failures, etc.

Specifically though, the GCS team logs all sorts of useful information into it. They’ve done so particularly when it involves the local Group member trying to initialize or join a Group Replication Cluster.  So let’s look at an example where a 2nd member to join fails, and how to diagnose in the error log.

Log Observations for a Boostrap Member

What is a bootstrap member again?  It is the first member to initialize a Group Replication cluster.  All subsequent members will utilize the cluster that this member began.  The bootstrap setup can be reviewed again here which the URL takes you to the right section of my previous blog.

What do we look for in the error log for a new “bootstrap member” initializing a group?  Let’s look and I’ll comment below:

Initial Bootstrap Member Error Log entries of interest:

# First set of log entries outline the Groups configurations, defined or defaulted
# The 1st log entry shows the auto-configured whitelist settings
# The 2nd log entry also very valuable showing the IP address for the current host's hostname
2017-03-14T17:27:21.487697Z 7 [Note] Plugin group_replication reported: '[GCS] Added automatically IP ranges 127.0.0.1/8,192.168.56.127/32 to the whitelist'
2017-03-14T17:27:21.489565Z 7 [Note] Plugin group_replication reported: '[GCS] Translated 'HOST1' to 192.168.56.127'
2017-03-14T17:27:21.489682Z 7 [Note] Plugin group_replication reported: '[GCS] SSL was not enabled'
2017-03-14T17:27:21.489697Z 7 [Note] Plugin group_replication reported: 'Initialized group communication with configuration: group_replication_group_name: "1a1c5221-fd26-11e6-8e12-1246aeecf2d5"; group_replication_local_address: "HOST1:16601"; group_replication_group_seeds: "HOST2:16601,HOST3:16601"; group_replication_bootstrap_group: true; group_replication_poll_spin_loops: 0; group_replication_compression_threshold: 1000000; group_replication_ip_whitelist: "AUTOMATIC"'
...
...
# Last 3 rows here confirm this is the bootstrap member for the Group, all is well
2017-03-14T17:27:22.543425Z 0 [Note] Plugin group_replication reported: 'Starting group replication recovery with view_id 14895124425433056:1'
2017-03-14T17:27:22.543713Z 15 [Note] Plugin group_replication reported: 'Only one server alive. Declaring this server as online within the replication group'
2017-03-14T17:27:22.555434Z 0 [Note] Plugin group_replication reported: 'This server was declared online within the replication group'

So we have a successfully initialized boostrap member to start our Group Replication clustered setup.

Log Observations for a Failing Member Trying to Join

Next, we configure the subsequent member as I’ve explained earlier here, and it should join the Group Member that we’ve established above.  Let’s review its log entries:

Subsequent Member attempting to join the Group:

# 2nd member trying to join the group but failing !!!! 
# lines directly below, same as bootstrap member outlines the configuruations
2017-03-14T17:37:25.890691Z 6 [Note] Plugin group_replication reported: '[GCS] Added automatically IP ranges 127.0.0.1/8,192.168.56.128/32 to the whitelist'
2017-03-14T17:37:25.892653Z 6 [Note] Plugin group_replication reported: '[GCS] Translated 'HOST2' to 192.168.56.128'
2017-03-14T17:37:25.892770Z 6 [Note] Plugin group_replication reported: '[GCS] SSL was not enabled'
2017-03-14T17:37:25.892784Z 6 [Note] Plugin group_replication reported: 'Initialized group communication with configuration: group_replication_group_name: "1a1c5221-fd26-11e6-8e12-1246aeecf2d5"; group_replication_local_address: "HOST2:16601"; group_replication_group_seeds: "HOST1:16601,HOST3:16601"; group_replication_bootstrap_group: false; group_replication_poll_spin_loops: 0; group_replication_compression_threshold: 1000000; group_replication_ip_whitelist: "AUTOMATIC"'
...
# The line directly below we can see it is reaching out to HOST1 on port 16601
# The 2nd line notes that it times out, and the 3rd states this member is NOT ready to join
2017-03-14T17:37:25.943647Z 0 [Note] Plugin group_replication reported: 'client connected to HOST1 16601 fd 130'
2017-03-14T17:37:55.944076Z 0 [ERROR] Plugin group_replication reported: '[GCS] Timeout while waiting for the group communication engine to be ready!'
2017-03-14T17:37:55.944116Z 0 [ERROR] Plugin group_replication reported: '[GCS] The group communication engine is not ready for the member to join. Local port: 16601'

With the errors are noted above….let’s confirm if the Bootstrap member participated

Bootstrap Member's Error Log entries since our last look:

# There is one line entry added to the error log since we last looked
# It identifies the IP attempting to connect, and states the reason for rejection
# It wasn't in the IP whitelist!
 
2017-03-14T17:37:25.935650Z 0 [Warning] Plugin group_replication reported: '[GCS] Connection attempt from IP address 192.168.56.128 refused. Address is not in the IP whitelist.'

Yes, it did!  The Boostrap member rejected the 2nd member that tried to join because it wasn’t included in the auto-configured IP white-list.   But why not?

Looking at the automatically created IP whitelist entry of 192.168.56.128/32 it which may not seem curious right away.  Looking at bit closer at the netmask though and the 32 indicates that this IP is in a subnet of its own.  So here we can conclude that the GCS automatic IP whitelist that is generated accounts for the configured netmask of the host’s IP and includes that IP range in the whitelist (along with the very liberal localhost IP range).  Had the netmask of the network on all the group replication servers been 192.168.56.128/24 which allows a full range of IPs for the last octet, then no problems would have been noticed as the automated IP whitelist would have been sufficient.

Options for the Fix

There are a few ways that this can be addressed.

  1. Adjust the netmask on all the servers to include a suitable range of IPs, which include all Group Replication related servers.  If this is acceptable, then you’ll need to stop Group Replication on the inital Member (HOST1 in this case), and re-bootstrap it so that the automatic IP whitelist picks up the new netmask configured on its server.  Other members should join properly after that, but their own netmask setups should also have been adjusted too.
     
  2. Maybe the restrictive netmask was intentional, in which case you can purposely construct the configuration for the IP whitelist instead using the explicit IP address of each member, netmask do not need to be included.  See the IP whitelist documentation for more information
    group_replication_ip_whitelist=”192.168.56.127,192.168.56.128,192.168.56.129,127.0.0.1/8″;

    Don’t forget, as per option 1, once you’ve added the above entry to all your Group Member configuration files, it still needs to be made active. There are 2 ways to do this:

    a) Restart the instance once the configuration file is setup.  Since the group in our case never made it past the first member, we’ll need to bootstrap that member again.  Once the initial bootstrap member is ready, then other servers needs to restart before joining.

    b) Assuming the config file has also been adjusted to include the config entry above, then the configuration can be made dynamically to the server.  However the Group Replication “service” needs to be recycled.

Dynamic commands to run in MySQL:

# Here are the needed commands to dynamically adjust the IP whitelist
mysql> STOP GROUP_REPLICATION;
mysql> SET GLOBAL group_replication_ip_whitelist="192.168.56.127,192.168.56.128,192.168.56.129,127.0.0.1/8";
mysql> START GROUP_REPLICATION;
 
# Run the following to execute the bootstrap commands before and after
mysql> STOP GROUP_REPLICATION;
mysql> SET GLOBAL group_replication_ip_whitelist="192.168.56.127,192.168.56.128,192.168.56.129,127.0.0.1/8";
mysql> SET GLOBAL group_replication_bootstrap_group=ON;
mysql> START GROUP_REPLICATION;
mysql> SET GLOBAL group_replication_bootstrap_group=OFF;

Offline Mode Usage with Group Replication

One more item to add: Anytime you plan to execute a restart of the Group Replication Service, ALWAYS dynamically enable offline_mode=ON before stopping the service. Once the group replication service is running again, then you can turn offline_mode=OFF.

Dynamic commands to run in MySQ:

# Here are the needed commands to dynamically adjust the IP whitelist
mysql> SET GLOBAL OFFLINE_MODE=ON;
mysql> STOP GROUP_REPLICATION;
mysql> SET GLOBAL group_replication_ip_whitelist="192.168.56.127,192.168.56.128,192.168.56.129,127.0.0.1/8";
mysql> START GROUP_REPLICATION;
mysql> SET GLOBAL OFFLINE_MODE=OFF;

Reasons and understanding for the above will be in my next blog post.

Conclusion

Hopefully this review of walking through some key messages in the error log will help you surpass possible complications you might come across.  There are a variety of things that might hold users up from getting it going, but the items noted in this blog post I consider the likely candidates based on my experience and engagements with companies that I’ve been working with so far.

Would love to hear your feedback and experience with Group Replication and look forward to supporting the wider MySQL Community and the commercial crowds alike!