MongoDB Backup/Restore Methodology using IBM® Spectrum Protect™ for Linux on z
by user
Comments
Transcript
MongoDB Backup/Restore Methodology using IBM® Spectrum Protect™ for Linux on z
MongoDB Backup/Restore Methodology using IBM® Spectrum Protect™ for Linux on z Ryan Bertsche Robert McNamara Kyle Moser Dulce Smith © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 1|Page Table of Contents 1. Introduction........................................................................................................................................................3 2. IBM® Spectrum Protect™ background ............................................................................................................4 3. MongoDB background ......................................................................................................................................5 4. Benefits ..............................................................................................................................................................9 5. MongoDB environment setup..........................................................................................................................10 6. IBM Spectrum ProtectTM (TSM) resources and setup .....................................................................................11 7. Implementation ................................................................................................................................................20 8. Test scenarios...................................................................................................................................................23 9. Performance considerations .............................................................................................................................42 10. Summary ..........................................................................................................................................................43 © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 2|Page 1. Introduction IBM has been running virtual machines since the 1970s and has customers that run hundreds to thousands of Linux images as guests on mainframe z/VM LPARs. The IBM Linux on z Test Integration Center in Poughkeepsie, NY focuses on running a mix of IBM and open source products to prove the versatility of Linux on z systems. The focus of this white paper is to backup fairly large instances of MongoDB that are sharded and replicated over numerous servers. We will accomplish this by using the IBM® Spectrum Protect™, formerly Tivoli® Storage Manager (TSM), running as backup and management agents on all of the MongoDB servers. The purpose of this is to show how MongoDB can be integrated with traditional or existing backup tools like IBM® Spectrum Protect™, even in production-style scale. This is all possible using the completely free and open-source MongoDB, version 3.0 and higher. There are many tools and features embedded directly into MongoDB that assist in the process, and compliment the backup abilities of MongoDB, which we will cover in the next sections. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 3|Page 2. IBM® Spectrum Protect™ background IBM® Spectrum Protect™ provides storage management solutions for multivendor computer environments. It provides automated, centrally scheduled, policy-managed backup, archive, and space-management capabilities for file servers, workstations, virtual machines, and applications. Furthermore, IBM® Spectrum Protect™ supports systems of all sizes - including virtual machines, file servers, email, databases, Enterprise Resource Planning (ERP) systems, mainframes and desktops. IBM® Spectrum Protect™ does all this from a single environment that expands as data grows. This white paper covers the backup and recovery functions of IBM® Spectrum Protect™, and it shows how it can be used to protect MongoDB data. In addition, there are a few specific reasons we choose IBM® Spectrum Protect™. First, it has a trusted track record for being a reliable backup manager for large enterprise systems. Many large customers have trusted it for years and may already have it integrated in their enterprise systems. Even for customers that do not yet use IBM® Spectrum Protect™, there is no need to have separate backup programs that require separate backup software specific to each program. That is significantly more overhead than simply using IBM® Spectrum Protect™, for all data. There are other important reasons to use IBM® Spectrum Protect™ for a backup solution. For instance, the backup occurs on site and is not dealing with cloud services that could impede the backup or restore leading to loss of data or time. The data is also more secure, and there is less chance of sensitive data being nefariously accessed if it stays on site. And finally, IBM® Spectrum Protect™ has a robust platform for managing backed up data. Some applications will just pile data up on a hard drive on some other server and will require it to be manually restored. Or you would have to manually manage where data is stored and what to do with it over time. IBM® Spectrum Protect™ can be configured with automatic procedures, like writing data to disk and tape, and has numerous archiving features. And when it comes to restoring the data, IBM® Spectrum Protect™ agents make it simple to get to the most recent backup for that particular machine and handles the process of putting the data back where it belongs. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 4|Page 3. MongoDB background MongoDB is an open source database considered to be the most popular and fastest growing NoSQL database mostly because of how well it works in areas where traditional SQL databases have trouble. It is very good for dealing with large sets of unstructured data and has exceptionally good read times on the data that is stored. That, combined with powerful queries that are written in JavaScript, makes MongoDB a powerful tool for modern applications like mobile and analytics that require frequent reads and consumption of data. While it is not a replacement for all SQL applications that store structured data, it does give a modern solution for the massive amounts of unstructured data and mobile traffic. In addition, MongoDB is designed to be highly scalable and available. These features are built into the design and structure of a MongoDB environment. MongoDB in a production environment is actually a cluster of processes running different tasks, usually running on different machines. It consists of three different types of servers: a. Config Servers - These servers store metadata about the locations of data. For a production environment, there needs to be exactly three config servers. These are the metadata servers that hold all the important information for the clustered DB. They hold all the information about where data is stored and how much there is. b. Query Routers - These special instances of MongoDB are the interface or gateway between outside applications and the data stored in MongoDB. Requests come into these servers, and the data requested gets returned to the application through the query router. There is no data stored permanently in these servers, which gives an added layer of security when connecting to the outside world. These MongoDB instances work by querying the config servers to find where the data should be stored. Then they then intelligently fetch the data and return it to the application. They also act as the management interfaces for doing cluster-level management. While all the configurations are actually stored on the config servers, query routers are the administrative console through which you access all settings and preferences. There can be any number of MongoDB servers, but you need at least one for the database to be functional. c. Shards - This is where the data is actually being stored in the system. The purpose of sharing is to horizontally scale the NoSQL database. The data is broken into pieces (aka shards) among a set of servers for the purpose of keeping the data consistent and available, while avoiding I/O bottlenecks. The servers are also automatically load balanced in order to prevent one server becoming disproportionally full or large. The idea is to break up the dataset as logical as possible and spread it across different machines. More machines mean more availability. Additional shards can be added, and the database will redistribute the data equally, allowing mongo to handle even more traffic. Of course, each shard is not an individual server. Each shard is actually a series of duplicated servers called a replica set. • Replica Sets - This is the solution to data redundancy and availability issues that can be faced with large databases, especially those of the NoSQL variety. The goal of this system is to have everything backed up, in a synchronized fashion, to the point where the complete failure of an entire server could be handled automatically with no downtime. Replica sets have an election amongst themselves to pick a single server to be the primary. This primary is the one responsible for communicating with the handling writes for replica set. All writes are handled first by the primary server, which writes them to an operation log that it distributes to the secondary members of the replica set. The secondary members then play back that log and apply all operations to their own data. One very interesting feature of replica sets is that while the primary handles all the writes, data can be read from all of the replica servers at the same time. This means that a read operational can occur concurrently on the same piece of data across a replica set. This leads to great availability. As far as failover goes, the replica sets are designed to automatically detect and failover for the loss of any server in the set, including the primary. With the loss of a primary, the servers will automatically elect a new primary and continue operation. Figure 1: MongoDB Structural Diagram © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 5|Page As you can tell from the diagram, in a production environment, each shard of a database is also a replica set. This is so there is always built-in redundancy with the data. There is no place in the MongoDB cluster where the data exists only once. The fact is, MongoDB is built with the purpose of persistent and reliable data, which is why it has native sharding and replica sets. Sharding is primarily focused on the scalability of the data by dividing a database among multiple servers. ‘The replica sets’ are sets of identical shards that sit on separate servers for redundancy and speed. There is also a feature called journaling, in which the database stores the pending write operations in a journal file before they are written to disk. These files will grow as large as 1GB in size and only be deleted once all the operations in the journal have been completed or a clean shutdown of the server has occurred. In the case of a dirty shutdown, as soon as the server is turned back on, the journal file is read and played back to make sure all the operations are committed, and commit the ones that are not. This, again, can be done automatically once the MongoDB instance on the server is brought back online. In the instance of some write errors, MongoDB will actually restart itself in order to fix the issue. It will restart, read from the journal and then, once the journal reads are completed, it will delete the old journal file and create a new one. Between journaling, shading and replica sets, there is a fair amount of automated failover for minor hiccups, like on a primary shard losing connectivity and going offline. Things would continue to run without flaw and the downed shard could be brought back up automatically after a restart. However, on bigger files systems that © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 6|Page are storing important data, with security and reliability issues, these techniques alone aren't quite enough. In this paper, we use IBM® Spectrum Protect™ as a solution to make MongoDB's built-in failover feature more robust, enabling fully scheduled backups for not just MongoDB but an entire enterprise ecosystem. In an enterprise system, you need to have full backups of all of your data for record keeping, audits and disaster recovery amongst other things. MongoDB’s redundancy is more about availability and does not provide a good way to handle these large, long-term backups. By using IBM® Spectrum Protect™, we will be able to merge MongoDB with existing enterprise infrastructure that already has backup policies in place. In our backup procedure, IBM® Spectrum Protect™ will use a built-in Red Hat (and most other major Linux distros) operating system feature called LVM Snapshot to do a filesystem copy of our MongoDB files on a replicated server. This will happen in a series of steps that will be extensively described in the following sections. What should be pointed out at this point is that all the journals and database files need to be in the same logical volume to avoid having to snapshot necessary amounts of useless data. This is best practice for MongoDB, anyway; so it is assumed that this basic rule is followed. In the end, the snapshot is compressed and stored by IBM® Spectrum Protect™ according to its own storage procedures. The MongoDB backup process relies on a dedicated replica set member for each shard that is used strictly for backup purposes. For availability purposes in a production environment, there would be multiple replicas of a single server. For our backup purposes, we want a replica that is predefined to be the backup instance. That means it cannot be a primary server, and MongoDB gives configuration options to make this happen. It will continue to operate as a functional member of the replica set while it is being backed up. However, we know that the backup procedure will take some resources on the backup server, and we don’t want to slow down the cluster. The production environment and backup schedule should be designed with that in mind. This backup procedure displays a few minimum requirements for the MongoDB cluster. This coincides with the minimum requirements for a production environment set for by the MongoDB documentation. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 7|Page 4. Benefits The following are some of the benefits IBM® Spectrum Protect™ offers as a backup and restore tool to enhance MongoDB’s capabilities: • Single backup manager for all applications, keeping complexity and overhead lower • Robust and configurable storage mediums and procedures • Separate server to do backup • Cataloging of backups • Backup changed data only • Built-in scheduling • Single control point • Database for logging, configuration, statistics, metadata In the next sections, we will show you how some of these features are utilized to make MongoDB a more robust database. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 8|Page 5. MongoDB environment setup Set up your replicated MongoDB servers, according to best practices, with replicas and shards. 1. 2. Our servers had the following minimum configurations: • 5GB of storage • 1GB of RAM (more is recommended) • MongoDB 3.0 or greater should be used • RHEL 7.1 on z/VM The backup environment will need to have, at a minimum, the following qualifications: • There must be 3 Configuration servers • The database is divided into at least 2 shards o • Those shards are split on an optimized shard key There are at least 3 servers running as a replica set for each shard o i.e. 1 primary server, 2 secondary servers - per shard • All servers in the cluster are networked by their domain name, and not directly by their IP addresses, for failover purposes • At least one MongoDB Router instance (mongos) is able to handle all requests. There should be enough routers to sufficiently handle incoming and outgoing requests. More routers should be added as needed. The procedure for backing up MongoDB is the main topic of this paper. This paper does not cover the entire procedure for creating or maintaining a production MongoDB instance. For more information on that and other MongoDB configurations, the official documentation is a great resource. http://docs.mongodb.org/manual/ © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 9|Page 6. IBM Spectrum ProtectTM (TSM) resources and setup 1. IBM® Spectrum Protect™ (TSM) Server resources To demonstrate the consistency and flexibility of the MongoDB Linux on z backup/restore methodology, the following IBM® Spectrum Protect™ (TSM) Server resources were each individually tested for this evaluation: 2. • System p AIX IBM® Spectrum Protect™ (TSM) Server version 6.3.5.0 • System p AIX IBM® Spectrum Protect™ (TSM) Server version 7.1.3.0 • System z Linux IBM® Spectrum Protect™ (TSM) Server version 6.3.5.0 • System z Linux IBM® Spectrum Protect™ (TSM) Server version 7.1.3.0 IBM® Spectrum Protect™ (TSM) Client resources To demonstrate the consistency of the MongoDB Linux on z backup/restore methodology, the following IBM® Spectrum Protect™ (TSM) Client resources were each individually tested for this evaluation: 3. • System z Linux IBM® Spectrum Protect™ (TSM) Client 7.1.2.0 • System z Linux IBM® Spectrum Protect™ (TSM) Client 7.1.3.0 • System z Linux IBM® Spectrum Protect™ (TSM) Client 7.1.4.0 Install and Configure the IBM® Spectrum Protect™ (TSM) Client The IBM® Spectrum Protect™ Client, version 7.1.4.0, has been installed on each Linux on z MongoDB server. Here are steps to performing this: a. Copy the install tar file, 7.1.4.0-TIV-TSMBAC-LinuxS390.tar, from your install repository to the target server: • Create a folder on the target Mongo server to download the file to: [root@LTMNGO01 /]# mkdir /tmp/tsm [root@LTMNGO01 /]# • Copy the 7.1.4.0-TIV-TSMBAC-LinuxS390.tar tar file from source repository (e.g. /images/tsm.clients/7.3) to the target directory (e.g. /tmp/tsm) using SCP. On the target MongoDB server, LTMNGO01, execute: [root@LTMNGO01 tsm]# scp -pr 10.20.xx.xxx:/images/tsm.clients/7.1.4.0/7.1.4.0-TIV-TSMBACLinuxS390.tar /tmp/tsm Password: 7.1.4.0-TIV-TSMBAC-LinuxS390.tar 100% 140MB 46.5MB/s 00:03 [root@LTMNGO01 tsm]# [root@LTMNGO01 tsm]# ls 7.1.4.0-TIV-TSMBAC-LinuxS390.tar [root@LTMNGO01 tsm]# © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 10 | P a g e b. Untar the 7.1.4.0-TIV-TSMBAC-LinuxS390.tar file located on the target MongoDB server: [root@LTMNGO01 tsm]# tar -xvpf 7.1.4.0-TIV-TSMBAC-LinuxS390.tar README_enu.htm README_api_enu.htm TIVsm-API64.s390x.rpm TIVsm-BA.s390x.rpm TIVsm-JBB.s390x.rpm TIVsm-APIcit.s390x.rpm TIVsm-BAcit.s390x.rpm TIVsm-filepath-7.1.4-0-rhel59.s390x.rpm TIVsm-filepath-7.1.4-0-rhel64.s390x.rpm TIVsm-filepath-7.1.4-0-rhel70.s390x.rpm TIVsm-filepath-7.1.4-0-sles11sp2.s390x.rpm TIVsm-filepath-source.tar.gz gskcrypt64-8.0.50.52.linux.s390x.rpm gskssl64-8.0.50.52.linux.s390x.rpm c. Install/update the 4 following rpms: TIVsm-API64.s390x.rpm TIVsm-BA.s390x.rpm gskcrypt64-8.0.50.52.linux.s390x.rpm gskssl64-8.0.50.52.linux.s390x.rpm [root@LTMNGO01 tsm]# rpm -i gskcrypt64-8.0.50.52.linux.s390x.rpm [root@LTMNGO01 tsm]# rpm -i gskssl64-8.0.50.52.linux.s390x.rpm [root@LTMNGO01 tsm]# rpm -i TIVsm-API64.s390x.rpm [root@LTMNGO01 tsm]# rpm -i TIVsm-BA.s390x.rpm [root@LTMNGO01 tsm]# [root@LTMNGO01 tsm]# rpm -qa | grep gsk gskssl64-8.0-50.52.s390x gskcrypt64-8.0-50.52.s390x [root@LTMNGO01 tsm]# [root@LTMNGO01 tsm]# rpm -qa | grep TIVsm TIVsm-API64-7.1.4-0.s390x TIVsm-BA-7.1.4-0.s390x [root@LTMNGO01 tsm]# d. Create and update the /opt/tivoli/tsm/client/ba/bin/dsm.sys and /opt/tivoli/tsm/client/ba/bin/dsm.opt files. Note: Anytime you make a change to the dsm.sys or dsm.opt files, issue /etc/init.d/tsm restart for the IBM® Spectrum Protect™ Client scheduler daemon to re-read these files. • Change directory [root@LTMNGO01 tsm]# cd /opt/tivoli/tsm/client/ba/bin [root@LTMNGO01 bin]# • Copy the dsm.opt.smp file to dsm.opt: © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 11 | P a g e [root@LTMNGO01 bin]# cp dsm.opt.smp dsm.opt [root@LTMNGO01 bin]# • Copy the dsm.sys.smp file to dsm.sys: [root@LTMNGO01 bin]# cp dsm.sys.smp dsm.sys [root@LTMNGO01 bin]# • List the files directory: [root@LTMNGO01 bin]# ls -al dsm.sys dsm.opt -r--r--r-- 1 root root 834 Dec 14 12:52 dsm.opt -r--r--r-- 1 root root 971 Dec 14 12:47 dsm.sys [root@LTMNGO01 bin]# • Update the dsm.opt file: - This is what dsm.opt currently has: [root@LTMNGO01 bin]# cat dsm.opt ************************************************************************ * Tivoli Storage Manager * * * * Sample Client User Options file for UNIX (dsm.opt.smp) * ************************************************************************ * * * * * This file contains an option you can use to specify the TSM server to contact if more than one is defined in your client system options file (dsm.sys). Copy dsm.opt.smp to dsm.opt. If you enter a server name for the option below, remove the leading asterisk (*). ************************************************************************ * SErvername A server name defined in the dsm.sys file [root@LTMNGO01 bin]# - Add the following line at the end of the file (where TSMLINUX is the name of the TSM server): SErvername TSMLINUX - Now, it reads: [root@LTMNGO01 bin]# cat dsm.opt ************************************************************************ * Tivoli Storage Manager * * * * Sample Client User Options file for UNIX (dsm.opt.smp) * ************************************************************************ * This file contains an option you can use to specify the TSM * server to contact if more than one is defined in your client * system options file (dsm.sys). Copy dsm.opt.smp to dsm.opt. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 12 | P a g e * If you enter a server name for the option below, remove the * leading asterisk (*). ************************************************************************ * SErvername A server name defined in the dsm.sys file SErvername TSMLINUX SErvername TSMAIX [root@LTMNGO01 bin]# • Update the dsm.sys file: This is what it the sample IBM® Spectrum Protect™ Client dsm.sys file looks like: [root@ltmngo02 bin]# cat dsm.sys ***************************************************************** * Tivoli Storage Manager * * * Sample Client System Options file for UNIX (dsm.sys.smp) * ***************************************************************** * * * * * This file contains the minimum options required to get started using TSM. Copy dsm.sys.smp to dsm.sys. In the dsm.sys file, enter the appropriate values for each option listed below and remove the leading asterisk (*) for each one. * If your client node communicates with multiple TSM servers, be * sure to add a stanza, beginning with the SERVERNAME option, for * each additional server. ***************************************************************** SErvername server_a COMMMethod TCPip TCPPort 1500 TCPServeraddress node.domain.company.COM [root@ltmngo01 bin]# As an example of a dsm.sys file configured with 2 TSM Servers, the following is an updated dsm.sys file, containing 2 IBM® Spectrum Protect™ servername stanzas and corresponding entries for both the Linux on z IBM® Spectrum Protect™ 7.1.3.0 Server (TSMLINUX) and the System p AIX TSM 7.1.3.0 Server (TSMAIX). Either or both TSM Servers could be used for MongoDB Server image backup purposes, dependent on the data protection requirements of the specific installation. The same dsm.sys TSM Server stanza and corresponding entries configuration rules could be used to add more TSM Servers, should they be required by a specific installation. For the following dsm.sys file, all modified values/entries from the default dsm.sys file are shown in the color blue. In addition, all stanzas shown in the color green are performance related and currently set with the maximum values for optimal MongoDB Server image backup performance. Please note that in an environment with a heavily utilized/resource constrained (i.e., CPU, memory, I/O, network, etc.) MongoDB Server, configuring these dsm.sys performance related entries with their optimal performance values could negatively impact the MongoDB Server performance. In such situations, it © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 13 | P a g e would be better to start with the default values for these performance related stanzas and incrementally evaluate any changes from there. [root@LTMNGO01 bin]# cat dsm.sys SErvername TSMLINUX COMMMethod TCPIP TCPPort 1500 TCPServeraddress 10.1.1.1 QUERYSCHEDPERIOD 6 NODENAME LTMNGO01 SCHEDLOGNAME /tmp/tsm1/dsmsched.log ERRORLOGNAME /tmp/tsm1/dsmerror.log SCHEDLOGRETENTION 7 D ERRORLOGRETENTION 7 D EXCLUDE.DIR /proc PASSWORDACCESS GENERATE COMPRESSION NO DISKBUFFSIZE 1023 TCPBUFFSIZE 512 TCPWINDOWSIZE 512 TCPNODELAY YES USELARGEBUFFERS YES TXNBYTELIMIT 64000 SErvername COMMMethod TCPPort TCPServeraddress QUERYSCHEDPERIOD NODENAME SCHEDLOGNAME ERRORLOGNAME SCHEDLOGRETENTION ERRORLOGRETENTION EXCLUDE.DIR PASSWORDACCESS COMPRESSION DISKBUFFSIZE TCPBUFFSIZE TCPWINDOWSIZE TCPNODELAY USELARGEBUFFERS TXNBYTELIMIT TSMAIX TCPIP 1500 10.1.1.2 6 LTMNGO01 /tmp/tsm2/dsmsched.log /tmp/tsm2/dsmerror.log 7D 7D /proc GENERATE NO 1023 512 512 YES YES 64000 e. Self-register the IBM® Spectrum Protect™ Client with the IBM® Spectrum Protect™ server TSMLINUX Note: To enable open IBM® Spectrum Protect™ Client registration with the IBM® Spectrum Protect™ Server, the IBM® Spectrum Protect™ Server REGISTRATION setting must be set to OPEN. Execute the dsmc -se=TSMLINUX command and answer the given questions: [root@LTMNGO01 bin]# dsmc -se=TSMLINUX IBM Tivoli Storage Manager Command Line Backup-Archive Client Interface Client Version 7, Release 1, Level 4.0 Client date/time: 12/14/2015 13:00:39 © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 14 | P a g e (c) Copyright by IBM Corporation and other(s) 1990, 2015. All Rights Reserved. Node Name: LTMNGO01 Your ID (LTMNGO01) is not currently registered with the server. Enter the following information to set up a new ID: Please enter a password: Enter new password for verification: Enter contact information: [email protected] Session established with server TSMLINUX: Linux/s390x Server Version 7, Release 1, Level 1.300 Server date/time: 12/14/2015 13:00:54 Last access: 12/14/2015 13:00:54 tsm> At this point, the IBM® Spectrum Protect™ client, version 7.1.4.0, has been installed on the MongoDB server. 4. Install and configure the IBM® Spectrum Protect™ (TSM) client scheduler To run the IBM® Spectrum Protect™ client scheduler daemon at system start as a service, follow these steps. a. Place the following script in /etc/init.d as the file tsm. #! /bin/bash # # tsm starts and stops Tivoli Storage Manager Client # # chkconfig: 2345 99 00 # description: Starts and stops Tivoli Storage Manager Client # # config: /opt/tivoli/tsm/client/ba/bin/dsm.sys # processname: dsmc # Source function library. . /etc/init.d/functions # See how we were called. prog="tsm" # Define environment variables PATH=$PATH:/opt/tivoli/tsm/client/ba/bin LANG="en_US" LOCK="/var/lock/subsys/$prog" PID="/var/run/$prog.pid" export LANG start () { echo -n $"Starting $prog: " mkdir -p /tmp/tsm dsmc schedule > /tmp/tsm/dsmsched.log 2>/tmp/tsm/dsmerror.log & pgrep dsmc > $PID && success RETVAL=$? echo [ $RETVAL -eq 0 ] && touch $LOCK return $RETVAL } stop () { echo -n $"Stopping $prog: " © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 15 | P a g e killproc dsmc RETVAL=$? echo [ $RETVAL -eq 0 ] && rm -f $LOCK [ $RETVAL -eq 0 ] && rm -f $PID return $RETVAL } rhstatus() { status dsmc } restart() { stop start } case "$1" in start) start ;; stop) stop ;; restart) restart ;; status) rhstatus ;; *) echo $"Usage: $0 {start|stop|status|restart}" exit 1 esac exit $? For additional information, please see this link: https://adsm.org/forum/index.php?threads/how-to-start-tsm-client-scheduler-in-linux.26676/ b. Issue the following 4 commands to configure the "tsm" service for run levels 3, 4, and 5 chmod 755 /etc/init.d/tsm chkconfig --add tsm chkconfig --level 345 tsm on chkconfig --list Note: the following are the manual IBM® Spectrum Protect™ client scheduler daemon controls. To start: /etc/init.d/tsm start To stop: /etc/init.d/tsm stop To restart: /etc/init.d/tsm restart c. Check IBM® Spectrum Protect™ client scheduler process is running. To check that the dsmc scheduler process is running, execute: ps –ef | grep dsmc [root@LTMNGO01 bin]# ps -ef | grep dsmc | grep -v grep root 42685 1 0 15:18 ? 00:00:00 dsmc schedule [root@LTMNGO01 bin]# © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 16 | P a g e 7. Implementation The goal of our IBM® Spectrum Protect™ agent backup is to provide a robust and reliable backup solution for MongoDB sharded environments that can be done entirely on premises. For customers already using the robust IBM® Spectrum Protect™ software, or for customers that may have more to backup than just MongoDB, this provides a unified solution. The best way to capture a point-in-time backup would be to initiate LVM snapshots simultaneously across all the servers being backed up. The servers will continue to operate in the cluster and the snapshot will freeze the data being backed up. After the backup has been completed, the server will continue to operate as normal. So, in this backup method, there is no downtime for any of the MongoDB instances. An overview of the entire backup procedure is described below, with the specific commands given in the testing section. The way IBM® Spectrum Protect™ will backup the data is as follows: • A script will be run to shut off the MongoDB balancer across the entire cluster to ensure a quick and accurate backup • One of the three config servers will be shut down for the backup procedure - that way, no data can be changed • IBM® Spectrum Protect™ backup procedure will be called on the designated backup replica members and the one config server simultaneously • Once the backup is complete, the config server will be turned back on, and it will automatically begin to ‘catch-up’ with the other 2 config servers • The balancer will be resumed, and the MongoDB cluster will resume all normal operations Restoring the backup: • The backup is restored by using IBM® Spectrum Protect™ commands to restore the entire logical volume to the machine o There must be a volume group of the same name on the machine being restored to, with enough space to hold the backup volume o You should make sure all of the DNS and hostname settings of the new machine match that of the backed up machine o Make sure that the configuration file for this new machine is the same as a the one it is being restored from • Once the command is run, you start the MongoDB instance, and it acts just like the original server that was backed up Keeping track of the hostnames and configuration files for an environment can become an increasingly complex task. We handled this by using dynamic configuration files and hostnames that were generated using Ansible. Either that, or the MongoDB instances could be deployed with some sort of infrastructure as code, like Chef, or maybe containerized with Docker, or possibly even a cloud-based implementation. Either way, the setup and configuration is worth backing up because that can cause a delay in the recovery of a system. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 17 | P a g e Steps to setting up the MongoDB cluster with shards and replicas: 1. Install MongoDB on each server in the cluster (at least 2 shards of 3 replicas each, plus 3 config servers, and at least 1 router server, for a total of at least 10) 2. Time to configure the first replica set a. Go into the first server of replica ‘RS1’ i. In the bash shell, enter mongo ipaddress:port to enter the MongoDB shell for that instance. ii. In the MongoDB shell, enter rs.initiate() to initialize the replica set, starting with this server. iii. In the MongoDB shell, issue rs.add(“ipaddress:port”) for each additional mongos instances that is part of the replica set. (Note: you do not need to run this command on the MongoDB instance you are currently in). b. Set one of the replicas to a priority 0 server to be the backup designated server. i. Run rs.conf() and rs.status() to get information about your replica sets, and double check that everything was added correctly. The order the servers are displayed corresponds to their array number in the set. ii. Run cfg = rs.conf() to assign the whole replica config to an array variable. iii. Run cfg.members[0].priority = 0 to set priority of replica set[0] to 0. iv. Run cfg.members[n] before running to make sure you’re configuring the desired member of the replica set. v. Run rs.reconfig(cfg) to make the changes stick. vi. Confirm the changes stuck with the rs.conf() and rs.status() commands Note: You cannot change the priority of the current primary to 0. In order to do that, you must issue the command: rs.stepDown(), first, to force the primary to step down. Then you can set the priority of the former primary to 0 so it won’t be primary again. 3. The first replica is up. It is now time to initialize the sharding a. First, you will need to add the initial replica set RS1 to the shard group. i. Run the command mongo ipaddress:27017/admin to connect to the mongos admin shell ii. Run sh.addShard(“rs1/ipaddress:port, ipaddress:port”) in the mongos shell for each member of the replica set you wish to add to the shard. Note: When adding a replica set to a shard, you only need to give the name of one replica set member for the entire replica set to be added to the shard. 4. At this point, start the driver to add data to the current collection. It will make it easier to find a shard key once we add the second set a. 5. To accomplish this, you should run some test data through your application and write it into MongoDB. You will point it to your mongos instances Now that you have some test data, you should bring up the other replica sets you would like to add to the server a. For each additional shard you wish to add, you need to bring up one additional replica set © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 18 | P a g e i. Simply follow the commands in step 2 and bring up as many replica sets as you need shards 6. Now it’s time to add all of the additional replica sets we just configured into the shard pool a. You will want to make sure to add all shard groups to the shard pool i. This time, repeat step 3 for each replica set. You only need to login to the mongos shell once, but repeat step 3-a-ii for every replica set being added to the shard. b. The next step is to enable sharing of the particular database we are using i. Connect to a mongos admin shell (See step 3-a-i) ii. Enter the command sh.enableSharding(“databaseName) 7. Your database is now ready to have its collections sharded. Once it is sharded, the data in that collection will be distributed evenly between all the shards. First, you must assign a shard key for each collection. This process is a little more in depth, and takes some finessing. So, this is described in greater detail in the section below. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 19 | P a g e 8. Test scenarios In order to clearly show the steps in the backup process, we will demonstrate the proper method to shard a database consisting of multiple collections. We will demonstrate this with AcmeAir, one of the drivers we used in the testing of our MongoDB backup procedures. The source code and documentation for this driver can be found at https://github.com/acmeair/acmeair-nodejs . AcmeAir is an open source application that is meant to resemble an airline company website that includes flight lookup and a booking system. There are customers that have data attached to their login accounts, as well as flights to books, flights with layovers, bookings for flights, and current customer sessions online. This uses a nodeJS environment to run a web program that has traffic driven against it by a JMeter automation driver. Traffic is sent to the Web page via API, and the application uses a nodeJS driver to run traffic against our MongoDB system. The question still remains, how on earth do you shard the data? This is an area where MongoDB could mostly use improvement with dynamic shard values. Right now, we are going to walk through the index sharing of the AcmeAir database with you. The Acmeair database contains 5 collections, each of which will be sharded at some point (e.g. booking, flight, flightSegment). Sharding is enabled for the database, and then each one of the five collections should get its own shard index and each collection will be independently sharded. First, we need to load the database with some initial data and run a few runs of the driver to get some test data. The database will be ready to shard but will not yet be actively sharded. We do this because we don’t know what the values in those collections will be, or the distribution of the values. If we knew the expected distribution beforehand, we could shard before adding data to the collection. Then, we will access the MongoDB shell of the primary server with the command: mongo ipaddress:port Where the port is the port that the primary replica of the database is running. Then, you will be in the MongoDB shell. Next, type in the command: use acmeair This will switch the focus to the correct database we will be working with. Next, we need to get all the collection in the database with the command: db.getCollectionInfos() This will give a list of the collections that exist within the database. In our AcmeAir example, the output from this command is as follows: rs1:PRIMARY> db.getCollectionInfos() [ { "name" : "airportCodeMapping", "options" : { } }, { "name" : "booking", © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 20 | P a g e "options" : { } }, { "name" : "customer", "options" : { } }, { "name" : "customerSession", "options" : { } }, { "name" : "flight", "options" : { } }, { "name" : "flightSegment", "options" : { } } ] Alternately, you can use the following command for a more compressed list of collection names: db.getCollectionNames() Which for us resulted in the following output: [ "airportCodeMapping", "booking", "customer", © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 21 | P a g e "customerSession", "flight", "flightSegment" ] After you have the names of the collections, you can find out the fields they are made up of with the command: db.collectionName.find() With the collectionName being the actual name of the collection, such as airportCodeMapping. Here is the example of the output of that command: rs1:PRIMARY> db.airportCodeMapping.find() { "_id" : "BOM", "airportName" : "Mumbai" } { "_id" : "DEL", "airportName" : "Delhi" } { "_id" : "FRA", "airportName" : "Frankfurt" } { "_id" : "HKG", "airportName" : "Hong Kong" } { "_id" : "LHR", "airportName" : "London" } { "_id" : "YUL", "airportName" : "Montreal" } { "_id" : "SVO", "airportName" : "Moscow" } { "_id" : "JFK", "airportName" : "New York" } { "_id" : "CDG", "airportName" : "Paris" } { "_id" : "FCO", "airportName" : "Rome" } { "_id" : "SIN", "airportName" : "Singapore" } { "_id" : "SYD", "airportName" : "Sydney" } { "_id" : "IKA", "airportName" : "Tehran" } { "_id" : "NRT", "airportName" : "Tokyo" } Mongo requires that you manually choose the shard field, and in this case, we will choose the airportName field. After you have decided the fields that you would like to shard, it is time to make it happen. The first step in the process is to get into the mongos admin shell, which you should already be in from the last step. Then you need to issue the use dataBaseName to switch focus to the database of your choice. These next steps will be repeated for each collection in your database. Remember, it is the collections (more or less the equivalent of tables in a traditional relational SQL DB) that make up the database getting sharded. And, the sharding process is essentially indexing a collection based on one field of the collection and then splitting that collection at the halfway point of that index. Now, the steps: © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 22 | P a g e 1. Before you index, you should check if a good index already exits, with the command db.collectionName.getIndex() 2. Run the command db.collectionName.createIndex( { fieldName : 1} ) . This command indexed the collection collectionName on the field fieldname. The 1 after the field name stands for the ordering, in this case, in descending order. -1 would signify ascending order. • Note: The indexing of a field is only necessary if data is already stored in the index. If you know the format of the data beforehand, you can skip this manual indexing. MongoDB will automatically make the shard key the index. After that, you can set the shard key with the command sh.shardCollection(“db.collection”, shard-key ). For example: sh.shardCollection(“acmeair.airportCodeMapping”, {“airportName”: 1}) Backup Procedure and scripting Now that we have a fully setup MongoDB database, we can start a backup. Interestingly, there are multiple storage engines that are available in MongoDB, MMAPv1 and WiredTiger. For our backup and restore procedure, we chose to use MMAPv1 because it was more stable. However, with our backup procedure, it makes no difference which storage engine you decide to use. One of the main reasons we settled on a LVM snapshot to get a point in time backup of our system is that we actually cannot otherwise guarantee a perfect moment-in-time backup without completely shutting down each MongoDB backup instance. There is a MongoDB method, fsynclock(), which is supposed to stop writes to a database, but it turns out that the fsynclock() method in MongoDB does not guarantee that WiredTiger actually stops writing. Since that creates a data integrity issue, we use the LVM snapshot, instead. That actual makes the process simpler, and our only remaining issues are to, 2) stop balancer and, 2) try to get the IBM® Spectrum Protect™ agent to backup with some synchronization. Part of what needs to be scripted is the stopping of the balancer. This balancer is the locking mechanism, which manages the redistribution of data amongst the shards in a MongoDB cluster. For example, if you sharded a collection on the field Last Name, the split between 2 shards might be the letter ‘L’. If MongoDB notices there are significantly more entries in the ‘A-L’ shard than the ‘M-Z’ shard, it may change the split point from ‘L’ to ‘K’. In this case, all the data with the last name starting with an ‘L’ would move from shard 1 to shard 2. It makes sense to turn the balancer off because you don’t want the balancer to move things during the backup process. New data will still be placed on the correct shard based on its shard key, but the shard point will not change when the balancer is off. That means that MongoDB can still accept new writes, as well as read while the balancer is turned off. It is possible for the data to become unbalanced during this time if there is a large influx of data from one section of the shard key. This can be reduced by using more complex shard keys, like a hash key, or shading on a different value altogether. Either way, MongoDB will re-balance itself once the balancer is re-enabled. You must be careful that the balancer is stopped as well as disabled and not just one or the other. It is possible that you run into a problem of a stuck balancer. We found that the best way to deal with that is to find the mongos instance that is causing the trouble and bringing it down softly. This is accomplished through the following procedure: 1. Connect to one of the mongos instances through the MongoDB shell mongo ipaddr:27017 2. Now that you are in the MongoDB shell, you want to use the config database. Then, see the state of said database use config © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 23 | P a g e sh.status() 3. Near the top of that data printout will be a field called balancer. The data looks like: balancer: Currently enabled: yes Currently running: yes Balancer lock taken at Thu Oct 08 2015 16:18:45 GMT-0400 (EDT) by ltmngo10:27017:1444334119:1804289383:Balancer:846930886 Failed balancer rounds in last 5 attempts: 0 Migration Results for the last 24 hours: No recent migrations 4 This tells you if the balancer is enabled, if it is running, and when it started. The balancer can be disabled, and still be running, because disabling the balancer doesn’t stop the running processes but simply waits for them to stop. To do this, run the command: sh.setBalancerState(false) 5. You can check again and see when the balancer stops running to begin the backup. If the balancer does not stop running, and the lock start time from above is more than a couple of hours old, you may need to manually fix it 6. First, try to get MongoDB to take care of it with the command: sh.stopBalancer() • This will give you live prompts of the active process to shutdown the balancer 7. In our experience, if that doesn’t work, you will need to shut down the server that holds the lock. In the status above, it lists which server holds the lock. It should be a mongos server that can be brought down safely and then brought back up without much consequence. From the Bash shell, run: ps aux | grep mongos • This will give you the PID of the mongos instance you need to kill kill xxxxx • This will kill the process ID of the number xxxxx that you give it • Bring it back online and give it some time to time out the lock before you try again (about 15-30 mins, at most). The default timeout should be 900,000 milliseconds, or about 5 minutes 8. Once the balancer is stopped, the mongos data servers are ready to be backed up, but the config server is not • To get a clean config server backup, it makes the most sense to completely shutdown one of the config servers. We do this because bringing the service down totally guarantees that there will be no reads or writes to that configuration, and we have a true moment-in-time backup. There is also the benefit that the config servers won’t try and do any chunk migrations or allow any sort of config changes when one is down, making sure the backup will work © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 24 | P a g e • This is done with by stopping the running mongod service on the desired backup device with the command service mongos stop • Now we are ready to start the IBM® Spectrum Protect™ backup About the Backup process IBM® Spectrum Protect™ has a built in method called backup image dev/vg/logical_volume. On Linux systems, this method will run a backup procedure that involves backing up the specified logical volume with Linux’s built-in Logical Volume Manager (LVM). Specifically, the LVM creates a snapshot of the logical volume and IBM® Spectrum Protect™ makes a backup of that snapshot and uploads it to the IBM® Spectrum Protect™ server. The way an LVM snapshot works is to create a copy-on-write clone of the logical volume in the free space of the volume group. What that really means is that the LVM makes a virtual copy of the actual logical volume by creating a virtual drive in the free space of the volume group. It stores hardlinks in the virtual snapshot drive that point to the data on the real logical volume. That means that a snapshot on a static dataset would take almost 0 storage space (storage space of hard links is negligible). When something is about to get written to the original copy, the original data will be copied to the snapshot before the new data is written on the actual disk. What that means is that you can keep the instance up and running with new writes happening to the database, all while you take a moment-in-time backup that starts the moment you start the command. The snapshot always contains the data exactly as it was on the logical volume when you started the snapshot. It is important to note that the journal files automatically created by MongoDB are also part of this snapshot. The journal files are MongoDB method of tracking recent changes that are being made before they are written to disk. This is especially important in a backup process, because MongoDB holds some data in memory before it flushes it to disk in a batch. By also backing up the journal, we have a record of all the transactions that have occurred on the database, but have not been completely written to disk. The snapshot function will be able to capture the state of both the database and the journal file at the exact same time to ensure that no data is lost in the process. MongoDB will read from the journal file when it is restored and apply all pending or incomplete updates. There is, of course, some storage considerations to keep in mind, which may vary from instance to instance. Logically, any data that changes during the live backup gets written to disk. This space for the snapshot is taken from the free space of the volume group that logical volume belongs to. You need to have enough free space in the volume group to cover all changes that may occur during backup time. Worst case scenario, you would need an equal amount of free space in the volume group as the logical volume occupies. This, however, is most often not going to be the case, depending on what percentage of your database gets overwritten during the time a backup may take. In addition, that space can be added to the volume group for the backup and then removed afterward. Despite the extra disk space needed, it is made up with the ability to do live backups. We don’t have to stop any of our database servers for a backup. So, there should be no significant performance impact while the backup is in progress. The primary servers in the shard will still operate at full speed, with only a reduction in read time for any queries being processed by the backup server. This can be eliminated as well with some special replica server modes, like ‘Hidden’ that prevent reads from happening on the backup server. It’s also worth noting that this is the backup method that MongoDB currently recommends, and many of the cloud backup companies currently use. There are a few backup and dump tools provided by MongoDB, but by their own admission, they are too slow to scale past a very small database of a couple gigabytes. In this case, IBM® Spectrum Protect™ provides a much cleaner way of implementing the LVM snapshot backup, and it handles the storage and safekeeping of the files on the backend server. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 25 | P a g e As for the actual IBM® Spectrum Protect™ commands, it is very simple. On all the backup servers, at the same time, issue the command: dsmc backup image /dev/volumeGroup/data This is the basic command that initiates the backup from the command line on the client servers, but there are a few parts and options you should know. First, the dsmc is the IBM® Spectrum Protect™ command, on the Linux command line that the program responds to. All of our IBM® Spectrum Protect™ commands ran from the shell will begin with dsmc. The next two words, backup image, are the commands that tells IBM® Spectrum Protect™ what to do (backup), and what technique to use (image). The last part of this command is the logical volume name that you would like to back up. It is important to note that you must use the device path (/dev/volume_group/volume) and not just use the name of the mount point (like /data). Using the file system mount point (e.g. /data) will be interpreted differently by IBM® Spectrum Protect™, and it will try to do a file system backup instead of an LVM logical volume snapshot. There are also a couple other commands that are useful and worth knowing. The first one is snapshotcahesize. It is an option that allows you to set the percentage of the total size of the logical volume that IBM® Spectrum Protect™ will reserve from the volume group. So if your /data directory is 100GB, passing 10 to this option will cause LVM to reserve 10GB from the volume group for the snapshot. If more than 10GB of the original 100GB gets changed during the backup, the IBM® Spectrum Protect™ agent will return an error and the backup will fail. In the event of a mission critical, 100% time-sensitive backup, you will want to have 100% of the logical volume free and unallocated in the volume group (in this example: 100GB). dsmc backup image -snapshotcachesize=10 /dev/volumeGroup/data Remember that the default is 100%, so if you do not use this flag, the volume group free space will have to be greater than or equal to the actual size of the logical volume that you are using. In our testing, we use a value of about 25, and never even came close to running out of space. This should be monitored and set on a case-bycase basis. You will also notice that the flag is joined to the value by an equal sign, which is different than the Linux norm. This is what the command would look like designating 55% cache size: dsmc backup image-snapshotcachesize=55 /dev/volumeGroup/data One last important command line option is compression. This option allows you to enable the compressing of this backup before it is uploaded to the IBM® Spectrum Protect™ server. We will get into the details of compression shortly, but right now know that the command is as follows. dsmc backup image -compressi=yes /dev/volumeGroup/data That is all there is to backing up our MongoDB cluster. With a properly designed LVM layout, and well planned configuration, IBM® Spectrum Protect™ can take care of the backup in one command. You should be aware that this is a point-in-time backup. This is a backup that is meant to capture a perfect copy of the database the instant the backup process is started. Despite the MongoDB data servers being live, and taking reads and writes, the data that is backed up is the data that is on the disk the moment the backup is started. All data that comes in or gets removed between the start and finish of the backup process will be reflected in the current state of the server but not the backup data. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 26 | P a g e Executing the commands There are a few different ways that you can go about executing the commands to perform the backup process. There are a few different commands you need to issue to MongoDB, as well as an IBM® Spectrum Protect™ command. The exact method for scripting this will come down to the scripting preferences and skill set of the database administrator. One very important factor in this scripting is the need for all of the IBM® Spectrum Protect™ backup process to start at nearly the same time. In order for the point-in-time snapshot to work, you don’t want any data to come into any of the data servers that is not going to be recognized in the config server. This data will be orphaned in the database with no reference to it in the config server - it is pointless! To avoid that as much as possible, there needs to be a reliable synchronization. There are a few methods that we believe will satisfy this need: • Cron and shell scripts - Scripting this out in a shell script and then running a synced cron on all the backup machines is the classic way to handle Linux administration. You may need to observe the status of the IBM® Spectrum Protect™ service to make sure that the backup happens at the same time on all the machines, and that it is relatively error free. • Issuing the MongoDB commands - Issuing the shutdown and startup commands to the running database can be a little tricky. But there are quite a few ways to do it. You can actually write the shutdown commands in JavaScript® (Mongos’ native language) and pass that JavaScript® file into the MongoDB shell through bash. There are also Python drivers, if that is your preferred scripting method, called pymongo. You can do just about anything from the pymongo driver that you need to manage MongoDB. There are also drivers in just about every language that you can use. • Automation Software - This includes software like Chef, Puppet, Ansible and Salt Stack. Many companies already use these services to manage theirs infrastructure, and there are many great benefits to using these for MongoDB. MongoDB scales by adding new machines, shards and replicas. The cluster can get complex quickly, and you don’t want to manage that by hand. But specifically, in terms of backups, these services give you a single point to manage the backup from. This allows you to synchronize all the servers through the Automation software. We chose to use Ansible in this paper for building and backing up our MongoDB clusters. Not only was it flexible, but also we were able to go from using the first time to having a working MongoDB script in a few days. Restoring the database Once again, when restoring the database, the IBM® Spectrum Protect™ command is very simple, but it’s everything else that takes some setup. 1. Assumptions You are trying to recover from a point-in-time snapshot, either on fresh servers, or on servers that previously ran MongoDB where the data is corrupted or gone etc. Either way, the process is nearly identical. 2. Important Considerations There are a few design feature of MongoDB that will affect how we restore the backups. To review what was already covered: • There are always three config servers in a production cluster that are essentially identical. • There are multiple shards that each hold a small slice of the whole MongoDB database. • Each shard is replicated, which means there are multiple MongoDB instances that hold identical data. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 27 | P a g e • MongoDB server instances don’t persist any data, and only need a configuration file pointing to the MongoDB configuration servers to run. The config servers hold the sharding rules and the configuration for most of the cluster. The config servers also hold all the metadata that knows where the actual data is stored. • The MongoDB instances actually store the information about their own primary and secondary servers. • Mongo servers are identified by hostname. So, you need 3 things to fully restore a system: the config file, the data, and the hostname. These facts play a determinate role in deciding how to backup and restore. There are a few important conclusions we draw based on these observations. First, if a server is identical to another server, you don’t need to take backups of each server and restore them individually. Instead, take a backup of one of the identical servers. Now when you restore it, you propagate it to all the identical servers in the cluster. Secondly, since all the configuration files are part of the backup, a well-planned and well-executed restore would restore the old configurations and require no extra configuration by the end user. In the end, you just power the restored machines on, and it is just as it was when it was backed up. This makes restorations much quicker and much simpler, especially at scale. This can be done by keeping the configuration file in the logical volume with the backup data, or by dynamically creating the configuration files with one of the automation platforms mentioned above. Finally, the servers being recognized by hostname make it easy to replace hardware and restore to totally different machines. If you need to change some or all of the hardware out when doing a restore, it’s as simple as changing the hostnames on the machines to match what was in the original configuration. This way your entire configuration is intact and MongoDB can continue to communicate across the cluster over the network. 3. Configuring the Server You need to get your server in the exact or similar state it was in before MongoDB crashed. This is why it is highly suggested to use some automation software to create your infrastructure as code. This makes it really easy to deploy a structurally identical server. In any case, you need to make sure certain steps are done before running any IBM® Spectrum Protect™ restore features. These are very nearly identical to the steps taken to originally set up the server. These steps must be taken on all servers that will be a part of the restored cluster. • Make sure that MongoDB is installed and the identical config file is in the right place. This includes making sure all of the directories needed for logging and PID files are there. These files should be owned by the user who will be running the mongos instance. If not, use chown to make sure the mongo process can access any of those files. • Make sure you create a logical volume with at least as much space as the original logical volume that was backed up (can be determined by the size of the backup file). lvcreate -L 10G -n data rhel7_system • There may be a prompt asking if you want to overwrite a filesystem block, respond with yes. The underlying filesystem that we need is actually on the image we will restore. • IBM® Spectrum Protect™ requires that there is an allocated logical volume to restore the backup. It will not restore to free volume group space. It needs the shell of the logical volume. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 28 | P a g e • You cannot restore a logical volume smaller than the backup image size. If you try that, it will fail. If it is larger, it will still work, but only if you refrain from building a filesystem on the newly created LVM. • If you create a filesystem, specifically an XFS filesystem on the newly created shell logical volume, and try to restore a smaller backup image, the IBM® Spectrum Protect™ agent will try to shrink the file system to automatically match them together. The problem is XFS does not have a filesystem shrink command, so it fails. For this type of restoration from a full LVM snapshot, you do not need to build a filesystem on the logical volume being restored to. It should be avoided because of potential errors that have been discovered. • Make sure that the mount point exists for attaching the LVM once it is restored. If not, create it with the command mkdir /FileName (i.e. mkdir /data) • Make sure that the IBM® Spectrum Protect™ agent is installed on the machine, in a directory different than the one you will be restoring MongoDB onto (e.g. If IBM® Spectrum Protect™ is stored in /opt, you should not be trying to restore the /opt logical volume) • If you are using different hardware, make sure you change the hostname to match the hostname in the config file, and that hosts resides in the replica set of the image you are about to restore. 4. Running the restoration command • If you are restoring to the same machine that the backup was originally taken from, you will use the command: dsmc restore image /dev/rhel7_system/data • If you wish to restore the IBM® Spectrum Protect™ image to a different machine, you will need to authorize this new machine to access the backup images of the original. To do that, you do the following: • Access the server where the backup originated. Enter the IBM® Spectrum Protect™ shell by entering the following terminal command: dsmc • Your prompt should now be tsm> • From here, you can query what nodes have access with the q access command, or view all the backed up files with the q files command. What you want to do is run the command set access backup "*" nodeName. set access backup "*" nodeName. This will give all nodes with the nodeName given access to the backup files of the current node. • Now that the target restoration server can access the backup, you restore using the command: dsmc restore image -fromnode=LTMNGO09 /dev/rhel7_system/data /dev/rhel7_system/data • Now that the lvm is restored, we need to remount it to the filesystem. Run: mount /dev/mapper/<volumeGroup> - <name> <mountPoint> (ex. ‘mount dev/mapper/rhel7_system-data /data’) © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 29 | P a g e • Once you have done that, you can start the MongoDB instance on that server and try to connect to the local MongoDB shell with mongo —port <port>. Once you can verify that the system is up and running in a proper configuration, stop the MongoDB instance This is where the restoration process can differ slightly. Because we have identical servers, we only took a backup image of one of the replicas. Now that we are restoring, we need to restore that one image back to all of the replicas. The two easiest and simple ways of doing this are to either: 1. Restore to one server using IBM® Spectrum Protect™ and use a copy command to move the files from the restored server to all its replicas Or, 2. Install the IBM® Spectrum Protect™ agents on all the machines and simultaneously restore the image from the IBM® Spectrum Protect™ server to watch the replicated servers. This requires doing the aforementioned step of giving access to a backup to every other node in the replica set. The pros of copying files after the restore is needing less IBM® Spectrum Protect™ agents. Since we only need backups from one machine, most of the IBM® Spectrum Protect™ agents would be installed strictly to do restores and never any backups. When copying files from the restored backup server to the replica sets, the IBM® Spectrum Protect™ server is doing less work, and the IBM® Spectrum Protect™ agents only need to be installed on the backup servers. Depending on the pricing package of IBM® Spectrum Protect™, this can play a factor, especially at scale. Of course, you have to program a way to copy and distribute files from the restored server to the others so there is a little bit more customization required. Using the IBM® Spectrum Protect™ restore function for all your servers can be a little bit simpler. However, there is more overhead of having an IBM® Spectrum Protect™ agent on all machines in the cluster. There is also an added strain on the IBM® Spectrum Protect™ server, having every single node in the MongoDB cluster having to restore the LVM image directly from it. For the sake of the paper, we used both methods and found that both were satisfactory. Method1 - Copying the files. Restore to the server with the IBM® Spectrum Protect™ agent as explained above. Go into the data directory of the restored volume and TCP the data to the other servers scp -r /data/mongo/….. hostname1:/ (e.g. ‘scp -r /data/mongo root@ltmgo04:/data’) Then make sure you do a chown on the files when they arrive at the other server to make sure the files are owned by the user who runs MongoDB (by default, mongod). You could also use rsync or some other file copying/synchronization method to pass the files between the servers. Method2 – Using IBM® Spectrum Protect™ on all servers. For this method, you need to install IBM® Spectrum Protect™ on all the servers and run the set access methods that were given above. They need to be run on every backup machine for every member of their own replica set, so they have free access to that IBM® Spectrum Protect™ image. Once that is done, you use the remote restore method, also given above to restore the file system, following all the restore steps above. Then you should chown the data directory to ensure it is owned by the user that runs MongoDB. This restoration process goes the same for all of MongoDB data servers, and the MongoDB configuration servers. The only restoration that the mongos servers need is for the configuration file to be put on the server they will run from. Now that we know how to get all the individual servers back up, we need to make sure we bring the cluster up in the right sequence. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 30 | P a g e At this point in the restore, MongoDB instance should have all its data on it and have been individually tested to make sure it came up. Then the MongoDB service should have been shut off again, leaving all the MongoDB services currently down. Moving forward, we will start the MongoDB services, and keep them on, barring any errors. This is the launch sequence for restoring the MongoDB instances, and it is just about the same as the original startup. First, you will start each member of the single replica set. You want to make sure that there is a primary and secondary and that they can all communicate with each other and recognize themselves as part of the same replica set. You will do this with all the different replica sets, leaving them on and running as you move on to the next. Second, once all the replica sets are active, it’s time to move on to the config servers. These must all be turned on before any of the mongos instances. Once all three are up and running, we can start to bring up the mongos instances. Make sure you do not change the hostnames of the config servers (ever). Even if you restored to new config server machines, you should change their hostnames to match what was in the original mongos config. You should start by bringing up one mongos instance and checking the log to make sure you can connect to all the config server instances. Once you see that, connect to the MongoDB shell through the mongos instance. Once there, you can run shard tests and look at the data and make sure everything is there. If everything is indeed working, you have successfully restored a MongoDB backup and are running an active cluster. Bring up as many mongos instances as you want and enjoy your freshly restored server. 5. Where to put the data As a result of various factors, having to deal with the backup procedure, we determined that the data portion of MongoDB should be kept in a separate volume group and mounted to the /data directory. Now, it is very possible to mount it to an existing volume group, but it must be its own logical volume. We do not want to waste any space when we do our backup, and we want to be able to mount it with ease in a known location. 6. Testing For our testing process, we needed to make sure that our backup and restore process worked efficiently, while maintaining data integrity. We also wanted to collect some performance metrics to make sure that our backup servers, as well as the while MongoDB cluster, continued to operate efficiently. To do this, we ran two different test drivers against the database. First, it’s the AcmeAir driver introduced earlier in the paper. This was chosen because it is a robust test driver for MongoDB that has multiple collections. It also has JMeter drivers to vary the amount of traffic generated per second on the database. That allowed us to keep track of entire cluster speeds before, after, and throughout the backup process. Second, it’s a custom Python script made specifically to test the data integrity of the database. It would write from a predefined JSON data set, and the Python script would insert certain markers into each entry written into MongoDB. This included the order that the items were written, as well as exact time stamps of them being added. By doing this, we were able to compare the data that was in the database after the restore to the timestamps and numbering from Python. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 31 | P a g e We were able to verify the integrity of the data by checking the time of the start of the backup process that we marked to the last entry in the restored MongoDB cluster. If those times matched, we knew the pointin-time backup had succeeded. In addition, we were able to look at the numbering sequence to see that there were no gaps in the data, such as having entry 200 and 202, but not having entry 201. Using these techniques we were able to drive various workloads against the servers, all while verifying the integrity of the workload. We will start with a baseline test of running a backup while no traffic is being driven to the MongoDB environment with compression enabled. The graph shows the CPU IDLE percentages. After that, it’s the standard out from the IBM® Spectrum Protect™ backup agent after calling a backup. It gives information on the time it took as well as the compression percentages and network transfer speeds. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 32 | P a g e Figure 2: CPU usage example #1: Backup with compression Compressed backup starts at 12:15:01 and ends at 12:18:45 Figure 3: CPU usage example #2: Backup without compression For a direct comparison, here is the same backup procedure on the same exact MongoDB instances with compression disabled. Non-compressed backup starts at 12:30:01 and ends at 2:33:01 Figure 4: Workload baseline, with 2 drivers pulling from 2 separate shared databases, no backups © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 33 | P a g e © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 34 | P a g e Figure 5: During the backup w/ Active Workloads & Compression Compressed backup starts at 11:30:01 and ends at 11:33:51 © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 35 | P a g e Figure 6: During the backup w/ Active Workloads & NO compression Non-compressed backup starts at 11:45:04 and ends at 11:48:11 © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 36 | P a g e 9. Performance considerations Considerations to be aware of during the backup process: Time - During our testing, the backup of a consistent set amount of MongoDB Server TSM image backup data per server required between 2.5 to 3.5 minutes depending on network speeds. These backups will be done concurrently across all shards but, at a much larger scale, it could become a time issue. However, there is the option of adding more shards to reduce the size of an individual backup. In addition, the time is not a huge factor because there is no performance issue in the MongoDB cluster performance during the backup. CPU Performance - There is an additional CPU utilization that can be observed during the process. Doing an uncompressed backup saw the %idle drop between 10-20 points during the 2.5 to 3.5 minutes it ran. The most extreme case was when compression was turned on. We saw some machines go to 0% idle for short periods of time (30 seconds) when doing a compressed backup, which is certainly not ideal. These machines are indeed live when the backup occurs and may be called upon to do lookups, and such a high CPU utilization can cause a bottleneck in MongoDB. However, we noticed no discernable drop in MongoDB cluster performance during the backup with compression process. It is certainly a tradeoff between CPU and network traffic and storage. It should be decided on a case-by-case basis which of the methods is best. It is important to remember that MongoDB with MMAPv1 storage engine is not typically bound by CPU utilization. The stressor is more likely to be the memory space and the disk I/O. Given that in our test cases the backup server was a secondary replica that only handles occasional reads, and no direct writes, the CPU cycles used by the backup process should be negligible. By design, the backup process takes place on an backup server that is under less load than the primary server and can backup as quickly, while remaining available to handle requests during a usage spike, and to stay current with the rest of the database without needing to catch up after the backup. MongoDB Performance - The backups we do leave all the servers, including the backup servers active and able to perform normal tasks. This should leave the performance of MongoDB unchanged except in the most stressful of scenarios. The biggest performance tradeoff during the backup process is the disabling of one of the config servers. This means that the mongos servers must retrieve all the metadata from 2 rather than 3 servers. However, version 3.0 of MongoDB will not scale past 3 config servers, which means there really is no performance gain in having more config servers; it’s more of a redundancy measure. Once the config server comes back online, it will be go into a sync mode in order to catch up on any metadata changes that may have occurred while it was offline. This normally completes in seconds and should cause no issues at all. Compression vs non-compression backups For the System p AIX and System z IBM® Spectrum Protect™ Servers, a series of image backups were performed between compressed and non-compressed backups, and across the 2 TSM 6.3.5.0 Servers and 2 IBM® Spectrum Protect™ 7.1.3.0 Servers, and 3 levels of TSM 7.1 Clients. The compressed image backups consistently required a moderate amount of additional time to complete, with the non-compressed image backups consistently faster in all TSM/Spectrum Protect Server and Client combinations. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 37 | P a g e 10. Summary MongoDB continues to grow in popularity and will find widespread use in Linux on z environments. IBM® Spectrum Protect™, formerly Tivoli® Storage Manager, is a mature, proven, and strategic world class data protection platform that helps organizations of all sizes meet their data protection requirements. This paper documents methods to backup MongoDB data with IBM® Spectrum Protect™. The IBM® Spectrum Protect™ backup function is a good option for data protection if the customer already has IBM® Spectrum Protect™ established in their data center. The backup procedure reviewed here recommends using IBM® Spectrum Protect™ to create Linux file system snapshots against a MongoDB replica set member dedicated as a backup server. The minimum requirements to setup the backup environment are defined above. The steps are outlined to do the backup of the data and journal files. The backup can be done currently with application writes to the data with no application down time. Elapsed time and CPU utilization measurements were captured during the testing using a consistent set amount of MongoDB Server TSM image backup data, consisting of MongoDB data and journal files. Measurements were taken with compression disabled and enabled with the same amount of source data. CPU usage was very high during the compression tests. Each customer will have to determine if this trade-off is justified in their own environment. Measurements were also taken with and without workload and with one and two data shards. IBM® Spectrum Protect™ restore of a MongoDB database was also tested and documented. The steps to prepare the target file systems are included along with the IBM® Spectrum Protect™ commands to initiate the restore. The recommended restore option is to, 1) recover to one replica set member using IBM® Spectrum Protect™, and 2) use Linux commands to copy the data to other members in the replica set. Another option is to, 1) have IBM® Spectrum Protect™ agents on all servers and, 2) use IBM® Spectrum Protect™ to restore to all of them. Hence, this paper provides multiple options - we show how flexible this solution can be - and we strive to allow the customers determine what will work best for their environments. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 38 | P a g e Copyright IBM Corporation 2016 IBM Systems Route 100 Somers, New York 10589 U.S.A. Produced in the United States of America, 01/2016 All Rights Reserved IBM, IBM logo, ECKD, HiperSockets, z Systems, EC12, z13, System p, Spectrum Protect, Tivoli, and WebSphere are trademarks or registered trademarks of the International Business Machines Corporation. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. All statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Performance is in Internal Throughput Rate (ITR) ratio based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput that any user will experience will vary depending upon considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve throughput improvements equivalent to the performance ratios stated here. © Copyright IBM Corp. 2016. All rights reserved. January 7, 2016 39 | P a g e