Back Up & Disaster Recovery
Backup/Disaster Recovery
Physical Facilities & Security
Backup Solutions
Power Protection
Data Protection
System Security
Data Availability & Reliability
Mirrored Disks & Servers
RAID 5 Stripped Disk
Separate Paging Disk
Secondary Domain Controller
Hi-Speed Disk/Tape Backup
Disaster Recovery Monitoring
Off-Site Backup Storage
Disaster Recovery
Disaster Recovery is a multi facetted problem. Disaster recovery depends on many important criteria that must be dealt with as a living system, not separate items on a list. Having many years experience in the Fault Tolerant, and Disaster Recovery industry I have come to learn that most companies are really a house of cards when it comes to dealing with disaster.
Disaster can strike on many fronts, for the purposes of this document I am only going to assume a single disaster occurring at one time. This is not really a very accurate way to look at the problem, because most major disaster events occur simultaneously. A bad storm knocks out the power and causes water damage. A minor earthquake knocks out the power, gas and water, and causes a fire. As you can see many disasters can occur at the same time, and as a general rule, do.
As humans we cannot deal with everything going wrong at once. The trick is to build a physical, electrical, and logical system in layers, each layer reinforcing the layers above and below. If the system is built well then multiple disasters are survivable.
Physical Facilities & Security
We provide un-paralleled installation support. All you have to do is order the system the rest is up to us. We will transfer your data and applications from your old system, or install any applications you may have. We can usually get you any application software you want a prices lower than you can purchase them yourself.
We will setup your system, printer, camera, and any other item you require. We will also do it when it is convenient for you not us. When the system is installed we will instruct you in how to use it.
Back Up Solutions
Backup is a very significant operation that must be planned and well thought out. Your entire organization’s future existence may depend on it. We provide a series of Backup Solutions from very inexpensive Flash memory Solutions to very complex and expensive NAS (Network Attached Storage) solutions with On-Line Secondary Backup to guarantee that Off-Site Storage.
There are many industries that require Off-Site Storage as part of a backup procedure. The regulations can be audited and your auditors may actually have to report your non-compliance. HIPPA, The SEC, The FED, State Insurance Regulators, and many other government or industry regulators require Off-Site Backup, Mail Archiving and System Backup of Client, Customer, or Critical financial data.
Your firm may not be in compliance with your industries regulation.
Power Protection
The most common power protection is a powerful UPS (uninterruptible power supply). This can be the most effective way to project your data. Servers do not like to lose power, they tend to have a very good chance at corrupting the data when they take an unexpected power fail. What the UPS does, is manufacture a buffer time zone that is configured to manage short and long power outages.
When a short power outage occurs the UPS is programmed to ride out the power outage. This means that a short power outage has no effect on the server data or operation. For long power outages the UPS waits until the power outage exceeds the short power outage time. It then sends a shutdown signal to the servers. This causes the servers to go through a normal shut down process.
The ideal Primary Domain Controller (PDC) is also protected with parallel 400+Watt power supplies. The system only needs 400+Watts of power to run all the server services. The PDC is configured with 2-400+Watt power supplies that are normally running at 200 Watts each. If one power supply fails, the other takes on the additional 200 Watt load and sounds an alarm. Since each power supply is rated at 400 Watts, the surviving power supply can continue to operate indefinitely. Each power supply is hot swappable so replacing the power supply does not require a system shut down.
Data Protection
Data protection must be managed by many methods in order to actually have multiple chances to recover the company’s data in the event of a disaster. Disasters that effect data are far more common than natural disasters that effect facilities and equipment. The type of disasters that effect data are usually due to hardware malfunction or failure, a security breech that can be used to change or destroy data, a software failure that can cause data to be corrupted, or an outside computer virus attack targeted to the company data.
System Security
All access to the system, is controlled by the Windows system security module. Windows security for the layman is tantamount to having a pretty good lock on your front door. This does not mean that someone who is schooled in the art of breaking system security (a Hacker) cannot break in, but you would have to have specialized knowledge and tools to do it. That is why the DOD (Department Of Defense) requires all computers be guarded with a locked door and an external physical security device like a “SmartCard Reader” to logon.
System Security usually pertains to user ID and password protection as administered by the system administrator. The purpose of this is to protect the company from unauthorized use by outsiders or company employees. The security is set up so that only a restricted number of people can gain access to the entire system. Normal users are restricted to specific areas of the system depending on their job function. These restrictions are set up through the system groups they are assigned to or the logon scripts that are used to specify the disc drive letters they have access to. This is carried through to the printers or servers that are available to them as well. All passwords are issued by the administrator and kept in a secure location. This is meant to insure a password policy that is difficult to break through.
The company must publish a policy that notifies all employees that data contained on company systems is the property of the company and is subject to inspection at any time. No data or files are considered private, this includes E-Mail any electronic document, any paper document stored in company facilities, or any company document stored off-site.
Data Availability & Reliability
There is a trade off between system performance, reliability and availability. When speaking about data reliability and availability we refer to how well the system can operate when a disk subsystem hardware error occurs. In order to keep the system operating the OS has to be 100% reliable and the data has to be 100% available. This means that if the OS disk fails there is a 100% accurate copy of the OS that is ready to be used with very little effort, RAID 1 or disk mirroring provides such an access method.
The data on the other hand need to be available as long as the operating system is functioning. A disk subsystem that allows a complete hard drive failure without data interruption is required. The data must also be stored on a disk subsystem makes the accessibility very fast and always available. RAID 5 or data striping provides such an access method.
To increase system performance and remove any write contention on the OS disk drives we move the paging functions of the OS a separate paging disk. If this paging disk fails the system automatically switches the paging functions to the mirrored OS disks. This is called automatic fail-over.
Systems designed to operate in the above fashion will continue to operate in most single or multiple failure modes. You would have to experience a OS disk failure, and a multiple data disk failure before you actually noticed a server availability failure. If either; the OS disk, paging or a data disk failed, you would not loses data nor actually lose access to the server. In fact the system would continue to operate in a slightly degraded mode, unnoticeable to the users. Even an OS disk failure would only require a reboot to bring the system back.
The only thing you can do to make the above system more reliable is to duplicate the system architecture and spread the server operations across multiple servers built the same way. If you are willing to live with a lower level of data throughput and capacity you can trade performance for reliability by mirroring all drives on the system. RAID 1 disk access is always more expensive and more reliable than RAID 5 disk access. RAID stands for Redundant Array of Inexpensive Disks. Raid 5 Disk Subsystems trade absolute reliability for a much higher level of data availability and overall improvement of system disk subsystem performance.
RAID 1 is disk mirroring or maintaining a live copy that is exact up to the last write operation using non-synchronous disk . RAID 1 attains its reliability by having an active copy of the data on a separate disk spindle. This High level of reliability is at the expense of capacity. You need twice the disk capacity your application requires to support RAID 1.
RAID 3 is data striping on non-synchronous disks. This has the effect of spreading the data across multiple disk spindles. Using very smart hardware and software the system uses all the disks holding data to complete a disk operation. This is arrangement yields high availability, and high disk capacity but lower performance than mirroring. RAID 3 has lower performance because the disks are spinning in a non-synchronous fashion and you have to wait for data to be in the right place before it can be accessed.
RAID 5 is data striping on synchronous disks. The data is spread across the disk array to give the disk controller the opportunity to apply all the disk access hardware to task of accessing the data. Because the disk spindles are synchronized, all the data appears under the heads on all the drives at the same time. If you had 4 data disks, this would allow the system to access 4 disk sectors at the same time. The reliability in RAID 3 & RAID 5 is realized by having an extra disk in the array that is used to store data correction information. Using this information and a very complicated polynomial the RAID controller can reconstruct the data from any failing data disk in the array.
RAID 1 Mirrored Disc OS
Raid 1 - Mirrored disks for the OS The most reliable method of storing data on hard drives is to use RAID 1 (disk mirroring). This method of data access allows you to have a complete copy of the hard drive as a live backup. The operating system manages the data copy and makes an exact backup at the hardware level. The only disadvantages are about a 30% degradation in system performance, and you use twice the disk space to store the OS. This is a small price to pay for 100% data reliability and reasonable data availability. This performance degradation can be negated if you set the system up so that the 95% of the disk commands to the mirrored disks are read commands. In fact in this read only scenario you get a performance boost. Since there are 2 copies of the operating systems you can get more than one set of OS functions from the mirrored disks at the same time.
Mirrored Servers If you want to get better availability you can build a set of mirrored servers that will allow you to have a hot stand-by system always available to take the place of the main server. This is a very expensive solution but if you have critical operations that must be available 365/24/7 this method will satisfy that requirement.
RAID 5 Stripped Disk Array
A RAID 5 data access method combines very good reliability with very high data availability. Nothing is more reliable than RAID 1, but in a datastore you really want safety combined with very high speed data access times. In RAID 5, the system attains very high access speed because all the drives in the disk array are spinning in Sync. This means if you are using 3, 4, or 5 hard drives you are actually getting a data transfer rate that is 3, 4 or 5 times the access time for 1 drive. The data is written in such away that a portion of the data is written on each disk drive or striped across the array . The disk subsystem goes after the data by moving the disk read heads on all the drives at the same time. To enhance reliability there is an extra disk in the array that is used or recreate the data for any failing disk in the array. You would have to loose more than one disk in the array at the same time to loose any data.
The other speed enhancement is a separate disk controller that controls the RAID 5 disk array. These controllers have their own processors and memory arrays that make the data available to the main server processor in a way that much faster that a normal disk channel. These controllers have a very large memory to act as a disk cache. The OS never actually makes a request to the system hard drives, it actually talks to the disk cache. This means that the hardware delay caused by the actual mechanical motion of the read heads does not exist when using these devices.
Separate Paging Disk
A paging disk is sort of a scratch pad for the operating system. When your programs use up the available physical memory, the OS uses a concept called virtual memory to make the program think more memory is available. Paging disk space is also used to store information the OS needs to operate your program when your program has to be shoved out of memory so another task can operate. The OS fails when no paging memory is available or if the paging disk fails. Because a separate paging disk is used, the system can always fail-over to the OS disk when this happens. You will see a minor degradation in OS performance, but you will keep functioning.
Secondary Domain Controller
A Backup Domain controller is used as a way to manage system security and user access to the data. A backup domain controller keeps a copy of all the security information, user login information and login startup files. It can also be uses as a mail handler or critical data store as well. The main reason we use a backup domain controller is to give the users a way to login to the system if the primary domain controller is not available.
Hi-Speed Disc/Tape Backup subsystem
There are many backup methods at many price points. The following is the ultimate backup method. It was originally designed for a tape backup system but it can be adapted very easily to a Disc backup system. Using a backup device like Imation’s RDX-1000 and cartridge disc media; Local Backup can be used to do complete backups. If you have a large amount of data on each server, you need one backup subsystem per server. If you have more data than the largest backup media capacity, you will have to resort to Network Attached Storage(NAS) or On-Line Backup.
Ultimate Backup
A method of making sure the data on the system is protected is the single most important administrative function the server can perform. The fact that the hard drives are protected with mirroring and disk arrays does not take away from the fact that the data must be preserved on a different media. The data must be written to tape and the current tape must be taken off site to protect it. There are many formulas to save the system’s data but one that allows a very high level of protection and at the same time uses a reasonable number tapes follows.
- 8 Daily media to backup the servers on Monday 1 thru Thursday 1
and Monday 2 thru Thursday 2 - 4 Friday media Friday 1, 2, 3 & 4.
- 3 Quarterly media Q1, 2, & 3.
- 2 Yearly Tapes.
- 17 media in all, plus 3 spares to allow for bad media or for special cases.
With 20 pieces of media the system is preserved in such a way as to enable a complete restore from a single media with 100% accuracy, and consistency for up to 2 weeks. A check point to within a week for a year. With the Quarterly backup, Friday backup and daily backup, data can be restored on any system with a like media drive. It is important to make sure that you have a commonly available media drive. Using this method the last backup media is always off-site, stored in a brief case or other kind of portable storage.
This tape backup method preserves data in the following way.
- 2 weeks of daily backups (Mon-Thu)
- 4 weeks of weekly backups (Friday backups)
- 1 monthly backup (Last Friday of month)
- 3 quarterly backups (Last day of quarter)
- 2 yearly backups (Last day of year)
Simple Backup Method
Seven pieces of Backup Media. You do a complete backup every day to each piece of media. The media is rotated every day, Monday thru Thursday, then Friday 1. The Monday thru Thursday media is reused then Friday 2. This way you have two weeks of backup on a reasonable budget.
Inexpensive Backup Method
Two pieces of backup media, You simply rotate the media everyday so you always have at least two backup images.
Disaster Recovery Module
Keeps a copy of the backup set on a remote system somewhere else in the building. This allows a complete data restore very quickly over the network. You still need a local backup, but the disaster recovery module gives you an immediately available backup image from the last successful backup. This module is an optional part of the Veritas backup engine but very valuable and allows the company to recover very quickly in the event of a catastrophic server failure. The Acronis backup engine builds an image on the backup media and you can recover the complete system from the image. If you store the image on another server you can get the disaster recovery module features for free.
Off-Site Backup Storage
Offsite storage is a absolutely sure way to preserve your data. There are services that will allow you to backup your entire server data set to an offsite backup server. These servers are accessible over the internet as a very modest price per month. In the event of a system failure the backup image is shipped you on USB drive with-on 24 hours. When this type of service is used in conjunction with a strict backup procedure you are sure of never being more than 24 hours away from recovery. While not a “Real-Time” recovery method is can certainly allow the principals of the company sleep better.
This method is particularly good when dealing with a data corruption or virus problems that has occurred unnoticed and has been backup’ed for weeks or days. The perfect example is a file or database that is only used once a quarter. If this file was to become inaccessible, the party or department that needed it would not know it was corrupted or infected until a quarter had past. Having the offsite storage company keep a month of daily backup images, a quarter of monthly images, a year of quarterly images and two years of yearly images you can never find yourself in a position that has no recovery. Being off-site you have a level of protection that is impossible with any other method.
The downside of this off-site storage is that all your data, your secure company information, and your personnel records are in someone else’s hands. Normally this data is encrypted and only you know the key. You also have to wait 24 hours to get your recovery data set. This is also very expensive.