Planet Sysadmin               

          blogs for sysadmins, chosen by sysadmins...
(Click here for multi-language)

May 20, 2013

Sam Ruby

Prosody as a personal xmpp server

Nearly six years ago, I set up a personal Jabber server using ejabberd.  This setup survived the server migration to Ubuntu 8.04 and 10.04.  This past weekend, I attempted to migrate that to a server running 12.04 and all I could get out of it was an erlang crash dump.

A quick scan for successors turned up prosody. Configuration was as simple as adding a VirtualHost and setting allow_registration to true.

May 20, 2013 05:29 PM

my other pc is a cloud

Active Directory List Object Mode

This is something I've been wanting to blog about for a long time, but have been putting it off because I knew it might turn in to a long, time-consuming post. Well it's time to bite the bullet and get started.

We were facing a bit of a problem in one of our managed hosting environments. We had this high-volume, multitenant Active Directory being used by dozens of different customers. There was a business requirement in this domain that customers not be able to read from one another's organization units for the sake of the mutual privacy of the customers. Things seemed to be working well for a while, but one day, it appeared that customer users logging on to many of the client computers were failing to process Group Policy upon logon:

Event ID: 1101
Source: Userenv
User: NT Authority\System
Description: Windows cannot access the object OU=Customers, DC=contoso, DC=com in Active Directory. The access to the object may be denied. Group Policy processing aborted.

To start troubleshooting, I copied one of the affected user accounts and used it to log in to one of their machines, and I was able to reproduce the issue. Upon trying to update Group Policy with gpupdate.exe, I noticed that the computer configuration was updating fine, while only the user portion of the update failed, and the event 1101 was produced.

The basic layout of the OU structure in the domain was this:

    
CONTOSO.COM
    |
    + Customers (OU)
          |
          + Customer1 (OU)
          |
          + Customer2 (OU)
          |
          + ...

Still using my customer-level user account, I noticed that I was able to browse the contents of my own Customer1 OU, but I was not able to browse the contents of any other OU. The permissions on these OUs had certainly been modified.

In fact, it was that the read permission for the Authenticated Users security group had been removed from the access control list on the Customers OU. That explains the event 1101s and the GPO processing failures. From Microsoft:

[GPO processing fails] when the Group Policy engine cannot read one of the OUs.

The Group Policy engine must be able to read all OUs from the level of the user object or the computer object to the level of the domain root object. Also, the Group Policy engine must be able to read the domain root object and the site object of the computer. This is because these objects may contain links to group policies. If the Group Policy engine cannot read one of these OUs, the events that are mentioned in the "Symptoms" section will be logged.

So in satisfying the business requirement that no customer be allowed to list the contents of another customer's OU, Group Policy processing had been broken. But simply giving Authenticated Users their read permissions back on the Customers OU, they get to browse all the other customers OUs as well.

We needed the best of both worlds.

This Microsoft article would lead you to believe that if a security principal just had the Read gpLink and Read gpOptions access control entries, then GPO processing should work fine:

But that's not enough. The four ACEs that were needed on the Customers OU were:

  • Read gpLink
  • Read gpOptions
  • Read cn
  • Read distinguishedName

Now we're making progress, but we're still not out of the woods. Giving Authenticated Users the List Contents permission on the Customers OU would allow them to see the names of all the other customer's OUs, although now they show up as "Unknown" object types and can't have their respective contents listed. But that's a messy solution in my opinion and doesn't fully satisfy the requirement. Customer1 shouldn't even be aware of Customer2's existence.

There's one last piece of the puzzle missing, and that brings me to List Object Mode.

List Object Mode is one strategy available to Active Directory administrators to allow for hiding certain bits of data from certain users. List Object mode has to be enabled manually; it's turned off by default. To enable it, set the value of the dsHeuristics property in the Configuration partition to 001 using ADSI Edit, like so:

dsHeuristics

Now you will have a new access control entry in the list on objects in your forest: List Object. The ACE was actually there before, but Active Directory doesn't enforce it by default.

List Object Mode is a form of Access Based Enumeration, (not to be confused with file system ABE,) where items are not displayed to users that do not have List Object permissions to them. By default, when a user has the List Contents permission on an OU, and queries that OU, he or she is given a list of all child OUs in that parent OU, even if the user doesn't have read access to those other child OUs.  They show up in ADUC as "Unknown" object types and get that little blank page for an icon which is the Microsoft universal symbol for "wth is this?"

By using List Object permissions after having enabled it as just described, Active Directory evaluates the permissions of all the child objects under the object that was queried before returning the results to the user. Unless the user has the List Object permission on the object, it is omitted from the results. So now we have a customer user who is able to read just his or her own OU, and the other Customer OUs are completely hidden from view.

And no more Group Policy failures due to access denied, either.

So are there disadvantages to enabling and using List Object mode in your domain? Yes there are. So even though it may be appropriate for your environment, List Object Mode is not for everybody and it's not a decision that should be made lightly:

  • Significantly increased access control checks on LDAP queries = busier domain controllers.
  • You may need to rethink your entire User and Computer organization strategy to accommodate for how the new permissions work.
  • It's a less common configuration that fewer people are familiar with. Administrative complexity++. You need to fully document the change and make sure every administrator is aware of it.

So there you have it. Now go impress your friends with your knowledge of AD List Object Mode!

by ryan@myotherpcisacloud.com at May 20, 2013 05:00 PM

bc-log

Securely backing up your files with rdiff-backup and sudo

Backups are important, whether you are backing up your databases or your wedding pictures. The loss of data can ruin your day. While there is a huge list of backup software to choose from; some good, some not so good. One of the tools that I have used for years is rdiff-backup.

rdiff-backup is a rsync delta based backup tool that both stores a full mirror and incremental changes. It determines changes based on the rsync method of creating small delta files, which allows for rdiff-backup to restore files to any point in time (within the specified retention period).

In the examples below I will refer to two servers names, backup-server and server. The names are pretty self-explanatory but just in case, backup-server is the location where I permanently store files copied (backed up) from server.

Setting up rdiff-backup

Installing rdiff-backup is easy considering most Linux distributions include it into their default repositories. In this article I will be using Ubuntu for my example systems.

Note: For Red Hat you will need to enable the EPEL repository to install rdiff-backup via YUM.

Installing

In order for rdiff-backup to work both the source and destination will require the rdiff-backup package. You can install it via apt-get.

On backup-server:

root@backup-server# apt-get install rdiff-backup

On server:

root@server# apt-get install rdiff-backup

Validate rdiff-backup versions match

One of the quirky things about rdiff-backup is that the tool does not support backwards capability with older versions. For this reason it is best to make sure that your rdiff-backup versions are the same on both servers.

On backup-server:

root@backup-server# rdiff-backup --version
rdiff-backup 1.2.8

On server:

root@server# rdiff-backup --version
rdiff-backup 1.2.8

Setting up SSH Keys

By default rdiff-backup uses SSH to communicate with remote systems to avoid typing a password every time rdiff-backup runs we will need to set-up SSH keys with passphrase-less authentication.

On backup-server:

root@backup-server# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.

When asked leave the passphrase empty.

Once you have the SSH key generated you will need to copy the contents of /root/.ssh/id_rsa.pub to the remote servers for key-based authentication. For our configuration we will use a non-privileged user account (test), as this will let us implement rdiff-backup without giving the backup-server full access to the systems being backed up.

On backup-server:

root@backup-server:# scp /root/.ssh/id_rsa.pub test@server:/var/tmp/id_rsa.pub.temp

On server:

test@server:$ cat /var/tmp/id_rsa.pub.temp >> ~/.ssh/authorized_keys

You should now be able to SSH from backup-server to server without being asked for a password.

Running backup jobs

Now that backup-server is able to SSH to server without being asked a password and rdiff-backup is the same version on both systems we are able to perform the first backup.

The directory we will backup today is /var/tmp/backmeup and we will be backing it up to /var/tmp/backups/server.example.com/. I personally prefer to backup to a directory named after the originating server, that way there is no question as to where the files came from.

On backup-server:

root@backup-server:# mkdir -p /var/tmp/backups/server.example.com
root@backup-server:# rdiff-backup test@server.example.com::/var/tmp/backmeup /var/tmp/backups/server.example.com/

rdiff-backup has now created a mirror of the /var/tmp/backmeup directory from server.example.com in /var/tmp/backups/server.example.com.

root@backup-server:# ls -la /var/tmp/backups/server.example.com/
total 52
drwxr-xr-x 3 root root 4096 May 19 13:07 .
drwxr-xr-x 3 root root 4096 May 19 13:53 ..
-rw-r--r-- 1 root root   25 May 19 13:07 10.file
-rw-r--r-- 1 root root   24 May 19 13:07 1.file
-rw-r--r-- 1 root root   24 May 19 13:07 2.file
-rw-r--r-- 1 root root   24 May 19 13:07 3.file
-rw-r--r-- 1 root root   24 May 19 13:07 4.file
-rw-r--r-- 1 root root   24 May 19 13:07 5.file
-rw-r--r-- 1 root root   24 May 19 13:07 6.file
-rw-r--r-- 1 root root   24 May 19 13:07 7.file
-rw-r--r-- 1 root root   24 May 19 13:07 8.file
-rw-r--r-- 1 root root   24 May 19 13:07 9.file
drwx------ 3 root root 4096 May 19 13:56 rdiff-backup-data

Now that we have backed up the original file we will run a second backup to capture changed data; this time a with a little more verbosity.

root@backup-server:# rdiff-backup -v5 test@server.example.com::/var/tmp/backmeup /var/tmp/backups/server.example.com/
Using rdiff-backup version 1.2.8
Executing ssh -C test@server.example.com rdiff-backup --server
<truncated for length>
Backup: must_escape_dos_devices = 0
Starting increment operation /var/tmp/backmeup to /var/tmp/backups/server.example.com
Processing changed file .
Incrementing mirror file /var/tmp/backups/server.example.com
Processing changed file 1.file
Incrementing mirror file /var/tmp/backups/server.example.com/1.file
Processing changed file 10.file
Incrementing mirror file /var/tmp/backups/server.example.com/10.file
Processing changed file 2.file
Incrementing mirror file /var/tmp/backups/server.example.com/2.file
Processing changed file 3.file
Incrementing mirror file /var/tmp/backups/server.example.com/3.file
Processing changed file 4.file
Incrementing mirror file /var/tmp/backups/server.example.com/4.file
Processing changed file 5.file
Incrementing mirror file /var/tmp/backups/server.example.com/5.file
Processing changed file 6.file
Incrementing mirror file /var/tmp/backups/server.example.com/6.file
Processing changed file 7.file
Incrementing mirror file /var/tmp/backups/server.example.com/7.file
Processing changed file 8.file
Incrementing mirror file /var/tmp/backups/server.example.com/8.file
Processing changed file 9.file
Incrementing mirror file /var/tmp/backups/server.example.com/9.file

As you can see -v5 tells us what files are being processed, this is handy to see what is being backed up or being restored.

Now if we only change files 1 – 3 and run rdiff-backup again rdiff-backup should only backup files that have changed leaving the others alone.

root@backup-server:# rdiff-backup -v5 test@server.example.com::/var/tmp/backmeup /var/tmp/backups/server.example.com/
Using rdiff-backup version 1.2.8
Executing ssh -C test@server.example.com rdiff-backup --server
<truncated for length>
Starting increment operation /var/tmp/backmeup to /var/tmp/backups/server.example.com
Processing changed file .
Incrementing mirror file /var/tmp/backups/server.example.com
Processing changed file 1.file
Incrementing mirror file /var/tmp/backups/server.example.com/1.file
Processing changed file 2.file
Incrementing mirror file /var/tmp/backups/server.example.com/2.file
Processing changed file 3.file
Incrementing mirror file /var/tmp/backups/server.example.com/3.file

If we look at the backup directory the number of files has not changed, however the contents and time stamps have.

root@backup-server:# ls -la /var/tmp/backups/server.example.com/
total 52
drwxr-xr-x 3 root root 4096 May 19 13:07 .
drwxr-xr-x 3 root root 4096 May 19 13:53 ..
-rw-r--r-- 1 root root   76 May 19 14:10 10.file
-rw-r--r-- 1 root root   98 May 19 14:16 1.file
-rw-r--r-- 1 root root   98 May 19 14:16 2.file
-rw-r--r-- 1 root root   98 May 19 14:16 3.file
-rw-r--r-- 1 root root   73 May 19 14:10 4.file
-rw-r--r-- 1 root root   73 May 19 14:10 5.file
-rw-r--r-- 1 root root   73 May 19 14:10 6.file
-rw-r--r-- 1 root root   73 May 19 14:10 7.file
-rw-r--r-- 1 root root   73 May 19 14:10 8.file
-rw-r--r-- 1 root root   73 May 19 14:10 9.file
drwx------ 3 root root 4096 May 19 14:16 rdiff-backup-data

rdiff-backup will keep the current mirror unchanged and any differences will be kept in diff files within the rdiff-backup-data directory. It is not advised to modify or interact with the mirror or diff files directly, it is better to use the rdiff-backup command itself.

Listing available backups

To see the available backups we can use rdiff-backup -l.

root@backup-server:# rdiff-backup -l /var/tmp/backups/server.example.com/
Found 5 increments:
    increments.2013-05-19T13:56:57-07:00.dir   Sun May 19 13:56:57 2013
    increments.2013-05-19T14:09:52-07:00.dir   Sun May 19 14:09:52 2013
    increments.2013-05-19T14:11:29-07:00.dir   Sun May 19 14:11:29 2013
    increments.2013-05-19T14:16:44-07:00.dir   Sun May 19 14:16:44 2013
    increments.2013-05-19T14:29:38-07:00.dir   Sun May 19 14:29:38 2013
Current mirror: Sun May 19 14:30:20 2013

If a file has been deleted and rdiff-backup has ran since the file deletion you may not find the file in the directory, you can still however list the available backups for that file by specifying it as if it did exist.

 root@backup-server:# rdiff-backup -l /var/tmp/backups/server.example.com/1.file
Found 4 increments:
    1.file.2013-05-19T13:56:57-07:00.diff.gz   Sun May 19 13:56:57 2013
    1.file.2013-05-19T14:09:52-07:00.diff.gz   Sun May 19 14:09:52 2013
    1.file.2013-05-19T14:11:29-07:00.diff.gz   Sun May 19 14:11:29 2013
    1.file.2013-05-19T14:16:44-07:00.snapshot.gz   Sun May 19 14:16:44 2013
Current mirror: Sun May 19 14:30:20 2013

Restoring backed up files and directories

rdiff-backup has the ability to restore either individual files or entire directories, as long as rdiff-backup has the item within its incremental lists.

Restoring an individual file

When restoring an individual file with rdiff-backup you can either specify a time or the incremental file to restore from. For  the following example I will show using the incremental file.

root@backup-server:# cd server.example.com/rdiff-backup-data/increments/
root@backup-server:# rdiff-backup -v5 1.file.2013-05-19T14\:11\:29-07\:00.diff.gz test@server.example.com::/var/tmp/backmeup/1.file

Restoring a directory

When restoring a directory however we will need to specify a specific time that we want to restore to.

root@backup-server:# rdiff-backup -v5 -r 1h server.example.com/ test@server.example.com::/var/tmp/backmeup

This command will restore the entire directory to where it was 1 hour ago or best it can depending on the backups available. rdiff-backup can support many time frames but I commonly find myself using the xDays format (e.g. 2D for 2 days).

Don’t use the force flag

While the above command will restore the whole directory it will only do so if the directory is empty. If the directory has files in it and you ask rdiff-backup to restore that directory than it will try to remove the existing files in order to match your backup. This action could result in data that has not been backed up being removed.

To protect against accidental deletion rdiff-backup requires the force flag to be used anytime a file is being overwritten or deleted.

root@backup-server:# rdiff-backup -v5 -r 1h server.example.com/ test@server.example.com::/var/tmp/backmeup
Using rdiff-backup version 1.2.8
Executing ssh -C server.example.com rdiff-backup --server
Fatal Error: Restore target /var/tmp/backmeup already exists, specify --force to overwrite.

I advise avoiding the use of the force flag whenever possible, if you truly do not want the contents of the directory than just remove them manually before restoring. I have seen many times where people used the force flag and accidentally overwrote a directory they did not mean (like /etc/ for example…).

Restoring to another location

When restoring with rdiff-backup you can restore files or directories to a location other than their originating source. This can be handy if you need to check the contents before completely restoring the file.

root@backup-server:# rdiff-backup -v5 -r 3h server.example.com/1.file test@server.example.com::/var/tmp/backmeup/1.file.restore

Backup Retention

Backups are only as good as their retention period, without a retention period you will eventually run out of disk space or use far more disk space than you had originally planned. rdiff-backup has the ability to maintain a certain number of incremental copies. With rdiff-backup you can tell it to either keep a backup for a certain amount of time or for a certain number of backups.

On backup-server:

Time method

The time method uses the same time format as restore.

root@backup-server:# rdiff-backup --force --remove-older-than 4h /var/tmp/backups/server.example.com

Number of backups method

To specify a number of backups use the number followed by a capital B.

root@backup-server:# rdiff-backup --force --remove-older-than 4B /var/tmp/backups/server.example.com

I used the force flag with the above commands as rdiff-backup requires force to be given if you are removing more than one incremental copy.

Providing more access with sudo

So far we have been backing up files and directories that the test user has access to; if we were to try and backup or restore a file that the test user does not have access to than the backup/restore will fail with a permission denied. To provide greater access you can either run rdiff-backup as the root user on the remote systems (which raises security concerns), or provide the test user with the ability to run rdiff-backup as the root user via sudo.

Example of permission denied error:

root@backup-server:# rdiff-backup -v5 test@server.example.com::/var/tmp/backmeup /var/tmp/backups/server.example.com
Using rdiff-backup version 1.2.8
Executing ssh -C test@server.example.com rdiff-backup --server
Exception '[Errno 13] Permission denied: '/var/tmp/backmeup'' raised of class '<type 'exceptions.OSError'>':

Adding the rdiff-backup into /etc/sudoers

In order to allow the test user the ability to run rdiff-backup as root we need to add an entry into the /etc/sudoers file, which controls what commands users can run via sudo. To modify this file we will use the visudo command.

On server:

root@server:/var/tmp# visudo

Append:

## Give test user the ability to run rdiff-backup
test    ALL = NOPASSWD: /usr/bin/rdiff-backup --server

As the test user you will now see rdiff-backup in the list of available sudo commands

test@server:~$ sudo -l
User test may run the following commands on this host:
    (root) NOPASSWD: /usr/bin/rdiff-backup --server

We are specifying NOPASSWD as by default sudo would normally ask the user for their password, which would not work very well with an automated backup script.

Running rdiff-backup with remote-schema

In order for rdiff-backup to use sudo we will need to change the command we have been using a bit; we will use the –remote-schema flag to tell rdiff-backup to run “sudo /usr/bin/rdiff-backup –server” on the remote system.

On backup-server:

Backup command

root@backup-server:# rdiff-backup -v5 --remote-schema 'ssh -C %s "sudo /usr/bin/rdiff-backup --server"' \
test@server.example.com::/var/tmp/backmeup /var/tmp/backups/server.example.com

<truncated>
Processing changed file 9.file
Incrementing mirror file /var/tmp/backups/server.example.com/9.file

Restore command

root@backup-server:# rdiff-backup -v5 -r 3h --remote-schema 'ssh -C %s "sudo /usr/bin/rdiff-backup --server"' \
/var/tmp/backups/server.example.com/5.file test@server.example.com::/var/tmp/backmeup/5.file

By adding sudo we are allowing the test user to backup and restore any file on the system with rdiff-backup.

Adding restrict-read-only for even more security

While using rdiff-backup with sudo prevents people from using the SSH key to login as root to all of our remote systems. This solution by itself does not restrict someone from using rdiff-backups restore function from deploying compromised files.

For even more security we can use the –restrict-read-only flag to restrict rdiff-backup to only being able to read files and blocking all write requests. The down side of this setting is that it also prevents valid restore requests as well. If you are more worried about someone accessing your systems than having to edit the sudoers file every time you want to restore a file; than this is a good option.

Adding restrict-read-only to the sudoers entry

In order to add –restrict-read-only we need to add it to both the rdiff-backup command and the sudoers entry.

root@server# visudo

Modify to:

test    ALL = NOPASSWD: /usr/bin/rdiff-backup --server --restrict-read-only /

The / at the end is the path that you want rdiff-backup to be restricted to. This entry would give rdiff-backup the ability to backup all files on the system. If you are not backing up the entire system you can restrict this to a specific path as well to prevent rdiff-backup from reading other files on the system not within your path.

Running the backup command with restrict-read-only

Now that sudo allows us to run the full command we can add it to the remote-schema.

root@backup-server:# rdiff-backup -v5 --remote-schema 'ssh -C %s "sudo /usr/bin/rdiff-backup --server --restrict-read-only /"' \
 test@server.example.com::/var/tmp/backmeup /var/tmp/backups/server.example.com
Using rdiff-backup version 1.2.8
Executing ssh -C test@server.example.com "sudo /usr/bin/rdiff-backup --server"

If you modified the path in the sudoers file you would need to do the same with the rdiff-backup command above.

Automating with Cron

Automating rdiff-backup with cron is as simple as tossing the commands above into a script and adding it to the crontab. The below is meant only for example, I would advise anyone reading this to script in some more intelligence to handle failed backups and concurrent runs but if you needed something quick and dirty this would work.

On backup-server:

Creating the backup script

root@backup-server# vi /root/backup-example.sh

Add:

#!/bin/bash
## Example rdiff-backup script - http://bencane.com
## This is not fancy, and you should really add error checking

# Backup
rdiff-backup -v5 --remote-schema 'ssh -C %s "sudo /usr/bin/rdiff-backup --server --restrict-read-only /"' \
 test@server.example.com::/var/tmp/backmeup /var/tmp/backups/server.example.com

# Clean Increments
rdiff-backup --force --remove-older-than 4B /var/tmp/backups/server.example.com

Adding to crontab

Once you have the script you can simply add the script into the crontab on the backup-server.

root@backup-server# crontab -e

Append:

# m h  dom mon dow   command
0 0 * * * /root/backup-example.sh > /dev/null 2>&1

The above crontab entry will run backup-example.sh every night at midnight. This will provide you with 4 days of incremental copies at all times.

Tags: , , , , , , ,

by Benjamin Cane at May 20, 2013 04:10 PM

/sys/admin/blog

HBR on what value creation will look like in the future

What value creation will look like in the future:  http://blogs.hbr.org/cs/2013/05/what_value_creation_will_look_like_in_the_future.html

A teaser from the article:

“Organizations have nearly perfected implementing the industrial model of managing work — the effort applied toward completing a task. For individuals, this model ensures that we know what we’re supposed to do each day. For organizations, it guarantees predictability and efficiency. The problem with the model is that work is becoming commoditized at an increasing rate, extending beyond manual tasks into knowledge work, as data entry, purchasing, billing, payroll, and similar responsibilities become automated. If your organization draws value from optimizing repetitive work, you’ll find that it will be increasingly difficult to extract that value.”

What you can do:

  • Master the machines.
  • Get obsessed with value.
  • Make creativity real.

 

by Joe at May 20, 2013 03:08 PM

Standalone Sysadmin

Advancing Women in Computing - Panelists Needed!

I got an email from a friend of mine who is soliciting for women who work in IT (preferably IT administration) to take part in a panel at LISA'13 called "Advancing Women in Computing". You can also watch last year's panel to get a feel for what it's like.

Once again, my good friend Rikki Endsley will be moderating the (probably) 90 minute session. They are giving preference to women in the Washington DC area (or people who are going to be attending LISA anyway), so if you're in that region and this sounds like something that interests you, email lisa13gurus@usenix.org (or drop a line here, and I'll get the message to them).

Thanks!

by Matt Simmons at May 20, 2013 02:53 PM

Rich Bowen

What I've learned at SourceForge

Today I'll be leaving SourceForge and taking a role at RedHat. Please don't think for a moment that it's because I don't like SourceForge. I continue to think that SourceForge does community *way* better than either Github or Google Code, and while there are places where the platform can improve, the team that's working on it is one of the finest bunch of engineers I've ever had the privilege of working with.

Here's a few of the many things I've learned at SourceForge.

People are passionate

Every time I talk to anybody about my job, I mention two projects: PonyKart and OpenMRS. These projects illustrate to me how people can be passionate about anything. Having talked with the leads of both of these projects, I'm blown away by their passion for excellence.

Of course, these projects could hardly be more different.

PonyKart is a My Little Pony themed Mario-Kart style game. It's fun. The physics are well done. The courses are well designed. The community is very engaged. And it has My Little Pony characters in it. The guys that did this project wanted it to be a MLP game, but they also wanted it to be excellent. They wanted it to be fun. They wanted it to be *good*. They are passionate about it.

The OpenMRS project is a medical records system that was developed for a hospital in Kenya that had a hacked-together Access database monstrosity, and it was faster and easier for these guys to hack something together than to try to fix what was there. But that wasn't enough. They were passionate. They wanted it to be done right, and they wanted hospitals all over the world to benefit from it. And now they have a non-profit dedicated to giving this product away to hospitals in developing nations that need it. These guys are my heroes.

I am continually blown away by the quest for excellence, and the vast range of ways that it manifests itself.

People are kind

I've met amazing people in my time at SourceForge. These people are helpful, kind, patient, and, as I've mentioned, passionate. For the most part, people get that I'm human and can't solve all of their problems immediately. They get that we all have the limitation of time and resources.

Most people *don't* throw tantrums or demand their way. For this I am very grateful. I'm glad to have met a few of the nice people.

People are cruel

Sure, SourceForge is the underdog right now. I get that. It's not necessary to be a jerk.

It's hard to remember, when people are being jerks, that they're in the minority. Most people are, in fact, nice. But the jerks are very loud.

I'd like to remind the jerks that the folks who happen to be developing their project on the SourceForge platform are passionate, and they are pragmatic, and they are doing something useful while you fling mud at them.

'nuff said.

People are pragmatic

Tools are tools. They are not your children.

For the most part, people want to get a job done, and they use the tools they have, because the focus is the task, not the tools. Once, we used CVS and MailMan and we *liked* it. SVN is better. Some people like Git better. But if we had to use CVS and MailMan, you know what? We'd still get stuff done.

Religious debates over the relative merits of DVCS and CVCS systems are all well and good over beer at conferences, but most of us have a job to do, and we don't have time for that indulgence. You may, in fact, be right, but I don't have that kind of time.

I grow very weary of the This vs That flame wars that have characterized the IT world for so long. Perl vs Python, VI vs Emacs, Linux vs Windows vs Mac, Git vs SVN. The thing is, if you're a professional, you need to know *all* of them, and you're not coming across as brilliant, you're coming across as only knowing one tool. Nice hammer. Sometimes a screwdriver is useful.

But, much as most people are nice, it turns out most people are pragmatic. Most people don't have time for those debates either. They want to get their job done. I really appreciate having met a lot of those kinds of people.

by rbowen at May 20, 2013 12:05 PM

The Nubby Admin

Lessons Learned from Cascading Failure and Face Punching a House

In regard to my post “Cascading Failure, Technical Debt, and Punching a House with my Face“, I was asked about my conclusions and how I dug myself out of that hole.

Before I go any further, let me confess that the blog post itself was another failure in that saga. I began writing it the day I discovered the a forgotten boot DVD in the optical drive was the cause of the server not coming back up, and I continued adding to the post over the next several days. Because I didn’t give a contiguous block of time to the writing, I left out some details. Furthermore, I posted it too soon because I was absent minded as I was saving the post and accidentally published it instead of saved it. I had to quickly unpublish it, however it was too late; an email notification went out to my subscribers so the unfinished article was read by a few people. Then when they clicked to go to the blog, they were met with a 404 error. Finally, I continued to write the post, scheduled it to be published the next day, and then completely forgot to polish it off because I staggered away from my computer to hit the treadmill, shower, and then collapse into bed. I woke up the next day, my computer still on, my desk still in “work mode” with items scattered all over it, and the WordPress control panel still open with the post being edited. The failure train was still going full steam.

If you’ve read the previous blog post, you might want to read it again because it’s now a little better presented with some key points added that I had left out.

The conclusions that I came to as a result of the Circus of Calamity that happened over Mother’s Day weekend are nothing new to me, and very likely nothing new to you. However I think they bear enough fruit to be written down. I’d like to codify my thoughts on technical debt in the system administration world and how to avoid it or deal with it if you’re currently in over your head. Mayhaps this will be the first effort in a larger work.

Without giving too much thought to the order in which I think the following tenants are valued at, here are some of the lessons that stand out to me.

Lessons learned:

Devote contiguous time to projects. In this case, my client gives me an hour cap. I can work on their systems for a certain amount of hours per month. That naturally lends itself to non contiguous blocks of time as I hit my maximum and then wait for the month to roll over. However, even with that limit, it’s best to spend it all in a largely uninterrupted segment of time. Regardless of if you are salaried, full time contracted, or an independent hired gun that works for whoever needs you whenever they can pay you, the concept remains the same.

If you have something to do, devote as much contiguous time to performing a task as you can. In my cast I tend to carve my day up thusly: four or five hours on one client’s systems, then another four or five hours on another client, and then perform three or four hours of tasks that I need to do on my own systems, bookkeeping, and general business busywork. This leeds to long days of frequent context shifting. In my last post’s situation, that led to errors like leaving a DVD in the optical drive of a server as well as not copying over the client’s password store on a schedule. The last post of mine was even a victim; as I context shifted too frequently, I forgot that I still had some polishing to do on it.

Don’t break your thoughts up. Think on a specific task, task set, and/or client for as long as possible with as few interruptions and context changes as possible.

Fix problems as they come. “I’ll get to that later” is death. The problem with the ILO needed to be addressed immediately. The problem with the BIOS clock resetting after power state changes needed to be addressed immediately. The problem with the hard drive controller failing needed to be addressed immediately. These were emergency level things that got distracted by the peculiarities of being an independent consultant and working for a place that has very small project budgets. Or perhaps that had nothing to do with it. Perhaps I could have pushed harder and taken more initiative. I know this office well and could have ordered the BIOS battery and contacted a local contractor to walk in and replace it. Act first, bill later! After all the years working with this group, I’m fairly certain that such an emergency action would be accepted and paid with no question.

Regardless of my specific situation, the idea is that, with few exceptions, one needs to address problems that crop up at that time, and not later. This is a sister concept to devoting contiguous time to a project. Keep on with solving a problem, and its directly related troubles for as long as possible without interruption. That can mean hours, or days, or longer as possible. This can lead to a rabbit-hole scenario where one simple change then leads to a huge infrastructure change. However, simply glossing over an issue adds another mound of debt to the overall systems debt. Don’t know why DNS queries are taking five seconds to resolve? Eh, it’s just a few seconds to wait and we’ve got bigger issues to solve with the payroll application. However, when a problem crops up as a result of DNS not being as smooth as one would expect, now you’ve got to face the DNS problem with another problem on your back. That pressure may lead to greater problems by encouraging you to implement half-measure solutions for the DNS resolution delay, which then causes another problem, which then causes another, and another… etc. and etc.

This is rather hard for consultants like me, however, so this bad tendency is strengthened. Consultants are paid by the hour in most cases, so the pressure to deliver without drawing out billable time is great. If a flat-rate project quote is made, you’re always one step away from hitting a mine field if you go too deeply into ancillary systems. I’ve lost my shirt as a result of a flat-rate project quote that ended up with scope-creep. While once in a while chasing down each problem to its root turns out great because you get extra business to fix those systems, most of the time it causes much more pain and suffering in the form of unpaid invoices, broken systems, and angry business owners.

This also seems rather tough for any IT person in general. At any given moment we’ve got large amounts of projects and rooms full of executives all competing for our time. Each one thinks they’re the most important person and project. Each one wants to be completed yesterday. There comes a point when being driven by the tyranny of the urgent has to stop, one way or another. The balance of when to chase down a problem to completion is a fine one, but I think we should collectively assume that a problem should be fixed immediately and require strong evidence to the contrary before abandoning the pursuit.

Get rest and stay healthy. I’ve been grinding hard for far too long. Starting a business is no joke. Keeping the business alive with paying clients all the while sharpening your skills and keeping potential clients engaged just in case existing business leaves is even less funny. I’ve been working so many hours in a week for three straight years that I’m aging myself prematurely. I’d love it if I could take a vacation or relax more, but the truth of the matter is that the clients I’ve picked up haven’t been the most lucrative and there have been some billing and invoicing… issues. I don’t have the time or the money to do much else with life except clatter away in front of a computer.

I’ve been departing in the evenings to get back to a hobby that has always interested me: weight lifting. That’s helped, and I’m contentedly regaining strength and getting re-bitten by the lifting bug and the addiction to “the pump”, but this is a fairly new re-committment. The plain facts are that I’m tired and zapped of mental energy. I’ve been grinding and it shows. The dumb mistakes I made for the client in the last post are in some part a result of mental exhaustion. I’ve lost some of the love for information technology that I used to have. I’ve visibly aged ten years in just three and I don’t have much material gain to show for it. I’m tired, and I did a disservice to my best client because of it. Shame on me.

(That said, if anyone knows of a business known for paying market hourly rates, and on time, that needs an independent system administrator with my skills based in Phoenix, Arizona, I’ve now got some time that I can book for a new client. Please, no employment positions at this point. Only consultant / contractor.)

Kanban can help! Kanban – I lurves it. I’m not about to suggest that it is the solution to everyone’s problems, but for those of us who are more tactile and visual, this kind of project and task management system can really make a difference. Kanban, in the simplest explanation that I’ve come up with so far, is a means of visualizing work and encouraging a limit to concurrent work.

In the cases of carving out contiguous time, fixing problems at the earliest possible moment after discovery, and even staying rested, kanban can be very useful since it forces you to be constantly aware of what work is currently being performed and what work is waiting to be performed. I use it to break larger projects into smaller chunks. Typically if a task would take more than four hours then it needs to be broken down into more than one ticket. However, in some cases I simply write “Work on X project, 4 hours” and that’s enough. But I’m digressing into the specifics of kanban when that’s not the point here.

Kanban as a means of staying on target and being ever aware of what context you’re currently in can be a huge boon to the hamstrung IT person. It’s especially helpful if you work with others and keep the kanban board highly visible. That way people always know what you’re working on. It’s very helpful to have management buy-in to that kind of system. Why? Imagine you’re in an environment where everyone thinks their projects are of utmost importance. If you limit your concurrent working projects to one or two like a good kanban system suggests, you can point to the board, specifically the area that is dedicated to tasks that you are currently working on this very moment, and force a choice. When someone complains about their stuff not getting done, then, with the blessings of leadership, you can require that the person who wants their project to be given top priority, contact the task owner of the current project that’s being worked on and explain to them why that has to be shelved and how long it will take for you to get back to work on it.

This can really help. Everyone understands sticky notes on a white board. If Gilles in accounting thinks that his project is the most important thing, tell him he can move James’s ticket from the currently worked-on project over to the holding tank. You know James. Six feet eight inches of security guard whose beard has more muscle than your entire body? Good luck, Gilles! But seriously, putting things into perspective for various project owners can seriously help you block off contiguous amounts of time and stay on track.

Sadly, in my case as an independent consultant, I can’t make Client A call Client B and explain why Client B needs to let up on using my time. It doesn’t work that way. However, I can still use kanban to aid in my own self discipline of keeping track of what I’m working on and when.

Digging out of the Hole

Some people have asked how I’ve dug out of that hole with the client. I’m still working on it. Meanwhile, in addition to that client, I’ve got a few other projects that I’m trying to sew up, so it’s a delicate balance of time and guarding contiguous working hours on each project. I can tell you that when I do get out of the hole of technical debt, it will be in large part due to kanban and a good set of targeted goals.

I need to first create a large vantage point for all my major task spheres. Currently I have three clients that I’m working with. One is a bit of a deadbeat. One is low priority. One is high priority (the one with the crashing server). In my personal life, I’ve got a few large goals to be concerned with as well. I’ll make those large circles of potential task lists and then figure out what is most important to me at this point in my life.

I can tell you that the high priority client (who also pays on time and is in good standing) will be at the top of the list of work projects. My low priority client will be a close second because the project is smaller and close to completion. It will be a great relief to get them finished. The client who is late paying is down on the list of priorities. Even them paying up their overdue invoices won’t get them past third spot until I see a history of on-time payments and not wanting cut-rate hourly rates.

Within each task sphere will be a list of tasks ordered by importance. Of highest importance for the client with the failing server will be to get new equipment in and start the migration. That much is obvious. I’ll need to address each quirk and hiccup along the way, such as ILOs dropping off the network and the like. Furthermore, I’ll need to guard contiguous time. Instead of dividing a day in three parts, (four or five hours for one client, four or five hours for a second client, and then a handful of hours for business management), I prefer to block off multiple days in a row for each client and task. That looks like this: Monday and Tuesday for one client, Wednesday for another, Thursday for business management, Friday and Saturday for another client (Yes, I work six days a week. It ain’t easy being self employed).

This should facilitate a steady march towards normalcy and healthy systems.

Parting Thoughts

  1. Don’t assume something isn’t important. Assume it’s important and require a lot of evidence to prove that it’s not. Give a serious effort at fixing any problem at the earliest possible moment.
  2. Dedicate contiguous time to completing a task and don’t dare multitask.
  3. Prioritize based on danger and worth.
  4. Chill out and get some exercise. BRO, DO YOU EVEN LIFT?

Not exactly ground breaking advice, but perhaps you need to hear it. I know I do. Preach it. Got any other tips and ideas? Any stories of failure and phoenix-like recovery from ashes? Let me know in the comment below or send me a guest post!

by Wesley David at May 20, 2013 10:44 AM

Chris Siebenmann

Today's comment spammer trick: regurgitated comments

I log the contents of some attempted spam comments here on Wandering Thoughts (the concise summary of when is when the spammer seems to be trying hard). Usually this doesn't get anything, but today my trawl through the logs turned up a succession of bizarre and odd comment attempts. The text had misspellings and typos but it generally made sense and most of the comment attempts were even about technical things that are vaguely on topic for here. But they were invariably attempts to comment on very inapplicable entries.

When I looked at the logs in detail, one of the most striking was a series of comment attempts that looked very much like a conversation between two or more people about using git on home directories. This was very odd since none of the comments were being posted, yet the people were pretty clearly replying to each other; I began to develop all sorts of theories about disturbingly intelligent content auto-generation. Finally I noticed something in one of the comment texts and the penny dropped:

[...] Possibly related posts: (automatically generated)Heroku, the Rails app.

There is a really simple way to get this text into a spam comment: you can be scraping content from existing blog posts and/or blog comments. So my new theory is that the would-be comment spammer is is scraping comment text from other blogs, mangling them somewhat, and then spam-posting them on other blogs (including mine).

The mangled text doesn't seem to have any links or other spam-relevant text so I'm not sure why the spammers are doing this. Maybe they're fishing to see what blogs will allow their comments through moderation and will follow up with more active content on blogs where this works.

Sidebar: source details and other things

So far 30 different IP addresses have tried this here today; most IP addresses have made only one attempt each. The IP addresses cover a large range of source networks. A few of them are CBL listed but that's pretty much it as far as DNBLs are concerned. Four of the IP addresses actually belong to Microsoft (168.63.43.185, 168.63.62.182, 168.63.76.184, and 168.63.84.217; all four are currently listed on the CBL). I'm assuming that these are compromised machines, VPS servers, or both.

Many of the IP addresses also made a burst of GET requests for various other URLs here. Maybe they're scraping text from Wandering Thoughts for use in their corpus for their next spam run somewhere else.

by cks at May 20, 2013 02:45 AM

May 19, 2013

witalis

cisco config archive doing

More or less about archiving router configuration is presented on:

http://www.techrepublic.com/blog/networking/use-the-cisco-ios-archive-command-to-archive-your-routers-configuration/532

I would like to add some useful command in this area:

# sh archive config differences nvram:startup-config system:running-config

it’s pretty self explanatory – produces output with differences between startup and running config.

To archive configuration you can add logging about entered commands

archive
 log config
  logging enable
  notify syslog
  hidekeys

commands are logged to syslog without password entries. Besides syslog, you can find which command was entered, using:

# sh archive log config all

Using archive command you can easily rollback your configuration to previous state

# configure replace ftp:<path_to_archive_cfg> list force time 10

it means that your configuration is reverted to archive config and if you don’t confirm in 10 minutes, it will back to running config. To confirm configuration:

# configure confirm

Moreover you can make your changes safer by doing

# configure terminal revert timer 1

it will rollback your configuration in 1 minutes if you don’t confirm configuration. More about this feature in great post:

http://packetpushers.net/cisco-configuration-archive-rollback-using-revert-instead-of-reload/



by admin at May 19, 2013 05:55 PM

Server Density

Chris Siebenmann

The technical effects of being an out of tree Linux kernel module

Suppose that you have a kernel module that is not in the mainstream kernel source for one reason or another. Perhaps it is license compatible but just not integrated for various reasons (as is the case with IET) or perhaps it is license incompatible (as is the case with ZFS on Linux). This non-inclusion has a number of cultural effects, but it also has real technical effects. Although I've mentioned them before, today I want to talk about them in some detail.

The first thing to know is that the Linux kernel does not have a stable kernel API for modules; how a module interacts with the rest of the kernel can and will change without notice. When your module is part of the kernel source, changing it to cope with the API change is generally the responsibility of the kernel developer who wants to make the API change. When your module is not in the kernel tree, not only is changing its code your job but so is even knowing about the API change. And API changes are not always obvious because sometimes they're things like changes in locking requirements or how you are supposed to use existing functions.

(Sometimes they are semi-obvious, like changing just what arguments a function takes. You do pay attention to all warning messages that show up when building your kernel module, right?)

Any number of people would like this to change but it isn't going to. The Linux kernel development process is optimized for in-tree code and not for out of tree code. If your out of tree code cannot be included in the kernel for various reasons, that's tough luck but the kernel developers really don't care that much (as a general rule). Locking themselves down to any stable module API would reduce their ability to improve and evolve the kernel code.

The next effect is pragmatic: if your code is not in the kernel tree, almost no one will look at it (and this includes automated scans over the kernel source code that look for various things) or do things to it. This is great if you're possessive about your code but it means that you're missing out on the quality checking that this creates, all of the little janitorial cleanups that people do, and if there is a bug then your module's developers are the only people who are looking at it.

(In some quarters it's fashionable to think that the Linux kernel developers are all clowns and cannot possibly contribute anything worthwhile to your code. This is a major mistake. Among other things they're basically certain to know the overall Linux kernel environment better than you do.)

A related issue is that the kernel developers try not to create bugs and regressions in in-tree code, especially if it's considered important (which, say, a commonly used filesystem will be); if one is created anyways a bunch of people will go looking to try to fix it. It's almost certain that no official kernel release would go out that broke a significant filesystem; the change that created the breakage would be identified and then reverted, with the change's developer told to try again. If your module is not in the tree, well, you're on your own. Performance regressions or actual breakages are your problem to diagnose and then either fix or try to argue the kernel developers into changing their side of the problem.

(And they may not, especially if your code is license-incompatible with the kernel and most especially if their change actually improves in-tree code and performance and so on.)

All of this means an out of tree kernel module requires more ongoing development work than an in-tree kernel module. In-tree kernel modules generally get somewhat of a ride from general kernel developers; out of tree modules do not and have to make up for it with time from their own developers. One predictable result is that many out of tree modules don't necessarily support all kernel versions, including kernel versions that sysadmins may want to use. A worst case situation with out of tree modules is that the developers simply stop updating the module for new kernels; any users of the module are then orphaned on old kernels.

by cks at May 19, 2013 05:20 AM

May 18, 2013

Ubuntu Geek

Libreoffice 4.0.3 released and PPA installation instructions included

LibreOffice is a comprehensive, professional-quality productivity suite that you can download and install for free. There is a large base of satisfied LibreOffice users worldwide, and it is available in more than 30 languages and for all major operating systems, including Microsoft Windows, Mac OS X and GNU/Linux (Debian, Ubuntu, Fedora, Mandriva, Suse, ...).
(...)
Read the rest of Libreoffice 4.0.3 released and PPA installation instructions included (367 words)


© ruchi for Ubuntu Geek, 2013. | Permalink | 4 comments | Add to del.icio.us
Post tags: , , ,

Related posts

by ruchi at May 18, 2013 11:12 PM

Milek

/sys/admin/blog

Running IT like a business

Some older but really good articles on running IT like a business:

There are a few additions IT Managers should think hard about:

  • Businesses have customers, not users.  We have to be customer focused and look at everything from the customer’s view at service levels, not at each of our component levels.
  • We should treat the allocation of our human resources, where our staff time goes, just like we treat our financial budgets.
  • Our major goals should include improving business-IT communication and creating value for the business.  The more we integrate with the business the better the value we can add.
  • Business models are moving to cloud strategies.  We’re only going to get busier and need to respond quicker to business needs as our product and IT strategies evolve.   Every little bit we do to improve and standardize processes now will pay us back with dividends as our new world evolves.

We’re embarking on a major culture change in IT if we are going to keep pace with the changing business strategy.

by Joe at May 18, 2013 03:08 PM

Chris Siebenmann

A little habit of our documentation: how we write logins

Ove the years, we've developed a number of local conventions for our local documentation. One of them is that we always write Unix logins with < and > around them, as if they were local email addresses, so that we'll talk about how <cks>'s processes had to be terminated or whatever. When I started here this struck me as vaguely goofy; over time it has rather grown on me and I now think it's a quite clever idea.

Writing logins this way does two things. The first is that they become completely unambiguous. This is not much of an issue with a login like 'cks', but we have any number of logins that are (or could be) people's first or last names, and vice versa. Consistently writing the login with <> around it removes that ambiguity and uncertainty. The second thing it does is that it makes it much easier to search for a particular login in old messages and documentation. Searching for 'chris' may get all sorts of hits that are not actually talking about the login chris; searching for '<chris>' narrows that down a lot.

(Well, sort of. The reality is that we sometimes wind up quoting various sorts of system messages and system logs in our messages and of course these messages generally don't use the '<login>' form. However, often excluding these messages from a later search is good enough because we're mostly interested in the record of active things we did to an account.)

There's a corollary to the convenience of <login>: right now we have no similar notation convention for Unix groups. We write less about Unix groups than about Unix logins (and groups generally have more distinct names), but it would still be nice to have some convention so we could do unambiguous searches and so on.

by cks at May 18, 2013 05:13 AM

May 17, 2013

Byron Miller

Devops – It’s about critical thinking & the evolutionary “WHY” of Silos.

I believe one of the best things to ever come out of DevOps movement that no one seems to be describing is essentially an explosion of critical thinking and reasoning skills, The maturing of IT if you will. Some people prescribe different views or distill it into different methods such as C.A.M.S (Culture, Automation, Measure, […]

by byronm at May 17, 2013 01:56 PM

Aaron Johnson

Chris Siebenmann

Why I'm not considering btrfs for our future fileservers just yet

In a comment on yesterday's entry I was asked:

Could you elaborate on the "btrfs does not qualify" part?

What's missing? How likely do you think this to change in the near future?

I will give a simple looking answer that conceals big depths: what's missing is a btrfs webpage that doesn't say 'run the latest kernel.org kernel' and a Fedora release that doesn't say 'btrfs is still experimental and is included as a technology preview' (which is what Fedora 18 says). It's possible that btrfs is more mature and ready than I think it is, but if so the btrfs people are doing a terrible job of publicizing this. Fundamentally I want to be using something that the developers consider 'mature' or at least 'ready' and I don't want us to be among the first pioneers with a production deployment of decent size in a challenging environment.

Pragmatically there is nothing that btrfs can do to make us consider it in the near future, for reasons I wrote about two years ago in an entry on the timing of production btrfs deployments. If btrfs magically became perfect tomorrow, it would only appear in an Ubuntu LTS release in 2014 and an Red Hat Enterprise release in, well, who knows but probably not this year.

(The current Ubuntu 12.04 LTS has btrfs v3.2, whereas btrfs is up to v3.9 already. The btrfs changelog shows the scope of a year's evolution.)

As far as what in specific is missing, well, I have to confess that I haven't looked at the current state of btrfs in much detail and so I don't have specific answers. I poke at btrfs vaguely every so often; generally I discover something that strikes me as alarming and then I go away again. Since btrfs is never going to be exactly like ZFS, I can't just directly translate our our ZFS fileserver design to btrfs and then complain about what's missing or different. To have a really informed opinion on what btrfs needed and what was wrong with it, I'd have to do a btrfs-based fileserver design from scratch, trying to harmonize what we think we want (which has been shaped by what ZFS gives us) with what btrfs gives us. So far there seems to be no real point to doing that before btrfs stabilizes.

(I'm starting to think that btrfs and ZFS have fundamentally different visions about some things, but that needs some more reading and another entry.)

Sidebar: ZFS on Linux maturity versus btrfs maturity

You might ask why I'm willing to consider ZFS on Linux even though it's a relatively young project, just like btrfs. The answer is that the two are fundamentally different. The ZFS part of ZoL on Linux is generally a mature and well proven codebase; most of the uncertain new bits are just for fitting it into Linux.

by cks at May 17, 2013 05:30 AM

The Nubby Admin

Cascading Failure, Technical Debt, and Punching a House with my Face

At 11:32PM, Saturday May 11th, I got an email from MX Toolbox notifying me that a SBS 2008 machine that I support had gone unresponsive. It’s 600 miles away from me in another state. This was not a strange occurrence with this server.

A Cluster of Prior Failures

Five years ago a small office with a minimal budget needed a SBS implementation. I recommended an HP ML 115 G5 with four hard drives and onboard RAID provided by an NVIDIA chipset. I have regretted that decision for all five years. Here’s a post of mine concerning that chipset and the troubles I’ve had with it.

In short, I have poor insight into and control over the entire server’s health. Some examples include:

  • I couldn’t update the hard drives’ firmware, which was a big deal because the serial numbers of those hard drives fell into a set of drives that have a known problem with suddenly going offline. The firmware update has to be applied through HP’s support tools, which are not supported on the ML 110/115. After much research and seeking help from HP, I was told that, in essence, I was left out to dry.
  • The ML 110/115 does not support the ProLiant Support Pack nor does that model support the Insight Control Manager. Keeping drivers updated and staying abreast of the various components’ health was virtually impossible.
  • There was also no HP ILO CLI interface available which made doing things like firmware updates especially difficult remotely.
  • The on-board storage controller had poor support form Nvidia, and offered very slim storage management features or reporting on hard drive health.

For years I hit the management ceiling with that box which probably cost my client more of my time and theirs than had a more robust server been purchased for twice the hardware cost. And then what I had been dreading for years finally happened…

Two Months Ago

“Did you reboot the server?” That’s never a question you want to hear, especially when you did not reboot a server. I VPN’d into that office’s network and checked for the presence of the server on the network. Yes, the server was down. One power cycle later, the OS loaded just fine.

I checked the event logs and it turns out there was a massive flurry of parity errors that came out of nowhere. The server froze as a result. The controller was apparently dying. After a reboot, the data appeared fine, and there were no more parity errors coming from the Nvidia storage driver. I knew something had to be done, but being remote and working with an office that has a shoestring budget (and can often only afford used shoestrings) made the options few and unattractive.

What’s worse, as I started investigating things further, I noticed that the ILO Advanced card that was in the server was no longer showing on the network. Aaaaand the BIOS clock would reset to July 2009 after being shut down (BIOS battery dying) causing strange problems with Active Directory and other applications running on the network that relied on accurate time (read: everything). AAAaaaaaand the two mirror sets (one for the system volume and one for the email server’s databases) had split apart and could not be re-synced because the Nvidia storage management software no longer recognized that any hard drives were connected.

The options, as I saw them, were for the business to either buy a new RAID controller, BIOS battery, and perhaps ILO card (and then scramble to perform the complex surgery remotely on their own, or pay a local consultant to coordinate with me, or pay to ship me on site) or get a new server altogether (and pay a local consultant to coordinate with me, or… you get the idea). Either way, it started to look more and more like a total forklift migration was necessary.

Two Months Later

Yes, it’s been about two months and the server is still riding in the same perilous state. Split mirrors, bi-monthly freezes that require a power cycle to recover from, and a lot of hoping and praying that data is not corrupted. Welcome to the world of supporting small business IT where people re-use tea bags and don’t run heat or AC in order to save money and keep the business open.

That Saturday night, it was getting late and I was thinking about bed. I checked my email one last time for anything pressing when I saw a MX Toolbox alert. This is never good. I scanned the email, saw what host was causing the alert, and knew that I was dead in the water. I could get into that client’s network via both a SonicWall VPN and unattended TeamViewer installations that existed on most of the workstation PCs. However, it was all futile because I didn’t have hardware level access to the server as a result of the ILO’s failure. The office has a Lantronix Spider KVMoIP device that was being used to work on a workstation migration for one employee, and was therefore not hooked up to the main office server. That was two layers of out of band management that was not doing any good for the most important technology asset in the building.

All of this meant that someone would have to show up at the office to power cycle the PC. The technical debt and compound interest of failure had already mounted fairly high by that point, considering the state of the server. However, things were about to get comical.

I’ll Gladly Pay You Tomorrow for Out of Band Management Today

What happened in the next 24 hours was a morbid comedy of oversights and compounded problems that ended in a whiplash inducing facepalm.

First, I needed to email three people who would most likely be in the vicinity of that office so I could coordinate with one of them to drop by on their Sunday morning and power cycle the server. Except the server is what does email for the organization so I can’t send to their organization email addresses (this is a Microsoft SBS machine). I only know of one employee’s non-work address, and I also happen to know the gmail address of another employee’s son.

I email those two people and tell them of the situation. As it turns out, two key workers are out, traveling to a convention in Texas. That makes access to email even more vital than normal. Everyone knows the situation and there’s not much more I can do so I get to bed. It’s not until about 2PM on Sunday, Mother’s Day here in the USA, that I hear back from one worker who has just enough time to skip by the office and power cycle the server.

Myself, I’m in the midst of a Mother’s Day dinner with my own family so I had ditched my phone… just moments before the employee called me from the remote office. I missed the call and the employee left a voicemail expressing a state of confusion over which server to power cycle. The organization is small and only has two servers. One is the SBS machine and the other is a HP MicroServer that is used as a network monitoring station and catchall for various extraneous services. I had assumed that over the years everyone had each server’s role understood by sight so I simply asked him to power cycle the SBS server, expecting that it would be known which piece of hardware that was. The fellow power cycled both servers since he couldn’t get in touch with me directly.

Okay, no big deal. The MicroServer is just running CentOS and OpenNMS. They’re resilient and can handle a sudden shutdown. As I listened to that voicemail, I checked to see if I could remotely connect to the server that had been down all night. I couldn’t. Great. Time to call the office and talk to the person who was on site and see what else could be done. Except the voicemail had been left over an hour ago and the employee had naturally left shortly after power cycling the server. I called his cell phone back, but he’s didn’t pick up. I left a voicemail.

A little later that Sunday I get in touch with another employee in the area who lives closer. He’s on his way out to pick up Mother’s Day dinner for his wife and can swing by to check out the server. First, I have him power cycle it again. Maybe the first guy just clicked the power button and didn’t hold it in? I held out hope for such a simple explanation. However, after I instructed this second person on how to make sure the server had shut down and then powered up, I waited for the duration of the standard bootup but nothing was showing up. It became apparent that the server was not coming back online.

“Do you know where the Spider is?” I asked hopefully. “No, I dunno where the other guy put it.” Gah! The Spider is a well known piece of equipment in that office, and it’s very rare that it can’t be found. I was about to concede defeat for that Sunday when, after some searching, the employee found the Spider. A few minutes of scrambling around and he had the thing hooked up to the server. Except… now I couldn’t get to the Spider. The fellow had to leave to pick up dinner and I wasn’t about to ruin his family’s Mother’s Day so I told him I’d see what I could do remotely, expecting nothing to be successful.

In the process of hooking up the Lantronix Spider, the employee had pulled the network cable out of the server and put it into the Spider. Then from the spider’s cascade port (it’s essentially a one port switch) he had connected a patch cable to the server’s LAN port. That made me wonder… perhaps it was a port on the ProCurve switch that was bad? That would explain both the server and now the Lantronix Spider being inaccessible. Or maybe the port spontaneously shut down as a result of some bug. Crazier things have happened.

I browsed to the switch’s management interface. “Please enter your username and password!” Okay, no problem! “Wait… I can’t remember what the password is… NOOOOOO!” The organization uses KeePass to store important passwords and software keys. The KeePass file is on the server. The server that is down.

But wait! I have a copy of the keepass databases on my own storage. Once a month or so I copy the files to my local storage so that I have an in-sync copy just in case. Whew! I find the switch’s login credentials and begin inspecting things. I looked, hoping for some bad news concerning the switch’s health (at least that would mean the server was okay), but the switch looked perfect. Nothing was amiss.

I’ve always been told to troubleshoot network problems from the lowest layer first. I had pretty much ruled out the physical layer. Layer 2 seemed healthy. Not much that can go wrong on a small, single subnet LAN. Layer three, IP… IP addresses… I gritted my teeth. I knew what the problem was. The Lantronix Spider is set to pick up an address via DHCP. Specifically it’s a DHCP reservation on the network’s DHCP server. The server that’s down. I wanted the network layer benefits of a static IP address, however I also wanted it to be easily portable between networks. My original idea was that the Spider could be used to support PCs on other LANs, like perhaps workers that were based in their home office that didn’t come into the organization’s building very often. With the Spider getting an IP address via DHCP, I could just tell someone to take it home with them and I’d only be left with walking them through configuring port forwarding, or getting TeamViewer set up on a PC on their LAN so I could get in and access the Spider via a local web browser. Except now the Spider was barking out forlorn DHCP discover packets and not getting any response back.

I fired up Network Monitor on an office PC to be sure. Yep, there it was. A DHCP discover request broadcasting every sixty seconds or so. Okay, I can handle this. The small office has a SonicWall firewall that has DHCP services on it. I only need to enable them, check its list of leases to find what IP address it was given, and I’ll be good! I mosey my web browser on over to the firewall’s administrative page. I stare at it. It wants the password for the admin user. “Password… password… I had to change it a few weeks ago. What did I choose…”

Oh well, I’ll look in the organization’s copied password file that I keep on my local storage! Yay foresight!! I found the firewall admin password and entered it. “Password Failure. Please Retry.” What?! Then I remembered that I had changed the firewall password due to security policy about two weeks ago. However, I hadn’t copied the organization’s password file to my local storage in a month. I had the old password in my copy of the password file, but not the new one. The new one was on the server that was currently down. Backups are taken every few hours, but a restoration needs to be done on functioning hardware. Super.

So that means I did it again. I couldn’t log in to the interface because I didn’t have the long password committed to memory. For super important passwords like that, I do keep a disaster recovery hard copy around. It’s essentially a few pages spelling out the most important usernames and password for the organization. However, only two people have that physical copy of information. While I could call them up and have them read off the password to me, I wasn’t ready to do that.

Instead, I turned to the HP MicroServer running CentOS 6. I have OpenNMS installed on it and have plans to install some ticketing software and maybe smokeping or M/Monit. Now, however, it’s going to be an impromptu DHCP server. Fortunately I can remember the password for the MicroServer! A quick ‘yum install dhcpd’ later and… “Couldn’t resolve host ‘centos-distro.cavecreek.net’” WHAT DEVILRY IS THIS?! But of course; DNS for the network is performed by the SBS server… which is down. After facepalming, I changed resolv.conf to point to OpenDNS and continued my march towards a functioning DHCP server on the network. After a few minutes I have dhcpd running and it quickly hands out a lease to the Spider.

And it was then that I saw it. After logging into the Spider, I viewed the remote console and saw a Windows installation screen on the server. Suddenly, I remembered what happened. In the process of preparing for a migration away from the failing hardware, I needed to experiment with making an unattended installation file. I had a remote worker put the SBS 2008 install CD in the main server’s tray. Of course, rebooting caused the server to boot into the high boot priority CD drive. I sat in horror, thinking about my cascade of failures. Nevertheless, that wasn’t the time to flail in self loathing. I simply needed to hit “cancel” and get out of the installation welcome screen to boot from the hard drive.

Except the Spider was unable to interact with the server as a remote keyboard or mouse. I’ve used the Spider on that very server in the past, and it worked great at all stages of the boot process. In the years that I’ve worked with that office I’ve had to check BIOS settings, ILO firmware settings, and storage controller settings, all using either the Spider or the ILO itseld. But now, for some unexplained reason, the Spider was not able to input anything. I couldn’t move the mouse, I couldn’t press keys. So I sat and stared at the remote video in complete disbelief.

It was a simple matter of leaving a voicemail for someone and telling them to remove the disc from the DVD drive the next time they were in the office. The next morning the worker that I left a message for did just that, power cycled the server, and it booted up as normal. Life continued.

I was abashed.

More about my conclusions concerning the situation later. In the mean time, got a similar story to share? Let me know in the comments below or contact me and you can write a guest blog post about it.

by Wesley David at May 17, 2013 03:38 AM

May 16, 2013

Ubuntu Geek

How to Install Cinnamon 1.8 on ubuntu 13.04

Cinnamon is a user interface. It is a fork of GNOME Shell, initially developed by (and for) Linux Mint. It attempts to provide a more traditional user environment based on the desktop metaphor, like GNOME 2. Cinnamon uses Muffin, a fork of the GNOME 3 window manager Mutter, as its window manager from Cinnamon 1.2 onwards
(...)
Read the rest of How to Install Cinnamon 1.8 on ubuntu 13.04 (249 words)


© ruchi for Ubuntu Geek, 2013. | Permalink | 2 comments | Add to del.icio.us
Post tags: , ,

Related posts

by ruchi at May 16, 2013 11:54 PM

Standalone Sysadmin

Busy, Busy, Busy

I might not notice it at the time, but I can always tell how busy I am by how many blog posts I manage to get live. By my count, I've been doing about one every eight days so far this month (if you count this one). So I'm behind :-) So what's been going on?

LOPSA-East

But I've been doing good, fun things. For instance, on May 3rd and 4th, I went to LOPSA-East, which was yet another really great conference. There was somewhere around 150 attendees this year, and it was really nice to see everyone again from previous years.

Way back in October of 2011 (were some of you even born then?), I asked about a class on SSDs, to see if there was any interest. Well, in October of 2011, the earliest I could have done it was spring of 2012, and didn't get around to finishing the course before then, so spring of 2013 it was, and I taught the SSD class on Saturday afternoon. Only three years in the making. That's cool, right? :-D

If you were in my class, you probably have the slides from the USB key. If you weren't in my class, then you'll be happy to know that since I don't really intend to teach the class again (although if my feedback is overwhelmingly positive, I'll consider it), I opted to have it recorded, and whenever that goes live, I'll be linking to it from here and including my full slide deck, too.

Storage Field Day

At the end of April, I went to Denver to do Storage Field Day. I haven't had a chance to write about the things I saw yet, but I'm very excited to talk about what we saw with Pernix Data. If you want to see some cool ideas, watch the videos there. I'll write more as soon as I get time.

LOPSA stuff

We're still in the swing of the election season. You might have seen when I updated my earlier post that the LOPSA Live transcript had been posted. That was the first of two candidate sessions. The other is tonight at 9pm, so follow the instructions by Aaron Sachs for connecting to #LOPSA-Live on Freenode and come ask the candidates good, hard questions.

The election is coming up next month. I've posted my series of discussions on internal concerns (including membership numbers, member communications, and operational transparency. Starting tomorrow, I'm going to start posting discussions related to external concerns - we have a lot of problems with marketing and how we're seen externally...when we're seen at all. Make sure to watch for those blog entries, too.

LISA Training

I haven't posted anything about it here, but I'm working with Dan Klein to help get training ideas for LISA'13. For the past several years, I've been involved as a blogger at the LISA conference (along with Ben Cotton, Marius Ducea, Greg Riedesel, and many others. I'm planning on continuing that for as long as they'll have me, but it's also nice to be able to contribute to the program in some small way, too. This means that if there's training that you think LISA should have, but doesn't, let me know and I'll do my best to figure out how we can have it.

Actual, "I get paid to do this" work stuff

At work, we've been doing all kinds of things. I've now got a production vSphere cluster, a new Nimble storage box, I'm trying desperately to get new gear for my core switch (I'm going with a pair of Nexus 5548s and six FEX to go along), and I need to order more five or six server racks to replace some of the ones we have now.

I continue to be mystified by the way that academia works. Specifically, budgeting and deadlines. For reasons that I'm unable to fathom, in order to get things on this year's budget, I have to order hardware and have it delivered and in my space by the end of June. Not, "ordered and paid for". Ordered, delivered, and in my space. I've thought about it, and I can't come up with any kind of compelling reason for this rule. Anyone with more experience in academia than I have want to weigh in? I'm at a loss.

Personal Stuff

I've finally bit the bullet and decided to get LASIK.

I'm in a large-ish metro area now, and the technology has been continually developing for a couple of decades, and I think it's matured to the point where I'm cool with people cutting my eye open and burning part of it away using lasers. I can't be 100% about technology enhancing our lives unless I walk the walk and take advantage of it, so I'm doing it.

I went in last week for my "free consultation", which determined that I was an excellent fit for normal "LASIK" surgery. If my cornea had been too thin, I guess I could have gotten either LASEK or PRK, both of which work well but have a longer healing and recovery time. Turns out my cornea is just fine.

Also, can I just say - they have the coolest eye equipment I've ever seen there. I've worn glasses or contacts since elementary school, and I've lived in a dozen cities or so since then, so I've seen my share of optometry equipment, but man, the toys the LASIK guys have are nuts. I'm practically blind, so when they said, "take off your glasses and look in this machine, and you'll see a hot-air balloon", I thought, "please, I'll be lucky to see a blurry light". Sure enough, looking into the machine, it was blurry...for a second. Then, like a camera, it "autofocused" and just like that, they had nearly my exact prescription. Awesome!

So the whole "lasering my eyeballs" thing is happening tomorrow afternoon. I honestly can't wait. I've been thinking about it for years, and having it this close is really exciting. I'll make sure to update early next week with the results.

So there you go. That's what I've been up to. I'll try to get back to posting more regularly, and maybe even on topics that you care about! Wouldn't that be exciting? ;-)

We'll see. Thanks!

by Matt Simmons at May 16, 2013 10:22 AM

Aaron Johnson

Chris Siebenmann

Why ZFS's CDDL license matters for ZFS on Linux

In a G+ conversation about ZFS I read the following:

[...] so, why use BTRFS at all? :-) Just the fact that it's GPL (and so able to be embedded into the kernel source tree) doesn't seem enough, specially considering that CDDL (the ZFS license) is a bona fide open source license, [...]

On the whole I like ZFS on Linux, but let's not mince words here: this licensing issue is a big issue. Were btrfs and ZFS close to general parity, it would be a very strong push towards btrfs.

That ZFS is CDDL licensed means that it can never be included in the Linux kernel source. It may mean that it can't be prepackaged in binary form by distributions, or at least by distributions that care strongly about licensing issues. The CDDL is part of what makes it extremely unlikely that Red Hat Enterprise or Ubuntu LTS will ever officially support ZoL, making it always be a 'batteries not included, you get to integrate it' portion of the system.

That ZFS will not be included in the Linux kernel source (because of the CDDL among other reasons) means that you are more at risk of developers ceasing to update ZFS for newer kernels (among other less important effects).

(Being in the Linux kernel source is no guarantee that code will be maintained, but it increases the chances a fair bit.)

These are risks that we'd be willing and able to take on, so they aren't real obstacles for us using ZoL if that turns out to be the best option for new fileservers. But they still weigh on my mind and there are any number of places where they are going to be real issues, sometimes killer ones.

(I've written about this before.)

(Given the current situation with 4k disks, we're already looking at recreating pools when we move them to a new fileserver infrastructure. At that point we could just as easily migrate from ZFS to something else, if the something else was good enough. Btrfs currently does not qualify.)

by cks at May 16, 2013 05:17 AM

May 15, 2013

UnixDaemon

Facter 1.7+ and External facts

While Puppet may get all the glory, Facter, the hard working information gathering library that can, seldom gets much exciting new functionality. However with the release of Facter 1.7 Puppetlabs have standardised and included a couple of useful facter enhancements that make it easier than ever to add custom facts to your puppet runs.

These two improvements come under the banner of 'External Facts'. The first allows you to surface your own facts from a static file, either plain text key value pairs or a specific YAML / JSON format. These static files should be placed under /etc/facter/facts.d


$ sudo mkdir -p /etc/facter/facts.d

# note - the .txt file extension
$ echo 'external_fact=yes' | sudo tee /etc/facter/facts.d/external_test.txt
external_fact=worked

$ facter external_fact
worked

At its simplest this is a way to surface basic, static, details from system provisioning and other similar large events but it's also an easy way to include details from other daemon and cronjobs. One of my first use cases for this was to create 'last_backup_time' and 'last_backup_status' facts that are written at the conclusion of my backup cronjob. Having the values inserted from out of band is a much nicer prospect that writing a custom fact that parses the cron logs.

If that's a little too static for you then the second usage might be what you're looking for. Any executable scripts dropped in the same directory that produce the same output formats as allowed above will be executed by facter when it's invoked.


# scripts must be executable!
$ sudo chmod a+rx /etc/facter/facts.d/process_count

$ cat /etc/facter/facts.d/process_count
#!/bin/bash

count=$(ps -efwww | wc -l | tr -s ' ')
echo "process_count=$count"

$ facter process_count
209

The ability to run scripts that provide facts and values makes customisation easier in situations where ruby isn't the best language for the job. It's also a nice way to reuse existing tools or for including information from further afield - such as the current binary log in use by MySQL or Postgres or the hosts current state in the load balancer.

While there have been third party extensions that provided this functionality for a while it's great to see these enhancements get included in core facter.

Like this post? - Digg Me! | Add to del.icio.us! | reddit this!

May 15, 2013 11:29 PM

SysAdmin1138

Yes, that happens

We all know it can happen, a BIOS update of some kind bricks whatever just got flashed, but it's one of those things you hope happens to other people first so you know not to go there. It happened to me recently, which got me thinking about continuous deployment from a hardware POV. Hardware being what it is, hard, you can't iterate and roll-back the way you can do software. There is no such thing as Vagrant for Embedded Systems that I've found!

The problem of, "when do I update the firmware for my server," is one that faces anyone with a physical infrastructure. There isn't really a globally accepted best-practice for this one, though the closest I can find is:

If the vendor lists the update as critical, apply it.
If you're experiencing one of the problems listed in the fixes, apply it.
If vendor tech-support tells you to apply it, apply it.
Otherwise, don't apply it.

But only apply it to a test device first to verify it actually fixes the problem. Then roll it out.

Doing so pro-actively is kind of risky, and only really useful in repurposing scenarios. Also, this 'best practice' assumes you have identical hardware to actually test with. Which a lot of us don't, and often can't due to slight differences between servers of the same model.

So. For those of us who are working on infrastructures either small enough to not be able to afford test hardware, or diverse enough that there is no such thing as a common class of machine, what are we to do?

Hope, mostly, and trust in your vendor support contracts to ship you new hardware in case you get a brick.

Or, trust in your redundancies and treat new-firmware-updates like a lost-server outage. If you get a brick, you're still within your failure tolerance and know not to go there for the rest of 'em. This is the approach we ended up taking, and it worked. We were running without our scale-test environment for a few days but production was unaffected until we could unbrick the affected machines.

In our case I suspect we had a v1.0 hardware revision, and the newest firmware was only backwards compatible for v1.0a and newer or something. I don't have proof of this, but that's what it feels like. Of course, this eventuality was not mentioned in the release-notes anywhere. Thus, testing.

by SysAdmin1138 at May 15, 2013 07:36 PM

Simplehelp

How to “Split” the iPad Keyboard

iPad

This very brief tutorial will show you how to ‘split’ the keyboard on your iPad so that you can type more comfortably using just your thumbs.

Follow the simple steps below to enable this lesser known feature (hat tip to Eric Hogg).

  1. Open any App that uses the keyboard. Put your thumbs on the middle of the keyboard and ‘swipe’ outwards.

  2. click to enlarge

  3. Now your keyboard is split into two halves – making it much easier to type using only your thumbs – which also makes it easier to hold your iPad while typing.

  4. click to enlarge

  5. To revert to the normal/default keyboard, simply swipe the keyboards back together.

  6. click to enlarge

by Ross McKillop at May 15, 2013 06:11 PM

Google Webmasters

Using schema.org markup for organization logos

Webmaster level: all

Today, we’re launching support for the schema.org markup for organization logos, a way to connect your site with an iconic image. We want you to be able to specify which image we use as your logo in Google search results.

Using schema.org Organization markup, you can indicate to our algorithms the location of your preferred logo. For example, a business whose homepage is www.example.com can add the following markup using visible on-page elements on their homepage:

<div itemscope itemtype="http://schema.org/Organization">
  <a itemprop="url" href="http://www.example.com/">Home</a>
  <img itemprop="logo" src="http://www.example.com/logo.png" />
</div>

This example indicates to Google that this image is designated as the organization’s logo image for the homepage also included in the markup, and, where possible, may be used in Google search results. Markup like this is a strong signal to our algorithms to show this image in preference over others, for example when we show Knowledge Graph on the right hand side based on users’ queries.

As always, please ask us in the Webmaster Help Forum if you have any questions.

by Google Webmaster Central (noreply@blogger.com) at May 15, 2013 02:52 PM

Yellow Bricks

EMC ViPR; My take


When I started writing this article I knew people were going to say that I am biased considering I work for VMware (EMC owns a part of VMware), but so be it. It is not like that has ever stopped me from posting anything about potential competitors so it also will not stop me now either. After seeing all the heated debates on twitter between the various storage vendors I figured it wouldn’t hurt to try to provide my perspective. I am looking at this from a VMware Infrastructure point of view and with my customer hat on. Considering I have huge interest in Software Defined Storage solutions this should be my cup of tea. So here you go, my take on EMC ViPR. Note that I did not actually played with the product yet (like most people providing public feedback), so this is purely about the concept of ViPR.

First of all, when I wrote about Software Defined Storage one the key requirements I mentioned was the ability to leverage existing legacy storage infrastructures… Primary reason for this is the fact I don’t expect customers to deprecate their legacy storage all at once, if they will at all. Keep that in mind when reading the rest of the article.

Let me summarize shortly what EMC introduced last week. EMC introduced a brand new product call ViPR. ViPR is a Software Defined Storage product; at least this is how EMC labels it. Those who read my articles on SDS know the “abstract / pool / automate” motto by now, and that is indeed what ViPR can offer:

  • It allows you to abstract the control path from the actual underlying hardware, enabling management of different storage devices through a common interface
  • It enables grouping of different types storage in to a single virtual storage pool. Based on policies/profiles the right type of storage can be consumed
  • It offers a single API for managing various devices; in other words a lower barier to automate. On top of that, when it comes to integration it for instance allows you to use a single “VASA” (vSphere APIs for Storage Awareness) provider instead of the many needed in a multi-vendor environment

So what does that look like?

What surprised me is that ViPR not only works with EMC arrays of all kinds but will also work for 3rd party storage solutions. For now NetApp support has been announced but I can see that being extended, and I now EMC is aiming to. You can also manage your fabric using ViPR, do note that this is currently limited to just a couple of vendors but how cool is that? When I did vSphere implementations the one thing I never liked doing was setting up the FC zones, ViPR makes that a lot easier and I can also see how this will be very useful in environments where workloads move around clusters. (Chad has a great article with awesome demos here) So what does this all mean? Let me give an example from a VMware point of view:

Your infrastructure has 3 different storage systems. Each of these systems have various data services and different storage tiers. Now when you need to add new data stores or introduce a new storage system without ViPR it would mean you will need to add new VASA providers, create LUNs, present these, potentially label these, see how automation works as typically API implementation differ etc. Yes a lot of work, but what if you had a system sitting in between you and your physical systems who takes some of these burdens on? That is indeed where ViPR comes in to play. Single VASA provider on vSphere, single API, single UI and self-service.

Now what is all the drama about then I can hear some of you think as it sounds pretty compelling. To be honest, I don’t know. Maybe it was the messaging used by EMC, or maybe the competition in the Software Defined space thought the world was crowded enough already? Maybe it is just the way of the storage industry today; considering all the heated debates witnessed over the last couple of years that is a perfectly viable option. Or maybe the problem is that ViPR enables a Software Defined Storage strategy without necessarily introducing new storage. Meaning that where some pitch a full new stack, in this case the current solution is used and a man-in-the-middle solution is introduced.

Don’t get me wrong, I am not saying that ViPR is THE solution for everyone. But it definitely bridges a gap and enables you to realise your SDS strategy. (Yes I know, there are other vendors who offer something similar.) ViPR can help those who have an existing storage solution to: abstract / pool / automate. Yes indeed, not everyone can afford it to swap out their full storage infrastructure for a new so-called Software Defined Storage device and that is where ViPR will come in handy. On top of that, some of you have, and probably always will, a multi-vendor strategy… again this is where ViPR can help simply your operations. The nice thing is that ViPR is an open platform, according to Chad source code and examples of all critical elements will be published so that anyone can ensure their storage system works with ViPR.

I would like to see ViPR integrate with host-local-caching solutions, it would be nice to be able to accelerate specific datastores (read caching / write back / write through) using a single interface / policy. Meaning as part of the policy ViPR surfaces to vCenter. Same applies to host side replication solutions by the way. I would also be interested in seeing how ViPR will integrate with solutions like Virtual Volumes (VVOLs) when it is released… but I guess time will tell.

I am about to start playing with ViPR in my lab so this is all based on what I have read and heard about ViPR (I like this series by Greg Schultz on ViPR). My understanding, and opinion, might change over time and if so I will be the first to admit and edit this article accordingly.

I wonder how those of you who are on the customer side look at ViPR, and I want to invite you to leave a comment.

"EMC ViPR; My take" originally appeared on Yellow-Bricks.com. Follow me on twitter - @DuncanYB.

by Duncan Epping at May 15, 2013 11:56 AM

Google Blog

Live from Google I/O: Mo’ screens, mo’ goodness

This morning, we kicked off the 6th annual Google I/O developer conference with over 6,000 developers at Moscone Center in San Francisco, 460 I/O Extended sites in 90 countries, and millions of you around the world who tuned in via our livestream. Over the next three days, we’ll be hosting technical sessions, hands-on code labs, and demonstrations of Google's products and partners' technology.

We believe computing is going through one of the most exciting moments in its history: people are increasingly adopting phones, tablets and newer type of devices. And this spread of technology has the potential to make a positive impact in the lives of people around the world—whether it's simply helping you in your daily commute, or connecting you to information that was previously inaccessible.

This is why we focus so much on our two open platforms: Android and Chrome. They enable developers to innovate and reach as many people as possible with their apps and services across multiple devices. Android started as a simple idea to advance open standards on mobile; today it is the world’s leading mobile platform and growing rapidly. Similarly, Chrome launched less than five years ago from an open source project; today it’s the world’s most popular browser.

In line with that vision, we made several announcements today designed to give developers even more tools to build great apps on Android and Chrome. We also shared new innovations from across Google meant to help make life just a little easier for you, including improvements in search, communications, photos, and maps.

Here’s a quick look at some of the announcements we made at I/O:

  • Android & Google Play: In addition to new developer tools, we unveiled Google Play Music All Access, a monthly music subscription service with access to millions of songs that joins our music store and locker; and the Google Play game services with real-time multiplayer and leaderboards. Also, coming next month to Google Play is a special Samsung Galaxy S4, which brings together cutting edge hardware from Samsung with Google’s latest software and services—including the user experience that ships with our popular Nexus devices.
  • Chrome: With over 750 million active users on Chrome, we’re now focused on bringing to mobile the speed, simplicity and security improvements that we’ve seen on the desktop. To that end, today we previewed next-generation video codec VP9 for faster video-streaming performance; the requestAutocomplete API for faster payments; and Chrome Experiments such as "A Journey Through Middle Earth" and Racer to demonstrate the ability to create immersive mobile experiences not possible in years past.
  • Google+: We unveiled the newly designed Google+, which helps you easily explore content as well dramatically improve your online photo experience to give you crisp, beautiful photos—without the work! We also upgraded Google+ Hangouts—our popular group video application—to help bring all of your real-life conversations online, across any device or platform, and with groups of up to 10 friends.
  • Search: Search has evolved considerably in recent years: it can now have a real conversation with you, and even make your day a bit smoother by predicting information you might need. Today we added the ability to set reminders by voice and we previewed “spoken answers” on laptops and desktops in Chrome—meaning you can ask Google a question and it will speak the answer back to you.
  • Maps: Today we previewed the next generation of Google Maps, which gets rid of any clutter in order to put your individual experience and exploration front and center. Each time you click or search, our technology draws you a tailored map that highlights the information you need. From design to directions, the new Google Maps is smarter and more useful.

Technology can have a profound, positive impact on the daily lives of billions of people. But we can’t do this alone—developers play a crucial role. I/O is our chance to come together and thank you for everything you do.

by Emily Wood (noreply@blogger.com) at May 15, 2013 12:48 PM

Aaron Johnson

Chris Siebenmann

Why I've so far been neglecting functional programming languages

Functional programming languages are in many ways the latest hotness and so for years I've been making off and on runs at things like yet another explanation of monads (which I think I sort of understand by now) and similar topics. Despite this, so far I've been almost completely uninterested in actually trying to write a functional program or exploring a FP language.

The big problem for me is that as far as I can tell, the kind of programs I usually work with are exactly the kind of programs that functional programming is stereotypically a bad fit with. The stereotype I've absorbed is that functional programming is quite a good fit for computation but not a good fit for IO, because IO intrinsically has side effects. Unfortunately most of what I write is all about IO and has little or no computation. Bashing a squarish peg into a roundish hole is unlikely to tell me anything particularly meaningful about nice the language is to work in; what I really need is a roundish peg, a computational problem, and those are relatively scarce around here.

(It's possible that I'm not looking hard enough. For example, I do periodically want to do things like log analysis or event reassembly, where the original data could just as well be a predefined data structure in the program instead of processed from logfiles on disk. I suspect that a functional language would handle these fine, maybe better than ad-hoc hackery in awk, Python, or whatever. If I was really crazy I would try rewriting the logic in our ZFS spares handling system in an FP language to see if it got clearer; it's fundamentally a series of transformations of a tree and then some analysis of the result. The result might even be more testable.)

by cks at May 15, 2013 04:57 AM

May 14, 2013

Ubuntu Geek

my other pc is a cloud

My Entry for the Advanced Event #3 of the 2013 Scripting Games

Halfway done.  Here's my third entry for this year's Powershell games.  I used a workflow this time, mostly in an attempt to garner favor from the voters for using new features exclusive to PS3.  Even though the multithreading with jobs that I did in the last event is a neat idea, it really doesn't perform very well.  The workflow will likely perform better, though I don't know if it's going to handle the throttling of thread creation if I handed it a list of 500 computers.

#Requires -Version 3
Function New-DiskSpaceReport
{
	<#
		.SYNOPSIS
			Gets hard drive information from one or more computers and saves it as HTML reports.
		.DESCRIPTION
			Gets hard drive information from one or more computers and saves it as HTML reports.
			The reports are saved to the specified directory with the name of the computer in
			the filename. The list of computers is processed in parallel for increased speed.
			Use the -Verbose switch if you want to see console output, which is very useful if you
			are having problems generating all the desired reports.
		.PARAMETER ComputerName
			One or more computer names from which to get information. This can be a
			comma-separated list, or a file of computer names one per line. The alias
			of this parameter is -Computer. The default value is the local computer.
		.PARAMETER Directory
			The directory to write the HTML files to. E.g., C:\Reports. The directory
			must exist. The default is the current working directory.
		.INPUTS
			[String[]]$ComputerName
			This is an array of strings representing the hostnames of the computers
			for which you want to retrieve information. This can also be supplied by
			(Get-Content file.txt). This can be piped into the cmdlet.
		.INPUTS
			[String]$Directory
			The directory to save the HTML reports to. The directory must exist.
		.OUTPUTS
			HTML files representing the information obtained from all
			the computers supplied to the cmdlet.
		.EXAMPLE
			New-DiskSpaceReport
			
			This will generate a report for the local computer and output the HTML file to
			the current working directory.			
		.EXAMPLE
			New-DiskSpaceReport -ComputerName server01,server02,server03 -Directory C:\Reports
			
			This will generate three HTML reports for the servers and save them in the C:\Reports
			directory.
		.EXAMPLE
			New-DiskSpaceReport -Computer (Get-Content .\computers.txt)
			
			This will generate HTML reports for all the computers in the computers.txt file and
			save the reports in the current working directory.
		.EXAMPLE
			,(Get-Content .\computers.txt) | New-DiskSpaceReport -Directory C:\Reports
			
			This will generate HTML reports for all the computers in the computers.txt file and
			save the reports in C:\Reports. Please note the leading comma in this example.
		.NOTES
			Scripting Games 2013 Advanced Event 3
	#>
	[CmdletBinding()]
	Param([Parameter(ValueFromPipeline=$True)]
			[Alias('Computer')]
			[String[]]$ComputerName = $Env:Computername,
		  [Parameter()]
			[ValidateScript({Test-Path $_ -PathType Container})]
			[String]$Directory = (Get-Location).Path)
	
	Write-Verbose -Message "Writing reports to $Directory..."
	
	WorkFlow BuildReports
	{
		Param([String[]]$Computers, [String]$Directory)
		ForEach -Parallel ($Computer In $Computers)
		{			
			InlineScript
			{				
				Write-Verbose -Message "Generating report for $Using:Computer..."
				$Header = @'
				<title>Disk Free Space Report</title>
				<style type=""text/css"">
					<!--
						TABLE { border-width: 1px; border-style: solid;  border-color: black; }
						TD    { border-width: 1px; border-style: dotted; border-color: black; }
					-->
				</style>
'@
				$Pre  = "<p><h2>Local Fixed Disk Report for $Using:Computer</h2></p>"
				$Post = "<hr><p style=`"font-size: 10px; font-style: italic;`">This report was generated on $(Get-Date)</p>"
				Try
				{					
					$LogicalDisks = Get-WMIObject -Query "SELECT * FROM Win32_LogicalDisk WHERE DriveType = 3" -ComputerName $Using:Computer -ErrorAction Stop | Select-Object -Property DeviceID,@{Label='SizeGB';Expression={"{0:N2}" -F ($_.Size/1GB)}},@{Label='FreeMB';Expression={"{0:N2}" -F ($_.FreeSpace/1MB)}},@{Label='PercentFree';Expression={"{0:N2}" -F (($_.Freespace/$_.Size)*100)}};
					$LogicalDisks | ConvertTo-HTML -Property DeviceID, SizeGB, FreeMB, PercentFree -Head $Header -PreContent $Pre -PostContent $Post | Out-File -FilePath $(Join-Path -Path $Using:Directory -ChildPath $Using:Computer`.html)
					Write-Verbose -Message "Report generated for $Using:Computer."
				}
				Catch
				{
					Write-Verbose -Message "Cannot build report for $Using:Computer. $($_.Exception.Message)"
				}
			}
		}
	}
	
	If($PSBoundParameters['Verbose'])
	{
		BuildReports -Computers $ComputerName -Directory $Directory -Verbose
	}
	Else
	{
		BuildReports -Computers $ComputerName -Directory $Directory
	}
}

by ryan@myotherpcisacloud.com at May 14, 2013 02:09 PM

Yellow Bricks

vSphere HA – VM Monitoring sensitivity


Last week there was a question on VMTN about VM Monitoring sensitivity. I could have sworn I did an article on that exact topic, but I couldn’t find it. I figured I would do a new one with a table explaining the levels of sensitivity that you can configure VM Monitoring to.

The question that was asked was based on a false positive response of VM Monitoring, in this case the virtual machine was frozen due to the consolidation of snapshots and VM Monitoring responded by restarting the virtual machine. As you can imagine the admin wasn’t too impressed as it caused downtime for his virtual machine. He wanted to know how to prevent this from happening. The answer was simple, change the sensitivity as it is set to “high” by default.

As shown in the table high sensitivity means that VM Monitoring responds to missing “VMware Tools heartbeat” within 30 seconds. However, before VM Monitoring restarts the VM though it will check if their was any storage or networking I/O for the last 120 seconds (advanced setting: das.iostatsInterval). If the answer is no to both, the VM will be restarted. So if you feel VM Monitoring is too aggressive, change it accordingly!

Sensitivity Failure Interval Max Failures Max Failures Time window
Low 120 seconds 3 7 days
Medium 60 seconds 3 24 hours
High 30 seconds 3 1 hour

Do note that you can change the above settings individually as well in the UI, as seen in the screenshot below. For instance you could manually increase the failure interval to 240 seconds. How you should configure it is something I cannot answer, it should be based on what you feel is an acceptable response time to a failure. Also, what is the sweet spot to avoid a false positive… A lot to think about indeed when introducing VM Monitoring.

"vSphere HA – VM Monitoring sensitivity" originally appeared on Yellow-Bricks.com. Follow me on twitter - @DuncanYB.

by Duncan Epping at May 14, 2013 12:30 PM

Chris Siebenmann

My language irritations with Go (so far) and why I'm wrong about them

The great thing about an evolving language is that if you're slow enough about writing up your irritations with it, some of them can wind up fixed (or part fixed). So this list is somewhat shorter than it was when I originally wrote my first Go program, and none of the irritations are major. Also, I will reluctantly concede that Go has good engineering reasons for all of them.

My largest single irritation is that break acts on switch and select; I expected it to act only on any enclosing control structure, so that you could write something like:

for {
   select {
   case <-mchan:
      // message silently swallowed
   case <-schan:
      break
}     

Instead you have to invent a boolean loop condition. I understand why Go does this; it enables you to exit early out of a switch or select case instead of having to wrap everything in ever increasing levels of nesting. This is likely especially important because Go uses explicit error checking (which would otherwise force those nested if blocks).

The issue that got partially fixed is Go's return requirements. When I wrote the original version of my program the natural form of one function was a big switch with a number of specific cases and then a default: to catch the rest; however, the original rules required a surplus return at the end of the function, which irritated me by forcing me to move the default case to the end of the function, obscuring the logic. The Go 1.1 changes make my particular case okay but I believe there remain cases where you need an unreachable ending return (or panic) to make the compiler happy.

You can make an argument that the original and current state of affairs are good software engineering. If the compiler did true reachability analysis it'd increase the number of cases where an innocent looking change to some part of the code would suddenly make the return coverage not be complete and thus produce potentially odd messages about missing returns. The current brute force rules protect against this and lead Go programmers to write in a certain sort of consistent style.

My final issue is my perennial one of being unable to cleanly cancel IO being done by goroutines, breaking them out of things so that they can see a death signal from outside. You can argue that this is a bug in the runtime, but the problem with this is that everything that calls an IO operation then needs to be aware of this particular error case (and catch it, and propagate it up the call stack in whatever way is appropriate). A good start to making it a bug in the runtime would be for the runtime to define a specific error for 'IO attempted on closed connection' and for absolutely everything to use it.

(As it stands, the net package doesn't even define a publicly visible error instance for this case, although it does define one internally. It's my personal view that this beautifully illustrates why this is a general language problem; while you can 'solve' it in code, it requires absolutely everyone to get it right and, well, they clearly don't.)

Again this is a software engineering tradeoff. Both the semantics and the runtime implementation of goroutines are undoubtedly vastly simplified because you don't have to worry about being able to signal or cancel a goroutine from outside itself. Outside of the program exiting, all of the interaction that a goroutine has with the outside world are initiated by itself, on its own terms. This makes it much easier to reason about the effects of a goroutine, especially if it's careful not to use global state.

by cks at May 14, 2013 03:39 AM


Administered by Joe. Content copyright by their respective authors.