Initial project report

jocke — Sun, 20 Mar 2011 00:28:14 +0000

Here’s a copy of my “Initial project report”, which is a part of my bachelor thesis.

Initial project report.pdf

Cold reset BMC = failback

jocke — Thu, 17 Mar 2011 18:06:55 +0000

If the BMC has failovered to another interface, it’s not going to go back to the dedicated BMC-interface. You can reboot the machine without effect. You can shut it down, without effect. Resetting the BMC via the web interface doesn’t work either. Previously I pulled the power of the unit physically, and this worked. Now, however, I discovered that ‘ipmitool’ can send cold resets, and this actually works.

root@cn012:~# modprobe ipmi-devintf
root@cn012:~# modprobe ipmi-si
root@cn012:~# ipmitool mc reset cold
Sent cold reset command to MC
root@cn012:~# ping -c1 cn012-bmc
PING cn012-bmc.internal (10.162.64.23) 56(84) bytes of data.
64 bytes from cn012-bmc.internal (10.162.64.23): icmp_seq=1 ttl=63 time=0.359 ms
 
--- cn012-bmc.internal ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.359/0.359/0.359/0.000 ms

Sysmes

jocke — Fri, 04 Mar 2011 11:01:00 +0000

So, I’ve added the sysmes-user to the entire cluster, added it’s pubkey-information, disabled password-login, added the sysmes-user to the /etc/sudoers file, etc.

#!/bin/bash
 
for host in $(cat hosts); do
	if ping -c1 -w1 $host|grep -qi "bytes from"; then
		ssh $host "\
			skill -KILL -u sysmes; \
			if grep -qi '^sysmes' /etc/passwd; then \
				usermod -u 901 sysmes; \
			else \
				useradd -m -d /opt/sysmes -g 100 -s /bin/bash -u 901 sysmes; \
			fi; \
			if [ ! -d /opt/sysmes/.ssh/ ]; then \
				mkdir /opt/sysmes/.ssh; \
			fi; \
			chown -R sysmes:users /opt/sysmes; \
			passwd -l sysmes; \
			if ! dpkg --list|grep -qi 'ii  sudo '; then \
				apt-get install --force-yes -y sudo; \
			fi; \
			if ! grep -qi '^sysmes' /etc/sudoers; then \
                        	printf '\n# Sudo for sysmes-user\nsysmes ALL = NOPASSWD: ALL\n\n' >> /etc/sudoers; \
                        fi;"
		scp /opt/sysmes/.ssh/authorized_keys $host:/opt/sysmes/.ssh/authorized_keys
	fi
done

Minutes from first meeting

jocke — Sat, 26 Feb 2011 03:07:16 +0000

First (official) meeting with my contact person at HiB. The subject of the meeting was to discuss my work at CERN, daily tasks, and a brief plan of preparations needed.

Minutes from first meeting.pdf

Not just one thing

jocke — Mon, 21 Feb 2011 22:49:06 +0000

So. It’s been a little while since last post. There’s been some progress. Things are starting to fall into place.

Last week most of my time went away on writing this sync-users-to-these-nodes script. This is needed since we’re not using LDAP anymore. Once the script is done, maintaining users could be done by 6-year-olds (yes, for real), which is somewhat easier than what we had when using LDAP (where, literally, it took months before admins got admin-rights :-P).

So, anyways. The script is nearly done. A few parts remaining (adding/removing groups, changing userinfo), but the rest is there; add user, delete user, reset password. Adding/deleting/changing groups isn’t planned to be implemented, as this is somewhat rare. The groups will be synced even if added manually, though.

The virtualization-cluster is also starting to get on it’s feet. This is where we’ll, over time, move almost all infrastructure-machines; ns0, ns1, mon0, mon1, etc etc.

Today I also fixed the user-account for sysmes; removing password-login, and making it accessible by public-key only. Entire production-cluster done; still missing a few of the infrastructure-machines, and the DEV-cluster, but we’re getting there.

root@portal-ecs1:~# for host in $(cat /etc/dsh/group/prodcluster); do if ping -c1 -w1 $host|grep -qi "bytes from"; then ssh $host "skill -KILL -u sysmes; usermod -u 901 sysmes; if [ ! -d /opt/sysmes/.ssh/ ]; then mkdir /opt/sysmes/.ssh; fi; chown -R sysmes:users /opt/sysmes; passwd -l sysmes"; scp /opt/sysmes/.ssh/authorized_keys $host:/opt/sysmes/.ssh/authorized_keys; fi; done

There are a lot of small things, like inconsistent sshd_config, that creates all these small obstacles when trying to fix/achieve something. Somewhat annoying in the long run, but we’ll get there I guess; God didn’t make the world in one day, you know. (-:

Oh, yes, we also got 3 of the premium licenses for the switches, which is good. Still missing the 4th (IT didn’t have more), so during the next two-three days we’ll figure out how long it’ll take before getting it. If it’s long, IT said they had a spare one installed in a lab-switch or something, which we could get. So unless something comes up, I’ll most likely configure the core-switches to be fully redundant using VRRP within a week or two.

I’ll also be spending the next days to upgrade BIOSes. We made a new image that _should_ work, so I’ll test it out tomorrow. If it works, then I’ll have 49 nodes to upgrade. Yay \o/

Then I’ll also fix host-based login in the PROD-cluster, which is somewhat more easy to maintain than key-based, as you don’t need to generate/copy keys to each user you wan’t to have password-less login for. You need to maintain the host-list, but this we can sync/push from somewhere, using pubkey for root, or something (since the pubkey can’t be “broken” as easily as the host-based can).

Oh, and then we have the script for configuring DHCP/DNS; this is somewhat important as well. I kinda want to write this in Perl, so that I can learn it a bit, but we’ll see.

Chicken-and-the-egg

jocke — Mon, 14 Feb 2011 17:09:08 +0000

So, it’s time for the famous “chicken-and-the-egg”-problem again. We have some SuperMicro-nodes that has been experiencing hardware-problems. After a long discussion with SM, they supposedly have figured out the root-cause of the problem, and has released a new BIOS-upgrade. That means we need to upgrade quite a lot of nodes.

Since we have no clue to find the nodes we want to upgrade, by just looking at the hostnames, we need to find it some other way. We know they have 2x CPU’s with 12 cores each, so, the following should list the nodes we need;

root@portal-ecs1:/etc/dsh/group# for host in $(cat prodcluster); do if ping -c1 -w1 $host|grep -qi "bytes from"; then ssh $host 'if [ `cat /proc/cpuinfo|grep -i "model name"|wc -l` -eq 24 ]; then hostname; fi'; fi; done
cn006
cn007
cn008
cn009
cn011
cn012
cn013
cn020
cn021
cn022
cn023
cn024
cn025
cn026
cn027
cngpu000
cngpu001
cngpu002
cngpu003
cngpu004
cngpu005
cngpu006
cngpu007
cngpu008
cngpu009
cngpu010
cngpu011
cngpu012
cngpu013
cngpu014
cngpu015
cngpu016
cngpu017
cngpu018
cngpu019
cngpu020
cngpu021
cngpu022
cngpu023
cngpu024
cngpu025
cngpu026
cngpu027
cngpu028
cngpu029
cngpu030
cngpu031
cngpu032
cngpu033

49 in total.

The problem, however, is that, even though we live in 2011, you need to use DOS to upgrade it. Fair enough. But what about the built-in BIOS-upgrade in BMC/IPMI? Well, the latter actually bricks the node, as, at some point, the BMC/IPMI goes through the BIOS, and hence, when upgrading the BIOS, it looses connectivity to itself some way. Brilliant. So, back to DOS. The nodes, obviously, has no floppy-disks, so, we need to use a CD. They don’t have a CD either, so you’ll either have to use a USB-stick, a USB-CD-ROM, or BMC/IPMI’s built-in virtual CD-ROM, where you can mount .iso-files from a SMB-share. Quite nifty. Except that, so far, we haven’t found a CD-ROM driver for DOS that accepts the virtual CD-ROM. So, then we can’t access the BIOS-upgrade software. Great. What about using the BMC/IPMI’s built-in virtual floppy-disk? That would have been a great idea, except that it’s limited to 1.44MB. Guess what? The new BIOS-firmware is 2.1MB. Wohooo!

So, for the moment we’re somewhat stuck. We’ll be looking into using a USB-stick, and maybe get it to work that way. I guess it’s all about finding a driver that accepts the virtual virtual virtual floppy-disk virtualized as a CD-ROM on a USB-stick, or something. Hahaha.

Fixing stuff

jocke — Sat, 12 Feb 2011 02:42:28 +0000

I’ve spent the day fixing stuff; mainly the messed-up BMC-stuff for the fep*-nodes. To summarize;

- Tyan BMC = fail
- It uses only LAN1 to send DHCP-requests to it’s BMC-interface
- Have to use a Gbps NIC for the management
- Inconsistent renaming of interfaces; LAN1 is sometimes eth0, other times eth1. Each node had to be checked manually — only a few nodes could be scripted;

root@portal-ecs1:~# for host in feptofc00 feptofc02 feptofc04 feptofc06 feptofa08 feptofc10 feptofc12 feptofc14; do ssh $host "sed -i 's/eth1/eth2/g' /etc/network/interfaces"; done

- Glad I don’t have to do this again

Current status of the cluster;

- 1 machine is down/gone (cn010)
- 3 machines down

root@portal-ecs1:~# for host in $(cat /etc/dsh/group/prodcluster); do if ! ping -c1 -w1 "$host"|grep -qi "bytes from"; then echo $host; fi; done
cn010
fepfmdaccorde
fephmpid0
fephmpid2
 
root@portal-ecs1:~# for host in $(cat /etc/dsh/group/prodcluster); do if ! ping -c1 -w1 "$host-mgmt"|grep -qi "bytes from"; then echo $host-mgmt; fi; done
cn010-mgmt

Not so dedicated

jocke — Fri, 11 Feb 2011 00:09:33 +0000

So. Ehr. The iLO/management-cards in some of the nodes (BMC, Tyan-motherboard — mainly the fep*-nodes), has it’s own, dedicated NIC (you can, of course, us it as a normal NIC if you want — even parallel with BMC). Previously, when everything was a flat network, you could basically plug a cable into any NIC, and you’d get it up and going. This is somewhat the case now, but since the management-NIC is going to be on it’s own subnet, you’ll need to distinguish between BMC NICs and normal NICs.

Yesterday I discovered that some of the nodes has failover-functionality, so that if the dedicated BMC NIC fails, it switches over to the normal NIC. Unfortunately it doesn’t have a fallback-solution (if/when the BMC NIC comes back online); once the failover has been triggered, it stays that way until the BMC power-cycles. When redoing the network-layout, this caused a lot of BMCs to failover, hence trying to request management IP on the PROD/DEV-network, which, of course, caused it to not get any IP at all.

Today I powered down all those nodes, and pulled the power-cord, making the BMCs lose power. After about 5-10 seconds off, I turned them on again. So far, so good. Loads of BMCs came up correctly on the management-network. However, a large part still tried to get IP on the PROD-network. I couldn’t seem to figure out why; the switches are all properly configured. And then I found the pattern;

root@portal-ecs1:/etc/dsh/group# for host in $(cat prodcluster); do if ! ping -c1 -t2 "$host-mgmt"|grep -qi "bytes from"; then echo "$host-mgmt"; fi; done
cn010-mgmt
fepdimutrk3-mgmt
fepemcal0-mgmt
fepemcal1-mgmt
fepemcal2-mgmt
fepemcal3-mgmt
fepemcal4-mgmt
fephltout2-mgmt
feppmd1-mgmt
fepsdd0-mgmt
fepsdd1-mgmt
fepsdd2-mgmt
fepsdd3-mgmt
fepsdd4-mgmt
fepsdd5-mgmt
feptofa00-mgmt
feptofa02-mgmt
feptofa04-mgmt
feptofa06-mgmt
feptofa08-mgmt
feptofa10-mgmt
feptofa12-mgmt
feptofa14-mgmt
feptofa16-mgmt
feptofc00-mgmt
feptofc02-mgmt
feptofc04-mgmt
feptofc06-mgmt
feptofc10-mgmt
feptofc12-mgmt
feptofc14-mgmt
feptofc16-mgmt
feptpcco17-mgmt
feptrd14-mgmt

All these (with _maybe_ a few exceptions) is running the same motherboard, with the same IPMI/BMC-addon card — both from Tyan. This addon-card is needed to activate the BMC-features. The sad thing, though, is that it seems to only send it’s DHCP-requests out the main NIC, that is LAN1/eth1 — regardless if it has links on the two other NICs. The ironic thing is that the source-mac of the DHCPDISCOVERs, is actually the one for eth0 (which is the dedicated management-NIC). So, with no help on Tyan’s homepages, and none of the available change-the-NIC-to-use-through-ipmitools-tricks working, I decided to go the somewhat easy way, even though it’s not ideal;

Make eth1 the dedicated management-NIC, and use eth2 for the normal network. I don’t like using a Gbps-NIC for management, but it’s better than spending ages to figure out how to get it to request IPs from the management-NIC. There has been some suggestions to flash the BIOS, NIC and IPMI/BMC-addon card, but this involves a lot of risk of bricking stuff, so I won’t go down that route. Not for now, at least.

So, tomorrow I’ll change eth1 to eth2 on ~30 nodes. Nice spending time on something as useful as this! :-D

Update: According to the picture below, it’s actually true; only eth1/LAN1 has the ability to use/have IPMI/BMC. That’s kinda LOL, considering you’ll be wasting a Gbps-NIC, when you have a 100Mbps-NIC available. GG, Tyan!

BMC-fail, solution

jocke — Thu, 10 Feb 2011 02:40:03 +0000

So, it seems as if I was correct (or at least somewhat). I found someone else with the same problem, and stumbled upon this explanation;

When power is applied to the power supply, the BMC powers on immediately. During the boot process the BMC (via Uboot which is booting Linux on the BMC) checks to see if the dedicated IPMI NIC port sees a link state. If not, the shared NIC port will be used. The NIC port selected at BMC boot time will be the NIC port used until the BMC is power cycled, either through a direct BMC reboot or when power is removed from the power supply. Rebooting the system itself will do nothing to the BMC.

This creates a cabling time race condition between plugging in the dedicated IPMI NIC and the power cable which is very obnoxious. Or, for example, if you have a power outtage and the BMC comes up before the switch does, the BMC will select the shared NIC in spite of the dedicated NIC being wired and LAN IPMI access will, in the case of VLANed ports, will be on the wrong network. We experience this more often than we like and find it quite frustrating.

So, I guess I’ll have to power those nodes completely down :-(