CERN / Fixing the kinks

So, yesterday I switched the production-cluster over to the new IP-layout. The transition went rather smooth, and I’ve spent today cleaning up loose ends (machines that didn’t get IPs, etc). Since there are no dynamic IP-ranges configured in the DHCP, I can easily find DHCPDISCOVERs from “bad” MAC-addresses.

root@ns0:~# cat /var/log/syslog*|grep -i "no free leases"|grep -vi "via eth0"|cut -d' ' -f9,11|sort|uniq|perl -wple 's|:$||i'
00:25:90:11:ce:1a 10.162.32.1
00:25:90:12:32:5c 10.162.32.1
00:30:48:c9:8d:3b 10.162.48.1
00:30:48:c9:8d:52 10.162.48.1
00:30:48:c9:8d:ce 10.162.48.1
00:30:48:c9:8f:e3 10.162.48.1
00:30:48:c9:a7:ce 10.162.48.1
00:30:48:c9:a7:fc 10.162.48.1
00:e0:81:48:10:af 10.162.32.1
00:e0:81:49:6b:f8 10.162.48.1
00:e0:81:4d:2d:c8 10.162.32.1
00:e0:81:79:83:04 10.162.48.1
00:e0:81:b1:39:80 10.162.48.1
00:e0:81:b1:39:b2 10.162.48.1
00:e0:81:b1:39:ca 10.162.48.1
00:e0:81:b1:39:dc 10.162.48.1
00:e0:81:b1:39:f2 10.162.48.1
00:e0:81:b1:3a:64 10.162.48.1
00:e0:81:b1:3a:90 10.162.48.1
00:e0:81:c0:70:57 10.162.32.1
00:e0:81:c0:70:58 10.162.32.1
00:e0:81:c0:70:95 10.162.32.1
00:e0:81:c0:70:d7 10.162.32.1
00:e0:81:c0:70:e4 10.162.32.1
00:e0:81:c0:71:3a 10.162.32.1
00:e0:81:c0:71:91 10.162.32.1
00:e0:81:c0:71:eb 10.162.32.1
00:e0:81:c0:75:4d 10.162.32.1
00:e0:81:c0:75:50 10.162.32.1
00:e0:81:c0:75:6f 10.162.32.1

This gives me MAC-addresses, and what network/VLAN the request came from.

I can then, by using an LDAP-dump, and the new DNS/DHCP-configuration files, generate a list with old and new hostnames;

joachim@keklolwtf: ~/Documents/CERN/Temp/Hostname-stuff $ for host in $(cat no-free-leases); do mac=`echo $host|cut -d'#' -f1`; network=`echo $host|cut -d'#' -f2`; old_hostname=`cat ldap.ldif|grep -i -A1 "$mac"|tail -1|perl -wple 's|dhcpOption: host-name||gi,s|"||gi,s| ||gi'`; new_hostname=`cat dhcpd-10.162.*|grep -i "$mac"|cut -d' ' -f2`; echo -e "$old_hostname -> $new_hostname\t$network\t$mac"; done
cngpu02-bmc -> cngpu001-bmc	10.162.32.1	00:25:90:11:ce:1a
cngpu01-bmc -> cngpu000-bmc	10.162.32.1	00:25:90:12:32:5c
cntpca015-bmc -> 		10.162.48.1	00:30:48:c9:8d:3b
cntpca016-bmc -> 		10.162.48.1	00:30:48:c9:8d:52
cntpca096-bmc -> 		10.162.48.1	00:30:48:c9:8d:ce
cntpca095-bmc -> 		10.162.48.1	00:30:48:c9:8f:e3
feptofc10-bmc -> feptofc10-bmc	10.162.32.1	00:e0:81:4d:2d:c8
vhost1 -> 			10.162.48.1	00:e0:81:b1:39:ca
vhost0 -> 			10.162.48.1	00:e0:81:b1:39:f2
feptrd14-bmc -> feptrd14-bmc	10.162.32.1	00:e0:81:c0:70:57
feptofa04-bmc -> feptofa04-bmc	10.162.32.1	00:e0:81:c0:70:58
fepemcal2-bmc -> fepemcal2-bmc	10.162.32.1	00:e0:81:c0:70:95
feptofa16-bmc -> feptofa16-bmc	10.162.32.1	00:e0:81:c0:70:d7
feptofa00-bmc -> feptofa00-bmc	10.162.32.1	00:e0:81:c0:70:e4
feptpcco17 -> 			10.162.32.1	00:e0:81:c0:71:3a
fepsdd0-bmc -> fepsdd0-bmc	10.162.32.1	00:e0:81:c0:71:91
feptofc16-bmc -> feptofc16-bmc	10.162.32.1	00:e0:81:c0:71:eb
fepemcal3-bmc -> fepemcal3-bmc	10.162.32.1	00:e0:81:c0:75:4d
feptofa06-bmc -> feptofa06-bmc	10.162.32.1	00:e0:81:c0:75:50
fepsdd1-bmc -> fepsdd1-bmc	10.162.32.1	00:e0:81:c0:75:6f

There seems to be an “issue” that some of the BMC-cards tries to request IPs through the main NICs. This would work with the old IP-layout (since it was just one, flat network), but not now, since BMC/CHARMs will be in another subnet than the normal NICs. I have yet to look into this.

Next, we can figure out how large part of the cluster we have brought back online. After the initial reboot earlier this morning (before I went to bed), there was 34 nodes that was “dead”;

root@portal-ecs1:/etc/dsh# for host in $(cat group/prodcluster); do if ! ping -c1 -t2 $host|grep -qi "bytes from"; then echo $host; fi; done
cn010
cn013
cn014
cn015
cn023
cn024
cn025
cn027
fepdimutrk2
fepdimutrk3
fepemcal0
fepemcal1
fepfmdaccorde
fephltout1
fephmpid0
fephmpid1
fephmpid2
fephmpid3
feppmd1
fepspd4
fepssd1
feptofa12
feptpcai06
feptpcai12
feptpcai14
feptpcao06
feptpcao08
feptpcci04
feptpcci06
feptpcci12
feptpcco11
feptpcco13
feptpcco17
feptrd00

Now, after a few fixes, we’re down to 7, where 5 of them are “known issues”; nodes that has been unavailable even before the switchover to the new IP-layout.

root@portal-ecs1:/etc/dsh# for host in $(cat group/prodcluster); do if ! ping -c1 -t2 $host|grep -qi "bytes from"; then echo $host; fi; done
cn010
fepfmdaccorde
fephmpid0
fephmpid2
fepspd4
feptpcci12
feptpcco17

feptpcci12 and fepspd4 are the only nodes that still needs a fix.