So, yesterday I switched the production-cluster over to the new IP-layout. The transition went rather smooth, and I’ve spent today cleaning up loose ends (machines that didn’t get IPs, etc). Since there are no dynamic IP-ranges configured in the DHCP, I can easily find DHCPDISCOVERs from “bad” MAC-addresses.
root@ns0:~# cat /var/log/syslog*|grep -i "no free leases"|grep -vi "via eth0"|cut -d' ' -f9,11|sort|uniq|perl -wple 's|:$||i' 00:25:90:11:ce:1a 10.162.32.1 00:25:90:12:32:5c 10.162.32.1 00:30:48:c9:8d:3b 10.162.48.1 00:30:48:c9:8d:52 10.162.48.1 00:30:48:c9:8d:ce 10.162.48.1 00:30:48:c9:8f:e3 10.162.48.1 00:30:48:c9:a7:ce 10.162.48.1 00:30:48:c9:a7:fc 10.162.48.1 00:e0:81:48:10:af 10.162.32.1 00:e0:81:49:6b:f8 10.162.48.1 00:e0:81:4d:2d:c8 10.162.32.1 00:e0:81:79:83:04 10.162.48.1 00:e0:81:b1:39:80 10.162.48.1 00:e0:81:b1:39:b2 10.162.48.1 00:e0:81:b1:39:ca 10.162.48.1 00:e0:81:b1:39:dc 10.162.48.1 00:e0:81:b1:39:f2 10.162.48.1 00:e0:81:b1:3a:64 10.162.48.1 00:e0:81:b1:3a:90 10.162.48.1 00:e0:81:c0:70:57 10.162.32.1 00:e0:81:c0:70:58 10.162.32.1 00:e0:81:c0:70:95 10.162.32.1 00:e0:81:c0:70:d7 10.162.32.1 00:e0:81:c0:70:e4 10.162.32.1 00:e0:81:c0:71:3a 10.162.32.1 00:e0:81:c0:71:91 10.162.32.1 00:e0:81:c0:71:eb 10.162.32.1 00:e0:81:c0:75:4d 10.162.32.1 00:e0:81:c0:75:50 10.162.32.1 00:e0:81:c0:75:6f 10.162.32.1 |
This gives me MAC-addresses, and what network/VLAN the request came from.
I can then, by using an LDAP-dump, and the new DNS/DHCP-configuration files, generate a list with old and new hostnames;
joachim@keklolwtf: ~/Documents/CERN/Temp/Hostname-stuff $ for host in $(cat no-free-leases); do mac=`echo $host|cut -d'#' -f1`; network=`echo $host|cut -d'#' -f2`; old_hostname=`cat ldap.ldif|grep -i -A1 "$mac"|tail -1|perl -wple 's|dhcpOption: host-name||gi,s|"||gi,s| ||gi'`; new_hostname=`cat dhcpd-10.162.*|grep -i "$mac"|cut -d' ' -f2`; echo -e "$old_hostname -> $new_hostname\t$network\t$mac"; done cngpu02-bmc -> cngpu001-bmc 10.162.32.1 00:25:90:11:ce:1a cngpu01-bmc -> cngpu000-bmc 10.162.32.1 00:25:90:12:32:5c cntpca015-bmc -> 10.162.48.1 00:30:48:c9:8d:3b cntpca016-bmc -> 10.162.48.1 00:30:48:c9:8d:52 cntpca096-bmc -> 10.162.48.1 00:30:48:c9:8d:ce cntpca095-bmc -> 10.162.48.1 00:30:48:c9:8f:e3 feptofc10-bmc -> feptofc10-bmc 10.162.32.1 00:e0:81:4d:2d:c8 vhost1 -> 10.162.48.1 00:e0:81:b1:39:ca vhost0 -> 10.162.48.1 00:e0:81:b1:39:f2 feptrd14-bmc -> feptrd14-bmc 10.162.32.1 00:e0:81:c0:70:57 feptofa04-bmc -> feptofa04-bmc 10.162.32.1 00:e0:81:c0:70:58 fepemcal2-bmc -> fepemcal2-bmc 10.162.32.1 00:e0:81:c0:70:95 feptofa16-bmc -> feptofa16-bmc 10.162.32.1 00:e0:81:c0:70:d7 feptofa00-bmc -> feptofa00-bmc 10.162.32.1 00:e0:81:c0:70:e4 feptpcco17 -> 10.162.32.1 00:e0:81:c0:71:3a fepsdd0-bmc -> fepsdd0-bmc 10.162.32.1 00:e0:81:c0:71:91 feptofc16-bmc -> feptofc16-bmc 10.162.32.1 00:e0:81:c0:71:eb fepemcal3-bmc -> fepemcal3-bmc 10.162.32.1 00:e0:81:c0:75:4d feptofa06-bmc -> feptofa06-bmc 10.162.32.1 00:e0:81:c0:75:50 fepsdd1-bmc -> fepsdd1-bmc 10.162.32.1 00:e0:81:c0:75:6f |
There seems to be an “issue” that some of the BMC-cards tries to request IPs through the main NICs. This would work with the old IP-layout (since it was just one, flat network), but not now, since BMC/CHARMs will be in another subnet than the normal NICs. I have yet to look into this.
Next, we can figure out how large part of the cluster we have brought back online. After the initial reboot earlier this morning (before I went to bed), there was 34 nodes that was “dead”;
root@portal-ecs1:/etc/dsh# for host in $(cat group/prodcluster); do if ! ping -c1 -t2 $host|grep -qi "bytes from"; then echo $host; fi; done cn010 cn013 cn014 cn015 cn023 cn024 cn025 cn027 fepdimutrk2 fepdimutrk3 fepemcal0 fepemcal1 fepfmdaccorde fephltout1 fephmpid0 fephmpid1 fephmpid2 fephmpid3 feppmd1 fepspd4 fepssd1 feptofa12 feptpcai06 feptpcai12 feptpcai14 feptpcao06 feptpcao08 feptpcci04 feptpcci06 feptpcci12 feptpcco11 feptpcco13 feptpcco17 feptrd00 |
Now, after a few fixes, we’re down to 7, where 5 of them are “known issues”; nodes that has been unavailable even before the switchover to the new IP-layout.
root@portal-ecs1:/etc/dsh# for host in $(cat group/prodcluster); do if ! ping -c1 -t2 $host|grep -qi "bytes from"; then echo $host; fi; done cn010 fepfmdaccorde fephmpid0 fephmpid2 fepspd4 feptpcci12 feptpcco17 |
feptpcci12 and fepspd4 are the only nodes that still needs a fix.