Server farm aberrations : is it normal?

CougTek · Feb 24, 2013

In my view, the company I work for has way too many servers compared to what would normally be needed. However, I haven't seen that many server farms so maybe I'm wrong.

Is it normal, for a 20-employees company with around a thousand customers and a 1.5TB monthly transfer volume, to have :

- Two dedicated LDAP servers
- Two FTP servers (which also combined as our two main DNS servers)
- Three dedicated NAT servers
- Two dedicated Domain Controllers

Is it just me, or do I count 9 separated servers doing what could be done, with enough redundancy, by two or three servers? All the above servers are three-years-old or more and none are better than Core 2 generation of processors.

My main problem with this is that we are looking to send our server farm in colocation space and the more servers we have, the higher it will cost us. IMO, we have way too many servers to perform jobs that could be done by less than half as many. The above is just part of our setup (we also have a 5x 2U single-socket servers to host a website serving only 250-300 daily viewers, for instance). Elsewhere, we have another system spread accross 24x 1U servers, but I calculated that it could be done, with more redundancy and performance, by a 3U or 4U setup, etc.

It's really depressing to work in such an unoptimized environment and with stubborn people who insist on keeping using the hardware they mistakenly bought, invoking they need to make those profitable. Like if pursuing a mistake doesn't make it more blatant.

mubs · Feb 24, 2013

That's the way it is in a lot of businesses, Coug. Sometimes it's frustrating as hell. But time is on your side. These servers are near the end of their life. Flip the problem on its head, and you have a great opportunity to show these people how to do it right, save them money, and become a hero. Play the game well to win!

CougTek · Feb 24, 2013

Oh and did I mentioned that the network switches are daisy-chained in two of the four racks we manage? So if the first one fails (not often, but it does happen), we lose two entire racks worth of servers. That would be one of our two main websites, plus our server group responsible for sending the bid&ask data to our customers (a big deal since we are a financial market company). But hey, no big deal, the switch only is seven years old and we don't pay Cisco for the warranty service (so no quick replacement).

ddrueding · Feb 24, 2013

Mubs is right. What you are facing is a political problem. That it can persist despite all the money it is costing is a sign that the forces involved are too strong to fight. Here would be my recommendation:

Identify clusters of servers that will be EOL'd at about the same time. Proactively demonstrate which services can be consolidated at that time (taking production machines offline before EOL is admitting a mistake, no need to make them take their medicine). Perhaps even using wording that suggests that the solutions you propose are now possible due to advances in technology that couldn't have been implemented in the past (even though they could).

Another way to pivot the strategy without placing blame is to introduce a new paradigm. If you are renting space, virtualization can allow higher densities than just about anything else. And by using this new business requirement (colocation) that would benefit from a different implementation (virtualization), you can get away with EOLing machines before their time and still save money.

Mercutio · Feb 24, 2013

In most of those cases the thinking was or is probably that the second server is present for redundancy and/or load balancing. It doesn't appear that those things have been properly implemented, but it sounds like the hardware is present in case it's ever needed. You basically have something closer to a warm spare setup that requires some un-clusterfucking in order to get running.

Also, there are a lot of admin types who adamantly believe that discrete functions should be performed by discrete servers, no matter how trivial. In a way, I do agree. It's a lot easier to remember that SiteA_LDAP_2 is the back-up LDAP server than to remember that SiteA_Server_2 is the LDAP/DNS/FTP/NTP/Squid/Postgres/CA server.

Granted that there's no reason why they can't eliminate some of the hardware they've got, but throwing money and hardware at problems is absolutely easy mode for IT people.

mubs · Feb 24, 2013

++1 to what DD said; he makes total sense. Spoken like a man who's won these types of battles! The good thing, Coug, is that things are now at an inflection point that you can take advantage of.

Chewy509 · Feb 24, 2013

Completely agree with what dd says...

Mind you I can't talk about my current workplace (2 fulltime employees - 9 physical servers - but we use a lot of virtualisation for server role separation (about 20 roles) and utilise a lot of storage - 7 of the servers are just for storage).

A previous place I worked as the sysadmin ('03-'05 timeframe) - 80 employees, spread across 4 sites. We had:
3x DCs (One of these played many roles, including Intranet server, fileserver, etc).
1x Exchange Server
3x Storage Servers
1x Intranet Server (WIMP based)

But, this was spread across 4 sites (connected via 512k ADSL, 100Mb FSO or 128k ISDN links to the main office), with servers spread around the site. The main office (roughly 50 employees) only had 4 of those servers.

CougTek · Feb 24, 2013

See Chewy, we have four times less employees and 76 servers overall. While I agree that Ddrueding's strategy is the wisest, I think Merc nailed it when he mentioned lazinest. But that lazinest has a price. The main architect of this clusterfuck (my collegue) would be completly incapable of drawing of the interconnections between the various switches and servers. Even if he had an entire afternoon to do it. And even if he did, it would look more a like a spagetti meal than a block diagram of a server farm.

My opinion is that in order to be reliable and efficient, an architecture has to remain simple. And that's utterly not the case here.

Thanks all for your input.

CougTek · Feb 26, 2013

Apparently, there's even worse than our company.

ddrueding · Feb 26, 2013

The main advantage I find to running a light system is that I can upgrade/swap out very frequently and still be well within any reasonable budget.

At the moment my company (400 employees at 7 sites in two counties, 200+ computers and 100+ smartphones) runs on 2 NAS boxes (Synology 8-drive units mirroring each other), 2 servers (VMWare Workstation mirroring each other), and a firewall appliance. That and a bunch of wireless WAN products and 3Com switches to stitch it all together. I could replace all that stuff annually and no one would even blink.

Mercutio · Feb 26, 2013

Why Workstation and not ESX? Can't you afford to go to vsphere?

ddrueding · Feb 26, 2013

I had vSphere. Still have the licenses, though I let the support lapse.

The main reason is damage control. When something goes really wrong in vSphere, I'm in a command line world I have very little experience in. With Workstation I know my way around and can deal with OS issues without stress. I also like that the files and drives are in a format that my workstation can read; one more recovery option.

I'm sure that vSphere has some fantastic recovery tools, and is more stable than Windows to begin with, but I'm just not as familiar. And I wasn't using the bells and whistles anyway. All I'm wishing for at the moment is a program that can do scheduled online backups of VMs from a GUI (instead of my awkward scripts that require the VM to be offline. I would pay several thousand dollars for one of those.

Howell · Feb 28, 2013

We use Quest's Vranger; though I am not responsible for it. We use the old SAN as an intermediate step before going to tape.

Howell · Feb 28, 2013

CougTek said:
See Chewy, we have four times less employees and 76 servers overall. While I agree that Ddrueding's strategy is the wisest, I think Merc nailed it when he mentioned lazinest. But that lazinest has a price. The main architect of this clusterfuck (my collegue) would be completly incapable of drawing of the interconnections between the various switches and servers. Even if he had an entire afternoon to do it. And even if he did, it would look more a like a spagetti meal than a block diagram of a server farm.

My opinion is that in order to be reliable and efficient, an architecture has to remain simple. And that's utterly not the case here.

Thanks all for your input.

Do you work for him or with him? For him, good luck. With him, start documenting the critical connections and functions and come up with an actualy plan for the colo that you can distill into 1 sentence that would be attractive to management.

CougTek · Feb 28, 2013

Howell said:
Do you work for him or with him?

With him. For him, I would have cracked his skull and left the job. I've ben working on a plan for a while, but since there's no available budget for this, everything is frozen.

Howell · Feb 28, 2013

Out of curiosity, what are your three dedicated NAT servers for?

CougTek · Mar 1, 2013

Honestly, I don't even know for sure. I know we're using one for the NAT service between the outside world and our internal networks and another one is purely there as a backup. I don't remember what the third one is for. I think it's because we use two different internet connections and its there for the second link.

Speaking of network, I can't find if that Supermicro switch (SSE-3348SR) has fail-over capability (with another similar switch) or not. If it does, it would be the perfect switch (well, pair of switches, because I'd use 2 for redundancy) for my network re-engineering plan. It's less than 10K$. Tremendously more capable than anything Cisco sells at that price. Cheaper than comparable HP's too. I'd link both switches with a pair of QSFP cables (~70$ each, so quite cheap) and wire all the servers with 10G SFP cables (one from each server to each switch). No more network bottleneck for the foreseable future AND redundancy for everything. Instead of 3.5 full racks, everything would fit in an 11U space (4x Supermicro 2027HTRF+ 2U servers, plus the two 3348SR switches and one last swich purely for the IPMI connections).

The entire thing would only cost 125K$ in hardware. I'll wait until September, when the 10-core Ivy Bridge-EP Xeon E5 will be out, before pushing it. I'd would sell the current servers and networking equipment to absorb a small portion of the upgrading cost. I'd be able to draw the entire setup on a napkin in 5 minutes. Simple, fast, reliable and not really expensive when you think about it. Certainly less than half what our current setup has cost. Probably less than a third, in fact.

It'll never pass, but it will be fun to work on the plan anyway.

ddrueding · Mar 1, 2013

Just be sure to specify when you talk about selling the hardware that all hard drives will be pulled first and disposed of in a manner appropriate to your disposition

Howell · Mar 4, 2013

CougTek said:
Honestly, I don't even know for sure. I know we're using one for the NAT service between the outside world and our internal networks and another one is purely there as a backup. I don't remember what the third one is for. I think it's because we use two different internet connections and its there for the second link.

Based on what you've said your coworker would not be able to maintain this setup or make changes if necessary. Presumably you would be relying on consulting labor for this. You should look into how you can parlay this problem into an oppotunity to save money by spending money on something that you can actually support in-house. If your employers are not feeling any pain currently then it becomes a much harder sell. This is why open source is so hard to make a business case for because it is not as easy to find knowledgable people and your knowledge base can leave for another job. Then you have to spend money on support or to get the pieces where they can be supported.

Cisco is not the be all end all it used to be and there are plenty of competitors out there for every type of equipment.

Handruin · Mar 8, 2013

ddrueding said:
I had vSphere. Still have the licenses, though I let the support lapse.

The main reason is damage control. When something goes really wrong in vSphere, I'm in a command line world I have very little experience in. With Workstation I know my way around and can deal with OS issues without stress. I also like that the files and drives are in a format that my workstation can read; one more recovery option.

I'm sure that vSphere has some fantastic recovery tools, and is more stable than Windows to begin with, but I'm just not as familiar. And I wasn't using the bells and whistles anyway. All I'm wishing for at the moment is a program that can do scheduled online backups of VMs from a GUI (instead of my awkward scripts that require the VM to be offline. I would pay several thousand dollars for one of those.

This may not be the thread to elaborate but I am curious to know more details about the issues you've faced with the various VMware ESXi deployments you've managed. There is certainly a great amount of value in knowing you're environment when shit goes bad so I can't argue that logic. I am however surprised that there were enough instances of failures that caused you to have to deal with command line interfaces in ESXi/vSphere. At present I'm now managing 41 ESXi servers under vCenter (with 12 more coming soon) and I seldom ever need to enable SSH to log into an ESXi box for troubleshooting purposes. With the exception of gathering performance data through ESXTop, there isn't many recovery tasks I even attempt at that level unless if things go horribly wrong.

You mentioned you would consider paying thousands for a proper backup management tool set. I know I mentioned Veeam in another thread but have you looked into the Veeam Backup Management Suite? Maybe it would be worth trying a 3-day demo of their product to see if it does what you need.

Handruin · Mar 8, 2013

CougTek said:
Honestly, I don't even know for sure. I know we're using one for the NAT service between the outside world and our internal networks and another one is purely there as a backup. I don't remember what the third one is for. I think it's because we use two different internet connections and its there for the second link.

Speaking of network, I can't find if that Supermicro switch (SSE-3348SR) has fail-over capability (with another similar switch) or not. If it does, it would be the perfect switch (well, pair of switches, because I'd use 2 for redundancy) for my network re-engineering plan. It's less than 10K$. Tremendously more capable than anything Cisco sells at that price. Cheaper than comparable HP's too. I'd link both switches with a pair of QSFP cables (~70$ each, so quite cheap) and wire all the servers with 10G SFP cables (one from each server to each switch). No more network bottleneck for the foreseable future AND redundancy for everything. Instead of 3.5 full racks, everything would fit in an 11U space (4x Supermicro 2027HTRF+ 2U servers, plus the two 3348SR switches and one last swich purely for the IPMI connections).

The entire thing would only cost 125K$ in hardware. I'll wait until September, when the 10-core Ivy Bridge-EP Xeon E5 will be out, before pushing it. I'd would sell the current servers and networking equipment to absorb a small portion of the upgrading cost. I'd be able to draw the entire setup on a napkin in 5 minutes. Simple, fast, reliable and not really expensive when you think about it. Certainly less than half what our current setup has cost. Probably less than a third, in fact.

It'll never pass, but it will be fun to work on the plan anyway.

How do you manage the network path A/B fail-over from the host perspective when you wire each host to both switches? How do you plan to configure the switches so that you don't create a loopback between the QSFP links? Have comparison-shopped F5 networks? How far off is the price of the Nexus 5K (5548P) or a Nexus 3048?

CougTek · Mar 9, 2013

Handruin said:
How do you manage the network path A/B fail-over from the host perspective when you wire each host to both switches?

Not sure. I know each Supermicro module has a dedicated RJ-45 port for IPMI and KVM-over-Ethernet. I planned on controlling the server thru that port. Therefore, if a switch would fail and I hadn't found an intelligent way to perform the link switching automatically, I could remote-connect via the KVM-over-network port and manually change each connection from one link to the other. I know this deserves further toughts.

The reason I planned this the way I did is the result from my meeting with the local Insight consulting team. They pushed me the idea of going with 10Gbps links instead of our current 1Gbps links because in most of their analysis, the bottleneck in the server farms they studied was the network connections between servers. They also tried to convince me to opt for blade servers instead of either a bunch of 1U or the 4-node/2U servers I'm currently favoring. It's apparently easy to manage a failed link on an HP blade chassis when one of the two links connected to two different switches go down. However, I've discarded blades because a blade setup would cost over 30% more than one with 4-modules/2U servers and my company is short on cash. So cash decides and 2U solution wins. The 10Gbps link idea stucked though. It's not that much more expensive to go that way and it garantees a lot of margin for future growth.

They pushed for HP hardware over Cisco's because the warranty and support are, according to them, more avantageous for the customer. That remains to be verified, but the fact that there was an HP representative present to the meeting versus none for Cisco suggest me that, at least locally, it would be easier to get support for HP stuff than for Cisco's. I do not remember the number number of the switches they showed (I'd guess it was part of the 5820 series), but those HP Procurve models came from 3Com's products range (3Com has been eaten by HP recently). However, the HP official peddler gave me some wrong information about his own products (told me there is no internal PCI-E slot in Proliant SL230 Gen8 servers, while there is one), so I doubled-checked most of the things he said about HP's hardware after the meeting. Browsing HP's product porte-folio and playing with my calculator, I ended up with an HP setup that would cost over 20% more than an all-Supermicro solution like the one described earlier. Plus, there are some points that, at least on paper, advantage Supermicro's servers and networking equipment over HP's. For instance, the Supermicro 2027TR-HTRF+ supports CPU with a slightly higher TDP than the HP Proliant SL230 Gen8 does. The 2027TR has 6 hot-swap drive bays per module, while the SL230 only has one (the other drive bays are internal) if you add a PCI-E card (which I'd need). I've also missed the mention of a KVM-over-LAN feature on the SL230, while it's clearly advertised on the Supermicro server. HP's SFP+ cables are almost 3x more expensive than Supermicro's. The HP switches in the price range of the Supermicro SSE-3348SR have a lower switching capacity than the Supermicro model, although that is almost a non-issue since the SSE-3348SR is a tremendous overkill. I realize that HP has more mature products than Supermicro, but the latter still offers data-center-grade equipement, just less refined.

So anyway, I figured that if it's possible to program a blade chassis to switch connections when a link fails, it must or at least should be possible to do the same with a few 2U servers. I don't know how, but there must be a way.

Handruin said:
How do you plan to configure the switches so that you don't create a loopback between the QSFP links?

I did something horrible, I assumed the switch could be programmed to use those links as fail-over links. I know that it's possible to do this with some HP Procurve switches and probably with several Cisco switches too. The Supermicro network switch I linked above targets the same market segment as the HP and Cisco switches which offer fail-over capabilities, so I presumed it can be configured the same too. I admit there are lot of points to verify and clarify there.

Handruin said:
Have comparison-shopped F5 networks? How far off is the price of the Nexus 5K (5548P) or a Nexus 3048?

I've never heard of F5 Networks before. It's a consulting firm?

Regarding the Cisco Nexus switches, the 5548P has fiber connectors while I'd need SFP+ (copper wires I believe) with the Supermicro servers. It also cost at least 50% more than the Supermicro SSE-3348SR. The 3048 only has gigabit ports, except for 4 SFP+ ports. It's much lower-end than the SSE-3348SR, but only cost a thousand bucks less (~8100$ vs ~9250$ for the Supermicro). I know the Supermicro switch I've selected is overkill for what we need, but since the price gap between this one and the 24-port one of the similar series is small (~2500$) and that it offers superior hardware capabilities than similarly priced HP and Cisco switches, why not go for it?

Lastly, although I did my best, it's Friday's evening/night here and I'm quite burned up from my week so I'm sure I did a lot of spelling mistakes and I've often used wrong verb tense and I'm sorry for it. I hope I'm still readable. Thank you very much for your input.

CougTek · Mar 9, 2013

Here's a representation of my idea that I did with dia (yes, it's ugly) :

"Switch pur IPMI" means "Switch for IPMI"
"Lien" means "link".
Nevermind the three lines of French text.

Since this won't be showed to the company's directors before Ivy Bridge EP Xeon are out, I took the liberty of writing that each node will support two 10-core CPU (probably the Xeon E5-2690v2 if the naming of the Ivy Bridge Xeon E3 tells us anything). Each module will also include a 2-port SFP+ 10Gbps add-on card. Each 2027TR-HTRF+ (or 2027TR-H70RF+, I haven't made my mind yet) integrates four dual-socket nodes.

It isn't noted, but each switch would have a WAN connection in the remaining RJ-45 port (the other one going to the IPMI-dedicated switch).

Handruin · Mar 10, 2013

CougTek said:
Not sure. I know each Supermicro module has a dedicated RJ-45 port for IPMI and KVM-over-Ethernet. I planned on controlling the server thru that port. Therefore, if a switch would fail and I hadn't found an intelligent way to perform the link switching automatically, I could remote-connect via the KVM-over-network port and manually change each connection from one link to the other. I know this deserves further toughts.

Looking at your diagram I'm wondering if you're going to run into weird problems with the connectivity for the IPMI connections. I'm confused at how you have so many IPMI connection per server. Are there really 4 LAN connections each? Do they also need to hop back to the SSE-X3348SR switches or could you just plug the house LAN directly into that 100Mb switch instead?

Traditionally you could do bonding of you multiple network ports through software. I'm not entirely certain how this is done in OpenIndiana but I have to imagine it's possible since most if not all the other Linux variants have options for enabling this. You may want to research which mode of bonding/trunking that you want to use for your setup. I get the impression that an active->passive setup is what you're aiming for. I think what you may need to do is trunk 802.3ad (LACP) or make use of Spanning Tree in the 40Gb QSFP links between your switches.

CougTek said:
The reason I planned this the way I did is the result from my meeting with the local Insight consulting team. They pushed me the idea of going with 10Gbps links instead of our current 1Gbps links because in most of their analysis, the bottleneck in the server farms they studied was the network connections between servers. They also tried to convince me to opt for blade servers instead of either a bunch of 1U or the 4-node/2U servers I'm currently favoring. It's apparently easy to manage a failed link on an HP blade chassis when one of the two links connected to two different switches go down. However, I've discarded blades because a blade setup would cost over 30% more than one with 4-modules/2U servers and my company is short on cash. So cash decides and 2U solution wins. The 10Gbps link idea stucked though. It's not that much more expensive to go that way and it garantees a lot of margin for future growth.

They pushed for HP hardware over Cisco's because the warranty and support are, according to them, more avantageous for the customer. That remains to be verified, but the fact that there was an HP representative present to the meeting versus none for Cisco suggest me that, at least locally, it would be easier to get support for HP stuff than for Cisco's. I do not remember the number number of the switches they showed (I'd guess it was part of the 5820 series), but those HP Procurve models came from 3Com's products range (3Com has been eaten by HP recently). However, the HP official peddler gave me some wrong information about his own products (told me there is no internal PCI-E slot in Proliant SL230 Gen8 servers, while there is one), so I doubled-checked most of the things he said about HP's hardware after the meeting. Browsing HP's product porte-folio and playing with my calculator, I ended up with an HP setup that would cost over 20% more than an all-Supermicro solution like the one described earlier. Plus, there are some points that, at least on paper, advantage Supermicro's servers and networking equipment over HP's. For instance, the Supermicro 2027TR-HTRF+ supports CPU with a slightly higher TDP than the HP Proliant SL230 Gen8 does. The 2027TR has 6 hot-swap drive bays per module, while the SL230 only has one (the other drive bays are internal) if you add a PCI-E card (which I'd need). I've also missed the mention of a KVM-over-LAN feature on the SL230, while it's clearly advertised on the Supermicro server. HP's SFP+ cables are almost 3x more expensive than Supermicro's. The HP switches in the price range of the Supermicro SSE-3348SR have a lower switching capacity than the Supermicro model, although that is almost a non-issue since the SSE-3348SR is a tremendous overkill. I realize that HP has more mature products than Supermicro, but the latter still offers data-center-grade equipement, just less refined.

So anyway, I figured that if it's possible to program a blade chassis to switch connections when a link fails, it must or at least should be possible to do the same with a few 2U servers. I don't know how, but there must be a way.

I did something horrible, I assumed the switch could be programmed to use those links as fail-over links. I know that it's possible to do this with some HP Procurve switches and probably with several Cisco switches too. The Supermicro network switch I linked above targets the same market segment as the HP and Cisco switches which offer fail-over capabilities, so I presumed it can be configured the same too. I admit there are lot of points to verify and clarify there.

I believe the Supermicro switches also support the fail-over features through use of spanning tree and/or trunking between switches. Those features are listed in the spec.

CougTek said:
I've never heard of F5 Networks before. It's a consulting firm?

Regarding the Cisco Nexus switches, the 5548P has fiber connectors while I'd need SFP+ (copper wires I believe) with the Supermicro servers. It also cost at least 50% more than the Supermicro SSE-3348SR. The 3048 only has gigabit ports, except for 4 SFP+ ports. It's much lower-end than the SSE-3348SR, but only cost a thousand bucks less (~8100$ vs ~9250$ for the Supermicro). I know the Supermicro switch I've selected is overkill for what we need, but since the price gap between this one and the 24-port one of the similar series is small (~2500$) and that it offers superior hardware capabilities than similarly priced HP and Cisco switches, why not go for it?

Lastly, although I did my best, it's Friday's evening/night here and I'm quite burned up from my week so I'm sure I did a lot of spelling mistakes and I've often used wrong verb tense and I'm sorry for it. I hope I'm still readable. Thank you very much for your input.

Handruin · Mar 10, 2013

CougTek said:
Here's a representation of my idea that I did with dia (yes, it's ugly) :

View attachment 572

"Switch pur IPMI" means "Switch for IPMI"
"Lien" means "link".
Nevermind the three lines of French text.

Since this won't be showed to the company's directors before Ivy Bridge EP Xeon are out, I took the liberty of writing that each node will support two 10-core CPU (probably the Xeon E5-2690v2 if the naming of the Ivy Bridge Xeon E3 tells us anything). Each module will also include a 2-port SFP+ 10Gbps add-on card. Each 2027TR-HTRF+ (or 2027TR-H70RF+, I haven't made my mind yet) integrates four dual-socket nodes.

It isn't noted, but each switch would have a WAN connection in the remaining RJ-45 port (the other one going to the IPMI-dedicated switch).

I think that could work fine assuming your WAN switch or switches also support the ability for creating a LACP connection. This would allow your environment to tolerate a failure at that level also. I'm assuming your so-called WAN connection goes to some kind of router or router + firewall?

Chewy509 · Mar 10, 2013

Handruin said:
I'm not entirely certain how this is done in OpenIndiana but I have to imagine it's possible since most if not all the other Linux variants have options for enabling this.

Assuming NWAM is disabled.

Code:

# ifconfig e1000g0 unplumb
# ifconfig e1000g1 unplumb
# dladm create-aggr -d e1000g0 -d e1000g1 1
# ifconfig aggr1 plumb
# ifconfig aggr1 192.168.0.253 netmask 255.255.255.0 up

The rest of the LAN config is as per normal, (eg gateway, DNS, hostname, etc).
IIRC, there is no real limit to the number of aggregates, but you may need to check your LAN adapter documentation, and see if the driver has any limitations.
PS. e1000g0 / e1000g1 are the LAN adapters to bond (Intel in this case, change to suit your LAN adapter).

CougTek · Mar 11, 2013

I haven't address this, but it needs to :

ddrueding said:
What you are facing is a political problem.

Indeed

ddrueding said:
Identify clusters of servers that will be EOL'd at about the same time. Proactively demonstrate which services can be consolidated at that time (taking production machines offline before EOL is admitting a mistake, no need to make them take their medicine). Perhaps even using wording that suggests that the solutions you propose are now possible due to advances in technology that couldn't have been implemented in the past (even though they could).

Indeed, they could have done it in 2008 and 2011, the two times they invested massively in their architecture. I've already identified the servers that are due to be replaced and the assosciated services. However, the network needs to be redone entirely in my opinion and that would be better done with a complete replacement of our current structure if we want to do this in the most effective way.

ddrueding said:
If you are renting space, virtualization can allow higher densities than just about anything else. And by using this new business requirement (colocation) that would benefit from a different implementation (virtualization), you can get away with EOLing machines before their time and still save money.

Tried this 6 months ago. No money before January was the answer back then. January came and then they weren't sure anymore if they wanted to ship the servers in colocation or not. I decided to push back the upgrading plan until the new Ivy Bridge-EP Xeon arrive since I consider that it would be a poor investment to spend just as much money now while we could get more than 20% more performance by waiting just a few months.

CougTek · Mar 11, 2013

Handruin said:
Looking at your diagram I'm wondering if you're going to run into weird problems with the connectivity for the IPMI connections. I'm confused at how you have so many IPMI connection per server. Are there really 4 LAN connections each?

If you take a look at the servers I plan to use, whether it is the Supermicro 2027TR-H70RF+, HP Proliant Gen8 SL230 (in an SL6500 chassis), or something similar, you'll see that it's 4 half-width 1U servers inside a 2U chassis. The only things that are shared are the power supplies and sometimes the hot-swap drive backplate. Each of the 4 server nodes needs to be configured separately. They need separated IPMI and data network cabling.

Handruin said:
Do they also need to hop back to the SSE-X3348SR switches or could you just plug the house LAN directly into that 100Mb switch instead?

How else could it be? I'm planning this change partly because I want to make it easier to move in case I can finally convince them to send everything in colocation. There are other reasons also, of course, but the entire setup has to connect to a single WAN connection (or two : one in each SSE-3348SR switch for redundancy). It will be a thougher sell if I need a third connection dedicated to the IPMI switch. Not to mention I'd need another firewall on that connection too.

Handruin said:
Traditionally you could do bonding of you multiple network ports through software. I'm not entirely certain how this is done in OpenIndiana but I have to imagine it's possible since most if not all the other Linux variants have options for enabling this.

Good to know. I didn't.

Handruin said:
You may want to research which mode of bonding/trunking that you want to use for your setup. I get the impression that an active->passive setup is what you're aiming for. I think what you may need to do is trunk 802.3ad (LACP) or make use of Spanning Tree in the 40Gb QSFP links between your switches.

[...]

I believe the Supermicro switches also support the fail-over features through use of spanning tree and/or trunking between switches. Those features are listed in the spec.

Ok, that's the part that is like chinese to me. Last time I took a networking course was in 1997 and it was fairly low-end (how to setup an already old-by-then NE2000 network card with the jumpers...was part of the class). I realize that including a course regarding how to configure the network switches will be mandatory in the upgrading proposition. Otherwise, I'm worried we won't be able to use the expensive equipment we'll buy.

Oh and thanks to Chewy for the Solaris class on configuring an aggregate link.

One point of concern I have is that I've stumbled on Intel's H2216JFJR web page and they EOL'ed it after only one year. Granted, that model uses tiny and whiny 40mm fans instead of larger 80mm fans like the Supermicro and HP servers I focus on, but I guess Intel ran in some issues with those servers and that's why they pulled the plug on those so soon.

CougTek · Mar 11, 2013

Handruin said:
I'm assuming your so-called WAN connection goes to some kind of router or router + firewall?

I planned to send the WAN straight into the Supermicro switch and isolate the port in the switch VLAN configuration, then send the signal to a VM with a firewall configured into it (under OpenBSD) and then ship back the connection to the switch to the rest of the network.

This is a bad idea, right? Better to pass thru two separated firewalls (one for each WAN connection) and then to send the signal to the switches? More expensive and more rack space needed, but probably safer too.

ddrueding · Mar 11, 2013

If you want to keep the WAN connections on two dedicated hardware machines, look for a system that will get two low-end servers in a 1U chassis. Anything will do.

But I would be comfortable using the vLAN technique you mentioned, just be sure that vLAN tags are stripped both ways somewhere upstream.

Howell · Mar 11, 2013

Why not pass the two external connections through the same firewall or clustered firewalls for load balancing, failover and NATing? That would give you a more robust solution and free up those servers and Us for something else.

I run an L2 link to colo with brocade on either end. In addition, I have 2 Internet connections at HQ and 1 at colo.

CougTek · Mar 11, 2013

Howell said:
Why not pass the two external connections through the same firewall...

Because if this single firewall fails, I'm fucked.

Howell said:
...or clustered firewalls for load balancing, failover and NATing? That would give you a more robust solution and free up those servers and Us for something else.

Better. Combining those services at the firewall is an idea. I was planning on configuring the network failover on the switches rather than on the firewalls. I have no idea how to configured clustered firewalls for fail-over and load-balancing. I'll hit Google about it.

Howell · Mar 11, 2013

Clustered firewall implementations can cost more than something techically simpler, like a cold spare.

Your RTO will guide this decision and must be defined by the business in order for these decisions to be anything more than shooting in the dark with a shotgun.

VLANs are management tools; not security tools. Even correctly configured implementations can be buggy. I would not expect them to be rigorously tested for security.

You may also want to look into PCI compliance for guidance or your insurance company. You might be surprised what they demand in order to accept the liability.

CougTek · Mar 12, 2013

I've found this document regarding clustered firewalls. It only covers the topic on surface, but it's a good start, I think, to get a general idea about the concept.

Regarding the guidance of our insurance company, with the cluster-fuck we currently have, I don't believe I can do worse.

CougTek · Mar 13, 2013

I've found this document quite interesting. It more or less confirms that I'm on the right track. I'll look into the prices of Netgear switches too.

Anyone knows of a better free application than "dia" to create diagrams?

Mercutio · Mar 13, 2013

Try Gliffy.
The free version doesn't really let you save much, but you can screenshot them into images.

time · Mar 13, 2013

Everything is better than Dia - it's bug-infested crapware.

You could try yEd; still not a patch on a commercial product but at least you can produce something with it.

CougTek · Mar 13, 2013

Thanks to both of you. I prefer yEd and what I've been able to do with it so far is much better than Dia. Not that it is a surprise.

CougTek · Mar 13, 2013

BTW, that article at Anandtech made me smile :

By the end of 2010 we realized two things. First, the server infrastructure that powered AnandTech was getting very old and we were seeing an increase in component failures, leading to higher than desired downtime. Secondly, our growth over the previous years had begun to tax our existing hardware. We needed an upgrade.

[...]

We needed to embrace virtualization and the ease of deployment benefits that came with it. The days of one box per application were over, and we had more than enough hardware to begin to consolidate multiple services per box.

The other guy I work with, who's supervising all the upgrades and maintenance on the servers, made the company buy for 80,000$ of equipment in 2011. We still don't use virtualization...

Handruin · Mar 14, 2013

CougTek said:
BTW, that article at Anandtech made me smile :

The other guy I work with, who's supervising all the upgrades and maintenance on the servers, made the company buy for 80,000$ of equipment in 2011. We still don't use virtualization...

I wish I could convey how much better it has been since the equipment that I've worked on and managed had switched over to a fully virtualized environment. We rarely deploy a non virtualized machine these days. There hasn't been a compelling reason to use a bare metal setup in a long time. All of the products I test and work on are now all virtual appliances. This is what our customers are also buying.

I hope that you can convince your management to eventually go down this path some day. It will make your life easier.

Server farm aberrations : is it normal?

Hairy Aussie

Storage? I am Storage!

Hairy Aussie

Fixture

Fatwah on Western Digital

Storage? I am Storage!

Wotty wot wot.

Hairy Aussie

Hairy Aussie

Fixture

Fatwah on Western Digital

Fixture

Storage? I am Storage!

Storage? I am Storage!

Hairy Aussie

Storage? I am Storage!

Hairy Aussie

Fixture

Storage? I am Storage!

Administrator

Administrator

Hairy Aussie

Hairy Aussie

Administrator

Administrator

Wotty wot wot.

Hairy Aussie

Hairy Aussie

Hairy Aussie

Fixture

Storage? I am Storage!

Hairy Aussie

Storage? I am Storage!

Hairy Aussie

Hairy Aussie

Fatwah on Western Digital

Storage? I am Storage!

Hairy Aussie

Hairy Aussie

Administrator