Jump to content

We may have overloaded the network


Tripredacus

Recommended Posts

I may reference network issues with Server 2008 in the past from this thread:

http://www.msfn.org/board/index.php?showtopic=122871

Which is in the 2003 forum...

We may have hit a wall and overloaded the network that the Imagex Server is running on. Here is the specs from my servers thread:

Computer Name: Imagex (Server 2008 Standard x64)

Config

- Intel S5000PSL

- 2x Intel Xeon 2.4GHz Quad-Core CPUs

- 16GB Fully-Buffered RAM

- 80GB SATA RAID1 (onboard)

- 1.76TB RAID10 (storage)

- 3Ware 9550SXU-8LP 64bit SATA RAID

- 2x Intel Gigabit NICs (onboard) TEAMed

- 2x Intel Gigabit NICs (PCI) TEAMed but not used (performance issues)

- apps: Active Directory, Domain Controller, Microsoft OPK, DHCP, DNS, WDS

Networking wise this is the current usage. Now I want to point out we have temporarily altered the network layout and this isn't normal. Also, we are not using managed switches, even though it is recommended (verbiage made it sound like required) by our TAM. Alas I don't get to make purchasing decisions.

Server -> 24 Port gigabit switch (Netgear) -> Netgear 24 port gigabit switch -> Switch3 & Switch4

No activity currently on Switch3

Switch 4 has 2 active connections. Connection1 is imaging with Imagex. Connection2 -> Switch5 -> Switch6

Total connected clients: 25 (24 + 0 + 1)

Client limit on server: 250

Total bandwidth being used (seen via Networking Tab in Task Manager) average: 10%

The problem is that half or 2/3 of the clients are actively imaging, using an 8GB image (using fast compression) and the other 1/3 are getting the PE transfered. None of the clients are locked up, but the data transfer has appeared to be nearly zero.

If by doing math of available bandwidth limitations (1000MB) *.10 = 100Mb/s. This gives a maximum 4Mbps (0.5MBps) per client, not counting SMB and other data.

If anyone has any ideas as to why this is happening, other than the "you should be using managed switches" angle, let me know. I totally agree we should upgrade our network in that respect. At least I can say we use the same model switches through-out. I will pull off a trace and open an SR with Microsoft.

Link to comment
Share on other sites


Agreed - the behavior seems more network related than server related. I do recommend managed switches, and also using Multicast vs Unicast (no word on which you're using) if you plan on having multiple users imaging at the exact same time.

Link to comment
Share on other sites

I may have found the problem. Yes it appears to be topology related, also related to bad management decisions in the past. I think our analogy was he (old manager) went to Staples with $100 and brought back 10 switches. Easy to fool some buyers too, say "go to the store and buy a gigabit switch" and we get all these "gigabit" switches.

Except they aren't gigabit switches, but gigabit capable. They are SMC switches capable of handling 13 clients. 12 clients plus 1 ghost server. They are actually 10/100 switches with 2 gigabit ports. So we needed to upgrade the network a little bit, but we had to drop it to do so. See, we have enough switches but they are not in the correct place. Here was the old setup:

Switch 1 uses 6 connection, but has 24 ports.

Switch 2 uses 2 connections, but has 16 ports.

Switch 3 is a 10/100 with 16 ports, uses 13.

Switch 4 has 5 ports and was still in inventory.

So I went out and got that out. And did a big network rotation. So now the 6 connection uses the 16, the 2 connection uses the 5 and the 24port replaces the SMC.

I swear, if it really isn't gigabit, it should say Gigabit switch on the front of it! :realmad:

Link to comment
Share on other sites

  • 2 weeks later...

OK I have a little update, and a question because the behaviour on the network seems to have some adverse affects on the domain controller. Now that all switches are updated to gigabit, there is still a problem. Any clients with the D945GCLF board in them, even tho the spec for the onboard nic says 10/100/1000, they will only run at 100 max. I can tell this because the switch indicates orange (100) lights for those ports instead of green (gigabit).

After prolonged use of these machines on the network for imaging, say 2-3 days, one of the other segments loses connectivity. The main switch that the Server connects to indicates an interesting blink pattern for that particular segment. Something like:

5 blinks, then lights out for 2 seconds, then repeat x infinity.

At this point, that particular segment loses DHCP services, and connectivity to the Domain Controller. However, if any clients on that segment (its called Server Island) already have an IP address, they can still access the file server, but not the domain controller. After their leases expire (DHCP is set for 3 hour lease time) those clients then lose all connectivity with the network. Then after a period of time, the second segment (called Gilligan's Island) will go through this process as well, and eventually the entire network loses connectivity.

Neither of the servers indicate any errors in the Event Viewer, except the Domain Controller which only has Kerberos warnings, which I ignore because Kerberos auth is not used. Also, the DHCP log has no errors or warnings either, no NACKs, only lease cleanups, and expiry messages.

Here's the strange part. If I reset the main switch (or any of the switches on any affected Islands) they still do not regain connectivity. The only way to regain connectivity is to reset the main switch and reboot the Domain Controller.

Another strange this discovered early on is that if you reset the main switch, DHCP will not function (there will be no errors reported either) until you reboot the server. It will show no link issues either. So it is possible that the presence of these clients running at 100 cause enough of a lag time so that the Server loses communication with the main switch for a split second, but this is enough time for it to stop being able to provide DHCP? Also when these 10/100 clients are active, the rest of the network slows down incredibly, even to a standstill.

Any ideas about this behaviour?

Link to comment
Share on other sites

Config

- Intel S5000PSL

- 2x Intel Xeon 2.4GHz Quad-Core CPUs

- 16GB Fully-Buffered RAM

- 80GB SATA RAID1 (onboard)

- 1.76TB RAID10 (storage)

- 3Ware 9550SXU-8LP 64bit SATA RAID

- 2x Intel Gigabit NICs (onboard) TEAMed

- 2x Intel Gigabit NICs (PCI) TEAMed but not used (performance issues)

- apps: Active Directory, Domain Controller, Microsoft OPK, DHCP, DNS, WDS

How the heck are you teaming ports with a non-managed switch? Is this for fault tolerance only? Because your definitely not getting link aggregation, and probably confusing the crap out of the switch.

Link to comment
Share on other sites

Hmmm... I will have to bring that up then. I only manage the Domain Controller, I didn't set it up like that. Or at least, I manage the roles, not the config. Another guy does that. Is it possible that this can cause things like this?

Another recent development. The bank of clients with the "gigabit" nics also started a new behaviour. After imaging was complete, they all reported hal.dll errors on reboot, which means the Image didn't actually complete. Temporarily, I had them switch to using my test server (unclesocks) which can only handle 4 clients at a time. I haven't moved that fast (damage control) in a while. It sucked because Unclesocks is used as a dev server, so it didn't have the right PE version (for production) set up on it. It took me about 20 minutes to copy the image over and get the PE all set up, but it seems it is moving a little better now. And if you know, Unclesocks isn't on the production network, so I had to do some copypasta to get the image over there. :sneaky:

EDIT: Also, regarding the teaming issue and the wrong switches. Can someone post some docs or whitepapers about this requirement? I am going to have to prove it to the guy in order to get it either "unteamed" or get management to upgrade the switches.

Link to comment
Share on other sites

If your network guy doesn't know you can't team NICs attached to an unmanaged switch, you should probably look for a new network guy. Anyway, to be specific, you'd likely need to know the algorithm being used for anything specific. However, the most common at this point is going to be LACP (802.3ad / 802.1AX) - just try to find an unmanaged switch that supports the protocol (and find a managed switch that doesn't, which would also be fruitless ;)). Do some reading on 802.3ad or 802.1AX if you want to know more.

Again, if you want an exercise in futility, try finding an unmanaged switch that supports 802.1AX or 802.3ad. Or, even better, find out what switches YOU are using there and spec them - do THEY support either of those protocols? The answer, if they're unmanaged, is likely going to be "no". For example, you mentioned you were using unmanaged 24 port netgear switches, so I've done a little research for you.

The most expensive netgear unmanaged 24-port gigabit switch (JGS524F):

Standards Compliance

IEEE 802.3i 10BASE-T Ethernet
IEEE 802.3u 100BASE-TX Fast Ethernet
IEEE 802.3ab 1000BASE-T Gigabit Ethernet
IEEE 802.3z 1000BASE-X Gigabit Ethernet
IEEE 802.3X Flow Control

And to contrast, the cheapest 24 port gigabit "smart" switch (not even fully L2 managed, only partial managed switch features - GS724T):

Network Protocol and Standards Compatibility 

IEEE 802.3 10BASE-T Ethernet
IEEE 802.3u 100BASE-TX Fast Ethernet
IEEE 802.3ab 1000BASE-T Gigabit Ethernet
IEEE 802.3x full-duplex flow control

Administrative Switch Management

IEEE 802.1D Spanning Tree Protocol
RFC 1157 SNMP v1, v2c
RFC 1213 MIB II
RFC 1643 Ethernet Interface MIB
RFC1493 Bridge MIB
Private Enterprise MIB
Jumbo Frame Support (up to 9216 bytes)
IEEE 802.1Q Tag VLAN
GS716T: 64 Static VLANs
- Supports 16 port-based VLAN
GS724T: 128 Static VLANs
- Supports 24 port-based VLAN
IEEE 802.1p (Class of Service)
DSCP - L3 QoS
Port-based QoS (options High/Normal)
Port Trunking - Manual as per IEEE802.3ad Link Aggregation // <- Link Aggregation, aka NIC Teaming support
DHCP client function
Access Control: Trusted MAC
Broadcast storm control (GS724T only)
Port mirroring (many-to-one)
Port setting
Web-based configuration, anywhere on the network
Smartwizard Discovery Utility program auto discovers devices (up to 254 agents/switches); set system configuration to each agent
Configuration backup/restore (easy to configure more than one switch)
Password access control and Restricted IP Access List
Firmware upgradeable

The unmanaged switch was $259.99 USD, and the "smart" switch is $299.99 USD. Hopefully you can see the difference between a managed and unmanged switch (and this isn't even a "fully" managed L2 switch, let alone managed at L3 or higher - those feature lists are usually pages long). So YES, you DO *need* a managed switch to get link aggregation functionality. Remember, it's not just the NIC that needs to be capable. Honestly, if you've got a network guy who doesn't know this, he'd be fired if it were my employee. This is pretty 101 stuff.

I don't know what SMC switches you had (I noticed you mentioned you had a mix), but expect the results of checking managed vs unmanaged on that brand to be pretty much the same.

Link to comment
Share on other sites

Thanks.

Also to be accurate, one was SMC but all the rest were made by TrendNet. Right now we get to determine which is the better route for us, building a separate server and network to handle these clients, or to upgrade the switches. Both would be nice but we'll see what happens.

Link to comment
Share on other sites

(DHCP is set for 3 hour lease time)

Just wondering. Why do you have the DHCP lease time set so short? This causes a lot of unneeded broadcasting. When 50% of the lease time has passed, the client will attempt to renew the lease with the original DHCP server. And any time the client boots and the lease is 50% or more passed, the client will attempt to renew the lease...

Link to comment
Share on other sites

(DHCP is set for 3 hour lease time)

Just wondering. Why do you have the DHCP lease time set so short? This causes a lot of unneeded broadcasting. When 50% of the lease time has passed, the client will attempt to renew the lease with the original DHCP server. And any time the client boots and the lease is 50% or more passed, the client will attempt to renew the lease...

When I worked OPs at an ISP, we used this. The longest an interface should be active on that specific network (depending on certain conditions) would be 1 hour.

Link to comment
Share on other sites

  • 2 months later...

Yet again I make the request to upgrade the network. Our main issue is that once any certain (uncertain) number of clients running at 100Mbps start using the network, the entire thing drops to 100. Even the gigabit clients, and heck the server drops too. Will using a smart switch be able to keep 100Mbps clients at 100 and Gigabit at Gigabit speeds? The purchasing guy balked at the prices of full-managed switches, but seemed to like the price of the smart switches.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...