MSFN Forum: Ethernet problems when CPU is not being used - MSFN Forum

Jump to content



  • 2 Pages +
  • 1
  • 2
  • You cannot start a new topic
  • You cannot reply to this topic

Ethernet problems when CPU is not being used Rate Topic: -----

#1 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

  Posted 17 May 2009 - 02:14 PM

Hello everyone,

I hope someone can help me with a very strange problem.

I have two brand new HP EliteBook 8730w laptops, with quad core processors. One is running Vista, the other has XP on it.

The problem I am having is that when streaming data packets over a TCP connection from either an embedded device or from another computer running some test software, the laptop either loses some of the data, or it arrives late after a number of retries. This is using the built in Intel gigabit ethernet card, not the wireless.

I'm seeing errors in the received column if I do a netstat -e. And I see "Packets received discarded" in Windows Performance Monitor.

The really strange thing is that if I run something other than the data receiver program, either a bit of test software written by me, or something like Prime95, so that the CPU is loaded, suddenly the data errors stop and all the packets are received correctly.

It doesn't matter which of the two laptops I use as the receiver, both have the same problem. I've tried different network drivers and made sure all other drivers are up to date. The laptops already came with the most up to date bios.

I've even tried using ExpressCard and USB network adapters instead of the built in Intel card with no luck.

Has anyone ever seen anything like this

Any help would be greatly appreciated.


Alex


#2 User is offline   tain 

  • Cyber Ops
  • Group: Super Moderator
  • Posts: 3,443
  • Joined: 24-September 05
  • OS:none specified
  • Country: Country Flag

Posted 28 May 2009 - 10:06 AM

Since they have common symptoms then it might be the hub/switch/router that they are plugged into. Or some other device on the network. We'd have to know more about your topology to figure out which component to troubleshoot first.

#3 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

Posted 01 June 2009 - 01:19 AM

View Posttain, on May 28 2009, 05:06 PM, said:

Since they have common symptoms then it might be the hub/switch/router that they are plugged into. Or some other device on the network. We'd have to know more about your topology to figure out which component to troubleshoot first.


I've tried it both with a switch between them and by simply connecting back to back, there are no other devices on the network. Both ways of connecting display the same problems.

This post has been edited by ajh499: 01 June 2009 - 01:21 AM


#4 User is offline   DigeratiPrime 

  • MSFN Junkie
  • Group: Super Moderator
  • Posts: 3,490
  • Joined: 18-August 04
  • OS:Windows 7 x64
  • Country: Country Flag

Posted 02 June 2009 - 09:09 AM

What service packs do you have installed on XP and Vista?

What OS is running on the embedded device?

You intially said that the packets arrive late, that would implicate the source or network.

What is the speed of the link? Perhaps you are trying to push a gigabit of traffic over a fast ethernet link?

Maybe you get less errors with the CPU loaded because you are receiving less packets? Prime95 will only load 1 core per instance.

Are the laptops plugged in or running on battery?

#5 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

Posted 03 June 2009 - 01:14 AM

View PostDigeratiPrime, on Jun 2 2009, 04:09 PM, said:

What service packs do you have installed on XP and Vista?

What OS is running on the embedded device?

You intially said that the packets arrive late, that would implicate the source or network.

What is the speed of the link? Perhaps you are trying to push a gigabit of traffic over a fast ethernet link?

Maybe you get less errors with the CPU loaded because you are receiving less packets? Prime95 will only load 1 core per instance.

Are the laptops plugged in or running on battery?


The embedded device runs VxWorks, I think, but I might be wrong.

The Vista laptop has SP1, the XP one is SP2.

I think the packets arrive late because they have to be resent due to the first attempt containing an error for whatever reason.

All devices used in trying to find the cause of this problem are gigabit

The laptops have been tested plugged in and running on battery, it doesn't seem to make any difference.

The latest version of Prime95 will load all cores by default. However only one core (doesn't seem to matter which) needs to be loaded for the packet errors to disappear.

#6 User is offline   DigeratiPrime 

  • MSFN Junkie
  • Group: Super Moderator
  • Posts: 3,490
  • Joined: 18-August 04
  • OS:Windows 7 x64
  • Country: Country Flag

Posted 03 June 2009 - 07:50 AM

The problem with the claim, loading the cpu on the machine that receives the packets affects the reception of the packets, is that error detection is usually offloaded to the NIC. Hence the CPU would not be involved with the error detection.

Check if this is true by opening the Device Manager snap-in (devmgmt.msc), expand Network Adapters, open the Properities for the Intel Gb NIC, goto the Advanced tab, and check that IP/TCP/UDP Checksum Offload is enabled.

Only other thing I can think of where the CPU might be involved is in the link speed. Maybe if the CPU is taxed the link speed would reduce from 1000BASE-T to 100BASE-TX or 10BASE-T and there is a poor link between the two nodes.

First verify that Auto Negotiation is set for both NICs (TX and RX) and the cable is CAT-6 or better. You could try forcing 10BASE-T Half Duplex and see if the quality improves.

I think it is more likely we are overlooking something.

#7 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

Posted 03 June 2009 - 09:01 AM

View PostDigeratiPrime, on Jun 3 2009, 02:50 PM, said:

The problem with the claim, loading the cpu on the machine that receives the packets affects the reception of the packets, is that error detection is usually offloaded to the NIC. Hence the CPU would not be involved with the error detection.

Check if this is true by opening the Device Manager snap-in (devmgmt.msc), expand Network Adapters, open the Properities for the Intel Gb NIC, goto the Advanced tab, and check that IP/TCP/UDP Checksum Offload is enabled.

Only other thing I can think of where the CPU might be involved is in the link speed. Maybe if the CPU is taxed the link speed would reduce from 1000BASE-T to 100BASE-TX or 10BASE-T and there is a poor link between the two nodes.

First verify that Auto Negotiation is set for both NICs (TX and RX) and the cable is CAT-6 or better. You could try forcing 10BASE-T Half Duplex and see if the quality improves.

I think it is more likely we are overlooking something.


Thank you for your suggestions, I've just given them a try.

TCP/UDP Checksum Offload makes no difference whether it is enabled or disabled.

I've used a variety of cables when testing this, and they don't seem to make a difference and all work with a different computer as the receiver. Or sending from the laptop to another machine, for that matter.

I don't think that the link speed is dropping due to the CPU being loaded as the number of bytes sent and the average transfer rate are very similar regardless of the CPU loading. I guess it would actually be slightly lower due to the retries of the discarded packets, but the readout in Performance Test is not showing it in enough detail for that.

The one thing that did seem to make a dfference was the link speed and duplex setting. I tried all combination of half and full duplex at 10, 100 and 1000 MBit/s (except for 1000 Half duplex as it is not an option). The tranfer rate would be the same at the same speed setting, but the half duplex mode would not have any discarded packets, but the full duplex modes would. 100 BASE - Full Duplex discarded more packets than even Gigabit.

The trouble is we need around 200MBit/s to receive the data from the embedded device, so we have to be able to use Gigabit.

I'm still very confused :wacko:


Alex

#8 User is offline   DigeratiPrime 

  • MSFN Junkie
  • Group: Super Moderator
  • Posts: 3,490
  • Joined: 18-August 04
  • OS:Windows 7 x64
  • Country: Country Flag

Posted 03 June 2009 - 12:42 PM

Well we are not going to eliminate packet errors entirely, and I am not sure how "good" or "bad" your connection really is.

What about IP checksum?

Please verify that you're using Certified CAT-6 or better patch cable(s), meaning not self terminated or dollar store stuff.

#9 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

Posted 03 June 2009 - 03:00 PM

View PostDigeratiPrime, on Jun 3 2009, 07:42 PM, said:

Well we are not going to eliminate packet errors entirely, and I am not sure how "good" or "bad" your connection really is.

What about IP checksum?

Please verify that you're using Certified CAT-6 or better patch cable(s), meaning not self terminated or dollar store stuff.


IP checksum makes no difference either

Of course there is always a chance of packet errors, but I'm seeing hundreds, or even thousands of errors per second, when the CPU is doing nothing. Then no errors at all when the CPU has a bit of load on it.

I'll check out what the cables are that I've tried and make sure at least one of them is Cat 6, but I don't think it's that.

The connection should be as "Good" as it can get, two machines with gigabit cards connected back to back with a 2m-ish long cable.

I could understand a connection problem if the same cable lost packets on another machine, but it doesn't. Or, if it was something to do with the laptop's built in network adapter, but I've tried two different cards and both have the same problem. And why does loading the CPU stop the packets from apparently containing errors?

Very, very odd!!

Alex

#10 User is offline   DigeratiPrime 

  • MSFN Junkie
  • Group: Super Moderator
  • Posts: 3,490
  • Joined: 18-August 04
  • OS:Windows 7 x64
  • Country: Country Flag

Posted 03 June 2009 - 07:58 PM

I think our major clue is now in this observation (emphasis mine):

Quote

the half duplex mode would not have any discarded packets, but the full duplex modes would. 100 BASE - Full Duplex discarded more packets than even Gigabit.

It indicates a duplex mismatch
http://en.wikipedia....Duplex_mismatch

Maybe your VxWorks device has problems with Full Duplex mode? Many embedded devices do have cheap NICs, often not gigabit rated.

#11 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

Posted 04 June 2009 - 02:17 AM

View PostDigeratiPrime, on Jun 4 2009, 02:58 AM, said:

I think our major clue is now in this observation (emphasis mine):

Quote

the half duplex mode would not have any discarded packets, but the full duplex modes would. 100 BASE - Full Duplex discarded more packets than even Gigabit.

It indicates a duplex mismatch
http://en.wikipedia....Duplex_mismatch

Maybe your VxWorks device has problems with Full Duplex mode? Many embedded devices do have cheap NICs, often not gigabit rated.



Sorry, the half / full duplex problem appears to be a red-herring. My fault. :blushing:


I don't think I explained what I'm doing very well.

I'm not actually testing this problem most of the time using the embedded device, as it is quite complicated to test against and there is always the potential for bugs in our data receiving code.

Most of the testing I'm doing is using the Network test from Performance Test 7 by Passmark software, and two computers (the HP laptop with the problem and a Dell desktop) connected back-to-back. I'm using this software as it is easy to use to show up the same problem on the HP laptop as we see with either our own test software on two computers back to back, or with the embedded device connected to the laptop with more complex software running.

While I was trying out the half / full duplex options yesterday, I was forcing the laptop to the speed and duplex mode that I wanted, but leaving the other machine on Auto, apparently this was causing a Duplex-Mismatch. I've just tried it again, this time setting the speed and duplex of both machines and this time it works correctly, with no packet errors except at gigabit.

This is with two computers back to back with some test software, however the embedded device needs gigabit as the data rate is far too high for a 100BASE network.

I also tried out a 1m long CAT6 cable, and at gigabit it still shows packet errors when the CPU is not busy.


Any more suggestions? Anyone?

Alex

#12 User is offline   DigeratiPrime 

  • MSFN Junkie
  • Group: Super Moderator
  • Posts: 3,490
  • Joined: 18-August 04
  • OS:Windows 7 x64
  • Country: Country Flag

Posted 04 June 2009 - 11:20 AM

At this point I would be running Network Monitor or Wireshark on both interfaces and try to figure out what is happening.

http://www.microsoft.com/downloads/details...;displaylang=en
http://www.wireshark.org/

Basically we need to know how the packets looked when transmitted and how they were received.

Quote

all work with a different computer as the receiver

Time to get on the phone/email HP and Intel about this and see what they say.

#13 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

Posted 04 June 2009 - 12:32 PM

View PostDigeratiPrime, on Jun 4 2009, 06:20 PM, said:

At this point I would be running Network Monitor or Wireshark on both interfaces and try to figure out what is happening.

http://www.microsoft.com/downloads/details...;displaylang=en
http://www.wireshark.org/

Basically we need to know how the packets looked when transmitted and how they were received.

Quote

all work with a different computer as the receiver

Time to get on the phone/email HP and Intel about this and see what they say.



I've already tried phoning HP technical support, and they were completely useless.

I don't think Intel offer any support for notebook products, they just direct you to the manufacturer of the machine.

I've had a go with Wireshark already, and I can see that every so often the TCP sequence number stops increasing for a while, then carries on again. I guess that is the point at which the data is being resent.

I'll have another go with it tomorrow, but I think that the erroneous packets do not make it as far as Wireshark. Which doesn't really help very much.

#14 User is offline   DigeratiPrime 

  • MSFN Junkie
  • Group: Super Moderator
  • Posts: 3,490
  • Joined: 18-August 04
  • OS:Windows 7 x64
  • Country: Country Flag

Posted 04 June 2009 - 06:20 PM

I've actually contacted Intel Support about their Network Adapters before and I've had good experiences with them.

http://supportmail.i...elcome.aspx?id=
Family: Network Connectivity
Line: Intel Desktop Adapters
Product: *model name from device manager*

What differences do you observe in Wireshark while throttling your CPU? I am still curious if that really affects the network...

#15 User is offline   DigeratiPrime 

  • MSFN Junkie
  • Group: Super Moderator
  • Posts: 3,490
  • Joined: 18-August 04
  • OS:Windows 7 x64
  • Country: Country Flag

Posted 04 June 2009 - 11:59 PM

Got an idea: try setting up Perfmon as you see here. I have to double check later if these are the right counters...

We should see packet errors when your CPU is idle, and less errors when the CPU is taxed.
Posted Image

#16 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

Posted 05 June 2009 - 07:36 AM

I've run perfmon, I think the image below shows the problem pretty well!

Posted Image


As for Wireshark, there is a slight problem. Capturing the packets introduces enough load on the CPU that the packet errors do not occur.

I'm not sure if it is a problem with the Intel network card, as the same thing happens with an ExpressCard adapter with a Marvell chipset.


Alex

#17 User is offline   DigeratiPrime 

  • MSFN Junkie
  • Group: Super Moderator
  • Posts: 3,490
  • Joined: 18-August 04
  • OS:Windows 7 x64
  • Country: Country Flag

Posted 05 June 2009 - 11:36 AM

Yeah that definetly shows it! :wacko:

Can you run Wireshark without Prime95 and maybe see these errors?

You could set the numproc to 1, or try moving Prime95 to a different core and see if these errors still happen.

I think another thing to consider, besides the "cpu effect", is you're getting too many errors to begin with.

I would definetly want to know what Intel says about this.

#18 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

Posted 05 June 2009 - 12:00 PM

View PostDigeratiPrime, on Jun 5 2009, 06:36 PM, said:

Yeah that definetly shows it! :wacko:

Can you run Wireshark without Prime95 and maybe see these errors?

You could set the numproc to 1, or try moving Prime95 to a different core and see if these errors still happen.

I think another thing to consider, besides the "cpu effect", is you're getting too many errors to begin with.

I would definetly want to know what Intel says about this.



I did run Wireshark without prime, but the CPU load of capturing the packets prevents the errors from occurring.

Also for the test in the perfmon screenshot, I only run Prime95 on one core, that's why the CPU usage only goes up by 25%. It's maxing out one core of a quad core laptop.

I know I'm getting too many errors to begin with, it should be practically nothing. The issue with the CPU only adds to the weirdness.

Alex

#19 User is offline   DigeratiPrime 

  • MSFN Junkie
  • Group: Super Moderator
  • Posts: 3,490
  • Joined: 18-August 04
  • OS:Windows 7 x64
  • Country: Country Flag

Posted 05 June 2009 - 03:02 PM

Well in previous post I suggested trying to load a different core, or have Windows only use one core using the numproc setting in msconfig. I am leaning towards the NDIS drivers, though you have tried different ones - quad cores are still rather new. This might have something to do with RSS - I would suggest going to Vista SP2 to see if things change. BTW are these x86 or x64 machines (it could affect the drivers)?

You could try to disable RSS using
netsh interface tcp set global rss=disabled


Can you describe in more detail the 'test' you are doing. For example I tried to replicate your experience by streaming a bluray movie between two of my pcs on a gigabit connection.

I am not familiar with the other counters availible in Performance Monitor, but you may want to read the descriptions of these: 'IPv4', 'Per Processor Network Activity Cycles', and 'Per Processor Network Interface Card Activity'. You could do a shotgun approach and just add all of those counters and look for weird behavior. You can also add counters from the other PC so they appear in the same graph.

I am really anxious to get the Windows Internals 5th edition book, only 12 more days! :P

#20 User is offline   ajh499 

  • Newbie
  • Group: Members
  • Posts: 13
  • Joined: 14-March 07

Posted 07 June 2009 - 03:25 AM

I have tried loading different different cores, it doesn't seem to matter which is loaded as long as one is.

I'll have a go tomorrow at turning RSS off and running perfmon with a few more counters and seeing if anything weird shows up.

The machines are both running the 32-bit versions of windows.

The simplest test I've been doing is using the Network test option in Performance Test 7 from Passmark software to stream packets over a TCP connection from one machine to another, and watching the number of packet errors in perfmon. The reason I'm using this software is that I wanted to make sure that the problems still showed up when using third-party software rather than something I had written that may have had a bug in it. Also, it's easy for someone else to try the same software as it has a 30 day evaluation available.

The packet size seems to make a difference. If I set the packet size to 3000 bytes, I get very large numbers of errors, but if I set it to 3072 (ie 3K), I get very few, if any errors. The MTU size is 1460 so neither 3000, nor 3072 is a multiple.

Share this topic:


  • 2 Pages +
  • 1
  • 2
  • You cannot start a new topic
  • You cannot reply to this topic

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users



All trademarks mentioned on this page are the property of their respective owners
Copyright © 2001 - 2011 msfn.org
Privacy Policy