That's what the customer said last Friday. The Linux driver for the Intel e1000 card doesn't work on their hardware. No syslog messages, though. And it's urgent! Their setup should go live in a week! O-kay. Personally, I didn't think it was the driver. Sounds more like a hardware issue. Does the switch show hardware link? What, they have an Ethernet bonding configuration in a high availability setup? Turn that off first. So they told us that they had turned off HA and Ethernet bonding, but still it doesn't work. But what do they mean by "does not work"? They told us that the switch wasn't accessible, in a different location or so. How about getting some cheap Longshine switch and try it, just to see if there is Ethernet link beat.
This went back and forth until Wednesday. They insisted that the driver was broken, that we were supposed to work with the hardware vendor (despite the machine in that configuration not being certified,) lots of contradictory information about their configuration, the usual we-already-tried-it and but-it-used-to-work-with-SLES9 claims... Then they broke the news that the card in fact works without bonding. Finally, I got their network configuration files and a somewhat plausible explanation what they were trying to do.
The explanation, of course, was misleading and confusing. The issue escalation further. And I still had the impression that we do not have all necessary information. In fact, we didn't, until yesterday.
The not working bonding connection is their fail-over heartbeat link, connected directly to the other machine. This cannot work for the "active-backup" bonding mode, it requires a specially configured switch in between. In fact, when asked they admitted that they did try it with the switch and that it does work. Of course, the switch would introduce a single point of failure, defeating the purpose of having two lines. But why the bonding? Why don't they just use two single connections? Well, they use the same connection for DRBD communication. Which, of course, introduces the risk of sporadic erroneous fail-overs due to missed or late heartbeats. In a high availability cluster setup, the heartbeat lines must not be used for anything else.
With their setup they would be much better off with a cold stand-by machine, switched on manually if the main machine fails. I recommend to move the "go live" for a couple of weeks, purchase some training and consulting hours from us and do a proper design.