


Shotgun Wedding
by Darren
Ashby
Have you ever had a problem
with a circuit that you just couldnýt figure out? One reader did and sent me
a very specific example, then concluded with a very general question that inspired
this article. After I responded I wondered, could I generate some troubleshooting
methodology that could be imparted to my fellow engineers? I present an answer
to this question after the letter. (If you want to read the methodology,
and skip the letter and my response, click here)
|
Question:
Hi Darren,
I have a general question
inspired by a very particular incident.ý
The company I work
for makes industrial controllers (PID) and indicators used to monitor
inputs such as thermocouples, 0-5V, 4-20mA, 0-20mA, etc. Units are assembled
and placed for a minimum of a 48-hours in a burn-in chamber set to 60
degrees C.ý After removal from the burn-in chamber, units are tested/calibrated
in a homemade station.ý
Here is what has happened
to me.ý A software bug was corrected, new code flashed into a micro and
then tested in a calibration stand.ý This was done 32 times and all 32
times the unit passed calibration.ý Now onto production. Guess what?ý
Many units are failing calibration.
So this is the question.ý
In general, in your experience, which PARTS (ICs or transistors or discretes)
are MOST LIKELY to be the PROBLEM?
Appreciate your help.
Michael
|
|
Answer:
Hello Mike,
From your description,
it sounds like the temperature increase caused the problem. So I would
suspect capacitors first if they are part of the input circuitry. (They
would be unlikely if they are in the power supply however.) Next I would
suspect the micro-controller as it uses some type of oscillator circuit
that could change with temperature. The dies in transistors and diodes
are usually good to 125C so I would not suspect those unless they are
carrying enough current to account for the additional 65C. Are the units
permanently damaged, or do they recover if left to cool down? Don't overlook
the PCB however, it could expand and contract with temperature, changing
traces or solder joints that could be a problem if the circuit design
is sensitive to that change.
In a more general
light, I think engineers often overlook specific parameters of discrete
parts. (They usually aren't perfect like we are taught to believe in school.)
I think the analog engineer that knows their transistors in and out are
a dying breed, so if it involves more than using a transistor for anything
more than a switch I would suspect that circuit. ICs are really just a
bucket of transistors that were made to be easy to use, So I would look
at it last.
I guess it's hard
to make a blanket statement on what is likely to fail, as often times
there are many small clues to a particular problem. And to complicate
the fact, it is likely a combination of two or more factors causing the
problem. Sometimes there may be seemingly insignificant clues. One time
early in my career we had a problem with some displays we were producing
as a percentage of them were failing and I was assigned to find out why.
When I took the unit apart, it would function correctly. When I put it
back together it would fail again. I looked for hours trying to find problems
with pinched wires and cold solder joints to no avail. So I sat there
and stared at the PCB for a while. And as I did, I noticed two small marks
on a resistor, I wondered were they came from, cause I hadn't scratched
anything. After some examination I discovered a screw head that would
contact this resistor when the PCB was installed. When I removed the screw,
the console worked correctly after assembly.
My only rule of thumb
is don't discount a theory (no matter how obvious or ridiculous it may
seem). Try to prove it right or wrong by experiment and then move on to
the next idea. And one more thing, start with the simple things first.
(If you what to know what I mean, read "ohms law still works" in my archives.)
Darren
|
I
often see engineers having immense difficulty with diagnosing the cause of a
problem when a lowly tech can identify the bad part right away. Sometimes a
tech will struggle for days and the engineer will take one look at the schematic
and say, "there is you problem." Some people have trouble with troubleshooting.
First, lets categorize different
types of problems as well as different methods of troubleshooting.
Design problem: This
is the most common mistake and the easiest to find, as it is generally repeatable
and consistent.
Tolerance problem:
Really a design problem, but I give it a special category because this
is typically inconsistent and difficult to repeat. Environmental effects commonly
aggravate this type of problem.
EMI problem:
This can also be difficult to repeat, who knows when EMI is going to hit. It
will often trip up the most competent engineers.
Software problem:
So many products today use some type of software or firmware. I have seen software
exhibit all of the symptoms above and also used to correct one of the above
problems even though it was really a hardware issue. It gets its own category
for that reason. Here is a metaphorical question, if you can fix a hardware
problem with software, was it really a software problem in the first place?
So what types of troubleshooting
methods are there? I like to group them into two categories.
Scientific method
Do what any good detective would do, look for all the clues you have been given
and deduce what might be the problem based on experience and knowledge. Advantage:
eventually you will identify the problem. Disadvantage: it takes a lot of patience.
Shotgun method
Take a shot at as many possibilities as you can had hope you get a hit. Some
times you get lucky, and you solve the problem fast. But if you arenýt careful
you can easily chase yourself in circles.
Scientific Shotgun Method
Does it surprise
you if I say I think you can solve all of the above problems with a combination,
a marriage if you will, of the shotgun method and the scientific method? I will
elaborate. When a problem first comes to your attention, sit down and right
down all the things you think it might be. Use your intuition as well as your
experience in this exercise. Speaking metaphorically get out the shotgun, take
aim and fire. Now comes the second part. Let the scientific method kick in,
figure out a way to evaluate each of your conclusions to prove or disprove them.
And have at it.
I typically see these results:
7 out of 10 times it was something stupid that the shotgun method caught easily
and quickly, as if they were using an old software version, or a component wasnýt
stuffed, or a fuse was burned out, etc.. An average of 2 out of 10 times something
more subtle was found that took some trial and error and required new data to
be found and evaluated till the problem was solved. 1 out of 10 times the solution
took a longer time, but eventually was found by repetitive applications of both
methods, where the shotgun approach opened up new areas of research that scientifically
lead to the resolution. On the aggregate, problems are typically solved quickly
with a minimum of running in circles when the scientific shotgun approach
is used. (Did you ever think you would see those two words together as something
meaningful?) This is a real boon in a consumer product world when shipping that
new design on time is all-important.
As for Mikeýs problem, I
should have taken my own advice and suggested to check the software version
first, since there had been a problem with previous versions. He reported in
a later email that they were burning the old version of software. Go figure,
I could have used a little shotgun wedding on this one.
Product
Engineering Archive