Do all the things like ++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatarSign Up
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple APILearn More
Search - "diagnostics"
My first job: The Mystery of The Powered-Down Server
I paid my way through college by working every-other-semester in the Cooperative-Education Program my school provided. My first job was with a small company (now defunct) which made some of the very first optical-storage robotic storage systems. I honestly forgot what I was "officially" hired for at first, but I quickly moved up into the kernel device-driver team and was quite happy there.
It was primarily a Solaris shop, with a smattering of IBM AIX RS/6000. It was one of these ill-fated RS/6000 machines which (by no fault of its own) plays a major role in this story.
One day, I came to work to find my team-leader in quite a tizzy -- cursing and ranting about our VAR selling us bad equipment; about how IBM just doesn't make good hardware like they did in the good old days; about how back when _he_ was in charge of buying equipment this wouldn't happen, and on and on and on.
Our primary AIX dev server was powered off when he arrived. He booted it up, checked logs and was running self-diagnostics, but absolutely nothing so far indicated why the machine had shut down. We blew a couple of hours trying to figure out what happened, to no avail. Eventually, with other deadlines looming, we just chalked it up be something we'll look into more later.
Several days went by, with the usual day-to-day comings and goings; no surprises.
Then, next week, it happened again.
My team-leader was LIVID. The same server was hard-down again when he came in; no explanation. He opened a ticket with IBM and put in a call to our VAR rep, demanding answers -- how could they sell us bad equipment -- why isn't there any indication of what's failing -- someone must come out here and fix this NOW, and on and on and on.
(As a quick aside, in case it's not clearly coming through between-the-lines, our team leader was always a little bit "over to top" for me. He was the kind of person who "got things done," and as long as you stayed on his good side, you could just watch the fireworks most days - but it became pretty exhausting sometimes).
Back our story -
An IBM CE comes out and does a full on-site hardware diagnostic -- tears the whole server down, runs through everything one part a time. Absolutely. Nothing. Wrong.
I recall, at some point of all this, making the comment "It's almost like someone just pulls the plug on it -- like the power just, poof, goes away."
My team-leader demands the CE replace the power supply, even though it appeared to be operating normally. He does, at our cost, of course.
Another weeks goes by and all is forgotten in the swamp of work we have to do.
Until one day, the next week... Yes, you guessed it... It happens again. The server is down. Heads are exploding (will at least one head we all know by now). With all the screaming going on, the entire office staff should have comped some Advil.
My team-leader demands the facilities team do a full diagnostic on the UPS system and assure we aren't getting drop-outs on the power system. They do the diagnostic. They also review the logs for the power/load distribution to the entire lab and office spaces. Nothing is amiss.
This would also be a good time draw the picture of where this server is -- this particular server is not in the actual server room, it's out in the office area. That's on purpose, since it is connected to a demo robotics cabinet we use for testing and POC work. And customer demos. This will date me, but these were the days when robotic storage was new and VERY exciting to watch...
So, this is basically a couple of big boxes out on the office floor, with power cables running into a special power-drop near the middle of the room. That information might seem superfluous now, but will come into play shortly in our story.
So, we still have no answer to what's causing the server problems, but we all have work to do, so we keep plugging away, hoping for the best.
The team leader is insisting the VAR swap in a new server.
One night, we (the device-driver team) are working late, burning the midnight oil, right there in the office, and we bear witness to something I will never forget.
The cleaning staff came in.
Anxious for a brief distraction from our marathon of debugging, we stopped to watch them set up and start cleaning the office for a bit.
Then, friends, I Am Not Making This Up(tm)... I watched one of the cleaning staff walk right over to that beautiful RS/6000 dev server, dwarfed in shadow beside that huge robotic disc enclosure... and yank the server power cable right out of the dedicated power drop. And plug in their vacuum cleaner. And vacuum the floor.
We each looked at one-another, slowly, in bewilderment... and then went home, after a brief discussion on the way out the door.
You see, our team-leader wasn't with us that night; so before we left, we all agreed to come in late the next day. Very late indeed.9
This customer comes in and practically throws a computer on the counter.
Customer: This computer isn't working. I've ran the diagnostics and it says it's software. *places a dvd case with a 32 bit Windows 7 disk in it on the counter* It had Windows 10 on it, but I want Windows 7 on it.
Me: Well, you may have issues with the drivers if you put Windows 7 on it--
Customer: I don't care, I just want Windows 7.
Me: You SHOULD care. That means no wifi, no display, no mouse... Windows 7 doesn't like Windows 10 hardware.
Customer: Then... check to see Windows 7 compatibility!
Me: Alright.... *makes notes to check for Windows 7 compatibility*
Me: So has this Windows 7 been used before?
Customer: Yes, it has.
Me: On how many computers?
Customer: I've installed it on two computers and it works just fine.
Me: That's weird because Windows license keys are for one computer only. Are both of them connected to the internet?
Me: Well, okay then... *finishes up ticket*
Customer: I work in this field and I just don't understand why they don't come with the disks anymore. How much is a Windows 10 disk?
Me: *gives price*
Customer: And do you have any?
Me: Let me check *I go to where they are, find some and come back out*
Me: Unfortunately we're out at the moment and would have to special order some back in.
Customer: OK. So then how much to fix this computer?
Me: *price of installing Windows and backing up data*
Customer: That's halfway to the price of a new one of these!
Me: Well yes, an HP at Walmart... But you do have that option if you want to take it.
Customer: Well, why does it cost that much?
Me: Well, it's $labor1 to install Windows, $labor2 to do some basic setup and drivers, and $labor3 to backup and restore data.
Customer: Oh, well I don't want data.
Me: Okay, well then it would be $total - $labor3
Customer: ...Okay, fine
Me: *updates the ticket*
When she finally left I put it on the bench and the first message said "SMART ERROR." I then did 4 different tests that said "lol, the hard drive is failing."
If you "worked in this field," you would know that a SMART error is hard drive related.
If you worked in this field, you would know that Windows is only a 1PC license, so why are you lying about installing it with no issues on other computers?
If you worked in this field, you would know you would want a 64bit Windows on your computer.
If you worked in this field, you would know how to find a Windows 10 installation media online.
If you worked in this field, you would know that HPs are not good computers to get.
IF YOU FUCKING WORKED IN THIS FIELD YOU WOULDN'T BE SUCH A FUCKING CUNT.17
Worst dev team failure I've experienced?
One of several.
Around 2012, a team of devs were tasked to convert a ASPX service to WCF that had one responsibility, returning product data (description, price, availability, etc...simple stuff)
No complex searching, just pass the ID, you get the response.
I was the original developer of the ASPX service, which API was an XML request and returned an XML response. The 'powers-that-be' decided anything XML was evil and had to be purged from the planet. If this thought bubble popped up over your head "Wait a sec...doesn't WCF transmit everything via SOAP, which is XML?", yes, but in their minds SOAP wasn't XML. That's not the worst WTF of this story.
The team, 3 developers, 2 DBAs, network administrators, several web developers, worked on the conversion for about 9 months using the Waterfall method (3~5 months was mostly in meetings and very basic prototyping) and using a test-first approach (their own flavor of TDD). The 'go live' day was to occur at 3:00AM and mandatory that nearly the entire department be on-sight (including the department VP) and available to help troubleshoot any system issues.
3:00AM - Teams start their deployments
3:05AM - Thousands and thousands of errors from all kinds of sources (web exceptions, database exceptions, server exceptions, etc), site goes down, teams roll everything back.
3:30AM - The primary developer remembered he made a last minute change to a stored procedure parameter that hadn't been pushed to production, which caused a side-affect across several layers of their stack.
4:00AM - The developer found his bug, but the manager decided it would be better if everyone went home and get a fresh look at the problem at 8:00AM (yes, he expected everyone to be back in the office at 8:00AM).
About a month later, the team scheduled another 3:00AM deployment (VP was present again), confident that introducing mocking into their testing pipeline would fix any database related errors.
3:00AM - Team starts their deployments.
3:30AM - No major errors, things seem to be going well. High fives, cheers..manager tells everyone to head home.
3:35AM - Site crashes, like white page, no response from the servers kind of crash. Resetting IIS on the servers works, but only for around 10 minutes or so.
4:00AM - Team rolls back, manager is clearly pissed at this point, "Nobody is going fucking home until we figure this out!!"
6:00AM - Diagnostics found the WCF client was causing the server to run out of resources, with a mix of clogging up server bandwidth, and a sprinkle of N+1 scaling problem. Manager lets everyone go home, but be back in the office at 8:00AM to develop a plan so this *never* happens again.
About 2 months later, a 'real' development+integration environment (previously, any+all integration tests were on the developer's machine) and the team scheduled a 6:00AM deployment, but at a much, much smaller scale with just the 3 development team members.
Why? Because the manager 'froze' changes to the ASPX service, the web team still needed various enhancements, so they bypassed the service (not using the ASPX service at all) and wrote their own SQL scripts that hit the database directly and utilized AppFabric/Velocity caching to allow the site to scale. There were only a couple client application using the ASPX service that needed to be converted, so deploying at 6:00AM gave everyone a couple of hours before users got into the office. Service deployed, worked like a champ.
A week later the VP schedules a celebration for the successful migration to WCF. Pizza, cake, the works. The 3 team members received awards (and a envelope, which probably equaled some $$$) and the entire team received a custom Benchmade pocket knife to remember this project's success. Myself and several others just stared at each other, not knowing what to say.
Later, my manager pulls several of us into a conference room
Me: "What the hell? This is one of the biggest failures I've been apart of. We got rewarded for thousands and thousands of dollars of wasted time."
<others expressed the same and expletive sediments>
Mgr: "I know..I know...but that's the story we have to stick with. If the company realizes what a fucking mess this is, we could all be fired."
Me: "What?!! All of us?!"
Mgr: "Well, shit rolls downhill. Dept-Mgr-John is ready to fire anyone he felt could make him look bad, which is why I pulled you guys in here. The other sheep out there will go along with anything he says and more than happy to throw you under the bus. Keep your head down until this blows over. Say nothing."12
I was in a public place on my laptop, and my laptop went into hibernation to save battery. I switched it back on and then the laptops BIOS came up saying that the battery was critically low, nothing bad here.
Instead of clicking continue, I decided to press "Diagnostics" instead. The diagnostics immediately began to run in the BIOS.
The screen began to show different coloured bars and patterns, obviously a screen test. Then a prompt appeared asking me if coloured bars were displayed. The options were yes and no, and a button saying "Exit" in the top right. Me, not wanting to do a full diagnostics on such a low battery, pressed exit.
The screen turned black, and then flashed red. The beeper on the motherboard began to beep at an ear-piercing volume. It sounded as if it was a bomb about to go off. Everyone around me stared and some people began to even panic. I tried switching it off by holding the power button but nothing was happening. People were just staring all around me.
After about 10 seconds, the beeping stopped and the screen displayed an error message similar to this:
"CRITICAL ERROR: Monitor test FAILED.
No user input was provided."
Moral of the story: Make your program account for all possible options.11
A couple of weeks ago, I asked the "brand manager" if he knew how to reset printers to their defaults before reconfiguring them, knowing full well that he did not. He assured me that he did. I smiled and let him leave.
He called me yesterday, frantic, because he didn't know how to reconfigure a printer that already had a password. After reminding him of the above, I told him how to put the printer in diagnostic mode and how to navigate the menus. Literally: "Turn the printer off, then hold down the feed paper button while turning the printer on. It will print out a bunch of diagnostics, and a menu at the bottom. Just follow the instructions at the bottom to use the menu"
Apparently following simple instructions is well outside of his abilities. After he spent five minutes fighting with it and complaining, I called him and walked him through powering the printer on while holding down the feed paper button. Terribly difficult.
The next step amounts to "hold down the feed paper button for more than 1 second." He spent ten minutes (ten!) on this unimaginably challenging step, and, frustrated at his inability to outsmart a simple button, he gave up completely.
He literally couldn't follow the instructions on the printout. I've attached a picture to show how ridiculous this is, and it saddens me terribly to report that I'm quite serious. he was literally unable to figure this out.
HE SPENT TEN MINUTES TRYING TO PUSH A BUTTON FOR >1 SECOND! TEN MINUTES!
That's what was too difficult for him! A button! With written instructions!
I can't even.
But the kicker?
Now he and the bossman want me to drive half an hour so I can push a button for ~1.2 seconds because they're utterly incapable.
I'm soo done.
I love Linux, but its community can be so full of incompetent assholes..
Just now I asked in Freenode ##linux how to get the process ID of my current running process in bash. I got my answer - it's a shell built-in called "$$".
Then people start to nitpick some more - why do you need it? How is that different from an exit? - to which my response was.. well I know the whole idea behind exit codes, and I'd use it whenever possible, in all defined behavior that allows my program to terminate itself whenever it can. This pidfile however would be used to exit itself and provide diagnostic information whenever the program enters undefined behavior - a segfault in C language. Scenarios in which I don't have full control over the script's behavior anymore, such as the system entering an unworkable state where the system stalled, still got some binaries in RAM but the rootfs got unwritable, such as now - very helpfully, thanks HP! - when my laptop likely overheated and shat itself. I issued sudo reboot into it, but even that wouldn't issue properly anymore due to the /sbin/poweroff binary becoming inaccessible too. I had to issue a hard power cycle.. one of the few times in which I'm thankful to HP for actually causing shit like this, lol.
Point is, that undefined behavior is what I'm trying to mitigate against. I certainly can't let any files other than diagnostics remain in nonvolatile storage like that, especially when their state should be predictable in order to ensure good operation (like files expressing whether the script is already running or not, i.e. lock files).
Back to that IRC chat. Aside from the answer, I got ridicule from people who probably don't even know how to properly compile a kernel. Ubuntu users, overconfident scum. Sometimes I feel like I should ask questions in channels like #archlinux only, where such incompetency is ridiculed on its own.13
1. Scripting out a team. I've built a collection of bash scripts to do what one of our teams does. Except the script does it in 30min and always does it well where that team used to take 4 to 10 hours and almost always missed something in the way.
2. Automate 70-80% of our BAU tasks with a single >4k loc bash script. Integrations with servicenow, lots of internal portals, predefined huge sets of commands to run on separate servers or lists of servers, do all sorts of diagnostics, schedule hw maintenance for DC folks, chase for approvals, track CHNG/CTSK tickets in a graphical chart so we would not miss any of them and lots lots more.
Finally we were able to afford time to make some coffee/tea.
These are the bau optimizations I'm proud of the most. And they have made significant impact on how our teams operate.
Whoever recognizes both company values in the tags and know what is that company - are they still using ´S´ in unix team? :)1
I was pressued to shift the blame.
We received an angry email from a customer that some of their data had disappeared. The boss assigns me to this task. This feature is relatively new and we've found some bugs in the past in here. I go through request logs, search the database, run some diagnostics, etc. for about 5 hours and I cannot find the problem. I focus on the bugs that we've had before but they don't seem to be the problem.
I tell the boss "sorry but I checked XYZ and I can't find the problem. I'm out of ideas." But the boss wanted answers by the end of the day. They did not want to admit to the client that we couldn't figure out what's wrong.
By now I was more pressured to find an answer, find something or someone to blame it on, not exactly to find the real solution. So I made up some BS:
"Sometimes, in HTML forms, the number inputs allow you to change the number by scrolling. We have some long forms where the user has to scroll. Perhaps the focus remained on the number input, so when they scrolled down they accidentally changed the number they meant to input."
The boss was happy with that. We explained this to the customer, and there's now a ticket to change type="number" to type="text" in our HTML forms and to validate it in th backend.
A week later another customer shows us a different error. This one is more clear because it had a stack trace, but I realise that this error is what caused our last error. It was pretty obscure, mind you, the unit tests didn't detect it.
I didn't tell the boss that they were connected tho.
With two angry clients in two weeks, I finally convinced the boss to give us more time to write more unit tests with full coverage.
A customer had spilled beer on his Macbook and brought it in for us to run diagnostics on.
Me: So it looks like his Mac got cultured...
Coworker: I'm not even going to respond.3
Motherfuckin fuckidy duck fuck!
I am so done with Azure for today!
After I ran out of space on a secondary drive I shut the VM down and increased said drive and now after starting it (which takes way too long already) I can't ssh into it: "Connection refused". Diagnostics say "everything is fine bruh" and now I'm stuck with an inaccessible VM which I already spent half the night on configuring and downloading 60gb of sources.. aaargh!8
Me: have you tried turning it off and on again?
Customer: oh come on, is that the best you can do!
M:ok how about we
clear all active memory,
Reset the firmware parameters
run system diagnostics and
reinitialise the basic input output system?
C: Wow .. yeah how do we do that?
M: turn it off and on again!
Time: 0600 hrs.
Mental State: Almost falling asleep on my laptop
I get a call from my "random cousin" with whom I haven't spoken in a looooong time, and he says "Hey, Good Morning ! I can't connect to my WiFi from my Windows laptop running Windows 7. Can you help ?.."
That moment when you TRULY believe in the person who developed the "Network Diagnostics" utility on Windows and ask the "random cousin" who calls you up at 6 AM to try it...
And he sends you this screenshot after some time ...
And then you have to wake up and pinch yourself to see if you are in a dream...
Long sleepless day ahead...5
A customer brought in an older, beat-up machine and told us it wasn't booting. We noticed that his power supply was damaged, but checked it in for other diagnostics.
I found out he had a corrupted operating system, but with everything else on the computer, I didn't recommend fixing the computer.
Now, for reference, this is a Windows 7 computer with 10GB of RAM. But it also has a bent side-panel, the front-panel is hanging on by a thread, and it would also need the new power supply -- all of which would be over $200 USD.
When I finally relayed this info to him over the phone, we started talking about the system.
Him: So what do you think?
Me: I mean, this computer has some good specs, but with the damage, I wouldn't recommend repairing the computer. Now, this is your computer and you are more than welcome to tell me to shove it, but I'd recommend replacing it. We're at the breaking point of doing whatever you want to do, and it's your money that you're spending, but in my professional opinion, I don't think it's worth saving.
Him: Well, okay. I'll come in later and see what options I have6
I started writing code at a young age, nodding games, building websites, modifying hex files, hacking etc... I started my career off tho in highschool writing embedded code for a local medical robotics company, and also got tasked with building the mobile app to control these robots and use them for diagnostics, etc.... this was before the App bubble, before there was app degree and that bullshit.. anyway graduated highschool, went to college to get a comp sci degree.
Wanted to teach for the university and research AI...
well I dropped out of college after 3 years, cuz I spent more time at work than in class. (I was a software consultant) in the auto industry in Detroit. I wasn’t learning anything I didn’t already know or could learn from books or a quick google search.
I also didn’t like the approach professors and the department taught software... way none of the kids had a good foundation of what the fuck they were doing... and everyone relied on the god damn IDEs... so I said fuck it and dropped out after getting in plenty of arguments with the professors and department leads.
I probably should have choose CE .. but whatever CS imo still needs a solid CE/EE foundation without it, 30 years from now I fear what will become of the industry of electronics... when all current gen folks are retired and nobody to write the embedded code, that literally ALLLLL consumer electronics runs on. Newer generations don’t understand pointers, proper memory management etc.
So I combined both passion AI and knowledge of software in general and embedded software, and been working on my career in the auto industry without a degree, never looked back.2
If I unplug a charger then my laptop immediately turns off
if I run a hw diag [boot into diag mode], it says I have a healthy battery, but a faulty ssd
I was browsing through devrant on my phone OP 5t and I noticed a small white pixel near the notification bar
I was shit scared, cause I got a history of damaged LCD with my previous phone.
I tried opening other apps with full screen no change.
Checked lcd test from hardware diagnostics tests.
The white pixel disappeared by itself.2
My dad, the man who taught me cutting corners is less possible in the IT field than any other field and that you have to do it CORRECTLY unless you're deliberately asking for problems, is using the OEM recovery utility to reinstall the OEM copy of Win7 Starter onto a shitbook destined to be a diagnostics machine for smart cars *because he doesn't wanna go driver hunting.*
They're all literally right fucking here. On this one page.
My mentor has become the bad example he once steered me away from becoming.3
Seeking help from anyone able to read Laptop motherboard semantics sheet
In short: Looking for a blown fuse on Laptop (Dell Inspiron 7547) near LCD cable connector, as not getting backlight after a new screen installation. Screen is functional and is detected properly and the device is passing all the diagnostics tests.
Issue tracked here
Goddamned apple. "Just works" my ass.
My girlfriend's 2014 MBP told her an update had already downloaded and she had no choice but to install it, which doesn't bother me. Now, though, it tries to update, then fails, and tells you to run diagnostics. Diagnostics says nothing is wrong, and says to contact support and it boots a minimal version of the OS that only runs a minimal version of safari. You try to access the chat feature and it never fully loads the chat interface. I can probably restore it, but I wanna know what was wrong. It's a really expensive brick right now.3
Hello everyone, looking for some career advice here.
First of let me list my credentials off here. I graduated in 2016 with a BS in Computer Science. While I was working on my degree I worked as an engineering for 3 years in a cell phone repair company. What this entailed was managing/reverse engineering a software solution of one of that companies vendors, writing documentation etc (it started as a summer internship and became a job that I worked full time over Summers and up to 30/week in the school year).
Anyway, the vendor I acted as a point of contact offered me a job before I graduated and I started with them in May 2016 as a junior most Dev. Since then I have have maintained the same job tittle (software developer), however my duties have increased.
Currently I maintain several of our build servers, manage software releases (as in I am the lead developer of this application) for the service that makes 90% of this companies money, and am the subject matter expert for everything regarding smartphone diagnostics. I've literally been entrusted with access to all of the company servers for if something goes wrong. I'm also training our newest developers and being told I'm doing a good job at doing so.
Currently with my job on a day to day basis I'm working with Java, Android, C++, Golang, MongoDB, iOS in Objective C, and Python
(Please note this is a small company of less than 50 people)
Currently I'm only being paid 60k USD and am wondering if I should hold out for a raise or consider looking for a better job? ( Please note I live in the east coast in an area where the cost of living isn't absurd).
Because this job was practically handed to me I don't know what to expect and feel imposter syndrome as I think I deserve better pay but think I don't have enough years experience. All advice is welcome4
lets try again.
What the fuck is with apache. Why I cannot start the page. it should be 5 minutes work.
but it give some shitty error where it is not clear what is wrong
This site can’t be reached timetracker.local’s server IP address could not be found.
Checking the connection
Checking the proxy, firewall, and DNS configuration
Running Windows Network Diagnostics
how long apache is being developed? 10 years ? more? and cannot make normal error messages so you would know how to fix the problem . fuck that. I hate it so much. wasting my time. bastards.14
Is there anyone out there who knows opennms? I got assigned to "improve" the nms diagnostics page (graphs are drawn and shit) but I can't find any dissent documentation. My task has even been changed to "if you solve the problem, write down a documentation on how you did it"
So yeah... Feeling lost.. Not even a SO thread to help me 😳😖1
I just went through a super long debugging process trying to figure out what was going on with my ZFS volumes. It turned out I had bad memory:
Disclaimer: I love open source and I adore the owasp for what they do.
BUT owasp zap has to be the most overly complicated, badly documented tool in existence. As long as one stays within its most basic functions everything is fine, setting it up as a proxy and even issuing a root cert for our test devices worked wonderfully simple.
Then I made the mistake to try to actually do anything with the data we pulled and had to dive into the scripting console.
The documentation basically consists only of "This thing exists", it provides a msg object with no information what it contains or how it's structured, has no code completion and, here comes the kicker, if the script is run and has an error it gets flagged and can't be reenabled after the error is fixed. So I'm currently at forwarder48.groovy trying to simply store the request on a database for possible diagnostics.
So right now I already know that I'll spend most of my vacation next week trying to decipher the source, document it, fix that damn "flagged as error" bullshit and jump through a billion hoops trying to get a pull request through.2