Reliability Engineer II

Microsoft - Redmond, WA

Successful candidate will be responsible for driving high quality and reliability of equipment used in Microsoft’s cloud to meet and exceed our customers’ expectations. Act as the internal consultant on all reliability matters and interface with program management, vendors and design engineering (as necessary) on key reliability programs/issues. This will include creation or revision of reliability engineering guidelines to improve product field performance through design enhancements to meet reliability goals. Uses principles of performance evaluation and prediction to improve the reliability and maintainability of Cloud Infrastructure servers, including PCBA (printed-circuit-board-assembly). Identifies, collects, analyzes, and manages various types of data to minimize failures and improve product performance. Use scripting and real time capture responses from electronic devices under test (DUTs) to determine proper operation of the DUT or fault trace/root-cause. Works cross-functionally to resolve reliability problems that result in excessive field failures.

In addition, candidate’s will use Microsoft cloud performance IoT(Internet of Things)/telemetry data and traditional reliability engineering principles, to determine and predict the reliability for critical commodities/parts. The successful candidate should be considered as an expert in her or his technical field as well as having a proven track record of success. The candidate must demonstrate a detailed understanding of the significance of time to market, risk mitigation, contingency plans, return on investment, etc.


Primary responsibilities include:

The candidate should possess most (or all) of the following capabilities:

  • Understanding of Design for Reliability principles and Physics-of-Failure concepts to develop and implement accelerated tests to identify and mitigate risks and qualify engineering designs during product development
  • Ability to use knowledge of product design and manufacturing processes to conduct Failure Modes and Effects Analysis (FMEA)
  • Capable of applying Design-of-Experiments concepts to identify Critical-to-Quality parameters and develop robust evaluation plans based on them
  • Knowledge of acceleration models for common failure mechanisms and stress types
  • Knowledge of statistical techniques to analyze test data and create estimates for field failure rates
  • Good understanding of fundamental properties and characteristics of materials used in cutting-edge consumer electronics products
  • Familiarity with application of simulation tools like Finite Element Analysis, etc. to evaluate product performance (mechanical, electrical and thermal) is useful
  • Experience with balancing the significance of time to market, risk mitigation plans and return on investment while creating and executing reliability plans during product development
  • Completing measurement method analysis, gage correlation studies (GRR) and other data fidelity studies
  • Completing data trend and variation analysis and creating engineering reports
  • Summarizing the test data and perform data analysis in accordance with product design requirements
  • Preparing test reports and communicating results and findings to Reliability engineer to facilitate root cause analysis and resolution when failure occurs
  • Ability to develop reliability stress hardware specifications and procedures, writing test programs for reliability testing and characterization, and assisting in selection of new reliability lab equipment
  • Ability to set-up test equipment for functional tests for subsystems (like PCBA) and system levels, including product cabling, instrumented thermal and mechanical characterizations such as thermal couples, accelerometers, strain gages, power supplies, and data acquisition systems
  • Develop and execute reliability qualification plans based on product lifecycle requirements while interfacing with design and manufacturing partners and external suppliers
  • Develop, with other functional disciplines, customer usage models and translate understanding of the customer into practical reliability test specifications
  • Standardize methodologies and processes for increased effectiveness of qualification plans and sample sizes used
  • Participate in component vendor selection activity and drive component qualification activity for components that are critical to Microsoft product requirements
  • Use knowledge of manufacturing process capability as well as system-level performance requirements to establish Critical-to-Reliability performance metrics
  • Monitor product performance in the field, understand customer-facing product issues and drive failure analysis and corrective action with the appropriate partner engineering teams
  • Strong working knowledge of PCBA (printed circuit board assembly) and electronic component failure mechanisms
  • Strong familiarity with industry standards, IPC, JEDEC, Telcordia, and MIL. standards
  • Strong working knowledge with life test, ALT, HALT and HASS design and execution
  • Knowledge of manufacturing methods for electronic components


  • Minimum BS level in Electrical, Mechanical, Physics or Materials Science Engineering
  • 5-10 years of experience in computer hardware development
  • Working knowledge of Power Supplies, FPGA, DIMMs/DRAM; signal integrity, development, integration, test plans, balancing performance, power, complexity and timing
  • Working knowledge of computer organization, architecture, and electrical, electro-mechanical components, and peripheral devices
  • Strong problem-solving skills using analytical and data-driven approach; Strong initiative and ability to work in a self-directed environment

Ability to communicate clearly through oral and written communications; Ability to present clear and concise information to team, internal and external customers

Microsoft is a highly innovative company that collaborates across disciplines to produce cutting edge cloud technology that changes our world. The Cloud Server Infrastructure (CSI) team in Microsoft’s Azure C+E division is responsible for delivering server infrastructure for Microsoft’s online services. The hardware for operating these services (over 200 and counting), comprises of hundreds of thousands of servers spread globally and applications that reach hundreds of millions of users every day. Our customer-base is growing rapidly, our infrastructure investments are multiplying, and the size of our global infrastructure is increasing by the day - along with the scale of our challenges. Learn more about our team and projects here Azure Hardware Infrastructure



Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances.

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

  • Industry leading healthcare
  • Savings and investments
  • Giving programs
  • Educational resources
  • Maternity and paternity leave
  • Opportunities to network and connect
  • Discounts on products and services
  • Generous time away
Attention - In the recruitment process, legitimate companies never withdraw fees from candidates. If there are companies that attract interview fees, tests, ticket reservations, etc. it is better to avoid it because there are indications of fraud. If you see something suspicious please contact us: [email protected]