Page 1 of 1

Email overheat alerts need summary info

Posted: 2015.07.01. 16:50
by mr-b
Hi

When I receive emailed alerts I find the summary info rather basic i.e. it tell me the error (overheat etc.), the machine name and then a percentage.
On opening the email I see a huge amount of General Information with app/computer/system info but nothing to read quickly to see more details about the source of the alert error.
Also I can't actually find out what the percentage number means - even after I search for it in the body of the email.

Is there any way of making this info more easily readable?

Re: Email overheat alerts need summary info

Posted: 2015.07.02. 14:11
by hdsentinel
By default, Hard Disk Sentinel sends a complete hard disk report in e-mail.
This can be configured easily: see Configuration -> Message settings -> Send E-mail option. There you can configure if the e-mail should be
- Detailed HTML report
- Detailed text report
- Brief text report (which sends only the information you may prefer: the quick and short status)

In the detailed report, you can search, for example to word "overheat" to jump to the drive(s) reporting higher temperatures.

Also on this Configuration -> Message settings page, you can adjust which items to be included in the detailed report (if you prefer to use it), for example to disable general information and/or sections about hard disks - to reduce the report size and improve readability.

The % value in the e-mail subject reflects the lowest health % value detected about the hard disks.
This is a quick indicator: for example if your hard disks all have 100% health, then any change, degradation in the health of any hard disk will cause that the e-mail subject will also show the reduced % value.
This way very quickly (even without checking the e-mail body) you will be notified about the degradation of the hard disk health - which may require attention.

Re: Email overheat alerts need summary info

Posted: 2015.07.02. 19:30
by mr-b
Tx for the info. I will try out the "brief text" option. I had looked through the config setting but my eye missed the Report type as it appeared to be indented together with a bunch of connection type settings which weren't relevant to me.

However I would suggest that the email subject could be clearer by indicating what the percentage actually refers to. Also it'd be handy to have the relevant drive & volume label to be displayed together with the temp, since that is what the warning is primarily about e.g. Overheat (SERVER1) [Disk #2 Vol2 Primary - Temp 43 °C Health 100 %].
This would usually save me from having to open the mail at all.

Re: Email overheat alerts need summary info

Posted: 2015.07.03. 07:44
by hdsentinel
Thanks for the tip.
Yes, generally, the current lowest % value included exactly to help, to make it informative about the general status before opening the e-mail.

Yes, it can be changed, for example as you wrote:

Overheat (SERVER1) [Disk #2 Vol2 Primary - Temp 43 °C Health 100 %]

But then it will immediately reduce usability, as then the health would be the actual health of Disk #2, instead of the lowest health detected (of all hard disk drives). So users expecting the lowest health would lose that information.

Maybe

Overheat (SERVER1) [Disk #2 Vol2 Primary - Temp 43 °C Health 100 %] Lowest health of all disks: 99%

better - but I suspect this is a bit too crowded and may be even confusing.
Also if more than hard disks have high temperatures, but the subject mentions only one, the admin (without opening the e-mail) may think that only one hard disk affected by the overheat. This may lead to false assumption.
In this case the subject can be even longer to include the other hard disk(s) affected....

So it is hard to find to find the balance to make it short but effective ;)

Maybe the best would be to have an option to configure, for example by specify any custom text and variables, so users can configure the subject of the e-mail alerts.

Re: Email overheat alerts need summary info

Posted: 2015.07.04. 23:59
by mr-b
Tx - I think making the stats clearer is a definite way forward.

But TBH I find including the machine's lowest health % in the subject rather confusing, since surely the alert is about overheating on a particular disk? So in an Overheat alert, I'd expect primarily stats about that overheating disk and any trending historical stats, rather than lowest machine health stats (except in the body of the mail).
If health stats do start reducing to certain levels then I'd expect alerts about those!
Or am I missing something here?

Also if multiple disks have high temps, won't they generate individual overheat alerts too?

And yes, an option to configure subject field would be handy, but then again sometimes ppl can hang themselves with too many options and simpler is better! ;-)

As an aside, when I used to work for a large storage array co (not EMC), I recall that there was a paper issued by a customer with many data centres who said that observed drive failure rates could not be correlated with temp mgmt issues and so overheating wasn't really a perceived issue with drives even though this seems to counter common sense ... I'm trying to dig out the ref but can't locate it currently.

Re: Email overheat alerts need summary info

Posted: 2015.07.05. 08:06
by hdsentinel
The lowest health % is automatically added to all e-mails. This includes the alert type e-mails (like this one) and for example the daily status e-mail as well.

> If health stats do start reducing to certain levels then I'd expect alerts about those!

Of course, if the proper options enabled, then alert will be issued when hard disk health is decreasing. See on Configuration -> Alerts page, there it is possible to enable "When disk health is low..." and it is possible to configure different alerts for two levels, the yellow (warning) and red (alert) levels.


> Also if multiple disks have high temps, won't they generate individual overheat alerts too?

Yes, it's true.


> ppl can hang themselves with too many options and simpler is better! ;-)

I 100% agree ;)


> that observed drive failure rates could not be correlated with temp mgmt issues and so overheating wasn't really a perceived issue
> with drives even though this seems to counter common sense ...

Generally, overheat can damage hard disks and using any hard disk for longer time in high temperature environments reduces lifetime.
Just as it's effect is usually not immediate, does not cause immediate hard disk failure, we can't say it's safe to operate hard disks with higher temperatures for longer time.

Also different type of hard disks tolerate higher temperatures differently. For example enterprise SAS / SCSI hard disks may tolerate this better (usually the manufacturers also show higher operating temperature ranges for them).

This is why temperature monitoring and displaying the status (including the daily maximum / average values) play an important role and Hard Disk Sentinel shows these. And if it detects overheat condition for longer time, it negatively impacts the estimated remaining lifetime.

Re: Email overheat alerts need summary info

Posted: 2015.07.06. 23:52
by mr-b
OK I found that disk health report - it was Google - "Failure Trends in a Large Disk Drive Population" http://research.google.com/archive/disk_failures.pdf

I maintain that clarifying the different disk and/or lowest machine drive healths would be very useful in these reports, as just an unlabelled % figure is rather mystifying.

Re: Email overheat alerts need summary info

Posted: 2015.07.10. 12:30
by hdsentinel
Yes, this is a well known report - and most of the results are not surprising, resulted in the "traditional" S.M.A.R.T. checking methods.

But temperature is still an important factor, even their research showed that higher temperature ranges increase hard disk failure %.
And do not speak about excessive high temperatures, just 40-42 Celsius, see fig 4. on page 6.

Temperature is not really important - as long as the hard disk is operating in its best temperature range.
A hard disk drive usually not fail sooner if it operates at 40 Celsius compared to a similar hard disk at 30 or 35 Celsius. But there is a good range, where hard disk drives operate well and their lifetime can be higher - for this, the F.A.Q. page ( http://www.hdsentinel.com/faq.php#q1 ) suggests 35-40 Celsius as ideal.
(yes, even over-cooling can also dangerous - as it can also shorten the lifetime).

Partly to "answer" the problems raised by this article, and also to show the biggest problems with the traditional S.M.A.R.T. checking method (which did not verify the correlation between different self-monitoring attributes, do not verify/report hard disk REAL issues, just verified if (or when) the threshold exceed condition reached only for some attributes), www.hdsentinel.com/smart page created many-many years ago.
This shows how this S.M.A.R.T. evaluation model (used by other tools, system BIOSes, OSes, etc.) *really* can't be used to detect and report the actual hard disk problems and can't be used to predict possible lifetime. This lead to a false assumption that hard disk problems can't be reported/detected - which is usually not true.

That's why we required a completely different approach, to show that S.M.A.R.T. in general (when understood / managed correctly) really CAN show problems - and Hard Disk Sentinel does this way since its first version, to reveal and show real problems, degradations (even minor ones).
Hard Disk Sentinel generally designed to overcome these problems and help us to reveal and fix possible problems.
Also the Help -> Appendix -> Health Calculation page shows the foundamentals of how the errors are determined, reported, counted.
(this is only part of the picture, the actual calculation is more complicated).

www.hdsentinel.com/smart page shows the biggest weaknesses of traditional S.M.A.R.T. checking model and shows how things can be done differently.

Also we completely agree that the best is to TEST any hard disk intensively (regardless of its status reported) BEFORE filling with actual data. This reveals any (even minor) problem before the hard disk is filled with sensitive data.

Hard Disk Sentinel offers different tests, exactly for this purpose and the page http://www.hdsentinel.com/faq.php#tests suggests different tests to be used first (even on a new hard disk drive). These tests reveal and fix problems - or confirm if the hard disk is perfect and can be used.


Thanks for the suggestion about the e-mail alert subject !

Re: Email overheat alerts need summary info

Posted: 2015.07.10. 23:29
by mr-b
Very interesting, espec regarding the suggested tests, of which I was not aware. I'll post in another thread about those!
Tx for considering the email alert subject change.